Nonlinear dimensionality reduction based visualization of single-cell RNA sequencing data

Single-cell multi-omics technology has catalyzed a transformative shift in contemporary cell biology, illuminating the nuanced relationship between genotype and phenotype. This paradigm shift hinges on the understanding that while genomic structures remain uniform across cells within an organism, the expression patterns dictate physiological traits. Leveraging high throughput sequencing, single-cell RNA sequencing (scRNA-seq) has emerged as a powerful tool, enabling comprehensive transcriptomic analysis at unprecedented resolution. This paper navi-gates through a landscape of dimensionality reduction techniques essential for distilling meaningful insights from the scRNA-seq datasets. Notably, while foundational, Principal Component Analysis may fall short of capturing the intricacies of diverse cell types. In response, nonlinear techniques have garnered traction, offering a more nuanced portrayal of cellular relationships. Among these, Pairwise Controlled Manifold Approximation Projection (PaCMAP) stands out for its capacity to preserve local and global structures. We present an augmented iteration, Compactness Preservation Pairwise Controlled Manifold Approximation Projection (CP-PaCMAP), a novel advancement for scRNA-seq data visualization. Employing benchmark datasets from critical human organs, we demonstrate the superior efficacy of CP-PaCMAP in preserving compactness, offering a pivotal breakthrough for enhanced classification and clustering in scRNA-seq analysis. A comprehensive suite of metrics, including Trustworthiness, Continuity, Mathew Correlation Coefficient, and Mantel test, collectively validate the fidelity and utility of proposed and existing techniques. These metrics provide a multi-dimensional evaluation, elucidating the performance of CP-PaCMAP compared to other dimensionality reduction techniques.


Introduction
It is a widely accepted and proven scientific fact that cells are the fundamental building blocks of all living organisms.They play a vital role in the structure and function of these organisms.In recent years, there has been a significant shift in cell biology research due to the development of single-cell multi-omics technology.Despite the fact that the genome structure of every cell in a given individual is essentially the same, the expression pattern of this genome determines the cell's physiological characteristics.The diverse range of physical traits observed in different organisms is a result of both the genotype and the expression pattern of the genome, and deviations from the norm in these patterns can lead to various diseases.To fully understand the relationship between genotype and phenotype, it is necessary to analyze transcriptomic information at a high resolution, and advances in high throughput sequencing technologies have made it possible to do so at the level of single cell (Nayak and Hasija 2021;Battenberg et al. 2022).
Page 2 of 23 Yousuff et al. Journal of Analytical Science and Technology (2024) 15:1 Recent single-cell RNA sequencing (scRNA-seq) technologies can create data for multitudes of cells in a single experiment, a portion of which are open to the public over the internet.This surge in throughput has allowed researchers to use scRNA-seq for a wide variety of tissues and even whole organisms (Ghazanfar et al. 2016).As the technology advances, it is anticipated that scRNA-seq will become more precise, dependable, and cost-effective per cell, making it possible for a vast array of studies.scRNA-seq has unleashed a plethora of opportunities in biomedical research; however, we have only touched a small portion of the possibilities of such a huge and varied dataset (Wang et al. 2021).scRNA-seq transcriptome profiles have opened up the possibility for recognition of unusual and peculiar cell types in organs or tissues, resolving the fate of a cell (Grün et al. 2015), cell lineage connections in early stages of development (Petropoulos et al. 2016), differentiating normal and abnormal cells (Shalek et al. 2013), antigen sensitivity and specificity of immune cells (Tu et al. 2019), deducing cellular trajectory (Miragaia et al. 2019), finding regulatory signatures in malignant tumors (Granja et al. 2019), decoding immune repertoire for contagious diseases (Yao et al. 2019), knowing and interpreting tumor heterogeneity (Wagner et al. 2019), enlightening the pathway for drug resistance and various stages of cancer treatment including relapse of tumors (Shaffer et al. 2017).More applications are being uncovered as a result of improved analysis techniques.
The data collected from hundreds of thousands of cells, each with numerous genes, results in a dataset with a large number of data points and high dimensionality.While this vast amount of data has the potential to reveal valuable insights, extracting useful information from it can be difficult (Babjac et al. 2022).To address this challenge, Dimensionality Reduction (DR) techniques have been developed to simplify the data and create lowerdimensional representations that are easier to understand and interpret.DR methods involve reconstructing the underlying distributions of the data in the "gene space" and providing a more intuitive way to analyze single-cell data.Researchers are seeking ways to represent highdimensional scRNA-seq datasets in a Low Dimensional Space (LDS) while preserving patient similarities and differences (Xiang et al. 2021).
The goal is to create an LDS representation that captures the relationships between patients, such that those with the same disease have similar patterns of expression.DR techniques are used to map High Dimensional Space (HDS) data to a 2-dimensional (2D) or 3-dimensional (3D) space, which makes it easier to visualize connections between data points that would be difficult or impossible to identify in the HDS (Carter et al. 2008;Yousuff and Babu 2022).The key principle of the DR approach is that it maintains the proximity of similar data points and keeps distant data points separated.Retention of local structure refers to maintaining the proximity of elements in the HDS in the LDS.In broader terms, the local structure is maintained when the neighboring elements in the HDS correspond to those in the LDS.On the other hand, preserving global structure implies maintaining relationships between clusters and larger-scale structures (Heiser and Lau 2020).
Principal Component Analysis (PCA), a linear DR technique, is commonly used in unsupervised data reduction by identifying linear feature combinations that have the highest variance.However, linear DR approaches are not always reliable for scRNA-seq analysis as they may not fully capture the complexity of diverse cell types and can result in an inadequate representation of the data (Tsuyuzaki et al. 2020).In contrast, nonlinear DR techniques have become popular for scRNA-seq data visualization because of their ability to identify both local and global patterns while avoiding coordinate overlap.These techniques are particularly useful for scRNA-seq data, which is often highly diverse and has complex associations between cell types and states.Additionally, nonlinear DR techniques are more effective in reducing the dimensionality of scRNA-seq data with many features per cell (Pierson and Yau 2015).
Several nonlinear dimensionality reduction algorithms have been proposed for visualizing and generating LDS for scRNA-seq data.Uniform Manifold Approximation and Projection (UMAP) (McInnes et al. 2018), t-distributed stochastic neighbor embedding (t-SNE) (Maaten and Hinton 2008), TriMap (Amid and Warmuth 2019), Potential of Heat-diffusion for Affinity-based Trajectory Embedding (PHATE) (Moon et al. 2019), and IVIS (Szubert et al. 2019) are commonly used among these algorithms.Each of these methods has limitations; for example, t-SNE is sensitive to the perplexity hyperparameter and may create clusters that are not real, t-SNE and UMAP are good at retaining local structures but have difficulty maintaining global structures.TriMap is a triplet model to reach the performance of UMAP and t-SNE, but it also has limitations; at times, it struggles with preserving local structures.Additionally, it is not possible to regulate these techniques, such as t-SNE, UMAP, or TriMap, effortlessly from local to global structure retention through any apparent modification of parameters (Coenen et al. 2019;Wang et al. 2022).PHATE is also a recently proposed alternative approach, but it is sensitive to initialization values, and it is liable to serious deformations when attempting to maintain pairwise associations or distances from HDS data in 2D or 3D (Moon et al. 2019).IVIS, on the other hand, uses Siamese neural networks, which can lead to high computational cost, limited interpretability, confined ability to handle variations, limited scalability and the need for a large amount of labeled data for effective training (Chicco and Cartwright 2021).
Selecting which points to attract and which to repel is crucial in maintaining both local and global structures.Pairwise Controlled Manifold Approximation Projection (PaCMAP) is a recent nonlinear DR algorithm that claims to achieve this by using a unique loss function and graph components.PaCMAP is demonstrated on synthetic, benchmark and real-time datasets and it has been proven to preserve local and global structures.It is quite reliable in hyperparameter choices and exhibit considerably faster runtime compare to other DR algorithms (Wang et al. 2022).This paper aims to present an augmented version of PaCMAP termed as Compactness Preservation Pairwise Controlled Manifold Approximation Projection (CP-PaCMAP) which can additionally preserve compactness property of HDS datapoints into LDS.CP-PaCMAP is remarkable in order to visualize scRNA-seq data.Further, the LDS obtained through CP-PaCMAP can be effectively utilized for better classification or clustering of scRNA-seq data.

Research gap
In spite of the vast potential inherent in scRNA-seq data, the colossal size and soaring dimensionality of these datasets introduce formidable hurdles.The quest for gleaning meaningful insights from such data has spurred the evolution of dimensionality reduction (DR) techniques.DR methodologies strive to reshape the high-dimensional landscape of gene expression into a more manageable, lower-dimensional form, facilitating streamlined analysis and visualization (Babjac et al. 2022).While Principal Component Analysis (PCA), a linear DR approach, has enjoyed widespread use, its applicability in scRNA-seq investigations is somewhat constrained, as it may fall short of encapsulating the full spectrum of cell diversity (Tsuyuzaki et al. 2020).Consequently, nonlinear DR techniques have risen in prominence within the realm of scRNA-seq data, primarily due to their adeptness in unveiling both local and global patterns within data characterized by intricate relationships among cell types and states (Pierson and Yau 2015).
Nonetheless, the prevailing nonlinear DR techniques, including UMAP, t-SNE, TriMap, PHATE, and IVIS, exhibit variable degrees of sensitivity to hyperparameters and encounter obstacles in preserving both local and global data structures (McInnes et al. 2018;Maaten and Hinton 2008;Amid and Warmuth 2019;Moon et al. 2019;Szubert et al. 2019).The specific research gap targeted by this study comes to the forefront: the demand for an enhanced nonlinear DR methodology tailored for scRNA-seq data analysis.This method should exhibit unwavering proficiency in effectively capturing both local and global data structures while concurrently preserving compactness, offering a holistic solution to the challenges presented by high-dimensional single-cell transcriptomics data.

Materials and methods
This section will discuss a comprehensive overview of the scRNA-seq datasets utilized in this study.We will also describe the preprocessing procedures carried out on the datasets to ensure the quality and reliability of the data.Finally, details about the proposed CP-PaCMAP approach are presented.

scRNA-seq data collection and preprocessing
Benchmark scRNA-seq datasets belonging to three vital human organs, the pancreas, skeletal muscles, and heart, are gathered and used in this study and the dataset are available from https:// hembe rg-lab.github.io/ scRNA.seq.datas ets/.The human pancreas dataset consists of 16,382 cells and 19,093 genes from 14 different classes of cells.The human skeletal muscle dataset contains 52,825 cells and 33,538 genes belonging to 8 unique categories of cells.A set of 38,929 cells and 27,420 genes categorized under 13 labels of cells of the human heart are present in the third dataset.Initially, all the datasets are subjected to a doublets removal process.Then, other preprocessing techniques, such as filtering, quality control, and normalization, are utilized to prepare the data for nonlinear DR and LDS visualization.All the preprocessing tasks are carried out in Python language using the Scanpy library (Wolf et al. 2018).
Doublets in scRNA-seq data indicate two separate cells combined by unexpected events during the sequencing procedure.In a droplet-based sequencing approach, this can occur if, for instance, two cells reside in the same droplet.Doublets can significantly influence the processing of scRNA-seq data, leading to skewed results and inaccurate inferences.This is due to the fact that the combined gene expression readings of the doublets do not adequately represent the genuine gene expression of either individual cell (Weber et al. 2021).Therefore, identifying and eliminating doublets from single-cell data is essential before undertaking subsequent analysis.This ensures that the results of the study are based on reliable and representative measurements of individual cells rather than on measurements of cells that have been artificially blended.
Single-Cell Variational Inference (SCVI) is a method that can be used to model and analyze scRNA-seq data.SCVI is based on a variational autoencoder architecture consisting of two main components: an encoder network and a decoder network.The encoder network maps each cell's HDS gene expression data to an LDS, while the decoder network maps the LDS representation back to the actual HDS.The training process of SCVI involves minimizing a reconstruction loss, which measures the difference between the input data and the reconstructed data generated by the decoder network.In addition, SCVI uses a regularization term in the loss function to encourage the learned latent representation to be smooth and continuous and to prevent overfitting to the training data (Gayoso et al. 2022).We have incorporated SCVI to identify doublets by calculating the reconstruction error of each cell in the data and setting a threshold based on this error.Cells with a high reconstruction error are considered doublets and can be removed from the data before further analysis.Table 1 displays the number of doublets identified from each dataset.
Filtering is a crucial preprocessing step in analyzing scRNA-seq data because it helps eliminate low-quality or undesirable cells and low-quality genes or irrelevant features.This can enhance subsequent analysis and improve the precision of the results.Moreover, by deleting redundant data points, filtering might lower the computing load of downstream analysis (McCarthy et al. 2017).A general filtering criterion for cells is given in Eq. 1, whereas filtering criteria for each dataset with specific values are given in Table 1.
where C is the set of all cells in the dataset, C′ is the fil- tered set of cells with at least X expressed genes, n(c) is the number of expressed genes in cell c , and the colon (:) represents a filter operation that retains only the cells that meet the specified criteria.Let Z be the gene expression matrix, where each row corresponds to a cell and each column corresponds to a gene.The element (1) C′ = {cinC : n(c) ≥ X} Z[i, j] represents the expression level of gene j in cell i .To remove genes that are found in fewer than Y cells, we have applied a filter based on the number of non-zero entries in each column of Z .Let M[j] be the number of non-zero entries in column j of Z , i.e., the number of cells where gene j is expressed.Then, it can be defined as a filtered gene expression matrix Z′ as given in Eq. 2, i.e., Z′ consists of the columns of Z where the corresponding gene is expressed in at least Y cells.The colon (:) symbol in Eq. 2 is a notation for all the rows and columns of Z which satisfies the condition.
Our scRNA-seq data were meticulously obtained using cutting-edge sequencing platforms to guarantee exceptional data quality and reliability.The iPSC and TMWC libraries were sequenced on an Illumina NextSeq 500 platform, employing a 150-cycle NextSeq High Output Reagent Kit v2.5.The sequencing protocol consisted of specific parameters: 26 bp for Read 1, 8 bp for the Index, and 98 bp for Read 2. The sequencing process on the NextSeq 500 platform was managed by the skilled team at the Institute of Molecular Bioscience Sequencing Core Facility.
Furthermore, the two PBMC libraries underwent sequencing on the Illumina NovaSeq 6000 instrument, featuring a 2 × 150 cycle S4 flow cell, operating in standalone mode.The libraries were loaded at a concentration of 8 nM, with each sample having a volume of 350 μL.The proficient execution of the NovaSeq 6000 sequencing procedure was carried out by the Kinghorn Centre for Clinical Genomics Sequencing Core Facility.
Libraries generated using the 10 × Genomics Chromium system underwent a critical conversion process employing the MGIEasy Universal Library Conversion kit (App-A) before being sequenced on the MGISEQ-2000 instrument.For each library, 10 ng of material underwent 10 cycles of polymerase chain reaction (PCR) to introduce a 5' phosphorylation exclusively on the forward strand.Following this, the purified PCR (2) product was subjected to denaturation, after which it was combined with a 'splint' oligonucleotide.This oligonucleotide is homologous to the P5 and P7 adapter regions of the library, facilitating the formation of a circular single-stranded DNA molecule.A ligase reaction was subsequently carried out to produce a complete single-stranded DNA circle of the forward strand.An exonuclease digestion step was executed to remove single-stranded non-circularized DNA molecules.The circular single-stranded DNA molecules then underwent Rolling Circle Amplification (RCA), generating 300-500 precise copies of the libraries, forming DNA Nanoballs (DNB).Each DNB library was loaded onto a 1500 M feature patterned array flow cell in preparation for sequencing, utilizing the MGISEQ-2000RS High-Throughput Sequencing Set (App-A).The sequencing process entailed 26 bp for Read 1 and 100 bp for Read 2 cycles, without an index barcode read, as only one sample was run per flow cell.FASTQ files were locally generated on the instrument, and sequencing was expertly conducted at the BGI Shenzhen, MGI R&D facility.Filtering out mitochondrial and ribosomal genes can enhance the reliability of scRNA-seq data, as high expression levels of these genes can signal poor data quality caused by technical issues like mitochondrial stress or cell lysis.In addition to reducing technical differences between cells, the removal of these genes can also improve downstream analysis and interpretation (McCarthy et al. 2017).Owing to the stochastic nature of RNA sequencing, various cells in a collection may have differing degrees of RNA sequenced, resulting in varying total read counts per cell.Normalization aids in compensating for these variations in sequencing depth by scaling the gene expression values for each cell by a factor that corresponds to the total amount of reads for that cell (Vallejos et al. 2017).
Data derived from scRNA-seq may be vulnerable to technical biases such as batch effects, variances in cell capture efficiency, or gene-specific effects such as amplification bias or content bias.Normalization can assist in accounting for these technological biases, allowing for a more accurate comparison of gene expression levels across cells.During sample preparation, sequencing, and data processing, technical noise can be created during scRNA-seq.By leveling the gene expression data of each cell, scaling can lessen the influence of technical noise.Scaling can enhance the display and clustering of scRNA-seq data by lowering the influence of genes with high expression values, which can control the analysis and obscure the signal from other genes with lower expression levels (Lytal and Ran 2020).Figure 1 depicts the preprocessed data in terms of a violin plot, explaining the distribution of four metrics across the cells in all three scRNA-seq datasets.The four metrics are: (i) the number of genes detected in each cell based on read counts, (ii) the total number of reads sequenced for each cell, (iii) the percentage of reads mapped to mitochondrial genes for each cell, (iv) the percentage of reads mapped to ribosomal genes for each cell.The y-axis of the plot shows the distribution of the metric values, with the width of the violin indicating the density of cells at that value.

Methodology
The loss function regulates the attractive and repulsive forces between each pair of data points; thus, fine-tuning the loss function helps to maintain local structure.The PaCMAP aims to bring together the neighbors from the HDS in the LDS and push away further points in the original space in the LDS.Specifically, it highlights the significance of having forces on non-neighbors.PaC-MAP algorithm prioritizes global structure: neighbors and mid-near pairings are attracted, whereas distant points are repelled.After the global structure is in place, the attractive force on mid-near edges reduces, stabilizes, and eventually vanishes over time, leaving the algorithm to fine-tune the local structure.PaCMAP has a primary objective with three kinds of pairwise loss elements, each related to a certain kind of graph section: nearest neighbor edges (NE), mid-near edges (ME), and repulsion edges with additional points (RE) (Wang et al. 2022) The edges are additionally weighted by the coefficients ω NE , ω ME , and ω RE , which collectively represent the total loss.As part of the optimization process, the weights are dynamically adjusted.The Student's t-distribution utilized in the similarity functions of t-SNE and TriMap is the reason for the decision to employ the scaled distance (Wang et al. 2022).
UMAP employs a binary search for the scale of each point, comparable to t-SNE, which utilizes entropy as perplexity for a similar search.UMAP and t-SNE imply that data points are distributed uniformly on an inherent LDS manifold since the search makes the neighborhoods of several data points behave identically.PaCMAP discards the data compactness surrounding each point by nullifying the influence of compactness with the search for scales of data points.CP-PaCMAP regularizes the cost function of PaCMAP to account for and return the compactness information surrounding each data point.Empirical evidence demonstrates that this incorporation of compactness information yields a remarkable embedding despite requiring additional calculation for the regularization element.If a data point's neighbors are relatively close, the surrounding area is compact for that point.Consequently, the local radius, determined as the mean distance between neighbors, can serve as a measure of local compactness.
A method for producing LDS that preserves compactness information at individual data points is proposed.This is achieved by defining a local radius, which formalizes the concept of spatial compactness.The proximity of nearest neighbors is often used to determine whether a data point belongs to a compact or sparse region.Specifically, a data point is considered to be in a compact area if its nearest neighbors are in close proximity to it.In contrast, a data point is deemed to be in a sparse area if its nearest neighbors are located at a considerable distance from it.The level of compactness for a given data point is determined by utilizing the average distance to nearest neighbors.In order to formalize this concept, it is necessary to have two elements for a given data point a i .The proposed methodology involves the use of a pairwise distance function, indicated as d(a i , a j ) , and a probability distribution, denoted as p j|i , which assigns weights to each data point a j depending on its distance from a i .The weights assigned to distant points are comparatively lower than those assigned to nearby points.The local radius at a given data point a i , represented as C p (a i ) , is defined as the expected value of the distance function on all other data points a j , with respect to the conditional probability p j|i .This meas- ure effectively captures the average distance between a i and its neighboring points as given in Eq. 4. The CP-PaCMAP approach leverage the probability distributions of PaCMAP, which is capable of capturing local associations.To determine the local radius in the input HDS, we perform a renormalization of the edge probabilities H ij .To obtain a conditional distribution p j|i , H ij /� N j=1 H ij can be calculated, and then determine the local radius as given in Eq. 5.
Subsequently, the local radius is determined within the LDS.Let b i denote the embedding coordinates of data point a i .A distribution that is corresponding to H is required to compute the probable distance between b i and its neighboring data points in the LDS.It is appropriate for the distribution in examining to possess adaptive length-scales analogous to those of H .This is necessary to ensure that a consistent number of nearest neighbors are incorporated in the calculation of the local radius at various data points within the dataset.The variable L is indicative of a total average across various length-scales.By defining p j|i as L ij /� N j=1 L ij and d(b i , b j ) as � b i − b j � 2 , the local radius in the LDS can be determined using Eq. 6.
Let us consider a data point from the input high ( H ) dimensional data x ∈ R H with K neighborhood data points uniformly distributed in a sphere of radius ζ H and volume v ∝ ζ H H .Both structure and compactness should be preserved in LDS ( L < H) , this implies that x and its neighbors should be mapped to an L-dimensional sphere of uniform density with radius ζ L , and, to retain the compactness of x′s K-neighborhood, the volume of the L-dimensional sphere should also remain as v such that v ∝ ζ L L , this indicates that ζ L and ζ H have a power law association i.e. ζ L ∝ ζ H −L H . Applying logarithms will result in logζ L = (H − L)logζ H + β for some val- ues of β.Driven by the exponential scaling of com- pactness with regards to dimensionality of the feature vectors, we seek for a power law association between the local radius in the input HDS dataset and in the output LDS for some hyperparameters α and β in order to retain the compactness.This is reformulated as an affinal connection between the logarithms of the local radii as given in Eq. 7. where , and γ = ln(α) .Our compactness retention objective is to select the LDS in such a way that the correlation between the logarithmic local radii of the input HDS data points and the output LDS is maximized.This method basically resembles canonical correlation analysis (Andrew et al. 2013).Thus, it can be stated that there exists an affine relationship between the logarithms of local compactness.Correlation serves as a means of measuring linear or affine interdependence; therefore, the correlation of the logarithms of local compactness is implemented as given in Eq. 8, whereas the covariance and variance of compactness can be computed using Eq. 9 and Eq. 10, respectively. where The PaC-MAP's cost function is regularized by maximizing the correlation of local compactness to create the CP-PaC-MAP's cost function, which needs to be minimized.The CP-PaCMAP algorithm is given in Algorithm 1 and its loss function is stated in Eq. 11. η is the regularization parameter that weights the correlation in respect to the initial cost of the PaCMAP.Similar to PaCMAP, CP-PaC-MAP optimizes via stochastic gradient descent Page 8 of 23 Yousuff et al. Journal of Analytical Science and Technology (2024)  PaCMAP is a recently proposed approach to visualize high-dimensional feature vectors.PHATE and IVIS are occasionally used for data visualization.We have considered all these techniques in our study to compare with the proposed CP-PaCMAP approach.Initially, all the DR techniques are implemented on 2D-generated data to understand the necessity and idea behind compactness preservation.The 2D data is generated in such a way that it contains linear data points belonging to 4 class labels.
Each category is meant to hold a different degree of compactness within the cluster, as shown in Fig. 2a.The base cluster of data points is very compactly placed compared to the second cluster.The third cluster contains a lesser level of compactness compared to the second.The fourth cluster is the one with more sparse data points.
The DR techniques are applied to the 2D generated data, and their corresponding 2D embeddings are depicted in Fig. 2, 3, 4 and 5. PaCMAP and UMAP visualization shown in Fig. 2b, c clearly prove that both the local structure and global structure of the original data are well preserved in LDS.TRIMAP and IVIS are able to retain the global structure but slightly struggle  2e, g.PHATE (Fig. 2f ) has an issue in preserving both structures, while t-SNE (Fig. 2d) is able to retain local structure in the LDS but has minor deviations in maintaining the global structure.Among all the seven DR techniques examined, the proposed CP-PaCMAP approach performs as well as PaCMAP in preserving both local and global structures in the LDS.It is also able to hold the compactness aspect present in all the clusters of the original data, as shown in Fig. 2h.

Results and discussions
We utilized a diverse array of assessment criteria to appraise the efficacy of the proposed approach and various DR methodologies across three distinct scRNAseq datasets-Human pancreas, skeletal muscle, and heart.Trustworthiness and continuity metrics are leveraged to scrutinize the fidelity of local and global structures within the reduced-dimensional representations (Andrew et al. 2013;Lee and Verleysen 2009;Jurman et al. 2012;Yousuff and Babu 2023;Allen et al. 2021;Gatin et al. 2019).The Mathew Correlation Coefficient metric provides the assessment of classification task (with imbalanced cell Fig. 3 continued types classes) performed on the LDS generated by all the DR techniques, while the Mantel test helps to evaluate the preservation of pairwise relationships between cells in the original HDS and their corresponding LDS.Furthermore, a runtime analysis is done to visualize the computational efficiency of each technique.This comprehensive suite of metrics collectively furnishes a multidimensional evaluation, elucidating both the merits and potential limitations of each approach within the diverse landscape of scRNAseq data analysis.( 12) • TW is the trustworthiness score and CN is the conti- nuity score for a given m , which represents the num- ber of nearest neighbors to consider.• N is the total number of cells (data points).
• r m (i, j) represents the rank of cell j among the m nearest neighbors of cell i in the HDS.This indicates how close cell j is to cell i in the original space con- sidering m neighbors.• m neighbors i, j is a binary indicator function.It takes a value of 1 if cell j is among the m nearest neighbors of cell i in the LDS.It checks whether the proximity relationship is maintained in the reduced space.
The TW score ranges from 0 to 1 ( TW ∈ [0, 1] ), where 0 indicates that the local structures are not preserved well in the LDS, and 1 indicates perfect preservation of local structures.The continuity score ranges from -1 to 1 ( CN ∈ [−1, 1] ).A score of -1 means that the global structure is perfectly preserved in reverse order (what is close in the original space is far in the reduced space), 0 means no preservation, and 1 means perfect preservation of the global structure.TW and CN scores are computed on various m values for all the three different scRNAseq datasets.Figures 6, 7, and 8 demonstrate that CP-PaC-MAP is performing comparatively fine with respect to all other DR techniques.Hence, compactness can be well preserved along with local and global structures of HDS into LDS without any compensation in performance.

Classification model and Matthew's correlation coefficient
In this study, we applied the K-nearest neighbor (KNN) classification algorithm to analyze scRNA-seq data.To assess the algorithm's performance and ensure robustness, we employed tenfold cross-validation.The scRNA-seq dataset consisted of gene expression profiles for individual cells, with the target variable being the cell type.By utilizing the KNN algorithm with a k value of 25, we aimed to predict the cell types based on the similarity of gene expression profiles among neighboring cells.The tenfold cross-validation approach allowed us to evaluate the algorithm's performance by splitting the data into 10 subsets, training the model on nine of them, and testing it on the remaining subset.This process was repeated 10 times, ensuring that each subset served as training and testing data.Confusion matrices are obtained for all the DR techniques applied to each scRNA-seq dataset.Finally, all the confusion matrices are utilized to compute Matthew's correlation (13) (m neighbors i, j − r m i, j ) coefficient (MCC), the classification performance metric.
The MCC is a widely utilized performance metric for assessing prediction precision in multi-class classification tasks (Zegarra Flores and Radoux 2023;Dine et al. 2022;Lee and Park 2022;Thakur et al. 2023;Zhang and Leatham 2019;Zhou et al. 2018).The overall assessment of classification accuracy is determined by considering the True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).TP refers to the count of positive instances that have been exactly classified.TN refers to the count of negative instances that are accurately classified.FP refers to the quantity of instances that are erroneously classified as positive.FN refers to the quantity of instances that are erroneously classified as negative.MCC is a metric that takes into account the distribution of true positives, true negatives, false positives, and false negatives in order to yield a singular value that serves as a comprehensive indicator of the classifier's predictive performance.A higher MCC score signifies superior performance, where a value of 1 represents the ideal result, and -1 represents the lowest outcome (Jurman et al. 2012;Yousuff and Babu 2023).
The MCC is a valuable metric in the context of multiclass classification tasks due to its ability to consider the disparities in class distributions.This metric offers a more dependable assessment of the classifier's effectiveness, particularly when confronted with unequal class proportions or imbalanced datasets.In the context of scRNA-seq classification, the MCC is a valuable metric.It is beneficial because it takes into account the differences in class distributions, which are often encountered in scRNA-seq data.The MCC provides a reliable measure of the classifier's performance, especially when dealing with imbalanced datasets or variations in the proportions of different cell types.It helps assess the accuracy and robustness of the classification algorithm in handling the complexities of scRNA-seq data (Jurman et al. 2012).In the multiple cell types (categories or classes) scenario, the MCC can be mathematically expressed by utilizing a confusion matrix M that represents the classifica- tion outcomes for each category C as given in Eq. 12.The MCC value of the DR techniques computed on all three scRNA-seq data is depicted in

Mantel test
The Mantel test can be utilized along with the Pearson Correlation Coefficient (PCC) to evaluate the preservation of pairwise relationships between cells in the original HDS and their LDS representations.The PCC is a measure of relationship between two sets of data, is commonly used as the correlation coefficient in the Mantel test (Zhou et al. 2018;Mushtaq et al. 2020;Singh et al. 2023;Fakhfakh et al. 2020;Gupta 2022;Zhao 2021).By comparing the PCC obtained from the Mantel test, it is possible to determine how well the DR technique preserves the pairwise relationships between cells.A higher PCC value (+ 1) indicates a stronger correlation and suggests better preservation of the relationships in the LDS.To create a distribution of correlation values, Mantel test procedure was performed multiple times on randomly chosen subsamples of the scRNA-seq data points (n = 500 cells per subsample picked without replacement).Mantel test on cluster centroid distance matrices exposes potential similarities or variations in the underlying grouping patterns (Szubert et al. 2019).PCC values obtained for various DR techniques on three different scRNA-seq Datasets demonstrated a strong correlation between the actual HDS and LDS cluster centroid distances.Mean and Median PCC values for all the DR techniques on scRNA-seq Human pancreas, skeletal muscle, and heart are listed in Table 2.
The RainCloud plot is a smart combination of a Strip plot, a split-half violin plot, a boxplot with whiskers, and a point plot.In the case of a strip plot, the data points are represented as individual dots distributed evenly along the categorical axis, providing a more granular view.Violin plots reveal data distribution shape, density, and spread.Width signifies density; wider areas have more data, and narrower areas have less.Longer violins suggest a broader range, while shorter ones imply a narrower range.Outliers are shown when data points extend beyond the violin's range.The box in the box plot represents the middle 50% of the data (interquartile range-IQR), with the median shown as a central line.The box length reflects the data spread, longer indicating a larger spread and shorter suggesting a narrower spread.Whiskers extend to 1.5 times the IQR, covering the data's range from minimum to maximum values.The flag of the point plot is meant to represent the mean of data in the context of the RainCloud plot (Allen et al. 2021).
PCC values collected after the permutations of the Mantel Test on different HDS scRNA-seq datasets and their corresponding LDS are plotted using RainCloud plots, as shown in Figs. 10,11,and 12,respectively.We are able to observe a higher density of PCC values towards + 1 in the case of CP-PaCMAP, as depicted using a split-half violin plot.The Median, minimum, and maximum values of PCC are also comparatively better in CP-PaCMAP, which is observed in the box plot.THE mean PCC values of CP-PaCMAP are also high compared to other DR techniques demonstrated using point plot flags (Fasil and Rajesh 2023;Gupta et al. 2023aGupta et al. , 2023b;;Sénéchal et al. 2005;Mukherjee et al. 2021;Gupta 2023;Kaur and Khehra 2021).

Runtime analysis
We have comprehensively explored several DR techniques, including PaCMAP, UMAP, t-SNE, TRIMAP, PHATE, and IVIS, alongside the proposed technique termed CP-PaCMAP.These techniques are pivotal in scRNA-seq analysis, revealing large-scale datasets' intrinsic structures and relationships.The primary objective was to assess these DR techniques' runtime  (computational efficiency) across various data point magnitudes ranging from 5000 to 30,000.To accomplish this, the execution times of each technique in seconds are recorded and subsequently visualized through a line graph, as shown in Fig. 13.Upon scrutinizing the outcomes, it is apparent that PaCMAP exhibited remarkable performance across all scenarios.It consistently outperformed its counterparts, showcasing its prowess in runtime.Intriguingly, CP-PaCMAP emerged as a notable approach, securing the second position in terms of

Conclusion
Our study highlights the pivotal role of DR techniques in unraveling the intricate relationships within scRNAseq data.While PCA remains a stalwart in linear DR, the limitations of this approach are evident in the face of diverse cell types.Nonlinear techniques like UMAP, t-SNE, TriMap, PHATE, and IVIS have emerged as powerful alternatives, each with unique strengths and constraints.Our introduction of the CP-PaCMAP algorithm addresses many challenges, providing a robust solution for visualizing and analyzing scRNA-seq data.Its ability to preserve both local and global structures, coupled with its enhanced computational efficiency, positions CP-PaC-MAP as a promising tool for researchers seeking to gain deeper insights into cellular heterogeneity.

Future work
Looking ahead, several avenues for further exploration and refinement can be implemented.Firstly, extending CP-PaCMAP to accommodate even larger and more diverse datasets could enhance its applicability across a broader spectrum of biological systems.Additionally, incorporating CP-PaCMAP into integrated workflows for scRNA-seq analysis, potentially in conjunction with advanced machine learning techniques, holds promise for uncovering novel biological insights.Exploring the potential of CP-PaCMAP in the context of multi-modal single-cell omics data could further expand its utility in deciphering complex cellular landscapes.Furthermore, investigating the algorithm's performance in scenarios of perturbed cellular states or rare cell type identification could yield valuable insights for various biomedical applications.Finally, efforts towards enhancing the interpretability of the resulting low-dimensional representations and developing user-friendly interfaces will be crucial for enabling the broader adoption of CP-PaCMAP in the scientific community.By pursuing these directions, we aim to advance the capabilities of DR techniques in scRNAseq analysis and contribute to a more comprehensive understanding of cellular biology.

Fig. 1
Fig. 1 Violin plot of preprocessed data to visualize the distribution of four metrics across the cells in three scRNA-seq human tissue datasets: a pancreas, b skeletal muscle, c heart

15: 1 Algorithm 1
Compactness Preservation Pairwise Controlled Manifold Approximation Projection.To ensure transparency and reproducibility, we are committed to providing details of the specific nondefault parameters employed in our methodology.Below, we outline the key non-default parameters along with their values: Used value: Random Regularization Parameter (η) for CP-PaCMAP: Default value: 0.01 Used value: 0.005 (11) Loss CP−PaCMAP = Loss PaCMAP − ηCorr(c L , c H ) Experiments DR techniques such as UMAP, TRIMAP, and t-SNE are very commonly used for scRNA-seq data visualization.

Fig. 2
Fig. 2 2D visualization of LDS generated by various DR techniques.b PaCMAP, c UMAP, d t-SNE, e TRIMAP, f PHATE, g IVIS, h CP-PaCMAP on a generated linear data, with different level of compactness at each cluster

Fig. 3
Fig. 3 2D visualization of LDS generated by various DR techniques.a PaCMAP, b UMAP, c t-SNE, d TRIMAP, e PHATE, f IVIS, g CP-PaCMAP on human pancreas scRNA-seq data containing 14 categories of cell type

Fig. 4
Fig. 4 2D visualization of LDS generated by various DR techniques.a PaCMAP, b UMAP, c t-SNE, d TRIMAP, e PHATE, f IVIS, g CP-PaCMAP on human skeletal muscle scRNA-seq data containing 8 categories of cell type ) helps us understand how well local relationships are preserved.It focuses on the nearest neighbors of each data point and checks if they remain close in the LDS.This is particularly important for methods that aim to capture local structures and clusters.Continuity ( CN ) helps us understand the preservation of global data patterns and the overall structure.It ensures that data points that were far apart or close in the HDS retain their relative distances in the LDS.This is essential for methods that aim to maintain the broader structure of the data(Wulfman et al. 2010; Ribaut et al. 2007;Sharini et al.

Fig. 9 . 2 c
The proposed CP-PaCMAP technique demonstrates slight improvement in MCC metric, compared to existing DR techniques (Jurman et al. 2012).(12)MCC = o × d − C c p c × t c d 2 − C c p 2 c × d 2 − C c tPage 18 of 23 Yousuff et al.Journal of Analytical Science and Technology (2024) 15:1 where, t c = C i M ic the number of times category C actually occurred, p c = C i M ci the number of the times category C got predicted, o = C c M cc the total count of observations rightly predicted, d = C i C j M ij the total count of data points.

Fig. 9
Fig. 9 MCC performance metric computed using confusion matrices of KNN model for all the DR techniques implemented on each scRNA-seq dataset

Fig. 13
Fig. 13 Runtime analysis of DR techniques on scRNA-seq data with varying data point sizes

Table 2
Mean and median PCC values obtained from the Mantel test for various DR techniques on all three scRNA-seq Datasets