1.1 Dataset (Select or Upload)
CAP-RNAseq includes two sample datasets to analyze. You can also upload your own files of raw count data (CSV file) and phenotype data (condition txt file).
Note that the first column of the uploaded CSV file should be HGNC Symbols of human genes with no duplicated rows.
Demo Datasets
Demo1: GSE136868
C: Control; TrkB.FL: Full length splice variant; TrkB.T1: Truncated splice variant
Demo2: GSE190998
siCtrl_P: siRNA control of proliferating cells; siCtrl_S: siRNA control of senescent cells; siNTRK2: NTRK2 knockdown senescent cells
Demo3: GSE201085
RDR: residual disease (RD) with recurrence; RDnR: RD without recurrence ; pCR: Pathologic complete response ; NoNAC: not receive neoadjuvant chemotherapy
Sample Input File Formats
We recommend uploading filtered data that excludes genes with counts of 0 and low values. Alternatively, you have the option to filter your data during the upload process by retaining genes with counts per million (CPM) above a specified threshold for a minimum number of samples.
The uploaded data can be visualized as a table of raw counts, logCPM values or mean of replicates.
Demo
Let’s say you want to analyze the sample dataset, Demo 1. You can either display the dataset as raw counts to use the data as it is or convert the counts into logCPM values, and visualize the biological replicates of a group in your downstream cluster analyses.
Alternatively, you can prefer to use the mean logCPM values of your groups. In that case, k-means clusters will be generated using the group means in the following steps.
1.2 Data Filtering (Recommended)
Preferentially, variance stabilizing transformation (VST) followed by analysis of variance (ANOVA) can be applied on the uploaded count data for extracting genes that exhibit significant alterations among different conditions.
We suggest using ANOVA applied gene sets for further analyses by clicking the button Use data after vst+ANOVA.
Note that the sample datasets are already ANOVA applied.
Proceed to the Clustering tab upon your selections.
2.1 Hierarchical Clustering (Optional)
The selected dataset can be visualized as nested clusters by Hierarchical Clustering with the desired agglomeration and distance methods for obtaining insightful expression profiles.
The desired cluster number to be generated can be specified. If not, the default is 4.
Demo
You can evaluate the branches of the cluster dendogram to decide on a cluster number at first glance. For instance, it can be interpreted that a greater number of clusters as 8 is more convenient to analyze gene expression patterns in the selected dataset.
Hierarchical clustering results can help select the number of clusters to be used in k-means clustering.
2.2 K-means Clustering
The desired cluster number, the maximum number of iterations, the number of random sets, and the layout can be specified considering the sample size. If not, the default is 8 clusters, maximum 200000 iterations, 20 random sets, and a layout with 4 columns.
Demo
Upon selecting the desired parameters and clicking on the button Cluster, line graphs appear for the generated clusters. The black lines correspond to the scaled logCPM values of each gene falling into a cluster and the red lines correspond to the cluster centroids.
The line graphs help interpret the differential gene expression patterns of the clusters. For instance, Cluster 7 seems to comprise of genes that are upregulated in condition TrkB.T1 compared to condition TrkB.FL.
Preferentially, you can visualize the generated clusters as boxplots. They may provide better interpretations for datasets with larger numbers of groups or those having low variation among their groups.
Boxplot visıalization can also be preferred for k-means clusters generated using group means of logCPM values.
The number of differentially expressed genes among the selected experimental groups having absolute log fold change values higher than or equal to the specified limit (default 0.5) can be viewed from the pop-up window that appears upon clicking on the DEGs button.
You can view and sort the clusters by the number of differentially expressed genes they involve. For instance, Cluster 4 seems to have the highest number of significantly differentially expressed genes among groups TrkB.FL and TrkB.T1.
The average silhouette widths for increasing numbers of clusters can be viewed from the Cluster analysis button. The silhouette scores for cluster numbers can be evaluated for optimal clustering of the gene set of interest.
For instance, when the range of 4-16 is evaluated for an optimal cluster number k, the highest number of well clustered clusters are seen to be shared among the options where 8, 13 and 16 clusters are generated. Meanwhile, the average silhouette widths are decreasing further when k > 8, thus generating 8 clusters can be preferred for further analysis.
2.3 Mirror Clusters
In the Dissimilarity Heatmap tab, a Pearson’s correlation between the centroids of each cluster is calculated and the cluster pairs having the greatest negative correlation are named as mirror clusters.
Demo
Evaluating clusters having complete opposite expression profiles can be important for obtaining a better insight into the molecular mechanisms induced by a certain condition. For instance, Clusters 4 and 7 are seen to be negatively correlated with a score of -0.98, hence it might be worthwile to investigate them together.
3.1 Cluster Prioritization: Silhouette Plots
You can view the silhouette width of each cluster to evaluate how well the genes are clustered and to identify the best possible cluster(s). Not all genes might be clustered well, hence suggestions for cluster selection based on the evaluation of average silhouette widths are provided to help improve the accuracy of your cluster analysis. You may prefer analyzing clusters with higher silhouette widths.
Clusters having similar silhouette scores can be observed from the cluster plot.
Demo
The silhouette information is also provided for each gene from your gene set of interest together with the information of its cluster. The information of neighboring clusters are also provided.
Note that you can also search for the silhouette information of a specific gene of interest.
Silhouette scores are based on the distance between a gene and other genes within its assigned cluster and the average distance of genes within the nearest cluster to which the gene is not assigned. Silhouette scores range from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to its neighboring clusters. See: silhouette function of the cluster package.
3.2 MSigDB for All Clusters
Gene set enrichment analysis can be conducted on the obtained clusters using Human MSigDB collections from the Molecular Signatures Database (GSEA). A table of molecular signatures and the enriched genes will be generated for each cluster. Genes can be filtered to have mRNA-protein correlation (retrieved by the DepMap Portal) above a specified threshold, or to have expression levels within a specified range.
Demo
You can select any of the MSigDB collection(s) and examine the enriched genes with their ratios within their clusters. The gene signatures can be filtered by gene counts, p values and q values.
MSigDB Collections are retrieved by the msigdbr package.
The gene clusters can also be divided into essential and non-essential genes to evaluate separately.
Enrichment of essential genes only may provide a better insight into more critical pathways induced by a condition. Separate evaluation of non-essential genes for functional annotations can facilitate making biological interpretations that are irrelevant with cell survival.
The Cancer Dependency Map (DepMap) data was used for categorizing genes by essentiality.
The Network panel allows for the use of available gene sets from MSigDB Collections for conducting a network analysis on the clusters of interest.
For instance, you can see the network plot of the mirror clusters 4 and 7 based on the selected control and treatment groups. Upregulation of gene sets are shown in red while downregulation in blue. Top 10 enriched terms are shown in default, yet you can increase or decrase the number of terms you want to visualize in your network. The edge widths are directly proportional to the gene ratios in the clusters enriched for the presented terms. Similarly to edge widths, edge lengths are proportional to normalized gene ratios.
You can search for all the clusters and terms presented in the generated network plot. It will facilitate your observations on the plots if you have preliminary hypotheses for the clustered gene sets.
Visualizing all of the generated clusters at once will provide insight into shared gene signatures and thus a better interpretation of the clusters. For instance, Clsuters 1 and 7 share a greater proportion of terms and bot hare upregulated.
Note that you can check the gene expression patterns of your clusters from the View Clusters tab as a reminder.
3.3 Priority Genes
The Priority Genes tab shows the number of genes in each cluster having target priority scores in different cancer types with their odds ratios and p-values.
The priority scores are retrieved from The Project Score Database, part of the DepMap Portal.
Demo
You can sort and filter the clusters by the odds ratios and p-values of priority genes for the given cancer types. For instance, Clusters 1 and 7 have the highest odd ratios with p < 0.05 for pan cancer. You can also check specific cancer types to see the association of your clusters.
The Heatmap panel allows for the visualization of the ratios of genes that have target priority scores for each cluster clusters using either odds ratios or gene percentages.
For instance, Cluster 6 is seen to be a cluster depleted of genes with priority scores, hence you might want to investigate other clusters.
3.4 Sample Correlations
The Pearson’s correlation between experimental groups in a selected cluster can be checked over an interactive heatmap to observe whether two conditions or the biological replicates of a condition have high correlation.
Demo
For instance, it is seen that the correlation between conditions C and TrkB.FL is higher compared to that between conditions C and TrkB.T1., and all replicates of each condition are close to each other.
3.5 Gene Essentiality
The mean of (DepMap)efficacy and selectivity scores of essential genes in each cluster is provided with the odds ratios and p-values from Fisher’s Exact Test to provide the information of whether a significant association exists between the number of essential genes in each cluster that is opposing to those in all other clusters.
At the bottom of the page, several graphs regarding the essentiality of genes will appear upon clicking on a cluster of your interest from the table.
Demo
The interactive scatter plot (x, y, z) on the top left corner of the section allows you to make observations on regression analyses using values of efficacy, selectivity, range of expression (CPM), average of expression (CPM), coefficient of variation or maximum log fold change.
For instance, the regression plot of log2 average CPM vs log2 range CPM can help compare the variations in essential and non-essential genes. Simultaneously, selection of point size as maximum logFC can help compare gene expression levels. You can simply replot by changing the parameters and selecting a cluster of interest. Note that you can discard either essential or non-essential genes from the plot for separate visualization.
On the top right corner of the section, graphs of Kolmogorov-Smirnov Test, density and Mann Whitney U Test results of genes within the selected cluster can be obtained for the selected parameter. Genes can be compared by either their efficacy scores or essentialty scores in each test.
Demo
For instance, it is seen from both Density and Mann Whitney U Test plots that non-essential genes have higher coefficients of variation whereas essential genes have slightly higher expression levels overall.
The barplot on the bottom left corner of the section shows the number of essential genes that are either common essentials, i.e. genes exhibiting shared essentiality among all lineages, or lineage-dependent, which impact efficacy scores.
The line graph on the bottom right corner is provided as a reminder of the selected cluster’s pattern.
3.6 mRNA-Protein Correlation
A regression plot can be generated for observing the overall mRNA-protein correlation of genes in the selected cluster with the data retrieved from the DepMap Portal. Regression analysis can be conducted for particular tissues of interest along with optional specifications of diseases or cell lines of interest. The interactive plot that is generated upon selection provides the regression line and the fitted values from robust regression. The bisquare weights of genes in the regression analysis are provided as a table down the page.
The linear models are generated using the smooth, lm, and rlm packages.
Demo
For instance, let’s say you want to check the mRNA and protein expression levels of genes in Cluster 4 in two specific cell lines of interest from the central nervous system lineage. You will obtain linear regression results for mRNA and protein expressions in both cell lines grouped into essential and non-essential genes. Note that you can discard some groups from the plot by clicking on their icons.
The robust regression table provides ease of search, filtration and sorting. For instance, you may consider sorting genes by weight if you want to investigate genes having high mRNA-protein correlation only. 0 weight means that the values do not fit.
4.1 Differential Gene Expression Analysis
Any one of the clusters and any of the sample condition pairs can be chosen for DGEA. The analysis can be conducted by either Limma-Voom or DESeq2, and the results will be provided as a table.
Demo
4.2 Functional Profiling
DGEA results for the selected cluster under a particular condition of interest can be used for gene set enrichment analysis. The analysis can be conducted on genes having logFC values within a specified range. Preferentially, genes can be categorized into essentials and non-essentials for analysis.
The descriptions of the gene sets can be viewed upon clicking the button Check MSigDB Collections.
Demo
Human MSigDB Collections from the Molecular Signatures Database (GSEA) are retrieved by the msigdbr package.
5.1 Gene-Specific Analysis
DGEA by limma or DESeq2 can also be conducted for further gene prioritization.
The results are provided together with the distance correlation, Differential gene expression results, information of gene essentiality with the efficacy, selectivity values and linear regression values for mRNA and protein correlation including the correlation coefficient ®, R square (R2), Adjusted R square (Adj.R2), intercept, slope and p value.
A distance correlation is calculated between the expression profile of each gene in a selected cluster and that of the cluster centroid. In other words, it shows how closely the expression pattern of each gene in a chosen cluster matches the pattern of the cluster's center.
Demo
Genes within a selected cluster under the specified sample condition can be particularly investigated. In the Gene-Specific Analysis panel, the normalized expression levels of a selected gene in each replicate are plotted on the left and the lining of the pattern of the gene of interest with that of the cluster centroid is plotted on the right upon clicking on the gene from the results table.
You can check the deviations among the replicates of a group for your gene of interest and see the distance of your gene from the cluster centroid.
On the bottom of the page, the dependency scores of the gene you have selected will be plotted for several different lineages. Each dot corresponds to a cell line with the corresponding lineage and the more the dependency scores are negative, the more the cell line is dependent on the gene of interest. For instance, U87MG cell line seems to be quite dependent on PDE4D as it has a negative dependency score of -0.0346.
The selected gene can also be evaluated at the protein level. The proportion of protein expression in cancer cells and healthy tissues can be plotted upon the selection of cancer types.
Note that there might not be any data for some of the selected genes in the HPA database.
For instance, it is seen that while PDE4D has low to medium level expression in both healthy glial or neuronal cells and gliomas, it has high level expression in several other types of cancers as melanoma and colorectal cancer.
You can also search for another gene of interest by name. For instance, the MCM4 protein that is searched is seen to have high expression in 75% of glioma patients.
Note that you can search for multiple genes of your interest – either comma or space delimited. Gene name search is not case sensitive.
The RNA expression levels of the gene can be visualized along with the normalized quantitative protein profile by mass spectrometry within distinct tissues and preferentially within the specified diseases or cell lines.
The data is extracted from 22Q1 DepMap public release that includes 55825 genes, 1165 cell lines, 33 primary diseases, and 32 lineages, retrieved by the depmap R package.
You can conduct a linear regression analysis for the gene you have selected from the results table to check for its mRNA-protein correlation in different lineages and preferentially in specified diseases or cell lines. Robust regression results will be displayed as in the Cluster Prioritization tab.
The cluster where a specific gene or genesets of interest resides can be found from the tab Search Gene/Geneset. Note that gene names should be comma separated or empty spaces if multiple genes are searched for. Gene name search is not case sensitive. It presents two tables: one indicating genes and their cluster IDs in the current dataset, and the other providing a frequency table with both observed and expected counts. Additionally, It computes the p-value for the chi-squared test, comparing the observed and expected counts.
5.2 Primer Design
Gene specific primer design can be done for any gene within the cluster that was selected in the Gene Specific Analysis section, after generating the table.
The designed exon junction spanning and intron spanning primers are displayed in their genomic positions.
Primer pairs and their attributes are provided as a table down the page.
The red lines represent the gene's exons; blue (forward) and orange (reverse) dots indicates the positions where each primer binds to the gene.
Validating the primers designed in CAP-RNAseq is highly recommended using another tool!
Demo
Please cite our app:
Raw File
CSV file with Gene Identifiers (Gene Symbol) in first column and raw count values in other columns
Do not use duplicated gene names! If you have duplicated gene names, they will be removed.
Condition File
TXT file (one line- row or column) without header
When you click 'Apply vst+ANOVA', vst normalization (variance-stabilizing transformation) will be performed before ANOVA to stabilize the variance along the mean. Then, ANOVA will be used to remove the genes whose expression do not change between samples.
This step is optional and can be skipped.
You can visualize your data with Hierarchical Clustering. This can give an idea for the number of clusters to use in K-means Clustering.
The dissimilarity is calculated by applying the Pearson's correlation method to the centroids of the clusters to determine mirror clusters.
The target priority score is a metric used to prioritize potential therapeutic targets based on their relevance and importance. For more information, you can visit the Project Score database.
Below, the table displays count of genes that have a priority score in any cancer type, as well as the odds ratio and p-value calculated for each cluster by comparing the number of these genes within the cluster to the total number of genes across all clusters.
Please select a cluster from pull-down menu to visualize the correlation between mRNA-protein expression values of this cluster’s genes.
This tab helps the user visualize protein expression levels of genes selected from the table above or entered manually by the user based on Human Protein Atlas database (HPA).
Primer Design part will be active when the table in 'Gene-Specific Analysis' tab is generated.