SRTdb: an omnibus for human tissue and cancer-specific RNA transcripts

The production of functional mature RNA transcripts from genes undergoes various pre-transcriptional regulation and post-transcriptional modifications. Accumulating studies demonstrated that gene transcription carries out in tissue and cancer type-dependent ways. However, RNA transcript-level specificity analysis in large-scale transcriptomics data across different normal tissue and cancer types is lacking. We applied reference-based de novo transcript assembly and quantification of 27,741 samples across 33 cancer types, 29 tissue types, and 25 cancer cell line types. We totally identified 231,836 specific RNA transcripts (SRTs) across various tissue and cancer types, most of which are found independent of specific genes. Almost half of tumor SRTs are also tissue-specific but in different tissues. Furthermore, we found that 10 ~ 20% of tumor SRTs in most tumor types were testis-specific. The SRT database (SRTdb) was constructed based on these resources. Taking liver cancer as an example, we showed how SRTdb resource is utilized to optimize the identification of RNA transcripts for more precision diagnosis of particular cancers. Our results provide a useful resource for exploring transcript specificity across various cancer and tissue types, and boost the precision medicine for tumor patients.


Introduction
Over the past decade, high-throughput RNA sequencing (RNA-seq) technology has largely improved our understanding of the roles transcriptome play in various human physiological and pathological processes [1]. Transcriptome-wide analysis has become indispensable for the investigation of systematic changes in numerous aspects of RNA biology and the discovery of novel functional RNAs [2][3][4]. Mature RNA transcripts are the major carriers to deliver genetic codes from DNA to proteins and exert regulatory roles, which undergoes diverse pre-transcriptional regulations and post-transcriptional modifications from nascent RNA products [5]. Through alternative processing of nascent RNAs, individual genes can transcribe different RNA transcripts in distinct physiological or pathological conditions to execute specific functions [6,7]. However, most transcriptome-wide studies have focused on gene-level activities, neglecting the specific RNA transcripts (SRTs). Emerging evidence has demonstrated the importance to determine the SRTs of genes in particular physiological and pathological conditions [8,9].
Recently, accumulating studies have shed light on the advantage of transcript-level analysis over gene-level ones. A pilot study of alternative transcriptional isoforms across multiple human tissues revealed the universal of alternative transcriptions in different tissue types, indicating the necessity and importance of RNA Open Access transcript-level analysis [10]. In a previous study, we identified one alternative transcript of UGP2 gene, which showed significantly differential expression and indicated benign prognosis in liver cancer [9]. Zheng et al. analyzed approximately 1000 normal and liver cancer RNAseq samples to identify transcripts that were exclusively expressed in liver cancer samples over normal liver samples [11]. They found that tumor-specific transcripts were frequently expressed in liver cancer and experimentally demonstrated their biological functions in liver cancer. These nonnegligible findings were masked in the stream of gene-level analysis. The specificity feature of RNA transcripts also suggests enormous potentials in tumor specific diagnosis in clinical practice.
To maximize the utility of human tissue and cancer SRTs, we conducted de novo transcriptome assembly of 27,741 RNA-seq samples across various tissue/cancer types and presented SRTdb (http:// www. sheng lilabs. com/ SRTdb/), which is a comprehensive database of human tissue and cancer SRTs. In total, SRTdb database contains 1,160,216 RNA transcripts across 29 different tissue types, 33 cancer types, and 25 cancer cell line lineages. We identified 228,752, 212,214, and 231,836 SRTs in human normal tissue types, cancer types, and cell line types, respectively. Further analysis revealed that tissue/ cancer SRTs are independent of corresponding specific genes (SRGs) and about half of tumor SRTs are also specific in other normal tissues, especially the testis. Overall, our results offered a panorama of RNA transcript specificity across various tissue and cancer types, and laid a solid data foundation for cancer precision medicine.

RNA-seq data collection
The RNA-seq read alignments (BAM files) of 16,367 human normal tissue samples from 29 different tissue types were downloaded from the Genotype-Tissue Expression data portal (GTEx, https:// www. gtexp ortal. org/) with official authorization. The RNA-seq read alignments (BAM files) of 10,358 human tumor samples from 33 different cancer types were obtained from the Genomic Data Commons data portal (GDC, https:// portal. gdc. cancer. gov/) with official authorization. The raw RNA-seq data (FASTQ files) of 1016 human cancer cell lines from 25 different primary sites were downloaded from the Sequence Read Archive (SRA, https:// www. ncbi. nlm. nih. gov/ sra) database with accession of SRP186687. These data were released by the Cancer Cell Line Encyclopedia project (CCLE, https:// porta ls. broad insti tute. org/ ccle/). Raw sequencing reads were aligned to the human reference genome (GRCh38) by using STAR software to generate read alignments for each cell line.

Transcript assembly and quantification
Individual read alignment files were provided as input to StringTie [12] for de novo transcript assembly. Transcript annotation from GENCODE version 22 was used as the transcript model reference to guide the assembly process with "-G" option. Transcript assembly was performed separately in each sample. Then all assembled transcripts were merged to generate a nonredundant master set of transcripts for all samples by using the "--merge" mode of StringTie. StringTie quantification was utilized to produce transcript-level expression for each sample. Expression levels were normalized in TPM units (TPM = Transcripts Per Million mapped reads). In each tissue/cancer/cell type, transcripts with expression levels higher than 0.1 TPM in at least one sample were remained as expressed transcripts.

Calculation of expression specificity scores
To obtain tissue/cancer/cell type-specific transcripts, a specificity score was calculated for each transcript, which was described in our previous study [9]. In particular, the specificity score was equal to the logarithm of lineage number minus Shannon entropy of transcript expression. The calculation is as follows: where S t represents specificity score of transcript t, N is the total number of tissue/cancer/cell types, p it indicates the expression ratio of transcript t in tissue/cancer/cell type i. One specificity score and N expression ratio were assigned to each transcript. The expression ratio of each transcript across all tissue/cancer/cell types is calculated as follows: where p it is the expression ratio of transcript t in tissue/ cancer/cell type i, N indicates the total number of tissue/ cancer/cell types, x it represents the expression value of transcript t in tissue/cancer/cell type i. When the largest expression ratio is more than two times compared to the second largest expression ratio and specificity score is larger than 1, the transcript was defined as tissue/cancer/ cell type-specific transcript in the tissue/cancer/cell type with largest expression ratio.

Calculation of specific diagnostic scores in different tumor types
To further filter out transcripts specific for individual tumor types, we calculated one specific diagnostic score for each tumor SRT by integrating the expression level, tumor specific scores, and tissue specific scores. Only transcripts that were expressed in tumor samples but not the corresponding normal samples were used to calculated specific diagnostic scores. Specific diagnostic scores are calculated as follows: where S tc is the specific diagnostic score of transcript t in cancer type c, x tc indicates the average expression level of transcript t across tumor samples in cancer type c, s tc is the specificity score of transcript t in cancer type c, r tc represents the expression frequency of transcript t across samples in cancer type c, s tt is the tissue specific score of transcript t in tissue t, s tt is set to 1 when transcript t is not a tissue-specific transcript. β 1 and β 2 are weight coefficients. β 1 is set to 1 when transcript t is a specific transcript in cancer type c, β 1 is set to −1 when transcript t is specific in other cancer types. β 2 is set to 1 when transcript t isa specific transcript in tissue type t, β 2 is set to 0 when transcript t is specific in other tissue types. The larger S tc value is, the higher reliability of transcript t as a specific diagnostic biomarker in cancer type c is.
Coding potential scores of transcripts in CPAT that were larger than the default value of 0.364 were labeled as protein-coding. The transcript was considered with coding potential when it was predicted in both two algorithms.

Database and web site implementation
SRTdb database was built with Python FLASK_REST API (https:// flask-restf ul. readt hedocs. io/) as backend web framework. In SRTdb database, MongoDB (https:// www. mongo db. com) was adopted for data deposition and management. Angular (https:// angul ar. io/) was utilized to develop web interfaces of SRTdb. The frontend framework was constructed by using Bootstrap (https:// getbo otstr ap. com). Data visualization was carried out by Echarts (https:// echar ts. apache. org/). The SRTdb online database is tested and supported in popular web browsers, including Microsoft Edge, Google Chrome, Firefox, and Safari.

SRTdb is resourced from over 27,000 human RNA-seq samples
SRTdb is a database aiming to collect and annotate human specific transcript RNAs, especially cancer-specific transcripts, from large-scale RNA-seq datasets. Currently, SRTdb analyzed 27,741 samples across 33 tumor types, 29 normal tissue types, and 25 cancer cell lineages (Table 1). To make the transcript identification more sensitive to tumor, we first performed refence-based de novo transcriptome assembly across 10,358 tumor samples (Fig. 1A). Briefly, de novo transcript assembly was separately conducted in individual tumor samples (see Materials and Methods). Then, all assembled transcripts were merged to generate one non-redundant set of transcripts as the master transcript annotation for following analyses. In total, 1,160,216 transcripts were identified, of which 198,256 are annotated transcripts and 961,960 (82.91%) are novel transcripts (Fig. 1B). The length of identified transcripts ranged widely from 200 bp to 30 kb, with median length of 5192 bp (Fig. 1C). About 40% of transcripts are composed by 2 ~ 5 exons, and approximately 10% have more than 20 exons (Fig. 1D

Exploring tumor and tissue-specific RNA transcripts with SRTdb
The expression profiles of transcripts were adopted to cluster samples separately in tumor and normal samples by using top 2000 variable transcripts. Transcript expression profiles showed notable performance to distinguish different tumor types ( Fig. 2A) and normal tissue types (Fig. 2B) tissues, respectively. The number of transcripts specific to each cancer type varies considerably (Fig. 2C). The most specifically expressed transcripts are found in acute myeloid leukemia (LAML), followed by esophageal carcinoma (ESCA), brain lower grade glioma (LGG) and glioblastoma multiforme (GBM). Notably, there are fewer SRTs for cancers of the same tissue type, such as lung adenocarcinoma (LUAD) and lung squamous carcinoma (LUSC), both of which are derived from lung tissue. Cancers of colorectum, colon adenocarcinoma (COAD) and rectum adenocarcinoma (READ), also have fewer SRTs. The major reason for these results is that cancers of the same or related tissue types have relatively similar expression patterns. A certain number of specifically expressed transcripts are present in each normal tissue type, and these transcripts may be tightly related to tissue-specific functions. The testis has the highest number of specifically expressed RNAs, followed by pituitary, salivary gland, spleen and liver. These SRTs can be employed as molecular biomarkers for different tissue types. Similarly, distinct numbers of specific transcripts were obtained in different cancer cell line types. To examine whether specific transcripts are independent of specific genes, we also evaluated the specificity of corresponding host genes. In most cancer and tissue types, the majority of SRTs are not from specific genes (Fig. 2D). This result demonstrated that a large portion of valuable RNA transcripts were neglected by the gene-level analysis. The SRTdb data portal mainly includes "Browse", "Search", and "Download" features. Through a userfriendly interface, users can browse the SRTs by selecting a specific tumor, cancer cell line, or normal tissue type from the dataset column on the left (Supplementary Fig. S1A), and the result will be displayed in a  The result table contains transcript ID, specificity score, specificity ratio, specific cancer type, tissue type, gene symbol, genomic loci, sequence, length, and transcript type. By clicking the transcript ID, users will be redirected to a new web page showing the transcript basic information, expression specificity and boxplots of expression profiles across cancer types, normal tissues, and cancer cell lines (Supplementary Fig. S1B). Users can query transcripts of interest by transcript ID, gene name, or genomic loci on the "Search" page ( Supplementary Fig. S1C). In addition, users can also do quick search by gene symbol in the homepage. Files of transcript annotation and specificity can be downloaded from the "Download" page.

Specificity analysis reveals dual roles of tumor SRTs
To further explore the potential clinical utility of tumor SRTs, we examined their specificity distribution across multiple tumor and tissue types. In average, approximately half of tumor SRTs are also tissue SRTs (Fig. 3A). Interestingly, the majority of tumor SRTs are exclusively expressed in tissue types that are not the particular tumor types where SRT originate. For example, one of LINC01419 transcripts, ENST00000522365.1, is specifically expressed in liver cancer across multiple cancer types, while exclusively expresses in testis tissue across various tissue types (Fig. 3B). In some tumor types, the vast majority of tumor SRTs are also specific in original tissues, especially liver cancer (Supplemental Fig.   S2A). The transcript of APOA2, ENST00000367990.6, is exclusively active in liver cancer across different cancer types, and also specifically expressed in liver tissue across multiple tissue types (Supplemental Fig. S2B). To further explore the specificity distribution of transcripts, we examined the tumor specificity of tissue SRTs. Strikingly, more than 30,000 testis SRTs are also specific in tumors, wherein most are specific in tumor types originating from tissues other than the testis (Fig. 3C). We next examined how many tumor SRTs are testis-specific in each tumor type. In all tumor types, tumor SRTs from LAML constitute the largest population of testis SRTs, followed by TGCT, ESCA, and LGG (Fig. 3D). In most tumor types, 10 ~ 20% of tumor SRTs are testis-specific. As expected, the most portion (about 40%) of TGCT tumor SRTs are also testis SRTs. CESC has the second largest percent (about 30%) of tumor SRTs that are testisspecific, which may be due to reproductive functions.

The application of SRTdb for precision cancer diagnosis: liver cancer as an example
A considerable part of tumor SRTs were found to be specifically expressed in particular normal tissues, disclosing possibly promiscuous assignment of specific RNAs in pan-cancer or paired tumor-normal studies. The SRT resources deposited in SRTdb data portal can be used to develop specific diagnostic RNA markers in particular cancer types. Here, we took liver cancer as an example to show the application of SRTdb database for precision diagnosis of cancer. Liver cancer transcripts were first filtered to discard transcripts that were expressed at considerable level in normal tissues. Then a specific diagnostic score was calculated for each of filtered liver cancer SRTs (see Materials and Methods). Compared to normal tissues, 3234 transcripts were specifically expressed in liver cancer, wherein only 116 were stringently liver cancer-specific (Fig. 4A). For example, one of APO2 transcripts, ENST00000481511.4, is exclusively expressed in liver cancer across different cancer types, and showed negligible transcriptional activity across multiple normal tissues, including normal liver tissue (Fig. 4B). The detection of ENST00000481511.4 in any sample from liquid biopsy or liver tissue very likely indicate the occurrence of liver cancer. Although ENST00000465758.1 (one of TM4SF4 transcripts) is specifically expressed in liver cancer over normal liver tissue and any other tissue, it is also detected transcriptionally active in other cancer types, such as cholangiocarcinoma and pancreas cancer (Fig. 4C). In another case, the transcript ENST00000379236.3 (from the TNFRSF4 gene) expressed much higher in liver cancer over normal liver tissue, but also showed considerable expression level in other cancer and tissue types (Fig. 4D). These results demonstrated the necessity of examining transcript-level expression across different tumor and tissue types for precision tumor diagnosis.

Discussion
In the recent decade, RNA-seq techniques have unearthed the vast diversity of transcriptome and their applications in cell/tissue identity and clinical diagnosis/treatment [1,15,16]. Nevertheless, studies have focused on gene levels. For a long time, one genomic location of a particular gene has been deemed to transcribed one single or major RNA transcript [17].
Although experimental validations are based on RNA transcripts, a large portion of isoform transcripts have been ignored. With the development of RNA-seq techniques and computational algorithms, more and more transcriptional isoforms and their functions have been discovered [18,19]. To unveil the transcriptional diversity and specificity across multiple human tissues and tumors, we conducted reference-based de novo transcript assembly and quantification of 27,741 samples across 29 tissue types, 33 cancer types, and 25 cancer cell line types. Our results revealed dual roles of tumor SRTs, wherein tumor SRTs also showed exclusive expression in particular tissues. We presented a publicly accessible data portal, SRTdb, to facilitate the exploration of cancer transcriptome and transcriptional specificity at transcript resolution. Efforts have been made to identify specific transcriptional activities across various tissue or cancer types [20][21][22]. But these results were based on gene-level quantification, which failed to consider the ubiquitous transcriptional isoforms. To minimize the effect of batch effects between samples, we applied a widely used RNA-seq pipeline to process all the samples [9,23,24]. All the RNA-seq samples went through the same alignments, transcript quantification and expression normalization.
Some published databases also provided RNA transcript information in human tissues or cancers, such as cncRNAdb [25], NONCODE [26], and NoncoRNA [27], but they are quite different from the SRTdb database. The cncRNAdb database collected about 2000 experimentally supported cncRNAs (coding and noncoding RNAs) across over 20 species. The NoncoRNA database curated 5568 experimentally supported noncoding RNAs and their drug target associations in cancer. Both cncRNAs and NoncoRNA only provided gene-level but no specific transcript information of RNAs, and they didn't include expression levels across different human tissues and cancers. The NONCODE database provided integrated knowledge of noncoding RNAs across 39 different species. Although NON-CODE provided the information of specific transcripts of RNAs, it is different from our SRTdb database in three major aspects: it focuses on noncoding RNAs; it doesn't provide RNA specificity across human tissues and cancers; these RNA transcripts were retrieved from published papers or databases.
Precision diagnosis of cancer is vital for preventing further deterioration and developing treatment strategies. Specificity of diagnostic markers or factors is crucial for differential or precision diagnosis, which includes discrimination from both normal tissues and other cancer types. Diagnostic specificity is especially important in non-invasive diagnosis, such as liquid biopsy, which is one of the most important means in the early detection of cancers [28,29]. Our results found that most tumor SRTs were also specifically expressed in other tumor types or particular normal tissues, indicating possible misdiagnosis in liquid biopsy. By utilizing the SRTdb resource, we also developed a specific diagnosis score system to identify transcripts of precision diagnosis of particular cancer types.
The advent of the third generation of RNA sequencing (TGRS) technologies (i.e., long-read or full-length RNA sequencing) has expedited the more accurate identification of full-length RNA transcripts [30]. TGRS will rectify a variety of RNA transcripts that were assembled from RNA-seq data by computational algorithms. Even though, major findings from computationally assembled RNA transcripts will largely promote the development of precision cancer medicine. The SRTdb data portal presented in this study is expected to assist our deeper understanding of cancer transcriptome diversity and precision cancer diagnosis in the clinical practice.