The study of RNA expression is the fastest growing area of genomic research. However, despite the dramatic increase in the number of sequenced transcriptomes, we still do not have accurate estimates of the number and expression levels of non-coding RNA genes. Non-coding transcripts are often overlooked due to incomplete genome annotation.
In this study, Université de Sherbrooke researchers use annotation-independent detection of RNA reads generated using a reverse transcriptase with low structure bias to identify non-coding RNA. Transcripts between 20 and 500 nucleotides were filtered and crosschecked with non-coding RNA annotations revealing 111 non-annotated non-coding RNAs expressed in different cell lines and tissues. Inspecting the sequence and structural features of these transcripts indicated that 60% of these transcripts correspond to new snoRNA and tRNA-like genes. The identified genes exhibited features of their respective families in terms of structure, expression, conservation and response to depletion of interacting proteins. Together, these data reveal a new group of RNA that are difficult to detect using standard gene prediction and RNA sequencing techniques, suggesting that reliance on actual gene annotation and sequencing techniques distorts the perceived architecture of the human transcriptome.
Identification of new non-coding RNA genes from sequencing read clusters
(A) Description of the RNA-Seq and bioinformatic analysis pipelines used for RNA detection and identification. The lack of fragmentation is indicated by an X. R1 and R2 indicate forward and reverse read files obtained from paired-end sequencing. Read clusters were identified using Blockbuster on bam alignment files, and visualized on a genome browser. (B) Identification of new non-coding RNA clusters. The RNA clusters obtained from A were compared to the annotation available in RNAcentral and clusters with and without standard annotations are indicated in the pie chart. (C) Proportion of kept and discarded clusters. Clusters with <100 uniquely mapped reads, a size smaller than 20 nucleotides, not detected in other investigated TGIRT-Seq datasets, poorly defined antisense fragment of the 45S rRNA or representing a retained intron were discarded. It should be noted that clusters with fewer than 100 uniquely mapped reads but with multimapped reads only aligning to other non-annotated clusters were kept (such is the case for the ETS and ITS clusters). (D) Distribution of the discarded clusters non-optimal features. (E) Methods used for the functional (biotype) classification of the new non-coding RNA genes. The retained clusters were compared to the sequence, structure, function and repeated element overlap of known ncRNA and classified by biotype. (F) Final distribution of the predicted biotype of the NA_RNA clusters.