The choice of a gene model has a dramatic effect on both gene quantification and differential analysis

RNA-Seq has become increasingly popular in transcriptome profiling. One aspect of transcriptome research is to quantify the expression levels of genomic elements, such as genes, their transcripts and exons. Acquiring a transcriptome expression profile requires genomic elements to be defined in the context of the genome. Multiple human genome annotation databases exist, including RefGene (RefSeq Gene), Ensembl, and the UCSC annotation database. The impact of the choice of an annotation on estimating gene expression remains insufficiently investigated.

Recently, researchers from Pfizer systematically characterized the impact of genome annotation choice on read mapping and transcriptome quantification by analyzing a RNA-Seq dataset generated by the Human Body Map 2.0 Project. The impact of a gene model on mapping of non-junction reads is different from junction reads. For the RNA-Seq dataset with a read length of 75 bp, on average, 95% of non-junction reads were mapped to exactly the same genomic location regardless of which gene models was used. By contrast, this percentage dropped to 53% for junction reads. In addition, about 30% of junction reads failed to align without the assistance of a gene model, while 10-15% mapped alternatively. There are 21,958 common genes among RefGene, Ensembl, and UCSC annotations. When the researchers compared the gene quantification results in RefGene and Ensembl annotations, 20% of genes are not expressed, and thus have a zero count in both annotations. Surprisingly, identical gene quantification results were obtained for only 16.3% (about one sixth) of genes. Approximately 28.1% of genes’ expression levels differed by 5% or higher, and of those, the relative expression levels for 9.3% of genes (equivalent to 2038) differed by 50% or greater. The case studies revealed that the gene definition differences in gene models frequently result in inconsistency in gene quantification.

rna-seq

The impact of a gene model on RNA-Seq read mapping (read length  =  75 bp)

This study demonstrates that the choice of a gene model has a dramatic effect on both gene quantification and differential analysis. This research will help RNA-Seq data analysts to make an informed choice of gene model in practical RNA-Seq data analysis.

Zhao S, Zhang B. (2015) A comprehensive evaluation of ensembl, RefSeq, and UCSC annotations in the context of RNA-seq read mapping and gene quantification. BMC Genomics. 2015 Feb 18;16(1):97. [article]

One comment

  1. While it is true that the results presented are valid for the specific data sets used, please be aware that it is not valid to draw any conclusions about the complete human RefSeq data set from the presented results. These results do not reflect on the complete RefSeq data set for human as complete RefSeq data was not analyzed. The UCSC RefGene track represents only a subset of the available RefSeq human transcript data set.

    The RefSeq collection is provided by the National Center for Biotechnology Information (NCBI) at the National Institutes of Health. The NCBI does provide comprehensive annotation data for the human RefSeq collection in several formats including GFF.

    The use of the UCSC RefGene track, instead of NCBI’s annotation of their RefSeq product, is unfortunate for the following reasons:
    1) UCSC’s RefGene track is generated by aligning RefSeq transcripts to the genome. It does not reflect NCBI’s annotated placement of RefSeq transcripts. Thus, for near-identical paralogs, a RefSeq transcript will be placed in more than one location in the UCSC RefGene track.
    2) the UCSC RefGene track only includes the ‘known’ RefSeqs having an accession prefix of ‘NM_’ or ‘NR_’. NCBI also provides a large number of model RefSeq transcripts (with XM_ and XR_ accession prefixes). Model RefSeq transcripts are based on a combination of evidence data including RNA-seq.
    3) NCBI’s generated annotation provides placement data for the complete RefSeq data set including both the heavily curated known and model set of RefSeq transcripts. NCBI generated annotation data is more directly comparable to the Ensembl annotation results which also includes two levels of annotation – that which has been provided by the HAVANA manual curation team at the Wellcome Trust Sanger Institute (comparable to the known RefSeq data set), and that which is predicted based on a combination of input data including RNA-Seq (comparable to model RefSeq records).

Leave a Reply to Kim Pruitt Cancel reply

Your email address will not be published. Required fields are marked *

*

Time limit is exhausted. Please reload CAPTCHA.