Using next generation sequencing data for improvement of eukaryotic gene prediction
Being a basic step in most NGS projects annotation is still very inaccurate. For most de-novo annotations ab-initio prediction is used. The method is based on HMM or machine learning algorithms and attempts to output the most probable gene annotation with respect of the gene model given. Still gene models are often far from being close to biological reality and rarely rely on such things like signaling sequences. The reason is that nature of such signals is often poorly understood (tss, tts) and some of the signals are very smooth (like Kozak sequence or an enhancer). Modeling such signals is not possible. But some hints can be made from experiments like RNA-seq. Mapping the reads from RNA-seq can precisely locale the intron boundaries, detect transcribed regions and determine the proper DNA strand for gene model. Using such hints it is possible to improve the annotation quality significantly. Here we discuss our pipeline which is able to combine hinted ab-initio approach with homology-based scoring system for concurrent gene models. The results of this pipeline usually are of better quality if compared to most widely used methods. We also will discuss the importance of annotation in whole-genome studies and connection between genome assembly and annotation quality.