The Annotation Pipeline

 

The annotation pipeline integrates evidences from reference protein sequences and gene expression data to structurally predict genes on genome sequences.For the homology-based annotation step we combined available Triticeae protein sequences obtained from UniProt (05/10/2016), This set included validated protein sequences from Swissprot as well as predicted protein sequences from species including Triticumaestivum, Aegilopstauschii and Hordeum vulgare.These protein sequences were mapped to the nucleotide sequence of the T. uraratu pseudomolecules using the splice-aware alignment software Genomethreader (version 1.6.6, arguments: -startcodon -finalstopcodon -species rice -gcmincoverage 70 -prseedlength 7 -prhdist 4), (doi:10.1016/j.infsof.2005.09.005).

 

Furthermore, we used Hisat2 (version 2.0.4, parameter: --dta) (Pertea et al. 2016) to align multiple sets of RNA-seq data to the assemblies. Data sets included expression data [add datasets]. We used Stringtie (version 1.2.3) (Pertea et al. 2016) to assemble mapped reads into transcript sequences for each data set separately and subsequently assembled into transcripts with Stringtie (version 1.2.3, parameter: m 150 -t -f 0.3, PMID: 25690850). All transcripts were combined using Cuffcompare (version 2.2.1, PMID: 26519415) and merged with Stringtie (version 1.2.3, parameter: --merge -m 150) to remove fragments and redundant structures. Next, we used Transdecoder (version 3.0.0) to find potential open reading frames and to predict protein sequences. We used BLASTP (ncbi-blast-2.3.0+, parameter: -max_target_seqs 1 -evalue 1e-05, PMID: 2231712) to compare potential protein sequences with a trusted protein reference database (UniprotMagnoliophyta, reviewed/swissprot, downloaded on 03. Aug 2016) and used hmmscan (version 3.1b2, PMID: 22039361) to identify conserved protein family domains for all potential proteins. Blast and hmmscan results were fed back into Transdecoder-predict to select best translations per transcript sequence. We configured Stringtie (parameter: -m 150 -t -f 0.3) to include only transcript sequences with a minimum size of 150 bp and to include only isoforms that were at least 30 % expressed of main isoform.The final gene predictions were combined with protein structure prediction from Genomethreader to compensate for potentially differentiating open reading frame predictions by the two tools.

 

To differentiate the predicted protein sequences into (i) canonical proteins, (ii) non-coding transcripts, (iii) pseudogenes and (iv) transposable elements, we applied a confidence classification to all potential protein/transcript sequences. Therefore, we used all potential protein sequences in BLASTagainst two protein reference databases. The first database contained all validated Magnoliophyta protein sequences from Uniprot (downloaded at 08/03/2016) and the second database contained all annotated Poaceae protein sequences from Uniprot (downloaded at 08/03/2016). The second database was further filtered to contain complete protein sequences only. Furthermore, to filter out transposons, we used all potential protein sequences in BLASTagainst the translated TREP (release16, Sabot et al. 2005, http://botserv2.uzh.ch/kelldata/trep-db/index.html) database. Best hits were selected for each predicted protein to each of the three databases. Only hits with an E-Value below 10E-10 were considered.

 

Further refinement was done by filtering significant alignments for query and subject coverage. For comparison with the protein databases, we considered only alignments with query and subject coverage of at least 90% as representative hits and for the comparison with the TREP database we considered alignments with a query coverage of at least 75% as representative hits. Based on representative blast hits and completeness of protein sequences (annotated start and stop codon), all potential transcript sequences were then classified into the following confidence classes: (i) High confidence (HC): Protein sequence is complete and has a subject and query coverage above the threshold in database UniMag (HC1), or no blast hit in database UniMag, but in UniPoa and not TREP (HC2). (ii) Low confidence (LC): Protein sequence is not complete and hit in database UniMag or UniPoa, but not in TREP (LC1), or hit not in UniMag and UniPoa and TREP, but protein sequence is complete. The tag REP was assigned for protein sequences not in UniMag and complete, but hits in TREP.