Introduction & References

TRAPID is an online tool for the fast and efficient processing of assembled RNA-Seq transcriptome data. TRAPID offers high-throughput ORF detection, frameshift correction and includes a functional, comparative and phylogenetic toolbox, making use of 175 reference proteomes. The TRAPID platform Is available at

Detailed information about the platform and the tools are provided in the different sections. In addition, we provide a detailed step-by-step tutorial here, to guide non-experts through the different steps of processing a complete transcriptome using TRAPID. Sample data including Panicum transcrips (from Meyer et al., 2012; see [1]) and subset labels can be found at

Software requirements

In order to use TRAPID, all you need is any modern browser with JavaScript enabled. TRAPID was tested using Firefox 58 & Chrome 64, on Ubuntu and Windows.

[1] Meyer E, Logan TL, Juenger TE: Transcriptome analysis and gene expression atlas for Panicum hallii var. filipes, a diploid model for biofuel research. Plant J 2012, 70(5):879-890.


In case you publish results generated using TRAPID, please cite this paper:

TRAPID, an efficient online tool for the functional and comparative analysis of de novo RNA-Seq transcriptomes.
Michiel Van Bel, Sebastian Proost, Christophe Van Neste, Dieter Deforce, Yves Van de Peer and Klaas Vandepoele*
Genome Biology, 14:R134, 2013 Genome Biology

In case you publish frameshift corrected sequences, MUSCLE multiple sequence alignments or phylogenetic trees generated using FastTree2 or PhyML, please also cite the corresponding papers:

FrameDP: sensitive peptide detection on noisy matured sequences.
Gouzy J, Carrere S, Schiex T
Bioinformatics, 25:670-671, 2009 Bioinformatics

MUSCLE: multiple sequence alignment with high accuracy and high throughput.
Edgar RC
Nucleic Acids Res, 32:1792-1797, 2004 NAR

New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0.
Guindon S, Dufayard JF, Lefort V, Anisimova M, Hordijk W, Gascuel O
Syst Biol 59:307-321, 2010 Syst. Biol.

FastTree 2--approximately maximum-likelihood trees for large alignments.
Price MN, Dehal PS, Arkin AP
PLoS One, 5:e9490, 2010 PLos One

User Authentication

Data security is a necessary concern when dealing with online platforms and services. Through the use of user authentication no user has access to the data of any other user. User authentication is performed through username/password combination.

To acquire a username/password combination for the platform, select the register option when visiting the TRAPID website. After supplying a valid email-address an associated password will be sent to you. Using the email-address/password combination the user gains access to the user-restricted area within the TRAPID platform.

Step-by-step instructions on how to create an account and log in can be found in the tutorial.

Creating TRAPID Experiments

The transcriptome data should be uploaded to the TRAPID platform. Before doing this, it is important to note that, after authentication, the user has the ability to create different experiments for different transcriptome data sets, with a maximum of 20 experiments per user. So analyzing different transcriptome data sets at the same time is perfectly possible.

The most important choice to be made here is what kind of reference database the user would like to use. The PLAZA reference databases should be very good for transcriptome data sets from plant or green algal species, while the EggNOG reference database should be used for any other species, such as animals, fungi or bacteria.

Step-by-step instructions on how to create an experiment can be found in the tutorial.

Uploading transcript sequences and job control

After the creation of a TRAPID experiment, the user should upload his transcriptome data to the platform. The transcriptome data should be made available as a multi-fasta file before upload to the server (max. size 30MB using the File upload option). In order to accommodate for the rather large file size associated with plain-text multi-fasta files, the uploaded file can also be compressed using zip or gzip. Fasta files, compressed or not, can also be uploaded by providing a URL to a specific transcript file (e.g. hosted at FTP site, public DropBox URL, etc.); this option allows to upload bigger data sets (max. size 300MB). If the transcriptome data is split over several multi-fasta files, the user has the ability to continue uploading data (via file upload or URLs) into his transcriptome data set before starting the processing phase. You will a receive an e-mail when your sequences have been successfully uploaded into the database.

During all TRAPID processing steps (upload, transcript processing, running frameshift correction or computing alignment/phylogenetic tree), users can check the experiment status to see if their job is queued, running or in error status. In case you want to cancel or stop your job, go to the Experiment Status page and modify the New status to Finished.

Step-by-step instructions on how to upload data can be found in the tutorial.

Perform transcript processing

The processing phase of the TRAPID platform is the next step, and necessary before any of the user custom analyses can be performed. This phase is initiated by selecting the Perform Initial Processing link on an experiment page. During this step, the user should consider the options carefully, as they may seriously impact the custom analyses.

  • First and foremost is the choice of whether either a single species, a phylogenetic clade or the gene family representatives will be used for the similarity search (i.e. Database Type). A single species is a good choice if in the reference database a close relative of the transcriptome species is present. If a good encompassing phylogenetic clade is available, then this is also a solid choice. If none of the above, then the gene family representatives will provide a good sample distribution of the gene content within each reference database.
  • Set an E-value cutoff for the DIAMOND similarity search.
  • Define the Gene Family Type: for PLAZA reference databases (PLAZA 3 monocots/dicots, pico-PLAZA 02), this is Gene families (TribeMCL clusters) or Integrative Orthology (in case a single species was selected as Database Type), for EggNOG this is Gene families (EggNOG ortholog groups).
  • Define how the functional annotation should be transferred from the family to the transcript level. In general, transfer based on gene family is the most conservative approach while transfer based on best hit is yielding a larger number of functionally annotated transcripts. Logically, combining both methods using transfer from both GF and best hit yields the largest fraction of annotated transcripts.
  • Decide if the similarity search should be preceded by a taxonomic classification of the transcripts (performed using kaiju against NCBI NR protein database). This step can be useful to flag potential contamination in a transcriptome dataset or have a quick overview of the ocmposition of a complex sample (e.g. single-cell transcriptome). In addition, it is possible to define transcript subsets from the taxonomic classification results.

After this step, the experiment will become available while the server performs the initial processing of the data. Again, you will a receive an e-mail when the processing is finished.

The processing of the data is sufficiently fast for normal transcriptome data sets. The fraction of very short or very long transcripts will impact the total processing time during this initial phase.

Initial processing time with TRAPID 01

A test data set containing 90.000 transcripts can be processed in less than 3 hours, with approximately 28% of these transcripts assigned to gene families. For the Panicum data set (25392 transcripts) discussed in the manuscript, the complete processing (incl. upload as gzip file from public URL + transcript processing using PLAZA 2.5, clade = Monocots) takes around 1 hour (with 60% of transcripts assigned to gene families).

Initial processing time with TRAPID 02

Using the same Panicum dataset discussed in TRAPID 01's manuscript, PLAZA 03 monocots reference database and 'Monocots' as clade, the processing time takes around 30 minutes (including taxonomic binning).

Step-by-step instructions on how to process transcripts can be found in the tutorial.

Basic analyses

After the initial processing of the data has been performed, several new data types are available for the TRAPID experiment: gene families and functional annotation (GO categories or protein domains). Using these extra data types offers exciting new analyses to the user.

Step-by-step instructions on how to browse these statistics can be found in the tutorial.

General statistics

General statistics

Figure 1: general statistics example.

The general statistics page offers a complete overview of ORF finding, gene family assignments, similarity search species information, meta-annotation and functional information.

Subsets and labels

If the data set is comprised of transcriptome data from different sources (with sources indicating different tissues, developmental types or stress conditions), then the user has the ability to assign labels to the subsets. This is done through the Import data > Transcript subsets/labels link on the experiment side menu and providing per label a list of transcript identifiers. Note that it is possible to assign multiple labels to one transcript.

By assigning labels to transcripts, several new analyses become available, such as comparison of functional annotation between different subsets, or by computing functional enrichment.

Step-by-step instructions on how to use subsets and labels can be found in the tutorial.

The user has the ability to search for a number of possible data types within his selected experiment, through the search bar in the header bar of the experiment. Functional annotation can be searched for both through direct term identifiers (e.g. GO:0005509) or through the descriptions (e.g. Calcium ion binding). Similarly, gene families can be searched using TRAPID's identifier (with experiment ID prefix) or the original gene family identifier.

Exporting data

The TRAPID platform allows the export of both the original data and the annotated and processed data of a user experiment. This data access is available under the Export data header on an experiment page and includes structural ORF information, transcript/ORF/protein sequences, taxonomic classification, gene/RNA family information, and functional GO/InterPro information.

The remainder of this section consists of a description of each type of export file, organized by category, complemented by minimal examples (ten first records). Please click on the Toggle example links to show the corresponding minimal example export file.

Structural data

The structural data export file is a tab-delimited file providing the following information for each sequence of an experiment:

  • Transcript identifier: the transcript sequence identifier.
  • Frame information: the detected frame, strand, and full frame information (homology support) for the inferred ORF sequence of the transcript.
  • Frameshift information: flag putative frameshift and potential frameshift correction (0/1 boolean values).
  • ORF information: the start/stop coordinates of the inferred ORF sequence and the presence of start/stop codons.
  • Meta annotation : the meta-annotation complemented by meta-annotation scoring information.
Users can choose to export any combination of the above information.

#transcript_id	detected_frame	detected_strand	full_frame_info	putative_frameshift	is_frame_corrected	orf_start	orf_stop	orf_contains_start_codon	orf_contains_stop_codon	meta_annotation	meta_annotation_score
contig15600	3	-	hit="Seita.5G133300"	0	0	0	1417	0	1	Quasi Full Length	std_dev=562;avg=1615;orf_length=1416;cutoff=491
contig14583	2	+	hit="LOC_Os12g34104"	0	0	799	945	1	1	No Information	null
contig14334	3	-	hit="Seita.9G362500"	0	0	0	2689	0	1	Quasi Full Length	std_dev=540;avg=2737;orf_length=2688;cutoff=1657
contig15854	1	-	hit="Seita.3G183900"	0	0	0	1517	0	1	Quasi Full Length	std_dev=762;avg=1912;orf_length=1518;cutoff=388
contig15185	3	-	hit="Sobic.004G069000"	0	0	0	2047	0	1	Quasi Full Length	std_dev=312;avg=2228;orf_length=2046;cutoff=1604
contig14563	1	-	hit="OB08G26510";alternative_frames="-2"	1	0	1365	1832	1	1	Full Length	std_dev=616;avg=1673;orf_length=468;cutoff=441
contig14653	1	-	hit="Seita.3G037300";alternative_frames="-2"	1	0	957	2204	1	1	Full Length	std_dev=373;avg=1181;orf_length=1248;cutoff=435
contig15055	1	-	hit="Seita.1G378200";alternative_frames="-3"	1	0	0	1847	0	1	Quasi Full Length	std_dev=321;avg=1837;orf_length=1848;cutoff=1195
contig15538	3	+	hit="Sobic.001G116500"	0	0	0	1693	0	1	Quasi Full Length	std_dev=455;avg=1572;orf_length=1692;cutoff=662
Taxonomic classification

The taxonomic classification export file is a tab-delimited file that provides, for each transcript of an experiment, their associated taxonomic label (NCBI tax ID of the lowest common ancestor, set to 0 if a transcript was not classified). In case a transcript was classified, classification metrics (score, number of matching tax IDs, number of matching sequences) and full taxonomic lineage are also provided. The classification score corresponds to the length of the best MEM sequence found by Kaiju.

#counter	transcript_id	tax_id	score	n_match_tax	n_match_seqs	lineage
1	contig00001	206008	263	15	22	Panicum hallii; Panicum sect. Panicum; Panicum; Panicinae; Paniceae; Panicodae; Panicoideae; PACMAD clade; Poaceae; Poales; commelinids; Petrosaviidae; Liliopsida; Mesangiospermae; Magnoliophyta; Spermatophyta; Euphyllophyta; Tracheophyta; Embryophyta; Streptophytina; Streptophyta; Viridiplantae; Eukaryota; cellular organisms
2	contig00002	0
3	contig00003	206008	387	7	22	Panicum hallii; Panicum sect. Panicum; Panicum; Panicinae; Paniceae; Panicodae; Panicoideae; PACMAD clade; Poaceae; Poales; commelinids; Petrosaviidae; Liliopsida; Mesangiospermae; Magnoliophyta; Spermatophyta; Euphyllophyta; Tracheophyta; Embryophyta; Streptophytina; Streptophyta; Viridiplantae; Eukaryota; cellular organisms
4	contig00004	206008	673	7	11	Panicum hallii; Panicum sect. Panicum; Panicum; Panicinae; Paniceae; Panicodae; Panicoideae; PACMAD clade; Poaceae; Poales; commelinids; Petrosaviidae; Liliopsida; Mesangiospermae; Magnoliophyta; Spermatophyta; Euphyllophyta; Tracheophyta; Embryophyta; Streptophytina; Streptophyta; Viridiplantae; Eukaryota; cellular organisms
5	contig00005	0
6	contig00006	206008	107	7	22	Panicum hallii; Panicum sect. Panicum; Panicum; Panicinae; Paniceae; Panicodae; Panicoideae; PACMAD clade; Poaceae; Poales; commelinids; Petrosaviidae; Liliopsida; Mesangiospermae; Magnoliophyta; Spermatophyta; Euphyllophyta; Tracheophyta; Embryophyta; Streptophytina; Streptophyta; Viridiplantae; Eukaryota; cellular organisms
7	contig00007	2014292	11	8	11	bacterium (Candidatus Ratteibacteria) CG23_combo_of_CG06-09_8_20_14_all_48_7; unclassified Bacteria (miscellaneous); unclassified Bacteria; Bacteria; cellular organisms
8	contig00008	206008	74	15	44	Panicum hallii; Panicum sect. Panicum; Panicum; Panicinae; Paniceae; Panicodae; Panicoideae; PACMAD clade; Poaceae; Poales; commelinids; Petrosaviidae; Liliopsida; Mesangiospermae; Magnoliophyta; Spermatophyta; Euphyllophyta; Tracheophyta; Embryophyta; Streptophytina; Streptophyta; Viridiplantae; Eukaryota; cellular organisms
9	contig00009	206008	332	7	22	Panicum hallii; Panicum sect. Panicum; Panicum; Panicinae; Paniceae; Panicodae; Panicoideae; PACMAD clade; Poaceae; Poales; commelinids; Petrosaviidae; Liliopsida; Mesangiospermae; Magnoliophyta; Spermatophyta; Euphyllophyta; Tracheophyta; Embryophyta; Streptophytina; Streptophyta; Viridiplantae; Eukaryota; cellular organisms
Gene family data

Three types of gene family data export files are available:

  1. Transcripts with GF: a tab-delimited file that contains the transcripts of an experiment and their associated gene family (if any).
  2. GF with transcripts: a tab-delimited file that contains, for each gene family of an experiment, the number and identifiers of transcripts assigned to the gene family (on a single line).
  3. GF reference data: a tab-delimited file that contains the reference data (GF name and members from the reference database) for each gene family of an experiment.

#counter	transcript_id	gf_id
1	contig15600	325_HOM04M000078
2	contig14583	325_HOM04M024306
3	contig14334	325_HOM04M002301
4	contig15854	325_HOM04M000265
5	contig15185	325_HOM04M007940
6	contig14563	325_HOM04M000016
7	contig14653	325_HOM04M000527
8	contig15055	325_HOM04M002035
9	contig15538	325_HOM04M001852
10	contig15718	325_HOM04M000525
#counter	gf_id	transcript_count	transcripts
1	325_HOM04M000288	1	contig21531
2	325_HOM04M000289	8	contig14024 contig13643 contig15820 contig14435 contig00084 contig24153 contig14741 contig25148
3	325_HOM04M000290	4	contig12096 contig12098 contig24795 contig21293
4	325_HOM04M000291	8	contig08387 contig24777 contig20960 contig19804 contig17222 contig22643 contig14830 contig19514
5	325_HOM04M000292	11	contig19081 contig18701 contig06441 contig06440 contig22975 contig06437 contig19799 contig10181 contig06438 contig24751 contig21978
6	325_HOM04M000293	1	contig18090
7	325_HOM04M000294	1	contig16262
8	325_HOM04M000295	2	contig17068 contig21683
9	325_HOM04M000297	2	contig20335 contig06599
10	325_HOM04M000299	11	contig05344 contig22288 contig05342 contig20433 contig05343 contig16646 contig17772 contig11019 contig23477 contig11020 contig15118
#counter	trapid_gf_id	reference_gf_id	gene_id
1	325_HOM04M002770	HOM04M002770	Pp3c2_32420
2	325_HOM04M002770	HOM04M002770	TAE37408G001
3	325_HOM04M002770	HOM04M002770	TAE37408G002
4	325_HOM04M002770	HOM04M002770	ATR0680G209
5	325_HOM04M002770	HOM04M002770	GSVIVG01020946001
6	325_HOM04M002770	HOM04M002770	PH01000860G0510
7	325_HOM04M002770	HOM04M002770	TAE27601G003
8	325_HOM04M002770	HOM04M002770	PAB00003424
9	325_HOM04M002770	HOM04M002770	Solyc03g025590.2
10	325_HOM04M002770	HOM04M002770	PAB00037282
RNA family data

The export files for RNA family data are identical to the gene family data export files (but containing RNA family information). However, no export file for reference information of RNA families is available, as this data is not stored anywhere within TRAPID. Please visit the RFAM website to retrieve this information.


Sequence export files are FASTA files for a chosen type of sequence and a selection of transcript sequences from an experiment. Exported sequences can either be the uploaded transcript sequences, the inferred ORF sequences, or aminoacid (translated ORF) sequences. It is possible to export sequences for all the transcripts of an experiment (default) or for any defined transcript subset.

Functional data

For each type of available functional annotation data (GO terms, protein domains, KO terms), two types of export files are available:

  1. Transcripts with functional annotation: a tab-delimited file that contains the transcripts of an experiment and their associated functional annotation labels (identifiers and descriptions).
  2. Functional annotation metadata: a tab-delimited file that contains, for each of every functional annotation label (identifier and description), the number and identifiers of associated transcripts (on a single line).

Note: the export of GO functional information has extra columns. The is_hidden column indicates whether a GO term is flagged as hidden, due to the presence of more informative GO codes in the GO graph for the given transcript, while the evidence_code column (value set to ISS) indicates that the GO annotation was assigned to the transcript via sequence similarity search.

#counter	transcript_id	go	evidence_code	is_hidden	description
1	contig00423	GO:0003676	ISS	1	nucleic acid binding
2	contig00423	GO:0003677	ISS	0	DNA binding
3	contig00423	GO:1901363	ISS	1	heterocyclic compound binding
4	contig00423	GO:0097159	ISS	1	organic cyclic compound binding
5	contig00423	GO:0005488	ISS	1	binding
6	contig01755	GO:0003676	ISS	1	nucleic acid binding
7	contig01755	GO:0003677	ISS	0	DNA binding
8	contig01755	GO:0005515	ISS	1	protein binding
9	contig01755	GO:1901363	ISS	1	heterocyclic compound binding
10	contig01755	GO:0097159	ISS	1	organic cyclic compound binding
#counter	interpro	description	num_transcripts	transcripts
1	IPR000007	Tubby, C-terminal	17	contig11469 contig05821 contig20328 contig20141 contig15849 contig06374 contig19374 contig19204 contig06372 contig16019 contig06373 contig11470 contig09177 contig20969 contig18970 contig11265 contig18737
2	IPR000008	C2 domain	97	contig08907 contig08909 contig02775 contig07858 contig02773 contig15163 contig02772 contig14957 contig02776 contig22276 contig12203 contig14003 contig05143 contig19878 contig12201 contig09834 contig09835 contig25077 contig04198 contig13022 contig10181 contig04197 contig20905 contig04196 contig16070 contig13823 contig19615 contig14145 contig24751 contig21722 contig18032 contig05835 contig13882 contig14463 contig09510 contig23293 contig09511 contig14358 contig20260 contig14672 contig19081 contig15275 contig19388 contig07086 contig02770 contig02771 contig20071 contig15790 contig04145 contig23085 contig06441 contig04143 contig06440 contig04141 contig11319 contig21842 contig09810 contig09811 contig14049 contig06437 contig06438 contig20391 contig16132 contig18588 contig15968 contig14595 contig14388 contig13965 contig05170 contig13867 contig18699 contig22624 contig23062 contig16332 contig11464 contig11463 contig23722 contig15859 contig23826 contig05678 contig05679 contig18701 contig22117 contig24707 contig05169 contig16583 contig11320 contig23734 contig11321 contig19799 contig21978 contig09893 contig09366 contig04200 contig09368 contig22975 contig14016
3	IPR000009	Protein phosphatase 2A regulatory subunit PR55	7	contig07662 contig03651 contig03650 contig03649 contig07664 contig03647 contig03648
4	IPR000010	Cystatin domain	4	contig21270 contig24646 contig22490 contig21355
5	IPR000011	Ubiquitin/SUMO-activating enzyme E1	5	contig15475 contig15605 contig21279 contig16325 contig14898
6	IPR000014	PAS domain	28	contig24777 contig15287 contig01658 contig15044 contig01654 contig01655 contig16104 contig02922 contig01652 contig01651 contig06672 contig14270 contig06674 contig06675 contig05325 contig05326 contig12570 contig12571 contig12572 contig09082 contig15078 contig05322 contig05323 contig01649 contig15218 contig22643 contig14523 contig13969
7	IPR000023	Phosphofructokinase domain	18	contig15688 contig14681 contig16895 contig15760 contig25248 contig17384 contig15339 contig20162 contig15048 contig10452 contig16255 contig10453 contig10454 contig15002 contig09953 contig09952 contig16755 contig09951
8	IPR000031	PurE domain	1	contig16781
9	IPR000033	LDLR class B repeat	3	contig03829 contig03830 contig03831
10	IPR000039	Ribosomal protein L18e	1	contig19134

Subset export files enable the retrieval of the list of sequences that are part of a given transcript subset. They simply consist in a list of sequence identifiers (one identifier per line).


The toolbox

On most pages (experiment/transcript/gene family/GO/protein domain) a toolbox is available which contains the most common analyses to be performed on the given data object.

Frameshift correction

This feature is currently disabled.

For transcripts that were flagged as potentially containing frameshifts, the user can execute FrameDP to putatively correct the transcript sequence and identify the correct ORF. FrameDP is a program which uses BLAST together with machine learning methods to build models which are used to test whether a sequence has a putative frameshift or not. The model is then used to correct the sequence (by inserting N-nucleotides at the necessary locations), which of course also directly has an impact on the associated Open Reading Frame (ORF). The drawback is however the exceptional long processing time. As such, the correction of frameshifts can only be done per gene family, and not on an entire transcript experimement.

The putative frameshifts are first detected during the "initial processing" phase, using a simple algorithm. The user has the ability to, on a gene family page, select these transcripts for FrameDP processing. If the total number of selected transcripts is lower than 20, additional random transcripts are added in order to have a good background model. Subsequently, all sequences are used for training and 'correction'.

Step-by-step instructions on how to correct frameshifts using FrameDP can be found in the tutorial.

Functional enrichment analysis

Apart from the functional annotation of individual transcripts, TRAPID also supports the quantitative analysis of experiment subsets using GO and protein domain enrichment statistics. Through the association of specific labels to (sub-)sets of sequences, transcripts can be annotated with specific sample information (e.g. tissue, developmental stage, control or treatment condition) and be used to perform within-transcriptome functional analysis.

Specific comparative analyses than can be performed using subsets are:

  • GO enrichment (subset versus all; hypergeometric distribution): bar chart, table and enrichment GO graph output
  • GO ratios (table) calculates GO frequencies between subsets. Also includes subset-specific GO annotations
  • GO ratios (table) calculates GO frequencies between subsets. Also includes subset-specific GO annotations
  • Protein domain enrichment (subset versus all; hypergeometric distribution)
  • Protein domain ratios between subsets
  • Different subsets - Venn diagrams

Go enrichment
Figure 2: GO enrichment.
Go graph
Figure 3: GO graph.
Go ratios
Figure 4: GO ratios.

Multiple sequence alignment


Figure 5: example multiple sequence alignment. Panicum data set, transcript contig16311 in family HOM000957 covering 117 genes from 25 species.

Starting from a selected transcript, the user has the ability to create an amino acid multiple sequence alignment (MSA) within a gene family context. As such, the user can create an MSA containing the transcripts within a gene family together with a selection of coding sequences from the reference database. This tool is accessible from the toolbox from a gene family. The MSA is created using MUSCLE (, a tool which delivers a good balance between speed and accuracy (Edgar 2004 Nucleic Acids Res. 2004; 32(5): 1792-1797). In order to reduce the computation time, the maximum number of iterations in the MUSCLE algorithm is fixed at three. All other settings are left at default.

After the MSA has been created, the user has the ability to view this alignment using JalView, or to download the MSA and investigate it using a different tool.

Phylogenetic trees

phylo tree

Figure 5: example FastTree phylogenetic tree (Panicum data set, transcript contig16311 in family HOM000957 covering 117 genes from 25 species, relaxed editing). The query transcript is shown in grey while homologs from the reference proteomes are shown in colors based on their taxonomic information. Meta-annotations are displayed as colored boxes next to the gene identifiers. Only a part of the complete tree is depicted in the image below.

The user has, similar to the multiple sequence alignment, also the ability to create a phylogenetic tree within a gene family context (see previous section). In order to create a phylogenetic tree, the system needs this MSA for the tree building algorithm. This step, however, is done automatically by the TRAPID system in case no previous MSA is present for the indicated gene family.

In order to create phylogenetic trees which are less dependent on putative large gaps, the standard MSA is transformed to a stripped MSA. In this stripped MSA the alignment length is reduced by removing all positions (for every gene/transcript sequence) for which a certain fraction (0.10 for stringent editing, 0.25 for relaxed editing) is a gap. As such, all gaps introduced by a small number of sequences will be removed.

In case the stringent editing yields a stripped MSA with zero or only a few conserved alignment positions, please re-run the tree analysis using the relaxed editing option, which will yield more conserved alignment positions.

In the TRAPID system we offer two different tree inference algorithms: FastTree ( and PhyMl (, with the first one being the default due to its very fast processing speed, coupled with equal or better fidelity (Price et al., 2010 PLoS One Mar 10;5(3):e9490). The user here has the ability to choose which algorithm to use, and -if desired- how many bootstrap runs will be applied. For FastTree we used the following non-default settings: '--wag gamma', which indicate that the algorithm uses the WAG+CAT model, and rescales the branch lengths. For PhyML we used the following non-default settings: '-m WAG -f e -c 4 -a e', which indicate that the algorithm uses the WAG+CAT model, that empirical amino acid frequencies are used, that 4 relative substitution rate categories are used, and that the parameter for the gamma distribution shape is based on the maximum likelihood estimate.

Finally, if the user has defined subsets within his/her experiment, these subsets can also be displayed on the phylogenetic tree, making subsequent analyses much easier. By default, the meta-annotation is displayed as domains next to the phylogenetic tree (see example below).

The default way to create a phylogenetic tree is:

  1. Login into the TRAPID platform, and select the desired experiment
  2. Select the transcript, either through the search function, or through any link in the platform
  3. On the transcript page, select the associated gene family
  4. On the gene family page, select the create phylogenetic tree from the toolbox
  5. Select the reference species from the reference database you want to include in the tree

Step-by-step instructions on how to construct phylogenetic trees can be found in the tutorial.


A key challenge in comparative genomics is reliably grouping homologous genes (derived from a common ancestor) and orthologous genes (homologs separated by a speciation event) into gene families. Orthology is generally considered a good proxy to identify genes performing a similar function in different species. Consequently, orthologs are frequently used as a means to transfer functional information from well-studied model systems to non-model organisms, for which e.g. only RNA-Seq-based gene catalogs are available. In eukaryotes, the utilization of orthology is not trivial, due to a wealth of paralogs (homologous genes created through a duplication event) in almost all lineages. Ancient duplication events preceding speciation led to outparalogs, which are frequently considered as subtypes within large gene families. In contrast to these are inparalogs, genes that originated through duplication events occurring after a speciation event. Besides continuous duplication events (for instance, via tandem duplication), many paralogs are remnants of different whole genome duplications (WGDs), resulting in the establishment of one-to-many and many-to-many orthologs (or co-orthologs).

Within TRAPID, the phylogenetic trees provide the most detailed approach to identify orthology relationships. For a given transcript, inspecting the phylogenetic tree can reveal whether orthologs exist in related species and if this relationship is a one-to-one, one-to-many or many-to-many orthology. Apart from the trees, also the Browse similarity search output tool, available from a transcript page, offers an overview of homologous genes in related species. Below, we show two examples demonstrating how orthologous groups and simple/complex orthology relationships can be derived from a phylogenetic tree generated using TRAPID.

phylo tree

Figure 7: example one-to-one orthology (Panicum data set, transcript contig14762 RecQ helicase). The node indicated in red shows the monocot homologs for this clade (1 indicates 100% bootstrap support). Within this sub-tree, from each included monocot species a single gene is present, revealing simple one-to-one orthology relationships.

phylo tree

Figure 8: example one-to-many orthology (Panicum data set, transcript contig00984 ATPase). The node indicated in red shows the monocot homologs for this clade (1 indicates 100% bootstrap support). Within this sub-tree, 2 genes from Zea mays are present, revealing that for the single Panicum transcript two co-orthologs exist in Z. mays.