About

  1. The Morph algorithm
  2. Data used
  3. Database structure
  4. Adding a new species
  5. Acknowledgements

The MORPH algorithm

The MORPH algorithm is described in Tzfadia et al. (2012). MORPH (MOdule-guided Ranking of PatHway genes) ranks candidate genes for biological processes based on gene expression data and a set of bait genes. In contrast to other guilt-by-association methods, MORPH uses clusterings to partition the data into modules before calculating co-expression measures. MORPH uses a machine learning approach known as data selection to find the dataset - clustering combination in which the bait genes have highest co-expression. This selection is based on the area under the self-rank curve. These features make MORPH a very powerful tool to unravel gene functions. A highly computational efficient version of the MORPH algorithm can be used with your own genes of interest online here.

Data used

The MorphDB database was last updated April 2017.

Species included

Species currently included are depicted in the cladogram below (branch lengths are meaningless). If you would like to submit data for a new species, please refer to adding a new species below.

Data sets used by MORPH

Here a summary is presented of the data used in the MORPH prediction algorithm for the different included species. For more information contact us here.

We note that we have only used relatively limited data sets, and as more and more data sets are becoming available, we advise to use the MORPH-bulk tool with your own data set of interest instead of relying on the MorphDB tool, which was build as a proof of concept and user friendly tool for exploring our MORPH bulk predictions described in Zwaenepoel et al. 2018. To apply MORPH bulk to your own data sets, please find information at https://github.com/arzwa/morph-bulk/wiki.

The MorphDB database is RDF based, and tools for generating the RDF graph are included in the MORPH bulk distribution. In other words you can easily generate a MorphDB instance yourself!

Arabidopsis thaliana

Clusterings: Co-expression (CLICK), protein-protein interaction (Matisse), metabolic network (Matisse) and Enzyme based.

Dataset# Conditions# Genes
ds1Data.txt 160 12459
Seed_GH_DataSet.txt 42 22225
SeedlingsDataSet.txt 64 12459
SeedsDataSet.txt 51 12459
TissuesData.txt 96 12564

Medicago truncatula

Clusterings: Co-expression (CLICK) and metabolic network (Matisse) based. Note: all dat except the JAMER, MAERF and JAMER_MAERF data sets was retrieved from the Noble foundation.

Dataset# Conditions# Genes
Balzergue.expression_matrix 41 20461
Benedito.expression_matrix 20 19363
Breakspear.expression_matrix 10 15137
Carvalho.expression_matrix 11 17446
Czaja.expression_matrix 20 16082
Niebel.expression_matrix 18 17388
noble_all.expression_matrix 151 20914
Ruffel.expression_matrix 13 18294
Zhang.expression_matrix 18 17902
JAMER.expression_matrix 21 12790
JAMER_MAERF.expression_matrix 32 13962
MAERF.expression_matrix 11 12637

Solanum lycopersicum

Clusterings: Co-expression (CLICK), protein-protein interaction (Matisse), metabolic network (Matisse), orthology (MCL) and enzyme based.

Dataset# Conditions# Genes
FruitDataSet.txt 32 9217
RootAndLeafDataSet.txt 21 9217

Solanum tuberosum

Clusterings: Co-expression (CLICK), protein-protein interaction (Matisse) and metabolic network (Matisse).

Dataset# Conditions# Genes
All_TissuesDataSet_ITAG.txt 326 9012
Leafs_ITAG.txt 242 9012
Root_ITAG.txt 24 9012
Tuber_ITAG.txt 60 9012

Oryza sativa

Clusterings: Co-expression (CLICK), protein-protein interaction (Matisse) and enzyme based.

Dataset# Conditions# Genes
E-GEOD-14275.expression_matrix 6 11019
E-GEOD-25073.expression_matrix 6 15930
E-GEOD-31077.expression_matrix 16 5137
E-GEOD-35984.expression_matrix 10 10116
E-GEOD-39298.expression_matrix 6 12118
E-GEOD-5167.expression_matrix 0 11852
E-GEOD-8216.expression_matrix 6 15964
E-MEXP-2267.expression_matrix 36 8025
RiceGenomeDataSet.expression_matrix 16 25744

Populus trichocarpa

Clusterings: Co-expression (CLICK) based.

Dataset# Conditions# Genes
all 24 32777
fiber 3 21720
leaf 3 27087
phloem 3 25795
root 3 25780
shoot 3 26404
three_cell_type 3 25326
vessel 3 22826
xylem 3 23888

Catharanthus roseus

Clusterings: Co-expression (CLICK), ortholog and enzyme based.

Dataset# Conditions# Genes
caros_all.expression_matrix 30 25329
caros_hairy_roots.expression_matrix 7 15105
caros_organs.expression_matrix 8 19933
caros_smartcell.expression_matrix 7 18678
caros_suspension_culture.expression_matrix 8 15121

Zostera marina

Clusterings: Co-expression based (CLICK).

Dataset# Conditions# Genes
all.count_table.txt 23 16674
all.fpkm_table.txt 23 15679
female_flower.count_table.txt 6 13077
female_flower.fpkm_table.txt 6 10587
male_flower.count_table.txt 7 14260
male_flower.fpkm_table.txt 7 11809
root.count_table.txt 4 13380
root.fpkm_table.txt 4 10708
vegetative.count_table.txt 6 13946
vegetative.fpkm_table.txt 6 11416

Database structure

MorphDB is an RDF based graph database stored in an Apache Jena triple store (TDB). Currently the database is not directly served and queries are performed using SPARQL at the server side. The triple store is highly scalable and can be served directly when interest grows.

Predicates

Currently, the following predicates are included in MorphDB.

Predicate
http://www.w3.org/2000/01/rdf-schema#label
http://www.w3.org/1999/02/22-rdf-syntax-ns#type
http://morph.org/has_score
http://morph.org/is_candidate_of
http://morph.org/member_of
http://morph.org/species
http://morph.org/has_member
http://morph.org/has_species_member
http://morph.org/is_missing_bait_of
http://morph.org/has_ausr
http://morph.org/has_bait_in_dataset
http://morph.org/has_candidate
http://morph.org/no_genes_in_dataset
http://morph.org/no_genes_missing
http://morph.org/has_bait_missing
http://morph.org/rank
http://morph.org/score_for_gene_set
http://morph.org/score_for_sp_gene_set
http://morph.org/score_value
http://morph.org/is_bait_of
http://morph.org/gene_set_type
http://morph.org/has_species

Objects (subjects)

The main objects (subjects) in the RDF graph are genes, gene sets (GO/mapman terms), gene families and scores of a gene for a particular gene set term. As a gene can have multiple scores (i.e. for different pathways and GO's) there is no unique <gene> has_score <score> triple in the triple store. Therefore gene scores are reified. We can access a score object of the gene by using the predicate has_score. To get the actual score for the pathway of interest we have to use the predicates score_for_sp_gene_set and score_value with as subject the score object. For a similar reason GO/mapman terms (gene sets) exist at two levels, as a GO term has different AUSR values in different species.
Object (subject)example URI
Genehttp://morph.org/gene#MT0001S0570
Gene set (general level)http://morph.org/gene_set#GO_0010105
Gene set (species level)http://morph.org/gene_set_sp#sly_GO_0010105
Gene familyhttp://morph.org/gene_family#HOM0000018
Scorehttp://morph.org/score#MT0001S0570_GO_0010224

Adding a new species

If you want to run MORPH bulk for your own data and own species of interest, we refer to the MORPH-bulk repository at https://github.com/arzwa/morph_bulk. On the wiki page of that repository you can find documentation and a tutorial on how to run MORPH bulk for your case of interest. This also includes tools for building a MorphDB RDF data base graph, which can be used to generate a MorphDB instance.

Acknowledgements

MorphDB was developed by Arthur Zwaenepoel (2017)

If you use MorphDB or MORPH bulk, please cite:

Zwaenepoel, A., Diels, T., Amar, D., Van Parys, T., Shamir, R., Van de Peer, Y., & Tzfadia, O. (2018).
MorphDB: Prioritizing Genes for Specialized Metabolism Pathways and Gene Ontology Categories in Plants.
Frontiers in Plant Science, 9(March), 1–13. https://doi.org/10.3389/fpls.2018.00352