About

The Morph algorithm
Data used
Database structure
Adding a new species
Acknowledgements

The MORPH algorithm

The MORPH algorithm is described in Tzfadia et al. (2012). MORPH (MOdule-guided Ranking of PatHway genes) ranks candidate genes for biological processes based on gene expression data and a set of bait genes. In contrast to other guilt-by-association methods, MORPH uses clusterings to partition the data into modules before calculating co-expression measures. MORPH uses a machine learning approach known as data selection to find the dataset - clustering combination in which the bait genes have highest co-expression. This selection is based on the area under the self-rank curve. These features make MORPH a very powerful tool to unravel gene functions. A highly computational efficient version of the MORPH algorithm can be used with your own genes of interest online here.

Data used

The MorphDB database was last updated April 2017.

Species included

Species currently included are depicted in the cladogram below (branch lengths are meaningless). If you would like to submit data for a new species, please refer to adding a new species below.

Data sets used by MORPH

Here a summary is presented of the data used in the MORPH prediction algorithm for the different included species. For more information contact us here.

We note that we have only used relatively limited data sets, and as more and more data sets are becoming available, we advise to use the MORPH-bulk tool with your own data set of interest instead of relying on the MorphDB tool, which was build as a proof of concept and user friendly tool for exploring our MORPH bulk predictions described in Zwaenepoel et al. 2018. To apply MORPH bulk to your own data sets, please find information at https://github.com/arzwa/morph-bulk/wiki.

The MorphDB database is RDF based, and tools for generating the RDF graph are included in the MORPH bulk distribution. In other words you can easily generate a MorphDB instance yourself!

Arabidopsis thaliana

Clusterings: Co-expression (CLICK), protein-protein interaction (Matisse), metabolic network (Matisse) and Enzyme based.

Dataset	# Conditions	# Genes
ds1Data.txt	160	12459
Seed_GH_DataSet.txt	42	22225
SeedlingsDataSet.txt	64	12459
SeedsDataSet.txt	51	12459
TissuesData.txt	96	12564

Medicago truncatula

Clusterings: Co-expression (CLICK) and metabolic network (Matisse) based. Note: all dat except the JAMER, MAERF and JAMER_MAERF data sets was retrieved from the Noble foundation.

Dataset	# Conditions	# Genes
Balzergue.expression_matrix	41	20461
Benedito.expression_matrix	20	19363
Breakspear.expression_matrix	10	15137
Carvalho.expression_matrix	11	17446
Czaja.expression_matrix	20	16082
Niebel.expression_matrix	18	17388
noble_all.expression_matrix	151	20914
Ruffel.expression_matrix	13	18294
Zhang.expression_matrix	18	17902
JAMER.expression_matrix	21	12790
JAMER_MAERF.expression_matrix	32	13962
MAERF.expression_matrix	11	12637

Solanum lycopersicum

Clusterings: Co-expression (CLICK), protein-protein interaction (Matisse), metabolic network (Matisse), orthology (MCL) and enzyme based.

Dataset	# Conditions	# Genes
FruitDataSet.txt	32	9217
RootAndLeafDataSet.txt	21	9217

Solanum tuberosum

Clusterings: Co-expression (CLICK), protein-protein interaction (Matisse) and metabolic network (Matisse).

Dataset	# Conditions	# Genes
All_TissuesDataSet_ITAG.txt	326	9012
Leafs_ITAG.txt	242	9012
Root_ITAG.txt	24	9012
Tuber_ITAG.txt	60	9012

Oryza sativa

Clusterings: Co-expression (CLICK), protein-protein interaction (Matisse) and enzyme based.

Dataset	# Conditions	# Genes
E-GEOD-14275.expression_matrix	6	11019
E-GEOD-25073.expression_matrix	6	15930
E-GEOD-31077.expression_matrix	16	5137
E-GEOD-35984.expression_matrix	10	10116
E-GEOD-39298.expression_matrix	6	12118
E-GEOD-5167.expression_matrix	0	11852
E-GEOD-8216.expression_matrix	6	15964
E-MEXP-2267.expression_matrix	36	8025
RiceGenomeDataSet.expression_matrix	16	25744

Populus trichocarpa

Clusterings: Co-expression (CLICK) based.

Dataset	# Conditions	# Genes
all	24	32777
fiber	3	21720
leaf	3	27087
phloem	3	25795
root	3	25780
shoot	3	26404
three_cell_type	3	25326
vessel	3	22826
xylem	3	23888

Catharanthus roseus

Clusterings: Co-expression (CLICK), ortholog and enzyme based.

Dataset	# Conditions	# Genes
caros_all.expression_matrix	30	25329
caros_hairy_roots.expression_matrix	7	15105
caros_organs.expression_matrix	8	19933
caros_smartcell.expression_matrix	7	18678
caros_suspension_culture.expression_matrix	8	15121

Zostera marina

Clusterings: Co-expression based (CLICK).

Dataset	# Conditions	# Genes
all.count_table.txt	23	16674
all.fpkm_table.txt	23	15679
female_flower.count_table.txt	6	13077
female_flower.fpkm_table.txt	6	10587
male_flower.count_table.txt	7	14260
male_flower.fpkm_table.txt	7	11809
root.count_table.txt	4	13380
root.fpkm_table.txt	4	10708
vegetative.count_table.txt	6	13946
vegetative.fpkm_table.txt	6	11416

Database structure

MorphDB is an RDF based graph database stored in an Apache Jena triple store (TDB). Currently the database is not directly served and queries are performed using SPARQL at the server side. The triple store is highly scalable and can be served directly when interest grows.

Predicates

Currently, the following predicates are included in MorphDB.

Predicate
http://www.w3.org/2000/01/rdf-schema#label
http://www.w3.org/1999/02/22-rdf-syntax-ns#type
http://morph.org/has_score
http://morph.org/is_candidate_of
http://morph.org/member_of
http://morph.org/species
http://morph.org/has_member
http://morph.org/has_species_member
http://morph.org/is_missing_bait_of
http://morph.org/has_ausr
http://morph.org/has_bait_in_dataset
http://morph.org/has_candidate
http://morph.org/no_genes_in_dataset
http://morph.org/no_genes_missing
http://morph.org/has_bait_missing
http://morph.org/rank
http://morph.org/score_for_gene_set
http://morph.org/score_for_sp_gene_set
http://morph.org/score_value
http://morph.org/is_bait_of
http://morph.org/gene_set_type
http://morph.org/has_species

Objects (subjects)

The main objects (subjects) in the RDF graph are genes, gene sets (GO/mapman terms), gene families and scores of a gene for a particular gene set term. As a gene can have multiple scores (i.e. for different pathways and GO's) there is no unique <gene> has_score <score> triple in the triple store. Therefore gene scores are reified. We can access a score object of the gene by using the predicate has_score. To get the actual score for the pathway of interest we have to use the predicates score_for_sp_gene_set and score_value with as subject the score object. For a similar reason GO/mapman terms (gene sets) exist at two levels, as a GO term has different AUSR values in different species.

Object (subject)	example URI
Gene	http://morph.org/gene#MT0001S0570
Gene set (general level)	http://morph.org/gene_set#GO_0010105
Gene set (species level)	http://morph.org/gene_set_sp#sly_GO_0010105
Gene family	http://morph.org/gene_family#HOM0000018
Score	http://morph.org/score#MT0001S0570_GO_0010224

Adding a new species

If you want to run MORPH bulk for your own data and own species of interest, we refer to the MORPH-bulk repository at https://github.com/arzwa/morph_bulk. On the wiki page of that repository you can find documentation and a tutorial on how to run MORPH bulk for your case of interest. This also includes tools for building a MorphDB RDF data base graph, which can be used to generate a MorphDB instance.

Acknowledgements

MorphDB was developed by Arthur Zwaenepoel (2017)

If you use MorphDB or MORPH bulk, please cite:

Zwaenepoel, A., Diels, T., Amar, D., Van Parys, T., Shamir, R., Van de Peer, Y., & Tzfadia, O. (2018).
MorphDB: Prioritizing Genes for Specialized Metabolism Pathways and Gene Ontology Categories in Plants.
Frontiers in Plant Science, 9(March), 1–13. https://doi.org/10.3389/fpls.2018.00352