ENIGMA - Expression Network Inference and Global Module Analysis

	Home
	Manual
	Download
	History

ENIGMA installation instructions

ENIGMA requires Java. If not already installed on your computer, download and install the Java 2 Runtime Environment, version 1.6.0 or higher.

Download the ENIGMA zip file and unpack it in the directory of your choice. After unpacking, you should find a folder called ENIGMA containing a lib folder, the folders test_data_1 and test_data_2 and the following files:

ENIGMA.jar (executable jar file)

enigma.properties (sample ENIGMA settings file)

README.txt (GPL license)

The test_data folders contain artificial datasets of 1000 genes x 100 conditions with 20 simulated biclusters (see article for more info) and the properties files and other data files needed to analyze them. The output files are also included.

Running ENIGMA

Go to the directory containing your custom <xxx>.properties file (see further) and type in:

(Linux) java -jar <path>/ENIGMA.jar <xxx>.properties > outputfile

where <path> is the path to the directory where ENIGMA.jar is located and <xxx> can be anything you like.

(Windows, open command-line window with Start>Run>cmd)

java -jar <path>\ENIGMA.jar <xxx>.properties > outputfile

ENIGMA data requirements

ENIGMA works with perturbational gene expression data, which means that you need to use a perturbation vs control type of experimental setup. You need to calculate differential expression p-values for all genes in all experiments beforehand, using e.g. the limma package in R.
Gather as much expression data as possible on the processes of interest. If you run ENIGMA on a bunch of experiments that are not focused on any particular process, you will probably end up with a very blurry network. It is advised to focus the dataset on one or several specific processes, but 'focus' can be interpreted in the broad sense. Even perturbations that are only remotely related to the process of interest can be informative. There is no general rule on how much data you need to get a good result, but it is clear that 5 perturbational experiments on a process encompassing 100 genes will probably not yield tremendous resolution.
Only include high-quality data. As with all data analysis frameworks, 'rubbish in = rubbish out'.
Avoid including perturbational experiments with pleiotropic effects. These can give rise to large modules that link biological processes that are not directly related. You can either remove such experiments beforehand by setting a threshold on the number of genes that can be differentially expressed in any given experiment (e.g. 1000), or you can exclude them after a first run of ENIGMA by examining the substructure of the very large modules (if any). Look for unnaturally large modules in which a few experiments seem to hold together otherwise divergently expressed submodules. These experiments can be excluded in a second ENIGMA run.

Setting ENIGMA properties

The enigma.properties file contains the following fields that define ENIGMA settings (example values are filled in):

ENIGMA parameters

dataDir=d:/test_data

Directory in which the data files are located

expressionData=hughratios

Log-ratio expression data. File format:

UID DESCRIPTION condition1 condition2 condition3

gene a 0.001 0.033 -0.01

gene b -0.045 -0.012 0.03

gene c -0.018 0.02 0.059

pvalData=hughpvals

P-values for over- or under-expression of genes under experimental conditions. File format:

UID DESCRIPTION condition1 condition2 condition3

gene a 0.001 1.20E-04 0.050

gene b 0.935 0.352 0.032

gene c 0.566 1.23E-05 0.220

geneDescriptionFile=SGD_features.tab

File Format. Only the fourth (ORF), fifth (NAME) and sixteenth (DESCRIPTION) column are used by the present version of ENIGMA, all other columns can be left empty

DBREF FEATURE _TYPE FEATURE _QUALIFIER ORF NAME ALIAS ... ... DESCRIPTION

S000002143 ORF Dubious YAL069W Dubious open reading frame unlikely to encode a protein, based on available experimental and comparative sequence data

S000000054 ORF Verified YAL058W CNE1 FUN48 Calnexin

S000000045 ORF Verified YAL047C SPC72 LDB4 Component of the cytoplasmic Tub4p (gamma-tubulin) complex, binds spindle pole bodies and links them to microtubules

chipData=harb04_reg_graph_0_p005_ORF.txt

File with ChIP binding data or motif data. If chipData=null, no TF binding site overrepresentation analysis will be performed. File format:

TF Target Binding_Sites

YMR016C YMR016C 14

YMR016C YDR309C 6

YMR016C YKL062W 10

'Binding_Sites' represents the number of binding sites/motifs for the TF in the target's promoter (this number is not really used, if you don't have info on the number of binding sites, just fill in ones).

interactionData=BIOGRID-ORGANISM-Saccharomyces_cerevisiae-2.0.35.tab.txt

a BioGRID-style text file containing protein and genetic interactions. See the BioGRID website for details and downloadable interaction files.

regulatorData=regulators.txt

Either a custom list of potential regulators, or 'regulatorData=default', in which case the regulators are selected from the GO categories specified under regulatoryGoCats. Custom list file format:

ORF NAME DESCRIPTION

YAL040C CLN3 G1/S transition of mitotic cell cycle* G1/S-specific cyclin

YAL041W CDC24 establishment of cell polarity (sensu Saccharomyces)* signal transducer*

YBL005W PDR3 transport transcription factor

regulatoryGoCats=30528 4672 79

a space-separated list of GO category numbers from which potential regulators should be selected.

clusterPar1=0.30
clusterPar2=0.55

These lines define custom ENIGMA clustering parameters. If clusterpar1=NaN and clusterpar2=NaN, ENIGMA will search for optimal parameter settings using a Simulated Annealing procedure. Otherwise, ENIGMA will look for modules using the specified parameter settings (between 0 and 1, clusterpar1 controls the spacing of the modules, clusterpar2 controls the size and coherence of individual modules).

pvalThreshold=0.01

p-value threshold for significant up- or down-regulation.

NOTE: if pvals=false (i.e. if you don't wish to use p-values), this threshold can be set to a log-ratio cutoff for 'significant' up- or down-regulation. E.g. when 'pvals=false' is used in combination with 'pvalThreshold=1' on data in log2-ratio format, two-fold upregulation or down-regulation of expression will be considered 'significant'.

usePvals=true

'true' if you want to use p-values to assess up- or downregulation of genes under specific conditions (recommended), 'false' if you want to use a log-ratio cutoff instead to determine up- or down-regulation. In this case, you should specify the same file in both the 'expressiondata' and 'pvaldata' fields (see above)

fdr=0.05

Significance level for FDR correction of expression correlation p-values. Also used as FDR level for determining set of conditions under which a module's genes show a significant response (expression up or down).

fdrTF=0.05

FDR level used for selecting enriched TF binding sites from chipData

nrMCSAruns=3

The number of Simulated Annealing runs, if clusterPar1=NaN and clusterPar2=NaN

beginT1=0.1

Begin temperature of first (rough) stage of Simulated Annealing

endT1=0.001

End temperature of first (rough) stage of Simulated Annealing

step1=0.05

Size of steps in parameter space in first (rough) stage of Simulated Annealing

coolingRate1=0.99

Cooling rate during first (rough) stage of Simulated Annealing

beginT2=0.01

Begin temperature of second stage of Simulated Annealing

endT2=0.0001

End temperature of second stage of Simulated Annealing

step2=0.01

Size of steps in parameter space in second stage of Simulated Annealing

coolingRate2=0.995

Cooling rate during second stage of Simulated Annealing

conditionSelection=true

True if you want to narrow down the condition sets for the modules to the most relevant ones, false otherwise

cosCorrThreshold=0.65

Cosine correlation threshold for grouping conditions into leaves. E.g. conditions in the same leaf should have a cosine correlation of at least 0.65. If you don't want to define leaves but just cluster hierarchically, use cosCorrThreshold= -1.

drawModules=png

Draw figures for all modules. drawModules=eps for eps figures, drawModules=png for png figures, drawModules=null if you don't want figures to be drawn

BiNGO parameters

BiNGO=true

True if you want to perform GO overrepresentation analysis on the gene sets of the modules, otherwise false

conditionBiNGO=true

True if you want to perform GO overrepresentation analysis on the condition sets of the modules, otherwise false. This option requires that your condition names start with the names of perturbed genes, separated by the rest of the condition name by ',' or '('.

fdrBiNGO=0.05

FDR significance level for GO overrepresentation analysis, generally 0.05 or 0.01

annotationFile=S_cerevisiae_default

GO annotation file, either one of the default files or a custom file (see BiNGO website for info on making custom annotation files). Currently allowed default files are:

S_cerevisiae_default
A_thaliana_default
S_pombe_default
T_brucei_default
C_elegans_default
D_melanogaster_default
B_rerio_default
H_sapiens_default
M_musculus_default
R_norvegicus_default
P_falsiparum_3D7_default
O_sativa_japonica_default
B_anthracis_Ames_default
S_oneidensis_MR-1_default
P_syringae_DC3000_default
C_burnetii_RSA_default
G_sulfurreducens_PCA_default
M_capsulatus_Bath_default
L_monocytogenes_4b_F2365_default
C_jejuni_RM1221_default
D_ethenogenes_195_default
S_pomeroyi_DSS-3_default
G_gallus_default
B_taurus_default

ontologyFile=GO_Biological_Process

GO ontology files, either one of the default files or a custom file (see BiNGO website for info on making custom ontologies). Currently allowed default files are:

GO_Biological_Process
GO_Molecular_Function
GO_Cellular_Component
GO_Full
GOSlim_Generic
GOSlim_GOA
GOSlim_Plants
GOSlim_Yeast

annotationDefault=true

True if you use one of the default GO annotation files, false otherwise

ontologyDefault=true

True if you use one of the default GO ontology files, false otherwise

ENIGMA output files

<dataset>_graph

The correlation graph learned in the first stage of ENIGMA. Structure:

Gene_1 | Gene_2 | PosNeg | P-value

Gene_1 and Gene_2 are the correlated genes, PosNeg is + for positive correlations and - for negative correlations (the present version of ENIGMA calculates only positive correlations by default), and P-value is the FDR-corrected P-value for the correlation

<dataset>_gene_stats

Basic statistics for all genes in the graph. Structure:

Gene_name | Description | k | C | Nr_modules | Modules

k is the positive degree of the gene, C the clustering coefficient of the gene's neighborhood. Nr_modules is the number of modules that the gene is part of, these modules are listed under Modules (tab-delimited)

<dataset>_modules

The modules found by ENIGMA. Structure:

Module_number | Module_size | Member Genes

The module name is the name of the gene whose k_max-core was used as the seed for the cluster (see paper).

<dataset>_conditions

The condition sets for all the modules. Structure:

Module_number | Condition_name | PosNeg | P-value

PosNeg = + if there is an overrepresentation of up-regulated expression values for that condition in the module relative to the whole microarray, PosNeg = - if there is an overrepresentation of down-regulated expression values. P-value is the FDR-corrected P-value for overrepresentation.

<dataset>_tf_enrichment

File with the results of the transcription factor binding overrepresentation analysis (based on ChIP or motif data) and the results of the search for significantly coexpressed regulators for every module. Structure:

TF_name | Module_number | type | P-value

P-value is the FDR-corrected P-value for overrepresentation. The type is either 'coexpression' or 'binding', depending on whether the TF/regulator is significantly coexpressed with or binding on the module's genes

<dataset>.bgo

BiNGO output file, contains info on the enrichment of GO categories in the modules. See BiNGO website for details.

module_xxx.eps or module_xxx.png

Files with png or eps figures for all the modules, including regulation programs, gene and condition sets, overrepresented GO categories and TF binding sites, and links to other modules