ENIGMA installation instructions

ENIGMA requires Java. If not already installed on your computer, download and install the Java 2 Runtime Environment, version 1.6.0 or higher. 

Download the ENIGMA zip file and unpack it in the directory of your choice. After unpacking, you should find a folder called ENIGMA containing a lib folder, the folders test_data_1 and test_data_2 and the following files:

ENIGMA.jar (executable jar file)

enigma.properties (sample ENIGMA settings file)

README.txt (GPL license)

The test_data folders contain artificial datasets of 1000 genes x 100 conditions with 20 simulated biclusters (see article for more info) and the properties files and other data files needed to analyze them. The output files are also included.

Running ENIGMA

Go to the directory containing your custom <xxx>.properties file (see further) and type in:

(Linux)  java -jar <path>/ENIGMA.jar <xxx>.properties > outputfile

where <path> is the path to the directory where ENIGMA.jar is located and <xxx> can be anything you like.

(Windows, open command-line window with Start>Run>cmd)

java -jar <path>\ENIGMA.jar <xxx>.properties > outputfile

ENIGMA data requirements

  • ENIGMA works with perturbational gene expression data, which means that you need to use a perturbation vs control type of experimental setup. You need to calculate differential expression p-values for all genes in all experiments beforehand, using e.g. the limma package in R.
  • Gather as much expression data as possible on the processes of interest.  If you run ENIGMA on a bunch of experiments that are not focused on any particular process, you will probably end up with a very blurry network. It is advised to focus the dataset on one or several specific processes, but 'focus' can be interpreted in the broad sense. Even perturbations that are only remotely related to the process of interest can be informative. There is no general rule on how much data you need to get a good result, but it is clear that 5 perturbational experiments on a process encompassing 100 genes will probably not yield tremendous resolution.
  • Only include high-quality data. As with all data analysis frameworks, 'rubbish in = rubbish out'.
  • Avoid including perturbational experiments with pleiotropic effects. These can give rise to large modules that link biological processes that are not directly related. You can either remove such experiments beforehand by setting a threshold on the number of genes that can be differentially expressed in any given experiment (e.g. 1000), or you can exclude them after a first run of ENIGMA by examining the substructure of the very large modules (if any). Look for unnaturally large modules in which a few experiments seem to hold together otherwise divergently expressed submodules. These experiments can be excluded in a second ENIGMA run.


  Setting ENIGMA properties

The enigma.properties file contains the following fields that define ENIGMA settings (example values are filled in):

ENIGMA parameters

  • dataDir=d:/test_data

Directory in which the data files are located

  • expressionData=hughratios

Log-ratio expression data. File format:

UID DESCRIPTION condition1 condition2 condition3
gene a   0.001 0.033 -0.01
gene b   -0.045 -0.012 0.03
gene c   -0.018 0.02 0.059
  • pvalData=hughpvals

P-values for over- or under-expression of genes under experimental conditions. File format:

UID DESCRIPTION condition1 condition2 condition3
gene a   0.001 1.20E-04 0.050
gene b   0.935 0.352 0.032
gene c   0.566 1.23E-05 0.220
  • geneDescriptionFile=SGD_features.tab

File Format. Only the fourth (ORF), fifth (NAME) and sixteenth (DESCRIPTION) column are used by the present version of ENIGMA, all other columns can be left empty

S000002143 ORF Dubious YAL069W                       Dubious open reading frame unlikely to encode a protein, based on available experimental and comparative sequence data
S000000054 ORF Verified YAL058W CNE1 FUN48                   Calnexin
S000000045 ORF Verified YAL047C SPC72 LDB4                   Component of the cytoplasmic Tub4p (gamma-tubulin) complex, binds spindle pole bodies and links them to microtubules


  • chipData=harb04_reg_graph_0_p005_ORF.txt

File with ChIP binding data or motif data. If chipData=null, no TF binding site overrepresentation analysis will be performed. File format:

TF Target Binding_Sites
YMR016C YMR016C 14
YMR016C YDR309C 6
YMR016C YKL062W 10

'Binding_Sites' represents the number of binding sites/motifs for the TF in the target's promoter (this number is not really used, if you don't have info on the number of binding sites, just fill in ones).

  • interactionData=BIOGRID-ORGANISM-Saccharomyces_cerevisiae-2.0.35.tab.txt

a BioGRID-style text file containing protein and genetic interactions. See the BioGRID website for details and downloadable interaction files.

  • regulatorData=regulators.txt

Either a custom list of potential regulators, or 'regulatorData=default', in which case the regulators are selected from the GO categories specified under regulatoryGoCats. Custom list file format:

YAL040C CLN3 G1/S transition of mitotic cell cycle* G1/S-specific cyclin
YAL041W CDC24 establishment of cell polarity (sensu Saccharomyces)* signal transducer*
YBL005W PDR3 transport transcription factor
  • regulatoryGoCats=30528 4672 79

a space-separated list of GO category numbers from which potential regulators should be selected.

  • clusterPar1=0.30
  • clusterPar2=0.55

These lines define custom ENIGMA clustering parameters. If clusterpar1=NaN and clusterpar2=NaN, ENIGMA will search for optimal parameter settings using a Simulated Annealing procedure. Otherwise, ENIGMA will look for modules using the specified parameter settings (between 0 and 1, clusterpar1 controls the spacing of the modules, clusterpar2 controls the size and coherence of individual modules).

  • pvalThreshold=0.01

p-value threshold for significant up- or down-regulation.

NOTE: if pvals=false (i.e. if you don't wish to use p-values), this threshold can be set to a log-ratio cutoff for 'significant' up- or down-regulation. E.g. when 'pvals=false' is used in combination with 'pvalThreshold=1' on data in log2-ratio format, two-fold upregulation or down-regulation of expression will be considered 'significant'.

  • usePvals=true

'true' if you want to use p-values to assess up- or downregulation of genes under specific conditions (recommended), 'false' if you want to use a log-ratio cutoff instead to determine up- or down-regulation. In this case, you should specify the same file in both the 'expressiondata' and 'pvaldata' fields (see above)

  • fdr=0.05

Significance level for FDR correction of expression correlation p-values. Also used as FDR level for determining set of conditions under which a module's genes show a significant response (expression up or down).

  • fdrTF=0.05

FDR level used for selecting enriched TF binding sites from chipData

  • nrMCSAruns=3

The number of Simulated Annealing runs, if clusterPar1=NaN and clusterPar2=NaN

  • beginT1=0.1

Begin temperature of first (rough) stage of Simulated Annealing

  • endT1=0.001

End temperature of first (rough) stage of Simulated Annealing

  • step1=0.05

Size of steps in parameter space in first (rough) stage of Simulated Annealing

  • coolingRate1=0.99

Cooling rate during first (rough) stage of Simulated Annealing

  • beginT2=0.01

Begin temperature of second stage of Simulated Annealing

  • endT2=0.0001

End temperature of second stage of Simulated Annealing

  • step2=0.01

Size of steps in parameter space in second stage of Simulated Annealing

  • coolingRate2=0.995

Cooling rate during second stage of Simulated Annealing

  • conditionSelection=true

True if you want to narrow down the condition sets for the modules to the most relevant ones, false otherwise

  • cosCorrThreshold=0.65

Cosine correlation threshold for grouping conditions into leaves. E.g. conditions in the same leaf should have a cosine correlation of at least 0.65. If you don't want to define leaves but just cluster hierarchically, use cosCorrThreshold= -1.

  • drawModules=png

Draw figures for all modules. drawModules=eps for eps figures, drawModules=png for png figures, drawModules=null if you don't want figures to be drawn

BiNGO parameters

  • BiNGO=true

True if you want to perform GO overrepresentation analysis on the gene sets of the modules, otherwise false

  • conditionBiNGO=true

True if you want to perform GO overrepresentation analysis on the condition sets of the modules, otherwise false. This option requires that your condition names start with the names of perturbed genes, separated by the rest of the condition name by ',' or '('.

  • fdrBiNGO=0.05

FDR significance level for GO overrepresentation analysis, generally 0.05 or 0.01

  • annotationFile=S_cerevisiae_default

GO annotation file, either one of the default files or a custom file (see BiNGO website for info on making custom annotation files). Currently allowed default files are:


  • ontologyFile=GO_Biological_Process

GO ontology files, either one of the default files or a custom file (see BiNGO website for info on making custom ontologies). Currently allowed default files are:


  • annotationDefault=true

True if you use one of the default GO annotation files, false otherwise

  • ontologyDefault=true

True if you use one of the default GO ontology files, false otherwise



ENIGMA output files

  • <dataset>_graph

The correlation graph learned in the first stage of ENIGMA. Structure:

Gene_1 | Gene_2 | PosNeg | P-value

Gene_1 and Gene_2 are the correlated genes, PosNeg is + for positive correlations and - for negative correlations (the present version of ENIGMA calculates only positive correlations by default), and P-value is the FDR-corrected P-value for the correlation

  • <dataset>_gene_stats

Basic statistics for all genes in the graph. Structure:

Gene_name | Description | k | C | Nr_modules | Modules

k is the positive degree of the gene, C the clustering coefficient of the gene's neighborhood. Nr_modules is the number of modules that the gene is part of, these modules are listed under Modules (tab-delimited)

  • <dataset>_modules

The modules found by ENIGMA. Structure:

Module_number | Module_size | Member Genes

The module name is the name of the gene whose kmax-core was used as the seed for the cluster (see paper).

  • <dataset>_conditions

The condition sets for all the modules. Structure:

Module_number | Condition_name | PosNeg | P-value

PosNeg = + if there is an overrepresentation of up-regulated expression values for that condition in the module relative to the whole microarray, PosNeg = - if there is an overrepresentation of down-regulated expression values. P-value is the FDR-corrected P-value for overrepresentation.

  • <dataset>_tf_enrichment

File with the results of the transcription factor binding overrepresentation analysis (based on ChIP or motif data) and the results of the search for significantly coexpressed regulators for every module. Structure:

TF_name | Module_number | type | P-value

P-value is the FDR-corrected P-value for overrepresentation. The type is either 'coexpression' or 'binding', depending on whether the TF/regulator is significantly coexpressed with or binding on the module's genes

  • <dataset>.bgo

BiNGO output file, contains info on the enrichment of GO categories in the modules. See BiNGO website for details.

  • module_xxx.eps or module_xxx.png

Files with png or eps figures for all the modules, including regulation programs, gene and condition sets, overrepresented GO categories and TF binding sites, and links to other modules


















Copyright (c) 2007 Flanders Interuniversitary Institute for Biotechnology (VIB)