Loading Gene Expression Data

Interaction networks are useful as stand-alone models. However, they are most powerful for answering scientific questions when integrated with additional biological information, such as gene or protein expression levels. Once loaded, expression ratios/levels may be visually superimposed on the network, used in a filter to select a subset of nodes, or used to identify active modules and subsystems (via plugin analysis tools). An expression data set can be loaded at any time, but are only relevant once a network has been loaded.

Format

Gene expression ratios or values are specified over one or more experiments using a text file. Ratios result from a comparison of two expression measurements (experiment vs. control). Some expression platforms, such as Affymetrix, directly measure expression values, without a comparison. The file consists of a header and a number of space- or tab-delimited fields, one line per gene, with the following format:

GeneName [CommonName] ratio1 ratio2 ... ratioN [pval1 pval2 ... pvalN]

Brackets [] indicate fields that are optional. The first two fields are the systematic gene name followed by an optional common name. Expression ratios/values are provided for each experiment, optionally followed by a p-value per experiment or other measure of the significance of each ratio/value, i.e. whether the ratio represents a true change in expression or whether the value is accurately measuring the value of a gene (according to some statistical model.) Significance values are generated by a variety of software packages for analyzing expression data generated by DNA microarrays, for instance a program VERA from the Institute of Systems Biology (http://www.systemsbiology.org/VERAandSAM). A list of other microarray analysis packages is available at: http://www.nslij-genetics.org/microarray/soft.html

Example

GENE DESCRIPT gal1RG.sig gal2RG.sig gal3RG.sig gal1RG.sig gal2RG.sig gal3RG.sig

YHR051W COX6 -0.034 -0.052 0.152 1.177 0.102 0.857

YHR124W NDT80 -0.090 -0.000 0.041 0.130 0.341 0.061

YKL181W PRS1 -0.167 -0.063 -0.230 -0.233 0.143 0.089

The first line is a header line giving the names of the experimental conditions. Note that each condition is duplicated; the first set of columns gives expression ratios and the second set gives significance values. The significance columns can be omitted if your data doesn't include significance measures. Every remaining row specifies the values for a gene, starting with the formal name of the gene, then a common name, then the ratios, then the significance values.

Some variations on this basic format are recognized: see the formal file format specification below for more information. Expression data files commonly have the file extensions ".mrna" or ".pvals", and these file extensions are recognized by Cytoscape when browsing for data files.

Commands

Load an expression data file using the File menu of Cytoscape, or by specifying the filename using the -e option at the command line. The –x command line option indicates that the expression data should not be loaded into node attributes. This is an advanced option, and is typically only used when the number of expression conditions is sufficiently large that it becomes unwieldy in the normal user interface.

Example

Load a sample gene expression data set using the menu: File / Load / Expression Matrix File. In the resulting file dialog box (shown at right), select the file “sampleData/galExpData.pvals”. As described in the following sections, Cytoscape is now ready to integrate these data with the underlying molecular interaction network.

Detailed file format (Advanced users)

In all expression data files, any whitespace (spaces and/or tabs) is considered a delimiter between adjacent fields. Every line of text is either the header line or contains all the measurements for a particular gene. No name conversion is applied to expression data files (see the section on name resolution in section 5. Building and Storing Interaction Networks). The names given in the first column of the expression data file should match exactly the names used elsewhere (i.e. in SIF or GML files).

The first line is a header line with one of the following three formats:

<text> <text> cond1 cond2 ... cond1 cond2 ... [NumSigConds]
<text> <text> cond1 cond2 ...
<tab><tab>RATIOS<tab><tab>...LAMBDAS

The first format specifies that both expression ratios and significance values are included in the file. The first two text tokens are ignored; these columns will contain names for each gene. The next token set specifies the names of the experimental conditions; these columns will contain ratio values. This list of condition names must then be duplicated exactly, each spelled the same way and in the same order. Optionally, a final column with the title NumSigConds may be present. If present, this column will contain integer values indicating the number of conditions in which each gene had a statistically significant change according to some threshold.

The second format is similar to the first except that the duplicate column names are omitted, and there is no NumSigConds fields. This format specifies data with ratios but no significance values.

The third format specifies an MTX header, which is a commonly used format. Two tab characters precede the RATIOS token. This token is followed by a number of tabs equal to the number of conditions, followed by the LAMBDAS token. This format specifies both ratios and significance values.

Each line after the first is a data line with the following format:
FormalGeneName CommonGeneName ratio1 ratio2 ... [lambda1 lambda2 ...] [numSigConds]

The first two tokens are gene names. The names in the first column are the keys used for node name lookup; these names should be the same as the names used elsewhere in Cytoscape (i.e. in the SIF or GML files). Traditionally in the gene expression microarray community, who defined these file formats, the first token is expected to be the formal name of the gene (in systems where there is a formal naming scheme for genes), while the second is expected to be a synonym for the gene commonly used by biologists, although Cytoscape does not make use of the common name column. The next columns contain floating point values for the ratios, followed by columns with the significance values if specified by the header line. The final column, if specified by the header line, should contain an integer giving the number of significant conditions for that gene.

Missing values are not allowed and will confuse the parser. For example, using two consecutive tabs to indicate a missing value will not work; the parser will regard both tabs as a single delimiter and be unable to parse the line correctly.

Optionally, the last line of the file may be a special footer line with the following format:
NumSigGenes int1 int2 ...

This line specified the number of genes that were significantly differentially expressed in each condition. The first text token must be spelled exactly as shown; the rest of the line should contain one integer value for each experimental condition.