Building and Storing Interaction Networks

Cytoscape reads an interaction network in three ways: a simple interaction file (SIF or .sif format), a standard format known as Graph Markup Language (GML or .gml format), and an XML standard called XGMML (extensible graph markup and modelling language). SIF specifies nodes and interactions only, while GML and XGMML store additional information about network layout and allows network data exchange with a variety of other network display programs. Typically, SIF is used to import interactions when building a network for the first time, since it is easy to create in a text editor or spreadsheet. Once the interactions have been loaded and layout has been performed, the network may be saved to and subsequently reloaded from GML or XGMML format in future Cytoscape sessions. SIF, GML, and XGMML are text files and you can edit and view them in a regular text editor. Additionally, GML and XGMML are supported by some other network visualization tools.

SIF FORMAT:

The simple interaction format is convenient for building a graph from a list of interactions. It also makes it easy to combine different interaction sets into a larger network, or add new interactions to an existing data set. The main disadvantage is that this format does not include any layout information, forcing Cytoscape to re-compute a new layout of the network each time it is loaded.

Lines in the SIF file specify a source node, a relationship type (or edge type), and one or more target nodes:

nodeA <relationship type> nodeB
nodeC <relationship type> nodeA
nodeD <relationship type> nodeE nodeF nodeB
nodeG
...
nodeY <relationship type> nodeZ

  

A more specific example is:

node1 typeA node2
node2 typeB node3 node4 node5
node0

  

The first line identifies two nodes, called node1 and node2, and a single relationship between node1 and node2 of type typeA. The second line specifies three new nodes, node3, node4, and node5; here "node2" refers to the same node as in the first line. The second line also specifies three relationships, all of type typeB and with node2 as the source, with node3, node4, and node5 as the targets, respectively. This second form is simply shorthand for specifying multiple relationships of the same type with the same source node. The third line indicates how to specify a node that has no relationships with other nodes. This form is not needed for nodes that do have relationships, since the specification of the relationship implicitly identifies the nodes as well. Duplicate entries are allowed and indicate multiple edges between the same nodes. For example, the following specifies three edges between the same pair of nodes, two of type pp and one of type pd:

node1 pp node2
node1 pp node2
node1 pd node2

  

Edges connecting a node to itself (self-edges) are also allowed:

node1 pp node1

  

Every node and edge in Cytoscape has an identifying name, most commonly used with the node and edge data attribute structures. Node names must be unique as identically named nodes will be treated as identical nodes. The name of each node will be the name in this file by default (unless another string is mapped to display on the node using the visual mapper. This is discussed in the section on visual styles. The name of each edge will be formed from the name of the source and target nodes plus the interaction type: for example, sourceName edgeType targetName.

The tag <interaction type> should be one of:

  pp .................. protein – protein interaction
  pd .................. protein -> DNA
  (e.g. transcription factor binding upstream of a regulating gene.)

  

Any text string will work, but the above are the conventions that have been followed thus far.

Additional interaction types are also possible, but not widely used, e.g.:

  pr .................. protein -> reaction
  rc .................. reaction -> compound
  cr .................. compound -> reaction
  gl .................. genetic lethal relationship
  pm .................. protein-metabolite interaction
  mp .................. metabolite-protein interaction

  

Even whole words or concatenated words may be used to define other types of relationships e.g. geneFusion, cogInference, pullsDown, activates, degrades, inactivates, inhibits, phosphorylates, upRegulates

Delimiters. Whitespace (space or tab) is used to delimit the names in the simple interaction file format. However, in some cases spaces are desired in a node name or edge type. The standard is that, if the file contains any tab characters, then tabs are used to delimit the fields and spaces are considered part of the name. If the file contains no tabs, then any spaces are delimiters that separate names (and names cannot contain spaces).

If your network unexpectedly contains no edges and node names that look like edge names, it probably means your file contains a stray tab that's fooling the parser. On the other hand, if your network has nodes whose names are half of a full name, then you probably meant to use tabs to separate node names with spaces.

Networks in simple interactions format are often stored in files with a ".sif" extension, and Cytoscape recognizes this extension when browsing a directory for files of this type.

GML FORMAT:

In contrast to SIF, GML is a rich graph format language supported by many other network visualization packages. The GML file format specification is available at:

http://www.infosun.fmi.uni-passau.de/Graphlet/GML/

It is generally not necessary to modify the content of a GML file directly. Once a network is built in SIF format and then laid out, the layout is preserved by saving to and loading from GML. Visual attributes specified in a GML file will result in a new visual style named “Filename.style” when that GML file is loaded.

XGMML FORMAT:

XGMML is the XML evolution of GML and is based on the GML definition. The XGMML file format specification is available at:

http://www.cs.rpi.edu/~puninj/XGMML/

XGMML is now preferred to GML because it offers the flexibility associated with all XML document types. If you're unsure about which to use, choose XGMML.

COMMANDS:

Load and save network files using the File menu of Cytoscape. Network files may also be loaded directly from the command line using the –N option (works for SIF, GML, and XGMML).

FOR EXAMPLE:

To load a sample molecular interaction network in SIF format, use the menu File → Import → Network. In the resulting file dialog box, select the file “sampleData/galFiltered.sif”. After a few seconds, a small network of 331 nodes should appear in the main window. To load the same interaction network as a GML, use the menu: File → Import → Network again. In the resulting file dialog box, select the file “sampleData/galFiltered.gml”. Node and edge attribute files as well as expression data and extra annotation can be loaded as well.

NODE NAMING ISSUES IN CYTOSCAPE:

Typically, genes are represented by nodes, and interactions (or other biological relationships) are represented by edges between nodes. For compactness, a gene also represents its corresponding protein. Nodes may also be used to represent compounds and reactions (or anything else) instead of genes.

If a network of genes or proteins is to be integrated with Gene Ontology (GO) annotation or gene expression data, the gene names must exactly match the names specified in the other data files. We strongly encourage naming genes and proteins by their systematic ORF name or standard accession number; common names may be displayed on the screen for ease of interpretation, so long as these are available to the program in the annotation directory or in a node attribute file. Cytoscape ships with all yeast ORF-to-common name mappings in a synonym table within the annotation/ directory. Other organisms will be supported in the future.

Why do we recommend using standard gene names? All of the external data formats recognized by Cytoscape provide data associated with particular names of particular objects. For example, a network of protein-protein interactions would list the names of the proteins, and the attribute and expression data would likewise be indexed by the name of the object.

The problem is in connecting data from different data sources that don't necessarily use the same name for the same object. For example, genes are commonly referred to by different names, including a formal "location on the chromosome" identifier and one or more common names that are used by ordinary researchers when talking about that gene. Additionally, database identifiers from every database where the gene is stored may be used to refer to a gene (e.g. protein accession numbers from Swiss-Prot). If one data source uses the formal name while a different data source used a common name or identifier, then Cytoscape must figure out that these two different names really refer to the same biological entity.

Cytoscape has two strategies for dealing with this naming issue, one simple and one more complex. The simple strategy is to assume that every data source uses the same set of names for every object. If this is the case, then Cytoscape can easily connect all of the different data sources.

To handle data sources with different sets of names, as is usually the case when manually integrating gene information from different sources, Cytoscape needs a data server that provides synonym information (See12. Annotation.). A synonym table gives a canonical name for each object in a given organism and one or more recognized synonyms for that object. Note that the synonym table itself defines what set of names are the "canonical" names. For example, in budding yeast the ORF names are commonly used as the canonical names.

If a synonym server is available, then by default Cytoscape will convert every name that appears in a data file to the associated canonical name. Unrecognized names will not be changed. This conversion of names to a common set allows Cytoscape to connect the genes present in different data sources, even if they have different names – as long as those names are recognized by the synonym server.

For this to work, Cytoscape must also be provided with the species to which the objects belong, since the data server requires the species in order to uniquely identify the object referred to by a particular name. This is usually done in Cytoscape by specifying the species name on the command line with the –P option (cytoscape.sh -P "defaultSpeciesName=Saccharomyces cerevisiae") or by editing the properties in the Preferences Dialog (Edit → Preference...).

The automatic canonicalization of names can be turned off using the -P option (cytoscape.sh -P canonicalizeName=false") or by editing the properties in the Preferences Dialog (Edit → Preferences...). This canonicalization of names currently does not apply to expression data. Expression data should use the same names as the other data sources or use the canonical names as defined by the synonym table.