Webtools

User Manual

ForCon 1.0 manual

?
      Contents
      ?


      Installing ForCon

      After downloading ForCon, you have a .zip file. Unzip it using WinZip or PKUNZIP in a temporary directory. Double-click the setup.exe file and follow the instructions.
      ?

      General description

      ForCon is a user-friendly software tool for the easy conversion of nucleic and amino acid sequence alignments into different formats.

      At the moment, ForCon is able to convert ý in both ways, i.e. reading and writing - the
      following formats (or formats used by the following software packages):
      ?

      • CLUSTAL
      • EMBL
      • FASTA
      • GCG/MSF
      • Hennig86
      • MEGA
      • NBRF/PIR
      • PAUP/Nexus
      • Parsimony Jackknifer
      • PHYLIP
      • TREECON


      Software packages not included in the list are usually able to read one of the formats mentioned.?? For the publication of sequence alignments, a format with codon positions can be generated ("Pretty").

      Sequential and interleaved formats are supported by ForCon. (see next paragraph)
      ?
      ?

      File formats

      The use of correct formats is extremely important: incorrect formats cannot be correctly interpreted by the program. For this reason a description and example of all the formats is presented below.

      Overall, two major types of formats exist: interleaved and noninterleaved (sequential). In the interleaved format, sequences are written in the form of an alignment:

      ?Sequence 1?? AGUCGAGUC---GCAGAAACGCAUGAC
      ?Sequence 2?? AGUCGCGUCG--GCAGAAACGCAUGAC
      ?Sequence 3?? AGUCGCGUCG--GCAGAUACGCAUCAC
      ?Sequence 4?? AGUCGCGUCGAAGCAGA--CGCAUGAC

      (Sequence 1)? -GACCACAUUUU-CCUUGCAAAG
      (Sequence 2)? GGACCACAUCAU-CCUUGCAAAG
      (Sequence 3)? GGAC-ACAUCAUCCCUCGCAGAG
      (Sequence 4)? GGACCACAUCAUCCCUUGCAGAG

      In the noninterleaved formats, sequences are written one after another:

      Sequence 1?????? AGUCGAGUC---GCAGAAACGCAUGAC
      -GACCACAUUUU-CCUUGCAAAG
      Sequence 2?? AGUCGCGUCG--GCAGAAACGCAUGAC
      GGACCACAUCAU-CCUUGCAAAG
      Sequence 3??? AGUCGCGUCG--GCAGAUACGCAUCAC
      GGAC-ACAUCAUCCCUCGCAGAG
      Sequence 4????? AGUCGCGUCGAAGCAGA--CGCAUGAC
      GGACCACAUCAUCCCUUGCAGAG

      Usually the symbol for missing data is 'N' (nucleotides) or 'X' (proteins). For insertions/deletions ('gaps') the most commonly used symbol is a hyphen '-'.

      Regarding the different formats:
      ?

        1) CLUSTAL

        The CLUSTAL program is a program for creating sequence alignments.
        The CLUSTAL format can be described as follows:

        - the word CLUSTAL should be on the first non-space line of the file
        - the alignment is displayed in blocks of a fixed length
        - each line in the block corresponds to one sequence
        - the line starts with the sequence name (of any length), followed by at least one space character
        - then the sequence itself is displayed (upper- or lowercase) ( '-' : gaps )
        ??? (optional : residue number at the end)
        - in between blocks: line with conservation info ( ForCon only writes stars for now ; for more info:
        ??? http://www-igbmc.u-strasbg.fr/BioInfo/ClustalX/#G )

        Example :

        CLUSTAL W (1.74) multiple sequence alignment

        Homo_sapiens??????? AGUCGAGUC---GCAGAAAC
        Pan_paniscus??????? AGUCGCGUCG--GCAGAAAC
        Gorilla_gorilla???? AGUCGCGUCG--GCAGAUAC
        Pongo_pigmaeus????? AGUCGCGUCGAAGCAGA--C
        ??????????????????? ***** ***?? *****? *

        Homo_sapiens??????? GCAUGAC-GACCACAUUUU-
        Pan_paniscus??????? GCAUGACGGACCACAUCAU-
        Gorilla_gorilla???? GCAUCACGGAC-ACAUCAUC
        Pongo_pigmaeus????? GCAUGACGGACCACAUCAUC
        ??????????????????? **** ** *** ****? *

        Homo_sapiens??????? CCUUGCAAAG
        Pan_paniscus??????? CCUUGCAAAG
        Gorilla_gorilla???? CCUCGCAGAG
        Pongo_pigmaeus????? CCUUGCAGAG
        ??????????????????? *** *** **

        2) EMBL

        The EMBL database is the primary nucleotide database in Europe.
        The format is described in detail at: http://www.ebi.ac.uk/ebi_docs/embl_db/usrman/structure_entry.html

        Multiple sequence files also follow these rules. They are separated by the '//' that ends each entry.
        Only the information used in multiple sequence alignments is used by ForCon.

        Example ( as generated by ForCon; for input, any EMBL file is allowed ):

        ID?? Homo sapiens
        SQ?? Sequence 50 BP;
        ???? AGUCGAGUC- --GCAGAAAC GCAUGAC-GA CCACAUUUU- CCUUGCAAAG
        //
        ID?? Pan paniscus
        SQ?? Sequence 50 BP;
        ???? AGUCGCGUCG --GCAGAAAC GCAUGACGGA CCACAUCAU- CCUUGCAAAG
        //
        ID?? Gorilla gorilla
        SQ?? Sequence 50 BP;
        ???? AGUCGCGUCG --GCAGAUAC GCAUCACGGA C-ACAUCAUC CCUCGCAGAG
        //
        ID?? Pongo pigmaeus
        SQ?? Sequence 50 BP;
        ???? AGUCGCGUCG AAGCAGA--C GCAUGACGGA CCACAUCAUC CCUUGCAGAG
        //

        3) FASTA

        The FASTA program is used for database searches.
        The format is described at : http://www.ncbi.nlm.nih.gov/BLAST/fasta.html

        Example:

        >Homo sapiens
        AGUCGAGUC---GCAGAAACGCAUGAC-GACCACAUUUU-CCUUGCAAAG
        >Pan paniscus
        AGUCGCGUCG--GCAGAAACGCAUGACGGACCACAUCAU-CCUUGCAAAG
        >Gorilla gorilla
        AGUCGCGUCG--GCAGAUACGCAUCACGGAC-ACAUCAUCCCUCGCAGAG
        >Pongo pigmaeus
        AGUCGCGUCGAAGCAGA--CGCAUGACGGACCACAUCAUCCCUUGCAGAG
        ?

        4) GCG/MSF

        The Multiple Sequence File format by the Genetics Computer Group Wisconsin package is thoroughly described in their user manual. In brief:

        - on the first line : file type identifier like '!!AA_MULTIPLE_ALIGNMENT 1.0',
        ??? '!!NA_MULTIPLE_ALIGNMENT 1.0' or 'PileUp'. ( optional )
        - second line: optional title/description
        - dividing line with obligatory 'MSF: sequence length', checksum value and two points '..'
        - name/weight section with checksum
        - separating line : //
        - alignment : interleaved

        Example ( as generated by ForCon? )

        !!NA_MULTIPLE_ALIGNMENT 1.0
        Four anthropoidea
        MSF: 50? Type: N? Check: 2666 ..

        Name: Homo_sapiens???? Len: 50?? Check: 8318?? Weight: 1.00
        Name: Pan_paniscus???? Len: 50?? Check: 7854?? Weight: 1.00
        Name: Gorilla_gorilla? Len: 50?? Check: 7778?? Weight: 1.00
        Name: Pongo_pigmaeus?? Len: 50?? Check: 8716?? Weight: 1.00

        //

        Homo_sapiens??????? AGUCGAGUC...GCAGAAAC
        Pan_paniscus??????? AGUCGCGUCG..GCAGAAAC
        Gorilla_gorilla???? AGUCGCGUCG..GCAGAUAC
        Pongo_pigmaeus????? AGUCGCGUCGAAGCAGA..C

        Homo_sapiens??????? GCAUGAC.GACCACAUUUU.
        Pan_paniscus??????? GCAUGACGGACCACAUCAU.
        Gorilla_gorilla???? GCAUCACGGAC.ACAUCAUC
        Pongo_pigmaeus????? GCAUGACGGACCACAUCAUC

        Homo_sapiens??????? CCUUGCAAAG
        Pan_paniscus??????? CCUUGCAAAG
        Gorilla_gorilla???? CCUCGCAGAG
        Pongo_pigmaeus????? CCUUGCAGAG
        ?
        ?

        5) Hennig86

        The parsimony phylogeny program by Farris uses an unusual format: the different IUPAC nucleotide letter codes are replaced by a number code. ForCon uses the following standard translation :
        ?

        ?
        A
        to:
        0
        U,T
        to:
        1
        G
        to:
        2
        C
        to:
        3
        N
        to:
        ?

        When converting from the Hennig86 format, the user will be prompted to enter his/her own translation preferences.
        The format is a sequential format. On the first line there is the word 'xread', used for recognition of the file. On the following line a title/description can be placed in between single quotes. The third line consists of the sequence length and the number of sequences. After the alignment ( is sequential format ), the file is closed by a semicolon (;). The symbol used for missing data is '?'. There is no separate character for defining gaps.

        Example:

        xread
        ' Four anthropoidea '
        50 4
        Homo sapiens
        132431324???341311143412314?31441412222?4422341113
        Pan paniscus
        1324343243??341311143412314331441412412?4422341113
        Gorilla gorilla
        1324343243??3413121434124143314?141241244424341313
        Pongo pigmaeus
        13243432431134131??4341231433144141241244422341313
        ;

        6) MEGA

        The Molecular Evolutionary Genetic Analysis program by Kumar, Tamura & Nei is a tree construction program based on distance- and parsimony methods.
        The format? is described in the MEGA manual. In brief:
        The format exists in the interleaved and noninterleaved format.
        Disregarding the format type, the file always starts with the word '#mega' on the first line. On the following line, a title can be stated, preceded by the term 'TITLE:'. In between the title and the sequence data, a description or extra comments can be placed. Even inside the sequences, comments are allowed in between quotes (""). The sequence names are preceded by a '#'.

        Examples:

        #mega
        TITLE: Four Anthropoidea

        The interleaved format

        #Homo_sapiens??????? AGUCGAGUC---GCAGAAACGCAUGAC-GACC
        #Pan_paniscus??????? AGUCGCGUCG--GCAGAAACGCAUGACGGACC
        #Gorilla_gorilla???? AGUCGCGUCG--GCAGAUACGCAUCACGGAC-
        #Pongo_pigmaeus????? AGUCGCGUCGAAGCAGA--CGCAUGACGGACC

        #Homo_sapiens??????? ACAUUUU-CCUUGCAAAG
        #Pan_paniscus??????? ACAUCAU-CCUUGCAAAG
        #Gorilla_gorilla???? ACAUCAUCCCUCGCAGAG
        #Pongo_pigmaeus????? ACAUCAUCCCUUGCAGAG

        ---

        #mega
        TITLE: Four Anthropoidea

        The noninterleaved format

        #Homo_sapiens
        AGUCGAGUC---GCAGAAACGCAUGAC-GACCACAUUUU-CCUUGCAAAG
        #Pan_paniscus
        AGUCGCGUCG--GCAGAAACGCAUGACGGACCACAUCAU-CCUUGCAAAG
        #Gorilla_gorilla
        AGUCGCGUCG--GCAGAUACGCAUCACGGAC-ACAUCAUCCCUCGCAGAG
        #Pongo_pigmaeus
        AGUCGCGUCGAAGCAGA--CGCAUGACGGACCACAUCAUCCCUUGCAGAG
        ?

        7) NBRF/PIR

        The format of this large protein database is similar to the FASTA format. Each sequence, though, starts with a '>[sequence type code];', followed by the sequence name and a description ( on the next line ).
        This description is ignored by ForCon.
        On the following line the actual sequence is written and is ended with an asterisk (*).

        The sequence type codes are as follows:
        ?
        ?

        Code
        Sequence type
        P1 Protein (complete)
        F1
        Protein (fragment)
        DL
        DNA (linear)
        DC
        DNA (circular)
        RL
        RNA (linear)
        RC
        RNA (circular)
        N3
        tRNA
        N1
        other functional RNA

        ForCon accepts all these codes, but only writes down codes P1, D1 and RL.

        Example :

        >RL;Homo sapiens
        Homo sapiens RNA sequence
        AGUCGAGUC---GCAGAAACGCAUGAC-GACCACAUUUU-CCUUGCAAAG*
        >RL;Pan paniscus
        Pan paniscus RNA sequence
        AGUCGCGUCG--GCAGAAACGCAUGACGGACCACAUCAU-CCUUGCAAAG*
        >RL;Gorilla gorilla
        Gorilla gorilla RNA sequence
        AGUCGCGUCG--GCAGAUACGCAUCACGGAC-ACAUCAUCCCUCGCAGAG*
        >RL;Pongo pigmaeus
        Pongo pigmaeus RNA sequence
        AGUCGCGUCGAAGCAGA--CGCAUGACGGACCACAUCAUCCCUUGCAGAG*
        ?

        8) PAUP/NEXUS

        The Nexus format is used by several programs: PAUP, MacClade, Spectrum,... .
        For a detailed description of the format, I'd like to refer to the article written by Maddison et al. :

        Maddison, D.R., Swofford, D.L., Maddison, W.P. (1997) NEXUS: An extendible file format for systematic information. Syst.Biol. 46, 590-621.

        ForCon is limited in the use of this extremely versatile format. Only the information on the alignment itself is used and generated, although any NEXUS file can be used as input file. The program will ignore all information that is not used.
        Here is an example of a NEXUS file generated by the ForCon program:

        #NEXUS
        [TITLE: Four Anthropoidea]

        begin data;
        dimensions ntax=4 nchar=50;
        format datatype=RNA missing=N gap=-;

        matrix
        Homo_sapiens
        AGUCGAGUC---GCAGAAACGCAUGAC-GACCACAUUUU-CCUUGCAAAG
        Pan_paniscus
        AGUCGCGUCG--GCAGAAACGCAUGACGGACCACAUCAU-CCUUGCAAAG
        Gorilla_gorilla
        AGUCGCGUCG--GCAGAUACGCAUCACGGAC-ACAUCAUCCCUCGCAGAG
        Pongo_pigmaeus
        AGUCGCGUCGAAGCAGA--CGCAUGACGGACCACAUCAUCCCUUGCAGAG
        ;
        endblock;
        begin assumptions;
        options deftype=unord;
        ?

        ---
        ?

        #NEXUS
        [TITLE: Four Anthropoidea]

        begin data;
        dimensions ntax=4 nchar=50;
        format interleave datatype=RNA missing=N gap=-;

        matrix
        Homo_sapiens??????? AGUCGAGUC---GCAGAAACGCAUGAC-GAC
        Pan_paniscus??????? AGUCGCGUCG--GCAGAAACGCAUGACGGAC
        Gorilla_gorilla???? AGUCGCGUCG--GCAGAUACGCAUCACGGAC
        Pongo_pigmaeus????? AGUCGCGUCGAAGCAGA--CGCAUGACGGAC

        Homo_sapiens??????? CACAUUUU-CCUUGCAAAG
        Pan_paniscus??????? CACAUCAU-CCUUGCAAAG
        Gorilla_gorilla???? -ACAUCAUCCCUCGCAGAG
        Pongo_pigmaeus????? CACAUCAUCCCUUGCAGAG
        ;

        endblock;
        begin assumptions;
        options deftype=unord;
        ?

        9) Parsimony Jackknifer

        The program by Farris is a parsimony program that also implements the jackknife method to test the reliability of branches.
        The format is similar to the MEGA format. On the first line a title/description is placed in between single quotes. The alignment can be written in sequential or interleaved format, but the sequence names have to be placed between brackets. Also no blanks are allowed in the names. They should be replaced by underscores ( _ ). The file is ended by a semicolon.

        Examples:

        ' Four Anthropoidea '
        (Homo_sapiens)??????? AGUCGAGUC---GCAGAAACGCAUGAC-GAC
        CACAUUUU-CCUUGCAAAG
        (Pan_paniscus)??????? AGUCGCGUCG--GCAGAAACGCAUGACGGAC
        CACAUCAU-CCUUGCAAAG
        (Gorilla_gorilla)???? AGUCGCGUCG--GCAGAUACGCAUCACGGAC
        -ACAUCAUCCCUCGCAGAG
        (Pongo_pigmaeus)????? AGUCGCGUCGAAGCAGA--CGCAUGACGGAC
        CACAUCAUCCCUUGCAGAG
        ;

        ---

        ' Four Anthropoidea '
        (Homo_sapiens)??????? AGUCGAGUC---GCAGAAACGCAUGAC-GAC
        (Pan_paniscus)??????? AGUCGCGUCG--GCAGAAACGCAUGACGGAC
        (Gorilla_gorilla)???? AGUCGCGUCG--GCAGAUACGCAUCACGGAC
        (Pongo_pigmaeus)????? AGUCGCGUCGAAGCAGA--CGCAUGACGGAC
        (Homo_sapiens)??????? CACAUUUU-CCUUGCAAAG
        (Pan_paniscus)??????? CACAUCAU-CCUUGCAAAG
        (Gorilla_gorilla)???? -ACAUCAUCCCUCGCAGAG
        (Pongo_pigmaeus)????? CACAUCAUCCCUUGCAGAG
        ;

        10) PHYLIP

        The PHYLIP package is a tree construction package that implements parsimony, distance and maximum likelihood.
        The format is pretty straightforward : on the first line the number of sequences and their length (in characters) is displayed. Then the alignment is displayed in an interleaved or sequential format. The sequence names are allowed to contain blanks, but may not consist of more than 10 characters. The interleaved format is slightly different from the other formats in the way that the sequence names are only displayed in the first block, while other interleaved formats repeat the names every block.

        For example:

        4 50
        Homo sapie AGUCGAGUC---GCAGAAACGCAUGAC-GACC
        Pan panisc AGUCGCGUCG--GCAGAAACGCAUGACGGACC
        Gorilla go AGUCGCGUCG--GCAGAUACGCAUCACGGAC-
        Pongo pigm AGUCGCGUCGAAGCAGA--CGCAUGACGGACC

        ACAUUUU-CCUUGCAAAG
        ACAUCAU-CCUUGCAAAG
        ACAUCAUCCCUCGCAGAG
        ACAUCAUCCCUUGCAGAG

        The sequential format looks like this:

        4 50
        Homo sapie AGUCGAGUC---GCAGAAACGCAUGAC-GACC
        ACAUUUU-CCUUGCAAAG
        Pan panisc AGUCGCGUCG--GCAGAAACGCAUGACGGACC
        ACAUCAU-CCUUGCAAAG
        Gorilla go AGUCGCGUCG--GCAGAUACGCAUCACGGAC-
        ACAUCAUCCCUCGCAGAG
        Pongo pigm AGUCGCGUCGAAGCAGA--CGCAUGACGGACC
        ACAUCAUCCCUUGCAGAG

        You can find more info in the PHYLIP package documentation.

        11) TREECON

        TREECON is a software package for construction and drawing of phylogenetic trees on the basis of
        evolutionary distances.
        A full description of the TREECON format can be found right here.

        Example:

        50
        Homo sapiens
        AGUCGAGUC---GCAGAAACGCAUGAC-GACCACAUUUU-CCUUGCAAAG
        Pan paniscus
        AGUCGCGUCG--GCAGAAACGCAUGACGGACCACAUCAU-CCUUGCAAAG
        Gorilla gorilla
        AGUCGCGUCG--GCAGAUACGCAUCACGGAC-ACAUCAUCCCUCGCAGAG
        Pongo pigmaeus
        AGUCGCGUCGAAGCAGA--CGCAUGACGGACCACAUCAUCCCUUGCAGAG
        ?
        ?
        ?

      Walk-through
      ?

      ?After succesfully installing ForCon, run the forcon.exe executable (or just double-click the shortcut).

      The start-up screen appears:

      Pressing the 'Enter' button will continue the program.
      First, you will be asked to specify the format of the input- and output file:


      ?

      Just select the format from each list and press OK.
      After doing this, you will be prompted to specify the input file.


      ?

      After this, the program will ask you for the blocksize/cutoff. Here you can specify the number of characters that each block/sequence line will consist of.

      Fill in the text box and press OK.

      If? your input file was a Hennig86 file, you are asked for the 'translation':


      ?

      So, in this case, every 0 is translated into an A, 1 to T, etc.
      Make sure just to enter one character for each box !

      Specify the file you would like to save the new alignment in:

      After doing this, you can make a selection of the sequences you would like convert.

      Click a name on the list to select that sequence. To select multiple sequences, hold down the Control key on your keyboard while selecting. Large blocks of sequences can be selected using the Shift key. Use the Select All button to select all the sequences at once. The deselect all button does the opposite.
      After you made your selection, press OK.

      You now get the chance to select certain positions of the alignment:
      ?

      You can choose between 4 options:
      ?

      • use all of the alignment ( no change )
      • use the 1st and 2nd codon positions, e.g. AAU GCU ACU ACG? becomes? AAGCACAC
      • only use the third codon positions, e.g. AAU GCU ACU ACG? becomes UUUG
      • use specific user-defined codon positions: to cut parts out of your alignment; areas should be separated by commas.
      Just check the button of you choice, press OK, and we're off to:


      ?

      The end.
      You can find your file in the directory you specified earlier.
      ?

      Disclaimer

      This software is distributed freely 'as-is'. The programmer cannot be held responsible for any damage that may occur. You can distribute the program among your friends, colleagues, etc. in the original .ZIP file. Please always register your program, if you should get a copy. It's free, won't take much of your time, and you will be notified of any new releases or bugs.

      Yes, please, register me !

      If you encounter any bugs, please report them to me : jerae@gengenp.rug.ac.be
      ?

      Acknowledgements

      The programmers would like to thank ( in random order ) :

      Julie Thompson, for her help on the CLUSTAL format
      Rob Verschraegen, for his programming tips
      Yves Van de Peer, for all his help
      Alex Dong Li, for his help on the GCG/MSF format
      All others who helped me in any way



      ?
      ?

      Jeroen Raes
      Research group of Bioinformatics
      Department of Plant Genetics?? Tel:32.9.264 87 20?? Fax:32.9.264 50 08
      University of Ghent, K.L. Ledeganckstraat 35, B-9000 GENT, Belgium

      Laboratoire Associe de l'INRA
      Vlaams interuniversitair Instituut voor Biotechnologie (VIB)

      jerae@gengenp.rug.ac.be

      to the ForCon homepage










Contact:
VIB / UGent
Bioinformatics & Evolutionary Genomics
Technologiepark 927
B-9052 Gent
BELGIUM
+32 (0) 9 33 13807 (phone)
+32 (0) 9 33 13809 (fax)

Don't hesitate to contact the in case of problems with the website!

You are visiting an outdated page of the BEG/Van de Peer Lab site.

Not all pages have been ported, so these archived pages are still available.

Redirect to the new website?