This page contains programs to:     All these programs are written in Perl.

Programs to standardize outputs from prediction programs
As all gene prediction software have their own way to present results, we wrote programs to make a uniform output from these.
All these programs were written between the end of 1998 and the begining of 1999. If software outputs have changed since then, these programs will, of course, either no longer work at all, or, which is more dangerous, give you false results.

In the following SEQ is your sequence name, or accession number, identificator ...

These programs all need the file, in the same directory.
INPUT file
Splice site predictors
NetGene2 Brunak data.list* or SEQnetgene2.txt dataset_netgene2ST
NetPlantGene Brunak data.list* or SEQnetplantgene.txt dataset_netplantgeneST
SplicePredictor Brendel data.list* or SEQsplicep1.txt** dataset_splicepredictor12ST
SPL Solovyev data.list* or SEQspl1.txt**  dataset_splST
Exon predictors
MZEF Zhang data.list* or SEQmzef1.txt** dataset_mzefST
Fexa Solovyev data.list* or SEQfex1.txt dataset_fexST
Grail Uberbacher data.list* or SEQgrail.txt dataset_grailST
GeneMark Borodovsky data.list* or SEQ.tfa.lst  dataset_genemarkST
SEQgm.txtST data.list* or SEQ.tfa.lst dataset_genemarkregionST
Gene Modelers
GeneMark.hmm Borodovsky data.list* or SEQgmhmm.txt dataset_gmhmmST
FgeneP Solovyev data.list* or SEQfgene1.txt dataset_fgenepWST
Fgenea Solovyev data.list* or SEQfgene1.txt** dataset_fgeneaST
GENSCAN Burge data.list* or SEQgs.txt dataset_genscanST
NetStart Brunak data.list* or
*data.list = file containing, on each line:
SEQ SEQ_length

**the prediction program must be run once on the direct strand =>SEQprog1.txt
                                                    and once on the reverse one =>SEQprog2.txt

USAGE: -l <data.list> <prediction_file> <seq_length>

Output format (generated file ST):

The standard output is a presented as a table in which each line is a new prediction, and colons are described below.
Contig name of the genomic sequence i.e. the name you wrote instead of SEQ.
Type Single(Sngl), Initial(Init), Internal(Intr) or Terminal(Term) exon.
Strand Direct (+) or complement (-).
Lend 5' end of exon (reading on direct strand)
Rend 3' end of exon (reading on direct strand)
Length Exon length
Phase Only in the case of GeneMark.hmm and NetGene2
Frame see below.
Ac Acceptor splice site position
Do Donor splice site position
Proba or Score probability, or score given by the program.
    A few explanations:
  1. Coordinates:

  2. All coordinates are given according to the 5->3 direction on direct strand.

    Thus, for Grail, Fgenea, Fexa, and SPL coordinates on the complementary strand were "reversed", to be given according to the direct strand numbering.

    For NetGene2, and NetPlantGene, on complementary strand, we read the 3->5 coordinates.

    Acceptor and Donor site positions were defined as in NetGene2 and Splicepredictor:

    Lend -1
    Rend +1
    Rend +1
    Lend 1

    Corrections have been performed on the NetPlantGene output:
    Strand (-) Do = Do_npg + 1, Ac = Ac_npg + 1

  3. Phase

  4. The Phase corresponds to the position in a codon of the first nucleotide of the considered exon.
    (so, the Phase of the first exon is always 1)

    In GeneMark.hmm the phase is x from Edxy value, and y from Erxy value.

    In NetGene2, the phase is given and corresponds to the position in the codon where occurs the splicing.

  5. Frame

  6. As all programs don't have the same definition of the Frame, we defined our Frame according to the one found when doing a map.
    Thus our definition of the frame is:
Strand (+)
Strand (-)
Lg(i) % 3 = X
L'(i+1) = L(i+1) X
F(i+1) = (L'(i+1) - 1) %3 + 1
Lg(i+1) = R(i+1) - L'(i+1) + 1
R'(i+1) = R(i+1)+ X
F(i+1) = (Ltot - R'(i+1)) %3 +1
Lg(i+1) = R'(i+1) - L(i+1) + 1

L(i) and R(i) are Lend and Rend of exon(i) respectively.
Lg(i) and F(i) are the length and the frame of exon (i).
Ltot is the total length of the genomic sequence given as input to the program.
% is the modulo

To be in agreement with this definition, we modified in some case the frame given by the programs, or we calculated it from the program output, as explained in this table:
GENSCAN F = GS_frame +1. F = 4 + Ltot%3 - GS_frame; 
if (F == 6) =>F=3
if (F == 5) =>F=2
if (F == 4) =>F=1
GENEMARK F = GM_frame F= 4 + Ltot%3 GM_frame.
if (F==5) =>F=2
if (F==4) =>F=1
GeneMark.hmm Edxy    NB: x = phase
F = [Lend -x] %3 + 1
Erxy    NB: y = phase
F = [Ltot - (Rend + (y-1))]%3 +1
Grail F=grail_frame +1 F=grail_frame +1
Fgenea, Fexa, FgeneP F = [(ORF_Lend - 1)%3 +1] F = [(Ltot-ORF_Rend)%3 +1]
We kept the frame for which the probability was the highest

Ltot is the total length of the genomic sequence given as input to the program.
% is the modulo

Programs to generate consensus predictions from two standard prediction files.

All these scripts take two standard input files and generate a standard output.
Need seq## in the field "Contig" of the standard file, where ## is a number.

Usage: file1 file2
File1 = exon predictor file
File2 = splice predictor file
keep only splice sites found in both files.
results contain file2 lines.
File1 = exon predictor file
File2 = splice predictor file
results contain file1 lines, only when both splice sites were found by the splice predictor (except if Init or Term exon)
File1 = exon predictor file
File2 = exon predictor file
results contain file1predictions when they are consensus with file2.
File1 = splice predictor file
File2 = splice predictor file
results contain file1 predictions when they are consensus with file2.


Programs to filter a standard file on the score or probability. -min <min score> -max <min score>

    Return only predictions with score between min and max.
    you can use either both min and max options or only one.

    input file= a standard file
    Need seq## in the field "Contig" of the standard file, where ## is a number.

    NB: in the case of a NetGene2 standard file, splice sites with a H (for highly confident sites) are always kept.
    And if you give a min score higher than 1 we will only get these results.

    example: -min 0.6 -max 0.8 <seq01.txtST> seq01.txtST_06_08

if you encounter some problem, please contact Catherine Mathé

Back to the top