This page contains programs to:     All these programs are written in Perl.



Programs to standardize outputs from prediction programs
As all gene prediction software have their own way to present results, we wrote programs to make a uniform output from these.
All these programs were written between the end of 1998 and the begining of 1999. If software outputs have changed since then, these programs will, of course, either no longer work at all, or, which is more dangerous, give you false results.

In the following SEQ is your sequence name, or accession number, identificator ...

These programs all need the file util.pl, in the same directory.
 
PROGRAM
AUTHOR
SCRIPT
INPUT file
OUTPUT file
Splice site predictors
NetGene2 Brunak ng2st.pl data.list* or SEQnetgene2.txt dataset_netgene2ST
SEQnetgene2.txtST
NetPlantGene Brunak npg2st.pl data.list* or SEQnetplantgene.txt dataset_netplantgeneST
SEQnetplantgene.txtST
SplicePredictor Brendel splicep2st.pl data.list* or SEQsplicep1.txt** dataset_splicepredictor12ST
SEQsplicep12.txtST
SPL Solovyev spl2st.pl data.list* or SEQspl1.txt**  dataset_splST
SEQspl12.txtST
Exon predictors
MZEF Zhang mzef2sr.pl data.list* or SEQmzef1.txt** dataset_mzefST
SEQmzef12.txtST
Fexa Solovyev fex2st.pl data.list* or SEQfex1.txt dataset_fexST
SEQfex12.txtST
Grail Uberbacher grail2st.pl data.list* or SEQgrail.txt dataset_grailST
SEQgrail.txtST
GeneMark Borodovsky gm2st.pl data.list* or SEQ.tfa.lst  dataset_genemarkST
SEQgm.txtST
    gmregion2st.pl data.list* or SEQ.tfa.lst dataset_genemarkregionST
SEQgmregion.txtST
Gene Modelers
GeneMark.hmm Borodovsky gmhmm2st.pl data.list* or SEQgmhmm.txt dataset_gmhmmST
SEQgmhmm.txtST
FgeneP Solovyev fgenepW2st.pl data.list* or SEQfgene1.txt dataset_fgenepWST
SEQfgenep.txtWST
Fgenea Solovyev fgene2st.pl data.list* or SEQfgene1.txt** dataset_fgeneaST
SEQfgenea.txtST
GENSCAN Burge gs2st.pl data.list* or SEQgs.txt dataset_genscanST
SEQgs.txtST
Start
NetStart Brunak start2st.pl data.list* or
SEQstart1.txt**
dataset_startST
SEQstart12.txtST
*data.list = file containing, on each line:
SEQ SEQ_length

**the prediction program must be run once on the direct strand =>SEQprog1.txt
                                                    and once on the reverse one =>SEQprog2.txt

USAGE:

program.pl -l <data.list>
program.pl <prediction_file> <seq_length>


Output format (generated file ST):

The standard output is a presented as a table in which each line is a new prediction, and colons are described below.
Contig name of the genomic sequence i.e. the name you wrote instead of SEQ.
Type Single(Sngl), Initial(Init), Internal(Intr) or Terminal(Term) exon.
Strand Direct (+) or complement (-).
Lend 5' end of exon (reading on direct strand)
Rend 3' end of exon (reading on direct strand)
Length Exon length
Phase Only in the case of GeneMark.hmm and NetGene2
Frame see below.
Ac Acceptor splice site position
Do Donor splice site position
Proba or Score probability, or score given by the program.
    A few explanations:
  1. Coordinates:

  2. All coordinates are given according to the 5’->3’ direction on direct strand.

    Thus, for Grail, Fgenea, Fexa, and SPL coordinates on the complementary strand were "reversed", to be given according to the direct strand numbering.

    For NetGene2, and NetPlantGene, on complementary strand, we read the 3’->5’ coordinates.

    Acceptor and Donor site positions were defined as in NetGene2 and Splicepredictor:

    Strand(+)
    Strand(-)
    Ac
    Lend -1
    Rend +1
    Do
    Rend +1
    Lend –1

    Corrections have been performed on the NetPlantGene output:
    Strand (-) Do = Do_npg + 1, Ac = Ac_npg + 1
     

  3. Phase

  4. The Phase corresponds to the position in a codon of the first nucleotide of the considered exon.
    (so, the Phase of the first exon is always 1)

    In GeneMark.hmm the phase is x from Edxy value, and y from Erxy value.

    In NetGene2, the phase is given and corresponds to the position in the codon where occurs the splicing.
     

  5. Frame

  6. As all programs don't have the same definition of the Frame, we defined our Frame according to the one found when doing a map.
    Thus our definition of the frame is:
Strand (+)
Strand (-)
Lg(i) % 3 = X
L'(i+1) = L(i+1) – X
F(i+1) = (L'(i+1) - 1) %3 + 1
Lg(i+1) = R(i+1) - L'(i+1) + 1
R'(i+1) = R(i+1)+ X
F(i+1) = (Ltot - R'(i+1)) %3 +1
Lg(i+1) = R'(i+1) - L(i+1) + 1

L(i) and R(i) are Lend and Rend of exon(i) respectively.
Lg(i) and F(i) are the length and the frame of exon (i).
Ltot is the total length of the genomic sequence given as input to the program.
% is the modulo

To be in agreement with this definition, we modified in some case the frame given by the programs, or we calculated it from the program output, as explained in this table:
 
Strand(+)
Strand(-)
GENSCAN F = GS_frame +1. F = 4 + Ltot%3 - GS_frame; 
if (F == 6) =>F=3
if (F == 5) =>F=2
if (F == 4) =>F=1
GENEMARK F = GM_frame F= 4 + Ltot%3 – GM_frame.
if (F==5) =>F=2
if (F==4) =>F=1
GeneMark.hmm Edxy    NB: x = phase
F = [Lend -x] %3 + 1
Erxy    NB: y = phase
F = [Ltot - (Rend + (y-1))]%3 +1
Grail F=grail_frame +1 F=grail_frame +1
Fgenea, Fexa, FgeneP F = [(ORF_Lend - 1)%3 +1] F = [(Ltot-ORF_Rend)%3 +1]
MZEF
We kept the frame for which the probability was the highest

Ltot is the total length of the genomic sequence given as input to the program.
% is the modulo







Programs to generate consensus predictions from two standard prediction files.

All these scripts take two standard input files and generate a standard output.
Need seq## in the field "Contig" of the standard file, where ## is a number.

Usage:  prog.pl file1 file2

compar_site.pl
File1 = exon predictor file
File2 = splice predictor file
keep only splice sites found in both files.
results contain file2 lines.
compar_M_site.pl
File1 = exon predictor file
File2 = splice predictor file
results contain file1 lines, only when both splice sites were found by the splice predictor (except if Init or Term exon)
compar_pred.pl
File1 = exon predictor file
File2 = exon predictor file
results contain file1predictions when they are consensus with file2.
compar_2splicepred.pl
File1 = splice predictor file
File2 = splice predictor file
results contain file1 predictions when they are consensus with file2.

 


Programs to filter a standard file on the score or probability.

    getscore.pl -min <min score> -max <min score>

    Return only predictions with score between min and max.
    you can use either both min and max options or only one.

    input file= a standard file
    Need seq## in the field "Contig" of the standard file, where ## is a number.

    NB: in the case of a NetGene2 standard file, splice sites with a H (for highly confident sites) are always kept.
    And if you give a min score higher than 1 we will only get these results.

    example: getscore.pl -min 0.6 -max 0.8 <seq01.txtST> seq01.txtST_06_08
 
 


if you encounter some problem, please contact Catherine Mathé
 
 

Back to the top