As all gene prediction software have their own way to present results, we wrote programs to make a uniform output from these.
All these programs were written between the end of 1998 and the begining of 1999. If software outputs have changed since then, these programs will, of course, either no longer work at all, or, which is more dangerous, give you false results.In the following SEQ is your sequence name, or accession number, identificator ...
|
|
|
|
|
|
|
|||||
NetGene2 | Brunak | ng2st.pl | data.list* or SEQnetgene2.txt | dataset_netgene2ST
SEQnetgene2.txtST |
|
NetPlantGene | Brunak | npg2st.pl | data.list* or SEQnetplantgene.txt | dataset_netplantgeneST
SEQnetplantgene.txtST |
|
SplicePredictor | Brendel | splicep2st.pl | data.list* or SEQsplicep1.txt** | dataset_splicepredictor12ST
SEQsplicep12.txtST |
|
SPL | Solovyev | spl2st.pl | data.list* or SEQspl1.txt** | dataset_splST
SEQspl12.txtST |
|
|
|||||
MZEF | Zhang | mzef2sr.pl | data.list* or SEQmzef1.txt** | dataset_mzefST
SEQmzef12.txtST |
|
Fexa | Solovyev | fex2st.pl | data.list* or SEQfex1.txt | dataset_fexST
SEQfex12.txtST |
|
Grail | Uberbacher | grail2st.pl | data.list* or SEQgrail.txt | dataset_grailST
SEQgrail.txtST |
|
GeneMark | Borodovsky | gm2st.pl | data.list* or SEQ.tfa.lst | dataset_genemarkST
SEQgm.txtST |
|
gmregion2st.pl | data.list* or SEQ.tfa.lst | dataset_genemarkregionST
SEQgmregion.txtST |
|||
|
|||||
GeneMark.hmm | Borodovsky | gmhmm2st.pl | data.list* or SEQgmhmm.txt | dataset_gmhmmST
SEQgmhmm.txtST |
|
FgeneP | Solovyev | fgenepW2st.pl | data.list* or SEQfgene1.txt | dataset_fgenepWST
SEQfgenep.txtWST |
|
Fgenea | Solovyev | fgene2st.pl | data.list* or SEQfgene1.txt** | dataset_fgeneaST
SEQfgenea.txtST |
|
GENSCAN | Burge | gs2st.pl | data.list* or SEQgs.txt | dataset_genscanST
SEQgs.txtST |
|
Start | |||||
NetStart | Brunak | start2st.pl | data.list* or
SEQstart1.txt** |
dataset_startST
SEQstart12.txtST |
*data.list = file containing, on each line:
SEQ SEQ_length**the prediction program must be run once on the direct strand =>SEQprog1.txt
and once on the reverse one =>SEQprog2.txtUSAGE:
program.pl -l <data.list>
program.pl <prediction_file> <seq_length>
Output format (generated
file ST):
The standard output is a presented as a table in which each line is a new prediction, and colons are described below.
Contig | name of the genomic sequence i.e. the name you wrote instead of SEQ. |
Type | Single(Sngl), Initial(Init), Internal(Intr) or Terminal(Term) exon. |
Strand | Direct (+) or complement (-). |
Lend | 5' end of exon (reading on direct strand) |
Rend | 3' end of exon (reading on direct strand) |
Length | Exon length |
Phase | Only in the case of GeneMark.hmm and NetGene2 |
Frame | see below. |
Ac | Acceptor splice site position |
Do | Donor splice site position |
Proba or Score | probability, or score given by the program. |
Thus, for Grail, Fgenea, Fexa, and SPL coordinates on the complementary strand were "reversed", to be given according to the direct strand numbering.
For NetGene2, and NetPlantGene, on complementary strand, we read the 3’->5’ coordinates.
Acceptor and Donor site positions were defined as in NetGene2 and Splicepredictor:
|
|
|
Ac |
|
|
Do |
|
|
Corrections have been performed on the NetPlantGene output:
Strand (-) Do = Do_npg + 1, Ac = Ac_npg + 1
In GeneMark.hmm the phase is x from Edxy value, and y from Erxy value.
In NetGene2, the phase is given and corresponds to the position in the
codon where occurs the splicing.
|
|
|
|
L'(i+1) = L(i+1) – X
F(i+1) = (L'(i+1) - 1) %3 + 1 Lg(i+1) = R(i+1) - L'(i+1) + 1 |
R'(i+1) = R(i+1)+ X
F(i+1) = (Ltot - R'(i+1)) %3 +1 Lg(i+1) = R'(i+1) - L(i+1) + 1 |
L(i) and R(i) are Lend and Rend of exon(i) respectively.
Lg(i) and F(i) are the length and the frame of exon (i).
Ltot is the total length of the genomic sequence given
as input to the program.
% is the modulo
To be in agreement with this definition, we modified in some case the frame given by the programs, or we calculated it from the program output, as explained in this table:
|
|
|
GENSCAN | F = GS_frame +1. | F = 4 + Ltot%3 - GS_frame;
if (F == 6) =>F=3 if (F == 5) =>F=2 if (F == 4) =>F=1 |
GENEMARK | F = GM_frame | F= 4 + Ltot%3 – GM_frame.
if (F==5) =>F=2 if (F==4) =>F=1 |
GeneMark.hmm | Edxy NB:
x = phase
F = [Lend -x] %3 + 1 |
Erxy NB:
y = phase
F = [Ltot - (Rend + (y-1))]%3 +1 |
Grail | F=grail_frame +1 | F=grail_frame +1 |
Fgenea, Fexa, FgeneP | F = [(ORF_Lend - 1)%3 +1] | F = [(Ltot-ORF_Rend)%3 +1] |
MZEF |
|
Ltot is the total length of the genomic sequence given
as input to the program.
% is the modulo
Programs to generate consensus predictions from two standard prediction files. All these scripts take two standard input files and generate a standard output.
Need seq## in the field "Contig" of the standard file, where ## is a number.Usage: prog.pl file1 file2
File2 = splice predictor file |
keep only splice sites found in both files.
results contain file2 lines. |
File2 = splice predictor file |
results contain file1 lines, only when both splice sites were found by the splice predictor (except if Init or Term exon) |
File2 = exon predictor file |
results contain file1predictions when they are consensus with file2. |
File2 = splice predictor file |
results contain file1 predictions when they are consensus with file2. |
Programs to filter a standard file on the score or probability.
getscore.pl -min <min score> -max <min score>
Return only predictions with score between min and
max.
you can use either both min and max options or only
one.
input file= a standard file
Need seq## in the field "Contig" of the standard
file, where ## is a number.
NB: in the case of a NetGene2 standard file,
splice sites with a H (for highly confident sites) are always kept.
And if you give a min score higher than 1 we will
only get these results.
example: getscore.pl -min 0.6 -max 0.8 <seq01.txtST>
seq01.txtST_06_08
if you encounter some problem, please contact Catherine
Mathé