Programs to standardize outputs from prediction programs:

This page contains programs to:

standardize outputs from prediction programs.
generate consensus predictions from two standard files.
filter a standard file on the score or probability.

All these programs are written in Perl.

Programs to standardize outputs from prediction programs

As all gene prediction software have their own way to present results, we wrote programs to make a uniform output from these.
All these programs were written between the end of 1998 and the begining of 1999. If software outputs have changed since then, these programs will, of course, either no longer work at all, or, which is more dangerous, give you false results.
In the following SEQ is your sequence name, or accession number, identificator ...

These programs all need the file util.pl, in the same directory.

PROGRAM	AUTHOR	SCRIPT	INPUT file		OUTPUT file
Splice site predictors
NetGene2	Brunak	ng2st.pl	data.list* or SEQnetgene2.txt		dataset_netgene2ST SEQnetgene2.txtST
NetPlantGene	Brunak	npg2st.pl	data.list* or SEQnetplantgene.txt		dataset_netplantgeneST SEQnetplantgene.txtST
SplicePredictor	Brendel	splicep2st.pl	data.list* or SEQsplicep1.txt**		dataset_splicepredictor12ST SEQsplicep12.txtST
SPL	Solovyev	spl2st.pl	data.list* or SEQspl1.txt**		dataset_splST SEQspl12.txtST
Exon predictors
MZEF	Zhang	mzef2sr.pl	data.list* or SEQmzef1.txt**		dataset_mzefST SEQmzef12.txtST
Fexa	Solovyev	fex2st.pl	data.list* or SEQfex1.txt		dataset_fexST SEQfex12.txtST
Grail	Uberbacher	grail2st.pl	data.list* or SEQgrail.txt		dataset_grailST SEQgrail.txtST
GeneMark	Borodovsky	gm2st.pl	data.list* or SEQ.tfa.lst		dataset_genemarkST SEQgm.txtST
		gmregion2st.pl	data.list* or SEQ.tfa.lst		dataset_genemarkregionST SEQgmregion.txtST
Gene Modelers
GeneMark.hmm	Borodovsky	gmhmm2st.pl		data.list* or SEQgmhmm.txt	dataset_gmhmmST SEQgmhmm.txtST
FgeneP	Solovyev	fgenepW2st.pl		data.list* or SEQfgene1.txt	dataset_fgenepWST SEQfgenep.txtWST
Fgenea	Solovyev	fgene2st.pl		data.list* or SEQfgene1.txt**	dataset_fgeneaST SEQfgenea.txtST
GENSCAN	Burge	gs2st.pl		data.list* or SEQgs.txt	dataset_genscanST SEQgs.txtST
Start
NetStart	Brunak	start2st.pl		data.list* or SEQstart1.txt**	dataset_startST SEQstart12.txtST

*data.list = file containing, on each line:
SEQ SEQ_length
**the prediction program must be run once on the direct strand =>SEQprog1.txt
and once on the reverse one =>SEQprog2.txt
USAGE:
program.pl -l <data.list>
program.pl <prediction_file> <seq_length>

Output format (generated file ST):

The standard output is a presented as a table in which each line is a new prediction, and colons are described below.

Contig	name of the genomic sequence i.e. the name you wrote instead of SEQ.
Type	Single(Sngl), Initial(Init), Internal(Intr) or Terminal(Term) exon.
Strand	Direct (+) or complement (-).
Lend	5' end of exon (reading on direct strand)
Rend	3' end of exon (reading on direct strand)
Length	Exon length
Phase	Only in the case of GeneMark.hmm and NetGene2
Frame	see below.
Ac	Acceptor splice site position
Do	Donor splice site position
Proba or Score	probability, or score given by the program.

A few explanations:

Coordinates:

All coordinates are given according to the 5’->3’ direction on direct strand.

Thus, for Grail, Fgenea, Fexa, and SPL coordinates on the complementary strand were "reversed", to be given according to the direct strand numbering.

For NetGene2, and NetPlantGene, on complementary strand, we read the 3’->5’ coordinates.

Acceptor and Donor site positions were defined as in NetGene2 and Splicepredictor:

	Strand(+)	Strand(-)
Ac	Lend -1	Rend +1
Do	Rend +1	Lend –1

Corrections have been performed on the NetPlantGene output:
Strand (-) Do = Do_npg + 1, Ac = Ac_npg + 1

Phase

In GeneMark.hmm the phase is x from Edxy value, and y from Erxy value.

In NetGene2, the phase is given and corresponds to the position in the codon where occurs the splicing.

Frame

our

Strand (+)	Strand (-)
Lg(i) % 3 = X
L'(i+1) = L(i+1) – X F(i+1) = (L'(i+1) - 1) %3 + 1 Lg(i+1) = R(i+1) - L'(i+1) + 1	R'(i+1) = R(i+1)+ X F(i+1) = (Ltot - R'(i+1)) %3 +1 Lg(i+1) = R'(i+1) - L(i+1) + 1

L(i) and R(i) are Lend and Rend of exon(i) respectively.
Lg(i) and F(i) are the length and the frame of exon (i).
Ltot is the total length of the genomic sequence given as input to the program.
% is the modulo

To be in agreement with this definition, we modified in some case the frame given by the programs, or we calculated it from the program output, as explained in this table:

	Strand(+)	Strand(-)
GENSCAN	F = GS_frame +1.	F = 4 + Ltot%3 - GS_frame; if (F == 6) =>F=3 if (F == 5) =>F=2 if (F == 4) =>F=1
GENEMARK	F = GM_frame	F= 4 + Ltot%3 – GM_frame. if (F==5) =>F=2 if (F==4) =>F=1
GeneMark.hmm	Edxy NB: x = phase F = [Lend -x] %3 + 1	Erxy NB: y = phase F = [Ltot - (Rend + (y-1))]%3 +1
Grail	F=grail_frame +1	F=grail_frame +1
Fgenea, Fexa, FgeneP	F = [(ORF_Lend - 1)%3 +1]	F = [(Ltot-ORF_Rend)%3 +1]
MZEF	We kept the frame for which the probability was the highest

Ltot is the total length of the genomic sequence given as input to the program.
% is the modulo

Programs to generate consensus predictions from two standard prediction files.
All these scripts take two standard input files and generate a standard output.
Need seq## in the field "Contig" of the standard file, where ## is a number.
Usage: prog.pl file1 file2

compar_site.pl File1 = exon predictor file File2 = splice predictor file	keep only splice sites found in both files. results contain file2 lines.
compar_M_site.pl File1 = exon predictor file File2 = splice predictor file	results contain file1 lines, only when both splice sites were found by the splice predictor (except if Init or Term exon)
compar_pred.pl File1 = exon predictor file File2 = exon predictor file	results contain file1predictions when they are consensus with file2.
compar_2splicepred.pl File1 = splice predictor file File2 = splice predictor file	results contain file1 predictions when they are consensus with file2.

Programs to filter a standard file on the score or probability.

getscore.pl -min <min score> -max <min score>

Return only predictions with score between min and max.
you can use either both min and max options or only one.

input file= a standard file
Need seq## in the field "Contig" of the standard file, where ## is a number.

NB: in the case of a NetGene2 standard file, splice sites with a H (for highly confident sites) are always kept.
And if you give a min score higher than 1 we will only get these results.

example: getscore.pl -min 0.6 -max 0.8 <seq01.txtST> seq01.txtST_06_08

if you encounter some problem, please contact Catherine Mathé

Back to the top