Post processing

Data sets

Two main data sets are taken into account:

Data.txtST which contains annotations grouped by contigs (each entry can contain more than one gene). Information about sequences length is stored in Data.list.
geneData.txtST which contains annotations but grouped by genes: for each contig, we group exons by gene and we consider this set as a new entry in the new data set. Correspondence between new entries (seq###) and contig names (seq##) are stored in genecorrespondance.txt. New entries keep the sequence length of the contig they are coming from (geneData.list).

For special purposes, we have built sub data sets:

Data.initST which is deduced from Data.txtST and contains only initial exons.
Data.intrST which is deduced from Data.txtST and contains only internal exons.
Data.termST which is deduced from Data.txtST and contains only terminal exons.
Data.snglST which is deduced from Data.txtST and contains only single exons (gene with one exon).

Nucleic level

cmpbase.pl realData predData dataList
This program takes 3 parameters as input: a real data set, a prediction data set and a data list file. For each entry in the data list file, this program computes (standard output):

the number of coding nucleic acids in the real data set that are predicted as coding too in the prediction data set (TP: true positive)
the number of non coding nucleic acids in the real data set that are predicted as coding too in the prediction data set (FP: false positive)
the number of non coding nucleic acids in the real data set that are predicted as non coding too in the prediction data set (TN: true negative)
the number of coding nucleic acids in the real data set that are predicted as non coding too in the prediction data set (FN: false negative)
the sensitivity: Sn = TP/(TP+FN)
the specificity: Sp = TP/(TP+FP)

Exons level

cmpexons.pl realData predData
For each entry of the read data set and for each exon of this entry, it gives as standard output:

Contig	Reality	Length	Prediction	TPf	%TPf	FPf	%FPf	(1+%TPf)/(1+%FPf)
seq#	ExR# [L, R]	l	ExP# [L',R']

where:

seq# is an entry in the real data set
ExR# [L, R] is an exon in this entry (numbered from 0 to n) and [L, R] are its left and right borders in the corresponding contig
l is the length of this exon
ExP# [L', R'] is a predicted exon for the same entry that overlap this particular exon in the good frame
TPf is the number of nucleic acids of the predicted exon that are really coding
%TPf = TPf / Length * 100
FPf is the number of nucleic acids of the predicted exon that are not coding
%FPf = FPf / Length * 100
(1+%TPf)/(1+%FPf) gives an idea of the quality of the prediction (maximum of 101 obtained for TPf max and FPf min, below 1 when a predicted exon has more FPf than TPf, that means the prediction software gives too much noise).

cmpexonsFI.pl realData predData
Same as cmpexons.pl but doesn't take into account the frame.

FPexons.pl predData
Reads as standard input the output of cmpexons.pl (or cmpexonsFI.pl) and computes for each entry in the prediction data set the number of predicted exons that do not overlap a real exon (in the good frame for cmpexons.pl). That is the number of false positives FP.

FNexons.pl realData
Reads as standard input the output of cmpexons.pl (or cmpexonsFI.pl) and computes for each entry in the real data set the number false negatives FN, that means the number of real exons that have not been considered as predicted, and thus are not listed in the output of cmpexons.pl (cmpexonsFI.pl).

Intron level

cmpintrons.pl readData predData datalist
For each entry in the prediction data set, this program computes intronic regions (parts that are not at all predicted as exons) and compares them to the real introns. It gives as output for each predicted intron found, the number of really non coding nucleic acids (TN), and the number of wrongly nucleic acids predicted as non coding (FN).

Sequence summary

TNintrons.pl
Read as standard input the output of cmpintrons.pl and computes for each sequences the number of well predicted introns.

cmpseq.pl realData FPdata FNdata TNdata
Read as standard input the output of cmpexons.pl and takes as parameters respectively the files where the results of FPexons.pl, FNexons.pl and TNintrons.pl have been stored. It gives as output:

Contig sLg Nb Ex Nb In #TPf sTPf sTPf/sLg #FPf #TN #FNf sFNf sFNf/sLg #split #ExR split #concat #ExR concat

where:

Contig is the entry in the data set
sLg is the sum of the lengths of all exons of this entry
Nb Ex is the number of exons in the entry
Nb In is the number of "introns" in the entry (Nb Ex + 1 because non coding region before first exons and after last exon are considered as introns)
#TPf number of exons predicted in the good frame, that means exons that have at least predicted exons with TPf > 0 (one overlapping nucleic acid is enough!). To avoid this problem, it is better to filter output of cmpexons.pl and to keep only lines with (1+%TPf)/(1+%FPf) > 1and to use them instead of the raw output of cmpexons.pl.
sTPf is the sum of all the TPf of predicted exons (some software generates overlapping predictions: thus some nucleic acids can have been counted several times in sTPf)
#FPf comes from the output of FPexons.pl
#TN comes from the output of TNintrons.pl
#FNf is the number of predicted exons not considered as predicted (comes from the output of FNexons.pl)
sFNf is the sum of the FNf coming from the output of FNexons.pl (same remark as for sTPf).
#split is the number of split evens (a split even occurs when portion of an exon is not predicted).
#ExR split is the number of exons that have been split.
#concat is the number of concat evens (a concat even occurs when a non coding region -but predicted as such- merges to exons).
#ExR concat is the number of exons that have been merged.

cmpseqFI.pl realData predData
Read as standard input the output of cmpexonsFI.pl and real and prediction data sets as parameters. It generates:

Contig

Nb Ex

sLg

#Pred

#Correct

sCorrect

#Partial

sPartial

#Wrong

sWrong

#CompMiss

sCompMiss

#split

#ExR split

#concat

#ExR concat

where:

Contig is the entry in the data set
Nb Ex is the number of exons in the entry
sLg is the sum of the lengths of all exons of this entry
#Pred is the number of predicted exons
#Correct is the number of exons predicted with good borders (%TP=100 && %FP=0)
sCorrect is the number nucleic acids in exons predicted with good borders
#Partial is the number of exons predicted with wrong borders (%TP!=100 || %FP!=0)
sPartial is the number nucleic acids in exons predicted with wrong borders
#Wrong is the number of predicted exons which match no real exons
sWrong is the number nucleic acids in predicted exons which match no real exons
#CompMiss is the number of exons not predicted at all
sCompMiss is the number nucleic acids in exons not predicted at all
#split is the number of split evens (a split even occurs when portion of an exon is not predicted).
#ExR split is the number of exons that have been split.
#concat is the number of concat evens (a concat even occurs when a non coding region -but predicted as such- merges to exons).
#ExR concat is the number of exons that have been merged.

Gene level

cmpgene.pl realData predData
This program compare predicted and real gene structure.
It gives first the real gene structure deduced from real data set:

Seq	Gene	Frame	begin	end
seq#	0 .. n	+/-	x	y

After the gene structure of the prediction

Seq Gene Frame begin end begin exon type end exon type

seq# 0..n +/- x y Term / Intr / Init / Sngl Term / Intr / Init / Sngl

Because predicted gene structure is not perfect, begin exon type and end exon type are not always one of those {[Sngl, Sngl], [Init, Term], [Term, Init]}.

Then information about split and concat events

Contig Real gene(s) Split in / concatenated with explanation

seq# (i,,j, ... , n) x real genes i,j ... , n in seq# are concatenated with predicted gene x in seq#

seq# x (i,j, ... , n) real gene x in seq# is split in predicted genes i, j, ... , n in seq#

The correctness of the prediction:

Contig Gene Comment

seq# Gene [begin, end]

borders differ for at least one exon
not predicted

good borders for all exons
perfect match

the predicted type of exon is correct

Good single predicted

Perfect model

the predicted type of exon is not correct

NOT single predicted

at least one border exon has incorrect type: Ta Tb

Finally a summary:

Contig #genes #pred #Correct #Missing #partial #Wrong #split #concat

seq# nb genes nb predicted genes nb good prediction nb not predicted = (#pred - #Correct - #Wrong) nb predictions that do not overlap a real gene nb genes split nb genes concatenated

lgStretch.pl realData cmpexons.pl_outputThis program computes the longest stretch for a sequence:
maximum number of nucleotides that are predicted in correct frame and belongs to contigous real exons suite.
It takes as parameters a real data set and the output of cmpexons.pl and generates as standard output:

Contig lg stretch

seq# nb nucleic acid

If you encounter a problem, contact Patrice Déhais

Contig	Real gene(s)	Split in / concatenated with	explanation
seq#	(i,,j, ... , n)	x	real genes i,j ... , n in seq# are concatenated with predicted gene x in seq#
seq#	x	(i,j, ... , n)	real gene x in seq# is split in predicted genes i, j, ... , n in seq#

Contig	Gene	Comment
seq#	Gene [begin, end]	borders differ for at least one exon not predicted good borders for all exons perfect match the predicted type of exon is correct Good single predicted Perfect model the predicted type of exon is not correct NOT single predicted at least one border exon has incorrect type: Ta Tb

Contig	#genes	#pred	#Correct	#Missing	#partial	#Wrong	#split	#concat
seq#	nb genes	nb predicted genes	nb good prediction	nb not predicted	= (#pred - #Correct - #Wrong)	nb predictions that do not overlap a real gene	nb genes split	nb genes concatenated