Post processing


Data sets

Two main data sets are taken into account: For special purposes, we have built sub data sets:

Nucleic level

cmpbase.pl  realData predData dataList
This program takes 3 parameters as input: a real data set, a prediction data set and a data list file. For each entry in the data list file, this program computes (standard output):

Exons level

cmpexons.pl realData predData
For each entry of the read data set and for each exon of this entry, it gives as standard output:
 
Contig Reality Length Prediction TPf %TPf FPf %FPf (1+%TPf)/(1+%FPf)
seq# ExR# [L, R] l ExP# [L',R']          
where: cmpexonsFI.pl realData predData
Same as cmpexons.pl but doesn't take into account the frame.

FPexons.pl predData
Reads as standard input the output of cmpexons.pl (or cmpexonsFI.pl) and computes for each entry in the prediction data set the number of predicted exons that do not overlap a real exon (in the good frame for cmpexons.pl). That is the number of false positives FP.

FNexons.pl realData
Reads as standard input the output of cmpexons.pl (or cmpexonsFI.pl) and computes for each entry in the real data set the number false negatives FN, that means the number of real exons that have not been considered as predicted, and thus are not listed in the output of cmpexons.pl (cmpexonsFI.pl).


Intron level

cmpintrons.pl readData predData datalist
For each entry in the prediction data set, this program computes intronic regions (parts that are not at all predicted as exons) and compares them to the real introns. It gives as output for each predicted intron found, the number of really non coding nucleic acids (TN), and the number of wrongly nucleic acids predicted as non coding (FN).


Sequence summary

TNintrons.pl
Read as standard input the output of cmpintrons.pl and computes for each sequences the number of well predicted introns.

cmpseq.pl realData FPdata FNdata TNdata
Read as standard input the output of cmpexons.pl and takes as parameters respectively the files where the results of FPexons.pl, FNexons.pl and TNintrons.pl have been stored.  It gives as output:
 
Contig sLg Nb Ex Nb In #TPf sTPf sTPf/sLg #FPf #TN #FNf sFNf sFNf/sLg #split #ExR split #concat #ExR concat
                             

where:

cmpseqFI.pl realData predData
Read as standard input the output of cmpexonsFI.pl and real and prediction data sets as parameters. It generates:
 
Contig Nb Ex sLg #Pred #Correct sCorrect #Partial sPartial #Wrong sWrong #CompMiss sCompMiss #split #ExR split #concat #ExR concat
                               
where:

Gene level

cmpgene.pl realData predData
This program compare predicted and real gene structure.
It gives first the real gene structure deduced from real data set:
 
Seq Gene Frame begin end
seq# 0 .. n +/- x y

After the gene structure of the prediction
 
Seq Gene Frame begin end begin exon type end exon type
seq# 0..n +/- x y Term / Intr / Init / Sngl Term / Intr / Init / Sngl
Because predicted gene structure is not perfect, begin exon type and end exon type are not always one of those {[Sngl, Sngl], [Init, Term], [Term, Init]}.

Then information about split and concat events
 
Contig Real gene(s) Split in / concatenated with explanation
seq# (i,,j, ... , n) x real genes i,j ... , n in seq# are concatenated with predicted gene x in seq#
seq# x (i,j, ... , n) real gene x in seq# is split in predicted genes i, j, ... , n in seq#

The correctness of the prediction:
 
Contig Gene Comment
seq# Gene [begin, end]
  • borders differ for at least one exon
    not predicted 
  • good borders for all exons
    perfect match 
    • the predicted type of exon is correct
      • Good single predicted
      • Perfect model
    • the predicted type of exon is not correct
      • NOT single predicted
      • at least one border exon has incorrect type: Ta Tb

Finally a summary:
 
 
Contig #genes #pred #Correct #Missing #partial #Wrong #split #concat
seq# nb genes nb predicted genes nb good prediction nb not predicted  = (#pred - #Correct - #Wrong) nb predictions that do not overlap a real gene nb genes split nb genes concatenated

lgStretch.pl realData cmpexons.pl_output
This program computes the longest stretch for a sequence:
maximum number of nucleotides that are predicted in correct frame and belongs to contigous real exons suite.
It takes as parameters a real data set and the output of cmpexons.pl and generates as standard output:
 
Contig lg stretch
seq# nb nucleic acid

If you encounter a problem, contact Patrice Déhais