Post processing

Data sets

Two main data sets are taken into account: For special purposes, we have built sub data sets:

Nucleic level  realData predData dataList
This program takes 3 parameters as input: a real data set, a prediction data set and a data list file. For each entry in the data list file, this program computes (standard output):

Exons level realData predData
For each entry of the read data set and for each exon of this entry, it gives as standard output:
Contig Reality Length Prediction TPf %TPf FPf %FPf (1+%TPf)/(1+%FPf)
seq# ExR# [L, R] l ExP# [L',R']          
where: realData predData
Same as but doesn't take into account the frame. predData
Reads as standard input the output of (or and computes for each entry in the prediction data set the number of predicted exons that do not overlap a real exon (in the good frame for That is the number of false positives FP. realData
Reads as standard input the output of (or and computes for each entry in the real data set the number false negatives FN, that means the number of real exons that have not been considered as predicted, and thus are not listed in the output of (

Intron level readData predData datalist
For each entry in the prediction data set, this program computes intronic regions (parts that are not at all predicted as exons) and compares them to the real introns. It gives as output for each predicted intron found, the number of really non coding nucleic acids (TN), and the number of wrongly nucleic acids predicted as non coding (FN).

Sequence summary
Read as standard input the output of and computes for each sequences the number of well predicted introns. realData FPdata FNdata TNdata
Read as standard input the output of and takes as parameters respectively the files where the results of, and have been stored.  It gives as output:
Contig sLg Nb Ex Nb In #TPf sTPf sTPf/sLg #FPf #TN #FNf sFNf sFNf/sLg #split #ExR split #concat #ExR concat

where: realData predData
Read as standard input the output of and real and prediction data sets as parameters. It generates:
Contig Nb Ex sLg #Pred #Correct sCorrect #Partial sPartial #Wrong sWrong #CompMiss sCompMiss #split #ExR split #concat #ExR concat

Gene level realData predData
This program compare predicted and real gene structure.
It gives first the real gene structure deduced from real data set:
Seq Gene Frame begin end
seq# 0 .. n +/- x y

After the gene structure of the prediction
Seq Gene Frame begin end begin exon type end exon type
seq# 0..n +/- x y Term / Intr / Init / Sngl Term / Intr / Init / Sngl
Because predicted gene structure is not perfect, begin exon type and end exon type are not always one of those {[Sngl, Sngl], [Init, Term], [Term, Init]}.

Then information about split and concat events
Contig Real gene(s) Split in / concatenated with explanation
seq# (i,,j, ... , n) x real genes i,j ... , n in seq# are concatenated with predicted gene x in seq#
seq# x (i,j, ... , n) real gene x in seq# is split in predicted genes i, j, ... , n in seq#

The correctness of the prediction:
Contig Gene Comment
seq# Gene [begin, end]
  • borders differ for at least one exon
    not predicted 
  • good borders for all exons
    perfect match 
    • the predicted type of exon is correct
      • Good single predicted
      • Perfect model
    • the predicted type of exon is not correct
      • NOT single predicted
      • at least one border exon has incorrect type: Ta Tb

Finally a summary:
Contig #genes #pred #Correct #Missing #partial #Wrong #split #concat
seq# nb genes nb predicted genes nb good prediction nb not predicted  = (#pred - #Correct - #Wrong) nb predictions that do not overlap a real gene nb genes split nb genes concatenated realData cmpexons.pl_output
This program computes the longest stretch for a sequence:
maximum number of nucleotides that are predicted in correct frame and belongs to contigous real exons suite.
It takes as parameters a real data set and the output of and generates as standard output:
Contig lg stretch
seq# nb nucleic acid

If you encounter a problem, contact Patrice Déhais