Post processing
Data sets
Two main data sets are taken into account:
-
Data.txtST which contains annotations grouped
by contigs (each entry can contain more than one gene). Information
about sequences length is stored in Data.list.
-
geneData.txtST which contains annotations
but grouped by genes: for each contig, we group exons by gene and
we consider this set as a new entry in the new data set. Correspondence
between new entries (seq###) and contig names (seq##) are stored in genecorrespondance.txt.
New entries keep the sequence length of the contig they are coming from
(geneData.list).
For special purposes, we have built sub data sets:
-
Data.initST which is deduced from Data.txtST
and contains only initial exons.
-
Data.intrST which is deduced from Data.txtST
and contains only internal exons.
-
Data.termST which is deduced from Data.txtST
and contains only terminal exons.
-
Data.snglST which is deduced from Data.txtST
and contains only single exons (gene with one exon).
Nucleic level
cmpbase.pl realData predData dataList
This program takes 3 parameters as input: a real data set, a prediction
data set and a data list file. For each entry in the data list file,
this program computes (standard output):
-
the number of coding nucleic acids in the real data set that are
predicted
as coding too in the prediction data set (TP: true positive)
-
the number of non coding nucleic acids in the real data set that
are predicted as coding too in the prediction data set (FP:
false positive)
-
the number of non coding nucleic acids in the real data set that
are predicted as non coding too in the prediction data set (TN:
true negative)
-
the number of coding nucleic acids in the real data set that are
predicted
as non coding too in the prediction data set (FN: false negative)
-
the sensitivity: Sn = TP/(TP+FN)
-
the specificity: Sp = TP/(TP+FP)
Exons level
cmpexons.pl realData predData
For each entry of the read data set and for each exon of this entry,
it gives as standard output:
Contig |
Reality |
Length |
Prediction |
TPf |
%TPf |
FPf |
%FPf |
(1+%TPf)/(1+%FPf) |
seq# |
ExR# [L, R] |
l |
ExP# [L',R'] |
|
|
|
|
|
where:
-
seq# is an entry in the real data set
-
ExR# [L, R] is an exon in this entry (numbered from 0 to n) and [L, R]
are its left and right borders in the corresponding contig
-
l is the length of this exon
-
ExP# [L', R'] is a predicted exon for the same entry that overlap this
particular exon in the good frame
-
TPf is the number of nucleic acids of the predicted exon that are really
coding
-
%TPf = TPf / Length * 100
-
FPf is the number of nucleic acids of the predicted exon that are not coding
-
%FPf = FPf / Length * 100
-
(1+%TPf)/(1+%FPf) gives an idea of the quality of the prediction (maximum
of 101 obtained for TPf max and FPf min, below 1 when a predicted
exon has more FPf than TPf, that means the prediction software gives too
much noise).
cmpexonsFI.pl realData predData
Same as cmpexons.pl but doesn't take into account the frame.
FPexons.pl predData
Reads as standard input the output of cmpexons.pl (or cmpexonsFI.pl)
and computes for each entry in the prediction data set the number of predicted
exons that do not overlap a real exon (in the good frame for cmpexons.pl).
That is the number of false positives FP.
FNexons.pl realData
Reads as standard input the output of cmpexons.pl (or cmpexonsFI.pl)
and computes for each entry in the real data set the number false negatives
FN, that means the number of real exons that have not been considered
as predicted, and thus are not listed in the output of cmpexons.pl (cmpexonsFI.pl).
Intron level
cmpintrons.pl readData predData datalist
For each entry in the prediction data set, this program computes intronic
regions (parts that are not at all predicted as exons) and compares them
to the real introns. It gives as output for each predicted intron found,
the number of really non coding nucleic acids (TN), and the
number
of wrongly nucleic acids predicted as non coding (FN).
Sequence summary
TNintrons.pl
Read as standard input the output of cmpintrons.pl and computes for
each sequences the number of well predicted introns.
cmpseq.pl realData FPdata FNdata TNdata
Read as standard input the output of cmpexons.pl and takes as parameters
respectively the files where the results of FPexons.pl, FNexons.pl and
TNintrons.pl have been stored. It gives as output:
Contig |
sLg |
Nb Ex |
Nb In |
#TPf |
sTPf |
sTPf/sLg |
#FPf |
#TN |
#FNf |
sFNf |
sFNf/sLg |
#split |
#ExR split |
#concat |
#ExR concat |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
where:
-
Contig is the entry in the data set
-
sLg is the sum of the lengths of all exons of this entry
-
Nb Ex is the number of exons in the entry
-
Nb In is the number of "introns" in the entry (Nb Ex + 1 because
non coding region before first exons and after last exon are considered
as introns)
-
#TPf number of exons predicted in the good frame, that means exons
that have at least predicted exons with TPf > 0 (one overlapping
nucleic acid is enough!). To avoid this problem, it is better to
filter output of cmpexons.pl and to keep only lines with (1+%TPf)/(1+%FPf)
> 1 and to use them instead of the raw output of cmpexons.pl.
-
sTPf is the sum of all the TPf of predicted exons (some software
generates overlapping predictions: thus some nucleic acids can have been
counted several times in sTPf)
-
#FPf comes from the output of FPexons.pl
-
#TN comes from the output of TNintrons.pl
-
#FNf is the number of predicted exons not considered as predicted
(comes from the output of FNexons.pl)
-
sFNf is the sum of the FNf coming from the output of FNexons.pl
(same remark as for sTPf).
-
#split is the number of split evens (a split even occurs when portion
of an exon is not predicted).
-
#ExR split is the number of exons that have been split.
-
#concat is the number of concat evens (a concat even occurs when
a non coding region -but predicted as such- merges to exons).
-
#ExR concat is the number of exons that have been merged.
cmpseqFI.pl realData predData
Read as standard input the output of cmpexonsFI.pl and real and prediction
data sets as parameters. It generates:
Contig |
Nb Ex |
sLg |
#Pred |
#Correct |
sCorrect |
#Partial |
sPartial |
#Wrong |
sWrong |
#CompMiss |
sCompMiss |
#split |
#ExR split |
#concat |
#ExR concat |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
where:
-
Contig is the entry in the data set
-
Nb Ex is the number of exons in the entry
-
sLg is the sum of the lengths of all exons of this entry
-
#Pred is the number of predicted exons
-
#Correct is the number of exons predicted with good borders (%TP=100
&& %FP=0)
-
sCorrect is the number nucleic acids in exons predicted with good
borders
-
#Partial is the number of exons predicted with wrong borders (%TP!=100
|| %FP!=0)
-
sPartial is the number nucleic acids in exons predicted with wrong
borders
-
#Wrong is the number of predicted exons which match no real exons
-
sWrong is the number nucleic acids in predicted exons which
match no real exons
-
#CompMiss is the number of exons not predicted at all
-
sCompMiss is the number nucleic acids in exons not predicted
at all
-
#split is the number of split evens (a split even occurs when portion
of an exon is not predicted).
-
#ExR split is the number of exons that have been split.
-
#concat is the number of concat evens (a concat even occurs when
a non coding region -but predicted as such- merges to exons).
-
#ExR concat is the number of exons that have been merged.
Gene level
cmpgene.pl realData predData
This program compare predicted and real gene structure.
It gives first the real gene structure deduced from real data set:
Seq |
Gene |
Frame |
begin |
end |
seq# |
0 .. n |
+/- |
x |
y |
After the gene structure of the prediction
Seq |
Gene |
Frame |
begin |
end |
begin exon type |
end exon type |
seq# |
0..n |
+/- |
x |
y |
Term / Intr / Init / Sngl |
Term / Intr / Init / Sngl |
Because predicted gene structure is not perfect, begin exon type and end
exon type are not always one of those {[Sngl, Sngl], [Init, Term], [Term,
Init]}.
Then information about split and concat events
Contig |
Real gene(s) |
Split in / concatenated with |
explanation |
seq# |
(i,,j, ... , n) |
x |
real genes i,j ... , n in seq# are concatenated with predicted gene
x in seq# |
seq# |
x |
(i,j, ... , n) |
real gene x in seq# is split in predicted genes i, j, ... , n in seq# |
The correctness of the prediction:
Contig |
Gene |
Comment |
seq# |
Gene [begin, end] |
-
borders differ for at least one exon
not predicted
-
good borders for all exons
perfect match
-
the predicted type of exon is correct
-
Good single predicted
-
Perfect model
-
the predicted type of exon is not correct
-
NOT single predicted
-
at least one border exon has incorrect type: Ta Tb
|
Finally a summary:
Contig |
#genes |
#pred |
#Correct |
#Missing |
#partial |
#Wrong |
#split |
#concat |
seq# |
nb genes |
nb predicted genes |
nb good prediction |
nb not predicted |
= (#pred - #Correct - #Wrong) |
nb predictions that do not overlap a real gene |
nb genes split |
nb genes concatenated |
lgStretch.pl realData cmpexons.pl_output
This program computes the longest stretch for a sequence:
maximum number of nucleotides that are predicted in correct frame and belongs
to contigous real exons suite.
It takes as parameters a real data set and the output of cmpexons.pl
and generates as standard output:
Contig |
lg stretch |
seq# |
nb nucleic acid |
If you encounter a problem, contact
Patrice Déhais