Utilities
General tools
-
html2txt.c
this C program reads as standard input an HTML file and generates as
standard output a text with all HTML tags removed and a quite conserved
layout. This program is used as a filter to extract prediction data from
HTML documents generated by some prediction software available on the Web.
-
cutInGene.pl realData predData
This program takes as input parameters a real
data set file made of contigs, and a standardized
output of a prediction software. The purpose of the program is to split
the predicted exons in numbered sets corresponding to real partition in
genes (deduced from the real data set). A correspondance.txt file is generated
which gives the new numbering of the entries.
Automation
Some prediction software was only available via the Web and do not allow
to process a set of sequences, thus we developed some Perl scripts to submit
our sequences. These Perl scripts need libwww-perl
(http://www.linpro.no/lwp/).
-
NetStart 1.0
(netstart.pl)
This program needs two parameters: the location of a folder containing
all needed FASTA files*, and a Data.list
file.
Foreach contig in Data.list, il will submit the sequence to NetStart,
thist for both strands. The ouput of NetStart is filtered with html2txt.
The resulting prediction data are stored in the current directory a file
whose name looks like seq##start#.txt (where ## is the contig number
and # is 1 or 2 depending on the strand -normal or reverse -).
Output of this program are used by start2st.pl.
-
SplicePredictor
(sp_reformat.pl)
This program takes as first parameter a splice predictor output file
and
direct or reverse as second parameter. It splits the
input files into two files for each sequence (one for each strand).
The name of output file follows the following format: seq##splicep#.txt
where
## is the contig number and # is 1 for direct strand or 2 for reverse strand.
-
Genefinder-SPL
(spl.pl)
This program takes the same king of parameters as the one for NetStart.
The output files have names which look like seq##spl#.txt where
## is the contig number and # is 1 for direct strand or 2 for reverse strand.
-
Genefinder-FEX
(fex.pl)
This program takes the same king of parameters as the one for NetStart.
The output files have names which look like seq##fex#.txt where
## is the contig number and # is 1 for direct strand or 2 for reverse strand.
-
Genefinder-Fgenea
(fgenea.pl)
This program takes the same king of parameters as the one for NetStart.
The output files have names which look like seq##fgene#.txt where
## is the contig number and # is 1 for direct strand or 2 for reverse strand.
-
Genefinder-Fgenep
(fgenep.pl)
This program takes the same king of parameters as the one for NetStart.
The output files have names which look like seq##fgene#.txt where
## is the contig number and # is 1 for direct strand or 2 for reverse strand.
-
MZEF (mzef.pl)
This program takes the same king of parameters as the one for NetStart,plus
two optional parameter settings: overlap=xxx and prior_proba=yyy
(default
values for overlap and prior_proba are respectively 0 and .03). The
output files have names which look like seq##mzef#.txt
where ##
is the contig number and # is 1 for direct strand or 2 for reverse strand.
-
Grail (xGrail.pl)
This program takes the same king of parameters as the one for NetStart.
The output files have names which look like seq##grail.txt where
## is the contig number.
-
GeneMark.hmm
(GeneMarkHMM.pl)
This program takes the same king of parameters as the one for NetStart.
In addition to html2txt filter, this program use also GMHMMreformat.pl
perl script.The output files have names which look like seq##gmhmm.txt
where
## is the contig number.
-
name of FASTA sequence files MUST
START with seq## (where ## is the contig number on 2 digits)
and have .tfa as file extension (ie: seq01ac002332g14g15.tfa)
-
a Data.list file
is a text file which contains per line the name of a contig (ie: seq01)
and the length of the sequence, separated by a tabulation.
If you encounter a problem, contact
Patrice Déhais