classifier
Interface Classifier

All Known Implementing Classes:
WekaSVM

public interface Classifier

This interface defines all the necessary methods that a classifier should conform to, in order to be compatible with FunSiP.

Author:
Michiel Van Bel

Nested Class Summary
static class Classifier.DATA_TYPE
          The possible types of data: POSITIVE_DATA indicates that the source is positive training data.
 
Method Summary
 void applyAttributeFilter(java.util.List<java.lang.Integer> attributeFilter, int maxNumFeatures, java.io.File toBeFilteredFile)
          After having used featureselection to get a filter, this filter can be used to change the featurefiles in order to optimize the svms.
 boolean buildClassifier()
          This method builds an SVM model file from a file with trainingexamples.
 java.lang.Double classify_single_instance_fast(double[] features)
          Use the trained classifier to classify a single instance of data in a very fast way, without having to resort to string parsing procedures (recommanded method for doing these classifications).
 java.lang.String classify_single_instance(java.lang.String instance)
          Use the trained classifier to classify a single instance of data, defined by the instance parameter.
 void classify(java.lang.String testFile, java.lang.String outputFile)
          Use the trained (or untrained, the modelfile must be set before though) SVM to classify data, and write the output to an outputfile
 CrossValidationResult crossValidate(int n, int maxPosTrain, int maxNegTrain)
          Performs a crossvalidation of trainingfile.
 java.lang.String generateFeatureString(java.util.List<java.lang.Double> data, Classifier.DATA_TYPE dataType)
          Creates a string of features for only one featurevector
 java.lang.String getFileExtension()
           
 java.lang.String getModelFile()
           
 int[] getPosNegExamplesInFile(java.io.File file)
          Returns the amount of positive and negative examples in a trainingfile.
 double getSigmoid_A()
           
 double getSigmoid_B()
           
 weka.core.Instances getTrainingFileInstances()
           
 boolean loadClassifier()
          Sets the modelfile, and - dependend on the implementation - there may be an attempt to build the SVM from this modelfile.
 boolean loadClassifier(java.lang.String modelFile)
          Sets the modelfile, and - dependend on the implementation - there may be an attempt to build the SVM from this modelfile.
 java.io.File mergeFeatureFiles(java.io.File tempFilePositive, java.io.File tempFileNegative)
          Merges the featurefiles (one with positive training features, one with negative training features), in order to make the actual training file.
 java.util.List<ValPosCombination> performAttributeEvaluation(boolean sort, weka.attributeSelection.AttributeEvaluator evaluator)
          Performs feature selection by evaluating different attributes.
 java.lang.String[] prepareCrossvalidationCommand(int fold, java.lang.String fileIn, java.lang.String fileOut)
          Creates an array with string values, to be parsed by the implementation of the classifier.
 java.lang.String[] prepareTrainingCommand(java.lang.String fileIn, java.lang.String fileOut)
          Creates an array with string values, to be parsed by the classifier.
 void setModelFile(java.lang.String svmModelFile)
          Changes the model for the classifier by changing the name of the modelfile.
 void setOptions(ClassifierOptions options)
          Changes the various options of this classifier.
 void setSigmoid_A(double sigmoid_A)
          Changes the sigmoid variable A (see documentation about restructuring the output by use of sigmoid curves)
 void setSigmoid_B(double sigmoid_B)
          Changes the sigmoid variable B (see documentation about restructuring the output by use of sigmoid curves)
 java.lang.String to_genomeview_output(int id, java.lang.Double distance, int funsite_start, int funsite_stop, java.lang.String classification_name)
          Method which produces a string that can be used by the GenomeView program.
 java.lang.String to_splice_machine_output(java.lang.Double distance, int funsite, int increase, java.lang.String classification_name)
          Method which produces a string that is similar to the output provided by Splicemachine (with the provided results).
 java.lang.String to_splice_machine_output(java.lang.String classification_result, int funsite, int increase, java.lang.String classification_name)
          Method which produces a string that is similar to that of the Splicemachine program, according to the provided results.
 java.io.File writeTemporaryFeatureData(java.lang.String tempFileName, boolean forward_strand, java.util.List<java.util.List<java.lang.Double>> data, Classifier.DATA_TYPE dataType)
          This method writes the temporary featuredata (being all the features extracted from 1 sequence, each feature in a different list) to a file.
 java.io.File writeTemporaryFeatureData(java.lang.String tempFileName, java.util.List<java.util.List<java.lang.Double>> data, Classifier.DATA_TYPE dataType)
          This method writes the temporary featuredata (being all the features extracted from 1 sequence, each feature in a different list) to a file.
 

Method Detail

crossValidate

CrossValidationResult crossValidate(int n,
                                    int maxPosTrain,
                                    int maxNegTrain)
Performs a crossvalidation of trainingfile. The results of this crossvalidation (number of true positives,false positives,true negatives,false negatives and deduced numbers) are then returned. The normal procedure for crossvalidation can be followed. However, we extended the notion of it so different numbers of positive/negative examples can be used during the training phase of each crossvalidation step. The number of positive/ negative examples for the testing phase during each crossvalidation step remains the same (begin (n-1)*(total_amount)). For further information, see the Crossvalidation.txt document in the /doc subdirectory.

Parameters:
n - The fold of the crossvalidation. Frequent numbers are 2,5 and 10
maxPosTrain - The maximum amount of positive training examples during the training phase of the crossvalidation.
maxNegTrain - The maximum amount of negative training examples during the training phase of the crossvalidation.
Returns:
The crossvalidationresult object, which contains the results of the crossvalidation, and the deduced statistics.

prepareCrossvalidationCommand

java.lang.String[] prepareCrossvalidationCommand(int fold,
                                                 java.lang.String fileIn,
                                                 java.lang.String fileOut)
Creates an array with string values, to be parsed by the implementation of the classifier. Needed to be place in the classifier-interface to reduce implementation issues.

Parameters:
fold - The fold of the crossvalidation
fileIn - The file containing the training features.
fileOut - The file for output (if applicable).
Returns:
Commandline array with options

prepareTrainingCommand

java.lang.String[] prepareTrainingCommand(java.lang.String fileIn,
                                          java.lang.String fileOut)
Creates an array with string values, to be parsed by the classifier.

Parameters:
fileIn - The name of the file containing the extracted features.
fileOut - The name of the file to which the output should be written (if applicable).
Returns:
Commandline array with options

buildClassifier

boolean buildClassifier()
This method builds an SVM model file from a file with trainingexamples. The name of the file containing the extracted features should have been defined prior to this.


mergeFeatureFiles

java.io.File mergeFeatureFiles(java.io.File tempFilePositive,
                               java.io.File tempFileNegative)
Merges the featurefiles (one with positive training features, one with negative training features), in order to make the actual training file. The type (donor/acceptor) is set by the constructor of the svm's, so it can take the necessary information from there.

Parameters:
tempFilePositive - The name of the file with features for positive training
tempFileNegative - The name of the file with features for negative training
Returns:
The resulting training file.

writeTemporaryFeatureData

java.io.File writeTemporaryFeatureData(java.lang.String tempFileName,
                                       boolean forward_strand,
                                       java.util.List<java.util.List<java.lang.Double>> data,
                                       Classifier.DATA_TYPE dataType)
This method writes the temporary featuredata (being all the features extracted from 1 sequence, each feature in a different list) to a file. The type of data is also important, since it is necessary to label the data (positive, negative, unknown) according to the type of data. At first all data (all features of all sequences) was kept in memory, but this very quickly resulted in heap overflow problems, for large datasets.

Parameters:
tempFileName - The name of the file to which the data should be written.
forward_strand - Indicates whether or not the data is located on the forward strand.
data - The featuredata, put in a nested linked list.
dataType - The type of data (see enum in this interface)
Returns:
The file with the temporary data.

writeTemporaryFeatureData

java.io.File writeTemporaryFeatureData(java.lang.String tempFileName,
                                       java.util.List<java.util.List<java.lang.Double>> data,
                                       Classifier.DATA_TYPE dataType)
This method writes the temporary featuredata (being all the features extracted from 1 sequence, each feature in a different list) to a file. The type of data is also important, since it is necessary to label the data (positive, negative, unknown) according to the type of data. At first all data (all features of all sequences) was kept in memory, but this very quickly dissolved heap overflow problems.

Parameters:
tempFileName - The name of the file to which the data should be written.
data - The featuredata
dataType - The type of data (see enum in this interface)
Returns:
The file with the temporary data.

generateFeatureString

java.lang.String generateFeatureString(java.util.List<java.lang.Double> data,
                                       Classifier.DATA_TYPE dataType)
Creates a string of features for only one featurevector

Parameters:
data - The featuredata
dataType - The datatype (positive,negative,unclassified)
Returns:
The featurestring

loadClassifier

boolean loadClassifier()
Sets the modelfile, and - dependend on the implementation - there may be an attempt to build the SVM from this modelfile.

Returns:
Whether or not loading the classifier succeeded.

loadClassifier

boolean loadClassifier(java.lang.String modelFile)
Sets the modelfile, and - dependend on the implementation - there may be an attempt to build the SVM from this modelfile.

Parameters:
modelFile - The name of the modelfile
Returns:
Whether or not loading the classifier succeeded.

classify

void classify(java.lang.String testFile,
              java.lang.String outputFile)
Use the trained (or untrained, the modelfile must be set before though) SVM to classify data, and write the output to an outputfile

Parameters:
testFile - The name of the file that contains the extracted features, outputdirectory is supposed to be in the filename.
outputFile - The name of the outputfile, outputdirectory is supposed to be in the filename.

classify_single_instance

java.lang.String classify_single_instance(java.lang.String instance)
Use the trained classifier to classify a single instance of data, defined by the instance parameter.

Parameters:
instance - The instance (consisting of extracted features) to be classified.
Returns:
The string indicating the result of the classification.

classify_single_instance_fast

java.lang.Double classify_single_instance_fast(double[] features)
Use the trained classifier to classify a single instance of data in a very fast way, without having to resort to string parsing procedures (recommanded method for doing these classifications).

Parameters:
features - The features that make up the instance that needs to be classified.
Returns:
A value given by the classifier. This is NOT a simple -1/+1 value. Indeed, when using SVM's this value indicates the distance to the hyperplane.

to_splice_machine_output

java.lang.String to_splice_machine_output(java.lang.String classification_result,
                                          int funsite,
                                          int increase,
                                          java.lang.String classification_name)
Method which produces a string that is similar to that of the Splicemachine program, according to the provided results.

Parameters:
classification_result - The result of the classification, in string format.
funsite - The location of the functional site in the sequence.
increase - An extra increae for the output (see documentation).
classification_name - The name for this type of functional site.
Returns:
The string in Splicemachine format.

to_splice_machine_output

java.lang.String to_splice_machine_output(java.lang.Double distance,
                                          int funsite,
                                          int increase,
                                          java.lang.String classification_name)
Method which produces a string that is similar to the output provided by Splicemachine (with the provided results).

Parameters:
distance - A value (distance to hyperplane for SVM's) that is used to give a score to a certain functional site.
funsite - The location of the fuctional site in the sequence.
increase - An extra increase for the location of the functional site in the output (see documentation).
classification_name - The name for this type of evaluated functional site.
Returns:
The string in Splicemachine format.

to_genomeview_output

java.lang.String to_genomeview_output(int id,
                                      java.lang.Double distance,
                                      int funsite_start,
                                      int funsite_stop,
                                      java.lang.String classification_name)
Method which produces a string that can be used by the GenomeView program.

Parameters:
id - A unique id for the functional site in the sequence
distance - A value (distance to hyperplane for SVM's) that is used to give a score to a certain functional site.
funsite_start - The start of the functional site in the sequence
funsite_stop - The stop of the functional site in the sequence
classification_name - The name for this type of evaluated functional site.
Returns:
The string in Splicemachine format.

getPosNegExamplesInFile

int[] getPosNegExamplesInFile(java.io.File file)
Returns the amount of positive and negative examples in a trainingfile.

Parameters:
file - The trainingfile
Returns:
An array of size 2, with the first number being the amount of positive training examples and the second number the amount of negative training examples.

getModelFile

java.lang.String getModelFile()
Returns:
The name of the file containg the model for the classifier.

setModelFile

void setModelFile(java.lang.String svmModelFile)
Changes the model for the classifier by changing the name of the modelfile.

Parameters:
svmModelFile - The name of the file containg the new model.

setOptions

void setOptions(ClassifierOptions options)
Changes the various options of this classifier.

Parameters:
options - The new set of options for this classifier.

performAttributeEvaluation

java.util.List<ValPosCombination> performAttributeEvaluation(boolean sort,
                                                             weka.attributeSelection.AttributeEvaluator evaluator)
Performs feature selection by evaluating different attributes.

Parameters:
sort - Whether to sort the resulting valposcombinations according to their values
evaluator - The evaluator used for performing the evaluation of the attributes
Returns:
A list with the values and the original positions of those values in order to be able to locate the classificationfeature this attribute belonged to.

getTrainingFileInstances

weka.core.Instances getTrainingFileInstances()
Returns:
The instances used for training the model of this classifier.

applyAttributeFilter

void applyAttributeFilter(java.util.List<java.lang.Integer> attributeFilter,
                          int maxNumFeatures,
                          java.io.File toBeFilteredFile)
After having used featureselection to get a filter, this filter can be used to change the featurefiles in order to optimize the svms.

Parameters:
attributeFilter - The filter: this is an array with the numbers of the attributes that MUST be preserved.
maxNumFeatures - The maximum amount of features to be used by the classifier.
toBeFilteredFile - The file containing the various features (set in a classifier dependend way) which should be filtered by the given attributefilter.

getSigmoid_A

double getSigmoid_A()
Returns:
the sigmoid variable A (see documentation about restructuring the output by use of sigmoid curves)

setSigmoid_A

void setSigmoid_A(double sigmoid_A)
Changes the sigmoid variable A (see documentation about restructuring the output by use of sigmoid curves)

Parameters:
sigmoid_A - The new sigmoid variable A

getSigmoid_B

double getSigmoid_B()
Returns:
the sigmoid variable B (see documentation about restructuring the output by use of sigmoid curves)

setSigmoid_B

void setSigmoid_B(double sigmoid_B)
Changes the sigmoid variable B (see documentation about restructuring the output by use of sigmoid curves)

Parameters:
sigmoid_B - The new sigmoid variable B

getFileExtension

java.lang.String getFileExtension()
Returns:
The file extension to be used by files containing the features that will be used for either building the classification model or for evaluation.