Latest News

MISSEL: Multiple Sub Sequences Extractor for cLassification

November 2015

A new feature selection algorithm that is able to extract alternative and equivalent classification models has been released.

CAMUR: Classifier with Alternative and MUltiple Rule-based models

October 2015

A new classification technique that is able to compute multiple rule-based models has been developed.

LAF: Logic Alignment Free

April 2015

A new section and software for classifying biological sequences with alignment free techniques and rule based supervised methods.

MISSEL: Multiple Sub Sequences Extractor for cLassification

Thanks to next generation technologies, recent reductions in cost and throughput of sequencing led to an ever-increasing collections of sequenced data. However, such an overabundance of data is not easily treatable by biologists and clinicians without powerful tools and techniques that allow to analyze and interpret them. In this context, automatic knowledge extraction methods that optimize the analysis and point to the interesting portions of the sequences are the key drivers to elicit meaningful information.

In this work, we propose MISSEL (Multiple Sub Sequences Extractor for cLassification), a novel algorithm that relies on a feature selection technique to extract multiple and locally adjacent solutions for supervised machine learning problems. In particular, we design our method to be applied in the biological framework where the relative position of a feature is relevant (e.g., sequenced data). Our goal is to find sets of separating features that are as close as possible to each other. Another crucial issue is to highlight the subsequences that appear with the same required characteristics. Our approach adopts a fast and effective method to evaluate the quality of subsequences and integrates it in a genetic algorithm.

Our algorithm has been applied to genomic sequences from Influenza-, Polyoma-, and Rhinoviruses, where features represent nucleotide loci. Moreover, our method has been integrated in a rule-based classification framework, and is able to efficiently extract a large number of highly reliable and compact classification rules for the three abovementioned datasets. Last but not least, it enables to identify several small highly informative portions of different analyzed genomic regions.

MISSEL DOWNLOADS

MISSEL_poster.pdf

MISSEL_software_package_linux_64.zip

Results.zip

Data sets

Influenzaviruses.zip

Polyomaviruses.zip

Rhinoviruses.zip