Sketch Image

MISSEL: Multiple Sub Sequences Extractor for cLassification

Thanks to next generation technologies, recent reductions in cost and throughput of sequencing led to an ever-increasing collections of sequenced data. However, such an overabundance of data is not easily treatable by biologists and clinicians without powerful tools and techniques that allow to analyze and interpret them. In this context, automatic knowledge extraction methods that optimize the analysis and point to the interesting portions of the sequences are the key drivers to elicit meaningful information.

In this work, we propose MISSEL (Multiple Sub Sequences Extractor for cLassification), a novel algorithm that relies on a feature selection technique to extract multiple and locally adjacent solutions for supervised machine learning problems. In particular, we design our method to be applied in the biological framework where the relative position of a feature is relevant (e.g., sequenced data). Our goal is to find sets of separating features that are as close as possible to each other. Another crucial issue is to highlight the subsequences that appear with the same required characteristics. Our approach adopts a fast and effective method to evaluate the quality of subsequences and integrates it in a genetic algorithm.

Our algorithm has been applied to genomic sequences from Influenza-, Polyoma-, and Rhinoviruses, where features represent nucleotide loci. Moreover, our method has been integrated in a rule-based classification framework, and is able to efficiently extract a large number of highly reliable and compact classification rules for the three abovementioned datasets. Last but not least, it enables to identify several small highly informative portions of different analyzed genomic regions.



Data sets