Sketch Image

CAMUR: Classifier with Alternative and MUltiple Rule-based models

Next Generation Sequencing (NGS) techniques are rapidly spreading, providing huge amounts of genomic data. Therefore, knowledge extraction methods from NGS data are highly requested. In this work, we focus on RNA-seq gene expression analysis and specifically on case-control studies for cancer with rule-based classification algorithms designed to build models, that discriminate cases from controls. State of the art algorithms typically extract a single classification model that contains few features (genes). On the other hand, our goal is to elicit a higher amount of knowledge by computing more alternative classification models and therefore to discover several features that are related to the predicted class.

We propose CAMUR (Classifier with Alternative and MUltiple Rule-based models), a new method and software package able to extract multiple, alternative, and equivalent classification models. CAMUR iteratively computes a rule-based classification model, calculates the power set (or a partial combination) of the features present in the rules, iteratively eliminates those combinations from the data set, and performs again the classification procedure until a stopping criterion is verified. CAMUR includes an ad-hoc knowledge repository (database) and a complete querying tool.

We downloaded, processed, and analyzed three different RNA-Seq data sets (Breast, Head and Neck, and Stomach Cancer) from The Cancer Genome Atlas (TCGA). Our experimental results show the efficacy of CAMUR: we obtain several high reliable equivalent classification models, from which the most frequent genes, their relationships, and the relation with a particular cancer are deduced.



Data sets