Next generation sequencing reads comparison with an alignment free distance
Next Generation Sequencing (NGS) machines extract from a
biological sample a large number of short DNA fragments (reads). These reads
are then used for several applications, e.g., sequence reconstruction, DNA
assembly, gene expression proling, mutation analysis.
We propose a method to evaluate the similarity between reads. This
method does not rely on the alignment of the reads and it is based on the
distance between the frequencies of their substrings of fixed dimensions (k-mers).
We compare this alignment-free distance with the similarity measures derived
from two alignment methods: Needleman-Wunsch and Blast. The comparison is
based on a simple assumption: the most correct distance is obtained by knowing
in advance the reference sequence. Therefore, we first align the reads on the
original DNA sequence, compute the overlap between the aligned reads, and use
this overlap as an ideal distance. We then verify how the alignment-free and the
alignment-based distances reproduce this ideal distance. The ability of correctly
reproducing the ideal distance is evaluated over samples of read pairs from
Saccharomyces cerevisiae, Escherichia coli, and Homo sapiens. The
comparison is based on the correctness of threshold predictors cross-validated
over different samples.
We exhibit experimental evidence that the proposed alignment-free
distance is a potentially useful read-to-read distance measure and performs better
than the more time consuming distances based on alignment.
Alignment-free distances may be used effectively for reads
comparison, and may provide a significant speed-up in several processes based on
NGS sequencing (e.g., DNA assembly).
NGS SAMPLE DATA SETS AND SOFTWARE
JavaDistanceOnlyAF.zip |
JavaDistanceOnlyBL.zip |
JavaDistanceOnlyNW.zip |
JavaDistancesAll.zip |
SampleData.zip |
softwareExecution.txt |