Efficient String Matching Algorithms in DNA Sequence Analysis
DOI:
https://doi.org/10.63345/Keywords:
DNA sequence analysis; string matching; FM-index; seed-and-extend; edit distance; read mapping; algorithm efficiencyAbstract
String matching underpins nearly every task in computational genomics—from mapping next-generation sequencing (NGS) reads to reference genomes and detecting motifs to calling variants and scanning regulatory regions. The volume and heterogeneity of contemporary sequencing data place extreme demands on algorithmic efficiency, with practical constraints on time, memory, and accuracy under sequencing errors and biological variation. This manuscript surveys the algorithmic foundations of efficient string matching for DNA sequence analysis, distinguishes exact from approximate matching, and clarifies when to favor hash-based seeding, automata, or index-based approaches such as suffix arrays and FM-indexes. We complement the conceptual review with a controlled simulation that compares six representative techniques—Knuth–Morris–Pratt (KMP), Boyer–Moore, Rabin–Karp, a bit-parallel edit-distance method (Myers), an FM-index (Burrows–Wheeler–transform based), and a seed-and-extend strategy with spaced seeds—on synthetic read-mapping tasks.
Evaluation emphasizes sensitivity, precision, F1-score, throughput (reads per second), and memory footprint. Results highlight trade-offs: FM-index and seed-and-extend approaches achieve the best F1 and throughput on noisy reads, while classical exact matchers are fast but brittle to mismatches/indels. We discuss statistical significance, effect sizes, and practical guidance for algorithm selection across read length, error profiles, and genome repetitiveness. We conclude with scope and limitations relevant to large genomes, long-read technologies, and structural variation, offering a roadmap for deploying efficient string matching in modern bioinformatics pipelines.
Downloads
Downloads
Additional Files
Published
Issue
Section
License
Copyright (c) 2026 The journal retains copyright of all published articles, ensuring that authors have control over their work while allowing wide dissenmination.

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Articles are published under the Creative Commons Attribution NonCommercial 4.0 License (CC BY NC 4.0), allowing others to distribute, remix, adapt, and build upon the work for non-commercial purposes while crediting the original author.
