Efficient String Matching Algorithms in DNA Sequence Analysis

Authors

  • Er Apoorva Jain Chandigarh University Mohali, Punjab, India Author

Abstract

String matching underpins nearly every task in computational genomics—from mapping next-generation sequencing (NGS) reads to reference genomes and detecting motifs to calling variants and scanning regulatory regions. The volume and heterogeneity of contemporary sequencing data place extreme demands on algorithmic efficiency, with practical constraints on time, memory, and accuracy under sequencing errors and biological variation. This manuscript surveys the algorithmic foundations of efficient string matching for DNA sequence analysis, distinguishes exact from approximate matching, and clarifies when to favor hash-based seeding, automata, or index-based approaches such as suffix arrays and FM-indexes. We complement the conceptual review with a controlled simulation that compares six representative techniques—Knuth–Morris–Pratt (KMP), Boyer–Moore, Rabin–Karp, a bit-parallel edit-distance method (Myers), an FM-index (Burrows–Wheeler–transform based), and a seed-and-extend strategy with spaced seeds—on synthetic read-mapping tasks.

Evaluation emphasizes sensitivity, precision, F1-score, throughput (reads per second), and memory footprint. Results highlight trade-offs: FM-index and seed-and-extend approaches achieve the best F1 and throughput on noisy reads, while classical exact matchers are fast but brittle to mismatches/indels. We discuss statistical significance, effect sizes, and practical guidance for algorithm selection across read length, error profiles, and genome repetitiveness. We conclude with scope and limitations relevant to large genomes, long-read technologies, and structural variation, offering a roadmap for deploying efficient string matching in modern bioinformatics pipelines.

Downloads

Download data is not yet available.

Published

2026-05-07

How to Cite

Er Apoorva Jain. “Efficient String Matching Algorithms in DNA Sequence Analysis”. International Journal of Advanced Research in Computer Science and Engineering (IJARCSE) U.S. ISSN: 3071-0154 2, no. 2 (May 7, 2026): May (9–16). Accessed June 14, 2026. https://ijarcse.org/index.php/ijarcse/article/view/130.