Efficient String Matching Algorithms in DNA Sequence Analysis

Authors

  • Er. Niharika Singh ABES Engineering College, Crossings Republik, Ghaziabad, Uttar Pradesh 201009 niharika250104@gmail.com Author

DOI:

https://doi.org/10.63345/

Keywords:

DNA sequence analysis; string matching; FM-index; seed-and-extend; edit distance; read mapping; algorithm efficiency

Abstract

String matching underpins nearly every task in computational genomics—from mapping next-generation sequencing (NGS) reads to reference genomes and detecting motifs to calling variants and scanning regulatory regions. The volume and heterogeneity of contemporary sequencing data place extreme demands on algorithmic efficiency, with practical constraints on time, memory, and accuracy under sequencing errors and biological variation. This manuscript surveys the algorithmic foundations of efficient string matching for DNA sequence analysis, distinguishes exact from approximate matching, and clarifies when to favor hash-based seeding, automata, or index-based approaches such as suffix arrays and FM-indexes. We complement the conceptual review with a controlled simulation that compares six representative techniques—Knuth–Morris–Pratt (KMP), Boyer–Moore, Rabin–Karp, a bit-parallel edit-distance method (Myers), an FM-index (Burrows–Wheeler–transform based), and a seed-and-extend strategy with spaced seeds—on synthetic read-mapping tasks.

Evaluation emphasizes sensitivity, precision, F1-score, throughput (reads per second), and memory footprint. Results highlight trade-offs: FM-index and seed-and-extend approaches achieve the best F1 and throughput on noisy reads, while classical exact matchers are fast but brittle to mismatches/indels. We discuss statistical significance, effect sizes, and practical guidance for algorithm selection across read length, error profiles, and genome repetitiveness. We conclude with scope and limitations relevant to large genomes, long-read technologies, and structural variation, offering a roadmap for deploying efficient string matching in modern bioinformatics pipelines.

Downloads

Download data is not yet available.

Additional Files

Published

2026-05-03

How to Cite

Singh, Er. Niharika. “Efficient String Matching Algorithms in DNA Sequence Analysis”. International Journal of Advanced Research in Computer Science and Engineering (IJARCSE) U.S. ISSN: 3071-0154 2, no. 2 (May 3, 2026): May (14–26). Accessed June 12, 2026. https://ijarcse.org/index.php/ijarcse/article/view/143.

Similar Articles

1-10 of 80

You may also start an advanced similarity search for this article.