Fast practical multi-pattern matching

Maxime Crochemore; A. Czumaj; L. Ga̧sieniec; T. Lecroq; W. Plandowski; W. Rytter

doi:10.1016/S0020-0190(99)00092-7

Fast practical multi-pattern matching

Maxime Crochemore, A. Czumaj, L. Ga̧sieniec, T. Lecroq, W. Plandowski, W. Rytter

Research output: Contribution to journal › Article › peer-review

52 Scopus citations

Abstract

The multi-pattern matching problem consists in finding all occurrences of the patterns from a finite set X in a given text T of length n. We present a new and simple algorithm combining the ideas of the Aho-Corasick algorithm and the directed acyclic word graphs. The algorithm has time complexity which is linear in the worst case (it makes at most 2n symbol comparisons) and has good average-case time complexity assuming the shortest pattern is sufficiently long. Denote the length of the shortest pattern by m, and the total length of all patterns by M. Assume that M is polynomial with respect to m, the alphabet contains at least 2 symbols and the text (in which the pattern is to be found) is random, for each position each letter occurs independently with the same probability. Then the average number of comparisons is O((n/m)·log m), which matches the lower bound of the problem. For sufficiently large values of m the algorithm has a good behavior in practice.

Original language	English (US)
Pages (from-to)	107-113
Number of pages	7
Journal	Information Processing Letters
Volume	71
Issue number	3
DOIs	https://doi.org/10.1016/S0020-0190(99)00092-7
State	Published - Aug 27 1999
Externally published	Yes

ASJC Scopus subject areas

Theoretical Computer Science
Signal Processing
Information Systems
Computer Science Applications

Access to Document

10.1016/S0020-0190(99)00092-7

Cite this

@article{8aa4f82e9c74456d9999795f22ed09ab,

title = "Fast practical multi-pattern matching",

abstract = "The multi-pattern matching problem consists in finding all occurrences of the patterns from a finite set X in a given text T of length n. We present a new and simple algorithm combining the ideas of the Aho-Corasick algorithm and the directed acyclic word graphs. The algorithm has time complexity which is linear in the worst case (it makes at most 2n symbol comparisons) and has good average-case time complexity assuming the shortest pattern is sufficiently long. Denote the length of the shortest pattern by m, and the total length of all patterns by M. Assume that M is polynomial with respect to m, the alphabet contains at least 2 symbols and the text (in which the pattern is to be found) is random, for each position each letter occurs independently with the same probability. Then the average number of comparisons is O((n/m)·log m), which matches the lower bound of the problem. For sufficiently large values of m the algorithm has a good behavior in practice.",

author = "Maxime Crochemore and A. Czumaj and L. G{\c a}sieniec and T. Lecroq and W. Plandowski and W. Rytter",

note = "Funding Information: ∗Corresponding author. Email: Thierry.Lecroq@dir.univ-rouen. fr. Supported in part by programme “G{\'e}nomes” of CNRS. 1Supported in part by programme “G{\'e}nomes” of CNRS. 2Supported by the grant KBN 8T11C03915.",

year = "1999",

month = aug,

day = "27",

doi = "10.1016/S0020-0190(99)00092-7",

language = "English (US)",

volume = "71",

pages = "107--113",

journal = "Information Processing Letters",

issn = "0020-0190",

publisher = "Elsevier",

number = "3",

}

TY - JOUR

T1 - Fast practical multi-pattern matching

AU - Crochemore, Maxime

AU - Czumaj, A.

AU - Ga̧sieniec, L.

AU - Lecroq, T.

AU - Plandowski, W.

AU - Rytter, W.

N1 - Funding Information: ∗Corresponding author. Email: Thierry.Lecroq@dir.univ-rouen. fr. Supported in part by programme “Génomes” of CNRS. 1Supported in part by programme “Génomes” of CNRS. 2Supported by the grant KBN 8T11C03915.

PY - 1999/8/27

Y1 - 1999/8/27

N2 - The multi-pattern matching problem consists in finding all occurrences of the patterns from a finite set X in a given text T of length n. We present a new and simple algorithm combining the ideas of the Aho-Corasick algorithm and the directed acyclic word graphs. The algorithm has time complexity which is linear in the worst case (it makes at most 2n symbol comparisons) and has good average-case time complexity assuming the shortest pattern is sufficiently long. Denote the length of the shortest pattern by m, and the total length of all patterns by M. Assume that M is polynomial with respect to m, the alphabet contains at least 2 symbols and the text (in which the pattern is to be found) is random, for each position each letter occurs independently with the same probability. Then the average number of comparisons is O((n/m)·log m), which matches the lower bound of the problem. For sufficiently large values of m the algorithm has a good behavior in practice.

AB - The multi-pattern matching problem consists in finding all occurrences of the patterns from a finite set X in a given text T of length n. We present a new and simple algorithm combining the ideas of the Aho-Corasick algorithm and the directed acyclic word graphs. The algorithm has time complexity which is linear in the worst case (it makes at most 2n symbol comparisons) and has good average-case time complexity assuming the shortest pattern is sufficiently long. Denote the length of the shortest pattern by m, and the total length of all patterns by M. Assume that M is polynomial with respect to m, the alphabet contains at least 2 symbols and the text (in which the pattern is to be found) is random, for each position each letter occurs independently with the same probability. Then the average number of comparisons is O((n/m)·log m), which matches the lower bound of the problem. For sufficiently large values of m the algorithm has a good behavior in practice.

UR - http://www.scopus.com/inward/record.url?scp=0033609581&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0033609581&partnerID=8YFLogxK

U2 - 10.1016/S0020-0190(99)00092-7

DO - 10.1016/S0020-0190(99)00092-7

M3 - Article

AN - SCOPUS:0033609581

SN - 0020-0190

VL - 71

SP - 107

EP - 113

JO - Information Processing Letters

JF - Information Processing Letters

IS - 3

ER -

Fast practical multi-pattern matching

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this