Real-time traversal in grammar-based compressed files

Leszek Ga̧sieniec, Roman Kolpako, Igor Potapov, Paul Sant

Research output: Contribution to journalConference articlepeer-review

39 Scopus citations

Abstract

Historically, one of the main aims of text compression was to reduce the size of data stored for future analysis and processing, see, e.g., [2, 3]. Unfortunately, further processing of the compressed data is usually preceded by complete decompression. This standard approach may cause problems in certain applications, where the original (uncompressed) file is very large and we face running out of space available for the computation. Thus, it is important to be able to process compressed data without requiring (complete) decompression. Moreover, in some applications (e.g., pattern matching problems), the information represented by the compressed file can be processed in relatively small chunks (length of a pattern). In this context it is crucial to study compression methods that allow time/space efficient access to any fragment of a compressed file without being forced to perform complete decompression. We study here real-time recovery of consecutive symbols from compressed files, in the context of grammar-based compression, see, e.g., [1]. In this setting, a compressed text is represented as a small (a few kilobytes) dictionary D (containing a set of code words), and a very long (a few megabytes) string based on symbols drawn from the dictionary D. The space efficiency of this kind of compression, is comparable with standard compression methods based on the Lempel-Ziv approach [3]. We show, that one can visit consecutive symbols of the original text, moving from one symbol to another in constant time and extra O(|D|) space. This algorithm is an improvement of the on-line linear (amortised) time algorithm presented in [1].

Original languageEnglish (US)
Pages (from-to)458
Number of pages1
JournalData Compression Conference Proceedings
StatePublished - 2005
Externally publishedYes
EventDCC 2005: Data Compression Conference - Snowbird, UT, United States
Duration: Mar 29 2005Mar 31 2005

ASJC Scopus subject areas

  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'Real-time traversal in grammar-based compressed files'. Together they form a unique fingerprint.

Cite this