English | Japanese

Sary: FAQ

Last Modified: 2000-12-06 (Since: 2000-12-06)


Table of Contents

General

Can a binary file be handled?

Yes, it can. Files containing special characters (including NULL character) can be handled correctly.

Is there file size limitation?

Currently 32 bit is the limit. Both a target file and a suffix file's size is limited up to 2 GB.

Construction of Suffix Array

mksary is too slow!

The best way to speed up mksary is to use a high performance machine which has high-speed CPU and a plenty of memories for constructing a suffix array for a huge file. It's preferable to equip memories 5 times as large as a target file.

If you use a machine which has limited memories, try -b option for enabling memory-saving block sorting.

Note: Searching can be performed efficiently with a machine having not so much memories.

How large a suffix array file is?

The size is simply number of index points * 4 bytes because of a suffix array is constructed with 32 bit integers.

Can a suffix array be compressed?

Very hard. You can construct a huge suffix array and apply gzip/bzip2 to it to understand.

Can a suffix array be updated incrementally?

No, it cannot. Please reconstruct it totally.

Can only specific fields be searched?

It can be done by assigning index points to those fields only. It's probably easy to write the program for your task in scripting languages such as Perl. See: Reference Manual: Appendix

Can character encodings be handled?

Character encodings can be specified for assiging index points. mksary takes -c option for a character encoding. mksary handles an input file as byte stream by default. The following character encodings can be handled.

Note: No special handling for character encodings are performed for searching.

Is there byte order incompatibility?

Suffix array files are stored in netowork byte order (big endian) so that the same suffix array files can be shared by both big endian machines (e.g., Sun SPARC) and little endian (e.g., Intel x86) machines.

Search with Suffix Array

Is case-insensitive searchs supported?

Yes, it can. It employs setlocale(3), isalpha(3), toupper(3), and tolower(3). If you set locale correctly, case-insensitive searchs for European languages such as French, German, and Russian should be performed.

Can multiple files be searched at onece?

No, it cannot. Please concatenate those files to one file and construct the suffix array for the file. For searching vast quantities of documents, Namazu may be a better choice.

Can line numbers be printed in search results?

No, it cannot but you can simulate it by preparing a file containing line numbers in advance by the following

    % cat -n foo.txt > foo-with-line-numbers.txt

and construct the suffix array for the file.

Miscellaneous

Why are a lot of test programs included?

Because I'm infected with testing. `tests' directory contains test suites. You can do testing by `make check'. I enjoy practicing the software development methodology described in Refactoring. Since suffix array manipulation requires a lot of hazardous boundary conditions, rigid testing is really needed. I also uses `g_assert' many times to prevent subtle bugs.

Note: Don't be worried if some tests failed because those tests are not expected to work everywhere. They are mainly for developers.

What's the difference between Sary and SUFARY?

Sary
Moderate performance
Object-oriented APIs
SUFARY
High performance
Low level APIs

The development of Sary started because I didn't like the souces codes of SUFARY. Sary aims maintainability, extensibility, and usability rather than performance. By the way, the author of Sary (Takabayashi) and SUFARY (Yamashita) were sitting side by side from April 1999 to March 2000 in the same laboratory in our college. :-)


Satoru Takabayashi