English | Japanese
Last Modified: 2000-12-06 (Since: 2000-12-06)
Yes, it can. Files containing special characters (including NULL character) can be handled correctly.
Currently 32 bit is the limit. Both a target file and a suffix file's size is limited up to 2 GB.
The best way to speed up mksary is to use a high performance machine which has high-speed CPU and a plenty of memories for constructing a suffix array for a huge file. It's preferable to equip memories 5 times as large as a target file.
If you use a machine which has limited memories, try -b option for enabling memory-saving block sorting.
Note: Searching can be performed efficiently with a machine having not so much memories.
The size is simply number of index points * 4 bytes because of a suffix array is constructed with 32 bit integers.
Very hard. You can construct a huge suffix array and apply gzip/bzip2 to it to understand.
No, it cannot. Please reconstruct it totally.
It can be done by assigning index points to those fields only. It's probably easy to write the program for your task in scripting languages such as Perl. See: Reference Manual: Appendix
Character encodings can be specified for assiging index points. mksary takes -c option for a character encoding. mksary handles an input file as byte stream by default. The following character encodings can be handled.
Note: No special handling for character encodings are performed for searching.
Suffix array files are stored in netowork byte order (big endian) so that the same suffix array files can be shared by both big endian machines (e.g., Sun SPARC) and little endian (e.g., Intel x86) machines.
Yes, it can. It employs setlocale(3), isalpha(3), toupper(3), and tolower(3). If you set locale correctly, case-insensitive searchs for European languages such as French, German, and Russian should be performed.
No, it cannot. Please concatenate those files to one file and construct the suffix array for the file. For searching vast quantities of documents, Namazu may be a better choice.
No, it cannot but you can simulate it by preparing a file containing line numbers in advance by the following
% cat -n foo.txt > foo-with-line-numbers.txt
and construct the suffix array for the file.
Because I'm infected with testing. `tests' directory contains test suites. You can do testing by `make check'. I enjoy practicing the software development methodology described in Refactoring. Since suffix array manipulation requires a lot of hazardous boundary conditions, rigid testing is really needed. I also uses `g_assert' many times to prevent subtle bugs.
Note: Don't be worried if some tests failed because those tests are not expected to work everywhere. They are mainly for developers.
The development of Sary started because I didn't like the souces codes of SUFARY. Sary aims maintainability, extensibility, and usability rather than performance. By the way, the author of Sary (Takabayashi) and SUFARY (Yamashita) were sitting side by side from April 1999 to March 2000 in the same laboratory in our college. :-)