marisa trie suffix compression?

marisa trie suffix compression? - python

I'm using a custom Cython wrapper of this marisa trie library as a key-value multimap.
My trie entries look like key 0xff data1 0xff data2 to map key to the tuple (data1, data2). data1 is a string of variable length but data2 is always a 4-byte unsigned int. The 0xff is a delimiter byte.
I know a trie is not the most optimal data structure for this from a theoretical point of a view, but various practical considerations make it the best available choice.
In this use case, I have about 10-20 million keys, each one has on average 10 data points. data2 is redundant for many entries (in some cases, data2 is always the same for all data points for a given key), so I had the idea of taking the most frequent data2 entry and adding a ("", base_data2) data point to each key.
Since a MARISA trie, to my knowledge, does not have suffix compression and for a given key each data1 is unique, I assumed that this would save 4 bytes per data tuple that uses a redundant key (plus adding in a single 4-byte "value" for each key). Having rebuilt the trie, I checked that the redundant data was no longer being stored. I expected a sizable decrease in both serialized and in-memory size, but in fact the on-disk trie went from 566MB to 557MB (and a similar reduction in RAM usage for a loaded trie).
From this I concluded that I must be wrong about there being no suffix compression. I was now storing the entries with a redundant data2 number as key 0xff data1 0xff, so to test this theory I removed the trailing 0xff and adjusted the code that uses the trie to cope. The new trie went down from 557MB to 535MB.
So removing a single redundant trailing byte made a 2x larger improvement than removing the same number of 4-byte sequences, so either the suffix compression theory is dead wrong, or it's implemented in some very convoluted way.
My remaining theory is that adding in the ("", base_data2) entry at a higher point in the trie somehow throws off the compression in some terrible way, but it should just be adding in 4 more bytes when I've removed many more than that from lower down in the trie.
I'm not optimistic for a fix, but I'd dearly like to know why I'm seeing this behavior! Thank you for your attention.

As I suspected, it's caused by padding.
in lib/marisa/grimoire/vector/vector.h, there is the following function:
void write_(Writer &writer) const {
writer.write((UInt64)total_size());
writer.write(const_objs_, size_);
writer.seek((8 - (total_size() % 8)) % 8);
}
The key point is: writer.seek((8 - (total_size() % 8)) % 8);. After writing each chunk, the writer pads to the next 8 bytes boundary.
This explains the behavior you are seeing, as part of the data removed by the initial shortening of the key was replaced with padding.
When you removed the extra byte, it brought the key size below the next boundary limit, resulting in a major size change.
Practically, what this means is that, since the padding code is in the serialization part of the library, you are probably getting the in-memory savings you were expecting, but that did not translate into on-disk savings. Monitoring program RAM usage should confirm that.
If disk size is your concern, then you might as well simply deflate the serialized data, as it seems MARISA does not apply any compression whatsoever.

Related

Compression of short strings

I am trying to compress short strings (max 15 characters).
The goal is to implement the "Normalized Compression Distance"[1], I tried a few compression algorithms in python (I also looked to se if i could do it in Julia but the packages all refuse to install).
I always obtain in the end a bit-string longer than the original string I am trying to compress which totally defeats the purpose.
An example with zlib :
import zlib
data = b"this is a test"
compressed_data = zlib.compress(data, 9)
print(len(data))
print(len(compressed_data))
Which returns :
13
21
Do you now what I am doing wrong, or how i could do this more efficiently ?
[1] : https://arxiv.org/pdf/cs/0312044.pdf

Check out these libraries for compressing short strings:
https://github.com/siara-cc/unishox :
Unishox is a hybrid encoder (entropy, dictionary and delta coding). It works by assigning fixed prefix-free codes for each letter of the 95 letter printable Character Set (entropy coding). It encodes repeating letter sets separately (dictionary coding). For Unicode characters (UTF-8), delta coding is used. It also has special handling for repeating upper case and num pad characters.
Unishox was developed to save memory in embedded devices and compress strings stored in databases. It is used in many projects and has an extension for Sqlite database. Although it is slower than other available libraries, it works well for the given applications.
https://github.com/antirez/smaz :
Smaz was developed by Salvatore Sanfilipo and it compresses strings by replacing parts of it using a codebook. This was the first one available for compressing short strings as far as I know.
https://github.com/Ed-von-Schleck/shoco :
shoco was written by Christian Schramm. It is an entropy encoder, because the length of the representation of a character is determined by the probability of encountering it in a given input string.
It has a default model for English language and a provision to train new models based on given sample text.
PS: Unishox was developed by me and its working principle is explained in this article:

According to your reference the extra overhead added by Zlib may not matter.
That article defines the NCD as (C(x*y) − min(C(x),C(y))) / max(C(x),C(y)), where using your zlib compression for C:
C(x) = length(zlib.compress(x, 9))
NCD(x,y) = (C(x*y) − min(C(x),C(y))) / max(C(x),C(y))
As long as Zlib only adds a constant overhead the numerator of the NCD
should not change, and the demoninator should only change by a small amount.
You could add a correction factor like this:
C(x) = length(zlib.compress(x, 9)) - length(zlib.compress("a", 9)) + 1
which might eliminate the remaining issues with the denominator of NCD.

The DEFLATE algorithm uses a 32kb compression dictionary to deduplicate your data. By default it builds this dictionary from the data you provide it.
With short strings, it won't be able to build a decent compression dictionary, and therefore won't be able to compress efficiently and the meta-data overhead is what increases the size of your compressed result.
One solution would be to use a preset dictionary with samples of recurring patterns.
This question handles the same issue: Reusing compression dictionary
You can use my dicflate utility to experiment with DEFLATE compression on short and long strings with and without preset dictionaries: dicflate

Parsing IFF- style data using Python

I have an IFF- style file (see below) whose contents I need to inspect in Python.
https://en.wikipedia.org/wiki/Interchange_File_Format
I can iterate through the file using the following code
from chunk import Chunk
def chunks(f):
while True:
try:
c=Chunk(f, align=False, bigendian=False)
yield c
c.skip()
except EOFError:
break
if __name__=="__main__":
for c in chunks(file("sample.iff", 'rb')):
name, sz, value = c.getname(), c.getsize(), c.read()
print (name, sz, value)
Now I need to parse the different values. I have had some success using Python's 'struct' module, unpacking different fields as follows
struct.unpack('<I', value)
or
struct.unpack('BBBB', value)
by experimenting with different formatting characters shown in the struct module documentation
https://docs.python.org/2/library/struct.html
This works with some of the simpler fields but not with the more complex ones. It is all very trial-and-error. What I need is some systematic way of unpacking the different values, some way of knowing or inspecting the type of data they represent. I am not a C datatype expert.
Any ideas ? Many thanks.
SVOXVERS BVER BPM }SPEDTGRDGVOL`NAME2017-02-15 16-38MSCLMZOOMXOFMYOFLMSKCURLTIMESELSLGENPATNPATTPATLPDTAa � 1pQ 10 `q !#QP! 0A �`A PCHNPLIN PYSZ PFLGPICO �m�!�a��Q�1:\<<<<:\�1�Q��a�!�mPFGCPBGC���PFFFPXXXPYYYPENDSFFFCSNAM OutputSFINSRELSXXXDSYYYhSZZZSSCLSVPRSCOL���SMICSMIB����SMIP����SLNK����SENDSFFFISNAM FMSTYPFMSFINSRELSXXX�SYYY8SZZZSSCLSVPRSCOL��SMICSMIB����SMIP����SLNKCVAL�CVAL0CVAL�CVALCVALCVALCVALCVALGCVALnCVAL\CVALCVAL&CVALoCVALDCVALCVALCVALCMID������������������SENDSFFFQSNAM EchoSTYPEchoSFINSRELSXXX�SYYY SZZZSSCLSVPRSCOL��SMICSMIB����SMIP����SLNK����CVALCVALCVAL�CVALCVALCVALCMID0������SENDSFFFQSNAM ReverbSTYPReverbSFINSRELSXXX\SYYY�SZZZSSCLSVPRSCOL��SMICSMIB����SMIP����SLNK����CVALCVALCVAL�CVAL�CVALCVALCVALCVALCVALCMIDH���������SENDSENDSENDSENDSEND

If it's really an IFF file, it needs alignment and big-endian turned on, and the file would contain a single FORM chunk that in turn contains the FORM type such as SVOX and the contents chunks. (Or it could contain a LIST or CAT container chunk.)
An IFF chunk has:
A four-character chunk-type code
A four-byte big-endian integer: length
length number of data bytes
A pad byte for alignment if length is odd
This is documented in "EA IFF 85". See the "EA IFF-85" Repository for the original IFF docs. [I wrote them.]
Some file formats like RIFF and PNG are variations on the IFF design, not conforming applications of the IFF standard. They vary the chunk format details, which is why Python's Chunk reader library lets you pick alignment, endian, and when to recurse into chunks.
By looking at your file in a hex/ascii dump and mapping out the chunk spans, you should be able to deduce whether it uses big-endian or little-endian length fields, whether each odd-length chunk is followed by a pad byte for alignment, and whether there are chunks within chunks.
Now to the contents. A chunk's type signals the format and semantics of its contents. Those contents could be a simple C struct or could contain variable-length strings. IFF itself does not provide metadata on that level of structure, unlike JSON and TIFF.
So try to find the documentation for the file format (SVOX?).
Otherwise try to reverse engineer the data. If you put sample data into an application that generates these files, you can try special cases, look for the expected values in the file, change just one parameter, then look for what changed in the file.
Finally, your code should call c.close(). c.close() will call c.skip() for you and also handle chunk closing, which includes safety checks for attempts to read the chunk afterwards.

Fast hash for strings

I have a set of ASCII strings, let's say they are file paths. They could be both short and quite long.
I'm looking for an algorithm that could calculate hash of such a strings and this hash will be also a string, but will have a fixed length, like youtube video ids:
https://www.youtube.com/watch?v=-F-3E8pyjFo
^^^^^^^^^^^
MD5 seems to be what I need, but it is critical for me to have a short hash strings.
Is there a shell command or python library which can do that?

As of Python 3 this method does not work:
Python has a built-in hash() function that's very fast and perfect for most uses:
>>> hash("dfds")
3591916071403198536
You can then make it unsigned:
>>> hashu=lambda word: ctypes.c_uint64(hash(word)).value
You can then turn it into a 16 byte hex string:
>>> hashu("dfds").to_bytes(8,"big").hex()
Or an N*2 byte string, where N is <= 8:
>>> hashn=lambda word, N : (hashu(word)%(2**(N*8))).to_bytes(N,"big").hex()
..etc. And if you want N to be larger than 8 bytes, you can just hash twice. Python's built-in is so vastly faster, it's never worth using hashlib for anything unless you need security... not just collision resistance.
>>> hashnbig=lambda word, N : ((hashu(word)+2**64*hashu(word+"2"))%(2**(N*8))).to_bytes(N,"big").hex()
And finally, use the urlsafe base64 encoding to make a much better string than "hex" gives you
>>> hashnbigu=lambda word, N : urlsafe_b64encode(((hashu(word)+2**64*hash(word+"2"))%(2**(N*8))).to_bytes(N,"big")).decode("utf8").rstrip("=")
>>> hashnbigu("foo",16)
'ZblnvrRqHwAy2lnvrR4HrA'
Caveats:
Be warned that in Python 3.3 and up, this function is
randomized and won't work for some use cases. You can disable this with PYTHONHASHSEED=0
See https://github.com/flier/pyfasthash for fast, stable hashes that
that similarly won't overload your CPU for non-cryptographic applications.
Don't use this lambda style in real code... write it out! And
stuffing things like 2**32 in your code, instead of making them
constants is bad form.
In the end 8 bytes of collision resistance is OK for a smaller
applications.... with less than a million entries, you've got
collision odds of < 0.0000001%. That's a 12 byte b64 encoded
string. But it might not be enough for larger apps.
16 bytes is enough for a UUID/OID in a cache, etc.
Speed comparison for producing 300k 16 byte hashes from a bytes-input.
builtin: 0.188
md5: 0.359
fnvhash_c: 0.113
For a complex input (tuple of 3 integers, for example), you have to convert to bytes to use the non-builtin hashes, this adds a lot of conversion overhead, making the builtin shine.
builtin: 0.197
md5: 0.603
fnvhash_c: 0.284

I guess this question is off-topic, because opinion based, but at least one hint for you, I know the FNV hash because it is used by The Sims 3 to find resources based on their names between the different content packages. They use the 64 bits version, so I guess it is enough to avoid collisions in a relatively large set of reference strings. The hash is easy to implement, if no module satisfies you (pyfasthash has an implementation of it for example).
To get a short string out of it, I would suggest you use base64 encoding. For example, this is the size of a base64-encoded 64 bits hash: nsTYVQUag88= (and you can get rid or the padding =).
Edit: I had finally the same problem as you, so I implemented the above idea: https://gist.github.com/Cilyan/9424144

Another option: hashids is designed to solve exactly this problem and has been ported to many languages, including Python. It's not really a hash in the sense of MD5 or SHA1, which are one-way; hashids "hashes" are reversable.
You are responsible for seeding the library with a secret value and selecting a minimum hash length.
Once that is done, the library can do two-way mapping between integers (single integers, like a simple primary key, or lists of integers, to support things like composite keys and sharding) and strings of the configured length (or slightly more). The alphabet used for generating "hashes" is fully configurable.
I have provided more details in this other answer.

You could use the sum program (assuming you're on linux) but keep in mind that the shorter the hash the more collisions you might have. You can always truncate MD5/SHA hashes as well.
EDIT: Here's a list of hash functions: List of hash functions

Something to keep in mind is that hash codes are one way functions - you cannot use them for "video ids" as you cannot go back from the hash to the original path. Quite apart from anything else hash collisions are quite likely and you end up with two hashes both pointing to the same video instead of different ones.
To create an Id like the youtube one the easiest way is to create a unique id however you normally do that (for example an auto key column in a database) and then map that to a unique string in a reversible way.
For example you could take an integer id and map it to 0-9a-z in base 36...or even 0-9a-zA-Z in base 62, padding the generated string out to the desired length if the id on its own does not give enough characters.

LZ77 compression reserved bytes "< , >"

I'm learning about LZ77 compression, and I saw that when I find a repeated string of bytes, I can use a pointer of the form <distance, length>, and that the "<", ",", ">" bytes are reserved. So... How do I compress a file that has these bytes, if I cannot compress these byte,s but cannot change it by a different byte (because decoders wouldn't be able to read it). Is there a way? Or decoders only decode is there is a exact <d, l> string? (if there is, so imagine if by a coencidence, we find these bytes in a file. What would happen?)
Thanks!

LZ77 is about referencing strings back in the decompressing buffer by their lengths and distances from the current position. But it is left to you how do you encode these back-references. Many implementations of LZ77 do it in different ways.
But you are right that there must be some way to distinguish "literals" (uncompressed pieces of data meant to be copied "as is" from the input to the output) from "back-references" (which are copied from already uncompressed portion).
One way to do it is reserving some characters as "special" (so called "escape sequences"). You can do it the way you did it, that is, by using < to mark the start of a back-reference. But then you also need a way to output < if it is a literal. You can do it, for example, by establishing that when after < there's another <, then it means a literal, and you just output one <. Or, you can establish that if after < there's immediately >, with nothing in between, then that's not a back-reference, so you just output <.
It also wouldn't be the most efficient way to encode those back-references, because it uses several bytes to encode a back-reference, so it will become efficient only for referencing strings longer than those several bytes. For shorter back-references it will inflate the data instead of compressing them, unless you establish that matches shorter than several bytes are being left as is, instead of generating back-references. But again, this means lower compression gains.
If you compress only plain old ASCII texts, you can employ a better encoding scheme, because ASCII uses just 7 out of 8 bits in a byte. So you can use the highest bit to signal a back-reference, and then use the remaining 7 bits as length, and the very next byte (or two) as back-reference's distance. This way you can always tell for sure whether the next byte is a literal ASCII character or a back-reference, by checking its highest bit. If it is 0, just output the character as is. If it is 1, use the following 7 bits as length, and read up the next 2 bytes to use it as distance. This way every back-reference takes 3 bytes, so you can efficiently compress text files with repeating sequences of more than 3 characters long.
But there's a still better way to do this, which gives even more compression: you can replace your characters with bit codes of variable lengths, crafted in such a way that the characters appearing more often would have shortest codes, and those which are rare would have longer codes. To achieve that, these codes have to be so-called "prefix codes", so that no code would be a prefix of some other code. When your codes have this property, you can always distinguish them by reading these bits in sequence until you decode some of them. Then you can be sure that you won't get any other valid item by reading more bits. The next bit always starts another new sequence. To produce such codes, you need to use Huffman trees. You can then join all your bytes and different lengths of references into one such tree and generate distinct bit codes for them, depending on their frequency. When you try to decode them, you just read the bits until you reach the code of some of these elements, and then you know for sure whether it is a code of some literal character or a code for back-reference's length. In the second case, you then read some additional bits for the distance of the back-reference (also encoded with a prefix code). This is what DEFLATE compression scheme does. But this is whole another story, and you will find the details in the RFC supplied by #MarkAdler.

If I understand your question correctly, it makes no sense. There are no "reserved bytes" for the uncompressed input of an LZ77 compressor. You need to simply encodes literals and length/distance pairs unambiguously.

Shortest hash in python to name cache files

What is the shortest hash (in filename-usable form, like a hexdigest) available in python? My application wants to save cache files for some objects. The objects must have unique repr() so they are used to 'seed' the filename. I want to produce a possibly unique filename for each object (not that many). They should not collide, but if they do my app will simply lack cache for that object (and will have to reindex that object's data, a minor cost for the application).
So, if there is one collision we lose one cache file, but it is the collected savings of caching all objects makes the application startup much faster, so it does not matter much.
Right now I'm actually using abs(hash(repr(obj))); that's right, the string hash! Haven't found any collisions yet, but I would like to have a better hash function. hashlib.md5 is available in the python library, but the hexdigest is really long if put in a filename. Alternatives, with reasonable collision resistance?
Edit:
Use case is like this:
The data loader gets a new instance of a data-carrying object. Unique types have unique repr. so if a cache file for hash(repr(obj)) exists, I unpickle that cache file and replace obj with the unpickled object. If there was a collision and the cache was a false match I notice. So if we don't have cache or have a false match, I instead init obj (reloading its data).
Conclusions (?)
The str hash in python may be good enough, I was only worried about its collision resistance. But if I can hash 2**16 objects with it, it's going to be more than good enough.
I found out how to take a hex hash (from any hash source) and store it compactly with base64:
# 'h' is a string of hex digits
bytes = "".join(chr(int(h[i:i+2], 16)) for i in xrange(0, len(h), 2))
hashstr = base64.urlsafe_b64encode(bytes).rstrip("=")

The birthday paradox applies: given a good hash function, the expected number of hashes before a collision occurs is about sqrt(N), where N is the number of different values that the hash function can take. (The wikipedia entry I've pointed to gives the exact formula). So, for example, if you want to use no more than 32 bits, your collision worries are serious for around 64K objects (i.e., 2**16 objects -- the square root of the 2**32 different values your hash function can take). How many objects do you expect to have, as an order of magnitude?
Since you mention that a collision is a minor annoyance, I recommend you aim for a hash length that's roughly the square of the number of objects you'll have, or a bit less but not MUCH less than that.
You want to make a filename - is that on a case-sensitive filesystem, as typical on Unix, or do you have to cater for case-insensitive systems too? This matters because you aim for short filenames, but the number of bits per character you can use to represent your hash as a filename changes dramatically on case-sensive vs insensitive systems.
On a case-sensitive system, you can use the standard library's base64 module (I recommend the "urlsafe" version of the encoding, i.e. this function, as avoiding '/' characters that could be present in plain base64 is important in Unix filenames). This gives you 6 usable bits per character, much better than the 4 bits/char in hex.
Even on a case-insensitive system, you can still do better than hex -- use base64.b32encode and get 5 bits per character.
These functions take and return strings; use the struct module to turn numbers into strings if your chosen hash function generates numbers.
If you do have a few tens of thousands of objects I think you'll be fine with builtin hash (32 bits, so 6-7 characters depending on your chosen encoding). For a million objects you'd want 40 bits or so (7 or 8 characters) -- you can fold (xor, don't truncate;-) a sha256 down to a long with a reasonable number of bits, say 128 or so, and use the % operator to cut it further to your desired length before encoding.

The builtin hash function of strings is fairly collision free, and also fairly short. It has 2**32 values, so it is fairly unlikely that you encounter collisions (if you use its abs value, it will have only 2**31 values).
You have been asking for the shortest hash function. That would certainly be
def hash(s):
return 0
but I guess you didn't really mean it that way...

You can make any hash you like shorter by simply truncating it. md5 is always 32 hex digits, but an arbitrary substring of it (or any other hash) has the proper qualities of a hash: equal values produce equal hashes, and the values are spread around a bunch.

I'm sure that there's a CRC32 implementation in Python, but that may be too short (8 hex digits). On the upside, it's very quick.
Found it, binascii.crc32

If you do have a collision, how are you going to tell that it actually happened?
If I were you, I would use hashlib to sha1() the repr(), and then just get a limited substring of it (first 16 characters, for example).
Unless you are talking about huge numbers of these objects, I would suggest that you just use the full hash. Then the opportunity for collision is so, so, so, so small, that you will never live to see it happen (likely).
Also, if you are dealing with that many files, I'm guessing that your caching technique should be adjusted to accommodate it.

We use hashlib.sha1.hexdigest(), which produces even longer strings, for cache objects with good success. Nobody is actually looking at cache files anyway.

Condsidering your use case, if you don't have your heart set on using separate cache files and you are not too far down that development path, you might consider using the shelve module.
This will give you a persistent dictionary (stored in a single dbm file) in which you store your objects. Pickling/unpickling is performed transparently, and you don't have to concern yourself with hashing, collisions, file I/O, etc.
For the shelve dictionary keys, you would just use repr(obj) and let shelve deal with stashing your objects for you. A simple example:
import shelve
cache = shelve.open('cache')
t = (1,2,3)
i = 10
cache[repr(t)] = t
cache[repr(i)] = i
print cache
# {'(1, 2, 3)': (1, 2, 3), '10': 10}
cache.close()
cache = shelve.open('cache')
print cache
#>>> {'(1, 2, 3)': (1, 2, 3), '10': 10}
print cache[repr(10)]
#>>> 10

Short hashes mean you may have same hash for two different files. Same may happen for big hashes too, but its way more rare.
Maybe these file names should vary based on other references, like microtime (unless these files may be created too quickly).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.