Efficient data interchange format using only C-style fprintf() statements? - python

I need to transfer the very large dataset (between 1-10 mil records, possibly much more) from a domain-specific language (whose sole output mechanism is a C-style fprintf statement) to Python.
Currently, I'm using the DSL's fprintf to write records to a flat file. The flat file looks like this:
x['a',1,2]=1.23456789012345e-01
x['a',1,3]=1.23456789012345e-01
x['a',1,4]=1.23456789012345e-01
y1=1.23456789012345e-01
y2=1.23456789012345e-01
z['a',1,2]=1.23456789012345e-01
z['a',1,3]=1.23456789012345e-01
z['a',1,4]=1.23456789012345e-01
As you can see the structure of each record is very simple (but the representation of the double-precision float as a 20-char string is grossly inefficient!):
<variable-length string> + "=" + <double-precision float>
I'm currently using Python to read each line and split it on the "=".
Is there anything I can do to make the representation more compact, so as to make it faster for Python to read? Is some sort of binary-encoding possible with fprintf?

Err....
How many times per minute are you reading this data from Python?
Because in my system I could read such a file with 20 million records (~400MB) in well under a second.
Unless you are performing this in a limited hardware, I'd say you are worrying too much about nothing.
>>> timeit("all(b.read(20) for x in xrange(0, 20000000,20) ) ", "b=open('data.dat')", number=1)
0.2856929302215576
>>> c = open("data.dat").read()
>>> len(c)
380000172

A compact binary format for serializing float values is defined in the basic encoding rules (BER). There they are called "reals". There are implementations of BER for Python available, but also not too hard to write. There are libraries for C as well. You could use this format (that's what it was designed for), or a variant (CER, DER). One such Python implementation is pyasn1.

Related

Compression of short strings

I am trying to compress short strings (max 15 characters).
The goal is to implement the "Normalized Compression Distance"[1], I tried a few compression algorithms in python (I also looked to se if i could do it in Julia but the packages all refuse to install).
I always obtain in the end a bit-string longer than the original string I am trying to compress which totally defeats the purpose.
An example with zlib :
import zlib
data = b"this is a test"
compressed_data = zlib.compress(data, 9)
print(len(data))
print(len(compressed_data))
Which returns :
13
21
Do you now what I am doing wrong, or how i could do this more efficiently ?
[1] : https://arxiv.org/pdf/cs/0312044.pdf
Check out these libraries for compressing short strings:
https://github.com/siara-cc/unishox :
Unishox is a hybrid encoder (entropy, dictionary and delta coding). It works by assigning fixed prefix-free codes for each letter of the 95 letter printable Character Set (entropy coding). It encodes repeating letter sets separately (dictionary coding). For Unicode characters (UTF-8), delta coding is used. It also has special handling for repeating upper case and num pad characters.
Unishox was developed to save memory in embedded devices and compress strings stored in databases. It is used in many projects and has an extension for Sqlite database. Although it is slower than other available libraries, it works well for the given applications.
https://github.com/antirez/smaz :
Smaz was developed by Salvatore Sanfilipo and it compresses strings by replacing parts of it using a codebook. This was the first one available for compressing short strings as far as I know.
https://github.com/Ed-von-Schleck/shoco :
shoco was written by Christian Schramm. It is an entropy encoder, because the length of the representation of a character is determined by the probability of encountering it in a given input string.
It has a default model for English language and a provision to train new models based on given sample text.
PS: Unishox was developed by me and its working principle is explained in this article:
According to your reference the extra overhead added by Zlib may not matter.
That article defines the NCD as (C(x*y) āˆ’ min(C(x),C(y))) / max(C(x),C(y)), where using your zlib compression for C:
C(x) = length(zlib.compress(x, 9))
NCD(x,y) = (C(x*y) āˆ’ min(C(x),C(y))) / max(C(x),C(y))
As long as Zlib only adds a constant overhead the numerator of the NCD
should not change, and the demoninator should only change by a small amount.
You could add a correction factor like this:
C(x) = length(zlib.compress(x, 9)) - length(zlib.compress("a", 9)) + 1
which might eliminate the remaining issues with the denominator of NCD.
The DEFLATE algorithm uses a 32kb compression dictionary to deduplicate your data. By default it builds this dictionary from the data you provide it.
With short strings, it won't be able to build a decent compression dictionary, and therefore won't be able to compress efficiently and the meta-data overhead is what increases the size of your compressed result.
One solution would be to use a preset dictionary with samples of recurring patterns.
This question handles the same issue: Reusing compression dictionary
You can use my dicflate utility to experiment with DEFLATE compression on short and long strings with and without preset dictionaries: dicflate

Passing a sequence of bits to a file python

As a part of a bigger project, I want to save a sequence of bits in a file so that the file is as small as possible. I'm not talking about compression, I want to save the sequence as it is but using the least amount of characters. The initial idea was to turn mini-sequences of 8 bits into chars using ASCII encoding and saving those chars, but due to some unknown problem with strange characters, the characters retrieved when reading the file are not the same that were originally written. I've tried opening the file with utf-8 encoding, latin-1 but none seems to work. I'm wondering if there's any other way, maybe by turning the sequence into a hexadecimal number?
technically you can not write less than a byte because the os organizes memory in bytes (write individual bits to a file in python), so this is binary file io, see https://docs.python.org/2/library/io.html there are modules like struct
open the file with the 'b' switch, indicates binary read/write operation, then use i.e. the to_bytes() function (Writing bits to a binary file) or struct.pack() (How to write individual bits to a text file in python?)
with open('somefile.bin', 'wb') as f:
import struct
>>> struct.pack("h", 824)
'8\x03'
>>> bits = "10111111111111111011110"
>>> int(bits[::-1], 2).to_bytes(4, 'little')
b'\xfd\xff=\x00'
if you want to get around the 8 bit (byte) structure of the memory you can use bit manipulation and techniques like bitmasks and BitArrays
see https://wiki.python.org/moin/BitManipulation and https://wiki.python.org/moin/BitArrays
however the problem is, as you said, to read back the data if you use BitArrays of differing length i.e. to store a decimal 7 you need 3 bit 0x111 to store a decimal 2 you need 2 bit 0x10. now the problem is to read this back.
how can your program know if it has to read the value back as a 3 bit value or as a 2 bit value ? in unorganized memory the sequence decimal 72 looks like 11110 that translates to 111|10 so how can your program know where the | is ?
in normal byte ordered memory decimal 72 is 0000011100000010 -> 00000111|00000010 this has the advantage that it is clear where the | is
this is why memory on its lowest level is organized in fixed clusters of 8 bit = 1 byte. if you want to access single bits inside a bytes/ 8 bit clusters you can use bitmasks in combination with logic operators (http://www.learncpp.com/cpp-tutorial/3-8a-bit-flags-and-bit-masks/). in python the easiest way for single bit manipulation is the module ctypes
if you know that your values are all 6 bit maybe it is worth the effort, however this is also tough...
(How do you set, clear, and toggle a single bit?)
(Why can't you do bitwise operations on pointer in C, and is there a way around this?)

Alternative ways for binary conversion in python

I often need to convert status code to bit representation in order to determine what error/status are active on analyzers using plain-text or binary communication protocol.
I use python to poll data and to parse it. Sometime I really get confuse because I found that there is so many ways to solve a problem. Today I had to convert a string where each character is an hexadecimal digit to its binary representation. That is, each hexadecimal character must be converted into 4 bits, where the MSB start from left. Note: I need a char by char conversion, and leading zero.
I managed to build these following function which does the trick in a quasi one-liner fashion.
def convertStatus(s, base=16):
n = int(math.log2(base))
b = "".join(["{{:0>{}b}}".format(n).format(int(x, base)) for x in s])
return b
Eg., this convert the following input:
0123456789abcdef
into:
0000000100100011010001010110011110001001101010111100110111101111
Which was my goal.
Now, I am wondering what another elegant solutions could I have used to reach my goal? I also would like to better understand what are advantages and drawbacks among solutions. The function signature can be changed, but usually it is a string for input and output. Lets become imaginative...
This is simple in two steps
Converting a string to an int is almost trivial: use int(aString, base=...)
the first parameter is can be a string!
and with base, almost every option is possible
Converting a number to a string is easy with format() and the mini print language
So converting hex-strings to binary can be done as
def h2b(x):
val = int(x, base=16)
return format(val, 'b')
Here the two steps are explicitly. Possible it's better to do it in one line, or even in-line

Fast hash for strings

I have a set of ASCII strings, let's say they are file paths. They could be both short and quite long.
I'm looking for an algorithm that could calculate hash of such a strings and this hash will be also a string, but will have a fixed length, like youtube video ids:
https://www.youtube.com/watch?v=-F-3E8pyjFo
^^^^^^^^^^^
MD5 seems to be what I need, but it is critical for me to have a short hash strings.
Is there a shell command or python library which can do that?
As of Python 3 this method does not work:
Python has a built-in hash() function that's very fast and perfect for most uses:
>>> hash("dfds")
3591916071403198536
You can then make it unsigned:
>>> hashu=lambda word: ctypes.c_uint64(hash(word)).value
You can then turn it into a 16 byte hex string:
>>> hashu("dfds").to_bytes(8,"big").hex()
Or an N*2 byte string, where N is <= 8:
>>> hashn=lambda word, N : (hashu(word)%(2**(N*8))).to_bytes(N,"big").hex()
..etc. And if you want N to be larger than 8 bytes, you can just hash twice. Python's built-in is so vastly faster, it's never worth using hashlib for anything unless you need security... not just collision resistance.
>>> hashnbig=lambda word, N : ((hashu(word)+2**64*hashu(word+"2"))%(2**(N*8))).to_bytes(N,"big").hex()
And finally, use the urlsafe base64 encoding to make a much better string than "hex" gives you
>>> hashnbigu=lambda word, N : urlsafe_b64encode(((hashu(word)+2**64*hash(word+"2"))%(2**(N*8))).to_bytes(N,"big")).decode("utf8").rstrip("=")
>>> hashnbigu("foo",16)
'ZblnvrRqHwAy2lnvrR4HrA'
Caveats:
Be warned that in Python 3.3 and up, this function is
randomized and won't work for some use cases. You can disable this with PYTHONHASHSEED=0
See https://github.com/flier/pyfasthash for fast, stable hashes that
that similarly won't overload your CPU for non-cryptographic applications.
Don't use this lambda style in real code... write it out! And
stuffing things like 2**32 in your code, instead of making them
constants is bad form.
In the end 8 bytes of collision resistance is OK for a smaller
applications.... with less than a million entries, you've got
collision odds of < 0.0000001%. That's a 12 byte b64 encoded
string. But it might not be enough for larger apps.
16 bytes is enough for a UUID/OID in a cache, etc.
Speed comparison for producing 300k 16 byte hashes from a bytes-input.
builtin: 0.188
md5: 0.359
fnvhash_c: 0.113
For a complex input (tuple of 3 integers, for example), you have to convert to bytes to use the non-builtin hashes, this adds a lot of conversion overhead, making the builtin shine.
builtin: 0.197
md5: 0.603
fnvhash_c: 0.284
I guess this question is off-topic, because opinion based, but at least one hint for you, I know the FNV hash because it is used by The Sims 3 to find resources based on their names between the different content packages. They use the 64 bits version, so I guess it is enough to avoid collisions in a relatively large set of reference strings. The hash is easy to implement, if no module satisfies you (pyfasthash has an implementation of it for example).
To get a short string out of it, I would suggest you use base64 encoding. For example, this is the size of a base64-encoded 64 bits hash: nsTYVQUag88= (and you can get rid or the padding =).
Edit: I had finally the same problem as you, so I implemented the above idea: https://gist.github.com/Cilyan/9424144
Another option: hashids is designed to solve exactly this problem and has been ported to many languages, including Python. It's not really a hash in the sense of MD5 or SHA1, which are one-way; hashids "hashes" are reversable.
You are responsible for seeding the library with a secret value and selecting a minimum hash length.
Once that is done, the library can do two-way mapping between integers (single integers, like a simple primary key, or lists of integers, to support things like composite keys and sharding) and strings of the configured length (or slightly more). The alphabet used for generating "hashes" is fully configurable.
I have provided more details in this other answer.
You could use the sum program (assuming you're on linux) but keep in mind that the shorter the hash the more collisions you might have. You can always truncate MD5/SHA hashes as well.
EDIT: Here's a list of hash functions: List of hash functions
Something to keep in mind is that hash codes are one way functions - you cannot use them for "video ids" as you cannot go back from the hash to the original path. Quite apart from anything else hash collisions are quite likely and you end up with two hashes both pointing to the same video instead of different ones.
To create an Id like the youtube one the easiest way is to create a unique id however you normally do that (for example an auto key column in a database) and then map that to a unique string in a reversible way.
For example you could take an integer id and map it to 0-9a-z in base 36...or even 0-9a-zA-Z in base 62, padding the generated string out to the desired length if the id on its own does not give enough characters.

A RAM error of big array

I need to get the numbers of one line randomly, and put each line in other array, then get the numbers of one col.
I have a big file, more than 400M. In that file, there are 13496*13496 number, means 13496 rows and 13496 cols. I want to read them to a array.
This is my code:
_L1 = [[0 for col in range(13496)] for row in range(13496)]
_L1file = open('distanceCMD.function.txt')
while (i<13496):
print "i="+str(i)
_strlf = _L1file.readline()
_strlf = _strlf.split('\t')
_strlf = _strlf[:-1]
_L1[i] = _strlf
i += 1
_L1file.close()
And this is my error message:
MemoryError:
File "D:\research\space-function\ART3.py", line 30, in <module>
_strlf = _strlf.split('\t')
you might want to approach your problem in another way. Process the file line by line. I don't see a need to store the whole big file into array. Otherwise, you might want to tell us what you are actually trying to do.
for line in open("400MB_file"):
# do something with line.
Or
f=open("file")
for linenum,line in enumerate(f):
if linenum+1 in [2,3,10]:
print "there are ", len(line.split())," columns" #assuming you want to split on spaces
print "100th column value is: ", line.split()[99]
if linenum+1>10:
break # break if you want to stop after the 10th line
f.close()
This is a simple case of your program demanding more memory than is available to the computer. An array of 13496x13496 elements requires 182,142,016 'cells', where a cell is a minimum of one byte (if storing chars) and potentially several bytes (if storing floating-point numerics, for example). I'm not even taking your particular runtimes' array metadata into account, though this would typically be a tiny overhead on a simple array.
Assuming each array element is just a single byte, your computer needs around 180MB of RAM to hold it in memory in its' entirety. Trying to process it could be impractical.
You need to think about the problem a different way; as has already been mentioned, a line-by-line approach might be a better option. Or perhaps processing the grid in smaller units, perhaps 10x10 or 100x100, and aggregating the results. Or maybe the problem itself can be expressed in a different form, which avoids the need to process the entire dataset altogether...?
If you give us a little more detail on the nature of the data and the objective, perhaps someone will have an idea to make the task more manageable.
Short answer: the Python object overhead is killing you. In Python 2.x on a 64-bit machine, a list of strings consumes 48 bytes per list entry even before accounting for the content of the strings. That's over 8.7 Gb of overhead for the size of array you describe.
On a 32-bit machine it'll be a bit better: only 28 bytes per list entry.
Longer explanation: you should be aware that Python objects themselves can be quite large: even simple objects like ints, floats and strings. In your code you're ending up with a list of lists of strings. On my (64-bit) machine, even an empty string object takes up 40 bytes, and to that you need to add 8 bytes for the list pointer that's pointing to this string object in memory. So that's already 48 bytes per entry, or around 8.7 Gb. Given that Python allocates memory in multiples of 8 bytes at a time, and that your strings are almost certainly non-empty, you're actually looking at 56 or 64 bytes (I don't know how long your strings are) per entry.
Possible solutions:
(1) You might do (a little) better by converting your entries from strings to ints or floats as appropriate.
(2) You'd do much better by either using Python's array type (not the same as list!) or by using numpy: then your ints or floats would only take 4 or 8 bytes each.
Since Python 2.6, you can get basic information about object sizes with the sys.getsizeof function. Note that if you apply it to a list (or other container) then the returned size doesn't include the size of the contained list objects; only of the structure used to hold those objects. Here are some values on my machine.
>>> import sys
>>> sys.getsizeof("")
40
>>> sys.getsizeof(5.0)
24
>>> sys.getsizeof(5)
24
>>> sys.getsizeof([])
72
>>> sys.getsizeof(range(10)) # 72 + 8 bytes for each pointer
152
MemoryError exception:
Raised when an operation runs out of
memory but the situation may still be
rescued (by deleting some objects).
The associated value is a string
indicating what kind of (internal)
operation ran out of memory. Note that
because of the underlying memory
management architecture (Cā€™s malloc()
function), the interpreter may not
always be able to completely recover
from this situation; it nevertheless
raises an exception so that a stack
traceback can be printed, in case a
run-away program was the cause.
It seems that, at least in your case, reading the entire file into memory is not a doable option.
Replace this:
_strlf = _strlf[:-1]
with this:
_strlf = [float(val) for val in _strlf[:-1]]
You are making a big array of strings. I can guarantee that the string "123.00123214213" takes a lot less memory when you convert it to floating point.
You might want to include some handling for null values.
You can also go to numpy's array type, but your problem may be too small to bother.

Categories