Loading a large dictionary using python pickle - python

I have a full inverted index in form of nested python dictionary. Its structure is :
{word : { doc_name : [location_list] } }
For example let the dictionary be called index, then for a word " spam ", entry would look like :
{ spam : { doc1.txt : [102,300,399], doc5.txt : [200,587] } }
I used this structure as python dict are pretty optimised and it makes programming easier.
for any word 'spam', the documents containig it can be given by :
index['spam'].keys()
and posting list for a document doc1 by:
index['spam']['doc1']
At present I am using cPickle to store and load this dictionary. But the pickled file is around 380 MB and takes a long time to load - 112 seconds(approx. I timed it using time.time()) and memory usage goes to 1.2 GB (Gnome system monitor). Once it loads, its fine. I have 4GB RAM.
len(index.keys()) gives 229758
Code
import cPickle as pickle
f = open('full_index','rb')
print 'Loading index... please wait...'
index = pickle.load(f) # This takes ages
print 'Index loaded. You may now proceed to search'
How can I make it load faster? I only need to load it once, when the application starts. After that, the access time is important to respond to queries.
Should I switch to a database like SQLite and create an index on its keys? If yes, how do I store the values to have an equivalent schema, which makes retrieval easy. Is there anything else that I should look into ?
Addendum
Using Tim's answer pickle.dump(index, file, -1) the pickled file is considerably smaller - around 237 MB (took 300 seconds to dump)... and takes half the time to load now (61 seconds ... as opposed to 112 s earlier .... time.time())
But should I migrate to a database for scalability ?
As for now I am marking Tim's answer as accepted.
PS :I don't want to use Lucene or Xapian ...
This question refers Storing an inverted index . I had to ask a new question because I wasn't able to delete the previous one.

Try the protocol argument when using cPickle.dump/cPickle.dumps. From cPickle.Pickler.__doc__:
Pickler(file, protocol=0) -- Create a pickler.
This takes a file-like object for writing a pickle data stream.
The optional proto argument tells the pickler to use the given
protocol; supported protocols are 0, 1, 2. The default
protocol is 0, to be backwards compatible. (Protocol 0 is the
only protocol that can be written to a file opened in text
mode and read back successfully. When using a protocol higher
than 0, make sure the file is opened in binary mode, both when
pickling and unpickling.)
Protocol 1 is more efficient than protocol 0; protocol 2 is
more efficient than protocol 1.
Specifying a negative protocol version selects the highest
protocol version supported. The higher the protocol used, the
more recent the version of Python needed to read the pickle
produced.
The file parameter must have a write() method that accepts a single
string argument. It can thus be an open file object, a StringIO
object, or any other custom object that meets this interface.
Converting JSON or YAML will probably take longer than pickling most of the time - pickle stores native Python types.

Do you really need it to load all at once? If you don't need all of it in memory, but only the select parts you want at any given time, you may want to map your dictionary to a set of files on disk instead of a single file… or map the dict to a database table. So, if you are looking for something that saves large dictionaries of data to disk or to a database, and can utilize pickling and encoding (codecs and hashmaps), then you might want to look at klepto.
klepto provides a dictionary abstraction for writing to a database, including treating your filesystem as a database (i.e. writing the entire dictionary to a single file, or writing each entry to it's own file). For large data, I often choose to represent the dictionary as a directory on my filesystem, and have each entry be a file. klepto also offers caching algorithms, so if you are using a filesystem backend for the dictionary you can avoid some speed penalty by utilizing memory caching.
>>> from klepto.archives import dir_archive
>>> d = {'a':1, 'b':2, 'c':map, 'd':None}
>>> # map a dict to a filesystem directory
>>> demo = dir_archive('demo', d, serialized=True)
>>> demo['a']
1
>>> demo['c']
<built-in function map>
>>> demo
dir_archive('demo', {'a': 1, 'c': <built-in function map>, 'b': 2, 'd': None}, cached=True)
>>> # is set to cache to memory, so use 'dump' to dump to the filesystem
>>> demo.dump()
>>> del demo
>>>
>>> demo = dir_archive('demo', {}, serialized=True)
>>> demo
dir_archive('demo', {}, cached=True)
>>> # demo is empty, load from disk
>>> demo.load()
>>> demo
dir_archive('demo', {'a': 1, 'c': <built-in function map>, 'b': 2, 'd': None}, cached=True)
>>> demo['c']
<built-in function map>
>>>
klepto also has other flags such as compression and memmode that can be used to customize how your data is stored (e.g. compression level, memory map mode, etc).
It's equally easy (the same exact interface) to use a (MySQL, etc) database as a backend instead of your filesystem. You can also turn off memory caching, so every read/write goes directly to the archive, simply by setting cached=False.
klepto provides access to customizing your encoding, by building a custom keymap.
>>> from klepto.keymaps import *
>>>
>>> s = stringmap(encoding='hex_codec')
>>> x = [1,2,'3',min]
>>> s(x)
'285b312c20322c202733272c203c6275696c742d696e2066756e6374696f6e206d696e3e5d2c29'
>>> p = picklemap(serializer='dill')
>>> p(x)
'\x80\x02]q\x00(K\x01K\x02U\x013q\x01c__builtin__\nmin\nq\x02e\x85q\x03.'
>>> sp = s+p
>>> sp(x)
'\x80\x02UT28285b312c20322c202733272c203c6275696c742d696e2066756e6374696f6e206d696e3e5d2c292c29q\x00.'
klepto also provides a lot of caching algorithms (like mru, lru, lfu, etc), to help you manage your in-memory cache, and will use the algorithm do the dump and load to the archive backend for you.
You can use the flag cached=False to turn off memory caching completely, and directly read and write to and from disk or database. If your entries are large enough, you might pick to write to disk, where you put each entry in it's own file. Here's an example that does both.
>>> from klepto.archives import dir_archive
>>> # does not hold entries in memory, each entry will be stored on disk
>>> demo = dir_archive('demo', {}, serialized=True, cached=False)
>>> demo['a'] = 10
>>> demo['b'] = 20
>>> demo['c'] = min
>>> demo['d'] = [1,2,3]
However while this should greatly reduce load time, it might slow overall execution down a bit… it's usually better to specify the maximum amount to hold in memory cache and pick a good caching algorithm. You have to play with it to get the right balance for your needs.
Get klepto here: https://github.com/uqfoundation

A common pattern in Python 2.x is to have one version of a module implemented in pure Python, with an optional accelerated version implemented as a C extension; for example, pickle and cPickle. This places the burden of importing the accelerated version and falling back on the pure Python version on each user of these modules. In Python 3.0, the accelerated versions are considered implementation details of the pure Python versions. Users should always import the standard version, which attempts to import the accelerated version and falls back to the pure Python version. The pickle / cPickle pair received this treatment.
Protocol version 0 is the original “human-readable” protocol and is backwards compatible with earlier versions of Python.
Protocol version 1 is an old binary format which is also compatible with earlier versions of Python.
Protocol version 2 was introduced in Python 2.3. It provides much more efficient pickling of new-style classes. Refer to PEP 307 for information about improvements brought by protocol 2.
Protocol version 3 was added in Python 3.0. It has explicit support for bytes objects and cannot be unpickled by Python 2.x. This is the default protocol, and the recommended protocol when compatibility with other Python 3 versions is required.
Protocol version 4 was added in Python 3.4. It adds support for very large objects, pickling more kinds of objects, and some data format optimizations. Refer to PEP 3154 for information about improvements brought by protocol 4.
If your dictionary is huge and should only be compatible with Python 3.4 or higher, use:
pickle.dump(obj, file, protocol=4)
pickle.load(file, encoding="bytes")
or:
Pickler(file, 4).dump(obj)
Unpickler(file).load()
That said, in 2010 the json module was 25 times faster at encoding and 15 times faster at decoding simple types than pickle. My 2014 benchmark says marshal > pickle > json, but marshal's coupled to specific Python versions.

Have you tried using an alternative storage format such as YAML or JSON? Python supports JSON natively from Python 2.6 using the json module I think, and there are third party modules for YAML.
You may also try the shelve module.

Dependend on how long is 'long' you have to think about the trade-offs you have to make: either have all data ready in memory after (long) startup, or load only partial data (then you need to split up the date in multiple files or use SQLite or something like this). I doubt that loading all data upfront from e.g. sqlite into a dictionary will bring any improvement.

Related

Prefer BytesIO or bytes for internal interface in Python?

I'm trying to decide on the best internal interface to use in my code, specifically around how to handle file contents. Really, the file contents are just binary data, so bytes is sufficient to represent them.
I'm storing files in different remote locations, so have a couple of different classes for reading and writing. I'm trying to figure out the best interface to use for my functions. Originally I was using file paths, but that was suboptimal because it meant that disk was always used (which meant lots of clumsy tempfiles).
There are several areas of the code that have the same requirement, and would directly use whatever was returned from this interface. As a result whatever abstraction I choose will touch a fair bit of code.
What are the various tradeoffs to using BytesIO vs bytes?
def put_file(location, contents_as_bytes):
def put_file(location, contents_as_fp):
def get_file_contents(location):
def get_file_contents(location, fp):
Playing around I've found that using the File-Like interfaces (BytesIO, etc) requires a bit of administration overhead in terms of seek(0) etc. That raises a questions like:
is it better to seek before you start, or after you've finished?
do you seek to the start or just operate from the position the file is in?
should you tell() to maintain the position?
looking at something like shutil.copyfileobj it doesn't do any seeking
One advantage I've found with using file-like interfaces instead is that it allows for passing in the fp to write into when you're retrieving data. Which seems to give a good deal of flexibility.
def get_file_contents(location, write_into=None):
if not write_into:
write_into = io.BytesIO()
# get the contents and put it into write_into
return write_into
get_file_contents('blah', file_on_disk)
get_file_contents('blah', gzip_file)
get_file_contents('blah', temp_file)
get_file_contents('blah', bytes_io)
new_bytes_io = get_file_contents('blah')
# etc
Is there a good reason to prefer BytesIO over just using fixed bytes when designing an interface in python?
The benefit of io.BytesIO objects is that they implement a common-ish interface (commonly known as a 'file-like' object). BytesIO objects have an internal pointer (whose position is returned by tell()) and for every call to read(n) the pointer advances n bytes. Ex.
import io
buf = io.BytesIO(b'Hello world!')
buf.read(1) # Returns b'H'
buf.tell() # Returns 1
buf.read(1) # Returns b'e'
buf.tell() # Returns 2
# Set the pointer to 0.
buf.seek(0)
buf.read() # This will return b'H', like the first call.
In your use case, both the bytes object and the io.BytesIO object are maybe not the best solutions. They will read the complete contents of your files into memory.
Instead, you could look at tempfile.TemporaryFile (https://docs.python.org/3/library/tempfile.html).

Shelve dictionary size is >100Gb for a 2Gb text file

I am creating a shelve file of sequences from a genomic FASTA file:
# Import necessary libraries
import shelve
from Bio import SeqIO
# Create dictionary of genomic sequences
genome = {}
with open("Mus_musculus.GRCm38.dna.primary_assembly.fa") as handle:
for record in SeqIO.parse(handle, "fasta"):
genome[str(record.id)] = str(record.seq)
# Shelve genome sequences
myShelve = shelve.open("Mus_musculus.GRCm38.dna.primary_assembly.db")
myShelve.update(genome)
myShelve.close()
The file itself is 2.6Gb, however when I try and shelve it, a file of >100Gb is being produced, plus my computer will throw out a number of complaints about being out of memory and the start up disk being full. This only seems to happen when I try to run this under OSX Yosemite, on Ubuntu it works as expected. Any suggestions why this is not working? I'm using Python 3.4.2
Verify what interface is used for dbm by import dbm; print(dbm.whichdb('your_file.db') The file format used by shelve depends on the best installed binary package available on your system and its interfaces. The newest is gdbm, while dumb is a fallback solution if no binary is found, ndbm is something between.
https://docs.python.org/3/library/shelve.html
https://docs.python.org/3/library/dbm.html
It is not favourable to have all data in the memory if you lose all memory for filesystem cache. Updating by smaller blocks is better. I even don't see a slowdown if items are updated one by one.
myShelve = shelve.open("Mus_musculus.GRCm38.dna.primary_assembly.db")
with open("Mus_musculus.GRCm38.dna.primary_assembly.fa") as handle:
for i, record in enumerate(SeqIO.parse(handle, "fasta")):
myShelve.update([(str(record.id), str(record.seq))])
myShelve.close()
It is known that dbm databases became fragmented if the app fell down after updates without calling database close. I think that this was your case. Now you probably have no important data yet in the big file, but in the future you can defragment a database by gdbm.reorganize().
I had the very same problem: On a macOS system with a shelve with about 4 Megabytes of data grew to the enormous size of 29 Gigabytes on disk! This obviously happened because I updated the same key value pairs in the shelve over and over again.
As my shelve was based on GNU dbm I was able to use his hint about reorganizing. Here is the code that brought my shelve file back to normal size within seconds:
import dbm
db = dbm.open(shelfFileName, 'w')
db.reorganize()
db.close()
I am not sure whether this technique will work for other (non GNU) dbms as well. To test your dbm system, remember the code shown by #hynekcer:
import dbm
print( dbm.whichdb(shelfFileName) )
If GNU dbm is used by your system this should output 'dbm.gnu' (which is the new name for the older gdbm).

Flat file key-value store in python

I'm looking for a flat-file, portable key-value store in Python. I'll be using strings for keys and either strings or lists for values. I looked at ZODB but I'd like something which is more widely used and is more actively developed. Do any of the dmb modules in Python require system libraries or a database server (like mysql or the likes) or can I write to file with any of them?
If a dbm does not support a python lists, I imagine that I can just serialize it?
You may want to consider h5py which is a Python interface to HDF5.
In [1]: import h5py
In [2]: f = h5py.File('test.hdf5', 'w')
In [3]: f['abc'] = [1, 2, 3]
In [4]: f['d'] = 'hello'
In [5]: f.close()
In [6]: f2 = h5py.File('test.hdf5', 'r')
In [7]: f2['abc'].value
Out[7]: array([1, 2, 3])
In [8]: list(f2['abc'])
Out[8]: [1, 2, 3]
In [10]: f2['d'].value
Out[10]: 'hello'
There is default support for sqlite and is included in standard library, but for sake of simplicity you can use shelve
http://docs.python.org/library/shelve.html
edit:
I havent tested this, buy dbm might be soltion for you. It is key-value database on UNIX since 1979.
http://docs.python.org/library/anydbm.html#module-anydbm and in case you need serialisation you can use pickle.
Sorry to ask an obvious question, but wouldn't a JSON file serve you just fine?
You can look at the shelve module. It uses pickle under the hood, and allows you to create a key-value look up that persists between launches.
Additionally, the json module with dump and load methods would probably work pretty well as well.

MD5 and SHA-2 collisions in Python

I'm writing a simple MP3 cataloguer to keep track of which MP3's are on my various devices. I was planning on using MD5 or SHA2 keys to identify matching files even if they have been renamed/moved, etc. I'm not trying to match MP3's that are logically equivalent (i.e.: same song but encoded differently). I have about 8000 MP3's. Only about 6700 of them generated unique keys.
My problem is that I'm running into collisions regardless of the hashing algorithm I choose. In one case, I have two files that happen to be tracks #1 and #2 on the same album, they are different file sizes yet produce identical hash keys whether I use MD5, SHA2-256, SHA2-512, etc...
This is the first time I'm really using hash keys on files and this is an unexpected result. I feel something fishy is going on here from the little I know about these hashing algorithms. Could this be an issue related to MP3's or Python's implementation?
Here's the snippet of code that I'm using:
data = open(path, 'r').read()
m = hashlib.md5(data)
m.update(data)
md5String = m.hexdigest()
Any answers or insights to why this is happening would be much appreciated. Thanks in advance.
--UPDATE--:
I tried executing this code in linux (with Python 2.6) and it did not produce a collision. As demonstrated by the stat call, the files are not the same. I also downloaded WinMD5 and this did not produce a collision(8d327ef3937437e0e5abbf6485c24bb3 and 9b2c66781cbe8c1be7d6a1447994430c). Is this a bug with Python hashlib on Windows? I tried the same under Python 2.7.1 and 2.6.6 and both provide the same result.
import hashlib
import os
def createMD5( path):
fh = open(path, 'r')
data = fh.read()
m = hashlib.md5(data)
md5String = m.hexdigest()
fh.close()
return md5String
print os.stat(path1)
print os.stat(path2)
print createMD5(path1)
print createMD5(path2)
>>> nt.stat_result(st_mode=33206, st_ino=0L, st_dev=0, st_nlink=0, st_uid=0, st_gid=0, st_size=6617216L, st_atime=1303808346L, st_mtime=1167098073L, st_ctime=1290222341L)
>>> nt.stat_result(st_mode=33206, st_ino=0L, st_dev=0, st_nlink=0, st_uid=0, st_gid=0, st_size=4921346L, st_atime=1303808348L, st_mtime=1167098076L, st_ctime=1290222341L)
>>> a7a10146b241cddff031eb03bd572d96
>>> a7a10146b241cddff031eb03bd572d96
I sort of have the feeling that you are reading a chunk of data which is smaller than the expected, and this chunk happens to be the same for both files. I don't know why, but try to open the file in binary with 'rb'. read() should read up to end of file, but windows behaves differently. From the docs
On Windows, 'b' appended to the mode
opens the file in binary mode, so
there are also modes like 'rb', 'wb',
and 'r+b'. Python on Windows makes a
distinction between text and binary
files; the end-of-line characters in
text files are automatically altered
slightly when data is read or written.
This behind-the-scenes modification to
file data is fine for ASCII text
files, but it’ll corrupt binary data
like that in JPEG or EXE files. Be
very careful to use binary mode when
reading and writing such files. On
Unix, it doesn’t hurt to append a 'b'
to the mode, so you can use it
platform-independently for all binary
files.
The files you're having a problem with are almost certainly identical if several different hashing algorithms all return the same hash results on them, or there's a bug in your implementation.
As a sanity test write your own "hash" that just returns the file's contents wholly, and see if this one generates the same "hashes".
As others have stated, a single hash collision is unlikely, and multiple nigh on impossible, unless the files are identical. I would recommend generating the sums with an external utility as something of a sanity check. For example, in Ubuntu (and most/all other Linux distributions):
blair#blair-eeepc:~$ md5sum Bandwagon.mp3
b87cbc2c17cd46789cb3a3c51a350557 Bandwagon.mp3
blair#blair-eeepc:~$ sha256sum Bandwagon.mp3
b909b027271b4c3a918ec19fc85602233a4c5f418e8456648c426403526e7bc0 Bandwagon.mp3
A quick Google search shows there are similar utilities available for Windows machines. If you see the collisions with the external utilities, then the files are identical. If there are no collisions, you are doing something wrong. I doubt the Python implementation is wrong, as I get the same results when doing the hash in Python:
>>> import hashlib
>>> hashlib.md5(open('Bandwagon.mp3', 'r').read()).hexdigest()
'b87cbc2c17cd46789cb3a3c51a350557'
>>> hashlib.sha256(open('Bandwagon.mp3', 'r').read()).hexdigest()
'b909b027271b4c3a918ec19fc85602233a4c5f418e8456648c426403526e7bc0'
Like #Delan Azabani said, there is something fishy here; collisions are bound to happen, but not that often. Check if the songs are the same, and update your post please.
Also, if you feel that you don't have enough keys, you can use two (or even more) hashing algorithms at the same time: by using MD5 for example, you have 2**128, or 340282366920938463463374607431768211456 keys. By using SHA-1, you have 2**160 or 1461501637330902918203684832716283019655932542976 keys. By combining them, you have 2**128 * 2**160, or 497323236409786642155382248146820840100456150797347717440463976893159497012533375533056.
(But if you ask me, MD5 is more than enough for your needs.)

How can I read a python pickle database/file from C?

I am working on integrating with several music players. At the moment my favorite is exaile.
In the new version they are migrating the database format from SQLite3 to an internal Pickle format. I wanted to know if there is a way to access pickle format files without having to reverse engineer the format by hand.
I know there is the cPickle python module, but I am unaware if it is callable directly from C.
http://www.picklingtools.com/
There is a library called the PicklingTools which I help maintain which might be useful: it allows you to form data structures in C++ that you can then pickle/unpickle ... it is C++, not C, but that shouldn't be a problem these days (assuming you are using the gcc/g++ suite).
The library is a plain C++ library (there are examples of C++ and Python within the distribution showing how to use the library over sockets and files from both C++ and Python), but in general, the basics of pickling to files is available.
The basic idea is that the PicklingTools library gives you "python-like" data structures from C++ so that you can then serialize and deserialize to/from Python/C++. All (?) the basic types: int, long int,string, None, complex, dictionarys, lists, ordered dictionaries and tuples are supported. There are few hooks to do custom classes, but that part is a bit immature: the rest of the library is pretty stable and has been active for 8 (?) years.
Simple example:
#include "chooseser.h"
int main()
{
Val a_dict = Tab("{ 'a':1, 'b':[1,2.2,'three'], 'c':None }");
cout << a_dict["b"][0]; // value of 1
// Dump to a file
DumpValToFile(a_dict, "example.p0", SERIALIZE_P0);
// .. from Python, can load the dictionary with pickle.load(file('example.p0'))
// Get the result back
Val result;
LoadValFromFile(result, "example.p0", SERIALIZE_P0);
cout << result << endl;
}
There is further documentation (FAQ and User's Guide) on the web site.
Hope this is useful:
Gooday,
Richie
http://www.picklingtools.com/
Like Cristian told, you can rather easily embed python code in your C code, see the example here.
Using cPickle is dead easy as well on python you could use something like:
import cPickle
f = open('dbfile', 'rb')
db = cPickle.load(f)
f.close()
# handle db integration
f = open('dbfile', 'wb')
cPickle.dump(db, f)
f.close()
You can embed a Python interpreter in a C program, but I think that the easiest solution is to write a Python script that converts "pickles" in another format, e.g. an SQLite database.

Categories