Accessing items from a dictionary using pickle efficiently in Python - python

I have a large dictionary mapping keys (which are strings) to objects. I pickled this large dictionary and at certain times I want to pull out only a handful of entries from it. The dictionary has usually thousands of entries total. When I load the dictionary using pickle, as follows:
from cPickle import *
# my dictionary from pickle, containing thousands of entries
mydict = open(load('mypickle.pickle'))
# accessing only handful of entries here
for entry in relevant_entries:
# find relevant entry
value = mydict[entry]
I notice that it can take up to 3-4 seconds to load the entire pickle, which I don't need, since I access only a tiny subset of the dictionary entries later on (shown above.)
How can I make it so pickle only loads those entries that I have from the dictionary, to make this faster?
Thanks.

Pickle serializes object (hierachies), it's not an on-disk store. As you have seen, you must unpickle the entire object to use it - which is of course wasteful. Use shelve, dbm or a database (SQLite) for on-disk storage.

You'll have to have "Ghost" objects, I.e. objects that are only placeholders and load themselves when accessed. This is a Difficult Issue, but it has been solved. You have two options. You can use the persistence library from ZODB, that helps with this. Or, you just start using ZODB directly; problem solved.
http://www.zodb.org/

If your objects are independent of each others, you could pickle and unpickle them individually using their key as filename, in some perverse way a directory is a kind of dictionary mapping filenames to files. This way it is simple to load only relevant entries.
Basically you use a memory dictionary as cache and if the searched key is missing try to load the file from the filesystem.
I'm not really saying you should do that. A database (ZODB, SQLite, other) is probably better for persistant storage.

Related

Loading a large dictionary through pickle file or by directly entering code defining dictionary

I have a large dictionary that I currently define as such:
my_dict = { … }
Because it has a lot of keys (4000 keys and I have 101 dictionaries of that size), defining my 101 dictionaries like that sounds like bad practice.
What should I do, should I make a list of those dictionaries, make a pickle file out of it and importing the pickle file?
Is there a better solution? Im interested by having a solution that is efficient in terms of time. I cannot use SQL because the I want to be distributing the code on different computers that aren’t necessarily connected to the same database / internet.
Im using Python 3.10.2.
Both pickles and JSON files need to be loaded fully before they can be used, so you'd be shipping 101 pickles or JSONs if you wanted to be able to access one without loading the rest, and even so, you'd still need to load one dict fully into memory to access any key.
Honestly, the better solution is SQL, in particular the sqlite3 module built-in to Python, and no, you don't need to be connected anywhere.
You get cheap random lookup (SELECT value FROM data WHERE dict_id = 123 AND key = 2345) and cheap "all keys in this dict" (SELECT key, value FROM data where dict_id = 345) and cheap "all everything" (SELECT dict_id, key, value FROM data) (and if you need, you can extend that with more indexes).
You would then just ship your premade SQLite file with your application (there are, heh, application notes on that).

Is there a way to load variables directly from a binary file in python? [duplicate]

I understood that Python pickling is a way to 'store' a Python Object in a way that does respect Object programming - different from an output written in txt file or DB.
Do you have more details or references on the following points:
where are pickled objects 'stored'?
why is pickling preserving object representation more than, say, storing in DB?
can I retrieve pickled objects from one Python shell session to another?
do you have significant examples when serialization is useful?
does serialization with pickle imply data 'compression'?
In other words, I am looking for a doc on pickling - Python.doc explains how to implement pickle but seems not dive into details about use and necessity of serialization.
Pickling is a way to convert a python object (list, dict, etc.) into a character stream. The idea is that this character stream contains all the information necessary to reconstruct the object in another python script.
As for where the pickled information is stored, usually one would do:
with open('filename', 'wb') as f:
var = {1 : 'a' , 2 : 'b'}
pickle.dump(var, f)
That would store the pickled version of our var dict in the 'filename' file. Then, in another script, you could load from this file into a variable and the dictionary would be recreated:
with open('filename','rb') as f:
var = pickle.load(f)
Another use for pickling is if you need to transmit this dictionary over a network (perhaps with sockets or something.) You first need to convert it into a character stream, then you can send it over a socket connection.
Also, there is no "compression" to speak of here...it's just a way to convert from one representation (in RAM) to another (in "text").
About.com has a nice introduction of pickling here.
Pickling is absolutely necessary for distributed and parallel computing.
Say you wanted to do a parallel map-reduce with multiprocessing (or across cluster nodes with pyina), then you need to make sure the function you want to have mapped across the parallel resources will pickle. If it doesn't pickle, you can't send it to the other resources on another process, computer, etc. Also see here for a good example.
To do this, I use dill, which can serialize almost anything in python. Dill also has some good tools for helping you understand what is causing your pickling to fail when your code fails.
And, yes, people use picking to save the state of a calculation, or your ipython session, or whatever. You can also extend pickle's Pickler and UnPickler to do compression with bz2 or gzip if you'd like.
I find it to be particularly useful with large and complex custom classes. In a particular example I'm thinking of, "Gathering" the information (from a database) to create the class was already half the battle. Then that information stored in the class might be altered at runtime by the user.
You could have another group of tables in the database and write another function to go through everything stored and write it to the new database tables. Then you would need to write another function to be able to load something saved by reading all of that info back in.
Alternatively, you could pickle the whole class as is and then store that to a single field in the database. Then when you go to load it back, it will all load back in at once as it was before. This can end up saving a lot of time and code when saving and retrieving complicated classes.
it is kind of serialization. use cPickle it is much faster than pickle.
import pickle
##make Pickle File
with open('pickles/corups.pickle', 'wb') as handle:
pickle.dump(corpus, handle)
#read pickle file
with open('pickles/corups.pickle', 'rb') as handle:
corpus = pickle.load(handle)

How to update a saved mapping without loading it into memory?

I'm maintaining some mappings that I need to continually update.
These mappings are implemented as pickle serialized dicts right now.
The update process is like this:
Load the pickle file into memory, so that I have access to the dict
Do any update to the dict and serialize it again.
The problem with this solution is it could consume a lot of memory for large dicts.
I've looked into other solutions like shelve and leveldb, but they could both generates many files instead of one, which is more complex to save to systems like key-value storage.
To read and modify your mappings without reading the entire map into memory, you'll need to store it as an indexed structure in some sort of database. There are lots of databases with good Python bindings that store the data on disk as a file, so that you don't have to worry about database servers or separate index files. Sqlite is almost certainly the most common choice. However, as you pointed out in the comments, the full functionality of an SQL database is probably unnecessary for your purpose, since you really only need to store key-value pairs.
Based on your particular requirements, then, I'd probably recommend vedis. It's a single-file, key-value database which can support very large database sizes (the documentation claims it can handle on the order of terrabytes), which is transactional and thread-safe to boot.

How to modify variables and instances in modules and save it at runtime in Python

I have main.py,header.py and var.py
header.py
import var
class table():
def __init__(self, name):
self.name = name
var.py
month = "jen"
table = "" # tried to make empty container which can save table instance but don't know how
main.py
import header
import var
var.table = header.table(var.month)
var.month = "feb"
And after this program ended, I want that var.table and var.month is modified and saved in var.py.
When your program ends, all your values are lost—unless you save them first, and load them on the next run. There are a variety of different ways to do this; which one you want depends on what kind of data you have and what you're doing with it.
The one thing you never, ever want to do is print arbitrary objects to a file and then try to figure out how to parse them later. If the answer to any of your questions is ast.literal_eval, you're saving things wrong.
One important thing to consider is when you save. If someone quits your program with ^C, and you only save during clean shutdowns, all your changes are gone.
Numpy/Pandas
Numpy and Pandas have their own built-in functions for saving data. See the Numpy docs and Pandas docs for all of the options, but the basic choices are:
Text (e.g., np.savetxt): Portable formats, editable in a spreadsheet.
Binary (e.g., np.save): Small files, fast saving and loading.
Pickle (see below, but also builtin functions): Can save arrays with arbitrary Python objects.
HDF5. If you need HDF5 or NetCDF, you probably already know that you need it.
List of strings
If all you have is a list of single-line strings, you just write them to a file and read them back line by line. It's hard to get simpler, and it's obviously human-readable.
If you need a short name for each value, or need separate sections, but your values are still all simple strings, you may want to look at configparser for CFG/INI files. But as soon as you get more complicated than that, look for a different format.
Python source
If you don't need to save anything, only load data (that your users might want to edit), you can use Python itself as a format—either a module that you import, or a script file that you exec. This can of course be very dangerous, but for a config file that's only being edited by people who already have your entire source code on their computer, that may not be a problem.
JSON and friends
JSON can save a single dict or list to a file and load it back. JSON is built into the Python standard library, and most other languages can also load and save it. JSON files are human-editable, although not beautiful.
JSON dicts and lists can be nested structure with other dicts and lists inside, and can also contain strings, floats, bools, and None, but nothing else. You can extend the json library with converters for other types, but it's a bit of work.
YAML is (almost) a superset of JSON that's easier to extend, and allows for prettier human-editable files. It doesn't have builtin support in the standard library, but there are a number of solid libraries on PyPI, like ruamel.yaml.
Both JSON and YAML can only save one dict or list per file. (The library will let you save multiple objects, but you won't be able to load them back, so be careful.) The simplest way around this is to create one big dict or list with all of you data packed into it. But JSON Lines allows you save multiple JSON dicts in a single file, at the cost of human readability. You can load it just by for line in file: obj = json.loads(obj), and you can save it with just the standard library if you know what you're doing, but you can also find third-party libraries like json-lines to do it for you.
Key-value stores
If what you want to store fits into a dict, but you want to have it on disk all the time instead of explicitly saving and loading, you want a key-value store.
dbm is an old but still functional format, as long as your keys and values are all small-ish strings and you don't have tons of them. Python makes a dbm look like a dict, so you don't need to change most of your code at all.
shelve extends dbm to let you save arbitrary values instead of just strings. It does this by using Pickle (see below), meaning it has the same safety issues, and it can also be slow.
More powerful key-value stores (and related things) are generally called NoSQL databases. There are lots of them nowadays; Redis is one of the popular choices. There's more to learn, but it can be worth it.
CSV
CSV stands for "comma-separated values", although there are variations that use whitespace or other characters. CSV is built into the standard library.
It's a great format when you have a list of objects all with the same fields, as long as all of the members are strings or numbers. But don't try to stretch it beyond that.
CSV files are just barely human-editable as text—but they can be edited very easily in spreadsheet programs like Excel or Google Sheets.
Pickle
Pickle is designed to save and load just about anything. This can be dangerous if you're reading arbitrary pickle files supplied by users, but it can also be very convenient. Pickle actually can't quite save and load everything unless you do a lot of work to add support to some of your types, but there's a third-party library named dill that extends support a lot further.
Pickle files are not at all human-readable, and are only compatible with Python, and sometimes not even with older versions of Python.
SQL
Finally, you can always build a full relational database. This it's quite as scary as it sounds.
Python has a database called sqlite3 built into the standard library.
If that looks too complicated, you may want to consider SQLAlchemy, which lets you store and query data without having to learn the SQL language. Or, if you search around, there are a number of fancier ORMs, and libraries that let you run custom list comprehensions directly against databases, and so on.
Other formats
There are ziklions of other standards out there for data files; a few even come with support in the standard library. They can be useful for special cases—plist files match what Apple uses for preferences on macOS and iOS; netrc files are a long-established way to store a list of server logins; XML is perfect if you have a time machine that can only travel to the year 2000; etc. But usually, you're better off using one of the common formats mentioned above.

I need a class that creates a dictionary file that lives on disk

I want to create a very very large dictionary, and I'd like to store it on disk so as not to kill my memory. Basically, my needs are a cross between cPickle and the dict class, in that it's a class that Python treats like a dictionary, but happens to live on the disk.
My first thought was to create some sort of wrapper around a simple MySQL table, but I have to store types in the entries of the structure that MySQL can't even hope to support out of the box.
The simplest way is the shelve module, which works almost exactly like a dictionary:
import shelve
myshelf = shelve.open("filename") # Might turn into filename.db
myshelf["A"] = "First letter of alphabet"
print myshelf["A"]
# ...
myshelf.close() # You should do this explicitly when you're finished
Note the caveats in the module documentation about changing mutable values (lists, dicts, etc.) stored on a shelf (you can, but it takes a bit more fiddling). It uses (c)pickle and dbm under the hood, so it will cheerfully store anything you can pickle.
I don't know how well it performs relative to other solutions, but it doesn't require any custom code or third party libraries.
Look at dbm in specific, and generally the entire Data Persistence chapter in the manual. Most key/value-store databases (gdbm, bdb, metakit, etc.) have a dict-like API which would probably serve your needs (and are fully embeddable so no need to manage an external database process).
File IO is expensive in terms of CPU cycles. So my first thoughts would be in favor of a database.
However, you could also split your "English dictionary" across multiple files so that (say) each file holds words that start with a specific letter of the alphabet (therefore, you'll have 26 files).
Now, when you say I want to create a very very large dictionary, do you mean a python dict or an English dictionary with words and their definitions, stored in a dict (with words as keys and definitions as values)? The second can be easily implemented with cPickle, as you pointed out.
Again, if memory is your main concern, then you'll need to recheck the number of files you want to use, because, if you're pickling dicts into each file, then you want the dicts to not get too big
Perhaps a usable solution for you would be to do this (I am going to assume that all the English words are sorted):
Get all the words in the English language into one file.
Count how many such words there are and split them into as many files as you see fit, depending on how large the files get.
Now, these smaller files contain the words and their meanings
This is how this solution is useful:
Say that your problem is to lookup the definition of a particular word. Now, at runtime, you can read the first word in each file, and determine if the word that you are looking for is in the previous file that you read (you will need a loop counter to check if you are at the last file). Once you have determined which file the word you are looking for is in, then you can open that file and load the contents of that file into the dict.
It's a little difficult to offer a solution without knowing more details about the problem at hand.

Categories