I'm storing a list of dictionaries in cPickle, but need to be able to add and remove to/from it occasionally. If I store the dictionary data in cPickle, is there some sort of limit on when I will be able to load it again?
You can store it for as long as you want. It's just a file. However, if your data structures start becoming complicated, it can become tedious and time consuming to unpickle, update and pickle the data again. Also, it's just file access so you have to handle concurrency issues by yourself.
No. cPickle just writes data to files and reads it back; why would you think there would be a limit?
cPickle is just a faster implementation of pickle. You can use it to convert a python object to its string equivalent and retrieve it back by unpickling.
You can do one of the two things with a pickled object:
Do not write to a file
In this case, the scope of your pickled data is similar to that of
any other variable.
Write to a file
We can write this pickled data to a file and read it whenever we
want and get back the python objects/data structures. Your pickled
data is safe as long as your pickled file is stored on the disk.
Related
Basically; I messed up. I have pickled some data, a pretty massive dictionary, and while my computer was able to create and pickle that dictionary in the first place, it crashes from running out of memory when I try to unpickle it. I need to unpickle it somehow, to get the data back, and then I can write each entry of the dictionary to a separate file that can actually fit in memory. My best guess for how to do that is to unpickle the dictionary entry by entry and then pickle each entry into it's own file, or failing that to unpickle it but somehow leave it as an on-disk object. I can't seem to find any information on how pickled data is actually stored to start writing a program to recover the data.
pickle is a serialization format unique to Python, and there is no user-level documentation for it.
However, there are extensive comments in a standard distribution's "Lib/pickletools.py" module, and with enough effort you should be able to use that module's dis() function to produce output you can parse yourself, or modify the source itself. dis() does not execute a pickle (meaning it doesn't build any Python objects from the pickle). Instead it reads the pickle file a few bytes at a time, and prints (to stdout, by default) a more human-readable form of what those bytes "mean".
I understood that Python pickling is a way to 'store' a Python Object in a way that does respect Object programming - different from an output written in txt file or DB.
Do you have more details or references on the following points:
where are pickled objects 'stored'?
why is pickling preserving object representation more than, say, storing in DB?
can I retrieve pickled objects from one Python shell session to another?
do you have significant examples when serialization is useful?
does serialization with pickle imply data 'compression'?
In other words, I am looking for a doc on pickling - Python.doc explains how to implement pickle but seems not dive into details about use and necessity of serialization.
Pickling is a way to convert a python object (list, dict, etc.) into a character stream. The idea is that this character stream contains all the information necessary to reconstruct the object in another python script.
As for where the pickled information is stored, usually one would do:
with open('filename', 'wb') as f:
var = {1 : 'a' , 2 : 'b'}
pickle.dump(var, f)
That would store the pickled version of our var dict in the 'filename' file. Then, in another script, you could load from this file into a variable and the dictionary would be recreated:
with open('filename','rb') as f:
var = pickle.load(f)
Another use for pickling is if you need to transmit this dictionary over a network (perhaps with sockets or something.) You first need to convert it into a character stream, then you can send it over a socket connection.
Also, there is no "compression" to speak of here...it's just a way to convert from one representation (in RAM) to another (in "text").
About.com has a nice introduction of pickling here.
Pickling is absolutely necessary for distributed and parallel computing.
Say you wanted to do a parallel map-reduce with multiprocessing (or across cluster nodes with pyina), then you need to make sure the function you want to have mapped across the parallel resources will pickle. If it doesn't pickle, you can't send it to the other resources on another process, computer, etc. Also see here for a good example.
To do this, I use dill, which can serialize almost anything in python. Dill also has some good tools for helping you understand what is causing your pickling to fail when your code fails.
And, yes, people use picking to save the state of a calculation, or your ipython session, or whatever. You can also extend pickle's Pickler and UnPickler to do compression with bz2 or gzip if you'd like.
I find it to be particularly useful with large and complex custom classes. In a particular example I'm thinking of, "Gathering" the information (from a database) to create the class was already half the battle. Then that information stored in the class might be altered at runtime by the user.
You could have another group of tables in the database and write another function to go through everything stored and write it to the new database tables. Then you would need to write another function to be able to load something saved by reading all of that info back in.
Alternatively, you could pickle the whole class as is and then store that to a single field in the database. Then when you go to load it back, it will all load back in at once as it was before. This can end up saving a lot of time and code when saving and retrieving complicated classes.
it is kind of serialization. use cPickle it is much faster than pickle.
import pickle
##make Pickle File
with open('pickles/corups.pickle', 'wb') as handle:
pickle.dump(corpus, handle)
#read pickle file
with open('pickles/corups.pickle', 'rb') as handle:
corpus = pickle.load(handle)
Say I have a lot of json lines to process and I only care about the specific fields in a json line.
{blablabla, 'whatICare': 1, blablabla}
{blablabla, 'whatICare': 2, blablabla}
....
Is there any way to extract whatICare from these json lines withoud loads them? Since the json lines are very long it may be slow to build objects from json..
Not any reliable way without writing your own parsing code.
But check out ujson! It can be 10x faster than python's built in json library, which is a bit on the slow side.
No, you will have to load and parse the JSON before you know what’s inside and to be able to filter out the desired elements.
That being said, if you worry about memory, you could use ijson which is an iterative parser. Instead of loading all the content at once, it is able to load only what’s necessary for the next iteration. So if you your file contains an array of objects, you can load and parse one object at a time, reducing the memory impact (as you only need to keep one object in memory, plus the data you actually care about). But it won’t become faster, and it also won’t magically skip data you are not interested in.
I understood that Python pickling is a way to 'store' a Python Object in a way that does respect Object programming - different from an output written in txt file or DB.
Do you have more details or references on the following points:
where are pickled objects 'stored'?
why is pickling preserving object representation more than, say, storing in DB?
can I retrieve pickled objects from one Python shell session to another?
do you have significant examples when serialization is useful?
does serialization with pickle imply data 'compression'?
In other words, I am looking for a doc on pickling - Python.doc explains how to implement pickle but seems not dive into details about use and necessity of serialization.
Pickling is a way to convert a python object (list, dict, etc.) into a character stream. The idea is that this character stream contains all the information necessary to reconstruct the object in another python script.
As for where the pickled information is stored, usually one would do:
with open('filename', 'wb') as f:
var = {1 : 'a' , 2 : 'b'}
pickle.dump(var, f)
That would store the pickled version of our var dict in the 'filename' file. Then, in another script, you could load from this file into a variable and the dictionary would be recreated:
with open('filename','rb') as f:
var = pickle.load(f)
Another use for pickling is if you need to transmit this dictionary over a network (perhaps with sockets or something.) You first need to convert it into a character stream, then you can send it over a socket connection.
Also, there is no "compression" to speak of here...it's just a way to convert from one representation (in RAM) to another (in "text").
About.com has a nice introduction of pickling here.
Pickling is absolutely necessary for distributed and parallel computing.
Say you wanted to do a parallel map-reduce with multiprocessing (or across cluster nodes with pyina), then you need to make sure the function you want to have mapped across the parallel resources will pickle. If it doesn't pickle, you can't send it to the other resources on another process, computer, etc. Also see here for a good example.
To do this, I use dill, which can serialize almost anything in python. Dill also has some good tools for helping you understand what is causing your pickling to fail when your code fails.
And, yes, people use picking to save the state of a calculation, or your ipython session, or whatever. You can also extend pickle's Pickler and UnPickler to do compression with bz2 or gzip if you'd like.
I find it to be particularly useful with large and complex custom classes. In a particular example I'm thinking of, "Gathering" the information (from a database) to create the class was already half the battle. Then that information stored in the class might be altered at runtime by the user.
You could have another group of tables in the database and write another function to go through everything stored and write it to the new database tables. Then you would need to write another function to be able to load something saved by reading all of that info back in.
Alternatively, you could pickle the whole class as is and then store that to a single field in the database. Then when you go to load it back, it will all load back in at once as it was before. This can end up saving a lot of time and code when saving and retrieving complicated classes.
it is kind of serialization. use cPickle it is much faster than pickle.
import pickle
##make Pickle File
with open('pickles/corups.pickle', 'wb') as handle:
pickle.dump(corpus, handle)
#read pickle file
with open('pickles/corups.pickle', 'rb') as handle:
corpus = pickle.load(handle)
I have a large dictionary mapping keys (which are strings) to objects. I pickled this large dictionary and at certain times I want to pull out only a handful of entries from it. The dictionary has usually thousands of entries total. When I load the dictionary using pickle, as follows:
from cPickle import *
# my dictionary from pickle, containing thousands of entries
mydict = open(load('mypickle.pickle'))
# accessing only handful of entries here
for entry in relevant_entries:
# find relevant entry
value = mydict[entry]
I notice that it can take up to 3-4 seconds to load the entire pickle, which I don't need, since I access only a tiny subset of the dictionary entries later on (shown above.)
How can I make it so pickle only loads those entries that I have from the dictionary, to make this faster?
Thanks.
Pickle serializes object (hierachies), it's not an on-disk store. As you have seen, you must unpickle the entire object to use it - which is of course wasteful. Use shelve, dbm or a database (SQLite) for on-disk storage.
You'll have to have "Ghost" objects, I.e. objects that are only placeholders and load themselves when accessed. This is a Difficult Issue, but it has been solved. You have two options. You can use the persistence library from ZODB, that helps with this. Or, you just start using ZODB directly; problem solved.
http://www.zodb.org/
If your objects are independent of each others, you could pickle and unpickle them individually using their key as filename, in some perverse way a directory is a kind of dictionary mapping filenames to files. This way it is simple to load only relevant entries.
Basically you use a memory dictionary as cache and if the searched key is missing try to load the file from the filesystem.
I'm not really saying you should do that. A database (ZODB, SQLite, other) is probably better for persistant storage.