Storing a list, then reading it as integer - python

I have a question. It may be an easy one, but anyway I could not find a good idea. The question is that I have 2 python programs. First of them is giving 2 outputs, one of output is a huge list (like having thousands of another lists) and the other one is a simple csv file for the Weka. I need to store this list (first output) somehow to be able to use it as input of the other program later. I can not just send it to second program because when the first of the program is done, Weka should also produce new output for the second program. Hence, second program has to wait the outputs of first program and Weka.
The problem is that output list consists of lost of lists having numerical values. Simple example could be:
list1 = [[1,5,7],[14,3,27], [19,12,0], [23,8,17], [12,7]]
If I write this on a txt file, then when I try to read it, it takes all the values as string. Is there any easy and fast way (since data is big) to manage somehow taking all the values as integer? Or maybe in the first case, writing it as integer?

I think this is good case to use pickle module
To save data:
import pickle
lst = [[1,5,7],[14,3,27], [19,12,0], [23,8,17], [12,7]]
pickle.dump(lst, open('data.pkl', 'wb'))
To read data from saved file:
import pickle
lst = pickle.load(open('data.pkl', 'r')
From documentation:
The pickle module implements a fundamental, but powerful algorithm for
serializing and de-serializing a Python object structure. “Pickling”
is the process whereby a Python object hierarchy is converted into a
byte stream, and “unpickling” is the inverse operation, whereby a byte
stream is converted back into an object hierarchy. Pickling (and
unpickling) is alternatively known as “serialization”, “marshalling,”
[1] or “flattening”, however, to avoid confusion, the terms used here
are “pickling” and “unpickling”.
there's also faster cPickle module:
To save data:
from cPickle import Pickler
p = Pickler(open('data2.pkl', 'wb'))
p.dump(lst)
To read data from saved file:
from cPickle import Unpickler
up = Unpickler(open('data.pkl', 'r'))
lst = up.load()

How about pickling the list output rather than outputting it as a plaintext representation? Have a look at the documentation for your version: it's basically a way to write Python objects to file, which you can then read from Python at any point to get identical objects.
Once you have the file open that you want to output to, the outputting difference will be quite minor, e.g.
import pickle
my_list = [[1, 2], [134, 76], [798, 5, 2]]
with open('outputfile.pkl', 'wb') as output:
pickle.dump(my_list, output, -1)
And then just use the following way to read it in from your second program:
import pickle
my_list = pickle.load(open('outputfile.pkl', 'rb'))

Related

How to alter the pickle database in Python?

I have a pickle database which I am reading using the following code
import pickle, pprint
import sys
def main(datafile):
with open(datafile,'rb')as fin:
data = pickle.load(fin)
pprint.pprint(data)
if __name__=='__main__':
if len(sys.argv) != 2:
print "Pickle database file must be given as an argument."
sys.exit()
main(sys.argv[1])
I recognised that it contained a dictionary. I want to delete/edit some values from this dictionary and make a new pickle database.
I am storing the output of this program in a file ( so that I can read the elements in the dictionary and choose which ones to delete) How do I read this file (pprinted data structures) and create a pickle database from it ?
As stated in Python docs pprint is guaranteed to turn objects into valid (in the sense of Python syntax) objects as long as they are representable as Python constants. So first thing is that what you are doing is fine as long as you do it for dicts, lists, numbers, strings, etc. In particular if some value deep down in the dict is not representable as a constant (e.g. a custom object) this will fail.
Now reading the output file should be quite straight forward:
import ast
with open('output.txt') as fo:
data = fo.read()
obj = ast.literal_eval(data)
This is assuming that you keep one object per file and nothing more.
Note that you may use built-in eval instead of ast.literal_eval but that is quite unsafe since eval can run arbitrary Python code.

why does converting a python 'shelve' to 'dict' use so much memory?

I have a very large python shelve object (6GB on disk). I want to be able to move it to another machine, and since shelves are not portable, I wanted to cPickle it. To do that, I first have to convert it to a dict.
For some reason, when I do dict(myShelf) the ipython process spikes up to 32GB of memory (all my machine has) and then seems to hang (or maybe just take a really long time).
Can someone explain this? And perhaps offer a potential workaround?
edit: using Python 2.7
From my experience I'd expect pickling to be even more of a memory-hog than what you've done so far. However, creating a dict loads every key and value in the shelf into memory at once, and you shouldn't assume because your shelf is 6GB on disk, that it's only 6GB in memory. For example:
>>> import sys, pickle
>>> sys.getsizeof(1)
24
>>> len(pickle.dumps(1))
4
>>> len(pickle.dumps(1, -1))
5
So, a very small integer is 5-6 times bigger as a Python int object (on my machine) than it is once pickled.
As for the workaround: you can write more than one pickled object to a file. So don't convert the shelf to a dict, just write a long sequence of keys and values to your file, then read an equally long sequence of keys and values on the other side to put into your new shelf. That way you only need one key/value pair in memory at a time. Something like this:
Write:
with open('myshelf.pkl', 'wb') as outfile:
pickle.dump(len(myShelf), outfile)
for p in myShelf.iteritems():
pickle.dump(p, outfile)
Read:
with open('myshelf.pkl', 'rb') as infile:
for _ in xrange(pickle.load(infile)):
k, v = pickle.load(infile)
myShelf[k] = v
I think you don't actually need to store the length, you could just keep reading until pickle.load throws an exception indicating it's run out of file.

cPickle.load( ) error

I am working with cPickle for the purpose to convert the structure data into datastream format and pass it to the library. The thing i have to do is to read file contents from manually written file name "targetstrings.txt" and convert the contents of file into that format which Netcdf library needs in the following manner,
Note: targetstrings.txt contains latin characters
op=open("targetstrings.txt",'rb')
targetStrings=cPickle.load(op)
The Netcdf library take the contents as strings.
While loading a file it stuck with the following error,
cPickle.UnpicklingError: invalid load key, 'A'.
Please tell me how can I rectify this error, I have googled around but did not find an appropriate solution.
Any suggestions,
pickle is not for reading/writing generic text files, but to serialize/deserialize Python objects to file. If you want to read text data you should use Python's usual IO functions.
with open('targetstrings.txt', 'r') as f:
fileContent = f.read()
If, as it seems, the library just wants to have a list of strings, taking each line as a list element, you just have to do:
with open('targetstrings.txt', 'r') as f:
lines=[l for l in f]
# now in lines you have the lines read from the file
As stated - Pickle is not meant to be used in this way.
If you need to manually edit complex Python objects taht are to be read and passed as Python objects to another function, there are plenty of other formats to use - for example XML, JSON, Python files themselves. Pickle uses a Python specific protocol, that while note being binary (in the version 0 of the protocol), and not changing across Python versions, is not meant for this, and is not even the recomended method to record Python objects for persistence or comunication (although it can be used for those purposes).

Editing Pickled Data

I need to save a complex piece of data:
list = ["Animals", {"Cats":4, "Dogs":5}, {"x":[], "y":[]}]
I was planning on saving several of these lists within the same file, and I was also planning on using the pickle module to save this data. I also want to be able to access the pickled data and add items to the lists in the 2nd dictionary. So after I unpickle the data and edit, the list might look like this:
list = ["Animals", {"Cats":4, "Dogs":5}, {"x"=[1, 2, 3], "y":[]}]
Preferable, I want to be able to save this list (using pickle) in the same file I took that piece of data from. However, if I simply re-pickle the data to the same file (lets say I originally saved it to "File"), I'll end up with two copies of the same list in that file:
a = open("File", "ab")
pickle.dump(list, a)
a.close()
Is there a way to replace the edited list in the file using pickle rather than adding a second (updated) copy? Or, is there another method I should consider for saving this data?
I think you want the shelve module. It creates a file (uses pickle under the hood) that contains the contents of a variable accessible by key (think persistent dictionary).
You could open the file for writing instead of appending -- then the changes would overwrite previous data. This is however a problem if there is more data stored in that file. If what you want really is to selectively replace data in a pickled file, I'm afraid this won't work with pickle. If this is a common operation, check if something like a sqlite database helps you to this end.

Pickle vs output to a file in python

I have a program that outputs some lists that I want to store to work with later. For example, suppose it outputs a list of student names and another list of their midterm scores. I can store this output in the following two ways:
Standard File Output way:
newFile = open('trialWrite1.py','w')
newFile.write(str(firstNames))
newFile.write(str(midterm1Scores))
newFile.close()
The pickle way:
newFile = open('trialWrite2.txt','w')
cPickle.dump(firstNames, newFile)
cPickle.dump(midterm1Scores, newFile)
newFile.close()
Which technique is better or preferred? Is there an advantage of using one over the other?
Thanks
I think the csv module might be a good fit here, since CSV is a standard format that can be both read and written by Python (and many other languages), and it's also human-readable. Usage could be as simple as
with open('trialWrite1.py','wb') as fileobj:
newFile = csv.writer(fileobj)
newFile.writerow(firstNames)
newFile.writerow(midterm1Scores)
However, it'd probably make more sense to write one student per row, including their name and score. That can be done like this:
from itertools import izip
with open('trialWrite1.py','wb') as fileobj:
newFile = csv.writer(fileobj)
for row in izip(firstNames, midterm1Scores):
newFile.writerow(row)
pickle is more generic -- it allows you to dump many different kinds of objects to a file for later use. The downside is that the interim storage is not very human-readable, and not in a standard format.
Writing strings to a file, on the other hand, is a much better interface to other activities or code. But it comes at the cost of having to parse the text back into your Python object again.
Both are fine for this simple (list?) data; I would use write( firstNames ) simply because there's no need to use pickle. In general, how to persist your data to the filesystem depends on the data!
For instance, pickle will happily pickle functions, which you can't do by simply writing the string representations.
>>> data = range
<class 'range'>
>>> pickle.dump( data, foo )
# stuff
>>> pickle.load( open( ..., "rb" ) )
<class 'range'.
For a completely different approach, consider that Python ships with SQLite. You could store your data in a SQL database without adding any third-party dependencies.

Categories