I have some data types I need to write:
a. A list of numpy arrays, e.g. [ndarray, ndarray, ndarray] of different sizes.
b. Any arbitrary numpy array, e.g. np.zeros((5,6)), np.randn((76,2)) and so on.
c. Any other future datatype that hasn't occurred to me yet.
Requirements:
I need a single function to be able to save all those data types, with no specific handling, and with future compatibility for type c stated above.
I also need the output file dump in human readable format.
So far, I was only able to achieve requirement 1 with either YAML or pickle, both of which with binary files, i.e. not human readable.
#staticmethod
def _read_with_yaml(path):
with open(path, 'r') as stream:
return yaml.load(stream)
#staticmethod
def _write_with_yaml(path, obj):
with io.open(path, 'w+', encoding='utf8') as outfile:
yaml.dump(obj, outfile, default_flow_style=False, allow_unicode=True)
This example code outputs non-human-readable files, but works for the data types I have.
Is there a way to achieve both requirements?
No, your requirements cannot be satisfied.
You already have one function yaml.dump() that saves all those data types. As you noticed it doesn't do so in a very readable way for numpy data structures. This is caused by numpy not having dumping routines for their special data structure, instead falling back to the, not-so-readable, default !python.... tagged dump of the datastructure. Now you (or the YAML or Numpy package maintainers) can provide, special routines for those objects that dump in a more readable format, so that could be covered. You can make the representer in your YAML library more intelligent and get more readable output for Numpy datastructures without touching Numpy classes.
But you want this for all future datatypes, and IMO a variation of Gödel's theorem applies: even if the YAML library is extended so that it covers all known cases and dumps them in a readable way, there will always be new datastructures, especially in C based extensions (like Numpy), that cannot be represented in a readable way without extra work.
So because of your
Any other future unknown datatype that hasn't occurred to me yet.
premise, this is not just a lot of difficult work, but impossible.
Related
I have a question. It may be an easy one, but anyway I could not find a good idea. The question is that I have 2 python programs. First of them is giving 2 outputs, one of output is a huge list (like having thousands of another lists) and the other one is a simple csv file for the Weka. I need to store this list (first output) somehow to be able to use it as input of the other program later. I can not just send it to second program because when the first of the program is done, Weka should also produce new output for the second program. Hence, second program has to wait the outputs of first program and Weka.
The problem is that output list consists of lost of lists having numerical values. Simple example could be:
list1 = [[1,5,7],[14,3,27], [19,12,0], [23,8,17], [12,7]]
If I write this on a txt file, then when I try to read it, it takes all the values as string. Is there any easy and fast way (since data is big) to manage somehow taking all the values as integer? Or maybe in the first case, writing it as integer?
I think this is good case to use pickle module
To save data:
import pickle
lst = [[1,5,7],[14,3,27], [19,12,0], [23,8,17], [12,7]]
pickle.dump(lst, open('data.pkl', 'wb'))
To read data from saved file:
import pickle
lst = pickle.load(open('data.pkl', 'r')
From documentation:
The pickle module implements a fundamental, but powerful algorithm for
serializing and de-serializing a Python object structure. “Pickling”
is the process whereby a Python object hierarchy is converted into a
byte stream, and “unpickling” is the inverse operation, whereby a byte
stream is converted back into an object hierarchy. Pickling (and
unpickling) is alternatively known as “serialization”, “marshalling,”
[1] or “flattening”, however, to avoid confusion, the terms used here
are “pickling” and “unpickling”.
there's also faster cPickle module:
To save data:
from cPickle import Pickler
p = Pickler(open('data2.pkl', 'wb'))
p.dump(lst)
To read data from saved file:
from cPickle import Unpickler
up = Unpickler(open('data.pkl', 'r'))
lst = up.load()
How about pickling the list output rather than outputting it as a plaintext representation? Have a look at the documentation for your version: it's basically a way to write Python objects to file, which you can then read from Python at any point to get identical objects.
Once you have the file open that you want to output to, the outputting difference will be quite minor, e.g.
import pickle
my_list = [[1, 2], [134, 76], [798, 5, 2]]
with open('outputfile.pkl', 'wb') as output:
pickle.dump(my_list, output, -1)
And then just use the following way to read it in from your second program:
import pickle
my_list = pickle.load(open('outputfile.pkl', 'rb'))
I am working with cPickle for the purpose to convert the structure data into datastream format and pass it to the library. The thing i have to do is to read file contents from manually written file name "targetstrings.txt" and convert the contents of file into that format which Netcdf library needs in the following manner,
Note: targetstrings.txt contains latin characters
op=open("targetstrings.txt",'rb')
targetStrings=cPickle.load(op)
The Netcdf library take the contents as strings.
While loading a file it stuck with the following error,
cPickle.UnpicklingError: invalid load key, 'A'.
Please tell me how can I rectify this error, I have googled around but did not find an appropriate solution.
Any suggestions,
pickle is not for reading/writing generic text files, but to serialize/deserialize Python objects to file. If you want to read text data you should use Python's usual IO functions.
with open('targetstrings.txt', 'r') as f:
fileContent = f.read()
If, as it seems, the library just wants to have a list of strings, taking each line as a list element, you just have to do:
with open('targetstrings.txt', 'r') as f:
lines=[l for l in f]
# now in lines you have the lines read from the file
As stated - Pickle is not meant to be used in this way.
If you need to manually edit complex Python objects taht are to be read and passed as Python objects to another function, there are plenty of other formats to use - for example XML, JSON, Python files themselves. Pickle uses a Python specific protocol, that while note being binary (in the version 0 of the protocol), and not changing across Python versions, is not meant for this, and is not even the recomended method to record Python objects for persistence or comunication (although it can be used for those purposes).
I recently asked a question regarding how to save large python objects to file. I had previously run into problems converting massive Python dictionaries into string and writing them to file via write(). Now I am using pickle. Although it works, the files are incredibly large (> 5 GB). I have little experience in the field of such large files. I wanted to know if it would be faster, or even possible, to zip this pickle file prior to storing it to memory.
You can compress the data with bzip2:
from __future__ import with_statement # Only for Python 2.5
import bz2,json,contextlib
hugeData = {'key': {'x': 1, 'y':2}}
with contextlib.closing(bz2.BZ2File('data.json.bz2', 'wb')) as f:
json.dump(hugeData, f)
Load it like this:
from __future__ import with_statement # Only for Python 2.5
import bz2,json,contextlib
with contextlib.closing(bz2.BZ2File('data.json.bz2', 'rb')) as f:
hugeData = json.load(f)
You can also compress the data using zlib or gzip with pretty much the same interface. However, both zlib and gzip's compression rates will be lower than the one achieved with bzip2 (or lzma).
Python code would be extremely slow when it comes to implementing data serialization.
If you try to create an equivalent to Pickle in pure Python, you'll see that it will be super slow.
Fortunately the built-in modules which perform that are quite good.
Apart from cPickle, you will find the marshal module, which is a lot faster.
But it needs a real file handle (not from a file-like object).
You can import marshal as Pickle and see the difference.
I don't think you can make a custom serializer which is a lot faster than this...
Here's an actual (not so old) serious benchmark of Python serializers
faster, or even possible, to zip this pickle file prior to [writing]
Of course it's possible, but there's no reason to try to make an explicit zipped copy in memory (it might not fit!) before writing it, when you can automatically cause it to be zipped as it is written, with built-in standard library functionality ;)
See http://docs.python.org/library/gzip.html . Basically, you create a special kind of stream with
gzip.GzipFile("output file name", "wb")
and then use it exactly like an ordinary file created with open(...) (or file(...) for that matter).
I'd just expand on phihag's answer.
When trying to serialize an object approaching the size of RAM, pickle/cPickle should be avoided, since it requires additional memory of 1-2 times the size of the object in order to serialize. That's true even when streaming it to BZ2File. In my case I was even running out of swap space.
But the problem with JSON (and similarly with HDF files as mentioned in the linked article) is that it cannot serialize tuples, which in my data are used as keys to dicts. There is no great solution for this; the best I could find was to convert tuples to strings, which requires some memory of its own, but much less than pickle. Nowadays, you can also use the ujson library, which is much faster than the json library.
For tuples composed of strings (requires strings to contain no commas):
import ujson as json
from bz2 import BZ2File
bigdata = { ('a','b','c') : 25, ('d','e') : 13 }
bigdata = dict([(','.join(k), v) for k, v in bigdata.viewitems()])
f = BZ2File('filename.json.bz2',mode='wb')
json.dump(bigdata,f)
f.close()
To re-compose the tuples:
bigdata = dict([(tuple(k.split(',')),v) for k,v in bigdata.viewitems()])
Alternatively if e.g. your keys are 2-tuples of integers:
bigdata2 = { (1,2): 1.2, (2,3): 3.4}
bigdata2 = dict([('%d,%d' % k, v) for k, v in bigdata2.viewitems()])
# ... save, load ...
bigdata2 = dict([(tuple(map(int,k.split(','))),v) for k,v in bigdata2.viewitems()])
Another advantage of this approach over pickle is that json appears to compress a significantly better than pickles when using bzip2 compression.
Look at Google's ProtoBuffers. Although they are not designed for large files out-of-the box, like audio-video files, they do well with object serialization as in your case, because they were designed for it. Practice shows that some day you may need to update structure of your files, and ProtoBuffers will handle it. Also, they are highly optimized for compression and speed. And you're not tied to Python, Java and C++ are well supported.
I have a program that outputs some lists that I want to store to work with later. For example, suppose it outputs a list of student names and another list of their midterm scores. I can store this output in the following two ways:
Standard File Output way:
newFile = open('trialWrite1.py','w')
newFile.write(str(firstNames))
newFile.write(str(midterm1Scores))
newFile.close()
The pickle way:
newFile = open('trialWrite2.txt','w')
cPickle.dump(firstNames, newFile)
cPickle.dump(midterm1Scores, newFile)
newFile.close()
Which technique is better or preferred? Is there an advantage of using one over the other?
Thanks
I think the csv module might be a good fit here, since CSV is a standard format that can be both read and written by Python (and many other languages), and it's also human-readable. Usage could be as simple as
with open('trialWrite1.py','wb') as fileobj:
newFile = csv.writer(fileobj)
newFile.writerow(firstNames)
newFile.writerow(midterm1Scores)
However, it'd probably make more sense to write one student per row, including their name and score. That can be done like this:
from itertools import izip
with open('trialWrite1.py','wb') as fileobj:
newFile = csv.writer(fileobj)
for row in izip(firstNames, midterm1Scores):
newFile.writerow(row)
pickle is more generic -- it allows you to dump many different kinds of objects to a file for later use. The downside is that the interim storage is not very human-readable, and not in a standard format.
Writing strings to a file, on the other hand, is a much better interface to other activities or code. But it comes at the cost of having to parse the text back into your Python object again.
Both are fine for this simple (list?) data; I would use write( firstNames ) simply because there's no need to use pickle. In general, how to persist your data to the filesystem depends on the data!
For instance, pickle will happily pickle functions, which you can't do by simply writing the string representations.
>>> data = range
<class 'range'>
>>> pickle.dump( data, foo )
# stuff
>>> pickle.load( open( ..., "rb" ) )
<class 'range'.
For a completely different approach, consider that Python ships with SQLite. You could store your data in a SQL database without adding any third-party dependencies.
I'm used to C++, and I build my data handling classes/functions to handle stream objects instead of files. I'd like to know how I might modify the following code, so that it can handle a stream of binary data in memory, rather than a file handle.
def get_count(self):
curr = self.file.tell()
self.file.seek(0, 0)
count, = struct.unpack('I', self.file.read(c_uint32_size))
self.file.seek(curr, 0)
return count
In this case, the code assumes self.file is a file, opened like so:
file = open('somefile.data, 'r+b')
How might I use the same code, yet instead do something like this:
file = get_binary_data()
Where get_binary_data() returns a string of binary data. Although the code doesn't show it, I also need to write to the stream (I didn't think it was worth posting the code for that).
Also, if possible, I'd like the new code to handle files as well.
You can use an instance of StringIO.StringIO (or cStringIO.StringIO, faster) to give a file-like interface to in-memory data.
Take a look at Python's StringIO module, docs here, which could be pretty much what you're after.
Have a look at 'StringIO' (Read and write strings as files)
Use StringIO.
I like the timing of the answer. (except mine)
We can see response time in milliseconds ?
of-course StringIO