I have a list of strings, and would like to pass this to an api that accepts only a file-like object, without having to concatenate/flatten the list to use the likes of StringIO.
The strings are utf-8, don't necessarily end in newlines, and if naively concatenated could be used directly in StringIO.
Preferred solution would be within the standard library (python3.8) (Given the shape of the data is naturally similar to a file (~identical to readlines() obviously), and memory access pattern would be efficient, I have a feeling I'm just failing to DuckDuckGo correctly) - but if that doesn't exist any "streaming" (no data concatenation) solution would suffice.
[Update, based on #JonSG's links]
Both RawIOBase and TextIOBase look provide an api that decouples arbitrarily sized "chunks"/fragments (in my case: strings in a list) from a file-like read which can specify its own read chunk size, while streaming the data itself (memory cost increases by only some window at any given time [dependent of course on behavior of your source & sink])
RawIOBase.readinto looks especially promising because it provides the buffer returned to client reads directly, allowing much simpler code - but this appears to come at the cost of one full copy (into that buffer).
TextIOBase.read() has its own cost for its operation solving the same subproblem, which is concatenating k (k much smaller than N) chunks together.
I'll investigate both of these.
Related
There's convenient peek function in io.BufferedReader. But
peek([n])
Return 1 (or n if specified) bytes from a buffer without advancing
the position. Only a single read on the raw stream is done to satisfy
the call. The number of bytes returned may be less than requested since
at most all the buffer’s bytes from the current position to the end are
returned.
it is returning too few bytes.
Where shall I get reliable multi-byte peek (without using read and disrupting other code nibbling the stream byte by byte and interpreting data from it)?
It depends on what you mean by reliable. The buffered classes are specifically tailored to prevent I/O as much as possible (as that is the whole point of a buffer) so they only guarantee that it will do 1 read of the buffer at most. The amount of data returned depends exclusively on the amount of data that the buffer already has in it.
If you need an exact amount of data, you will need to alter the underlying structures. In particular, you will probably need to re-open the stream with a bigger buffer.
If that is not an option, you could provide a wrapper class so that you can intercept the reads that you need and provide the data transparently to other code that actually want to consume the data.
I am naive to Python. But, what I came to know is that both are being used for serialization and deserialization. So, I just want to know what all basic differences in between them?
YAML is a language-neutral format that can represent primitive types (int, string, etc.) well, and is highly portable between languages. Kind of analogous to JSON, XML or a plain-text file; just with some useful formatting conventions mixed in -- in fact, YAML is a superset of JSON.
Pickle format is specific to Python and can represent a wide variety of data structures and objects, e.g. Python lists, sets and dictionaries; instances of Python classes; and combinations of these like lists of objects; objects containing dicts containing lists; etc.
So basically:
YAML represents simple data types & structures in a language-portable manner
pickle can represent complex structures, but in a non-language-portable manner
There's more to it than that, but you asked for the "basic" difference.
pickle is a special python serialization format when a python object is converted into a byte stream and back:
“Pickling” is the process whereby a Python object hierarchy is
converted into a byte stream, and “unpickling” is the inverse
operation, whereby a byte stream is converted back into an object
hierarchy.
The main point is that it is python specific.
On the other hand, YAML is language-agnostic and human-readable serialization format.
FYI, if you are choosing between these formats, think about:
serialization/derialization speed (see cPickle module)
do you need to store serialized files in a human-readable form?
what are you going to serialize? If it's a python-specific complex data structure, for example, then you should go with pickle.
See also:
Python serialization - Why pickle?
Lightweight pickle for basic types in python?
If it is not important for you to read files by a person, but you just need to save the file, and then read it, then use the pickle. It is much faster and the binaries weigh less.
YAML files are more readable as mentioned above, but also slower and larger in size.
I have tested for my application. I measured the time to upload and download an object to a file, as well as its size.
Serialization/deserialization method
Average time, s
Size of file, kB
PyYAML
1.73
1149.358
pickle
0.004
690.658
As you can see, yaml is 1,67 times heavier. And 432,5 times slower.
P. S. This is for my data. In your case, it may be different. But that's enough for comparison.
I'm using python to set up a computationally intense simulation, then running it in a custom built C-extension and finally processing the results in python. During the simulation, I want to store a fixed-length number of floats (C doubles converted to PyFloatObjects) representing my variables at every time step, but I don't know how many time steps there will be in advance. Once the simulation is done, I need to pass back the results to python in a form where the data logged for each individual variable is available as a list-like object (for example a (wrapper around a) continuous array, piece-wise continuous array or column in a matrix with a fixed stride).
At the moment I'm creating a dictionary mapping the name of each variable to a list containing PyFloatObject objects. This format is perfect for working with in the post-processing stage but I have a feeling the creation stage could be a lot faster.
Time is quite crucial since the simulation is a computationally heavy task already. I expect that a combination of A. buying lots of memory and B. setting up your experiment wisely will allow the entire log to fit in the RAM. However, with my current dict-of-lists solution keeping every variable's log in a continuous section of memory would require a lot of copying and overhead.
My question is: What is a clever, low-level way of quickly logging gigabytes of doubles in memory with minimal space/time overhead, that still translates to a neat python data structure?
Clarification: when I say "logging", I mean storing until after the simulation. Once that's done a post-processing phase begins and in most cases I'll only store the resulting graphs. So I don't actually need to store the numbers on disk.
Update: In the end, I changed my approach a little and added the log (as a dict mapping variable names to sequence types) to the function parameters. This allows you to pass in objects such as lists or array.arrays or anything that has an append method. This adds a little time overhead because I'm using the PyObject_CallMethodObjArgs function to call the Append method instead of PyList_Append or similar. Using arrays allows you to reduce the memory load, which appears to be the best I can do short of writing my own expanding storage type. Thanks everyone!
You might want to consider doing this in Cython, instead of as a C extension module. Cython is smart, and lets you do things in a pretty pythonic way, even though it at the same time lets you use C datatypes and python datatypes.
Have you checked out the array module? It allows you to store lots of scalar, homogeneous types in a single collection.
If you're truly "logging" these, and not just returning them to CPython, you might try opening a file and fprintf'ing them.
BTW, realloc might be your friend here, whether you go with a C extension module or Cython.
This is going to be more a huge dump of ideas rather than a consistent answer, because it sounds like that's what you're looking for. If not, I apologize.
The main thing you're trying to avoid here is storing billions of PyFloatObjects in memory. There are a few ways around that, but they all revolve on storing billions of plain C doubles instead, and finding some way to expose them to Python as if they were sequences of PyFloatObjects.
To make Python (or someone else's module) do the work, you can use a numpy array, a standard library array, a simple hand-made wrapper on top of the struct module, or ctypes. (It's a bit odd to use ctypes to deal with an extension module, but there's nothing stopping you from doing it.) If you're using struct or ctypes, you can even go beyond the limits of your memory by creating a huge file and mmapping in windows into it as needed.
To make your C module do the work, instead of actually returning a list, return a custom object that meets the sequence protocol, so when someone calls, say, foo.getitem(i) you convert _array[i] to a PyFloatObject on the fly.
Another advantage of mmap is that, if you're creating the arrays iteratively, you can create them by just streaming to a file, and then use them by mmapping the resulting file back as a block of memory.
Otherwise, you need to handle the allocations. If you're using the standard array, it takes care of auto-expanding as needed, but otherwise, you're doing it yourself. The code to do a realloc and copy if necessary isn't that difficult, and there's lots of sample code online, but you do have to write it. Or you may want to consider building a strided container that you can expose to Python as if it were contiguous even though it isn't. (You can do this directly via the complex buffer protocol, but personally I've always found that harder than writing my own sequence implementation.) If you can use C++, vector is an auto-expanding array, and deque is a strided container (and if you've got the SGI STL rope, it may be an even better strided container for the kind of thing you're doing).
As the other answer pointed out, Cython can help for some of this. Not so much for the "exposing lots of floats to Python" part; you can just move pieces of the Python part into Cython, where they'll get compiled into C. If you're lucky, all of the code that needs to deal with the lots of floats will work within the subset of Python that Cython implements, and the only things you'll need to expose to actual interpreted code are higher-level drivers (if even that).
I was wondering whether someone might know the answer to the following.
I'm using Python to build a character-based suffix tree. There are over 11 million nodes in the tree which fits in to approximately 3GB of memory. This was down from 7GB by using the slot class method rather than the Dict method.
When I serialise the tree (using the highest protocol) the resulting file is more than a hundred times smaller.
When I load the pickled file back in, it again consumes 3GB of memory. Where does this extra overhead come from, is it something to do with Pythons handling of memory references to class instances?
Update
Thank you larsmans and Gurgeh for your very helpful explanations and advice. I'm using the tree as part of an information retrieval interface over a corpus of texts.
I originally stored the children (max of 30) as a Numpy array, then tried the hardware version (ctypes.py_object*30), the Python array (ArrayType), as well as the dictionary and Set types.
Lists seemed to do better (using guppy to profile the memory, and __slots__['variable',...]), but I'm still trying to squash it down a bit more if I can. The only problem I had with arrays is having to specify their size in advance, which causes a bit of redundancy in terms of nodes with only one child, and I have quite a lot of them. ;-)
After the tree is constructed I intend to convert it to a probabilistic tree with a second pass, but may be I can do this as the tree is constructed. As construction time is not too important in my case, the array.array() sounds like something that would be useful to try, thanks for the tip, really appreciated.
I'll let you know how it goes.
If you try to pickle an empty list, you get:
>>> s = StringIO()
>>> pickle.dump([], s)
>>> s.getvalue()
'(l.'
and similarly '(d.' for an empty dict. That's three bytes. The in-memory representation of a list, however, contains
a reference count
a type ID, in turn containing a pointer to the type name and bookkeeping info for memory allocation
a pointer to a vector of pointers to actual elements
and yet more bookkeeping info.
On my machine, which has 64-bit pointers, the sizeof a Python list header object is 40 bytes, so that's one order of magnitude. I assume an empty dict will have similar size.
Then, both list and dict use an overallocation strategy to obtain amortized O(1) performance for their main operations, malloc introduces overhead, there's alignment, member attributes that you may or may not even be aware of and various other factors that get you the second order of magnitude.
Summing up: pickle is a pretty good compression algorithm for Python objects :)
Do you construct your tree once and then use it without modifying it further? In that case you might want to consider using separate structures for the dynamic construction and the static usage.
Dicts and objects are very good for dynamic modification, but they are not very space efficient in a read-only scenario. I don't know exactly what you are using your suffix tree for, but you could let each node be represented by a 2-tuple of a sorted array.array('c') and an equally long tuple of subnodes (a tuple instead of a vector to avoid overallocation). You traverse the tree using the bisect-module for lookup in the array. The index of a character in the array will correspond to a subnode in the subnode-tuple. This way you avoid dicts, objects and vector.
You could do something similar during the construction process, perhaps using a subnode-vector instead of subnode-tuple. But this will of course make construction slower, since inserting new nodes in a sorted vector is O(N).
BACKGROUND
The issue I'm working with is as follows:
Within the context of an experiment I am designing for my research, I produce a large number of large (length 4M) arrays which are somewhat sparse, and thereby could be stored as scipy.sparse.lil_matrix instances, or simply as scipy.array instances (the space gain/loss isn't the issue here).
Each of these arrays must be paired with a string (namely a word) for the data to make sense, as they are semantic vectors representing the meaning of that string. I need to preserve this pairing.
The vectors for each word in a list are built one-by-one, and stored to disk before moving on to the next word.
They must be stored to disk in a manner which could be then retrieved with dictionary-like syntax. For example if all the words are stored in a DB-like file, I need to be able to open this file and do things like vector = wordDB[word].
CURRENT APPROACH
What I'm currently doing:
Using shelve to open a shelf named wordDB
Each time the vector (currently using lil_matrix from scipy.sparse) for a word is built, storing the vector in the shelf: wordDB[word] = vector
When I need to use the vectors during the evaluation, I'll do the reverse: open the shelf, and then recall vectors by doing vector = wordDB[word] for each word, as they are needed, so that not all the vectors need be held in RAM (which would be impossible).
The above 'solution' fits my needs in terms of solving the problem as specified. The issue is simply that when I wish to use this method to build and store vectors for a large amount of words, I simply run out of disk space.
This is, as far as I can tell, because shelve pickles the data being stored, which is not an efficient way of storing large arrays, thus rendering this storage problem intractable with shelve for the number of words I need to deal with.
PROBLEM
The question is thus: is there a way of serializing my set of arrays which will:
Save the arrays themselves in compressed binary format akin to the .npy files generated by scipy.save?
Meet my requirement that the data be readable from disk as a dictionary, maintaining the association between words and arrays?
as JoshAdel already suggested, I would go for HDF5, the simplest way is to use h5py:
http://h5py.alfven.org/
you can attach several attributes to an array with a dictionary like sintax:
dset.attrs["Name"] = "My Dataset"
where dset is your dataset which can be sliced exactly as a numpy array, but in the background it does not load all the array into memory.
I would suggest to use scipy.save and have an dictionnary between the word and the name of the files.
Have you tried just using cPickle to pickle the dictionary directly using:
import cPickle
DD = dict()
f = open('testfile.pkl','wb')
cPickle.dump(DD,f,-1)
f.close()
Alternatively, I would just save the vectors in a large multidimensional array using hdf5 or netcdf if necessary since this allows you to open a large array without bringing it all into memory at once and then get slices as needed. You can then associate the words as an additional group in the netcdf4/hdf5 file and use the common indices to quickly associate the appropriate slice from each group, or just name the group the word and then have the data be the vector. You'd have to play around with which is more efficient.
http://netcdf4-python.googlecode.com/svn/trunk/docs/netCDF4-module.html
Pytables also might be a useful storage layer on top of HDF5:
http://www.pytables.org
Avoid using shelve, it's bug ridden and has cross-platform issues.
The memory issue, however, has nothing to do with shelve. Numpy arrays provide efficient implementation of the pickle protocol and there is little memory overhead to cPickle.dumps(protocol=-1), compared to binary .npy (only the extra headers in pickle, basically).
So if binary/pickle isn't enough, you'll have to go for compression. Have a look at pytables or h5py (difference between the two).
If specifying the binary protocol in pickle is enough, you can consider something more lightweight than hdf5: check out sqlitedict for a replacement of shelve. It has no additional dependencies.