Python: quickly loading 7 GB of text files into unicode strings - python

I have a large directory of text files--approximately 7 GB. I need to load them quickly into Python unicode strings in iPython. I have 15 GB of memory total. (I'm using EC2, so I can buy more memory if absolutely necessary.)
Simply reading the files will be too slow for my purposes. I have tried copying the files to a ramdisk and then loading them from there into iPython. That speeds things up but iPython crashes (not enough memory left over?) Here is the ramdisk setup:
mount -t tmpfs none /var/ramdisk -o size=7g
Anyone have any ideas? Basically, I'm looking for persistent in-memory Python objects. The iPython requirement precludes using IncPy: http://www.stanford.edu/~pgbovine/incpy.html .
Thanks!

There is much that is confusing here, which makes it more difficult to answer this question:
The ipython requirement. Why do you need to process such large data files from within ipython instead of a stand-alone script?
The tmpfs RAM disk. I read your question as implying that you read all of your input data into memory at once in Python. If that is the case, then python allocates its own buffers to hold all the data anyway, and the tmpfs filesystem only buys you a performance gain if you reload the data from the RAM disk many, many times.
Mentioning IncPy. If your performance issues are something you could solve with memoization, why can't you just manually implement memoization for the functions where it would help most?
So. If you actually need all the data in memory at once -- if your algorithm reprocesses the entire dataset multiple times, for example -- I would suggest looking at the mmap module. That will provide the data in raw bytes instead of unicode objects, which might entail a little more work in your algorithm (operating on the encoded data, for example), but will use a reasonable amount of memory. Reading the data into Python unicode objects all at once will require either 2x or 4x as much RAM as it occupies on disk (assuming the data is UTF-8).
If your algorithm simply does a single linear pass over the data (as does the Aho-Corasick algorithm you mention), then you'd be far better off just reading in a reasonably sized chunk at a time:
with codecs.open(inpath, encoding='utf-8') as f:
data = f.read(8192)
while data:
process(data)
data = f.read(8192)
I hope this at least gets you closer.

I saw the mention of IncPy and IPython in your question, so let me plug a project of mine that goes a bit in the direction of IncPy, but works with IPython and is well-suited to large data: http://packages.python.org/joblib/
If you are storing your data in numpy arrays (strings can be stored in numpy arrays), joblib can use memmap for intermediate results and be efficient for IO.

Related

Indexing very large hex file with python

I'm trying to write a program that parses data from a (very) large file that contains even rows of 8 sets of 16 bit hex values. For instance, one row would look like this:
edfc b600 edfc 2102 81fb 0000 d1fe 0eff
The data files are expected to be anywhere between 1-4 TB, so I wasn't sure what the best approach would be. If I load this file using Python's open() function, could this turn out badly? I'm worried about how much of an impact this will have on my memory if I'm loading such a large file just to index through. Alternatively, if there's a method I can use to load just the section of data I want from the file, that would be ideal, but as far as I know, I don't think that's even possible. Is this correct?
Anyway, Some sort of idea as to how to approach this very general problem would be much appreciated!
Found an answer from Github. In numpy, there's a function called memmap that works for what I'm doing.
samples = np.memmap("hexdump_samples", mode="r", dtype=np.int16)[100:159]
This didn't seem to cause any issues with the smaller data set I was using, but I can't imagine this causing any issues with memory with the larger files. As far as I understand, this wouldn't cause any issues.
It depends on your computer hardware, how much RAM you have. Python is an interpreted language with a bunch of safeguards, but I wouldn't risk trying to open that file with Python. I would recommend using C or C++, they are good with large amounts of data and memory management. You can then parse the data in bite sized chunks, maybe 16MB per chunk. Python is a extremely slow and memory inefficient compared to C.

Decode big amount of "JSON-like" data in Python quickly

Say there are many (about 300,000) JSON files that take much time (about 30 minutes) to load into a list of Python objects. Profiling revealed that it is in fact not the file access but the decoding, which takes most of the time. Is there a format that I can convert these files to, which can be loaded much faster into a python list of objects?
My attempt: I converted the files to ProtoBuf (aka Google's Protocol Buffers) but even though I got really small files (reduced to ~20% of their original size), the time to load them did not improve that dramatically (still more than 20 minutes to load them all).
You might be looking into the wrong direction with the conversion as it will probably not cut your loading times as much as you would like. If the decoding is taking a lot of time, it will probably take quite some time from other formats as well, assuming that the JSON decoder is not really badly written. I am assuming the standard library functions have decent implementations, and JSON is not a lousy format for data storage speed-wise.
You could try running your program with PyPy instead of the default CPython implementation that I will assume you are using. PyPy could decrease the execution time tremendously. It has a faster JSON module and uses a JIT which might speed up your program a lot.
If you are using Python 3 you could also try using ProcessPoolExecutor to run the file loading and data deserialization / decoding concurrently. You will have to experiment with the degree of concurrency, but a good starting point is the number of your CPU cores, which you can halve or double. If your program waits for I/O a lot, you should run a higher degree of concurrency, if the degree of I/O is smaller you can try and reduce the concurrency. If you write each executor so that they load the data into Python objects and simply return them, you should be able to cut your loading times significantly. Note that you must use a process-driven approach, using threads will not work with the GIL.
You could also use a faster JSON library which could speed up your execution times two or three-fold in an optimal case. In a real-world use case the speed up will probably be smaller. Do note that these might not work with PyPy since it uses an alternative CFFI implementation and will not work with CPython programs, and PyPy has a good JSON module anyway.
Try ujson, it's quite a bit faster.
"Decoding takes most of the time" can be seen as "building the Python objects takes all the time". Do you really need all these things as Python objects in RAM all the time? It must be quite a lot.
I'd consider using a proper database for e.g. querying data of such size.
If you need mass processing of a different kind, e.g. stats or matrix processing, I'd take a look at pandas.

Caching CSV-read data with pandas for multiple runs

I'm trying to apply machine learning (Python with scikit-learn) to a large data stored in a CSV file which is about 2.2 gigabytes.
As this is a partially empirical process I need to run the script numerous times which results in the pandas.read_csv() function being called over and over again and it takes a lot of time.
Obviously, this is very time consuming so I guess there is must be a way to make the process of reading the data faster - like storing it in a different format or caching it in some way.
Code example in the solution would be great!
I would store already parsed DFs in one of the following formats:
HDF5 (fast, supports conditional reading / querying, supports various compression methods, supported by different tools/languages)
Feather (extremely fast - makes sense to use on SSD drives)
Pickle (fast)
All of them are very fast
PS it's important to know what kind of data (what dtypes) you are going to store, because it might affect the speed dramatically

how to rapidaly load data into memory with python?

I have a large csv file (5 GB) and I can read it with pandas.read_csv(). This operation takes a lot of time 10-20 minutes.
How can I speed it up?
Would it be useful to transform the data in a sqllite format? In case what should I do?
EDIT: More information:
The data contains 1852 columns and 350000 rows. Most of the columns are float65 and contain numbers. Some other contains string or dates (that I suppose are considered as string)
I am using a laptop with 16 GB of RAM and SSD hard drive. The data should fit fine in memory (but I know that python tends to increase the data size)
EDIT 2 :
During the loading I receive this message
/usr/local/lib/python3.4/dist-packages/pandas/io/parsers.py:1164: DtypeWarning: Columns (1841,1842,1844) have mixed types. Specify dtype option on import or set low_memory=False.
data = self._reader.read(nrows)
EDIT: SOLUTION
Read one time the csv file and save it as
data.to_hdf('data.h5', 'table')
This format is incredibly efficient
This actually depends on which part of reading it is taking 10 minutes.
If it's actually reading from disk, then obviously any more compact form of the data will be better.
If it's processing the CSV format (you can tell this because your CPU is at near 100% on one core while reading; it'll be very low for the other two), then you want a form that's already preprocessed.
If it's swapping memory, e.g., because you only have 2GB of physical RAM, then nothing is going to help except splitting the data.
It's important to know which one you have. For example, stream-compressing the data (e.g., with gzip) will make the first problem a lot better, but the second one even worse.
It sounds like you probably have the second problem, which is good to know. (However, there are things you can do that will probably be better no matter what the problem.)
Your idea of storing it in a sqlite database is nice because it can at least potentially solve all three at once; you only read the data in from disk as-needed, and it's stored in a reasonably compact and easy-to-process form. But it's not the best possible solution for the first two, just a "pretty good" one.
In particular, if you actually do need to do array-wide work across all 350000 rows, and can't translate that work into SQL queries, you're not going to get much benefit out of sqlite. Ultimately, you're going to be doing a giant SELECT to pull in all the data and then process it all into one big frame.
Writing out the shape and structure information, then writing the underlying arrays in NumPy binary form. Then, for reading, you have to reverse that. NumPy's binary form just stores the raw data as compactly as possible, and it's a format that can be written blindingly quickly (it's basically just dumping the raw in-memory storage to disk). That will improve both the first and second problems.
Similarly, storing the data in HDF5 (either using Pandas IO or an external library like PyTables or h5py) will improve both the first and second problems. HDF5 is designed to be a reasonably compact and simple format for storing the same kind of data you usually store in Pandas. (And it includes optional compression as a built-in feature, so if you know which of the two you have, you can tune it.) It won't solve the second problem quite as well as the last option, but probably well enough, and it's much simpler (once you get past setting up your HDF5 libraries).
Finally, pickling the data may sometimes be faster. pickle is Python's native serialization format, and it's hookable by third-party modules—and NumPy and Pandas have both hooked it to do a reasonably good job of pickling their data.
(Although this doesn't apply to the question, it may help someone searching later: If you're using Python 2.x, make sure to explicitly use pickle format 2; IIRC, NumPy is very bad at the default pickle format 0. In Python 3.0+, this isn't relevant, because the default format is at least 3.)
Python has two built-in libraries called pickle and cPickle that can store any Python data structure.
cPickle is identical to pickle except that cPickle has trouble with Unicode stuff and is 1000x faster.
Both are really convenient for saving stuff that's going to be re-loaded into Python in some form, since you don't have to worry about some kind of error popping up in your file I/O.
Having worked with a number of XML files, I've found some performance gains from loading pickles instead of raw XML. I'm not entirely sure how the performance compares with CSVs, but it's worth a shot, especially if you don't have to worry about Unicode stuff and can use cPickle. It's also simple, so if it's not a good enough boost, you can move on to other methods with minimal time lost.
A simple example of usage:
>>> import pickle
>>> stuff = ["Here's", "a", "list", "of", "tokens"]
>>> fstream = open("test.pkl", "wb")
>>> pickle.dump(stuff,fstream)
>>> fstream.close()
>>>
>>> fstream2 = open("test.pkl", "rb")
>>> old_stuff = pickle.load(fstream2)
>>> fstream2.close()
>>> old_stuff
["Here's", 'a', 'list', 'of', 'tokens']
>>>
Notice the "b" in the file stream openers. This is important--it preserves cross-platform compatibility of the pickles. I've failed to do this before and had it come back to haunt me.
For your stuff, I recommend writing a first script that parses the CSV and saves it as a pickle; when you do your analysis, the script associated with that loads the pickle like in the second block of code up there.
I've tried this with XML; I'm curious how much of a boost you will get with CSVs.
If the problem is in the processing overhead, then you can divide the file into smaller files and handle them in different CPU cores or threads. Also for some algorithms the python time will increase non-linearly and the dividing method will help in these cases.

Creating very large NUMPY arrays in small chunks (PyTables vs. numpy.memmap)

There are a bunch of questions on SO that appear to be the same, but they don't really answer my question fully. I think this is a pretty common use-case for computational scientists, so I'm creating a new question.
QUESTION:
I read in several small numpy arrays from files (~10 MB each) and do some processing on them. I want to create a larger array (~1 TB) where each dimension in the array contains the data from one of these smaller files. Any method that tries to create the whole larger array (or a substantial part of it) in the RAM is not suitable, since it floods up the RAM and brings the machine to a halt. So I need to be able to initialize the larger array and fill it in small batches, so that each batch gets written to the larger array on disk.
I initially thought that numpy.memmap is the way to go, but when I issue a command like
mmapData = np.memmap(mmapFile,mode='w+', shape=(large_no1,large_no2))
the RAM floods and the machine slows to a halt.
After poking around a bit it seems like PyTables might be well suited for this sort of thing, but I'm not really sure. Also, it was hard to find a simple example in the doc or elsewhere which illustrates this common use-case.
IF anyone knows how this can be done using PyTables, or if there's a more efficient/faster way to do this, please let me know! Any refs. to examples appreciated!
That's weird. The np.memmap should work. I've been using it with 250Gb data on a 12Gb RAM machine without problems.
Does the system really runs out of memory at the very moment of the creation of the memmap file? Or it happens along the code? If it happens at the file creation I really don't know what the problem would be.
When I started using memmap I've made some mistakes that led me to memory run out. For me, something like the below code should work:
mmapData = np.memmap(mmapFile, mode='w+', shape = (smallarray_size,number_of_arrays), dtype ='float64')
for k in range(number_of_arrays):
smallarray = np.fromfile(list_of_files[k]) # list_of_file is the list with the files name
smallarray = do_something_with_array(smallarray)
mmapData[:,k] = smallarray
It may not be the most efficient way, but it seems to me that it would have the lowest memory usage.
Ps: Be aware that the default dtype value for memmap(int) and fromfile(float) are different!
HDF5 is a C library that can efficiently store large on-disk arrays. Both PyTables and h5py are Python libraries on top of HDF5. If you're using tabular data then PyTables might be preferred; if you have just plain arrays then h5py is probably more stable/simpler.
There are out-of-core numpy array solutions that handle the chunking for you. Dask.array would give you plain numpy semantics on top of your collection of chunked files (see docs on stacking.)

Categories