I'm working on an application where it is convenient to store many (around 10^6-10^9) small files (<1K-10K) on disk. I need to iterate over them multiple times for a computation. Would compressing them into a single file would help to significantly reduce the IO bottleneck? I haven't measured different solutions yet, but would like to get pointers on what to expect. This application can easily become IO bound.
Related
At the moment, I am doing a file comparison on 2 CSV files, checking for duplicate lines in each specific file, checking for data mismatches between the files, and checking for missing data rows in each file.
Currently, I am doing this in memory, built for speed because this will be processing thousands of files constantly. This comes at a price though, it can only process files it can completely store in memory.
I am looking to make a fall back if for some reason (although this should never happen) to be able to do the comparison if the files can't fit in memory.
What would be a good approach to do this?
Use pandas. Can't beat it for data analysis in python.
https://pandas.pydata.org/pandas-docs/stable/10min.html
Comes complete with a
read_csv(filepath, skiprows=100000, nrows=9999999)
method that loads the specified rows.
It's built on numpy, the majority of which's methods are implement in C, making them incredibly fast.
I'm trying to apply machine learning (Python with scikit-learn) to a large data stored in a CSV file which is about 2.2 gigabytes.
As this is a partially empirical process I need to run the script numerous times which results in the pandas.read_csv() function being called over and over again and it takes a lot of time.
Obviously, this is very time consuming so I guess there is must be a way to make the process of reading the data faster - like storing it in a different format or caching it in some way.
Code example in the solution would be great!
I would store already parsed DFs in one of the following formats:
HDF5 (fast, supports conditional reading / querying, supports various compression methods, supported by different tools/languages)
Feather (extremely fast - makes sense to use on SSD drives)
Pickle (fast)
All of them are very fast
PS it's important to know what kind of data (what dtypes) you are going to store, because it might affect the speed dramatically
I intent to use multiprocessing to read a set of small files with multiprocesing capabilities of Python. However this is awkward in some sense to me because if the disk is rotational then the bottle neck is the rotation time and even-though I use multiple processes, total read time should be similar with single process read. Am I wrong ? What are your comments?
I addition, do you think using multiprocessing might cause intertwined reading of the files so the contents of these files are skewed in some way?
Your reasoning is sound, but the only way to find out for sure is by benchmarking (that said, it is unlikely that reading many small files in parallel will increase performance over reading them sequentially).
I am not entirely sure what you mean by "intertwined reading", but -- unless there are bugs in your code or the files are being changed while you're reading them -- you will get exactly the same contents irrespective of how you read it.
You are indeed right, the bottleneck will be disk-IO.
However, the only way to really know, is to measure both approaches.
If you have influence on the files, you could go for one larger file as opposed to many smaller files.
I've currently got a project running on PiCloud that involves multiple iterations of an ODE Solver. Each iteration produces a NumPy array of about 30 rows and 1500 columns, with each iterations being appended to the bottom of the array of the previous results.
Normally, I'd just let these fairly big arrays be returned by the function, hold them in memory and deal with them all at one. Except PiCloud has a fairly restrictive cap on the size of the data that can be out and out returned by a function, to keep down on transmission costs. Which is fine, except that means I'd have to launch thousands of jobs, each running on iteration, with considerable overhead.
It appears the best solution to this is to write the output to a file, and then collect the file using another function they have that doesn't have a transfer limit.
Is my best bet to do this just dumping it into a CSV file? Should I add to the CSV file each iteration, or hold it all in an array until the end and then just write once? Is there something terribly clever I'm missing?
Unless there is a reason for the intermediate files to be human-readable, do not use CSV, as this will inevitably involve a loss of precision.
The most efficient is probably tofile (doc) which is intended for quick dumps of file to disk when you know all of the attributes of the data ahead of time.
For platform-independent, but numpy-specific, saves, you can use save (doc).
Numpy and scipy also have support for various scientific data formats like HDF5 if you need portability.
I would recommend looking at the pickle module. The pickle module allows you to serialize python objects as streams of bytes (e.g., strings). This allows you to write them to a file or send them over a network, and then reinstantiate the objects later.
Try Joblib - Fast compressed persistence
One of the key components of joblib is it’s ability to persist arbitrary Python objects, and read them back very quickly. It is particularly efficient for containers that do their heavy lifting with numpy arrays. The trick to achieving great speed has been to save in separate files the numpy arrays, and load them via memmapping.
Edit:
Newer (2016) blog entry on data persistence in Joblib
is opening a large file once reading it completely once to list faster (or) opening smaller files whose total sum of size is equal to large file and loading smaller file into list manupalating one by one faster?
which is faster?? is the difference is time large enough to impact my program??
total time difference of lesser then of 30 sec is negligible for me
It depends if your data fit in your available memory. If you need to resort to paging, or virtual memory, then opening a single giant file might become slower than opening more smaller files. This will be even more true if the computation you need to make creates intermediate variables that won't fit in the physical RAM either.
So, as long as the file is not that big, one opening will be faster, but if this is not true, then many opening may be faster.
At last, note that if you can do many opening, you might be able to do them in parallel and process various parts in different processes, which might make things faster again.
Obviously one open and close is going to be faster than n opens and closes if you are reading the same amount of data. Plus, when reading a single file the I/O classes you use can take advantage of things like buffering, etc, which makes it even faster.
If you are reading the file sequentially from start until end, one open/close is faster than multiple open/close operations.
However keep in mind that if you need to do a lot of seeking in your 1 big file, then maybe storing separate files won't be slower in that case.
Also keep in mind that no matter which approach you are using, you shouldn't read the entire file in at once. Do it in chunks.
Working with a single file is almost certainly going to be faster: you have to read the same amount of data in both cases, but when working with multiple files, you have that much more housekeeping operations slowing you down.
Additionally, you can read data from a single file at the maximum speed the disk can handle, using the disk buffer to the maximum etc., whereas with multiple files, the disk head does a lot more dancing jumping from file to file.
30sec time difference? Define large. Everything that fits into an average's computer RAM would probably not take much more time than 30sec in total.
Why do you think you need to read the file(s) into a list?
If you can open several small files and process each independently, then surely that means:
(a) that you don't need to read into a list, you can process any file (including 1 large file) a line at a time (avoiding running-out-of-real-memory problems)
or
(b) what you need to do is more complicated than you have told us.