What are the different ways to access really large csv files?

What are the different ways to access really large csv files? - python

I had been working on a project where I had to read and process very large csv files with millions of rows as fast as possible.
I came across the link: https://nelsonslog.wordpress.com/2015/02/26/python-csv-benchmarks/ where the author has benchmarked different ways of accessing csv and the time taken for each step.
He has used a catdevnull process with the code as shown:
def catDevNull():
os.system('cat %s > /dev/null' % fn)
The time taken in this case is the least. I believe it is independent of the python version as the time taken to read the file remains the same. Then he utilizes the warmc ache method as shown:
def wc():
os.system('wc -l %s > /dev/null' % fn)
The above two methods are the fastest. Using pandas.read_csv for the task, the time is less than other methods, but still slower than the above two methods.
Putting x = os.system('cat %s > /dev/null % fn), and checking the data type is a string.
How does os.system read the file that the time is so much less? Also, is there a way to access the files after they are read by os.system for further processing?
I was also curious as to how come reading the file is so much faster in pandas compared to other methods available as shown in the above link?

os.system completely relinquishes the control you have in Python. There is no way to access anything which happened in the subprocess after it has finished.
A better way to have some (but not sufficient) control over a subprocess is to use the Python subprocess module. This allows you to interact with the running process using signals and I/O, but still, there is no way to affect the internals of a process unless it has a specific API for allowing you to do that. (Linux exposes some process internals in the /proc filesystem if you want to explore that.)
I don't think you understand what the benchmark means. The cat >/dev/null is a baseline which simply measures how quickly the system is able to read the file off the disk; your process cannot possibly be faster than the I/O channel permits, so this is the time that the system takes for doing nothing at all. You would basically subtract this time from the subsequent results before you compare their relative performance.
Conventionally, the absolutely fastest way to read a large file is to index it, then use the index in memory to seek to the position inside the file you want to access. Building the index causes some overhead, but if you access the file more than once, the benefits soon cancel out the overhead. Importing your file to a database is a convenient and friendly way to do this; the database encapsulates the I/O completely and lets you query the data as if you could ignore that it is somehow serialized into bytes on a disk behind the scenes.

Based on my testing. I came across the fact that it is a lot faster to query in a pandas dataframe than querying in the database[tested for sqlite3]
Thus, the fastest way is to get the csv as a pandas dataframe, and then query in the dataframe as required. Also, if I need to save the file, I can pickle the dataframe, and reuse it as required. The time to pickle and unpickle file and querying is a lot lesser than storing the data in sql and then querying for the results.

Related

Python3 shelve items iterator for multiprocessing

I am trying to analyse a large shelve with multiprocessing.Pool. Being with the read-only mode it should be thread safe, but it seams that a large object is being read first, then slowwwwly dispatched through the pool. Can this be done more efficiently?
Here is a minimal example of what I'm doing. Assume that test_file.shelf already exist and is large (14GB+). I can see this snipped hugging 20GB of RAM, but only a small part of the shelve can be read at the same time (many more items than processors).
from multiprocessing import Pool
import shelve
def func(key_val):
print(key_val)
with shelve.open('test_file.shelf', flag='r') as shelf,\
Pool(4) as pool:
list(pool.imap_unordered(func, iter(shelf.items()))

Shelves are inherently not fast to open because they work as a dictionary like object and when you open them it takes a bit of time to load, especially for large shelves, due to the way it works in the backend. For it to function as a dictionary like object, each time you get a new item, that item is loaded into a separate dictionary in memory. lib reference
Also from the docs:
(imap) For very long iterables using a large value for chunksize can make the job complete much faster than using the default value of 1.
You are using the standard chunk size of 1 which is causing it take a long time to get through your very large shelve file. The docs suggest chunking instead of allowing it to send 1 at a time to speed it up.
The shelve module does not support concurrent read/write access to shelved objects. (Multiple simultaneous read accesses are safe.) When a program has a shelf open for writing, no other program should have it open for reading or writing. Unix file locking can be used to solve this, but this differs across Unix versions and requires knowledge about the database implementation used.
Last, just as a note, I'm not sure that your assumption about it being safe for multiprocessing is correct out of the box, depending on the implementation.
Edited:
As pointed out by juanpa.arrivillaga, in this answer it describes what is happening in the backend -- your entire iterable may be being consumed up front which causes a large memory usage.

Python 3 - Faster Print & I/O

I'm currently involved in a Python project that involves handling massive amounts of data. In this, I have to print massive amounts of data to files. They are always one-liners, but sometimes consisting of millions of digits.
The actual mathematical operations in Python only take seconds, minutes at most. Printing them to a file takes up to several hours; which I don't always have.
Is there any way of speeding up the I/O?
From what I figure, the number is stored in the RAM (Or at least I assume so, it's the only thing which would take up 11GB of RAM), but Python does not print it to a text file immediately. Is there a way to dump that information -- if it is the number -- to a file? I've tried Task Manager's Dump, which gave me a 22GB dump file (Yes, you read that right), and it doesn't look like there's what I was looking for in there, albeit it wasn't very clear.
If it makes a difference, I have Python 3.5.1 (Anaconda and Spyder), Windows 8.1 x64 and 16GB RAM.
By the way, I do run Garbage Collect (gc module) inside the script, and I delete variables that are not needed, so those 11GB aren't just junk.

If you are indeed I/O bound by the time it takes to write the file, multi-threading with a pool of threads may help. Of course, there is a limit to that, but at least, it would allow you to issue non-blocking file writes.

Multithreading could speed it up (have printers on other threads that you write to in memory that have a queue).
Maybe a system design standpoint, but maybe evaluate whether or not you need to write everything to the file. Perhaps consider creating various levels of logging so that a release mode could run faster (if that makes sense in your context).

Use HDF5 file format
The problem is, you have to write a lot of data.
HDF5 is format being very efficient in size and allowing access to it by various tools.
Be prepared for few challenges:
there are multiple python packages for HDF5, you will have to find the one which fits your needs
installation is not always very simple (but there might be Windows installation binary)
expect a bit of study to understand the data structures to be stored.
it will occasionally need some CPU cycles - typically you write a lot of data quickly and at one moment it has to be flushed to the disk. At this moment it starts compressing the data what can take few seconds. See GIL for IO bounded thread in C extension (HDF5)
Anyway, I think, it is very likely, you will manage and apart of faster writes to the files you will also gain smaller files, which are simpler to handle.

How to have multiple python programs append rows to the same file?

I've got multiple python processes (typically 1 per core) transforming large volumes of data that they are each reading from dedicated sources, and writing to a single output file that each opened in append mode.
Is this a safe way for these programs to work?
Because of the tight performance requirements and large data volumes I don't think that I can have each process repeatedly open & close the file. Another option is to have each write to a dedicated output file and a single process concatenate them together once they're all done. But I'd prefer to avoid that.
Thanks in advance for any & all answers and suggestions.

Have you considered using the multiprocessing module to coordinate between the running programs in a thread-like manner? See in particular the queue interface; you can place each completed work item on a queue when completed, and have a single process reading off the queue and writing to your output file.
Alternately, you can have each subprocess maintain a separate pipe to a parent process which does a select() call from all of them, and copies data to the output file when appropriate. Of course, this can be done "by hand" (without the multiprocessing module) as well as with it.
Alternately, if the reason you're avoiding threads is to avoid the global interpreter lock, you might consider a non-CPython implementation (such as Jython or IronPython).

Your procedure is "safe" in that no crashes will result, but data coming (with very unlucky timing) from different processes could get mixed up -- e.g., process 1 is appending a long string of as, process 2 a long string of b, you could end up in the file with lots of as then the bs then more as (or other combinations / mixings).
Problem is, .write is not guaranteed to be atomic for sufficiently long string arguments. If you have a tight boundary on the arguments, less than your fs/os's blocksize, you might be lucky. Otherwise, try using the logging module, which does take more precautions (but perhaps those precautions might slow you down... you'll need to benchmark) exactly because it targets "log files" that are often being appended to by multiple programs.

What's the best way to divide large files in Python for multiprocessing?

I run across a lot of "embarrassingly parallel" projects I'd like to parallelize with the multiprocessing module. However, they often involve reading in huge files (greater than 2gb), processing them line by line, running basic calculations, and then writing results. What's the best way to split a file and process it using Python's multiprocessing module? Should Queue or JoinableQueue in multiprocessing be used? Or the Queue module itself? Or, should I map the file iterable over a pool of processes using multiprocessing? I've experimented with these approaches but the overhead is immense in distribution the data line by line. I've settled on a lightweight pipe-filters design by using cat file | process1 --out-file out1 --num-processes 2 | process2 --out-file out2, which passes a certain percentage of the first process's input directly to the second input (see this post), but I'd like to have a solution contained entirely in Python.
Surprisingly, the Python documentation doesn't suggest a canonical way of doing this (despite a lengthy section on programming guidelines in the multiprocessing documentation).
Thanks,
Vince
Additional information: Processing time per line varies. Some problems are fast and barely not I/O bound, some are CPU-bound. The CPU bound, non-dependent tasks will gain the post from parallelization, such that even inefficient ways of assigning data to a processing function would still be beneficial in terms of wall clock time.
A prime example is a script that extracts fields from lines, checks for a variety of bitwise flags, and writes lines with certain flags to a new file in an entirely new format. This seems like an I/O bound problem, but when I ran it with my cheap concurrent version with pipes, it was about 20% faster. When I run it with pool and map, or queue in multiprocessing it is always over 100% slower.

One of the best architectures is already part of Linux OS's. No special libraries required.
You want a "fan-out" design.
A "main" program creates a number of subprocesses connected by pipes.
The main program reads the file, writing lines to the pipes doing the minimum filtering required to deal the lines to appropriate subprocesses.
Each subprocess should probably be a pipeline of distinct processes that read and write from stdin.
You don't need a queue data structure, that's exactly what an in-memory pipeline is -- a queue of bytes between two concurrent processes.

One strategy is to assign each worker an offset so if you have eight worker processes you assign then numbers 0 to 7. Worker number 0 reads the first record processes it then skips 7 and goes on to process the 8th record etc., worker number 1 reads the second record then skips 7 and processes the 9th record.........
There are a number of advantages to this scheme. It doesnt matter how big the file is the work is always divided evenly, processes on the same machine will process at roughly the same rate, and use the same buffer areas so you dont incur any excessive I/O overhead. As long as the file hasnt been updated you can rerun individual threads to recover from failures.

You dont mention how you are processing the lines; possibly the most important piece of info.
Is each line independant? Is the calculation dependant on one line coming before the next? Must they be processed in blocks? How long does the processing for each line take? Is there a processing step that must incorporate "all" the data at the end? Or can intermediate results be thrown away and just a running total maintained? Can the file be initially split by dividing filesize by count of threads? Or does it grow as you process it?
If the lines are independant and the file doesn't grow, the only coordination you need is to farm out "starting addresses" and "lengths" to each of the workers; they can independantly open and seek into the file and then you must simply coordinate their results; perhaps by waiting for N results to come back into a queue.
If the lines are not independant, the answer will depend highly on the structure of the file.

I know you specifically asked about Python, but I will encourage you to look at Hadoop (http://hadoop.apache.org/): it implements the Map and Reduce algorithm which was specifically designed to address this kind of problem.
Good luck

It depends a lot on the format of your file.
Does it make sense to split it anywhere? Or do you need to split it at a new line? Or do you need to make sure that you split it at the end of an object definition?
Instead of splitting the file, you should use multiple readers on the same file, using os.lseek to jump to the appropriate part of the file.
Update: Poster added that he wants to split on new lines. Then I propose the following:
Let's say you have 4 processes. Then the simple solution is to os.lseek to 0%, 25%, 50% and 75% of the file, and read bytes until you hit the first new line. That's your starting point for each process. You don't need to split the file to do this, just seek to the right location in the large file in each process and start reading from there.

Fredrik Lundh's Some Notes on Tim Bray's Wide Finder Benchmark is an interesting read, about a very similar use case, with a lot of good advice. Various other authors also implemented the same thing, some are linked from the article, but you might want to try googling for "python wide finder" or something to find some more. (there was also a solution somewhere based on the multiprocessing module, but that doesn't seem to be available anymore)

If the run time is long, instead of having each process read its next line through a Queue, have the processes read batches of lines. This way the overhead is amortized over several lines (e.g. thousands or more).

How should I optimize this filesystem I/O bound program?

I have a python program that does something like this:
Read a row from a csv file.
Do some transformations on it.
Break it up into the actual rows as they would be written to the database.
Write those rows to individual csv files.
Go back to step 1 unless the file has been totally read.
Run SQL*Loader and load those files into the database.
Step 6 isn't really taking much time at all. It seems to be step 4 that's taking up most of the time. For the most part, I'd like to optimize this for handling a set of records in the low millions running on a quad-core server with a RAID setup of some kind.
There are a few ideas that I have to solve this:
Read the entire file from step one (or at least read it in very large chunks) and write the file to disk as a whole or in very large chunks. The idea being that the hard disk would spend less time going back and forth between files. Would this do anything that buffering wouldn't?
Parallelize steps 1, 2&3, and 4 into separate processes. This would make steps 1, 2, and 3 not have to wait on 4 to complete.
Break the load file up into separate chunks and process them in parallel. The rows don't need to be handled in any sequential order. This would likely need to be combined with step 2 somehow.
Of course, the correct answer to this question is "do what you find to be the fastest by testing." However, I'm mainly trying to get an idea of where I should spend my time first. Does anyone with more experience in these matters have any advice?

Poor man's map-reduce:
Use split to break the file up into as many pieces as you have CPUs.
Use batch to run your muncher in parallel.
Use cat to concatenate the results.

Python already does IO buffering and the OS should handle both prefetching the input file and delaying writes until it needs the RAM for something else or just gets uneasy about having dirty data in RAM for too long. Unless you force the OS to write them immediately, like closing the file after each write or opening the file in O_SYNC mode.
If the OS isn't doing the right thing, you can try raising the buffer size (third parameter to open()). For some guidance on appropriate values given a 100MB/s 10ms latency IO system a 1MB IO size will result in approximately 50% latency overhead, while a 10MB IO size will result in 9% overhead. If its still IO bound, you probably just need more bandwidth. Use your OS specific tools to check what kind of bandwidth you are getting to/from the disks.
Also useful is to check if step 4 is taking a lot of time executing or waiting on IO. If it's executing you'll need to spend more time checking which part is the culprit and optimize that, or split out the work to different processes.

If you are I/O bound, the best way I have found to optimize is to read or write the entire file into/out of memory at once, then operate out of RAM from there on.
With extensive testing I found that my runtime eded up bound not by the amount of data I read from/wrote to disk, but by the number of I/O operations I used to do it. That is what you need to optimize.
I don't know Python, but if there is a way to tell it to write the whole file out of RAM in one go, rather than issuing a separate I/O for each byte, that's what you need to do.
Of course the drawback to this is that files can be considerably larger than available RAM. There are lots of ways to deal with that, but that is another question for another time.

Can you use a ramdisk for step 4? Low millions sounds doable if the rows are less than a couple of kB or so.

Use buffered writes for step 4.
Write a simple function that simply appends the output onto a string, checks the string length, and only writes when you have enough which should be some multiple of 4k bytes. I would say start with 32k buffers and time it.
You would have one buffer per file, so that most "writes" won't actually hit the disk.

Isn't it possible to collect a few thousand rows in ram, then go directly to the database server and execute them?
This would remove the save to and load from the disk that step 4 entails.
If the database server is transactional, this is also a safe way to do it - just have the database begin before your first row and commit after the last.

The first thing is to be certain of what you should optimize. You seem to not know precisely where your time is going. Before spending more time wondering, use a performance profiler to see exactly where the time is going.
http://docs.python.org/library/profile.html
When you know exactly where the time is going, you'll be in a better position to know where to spend your time optimizing.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.