I've got multiple python processes (typically 1 per core) transforming large volumes of data that they are each reading from dedicated sources, and writing to a single output file that each opened in append mode.
Is this a safe way for these programs to work?
Because of the tight performance requirements and large data volumes I don't think that I can have each process repeatedly open & close the file. Another option is to have each write to a dedicated output file and a single process concatenate them together once they're all done. But I'd prefer to avoid that.
Thanks in advance for any & all answers and suggestions.
Have you considered using the multiprocessing module to coordinate between the running programs in a thread-like manner? See in particular the queue interface; you can place each completed work item on a queue when completed, and have a single process reading off the queue and writing to your output file.
Alternately, you can have each subprocess maintain a separate pipe to a parent process which does a select() call from all of them, and copies data to the output file when appropriate. Of course, this can be done "by hand" (without the multiprocessing module) as well as with it.
Alternately, if the reason you're avoiding threads is to avoid the global interpreter lock, you might consider a non-CPython implementation (such as Jython or IronPython).
Your procedure is "safe" in that no crashes will result, but data coming (with very unlucky timing) from different processes could get mixed up -- e.g., process 1 is appending a long string of as, process 2 a long string of b, you could end up in the file with lots of as then the bs then more as (or other combinations / mixings).
Problem is, .write is not guaranteed to be atomic for sufficiently long string arguments. If you have a tight boundary on the arguments, less than your fs/os's blocksize, you might be lucky. Otherwise, try using the logging module, which does take more precautions (but perhaps those precautions might slow you down... you'll need to benchmark) exactly because it targets "log files" that are often being appended to by multiple programs.
Related
I'm working on a project in Python 3 that involves reading lines from a text file, manipulating those lines in some way, and then writing the results of said manipulation into another text file. Implementing that flow in a serial way is trivial.
However, running every step serially takes a long time (I'm working on text files that are several hundred megabytes/several gigabytes in size). I thought about breaking up the process into multiple, actual system processes. Based on the recommended best practices, I'm going to use Python's multiprocessing library.
Ideally, there should be one and only one Process to read from and write to the text files. The manipulation part, however, is where I'm running into issues.
When the "reader process" reads a line from the initial text file, it places that line in a Queue. The "manipulation processes" then pull from that line from the Queue, do their thing, then put the result into yet another Queue, which the "writer process" then takes and writes to another text file. As it stands right now, the manipulation processes simply check to see if the "reader Queue" has data in it, and if it does, they get() the data from the Queue and do their thing. However, those processes may be running before the reader process runs, thus causing the program to stall.
What, in your opinions, would be the "Best Way" to schedule the processes in such a way so the manipulation processes won't run until the reader process has put data into the Queue, and vice-versa with the writer process? I considered firing off custom signals, but I'm not sure if that's the most appropriate way forward. Any help will be greatly appreciated!
If I were you, I would separate the tasks of dividing your file into tractable chunks and the compute-intensive manipulation part. If that is not possible (for example, if lines are not independent for some reason), then you might have to do a purely serial implementation anyway.
Once you have N chunks in separate files, you can just start your serial manipulation script N times, for each chunk. Afterwards, combine the output back into one file. If you do it it this way, no queue is needed and you will save yourself some work.
You're describing a task queue. Celery is a task queue: http://www.celeryproject.org/
I have a program that takes a very huge input file and makes a dict out of it. Since there is no way this is going to fit in memory, I Decided to use shelve to write it to my disk. Now I need to take advantage of the multiple cores available in my system (8 of them) so that I can speed up my parsing. The most obvious way to do this I thought was to split my input file into 8 parts and run the code on all 8 parts concurrently. The problem is that I need only 1 dictionary in the end. Not 8 of them. So how do I use shelve to update one single dictionary parallely?
I gave a pretty detailed answer here on Processing single file from multiple processes in python
Don't try to figure out how you can have many processes write to a shelve at once. Think about how you can have a single process deliver results to the shelve.
The idea is that you have a single process producing the input to a queue. Then you have as many workers as you want receiving queued items and doing the work. When they are done, they place the result into a result queue for the sink to read. The benefit is that you do not have to manually split up your work ahead of time. Just produce the "input" and let whatever worker is read take it and work on it.
With this pattern, you can scale up or down the workers based on the system capabilities.
shelve doesn't support concurrent access. There are a few options for accomplishing what you want:
Make one shelf per process and then merge at the end.
Have worker processes send their results back to the master process over eg multiprocessing.Pipe; the master then stores them in the shelf.
I think you can get bsddb to work with concurrent access in a shelve-like API, but I've never had the need to do so.
My projects needs multiple threads reading files from the same folder. This folder has incoming files and the file should only be processed by any one of those threads. Later, this file reading thread, deletes the file after processing it.
EDIT after the first answer: I don't want a single thread in charge of reading filenames and feeding those names to other threads, so that they can read it.
Is there any efficient way of achieving this in python?
You should probably use the Queue module. From the docs:
The Queue module implements multi-producer, multi-consumer queues. It is especially useful in threaded programming when information must be exchanged safely between multiple threads.
I would use a FIFO approach, with a thread in charge of checking for inbound files and queuing them, and a number of workers processing them. A LIFO approach or an approach in which priority is assigned with a custom method are also supported by the module.
EDIT: If you don't want to use the Queue module and you are under a *nix system, you could use fcntl.lockf instead. An alternative, opening the files with os.open('filename', os.O_EXLOCK).
Depending on how often you perform this operation, you might find it less performing than using Queue, as you will have to account for race conditions (i.e.: you might acquire the name of the file to open, but the file might get locked by another thread before you get a chance to open it, throwing an exception that you will have to trap). Queue is there for a reason! ;)
EDIT2: Comments in this and other questions are bringing up the problem with simultaneous disk access to different files and the consequent performance hit. I was thinking that task_done would have been used for preventing this, but reading others' comments it occurred to me that instead of queuing file names, one could queue the files' content directly. This second alternative would work only for a limited amount of limited size queued files, given that RAM would fill up rather quickly otherwise.
I'm unaware if RAID and other parallel disk configurations would already take care of reading one file per disk rather than bouncing back and forth between two files on both disks.
HTH!
If you want multiple threads to read directly from the same folder several files in parallel, then I must disappoint you. Reading in parallel from a single disk is not a viable option. A single disk needs to spin and seek the next location to be read. If you're reading with multiple threads, you are just bouncing the disk around between seeks and the performance is much worse than a simple sequential read.
Just stick to mac's advice and use a single thread for reading.
Consider:
pipe_read, pipe_write = os.pipe()
Now, I would like to know two things:
(1) I have two threads. If I guarantee that only one is reading os.read(pipe_read,n) and the other is only writing os.write(pipe_write), will I have any problem, even if the two threads do it simultaneously? Will I get all data that was written in the correct order? What happens if they do it simultaneously? Is it possible that a single write is read in pieces, like?:
Thread 1: os.write(pipe_write, '1234567')
Thread 2: os.read(pipe_read,big_number) --> '123'
Thread 2: os.read(pipe_read,big_number) --> '4567'
Or -- again, consider simultaneity -- will a single os.write(some_string) always return entirely by a single os.read(pipe_read, very_big_number)?
(2) Consider more than one thread writing to the pipe_write end of the pipe using logging.handlers.FileHandler() -- I've read that the logging module is threadsafe. Does this mean that I can do this without losing data? I think I won't be able to control the order of the data in the pipe; but this is not a requirement.
Requirements:
all data written by some threads on the write end must come out at the read end
a string written by a single logger.info(), logger.error(), ... has to stay in one piece.
Are these reqs fulfilled?
Thank you in advance,
Jan-Philip Gehrcke
os.read and os.write on the two fds returned from os.pipe is threadsafe, but you appear to demand more than that. Sub (1), yes, there is no "atomicity" guarantee for sinle reads or writes -- the scenario you depict (a single short write ends up producing two reads) is entirely possible. (In general, os.whatever is a thin wrapper on operating system functionality, and it's up to the OS to ensure, or fail to ensure, the kind of functionality you require; in this case, the Posix standard doesn't require the OS to ensure this kind of "atomicity"). You're guaranteed to get all data that was written, and in the correct order, but that's it. A single write of a large piece of data might stall once it's filled the OS-supplied buffer and only proceed once some other thread has read some of the initial data (beware deadlocks, of course!), etc, etc.
Sub (2), yes, the logging module is threadsafe AND "atomic" in that data produced by a single call to logging.info, logging.warn, logging.error, etc, "stays in one piece" in terms of calls to the underlying handler (however if that handler in turn uses non-atomic means such as os.write, it may still e.g. stall in the kernel until the underlying buffer gets unclogged, etc, etc, as above).
I run across a lot of "embarrassingly parallel" projects I'd like to parallelize with the multiprocessing module. However, they often involve reading in huge files (greater than 2gb), processing them line by line, running basic calculations, and then writing results. What's the best way to split a file and process it using Python's multiprocessing module? Should Queue or JoinableQueue in multiprocessing be used? Or the Queue module itself? Or, should I map the file iterable over a pool of processes using multiprocessing? I've experimented with these approaches but the overhead is immense in distribution the data line by line. I've settled on a lightweight pipe-filters design by using cat file | process1 --out-file out1 --num-processes 2 | process2 --out-file out2, which passes a certain percentage of the first process's input directly to the second input (see this post), but I'd like to have a solution contained entirely in Python.
Surprisingly, the Python documentation doesn't suggest a canonical way of doing this (despite a lengthy section on programming guidelines in the multiprocessing documentation).
Thanks,
Vince
Additional information: Processing time per line varies. Some problems are fast and barely not I/O bound, some are CPU-bound. The CPU bound, non-dependent tasks will gain the post from parallelization, such that even inefficient ways of assigning data to a processing function would still be beneficial in terms of wall clock time.
A prime example is a script that extracts fields from lines, checks for a variety of bitwise flags, and writes lines with certain flags to a new file in an entirely new format. This seems like an I/O bound problem, but when I ran it with my cheap concurrent version with pipes, it was about 20% faster. When I run it with pool and map, or queue in multiprocessing it is always over 100% slower.
One of the best architectures is already part of Linux OS's. No special libraries required.
You want a "fan-out" design.
A "main" program creates a number of subprocesses connected by pipes.
The main program reads the file, writing lines to the pipes doing the minimum filtering required to deal the lines to appropriate subprocesses.
Each subprocess should probably be a pipeline of distinct processes that read and write from stdin.
You don't need a queue data structure, that's exactly what an in-memory pipeline is -- a queue of bytes between two concurrent processes.
One strategy is to assign each worker an offset so if you have eight worker processes you assign then numbers 0 to 7. Worker number 0 reads the first record processes it then skips 7 and goes on to process the 8th record etc., worker number 1 reads the second record then skips 7 and processes the 9th record.........
There are a number of advantages to this scheme. It doesnt matter how big the file is the work is always divided evenly, processes on the same machine will process at roughly the same rate, and use the same buffer areas so you dont incur any excessive I/O overhead. As long as the file hasnt been updated you can rerun individual threads to recover from failures.
You dont mention how you are processing the lines; possibly the most important piece of info.
Is each line independant? Is the calculation dependant on one line coming before the next? Must they be processed in blocks? How long does the processing for each line take? Is there a processing step that must incorporate "all" the data at the end? Or can intermediate results be thrown away and just a running total maintained? Can the file be initially split by dividing filesize by count of threads? Or does it grow as you process it?
If the lines are independant and the file doesn't grow, the only coordination you need is to farm out "starting addresses" and "lengths" to each of the workers; they can independantly open and seek into the file and then you must simply coordinate their results; perhaps by waiting for N results to come back into a queue.
If the lines are not independant, the answer will depend highly on the structure of the file.
I know you specifically asked about Python, but I will encourage you to look at Hadoop (http://hadoop.apache.org/): it implements the Map and Reduce algorithm which was specifically designed to address this kind of problem.
Good luck
It depends a lot on the format of your file.
Does it make sense to split it anywhere? Or do you need to split it at a new line? Or do you need to make sure that you split it at the end of an object definition?
Instead of splitting the file, you should use multiple readers on the same file, using os.lseek to jump to the appropriate part of the file.
Update: Poster added that he wants to split on new lines. Then I propose the following:
Let's say you have 4 processes. Then the simple solution is to os.lseek to 0%, 25%, 50% and 75% of the file, and read bytes until you hit the first new line. That's your starting point for each process. You don't need to split the file to do this, just seek to the right location in the large file in each process and start reading from there.
Fredrik Lundh's Some Notes on Tim Bray's Wide Finder Benchmark is an interesting read, about a very similar use case, with a lot of good advice. Various other authors also implemented the same thing, some are linked from the article, but you might want to try googling for "python wide finder" or something to find some more. (there was also a solution somewhere based on the multiprocessing module, but that doesn't seem to be available anymore)
If the run time is long, instead of having each process read its next line through a Queue, have the processes read batches of lines. This way the overhead is amortized over several lines (e.g. thousands or more).