Python: is os.read() / os.write() on an os.pipe() threadsafe?

Python: is os.read() / os.write() on an os.pipe() threadsafe? - python

Consider:
pipe_read, pipe_write = os.pipe()
Now, I would like to know two things:
(1) I have two threads. If I guarantee that only one is reading os.read(pipe_read,n) and the other is only writing os.write(pipe_write), will I have any problem, even if the two threads do it simultaneously? Will I get all data that was written in the correct order? What happens if they do it simultaneously? Is it possible that a single write is read in pieces, like?:
Thread 1: os.write(pipe_write, '1234567')
Thread 2: os.read(pipe_read,big_number) --> '123'
Thread 2: os.read(pipe_read,big_number) --> '4567'
Or -- again, consider simultaneity -- will a single os.write(some_string) always return entirely by a single os.read(pipe_read, very_big_number)?
(2) Consider more than one thread writing to the pipe_write end of the pipe using logging.handlers.FileHandler() -- I've read that the logging module is threadsafe. Does this mean that I can do this without losing data? I think I won't be able to control the order of the data in the pipe; but this is not a requirement.
Requirements:
all data written by some threads on the write end must come out at the read end
a string written by a single logger.info(), logger.error(), ... has to stay in one piece.
Are these reqs fulfilled?
Thank you in advance,
Jan-Philip Gehrcke

os.read and os.write on the two fds returned from os.pipe is threadsafe, but you appear to demand more than that. Sub (1), yes, there is no "atomicity" guarantee for sinle reads or writes -- the scenario you depict (a single short write ends up producing two reads) is entirely possible. (In general, os.whatever is a thin wrapper on operating system functionality, and it's up to the OS to ensure, or fail to ensure, the kind of functionality you require; in this case, the Posix standard doesn't require the OS to ensure this kind of "atomicity"). You're guaranteed to get all data that was written, and in the correct order, but that's it. A single write of a large piece of data might stall once it's filled the OS-supplied buffer and only proceed once some other thread has read some of the initial data (beware deadlocks, of course!), etc, etc.
Sub (2), yes, the logging module is threadsafe AND "atomic" in that data produced by a single call to logging.info, logging.warn, logging.error, etc, "stays in one piece" in terms of calls to the underlying handler (however if that handler in turn uses non-atomic means such as os.write, it may still e.g. stall in the kernel until the underlying buffer gets unclogged, etc, etc, as above).

Related

How can I get the most recent value from a data stream in a separate Python multiprocessing process?

I have a process that reads from a sensor and another that reads from the graph. However, the grapher can be slower than the sensor reading, especially when many sensors are added on.
I see two options to pass the information from the sensor to the grapher: pipes and mp.Value. Pipes from what I know should be faster, but I worry about the issue where the grapher starts to delay: if the sensor sample n times as fast as the grapher, then with every grapher time step, we are only progressing 1/n timesteps in the future (e.g. if twice as fast, after 20s the grapher only has displayed 10s). I could see the sensor polling the pipe and removing all values before adding a new value, but that sounds expensive comptutationally. The mp.Value route does require more explicit locking and I believe isn't as fast as the Pipe class, although I don't know for sure.
What would be the best way to approach this multiprocessing to avoid issues here?
Edit for clarification: I don't care if the grapher gets all the information. Using the most recent value is fine, which is why the title says "Pipe only Last Value". The main requirement of the grapher is just to have the plot not get delayed, even if we effectively downsample by throwing away data. The sensor does need to sample faster than the grapher reads though as the data is also being recorded and processed, and we don't want to downsample that information.

To get the most up-to-date sensor value, you actually need the sensor process to wait until the grapher is ready to send data. There are several ways to do this, but I think actually using two unidirectional (duplex=False) pipes is the best way to go, because you don't need to involve any extra threads or semaphores. In this setup, the first pipe sends data sensor->grapher as normal but the second is simply signals that the grapher is ready to immediately accept data. It's a little awkward to express in prose so here is pseudo code:
def grapher():
while True:
data = pipe_to_grapher.recv()
graph(data)
pipe_to_sensor.send(None) # Can be any value
def sensor():
while True:
data = sense()
if pipe_to_sensor.poll():
pipe_to_grapher.send(data) # Freshest possible
pipe_to_sensor.recv() # Clear the pipe
record(data)
Note that the sensor can simply pass right on by if poll() returns False, as it is an indication that the grapher is not ready for data yet. You can also easily extend the system to use special values to communicate something about the state of one process to the other, such as a shutdown command.
(Pre-edit answer follows)
This question appears to be asking about applying backpressure to your sensor data flow. It sounds like a multiprocessing.Queue might be a good solution for your specific case. Internally it uses a pipe, so it will have similar performance characteristics, but it also has specifically the method put() that can be used with a maxsize parameter that you can set to a low number like 1, so that the sensor process will wait until the grapher process has retrieved an item before going back to acquire more data.
If the sensor has its own buffer that needs clearing, you can use put_nowait() instead and catch the Full error as an indication that the grapher won't be able to plot that data and it should be discarded. This saves the overhead of pickling and sending the data, but can lead to very rapidly polling the sensor which may be a source of overhead itself, depending on the device/drivers/api.

How to use StringIO with independent read and write pointer?

In my Python program, and there in a thread, I'm constantly and asynchronously getting a raw-byte buffer (a log output, actually) from an embedded device. After conversion, it is an ascii-string containing linebreaks.
In the same program I would like to consume it as lines from the main-thread, from time to time.
I chose io.StringIO for this as is does the linebreaks-to-lines split and has the nice readline-interface.
In my thread I'm calling write() on the StringIO-object and the main thread would want to consume lines from time to time (via readline())
However, the StringIO-class does not behave as a FIFO or Ringbuffer. It does not have a separate read and write-pointer.
Before readline() I'd have to move the file-point back to the last read position and when done back to the end, so that the write will append correctly.
I could write my own string-line-ring-buffer with that interface taking into account the concurrent access and so on. But it is a hard task actually and I'm wondering whether is already something which fulfills my needs.

Where to write parallelized program output to?

I have a program that is using pool.map() to get the values using ten parallel workers. I'm having trouble wrapping my head around how I am suppose to stitch the values back together to make use of it at the end.
What I have is structured like this:
initial_input = get_initial_values()
pool.map(function, initial_input)
pool.close()
pool.join()
# now how would I get the output?
send_ftp_of_output(output_data)
Would I write the function to a log file? If so, if there are (as a hypothetical) a million processes trying to write to the same file, would things overwrite each other?

pool.map(function,input)
returns a list.
You can get the output by doing:
output_data = pool.map(function,input)
pool.map simply runs the map function in paralell, but it still only returns a single list. If you're not outputting anything in the function you are mapping (and you shouldn't), then it simply returns a list. This is the same as map() would do, except it is executed in paralell.

In regards to the log file, yes, having multiple threads right to the same place would interleave within the log file. You could have the thread log the file before the write, which would ensure that something wouldn't get interrupted mid-entry, but it would still interleave things chronologically amongst all the threads. Locking the log file each time also would significantly slow down logging due to the overhead involved.
You can also have, say, the thread number -- %(thread)d -- or some other identifying mark in the logging Formatter output that would help to differentiate, but it could still be hard to follow, especially for a bunch of threads.
Not sure if this would work in your specific application, as the specifics in your app may preclude it, however, I would strongly recommend considering GNU Parallel (http://www.gnu.org/software/parallel/) to do the parallelized work. (You can use, say, subprocess.check_output to call into it).
The benefit of this is several fold, chiefly that you can easily vary the number of parallel workers -- up to having parallel use one worker per core on the machine -- and it will pipeline the items accordingly. The other main benefit, and the one more specifically related to your question -- is that it will stitch the output of all of these parallel workers together as if they had been invoked serially.
If your program wouldn't work so well having, say, a single command line piped from a file within the app and parallelized, you could perhaps make your Python code single-worker and then as the commands piped to parallel, make it a number of permutations of your Python command line, varying the target each time, and then have it output the results.
I use GNU Parallel quite often in conjunction with Python, often to do things, like, say, 6 simultaneous Postgres queries using psql from a list of 50 items.

Using Tritlo's suggestion, here is what worked for me:
def run_updates(input_data):
# do something
return {data}
if __name__ == '__main__':
item = iTunes()
item.fetch_itunes_pulldowns_to_do()
initial_input_data = item.fetched_update_info
pool = Pool(NUM_IN_PARALLEL)
result = pool.map(run_updates, initial_input_data)
pool.close()
pool.join()
print result
And this gives me a list of results

How to have multiple python programs append rows to the same file?

I've got multiple python processes (typically 1 per core) transforming large volumes of data that they are each reading from dedicated sources, and writing to a single output file that each opened in append mode.
Is this a safe way for these programs to work?
Because of the tight performance requirements and large data volumes I don't think that I can have each process repeatedly open & close the file. Another option is to have each write to a dedicated output file and a single process concatenate them together once they're all done. But I'd prefer to avoid that.
Thanks in advance for any & all answers and suggestions.

Have you considered using the multiprocessing module to coordinate between the running programs in a thread-like manner? See in particular the queue interface; you can place each completed work item on a queue when completed, and have a single process reading off the queue and writing to your output file.
Alternately, you can have each subprocess maintain a separate pipe to a parent process which does a select() call from all of them, and copies data to the output file when appropriate. Of course, this can be done "by hand" (without the multiprocessing module) as well as with it.
Alternately, if the reason you're avoiding threads is to avoid the global interpreter lock, you might consider a non-CPython implementation (such as Jython or IronPython).

Your procedure is "safe" in that no crashes will result, but data coming (with very unlucky timing) from different processes could get mixed up -- e.g., process 1 is appending a long string of as, process 2 a long string of b, you could end up in the file with lots of as then the bs then more as (or other combinations / mixings).
Problem is, .write is not guaranteed to be atomic for sufficiently long string arguments. If you have a tight boundary on the arguments, less than your fs/os's blocksize, you might be lucky. Otherwise, try using the logging module, which does take more precautions (but perhaps those precautions might slow you down... you'll need to benchmark) exactly because it targets "log files" that are often being appended to by multiple programs.

What's the best way to divide large files in Python for multiprocessing?

I run across a lot of "embarrassingly parallel" projects I'd like to parallelize with the multiprocessing module. However, they often involve reading in huge files (greater than 2gb), processing them line by line, running basic calculations, and then writing results. What's the best way to split a file and process it using Python's multiprocessing module? Should Queue or JoinableQueue in multiprocessing be used? Or the Queue module itself? Or, should I map the file iterable over a pool of processes using multiprocessing? I've experimented with these approaches but the overhead is immense in distribution the data line by line. I've settled on a lightweight pipe-filters design by using cat file | process1 --out-file out1 --num-processes 2 | process2 --out-file out2, which passes a certain percentage of the first process's input directly to the second input (see this post), but I'd like to have a solution contained entirely in Python.
Surprisingly, the Python documentation doesn't suggest a canonical way of doing this (despite a lengthy section on programming guidelines in the multiprocessing documentation).
Thanks,
Vince
Additional information: Processing time per line varies. Some problems are fast and barely not I/O bound, some are CPU-bound. The CPU bound, non-dependent tasks will gain the post from parallelization, such that even inefficient ways of assigning data to a processing function would still be beneficial in terms of wall clock time.
A prime example is a script that extracts fields from lines, checks for a variety of bitwise flags, and writes lines with certain flags to a new file in an entirely new format. This seems like an I/O bound problem, but when I ran it with my cheap concurrent version with pipes, it was about 20% faster. When I run it with pool and map, or queue in multiprocessing it is always over 100% slower.

One of the best architectures is already part of Linux OS's. No special libraries required.
You want a "fan-out" design.
A "main" program creates a number of subprocesses connected by pipes.
The main program reads the file, writing lines to the pipes doing the minimum filtering required to deal the lines to appropriate subprocesses.
Each subprocess should probably be a pipeline of distinct processes that read and write from stdin.
You don't need a queue data structure, that's exactly what an in-memory pipeline is -- a queue of bytes between two concurrent processes.

One strategy is to assign each worker an offset so if you have eight worker processes you assign then numbers 0 to 7. Worker number 0 reads the first record processes it then skips 7 and goes on to process the 8th record etc., worker number 1 reads the second record then skips 7 and processes the 9th record.........
There are a number of advantages to this scheme. It doesnt matter how big the file is the work is always divided evenly, processes on the same machine will process at roughly the same rate, and use the same buffer areas so you dont incur any excessive I/O overhead. As long as the file hasnt been updated you can rerun individual threads to recover from failures.

You dont mention how you are processing the lines; possibly the most important piece of info.
Is each line independant? Is the calculation dependant on one line coming before the next? Must they be processed in blocks? How long does the processing for each line take? Is there a processing step that must incorporate "all" the data at the end? Or can intermediate results be thrown away and just a running total maintained? Can the file be initially split by dividing filesize by count of threads? Or does it grow as you process it?
If the lines are independant and the file doesn't grow, the only coordination you need is to farm out "starting addresses" and "lengths" to each of the workers; they can independantly open and seek into the file and then you must simply coordinate their results; perhaps by waiting for N results to come back into a queue.
If the lines are not independant, the answer will depend highly on the structure of the file.

I know you specifically asked about Python, but I will encourage you to look at Hadoop (http://hadoop.apache.org/): it implements the Map and Reduce algorithm which was specifically designed to address this kind of problem.
Good luck

It depends a lot on the format of your file.
Does it make sense to split it anywhere? Or do you need to split it at a new line? Or do you need to make sure that you split it at the end of an object definition?
Instead of splitting the file, you should use multiple readers on the same file, using os.lseek to jump to the appropriate part of the file.
Update: Poster added that he wants to split on new lines. Then I propose the following:
Let's say you have 4 processes. Then the simple solution is to os.lseek to 0%, 25%, 50% and 75% of the file, and read bytes until you hit the first new line. That's your starting point for each process. You don't need to split the file to do this, just seek to the right location in the large file in each process and start reading from there.

Fredrik Lundh's Some Notes on Tim Bray's Wide Finder Benchmark is an interesting read, about a very similar use case, with a lot of good advice. Various other authors also implemented the same thing, some are linked from the article, but you might want to try googling for "python wide finder" or something to find some more. (there was also a solution somewhere based on the multiprocessing module, but that doesn't seem to be available anymore)

If the run time is long, instead of having each process read its next line through a Queue, have the processes read batches of lines. This way the overhead is amortized over several lines (e.g. thousands or more).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.