Python - Limit amount of data subprocess.Popen can produce

Python - Limit amount of data subprocess.Popen can produce - python

I found lots of similar questions asking size of an object at run time in python. Some of the answers suggests to set a limit on amount of memory of sub-process. I do not want to set a limit on memory of sub-process. Here is what I want --
I'm using subprocess.Popen() to execute an external program. I can, very well, get standard output and error with process.stdout.readlines() and process.stderr.readlines() after the process is complete.
I have a problem when an erroneous program gets into an infinite loop and keeps producing output. Since subprocess.Popen() stores output data in memory this infinite loop quickly eats up entire memory and program slows down.
One solution is that I can run the command with timeout. But programs take variable time to complete. Large timeout, for a program taking small time and having an infinite loop, defeats the purpose of having it.
Is there any simple way where I can put an upper limit say 200MB on amount of data the command can produce? If it exceeds the limit command should get killed.

First: It is not subprocess.Popen() storing the data, but it is the pipe between "us" and "our" subprocess.
You shouldn't use readlines() in this case as this will indefinitely buffer the data and only at the end return them as a list (in this case, it is indeed this function which stores the data).
If you do something like
bytes = lines = 0
for line in process.stdout:
bytes += len(line)
lines += 1
if bytes > 200000000 or lines > 10000:
# handle the described situation
break
you can act as wanted in your question. But you shouldn't forget to kill the subprocess afterwards in order to stop it producing further data.
But if you want to take care of stderr as well, you'd have to try to replicate process.communicate()'s behaviour with select() etc., and act appropriately.

There doesn't seem to be an easy answer to what you want
http://linux.about.com/library/cmd/blcmdl2_setrlimit.htm
rlimit has a flag to limit memory, CPU or number of open files, but apparently nothing to limit the amount of I/O.
You should handle the case manually as already described.

Related

How can I get the most recent value from a data stream in a separate Python multiprocessing process?

I have a process that reads from a sensor and another that reads from the graph. However, the grapher can be slower than the sensor reading, especially when many sensors are added on.
I see two options to pass the information from the sensor to the grapher: pipes and mp.Value. Pipes from what I know should be faster, but I worry about the issue where the grapher starts to delay: if the sensor sample n times as fast as the grapher, then with every grapher time step, we are only progressing 1/n timesteps in the future (e.g. if twice as fast, after 20s the grapher only has displayed 10s). I could see the sensor polling the pipe and removing all values before adding a new value, but that sounds expensive comptutationally. The mp.Value route does require more explicit locking and I believe isn't as fast as the Pipe class, although I don't know for sure.
What would be the best way to approach this multiprocessing to avoid issues here?
Edit for clarification: I don't care if the grapher gets all the information. Using the most recent value is fine, which is why the title says "Pipe only Last Value". The main requirement of the grapher is just to have the plot not get delayed, even if we effectively downsample by throwing away data. The sensor does need to sample faster than the grapher reads though as the data is also being recorded and processed, and we don't want to downsample that information.

To get the most up-to-date sensor value, you actually need the sensor process to wait until the grapher is ready to send data. There are several ways to do this, but I think actually using two unidirectional (duplex=False) pipes is the best way to go, because you don't need to involve any extra threads or semaphores. In this setup, the first pipe sends data sensor->grapher as normal but the second is simply signals that the grapher is ready to immediately accept data. It's a little awkward to express in prose so here is pseudo code:
def grapher():
while True:
data = pipe_to_grapher.recv()
graph(data)
pipe_to_sensor.send(None) # Can be any value
def sensor():
while True:
data = sense()
if pipe_to_sensor.poll():
pipe_to_grapher.send(data) # Freshest possible
pipe_to_sensor.recv() # Clear the pipe
record(data)
Note that the sensor can simply pass right on by if poll() returns False, as it is an indication that the grapher is not ready for data yet. You can also easily extend the system to use special values to communicate something about the state of one process to the other, such as a shutdown command.
(Pre-edit answer follows)
This question appears to be asking about applying backpressure to your sensor data flow. It sounds like a multiprocessing.Queue might be a good solution for your specific case. Internally it uses a pipe, so it will have similar performance characteristics, but it also has specifically the method put() that can be used with a maxsize parameter that you can set to a low number like 1, so that the sensor process will wait until the grapher process has retrieved an item before going back to acquire more data.
If the sensor has its own buffer that needs clearing, you can use put_nowait() instead and catch the Full error as an indication that the grapher won't be able to plot that data and it should be discarded. This saves the overhead of pickling and sending the data, but can lead to very rapidly polling the sensor which may be a source of overhead itself, depending on the device/drivers/api.

Do multiprocessing pipes fill up, and if so what behavior do they cause in a program?

I am working on a threaded Python program and am using pipes, but found that they freeze at a certain point (what I would consider a relatively small amount of data). I have a test case below. I've tried digging into documentation but have been unable to find anything.
import multiprocessing
def test():
out_, in_ = multiprocessing.Pipe()
for i in range(10**6):
print(i)
in_.send(i)
When I run this code, it prints to 278 then stops, which seems to be a small amount of data. Is this due to it running out of memory or something else? Are there any workarounds or parameters I could use to increase the size?

Yes, pipes have a limited amount of storage, the amount depends on the operating system. If a process tries to write to the pipe faster than another process reads from it, the pipe will eventually fill up. When that happens, the next attempt to write will block until the reader reads enough to make space for it.
The pipe can also be put into non-blocking mode. In this case, trying to write to a full pipe will return an error indication, and the writing code can decide how to deal with it. It doesn't appear that the Python multiprocessing module has a way to set a pipe to non-blocking. Python multiprocess non-blocking intercommunication using Pipes says that the way to do this kind of processing is to use in_.poll() to tell whether the pipe is writable.

Infinite While Loop - Memory Use?

I am trying to figure out how a while loop determines how much memory to use.
At the basic level:
while True:
pass
If I did a similar thing in PHP, it would grind my localhost to a crawl. But in this case I would expect it to use next to nothing.
So for an infinite loop in python, should I expect that it would use up tons of memory, or does it scale according to what is being done, and where?

Your infinite loop is a no-op (it doesn't do anything), so it won't increase the memory use beyond what is being used by the rest of your program. To answer your question, you need to post the code that you suspect is causing memory problems.
In PHP however, the same loop will "hang" because the web server is expecting a response to send back to the client. Since no response is being received, the web browser will simply "freeze". Depending on how the web server is configured, it may choose to end the process an issue a timeout error.
You could do the same if you used Python and a web framework and put an infinite loop in one of your methods that returns a response to the client.
If you ran the equivalent PHP code from the shell, it will have the same effect as if it was written in Python (or any other language). That is, your console will block until you kill the process.
I'm asking because I want to create a program that runs infinitely,
but I'm not sure how to determine it's footprint or how much it will
take from system resources.
A program that runs indefinitely (I think that's what you mean) - it generally has two cases:
Its waiting to do some work on a trigger (like a web server runs indefinitely, but its just sitting there until someone visits your website)
Its doing a process that is taking a long time.
For #2, you need to determine the resource use by figuring out what is the work being done.
If its building a large list of items to do some calculations/sorting, then memory use will grow as the list grows.
If its processing a bunch of files, and during this process, it generates a lot of output stored on disk - then disk usage will grow, and then shrink when the process is done.
If its a rendering engine, then memory use and CPU use will increase, along with disk use as the memory is swapped out during rendering. However, such a system will not tax the disk too much.
The bottom line is, you can't get an answer to this unless you explain the process being run.

How to change process state from sleep to running in linux?

I have a python program which needs to scan some large log files to extract useful information.
In this program, to better utilize computing resource of sever (which runs ubuntu 12.04 LTS and has 64 cores and 96 GB memory), I create a process pool with size = 10 and apply sever jobs to these pool workers. Each job reads from several large files(about 50 GB each, 20 files in total) by using file.readlines(), and then analyze them line by line to find useful information and save the results in a dictionary. After all files are scanned and analyzed, the result dictionary is wrote to the disk. Besides, there is no explicit call of gc.collect() in the whole script.
I started this program on server using root account and these processes works fine at first: each process of this program will occupies about 3.8 GB memory, so there is 40 GB in total
After a few hours, some other user starts another memory-consuming program (also use root account), which aggressively uses almost all the memory (99% of total memory), and later this program is interrupted by CTRL-Z and killed by using killall -9 process_name
However, after this, I have found that the process state of most of my poolworkers have been changed to S, the CPU usage of these sleep process is decreased to 0. According to man top:
The status of the task which can be one of:
'D' = uninterruptible sleep,
'R' = running,
'S' = sleeping,
'T' = traced or stopped,
'Z' = zombie
I used ps -axl command to check the name of the kernel function where the process is sleeping, and it turns out to these poolworker processes sleep on _fastMutex.
This situation lasts for a long time(The process state is still S now) and I don't want to restart my process to scan all the files again, how can I change these process from state Sleep to Running ?

The Sleeping state indicates that they are waiting for something; the way to wake them up is to satisfy whatever condition it is they wait for (the mutex is probably the mechanism of waiting, not the condition itself). The references to memory consumption suggest the possibility that some processes are at least partially paged out, in which case they would be waiting for the swapper; however, that results in uninterruptible sleep D, not S.
System calls that are in interruptible sleep can also be interrupted by signals, such as alarm, terminate, stop, or continue. Most signals cause the program to abort, however. The two that are (usually) safe, continue and ignore, don't change program flow; so it would just go back to sleep on the same condition again.
Most likely, the reason your processes are in S is that they're genuinely waiting for outside input. Since all we know of your program is that it loads a lot of data, I can't tell you where that happens.
As for how you've described your program: "Each job reads from several large files ... using file.readlines(), and then analyze them line by line". It's highly unlikely this is an efficient way to do it; if you're only scanning line by line in one sweep, it's better to iterate on the file objects in the first place (getting one line at a time). If you're reading text lines in a random order, linecache is your friend. Using mmap you could avoid copying the data from the disk buffers. Which is the best fit depends on the structure of your data and algorithm.
By "state of most of my poolworkers have been changed to S" I suspect that the other workers are what's interesting. Perhaps the sleeping ones are just waiting for the ones that are paged out to return.

What's the best way to divide large files in Python for multiprocessing?

I run across a lot of "embarrassingly parallel" projects I'd like to parallelize with the multiprocessing module. However, they often involve reading in huge files (greater than 2gb), processing them line by line, running basic calculations, and then writing results. What's the best way to split a file and process it using Python's multiprocessing module? Should Queue or JoinableQueue in multiprocessing be used? Or the Queue module itself? Or, should I map the file iterable over a pool of processes using multiprocessing? I've experimented with these approaches but the overhead is immense in distribution the data line by line. I've settled on a lightweight pipe-filters design by using cat file | process1 --out-file out1 --num-processes 2 | process2 --out-file out2, which passes a certain percentage of the first process's input directly to the second input (see this post), but I'd like to have a solution contained entirely in Python.
Surprisingly, the Python documentation doesn't suggest a canonical way of doing this (despite a lengthy section on programming guidelines in the multiprocessing documentation).
Thanks,
Vince
Additional information: Processing time per line varies. Some problems are fast and barely not I/O bound, some are CPU-bound. The CPU bound, non-dependent tasks will gain the post from parallelization, such that even inefficient ways of assigning data to a processing function would still be beneficial in terms of wall clock time.
A prime example is a script that extracts fields from lines, checks for a variety of bitwise flags, and writes lines with certain flags to a new file in an entirely new format. This seems like an I/O bound problem, but when I ran it with my cheap concurrent version with pipes, it was about 20% faster. When I run it with pool and map, or queue in multiprocessing it is always over 100% slower.

One of the best architectures is already part of Linux OS's. No special libraries required.
You want a "fan-out" design.
A "main" program creates a number of subprocesses connected by pipes.
The main program reads the file, writing lines to the pipes doing the minimum filtering required to deal the lines to appropriate subprocesses.
Each subprocess should probably be a pipeline of distinct processes that read and write from stdin.
You don't need a queue data structure, that's exactly what an in-memory pipeline is -- a queue of bytes between two concurrent processes.

One strategy is to assign each worker an offset so if you have eight worker processes you assign then numbers 0 to 7. Worker number 0 reads the first record processes it then skips 7 and goes on to process the 8th record etc., worker number 1 reads the second record then skips 7 and processes the 9th record.........
There are a number of advantages to this scheme. It doesnt matter how big the file is the work is always divided evenly, processes on the same machine will process at roughly the same rate, and use the same buffer areas so you dont incur any excessive I/O overhead. As long as the file hasnt been updated you can rerun individual threads to recover from failures.

You dont mention how you are processing the lines; possibly the most important piece of info.
Is each line independant? Is the calculation dependant on one line coming before the next? Must they be processed in blocks? How long does the processing for each line take? Is there a processing step that must incorporate "all" the data at the end? Or can intermediate results be thrown away and just a running total maintained? Can the file be initially split by dividing filesize by count of threads? Or does it grow as you process it?
If the lines are independant and the file doesn't grow, the only coordination you need is to farm out "starting addresses" and "lengths" to each of the workers; they can independantly open and seek into the file and then you must simply coordinate their results; perhaps by waiting for N results to come back into a queue.
If the lines are not independant, the answer will depend highly on the structure of the file.

I know you specifically asked about Python, but I will encourage you to look at Hadoop (http://hadoop.apache.org/): it implements the Map and Reduce algorithm which was specifically designed to address this kind of problem.
Good luck

It depends a lot on the format of your file.
Does it make sense to split it anywhere? Or do you need to split it at a new line? Or do you need to make sure that you split it at the end of an object definition?
Instead of splitting the file, you should use multiple readers on the same file, using os.lseek to jump to the appropriate part of the file.
Update: Poster added that he wants to split on new lines. Then I propose the following:
Let's say you have 4 processes. Then the simple solution is to os.lseek to 0%, 25%, 50% and 75% of the file, and read bytes until you hit the first new line. That's your starting point for each process. You don't need to split the file to do this, just seek to the right location in the large file in each process and start reading from there.

Fredrik Lundh's Some Notes on Tim Bray's Wide Finder Benchmark is an interesting read, about a very similar use case, with a lot of good advice. Various other authors also implemented the same thing, some are linked from the article, but you might want to try googling for "python wide finder" or something to find some more. (there was also a solution somewhere based on the multiprocessing module, but that doesn't seem to be available anymore)

If the run time is long, instead of having each process read its next line through a Queue, have the processes read batches of lines. This way the overhead is amortized over several lines (e.g. thousands or more).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.