In my Python program, and there in a thread, I'm constantly and asynchronously getting a raw-byte buffer (a log output, actually) from an embedded device. After conversion, it is an ascii-string containing linebreaks.
In the same program I would like to consume it as lines from the main-thread, from time to time.
I chose io.StringIO for this as is does the linebreaks-to-lines split and has the nice readline-interface.
In my thread I'm calling write() on the StringIO-object and the main thread would want to consume lines from time to time (via readline())
However, the StringIO-class does not behave as a FIFO or Ringbuffer. It does not have a separate read and write-pointer.
Before readline() I'd have to move the file-point back to the last read position and when done back to the end, so that the write will append correctly.
I could write my own string-line-ring-buffer with that interface taking into account the concurrent access and so on. But it is a hard task actually and I'm wondering whether is already something which fulfills my needs.
Related
In the Python HDF5 library h5py, do I need to flush() a file before I close() it?
Or does closing the file already make sure that any data that might still be in the buffers will be written to disk?
What exactly is the point of flushing? When would flushing be necessary?
No, you do not need to flush the file before closing. Flushing is done automatically by the underlying HDF5 C library when you close the file.
As to the point of flushing. File I/O is slow compared to things like memory or cache access. If programs had to wait before data was actually on the disk each time a write was performed, that would slow things down a lot. So the actual writing to disk is buffered by at least the OS, but in many cases by the I/O library being used (e.g., the C standard I/O library). When you ask to write data to a file, it usually just means that the OS has copied your data to its own internal buffer, and will actually put it on the disk when it's convenient to do so.
Flushing overrides this buffering, at whatever level the call is made. So calling h5py.File.flush() will flush the HDF5 library buffers, but not necessarily the OS buffers. The point of this is to give the program some control over when data actually leaves a buffer.
For example, writing to the standard output is usually line-buffered. But if you really want to see the output before a newline, you can call fflush(stdout). This might make sense if you are piping the standard output of one process into another: that downstream process can start consuming the input right away, without waiting for the OS to decide it's a good time.
Another good example is making a call to fork(2). This usually copies the entire address space of a process, which means the I/O buffers as well. That may result in duplicated output, unnecessary copying, etc. Flushing a stream guarantees that the buffer is empty before forking.
I'm working on a project in Python 3 that involves reading lines from a text file, manipulating those lines in some way, and then writing the results of said manipulation into another text file. Implementing that flow in a serial way is trivial.
However, running every step serially takes a long time (I'm working on text files that are several hundred megabytes/several gigabytes in size). I thought about breaking up the process into multiple, actual system processes. Based on the recommended best practices, I'm going to use Python's multiprocessing library.
Ideally, there should be one and only one Process to read from and write to the text files. The manipulation part, however, is where I'm running into issues.
When the "reader process" reads a line from the initial text file, it places that line in a Queue. The "manipulation processes" then pull from that line from the Queue, do their thing, then put the result into yet another Queue, which the "writer process" then takes and writes to another text file. As it stands right now, the manipulation processes simply check to see if the "reader Queue" has data in it, and if it does, they get() the data from the Queue and do their thing. However, those processes may be running before the reader process runs, thus causing the program to stall.
What, in your opinions, would be the "Best Way" to schedule the processes in such a way so the manipulation processes won't run until the reader process has put data into the Queue, and vice-versa with the writer process? I considered firing off custom signals, but I'm not sure if that's the most appropriate way forward. Any help will be greatly appreciated!
If I were you, I would separate the tasks of dividing your file into tractable chunks and the compute-intensive manipulation part. If that is not possible (for example, if lines are not independent for some reason), then you might have to do a purely serial implementation anyway.
Once you have N chunks in separate files, you can just start your serial manipulation script N times, for each chunk. Afterwards, combine the output back into one file. If you do it it this way, no queue is needed and you will save yourself some work.
You're describing a task queue. Celery is a task queue: http://www.celeryproject.org/
I've got multiple python processes (typically 1 per core) transforming large volumes of data that they are each reading from dedicated sources, and writing to a single output file that each opened in append mode.
Is this a safe way for these programs to work?
Because of the tight performance requirements and large data volumes I don't think that I can have each process repeatedly open & close the file. Another option is to have each write to a dedicated output file and a single process concatenate them together once they're all done. But I'd prefer to avoid that.
Thanks in advance for any & all answers and suggestions.
Have you considered using the multiprocessing module to coordinate between the running programs in a thread-like manner? See in particular the queue interface; you can place each completed work item on a queue when completed, and have a single process reading off the queue and writing to your output file.
Alternately, you can have each subprocess maintain a separate pipe to a parent process which does a select() call from all of them, and copies data to the output file when appropriate. Of course, this can be done "by hand" (without the multiprocessing module) as well as with it.
Alternately, if the reason you're avoiding threads is to avoid the global interpreter lock, you might consider a non-CPython implementation (such as Jython or IronPython).
Your procedure is "safe" in that no crashes will result, but data coming (with very unlucky timing) from different processes could get mixed up -- e.g., process 1 is appending a long string of as, process 2 a long string of b, you could end up in the file with lots of as then the bs then more as (or other combinations / mixings).
Problem is, .write is not guaranteed to be atomic for sufficiently long string arguments. If you have a tight boundary on the arguments, less than your fs/os's blocksize, you might be lucky. Otherwise, try using the logging module, which does take more precautions (but perhaps those precautions might slow you down... you'll need to benchmark) exactly because it targets "log files" that are often being appended to by multiple programs.
I'm creating a python script which accepts a path to a remote file and an n number of threads. The file's size will be divided by the number of threads, when each thread completes I want them to append the fetch data to a local file.
How do I manage it so that the order in which the threads where generated will append to the local file in order so that the bytes don't get scrambled?
Also, what if I'm to download several files simultaneously?
You could coordinate the works with locks &c, but I recommend instead using Queue -- usually the best way to coordinate multi-threading (and multi-processing) in Python.
I would have the main thread spawn as many worker threads as you think appropriate (you may want to calibrate between performance, and load on the remote server, by experimenting); every worker thread waits at the same global Queue.Queue instance, call it workQ for example, for "work requests" (wr = workQ.get() will do it properly -- each work request is obtained by a single worker thread, no fuss, no muss).
A "work request" can in this case simply be a triple (tuple with three items): identification of the remote file (URL or whatever), offset from which it is requested to get data from it, number of bytes to get from it (note that this works just as well for one or multiple files ot fetch).
The main thread pushes all work requests to the workQ (just workQ.put((url, from, numbytes)) for each request) and waits for results to come to another Queue instance, call it resultQ (each result will also be a triple: identifier of the file, starting offset, string of bytes that are the results from that file at that offset).
As each working thread satisfies the request it's doing, it puts the results into resultQ and goes back to fetch another work request (or wait for one). Meanwhile the main thread (or a separate dedicated "writing thread" if needed -- i.e. if the main thread has other work to do, for example on the GUI) gets results from resultQ and performs the needed open, seek, and write operations to place the data at the right spot.
There are several ways to terminate the operation: for example, a special work request may be asking the thread receiving it to terminate -- the main thread puts on workQ just as many of those as there are working threads, after all the actual work requests, then joins all the worker threads when all data have been received and written (many alternatives exist, such as joining the queue directly, having the worker threads daemonic so they just go away when the main thread terminates, and so forth).
You need to fetch completely separate parts of the file on each thread. Calculate the chunk start and end positions based on the number of threads. Each chunk must have no overlap obviously.
For example, if target file was 3000 bytes long and you want to fetch using three thread:
Thread 1: fetches bytes 1 to 1000
Thread 2: fetches bytes 1001 to 2000
Thread 3: fetches bytes 2001 to 3000
You would pre-allocate an empty file of the original size, and write back to the respective positions within the file.
You can use a thread safe "semaphore", like this:
class Counter:
counter = 0
#classmethod
def inc(cls):
n = cls.counter = cls.counter + 1 # atomic increment and assignment
return n
Using Counter.inc() returns an incremented number across threads, which you can use to keep track of the current block of bytes.
That being said, there's no need to split up file downloads into several threads, because the downstream is way slower than the writing to disk, so one thread will always finish before the next one is downloading.
The best and least resource hungry way is simply to have a download file descriptor linked directly to a file object on disk.
for "download several files simultaneously", I recommond this article: Practical threaded programming with Python . It provides a simultaneously download related example by combining threads with Queues, I thought it's worth a reading.
Consider:
pipe_read, pipe_write = os.pipe()
Now, I would like to know two things:
(1) I have two threads. If I guarantee that only one is reading os.read(pipe_read,n) and the other is only writing os.write(pipe_write), will I have any problem, even if the two threads do it simultaneously? Will I get all data that was written in the correct order? What happens if they do it simultaneously? Is it possible that a single write is read in pieces, like?:
Thread 1: os.write(pipe_write, '1234567')
Thread 2: os.read(pipe_read,big_number) --> '123'
Thread 2: os.read(pipe_read,big_number) --> '4567'
Or -- again, consider simultaneity -- will a single os.write(some_string) always return entirely by a single os.read(pipe_read, very_big_number)?
(2) Consider more than one thread writing to the pipe_write end of the pipe using logging.handlers.FileHandler() -- I've read that the logging module is threadsafe. Does this mean that I can do this without losing data? I think I won't be able to control the order of the data in the pipe; but this is not a requirement.
Requirements:
all data written by some threads on the write end must come out at the read end
a string written by a single logger.info(), logger.error(), ... has to stay in one piece.
Are these reqs fulfilled?
Thank you in advance,
Jan-Philip Gehrcke
os.read and os.write on the two fds returned from os.pipe is threadsafe, but you appear to demand more than that. Sub (1), yes, there is no "atomicity" guarantee for sinle reads or writes -- the scenario you depict (a single short write ends up producing two reads) is entirely possible. (In general, os.whatever is a thin wrapper on operating system functionality, and it's up to the OS to ensure, or fail to ensure, the kind of functionality you require; in this case, the Posix standard doesn't require the OS to ensure this kind of "atomicity"). You're guaranteed to get all data that was written, and in the correct order, but that's it. A single write of a large piece of data might stall once it's filled the OS-supplied buffer and only proceed once some other thread has read some of the initial data (beware deadlocks, of course!), etc, etc.
Sub (2), yes, the logging module is threadsafe AND "atomic" in that data produced by a single call to logging.info, logging.warn, logging.error, etc, "stays in one piece" in terms of calls to the underlying handler (however if that handler in turn uses non-atomic means such as os.write, it may still e.g. stall in the kernel until the underlying buffer gets unclogged, etc, etc, as above).