Multiple threads reading from single folder on Linux - python

My projects needs multiple threads reading files from the same folder. This folder has incoming files and the file should only be processed by any one of those threads. Later, this file reading thread, deletes the file after processing it.
EDIT after the first answer: I don't want a single thread in charge of reading filenames and feeding those names to other threads, so that they can read it.
Is there any efficient way of achieving this in python?

You should probably use the Queue module. From the docs:
The Queue module implements multi-producer, multi-consumer queues. It is especially useful in threaded programming when information must be exchanged safely between multiple threads.
I would use a FIFO approach, with a thread in charge of checking for inbound files and queuing them, and a number of workers processing them. A LIFO approach or an approach in which priority is assigned with a custom method are also supported by the module.
EDIT: If you don't want to use the Queue module and you are under a *nix system, you could use fcntl.lockf instead. An alternative, opening the files with os.open('filename', os.O_EXLOCK).
Depending on how often you perform this operation, you might find it less performing than using Queue, as you will have to account for race conditions (i.e.: you might acquire the name of the file to open, but the file might get locked by another thread before you get a chance to open it, throwing an exception that you will have to trap). Queue is there for a reason! ;)
EDIT2: Comments in this and other questions are bringing up the problem with simultaneous disk access to different files and the consequent performance hit. I was thinking that task_done would have been used for preventing this, but reading others' comments it occurred to me that instead of queuing file names, one could queue the files' content directly. This second alternative would work only for a limited amount of limited size queued files, given that RAM would fill up rather quickly otherwise.
I'm unaware if RAID and other parallel disk configurations would already take care of reading one file per disk rather than bouncing back and forth between two files on both disks.
HTH!

If you want multiple threads to read directly from the same folder several files in parallel, then I must disappoint you. Reading in parallel from a single disk is not a viable option. A single disk needs to spin and seek the next location to be read. If you're reading with multiple threads, you are just bouncing the disk around between seeks and the performance is much worse than a simple sequential read.
Just stick to mac's advice and use a single thread for reading.

Related

How to do proper file locking on NFS?

I am trying to implement a "record manager" class in python 3x and linux/macOS. The class is relatively easy and straightforward, the only "hard" thing I want is to be able to access the same file (where results are saved) on multiple processes.
This seemed pretty easy, conceptually: when saving, acquire an exclusive lock on the file. Update your information, save the new information, release exclusive lock on the file. Easy enough.
I am using fcntl.lockf(file, fcntl.LOCK_EX) to acquire the exclusive lock. The problem is that, looking on the internet, I am finding a lot of different websites saying how this is not reliable, that it won't work on windows, that the support on NFS is shaky, and that things could change between macOS and linux.
I have accepted that the code won't work on windows, but I was hoping to be able to make it work on macOS (single machine) and on linux (on multiple servers with NFS).
The problem is that I can't seem to make this work; and after a while of debugging and after the tests passed on macOS, they failed once I tried them on the NFS with linux (ubuntu 16.04). The issue is an inconsistency between the informations saved by multiple processes - some processes have their modifications missing, which means something went wrong in the locking and saving procedure.
I am sure there is something I am doing wrong, and I suspect this may be related to the issues that I read about online. So, what is the proper way to deal multiple access to the same file that works on macOS and linux over NFS?
Edit
This is what the typical method that writes new informations to disk looks like:
sf = open(self._save_file_path, 'rb+')
try:
fcntl.lockf(sf, fcntl.LOCK_EX) # acquire an exclusive lock - only one writer
self._raw_update(sf) #updates the records from file (other processes may have modified it)
self._saved_records[name] = new_info
self._raw_save() #does not check for locks (but does *not* release the lock on self._save_file_path)
finally:
sf.flush()
os.fsync(sf.fileno()) #forcing the OS to write to disk
sf.close() #release the lock and close
While this is how a typical method that only read info from disk looks like:
sf = open(self._save_file_path, 'rb')
try:
fcntl.lockf(sf, fcntl.LOCK_SH) # acquire shared lock - multiple writers
self._raw_update(sf) #updates the records from file (other processes may have modified it)
return self._saved_records
finally:
sf.close() #release the lock and close
Also, this is how _raw_save looks like:
def _raw_save(self):
#write to temp file first to avoid accidental corruption of information.
#os.replace is guaranteed to be an atomic operation in POSIX
with open('temp_file', 'wb') as p:
p.write(self._saved_records)
os.replace('temp_file', self._save_file_path) #pretty sure this does not release the lock
Error message
I have written a unit test where I create 100 different processes, 50 that read and 50 that write to the same file. Each process does some random waiting to avoid accessing the files sequentially.
The problem is that some of the records are not kept; at the end there are some 3-4 random records missing, so I only end up with 46-47 records rather than 50.
Edit 2
I have modified the code above and I acquire the lock not on the file itself, but on a separate lock file. This prevents the issue that closing the file would release the lock (as suggested by #janneb), and makes the code work correctly on mac. The same code fails on linux with NFS though.
I don't see how the combination of file locks and os.replace() can make sense. When the file is replaced (that is, the directory entry is replaced), all the existing file locks (probably including file locks waiting for the locking to succeed, I'm not sure of the semantics here) and file descriptors will be against the old file, not the new one. I suspect this is the reason behind the race conditions causing you to lose some of the records in your tests.
os.replace() is a good technique to ensure that a reader doesn't read a partial update. But it doesn't work robustly in the face of multiple updaters (unless losing some of the updates is ok).
Another issues is that fcntl is a really really stupid API. In particular, the locks are bound to the process, not the file descriptor. Which means that e.g. a close() on ANY file descriptor pointing to the file will release the lock.
One way would be to use a "lock file", e.g. taking advantage of the atomicity of link(). From http://man7.org/linux/man-pages/man2/open.2.html:
Portable
programs that want to perform atomic file locking using a
lockfile, and need to avoid reliance on NFS support for
O_EXCL, can create a unique file on the same filesystem (e.g.,
incorporating hostname and PID), and use link(2) to make a
link to the lockfile. If link(2) returns 0, the lock is
successful. Otherwise, use stat(2) on the unique file to
check if its link count has increased to 2, in which case the
lock is also successful.
If it's Ok to read slightly stale data then you can use this link() dance only for a temp file that you use when updating the file and then os.replace() the "main" file you use for reading (reading can then be lockless). If not, then you need to do the link() trick for the "main" file and forget about shared/exclusive locking, all locks are then exclusive.
Addendum: One tricky thing to deal with when using lock files is what to do when a process dies for whatever reason, and leaves the lock file around. If this is to run unattended, you might want to incorporate some kind of timeout and removal of lock files (e.g. check the stat() timestamps).
Using randomly named hard links and the link counts on those files as lock files is a common strategy (E.g. this), and arguable better than using lockd but for far more information about the limits of all sorts of locks over NFS read this: http://0pointer.de/blog/projects/locking.html
You'll also find that this is a long standing standard problem for MTA software using Mbox files over NFS. Probably the best answer there was to use Maildir instead of Mbox, but if you look for examples in the source code of something like postfix, it'll be close to best practice. And if they simply don't solve that problem, that might also be your answer.
NFS is great for file sharing. It sucks as a "transmission" medium.
I've been down the NFS-for-data-transmission road multiple times. In every instance, the solution involved moving away from NFS.
Getting reliable locking is one part of the problem. The other part is the update of the file on the server and expecting the clients to receive that data at some specific point-in-time (such as before they can grab the lock).
NFS isn't designed to be a data transmission solution. There are caches and timing involved. Not to mention paging of the file content, and file metadata (e.g. the atime attribute). And client O/S'es keeping track of state locally (such as "where" to append the client's data when writing to the end of the file).
For a distributed, synchronized store, I recommend looking at a tool that does just that. Such as Cassandra, or even a general-purpose database.
If I'm reading the use-case correctly, you could also go with a simple server-based solution. Have a server listen for TCP connections, read messages from the connections, and then write each to file, serializing the writes within the server itself. There's some added complexity in having your own protocol (to know where a message starts and stops), but otherwise, it's fairly straight-forward.

Read a file multi-threaded in python in chunks of 2KB.

I have to read a file in chunks of 2KB and do some operation on those chunks. Now where I'm actually stuck is, when the data needs to be thread-safe. From what I've seen in online tutorials and StackOverflow answers, we define a worker thread, and override its run method. The run method uses data from a queue which we pass as an argument, and which contains the actual data. But to load that queue with data, I'll have to go through the file serially, which eliminates parallelism. I want that multiple threads read the file in parallel manner. So I'll have to cover the read part in the run function only. But I'm not sure how to go with that. Help needed.
Reading the file serially is your best option since (hardware wise) it gives you the best read throughout.
Usually the slow part is not in the data reading but in its processing...

Is using a giant OR regex inefficent in python?

Simple question, is using a giant OR regex inefficent in python. I am building a script to search for bad files. I have a source file that contains 50 or so "signatures" so far. The list is in the form of:
Djfsid
LJflsdflsdf
fjlsdlf
fsdf
.
.
.
There are no real "consistancies" so optomizing the list by removing "duplicates" or checking for "is one entry a substring of another entry" won't do much.
I basically wan't to OS walk down a directory, open a file, check for the signature, close and move on.
To speed things up I will break the list up into 50/n different sublists where N is the number of cores and have a thread do work on a few entries of the list.
Would using a giant regex re.search('(entry1|entry2|entry3....|entryK)', FILE_CONTENTS) or a giant for i in xrange(0,NUM_SUBENTRIES)...if subentry[i] in FILE_CONTENTS... be better off?
Also is this a good way to multithread? This is unix so multiple threads can work on the same file at the same time. Will disk access basically bottelneck me to the point where multithreading is useless?
Also is this a good way to multithread?
Not really.
Will disk access basically bottelneck me to the point where multithreading is useless?
Correct.
You might want to look closely at multiprocessing.
A worker Process should do the OS.walk and put the file names into a Queue.
A pool of worker Process instances. Each will get a file name from the Queue, open it, check the signature and enqueue results into a "good" Queue and a "bad" Queue. Create as many of these as it takes to make the CPU 100% busy.
Another worker Process instance can dequeue good entries and log them.
Another worker Process instance can dequeue bad entries and delete or rename or whatever is supposed to happen. This can interfere with the os.walk. A possibility is to log these into a "do this next" file which is processed after the os.walk is finished.
It would depend on the machine you are using. If you use the machine's maximum capaticy, it will slow down, of course. I think the best way to find out is to try.
Don't worry about optimisation.
50 data points is tiny compared to what your computer can manage, so you'll probably waste a lot of your time, and make your program more complicated.

How to have multiple python programs append rows to the same file?

I've got multiple python processes (typically 1 per core) transforming large volumes of data that they are each reading from dedicated sources, and writing to a single output file that each opened in append mode.
Is this a safe way for these programs to work?
Because of the tight performance requirements and large data volumes I don't think that I can have each process repeatedly open & close the file. Another option is to have each write to a dedicated output file and a single process concatenate them together once they're all done. But I'd prefer to avoid that.
Thanks in advance for any & all answers and suggestions.
Have you considered using the multiprocessing module to coordinate between the running programs in a thread-like manner? See in particular the queue interface; you can place each completed work item on a queue when completed, and have a single process reading off the queue and writing to your output file.
Alternately, you can have each subprocess maintain a separate pipe to a parent process which does a select() call from all of them, and copies data to the output file when appropriate. Of course, this can be done "by hand" (without the multiprocessing module) as well as with it.
Alternately, if the reason you're avoiding threads is to avoid the global interpreter lock, you might consider a non-CPython implementation (such as Jython or IronPython).
Your procedure is "safe" in that no crashes will result, but data coming (with very unlucky timing) from different processes could get mixed up -- e.g., process 1 is appending a long string of as, process 2 a long string of b, you could end up in the file with lots of as then the bs then more as (or other combinations / mixings).
Problem is, .write is not guaranteed to be atomic for sufficiently long string arguments. If you have a tight boundary on the arguments, less than your fs/os's blocksize, you might be lucky. Otherwise, try using the logging module, which does take more precautions (but perhaps those precautions might slow you down... you'll need to benchmark) exactly because it targets "log files" that are often being appended to by multiple programs.

What's the best way to divide large files in Python for multiprocessing?

I run across a lot of "embarrassingly parallel" projects I'd like to parallelize with the multiprocessing module. However, they often involve reading in huge files (greater than 2gb), processing them line by line, running basic calculations, and then writing results. What's the best way to split a file and process it using Python's multiprocessing module? Should Queue or JoinableQueue in multiprocessing be used? Or the Queue module itself? Or, should I map the file iterable over a pool of processes using multiprocessing? I've experimented with these approaches but the overhead is immense in distribution the data line by line. I've settled on a lightweight pipe-filters design by using cat file | process1 --out-file out1 --num-processes 2 | process2 --out-file out2, which passes a certain percentage of the first process's input directly to the second input (see this post), but I'd like to have a solution contained entirely in Python.
Surprisingly, the Python documentation doesn't suggest a canonical way of doing this (despite a lengthy section on programming guidelines in the multiprocessing documentation).
Thanks,
Vince
Additional information: Processing time per line varies. Some problems are fast and barely not I/O bound, some are CPU-bound. The CPU bound, non-dependent tasks will gain the post from parallelization, such that even inefficient ways of assigning data to a processing function would still be beneficial in terms of wall clock time.
A prime example is a script that extracts fields from lines, checks for a variety of bitwise flags, and writes lines with certain flags to a new file in an entirely new format. This seems like an I/O bound problem, but when I ran it with my cheap concurrent version with pipes, it was about 20% faster. When I run it with pool and map, or queue in multiprocessing it is always over 100% slower.
One of the best architectures is already part of Linux OS's. No special libraries required.
You want a "fan-out" design.
A "main" program creates a number of subprocesses connected by pipes.
The main program reads the file, writing lines to the pipes doing the minimum filtering required to deal the lines to appropriate subprocesses.
Each subprocess should probably be a pipeline of distinct processes that read and write from stdin.
You don't need a queue data structure, that's exactly what an in-memory pipeline is -- a queue of bytes between two concurrent processes.
One strategy is to assign each worker an offset so if you have eight worker processes you assign then numbers 0 to 7. Worker number 0 reads the first record processes it then skips 7 and goes on to process the 8th record etc., worker number 1 reads the second record then skips 7 and processes the 9th record.........
There are a number of advantages to this scheme. It doesnt matter how big the file is the work is always divided evenly, processes on the same machine will process at roughly the same rate, and use the same buffer areas so you dont incur any excessive I/O overhead. As long as the file hasnt been updated you can rerun individual threads to recover from failures.
You dont mention how you are processing the lines; possibly the most important piece of info.
Is each line independant? Is the calculation dependant on one line coming before the next? Must they be processed in blocks? How long does the processing for each line take? Is there a processing step that must incorporate "all" the data at the end? Or can intermediate results be thrown away and just a running total maintained? Can the file be initially split by dividing filesize by count of threads? Or does it grow as you process it?
If the lines are independant and the file doesn't grow, the only coordination you need is to farm out "starting addresses" and "lengths" to each of the workers; they can independantly open and seek into the file and then you must simply coordinate their results; perhaps by waiting for N results to come back into a queue.
If the lines are not independant, the answer will depend highly on the structure of the file.
I know you specifically asked about Python, but I will encourage you to look at Hadoop (http://hadoop.apache.org/): it implements the Map and Reduce algorithm which was specifically designed to address this kind of problem.
Good luck
It depends a lot on the format of your file.
Does it make sense to split it anywhere? Or do you need to split it at a new line? Or do you need to make sure that you split it at the end of an object definition?
Instead of splitting the file, you should use multiple readers on the same file, using os.lseek to jump to the appropriate part of the file.
Update: Poster added that he wants to split on new lines. Then I propose the following:
Let's say you have 4 processes. Then the simple solution is to os.lseek to 0%, 25%, 50% and 75% of the file, and read bytes until you hit the first new line. That's your starting point for each process. You don't need to split the file to do this, just seek to the right location in the large file in each process and start reading from there.
Fredrik Lundh's Some Notes on Tim Bray's Wide Finder Benchmark is an interesting read, about a very similar use case, with a lot of good advice. Various other authors also implemented the same thing, some are linked from the article, but you might want to try googling for "python wide finder" or something to find some more. (there was also a solution somewhere based on the multiprocessing module, but that doesn't seem to be available anymore)
If the run time is long, instead of having each process read its next line through a Queue, have the processes read batches of lines. This way the overhead is amortized over several lines (e.g. thousands or more).

Categories