Real-time handling of multiple files in data processing (Python Multiprocessing)

Real-time handling of multiple files in data processing (Python Multiprocessing) - python

The program is monitoring a folder received_dir and process the files received in real time. After processing the file, the original file should be deleted to save the disk space.
I am trying to use Python multiprocessing and Pool.
I want to check if there is any technical flaw in current approach.
One of the problem in the current code is that the program should wait until all 20 files in the queue are processed before starting the next round, so it may be inefficient in certain conditions (i.e, various file sizes).
from multiprocessing import Pool
import os
import os.path
Parse_OUT="/opt/out/"
Receive_Dir="/opt/receive/"
def parser(infile):
out_dir=date_of(filename)
if not os.path.exists(out_dir):
os.mkdir(out_dir)
fout=gzip.open(out_dir+'/'+filename+'csv.gz','wb')
with gzip.open(infile) as fin:
for line in fin:
data=line.split(',')
fout.write(data)
fout.close()
os.remove(infile)
if __name__ == '__main__':
pool=Pool(20)
while True:
targets=glob.glob(Receive_Dir)[:10]
pool.map(parser, targets)
pool.close()

I see several issues:
if not os.path.exists(out_dir): os.mkdir(out_dir): This is a race condition. If two workers try to create the same directory at the same time, one will raise an exception. Don't do the if condition. Simply call os.makedirs(out_dir, exist_ok=True)
Don't assemble file paths with string addition. Simply do os.path.join(out_dir, filename+'csv.gz'). This is cleaner and has fewer failure states
Instead of spinning in your while True-loop even if no new directories appear, you can use the inotify mechanism on Linux to monitor the directory for changes. That would only wake your process if there is actually anything to do. Check out pyinotify: https://github.com/seb-m/pyinotify
Since you mentioned that you are dissatisfied with the batching: You can use pool.apply_async to start new operations as they become available. Your main loop doesn't do anything with the results, so you can just "fire and forget"
Incidentally, why are you starting a pool with 20 workers and then you just launch 10 directory operations at once?

Related

Python multitprocessing to process files

I've never done anything with multiprocessing before, but I recently ran into a problem with one of my projects taking an excessive amount of time to run. I have about 336,000 files I need to process, and a traditional for loop would likely take about a week to run.
There are two loops to do this, but they are effectively identical in what they return so I've only included one.
import json
import os
from tqdm import tqdm
import multiprocessing as mp
jsons = os.listdir('/content/drive/My Drive/mrp_workflow/JSONs')
materials = [None] * len(jsons)
def asyncJSONs(file, index):
try:
with open('/content/drive/My Drive/mrp_workflow/JSONs/{}'.format(file)) as f:
data = json.loads(f.read())
properties = process_dict(data, {})
properties['name'] = file.split('.')[0]
materials[index] = properties
except:
print("Error parsing at {}".format(file))
process_list = []
i = 0
for file in tqdm(jsons):
p = mp.Process(target=asyncJSONs,args=(file,i))
p.start()
process_list.append(p)
i += 1
for process in process_list:
process.join()
Everything in that relating to multiprocessing was cobbled together from a collection of google searches and articles, so I wouldn't be surprised if it wasn't remotely correct. For example, the 'i' variable is a dirty attempt to keep the information in some kind of order.
What I'm trying to do is load information from those JSON files and store it in the materials variable. But when I run my current code nothing is stored in materials.

As you can read in other answers - processes don't share memory and you can't set value directly in materials. Function has to use return to send result back to main process and it has to wait for result and get it.
It can be simpler with Pool. It doesn't need to use queue manually. And it should return results in the same order as data in all_jsons. And you can set how many processes to run at the same time so it will not block CPU for other processes in system.
But it can't use tqdm.
I couldn't test it but it can be something like this
import os
import json
from multiprocessing import Pool
# --- functions ---
def asyncJSONs(filename):
try:
fullpath = os.path.join(folder, filename)
with open(fullpath) as f:
data = json.loads(f.read())
properties = process_dict(data, {})
properties['name'] = filename.split('.')[0]
return properties
except:
print("Error parsing at {}".format(filename))
# --- main ---
# for all processes (on some systems it may have to be outside `__main__`)
folder = '/content/drive/My Drive/mrp_workflow/JSONs'
if __name__ == '__main__':
# code only for main process
all_jsons = os.listdir(folder)
with Pool(5) as p:
materials = p.map(asyncJSONs, all_jsons)
for item in materials:
print(item)
BTW:
Other modules: concurrent.futures, joblib, ray,

Going to mention a totally different way of solving this problem. Don't bother trying to append all the data to the same list. Extract the data you need, and append it to some target file in ndjson/jsonlines format. That's just where, instead of objects part of a json array [{},{}...], you have separate objects on each line.
{"foo": "bar"}
{"foo": "spam"}
{"eggs": "jam"}
The workflow looks like this:
spawn N workers with a manifest of files to process and the output file to write to. You don't even need MP, you could use a tool like rush to parallelize.
worker parses data, generates the output dict
worker opens the output file with append flag. dump the data and flush immediately:
with open(out_file, 'a') as fp:
print(json.dumps(data), file=fp, flush=True)
Flush ensure that as long as your data is less than the buffer size on your kernel (usually several MB), your different processes won't stomp on each other and conflict writes. If they do get conflicted, you may need to write to a separate output file for each worker, and then join them all.
You can join the files and/or convert to regular JSON array if needed using jq. To be honest, just embrace jsonlines. It's a way better data format for long lists of objects, since you don't have to parse the whole thing in memory.

You need to understand how multiprocessing works. It starts a brand new process for EACH task, each with a brand new Python interpreter, which runs your script all over again. These processes do not share memory in any way. The other processes get a COPY of your globals, but they obviously can't be the same memory.
If you need to send information back, you can using a multiprocessing.queue. Have the function stuff the results in a queue, while your main code waits for stuff to magically appear in the queue.
Also PLEASE read the instructions in the multiprocessing docs about main. Each new process will re-execute all the code in your main file. Thus, any one-time stuff absolutely must be contained in a
if __name__ == "__main__":
block. This is one case where the practice of putting your mainline code into a function called main() is a "best practice".
What is taking all the time here? Is it reading the files? If so, then you might be able to do this with multithreading instead of multiprocessing. However, if you are limited by disk speed, then no amount of multiprocessing is going to reduce your run time.

Python Recursive Multiprocessing - too many threads

Background:
Python 3.5.1, Windows 7
I have a network drive that holds a large number of files and directories. I'm trying to write a script to parse through all of these as quickly as possible to find all files that match a RegEx, and copy these files to my local PC for review. There are about 3500 directories and subdirectories, and a few million files. I'm trying to make this as generic as possible (i.e., not writing code to this exact file structure) in order to reuse this for other network drives. My code works when run against a small network drive, the issue here seems to be scalability.
I've tried a few things using the multiprocessing library and can't seem to get it to work reliably. My idea was to create a new job to parse through each subdirectory to work as quickly as possible. I have a recursive function that parses through all objects in a directory, then calls itself for any subdirectories, and checks any files it finds against the RegEx.
Question: how can I limit the number of threads/processes without using Pools to achieve my goal?
What I've tried:
If I only use Process jobs, I get the error RuntimeError: can't start new thread after more than a few hundred threads start, and it starts dropping connections. I end up with about half the files found, as half of the directories error out (code for this below).
To limit the number of total threads, I tried to use the Pool methods, but I can't pass pool objects to called methods according to this question, which makes the recursion implementation not possible.
To fix that, I tried to call Processes inside the Pool methods, but I get the error daemonic processes are not allowed to have children.
I think that if I can limit the number of concurrent threads, then my solution will work as designed.
Code:
import os
import re
import shutil
from multiprocessing import Process, Manager
CheckLocations = ['network drive location 1', 'network drive location 2']
SaveLocation = 'local PC location'
FileNameRegex = re.compile('RegEx here', flags = re.IGNORECASE)
# Loop through all items in folder, and call itself for subfolders.
def ParseFolderContents(path, DebugFileList):
FolderList = []
jobs = []
TempList = []
if not os.path.exists(path):
return
try:
for item in os.scandir(path):
try:
if item.is_dir():
p = Process(target=ParseFolderContents, args=(item.path, DebugFileList))
jobs.append(p)
p.start()
elif FileNameRegex.search(item.name) != None:
DebugFileList.append((path, item.name))
else:
pass
except Exception as ex:
if hasattr(ex, 'message'):
print(ex.message)
else:
print(ex)
# print('Error in file:\t' + item.path)
except Exception as ex:
if hasattr(ex, 'message'):
print(ex.message)
else:
print('Error in path:\t' + path)
pass
else:
print('\tToo many threads to restart directory.')
for job in jobs:
job.join()
# Save list of debug files.
def SaveDebugFiles(DebugFileList):
for file in DebugFileList:
try:
shutil.copyfile(file[0] + '\\' + file[1], SaveLocation + file[1])
except PermissionError:
continue
if __name__ == '__main__':
with Manager() as manager:
# Iterate through all directories to make a list of all desired files.
DebugFileList = manager.list()
jobs = []
for path in CheckLocations:
p = Process(target=ParseFolderContents, args=(path, DebugFileList))
jobs.append(p)
p.start()
for job in jobs:
job.join()
print('\n' + str(len(DebugFileList)) + ' files found.\n')
if len(DebugFileList) == 0:
quit()
# Iterate through all debug files and copy them to local PC.
n = 25 # Number of files to grab for each parallel path.
TempList = [DebugFileList[i:i + n] for i in range(0, len(DebugFileList), n)] # Split list into small chunks.
jobs = []
for item in TempList:
p = Process(target=SaveDebugFiles, args=(item, ))
jobs.append(p)
p.start()
for job in jobs:
job.join()

Don't disdain the usefulness of pools, especially when you want to control the number of processes to create. They also take care of managing your workers (create/start/join/distribute chunks of work) and help you collect potential results.
As you have realized yourself, you create way too many processes, up to a point where you seem to exhaust so many system resources that you cannot create more processes.
Additionally, the creation of new processes in your code is controlled by outside factors, i.e. the number of folders in your file trees, which makes it very difficult to limit the number of processes. Also, creating a new process comes with quite some overhead on the OS and you might even end up wasting that overhead on empty directories. Plus, context switches between processes are quite costly.
With the number of processes you create, given the number of folders you stated, your processes will basically just sit there and idle most of the time while they are waiting for a share of CPU time to actually do some work. There will be a lot of contention for said CPU time, unless you have a supercomputer with thousands of cores at your disposal. And even when a process gets some CPU time to work, it will likely spend a quite a bit of that time waiting for I/O.
That being said, you'll probably want to look into using threads for such a task. And you could do some optimization in your code. From your example, I don't see any reason why you would split identifying the files to copy and actually copying them into different tasks. Why not let your workers copy each file they found matching the RE right away?
I'd create a list of files in the directories in question using os.walk (which I consider reasonably fast) from the main thread and then offload that list to a pool of workers that checks these files for matches and copies those right away:
import os
import re
from multiprocessing.pool import ThreadPool
search_dirs = ["dir 1", "dir2"]
ptn = re.compile(r"your regex")
# your target dir definition
file_list = []
for topdir in search_dirs:
for root, dirs, files in os.walk(topdir):
for file in files:
file_list.append(os.path.join(root, file))
def copier(path):
if ptn.match(path):
# do your shutil.copyfile with the try-except right here
# obviously I did not want to start mindlessly copying around files on my box :)
return path
with ThreadPool(processes=10) as pool:
results = pool.map(copier, file_list)
# print all the processed files. For those that did not match, None is returned
print("\n".join([r for r in results if r]))
On a side note: don't concatenate your paths manually (file[0] + "\\" + file[1]), rather use os.path.join for this.

I was unable to get this to work exactly as I desired. os.walk was slow, and every other method I thought of was either a similar speed or crashed due to too many threads.
I ended up using a similar method that I posted above, but instead of starting the recursion at the top level directory, it would go down one or two levels until there were several directories. It would then start the recursion at each of these directories in series, which limited the number of threads enough to finish successfully. Execution time is similar to os.walk, which would probably make for a simpler and more readable implementation.

Call one method with several threads in python

I have about 4 input text files that I want to read them and write all of them into one separate file.
I use two threads so it runs faster!
Here is my questions and code in python:
1-Does each thread has its own version of variables such as "lines" inside the function "writeInFile"?
2-Since I copied some parts of the code from Tutorialspoint, I don't understand what is "while 1: pass" in the last line. Can you explain? link to the main code: http://www.tutorialspoint.com/python/python_multithreading.htm
3-Does it matter what delay I put for the threads?
4-If I have about 400 input text files and want to do some operations on them before writing all of them into a separate file, how many threads I can use?
5- If assume I use 10 threads, is it better to have the inputs in different folders (10 folders with 40 input text files each) and for each thread call one folder OR I use what I already done in the below code in which I ask each thread to read one of the 400 input text files if they have not been read before by other threads?
processedFiles=[] # this list to check which file in the folder has already been read by one thread so the other thread don't read it
#Function run by the threads
def writeInFile( threadName, delay):
for file in glob.glob("*.txt"):
if file not in processedFiles:
processedFiles.append(file)
f = open(file,"r")
lines = f.readlines()
f.close()
time.sleep(delay)
#open the file to write in
f = open('myfile','a')
f.write("%s \n" %lines)
f.close()
print "%s: %s" % ( threadName, time.ctime(time.time()) )
# Create two threads as follows
try:
f = open('myfile', 'r+')
f.truncate()
start = timeit.default_timer()
thread.start_new_thread( writeInFile, ("Thread-1", 0, ) )
thread.start_new_thread( writeInFile, ("Thread-2", 0, ) )
stop = timeit.default_timer()
print stop - start
except:
print "Error: unable to start thread"
while 1:
pass

Yes. Each of the local variables is on the thread's stack and are not shared between threads.
This loop allows the parent thread to wait for each of the child threads to finish and exit before termination of the program. The actual construct you should use to handle this is join and not a while loop. See what is the use of join() in python threading.
In practice, yes, especially if the threads are writing to a common set of files (e.g., both thread 1 and thread2 will be reading/writing to the same file). Depending on the hardware, the size of the files and the amount of data you re trying to write, different delays may make your program feel more responsive to the user than not. The best bet is to start with a simple value and adjust it as you see the program work in a real-world setting.
While you can technically use as many threads as you want, you generally won’t get any performance benefits over 1 thread per core per CPU.
Different folders won’t matter as much for only 400 files. If you’re talking about 4,000,000 files, than it might matter for instances when you want to do ls on those directories. What will matter for performance is whether each thread is working on it's own file or whether two or more threads might be operating on the same file.
General thought: while it is a more advanced architecture, you may want to try to learn/use celery for these types of tasks in a production environment http://www.celeryproject.org/.

Python for loop flow when calling a time consuming function

I have implemented the following code:
lines=[]
with open('path_to_file', 'r+') as source:
for line in source:
line = line.replace('\n','').strip()
if line.split()[-1] != 'sent':
# do some operation on line without 'sent' tag
upload(data1.zip)
upload(data2.zip)
do_operation(line)
# tag the line
line += '\tsent'
line += '\n'
# temporary save lines in a list
lines.append(line)
# move position to start of the file
source.seek(0)
# write back lines to the file
source.writelines(lines)
I am calling upload methods in a section #do some operation with lines without sent tag to upload data to the cloud. As the data is a bit large (around 1GB), it takes a while to finish the upload. In the mean time, does the for loop go ahead to call upload(data2)? I am getting errors as I cannot upload simultaneously.
If yes, how can I avoid this?
EDIT:::
I have changed upload function to return status as done after uploading. So, how can I modify my main loop so that it will wait after calling upload(data1.zip) and then move on to upload(data2.zip). I want to synchronize..

I think your problem might be that you don't want to try to upload more than one file at a time.
Your code doesn't try to do any parallel uploads. So I suspect that your upload() function is starting an upload process and then letting it run in the background while it returns to you.
If this is true, you can try some of these options:
Pass an option to the upload function that tells it to wait until the upload finishes before returning.
Discover (research) some attribute that you can use to synchronize your program with the process started by the upload function. For example, if the function returns the child process id, you could do a wait on that pid to complete. Or perhaps it writes the pid out to a pidfile - you could read in the number, and wait for it.
If you can't make the upload function do what you want synchronously, you might consider replacing calls to upload() with print statements to have your code generate some kind of script that could be executed separately, possibly with a different environment or using a different upload utility.

You can use multiprocessing do the time consuming work.
import multiprocessing
# creates processes for your files, each file has its own processor
processes = [multiprocessing.Process(target=upload, args=(zip_file,)) for zip_file in [data1.zip,data2.zip]]
# starts the processes
for p in processes:
p.start()
# waits for all processes finish work
for p in processes:
p.join()
# It will not go here until all files finish uploading.
...
...

You can send them off as independent processes. Use the Python multiprocessing module; there are nice tutorials, too.
Your inner loop might look something like this:
up1 = Process(target=upload, args=(data1.zip,))
up2 = Process(target=upload, args=(data2.zip,))
up1.start()
up2.start()
# Now, do other stuff while these run
do_operation(line)
# tag the line
line += '\tsent'
# Wait for the uploads to finish -- in case they're slower than do_operation.
up1.join()
up2.join()
flag
#Prune yes its me who is confused.. i want to synchronize.
Excellent; we have that cleared up. The things you synchronize are separate processes. You have your main process waiting for the result of your child process, the upload. Multiple processes is called ... :-)
Are we at a solution point now? I think the pieces you need are in one (or at most two) of these answers.

python multiprocessing logging: QueueHandler with RotatingFileHandler "file being used by another process" error

I'm converting a program to multiprocessing and need to be able to log to a single rotating log from the main process as well as subprocesses. I'm trying to use the 2nd example in the python cookbook Logging to a single file from multiple processes, which starts a logger_thread running as part of the main process, picking up log messages off a queue that the subprocesses add to. The example works well as is, and also works if I switch to a RotatingFileHandler.
However if I change it to start logger_thread before the subprocesses (so that I can log from the main process as well), then as soon as the log rotates, all subsequent logging generates a traceback with WindowsError: [Error 32] The process cannot access the file because it is being used by another process.
In other words I change this code from the 2nd example
workers = []
for i in range(5):
wp = Process(target=worker_process, name='worker %d' % (i + 1), args=(q,))
workers.append(wp)
wp.start()
logging.config.dictConfig(d)
lp = threading.Thread(target=logger_thread, args=(q,))
lp.start()
to this:
logging.config.dictConfig(d)
lp = threading.Thread(target=logger_thread, args=(q,))
lp.start()
workers = []
for i in range(5):
wp = Process(target=worker_process, name='worker %d' % (i + 1), args=(q,))
workers.append(wp)
wp.start()
and swap out logging.FileHandler for logging.handlers.RotatingFileHandler (with a very small maxBytes for testing) and then I hit this error.
I'm using Windows and python 2.7. QueueHandler is not part of stdlib til python 3.2 but I've copied the source code from Gist, which it says is safe to do.
I don't understand why starting the listener first would make any difference, nor do I understand why any process other than main would be attempting to access the file.

You should never start any threads before subprocesses. When Python forks, the threads and IPC state will not always be copied properly.
There are several resources on this, just google for fork and threads. Some people claim they can do it, but it's not clear to me that it can ever work properly.
Just start all your processes first.
Example additional information:
Status of mixing multiprocessing and threading in Python
https://stackoverflow.com/a/6079669/4279
In your case, it might be that the copied open file handle is the problem, but you still should start your subprocesses before your threads (and before you open any files that you will later want to destroy).
Some rules of thumb, summarized by fantabolous from the comments:
Subprocesses must always be started before any threads created by the same process.
multiprocessing.Pool creates both subprocesses AND threads, so one mustn't create additional Processes or Pools after the first one.
Files should not already be open at the time a Process or Pool is created. (This is OK in some cases, but not, e.g. if a file will be deleted later.)
Subprocesses can create their own threads and processes, with the same rules above applying.
Starting all processes first is the easiest way to do this

So, you can simply make your own file log handler. I have yet to see logs getting garbled from multiprocessing, so it seems file log rotation is the big issue. Just do this in your main, and you don't have to change any of the rest of your logging
import logging
import logging.handlers
from multiprocessing import RLock
class MultiprocessRotatingFileHandler(logging.handlers.RotatingFileHandler):
def __init__(self, *kargs, **kwargs):
super(MultiprocessRotatingFileHandler, self).__init__(*kargs, **kwargs)
self.lock = RLock()
def shouldRollover(self, record):
with self.lock:
super(MultiprocessRotatingFileHandler, self).shouldRollover(record)
file_log_path = os.path.join('var','log', os.path.basename(__file__) + '.log')
file_log = MultiprocessRotatingFileHandler(file_log_path,
maxBytes=8*1000*1024,
backupCount=5,
delay=True)
logging.basicConfig(level=logging.DEBUG)
logging.addHandler(file_log)
I'm willing to guess that locking every time you try to rotate is probably slowing down logging, but then this is a case where we need to sacrifice performance for correctness.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Real-time handling of multiple files in data processing (Python Multiprocessing) - python

Related

Python multitprocessing to process files

Python Recursive Multiprocessing - too many threads

Call one method with several threads in python

Python for loop flow when calling a time consuming function

python multiprocessing logging: QueueHandler with RotatingFileHandler "file being used by another process" error

Categories

Resources