I have implemented the following code:
lines=[]
with open('path_to_file', 'r+') as source:
for line in source:
line = line.replace('\n','').strip()
if line.split()[-1] != 'sent':
# do some operation on line without 'sent' tag
upload(data1.zip)
upload(data2.zip)
do_operation(line)
# tag the line
line += '\tsent'
line += '\n'
# temporary save lines in a list
lines.append(line)
# move position to start of the file
source.seek(0)
# write back lines to the file
source.writelines(lines)
I am calling upload methods in a section #do some operation with lines without sent tag to upload data to the cloud. As the data is a bit large (around 1GB), it takes a while to finish the upload. In the mean time, does the for loop go ahead to call upload(data2)? I am getting errors as I cannot upload simultaneously.
If yes, how can I avoid this?
EDIT:::
I have changed upload function to return status as done after uploading. So, how can I modify my main loop so that it will wait after calling upload(data1.zip) and then move on to upload(data2.zip). I want to synchronize..
I think your problem might be that you don't want to try to upload more than one file at a time.
Your code doesn't try to do any parallel uploads. So I suspect that your upload() function is starting an upload process and then letting it run in the background while it returns to you.
If this is true, you can try some of these options:
Pass an option to the upload function that tells it to wait until the upload finishes before returning.
Discover (research) some attribute that you can use to synchronize your program with the process started by the upload function. For example, if the function returns the child process id, you could do a wait on that pid to complete. Or perhaps it writes the pid out to a pidfile - you could read in the number, and wait for it.
If you can't make the upload function do what you want synchronously, you might consider replacing calls to upload() with print statements to have your code generate some kind of script that could be executed separately, possibly with a different environment or using a different upload utility.
You can use multiprocessing do the time consuming work.
import multiprocessing
# creates processes for your files, each file has its own processor
processes = [multiprocessing.Process(target=upload, args=(zip_file,)) for zip_file in [data1.zip,data2.zip]]
# starts the processes
for p in processes:
p.start()
# waits for all processes finish work
for p in processes:
p.join()
# It will not go here until all files finish uploading.
...
...
You can send them off as independent processes. Use the Python multiprocessing module; there are nice tutorials, too.
Your inner loop might look something like this:
up1 = Process(target=upload, args=(data1.zip,))
up2 = Process(target=upload, args=(data2.zip,))
up1.start()
up2.start()
# Now, do other stuff while these run
do_operation(line)
# tag the line
line += '\tsent'
# Wait for the uploads to finish -- in case they're slower than do_operation.
up1.join()
up2.join()
flag
#Prune yes its me who is confused.. i want to synchronize.
Excellent; we have that cleared up. The things you synchronize are separate processes. You have your main process waiting for the result of your child process, the upload. Multiple processes is called ... :-)
Are we at a solution point now? I think the pieces you need are in one (or at most two) of these answers.
Related
The program is monitoring a folder received_dir and process the files received in real time. After processing the file, the original file should be deleted to save the disk space.
I am trying to use Python multiprocessing and Pool.
I want to check if there is any technical flaw in current approach.
One of the problem in the current code is that the program should wait until all 20 files in the queue are processed before starting the next round, so it may be inefficient in certain conditions (i.e, various file sizes).
from multiprocessing import Pool
import os
import os.path
Parse_OUT="/opt/out/"
Receive_Dir="/opt/receive/"
def parser(infile):
out_dir=date_of(filename)
if not os.path.exists(out_dir):
os.mkdir(out_dir)
fout=gzip.open(out_dir+'/'+filename+'csv.gz','wb')
with gzip.open(infile) as fin:
for line in fin:
data=line.split(',')
fout.write(data)
fout.close()
os.remove(infile)
if __name__ == '__main__':
pool=Pool(20)
while True:
targets=glob.glob(Receive_Dir)[:10]
pool.map(parser, targets)
pool.close()
I see several issues:
if not os.path.exists(out_dir): os.mkdir(out_dir): This is a race condition. If two workers try to create the same directory at the same time, one will raise an exception. Don't do the if condition. Simply call os.makedirs(out_dir, exist_ok=True)
Don't assemble file paths with string addition. Simply do os.path.join(out_dir, filename+'csv.gz'). This is cleaner and has fewer failure states
Instead of spinning in your while True-loop even if no new directories appear, you can use the inotify mechanism on Linux to monitor the directory for changes. That would only wake your process if there is actually anything to do. Check out pyinotify: https://github.com/seb-m/pyinotify
Since you mentioned that you are dissatisfied with the batching: You can use pool.apply_async to start new operations as they become available. Your main loop doesn't do anything with the results, so you can just "fire and forget"
Incidentally, why are you starting a pool with 20 workers and then you just launch 10 directory operations at once?
I've never done anything with multiprocessing before, but I recently ran into a problem with one of my projects taking an excessive amount of time to run. I have about 336,000 files I need to process, and a traditional for loop would likely take about a week to run.
There are two loops to do this, but they are effectively identical in what they return so I've only included one.
import json
import os
from tqdm import tqdm
import multiprocessing as mp
jsons = os.listdir('/content/drive/My Drive/mrp_workflow/JSONs')
materials = [None] * len(jsons)
def asyncJSONs(file, index):
try:
with open('/content/drive/My Drive/mrp_workflow/JSONs/{}'.format(file)) as f:
data = json.loads(f.read())
properties = process_dict(data, {})
properties['name'] = file.split('.')[0]
materials[index] = properties
except:
print("Error parsing at {}".format(file))
process_list = []
i = 0
for file in tqdm(jsons):
p = mp.Process(target=asyncJSONs,args=(file,i))
p.start()
process_list.append(p)
i += 1
for process in process_list:
process.join()
Everything in that relating to multiprocessing was cobbled together from a collection of google searches and articles, so I wouldn't be surprised if it wasn't remotely correct. For example, the 'i' variable is a dirty attempt to keep the information in some kind of order.
What I'm trying to do is load information from those JSON files and store it in the materials variable. But when I run my current code nothing is stored in materials.
As you can read in other answers - processes don't share memory and you can't set value directly in materials. Function has to use return to send result back to main process and it has to wait for result and get it.
It can be simpler with Pool. It doesn't need to use queue manually. And it should return results in the same order as data in all_jsons. And you can set how many processes to run at the same time so it will not block CPU for other processes in system.
But it can't use tqdm.
I couldn't test it but it can be something like this
import os
import json
from multiprocessing import Pool
# --- functions ---
def asyncJSONs(filename):
try:
fullpath = os.path.join(folder, filename)
with open(fullpath) as f:
data = json.loads(f.read())
properties = process_dict(data, {})
properties['name'] = filename.split('.')[0]
return properties
except:
print("Error parsing at {}".format(filename))
# --- main ---
# for all processes (on some systems it may have to be outside `__main__`)
folder = '/content/drive/My Drive/mrp_workflow/JSONs'
if __name__ == '__main__':
# code only for main process
all_jsons = os.listdir(folder)
with Pool(5) as p:
materials = p.map(asyncJSONs, all_jsons)
for item in materials:
print(item)
BTW:
Other modules: concurrent.futures, joblib, ray,
Going to mention a totally different way of solving this problem. Don't bother trying to append all the data to the same list. Extract the data you need, and append it to some target file in ndjson/jsonlines format. That's just where, instead of objects part of a json array [{},{}...], you have separate objects on each line.
{"foo": "bar"}
{"foo": "spam"}
{"eggs": "jam"}
The workflow looks like this:
spawn N workers with a manifest of files to process and the output file to write to. You don't even need MP, you could use a tool like rush to parallelize.
worker parses data, generates the output dict
worker opens the output file with append flag. dump the data and flush immediately:
with open(out_file, 'a') as fp:
print(json.dumps(data), file=fp, flush=True)
Flush ensure that as long as your data is less than the buffer size on your kernel (usually several MB), your different processes won't stomp on each other and conflict writes. If they do get conflicted, you may need to write to a separate output file for each worker, and then join them all.
You can join the files and/or convert to regular JSON array if needed using jq. To be honest, just embrace jsonlines. It's a way better data format for long lists of objects, since you don't have to parse the whole thing in memory.
You need to understand how multiprocessing works. It starts a brand new process for EACH task, each with a brand new Python interpreter, which runs your script all over again. These processes do not share memory in any way. The other processes get a COPY of your globals, but they obviously can't be the same memory.
If you need to send information back, you can using a multiprocessing.queue. Have the function stuff the results in a queue, while your main code waits for stuff to magically appear in the queue.
Also PLEASE read the instructions in the multiprocessing docs about main. Each new process will re-execute all the code in your main file. Thus, any one-time stuff absolutely must be contained in a
if __name__ == "__main__":
block. This is one case where the practice of putting your mainline code into a function called main() is a "best practice".
What is taking all the time here? Is it reading the files? If so, then you might be able to do this with multithreading instead of multiprocessing. However, if you are limited by disk speed, then no amount of multiprocessing is going to reduce your run time.
I have 2 separate scripts working with the same variables.
To be more precise, one code edits the variables and the other one uses them (It would be nice if it could edit them too but not absolutely necessary.)
This is what i am currently doing:
When code 1 edits a variable it dumps it into a json file.
Code 2 repeatedly opens the json file to get the variables.
This method is really not elegant and the while loop is really slow.
How can i share variables across scripts?
My first scripts gets data from a midi controller and sends web-requests.
My second script is for LED strips (those run thanks to the same midi controller). Both script run in a "while true" loop.
I can't simply put them in the same script since every webrequest would slow the LEDs down. I am currently just sharing the variables via a json file.
If enough people ask for it i will post the whole code but i have been told not to do this
Considering the information you provided, meaning...
Both script run in a "while true" loop.
I can't simply put them in the same script since every webrequest would slow the LEDs down.
To me, you have 2 choices :
Use a client/server model. You have 2 machines. One acts as the server, and the second as the client. The server has a script with an infinite loop that consistently updates the data, and you would have an API that would just read and expose the current state of your file/database to the client. The client would be on another machine, and as I understand it, it would simply request the current data, and process it.
Make a single multiprocessing script. Each script would run on a separate 'thread' and would manage its own memory. As you also want to share variables between your two programs, you could pass as argument an object that would be shared between both your programs. See this resource to help you.
Note that there are more solutions to this. For instance, you're using a JSON file that you are consistently opening and closing (that is probably what takes the most time in your program). You could use a real Database that could handle being opened only once, and processed many times, while still being updated.
a Manager from multiprocessing lets you do this sort thing pretty easily
first I simplify your "midi controller and sends web-request" code down to something that just sleeps for random amounts of time and updates a variable in a managed dictionary:
from time import sleep
from random import random
def slow_fn(d):
i = 0
while True:
sleep(random() ** 2)
i += 1
d['value'] = i
next we simplify the "LED strip" control down to something that just prints to the screen:
from time import perf_counter
def fast_fn(d):
last = perf_counter()
while True:
sleep(0.05)
value = d.get('value')
now = perf_counter()
print(f'fast {value} {(now - last) * 1000:.2f}ms')
last = now
you can then run these functions in separate processes:
import multiprocessing as mp
with mp.Manager() as manager:
d = manager.dict()
procs = []
for fn in [slow_fn, fast_fn]:
p = mp.Process(target=fn, args=[d])
procs.append(p)
p.start()
for p in procs:
p.join()
the "fast" output happens regularly with no obvious visual pauses
I have a function which request a server, retrieves some data, process it and saves a csv file. This function should be launch 20k times. Each execution last differently: sometimes It last more than 20 minutes and other less than a second. I decided to go with multiprocessing.Pool.map to parallelize the execution. My code looks like:
def get_data_and_process_it(filename):
print('getting', filename)
...
print(filename, 'has been process')
with Pool(8) as p:
p.map(get_data_and_process_it, long_list_of_filenames)
Looking at how prints are generated it seems that long_list_of_filenames it's been splited into 8 parts and assinged to each CPU because sometimes is just get blocked in one 20 minutes execution with no other element of long_list_of_filenames been processed in those 20 minutes. What I was expecting is map to schedule each element in a cpu core in a FIFO style.
Is there a better approach for my case?
The map method only returns when all operations have finished.
And printing from a pool worker is not ideal. For one thing, files like stdout use buffering, so there might be a variable amount of time between printing a message and it actually appearing. Furthermore, since all workers inherit the same stdout, their output would become intermeshed and possibly even garbled.
So I would suggest using imap_unordered instead. That returns an iterator that will begin yielding results as soon as they are available. The only catch is that this returns results in the order they finish, not in the order they started.
Your worker function (get_data_and_process_it) should return some kind of status indicator. For example a tuple of the filename and the result.
def get_data_and_process_it(filename):
...
if (error):
return (filename, f'has *failed* bacause of {reason}')
return (filename, 'has been processed')
You could then do:
with Pool(8) as p:
for fn, res in p.imap_unordered(get_data_and_process_it, long_list_of_filenames):
print(fn, res)
That gives accurate information about when a job finishes, and since only the parent process writes to stdout, there is no change of the output becoming garbled.
Additionally, I would suggest to use sys.stdout.reconfigure(line_buffering=True) somewhere in the beginning of your program. That ensures that the stdout stream will be flushed after every line of output.
map is blocking, instead of p.map you can use p.map_async. map will wait for all those function calls to finish so we see all the results in a row. map_async does the work in random order and does not wait for a proceeding task to finish before starting a new task. This is the fastest approach.(For more) There is also a SO thread which in detail discusses about map and map_async.
The multiprocessing Pool class handles the queuing logic for us. It's perfect for running web scraping jobs in parallel (example) or really any job that can be broken up and distributed independently. If you need more control over the queue or need to share data between multiple processes, you may want to look at the Queue class(For more).
I have about 4 input text files that I want to read them and write all of them into one separate file.
I use two threads so it runs faster!
Here is my questions and code in python:
1-Does each thread has its own version of variables such as "lines" inside the function "writeInFile"?
2-Since I copied some parts of the code from Tutorialspoint, I don't understand what is "while 1: pass" in the last line. Can you explain? link to the main code: http://www.tutorialspoint.com/python/python_multithreading.htm
3-Does it matter what delay I put for the threads?
4-If I have about 400 input text files and want to do some operations on them before writing all of them into a separate file, how many threads I can use?
5- If assume I use 10 threads, is it better to have the inputs in different folders (10 folders with 40 input text files each) and for each thread call one folder OR I use what I already done in the below code in which I ask each thread to read one of the 400 input text files if they have not been read before by other threads?
processedFiles=[] # this list to check which file in the folder has already been read by one thread so the other thread don't read it
#Function run by the threads
def writeInFile( threadName, delay):
for file in glob.glob("*.txt"):
if file not in processedFiles:
processedFiles.append(file)
f = open(file,"r")
lines = f.readlines()
f.close()
time.sleep(delay)
#open the file to write in
f = open('myfile','a')
f.write("%s \n" %lines)
f.close()
print "%s: %s" % ( threadName, time.ctime(time.time()) )
# Create two threads as follows
try:
f = open('myfile', 'r+')
f.truncate()
start = timeit.default_timer()
thread.start_new_thread( writeInFile, ("Thread-1", 0, ) )
thread.start_new_thread( writeInFile, ("Thread-2", 0, ) )
stop = timeit.default_timer()
print stop - start
except:
print "Error: unable to start thread"
while 1:
pass
Yes. Each of the local variables is on the thread's stack and are not shared between threads.
This loop allows the parent thread to wait for each of the child threads to finish and exit before termination of the program. The actual construct you should use to handle this is join and not a while loop. See what is the use of join() in python threading.
In practice, yes, especially if the threads are writing to a common set of files (e.g., both thread 1 and thread2 will be reading/writing to the same file). Depending on the hardware, the size of the files and the amount of data you re trying to write, different delays may make your program feel more responsive to the user than not. The best bet is to start with a simple value and adjust it as you see the program work in a real-world setting.
While you can technically use as many threads as you want, you generally won’t get any performance benefits over 1 thread per core per CPU.
Different folders won’t matter as much for only 400 files. If you’re talking about 4,000,000 files, than it might matter for instances when you want to do ls on those directories. What will matter for performance is whether each thread is working on it's own file or whether two or more threads might be operating on the same file.
General thought: while it is a more advanced architecture, you may want to try to learn/use celery for these types of tasks in a production environment http://www.celeryproject.org/.