An application I use for graphics has an embedded Python interpreter - It works exactly the same as any other Python interpreter except there are a few special objects.
Basically I am trying to use Python to download a bunch of images and make other Network and disk I/O. If I do this without multithreading, my application will freeze (i.e. videos quit playing) until the downloads are finished.
To get around this I am trying to use multi-threading. However, I can not touch any of the main process.
I have written this code. The only parts unique to the program are commented. me.store / me.fetch is basically a way of getting a global variable. op('files') refers to a global table.
These are two things, "in the main process" that can only be touched in a thread safe way. I am not sure if my code does this.
I would apprecaite any input as to why or (why not) this code is thread-safe and how I can get around access the global variables in a thread safe way.
One thing I am worried about is how the counter is fetched multiple times by many threads. Since it is only updated after the file is written, could this cause a race-condition where the different threads access the counter with the same value (and then don't store the incremented value correctly). Or, what happens to the counter if the disk write fails.
from urllib import request
import threading, queue, os
url = 'http://users.dialogfeed.com/en/snippet/dialogfeed-social-wall-twitter-instagram.json?api_key=ac77f8f99310758c70ee9f7a89529023'
imgs = [
'http://search.it.online.fr/jpgs/placeholder-hollywood.jpg.jpg',
'http://www.lpkfusa.com/Images/placeholder.jpg',
'http://bi1x.caltech.edu/2015/_images/embryogenesis_placeholder.jpg'
]
def get_pic(url):
# Fetch image data
data = request.urlopen(url).read()
# This is the part I am concerned about, what if multiple threads fetch the counter before it is updated below
# What happens if the file write fails?
counter = me.fetch('count', 0)
# Download the file
with open(str(counter) + '.jpg', 'wb') as outfile:
outfile.write(data)
file_name = 'file_' + str(counter)
path = os.getcwd() + '\\' + str(counter) + '.jpg'
me.store('count', counter + 1)
return file_name, path
def get_url(q, results):
url = q.get_nowait()
file_name, path = get_pic(url)
results.append([file_name, path])
q.task_done()
def fetch():
# Clear the table
op('files').clear()
results = []
url_q = queue.Queue()
# Simulate getting a JSON feed
print(request.urlopen(url).read().decode('utf-8'))
for img in imgs:
# Add url to queue and start a thread
url_q.put(img)
t = threading.Thread(target=get_url, args=(url_q, results,))
t.start()
# Wait for threads to finish before updating table
url_q.join()
for cell in results:
op('files').appendRow(cell)
return
# Start a thread so that the first http get doesn't block
thread = threading.Thread(target=fetch)
thread.start()
Your code doesn't appear to be safe at all. Key points:
Appending to results is unsafe -- two threads might try to append to the list at the same time.
Accessing and setting counter is unsafe -- a thread my fetch counter before another thread has set the new counter value.
Passing a queue of urls is redundant -- just pass a new url to each job.
Another way (concurrent.futures)
Since you are using python 3, why not make use of the concurrent.futures module, which makes your task much easier to manage. Below I've written out your code in a way which does not require explicit synchronisation -- all the work is handled by the futures module.
from urllib import request
import os
import threading
from concurrent.futures import ThreadPoolExecutor
from itertools import count
url = 'http://users.dialogfeed.com/en/snippet/dialogfeed-social-wall-twitter-instagram.json?api_key=ac77f8f99310758c70ee9f7a89529023'
imgs = [
'http://search.it.online.fr/jpgs/placeholder-hollywood.jpg.jpg',
'http://www.lpkfusa.com/Images/placeholder.jpg',
'http://bi1x.caltech.edu/2015/_images/embryogenesis_placeholder.jpg'
]
def get_pic(url, counter):
# Fetch image data
data = request.urlopen(url).read()
# Download the file
with open(str(counter) + '.jpg', 'wb') as outfile:
outfile.write(data)
file_name = 'file_' + str(counter)
path = os.getcwd() + '\\' + str(counter) + '.jpg'
return file_name, path
def fetch():
# Clear the table
op('files').clear()
with ThreadPoolExecutor(max_workers=2) as executor:
count_start = me.fetch('count', 0)
# reserve these numbers for our tasks
me.store('count', count_start + len(imgs))
# separate fetching and storing is usually not thread safe
# however, if only one thread modifies count (the one running fetch) then
# this will be safe (same goes for the files variable)
for cell in executor.map(get_pic, imgs, count(count_start)):
op('files').appendRow(cell)
# Start a thread so that the first http get doesn't block
thread = threading.Thread(target=fetch)
thread.start()
If multiple threads modify count then you should use a lock when modifying count.
eg.
lock = threading.Lock()
def fetch():
...
with lock:
# Do not release the lock between accessing and modifying count.
# Other threads wanting to modify count, must use the same lock object (not
# another instance of Lock).
count_start = me.fetch('count', 0)
me.store('count', count_start + len(imgs))
# use count_start here
The only problem with this if one job fails for some reason then you will get a missing file number. Any raised exception will also interrupt the executor doing the mapping, by re-raising the exception there --so you can then do something if needed.
You could avoid using a counter by using the tempfile module to find somewhere to temporarily store a file before moving the file somewhere permanent.
Remember to look at multiprocessing and threading if you are new to python multi-threading stuff.
Your code seems ok, though the code style is not very easy to read. You need to run it to see if it works as your expectation.
with will make sure your lock is released. The acquire() method will be called when the block is entered, and release() will be called when the block is exited.
If you add more threads, make sure they are not using the same address from queue and no race condition (seems it is done by Queue.get(), but you need to run it to verify). Remember, each threads share the same process so almost everything is shared. You don't want two threads are handling the same address
The Lock doesn't do anything at all. You only have one thread that ever calls download_job - that's the one you assigned to my_thread. The other one, the main thread, calls offToOn and is finished as soon as it reaches the end of that function. So there is no second thread that ever tries to acquire the lock, and hence no second thread ever gets blocked. The table you mention is, apparently, in a file that you explicitly open and close. If the operating system protects this file against simultaneous access from different programs, you can get away with this; otherwise it is definitely unsafe because you haven't accomplished any thread synchronization.
Proper synchronization between threads requires that different threads have access to the SAME lock; i.e., one lock is accessed by multiple threads. Also note that "thread" is not a synonym for "process." Python supports both. If you're really supposed to avoid accessing the main process, you have to use the multiprocessing module to launch and manage a second process.
And this code will never exit, since there is always a thread running in an infinite loop (in threader).
Accessing a resource in a thread-safe manner requires something like this:
a_lock = Lock()
def use_resource():
with a_lock:
# do something
The lock is created once, outside the function that uses it. Every access to the resource in the whole application, from whatever thread, must acquire the same lock, either by calling use_resource or some equivalent.
Related
I have a Python script that does two things; 1) it downloads a large file by making an API call, and 2) preprocess that large file. I want to use Multiprocessing to run my script. Each individual part (1 and 2) takes quite long. Everything happens in-memory due to the large size of the files, so ideally a single core would do both (1) and (2) consecutively. I have a large amount of cores available (100+), but I can only have 4 API calls running at the same time (limitation set by the API developers). So what I want to do is spawn 4 cores that start downloading by making an API-call, and as soon as one of those cores is done downloading and starts preprocessing I want a new core to start the whole process as well. This so there's always 4 cores downloading, and as many cores as needed doing the pre-processing. I do not know however how to have a new core spawn as soon as another core is finished with the first part of the script.
My actual code is way too complex to just dump here, but let's say I have the following two functions:
import requests
def make_api_call(val):
"""Function that does part 1); makes an API call, stores it in memory and returns a large
satellite GeoTIFF
"""
large_image = requests.get(val)
return(large_image)
def preprocess_large_image(large_image):
"""Function that does part 2); preprocesses a large image, and returns the relevant data
"""
results = preprocess(large_image)
return(results)
how then can I make sure that as soon as a single core/process is finished with 'make_api_call' and starts with 'preprocess_large_image', another core spawns and starts the entire process as well? This so there is always 4 images downloading side-by-side. Thank you in advance for the help!
This is a perfect application for a multiprocessing.Semaphore (or for safety, use a BoundedSemaphore)! Basically you put a lock around the api call part of the process, but let up to 4 worker processes hold the lock at any given time. For various reasons, things like Lock, Semaphore, Queue, etc all need to be passed at the creation of a Pool, rather than when a method like map or imap is called. This is done by specifying an initialization function in the pool constructor.
def api_call(arg):
return foo
def process_data(foo):
return "done"
def map_func(arg):
global semaphore
with semaphore:
foo = api_call(arg)
return process_data(foo)
def init_pool(s):
global semaphore = s
if __name__ == "__main__":
s = mp.BoundedSemaphore(4) #max concurrent API calls
with mp.Pool(n_workers, init_pool, (s,)) as p: #n_workers should be great enough that you always have a free worker waiting on semaphore.acquire()
for result in p.imap(map_func, arglist):
print(result)
If both the downloading (part 1) and the conversion (part 2) take long, there is not much reason to do everything in memory.
Keep in mind that networking is generally slower than disk operations.
So I would suggest to use two pools, saving the downloaded files to disk, and send file names to workers.
The first Pool is created with four workers and does the downloading. The worker saves the image to a file and returns the filename. With this Pool you use the imap_unordered method, because that starts yielding values as soon as they become available.
The second Pool does the image processing. It gets fed by apply_async, which returns an AsyncResult object.
We need to save those to keep track of when all the conversions are finished.
Note that map or imap_unordered are not suitable here because they require a ready-made iterable.
def download(url):
large_image = requests.get(url)
filename = url_to_filename(url) # you need to write this
with open(filename, "wb") as imgf:
imgf.write(large_image)
def process_image(name):
with open(name, "rb") as f:
large_image = f.read()
# File processing goes here
with open(name, "wb") as f:
f.write(large_image)
return name
dlp = multiprocessing.Pool(processes=4)
# Default pool size is os.cpu_count(); might be too much.
imgp = multiprocessing.Pool(processes=20)
urllist = ['http://foo', 'http://bar'] # et cetera
in_progress = []
for name in dlp.imap_unordered(download, urllist):
in_progress.append(imgp.apply_async(process_image, (name,)), )
# Wait for the conversions to finish.
while in_progress:
finished = []
for res in in_progress:
if res.ready():
finished.append(res)
for f in finished:
in_progress.remove(f)
print(f"Finished processing '{f.get()}'.")
time.sleep(0.1)
I've never done anything with multiprocessing before, but I recently ran into a problem with one of my projects taking an excessive amount of time to run. I have about 336,000 files I need to process, and a traditional for loop would likely take about a week to run.
There are two loops to do this, but they are effectively identical in what they return so I've only included one.
import json
import os
from tqdm import tqdm
import multiprocessing as mp
jsons = os.listdir('/content/drive/My Drive/mrp_workflow/JSONs')
materials = [None] * len(jsons)
def asyncJSONs(file, index):
try:
with open('/content/drive/My Drive/mrp_workflow/JSONs/{}'.format(file)) as f:
data = json.loads(f.read())
properties = process_dict(data, {})
properties['name'] = file.split('.')[0]
materials[index] = properties
except:
print("Error parsing at {}".format(file))
process_list = []
i = 0
for file in tqdm(jsons):
p = mp.Process(target=asyncJSONs,args=(file,i))
p.start()
process_list.append(p)
i += 1
for process in process_list:
process.join()
Everything in that relating to multiprocessing was cobbled together from a collection of google searches and articles, so I wouldn't be surprised if it wasn't remotely correct. For example, the 'i' variable is a dirty attempt to keep the information in some kind of order.
What I'm trying to do is load information from those JSON files and store it in the materials variable. But when I run my current code nothing is stored in materials.
As you can read in other answers - processes don't share memory and you can't set value directly in materials. Function has to use return to send result back to main process and it has to wait for result and get it.
It can be simpler with Pool. It doesn't need to use queue manually. And it should return results in the same order as data in all_jsons. And you can set how many processes to run at the same time so it will not block CPU for other processes in system.
But it can't use tqdm.
I couldn't test it but it can be something like this
import os
import json
from multiprocessing import Pool
# --- functions ---
def asyncJSONs(filename):
try:
fullpath = os.path.join(folder, filename)
with open(fullpath) as f:
data = json.loads(f.read())
properties = process_dict(data, {})
properties['name'] = filename.split('.')[0]
return properties
except:
print("Error parsing at {}".format(filename))
# --- main ---
# for all processes (on some systems it may have to be outside `__main__`)
folder = '/content/drive/My Drive/mrp_workflow/JSONs'
if __name__ == '__main__':
# code only for main process
all_jsons = os.listdir(folder)
with Pool(5) as p:
materials = p.map(asyncJSONs, all_jsons)
for item in materials:
print(item)
BTW:
Other modules: concurrent.futures, joblib, ray,
Going to mention a totally different way of solving this problem. Don't bother trying to append all the data to the same list. Extract the data you need, and append it to some target file in ndjson/jsonlines format. That's just where, instead of objects part of a json array [{},{}...], you have separate objects on each line.
{"foo": "bar"}
{"foo": "spam"}
{"eggs": "jam"}
The workflow looks like this:
spawn N workers with a manifest of files to process and the output file to write to. You don't even need MP, you could use a tool like rush to parallelize.
worker parses data, generates the output dict
worker opens the output file with append flag. dump the data and flush immediately:
with open(out_file, 'a') as fp:
print(json.dumps(data), file=fp, flush=True)
Flush ensure that as long as your data is less than the buffer size on your kernel (usually several MB), your different processes won't stomp on each other and conflict writes. If they do get conflicted, you may need to write to a separate output file for each worker, and then join them all.
You can join the files and/or convert to regular JSON array if needed using jq. To be honest, just embrace jsonlines. It's a way better data format for long lists of objects, since you don't have to parse the whole thing in memory.
You need to understand how multiprocessing works. It starts a brand new process for EACH task, each with a brand new Python interpreter, which runs your script all over again. These processes do not share memory in any way. The other processes get a COPY of your globals, but they obviously can't be the same memory.
If you need to send information back, you can using a multiprocessing.queue. Have the function stuff the results in a queue, while your main code waits for stuff to magically appear in the queue.
Also PLEASE read the instructions in the multiprocessing docs about main. Each new process will re-execute all the code in your main file. Thus, any one-time stuff absolutely must be contained in a
if __name__ == "__main__":
block. This is one case where the practice of putting your mainline code into a function called main() is a "best practice".
What is taking all the time here? Is it reading the files? If so, then you might be able to do this with multithreading instead of multiprocessing. However, if you are limited by disk speed, then no amount of multiprocessing is going to reduce your run time.
I am writing a Flask Web Application using WTForms. In one of the forms the user should upload a csv file and the server will analyze the received data. This is the code I am using.
filename = token_hex(8) + '.csv' # Generate a new filename
form.dataset.data.save('myapp/datasets/' + filename) # Save the received file
dataset = genfromtxt('myapp/datasets/' + filename, delimiter=',') # Open the newly generated file
# analyze 'dataset'
As long as I was using this code inside a single-thread application everything was working. I tried adding a thread in the code. Here's the procedure called by the thread (the same exact code inside a function):
def execute_analysis(form):
filename = token_hex(8) + '.csv' # Generate a new filename
form.dataset.data.save('myapp/datasets/' + filename) # Save the received file
dataset = genfromtxt('myapp/datasets/' + filename, delimiter=',') # Open the newly generated file
# analyze 'dataset'
and here's how I call the thread
import threading
#posts.route("/estimation", methods=['GET', 'POST'])
#login_required
def estimate_parameters():
form = EstimateForm()
if form.validate_on_submit():
threading.Thread(target=execute_analysis, args=[form]).start()
flash("Your request has been received. Please check the site in again in a few minutes.", category='success')
# return render_template('posts/post.html', title=post.id, post=post)
return render_template('estimations/estimator.html', title='New Analysis', form=form, legend='New Analysis')
But now I get the following error:
ValueError: I/O operation on closed file.
Relative to the save function call. Why is it not working? How should I fix this?
It's hard to tell without further context, but I suspect it's likely that you're returning from a function or exiting a context manager which causes some file descriptor to close, and hence causes the save(..) call to fail with ValueError.
If so, one direct fix would be to wait for the thread to finish before returning/closing the file. Something along the lines of:
def handle_request(form):
...
analyzer_thread = threading.Thread(target=execute_analysis, args=[form])
analyzer_thread.start()
...
analyzer_thread.join() # wait for completion of execute_analysis
cleanup_context(form)
return
Here is a reproducable minimal example of the problem I am describing:
import threading
SEM = threading.Semaphore(0)
def run(fd):
SEM.acquire() # wait till release
fd.write("This will fail :(")
fd = open("test.txt", "w+")
other_thread = threading.Thread(target=run, args=[fd])
other_thread.start()
fd.close()
SEM.release() # release the semaphore, so other_thread will acquire & proceed
other_thread.join()
Note that the main thread will close the file, and the other thread will fail on write call with ValueError: I/O operation on closed file., as in your case.
I don't know the framework sufficiently to tell exactly what happened, but I can tell you how you probably can fix it.
Whenever you have a resource that is shared by multiple threads, use a lock.
from threading import Lock
LOCK = Lock()
def process():
LOCK.acquire()
... # open a file, write some data to it etc.
LOCK.release()
# alternatively, use the context manager syntax
with LOCK:
...
threading.Thread(target=process).start()
threading.Thread(target=process).start()
Documentation on threading.Lock:
The class implementing primitive lock objects. Once a thread has acquired a lock, subsequent attempts to acquire it block, until it is released
Basically, after thread 1 calls LOCK.acquire(), subsequent calls e.g. from other threads, will cause those threads to freeze and wait until something calls LOCK.release() (usually thread 1, after it finishes its business with the resource).
If the filenames are randomly generated then I wouldn't expect problems with 1 thread closing the other's file, unless both of them happen to generate the same name. But perhaps you can figure it out with some experimentation, e.g. first try locking calls to both save and genfromtxt and check if that helps. It might also make sense to add some print statements (or even better, use logging), e.g. to check if the file names don't collide.
I'm trying to use the fbx python module from autodesk, but it seems I can't thread any operation. This seems due to the GIL not relased. Has anyone found the same issue or am I doing something wrong? When I say it doesn't work, I mean the code doesn't release the thread and I'm not be able to do anything else, while the fbx code is running.
There isn't much of code to post, just to know whether it did happen to anyone to try.
Update:
here is the example code, please note each fbx file is something like 2GB
import os
import fbx
import threading
file_dir = r'../fbxfiles'
def parse_fbx(filepath):
print '-' * (len(filepath) + 9)
print 'parsing:', filepath
manager = fbx.FbxManager.Create()
importer = fbx.FbxImporter.Create(manager, '')
status = importer.Initialize(filepath)
if not status:
raise IOError()
scene = fbx.FbxScene.Create(manager, '')
importer.Import(scene)
# freeup memory
rootNode = scene.GetRootNode()
def traverse(node):
print node.GetName()
for i in range(0, node.GetChildCount()):
child = node.GetChild(i)
traverse(child)
# RUN
traverse(rootNode)
importer.Destroy()
manager.Destroy()
files = os.listdir(file_dir)
tt = []
for file_ in files:
filepath = os.path.join(file_dir, file_)
t = threading.Thread(target=parse_fbx, args=(filepath,))
tt.append(t)
t.start()
One problem I see is with your traverse() function. It's calling itself recursively potentially a huge number of times. Another is having all the threads printing stuff at the same time. Doing that properly requires coordinating access to the shared output device (i.e. the screen). A simple way to do that is by creating and using a global threading.Lock object.
First create a global Lock to prevent threads from printing at same time:
file_dir = '../fbxfiles' # an "r" prefix needed only when path contains backslashes
print_lock = threading.Lock() # add this here
Then make a non-recursive version of traverse() that uses it:
def traverse(rootNode):
with print_lock:
print rootNode.GetName()
for i in range(node.GetChildCount()):
child = node.GetChild(i)
with print_lock:
print child.GetName()
It's not clear to me exactly where the reading of each fbxfile takes place. If it all happens as a result of the importer.Import(scene) call, then that is the only time any other threads will be given a chance to run — unless some I/O is [also] done within the traverse() function.
Since printing is most definitely a form of output, thread switching will also be able to occur when it's done. However, if all the function did was perform computations of some kind, no multi-threading would take place within it during its execution.
Once you get the multi-reading working, you may encounter insufficient memory issues if multiple 2GB fbxfiles are being read into memory simultaneously by the various different threads.
Let's assume I'm stuck using Python 2.6, and can't upgrade (even if that would help). I've written a program that uses the Queue class. My producer is a simple directory listing. My consumer threads pull a file from the queue, and do stuff with it. If the file has already been processed, I skip it. The processed list is generated before all of the threads are started, so it isn't empty.
Here's some pseudo-code.
import Queue, sys, threading
processed = []
def consumer():
while True:
file = dirlist.get(block=True)
if file in processed:
print "Ignoring %s" % file
else:
# do stuff here
dirlist.task_done()
dirlist = Queue.Queue()
for f in os.listdir("/some/dir"):
dirlist.put(f)
max_threads = 8
for i in range(max_threads):
thr = Thread(target=consumer)
thr.start()
dirlist.join()
The strange behavior I'm getting is that if a thread encounters a file that's already been processed, the thread stalls out and waits until the entire program ends. I've done a little bit of testing, and the first 7 threads (assuming 8 is the max) stop, while the 8th thread keeps processing, one file at a time. But, by doing that, I'm losing the entire reason for threading the application.
Am I doing something wrong, or is this the expected behavior of the Queue/threading classes in Python 2.6?
I tried running your code, and did not see the behavior you describe. However, the program never exits. I recommend changing the .get() call as follows:
try:
file = dirlist.get(True, 1)
except Queue.Empty:
return
If you want to know which thread is currently executing, you can import the thread module and print thread.get_ident().
I added the following line after the .get():
print file, thread.get_ident()
and got the following output:
bin 7116328
cygdrive 7116328
cygwin.bat 7149424
cygwin.ico 7116328
dev etc7598568
7149424
fix 7331000
home 7116328lib
7598568sbin
7149424Thumbs.db
7331000
tmp 7107008
usr 7116328
var 7598568proc
7441800
The output is messy because the threads are writing to stdout at the same time. The variety of thread identifiers further confirms that all of the threads are running.
Perhaps something is wrong in the real code or your test methodology, but not in the code you posted?
Since this problem only manifests itself when finding a file that's already been processed, it seems like this is something to do with the processed list itself. Have you tried implementing a simple lock? For example:
processed = []
processed_lock = threading.Lock()
def consumer():
while True:
with processed_lock.acquire():
fileInList = file in processed
if fileInList:
# ... et cetera
Threading tends to cause the strangest bugs, even if they seem like they "shouldn't" happen. Using locks on shared variables is the first step to make sure you don't end up with some kind of race condition that could cause threads to deadlock.
Of course, if what you're doing under # do stuff here is CPU-intensive, then Python will only run code from one thread at a time anyway, due to the Global Interpreter Lock. In that case, you may want to switch to the multiprocessing module - it's very similar to threading, though you will need to replace shared variables with another solution (see here for details).