Multiprocessing where new process starts hafway through other process - python

I have a Python script that does two things; 1) it downloads a large file by making an API call, and 2) preprocess that large file. I want to use Multiprocessing to run my script. Each individual part (1 and 2) takes quite long. Everything happens in-memory due to the large size of the files, so ideally a single core would do both (1) and (2) consecutively. I have a large amount of cores available (100+), but I can only have 4 API calls running at the same time (limitation set by the API developers). So what I want to do is spawn 4 cores that start downloading by making an API-call, and as soon as one of those cores is done downloading and starts preprocessing I want a new core to start the whole process as well. This so there's always 4 cores downloading, and as many cores as needed doing the pre-processing. I do not know however how to have a new core spawn as soon as another core is finished with the first part of the script.
My actual code is way too complex to just dump here, but let's say I have the following two functions:
import requests
def make_api_call(val):
"""Function that does part 1); makes an API call, stores it in memory and returns a large
satellite GeoTIFF
"""
large_image = requests.get(val)
return(large_image)
def preprocess_large_image(large_image):
"""Function that does part 2); preprocesses a large image, and returns the relevant data
"""
results = preprocess(large_image)
return(results)
how then can I make sure that as soon as a single core/process is finished with 'make_api_call' and starts with 'preprocess_large_image', another core spawns and starts the entire process as well? This so there is always 4 images downloading side-by-side. Thank you in advance for the help!

This is a perfect application for a multiprocessing.Semaphore (or for safety, use a BoundedSemaphore)! Basically you put a lock around the api call part of the process, but let up to 4 worker processes hold the lock at any given time. For various reasons, things like Lock, Semaphore, Queue, etc all need to be passed at the creation of a Pool, rather than when a method like map or imap is called. This is done by specifying an initialization function in the pool constructor.
def api_call(arg):
return foo
def process_data(foo):
return "done"
def map_func(arg):
global semaphore
with semaphore:
foo = api_call(arg)
return process_data(foo)
def init_pool(s):
global semaphore = s
if __name__ == "__main__":
s = mp.BoundedSemaphore(4) #max concurrent API calls
with mp.Pool(n_workers, init_pool, (s,)) as p: #n_workers should be great enough that you always have a free worker waiting on semaphore.acquire()
for result in p.imap(map_func, arglist):
print(result)

If both the downloading (part 1) and the conversion (part 2) take long, there is not much reason to do everything in memory.
Keep in mind that networking is generally slower than disk operations.
So I would suggest to use two pools, saving the downloaded files to disk, and send file names to workers.
The first Pool is created with four workers and does the downloading. The worker saves the image to a file and returns the filename. With this Pool you use the imap_unordered method, because that starts yielding values as soon as they become available.
The second Pool does the image processing. It gets fed by apply_async, which returns an AsyncResult object.
We need to save those to keep track of when all the conversions are finished.
Note that map or imap_unordered are not suitable here because they require a ready-made iterable.
def download(url):
large_image = requests.get(url)
filename = url_to_filename(url) # you need to write this
with open(filename, "wb") as imgf:
imgf.write(large_image)
def process_image(name):
with open(name, "rb") as f:
large_image = f.read()
# File processing goes here
with open(name, "wb") as f:
f.write(large_image)
return name
dlp = multiprocessing.Pool(processes=4)
# Default pool size is os.cpu_count(); might be too much.
imgp = multiprocessing.Pool(processes=20)
urllist = ['http://foo', 'http://bar'] # et cetera
in_progress = []
for name in dlp.imap_unordered(download, urllist):
in_progress.append(imgp.apply_async(process_image, (name,)), )
# Wait for the conversions to finish.
while in_progress:
finished = []
for res in in_progress:
if res.ready():
finished.append(res)
for f in finished:
in_progress.remove(f)
print(f"Finished processing '{f.get()}'.")
time.sleep(0.1)

Related

Python multitprocessing to process files

I've never done anything with multiprocessing before, but I recently ran into a problem with one of my projects taking an excessive amount of time to run. I have about 336,000 files I need to process, and a traditional for loop would likely take about a week to run.
There are two loops to do this, but they are effectively identical in what they return so I've only included one.
import json
import os
from tqdm import tqdm
import multiprocessing as mp
jsons = os.listdir('/content/drive/My Drive/mrp_workflow/JSONs')
materials = [None] * len(jsons)
def asyncJSONs(file, index):
try:
with open('/content/drive/My Drive/mrp_workflow/JSONs/{}'.format(file)) as f:
data = json.loads(f.read())
properties = process_dict(data, {})
properties['name'] = file.split('.')[0]
materials[index] = properties
except:
print("Error parsing at {}".format(file))
process_list = []
i = 0
for file in tqdm(jsons):
p = mp.Process(target=asyncJSONs,args=(file,i))
p.start()
process_list.append(p)
i += 1
for process in process_list:
process.join()
Everything in that relating to multiprocessing was cobbled together from a collection of google searches and articles, so I wouldn't be surprised if it wasn't remotely correct. For example, the 'i' variable is a dirty attempt to keep the information in some kind of order.
What I'm trying to do is load information from those JSON files and store it in the materials variable. But when I run my current code nothing is stored in materials.
As you can read in other answers - processes don't share memory and you can't set value directly in materials. Function has to use return to send result back to main process and it has to wait for result and get it.
It can be simpler with Pool. It doesn't need to use queue manually. And it should return results in the same order as data in all_jsons. And you can set how many processes to run at the same time so it will not block CPU for other processes in system.
But it can't use tqdm.
I couldn't test it but it can be something like this
import os
import json
from multiprocessing import Pool
# --- functions ---
def asyncJSONs(filename):
try:
fullpath = os.path.join(folder, filename)
with open(fullpath) as f:
data = json.loads(f.read())
properties = process_dict(data, {})
properties['name'] = filename.split('.')[0]
return properties
except:
print("Error parsing at {}".format(filename))
# --- main ---
# for all processes (on some systems it may have to be outside `__main__`)
folder = '/content/drive/My Drive/mrp_workflow/JSONs'
if __name__ == '__main__':
# code only for main process
all_jsons = os.listdir(folder)
with Pool(5) as p:
materials = p.map(asyncJSONs, all_jsons)
for item in materials:
print(item)
BTW:
Other modules: concurrent.futures, joblib, ray,
Going to mention a totally different way of solving this problem. Don't bother trying to append all the data to the same list. Extract the data you need, and append it to some target file in ndjson/jsonlines format. That's just where, instead of objects part of a json array [{},{}...], you have separate objects on each line.
{"foo": "bar"}
{"foo": "spam"}
{"eggs": "jam"}
The workflow looks like this:
spawn N workers with a manifest of files to process and the output file to write to. You don't even need MP, you could use a tool like rush to parallelize.
worker parses data, generates the output dict
worker opens the output file with append flag. dump the data and flush immediately:
with open(out_file, 'a') as fp:
print(json.dumps(data), file=fp, flush=True)
Flush ensure that as long as your data is less than the buffer size on your kernel (usually several MB), your different processes won't stomp on each other and conflict writes. If they do get conflicted, you may need to write to a separate output file for each worker, and then join them all.
You can join the files and/or convert to regular JSON array if needed using jq. To be honest, just embrace jsonlines. It's a way better data format for long lists of objects, since you don't have to parse the whole thing in memory.
You need to understand how multiprocessing works. It starts a brand new process for EACH task, each with a brand new Python interpreter, which runs your script all over again. These processes do not share memory in any way. The other processes get a COPY of your globals, but they obviously can't be the same memory.
If you need to send information back, you can using a multiprocessing.queue. Have the function stuff the results in a queue, while your main code waits for stuff to magically appear in the queue.
Also PLEASE read the instructions in the multiprocessing docs about main. Each new process will re-execute all the code in your main file. Thus, any one-time stuff absolutely must be contained in a
if __name__ == "__main__":
block. This is one case where the practice of putting your mainline code into a function called main() is a "best practice".
What is taking all the time here? Is it reading the files? If so, then you might be able to do this with multithreading instead of multiprocessing. However, if you are limited by disk speed, then no amount of multiprocessing is going to reduce your run time.

Why does not multithreading speed up my program?

I have a big text file that needs to be processed. I first read all text into a list and then use ThreadPoolExecutor to start multiple threads to process it. The two functions called in process_text() are not listed here: is_channel and get_relations().
I am on Mac and my observations show that it doesn't really speed up the processing (cpu with 8 cores, only 15% cpu is used). If there is a performance bottleneck in either the function is_channel or get_relations, then the multithreading won't help much. Is that the reason for no performance gain? Should I try to use multiprocessing to speed up instead of multithreading?
def process_file(file_name):
all_lines = []
with open(file_name, 'r', encoding='utf8') as f:
for index, line in enumerate(f):
line = line.strip()
all_lines.append(line)
# Classify text
all_results = []
with ThreadPoolExecutor(max_workers=10) as executor:
for index, result in enumerate(executor.map(process_text, all_lines, itertools.repeat(channel))):
all_results.append(result)
for index, entities_relations_list in enumerate(all_results):
# print out results
def process_text(text, channel):
global channel_text
global non_channel_text
is_right_channel = is_channel(text, channel)
entities = ()
relations = None
entities_relations_list = set()
entities_relations_list.add((entities, relations))
if is_right_channel:
channel_text += 1
entities_relations_list = get_relations(text, channel)
return (text, entities_relations_list, is_right_channel)
non_channel_text += 1
return (text, entities_relations_list, is_right_channel)
The first thing that should be done is finding out how much time it takes to:
Read the file in memory (T1)
Do all processing (T2)
Printing result (T3)
The third point (printing), if you are really doing it, can slow down things. It's fine as long as you are not printing it to terminal and just piping the output to a file or something else.
Based on timings, we'll get to know:
T1 >> T2 => IO bound
T2 >> T1 => CPU bound
T1 and T2 are close => Neither.
by x >> y I mean x is significantly greater than y.
Based on above and the file size, you can try a few approaches:
Threading based
Even this can be done 2 ways, which one would work faster can be found out by again benchmarking/looking at the timings.
Approach-1 (T1 >> T2 or even when T1 and T2 are similar)
Run the code to read the file itself in a thread and let it push the lines to a queue instead of the list.
This thread inserts a None at end when it is done reading from file. This will be important to tell the worker that they can stop
Now run the processing workers and pass them the queue
The workers keep reading from the queue in a loop and processing the results. Similar to the reader thread, these workers put results in a queue.
Once a thread encounters a None, it stops the loop and re-inserts the None into the queue (so that other threads can stop themselves).
The printing part can again be done in a thread.
The above is example of single Producer and multiple consumer threads.
Approach-2 (This is just another way of doing what is being already done by the code snippet in the question)
Read the entire file into a list.
Divide the list into index ranges based on no. of threads.
Example: if the file has 100 lines in total and we use 10 threads
then 0-9, 10-19, .... 90-99 are the index ranges
Pass the complete list and these index ranges to the threads to process each set. Since you are not modifying original list, hence this works.
This approach can give results better than running the worker for each individual line.
Multiprocessing based
(CPU bound)
Split the file into multiple files before processing.
Run a new process for each file.
Each process gets the path of the file it should read and process
This requires additional step of combining all results/files at end
The process creation part can be done from within python using multiprocessing module
or from a driver script to spawn a python process for each file, like a shell script
Just by looking at the code, it seems to be CPU bound. Hence, I would prefer multiprocessing for doing that. I have used both approaches in practice.
Multiprocessing: when processing huge text files(GBs) stored on disk (like what you are doing).
Threading (Approach-1): when reading from multiple databases. As that is more IO bound than CPU (I used multiple producer and multiple consumer threads).

Python: how to parallelizing a simple loop with MPI

I need to rewrite a simple for loop with MPI cause each step is time consuming. Lets say I have a list including several np.array and I want to apply some computation on each array. For example:
def myFun(x):
return x+2 # simple example, the real one would be complicated
dat = [np.random.rand(3,2), np.random.rand(3,2),np.random.rand(3,2),np.random.rand(3,2)] # real data would be much larger
result = []
for item in dat:
result.append(myFun(item))
Instead of using the simple for loop above, I want to use MPI to run the 'for loop' part of the above code in parallel with 24 different nodes also I want the order of items in the result list follow the same with dat list.
Note The data is read from other file which can be treated 'fix' for each processor.
I haven't use mpi before, so this stucks me for a while.
For simplicity let us assume that the master process (the process with rank = 0) is the one that will read the entire file from disk into memory. This problem can be solved only knowing about the following MPI routines, Get_size(), Get_rank(), scatter, and gather.
The Get_size():
Returns the number of processes in the communicator. It will return
the same number to every process.
The Get_rank():
Determines the rank of the calling process in the communicator.
In MPI to each process is assigned a rank, that varies from 0 to N - 1, where N is the total number of processes running.
The scatter:
MPI_Scatter involves a designated root process sending data to all
processes in a communicator. The primary difference between MPI_Bcast
and MPI_Scatter is small but important. MPI_Bcast sends the same piece
of data to all processes while MPI_Scatter sends chunks of an array to
different processes.
and the gather:
MPI_Gather is the inverse of MPI_Scatter. Instead of spreading
elements from one process to many processes, MPI_Gather takes elements
from many processes and gathers them to one single process.
Obviously, you should first follow a tutorial and read the MPI documentation to understand its parallel programming model, and its routines. Otherwise, you will find it very hard to understand how it all works. That being said your code could look like the following:
from mpi4py import MPI
def myFun(x):
return x+2 # simple example, the real one would be complicated
comm = MPI.COMM_WORLD
rank = comm.Get_rank() # get your process ID
data = # init the data
if rank == 0: # The master is the only process that reads the file
data = # something read from file
# Divide the data among processes
data = comm.scatter(data, root=0)
result = []
for item in data:
result.append(myFun(item))
# Send the results back to the master processes
newData = comm.gather(result,root=0)
In this way, each process will work (in parallel) in only a certain chunk of the data. After having finish their work, each process send back to the master process their data chunks (i.e., comm.gather(result,root=0)). This is just a toy example, now it is up to you to improved according to your testing environment and code.
You could either go the low-level MPI way as shown in the answer of #dreamcrash or you could go for a more Pythonic solution that uses an executor pool very similar to the one provided by the standard Python multiprocessing module.
First, you need to turn your code into a more functional-style one by noticing that you are actually doing a map operation, which applies myFun to each element of dat:
def myFun(x):
return x+2 # simple example, the real one would be complicated
dat = [
np.random.rand(3,2), np.random.rand(3,2), np.random.rand(3,2), np.random.rand(3,2)
] # real data would be much larger
result = map(myFun, dat)
map here runs sequentially in one Python interpreter process.
To run that map in parallel with the multiprocessing module, you only need to instantiate a Pool object and then call its map() method in place of the Python map() function:
from multiprocessing import Pool
def myFun(x):
return x+2 # simple example, the real one would be complicated
if __name__ == '__main__':
dat = [
np.random.rand(3,2), np.random.rand(3,2), np.random.rand(3,2), np.random.rand(3,2)
] # real data would be much larger
with Pool() as pool:
result = pool.map(myFun, dat)
Here, Pool() creates a new executor pool with as many interpreter processes as there are logical CPUs as seen by the OS. Calling the map() method of the pool runs the mapping in parallel by sending items to the different processes in the pool and waiting for completion. Since the worker processes import the Python script as a module, it is important to have the code that was previously at the top level moved under the if __name__ == '__main__': conditional so it doesn't run in the workers too.
Using multiprocessing.Pool() is very convenient because it requires only a slight change of the original code and the module handles for you all the work scheduling and the required data movement to and from the worker processes. The problem with multiprocessing is that it only works on a single host. Fortunately, mpi4py provides a similar interface through the mpi4py.futures.MPIPoolExecutor class:
from mpi4py.futures import MPIPoolExecutor
def myFun(x):
return x+2 # simple example, the real one would be complicated
if __name__ == '__main__':
dat = [
np.random.rand(3,2), np.random.rand(3,2), np.random.rand(3,2), np.random.rand(3,2)
] # real data would be much larger
with MPIPoolExecutor() as pool:
result = pool.map(myFun, dat)
Like with the Pool object from the multiprocessing module, the MPI pool executor handles for you all the work scheduling and data movement.
There are two ways to run the MPI program. The first one starts the script as an MPI singleton and then uses the MPI process control facility to spawn a child MPI job with all the pool workers:
mpiexec -n 1 python program.py
You also need to specify the MPI universe size (the total number of MPI ranks in both the main and all child jobs). The specific way of doing so differs between the implementations, so you need to consult your implementation's manual.
The second option is to launch directly the desired number of MPI ranks and have them execute the mpi4py.futures module itself with the script name as argument:
mpiexec -n 24 python -m mpi4py.futures program.py
Keep in mind that no mater which way you launch the script one MPI rank will be reserved for the controller and will not be running mapping tasks. You are aiming at running on 24 hosts, so you should be having plenty of CPU cores and can probably afford to have one reserved. Or you could instruct MPI to oversubscribe the first host with one more rank.
One thing to note with both multiprocessing.Pool and mpi4py.futures.MPIPoolExecutor is that the map() method guarantees the order of the items in the output array, but it doesn't guarantee the order in which the different items are evaluated. This shouldn't be a problem in most cases.
A word of advise. If your data is actually chunks read from a file, you may be tempted to do something like this:
if __name__ == '__main__':
data = read_chunks()
with MPIPoolExecutor() as p:
result = p.map(myFun, data)
Don't do that. Instead, if possible, e.g., if enabled by the presence of a shared (and hopefully parallel) filesytem, delegate the reading to the workers:
NUM_CHUNKS = 100
def myFun(chunk_num):
# You may need to pass the value of NUM_CHUNKS to read_chunk()
# for it to be able to seek to the right position in the file
data = read_chunk(NUM_CHUNKS, chunk_num)
return ...
if __name__ == '__main__':
chunk_nums = range(NUM_CHUNKS) # 100 chunks
with MPIPoolExecutor() as p:
result = p.map(myFun, chunk_nums)

Clean, pythonic way for concurrent data loaders?

Python 3
I would like to know what a really clean, pythonic concurrent data loader should look like. I need this approach for a project of mine that does heavy computations on data that is too big to entirely fit into memory. Hence, I implemented data loaders that should run concurrently and store data in a queue, so that the main process can work while (in the mean time) the next data is being loaded & prepared. Of course, the queue should block when it is empty (main process trying to consume more items -> queue should wait for new data) or full (worker process should wait until main process consumes data out of the queue to prevent out-of-memory errors).
I have written a class to fulfill this need using Python's multiprocessing module (multiprocessing.Queue and multiprocessing.Process). The crucial parts of the class are implemented as follows:
import multiprocessing as mp
from itertools import cycle
class ConcurrentLoader:
def __init__(path_to_data, queue_size, batch_size):
self._batch_size
self._path = path_to_data
filenames = ... # filenames for path 'path_to_data',
# get loaded using glob
self._files = cycle()
self._q = mp.Queue(queue_size)
...
self._worker = mp.Process(target=self._worker_func, daemon=True)
self._worker.start() # only started, never stopped
def _worker_func(self):
while True:
buffer = list()
for i in range(batch_size):
f = next(self._files)
... # load f and do some pre-processing with NumPy
... # add it to buffer
self._q.put(np.array(buffer).astype(np.float32))
def get_batch_data(self):
self._q.get()
The class has some more methods, but they are all for "convenience functionality". For example, it counts in a dict how often each file was loaded, how often the whole data set was loaded and so on, but these are rather easy to implement in Python and do not waste much computation time (sets, dicts, ...).
The data part itself on the other hand, due to I/O and pre-processing, can even take seconds. That is the reason why I want this to happen concurrently.
ConcurrentLoader should:
block main process: if get_batch_data is called, but queue is empty
block worker process: if queue is full, to prevent out-of-memory errors and prevent while True from wasting resources
be "transparent" to any class that uses ConcurrentLoader: they should just supply the path to the data and use get_batch_data without noticing that this actually works concurrently ("hassle free usage")
terminate its worker when main process dies to free resources again
Considering these goals (have I forgotten anything?) what should I do to enhance the current implementation? Is it thread/dead-lock safe? Is there a more "pythonic" way of implementation? Can I get it more clean? Does waste resources somehow?
Any class that uses ConcurrentLoader would roughly follow this setup:
class Foo:
...
def do_something(self):
...
data1 = ConcurrentLoader("path/to/data1", 64, 8)
data2 = ConcurrentLoader("path/to/data2", 256, 16)
...
sample1 = data1.get_batch_data()
sample2 = data2.get_batch_data()
... # heavy computations with data contained in 'sample1' & 'sample2'
# go *here*
Please either point out mistakes of any kind in order to improve my approach or supply an own, cleaner, more pythonic approach.
Blocking when a multiprocessing.Queue is empty/full and
get()/put() is called on it happens automatically.
This behavior is transparent to calling functions.
Use self._worker.daemon = True before self._worker.start() so the worker(s) will automatically be killed when main process exits

Is this Python code a safe way to use multi-threading

An application I use for graphics has an embedded Python interpreter - It works exactly the same as any other Python interpreter except there are a few special objects.
Basically I am trying to use Python to download a bunch of images and make other Network and disk I/O. If I do this without multithreading, my application will freeze (i.e. videos quit playing) until the downloads are finished.
To get around this I am trying to use multi-threading. However, I can not touch any of the main process.
I have written this code. The only parts unique to the program are commented. me.store / me.fetch is basically a way of getting a global variable. op('files') refers to a global table.
These are two things, "in the main process" that can only be touched in a thread safe way. I am not sure if my code does this.
I would apprecaite any input as to why or (why not) this code is thread-safe and how I can get around access the global variables in a thread safe way.
One thing I am worried about is how the counter is fetched multiple times by many threads. Since it is only updated after the file is written, could this cause a race-condition where the different threads access the counter with the same value (and then don't store the incremented value correctly). Or, what happens to the counter if the disk write fails.
from urllib import request
import threading, queue, os
url = 'http://users.dialogfeed.com/en/snippet/dialogfeed-social-wall-twitter-instagram.json?api_key=ac77f8f99310758c70ee9f7a89529023'
imgs = [
'http://search.it.online.fr/jpgs/placeholder-hollywood.jpg.jpg',
'http://www.lpkfusa.com/Images/placeholder.jpg',
'http://bi1x.caltech.edu/2015/_images/embryogenesis_placeholder.jpg'
]
def get_pic(url):
# Fetch image data
data = request.urlopen(url).read()
# This is the part I am concerned about, what if multiple threads fetch the counter before it is updated below
# What happens if the file write fails?
counter = me.fetch('count', 0)
# Download the file
with open(str(counter) + '.jpg', 'wb') as outfile:
outfile.write(data)
file_name = 'file_' + str(counter)
path = os.getcwd() + '\\' + str(counter) + '.jpg'
me.store('count', counter + 1)
return file_name, path
def get_url(q, results):
url = q.get_nowait()
file_name, path = get_pic(url)
results.append([file_name, path])
q.task_done()
def fetch():
# Clear the table
op('files').clear()
results = []
url_q = queue.Queue()
# Simulate getting a JSON feed
print(request.urlopen(url).read().decode('utf-8'))
for img in imgs:
# Add url to queue and start a thread
url_q.put(img)
t = threading.Thread(target=get_url, args=(url_q, results,))
t.start()
# Wait for threads to finish before updating table
url_q.join()
for cell in results:
op('files').appendRow(cell)
return
# Start a thread so that the first http get doesn't block
thread = threading.Thread(target=fetch)
thread.start()
Your code doesn't appear to be safe at all. Key points:
Appending to results is unsafe -- two threads might try to append to the list at the same time.
Accessing and setting counter is unsafe -- a thread my fetch counter before another thread has set the new counter value.
Passing a queue of urls is redundant -- just pass a new url to each job.
Another way (concurrent.futures)
Since you are using python 3, why not make use of the concurrent.futures module, which makes your task much easier to manage. Below I've written out your code in a way which does not require explicit synchronisation -- all the work is handled by the futures module.
from urllib import request
import os
import threading
from concurrent.futures import ThreadPoolExecutor
from itertools import count
url = 'http://users.dialogfeed.com/en/snippet/dialogfeed-social-wall-twitter-instagram.json?api_key=ac77f8f99310758c70ee9f7a89529023'
imgs = [
'http://search.it.online.fr/jpgs/placeholder-hollywood.jpg.jpg',
'http://www.lpkfusa.com/Images/placeholder.jpg',
'http://bi1x.caltech.edu/2015/_images/embryogenesis_placeholder.jpg'
]
def get_pic(url, counter):
# Fetch image data
data = request.urlopen(url).read()
# Download the file
with open(str(counter) + '.jpg', 'wb') as outfile:
outfile.write(data)
file_name = 'file_' + str(counter)
path = os.getcwd() + '\\' + str(counter) + '.jpg'
return file_name, path
def fetch():
# Clear the table
op('files').clear()
with ThreadPoolExecutor(max_workers=2) as executor:
count_start = me.fetch('count', 0)
# reserve these numbers for our tasks
me.store('count', count_start + len(imgs))
# separate fetching and storing is usually not thread safe
# however, if only one thread modifies count (the one running fetch) then
# this will be safe (same goes for the files variable)
for cell in executor.map(get_pic, imgs, count(count_start)):
op('files').appendRow(cell)
# Start a thread so that the first http get doesn't block
thread = threading.Thread(target=fetch)
thread.start()
If multiple threads modify count then you should use a lock when modifying count.
eg.
lock = threading.Lock()
def fetch():
...
with lock:
# Do not release the lock between accessing and modifying count.
# Other threads wanting to modify count, must use the same lock object (not
# another instance of Lock).
count_start = me.fetch('count', 0)
me.store('count', count_start + len(imgs))
# use count_start here
The only problem with this if one job fails for some reason then you will get a missing file number. Any raised exception will also interrupt the executor doing the mapping, by re-raising the exception there --so you can then do something if needed.
You could avoid using a counter by using the tempfile module to find somewhere to temporarily store a file before moving the file somewhere permanent.
Remember to look at multiprocessing and threading if you are new to python multi-threading stuff.
Your code seems ok, though the code style is not very easy to read. You need to run it to see if it works as your expectation.
with will make sure your lock is released. The acquire() method will be called when the block is entered, and release() will be called when the block is exited.
If you add more threads, make sure they are not using the same address from queue and no race condition (seems it is done by Queue.get(), but you need to run it to verify). Remember, each threads share the same process so almost everything is shared. You don't want two threads are handling the same address
The Lock doesn't do anything at all. You only have one thread that ever calls download_job - that's the one you assigned to my_thread. The other one, the main thread, calls offToOn and is finished as soon as it reaches the end of that function. So there is no second thread that ever tries to acquire the lock, and hence no second thread ever gets blocked. The table you mention is, apparently, in a file that you explicitly open and close. If the operating system protects this file against simultaneous access from different programs, you can get away with this; otherwise it is definitely unsafe because you haven't accomplished any thread synchronization.
Proper synchronization between threads requires that different threads have access to the SAME lock; i.e., one lock is accessed by multiple threads. Also note that "thread" is not a synonym for "process." Python supports both. If you're really supposed to avoid accessing the main process, you have to use the multiprocessing module to launch and manage a second process.
And this code will never exit, since there is always a thread running in an infinite loop (in threader).
Accessing a resource in a thread-safe manner requires something like this:
a_lock = Lock()
def use_resource():
with a_lock:
# do something
The lock is created once, outside the function that uses it. Every access to the resource in the whole application, from whatever thread, must acquire the same lock, either by calling use_resource or some equivalent.

Categories