I'm pretty new to python (I mainly write code in Java). I have a python script that's essentially a crawler. It calls phantomjs, which loads up the page, returns its source, and a list of urls that it found in the page.
I've been trying to use Python 3's multiprocessing module to do this, but I can't figure out how to use a shared queue that workers can also add to. I keep getting unpredictable results.
My previous approach used a global list of URLs, out of which I extracted a chunk and sent to workers using map_async. At the end, I would gather all the returned URLs and append them to the global list. The problem is that each "chunk" takes as long as the slowest worker. I'm trying to modify it so that whenever worker is done, it can pick up the next URL. However, I don't think I'm doing it correctly. Here's what I have so far:
def worker(url, urls):
print(multiprocessing.current_process().name + "." + str(multiprocessing.current_process().pid) + " loading " + url)
returned_urls = phantomjs(url)
print(multiprocessing.current_process().name + "." + str(multiprocessing.current_process().pid) + " returning " + str(len(returned_urls)) + " URLs")
for returned_url in returned_urls:
urls.put(returned_url, block=True)
print("There are " + str(urls.qsize()) + " URLs in total.\n")
if __name__ == '__main__':
manager = multiprocessing.Manager()
urls = manager.Queue()
urls.append(<some-url>)
pool = Pool()
while True:
url = urls.get(block=True)
pool.apply_async(worker, (url, urls))
pool.close()
pool.join()
If there is a better way to do this, please let me know. I'm crawling a known site, and the eventual terminating condition is when there are no URLs to process. But right now it looks like I will just keep running for ever. I'm not sure if I would use queue.empty() because it does say that it's not reliable.
Here is what I would probably do:
def worker(url, urls):
print(multiprocessing.current_process().name + "." + str(multiprocessing.current_process().pid) + " loading " + url)
returned_urls = phantomjs(url)
print(multiprocessing.current_process().name + "." + str(multiprocessing.current_process().pid) + " returning " + str(len(returned_urls)) + " URLs")
for returned_url in returned_urls:
urls.put(returned_url, block=True)
# signal finished processing this url
urls.put('no-url')
print("There are " + str(urls.qsize()) + " URLs in total.\n")
if __name__ == '__main__':
manager = multiprocessing.Manager()
pool = Pool()
urls = manager.Queue()
# start first url before entering loop
counter = 1
pool.apply_async(worker, (<some-url>, urls))
while counter > 0:
url = urls.get(block=True)
if url == 'no-url':
# a url has finished processing
counter -= 1
else:
# a new url needs to be processed
counter += 1
pool.apply_async(worker, (url, urls))
pool.close()
pool.join()
Whenever a url is popped off the queue, increment the counter. Think of it as a "currently processing url" counter. When a 'no-url' is popped off the queue, a "currently processing url" has finished, so decrement the counter. As long as the counter is greater than 0, there are urls that haven't finished processing and returned 'no-url' yet.
EDIT
As I said in the comment (put here for anyone else who reads it), when using a multiprocessing.Pool, instead of thinking of it as individual processes, it's best to think of it as a single construct that executes your function each time it gets data (concurrently when possible). This is most useful for data-driven problems where you don't track or care about individual worker processes only the data being processed.
This is how I solved the problem. I originally went with the design posted in this answer but bj0 mentioned that it was abusing the initializer function. So I decided to do it using apply_async, in a fashion similar to the code posted in my question.
Since my workers modify the queue they are reading URLs from (they add to it), I thought that I could simply run my loop like so:
while not urls.empty():
pool.apply_async(worker, (urls.get(), urls))
I expected that this would work since the workers will add to the queue, and apply_async would wait if all workers were busy. This didn't work as I expected and the loop terminated early. The problem was that it wasn't clear that apply_async does not block if all workers are busy. Instead, it will queue up submitted tasks, which means that urls will eventually become empty and the loop will terminate. The only time the loop blocks is if the queue is empty when you try to execute urls.get(). At this point, it will wait for more items to become available in the queue. But I still needed to figure out a way the terminate the loop. The condition is that the loop should terminate when none of the workers return new URLs. To do this, I use a shared dict that sets a value associated with the process-name to 0 if the process didn't return any URLs, and 1 otherwise. I check the sum of the keys during every iteration of the loop, and if it is ever 0 I know that I am done.
The basic structure ended up being like this:
def worker(url, url_queue, proc_user_urls_queue, proc_empty_urls_queue):
returned_urls = phantomjs(url) # calls phantomjs and waits for output
if len(returned_urls) > 0:
proc_empty_urls_queue.put(
[multiprocessing.current_process().name, 1]
)
else:
proc_empty_urls_queue.put(
[multiprocessing.current_process().name, 0]
)
for returned_url in returned_urls:
url_queue.put(returned_url)
def empty_url_tallier(proc_empty_urls_queue, proc_empty_urls_dict):
while 1:
# This may not be necessary. I don't know if this worker is run
# by the same process every time. If not, it is possible that
# the worker was assigned the task of fetching URLs, and returned
# some. So let's make sure that we set its entry to zero anyway.
# If this worker is run by the same process every time, then this
# stuff is not necessary.
id = multiprocessing.current_process().name
proc_empty_urls_dict[id] = 0
proc_empty_urls = proc_empty_urls_queue.get()
if proc_empty_urls == "done": # poison pill
break
proc_id = proc_empty_urls[0]
proc_empty_url = proc_empty_urls[1]
proc_empty_urls_dict[proc_id] = proc_empty_url
manager = Manager()
urls = manager.Queue()
proc_empty_urls_queue = manager.Queue()
proc_empty_urls_dict = manager.dict()
pool = Pool(33)
pool.apply_async(writer, (proc_user_urls_queue,))
pool.apply_async(empty_url_tallier, (proc_empty_urls_queue, proc_empty_urls_dict))
# Run the first apply synchronously
urls.put("<some-url>")
pool.apply(worker, (urls.get(), urls, proc_empty_urls_queue))
while sum(proc_empty_urls_dict.values()) > 0:
pool.apply_async(worker, (urls.get(), urls, proc_empty_urls_queue))
proc_empty_urls_queue.put("done") # poison pill
pool.close()
pool.join()
Related
So this is the first time I am playing around with threading so please bare with me here. In my main application (which I will implement this into), I need to add multithreading into my script. The script will read account info from a text file, then login & do some tasks with that account. I need to make sure that threads aren't reading the same line from the accounts text file since that would screw everything up, which I'm not quite sure about how to do.
from multiprocessing import Queue, Process
from threading import Thread
from time import sleep
urls_queue = Queue()
max_process = 10
def dostuff():
with open ('acc.txt', 'r') as accounts:
for account in accounts:
account.strip()
split = account.split(":")
a = {
'user': split[0],
'pass': split[1],
'name': split[2].replace('\n', ''),
}
sleep(1)
print(a)
for i in range(max_process):
urls_queue.put("DONE")
def doshit_processor():
while True:
url = urls_queue.get()
if url == "DONE":
break
def main():
file_reader_thread = Thread(target=dostuff)
file_reader_thread.start()
procs = []
for i in range(max_process):
p = Process(target=doshit_processor)
procs.append(p)
p.start()
for p in procs:
p.join()
print('all done')
# wait for all tasks in the queue
file_reader_thread.join()
if __name__ == '__main__':
main()
So at the moment I don't think the threading is even working, because it's printing one account out per second, even with 10 threads. So it should be printing 10 accounts per second which it isn't which has me confused. Also I am not sure how to make sure that threads won't pick the same account line. Help by a big brain is much appreciated
The problem is that you create a single thread to generate the data for your processes but then don't post that data to the queue. You sleep in that single thread so you see one item generated per second and then... nothing because the item isn't queued. It seems that all you are doing is creating a process pool and the inbuilt multiprocessing.Pool should work for you.
I've set pool "chunk size" low so that workers are only given 1 work item at a time. This is good for workflows where processing time can vary for each work item. By default, pool tries to optimize for the case where processing time is roughly equivalent and instead tries to reduce interprocess communication time.
Your data looks like a colon-separated file and you can use csv to cut down the processing there too. This smaller script should do what you want.
import multiprocessing as mp
from time import sleep
import csv
max_process = 10
def doshit_processor(row):
time.sleep(1) # if you want to simulate work
print(row)
def main():
with open ('acc.txt', newline='') as accounts:
table = list(csv.DictReader(accounts, fieldnames=('user', 'pass', 'name'),
delimiter=':')
with mp.Pool(max_process) as pool:
pool.map(doshit_processor, table, chunksize=1)
print('all done')
if __name__ == '__main__':
main()
I'm having a problem with my multiprocessing and I'm afraid it's a rather simple fix and I'm just not properly implementing the multiprocessing correctly. I've been researching the things that can cause the problem, but all I'm really finding is people recommending the use of a queue to prevent this, but that doesn't seem to be stopping it (again, I may just be implementing the queue incorrectly) I've been at this a couple of days now and I was hoping I could get some help.
Thanks in advance!
import csv
import multiprocessing as mp
import os
import queue
import sys
import time
import connections
import packages
import profiles
def execute_extract(package, profiles, q):
# This is the package execution for the extract
# It fires fine and will print the starting message below
started_at = time.monotonic()
print(f"Starting {package.packageName}")
try:
oracle_connection = connections.getOracleConnection(profiles['oracle'], 1)
engine = connections.getSQLConnection(profiles['system'], 1)
path = os.path.join(os.getcwd(), 'csv_data', package.packageName + '.csv')
cursor = oracle_connection.cursor()
if os.path.exists(path):
os.remove(path)
f = open(path, 'w')
chunksize = 100000
offset = 0
row_total = 0
csv_writer = csv.writer(f, delimiter='^', lineterminator='\n')
# I am having to do some data cleansing. I know this is not the most efficient way to do this, but currently
# it is what I am limited too
while True:
cursor.execute(package.query + f'\r\n OFFSET {offset} ROWS\r\n FETCH NEXT {chunksize} ROWS ONLY')
test = cursor.fetchone()
if test is None:
break
else:
while True:
row = cursor.fetchone()
if row is None:
break
else:
new_row = list(row)
new_row.append(package.sourceId[0])
new_row.append('')
i = 0
for item in new_row:
if type(item) == float:
new_row[i] = int(item)
elif type(item) == str:
new_row[i] = item.encode('ascii', 'replace')
i += 1
row = tuple(new_row)
csv_writer.writerow(row)
row_total += 1
offset += chunksize
f.close()
# I know that execution is at least reaching this point. I can watch the CSV files grow as more and more
# rows are added to the for all the packages What I never get are either the success message or error message
# below, and there are never any entries placed in the tables
query = f"BULK INSERT {profiles['system'].database.split('_')[0]}_{profiles['system'].database.split('_')[1]}_test_{profiles['system'].database.split('_')[2]}.{package.destTable} FROM \"{path}\" WITH (FIELDTERMINATOR='^', ROWTERMINATOR='\\n');"
engine.cursor().execute(query)
engine.commit()
end_time = time.monotonic() - started_at
print(
f"{package.packageName} has completed. Total rows inserted: {row_total}. Total execution time: {end_time} seconds\n")
os.remove(path)
except Exception as e:
print(f'An error has occured for package {package.packageName}.\r\n {repr(e)}')
finally:
# Here is where I am trying to add an item to the queue so the get method in the main def will pick it up and
# remove it from the queue
q.put(f'{package.packageName} has completed')
if oracle_connection:
oracle_connection.close()
if engine:
engine.cursor().close()
engine.close()
if __name__ == '__main__':
# Setting mp creation type
ctx = mp.get_context('spawn')
q = ctx.Queue()
# For the Etl I generate a list of class objects that hold relevant information profs contains a list of
# connection objects (credentials, connection strings, etc) packages contains the information to run the extract
# (destination tables, query string, package name for logging, etc)
profs = profiles.get_conn_vars(sys.argv[1])
packages = packages.get_etl_packages(profs)
processes = []
# I'm trying to track both individual package execution time and overall time so I can get an estimate on rows
# per second
start_time = time.monotonic()
sqlConn = connections.getSQLConnection(profs['system'])
# Here I'm executing a SQL command to truncate all my staging tables to ensure they are empty and will not
# generate any key violations
sqlConn.execute(
f"USE[{profs['system'].database.split('_')[0]}_{profs['system'].database.split('_')[1]}_test_{profs['system'].database.split('_')[2]}]\r\nExec Sp_msforeachtable #command1='Truncate Table ?',#whereand='and Schema_Id=Schema_id(''my_schema'')'")
# Here is where I start generating a process per package to try and get all packages to run simultaneously
for package in packages:
p = ctx.Process(target=execute_extract, args=(package, profs, q,))
processes.append(p)
p.start()
# Here is my attempt at managing the queue. This is a monstrosity of fixes I've tried to get this to work
results = []
while True:
try:
result = q.get(False, 0.01)
results.append(result)
except queue.Empty:
pass
allExited = True
for t in processes:
if t.exitcode is None:
allExited = False
break
if allExited & q.empty():
break
for p in processes:
p.join()
# Closing out the end time and writing the overall execution time in minutes.
end_time = time.monotonic() - start_time
print(f'Total execution time of {end_time / 60} minutes.')
I can't be sure why you are experiencing a deadlock (I am not at all convinced it is related to your queue management), but I can say for sure that you can simplify your queue management logic if you do one of either two things:
Method 1
Ensure that your worker function, execute_extract will put something on the results queue even in the case of an exception (I would recommend placing the Exception object itself). Then your entire main process loop that begins with while True: that attempts to get the results can be replaced with:
results = [q.get() for _ in range(len(processes))]
You are guaranteed that there will be a fixed number of messages on the queue equal to the number of processes created.
Method 2 (even simpler)
Simply reverse the order in which you wait for the subprocesses to complete and you process the results queue. You don't know how many messages will be on the queue but you aren't processing the queue until all the processes have returned. So however many messages are on the queue is all you will ever get. Just retrieve them until the queue is empty:
for p in processes:
p.join()
results = []
while not q.empty():
results.append(q.get())
At this point I would normally suggest that you use a multiprocessing pool class such as multiprocessing.Pool which does not require an explicit queue to retrieve results. But make either of these changes (I suggest Method 2, as I cannot see how it can cause a deadlock since only the main process is running at this point) and see if your problem goes away. I am not, however, guaranteeing that your issue is not somewhere else in your code. While your code is overly complicated and inefficient, it is not obviously "wrong." At least you will know whether your problem is elsewhere.
And my question for you: What does it buy you to do everything using a context acquired with ctx = mp.get_context('spawn') instead of just calling the methods on the multiprocessing module itself? If your platform had support for a fork call, which would be the default context, would you not want to use that?
Background: I'm trying to do 100's of dymola simulations with the python-dymola interface. I managed to run them in a for-loop. Now I want them to run while multi-threading so I can run multiple models parallel (which will be much faster). Since probably nobody uses the interface, I wrote some simple code that also shows my problem:
1: Turn a for-loop into a definition that is run into another for-loop BUT both the def and the for-loop share the same variable 'i'.
2: Turn a for-loop into a definition and use multi-threading to execute it. A for-loop runs the command one by one. I want to run them parallel with a maximum of x threads at the same time. The result should be the same as when executing the for-loop
Example-code:
import os
nSim = 100
ndig='{:01d}'
for i in range(nSim):
os.makedirs(str(ndig.format(i)))
Note that the name of the created directories are just the numbers from the for-loop (this is important). Now instead of using the for-loop, I would love to create the directories with multi-threading (note: probably not interesting for this short code but when calling and executing 100's of simulation models it definitely is interesting to use multi-threading).
So I started with something simple I thought, turning the for-loop into a function that then is run inside another for-loop and hoped to have the same result as with the for-loop code above but got this error:
AttributeError: 'NoneType' object has no attribute 'start'
(note: I just started with this, because I did not use the def-statement before and the thread package is also new. After this I would evolve towards the multi-threading.)
1:
import os
nSim = 100
ndig='{:01d}'
def simulation(i):
os.makedirs(str(ndig.format(i)))
for i in range(nSim):
simulation(i=i).start
After that failed, I tried to evolve to multi-threading (converting the for-loop into something that does the same but with multi-threading and by that running the code parallel instead of one by one and with a maximum number of threads):
2:
import os
import threading
nSim = 100
ndig='{:01d}'
def simulation(i):
os.makedirs(str(ndig.format(i)))
if __name__ == '__main__':
i in range(nSim)
simulation_thread[i] = threading.Thread(target=simulation(i=i))
simulation_thread[i].daemon = True
simulation_thread[i].start()
Unfortunately that attempt failed as well and now I got the error:
NameError: name 'i' is not defined
Does anybody has suggestions for issues 1 or 2?
Both examples are incomplete. Here's a complete example. Note that target gets passed the name of the function target=simulation and a tuple of its arguments args=(i,). Don't call the function target=simulation(i=i) because that just passes the result of the function, which is equivalent to target=None in this case.
import threading
nSim = 100
def simulation(i):
print(f'{threading.current_thread().name}: {i}')
if __name__ == '__main__':
threads = [threading.Thread(target=simulation,args=(i,)) for i in range(nSim)]
for t in threads:
t.start()
for t in threads:
t.join()
Output:
Thread-1: 0
Thread-2: 1
Thread-3: 2
.
.
Thread-98: 97
Thread-99: 98
Thread-100: 99
Note you usually don't want more threads that CPUs, which you can get from multiprocessing.cpu_count(). You can use create a thread pool and use queue.Queue to post work that the threads execute. An example is in the Python Queue documentation.
Cannot call .start like this
simulation(i=i).start
on an non-threading object. Also, you have to import the module as well
It seems like you forgot to add 'for' and indent the code in your loop
i in range(nSim)
simulation_thread[i] = threading.Thread(target=simulation(i=i))
simulation_thread[i].daemon = True
simulation_thread[i].start()
to
for i in range(nSim):
simulation_thread[i] = threading.Thread(target=simulation(i=i))
simulation_thread[i].daemon = True
simulation_thread[i].start()
If you would like to have max number of thread in a pool, and to run all items in the queue. We can continue #mark-tolonen answer and do like this:
import threading
import queue
import time
def main():
size_of_threads_pool = 10
num_of_tasks = 30
task_seconds = 1
q = queue.Queue()
def worker():
while True:
item = q.get()
print(my_st)
print(f'{threading.current_thread().name}: Working on {item}')
time.sleep(task_seconds)
print(f'Finished {item}')
q.task_done()
my_st = "MY string"
threads = [threading.Thread(target=worker, daemon=True) for i in range(size_of_threads_pool)]
for t in threads:
t.start()
# send the tasks requests to the worker
for item in range(num_of_tasks):
q.put(item)
# block until all tasks are done
q.join()
print('All work completed')
# NO need this, as threads are while True, so never will stop..
# for t in threads:
# t.join()
if __name__ == '__main__':
main()
This will run 30 tasks of 1 second in each, using 10 threads.
So total time would be 3 seconds.
$ time python3 q_test.py
...
All work completed
real 0m3.064s
user 0m0.033s
sys 0m0.016s
EDIT: I found another higher-level interface for asynchronously executing callables.
Use concurrent.futures, see the example in the docs:
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the URL and contents
def load_url(url, timeout):
with urllib.request.urlopen(url, timeout=timeout) as conn:
return conn.read()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
Note the max_workers=5 that will tell the max number of threads, and
note the for loop for url in URLS that you can use.
I want to use rq to run tasks on a separate worker to gather data from a measuring instrument. The end of the task will be signaled by a user pressing a button on a dash app.
The problem is that the task itself does not know when to terminate since it doesn't have access to the dash app's context.
I already use meta to pass information from the worker back to the caller but can I pass information from the caller to the worker?
Example task:
from rq import get_current_job
from time import time
def mock_measurement():
job = get_current_job()
t_start = time()
# Run the measurement
t = []
i = []
job.meta['should_stop'] = False # I want to use this tag to tell the job to stop
while not job.meta['should_stop']:
t.append(time() - t_start)
i.append(np.random.random())
job.meta['data'] = (t, i)
job.save_meta()
sleep(5)
print("Job Finished")
From the console, I can start a job as such
queue = rq.Queue('test-app', connection=Redis('localhost', 6379))
job = queue.enqueue('tasks.mock_measurement')
and I would like to be able to do this from the console to signify to the worker it can stop running:
job.meta['should_stop'] = True
job.save_meta()
job.refresh
However, while the commands above return without an error, they do not actually update the meta dictionary.
Because you didn't fetch the updated meta. But, don't do this!!
Invoking save_meta and refresh in caller and worker will lose data.
Instead, Use job.connection.set(job + ':should_stop', 1, ex=300) to set flag, and use job.connection.get(job + ':should_stop') to check if flag is set.
My code:
def create_rods(folder="./", kappas=10, allowed_kappa_error=.3,
radius_correction_ratio=0.1):
"""
Create one rod for each rod_data and for each file
returns [RodGroup1, RodGroup2, ...]
"""
names, files = import_files(folder=folder)
if len(files) == 0:
print "No files to import."
raise ValueError
states = [None for dummy_ in range(len(files))]
processes = []
states_queue = mp.Queue()
for index in range(len(files)):
process = mp.Process(target=create_rods_process,
args=(kappas, allowed_kappa_error,
radius_correction_ratio, names,
files, index, states_queue))
processes.append(process)
run_processes(processes) #This part seem to take a lot of time.
try:
while True:
[index, state] = states_queue.get(False)
states[index] = state
except Queue.Empty:
pass
return names, states
def create_rods_process(kappas, allowed_kappa_error,
radius_correction_ratio, names,
files, index, states_queue):
"""
Process of method.
"""
state = SystemState(kappas, allowed_kappa_error,
radius_correction_ratio, names[index])
data = import_data(files[index])
for dataline in data:
parameters = tuple(dataline)
new_rod = Rod(parameters)
state.put_rod(new_rod)
state.check_rods()
states_queue.put([index, state])
def run_processes(processes, time_out=None):
"""
Runs all processes using all cores.
"""
running = []
cpus = mp.cpu_count()
try:
while True:
#for cpu in range(cpus):
next_process = processes.pop()
running.append(next_process)
next_process.start()
except IndexError:
pass
if not time_out:
try:
while True:
for process in running:
if not process.is_alive():
running.remove(process)
except TypeError:
pass
else:
for process in running:
process.join(time_out)
I expect processes to end but I get a process stucked. I don't know if there is a problem with run_processes() method or with create_rods() method. With join cpus are freed, but program doesn't go on.
From Python's multiprocessing guidelines.
Joining processes that use queues
Bear in mind that a process that has put items in a queue will wait before terminating until all the buffered items are fed by the “feeder” thread to the underlying pipe. (The child process can call the Queue.cancel_join_thread method of the queue to avoid this behaviour.)
This means that whenever you use a queue you need to make sure that all items which have been put on the queue will eventually be removed before the process is joined. Otherwise you cannot be sure that processes which have put items on the queue will terminate. Remember also that non-daemonic processes will be joined automatically.
Joining processes before draining their Queues results in a deadlock. You need to be sure the queues are emptied before joining the processes.