multiprocess download limited by throttle - python

I have a 3rd party python3 module that downloads data from a source in the internet.
I do not have control over how this module does that (most likely just a http donwload)
I only pass some key and the module returns to me the relevant data.
There is a limit of 1 download per second max.
I have 10000 datas to obtain through this module for 10000 keys.
So I construct the list of 10000 keys, then map the list over a multiprocess pool (say 4 processes) and collect the results.
Each process dowloads, processes the data, stores to file on disk, then return status.
Each process does this for each of the 2500 requests it gets to work on.
The assumption is that result processing takes some time, and while a process is processing, another could download things, albeit all is restricted by the 1download/s max.
The code looks like:
main.py
#collect the list l of 10000 keys
...
pool = Pool(initializer=initPoolProcess, initargs=... )
...
m = Manager()
lck = m.Lock()
lastQueryTime = m.Value('d',0.0)
result = pool.map( f, l )
pool.close()
pool.join()
the other file mp.py has:
def f(...):
while (datetime.datetime.now().timestamp() - lastQueryTime.value) < 1.0:
time.sleep( datetime.datetime.now().timestamp() - lastQueryTime.value )
lck.acquire()
lastQueryTime.value = datetime.datetime.now().timestamp()
lck.release()
# do actual download
# process
# return result
question 1: given this piece code, how do I test that my assumption holds, that I am gaining by parallelizing the task?
question 2: if yes, is there any optimization to have here?
regards,

Related

Sharing a large dataframe with multiprocessing.pool

I have a function which I want to compute in parallel using multiprocessing. The function takes an argument, but also loads subsets from two very large dataframe which has already been loaded into memory (one of which is about 1G and the other is just over 6G).
largeDF1 = pd.read_csv(directory + 'name1.csv')
largeDF2 = pd.read_csv(directory + 'name2.csv')
def f(x):
load_content1 = largeDF1.loc[largeDF1['FirstRow'] == x]
load_content2 = largeDF1.loc[largeDF1['FirstRow'] == x]
#some computation happens here
new_data.to_csv(directory + 'output.csv', index = False)
def main():
multiprocessing.set_start_method('spawn', force = True)
pool = multiprocessing.Pool(processes = multiprocessing.cpu_count())
input = input_data['col']
pool.map_async(f, input)
pool.close()
pool.join()
The problem is that the files are too big and when I run them over multiple cores I get a memory issue. I want to know if there is a way where the loaded files can be shared across all processes.
I have tried manager() but could not get it to work. Any help is appreciated. Thanks.
If you were running this on a UNIX-like system (which uses the fork startmethod by default) the data would be shared out-of-the-box. Most operating systems use copy-on-write for memory pages. So even if you fork a process several times they would share most of the memory pages that contain the dataframes, al long as you don't modify those dataframes.
But when using the spawn start method, each worker process has to load the dataframe. I'm not sure if the OS is smart enough in that case to share the memory pages. Or indeed that these spawned processes would all have the same memory lay-out.
The only portable solution I can think of would be to leave the data on disk and use mmap in the workers to map it into memory read-only. That way the OS would notice that multiple processes are mapping the same file, and it would only load one copy.
The downside is that the data would be in memory in on-disk csv format, which makes reading data from it (without making a copy!) less convenient. So you might want to prepare the data beforehand into a form that it easier to use. Like e.g. convert the data from 'FirstRow' into a binary file of float or double that you can iterate over with struct.iter_unpack.
The function below (from my statusline script) uses mmap to count the amount of messages in a mailbox file.
def mail(storage, mboxname):
"""
Report unread mail.
Arguments:
storage: a dict with keys (unread, time, size) from the previous call or an empty dict.
This dict will be *modified* by this function.
mboxname (str): name of the mailbox to read.
Returns: A string to display.
"""
stats = os.stat(mboxname)
if stats.st_size == 0:
return 'Mail: 0'
# When mutt modifies the mailbox, it seems to only change the
# ctime, not the mtime! This is probably releated to how mutt saves the
# file. See also stat(2).
newtime = stats.st_ctime
newsize = stats.st_size
if not storage or newtime > storage['time'] or newsize != storage['size']:
with open(mboxname) as mbox:
with mmap.mmap(mbox.fileno(), 0, prot=mmap.PROT_READ) as mm:
start, total = 0, 1 # First mail is not found; it starts on first line...
while True:
rv = mm.find(b'\n\nFrom ', start)
if rv == -1:
break
else:
total += 1
start = rv + 7
start, read = 0, 0
while True:
rv = mm.find(b'\nStatus: R', start)
if rv == -1:
break
else:
read += 1
start = rv + 10
unread = total - read
# Save values for the next run.
storage['unread'], storage['time'], storage['size'] = unread, newtime, newsize
else:
unread = storage['unread']
return f'Mail: {unread}'
In this case I used mmap because it was 4x faster than just reading the file. See normal reading versus using mmap.

How can I make python scripts run faster or use multiprocess in this case?

I'm trying to measure four similarities(cosine_similarity, jaccard, Sequence Matcher similarity, jaccard_variants similarity) over 800K pairs of documents.
Every document file is txt format and about 100KB ~ 300KB(About 1500000 characters).
I have two questions regarding how to make my python scripts faster:
MY PYTHON SCRIPTS:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from difflib import SequenceMatcher
def get_tf_vectors(doc1, doc2):
text = [doc1, doc2]
vectorizer = CountVectorizer(text)
vectorizer.fit(text)
return vectorizer.transform(text).toarray()
def measure_sim(doc1, doc2):
a, b = doc1.split(), doc2.split()
c, d = set(a), set(b)
vectors = [t for t in get_tf_vectors(doc1, doc2)]
return cosine_similarity(vectors)[1][0], float(len(c&d) / len(c|d)), \
1 - (sum(abs(vectors[0] - vectors[1])) / sum(vectors[0] + vectors[1])), \
SequenceMatcher(None, a, b).ratio()
#items in doc_pair list are like('ID', 'doc1_directory', 'doc2_directory')
def data_analysis(doc_pair_list):
result = {}
for item in doc_pair_list:
f1 = open(item[1], 'rb')
doc1 = f1.read()
f1.close()
f2 = oepn(item[2], 'rb')
doc2 = f2.read()
f2.close()
result[item[0]] = measure_sim(doc1, doc2)
However, this code uses only 10% of my CPU and it takes almost 20 days to this task to be done. So I want to ask if there would be any way to make this code more efficient.
Q1. Since Documents are saved in HDD, I thought loading those text data should take some time. Hence, I suspect that loading only two documents every time the computer computes the similarities might not be efficient. Hence I am going to try loading 50 pairs of documents at once and computes similarity respectively. Would it be helpful?
Q2. Most of the postings about "How to make your codes run faster" said that I should use Python module based on C-code. However, since I'm using sklearn module which is known to be quite efficient, I wonder there would be any better way.
Is there any way that could help this python script to use more computer resources and become faster??
There are maybe better solutions, but you may try something like this, if the counting of similarities is the blocker:
1) A separate process to read all the files one by one and put them to a multiprocessing.Queue
2) Pool of multiple worker processes to count the similarities and put results into multiprocessing.Queue.
3) Main thread then simply loads results from results_queue and save them to dictionary as you have it now.
I don't know your hardware limitations (number and speed of CPU cores, RAM size, disk read speed) and I don't have any samples to test it on.
EDIT: Below is provided the described code. Please try and check if it is faster and let me know. If the main blocker is loading of files, we can create more loader processes (e.g. 2 processes and each loads half of the files). If the blocker is calculating similarities, then you can create more worker processes (just change worker_count). Finally 'results' is the dictionary with all the results.
import multiprocessing
import os
from difflib import SequenceMatcher
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
def get_tf_vectors(doc1, doc2):
text = [doc1, doc2]
vectorizer = CountVectorizer(text)
vectorizer.fit(text)
return vectorizer.transform(text).toarray()
def calculate_similarities(doc_pairs_queue, results_queue):
""" Pick docs from doc_pairs_queue and calculate their similarities, save the result to results_queue. Repeat infinitely (until process is terminated). """
while True:
pair = doc_pairs_queue.get()
pair_id = pair[0]
doc1 = pair[1]
doc2 = pair[2]
a, b = doc1.split(), doc2.split()
c, d = set(a), set(b)
vectors = [t for t in get_tf_vectors(doc1, doc2)]
results_queue.put((pair_id, cosine_similarity(vectors)[1][0], float(len(c&d) / len(c|d)),
1 - (sum(abs(vectors[0] - vectors[1])) / sum(vectors[0] + vectors[1])),
SequenceMatcher(None, a, b).ratio()))
def load_files(doc_pair_list, loaded_queue):
"""
Pre-load files and put them to a queue, so working processes can get them.
:param doc_pair_list: list of files to be loaded (ID, doc1_path, doc2_path)
:param loaded_queue: multiprocessing.Queue that will hold pre-loaded data
"""
print("Started loading files...")
for item in doc_pair_list:
with open(item[1], 'rb') as f1:
with open(item[2], 'rb') as f2:
loaded_queue.put((item[0], f1.read(), f2.read())) # if queue is full, this automatically waits until there is space
print("Finished loading files.")
def data_analysis(doc_pair_list):
# create a loader process that will pre-load files (it does no calculations, so it loads much faster)
# loader puts loaded files to a queue; 1 pair ~ 500 KB, 1000 pairs ~ 500 MB max size of queue (RAM memory)
loaded_pairs_queue = multiprocessing.Queue(maxsize=1000)
loader = multiprocessing.Process(target=load_files, args=(doc_pair_list, loaded_pairs_queue))
loader.start()
# create worker processes - these will do all calculations
results_queue = multiprocessing.Queue(maxsize=1000) # workers put results to this queue
worker_count = os.cpu_count() if os.cpu_count() else 2 # number of worker processes
workers = [] # create list of workers, so we can terminate them later
for i in range(worker_count):
worker = multiprocessing.Process(target=calculate_similarities, args=(loaded_pairs_queue, results_queue))
worker.start()
workers.append(worker)
# main process just picks the results from queue and saves them to the dictionary
results = {}
i = 0 # results counter
pairs_count = len(doc_pair_list)
while i < pairs_count:
res = results_queue.get(timeout=600) # timeout is just in case something unexpected happened (results are calculated much quicker)
# Queue.get() is blocking - if queue is empty, get() waits until something is put into queue and then get it
results[res[0]] = res[1:] # save to dictionary by ID (first item in the result)
# clean up the processes (so there aren't any zombies left)
loader.terminate()
loader.join()
for worker in workers:
worker.terminate()
worker.join()
Let me know about the results please, I am quite interested in it and will assist you further if needed ;)
First thing to do is see if you can find the real bottleneck and I think using cProfile might confirm your suspicion or shed some more light on your problem.
You should be able to run your code unmodified using cProfile like this:
python -m cProfile -o profiling-results python-file-to-test.py
After that you can analyze the results using pstats like this:
import pstats
stats = pstats.Stats("profiling-results")
stats.sort_stats("tottime")
stats.print_stats(10)
More on profiling your code is on Marco Bonazanin's blog article My Python Code is Slow? Tips for Profiling

Optimize parsing of GB sized files in parallel

I have several compressed files with sizes on the order of 2GB compressed. The beginning of each file has a set of headers which I parse and extract a list of ~4,000,000 pointers (pointers).
For each pair of pointers (pointers[i], pointers[i+1]) for 0 <= i < len(pointers), I
seek to pointers[i]
read pointers[i+1]-pointer[i]
decompress it
do a single pass operation on that data and update a dictionary with what I find.
The issue is, I can only process roughly 30 of pointer pairs a second using a single Python process, which means each file takes more than a day to get through.
Assuming splitting up the pointers list among multiple processes doesn't hurt performance (due to each process looking at the same file, though different non-overlapping parts), how can I use multiprocessing to speed this up?
My single threaded operation looks like this:
def search_clusters(pointers, filepath, automaton, counter):
def _decompress_lzma(f, pointer, chunk_size=2**14):
# skipping over this
...
return uncompressed_buffer
first_pointer, last_pointer = pointers[0], pointers[-1]
with open(filepath, 'rb') as fh:
fh.seek(first_pointer)
f = StringIO(fh.read(last_pointer - first_pointer))
for pointer1, pointer2 in zip(pointers, pointers[1:]):
size = pointer2 - pointer1
f.seek(pointer1 - first_pointer)
buffer = _decompress_lzma(f, 0)
# skipping details, ultimately the counter dict is
# modified passing the uncompressed buffer through
# an aho corasick automaton
counter = update_counter_with_buffer(buffer, automaton, counter)
return counter
# parse file and return pointers list
bzf = ZimFile(infile)
pointers = bzf.cluster_pointers
counter = load_counter_dict() # returns collections.Counter()
automaton = load_automaton()
search_clusters(pointers, infile, autmaton, counter)
I tried changing this to use multiprocessing.Pool:
from itertools import repeat, izip
import logging
import multiprocessing
logger = multiprocessing.log_to_stderr()
logger.setLevel(multiprocessing.SUBDEBUG)
def chunked(pointers, chunksize=1024):
for i in range(0, len(pointers), chunksize):
yield list(pointers[i:i+chunksize+1])
def search_wrapper(args):
return search_clusters(*args)
# parse file and return pointers list
bzf = ZimFile(infile)
pointers = bzf.cluster_pointers
counter = load_counter_dict() # returns collections.Counter()
map_args = izip(chunked(cluster_pointers), repeat(infile),
repeat(automaton.copy()), repeat(word_counter.copy()))
pool = multiprocessing.Pool(20)
results = pool.map(search_wrapper, map_args)
pool.close()
pool.join()
but after a little while of processing, I get the following message and the script just hangs there with no further output:
[DEBUG/MainProcess] cleaning up worker 0
[DEBUG/MainProcess] added worker
[INFO/PoolWorker-20] child process calling self.run()
However, if I run with a serialized version of map without multiprocessing, things run just fine:
map(search_wrapper, map_args)
Any advice on how to change my multiprocessing code so it doesn't hang? Is it even a good idea to attempt to use multiple processes to read the same file?

Python: How to run nested parallel process in python?

I have a dataset df of trader transactions.
I have 2 levels of for loops as follows:
smartTrader =[]
for asset in range(len(Assets)):
df = df[df['Assets'] == asset]
# I have some more calculations here
for trader in range(len(df['TraderID'])):
# I have some calculations here, If trader is successful, I add his ID
# to the list as follows
smartTrader.append(df['TraderID'][trader])
# some more calculations here which are related to the first for loop.
I would like to parallelise the calculations for each asset in Assets, and I also want to parallelise the calculations for each trader for every asset. After ALL these calculations are done, I want to do additional analysis based on the list of smartTrader.
This is my first attempt at parallel processing, so please be patient with me, and I appreciate your help.
If you use pathos, which provides a fork of multiprocessing, you can easily nest parallel maps. pathos is built for easily testing combinations of nested parallel maps -- which are direct translations of nested for loops.
It provides a selection of maps that are blocking, non-blocking, iterative, asynchronous, serial, parallel, and distributed.
>>> from pathos.pools import ProcessPool, ThreadPool
>>> amap = ProcessPool().amap
>>> tmap = ThreadPool().map
>>> from math import sin, cos
>>> print amap(tmap, [sin,cos], [range(10),range(10)]).get()
[[0.0, 0.8414709848078965, 0.9092974268256817, 0.1411200080598672, -0.7568024953079282, -0.9589242746631385, -0.27941549819892586, 0.6569865987187891, 0.9893582466233818, 0.4121184852417566], [1.0, 0.5403023058681398, -0.4161468365471424, -0.9899924966004454, -0.6536436208636119, 0.2836621854632263, 0.9601702866503661, 0.7539022543433046, -0.14550003380861354, -0.9111302618846769]]
Here this example uses a processing pool and a thread pool, where the thread map call is blocking, while the processing map call is asynchronous (note the get at the end of the last line).
Get pathos here: https://github.com/uqfoundation
or with:
$ pip install git+https://github.com/uqfoundation/pathos.git#master
Nested parallelism can be done elegantly with Ray, a system that allows you to easily parallelize and distribute your Python code.
Assume you want to parallelize the following nested program
def inner_calculation(asset, trader):
return trader
def outer_calculation(asset):
return asset, [inner_calculation(asset, trader) for trader in range(5)]
inner_results = []
outer_results = []
for asset in range(10):
outer_result, inner_result = outer_calculation(asset)
outer_results.append(outer_result)
inner_results.append(inner_result)
# Then you can filter inner_results to get the final output.
Bellow is the Ray code parallelizing the above code:
Use the #ray.remote decorator for each function that we want to execute concurrently in its own process. A remote function returns a future (i.e., an identifier to the result) rather than the result itself.
When invoking a remote function f() the remote modifier, i.e., f.remote()
Use the ids_to_vals() helper function to convert a nested list of ids to values.
Note the program structure is identical. You only need to add remote and then convert the futures (ids) returned by the remote functions to values using the ids_to_vals() helper function.
import ray
ray.init()
# Define inner calculation as a remote function.
#ray.remote
def inner_calculation(asset, trader):
return trader
# Define outer calculation to be executed as a remote function.
#ray.remote(num_return_vals = 2)
def outer_calculation(asset):
return asset, [inner_calculation.remote(asset, trader) for trader in range(5)]
# Helper to convert a nested list of object ids to a nested list of corresponding objects.
def ids_to_vals(ids):
if isinstance(ids, ray.ObjectID):
ids = ray.get(ids)
if isinstance(ids, ray.ObjectID):
return ids_to_vals(ids)
if isinstance(ids, list):
results = []
for id in ids:
results.append(ids_to_vals(id))
return results
return ids
outer_result_ids = []
inner_result_ids = []
for asset in range(10):
outer_result_id, inner_result_id = outer_calculation.remote(asset)
outer_result_ids.append(outer_result_id)
inner_result_ids.append(inner_result_id)
outer_results = ids_to_vals(outer_result_ids)
inner_results = ids_to_vals(inner_result_ids)
There are a number of advantages of using Ray over the multiprocessing module. In particular, the same code will run on a single machine as well as on a cluster of machines. For more advantages of Ray see this related post.
Probably threading, from standard python library, is most convenient approach:
import threading
def worker(id):
#Do you calculations here
return
threads = []
for asset in range(len(Assets)):
df = df[df['Assets'] == asset]
for trader in range(len(df['TraderID'])):
t = threading.Thread(target=worker, args=(trader,))
threads.append(t)
t.start()
#add semaphore here if you need synchronize results for all traders.
Instead of using for, use map:
import functools
smartTrader =[]
m=map( calculations_as_a_function,
[df[df['Assets'] == asset] \
for asset in range(len(Assets))])
functools.reduce(smartTradder.append, m)
From then on, you can try different parallel map implementations s.a. multiprocessing's, or stackless'

Launching other programs using a multiprocessing pool in Python is slow

My goal:
Use some function in python to extract data from many datafiles simultanously using different processors on my computer. I need to extract data from 1300 files and would therefore like Python to start extracting data from a new file once one extraction is complete. The extraction of data from a file is completely independent of extraction of data from other files. The files are of a form requiring to open the program which created them (OrcaFlex) to extract data. Due to this, extracting data from a single file can be time-consuming. (I use Windows)
My attempt:
Using Multiprocessing.Pool().map_async to pool my tasks.
Code:
import multiprocessing as mp
import OrcFxAPI # Package connected to external program
arcs = [1,2,5,7,9,13]
# define a example function
def get_results(x):
# Collects results from external program:
model = OrcFxAPI.Model(x[0])
c = model['line_inner_barrel'].LinkedStatistics(['Ezy-Angle'], 1, objectExtra=OrcFxAPI.oeEndB).TimeSeriesStatistics('Ezy-Angle').Mean
d = model['line_inner_barrel'].LinkedStatistics(['Ezy-Angle'], 1, objectExtra=OrcFxAPI.oeEndB).Query('Ezy-Angle', 'Ezy-Angle').ValueAtMax
e = model['line_inner_barrel'].LinkedStatistics(['Ezy-Angle'], 1, objectExtra=OrcFxAPI.oeEndB).Query('Ezy-Angle', 'Ezy-Angle').ValueAtMin
# Also does many other operations for extraction of results
return [c,d,e]
if __name__ == '__main__':
# METHOD WITH POOL - TAKES APPROX 1 HR 28 MIN
# List of input needed for the get_results function:
args = ((('CaseD%.3d.sim' % casenumber), arcs, 1) for casenumber in range(1,25))
pool = mp.Pool(processes=7)
results = pool.map_async(get_results, args)
pool.close()
pool.join()
# METHOD WITH FOR-LOOP - TAKES APPROX 1 HR 10 MIN
# List of input needed for the get_results function:
args2 = ((('CaseD%.3d.sim' % casenumber), arcs, 1) for casenumber in range(1,25))
for arg in args2:
get_results(arg)
Problem:
Using a for-loop for a reduced set (24)` of smaller datafiles took 1 hour 10 min, while using the Pool with 7 processors took 1 hour 28 min. Does anybody know why the run-time is so slow, and not close to divided by seven???
Also, is there a way of knowing which processor the Multiprocessing.Pool() allocates to a given Process? (In other words, can I let my Process know which processor it is using)
All help would be very much appreciated!!

Categories