I sped up a process by using a multithread function, however I need to maintain a relationship between the output and input.
import requests
import pprint
import threading
ticker = ['aapl', 'googl', 'nvda']
url_array = []
for i in ticker:
url = 'https://query2.finance.yahoo.com/v10/finance/quoteSummary/' + i + '?formatted=true&crumb=8ldhetOu7RJ&lang=en-US®ion=US&modules=defaultKeyStatistics%2CfinancialData%2CcalendarEvents&corsDomain=finance.yahoo.com'
url_array.append(url)
def fetch_ev(url):
urlHandler = requests.get(url)
data = urlHandler.json()
ev_single = data['quoteSummary']['result'][0]['defaultKeyStatistics']['enterpriseValue']['raw']
ev_array.append(ev_single) # makes array of enterprise values
threads = [threading.Thread(target=fetch_ev, args=(url,)) for url in
url_array] # calls multi thread that pulls enterprise value
for thread in threads:
thread.start()
for thread in threads:
thread.join()
pprint.pprint(dict(zip(ticker, ev_array)))
Sample output of the code:
1) {'aapl': '30.34B', 'googl': '484.66B', 'nvda': '602.66B'}
2) {'aapl': '484.66B', 'googl': '30.34B', 'nvda': '602.66B'}
I need the value to be matched up with the correct ticker.
Edit: I know dictionaries do not preserve order. Sorry, perhaps I was a little (very) unclear in my question. I have an array of ticker symbols, that matches the order of my url inputs. After running fetch_ev, I want to combine these ticker symbols with the matching enterprise value or ev_single. The order that they are stored in does not matter, however the pairings (k v pairs) or which values are stored with which ticker is very important.
Edit2 (MCVE) I changed the code to a simpler version of what I had- that shows the problem better. Sorry it's a little more complicated than I would have wanted complicated.
To make it easy to maintain the correspondence between input and output, the ev_array can be preallocated so it's the same size as the ticker array, and the fetch_ev() thread function can be given an extra argument specifying the index of the location in that array to store the value fetched.
The maintain the integrity of the ev_array, a threading.RLock was added to prevent concurrent access to the shared resource which might otherwise be written to simultaneously by more than one thread. (Since its contents are now referenced directly through the index passed to fetch_ev(), this may not be strictly necessary.)
I don't know the proper ticker ↔ enterprise value concurrence to be able to verify the results that doing this produces:
{'aapl': 602658308096L, 'googl': 484659986432L, 'nvda': 30338199552L}
but at least they're now the same each time it's run.
import requests
import pprint
import threading
def fetch_ev(index, url): # index parameter added
response = requests.get(url)
response.raise_for_status()
data = response.json()
ev_single = data['quoteSummary']['result'][0][
'defaultKeyStatistics']['enterpriseValue']['raw']
with ev_array_lock:
ev_array[index] = ev_single # store enterprise value obtained
tickers = ['aapl', 'googl', 'nvda']
ev_array = [None] * len(tickers) # preallocate to hold results
ev_array_lock = threading.RLock() # to synchronize concurrent array access
urls = ['https://query2.finance.yahoo.com/v10/finance/quoteSummary/{}'
'?formatted=true&crumb=8ldhetOu7RJ&lang=en-US®ion=US'
'&modules=defaultKeyStatistics%2CfinancialData%2CcalendarEvents'
'&corsDomain=finance.yahoo.com'.format(symbol)
for symbol in tickers]
threads = [threading.Thread(target=fetch_ev, args=(i, url))
for i, url in enumerate(urls)] # activities to obtain ev's
for thread in threads:
thread.start()
for thread in threads:
thread.join()
pprint.pprint(dict(zip(tickers, ev_array)))
Related
I have a script that loops over a pandas dataframe and outputs GIS data to a geopackage based on some searches and geometry manipulation. It works when I use a for loop but with over 4k records it takes a while. Since I have it built as it's own function that returns what I need based on a row iteration I tried to run it with multiprocessing with:
import pandas as pd, bwe_mapping
from multiprocessing import Pool
#Sample dataframe
bwes = [['id', 7216],['item_id', 3277841], ['Date', '2019-01-04T00:00:00.000Z'], ['start_lat', -56.92], ['start_lon', 45.87], ['End_lat', -59.87], ['End_lon', 44.67]]
bwedf = pd.read_csv(bwes)
geopackage = "datalocation\geopackage.gpkg"
tracklayer = "tracks"
if __name__=='__main__':
def task(item):
bwe_mapping.map_bwe(item, geopackage, tracklayer)
pool = Pool()
for index, row in bwedf.iterrows():
task(row)
with Pool() as pool:
for results in pool.imap_unordered(task, bwedf.iterrows()):
print(results)
When I run this my Task manager populates with 16 new python tasks but no sign that anything is being done. Would it be better to use numpy.array.split() to break up my pandas df into 4 or 8 smaller ones and run the for index, row in bwedf.iterrows(): for each dataframe on it's own processor?
No one process needs to be done in any order; as long as I can store the outputs, which are geopanda dataframes, into a list to concatenate into geopackage layers at the end.
Should I have put the for loop in the function and just passed it the whole dataframe and gis data to search?
if you are running on windows/macOS then it's going to use spawn to create the workers, which means that any child MUST find the function it is going to execute when it imports your main script.
your code has the function definition inside your if __name__=='__main__': so the children don't have access to it.
simply moving the function def to before if __name__=='__main__': will make it work.
what is happening is that each child is crashing when it tries to run a function because it never saw its definition.
minimal code to reproduce the problem:
from multiprocessing import Pool
if __name__ == '__main__':
def task(item):
print(item)
return item
pool = Pool()
with Pool() as pool:
for results in pool.imap_unordered(task, range(10)):
print(results)
and the solution is to move the function definition to before the if __name__=='__main__': line.
Edit: now to iterate on rows in a dataframe, this simple example demonstrates how to do it, note that iterrows returns an index and a row, which is why it is unpacked.
import os
import pandas as pd
from multiprocessing import Pool
import time
# Sample dataframe
bwes = [['id', 7216], ['item_id', 3277841], ['Date', '2019-01-04T00:00:00.000Z'], ['start_lat', -56.92],
['start_lon', 45.87], ['End_lat', -59.87], ['End_lon', 44.67]]
bwef = pd.DataFrame(bwes)
def task(item):
time.sleep(1)
index, row = item
# print(os.getpid(), tuple(row))
return str(os.getpid()) + " " + str(tuple(row))
if __name__ == '__main__':
with Pool() as pool:
for results in pool.imap_unordered(task, bwef.iterrows()):
print(results)
the time.sleep(1) is only there because there is only a small amount of work and one worker might grab it all, so i am forcing every worker to wait for the others, you should remove it, the result is as follows:
13228 ('id', 7216)
11376 ('item_id', 3277841)
15580 ('Date', '2019-01-04T00:00:00.000Z')
10712 ('start_lat', -56.92)
11376 ('End_lat', -59.87)
13228 ('start_lon', 45.87)
10712 ('End_lon', 44.67)
it seems like your "example" dataframe is transposed, but you just have to construct the dataframe correctly, i'd recommend you first run the code serially with iterrows, before running it across multiple cores.
obviously sending data to the workers and back from them takes time, so make sure each worker is doing a lot of computational work and not just sending it back to the parent process.
I have two functions that I want to run concurrently to check performance, now a days I'm running one after another and it's taking quite some time.
Here it's how I'm running
import pandas as pd
import threading
df = pd.read_csv('data/Detalhado_full.csv', sep=',', dtype={'maquina':str})
def gerar_graph_36():
df_ordered = df.query(f'maquina=="3.6"')[['data', 'dia_semana', 'oee', 'ptg_ruins', 'prod_real_kg', 'prod_teorica_kg']].sort_values(by='data')
oee = df_ordered['oee'].iloc[-1:].iloc[0]
return oee
def gerar_graph_31():
df_ordered = df.query(f'maquina=="3.1"')[['data', 'dia_semana', 'oee', 'ptg_ruins', 'prod_real_kg', 'prod_teorica_kg']].sort_values(by='data')
oee = df_ordered['oee'].iloc[-1:].iloc[0]
return oee
oee_36 = gerar_graph_36()
oee_31 = gerar_graph_31()
print(oee_36, oee_31)
I tried to apply threading using this statement but it's not returning the variable, instead it's printing None value
print(oee_31, oee_36) -> Expecting: 106.3 99.7 // Returning None None
oee_31 = threading.Thread(target=gerar_graph_31, args=()).start()
oee_36 = threading.Thread(target=gerar_graph_36, args=()).start()
print(oee_31, oee_36)
For checking purpose, If I use the command below, returns 3 as expected
print(threading.active_count())
I need the return oee value from the function, something like 103.8.
Thanks in advance!!
Ordinarily creatign a new thread and starting it is not like calling a function which returns a variable: the Thread.start() call just "starts the code of the other thread", and returns imediatelly.
To colect results in the other threads you have to comunicate the computed results to the main thread using some data structure. An ordinary list or dictionary could do, or one could use a queue.Queue.
If you want to have something more like a function call and be able to not modify the gerar_graph() functions, you could use the concurrent.futures module instead of threading: that is higher level code that will wrap your calls in a "future" object, and you will be able to check when each future is done and fetch the value returned by the function.
Otherwise, simply have a top-level variable containign a list, wait for your threads to finish up running (they stop when the function called by "target" returns), and collect the results:
import pandas as pd
import threading
df = pd.read_csv('data/Detalhado_full.csv', sep=',', dtype={'maquina':str})
results = []
def gerar_graph_36():
df_ordered = df.query(f'maquina=="3.6"')[['data', 'dia_semana', 'oee', 'ptg_ruins', 'prod_real_kg', 'prod_teorica_kg']].sort_values(by='data')
oee = df_ordered['oee'].iloc[-1:].iloc[0]
results.append(oee)
def gerar_graph_31():
df_ordered = df.query(f'maquina=="3.1"')[['data', 'dia_semana', 'oee', 'ptg_ruins', 'prod_real_kg', 'prod_teorica_kg']].sort_values(by='data')
oee = df_ordered['oee'].iloc[-1:].iloc[0]
results.append(oee)
# We need to keep a reference to the threads themselves
# so that we can call both ".start()" (which always returns None)
# and ".join()" on them.
oee_31 = threading.Thread(target=gerar_graph_31); oee_31.start()
oee_36 = threading.Thread(target=gerar_graph_36); oee_36.start()
oee_31.join() # will block and return only when the task is done, but oee_36 will be running concurrently
oee_36.join()
print(results)
If you need more than 2 threads, (like all 36...), I strongly suggest using concurrent.futures: you can limit the number of workers to a number comparable to the logical CPUs you have. And, of course, manage your tasks and calls in a list or dictionary, instead of having a separate variable name for each.
I'd like to parallelize a function that returns a flatten list of values (called "keys") in a dict but I don't understand how to obtain in the final result. I have tried:
def toParallel(ht, token):
keys = []
words = token[token['hashtag'] == ht]['word']
for w in words:
keys.append(checkString(w))
y = {ht:keys}
num_cores = multiprocessing.cpu_count()
pool = multiprocessing.Pool(num_cores)
token = pd.read_csv('/path', sep=",", header = None, encoding='utf-8')
token.columns = ['word', 'hashtag', 'count']
hashtag = pd.DataFrame(token.groupby(by='hashtag', as_index=False).count()['hashtag'])
result = pd.DataFrame(index = hashtag['hashtag'], columns = range(0, 21))
result = result.fillna(0)
final_result = []
final_result = [pool.apply_async(toParallel, args=(ht,token,)) for ht in hashtag['hashtag']]
Where toParallel function should return a dict with hashtag as key and a list of keys (where keys are int). But if I try to print final_result, I obtain only
bound method ApplyResult.get of multiprocessing.pool.ApplyResult object at 0x10c4fa950
How can I do it?
final_result = [pool.apply_async(toParallel, args=(ht,token,)) for ht in hashtag['hashtag']]
You can either use Pool.apply() and get the result right away (in which case you do not need multiprocessing hehe, the function is just there for completeness) or use Pool.apply_async() following by Pool.get(). Pool.apply_async() is asynchronous.
Something like this:
workers = [pool.apply_async(toParallel, args=(ht,token,)) for ht in hashtag['hashtag']]
final_result = [worker.get() for worker in workers]
Alternatively, you can also use Pool.map() which will do all this for you.
Either way, I recommend you read the documentation carefully.
Addendum: When answering this question I presumed the OP is using some Unix operating system like Linux or OSX. If you are using Windows, you must not forget to safeguard your parent/worker processes using if __name__ == '__main__'. This is because Windows lacks fork() and so the child process starts at the beginning of the file, and not at the point of forking like in Unix, so you must use an if condition to guide it. See here.
ps: this is unnecessary:
num_cores = multiprocessing.cpu_count()
pool = multiprocessing.Pool(num_cores)
If you call multiprocessing.Pool() without arguments (or None), it already creates a pool of workers with the size of your cpu count.
I have a dataset df of trader transactions.
I have 2 levels of for loops as follows:
smartTrader =[]
for asset in range(len(Assets)):
df = df[df['Assets'] == asset]
# I have some more calculations here
for trader in range(len(df['TraderID'])):
# I have some calculations here, If trader is successful, I add his ID
# to the list as follows
smartTrader.append(df['TraderID'][trader])
# some more calculations here which are related to the first for loop.
I would like to parallelise the calculations for each asset in Assets, and I also want to parallelise the calculations for each trader for every asset. After ALL these calculations are done, I want to do additional analysis based on the list of smartTrader.
This is my first attempt at parallel processing, so please be patient with me, and I appreciate your help.
If you use pathos, which provides a fork of multiprocessing, you can easily nest parallel maps. pathos is built for easily testing combinations of nested parallel maps -- which are direct translations of nested for loops.
It provides a selection of maps that are blocking, non-blocking, iterative, asynchronous, serial, parallel, and distributed.
>>> from pathos.pools import ProcessPool, ThreadPool
>>> amap = ProcessPool().amap
>>> tmap = ThreadPool().map
>>> from math import sin, cos
>>> print amap(tmap, [sin,cos], [range(10),range(10)]).get()
[[0.0, 0.8414709848078965, 0.9092974268256817, 0.1411200080598672, -0.7568024953079282, -0.9589242746631385, -0.27941549819892586, 0.6569865987187891, 0.9893582466233818, 0.4121184852417566], [1.0, 0.5403023058681398, -0.4161468365471424, -0.9899924966004454, -0.6536436208636119, 0.2836621854632263, 0.9601702866503661, 0.7539022543433046, -0.14550003380861354, -0.9111302618846769]]
Here this example uses a processing pool and a thread pool, where the thread map call is blocking, while the processing map call is asynchronous (note the get at the end of the last line).
Get pathos here: https://github.com/uqfoundation
or with:
$ pip install git+https://github.com/uqfoundation/pathos.git#master
Nested parallelism can be done elegantly with Ray, a system that allows you to easily parallelize and distribute your Python code.
Assume you want to parallelize the following nested program
def inner_calculation(asset, trader):
return trader
def outer_calculation(asset):
return asset, [inner_calculation(asset, trader) for trader in range(5)]
inner_results = []
outer_results = []
for asset in range(10):
outer_result, inner_result = outer_calculation(asset)
outer_results.append(outer_result)
inner_results.append(inner_result)
# Then you can filter inner_results to get the final output.
Bellow is the Ray code parallelizing the above code:
Use the #ray.remote decorator for each function that we want to execute concurrently in its own process. A remote function returns a future (i.e., an identifier to the result) rather than the result itself.
When invoking a remote function f() the remote modifier, i.e., f.remote()
Use the ids_to_vals() helper function to convert a nested list of ids to values.
Note the program structure is identical. You only need to add remote and then convert the futures (ids) returned by the remote functions to values using the ids_to_vals() helper function.
import ray
ray.init()
# Define inner calculation as a remote function.
#ray.remote
def inner_calculation(asset, trader):
return trader
# Define outer calculation to be executed as a remote function.
#ray.remote(num_return_vals = 2)
def outer_calculation(asset):
return asset, [inner_calculation.remote(asset, trader) for trader in range(5)]
# Helper to convert a nested list of object ids to a nested list of corresponding objects.
def ids_to_vals(ids):
if isinstance(ids, ray.ObjectID):
ids = ray.get(ids)
if isinstance(ids, ray.ObjectID):
return ids_to_vals(ids)
if isinstance(ids, list):
results = []
for id in ids:
results.append(ids_to_vals(id))
return results
return ids
outer_result_ids = []
inner_result_ids = []
for asset in range(10):
outer_result_id, inner_result_id = outer_calculation.remote(asset)
outer_result_ids.append(outer_result_id)
inner_result_ids.append(inner_result_id)
outer_results = ids_to_vals(outer_result_ids)
inner_results = ids_to_vals(inner_result_ids)
There are a number of advantages of using Ray over the multiprocessing module. In particular, the same code will run on a single machine as well as on a cluster of machines. For more advantages of Ray see this related post.
Probably threading, from standard python library, is most convenient approach:
import threading
def worker(id):
#Do you calculations here
return
threads = []
for asset in range(len(Assets)):
df = df[df['Assets'] == asset]
for trader in range(len(df['TraderID'])):
t = threading.Thread(target=worker, args=(trader,))
threads.append(t)
t.start()
#add semaphore here if you need synchronize results for all traders.
Instead of using for, use map:
import functools
smartTrader =[]
m=map( calculations_as_a_function,
[df[df['Assets'] == asset] \
for asset in range(len(Assets))])
functools.reduce(smartTradder.append, m)
From then on, you can try different parallel map implementations s.a. multiprocessing's, or stackless'
I have a python process (2.7) that takes a key, does a bunch of calculations and returns a list of results. Here is a very simplified version.
I am using multiprocessing to create threads so this can be processed faster. However, my production data has several million rows and each loop takes progressively longer to complete. The last time I ran this each loop took over 6 minutes to complete while at the start it takes a second or less. I think this is because all the threads are adding results into resultset and that continues to grow until it contains all the records.
Is it possible to use multiprocessing to stream the results of each thread (a list) into a csv or batch resultset so it writes to the csv after a set number of rows?
Any other suggestions for speeding up or optimizing the approach would be appreciated.
import numpy as np
import pandas as pd
import csv
import os
import multiprocessing
from multiprocessing import Pool
global keys
keys = [1,2,3,4,5,6,7,8,9,10,11,12]
def key_loop(key):
test_df = pd.DataFrame(np.random.randn(1,4), columns=['a','b','c','d'])
test_list = test_df.ix[0].tolist()
return test_list
if __name__ == "__main__":
try:
pool = Pool(processes=8)
resultset = pool.imap(key_loop,(key for key in keys) )
loaddata = []
for sublist in resultset:
loaddata.append(sublist)
with open("C:\\Users\\mp_streaming_test.csv", 'w') as file:
writer = csv.writer(file)
for listitem in loaddata:
writer.writerow(listitem)
file.close
print "finished load"
except:
print 'There was a problem multithreading the key Pool'
raise
Here is an answer consolidating the suggestions Eevee and I made
import numpy as np
import pandas as pd
import csv
from multiprocessing import Pool
keys = [1,2,3,4,5,6,7,8,9,10,11,12]
def key_loop(key):
test_df = pd.DataFrame(np.random.randn(1,4), columns=['a','b','c','d'])
test_list = test_df.ix[0].tolist()
return test_list
if __name__ == "__main__":
try:
pool = Pool(processes=8)
resultset = pool.imap(key_loop, keys, chunksize=200)
with open("C:\\Users\\mp_streaming_test.csv", 'w') as file:
writer = csv.writer(file)
for listitem in resultset:
writer.writerow(listitem)
print "finished load"
except:
print 'There was a problem multithreading the key Pool'
raise
Again, the changes here are
Iterate over resultset directly, rather than needlessly copying it to a list first.
Feed the keys list directly to pool.imap instead of creating a generator comprehension out of it.
Providing a larger chunksize to imap than the default of 1. The larger chunksize reduces the cost of the inter-process communication required to pass the values inside keys to the sub-processes in your pool, which can give big performance boosts when keys is very large (as it is in your case). You should experiment with different values for chunksize (try something considerably larger than 200, like 5000, etc.) and see how it affects performance. I'm making a wild guess with 200, though it should definitely do better than 1.
The following very simple code collects many worker's data into a single CSV file. A worker takes a key and returns a list of rows. The parent processes several keys at a time, using several workers. When each key is done, the parent writes output rows, in order, to a CSV file.
Be careful about order. If each worker writes to the CSV file directly, they'll be out of order or will stomp on each others. Having each worker write to its own CSV file will be fast, but will require merging all the data files together afterward.
source
import csv, multiprocessing, sys
def worker(key):
return [ [key, 0], [key+1, 1] ]
pool = multiprocessing.Pool() # default 1 proc per CPU
writer = csv.writer(sys.stdout)
for resultset in pool.imap(worker, [1,2,3,4]):
for row in resultset:
writer.writerow(row)
output
1,0
2,1
2,0
3,1
3,0
4,1
4,0
5,1
My bet would be that dealing with the large structure at once using appending is what makes it slow. What I usually do is that I open up as many files as cores and use modulo to write to each file immediately such that the streams don't cause trouble compared to if you'd direct them all into the same file (write errors), and also not trying to store huge data. Probably not the best solution, but really quite easy. In the end you just merge back the results.
Define at start of the run:
num_cores = 8
file_sep = ","
outFiles = [open('out' + str(x) + ".csv", "a") for x in range(num_cores)]
Then in the key_loop function:
def key_loop(key):
test_df = pd.DataFrame(np.random.randn(1,4), columns=['a','b','c','d'])
test_list = test_df.ix[0].tolist()
outFiles[key % num_cores].write(file_sep.join([str(x) for x in test_list])
+ "\n")
Afterwards, don't forget to close: [x.close() for x in outFiles]
Improvements:
Iterate over blocks like mentioned in the comments. Writing/processing 1 line at a time is going to be much slower than writing blocks.
Handling errors (closing of files)
IMPORTANT: I'm not sure of the meaning of the "keys" variable, but the numbers there will not allow modulo to ensure you have each process write to each individual stream (12 keys, modulo 8 will make 2 processes write to the same file)