I have to run a process on about 2 million IDs, for which I am trying to use MultipleProcessing.
My sample data, stored in dataframe df looks like (just presenting 3 rows):
c_id
0 ID1
1 ID2
2 ID3
My parallelize code is as follows:
def parallelize(data,func,parts=cpu_count()):
if data.shape[0] < parts:
parts = data.shape[0]
data_split = np.array_split(data,parts)
pool = Pool(parts)
parallel_out = pd.concat(pool.map(func,data_split))
pool.close()
pool.join()
return parallel_out
A sample process that I want to run on all the ID's is to add my first name each ID.
There are two pieces of codes that I tested.
First: Using a for loop and then calling the parallelize function, as follows:
def pooltype1(df_id):
dfi=[]
for item in df_id['c_id']:
dfi.append({'string': str(item) + '_ravi'})
dfi = pd.DataFrame(df)
return dfi
p = parallelize(df,pooltype1,parts=cpu_count())
The output is as expected and the index of each is 0, confirming that each ID went into a different cpu (cpu_count() for my system > 3):
string
0 ID1_ravi
0 ID2_ravi
0 ID3_ravi
and the runtime is 0.12 secs.
However, to further speed it up on my actual (2 million) data, I tried to replace the for-loop in the pooltype1 function by a apply command and then calling the parallelize function as below:
# New function
def add_string(x):
return x + '_ravi'
def pooltype2(df_id):
dfi = df_id.apply(add_string)
return dfi
p = parallelize(df,pooltype2,parts=cpu_count())
Now index of the output was not all zero
string
0 ID1_ravi
1 ID2_ravi
2 ID3_ravi
and to my surprise the runtime jumped to 5.5 sec. This seems like apply was executed on the whole original dataframe and not at a cpu level.
So, while doing pool.map do I have to use a for-loop (as in pooltype1 function) or is there a way the apply can be applied within each cpu (hoping that it will further reduce run time). If one can do the apply at a cpu level, please do help me with the code.
Thank you.
Related
Code was initially in R, but as R does not handle large dataset well, I converted the code to python and ported it to Google Colab. Even on Google Colab it took very long, and I never actually saw it finish runing even after 8 hours. I also added more breaking statements to avoid unnecessary runs.
The dataset has around unique 50000 time stamps, unique 40000 ids. It is in the format of ['time','id','x-coordinate','y-coordinate], very clear cut passenger trajectory dataset.
What the code is trying to do is extract out all the pairs of IDs which are 2 meters/less apart from each other at the same time frame.
Please let me know if there are ways to optimize this.
Here's a short overview of the data. [my_data.head(10)][1]
i=0
y = pd.DataFrame(columns=['source', 'dest']) #empty contact network df
infectedGrp = [824, 11648, 23468]
while (i < my_data.shape[0]):
row1=my_data.iloc[i]
id1=row1[1]
time1=row1[0]
x1=row1[2]
y1=row1[3]
infected1=my_data.iloc[i,4]
infectious1=my_data.iloc[i,5]
#print(row1)
#print(time1)
for j in range(i+1,my_data.shape[0]):
row2=my_data.iloc[j]
id2=row2[1]
time2=row2[0]
x2=row2[2]
y2=row2[3]
infected2=my_data.iloc[j,4]
infectious2=my_data.iloc[j,5]
print(time2)
if(time2!=time1):
i=i+1
print("diff time...breaking")
break
if(x2>x1+2) or (x1>x2+2):
i=i+1
print("x more than 2...breaking")
break
if(y2>y1+2) or (y1>y2+2):
i=i+1
print("y more than 2...breaking")
break
probability = 0
distance = round(math.sqrt(pow((x1-x2),2)+pow((y1-y2),2)),2)
print(distance)
print(infected1)
print(infected2)
if (distance<=R):
if infectious1 and not infected2 : #if one person is infectious and the other is not infected
probability = (1-beta)*(1/R)*(math.sqrt(R**2-distance**2))
print(probability)
print("here")
infected2=decision(probability)
numid2= int(id2) # update all entries for id2
if (infected2):
my_data.loc[my_data['id'] == numid2, 'infected'] = True
#my_data.iloc[j,7]=probability
elif infectious2 and not infected1:
infected1=decision(probability)
numid1= int(id1) # update all entries for id1
if (infected1):
my_data.loc[my_data['id'] == numid1, 'infected'] = True
#my_data.iloc[i,7]=probability
inf1 = 'F'
inf2 = 'F'
if (infected1):
inf1 = 'T'
if (infected2):
inf2 = 'T'
print('prob '+str(probability)+' at time '+str(time1))
new_row = {'source': id1.astype(str)+' '+inf1, 'dest': id2.astype(str)+' '+inf2}
y = y.append(new_row, ignore_index=True)
i=i+1
[1]: https://i.stack.imgur.com/YVdmB.png
Its hard to tell now for sure, but I think good guess is this line is your biggest "sin":
y = y.append(new_row, ignore_index=True)
You should not append rows to dataframe in a loop.
You should aggregate them in python list and then create DataFrame using all of them after the loop.
y = []
while (i < my_data.shape[0])
(...)
y.append(new_row)
y = pd.DataFrame(y)
I also suggest to use line profiler to analyse which parts of the code are the bottlenecks
You are using a nested loop to find time values that are equivalent. You can get a huge improvement by doing a group_by operation instead and then iterating through the groups.
I have a data table which has two columns time in and time out as shown below.
TimeIn TimeOut
01:23AM 01:45AM
01:34AM 01:53AM
01:43AM 01:59AM
02:01AM 02:09AM
02:34AM 03:11AM
02:39AM 02:48AM
02:56AM 03:12AM
I need to create a third column named 'Counter' which updates in a way that when the TimeIn of ith occurance is more than TimeOut of (i-1)th then that counter remains same else increases to 1. Consider it as people assigned for task so if a person is free after his/her time out then he/she can take up the job. Also if at a particular instance more than one counter is free then I need to take the first of them which got free so the above table would look like this.
TimeIn TimeOut Counter
01:23AM 01:45AM 1
01:34AM 01:53AM 2
01:43AM 01:59AM 3
02:01AM 02:09AM 1 (in this case 1,2,3 all are also free but 1 became free first)
02:34AM 03:11AM 2 (in this case 1,2,3 all are also free but 2 became free first)
02:39AM 02:48AM 3 (in this case 1 is also free but 3 became free first)
02:56AM 03:12AM 1 (in this case 3 is also free but 1 became free first)
I was hoping if there could be a way in pandas to do it without loop since my database could be large but please let me know even if there is a way where it could be achieved efficiently using a loop as well should be fine.
Many thanks in advance.
I couldn't figure out an efficient way with native Pandas-methods. But if I'm not completely mistaken, a heap queue seems to be an adequate tool for the problem.
With
df =
TimeIn TimeOut
0 01:23AM 01:45AM
1 01:34AM 01:53AM
2 01:43AM 01:59AM
3 02:01AM 02:09AM
4 02:34AM 03:11AM
5 02:39AM 02:48AM
6 02:56AM 03:12AM
and
for col in ("TimeIn", "TimeOut"):
df[col] = pd.to_datetime(df[col])
this
from heapq import heappush, heappop
w_count = 1
counter = [1]
heap = []
w_time_out, w = df.TimeOut[0], 1
for time_in, time_out in zip(
df.TimeIn.tolist()[1:], df.TimeOut.tolist()[1:]
):
if time_in > w_time_out:
heappush(heap, (time_out, w))
counter.append(w)
w_time_out, w = heappop(heap)
else:
w_count += 1
counter.append(w_count)
if time_out > w_time_out:
heappush(heap, (time_out, w_count))
else:
heappush(heap, (w_time_out, w))
w_time_out, w = time_out, w_count
produces the counter-list
[1, 2, 3, 1, 2, 3, 1]
Regarding your input data: You don't have complete timestamps, so pd.to_datetime uses the current day as date part. So if the range of your times isn't contained in one day you'll run into trouble.
EDIT: Fixed a mistake in the last else-branch.
For the sake of completeness, I'm including a pandas/numpy based solution. Performance is roughly 3x better (I saw 12s vs 34s for 10 million records) than the heapq based one, but implementation is significantly harder to follow. Unless you really need the performance, I'd recommend #Timus solution.
The idea here is:
We identify sessions where we have to increment the counter. We can immediately assign counter values to these sessions.
For the remaining sessions, we create a sequence of sessions that the same worker handles. We can then map any session to a "root session" where the worker was created.
To accomplish step (2):
We get two lists of session IDs, one sorted by start time and the other end time.
Pair each session start with the least recent session end. This corresponds to the earliest available worker taking on the next incoming request.
Work up the tree to map any given session to the first session handled by that worker.
# setup
text = StringIO(
"""
TimeIn TimeOut
01:23AM 01:45AM
01:34AM 01:53AM
01:43AM 01:59AM
02:01AM 02:09AM
02:34AM 03:11AM
02:39AM 02:48AM
02:56AM 03:12AM
""".strip()
)
sessions = pd.read_csv(text, sep=" ", parse_dates=["TimeIn", "TimeOut"])
# transform the data from wide format to long format
# event_log has the following columns:
# - Session: corresponding to the index of the input data
# - EventType: either TimeIn or TimeOut
# - EventTime: the event's time value
event_log = pd.melt(
sessions.rename_axis(index="Session").reset_index(),
id_vars=["Session"],
value_vars=["TimeIn", "TimeOut"],
var_name="EventType",
value_name="EventTime",
)
# sort the entire log by time
event_log.sort_values("EventTime", inplace=True, kind="mergesort")
# concurrency is the number of active workers at the time of that log entry
concurrency = event_log["EventType"].replace({"TimeIn": 1, "TimeOut": -1}).cumsum()
# new workers occur when the running maximum concurrency increases
new_worker = concurrency.cummax().diff().astype(bool)
new_worker_sessions = event_log.loc[new_worker, "Session"]
root_session = np.empty_like(sessions.index)
root_session[new_worker_sessions] = new_worker_sessions
# we could use the `sessions` DataFrame to avoid searching, but we'd need to sort on TimeOut
new_session = event_log.query("~#new_worker & (EventType == 'TimeIn')")["Session"]
old_session = event_log.query("~#new_worker & (EventType == 'TimeOut')")["Session"]
# Pair each session start with the session that ended least recently
root_session[new_session] = old_session[: new_session.shape[0]]
# Find the root session
# maybe something can be optimized here?
while not np.array_equal((_root_session := root_session.take(root_session)), root_session):
root_session = _root_session
counter = np.empty_like(root_session)
counter[new_worker_sessions] = np.arange(start=1, stop=new_worker_sessions.shape[0] + 1)
sessions["Counter"] = counter.take(root_session)
Quick bit of code to generate more fake data:
N = 10 ** 6
start = pd.Timestamp("2021-08-12T01:23:00")
_base = pd.date_range(start=start, periods=N, freq=pd.Timedelta(1, "seconds"))
time_in = (
_base.values
+ np.random.exponential(1000, size=N).astype("timedelta64[ms]")
+ np.random.exponential(10000, size=N).astype("timedelta64[ns]")
+ np.timedelta64(1, "ms")
)
time_out = (
time_in
+ np.random.exponential(10, size=N).astype("timedelta64[s]")
+ np.random.exponential(1000, size=N).astype("timedelta64[ms]")
+ np.random.exponential(10000, size=N).astype("timedelta64[ns]")
+ np.timedelta64(1, "s")
)
sessions = (
pd.DataFrame({"TimeIn": time_in, "TimeOut": time_out})
.sort_values("TimeIn")
.reset_index(drop=True)
)
I am running set of numerical simulations. I need to run some sensitivity analyses on the results, i.e. calculate and show how much certain outputs change, as certain inputs vary within given ranges. Basically I need to create a table like this, where each row is the result of one model run:
+-------------+-------------+-------------+-------------+
| Input 1 | Input 2 | Output 1 | Output 2 |
+-------------+-------------+-------------+-------------+
| 0.708788979 | 0.614576315 | 0.366315092 | 0.476088865 |
| 0.793662551 | 0.938622754 | 0.898870204 | 0.014915374 |
| 0.366560694 | 0.244354275 | 0.740988568 | 0.197036087 |
+-------------+-------------+-------------+-------------+
Each model run is tricky to parallelise, but it shouldn't be too hard to parallelise by getting each CPU to run a different model with different inputs.
I have put something together with the multiprocessing library, but it is much slower than I would have hoped. Do you have any suggestions on what I am doing wrong / how I can speed it up? I am open to using a library other than multiprocessing.
Does it have to do with load balancing?
I must confess I am new to multiprocessing in Python and am not too clear on the differences among map, apply, and apply_async.
I have made a toy example to show what I mean: I create random samples from a lognormal distribution, and calculate how much the mean of my sample changes as the mean and sigma of the distribution change. This is just a banal example because what matters here is not the model itself, but running multiple models in parallel.
In my example, the times (in seconds) are:
+-----------------+-----------------+---------------------+
| Million records | Time (parallel) | Time (not parallel) |
+-----------------+-----------------+---------------------+
| 5 | 24.4 | 18 |
| 10 | 26.5 | 35.8 |
| 20 | 32.2 | 71 |
+-----------------+-----------------+---------------------+
Only between a sample size of 5 and 10 million does parallelising bring any benefits. Is this to be expected?
P.S. I am aware of the SALib library for sensitivity analyses, but, as far as I can see, it doesn't do what I'm after.
My code:
import numpy as np
import pandas as pd
import time
import multiprocessing
from multiprocessing import Pool
# I store all the possible inputs in a dataframe
tmp = {}
i = 0
for mysigma in np.linspace(0,1,10):
for mymean in np.linspace(0,1,10):
i += 1
tmp[i] = pd.DataFrame({'mean':[mymean],\
'sigma':[mysigma]})
par_inputs = pd.concat( [tmp[x] for x in tmp], axis=0, ignore_index=True)
def not_parallel(df):
for row in df.itertuples(index=True):
myindex = row[0]
mymean = row[1]
mysigma = row[2]
dist = np.random.lognormal(mymean, mysigma, size = n)
empmean = dist.mean()
df.loc[myindex,'empirical mean'] = empmean
df.to_csv('results not parallel.csv')
# splits the dataframe and sets up the parallelisation
def parallelize_dataframe(df, func):
df_split = np.array_split(df, num_partitions)
pool = Pool(num_cores)
conc_df = pd.concat(pool.map(func, df_split))
pool.close()
pool.join()
conc_df.to_csv('results parallelized.csv')
return conc_df
# the actual function being parallelised
def parallel_sensitivities(data):
for row in data.itertuples(index=True):
myindex = row[0]
mymean = row[1]
mysigma = row[2]
dist = np.random.lognormal(mymean, mysigma, size = n)
empmean = dist.mean()
print(empmean)
data.loc[myindex,'empirical mean'] = empmean
return data
num_cores = multiprocessing.cpu_count()
num_partitions = num_cores
n = int(5e6)
if __name__ == '__main__':
start = time.time()
not_parallel(par_inputs)
time_np = time.time() - start
start = time.time()
parallelize_dataframe(par_inputs, parallel_sensitivities)
time_p = time.time() - start
The time differences are for starting the multiple processes up. To start each process it takes some amount of seconds. Actual processing time you are doing much better than non-parallel but part of multiprocessing speed increase is accepting the time it takes to start each process.
In this case, your example functions are relatively fast by amount of seconds so you don't see the time gain immediately on a small amount of records. For more intensive operations on each record you would see much more significant time gains by parallelizing.
Keep in mind that parallelization is both costly, and time-consuming due to the overhead of the subprocesses that is needed by your operating system. Compared to running two or more tasks in a linear way, doing this in parallel you may save between 25 and 30 percent of time per subprocess, depending on your use-case. For example, two tasks that consume 5 seconds each need 10 seconds in total if executed in series, and may need about 8 seconds on average on a multi-core machine when parallelized. 3 of those 8 seconds may be lost to overhead, limiting your speed improvements.
From this article.
Edited:
When using a Pool(), you have a few options to assign tasks to the pool.
multiprocessing.apply_asynch() docs is used to assign a single task and in order to avoid blocking while waiting for that task completion.
multiprocessing.map_async docs will chunk an iterable by chunk_size and add each chunk to the pool to be completed.
In your case, it will depend on the real scenario you are using, but they aren't exchangeable based on time, rather based on what function you need to run. I'm not going to say for sure which one you need since you used a fake example. I'm guessing you could use apply_asynch if you need each function to run and the function is self-contained. If the function can parallel run over an iterable, you would want to map_asynch.
I have a few hundred thousand groups through which I want to iterate this particular lag operation. Below is a sample where Buy_Ord_No is the group by variable:
I would like to generate Lag_Exec_Qty and Exec_Qty. What I am basically doing here is initially setting Exec_Qty equal to 0 when Buy_Act_Type = 1 or Buy_Act_Type = 4. Then, I take the lag value of Exec_Qty ad Lag_Exec_Qty. In the same row, I sum up Trd_Qty and Lag_Exec_Qty to get the updated Exec_Qty.
This is the code that I currently have:
for b in buy:
temp=buy_sorted_file[buy_sorted_file["Buy_Ord_No"]==b]
temp=temp.sort_values(["Buy_Ord_No","Buy_Ord_Txn_Time"], ascending=[True, True]).reset_index(drop=True)
for index in range(len(temp.index)):
if(int(temp["Buy_Act_Type"].iloc[index])==1 or int(temp["Buy_Act_Type"].iloc[index])==4):
temp["Exec_Qty"].iloc[index]=0
temp["Lag_Exec_Qty"].iloc[index]=0
else:
temp["Lag_Exec_Qty"].iloc[index]=temp["Exec_Qty"].iloc[index-1]
temp["Exec_Qty"].iloc[index]=temp["Trd_Qty"].iloc[index]+temp["Lag_Exec_Qty"].iloc[index]
if (len(buy_sorted_exec_file.index) == 0):
buy_sorted_exec_file = temp.copy()
else:
buy_sorted_exec_file = pd.concat([temp,buy_sorted_exec_file]).reset_index(drop=True)
buy_sorted_file= buy_sorted_exec_file.sort_values(["Buy_Ord_Txn_Time", "Buy_Ord_Limit_Pr"],ascending=[True, True]).reset_index(drop=True)
The code takes a really long time to run. Is there anyway through which I can speed this process up?
You should be able to do, without any loops:
temp['Lag_Exec_Qty'] = temp['Exec_Qty'].shift(1)
temp['Exec_Qty'] = temp['Trd_Qty'] + temp['Lag_Exec_Qty']
I have a python program to deal with big data in one computer(16 cpu cores). Because the data is bigger and bigger, I need it to run in 5 computers. I am fresh in Spark,still feel Confused after read some docs. I will appreciate if anyone can tell me what is the best way to make a small cluster.
Here is some details:
The program is trying to count the trade volume in every price for each stock (one day for a time) from tick transaction pandas dataframe data.
There are more than 3000 stocks, 1 billion transactions in one day. The size of data file(dataframe) is between 1~2 G.
getting the result of 300 days spend for 3 days on one computer now, I hope to add 4 more computers to short the time.
here are the sample code in python:
import sharedmem
import os
import multiprocessing as mp
def ticks_to_priceline(day=None):
# file name for the tick dataframe file, one day for a file
fn = get_tick_dataframe_filename_byday(day)
with pd.HDFStore(fn, 'r') as tick_store:
tick_dataframe = tick_store.select("tick")
all_stock_symbols = tick_dataframe.symbol.drop_duplicates()
sblist = []
# cut to small chunk
chunk = 300
for xx in range(len(all_stock_symbols) / chunk + 1):
sblist.append(all_stock_symbols[xx * chunk:(xx + 1) * stuck])
# run with all cpus
with sharedmem.MapReduce(np=mp.cpu_count()) as pool:
def work(chunk_list):
result = {}
for symbol in chunk_list:
data = tick_dataframe[tick_dataframe.symbol == symbol]
if not data.empty and len(data) > 99:
df1 = data.loc[:,
[u'timestamp', u'price', u'volume']]
df1['vol_diff'] = df1.volume.diff().fillna(0)
df2 = df1.loc[:, ['price', 'vol_diff']]
df2.price = df2.price.apply(int)
rs = df2.groupby('price').sum()
rs = rs.sort_index(ascending=0).reset_index()
result[symbol] = rs
return result
rslist = pool.map(work, sblist)
return rslist
here is a spark cluster in standalone mode I have already setup for testing. My main problem is how to rewrite the codes above.