I have a large Dataframe based on market data from the online game EVE.
I'm trying to determine the most profitable trades based on the price of the buy or sell order of an item.
I've found that it takes quite a while to loop through all the possibilities and would like some advice on how to make my code more efficient.
data = https://market.fuzzwork.co.uk/orderbooks/latest.csv.gz
SETUP:
import pandas as pd
df = pd.read_csv('latest.csv', sep='\t', names=["orderID","typeID","issued","buy","volume","volumeEntered","minVolume","price","stationID","range","duration","region","orderSet"])
Iterate through all the possibilites
buy_order = df[(df.typeID == 34) & (df.buy == True)].copy()
sell_order = df[(df.typeID == 34) & (df.buy == False)].copy()
profitable_trade = []
for i in buy_order.index:
for j in sell_order.index:
if buy_order.loc[i,'price'] > sell_order.loc[j, 'price']:
profitable_trade.append(buy_order.loc[i, ['typeID', 'orderID', 'price', 'volume', 'stationID', 'range']].tolist() + sell_order.loc[j, ['orderID', 'price', 'volume', 'stationID', 'range']].tolist())
This takes quite a long time (33s on a ryzen 2600x, 12s on an M1 Pro)
Shorten the iteration
buy_order = df[(df.typeID == 34) & (df.buy == True)].copy()
sell_order = df[(df.typeID == 34) & (df.buy == False)].copy()
buy_order.sort_values(by='price', ascending=False, inplace=True, ignore_index=True)
sell_order.sort_values(by='price', ascending=True, inplace=True, ignore_index=True)
for i in buy_order.index:
if buy_order.loc[i, 'price'] > sell_order.price.min():
for j in sell_order.index:
if buy_order.loc[i,'price'] > sell_order.loc[j, 'price']:
profitable_trade2.append(buy_order.loc[i, ['typeID', 'orderID', 'price', 'volume', 'stationID', 'range']].tolist() + sell_order.loc[j, ['orderID', 'price', 'volume', 'stationID', 'range']].tolist())
else:
break
else:
break
This shaves about 25%-30% off the time (23s on 2600x, 9s on the M1 Pro)
Times have been recorded in a Jupyter Notebook
Any Tips are welcome!
Option 1 - Iterate through all the possibilites (yours):
start = time.time()
buy_order = df[(df.typeID == 34) & (df.buy == True)].copy()
sell_order = df[(df.typeID == 34) & (df.buy == False)].copy()
profitable_trade = []
for i in buy_order.index:
for j in sell_order.index:
if buy_order.loc[i,'price'] > sell_order.loc[j, 'price']:
profitable_trade.append(buy_order.loc[i, ['typeID', 'orderID', 'price', 'volume', 'stationID', 'range']].tolist() + sell_order.loc[j, ['orderID', 'price', 'volume', 'stationID', 'range']].tolist())
stop = time.time()
print(f"Time: {stop - start} seconds")
Time: 33.145344734191895 seconds
Option 2 - Shorten the iteration (yours):
start = time.time()
buy_order = df[(df.typeID == 34) & (df.buy == True)].copy()
sell_order = df[(df.typeID == 34) & (df.buy == False)].copy()
buy_order.sort_values(by='price', ascending=False, inplace=True, ignore_index=True)
sell_order.sort_values(by='price', ascending=True, inplace=True, ignore_index=True)
profitable_trade2 = []
for i in buy_order.index:
if buy_order.loc[i, 'price'] > sell_order.price.min():
for j in sell_order.index:
if buy_order.loc[i,'price'] > sell_order.loc[j, 'price']:
profitable_trade2.append(buy_order.loc[i, ['typeID', 'orderID', 'price', 'volume', 'stationID', 'range']].tolist() + sell_order.loc[j, ['orderID', 'price', 'volume', 'stationID', 'range']].tolist())
else:
break
else:
break
stop = time.time()
print(f"Time: {stop - start} seconds")
Time: 26.736826419830322 seconds
Option 3 - Pandas Optimizations:
You can get some speedup by applying the following optimizations:
iterate over dataframe items directly (iterrows instead of index + loc)
single filtering operation for sell-orders
start = time.time()
buy_order = df[(df.typeID == 34) & (df.buy == True)]
sell_order = df[(df.typeID == 34) & (df.buy == False)]
profitable_trade = []
for _, buy in buy_order.iterrows():
filtered_sell_orders = sell_order[sell_order["price"] < buy["price"]]
for _, sell in filtered_sell_orders.iterrows():
profitable_trade.append(buy[['typeID', 'orderID', 'price', 'volume', 'stationID', 'range']].tolist() + sell[['orderID', 'price', 'volume', 'stationID', 'range']].tolist())
stop = time.time()
print(f"Time: {stop - start} seconds")
Time: 19.43745183944702 seconds
Note that almost all the time is spent on the tolist()-operations (the following option is just for showing this impact, it does not return the target list):
start = time.time()
buy_order = df[(df.typeID == 34) & (df.buy == True)]
sell_order = df[(df.typeID == 34) & (df.buy == False)]
profitable_trade = []
for _, buy in buy_order.iterrows():
filtered_sell_orders = sell_order[sell_order["price"] < buy["price"]]
for _, sell in filtered_sell_orders.iterrows():
# removed 'tolist'-operations
profitable_trade.append(1)
stop = time.time()
print(f"Time: {stop - start} seconds")
Time: 2.072049617767334 seconds
Option 4 - Replace tolist-operations and store results in dataframe:
You can accelerate your code by
storing your filtered values in intermediate lists containing rows of the original dataframe
converting the intermediate lists to dataframes and concatenating them
the resulting dataframe yields the same information as the list profitable_trade
convert the dataframe to the desired list of lists (if needed)
start = time.time()
buy_orders = df[(df.typeID == 34) & (df.buy == True)]
sell_orders = df[(df.typeID == 34) & (df.buy == False)]
# store buy and cell rows in intermediate lists
buys = []
sells = []
for _, buy in buy_orders.iterrows():
# apply filtering operation once
filtered_sell_orders = sell_orders[sell_orders.price < buy.price]
sell_rows = list(filtered_sell_orders.iterrows())
# store buy and sell row items
buys.extend([buy] * len(sell_rows))
sells.extend([sell for _, sell in sell_rows])
# convert intermediate lists to dataframes
buys = pd.DataFrame(buys)
sells = pd.DataFrame(sells)
# rename columns for buys / cells dataframes for unique column names
buys = buys.rename(columns={column: f"{column}_buy" for column in buys.columns})
sells = sells.rename(columns={column: f"{column}_sell" for column in sells.columns})
# reset indices and concatenate buys / cells along the column axis
buys.reset_index(drop=True, inplace=True)
sells.reset_index(drop=True, inplace=True)
profitable_trade_df = pd.concat([buys, sells], axis=1)
# convert to list of lists (if needed)
profitable_trade = profitable_trade_df[['typeID_buy', 'orderID_buy', 'price_buy', 'volume_buy', 'stationID_buy', 'range_buy','orderID_sell', 'price_sell', 'volume_sell', 'stationID_sell', 'range_sell']].values.tolist()
stop = time.time()
print(f"Time: {stop - start} seconds")
Time: 3.785726308822632 seconds
Many thanks to #daniel.fehrenbacher for the explanation and suggestions.
In addition to his options, I've found a few myself using this article:
https://towardsdatascience.com/heres-the-most-efficient-way-to-iterate-through-your-pandas-dataframe-4dad88ac92ee#:
TL;DR
Don't use tolist()
Filter operation isn't always better, depends on the iteration method
There are much faster iteration methods than a regular for loop, or even iterrows(): use dictionary iteration
Use of .tolist() is detrimental
As mention in the answer above, a .tolist() uses too much time. It's much faster to use append([item1, item2, item3...]) than use append(row[['item1', 'item2', item3'...]].tolist())
tolist(): 19.2s
%%time
buy_order = df[(df.typeID == 34) & (df.buy == True)]
sell_order = df[(df.typeID == 34) & (df.buy == False)]
profitable_trade = []
for _, buy in buy_order.iterrows():
filtered_sell_orders = sell_order[sell_order["price"] < buy["price"]]
for _, sell in filtered_sell_orders.iterrows():
profitable_trade.append(buy[['typeID', 'orderID', 'price', 'volume', 'stationID', 'range']].tolist() + sell[['orderID', 'price', 'volume', 'stationID', 'range']].tolist())
append([item1, item2]): 3.5s
%%time
buy_order = df[(df.typeID == 34) & (df.buy == True)]
sell_order = df[(df.typeID == 34) & (df.buy == False)]
profitable_trade = []
for _, buy in buy_order.iterrows():
filtered_sell_orders = sell_order[sell_order["price"] < buy["price"]]
for _, sell in filtered_sell_orders.iterrows():
profitable_trade.append([
buy.typeID,
buy.orderID,
buy.price,
buy.volume,
buy.stationID,
buy.range,
sell.orderID,
sell.price,
sell.volume,
sell.stationID,
sell.range
])
Filtering Operation VS break
While the single filtering operation has a slight efficiency increase when you use .iterrows(), I've found it is the opposite when you use the better .itertuples().
iterrows() with filter operation: 3.26s
%%time
buy_order = df[(df.typeID == 34) & (df.buy == True)]
sell_order = df[(df.typeID == 34) & (df.buy == False)]
profitable_trade = []
for _, row_buy in buy_order.iterrows():
filtered_sell_orders = sell_order[sell_order["price"] < row_buy.price]
for _, row_sell in filtered_sell_orders.iterrows():
profitable_trade.append([
row_buy.typeID,
row_buy.orderID,
row_buy.price,
row_buy.volume,
row_buy.stationID,
row_buy.range,
row_sell.orderID,
row_sell.price,
row_sell.volume,
row_sell.stationID,
row_sell.range
])
iterrows() with break statements: 3.77s
%%time
buy_order = df[(df.typeID == 34) & (df.buy == True)].copy()
sell_order = df[(df.typeID == 34) & (df.buy == False)].copy()
buy_order.sort_values(by='price', ascending=False, inplace=True, ignore_index=True)
sell_order.sort_values(by='price', ascending=True, inplace=True, ignore_index=True)
profitable_trade3 = []
lowest_sell = sell_order.price.min()
for _, row_buy in buy_order.iterrows():
if row_buy.price > lowest_sell:
for _, row_sell in sell_order.iterrows():
if row_buy.price > row_sell.price:
profitable_trade3.append([
row_buy.typeID,
row_buy.orderID,
row_buy.price,
row_buy.volume,
row_buy.stationID,
row_buy.range,
row_sell.orderID,
row_sell.price,
row_sell.volume,
row_sell.stationID,
row_sell.range
])
else:
break
else:
break
itertuples with filter operation: 650ms
%%time
buy_order = df[(df.typeID == 34) & (df.buy == True)]
sell_order = df[(df.typeID == 34) & (df.buy == False)]
profitable_trade = []
for row_buy in buy_order.itertuples():
filtered_sell_orders = sell_order[sell_order["price"] < row_buy.price]
for row_sell in filtered_sell_orders.itertuples():
profitable_trade.append([
row_buy.typeID,
row_buy.orderID,
row_buy.price,
row_buy.volume,
row_buy.stationID,
row_buy.range,
row_sell.orderID,
row_sell.price,
row_sell.volume,
row_sell.stationID,
row_sell.range
])
itertuples with break statement: 375ms
%%time
buy_order = df[(df.typeID == 34) & (df.buy == True)].copy()
sell_order = df[(df.typeID == 34) & (df.buy == False)].copy()
buy_order.sort_values(by='price', ascending=False, inplace=True, ignore_index=True)
sell_order.sort_values(by='price', ascending=True, inplace=True, ignore_index=True)
profitable_trade3 = []
lowest_sell = sell_order.price.min()
for row_buy in buy_order.itertuples():
if row_buy.price > lowest_sell:
for row_sell in sell_order.itertuples():
if row_buy.price > row_sell.price:
profitable_trade3.append([
row_buy.typeID,
row_buy.orderID,
row_buy.price,
row_buy.volume,
row_buy.stationID,
row_buy.range,
row_sell.orderID,
row_sell.price,
row_sell.volume,
row_sell.stationID,
row_sell.range
])
else:
break
else:
break
Better iteration methods
itertuples (see above): 375ms
Numpy Iteration Method (df.values): 200ms
buy_order = df[(df.typeID == 34) & (df.buy == True)].copy()
sell_order = df[(df.typeID == 34) & (df.buy == False)].copy()
buy_order.sort_values(by='price', ascending=False, inplace=True, ignore_index=True)
sell_order.sort_values(by='price', ascending=True, inplace=True, ignore_index=True)
profitable_trade4 = []
lowest_sell = sell_order.price.min()
for row_buy in buy_order.values:
if row_buy[7] > lowest_sell:
for row_sell in sell_order.values:
if row_buy[7] > row_sell[7]:
profitable_trade4.append([
row_buy[1],
row_buy[0],
row_buy[7],
row_buy[4],
row_buy[8],
row_buy[9],
row_sell[0],
row_sell[7],
row_sell[4],
row_sell[8],
row_sell[9]
])
else:
break
else:
break
Dictionary Iteration (df.to_dict('records')): 78ms
%%time
buy_order = df[(df.typeID == 34) & (df.buy == True)].copy()
sell_order = df[(df.typeID == 34) & (df.buy == False)].copy()
buy_order.sort_values(by='price', ascending=False, inplace=True, ignore_index=True)
sell_order.sort_values(by='price', ascending=True, inplace=True, ignore_index=True)
profitable_trade5 = []
buy_dict = buy_order.to_dict('records')
sell_dict = sell_order.to_dict('records')
lowest_sell = sell_order.price.min()
for row_buy in buy_dict:
if row_buy['price'] > lowest_sell:
for row_sell in sell_dict:
if row_buy['price'] > row_sell['price']:
profitable_trade5.append([
row_buy['typeID'],
row_buy['orderID'],
row_buy['price'],
row_buy['volume'],
row_buy['stationID'],
row_buy['range'],
row_sell['orderID'],
row_sell['price'],
row_sell['volume'],
row_sell['stationID'],
row_sell['range']
])
else:
break
else:
break
Related
I am using .where() function to select time and certain criteria in xarray dataset.
import numpy as np
import xarray as xr
ds1 = xr.open_dataset('COD.nc')
ds2 = xr.open_dataset('CDNC.nc')
ds3 = xr.open_dataset('LWP.nc')
ds4 = xr.open_dataset('CTT.nc')
ds5 = xr.open_dataset('CTP.nc')
ds6 = xr.open_dataset('CER.nc')
ds11 = ds1.where((ds1.time == ds2.time))
ds22 = ds2.where((ds2.time == ds11.time))
ds33 = ds3.where((ds3.time == ds2.time))
ds44 = ds4.where((ds4.time == ds2.time))
ds55 = ds5.where((ds5.time == ds2.time))
ds66 = ds6.where((ds6.time == ds2.time))
COD = ds11.Cloud_Optical_Thickness
CDNC= ds22.Cloud_Droplet_Concentration
LWP = ds33.Cloud_Water_Path
CTT = ds44.Cloud_Top_Temperature
CTP = ds55.Cloud_Top_Pressure
CER = ds66.Cloud_Effective_Radius
cod = COD.where((CTT >= 273.0) & (CTP > 680.0) & (CER > 4) & (COD > 4))
lwp = LWP.where((CTT >= 273.0) & (CTP > 680.0) & (CER > 4) & (COD > 4))
cdnc = CDNC.where((CTT >= 273.0) & (CTP > 680.0) & (CER > 4) & (COD > 4))
but its too slow....even for small dataset......
Dimension of my each dataset is (time: 7555, lat= 35, lon=71). Its running for more than two hours....
is there any way to fasten the performance? Thanks!!
'''
for i, row in matrice_finale.iterrows():
name = "Right_"+row["Valeur de clé"]
df[name] = 0
df[name] = np.where(
(df["line_num"] == df.loc[i, "line_num"]) & (df["block_num"] == df.loc[i, "block_num"]) & (
df["left"] > df.loc[i, "left"]), 1, 0)
name = "bottom_" +row["Valeur de clé"]
df[name] = 0
df[name] = np.where(
(df["line_num"] == df.loc[i, "line_num"]+1) & (df["block_num"] == df.loc[i, "block_num"]), 1, 0)
'''
ValueError: Can only compare identically-labeled Series objects
I am using python3.7 and this is the current code base(apologies for putting so much code but thought it would help overall)
def TRADE_ENTRY(df_names, df_underlyings,df_strategies, columns, param, out_path,recovery_path):
nameUpdate =0
strategyUpdate=0
underlyingUpdate=0
sg.theme('Dark Brown 1')
listing = [sg.Text(u, size = param) for u in columns]
header = [[x] for x in listing]
now = datetime.datetime.now()
core = [
sg.Input(f"{now.month}/{now.day}/{now.year}",size = param),
sg.Input(f"{now.hour}:{now.minute}:{now.second}",size = param),
sg.Listbox(list(df_strategies.STRATEGIES), size=(20,2), enable_events=False, key='_PLAYERS0_'),
sg.Listbox(['ETF', 'EQT', 'FUT', 'OPT', 'BOND'],enable_events=False,key='_PLAYERS20_',size = (20,2)),
sg.Listbox(list(df_names.NAMES), size=(20,4), enable_events=False,key='_PLAYERS6_'),
sg.Listbox( ['B', 'S'],size = (20,1),enable_events=False,key='_PLAYERS12_'),
sg.Input(size = param),
sg.Input(size = param),
sg.CalendarButton('Calendar', pad=None, font=('MS Sans Serif', 10, 'bold'),
button_color=('yellow', 'brown'), format=('%d/%m/%Y'), key='_CALENDAR_', target='_INP_'),
sg.Input(size = param),
sg.Listbox(list(df_underlyings.UNDERLYINGS), size=(20,4), enable_events=False,key='_PLAYERS2_'),
sg.Listbox(['C', 'P', 'N/A'],size = param),
]
mesh = [[x,y] for (x,y) in list(zip(listing, core))]
mesh[8].append(sg.Input(size = (10,2),key = '_INP_'))
layout =[[sg.Button("SEND"),sg.Button("NEW_NAME"), sg.Button("NEW_STRAT"), sg.Button("NEW_UND")] ]+ mesh
window = sg.Window('Trade Entry System', layout, font='Courier 12').Finalize()
while True:
event, values = window.read(timeout=500)
#print('EVENT, VALUES', event, values)# all the inputs with extra information for compiler
if event == "SEND":
data = values
a = list(data.values())
a = [x if isinstance(x, list) == False else empty_handler(x) for x in a]
a = [x if x !="" else "EMPTY" for x in a ]
#print('A', a)#all the inputs now in a list
df = pd.DataFrame(a, index = columns)
print('DF1', df)#columns dataframe with column names and then the values
df = df.transpose()
#print('DF2', df)#rows dataframe with column names and then the values
status = error_handling(df)
#print('STATUS', status)
if status == "ERROR":
print("YOU MUST RECTIFY INPUT")
elif status == "CORRECT":
#if a future then will overwrite its name
if df['TYPE'][0] == "FUT":
df['NAME'][0] = "F-"+ df['UNDERLYING'][0] + "-" +df['EXPIRATION'][0]
#if an option then will overwrite its name
elif df['TYPE'][0] =="OPT":
df['NAME'][0] = 'O-' + df['UNDERLYING'][0] + "--" + df['OPTION_TYPE'][0] +df['STRIKE'][0] +"--" +df['EXPIRATION'][0]
else:
pass
processing(df, recovery_path, out_path)
else:
print("ERROR WITH USER INPUT FATAL")
break
elif event == "NEW_NAME":
security_creation(r'Y:\NAMES.xlsx', "Sheet1", "NAME", param)
nameUpdate=1
continue
elif event == "NEW_STRAT":
security_creation(r'Y:\STRATEGIES.xlsx', "Sheet1", "STRATEGY", param)
strategyUpdate=1
continue
elif event == "NEW_UND":
security_creation(r'Y:\UNDERLYINGS.xlsx', "Sheet1", "UNDERLYINGS", param)
underlyingUpdate=1
continue
elif event == sg.TIMEOUT_KEY:
if(nameUpdate==1):
df_names = pd.read_excel(r'Y:\NAMES.xlsx', "Sheet1")
df =df_names.values.tolist()
window['_PLAYERS6_'].update(values=df, set_to_index=0)
if(underlyingUpdate==1):
df_underlyings = pd.read_excel(r'Y:\UNDERLYINGS.xlsx', "Sheet1")
df =df_underlyings.values.tolist()
window['_PLAYERS2_'].update(values=df, set_to_index=0)
if(strategyUpdate==1):
df_strategies = pd.read_excel(r'Y:\STRATEGIES.xlsx', "Sheet1")
df =df_strategies.values.tolist()
window['_PLAYERS0_'].update(values=df, set_to_index=0)
print("Listboxes updated !")
else:
print("OVER")
break
window.close()
TRADE_ENTRY(df_names, df_underlyings,df_strategies, columns, param,out_path, recovery_path)
Towards the end of the function there's 3 elif, all NEW_NAME, NEW_STRAT and NEW_UND are the user submitting information to the corresponding 3 excel files. The function security_creation actually updates said excel files. Below that I am trying to update the Listboxes but no luck.
Any help would be greatly appreciated since i am so confused
I have a big dataframe which has two million rows. There are 60000 unique (store_id, product_id) pairs.
I need select by each (store_id, product_id), do some calculation , such as resample to H , sum , avg . Finally, concat all to a new dataframe.
The problem is it is very very slow, and become slower while running.
The mainly code is:
def process_df(df, func, *args, **kwargs):
'''
'''
product_ids = df.product_id.unique()
store_ids = df.store_id.unique()
# uk = df.drop_duplicates(subset=['store_id','product_id'])
# for idx, item in uk.iterrows():
all_df = list()
i = 1
with tqdm(total=product_ids.shape[0]*store_ids.shape[0]) as t:
for store_id in store_ids:
sdf = df.loc[df['store_id']==store_id]
for product_id in product_ids:
new_df = sdf.loc[(sdf['product_id']==product_id) ]
if new_df.shape[0] < 14:
continue
new_df = func(new_df, *args, **kwargs)
new_df.loc[:, 'store_id'] = store_id
new_df.loc[:, 'product_id'] = product_id
all_df.append(new_df)
t.update()
all_df= pd.concat(all_df)
return all_df
def process_order_items(df, store_id=None, product_id=None, freq='D'):
if store_id and "store_id" in df.columns:
df = df.loc[df['store_id']==store_id]
if product_id and "product_id" in df.columns:
df = df.loc[df['product_id']==product_id]
# convert to datetime
df.loc[:, "datetime_create"] = pd.to_datetime(df.time_create, unit='ms').dt.tz_localize('UTC').dt.tz_convert('Asia/Shanghai').dt.tz_localize(None)
df = df[["price", "count", "fee_total", "fee_real", "price_real", "price_guide", "price_change_category", "datetime_create"]]
df.loc[:, "has_discount"] = (df.price_change_category > 0).astype(int)
df.loc[:, "clearance"] = df.price_change_category.apply(lambda x:x in(10, 20, 23)).astype(int)
if not freq:
df.loc[:, "date_create"] = df["datetime_create"]
else:
assert freq in ('D', 'H')
df.index = df.loc[:, "datetime_create"]
discount_order_count = df['has_discount'].resample(freq).sum()
clearance_order_count = df['clearance'].resample(freq).sum()
discount_sale_count = df.loc[df.has_discount >0, 'count'].resample(freq).sum()
clearance_sale_count = df.loc[df.clearance >0, 'count'].resample(freq).sum()
no_discount_price = df.loc[df.has_discount == 0, 'price'].resample(freq).sum()
no_clearance_price = df.loc[df.clearance == 0, 'price'].resample(freq).sum()
order_count = df['count'].resample(freq).count()
day_count = df['count'].resample(freq).sum()
price_guide = df['price_guide'].resample(freq).max()
price_avg = (df['price'] * df['count']).resample(freq).sum() / day_count
df = pd.DataFrame({
"price":price_avg,
"price_guide": price_guide,
"sale_count": day_count,
"order_count": order_count,
"discount_order_count": discount_order_count,
"clearance_order_count": clearance_order_count,
"discount_sale_count": discount_sale_count,
"clearance_sale_count": clearance_sale_count,
})
df = df.drop(df[df.order_count == 0].index)
return df
I think the problem is there are too many redundant selections.
Maybe I could use groupby(['store_id','product_id']).agg to avoid redundant , but I have no idea how to use process_order_items with it and merge results together.
I think you can change:
df.loc[:,"clearance"] = df.price_change_category.apply(lambda x:x in(10, 20, 23)).astype(int)
to Series.isin:
df["clearance"] = df.price_change_category.isin([10, 20, 23]).astype(int)
Also solution for Resampler.aggregate:
d = {'has_discount':'sum',
'clearance':'sum',
'count': ['count', 'sum'],
'price_guide':'max'}
df1 = df.resample(freq).agg(d)
df1.columns = df1.columns.map('_'.join)
d1 = {'has_discount_count':'discount_order_count',
'clearance_count':'clearance_order_count',
'count_count':'order_count',
'count_sum':'day_count',
'price_guide_max':'price_guide'}
df1.rename(columns=d1)
Another idea is no convert boolean mask to integer, but use columns for filtering like:
df["has_discount"] = df.price_change_category > 0
df["clearance"] = df.price_change_category.isin([10, 20, 23])
discount_sale_count = df.loc[df.has_discount, 'count'].resample(freq).sum()
clearance_sale_count = df.loc[df.clearance, 'count'].resample(freq).sum()
#for filtering ==0 invert boolean mask columns by ~
no_discount_price = df.loc[~df.has_discount, 'price'].resample(freq).sum()
no_clearance_price = df.loc[~df.clearance, 'price'].resample(freq).sum()
First function should be simplify by GroupBy.apply instaed loops, then concat is not necessary:
def f(x):
print (x)
df = df.groupby(['product_id','store_id']).apply(f)
parts_list = imp_parts_df['Parts'].tolist()
sub_week_list = ['2016-12-11', '2016-12-04', '2016-11-27', '2016-11-20', '2016-11-13']
i = 0
start = DT.datetime.now()
for p in parts_list:
for thisdate in sub_week_list:
thisweek_start = pd.to_datetime(thisdate, format='%Y-%m-%d') #'2016/12/11'
thisweek_end = thisweek_start + DT.timedelta(days=7) # add 7 days to the week date
val_shipped = len(shipment_df[(shipment_df['loc'] == 'USW1') & (shipment_df['part'] == str(p)) & (shipment_df['shipped_date'] >= thisweek_start) & (shipment_df['shipped_date'] < thisweek_end)])
print(DT.datetime.now() - start).total_seconds()
shipment_df has around 35000 records
partlist has 436 parts
sub_week_list has 5 dates in it
it took overall 438.13 secs to run this code
Is there any faster way to do it?
parts_list = imp_parts_df['Parts'].astype(str).tolist()
i = 0
start = DT.datetime.now()
for p in parts_list:
q = 'loc == "xxx" & part == #p & "2016-11-20" <= shipped_date < "2016-11-27"'
val_shipped = len(shipment_df.query(q))
print (DT.datetime.now() - start).total_seconds()