I have a large Dataframe based on market data from the online game EVE.
I'm trying to determine the most profitable trades based on the price of the buy or sell order of an item.
I've found that it takes quite a while to loop through all the possibilities and would like some advice on how to make my code more efficient.
data = https://market.fuzzwork.co.uk/orderbooks/latest.csv.gz
SETUP:
import pandas as pd
df = pd.read_csv('latest.csv', sep='\t', names=["orderID","typeID","issued","buy","volume","volumeEntered","minVolume","price","stationID","range","duration","region","orderSet"])
Iterate through all the possibilites
buy_order = df[(df.typeID == 34) & (df.buy == True)].copy()
sell_order = df[(df.typeID == 34) & (df.buy == False)].copy()
profitable_trade = []
for i in buy_order.index:
for j in sell_order.index:
if buy_order.loc[i,'price'] > sell_order.loc[j, 'price']:
profitable_trade.append(buy_order.loc[i, ['typeID', 'orderID', 'price', 'volume', 'stationID', 'range']].tolist() + sell_order.loc[j, ['orderID', 'price', 'volume', 'stationID', 'range']].tolist())
This takes quite a long time (33s on a ryzen 2600x, 12s on an M1 Pro)
Shorten the iteration
buy_order = df[(df.typeID == 34) & (df.buy == True)].copy()
sell_order = df[(df.typeID == 34) & (df.buy == False)].copy()
buy_order.sort_values(by='price', ascending=False, inplace=True, ignore_index=True)
sell_order.sort_values(by='price', ascending=True, inplace=True, ignore_index=True)
for i in buy_order.index:
if buy_order.loc[i, 'price'] > sell_order.price.min():
for j in sell_order.index:
if buy_order.loc[i,'price'] > sell_order.loc[j, 'price']:
profitable_trade2.append(buy_order.loc[i, ['typeID', 'orderID', 'price', 'volume', 'stationID', 'range']].tolist() + sell_order.loc[j, ['orderID', 'price', 'volume', 'stationID', 'range']].tolist())
else:
break
else:
break
This shaves about 25%-30% off the time (23s on 2600x, 9s on the M1 Pro)
Times have been recorded in a Jupyter Notebook
Any Tips are welcome!
Option 1 - Iterate through all the possibilites (yours):
start = time.time()
buy_order = df[(df.typeID == 34) & (df.buy == True)].copy()
sell_order = df[(df.typeID == 34) & (df.buy == False)].copy()
profitable_trade = []
for i in buy_order.index:
for j in sell_order.index:
if buy_order.loc[i,'price'] > sell_order.loc[j, 'price']:
profitable_trade.append(buy_order.loc[i, ['typeID', 'orderID', 'price', 'volume', 'stationID', 'range']].tolist() + sell_order.loc[j, ['orderID', 'price', 'volume', 'stationID', 'range']].tolist())
stop = time.time()
print(f"Time: {stop - start} seconds")
Time: 33.145344734191895 seconds
Option 2 - Shorten the iteration (yours):
start = time.time()
buy_order = df[(df.typeID == 34) & (df.buy == True)].copy()
sell_order = df[(df.typeID == 34) & (df.buy == False)].copy()
buy_order.sort_values(by='price', ascending=False, inplace=True, ignore_index=True)
sell_order.sort_values(by='price', ascending=True, inplace=True, ignore_index=True)
profitable_trade2 = []
for i in buy_order.index:
if buy_order.loc[i, 'price'] > sell_order.price.min():
for j in sell_order.index:
if buy_order.loc[i,'price'] > sell_order.loc[j, 'price']:
profitable_trade2.append(buy_order.loc[i, ['typeID', 'orderID', 'price', 'volume', 'stationID', 'range']].tolist() + sell_order.loc[j, ['orderID', 'price', 'volume', 'stationID', 'range']].tolist())
else:
break
else:
break
stop = time.time()
print(f"Time: {stop - start} seconds")
Time: 26.736826419830322 seconds
Option 3 - Pandas Optimizations:
You can get some speedup by applying the following optimizations:
iterate over dataframe items directly (iterrows instead of index + loc)
single filtering operation for sell-orders
start = time.time()
buy_order = df[(df.typeID == 34) & (df.buy == True)]
sell_order = df[(df.typeID == 34) & (df.buy == False)]
profitable_trade = []
for _, buy in buy_order.iterrows():
filtered_sell_orders = sell_order[sell_order["price"] < buy["price"]]
for _, sell in filtered_sell_orders.iterrows():
profitable_trade.append(buy[['typeID', 'orderID', 'price', 'volume', 'stationID', 'range']].tolist() + sell[['orderID', 'price', 'volume', 'stationID', 'range']].tolist())
stop = time.time()
print(f"Time: {stop - start} seconds")
Time: 19.43745183944702 seconds
Note that almost all the time is spent on the tolist()-operations (the following option is just for showing this impact, it does not return the target list):
start = time.time()
buy_order = df[(df.typeID == 34) & (df.buy == True)]
sell_order = df[(df.typeID == 34) & (df.buy == False)]
profitable_trade = []
for _, buy in buy_order.iterrows():
filtered_sell_orders = sell_order[sell_order["price"] < buy["price"]]
for _, sell in filtered_sell_orders.iterrows():
# removed 'tolist'-operations
profitable_trade.append(1)
stop = time.time()
print(f"Time: {stop - start} seconds")
Time: 2.072049617767334 seconds
Option 4 - Replace tolist-operations and store results in dataframe:
You can accelerate your code by
storing your filtered values in intermediate lists containing rows of the original dataframe
converting the intermediate lists to dataframes and concatenating them
the resulting dataframe yields the same information as the list profitable_trade
convert the dataframe to the desired list of lists (if needed)
start = time.time()
buy_orders = df[(df.typeID == 34) & (df.buy == True)]
sell_orders = df[(df.typeID == 34) & (df.buy == False)]
# store buy and cell rows in intermediate lists
buys = []
sells = []
for _, buy in buy_orders.iterrows():
# apply filtering operation once
filtered_sell_orders = sell_orders[sell_orders.price < buy.price]
sell_rows = list(filtered_sell_orders.iterrows())
# store buy and sell row items
buys.extend([buy] * len(sell_rows))
sells.extend([sell for _, sell in sell_rows])
# convert intermediate lists to dataframes
buys = pd.DataFrame(buys)
sells = pd.DataFrame(sells)
# rename columns for buys / cells dataframes for unique column names
buys = buys.rename(columns={column: f"{column}_buy" for column in buys.columns})
sells = sells.rename(columns={column: f"{column}_sell" for column in sells.columns})
# reset indices and concatenate buys / cells along the column axis
buys.reset_index(drop=True, inplace=True)
sells.reset_index(drop=True, inplace=True)
profitable_trade_df = pd.concat([buys, sells], axis=1)
# convert to list of lists (if needed)
profitable_trade = profitable_trade_df[['typeID_buy', 'orderID_buy', 'price_buy', 'volume_buy', 'stationID_buy', 'range_buy','orderID_sell', 'price_sell', 'volume_sell', 'stationID_sell', 'range_sell']].values.tolist()
stop = time.time()
print(f"Time: {stop - start} seconds")
Time: 3.785726308822632 seconds
Many thanks to #daniel.fehrenbacher for the explanation and suggestions.
In addition to his options, I've found a few myself using this article:
https://towardsdatascience.com/heres-the-most-efficient-way-to-iterate-through-your-pandas-dataframe-4dad88ac92ee#:
TL;DR
Don't use tolist()
Filter operation isn't always better, depends on the iteration method
There are much faster iteration methods than a regular for loop, or even iterrows(): use dictionary iteration
Use of .tolist() is detrimental
As mention in the answer above, a .tolist() uses too much time. It's much faster to use append([item1, item2, item3...]) than use append(row[['item1', 'item2', item3'...]].tolist())
tolist(): 19.2s
%%time
buy_order = df[(df.typeID == 34) & (df.buy == True)]
sell_order = df[(df.typeID == 34) & (df.buy == False)]
profitable_trade = []
for _, buy in buy_order.iterrows():
filtered_sell_orders = sell_order[sell_order["price"] < buy["price"]]
for _, sell in filtered_sell_orders.iterrows():
profitable_trade.append(buy[['typeID', 'orderID', 'price', 'volume', 'stationID', 'range']].tolist() + sell[['orderID', 'price', 'volume', 'stationID', 'range']].tolist())
append([item1, item2]): 3.5s
%%time
buy_order = df[(df.typeID == 34) & (df.buy == True)]
sell_order = df[(df.typeID == 34) & (df.buy == False)]
profitable_trade = []
for _, buy in buy_order.iterrows():
filtered_sell_orders = sell_order[sell_order["price"] < buy["price"]]
for _, sell in filtered_sell_orders.iterrows():
profitable_trade.append([
buy.typeID,
buy.orderID,
buy.price,
buy.volume,
buy.stationID,
buy.range,
sell.orderID,
sell.price,
sell.volume,
sell.stationID,
sell.range
])
Filtering Operation VS break
While the single filtering operation has a slight efficiency increase when you use .iterrows(), I've found it is the opposite when you use the better .itertuples().
iterrows() with filter operation: 3.26s
%%time
buy_order = df[(df.typeID == 34) & (df.buy == True)]
sell_order = df[(df.typeID == 34) & (df.buy == False)]
profitable_trade = []
for _, row_buy in buy_order.iterrows():
filtered_sell_orders = sell_order[sell_order["price"] < row_buy.price]
for _, row_sell in filtered_sell_orders.iterrows():
profitable_trade.append([
row_buy.typeID,
row_buy.orderID,
row_buy.price,
row_buy.volume,
row_buy.stationID,
row_buy.range,
row_sell.orderID,
row_sell.price,
row_sell.volume,
row_sell.stationID,
row_sell.range
])
iterrows() with break statements: 3.77s
%%time
buy_order = df[(df.typeID == 34) & (df.buy == True)].copy()
sell_order = df[(df.typeID == 34) & (df.buy == False)].copy()
buy_order.sort_values(by='price', ascending=False, inplace=True, ignore_index=True)
sell_order.sort_values(by='price', ascending=True, inplace=True, ignore_index=True)
profitable_trade3 = []
lowest_sell = sell_order.price.min()
for _, row_buy in buy_order.iterrows():
if row_buy.price > lowest_sell:
for _, row_sell in sell_order.iterrows():
if row_buy.price > row_sell.price:
profitable_trade3.append([
row_buy.typeID,
row_buy.orderID,
row_buy.price,
row_buy.volume,
row_buy.stationID,
row_buy.range,
row_sell.orderID,
row_sell.price,
row_sell.volume,
row_sell.stationID,
row_sell.range
])
else:
break
else:
break
itertuples with filter operation: 650ms
%%time
buy_order = df[(df.typeID == 34) & (df.buy == True)]
sell_order = df[(df.typeID == 34) & (df.buy == False)]
profitable_trade = []
for row_buy in buy_order.itertuples():
filtered_sell_orders = sell_order[sell_order["price"] < row_buy.price]
for row_sell in filtered_sell_orders.itertuples():
profitable_trade.append([
row_buy.typeID,
row_buy.orderID,
row_buy.price,
row_buy.volume,
row_buy.stationID,
row_buy.range,
row_sell.orderID,
row_sell.price,
row_sell.volume,
row_sell.stationID,
row_sell.range
])
itertuples with break statement: 375ms
%%time
buy_order = df[(df.typeID == 34) & (df.buy == True)].copy()
sell_order = df[(df.typeID == 34) & (df.buy == False)].copy()
buy_order.sort_values(by='price', ascending=False, inplace=True, ignore_index=True)
sell_order.sort_values(by='price', ascending=True, inplace=True, ignore_index=True)
profitable_trade3 = []
lowest_sell = sell_order.price.min()
for row_buy in buy_order.itertuples():
if row_buy.price > lowest_sell:
for row_sell in sell_order.itertuples():
if row_buy.price > row_sell.price:
profitable_trade3.append([
row_buy.typeID,
row_buy.orderID,
row_buy.price,
row_buy.volume,
row_buy.stationID,
row_buy.range,
row_sell.orderID,
row_sell.price,
row_sell.volume,
row_sell.stationID,
row_sell.range
])
else:
break
else:
break
Better iteration methods
itertuples (see above): 375ms
Numpy Iteration Method (df.values): 200ms
buy_order = df[(df.typeID == 34) & (df.buy == True)].copy()
sell_order = df[(df.typeID == 34) & (df.buy == False)].copy()
buy_order.sort_values(by='price', ascending=False, inplace=True, ignore_index=True)
sell_order.sort_values(by='price', ascending=True, inplace=True, ignore_index=True)
profitable_trade4 = []
lowest_sell = sell_order.price.min()
for row_buy in buy_order.values:
if row_buy[7] > lowest_sell:
for row_sell in sell_order.values:
if row_buy[7] > row_sell[7]:
profitable_trade4.append([
row_buy[1],
row_buy[0],
row_buy[7],
row_buy[4],
row_buy[8],
row_buy[9],
row_sell[0],
row_sell[7],
row_sell[4],
row_sell[8],
row_sell[9]
])
else:
break
else:
break
Dictionary Iteration (df.to_dict('records')): 78ms
%%time
buy_order = df[(df.typeID == 34) & (df.buy == True)].copy()
sell_order = df[(df.typeID == 34) & (df.buy == False)].copy()
buy_order.sort_values(by='price', ascending=False, inplace=True, ignore_index=True)
sell_order.sort_values(by='price', ascending=True, inplace=True, ignore_index=True)
profitable_trade5 = []
buy_dict = buy_order.to_dict('records')
sell_dict = sell_order.to_dict('records')
lowest_sell = sell_order.price.min()
for row_buy in buy_dict:
if row_buy['price'] > lowest_sell:
for row_sell in sell_dict:
if row_buy['price'] > row_sell['price']:
profitable_trade5.append([
row_buy['typeID'],
row_buy['orderID'],
row_buy['price'],
row_buy['volume'],
row_buy['stationID'],
row_buy['range'],
row_sell['orderID'],
row_sell['price'],
row_sell['volume'],
row_sell['stationID'],
row_sell['range']
])
else:
break
else:
break
Hi i'm wondering what should i do to save all those values in a dataframe...
for mask in range (len(predicted_masks)):
folha = np.where(predicted_masks [mask,:,:] == 1 , 1, 0)
soma_folha = np.sum(folha)
sintoma = np.where(predicted_masks [mask,:,:] == 2 , 1, 0)
soma_sintoma = np.sum(sintoma)
fundo = np.where(predicted_masks [mask,:,:] == 0 , 1, 0)
soma_fundo = np.sum(fundo)
#print(soma_fundo, soma_folha, soma_sintoma)
severidade = (soma_sintoma/(soma_folha+soma_sintoma))*100
severidade = round(severidade,2)
print(soma_fundo, soma_folha, soma_sintoma, severidade)
d = {'mask': mask, 'soma_folha':soma_folha, 'soma_sintoma':soma_sintoma, 'soma_fundo':soma_fundo, 'severidade': severidade}
df = pd.DataFrame([d])
df.to_csv('/content/drive/MyDrive/DB_mosca_minadora/pred_csv/pred_test_db_anotated.csv', index=False)
already tried to save each one separately but it wont came up..
i needed to save all printed values in a dataframe, thats for 304 images (304 lines) buts it only saves the last line
can someone help me?
You are overwriting and saving your dataframe within the loop. You should instead do something like the following:
df = pd.DataFrame(columns=['mask', 'soma_folha', 'soma_sintoma', 'soma_fundo', 'severidade'])
for mask in range (len(predicted_masks)):
folha = np.where(predicted_masks [mask,:,:] == 1 , 1, 0)
soma_folha = np.sum(folha)
sintoma = np.where(predicted_masks [mask,:,:] == 2 , 1, 0)
soma_sintoma = np.sum(sintoma)
fundo = np.where(predicted_masks [mask,:,:] == 0 , 1, 0)
soma_fundo = np.sum(fundo)
#print(soma_fundo, soma_folha, soma_sintoma)
severidade = (soma_sintoma/(soma_folha+soma_sintoma))*100
severidade = round(severidade,2)
print(soma_fundo, soma_folha, soma_sintoma, severidade)
d = {'mask': mask, 'soma_folha':soma_folha, 'soma_sintoma':soma_sintoma, 'soma_fundo':soma_fundo, 'severidade': severidade}
new_df = pd.DataFrame([d])
df = pd.concat([df, new_df])
df.to_csv('/content/drive/MyDrive/DB_mosca_minadora/pred_csv/pred_test_db_anotated.csv', index=False)
I have a dataframe about movies
df = pd.read_csv("https://raw.githubusercontent.com/Giovanni1085/UvA_CSDA_2021/main/assignments/miniproject/Pudding-Film-Dialogue-Clean.csv"
It has the columns of gender and proportion_of_dialogue per film. The later shows the proportion of dialogue of each given character and the 'gender' columns shows of cours the gender.
I want to create a for loop that goes over every row and creates a new dictionary every time a row gives a new movie after several characters. I want in this dictionary list of the proportion of dialogue from woman and from man. This all together goes in it's own dictionary.
The outcome would be something like:
What I have is this (it's a lot):
proportion_per_film_dictionary = {}
prop_woman = 0
prop_man = 0
for i in range(len(df.index)):
if df.loc[i,'index1'] == 0: #I made a column with the index so I could do this for the first row
dct['film_%s' %i] = []
if df.loc[i, 'gender'] == 'woman':
prop_woman = prop_woman + df.loc[i,'proportion_of_dialogue']
if df.loc[i,'gender'] == 'man':
prop_man = prop_man + df.loc[i,'proportion_of_dialogue']
if df.loc[i,'title'] != df.loc[i+1,'title']:
dct['film_%s'%i].append(prop_man)
dct['film_%s'%i].append(prop_woman)
proportion_per_film_dictionary.append(dct['film_%s'%i])
prop_woman = None
prop_man = None
elif df.loc[i,'title'] == df.loc[i+1,'title']:
if df.loc[i, 'gender'] == 'woman':
prop_woman = prop_woman + df.loc[i,'proportion_of_dialogue']
if df.loc[i,'gender'] == 'man':
prop_man = prop_man + df.loc[i,'proportion_of_dialogue']
elif df.loc[i,'title'] != df.loc[i+1, 'title']:
if df.loc[i, 'gender'] == 'woman':
prop_woman = prop_woman + df.loc[i,'proportion_of_dialogue']
if df.loc[i,'gender'] == 'man':
prop_man = prop_man + df.loc[i,'proportion_of_dialogue']
if df.loc[i,'title'] != df.loc[i+1,'title']:
dct['film_%s'%i].append(prop_man)
dct['film_%s'%i].append(prop_woman)
proportion_per_film_dictionary.append(dct['film_%s'%i])
prop_woman = None
prop_man = None
This code shows the iteration over every movie title, I first also tried i-1 but apparently Python doesn't do -1, although it does +1.
I now get the following error:
dict' object has no attribute 'append'
I don't know why. Anybody who can help me? Thank you
In python dictionaries you don't use append, that's for lists. For dictionaries, it's enough to add a new value by assigning a key. Like this:
dct['film_%s'%i] = prop_man
dct['film_%s'%i] = prop_woman
I have a big dataframe which has two million rows. There are 60000 unique (store_id, product_id) pairs.
I need select by each (store_id, product_id), do some calculation , such as resample to H , sum , avg . Finally, concat all to a new dataframe.
The problem is it is very very slow, and become slower while running.
The mainly code is:
def process_df(df, func, *args, **kwargs):
'''
'''
product_ids = df.product_id.unique()
store_ids = df.store_id.unique()
# uk = df.drop_duplicates(subset=['store_id','product_id'])
# for idx, item in uk.iterrows():
all_df = list()
i = 1
with tqdm(total=product_ids.shape[0]*store_ids.shape[0]) as t:
for store_id in store_ids:
sdf = df.loc[df['store_id']==store_id]
for product_id in product_ids:
new_df = sdf.loc[(sdf['product_id']==product_id) ]
if new_df.shape[0] < 14:
continue
new_df = func(new_df, *args, **kwargs)
new_df.loc[:, 'store_id'] = store_id
new_df.loc[:, 'product_id'] = product_id
all_df.append(new_df)
t.update()
all_df= pd.concat(all_df)
return all_df
def process_order_items(df, store_id=None, product_id=None, freq='D'):
if store_id and "store_id" in df.columns:
df = df.loc[df['store_id']==store_id]
if product_id and "product_id" in df.columns:
df = df.loc[df['product_id']==product_id]
# convert to datetime
df.loc[:, "datetime_create"] = pd.to_datetime(df.time_create, unit='ms').dt.tz_localize('UTC').dt.tz_convert('Asia/Shanghai').dt.tz_localize(None)
df = df[["price", "count", "fee_total", "fee_real", "price_real", "price_guide", "price_change_category", "datetime_create"]]
df.loc[:, "has_discount"] = (df.price_change_category > 0).astype(int)
df.loc[:, "clearance"] = df.price_change_category.apply(lambda x:x in(10, 20, 23)).astype(int)
if not freq:
df.loc[:, "date_create"] = df["datetime_create"]
else:
assert freq in ('D', 'H')
df.index = df.loc[:, "datetime_create"]
discount_order_count = df['has_discount'].resample(freq).sum()
clearance_order_count = df['clearance'].resample(freq).sum()
discount_sale_count = df.loc[df.has_discount >0, 'count'].resample(freq).sum()
clearance_sale_count = df.loc[df.clearance >0, 'count'].resample(freq).sum()
no_discount_price = df.loc[df.has_discount == 0, 'price'].resample(freq).sum()
no_clearance_price = df.loc[df.clearance == 0, 'price'].resample(freq).sum()
order_count = df['count'].resample(freq).count()
day_count = df['count'].resample(freq).sum()
price_guide = df['price_guide'].resample(freq).max()
price_avg = (df['price'] * df['count']).resample(freq).sum() / day_count
df = pd.DataFrame({
"price":price_avg,
"price_guide": price_guide,
"sale_count": day_count,
"order_count": order_count,
"discount_order_count": discount_order_count,
"clearance_order_count": clearance_order_count,
"discount_sale_count": discount_sale_count,
"clearance_sale_count": clearance_sale_count,
})
df = df.drop(df[df.order_count == 0].index)
return df
I think the problem is there are too many redundant selections.
Maybe I could use groupby(['store_id','product_id']).agg to avoid redundant , but I have no idea how to use process_order_items with it and merge results together.
I think you can change:
df.loc[:,"clearance"] = df.price_change_category.apply(lambda x:x in(10, 20, 23)).astype(int)
to Series.isin:
df["clearance"] = df.price_change_category.isin([10, 20, 23]).astype(int)
Also solution for Resampler.aggregate:
d = {'has_discount':'sum',
'clearance':'sum',
'count': ['count', 'sum'],
'price_guide':'max'}
df1 = df.resample(freq).agg(d)
df1.columns = df1.columns.map('_'.join)
d1 = {'has_discount_count':'discount_order_count',
'clearance_count':'clearance_order_count',
'count_count':'order_count',
'count_sum':'day_count',
'price_guide_max':'price_guide'}
df1.rename(columns=d1)
Another idea is no convert boolean mask to integer, but use columns for filtering like:
df["has_discount"] = df.price_change_category > 0
df["clearance"] = df.price_change_category.isin([10, 20, 23])
discount_sale_count = df.loc[df.has_discount, 'count'].resample(freq).sum()
clearance_sale_count = df.loc[df.clearance, 'count'].resample(freq).sum()
#for filtering ==0 invert boolean mask columns by ~
no_discount_price = df.loc[~df.has_discount, 'price'].resample(freq).sum()
no_clearance_price = df.loc[~df.clearance, 'price'].resample(freq).sum()
First function should be simplify by GroupBy.apply instaed loops, then concat is not necessary:
def f(x):
print (x)
df = df.groupby(['product_id','store_id']).apply(f)