I have this df:
Segnale Prezzo Prezzo_exit
0 Long 44645 43302
1 Short 41169 44169
2 Long 44322 47093
3 Short 45323 42514
sample code to generate it:
tbl2 = {
"Segnale" : ["Long", "Short", "Long", "Short"],
"Prezzo" : [44645, 41169, 44322, 45323],
"Prezzo_exit" : [43302, 44169, 47093, 42514]}
df = pd.DataFrame(tbl2)
I need to create a new column named "esito" with this conditions:
if df["Segnale"] =="Long" and df["Prezzo"] < df["Prezzo_exit"] #row with "target"
if df["Segnale"] =="Long" and df["Prezzo"] > df["Prezzo_exit"] #row with "stop"
if df["Segnale"] =="Short" and df["Prezzo"] < df["Prezzo_exit"] #row with "stop"
if df["Segnale"] =="Short" and df["Prezzo"] > df["Prezzo_exit"] #row with "target"
So the final result will be:
Segnale Prezzo Prezzo_exit esito
0 Long 44645 43302 stop
1 Short 41169 44169 stop
2 Long 44322 47093 target
3 Short 45323 42514 target
I tried with no success:
df.loc[(df['Segnale'].str.contains('Long') & df['Prezzo'] <
df['Prezzo_exit']), 'Esito'] = 'Target'
df.loc[(df['Segnale'].str.contains('Long') & df['Prezzo'] > df['Prezzo_exit']), 'Esito'] =
'Stop'
df.loc[(df['Segnale'].str.contains('Short') & df['Prezzo'] > df['Prezzo_exit']), 'Esito'] =
'Target'
df.loc[(df['Segnale'].str.contains('Short') & df['Prezzo'] > df['Prezzo_exit']), 'Esito'] =
'Stop'
This will do what your question asks:
df.loc[(df.Segnale=='Long') & (df.Prezzo < df.Prezzo_exit), 'esito'] = 'target'
df.loc[(df.Segnale=='Long') & (df.Prezzo > df.Prezzo_exit), 'esito'] = 'stop'
df.loc[(df.Segnale=='Short') & (df.Prezzo < df.Prezzo_exit), 'esito'] = 'stop'
df.loc[(df.Segnale=='Short') & (df.Prezzo > df.Prezzo_exit), 'esito'] = 'target'
Output:
Segnale Prezzo Prezzo_exit esito
0 Long 44645 43302 stop
1 Short 41169 44169 stop
2 Long 44322 47093 target
3 Short 45323 42514 target
UPDATE:
You could also do this:
df['esito'] = ( pd.Series(['stop']*len(df)).where(
((df.Segnale=='Long') & (df.Prezzo > df.Prezzo_exit)) | ((df.Segnale=='Short') & (df.Prezzo < df.Prezzo_exit)),
'target') )
... or this:
df['esito'] = ( np.where(
((df.Segnale=='Long') & (df.Prezzo > df.Prezzo_exit)) | ((df.Segnale=='Short') & (df.Prezzo < df.Prezzo_exit)),
'stop', 'target') )
You need add parentheses to following comparison
(df['Prezzo'] < df['Prezzo_exit'])
For simplification, you can use np.select to select condition and choice in one statement.
I have a large Dataframe based on market data from the online game EVE.
I'm trying to determine the most profitable trades based on the price of the buy or sell order of an item.
I've found that it takes quite a while to loop through all the possibilities and would like some advice on how to make my code more efficient.
data = https://market.fuzzwork.co.uk/orderbooks/latest.csv.gz
SETUP:
import pandas as pd
df = pd.read_csv('latest.csv', sep='\t', names=["orderID","typeID","issued","buy","volume","volumeEntered","minVolume","price","stationID","range","duration","region","orderSet"])
Iterate through all the possibilites
buy_order = df[(df.typeID == 34) & (df.buy == True)].copy()
sell_order = df[(df.typeID == 34) & (df.buy == False)].copy()
profitable_trade = []
for i in buy_order.index:
for j in sell_order.index:
if buy_order.loc[i,'price'] > sell_order.loc[j, 'price']:
profitable_trade.append(buy_order.loc[i, ['typeID', 'orderID', 'price', 'volume', 'stationID', 'range']].tolist() + sell_order.loc[j, ['orderID', 'price', 'volume', 'stationID', 'range']].tolist())
This takes quite a long time (33s on a ryzen 2600x, 12s on an M1 Pro)
Shorten the iteration
buy_order = df[(df.typeID == 34) & (df.buy == True)].copy()
sell_order = df[(df.typeID == 34) & (df.buy == False)].copy()
buy_order.sort_values(by='price', ascending=False, inplace=True, ignore_index=True)
sell_order.sort_values(by='price', ascending=True, inplace=True, ignore_index=True)
for i in buy_order.index:
if buy_order.loc[i, 'price'] > sell_order.price.min():
for j in sell_order.index:
if buy_order.loc[i,'price'] > sell_order.loc[j, 'price']:
profitable_trade2.append(buy_order.loc[i, ['typeID', 'orderID', 'price', 'volume', 'stationID', 'range']].tolist() + sell_order.loc[j, ['orderID', 'price', 'volume', 'stationID', 'range']].tolist())
else:
break
else:
break
This shaves about 25%-30% off the time (23s on 2600x, 9s on the M1 Pro)
Times have been recorded in a Jupyter Notebook
Any Tips are welcome!
Option 1 - Iterate through all the possibilites (yours):
start = time.time()
buy_order = df[(df.typeID == 34) & (df.buy == True)].copy()
sell_order = df[(df.typeID == 34) & (df.buy == False)].copy()
profitable_trade = []
for i in buy_order.index:
for j in sell_order.index:
if buy_order.loc[i,'price'] > sell_order.loc[j, 'price']:
profitable_trade.append(buy_order.loc[i, ['typeID', 'orderID', 'price', 'volume', 'stationID', 'range']].tolist() + sell_order.loc[j, ['orderID', 'price', 'volume', 'stationID', 'range']].tolist())
stop = time.time()
print(f"Time: {stop - start} seconds")
Time: 33.145344734191895 seconds
Option 2 - Shorten the iteration (yours):
start = time.time()
buy_order = df[(df.typeID == 34) & (df.buy == True)].copy()
sell_order = df[(df.typeID == 34) & (df.buy == False)].copy()
buy_order.sort_values(by='price', ascending=False, inplace=True, ignore_index=True)
sell_order.sort_values(by='price', ascending=True, inplace=True, ignore_index=True)
profitable_trade2 = []
for i in buy_order.index:
if buy_order.loc[i, 'price'] > sell_order.price.min():
for j in sell_order.index:
if buy_order.loc[i,'price'] > sell_order.loc[j, 'price']:
profitable_trade2.append(buy_order.loc[i, ['typeID', 'orderID', 'price', 'volume', 'stationID', 'range']].tolist() + sell_order.loc[j, ['orderID', 'price', 'volume', 'stationID', 'range']].tolist())
else:
break
else:
break
stop = time.time()
print(f"Time: {stop - start} seconds")
Time: 26.736826419830322 seconds
Option 3 - Pandas Optimizations:
You can get some speedup by applying the following optimizations:
iterate over dataframe items directly (iterrows instead of index + loc)
single filtering operation for sell-orders
start = time.time()
buy_order = df[(df.typeID == 34) & (df.buy == True)]
sell_order = df[(df.typeID == 34) & (df.buy == False)]
profitable_trade = []
for _, buy in buy_order.iterrows():
filtered_sell_orders = sell_order[sell_order["price"] < buy["price"]]
for _, sell in filtered_sell_orders.iterrows():
profitable_trade.append(buy[['typeID', 'orderID', 'price', 'volume', 'stationID', 'range']].tolist() + sell[['orderID', 'price', 'volume', 'stationID', 'range']].tolist())
stop = time.time()
print(f"Time: {stop - start} seconds")
Time: 19.43745183944702 seconds
Note that almost all the time is spent on the tolist()-operations (the following option is just for showing this impact, it does not return the target list):
start = time.time()
buy_order = df[(df.typeID == 34) & (df.buy == True)]
sell_order = df[(df.typeID == 34) & (df.buy == False)]
profitable_trade = []
for _, buy in buy_order.iterrows():
filtered_sell_orders = sell_order[sell_order["price"] < buy["price"]]
for _, sell in filtered_sell_orders.iterrows():
# removed 'tolist'-operations
profitable_trade.append(1)
stop = time.time()
print(f"Time: {stop - start} seconds")
Time: 2.072049617767334 seconds
Option 4 - Replace tolist-operations and store results in dataframe:
You can accelerate your code by
storing your filtered values in intermediate lists containing rows of the original dataframe
converting the intermediate lists to dataframes and concatenating them
the resulting dataframe yields the same information as the list profitable_trade
convert the dataframe to the desired list of lists (if needed)
start = time.time()
buy_orders = df[(df.typeID == 34) & (df.buy == True)]
sell_orders = df[(df.typeID == 34) & (df.buy == False)]
# store buy and cell rows in intermediate lists
buys = []
sells = []
for _, buy in buy_orders.iterrows():
# apply filtering operation once
filtered_sell_orders = sell_orders[sell_orders.price < buy.price]
sell_rows = list(filtered_sell_orders.iterrows())
# store buy and sell row items
buys.extend([buy] * len(sell_rows))
sells.extend([sell for _, sell in sell_rows])
# convert intermediate lists to dataframes
buys = pd.DataFrame(buys)
sells = pd.DataFrame(sells)
# rename columns for buys / cells dataframes for unique column names
buys = buys.rename(columns={column: f"{column}_buy" for column in buys.columns})
sells = sells.rename(columns={column: f"{column}_sell" for column in sells.columns})
# reset indices and concatenate buys / cells along the column axis
buys.reset_index(drop=True, inplace=True)
sells.reset_index(drop=True, inplace=True)
profitable_trade_df = pd.concat([buys, sells], axis=1)
# convert to list of lists (if needed)
profitable_trade = profitable_trade_df[['typeID_buy', 'orderID_buy', 'price_buy', 'volume_buy', 'stationID_buy', 'range_buy','orderID_sell', 'price_sell', 'volume_sell', 'stationID_sell', 'range_sell']].values.tolist()
stop = time.time()
print(f"Time: {stop - start} seconds")
Time: 3.785726308822632 seconds
Many thanks to #daniel.fehrenbacher for the explanation and suggestions.
In addition to his options, I've found a few myself using this article:
https://towardsdatascience.com/heres-the-most-efficient-way-to-iterate-through-your-pandas-dataframe-4dad88ac92ee#:
TL;DR
Don't use tolist()
Filter operation isn't always better, depends on the iteration method
There are much faster iteration methods than a regular for loop, or even iterrows(): use dictionary iteration
Use of .tolist() is detrimental
As mention in the answer above, a .tolist() uses too much time. It's much faster to use append([item1, item2, item3...]) than use append(row[['item1', 'item2', item3'...]].tolist())
tolist(): 19.2s
%%time
buy_order = df[(df.typeID == 34) & (df.buy == True)]
sell_order = df[(df.typeID == 34) & (df.buy == False)]
profitable_trade = []
for _, buy in buy_order.iterrows():
filtered_sell_orders = sell_order[sell_order["price"] < buy["price"]]
for _, sell in filtered_sell_orders.iterrows():
profitable_trade.append(buy[['typeID', 'orderID', 'price', 'volume', 'stationID', 'range']].tolist() + sell[['orderID', 'price', 'volume', 'stationID', 'range']].tolist())
append([item1, item2]): 3.5s
%%time
buy_order = df[(df.typeID == 34) & (df.buy == True)]
sell_order = df[(df.typeID == 34) & (df.buy == False)]
profitable_trade = []
for _, buy in buy_order.iterrows():
filtered_sell_orders = sell_order[sell_order["price"] < buy["price"]]
for _, sell in filtered_sell_orders.iterrows():
profitable_trade.append([
buy.typeID,
buy.orderID,
buy.price,
buy.volume,
buy.stationID,
buy.range,
sell.orderID,
sell.price,
sell.volume,
sell.stationID,
sell.range
])
Filtering Operation VS break
While the single filtering operation has a slight efficiency increase when you use .iterrows(), I've found it is the opposite when you use the better .itertuples().
iterrows() with filter operation: 3.26s
%%time
buy_order = df[(df.typeID == 34) & (df.buy == True)]
sell_order = df[(df.typeID == 34) & (df.buy == False)]
profitable_trade = []
for _, row_buy in buy_order.iterrows():
filtered_sell_orders = sell_order[sell_order["price"] < row_buy.price]
for _, row_sell in filtered_sell_orders.iterrows():
profitable_trade.append([
row_buy.typeID,
row_buy.orderID,
row_buy.price,
row_buy.volume,
row_buy.stationID,
row_buy.range,
row_sell.orderID,
row_sell.price,
row_sell.volume,
row_sell.stationID,
row_sell.range
])
iterrows() with break statements: 3.77s
%%time
buy_order = df[(df.typeID == 34) & (df.buy == True)].copy()
sell_order = df[(df.typeID == 34) & (df.buy == False)].copy()
buy_order.sort_values(by='price', ascending=False, inplace=True, ignore_index=True)
sell_order.sort_values(by='price', ascending=True, inplace=True, ignore_index=True)
profitable_trade3 = []
lowest_sell = sell_order.price.min()
for _, row_buy in buy_order.iterrows():
if row_buy.price > lowest_sell:
for _, row_sell in sell_order.iterrows():
if row_buy.price > row_sell.price:
profitable_trade3.append([
row_buy.typeID,
row_buy.orderID,
row_buy.price,
row_buy.volume,
row_buy.stationID,
row_buy.range,
row_sell.orderID,
row_sell.price,
row_sell.volume,
row_sell.stationID,
row_sell.range
])
else:
break
else:
break
itertuples with filter operation: 650ms
%%time
buy_order = df[(df.typeID == 34) & (df.buy == True)]
sell_order = df[(df.typeID == 34) & (df.buy == False)]
profitable_trade = []
for row_buy in buy_order.itertuples():
filtered_sell_orders = sell_order[sell_order["price"] < row_buy.price]
for row_sell in filtered_sell_orders.itertuples():
profitable_trade.append([
row_buy.typeID,
row_buy.orderID,
row_buy.price,
row_buy.volume,
row_buy.stationID,
row_buy.range,
row_sell.orderID,
row_sell.price,
row_sell.volume,
row_sell.stationID,
row_sell.range
])
itertuples with break statement: 375ms
%%time
buy_order = df[(df.typeID == 34) & (df.buy == True)].copy()
sell_order = df[(df.typeID == 34) & (df.buy == False)].copy()
buy_order.sort_values(by='price', ascending=False, inplace=True, ignore_index=True)
sell_order.sort_values(by='price', ascending=True, inplace=True, ignore_index=True)
profitable_trade3 = []
lowest_sell = sell_order.price.min()
for row_buy in buy_order.itertuples():
if row_buy.price > lowest_sell:
for row_sell in sell_order.itertuples():
if row_buy.price > row_sell.price:
profitable_trade3.append([
row_buy.typeID,
row_buy.orderID,
row_buy.price,
row_buy.volume,
row_buy.stationID,
row_buy.range,
row_sell.orderID,
row_sell.price,
row_sell.volume,
row_sell.stationID,
row_sell.range
])
else:
break
else:
break
Better iteration methods
itertuples (see above): 375ms
Numpy Iteration Method (df.values): 200ms
buy_order = df[(df.typeID == 34) & (df.buy == True)].copy()
sell_order = df[(df.typeID == 34) & (df.buy == False)].copy()
buy_order.sort_values(by='price', ascending=False, inplace=True, ignore_index=True)
sell_order.sort_values(by='price', ascending=True, inplace=True, ignore_index=True)
profitable_trade4 = []
lowest_sell = sell_order.price.min()
for row_buy in buy_order.values:
if row_buy[7] > lowest_sell:
for row_sell in sell_order.values:
if row_buy[7] > row_sell[7]:
profitable_trade4.append([
row_buy[1],
row_buy[0],
row_buy[7],
row_buy[4],
row_buy[8],
row_buy[9],
row_sell[0],
row_sell[7],
row_sell[4],
row_sell[8],
row_sell[9]
])
else:
break
else:
break
Dictionary Iteration (df.to_dict('records')): 78ms
%%time
buy_order = df[(df.typeID == 34) & (df.buy == True)].copy()
sell_order = df[(df.typeID == 34) & (df.buy == False)].copy()
buy_order.sort_values(by='price', ascending=False, inplace=True, ignore_index=True)
sell_order.sort_values(by='price', ascending=True, inplace=True, ignore_index=True)
profitable_trade5 = []
buy_dict = buy_order.to_dict('records')
sell_dict = sell_order.to_dict('records')
lowest_sell = sell_order.price.min()
for row_buy in buy_dict:
if row_buy['price'] > lowest_sell:
for row_sell in sell_dict:
if row_buy['price'] > row_sell['price']:
profitable_trade5.append([
row_buy['typeID'],
row_buy['orderID'],
row_buy['price'],
row_buy['volume'],
row_buy['stationID'],
row_buy['range'],
row_sell['orderID'],
row_sell['price'],
row_sell['volume'],
row_sell['stationID'],
row_sell['range']
])
else:
break
else:
break
I'm trying to create a lineplot of the count of three different groups i.e. desktop, mobile & tablet with the x axis having the years of 2014, 2015 and 2016 but I am getting the error
my code is currently:
#year-by-year change
desktop14 = od.loc[(od.Account_Year_Week >= 201401) & (od.Account_Year_Week <= 201453) & (od.online_device_type_detail == "DESKTOP"), "Gross_Demand_Pre_Credit"]
desktop15 = od.loc[(od.Account_Year_Week >= 201501) & (od.Account_Year_Week <= 201553) & (od.online_device_type_detail == "DESKTOP"), "Gross_Demand_Pre_Credit"]
desktop16 = od.loc[(od.Account_Year_Week >= 201601) & (od.Account_Year_Week <= 201653) & (od.online_device_type_detail == "DESKTOP"), "Gross_Demand_Pre_Credit"]
mobile14 = od.loc[(od.Account_Year_Week >= 201401) & (od.Account_Year_Week <= 201453) & (od.online_device_type_detail == "MOBILE"), "Gross_Demand_Pre_Credit"]
mobile15 = od.loc[(od.Account_Year_Week >= 201501) & (od.Account_Year_Week <= 201553) & (od.online_device_type_detail == "MOBILE"), "Gross_Demand_Pre_Credit"]
mobile16 = od.loc[(od.Account_Year_Week >= 201601) & (od.Account_Year_Week <= 201653) & (od.online_device_type_detail == "MOBILE"), "Gross_Demand_Pre_Credit"]
tablet14 = od.loc[(od.Account_Year_Week >= 201401) & (od.Account_Year_Week <= 201453) & (od.online_device_type_detail == "TABLET"), "Gross_Demand_Pre_Credit"]
tablet15 = od.loc[(od.Account_Year_Week >= 201501) & (od.Account_Year_Week <= 201553) & (od.online_device_type_detail == "TABLET"), "Gross_Demand_Pre_Credit"]
tablet16 = od.loc[(od.Account_Year_Week >= 201601) & (od.Account_Year_Week <= 201653) & (od.online_device_type_detail == "TABLET"), "Gross_Demand_Pre_Credit"]
devicedata = [["Desktop", desktop14.count(), desktop15.count(), desktop16.count()], ["Mobile", mobile14.count(), mobile15.count(), mobile16.count()], ["Tablet", tablet14.count(), tablet15.count(), tablet16.count()]]
df = pd.DataFrame(devicedata, columns=["Device", "2014", "2015", "2016"]).set_index("Device")
plt.show()
I want to make each of the lines the Device types and the x axis showing the change in year. How do I do this - (essentially reversing the axis).
any help is greatly appreciated
Just do
df.transpose().plot()
Result will be something like this: