Pandas merge two dataframes on column with different length - python

I am working on a trading algo but I have some issues when trying to combine the buy_orders and sell_orders dataframes to a single dataframe, orders.
The issue shows it self on the buy_order date 2021-01-21 where a recommendation was made by my algo to buy, but this has no sell order yet as the signal hasn't been spotted yet, therefore these should be NAN when merged.
If I was to join on index the sell order would be the sell order for a different stock from the sell_orders dataframe.
buy_orders dataframe
Date_buy Name Stock_Price_buy Order
26 2020-07-30 AAPL 96.19 buy
27 2020-09-30 AAPL 115.81 buy
28 2020-11-05 AAPL 119.03 buy
29 2020-11-30 AAPL 119.05 buy
30 2021-01-21 AAPL 136.87 buy
31 2020-10-11 ABBV 21.21 buy
sell_orders dataframe
Date_sell Name Stock_Price_sell Order
25 2020-07-20 AAPL 98.36 sell
26 2020-09-02 AAPL 131.40 sell
27 2020-10-20 AAPL 117.51 sell
28 2020-11-20 AAPL 117.34 sell
29 2021-01-04 AAPL 129.41 sell
30 2020-10-15 ABBV 24.23 sell
Ideal result would be the orders dataframe as demonstrated below.
Index Buy_date Name_x Stock_Price_buy Order_x Sell_date Name_y Stock_Price_buy Order_y
26 2020-07-30 AAPL 96.19 buy 2020-09-02 AAPL 131.40 sell
27 2020-09-30 AAPL 115.81 buy 2020-10-20 AAPL 117.51 sell
28 2020-11-05 AAPL 119.03 buy 2020-11-20 AAPL 117.34 sell
29 2020-11-30 AAPL 119.05 buy 2021-01-04 AAPL 129.41 sell
30 2021-01-21 AAPL 136.87 buy NaN NaN NaN NaN
Here's how the orders dataframe looks like now when buy_orders.Name_x and sell_orders.Name_y are different for the first time. ABBV sell_order should have been NANs
28 2020-11-05 AAPL 119.03 buy 2020-11-20 AAPL 117.34 sell
29 2020-11-30 AAPL 119.05 buy 2021-01-04 AAPL 129.41 sell
30 2021-01-21 AAPL 136.87 buy 2018-05-24 ABBV 24.23 sell

Have you thought of join but then declaring suffixes as follows?.
buy_orders.join(sell_orders,lsuffix='_buy', rsuffix='_sell')
Date_buy Name_buy Stock_Price_buy Order_buy Date_sell Name_sell \
26 2020-07-30 AAPL 96.19 buy 2020-09-02 AAPL
27 2020-09-30 AAPL 115.81 buy 2020-10-20 AAPL
28 2020-11-05 AAPL 119.03 buy 2020-11-20 AAPL
29 2020-11-30 AAPL 119.05 buy 2021-01-04 AAPL
30 2021-01-21 AAPL 136.87 buy NaN NaN
Stock_Price_sell Order_sell
26 131.40 sell
27 117.51 sell
28 117.34 sell
29 129.41 sell
30 NaN NaN

Assuming your data is structured so that it always starts with a buy and alternates with sell orders, and only has one transaction per day, and is always one sized lot per transaction... you can use pd.concat. I made a simple dataframe that is kind of like yours (and in the future, it makes it easier if you include code to make a sample dataframe as part of your question):
buy_orders = pd.DataFrame.from_dict({'Date_buy': [ pd.to_datetime('2020-11-01'), pd.to_datetime('2020-11-03'),
pd.to_datetime('2020-11-05'), pd.to_datetime('2020-11-08'),
pd.to_datetime('2020-11-10')],
'Order' : ['B','B','B','B','B'],
'Name' : ['AAPL','AAPL','AAPL','AAPL','ABBV'],
'Stock_Price_buy' : [1,2,3,4,5.0]})
sell_orders = pd.DataFrame.from_dict({'Date_sell': [ pd.to_datetime('2020-11-02'), pd.to_datetime('2020-11-04'),
pd.to_datetime('2020-11-06'), pd.to_datetime('2020-11-12'),
pd.to_datetime('2020-11-22')],
'Order' : ['S','S','S','S','S'],
'Name' : ['AAPL','AAPL','AAPL','ABBV','ABBV'],
'Stock_Price_sell' : [23,24,25,26,5.0]})
You can first stack the two dataframes and sort them by date and ticker (after normalizing the column names):
buy_orders = buy_orders.rename(columns={'Date_buy' : "Date", "Stock_Price_buy" : "Price"})
sell_orders = sell_orders.rename(columns={'Date_sell' : "Date", "Stock_Price_sell" : "Price"})
df = pd.concat([buy_orders, sell_orders])
df = df.sort_values(['Date','Order']).reset_index(drop=True)
...then making a copy of the dataframe (changing the column names to keep them distinct in the later concat step):
df2 = df.copy()
df2.columns = [f"{c}_sell" for c in df.columns]
You then concatenate the two dataframes next to each other, but with a .shift(-1) on the second one so that they're offset
df3 = pd.concat([df, df2.shift(-1)], axis=1)
Finally, you can clean up the junk rows:
cut = ( df3.Name != df3.Name_sell)
import numpy as np
df3.loc[cut, 'Date_sell'] = np.nan
df3.loc[cut, 'Order_sell'] = np.nan
df3.loc[cut, 'Price_sell'] = np.nan
df3 = df3.drop(columns='Name_sell')
df3 = df3[df3.Order!="S"].reset_index(drop=True).copy()
That gives you something like
Date Order Name Price Date_sell Order_sell Price_sell
0 2020-11-01 B AAPL 1.0 2020-11-02 S 23.0
1 2020-11-03 B AAPL 2.0 2020-11-04 S 24.0
2 2020-11-05 B AAPL 3.0 2020-11-06 S 25.0
3 2020-11-08 B AAPL 4.0 NaT NaN NaN
4 2020-11-10 B ABBV 5.0 2020-11-12 S 26.0
You don't have to make all the intermediate dataframes, etc, but I left the code that way here so that if you paste things in a notebook you can look at the steps.

Related

How to fill NANs with a specific row of data

I am a new python user and have a few questions regarding filling NA's of a data frame.
Currently, I have a data frame that has a series of dates from 2022-08-01 to 2037-08-01 with a frequency of monthly data.
However, after 2027-06-01 the pricing data stops and I would like to extrapolate the values forward to fill out the rest of the dates. Essentially I would like to take the last 12 months of prices and fill those forward for the rest of the data frame. I am thinking of doing some type of groupby month with a fillna(method=ffill) however when I do this it just fills the last value in the df forward. Below is an example of my code.
Above is a picture you will see that the values stop at 12/1/2023 I wish to fill the previous 12 values forward for the rest of the maturity dates. So all prices fro 1/1/2023 to 12/1/2023 will be fill forward for all months.
import pandas as pd
mat = pd.DataFrame(pd.date_range('01/01/2020','01/01/2022',freq='MS'))
prices = pd.DataFrame(['179.06','174.6','182.3','205.59','204.78','202.19','216.17','218.69','220.73','223.28','225.16','226.31'])
example = pd.concat([mat,prices],axis=1)
example.columns = ['maturity', 'price']
Output
0 2020-01-01 179.06
1 2020-02-01 174.6
2 2020-03-01 182.3
3 2020-04-01 205.59
4 2020-05-01 204.78
5 2020-06-01 202.19
6 2020-07-01 216.17
7 2020-08-01 218.69
8 2020-09-01 220.73
9 2020-10-01 223.28
10 2020-11-01 225.16
11 2020-12-01 226.31
12 2021-01-01 NaN
13 2021-02-01 NaN
14 2021-03-01 NaN
15 2021-04-01 NaN
16 2021-05-01 NaN
17 2021-06-01 NaN
18 2021-07-01 NaN
19 2021-08-01 NaN
20 2021-09-01 NaN
21 2021-10-01 NaN
22 2021-11-01 NaN
23 2021-12-01 NaN
24 2022-01-01 NaN
Is this what you're looking for?
out = df.groupby(df.maturity.dt.month).ffill()
print(out)
Output:
maturity price
0 2020-01-01 179.06
1 2020-02-01 174.6
2 2020-03-01 182.3
3 2020-04-01 205.59
4 2020-05-01 204.78
5 2020-06-01 202.19
6 2020-07-01 216.17
7 2020-08-01 218.69
8 2020-09-01 220.73
9 2020-10-01 223.28
10 2020-11-01 225.16
11 2020-12-01 226.31
12 2021-01-01 179.06
13 2021-02-01 174.6
14 2021-03-01 182.3
15 2021-04-01 205.59
16 2021-05-01 204.78
17 2021-06-01 202.19
18 2021-07-01 216.17
19 2021-08-01 218.69
20 2021-09-01 220.73
21 2021-10-01 223.28
22 2021-11-01 225.16
23 2021-12-01 226.31
24 2022-01-01 179.06

MultiIndex column-wise from existing pandas dataframe columns

I am trying to reindex my pandas dataframe to a column-wise MultiIndex. Most answers I've explored seem to answer only row wise. My current df looks as such:
ticker calendardate eps price ps revenue
0 ABNB 2019-12-31 -2.59 NaN NaN 4.80
1 ABNB 2020-12-31 -16.12 146.80 25.962 3.37
2 AMZN 2019-12-31 23.46 1847.84 3.266 2.80
3 AMZN 2020-12-31 42.64 3256.93 4.233 3.86
I want a MultiIndex based upon calendardate so that my output looks as such:
ticker eps price ps revenue
2019 2020 2019 2020 2019 2020 2019 2020
0 ABNB -2.59 -16.12 NaN 146.80 NaN 25.962 4.80 3.37
1 AMZN 23.46 42.64 1847.84 3256.93 3.266 4.233 2.80 3.86
Any help would be appreciated. Thanks
We can use str.split to split the column calenderdate around the delimiter - then use str[0] to select the year portion of splitted column, now set the index of dataframe to column ticker along with extracted year followed by unstack to reshape.
y = df['calendardate'].str.split('-', n=1).str[0]
df.drop('calendardate', 1).set_index(['ticker', y]).unstack()
If the dtype of column calendardate is datetime then we can instead use:
y = df['calendardate'].dt.year
df.drop('calendardate', 1).set_index(['ticker', y]).unstack()
eps price ps revenue
calendardate 2019 2020 2019 2020 2019 2020 2019 2020
ticker
ABNB -2.59 -16.12 NaN 146.80 NaN 25.962 4.8 3.37
AMZN 23.46 42.64 1847.84 3256.93 3.266 4.233 2.8 3.86

Faster way to filter pandas dataframe and create new columns

Given df
ticker close open
0 AAPL 1.2 1.1
1 TSLA 25.0 27.0
2 TSLA 83.0 80.0
3 TSLA 95.0 93.0
4 CCL 234.0 234.2
5 AAPL 512.0 520.0
My purpose:
(1) Apply functions to each ticker dataframe (subset)
(2) Create new column with values in string like 'exist' to each ticker dataframe
My expected output
ticker close open candlestick SMA_20 SMA_50
0 AAPL 1.2 1.1 bullish (number) (number)
1 TSLA 25.0 27.0 bearish (number) (number)
2 TSLA 83.0 80.0 bullish (number) (number)
3 TSLA 95.0 93.0 bullish (number) (number)
4 CCL 234.0 234.2 bearish (number) (number)
5 AAPL 512.0 520.0 bearish (number) (number)
I've tried this code, which is extremely slow
for x in df.ticker:
df_ticker = df[df.ticker == x]
df_close_price = pd.DataFrame(df_ticker.close)
for days in [20,50]:
df_ticker[f'SMA_{days}'] = df_close_price.apply(lambda c: abstract.SMA(c, days))
......
df_result = df_result.append(df_ticker)
I was wondering how to filter the dataframe by ticker in a faster way when dealing with millions rows. Many suggested using .loc, numpy, but I could not find a possible way to perform.
Thanks!
I think you need numpy.where:
df['candlestick'] = np.where(df['close'] > df['open'], 'bullish', 'bearish')
print (df)
ticker close open candlestick
0 AAPL 1.2 1.1 bullish
1 TSLA 25.0 27.0 bearish
2 TSLA 83.0 80.0 bullish
3 TSLA 95.0 93.0 bullish
4 CCL 234.0 234.2 bearish
5 AAPL 512.0 520.0 bearish
EDIT: Here is possible use GroupBy.apply with custom functionand mainly pass Series to abstract.SMA instead .apply(lambda c: abstract.SMA(c, days):
def f(x):
for days in [20,50]:
x[f'SMA_{days}'] = abstract.SMA(x.close, days)
return x
df = df.groupby('ticker')['close'].apply(f)
print (df)

melting a multi index dataframe in pandas [duplicate]

This question already has answers here:
Convert columns into rows with Pandas
(6 answers)
Closed 2 years ago.
I keep getting stuck with this multi-level dataframe of stock prices that I'm trying to melt from wide to long data.
I'd like to go from this:
Attributes Close Close High
Symbols AMZN ^DJI AMZN
Date
2020-12-01 32 29 35
2020-11-30 31 28 36
2020-11-27 30 27 37
To this:
Attri Sym Date price
0 Close AMZN 2020-12-01 32
1 Close AMZN 2020-11-30 31
2 Close AMZN 2020-11-27 30
3 Close ^DJI 2020-12-01 29
4 Close ^DJI 2020-11-30 28
5 High AMZN 2020-12-01 35
6 ....
I tried:
df = df.reset_index()
df = df.set_index([('Date', '')]).rename_axis(index=None, columns=('Date', ''))
df = df.fillna('').set_index('Date').T\
.set_index('',append=True).stack().reset_index()
But I'm not getting it. Any ideas what else I should try?
For me working DataFrame.stack by both levels with Series.reset_index:
df = df.stack([0,1]).reset_index(name='price')
print (df)
Date Attributes Symbols price
0 2020-12-01 Close AMZN 32.0
1 2020-12-01 Close ^DJI 29.0
2 2020-12-01 High AMZN 35.0
3 2020-11-30 Close AMZN 31.0
4 2020-11-30 Close ^DJI 28.0
5 2020-11-30 High AMZN 36.0
6 2020-11-27 Close AMZN 30.0
7 2020-11-27 Close ^DJI 27.0
8 2020-11-27 High AMZN 37.0
Another idea is solution from comment by #sammywemmy:
df = df.melt(ignore_index=False, value_name="price").reset_index()
print (df)
Date Attributes Symbols price
0 2020-12-01 Close AMZN 32
1 2020-11-30 Close AMZN 31
2 2020-11-27 Close AMZN 30
3 2020-12-01 Close ^DJI 29
4 2020-11-30 Close ^DJI 28
5 2020-11-27 Close ^DJI 27
6 2020-12-01 High AMZN 35
7 2020-11-30 High AMZN 36
8 2020-11-27 High AMZN 37

Pandas Panel Data - Identifying year gap and calculating returns

I am working with a large panel data of financial info, however the values are a bit spotty. I am trying to calculate the return between each year of each stock in my panel data. However, because of missing values sometimes firms have year gaps, making the: df['stock_ret'] = df.groupby(['tic'])['stock_price'].pct_change() impossible to practice as it would be wrong. The df looks something like this (just giving an example):
datadate month fyear ticker price
0 31/12/1998 12 1998 AAPL 188.92
1 31/12/1999 12 1999 AAPL 197.44
2 31/12/2002 12 2002 AAPL 268.13
3 31/12/2003 12 2003 AAPL 278.06
4 31/12/2004 12 2004 AAPL 288.35
5 31/12/2005 12 2005 AAPL 312.23
6 31/05/2008 5 2008 TSLA 45.67
7 31/05/2009 5 2009 TSLA 38.29
8 31/05/2010 5 2010 TSLA 42.89
9 31/05/2011 5 2011 TSLA 56.03
10 31/05/2014 5 2014 TSLA 103.45
.. ... .. .. .. ..
What I am looking for is a piece of code that would allow me to understand (for each individual firm) if there is any gap in the data, and calculate returns for the two different series. Just like this:
datadate month fyear ticker price return
0 31/12/1998 12 1998 AAPL 188.92 NaN
1 31/12/1999 12 1999 AAPL 197.44 0.0451
2 31/12/2002 12 2002 AAPL 268.13 NaN
3 31/12/2003 12 2003 AAPL 278.06 0.0370
4 31/12/2004 12 2004 AAPL 288.35 0.0370
5 31/12/2005 12 2005 AAPL 312.23 0.0828
6 31/05/2008 5 2008 TSLA 45.67 NaN
7 31/05/2009 5 2009 TSLA 38.29 -0.1616
8 31/05/2010 5 2010 TSLA 42.89 0.1201
9 31/05/2011 5 2011 TSLA 56.03 0.3063
10 31/05/2014 5 2014 TSLA 103.45 NaN
.. ... .. .. .. ..
If you have any other suggestions on how to treat this problem, please feel free to share your knowledge :) I am a bit inexperienced so I am sure that your advice could help!
Thank you in advance guys!
You can create a mask that tells if the last year existed and just update those years with pct change:
df['return'] = np.nan
mask = df.groupby('ticker')['fyear'].apply(lambda x: x.shift(1)==x-1)
df.loc[mask,'return'] = df.groupby('ticker')['price'].pct_change()

Categories