MultiIndex column-wise from existing pandas dataframe columns - python

I am trying to reindex my pandas dataframe to a column-wise MultiIndex. Most answers I've explored seem to answer only row wise. My current df looks as such:
ticker calendardate eps price ps revenue
0 ABNB 2019-12-31 -2.59 NaN NaN 4.80
1 ABNB 2020-12-31 -16.12 146.80 25.962 3.37
2 AMZN 2019-12-31 23.46 1847.84 3.266 2.80
3 AMZN 2020-12-31 42.64 3256.93 4.233 3.86
I want a MultiIndex based upon calendardate so that my output looks as such:
ticker eps price ps revenue
2019 2020 2019 2020 2019 2020 2019 2020
0 ABNB -2.59 -16.12 NaN 146.80 NaN 25.962 4.80 3.37
1 AMZN 23.46 42.64 1847.84 3256.93 3.266 4.233 2.80 3.86
Any help would be appreciated. Thanks

We can use str.split to split the column calenderdate around the delimiter - then use str[0] to select the year portion of splitted column, now set the index of dataframe to column ticker along with extracted year followed by unstack to reshape.
y = df['calendardate'].str.split('-', n=1).str[0]
df.drop('calendardate', 1).set_index(['ticker', y]).unstack()
If the dtype of column calendardate is datetime then we can instead use:
y = df['calendardate'].dt.year
df.drop('calendardate', 1).set_index(['ticker', y]).unstack()
eps price ps revenue
calendardate 2019 2020 2019 2020 2019 2020 2019 2020
ticker
ABNB -2.59 -16.12 NaN 146.80 NaN 25.962 4.8 3.37
AMZN 23.46 42.64 1847.84 3256.93 3.266 4.233 2.8 3.86

Related

Convert 3 columns from dataframe to date

I have dataframe like this:
I want to convert the 'start_year', 'start_month', 'start_day' columns to date
and the columns 'end_year', 'end_month', 'end_day' to another date
There is a way to do that?
Thank you.
Given a dataframe like this:
year month day
0 2019.0 12.0 29.0
1 2020.0 9.0 15.0
2 2018.0 3.0 1.0
You can convert them to date string using type cast, and str.zfill:
OUTPUT:
df.apply(lambda x: f'{int(x["year"])}-{str(int(x["month"])).zfill(2)}-{str(int(x["day"])).zfill(2)}', axis=1)
0 2019-12-29
1 2020-09-15
2 2018-03-01
dtype: object
Here's an approach
simulate some data as your data was an image
use apply against each row to row series using datetime.datetime()
import datetime as dt
import numpy as np
import pandas as pd
df = pd.DataFrame(
{
"start_year": np.random.choice(range(2018, 2022), 10),
"start_month": np.random.choice(range(1, 13), 10),
"start_day": np.random.choice(range(1, 28), 10),
"end_year": np.random.choice(range(2018, 2022), 10),
"end_month": np.random.choice(range(1, 13), 10),
"end_day": np.random.choice(range(1, 28), 10),
}
)
df = df.apply(
lambda r: r.append(pd.Series({f"{startend}_date": dt.datetime(*(r[f"{startend}_{part}"]
for part in ["year", "month", "day"]))
for startend in ["start", "end"]})),
axis=1)
df
start_year
start_month
start_day
end_year
end_month
end_day
start_date
end_date
0
2018
9
6
2020
1
3
2018-09-06 00:00:00
2020-01-03 00:00:00
1
2018
11
6
2020
7
2
2018-11-06 00:00:00
2020-07-02 00:00:00
2
2021
8
13
2020
11
2
2021-08-13 00:00:00
2020-11-02 00:00:00
3
2021
3
15
2021
3
6
2021-03-15 00:00:00
2021-03-06 00:00:00
4
2019
4
13
2021
11
5
2019-04-13 00:00:00
2021-11-05 00:00:00
5
2021
2
5
2018
8
17
2021-02-05 00:00:00
2018-08-17 00:00:00
6
2020
4
19
2020
9
18
2020-04-19 00:00:00
2020-09-18 00:00:00
7
2020
3
27
2020
10
20
2020-03-27 00:00:00
2020-10-20 00:00:00
8
2019
12
23
2018
5
11
2019-12-23 00:00:00
2018-05-11 00:00:00
9
2021
7
18
2018
5
10
2021-07-18 00:00:00
2018-05-10 00:00:00
An interesting feature of pandasonic to_datetime function is that instead of
a sequence of strings you can pass to it a whole DataFrame.
But in this case there is a requirement that such a DataFrame must have columns
named year, month and day. They can be also of float type, like your source
DataFrame sample.
So a quite elegant solution is to:
take a part of the source DataFrame (3 columns with the respective year,
month and day),
rename its columns to year, month and day,
use it as the argument to to_datetime,
save the result as a new column.
To do it, start from defining a lambda function, to be used as the rename
function below:
colNames = lambda x: x.split('_')[1]
Then just call:
df['Start'] = pd.to_datetime(df.loc[:, 'start_year' : 'start_day']
.rename(columns=colNames))
df['End'] = pd.to_datetime(df.loc[:, 'end_year' : 'end_day']
.rename(columns=colNames))
For a sample of your source DataFrame, the result is:
start_year start_month start_day evidence_method_dating end_year end_month end_day Start End
0 2019.0 12.0 9.0 Historical Observations 2019.0 12.0 9.0 2019-12-09 2019-12-09
1 2019.0 2.0 18.0 Historical Observations 2019.0 7.0 28.0 2019-02-18 2019-07-28
2 2018.0 7.0 3.0 Seismicity 2019.0 8.0 20.0 2018-07-03 2019-08-20
Maybe the next part should be to remove columns with parts of both "start"
and "end" dates. Your choice.
Edit
To avoid saving the lambda (anonymous) function under a variable, define
this function as a regular (named) function:
def colNames(x):
return x.split('_')[1]

How to split pandas data with years in index and months in columns (percent change)

I'm trying to get percentage changes per month/year so that I see years in index and months in columns.
Here's how the original data looks like:
time
2009-12-31 1.692868
2010-01-03 1.693478
2010-01-04 1.681354
2010-01-05 1.681792
2010-01-06 1.676942
2010-01-07 1.685896
2010-01-08 1.675619
2010-01-09 1.675620
2010-01-10 1.671965
2010-01-11 1.668323
I have further used the following formula to obtain monthly percentage change.
prices.resample("M").ffill().pct_change().apply(lambda x: round(x*100,2))
Here's the data I received:
time
2009-12-31 NaN
2010-01-31 1.32
2010-02-28 0.48
2010-03-31 -0.49
2010-04-30 0.11
2010-05-31 4.45
2010-06-30 -1.30
2010-07-31 -4.09
2010-08-31 1.08
2010-09-30 -3.72
2010-10-31 -1.91
2010-11-30 2.93
2010-12-31 -3.42
2011-01-31 0.14
2011-02-28 -0.83
2011-03-31 -0.40
2011-04-30 -3.91
2011-05-31 0.88
2011-06-30 -0.34
2011-07-31 -2.66
However, my final goal is to have percentage changes per each month, so that I have years in index and months in columns. How can I do it? I would appreciate any advice. Also, I am wondering how to build a similar heatmap with matplotlib.
Here's an example of what I need.
You can obtain your graph directly without the first step:
As your example is a bit short, let's use this dummy one:
dates = pd.date_range('2009-01-01', '2020-12-31')
df = pd.DataFrame({'time': dates,
'value': np.random.random(size=len(dates)),
}).set_index('time')
>>> print(df)
value
time
2009-01-01 0.661110
2009-01-02 0.757710
2009-01-03 0.490736
2009-01-04 0.148575
2009-01-05 0.715500
... ...
2020-12-27 0.715620
2020-12-28 0.171634
2020-12-29 0.412512
2020-12-30 0.814756
2020-12-31 0.427202
[4383 rows x 1 columns]
processing:
Here we'll use pivot_table to reshape the data, taking the mean of the values, then we'll apply pct_change:
# just to get month names in order
cols = pd.date_range('2020', '2021', freq='M').month_name().str[:3]
df2 = (df.assign(year=df.index.year,
month=df.index.month_name().str[:3],
)
.pivot_table(index='year', columns='month', values='value', fill_value=0)
.pct_change()
[cols]
)
output:
month Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
year
2009 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2010 0.214070 0.065160 -0.073928 -0.013328 0.145379 -0.000346 0.118528 0.069972 0.037107 -0.249954 -0.244608 -0.087839
2011 0.031421 0.126091 -0.032321 -0.000812 0.004430 -0.084645 0.026099 0.020069 0.073262 0.087346 0.228409 0.093019
2012 -0.095717 -0.248492 0.094027 0.173968 0.307899 0.075966 -0.200719 0.030460 -0.185117 0.107859 -0.090682 -0.109882
2013 0.075015 0.242875 -0.049873 -0.195608 -0.144225 0.017974 0.311462 0.041423 0.277412 -0.113914 0.175273 0.045258
2014 0.018353 -0.113219 0.189669 0.064207 0.036269 0.022477 -0.232103 0.109159 -0.103024 -0.088224 -0.159047 0.067562
2015 -0.094678 0.186993 -0.128900 0.074652 0.054206 0.093470 0.111634 -0.053931 0.034411 -0.088852 0.181860 -0.055049
2016 0.057190 0.029102 0.011317 -0.051180 -0.181694 -0.084899 0.013056 -0.078995 -0.198341 0.377086 -0.096291 -0.181843
2017 -0.161556 -0.059750 -0.051224 -0.202536 0.165222 -0.086402 0.116095 -0.029666 0.224123 -0.010386 -0.081571 0.381159
2018 0.109618 -0.004155 -0.007470 0.251640 -0.100422 -0.113325 -0.161298 -0.107079 0.023862 -0.029307 0.070167 -0.144116
2019 0.027455 -0.189825 0.142514 -0.037071 0.100118 0.157974 0.020722 0.022490 -0.187602 0.168074 0.187713 0.209489
2020 0.014801 0.310334 -0.037249 -0.010381 -0.231910 0.012961 0.128481 -0.083552 0.186090 -0.055755 -0.102882 -0.020587
plotting
Let's use seaborn.heatmap to plot with the "vlag" colormap (blue/red is much better than green/red for colorblinds):
import seaborn as sns
ax = sns.heatmap(df2, cmap='vlag', center=0, annot=True, fmt='.2f')
ax.figure.set_size_inches(8, 6)
I had a similar objective a while ago and solved it using the groupby method. You do have to use an aggregation function (I used mean()), but since you have only single values for every month and year already this method is a bit inefficient, but it does still work for your case.
df["month"] = df.index.month
# Have a dataframe named df with a month column, and group it by year
df2 = df.groupby([df.index.year, 'month'])[1].mean().unstack()
print(df2)
This gives this output:
month 1 2 3 4 5 6 7 8 9 10 11 12
0
2009 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2010 1.32 0.48 -0.49 0.11 4.45 -1.30 -4.09 1.08 -3.72 -1.91 2.93 -3.42
2011 0.14 -0.83 -0.40 -3.91 0.88 -0.34 -2.66 NaN NaN NaN NaN NaN
Now I do not use matplotlib, but using plotly you can make a heat-map quite easily from this:
months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
times = list(df.index.year.unique())
fig2 = go.Figure(data=go.Heatmap(
z=df2,
x=months,
y=times))
fig2.update_xaxes(title_text='Months')
fig2.update_yaxes(title_text='Years',dtick=1)
fig2.show()

Pandas Dataframe : Assigning conditional values to each dataframe row on values greater than or lower than

Currently i have dataframe which has has stock ticker and return
ticker_name return
"AAPL 2020" -15%
"AAPL 2019" 20%
"AAPL 2018" 40%
"AAPL 2017" 30%
"AAPL 2016" -10%
....
I also have the data of index return in yearly format for the last x years.With this data want to tag if specific stock have above market return
sp_500_year return
"2020" -30%
"2019" 10%
"2018" 10%
"2017" 10%
"2016" 20%
....
Expected output are new column which are tags in boolean format. 1 if they have above market return , and 0 if they underperform the market.
ticker_name return above_market
"AAPL 2020" -15% 1
"AAPL 2019" 20% 1
"AAPL 2018" 40% 1
"AAPL 2017" 30% 1
"AAPL 2016" -10% 0
....
I found a similar question to mine, however this question is comparing strings and only have two possible input (BULL & BEAR) while mine is float and vary (index return).
Pandas: if row in column A contains "x", write "y" to row in column B
First I would suggest splitting your ticker_name column into two columns.
Below df_ticker stores your first DataFrame with the stock ticker, year and return. Here the Series.str.split() method is used to split the ticker from the year, then we used the trick described here Python pandas split column list into multiple columns.
df_ticker = df_ticker.join(pd.DataFrame(df_ticker['ticker_name'].str.split().tolist(), columns=['ticker', 'year']))
Two new columns are created :
ticker_name return ticker year
0 AAPL 2020 -0.15 AAPL 2020
1 AAPL 2019 0.20 AAPL 2019
2 AAPL 2018 0.40 AAPL 2018
3 AAPL 2017 0.30 AAPL 2017
4 AAPL 2016 -0.10 AAPL 2016
Then I would suggest setting the year as the index, this will ease comparisons between DataFrames.
df_ticker = df_ticker.set_index('year')
Grants:
ticker_name return ticker
year
2020 AAPL 2020 -0.15 AAPL
2019 AAPL 2019 0.20 AAPL
2018 AAPL 2018 0.40 AAPL
2017 AAPL 2017 0.30 AAPL
2016 AAPL 2016 -0.10 AAPL
And
df_index = df_index.set_index('sp_500_year')
Grants
return
sp_500_year
2020 -0.3
2019 0.1
2018 0.1
2017 0.1
2016 0.2
Now you can safely compare both DataFrames
df_ticker['above_index'] = df_ticker['return'] > df_index['return']
ticker_name return ticker above_index
year
2020 AAPL 2020 -0.15 AAPL True
2019 AAPL 2019 0.20 AAPL True
2018 AAPL 2018 0.40 AAPL True
2017 AAPL 2017 0.30 AAPL True
2016 AAPL 2016 -0.10 AAPL False

Pandas merge two dataframes on column with different length

I am working on a trading algo but I have some issues when trying to combine the buy_orders and sell_orders dataframes to a single dataframe, orders.
The issue shows it self on the buy_order date 2021-01-21 where a recommendation was made by my algo to buy, but this has no sell order yet as the signal hasn't been spotted yet, therefore these should be NAN when merged.
If I was to join on index the sell order would be the sell order for a different stock from the sell_orders dataframe.
buy_orders dataframe
Date_buy Name Stock_Price_buy Order
26 2020-07-30 AAPL 96.19 buy
27 2020-09-30 AAPL 115.81 buy
28 2020-11-05 AAPL 119.03 buy
29 2020-11-30 AAPL 119.05 buy
30 2021-01-21 AAPL 136.87 buy
31 2020-10-11 ABBV 21.21 buy
sell_orders dataframe
Date_sell Name Stock_Price_sell Order
25 2020-07-20 AAPL 98.36 sell
26 2020-09-02 AAPL 131.40 sell
27 2020-10-20 AAPL 117.51 sell
28 2020-11-20 AAPL 117.34 sell
29 2021-01-04 AAPL 129.41 sell
30 2020-10-15 ABBV 24.23 sell
Ideal result would be the orders dataframe as demonstrated below.
Index Buy_date Name_x Stock_Price_buy Order_x Sell_date Name_y Stock_Price_buy Order_y
26 2020-07-30 AAPL 96.19 buy 2020-09-02 AAPL 131.40 sell
27 2020-09-30 AAPL 115.81 buy 2020-10-20 AAPL 117.51 sell
28 2020-11-05 AAPL 119.03 buy 2020-11-20 AAPL 117.34 sell
29 2020-11-30 AAPL 119.05 buy 2021-01-04 AAPL 129.41 sell
30 2021-01-21 AAPL 136.87 buy NaN NaN NaN NaN
Here's how the orders dataframe looks like now when buy_orders.Name_x and sell_orders.Name_y are different for the first time. ABBV sell_order should have been NANs
28 2020-11-05 AAPL 119.03 buy 2020-11-20 AAPL 117.34 sell
29 2020-11-30 AAPL 119.05 buy 2021-01-04 AAPL 129.41 sell
30 2021-01-21 AAPL 136.87 buy 2018-05-24 ABBV 24.23 sell
Have you thought of join but then declaring suffixes as follows?.
buy_orders.join(sell_orders,lsuffix='_buy', rsuffix='_sell')
Date_buy Name_buy Stock_Price_buy Order_buy Date_sell Name_sell \
26 2020-07-30 AAPL 96.19 buy 2020-09-02 AAPL
27 2020-09-30 AAPL 115.81 buy 2020-10-20 AAPL
28 2020-11-05 AAPL 119.03 buy 2020-11-20 AAPL
29 2020-11-30 AAPL 119.05 buy 2021-01-04 AAPL
30 2021-01-21 AAPL 136.87 buy NaN NaN
Stock_Price_sell Order_sell
26 131.40 sell
27 117.51 sell
28 117.34 sell
29 129.41 sell
30 NaN NaN
Assuming your data is structured so that it always starts with a buy and alternates with sell orders, and only has one transaction per day, and is always one sized lot per transaction... you can use pd.concat. I made a simple dataframe that is kind of like yours (and in the future, it makes it easier if you include code to make a sample dataframe as part of your question):
buy_orders = pd.DataFrame.from_dict({'Date_buy': [ pd.to_datetime('2020-11-01'), pd.to_datetime('2020-11-03'),
pd.to_datetime('2020-11-05'), pd.to_datetime('2020-11-08'),
pd.to_datetime('2020-11-10')],
'Order' : ['B','B','B','B','B'],
'Name' : ['AAPL','AAPL','AAPL','AAPL','ABBV'],
'Stock_Price_buy' : [1,2,3,4,5.0]})
sell_orders = pd.DataFrame.from_dict({'Date_sell': [ pd.to_datetime('2020-11-02'), pd.to_datetime('2020-11-04'),
pd.to_datetime('2020-11-06'), pd.to_datetime('2020-11-12'),
pd.to_datetime('2020-11-22')],
'Order' : ['S','S','S','S','S'],
'Name' : ['AAPL','AAPL','AAPL','ABBV','ABBV'],
'Stock_Price_sell' : [23,24,25,26,5.0]})
You can first stack the two dataframes and sort them by date and ticker (after normalizing the column names):
buy_orders = buy_orders.rename(columns={'Date_buy' : "Date", "Stock_Price_buy" : "Price"})
sell_orders = sell_orders.rename(columns={'Date_sell' : "Date", "Stock_Price_sell" : "Price"})
df = pd.concat([buy_orders, sell_orders])
df = df.sort_values(['Date','Order']).reset_index(drop=True)
...then making a copy of the dataframe (changing the column names to keep them distinct in the later concat step):
df2 = df.copy()
df2.columns = [f"{c}_sell" for c in df.columns]
You then concatenate the two dataframes next to each other, but with a .shift(-1) on the second one so that they're offset
df3 = pd.concat([df, df2.shift(-1)], axis=1)
Finally, you can clean up the junk rows:
cut = ( df3.Name != df3.Name_sell)
import numpy as np
df3.loc[cut, 'Date_sell'] = np.nan
df3.loc[cut, 'Order_sell'] = np.nan
df3.loc[cut, 'Price_sell'] = np.nan
df3 = df3.drop(columns='Name_sell')
df3 = df3[df3.Order!="S"].reset_index(drop=True).copy()
That gives you something like
Date Order Name Price Date_sell Order_sell Price_sell
0 2020-11-01 B AAPL 1.0 2020-11-02 S 23.0
1 2020-11-03 B AAPL 2.0 2020-11-04 S 24.0
2 2020-11-05 B AAPL 3.0 2020-11-06 S 25.0
3 2020-11-08 B AAPL 4.0 NaT NaN NaN
4 2020-11-10 B ABBV 5.0 2020-11-12 S 26.0
You don't have to make all the intermediate dataframes, etc, but I left the code that way here so that if you paste things in a notebook you can look at the steps.

Pandas Panel Data - Identifying year gap and calculating returns

I am working with a large panel data of financial info, however the values are a bit spotty. I am trying to calculate the return between each year of each stock in my panel data. However, because of missing values sometimes firms have year gaps, making the: df['stock_ret'] = df.groupby(['tic'])['stock_price'].pct_change() impossible to practice as it would be wrong. The df looks something like this (just giving an example):
datadate month fyear ticker price
0 31/12/1998 12 1998 AAPL 188.92
1 31/12/1999 12 1999 AAPL 197.44
2 31/12/2002 12 2002 AAPL 268.13
3 31/12/2003 12 2003 AAPL 278.06
4 31/12/2004 12 2004 AAPL 288.35
5 31/12/2005 12 2005 AAPL 312.23
6 31/05/2008 5 2008 TSLA 45.67
7 31/05/2009 5 2009 TSLA 38.29
8 31/05/2010 5 2010 TSLA 42.89
9 31/05/2011 5 2011 TSLA 56.03
10 31/05/2014 5 2014 TSLA 103.45
.. ... .. .. .. ..
What I am looking for is a piece of code that would allow me to understand (for each individual firm) if there is any gap in the data, and calculate returns for the two different series. Just like this:
datadate month fyear ticker price return
0 31/12/1998 12 1998 AAPL 188.92 NaN
1 31/12/1999 12 1999 AAPL 197.44 0.0451
2 31/12/2002 12 2002 AAPL 268.13 NaN
3 31/12/2003 12 2003 AAPL 278.06 0.0370
4 31/12/2004 12 2004 AAPL 288.35 0.0370
5 31/12/2005 12 2005 AAPL 312.23 0.0828
6 31/05/2008 5 2008 TSLA 45.67 NaN
7 31/05/2009 5 2009 TSLA 38.29 -0.1616
8 31/05/2010 5 2010 TSLA 42.89 0.1201
9 31/05/2011 5 2011 TSLA 56.03 0.3063
10 31/05/2014 5 2014 TSLA 103.45 NaN
.. ... .. .. .. ..
If you have any other suggestions on how to treat this problem, please feel free to share your knowledge :) I am a bit inexperienced so I am sure that your advice could help!
Thank you in advance guys!
You can create a mask that tells if the last year existed and just update those years with pct change:
df['return'] = np.nan
mask = df.groupby('ticker')['fyear'].apply(lambda x: x.shift(1)==x-1)
df.loc[mask,'return'] = df.groupby('ticker')['price'].pct_change()

Categories