melting a multi index dataframe in pandas [duplicate]

melting a multi index dataframe in pandas [duplicate] - python

This question already has answers here:
Convert columns into rows with Pandas
(6 answers)
Closed 2 years ago.
I keep getting stuck with this multi-level dataframe of stock prices that I'm trying to melt from wide to long data.
I'd like to go from this:
Attributes Close Close High
Symbols AMZN ^DJI AMZN
Date
2020-12-01 32 29 35
2020-11-30 31 28 36
2020-11-27 30 27 37
To this:
Attri Sym Date price
0 Close AMZN 2020-12-01 32
1 Close AMZN 2020-11-30 31
2 Close AMZN 2020-11-27 30
3 Close ^DJI 2020-12-01 29
4 Close ^DJI 2020-11-30 28
5 High AMZN 2020-12-01 35
6 ....
I tried:
df = df.reset_index()
df = df.set_index([('Date', '')]).rename_axis(index=None, columns=('Date', ''))
df = df.fillna('').set_index('Date').T\
.set_index('',append=True).stack().reset_index()
But I'm not getting it. Any ideas what else I should try?

For me working DataFrame.stack by both levels with Series.reset_index:
df = df.stack([0,1]).reset_index(name='price')
print (df)
Date Attributes Symbols price
0 2020-12-01 Close AMZN 32.0
1 2020-12-01 Close ^DJI 29.0
2 2020-12-01 High AMZN 35.0
3 2020-11-30 Close AMZN 31.0
4 2020-11-30 Close ^DJI 28.0
5 2020-11-30 High AMZN 36.0
6 2020-11-27 Close AMZN 30.0
7 2020-11-27 Close ^DJI 27.0
8 2020-11-27 High AMZN 37.0
Another idea is solution from comment by #sammywemmy:
df = df.melt(ignore_index=False, value_name="price").reset_index()
print (df)
Date Attributes Symbols price
0 2020-12-01 Close AMZN 32
1 2020-11-30 Close AMZN 31
2 2020-11-27 Close AMZN 30
3 2020-12-01 Close ^DJI 29
4 2020-11-30 Close ^DJI 28
5 2020-11-27 Close ^DJI 27
6 2020-12-01 High AMZN 35
7 2020-11-30 High AMZN 36
8 2020-11-27 High AMZN 37

Related

How to fill NANs with a specific row of data

I am a new python user and have a few questions regarding filling NA's of a data frame.
Currently, I have a data frame that has a series of dates from 2022-08-01 to 2037-08-01 with a frequency of monthly data.
However, after 2027-06-01 the pricing data stops and I would like to extrapolate the values forward to fill out the rest of the dates. Essentially I would like to take the last 12 months of prices and fill those forward for the rest of the data frame. I am thinking of doing some type of groupby month with a fillna(method=ffill) however when I do this it just fills the last value in the df forward. Below is an example of my code.
Above is a picture you will see that the values stop at 12/1/2023 I wish to fill the previous 12 values forward for the rest of the maturity dates. So all prices fro 1/1/2023 to 12/1/2023 will be fill forward for all months.
import pandas as pd
mat = pd.DataFrame(pd.date_range('01/01/2020','01/01/2022',freq='MS'))
prices = pd.DataFrame(['179.06','174.6','182.3','205.59','204.78','202.19','216.17','218.69','220.73','223.28','225.16','226.31'])
example = pd.concat([mat,prices],axis=1)
example.columns = ['maturity', 'price']
Output
0 2020-01-01 179.06
1 2020-02-01 174.6
2 2020-03-01 182.3
3 2020-04-01 205.59
4 2020-05-01 204.78
5 2020-06-01 202.19
6 2020-07-01 216.17
7 2020-08-01 218.69
8 2020-09-01 220.73
9 2020-10-01 223.28
10 2020-11-01 225.16
11 2020-12-01 226.31
12 2021-01-01 NaN
13 2021-02-01 NaN
14 2021-03-01 NaN
15 2021-04-01 NaN
16 2021-05-01 NaN
17 2021-06-01 NaN
18 2021-07-01 NaN
19 2021-08-01 NaN
20 2021-09-01 NaN
21 2021-10-01 NaN
22 2021-11-01 NaN
23 2021-12-01 NaN
24 2022-01-01 NaN

Is this what you're looking for?
out = df.groupby(df.maturity.dt.month).ffill()
print(out)
Output:
maturity price
0 2020-01-01 179.06
1 2020-02-01 174.6
2 2020-03-01 182.3
3 2020-04-01 205.59
4 2020-05-01 204.78
5 2020-06-01 202.19
6 2020-07-01 216.17
7 2020-08-01 218.69
8 2020-09-01 220.73
9 2020-10-01 223.28
10 2020-11-01 225.16
11 2020-12-01 226.31
12 2021-01-01 179.06
13 2021-02-01 174.6
14 2021-03-01 182.3
15 2021-04-01 205.59
16 2021-05-01 204.78
17 2021-06-01 202.19
18 2021-07-01 216.17
19 2021-08-01 218.69
20 2021-09-01 220.73
21 2021-10-01 223.28
22 2021-11-01 225.16
23 2021-12-01 226.31
24 2022-01-01 179.06

Pandas merge two dataframes on column with different length

I am working on a trading algo but I have some issues when trying to combine the buy_orders and sell_orders dataframes to a single dataframe, orders.
The issue shows it self on the buy_order date 2021-01-21 where a recommendation was made by my algo to buy, but this has no sell order yet as the signal hasn't been spotted yet, therefore these should be NAN when merged.
If I was to join on index the sell order would be the sell order for a different stock from the sell_orders dataframe.
buy_orders dataframe
Date_buy Name Stock_Price_buy Order
26 2020-07-30 AAPL 96.19 buy
27 2020-09-30 AAPL 115.81 buy
28 2020-11-05 AAPL 119.03 buy
29 2020-11-30 AAPL 119.05 buy
30 2021-01-21 AAPL 136.87 buy
31 2020-10-11 ABBV 21.21 buy
sell_orders dataframe
Date_sell Name Stock_Price_sell Order
25 2020-07-20 AAPL 98.36 sell
26 2020-09-02 AAPL 131.40 sell
27 2020-10-20 AAPL 117.51 sell
28 2020-11-20 AAPL 117.34 sell
29 2021-01-04 AAPL 129.41 sell
30 2020-10-15 ABBV 24.23 sell
Ideal result would be the orders dataframe as demonstrated below.
Index Buy_date Name_x Stock_Price_buy Order_x Sell_date Name_y Stock_Price_buy Order_y
26 2020-07-30 AAPL 96.19 buy 2020-09-02 AAPL 131.40 sell
27 2020-09-30 AAPL 115.81 buy 2020-10-20 AAPL 117.51 sell
28 2020-11-05 AAPL 119.03 buy 2020-11-20 AAPL 117.34 sell
29 2020-11-30 AAPL 119.05 buy 2021-01-04 AAPL 129.41 sell
30 2021-01-21 AAPL 136.87 buy NaN NaN NaN NaN
Here's how the orders dataframe looks like now when buy_orders.Name_x and sell_orders.Name_y are different for the first time. ABBV sell_order should have been NANs
28 2020-11-05 AAPL 119.03 buy 2020-11-20 AAPL 117.34 sell
29 2020-11-30 AAPL 119.05 buy 2021-01-04 AAPL 129.41 sell
30 2021-01-21 AAPL 136.87 buy 2018-05-24 ABBV 24.23 sell

Have you thought of join but then declaring suffixes as follows?.
buy_orders.join(sell_orders,lsuffix='_buy', rsuffix='_sell')
Date_buy Name_buy Stock_Price_buy Order_buy Date_sell Name_sell \
26 2020-07-30 AAPL 96.19 buy 2020-09-02 AAPL
27 2020-09-30 AAPL 115.81 buy 2020-10-20 AAPL
28 2020-11-05 AAPL 119.03 buy 2020-11-20 AAPL
29 2020-11-30 AAPL 119.05 buy 2021-01-04 AAPL
30 2021-01-21 AAPL 136.87 buy NaN NaN
Stock_Price_sell Order_sell
26 131.40 sell
27 117.51 sell
28 117.34 sell
29 129.41 sell
30 NaN NaN

Assuming your data is structured so that it always starts with a buy and alternates with sell orders, and only has one transaction per day, and is always one sized lot per transaction... you can use pd.concat. I made a simple dataframe that is kind of like yours (and in the future, it makes it easier if you include code to make a sample dataframe as part of your question):
buy_orders = pd.DataFrame.from_dict({'Date_buy': [ pd.to_datetime('2020-11-01'), pd.to_datetime('2020-11-03'),
pd.to_datetime('2020-11-05'), pd.to_datetime('2020-11-08'),
pd.to_datetime('2020-11-10')],
'Order' : ['B','B','B','B','B'],
'Name' : ['AAPL','AAPL','AAPL','AAPL','ABBV'],
'Stock_Price_buy' : [1,2,3,4,5.0]})
sell_orders = pd.DataFrame.from_dict({'Date_sell': [ pd.to_datetime('2020-11-02'), pd.to_datetime('2020-11-04'),
pd.to_datetime('2020-11-06'), pd.to_datetime('2020-11-12'),
pd.to_datetime('2020-11-22')],
'Order' : ['S','S','S','S','S'],
'Name' : ['AAPL','AAPL','AAPL','ABBV','ABBV'],
'Stock_Price_sell' : [23,24,25,26,5.0]})
You can first stack the two dataframes and sort them by date and ticker (after normalizing the column names):
buy_orders = buy_orders.rename(columns={'Date_buy' : "Date", "Stock_Price_buy" : "Price"})
sell_orders = sell_orders.rename(columns={'Date_sell' : "Date", "Stock_Price_sell" : "Price"})
df = pd.concat([buy_orders, sell_orders])
df = df.sort_values(['Date','Order']).reset_index(drop=True)
...then making a copy of the dataframe (changing the column names to keep them distinct in the later concat step):
df2 = df.copy()
df2.columns = [f"{c}_sell" for c in df.columns]
You then concatenate the two dataframes next to each other, but with a .shift(-1) on the second one so that they're offset
df3 = pd.concat([df, df2.shift(-1)], axis=1)
Finally, you can clean up the junk rows:
cut = ( df3.Name != df3.Name_sell)
import numpy as np
df3.loc[cut, 'Date_sell'] = np.nan
df3.loc[cut, 'Order_sell'] = np.nan
df3.loc[cut, 'Price_sell'] = np.nan
df3 = df3.drop(columns='Name_sell')
df3 = df3[df3.Order!="S"].reset_index(drop=True).copy()
That gives you something like
Date Order Name Price Date_sell Order_sell Price_sell
0 2020-11-01 B AAPL 1.0 2020-11-02 S 23.0
1 2020-11-03 B AAPL 2.0 2020-11-04 S 24.0
2 2020-11-05 B AAPL 3.0 2020-11-06 S 25.0
3 2020-11-08 B AAPL 4.0 NaT NaN NaN
4 2020-11-10 B ABBV 5.0 2020-11-12 S 26.0
You don't have to make all the intermediate dataframes, etc, but I left the code that way here so that if you paste things in a notebook you can look at the steps.

Right way to use groupby resample aggregate function

I have some data which I'm trying to groupby "name" first and then resample by "transaction_date"
transaction_date name revenue
01/01/2020 ADIB 30419
01/01/2020 ADIB 1119372
01/01/2020 ADIB 1272170
01/01/2020 ADIB 43822
01/01/2020 ADIB 24199
The issue i have is writing groupby resample in two different ways return two different results
1-- df.groupby("name").resample("M", on="transaction_date").sum()[['revenue']].head(12)
2-- df.groupby("name").resample("M", on="transaction_date").aggregate({'revenue':'sum'}).head(12)
The first method returns the values I'm looking for.
I don't understand why the two methods return different results. Is this a bug?
Result 1
name transaction_date revenue
ADIB 2020-01-31 39170943.0
2020-02-29 48003966.0
2020-03-31 32691641.0
2020-04-30 11979337.0
2020-05-31 35510726.0
2020-06-30 25677857.0
2020-07-31 12437122.0
2020-08-31 4348936.0
2020-09-30 10547188.0
2020-10-31 5287406.0
2020-11-30 4288930.0
2020-12-31 17066105.0
Result 2
name transaction_date revenue
ADIB 2020-01-31 64128331.0
2020-02-29 54450014.0
2020-03-31 45636192.0
2020-04-30 25016777.0
2020-05-31 11941744.0
2020-06-30 15703151.0
2020-07-31 5517526.0
2020-08-31 4092618.0
2020-09-30 4333433.0
2020-10-31 3944117.0
2020-11-30 6528058.0
2020-12-31 5718196.0

Indeed, it's either a bug or an extremely strange behavior. Consider the following data:
input:
date revenue name
0 2020-10-27 0.744045 n_1
1 2020-10-29 0.074852 n_1
2 2020-11-21 0.560182 n_2
3 2020-12-29 0.208616 n_2
4 2020-05-03 0.325044 n_0
gb = df.groupby("name").resample("M", on="date")
gb.aggregate({'revenue':'sum'})
==>
revenue
name date
n_0 2020-12-31 0.325044
n_1 2020-05-31 0.744045
2020-06-30 0.000000
2020-07-31 0.000000
2020-08-31 0.000000
2020-09-30 0.000000
2020-10-31 0.074852
n_2 2020-10-31 0.560182
2020-11-30 0.208616
print(gb.sum()[['revenue']])
==>
revenue
name date
n_0 2020-05-31 0.325044
n_1 2020-10-31 0.818897
n_2 2020-11-30 0.560182
2020-12-31 0.208616
As one can see, it seems that aggregate produces the wrong results. For example, it takes data from Oct and attaches it to May.
Here's an even simpler example:
Data frame:
date revenue name
0 2020-02-24 9 n_1
1 2020-05-12 8 n_2
2 2020-03-28 9 n_2
3 2020-01-14 2 n_0
gb = df.groupby("name").resample("M", on="date")
res1 = gb.sum()[['revenue']]
==>
name date
n_0 2020-01-31 2
n_1 2020-02-29 9
n_2 2020-03-31 9
2020-04-30 0
2020-05-31 8
res2 = gb.aggregate({'revenue':'sum'})
==>
name date
n_0 2020-05-31 2
n_1 2020-01-31 9
n_2 2020-02-29 8
2020-03-31 9
I opened a bug about it: https://github.com/pandas-dev/pandas/issues/35173

Resampling grouped data to obtain daily average data using Pandas

I'm new to pandas and I'm having some problems when I try to obtain daily average from data file.
So, my data is structured as follows:
DATA ESTACION
DATETIME
2020-01-15 00:00:00 175 47
2020-01-15 01:00:00 152 47
2020-01-15 02:00:00 180 47
2020-01-15 03:00:00 132 47
2020-01-15 04:00:00 115 47
... ... ...
2020-03-13 19:00:00 38 16
2020-03-13 20:00:00 53 16
2020-03-13 21:00:00 73 16
2020-03-13 22:00:00 28 16
2020-03-13 23:00:00 22 16
These are air pollution results gathered by 24 stations. Each station receives hourly information as you can see.
I'm trying to get daily average data by station. So this is what I do:
I group all info by station
grouped = data.groupby(['ESTACION'])
Then I get daily average resampling the grouped data
resampled = grouped.resample('D').mean()
And this is what I've obtained:
DATA ESTACION
ESTACION DATETIME
4 2020-01-02 18.250000 4.0
2020-01-03 NaN NaN
2020-01-04 NaN NaN
2020-01-05 NaN NaN
2020-01-06 NaN NaN
... ... ...
60 2020-11-29 NaN NaN
2020-11-30 NaN NaN
2020-12-01 NaN NaN
2020-12-02 118.666667 60.0
2020-12-03 80.833333 60.0
I don't really know whats going on cause I've only got data for 2020-01-15 - 2020-03-13 and it shows me info from other timestamps and NaN results.
If you need anything else to reproduce this case let me know.
Thanks and best regards

Output is expected, because resample always create consecutive DatetimeIndex.
So is possible remove missing rows by DataFrame.dropna:
resampled = grouped.resample('D').mean().dropna()
Another solution is use Series.dt.date:
data.groupby(['ESTACION', data['DATETIME'].dt.date]).mean()

Returns of multiple stocks using pandas at once

I have close prices of multiple stocks over multiple days in a dataframe like this.
In [67]: df
Out[67]:
Date Symbol Close
0 12/30/2016 AMZN 749.87
1 12/29/2016 AMZN 765.15
2 12/28/2016 AMZN 772.13
3 12/27/2016 AMZN 771.40
4 12/30/2016 GOOGL 792.45
5 12/29/2016 GOOGL 802.88
6 12/28/2016 GOOGL 804.57
7 12/27/2016 GOOGL 805.80
8 12/30/2016 NFLX 123.80
9 12/29/2016 NFLX 125.33
10 12/28/2016 NFLX 125.89
11 12/27/2016 NFLX 128.35
I would like to compute daily returns of these stocks using pandas. The output should looks like this:
Date Symbol Return
0 12/27/2016 AMZN NaN
1 12/28/2016 AMZN 0.000946
2 12/29/2016 AMZN -0.009040
3 12/30/2016 AMZN -0.019970
4 12/27/2016 GOOGL NaN
5 12/28/2016 GOOGL -0.001526
6 12/29/2016 GOOGL -0.002101
7 12/30/2016 GOOGL -0.012991
8 12/27/2016 NFLX NaN
9 12/28/2016 NFLX -0.019166
10 12/29/2016 NFLX -0.004448
11 12/30/2016 NFLX -0.012208
I got the above output using following code but I feel this can be simplified further.
In [70]: rtn = df.pivot("Date", "Symbol", "Close").pct_change().reset_index()
In [73]: pd.melt(rtn, id_vars='Date', value_vars=list(rtn.columns[1:]),var_name='Symbol',value_name='Return')

You can use sort_values first and then groupby with DataFrameGroupBy.pct_change:
df = df.sort_values(['Symbol','Date']).reset_index(drop=True)
df['Return'] = df.groupby('Symbol')['Close'].pct_change()
print (df)
Date Symbol Close Return
0 12/27/2016 AMZN 771.40 NaN
1 12/28/2016 AMZN 772.13 0.000946
2 12/29/2016 AMZN 765.15 -0.009040
3 12/30/2016 AMZN 749.87 -0.019970
4 12/27/2016 GOOGL 805.80 NaN
5 12/28/2016 GOOGL 804.57 -0.001526
6 12/29/2016 GOOGL 802.88 -0.002101
7 12/30/2016 GOOGL 792.45 -0.012991
8 12/27/2016 NFLX 128.35 NaN
9 12/28/2016 NFLX 125.89 -0.019166
10 12/29/2016 NFLX 125.33 -0.004448
11 12/30/2016 NFLX 123.80 -0.012208

You can set_index and unstack which will sort your index for you and then pct_change and stack back.
print(
df.set_index(['Date', 'Symbol'])
.Close.unstack().pct_change()
.stack(dropna=False).reset_index(name='Return')
.sort_values(['Symbol', 'Date'])
.reset_index(drop=True)
)
Date Symbol Return
0 2016-12-27 AMZN NaN
1 2016-12-28 AMZN 0.000946
2 2016-12-29 AMZN -0.009040
3 2016-12-30 AMZN -0.019970
4 2016-12-27 GOOGL NaN
5 2016-12-28 GOOGL -0.001526
6 2016-12-29 GOOGL -0.002101
7 2016-12-30 GOOGL -0.012991
8 2016-12-27 NFLX NaN
9 2016-12-28 NFLX -0.019166
10 2016-12-29 NFLX -0.004448
11 2016-12-30 NFLX -0.012208

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

melting a multi index dataframe in pandas [duplicate] - python

Related

How to fill NANs with a specific row of data

Pandas merge two dataframes on column with different length

Right way to use groupby resample aggregate function

Resampling grouped data to obtain daily average data using Pandas

Returns of multiple stocks using pandas at once

Categories

Resources