I have a DataFrame that looks similar to this:
Date Close Open
AAP AWS BGG ... AAP AWS BGG ...
2020 10 50 13 ... 100 500 13 ...
2021 11 41 7 ... 111 41 7 ...
2022 12 50 13 ... 122 50 13 ...
and want to turn it into
Date Close Open Index2
2020 10 100 AAP
2021 11 111 AAP
2022 12 122 AAP
2020 50 500 AWS
...
How can I achieve it using pandas?
You can use set_index and stack to get the expected dataframe:
>>> (df.set_index('Date').stack(level=1)
.rename_axis(index=['Date', 'Ticker'])
.reset_index())
Date Ticker Close Open
0 2020 AAP 10 100
1 2020 AWS 50 500
2 2020 BGG 13 13
3 2021 AAP 11 111
4 2021 AWS 41 41
5 2021 BGG 7 7
6 2022 AAP 12 122
7 2022 AWS 50 50
8 2022 BGG 13 13
My input dataframe:
>>> df
Date Close Open
AAP AWS BGG AAP AWS BGG
0 2020 10 50 13 100 500 13
1 2021 11 41 7 111 41 7
2 2022 12 50 13 122 50 13
You could also use wide_to_long
pd.wide_to_long(df.set_axis(df.columns.map('_'.join).str.rstrip('_'),axis=1),
['Close', 'Open'], 'Date', 'Ticker', '_', '\\w+').reset_index()
Date Ticker Close Open
0 2020 AAP 10 100
1 2021 AAP 11 111
2 2022 AAP 12 122
3 2020 AWS 50 500
4 2021 AWS 41 41
5 2022 AWS 50 50
6 2020 BGG 13 13
7 2021 BGG 7 7
8 2022 BGG 13 13
Related
I am newer data science and am working on a project to analyze sports statistics. I have a dataset of hockey statistics for a group of players over multiple seasons. Players have anywhere between 1 row to 12 rows representing their season statistics over however many seasons they've played.
Example:
Player Season Pos GP G A P +/- PIM P/GP ... PPG PPP SHG SHP OTG GWG S S% TOI/GP FOW%
0 Nathan MacKinnon 2022 1 65 32 56 88 22 42 1.35 ... 7 27 0 0 1 5 299 10.7 21.07 45.4
1 Nathan MacKinnon 2021 1 48 20 45 65 22 37 1.35 ... 8 25 0 0 0 2 206 9.7 20.37 48.5
2 Nathan MacKinnon 2020 1 69 35 58 93 13 12 1.35 ... 12 31 0 0 2 4 318 11.0 21.22 43.1
3 Nathan MacKinnon 2019 1 82 41 58 99 20 34 1.21 ... 12 37 0 0 1 6 365 11.2 22.08 43.7
4 Nathan MacKinnon 2018 1 74 39 58 97 11 55 1.31 ... 12 32 0 1 3 12 284 13.7 19.90 41.9
5 Nathan MacKinnon 2017 1 82 16 37 53 -14 16 0.65 ... 2 14 2 2 2 4 251 6.4 19.95 50.6
6 Nathan MacKinnon 2016 1 72 21 31 52 -4 20 0.72 ... 7 16 0 1 0 6 245 8.6 18.87 48.4
7 Nathan MacKinnon 2015 1 64 14 24 38 -7 34 0.59 ... 3 7 0 0 0 2 192 7.3 17.05 47.0
8 Nathan MacKinnon 2014 1 82 24 39 63 20 26 0.77 ... 8 17 0 0 0 5 241 10.0 17.35 42.9
9 J.T. Compher 2022 2 70 18 15 33 6 25 0.47 ... 4 6 1 1 0 0 102 17.7 16.32 51.4
10 J.T. Compher 2021 2 48 10 8 18 10 19 0.38 ... 1 2 0 0 0 2 47 21.3 14.22 45.9
11 J.T. Compher 2020 2 67 11 20 31 9 18 0.46 ... 1 5 0 3 1 3 106 10.4 16.75 47.7
12 J.T. Compher 2019 2 66 16 16 32 -8 31 0.48 ... 4 9 3 3 0 3 118 13.6 17.48 49.2
13 J.T. Compher 2018 2 69 13 10 23 -29 20 0.33 ... 4 7 2 2 2 3 131 9.9 16.00 45.1
14 J.T. Compher 2017 2 21 3 2 5 0 4 0.24 ... 1 1 0 0 0 1 30 10.0 14.93 47.6
15 Darren Helm 2022 1 68 7 8 15 -5 14 0.22 ... 0 0 1 2 0 1 93 7.5 10.55 44.2
16 Darren Helm 2021 1 47 3 5 8 -3 10 0.17 ... 0 0 0 0 0 0 83 3.6 14.68 66.7
17 Darren Helm 2020 1 68 9 7 16 -6 37 0.24 ... 0 0 1 2 0 0 102 8.8 13.73 53.6
18 Darren Helm 2019 1 61 7 10 17 -11 20 0.28 ... 0 0 1 4 0 0 107 6.5 14.57 44.4
19 Darren Helm 2018 1 75 13 18 31 3 39 0.41 ... 0 0 2 4 0 0 141 9.2 15.57 44.1
[sample of my dataset][1]
[1]: https://i.stack.imgur.com/7CsUd.png
If any player has played more than 6 seasons, I want to drop the row corresponding to Season 2021. This is because COVID drastically shortened the season and it is causing issues as I work with averages.
As you can see from the screenshot, Nathan MacKinnon has played 9 seasons. Across those 9 seasons, except for 2021, he plays in no fewer than 64 games. Due to the shortened season of 2021, he only got 48 games.
Removing Season 2021 results in an Average Games Played of 73.75.
Keeping Season 2021 in the data, the Average Games Played becomes 70.89.
While not drastic, it compounds into the other metrics as well.
I have been trying this for a little while now, but as I mentioned, I am new to this world and am struggling to figure out how to accomplish this.
I don't want to just completely drop ALL rows for 2021 across all players, though, as some players only have 1-5 years' worth of data and for those players, I need to use as much data as I can and remove 1 row from a player with only 2 seasons would also negatively skew averages.
I would really appreciate some assistance from anyone more experienced than me!
This can be accomplished by using groupby and apply. For example:
edited_players = (players
.groupby("Player")
.apply(lambda subset: subset if len(subset) <= 6 else subset.query("Season != 2021"))
)
Round brackets for formatting purposes.
The combination of groupby and apply basically feeds a grouped subset of your dataframe to a function. So, first all the rows of Nathan MacKinnon will be used, then rows for J.T. Compher, then Darren Helm rows, etc.
The function used is an anonymous/lambda function which operates under the following logic: "if the dataframe subset that I receive has 6 or fewer rows, I'll return the subset unedited. Otherwise, I will filter out rows within that subset which have the value 2021 in the Season column".
How can I use the value from the same month in the previous year to fill values in the following table for 2020:
Category Month Year Value
A 1 2019 15
B 2 2019 20
A 2 2019 90
A 3 2019 50
B 4 2019 40
A 5 2019 20
A 6 2019 15
A 7 2019 17
A 8 2019 18
A 9 2019 12
A 10 2019 11
A 11 2019 19
A 12 2019 15
A 1 2020 18
A 2 2020 53
A 3 2020 80
The final desired result is the following:
Category Month Year Value
A 1 2019 15
B 2 2019 20
A 2 2019 90
A 3 2019 50
B 4 2019 40
A 4 2019 40
A 5 2019 20
A 6 2019 15
A 7 2019 17
A 8 2019 18
A 9 2019 12
A 10 2019 11
A 11 2019 19
A 12 2019 15
A 1 2020 18
A 2 2020 53
A 3 2020 80
B 4 2020 40
A 4 2020 40
A 5 2020 20
A 6 2020 15
A 7 2020 17
A 8 2020 18
A 9 2020 12
A 10 2020 11
A 11 2020 19
A 12 2020 15
I tried using pandas groupby but not sure if that is the right approach.
IIUC we use the pivot then ffill with stack
s=df.pivot_table(index=['Category','Year'],columns='Month',values='Value').groupby(level=0).ffill().stack().reset_index()
Category Year level_2 0
0 A 2019 1 15.0
1 A 2019 2 90.0
2 A 2019 3 50.0
3 A 2019 5 20.0
4 A 2019 6 15.0
5 A 2019 7 17.0
6 A 2019 8 18.0
7 A 2019 9 12.0
8 A 2019 10 11.0
9 A 2019 11 19.0
10 A 2019 12 15.0
11 A 2020 1 18.0
12 A 2020 2 53.0
13 A 2020 3 80.0
14 A 2020 5 20.0
15 A 2020 6 15.0
16 A 2020 7 17.0
17 A 2020 8 18.0
18 A 2020 9 12.0
19 A 2020 10 11.0
20 A 2020 11 19.0
21 A 2020 12 15.0
22 B 2019 2 20.0
23 B 2019 4 40.0
You can accomplish this with a combination of loc, concat, and drop_duplicates.
The idea here is to concatenate the dataframe with a copy of the 2019 data where year is changed to 2020, and then only keeping the first value for Category, Month, Year.
df2 = df.loc[df['Year'] == 2019, :]
df2['Year'] = 2020
pd.concat([df, df2]).drop_duplicates(subset=['Category', 'Month', 'Year'], keep='first')
Output
Category Month Year Value
0 A 1 2019 15
1 B 2 2019 20
2 A 2 2019 90
3 A 3 2019 50
4 B 4 2019 40
5 A 5 2019 20
6 A 6 2019 15
7 A 7 2019 17
8 A 8 2019 18
9 A 9 2019 12
10 A 10 2019 11
11 A 11 2019 19
12 A 12 2019 15
13 A 1 2020 18
14 A 2 2020 53
15 A 3 2020 80
1 B 2 2020 20
4 B 4 2020 40
5 A 5 2020 20
6 A 6 2020 15
7 A 7 2020 17
8 A 8 2020 18
9 A 9 2020 12
10 A 10 2020 11
11 A 11 2020 19
12 A 12 2020 15
I have a database in panel data form:
Date id variable1 variable2
2015 1 10 200
2016 1 17 300
2017 1 8 400
2018 1 11 500
2015 2 12 150
2016 2 19 350
2017 2 15 250
2018 2 9 450
2015 3 20 100
2016 3 8 220
2017 3 12 310
2018 3 14 350
And I have a list with the labels of the ID
List = ['Argentina', 'Brazil','Chile']
I want to replace values of id with labels from my list.
Thanks in advance
Date id variable1 variable2
2015 Argentina 10 200
2016 Argentina 17 300
2017 Argentina 8 400
2018 Argentina 11 500
2015 Brazil 12 150
2016 Brazil 19 350
2017 Brazil 15 250
2018 Brazil 9 450
2015 Chile 20 100
2016 Chile 8 220
2017 Chile 12 310
2018 Chile 14 350
map is the way to go, with enumerate:
d = {k:v for k,v in enumerate(List, start=1)}
df['id'] = df['id'].map(d)
Output:
Date id variable1 variable2
0 2015 Argentina 10 200
1 2016 Argentina 17 300
2 2017 Argentina 8 400
3 2018 Argentina 11 500
4 2015 Brazil 12 150
5 2016 Brazil 19 350
6 2017 Brazil 15 250
7 2018 Brazil 9 450
8 2015 Chile 20 100
9 2016 Chile 8 220
10 2017 Chile 12 310
11 2018 Chile 14 350
Try
df['id'] = df['id'].map({1: 'Argentina', 2: 'Brazil', 3: 'Chile'})
or
df['id'] = df['id'].map({k+1: v for k, v in enumerate(List)})
Data
I have a dataset that shows up-to-date bookings data grouped by company and month (empty values are NaNs)
company month year_ly bookings_ly year_ty bookings_ty
company a 1 2018 432 2019 253
company a 2 2018 265 2019 635
company a 3 2018 345 2019 525
company a 4 2018 233 2019
company a 5 2018 7664 2019
... ... ... ... ... ...
company a 12 2018 224 2019 321
company b 1 2018 543 2019 576
company b 2 2018 23 2019 43
company b 3 2018 64 2019 156
company b 4 2018 143 2019
company b 5 2018 41 2019
company b 6 2018 90 2019
... ... ... ... ... ...
What I want
I'd like to create a column or update the bookings_ty column where value is NaN (whichever is easier) that applies the following calculation for each row (grouped by company):
((SUM of previous 3 rows (or months) of bookings_ty)
/(SUM of previous 3 rows (or months) of bookings_ly))
* bookings_ly
Where a row's bookings_ty is NaN, I'd like that iteration of the formula to take the newly calculated field as part of its bookings_ty so essentially what the formula should do is populate NaN values in bookings_ty.
My attempt
df_bkgs.set_index(['operator', 'month'], inplace=True)
def calc(df_bkgs):
df_bkgs['bookings_calc'] = df_bkgs['bookings_ty'].copy
df_bkgs['bookings_ty_l3m'] = df_bkgs.groupby(level=0)['bookings_ty'].transform(lambda x: x.shift(1) + x.shift(2) + x.shift(3) )
df_bkgs['bookings_ly_l3m'] = df_bkgs.groupby(level=0)['bookings_ly'].transform(lambda x: x.shift(1) + x.shift(2) + x.shift(3) )
df_bkgs['bookings_factor'] = df_bkgs['bookings_ty_l3m']/df_bkgs['bookings_ly_l3m']
df_bkgs['bookings_calc'] = df_bkgs['bookings_factor'] * df_bkgs['bookings_ly']
return df_bkgs
df_bkgs.groupby(level=0).apply(calc)
import numpy as np
df['bookings_calc'] = np.where(df['bookings_ty']isna(), df['bookings_calc'], df['bookings_ty'])
Issue with this code is that it generates the calculated field for only the first empty/NaN bookings_ty. What I'd like is for there to be an iteration or loop type process that then takes the previous 3 rows in the group and if the bookings_ty is empty/NaN then take the calculated field of that row.
Thanks
You can try this. I made a function which found the last 3 records in your dataframe by row. note I had to create a column named index to do this as you can't access the index (as far as I know) within an apply statement.
# dataframe is named f
company month year_ly bookings_ly year_ty bookings_ty
0 a 1 2018 432 2019 253.0
1 a 2 2018 265 2019 635.0
2 a 3 2018 345 2019 525.0
3 a 4 2018 233 2019 NaN
4 a 5 2018 7664 2019 NaN
5 a 12 2018 224 2019 321.0
6 b 1 2018 543 2019 576.0
7 b 2 2018 23 2019 43.0
8 b 3 2018 64 2019 156.0
9 b 4 2018 143 2019 NaN
10 b 5 2018 41 2019 NaN
11 b 6 2018 90 2019 NaN
f.reset_index(inplace=True)
def aggFunct(row, df, last=3):
series = df.loc[(df['index'] < row['index']) & (df['index'] >= row['index'] - last), 'bookings_ty'].fillna(0)
ssum = series.sum()
return ssum
f.loc[f['bookings_ty'].isna(),'bookings_ty'] = f[f['bookings_ty'].isna()].apply(aggFunct, df=f, axis=1)
f.drop('index',axis=1,inplace=True)
f
company month year_ly bookings_ly year_ty bookings_ty
0 a 1 2018 432 2019 253.0
1 a 2 2018 265 2019 635.0
2 a 3 2018 345 2019 525.0
3 a 4 2018 233 2019 1413.0
4 a 5 2018 7664 2019 1160.0
5 a 12 2018 224 2019 321.0
6 b 1 2018 543 2019 576.0
7 b 2 2018 23 2019 43.0
8 b 3 2018 64 2019 156.0
9 b 4 2018 143 2019 775.0
10 b 5 2018 41 2019 199.0
11 b 6 2018 90 2019 156.0
Depending on how many companies you have in your table, I might be inclined to run this on Excel as opposed to doing this on pandas. Iterating through the rows might be slow, but if speed is not a concern, the following solution should work:
import numpy as np
import pandas as pd
df = pd.read_excel('data_file.xlsx') # <-- name of your file.
companies = pd.unique(df.company)
months = pd.unique(df.month)
for c in companies:
for m in months:
# slice a single row
df_row= df[(df['company']==c) & (df['month']==m)]
val = df_slice.bookings_ty.values[0]
if np.isnan(val):
# get the index of the row
idx = df_row.index[0]
df1 = df.copy()
df1 = df1[(df1['company']==c) & (df1['month'].isin([m for m in range(m-3,m)]))]
ratio = df1.bookings_ty.sum() / df1.bookings_ly.sum()
projected_value = df_slice.bookings_ly.values[0] * ratio
df.loc[idx, 'bookings_ty'] = projected_value
else:
pass
print(df)
if we can assume that the DataFrame is always sorted by 'company' and then by 'month', then we can use the following approach, there is a 20-fold improvement (0.003s vs. 0.07s) with my sample data of 24 rows.
df = pd.read_excel('data_file.xlsx') # your input file
ly = df.bookings_ly.values.tolist()
ty = df.bookings_ty.values.tolist()
for val in ty:
if np.isnan(val):
idx = ty.index(val) # returns the index of the first 'nan' found
ratio = sum(ty[idx-3:idx])/sum(ly[idx-3:idx])
ty[idx] = ratio * ly[idx]
df['bookings_ty'] = ty
here is a solution:
import numpy as np
import pandas as pd
#sort values if not
df = df.sort_values(['company', 'year_ty', 'month']).reset_index(drop=True)
def process(x):
while x['bookings_ty'].isnull().any():
x['bookings_ty'] = np.where((x['bookings_ty'].isnull()),
(x['bookings_ty'].shift(1) +
x['bookings_ty'].shift(2) +
x['bookings_ty'].shift(3)) /
(x['bookings_ly'].shift(1) +
x['bookings_ly'].shift(2) +
x['bookings_ly'].shift(3)) *
x['bookings_ly'], x['bookings_ty'])
return x
df = df.groupby(['company']).apply(lambda x: process(x))
#convert to int64 if needed or stay with float values
df['bookings_ty'] = df['bookings_ty'].astype(np.int64)
print(df)
initial DF:
company month year_ly bookings_ly year_ty bookings_ty
0 company_a 1 2018 432 2019 253
1 company_a 2 2018 265 2019 635
2 company_a 3 2018 345 2019 525
3 company_a 4 2018 233 2019 NaN
4 company_a 5 2018 7664 2019 NaN
5 company_a 12 2018 224 2019 321
6 company_b 1 2018 543 2019 576
7 company_b 2 2018 23 2019 43
8 company_b 3 2018 64 2019 156
9 company_b 4 2018 143 2019 NaN
10 company_b 5 2018 41 2019 NaN
11 company_b 6 2018 90 2019 NaN
result:
company month year_ly bookings_ly year_ty bookings_ty
0 company_a 1 2018 432 2019 253
1 company_a 2 2018 265 2019 635
2 company_a 3 2018 345 2019 525
3 company_a 4 2018 233 2019 315 **
4 company_a 5 2018 7664 2019 13418 **
5 company_a 12 2018 224 2019 321
6 company_b 1 2018 543 2019 576
7 company_b 2 2018 23 2019 43
8 company_b 3 2018 64 2019 156
9 company_b 4 2018 143 2019 175 **
10 company_b 5 2018 41 2019 66 **
11 company_b 6 2018 90 2019 144 **
In case of you want another rolling month or maybe a NaN value could exist at the beginning of each company, you could use this generic solution:
df = df.sort_values(['company', 'year_ty', 'month']).reset_index(drop=True)
def process(x, m):
idx = (x.loc[x['bookings_ty'].isnull()].index.to_list())
for i in idx:
id = i - x.index[0]
start = 0 if id < m else id - m
sum_ty = sum(x['bookings_ty'].to_list()[start:id])
sum_ly = sum(x['bookings_ly'].to_list()[start:id])
ly = x.at[i, 'bookings_ly']
x.at[i, 'bookings_ty'] = sum_ty / sum_ly * ly
return x
rolling_month = 3
df = df.groupby(['company']).apply(lambda x: process(x, rolling_month))
df['bookings_ty'] = df['bookings_ty'].astype(np.int64)
print(df)
initial df:
company month year_ly bookings_ly year_ty bookings_ty
0 company_a 1 2018 432 2019 253.0
1 company_a 2 2018 265 2019 635.0
2 company_a 3 2018 345 2019 NaN
3 company_a 4 2018 233 2019 NaN
4 company_a 5 2018 7664 2019 NaN
5 company_a 12 2018 224 2019 321.0
6 company_b 1 2018 543 2019 576.0
7 company_b 2 2018 23 2019 43.0
8 company_b 3 2018 64 2019 156.0
9 company_b 4 2018 143 2019 NaN
10 company_b 5 2018 41 2019 NaN
11 company_b 6 2018 90 2019 NaN
final result:
company month year_ly bookings_ly year_ty bookings_ty
0 company_a 1 2018 432 2019 253
1 company_a 2 2018 265 2019 635
2 company_a 3 2018 345 2019 439 ** work only with 2 previous rows
3 company_a 4 2018 233 2019 296 **
4 company_a 5 2018 7664 2019 12467 **
5 company_a 12 2018 224 2019 321
6 company_b 1 2018 543 2019 576
7 company_b 2 2018 23 2019 43
8 company_b 3 2018 64 2019 156
9 company_b 4 2018 143 2019 175 **
10 company_b 5 2018 41 2019 66 **
11 company_b 6 2018 90 2019 144 **
if you want to speed up the process you could try:
df.set_index(['company'], inplace=True)
df = df.groupby(level=(0)).apply(lambda x: process(x))
instead of
df = df.groupby(['company']).apply(lambda x: process(x))
I have a pandas data with 3 columns:
date: from 1/1/2018 up until 8/23/2019, column A and column B.
import pandas as pd
df = pd.DataFrame(np.random.randint(0,10,size=(600, 2)), columns=list('AB'))
df['date'] = pd.DataFrame(pd.date_range(start='1/1/2018', end='8/23/2019'))
df.set_index('date')
df is as follows:
date A B
2018-01-01 7 4
2018-01-02 5 4
2018-01-03 3 1
2018-01-04 9 3
2018-01-05 7 8
2018-01-06 0 0
2018-01-07 6 8
2018-01-08 3 7
...
...
...
2019-08-18 1 0
2019-08-19 8 1
2019-08-20 5 9
2019-08-21 0 7
2019-08-22 3 6
2019-08-23 8 6
I want monthly accumulated values of column A and monthly averaged values of column B. The final output will become a df with 20 rows ( 12 months of year 2018 and 8 months of year 2019 ) and 4 columns, representing monthly accumulated values of column A, monthly averaged values of column B, month number and year number just like below:
month year monthly_accumulated_of_A monthly_averaged_of_B
0 1 2018 176 1.747947
1 2 2018 110 2.399476
2 3 2018 131 3.976747
3 4 2018 227 2.314923
4 5 2018 234 0.464097
5 6 2018 249 1.662753
6 7 2018 121 1.588865
7 8 2018 165 2.318268
8 9 2018 219 1.060595
9 10 2018 131 0.577268
10 11 2018 179 3.948414
11 12 2018 115 1.750346
12 1 2019 190 3.364003
13 2 2019 215 0.864792
14 3 2019 231 3.219739
15 4 2019 186 2.904413
16 5 2019 232 0.324695
17 6 2019 163 1.334139
18 7 2019 238 1.670644
19 8 2019 112 1.316442
How can I achieve this in pandas?
Use DataFrameGroupBy.agg with DatetimeIndex.month and DatetimeIndex.year, for ordering add sort_index and last use reset_index for columns from MultiIndex:
import pandas as pd
import numpy as np
np.random.seed(2018)
#changed 300 to 600
df = pd.DataFrame(np.random.randint(0,10,size=(600, 2)), columns=list('AB'))
df['date'] = pd.DataFrame(pd.date_range(start='1/1/2018', end='8/23/2019'))
df = df.set_index('date')
df1 = (df.groupby([df.index.month.rename('month'),
df.index.year.rename('year')])
.agg({'A':'sum', 'B':'mean'})
.sort_index(level=['year', 'month'])
.reset_index())
print (df1)
month year A B
0 1 2018 147 4.838710
1 2 2018 120 3.678571
2 3 2018 114 4.387097
3 4 2018 143 3.800000
4 5 2018 124 3.870968
5 6 2018 129 4.700000
6 7 2018 143 3.935484
7 8 2018 118 5.483871
8 9 2018 150 5.500000
9 10 2018 139 4.225806
10 11 2018 136 4.933333
11 12 2018 141 4.548387
12 1 2019 137 4.709677
13 2 2019 120 4.964286
14 3 2019 167 4.935484
15 4 2019 121 4.200000
16 5 2019 133 4.129032
17 6 2019 140 5.066667
18 7 2019 189 4.677419
19 8 2019 100 3.695652