Impute values from pandas row with specific identifier to all other rows - python

I have this dataframe (sorry, not sure how to format it nicely here):
SRC SRCDate Ticker Coupon Vintage Bal ($bn) WAC WAM WALA LNSZ ... FICO Refi% Month_Assessed CPR Month_key SRC_year SRC_month Year Month Interest_Rate
JPM 02/05/2021 FNCI 1.5 2020 28.7 2.25 175 4 293 / 286 ... 777 91 Apr 7.536801 M+2 2021 2 2021 2 2.24
JPM 03/05/2021 FNCI 1.5 2020 28.7 2.25 175 4 293 / 286 ... 777 91 Apr 5.131145 M+1 2021 3 2021 3 2.39
JPM 04/07/2021 FNCI 1.5 2020 28 2.25 173 6 292 / 281 ... 777 91 Apr 7.233214 M 2021 4 2021 4 2.36
JPM 05/07/2021 FNCI 1.5 2020 27.6 2.25 171 7 292 / 279 ... 777 91 Apr 8.900000 M-1 2021 5 2021 5 2.28
And use this code:
cols = ['SRC_year','Bal ($bn)', 'WAC', 'WAM', 'WALA', 'LTV', 'FICO', 'Refi%', 'Interest_Rate']
jpm_2021[cols] = jpm_2021[cols].apply(pd.to_numeric, downcast='float', errors='coerce')
for col in cols:
jpm_2021[col] = jpm_2021.groupby(['SRC_year','Ticker', 'Coupon', 'Vintage', 'Month_Assessed'])[col].transform('mean')
To normalize the values of all the cols to their respective means by the grouping in groupby. The reason for this is to be able to create a pivoted table with this code:
jpm_final = jpm_2021.pivot_table(index=['SRC', 'Ticker', 'Coupon', 'Vintage', 'Month_Assessed', 'Bal ($bn)', 'WAC', 'WAM', 'WALA', 'LTV', 'FICO', 'Refi%', 'Interest_Rate'],
columns="Month_key", values="CPR").rename_axis(columns=None).reset_index()
The problem is, taking the mean of all of those columns (especially Interest Rate) renders the resulting table less than insightful. Instead, what I'd like to do is to impute all the values in the rows where Month_key is M to all the other rows with the same grouping defined in the groupby function above. Any tips on how to do that?

Related

Python Pandas convert selective columns into rows

My dataset has some information about price and sales for different years. The problem is each year is actually a different column header for price and for sales as well. For example the CSV looks like
Items
Price in 2018
Price in 2019
Price in 2020
Sales in 2018
Sales in 2019
Sales in 2020
A
100
120
135
5000
6000
6500
B
110
130
150
2000
4000
4500
C
150
110
175
1000
3000
3000
I want to show it something like this
Items
Year
Price
Sales
A
2018
100
5000
A
2019
120
6000
A
2020
135
6500
B
2018
110
2000
B
2019
130
4000
B
2020
150
4500
C
2018
150
1000
C
2019
110
3000
C
2020
175
3000
I used melt function from Pandas like this
df.melt(id_vars = ['Items'], var_name="Year", value_name="Price")
But I'm struggling in getting separate columns for Price and Sales as it gives Price and Sales in one column. Thanks
Let us try pandas wide_to_long
pd.wide_to_long(df, i='Items', j='year',
stubnames=['Price', 'Sales'],
suffix=r'\d+', sep=' in ').sort_index()
Price Sales
Items year
A 2018 100 5000
2019 120 6000
2020 135 6500
B 2018 110 2000
2019 130 4000
2020 150 4500
C 2018 150 1000
2019 110 3000
2020 175 3000

How to assign the groupby results to a series in pandas

I have a df which looks like this:
Date Value
2020 0
2020 100
2020 200
2020 300
2021 100
2021 150
2021 0
I want to get the average of the grouped Value by Date where Value > 0. When I tried:
df['Yearly AVG'] = df[df['Value']>0].groupby('Date')['Value'].mean()
I get NaN Values, when I print the line above I get what I need but with the Date column.
Date
2020 200
2021 125
How Can I have the following:
Date Value Yearly AVG
2020 0 200
2020 100 200
2020 200 200
2020 300 200
2021 100 125
2021 150 125
2021 0 125
Here is trick replace non matched values to missing values and then using GroupBy.transform for new columns filled by aggregate values:
df['Yearly AVG'] = df['Value'].where(df['Value']>0).groupby(df['Date']).transform('mean')
print (df)
Date Value Yearly AVG
0 2020 0 200.0
1 2020 100 200.0
2 2020 200 200.0
3 2020 300 200.0
4 2021 100 125.0
5 2021 150 125.0
6 2021 0 125.0
Detail:
print (df['Value'].where(df['Value']>0))
0 NaN
1 100.0
2 200.0
3 300.0
4 100.0
5 150.0
6 NaN
Name: Value, dtype: float64
Your solution should be changed:
df['Yearly AVG'] = df['Date'].map(df[df['Value']>0].groupby('Date')['Value'].mean())

Pandas: aggregate and show percent difference

I have a dataframe that looks like this:
df = pd.DataFrame( [
['BILING',2017,7,1406 ],
['BILWPL',2017,7,199],
['BKCLUB',2017,7,9417],
['LEAVEN',2017,7,4773 ],
['MAILORDER',2017,7,10487]
], columns=['Branch','Year','Month','count']
df
Out[1]:
Branch Year Month count
0 BILING 2017 7 1406
1 BILWPL 2017 7 199
2 BKCLUB 2017 7 9417
10 LEAVEN 2017 7 4773
18 MAILORDER 2017 7 10487
It contains the same month but different years so that one can compare the time of year across time.
The desired output would look something like:
Branch Month 2017 2019 Mean(ave) percent_diff
BILING 7 1406 1501 1480 5%
BILWPL 7 199 87 102 -40%
BKCLUB 7 9417 8002 7503 -3%
LEAVEN 7 4773 5009 4509 -15%
MAILORDER 7 10487 11032 9004 8%
My question is how to aggregate based on branch to display across and add 2 columns: mean and percent difference between mean and newest year.
**** UPDATE ****
This is close but is missing some columns [ Thanks G. Anderson ]:
df.pivot_table(
values='count', index='Branch', columns='Year',
fill_value=0, aggfunc='mean')
Produces:
Year 2017 2018 2019
Branch
BILING 1406 1280 4
BILWPL 199 117 239
BKCLUB 94 161 238
This is very close but I'm hoping to tack on columns corresponding to the mean, and percent difference.
* UPDATE 2 *
circ_pivot = df.pivot_table(
values='count', index='Branch', columns='Year',
fill_value=0)
circ_pivot['Mean'] = circ_pivot[[2017,2018,2019]].mean(axis=1)
circ_pivot['Change'] = ((circ_pivot[2019] - circ_pivot[2018]) / circ_pivot[2018]) * 100
circ_pivot['Change_mean'] = ((circ_pivot[2019] - circ_pivot['Mean']) / circ_pivot['Mean']) * 100
Output:
Year 2017 2018 2019 Mean Change Change_mean
Branch
BILING 1406 1280 4 896.666667 -99.687500 -99.553903
BILWPL 199 117 239 185.000000 104.273504 29.189189
BKCLUB 94 161 238 164.333333 47.826087 44.827586
This is the solution I ended up with.
circ_pivot = df.pivot_table(
values='count', index='Branch', columns='Year',
fill_value=0, aggfunc=np.sum, margins=True)
circ_pivot['Mean'] = round(circ_pivot[[2017,2018,2019]].mean(axis=1))
circ_pivot['Change'] = round(((circ_pivot[2019] - circ_pivot[2018]) / circ_pivot[2018]) * 100)
circ_pivot['Change_mean'] = round(((circ_pivot[2019] - circ_pivot['Mean']) / circ_pivot['Mean']) * 100)
print(circ_pivot)
Output:
Year 2017 2018 2019 All Mean Change Change_mean
Branch
BILING 1406 1280 4 2690.0 897.0 -100.0 -100.0
BILWPL 199 117 239 555.0 185.0 104.0 29.0
BKCLUB 94 161 238 493.0 164.0 48.0 45.0
Improvements would be:
Relative dates instead of hard coded date fields.

How to apply a function to multiple columns that iterates through each row

Data
I have a dataset that shows up-to-date bookings data grouped by company and month (empty values are NaNs)
company month year_ly bookings_ly year_ty bookings_ty
company a 1 2018 432 2019 253
company a 2 2018 265 2019 635
company a 3 2018 345 2019 525
company a 4 2018 233 2019
company a 5 2018 7664 2019
... ... ... ... ... ...
company a 12 2018 224 2019 321
company b 1 2018 543 2019 576
company b 2 2018 23 2019 43
company b 3 2018 64 2019 156
company b 4 2018 143 2019
company b 5 2018 41 2019
company b 6 2018 90 2019
... ... ... ... ... ...
What I want
I'd like to create a column or update the bookings_ty column where value is NaN (whichever is easier) that applies the following calculation for each row (grouped by company):
((SUM of previous 3 rows (or months) of bookings_ty)
/(SUM of previous 3 rows (or months) of bookings_ly))
* bookings_ly
Where a row's bookings_ty is NaN, I'd like that iteration of the formula to take the newly calculated field as part of its bookings_ty so essentially what the formula should do is populate NaN values in bookings_ty.
My attempt
df_bkgs.set_index(['operator', 'month'], inplace=True)
def calc(df_bkgs):
df_bkgs['bookings_calc'] = df_bkgs['bookings_ty'].copy
df_bkgs['bookings_ty_l3m'] = df_bkgs.groupby(level=0)['bookings_ty'].transform(lambda x: x.shift(1) + x.shift(2) + x.shift(3) )
df_bkgs['bookings_ly_l3m'] = df_bkgs.groupby(level=0)['bookings_ly'].transform(lambda x: x.shift(1) + x.shift(2) + x.shift(3) )
df_bkgs['bookings_factor'] = df_bkgs['bookings_ty_l3m']/df_bkgs['bookings_ly_l3m']
df_bkgs['bookings_calc'] = df_bkgs['bookings_factor'] * df_bkgs['bookings_ly']
return df_bkgs
df_bkgs.groupby(level=0).apply(calc)
import numpy as np
df['bookings_calc'] = np.where(df['bookings_ty']isna(), df['bookings_calc'], df['bookings_ty'])
Issue with this code is that it generates the calculated field for only the first empty/NaN bookings_ty. What I'd like is for there to be an iteration or loop type process that then takes the previous 3 rows in the group and if the bookings_ty is empty/NaN then take the calculated field of that row.
Thanks
You can try this. I made a function which found the last 3 records in your dataframe by row. note I had to create a column named index to do this as you can't access the index (as far as I know) within an apply statement.
# dataframe is named f
company month year_ly bookings_ly year_ty bookings_ty
0 a 1 2018 432 2019 253.0
1 a 2 2018 265 2019 635.0
2 a 3 2018 345 2019 525.0
3 a 4 2018 233 2019 NaN
4 a 5 2018 7664 2019 NaN
5 a 12 2018 224 2019 321.0
6 b 1 2018 543 2019 576.0
7 b 2 2018 23 2019 43.0
8 b 3 2018 64 2019 156.0
9 b 4 2018 143 2019 NaN
10 b 5 2018 41 2019 NaN
11 b 6 2018 90 2019 NaN
f.reset_index(inplace=True)
def aggFunct(row, df, last=3):
series = df.loc[(df['index'] < row['index']) & (df['index'] >= row['index'] - last), 'bookings_ty'].fillna(0)
ssum = series.sum()
return ssum
f.loc[f['bookings_ty'].isna(),'bookings_ty'] = f[f['bookings_ty'].isna()].apply(aggFunct, df=f, axis=1)
f.drop('index',axis=1,inplace=True)
f
company month year_ly bookings_ly year_ty bookings_ty
0 a 1 2018 432 2019 253.0
1 a 2 2018 265 2019 635.0
2 a 3 2018 345 2019 525.0
3 a 4 2018 233 2019 1413.0
4 a 5 2018 7664 2019 1160.0
5 a 12 2018 224 2019 321.0
6 b 1 2018 543 2019 576.0
7 b 2 2018 23 2019 43.0
8 b 3 2018 64 2019 156.0
9 b 4 2018 143 2019 775.0
10 b 5 2018 41 2019 199.0
11 b 6 2018 90 2019 156.0
Depending on how many companies you have in your table, I might be inclined to run this on Excel as opposed to doing this on pandas. Iterating through the rows might be slow, but if speed is not a concern, the following solution should work:
import numpy as np
import pandas as pd
df = pd.read_excel('data_file.xlsx') # <-- name of your file.
companies = pd.unique(df.company)
months = pd.unique(df.month)
for c in companies:
for m in months:
# slice a single row
df_row= df[(df['company']==c) & (df['month']==m)]
val = df_slice.bookings_ty.values[0]
if np.isnan(val):
# get the index of the row
idx = df_row.index[0]
df1 = df.copy()
df1 = df1[(df1['company']==c) & (df1['month'].isin([m for m in range(m-3,m)]))]
ratio = df1.bookings_ty.sum() / df1.bookings_ly.sum()
projected_value = df_slice.bookings_ly.values[0] * ratio
df.loc[idx, 'bookings_ty'] = projected_value
else:
pass
print(df)
if we can assume that the DataFrame is always sorted by 'company' and then by 'month', then we can use the following approach, there is a 20-fold improvement (0.003s vs. 0.07s) with my sample data of 24 rows.
df = pd.read_excel('data_file.xlsx') # your input file
ly = df.bookings_ly.values.tolist()
ty = df.bookings_ty.values.tolist()
for val in ty:
if np.isnan(val):
idx = ty.index(val) # returns the index of the first 'nan' found
ratio = sum(ty[idx-3:idx])/sum(ly[idx-3:idx])
ty[idx] = ratio * ly[idx]
df['bookings_ty'] = ty
here is a solution:
import numpy as np
import pandas as pd
#sort values if not
df = df.sort_values(['company', 'year_ty', 'month']).reset_index(drop=True)
def process(x):
while x['bookings_ty'].isnull().any():
x['bookings_ty'] = np.where((x['bookings_ty'].isnull()),
(x['bookings_ty'].shift(1) +
x['bookings_ty'].shift(2) +
x['bookings_ty'].shift(3)) /
(x['bookings_ly'].shift(1) +
x['bookings_ly'].shift(2) +
x['bookings_ly'].shift(3)) *
x['bookings_ly'], x['bookings_ty'])
return x
df = df.groupby(['company']).apply(lambda x: process(x))
#convert to int64 if needed or stay with float values
df['bookings_ty'] = df['bookings_ty'].astype(np.int64)
print(df)
initial DF:
company month year_ly bookings_ly year_ty bookings_ty
0 company_a 1 2018 432 2019 253
1 company_a 2 2018 265 2019 635
2 company_a 3 2018 345 2019 525
3 company_a 4 2018 233 2019 NaN
4 company_a 5 2018 7664 2019 NaN
5 company_a 12 2018 224 2019 321
6 company_b 1 2018 543 2019 576
7 company_b 2 2018 23 2019 43
8 company_b 3 2018 64 2019 156
9 company_b 4 2018 143 2019 NaN
10 company_b 5 2018 41 2019 NaN
11 company_b 6 2018 90 2019 NaN
result:
company month year_ly bookings_ly year_ty bookings_ty
0 company_a 1 2018 432 2019 253
1 company_a 2 2018 265 2019 635
2 company_a 3 2018 345 2019 525
3 company_a 4 2018 233 2019 315 **
4 company_a 5 2018 7664 2019 13418 **
5 company_a 12 2018 224 2019 321
6 company_b 1 2018 543 2019 576
7 company_b 2 2018 23 2019 43
8 company_b 3 2018 64 2019 156
9 company_b 4 2018 143 2019 175 **
10 company_b 5 2018 41 2019 66 **
11 company_b 6 2018 90 2019 144 **
In case of you want another rolling month or maybe a NaN value could exist at the beginning of each company, you could use this generic solution:
df = df.sort_values(['company', 'year_ty', 'month']).reset_index(drop=True)
def process(x, m):
idx = (x.loc[x['bookings_ty'].isnull()].index.to_list())
for i in idx:
id = i - x.index[0]
start = 0 if id < m else id - m
sum_ty = sum(x['bookings_ty'].to_list()[start:id])
sum_ly = sum(x['bookings_ly'].to_list()[start:id])
ly = x.at[i, 'bookings_ly']
x.at[i, 'bookings_ty'] = sum_ty / sum_ly * ly
return x
rolling_month = 3
df = df.groupby(['company']).apply(lambda x: process(x, rolling_month))
df['bookings_ty'] = df['bookings_ty'].astype(np.int64)
print(df)
initial df:
company month year_ly bookings_ly year_ty bookings_ty
0 company_a 1 2018 432 2019 253.0
1 company_a 2 2018 265 2019 635.0
2 company_a 3 2018 345 2019 NaN
3 company_a 4 2018 233 2019 NaN
4 company_a 5 2018 7664 2019 NaN
5 company_a 12 2018 224 2019 321.0
6 company_b 1 2018 543 2019 576.0
7 company_b 2 2018 23 2019 43.0
8 company_b 3 2018 64 2019 156.0
9 company_b 4 2018 143 2019 NaN
10 company_b 5 2018 41 2019 NaN
11 company_b 6 2018 90 2019 NaN
final result:
company month year_ly bookings_ly year_ty bookings_ty
0 company_a 1 2018 432 2019 253
1 company_a 2 2018 265 2019 635
2 company_a 3 2018 345 2019 439 ** work only with 2 previous rows
3 company_a 4 2018 233 2019 296 **
4 company_a 5 2018 7664 2019 12467 **
5 company_a 12 2018 224 2019 321
6 company_b 1 2018 543 2019 576
7 company_b 2 2018 23 2019 43
8 company_b 3 2018 64 2019 156
9 company_b 4 2018 143 2019 175 **
10 company_b 5 2018 41 2019 66 **
11 company_b 6 2018 90 2019 144 **
if you want to speed up the process you could try:
df.set_index(['company'], inplace=True)
df = df.groupby(level=(0)).apply(lambda x: process(x))
instead of
df = df.groupby(['company']).apply(lambda x: process(x))

Grouping data series by day intervals with Pandas

I have to perform some data analysis on a seasonal basis.
I have circa one and a half years worth of hourly measurements, from the end of 2015 to the second half of 2017. What I want to do is to sort this data in seasons.
Here's an example of the data I am working with:
Date,Year,Month,Day,Day week,Hour,Holiday,Week Day,Impulse,Power (kW),Temperature (C)
04/12/2015,2015,12,4,6,18,0,6,2968,1781,16.2
04/12/2015,2015,12,4,6,19,0,6,2437,1462,16.2
19/04/2016,2016,4,19,3,3,0,3,1348,809,14.4
19/04/2016,2016,4,19,3,4,0,3,1353,812,14.1
11/06/2016,2016,6,11,7,19,0,7,1395,837,18.8
11/06/2016,2016,6,11,7,20,0,7,1370,822,17.4
11/06/2016,2016,6,11,7,21,0,7,1364,818,17
11/06/2016,2016,6,11,7,22,0,7,1433,860,17.5
04/12/2016,2016,12,4,1,17,0,1,1425,855,14.6
04/12/2016,2016,12,4,1,18,0,1,1466,880,14.4
07/03/2017,2017,3,7,3,14,0,3,3668,2201,14.2
07/03/2017,2017,3,7,3,15,0,3,3666,2200,14
24/04/2017,2017,4,24,2,5,0,2,1347,808,11.4
24/04/2017,2017,4,24,2,6,0,2,1816,1090,11.5
24/04/2017,2017,4,24,2,7,0,2,2918,1751,12.4
15/06/2017,2017,6,15,5,13,1,1,2590,1554,22.5
15/06/2017,2017,6,15,5,14,1,1,2629,1577,22.5
15/06/2017,2017,6,15,5,15,1,1,2656,1594,22.1
15/11/2017,2017,11,15,4,13,0,4,3765,2259,15.6
15/11/2017,2017,11,15,4,14,0,4,3873,2324,15.9
15/11/2017,2017,11,15,4,15,0,4,3905,2343,15.8
15/11/2017,2017,11,15,4,16,0,4,3861,2317,15.3
As you can see I have data on three different years.
What I was thinking to do is to convert the first column with the pd.to_datetime() command. Then to group the rows according to the day/month, regardless of the year in dd/mm intervals (if winter goes from the 21/12 to the 21/03, create a new dataframe with all of those rows in which the date is included in this interval, regardless of the year), but I couldn't do it by neglecting the year (which make things more complicated).
EDIT:
A desired output would be:
df_spring
Date,Year,Month,Day,Day week,Hour,Holiday,Week Day,Impulse,Power (kW),Temperature (C)
19/04/2016,2016,4,19,3,3,0,3,1348,809,14.4
19/04/2016,2016,4,19,3,4,0,3,1353,812,14.1
07/03/2017,2017,3,7,3,14,0,3,3668,2201,14.2
07/03/2017,2017,3,7,3,15,0,3,3666,2200,14
24/04/2017,2017,4,24,2,5,0,2,1347,808,11.4
24/04/2017,2017,4,24,2,6,0,2,1816,1090,11.5
24/04/2017,2017,4,24,2,7,0,2,2918,1751,12.4
df_autumn
Date,Year,Month,Day,Day week,Hour,Holiday,Week Day,Impulse,Power (kW),Temperature (C)
04/12/2015,2015,12,4,6,18,0,6,2968,1781,16.2
04/12/2015,2015,12,4,6,19,0,6,2437,1462,16.2
04/12/2016,2016,12,4,1,17,0,1,1425,855,14.6
04/12/2016,2016,12,4,1,18,0,1,1466,880,14.4
15/11/2017,2017,11,15,4,13,0,4,3765,2259,15.6
15/11/2017,2017,11,15,4,14,0,4,3873,2324,15.9
15/11/2017,2017,11,15,4,15,0,4,3905,2343,15.8
15/11/2017,2017,11,15,4,16,0,4,3861,2317,15.3
And so on for the remaining seasons.
Define each season by filtering the relevant rows using Day and Month columns as presented for winter:
df_winter = df.loc[((df['Day'] >= 21) & (df['Month'] == 12)) | (df['Month'] == 1) | (df['Month'] == 2) | ((df['Day'] <= 21) & (df['Month'] == 3))]
you can simply filter your dataframe by month.isin()
# spring
df[df['Month'].isin([3,4])]
Date Year Month Day Day week Hour Holiday Week Day Impulse Power (kW) Temperature (C)
2 19/04/2016 2016 4 19 3 3 0 3 1348 809 14.4
3 19/04/2016 2016 4 19 3 4 0 3 1353 812 14.1
10 07/03/2017 2017 3 7 3 14 0 3 3668 2201 14.2
11 07/03/2017 2017 3 7 3 15 0 3 3666 2200 14.0
12 24/04/2017 2017 4 24 2 5 0 2 1347 808 11.4
13 24/04/2017 2017 4 24 2 6 0 2 1816 1090 11.5
14 24/04/2017 2017 4 24 2 7 0 2 2918 1751 12.4
# autumn
df[df['Month'].isin([11,12])]
Date Year Month Day Day week Hour Holiday Week Day Impulse Power (kW) Temperature (C)
0 04/12/2015 2015 12 4 6 18 0 6 2968 1781 16.2
1 04/12/2015 2015 12 4 6 19 0 6 2437 1462 16.2
8 04/12/2016 2016 12 4 1 17 0 1 1425 855 14.6
9 04/12/2016 2016 12 4 1 18 0 1 1466 880 14.4
18 15/11/2017 2017 11 15 4 13 0 4 3765 2259 15.6
19 15/11/2017 2017 11 15 4 14 0 4 3873 2324 15.9
20 15/11/2017 2017 11 15 4 15 0 4 3905 2343 15.8
21 15/11/2017 2017 11 15 4 16 0 4 3861 2317 15.3

Categories