Impute values from pandas row with specific identifier to all other rows - python
I have this dataframe (sorry, not sure how to format it nicely here):
SRC SRCDate Ticker Coupon Vintage Bal ($bn) WAC WAM WALA LNSZ ... FICO Refi% Month_Assessed CPR Month_key SRC_year SRC_month Year Month Interest_Rate
JPM 02/05/2021 FNCI 1.5 2020 28.7 2.25 175 4 293 / 286 ... 777 91 Apr 7.536801 M+2 2021 2 2021 2 2.24
JPM 03/05/2021 FNCI 1.5 2020 28.7 2.25 175 4 293 / 286 ... 777 91 Apr 5.131145 M+1 2021 3 2021 3 2.39
JPM 04/07/2021 FNCI 1.5 2020 28 2.25 173 6 292 / 281 ... 777 91 Apr 7.233214 M 2021 4 2021 4 2.36
JPM 05/07/2021 FNCI 1.5 2020 27.6 2.25 171 7 292 / 279 ... 777 91 Apr 8.900000 M-1 2021 5 2021 5 2.28
And use this code:
cols = ['SRC_year','Bal ($bn)', 'WAC', 'WAM', 'WALA', 'LTV', 'FICO', 'Refi%', 'Interest_Rate']
jpm_2021[cols] = jpm_2021[cols].apply(pd.to_numeric, downcast='float', errors='coerce')
for col in cols:
jpm_2021[col] = jpm_2021.groupby(['SRC_year','Ticker', 'Coupon', 'Vintage', 'Month_Assessed'])[col].transform('mean')
To normalize the values of all the cols to their respective means by the grouping in groupby. The reason for this is to be able to create a pivoted table with this code:
jpm_final = jpm_2021.pivot_table(index=['SRC', 'Ticker', 'Coupon', 'Vintage', 'Month_Assessed', 'Bal ($bn)', 'WAC', 'WAM', 'WALA', 'LTV', 'FICO', 'Refi%', 'Interest_Rate'],
columns="Month_key", values="CPR").rename_axis(columns=None).reset_index()
The problem is, taking the mean of all of those columns (especially Interest Rate) renders the resulting table less than insightful. Instead, what I'd like to do is to impute all the values in the rows where Month_key is M to all the other rows with the same grouping defined in the groupby function above. Any tips on how to do that?
Related
Python Pandas convert selective columns into rows
My dataset has some information about price and sales for different years. The problem is each year is actually a different column header for price and for sales as well. For example the CSV looks like Items Price in 2018 Price in 2019 Price in 2020 Sales in 2018 Sales in 2019 Sales in 2020 A 100 120 135 5000 6000 6500 B 110 130 150 2000 4000 4500 C 150 110 175 1000 3000 3000 I want to show it something like this Items Year Price Sales A 2018 100 5000 A 2019 120 6000 A 2020 135 6500 B 2018 110 2000 B 2019 130 4000 B 2020 150 4500 C 2018 150 1000 C 2019 110 3000 C 2020 175 3000 I used melt function from Pandas like this df.melt(id_vars = ['Items'], var_name="Year", value_name="Price") But I'm struggling in getting separate columns for Price and Sales as it gives Price and Sales in one column. Thanks
Let us try pandas wide_to_long pd.wide_to_long(df, i='Items', j='year', stubnames=['Price', 'Sales'], suffix=r'\d+', sep=' in ').sort_index() Price Sales Items year A 2018 100 5000 2019 120 6000 2020 135 6500 B 2018 110 2000 2019 130 4000 2020 150 4500 C 2018 150 1000 2019 110 3000 2020 175 3000
How to assign the groupby results to a series in pandas
I have a df which looks like this: Date Value 2020 0 2020 100 2020 200 2020 300 2021 100 2021 150 2021 0 I want to get the average of the grouped Value by Date where Value > 0. When I tried: df['Yearly AVG'] = df[df['Value']>0].groupby('Date')['Value'].mean() I get NaN Values, when I print the line above I get what I need but with the Date column. Date 2020 200 2021 125 How Can I have the following: Date Value Yearly AVG 2020 0 200 2020 100 200 2020 200 200 2020 300 200 2021 100 125 2021 150 125 2021 0 125
Here is trick replace non matched values to missing values and then using GroupBy.transform for new columns filled by aggregate values: df['Yearly AVG'] = df['Value'].where(df['Value']>0).groupby(df['Date']).transform('mean') print (df) Date Value Yearly AVG 0 2020 0 200.0 1 2020 100 200.0 2 2020 200 200.0 3 2020 300 200.0 4 2021 100 125.0 5 2021 150 125.0 6 2021 0 125.0 Detail: print (df['Value'].where(df['Value']>0)) 0 NaN 1 100.0 2 200.0 3 300.0 4 100.0 5 150.0 6 NaN Name: Value, dtype: float64 Your solution should be changed: df['Yearly AVG'] = df['Date'].map(df[df['Value']>0].groupby('Date')['Value'].mean())
Pandas: aggregate and show percent difference
I have a dataframe that looks like this: df = pd.DataFrame( [ ['BILING',2017,7,1406 ], ['BILWPL',2017,7,199], ['BKCLUB',2017,7,9417], ['LEAVEN',2017,7,4773 ], ['MAILORDER',2017,7,10487] ], columns=['Branch','Year','Month','count'] df Out[1]: Branch Year Month count 0 BILING 2017 7 1406 1 BILWPL 2017 7 199 2 BKCLUB 2017 7 9417 10 LEAVEN 2017 7 4773 18 MAILORDER 2017 7 10487 It contains the same month but different years so that one can compare the time of year across time. The desired output would look something like: Branch Month 2017 2019 Mean(ave) percent_diff BILING 7 1406 1501 1480 5% BILWPL 7 199 87 102 -40% BKCLUB 7 9417 8002 7503 -3% LEAVEN 7 4773 5009 4509 -15% MAILORDER 7 10487 11032 9004 8% My question is how to aggregate based on branch to display across and add 2 columns: mean and percent difference between mean and newest year. **** UPDATE **** This is close but is missing some columns [ Thanks G. Anderson ]: df.pivot_table( values='count', index='Branch', columns='Year', fill_value=0, aggfunc='mean') Produces: Year 2017 2018 2019 Branch BILING 1406 1280 4 BILWPL 199 117 239 BKCLUB 94 161 238 This is very close but I'm hoping to tack on columns corresponding to the mean, and percent difference. * UPDATE 2 * circ_pivot = df.pivot_table( values='count', index='Branch', columns='Year', fill_value=0) circ_pivot['Mean'] = circ_pivot[[2017,2018,2019]].mean(axis=1) circ_pivot['Change'] = ((circ_pivot[2019] - circ_pivot[2018]) / circ_pivot[2018]) * 100 circ_pivot['Change_mean'] = ((circ_pivot[2019] - circ_pivot['Mean']) / circ_pivot['Mean']) * 100 Output: Year 2017 2018 2019 Mean Change Change_mean Branch BILING 1406 1280 4 896.666667 -99.687500 -99.553903 BILWPL 199 117 239 185.000000 104.273504 29.189189 BKCLUB 94 161 238 164.333333 47.826087 44.827586
This is the solution I ended up with. circ_pivot = df.pivot_table( values='count', index='Branch', columns='Year', fill_value=0, aggfunc=np.sum, margins=True) circ_pivot['Mean'] = round(circ_pivot[[2017,2018,2019]].mean(axis=1)) circ_pivot['Change'] = round(((circ_pivot[2019] - circ_pivot[2018]) / circ_pivot[2018]) * 100) circ_pivot['Change_mean'] = round(((circ_pivot[2019] - circ_pivot['Mean']) / circ_pivot['Mean']) * 100) print(circ_pivot) Output: Year 2017 2018 2019 All Mean Change Change_mean Branch BILING 1406 1280 4 2690.0 897.0 -100.0 -100.0 BILWPL 199 117 239 555.0 185.0 104.0 29.0 BKCLUB 94 161 238 493.0 164.0 48.0 45.0 Improvements would be: Relative dates instead of hard coded date fields.
How to apply a function to multiple columns that iterates through each row
Data I have a dataset that shows up-to-date bookings data grouped by company and month (empty values are NaNs) company month year_ly bookings_ly year_ty bookings_ty company a 1 2018 432 2019 253 company a 2 2018 265 2019 635 company a 3 2018 345 2019 525 company a 4 2018 233 2019 company a 5 2018 7664 2019 ... ... ... ... ... ... company a 12 2018 224 2019 321 company b 1 2018 543 2019 576 company b 2 2018 23 2019 43 company b 3 2018 64 2019 156 company b 4 2018 143 2019 company b 5 2018 41 2019 company b 6 2018 90 2019 ... ... ... ... ... ... What I want I'd like to create a column or update the bookings_ty column where value is NaN (whichever is easier) that applies the following calculation for each row (grouped by company): ((SUM of previous 3 rows (or months) of bookings_ty) /(SUM of previous 3 rows (or months) of bookings_ly)) * bookings_ly Where a row's bookings_ty is NaN, I'd like that iteration of the formula to take the newly calculated field as part of its bookings_ty so essentially what the formula should do is populate NaN values in bookings_ty. My attempt df_bkgs.set_index(['operator', 'month'], inplace=True) def calc(df_bkgs): df_bkgs['bookings_calc'] = df_bkgs['bookings_ty'].copy df_bkgs['bookings_ty_l3m'] = df_bkgs.groupby(level=0)['bookings_ty'].transform(lambda x: x.shift(1) + x.shift(2) + x.shift(3) ) df_bkgs['bookings_ly_l3m'] = df_bkgs.groupby(level=0)['bookings_ly'].transform(lambda x: x.shift(1) + x.shift(2) + x.shift(3) ) df_bkgs['bookings_factor'] = df_bkgs['bookings_ty_l3m']/df_bkgs['bookings_ly_l3m'] df_bkgs['bookings_calc'] = df_bkgs['bookings_factor'] * df_bkgs['bookings_ly'] return df_bkgs df_bkgs.groupby(level=0).apply(calc) import numpy as np df['bookings_calc'] = np.where(df['bookings_ty']isna(), df['bookings_calc'], df['bookings_ty']) Issue with this code is that it generates the calculated field for only the first empty/NaN bookings_ty. What I'd like is for there to be an iteration or loop type process that then takes the previous 3 rows in the group and if the bookings_ty is empty/NaN then take the calculated field of that row. Thanks
You can try this. I made a function which found the last 3 records in your dataframe by row. note I had to create a column named index to do this as you can't access the index (as far as I know) within an apply statement. # dataframe is named f company month year_ly bookings_ly year_ty bookings_ty 0 a 1 2018 432 2019 253.0 1 a 2 2018 265 2019 635.0 2 a 3 2018 345 2019 525.0 3 a 4 2018 233 2019 NaN 4 a 5 2018 7664 2019 NaN 5 a 12 2018 224 2019 321.0 6 b 1 2018 543 2019 576.0 7 b 2 2018 23 2019 43.0 8 b 3 2018 64 2019 156.0 9 b 4 2018 143 2019 NaN 10 b 5 2018 41 2019 NaN 11 b 6 2018 90 2019 NaN f.reset_index(inplace=True) def aggFunct(row, df, last=3): series = df.loc[(df['index'] < row['index']) & (df['index'] >= row['index'] - last), 'bookings_ty'].fillna(0) ssum = series.sum() return ssum f.loc[f['bookings_ty'].isna(),'bookings_ty'] = f[f['bookings_ty'].isna()].apply(aggFunct, df=f, axis=1) f.drop('index',axis=1,inplace=True) f company month year_ly bookings_ly year_ty bookings_ty 0 a 1 2018 432 2019 253.0 1 a 2 2018 265 2019 635.0 2 a 3 2018 345 2019 525.0 3 a 4 2018 233 2019 1413.0 4 a 5 2018 7664 2019 1160.0 5 a 12 2018 224 2019 321.0 6 b 1 2018 543 2019 576.0 7 b 2 2018 23 2019 43.0 8 b 3 2018 64 2019 156.0 9 b 4 2018 143 2019 775.0 10 b 5 2018 41 2019 199.0 11 b 6 2018 90 2019 156.0
Depending on how many companies you have in your table, I might be inclined to run this on Excel as opposed to doing this on pandas. Iterating through the rows might be slow, but if speed is not a concern, the following solution should work: import numpy as np import pandas as pd df = pd.read_excel('data_file.xlsx') # <-- name of your file. companies = pd.unique(df.company) months = pd.unique(df.month) for c in companies: for m in months: # slice a single row df_row= df[(df['company']==c) & (df['month']==m)] val = df_slice.bookings_ty.values[0] if np.isnan(val): # get the index of the row idx = df_row.index[0] df1 = df.copy() df1 = df1[(df1['company']==c) & (df1['month'].isin([m for m in range(m-3,m)]))] ratio = df1.bookings_ty.sum() / df1.bookings_ly.sum() projected_value = df_slice.bookings_ly.values[0] * ratio df.loc[idx, 'bookings_ty'] = projected_value else: pass print(df) if we can assume that the DataFrame is always sorted by 'company' and then by 'month', then we can use the following approach, there is a 20-fold improvement (0.003s vs. 0.07s) with my sample data of 24 rows. df = pd.read_excel('data_file.xlsx') # your input file ly = df.bookings_ly.values.tolist() ty = df.bookings_ty.values.tolist() for val in ty: if np.isnan(val): idx = ty.index(val) # returns the index of the first 'nan' found ratio = sum(ty[idx-3:idx])/sum(ly[idx-3:idx]) ty[idx] = ratio * ly[idx] df['bookings_ty'] = ty
here is a solution: import numpy as np import pandas as pd #sort values if not df = df.sort_values(['company', 'year_ty', 'month']).reset_index(drop=True) def process(x): while x['bookings_ty'].isnull().any(): x['bookings_ty'] = np.where((x['bookings_ty'].isnull()), (x['bookings_ty'].shift(1) + x['bookings_ty'].shift(2) + x['bookings_ty'].shift(3)) / (x['bookings_ly'].shift(1) + x['bookings_ly'].shift(2) + x['bookings_ly'].shift(3)) * x['bookings_ly'], x['bookings_ty']) return x df = df.groupby(['company']).apply(lambda x: process(x)) #convert to int64 if needed or stay with float values df['bookings_ty'] = df['bookings_ty'].astype(np.int64) print(df) initial DF: company month year_ly bookings_ly year_ty bookings_ty 0 company_a 1 2018 432 2019 253 1 company_a 2 2018 265 2019 635 2 company_a 3 2018 345 2019 525 3 company_a 4 2018 233 2019 NaN 4 company_a 5 2018 7664 2019 NaN 5 company_a 12 2018 224 2019 321 6 company_b 1 2018 543 2019 576 7 company_b 2 2018 23 2019 43 8 company_b 3 2018 64 2019 156 9 company_b 4 2018 143 2019 NaN 10 company_b 5 2018 41 2019 NaN 11 company_b 6 2018 90 2019 NaN result: company month year_ly bookings_ly year_ty bookings_ty 0 company_a 1 2018 432 2019 253 1 company_a 2 2018 265 2019 635 2 company_a 3 2018 345 2019 525 3 company_a 4 2018 233 2019 315 ** 4 company_a 5 2018 7664 2019 13418 ** 5 company_a 12 2018 224 2019 321 6 company_b 1 2018 543 2019 576 7 company_b 2 2018 23 2019 43 8 company_b 3 2018 64 2019 156 9 company_b 4 2018 143 2019 175 ** 10 company_b 5 2018 41 2019 66 ** 11 company_b 6 2018 90 2019 144 ** In case of you want another rolling month or maybe a NaN value could exist at the beginning of each company, you could use this generic solution: df = df.sort_values(['company', 'year_ty', 'month']).reset_index(drop=True) def process(x, m): idx = (x.loc[x['bookings_ty'].isnull()].index.to_list()) for i in idx: id = i - x.index[0] start = 0 if id < m else id - m sum_ty = sum(x['bookings_ty'].to_list()[start:id]) sum_ly = sum(x['bookings_ly'].to_list()[start:id]) ly = x.at[i, 'bookings_ly'] x.at[i, 'bookings_ty'] = sum_ty / sum_ly * ly return x rolling_month = 3 df = df.groupby(['company']).apply(lambda x: process(x, rolling_month)) df['bookings_ty'] = df['bookings_ty'].astype(np.int64) print(df) initial df: company month year_ly bookings_ly year_ty bookings_ty 0 company_a 1 2018 432 2019 253.0 1 company_a 2 2018 265 2019 635.0 2 company_a 3 2018 345 2019 NaN 3 company_a 4 2018 233 2019 NaN 4 company_a 5 2018 7664 2019 NaN 5 company_a 12 2018 224 2019 321.0 6 company_b 1 2018 543 2019 576.0 7 company_b 2 2018 23 2019 43.0 8 company_b 3 2018 64 2019 156.0 9 company_b 4 2018 143 2019 NaN 10 company_b 5 2018 41 2019 NaN 11 company_b 6 2018 90 2019 NaN final result: company month year_ly bookings_ly year_ty bookings_ty 0 company_a 1 2018 432 2019 253 1 company_a 2 2018 265 2019 635 2 company_a 3 2018 345 2019 439 ** work only with 2 previous rows 3 company_a 4 2018 233 2019 296 ** 4 company_a 5 2018 7664 2019 12467 ** 5 company_a 12 2018 224 2019 321 6 company_b 1 2018 543 2019 576 7 company_b 2 2018 23 2019 43 8 company_b 3 2018 64 2019 156 9 company_b 4 2018 143 2019 175 ** 10 company_b 5 2018 41 2019 66 ** 11 company_b 6 2018 90 2019 144 ** if you want to speed up the process you could try: df.set_index(['company'], inplace=True) df = df.groupby(level=(0)).apply(lambda x: process(x)) instead of df = df.groupby(['company']).apply(lambda x: process(x))
Grouping data series by day intervals with Pandas
I have to perform some data analysis on a seasonal basis. I have circa one and a half years worth of hourly measurements, from the end of 2015 to the second half of 2017. What I want to do is to sort this data in seasons. Here's an example of the data I am working with: Date,Year,Month,Day,Day week,Hour,Holiday,Week Day,Impulse,Power (kW),Temperature (C) 04/12/2015,2015,12,4,6,18,0,6,2968,1781,16.2 04/12/2015,2015,12,4,6,19,0,6,2437,1462,16.2 19/04/2016,2016,4,19,3,3,0,3,1348,809,14.4 19/04/2016,2016,4,19,3,4,0,3,1353,812,14.1 11/06/2016,2016,6,11,7,19,0,7,1395,837,18.8 11/06/2016,2016,6,11,7,20,0,7,1370,822,17.4 11/06/2016,2016,6,11,7,21,0,7,1364,818,17 11/06/2016,2016,6,11,7,22,0,7,1433,860,17.5 04/12/2016,2016,12,4,1,17,0,1,1425,855,14.6 04/12/2016,2016,12,4,1,18,0,1,1466,880,14.4 07/03/2017,2017,3,7,3,14,0,3,3668,2201,14.2 07/03/2017,2017,3,7,3,15,0,3,3666,2200,14 24/04/2017,2017,4,24,2,5,0,2,1347,808,11.4 24/04/2017,2017,4,24,2,6,0,2,1816,1090,11.5 24/04/2017,2017,4,24,2,7,0,2,2918,1751,12.4 15/06/2017,2017,6,15,5,13,1,1,2590,1554,22.5 15/06/2017,2017,6,15,5,14,1,1,2629,1577,22.5 15/06/2017,2017,6,15,5,15,1,1,2656,1594,22.1 15/11/2017,2017,11,15,4,13,0,4,3765,2259,15.6 15/11/2017,2017,11,15,4,14,0,4,3873,2324,15.9 15/11/2017,2017,11,15,4,15,0,4,3905,2343,15.8 15/11/2017,2017,11,15,4,16,0,4,3861,2317,15.3 As you can see I have data on three different years. What I was thinking to do is to convert the first column with the pd.to_datetime() command. Then to group the rows according to the day/month, regardless of the year in dd/mm intervals (if winter goes from the 21/12 to the 21/03, create a new dataframe with all of those rows in which the date is included in this interval, regardless of the year), but I couldn't do it by neglecting the year (which make things more complicated). EDIT: A desired output would be: df_spring Date,Year,Month,Day,Day week,Hour,Holiday,Week Day,Impulse,Power (kW),Temperature (C) 19/04/2016,2016,4,19,3,3,0,3,1348,809,14.4 19/04/2016,2016,4,19,3,4,0,3,1353,812,14.1 07/03/2017,2017,3,7,3,14,0,3,3668,2201,14.2 07/03/2017,2017,3,7,3,15,0,3,3666,2200,14 24/04/2017,2017,4,24,2,5,0,2,1347,808,11.4 24/04/2017,2017,4,24,2,6,0,2,1816,1090,11.5 24/04/2017,2017,4,24,2,7,0,2,2918,1751,12.4 df_autumn Date,Year,Month,Day,Day week,Hour,Holiday,Week Day,Impulse,Power (kW),Temperature (C) 04/12/2015,2015,12,4,6,18,0,6,2968,1781,16.2 04/12/2015,2015,12,4,6,19,0,6,2437,1462,16.2 04/12/2016,2016,12,4,1,17,0,1,1425,855,14.6 04/12/2016,2016,12,4,1,18,0,1,1466,880,14.4 15/11/2017,2017,11,15,4,13,0,4,3765,2259,15.6 15/11/2017,2017,11,15,4,14,0,4,3873,2324,15.9 15/11/2017,2017,11,15,4,15,0,4,3905,2343,15.8 15/11/2017,2017,11,15,4,16,0,4,3861,2317,15.3 And so on for the remaining seasons.
Define each season by filtering the relevant rows using Day and Month columns as presented for winter: df_winter = df.loc[((df['Day'] >= 21) & (df['Month'] == 12)) | (df['Month'] == 1) | (df['Month'] == 2) | ((df['Day'] <= 21) & (df['Month'] == 3))]
you can simply filter your dataframe by month.isin() # spring df[df['Month'].isin([3,4])] Date Year Month Day Day week Hour Holiday Week Day Impulse Power (kW) Temperature (C) 2 19/04/2016 2016 4 19 3 3 0 3 1348 809 14.4 3 19/04/2016 2016 4 19 3 4 0 3 1353 812 14.1 10 07/03/2017 2017 3 7 3 14 0 3 3668 2201 14.2 11 07/03/2017 2017 3 7 3 15 0 3 3666 2200 14.0 12 24/04/2017 2017 4 24 2 5 0 2 1347 808 11.4 13 24/04/2017 2017 4 24 2 6 0 2 1816 1090 11.5 14 24/04/2017 2017 4 24 2 7 0 2 2918 1751 12.4 # autumn df[df['Month'].isin([11,12])] Date Year Month Day Day week Hour Holiday Week Day Impulse Power (kW) Temperature (C) 0 04/12/2015 2015 12 4 6 18 0 6 2968 1781 16.2 1 04/12/2015 2015 12 4 6 19 0 6 2437 1462 16.2 8 04/12/2016 2016 12 4 1 17 0 1 1425 855 14.6 9 04/12/2016 2016 12 4 1 18 0 1 1466 880 14.4 18 15/11/2017 2017 11 15 4 13 0 4 3765 2259 15.6 19 15/11/2017 2017 11 15 4 14 0 4 3873 2324 15.9 20 15/11/2017 2017 11 15 4 15 0 4 3905 2343 15.8 21 15/11/2017 2017 11 15 4 16 0 4 3861 2317 15.3