Pandas: In pivot_table, how to custom fill missing values? - python

I want to fill the missing values in my Pandas pivot_table with values from the index and to fill the missing Year Week columns.
import pandas as pd
d = { 'Year': [2019,2019,2019,2019,2019,2019],
'Week': [1,2,3,4,5,6],
'Part': ['A','A','A','B','B','B'],
'Static': [20,20,20,40,40,40],
'Value': [np.nan,10,50,np.nan,30,np.nan]
}
df = pd.DataFrame(d)
pivot = df.pivot_table(index=['Part','Static'], columns=['Year', 'Week'], values=['Value'])
print(pivot)
Value
Year 2019
Week 2 3 5
Part Static
A 20 10.0 50.0 NaN
B 40 NaN NaN 30.0
In the example above, the Weeks 1, 4 & 6 are missing because they don't have values. As for the NaN, I want to fill them with a value from the "left", so for Week 1 for Part A the value will be 20.0, and for Week 4 to 6 will be 50.0, and the same for Part B where all NaN will be filled with values from the left.
The expected output is
Value
Year 2019
Week 1 2 3 4 5 6
Part Static
A 20 20.0 10.0 50.0 50.0 50.0 50.0
B 40 40.0 40.0 40.0 40.0 30.0 30.0
PS: I can refer to a reference calendar dataframe to pull in all the Year Week values.
Edit:
I tested the solution on my data, but it seems to not work. Here is an updated data with Week 4 being removed.
d = { 'Year': [2019,2019,2019,2019,2019],
'Week': [1,2,3,5,6],
'Part': ['A','A','A','B','B'],
'Static': [20,20,20,40,40],
'Value': [np.nan,10,50,30,np.nan]
}
df = pd.DataFrame(d)
#Year Week data set for reference
d2 = {'Year':[2019,2019,2019,2019,2019,2019,2019,2019,2019,2019],
'Week':[1,2,3,4,5,6,7,8,9,10] }

unstack reset_index and fillna is one option:
df.set_index(['Year','Week', 'Part', 'Static']).unstack([0,1]).reset_index().fillna(method='ffill', axis=1)
Part Static Value
Year 2019
Week 1 2 3 4 5 6
0 A 20 20 10 50 50 50 50
1 B 40 40 40 40 40 30 30
fillna with methond='ffill' will forward fill data so when you set axis=1 it forward fills left to right.

fill the column Value, first filling down the column, and then filling across the with the Static value
df.Value = df.groupby('Part')[['Static', 'Value']].ffill().ffill(axis=1).Value
After this operation, the Value column has an object type. So it is necessary to cast as int.
df.Value = df.Value.astype('int')
Then, pivot as usual, but also ffill & bfill after on the horizontal axis
df.pivot_table(index=['Part','Static'], columns=['Year', 'Week'], values=['Value']).ffill(axis=1).bfill(axis=1)
# outputs:
Value
Year 2019
Week 1 2 3 4 5 6
Part Static
A 20 20.0 10.0 50.0 50.0 50.0 50.0
B 40 40.0 40.0 40.0 40.0 30.0 30.0

Related

How to interpolate missing years within pd.groupby()

Problem:
I have a dataframe that contains entries with 5 year time intervals. I need to group entries by 'id' columns and interpolate values between the first and last item in the group. I understand that it has to be some combination of groupby(), set_index() and interpolate() but I am unable to make it work for the whole input dataframe.
Sample df:
import pandas as pd
data = {
'id': ['a', 'b', 'a', 'b'],
'year': [2005, 2005, 2010, 2010],
'val': [0, 0, 100, 100],
}
df = pd.DataFrame.from_dict(data)
example input df:
_ id year val
0 a 2005 0
1 a 2010 100
2 b 2005 0
3 b 2010 100
expected output df:
_ id year val type
0 a 2005 0 original
1 a 2006 20 interpolated
2 a 2007 40 interpolated
3 a 2008 60 interpolated
4 a 2009 80 interpolated
5 a 2010 100 original
6 b 2005 0 original
7 b 2006 20 interpolated
8 b 2007 40 interpolated
9 b 2008 60 interpolated
10 b 2009 80 interpolated
11 b 2010 100 original
'type' is not necessary its just for illustration purposes.
Question:
How can I add missing years to the groupby() view and interpolate() their corresponding values?
Thank you!
Using a temporary reshaping with pivot and unstack and reindex+interpolate to add the missing years:
out = (df
.pivot(index='year', columns='id', values='val')
.reindex(range(df['year'].min(), df['year'].max()+1))
.interpolate('index')
.unstack(-1).reset_index(name='val')
)
Output:
id year val
0 a 2005 0.0
1 a 2006 20.0
2 a 2007 40.0
3 a 2008 60.0
4 a 2009 80.0
5 a 2010 100.0
6 b 2005 0.0
7 b 2006 20.0
8 b 2007 40.0
9 b 2008 60.0
10 b 2009 80.0
11 b 2010 100.0
Solution for create years by minimal and maximal years for each group independently:
First create missing values by DataFrame.reindex per groups by minimal and maximal values and then interpolate by Series.interpolate, last identify values from original DataFrame to new column:
df = (df.set_index('year')
.groupby('id')['val']
.apply(lambda x: x.reindex(range(x.index.min(), x.index.max() + 1)).interpolate())
.reset_index()
.merge(df, how='left', indicator=True)
.assign(type = lambda x: np.where(x.pop('_merge').eq('both'),
'original',
'interpolated')))
print (df)
id year val type
0 a 2005 0.0 original
1 a 2006 20.0 interpolated
2 a 2007 40.0 interpolated
3 a 2008 60.0 interpolated
4 a 2009 80.0 interpolated
5 a 2010 100.0 original
6 b 2005 0.0 original
7 b 2006 20.0 interpolated
8 b 2007 40.0 interpolated
9 b 2008 60.0 interpolated
10 b 2009 80.0 interpolated
11 b 2010 100.0 original

How to make a new pandas column that's the average of the last 3 values?

Let's say I have a dataframe with 3 columns, dt, unit, sold. What I would like to know how to do is how to create a new column called say, prior_3_avg, that is as the name suggests, an average of sold by unit for the past three same-day-of-week as dt. E.g., for unit "1" on May 5th 2020, what's the average it sold on April 28th, 21st, and 14th, which are the last three thursdays?
Toy sample data:
df = pd.DataFrame({'dt':['2020-5-1','2020-5-2','2020-5-3','2020-5-4','2020-5-5','2020-5-6','2020-5-7','2020-5-8','2020-5-9','2020-5-10','2020-5-11','2020-5-12','2020-5-13','2020-5-14','2020-5-15','2020-5-16','2020-5-17','2020-5-18','2020-5-19','2020-5-20','2020-5-21','2020-5-22','2020-5-23','2020-5-24','2020-5-25','2020-5-26','2020-5-27','2020-5-28','2020-5-1','2020-5-2','2020-5-3','2020-5-4','2020-5-5','2020-5-6','2020-5-7','2020-5-8','2020-5-9','2020-5-10','2020-5-11','2020-5-12','2020-5-13','2020-5-14','2020-5-15','2020-5-16','2020-5-17','2020-5-18','2020-5-19','2020-5-20','2020-5-21','2020-5-22','2020-5-23','2020-5-24','2020-5-25','2020-5-26','2020-5-27','2020-5-28',],'unit':[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2],'sold':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28]})
df['dt'] = pd.to_datetime(df['dt'])
dt unit sold
0 2020-05-01 1 1
1 2020-05-02 1 2
2 2020-05-03 1 3
3 2020-05-04 1 4
4 2020-05-05 1 5
5 2020-05-06 1 6
...
How would I go about this? I've seen:
Pandas new column from groupby averages
That explains how to just do a group by on the columns. I figure I could do a "day of week" column, but then I still have the same problem of wanting to limit to the past 3 matching day of week values instead of just all of the results.
It could possibly have something to do with this, but this looks more like it's useful for one-off analysis than making a new column: limit amount of rows as result of groupby Pandas
This should work:
df['dayofweek'] = df['dt'].dt.dayofweek
df['output'] = df.apply(lambda x: df['sold'][(df.index < x.name) & (df.dayofweek == x.dayofweek)].tail(3).sum(), axis = 1)
first create a new columns with the day
import pandas as pd
date = pd.date_range('2018-12-30', '2019-01-07',
freq='D').to_series()
date.dt.dayofweek
That will give you the number for the day and after you just need to filter with the month and sort the value
Here is one idea: First group by unit, then group each unit by weekdays and get the rolling average for n weeks (with closed='left', the last n weeks excluding the current one are used for calculation, which seems to be what you want)...
n = 3
result = (df.groupby('unit')
.apply(lambda f: (f['sold']
.groupby(f.dt.dt.day_name())
.rolling(n, closed='left')
.mean()
)
)
)
...which results in this series:
unit dt
1 Friday 0 NaN
7 NaN
14 NaN
21 8.0
Monday 3 NaN
10 NaN
17 NaN
24 11.0
...
2 Friday 28 NaN
35 NaN
42 NaN
49 8.0
Monday 31 NaN
38 NaN
45 NaN
52 11.0
...
Name: sold, dtype: float64
Next, get rid of the unit and time index levels, we don't need them.
Also, rename the series for easier joining.
result = result.reset_index(level=[0, 1], drop=True)
result = result.rename('prior_3_avg')
Back to the mothership...
df2 = df.join(result)
Part of the final result in df2:
time unit sold prior_3_avg
... # first 21 are NaN
21 2020-05-22 1 22 8.0
22 2020-05-23 1 23 9.0
23 2020-05-24 1 24 10.0
24 2020-05-25 1 25 11.0
25 2020-05-26 1 26 12.0
26 2020-05-27 1 27 13.0
27 2020-05-28 1 28 14.0

Pandas: Extract the row before and the row after based on a given value

I just started getting into pandas. I have searched through many sources and could not find a solution to my problem. Hope to learn from the specialists here.
This is the original dataframe:
Country
Sales
Item_A
Item_B
UK
28
20
30
Asia
75
15
20
USA
100
30
40
Assume that the Sales column is always sorted in ascending order from lowest to highest.
Let say given Sales = 50 and Country = 'UK', how do I
Identify the two rows that have the closest Sales value w.r.t. 50?
Insert a new row between the two rows with the given Country and Sales?
Interpolate the values for Item_A and Item_B?
This is the expected result:
Country
Sales
Item_A
Item_B
UK
28
20
30
UK
50
17.7
25.3
Asia
75
15
20
First, I would recommend you to just add the new row at the bottom and sort the column so that it would go to your preferred postion.
new = {'Country': ['UK'], 'Sales': [50]}
df = pd.concat([df, pd.DataFrame(new)]).sort_values(by=["Sales"]).reset_index(drop=True)
Country Sales Item_A Item_B
0 UK 28 20.0 30.0
1 UK 50 NaN NaN
2 Asia 75 15.0 20.0
3 USA 100 30.0 40.0
The second line will add the new line (concat), then sort your concerned column (sort_values) and the row will move to the preferred index (reset_index).
But if you have your reasons of adding directly to the index, I am not aware of pandas insert for rows, only columns. So, my recommendation would be to rip the original dataframe into before and after rows. To do so, you would need to find the index to put your new row.
def check_index(value):
ruler = sorted(df["Sales"])
ruled = [i for i in range(len(ruler)) if ruler[i] < 50]
return max(ruled)+1
This function will sort the concerned column of the original dataframe, compare the value and get the index your new row should go.
df = pd.concat([df[: check_index(new["Sales"])], pd.DataFrame(new), df[check_index(new["Sales"]):]]).reset_index(drop=True)
Country Sales Item_A Item_B
0 UK 28 20.0 30.0
1 UK 50 NaN NaN
2 Asia 75 15.0 20.0
3 USA 100 30.0 40.0
This will rip your dataframe, and concat before, new row, then after dataframe. For your second part of the request, you can apply the same funtion directly by naming the columns, but here I make sure to select the numeric columns first since we are going to do arithmetics on this. We use shift to select the previous and subsequent values then half the value.
for col in df.select_dtypes(include=['int16', 'int32', 'int64', 'float16', 'float32', 'float64']).columns.tolist():
df[col] = df[col].fillna((df[col].shift() + df[col].shift(-1))/2)
Country Sales Item_A Item_B
0 UK 28 20.0 30.0
1 UK 50 17.5 25.0
2 Asia 75 15.0 20.0
3 USA 100 30.0 40.0
But please be noted that if the new row is going to the first row of the dataframe, the value will be still Na since it does not have a before row to calculate with. For that, I added a second new fillna function, you can replace with the value/calculation of your choice.
Country Sales Item_A Item_B
0 UK 10 NaN NaN
1 UK 28 20.0 30.0
2 UK 50 NaN NaN
3 Asia 75 15.0 20.0
4 USA 100 30.0 40.0
for col in df.select_dtypes(include=['int16', 'int32', 'int64', 'float16', 'float32', 'float64']).columns.tolist():
df[col] = df[col].fillna((df[col].shift() + df[col].shift(-1))/2)
df[col] = df[col].fillna(df[col].shift(-1)/2) #this
Country Sales Item_A Item_B
0 UK 10 10.0 15.0
1 UK 28 20.0 30.0
2 UK 50 17.5 25.0
3 Asia 75 15.0 20.0
4 USA 100 30.0 40.0

Is there a pandas function to return instantaneous values from cummulated sum?

I have a dataframe which columns have acummulated values, i.e. a financial report for all four quarters in a year. I need to de-accumulate the values in order to get the values for every period instead of the accumulated sum over time.
I've already built a function that uses loops for every column in the dataframe and substracts the previous column from the selected column (very inefficient). But in some cases, I have monthly data instead of quarterly, so the number of periods changes from 4 to 12.
Image of dataframe I have
I need a function that takes the number of periods (like a rolling sum that takes the number of windows as input) and outputs the dissagregated sum of the dataframe.
Thank you!
Take a diff within group. Need to .fillna to get the first value.
Sample Data
df = pd.DataFrame(np.random.randint(1, 10, (3, 8)))
df.columns = [f'{y}-{str(m).zfill(2)}' for y in range(2012, 2014) for m in range(1, 5)]
df = df.cumsum(1) # For illustration, don't worry about across years.
df['tag'] = 'foo'
2012-01 2012-02 2012-03 2012-04 2013-01 2013-02 2013-03 2013-04 tag
0 5 6 15 23 25 28 36 45 foo
1 5 9 14 17 24 27 31 38 foo
2 4 10 11 19 24 29 38 41 foo
Code:
df.groupby(df.columns.str[0:4], axis=1).diff(1).fillna(df)
2012-01 2012-02 2012-03 2012-04 2013-01 2013-02 2013-03 2013-04 tag
0 5.0 1.0 9.0 8.0 25.0 3.0 8.0 9.0 foo
1 5.0 4.0 5.0 3.0 24.0 3.0 4.0 7.0 foo
2 4.0 6.0 1.0 8.0 24.0 5.0 9.0 3.0 foo
You can do those steps:
import pandas as pd
df = pd.DataFrame([[1, 3, 2], [100, 90, 110]], columns=['2019-01', '2019-02', '2019-03'], index=['A', 'B'])
df = df.unstack().reset_index(name='value').sort_values(['level_1', 'level_0'])
df['delta'] = df.groupby('level_1').diff()
df['delta'].fillna(df.value, inplace=True)
df.pivot(index='level_1', columns='level_0', values='delta')

Use pandas to dynamically convert multiple rows with matching index to multiple columns

I need to convert the following data frame from this:
class_id instructor_id
1 10
2 10
2 20
3 30
3 40
3 50
to this:
class_id instructor_id instructor_id_2 instructor_id_3
1 10
2 10 20
3 30 40 50
The number of unique instuctor_id columns will be determined dynamically based on the number of instructor_id number associated with each class_id. The instructor_id column names will continue the same pattern of instructor_id_x.
Using groupby apply+list and apply+pd.Series as:
df1 = df.groupby('class_id')['instructor_id'].apply(list).apply(pd.Series)
# alternative df.groupby('class_id')['instructor_id'].apply(lambda x: pd.Series(x.tolist())).unstack()
df1.columns = ['instructor_id']+['instructor_id_'+str(i+1) for i in df1.columns[1:]]
df1.reset_index(inplace=True)
print(df1)
class_id instructor_id instructor_id_2 instructor_id_3
0 1 10.0 NaN NaN
1 2 10.0 20.0 NaN
2 3 30.0 40.0 50.0
groupby + cumcount + unstack
Here's one way using a key helper series:
key = df.groupby('class_id')['instructor_id'].cumcount()\
.add(1).map('Instructor_{}'.format)
res = df.set_index(['class_id', key]).unstack().reset_index()
# clean up column names
res.columns = res.columns.droplevel(0)
res = res.rename(columns={'': 'class_id'})
print(res)
class_id Instructor_1 Instructor_2 Instructor_3
0 1 10.0 NaN NaN
1 2 10.0 20.0 NaN
2 3 30.0 40.0 50.0

Categories