Problem:
I have a dataframe that contains entries with 5 year time intervals. I need to group entries by 'id' columns and interpolate values between the first and last item in the group. I understand that it has to be some combination of groupby(), set_index() and interpolate() but I am unable to make it work for the whole input dataframe.
Sample df:
import pandas as pd
data = {
'id': ['a', 'b', 'a', 'b'],
'year': [2005, 2005, 2010, 2010],
'val': [0, 0, 100, 100],
}
df = pd.DataFrame.from_dict(data)
example input df:
_ id year val
0 a 2005 0
1 a 2010 100
2 b 2005 0
3 b 2010 100
expected output df:
_ id year val type
0 a 2005 0 original
1 a 2006 20 interpolated
2 a 2007 40 interpolated
3 a 2008 60 interpolated
4 a 2009 80 interpolated
5 a 2010 100 original
6 b 2005 0 original
7 b 2006 20 interpolated
8 b 2007 40 interpolated
9 b 2008 60 interpolated
10 b 2009 80 interpolated
11 b 2010 100 original
'type' is not necessary its just for illustration purposes.
Question:
How can I add missing years to the groupby() view and interpolate() their corresponding values?
Thank you!
Using a temporary reshaping with pivot and unstack and reindex+interpolate to add the missing years:
out = (df
.pivot(index='year', columns='id', values='val')
.reindex(range(df['year'].min(), df['year'].max()+1))
.interpolate('index')
.unstack(-1).reset_index(name='val')
)
Output:
id year val
0 a 2005 0.0
1 a 2006 20.0
2 a 2007 40.0
3 a 2008 60.0
4 a 2009 80.0
5 a 2010 100.0
6 b 2005 0.0
7 b 2006 20.0
8 b 2007 40.0
9 b 2008 60.0
10 b 2009 80.0
11 b 2010 100.0
Solution for create years by minimal and maximal years for each group independently:
First create missing values by DataFrame.reindex per groups by minimal and maximal values and then interpolate by Series.interpolate, last identify values from original DataFrame to new column:
df = (df.set_index('year')
.groupby('id')['val']
.apply(lambda x: x.reindex(range(x.index.min(), x.index.max() + 1)).interpolate())
.reset_index()
.merge(df, how='left', indicator=True)
.assign(type = lambda x: np.where(x.pop('_merge').eq('both'),
'original',
'interpolated')))
print (df)
id year val type
0 a 2005 0.0 original
1 a 2006 20.0 interpolated
2 a 2007 40.0 interpolated
3 a 2008 60.0 interpolated
4 a 2009 80.0 interpolated
5 a 2010 100.0 original
6 b 2005 0.0 original
7 b 2006 20.0 interpolated
8 b 2007 40.0 interpolated
9 b 2008 60.0 interpolated
10 b 2009 80.0 interpolated
11 b 2010 100.0 original
Let's say I have a dataframe with 3 columns, dt, unit, sold. What I would like to know how to do is how to create a new column called say, prior_3_avg, that is as the name suggests, an average of sold by unit for the past three same-day-of-week as dt. E.g., for unit "1" on May 5th 2020, what's the average it sold on April 28th, 21st, and 14th, which are the last three thursdays?
Toy sample data:
df = pd.DataFrame({'dt':['2020-5-1','2020-5-2','2020-5-3','2020-5-4','2020-5-5','2020-5-6','2020-5-7','2020-5-8','2020-5-9','2020-5-10','2020-5-11','2020-5-12','2020-5-13','2020-5-14','2020-5-15','2020-5-16','2020-5-17','2020-5-18','2020-5-19','2020-5-20','2020-5-21','2020-5-22','2020-5-23','2020-5-24','2020-5-25','2020-5-26','2020-5-27','2020-5-28','2020-5-1','2020-5-2','2020-5-3','2020-5-4','2020-5-5','2020-5-6','2020-5-7','2020-5-8','2020-5-9','2020-5-10','2020-5-11','2020-5-12','2020-5-13','2020-5-14','2020-5-15','2020-5-16','2020-5-17','2020-5-18','2020-5-19','2020-5-20','2020-5-21','2020-5-22','2020-5-23','2020-5-24','2020-5-25','2020-5-26','2020-5-27','2020-5-28',],'unit':[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2],'sold':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28]})
df['dt'] = pd.to_datetime(df['dt'])
dt unit sold
0 2020-05-01 1 1
1 2020-05-02 1 2
2 2020-05-03 1 3
3 2020-05-04 1 4
4 2020-05-05 1 5
5 2020-05-06 1 6
...
How would I go about this? I've seen:
Pandas new column from groupby averages
That explains how to just do a group by on the columns. I figure I could do a "day of week" column, but then I still have the same problem of wanting to limit to the past 3 matching day of week values instead of just all of the results.
It could possibly have something to do with this, but this looks more like it's useful for one-off analysis than making a new column: limit amount of rows as result of groupby Pandas
This should work:
df['dayofweek'] = df['dt'].dt.dayofweek
df['output'] = df.apply(lambda x: df['sold'][(df.index < x.name) & (df.dayofweek == x.dayofweek)].tail(3).sum(), axis = 1)
first create a new columns with the day
import pandas as pd
date = pd.date_range('2018-12-30', '2019-01-07',
freq='D').to_series()
date.dt.dayofweek
That will give you the number for the day and after you just need to filter with the month and sort the value
Here is one idea: First group by unit, then group each unit by weekdays and get the rolling average for n weeks (with closed='left', the last n weeks excluding the current one are used for calculation, which seems to be what you want)...
n = 3
result = (df.groupby('unit')
.apply(lambda f: (f['sold']
.groupby(f.dt.dt.day_name())
.rolling(n, closed='left')
.mean()
)
)
)
...which results in this series:
unit dt
1 Friday 0 NaN
7 NaN
14 NaN
21 8.0
Monday 3 NaN
10 NaN
17 NaN
24 11.0
...
2 Friday 28 NaN
35 NaN
42 NaN
49 8.0
Monday 31 NaN
38 NaN
45 NaN
52 11.0
...
Name: sold, dtype: float64
Next, get rid of the unit and time index levels, we don't need them.
Also, rename the series for easier joining.
result = result.reset_index(level=[0, 1], drop=True)
result = result.rename('prior_3_avg')
Back to the mothership...
df2 = df.join(result)
Part of the final result in df2:
time unit sold prior_3_avg
... # first 21 are NaN
21 2020-05-22 1 22 8.0
22 2020-05-23 1 23 9.0
23 2020-05-24 1 24 10.0
24 2020-05-25 1 25 11.0
25 2020-05-26 1 26 12.0
26 2020-05-27 1 27 13.0
27 2020-05-28 1 28 14.0
I just started getting into pandas. I have searched through many sources and could not find a solution to my problem. Hope to learn from the specialists here.
This is the original dataframe:
Country
Sales
Item_A
Item_B
UK
28
20
30
Asia
75
15
20
USA
100
30
40
Assume that the Sales column is always sorted in ascending order from lowest to highest.
Let say given Sales = 50 and Country = 'UK', how do I
Identify the two rows that have the closest Sales value w.r.t. 50?
Insert a new row between the two rows with the given Country and Sales?
Interpolate the values for Item_A and Item_B?
This is the expected result:
Country
Sales
Item_A
Item_B
UK
28
20
30
UK
50
17.7
25.3
Asia
75
15
20
First, I would recommend you to just add the new row at the bottom and sort the column so that it would go to your preferred postion.
new = {'Country': ['UK'], 'Sales': [50]}
df = pd.concat([df, pd.DataFrame(new)]).sort_values(by=["Sales"]).reset_index(drop=True)
Country Sales Item_A Item_B
0 UK 28 20.0 30.0
1 UK 50 NaN NaN
2 Asia 75 15.0 20.0
3 USA 100 30.0 40.0
The second line will add the new line (concat), then sort your concerned column (sort_values) and the row will move to the preferred index (reset_index).
But if you have your reasons of adding directly to the index, I am not aware of pandas insert for rows, only columns. So, my recommendation would be to rip the original dataframe into before and after rows. To do so, you would need to find the index to put your new row.
def check_index(value):
ruler = sorted(df["Sales"])
ruled = [i for i in range(len(ruler)) if ruler[i] < 50]
return max(ruled)+1
This function will sort the concerned column of the original dataframe, compare the value and get the index your new row should go.
df = pd.concat([df[: check_index(new["Sales"])], pd.DataFrame(new), df[check_index(new["Sales"]):]]).reset_index(drop=True)
Country Sales Item_A Item_B
0 UK 28 20.0 30.0
1 UK 50 NaN NaN
2 Asia 75 15.0 20.0
3 USA 100 30.0 40.0
This will rip your dataframe, and concat before, new row, then after dataframe. For your second part of the request, you can apply the same funtion directly by naming the columns, but here I make sure to select the numeric columns first since we are going to do arithmetics on this. We use shift to select the previous and subsequent values then half the value.
for col in df.select_dtypes(include=['int16', 'int32', 'int64', 'float16', 'float32', 'float64']).columns.tolist():
df[col] = df[col].fillna((df[col].shift() + df[col].shift(-1))/2)
Country Sales Item_A Item_B
0 UK 28 20.0 30.0
1 UK 50 17.5 25.0
2 Asia 75 15.0 20.0
3 USA 100 30.0 40.0
But please be noted that if the new row is going to the first row of the dataframe, the value will be still Na since it does not have a before row to calculate with. For that, I added a second new fillna function, you can replace with the value/calculation of your choice.
Country Sales Item_A Item_B
0 UK 10 NaN NaN
1 UK 28 20.0 30.0
2 UK 50 NaN NaN
3 Asia 75 15.0 20.0
4 USA 100 30.0 40.0
for col in df.select_dtypes(include=['int16', 'int32', 'int64', 'float16', 'float32', 'float64']).columns.tolist():
df[col] = df[col].fillna((df[col].shift() + df[col].shift(-1))/2)
df[col] = df[col].fillna(df[col].shift(-1)/2) #this
Country Sales Item_A Item_B
0 UK 10 10.0 15.0
1 UK 28 20.0 30.0
2 UK 50 17.5 25.0
3 Asia 75 15.0 20.0
4 USA 100 30.0 40.0
I have a dataframe which columns have acummulated values, i.e. a financial report for all four quarters in a year. I need to de-accumulate the values in order to get the values for every period instead of the accumulated sum over time.
I've already built a function that uses loops for every column in the dataframe and substracts the previous column from the selected column (very inefficient). But in some cases, I have monthly data instead of quarterly, so the number of periods changes from 4 to 12.
Image of dataframe I have
I need a function that takes the number of periods (like a rolling sum that takes the number of windows as input) and outputs the dissagregated sum of the dataframe.
Thank you!
Take a diff within group. Need to .fillna to get the first value.
Sample Data
df = pd.DataFrame(np.random.randint(1, 10, (3, 8)))
df.columns = [f'{y}-{str(m).zfill(2)}' for y in range(2012, 2014) for m in range(1, 5)]
df = df.cumsum(1) # For illustration, don't worry about across years.
df['tag'] = 'foo'
2012-01 2012-02 2012-03 2012-04 2013-01 2013-02 2013-03 2013-04 tag
0 5 6 15 23 25 28 36 45 foo
1 5 9 14 17 24 27 31 38 foo
2 4 10 11 19 24 29 38 41 foo
Code:
df.groupby(df.columns.str[0:4], axis=1).diff(1).fillna(df)
2012-01 2012-02 2012-03 2012-04 2013-01 2013-02 2013-03 2013-04 tag
0 5.0 1.0 9.0 8.0 25.0 3.0 8.0 9.0 foo
1 5.0 4.0 5.0 3.0 24.0 3.0 4.0 7.0 foo
2 4.0 6.0 1.0 8.0 24.0 5.0 9.0 3.0 foo
You can do those steps:
import pandas as pd
df = pd.DataFrame([[1, 3, 2], [100, 90, 110]], columns=['2019-01', '2019-02', '2019-03'], index=['A', 'B'])
df = df.unstack().reset_index(name='value').sort_values(['level_1', 'level_0'])
df['delta'] = df.groupby('level_1').diff()
df['delta'].fillna(df.value, inplace=True)
df.pivot(index='level_1', columns='level_0', values='delta')