Problem:
I have a dataframe that contains entries with 5 year time intervals. I need to group entries by 'id' columns and interpolate values between the first and last item in the group. I understand that it has to be some combination of groupby(), set_index() and interpolate() but I am unable to make it work for the whole input dataframe.
Sample df:
import pandas as pd
data = {
'id': ['a', 'b', 'a', 'b'],
'year': [2005, 2005, 2010, 2010],
'val': [0, 0, 100, 100],
}
df = pd.DataFrame.from_dict(data)
example input df:
_ id year val
0 a 2005 0
1 a 2010 100
2 b 2005 0
3 b 2010 100
expected output df:
_ id year val type
0 a 2005 0 original
1 a 2006 20 interpolated
2 a 2007 40 interpolated
3 a 2008 60 interpolated
4 a 2009 80 interpolated
5 a 2010 100 original
6 b 2005 0 original
7 b 2006 20 interpolated
8 b 2007 40 interpolated
9 b 2008 60 interpolated
10 b 2009 80 interpolated
11 b 2010 100 original
'type' is not necessary its just for illustration purposes.
Question:
How can I add missing years to the groupby() view and interpolate() their corresponding values?
Thank you!
Using a temporary reshaping with pivot and unstack and reindex+interpolate to add the missing years:
out = (df
.pivot(index='year', columns='id', values='val')
.reindex(range(df['year'].min(), df['year'].max()+1))
.interpolate('index')
.unstack(-1).reset_index(name='val')
)
Output:
id year val
0 a 2005 0.0
1 a 2006 20.0
2 a 2007 40.0
3 a 2008 60.0
4 a 2009 80.0
5 a 2010 100.0
6 b 2005 0.0
7 b 2006 20.0
8 b 2007 40.0
9 b 2008 60.0
10 b 2009 80.0
11 b 2010 100.0
Solution for create years by minimal and maximal years for each group independently:
First create missing values by DataFrame.reindex per groups by minimal and maximal values and then interpolate by Series.interpolate, last identify values from original DataFrame to new column:
df = (df.set_index('year')
.groupby('id')['val']
.apply(lambda x: x.reindex(range(x.index.min(), x.index.max() + 1)).interpolate())
.reset_index()
.merge(df, how='left', indicator=True)
.assign(type = lambda x: np.where(x.pop('_merge').eq('both'),
'original',
'interpolated')))
print (df)
id year val type
0 a 2005 0.0 original
1 a 2006 20.0 interpolated
2 a 2007 40.0 interpolated
3 a 2008 60.0 interpolated
4 a 2009 80.0 interpolated
5 a 2010 100.0 original
6 b 2005 0.0 original
7 b 2006 20.0 interpolated
8 b 2007 40.0 interpolated
9 b 2008 60.0 interpolated
10 b 2009 80.0 interpolated
11 b 2010 100.0 original
Related
So I have a panel df that looks like this:
ID
year
value
1
2002
8
1
2003
9
1
2004
10
2
2002
11
2
2003
11
2
2004
12
I want to set the value for every ID and for all years to the value in 2004. How do I do this?
The df should then look like this:
ID
year
value
1
2002
10
1
2003
10
1
2004
10
2
2002
12
2
2003
12
2
2004
12
Could not find anything online. So far I have tried to get the value for every ID for year 2004, created a new df from that and then merged it back in. Though, that is super slow.
We can use Series.map for this, first we select the values and create our mapping:
mapping = df[df["year"].eq(2004)].set_index("ID")["value"]
df["value"] = df["ID"].map(mapping)
ID year value
0 1 2002 10
1 1 2003 10
2 1 2004 10
3 2 2002 12
4 2 2003 12
5 2 2004 12
Let's convert the value where corresponding year is not 2004 to NaN then get the max value per ID.
df['value'] = (df.assign(value=df['value'].mask(df['year'].ne(2004)))
.groupby('ID')['value'].transform('max'))
print(df)
ID year value
0 1 2002 10.0
1 1 2003 10.0
2 1 2004 10.0
3 2 2002 12.0
4 2 2003 12.0
5 2 2004 12.0
Another method, for some variety.
# Make everything that isn't 2004 null~
df.loc[df.year.ne(2004), 'value'] = np.nan
# Fill the values by ID~
df['value'] = df.groupby('ID')['value'].bfill()
Output:
ID year value
0 1 2002 10.0
1 1 2003 10.0
2 1 2004 10.0
3 2 2002 12.0
4 2 2003 12.0
5 2 2004 12.0
Yet another method, a bit longer but should be quite intuitive. Basically creating a lookup table for ID->value then performing lookup using pandas.merge.
import pandas as pd
# Original dataframe
df_orig = pd.DataFrame([(1, 2002, 8), (1, 2003, 9), (1, 2004, 10), (2, 2002, 11), (2, 2003, 11), (2, 2004, 12)])
df_orig.columns = ['ID', 'year', 'value']
# Dataframe with 2004 IDs
df_2004 = df_orig[df_orig['year'] == 2004]
df_2004.drop(columns=['year'], inplace=True)
print(df_2004)
# Drop values from df_orig and replace with those from df_2004
df_orig.drop(columns=['value'], inplace=True)
df_final = pd.merge(df_orig, df_2004, on='ID', how='right')
print(df_final)
df_2004:
ID value
2 1 10
5 2 12
df_final:
ID year value
0 1 2002 10
1 1 2003 10
2 1 2004 10
3 2 2002 12
4 2 2003 12
5 2 2004 12
A B C D
1 2010 one 0 0
2 2020 one 2 4
3 2007 two 0 8
4 2010 one 8 4
5 2020 four 6 12
6 2007 three 0 14
7 2006 four 7 14
8 2010 two 10 12
I need to replace 0 with the average of the C values of years.For example 2010 C value would be 9. What is the best way to do this? i have over 10,000 rows.
You can use replace to change 0's to np.nan in Column C, and use fillna to map the yearly averages:
df.C.replace({0:np.nan},inplace=True)
df.C.fillna(
df.A.map(
df.groupby(df['A']).\
C.mean().fillna(0)\
.to_dict()
),inplace=True
)
print(df)
A B C D
0 2010 one 9.0 0
1 2020 one 2.0 4
2 2007 two 0.0 8
3 2010 one 8.0 4
4 2020 four 6.0 12
5 2007 three 0.0 14
6 2006 four 7.0 14
7 2010 two 10.0 12
2007 is still NaN because we have no values other than 0's in the initial data.
Here is what I think I will do it. The code below will be pseudo-code.
1: You find the avg for each year, and put it to a dict.
my_year_dict = {'2020':xxx,'2021':xxx}
2: Use apply & lambda functions
df[New C Col] = df[C].apply(lambda x: my_year_dict[x] if x is 0)
Hope it can be a start!
I have following example df:
housing = {'year': [2001, 2002, 2003, 2004, 2005],
'moved in': [10, 26, 15, 11, 12],
'moved out': [4, 15, 23, 1, 3]}
df = pd.DataFrame(housing, columns = ['year', 'moved in', 'moved out'])
Now I want to create a column with calculated values which would show the number of people living in a house in a given year. In the first row there must be calculated the number of people who moved in and out giving the result. In the next row this result should be taken adding the number people who moved in and subtracting the number of people who moved out. The result would be the number of people who still live in the house in this year. I would like to iterate it through the whole df.
Is there a solution for it? Thank you in advance.
Basically you need a rolling sum over each year's net change.
df['current'] = (df['moved in'] - df['moved out']).rolling(window=len(df), min_periods=1).sum()
print(df)
year moved in moved out current
0 2001 10 4 6.0
1 2002 26 15 17.0
2 2003 15 23 9.0
3 2004 11 1 19.0
4 2005 12 3 28.0
With the net change column:
df['net change'] = df['moved in'] - df['moved out']
df['current'] = df['net change'].rolling(window=len(df), min_periods=1).sum()
print(df)
year moved in moved out net change current
0 2001 10 4 6 6.0
1 2002 26 15 11 17.0
2 2003 15 23 -8 9.0
3 2004 11 1 10 19.0
4 2005 12 3 9 28.0
I want to fill the missing values in my Pandas pivot_table with values from the index and to fill the missing Year Week columns.
import pandas as pd
d = { 'Year': [2019,2019,2019,2019,2019,2019],
'Week': [1,2,3,4,5,6],
'Part': ['A','A','A','B','B','B'],
'Static': [20,20,20,40,40,40],
'Value': [np.nan,10,50,np.nan,30,np.nan]
}
df = pd.DataFrame(d)
pivot = df.pivot_table(index=['Part','Static'], columns=['Year', 'Week'], values=['Value'])
print(pivot)
Value
Year 2019
Week 2 3 5
Part Static
A 20 10.0 50.0 NaN
B 40 NaN NaN 30.0
In the example above, the Weeks 1, 4 & 6 are missing because they don't have values. As for the NaN, I want to fill them with a value from the "left", so for Week 1 for Part A the value will be 20.0, and for Week 4 to 6 will be 50.0, and the same for Part B where all NaN will be filled with values from the left.
The expected output is
Value
Year 2019
Week 1 2 3 4 5 6
Part Static
A 20 20.0 10.0 50.0 50.0 50.0 50.0
B 40 40.0 40.0 40.0 40.0 30.0 30.0
PS: I can refer to a reference calendar dataframe to pull in all the Year Week values.
Edit:
I tested the solution on my data, but it seems to not work. Here is an updated data with Week 4 being removed.
d = { 'Year': [2019,2019,2019,2019,2019],
'Week': [1,2,3,5,6],
'Part': ['A','A','A','B','B'],
'Static': [20,20,20,40,40],
'Value': [np.nan,10,50,30,np.nan]
}
df = pd.DataFrame(d)
#Year Week data set for reference
d2 = {'Year':[2019,2019,2019,2019,2019,2019,2019,2019,2019,2019],
'Week':[1,2,3,4,5,6,7,8,9,10] }
unstack reset_index and fillna is one option:
df.set_index(['Year','Week', 'Part', 'Static']).unstack([0,1]).reset_index().fillna(method='ffill', axis=1)
Part Static Value
Year 2019
Week 1 2 3 4 5 6
0 A 20 20 10 50 50 50 50
1 B 40 40 40 40 40 30 30
fillna with methond='ffill' will forward fill data so when you set axis=1 it forward fills left to right.
fill the column Value, first filling down the column, and then filling across the with the Static value
df.Value = df.groupby('Part')[['Static', 'Value']].ffill().ffill(axis=1).Value
After this operation, the Value column has an object type. So it is necessary to cast as int.
df.Value = df.Value.astype('int')
Then, pivot as usual, but also ffill & bfill after on the horizontal axis
df.pivot_table(index=['Part','Static'], columns=['Year', 'Week'], values=['Value']).ffill(axis=1).bfill(axis=1)
# outputs:
Value
Year 2019
Week 1 2 3 4 5 6
Part Static
A 20 20.0 10.0 50.0 50.0 50.0 50.0
B 40 40.0 40.0 40.0 40.0 30.0 30.0
I am having a data frame like this I have to get missing Quarterly value and count between them
Same with Quarterly Missing count and fill the data frame is
year Data Id
2019Q4 57170 A
2019Q3 55150 A
2019Q2 51109 A
2019Q1 51109 A
2018Q1 57170 B
2018Q4 55150 B
2017Q4 51109 C
2017Q2 51109 C
2017Q1 51109 C
Id Start year end-year count
B 2018Q2 2018Q3 2
B 2017Q3 2018Q3 1
How can I achieve this using python panda
Use:
#changed data for more general solution - multiple missing years per groups
print (df)
year Data Id
0 2015 57170 A
1 2016 55150 A
2 2019 51109 A
3 2023 51109 A
4 2000 47740 B
5 2002 44563 B
6 2003 43643 C
7 2004 42050 C
8 2007 37312 C
#add missing values for no years by reindex
df1 = (df.set_index('year')
.groupby('Id')['Id']
.apply(lambda x: x.reindex(np.arange(x.index.min(), x.index.max() + 1)))
.reset_index(name='val'))
print (df1)
Id year val
0 A 2015 A
1 A 2016 A
2 A 2017 NaN
3 A 2018 NaN
4 A 2019 A
5 A 2020 NaN
6 A 2021 NaN
7 A 2022 NaN
8 A 2023 A
9 B 2000 B
10 B 2001 NaN
11 B 2002 B
12 C 2003 C
13 C 2004 C
14 C 2005 NaN
15 C 2006 NaN
16 C 2007 C
#boolean mask for check no NaNs to variable for reuse
m = df1['val'].notnull().rename('g')
#create index by cumulative sum for unique groups for consecutive NaNs
df1.index = m.cumsum()
#filter only NaNs row and aggregate first, last and count.
df2 = (df1[~m.values].groupby(['Id', 'g'])['year']
.agg(['first','last','size'])
.reset_index(level=1, drop=True)
.reset_index())
print (df2)
Id first last size
0 A 2017 2018 2
1 A 2020 2022 3
2 B 2001 2001 1
3 C 2005 2006 2
EDIT:
#convert to datetimes
df['year'] = pd.to_datetime(df['year'], format='%Y%m')
#resample by start of months with asfreq
df1 = df.set_index('year').groupby('Id')['Id'].resample('MS').asfreq().rename('val').reset_index()
print (df1)
Id year val
0 A 2015-05-01 A
1 A 2015-06-01 NaN
2 A 2015-07-01 A
3 A 2015-08-01 NaN
4 A 2015-09-01 A
5 B 2000-01-01 B
6 B 2000-02-01 NaN
7 B 2000-03-01 B
8 C 2003-01-01 C
9 C 2003-02-01 C
10 C 2003-03-01 NaN
11 C 2003-04-01 NaN
12 C 2003-05-01 C
m = df1['val'].notnull().rename('g')
#create index by cumulative sum for unique groups for consecutive NaNs
df1.index = m.cumsum()
#filter only NaNs row and aggregate first, last and count.
df2 = (df1[~m.values].groupby(['Id', 'g'])['year']
.agg(['first','last','size'])
.reset_index(level=1, drop=True)
.reset_index())
print (df2)
Id first last size
0 A 2015-06-01 2015-06-01 1
1 A 2015-08-01 2015-08-01 1
2 B 2000-02-01 2000-02-01 1
3 C 2003-03-01 2003-04-01 2