I have following example df:
housing = {'year': [2001, 2002, 2003, 2004, 2005],
'moved in': [10, 26, 15, 11, 12],
'moved out': [4, 15, 23, 1, 3]}
df = pd.DataFrame(housing, columns = ['year', 'moved in', 'moved out'])
Now I want to create a column with calculated values which would show the number of people living in a house in a given year. In the first row there must be calculated the number of people who moved in and out giving the result. In the next row this result should be taken adding the number people who moved in and subtracting the number of people who moved out. The result would be the number of people who still live in the house in this year. I would like to iterate it through the whole df.
Is there a solution for it? Thank you in advance.
Basically you need a rolling sum over each year's net change.
df['current'] = (df['moved in'] - df['moved out']).rolling(window=len(df), min_periods=1).sum()
print(df)
year moved in moved out current
0 2001 10 4 6.0
1 2002 26 15 17.0
2 2003 15 23 9.0
3 2004 11 1 19.0
4 2005 12 3 28.0
With the net change column:
df['net change'] = df['moved in'] - df['moved out']
df['current'] = df['net change'].rolling(window=len(df), min_periods=1).sum()
print(df)
year moved in moved out net change current
0 2001 10 4 6 6.0
1 2002 26 15 11 17.0
2 2003 15 23 -8 9.0
3 2004 11 1 10 19.0
4 2005 12 3 9 28.0
Related
Problem:
I have a dataframe that contains entries with 5 year time intervals. I need to group entries by 'id' columns and interpolate values between the first and last item in the group. I understand that it has to be some combination of groupby(), set_index() and interpolate() but I am unable to make it work for the whole input dataframe.
Sample df:
import pandas as pd
data = {
'id': ['a', 'b', 'a', 'b'],
'year': [2005, 2005, 2010, 2010],
'val': [0, 0, 100, 100],
}
df = pd.DataFrame.from_dict(data)
example input df:
_ id year val
0 a 2005 0
1 a 2010 100
2 b 2005 0
3 b 2010 100
expected output df:
_ id year val type
0 a 2005 0 original
1 a 2006 20 interpolated
2 a 2007 40 interpolated
3 a 2008 60 interpolated
4 a 2009 80 interpolated
5 a 2010 100 original
6 b 2005 0 original
7 b 2006 20 interpolated
8 b 2007 40 interpolated
9 b 2008 60 interpolated
10 b 2009 80 interpolated
11 b 2010 100 original
'type' is not necessary its just for illustration purposes.
Question:
How can I add missing years to the groupby() view and interpolate() their corresponding values?
Thank you!
Using a temporary reshaping with pivot and unstack and reindex+interpolate to add the missing years:
out = (df
.pivot(index='year', columns='id', values='val')
.reindex(range(df['year'].min(), df['year'].max()+1))
.interpolate('index')
.unstack(-1).reset_index(name='val')
)
Output:
id year val
0 a 2005 0.0
1 a 2006 20.0
2 a 2007 40.0
3 a 2008 60.0
4 a 2009 80.0
5 a 2010 100.0
6 b 2005 0.0
7 b 2006 20.0
8 b 2007 40.0
9 b 2008 60.0
10 b 2009 80.0
11 b 2010 100.0
Solution for create years by minimal and maximal years for each group independently:
First create missing values by DataFrame.reindex per groups by minimal and maximal values and then interpolate by Series.interpolate, last identify values from original DataFrame to new column:
df = (df.set_index('year')
.groupby('id')['val']
.apply(lambda x: x.reindex(range(x.index.min(), x.index.max() + 1)).interpolate())
.reset_index()
.merge(df, how='left', indicator=True)
.assign(type = lambda x: np.where(x.pop('_merge').eq('both'),
'original',
'interpolated')))
print (df)
id year val type
0 a 2005 0.0 original
1 a 2006 20.0 interpolated
2 a 2007 40.0 interpolated
3 a 2008 60.0 interpolated
4 a 2009 80.0 interpolated
5 a 2010 100.0 original
6 b 2005 0.0 original
7 b 2006 20.0 interpolated
8 b 2007 40.0 interpolated
9 b 2008 60.0 interpolated
10 b 2009 80.0 interpolated
11 b 2010 100.0 original
So I have a panel df that looks like this:
ID
year
value
1
2002
8
1
2003
9
1
2004
10
2
2002
11
2
2003
11
2
2004
12
I want to set the value for every ID and for all years to the value in 2004. How do I do this?
The df should then look like this:
ID
year
value
1
2002
10
1
2003
10
1
2004
10
2
2002
12
2
2003
12
2
2004
12
Could not find anything online. So far I have tried to get the value for every ID for year 2004, created a new df from that and then merged it back in. Though, that is super slow.
We can use Series.map for this, first we select the values and create our mapping:
mapping = df[df["year"].eq(2004)].set_index("ID")["value"]
df["value"] = df["ID"].map(mapping)
ID year value
0 1 2002 10
1 1 2003 10
2 1 2004 10
3 2 2002 12
4 2 2003 12
5 2 2004 12
Let's convert the value where corresponding year is not 2004 to NaN then get the max value per ID.
df['value'] = (df.assign(value=df['value'].mask(df['year'].ne(2004)))
.groupby('ID')['value'].transform('max'))
print(df)
ID year value
0 1 2002 10.0
1 1 2003 10.0
2 1 2004 10.0
3 2 2002 12.0
4 2 2003 12.0
5 2 2004 12.0
Another method, for some variety.
# Make everything that isn't 2004 null~
df.loc[df.year.ne(2004), 'value'] = np.nan
# Fill the values by ID~
df['value'] = df.groupby('ID')['value'].bfill()
Output:
ID year value
0 1 2002 10.0
1 1 2003 10.0
2 1 2004 10.0
3 2 2002 12.0
4 2 2003 12.0
5 2 2004 12.0
Yet another method, a bit longer but should be quite intuitive. Basically creating a lookup table for ID->value then performing lookup using pandas.merge.
import pandas as pd
# Original dataframe
df_orig = pd.DataFrame([(1, 2002, 8), (1, 2003, 9), (1, 2004, 10), (2, 2002, 11), (2, 2003, 11), (2, 2004, 12)])
df_orig.columns = ['ID', 'year', 'value']
# Dataframe with 2004 IDs
df_2004 = df_orig[df_orig['year'] == 2004]
df_2004.drop(columns=['year'], inplace=True)
print(df_2004)
# Drop values from df_orig and replace with those from df_2004
df_orig.drop(columns=['value'], inplace=True)
df_final = pd.merge(df_orig, df_2004, on='ID', how='right')
print(df_final)
df_2004:
ID value
2 1 10
5 2 12
df_final:
ID year value
0 1 2002 10
1 1 2003 10
2 1 2004 10
3 2 2002 12
4 2 2003 12
5 2 2004 12
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 1 year ago.
However, I have the following problem:
If a year or date does not exist in df2 then a price and a listing_id is automatically added during the merge. But that should be NaN
The second problem is when merging, as soon as I have multiple data that were on the same day and year then the temperature is also merged to the second, for example:
d = {'id': [1], 'day': [1], 'temperature': [20], 'year': [2001]}
df = pd.DataFrame(data=d)
print(df)
id day temperature year
0 1 1 20 2001
d2 = {'id': [122, 244], 'day': [1, 1],
'listing_id': [2, 4], 'price': [20, 440], 'year': [2001, 2001]}
df2 = pd.DataFrame(data=d2)
print(df2)
id day listing_id price year
0 122 1 2 20 2001
1 244 1 4 440 2001
df3 = pd.merge(df,df2[['day', 'listing_id', 'price']],
left_on='day', right_on = 'day',how='left')
print(df3)
id day temperature year listing_id price
0 1 1 20 2001 2 20
1 1 1 20 2001 4 440 # <-- The second temperature is wrong :/
This should not be so, because if I later still have a date from year 2002 which was in day 1 with a temperature of 30 and I want to calculate the average. Then I get the following formula: 20 + 20 + 30 = 23.3. The formula should be 20 + 30 = 25. Therefore, if a value has already been filled, there should be a NaN value in it.
Code Snippet
d = {'id': [1, 2, 3, 4, 5], 'day': [1, 2, 3, 4, 2],
'temperature': [20, 40, 50, 60, 20], 'year': [2001, 2002, 2004, 2005, 1999]}
df = pd.DataFrame(data=d)
print(df)
id day temperature year
0 1 1 20 2001
1 2 2 40 2002
2 3 3 50 2004
3 4 4 60 2005
4 5 2 20 1999
d2 = {'id': [122, 244, 387, 4454, 521], 'day': [1, 2, 3, 4, 2],
'listing_id': [2, 4, 5, 6, 7], 'price': [20, 440, 500, 6600, 500],
'year': [2001, 2002, 2004, 2005, 2005]}
df2 = pd.DataFrame(data=d2)
print(df2)
id day listing_id price year
0 122 1 2 20 2001
1 244 2 4 440 2002
2 387 3 5 500 2004
3 4454 4 6 6600 2005
4 521 2 7 500 2005
df3 = pd.merge(df,df2[['day','listing_id', 'price']],
left_on='day', right_on = 'day',how='left').drop('day',axis=1)
print(df3)
id day temperature year listing_id price
0 1 1 20 2001 2 20
1 2 2 40 2002 4 440
2 2 2 40 2002 7 500
3 3 3 50 2004 5 500
4 4 4 60 2005 6 6600
5 5 2 20 1999 4 440
6 5 2 20 1999 7 500
What I want
id day temperature year listing_id price
0 1 1 20 2001 2 20
1 2 2 40 2002 4 440
2 2 2 NaN 2005 7 500
3 3 3 50 2004 5 500
4 4 4 60 2005 6 6600
5 5 2 20 1999 NaN NaN
IIUC:
>>> df1.merge(df2[['day', 'listing_id', 'price', 'year']],
on=['day', 'year'], how='outer')
id day temperature year listing_id price
0 1.0 1 20.0 2001 2.0 20.0
1 2.0 2 40.0 2002 4.0 440.0
2 3.0 3 50.0 2004 5.0 500.0
3 4.0 4 60.0 2005 6.0 6600.0
4 5.0 2 20.0 1999 NaN NaN
5 NaN 2 NaN 2005 7.0 500.0
In the following dataframe
d = {'year': [2001, 2002, 2005, 2002, 2004, 1999, 1890],
'tin': [12, 23, 24, 28,30, 12,7],
'ptin': [12, 23, 28, 22, 12, 12,0] }
df = pd.DataFrame(data=d)
If I run following code:
df = (df.groupby(['ptin', 'tin', 'year'])
.apply(lambda x : x['tin'].isin(x['ptin']).astype(int).sum())
.reset_index(name='matches'))
df
I get following result
ptin tin year matches
0 12 3.0 1999 0
1 12 3.0 2001 0
2 22 1.0 2002 0
3 23 1.0 2002 0
This gives me the matching tin to ptin and groups by year.
Now if I want to find the last occurence of say for example tin == 12, I should get 2001. I want add that column as well as difference between 1999 and 2001, which is two in different column, such that my answer looks like below
ptin tin year matches lastoccurence length
0 12 3.0 1999 0 0 0
1 12 3.0 2001 0 2001 2
2 22 1.0 2002 0 2002 1
3 23 1.0 2002 0 2002 1
Any help would be appreciated. I could take solution in either pandas or SQL if that is possible.
I think this will do magic (at least partially?):
df['duration'] = df.sort_values(['ptin','year']).groupby('ptin')['year'].diff()
df = df.dropna(subset=['duration'])
print (df)
ptin tin year matches duration
2 12 12 2001 1 2.0
3 12 30 2004 0 3.0
So I used a group by on a pandas dataframe which looks like this
df.groupby(['year','month'])['AMT'].agg('sum')
And I get something like this
year month
2003 1 114.00
2 9195.00
3 300.00
5 200.00
6 450.00
7 68.00
8 750.00
9 3521.00
10 250.00
11 799.00
12 1000.00
2004 1 8551.00
2 9998.00
3 17334.00
4 2525.00
5 16014.00
6 9132.00
7 10623.00
8 7538.00
9 3650.00
10 7733.00
11 10128.00
12 4741.00
2005 1 6965.00
2 3208.00
3 8630.00
4 7776.00
5 11950.00
6 11717.00
7 1510.00
...
2015 7 1431441.00
8 966974.00
9 1121650.00
10 1200104.00
11 1312191.90
12 482535.00
2016 1 1337343.00
2 1465068.00
3 1170113.00
4 1121691.00
5 1302936.00
6 1518047.00
7 1251844.00
8 825215.00
9 1491626.00
10 1243877.00
11 1632252.00
12 750995.50
2017 1 905974.00
2 1330182.00
3 1382628.52
4 1146789.00
5 1201425.00
6 1278701.00
7 1172596.00
8 1517116.50
9 1108609.00
10 1360841.00
11 1340386.00
12 860686.00
What I want is to just select the max out of the third summed column so that the final data frame has only the max from each year, something like:
year month
2003 2 9195.00
2004 3 17334.00
2005 5 11950.00
... and so on
What do I have to add to my group by aggregation to do this?
I think need DataFrameGroupBy.idxmax:
s = df.groupby(['year','month'])['AMT'].sum()
out = s.loc[s.groupby(level=0).idxmax()]
#working in newer pandas versions
#out = df.loc[df.groupby('Year').idxmax()]
print (out)
Year month
2003 2 9195.0
2004 3 17334.0
2005 5 11950.0
Name: AMT, dtype: float64
If possible multiple max values per years:
out = s[s == s.groupby(level=0).transform('max')]
print (out)
Year month
2003 2 9195.0
2004 3 17334.0
2005 5 11950.0
Name: AMT, dtype: float64
You can use GroupBy + transform with max. Note this gives multiple maximums for any years where a tie exists. This may or may not be what you require.
As you have requested, it's possible to do this in 2 steps, first summing and then calculating maximums by year.
df = pd.DataFrame({'year': [2003, 2003, 2003, 2004, 2004, 2004],
'month': [1, 2, 2, 1, 1, 2],
'AMT': [100, 200, 100, 100, 300, 100]})
# STEP 1: sum by year + month
df2 = df.groupby(['year', 'month']).sum().reset_index()
# STEP 2: filter for max by year
res = df2[df2['AMT'] == df2.groupby(['year'])['AMT'].transform('max')]
print(res)
year month AMT
1 2003 2 300
2 2004 1 400