grouping by count, year and displaying the last occurence and its count - python

In the following dataframe
d = {'year': [2001, 2002, 2005, 2002, 2004, 1999, 1890],
'tin': [12, 23, 24, 28,30, 12,7],
'ptin': [12, 23, 28, 22, 12, 12,0] }
df = pd.DataFrame(data=d)
If I run following code:
df = (df.groupby(['ptin', 'tin', 'year'])
.apply(lambda x : x['tin'].isin(x['ptin']).astype(int).sum())
.reset_index(name='matches'))
df
I get following result
ptin tin year matches
0 12 3.0 1999 0
1 12 3.0 2001 0
2 22 1.0 2002 0
3 23 1.0 2002 0
This gives me the matching tin to ptin and groups by year.
Now if I want to find the last occurence of say for example tin == 12, I should get 2001. I want add that column as well as difference between 1999 and 2001, which is two in different column, such that my answer looks like below
ptin tin year matches lastoccurence length
0 12 3.0 1999 0 0 0
1 12 3.0 2001 0 2001 2
2 22 1.0 2002 0 2002 1
3 23 1.0 2002 0 2002 1
Any help would be appreciated. I could take solution in either pandas or SQL if that is possible.

I think this will do magic (at least partially?):
df['duration'] = df.sort_values(['ptin','year']).groupby('ptin')['year'].diff()
df = df.dropna(subset=['duration'])
print (df)
ptin tin year matches duration
2 12 12 2001 1 2.0
3 12 30 2004 0 3.0

Related

How to interpolate missing years within pd.groupby()

Problem:
I have a dataframe that contains entries with 5 year time intervals. I need to group entries by 'id' columns and interpolate values between the first and last item in the group. I understand that it has to be some combination of groupby(), set_index() and interpolate() but I am unable to make it work for the whole input dataframe.
Sample df:
import pandas as pd
data = {
'id': ['a', 'b', 'a', 'b'],
'year': [2005, 2005, 2010, 2010],
'val': [0, 0, 100, 100],
}
df = pd.DataFrame.from_dict(data)
example input df:
_ id year val
0 a 2005 0
1 a 2010 100
2 b 2005 0
3 b 2010 100
expected output df:
_ id year val type
0 a 2005 0 original
1 a 2006 20 interpolated
2 a 2007 40 interpolated
3 a 2008 60 interpolated
4 a 2009 80 interpolated
5 a 2010 100 original
6 b 2005 0 original
7 b 2006 20 interpolated
8 b 2007 40 interpolated
9 b 2008 60 interpolated
10 b 2009 80 interpolated
11 b 2010 100 original
'type' is not necessary its just for illustration purposes.
Question:
How can I add missing years to the groupby() view and interpolate() their corresponding values?
Thank you!
Using a temporary reshaping with pivot and unstack and reindex+interpolate to add the missing years:
out = (df
.pivot(index='year', columns='id', values='val')
.reindex(range(df['year'].min(), df['year'].max()+1))
.interpolate('index')
.unstack(-1).reset_index(name='val')
)
Output:
id year val
0 a 2005 0.0
1 a 2006 20.0
2 a 2007 40.0
3 a 2008 60.0
4 a 2009 80.0
5 a 2010 100.0
6 b 2005 0.0
7 b 2006 20.0
8 b 2007 40.0
9 b 2008 60.0
10 b 2009 80.0
11 b 2010 100.0
Solution for create years by minimal and maximal years for each group independently:
First create missing values by DataFrame.reindex per groups by minimal and maximal values and then interpolate by Series.interpolate, last identify values from original DataFrame to new column:
df = (df.set_index('year')
.groupby('id')['val']
.apply(lambda x: x.reindex(range(x.index.min(), x.index.max() + 1)).interpolate())
.reset_index()
.merge(df, how='left', indicator=True)
.assign(type = lambda x: np.where(x.pop('_merge').eq('both'),
'original',
'interpolated')))
print (df)
id year val type
0 a 2005 0.0 original
1 a 2006 20.0 interpolated
2 a 2007 40.0 interpolated
3 a 2008 60.0 interpolated
4 a 2009 80.0 interpolated
5 a 2010 100.0 original
6 b 2005 0.0 original
7 b 2006 20.0 interpolated
8 b 2007 40.0 interpolated
9 b 2008 60.0 interpolated
10 b 2009 80.0 interpolated
11 b 2010 100.0 original

Pandas: How to replace column values in panel dataset based on ID and condition

So I have a panel df that looks like this:
ID
year
value
1
2002
8
1
2003
9
1
2004
10
2
2002
11
2
2003
11
2
2004
12
I want to set the value for every ID and for all years to the value in 2004. How do I do this?
The df should then look like this:
ID
year
value
1
2002
10
1
2003
10
1
2004
10
2
2002
12
2
2003
12
2
2004
12
Could not find anything online. So far I have tried to get the value for every ID for year 2004, created a new df from that and then merged it back in. Though, that is super slow.
We can use Series.map for this, first we select the values and create our mapping:
mapping = df[df["year"].eq(2004)].set_index("ID")["value"]
df["value"] = df["ID"].map(mapping)
ID year value
0 1 2002 10
1 1 2003 10
2 1 2004 10
3 2 2002 12
4 2 2003 12
5 2 2004 12
Let's convert the value where corresponding year is not 2004 to NaN then get the max value per ID.
df['value'] = (df.assign(value=df['value'].mask(df['year'].ne(2004)))
.groupby('ID')['value'].transform('max'))
print(df)
ID year value
0 1 2002 10.0
1 1 2003 10.0
2 1 2004 10.0
3 2 2002 12.0
4 2 2003 12.0
5 2 2004 12.0
Another method, for some variety.
# Make everything that isn't 2004 null~
df.loc[df.year.ne(2004), 'value'] = np.nan
# Fill the values by ID~
df['value'] = df.groupby('ID')['value'].bfill()
Output:
ID year value
0 1 2002 10.0
1 1 2003 10.0
2 1 2004 10.0
3 2 2002 12.0
4 2 2003 12.0
5 2 2004 12.0
Yet another method, a bit longer but should be quite intuitive. Basically creating a lookup table for ID->value then performing lookup using pandas.merge.
import pandas as pd
# Original dataframe
df_orig = pd.DataFrame([(1, 2002, 8), (1, 2003, 9), (1, 2004, 10), (2, 2002, 11), (2, 2003, 11), (2, 2004, 12)])
df_orig.columns = ['ID', 'year', 'value']
# Dataframe with 2004 IDs
df_2004 = df_orig[df_orig['year'] == 2004]
df_2004.drop(columns=['year'], inplace=True)
print(df_2004)
# Drop values from df_orig and replace with those from df_2004
df_orig.drop(columns=['value'], inplace=True)
df_final = pd.merge(df_orig, df_2004, on='ID', how='right')
print(df_final)
df_2004:
ID value
2 1 10
5 2 12
df_final:
ID year value
0 1 2002 10
1 1 2003 10
2 1 2004 10
3 2 2002 12
4 2 2003 12
5 2 2004 12

Merge should adopt Nan values if not existing value [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 1 year ago.
However, I have the following problem:
If a year or date does not exist in df2 then a price and a listing_id is automatically added during the merge. But that should be NaN
The second problem is when merging, as soon as I have multiple data that were on the same day and year then the temperature is also merged to the second, for example:
d = {'id': [1], 'day': [1], 'temperature': [20], 'year': [2001]}
df = pd.DataFrame(data=d)
print(df)
id day temperature year
0 1 1 20 2001
d2 = {'id': [122, 244], 'day': [1, 1],
'listing_id': [2, 4], 'price': [20, 440], 'year': [2001, 2001]}
df2 = pd.DataFrame(data=d2)
print(df2)
id day listing_id price year
0 122 1 2 20 2001
1 244 1 4 440 2001
df3 = pd.merge(df,df2[['day', 'listing_id', 'price']],
left_on='day', right_on = 'day',how='left')
print(df3)
id day temperature year listing_id price
0 1 1 20 2001 2 20
1 1 1 20 2001 4 440 # <-- The second temperature is wrong :/
This should not be so, because if I later still have a date from year 2002 which was in day 1 with a temperature of 30 and I want to calculate the average. Then I get the following formula: 20 + 20 + 30 = 23.3. The formula should be 20 + 30 = 25. Therefore, if a value has already been filled, there should be a NaN value in it.
Code Snippet
d = {'id': [1, 2, 3, 4, 5], 'day': [1, 2, 3, 4, 2],
'temperature': [20, 40, 50, 60, 20], 'year': [2001, 2002, 2004, 2005, 1999]}
df = pd.DataFrame(data=d)
print(df)
id day temperature year
0 1 1 20 2001
1 2 2 40 2002
2 3 3 50 2004
3 4 4 60 2005
4 5 2 20 1999
d2 = {'id': [122, 244, 387, 4454, 521], 'day': [1, 2, 3, 4, 2],
'listing_id': [2, 4, 5, 6, 7], 'price': [20, 440, 500, 6600, 500],
'year': [2001, 2002, 2004, 2005, 2005]}
df2 = pd.DataFrame(data=d2)
print(df2)
id day listing_id price year
0 122 1 2 20 2001
1 244 2 4 440 2002
2 387 3 5 500 2004
3 4454 4 6 6600 2005
4 521 2 7 500 2005
df3 = pd.merge(df,df2[['day','listing_id', 'price']],
left_on='day', right_on = 'day',how='left').drop('day',axis=1)
print(df3)
id day temperature year listing_id price
0 1 1 20 2001 2 20
1 2 2 40 2002 4 440
2 2 2 40 2002 7 500
3 3 3 50 2004 5 500
4 4 4 60 2005 6 6600
5 5 2 20 1999 4 440
6 5 2 20 1999 7 500
What I want
id day temperature year listing_id price
0 1 1 20 2001 2 20
1 2 2 40 2002 4 440
2 2 2 NaN 2005 7 500
3 3 3 50 2004 5 500
4 4 4 60 2005 6 6600
5 5 2 20 1999 NaN NaN
IIUC:
>>> df1.merge(df2[['day', 'listing_id', 'price', 'year']],
on=['day', 'year'], how='outer')
id day temperature year listing_id price
0 1.0 1 20.0 2001 2.0 20.0
1 2.0 2 40.0 2002 4.0 440.0
2 3.0 3 50.0 2004 5.0 500.0
3 4.0 4 60.0 2005 6.0 6600.0
4 5.0 2 20.0 1999 NaN NaN
5 NaN 2 NaN 2005 7.0 500.0

Summarize rows in df

I have following example df:
housing = {'year': [2001, 2002, 2003, 2004, 2005],
'moved in': [10, 26, 15, 11, 12],
'moved out': [4, 15, 23, 1, 3]}
df = pd.DataFrame(housing, columns = ['year', 'moved in', 'moved out'])
Now I want to create a column with calculated values which would show the number of people living in a house in a given year. In the first row there must be calculated the number of people who moved in and out giving the result. In the next row this result should be taken adding the number people who moved in and subtracting the number of people who moved out. The result would be the number of people who still live in the house in this year. I would like to iterate it through the whole df.
Is there a solution for it? Thank you in advance.
Basically you need a rolling sum over each year's net change.
df['current'] = (df['moved in'] - df['moved out']).rolling(window=len(df), min_periods=1).sum()
print(df)
year moved in moved out current
0 2001 10 4 6.0
1 2002 26 15 17.0
2 2003 15 23 9.0
3 2004 11 1 19.0
4 2005 12 3 28.0
With the net change column:
df['net change'] = df['moved in'] - df['moved out']
df['current'] = df['net change'].rolling(window=len(df), min_periods=1).sum()
print(df)
year moved in moved out net change current
0 2001 10 4 6 6.0
1 2002 26 15 11 17.0
2 2003 15 23 -8 9.0
3 2004 11 1 10 19.0
4 2005 12 3 9 28.0

Pandas dataframe frequency operation

I just start with pandas and I would like to know how to count the number of document(unique) per year per company
My data are :
df
year document_id company
0 1999 3 Orange
1 1999 5 Orange
2 1999 3 Orange
3 2001 41 Banana
4 2001 21 Strawberry
5 2001 18 Strawberry
6 2002 44 Orange
At the end, I would like to have a new dataframe like this
year document_id company nbDocument
0 1999 [3,5] Orange 2
1 2001 [21] Banana 1
2 2001 [21,18] Strawberry 2
3 2002 [44] Orange 1
I tried :
count2 = apyData.groupby(['year','company']).agg({'document_id': pd.Series.value_counts})
But with groupby operation, I'm not able to have this kind of structure and count unique value for Orange in 1999 for example, is there a way to do this ?
Thx
You could create a new DataFrame and add the unique document_id using a list comprension as follows:
result = pd.DataFrame()
result['document_id'] = df.groupby(['company', 'year']).apply(lambda x: [d for d in x['document_id'].drop_duplicates()])
now that you have a list of unique document_id, you only need to get the length of this list:
result['nbDocument'] = result.document_id.apply(lambda x: len(x))
to get:
result.reset_index().sort_values(['company', 'year'])
company year document_id nbDocument
0 Banana 2001 [41] 1
1 Orange 1999 [3, 5] 2
2 Orange 2002 [44] 1
3 Strawberry 2001 [21, 18] 2
This produces the desired output:
out = pd.DataFrame()
grouped = df.groupby(['year', 'company'])
out['nbDocument'] = grouped.apply(lambda x: list(x['document_id'].drop_duplicates()))
out['document_id'] = out['nbDocument'].apply(lambda x: len(x))
print(out.reset_index().sort_values(['year', 'company']))
year company nbDocument document_id
0 1999 Orange [3, 5] 2
1 2001 Banana [41] 1
2 2001 Strawberry [21, 18] 2
3 2002 Orange [44] 1

Categories