I wonder how to count accumulative unique values by groups in python?
Below is the dataframe example:
Group
Year
Type
A
1998
red
A
1998
blue
A
2002
red
A
2005
blue
A
2008
blue
A
2008
yello
B
1998
red
B
2001
red
B
2003
red
C
1996
red
C
2002
orange
C
2002
red
C
2012
blue
C
2012
yello
I need to create a new column by Column "Group". The value of this new column should be the accumulative unique values of Column "Type", accumulating by Column "Year".
Below is the dataframe I want.
For example:
(1)For Group A and in year 1998, I want to count the unique value of Type in year 1998, and there are two unique values of Type: red and blue.
(2)For Group A and in year 2002, I want to count the unique value of Type in year 1998 and 2002, and there are also two unique values of Type: red and blue.
(3)For Group A and in year 2008, I want to count the unique value of Type in year 1998, 2002, 2005, and 2008, and there are three unique values of Type: red, blue, and yellow.
Group
Year
Type
Want
A
1998
red
2
A
1998
blue
2
A
2002
red
2
A
2005
blue
2
A
2008
blue
3
A
2008
yello
3
B
1998
red
1
B
2001
red
1
B
2003
red
1
C
1996
red
1
C
2002
orange
2
C
2002
red
2
C
2012
blue
4
C
2012
yello
4
One more thing about this dataframe: not all groups have values in the same years. For example, group A has two values in year 1998 and 2008, one value in year 2002 and 2005. Group B has values in year 1998, 2001, and 2003.
I wonder how to address this problem. Your great help means a lot to me. Thanks!
For each Group:
Append a new column Want that has the values like you want:
def f(df):
want = df.groupby('Year')['Type'].agg(list).cumsum().apply(set).apply(len)
want.name = 'Want'
return df.merge(want, on='Year')
df.groupby('Group', group_keys=False).apply(f).reset_index(drop=True)
Result:
Group Year Type Want
0 A 1998 red 2
1 A 1998 blue 2
2 A 2002 red 2
3 A 2005 blue 2
4 A 2008 blue 3
5 A 2008 yello 3
6 B 1998 red 1
7 B 2001 red 1
8 B 2003 red 1
9 C 1996 red 1
10 C 2002 orange 2
11 C 2002 red 2
12 C 2012 blue 4
13 C 2012 yello 4
Notes:
I think the use of .merge here is efficient.
You can also use 1 .apply inside f instead of 2 chained ones to improve efficiency: .apply(lambda x: len(set(x)))
Related
So I have a panel df that looks like this:
ID
year
value
1
2002
8
1
2003
9
1
2004
10
2
2002
11
2
2003
11
2
2004
12
I want to set the value for every ID and for all years to the value in 2004. How do I do this?
The df should then look like this:
ID
year
value
1
2002
10
1
2003
10
1
2004
10
2
2002
12
2
2003
12
2
2004
12
Could not find anything online. So far I have tried to get the value for every ID for year 2004, created a new df from that and then merged it back in. Though, that is super slow.
We can use Series.map for this, first we select the values and create our mapping:
mapping = df[df["year"].eq(2004)].set_index("ID")["value"]
df["value"] = df["ID"].map(mapping)
ID year value
0 1 2002 10
1 1 2003 10
2 1 2004 10
3 2 2002 12
4 2 2003 12
5 2 2004 12
Let's convert the value where corresponding year is not 2004 to NaN then get the max value per ID.
df['value'] = (df.assign(value=df['value'].mask(df['year'].ne(2004)))
.groupby('ID')['value'].transform('max'))
print(df)
ID year value
0 1 2002 10.0
1 1 2003 10.0
2 1 2004 10.0
3 2 2002 12.0
4 2 2003 12.0
5 2 2004 12.0
Another method, for some variety.
# Make everything that isn't 2004 null~
df.loc[df.year.ne(2004), 'value'] = np.nan
# Fill the values by ID~
df['value'] = df.groupby('ID')['value'].bfill()
Output:
ID year value
0 1 2002 10.0
1 1 2003 10.0
2 1 2004 10.0
3 2 2002 12.0
4 2 2003 12.0
5 2 2004 12.0
Yet another method, a bit longer but should be quite intuitive. Basically creating a lookup table for ID->value then performing lookup using pandas.merge.
import pandas as pd
# Original dataframe
df_orig = pd.DataFrame([(1, 2002, 8), (1, 2003, 9), (1, 2004, 10), (2, 2002, 11), (2, 2003, 11), (2, 2004, 12)])
df_orig.columns = ['ID', 'year', 'value']
# Dataframe with 2004 IDs
df_2004 = df_orig[df_orig['year'] == 2004]
df_2004.drop(columns=['year'], inplace=True)
print(df_2004)
# Drop values from df_orig and replace with those from df_2004
df_orig.drop(columns=['value'], inplace=True)
df_final = pd.merge(df_orig, df_2004, on='ID', how='right')
print(df_final)
df_2004:
ID value
2 1 10
5 2 12
df_final:
ID year value
0 1 2002 10
1 1 2003 10
2 1 2004 10
3 2 2002 12
4 2 2003 12
5 2 2004 12
I wonder how to count accumulative unique values by groups in python?
Below is the dataframe example:
Group
Year
Type
A
1998
red
A
1998
blue
A
2002
red
A
2005
blue
A
2008
blue
A
2008
yello
B
1998
red
B
2001
red
B
2003
red
C
1996
red
C
2002
orange
C
2002
red
C
2012
blue
C
2012
yello
I need to create a new column by Column "Group". The value of this new column should be the accumulative unique values of Column "Type", accumulating by Column "Year".
Below is the dataframe I want.
For example:
(1)For Group A and in year 1998, I want to count the unique value of Type in year 1998, and there are two unique values of Type: red and blue.
(2)For Group A and in year 2002, I want to count the unique value of Type in year 1998 and 2002, and there are also two unique values of Type: red and blue.
(3)For Group A and in year 2008, I want to count the unique value of Type in year 1998, 2002, 2005, and 2008, and there are also four unique values of Type: red, blue, and yellow.
Group
Year
Type
Want
A
1998
red
2
A
1998
blue
2
A
2002
red
2
A
2005
blue
2
A
2008
blue
3
A
2008
yello
3
B
1998
red
1
B
2001
red
1
B
2003
red
1
C
1996
red
1
C
2002
orange
2
C
2002
red
2
C
2012
blue
4
C
2012
yello
4
One more thing about this dataframe: not all groups have values in the same years. For example, group A has two values in year 1998 and 2008, one value in year 2002 and 2005. Group B has values in year 1998, 2001, and 2003.
I wonder how to address this problem. Your great help means a lot to me. Thanks!
Use custom lambda function with factorize in GroupBy.transform:
f = lambda x: pd.factorize(x)[0]
df['Want1'] = df.groupby('Group', sort=False)['Type'].transform(f) + 1
print (df)
Group Year Type Want1
0 A 1998 red 1
1 A 2002 red 1
2 A 2005 blue 2
3 A 2008 blue 2
4 A 2009 yello 3
5 B 1998 red 1
6 B 2001 red 1
7 B 2003 red 1
8 C 1996 red 1
9 C 2002 orange 2
10 C 2008 blue 3
11 C 2012 yello 4
I have a dataframe that looks like the this:
I want to keep only the consecutive years in each group, such as the following figure where the year of 2005 in group A and year of 2009 and 2011 in group B are deleted.
I created a column of the year difference by using df['year_diff']=df.groupby(['group'])['Year'].diff(), and then only kept the rows where the year difference was equal to 1.
However, this method will also delete the first row in each consecutive year group since the year difference of the first row will be NAN. For example, the year of 2000 will be deleted from group 2000-2005. Is there a way that I can do to avoid this problem?
shift
Get the year diffs as OP first did. Then check if equal to 1 or the previous value is 1
yd = df.Year.groupby(df.group).diff().eq(1)
df[yd | yd.shift(-1)]
group Year
0 A 2000
1 A 2001
2 A 2002
3 A 2003
5 A 2007
6 A 2008
7 A 2009
8 A 2010
9 A 2011
10 B 2005
11 B 2006
12 B 2007
15 B 2013
16 B 2014
17 B 2015
18 B 2016
19 B 2017
Setup
Thx jez
a = [('A',x) for x in range(2000, 2012) if x not in [2004,2006]]
b = [('B',x) for x in range(2005, 2018) if x not in [2008,2010,2012]]
df = pd.DataFrame(a + b, columns=['group','Year'])
If I understand correctly, using diff and cumsum create the additional group key, then groupby it and your group columns, and drop the count equal to 1.
df[df.g.groupby([df.g,df.Year.diff().ne(1).cumsum()]).transform('count').ne(1)]
Out[317]:
g Year
0 A 2000
1 A 2001
2 A 2002
3 A 2003
5 A 2007
6 A 2008
7 A 2009
8 A 2010
9 A 2011
10 B 2005
11 B 2006
12 B 2007
15 B 2013
16 B 2014
17 B 2015
18 B 2016
19 B 2017
Data
df=pd.DataFrame({'g':list('AAAAAAAAAABBBBBBBBBB',
'Year':[2000,2001,2002,2003,2005,2007,2008,2009,2010,2011,2005,2006,2007,2009,2011,2013,2014,2015,2016,2017])]})
You can have two columns for differences. One for difference from the next row, and one from the previous row. Then you can use an np.where to filter the columns which have values of 1 for the first difference OR -1 for the second difference.
df=pd.DataFrame({'group':list('AAAAAAAAAABBBBBBBBBB'),'Year':[2000,2001,2002,2003,2005,2007,2008,2009,2010,2011,2005,2006,2007,2009,2011,2013,2014,2015,2016,2017]})
df['year_diff']=df.groupby(['group'])['Year'].diff()
df['year_diff2']=df.groupby(['group'])['Year'].diff(-1)
df['check']=np.where((df.year_diff==1) | (df.year_diff2==-1),True,False)
And then drop all the rows where df.check==False.
This seems like a long method, but it is quite easy to logically follow the process I think.
If I wanted to aggregate values/sum a column by a certain time period, how do I do it using the pivot table? For example in the table below, if I want the aggregate sum of fruits between 2000 - 2001, and 2002 - 2004, what code would I write? Currently I have this so far:
import pandas as pd
import numpy as np
UG = pd.read_csv('fruitslist.csv', index_col=2)
UG = UG.pivot_table(values = 'Count', index = 'Fruits', columns = 'Year', aggfunc=np.sum)
UG.to_csv('fruits.csv')
This returns counts for each fruit by each individual year, but I can't seem to aggregate by decade (e.g 90s, 00s, 2010s)
Fruits Count Year
Apple 4 1995
Orange 5 1996
Orange 6 2001
Guava 8 2003
Banana 6 2010
Guava 8 2011
Peach 7 2012
Guava 9 2013
Thanks in advance!
This might help. Convert the Year column within a groupby to decades and then aggregate.
"""
Fruits Count Year
Apple 4 1995
Orange 5 1996
Orange 6 2001
Guava 8 2003
Banana 6 2010
Guava 8 2011
Peach 7 2012
Guava 9 2013
"""
df = pd.read_clipboard()
output = df.groupby([
df.Year//10*10,
'Fruits'
]).agg({
'Count' : 'sum'
})
print(output)
Count
Year Fruits
1990 Apple 4
Orange 5
2000 Guava 8
Orange 6
2010 Banana 6
Guava 17
Peach 7
Edit
If you want to group the years by a different amount, say every 2 years, just change the Year group:
print(df.groupby([
df.Year//2*2,
'Fruits'
]).agg({
'Count' : 'sum'
}))
Count
Year Fruits
1994 Apple 4
1996 Orange 5
2000 Orange 6
2002 Guava 8
2010 Banana 6
Guava 8
2012 Guava 9
Peach 7
I just start with pandas and I would like to know how to count the number of document(unique) per year per company
My data are :
df
year document_id company
0 1999 3 Orange
1 1999 5 Orange
2 1999 3 Orange
3 2001 41 Banana
4 2001 21 Strawberry
5 2001 18 Strawberry
6 2002 44 Orange
At the end, I would like to have a new dataframe like this
year document_id company nbDocument
0 1999 [3,5] Orange 2
1 2001 [21] Banana 1
2 2001 [21,18] Strawberry 2
3 2002 [44] Orange 1
I tried :
count2 = apyData.groupby(['year','company']).agg({'document_id': pd.Series.value_counts})
But with groupby operation, I'm not able to have this kind of structure and count unique value for Orange in 1999 for example, is there a way to do this ?
Thx
You could create a new DataFrame and add the unique document_id using a list comprension as follows:
result = pd.DataFrame()
result['document_id'] = df.groupby(['company', 'year']).apply(lambda x: [d for d in x['document_id'].drop_duplicates()])
now that you have a list of unique document_id, you only need to get the length of this list:
result['nbDocument'] = result.document_id.apply(lambda x: len(x))
to get:
result.reset_index().sort_values(['company', 'year'])
company year document_id nbDocument
0 Banana 2001 [41] 1
1 Orange 1999 [3, 5] 2
2 Orange 2002 [44] 1
3 Strawberry 2001 [21, 18] 2
This produces the desired output:
out = pd.DataFrame()
grouped = df.groupby(['year', 'company'])
out['nbDocument'] = grouped.apply(lambda x: list(x['document_id'].drop_duplicates()))
out['document_id'] = out['nbDocument'].apply(lambda x: len(x))
print(out.reset_index().sort_values(['year', 'company']))
year company nbDocument document_id
0 1999 Orange [3, 5] 2
1 2001 Banana [41] 1
2 2001 Strawberry [21, 18] 2
3 2002 Orange [44] 1