So basically I have 2 DataFrames like this:
Table_1
Apple
Banana
Orange
Date
1
2
4
2020
3
5
2
2021
7
8
9
2022
Table_2
fruit
year
Apple
2020
Apple
2021
Apple
2022
Banana
2020
Banana
2021
Banana
2022
Orange
2020
Orange
2021
Orange
2022
So I want to lookup the values for the fruits for Table_2 from the Table_1 based on the fruit name and the respective year.
The final outcome should look like this:
fruit
year
number
Apple
2020
1
Apple
2021
3
Apple
2022
7
Banana
2020
2
Banana
2021
5
Banana
2022
8
Orange
2020
4
Orange
2021
2
Orange
2022
9
In the Excel for an example one can do something like this:
=INDEX(Table1[[Apple]:[Orange]],MATCH([#year],Table1[Date],0),MATCH([#fruit],Table1[[#Headers],[Apple]:[Orange]],0))
But what is the way to do it in Python?
Assuming pandas, you can melt and merge:
out = (df2
.merge(df1.rename(columns={'Date': 'year'})
.melt('year', var_name='fruit', value_name='number'),
how='left'
)
)
output:
fruit year number
0 Apple 2020 1
1 Apple 2021 3
2 Apple 2022 7
3 Banana 2020 2
4 Banana 2021 5
5 Banana 2022 8
6 Orange 2020 4
7 Orange 2021 2
8 Orange 2022 9
Related
I wonder how to count accumulative unique values by groups in python?
Below is the dataframe example:
Group
Year
Type
A
1998
red
A
1998
blue
A
2002
red
A
2005
blue
A
2008
blue
A
2008
yello
B
1998
red
B
2001
red
B
2003
red
C
1996
red
C
2002
orange
C
2002
red
C
2012
blue
C
2012
yello
I need to create a new column by Column "Group". The value of this new column should be the accumulative unique values of Column "Type", accumulating by Column "Year".
Below is the dataframe I want.
For example:
(1)For Group A and in year 1998, I want to count the unique value of Type in year 1998, and there are two unique values of Type: red and blue.
(2)For Group A and in year 2002, I want to count the unique value of Type in year 1998 and 2002, and there are also two unique values of Type: red and blue.
(3)For Group A and in year 2008, I want to count the unique value of Type in year 1998, 2002, 2005, and 2008, and there are three unique values of Type: red, blue, and yellow.
Group
Year
Type
Want
A
1998
red
2
A
1998
blue
2
A
2002
red
2
A
2005
blue
2
A
2008
blue
3
A
2008
yello
3
B
1998
red
1
B
2001
red
1
B
2003
red
1
C
1996
red
1
C
2002
orange
2
C
2002
red
2
C
2012
blue
4
C
2012
yello
4
One more thing about this dataframe: not all groups have values in the same years. For example, group A has two values in year 1998 and 2008, one value in year 2002 and 2005. Group B has values in year 1998, 2001, and 2003.
I wonder how to address this problem. Your great help means a lot to me. Thanks!
For each Group:
Append a new column Want that has the values like you want:
def f(df):
want = df.groupby('Year')['Type'].agg(list).cumsum().apply(set).apply(len)
want.name = 'Want'
return df.merge(want, on='Year')
df.groupby('Group', group_keys=False).apply(f).reset_index(drop=True)
Result:
Group Year Type Want
0 A 1998 red 2
1 A 1998 blue 2
2 A 2002 red 2
3 A 2005 blue 2
4 A 2008 blue 3
5 A 2008 yello 3
6 B 1998 red 1
7 B 2001 red 1
8 B 2003 red 1
9 C 1996 red 1
10 C 2002 orange 2
11 C 2002 red 2
12 C 2012 blue 4
13 C 2012 yello 4
Notes:
I think the use of .merge here is efficient.
You can also use 1 .apply inside f instead of 2 chained ones to improve efficiency: .apply(lambda x: len(set(x)))
I wonder how to count accumulative unique values by groups in python?
Below is the dataframe example:
Group
Year
Type
A
1998
red
A
1998
blue
A
2002
red
A
2005
blue
A
2008
blue
A
2008
yello
B
1998
red
B
2001
red
B
2003
red
C
1996
red
C
2002
orange
C
2002
red
C
2012
blue
C
2012
yello
I need to create a new column by Column "Group". The value of this new column should be the accumulative unique values of Column "Type", accumulating by Column "Year".
Below is the dataframe I want.
For example:
(1)For Group A and in year 1998, I want to count the unique value of Type in year 1998, and there are two unique values of Type: red and blue.
(2)For Group A and in year 2002, I want to count the unique value of Type in year 1998 and 2002, and there are also two unique values of Type: red and blue.
(3)For Group A and in year 2008, I want to count the unique value of Type in year 1998, 2002, 2005, and 2008, and there are also four unique values of Type: red, blue, and yellow.
Group
Year
Type
Want
A
1998
red
2
A
1998
blue
2
A
2002
red
2
A
2005
blue
2
A
2008
blue
3
A
2008
yello
3
B
1998
red
1
B
2001
red
1
B
2003
red
1
C
1996
red
1
C
2002
orange
2
C
2002
red
2
C
2012
blue
4
C
2012
yello
4
One more thing about this dataframe: not all groups have values in the same years. For example, group A has two values in year 1998 and 2008, one value in year 2002 and 2005. Group B has values in year 1998, 2001, and 2003.
I wonder how to address this problem. Your great help means a lot to me. Thanks!
Use custom lambda function with factorize in GroupBy.transform:
f = lambda x: pd.factorize(x)[0]
df['Want1'] = df.groupby('Group', sort=False)['Type'].transform(f) + 1
print (df)
Group Year Type Want1
0 A 1998 red 1
1 A 2002 red 1
2 A 2005 blue 2
3 A 2008 blue 2
4 A 2009 yello 3
5 B 1998 red 1
6 B 2001 red 1
7 B 2003 red 1
8 C 1996 red 1
9 C 2002 orange 2
10 C 2008 blue 3
11 C 2012 yello 4
Lets say i have following table:
ID FRUIT ORDER
01 apple 1
01 apple 2
01 peach 3
01 apple 4
02 melon 1
02 apple 2
02 apple 3
02 apple 4
Now i want to consolidate rows within same ID when the values are equal in a iterative manner (drop duplicates if they are in a sequence) and redefine the order number, e.g.
ID FRUIT ORDER
01 apple 1
01 peach 2
01 apple 3
02 melon 1
02 apple 2
EDIT: I forgot to reorder. Like above: the order should be re-arranged in an iterative manner
Use boolean indexing for filter only first consecutive values with cumcount for new ordering:
a = df['ID'] + df['FRUIT']
#if necessary
#a = df['ID'].astype(str) + df['FRUIT']
df = df[a.ne(a.shift())]
df['ORDER'] = df.groupby('ID').cumcount().add(1)
print (df)
ID FRUIT ORDER
0 01 apple 1
2 01 peach 2
3 01 apple 3
4 02 melon 1
5 02 apple 2
I believe this will be easy one to go :
>>> df
ID FRUIT ORDER
0 01 apple 1
1 01 apple 2
2 01 peach 3
3 01 apple 4
4 02 melon 1
5 02 apple 2
6 02 apple 3
7 02 apple 4
>>> df[df['FRUIT'] != df['FRUIT'].shift(1)]
ID FRUIT ORDER
0 01 apple 1
2 01 peach 3
3 01 apple 4
4 02 melon 1
5 02 apple 2
If I wanted to aggregate values/sum a column by a certain time period, how do I do it using the pivot table? For example in the table below, if I want the aggregate sum of fruits between 2000 - 2001, and 2002 - 2004, what code would I write? Currently I have this so far:
import pandas as pd
import numpy as np
UG = pd.read_csv('fruitslist.csv', index_col=2)
UG = UG.pivot_table(values = 'Count', index = 'Fruits', columns = 'Year', aggfunc=np.sum)
UG.to_csv('fruits.csv')
This returns counts for each fruit by each individual year, but I can't seem to aggregate by decade (e.g 90s, 00s, 2010s)
Fruits Count Year
Apple 4 1995
Orange 5 1996
Orange 6 2001
Guava 8 2003
Banana 6 2010
Guava 8 2011
Peach 7 2012
Guava 9 2013
Thanks in advance!
This might help. Convert the Year column within a groupby to decades and then aggregate.
"""
Fruits Count Year
Apple 4 1995
Orange 5 1996
Orange 6 2001
Guava 8 2003
Banana 6 2010
Guava 8 2011
Peach 7 2012
Guava 9 2013
"""
df = pd.read_clipboard()
output = df.groupby([
df.Year//10*10,
'Fruits'
]).agg({
'Count' : 'sum'
})
print(output)
Count
Year Fruits
1990 Apple 4
Orange 5
2000 Guava 8
Orange 6
2010 Banana 6
Guava 17
Peach 7
Edit
If you want to group the years by a different amount, say every 2 years, just change the Year group:
print(df.groupby([
df.Year//2*2,
'Fruits'
]).agg({
'Count' : 'sum'
}))
Count
Year Fruits
1994 Apple 4
1996 Orange 5
2000 Orange 6
2002 Guava 8
2010 Banana 6
Guava 8
2012 Guava 9
Peach 7
I just start with pandas and I would like to know how to count the number of document(unique) per year per company
My data are :
df
year document_id company
0 1999 3 Orange
1 1999 5 Orange
2 1999 3 Orange
3 2001 41 Banana
4 2001 21 Strawberry
5 2001 18 Strawberry
6 2002 44 Orange
At the end, I would like to have a new dataframe like this
year document_id company nbDocument
0 1999 [3,5] Orange 2
1 2001 [21] Banana 1
2 2001 [21,18] Strawberry 2
3 2002 [44] Orange 1
I tried :
count2 = apyData.groupby(['year','company']).agg({'document_id': pd.Series.value_counts})
But with groupby operation, I'm not able to have this kind of structure and count unique value for Orange in 1999 for example, is there a way to do this ?
Thx
You could create a new DataFrame and add the unique document_id using a list comprension as follows:
result = pd.DataFrame()
result['document_id'] = df.groupby(['company', 'year']).apply(lambda x: [d for d in x['document_id'].drop_duplicates()])
now that you have a list of unique document_id, you only need to get the length of this list:
result['nbDocument'] = result.document_id.apply(lambda x: len(x))
to get:
result.reset_index().sort_values(['company', 'year'])
company year document_id nbDocument
0 Banana 2001 [41] 1
1 Orange 1999 [3, 5] 2
2 Orange 2002 [44] 1
3 Strawberry 2001 [21, 18] 2
This produces the desired output:
out = pd.DataFrame()
grouped = df.groupby(['year', 'company'])
out['nbDocument'] = grouped.apply(lambda x: list(x['document_id'].drop_duplicates()))
out['document_id'] = out['nbDocument'].apply(lambda x: len(x))
print(out.reset_index().sort_values(['year', 'company']))
year company nbDocument document_id
0 1999 Orange [3, 5] 2
1 2001 Banana [41] 1
2 2001 Strawberry [21, 18] 2
3 2002 Orange [44] 1