Pandas dataframe frequency operation - python

I just start with pandas and I would like to know how to count the number of document(unique) per year per company
My data are :
df
year document_id company
0 1999 3 Orange
1 1999 5 Orange
2 1999 3 Orange
3 2001 41 Banana
4 2001 21 Strawberry
5 2001 18 Strawberry
6 2002 44 Orange
At the end, I would like to have a new dataframe like this
year document_id company nbDocument
0 1999 [3,5] Orange 2
1 2001 [21] Banana 1
2 2001 [21,18] Strawberry 2
3 2002 [44] Orange 1
I tried :
count2 = apyData.groupby(['year','company']).agg({'document_id': pd.Series.value_counts})
But with groupby operation, I'm not able to have this kind of structure and count unique value for Orange in 1999 for example, is there a way to do this ?
Thx

You could create a new DataFrame and add the unique document_id using a list comprension as follows:
result = pd.DataFrame()
result['document_id'] = df.groupby(['company', 'year']).apply(lambda x: [d for d in x['document_id'].drop_duplicates()])
now that you have a list of unique document_id, you only need to get the length of this list:
result['nbDocument'] = result.document_id.apply(lambda x: len(x))
to get:
result.reset_index().sort_values(['company', 'year'])
company year document_id nbDocument
0 Banana 2001 [41] 1
1 Orange 1999 [3, 5] 2
2 Orange 2002 [44] 1
3 Strawberry 2001 [21, 18] 2

This produces the desired output:
out = pd.DataFrame()
grouped = df.groupby(['year', 'company'])
out['nbDocument'] = grouped.apply(lambda x: list(x['document_id'].drop_duplicates()))
out['document_id'] = out['nbDocument'].apply(lambda x: len(x))
print(out.reset_index().sort_values(['year', 'company']))
year company nbDocument document_id
0 1999 Orange [3, 5] 2
1 2001 Banana [41] 1
2 2001 Strawberry [21, 18] 2
3 2002 Orange [44] 1

Related

Pandas: How to replace column values in panel dataset based on ID and condition

So I have a panel df that looks like this:
ID
year
value
1
2002
8
1
2003
9
1
2004
10
2
2002
11
2
2003
11
2
2004
12
I want to set the value for every ID and for all years to the value in 2004. How do I do this?
The df should then look like this:
ID
year
value
1
2002
10
1
2003
10
1
2004
10
2
2002
12
2
2003
12
2
2004
12
Could not find anything online. So far I have tried to get the value for every ID for year 2004, created a new df from that and then merged it back in. Though, that is super slow.
We can use Series.map for this, first we select the values and create our mapping:
mapping = df[df["year"].eq(2004)].set_index("ID")["value"]
df["value"] = df["ID"].map(mapping)
ID year value
0 1 2002 10
1 1 2003 10
2 1 2004 10
3 2 2002 12
4 2 2003 12
5 2 2004 12
Let's convert the value where corresponding year is not 2004 to NaN then get the max value per ID.
df['value'] = (df.assign(value=df['value'].mask(df['year'].ne(2004)))
.groupby('ID')['value'].transform('max'))
print(df)
ID year value
0 1 2002 10.0
1 1 2003 10.0
2 1 2004 10.0
3 2 2002 12.0
4 2 2003 12.0
5 2 2004 12.0
Another method, for some variety.
# Make everything that isn't 2004 null~
df.loc[df.year.ne(2004), 'value'] = np.nan
# Fill the values by ID~
df['value'] = df.groupby('ID')['value'].bfill()
Output:
ID year value
0 1 2002 10.0
1 1 2003 10.0
2 1 2004 10.0
3 2 2002 12.0
4 2 2003 12.0
5 2 2004 12.0
Yet another method, a bit longer but should be quite intuitive. Basically creating a lookup table for ID->value then performing lookup using pandas.merge.
import pandas as pd
# Original dataframe
df_orig = pd.DataFrame([(1, 2002, 8), (1, 2003, 9), (1, 2004, 10), (2, 2002, 11), (2, 2003, 11), (2, 2004, 12)])
df_orig.columns = ['ID', 'year', 'value']
# Dataframe with 2004 IDs
df_2004 = df_orig[df_orig['year'] == 2004]
df_2004.drop(columns=['year'], inplace=True)
print(df_2004)
# Drop values from df_orig and replace with those from df_2004
df_orig.drop(columns=['value'], inplace=True)
df_final = pd.merge(df_orig, df_2004, on='ID', how='right')
print(df_final)
df_2004:
ID value
2 1 10
5 2 12
df_final:
ID year value
0 1 2002 10
1 1 2003 10
2 1 2004 10
3 2 2002 12
4 2 2003 12
5 2 2004 12

Two-Way lookup in Python

So basically I have 2 DataFrames like this:
Table_1
Apple
Banana
Orange
Date
1
2
4
2020
3
5
2
2021
7
8
9
2022
Table_2
fruit
year
Apple
2020
Apple
2021
Apple
2022
Banana
2020
Banana
2021
Banana
2022
Orange
2020
Orange
2021
Orange
2022
So I want to lookup the values for the fruits for Table_2 from the Table_1 based on the fruit name and the respective year.
The final outcome should look like this:
fruit
year
number
Apple
2020
1
Apple
2021
3
Apple
2022
7
Banana
2020
2
Banana
2021
5
Banana
2022
8
Orange
2020
4
Orange
2021
2
Orange
2022
9
In the Excel for an example one can do something like this:
=INDEX(Table1[[Apple]:[Orange]],MATCH([#year],Table1[Date],0),MATCH([#fruit],Table1[[#Headers],[Apple]:[Orange]],0))
But what is the way to do it in Python?
Assuming pandas, you can melt and merge:
out = (df2
.merge(df1.rename(columns={'Date': 'year'})
.melt('year', var_name='fruit', value_name='number'),
how='left'
)
)
output:
fruit year number
0 Apple 2020 1
1 Apple 2021 3
2 Apple 2022 7
3 Banana 2020 2
4 Banana 2021 5
5 Banana 2022 8
6 Orange 2020 4
7 Orange 2021 2
8 Orange 2022 9

Filtering for values in Pivot table columns

If I wanted to aggregate values/sum a column by a certain time period, how do I do it using the pivot table? For example in the table below, if I want the aggregate sum of fruits between 2000 - 2001, and 2002 - 2004, what code would I write? Currently I have this so far:
import pandas as pd
import numpy as np
UG = pd.read_csv('fruitslist.csv', index_col=2)
UG = UG.pivot_table(values = 'Count', index = 'Fruits', columns = 'Year', aggfunc=np.sum)
UG.to_csv('fruits.csv')
This returns counts for each fruit by each individual year, but I can't seem to aggregate by decade (e.g 90s, 00s, 2010s)
Fruits Count Year
Apple 4 1995
Orange 5 1996
Orange 6 2001
Guava 8 2003
Banana 6 2010
Guava 8 2011
Peach 7 2012
Guava 9 2013
Thanks in advance!
This might help. Convert the Year column within a groupby to decades and then aggregate.
"""
Fruits Count Year
Apple 4 1995
Orange 5 1996
Orange 6 2001
Guava 8 2003
Banana 6 2010
Guava 8 2011
Peach 7 2012
Guava 9 2013
"""
df = pd.read_clipboard()
output = df.groupby([
df.Year//10*10,
'Fruits'
]).agg({
'Count' : 'sum'
})
print(output)
Count
Year Fruits
1990 Apple 4
Orange 5
2000 Guava 8
Orange 6
2010 Banana 6
Guava 17
Peach 7
Edit
If you want to group the years by a different amount, say every 2 years, just change the Year group:
print(df.groupby([
df.Year//2*2,
'Fruits'
]).agg({
'Count' : 'sum'
}))
Count
Year Fruits
1994 Apple 4
1996 Orange 5
2000 Orange 6
2002 Guava 8
2010 Banana 6
Guava 8
2012 Guava 9
Peach 7

grouping by count, year and displaying the last occurence and its count

In the following dataframe
d = {'year': [2001, 2002, 2005, 2002, 2004, 1999, 1890],
'tin': [12, 23, 24, 28,30, 12,7],
'ptin': [12, 23, 28, 22, 12, 12,0] }
df = pd.DataFrame(data=d)
If I run following code:
df = (df.groupby(['ptin', 'tin', 'year'])
.apply(lambda x : x['tin'].isin(x['ptin']).astype(int).sum())
.reset_index(name='matches'))
df
I get following result
ptin tin year matches
0 12 3.0 1999 0
1 12 3.0 2001 0
2 22 1.0 2002 0
3 23 1.0 2002 0
This gives me the matching tin to ptin and groups by year.
Now if I want to find the last occurence of say for example tin == 12, I should get 2001. I want add that column as well as difference between 1999 and 2001, which is two in different column, such that my answer looks like below
ptin tin year matches lastoccurence length
0 12 3.0 1999 0 0 0
1 12 3.0 2001 0 2001 2
2 22 1.0 2002 0 2002 1
3 23 1.0 2002 0 2002 1
Any help would be appreciated. I could take solution in either pandas or SQL if that is possible.
I think this will do magic (at least partially?):
df['duration'] = df.sort_values(['ptin','year']).groupby('ptin')['year'].diff()
df = df.dropna(subset=['duration'])
print (df)
ptin tin year matches duration
2 12 12 2001 1 2.0
3 12 30 2004 0 3.0

Find the index of multi-rows

Suppose I have a dataframe called df as follows:
A_column B_column C_column
0 Apple 100 15
1 Banana 80 20
2 Orange 110 10
3 Apple 150 16
4 Apple 90 13
[Q] How to list the index [0,3,4] for Apple in A_column?
You can just pass the row indexes as list to df.iloc
>>> df
A_column B_column C_column
0 Apple 100 15
1 Banana 80 20
2 Orange 110 10
3 Apple 150 16
4 Apple 90 13
>>> df.iloc[[0,3,4]]
A_column B_column C_column
0 Apple 100 15
3 Apple 150 16
4 Apple 90 13
EDIT: seems i misunderstood your questions
So you want to have the list containing the index number of the rows containing 'Apple', you can use df.index[df['A_column']=='Apple'].tolist()
>>> df.index[df['A_column']=='Apple'].tolist()
[0, 3, 4]

Categories