Pandas sum rows by group based on condition

Pandas sum rows by group based on condition - python

I have weekly data grouped by region. I'm trying to figure out how to sum a set of rows based on a condition for each region. For example:
Region | Week | Year | value
------------------------------
R1 | 53 | 2016 | 10
R1 | 1 | 2017 | 8
R2 | 53 | 2017 | 10
R2 | 1 | 2018 | 17
R3 | 53 | 2018 | 30
R3 | 1 | 2019 | 1
I would like add every value of week 53 from the previous year to the first week of the following year to turn it into:
Region | Week | Year | value
------------------------------
R1 | 1 | 2017 | 18
R2 | 1 | 2018 | 27
R3 | 1 | 2019 | 31
Thanks.

agg can be very useful here. Try this:
df = df.groupby('Region', as_index=False).agg({'Year':'max', 'value':'sum'})
Output:
>>> df
Region Year value
0 R1 2017 18
1 R2 2018 27
2 R3 2019 31

Format Year and week of the year to be able to convert into date.
Extract the time components and proceed to groupby and sum
s=pd.to_datetime(df.Year * 1000 + df.Week * 10 + 0, format='%Y%W%w')
df=(df.assign(Year=np.where(df['Week']==53,s.dt.year, df['Year']),
Week=np.where(df['Week']==53,s.dt.isocalendar().week, df['Week']))
.groupby(['Region', 'Year', 'Week']).agg('sum'))

Related

Using column as tiebreaker for maximums in Python

Reposted with clarification.
I am working on a dataframe that looks like the following:
+-------+----+------+------+
| Value | ID | Date | ID 2 |
+-------+----+------+------+
| 1 | 5 | 2012 | 111 |
| 1 | 5 | 2012 | 112 |
| 0 | 12 | 2017 | 113 |
| 0 | 12 | 2022 | 114 |
| 1 | 27 | 2005 | 115 |
| 1 | 27 | 2011 | 116 |
+-------+----+------+-----+
Only using rows with "Value" == "1" ("value is boolean), I would like to group the dataframe by ID and input the string "latest" to new (blank) column, giving the following output:
+-------+----+------+------+-------+
| Value | ID | Date | ID 2 |Latest |
+-------+----+------+------+-------+
| 1 | 5 | 2012 | 111 | |
| 1 | 5 | 2012 | 112 | Latest |
| 0 | 12 | 2017 | 113 | |
| 0 | 12 | 2022 | 114 | |
| 1 | 27 | 2005 | 115 | |
| 1 | 27 | 2011 | 116 | Latest |
+-------+----+------+-----+--------+
I am using the following code to find the maximum:
latest = df.query('Value==1').groupby("ID").max("Year").assign(Latest = "Latest")
df = pd.merge(df,latest,how="outer")
df
But I have since realized some of the max years are the same, i.e. there could be 4 rows, all with max year 2017. For the tiebreaker, I need to use the max ID 2 within groups.
latest = df.query('Value==1').groupby("ID").max("Year").groupby("ID 2").max("ID 2").assign(Latest = "Latest")
df = pd.merge(df,latest,how="outer")
df
but it is giving me a dataframe completely different than the one desired.

Try this:
df['Latest'] = np.where(df['ID2'].eq(df.groupby(df['Value'].ne(df['Value'].shift(1)).cumsum())['ID2'].transform('max')) & df['Value'].ne(0), 'Latest', '')
Output:
>>> df
Value ID Date ID2 Latest
0 1 5 2012 111
1 1 5 2012 112 Latest
2 0 12 2017 113
3 0 12 2022 114
4 1 27 2005 115
5 1 27 2011 116 Latest

Here's one way a bit similar to your own approach. Basically, groupby + last to get the latest + assign a variable + merge:
df = df.merge(df.groupby(['ID', 'Value'])['ID 2'].last().reset_index().assign(Latest=lambda x: np.where(x['Value'], 'Latest', '')), how='outer').fillna('')
or even this works:
df = df.query('Value==1').groupby('ID').last('ID 2').assign(Latest='Latest').merge(df, how='outer').fillna('')
Output:
Value ID Date ID 2 Latest
0 1 5 2012 111
1 1 5 2012 112 Latest
2 0 12 2017 113
3 0 12 2022 114
4 1 27 2005 115
5 1 27 2011 116 Latest

Here is one with window functions:
c = df['Value'].ne(df['Value'].shift())
s = df['Date'].add(df['ID 2']) #add the year and ID for handling duplicates
c1 = s.eq(s.groupby(c.cumsum()).transform('max'))& (df['Value'].eq(1))
df['Latest'] = np.where(c1,'Latest','')
print(df)
Value ID Date ID 2 Latest
0 1 5 2012 111
1 1 5 2012 112 Latest
2 0 12 2017 113
3 0 12 2022 114
4 1 27 2005 115
5 1 27 2011 116 Latest

Create a subset by filtering on Year

I have a sample dataset as shown below:
| Id | Year | Price |
|----|------|-------|
| 1 | 2000 | 10 |
| 1 | 2001 | 12 |
| 1 | 2002 | 15 |
| 2 | 2000 | 16 |
| 2 | 2001 | 20 |
| 2 | 2002 | 22 |
| 3 | 2000 | 15 |
| 3 | 2001 | 19 |
| 3 | 2002 | 26 |
I want to subset the dataset so that I can consider the values only for last two years. I want to create a variable 'end_year' and pass a year value to it and then use it to subset original dataframe to take into account only the last two years. Since I have new data coming, so I wanted to create the variable. I have tried the below code but I'm getting error.
end_year="2002"
df1=df[(df['Year'] >= end_year-1)]

Per the comments, Year is type object in the raw data. We should first cast it to int and then compare with numeric end_year:
df.Year=df.Year.astype(int) # cast `Year` to `int`
end_year=2002 # now we can use `int` here too
df1=df[(df['Year'] >= end_year-1)]
Id
Year
Price
1
1
2001
12
2
1
2002
15
4
2
2001
20
5
2
2002
22
7
3
2001
19
8
3
2002
26

pull row with max date from groupby in python pandas

I'm trying to pull the max date from a df in the below format
columns: index1 index2 col1
place1
| 2018 | 5 |
| 2019 | 4 |
| 2020 | 2 |
place2
| 2016 | 9 |
| 2017 | 8 |
place3
| 2018 | 6 |
| 2019 | 1 |
I'm trying to pull rows out for the maximum years available for each place. In the above example the final df would be:
place1 | 2020 | 2
place2 | 2017 | 8
place3 | 2019 | 1

You can use dataframe.sort_values().groupby().last()
To find the maximum value in a group
In your case you have to do
df.sort_values("index2").groupby("index1").last()
I think it may work for you

I am newbie in python but might be it can help:
import pandas as pd
data=[['place1','2018','5'],
['place1','2019','4'],
['place1','2020','2'],
['place2','2016','9'],
['place2','2017','8'],
['place3','2018','6'],
['place3','2019','1']]
df=pd.DataFrame(data,columns=['index1','index2','col1'])
df.set_index(['index1','index2'], inplace=True)
df.reset_index(level=1, inplace=True)
df=df.sort_values(['index1','index2'],ascending=False).groupby('index1').first()
df.set_index('index2',append=True,inplace=True)

Difference of sum of consecutive years pandas

Suppose I have this pandas DataFrame df
Date | Year | Value
2017-01-01 | 2017 | 20
2017-01-12 | 2017 | 40
2018-01-12 | 2018 | 150
2019-10-10 | 2019 | 300
I want to calculate the difference between the total sum of Value per year between consecutive years. To get the total sum of Value per year I can do
df['YearlyValue'] = df.groupy('Year')['Value'].transform('sum')
which gives me
Date | Year | Value | YearlyValue
2017-01-01 | 2017 | 20 | 60
2017-01-12 | 2017 | 40 | 60
2018-01-12 | 2018 | 150 | 150
2019-10-10 | 2019 | 300 | 300
but how can I get a new column 'Increment' has difference between YearlyValue of consecutive years?

How to calculate Average of dates in groupby Python

I have a dataframe with more than 500K rows and 20 columns. I am trying to determine the frequency in which the personId receives something based on the date_received column, all of the other columns are irrelevant for this task but usefull for subsequent tasks.
|---------------------|------------------|
| personId | date_Recieved |
|---------------------|------------------|
| 1 | 2 feb 2016 |
|---------------------|------------------|
| 1 | 4 feb 2016 |
|---------------------|------------------|
| 1 | 6 feb 2016 |
|---------------------|------------------|
| 2 | 10 dec 2016 |
|---------------------|------------------|
| 2 | 1 jan 2017 |
|---------------------|------------------|
| 2 | 20 jan 2017 |
|---------------------|------------------|
The date_received is of type pandas.tslib.Timestamp I am looking for something like this:
|---------------------|------------------|
| personId | Frequency |
|---------------------|------------------|
| 1 | 2 days |
|---------------------|------------------|
| 2 | 20.5 days |
|---------------------|------------------|
So in average person 1 recieves something every 2 days and person two recieves something every 20.5 days.
I tried using the groupby function but still haven't been able to get the response with my dataframe.
Can someone please help me with this?

using groupby and lambda
df.groupby('personId').date_Recieved.apply(lambda x: x.diff().dropna().mean())
personId
1 2 days 00:00:00
2 20 days 12:00:00
Name: date_Recieved, dtype: timedelta64[ns]
setup
txt = """
personId date_Recieved
1 2 feb 2016
1 4 feb 2016
1 6 feb 2016
2 10 dec 2016
2 1 jan 2017
2 20 jan 2017
"""
df = pd.read_csv(StringIO(txt), sep='\s{2,}', engine='python', parse_dates=[1])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas sum rows by group based on condition - python

agg can be very useful here. Try this: df = df.groupby('Region', as_index=False).agg({'Year':'max', 'value':'sum'}) Output: >>> df Region Year value 0 R1 2017 18 1 R2 2018 27 2 R3 2019 31

Related

Using column as tiebreaker for maximums in Python

Create a subset by filtering on Year

pull row with max date from groupby in python pandas

Difference of sum of consecutive years pandas

How to calculate Average of dates in groupby Python

Categories

Resources