Pandas sum rows by group based on condition - python

I have weekly data grouped by region. I'm trying to figure out how to sum a set of rows based on a condition for each region. For example:
Region | Week | Year | value
------------------------------
R1 | 53 | 2016 | 10
R1 | 1 | 2017 | 8
R2 | 53 | 2017 | 10
R2 | 1 | 2018 | 17
R3 | 53 | 2018 | 30
R3 | 1 | 2019 | 1
I would like add every value of week 53 from the previous year to the first week of the following year to turn it into:
Region | Week | Year | value
------------------------------
R1 | 1 | 2017 | 18
R2 | 1 | 2018 | 27
R3 | 1 | 2019 | 31
Thanks.

agg can be very useful here. Try this:
df = df.groupby('Region', as_index=False).agg({'Year':'max', 'value':'sum'})
Output:
>>> df
Region Year value
0 R1 2017 18
1 R2 2018 27
2 R3 2019 31

Format Year and week of the year to be able to convert into date.
Extract the time components and proceed to groupby and sum
s=pd.to_datetime(df.Year * 1000 + df.Week * 10 + 0, format='%Y%W%w')
df=(df.assign(Year=np.where(df['Week']==53,s.dt.year, df['Year']),
Week=np.where(df['Week']==53,s.dt.isocalendar().week, df['Week']))
.groupby(['Region', 'Year', 'Week']).agg('sum'))

Related

Using column as tiebreaker for maximums in Python

Reposted with clarification.
I am working on a dataframe that looks like the following:
+-------+----+------+------+
| Value | ID | Date | ID 2 |
+-------+----+------+------+
| 1 | 5 | 2012 | 111 |
| 1 | 5 | 2012 | 112 |
| 0 | 12 | 2017 | 113 |
| 0 | 12 | 2022 | 114 |
| 1 | 27 | 2005 | 115 |
| 1 | 27 | 2011 | 116 |
+-------+----+------+-----+
Only using rows with "Value" == "1" ("value is boolean), I would like to group the dataframe by ID and input the string "latest" to new (blank) column, giving the following output:
+-------+----+------+------+-------+
| Value | ID | Date | ID 2 |Latest |
+-------+----+------+------+-------+
| 1 | 5 | 2012 | 111 | |
| 1 | 5 | 2012 | 112 | Latest |
| 0 | 12 | 2017 | 113 | |
| 0 | 12 | 2022 | 114 | |
| 1 | 27 | 2005 | 115 | |
| 1 | 27 | 2011 | 116 | Latest |
+-------+----+------+-----+--------+
I am using the following code to find the maximum:
latest = df.query('Value==1').groupby("ID").max("Year").assign(Latest = "Latest")
df = pd.merge(df,latest,how="outer")
df
But I have since realized some of the max years are the same, i.e. there could be 4 rows, all with max year 2017. For the tiebreaker, I need to use the max ID 2 within groups.
latest = df.query('Value==1').groupby("ID").max("Year").groupby("ID 2").max("ID 2").assign(Latest = "Latest")
df = pd.merge(df,latest,how="outer")
df
but it is giving me a dataframe completely different than the one desired.
Try this:
df['Latest'] = np.where(df['ID2'].eq(df.groupby(df['Value'].ne(df['Value'].shift(1)).cumsum())['ID2'].transform('max')) & df['Value'].ne(0), 'Latest', '')
Output:
>>> df
Value ID Date ID2 Latest
0 1 5 2012 111
1 1 5 2012 112 Latest
2 0 12 2017 113
3 0 12 2022 114
4 1 27 2005 115
5 1 27 2011 116 Latest
Here's one way a bit similar to your own approach. Basically, groupby + last to get the latest + assign a variable + merge:
df = df.merge(df.groupby(['ID', 'Value'])['ID 2'].last().reset_index().assign(Latest=lambda x: np.where(x['Value'], 'Latest', '')), how='outer').fillna('')
or even this works:
df = df.query('Value==1').groupby('ID').last('ID 2').assign(Latest='Latest').merge(df, how='outer').fillna('')
Output:
Value ID Date ID 2 Latest
0 1 5 2012 111
1 1 5 2012 112 Latest
2 0 12 2017 113
3 0 12 2022 114
4 1 27 2005 115
5 1 27 2011 116 Latest
Here is one with window functions:
c = df['Value'].ne(df['Value'].shift())
s = df['Date'].add(df['ID 2']) #add the year and ID for handling duplicates
c1 = s.eq(s.groupby(c.cumsum()).transform('max'))& (df['Value'].eq(1))
df['Latest'] = np.where(c1,'Latest','')
print(df)
Value ID Date ID 2 Latest
0 1 5 2012 111
1 1 5 2012 112 Latest
2 0 12 2017 113
3 0 12 2022 114
4 1 27 2005 115
5 1 27 2011 116 Latest

Create a subset by filtering on Year

I have a sample dataset as shown below:
| Id | Year | Price |
|----|------|-------|
| 1 | 2000 | 10 |
| 1 | 2001 | 12 |
| 1 | 2002 | 15 |
| 2 | 2000 | 16 |
| 2 | 2001 | 20 |
| 2 | 2002 | 22 |
| 3 | 2000 | 15 |
| 3 | 2001 | 19 |
| 3 | 2002 | 26 |
I want to subset the dataset so that I can consider the values only for last two years. I want to create a variable 'end_year' and pass a year value to it and then use it to subset original dataframe to take into account only the last two years. Since I have new data coming, so I wanted to create the variable. I have tried the below code but I'm getting error.
end_year="2002"
df1=df[(df['Year'] >= end_year-1)]
Per the comments, Year is type object in the raw data. We should first cast it to int and then compare with numeric end_year:
df.Year=df.Year.astype(int) # cast `Year` to `int`
end_year=2002 # now we can use `int` here too
df1=df[(df['Year'] >= end_year-1)]
Id
Year
Price
1
1
2001
12
2
1
2002
15
4
2
2001
20
5
2
2002
22
7
3
2001
19
8
3
2002
26

pull row with max date from groupby in python pandas

I'm trying to pull the max date from a df in the below format
columns: index1 index2 col1
place1
| 2018 | 5 |
| 2019 | 4 |
| 2020 | 2 |
place2
| 2016 | 9 |
| 2017 | 8 |
place3
| 2018 | 6 |
| 2019 | 1 |
I'm trying to pull rows out for the maximum years available for each place. In the above example the final df would be:
place1 | 2020 | 2
place2 | 2017 | 8
place3 | 2019 | 1
You can use dataframe.sort_values().groupby().last()
To find the maximum value in a group
In your case you have to do
df.sort_values("index2").groupby("index1").last()
I think it may work for you
I am newbie in python but might be it can help:
import pandas as pd
data=[['place1','2018','5'],
['place1','2019','4'],
['place1','2020','2'],
['place2','2016','9'],
['place2','2017','8'],
['place3','2018','6'],
['place3','2019','1']]
df=pd.DataFrame(data,columns=['index1','index2','col1'])
df.set_index(['index1','index2'], inplace=True)
df.reset_index(level=1, inplace=True)
df=df.sort_values(['index1','index2'],ascending=False).groupby('index1').first()
df.set_index('index2',append=True,inplace=True)

Difference of sum of consecutive years pandas

Suppose I have this pandas DataFrame df
Date | Year | Value
2017-01-01 | 2017 | 20
2017-01-12 | 2017 | 40
2018-01-12 | 2018 | 150
2019-10-10 | 2019 | 300
I want to calculate the difference between the total sum of Value per year between consecutive years. To get the total sum of Value per year I can do
df['YearlyValue'] = df.groupy('Year')['Value'].transform('sum')
which gives me
Date | Year | Value | YearlyValue
2017-01-01 | 2017 | 20 | 60
2017-01-12 | 2017 | 40 | 60
2018-01-12 | 2018 | 150 | 150
2019-10-10 | 2019 | 300 | 300
but how can I get a new column 'Increment' has difference between YearlyValue of consecutive years?

How to calculate Average of dates in groupby Python

I have a dataframe with more than 500K rows and 20 columns. I am trying to determine the frequency in which the personId receives something based on the date_received column, all of the other columns are irrelevant for this task but usefull for subsequent tasks.
|---------------------|------------------|
| personId | date_Recieved |
|---------------------|------------------|
| 1 | 2 feb 2016 |
|---------------------|------------------|
| 1 | 4 feb 2016 |
|---------------------|------------------|
| 1 | 6 feb 2016 |
|---------------------|------------------|
| 2 | 10 dec 2016 |
|---------------------|------------------|
| 2 | 1 jan 2017 |
|---------------------|------------------|
| 2 | 20 jan 2017 |
|---------------------|------------------|
The date_received is of type pandas.tslib.Timestamp I am looking for something like this:
|---------------------|------------------|
| personId | Frequency |
|---------------------|------------------|
| 1 | 2 days |
|---------------------|------------------|
| 2 | 20.5 days |
|---------------------|------------------|
So in average person 1 recieves something every 2 days and person two recieves something every 20.5 days.
I tried using the groupby function but still haven't been able to get the response with my dataframe.
Can someone please help me with this?
using groupby and lambda
df.groupby('personId').date_Recieved.apply(lambda x: x.diff().dropna().mean())
personId
1 2 days 00:00:00
2 20 days 12:00:00
Name: date_Recieved, dtype: timedelta64[ns]
setup
txt = """
personId date_Recieved
1 2 feb 2016
1 4 feb 2016
1 6 feb 2016
2 10 dec 2016
2 1 jan 2017
2 20 jan 2017
"""
df = pd.read_csv(StringIO(txt), sep='\s{2,}', engine='python', parse_dates=[1])

Categories