Difference of sum of consecutive years pandas - python

Suppose I have this pandas DataFrame df
Date | Year | Value
2017-01-01 | 2017 | 20
2017-01-12 | 2017 | 40
2018-01-12 | 2018 | 150
2019-10-10 | 2019 | 300
I want to calculate the difference between the total sum of Value per year between consecutive years. To get the total sum of Value per year I can do
df['YearlyValue'] = df.groupy('Year')['Value'].transform('sum')
which gives me
Date | Year | Value | YearlyValue
2017-01-01 | 2017 | 20 | 60
2017-01-12 | 2017 | 40 | 60
2018-01-12 | 2018 | 150 | 150
2019-10-10 | 2019 | 300 | 300
but how can I get a new column 'Increment' has difference between YearlyValue of consecutive years?

Related

How to count na "blocks" in one column depending on another columns

Suppose I have this df:
Date | Time | DeviceID | Temperature (c°)| Humidity (%)|
---------------------------------------------------------------
01/01/20 | 12:00 | 567 | 13.1 | 73 |
01/01/20 | 12:10 | 2543 | 13 | 72.7 |
01/01/20 | 12:20 | 573 | 13.5 | 70 |
01/01/20 | 12:30 | 474 | 12 | 75 |
How can I display the DeviceIDs which had more than, let's say,3 consecutive NAs in temperature or humidity?
Also, I would like to see the length of consecutive NAs segments each DeviceID has in Temperature and Humidity.(i.e Device 4 has 4 consecutive NAs in temperature on this day; Device 80 has 6 consecutive Nas in humidity on that day and so on)
What can I do?

Pandas sum rows by group based on condition

I have weekly data grouped by region. I'm trying to figure out how to sum a set of rows based on a condition for each region. For example:
Region | Week | Year | value
------------------------------
R1 | 53 | 2016 | 10
R1 | 1 | 2017 | 8
R2 | 53 | 2017 | 10
R2 | 1 | 2018 | 17
R3 | 53 | 2018 | 30
R3 | 1 | 2019 | 1
I would like add every value of week 53 from the previous year to the first week of the following year to turn it into:
Region | Week | Year | value
------------------------------
R1 | 1 | 2017 | 18
R2 | 1 | 2018 | 27
R3 | 1 | 2019 | 31
Thanks.
agg can be very useful here. Try this:
df = df.groupby('Region', as_index=False).agg({'Year':'max', 'value':'sum'})
Output:
>>> df
Region Year value
0 R1 2017 18
1 R2 2018 27
2 R3 2019 31
Format Year and week of the year to be able to convert into date.
Extract the time components and proceed to groupby and sum
s=pd.to_datetime(df.Year * 1000 + df.Week * 10 + 0, format='%Y%W%w')
df=(df.assign(Year=np.where(df['Week']==53,s.dt.year, df['Year']),
Week=np.where(df['Week']==53,s.dt.isocalendar().week, df['Week']))
.groupby(['Region', 'Year', 'Week']).agg('sum'))

Create a subset by filtering on Year

I have a sample dataset as shown below:
| Id | Year | Price |
|----|------|-------|
| 1 | 2000 | 10 |
| 1 | 2001 | 12 |
| 1 | 2002 | 15 |
| 2 | 2000 | 16 |
| 2 | 2001 | 20 |
| 2 | 2002 | 22 |
| 3 | 2000 | 15 |
| 3 | 2001 | 19 |
| 3 | 2002 | 26 |
I want to subset the dataset so that I can consider the values only for last two years. I want to create a variable 'end_year' and pass a year value to it and then use it to subset original dataframe to take into account only the last two years. Since I have new data coming, so I wanted to create the variable. I have tried the below code but I'm getting error.
end_year="2002"
df1=df[(df['Year'] >= end_year-1)]
Per the comments, Year is type object in the raw data. We should first cast it to int and then compare with numeric end_year:
df.Year=df.Year.astype(int) # cast `Year` to `int`
end_year=2002 # now we can use `int` here too
df1=df[(df['Year'] >= end_year-1)]
Id
Year
Price
1
1
2001
12
2
1
2002
15
4
2
2001
20
5
2
2002
22
7
3
2001
19
8
3
2002
26

pull row with max date from groupby in python pandas

I'm trying to pull the max date from a df in the below format
columns: index1 index2 col1
place1
| 2018 | 5 |
| 2019 | 4 |
| 2020 | 2 |
place2
| 2016 | 9 |
| 2017 | 8 |
place3
| 2018 | 6 |
| 2019 | 1 |
I'm trying to pull rows out for the maximum years available for each place. In the above example the final df would be:
place1 | 2020 | 2
place2 | 2017 | 8
place3 | 2019 | 1
You can use dataframe.sort_values().groupby().last()
To find the maximum value in a group
In your case you have to do
df.sort_values("index2").groupby("index1").last()
I think it may work for you
I am newbie in python but might be it can help:
import pandas as pd
data=[['place1','2018','5'],
['place1','2019','4'],
['place1','2020','2'],
['place2','2016','9'],
['place2','2017','8'],
['place3','2018','6'],
['place3','2019','1']]
df=pd.DataFrame(data,columns=['index1','index2','col1'])
df.set_index(['index1','index2'], inplace=True)
df.reset_index(level=1, inplace=True)
df=df.sort_values(['index1','index2'],ascending=False).groupby('index1').first()
df.set_index('index2',append=True,inplace=True)

How to group data by 6 month in Python

I have the following dataframe and I want to get the sum of the Revenue per 6 month. I can extract quarter, month, year out of the date, but I am unable to do it for the 6 month
| date | Revenue |
|-----------|---------|
| 1/2/2017 | 200 |
| 2/2/2017 | 300 |
| 3/2/2017 | 100 |
| 4/2/2017 | 100 |
| 5/23/2017 | 200 |
| 6/20/2017 | 300 |
| 7/22/2017 | 400 |
| 8/21/2017 | 800 |
| 9/21/2017 | 500 |
| 10/21/2017| 500 |
| 11/21/2017| 500 |
| 12/21/2017| 500 |
You can use resample.
df['date'] = pd.to_datetime(df['date'])
df.resample('6M', on='date').sum().reset_index()
#output
date renevue
0 2017-01-31 200
1 2017-07-31 1400
2 2018-01-31 2800
Use pandas.Grouper:
df['date'] = pd.to_datetime(df['date'])
dfg = df.groupby(pd.Grouper(key='date', freq='6M')).sum().reset_index()
date Revenue
0 2017-01-31 200
1 2017-07-31 1400
2 2018-01-31 2800
You could do
df['date'] = pd.to_datetime(df['date'])
df['year_half'] = df.date.dt.month <= 6
df.groupby([df.year_half, df.date.dt.year])['Revenue'].sum()

Categories