How to calculate Average of dates in groupby Python

How to calculate Average of dates in groupby Python - python

I have a dataframe with more than 500K rows and 20 columns. I am trying to determine the frequency in which the personId receives something based on the date_received column, all of the other columns are irrelevant for this task but usefull for subsequent tasks.
|---------------------|------------------|
| personId | date_Recieved |
|---------------------|------------------|
| 1 | 2 feb 2016 |
|---------------------|------------------|
| 1 | 4 feb 2016 |
|---------------------|------------------|
| 1 | 6 feb 2016 |
|---------------------|------------------|
| 2 | 10 dec 2016 |
|---------------------|------------------|
| 2 | 1 jan 2017 |
|---------------------|------------------|
| 2 | 20 jan 2017 |
|---------------------|------------------|
The date_received is of type pandas.tslib.Timestamp I am looking for something like this:
|---------------------|------------------|
| personId | Frequency |
|---------------------|------------------|
| 1 | 2 days |
|---------------------|------------------|
| 2 | 20.5 days |
|---------------------|------------------|
So in average person 1 recieves something every 2 days and person two recieves something every 20.5 days.
I tried using the groupby function but still haven't been able to get the response with my dataframe.
Can someone please help me with this?

using groupby and lambda
df.groupby('personId').date_Recieved.apply(lambda x: x.diff().dropna().mean())
personId
1 2 days 00:00:00
2 20 days 12:00:00
Name: date_Recieved, dtype: timedelta64[ns]
setup
txt = """
personId date_Recieved
1 2 feb 2016
1 4 feb 2016
1 6 feb 2016
2 10 dec 2016
2 1 jan 2017
2 20 jan 2017
"""
df = pd.read_csv(StringIO(txt), sep='\s{2,}', engine='python', parse_dates=[1])

Related

Pandas sum rows by group based on condition

I have weekly data grouped by region. I'm trying to figure out how to sum a set of rows based on a condition for each region. For example:
Region | Week | Year | value
------------------------------
R1 | 53 | 2016 | 10
R1 | 1 | 2017 | 8
R2 | 53 | 2017 | 10
R2 | 1 | 2018 | 17
R3 | 53 | 2018 | 30
R3 | 1 | 2019 | 1
I would like add every value of week 53 from the previous year to the first week of the following year to turn it into:
Region | Week | Year | value
------------------------------
R1 | 1 | 2017 | 18
R2 | 1 | 2018 | 27
R3 | 1 | 2019 | 31
Thanks.

agg can be very useful here. Try this:
df = df.groupby('Region', as_index=False).agg({'Year':'max', 'value':'sum'})
Output:
>>> df
Region Year value
0 R1 2017 18
1 R2 2018 27
2 R3 2019 31

Format Year and week of the year to be able to convert into date.
Extract the time components and proceed to groupby and sum
s=pd.to_datetime(df.Year * 1000 + df.Week * 10 + 0, format='%Y%W%w')
df=(df.assign(Year=np.where(df['Week']==53,s.dt.year, df['Year']),
Week=np.where(df['Week']==53,s.dt.isocalendar().week, df['Week']))
.groupby(['Region', 'Year', 'Week']).agg('sum'))

Create a subset by filtering on Year

I have a sample dataset as shown below:
| Id | Year | Price |
|----|------|-------|
| 1 | 2000 | 10 |
| 1 | 2001 | 12 |
| 1 | 2002 | 15 |
| 2 | 2000 | 16 |
| 2 | 2001 | 20 |
| 2 | 2002 | 22 |
| 3 | 2000 | 15 |
| 3 | 2001 | 19 |
| 3 | 2002 | 26 |
I want to subset the dataset so that I can consider the values only for last two years. I want to create a variable 'end_year' and pass a year value to it and then use it to subset original dataframe to take into account only the last two years. Since I have new data coming, so I wanted to create the variable. I have tried the below code but I'm getting error.
end_year="2002"
df1=df[(df['Year'] >= end_year-1)]

Per the comments, Year is type object in the raw data. We should first cast it to int and then compare with numeric end_year:
df.Year=df.Year.astype(int) # cast `Year` to `int`
end_year=2002 # now we can use `int` here too
df1=df[(df['Year'] >= end_year-1)]
Id
Year
Price
1
1
2001
12
2
1
2002
15
4
2
2001
20
5
2
2002
22
7
3
2001
19
8
3
2002
26

pull row with max date from groupby in python pandas

I'm trying to pull the max date from a df in the below format
columns: index1 index2 col1
place1
| 2018 | 5 |
| 2019 | 4 |
| 2020 | 2 |
place2
| 2016 | 9 |
| 2017 | 8 |
place3
| 2018 | 6 |
| 2019 | 1 |
I'm trying to pull rows out for the maximum years available for each place. In the above example the final df would be:
place1 | 2020 | 2
place2 | 2017 | 8
place3 | 2019 | 1

You can use dataframe.sort_values().groupby().last()
To find the maximum value in a group
In your case you have to do
df.sort_values("index2").groupby("index1").last()
I think it may work for you

I am newbie in python but might be it can help:
import pandas as pd
data=[['place1','2018','5'],
['place1','2019','4'],
['place1','2020','2'],
['place2','2016','9'],
['place2','2017','8'],
['place3','2018','6'],
['place3','2019','1']]
df=pd.DataFrame(data,columns=['index1','index2','col1'])
df.set_index(['index1','index2'], inplace=True)
df.reset_index(level=1, inplace=True)
df=df.sort_values(['index1','index2'],ascending=False).groupby('index1').first()
df.set_index('index2',append=True,inplace=True)

Sum to dataframes based on row and column

Given to Dataframes df_1
Code | Jan | Feb | Mar
a | 1 | 2 | 1
b | 3 | 4 | 3
and df_2
Code | Jan | Feb | Mar
a | 1 | 1 | 2
c | 7 | 0 | 0
I would like to sum these to tables based on the row and colum. So my result dataframe shoul look like this:
Code | Jan | Feb | Mar
a | 2 | 3 | 3
b | 3 | 4 | 3
c | 7 | 0 | 0
Is there an easy way to do this? I can to this using a lot of for loops and if statements but this is very slow for large datasets.

Use concat and aggregate sum:
df = pd.concat([df_1, df_2]).groupby('Code', as_index=False).sum()
print (df)
Code Jan Feb Mar
0 a 2 3 3
1 b 3 4 3
2 c 7 0 0

Sort dataframe using dictionary as sort criteria

There is a similar question here but not exactly what I'm looking for.
I want to sort a dataframe based on a dictionary that specifies the column(s) to sort by as well as the order for each column.
Example:
df =
+-------+-------+-----------+------+
| Index | Time | Month | Year |
+-------+-------+-----------+------+
| 0 | 13:00 | January | 2018 |
| 1 | 14:30 | March | 2015 |
| 2 | 12:00 | November | 2003 |
| 3 | 10:15 | September | 2012 |
| 4 | 13:30 | October | 2012 |
| 5 | 06:25 | June | 2012 |
| 6 | 07:50 | August | 2019 |
| 7 | 09:20 | May | 2015 |
| 8 | 22:30 | July | 2016 |
| 9 | 23:05 | April | 2013 |
| 10 | 21:10 | April | 2008 |
+-------+-------+-----------+------+
sort_dict = {'Month': 'Ascending', 'Year': 'Descending', 'Time': 'Ascending'}
df.sort_values(by=sort_dict)
df =
+-------+-------+-----------+------+
| Index | Time | Month | Year |
+-------+-------+-----------+------+
| 0 | 13:00 | January | 2018 |
| 1 | 14:30 | March | 2015 |
| 9 | 23:05 | April | 2013 |
| 10 | 21:10 | April | 2008 |
| 7 | 09:20 | May | 2015 |
| 5 | 06:25 | June | 2012 |
| 8 | 22:30 | July | 2016 |
| 6 | 07:50 | August | 2019 |
| 3 | 10:15 | September | 2012 |
| 4 | 13:30 | October | 2012 |
| 2 | 12:00 | November | 2003 |
+-------+-------+-----------+------+
Any help is appreciated thanks!
Column index would also be fine:
sort_dict = {2: 'Ascending', 3: 'Descending', 1: 'Ascending'}

EDIT: (thanks #Jon Clements)
In python 3.6 declaring sort_dict the key order will be as specified, however, prior to 3.6, dict literals won't necessarily preserve order. eg. in 3.5, declaring sort_dict ends up being {'Month': 'Ascending', 'Time': 'Ascending', 'Year': 'Descending'}... which is going to be a different iteration order - thus different sort results.
If need always same order is possible use OrderedDict or Series by constructor - there order not depends of version of python.
One possible solution is create helper Series, then convert index to list and pass also parameter ascending filled boolean list:
s = pd.Series(sort_dict)
print (s)
Month Ascending
Year Descending
Time Ascending
dtype: object
df = df.sort_values(by=s.index.tolist(), ascending = (s == 'Ascending'))
print (df)
Time Month Year
Index
9 23:05 April 2013
10 21:10 April 2008
6 07:50 August 2019
0 13:00 January 2018
8 22:30 July 2016
5 06:25 June 2012
1 14:30 March 2015
7 09:20 May 2015
2 12:00 November 2003
4 13:30 October 2012
3 10:15 September 2012

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to calculate Average of dates in groupby Python - python

Related

Pandas sum rows by group based on condition

Create a subset by filtering on Year

pull row with max date from groupby in python pandas

Sum to dataframes based on row and column

Sort dataframe using dictionary as sort criteria

Categories

Resources