How to calculate Average of dates in groupby Python - python

I have a dataframe with more than 500K rows and 20 columns. I am trying to determine the frequency in which the personId receives something based on the date_received column, all of the other columns are irrelevant for this task but usefull for subsequent tasks.
|---------------------|------------------|
| personId | date_Recieved |
|---------------------|------------------|
| 1 | 2 feb 2016 |
|---------------------|------------------|
| 1 | 4 feb 2016 |
|---------------------|------------------|
| 1 | 6 feb 2016 |
|---------------------|------------------|
| 2 | 10 dec 2016 |
|---------------------|------------------|
| 2 | 1 jan 2017 |
|---------------------|------------------|
| 2 | 20 jan 2017 |
|---------------------|------------------|
The date_received is of type pandas.tslib.Timestamp I am looking for something like this:
|---------------------|------------------|
| personId | Frequency |
|---------------------|------------------|
| 1 | 2 days |
|---------------------|------------------|
| 2 | 20.5 days |
|---------------------|------------------|
So in average person 1 recieves something every 2 days and person two recieves something every 20.5 days.
I tried using the groupby function but still haven't been able to get the response with my dataframe.
Can someone please help me with this?

using groupby and lambda
df.groupby('personId').date_Recieved.apply(lambda x: x.diff().dropna().mean())
personId
1 2 days 00:00:00
2 20 days 12:00:00
Name: date_Recieved, dtype: timedelta64[ns]
setup
txt = """
personId date_Recieved
1 2 feb 2016
1 4 feb 2016
1 6 feb 2016
2 10 dec 2016
2 1 jan 2017
2 20 jan 2017
"""
df = pd.read_csv(StringIO(txt), sep='\s{2,}', engine='python', parse_dates=[1])

Related

Pandas sum rows by group based on condition

I have weekly data grouped by region. I'm trying to figure out how to sum a set of rows based on a condition for each region. For example:
Region | Week | Year | value
------------------------------
R1 | 53 | 2016 | 10
R1 | 1 | 2017 | 8
R2 | 53 | 2017 | 10
R2 | 1 | 2018 | 17
R3 | 53 | 2018 | 30
R3 | 1 | 2019 | 1
I would like add every value of week 53 from the previous year to the first week of the following year to turn it into:
Region | Week | Year | value
------------------------------
R1 | 1 | 2017 | 18
R2 | 1 | 2018 | 27
R3 | 1 | 2019 | 31
Thanks.
agg can be very useful here. Try this:
df = df.groupby('Region', as_index=False).agg({'Year':'max', 'value':'sum'})
Output:
>>> df
Region Year value
0 R1 2017 18
1 R2 2018 27
2 R3 2019 31
Format Year and week of the year to be able to convert into date.
Extract the time components and proceed to groupby and sum
s=pd.to_datetime(df.Year * 1000 + df.Week * 10 + 0, format='%Y%W%w')
df=(df.assign(Year=np.where(df['Week']==53,s.dt.year, df['Year']),
Week=np.where(df['Week']==53,s.dt.isocalendar().week, df['Week']))
.groupby(['Region', 'Year', 'Week']).agg('sum'))

Create a subset by filtering on Year

I have a sample dataset as shown below:
| Id | Year | Price |
|----|------|-------|
| 1 | 2000 | 10 |
| 1 | 2001 | 12 |
| 1 | 2002 | 15 |
| 2 | 2000 | 16 |
| 2 | 2001 | 20 |
| 2 | 2002 | 22 |
| 3 | 2000 | 15 |
| 3 | 2001 | 19 |
| 3 | 2002 | 26 |
I want to subset the dataset so that I can consider the values only for last two years. I want to create a variable 'end_year' and pass a year value to it and then use it to subset original dataframe to take into account only the last two years. Since I have new data coming, so I wanted to create the variable. I have tried the below code but I'm getting error.
end_year="2002"
df1=df[(df['Year'] >= end_year-1)]
Per the comments, Year is type object in the raw data. We should first cast it to int and then compare with numeric end_year:
df.Year=df.Year.astype(int) # cast `Year` to `int`
end_year=2002 # now we can use `int` here too
df1=df[(df['Year'] >= end_year-1)]
Id
Year
Price
1
1
2001
12
2
1
2002
15
4
2
2001
20
5
2
2002
22
7
3
2001
19
8
3
2002
26

pull row with max date from groupby in python pandas

I'm trying to pull the max date from a df in the below format
columns: index1 index2 col1
place1
| 2018 | 5 |
| 2019 | 4 |
| 2020 | 2 |
place2
| 2016 | 9 |
| 2017 | 8 |
place3
| 2018 | 6 |
| 2019 | 1 |
I'm trying to pull rows out for the maximum years available for each place. In the above example the final df would be:
place1 | 2020 | 2
place2 | 2017 | 8
place3 | 2019 | 1
You can use dataframe.sort_values().groupby().last()
To find the maximum value in a group
In your case you have to do
df.sort_values("index2").groupby("index1").last()
I think it may work for you
I am newbie in python but might be it can help:
import pandas as pd
data=[['place1','2018','5'],
['place1','2019','4'],
['place1','2020','2'],
['place2','2016','9'],
['place2','2017','8'],
['place3','2018','6'],
['place3','2019','1']]
df=pd.DataFrame(data,columns=['index1','index2','col1'])
df.set_index(['index1','index2'], inplace=True)
df.reset_index(level=1, inplace=True)
df=df.sort_values(['index1','index2'],ascending=False).groupby('index1').first()
df.set_index('index2',append=True,inplace=True)

Sum to dataframes based on row and column

Given to Dataframes df_1
Code | Jan | Feb | Mar
a | 1 | 2 | 1
b | 3 | 4 | 3
and df_2
Code | Jan | Feb | Mar
a | 1 | 1 | 2
c | 7 | 0 | 0
I would like to sum these to tables based on the row and colum. So my result dataframe shoul look like this:
Code | Jan | Feb | Mar
a | 2 | 3 | 3
b | 3 | 4 | 3
c | 7 | 0 | 0
Is there an easy way to do this? I can to this using a lot of for loops and if statements but this is very slow for large datasets.
Use concat and aggregate sum:
df = pd.concat([df_1, df_2]).groupby('Code', as_index=False).sum()
print (df)
Code Jan Feb Mar
0 a 2 3 3
1 b 3 4 3
2 c 7 0 0

Sort dataframe using dictionary as sort criteria

There is a similar question here but not exactly what I'm looking for.
I want to sort a dataframe based on a dictionary that specifies the column(s) to sort by as well as the order for each column.
Example:
df =
+-------+-------+-----------+------+
| Index | Time | Month | Year |
+-------+-------+-----------+------+
| 0 | 13:00 | January | 2018 |
| 1 | 14:30 | March | 2015 |
| 2 | 12:00 | November | 2003 |
| 3 | 10:15 | September | 2012 |
| 4 | 13:30 | October | 2012 |
| 5 | 06:25 | June | 2012 |
| 6 | 07:50 | August | 2019 |
| 7 | 09:20 | May | 2015 |
| 8 | 22:30 | July | 2016 |
| 9 | 23:05 | April | 2013 |
| 10 | 21:10 | April | 2008 |
+-------+-------+-----------+------+
sort_dict = {'Month': 'Ascending', 'Year': 'Descending', 'Time': 'Ascending'}
df.sort_values(by=sort_dict)
df =
+-------+-------+-----------+------+
| Index | Time | Month | Year |
+-------+-------+-----------+------+
| 0 | 13:00 | January | 2018 |
| 1 | 14:30 | March | 2015 |
| 9 | 23:05 | April | 2013 |
| 10 | 21:10 | April | 2008 |
| 7 | 09:20 | May | 2015 |
| 5 | 06:25 | June | 2012 |
| 8 | 22:30 | July | 2016 |
| 6 | 07:50 | August | 2019 |
| 3 | 10:15 | September | 2012 |
| 4 | 13:30 | October | 2012 |
| 2 | 12:00 | November | 2003 |
+-------+-------+-----------+------+
Any help is appreciated thanks!
Column index would also be fine:
sort_dict = {2: 'Ascending', 3: 'Descending', 1: 'Ascending'}
EDIT: (thanks #Jon Clements)
In python 3.6 declaring sort_dict the key order will be as specified, however, prior to 3.6, dict literals won't necessarily preserve order. eg. in 3.5, declaring sort_dict ends up being {'Month': 'Ascending', 'Time': 'Ascending', 'Year': 'Descending'}... which is going to be a different iteration order - thus different sort results.
If need always same order is possible use OrderedDict or Series by constructor - there order not depends of version of python.
One possible solution is create helper Series, then convert index to list and pass also parameter ascending filled boolean list:
s = pd.Series(sort_dict)
print (s)
Month Ascending
Year Descending
Time Ascending
dtype: object
df = df.sort_values(by=s.index.tolist(), ascending = (s == 'Ascending'))
print (df)
Time Month Year
Index
9 23:05 April 2013
10 21:10 April 2008
6 07:50 August 2019
0 13:00 January 2018
8 22:30 July 2016
5 06:25 June 2012
1 14:30 March 2015
7 09:20 May 2015
2 12:00 November 2003
4 13:30 October 2012
3 10:15 September 2012

Categories