Compute operations within subgroups in pandas - python

I have a table that has multiple subgroups. For example, person A has a total of three visits and person B has a total of two visits. I also have the time of each visit:
id visit time_of_visit
A 1 2002-01-15
A 2 2003-01-15
A 3 2003-02-15
B 1 1996-08-09
B 2 1998-08-09
I want to compute how long apart each visit is in terms of years for each person. So I want something like this:
id visit time_of_visit difference_in_time
A 1 2002-01-15 na
A 2 2003-01-15 1
A 3 2003-02-15 0.0833
B 1 1996-08-09 na
B 2 1998-08-09 2
Any ideas how to do this in python pandas? Thanks!

groupby.diff on a datetime column will give you
df['time_of_visit'] = pd.to_datetime(df['time_of_visit'])
df.groupby('id')['time_of_visit'].diff()
Out:
0 NaT
1 365 days
2 31 days
3 NaT
4 730 days
Name: time_of_visit, dtype: timedelta64[ns]
However, timedeltas cannot give you years as it is not a standard measure. You can always convert by your own rules of course (for example divide by 365).
df.groupby('id')['time_of_visit'].diff().dt.days / 365
Out:
0 NaN
1 1.000000
2 0.084932
3 NaN
4 2.000000
Name: time_of_visit, dtype: float64

Related

How to get weekly averages for column values and week number for the corresponding year based on daily data records with pandas

I'm still learning python and would like to ask your help with the following problem:
I have a csv file with daily data and I'm looking for a solution to sum it per calendar weeks. So for the mockup data below I have rows stretched over 2 weeks (week 14 (current week) and week 13 (past week)). Now I need to find a way to group rows per calendar week, recognize what year they belong to and calculate week sum and week average. In the file input example there are only two different IDs. However, in the actual data file I expect many more.
input.csv
id date activeMembers
1 2020-03-30 10
2 2020-03-30 1
1 2020-03-29 5
2 2020-03-29 6
1 2020-03-28 0
2 2020-03-28 15
1 2020-03-27 32
2 2020-03-27 10
1 2020-03-26 9
2 2020-03-26 3
1 2020-03-25 0
2 2020-03-25 0
1 2020-03-24 0
2 2020-03-24 65
1 2020-03-23 22
2 2020-03-23 12
...
desired output.csv
id week WeeklyActiveMembersSum WeeklyAverageActiveMembers
1 202014 10 1.4
2 202014 1 0.1
1 202013 68 9.7
2 202013 111 15.9
my goal is to:
import pandas as pd
df = pd.read_csv('path/to/my/input.csv')
Here I'd need to group by 'id' + 'date' column (per calendar week - not sure if this is possible) and create a 'week' column with the week number, then sum 'activeMembers' values for the particular week, save as 'WeeklyActiveMembersSum' column in my output file and finally calculate 'weeklyAverageActiveMembers' for the particular week. I was experimenting with groupby and isin parameters but no luck so far... would I have to go with something similar to this:
df.groupby('id', as_index=False).agg({'date':'max',
'activeMembers':'sum'}
and finally save all as output.csv:
df.to_csv('path/to/my/output.csv', index=False)
Thanks in advance!
It seems I'm getting a different week setting than you do:
# should convert datetime column to datetime type
df['date'] = pd.to_datetime(df['date'])
(df.groupby(['id',df.date.dt.strftime('%Y%W')], sort=False)
.activeMembers.agg([('Sum','sum'),('Average','mean')])
.add_prefix('activeMembers')
.reset_index()
)
Output:
id date activeMembersSum activeMembersAverage
0 1 202013 10 10.000000
1 2 202013 1 1.000000
2 1 202012 68 9.714286
3 2 202012 111 15.857143

column that depends on computing the difference among two column cells in groupby object

I need some tips to make a calculation.
I have a DataFrame that looks like the following:
text_id user date important_words
1 John 2018-01-01 {cat, dog, puppy}
1 John 2018-02-01 {cat, dog}
2 Anne 2018-01-01 {flower, sun}
3 John 2018-03-01 {water, blue}
3 Marie 2018-05-01 {water, blue, ocean}
3 Kate 2018-08-01 {island, sand, towel}
4 Max 2018-01-01 {hot, cold}
4 Ethan 2018-06-01 {hot, warm}
5 Marie 2019-01-01 {boo}
In the given dataframe:
the text_id refers to the id of a text: each text with a different id is a different text. The user column refers to the name of the user that has edited the text (adding and erasing important words). The date column refers to the moment in which the edit was made (note that edits on each text are temporarilly sorted). Finally, the important_words column is a set of important words present in the text after the edit of the user.
I need to calculate how many words were added by each user on each edition of a page.
The expected output here would be:
text_id user date important_words added_words
1 John 2018-01-01 {cat, dog, puppy} 3
1 John 2018-02-01 {cat, dog} 0
2 Anne 2018-01-01 {flower, sun} 2
3 John 2018-03-01 {water, blue} 2
3 Marie 2018-05-01 {water, blue, ocean} 1
3 Kate 2018-08-01 {island, sand, towel} 3
4 Max 2018-01-01 {hot, cold} 2
4 Ethan 2018-06-01 {hot, warm} 1
5 Marie 2019-01-01 {boo} 1
Note that the first time editing the text is the creation, so the number of words added is always the size of the important_words set in that case.
Any tips on what would be the fastest way to compute the added_words column will be highly appreciated.
Note that the important_words column contains a set, thus the operation of calculating the difference among two consecutive editions should be easy.
Hard to think but interesting :-) I am using get_dummies, then we just keep the first 1 value per columns and sum them
s=df.important_words.map(','.join).str.get_dummies(sep=',')
s.mask(s==0).cumsum().eq(1).sum(1)
Out[247]:
0 3
1 0
2 2
3 2
4 1
5 3
6 2
7 1
8 1
dtype: int64
df['val']=s.mask(s==0).cumsum().eq(1).sum(1)
Update
s=df.important_words.map(','.join).str.get_dummies(sep=',')
s.mask(s==0).groupby(df['text_id']).cumsum().eq(1).sum(1)

Calculating Interest Rates Python Dataframe

I Need to calculate the compund interest rate so, lets say I have a Dataframe like that:
days
1 10
2 15
3 20
What I want to get is (suppose the interest rate is 1% every day:
days interst rate
1 10 10,46%
2 15 16,10%
3 20 22,02%
My code is as follows:
def inclusao_juros (x):
dias = df_arrumada_4['Prazo Medio']
return ((1.0009723)^dias)-1
df_arrumada_4['juros_acumulado'] = df_arrumada_4['Prazo Medio'].apply(inclusao_juros)
What should I do??? Tks
I think you need numpy.power:
df['new'] = np.power(1.01, df['days']) - 1
print (df)
days new
1 10 0.104622
2 15 0.160969
3 20 0.220190
IIUC
pd.Series([1.01]*len(df)).pow(df.reset_index().days,0).sub(1)
Out[695]:
0 0.104622
1 0.160969
2 0.220190
dtype: float64
Jez's : pd.Series([1.01]*len(df),index=df.index).pow(df.days,0).sub(1)
Or using your apply
df.days.apply(lambda x: 1.01**x -1)
Out[697]:
1 0.104622
2 0.160969
3 0.220190
Name: days, dtype: float64

Grouping by unique values in python pandas dataframe

I have a datafame that goes like this
id rev committer_id
date
1996-07-03 08:18:15 1 76620 1
1996-07-03 08:18:15 2 76621 2
1996-11-18 20:51:08 3 76987 3
1996-11-21 09:12:53 4 76995 2
1996-11-21 09:16:33 5 76997 2
1996-11-21 09:39:27 6 76999 2
1996-11-21 09:53:37 7 77003 2
1996-11-21 10:11:35 8 77006 2
1996-11-21 10:17:50 9 77008 2
1996-11-21 10:23:58 10 77010 2
1996-11-21 10:32:58 11 77012 2
1996-11-21 10:55:51 12 77014 2
I would like to group by quarterly periods and then show number of unique entries in the committer_id column. Columns id and rev are actually not used for the moment.
I would like to have a result as the following
committer_id
date
1996-09-30 2
1996-12-31 91
1997-03-31 56
1997-06-30 154
1997-09-30 84
The actual results are aggregated by number of entries in each time period and not by unique entries. I am using the following :
df[['committer_id']].groupby(pd.Grouper(freq='Q-DEC')).aggregate(np.size)
Can't figure how to use np.unique.
Any ideas, please.
Best,
--
df[['committer_id']].groupby(pd.Grouper(freq='Q-DEC')).aggregate(pd.Series.nunique)
Should work for you. Or df.groupby(pd.Grouper(freq='Q-DEC'))['committer_id'].nunique()
Your try with np.unique didn't work because that returns an array of unique items. The result for agg must be a scalar. So .aggregate(lambda x: len(np.unique(x)) probably would work too.

Time series: Mean per hour per day per Id number

I am a somewhat beginner programmer and learning python (+pandas) and hope I can explain this well enough. I have a large time series pd dataframe of over 3 million rows and initially 12 columns spanning a number of years. This covers people taking a ticket from different locations denoted by Id numbers(350 of them). Each row is one instance (one ticket taken).
I have searched many questions like counting records per hour per day and getting average per hour over several years. However, I run into the trouble of including the 'Id' variable.
I'm looking to get the mean value of people taking a ticket for each hour, for each day of the week (mon-fri) and per station.
I have the following, setting datetime to index:
Id Start_date Count Day_name_no
149 2011-12-31 21:30:00 1 5
150 2011-12-31 20:51:00 1 0
259 2011-12-31 20:48:00 1 1
3015 2011-12-31 19:38:00 1 4
28 2011-12-31 19:37:00 1 4
Using groupby and Start_date.index.hour, I cant seem to include the 'Id'.
My alternative approach is to split the hour out of the date and have the following:
Id Count Day_name_no Trip_hour
149 1 2 5
150 1 4 10
153 1 2 15
1867 1 4 11
2387 1 2 7
I then get the count first with:
Count_Item = TestFreq.groupby([TestFreq['Id'], TestFreq['Day_name_no'], TestFreq['Hour']]).count().reset_index()
Id Day_name_no Trip_hour Count
1 0 7 24
1 0 8 48
1 0 9 31
1 0 10 28
1 0 11 26
1 0 12 25
Then use groupby and mean:
Mean_Count = Count_Item.groupby(Count_Item['Id'], Count_Item['Day_name_no'], Count_Item['Hour']).mean().reset_index()
However, this does not give the desired result as the mean values are incorrect.
I hope I have explained this issue in a clear way. I looking for the mean per hour per day per Id as I plan to do clustering to separate my dataset into groups before applying a predictive model on these groups.
Any help would be grateful and if possible an explanation of what I am doing wrong either code wise or my approach.
Thanks in advance.
I have edited this to try make it a little clearer. Writing a question with a lack of sleep is probably not advisable.
A toy dataset that i start with:
Date Id Dow Hour Count
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
04/01/2015 1234 1 11 1
I now realise I would have to use the date first and get something like:
Date Id Dow Hour Count
12/12/2014 1234 0 9 5
19/12/2014 1234 0 9 3
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 4
04/01/2015 1234 1 11 1
And then calculate the mean per Id, per Dow, per hour. And want to get this:
Id Dow Hour Mean
1234 0 9 4
1234 0 10 1
1234 1 11 2.5
I hope this makes it a bit clearer. My real dataset spans 3 years with 3 million rows, contains 350 Id numbers.
Your question is not very clear, but I hope this helps:
df.reset_index(inplace=True)
# helper columns with date, hour and dow
df['date'] = df['Start_date'].dt.date
df['hour'] = df['Start_date'].dt.hour
df['dow'] = df['Start_date'].dt.dayofweek
# sum of counts for all combinations
df = df.groupby(['Id', 'date', 'dow', 'hour']).sum()
# take the mean over all dates
df = df.reset_index().groupby(['Id', 'dow', 'hour']).mean()
You can use the groupby function using the 'Id' column and then use the resample function with how='sum'.

Categories