Multiple group counts within data base - python

I have been presented with a very small dataset that has the date of each time a user logs into a system, I have to use this data set to create a table where I can show for each log-in the cumulative monthly counts of logs and the overall cumulative counts of logs, this is the data set I have:
date
user
1/01/2022
Mark
2/01/2022
Mark
3/02/2022
Mark
4/02/2022
Mark
5/03/2022
Mark
6/03/2022
Mark
7/03/2022
Mark
8/03/2022
Mark
9/03/2022
Mark
and this is my desired output:
row
date
user
monthly_track
acum_track
1
1/01/2022
Mark
1
1
2
2/01/2022
Mark
2
2
3
3/02/2022
Mark
1
3
4
4/02/2022
Mark
2
4
5
5/03/2022
Mark
1
5
6
6/03/2022
Mark
2
6
7
7/03/2022
Mark
3
7
8
8/03/2022
Mark
4
8
9
9/03/2022
Mark
5
9
Why? Let's look at the row number 5. This is the first time the user Mark has logged into the system during the month 3 (March) but it is the 5th overall login in the data set (for the purpose of learning there will only be one year (2022).
I have no idea as to how to get the monthly and overall count together. I can groupby user and sort by date to count how many times in total a user has logged in, but I know that in order to achive my desired output I will have to group by date and user and then make counts based on month but I will have to somehow group the data by user (only) to get the overall count and I dont think I could group twice the data.

First you need to convert date to actual datetime values with to_datetime. The rest is simple with groupby and cumcount:
df['date'] = pd.to_datetime(df['date'], format='%d/%m/%Y')
df['monthly_count'] = df.groupby([df['user'], df['date'].dt.year, df['date'].dt.month]).cumcount() + 1
df['acum_count'] = df.groupby('user').cumcount() + 1
Output:
>>> df
date user monthly_count acum_count
0 2022-01-01 Mark 1 1
1 2022-01-02 Mark 2 2
2 2022-02-03 Mark 1 3
3 2022-02-04 Mark 2 4
4 2022-03-05 Mark 1 5
5 2022-03-06 Mark 2 6
6 2022-03-07 Mark 3 7
7 2022-03-08 Mark 4 8
8 2022-03-09 Mark 5 9

Related

column that depends on computing the difference among two column cells in groupby object

I need some tips to make a calculation.
I have a DataFrame that looks like the following:
text_id user date important_words
1 John 2018-01-01 {cat, dog, puppy}
1 John 2018-02-01 {cat, dog}
2 Anne 2018-01-01 {flower, sun}
3 John 2018-03-01 {water, blue}
3 Marie 2018-05-01 {water, blue, ocean}
3 Kate 2018-08-01 {island, sand, towel}
4 Max 2018-01-01 {hot, cold}
4 Ethan 2018-06-01 {hot, warm}
5 Marie 2019-01-01 {boo}
In the given dataframe:
the text_id refers to the id of a text: each text with a different id is a different text. The user column refers to the name of the user that has edited the text (adding and erasing important words). The date column refers to the moment in which the edit was made (note that edits on each text are temporarilly sorted). Finally, the important_words column is a set of important words present in the text after the edit of the user.
I need to calculate how many words were added by each user on each edition of a page.
The expected output here would be:
text_id user date important_words added_words
1 John 2018-01-01 {cat, dog, puppy} 3
1 John 2018-02-01 {cat, dog} 0
2 Anne 2018-01-01 {flower, sun} 2
3 John 2018-03-01 {water, blue} 2
3 Marie 2018-05-01 {water, blue, ocean} 1
3 Kate 2018-08-01 {island, sand, towel} 3
4 Max 2018-01-01 {hot, cold} 2
4 Ethan 2018-06-01 {hot, warm} 1
5 Marie 2019-01-01 {boo} 1
Note that the first time editing the text is the creation, so the number of words added is always the size of the important_words set in that case.
Any tips on what would be the fastest way to compute the added_words column will be highly appreciated.
Note that the important_words column contains a set, thus the operation of calculating the difference among two consecutive editions should be easy.
Hard to think but interesting :-) I am using get_dummies, then we just keep the first 1 value per columns and sum them
s=df.important_words.map(','.join).str.get_dummies(sep=',')
s.mask(s==0).cumsum().eq(1).sum(1)
Out[247]:
0 3
1 0
2 2
3 2
4 1
5 3
6 2
7 1
8 1
dtype: int64
df['val']=s.mask(s==0).cumsum().eq(1).sum(1)
Update
s=df.important_words.map(','.join).str.get_dummies(sep=',')
s.mask(s==0).groupby(df['text_id']).cumsum().eq(1).sum(1)

Rolling sum based on dates, adding in conditions that actively update values in Pandas Dataframe if met?

I am calculating rolling last 180 day sales totals by ID in Python using Pandas and need to be able to update the last 180 day cumulative sales column if a user hits a certain threshold. For example, if someone reaches $100 spent cumulatively in the last 180 days, their cumulative spend for that day should reflect them reaching that level and effectively "redeeming" that $100, leaving them only with the excess from the last visit as progress towards their next $100 hit. (See the example below)
I also need to create a separate data frame during this process containing only the dates & user_ids for when the $100 is met to keep track of how many times the threshold has been met across all users.
I was thinking somehow I could use apply with conditional statements, but was not sure exactly how it would work as the data frame needs to be updated on the fly to have the rolling sums for later dates be calculated taking into account this updated total. In other words, the cumulative sums for dates after they hit the threshold need to be adjusted for the fact that they "redeemed" the $100.
This is what I have so far that gets the rolling cumulative sum by user. I don't know if its possible to chain conditional methods with apply to this or what the best way forward is.
order_data['rolling_sales_180'] = order_data.groupby('user_id').rolling(window='180D', on='day')['sales'].sum().reset_index(drop=True)
See the below example of expected results. In row 6, the user reaches $120, crossing the $100 threshold, but the $100 is subtracted from his cumulative sum as of that date and he is left with $20 as of that date because that was the amount in excess of the $100 threshold that he spent on that day. He then continues to earn cumulatively on this $20 for his subsequent visit within 180 days. A user can go through this process many times, earning many rewards over different 180 day periods.
print(order_data)
day user_id sales \
0 2017-08-10 1 10
1 2017-08-22 1 10
2 2017-08-31 1 10
3 2017-09-06 1 10
4 2017-09-19 1 10
5 2017-10-16 1 30
6 2017-11-28 1 40
7 2018-01-22 1 10
8 2018-03-19 1 10
9 2018-07-25 1 10
rolling_sales_180
0 10
1 20
2 30
3 40
4 50
5 80
6 20
7 30
8 40
9 20
Additionally, as mentioned above, I need a separate data frame to be created throughout this process with the day, user_id, sales, and rolling_sales_180 that only includes all the days during which the $100 threshold was met in order to count the number of times this goal is reached. See below:
print(threshold_reached)
day user_id sales rolling_sales_180
0 2017-11-28 1 40 120
.
.
.
If I understand your question correctly, the following should work for you:
def groupby_rolling(grp_df):
df = grp_df.set_index("day")
cum_sales = df.rolling("180D")["sales"].sum()
hundreds = (cum_sales // 100).astype(int)
progress = cum_sales % 100
df["rolling_sales_180"] = cum_sales
df["progress"] = progress
df["milestones"] = hundreds
return df
result = df.groupby("user_id").apply(groupby_rolling)
Output of this is (for your provided sample):
user_id sales rolling_sales_180 progress milestones
user_id day
1 2017-08-10 1 10 10.0 10.0 0
2017-08-22 1 10 20.0 20.0 0
2017-08-31 1 10 30.0 30.0 0
2017-09-06 1 10 40.0 40.0 0
2017-09-19 1 10 50.0 50.0 0
2017-10-16 1 30 80.0 80.0 0
2017-11-28 1 40 120.0 20.0 1
2018-01-22 1 10 130.0 30.0 1
2018-03-19 1 10 90.0 90.0 0
2018-07-25 1 10 20.0 20.0 0
What the groupby(...).apply(...) does is for each group in the original df, the provided function is applied. In this case, I've encapsulated your complex logic, which is currently not possible to do with a straightforward groupby-rolling operation, in a simple-to-parse basic function.
The function should hopefully be self-documenting by how I named variables, but I'd be happy to add comments if you'd like.

How to group my time by month / week in pd.DataFrame

I have this DataFrame about my Facebook that says, the events I interested at, I joined and the respective time frame for them. I am having some problem of grouping the time by month or week, because there are two of them
joined_time interested_time
0 2019-04-01 2019-04-21
1 2019-03-15 2019-04-06
2 2019-03-13 2019-03-26
Both time indicates when I clicked the 'Going' or 'Interested' button when an event pops up in Facebook. Sorry for the very small sample size, but this is what I have simplified it down to at the moment. And what I am trying to achieve here is that,
Year Month Total_Events_No Events_Joined Events_Interested
2019 3 3 2 1
4 3 1 2
Where in this DataFrame, the year and month are multi-index, and the other columns consist of the counts of respective situations.
I am using melt before groupby and unstack
s=df.melt()
s.value=pd.to_datetime(s.value)
s=s.groupby([s.value.dt.year,s.value.dt.month,s.variable]).size().unstack()
s['Total']=s.sum(axis=1)
s
variable interested_time joined_time Total
value value
2019 3 1 2 3
4 2 1 3

Grouping by unique values in python pandas dataframe

I have a datafame that goes like this
id rev committer_id
date
1996-07-03 08:18:15 1 76620 1
1996-07-03 08:18:15 2 76621 2
1996-11-18 20:51:08 3 76987 3
1996-11-21 09:12:53 4 76995 2
1996-11-21 09:16:33 5 76997 2
1996-11-21 09:39:27 6 76999 2
1996-11-21 09:53:37 7 77003 2
1996-11-21 10:11:35 8 77006 2
1996-11-21 10:17:50 9 77008 2
1996-11-21 10:23:58 10 77010 2
1996-11-21 10:32:58 11 77012 2
1996-11-21 10:55:51 12 77014 2
I would like to group by quarterly periods and then show number of unique entries in the committer_id column. Columns id and rev are actually not used for the moment.
I would like to have a result as the following
committer_id
date
1996-09-30 2
1996-12-31 91
1997-03-31 56
1997-06-30 154
1997-09-30 84
The actual results are aggregated by number of entries in each time period and not by unique entries. I am using the following :
df[['committer_id']].groupby(pd.Grouper(freq='Q-DEC')).aggregate(np.size)
Can't figure how to use np.unique.
Any ideas, please.
Best,
--
df[['committer_id']].groupby(pd.Grouper(freq='Q-DEC')).aggregate(pd.Series.nunique)
Should work for you. Or df.groupby(pd.Grouper(freq='Q-DEC'))['committer_id'].nunique()
Your try with np.unique didn't work because that returns an array of unique items. The result for agg must be a scalar. So .aggregate(lambda x: len(np.unique(x)) probably would work too.

Time series: Mean per hour per day per Id number

I am a somewhat beginner programmer and learning python (+pandas) and hope I can explain this well enough. I have a large time series pd dataframe of over 3 million rows and initially 12 columns spanning a number of years. This covers people taking a ticket from different locations denoted by Id numbers(350 of them). Each row is one instance (one ticket taken).
I have searched many questions like counting records per hour per day and getting average per hour over several years. However, I run into the trouble of including the 'Id' variable.
I'm looking to get the mean value of people taking a ticket for each hour, for each day of the week (mon-fri) and per station.
I have the following, setting datetime to index:
Id Start_date Count Day_name_no
149 2011-12-31 21:30:00 1 5
150 2011-12-31 20:51:00 1 0
259 2011-12-31 20:48:00 1 1
3015 2011-12-31 19:38:00 1 4
28 2011-12-31 19:37:00 1 4
Using groupby and Start_date.index.hour, I cant seem to include the 'Id'.
My alternative approach is to split the hour out of the date and have the following:
Id Count Day_name_no Trip_hour
149 1 2 5
150 1 4 10
153 1 2 15
1867 1 4 11
2387 1 2 7
I then get the count first with:
Count_Item = TestFreq.groupby([TestFreq['Id'], TestFreq['Day_name_no'], TestFreq['Hour']]).count().reset_index()
Id Day_name_no Trip_hour Count
1 0 7 24
1 0 8 48
1 0 9 31
1 0 10 28
1 0 11 26
1 0 12 25
Then use groupby and mean:
Mean_Count = Count_Item.groupby(Count_Item['Id'], Count_Item['Day_name_no'], Count_Item['Hour']).mean().reset_index()
However, this does not give the desired result as the mean values are incorrect.
I hope I have explained this issue in a clear way. I looking for the mean per hour per day per Id as I plan to do clustering to separate my dataset into groups before applying a predictive model on these groups.
Any help would be grateful and if possible an explanation of what I am doing wrong either code wise or my approach.
Thanks in advance.
I have edited this to try make it a little clearer. Writing a question with a lack of sleep is probably not advisable.
A toy dataset that i start with:
Date Id Dow Hour Count
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
04/01/2015 1234 1 11 1
I now realise I would have to use the date first and get something like:
Date Id Dow Hour Count
12/12/2014 1234 0 9 5
19/12/2014 1234 0 9 3
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 4
04/01/2015 1234 1 11 1
And then calculate the mean per Id, per Dow, per hour. And want to get this:
Id Dow Hour Mean
1234 0 9 4
1234 0 10 1
1234 1 11 2.5
I hope this makes it a bit clearer. My real dataset spans 3 years with 3 million rows, contains 350 Id numbers.
Your question is not very clear, but I hope this helps:
df.reset_index(inplace=True)
# helper columns with date, hour and dow
df['date'] = df['Start_date'].dt.date
df['hour'] = df['Start_date'].dt.hour
df['dow'] = df['Start_date'].dt.dayofweek
# sum of counts for all combinations
df = df.groupby(['Id', 'date', 'dow', 'hour']).sum()
# take the mean over all dates
df = df.reset_index().groupby(['Id', 'dow', 'hour']).mean()
You can use the groupby function using the 'Id' column and then use the resample function with how='sum'.

Categories