How to group my time by month / week in pd.DataFrame - python

I have this DataFrame about my Facebook that says, the events I interested at, I joined and the respective time frame for them. I am having some problem of grouping the time by month or week, because there are two of them
joined_time interested_time
0 2019-04-01 2019-04-21
1 2019-03-15 2019-04-06
2 2019-03-13 2019-03-26
Both time indicates when I clicked the 'Going' or 'Interested' button when an event pops up in Facebook. Sorry for the very small sample size, but this is what I have simplified it down to at the moment. And what I am trying to achieve here is that,
Year Month Total_Events_No Events_Joined Events_Interested
2019 3 3 2 1
4 3 1 2
Where in this DataFrame, the year and month are multi-index, and the other columns consist of the counts of respective situations.

I am using melt before groupby and unstack
s=df.melt()
s.value=pd.to_datetime(s.value)
s=s.groupby([s.value.dt.year,s.value.dt.month,s.variable]).size().unstack()
s['Total']=s.sum(axis=1)
s
variable interested_time joined_time Total
value value
2019 3 1 2 3
4 2 1 3

Related

Randomly sample from panel data by 3months periods

I have pandas data frame that is panel data i.e. data of multiple customers over a timeframe. I want to sample (for bootstraping) a continuous three months period (i always wanna get a full month) of a random customer 90 times.
I have googled a bit and found several sampling techniques but none that would include sampling based on three continuous months.
I was considering just making a list of all the month names and sampling three consecutive ones (although not sure how to do consecutive). But how would i then be able to e.g. pick Nov21-Dec21-Jan22 ?
Would appreciate the help a lot!
import pandas as pd
date_range = pd.date_range("2020-01-01", "2022-01-01")
df = pd.DataFrame({"value":3}, index=date_range)
df.groupby(df.index.quarter).sample(5)
This would output:
Out[12]:
value
2021-01-14 3
2021-02-27 3
2020-01-20 3
2021-02-03 3
2021-02-19 3
2021-04-27 3
2021-06-29 3
2021-04-12 3
2020-06-24 3
2020-06-05 3
2021-07-30 3
2020-08-29 3
2021-07-03 3
2020-07-17 3
2020-09-12 3
2020-12-22 3
2021-12-13 3
2021-11-29 3
2021-12-19 3
2020-10-18 3
It selected 5 sample values form each quarter group.
From now own you can format date column (index) for it to write month in text.

Multiple group counts within data base

I have been presented with a very small dataset that has the date of each time a user logs into a system, I have to use this data set to create a table where I can show for each log-in the cumulative monthly counts of logs and the overall cumulative counts of logs, this is the data set I have:
date
user
1/01/2022
Mark
2/01/2022
Mark
3/02/2022
Mark
4/02/2022
Mark
5/03/2022
Mark
6/03/2022
Mark
7/03/2022
Mark
8/03/2022
Mark
9/03/2022
Mark
and this is my desired output:
row
date
user
monthly_track
acum_track
1
1/01/2022
Mark
1
1
2
2/01/2022
Mark
2
2
3
3/02/2022
Mark
1
3
4
4/02/2022
Mark
2
4
5
5/03/2022
Mark
1
5
6
6/03/2022
Mark
2
6
7
7/03/2022
Mark
3
7
8
8/03/2022
Mark
4
8
9
9/03/2022
Mark
5
9
Why? Let's look at the row number 5. This is the first time the user Mark has logged into the system during the month 3 (March) but it is the 5th overall login in the data set (for the purpose of learning there will only be one year (2022).
I have no idea as to how to get the monthly and overall count together. I can groupby user and sort by date to count how many times in total a user has logged in, but I know that in order to achive my desired output I will have to group by date and user and then make counts based on month but I will have to somehow group the data by user (only) to get the overall count and I dont think I could group twice the data.
First you need to convert date to actual datetime values with to_datetime. The rest is simple with groupby and cumcount:
df['date'] = pd.to_datetime(df['date'], format='%d/%m/%Y')
df['monthly_count'] = df.groupby([df['user'], df['date'].dt.year, df['date'].dt.month]).cumcount() + 1
df['acum_count'] = df.groupby('user').cumcount() + 1
Output:
>>> df
date user monthly_count acum_count
0 2022-01-01 Mark 1 1
1 2022-01-02 Mark 2 2
2 2022-02-03 Mark 1 3
3 2022-02-04 Mark 2 4
4 2022-03-05 Mark 1 5
5 2022-03-06 Mark 2 6
6 2022-03-07 Mark 3 7
7 2022-03-08 Mark 4 8
8 2022-03-09 Mark 5 9

Applying Pandas groupby to multiple columns

I have a set of data that has several different columns, with daily data going back several years. The variable is the exact same for each column. I've calculated the daily, monthly, and yearly statistics for each column, and want to do the same, but combining all columns together to get one statistic for each day, month, and year rather than the several different ones I calculated before.
I've been using Pandas group by so far, using something like this:
sum_daily_files = daily_files.groupby(daily_files.Date.dt.day).sum()
sum_monthly_files = daily_files.groupby(daily_files.Date.dt.month).sum()
sum_yearly_files = daily_files.groupby(daily_files.Date.dt.year).sum()
Any suggestions on how I might go about using Pandas - or any other package - to combine the statistics together? Thanks so much!
edit
Here's a snippet of my dataframe:
Date site1 site2 site3 site4 site5 site6
2010-01-01 00:00:00 2 0 1 1 0 1
2010-01-02 00:00:00 7 5 1 3 1 1
2010-01-03 00:00:00 3 3 2 2 2 1
2010-01-04 00:00:00 0 0 0 0 0 0
2010-01-05 00:00:00 0 0 0 0 0 1
I just had to type it in because I was having trouble getting it over, so my apologies. Basically, it's six different sites from 2010 to 2019 that details how much snow (in inches) each site received on each day.
(Your problem need to be clarify)
Is this what you want?
all_sum_daily_files = sum_daily_files.sum(axis=1) # or daily_files.sum(axis=1)
all_sum_monthly_files = sum_monthly_files.sum(axis=1)
all_sum_yearly_files = sum_yearly_files.sum(axis=1)
If your data is daily, why calculate the daily sum, you can use directly daily_files.sum(axis=1).

How to get weekly averages for column values and week number for the corresponding year based on daily data records with pandas

I'm still learning python and would like to ask your help with the following problem:
I have a csv file with daily data and I'm looking for a solution to sum it per calendar weeks. So for the mockup data below I have rows stretched over 2 weeks (week 14 (current week) and week 13 (past week)). Now I need to find a way to group rows per calendar week, recognize what year they belong to and calculate week sum and week average. In the file input example there are only two different IDs. However, in the actual data file I expect many more.
input.csv
id date activeMembers
1 2020-03-30 10
2 2020-03-30 1
1 2020-03-29 5
2 2020-03-29 6
1 2020-03-28 0
2 2020-03-28 15
1 2020-03-27 32
2 2020-03-27 10
1 2020-03-26 9
2 2020-03-26 3
1 2020-03-25 0
2 2020-03-25 0
1 2020-03-24 0
2 2020-03-24 65
1 2020-03-23 22
2 2020-03-23 12
...
desired output.csv
id week WeeklyActiveMembersSum WeeklyAverageActiveMembers
1 202014 10 1.4
2 202014 1 0.1
1 202013 68 9.7
2 202013 111 15.9
my goal is to:
import pandas as pd
df = pd.read_csv('path/to/my/input.csv')
Here I'd need to group by 'id' + 'date' column (per calendar week - not sure if this is possible) and create a 'week' column with the week number, then sum 'activeMembers' values for the particular week, save as 'WeeklyActiveMembersSum' column in my output file and finally calculate 'weeklyAverageActiveMembers' for the particular week. I was experimenting with groupby and isin parameters but no luck so far... would I have to go with something similar to this:
df.groupby('id', as_index=False).agg({'date':'max',
'activeMembers':'sum'}
and finally save all as output.csv:
df.to_csv('path/to/my/output.csv', index=False)
Thanks in advance!
It seems I'm getting a different week setting than you do:
# should convert datetime column to datetime type
df['date'] = pd.to_datetime(df['date'])
(df.groupby(['id',df.date.dt.strftime('%Y%W')], sort=False)
.activeMembers.agg([('Sum','sum'),('Average','mean')])
.add_prefix('activeMembers')
.reset_index()
)
Output:
id date activeMembersSum activeMembersAverage
0 1 202013 10 10.000000
1 2 202013 1 1.000000
2 1 202012 68 9.714286
3 2 202012 111 15.857143

Pandas: Conditional replace on consecutive rows within a group

I am trying to build "episodes" from a list of transactions organized by group (patient). I used to do this with Stata, but I'm not sure how to do it in Python. In Stata, I would say something like:
by patient: replace startDate = startDate[_n-1] if startDate-endDate[_n-1]<10
In English, that meant to start with the first row of a group and check if the number of days between the startDate of that group and the endDate of the prior group was less than 10. Then, move to the next row and perform the same thing, then the next row... until you'd exhausted all rows.
I have been trying to figure out how to do the same thing in Python/Pandas and running into a wall. I could sort the dataframe by patient and date, then iterate over the entire data frame. It seems like there should be a better way to do this.
It's important that the script first compare row 2 to row 1 because, when I get to row 3, if the script has replaced the value in row 2, when I get to row 3, I want to use the replaced value, not the original value.
Sample input:
Patient startDate endDate
1 1/1/2016 1/2/2016
1 1/11/2016 1/12/2016
1 1/28/2016 1/28/2016
1 6/15/2016 6/16/2016
2 3/1/2016 3/1/2016
Sample output:
Patient startDate endDate
1 1/1/2016 1/2/2016
1 1/1/2016 1/12/2016
1 1/1/2016 1/28/2016
1 6/15/2016 6/16/2016
2 3/1/2016 3/1/2016
I think we need shift + groupby , and bfill + mask is the key
df.startDate=pd.to_datetime(df.startDate)
df.endDate=pd.to_datetime(df.endDate)
df.startDate=df.groupby('Patient').apply(lambda x : x.startDate.mask((x.startDate-x.endDate.shift(1)).fillna(0).astype('timedelta64[D]')<10).bfill()).reset_index(level=0,drop=True).fillna(df.startDate)
df
Out[495]:
Patient startDate endDate
0 1 2016-01-28 2016-01-02
1 1 2016-01-28 2016-01-12
2 1 2016-01-28 2016-01-28
3 1 2016-06-15 2016-06-16
4 2 2016-03-01 2016-03-01

Categories