I am a somewhat beginner programmer and learning python (+pandas) and hope I can explain this well enough. I have a large time series pd dataframe of over 3 million rows and initially 12 columns spanning a number of years. This covers people taking a ticket from different locations denoted by Id numbers(350 of them). Each row is one instance (one ticket taken).
I have searched many questions like counting records per hour per day and getting average per hour over several years. However, I run into the trouble of including the 'Id' variable.
I'm looking to get the mean value of people taking a ticket for each hour, for each day of the week (mon-fri) and per station.
I have the following, setting datetime to index:
Id Start_date Count Day_name_no
149 2011-12-31 21:30:00 1 5
150 2011-12-31 20:51:00 1 0
259 2011-12-31 20:48:00 1 1
3015 2011-12-31 19:38:00 1 4
28 2011-12-31 19:37:00 1 4
Using groupby and Start_date.index.hour, I cant seem to include the 'Id'.
My alternative approach is to split the hour out of the date and have the following:
Id Count Day_name_no Trip_hour
149 1 2 5
150 1 4 10
153 1 2 15
1867 1 4 11
2387 1 2 7
I then get the count first with:
Count_Item = TestFreq.groupby([TestFreq['Id'], TestFreq['Day_name_no'], TestFreq['Hour']]).count().reset_index()
Id Day_name_no Trip_hour Count
1 0 7 24
1 0 8 48
1 0 9 31
1 0 10 28
1 0 11 26
1 0 12 25
Then use groupby and mean:
Mean_Count = Count_Item.groupby(Count_Item['Id'], Count_Item['Day_name_no'], Count_Item['Hour']).mean().reset_index()
However, this does not give the desired result as the mean values are incorrect.
I hope I have explained this issue in a clear way. I looking for the mean per hour per day per Id as I plan to do clustering to separate my dataset into groups before applying a predictive model on these groups.
Any help would be grateful and if possible an explanation of what I am doing wrong either code wise or my approach.
Thanks in advance.
I have edited this to try make it a little clearer. Writing a question with a lack of sleep is probably not advisable.
A toy dataset that i start with:
Date Id Dow Hour Count
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
04/01/2015 1234 1 11 1
I now realise I would have to use the date first and get something like:
Date Id Dow Hour Count
12/12/2014 1234 0 9 5
19/12/2014 1234 0 9 3
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 4
04/01/2015 1234 1 11 1
And then calculate the mean per Id, per Dow, per hour. And want to get this:
Id Dow Hour Mean
1234 0 9 4
1234 0 10 1
1234 1 11 2.5
I hope this makes it a bit clearer. My real dataset spans 3 years with 3 million rows, contains 350 Id numbers.
Your question is not very clear, but I hope this helps:
df.reset_index(inplace=True)
# helper columns with date, hour and dow
df['date'] = df['Start_date'].dt.date
df['hour'] = df['Start_date'].dt.hour
df['dow'] = df['Start_date'].dt.dayofweek
# sum of counts for all combinations
df = df.groupby(['Id', 'date', 'dow', 'hour']).sum()
# take the mean over all dates
df = df.reset_index().groupby(['Id', 'dow', 'hour']).mean()
You can use the groupby function using the 'Id' column and then use the resample function with how='sum'.
Related
I have a dataset with a few columns. I would like to slice the data frame with finding a string "M22" in the column "Run number". I am able to do so. However, I would like to count the number of unique rows that contained the string "M22".
Here is what I have done for the below table (example):
RUN_NUMBER DATE_TIME CULTURE_DAY AGE_HRS AGE_DAYS
335991M 6/30/2022 0 0 0
M220621 7/1/2022 1 24 1
M220678 7/2/2022 2 48 2
510091M 7/3/2022 3 72 3
M220500 7/4/2022 4 96 4
335991M 7/5/2022 5 120 5
M220621 7/6/2022 6 144 6
M220678 7/7/2022 7 168 7
335991M 7/8/2022 8 192 8
M220621 7/9/2022 9 216 9
M220678 7/10/2022 10 240 10
here is the results I got:
RUN_NUMBER
335991M 0
510091M 0
335992M 0
M220621 3
M220678 3
M220500 1
Now I need to count the strings/rows that contained "M22" : so I need to get 3 as output.
Use the following approach with pd.Series.unique function:
df[df['RUN_NUMBER'].str.contains("M22")]['RUN_NUMBER'].unique().size
Or a more faster alternative using numpy.char.find function:
(np.char.find(df['RUN_NUMBER'].unique().astype(str), 'M22') != -1).sum()
3
I am manipulating some data in Python and was wondering if anyone can help.
I have data that looks like this:
count source timestamp tokens
0 1 alt-right-census 2006-03-21 setting
1 1 alt-right-census 2006-03-21 twttr
2 1 stormfront 2006-06-24 head
3 1 stormfront 2006-10-07 five
and I need data that looks like this:
count_stormfront count_alt-right-census month token
2 1 2006-01 setting
or like this:
date token alt_count storm_count
4069995 2016-09 zealand 0 0
4069996 2016-09 zero 11 8
4069997 2016-09 zika 295 160
How can I aggregate days by year-month and pivot so that count becomes count_source summed over the month?
Any help would be appreciated. Thanks!
df.groupby(['source', df['timestamp'].str[:7]]).size().unstack()
Result:
timestamp 2006-03 2006-06 2006-10
source
alt-right-census 2.0 NaN NaN
stormfront NaN 1.0 1.0
I'm still learning python and would like to ask your help with the following problem:
I have a csv file with daily data and I'm looking for a solution to sum it per calendar weeks. So for the mockup data below I have rows stretched over 2 weeks (week 14 (current week) and week 13 (past week)). Now I need to find a way to group rows per calendar week, recognize what year they belong to and calculate week sum and week average. In the file input example there are only two different IDs. However, in the actual data file I expect many more.
input.csv
id date activeMembers
1 2020-03-30 10
2 2020-03-30 1
1 2020-03-29 5
2 2020-03-29 6
1 2020-03-28 0
2 2020-03-28 15
1 2020-03-27 32
2 2020-03-27 10
1 2020-03-26 9
2 2020-03-26 3
1 2020-03-25 0
2 2020-03-25 0
1 2020-03-24 0
2 2020-03-24 65
1 2020-03-23 22
2 2020-03-23 12
...
desired output.csv
id week WeeklyActiveMembersSum WeeklyAverageActiveMembers
1 202014 10 1.4
2 202014 1 0.1
1 202013 68 9.7
2 202013 111 15.9
my goal is to:
import pandas as pd
df = pd.read_csv('path/to/my/input.csv')
Here I'd need to group by 'id' + 'date' column (per calendar week - not sure if this is possible) and create a 'week' column with the week number, then sum 'activeMembers' values for the particular week, save as 'WeeklyActiveMembersSum' column in my output file and finally calculate 'weeklyAverageActiveMembers' for the particular week. I was experimenting with groupby and isin parameters but no luck so far... would I have to go with something similar to this:
df.groupby('id', as_index=False).agg({'date':'max',
'activeMembers':'sum'}
and finally save all as output.csv:
df.to_csv('path/to/my/output.csv', index=False)
Thanks in advance!
It seems I'm getting a different week setting than you do:
# should convert datetime column to datetime type
df['date'] = pd.to_datetime(df['date'])
(df.groupby(['id',df.date.dt.strftime('%Y%W')], sort=False)
.activeMembers.agg([('Sum','sum'),('Average','mean')])
.add_prefix('activeMembers')
.reset_index()
)
Output:
id date activeMembersSum activeMembersAverage
0 1 202013 10 10.000000
1 2 202013 1 1.000000
2 1 202012 68 9.714286
3 2 202012 111 15.857143
This question is related to my previous question. I have the following dataframe:
df =
QUEUE_1 QUEUE_2 DAY HOUR TOTAL_SERVICE_TIME TOTAL_WAIT_TIME EVAL
ABC123 DEF656 1 7 20 30 1
ABC123 1 7 22 32 0
DEF656 ABC123 1 8 15 12 0
FED456 DEF656 2 8 15 16 1
I need to get the following dataframe (it's similar to the one I wanted to get in my previous question, but here I need to add 2 additional columns AVG_COUNT_PER_DAY_HOUR and AVG_PERCENT_EVAL_1).
QUEUE HOUR AVG_TOT_SERVICE_TIME AVG_TOT_WAIT_TIME AVG_COUNT_PER_DAY_HOUR AVG_PERCENT_EVAL_1
ABC123 7 21 31 1 50
ABC123 8 15 12 0.5 100
DEF656 7 20 30 0.5 100
DEF656 8 15 14 1 50
FED456 7 0 0 0 0
FED456 8 15 14 0.5 100
The column AVG_COUNT_PER_DAY_HOUR should contain the average count of a corresponding HOUR value over days (DAY) grouped by QUEUE. For example, in df, in case of ABC123, the HOUR 7 appears 2 times for the DAY 1 and 0 times for the DAY 2. Therefore the average is 1. The same logic is applied to the HOUR 8. It appears 1 time in DAY 1 and 0 times in DAY 2 for ABC123. Therefore the average is 0.5.
The column AVG_PERCENT_EVAL_1 should contain the percent of EVAL equal to 1 over hours, grouped by QUEUE. For example, in case of ABC123, the EVAL is equal to 1 one time when HOUR is 7. It is also equal to 0 one time when HOUR is 7. So, AVG_PERCENT_EVAL_1 is 50 for ABC123 and hour 7.
I use this approach:
df = pd.lreshape(aa, {'QUEUE': df.columns[df.columns.str.startswith('QUEUE')].tolist()})
piv_df = df.pivot_table(index=['QUEUE'], columns=['HOUR'], fill_value=0)
result = piv_df.stack().add_prefix('AVG_').reset_index()
I get stuck with adding columns AVG_COUNT_PER_DAY_HOUR and AVG_PERCENT_EVAL_1. For instance, to add the column AVG_COUNT_PER_DAY_HOUR I am thinking to use .apply(pd.value_counts, 1).notnull().groupby(level=0).sum().astype(int), while for calculating AVG_PERCENT_EVAL_1 I am thinking to use [df.EVAL==1].agg({'EVAL' : 'count'}). However, don't know how to incorporate it into my current code in order to get correct solution.
UPDATE:
Perhaps it is easier to adopt this solution to what I need in this questions:
result = pd.lreshape(df, {'QUEUE': ['QUEUE_1','QUEUE_2']})
mux = pd.MultiIndex.from_product([result.QUEUE.dropna().unique(),
result.dropna().DAY.unique(),
result.HOUR.dropna().unique(), ], names=['QUEUE','DAY','HOUR'])
print (result.groupby(['QUEUE','DAY','HOUR'])
.mean()
.reindex(mux, fill_value=0)
.add_prefix('AVG_')
.reset_index())
Steps:
1) To compute AVG_COUNT_PER_DAY_HOUR :
With the help of pd.crosstab(), compute the distinct counts of HOUR w.r.t DAYS (so that we obtain cases for missing days) grouped by QUEUE.
stack the DF so that HOUR which was part of a hierarchical column before now gets positioned as an index, leaving just DAYS as columns. We take the mean columnwise after filling NaNs with 0.
2) To compute AVG_PERCENT_EVAL_1:
After getting the pivoted frame (same as before) and also from the fact that mean would just give us the percentage change as those are simply binary in nature (1/0), we simply take EVAL from this DF and multiply it's result by 100 as means were computed while pivoting itself (default agg=np.mean).
Finally, we join all these frames.
same as in the linked post:
df = pd.lreshape(df, {'QUEUE': df.columns[df.columns.str.startswith('QUEUE')].tolist()})
piv_df = df.pivot_table(index='QUEUE', columns='HOUR', fill_value=0).stack()
avg_tot = piv_df[['TOTAL_SERVICE_TIME', 'TOTAL_WAIT_TIME']].add_prefix("AVG_")
additional portion:
avg_cnt = pd.crosstab(df['QUEUE'], [df['DAY'], df['HOUR']]).stack().fillna(0).mean(1)
avg_pct = piv_df['EVAL'].mul(100).astype(int)
avg_tot.join(
avg_cnt.to_frame("AVG_COUNT_PER_DAY_HOUR")
).join(avg_pct.to_frame("AVG_PERCENT_EVAL_1")).reset_index()
avg_cnt looks like:
QUEUE HOUR
ABC123 7 1.0
8 0.5
DEF656 7 0.5
8 1.0
FED456 7 0.0
8 0.5
dtype: float64
avg_pct looks like:
QUEUE HOUR
ABC123 7 50
8 0
DEF656 7 100
8 50
FED456 7 0
8 100
Name: EVAL, dtype: int32
I have a Python dataframe with 1408 lines of data. My goal is to compare the largest number and smallest number associated with a given weekday during one week to the next week's number on the same day of the week which the prior largest/smallest occurred. Essentially, I want to look at quintiles (since there are 5 days in a business week) rank 1 and 5 and see how they change from week to week. Build a cdf of numbers associated to each weekday.
To clean the data, I need to remove 18 weeks in total from it. That is, every week in the dataframe associated with holidays plus the entire week following week after the holiday occurred.
After this, I think I should insert a column in the dataframe that labels all my data with Monday through Friday-- for all the dates in the file (there are 6 years of data). The reason for labeling M-F is so that I can sort each number associated to the day of the week in ascending order. And query on the day of the week.
Methodological suggestions on either 1. or 2. or both would be immensely appreciated.
Thank you!
#2 seems like it's best tackled with a combination of df.groupby() and apply() on the resulting Groupby object. Perhaps an example is the best way to explain.
Given a dataframe:
In [53]: df
Out[53]:
Value
2012-08-01 61
2012-08-02 52
2012-08-03 89
2012-08-06 44
2012-08-07 35
2012-08-08 98
2012-08-09 64
2012-08-10 48
2012-08-13 100
2012-08-14 95
2012-08-15 14
2012-08-16 55
2012-08-17 58
2012-08-20 11
2012-08-21 28
2012-08-22 95
2012-08-23 18
2012-08-24 81
2012-08-27 27
2012-08-28 81
2012-08-29 28
2012-08-30 16
2012-08-31 50
In [54]: def rankdays(df):
.....: if len(df) != 5:
.....: return pandas.Series()
.....: return pandas.Series(df.Value.rank(), index=df.index.weekday)
.....:
In [52]: df.groupby(lambda x: x.week).apply(rankdays).unstack()
Out[52]:
0 1 2 3 4
32 2 1 5 4 3
33 5 4 1 2 3
34 1 3 5 2 4
35 2 5 3 1 4