My data has trips with datetime info, user id for each trip and trip type (single, round, pseudo).
Here's a data sample (pandas dataframe), named All_Data:
HoraDTRetirada idpass type
2016-02-17 15:36:00 39579449489 'single'
2016-02-18 19:13:00 39579449489 'single'
2016-02-26 09:20:00 72986744521 'pseudo'
2016-02-27 12:11:00 72986744521 'round'
2016-02-27 14:55:00 11533148958 'pseudo'
2016-02-28 12:27:00 72986744521 'round'
2016-02-28 16:32:00 72986744521 'round'
I would like to count the number of times each category repeats in a "week of year" by user.
For example, if the event happens on a monday and the next event happens on a thursday for a same user, that makes two events on the same week; however, if one event happens on a saturday and the next event happens on the following monday, they happened in different weeks.
The output I am looking for would be in a form like this:
idpass weekofyear type frequency
39579449489 1 'single' 2
72986744521 2 'round' 3
72986744521 2 'pseudo' 1
11533148958 2 'pseudo' 1
Edit: this older question approaches a similar problem, but I don't know how to do it with pandas.
import pandas as pd
data = {"HoraDTRetirada": ["2016-02-17 15:36:00", "2016-02-18 19:13:00", "2016-12-31 09:20:00", "2016-02-28 12:11:00",
"2016-02-28 14:55:00", "2016-02-29 12:27:00", "2016-02-29 16:32:00"],
"idpass": ["39579449489", "39579449489", "72986744521", "72986744521", "11533148958", "72986744521",
"72986744521"],
"type": ["single", "single", "pseudo", "round", "pseudo", "round", "round"]}
df = pd.DataFrame.from_dict(data)
print(df)
df["HoraDTRetirada"] = pd.to_datetime(df['HoraDTRetirada'])
df["week"] = df['HoraDTRetirada'].dt.strftime('%U')
k = df.groupby(["idpass", "week", "type"],as_index=False).count()
print(k)
Output:
HoraDTRetirada idpass type
0 2016-02-17 15:36:00 39579449489 single
1 2016-02-18 19:13:00 39579449489 single
2 2016-12-31 09:20:00 72986744521 pseudo
3 2016-02-28 12:11:00 72986744521 round
4 2016-02-28 14:55:00 11533148958 pseudo
5 2016-02-29 12:27:00 72986744521 round
6 2016-02-29 16:32:00 72986744521 round
idpass week type HoraDTRetirada
0 11533148958 09 pseudo 1
1 39579449489 07 single 2
2 72986744521 09 round 3
3 72986744521 52 pseudo 1
This is how I got what I was looking for:
Step 1 from suggested answers was skipped because timestamps were already in pandas datetime form.
Step 2: create column for week of year:
df['week'] = df['HoraDTRetirada'].dt.strftime('%U')
Step 3: group by user id, type and week, and count values with size()
df.groupby(['idpass','type','week']).size()
My suggestion would be to do this:
make sure your timestamp is pandas datetime and add frequency column
df['HoraDTRetirada'] = pd.to_datetime(df['HoraDTRetirada'])
df['freq'] = 1
Group it and count
res = df.groupby(['idpass', 'type', pd.Grouper(key='HoraDTRetirada', freq='1W')]).count().reset_index()
Convert time to week of a year
res['HoraDTRetirada'] = res['HoraDTRetirada'].apply(lambda x: x.week)
Final result looks like that:
EDIT:
You are right, in your case we should do step 3 before step 2, and if you want to do that, remember that groupby will change, so finally step 2 will be:
res['HoraDTRetirada'] = res['HoraDTRetirada'].apply(lambda x: x.week)
and step 3 :
res = df.groupby(['idpass', 'type', 'HoraDTRetirada')]).count().reset_index()
It's a bit different because the "Hora" variable is not a time anymore, but just an int representing a week.
Related
I have code as below. My questions:
why is it assigning week 1 to 2014-12-29 and '2014-1-1'? Why it is not assigning week 53 to 2014-12-29?
how could i assign week number that is continuously increasing? I
want '2014-12-29','2015-1-1' to have week 53 and '2015-1-15' to have
week 55 etc.
x=pd.DataFrame(data=['2014-1-1','2014-12-29','2015-1-1','2015-1-15'],columns=['date'])
x['week_number']=pd.DatetimeIndex(x['date']).week
As far as why the week number is 1 for 12/29/2014 -- see the question I linked to in the comments. For the second part of your question:
January 1, 2014 was a Wednesday. We can take the minimum date of your date column, get the day number and subtract from the difference:
Solution
# x["date"] = pd.to_datetime(x["date"]) # if not already a datetime column
min_date = x["date"].min() + 1 # + 1 because they're zero-indexed
x["weeks_from_start"] = ((x["date"].diff().dt.days.cumsum() - min_date) // 7 + 1).fillna(1).astype(int)
Output:
date weeks_from_start
0 2014-01-01 1
1 2014-12-29 52
2 2015-01-01 52
3 2015-01-15 54
Step by step
The first step is to convert the date column to the datetime type, if you haven't already:
In [3]: x.dtypes
Out[3]:
date object
dtype: object
In [4]: x["date"] = pd.to_datetime(x["date"])
In [5]: x
Out[5]:
date
0 2014-01-01
1 2014-12-29
2 2015-01-01
3 2015-01-15
In [6]: x.dtypes
Out[6]:
date datetime64[ns]
dtype: object
Next, we need to find the minimum of your date column and set that as the starting date day of the week number (adding 1 because the day number starts at 0):
In [7]: x["date"].min().day + 1
Out[7]: 2
Next, use the built-in .diff() function to take the differences of adjacent rows:
In [8]: x["date"].diff()
Out[8]:
0 NaT
1 362 days
2 3 days
3 14 days
Name: date, dtype: timedelta64[ns]
Note that we get NaT ("not a time") for the first entry -- that's because the first row has nothing to compare to above it.
The way to interpret these values is that row 1 is 362 days after row 0, and row 2 is 3 days after row 1, etc.
If you take the cumulative sum and subtract the starting day number, you'll get the days since the starting date, in this case 2014-01-01, as if the Wednesday was day 0 of that first week (this is because when we calculate the number of weeks since that starting date, we need to compensate for the fact that Wednesday was the middle of that week):
In [9]: x["date"].diff().dt.days.cumsum() - min_date
Out[9]:
0 NaN
1 360.0
2 363.0
3 377.0
Name: date, dtype: float64
Now when we take the floor division by 7, we'll get the correct number of weeks since the starting date:
In [10]: (x["date"].diff().dt.days.cumsum() - 2) // 7 + 1
Out[10]:
0 NaN
1 52.0
2 52.0
3 54.0
Name: date, dtype: float64
Note that we add 1 because (I assume) you're counting from 1 -- i.e., 2014-01-01 is week 1 for you, and not week 0.
Finally, the .fillna is just to take care of that NaT (which turned into a NaN when we started doing arithmetic). You use .fillna(value) to fill NaNs with value:
In [11]: ((x["date"].diff().dt.days.cumsum() - 2) // 7 + 1).fillna(1)
Out[11]:
0 1.0
1 52.0
2 52.0
3 54.0
Name: date, dtype: float64
Finally use .astype() to convert the column to integers instead of floats.
My company uses a 4-4-5 calendar for reporting purposes. Each month (aka period) is 4-weeks long, except every 3rd month is 5-weeks long.
Pandas seems to have good support for custom calendar periods. However, I'm having trouble figuring out the correct frequency string or custom business month offset to achieve months for a 4-4-5 calendar.
For example:
df_index = pd.date_range("2020-03-29", "2021-03-27", freq="D", name="date")
df = pd.DataFrame(
index=df_index, columns=["a"], data=np.random.randint(0, 100, size=len(df_index))
)
df.groupby(pd.Grouper(level=0, freq="4W-SUN")).mean()
Grouping by 4-weeks starting on Sunday results in the following. The first three month start dates are correct but I need every third month to be 5-weeks long. The 4th month start date should be 2020-06-28.
a
date
2020-03-29 16.000000
2020-04-26 50.250000
2020-05-24 39.071429
2020-06-21 52.464286
2020-07-19 41.535714
2020-08-16 46.178571
2020-09-13 51.857143
2020-10-11 44.250000
2020-11-08 47.714286
2020-12-06 56.892857
2021-01-03 55.821429
2021-01-31 53.464286
2021-02-28 53.607143
2021-03-28 45.037037
Essentially what I'd like to achieve is something like this:
a
date
2020-03-29 20.000000
2020-04-26 50.750000
2020-05-24 49.750000
2020-06-28 49.964286
2020-07-26 52.214286
2020-08-23 47.714286
2020-09-27 46.250000
2020-10-25 53.357143
2020-11-22 52.035714
2020-12-27 39.750000
2021-01-24 43.428571
2021-02-21 49.392857
Pandas currently support only yearly and quarterly 5253 (aka 4-4-5 calendar).
See is pandas.tseries.offsets.FY5253 and pandas.tseries.offsets.FY5253Quarter
df_index = pd.date_range("2020-03-29", "2021-03-27", freq="D", name="date")
df = pd.DataFrame(index=df_index)
df['a'] = np.random.randint(0, 100, df.shape[0])
So indeed you need some more work to get to week level and maintain a 4-4-5 calendar. You could align to quarters using the native pandas offset and fill-in the 4-4-5 week pattern manually.
def date_range(start, end, offset_array, name=None):
start = pd.to_datetime(start)
end = pd.to_datetime(end)
index = []
start -= offset_array[0]
while(start<end):
for x in offset_array:
start += x
if start > end:
break
index.append(start)
return pd.Series(index, name=name)
This function takes a list of offsets rather than a regular frequency period, so it allows to move from date to date following the offsets in the given array:
offset_445 = [
pd.tseries.offsets.FY5253Quarter(weekday=6),
4*pd.tseries.offsets.Week(weekday=6),
4*pd.tseries.offsets.Week(weekday=6),
]
df_index_445 = date_range("2020-03-29", "2021-03-27", offset_445, name='date')
Out:
0 2020-05-03
1 2020-05-31
2 2020-06-28
3 2020-08-02
4 2020-08-30
5 2020-09-27
6 2020-11-01
7 2020-11-29
8 2020-12-27
9 2021-01-31
10 2021-02-28
Name: date, dtype: datetime64[ns]
Once the index is created, then it's back to aggregations logic to get the data in the right row buckets. Assuming that you want the mean for the start of each 4 or 5 week period, according to the df_index_445 you have generated, it could look like this:
# calculate the mean on reindex groups
reindex = df_index_445.searchsorted(df.index, side='right') - 1
res = df.groupby(reindex).mean()
# filter valid output
res = res[res.index>=0]
res.index = df_index_445
Out:
a
2020-05-03 47.857143
2020-05-31 53.071429
2020-06-28 49.257143
2020-08-02 40.142857
2020-08-30 47.250000
2020-09-27 52.485714
2020-11-01 48.285714
2020-11-29 56.178571
2020-12-27 51.428571
2021-01-31 50.464286
2021-02-28 53.642857
Note that since the frequency is not regular, pandas will set the datetime index frequency to None.
I 've got stuck with the following format:
0 2001-12-25
1 2002-9-27
2 2001-2-24
3 2001-5-3
4 200510
5 20078
What I need is the date in a format %Y-%m
What I tried was
def parse(date):
if len(date)<=5:
return "{}-{}".format(date[:4], date[4:5], date[5:])
else:
pass
df['Date']= parse(df['Date'])
However, I only succeeded in parse 20078 to 2007-8, the format like 2001-12-25 appeared as None.
So, how can I do it? Thank you!
we can use the pd.to_datetime and use errors='coerce' to parse the dates in steps.
assuming your column is called date
s = pd.to_datetime(df['date'],errors='coerce',format='%Y-%m-%d')
s = s.fillna(pd.to_datetime(df['date'],format='%Y%m',errors='coerce'))
df['date_fixed'] = s
print(df)
date date_fixed
0 2001-12-25 2001-12-25
1 2002-9-27 2002-09-27
2 2001-2-24 2001-02-24
3 2001-5-3 2001-05-03
4 200510 2005-10-01
5 20078 2007-08-01
In steps,
first we cast the regular datetimes to a new series called s
s = pd.to_datetime(df['date'],errors='coerce',format='%Y-%m-%d')
print(s)
0 2001-12-25
1 2002-09-27
2 2001-02-24
3 2001-05-03
4 NaT
5 NaT
Name: date, dtype: datetime64[ns]
as you can can see we have two NaT which are null datetime values in our series, these correspond with your datetimes which are missing a day,
we then reapply the same datetime method but with the opposite format, and apply those to the missing values of s
s = s.fillna(pd.to_datetime(df['date'],format='%Y%m',errors='coerce'))
print(s)
0 2001-12-25
1 2002-09-27
2 2001-02-24
3 2001-05-03
4 2005-10-01
5 2007-08-01
then we re-assign to your dataframe.
You could use a regex to pull out the year and month, and convert to datetime :
df = pd.read_clipboard("\s{2,}",header=None,names=["Dates"])
pattern = r"(?P<Year>\d{4})[-]*(?P<Month>\d{1,2})"
df['Dates'] = pd.to_datetime([f"{year}-{month}" for year, month in df.Dates.str.extract(pattern).to_numpy()])
print(df)
Dates
0 2001-12-01
1 2002-09-01
2 2001-02-01
3 2001-05-01
4 2005-10-01
5 2007-08-01
Note that pandas automatically converts the day to 1, since only year and month was supplied.
UsageDate CustID1 CustID2 .... CustIDn
0 2018-01-01 00:00:00 1.095
1 2018-01-01 01:00:00 1.129
2 2018-01-01 02:00:00 1.165
3 2018-01-01 04:00:00 1.697
.
.
m 2018-31-01 23:00:00 1.835 (m,n)
The dataframe (df) has m rows and n columns. m is a Hourly TimeSeries Index which starts from first hour of month to last hour of month.
The columns are the customers which are almost 100,000.
The values at each cell of Dataframe are energy consumption values.
For every customer, I need to calculate:
1) Mean of every hour usage - so basically average of 1st hour of every day in a month, 2nd hour of every day in a month etc.
2) Summation of usage of every customer
3) Top 3 usage hours - for a customer x, it can be "2018-01-01 01:00:00",
"2018-11-01 05:00:00" "2018-21-01 17:00:00"
4) Bottom 3 usage hours - Similar explanation as above
5) Mean of usage for every customer in the month
My main point of trouble is how to aggregate data both for every customer and the hour of day, or day together.
For summation of usage for every customer, I tried:
df_temp = pd.DataFrame(columns=["TotalUsage"])
for col in df.columns:
`df_temp[col,"TotalUsage"] = df[col].apply.sum()`
However, this and many version of this which I tried are not helping me solve the problem.
Please help me with an approach and how to think about such problems.
Also, since the dataframe is large, it would be helpful if we can talk about Computational Complexity and how can we decrease computation time.
This looks like a job for pandas.groupby.
(I didn't test the code because I didn't have a good sample dataset from which to work. If there are errors, let me know.)
For some of your requirements, you'll need to add a column with the hour:
df['hour']=df['UsageDate'].dt.hour
1) Mean by hour.
mean_by_hour=df.groupby('hour').mean()
2) Summation by user.
sum_by_uers=df.sum()
3) Top usage by customer. Bottom 3 usage hours - Similar explanation as above.--I don't quite understand your desired output, you might be asking too many different questions in this question. If you want the hour and not the value, I think you may have to iterate through the columns. Adding an example may help.
4) Same comment.
5) Mean by customer.
mean_by_cust = df.mean()
I am not sure if this is all the information you are looking for but it will point you in the right direction:
import pandas as pd
import numpy as np
# sample data for 3 days
np.random.seed(1)
data = pd.DataFrame(pd.date_range('2018-01-01', periods= 72, freq='H'), columns=['UsageDate'])
data2 = pd.DataFrame(np.random.rand(72,5), columns=[f'ID_{i}' for i in range(5)])
df = data.join([data2])
# print('Sample Data:')
# print(df.head())
# print()
# mean of every month and hour per year
# groupby year month hour then find the mean of every hour in a given year and month
mean_data = df.groupby([df['UsageDate'].dt.year, df['UsageDate'].dt.month, df['UsageDate'].dt.hour]).mean()
mean_data.index.names = ['UsageDate_year', 'UsageDate_month', 'UsageDate_hour']
# print('Mean Data:')
# print(mean_data.head())
# print()
# use set_index with max and head
top_3_Usage_hours = df.set_index('UsageDate').max(1).sort_values(ascending=False).head(3)
# print('Top 3:')
# print(top_3_Usage_hours)
# print()
# use set_index with min and tail
bottom_3_Usage_hours = df.set_index('UsageDate').min(1).sort_values(ascending=False).tail(3)
# print('Bottom 3:')
# print(bottom_3_Usage_hours)
out:
Sample Data:
UsageDate ID_0 ID_1 ID_2 ID_3 ID_4
0 2018-01-01 00:00:00 0.417022 0.720324 0.000114 0.302333 0.146756
1 2018-01-01 01:00:00 0.092339 0.186260 0.345561 0.396767 0.538817
2 2018-01-01 02:00:00 0.419195 0.685220 0.204452 0.878117 0.027388
3 2018-01-01 03:00:00 0.670468 0.417305 0.558690 0.140387 0.198101
4 2018-01-01 04:00:00 0.800745 0.968262 0.313424 0.692323 0.876389
Mean Data:
ID_0 ID_1 ID_2 \
UsageDate_year UsageDate_month UsageDate_hour
2018 1 0 0.250716 0.546475 0.202093
1 0.414400 0.264330 0.535928
2 0.335119 0.877191 0.380688
3 0.577429 0.599707 0.524876
4 0.702336 0.654344 0.376141
ID_3 ID_4
UsageDate_year UsageDate_month UsageDate_hour
2018 1 0 0.244185 0.598238
1 0.400003 0.578867
2 0.623516 0.477579
3 0.429835 0.510685
4 0.503908 0.595140
Top 3:
UsageDate
2018-01-01 21:00:00 0.997323
2018-01-03 23:00:00 0.990472
2018-01-01 08:00:00 0.988861
dtype: float64
Bottom 3:
UsageDate
2018-01-01 19:00:00 0.002870
2018-01-03 02:00:00 0.000402
2018-01-01 00:00:00 0.000114
dtype: float64
For top and bottom 3 if you want to find the min sum across rows then:
df.set_index('UsageDate').sum(1).sort_values(ascending=False).tail(3)
I have tried to calculate the number of business days between two date (stored in separate columns in a dataframe ).
MonthBegin MonthEnd
0 2014-06-09 2014-06-30
1 2014-07-01 2014-07-31
2 2014-08-01 2014-08-31
3 2014-09-01 2014-09-30
4 2014-10-01 2014-10-31
I have tried to apply numpy.busday_count but I get the following error:
Iterator operand 0 dtype could not be cast from dtype('<M8[ns]') to dtype('<M8[D]') according to the rule 'safe'
I have tried to change the type into Timestamp as the following :
Timestamp('2014-08-31 00:00:00')
or datetime :
datetime.date(2014, 8, 31)
or to numpy.datetime64:
numpy.datetime64('2014-06-30T00:00:00.000000000')
Anyone knows how to fix it?
Note 1: I have passed tried np.busday_count in two way :
1. Passing dataframe columns, t['Days']=np.busday_count(t.MonthBegin,t.MonthEnd)
Passing arrays np.busday_count(dt1,dt2)
Note2: My dataframe has over 150K rows so I need to use an efficient algorithm
You can using bdate_range, also I corrected your input , since the most of MonthEnd is early than the MonthBegin
[len(pd.bdate_range(x,y))for x,y in zip(df['MonthBegin'],df['MonthEnd'])]
Out[519]: [16, 21, 22, 23, 20]
I think the best way to do is
df.apply(lambda row : np.busday_count(row['MBegin'],row['MEnd']),axis=1)
For my dataframe df as below:
MBegin MEnd
0 2011-01-01 2011-02-01
1 2011-01-10 2011-02-10
2 2011-01-02 2011-02-02
doing :
df['MBegin'] = df['MBegin'].values.astype('datetime64[D]')
df['MEnd'] = df['MEnd'].values.astype('datetime64[D]')
df['busday'] = df.apply(lambda row : np.busday_count(row['MBegin'],row['MEnd']),axis=1)
>>df
MBegin MEnd busday
0 2011-01-01 2011-02-01 21
1 2011-01-10 2011-02-10 23
2 2011-01-02 2011-02-02 22
You need to provide the template in which your dates are written.
a = datetime.strptime('2014-06-9', '%Y-%m-%d')
Calculate this for your
b = datetime.strptime('2014-06-30', '%Y-%m-%d')
Now their difference
c = b-a
c.days
which gives you the difference 21 days, You can now use list comprehension to get the difference between two dates as days.
will give you datetime.timedelta(21), to convert it into days, just use
You can modify your code to get the desired result as below:
df = pd.DataFrame({'MonthBegin': ['2014-06-09', '2014-08-01', '2014-09-01', '2014-10-01', '2014-11-01'],
'MonthEnd': ['2014-06-30', '2014-08-31', '2014-09-30', '2014-10-31', '2014-11-30']})
df['MonthBegin'] = df['MonthBegin'].astype('datetime64[ns]')
df['MonthEnd'] = df['MonthEnd'].astype('datetime64[ns]')
df['BDays'] = np.busday_count(df['MonthBegin'].tolist(), df['MonthEnd'].tolist())
print(df)
MonthBegin MonthEnd BDays
0 2014-06-09 2014-06-30 15
1 2014-08-01 2014-08-31 21
2 2014-09-01 2014-09-30 21
3 2014-10-01 2014-10-31 22
4 2014-11-01 2014-11-30 20
Additionally numpy.busday_count has few other optional arguments like weekmask, holidays ... which you can use according to your need.