I am looking at shift data of a factory that works 24 hours a day. I want to group the data at each shift change which is 6:00 and 18:00. Up till now I have been trying to it with:
Data_Frame.groupby([pd.Grouper(freq='12H')]).count()
However I have realised that since freq is set to 12H, it will always take a period of 12 hours including during daylight savings.
Unfortunately it is always 6:00 and 18:00 even when the clocks change. That means in reality there is one shift in the year that is 11 hours long and another that is 13 hours long so in the middle of the year group is off by 1 hour.
I feel that this is such a fundamental thing (daylight savings) that there should be some way of telling pandas that it needs to take account of daylight savings.
I have tried changing it from UTC to Europe/London however it still takes 12 hours periods.
Many Thanks
edit:
Only way I have found to do this is, before using groupby is to split my data into 3 (before first hour change, during hour change, second hour change) use groupby on each individually then putting them back together but this is irritating and tedious so anything better than this is hugely appreciated.
Hourly and 10 minute time-zone-aware time series' spanning spring dst change:
ts_hrly = pd.date_range('03-10-2018', '3-13-2018', freq='H', tz='US/Eastern')
ts_10m = pd.date_range('03-10-2018', '3-13-2018', freq='10T', tz='US/Eastern')
Use the hourly data
ts = ts_hrly
df = pd.DataFrame({'tstamp':ts,'period':range(len(ts))})
The dst transition looks like this:
>>> df[18:23]
period tstamp
18 18 2018-03-11 00:00:00-05:00
19 19 2018-03-11 01:00:00-05:00
20 20 2018-03-11 03:00:00-04:00
21 21 2018-03-11 04:00:00-04:00
22 22 2018-03-11 05:00:00-04:00
>>>
To group into twelve hourly increments on 06:00 and 18:00 boundaries I assigned each observation to a shift number then grouped by the shift number
My data conveniently starts at a shift change so calculate elapsed time since that first shift change:
nanosec = df['tstamp'].values - df.iloc[0,1].value
Find the shift changes and use np.cumsum() to assign shift numbers
shift_change = nanosec.astype(np.int64) % (3600 * 1e9 * 12) == 0
df['shift_nbr'] = shift_change.cumsum()
gb = df.groupby(df['shift_nbr'])
for k,g in gb:
print(f'{k} has {len(g)} items')
>>>
1 has 12 items
2 has 12 items
3 has 12 items
4 has 12 items
5 has 12 items
6 has 12 items
I haven't found a way to compensate for data starting in the middle of a shift.
If you want the groups for shifts affected by dst changes to have 11 or 13 items, change the timezone aware series to a timezone naive series
df2 = pd.DataFrame({'tstamp':pd.to_datetime(ts.strftime('%m-%d-%y %H:%M')),'period':range(len(ts))})
Use the same process to assign and group by shift numbers
nanosec = df2['tstamp'].values - df2.iloc[0,1].value
shift_change = nanosec.astype(np.int64) % (3600 * 1e9 * 12) == 0
df2['shift_nbr'] = shift_change.cumsum()
for k,g in gb2:
print(f'{k} has {len(g)} items')
>>>
1 has 12 items
2 has 11 items
3 has 12 items
4 has 12 items
5 has 12 items
6 has 12 items
7 has 1 items
Unfortunately, pd.to_datetime(ts.strftime('%m-%d-%y %H:%M')) takes some time. Here is a faster/better way to do it using the hour attribute of the timestamps to calculate elapsed hours - no need to create a separate timezone naive series, the hour attribute appears to be unaware. It also works for data starting in the middle of a shift.
ts = pd.date_range('01-01-2018 03:00', '01-01-2019 06:00', freq='H', tz='US/Eastern')
df3 = pd.DataFrame({'tstamp':ts,'period':range(len(ts))})
shift_change = ((df3['tstamp'].dt.hour - 6) % 12) == 0
shift_nbr = shift_change.cumsum()
gb3 = df3.groupby(shift_nbr)
print(sep,'gb3')
for k,g in gb3:
if len(g) != 12:
print(f'shift starting {g.iloc[0,1]} has {len(g)} items')
>>>
shift starting 2018-01-01 03:00:00-05:00 has 3 items
shift starting 2018-03-10 18:00:00-05:00 has 11 items
shift starting 2018-11-03 18:00:00-04:00 has 13 items
shift starting 2019-01-01 06:00:00-05:00 has 1 items
Related
My DataFrame looks like this:
id
date
value
1
2021-07-16
100
2
2021-09-15
20
1
2021-04-10
50
1
2021-08-27
30
2
2021-07-22
15
2
2021-07-22
25
1
2021-06-30
40
3
2021-10-11
150
2
2021-08-03
15
1
2021-07-02
90
I want to groupby the id, and return the difference of total value in a 90-days period.
Specifically, I want the values of last 90 days based on today, and based on 30 days ago.
For example, considering today is 2021-10-13, I would like to get:
the sum of all values per id between 2021-10-13 and 2021-07-15
the sum of all values per id between 2021-09-13 and 2021-06-15
And finally, subtract them to get the variation.
I've already managed to calculate it, by creating separated temporary dataframes containing only the dates in those periods of 90 days, grouping by id, and then merging these temp dataframes into a final one.
But I guess it should be an easier or simpler way to do it. Appreciate any help!
Btw, sorry if the explanation was a little messy.
If I understood correctly, you need something like this:
import pandas as pd
import datetime
## Calculation of the dates that we are gonna need.
today = datetime.datetime.now()
delta = datetime.timedelta(days = 120)
# Date of the 120 days ago
hundredTwentyDaysAgo = today - delta
delta = datetime.timedelta(days = 90)
# Date of the 90 days ago
ninetyDaysAgo = today - delta
delta = datetime.timedelta(days = 30)
# Date of the 30 days ago
thirtyDaysAgo = today - delta
## Initializing an example df.
df = pd.DataFrame({"id":[1,2,1,1,2,2,1,3,2,1],
"date": ["2021-07-16", "2021-09-15", "2021-04-10", "2021-08-27", "2021-07-22", "2021-07-22", "2021-06-30", "2021-10-11", "2021-08-03", "2021-07-02"],
"value": [100,20,50,30,15,25,40,150,15,90]})
## Casting date column
df['date'] = pd.to_datetime(df['date']).dt.date
grouped = df.groupby('id')
# Sum of last 90 days per id
ninetySum = grouped.apply(lambda x: x[x['date'] >= ninetyDaysAgo.date()]['value'].sum())
# Sum of last 90 days, starting from 30 days ago per id
hundredTwentySum = grouped.apply(lambda x: x[(x['date'] >= hundredTwentyDaysAgo.date()) & (x['date'] <= thirtyDaysAgo.date())]['value'].sum())
The output is
ninetySum - hundredTwentySum
id
1 -130
2 20
3 150
dtype: int64
You can double check to make sure these are the numbers you wanted by printing ninetySum and hundredTwentySum variables.
I’m trying to look at some sales data for a small store. I have a time stamp of when the settlement was made, but sometimes it’s done before midnight and sometimes its done after midnight.
This is giving me data correct for some days and incorrect for others, as anything after midnight should be for the day before. I couldn’t find the correct pandas documentation for what I’m looking for.
Is there an if else solution to create a new column, loop through the NEW_TIMESTAMP column and set a custom timeframe (if after midnight, but before 3pm: set the day before ; else set the day). Every time I write something it either runs forever, or it crashes jupyter.
Data:
What I did is I created another series which says when a day should be offset back by one day, and I multiplied it by a pd.timedelta object, such that 0 turns into "0 days" and 1 turns into "1 day". Subtracting two series gives the right result.
Let me know how the following code works for you.
import pandas as pd
import numpy as np
# copied from https://stackoverflow.com/questions/50559078/generating-random-dates-within-a-given-range-in-pandas
def random_dates(start, end, n=15):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
dates = random_dates(start=pd.to_datetime('2020-01-01'),
end=pd.to_datetime('2021-01-01'))
timestamps = pd.Series(dates)
# this takes only the hour component of every datetime
hours = timestamps.dt.hour
# this takes only the hour component of every datetime
dates = timestamps.dt.date
# this compares the hours with 15, and returns a boolean if it is smaller
flag_is_day_before = hours < 15
# now you can set the dates by multiplying the 1s and 0s with a day timedelta
new_dates = dates - pd.to_timedelta(1, unit='day') * flag_is_day_before
df = pd.DataFrame(data=dict(timestamps=timestamps, new_dates=new_dates))
print(df)
This outputs
timestamps new_dates
0 2020-07-10 20:11:13 2020-07-10
1 2020-05-04 01:20:07 2020-05-03
2 2020-03-30 09:17:36 2020-03-29
3 2020-06-01 16:16:58 2020-06-01
4 2020-09-22 04:53:33 2020-09-21
5 2020-08-02 20:07:26 2020-08-02
6 2020-03-22 14:06:53 2020-03-21
7 2020-03-14 14:21:12 2020-03-13
8 2020-07-16 20:50:22 2020-07-16
9 2020-09-26 13:26:55 2020-09-25
10 2020-11-08 17:27:22 2020-11-08
11 2020-11-01 13:32:46 2020-10-31
12 2020-03-12 12:26:21 2020-03-11
13 2020-12-28 08:04:29 2020-12-27
14 2020-04-06 02:46:59 2020-04-05
I have a big dataframe consisting of 600 days worth of data. Each day has 100 timestamps. I have a separate list of 30 days from which I want to data. How do I remove data from these 30 days from the dataframe?
I tried a for loop, but it did not work. I know there is a simple method. But I don't know how to implement it.
df #is main dataframe which has many columns and rows. Index is a timestamp.
df['dates'] = df.index.strftime('%Y-%m-%d') # date part of timestamp is sliced and
#a new column is created. Instead of index, I want to use this column for comparing with bad list.
bad_list # it is a list of bad dates
for i in range(0,len(df)):
for j in range(0,len(bad_list)):
if str(df['dates'][i])== bad_list[j]:
df.drop(df[i].index,inplace=True)
You can do the following
df['dates'] = df.index.strftime('%Y-%m-%d')
#badlist should be in date format too.
newdf = df[~df['dates'].isin(badlist)]
# the ~ is used to denote "not in" the list.
#if Jan 1, 2000 is a bad date, it should be in the list as datetime(2000,1,1)
You can perform simple comparison:
>>> dates = pd.Series(pd.to_datetime(np.random.randint(int(time()) - 60 * 60 * 24 * 5, int(time()), 12), unit='s'))
>>> dates
0 2019-03-19 05:25:32
1 2019-03-20 00:58:29
2 2019-03-19 01:03:36
3 2019-03-22 11:45:24
4 2019-03-19 08:14:29
5 2019-03-21 10:17:13
6 2019-03-18 09:09:15
7 2019-03-20 00:14:16
8 2019-03-21 19:47:02
9 2019-03-23 06:19:35
10 2019-03-23 05:42:34
11 2019-03-21 11:37:46
>>> start_date = pd.to_datetime('2019-03-20')
>>> end_date = pd.to_datetime('2019-03-22')
>>> dates[(dates > start_date) & (dates < end_date)]
1 2019-03-20 00:58:29
5 2019-03-21 10:17:13
7 2019-03-20 00:14:16
8 2019-03-21 19:47:02
11 2019-03-21 11:37:46
If your source Series is not in datetime format, then you will need to use pd.to_datetime to convert it.
I have a dataframe of DateTime (index) and a sampling of power usage:
DateTime Usage
01-Jan-17 12am 10
01-Jan-17 3am 5
01-Jan-17 6am 15
01-Jan-17 9am 40
01-Jan-17 12pm 60
01-Jan-17 3pm 62
01-Jan-17 6pm 45
01-Jan-17 9pm 18
02-Jan-17 12am 11
02-Jan-17 3am 4
02-Jan-17 6am 17
02-Jan-17 9am 37
02-Jan-17 12pm 64
02-Jan-17 3pm 68
02-Jan-17 6pm 41
02-Jan-17 9pm 16
In reality, this series is much longer. I am trying to compare day-over-day time periods, such that I can look at the daily-seasonality of the time series. Is there a way in panda's to split the data such that you can compare these time series? I'd imagine the resulting DataFrame would look something like:
Time 1-Jan 2-Jan
12am 10 11
3am 5 4
6am 15 17
9am 40 37
12pm 60 64
3pm 62 68
6pm 45 41
9pm 18 16
Thanks!
Assuming you have DateTime as str data type, you can split it into Date and Time and then pivot it:
df[['Date', 'Time']] = df.DateTime.str.split(" ", expand=True)
df1 = df.pivot("Time", "Date", "Usage").reset_index()
How to sort the Time column? It's actually not so straight forward, to do this, we need to extract some columns from the Time, the hour, the PM/AM indicator as well as if the hour is 12, as 12 should be placed above all other hours:
# use regex to extract Hour (numeric part of Time) and AM/PM indicator
hourInd = df1.Time.str.extract("(?P<Hour>\d+)(?P<Ind>[pa]m)", expand=True)
# convert the hour column to integer and create another column to check if hour is 12
# then sort by AM/PM indicator, IsTwelve and Hour and get the index to reorder the original
# data frame
df1.loc[(hourInd.assign(Hour = hourInd.Hour.astype(int), IsTwelve = hourInd.Hour != "12")
.sort_values(["Ind", "IsTwelve", "Hour"]).index)]
I can create an hour of day column in Pandas using data['hod'] = [r.hour for r in data.index] which is really useful for groupby related analysis. However, I would like to be able to create a similar column for 1 hour intervals starting at 09:30 instead of 09:00. So the column values would be 09:30-10:30, 10:30-11:30 etc.
The aim is to be able to groupby these values in order to gain stats for the time period.
Using data as follows. I already added hour of day, day of week etc, I just need the same for time sliced from 09:30 onwards in one hour intervals:
data['2008-05-06 09:00:00':].head()
Open High Low Last Volume hod dow dom minute
Timestamp
2008-05-06 09:00:00 1399.50 1399.50 1399.25 1399.50 4 9 1 6 0
2008-05-06 09:01:00 1399.25 1399.75 1399.25 1399.50 5 9 1 6 1
2008-05-06 09:02:00 1399.75 1399.75 1399.00 1399.50 19 9 1 6 2
2008-05-06 09:03:00 1399.50 1399.75 1398.50 1398.50 37 9 1 6 3
2008-05-06 09:04:00 1398.75 1399.00 1398.75 1398.75 15 9 1 6 4
I assumed that when you start from half point of each hour then you divide a day into 25 sections instead of 24. Here is how I labelled those sections: Section -1: [0:00, 0:29], Section 0: [0:30, 1:29], Section 1: [1:30, 2:29] ... Section 22: [22:30, 23:29] and Section 23: [23:30, 23:50], where first and last sections are half an hour long.
And here is an implementation with pandas
import pandas as pd
import numpy as np
def shifted_hour_of_day(ts, beginning_of_hour=0):
shift = pd.Timedelta('%dmin' % (beginning_of_hour))
ts_shifted = ts - pd.Timedelta(shift)
hour = ts_shifted.hour
if ts_shifted.day != ts.day: # we shifted these timestamps to yesterday
hour = -1 # label the first section as -1
return hour
# Generate random data
timestamps = pd.date_range('2008-05-06 00:00:00', '2008-05-07 00:00:00', freq='10min')
vals = np.random.rand(len(timestamps))
df = pd.DataFrame(index=timestamps, data={'value': vals})
df.loc[:, 'hod'] = [r.hour for r in df.index]
# Test shifted_hour_of_day
df.loc[:, 'hod2'] = [shifted_hour_of_day(r, beginning_of_hour=20) for r in df.index]
df.head(20)
Now you can groupby this DataFrame on 'hod2'.