I have a dataframe of DateTime (index) and a sampling of power usage:
DateTime Usage
01-Jan-17 12am 10
01-Jan-17 3am 5
01-Jan-17 6am 15
01-Jan-17 9am 40
01-Jan-17 12pm 60
01-Jan-17 3pm 62
01-Jan-17 6pm 45
01-Jan-17 9pm 18
02-Jan-17 12am 11
02-Jan-17 3am 4
02-Jan-17 6am 17
02-Jan-17 9am 37
02-Jan-17 12pm 64
02-Jan-17 3pm 68
02-Jan-17 6pm 41
02-Jan-17 9pm 16
In reality, this series is much longer. I am trying to compare day-over-day time periods, such that I can look at the daily-seasonality of the time series. Is there a way in panda's to split the data such that you can compare these time series? I'd imagine the resulting DataFrame would look something like:
Time 1-Jan 2-Jan
12am 10 11
3am 5 4
6am 15 17
9am 40 37
12pm 60 64
3pm 62 68
6pm 45 41
9pm 18 16
Thanks!
Assuming you have DateTime as str data type, you can split it into Date and Time and then pivot it:
df[['Date', 'Time']] = df.DateTime.str.split(" ", expand=True)
df1 = df.pivot("Time", "Date", "Usage").reset_index()
How to sort the Time column? It's actually not so straight forward, to do this, we need to extract some columns from the Time, the hour, the PM/AM indicator as well as if the hour is 12, as 12 should be placed above all other hours:
# use regex to extract Hour (numeric part of Time) and AM/PM indicator
hourInd = df1.Time.str.extract("(?P<Hour>\d+)(?P<Ind>[pa]m)", expand=True)
# convert the hour column to integer and create another column to check if hour is 12
# then sort by AM/PM indicator, IsTwelve and Hour and get the index to reorder the original
# data frame
df1.loc[(hourInd.assign(Hour = hourInd.Hour.astype(int), IsTwelve = hourInd.Hour != "12")
.sort_values(["Ind", "IsTwelve", "Hour"]).index)]
Related
I am trying to convert this series of data into dataframe using as_index = False inside groupby method. My goal is to show the total value for month and weekday.
My data
This is my main data uber-15.
Dispatching Pickup_date Affiliated locationID month weekDay day hour minute
0 B02617 2015-05-17 09:47:00 B02617 141 5 Sunday 17 9 47
1 B02617 2015-05-17 09:47:00 B02617 65 5 Sunday 17 9 47
From this I am extracting month and weekDay.
temp = uber_15.groupby(['month', "weekDay"]).size()
Next I am converting this series to dataframe using as_index.
temp = uber_15.groupby(['month', "weekDay"], as_index=False).size()
But the result is same when I use as_index=False but not working.
I also tried finding online solution where I find about reset_index but this there is column header with "0" which was supposed to be 'size' with reset_index.
temp = uber_15.groupby(['month', "weekDay"]).size().reset_index()
This the goal I am trying to achieve.
this is the output I am getting.
I am looking for week start date for entire date frame , with format of dd-mm-yyyy,
Below week number :(src_data['WEEK'])
28
29
30
31
32
33
34
35
code :
src_data['firstdayofweek'] = datetime.datetime.strptime(f'{2020}-W{int(src_data['WEEK'] )- 1}-1','%Y-W%W-%w').date()
Output :
Thanks in advance
You can add a year and weekday as strings and parse to_datetime with the appropriate directives (see also here). If desired, convert to string with strftime:
src_data = pd.DataFrame({'WEEK':[28,29,30,31,32,33,34,35]})
year, weekday = '2020', '1'
src_data['DATE'] = pd.to_datetime(year + src_data['WEEK'].astype(str) + weekday,
format='%G%V%u').dt.strftime('%d-%m-%Y')
# src_data
# WEEK DATE
# 0 28 06-07-2020
# 1 29 13-07-2020
# 2 30 20-07-2020
# 3 31 27-07-2020
# 4 32 03-08-2020
# 5 33 10-08-2020
# 6 34 17-08-2020
# 7 35 24-08-2020
I am looking at shift data of a factory that works 24 hours a day. I want to group the data at each shift change which is 6:00 and 18:00. Up till now I have been trying to it with:
Data_Frame.groupby([pd.Grouper(freq='12H')]).count()
However I have realised that since freq is set to 12H, it will always take a period of 12 hours including during daylight savings.
Unfortunately it is always 6:00 and 18:00 even when the clocks change. That means in reality there is one shift in the year that is 11 hours long and another that is 13 hours long so in the middle of the year group is off by 1 hour.
I feel that this is such a fundamental thing (daylight savings) that there should be some way of telling pandas that it needs to take account of daylight savings.
I have tried changing it from UTC to Europe/London however it still takes 12 hours periods.
Many Thanks
edit:
Only way I have found to do this is, before using groupby is to split my data into 3 (before first hour change, during hour change, second hour change) use groupby on each individually then putting them back together but this is irritating and tedious so anything better than this is hugely appreciated.
Hourly and 10 minute time-zone-aware time series' spanning spring dst change:
ts_hrly = pd.date_range('03-10-2018', '3-13-2018', freq='H', tz='US/Eastern')
ts_10m = pd.date_range('03-10-2018', '3-13-2018', freq='10T', tz='US/Eastern')
Use the hourly data
ts = ts_hrly
df = pd.DataFrame({'tstamp':ts,'period':range(len(ts))})
The dst transition looks like this:
>>> df[18:23]
period tstamp
18 18 2018-03-11 00:00:00-05:00
19 19 2018-03-11 01:00:00-05:00
20 20 2018-03-11 03:00:00-04:00
21 21 2018-03-11 04:00:00-04:00
22 22 2018-03-11 05:00:00-04:00
>>>
To group into twelve hourly increments on 06:00 and 18:00 boundaries I assigned each observation to a shift number then grouped by the shift number
My data conveniently starts at a shift change so calculate elapsed time since that first shift change:
nanosec = df['tstamp'].values - df.iloc[0,1].value
Find the shift changes and use np.cumsum() to assign shift numbers
shift_change = nanosec.astype(np.int64) % (3600 * 1e9 * 12) == 0
df['shift_nbr'] = shift_change.cumsum()
gb = df.groupby(df['shift_nbr'])
for k,g in gb:
print(f'{k} has {len(g)} items')
>>>
1 has 12 items
2 has 12 items
3 has 12 items
4 has 12 items
5 has 12 items
6 has 12 items
I haven't found a way to compensate for data starting in the middle of a shift.
If you want the groups for shifts affected by dst changes to have 11 or 13 items, change the timezone aware series to a timezone naive series
df2 = pd.DataFrame({'tstamp':pd.to_datetime(ts.strftime('%m-%d-%y %H:%M')),'period':range(len(ts))})
Use the same process to assign and group by shift numbers
nanosec = df2['tstamp'].values - df2.iloc[0,1].value
shift_change = nanosec.astype(np.int64) % (3600 * 1e9 * 12) == 0
df2['shift_nbr'] = shift_change.cumsum()
for k,g in gb2:
print(f'{k} has {len(g)} items')
>>>
1 has 12 items
2 has 11 items
3 has 12 items
4 has 12 items
5 has 12 items
6 has 12 items
7 has 1 items
Unfortunately, pd.to_datetime(ts.strftime('%m-%d-%y %H:%M')) takes some time. Here is a faster/better way to do it using the hour attribute of the timestamps to calculate elapsed hours - no need to create a separate timezone naive series, the hour attribute appears to be unaware. It also works for data starting in the middle of a shift.
ts = pd.date_range('01-01-2018 03:00', '01-01-2019 06:00', freq='H', tz='US/Eastern')
df3 = pd.DataFrame({'tstamp':ts,'period':range(len(ts))})
shift_change = ((df3['tstamp'].dt.hour - 6) % 12) == 0
shift_nbr = shift_change.cumsum()
gb3 = df3.groupby(shift_nbr)
print(sep,'gb3')
for k,g in gb3:
if len(g) != 12:
print(f'shift starting {g.iloc[0,1]} has {len(g)} items')
>>>
shift starting 2018-01-01 03:00:00-05:00 has 3 items
shift starting 2018-03-10 18:00:00-05:00 has 11 items
shift starting 2018-11-03 18:00:00-04:00 has 13 items
shift starting 2019-01-01 06:00:00-05:00 has 1 items
Suppose I have a dataframe, where the rows are indexed by trading days, so something like:
Date ClosingPrice
2017-3-16 10.00
2017-3-17 10.13
2017-3-20 10.19
...
I want to find $N$ rows starting with (say) 2017-2-28, so I don't know the date range, I just know that I want to do something ten rows down. What is the most elegant way of doing this? (there are plenty of ugly ways...)
my quick answer
s = df.Date.searchsorted(pd.to_datetime('2017-2-28'))[0]
df.iloc[s:s + 10]
demo
df = pd.DataFrame(dict(
Date=pd.date_range('2017-01-31', periods=90, freq='B'),
ClosingPrice=np.random.rand(90)
)).iloc[:, ::-1]
date = pd.to_datetime('2017-3-11')
s = df.Date.searchsorted(date)[0]
df.iloc[s:s + 10]
Date ClosingPrice
29 2017-03-13 0.737527
30 2017-03-14 0.411525
31 2017-03-15 0.794309
32 2017-03-16 0.578911
33 2017-03-17 0.747763
34 2017-03-20 0.081113
35 2017-03-21 0.000058
36 2017-03-22 0.274022
37 2017-03-23 0.367831
38 2017-03-24 0.100930
naive time test
df[df['Date'] >= Date(2017,02,28)][:10]
I guess?
I have an hourly dataframe in the following format over several years:
Date/Time Value
01.03.2010 00:00:00 60
01.03.2010 01:00:00 50
01.03.2010 02:00:00 52
01.03.2010 03:00:00 49
.
.
.
31.12.2013 23:00:00 77
I would like to average the data so I can get the average of hour 0, hour 1... hour 23 of each of the years.
So the output should look somehow like this:
Year Hour Avg
2010 00 63
2010 01 55
2010 02 50
.
.
.
2013 22 71
2013 23 80
Does anyone know how to obtain this in pandas?
Note: Now that Series have the dt accessor it's less important that date is the index, though Date/Time still needs to be a datetime64.
Update: You can do the groupby more directly (without the lambda):
In [21]: df.groupby([df["Date/Time"].dt.year, df["Date/Time"].dt.hour]).mean()
Out[21]:
Value
Date/Time Date/Time
2010 0 60
1 50
2 52
3 49
In [22]: res = df.groupby([df["Date/Time"].dt.year, df["Date/Time"].dt.hour]).mean()
In [23]: res.index.names = ["year", "hour"]
In [24]: res
Out[24]:
Value
year hour
2010 0 60
1 50
2 52
3 49
If it's a datetime64 index you can do:
In [31]: df1.groupby([df1.index.year, df1.index.hour]).mean()
Out[31]:
Value
2010 0 60
1 50
2 52
3 49
Old answer (will be slower):
Assuming Date/Time was the index* you can use a mapping function in the groupby:
In [11]: year_hour_means = df1.groupby(lambda x: (x.year, x.hour)).mean()
In [12]: year_hour_means
Out[12]:
Value
(2010, 0) 60
(2010, 1) 50
(2010, 2) 52
(2010, 3) 49
For a more useful index, you could then create a MultiIndex from the tuples:
In [13]: year_hour_means.index = pd.MultiIndex.from_tuples(year_hour_means.index,
names=['year', 'hour'])
In [14]: year_hour_means
Out[14]:
Value
year hour
2010 0 60
1 50
2 52
3 49
* if not, then first use set_index:
df1 = df.set_index('Date/Time')
If your date/time column were in the datetime format (see dateutil.parser for automatic parsing options), you can use pandas resample as below:
year_hour_means = df.resample('H',how = 'mean')
which will keep your data in the datetime format. This may help you with whatever you are going to be doing with your data down the line.