I would like to analyze a dataframe with hourly data for several days, e.g. df:
DATE TIME Threshold Value
2022-11-04 02:00:00 10 9
2022-11-04 03:00:00 11 10
2022-11-04 04:00:00 10 11
2022-11-04 06:00:00 12 11
2022-11-04 05:00:00 12 12
2022-11-04 07:00:00 10 11
2022-11-04 08:00:00 11 10
2022-11-04 09:00:00 11 9
2022-11-04 10:00:00 12 9
2022-11-04 11:00:00 10 10
2022-11-04 12:00:00 10 10
...
2022-11-05 01:00:00 10 9
2022-11-05 02:00:00 11 10
...
Now I would like to examine the data based on threshold/value and time.
Let's say I am interested in the Value of time "08:00:00" if the threshold of the preceding time "04:00:00" was 10. To find possible patterns, I might also look at other combinations in the future.
My approach was:
Create a new dataframe df_2 with all slices of 04:00:00 and value = 10
Create a new dataframe df_3 with all slices of 08:00:00
merge df_2 and df_3 and select only rows where a time = 04:00:00 of the same day precedes a time = 8:00:00 entry.
This seems to be a bit cumbersome and I was wondering if there was a more practical way to do this.
Maybe someone could suggest a more efficient way?
at first make DatetimeInex:
date_idx=df.iloc[:, :2].astype('str').apply(lambda x: pd.to_datetime(' '.join(x)), axis=1)
and make new column that have Threshold before 4H
and make result to df1
df1 = (df.set_index(date_idx)
.drop(['DATE', 'TIME'], axis=1)
.sort_index()
.assign(new=df1.shift(freq='4H')['Threshold']))
output(df1):
Threshold Value new
2022-11-04 02:00:00 10 9 NaN
2022-11-04 03:00:00 11 10 NaN
2022-11-04 04:00:00 10 11 NaN
2022-11-04 05:00:00 12 12 NaN
2022-11-04 06:00:00 12 11 10.0
2022-11-04 07:00:00 10 11 11.0
2022-11-04 08:00:00 11 10 10.0
2022-11-04 09:00:00 11 9 12.0
2022-11-04 10:00:00 12 9 12.0
2022-11-04 11:00:00 10 10 10.0
2022-11-04 12:00:00 10 10 11.0
filter data at 08:00:00:
df1.at_time('08:00')
output:
Threshold Value new
2022-11-04 08:00:00 11 10 10.0
check or filter Value and new column
here is one way to do it
out=(df.loc[
(df['TIME'].isin(['04:00:00','08:00:00']) & # choose rows where time is 4:00 or 8:00
df['DATE'].isin( # and date where
df.loc[df['TIME'].eq('04:00:00') & # time is 04:00:00
df['Threshold'].eq(10)]['DATE']) # and Threshold is 10
)])
out
DATE TIME Threshold Value
2 2022-11-04 04:00:00 10 11
6 2022-11-04 08:00:00 11 10
Alternately, same as above just choose time eq to 08:00:00
out=(df.loc[
(df['TIME'].isin(['08:00:00']) &
df['DATE'].isin(
df.loc[df['TIME'].eq('04:00:00') &
df['Threshold'].eq(10)]['DATE'])
)])
out
DATE TIME Threshold Value
6 2022-11-04 08:00:00 11 10
Related
I have such a dataframe with "normal" steps of two hours between the timestamps. But sometimes there are unfortunately gaps within my data. Because of that I would like to round timestamps with odd hours (01:00, 03:00 etc.) to even hours (02:00, 04:00 etc.). Time is my index column.
My dataframe looks like this:
Time Values
2021-10-24 22:00:00 2
2021-10-25 00:00:00 4
2021-10-25 02:00:00 78
2021-10-25 05:00:00 90
2021-10-25 07:00:00 1
How can I get a dataframe like this?
Time Values
2021-10-24 22:00:00 2
2021-10-25 00:00:00 4
2021-10-25 02:00:00 78
2021-10-25 06:00:00 90
2021-10-25 08:00:00 1
Use DateTimeIndex.floor or DateTimeIndex.ceil with a frequency string 2H depending if you want to down or upsample.
df.index = df.index.ceil('2H')
>>> df
Values
Time
2021-10-24 22:00:00 2
2021-10-25 00:00:00 4
2021-10-25 02:00:00 78
2021-10-25 06:00:00 90
2021-10-25 08:00:00 1
If "Time" is a column (and not the index), you can use dt.ceil:
df["Time"] = df["Time"].dt.ceil("2H")
>>> df
Time Values
0 2021-10-24 22:00:00 2
1 2021-10-25 00:00:00 4
2 2021-10-25 02:00:00 78
3 2021-10-25 06:00:00 90
4 2021-10-25 08:00:00 2
Alternatively, if you want to ensure that the data contains every 2-hour interval, you could resample:
df = df.resample("2H", on="Time", closed="right").sum()
>>> df
Values
Time
2021-10-24 22:00:00 2
2021-10-25 00:00:00 4
2021-10-25 02:00:00 78
2021-10-25 04:00:00 0
2021-10-25 06:00:00 90
2021-10-25 08:00:00 2
I have the following table:
Hora_Retiro count_uses
0 00:00:18 1
1 00:00:34 1
2 00:02:27 1
3 00:03:13 1
4 00:06:45 1
... ... ...
748700 23:58:47 1
748701 23:58:49 1
748702 23:59:11 1
748703 23:59:47 1
748704 23:59:56 1
And I want to group all values within each hour, so I can see the total number of uses per hour (00:00:00 - 23:00:00)
I have the following code:
hora_pico_aug= hora_pico.groupby(pd.Grouper(key="Hora_Retiro",freq='H')).count()
Hora_Retiro column is of timedelta64[ns] type
Which gives the following output:
count_uses
Hora_Retiro
00:00:02 2566
01:00:02 602
02:00:02 295
03:00:02 5
04:00:02 10
05:00:02 4002
06:00:02 16075
07:00:02 39410
08:00:02 76272
09:00:02 56721
10:00:02 36036
11:00:02 32011
12:00:02 33725
13:00:02 41032
14:00:02 50747
15:00:02 50338
16:00:02 42347
17:00:02 54674
18:00:02 76056
19:00:02 57958
20:00:02 34286
21:00:02 22509
22:00:02 13894
23:00:02 7134
However, the index column starts at 00:00:02, and I want it to start at 00:00:00, and then go from one hour intervals. Something like this:
count_uses
Hora_Retiro
00:00:00 2565
01:00:00 603
02:00:00 295
03:00:00 5
04:00:00 10
05:00:00 4002
06:00:00 16075
07:00:00 39410
08:00:00 76272
09:00:00 56721
10:00:00 36036
11:00:00 32011
12:00:00 33725
13:00:00 41032
14:00:00 50747
15:00:00 50338
16:00:00 42347
17:00:00 54674
18:00:00 76056
19:00:00 57958
20:00:00 34286
21:00:00 22509
22:00:00 13894
23:00:00 7134
How can i make it to start at 00:00:00??
Thanks for the help!
You can create an hour column from Hora_Retiro column.
df['hour'] = df['Hora_Retiro'].dt.hour
And then groupby on the basis of hour
gpby_df = df.groupby('hour')['count_uses'].sum().reset_index()
gpby_df['hour'] = pd.to_datetime(gpby_df['hour'], format='%H').dt.time
gpby_df.columns = ['Hora_Retiro', 'sum_count_uses']
gpby_df
gives
Hora_Retiro sum_count_uses
0 00:00:00 14
1 09:00:00 1
2 10:00:00 2
3 20:00:00 2
I assume that Hora_Retiro column in your DataFrame is of
Timedelta type. It is not datetime, as in this case there
would be printed also the date part.
Indeed, your code creates groups starting at the minute / second
taken from the first row.
To group by "full hours":
round each element in this column to hour,
then group (just by this rounded value).
The code to do it is:
hora_pico.groupby(hora_pico.Hora_Retiro.apply(
lambda tt: tt.round('H'))).count_uses.count()
However I advise you to make up your mind, what do you want to count:
rows or values in count_uses column.
In the second case replace count function with sum.
This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 3 years ago.
There is a csv data frame which contains attributes and their values in an hourly interval. Not all attributes are listed each hour. It looks like this:
time attribute value
2019.10.11. 10:00:00 A 10
2019.10.11. 10:00:00 B 20
2019.10.11. 10:00:00 C 10
2019.10.11. 10:00:00 D 13
2019.10.11. 10:00:00 E 12
2019.10.11. 11:00:00 A 11
2019.10.11. 11:00:00 D 8
2019.10.11. 11:00:00 E 17
2019.10.11. 12:00:00 A 13
2019.10.11. 12:00:00 B 24
2019.10.11. 12:00:00 C 11
2019.10.11. 12:00:00 E 17
I would like to convert it to have one row for each hour and the attribute name should go as column with its value. If an attribute is not listed, then it should have a zero value or can also be left blank etc... Does pandas offer a way with merge, concat or join or anything else to automate this or do I have to implement it manually?
I would need the dataset in the following format:
time A B C D E
2019.10.11. 10:00:00 10 20 10 13 12
2019.10.11. 11:00:00 11 0 0 8 17
2019.10.11. 12:00:00 13 24 11 0 17
Thank you for reading it!
Use DataFrame.pivot_table:
df=df.pivot_table(columns='attribute',index='time' ,values ='value',fill_value=0)
print(df)
attribute A B C D E
time
2019.10.11. 10:00:00 10 20 10 13 12
2019.10.11. 11:00:00 11 0 0 8 17
2019.10.11. 12:00:00 13 24 11 0 17
You could use unstack + fillna:
df = pd.DataFrame(data=data, columns=['time', 'attribute', 'value'])
print(df.set_index(['time', 'attribute']).unstack(level=-1).fillna(0))
Output
value
attribute A B C D E
time
2019.10.11. 10:00:00 10.0 20.0 10.0 13.0 12.0
2019.10.11. 11:00:00 11.0 0.0 0.0 8.0 17.0
2019.10.11. 12:00:00 13.0 24.0 11.0 0.0 17.0
I have two data frames like following, data frame A has datetime even with minutes, data frame B only has hour.
df:A
dataDate original
2018-09-30 11:20:00 3
2018-10-01 12:40:00 10
2018-10-02 07:00:00 5
2018-10-27 12:50:00 5
2018-11-28 19:45:00 7
df:B
dataDate count
2018-09-30 10:00:00 300
2018-10-01 12:00:00 50
2018-10-02 07:00:00 120
2018-10-27 12:00:00 234
2018-11-28 19:05:00 714
I like to merge the two on the basis of hour date and hour, so that now in dataframe A should have all the rows filled on the basis of merge on date and hour
I can try to do it via
A['date'] = A.dataDate.date
B['date'] = B.dataDate.date
A['hour'] = A.dataDate.hour
B['hour'] = B.dataDate.hour
and then merge
merge_df = pd.merge(A,B, how='left', left_on=['date', 'hour'],
right_on=['date', 'hour'])
but its a very long process, Is their an efficient way to perform the same operation with the help of pandas time series or date functionality?
Use map if need append only one column from B to A with floor for set minutes and seconds if exist to 0:
d = dict(zip(B.dataDate.dt.floor('H'), B['count']))
A['count'] = A.dataDate.dt.floor('H').map(d)
print (A)
dataDate original count
0 2018-09-30 11:20:00 3 NaN
1 2018-10-01 12:40:00 10 50.0
2 2018-10-02 07:00:00 5 120.0
3 2018-10-27 12:50:00 5 234.0
4 2018-11-28 19:45:00 7 714.0
For general solution use DataFrame.join:
A.index = A.dataDate.dt.floor('H')
B.index = B.dataDate.dt.floor('H')
A = A.join(B, lsuffix='_left')
print (A)
dataDate_left original dataDate count
dataDate
2018-09-30 11:00:00 2018-09-30 11:20:00 3 NaT NaN
2018-10-01 12:00:00 2018-10-01 12:40:00 10 2018-10-01 12:00:00 50.0
2018-10-02 07:00:00 2018-10-02 07:00:00 5 2018-10-02 07:00:00 120.0
2018-10-27 12:00:00 2018-10-27 12:50:00 5 2018-10-27 12:00:00 234.0
2018-11-28 19:00:00 2018-11-28 19:45:00 7 2018-11-28 19:05:00 714.0
Fairly new to python and pandas here.
I make a query that's giving me back a timeseries. I'm never sure how many data points I receive from the query (run for a single day), but what I do know is that I need to resample them to contain 24 points (one for each hour in the day).
Printing m3hstream gives
[(1479218009000L, 109), (1479287368000L, 84)]
Then I try to make a dataframe df with
df = pd.DataFrame(data = list(m3hstream), columns=['Timestamp', 'Value'])
and this gives me an output of
Timestamp Value
0 1479218009000 109
1 1479287368000 84
Following I do this
daily_summary = pd.DataFrame()
daily_summary['value'] = df['Value'].resample('H').mean()
daily_summary = daily_summary.truncate(before=start, after=end)
print "Now daily summary"
print daily_summary
But this is giving me a TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'
Could anyone please let me know how to resample it so I have 1 point for each hour in the 24 hour period that I'm querying for?
Thanks.
First thing you need to do is convert that 'Timestamp' to an actual pd.Timestamp. It looks like those are milliseconds
Then resample with the on parameter set to 'Timestamp'
df = df.assign(
Timestamp=pd.to_datetime(df.Timestamp, unit='ms')
).resample('H', on='Timestamp').mean().reset_index()
Timestamp Value
0 2016-11-15 13:00:00 109.0
1 2016-11-15 14:00:00 NaN
2 2016-11-15 15:00:00 NaN
3 2016-11-15 16:00:00 NaN
4 2016-11-15 17:00:00 NaN
5 2016-11-15 18:00:00 NaN
6 2016-11-15 19:00:00 NaN
7 2016-11-15 20:00:00 NaN
8 2016-11-15 21:00:00 NaN
9 2016-11-15 22:00:00 NaN
10 2016-11-15 23:00:00 NaN
11 2016-11-16 00:00:00 NaN
12 2016-11-16 01:00:00 NaN
13 2016-11-16 02:00:00 NaN
14 2016-11-16 03:00:00 NaN
15 2016-11-16 04:00:00 NaN
16 2016-11-16 05:00:00 NaN
17 2016-11-16 06:00:00 NaN
18 2016-11-16 07:00:00 NaN
19 2016-11-16 08:00:00 NaN
20 2016-11-16 09:00:00 84.0
If you want to fill those NaN values, use ffill, bfill, or interpolate
df.assign(
Timestamp=pd.to_datetime(df.Timestamp, unit='ms')
).resample('H', on='Timestamp').mean().reset_index().interpolate()
Timestamp Value
0 2016-11-15 13:00:00 109.00
1 2016-11-15 14:00:00 107.75
2 2016-11-15 15:00:00 106.50
3 2016-11-15 16:00:00 105.25
4 2016-11-15 17:00:00 104.00
5 2016-11-15 18:00:00 102.75
6 2016-11-15 19:00:00 101.50
7 2016-11-15 20:00:00 100.25
8 2016-11-15 21:00:00 99.00
9 2016-11-15 22:00:00 97.75
10 2016-11-15 23:00:00 96.50
11 2016-11-16 00:00:00 95.25
12 2016-11-16 01:00:00 94.00
13 2016-11-16 02:00:00 92.75
14 2016-11-16 03:00:00 91.50
15 2016-11-16 04:00:00 90.25
16 2016-11-16 05:00:00 89.00
17 2016-11-16 06:00:00 87.75
18 2016-11-16 07:00:00 86.50
19 2016-11-16 08:00:00 85.25
20 2016-11-16 09:00:00 84.00
Let's try:
daily_summary = daily_summary.set_index('Timestamp')
daily_summary.index = pd.to_datetime(daily_summary.index, unit='ms')
For once an hour:
daily_summary.resample('H').mean()
or for once a day:
daily_summary.resample('D').mean()