Python Pandas - sum a boolean variable by hour - python

I have a pretty simple question: I have a pandas DataFrame that looks like:
y
2015-12-09 09:00:00 1
2015-12-09 08:48:00 1
2015-12-09 08:24:00 1
2015-12-09 08:12:00 1
2015-12-09 08:00:00 1
2015-12-09 06:36:00 1
2015-12-09 06:24:00 1
... ..
2015-12-08 10:12:00 1
2015-12-08 10:00:00 1
2015-12-08 09:48:00 1
2015-12-08 09:36:00 1
I want to sum the boolean variables by hour, so I have something that looks like:
y
2015-12-09 09:00:00 1
2015-12-09 08:00:00 4
2015-12-09 07:00:00 0
2015-12-09 06:00:00 2
... ..
2015-12-08 10:00:00 2
2015-12-08 09:00:00 2
I keep getting this error:
AttributeError: 'numpy.ndarray' object has no attribute 'groupby'
It doesn't seem like a very hard problem, but I cannot figure it out.

The solution is relatively straightforward, but it does implicitly assume that in your data set, 0 equates to False (which seems logical to me). If so, this works:
df.resample('1H', how='sum').fillna(0)
Else you may have to look into a different way of sorting through your data.

I'm a Pandas newbie but here are my two cents.
Let's start with a DataFrame that looks like this (like yours):
What I did first was converting that string date-time into a date-time field:
data['datetime'] = pd.to_datetime(data['datetime'])
Then I created another column with only date values:
data['date'] = abc.datetime.dt.date
And another one with hour values:
data['hour'] = data.datetime.dt.hour
So my data DataFrame looks like this:
Finally, I just grouped by date and hour:
data.groupby(['date', 'hour']).size()
And these are the results:
If you don't want to alter your DataFrame just use a copy of it like:
mutable_data = data
And then make changes to mutable_data.
I hope this helps. If not, I would love to receive suggestions.

Related

python masking each day in dataframe

I have to make a daily sum on a dataframe but only if at least 70% of the daily data is not NaN. If it is then this day must not be taken into account. Is there a way to create such a mask? My dataframe is more than 17 years of hourly data.
my data is something like this:
clear skies all skies Lab
2015-02-26 13:00:00 597.5259 376.1830 307.62
2015-02-26 14:00:00 461.2014 244.0453 199.94
2015-02-26 15:00:00 283.9003 166.5772 107.84
2015-02-26 16:00:00 93.5099 50.7761 23.27
2015-02-26 17:00:00 1.1559 0.2784 0.91
... ... ...
2015-12-05 07:00:00 95.0285 29.1006 45.23
2015-12-05 08:00:00 241.8822 120.1049 113.41
2015-12-05 09:00:00 363.8040 196.0568 244.78
2015-12-05 10:00:00 438.2264 274.3733 461.28
2015-12-05 11:00:00 456.3396 330.6650 447.15
if I groupby and aggregate than there is no way to know if in any day there was some lack of data and some days will have lower sums and therefore lowering my monthly means
As said in the comments, use groupby to group the data by date and then write an appropriate selection. This is an example that would sum all days (assuming regular data points, 24 per day) with less than 50% of nan entries:
import pandas as pd
import numpy as np
# create a date range
date_rng = pd.date_range(start='1/1/2018', end='1/1/2021', freq='H')
# create random data
df = pd.DataFrame({"data":np.random.randint(0,100,size=(len(date_rng)))}, index = date_rng)
# set some values to nan
df["data"][df["data"] > 50] = np.nan
# looks like this
df.head(20)
# sum everything where less than 50% are nan
df.groupby(df.index.date).sum()[df.isna().groupby(df.index.date).sum() < 12]
Example output:
data
2018-01-01 NaN
2018-01-02 NaN
2018-01-03 487.0
2018-01-04 NaN
2018-01-05 421.0
... ...
2020-12-28 NaN
2020-12-29 NaN
2020-12-30 NaN
2020-12-31 392.0
2021-01-01 0.0
An alternative solution - you may find it useful & flexible:
# pip install convtools
from convtools import conversion as c
total_number = c.ReduceFuncs.Count()
total_not_none = c.ReduceFuncs.Count(where=c.item("amount").is_not(None))
total_sum = c.ReduceFuncs.Sum(c.item("amount"))
input_data = [] # e.g. iterable of dicts
converter = (
c.group_by(
c.item("key1"),
c.item("key2"),
)
.aggregate(
{
"key1": c.item("key1"),
"key2": c.item("key2"),
"sum_if_70": c.if_(
total_not_none / total_number < 0.7,
None,
total_sum,
),
}
)
.gen_converter(
debug=False
) # install black and set to True to see the generated ad-hoc code
)
result = converter(input_data)

How to change the time of a pandas datetime object to the start of the hour?

I have a pandas Dataframe in which one of the column is pandas datetime column created using pd.to_datetime()1. I want to extract the date and hour from each datetime object, in other words, I want to change the minute and seconds to 0.
I used normalize() to change the time to midnight but don't how how to change the time to start of the hour. Please suggest a way to do so.
making some test data and turning it into a dataframe
rng = pd.date_range('1/1/2018 11:59:00', periods=3, freq='min')
df = pd.DataFrame(rng)
print(df)
print(df[0].round('H'))
gives the input
0
0 2018-01-01 11:59:00
1 2018-01-01 12:00:00
2 2018-01-01 12:01:00
and rounded to the nearest hour gives
0
0 2018-01-01 12:00:00
1 2018-01-01 12:00:00
2 2018-01-01 12:00:00
and
print(df[0].dt.floor('H'))
gives
0
0 2018-01-01 11:00:00
1 2018-01-01 12:00:00
2 2018-01-01 12:00:00
if you always want to round down. Likewise, ceil('H') if you want to round up
I think you need to checkout pandas.Series.dt.strftime
Or try this:
import datetime
df=pd.DataFrame({'timestamp':[pd.Timestamp('today')]})
df['Date']=[pd.to_datetime(i.date())+ datetime.timedelta(hours=i.hour) for i in df['timestamp']]

How to make a pandas series whose index is every day of 2020

I would like to make an empty pandas series with a date index which is every day of 2020. That is 01-01-2020, 02-01-2020 etc.
Although this looks very simple I couldn’t find out how to do it.
Use date_range:
range_2020 = pd.date_range("2020-01-01", "2020-12-31", freq="D")
pd.DataFrame(range(366), index=range_2020)
The output is:
0
2020-01-01 0
2020-01-02 1
2020-01-03 2
2020-01-04 3
2020-01-05 4
...

Slicing pandas dataframe by custom months and days -- is there a way to avoid for loops?

The problem
Suppose I have a time series dataframe df (a pandas dataframe) and some days I want to slice from it, contained in another dataframe called sample_days:
>>> df
foo bar
2020-01-01 00:00:00 0.360049 0.897839
2020-01-01 01:00:00 0.285667 0.409544
2020-01-01 02:00:00 0.323871 0.240926
2020-01-01 03:00:00 0.921623 0.766624
2020-01-01 04:00:00 0.087618 0.142409
... ... ...
2020-12-31 19:00:00 0.145111 0.993822
2020-12-31 20:00:00 0.331223 0.021287
2020-12-31 21:00:00 0.531099 0.859035
2020-12-31 22:00:00 0.759594 0.790265
2020-12-31 23:00:00 0.103651 0.074029
[8784 rows x 2 columns]
>>> sample_days
month day
0 3 16
1 7 26
2 8 15
3 9 26
4 11 25
I want to slice df with the days specified in sample_days. I can do this with for loops (see below). However, is there a way to avoid for loops (as this is more efficient)? The result should be a dataframe called sample like the following:
>>> sample
foo bar
2020-03-16 00:00:00 0.707276 0.592614
2020-03-16 01:00:00 0.136679 0.357872
2020-03-16 02:00:00 0.612331 0.290126
2020-03-16 03:00:00 0.276389 0.576996
2020-03-16 04:00:00 0.612977 0.781527
... ... ...
2020-11-25 19:00:00 0.904266 0.825501
2020-11-25 20:00:00 0.269589 0.050304
2020-11-25 21:00:00 0.271814 0.418235
2020-11-25 22:00:00 0.595005 0.973198
2020-11-25 23:00:00 0.151149 0.024057
[120 rows x 2 columns
which is just the df sliced across the correct days.
My (slow) solution
I've managed to do this using for loops and pd.concat:
sample = pd.concat([df.loc[df.index.month.isin([sample_day.month]) &
df.index.day.isin([sample_day.day])]
for sample_day in sample_days.itertuples()])
which is based on concatenating multiple days as sliced by the method indicated here. This gives the desired result but is rather slow. For example, using this method to get the first day of each month takes 0.2 seconds on average, whereas just calling df.loc[df.index.day == 1] (presumably avoiding python for loops under-the-hood) is around 300 times faster. However, this is a slice on just the day -- I am slicing on month and day.
Apologies if this has been answered somewhere else -- I've searched for quite a while but perhaps was not using the correct keywords.
You can do a string comparison of the month and days at the same time.
You need the space to differentiate between 11 2 and 1 12 for example, otherwise both would be regarded as the same.
df.loc[(df.index.month.astype(str) +' '+ df.index.day.astype(str)).isin(sample_days['month'].astype(str)+' '+sample_days['day'].astype(str))]
After getting a bit of inspiration from #Ben Pap's solution (thanks!), I've found a solution that is both fast and avoids any "hacks" like changing datetime to strings. It combines the month and day into a single MultiIndex, as below (you can make this a single line, but I've expanded it into multiple to make the idea clear).
full_index = pd.MultiIndex.from_arrays([df.index.month, df.index.day],
names=['month', 'day'])
sample_index = pd.MultiIndex.from_frame(sample_days)
sample = df.loc[full_index.isin(sample_index)]
If I run this code along with my original for loop and #Ben Pap's answer, and sample 100 days from one year time series for 2020 (8784 hours with the leap day), I get the following solution times:
Original for loop: 0.16s
#Ben Pap's solution, combining month and day into single string: 0.019s
Above solution using MultiIndex: 0.006s
so I think using a MultiIndex is the way to go.

Pandas read_hdf query by date and time range

I have a question regarding how to filter results in the pd.read_hdf function. So here's the setup, I have a pandas dataframe (with np.datetime64 index) which I put into a hdf5 file. There's nothing fancy going on here, so no use of hierarchy or anything (maybe I could incorporate it?). Here's an example:
Foo Bar
TIME
2014-07-14 12:02:00 0 0
2014-07-14 12:03:00 0 0
2014-07-14 12:04:00 0 0
2014-07-14 12:05:00 0 0
2014-07-14 12:06:00 0 0
2014-07-15 12:02:00 0 0
2014-07-15 12:03:00 0 0
2014-07-15 12:04:00 0 0
2014-07-15 12:05:00 0 0
2014-07-15 12:06:00 0 0
2014-07-16 12:02:00 0 0
2014-07-16 12:03:00 0 0
2014-07-16 12:04:00 0 0
2014-07-16 12:05:00 0 0
2014-07-16 12:06:00 0 0
Now I store this into a .h5 using the following command:
store = pd.HDFStore('qux.h5')
#generate df
store.append('data', df)
store.close()
Next, I'll have another process which accesses this data and I would like to take date/time slices of this data. So suppose I want dates between 2014-07-14 and 2014-07-15, and only for times between 12:02:00 and 12:04:00. Currently I am using the following command to retrieve this:
pd.read_hdf('qux.h5', 'data', where='index >= 20140714 and index <= 20140715').between_time(start_time=datetime.time(12,2), end_time=datetime.time(12,4))
As far as I'm aware, someone please correct me if I'm wrong here, but entire original dataset is not read into memory if I use 'where'. So in other words:
This:
pd.read_hdf('qux.h5', 'data', where='index >= 20140714 and index <= 20140715')
Is not the same as this:
pd.read_hdf('qux.h5', 'data')['20140714':'20140715']
While the end result is exactly the same, what's being done in the background is not. So my question is, is there a way to incorporate that time range filter (i.e. .between_time()) into my where statement? Or if there's another way I should structure my hdf5 file? Maybe store a table for each day?
Thanks!
EDIT:
Regarding using hierarchy, I'm aware that the structure should be highly dependent on how I'll be using the data. However, if we assume that the I define a table per date (e.g. 'df/date_20140714', 'df/date_20140715', ...). Again I may be mistaken here, but using my example of querying date/time range; I'll probably incur a performance penalty as I'll need to read each table and have to merge them if I want a consolidated output right?
See an example of selecting using a where mask
Here's an example
In [50]: pd.set_option('max_rows',10)
In [51]: df = DataFrame(np.random.randn(1000,2),index=date_range('20130101',periods=1000,freq='H'))
In [52]: df
Out[52]:
0 1
2013-01-01 00:00:00 -0.467844 1.038375
2013-01-01 01:00:00 0.057419 0.914379
2013-01-01 02:00:00 -1.378131 0.187081
2013-01-01 03:00:00 0.398765 -0.122692
2013-01-01 04:00:00 0.847332 0.967856
... ... ...
2013-02-11 11:00:00 0.554420 0.777484
2013-02-11 12:00:00 -0.558041 1.833465
2013-02-11 13:00:00 -0.786312 0.501893
2013-02-11 14:00:00 -0.280538 0.680498
2013-02-11 15:00:00 1.533521 -1.992070
[1000 rows x 2 columns]
In [53]: store = pd.HDFStore('test.h5',mode='w')
In [54]: store.append('df',df)
In [55]: c = store.select_column('df','index')
In [56]: where = pd.DatetimeIndex(c).indexer_between_time('12:30','4:00')
In [57]: store.select('df',where=where)
Out[57]:
0 1
2013-01-01 00:00:00 -0.467844 1.038375
2013-01-01 01:00:00 0.057419 0.914379
2013-01-01 02:00:00 -1.378131 0.187081
2013-01-01 03:00:00 0.398765 -0.122692
2013-01-01 04:00:00 0.847332 0.967856
... ... ...
2013-02-11 03:00:00 0.902023 1.416775
2013-02-11 04:00:00 -1.455099 -0.766558
2013-02-11 13:00:00 -0.786312 0.501893
2013-02-11 14:00:00 -0.280538 0.680498
2013-02-11 15:00:00 1.533521 -1.992070
[664 rows x 2 columns]
In [58]: store.close()
Couple of points to note. This reads in the entire index to start. Usually this is not a burden. If it is you can just chunk read it (provide start/stop, though its a bit manual to do this ATM). Current select_column I don't believe can accept a query either.
You could potentially iterate over the days (and do individual queries) if you have a gargantuan amount of data (tens of millions of rows, which are wide), which might be more efficient.
Recombing data is relatively cheap (via concat), so don't be afraid to sub-query (though doing this too much can drag perf as well).

Categories