Pandas read_hdf query by date and time range - python

I have a question regarding how to filter results in the pd.read_hdf function. So here's the setup, I have a pandas dataframe (with np.datetime64 index) which I put into a hdf5 file. There's nothing fancy going on here, so no use of hierarchy or anything (maybe I could incorporate it?). Here's an example:
Foo Bar
TIME
2014-07-14 12:02:00 0 0
2014-07-14 12:03:00 0 0
2014-07-14 12:04:00 0 0
2014-07-14 12:05:00 0 0
2014-07-14 12:06:00 0 0
2014-07-15 12:02:00 0 0
2014-07-15 12:03:00 0 0
2014-07-15 12:04:00 0 0
2014-07-15 12:05:00 0 0
2014-07-15 12:06:00 0 0
2014-07-16 12:02:00 0 0
2014-07-16 12:03:00 0 0
2014-07-16 12:04:00 0 0
2014-07-16 12:05:00 0 0
2014-07-16 12:06:00 0 0
Now I store this into a .h5 using the following command:
store = pd.HDFStore('qux.h5')
#generate df
store.append('data', df)
store.close()
Next, I'll have another process which accesses this data and I would like to take date/time slices of this data. So suppose I want dates between 2014-07-14 and 2014-07-15, and only for times between 12:02:00 and 12:04:00. Currently I am using the following command to retrieve this:
pd.read_hdf('qux.h5', 'data', where='index >= 20140714 and index <= 20140715').between_time(start_time=datetime.time(12,2), end_time=datetime.time(12,4))
As far as I'm aware, someone please correct me if I'm wrong here, but entire original dataset is not read into memory if I use 'where'. So in other words:
This:
pd.read_hdf('qux.h5', 'data', where='index >= 20140714 and index <= 20140715')
Is not the same as this:
pd.read_hdf('qux.h5', 'data')['20140714':'20140715']
While the end result is exactly the same, what's being done in the background is not. So my question is, is there a way to incorporate that time range filter (i.e. .between_time()) into my where statement? Or if there's another way I should structure my hdf5 file? Maybe store a table for each day?
Thanks!
EDIT:
Regarding using hierarchy, I'm aware that the structure should be highly dependent on how I'll be using the data. However, if we assume that the I define a table per date (e.g. 'df/date_20140714', 'df/date_20140715', ...). Again I may be mistaken here, but using my example of querying date/time range; I'll probably incur a performance penalty as I'll need to read each table and have to merge them if I want a consolidated output right?

See an example of selecting using a where mask
Here's an example
In [50]: pd.set_option('max_rows',10)
In [51]: df = DataFrame(np.random.randn(1000,2),index=date_range('20130101',periods=1000,freq='H'))
In [52]: df
Out[52]:
0 1
2013-01-01 00:00:00 -0.467844 1.038375
2013-01-01 01:00:00 0.057419 0.914379
2013-01-01 02:00:00 -1.378131 0.187081
2013-01-01 03:00:00 0.398765 -0.122692
2013-01-01 04:00:00 0.847332 0.967856
... ... ...
2013-02-11 11:00:00 0.554420 0.777484
2013-02-11 12:00:00 -0.558041 1.833465
2013-02-11 13:00:00 -0.786312 0.501893
2013-02-11 14:00:00 -0.280538 0.680498
2013-02-11 15:00:00 1.533521 -1.992070
[1000 rows x 2 columns]
In [53]: store = pd.HDFStore('test.h5',mode='w')
In [54]: store.append('df',df)
In [55]: c = store.select_column('df','index')
In [56]: where = pd.DatetimeIndex(c).indexer_between_time('12:30','4:00')
In [57]: store.select('df',where=where)
Out[57]:
0 1
2013-01-01 00:00:00 -0.467844 1.038375
2013-01-01 01:00:00 0.057419 0.914379
2013-01-01 02:00:00 -1.378131 0.187081
2013-01-01 03:00:00 0.398765 -0.122692
2013-01-01 04:00:00 0.847332 0.967856
... ... ...
2013-02-11 03:00:00 0.902023 1.416775
2013-02-11 04:00:00 -1.455099 -0.766558
2013-02-11 13:00:00 -0.786312 0.501893
2013-02-11 14:00:00 -0.280538 0.680498
2013-02-11 15:00:00 1.533521 -1.992070
[664 rows x 2 columns]
In [58]: store.close()
Couple of points to note. This reads in the entire index to start. Usually this is not a burden. If it is you can just chunk read it (provide start/stop, though its a bit manual to do this ATM). Current select_column I don't believe can accept a query either.
You could potentially iterate over the days (and do individual queries) if you have a gargantuan amount of data (tens of millions of rows, which are wide), which might be more efficient.
Recombing data is relatively cheap (via concat), so don't be afraid to sub-query (though doing this too much can drag perf as well).

Related

Pandas apply on rolling with multi-column output

I am working on a code that would apply a rolling window to a function that would return multiple columns.
Input: Pandas Series
Expected output: 3-column DataFrame
def fun1(series, ):
# Some calculations producing numbers a, b and c
return {"a": a, "b": b, "c": c}
res.rolling('21 D').apply(fun1)
Contents of res:
time
2019-09-26 16:00:00 0.674969
2019-09-26 16:15:00 0.249569
2019-09-26 16:30:00 -0.529949
2019-09-26 16:45:00 -0.247077
2019-09-26 17:00:00 0.390827
...
2019-10-17 22:45:00 0.232998
2019-10-17 23:00:00 0.590827
2019-10-17 23:15:00 0.768991
2019-10-17 23:30:00 0.142661
2019-10-17 23:45:00 -0.555284
Length: 1830, dtype: float64
Error:
TypeError: must be real number, not dict
What I've tried:
Changing raw=True in apply
Using a lambda function in in apply
Returning result in fun1 as lists/numpy arrays/dataframe/series.
I have also went through many related posts in SO, to state a few:
Pandas - Using `.rolling()` on multiple columns
Returning two values from pandas.rolling_apply
How to apply a function to two columns of Pandas dataframe
Apply pandas function to column to create multiple new columns?
But none of the solution specified solves this problem.
Is there a straight-forward solution to this?
Here is a hacky answer using rolling, producing a DataFrame:
import pandas as pd
import numpy as np
dr = pd.date_range('09-26-2019', '10-17-2019', freq='15T')
data = np.random.rand(len(dr))
s = pd.Series(data, index=dr)
output = pd.DataFrame(columns=['a','b','c'])
row = 0
def compute(window, df):
global row
a = window.max()
b = window.min()
c = a - b
df.loc[row,['a','b','c']] = [a,b,c]
row+=1
return 1
s.rolling('1D').apply(compute,kwargs={'df':output})
output.index = s.index
It seems like the rolling apply function is always expecting a number to be returned, in order to immediately generate a new Series based on the calculations.
I am getting around this by making a new output DataFrame (with the desired output columns), and writing to that within the function. I'm not sure if there is a way to get the index within a rolling object, so I instead use global to make an increasing count for writing new rows. In light of the point above though, you need to return some number. So while the actually rolling operation returns a series of 1, output is modified:
In[0]:
s
Out[0]:
2019-09-26 00:00:00 0.106208
2019-09-26 00:15:00 0.979709
2019-09-26 00:30:00 0.748573
2019-09-26 00:45:00 0.702593
2019-09-26 01:00:00 0.617028
2019-10-16 23:00:00 0.742230
2019-10-16 23:15:00 0.729797
2019-10-16 23:30:00 0.094662
2019-10-16 23:45:00 0.967469
2019-10-17 00:00:00 0.455361
Freq: 15T, Length: 2017, dtype: float64
In[1]:
output
Out[1]:
a b c
2019-09-26 00:00:00 0.106208 0.106208 0.000000
2019-09-26 00:15:00 0.979709 0.106208 0.873501
2019-09-26 00:30:00 0.979709 0.106208 0.873501
2019-09-26 00:45:00 0.979709 0.106208 0.873501
2019-09-26 01:00:00 0.979709 0.106208 0.873501
... ... ...
2019-10-16 23:00:00 0.980544 0.022601 0.957943
2019-10-16 23:15:00 0.980544 0.022601 0.957943
2019-10-16 23:30:00 0.980544 0.022601 0.957943
2019-10-16 23:45:00 0.980544 0.022601 0.957943
2019-10-17 00:00:00 0.980544 0.022601 0.957943
[2017 rows x 3 columns]
This feels like more of an exploit of rolling than an intended use, so I would be interested to see a more elegant answer.
UPDATE: Thanks to #JuanPi, you can get the rolling window index using this answer. So a non-globalanswer could look like this:
def compute(window, df):
a = window.max()
b = window.min()
c = a - b
df.loc[window.index.max(),['a','b','c']] = [a,b,c]
return 1
This hack seem to work for me, albeit the additional features of rolling cannot be applied to this solution. However, the speed of the application is significantly faster due to multiprocessing.
from multiprocessing import Pool
import functools
def apply_fn(indices, fn, df):
return fn(df.loc[indices])
def rolling_apply(df, fn, window_size, start=None, end=None):
"""
The rolling application of a function fn on a DataFrame df given the window_size
"""
x = df.index
if start is not None:
x = x[x >= start]
if end is not None:
x = x[x <= end]
if type(window_size) == str:
delta = pd.Timedelta(window_size)
index_sets = [x[(x > (i - delta)) & (x <= i)] for i in x]
else:
assert type(window_size) == int, "Window size should be str (representing Timedelta) or int"
delta = window_size
index_sets = [x[(x > (i - delta)) & (x <= i)] for i in x]
with Pool() as pool:
result = list(pool.map(functools.partial(apply_fn, fn=fn, df=df), index_sets))
result = pd.DataFrame(data=result, index=x)
return result
Having the above functions in place, plug in the function to roll into the custom rolling_function.
result = rolling_apply(res, fun1, "21 D")
Contents of result:
a b c
time
2019-09-26 16:00:00 NaN NaN NaN
2019-09-26 16:15:00 0.500000 0.106350 0.196394
2019-09-26 16:30:00 0.500000 0.389759 -0.724829
2019-09-26 16:45:00 2.000000 0.141436 -0.529949
2019-09-26 17:00:00 6.010184 0.141436 -0.459231
... ... ... ...
2019-10-17 22:45:00 4.864015 0.204483 -0.761609
2019-10-17 23:00:00 6.607717 0.204647 -0.761421
2019-10-17 23:15:00 7.466364 0.204932 -0.761108
2019-10-17 23:30:00 4.412779 0.204644 -0.760386
2019-10-17 23:45:00 0.998308 0.203039 -0.757979
1830 rows × 3 columns
Note:
This implementation works for both Series and DataFrame input
This implementation works for both time and integer windows
The result returned by fun1 can even be a list, numpy array, series or a dictionary
The window_size considers only the max window size, so all starting indices below the window_size would have their windows include all elements up to the starting element.
The apply function should not be nested inside the rolling_apply function since the pool.map cannot accept local or lambda functions as they cannot be 'pickled' according to the multiprocessing library

Slicing pandas dataframe by custom months and days -- is there a way to avoid for loops?

The problem
Suppose I have a time series dataframe df (a pandas dataframe) and some days I want to slice from it, contained in another dataframe called sample_days:
>>> df
foo bar
2020-01-01 00:00:00 0.360049 0.897839
2020-01-01 01:00:00 0.285667 0.409544
2020-01-01 02:00:00 0.323871 0.240926
2020-01-01 03:00:00 0.921623 0.766624
2020-01-01 04:00:00 0.087618 0.142409
... ... ...
2020-12-31 19:00:00 0.145111 0.993822
2020-12-31 20:00:00 0.331223 0.021287
2020-12-31 21:00:00 0.531099 0.859035
2020-12-31 22:00:00 0.759594 0.790265
2020-12-31 23:00:00 0.103651 0.074029
[8784 rows x 2 columns]
>>> sample_days
month day
0 3 16
1 7 26
2 8 15
3 9 26
4 11 25
I want to slice df with the days specified in sample_days. I can do this with for loops (see below). However, is there a way to avoid for loops (as this is more efficient)? The result should be a dataframe called sample like the following:
>>> sample
foo bar
2020-03-16 00:00:00 0.707276 0.592614
2020-03-16 01:00:00 0.136679 0.357872
2020-03-16 02:00:00 0.612331 0.290126
2020-03-16 03:00:00 0.276389 0.576996
2020-03-16 04:00:00 0.612977 0.781527
... ... ...
2020-11-25 19:00:00 0.904266 0.825501
2020-11-25 20:00:00 0.269589 0.050304
2020-11-25 21:00:00 0.271814 0.418235
2020-11-25 22:00:00 0.595005 0.973198
2020-11-25 23:00:00 0.151149 0.024057
[120 rows x 2 columns
which is just the df sliced across the correct days.
My (slow) solution
I've managed to do this using for loops and pd.concat:
sample = pd.concat([df.loc[df.index.month.isin([sample_day.month]) &
df.index.day.isin([sample_day.day])]
for sample_day in sample_days.itertuples()])
which is based on concatenating multiple days as sliced by the method indicated here. This gives the desired result but is rather slow. For example, using this method to get the first day of each month takes 0.2 seconds on average, whereas just calling df.loc[df.index.day == 1] (presumably avoiding python for loops under-the-hood) is around 300 times faster. However, this is a slice on just the day -- I am slicing on month and day.
Apologies if this has been answered somewhere else -- I've searched for quite a while but perhaps was not using the correct keywords.
You can do a string comparison of the month and days at the same time.
You need the space to differentiate between 11 2 and 1 12 for example, otherwise both would be regarded as the same.
df.loc[(df.index.month.astype(str) +' '+ df.index.day.astype(str)).isin(sample_days['month'].astype(str)+' '+sample_days['day'].astype(str))]
After getting a bit of inspiration from #Ben Pap's solution (thanks!), I've found a solution that is both fast and avoids any "hacks" like changing datetime to strings. It combines the month and day into a single MultiIndex, as below (you can make this a single line, but I've expanded it into multiple to make the idea clear).
full_index = pd.MultiIndex.from_arrays([df.index.month, df.index.day],
names=['month', 'day'])
sample_index = pd.MultiIndex.from_frame(sample_days)
sample = df.loc[full_index.isin(sample_index)]
If I run this code along with my original for loop and #Ben Pap's answer, and sample 100 days from one year time series for 2020 (8784 hours with the leap day), I get the following solution times:
Original for loop: 0.16s
#Ben Pap's solution, combining month and day into single string: 0.019s
Above solution using MultiIndex: 0.006s
so I think using a MultiIndex is the way to go.

Switching row header and column header in python

I have a .csv file that has data something like this:
#file...out/houses.csv
#data...sun may 1 11:20:43 2011
#user...abs12
#host...(null)
#group...class=house
#property..change_per_hour
#limit...0
#interval..10000000
#timestamp,house_0,house_1,house_2,house_3,.....,house_1000
2010-07-01 00:00:00 EDT,1.2,1.3,1.4,1.5,........,9.72
2010-07-01 01:00:00 EDT,2.2,2.3,2.4,2.5,........,19.72
2010-07-01 02:00:00 EDT,3.2,3.3,3.4,3.5,........,29.72
2010-07-01 05:00:00 EDT,5.2,5.3,5.4,5.5,........,59.72
2010-07-01 06:00:00 EDT,6.2,,6.4,,..............,
...
I want to convert this and save to a new .csv and the data should look like:
#file...out/houses.csv
#data...sun may 1 11:20:43 2011
#user...abs12
#host...(null)
#group...class=house
#property..change_per_hour
#limit...0
#interval..10000000
#EntityName,2010-07-01 00:00:00 EDT,2010-07-01 01:00:00 EDT,2010-07-01 02:00:00 EDT,2010-07-01 05:00:00 EDT,2010-07-01 06:00:00 EDT
house_0,1.2,2.2,3.2,5.2,6.2,...
house_1,1.3,2.3,3.3,5.3,,...
house_2,1.4,2.4,3.4,5.4,6.4,...
house_3,1.5,2.5,3.5,5.5,,...
...
house_1000,9.72,19.72,29.72,59.72,
I tried to use pandas: convert to a dictionary that looks like dtDict={'house_0':{'datetimestamp_1':'value_1','datetimestamp_2':'value_2'...}...}but I am not able to convert to a dictionary and use panda's DataFrame such as pandas.DataFrame(dtDict) to do that conversion. I dont have to use pandas (can you anything in python) but thought pandas is good for csv manipulation. any help?
Assuming it is in a pandas dataframe already, this works:
df = pd.DataFrame(
data=[[1, 3], [2, 5]],
index=[0, 1],
columns=['a', 'b']
)
Output:
>>>print(df)
a b
0 1 3
1 2 5
Then, transpose the dataframe:
>>>print(df.transpose())
0 1
a 1 2
b 3 5

Python Pandas - sum a boolean variable by hour

I have a pretty simple question: I have a pandas DataFrame that looks like:
y
2015-12-09 09:00:00 1
2015-12-09 08:48:00 1
2015-12-09 08:24:00 1
2015-12-09 08:12:00 1
2015-12-09 08:00:00 1
2015-12-09 06:36:00 1
2015-12-09 06:24:00 1
... ..
2015-12-08 10:12:00 1
2015-12-08 10:00:00 1
2015-12-08 09:48:00 1
2015-12-08 09:36:00 1
I want to sum the boolean variables by hour, so I have something that looks like:
y
2015-12-09 09:00:00 1
2015-12-09 08:00:00 4
2015-12-09 07:00:00 0
2015-12-09 06:00:00 2
... ..
2015-12-08 10:00:00 2
2015-12-08 09:00:00 2
I keep getting this error:
AttributeError: 'numpy.ndarray' object has no attribute 'groupby'
It doesn't seem like a very hard problem, but I cannot figure it out.
The solution is relatively straightforward, but it does implicitly assume that in your data set, 0 equates to False (which seems logical to me). If so, this works:
df.resample('1H', how='sum').fillna(0)
Else you may have to look into a different way of sorting through your data.
I'm a Pandas newbie but here are my two cents.
Let's start with a DataFrame that looks like this (like yours):
What I did first was converting that string date-time into a date-time field:
data['datetime'] = pd.to_datetime(data['datetime'])
Then I created another column with only date values:
data['date'] = abc.datetime.dt.date
And another one with hour values:
data['hour'] = data.datetime.dt.hour
So my data DataFrame looks like this:
Finally, I just grouped by date and hour:
data.groupby(['date', 'hour']).size()
And these are the results:
If you don't want to alter your DataFrame just use a copy of it like:
mutable_data = data
And then make changes to mutable_data.
I hope this helps. If not, I would love to receive suggestions.

Pandas: use array index all values

I want to select all rows with a particular index. My DataFrame look like this:
>>> df
Code
Patient Date
1 2003-01-12 00:00:00 a
2003-02-13 00:00:00 b
2003-02-14 00:00:00 ba
2 2001-1-17 22:00:00 z
2002-1-21 00:00:00 d
2003-1-21 00:00:00 a
2005-12-1 00:00:00 ba
Selecting one of the first (Patient) index works:
>>> df.loc[1]
Code
Patient Date
1 2003-01-12 00:00:00 a
2003-02-13 00:00:00 b
2003-02-14 00:00:00 ba
But selecting multiple of the first (Patient) index does not:
>>> df.loc[[1, 2]]
Code
Patient Date
1 2003-01-12 00:00:00 a
2 2001-1-17 22:00:00 z
However, I would like to get the entire dataframe (as the result would be if [1,1,1,2] i.e, the original dataframe).
When using a single index it works fine. For example:
>>> df.reset_index().set_index("Patient").loc[[1, 2]]
Date Code
Patient
1 2003-01-12 00:00:00 a
2003-02-13 00:00:00 b
2003-02-14 00:00:00 ba
2 2001-1-17 22:00:00 z
2002-1-21 00:00:00 d
2003-1-21 00:00:00 a
2005-12-1 00:00:00 ba
TL;DR Why do I have to repeat the index when using multiple indexes but not when I use a single index?
EDIT: Apparently it can be done similar to:
>>> df.loc[df.index.get_level_values("Patient").isin([1, 2])]
But this seems quite dirty to me. Is this the way - or is any other, better, way possible?
For Pandas verison 0.14 the recommended way, according to the above comment, is:
df.loc[([1,2],),:]

Categories