Create overlapping groups with pandas timegrouper

Create overlapping groups with pandas timegrouper - python

I am using Pandas Timegrouper to group datapoints in a pandas dataframe in python:
grouped = data.groupby(pd.TimeGrouper('30S'))
I would like to know if there's a way to achieve window overlap, like suggested in this question: Window overlap in Pandas while keeping the pandas dataframe as data structure.
Update: tested timing of the three solutions proposed below and the rolling mean seems faster:
%timeit df.groupby(pd.TimeGrouper('30s',closed='right')).mean()
%timeit df.resample('30s',how='mean',closed='right')
%timeit pd.rolling_mean(df,window=30).iloc[29::30]
yields:
1000 loops, best of 3: 336 µs per loop
1000 loops, best of 3: 349 µs per loop
1000 loops, best of 3: 199 µs per loop

Create some data exactly 3 x 30s long
In [51]: df = DataFrame(randn(90,2),columns=list('AB'),index=date_range('20130101 9:01:01',freq='s',periods=90))
Using a TimeGrouper in this way is equivalent of resample (and that's what resample actually does)
Note that I used closed to make sure that exactly 30 observations are included
In [57]: df.groupby(pd.TimeGrouper('30s',closed='right')).mean()
Out[57]:
A B
2013-01-01 09:01:00 -0.214968 -0.162200
2013-01-01 09:01:30 -0.090708 -0.021484
2013-01-01 09:02:00 -0.160335 -0.135074
In [52]: df.resample('30s',how='mean',closed='right')
Out[52]:
A B
2013-01-01 09:01:00 -0.214968 -0.162200
2013-01-01 09:01:30 -0.090708 -0.021484
2013-01-01 09:02:00 -0.160335 -0.135074
This is also equivalent if you then pick out the 30s intervals
In [55]: pd.rolling_mean(df,window=30).iloc[28:40]
Out[55]:
A B
2013-01-01 09:01:29 NaN NaN
2013-01-01 09:01:30 -0.214968 -0.162200
2013-01-01 09:01:31 -0.150401 -0.180492
2013-01-01 09:01:32 -0.160755 -0.142534
2013-01-01 09:01:33 -0.114918 -0.181424
2013-01-01 09:01:34 -0.098945 -0.221110
2013-01-01 09:01:35 -0.052450 -0.169884
2013-01-01 09:01:36 -0.011172 -0.185132
2013-01-01 09:01:37 0.100843 -0.178179
2013-01-01 09:01:38 0.062554 -0.097637
2013-01-01 09:01:39 0.048834 -0.065808
2013-01-01 09:01:40 0.003585 -0.059181
So depending on what you want to achieve, its easy to do an overlap, by using rolling_mean
and then pick out whatever 'frequency' you want. Eg here is a 5s resample with a 30s interval.
In [61]: pd.rolling_mean(df,window=30)[9::5]
Out[61]:
A B
2013-01-01 09:01:10 NaN NaN
2013-01-01 09:01:15 NaN NaN
2013-01-01 09:01:20 NaN NaN
2013-01-01 09:01:25 NaN NaN
2013-01-01 09:01:30 -0.214968 -0.162200
2013-01-01 09:01:35 -0.052450 -0.169884
2013-01-01 09:01:40 0.003585 -0.059181
2013-01-01 09:01:45 -0.055886 -0.111228
2013-01-01 09:01:50 -0.110191 -0.045032
2013-01-01 09:01:55 0.093662 -0.036177
2013-01-01 09:02:00 -0.090708 -0.021484
2013-01-01 09:02:05 -0.286759 0.020365
2013-01-01 09:02:10 -0.273221 -0.073886
2013-01-01 09:02:15 -0.222720 -0.038865
2013-01-01 09:02:20 -0.175630 0.001389
2013-01-01 09:02:25 -0.301671 -0.025603
2013-01-01 09:02:30 -0.160335 -0.135074

Related

How can I interpolate missed values of time series in pythonic way? [duplicate]

I have a dataframe with DatetimeIndex. This is one of columns:
>>> y.out_brd
2013-01-01 11:25:00 0.04464286
2013-01-01 11:30:00 NaN
2013-01-01 11:35:00 NaN
2013-01-01 11:40:00 0.005952381
2013-01-01 11:45:00 0.01785714
2013-01-01 11:50:00 0.008928571
Freq: 5T, Name: out_brd, dtype: object
When I'm trying to use interpolate() on function I get absolutly nothing changes:
>>> y.out_brd.interpolate(method='time')
2013-01-01 11:25:00 0.04464286
2013-01-01 11:30:00 NaN
2013-01-01 11:35:00 NaN
2013-01-01 11:40:00 0.005952381
2013-01-01 11:45:00 0.01785714
2013-01-01 11:50:00 0.008928571
Freq: 5T, Name: out_brd, dtype: object
How to make it work?
Update:
the code for generating such a dataframe.
time_index = pd.date_range(start=datetime(2013, 1, 1, 3),
end=datetime(2013, 1, 2, 2, 59),
freq='5T')
grid_columns = [u'in_brd', u'in_alt', u'out_brd', u'out_alt']
df = pd.DataFrame(index=time_index, columns=grid_columns)
After that I fill cells with some data.
I have dataframe field_data with survey data about boarding and alighting on railroad, and station variable.
I also have interval_end function defined like this:
interval_end = lambda index, prec_lvl: index.to_datetime() \
+ timedelta(minutes=prec_lvl - 1,
seconds=59)
The code:
for index, row in df.iterrows():
recs = field_data[(field_data.station_name == station)
& (field_data.arrive_time >= index.time())
& (field_data.arrive_time <= interval_end(
index, prec_lvl).time())]
in_recs_num = recs[recs.orientation == u'in'][u'train_number'].count()
out_recs_num = recs[recs.orientation == u'out'][u'train_number'].count()
if in_recs_num:
df.loc[index, u'in_brd'] = recs[
recs.orientation == u'in'][u'boarding'].sum() / \
(in_recs_num * CAR_CAPACITY)
df.loc[index, u'in_alt'] = recs[
recs.orientation == u'in'][u'alighting'].sum() / \
(in_recs_num * CAR_CAPACITY)
if out_recs_num:
df.loc[index, u'out_brd'] = recs[
recs.orientation == u'out'][u'boarding'].sum() / \
(out_recs_num * CAR_CAPACITY)
df.loc[index, u'out_alt'] = recs[
recs.orientation == u'out'][u'alighting'].sum() / \
(out_recs_num * CAR_CAPACITY)

You need to convert your Series to have a dtype of float64 instead of your current object. Here's an example to illustrate the difference. Note that in general object dtype Series are of limited use, the most common case being a Series containing strings. Other than that they are very slow since they cannot take advantage of any data type information.
In [9]: s = Series(randn(6), index=pd.date_range('2013-01-01 11:25:00', freq='5T', periods=6), dtype=object)
In [10]: s.iloc[1:3] = nan
In [11]: s
Out[11]:
2013-01-01 11:25:00 -0.69522
2013-01-01 11:30:00 NaN
2013-01-01 11:35:00 NaN
2013-01-01 11:40:00 -0.70308
2013-01-01 11:45:00 -1.5653
2013-01-01 11:50:00 0.95893
Freq: 5T, dtype: object
In [12]: s.interpolate(method='time')
Out[12]:
2013-01-01 11:25:00 -0.69522
2013-01-01 11:30:00 NaN
2013-01-01 11:35:00 NaN
2013-01-01 11:40:00 -0.70308
2013-01-01 11:45:00 -1.5653
2013-01-01 11:50:00 0.95893
Freq: 5T, dtype: object
In [13]: s.astype(float).interpolate(method='time')
Out[13]:
2013-01-01 11:25:00 -0.6952
2013-01-01 11:30:00 -0.6978
2013-01-01 11:35:00 -0.7005
2013-01-01 11:40:00 -0.7031
2013-01-01 11:45:00 -1.5653
2013-01-01 11:50:00 0.9589
Freq: 5T, dtype: float64

I am late but, this solved my problem.
You need to assign the outcome to some variable or itself.
y=y.out_brd.interpolate(method='time')

You could also fix this without changing the name of the data frame with the function "in place":
y.out_brd.interpolate(method='time', inplace=True)

Short answer from Phillip, which I missed the first time and came back to answer it:
You need to have a float series:
s.astype(float).interpolate(method='time')

Errors with numpy unwrap when the using long arrays

I am trying to use the function numpy.unwrap to correct some phase
I have long vector with 2678399 records which contains the difference in radians between 2 angles. The array contains nan values although I think is not relevant as unwrap is applied to each record independently.
When I applied unwrap, by the 400 record generates nan values in the rest of the array
If I apply np.unwrap to just one slice of the original array works fine.
Is that a possible bug in this function?
d90dif=(df2['d90']-df2['d90avg'])*(np.pi/180)#difference between two angles in radians
df2['d90dif']=np.unwrap(d90dif.values)#unwrap to the array to create new column
just to explain the problem
d90dif[700:705]#angle difference for some records
2013-01-01 00:11:41 0.087808
2013-01-01 00:11:42 0.052901
2013-01-01 00:11:43 0.000541
2013-01-01 00:11:44 0.087808
2013-01-01 00:11:45 0.017995
dtype: float64
df2['d90dif'][700:705]#results with unwrap for these records
2013-01-01 00:11:41 NaN
2013-01-01 00:11:42 NaN
2013-01-01 00:11:43 NaN
2013-01-01 00:11:44 NaN
2013-01-01 00:11:45 NaN
Name: d90dif, dtype: float64
now I repeat the process with a small array
test=d90dif[700:705]
2013-01-01 00:11:41 0.087808
2013-01-01 00:11:42 0.052901
2013-01-01 00:11:43 0.000541
2013-01-01 00:11:44 0.087808
2013-01-01 00:11:45 0.017995
dtype: float64
unw=np.unwrap(test.values)
array([ 0.08780774, 0.05290116, 0.00054128, 0.08780774, 0.01799457])
Now it is ok. If I do it with a dataframe input in unwrap() works fine as well

By looking at the documentation of unwrap, it seems that NaN would have an effect since the function is looking at differences of adjacent elements to detect jumps in the phase.

It seems that the nan values play an important role
test
2013-01-01 00:11:41 0.087808
2013-01-01 00:11:42 0.052901
2013-01-01 00:11:43 0.000541
2013-01-01 00:11:44 NaN
2013-01-01 00:11:45 0.017995
dtype: float64
If there is nan in the column, from there everything becomes a nan
np.unwrap(test)
array([ 0.08780774, 0.05290116, 0.00054128, nan, nan])
I would say this is a bug but...

Fast conversion of time string (Hour:Min:Sec.Millsecs) to float

I use pandas to import a csv file (about a million rows, 5 columns) that contains one column of timestamps (increasing row-by-row) in the format Hour:Min:Sec.Millsecs, e.g.
11:52:55.162
and some other columns with floats. I need to transform the timestamp column into floats (say in seconds). So far I'm using
pandas.read_csv
to get a dataframe df and then transform it into a numpy array
df=np.array(df)
All the above works great and is quite fast. However, then I use datetime.strptime (the 0th columns are the timestamps)
df[:,0]=[(datetime.strptime(str(d),'%H:%M:%S.%f')).total_seconds() for d in df[:,0]]
to transform the timestamps into seconds and unfortunately this turns out to be veryyyy slow . It's not the iteration over all the rows that so slow but
datetime.strptime
is the bottleneck. Is there a better way to do it?

Here, using timedeltas
Create a sample series
In [21]: s = pd.to_timedelta(np.arange(100000),unit='s')
In [22]: s
Out[22]:
0 00:00:00
1 00:00:01
2 00:00:02
3 00:00:03
4 00:00:04
5 00:00:05
6 00:00:06
7 00:00:07
8 00:00:08
9 00:00:09
10 00:00:10
11 00:00:11
12 00:00:12
13 00:00:13
14 00:00:14
...
99985 1 days, 03:46:25
99986 1 days, 03:46:26
99987 1 days, 03:46:27
99988 1 days, 03:46:28
99989 1 days, 03:46:29
99990 1 days, 03:46:30
99991 1 days, 03:46:31
99992 1 days, 03:46:32
99993 1 days, 03:46:33
99994 1 days, 03:46:34
99995 1 days, 03:46:35
99996 1 days, 03:46:36
99997 1 days, 03:46:37
99998 1 days, 03:46:38
99999 1 days, 03:46:39
Length: 100000, dtype: timedelta64[ns]
Convert to string for testing purposes
In [23]: t = s.apply(pd.tslib.repr_timedelta64)
These are strings
In [24]: t.iloc[-1]
Out[24]: '1 days, 03:46:39'
Dividing by a timedelta64 converts this to seconds
In [25]: pd.to_timedelta(t.iloc[-1])/np.timedelta64(1,'s')
Out[25]: 99999.0
This is currently matching using a reg-ex, so not very fast from a string directly.
In [27]: %timeit pd.to_timedelta(t)/np.timedelta64(1,'s')
1 loops, best of 3: 1.84 s per loop
This is a date-timestamp based soln
Since date times are already stored as int64's this is very easy an fast
Create a sample series
In [7]: s = Series(date_range('20130101',periods=1000,freq='ms'))
In [8]: s
Out[8]:
0 2013-01-01 00:00:00
1 2013-01-01 00:00:00.001000
2 2013-01-01 00:00:00.002000
3 2013-01-01 00:00:00.003000
4 2013-01-01 00:00:00.004000
5 2013-01-01 00:00:00.005000
6 2013-01-01 00:00:00.006000
7 2013-01-01 00:00:00.007000
8 2013-01-01 00:00:00.008000
9 2013-01-01 00:00:00.009000
10 2013-01-01 00:00:00.010000
11 2013-01-01 00:00:00.011000
12 2013-01-01 00:00:00.012000
13 2013-01-01 00:00:00.013000
14 2013-01-01 00:00:00.014000
...
985 2013-01-01 00:00:00.985000
986 2013-01-01 00:00:00.986000
987 2013-01-01 00:00:00.987000
988 2013-01-01 00:00:00.988000
989 2013-01-01 00:00:00.989000
990 2013-01-01 00:00:00.990000
991 2013-01-01 00:00:00.991000
992 2013-01-01 00:00:00.992000
993 2013-01-01 00:00:00.993000
994 2013-01-01 00:00:00.994000
995 2013-01-01 00:00:00.995000
996 2013-01-01 00:00:00.996000
997 2013-01-01 00:00:00.997000
998 2013-01-01 00:00:00.998000
999 2013-01-01 00:00:00.999000
Length: 1000, dtype: datetime64[ns]
Convert to ns since epoch / divide to get ms since epoch (if you want seconds,
divide by 10**9)
In [9]: pd.DatetimeIndex(s).asi8/10**6
Out[9]:
array([1356998400000, 1356998400001, 1356998400002, 1356998400003,
1356998400004, 1356998400005, 1356998400006, 1356998400007,
1356998400008, 1356998400009, 1356998400010, 1356998400011,
...
1356998400992, 1356998400993, 1356998400994, 1356998400995,
1356998400996, 1356998400997, 1356998400998, 1356998400999])
Pretty fast
In [12]: s = Series(date_range('20130101',periods=1000000,freq='ms'))
In [13]: %timeit pd.DatetimeIndex(s).asi8/10**6
100 loops, best of 3: 11 ms per loop

I'm guessing that the datetime object has a lot of overhead - it may be easier to do it by hand:
def to_seconds(s):
hr, min, sec = [float(x) for x in s.split(':')]
return hr*3600 + min*60 + sec

Using sum(), and enumerate() -
>>> ts = '11:52:55.162'
>>> ts1 = map(float, ts.split(':'))
>>> ts1
[11.0, 52.0, 55.162]
>>> ts2 = [60**(2-i)*n for i, n in enumerate(ts1)]
>>> ts2
[39600.0, 3120.0, 55.162]
>>> ts3 = sum(ts2)
>>> ts3
42775.162
>>> seconds = sum(60**(2-i)*n for i, n in enumerate(map(float, ts.split(':'))))
>>> seconds
42775.162
>>>

How to convert a pandas DataFrame into a TimeSeries?

I am looking for a way to convert a DataFrame to a TimeSeries without splitting the index and value columns. Any ideas? Thanks.
In [20]: import pandas as pd
In [21]: import numpy as np
In [22]: dates = pd.date_range('20130101',periods=6)
In [23]: df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
In [24]: df
Out[24]:
A B C D
2013-01-01 -0.119230 1.892838 0.843414 -0.482739
2013-01-02 1.204884 -0.942299 -0.521808 0.446309
2013-01-03 1.899832 0.460871 -1.491727 -0.647614
2013-01-04 1.126043 0.818145 0.159674 -1.490958
2013-01-05 0.113360 0.190421 -0.618656 0.976943
2013-01-06 -0.537863 -0.078802 0.197864 -1.414924
In [25]: pd.Series(df)
Out[25]:
0 A
1 B
2 C
3 D
dtype: object

I know this is late to the game here but a few points.
Whether or not a DataFrame is considered a TimeSeries is the type of index. In your case, your index is already a TimeSeries, so you are good to go. For more information on all the cool slicing you can do with a the pd.timeseries index, take a look at http://pandas.pydata.org/pandas-docs/stable/timeseries.html#datetime-indexing
Now, others might arrive here because they have a column 'DateTime' that they want to make an index, in which case the answer is simple
ts = df.set_index('DateTime')

Here is one possibility
[3]: df
Out[3]:
A B C D
2013-01-01 -0.024362 0.712035 -0.913923 0.755276
2013-01-02 2.624298 0.285546 0.142265 -0.047871
2013-01-03 1.315157 -0.333630 0.398759 -1.034859
2013-01-04 0.713141 -0.109539 0.263706 -0.588048
2013-01-05 -1.172163 -1.387645 -0.171854 -0.458660
2013-01-06 -0.192586 0.480023 -0.530907 -0.872709
In [4]: df.unstack()
Out[4]:
A 2013-01-01 -0.024362
2013-01-02 2.624298
2013-01-03 1.315157
2013-01-04 0.713141
2013-01-05 -1.172163
2013-01-06 -0.192586
B 2013-01-01 0.712035
2013-01-02 0.285546
2013-01-03 -0.333630
2013-01-04 -0.109539
2013-01-05 -1.387645
2013-01-06 0.480023
C 2013-01-01 -0.913923
2013-01-02 0.142265
2013-01-03 0.398759
2013-01-04 0.263706
2013-01-05 -0.171854
2013-01-06 -0.530907
D 2013-01-01 0.755276
2013-01-02 -0.047871
2013-01-03 -1.034859
2013-01-04 -0.588048
2013-01-05 -0.458660
2013-01-06 -0.872709
dtype: float64

How to convert TimeSeries object in pandas into integer?

I've been working with Pandas to calculate the age of a sportsman on a particular fixture, although it's returned as a TimeSeries type.
I'd now like to be able to plot age (in days) against the fixture dates, but can't work out how to turn the TimeSeries object to an integer. What can I try next?
This is the shape of the data.
squad_date['mean_age']
2008-08-16 11753 days, 0:00:00
2008-08-23 11760 days, 0:00:00
2008-08-30 11767 days, 0:00:00
2008-09-14 11782 days, 0:00:00
2008-09-20 11788 days, 0:00:00
This is what I would like:
2008-08-16 11753
2008-08-23 11760
2008-08-30 11767
2008-09-14 11782
2008-09-20 11788

For people who find this post by google, if you have numpy >= 0.7 and pandas 0.11, these solutions will not work. What does work:
squad_date['mean_age'].apply(lambda x: x / np.timedelta64(1,'D'))
The official Pandas documentation can be confusing here. They suggest to do "x.item()", where x is already a timedelta object.
x.item() would retrieve the difference in as an int value from the timedelta object. If that would be 'ns', you would get an int with the number of nanoseconds for example. So that would give a integer divide by a timedelta error; dividing the timedeltas directly by each other does work (and converts it to Days as by the 'D' in the second part).
I hope this will help someone in the future!

you need to be on master for this (0.11-dev)
In [40]: x = pd.date_range('20130101',periods=5)
In [41]: td = pd.Series(x,index=x)-pd.Timestamp('20130101')
In [43]: td
Out[43]:
2013-01-01 00:00:00
2013-01-02 1 days, 00:00:00
2013-01-03 2 days, 00:00:00
2013-01-04 3 days, 00:00:00
2013-01-05 4 days, 00:00:00
Freq: D, Dtype: timedelta64[ns]
In [44]: td.apply(lambda x: x.item().days)
Out[44]:
2013-01-01 0
2013-01-02 1
2013-01-03 2
2013-01-04 3
2013-01-05 4
Freq: D, Dtype: int64

The way I did it:
def conv_delta_to_int (dt):
return int(str(dt).split(" ")[0].replace (",", ""))
squad_date['mean_age'] = map(conv_delta_to_int, squad_date['mean_age'])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create overlapping groups with pandas timegrouper - python

Related

How can I interpolate missed values of time series in pythonic way? [duplicate]

Errors with numpy unwrap when the using long arrays

Fast conversion of time string (Hour:Min:Sec.Millsecs) to float

How to convert a pandas DataFrame into a TimeSeries?

How to convert TimeSeries object in pandas into integer?

Categories

Resources