Errors with numpy unwrap when the using long arrays - python

I am trying to use the function numpy.unwrap to correct some phase
I have long vector with 2678399 records which contains the difference in radians between 2 angles. The array contains nan values although I think is not relevant as unwrap is applied to each record independently.
When I applied unwrap, by the 400 record generates nan values in the rest of the array
If I apply np.unwrap to just one slice of the original array works fine.
Is that a possible bug in this function?
d90dif=(df2['d90']-df2['d90avg'])*(np.pi/180)#difference between two angles in radians
df2['d90dif']=np.unwrap(d90dif.values)#unwrap to the array to create new column
just to explain the problem
d90dif[700:705]#angle difference for some records
2013-01-01 00:11:41 0.087808
2013-01-01 00:11:42 0.052901
2013-01-01 00:11:43 0.000541
2013-01-01 00:11:44 0.087808
2013-01-01 00:11:45 0.017995
dtype: float64
df2['d90dif'][700:705]#results with unwrap for these records
2013-01-01 00:11:41 NaN
2013-01-01 00:11:42 NaN
2013-01-01 00:11:43 NaN
2013-01-01 00:11:44 NaN
2013-01-01 00:11:45 NaN
Name: d90dif, dtype: float64
now I repeat the process with a small array
test=d90dif[700:705]
2013-01-01 00:11:41 0.087808
2013-01-01 00:11:42 0.052901
2013-01-01 00:11:43 0.000541
2013-01-01 00:11:44 0.087808
2013-01-01 00:11:45 0.017995
dtype: float64
unw=np.unwrap(test.values)
array([ 0.08780774, 0.05290116, 0.00054128, 0.08780774, 0.01799457])
Now it is ok. If I do it with a dataframe input in unwrap() works fine as well

By looking at the documentation of unwrap, it seems that NaN would have an effect since the function is looking at differences of adjacent elements to detect jumps in the phase.

It seems that the nan values play an important role
test
2013-01-01 00:11:41 0.087808
2013-01-01 00:11:42 0.052901
2013-01-01 00:11:43 0.000541
2013-01-01 00:11:44 NaN
2013-01-01 00:11:45 0.017995
dtype: float64
If there is nan in the column, from there everything becomes a nan
np.unwrap(test)
array([ 0.08780774, 0.05290116, 0.00054128, nan, nan])
I would say this is a bug but...

Related

How can I interpolate missed values of time series in pythonic way? [duplicate]

I have a dataframe with DatetimeIndex. This is one of columns:
>>> y.out_brd
2013-01-01 11:25:00 0.04464286
2013-01-01 11:30:00 NaN
2013-01-01 11:35:00 NaN
2013-01-01 11:40:00 0.005952381
2013-01-01 11:45:00 0.01785714
2013-01-01 11:50:00 0.008928571
Freq: 5T, Name: out_brd, dtype: object
When I'm trying to use interpolate() on function I get absolutly nothing changes:
>>> y.out_brd.interpolate(method='time')
2013-01-01 11:25:00 0.04464286
2013-01-01 11:30:00 NaN
2013-01-01 11:35:00 NaN
2013-01-01 11:40:00 0.005952381
2013-01-01 11:45:00 0.01785714
2013-01-01 11:50:00 0.008928571
Freq: 5T, Name: out_brd, dtype: object
How to make it work?
Update:
the code for generating such a dataframe.
time_index = pd.date_range(start=datetime(2013, 1, 1, 3),
end=datetime(2013, 1, 2, 2, 59),
freq='5T')
grid_columns = [u'in_brd', u'in_alt', u'out_brd', u'out_alt']
df = pd.DataFrame(index=time_index, columns=grid_columns)
After that I fill cells with some data.
I have dataframe field_data with survey data about boarding and alighting on railroad, and station variable.
I also have interval_end function defined like this:
interval_end = lambda index, prec_lvl: index.to_datetime() \
+ timedelta(minutes=prec_lvl - 1,
seconds=59)
The code:
for index, row in df.iterrows():
recs = field_data[(field_data.station_name == station)
& (field_data.arrive_time >= index.time())
& (field_data.arrive_time <= interval_end(
index, prec_lvl).time())]
in_recs_num = recs[recs.orientation == u'in'][u'train_number'].count()
out_recs_num = recs[recs.orientation == u'out'][u'train_number'].count()
if in_recs_num:
df.loc[index, u'in_brd'] = recs[
recs.orientation == u'in'][u'boarding'].sum() / \
(in_recs_num * CAR_CAPACITY)
df.loc[index, u'in_alt'] = recs[
recs.orientation == u'in'][u'alighting'].sum() / \
(in_recs_num * CAR_CAPACITY)
if out_recs_num:
df.loc[index, u'out_brd'] = recs[
recs.orientation == u'out'][u'boarding'].sum() / \
(out_recs_num * CAR_CAPACITY)
df.loc[index, u'out_alt'] = recs[
recs.orientation == u'out'][u'alighting'].sum() / \
(out_recs_num * CAR_CAPACITY)
You need to convert your Series to have a dtype of float64 instead of your current object. Here's an example to illustrate the difference. Note that in general object dtype Series are of limited use, the most common case being a Series containing strings. Other than that they are very slow since they cannot take advantage of any data type information.
In [9]: s = Series(randn(6), index=pd.date_range('2013-01-01 11:25:00', freq='5T', periods=6), dtype=object)
In [10]: s.iloc[1:3] = nan
In [11]: s
Out[11]:
2013-01-01 11:25:00 -0.69522
2013-01-01 11:30:00 NaN
2013-01-01 11:35:00 NaN
2013-01-01 11:40:00 -0.70308
2013-01-01 11:45:00 -1.5653
2013-01-01 11:50:00 0.95893
Freq: 5T, dtype: object
In [12]: s.interpolate(method='time')
Out[12]:
2013-01-01 11:25:00 -0.69522
2013-01-01 11:30:00 NaN
2013-01-01 11:35:00 NaN
2013-01-01 11:40:00 -0.70308
2013-01-01 11:45:00 -1.5653
2013-01-01 11:50:00 0.95893
Freq: 5T, dtype: object
In [13]: s.astype(float).interpolate(method='time')
Out[13]:
2013-01-01 11:25:00 -0.6952
2013-01-01 11:30:00 -0.6978
2013-01-01 11:35:00 -0.7005
2013-01-01 11:40:00 -0.7031
2013-01-01 11:45:00 -1.5653
2013-01-01 11:50:00 0.9589
Freq: 5T, dtype: float64
I am late but, this solved my problem.
You need to assign the outcome to some variable or itself.
y=y.out_brd.interpolate(method='time')
You could also fix this without changing the name of the data frame with the function "in place":
y.out_brd.interpolate(method='time', inplace=True)
Short answer from Phillip, which I missed the first time and came back to answer it:
You need to have a float series:
s.astype(float).interpolate(method='time')

Create overlapping groups with pandas timegrouper

I am using Pandas Timegrouper to group datapoints in a pandas dataframe in python:
grouped = data.groupby(pd.TimeGrouper('30S'))
I would like to know if there's a way to achieve window overlap, like suggested in this question: Window overlap in Pandas while keeping the pandas dataframe as data structure.
Update: tested timing of the three solutions proposed below and the rolling mean seems faster:
%timeit df.groupby(pd.TimeGrouper('30s',closed='right')).mean()
%timeit df.resample('30s',how='mean',closed='right')
%timeit pd.rolling_mean(df,window=30).iloc[29::30]
yields:
1000 loops, best of 3: 336 µs per loop
1000 loops, best of 3: 349 µs per loop
1000 loops, best of 3: 199 µs per loop
Create some data exactly 3 x 30s long
In [51]: df = DataFrame(randn(90,2),columns=list('AB'),index=date_range('20130101 9:01:01',freq='s',periods=90))
Using a TimeGrouper in this way is equivalent of resample (and that's what resample actually does)
Note that I used closed to make sure that exactly 30 observations are included
In [57]: df.groupby(pd.TimeGrouper('30s',closed='right')).mean()
Out[57]:
A B
2013-01-01 09:01:00 -0.214968 -0.162200
2013-01-01 09:01:30 -0.090708 -0.021484
2013-01-01 09:02:00 -0.160335 -0.135074
In [52]: df.resample('30s',how='mean',closed='right')
Out[52]:
A B
2013-01-01 09:01:00 -0.214968 -0.162200
2013-01-01 09:01:30 -0.090708 -0.021484
2013-01-01 09:02:00 -0.160335 -0.135074
This is also equivalent if you then pick out the 30s intervals
In [55]: pd.rolling_mean(df,window=30).iloc[28:40]
Out[55]:
A B
2013-01-01 09:01:29 NaN NaN
2013-01-01 09:01:30 -0.214968 -0.162200
2013-01-01 09:01:31 -0.150401 -0.180492
2013-01-01 09:01:32 -0.160755 -0.142534
2013-01-01 09:01:33 -0.114918 -0.181424
2013-01-01 09:01:34 -0.098945 -0.221110
2013-01-01 09:01:35 -0.052450 -0.169884
2013-01-01 09:01:36 -0.011172 -0.185132
2013-01-01 09:01:37 0.100843 -0.178179
2013-01-01 09:01:38 0.062554 -0.097637
2013-01-01 09:01:39 0.048834 -0.065808
2013-01-01 09:01:40 0.003585 -0.059181
So depending on what you want to achieve, its easy to do an overlap, by using rolling_mean
and then pick out whatever 'frequency' you want. Eg here is a 5s resample with a 30s interval.
In [61]: pd.rolling_mean(df,window=30)[9::5]
Out[61]:
A B
2013-01-01 09:01:10 NaN NaN
2013-01-01 09:01:15 NaN NaN
2013-01-01 09:01:20 NaN NaN
2013-01-01 09:01:25 NaN NaN
2013-01-01 09:01:30 -0.214968 -0.162200
2013-01-01 09:01:35 -0.052450 -0.169884
2013-01-01 09:01:40 0.003585 -0.059181
2013-01-01 09:01:45 -0.055886 -0.111228
2013-01-01 09:01:50 -0.110191 -0.045032
2013-01-01 09:01:55 0.093662 -0.036177
2013-01-01 09:02:00 -0.090708 -0.021484
2013-01-01 09:02:05 -0.286759 0.020365
2013-01-01 09:02:10 -0.273221 -0.073886
2013-01-01 09:02:15 -0.222720 -0.038865
2013-01-01 09:02:20 -0.175630 0.001389
2013-01-01 09:02:25 -0.301671 -0.025603
2013-01-01 09:02:30 -0.160335 -0.135074

How to convert TimeSeries object in pandas into integer?

I've been working with Pandas to calculate the age of a sportsman on a particular fixture, although it's returned as a TimeSeries type.
I'd now like to be able to plot age (in days) against the fixture dates, but can't work out how to turn the TimeSeries object to an integer. What can I try next?
This is the shape of the data.
squad_date['mean_age']
2008-08-16 11753 days, 0:00:00
2008-08-23 11760 days, 0:00:00
2008-08-30 11767 days, 0:00:00
2008-09-14 11782 days, 0:00:00
2008-09-20 11788 days, 0:00:00
This is what I would like:
2008-08-16 11753
2008-08-23 11760
2008-08-30 11767
2008-09-14 11782
2008-09-20 11788
For people who find this post by google, if you have numpy >= 0.7 and pandas 0.11, these solutions will not work. What does work:
squad_date['mean_age'].apply(lambda x: x / np.timedelta64(1,'D'))
The official Pandas documentation can be confusing here. They suggest to do "x.item()", where x is already a timedelta object.
x.item() would retrieve the difference in as an int value from the timedelta object. If that would be 'ns', you would get an int with the number of nanoseconds for example. So that would give a integer divide by a timedelta error; dividing the timedeltas directly by each other does work (and converts it to Days as by the 'D' in the second part).
I hope this will help someone in the future!
you need to be on master for this (0.11-dev)
In [40]: x = pd.date_range('20130101',periods=5)
In [41]: td = pd.Series(x,index=x)-pd.Timestamp('20130101')
In [43]: td
Out[43]:
2013-01-01 00:00:00
2013-01-02 1 days, 00:00:00
2013-01-03 2 days, 00:00:00
2013-01-04 3 days, 00:00:00
2013-01-05 4 days, 00:00:00
Freq: D, Dtype: timedelta64[ns]
In [44]: td.apply(lambda x: x.item().days)
Out[44]:
2013-01-01 0
2013-01-02 1
2013-01-03 2
2013-01-04 3
2013-01-05 4
Freq: D, Dtype: int64
The way I did it:
def conv_delta_to_int (dt):
return int(str(dt).split(" ")[0].replace (",", ""))
squad_date['mean_age'] = map(conv_delta_to_int, squad_date['mean_age'])

What representation should I use in Pandas for data valid throughout an interval?

I have a series of hourly prices. Each price is valid throughout the whole 1-hour period. What is the best way to represent these prices in Pandas that would enable me to index them in arbitrary higher frequencies (such as minutes or seconds) and do arithmetic with them?
Data specifics
Sample prices might be:
>>> prices = Series(randn(5), pd.date_range('2013-01-01 12:00', periods = 5, freq='H'))
>>> prices
2013-01-01 12:00:00 -1.001692
2013-01-01 13:00:00 -1.408082
2013-01-01 14:00:00 -0.329637
2013-01-01 15:00:00 1.005882
2013-01-01 16:00:00 1.202557
Freq: H
Now, what representation to use if I want the value at 13:37:42(I expect it to be the same as at 13:00)?
>>> prices['2013-01-01 13:37:42']
...
KeyError: <Timestamp: 2013-01-01 13:37:42>
Resampling
I know I could resample the prices and fill in the details (ffill, right?), but that doesn't seem like such a nice solution, because I have to assume the frequency I'm going to be indexing it at and it reduces readability with too many unnecessary data points.
Time spans
At first glance a PeriodIndex seems to work
>>> price_periods = prices.to_period()
>>> price_periods['2013-01-01 13:37:42']
-1.408082
But a time-spanned series doesn't offer some of the other functionality I expect from a Series. Say that I have another series amounts that says how many items I bought in a certain moment. If I wanted to calculate the prices I would want to multiply the two series'
>>> amounts = Series([1,2,2], pd.DatetimeIndex(['2013-01-01 13:37', '2013-01-01 13:57', '2013-01-01 14:05']))
>>> amounts*price_periods
but that yields an exception and sometimes even freezes my IPy Notebook. Indexing doesn't help either.
>>> ts_periods[amounts.index]
Are PeriodIndex structures still a work in progress or these features aren't going to be added? Is there maybe some other structure I should have used (or should use for now, before PeriodIndex matures)? I'm using Pandas version 0.9.0.dev-1e68fd9.
Check asof
prices.asof('2013-01-01 13:37:42')
returns the value for the previous available datetime:
prices['2013-01-01 13:00:00']
To make calculations, you can use:
prices.asof(amounts.index) * amounts
which returns a Series with amount's Index and the respective values:
>>> prices
2013-01-01 12:00:00 0.943607
2013-01-01 13:00:00 -1.019452
2013-01-01 14:00:00 -0.279136
2013-01-01 15:00:00 1.013548
2013-01-01 16:00:00 0.929920
>>> prices.asof(amounts.index)
2013-01-01 13:37:00 -1.019452
2013-01-01 13:57:00 -1.019452
2013-01-01 14:05:00 -0.279136
>>> prices.asof(amounts.index) * amounts
2013-01-01 13:37:00 -1.019452
2013-01-01 13:57:00 -2.038904
2013-01-01 14:05:00 -0.558272

How to get the correlation between two timeseries using Pandas

I have two sets of temperature date, which have readings at regular (but different) time intervals. I'm trying to get the correlation between these two sets of data.
I've been playing with Pandas to try to do this. I've created two timeseries, and am using TimeSeriesA.corr(TimeSeriesB). However, if the times in the 2 timeSeries do not match up exactly (they're generally off by seconds), I get Null as an answer. I could get a decent answer if I could:
a) Interpolate/fill missing times in each TimeSeries (I know this is possible in Pandas, I just don't know how to do it)
b) strip the seconds out of python datetime objects (Set seconds to 00, without changing minutes). I'd lose a degree of accuracy, but not a huge amount
c) Use something else in Pandas to get the correlation between two timeSeries
d) Use something in python to get the correlation between two lists of floats, each float having a corresponding datetime object, taking into account the time.
Anyone have any suggestions?
You have a number of options using pandas, but you have to make a decision about how it makes sense to align the data given that they don't occur at the same instants.
Use the values "as of" the times in one of the time series, here's an example:
In [15]: ts
Out[15]:
2000-01-03 00:00:00 -0.722808451504
2000-01-04 00:00:00 0.0125041039477
2000-01-05 00:00:00 0.777515530539
2000-01-06 00:00:00 -0.35714026263
2000-01-07 00:00:00 -1.55213541118
2000-01-10 00:00:00 -0.508166334892
2000-01-11 00:00:00 0.58016097981
2000-01-12 00:00:00 1.50766289013
2000-01-13 00:00:00 -1.11114968643
2000-01-14 00:00:00 0.259320239297
In [16]: ts2
Out[16]:
2000-01-03 00:00:30 1.05595278907
2000-01-04 00:00:30 -0.568961755792
2000-01-05 00:00:30 0.660511172645
2000-01-06 00:00:30 -0.0327384421979
2000-01-07 00:00:30 0.158094407533
2000-01-10 00:00:30 -0.321679671377
2000-01-11 00:00:30 0.977286027619
2000-01-12 00:00:30 -0.603541295894
2000-01-13 00:00:30 1.15993249209
2000-01-14 00:00:30 -0.229379534767
you can see these are off by 30 seconds. The reindex function enables you to align data while filling forward values (getting the "as of" value):
In [17]: ts.reindex(ts2.index, method='pad')
Out[17]:
2000-01-03 00:00:30 -0.722808451504
2000-01-04 00:00:30 0.0125041039477
2000-01-05 00:00:30 0.777515530539
2000-01-06 00:00:30 -0.35714026263
2000-01-07 00:00:30 -1.55213541118
2000-01-10 00:00:30 -0.508166334892
2000-01-11 00:00:30 0.58016097981
2000-01-12 00:00:30 1.50766289013
2000-01-13 00:00:30 -1.11114968643
2000-01-14 00:00:30 0.259320239297
In [18]: ts2.corr(ts.reindex(ts2.index, method='pad'))
Out[18]: -0.31004148593302283
note that 'pad' is also aliased by 'ffill' (but only in the very latest version of pandas on GitHub as of this time!).
Strip seconds out of all your datetimes. The best way to do this is to use rename
In [25]: ts2.rename(lambda date: date.replace(second=0))
Out[25]:
2000-01-03 00:00:00 1.05595278907
2000-01-04 00:00:00 -0.568961755792
2000-01-05 00:00:00 0.660511172645
2000-01-06 00:00:00 -0.0327384421979
2000-01-07 00:00:00 0.158094407533
2000-01-10 00:00:00 -0.321679671377
2000-01-11 00:00:00 0.977286027619
2000-01-12 00:00:00 -0.603541295894
2000-01-13 00:00:00 1.15993249209
2000-01-14 00:00:00 -0.229379534767
Note that if rename causes there to be duplicate dates an Exception will be thrown.
For something a little more advanced, suppose you wanted to correlate the mean value for each minute (where you have multiple observations per second):
In [31]: ts_mean = ts.groupby(lambda date: date.replace(second=0)).mean()
In [32]: ts2_mean = ts2.groupby(lambda date: date.replace(second=0)).mean()
In [33]: ts_mean.corr(ts2_mean)
Out[33]: -0.31004148593302283
These last code snippets may not work if you don't have the latest code from https://github.com/wesm/pandas. If .mean() doesn't work on a GroupBy object per above try .agg(np.mean)
Hope this helps!
By shifting your timestamps you might be losing some accuracy. You can just perform an outer join on your time series filling NaN values with 0 and then you will have the whole timestamps (either it is a shared one or belongs to only one of the datasets). Then you may want to do the correlation function for the columns of your new dataset that will give you the result you are looking for without losing accuracy. This is my code once I was working with time series:
t12 = t1.join(t2, lsuffix='_t1', rsuffix='_t2', how ='outer').fillna(0)
t12.corr()
This way you will have all timestamps.

Categories