Python pandas resample added dates not present in the original data - python

I am using pandas to convert intraday data, stored in data_m, to daily data. For some reason resample added rows for days that were not present in the intraday data. For example, 1/8/2000 is not in the intraday data, yet the daily data contains a row for that date with NaN as the value. DatetimeIndex has more entries than the actual data. Am I doing anything wrong?
data_m.resample('D', how = mean).head()
Out[13]:
x
2000-01-04 8803.879581
2000-01-05 8765.036649
2000-01-06 8893.156250
2000-01-07 8780.037433
2000-01-08 NaN
data_m.resample('D', how = mean)
Out[14]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 4729 entries, 2000-01-04 00:00:00 to 2012-12-14 00:00:00
Freq: D
Data columns:
x 3241 non-null values
dtypes: float64(1)

What you are doing looks correct, it's just that pandas gives NaN for the mean of an empty array.
In [1]: Series().mean()
Out[1]: nan
resample converts to a regular time interval, so if there are no samples that day you get NaN.
Most of the time having NaN isn't a problem. If it is we can either use fill_method (for example 'ffill') or if you really wanted to remove them you could use dropna (not recommended):
data_m.resample('D', how = mean, fill_method='ffill')
data_m.resample('D', how = mean).dropna()
Update: The modern equivalent seems to be:
In [21]: s.resample("D").mean().ffill()
Out[21]:
x
2000-01-04 8803.879581
2000-01-05 8765.036649
2000-01-06 8893.156250
2000-01-07 8780.037433
2000-01-08 8780.037433
In [22]: s.resample("D").mean().dropna()
Out[22]:
x
2000-01-04 8803.879581
2000-01-05 8765.036649
2000-01-06 8893.156250
2000-01-07 8780.037433
See resample docs.

Prior to 0.10.0, pandas labeled resample bins with the right-most edge, which for daily resampling, is the next day. Starting with 0.10.0, the default binning behavior for daily and higher frequencies changed to label='left', closed='left' to minimize this confusion. See http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#api-changes for more information.

Related

How to filter specifc months from Pandas datetime index

I have a daily dataset which ranges from 2000-2010. I already set the column 'GregDate' via
df.set_index(pd.DatetimeIndex(df['GregDate']))
as index. Now I only want to investigate the months from November till March (for all the ten years).
My dataframe looks like this:
Sigma Lat Lon
GregDate
2000-01-01 -5.874062 79.913437 -74.583125
2000-01-02 -6.794000 79.904000 -74.604000
2000-01-03 -5.826061 79.923939 -74.548485
2000-01-04 -5.702439 79.916829 -74.641707
...
2009-07-11 -10.727381 79.925952 -74.660714
2009-07-12 -10.648000 79.923667 -74.557333
2009-07-13 -11.123095 79.908810 -74.596190
[3482 rows x 3 columns]
I already had a look at this question, but I still am not able to solve my problem.
I think you need boolean indexing with DatetimeIndex.month and Index.isin:
df = df[df.index.month.isin([11,12,1,2,3])]
print (df)
Sigma Lat Lon
GregDate
2000-01-01 -5.874062 79.913437 -74.583125
2000-01-02 -6.794000 79.904000 -74.604000
2000-01-03 -5.826061 79.923939 -74.548485
2000-01-04 -5.702439 79.916829 -74.641707
In [10]: df.query("index.dt.month in [11,12,1,2,3]")
Out[10]:
Sigma Lat Lon
GregDate
2000-01-01 -5.874062 79.913437 -74.583125
2000-01-02 -6.794000 79.904000 -74.604000
2000-01-03 -5.826061 79.923939 -74.548485
2000-01-04 -5.702439 79.916829 -74.641707

Fancy Index in Pandas Panels

Let we have following Panel:
companies = ["GOOG", "YHOO", "AMZN", "MSFT", "AAPL"]
p = data.DataReader(name = companies, data_source="google", start = "2013-01-01", end = "2017-02-22")
I want to extract values of "Low", "MSFT" for two dates "2013-01-02" and "2013-01-08". I use several options and some of them work, some not. Here are those methods:
Using .ix[] method
p.ix["Low", [0,4], "MSFT"] and the result is:
Date
2013-01-02 27.15
2013-01-08 26.46
Name: MSFT, dtype: float64
So it works, no problem at all.
Using .iloc[] method
p.iloc[2, [0,4], 3] and it also works.
Date
2013-01-02 27.15
2013-01-08 26.46
Name: MSFT, dtype: float64
Using .ix[] method again but in different way
p.ix["Low", ["2013-01-02", "2013-01-08"], "MSFT"] and it returns weird result as:
Date
2013-01-02 NaN
2013-01-08 NaN
Name: MSFT, dtype: float64
Using .loc[] method
p.loc["Low", ["2013-01-02", "2013-01-08"], "MSFT"] and this time an error raised
KeyError: "None of [['2013-01-02', '2013-01-08']] are in the [index]"
1 and 2 are the ones that work, and it is pretty straightforward. However, I don't understand the reason of getting NaN values in 3rd method and an error in 4th method.
Using the following
In [116]: wp = pd.Panel(np.random.randn(2, 5, 4), items=['Low', 'High'],
.....: major_axis=pd.date_range('1/1/2000', periods=5),
.....: minor_axis=['A', 'B', 'C', 'D'])
It's good to remember that:
.loc uses the labels in the index
.iloc uses integer positioning in the index
.ix tries acting like .loc but falls back to .iloc if it fails, so we can focus only on the .ix version
source
If I do wp.ix['Low'] I get
A B C D
2000-01-01 -0.864402 0.559969 1.226582 -1.090447
2000-01-02 0.288341 -0.786711 -0.662960 0.613778
2000-01-03 1.712770 1.393537 -2.230170 -0.082778
2000-01-04 -1.297067 1.076110 -1.384226 1.824781
2000-01-05 1.268253 -2.185574 0.090986 0.464095
Now if you want to access the data for 2000-01-01 through 2000-01-03, you need to use the syntax
wp.loc['Low','2000-01-01':'2000-01-03']
which returns
A B C D
2000-01-01 -0.864402 0.559969 1.226582 -1.090447
2000-01-02 0.288341 -0.786711 -0.662960 0.613778
2000-01-03 1.712770 1.393537 -2.230170 -0.082778

Pandas: Group by year and plot density

I have a data frame that contains some time based data:
>>> temp.groupby(pd.TimeGrouper('AS'))['INC_RANK'].mean()
date
2001-01-01 0.567128
2002-01-01 0.581349
2003-01-01 0.556646
2004-01-01 0.549128
2005-01-01 NaN
2006-01-01 0.536796
2007-01-01 0.513109
2008-01-01 0.525859
2009-01-01 0.530433
2010-01-01 0.499250
2011-01-01 0.488159
2012-01-01 0.493405
2013-01-01 0.530207
Freq: AS-JAN, Name: INC_RANK, dtype: float64
And now I would like to plot the density for each year. The following command used to work for other data frames, but it is not here:
>>> temp.groupby(pd.TimeGrouper('AS'))['INC_RANK'].plot(kind='density')
ValueError: ordinal must be >= 1
Here's how that column looks like:
>>> temp['INC_RANK'].head()
date
2001-01-01 0.516016
2001-01-01 0.636038
2001-01-01 0.959501
2001-01-01 NaN
2001-01-01 0.433824
Name: INC_RANK, dtype: float64
I think it is due to the nan in your data, as density can not be estimated for nans. However, since you want to visualize density, it should not be a big issue to simply just drop the missing values, assuming the missing/unobserved cells should follow the same distribution as the observed/non-missing cells. Therefore, df.dropna().groupby(pd.TimeGrouper('AS'))['INC_RANK'].plot(kind='density') should suffice.
On the other hand, if the missing values are not 'unobserved', but rather are the values out of the measuring range (say data from a temperature sensor, which reads 0~50F, but sometimes, 100F temperate is encountered. Sensor sends out a error code and recorded as missing value), then dropna() probably is not a good idea.

Python pandas, how to only plot a DataFrame that actually have the datapoint and leave the gap out

I have a DataFrame with intraday data indexed with DatetimeIndex
df1 =pd.DataFrame(np.random.randn(6,4),index=pd.date_range('1/1/2000',periods=6, freq='1h'))
df2 =pd.DataFrame(np.random.randn(6,4),index=pd.date_range('1/2/2000',periods=6, freq='1h'))
df3 = df1.append(df2)
so as can be seen there is a big gap between within the two days in df3
df3.plot()
will plot every single hour from 2000-01-01 00:00:00 to 2000-01-02 05:00:00, while actually from 2000-01-01 06:00:00 to 2000-01-02 00:00:00 there are actually no datapoint.
How to leave those data point in the plot so that from 2000-01-01 06:00:00 to 2000-01-02 00:00:00 is not plotted?
This seems to have been in discussion for some time at Google Groups.
Pandas Intraday Time Series plots
One way to do this is to resample (hourly) before you plot:
df3.resample('H').plot()
Note: This ensures you have NaN values between real values which are not plotted (rather than connected). This means you are storing more data here, which may be an issue.

How to get the correlation between two timeseries using Pandas

I have two sets of temperature date, which have readings at regular (but different) time intervals. I'm trying to get the correlation between these two sets of data.
I've been playing with Pandas to try to do this. I've created two timeseries, and am using TimeSeriesA.corr(TimeSeriesB). However, if the times in the 2 timeSeries do not match up exactly (they're generally off by seconds), I get Null as an answer. I could get a decent answer if I could:
a) Interpolate/fill missing times in each TimeSeries (I know this is possible in Pandas, I just don't know how to do it)
b) strip the seconds out of python datetime objects (Set seconds to 00, without changing minutes). I'd lose a degree of accuracy, but not a huge amount
c) Use something else in Pandas to get the correlation between two timeSeries
d) Use something in python to get the correlation between two lists of floats, each float having a corresponding datetime object, taking into account the time.
Anyone have any suggestions?
You have a number of options using pandas, but you have to make a decision about how it makes sense to align the data given that they don't occur at the same instants.
Use the values "as of" the times in one of the time series, here's an example:
In [15]: ts
Out[15]:
2000-01-03 00:00:00 -0.722808451504
2000-01-04 00:00:00 0.0125041039477
2000-01-05 00:00:00 0.777515530539
2000-01-06 00:00:00 -0.35714026263
2000-01-07 00:00:00 -1.55213541118
2000-01-10 00:00:00 -0.508166334892
2000-01-11 00:00:00 0.58016097981
2000-01-12 00:00:00 1.50766289013
2000-01-13 00:00:00 -1.11114968643
2000-01-14 00:00:00 0.259320239297
In [16]: ts2
Out[16]:
2000-01-03 00:00:30 1.05595278907
2000-01-04 00:00:30 -0.568961755792
2000-01-05 00:00:30 0.660511172645
2000-01-06 00:00:30 -0.0327384421979
2000-01-07 00:00:30 0.158094407533
2000-01-10 00:00:30 -0.321679671377
2000-01-11 00:00:30 0.977286027619
2000-01-12 00:00:30 -0.603541295894
2000-01-13 00:00:30 1.15993249209
2000-01-14 00:00:30 -0.229379534767
you can see these are off by 30 seconds. The reindex function enables you to align data while filling forward values (getting the "as of" value):
In [17]: ts.reindex(ts2.index, method='pad')
Out[17]:
2000-01-03 00:00:30 -0.722808451504
2000-01-04 00:00:30 0.0125041039477
2000-01-05 00:00:30 0.777515530539
2000-01-06 00:00:30 -0.35714026263
2000-01-07 00:00:30 -1.55213541118
2000-01-10 00:00:30 -0.508166334892
2000-01-11 00:00:30 0.58016097981
2000-01-12 00:00:30 1.50766289013
2000-01-13 00:00:30 -1.11114968643
2000-01-14 00:00:30 0.259320239297
In [18]: ts2.corr(ts.reindex(ts2.index, method='pad'))
Out[18]: -0.31004148593302283
note that 'pad' is also aliased by 'ffill' (but only in the very latest version of pandas on GitHub as of this time!).
Strip seconds out of all your datetimes. The best way to do this is to use rename
In [25]: ts2.rename(lambda date: date.replace(second=0))
Out[25]:
2000-01-03 00:00:00 1.05595278907
2000-01-04 00:00:00 -0.568961755792
2000-01-05 00:00:00 0.660511172645
2000-01-06 00:00:00 -0.0327384421979
2000-01-07 00:00:00 0.158094407533
2000-01-10 00:00:00 -0.321679671377
2000-01-11 00:00:00 0.977286027619
2000-01-12 00:00:00 -0.603541295894
2000-01-13 00:00:00 1.15993249209
2000-01-14 00:00:00 -0.229379534767
Note that if rename causes there to be duplicate dates an Exception will be thrown.
For something a little more advanced, suppose you wanted to correlate the mean value for each minute (where you have multiple observations per second):
In [31]: ts_mean = ts.groupby(lambda date: date.replace(second=0)).mean()
In [32]: ts2_mean = ts2.groupby(lambda date: date.replace(second=0)).mean()
In [33]: ts_mean.corr(ts2_mean)
Out[33]: -0.31004148593302283
These last code snippets may not work if you don't have the latest code from https://github.com/wesm/pandas. If .mean() doesn't work on a GroupBy object per above try .agg(np.mean)
Hope this helps!
By shifting your timestamps you might be losing some accuracy. You can just perform an outer join on your time series filling NaN values with 0 and then you will have the whole timestamps (either it is a shared one or belongs to only one of the datasets). Then you may want to do the correlation function for the columns of your new dataset that will give you the result you are looking for without losing accuracy. This is my code once I was working with time series:
t12 = t1.join(t2, lsuffix='_t1', rsuffix='_t2', how ='outer').fillna(0)
t12.corr()
This way you will have all timestamps.

Categories