I have a data frame that contains some time based data:
>>> temp.groupby(pd.TimeGrouper('AS'))['INC_RANK'].mean()
date
2001-01-01 0.567128
2002-01-01 0.581349
2003-01-01 0.556646
2004-01-01 0.549128
2005-01-01 NaN
2006-01-01 0.536796
2007-01-01 0.513109
2008-01-01 0.525859
2009-01-01 0.530433
2010-01-01 0.499250
2011-01-01 0.488159
2012-01-01 0.493405
2013-01-01 0.530207
Freq: AS-JAN, Name: INC_RANK, dtype: float64
And now I would like to plot the density for each year. The following command used to work for other data frames, but it is not here:
>>> temp.groupby(pd.TimeGrouper('AS'))['INC_RANK'].plot(kind='density')
ValueError: ordinal must be >= 1
Here's how that column looks like:
>>> temp['INC_RANK'].head()
date
2001-01-01 0.516016
2001-01-01 0.636038
2001-01-01 0.959501
2001-01-01 NaN
2001-01-01 0.433824
Name: INC_RANK, dtype: float64
I think it is due to the nan in your data, as density can not be estimated for nans. However, since you want to visualize density, it should not be a big issue to simply just drop the missing values, assuming the missing/unobserved cells should follow the same distribution as the observed/non-missing cells. Therefore, df.dropna().groupby(pd.TimeGrouper('AS'))['INC_RANK'].plot(kind='density') should suffice.
On the other hand, if the missing values are not 'unobserved', but rather are the values out of the measuring range (say data from a temperature sensor, which reads 0~50F, but sometimes, 100F temperate is encountered. Sensor sends out a error code and recorded as missing value), then dropna() probably is not a good idea.
Related
Long question short, what is an appropriate resampling freq/rule? Sometimes I get a dataframe mostly filled with NaNs, sometimes it works great. I thought I had a handle on it.
Below is an example,
I am processing a lot of data and was changing my resample frequency and notice that for reason certain resample rules produce only 1 element in each row to have a value, the rest of elements to have NaN's.
For example,
df = pd.DataFrame()
df['date']=pd.date_range(start='1/1/2018', end='5/08/2018')
Creating some example data,
df['data1']=np.random.randint(1, 10, df.shape[0])
df['data2']=np.random.randint(1, 10, df.shape[0])
df['data3'] = np.arange(len(df))
The data looks like,
print(df.head())
print(df.shape)
data1 data2 data3
date
2018-01-01 7 7 0
2018-01-02 8 8 1
2018-01-03 2 7 2
2018-01-04 2 2 3
2018-01-05 2 5 4
(128, 3)
When I resample the data using offset aliases I get an unexpected results.
Below I resample the data every 3 minutes.
resampled=df.resample('3T').mean()
print(resampled.head())
print(resampled.shape)
data1 data2 data3
date
2018-01-01 00:00:00 4.0 5.0 0.0
2018-01-01 00:03:00 NaN NaN NaN
2018-01-01 00:06:00 NaN NaN NaN
2018-01-01 00:09:00 NaN NaN NaN
2018-01-01 00:12:00 NaN NaN NaN
Most of the rows are filled with NaN besides the first. I believe this due to that there is no index for my resampling rule. Is this correct? '24H' is the smallest interval for this data, but anything less leaves NaN in a row.
Can a dataframe be resampled for increments less than the datetime resolution?
I have had trouble in the past trying to resample a large dataset that spanned over a year with the datetime index formatted as %Y:%j:%H:%M:%S (year:day #: hour: minute:second, note: close enough without being verbose). Attempting to resample every 15 or 30 days also produced very similar results with NaNs. I thought it was due to having an odd date format with no month, but df.head() showed the index with correct dates.
When you resample lowering the frequency (downsample), then
one of possible options to compute the result is just mean().
It actuaaly means:
The source DataFrame contains too detailed data.
You want to change the sampling frequency to some lower one and
compute e.g. a mean of each column from some number
of source rows for the current sampling period.
But when you increase the sampling frequency (upsample), then:
Your source data are too general.
You want to change the frequency to a higher one.
One of possible options to compute the result is e.g. to
interpolate between known source values.
Note that when you upsample daily data to 3-minute frequency then:
The first row will contain data between 2018-01-01 00:00:00 and
2018-01-01 00:03:00.
The next row will contain data between 2018-01-01 00:03:00 and
2018-01-01 00:06:00.
And so on.
So, based on your source data:
The first row contains data from 2018-01-01 (sharp on midnight).
Since no source data is available for the time range between
00:03:00 and 00:06:00 (on 2018-01-01), the second row contains
just NaN values.
The same pertains to further rows, up to 2018-01-01 23:57:00
(no source data for these time slices).
The next row, for 2018-01-02 00:00:00 can be filled with source data.
And so on.
There is nothing strange in this behaviour. Resample works just this way.
As you actually upsample the source data, maybe you should interpolate
the missing values?
This is a small subset of my data:
heartrate
2018-01-01 00:00:00 67.0
2018-01-01 00:01:00 55.0
2018-01-01 00:02:00 60.0
2018-01-01 00:03:00 67.0
2018-01-01 00:04:00 72.0
2018-01-01 00:05:00 53.0
2018-01-01 00:06:00 62.0
2018-01-01 00:07:00 59.0
2018-01-01 00:08:00 117.0
2018-01-01 00:09:00 62.0
2018-01-01 00:10:00 65.0
2018-01-01 00:11:00 70.0
2018-01-01 00:12:00 49.0
2018-01-01 00:13:00 59.0
This data is a collection of daily heart rates from patients. I am trying to see if, based off their heart rate, I can find the time window that they are asleep.
I am not sure how to write a code that is able to identify the time window that the patient is asleep because every few minutes, there will be a spike in the data. For example, in the data provided from 2018-01-01 00:07:00 to 2018-01-01 00:08:00, the heartrate jumped from 59 to 117. Can anyone suggest a way around this and a way to find the time window when the Heartrate is below the mean for a few hours?
As mentioned in your comments, you can find the rolling mean to 'smoothen' your signal using:
patient_data_df['rollingmeanVal'] = patient_data_df.rolling('3T').heartrate.mean()
Assuming you are using a dataframe and want to identify rows that have a HR bellow or equal to the mean you can use:
HR_mean = patient_data_df['rollingmeanVal'].mean()
selected_data_df = patient_data_df[patient_data_df['rollingmeanVal'] <= HR_mean]
Then, instead of dealing with the dataframe as a time-series dataframe, you can reset the index and generate a column called index with the datetime as values. Now that you have a dataframe with all values bellow the mean, you can group them into groups when there is more than 30 mins difference between each group. This is assuming that having fluctuating data for 30 mins is ok.
Assuming that the group with the most data is when the patient is asleep, you can identify that group. Using the first and last date of this group, you can then identify the time window that the patient is asleep.
Reset the index, adding a new col called index with the time-series data:
selected_data_df.reset_index(inplace=True)
Group by:
selected_data_df['grp'] = selected_data_df['index'].diff().dt.seconds.ge(30 * 60).cumsum()
sleep_grp = selected_data_df.groupby('grp').count().sort_values(['grp']).head(1)
sleep_grp_index = sleep_grp.index.values[0]
sleep_df = selected_data_df[selected_data_df['grp'] == sleep_grp_index].drop('grp', axis=1)
Start of sleep time:
temp2_df['index'].iloc[0]
End of sleep time:
temp2_df['index'].iloc[-1]
You may use Run Length Encoding function from base R for solving your problem. In step 1 you may calculate the rolling mean of your patients heart rate. You may use your solution or any other. Afterwards you add a logic flag to your data.frame, e.g. patient['lowerVal'] = patient['heartrate'] < patient['rollingmeanVal']. Afterwards apply rle function on that variable lowerVal. As return you get the length of runs below and above mean. By applying cumsum on the lengths value, you get locations of your sleeping time frames.
Sorry. It is Python. Therefore, you may use the Python version of Run Length Encoding.
I have a dataframe as follows
df = pd.DataFrame({ 'X' : np.random.randn(50000)}, index=pd.date_range('1/1/2000', periods=50000, freq='T'))
df.head(10)
Out[37]:
X
2000-01-01 00:00:00 -0.699565
2000-01-01 00:01:00 -0.646129
2000-01-01 00:02:00 1.339314
2000-01-01 00:03:00 0.559563
2000-01-01 00:04:00 1.529063
2000-01-01 00:05:00 0.131740
2000-01-01 00:06:00 1.282263
2000-01-01 00:07:00 -1.003991
2000-01-01 00:08:00 -1.594918
2000-01-01 00:09:00 -0.775230
I would like to create a variable that contains the sum of X
over the last 5 days (not including the current observation)
only considering observations that fall at the exact same hour as the current observation.
In other words:
At index 2000-01-01 00:00:00, df['rolling_sum_same_hour'] contains the sum the values of X observed at 00:00:00 during the last 5 days in the data (not including 2000-01-01 of course).
At index 2000-01-01 00:01:00, df['rolling_sum_same_hour'] contains the sum of of X observed at 00:00:01 during the last 5 days and so on.
The intuitive idea is that intraday prices have intraday seasonality, and I want to get rid of it that way.
I tried to use df['rolling_sum_same_hour']=df.at_time(df.index.minute).rolling(window=5).sum()
with no success.
Any ideas?
Many thanks!
Behold the power of groupby!
df = # as you defined above
df['rolling_sum_by_time'] = df.groupby(df.index.time)['X'].apply(lambda x: x.shift(1).rolling(10).sum())
It's a big pill to swallow there, but we are grouping by time (as in python datetime.time), then getting the column we care about (else apply will work on columns - it now works on the time-groups), and then applying the function you want!
IIUC, what you want is to perform a rolling sum, but only on the observations grouped by the exact same time of day. This can be done by
df.X.groupby([df.index.hour, df.index.minute]).apply(lambda g: g.rolling(window=5).sum())
(Note that your question alternates between 5 and 10 periods.) For example:
In [43]: df.X.groupby([df.index.hour, df.index.minute]).apply(lambda g: g.rolling(window=5).sum()).tail()
Out[43]:
2000-02-04 17:15:00 -2.135887
2000-02-04 17:16:00 -3.056707
2000-02-04 17:17:00 0.813798
2000-02-04 17:18:00 -1.092548
2000-02-04 17:19:00 -0.997104
Freq: T, Name: X, dtype: float64
I have a DataFrame with intraday data indexed with DatetimeIndex
df1 =pd.DataFrame(np.random.randn(6,4),index=pd.date_range('1/1/2000',periods=6, freq='1h'))
df2 =pd.DataFrame(np.random.randn(6,4),index=pd.date_range('1/2/2000',periods=6, freq='1h'))
df3 = df1.append(df2)
so as can be seen there is a big gap between within the two days in df3
df3.plot()
will plot every single hour from 2000-01-01 00:00:00 to 2000-01-02 05:00:00, while actually from 2000-01-01 06:00:00 to 2000-01-02 00:00:00 there are actually no datapoint.
How to leave those data point in the plot so that from 2000-01-01 06:00:00 to 2000-01-02 00:00:00 is not plotted?
This seems to have been in discussion for some time at Google Groups.
Pandas Intraday Time Series plots
One way to do this is to resample (hourly) before you plot:
df3.resample('H').plot()
Note: This ensures you have NaN values between real values which are not plotted (rather than connected). This means you are storing more data here, which may be an issue.
I am using pandas to convert intraday data, stored in data_m, to daily data. For some reason resample added rows for days that were not present in the intraday data. For example, 1/8/2000 is not in the intraday data, yet the daily data contains a row for that date with NaN as the value. DatetimeIndex has more entries than the actual data. Am I doing anything wrong?
data_m.resample('D', how = mean).head()
Out[13]:
x
2000-01-04 8803.879581
2000-01-05 8765.036649
2000-01-06 8893.156250
2000-01-07 8780.037433
2000-01-08 NaN
data_m.resample('D', how = mean)
Out[14]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 4729 entries, 2000-01-04 00:00:00 to 2012-12-14 00:00:00
Freq: D
Data columns:
x 3241 non-null values
dtypes: float64(1)
What you are doing looks correct, it's just that pandas gives NaN for the mean of an empty array.
In [1]: Series().mean()
Out[1]: nan
resample converts to a regular time interval, so if there are no samples that day you get NaN.
Most of the time having NaN isn't a problem. If it is we can either use fill_method (for example 'ffill') or if you really wanted to remove them you could use dropna (not recommended):
data_m.resample('D', how = mean, fill_method='ffill')
data_m.resample('D', how = mean).dropna()
Update: The modern equivalent seems to be:
In [21]: s.resample("D").mean().ffill()
Out[21]:
x
2000-01-04 8803.879581
2000-01-05 8765.036649
2000-01-06 8893.156250
2000-01-07 8780.037433
2000-01-08 8780.037433
In [22]: s.resample("D").mean().dropna()
Out[22]:
x
2000-01-04 8803.879581
2000-01-05 8765.036649
2000-01-06 8893.156250
2000-01-07 8780.037433
See resample docs.
Prior to 0.10.0, pandas labeled resample bins with the right-most edge, which for daily resampling, is the next day. Starting with 0.10.0, the default binning behavior for daily and higher frequencies changed to label='left', closed='left' to minimize this confusion. See http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#api-changes for more information.