Long question short, what is an appropriate resampling freq/rule? Sometimes I get a dataframe mostly filled with NaNs, sometimes it works great. I thought I had a handle on it.
Below is an example,
I am processing a lot of data and was changing my resample frequency and notice that for reason certain resample rules produce only 1 element in each row to have a value, the rest of elements to have NaN's.
For example,
df = pd.DataFrame()
df['date']=pd.date_range(start='1/1/2018', end='5/08/2018')
Creating some example data,
df['data1']=np.random.randint(1, 10, df.shape[0])
df['data2']=np.random.randint(1, 10, df.shape[0])
df['data3'] = np.arange(len(df))
The data looks like,
print(df.head())
print(df.shape)
data1 data2 data3
date
2018-01-01 7 7 0
2018-01-02 8 8 1
2018-01-03 2 7 2
2018-01-04 2 2 3
2018-01-05 2 5 4
(128, 3)
When I resample the data using offset aliases I get an unexpected results.
Below I resample the data every 3 minutes.
resampled=df.resample('3T').mean()
print(resampled.head())
print(resampled.shape)
data1 data2 data3
date
2018-01-01 00:00:00 4.0 5.0 0.0
2018-01-01 00:03:00 NaN NaN NaN
2018-01-01 00:06:00 NaN NaN NaN
2018-01-01 00:09:00 NaN NaN NaN
2018-01-01 00:12:00 NaN NaN NaN
Most of the rows are filled with NaN besides the first. I believe this due to that there is no index for my resampling rule. Is this correct? '24H' is the smallest interval for this data, but anything less leaves NaN in a row.
Can a dataframe be resampled for increments less than the datetime resolution?
I have had trouble in the past trying to resample a large dataset that spanned over a year with the datetime index formatted as %Y:%j:%H:%M:%S (year:day #: hour: minute:second, note: close enough without being verbose). Attempting to resample every 15 or 30 days also produced very similar results with NaNs. I thought it was due to having an odd date format with no month, but df.head() showed the index with correct dates.
When you resample lowering the frequency (downsample), then
one of possible options to compute the result is just mean().
It actuaaly means:
The source DataFrame contains too detailed data.
You want to change the sampling frequency to some lower one and
compute e.g. a mean of each column from some number
of source rows for the current sampling period.
But when you increase the sampling frequency (upsample), then:
Your source data are too general.
You want to change the frequency to a higher one.
One of possible options to compute the result is e.g. to
interpolate between known source values.
Note that when you upsample daily data to 3-minute frequency then:
The first row will contain data between 2018-01-01 00:00:00 and
2018-01-01 00:03:00.
The next row will contain data between 2018-01-01 00:03:00 and
2018-01-01 00:06:00.
And so on.
So, based on your source data:
The first row contains data from 2018-01-01 (sharp on midnight).
Since no source data is available for the time range between
00:03:00 and 00:06:00 (on 2018-01-01), the second row contains
just NaN values.
The same pertains to further rows, up to 2018-01-01 23:57:00
(no source data for these time slices).
The next row, for 2018-01-02 00:00:00 can be filled with source data.
And so on.
There is nothing strange in this behaviour. Resample works just this way.
As you actually upsample the source data, maybe you should interpolate
the missing values?
Related
I have a pandas dataframe (python) indexed with timestamps roughly every 10 seconds. I want to find hourly averages, but all functions I find start their averaging at even hours (e.g. hour 9 includes data from 08.00:00 to 08:59:50). Let's say I have the dataframe below.
Timestamp value data
2022-01-01 00:00:00 0.0 5.31
2022-01-01 00:00:10 0.0 0.52
2022-01-01 00:00:20 1.0 9.03
2022-01-01 00:00:30 1.0 4.37
2022-01-01 00:00:40 1.0 8.03
...
2022-01-01 13:52:30 1.0 9.75
2022-01-01 13:52:40 1.0 0.62
2022-01-01 13:52:50 1.0 3.58
2022-01-01 13:53:00 1.0 8.23
2022-01-01 13:53:10 1.0 3.07
Freq: 10S, Length: 5000, dtype: float64
So what I want to do:
Only look at data where we have data that consistently through 1 hour has a value of 1
Find an hourly average of these hours (could e.g. be between 01:30:00-02:29:50 and 11:16:30 - 12:16:20)..
I hope I made my problem clear enough. How do I do this?
EDIT:
Maybe the question was a bit unclear phrased.
I added a third column data, which is what I want to find the mean of. I am only interested in time intervals where, value = 1 consistently through one hour, the rest of the data can be excluded.
EDIT #2:
A bit of background to my problem: I have a sensor giving me data every 10 seconds. For data to be "approved" certain requirements are to be fulfilled (value in this example), and I need the hourly averages (and preferably timestamps for when this occurs). So in order to maximize the number of possible hours to include in my analysis, I would like to find full hours even if they don't start at an even timestamp.
If I understand you correctly you want a conditional mean - calculate the mean per hour of the data column conditional on the value column being all 1 for every 10s row in that hour.
Assuming your dataframe is called df, the steps to do this are:
Create a grouping column
This is your 'hour' column that can be created by
df['hour'] = df.Timestamp.hour
Create condition
Now we've got a column to identify groups we can check which groups are eligible - only those with value consistently equal to 1. If we have 10s intervals and it's per hour then if we group by hour and sum this column then we should get 360 as there are 360 10s intervals per hour.
Group and compute
We can now group and use the aggregate function to:
sum the value column to evaluate against our condition
compute the mean of the data column to return for the valid hours
# group and aggregate
df_mean = df[['hour', 'value', 'data']].groupby('hour').aggregate({'value': 'sum', 'data': 'mean'})
# apply condition
df_mean = df_mean[df_mean['value'] == 360]
That's it - you are left with a dataframe that contains the mean value of data for only the hours where you have a complete hour of value=1.
If you want to augment this so you don't have to start with the grouping as per hour starting as 08:00:00-09:00:00 and maybe you want to start as 08:00:10-09:00:10 then the solution is simple - augment the grouping column but don't change anything else in the process.
To do this you can use datetime.timedelta to shift things forward or back so that df.Timestamp.hour can still be leveraged to keep things simple.
Infer grouping from data
One final idea - if you want to infer which hours on a rolling basis you have complete data for then you can do this with a rolling sum - this is even easier. You:
compute the rolling sum of value and mean of data
only select where value is equal to 360
df_roll = df.rolling(360).aggregate({'value': 'sum', 'data': 'mean'})
df_roll = df_roll[df_roll['value'] == 360]
Yes, there is. You need resample with an offset.
Make some test data
Please make sure to provide meaningful test data next time.
import pandas as pd
import numpy as np
# One day in 10 second intervals
index = pd.date_range(start='1/1/2018', end='1/2/2018', freq='10S')
df = pd.DataFrame({"data": np.random.random(len(index))}, index=index)
# This will set the first part of the data to 1, the rest to 0
df["value"] = (df.index < "2018-01-01 10:00:10").astype(int)
This is what we got:
>>> df
data value
2018-01-01 00:00:00 0.377082 1
2018-01-01 00:00:10 0.574471 1
2018-01-01 00:00:20 0.284629 1
2018-01-01 00:00:30 0.678923 1
2018-01-01 00:00:40 0.094724 1
... ... ...
2018-01-01 23:59:20 0.839973 0
2018-01-01 23:59:30 0.890321 0
2018-01-01 23:59:40 0.426595 0
2018-01-01 23:59:50 0.089174 0
2018-01-02 00:00:00 0.351624 0
Get the mean per hour with an offset
Here is a small function that checks if all value rows in the slice are equal to 1 and returns the mean if so, otherwise it (implicitly) returns None.
def get_conditioned_average(frame):
if frame.value.eq(1).all():
return frame.data.mean()
Now just apply this to hourly slices, starting, e.g., at 10 seconds after the full hour.
df2 = df.resample('H', offset='10S').apply(get_conditioned_average)
This is the final result:
>>> df2
2017-12-31 23:00:10 0.377082
2018-01-01 00:00:10 0.522144
2018-01-01 01:00:10 0.506536
2018-01-01 02:00:10 0.505334
2018-01-01 03:00:10 0.504431
... ... ...
2018-01-01 19:00:10 NaN
2018-01-01 20:00:10 NaN
2018-01-01 21:00:10 NaN
2018-01-01 22:00:10 NaN
2018-01-01 23:00:10 NaN
Freq: H, dtype: float64
Lets say I have a idx=pd.DatatimeIndex with one minute frequency. I also have a list of bad dates (each are of type pd.Timestamp without the time information) that I want to remove from the original idx. How do I do that in pandas?
Use normalize to remove the time part from your index so you can do a simple ~ + isin selection, i.e. find the dates not in that bad list. You can further ensure your list of dates don't have a time part with the same [x.normalize() for x in bad_dates] if you need to be extra safe.
Sample Data
import pandas as pd
df = pd.DataFrame(range(9), index=pd.date_range('2010-01-01', freq='11H', periods=9))
bad_dates = [pd.Timestamp('2010-01-02'), pd.Timestamp('2010-01-03')]
Code
df[~df.index.normalize().isin(bad_dates)]
# 0
#2010-01-01 00:00:00 0
#2010-01-01 11:00:00 1
#2010-01-01 22:00:00 2
#2010-01-04 05:00:00 7
#2010-01-04 16:00:00 8
I have a Series which consists of hourly data. I want to compute daily sum.
The data may have missing hours and sometimes missing dates.
2017-02-01 00:00:00 3.0
2017-02-01 01:00:00 4.0
2017-02-01 02:00:00 4.0
2017-02-03 00:00:00 3.0
For example, in the time series above for 2017-02-01, only first three hours data is present. Rest of the 21 hours data is missing.
The data for 2017-02-02 is completely missing.
I don't care about missing hours. The daily sum should consider whatever data is present for a day (in the example, it should consider hours 0, 1, 2).
But, if a date is completely missing, I should have NaN as the sum for that date.
resample() followed by sum() works fine for #1. But it returns me 0 for #2.
2017-02-01 110.0
2017-02-02 0.0
2017-02-03 3.0
Here is the dummy code:
my_series.resample('1D',closed='left',label='left').sum()
How can I tell resample(), not to set 0 for missing dates?
Use min_count=1 in sum:
min_count : int, default 0
The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.
New in version 0.22.0: Added with the default being 0. This means the sum of an all-NA or empty Series is 0, and the product of an all-NA or empty Series is 1.
a = my_series.resample('1D',closed='left',label='left').sum(min_count=1)
print (a)
2017-02-01 11.0
2017-02-02 NaN
2017-02-03 3.0
Freq: D, Name: a, dtype: float64
This is a small subset of my data:
heartrate
2018-01-01 00:00:00 67.0
2018-01-01 00:01:00 55.0
2018-01-01 00:02:00 60.0
2018-01-01 00:03:00 67.0
2018-01-01 00:04:00 72.0
2018-01-01 00:05:00 53.0
2018-01-01 00:06:00 62.0
2018-01-01 00:07:00 59.0
2018-01-01 00:08:00 117.0
2018-01-01 00:09:00 62.0
2018-01-01 00:10:00 65.0
2018-01-01 00:11:00 70.0
2018-01-01 00:12:00 49.0
2018-01-01 00:13:00 59.0
This data is a collection of daily heart rates from patients. I am trying to see if, based off their heart rate, I can find the time window that they are asleep.
I am not sure how to write a code that is able to identify the time window that the patient is asleep because every few minutes, there will be a spike in the data. For example, in the data provided from 2018-01-01 00:07:00 to 2018-01-01 00:08:00, the heartrate jumped from 59 to 117. Can anyone suggest a way around this and a way to find the time window when the Heartrate is below the mean for a few hours?
As mentioned in your comments, you can find the rolling mean to 'smoothen' your signal using:
patient_data_df['rollingmeanVal'] = patient_data_df.rolling('3T').heartrate.mean()
Assuming you are using a dataframe and want to identify rows that have a HR bellow or equal to the mean you can use:
HR_mean = patient_data_df['rollingmeanVal'].mean()
selected_data_df = patient_data_df[patient_data_df['rollingmeanVal'] <= HR_mean]
Then, instead of dealing with the dataframe as a time-series dataframe, you can reset the index and generate a column called index with the datetime as values. Now that you have a dataframe with all values bellow the mean, you can group them into groups when there is more than 30 mins difference between each group. This is assuming that having fluctuating data for 30 mins is ok.
Assuming that the group with the most data is when the patient is asleep, you can identify that group. Using the first and last date of this group, you can then identify the time window that the patient is asleep.
Reset the index, adding a new col called index with the time-series data:
selected_data_df.reset_index(inplace=True)
Group by:
selected_data_df['grp'] = selected_data_df['index'].diff().dt.seconds.ge(30 * 60).cumsum()
sleep_grp = selected_data_df.groupby('grp').count().sort_values(['grp']).head(1)
sleep_grp_index = sleep_grp.index.values[0]
sleep_df = selected_data_df[selected_data_df['grp'] == sleep_grp_index].drop('grp', axis=1)
Start of sleep time:
temp2_df['index'].iloc[0]
End of sleep time:
temp2_df['index'].iloc[-1]
You may use Run Length Encoding function from base R for solving your problem. In step 1 you may calculate the rolling mean of your patients heart rate. You may use your solution or any other. Afterwards you add a logic flag to your data.frame, e.g. patient['lowerVal'] = patient['heartrate'] < patient['rollingmeanVal']. Afterwards apply rle function on that variable lowerVal. As return you get the length of runs below and above mean. By applying cumsum on the lengths value, you get locations of your sleeping time frames.
Sorry. It is Python. Therefore, you may use the Python version of Run Length Encoding.
I have a data frame that contains some time based data:
>>> temp.groupby(pd.TimeGrouper('AS'))['INC_RANK'].mean()
date
2001-01-01 0.567128
2002-01-01 0.581349
2003-01-01 0.556646
2004-01-01 0.549128
2005-01-01 NaN
2006-01-01 0.536796
2007-01-01 0.513109
2008-01-01 0.525859
2009-01-01 0.530433
2010-01-01 0.499250
2011-01-01 0.488159
2012-01-01 0.493405
2013-01-01 0.530207
Freq: AS-JAN, Name: INC_RANK, dtype: float64
And now I would like to plot the density for each year. The following command used to work for other data frames, but it is not here:
>>> temp.groupby(pd.TimeGrouper('AS'))['INC_RANK'].plot(kind='density')
ValueError: ordinal must be >= 1
Here's how that column looks like:
>>> temp['INC_RANK'].head()
date
2001-01-01 0.516016
2001-01-01 0.636038
2001-01-01 0.959501
2001-01-01 NaN
2001-01-01 0.433824
Name: INC_RANK, dtype: float64
I think it is due to the nan in your data, as density can not be estimated for nans. However, since you want to visualize density, it should not be a big issue to simply just drop the missing values, assuming the missing/unobserved cells should follow the same distribution as the observed/non-missing cells. Therefore, df.dropna().groupby(pd.TimeGrouper('AS'))['INC_RANK'].plot(kind='density') should suffice.
On the other hand, if the missing values are not 'unobserved', but rather are the values out of the measuring range (say data from a temperature sensor, which reads 0~50F, but sometimes, 100F temperate is encountered. Sensor sends out a error code and recorded as missing value), then dropna() probably is not a good idea.