Pandas : How to avoid fillna while resampling from hourly to daily data - python

I have a Series which consists of hourly data. I want to compute daily sum.
The data may have missing hours and sometimes missing dates.
2017-02-01 00:00:00 3.0
2017-02-01 01:00:00 4.0
2017-02-01 02:00:00 4.0
2017-02-03 00:00:00 3.0
For example, in the time series above for 2017-02-01, only first three hours data is present. Rest of the 21 hours data is missing.
The data for 2017-02-02 is completely missing.
I don't care about missing hours. The daily sum should consider whatever data is present for a day (in the example, it should consider hours 0, 1, 2).
But, if a date is completely missing, I should have NaN as the sum for that date.
resample() followed by sum() works fine for #1. But it returns me 0 for #2.
2017-02-01 110.0
2017-02-02 0.0
2017-02-03 3.0
Here is the dummy code:
my_series.resample('1D',closed='left',label='left').sum()
How can I tell resample(), not to set 0 for missing dates?

Use min_count=1 in sum:
min_count : int, default 0
The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.
New in version 0.22.0: Added with the default being 0. This means the sum of an all-NA or empty Series is 0, and the product of an all-NA or empty Series is 1.
a = my_series.resample('1D',closed='left',label='left').sum(min_count=1)
print (a)
2017-02-01 11.0
2017-02-02 NaN
2017-02-03 3.0
Freq: D, Name: a, dtype: float64

Related

Removing invalid values to correctly recreate cumulative data in pandas

I have a data set with statistics that I collect from text. The processing method sometimes does not work correctly, and I need to correct the output data. I know they are supposed to be cumulative, but sometimes I get incorrect data.
Time series data that should accumulate over time. Right now I'm getting the following, sample snippet:
df
date value
2021-07-20 21347.0
2021-07-24 21739.0
2021-08-02 22.0
2021-08-03 22.0
2021-08-06 22947.0
2021-08-17 4.0
As you can see, the data is cumulative, but some values are defined incorrectly.
I would like such values to be converted to nan.
How can I do that? The final result is expected to be as follows:
df
date value
2021-07-20 21347.0
2021-07-24 21739.0
2021-08-02 nan
2021-08-03 nan
2021-08-06 22947.0
2021-08-17 nan
You can do that using numpy:
df['value'] = np.where(df['value'] < df['value'][0], np.nan, df['value'])
Output:
date value
0 2021-07-20 21347.0
1 2021-07-24 21739.0
2 2021-08-02 nan
3 2021-08-03 nan
4 2021-08-06 22947.0
5 2021-08-17 nan
Can you try this:
import numpy as np
df['check']=df['value'].shift(1)
df['value']=np.where(df['value']>df['check'],df['value'],np.nan)

Is there a way to find hourly averages in pandas timeframes that do not start from even hours?

I have a pandas dataframe (python) indexed with timestamps roughly every 10 seconds. I want to find hourly averages, but all functions I find start their averaging at even hours (e.g. hour 9 includes data from 08.00:00 to 08:59:50). Let's say I have the dataframe below.
Timestamp value data
2022-01-01 00:00:00 0.0 5.31
2022-01-01 00:00:10 0.0 0.52
2022-01-01 00:00:20 1.0 9.03
2022-01-01 00:00:30 1.0 4.37
2022-01-01 00:00:40 1.0 8.03
...
2022-01-01 13:52:30 1.0 9.75
2022-01-01 13:52:40 1.0 0.62
2022-01-01 13:52:50 1.0 3.58
2022-01-01 13:53:00 1.0 8.23
2022-01-01 13:53:10 1.0 3.07
Freq: 10S, Length: 5000, dtype: float64
So what I want to do:
Only look at data where we have data that consistently through 1 hour has a value of 1
Find an hourly average of these hours (could e.g. be between 01:30:00-02:29:50 and 11:16:30 - 12:16:20)..
I hope I made my problem clear enough. How do I do this?
EDIT:
Maybe the question was a bit unclear phrased.
I added a third column data, which is what I want to find the mean of. I am only interested in time intervals where, value = 1 consistently through one hour, the rest of the data can be excluded.
EDIT #2:
A bit of background to my problem: I have a sensor giving me data every 10 seconds. For data to be "approved" certain requirements are to be fulfilled (value in this example), and I need the hourly averages (and preferably timestamps for when this occurs). So in order to maximize the number of possible hours to include in my analysis, I would like to find full hours even if they don't start at an even timestamp.
If I understand you correctly you want a conditional mean - calculate the mean per hour of the data column conditional on the value column being all 1 for every 10s row in that hour.
Assuming your dataframe is called df, the steps to do this are:
Create a grouping column
This is your 'hour' column that can be created by
df['hour'] = df.Timestamp.hour
Create condition
Now we've got a column to identify groups we can check which groups are eligible - only those with value consistently equal to 1. If we have 10s intervals and it's per hour then if we group by hour and sum this column then we should get 360 as there are 360 10s intervals per hour.
Group and compute
We can now group and use the aggregate function to:
sum the value column to evaluate against our condition
compute the mean of the data column to return for the valid hours
# group and aggregate
df_mean = df[['hour', 'value', 'data']].groupby('hour').aggregate({'value': 'sum', 'data': 'mean'})
# apply condition
df_mean = df_mean[df_mean['value'] == 360]
That's it - you are left with a dataframe that contains the mean value of data for only the hours where you have a complete hour of value=1.
If you want to augment this so you don't have to start with the grouping as per hour starting as 08:00:00-09:00:00 and maybe you want to start as 08:00:10-09:00:10 then the solution is simple - augment the grouping column but don't change anything else in the process.
To do this you can use datetime.timedelta to shift things forward or back so that df.Timestamp.hour can still be leveraged to keep things simple.
Infer grouping from data
One final idea - if you want to infer which hours on a rolling basis you have complete data for then you can do this with a rolling sum - this is even easier. You:
compute the rolling sum of value and mean of data
only select where value is equal to 360
df_roll = df.rolling(360).aggregate({'value': 'sum', 'data': 'mean'})
df_roll = df_roll[df_roll['value'] == 360]
Yes, there is. You need resample with an offset.
Make some test data
Please make sure to provide meaningful test data next time.
import pandas as pd
import numpy as np
# One day in 10 second intervals
index = pd.date_range(start='1/1/2018', end='1/2/2018', freq='10S')
df = pd.DataFrame({"data": np.random.random(len(index))}, index=index)
# This will set the first part of the data to 1, the rest to 0
df["value"] = (df.index < "2018-01-01 10:00:10").astype(int)
This is what we got:
>>> df
data value
2018-01-01 00:00:00 0.377082 1
2018-01-01 00:00:10 0.574471 1
2018-01-01 00:00:20 0.284629 1
2018-01-01 00:00:30 0.678923 1
2018-01-01 00:00:40 0.094724 1
... ... ...
2018-01-01 23:59:20 0.839973 0
2018-01-01 23:59:30 0.890321 0
2018-01-01 23:59:40 0.426595 0
2018-01-01 23:59:50 0.089174 0
2018-01-02 00:00:00 0.351624 0
Get the mean per hour with an offset
Here is a small function that checks if all value rows in the slice are equal to 1 and returns the mean if so, otherwise it (implicitly) returns None.
def get_conditioned_average(frame):
if frame.value.eq(1).all():
return frame.data.mean()
Now just apply this to hourly slices, starting, e.g., at 10 seconds after the full hour.
df2 = df.resample('H', offset='10S').apply(get_conditioned_average)
This is the final result:
>>> df2
2017-12-31 23:00:10 0.377082
2018-01-01 00:00:10 0.522144
2018-01-01 01:00:10 0.506536
2018-01-01 02:00:10 0.505334
2018-01-01 03:00:10 0.504431
... ... ...
2018-01-01 19:00:10 NaN
2018-01-01 20:00:10 NaN
2018-01-01 21:00:10 NaN
2018-01-01 22:00:10 NaN
2018-01-01 23:00:10 NaN
Freq: H, dtype: float64

Difference between pandas aggregators .first() and .last()

I'm curious as to what last() and first() does in this specific instance (when chained to a resampling). Correct me if I'm wrong, but I understand if you pass arguments into first and last, e.g. 3; it returns the first 3 months or first 3 years.
In this circumstance, since I'm not passing any arguments into first() and last(), what is it actually doing when I'm resampling it like that? I know that if I resample by chaining .mean(), I'll resample into years with the mean score from averaging all the months, but what is happening when I'm using last()?
More importantly, why does first() and last() give me different answers in this context? I see that numerically they are not equal.
i.e: post2008.resample().first() != post2008.resample().last()
TLDR:
What does .first() and .last() do?
What does .first() and .last() do in this instance, when chained to a resample?
Why does .resample().first() != .resample().last()?
This is the code before the aggregation:
# Read 'GDP.csv' into a DataFrame: gdp
gdp = pd.read_csv('GDP.csv', index_col='DATE', parse_dates=True)
# Slice all the gdp data from 2008 onward: post2008
post2008 = gdp.loc['2008-01-01':,:]
# Print the last 8 rows of post2008
print(post2008.tail(8))
This is what print(post2008.tail(8)) outputs:
VALUE
DATE
2014-07-01 17569.4
2014-10-01 17692.2
2015-01-01 17783.6
2015-04-01 17998.3
2015-07-01 18141.9
2015-10-01 18222.8
2016-01-01 18281.6
2016-04-01 18436.5
Here is the code that resamples and aggregates by last():
# Resample post2008 by year, keeping last(): yearly
yearly = post2008.resample('A').last()
print(yearly)
This is what yearly is like when it's post2008.resample('A').last():
VALUE
DATE
2008-12-31 14549.9
2009-12-31 14566.5
2010-12-31 15230.2
2011-12-31 15785.3
2012-12-31 16297.3
2013-12-31 16999.9
2014-12-31 17692.2
2015-12-31 18222.8
2016-12-31 18436.5
Here is the code that resamples and aggregates by first():
# Resample post2008 by year, keeping first(): yearly
yearly = post2008.resample('A').first()
print(yearly)
This is what yearly is like when it's post2008.resample('A').first():
VALUE
DATE
2008-12-31 14668.4
2009-12-31 14383.9
2010-12-31 14681.1
2011-12-31 15238.4
2012-12-31 15973.9
2013-12-31 16475.4
2014-12-31 17025.2
2015-12-31 17783.6
2016-12-31 18281.6
Before anything else, let's create a dataframe with example data:
import pandas as pd
dates = pd.DatetimeIndex(['2014-07-01', '2014-10-01', '2015-01-01',
'2015-04-01', '2015-07-01', '2015-07-01',
'2016-01-01', '2016-04-01'])
df = pd.DataFrame({'VALUE': range(1000, 9000, 1000)}, index=dates)
print(df)
The output will be
VALUE
2014-07-01 1000
2014-10-01 2000
2015-01-01 3000
2015-04-01 4000
2015-07-01 5000
2015-07-01 6000
2016-01-01 7000
2016-04-01 8000
If we pass e.g. '6M' to df.first (which is not an aggregator, but a DataFrame method), we will be selecting the first six months of data, which in the example above consists of just two days:
print(df.first('6M'))
VALUE
2014-07-01 1000
2014-10-01 2000
Similarly, last returns only the rows that belong to the last six months of data:
print(df.last('6M'))
VALUE
2016-01-01 6000
2016-04-01 7000
In this context, not passing the required argument results in an error:
print(df.first())
TypeError: first() missing 1 required positional argument: 'offset'
On the other hand, df.resample('Y') returns a Resampler object, which has aggregation methods first, last, mean, etc. In this case, they keep only the first (respectively, last) values of each year (instead of e.g. averaging all values, or some other kind of aggregation):
print(df.resample('Y').first())
VALUE
2014-12-31 1000
2015-12-31 3000 # This is the first of the 4 values from 2015
2016-12-31 7000
print(df.resample('Y').last())
VALUE
2014-12-31 2000
2015-12-31 6000 # This is the last of the 4 values from 2015
2016-12-31 8000
As an extra example, consider also the case of grouping by a smaller period:
print(df.resample('M').last().head())
VALUE
2014-07-31 1000.0 # This is the last (and only) value from July, 2014
2014-08-31 NaN # No data for August, 2014
2014-09-30 NaN # No data for September, 2014
2014-10-31 2000.0
2014-11-30 NaN # No data for November, 2014
In this case, any periods for which there is no value will be filled with NaNs. Also, for this example, using first instead of last would have returned the same values, since each month has (at most) one value.

Resampling dataframe is producing unexpected results

Long question short, what is an appropriate resampling freq/rule? Sometimes I get a dataframe mostly filled with NaNs, sometimes it works great. I thought I had a handle on it.
Below is an example,
I am processing a lot of data and was changing my resample frequency and notice that for reason certain resample rules produce only 1 element in each row to have a value, the rest of elements to have NaN's.
For example,
df = pd.DataFrame()
df['date']=pd.date_range(start='1/1/2018', end='5/08/2018')
Creating some example data,
df['data1']=np.random.randint(1, 10, df.shape[0])
df['data2']=np.random.randint(1, 10, df.shape[0])
df['data3'] = np.arange(len(df))
The data looks like,
print(df.head())
print(df.shape)
data1 data2 data3
date
2018-01-01 7 7 0
2018-01-02 8 8 1
2018-01-03 2 7 2
2018-01-04 2 2 3
2018-01-05 2 5 4
(128, 3)
When I resample the data using offset aliases I get an unexpected results.
Below I resample the data every 3 minutes.
resampled=df.resample('3T').mean()
print(resampled.head())
print(resampled.shape)
data1 data2 data3
date
2018-01-01 00:00:00 4.0 5.0 0.0
2018-01-01 00:03:00 NaN NaN NaN
2018-01-01 00:06:00 NaN NaN NaN
2018-01-01 00:09:00 NaN NaN NaN
2018-01-01 00:12:00 NaN NaN NaN
Most of the rows are filled with NaN besides the first. I believe this due to that there is no index for my resampling rule. Is this correct? '24H' is the smallest interval for this data, but anything less leaves NaN in a row.
Can a dataframe be resampled for increments less than the datetime resolution?
I have had trouble in the past trying to resample a large dataset that spanned over a year with the datetime index formatted as %Y:%j:%H:%M:%S (year:day #: hour: minute:second, note: close enough without being verbose). Attempting to resample every 15 or 30 days also produced very similar results with NaNs. I thought it was due to having an odd date format with no month, but df.head() showed the index with correct dates.
When you resample lowering the frequency (downsample), then
one of possible options to compute the result is just mean().
It actuaaly means:
The source DataFrame contains too detailed data.
You want to change the sampling frequency to some lower one and
compute e.g. a mean of each column from some number
of source rows for the current sampling period.
But when you increase the sampling frequency (upsample), then:
Your source data are too general.
You want to change the frequency to a higher one.
One of possible options to compute the result is e.g. to
interpolate between known source values.
Note that when you upsample daily data to 3-minute frequency then:
The first row will contain data between 2018-01-01 00:00:00 and
2018-01-01 00:03:00.
The next row will contain data between 2018-01-01 00:03:00 and
2018-01-01 00:06:00.
And so on.
So, based on your source data:
The first row contains data from 2018-01-01 (sharp on midnight).
Since no source data is available for the time range between
00:03:00 and 00:06:00 (on 2018-01-01), the second row contains
just NaN values.
The same pertains to further rows, up to 2018-01-01 23:57:00
(no source data for these time slices).
The next row, for 2018-01-02 00:00:00 can be filled with source data.
And so on.
There is nothing strange in this behaviour. Resample works just this way.
As you actually upsample the source data, maybe you should interpolate
the missing values?

Pandas: Group by year and plot density

I have a data frame that contains some time based data:
>>> temp.groupby(pd.TimeGrouper('AS'))['INC_RANK'].mean()
date
2001-01-01 0.567128
2002-01-01 0.581349
2003-01-01 0.556646
2004-01-01 0.549128
2005-01-01 NaN
2006-01-01 0.536796
2007-01-01 0.513109
2008-01-01 0.525859
2009-01-01 0.530433
2010-01-01 0.499250
2011-01-01 0.488159
2012-01-01 0.493405
2013-01-01 0.530207
Freq: AS-JAN, Name: INC_RANK, dtype: float64
And now I would like to plot the density for each year. The following command used to work for other data frames, but it is not here:
>>> temp.groupby(pd.TimeGrouper('AS'))['INC_RANK'].plot(kind='density')
ValueError: ordinal must be >= 1
Here's how that column looks like:
>>> temp['INC_RANK'].head()
date
2001-01-01 0.516016
2001-01-01 0.636038
2001-01-01 0.959501
2001-01-01 NaN
2001-01-01 0.433824
Name: INC_RANK, dtype: float64
I think it is due to the nan in your data, as density can not be estimated for nans. However, since you want to visualize density, it should not be a big issue to simply just drop the missing values, assuming the missing/unobserved cells should follow the same distribution as the observed/non-missing cells. Therefore, df.dropna().groupby(pd.TimeGrouper('AS'))['INC_RANK'].plot(kind='density') should suffice.
On the other hand, if the missing values are not 'unobserved', but rather are the values out of the measuring range (say data from a temperature sensor, which reads 0~50F, but sometimes, 100F temperate is encountered. Sensor sends out a error code and recorded as missing value), then dropna() probably is not a good idea.

Categories