Imagine a data frame with multiple variables measured every 30 min. Every time series inside this data frame has gaps at possibly different positions. These gaps are to be replaced by some kind of running mean, lets say +/- 2 days. For example, if at day 4 07:30 I have missing data, I want to replace a NaN entry with the average of the measurements at 07:30 at day 2, 3, 5 and 6. Note that it is also possible that, for example, day 5, 07:30 is also NaN -- in this case, this is should be excluded from the average that is to replace the missing measurement at day 4 (should be possible with np.nanmean?)
I am not sure how to do this. Right now, I would probably loop over every single row and column in the data frame and write a really bad hack along the lines of np.mean(df.ix[[i-48, i, i+48], "A"]), but I feel there must be a more pythonic/pandas-y way?
Sample data set:
import numpy as np
import pandas as pd
# generate a 1-week time series
dates = pd.date_range(start="2014-01-01 00:00", end="2014-01-07 00:00", freq="30min")
df = pd.DataFrame(np.random.randn(len(dates),3), index=dates, columns=("A", "B", "C"))
# generate some artificial gaps
df.ix["2014-01-04 10:00":"2014-01-04 11:00", "A"] = np.nan
df.ix["2014-01-04 12:30":"2014-01-04 14:00", "B"] = np.nan
df.ix["2014-01-04 09:30":"2014-01-04 15:00", "C"] = np.nan
print df["2014-01-04 08:00":"2014-01-04 16:00"]
A B C
2014-01-04 08:00:00 0.675720 2.186484 -0.033969
2014-01-04 08:30:00 -0.897217 1.332437 -2.618197
2014-01-04 09:00:00 0.299395 0.837023 1.346117
2014-01-04 09:30:00 0.223051 0.913047 NaN
2014-01-04 10:00:00 NaN 1.395480 NaN
2014-01-04 10:30:00 NaN -0.800921 NaN
2014-01-04 11:00:00 NaN -0.932760 NaN
2014-01-04 11:30:00 0.057219 -0.071280 NaN
2014-01-04 12:00:00 0.215810 -1.099531 NaN
2014-01-04 12:30:00 -0.532563 NaN NaN
2014-01-04 13:00:00 -0.697872 NaN NaN
2014-01-04 13:30:00 -0.028541 NaN NaN
2014-01-04 14:00:00 -0.073426 NaN NaN
2014-01-04 14:30:00 -1.187419 0.221636 NaN
2014-01-04 15:00:00 1.802449 0.144715 NaN
2014-01-04 15:30:00 0.446615 1.013915 -1.813272
2014-01-04 16:00:00 -0.410670 1.265309 -0.198607
[17 rows x 3 columns]
(An even more sophisticated tool would also exclude measurements from the averaging procdure that were themselves created by averaging, but that doesn't necessarily have to be included in an answer, since I believe this may make things too complicated for now. )
/edit: A sample solution that I'm not really happy with:
# specify the columns of df where gaps should be filled
cols = ["A", "B", "C"]
for col in cols:
for idx, rows in df.iterrows():
if np.isnan(df.ix[idx, col]):
# replace with mean of adjacent days
df.ix[idx, col] = np.nanmean(df.ix[[idx-48, idx+48], col])
There is two things I don't like about this solution:
If there is a single line missing or duplicated anywhere, this fails. In the last line, I would like to subtract "one day" all the time, no matter if that is 47, 48 or 49 rows away. Also, it would be good of I could extend the range (e.g. -3 days to +3 days) without manually writing a list for the index.
I would like to get rid of the loops, if that is possible.
This should be a faster and more concise way to do it. Main thing is to use the shift() function instead of the loop. Simple version would be this:
df[ df.isnull() ] = np.nanmean( [ df.shift(-48), df.shift(48) ] )
It turned out to be really hard to generalize this, but this seems to work:
df[ df.isnull() ] = np.nanmean( [ df.shift(x).values for x in
range(-48*window,48*(window+1),48) ], axis=0 )
I'm not sure, but suspect there might be a bug with nanmean and it's also the same reason you got missing values yourself. It seems to me that nanmean cannot handle nans if you feed it a dataframe. But if I convert to an array (with .values) and use axis=0 then it seems to work.
Check results for window=1:
print df.ix["2014-01-04 12:30":"2014-01-04 14:00", "B"]
print df.ix["2014-01-03 12:30":"2014-01-03 14:00", "B"]
print df.ix["2014-01-05 12:30":"2014-01-05 14:00", "B"]
2014-01-04 12:30:00 0.940193 # was nan, now filled
2014-01-04 13:00:00 0.078160
2014-01-04 13:30:00 -0.662918
2014-01-04 14:00:00 -0.967121
2014-01-03 12:30:00 0.947915 # day before
2014-01-03 13:00:00 0.167218
2014-01-03 13:30:00 -0.391444
2014-01-03 14:00:00 -1.157040
2014-01-05 12:30:00 0.932471 # day after
2014-01-05 13:00:00 -0.010899
2014-01-05 13:30:00 -0.934391
2014-01-05 14:00:00 -0.777203
Regarding problem #2, it will depend on your data but if you precede the above with
df = df.resample('30min')
that will give you a row of nans for all the missing rows and then you can fill them in the same as all the other nans. That's probably the simplest and fastest way if it works.
Alternatively, you could do something with groupby. My groupby-fu is weak but to give you the flavor of it, something like:
df.groupby( df.index.hour ).fillna(method='pad')
would correctly deal the issue of missing rows, but not the other things.
Related
I have to make a daily sum on a dataframe but only if at least 70% of the daily data is not NaN. If it is then this day must not be taken into account. Is there a way to create such a mask? My dataframe is more than 17 years of hourly data.
my data is something like this:
clear skies all skies Lab
2015-02-26 13:00:00 597.5259 376.1830 307.62
2015-02-26 14:00:00 461.2014 244.0453 199.94
2015-02-26 15:00:00 283.9003 166.5772 107.84
2015-02-26 16:00:00 93.5099 50.7761 23.27
2015-02-26 17:00:00 1.1559 0.2784 0.91
... ... ...
2015-12-05 07:00:00 95.0285 29.1006 45.23
2015-12-05 08:00:00 241.8822 120.1049 113.41
2015-12-05 09:00:00 363.8040 196.0568 244.78
2015-12-05 10:00:00 438.2264 274.3733 461.28
2015-12-05 11:00:00 456.3396 330.6650 447.15
if I groupby and aggregate than there is no way to know if in any day there was some lack of data and some days will have lower sums and therefore lowering my monthly means
As said in the comments, use groupby to group the data by date and then write an appropriate selection. This is an example that would sum all days (assuming regular data points, 24 per day) with less than 50% of nan entries:
import pandas as pd
import numpy as np
# create a date range
date_rng = pd.date_range(start='1/1/2018', end='1/1/2021', freq='H')
# create random data
df = pd.DataFrame({"data":np.random.randint(0,100,size=(len(date_rng)))}, index = date_rng)
# set some values to nan
df["data"][df["data"] > 50] = np.nan
# looks like this
df.head(20)
# sum everything where less than 50% are nan
df.groupby(df.index.date).sum()[df.isna().groupby(df.index.date).sum() < 12]
Example output:
data
2018-01-01 NaN
2018-01-02 NaN
2018-01-03 487.0
2018-01-04 NaN
2018-01-05 421.0
... ...
2020-12-28 NaN
2020-12-29 NaN
2020-12-30 NaN
2020-12-31 392.0
2021-01-01 0.0
An alternative solution - you may find it useful & flexible:
# pip install convtools
from convtools import conversion as c
total_number = c.ReduceFuncs.Count()
total_not_none = c.ReduceFuncs.Count(where=c.item("amount").is_not(None))
total_sum = c.ReduceFuncs.Sum(c.item("amount"))
input_data = [] # e.g. iterable of dicts
converter = (
c.group_by(
c.item("key1"),
c.item("key2"),
)
.aggregate(
{
"key1": c.item("key1"),
"key2": c.item("key2"),
"sum_if_70": c.if_(
total_not_none / total_number < 0.7,
None,
total_sum,
),
}
)
.gen_converter(
debug=False
) # install black and set to True to see the generated ad-hoc code
)
result = converter(input_data)
The context
I am looking to apply a ufuncs (cumsum in this case) to blocks of contiguous rows in a time serie, which is stored in a panda DataFrame.
This time serie is sorted according its DatetimeIndex.
Blocks are defined by a custom DatetimeIndex.
To do so, I came up with this (ok) code.
# input dataset
length = 10
ts = pd.date_range(start='2021/01/01 00:00', periods=length, freq='1h')
random.seed(1)
val = random.sample(range(1, 10+length), length)
df = pd.DataFrame({'val' : val}, index=ts)
# groupby custom datetimeindex
key_ts = [ts[i] for i in [1,3,7]]
df.loc[key_ts, 'id'] = range(len(key_ts))
df['id'] = df['id'].ffill()
# cumsum
df['cumsum'] = df.groupby('id')['val'].cumsum()
# initial dataset
In [13]: df
Out[13]:
val
2021-01-01 00:00:00 5
2021-01-01 01:00:00 3
2021-01-01 02:00:00 9
2021-01-01 03:00:00 4
2021-01-01 04:00:00 8
2021-01-01 05:00:00 13
2021-01-01 06:00:00 15
2021-01-01 07:00:00 14
2021-01-01 08:00:00 11
2021-01-01 09:00:00 7
# DatetimeIndex defining custom time intervals for 'resampling'.
In [14]: key_ts
Out[14]:
[Timestamp('2021-01-01 01:00:00', freq='H'),
Timestamp('2021-01-01 03:00:00', freq='H'),
Timestamp('2021-01-01 07:00:00', freq='H')]
# result
In [16]: df
Out[16]:
val id cumsum
2021-01-01 00:00:00 5 NaN -1
2021-01-01 01:00:00 3 0.0 3
2021-01-01 02:00:00 9 0.0 12
2021-01-01 03:00:00 4 1.0 4
2021-01-01 04:00:00 8 1.0 12
2021-01-01 05:00:00 13 1.0 25
2021-01-01 06:00:00 15 1.0 40
2021-01-01 07:00:00 14 2.0 14
2021-01-01 08:00:00 11 2.0 25
2021-01-01 09:00:00 7 2.0 32
The question
Is groupby the most efficient in terms of CPU and memory in this case where blocks are made with contiguous rows?
I would think that with groupby, a 1st read of the full the dataset is made to identify all rows to group together.
Knowing rows are contiguous in my case, I don't need to read the full dataset to know I have gathered all the rows of current group.
As soon as I hit the row of the next group, I know calculations are done with previous group.
In case rows are contiguous, the sorting step is lighter.
Hence the question, is there a way to mention this to pandas to save some CPU?
Thanks in advance for your feedbacks,
Bests
group_by is clearly not the fastest solution here because it should either use a slow sort or slow hashing operations to group the values.
What you want to implement is called a segmented cumulative sum. You can implement this quite efficiently using Numpy, but this is a bit tricky to implement (especially due to the NaN values) and not the fastest solution because multiple one need multiple steps iterating over all the id/valcolumns. The fastest solution is to use something like Numba to do this very quickly in one step.
Here is an implementation:
import numpy as np
import numba as nb
# To avoid the compilation cost at runtime, use:
# #nb.njit('int64[:](float64[:],int64[:])')
#nb.njit
def segmentedCumSum(ids, values):
size = len(ids)
res = np.empty(size, dtype=values.dtype)
if size == 0:
return res
zero = values.dtype.type(0)
curValue = zero
for i in range(size):
if not np.isnan(ids[i]):
if i > 0 and ids[i-1] != ids[i]:
curValue = zero
curValue += values[i]
res[i] = curValue
else:
res[i] = -1
curValue = zero
return res
df['cumsum'] = segmentedCumSum(df['id'].to_numpy(), df['val'].to_numpy())
Note that ids[i-1] != ids[i] might fail with big floats because of their imprecision. The best solution is to use integers and -1 to replace the NaN value. If you do want to keep the float values, you can use the expression np.abs(ids[i-1]-ids[i]) > epsilon with a very small epsilon. See this for more information.
I have a time series dataframe with dates|weather information that looks like this:
2017-01-01 5
2017-01-02 10
.
.
2017-12-31 6
I am trying to upsample it to hourly data using the following:
weather.resample('H').pad()
I expected to see 8760 entries for 24 intervals * 365 days. However, it only returns 8737 with the last 23 intervals missing for 31st of december. Is there something special I need to do to get 24 intervals for the last day?
Thanks in advance.
Pandas normalizes 2017-12-31 to 2017-12-31 00:00 and then creates a range that ends in that last datetime... I would include a last row before resampling with
df.loc['2018-01-01'] = 0
Edit:
You can get the result you want with numpy.repeat
Take this df
np.random.seed(1)
weather = pd.DataFrame(index=pd.date_range('2017-01-01', '2017-12-31'),
data={'WEATHER_MAX': np.random.random(365)*15})
WEATHER_MAX
2017-01-01 6.255330
2017-01-02 10.804867
2017-01-03 0.001716
2017-01-04 4.534989
2017-01-05 2.201338
... ...
2017-12-27 4.503725
2017-12-28 2.145087
2017-12-29 13.519627
2017-12-30 8.123391
2017-12-31 14.621106
[365 rows x 1 columns]
By repeating on axis=1 you can then transform the default range(24) column names to hourly timediffs
# repeat, then stack
hourly = pd.DataFrame(np.repeat(weather.values, 24, axis=1),
index=weather.index).stack()
# combine date and hour
hourly.index = (
hourly.index.get_level_values(0) +
pd.to_timedelta(hourly.index.get_level_values(1), unit='h')
)
hourly = hourly.rename('WEATHER_MAX').to_frame()
Output
WEATHER_MAX
2017-01-01 00:00:00 6.255330
2017-01-01 01:00:00 6.255330
2017-01-01 02:00:00 6.255330
2017-01-01 03:00:00 6.255330
2017-01-01 04:00:00 6.255330
... ...
2017-12-31 19:00:00 14.621106
2017-12-31 20:00:00 14.621106
2017-12-31 21:00:00 14.621106
2017-12-31 22:00:00 14.621106
2017-12-31 23:00:00 14.621106
[8760 rows x 1 columns]
What to do and the reason are the same as #RichieV's answer.
However, the value to be used is not 0 or a meaningless value, it is necessary to use valid data actually measured on 2018-01-01.
This is because using a meaningless value reduces the effectiveness of the resampled 2017-12-31 data and the results derived using that data.
Prepare a valid value for 2018-01-01 at the end of the data.
Call resample.
Delete the data of 2018-01-01 after resample.
You will get 8670 data for 2017.
Look at #RichieV's modified answer:
I was misunderstanding the question.
My answer was to complement resample with interpolate etc.
resampleを用いた外挿 (データ補間) を行いたい
If the same value as 00:00 on the day is all right, it would be a different way of thinking.
I've got two dataframes that both have a date column and an emaX column, when I merge them I get the expected result of a single date column and two emaX columns. But when I try access the date key from the merged dataframe, it returns a KeyError: date.
This is the function that returns the emaX (I have two, but they're nearly identical):
def av_get_ema_20():
ti = TechIndicators(key=TOKEN, output_format="pandas")
emaData20, meta_ema = ti.get_ema(symbol=SYMBOL, interval=INTERVAL, time_period=20, series_type=EMA_TYPE)
ema20renamed = pd.DataFrame(emaData20)
ema20renamed.rename(columns={'EMA': 'ema20'}, inplace=True)
return ema20renamed
Then I merge the two returned dataframes:
mergedDF = pd.merge(av_get_ema_10(), av_get_ema_20(), on=["date"], how="inner")
# TEST LINE
print(mergedDF)
The dataframe that is printed out appears as I expected it to be:
ema10 ema20
date
2020-01-02 11:30:00 3226.5200 NaN
2020-01-02 12:30:00 3229.0927 NaN
2020-01-02 13:30:00 3232.0558 NaN
2020-01-02 14:30:00 3235.0839 NaN
2020-01-02 15:30:00 3239.1668 NaN
... ... ...
2020-03-26 11:30:00 2524.9545 2473.8551
2020-03-26 12:30:00 2533.1755 2483.0279
2020-03-26 13:30:00 2541.2982 2492.0586
2020-03-26 14:30:00 2551.0458 2501.8540
2020-03-26 15:30:00 2565.2866 2513.9983
But then when I attempt to use the merged dataframe (for ex. interating through the dataframe), I get KeyError: date:
for index, row in mergedDF.iterrows():
print(row["date"], row["ema10"], row["ema20"])
Am I misinterpreting the dataframe in some way or is there something else I am supposed to do prior to using the merged set (including the date)? I'm at a loss here.
The problem
Suppose I have a time series dataframe df (a pandas dataframe) and some days I want to slice from it, contained in another dataframe called sample_days:
>>> df
foo bar
2020-01-01 00:00:00 0.360049 0.897839
2020-01-01 01:00:00 0.285667 0.409544
2020-01-01 02:00:00 0.323871 0.240926
2020-01-01 03:00:00 0.921623 0.766624
2020-01-01 04:00:00 0.087618 0.142409
... ... ...
2020-12-31 19:00:00 0.145111 0.993822
2020-12-31 20:00:00 0.331223 0.021287
2020-12-31 21:00:00 0.531099 0.859035
2020-12-31 22:00:00 0.759594 0.790265
2020-12-31 23:00:00 0.103651 0.074029
[8784 rows x 2 columns]
>>> sample_days
month day
0 3 16
1 7 26
2 8 15
3 9 26
4 11 25
I want to slice df with the days specified in sample_days. I can do this with for loops (see below). However, is there a way to avoid for loops (as this is more efficient)? The result should be a dataframe called sample like the following:
>>> sample
foo bar
2020-03-16 00:00:00 0.707276 0.592614
2020-03-16 01:00:00 0.136679 0.357872
2020-03-16 02:00:00 0.612331 0.290126
2020-03-16 03:00:00 0.276389 0.576996
2020-03-16 04:00:00 0.612977 0.781527
... ... ...
2020-11-25 19:00:00 0.904266 0.825501
2020-11-25 20:00:00 0.269589 0.050304
2020-11-25 21:00:00 0.271814 0.418235
2020-11-25 22:00:00 0.595005 0.973198
2020-11-25 23:00:00 0.151149 0.024057
[120 rows x 2 columns
which is just the df sliced across the correct days.
My (slow) solution
I've managed to do this using for loops and pd.concat:
sample = pd.concat([df.loc[df.index.month.isin([sample_day.month]) &
df.index.day.isin([sample_day.day])]
for sample_day in sample_days.itertuples()])
which is based on concatenating multiple days as sliced by the method indicated here. This gives the desired result but is rather slow. For example, using this method to get the first day of each month takes 0.2 seconds on average, whereas just calling df.loc[df.index.day == 1] (presumably avoiding python for loops under-the-hood) is around 300 times faster. However, this is a slice on just the day -- I am slicing on month and day.
Apologies if this has been answered somewhere else -- I've searched for quite a while but perhaps was not using the correct keywords.
You can do a string comparison of the month and days at the same time.
You need the space to differentiate between 11 2 and 1 12 for example, otherwise both would be regarded as the same.
df.loc[(df.index.month.astype(str) +' '+ df.index.day.astype(str)).isin(sample_days['month'].astype(str)+' '+sample_days['day'].astype(str))]
After getting a bit of inspiration from #Ben Pap's solution (thanks!), I've found a solution that is both fast and avoids any "hacks" like changing datetime to strings. It combines the month and day into a single MultiIndex, as below (you can make this a single line, but I've expanded it into multiple to make the idea clear).
full_index = pd.MultiIndex.from_arrays([df.index.month, df.index.day],
names=['month', 'day'])
sample_index = pd.MultiIndex.from_frame(sample_days)
sample = df.loc[full_index.isin(sample_index)]
If I run this code along with my original for loop and #Ben Pap's answer, and sample 100 days from one year time series for 2020 (8784 hours with the leap day), I get the following solution times:
Original for loop: 0.16s
#Ben Pap's solution, combining month and day into single string: 0.019s
Above solution using MultiIndex: 0.006s
so I think using a MultiIndex is the way to go.