Hi I am trying to resample a pandas DataFrame backwards.
This is my dataframe:
seconds = np.arange(20, 700, 60)
timedeltas = pd.to_timedelta(seconds, unit='s')
vals = np.array([randint(-10,10) for a in range(len(seconds))])
df = pd.DataFrame({'values': vals}, index = timedeltas)
then I have
In [252]: df
Out[252]:
values
00:00:20 8
00:01:20 4
00:02:20 5
00:03:20 9
00:04:20 7
00:05:20 5
00:06:20 5
00:07:20 -6
00:08:20 -3
00:09:20 -5
00:10:20 -5
00:11:20 -10
and
In [253]: df.resample('5min').mean()
Out[253]:
values
00:00:20 6.6
00:05:20 -0.8
00:10:20 -7.5
and what I would like is something like
Out[***]:
values
00:01:20 6
00:06:20 valb
00:11:20 -5.8
where the values of each new time are the ones if I roll back the dataframe and compute the mean in each bin going from backwards to forward. For example in this
case the last value should be
valc = (-6-3-5-5-10)/5.
valc= -5.8
which is the average of the last 5 values, and the first one should be the average of the only 2 first values because the "bin" is incomplete.
Reading pandas documentation I thought that I have to use the parameters how='last' but in my current version of pandas this is not working (version 0.20.3). Additionally I tried with the options closed and convention, but I wasn't able to perform this.
Thanks for the help
The easiest way is to sort the index in reverse order, then resample to get the desired results:
df.sort_index(ascending=False).resample('5min').mean()
Resample reference - When the resample starts the first bin is of max available length, in this case 5. Closed, label, convention parameters are helpful but do not compute the mean going from backwards to forward. To do that use sort.
Related
I am trying to loop through a dataframe creating a dynamic ranges that are limited to the last 6 months of every row index.
Because I am looking back 6 months, I start from the first index row that has a date >= the first date in row index 0 of the dataframe. The condition which I have managed to create is shown below:
for i in df.index:
if datetime.strptime(df['date'][i], '%Y-%m-%d %H:%M:%S') >= (datetime.strptime(df['date'].iloc[0], '%Y-%m-%d %H:%M:%S') + dateutil.relativedelta.relativedelta(months=6)):
However, this merely creates ranges that grow in size incorporating, all data that is indexed after
the first index row that has a date >= the first date in row index 0 of the dataframe.
How can I limit the condition statement to only the last 6 months of each row index?
I'm not sure what exactly you want to do once you have your "dynamic ranges".
You can obtain a list of intervals (t - 6mo, t) for each t in your DatetimeIndex):
intervals = [(t - pd.DateOffset(months=6), t) for t in df.index]
But doing selection operations in a big for-loop might be slow.
Instead, you might be interested in Pandas's rolling operations. It can even use a date offset (as long as it is fixed-frequency) instead of a fixed-sized int window width. However, "6 months" is a non-fixed frequency, and as such the regular rolling won't accept it.
Still, if you are ok with an approximation, say "182 days", then the following might work well.
# setup
n = 10
df = pd.DataFrame(
{'a': np.arange(n), 'b': np.ones(n)},
index=pd.date_range('2019-01-01', freq='M', periods=n))
# example: sum
df.rolling('182D', min_periods=0).sum()
# out:
a b
2019-01-31 0.0 1.0
2019-02-28 1.0 2.0
2019-03-31 3.0 3.0
2019-04-30 6.0 4.0
2019-05-31 10.0 5.0
2019-06-30 15.0 6.0
2019-07-31 21.0 7.0
2019-08-31 27.0 6.0
2019-09-30 33.0 6.0
2019-10-31 39.0 6.0
If you want to be strict on the 6 months windows, you can implement your own pandas.api.indexers.BaseIndexer and use that as arg of rolling.
I have a dataset that has multiple values received per second - up to 100 DFS (no more, but not consistently 100). The challenge is that the date field did not capture time more granularly than second, so multiple rows have the same hh:mm:ss timestamp. These are fine, but I also have several seconds missing across the set, i.e., not showing at all.
Therefore my 2 initial columns might look like this, where I am missing the 54 sec step:
2020-08-24 03:36:53, 5
2020-08-24 03:36:53, 8
2020-08-24 03:36:53, 6
2020-08-24 03:36:55, 8
Because of the legit date "duplicates" and the information I need from this, I don't want to aggregate but I do need to create the missing seconds, insert them and fill (NaN, etc) so I can then manage them appropriately for aligning with other datasets.
The only way I can seem to do this is with a nested if loop which looks at the previous timestamp and if it is the same as the current cell (pt == ct) then no action, if it is 1 less (pt = (ct-1)) then no action but it if is more than the current cell by 2 or more, insert the missing (pt <= (ct-2)). This feels a bit cumbersome (though workable). Am I missing an easier way to do this?
I have checked a lot of "fill missing dates" threads on here as well as in various functions on pandas.pydata.org but reindexing and the most common date fills all seem to rely on dates not having duplicates. Any advice would be fantastic.
This can be solved by creating a pandas series containing all timepoints you want to consider and then merge this with the original dataframe.
For example:
start, end = df['date'].min(), df['date'].max()
all_timepoints = pd.date_range(start, end, freq='s').to_series(name='date')
df.merge(all_timepoints , on='date', how='outer', sort=True).fillna(0)
Will give:
date value
0 2020-08-24 03:36:53 5.0
1 2020-08-24 03:36:53 8.0
2 2020-08-24 03:36:53 6.0
3 2020-08-24 03:36:54 0.0
4 2020-08-24 03:36:55 8.0
I would like to perform mobile averaging considering periodic boundary conditions. I try to make myself clear.
I have this data:
Date,Q
1989-01-01 00:00,0
1989-01-02 00:00,1
1989-01-03 00:00,4
1989-01-04 00:00,6
1989-01-05 00:00,8
1989-01-06 00:00,10
1989-01-07 00:00,11
I would like to compute the mobile averaging considering 3 data: the next and the previous.
In particular, I would like to use same option in the "rolling" function where the first data (0 in python framework) were able to take into account the last one and vice versa the last one the first one. This would allows me to have a sort of periodic boundary conditions.
Indeed, I have applied the following:
First, I read the dataframe
df = pd.read_csv(fname, index_col = 0, parse_dates=True)
then I apply the "rolling" as
df['Q'] = pd.Series(df["Q"].rolling(3, center=True).mean())
However, I get the following results:
Date
1989-01-01 NaN
1989-01-02 1.66
1989-01-03 3.66
1989-01-04 6
1989-01-05 8
1989-01-06 9.66
1989-01-07 NaN
I know that I could apply the "min_periods=1" option but this is not what I want. Indeed, It is clear that in the second row the result is correct:
1.66 = (0+1+4)/3
However, I would like to have this result in the first row:
(0+1+11)/3
As you can noticed, the number 11 is the value of the last row. Similarly, I expect in the last row:
(10+11+0)/3
where 0 is the value of the first row.
Do you have some suggestions or idea?
Thanks,
Diego
I would just duplicate the values before the first one and after last one, sort the dataframe, and do the rolling average. Then it would be enough to drop the added values:
df.loc[df.index[0] - pd.offsets.Day(1), 'Q'] = df.iloc[-1]['Q']
df.loc[df.index[-2] + pd.offsets.Day(1), 'Q'] = df.iloc[0]['Q']
df = df.sort_index()
df['Q'] = pd.Series(df["Q"].rolling(3, center=True).mean())
It gives as expected:
Q
Date
1989-01-01 4.000000
1989-01-02 1.666667
1989-01-03 3.666667
1989-01-04 6.000000
1989-01-05 8.000000
1989-01-06 9.666667
1989-01-07 7.000000
I have a program that ideally measures the temperature every second. However, in reality this does not happen. Sometimes, it skips a second or it breaks down for 400 seconds and then decides to start recording again. This leaves gaps in my 2-by-n dataframe, where ideally n = 86400 (the amount of seconds in a day). I want to apply some sort of moving/rolling average to it to get a nicer plot, but if I do that to the "raw" datafiles, the amount of data points becomes less. This is shown here, watch the x-axis. I know the "nice data" doesn't look nice yet; I'm just playing with some values.
So, I want to implement a data cleaning method, which adds data to the dataframe. I thought about it, but don't know how to implement it. I thought of it as follows:
If the index is not equal to the time, then we need to add a number, at time = index. If this gap is only 1 value, then the average of the previous number and the next number will do for me. But if it is bigger, say 100 seconds are missing, then a linear function needs to be made, which will increase or decrease the value steadily.
So I guess a training set could be like this:
index time temp
0 0 20.10
1 1 20.20
2 2 20.20
3 4 20.10
4 100 22.30
Here, I would like to get a value for index 3, time 3 and the values missing between time = 4 and time = 100. I'm sorry about my formatting skills, I hope it is clear.
How would I go about programming this?
Use merge with complete time column and then interpolate:
# Create your table
time = np.array([e for e in np.arange(20) if np.random.uniform() > 0.6])
temp = np.random.uniform(20, 25, size=len(time))
temps = pd.DataFrame([time, temp]).T
temps.columns = ['time', 'temperature']
>>> temps
time temperature
0 4.0 21.662352
1 10.0 20.904659
2 15.0 20.345858
3 18.0 24.787389
4 19.0 20.719487
The above is a random table generated with missing time data.
# modify it
filled = pd.Series(np.arange(temps.iloc[0,0], temps.iloc[-1, 0]+1))
filled = filled.to_frame()
filled.columns = ['time'] # Create a fully filled time column
merged = pd.merge(filled, temps, on='time', how='left') # merge it with original, time without temperature will be null
merged.temperature = merged.temperature.interpolate() # fill nulls linearly.
# Alternatively, use reindex, this does the same thing.
final = temps.set_index('time').reindex(np.arange(temps.time.min(),temps.time.max()+1)).reset_index()
final.temperature = final.temperature.interpolate()
>>> merged # or final
time temperature
0 4.0 21.662352
1 5.0 21.536070
2 6.0 21.409788
3 7.0 21.283505
4 8.0 21.157223
5 9.0 21.030941
6 10.0 20.904659
7 11.0 20.792898
8 12.0 20.681138
9 13.0 20.569378
10 14.0 20.457618
11 15.0 20.345858
12 16.0 21.826368
13 17.0 23.306879
14 18.0 24.787389
15 19.0 20.719487
First you can set the second values to actual time values as such:
df.index = pd.to_datetime(df['time'], unit='s')
After which you can use pandas' built-in time series operations to resample and fill in the missing values:
df = df.resample('s').interpolate('time')
Optionally, if you still want to do some smoothing you can use the following operation for that:
df.rolling(5, center=True, win_type='hann').mean()
Which will smooth with a 5 element wide Hanning window. Note: any window-based smoothing will cost you value points at the edges.
Now your dataframe will have datetimes (including date) as index. This is required for the resample method. If you want to lose the date, you can simply use:
df.index = df.index.time
need to fill the NA values with the past three values mean of that NA
this is my dataset
RECEIPT_MONTH_YEAR NET_SALES
0 2014-01-01 818817.20
1 2014-02-01 362377.20
2 2014-03-01 374644.60
3 2014-04-01 NA
4 2014-05-01 NA
5 2014-06-01 NA
6 2014-07-01 NA
7 2014-08-01 46382.50
8 2014-09-01 55933.70
9 2014-10-01 292303.40
10 2014-10-01 382928.60
is this dataset a .csv file or a dataframe. This NA is a 'NaN' or a string ?
import pandas as pd
import numpy as np
df=pd.read_csv('your dataset',sep=' ')
df.replace('NA',np.nan)
df.fillna(method='ffill',inplace=True)
you mention something about mean of 3 values..the above simply forward fills the last observation before the NaNs begin. This is often a good way for forecasting (better than taking means in certain cases, if persistence is important)
ind = df['NET_SALES'].index[df['NET_SALES'].apply(np.isnan)]
Meanof3 = df.iloc[ind[0]-3:ind[0]].mean(axis=1,skipna=True)
df.replace('NA',Meanof3)
Maybe the answer can be generalised and improved if more info about the dataset is known - like if you always want to take the mean of last 3 measurements before any NA. The above will allow you to check the indices that are NaNs and then take mean of 3 before, while ignoring any NaNs
This is simple but it is working
df_data.fillna(0,inplace=True)
for i in range(0,len(df_data)):
if df_data['NET_SALES'][i]== 0.00:
condtn = df_data['NET_SALES'][i-1]+df_data['NET_SALES'][i-2]+df_data['NET_SALES'][i-3]
df_data['NET_SALES'][i]=condtn/3
You could use fillna (assuming that your NA is already np.nan) and rolling mean:
import pandas as pd
import numpy as np
df = pd.DataFrame([818817.2,362377.2,374644.6,np.nan,np.nan,np.nan,np.nan,46382.5,55933.7,292303.4,382928.6], columns=["NET_SALES"])
df["NET_SALES"] = df["NET_SALES"].fillna(df["NET_SALES"].shift(1).rolling(3, min_periods=1).mean())
Out:
NET_SALES
0 818817.2
1 362377.2
2 374644.6
3 518613.0
4 368510.9
5 374644.6
6 NaN
7 46382.5
8 55933.7
9 292303.4
10 382928.6
If you want to include the imputed values I guess you'll need to use a loop.