I have a pandas dataframe as following:
Date time LifeTime1 LifeTime2 LifeTime3 LifeTime4 LifeTime5
2020-02-11 17:30:00 6 7 NaN NaN 3
2020-02-11 17:30:00 NaN NaN 3 3 NaN
2020-02-12 15:30:00 2 2 NaN NaN 3
2020-02-16 14:30:00 4 NaN NaN NaN 1
2020-02-16 14:30:00 NaN 7 NaN NaN NaN
2020-02-16 14:30:00 NaN NaN 8 2 NaN
The dates are identical for some rows, is it possible to add 1 second, 2 second, 3 seconds to 2, 3, and 4 identical dates? So if its just one unique date, leave as is. If there are two identical dates, leave first one as is but add 1 second to the second identical date. And if three identical date, leave first as is, second add 1 second and add 2 second to third one. Is this possible to do easily in pandas?
You can use groupby.cumcount combined with pandas.to_datetime with unit='s' to add incremental seconds to the duplicated rows:
s = pd.to_datetime(df['Date time'])
df['Date time'] = s+pd.to_timedelta(s.groupby(s).cumcount(), unit='s')
As a one liner with python 3.8+ walrus operator:
df['Date time'] = ((s:=pd.to_datetime(df['Date time']))
+pd.to_timedelta(s.groupby(s).cumcount(), unit='s')
)
output:
Date time LifeTime1 LifeTime2 LifeTime3 LifeTime4 LifeTime5
0 2020-02-11 17:30:00 6.0 7.0 NaN NaN 3.0
1 2020-02-11 17:30:01 NaN NaN 3.0 3.0 NaN
2 2020-02-12 15:30:00 2.0 2.0 NaN NaN 3.0
3 2020-02-16 14:30:00 4.0 NaN NaN NaN 1.0
4 2020-02-16 14:30:01 NaN 7.0 NaN NaN NaN
5 2020-02-16 14:30:02 NaN NaN 8.0 2.0 NaN
Related
I have a dataframe where the index is date increasing and the columns are observations of variables. The array is sparse.
My goal is to propogate forward in time a known value to fill NaN but I want to stop at the last non-NaN value as that last value signifies the "death" of the variable.
e.g. for the dataset
a
b
c
2020-01-01
NaN
11
NaN
2020-02-01
1
NaN
NaN
2020-03-01
NaN
NaN
14
2020-04-01
2
NaN
NaN
2020-05-01
NaN
NaN
NaN
2020-06-01
NaN
NaN
15
2020-07-01
3
NaN
NaN
2020-08-01
NaN
NaN
NaN
I want to output
a
b
c
2020-01-01
NaN
11
NaN
2020-02-01
1
NaN
NaN
2020-03-01
1
NaN
14
2020-04-01
2
NaN
14
2020-05-01
2
NaN
14
2020-06-01
2
NaN
15
2020-07-01
3
NaN
NaN
2020-08-01
NaN
NaN
NaN
I can identify the index of the last observation using df.notna()[::-1].idxmax() but can't figure out how to use this as a way to limit the fillna function
I'd be grateful for any suggestions. Many thanks
Use DataFrame.where for forward filling by mask - testing only non missing values by back filling them:
df = df.where(df.bfill().isna(), df.ffill())
print (df)
a b c
2020-01-01 NaN 11.0 NaN
2020-02-01 1.0 NaN NaN
2020-03-01 1.0 NaN 14.0
2020-04-01 2.0 NaN 14.0
2020-05-01 2.0 NaN 14.0
2020-06-01 2.0 NaN 15.0
2020-07-01 3.0 NaN NaN
2020-08-01 NaN NaN NaN
Your solution should be used too if compare Series converted to numpy array with broadcasting:
mask = df.notna()[::-1].idxmax().to_numpy() < df.index.to_numpy()[:, None]
df = df.where(mask, df.ffill())
print (df)
a b c
2020-01-01 NaN 11.0 NaN
2020-02-01 1.0 NaN NaN
2020-03-01 1.0 NaN 14.0
2020-04-01 2.0 NaN 14.0
2020-05-01 2.0 NaN 14.0
2020-06-01 2.0 NaN 15.0
2020-07-01 3.0 NaN NaN
2020-08-01 NaN NaN NaN
You can use Series.last_valid_index which is specifically designed for this (to return the index for last non-NA/null value) , to just ffill up to that point:
Assuming your dataset is called df:
df.apply(lambda x: x.loc[:x.last_valid_index()].ffill())
index a b c
0 2020-01-01 NaN 11.00 NaN
1 2020-02-01 1.00 NaN NaN
2 2020-03-01 1.00 NaN 14.00
3 2020-04-01 2.00 NaN 14.00
4 2020-05-01 2.00 NaN 14.00
5 2020-06-01 2.00 NaN 15.00
6 2020-07-01 3.00 NaN NaN
7 2020-08-01 NaN NaN NaN
More on this on:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.last_valid_index.html
I have a dataframe where in some cases a case has its records in more than one row, with nulls in some rows as so:
date_rounded 1 2 3 4 5
0 2020-04-01 00:05:00 0.0 NaN NaN NaN NaN
1 2020-04-01 00:05:00 NaN 1.0 44.0 44.0 46.454
2 2020-04-01 00:05:00 NaN NaN NaN NaN NaN
I want to have only one row with the filled data, so far I have:
df.groupby(['date_rounded']).apply(lambda df0: df0.fillna(method='ffill').fillna(method='bfill').drop_duplicates())
this works, but it is slow, any better ideas?
Thanks
You can also use groupby and first:
df.groupby("date_rounded").first()
1 2 3 4 5
date_rounded
2020-04-01 00:05:00 0.0 1.0 44.0 44.0 46.454
If you need to fill within each group, you can use groupby().apply and bfill:
df.groupby('date_rounded', as_index=False).apply(lambda x: x.bfill().iloc[0])
Output:
0 date_rounded 1 2 3 4 5
0 2020-04-01 00:05:00 0.0 1.0 44.0 44.0 46.454
I have a DataFrame of the form
eqt_code ACA_FP AC_FP AI_FP
BDATE
2015-01-01 NaN NaN NaN
2015-01-02 NaN NaN NaN
2015-01-05 1 NaN NaN
2015-01-06 NaN NaN NaN
2015-01-07 NaN NaN NaN
2015-01-08 NaN 0.2 NaN
2015-01-09 NaN NaN NaN
2015-01-12 5 NaN NaN
2015-01-13 NaN NaN NaN
2015-01-14 NaN NaN NaN
2015-01-15 NaN NaN NaN
And I would like, for each month, to get the last non-NaN value of each column (NaN if there is no valid value). Hence resulting in something like
eqt_code ACA_FP AC_FP AI_FP
BDATE
2015-01-31 5 0.2 NaN
2015-02-28 10 1 3
2015-03-31 NaN NaN 3
2015-04-30 10 1 3
I had two ideas to perform this:
Do a ffill with a limit that goes to the end of the month. Something like df.ffill(<add good thing here>).resample('M').last().
Use last_valid_index with resample('M').
Using resample
df.resample('M').last()
Out[82]:
ACA_FP AC_FP AI_FP
eqt_code
2015-01-31 1.0 0.2 NaN
Use groupby and last:
# Do this if the index isn't a DatetimeIndex.
# df.index = pd.to_datetime(df.index)
df.groupby(df.index + pd.offsets.MonthEnd(0)).last()
ACA_FP AC_FP AI_FP
BDATE
2015-01-31 5.0 0.2 NaN
...
Using df.dropna(how='all') will remove each row where all the values are NaN, and will get you most of the way there.
I want to change all values less than 5 in the following df with nan, but column B should be excluded from the operation without dropping it.
A B C D
DateTime
2016-03-03 05:45:00 1 2 3 4
2016-03-03 06:00:00 1 2 3 4
2016-03-03 06:15:00 1 2 3 4
2016-03-03 06:30:00 1 2 3 4
2016-03-03 06:45:00 1 2 3 4
desired result
A B C D
DateTime
2016-03-03 05:45:00 NaN 2 NaN NaN
2016-03-03 06:00:00 NaN 2 NaN NaN
2016-03-03 06:15:00 NaN 2 NaN NaN
2016-03-03 06:30:00 NaN 2 NaN NaN
2016-03-03 06:45:00 NaN 2 NaN NaN
I can take colum B out of the df then apply df[df < 5] = np.nan to the remaining df, then combine them again. Dropping column B before the operation can also be another approach. But I am looking for a more efficient way, one liner if posible.
Trying df[df.columns.difference(['B']) < 5] = np.nan, but it is not correct. Also df[(df.B != 'Other') < 5] = np.nan without a success.
Let's use a more sensible example:
A B C D
DateTime
2016-03-03 05:45:00 1 2 3 4
2016-03-03 06:00:00 1 2 3 10
2016-03-03 06:15:00 1 2 6 4
2016-03-03 06:30:00 1 2 3 4
2016-03-03 06:45:00 1 2 6 10
df.loc[:, df.columns.difference(['B'])] = df[df >= 5]
df
A B C D
DateTime
2016-03-03 05:45:00 NaN 2 NaN NaN
2016-03-03 06:00:00 NaN 2 NaN 10.0
2016-03-03 06:15:00 NaN 2 6.0 NaN
2016-03-03 06:30:00 NaN 2 NaN NaN
2016-03-03 06:45:00 NaN 2 6.0 10.0
This masks everything, but only assigns based on loc.
Another option is masking with update:
v = df[df >= 5]
v.update(df[['B']])
A B C D
DateTime
2016-03-03 05:45:00 NaN 2.0 NaN NaN
2016-03-03 06:00:00 NaN 2.0 NaN 10.0
2016-03-03 06:15:00 NaN 2.0 6.0 NaN
2016-03-03 06:30:00 NaN 2.0 NaN NaN
2016-03-03 06:45:00 NaN 2.0 6.0 10.0
Working from your code, you can do instead:
mask = (df.loc[:,df.columns.difference(['B']).tolist()] < 5).any()
df[mask[mask].index] = np.nan
Note that df.columns.difference(['B']) is a list of columns excluding B. So it doesn't make sense to see which are < 5. You firstly have to slice the dataframe with these columns to then check the consition. Finally you have to add any to check if there is at least a True.
df[df[df.columns.difference(['B'])]<5]=np.nan
You may using mask
df.mask(df.lt(5)).combine_first(df[['B']])
Out[258]:
A B C D
DateTime
2016-03-0305:45:00 NaN 2.0 NaN NaN
2016-03-0306:00:00 NaN 2.0 NaN NaN
2016-03-0306:15:00 NaN 2.0 NaN NaN
2016-03-0306:30:00 NaN 2.0 NaN NaN
2016-03-0306:45:00 NaN 2.0 NaN NaN
You can do this simply by slicing down the columns
import pandas as pd
import numpy as np
df = pd.DataFrame({l:range(10) for l in 'ABCDEFGH'})
dont_change=['B']
cols = [col for col in df.columns if col not in dont_change]
df_sel = df.loc[:,cols] # select correct columns
df_sel[df_sel<5]=np.nan # modify
df[cols]=df_sel #reassign
This question already has answers here:
Replace NaN or missing values with rolling mean or other interpolation
(2 answers)
Python: Sliding windowed mean, ignoring missing data
(2 answers)
Closed 4 years ago.
I have a df like this:
a001 a002 a003 a004 a005
time_axis
2017-02-07 1 NaN NaN NaN NaN
2017-02-14 NaN NaN NaN NaN NaN
2017-03-20 NaN NaN 2 NaN NaN
2017-04-03 NaN 3 NaN NaN NaN
2017-05-15 NaN NaN NaN NaN NaN
2017-06-05 NaN NaN NaN NaN NaN
2017-07-10 NaN 6 NaN NaN NaN
2017-07-17 4 NaN NaN NaN NaN
2017-07-24 NaN NaN NaN 1 NaN
2017-08-07 NaN NaN NaN NaN NaN
2017-08-14 NaN NaN NaN NaN NaN
2017-08-28 NaN NaN NaN NaN 5
And I would like to make a rolling mean for each row on the previous 3 valid values(not empty rows) and save in another df:
last_3
time_axis
2017-02-07 1 # still there is only a row
2017-02-14 1 # only a valid value(in the first row) -> average is the value itself
2017-03-20 1.5 # average on the previous rows (only 2 rows contain value-> (2+1)/2
2017-04-03 2 # average on the previous rows with non-NaN values(2017-02-14 excluded) (2+3+1)/3
2017-05-15 2 # Same reason as the previous row
2017-06-05 2 # Same reason
2017-07-10 3.6 # Now the considered values are:2,3,6
2017-07-17 4.3 # considered values: 4,6,3
2017-07-24 3.6 # considered values: 1,4,6
2017-08-07 3.6 # no new values in this row, so again 1,4,6
2017-08-14 3.6 # same reason
2017-08-28 3.3 # now the considered values are: 5,1,4
I was trying deleting the empty rows in the first dataframe and then apply rolling and mean, but I think it is the wrong approach(df1 in my example already exist):
df2 = df.dropna(how='all')
df1['last_3'] = df2.mean(axis=1).rolling(window=3, min_periods=3).mean()
I think you need:
df2 = df.dropna(how='all')
df['last_3'] = df2.mean(axis=1).rolling(window=3, min_periods=1).mean()
df['last_3'] = df['last_3'].ffill()
print (df)
a001 a002 a003 a004 a005 last_3
2017-02-07 1.0 NaN NaN NaN NaN 1.000000
2017-02-14 NaN NaN NaN NaN NaN 1.000000
2017-03-20 NaN NaN 2.0 NaN NaN 1.500000
2017-04-03 NaN 3.0 NaN NaN NaN 2.000000
2017-05-15 NaN NaN NaN NaN NaN 2.000000
2017-06-05 NaN NaN NaN NaN NaN 2.000000
2017-07-10 NaN 6.0 NaN NaN NaN 3.666667
2017-07-17 4.0 NaN NaN NaN NaN 4.333333
2017-07-24 NaN NaN NaN 1.0 NaN 3.666667
2017-08-07 NaN NaN NaN NaN NaN 3.666667
2017-08-14 NaN NaN NaN NaN NaN 3.666667
2017-08-28 NaN NaN NaN NaN 5.0 3.333333