I have a pandas dataframe as following:
Date time LifeTime1 LifeTime2 LifeTime3 LifeTime4 LifeTime5
2020-02-11 17:30:00 6 7 NaN NaN 3
2020-02-11 17:30:00 NaN NaN 3 3 NaN
2020-02-12 15:30:00 2 2 NaN NaN 3
2020-02-16 14:30:00 4 NaN NaN NaN 1
2020-02-16 14:30:00 NaN 7 NaN NaN NaN
2020-02-16 14:30:00 NaN NaN 8 2 NaN
The dates are identical for some rows, is it possible to add 1 second, 2 second, 3 seconds to 2, 3, and 4 identical dates? So if its just one unique date, leave as is. If there are two identical dates, leave first one as is but add 1 second to the second identical date. And if three identical date, leave first as is, second add 1 second and add 2 second to third one. Is this possible to do easily in pandas?
You can use groupby.cumcount combined with pandas.to_datetime with unit='s' to add incremental seconds to the duplicated rows:
s = pd.to_datetime(df['Date time'])
df['Date time'] = s+pd.to_timedelta(s.groupby(s).cumcount(), unit='s')
As a one liner with python 3.8+ walrus operator:
df['Date time'] = ((s:=pd.to_datetime(df['Date time']))
+pd.to_timedelta(s.groupby(s).cumcount(), unit='s')
)
output:
Date time LifeTime1 LifeTime2 LifeTime3 LifeTime4 LifeTime5
0 2020-02-11 17:30:00 6.0 7.0 NaN NaN 3.0
1 2020-02-11 17:30:01 NaN NaN 3.0 3.0 NaN
2 2020-02-12 15:30:00 2.0 2.0 NaN NaN 3.0
3 2020-02-16 14:30:00 4.0 NaN NaN NaN 1.0
4 2020-02-16 14:30:01 NaN 7.0 NaN NaN NaN
5 2020-02-16 14:30:02 NaN NaN 8.0 2.0 NaN
I have a dataframe where the index is date increasing and the columns are observations of variables. The array is sparse.
My goal is to propogate forward in time a known value to fill NaN but I want to stop at the last non-NaN value as that last value signifies the "death" of the variable.
e.g. for the dataset
a
b
c
2020-01-01
NaN
11
NaN
2020-02-01
1
NaN
NaN
2020-03-01
NaN
NaN
14
2020-04-01
2
NaN
NaN
2020-05-01
NaN
NaN
NaN
2020-06-01
NaN
NaN
15
2020-07-01
3
NaN
NaN
2020-08-01
NaN
NaN
NaN
I want to output
a
b
c
2020-01-01
NaN
11
NaN
2020-02-01
1
NaN
NaN
2020-03-01
1
NaN
14
2020-04-01
2
NaN
14
2020-05-01
2
NaN
14
2020-06-01
2
NaN
15
2020-07-01
3
NaN
NaN
2020-08-01
NaN
NaN
NaN
I can identify the index of the last observation using df.notna()[::-1].idxmax() but can't figure out how to use this as a way to limit the fillna function
I'd be grateful for any suggestions. Many thanks
Use DataFrame.where for forward filling by mask - testing only non missing values by back filling them:
df = df.where(df.bfill().isna(), df.ffill())
print (df)
a b c
2020-01-01 NaN 11.0 NaN
2020-02-01 1.0 NaN NaN
2020-03-01 1.0 NaN 14.0
2020-04-01 2.0 NaN 14.0
2020-05-01 2.0 NaN 14.0
2020-06-01 2.0 NaN 15.0
2020-07-01 3.0 NaN NaN
2020-08-01 NaN NaN NaN
Your solution should be used too if compare Series converted to numpy array with broadcasting:
mask = df.notna()[::-1].idxmax().to_numpy() < df.index.to_numpy()[:, None]
df = df.where(mask, df.ffill())
print (df)
a b c
2020-01-01 NaN 11.0 NaN
2020-02-01 1.0 NaN NaN
2020-03-01 1.0 NaN 14.0
2020-04-01 2.0 NaN 14.0
2020-05-01 2.0 NaN 14.0
2020-06-01 2.0 NaN 15.0
2020-07-01 3.0 NaN NaN
2020-08-01 NaN NaN NaN
You can use Series.last_valid_index which is specifically designed for this (to return the index for last non-NA/null value) , to just ffill up to that point:
Assuming your dataset is called df:
df.apply(lambda x: x.loc[:x.last_valid_index()].ffill())
index a b c
0 2020-01-01 NaN 11.00 NaN
1 2020-02-01 1.00 NaN NaN
2 2020-03-01 1.00 NaN 14.00
3 2020-04-01 2.00 NaN 14.00
4 2020-05-01 2.00 NaN 14.00
5 2020-06-01 2.00 NaN 15.00
6 2020-07-01 3.00 NaN NaN
7 2020-08-01 NaN NaN NaN
More on this on:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.last_valid_index.html
I have a dataframe where in some cases a case has its records in more than one row, with nulls in some rows as so:
date_rounded 1 2 3 4 5
0 2020-04-01 00:05:00 0.0 NaN NaN NaN NaN
1 2020-04-01 00:05:00 NaN 1.0 44.0 44.0 46.454
2 2020-04-01 00:05:00 NaN NaN NaN NaN NaN
I want to have only one row with the filled data, so far I have:
df.groupby(['date_rounded']).apply(lambda df0: df0.fillna(method='ffill').fillna(method='bfill').drop_duplicates())
this works, but it is slow, any better ideas?
Thanks
You can also use groupby and first:
df.groupby("date_rounded").first()
1 2 3 4 5
date_rounded
2020-04-01 00:05:00 0.0 1.0 44.0 44.0 46.454
If you need to fill within each group, you can use groupby().apply and bfill:
df.groupby('date_rounded', as_index=False).apply(lambda x: x.bfill().iloc[0])
Output:
0 date_rounded 1 2 3 4 5
0 2020-04-01 00:05:00 0.0 1.0 44.0 44.0 46.454
Here is what I have tried and what error I received:
>>> import pandas as pd
>>> df = pd.DataFrame({"A":[1,2,3,4,5],"B":[5,4,3,2,1],"C":[0,0,0,0,0],"D":[1,1,1,1,1]})
>>> df
A B C D
0 1 5 0 1
1 2 4 0 1
2 3 3 0 1
3 4 2 0 1
4 5 1 0 1
>>> import pandas as pd
>>> df = pd.DataFrame({"A":[1,2,3,4,5],"B":[5,4,3,2,1],"C":[0,0,0,0,0],"D":[1,1,1,1,1]})
>>> first = [2,2,2,2,2,2,2,2,2,2,2,2]
>>> first = pd.DataFrame(first).T
>>> first.index = [2]
>>> df = df.join(first)
>>> df
A B C D 0 1 2 3 4 5 6 7 8 9 10 11
0 1 5 0 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2 4 0 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 3 3 0 1 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
3 4 2 0 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 5 1 0 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
>>> second = [3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3]
>>> second = pd.DataFrame(second).T
>>> second.index = [1]
>>> df = df.join(second)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python35\lib\site-packages\pandas\core\frame.py", line 6815, in join
rsuffix=rsuffix, sort=sort)
File "C:\Python35\lib\site-packages\pandas\core\frame.py", line 6830, in _join_compat
suffixes=(lsuffix, rsuffix), sort=sort)
File "C:\Python35\lib\site-packages\pandas\core\reshape\merge.py", line 48, in merge
return op.get_result()
File "C:\Python35\lib\site-packages\pandas\core\reshape\merge.py", line 552, in get_result
rdata.items, rsuf)
File "C:\Python35\lib\site-packages\pandas\core\internals\managers.py", line 1972, in items_overlap_with_suffix
'{rename}'.format(rename=to_rename))
ValueError: columns overlap but no suffix specified: Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], dtype='object')
I am trying to create new list with the extra columns which I have to add at specific indexes of the main dataframe df.
When i tried the first it worked and you can see the output. But when I tried the same way with second I received the above mentioned error.
Kindly, let me know what I can do in this situation and achieve the goal I am expecting.
Use DataFrame.combine_first instead join if need assign to same columns created before, last DataFrame.reindex by list of columns for expected ordering:
df = pd.DataFrame({"A":[1,2,3,4,5],"B":[5,4,3,2,1],"C":[0,0,0,0,0],"D":[1,1,1,1,1]})
orig = df.columns.tolist()
first = [2,2,2,2,2,2,2,2,2,2,2,2]
first = pd.DataFrame(first).T
first.index = [2]
df = df.combine_first(first)
second = [3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3]
second = pd.DataFrame(second).T
second.index = [1]
df = df.combine_first(second)
df = df.reindex(orig + first.columns.tolist(), axis=1)
print (df)
A B C D 0 1 2 3 4 5 6 7 8 9 10 11
0 1 5 0 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2 4 0 1 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
2 3 3 0 1 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
3 4 2 0 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 5 1 0 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Yes this is expected behaviour because join works much like an SQL join, meaning that it will join on the provided index and concatenate all the columns together. The problem arises from the fact that pandas does not accept two columns to have the same name. Hence, if you have 2 columns in each dataframe with the same name, it will first look for a suffix to add to those columns to avoid name clashes. This is controlled with the lsuffix and rsuffix arguments in the join method.
Conclusion: 2 ways to solve this:
Either provide a suffix so that pandas is able to resolve the name clashes; or
Make sure that you don't have overlapping columns
You have to specify the suffixes since the column names are the same. Assuming you are trying to add the second values as new columns horizontally:
df = df.join(second, lsuffix='first', rsuffix='second')
A B C D 0first 1first 2first 3first 4first 5first ... 10second 11second 12 13 14 15 16 17 18 19
0 1 5 0 1 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2 4 0 1 NaN NaN NaN NaN NaN NaN ... 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
2 3 3 0 1 2.0 2.0 2.0 2.0 2.0 2.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 4 2 0 1 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 5 1 0 1 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
I have a long list of columns and I want to subtract the previous column from the current column and replace the current column with the difference.
So if I have:
A B C D
1 NaN 3 7
3 NaN 8 10
2 NaN 6 11
I want the output to be:
A B C D
1 NaN 2 4
3 NaN 5 2
2 NaN 4 5
I have been trying to use this code:
df2 = df1.diff(axis=1)
but this does not produce the desired output
Thanks in advance.
You can do this with df.where and then update to bring back the first non-null entry for each row of your DataFrame.
Sample Data: df
A B C D
0 1.0 NaN 3.0 7.0
1 1.0 4.0 5.0 9.0
2 NaN 4.0 NaN 4.0
3 NaN 4.0 NaN NaN
4 NaN NaN 3.0 7.0
5 3.0 NaN NaN 7.0
6 6.0 NaN NaN NaN
Code:
df_d = df.where(df.isnull(),
df.fillna(method='ffill', axis=1).diff(axis=1))
df_d.update(df.where(df.notnull().cumsum(1).cumsum(1) == 1))
Output: df_d
A B C D
0 1.0 NaN 2.0 4.0
1 1.0 3.0 1.0 4.0
2 NaN 4.0 NaN 0.0
3 NaN 4.0 NaN NaN
4 NaN NaN 3.0 4.0
5 3.0 NaN NaN 4.0
6 6.0 NaN NaN NaN
Actually, it is producing the desired result but you are trying to calculate diff on nan values which will be nan so diff is working as expected.
For your case just fetch the first column from original dataframe and you should be fine
df2=df1.diff(axis=1)
df2.A=df1.A
print(df2)
Output
A B C D
1 NaN 2.0 4.0