I have a dataframe where in some cases a case has its records in more than one row, with nulls in some rows as so:
date_rounded 1 2 3 4 5
0 2020-04-01 00:05:00 0.0 NaN NaN NaN NaN
1 2020-04-01 00:05:00 NaN 1.0 44.0 44.0 46.454
2 2020-04-01 00:05:00 NaN NaN NaN NaN NaN
I want to have only one row with the filled data, so far I have:
df.groupby(['date_rounded']).apply(lambda df0: df0.fillna(method='ffill').fillna(method='bfill').drop_duplicates())
this works, but it is slow, any better ideas?
Thanks
You can also use groupby and first:
df.groupby("date_rounded").first()
1 2 3 4 5
date_rounded
2020-04-01 00:05:00 0.0 1.0 44.0 44.0 46.454
If you need to fill within each group, you can use groupby().apply and bfill:
df.groupby('date_rounded', as_index=False).apply(lambda x: x.bfill().iloc[0])
Output:
0 date_rounded 1 2 3 4 5
0 2020-04-01 00:05:00 0.0 1.0 44.0 44.0 46.454
Related
I'm trying to reference data from source dataframes, using indices stored in another dataframe.
For example, let's say we have a "shifts" dataframe with the names of the people on duty on each date (some values can be NaN):
a b c
2023-01-01 Sam Max NaN
2023-01-02 Mia NaN Max
2023-01-03 NaN Sam Mia
Then we have a "performance" dataframe, with the performance of each employee on each date. Row indices are the same as the shifts dataframe, but column names are different:
Sam Mia Max Ian
2023-01-01 4.5 NaN 3.0 NaN
2023-01-02 NaN 2.0 3.0 NaN
2023-01-03 4.0 3.0 NaN 4.0
and finally we have a "salary" dataframe, whose structure and indices are different from the other two dataframes:
Employee Salary
0 Sam 100
1 Mia 90
2 Max 80
3 Ian 70
I need to create two output dataframes, with same structure and indices as "shifts".
In the first one, I need to substitute the employee name with his/her performance on that date.
In the second output dataframe, the employee name is replaced with his/her salary. Theses are the expected outputs:
Output 1:
a b c
2023-01-01 4.5 3.0 NaN
2023-01-02 2.0 NaN 3.0
2023-01-03 NaN 4.0 3.0
Output 2:
a b c
2023-01-01 100.0 80.0 NaN
2023-01-02 90.0 NaN 80.0
2023-01-03 NaN 100.0 90.0
Any idea of how to do it? Thanks
For the first one:
(shifts
.reset_index().melt('index')
.merge(performance.stack().rename('p'),
left_on=['index', 'value'], right_index=True)
.pivot(index='index', columns='variable', values='p')
.reindex_like(shifts)
)
Output:
a b c
2023-01-01 4.5 3.0 NaN
2023-01-02 2.0 NaN 3.0
2023-01-03 NaN 4.0 3.0
For the second:
shifts.replace(salary.set_index('Employee')['Salary'])
Output:
a b c
2023-01-01 100.0 80.0 NaN
2023-01-02 90.0 NaN 80.0
2023-01-03 NaN 100.0 90.0
Here's a way to do what your question asks:
out1 = ( shifts.stack()
.rename_axis(index=('date','shift'))
.reset_index().rename(columns={0:'employee'})
.pipe(lambda df: df.assign(perf=
df.apply(lambda row: perf.loc[row.date, row.employee], axis=1)))
.pivot(index='date', columns='shift', values='perf')
.rename_axis(index=None, columns=None) )
out2 = shifts.replace(salary.set_index('Employee').Salary)
Output:
Output 1:
a b c
2023-01-01 4.5 3.0 NaN
2023-01-02 2.0 NaN 3.0
2023-01-03 NaN 4.0 3.0
Output 2:
a b c
2023-01-01 100.0 80.0 NaN
2023-01-02 90.0 NaN 80.0
2023-01-03 NaN 100.0 90.0
I have a pandas dataframe as following:
Date time LifeTime1 LifeTime2 LifeTime3 LifeTime4 LifeTime5
2020-02-11 17:30:00 6 7 NaN NaN 3
2020-02-11 17:30:00 NaN NaN 3 3 NaN
2020-02-12 15:30:00 2 2 NaN NaN 3
2020-02-16 14:30:00 4 NaN NaN NaN 1
2020-02-16 14:30:00 NaN 7 NaN NaN NaN
2020-02-16 14:30:00 NaN NaN 8 2 NaN
The dates are identical for some rows, is it possible to add 1 second, 2 second, 3 seconds to 2, 3, and 4 identical dates? So if its just one unique date, leave as is. If there are two identical dates, leave first one as is but add 1 second to the second identical date. And if three identical date, leave first as is, second add 1 second and add 2 second to third one. Is this possible to do easily in pandas?
You can use groupby.cumcount combined with pandas.to_datetime with unit='s' to add incremental seconds to the duplicated rows:
s = pd.to_datetime(df['Date time'])
df['Date time'] = s+pd.to_timedelta(s.groupby(s).cumcount(), unit='s')
As a one liner with python 3.8+ walrus operator:
df['Date time'] = ((s:=pd.to_datetime(df['Date time']))
+pd.to_timedelta(s.groupby(s).cumcount(), unit='s')
)
output:
Date time LifeTime1 LifeTime2 LifeTime3 LifeTime4 LifeTime5
0 2020-02-11 17:30:00 6.0 7.0 NaN NaN 3.0
1 2020-02-11 17:30:01 NaN NaN 3.0 3.0 NaN
2 2020-02-12 15:30:00 2.0 2.0 NaN NaN 3.0
3 2020-02-16 14:30:00 4.0 NaN NaN NaN 1.0
4 2020-02-16 14:30:01 NaN 7.0 NaN NaN NaN
5 2020-02-16 14:30:02 NaN NaN 8.0 2.0 NaN
I need, for each (x) row of a dataframe, to get the value stored in the previous row (x-1) and in a specific target column. The header of the target column is stored in a column (Target_col) of the x row.
0 1 2 Target_col
Date
2022-01-01 37.0 26.0 NaN 0
2022-01-02 NaN 41.0 0.0 1
2022-01-03 NaN 40.0 43.0 1
2022-01-04 NaN NaN 23.0 2
For example, in the last row my Target_value is 43.0, which is stored in the column "2" of the previous row.
This is the expected output:
0 1 2 Target_col Target_value
Date
2022-01-01 37.0 26.0 NaN 0 NaN
2022-01-02 NaN 41.0 0.0 1 26.0
2022-01-03 NaN 40.0 43.0 1 41.0
2022-01-04 NaN NaN 23.0 2 43.0
I was able to get what I want by duplicating the df:
df2 = df.shift(periods=1)
df['Target_value'] = df2.lookup(df.index, df['Target_col'])
but I guess there is a smarter way to do that. Furthermore, lookup is deprecated. Any ideas?
Please note that I reshaped my question and the example df to make everything clearer, so itprorh66's answer and my comments to his answer are are no longer relevant.
I would approach the problem a little differently as illustrated below:
given a base dataframe of the form:
df:
date a b c
0 2022-01-01 12.0 11.0 NaN
1 2022-01-02 10.0 11.0 NaN
2 2022-01-03 NaN 10.0 10.0
3 2022-01-04 NaN 11.0 9.0
4 2022-01-05 NaN NaN 12.0
In stead of defining the column that contains the first valid data, I would create a column which just contains the first valid piece of data as follows:
# helper function to find first valid data
def findfirst(row, cols_list):
# return the first non-Nan value found within row
for c in cols_list:
if not np.isnan(row[c]):
return row[c]
return np.nan
Then using the above helper, I add the column 'First' which contains the desired data as follows:
df['First'] = df.apply(lambda row: findfirst(row, ['a', 'b', 'c']), axis= 1)
This create the following dataframe:
date a b c First
0 2022-01-01 12.0 11.0 NaN 12.0
1 2022-01-02 10.0 11.0 NaN 10.0
2 2022-01-03 NaN 10.0 10.0 10.0
3 2022-01-04 NaN 11.0 9.0 11.0
4 2022-01-05 NaN NaN 12.0 12.0
From the above you can then compute the change value as follows:
df['Change'] = (df['First']/df['First'].shift())-1
Which yields:
** date a b c First Change
0 2022-01-01 12.0 11.0 NaN 12.0 NaN
1 2022-01-02 10.0 11.0 NaN 10.0 -0.166667
2 2022-01-03 NaN 10.0 10.0 10.0 0.000000
3 2022-01-04 NaN 11.0 9.0 11.0 0.100000
4 2022-01-05 NaN NaN 12.0 12.0 0.090909**
It's a bit convoluted but this works:
cols = df.columns[:-1]
temp = df['Target_col'].shift(-1).values[:-1]
temp = np.append(temp, 0)
target_values = df[cols].to_numpy()[np.arange(len(df)), temp.astype(int)][:-1]
target_values = np.insert(target_values, 0, 0, axis=0)
df['target_values'] = target_values.tolist()
I have a dataframe where the index is date increasing and the columns are observations of variables. The array is sparse.
My goal is to propogate forward in time a known value to fill NaN but I want to stop at the last non-NaN value as that last value signifies the "death" of the variable.
e.g. for the dataset
a
b
c
2020-01-01
NaN
11
NaN
2020-02-01
1
NaN
NaN
2020-03-01
NaN
NaN
14
2020-04-01
2
NaN
NaN
2020-05-01
NaN
NaN
NaN
2020-06-01
NaN
NaN
15
2020-07-01
3
NaN
NaN
2020-08-01
NaN
NaN
NaN
I want to output
a
b
c
2020-01-01
NaN
11
NaN
2020-02-01
1
NaN
NaN
2020-03-01
1
NaN
14
2020-04-01
2
NaN
14
2020-05-01
2
NaN
14
2020-06-01
2
NaN
15
2020-07-01
3
NaN
NaN
2020-08-01
NaN
NaN
NaN
I can identify the index of the last observation using df.notna()[::-1].idxmax() but can't figure out how to use this as a way to limit the fillna function
I'd be grateful for any suggestions. Many thanks
Use DataFrame.where for forward filling by mask - testing only non missing values by back filling them:
df = df.where(df.bfill().isna(), df.ffill())
print (df)
a b c
2020-01-01 NaN 11.0 NaN
2020-02-01 1.0 NaN NaN
2020-03-01 1.0 NaN 14.0
2020-04-01 2.0 NaN 14.0
2020-05-01 2.0 NaN 14.0
2020-06-01 2.0 NaN 15.0
2020-07-01 3.0 NaN NaN
2020-08-01 NaN NaN NaN
Your solution should be used too if compare Series converted to numpy array with broadcasting:
mask = df.notna()[::-1].idxmax().to_numpy() < df.index.to_numpy()[:, None]
df = df.where(mask, df.ffill())
print (df)
a b c
2020-01-01 NaN 11.0 NaN
2020-02-01 1.0 NaN NaN
2020-03-01 1.0 NaN 14.0
2020-04-01 2.0 NaN 14.0
2020-05-01 2.0 NaN 14.0
2020-06-01 2.0 NaN 15.0
2020-07-01 3.0 NaN NaN
2020-08-01 NaN NaN NaN
You can use Series.last_valid_index which is specifically designed for this (to return the index for last non-NA/null value) , to just ffill up to that point:
Assuming your dataset is called df:
df.apply(lambda x: x.loc[:x.last_valid_index()].ffill())
index a b c
0 2020-01-01 NaN 11.00 NaN
1 2020-02-01 1.00 NaN NaN
2 2020-03-01 1.00 NaN 14.00
3 2020-04-01 2.00 NaN 14.00
4 2020-05-01 2.00 NaN 14.00
5 2020-06-01 2.00 NaN 15.00
6 2020-07-01 3.00 NaN NaN
7 2020-08-01 NaN NaN NaN
More on this on:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.last_valid_index.html
I want to change all values less than 5 in the following df with nan, but column B should be excluded from the operation without dropping it.
A B C D
DateTime
2016-03-03 05:45:00 1 2 3 4
2016-03-03 06:00:00 1 2 3 4
2016-03-03 06:15:00 1 2 3 4
2016-03-03 06:30:00 1 2 3 4
2016-03-03 06:45:00 1 2 3 4
desired result
A B C D
DateTime
2016-03-03 05:45:00 NaN 2 NaN NaN
2016-03-03 06:00:00 NaN 2 NaN NaN
2016-03-03 06:15:00 NaN 2 NaN NaN
2016-03-03 06:30:00 NaN 2 NaN NaN
2016-03-03 06:45:00 NaN 2 NaN NaN
I can take colum B out of the df then apply df[df < 5] = np.nan to the remaining df, then combine them again. Dropping column B before the operation can also be another approach. But I am looking for a more efficient way, one liner if posible.
Trying df[df.columns.difference(['B']) < 5] = np.nan, but it is not correct. Also df[(df.B != 'Other') < 5] = np.nan without a success.
Let's use a more sensible example:
A B C D
DateTime
2016-03-03 05:45:00 1 2 3 4
2016-03-03 06:00:00 1 2 3 10
2016-03-03 06:15:00 1 2 6 4
2016-03-03 06:30:00 1 2 3 4
2016-03-03 06:45:00 1 2 6 10
df.loc[:, df.columns.difference(['B'])] = df[df >= 5]
df
A B C D
DateTime
2016-03-03 05:45:00 NaN 2 NaN NaN
2016-03-03 06:00:00 NaN 2 NaN 10.0
2016-03-03 06:15:00 NaN 2 6.0 NaN
2016-03-03 06:30:00 NaN 2 NaN NaN
2016-03-03 06:45:00 NaN 2 6.0 10.0
This masks everything, but only assigns based on loc.
Another option is masking with update:
v = df[df >= 5]
v.update(df[['B']])
A B C D
DateTime
2016-03-03 05:45:00 NaN 2.0 NaN NaN
2016-03-03 06:00:00 NaN 2.0 NaN 10.0
2016-03-03 06:15:00 NaN 2.0 6.0 NaN
2016-03-03 06:30:00 NaN 2.0 NaN NaN
2016-03-03 06:45:00 NaN 2.0 6.0 10.0
Working from your code, you can do instead:
mask = (df.loc[:,df.columns.difference(['B']).tolist()] < 5).any()
df[mask[mask].index] = np.nan
Note that df.columns.difference(['B']) is a list of columns excluding B. So it doesn't make sense to see which are < 5. You firstly have to slice the dataframe with these columns to then check the consition. Finally you have to add any to check if there is at least a True.
df[df[df.columns.difference(['B'])]<5]=np.nan
You may using mask
df.mask(df.lt(5)).combine_first(df[['B']])
Out[258]:
A B C D
DateTime
2016-03-0305:45:00 NaN 2.0 NaN NaN
2016-03-0306:00:00 NaN 2.0 NaN NaN
2016-03-0306:15:00 NaN 2.0 NaN NaN
2016-03-0306:30:00 NaN 2.0 NaN NaN
2016-03-0306:45:00 NaN 2.0 NaN NaN
You can do this simply by slicing down the columns
import pandas as pd
import numpy as np
df = pd.DataFrame({l:range(10) for l in 'ABCDEFGH'})
dont_change=['B']
cols = [col for col in df.columns if col not in dont_change]
df_sel = df.loc[:,cols] # select correct columns
df_sel[df_sel<5]=np.nan # modify
df[cols]=df_sel #reassign