I need, for each (x) row of a dataframe, to get the value stored in the previous row (x-1) and in a specific target column. The header of the target column is stored in a column (Target_col) of the x row.
0 1 2 Target_col
Date
2022-01-01 37.0 26.0 NaN 0
2022-01-02 NaN 41.0 0.0 1
2022-01-03 NaN 40.0 43.0 1
2022-01-04 NaN NaN 23.0 2
For example, in the last row my Target_value is 43.0, which is stored in the column "2" of the previous row.
This is the expected output:
0 1 2 Target_col Target_value
Date
2022-01-01 37.0 26.0 NaN 0 NaN
2022-01-02 NaN 41.0 0.0 1 26.0
2022-01-03 NaN 40.0 43.0 1 41.0
2022-01-04 NaN NaN 23.0 2 43.0
I was able to get what I want by duplicating the df:
df2 = df.shift(periods=1)
df['Target_value'] = df2.lookup(df.index, df['Target_col'])
but I guess there is a smarter way to do that. Furthermore, lookup is deprecated. Any ideas?
Please note that I reshaped my question and the example df to make everything clearer, so itprorh66's answer and my comments to his answer are are no longer relevant.
I would approach the problem a little differently as illustrated below:
given a base dataframe of the form:
df:
date a b c
0 2022-01-01 12.0 11.0 NaN
1 2022-01-02 10.0 11.0 NaN
2 2022-01-03 NaN 10.0 10.0
3 2022-01-04 NaN 11.0 9.0
4 2022-01-05 NaN NaN 12.0
In stead of defining the column that contains the first valid data, I would create a column which just contains the first valid piece of data as follows:
# helper function to find first valid data
def findfirst(row, cols_list):
# return the first non-Nan value found within row
for c in cols_list:
if not np.isnan(row[c]):
return row[c]
return np.nan
Then using the above helper, I add the column 'First' which contains the desired data as follows:
df['First'] = df.apply(lambda row: findfirst(row, ['a', 'b', 'c']), axis= 1)
This create the following dataframe:
date a b c First
0 2022-01-01 12.0 11.0 NaN 12.0
1 2022-01-02 10.0 11.0 NaN 10.0
2 2022-01-03 NaN 10.0 10.0 10.0
3 2022-01-04 NaN 11.0 9.0 11.0
4 2022-01-05 NaN NaN 12.0 12.0
From the above you can then compute the change value as follows:
df['Change'] = (df['First']/df['First'].shift())-1
Which yields:
** date a b c First Change
0 2022-01-01 12.0 11.0 NaN 12.0 NaN
1 2022-01-02 10.0 11.0 NaN 10.0 -0.166667
2 2022-01-03 NaN 10.0 10.0 10.0 0.000000
3 2022-01-04 NaN 11.0 9.0 11.0 0.100000
4 2022-01-05 NaN NaN 12.0 12.0 0.090909**
It's a bit convoluted but this works:
cols = df.columns[:-1]
temp = df['Target_col'].shift(-1).values[:-1]
temp = np.append(temp, 0)
target_values = df[cols].to_numpy()[np.arange(len(df)), temp.astype(int)][:-1]
target_values = np.insert(target_values, 0, 0, axis=0)
df['target_values'] = target_values.tolist()
Related
I'm trying to reference data from source dataframes, using indices stored in another dataframe.
For example, let's say we have a "shifts" dataframe with the names of the people on duty on each date (some values can be NaN):
a b c
2023-01-01 Sam Max NaN
2023-01-02 Mia NaN Max
2023-01-03 NaN Sam Mia
Then we have a "performance" dataframe, with the performance of each employee on each date. Row indices are the same as the shifts dataframe, but column names are different:
Sam Mia Max Ian
2023-01-01 4.5 NaN 3.0 NaN
2023-01-02 NaN 2.0 3.0 NaN
2023-01-03 4.0 3.0 NaN 4.0
and finally we have a "salary" dataframe, whose structure and indices are different from the other two dataframes:
Employee Salary
0 Sam 100
1 Mia 90
2 Max 80
3 Ian 70
I need to create two output dataframes, with same structure and indices as "shifts".
In the first one, I need to substitute the employee name with his/her performance on that date.
In the second output dataframe, the employee name is replaced with his/her salary. Theses are the expected outputs:
Output 1:
a b c
2023-01-01 4.5 3.0 NaN
2023-01-02 2.0 NaN 3.0
2023-01-03 NaN 4.0 3.0
Output 2:
a b c
2023-01-01 100.0 80.0 NaN
2023-01-02 90.0 NaN 80.0
2023-01-03 NaN 100.0 90.0
Any idea of how to do it? Thanks
For the first one:
(shifts
.reset_index().melt('index')
.merge(performance.stack().rename('p'),
left_on=['index', 'value'], right_index=True)
.pivot(index='index', columns='variable', values='p')
.reindex_like(shifts)
)
Output:
a b c
2023-01-01 4.5 3.0 NaN
2023-01-02 2.0 NaN 3.0
2023-01-03 NaN 4.0 3.0
For the second:
shifts.replace(salary.set_index('Employee')['Salary'])
Output:
a b c
2023-01-01 100.0 80.0 NaN
2023-01-02 90.0 NaN 80.0
2023-01-03 NaN 100.0 90.0
Here's a way to do what your question asks:
out1 = ( shifts.stack()
.rename_axis(index=('date','shift'))
.reset_index().rename(columns={0:'employee'})
.pipe(lambda df: df.assign(perf=
df.apply(lambda row: perf.loc[row.date, row.employee], axis=1)))
.pivot(index='date', columns='shift', values='perf')
.rename_axis(index=None, columns=None) )
out2 = shifts.replace(salary.set_index('Employee').Salary)
Output:
Output 1:
a b c
2023-01-01 4.5 3.0 NaN
2023-01-02 2.0 NaN 3.0
2023-01-03 NaN 4.0 3.0
Output 2:
a b c
2023-01-01 100.0 80.0 NaN
2023-01-02 90.0 NaN 80.0
2023-01-03 NaN 100.0 90.0
How do I create a new column which joins the column names for any non na values on a per row basis.
Please note the duplicate index.
Code
so_df = pd.DataFrame({"ma_1":[10,np.nan,13,15],
"ma_2":[10,11,np.nan,15],
"ma_3":[np.nan,11,np.nan,15]},index=[0,1,1,2])
Example DF
ma_1 ma_2 ma_3
0 10.0 10.0 NaN
1 NaN 11.0 11.0
1 13.0 NaN NaN
2 15.0 15.0 15.0
Desired output is a new column which joins the column names for non na values as per col_names example below.
so_df["col_names"] = ["ma_1, ma_2","ma_2, ma_3","ma_1","ma_1, ma_2, ma_3"]
ma_1 ma_2 ma_3 col_names
0 10.0 10.0 NaN ma_1, ma_2
1 NaN 11.0 11.0 ma_2, ma_3
1 13.0 NaN NaN ma_1
2 15.0 15.0 15.0 ma_1, ma_2, ma_3
Try with dot
df['new'] = df.notna().dot(df.columns+',').str[:-1]
df
Out[77]:
ma_1 ma_2 ma_3 new
0 10.0 10.0 NaN ma_1,ma_2
1 NaN 11.0 11.0 ma_2,ma_3
1 13.0 NaN NaN ma_1
2 15.0 15.0 15.0 ma_1,ma_2,ma_3
I have a pandas dataframe that effectively contains several different datasets. Between each dataset is a row full of NaN. Can I split the dataframe on the NaN row to make two dataframes? Thanks in advance.
You can use this to split into many data frames based on all NaN rows:
#index of all NaN rows (+ beginning and end of df)
idx = [0] + df.index[df.isnull().all(1)].tolist() + [df.shape[0]]
#list of data frames split at all NaN indices
list_of_dfs = [df.iloc[idx[n]:idx[n+1]] for n in range(len(idx)-1)]
And if you want to exclude the NaN rows from split data frames:
idx = [-1] + df.index[df.isnull().all(1)].tolist() + [df.shape[0]]
list_of_dfs = [df.iloc[idx[n]+1:idx[n+1]] for n in range(len(idx)-1)]
Example:
df:
0 1
0 1.0 1.0
1 NaN 1.0
2 1.0 NaN
3 NaN NaN
4 NaN NaN
5 1.0 1.0
6 1.0 1.0
7 NaN 1.0
8 1.0 NaN
9 1.0 NaN
list_of_dfs:
[ 0 1
0 1.0 1.0
1 NaN 1.0
2 1.0 NaN,
Empty DataFrame
Columns: [0, 1]
Index: [],
0 1
5 1.0 1.0
6 1.0 1.0
7 NaN 1.0
8 1.0 NaN
9 1.0 NaN]
Use df[df[COLUMN_NAME].isnull()].index.tolist() to get a list of indices corresponding to the NaN rows. You can then split the dataframe into multiple dataframes by using the indices.
My solution allows to split your DataFrame into any number of chunks,
on each row full of NaNs.
Assume that the input DataFrame contains:
A B C
0 10.0 Abc 20.0
1 11.0 NaN 21.0
2 12.0 Ghi NaN
3 NaN NaN NaN
4 NaN Hkx 30.0
5 21.0 Jkl 32.0
6 22.0 Mno 33.0
7 NaN NaN NaN
8 30.0 Pqr 40.0
9 NaN Stu NaN
10 32.0 Vwx 44.0
so that "split points" are rows with indices 3 and 7.
To do your task:
Generate the grouping criterion Series:
grp = (df.isnull().sum(axis=1) == df.shape[1]).cumsum()
Drop rows full of NaN and group the result by the above criterion:
gr = df.dropna(axis=0, thresh=1).groupby(grp)
thresh=1 means that for the current row it is enough to have 1
non-NaN value to be kept in the result.
Perform actual split, as a list comprehension:
result = [ gr.get_group(key) for key in gr.groups ]
To print the result, you can run:
for i, chunk in enumerate(result):
print(f'Chunk {i}:')
print(chunk, end='\n\n')
getting:
Chunk 0:
A B C
0 10.0 Abc 20.0
1 11.0 NaN 21.0
2 12.0 Ghi NaN
Chunk 1:
A B C
4 NaN Hkx 30.0
5 21.0 Jkl 32.0
6 22.0 Mno 33.0
Chunk 2:
A B C
8 30.0 Pqr 40.0
9 NaN Stu NaN
10 32.0 Vwx 44.0
When summing two pandas columns, I want to ignore nan-values when one of the two columns is a float. However when nan appears in both columns, I want to keep nan in the output (instead of 0.0).
Initial dataframe:
Surf1 Surf2
0 0
NaN 8
8 15
NaN NaN
16 14
15 7
Desired output:
Surf1 Surf2 Sum
0 0 0
NaN 8 8
8 15 23
NaN NaN NaN
16 14 30
15 7 22
Tried code:
-> the code below ignores nan-values but when taking the sum of two nan-values, it gives 0.0 in the output where I want to keep it as NaN in that particular case to keep these empty values separate from values that are actually 0 after summing.
import pandas as pd
import numpy as np
data = pd.DataFrame({"Surf1": [10,np.nan,8,np.nan,16,15], "Surf2": [22,8,15,np.nan,14,7]})
print(data)
data.loc[:,'Sum'] = data.loc[:,['Surf1','Surf2']].sum(axis=1)
print(data)
From the documentation pandas.DataFrame.sum
By default, the sum of an empty or all-NA Series is 0.
>>> pd.Series([]).sum() # min_count=0 is the default 0.0
This can be controlled with the min_count parameter. For example, if you’d like the sum of an empty series to be NaN, pass min_count=1.
Change your code to
data.loc[:,'Sum'] = data.loc[:,['Surf1','Surf2']].sum(axis=1, min_count=1)
output
Surf1 Surf2
0 10.0 22.0
1 NaN 8.0
2 8.0 15.0
3 NaN NaN
4 16.0 14.0
5 15.0 7.0
Surf1 Surf2 Sum
0 10.0 22.0 32.0
1 NaN 8.0 8.0
2 8.0 15.0 23.0
3 NaN NaN NaN
4 16.0 14.0 30.0
5 15.0 7.0 22.0
You could mask the result by doing:
df.sum(1).mask(df.isna().all(1))
0 0.0
1 8.0
2 23.0
3 NaN
4 30.0
5 22.0
dtype: float64
You can do:
df['Sum'] = df.dropna(how='all').sum(1)
Output:
Surf1 Surf2 Sum
0 10.0 22.0 32.0
1 NaN 8.0 8.0
2 8.0 15.0 23.0
3 NaN NaN NaN
4 16.0 14.0 30.0
5 15.0 7.0 22.0
You can use min_count, this will sum all the row when there is at least on not null, if all null return null
df['SUM']=df.sum(min_count=1,axis=1)
#df.sum(min_count=1,axis=1)
Out[199]:
0 0.0
1 8.0
2 23.0
3 NaN
4 30.0
5 22.0
dtype: float64
I think All the solutions listed above work only for the cases when when it is the FIRST column value that is missing. If you have cases when the first column value is non-missing but the second column value is missing, try using:
df['sum'] = df['Surf1']
df.loc[(df['Surf2'].notnull()), 'sum'] = df['Surf1'].fillna(0) + df['Surf2']
I have a dataframe where in some cases a case has its records in more than one row, with nulls in some rows as so:
date_rounded 1 2 3 4 5
0 2020-04-01 00:05:00 0.0 NaN NaN NaN NaN
1 2020-04-01 00:05:00 NaN 1.0 44.0 44.0 46.454
2 2020-04-01 00:05:00 NaN NaN NaN NaN NaN
I want to have only one row with the filled data, so far I have:
df.groupby(['date_rounded']).apply(lambda df0: df0.fillna(method='ffill').fillna(method='bfill').drop_duplicates())
this works, but it is slow, any better ideas?
Thanks
You can also use groupby and first:
df.groupby("date_rounded").first()
1 2 3 4 5
date_rounded
2020-04-01 00:05:00 0.0 1.0 44.0 44.0 46.454
If you need to fill within each group, you can use groupby().apply and bfill:
df.groupby('date_rounded', as_index=False).apply(lambda x: x.bfill().iloc[0])
Output:
0 date_rounded 1 2 3 4 5
0 2020-04-01 00:05:00 0.0 1.0 44.0 44.0 46.454