I have a DataFrame where I am looking to fill in values in a column based on their grouping. I only want to fill in the values (by propagating non-NaN values using ffill and bfill) if there is only one unique value in the column to be filled; otherwise, it should be left as is. My code below has a sample dataset where I try to do this, but I get an error.
Code:
df = pd.DataFrame({"A": [1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 6, 6],
"B": ['a', 'a', np.nan, 'b', 'b', 'c', np.nan, 'd', np.nan, 'e', 'e', np.nan, 'h', 'h'],
"C": [5.0, np.nan, 4.0, 4.0, np.nan, 9.0, np.nan, np.nan, 9.0, 8.0, np.nan, 2.0, np.nan, np.nan]})
col_to_groupby = "A"
col_to_modify = "B"
group = df.groupby(col_to_groupby)
modified = group[group[col_to_modify].nunique() == 1].transform(lambda x: x.ffill().bfill())
df.update(modified)
Error:
KeyError: 'Columns not found: False, True'
Original dataset:
A B C
0 1 a 5.0
1 1 a NaN
2 2 NaN 4.0
3 2 b 4.0
4 2 b NaN
5 3 c 9.0
6 3 NaN NaN
7 3 d NaN
8 3 NaN 9.0
9 4 e 8.0
10 4 e NaN
11 5 NaN 2.0
12 6 h NaN
13 6 NaN NaN
Desired result:
A B C
0 1 a 5.0
1 1 a NaN
2 2 b 4.0
3 2 b 4.0
4 2 b NaN
5 3 c 9.0
6 3 NaN NaN
7 3 d NaN
8 3 NaN 9.0
9 4 e 8.0
10 4 e NaN
11 5 NaN 2.0
12 6 h NaN
13 6 h NaN
The above is the desired result because
row index 2 is in group 2, which only has 1 unique value in column B ("b"), so it is changed.
row indices 6 and 8 are in group 3, but there are 2 unique values in column B ("c" and "d"), so they are unaltered.
row index 5 is in group 11, but has no data in column B to propagate.
row index 13 is in group 6, which only has 1 unique value in column B ("h"), so it is changed.
One option is to add a condition in groupby.apply:
df[col_to_modify] = df.groupby(col_to_groupby)[col_to_modify].apply(lambda x: x.ffill().bfill() if x.nunique()==1 else x)
Another could be to use groupby + transform(nunique) + eq to create a boolean filter for the groups with unique values; then update those rows with groupby + first (first drops NaN) using where:
g = df.groupby(col_to_groupby)[col_to_modify]
df[col_to_modify] = g.transform('first').where(g.transform('nunique').eq(1), df[col_to_modify])
Output:
A B C
0 1 a 5.0
1 1 a NaN
2 2 b 4.0
3 2 b 4.0
4 2 b NaN
5 3 c 9.0
6 3 NaN NaN
7 3 d NaN
8 3 NaN 9.0
9 4 e 8.0
10 4 e NaN
11 5 NaN 2.0
12 6 h NaN
13 6 h NaN
Related
Suppose I have a DataFrame with some NaNs:
>>> df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
>>> df
0 1 2 3
1 4 NaN NaN
2 NaN NaN 9
The result should be like this which is just +10 of the previous NaN value of the column.
0 1 2 3
1 4 12 13
2 14 12 9
Is there any way to do this using any methods?or do I have to iterate each column?
You can use ffill() to fill the NaNs with the previous non-NaN value, and then a simple mask to increment all by 10:
result = df.ffill()
result[df.isna()] += 10
Output
0 1 2
0 1.0 2.0 3.0
1 4.0 12.0 13.0
2 14.0 12.0 9.0
Working with pandas, I have a dataframe with two hierarchies A and B, where B can be NaN, and I want to fill some NaNs in D in a particular way:
In the example below, A has "B-subgroups" where there are no values at all for D (e.g. (1, 1)), while A also has values for D in other subgroups (e.g. (1, 3)).
Now I want to get the mean of each subgroup (120, 90 and 75 for A==1), find the median of these means (90 for A==1) and use this median to fill NaNs in the other subgroups of A==1.
Groups like A==2, where there are only NaNs for D, should not be filled.
Groups like A==3, where there are some values for D but only rows with B being NaN have NaN in D, should not be filled if possible (I intend to fill these later with the mean of all values of D of their whole A groups).
Example df:
d = {'A': [1, 1, 1, 1, 1, 1, 1, 2, 3, 3, 3],
'B': [1, 2, 3, 3, 4, 5, 6, 1, 1, np.NaN, np.NaN],
'D': [np.NaN, np.NaN, 120, 120, 90, 75, np.NaN, np.NaN, 60, 50, np.NaN]}
df = pd.DataFrame(data=d)
A B D
1 1 NaN
1 2 NaN
1 3 120
1 3 120
1 4 90
1 5 75
1 6 NaN
2 1 NaN
3 1 60
3 NaN 50
3 NaN NaN
Expected result:
A B D
1 1 90
1 2 90
1 3 120
1 3 120
1 4 90
1 5 75
1 6 90
2 1 NaN
3 1 60
3 NaN 50
3 NaN NaN
With df.groupby(['A', 'B'])['D'].mean().groupby(['A']).agg('median') or .median() I seem to get the right values, but using
df['D'] = df['D'].fillna(
df.groupby(['A', 'B'])['D'].mean().groupby(['A']).agg('median')
)
does not seem to change any values in D.
Any help is greatly appreciated, I've been stuck on this for a while and cannot find any solution anywhere.
Your first step is correct. After that we use Series.map to map the correct medians to each group in column A.
Finally we use np.where to conditionally fill in column D if B is not NaN:
medians = df.groupby(['A', 'B'])['D'].mean().groupby(['A']).agg('median')
df['D'] = np.where(df['B'].notna(), # if B is not NaN
df['D'].fillna(df['A'].map(medians)), # fill in the median
df['D']) # else keep the value of column D
A B D
0 1 1.00 90.00
1 1 2.00 90.00
2 1 3.00 120.00
3 1 3.00 120.00
4 1 4.00 90.00
5 1 5.00 75.00
6 1 6.00 90.00
7 2 1.00 nan
8 3 1.00 60.00
9 3 nan 50.00
10 3 nan nan
I have a pandas dataframe with two dimensions. I want to calculate the rolling standard deviation along axis 1 while also including datapoints in the rows above and below.
So say I have this df:
data = {'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8],
'C': [9, 10, 11, 12]}
df = pd.DataFrame(data)
print(df)
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
I want a rectangular window 3 rows high and 2 columns across, moving from left to right. So, for example,
std_df.loc[1, 'C']
would be equal to
np.std([1, 5, 9, 2, 6, 10, 3, 7, 11])
But no idea how to achieve this without very slow iteration
Looks like what you want is pd.shift
import pandas as pd
import numpy as np
data = {'A': [1,2,3,4], 'B': [5,6,7,8], 'C': [9,10,11,12]}
df = pd.DataFrame(data)
print(df)
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
Shifting the dataframe you provided by 1 yields the row above
print(df.shift(1))
A B C
0 NaN NaN NaN
1 1.0 5.0 9.0
2 2.0 6.0 10.0
3 3.0 7.0 11.0
Similarly, shifting the dataframe you provided by -1 yields the row below
print(df.shift(-1))
A B C
0 2.0 6.0 10.0
1 3.0 7.0 11.0
2 4.0 8.0 12.0
3 NaN NaN NaN
so the code below should do what you're looking for (add_prefix prefixes the column names to make them unique)
above_df = df.shift(1).add_prefix('above_')
below_df = df.shift(-1).add_prefix('below_')
lagged = pd.concat([df, above_df, below_df], axis=1)
lagged['std'] = lagged.apply(np.std, axis=1)
print(lagged)
A B C above_A above_B above_C below_A below_B below_C std
0 1 5 9 NaN NaN NaN 2.0 6.0 10.0 3.304038
1 2 6 10 1.0 5.0 9.0 3.0 7.0 11.0 3.366502
2 3 7 11 2.0 6.0 10.0 4.0 8.0 12.0 3.366502
3 4 8 12 3.0 7.0 11.0 NaN NaN NaN 3.304038
import pandas as pd
df1 = pd.DataFrame({
'value1': ["a","a","a","b","b","b","c","c"],
'value2': [1,2,3,4,4,4,5,5],
'value3': [1,2,3, None , None, None, None, None],
'value4': [1,2,3,None , None, None, None, None],
'value5': [1,2,3,None , None, None, None, None]})
df2 = pd.DataFrame({
'value1': ["k","j","l","m","x","y"],
'value2': [2, 2, 1, 3, 4, 5],
'value3': [2, 2, 2, 3, 4, 5],
'value4': [3, 2, 2, 3, 4, 5],
'value5': [2, 1, 2, 3, 4, 5]})
df1 =
value1 value2 value3 value4 value5
0 a 1 1.0 1.0 1.0
1 a 2 2.0 2.0 2.0
2 a 3 3.0 3.0 3.0
3 b 4 NaN NaN NaN
4 b 4 NaN NaN NaN
5 b 4 NaN NaN NaN
6 c 5 NaN NaN NaN
7 c 5 NaN NaN NaN
df2 =
value1 value2 value3 value4 value5
0 k 2 2 3 2
1 j 2 2 2 1
2 l 1 2 2 2
3 m 3 3 3 3
4 x 4 4 4 4
5 y 5 5 5 5
I would like to fill NaN in df1 from values in df2
So the results of df1 will look like
df1 =
value1 value2 value3 value4 value5
0 a 1 1.0 1.0 1.0
1 a 2 2.0 2.0 2.0
2 a 3 3.0 3.0 3.0
3 b 4 2 2 1
4 b 4 2 2 2
5 b 4 3 3 3
6 c 5 4 4 4
7 c 5 5 5 5
I used following codes.
tmp1 = df1[df1.value1 == 'b'].iloc[:, 2:]
tmp2 = df2.iloc[1:, 2:]
tmp1 = tmp2 can update values in tmp1, but when I use following
df1[df1.value1 == 'b'].iloc[:, 2:]= tmp2
It doesn't update the values in df1 as shown below.
value1 value2 value3 value4 value5
0 a 1 1.0 1.0 1.0
1 a 2 2.0 2.0 2.0
2 a 3 3.0 3.0 3.0
3 b 4 NaN NaN NaN
4 b 4 NaN NaN NaN
5 b 4 NaN NaN NaN
6 c 5 NaN NaN NaN
7 c 5 NaN NaN NaN
Why it happens and how can I solve this issue?
Thank you.
This line doesn't do what you think it's doing:
tmp1 = df1[df1.value1 == 'b'].iloc[:, 2:]
Methods are applied sequentially, so df1[df1.value1 == 'b'] keeps only rows 3, 4, 5 from df1. But this isn't what you want, you want to update all rows starting from the first instance your condition is satisfied.
Instead, first find the required index.
idx = df1['value1'].eq('b').values.argmax()
You then need to explicitly assign the last n rows from df2:
df1.iloc[idx:, 2:] = df2.iloc[-(len(df1.index)-idx):, 2:].values
print(df1)
value1 value2 value3 value4 value5
0 a 1 1.0 1.0 1.0
1 a 2 2.0 2.0 2.0
2 a 3 3.0 3.0 3.0
3 b 4 2.0 2.0 1.0
4 b 4 2.0 2.0 2.0
5 b 4 3.0 3.0 3.0
6 c 5 4.0 4.0 4.0
7 c 5 5.0 5.0 5.0
If you want to replace the nan values using index alignment use pandas fillna
df1.fillna(df2)
Add inplace if you want to update df1
df1.fillna(df2, inplace=True)
-
edit for case without aligned indexes:
If indexes of target and replacement values are not aligned, they can be aligned so that the dataframe fillna method can be used.
To align the indexes, get the indexes of rows containing nans in df1 to be replaced, filter df2 to include replacement values and then assign the replacement indexes from df1 as the index of df2. Then use fillna to transfer values from df2 to df1.
# in this case, find index values when df1.value1 is greater than or equal to 'b'
# (alternately could be indexes of rows containing nans)
idx = df1.index[df1.value1 >= 'b']
# get the section of df2 which will provide replacement values
# limit length to length of idx
align_df = df2[1:len(idx) + 1]
# set the index to match the nan rows from df1
align_df.index = idx
# use auto-alignment with fillna to transfer values from align_df(df2) to df1
df1.fillna(align_df)
# or can use df1.combine_first(align_df) because of the matching target and replacement indexes
I have to generate and update a list based on a database extract and a csv file.
I'm planning to do that using 2 pandas dataframes.
I'm able to generate the inserts (new items within the csv file based df) and the deletes (items not existing in the csv file based df) but I don't know how to generate and update the list. The dict should only contain the columns where the values are changed and the key column
The result of the operation should be something like this:
{'key': 10,
'column1': 'abc',
'column6': 10.8
}
Any idea on how to achieve this?
you can do it this way:
In [424]: df
Out[424]:
a b c d
0 7 5 1 3
1 1 8 6 1
2 9 6 5 2
3 5 5 4 2
4 7 1 4 6
In [425]: df2
Out[425]:
a b c d
0 -1 5 1 -1
1 1 8 6 1
2 9 6 5 2
3 5 5 -1 2
4 7 1 4 6
In [426]: df.index.name = 'key'
In [427]: df2.index.name = 'key'
In [430]: (df2[df2 != df]
.....: .dropna(how='all')
.....: .dropna(axis=1, how='all')
.....: .reset_index()
.....: .apply(lambda x: x.dropna().to_dict(), axis=1)
.....: )
Out[430]:
0 {'a': -1.0, 'd': -1.0, 'key': 0.0}
1 {'c': -1.0, 'key': 3.0}
dtype: object
explanation:
In [441]: df2[df2 != df]
Out[441]:
a b c d
key
0 -1.0 NaN NaN -1.0
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN -1.0 NaN
4 NaN NaN NaN NaN
In [443]: df2[df2 != df].dropna(how='all')
Out[443]:
a b c d
key
0 -1.0 NaN NaN -1.0
3 NaN NaN -1.0 NaN
In [444]: df2[df2 != df].dropna(how='all').dropna(axis=1, how='all')
Out[444]:
a c d
key
0 -1.0 NaN -1.0
3 NaN -1.0 NaN