I have a DataFrame where I am looking to fill in values in a column based on their grouping. I only want to fill in the values (by propagating non-NaN values using ffill and bfill) if there is only one unique value in the column to be filled; otherwise, it should be left as is. My code below has a sample dataset where I try to do this, but I get an error.
Code:
df = pd.DataFrame({"A": [1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 6, 6],
"B": ['a', 'a', np.nan, 'b', 'b', 'c', np.nan, 'd', np.nan, 'e', 'e', np.nan, 'h', 'h'],
"C": [5.0, np.nan, 4.0, 4.0, np.nan, 9.0, np.nan, np.nan, 9.0, 8.0, np.nan, 2.0, np.nan, np.nan]})
col_to_groupby = "A"
col_to_modify = "B"
group = df.groupby(col_to_groupby)
modified = group[group[col_to_modify].nunique() == 1].transform(lambda x: x.ffill().bfill())
df.update(modified)
Error:
KeyError: 'Columns not found: False, True'
Original dataset:
A B C
0 1 a 5.0
1 1 a NaN
2 2 NaN 4.0
3 2 b 4.0
4 2 b NaN
5 3 c 9.0
6 3 NaN NaN
7 3 d NaN
8 3 NaN 9.0
9 4 e 8.0
10 4 e NaN
11 5 NaN 2.0
12 6 h NaN
13 6 NaN NaN
Desired result:
A B C
0 1 a 5.0
1 1 a NaN
2 2 b 4.0
3 2 b 4.0
4 2 b NaN
5 3 c 9.0
6 3 NaN NaN
7 3 d NaN
8 3 NaN 9.0
9 4 e 8.0
10 4 e NaN
11 5 NaN 2.0
12 6 h NaN
13 6 h NaN
The above is the desired result because
row index 2 is in group 2, which only has 1 unique value in column B ("b"), so it is changed.
row indices 6 and 8 are in group 3, but there are 2 unique values in column B ("c" and "d"), so they are unaltered.
row index 5 is in group 11, but has no data in column B to propagate.
row index 13 is in group 6, which only has 1 unique value in column B ("h"), so it is changed.
One option is to add a condition in groupby.apply:
df[col_to_modify] = df.groupby(col_to_groupby)[col_to_modify].apply(lambda x: x.ffill().bfill() if x.nunique()==1 else x)
Another could be to use groupby + transform(nunique) + eq to create a boolean filter for the groups with unique values; then update those rows with groupby + first (first drops NaN) using where:
g = df.groupby(col_to_groupby)[col_to_modify]
df[col_to_modify] = g.transform('first').where(g.transform('nunique').eq(1), df[col_to_modify])
Output:
A B C
0 1 a 5.0
1 1 a NaN
2 2 b 4.0
3 2 b 4.0
4 2 b NaN
5 3 c 9.0
6 3 NaN NaN
7 3 d NaN
8 3 NaN 9.0
9 4 e 8.0
10 4 e NaN
11 5 NaN 2.0
12 6 h NaN
13 6 h NaN
I would like to merge two dataframes, df2 might have more columns and will always be 1 row. I would like the data from the df2 row to overwrite the matching row in df on a.
df = pd.DataFrame({'a': {0: 0, 1: 1, 2: 2}, 'b': {0: 3, 1: 4, 2: 5}})
df2 = pd.DataFrame({'a': {0: 1}, 'b': {0: 90}, 'c': {0: 76}})
>>> df
a b
0 0 3
1 1 4
2 2 5
>>> df2
a b c
0 1 90 76
The desired output:
a b c
0 0 3 NaN
1 1 90 76
2 2 5 NaN
I have tried merge left but this creates two b columns (b_x and b_y):
>>> pd.merge(df,df2,how='left', on='a')
a b_x b_y c
0 0 3 NaN NaN
1 1 4 90.0 76.0
2 2 5 NaN NaN
You can use df.combine_first here:
df2.set_index("a").combine_first(df.set_index("a")).reset_index()
Or with merge:
out = df.merge(df2,on=['a'],how='left')
out.loc[:,out.columns.str.endswith("_x")] = out.loc[:,
out.columns.str.endswith("_y")].to_numpy()
out = out.groupby(out.columns.str.split("_").str[0],axis=1).first()
print(out)
a b c
0 0 3.0 NaN
1 1 90.0 76.0
2 2 5.0 NaN
I have a (example-) dataframe with 4 columns:
data = {'A': ['a', 'b', 'c', 'd', 'e', 'f'],
'B': [42, 52, np.nan, np.nan, np.nan, np.nan],
'C': [np.nan, np.nan, 31, 2, np.nan, np.nan],
'D': [np.nan, np.nan, np.nan, np.nan, 62, 70]}
df = pd.DataFrame(data, columns = ['A', 'B', 'C', 'D'])
A B C D
0 a 42.0 NaN NaN
1 b 52.0 NaN NaN
2 c NaN 31.0 NaN
3 d NaN 2.0 NaN
4 e NaN NaN 62.0
5 f NaN NaN 70.0
I would now like to merge/combine columns B, C, and D to a new column E like in this example:
data2 = {'A': ['a', 'b', 'c', 'd', 'e', 'f'],
'E': [42, 52, 31, 2, 62, 70]}
df2 = pd.DataFrame(data2, columns = ['A', 'E'])
A E
0 a 42
1 b 52
2 c 31
3 d 2
4 e 62
5 f 70
I found a quite similar question here but this adds the merged colums B, C, and D at the end of column A:
0 a
1 b
2 c
3 d
4 e
5 f
6 42
7 52
8 31
9 2
10 62
11 70
dtype: object
Thanks for help.
Option 1
Using assign and drop
In [644]: cols = ['B', 'C', 'D']
In [645]: df.assign(E=df[cols].sum(1)).drop(cols, 1)
Out[645]:
A E
0 a 42.0
1 b 52.0
2 c 31.0
3 d 2.0
4 e 62.0
5 f 70.0
Option 2
Using assignment and drop
In [648]: df['E'] = df[cols].sum(1)
In [649]: df = df.drop(cols, 1)
In [650]: df
Out[650]:
A E
0 a 42.0
1 b 52.0
2 c 31.0
3 d 2.0
4 e 62.0
5 f 70.0
Option 3 Lately, I like the 3rd option.
Using groupby
In [660]: df.groupby(np.where(df.columns == 'A', 'A', 'E'), axis=1).first() #or sum max min
Out[660]:
A E
0 a 42.0
1 b 52.0
2 c 31.0
3 d 2.0
4 e 62.0
5 f 70.0
In [661]: df.columns == 'A'
Out[661]: array([ True, False, False, False], dtype=bool)
In [662]: np.where(df.columns == 'A', 'A', 'E')
Out[662]:
array(['A', 'E', 'E', 'E'],
dtype='|S1')
The question as written asks for merge/combine as opposed to sum, so posting this to help folks who find this answer looking for help on coalescing with combine_first, which can be a bit tricky.
df2 = pd.concat([df["A"],
df["B"].combine_first(df["C"]).combine_first(df["D"])],
axis=1)
df2.rename(columns={"B":"E"}, inplace=True)
A E
0 a 42.0
1 b 52.0
2 c 31.0
3 d 2.0
4 e 62.0
5 f 70.0
What's so tricky about that? in this case there's no problem - but let's say you were pulling the B, C and D values from different dataframes, in which the a,b,c,d,e,f labels were present, but not necessarily in the same order. combine_first() aligns on the index, so you'd need to tack a set_index() on to each of your df references.
df2 = pd.concat([df.set_index("A", drop=False)["A"],
df.set_index("A")["B"]\
.combine_first(df.set_index("A")["C"])\
.combine_first(df.set_index("A")["D"]).astype(int)],
axis=1).reset_index(drop=True)
df2.rename(columns={"B":"E"}, inplace=True)
A E
0 a 42
1 b 52
2 c 31
3 d 2
4 e 62
5 f 70
Use difference for columns names without A and then get sum or max:
cols = df.columns.difference(['A'])
df['E'] = df[cols].sum(axis=1).astype(int)
# df['E'] = df[cols].max(axis=1).astype(int)
df = df.drop(cols, axis=1)
print (df)
A E
0 a 42
1 b 52
2 c 31
3 d 2
4 e 62
5 f 70
If multiple values per rows:
data = {'A': ['a', 'b', 'c', 'd', 'e', 'f'],
'B': [42, 52, np.nan, np.nan, np.nan, np.nan],
'C': [np.nan, np.nan, 31, 2, np.nan, np.nan],
'D': [10, np.nan, np.nan, np.nan, 62, 70]}
df = pd.DataFrame(data, columns = ['A', 'B', 'C', 'D'])
print (df)
A B C D
0 a 42.0 NaN 10.0
1 b 52.0 NaN NaN
2 c NaN 31.0 NaN
3 d NaN 2.0 NaN
4 e NaN NaN 62.0
5 f NaN NaN 70.0
cols = df.columns.difference(['A'])
df['E'] = df[cols].apply(lambda x: ', '.join(x.dropna().astype(int).astype(str)), 1)
df = df.drop(cols, axis=1)
print (df)
A E
0 a 42, 10
1 b 52
2 c 31
3 d 2
4 e 62
5 f 70
You can also use ffill with iloc:
df['E'] = df.iloc[:, 1:].ffill(1).iloc[:, -1].astype(int)
df = df.iloc[:, [0, -1]]
print(df)
A E
0 a 42
1 b 52
2 c 31
3 d 2
4 e 62
5 f 70
Zero's third option using groupby requires a numpy import and only handles one column outside the set of columns to collapse, while jpp's answer using ffill requires you know how columns are ordered. Here's a solution that has no extra dependencies, takes an arbitrary input dataframe, and only collapses columns if all rows in those columns are single-valued:
import pandas as pd
data = [{'A':'a', 'B':42, 'messy':'z'},
{'A':'b', 'B':52, 'messy':'y'},
{'A':'c', 'C':31},
{'A':'d', 'C':2, 'messy':'w'},
{'A':'e', 'D':62, 'messy':'v'},
{'A':'f', 'D':70, 'messy':['z']}]
df = pd.DataFrame(data)
cols = ['B', 'C', 'D']
new_col = 'E'
if df[cols].apply(lambda x: len(x.notna().value_counts()) == 1, axis=1).all():
df[new_col] = df[cols].ffill(axis=1).dropna(axis=1)
df2 = df.drop(columns=cols)
print(df, '\n\n', df2)
Output:
A B messy C D
0 a 42.0 z NaN NaN
1 b 52.0 y NaN NaN
2 c NaN NaN 31.0 NaN
3 d NaN w 2.0 NaN
4 e NaN v NaN 62.0
5 f NaN [z] NaN 70.0
A messy E
0 a z 42.0
1 b y 52.0
2 c NaN 31.0
3 d w 2.0
4 e v 62.0
5 f [z] 70.0
I have a dataframe like this:
df = pd.DataFrame({'dim': {0: 'A', 1: 'B', 2: 'A', 3: 'B', 4: 'A'},
'id': {0: 1, 1: 1, 2: 2, 3: 2, 4: 3},
'value1': {0: nan, 1: 1.2, 2: 2.0, 3: nan, 4: 3.0},
'value2': {0: 1.0, 1: 2.0, 2: nan, 3: nan, 4: nan}})
dim id value1 value2
0 A 1 NaN 1.0
1 B 1 1.2 2.0
2 A 2 2.0 NaN
3 B 2 NaN NaN
4 A 3 3.0 NaN
I now want to aggregate the values for different dimensions over the id, so that the following is true:
If dim == 'A' is not None then take the value from dim == 'A' else take value where dim == 'B' (if it is not None). If both are None, just take None.
So the result should be:
id value1 value2
0 1 1.2 1.0
1 2 2.0 NaN
2 3 3.0 NaN
My guess is, I would need to use some form of group by function, but I am not too sure. Maybe something with apply?
You can use set_index with unstack and swaplevel for reshape and then combine_first:
df1 = df.set_index(['id','dim']).unstack().swaplevel(0,1,axis=1)
#alternative
#df1 = df.pivot('id','dim').swaplevel(0,1,axis=1)
print (df1)
dim A B A B
value1 value1 value2 value2
id
1 NaN 1.2 1.0 2.0
2 2.0 NaN NaN NaN
3 3.0 NaN NaN NaN
df2 = df1['A'].combine_first(df1['B']).reset_index()
print (df2)
id value1 value2
0 1 1.2 1.0
1 2 2.0 NaN
2 3 3.0 NaN
Similar solution with xs for select MultiIndex:
df1 = df.set_index(['id','dim']).unstack()
#alternative
#df1 = df.pivot('id','dim')
print (df1)
value1 value2
dim A B A B
id
1 NaN 1.2 1.0 2.0
2 2.0 NaN NaN NaN
3 3.0 NaN NaN NaN
df2 = df1.xs('A', axis=1, level=1).combine_first(df1.xs('B', axis=1, level=1)).reset_index()
print (df2)
id value1 value2
0 1 1.2 1.0
1 2 2.0 NaN
2 3 3.0 NaN
I have to generate and update a list based on a database extract and a csv file.
I'm planning to do that using 2 pandas dataframes.
I'm able to generate the inserts (new items within the csv file based df) and the deletes (items not existing in the csv file based df) but I don't know how to generate and update the list. The dict should only contain the columns where the values are changed and the key column
The result of the operation should be something like this:
{'key': 10,
'column1': 'abc',
'column6': 10.8
}
Any idea on how to achieve this?
you can do it this way:
In [424]: df
Out[424]:
a b c d
0 7 5 1 3
1 1 8 6 1
2 9 6 5 2
3 5 5 4 2
4 7 1 4 6
In [425]: df2
Out[425]:
a b c d
0 -1 5 1 -1
1 1 8 6 1
2 9 6 5 2
3 5 5 -1 2
4 7 1 4 6
In [426]: df.index.name = 'key'
In [427]: df2.index.name = 'key'
In [430]: (df2[df2 != df]
.....: .dropna(how='all')
.....: .dropna(axis=1, how='all')
.....: .reset_index()
.....: .apply(lambda x: x.dropna().to_dict(), axis=1)
.....: )
Out[430]:
0 {'a': -1.0, 'd': -1.0, 'key': 0.0}
1 {'c': -1.0, 'key': 3.0}
dtype: object
explanation:
In [441]: df2[df2 != df]
Out[441]:
a b c d
key
0 -1.0 NaN NaN -1.0
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN -1.0 NaN
4 NaN NaN NaN NaN
In [443]: df2[df2 != df].dropna(how='all')
Out[443]:
a b c d
key
0 -1.0 NaN NaN -1.0
3 NaN NaN -1.0 NaN
In [444]: df2[df2 != df].dropna(how='all').dropna(axis=1, how='all')
Out[444]:
a c d
key
0 -1.0 NaN -1.0
3 NaN -1.0 NaN