create a dict of changes of 2 dataframes - python

I have to generate and update a list based on a database extract and a csv file.
I'm planning to do that using 2 pandas dataframes.
I'm able to generate the inserts (new items within the csv file based df) and the deletes (items not existing in the csv file based df) but I don't know how to generate and update the list. The dict should only contain the columns where the values are changed and the key column
The result of the operation should be something like this:
{'key': 10,
'column1': 'abc',
'column6': 10.8
}
Any idea on how to achieve this?

you can do it this way:
In [424]: df
Out[424]:
a b c d
0 7 5 1 3
1 1 8 6 1
2 9 6 5 2
3 5 5 4 2
4 7 1 4 6
In [425]: df2
Out[425]:
a b c d
0 -1 5 1 -1
1 1 8 6 1
2 9 6 5 2
3 5 5 -1 2
4 7 1 4 6
In [426]: df.index.name = 'key'
In [427]: df2.index.name = 'key'
In [430]: (df2[df2 != df]
.....: .dropna(how='all')
.....: .dropna(axis=1, how='all')
.....: .reset_index()
.....: .apply(lambda x: x.dropna().to_dict(), axis=1)
.....: )
Out[430]:
0 {'a': -1.0, 'd': -1.0, 'key': 0.0}
1 {'c': -1.0, 'key': 3.0}
dtype: object
explanation:
In [441]: df2[df2 != df]
Out[441]:
a b c d
key
0 -1.0 NaN NaN -1.0
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN -1.0 NaN
4 NaN NaN NaN NaN
In [443]: df2[df2 != df].dropna(how='all')
Out[443]:
a b c d
key
0 -1.0 NaN NaN -1.0
3 NaN NaN -1.0 NaN
In [444]: df2[df2 != df].dropna(how='all').dropna(axis=1, how='all')
Out[444]:
a c d
key
0 -1.0 NaN -1.0
3 NaN -1.0 NaN

Related

combine and group rows from 2 dfs

I have 2 dfs, which I want to combine as the following:
df1 = pd.DataFrame({"a": [1,2], "b":['A','B'], "c":[3,2]})
df2 = pd.DataFrame({"a": [1,1,1, 2,2,2, 3, 4], "b":['A','A','A','B','B', 'B','C','D'], "c":[3, None,None,2,None,None,None,None]})
Output:
a b c
1 A 3.0
1 A NaN
1 A NaN
2 B 2.0
2 B NaN
2 B NaN
I had an earlier version of this question that only involved df2 and was solved with
df.groupby(['a','b']).filter(lambda g: any(~g['c'].isna()))
but now I need to run it only for rows that appear in df1 (df2 contains rows from df1 but some extra rows which I want to not be included.
Thanks!
You can turn the indicator on with merge
out = df2.merge(df1,indicator=True,how='outer',on=['a','b'])
Out[91]:
a b c_x c_y _merge
0 1 A 3.0 3.0 both
1 1 A NaN 3.0 both
2 1 A NaN 3.0 both
3 2 B 2.0 2.0 both
4 2 B NaN 2.0 both
5 2 B NaN 2.0 both
6 3 C NaN NaN left_only
7 4 D NaN NaN left_only
out = out[out['_merge']=='both']
IIUC, you could merge:
out = df2.merge(df1[['a','b']])
or you could use chained isin:
out1 = df2[df2['a'].isin(df1['a']) & df2['b'].isin(df1['b'])]
Output:
a b c
0 1 A 3.0
1 1 A NaN
2 1 A NaN
3 2 B 2.0
4 2 B NaN
5 2 B NaN

Pandas fill in group if condition is met

I have a DataFrame where I am looking to fill in values in a column based on their grouping. I only want to fill in the values (by propagating non-NaN values using ffill and bfill) if there is only one unique value in the column to be filled; otherwise, it should be left as is. My code below has a sample dataset where I try to do this, but I get an error.
Code:
df = pd.DataFrame({"A": [1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 6, 6],
"B": ['a', 'a', np.nan, 'b', 'b', 'c', np.nan, 'd', np.nan, 'e', 'e', np.nan, 'h', 'h'],
"C": [5.0, np.nan, 4.0, 4.0, np.nan, 9.0, np.nan, np.nan, 9.0, 8.0, np.nan, 2.0, np.nan, np.nan]})
col_to_groupby = "A"
col_to_modify = "B"
group = df.groupby(col_to_groupby)
modified = group[group[col_to_modify].nunique() == 1].transform(lambda x: x.ffill().bfill())
df.update(modified)
Error:
KeyError: 'Columns not found: False, True'
Original dataset:
A B C
0 1 a 5.0
1 1 a NaN
2 2 NaN 4.0
3 2 b 4.0
4 2 b NaN
5 3 c 9.0
6 3 NaN NaN
7 3 d NaN
8 3 NaN 9.0
9 4 e 8.0
10 4 e NaN
11 5 NaN 2.0
12 6 h NaN
13 6 NaN NaN
Desired result:
A B C
0 1 a 5.0
1 1 a NaN
2 2 b 4.0
3 2 b 4.0
4 2 b NaN
5 3 c 9.0
6 3 NaN NaN
7 3 d NaN
8 3 NaN 9.0
9 4 e 8.0
10 4 e NaN
11 5 NaN 2.0
12 6 h NaN
13 6 h NaN
The above is the desired result because
row index 2 is in group 2, which only has 1 unique value in column B ("b"), so it is changed.
row indices 6 and 8 are in group 3, but there are 2 unique values in column B ("c" and "d"), so they are unaltered.
row index 5 is in group 11, but has no data in column B to propagate.
row index 13 is in group 6, which only has 1 unique value in column B ("h"), so it is changed.
One option is to add a condition in groupby.apply:
df[col_to_modify] = df.groupby(col_to_groupby)[col_to_modify].apply(lambda x: x.ffill().bfill() if x.nunique()==1 else x)
Another could be to use groupby + transform(nunique) + eq to create a boolean filter for the groups with unique values; then update those rows with groupby + first (first drops NaN) using where:
g = df.groupby(col_to_groupby)[col_to_modify]
df[col_to_modify] = g.transform('first').where(g.transform('nunique').eq(1), df[col_to_modify])
Output:
A B C
0 1 a 5.0
1 1 a NaN
2 2 b 4.0
3 2 b 4.0
4 2 b NaN
5 3 c 9.0
6 3 NaN NaN
7 3 d NaN
8 3 NaN 9.0
9 4 e 8.0
10 4 e NaN
11 5 NaN 2.0
12 6 h NaN
13 6 h NaN

Fill out NA values in Pandas DataFrame by using another Pandas DataFrame

import pandas as pd
df1 = pd.DataFrame({
'value1': ["a","a","a","b","b","b","c","c"],
'value2': [1,2,3,4,4,4,5,5],
'value3': [1,2,3, None , None, None, None, None],
'value4': [1,2,3,None , None, None, None, None],
'value5': [1,2,3,None , None, None, None, None]})
df2 = pd.DataFrame({
'value1': ["k","j","l","m","x","y"],
'value2': [2, 2, 1, 3, 4, 5],
'value3': [2, 2, 2, 3, 4, 5],
'value4': [3, 2, 2, 3, 4, 5],
'value5': [2, 1, 2, 3, 4, 5]})
df1 =
value1 value2 value3 value4 value5
0 a 1 1.0 1.0 1.0
1 a 2 2.0 2.0 2.0
2 a 3 3.0 3.0 3.0
3 b 4 NaN NaN NaN
4 b 4 NaN NaN NaN
5 b 4 NaN NaN NaN
6 c 5 NaN NaN NaN
7 c 5 NaN NaN NaN
df2 =
value1 value2 value3 value4 value5
0 k 2 2 3 2
1 j 2 2 2 1
2 l 1 2 2 2
3 m 3 3 3 3
4 x 4 4 4 4
5 y 5 5 5 5
I would like to fill NaN in df1 from values in df2
So the results of df1 will look like
df1 =
value1 value2 value3 value4 value5
0 a 1 1.0 1.0 1.0
1 a 2 2.0 2.0 2.0
2 a 3 3.0 3.0 3.0
3 b 4 2 2 1
4 b 4 2 2 2
5 b 4 3 3 3
6 c 5 4 4 4
7 c 5 5 5 5
I used following codes.
tmp1 = df1[df1.value1 == 'b'].iloc[:, 2:]
tmp2 = df2.iloc[1:, 2:]
tmp1 = tmp2 can update values in tmp1, but when I use following
df1[df1.value1 == 'b'].iloc[:, 2:]= tmp2
It doesn't update the values in df1 as shown below.
value1 value2 value3 value4 value5
0 a 1 1.0 1.0 1.0
1 a 2 2.0 2.0 2.0
2 a 3 3.0 3.0 3.0
3 b 4 NaN NaN NaN
4 b 4 NaN NaN NaN
5 b 4 NaN NaN NaN
6 c 5 NaN NaN NaN
7 c 5 NaN NaN NaN
Why it happens and how can I solve this issue?
Thank you.
This line doesn't do what you think it's doing:
tmp1 = df1[df1.value1 == 'b'].iloc[:, 2:]
Methods are applied sequentially, so df1[df1.value1 == 'b'] keeps only rows 3, 4, 5 from df1. But this isn't what you want, you want to update all rows starting from the first instance your condition is satisfied.
Instead, first find the required index.
idx = df1['value1'].eq('b').values.argmax()
You then need to explicitly assign the last n rows from df2:
df1.iloc[idx:, 2:] = df2.iloc[-(len(df1.index)-idx):, 2:].values
print(df1)
value1 value2 value3 value4 value5
0 a 1 1.0 1.0 1.0
1 a 2 2.0 2.0 2.0
2 a 3 3.0 3.0 3.0
3 b 4 2.0 2.0 1.0
4 b 4 2.0 2.0 2.0
5 b 4 3.0 3.0 3.0
6 c 5 4.0 4.0 4.0
7 c 5 5.0 5.0 5.0
If you want to replace the nan values using index alignment use pandas fillna
df1.fillna(df2)
Add inplace if you want to update df1
df1.fillna(df2, inplace=True)
-
edit for case without aligned indexes:
If indexes of target and replacement values are not aligned, they can be aligned so that the dataframe fillna method can be used.
To align the indexes, get the indexes of rows containing nans in df1 to be replaced, filter df2 to include replacement values and then assign the replacement indexes from df1 as the index of df2. Then use fillna to transfer values from df2 to df1.
# in this case, find index values when df1.value1 is greater than or equal to 'b'
# (alternately could be indexes of rows containing nans)
idx = df1.index[df1.value1 >= 'b']
# get the section of df2 which will provide replacement values
# limit length to length of idx
align_df = df2[1:len(idx) + 1]
# set the index to match the nan rows from df1
align_df.index = idx
# use auto-alignment with fillna to transfer values from align_df(df2) to df1
df1.fillna(align_df)
# or can use df1.combine_first(align_df) because of the matching target and replacement indexes

Python pandas.DataFrame: Make whole row NaN according to condition

I want to make the whole row NaN according to a condition, based on a column. For example, if B > 5, I want to make the whole row NaN.
Unprocessed data frame looks like this:
A B
0 1 4
1 3 5
2 4 6
3 8 7
Make whole row NaN, if B > 5:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Thank you.
Use boolean indexing for assign value per condition:
df[df['B'] > 5] = np.nan
print (df)
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Or DataFrame.mask which add by default NaNs by condition:
df = df.mask(df['B'] > 5)
print (df)
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Thank you Bharath shetty:
df = df.where(~(df['B']>5))
You can also use df.loc[df.B > 5, :] = np.nan
Example
In [14]: df
Out[14]:
A B
0 1 4
1 3 5
2 4 6
3 8 7
In [15]: df.loc[df.B > 5, :] = np.nan
In [16]: df
Out[16]:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
in human language df.loc[df.B > 5, :] = np.nan can be translated to:
assign np.nan to any column (:) of the dataframe ( df ) where the
condition df.B > 5 is valid.
Or using reindex
df.loc[df.B<=5,:].reindex(df.index)
Out[83]:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN

Pandas Apply returns matrix instead of single column

This is probably a stupid question, but I have been trying for a while and I can't seem to get it to work.
I have a dataframe:
df1 = pd.DataFrame({'Type': ['A','A', 'B', 'F', 'C', 'G', 'A', 'E'], 'Other': [999., 999., 999., 999., 999., 999., 999., 999.]})
I now want to create a new column based on the column Type. For this I have second dataframe:
df2 = pd.DataFrame({'Type':['A','B','C','D','E','F', 'G'],'Value':[1, 1, 2, 3, 4, 4, 5]})
that I am using as a lookup table.
When I try something like:
df1.apply(lambda x: df2.Value[df2.Type == x['Type']],axis=1)
I get a matrix instead of a single column:
Out[21]:
0 1 2 4 5 6
0 1 NaN NaN NaN NaN NaN
1 1 NaN NaN NaN NaN NaN
2 NaN 1 NaN NaN NaN NaN
3 NaN NaN NaN NaN 4 NaN
4 NaN NaN 2 NaN NaN NaN
5 NaN NaN NaN NaN NaN 5
6 1 NaN NaN NaN NaN NaN
7 NaN NaN NaN 4 NaN NaN
What I want however is this:
0
0 1
1 1
2 1
3 4
4 2
5 5
6 1
7 4
What am I doing wrong?
You can use map to achieve this:
In [62]:
df1['Type'].map(df2.set_index('Type')['Value'],na_action='ignore')
Out[62]:
0 1
1 1
2 1
3 4
4 2
5 5
6 1
7 4
Name: Type, dtype: int64
If you modified your apply attempt to the following then it would've worked:
In [70]:
df1['Type'].apply(lambda x: df2.loc[df2.Type == x,'Value'].values[0])
Out[70]:
0 1
1 1
2 1
3 4
4 2
5 5
6 1
7 4
Name: Type, dtype: int64
If we look at what you tried:
df1.apply(lambda x: df2.Value[df2.Type == x['Type']],axis=1)
this is trying to compare the 'type' and return the 'value' the problem here is that you're returning a Series with the index of df2, this is confusing pandas and causing the matrix to be returned. You can see this if we hard code 'B' as an example:
In [75]:
df2.Value[df2.Type == 'B']
Out[75]:
1 1
Name: Value, dtype: int64

Categories