Check values in two different columns and replace non-identical values

Check values in two different columns and replace non-identical values - python

I have this df
A
B
111
4
111
4
112
0
112
2
113
3
113
3
114
nan
114
1
I want to replace nan and 0 values with other values from col B for the corresponding item from col A as follows:
A
B
111
4
111
4
112
2
112
2
113
3
113
3
114
1
114
1
I tried this but this not returning the correct values
df['B'].fillna(0)
df=df.merge(df[B > 0].groupby('$LINK:NO').size().reset_index(name='B'), on='A')

Replace values less or equal 0 to missing values in Series.where, so possible get first non missing values per groups by GroupBy.transform with GroupBy.first:
df['B'] = (df.assign(new = df['B'].where(df['B'].gt(0)))
.groupby('A')['new']
.transform('first'))
print (df)
A B
0 111 4.0
1 111 4.0
2 112 2.0
3 112 2.0
4 113 3.0
5 113 3.0
6 114 1.0
7 114 1.0
Another idea is sorting use max:
df['B'] = df.sort_values('B').groupby('A').transform('max')
print (df)
A B
0 111 4.0
1 111 4.0
2 112 2.0
3 112 2.0
4 113 3.0
5 113 3.0
6 114 1.0
7 114 1.0

Related

How to merge DataFrames based on on column while adding another

I have the following mock DataFrames:
df1:
ID FILLER1 FILLER2 QUANTITY
01 123 132 12
02 123 132 5
03 123 132 10
df2:
ID FILLER1 FILLER2 QUANTITY
01 123 132 +1
02 123 132 -1
which would result in the 'Quantity' of DF1 will result in 13, 4 and 10.
Thx in advance for any help provided!

Question is not super clear but if I get what you're trying to do here is a way:
# A left join and filling 0 instead of NaN for that third row
In [19]: merged = df1.merge(df2, on=['ID', 'FILLER1', 'FILLER2'], how='left').fillna(0)
In [20]: merged
Out[20]:
ID FILLER1 FILLER2 QUANTITY_x QUANTITY_y
0 1 123 132 12 1.0
1 2 123 132 5 -1.0
2 3 123 132 10 0.0
# Adding new quantity column
In [21]: merged['QUANTITY'] = merged['QUANTITY_x'] + merged['QUANTITY_y']
In [22]: merged
Out[22]:
ID FILLER1 FILLER2 QUANTITY_x QUANTITY_y QUANTITY
0 1 123 132 12 1.0 13.0
1 2 123 132 5 -1.0 4.0
2 3 123 132 10 0.0 10.0
# Removing _x and _y columns
In [23]: merged = merged[['ID', 'FILLER1', 'FILLER2', 'QUANTITY']]
In [24]: merged
Out[24]:
ID FILLER1 FILLER2 QUANTITY
0 1 123 132 13.0
1 2 123 132 4.0
2 3 123 132 10.0

How to map pandas Groupby dataframe with sum values to another dataframe using non-unique column

I have two pandas dataframe df1 and df2. Where i need to find df1['seq'] by doing a groupby on df2 and taking the sum of the column df2['sum_column']. Below are sample data and my current solution.
df1
id code amount seq
234 3 9.8 ?
213 3 18
241 3 6.4
543 3 2
524 2 1.8
142 2 14
987 2 11
658 3 17
df2
c_id name role sum_column
1 Aus leader 6
1 Aus client 1
1 Aus chair 7
2 Ned chair 8
2 Ned leader 3
3 Mar client 5
3 Mar chair 2
3 Mar leader 4
grouped = df2.groupby('c_id')['sum_column'].sum()
df3 = grouped.reset_index()
df3
c_id sum_column
1 14
2 11
3 11
The next step where am having issues is to map the df3 to df1 and conduct a conditional check to see if df1['amount'] is greater then df3['sum_column'].
df1['seq'] = np.where(df1['amount'] > df1['code'].map(df3.set_index('c_id')[sum_column]), 1, 0)
printing out df1['code'].map(df3.set_index('c_id')['sum_column']), I get only NaN values.
Does anyone know what am doing wrong here?
Expected results:
df1
id code amount seq
234 3 9.8 0
213 3 18 1
241 3 6.4 0
543 3 2 0
524 2 1.8 0
142 2 14 1
987 2 11 0
658 3 17 1

Solution should be simplify with remove .reset_index() for df3 and pass Series to map:
s = df2.groupby('c_id')['sum_column'].sum()
df1['seq'] = np.where(df1['amount'] > df1['code'].map(s), 1, 0)
Alternative with casting boolean mask to integer for True, False to 1,0:
df1['seq'] = (df1['amount'] > df1['code'].map(s)).astype(int)
print (df1)
id code amount seq
0 234 3 9.8 0
1 213 3 18.0 1
2 241 3 6.4 0
3 543 3 2.0 0
4 524 2 1.8 0
5 142 2 14.0 1
6 987 2 11.0 0
7 658 3 17.0 1

You forget add quote for sum_column
df1['seq']=np.where(df1['amount'] > df1['code'].map(df3.set_index('c_id')['sum_column']), 1, 0)

Calculations between different rows

I try to run loop over a pandas dataframe that takes two arguments from different rows. I tried to use .iloc and shift functions but did not manage to get the result i need.
Here's a simple example to explain better what i want to do:
dataframe1:
a b c
0 101 1 aaa
1 211 2 dcd
2 351 3 yyy
3 401 5 lol
4 631 6 zzz
for the above df I want to make new column ('d') that gets the diff between the values in column 'a' only if the diff between the values in column 'b' is equal to 1, if not the value should be null. like the following dataframe2:
a b c d
0 101 1 aaa nan
1 211 2 dcd 110
2 351 3 yyy 140
3 401 5 lol nan
4 631 6 zzz 230
Is there any designed function that can handle this kind of calculations?

Try like this, using loc and diff():
df.loc[df.b.diff() == 1, 'd'] = df.a.diff()
>>> df
a b c d
0 101 1 aaa NaN
1 211 2 dcd 110.0
2 351 3 yyy 140.0
3 401 5 lol NaN
4 631 6 zzz 230.0

You can create a group key
df1.groupby(df1.b.diff().ne(1).cumsum()).a.diff()
Out[361]:
0 NaN
1 110.0
2 140.0
3 NaN
4 230.0
Name: a, dtype: float64

Python dataframe: pivot on same column

I have two columns "ID" and "division" as shown below.
df = pd.DataFrame(np.array([['111', 'AAA'],['222','AAA'],['333','BBB'],['444','CCC'],['444','AAA'],['222','BBB'],['111','BBB']]),columns=['ID','division'])
ID division
0 111 AAA
1 222 AAA
2 333 BBB
3 444 CCC
4 444 AAA
5 222 BBB
6 111 BBB
The expected output is as shown below where I need to pivot on the same column but the count is dependent on "division". This should be presented in a heatmap.
df = pd.DataFrame(np.array([['0','2','1','1'],['2','0','1','1'],['1','1','0','0'],['1','1','0','0']]),columns=['111','222','333','444'],index=['111','222','333','444'])
111 222 333 444
111 0 2 1 1
222 2 0 1 1
333 1 1 0 0
444 1 1 0 0
So, technically I am doing an overlap between ID's with respect to division.
Example:
The highlighted box in red where the overlap between 111 and 222 ID's is 2(AAA and BBB). where as the overlap between 111 and 444 is 1 (AAA highlighted in the black box).
I could do this in excel in 2 steps.Not sure if below one helps.
Step1:=SUM(COUNTIFS($B$2:$B$8,$B2,$A$2:$A$8,$G2),COUNTIFS($B$2:$B$8,$B2,$A$2:$A$8,H$1))-1
Step2:=IF($G12=H$1,0,SUMIFS(H$2:H$8,$G$2:$G$8,$G12))
But is there any way that we can do it in Python using dataframes.
Appreciate your help
Case-2
if df = pd.DataFrame(np.array([['111', 'AAA','4'],['222','AAA','5'],['333','BBB','6'],
['444','CCC','3'],['444','AAA','2'], ['222','BBB','2'],
['111','BBB','7']]),columns=['ID','division','count'])
ID division count
0 111 AAA 4
1 222 AAA 5
2 333 BBB 6
3 444 CCC 3
4 444 AAA 2
5 222 BBB 2
6 111 BBB 7
Expected output would be
df_result = pd.DataFrame(np.array([['0','18','13','6'],['18','0','8','7'],['13','8','0','0'],['6','7','0','0']]),columns=['111','222','333','444'],index=['111','222','333','444'])
111 222 333 444
111 0 18 13 6
222 18 0 8 7
333 13 8 0 0
444 6 7 0 0
Calculation: Here there is an overlap between 111 and 222 with respect to divisions AAA and BBB hence the sum would be 4+5+2+7=18

Another way to do this is to use a self join with merge and pd.crosstab:
df_out = df.merge(df, on='division')
results = pd.crosstab(df_out.ID_x, df_out.ID_y)
np.fill_diagonal(results.values, 0)
Output:
ID_y 111 222 333 444
ID_x
111 0.0 2.0 1.0 1.0
222 2.0 0.0 1.0 1.0
333 1.0 1.0 0.0 0.0
444 1.0 1.0 0.0 0.0
Case 2
df = pd.DataFrame(np.array([['111', 'AAA','4'],['222','AAA','5'],['333','BBB','6'],
['444','CCC','3'],['444','AAA','2'], ['222','BBB','2'],
['111','BBB','7']]),columns=['ID','division','count'])
df['count'] = df['count'].astype(int)
df_out = df.merge(df, on='division')
df_out = df_out.assign(count = df_out.count_x + df_out.count_y)
results = pd.crosstab(df_out.ID_x, df_out.ID_y, df_out['count'], aggfunc='sum').fillna(0)
np.fill_diagonal(results.values, 0)
Output:
ID_y 111 222 333 444
ID_x
111 0.0 18.0 13.0 6.0
222 18.0 0.0 8.0 7.0
333 13.0 8.0 0.0 0.0
444 6.0 7.0 0.0 0.0

Problems with combining columns from dataframes in pandas

I have two dataframes that I'm trying to merge.
df1
code scale R1 R2...
0 121 1 80 110
1 121 2 NaN NaN
2 121 3 NaN NaN
3 313 1 60 60
4 313 2 NaN NaN
5 313 3 NaN NaN
...
df2
code scale R1 R2...
0 121 2 30 20
3 313 2 15 10
...
I need, based on the equality of the columns code and scale copy the value from df2 to df1.
The result should look like this:
df1
code scale R1 R2...
0 121 1 80 110
1 121 2 30 20
2 121 3 NaN NaN
3 313 1 60 60
4 313 2 15 10
5 313 3 NaN NaN
...
The problem is that there can be a lot of columns like R1 and R2 and I can not check each one separately, so I wanted to use something from this instruction, but nothing gives me the desired result. I'm doing something wrong, but I can't understand what. I really need advice.

What do you want to happen if the two dataframes both have values for R1/R2? If you want keep df1, you could do
df1.set_index(['code', 'scale']).fillna(df2.set_index(['code', 'scale'])).reset_index()
To keep df2 just do the fillna the other way round. To combine in some other way please clarify the question!

Try this ?
pd.concat([df,df1],axis=0).sort_values(['code','scale']).drop_duplicates(['code','scale'],keep='last')
Out[21]:
code scale R1 R2
0 121 1 80.0 110.0
0 121 2 30.0 20.0
2 121 3 NaN NaN
3 313 1 60.0 60.0
3 313 2 15.0 10.0
5 313 3 NaN NaN

This is a good situation for combine_first. It replaces the nulls in the calling dataframe from the passed dataframe.
df1.set_index(['code', 'scale']).combine_first(df2.set_index(['code', 'scale'])).reset_index()
code scale R1 R2
0 121 1 80.0 110.0
1 121 2 30.0 20.0
2 121 3 NaN NaN
3 313 1 60.0 60.0
4 313 2 15.0 10.0
5 313 3 NaN NaN
Other solutions
with fillna
df.set_index(['code', 'scale']).fillna(df1.set_index(['code', 'scale'])).reset_index()
with add - a bit faster
df.set_index(['code', 'scale']).add(df1.set_index(['code', 'scale']), fill_value=0)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Check values in two different columns and replace non-identical values - python

Related

How to merge DataFrames based on on column while adding another

How to map pandas Groupby dataframe with sum values to another dataframe using non-unique column

Calculations between different rows

Python dataframe: pivot on same column

Problems with combining columns from dataframes in pandas

Categories

Resources