Search and replace matching values across data frames in Python - python

I have the following Pandas Dataframes:
df1:
C D E F G
111 222 333 444 555
666 777
df2:
A B
111 3
222 4
333 3
444 3
555 4
100 3
666 4
200 3
777 3
I need to look up in df2 to find matching value from df1.A Then replace that value in df1 with the paired valued in df2.B
So the required output would be:
C D E F G
3 4 3 3 4
4 3
I tried a left merge and thought to try and reshape the values across but thought there must be a simpler / cleaner direct search and replace method. Any help much appreciated.

First create a series mapping:
s = df2.set_index('A')['B']
Then apply this to each value:
df1 = df1.applymap(s.get)

try this,
temp=df2.set_index('A')['B']
print df1.replace(temp)
Output:
C D E F G
0 3 4 3.0 3.0 4.0
1 4 3 NaN NaN NaN

Related

Adding and multiplying values of a dataframe in Python

I have a dataset with multiple columns and rows. The rows are supposed to be summed up based on the unique value in a column. I tried .groupby but I want to retain the whole dataset and not just summed up columns based on one unique column. I further need to multiple these individual columns(values) with another column.
For example:
id A B C D E
11 2 1 2 4 100
11 2 2 1 1 100
12 1 3 2 2 200
13 3 1 1 4 190
14 Nan 1 2 2 300
I would like to sum up columns B, C & D based on the unique id and then multiply the result by column A and E in a new column F. I do not want to sum up the values of column A & E
I would like the resultant dataframe to be something like this, which also deals with NaN and while calculating skips the NaN value and moves onto further calculation:
id A B C D E F
11 2 3 3 5 100 9000
12 1 3 2 2 200 2400
13 3 1 1 4 190 2280
14 Nan 1 2 2 300 1200
If the above is unachievable then I would like something as, where the rows are same but the calculation is what I have stated above based on the same id:
id A B C D E F
11 2 3 3 5 100 9000
11 2 2 1 1 100 9000
12 1 3 2 2 200 2400
13 3 1 1 4 190 2280
14 Nan 1 2 2 300 1200
My logic earlier was to apply groupby on the columns B, C, D and then multiply but that is not working out for me. If the above dataframes are unachieavable then please let me know how can i perform this calculation and then merge/join the results with the original file with just E column.
You must first sum verticaly the columns B, C and D for common id, then take the horizontal product:
result = df.groupby('id').agg({'A': 'first', 'B':'sum', 'C': 'sum', 'D': 'sum',
'E': 'first'})
result['F'] = result.fillna(1).astype('int64').agg('prod', axis=1)
It gives:
A B C D E F
id
11 2.0 3 3 5 100 9000
12 1.0 3 2 2 200 2400
13 3.0 1 1 4 190 2280
14 NaN 1 2 2 300 1200
Beware: id is the index here - use reset_index if you want it to be a normal column.

Pandas - For Each Index, Put All Columns Into Rows [duplicate]

This question already has answers here:
Convert columns into rows with Pandas
(6 answers)
Closed 3 years ago.
I'm trying to avoid looping, but the title sort of explains the issue.
import pandas as pd
df = pd.DataFrame(columns=['Index',1,2,3,4,5])
df = df.append({'Index':333,1:'A',2:'C',3:'F',4:'B',5:'D'}, ignore_index=True)
df = df.append({'Index':234,1:'B',2:'D',3:'C',4:'A',5:'Z'}, ignore_index=True)
df.set_index('Index', inplace=True)
print(df)
1 2 3 4 5
Index
333 A C F B D
234 B D C A Z
I want to preserve the index, and for each column turn it into a row with the corresponding value like this:
newcol value
Index
333 1 A
333 2 C
333 3 F
333 4 B
333 5 C
234 1 B
234 2 D
234 3 C
234 4 A
234 5 Z
It's somewhat of a transpose issue, but not exactly like that. Any ideas?
You need:
df.stack().reset_index(1, name='value').rename(columns={'level_1':'newcol'})
# OR df.reset_index().melt('Index',var_name='new_col',value_name='Value').set_index('Index')
#(cc: #anky_91)
Output:
newcol value
Index
333 1 A
333 2 C
333 3 F
333 4 B
333 5 D
234 1 B
234 2 D
234 3 C
234 4 A
234 5 Z
Another solution using to_frame and rename_axis:
df.stack().to_frame('value').rename_axis(index=['','newcol']).reset_index(1)
newcol value
333 1 A
333 2 C
333 3 F
333 4 B
333 5 D
234 1 B
234 2 D
234 3 C
234 4 A
234 5 Z

Calculations between different rows

I try to run loop over a pandas dataframe that takes two arguments from different rows. I tried to use .iloc and shift functions but did not manage to get the result i need.
Here's a simple example to explain better what i want to do:
dataframe1:
a b c
0 101 1 aaa
1 211 2 dcd
2 351 3 yyy
3 401 5 lol
4 631 6 zzz
for the above df I want to make new column ('d') that gets the diff between the values in column 'a' only if the diff between the values in column 'b' is equal to 1, if not the value should be null. like the following dataframe2:
a b c d
0 101 1 aaa nan
1 211 2 dcd 110
2 351 3 yyy 140
3 401 5 lol nan
4 631 6 zzz 230
Is there any designed function that can handle this kind of calculations?
Try like this, using loc and diff():
df.loc[df.b.diff() == 1, 'd'] = df.a.diff()
>>> df
a b c d
0 101 1 aaa NaN
1 211 2 dcd 110.0
2 351 3 yyy 140.0
3 401 5 lol NaN
4 631 6 zzz 230.0
You can create a group key
df1.groupby(df1.b.diff().ne(1).cumsum()).a.diff()
Out[361]:
0 NaN
1 110.0
2 140.0
3 NaN
4 230.0
Name: a, dtype: float64

How to drop duplicates from a subset of rows in a pandas dataframe?

I have a dataframe like this:
A B C
12 true 1
12 true 1
3 nan 2
3 nan 3
I would like to drop all rows where the value of column A is duplicate but only if the value of column B is 'true'.
The resulting dataframe I have in mind is:
A B C
12 true 1
3 nan 2
3 nan 3
I tried using: df.loc[df['B']=='true'].drop_duplicates('A', inplace=True, keep='first') but it doesn't seem to work.
Thanks for your help!
You can sue pd.concat split the df by B
df=pd.concat([df.loc[df.B!=True],df.loc[df.B==True].drop_duplicates(['A'],keep='first')]).sort_index()
df
Out[1593]:
A B C
0 12 True 1
2 3 NaN 2
3 3 NaN 3
df[df.B.ne(True) | ~df.A.duplicated()]
A B C
0 12 True 1
2 3 NaN 2
3 3 NaN 3

Pandas - remove row similar to other row

I need to remove all rows from a pandas.DataFrame, which satisfy an unusual condition.
In case there is an exactly the same row, except for it has Nan value in column "C", I want to remove this row.
Given a table:
A B C D
1 2 NaN 3
1 2 50 3
10 20 NaN 30
5 6 7 8
I need to remove the first row, since it has Nan in column C, but there is absolutely same row (second) with real value in column C.
However, 3rd row must stay, because there're no rows with same A, B and D values as it has.
How do you perform this using pandas? Thank you!
You can achieve in using drop_duplicates.
Initial DataFrame:
df=pd.DataFrame(columns=['a','b','c','d'], data=[[1,2,None,3],[1,2,50,3],[10,20,None,30],[5,6,7,8]])
df
a b c d
0 1 2 NaN 3
1 1 2 50 3
2 10 20 NaN 30
3 5 6 7 8
Then you can sort DataFrame by column C. This will drop NaNs to the bottom of column:
df = df.sort_values(['c'])
df
a b c d
3 5 6 7 8
1 1 2 50 3
0 1 2 NaN 3
2 10 20 NaN 30
And then remove duplicates selecting taken into account columns ignoring C and keeping first catched row:
df1 = df.drop_duplicates(['a','b','d'], keep='first')
a b c d
3 5 6 7 8
1 1 2 50 3
2 10 20 NaN 30
But it will be valid only if NaNs are in column C.
You can try fillna along with drop_duplicates
df.bfill().ffill().drop_duplicates(subset=['A', 'B', 'D'], keep = 'last')
This will handle the scenario such as A, B and D values are same but C has non-NaN values in both the rows.
You get
A B C D
1 1 2 50 3
2 10 20 Nan 30
3 5 6 7 8
This feels right to me
notdups = ~df.duplicated(df.columns.difference(['C']), keep=False)
notnans = df.C.notnull()
df[notdups | notnans]
A B C D
1 1 2 50.0 3
2 10 20 NaN 30
3 5 6 7.0 8

Categories