VLOOKUP on Python - python

I have got two dataframes:
df1:
Index a b c d e
1 1 X 10 12 A
2 1 Y 11 13 B
3 1 Z 12 14 C
4 1 W 13 15 C
5 1 A 14 49 D
df2:
Index b f
1 X YES
2 Y YES
3 Z YES
4 W YES
I would like to VLOOKUP the values in column 'b' and report column 'f' to df1.
I tried running the following code but does not work:
new_df = df1.merge(df2, on='b', how='left')
My output should look like as follows:
Index a b c d e f
1 1 X 10 12 A YES
2 1 Y 11 13 B YES
3 1 Z 12 14 C YES
4 1 W 13 15 C YES
5 1 A 14 49 D NaN
Note that df1 has 3400 rows, while df2 only 30.

You can also use list comprehension:
vlookup = ['Yes' if df['b'][i] in list(df2['b']) else np.nan for i in range(df.shape[0])]
Here is the output:
df['vlookup'] = vlookup
a b c d e vlookup
0 1 X 10 12 A Yes
1 1 Y 11 13 B Yes
2 1 Z 12 14 C Yes
3 1 W 13 15 C Yes
4 1 A 14 49 D NaN

Okay, you can use map using a pd.Series defined by df2 dataframe:
df1['f'] = df1['b'].map(df2.set_index('b')['f'])
df1
Output:
a b c d e f
Index
1 1 X 10 12 A YES
2 1 Y 11 13 B YES
3 1 Z 12 14 C YES
4 1 W 13 15 C YES
5 1 A 14 49 D NaN
First create a pd.Series using df2.set_index('b')['f'] then map the values in df1['b'] to create the column df1['f'].

Related

Compre two dataframes on multiple columns

I have two dataframes, they both have the same columns. I want to compare them both and find for each two rows that are different, on which column they have different values
my dataframes are as follow:
the column A is unique key both dataframes share
df1
A B C D E
0 V 10 5 18 20
1 W 9 18 11 13
2 X 8 7 12 5
3 Y 7 9 7 8
4 Z 6 5 3 90
df2
A B C D E
0 V 30 5 18 20
1 W 9 18 11 9
2 X 8 7 12 5
3 Y 36 9 7 8
4 Z 6 5 3 90
expected result:
df3
A key
0 V B
1 W E
3 Y B
What i've tried so far is:
df3 = df1.merge(df2, on=['A', 'B', 'C', 'D', 'E'], how='outer', indicator=True)
df3 = df3[df3._merge != 'both'] #to retrieve only the rows where there's a difference spotted
This is what I get for df3
A B C D E _merge
0 V 10 5 18 20 left_only
1 W 9 18 11 13 left_only
3 Y 7 9 7 8 left_only
5 V 30 5 18 20 right_only
6 W 9 18 11 9 right_only
8 Y 36 9 7 8 right_only
How can I achieve the expected result ?
In your case you can set the index first then eq
s = df1.set_index('A').eq(df2.set_index('A'))
s.mask(s).stack().reset_index()
Out[442]:
A level_1 0
0 V B False
1 W E False
2 Y B False
You can find the differences between the two frames and use idxmax with axis=1 to get the differing column:
diff = df1.set_index("A") - df2.set_index("A")
result = diff[diff.ne(0)].abs().idxmax(1).dropna()
>>> result
A
V B
W E
Y B
dtype: object

map values in a dataframe according to ranges

I have a dataframe df
import pandas
df = pandas.DataFrame(data=[1,2,3,2,2,2,3,3,4,5,10,11,12,1,2,1,1], columns=['codes'])
codes
0 1
1 2
2 3
3 2
4 2
5 2
6 3
7 3
8 4
9 5
10 10
11 11
12 12
13 1
14 2
15 1
16 1
and I would like to group the values in the column code
according to a specific logic:
values == 0 become A
values in the range (1,4) becomes B
values == 5 becomes C
values in the range (6,16) becomes D
is there a way to keep the logic and the dataframe separate so that it is easy to change the grouping rules in the future?
I would like to avoid to write
df.loc[df['code']==0,'code']=A
df.loc[(df['code']>=1 & df['code']<=4),'code']=B
First idea is use Series.map with merge dictionaries, second is use cut with right=False:
df = pd.DataFrame(data=[0,1,2,3,2,2,2,3,3,4,5,10,11,12,16,2,17,1], columns=['codes'])
d1 = {0: 'A', 5:'C'}
d2 = dict.fromkeys(range(1,5), 'B')
d3 = dict.fromkeys(range(6,17), 'D')
d = {**d1, **d2, **d3}
df['codes1'] = df['codes'].map(d)
df['codes2'] = pd.cut(df['codes'], bins=(0,1,5,6,17), labels=list('ABCD'), right=False)
print (df)
codes codes1 codes2
0 0 A A
1 1 B B
2 2 B B
3 3 B B
4 2 B B
5 2 B B
6 2 B B
7 3 B B
8 3 B B
9 4 B B
10 5 C C
11 10 D D
12 11 D D
13 12 D D
14 16 D D
15 2 B B
16 17 NaN NaN
17 1 B B

How to multiply dataframe in effective way

I have a dataframe like below,
df=pd.DataFrame({'col1':[1,2,3,4,5],'col2':list('abcde')})
I want to make duplicate of dataframe by length of its contents.
Basically I want to get each value in col1 to be added with entire contents of col2.
Input:
col1 col2
0 1 a
1 2 b
2 3 c
3 4 d
4 5 e
O/P:
col1 col2
0 1 a
1 1 b
2 1 c
3 1 d
4 1 e
5 2 a
6 2 b
7 2 c
8 2 d
9 2 e
10 3 a
11 3 b
12 3 c
13 3 d
14 3 e
15 4 a
16 4 b
17 4 c
18 4 d
19 4 e
20 5 a
21 5 b
22 5 c
23 5 d
24 5 e
For this I have used this,
op=[]
for val in df.col1.values:
temp=df.copy()
temp['col1']=val
op.append(temp)
print(pd.concat(op,ignore_index=True))
I want to get exact output in a better way(excluding loop)
with unstack :
pd.DataFrame(index=df.col2,columns=df.col1).unstack().reset_index().drop(0,1)
try itertools to do that
import pandas as pd
from itertools import product
df=pd.DataFrame({'col1':[1,2,3,4,5],'col2':list('abcde')})
res = pd.DataFrame((product(df['col1'],df['col2'])),columns=['col1','col2'])
print(res)
col1 col2
0 1 a
1 1 b
2 1 c
3 1 d
4 1 e
5 2 a
6 2 b
7 2 c
8 2 d
9 2 e
10 3 a
11 3 b
12 3 c
13 3 d
14 3 e
15 4 a
16 4 b
17 4 c
18 4 d
19 4 e
20 5 a
21 5 b
22 5 c
23 5 d
24 5 e
I hope it would solve your problem
Use cross join and filter necessary columns:
df = pd.merge(df.assign(a=1), df.assign(a=1), on='a')[['col1_x','col2_y']]
df = df.rename(columns = lambda x: x.split('_')[0])
print (df)
col1 col2
0 1 a
1 1 b
2 1 c
3 1 d
4 1 e
5 2 a
6 2 b
7 2 c
8 2 d
9 2 e
10 3 a
11 3 b
12 3 c
13 3 d
14 3 e
15 4 a
16 4 b
17 4 c
18 4 d
19 4 e
20 5 a
21 5 b
22 5 c
23 5 d
24 5 e
So, what you want is the cartesian product. I would do it like this:
from intertools import product
pd.DataFrame(product(*[df.col1.values,df.col2.values]),columns=["col1","col2"])
#output
0 1
0 1 a
1 1 b
2 1 c
3 1 d
4 1 e
5 2 a
6 2 b
7 2 c
8 2 d
9 2 e
10 3 a
11 3 b
12 3 c
13 3 d
14 3 e
15 4 a
16 4 b
17 4 c
18 4 d
19 4 e
20 5 a
21 5 b
22 5 c
23 5 d
24 5 e
You need to input the name of the columns again, thou.
Well..essentially anything that gives you cartesian product would do. For example,
pd.MultiIndex.from_product([df['col1'],df['col2']]).to_frame(index=False, name=['Col1','Col2'])
There you go:
=^..^=
import pandas as pd
df=pd.DataFrame({'col1':[1,2,3,4,5],'col2':list('abcde')})
# repeat col1 values
df_col1 = df['col1'].repeat(df.shape[0]).reset_index().drop(['index'], axis=1)
# multiply col2 values
df_col2 = pd.concat([df['col2']]*df.shape[0], ignore_index=True)
# contact results
result = pd.concat([df_col1, df_col2], axis=1, sort=False)
Output:
col1 col2
0 1 a
1 1 b
2 1 c
3 1 d
4 1 e
5 2 a
6 2 b
7 2 c
8 2 d
9 2 e
10 3 a
11 3 b
12 3 c
13 3 d
14 3 e
15 4 a
16 4 b
17 4 c
18 4 d
19 4 e
20 5 a
21 5 b
22 5 c
23 5 d
24 5 e

compare multiple columns of pandas dataframe with one column

I have a dataframe:
df-
A B C D E
0 V 10 5 18 20
1 W 9 18 11 13
2 X 8 7 12 5
3 Y 7 9 7 8
4 Z 6 5 3 90
I want to add a column 'Result' which should return 1 if the value in column 'E' is greater than the values in B, C & D columns else return 0.
Output should be:
A B C D E Result
0 V 10 5 18 20 1
1 W 9 18 11 13 0
2 X 8 7 12 5 0
3 Y 7 9 7 8 0
4 Z 6 5 3 90 1
For few columns, i would use logic like : if(and(E>B,E>C,E>D),1,0),
But I have to compare around 20 columns (from B to U) with column name 'V'. Additionally, the dataframe has around 100 thousand rows.
I am using
df['Result']=np.where((df.ix[:,1:20])<df['V']).all(1),1,0)
And it gives a Memory error.
One possible solution is compare in numpy and last convert boolean mask to ints:
df['Result'] = (df.iloc[:, 1:4].values < df[['E']].values).all(axis=1).astype(int)
print (df)
A B C D E Result
0 V 10 5 18 20 1
1 W 9 18 11 13 0
2 X 8 7 12 5 0
3 Y 7 9 7 8 0
4 Z 6 5 3 90 1

Groupby returning full row for max occurs

How to get full row of data for groupby relsult?
df
a b c d e
0 a 25 12 1 20
1 a 15 1 1 1
2 b 12 1 1 1
3 n 25 2 3 3
In [4]: df = pd.read_clipboard()
In [5]: df.groupby('a')['b'].max()
Out[5]:
a
a 25
b 12
n 25
Name: b, dtype: int64
How the get the full row?
a b c d e
a 25 12 1 20
b 12 1 1 1
n 25 2 3 3
I tried filtering but df[df.e == df.groupby('a')['b'].max()] but size is different :(
Original data:
0 1 2 3 4 5 6 7 8 9
EVE00101 Trial DRY RUN PASS 1610071 1610071 Y 20140808 NaN 29
10 11 12 13 14
FF1 ./ff1.sh Event Validation Hive Tables 2015-11-30 9:40:34
Groupby([1,7])[14].max() gives me the result but in grouped series as 1 and 7 as index I wanted the corresponding columns. It is 15,000 row data and provided 1 row of sample
You can use argmax() :
In [287]: df.groupby('a', as_index=False).apply(lambda x: x.loc[x.b.argmax(),])
Out[287]:
a b c d e
0 a 25 12 1 20
1 b 12 1 1 1
2 n 25 2 3 3
This way it works even if b is not the biggest one.
I'd overwrite the 'b' column using transform and then drop the duplicate 'a' row using drop_duplicates:
In [331]:
df['b'] = df.groupby('a')['b'].transform('max')
df
Out[331]:
a b c d e
0 a 25 12 1 20
1 a 25 1 1 1
2 b 12 1 1 1
3 n 25 2 3 3
In [332]:
df.drop_duplicates('a')
Out[332]:
a b c d e
0 a 25 12 1 20
2 b 12 1 1 1
3 n 25 2 3 3

Categories