compare multiple columns of pandas dataframe with one column - python

I have a dataframe:
df-
A B C D E
0 V 10 5 18 20
1 W 9 18 11 13
2 X 8 7 12 5
3 Y 7 9 7 8
4 Z 6 5 3 90
I want to add a column 'Result' which should return 1 if the value in column 'E' is greater than the values in B, C & D columns else return 0.
Output should be:
A B C D E Result
0 V 10 5 18 20 1
1 W 9 18 11 13 0
2 X 8 7 12 5 0
3 Y 7 9 7 8 0
4 Z 6 5 3 90 1
For few columns, i would use logic like : if(and(E>B,E>C,E>D),1,0),
But I have to compare around 20 columns (from B to U) with column name 'V'. Additionally, the dataframe has around 100 thousand rows.
I am using
df['Result']=np.where((df.ix[:,1:20])<df['V']).all(1),1,0)
And it gives a Memory error.

One possible solution is compare in numpy and last convert boolean mask to ints:
df['Result'] = (df.iloc[:, 1:4].values < df[['E']].values).all(axis=1).astype(int)
print (df)
A B C D E Result
0 V 10 5 18 20 1
1 W 9 18 11 13 0
2 X 8 7 12 5 0
3 Y 7 9 7 8 0
4 Z 6 5 3 90 1

Related

Compre two dataframes on multiple columns

I have two dataframes, they both have the same columns. I want to compare them both and find for each two rows that are different, on which column they have different values
my dataframes are as follow:
the column A is unique key both dataframes share
df1
A B C D E
0 V 10 5 18 20
1 W 9 18 11 13
2 X 8 7 12 5
3 Y 7 9 7 8
4 Z 6 5 3 90
df2
A B C D E
0 V 30 5 18 20
1 W 9 18 11 9
2 X 8 7 12 5
3 Y 36 9 7 8
4 Z 6 5 3 90
expected result:
df3
A key
0 V B
1 W E
3 Y B
What i've tried so far is:
df3 = df1.merge(df2, on=['A', 'B', 'C', 'D', 'E'], how='outer', indicator=True)
df3 = df3[df3._merge != 'both'] #to retrieve only the rows where there's a difference spotted
This is what I get for df3
A B C D E _merge
0 V 10 5 18 20 left_only
1 W 9 18 11 13 left_only
3 Y 7 9 7 8 left_only
5 V 30 5 18 20 right_only
6 W 9 18 11 9 right_only
8 Y 36 9 7 8 right_only
How can I achieve the expected result ?
In your case you can set the index first then eq
s = df1.set_index('A').eq(df2.set_index('A'))
s.mask(s).stack().reset_index()
Out[442]:
A level_1 0
0 V B False
1 W E False
2 Y B False
You can find the differences between the two frames and use idxmax with axis=1 to get the differing column:
diff = df1.set_index("A") - df2.set_index("A")
result = diff[diff.ne(0)].abs().idxmax(1).dropna()
>>> result
A
V B
W E
Y B
dtype: object

Find index of first row whose value matches a condition set by another row

I have a dataframe which consists of two columns:
x y
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
6 7 14
7 8 16
8 9 18
9 10 20
I would like to add a column whose value is the index of the first value to meet the following condition: y >= x. For example, for row 2 (x = 3), the first y value greater or equal to 3 is 4 so the output of z for row 2 is (index) 1. I expect the final table to look like:
x y z
0 1 2 0
1 2 4 0
2 3 6 1
3 4 8 1
4 5 10 2
5 6 12 2
6 7 14 3
7 8 16 3
8 9 18 4
9 10 20 4
It should be noted that both x and y are sorted if that should make the solution easier.
I have seen a similar answer but I could not translate it to my situation.
You want np.searchsorted, which assumes df['y'] is sorted:
df['z'] = np.searchsorted(df['y'], df['x'])
Output:
x y z
0 1 2 0
1 2 4 0
2 3 6 1
3 4 8 1
4 5 10 2
5 6 12 2
6 7 14 3
7 8 16 3
8 9 18 4
9 10 20 4

Can You Preserve Column Order When Pandas Dataframe.Combine Or DataFrame.Combine_First?

If you have 2 dataframes, represented as:
A F Y
0 1 2 3
1 4 5 6
And
B C T
0 7 8 9
1 10 11 12
When combining it becomes:
A B C F T Y
0 1 7 8 2 9 3
1 4 10 11 5 12 6
I would like it to become:
A F Y B C T
0 1 2 3 7 8 9
1 4 5 6 10 11 12
How do I combine 1 data frame with another but keep the original column order?
In [1294]: new_df = df.join(df1)
In [1295]: new_df
Out[1295]:
A F Y B C T
0 1 2 3 7 8 9
1 4 5 6 10 11 12
OR you can also use pd.merge(not a very clean solution though)
In [1297]: df['tmp' ] =1
In [1298]: df1['tmp'] = 1
In [1309]: pd.merge(df, df1, on=['tmp'], left_index=True, right_index=True).drop('tmp', 1)
Out[1309]:
A F Y B C T
0 1 2 3 7 8 9
1 4 5 6 10 11 12

pandas dataframe create unique ids from column having elements frequency greater than 1

I have following dataframe:
line# key amino0 pos0 amino1 pos1 amino2 pos2
0 14 A 13 M 2 K 14
1 12 A 13 M 2 A 1
2 1 A 1 M 2 P 3
3 2 P 3 P 4 B 6
4 1 A 1 M 2 P 35
5 12 A 31 A 32 M 41
6 1 M 24 P 23 A 22
7 12 A 31 A 32 M 42
8 4 J 5 P 4 B 6
9 3 B 6 I 7 P 4
10 8 B 6 H 10 I 7
I want to update column 'key' with the each occurrence of keys which have frequency>1. My output should look like this:
line# key amino0 pos0 amino1 pos1 amino2 pos2
0 14_1 A 13 M 2 K 14
1 12_1 A 13 M 2 A 1
2 1_1 A 1 M 2 P 3
3 2_1 P 3 P 4 B 6
4 1_2 A 1 M 2 P 35
5 12_2 A 31 A 32 M 41
6 1_3 M 24 P 23 A 22
7 12_3 A 31 A 32 M 42
8 4_1 J 5 P 4 B 6
9 3_1 B 6 I 7 P 4
10 8_1 B 6 H 10 I 7
For each element in 'key' column, 1st portion is the key, 2nd portion is freq occurrence number. For eg. key 12 has freq 3, therefore, three occurrences of key 12 in three rows will be updated with 12_1, 12_2, 12_3.
The following code is only giving keys with freq >1.
df = pd.read_csv("myfile.txt", sep='\t', names = ['key', 'amino0', 'pos0','amino1', 'pos1','amino2', 'pos2'])
vc = df.key.value_counts()
print(vc[vc > 2].index[0])
How to update the keys? Avoiding loop is preferable.
If type of key column is string use radd:
df['key'] += df.groupby('key').cumcount().add(1).astype(str).radd('_')
#alternative
#df['key'] += '_' + df.groupby('key').cumcount().add(1).astype(str)
If integer first is necessary converting:
df['key'] = df['key'].astype(str) + '_' + df.groupby('key').cumcount().add(1).astype(str)
print (df)
line# key amino0 pos0 amino1 pos1 amino2 pos2
0 0 14_1 A 13 M 2 K 14
1 1 12_1 A 13 M 2 A 1
2 2 1_1 A 1 M 2 P 3
3 3 2_1 P 3 P 4 B 6
4 4 1_2 A 1 M 2 P 35
5 5 12_2 A 31 A 32 M 41
6 6 1_3 M 24 P 23 A 22
7 7 12_3 A 31 A 32 M 42
8 8 4_1 J 5 P 4 B 6
9 9 3_1 B 6 I 7 P 4
10 10 8_1 B 6 H 10 I 7
Details:
First use GroupBy.cumcount for counter per groups defined by colum key:
print (df.groupby('key').cumcount())
0 0
1 0
2 0
3 0
4 1
5 1
6 2
7 2
8 0
9 0
10 0
dtype: int64
Then add 1 for starting by 1, it is like + 1:
print (df.groupby('key').cumcount().add(1))
0 1
1 1
2 1
3 1
4 2
5 2
6 3
7 3
8 1
9 1
10 1
dtype: int64
For converting to strings use astype, object means obviously string:
print (df.groupby('key').cumcount().add(1).astype(str))
0 1
1 1
2 1
3 1
4 2
5 2
6 3
7 3
8 1
9 1
10 1
dtype: object

sort dataframe by position in group then by that group

consider the dataframe df
df = pd.DataFrame(dict(
A=list('aaaaabbbbccc'),
B=range(12)
))
print(df)
A B
0 a 0
1 a 1
2 a 2
3 a 3
4 a 4
5 b 5
6 b 6
7 b 7
8 b 8
9 c 9
10 c 10
11 c 11
I want to sort the dataframe such if I grouped by column 'A' I'd pull the first position from each group, then cycle back and get the second position from each group if any are remaining. So on and so forth.
I'd expect results tot look like this
A B
0 a 0
5 b 5
9 c 9
1 a 1
6 b 6
10 c 10
2 a 2
7 b 7
11 c 11
3 a 3
8 b 8
4 a 4
You can use cumcount for count values in groups first, then sort_values and reindex by Series cum:
cum = df.groupby('A')['B'].cumcount().sort_values()
print (cum)
0 0
5 0
9 0
1 1
6 1
10 1
2 2
7 2
11 2
3 3
8 3
4 4
dtype: int64
print (df.reindex(cum.index))
A B
0 a 0
5 b 5
9 c 9
1 a 1
6 b 6
10 c 10
2 a 2
7 b 7
11 c 11
3 a 3
8 b 8
4 a 4
Here's a NumPy approach -
def approach1(g, v):
# Inputs : 1D arrays of groupby and value columns
id_arr2 = np.ones(v.size,dtype=int)
sf = np.flatnonzero(g[1:] != g[:-1])+1
id_arr2[sf[0]] = -sf[0]+1
id_arr2[sf[1:]] = sf[:-1] - sf[1:]+1
return id_arr2.cumsum().argsort(kind='mergesort')
Sample run -
In [246]: df
Out[246]:
A B
0 a 0
1 a 1
2 a 2
3 a 3
4 a 4
5 b 5
6 b 6
7 b 7
8 b 8
9 c 9
10 c 10
11 c 11
In [247]: df.iloc[approach1(df.A.values, df.B.values)]
Out[247]:
A B
0 a 0
5 b 5
9 c 9
1 a 1
6 b 6
10 c 10
2 a 2
7 b 7
11 c 11
3 a 3
8 b 8
4 a 4
Or using df.reindex from #jezrael's post :
df.reindex(approach1(df.A.values, df.B.values))

Categories