Compare a matrix against a column vector - python

Arrays 'A' and vector 'B' below are part of pandas dataframe.
I have a large array A of form:
28 39 52
77 80 66
7 18 24
9 97 68
I have a vector B of form:
32
5
42
17
How do I compare pythonically each column of A against B. I am trying to get True/False values for A < B comparison to get the following result:
TRUE FALSE FALSE
FALSE FALSE FALSE
TRUE TRUE TRUE
TRUE FALSE FALSE
I can do list comprehension syntax but is there a better way to pull this off. My array A and B are very large.

Consider the pd.DataFrame and pd.Series, A and B
A = pd.DataFrame([
[28, 39, 52],
[77, 80, 66],
[7, 18, 24],
[9, 97, 68]
])
B = pd.Series([32, 5, 42, 17])
pandas
By default, when you compare a pd.DataFrame with a pd.Series, pandas aligns each index value from the series with the column names of the dataframe. This is what happens when you use A < B. In this case, you have 4 rows in your dataframe and 4 elements in your series, so I'm going to assume you want to align the index values of the series with the index values of the dataframe. In order to specify the axis you want to align with, you need to use the comparison method rather than the operator. That's because when you use the method, you can use the axis parameter and specify that you want axis=0 rather than the default axis=1.
A.lt(B, axis=0)
0 1 2
0 True False False
1 False False False
2 True True True
3 True False False
I often just write this as A.lt(B, 0)
numpy
In numpy, you also have to pay attention to the dimensionality of the arrays and you are assuming that the positions are already lined up. The positions will be taken care of if they come from the same dataframe.
print(A.values)
[[28 39 52]
[77 80 66]
[ 7 18 24]
[ 9 97 68]]
print(B.values)
[32 5 42 17]
Notice that B is a 1 dimensional array while A is a 2 dimensional array. In order to compare B along the rows of A we need to reshape B into a 2 dimensional array. The most obvious way to do this is with reshape
print(A.values < B.values.reshape(4, 1))
[[ True False False]
[False False False]
[ True True True]
[ True False False]]
However, these are ways in which you will commonly see others do the same reshaping
A.values < B.values.reshape(-1, 1)
Or
A.values < B.values[:, None]
timed back test
To get a handle of how fast these comparisons are, I've constructed the following back test.
def pd_cmp(df, s):
return df.lt(s, 0)
def np_cmp_a2a(df, s):
"""To get an apples to apples comparison
I return the same thing in both functions"""
return pd.DataFrame(
df.values < s.values[:, None],
df.index, df.columns
)
def np_cmp_a2o(df, s):
"""To get an apples to oranges comparison
I return a numpy array"""
return df.values < s.values[:, None]
results = pd.DataFrame(
index=pd.Index([10, 1000, 100000], name='group size'),
columns=pd.Index(['pd_cmp', 'np_cmp_a2a', 'np_cmp_a2o'], name='method'),
)
from timeit import timeit
for i in results.index:
df = pd.concat([A] * i, ignore_index=True)
s = pd.concat([B] * i, ignore_index=True)
for j in results.columns:
results.set_value(
i, j,
timeit(
'{}(df, s)'.format(j),
'from __main__ import {}, df, s'.format(j),
number=100
)
)
results.plot()
I can conclude that the numpy based solutions are faster but not all that much. They all scale the same.

You can do this using lt and calling squeeze on B so it flattens the df to a 1-D Series:
In [107]:
A.lt(B.squeeze(),axis=0)
Out[107]:
0 1 2
0 True False False
1 False False False
2 True True True
3 True False False
The problem is that without squeeze then it will try to align on the column labels which we don't want. We want to broadcast the comparison along the column-axis

The more efficient is to go down numpy level (A,B are DataFrames here):
A.values<B.values

Yet another option using numpy is with numpy.newaxis
In [99]: B = B[:, np.newaxis]
In [100]: B
Out[100]:
array([[32],
[ 5],
[42],
[17]])
In [101]: A < B
Out[101]:
array([[ True, False, False],
[False, False, False],
[ True, True, True],
[ True, False, False]], dtype=bool)
Essentially, we're converting the vector B into a 2D array so that numpy can broadcast when comparing two arrays of different shapes.

Related

How to perform boolean AND on matrix using vector in Pandas?

I would like to perform a boolean AND operation on a matrix using a vector. For example, given:
matrix = pd.DataFrame([[True, False], [True, False], [True, False]], columns=["A", "B"])
vector = pd.Series([False, False, False])
The result would be column-wise boolean AND, like:
result = pd.DataFrame([[False, False], [False, False], [False, False]], columns=["A", "B"])
I was able to achieve that using a loop, but I'm wondering - is there a more elegant way to do that?
I would drop down to numpy that way you can avoid the loop and broadcast over the correct axis. Then reconstruct the DataFrame
import pandas as pd
pd.DataFrame(matrix.to_numpy() & vector.to_numpy()[:, None],
columns=matrix.columns,
index=matrix.index)
A B
0 False False
1 False False
2 False False
Alternatively you can transpose the DataFrame allowing a simple & comparison and then transpose the result back. This might get slow for large DataFrames
(matrix.T & vector).T
Use -
matrix.apply(lambda x: x & vector)
Output
A B
0 False False
1 False False
2 False False
It is implied here that the axis parameter of the apply function is 0 for columnwise apply.

Get a boolean vector from comparing columns using conditions

Given a matrix with column 0 and 1 that needs to be compared:
>>> import pandas as pd
>>> df = pd.DataFrame([[3,4,'a'],[3,5,'b'],[9,2,'a']])
>>> df
0 1 2
0 3 4 a
1 3 5 b
2 9 2 a
The goal is to compare the first and the second columns such that the we it fits a certain conditions, e.g. if we want to know whether values in column df[0] is lower than df[1]. The desired output would look as such:
[True, True, False]
I've tried to use np.where with a conditions but it seems like it's returning the values that fit the condition instead:
>>> import numpy as np
>>> np.where(df[0] < df[1], df[0], df[1])
array([3, 3, 2])
I could do this, but I'm sure there's a simpler way to use numpy or pandas function to get the boolean vector:
[row[0] < row[1] for idx, row in df.iterrows()]
Is that what you want?
import numpy as np
df = np.array([[3,4,'a'],[3,5,'b'],[9,2,'a']])
df[0, :] == df[1, :]
#output array([ True, False, False], dtype=bool)
df[0, :] < df[1, :]
#output array([False, True, True], dtype=bool)

selecting pandas rows: why isin() and not i in mylist?

I have a dataframe with a multi-index and need to select only the rows where the first index is not in a list. This works:
df= df.iloc[~(df.index.get_level_values(0).isin(mylist) )
This doesn't:
df= df.iloc[(df.index.get_level_values(0) not in mylist )
I get an error about the truth value of the array.
Why? What does it mean? Is it documented in the official docs?
Say, you have a dataframe df as follows:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(30).reshape((6,5)))
tuples = [(i//2, i%2) for i in range(6)]
df.index = pd.MultiIndex.from_tuples(tuples)
print(df)
0 1 2 3 4
0 0 0.623671 0.335741 0.035219 0.902372 0.349697
1 0.487387 0.325101 0.361753 0.935972 0.425735
1 0 0.147836 0.599608 0.888232 0.712804 0.604688
1 0.156712 0.286682 0.680316 0.104996 0.389507
2 0 0.212923 0.580401 0.02415 0.712987 0.803497
1 0.804538 0.035597 0.611101 0.328159 0.140793
df.index.get_level_values(0) will return an array: Int64Index([0, 0, 1, 1, 2, 2], dtype='int64')
The error says that by using in operator it is not clear whether you want to check all elements in that array are in the list, or any element in that array is in the list. You are comparing the array against the whole list. What you want is the element-wise comparison and in does not do that. Even if it was clear, it would return a single value. If you try df.index.get_level_values(0).isin([0,1]), on the other hand, it will return an array of boolean values: array([ True, True, True, True, False, False], dtype=bool) so it will check first whether 0 is in the list, whether second 0 is in the list, whether 1 is in the list... And then those boolean values will be used to slice the dataframe (i.e. show me only the rows where the array has True value).
In [12]: df.iloc[[ True, True, True, True, False, False]]
Out [12]: 0 1 2 3 4
0 0 0.623671 0.335741 0.035219 0.902372 0.349697
1 0.487387 0.325101 0.361753 0.935972 0.425735
1 0 0.147836 0.599608 0.888232 0.712804 0.604688
1 0.156712 0.286682 0.680316 0.104996 0.389507

Pandas GroupBy Index

I have a dataframe with a column that I want to groupby. Within each group, I want to perform a check to see if the first values is less than the second value times some scalar, e.g. (x < y * .5). If it is, the first value is set to True and all other values False. Else, all values are False.
I have a sample data frame here:
d = pd.DataFrame(np.array([[0, 0, 1, 1, 2, 2, 2],
[3, 4, 5, 6, 7, 8, 9],
[1.25, 10.1, 2.3, 2.4, 1.2, 5.5, 5.7]]).T,
columns=['a', 'b', 'c'])
I can get a stacked groupby to get the data that I want out a a:
g = d.groupby('a')['c'].nsmallest(2).groupby(level='a')
This results in three groups, each with 2 entries. By adding an apply, I can call a function to return a boolean mask:
def func(group):
if group.iloc[0] < group.iloc[1] * .5:
return [True, False]
else:
return [False, False]
g = d.groupby('a')['c'].nsmallest(2).groupby(level='a').apply(func)
Unfortunately, this destroys the index into the original dataframe and removes the ability to handle cases where more than 2 elements are present.
Two questions:
Is it possible to maintain the index in the original dataframe and update a column with the results of a groupby? This is made slightly different because the .nsmallest call results in a Series on the 'c' column.
Does a more elegant method exist to compute a boolean array for groups in a dataframe based on some custom criteria, e.g. this ratio test.
Looks like transform is what you need:
>>> def func(group):
... res = [False] * len(group)
... if group.iloc[0] < group.iloc[1] * .5:
... res[0] = True
... return res
>>> d['res'] = d.groupby('a')['c'].transform(func).astype('bool')
>>> d
a b c res
0 0 3 1.25 True
1 0 4 10.10 False
2 1 5 2.30 False
3 1 6 2.40 False
4 2 7 1.20 True
5 2 8 5.50 False
6 2 9 5.70 False
From the documentation:
The transform method returns an object that is indexed the same (same
size) as the one being grouped. Thus, the passed transform function
should return a result that is the same size as the group chunk. For
example, suppose we wished to standardize the data within each group

Greater/less than comparisons between Pandas DataFrames/Series

How can I perform comparisons between DataFrames and Series? I'd like to mask elements in a DataFrame/Series that are greater/less than elements in another DataFrame/Series.
For instance, the following doesn't replace elements greater than the mean
with nans although I was expecting it to:
>>> x = pd.DataFrame(data={'a': [1, 2], 'b': [3, 4]})
>>> x[x > x.mean(axis=1)] = np.nan
>>> x
a b
0 1 3
1 2 4
If we look at the boolean array created by the comparison, it is really weird:
>>> x = pd.DataFrame(data={'a': [1, 2], 'b': [3, 4]})
>>> x > x.mean(axis=1)
a b 0 1
0 False False False False
1 False False False False
I don't understand by what logic the resulting boolean array is like that. I'm able to work around this problem by using transpose:
>>> (x.T > x.mean(axis=1).T).T
a b
0 False True
1 False True
But I believe there is some "correct" way of doing this that I'm not aware of. And at least I'd like to understand what is going on.
The problem here is that it's interpreting the index as column values to perform the comparison, if you use .gt and pass axis=0 then you get the result you desire:
In [203]:
x.gt(x.mean(axis=1), axis=0)
Out[203]:
a b
0 False True
1 False True
You can see what I mean when you perform the comparison with the np array:
In [205]:
x > x.mean(axis=1).values
Out[205]:
a b
0 False False
1 False True
here you can see that the default axis for comparison is on the column, resulting in a different result

Categories