Given a matrix with column 0 and 1 that needs to be compared:
>>> import pandas as pd
>>> df = pd.DataFrame([[3,4,'a'],[3,5,'b'],[9,2,'a']])
>>> df
0 1 2
0 3 4 a
1 3 5 b
2 9 2 a
The goal is to compare the first and the second columns such that the we it fits a certain conditions, e.g. if we want to know whether values in column df[0] is lower than df[1]. The desired output would look as such:
[True, True, False]
I've tried to use np.where with a conditions but it seems like it's returning the values that fit the condition instead:
>>> import numpy as np
>>> np.where(df[0] < df[1], df[0], df[1])
array([3, 3, 2])
I could do this, but I'm sure there's a simpler way to use numpy or pandas function to get the boolean vector:
[row[0] < row[1] for idx, row in df.iterrows()]
Is that what you want?
import numpy as np
df = np.array([[3,4,'a'],[3,5,'b'],[9,2,'a']])
df[0, :] == df[1, :]
#output array([ True, False, False], dtype=bool)
df[0, :] < df[1, :]
#output array([False, True, True], dtype=bool)
Related
I would like to perform a boolean AND operation on a matrix using a vector. For example, given:
matrix = pd.DataFrame([[True, False], [True, False], [True, False]], columns=["A", "B"])
vector = pd.Series([False, False, False])
The result would be column-wise boolean AND, like:
result = pd.DataFrame([[False, False], [False, False], [False, False]], columns=["A", "B"])
I was able to achieve that using a loop, but I'm wondering - is there a more elegant way to do that?
I would drop down to numpy that way you can avoid the loop and broadcast over the correct axis. Then reconstruct the DataFrame
import pandas as pd
pd.DataFrame(matrix.to_numpy() & vector.to_numpy()[:, None],
columns=matrix.columns,
index=matrix.index)
A B
0 False False
1 False False
2 False False
Alternatively you can transpose the DataFrame allowing a simple & comparison and then transpose the result back. This might get slow for large DataFrames
(matrix.T & vector).T
Use -
matrix.apply(lambda x: x & vector)
Output
A B
0 False False
1 False False
2 False False
It is implied here that the axis parameter of the apply function is 0 for columnwise apply.
I have a pandas DataFrame looking like this
data = [["2020-01-01", "2020-01-01"], ["2020-01-02", "2020-01-04"], ["2020-01-05", "2020-01-06"]]
df = pd.DataFrame(data, columns=["START", "END"]).astype({"END": "datetime64[ns]" , "START": "datetime64[ns]"})
START
END
0
2020-01-01
2020-01-01
1
2020-01-02
2020-01-04
2
2020-01-05
2020-01-06
and
a Series/numpy array of datetime64[ns] like this
timestamps = pd.Series(["2020-01-02", "2020-01-03"], dtype="datetime64[ns]")
For every row of df I want to know if there is at least one value in timestamps which lies inbetween START and END.
I can do the following
df["START"].apply(lambda x: (timestamps >= x).any()) & df["END"].apply(lambda x: (timestamps <= x).any())
resulting in [False, True, False], but is there a more performant or built-in way without using df.apply?
EDIT:
Actually, my solution using apply was incorrect because if we had
timestamps = pd.Series(["2019-01-01", "2021-01-01"], dtype="datetime64[ns]")
the output would be [True, True, True], which is obviously false. However, the accepted answer does produce a correct result.
We can broadcast the values in START and END columns to create a boolean mask, then reduce the resulting boolean mask along axis=1
t = timestamps.values
((df['START'].values[:, None] <= t) & (df['END'].values[:, None] >= t)).any(1)
array([False, True, False])
Use IntervalIndex.overlaps if possible create Interval from timestamps by minimal and maximal values:
s = pd.IntervalIndex.from_arrays(df['START'],
df['END'],
closed='both')
i = pd.Interval(timestamps.min(), timestamps.max(), closed='both')
out = s.overlaps(i)
print (out )
[False True False]
I have a data frame with a bool type column. I would like to obtain the number of True values per id using pandas' groupby and agg functions. I've done this a bunch of times, but it seems the resulting column's type depends on the data frame. Here is an example:
import pandas as pd
d = {'id': [1, 1, 2, 3], 'bool': [True, False, False, True]}
df = pd.DataFrame(data=d)
print(df.groupby(['id']).agg({'bool': 'sum'}))
The output I get from this code is:
id bool
0 1 True
1 2 False
2 3 True
Which is not what I want. Now, if agg tries to sum two True values:
import pandas as pd
d = {'id': [1, 1, 2, 3], 'bool': [True, True, False, True]}
df = pd.DataFrame(data=d)
print(df.groupby(['id']).agg({'bool': 'sum'}))
Then I get:
id bool
0 1 2.00
1 2 0.00
2 3 1.00
Which is how I want.
I've seen situations in which a few rows are of type bool, whereas others are type float. It seems to be related with the number of rows grouped: if only one row, then it shows the bool value; if more than one, resulting type is float. I would like the resulting aggregated columns to always be of type float.
Pandas version is 1.0.1
You can sum and keep the output as a float in all cases with the following:
import pandas as pd
d = {'id': [1, 1, 2, 3], 'bool': [True, False, False, True]}
df = pd.DataFrame(data=d)
print(df.groupby(['id'])['bool'].sum().astype(float))
Yields the output
id
1 1.0
2 0.0
3 1.0
Name: bool, dtype: float64
You can just use the max function
df.groupby(['id']).agg({'bool': 'max'})
You can use typecast to float. Use reset_index if you want a separate column for your index at the end
df.groupby(['id']).agg({'bool': 'sum'}).astype(float).reset_index()
Example:
>>> import pandas as pd
>>> d = {'id': [1, 1, 2, 3], 'bool': [True, True, False, True]}
>>> df = pd.DataFrame(data=d)
>>>
>>> df.groupby(['id']).agg({'bool': 'sum'}).astype(float).reset_index()
id bool
0 1 2.0
1 2 0.0
2 3 1.0
>>>
There is a dedicated Numpy function to count non-zero cells
(True is counted as 1, False as 0). So you can run:
df.groupby(['id']).bool.agg(lambda gr: np.count_nonzero(gr))
I assume that you want integer number of True values.
Otherwise append .astype(float).
Arrays 'A' and vector 'B' below are part of pandas dataframe.
I have a large array A of form:
28 39 52
77 80 66
7 18 24
9 97 68
I have a vector B of form:
32
5
42
17
How do I compare pythonically each column of A against B. I am trying to get True/False values for A < B comparison to get the following result:
TRUE FALSE FALSE
FALSE FALSE FALSE
TRUE TRUE TRUE
TRUE FALSE FALSE
I can do list comprehension syntax but is there a better way to pull this off. My array A and B are very large.
Consider the pd.DataFrame and pd.Series, A and B
A = pd.DataFrame([
[28, 39, 52],
[77, 80, 66],
[7, 18, 24],
[9, 97, 68]
])
B = pd.Series([32, 5, 42, 17])
pandas
By default, when you compare a pd.DataFrame with a pd.Series, pandas aligns each index value from the series with the column names of the dataframe. This is what happens when you use A < B. In this case, you have 4 rows in your dataframe and 4 elements in your series, so I'm going to assume you want to align the index values of the series with the index values of the dataframe. In order to specify the axis you want to align with, you need to use the comparison method rather than the operator. That's because when you use the method, you can use the axis parameter and specify that you want axis=0 rather than the default axis=1.
A.lt(B, axis=0)
0 1 2
0 True False False
1 False False False
2 True True True
3 True False False
I often just write this as A.lt(B, 0)
numpy
In numpy, you also have to pay attention to the dimensionality of the arrays and you are assuming that the positions are already lined up. The positions will be taken care of if they come from the same dataframe.
print(A.values)
[[28 39 52]
[77 80 66]
[ 7 18 24]
[ 9 97 68]]
print(B.values)
[32 5 42 17]
Notice that B is a 1 dimensional array while A is a 2 dimensional array. In order to compare B along the rows of A we need to reshape B into a 2 dimensional array. The most obvious way to do this is with reshape
print(A.values < B.values.reshape(4, 1))
[[ True False False]
[False False False]
[ True True True]
[ True False False]]
However, these are ways in which you will commonly see others do the same reshaping
A.values < B.values.reshape(-1, 1)
Or
A.values < B.values[:, None]
timed back test
To get a handle of how fast these comparisons are, I've constructed the following back test.
def pd_cmp(df, s):
return df.lt(s, 0)
def np_cmp_a2a(df, s):
"""To get an apples to apples comparison
I return the same thing in both functions"""
return pd.DataFrame(
df.values < s.values[:, None],
df.index, df.columns
)
def np_cmp_a2o(df, s):
"""To get an apples to oranges comparison
I return a numpy array"""
return df.values < s.values[:, None]
results = pd.DataFrame(
index=pd.Index([10, 1000, 100000], name='group size'),
columns=pd.Index(['pd_cmp', 'np_cmp_a2a', 'np_cmp_a2o'], name='method'),
)
from timeit import timeit
for i in results.index:
df = pd.concat([A] * i, ignore_index=True)
s = pd.concat([B] * i, ignore_index=True)
for j in results.columns:
results.set_value(
i, j,
timeit(
'{}(df, s)'.format(j),
'from __main__ import {}, df, s'.format(j),
number=100
)
)
results.plot()
I can conclude that the numpy based solutions are faster but not all that much. They all scale the same.
You can do this using lt and calling squeeze on B so it flattens the df to a 1-D Series:
In [107]:
A.lt(B.squeeze(),axis=0)
Out[107]:
0 1 2
0 True False False
1 False False False
2 True True True
3 True False False
The problem is that without squeeze then it will try to align on the column labels which we don't want. We want to broadcast the comparison along the column-axis
The more efficient is to go down numpy level (A,B are DataFrames here):
A.values<B.values
Yet another option using numpy is with numpy.newaxis
In [99]: B = B[:, np.newaxis]
In [100]: B
Out[100]:
array([[32],
[ 5],
[42],
[17]])
In [101]: A < B
Out[101]:
array([[ True, False, False],
[False, False, False],
[ True, True, True],
[ True, False, False]], dtype=bool)
Essentially, we're converting the vector B into a 2D array so that numpy can broadcast when comparing two arrays of different shapes.
I have a dataframe with a multi-index and need to select only the rows where the first index is not in a list. This works:
df= df.iloc[~(df.index.get_level_values(0).isin(mylist) )
This doesn't:
df= df.iloc[(df.index.get_level_values(0) not in mylist )
I get an error about the truth value of the array.
Why? What does it mean? Is it documented in the official docs?
Say, you have a dataframe df as follows:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(30).reshape((6,5)))
tuples = [(i//2, i%2) for i in range(6)]
df.index = pd.MultiIndex.from_tuples(tuples)
print(df)
0 1 2 3 4
0 0 0.623671 0.335741 0.035219 0.902372 0.349697
1 0.487387 0.325101 0.361753 0.935972 0.425735
1 0 0.147836 0.599608 0.888232 0.712804 0.604688
1 0.156712 0.286682 0.680316 0.104996 0.389507
2 0 0.212923 0.580401 0.02415 0.712987 0.803497
1 0.804538 0.035597 0.611101 0.328159 0.140793
df.index.get_level_values(0) will return an array: Int64Index([0, 0, 1, 1, 2, 2], dtype='int64')
The error says that by using in operator it is not clear whether you want to check all elements in that array are in the list, or any element in that array is in the list. You are comparing the array against the whole list. What you want is the element-wise comparison and in does not do that. Even if it was clear, it would return a single value. If you try df.index.get_level_values(0).isin([0,1]), on the other hand, it will return an array of boolean values: array([ True, True, True, True, False, False], dtype=bool) so it will check first whether 0 is in the list, whether second 0 is in the list, whether 1 is in the list... And then those boolean values will be used to slice the dataframe (i.e. show me only the rows where the array has True value).
In [12]: df.iloc[[ True, True, True, True, False, False]]
Out [12]: 0 1 2 3 4
0 0 0.623671 0.335741 0.035219 0.902372 0.349697
1 0.487387 0.325101 0.361753 0.935972 0.425735
1 0 0.147836 0.599608 0.888232 0.712804 0.604688
1 0.156712 0.286682 0.680316 0.104996 0.389507