applying multiple conditions in pandas dataframe - python

I want to fill an column with true or false, depending on whether a condition is met.
I know to use any() method, but I need to compare values of two columns. I tried and have not succeeded- using & gives type error.
my data looks something like
A B condition_met
1 2
3 3
5 9
7 2
the expected output is something like
my data looks something like
A B condition_met
1 2 true
3 3 true
5 9 true
7 2 false
I want the value in condition_met if A>3 and B>4
What I tried was
df.loc[df['A'] > 3 & 'B' > 4, 'condition_met'] = 'True'
upd: I need to check if condition is met. i.e., if A>3 then B>4.
if A<=3 then it must still be true, since the condition doesn't exist.

Run:
df['condition_met'] = (df.A > 3) & (df.B > 4)
Another possible approach is to use logical_and function from Numpy:
df['condition_met'] = np.logical_and(df.A.gt(3), df.B.gt(4))

You can assign mask with parantheses, because priority of operators:
df['condition_met'] = (df.A>3) & (df.B>4)
Or:
df['condition_met'] = df.A.gt(3) & df.B.gt(4)
Your solution - 'True' if match else NaNs:
df.loc[(df['A'] > 3) & (df['B'] > 4), 'condition_met'] = 'True'
EDIT:
df['condition_met'] = (df.A>3) & (df.B>4) | (df.A <= 3)
print (df)
A B condition_met
0 1 2 True
1 3 3 True
2 5 9 True
3 7 2 False

Related

Equate a column vector and a row vector with Pandas

Supposing I have two Pandas series:
s1 = pandas.Series([1,2,3])
s2 = pandas.Series([3,1,2])
Is there a good way to equate them in a column x row-style? i.e. I want a DataFrame output that is the result of doing
1 == 3, 2 == 3, 3 == 3
1 == 1, 2 == 1, 3 == 1
1 == 2, 2 == 2, 3 == 2
With the expected output of
False False True
True False False
False True False
I understand that I could expand the two series out into dataframes in their own right and then equate those data frames, but then my peak memory usage will double. I could also loop through one series and equate each individual value to the other series, and then stack those output series together into a DataFrame, and I'll do that if I have to. But it feels like there should be a way to this.
You can take advantage of broadcasting
res = s1[:,None] == s2[None,:]
You can do it using numpy.outer:
pd.DataFrame(np.outer(s1,1/s2) == 1, s1, s2)
s2 3 1 2
s1
1 False True False
2 False False True
3 True False False
Easy to do with apply
out = s1.apply(lambda x : s2==x)
Out[31]:
0 1 2
0 False True False
1 False False True
2 True False False

What will be the mean of a conditional output

let's take a condition as :
(df['a'] > 10) & (df['a'] < 20)
This condition will give a true false output.
What will be the mean of this conditional output?
i.e np.mean((df['a'] > 10) & (df['a'] < 20)) = ?
It will give the mean of all the values that is > 10 and < 20.
to get the mean value you have to use square bracket
np.mean(df[(df['a'] > 10) & (df['a'] < 20)])
It working same like 1 and 0 values instead True and False values, so it return percentage of matched values of both conditions:
df = pd.DataFrame({'a':[9,13,23,16,23]})
m = (df['a'] > 10) & (df['a'] < 20)
print (m)
0 False
1 True
2 False
3 True
4 False
Name: a, dtype: bool
There is 2 matched values from 5 values, so percentage is 2/5=0.4:
print (m.mean())
0.4

Checking when more than one dataframe column has specific values in Python

I have data that is in the form that looks like:
Shop Date Produced Lost Output Signal
Cornerstop 01-01-2010 0 1 9 1
Cornerstop 01-01-2010 11 1 11 0
Cornerstop 01-01-2010 0 0 0 2
Cornerstop 01-01-2010 1 0 0 2
Cornerstop 01-01-2010 5 7 0 2
.
.
.
.
The data SHOULD have values for 'Lost' and 'Output' that are 0 when 'Produced' is 0 but that's not the case. I need a way to find out when this isn't the case (when Produced is 0 but any of Lost, Output, or Signal are not 0).
Making a counter that counts the times this is true or not is what I used to see the number like:
counter = 0
for index, row in data.iterrows():
if row['Produced'] and row['Lost'] != 0:
counter += 1
else:
continue
I'd like to see exactly which rows in the dataframe these are (it's a large set) and this is hardly very efficient to search by each row.
Is there a better way I can do this?
Try Boolean indexing:
data[(data['Produced'] == 0) & (data['Lost'] != 0) & (data['Output'] != 0) & (data['Signal'] != 0)]
You can use Boolean indexing and pd.DataFrame.all. For readability, you can store masks in variables:
m1 = data['Produced'] == 0
m2 = (data[['Lost', 'Output', 'Signal']] != 0).all(1)
res = data[m1 & m2]
My approach would be boolean indexing with one array for the ==0 part (Produced) and one for the !=0 part, packed via loc and any:
df[df.Produced==0 & (df.loc[:, ['Lost', 'Output', 'Signal']]!=0).any(1)]

Pandas: Filtering multiple conditions

I'm trying to do boolean indexing with a couple conditions using Pandas. My original DataFrame is called df. If I perform the below, I get the expected result:
temp = df[df["bin"] == 3]
temp = temp[(~temp["Def"])]
temp = temp[temp["days since"] > 7]
temp.head()
However, if I do this (which I think should be equivalent), I get no rows back:
temp2 = df[df["bin"] == 3]
temp2 = temp2[~temp2["Def"] & temp2["days since"] > 7]
temp2.head()
Any idea what accounts for the difference?
Use () because operator precedence:
temp2 = df[~df["Def"] & (df["days since"] > 7) & (df["bin"] == 3)]
Alternatively, create conditions on separate rows:
cond1 = df["bin"] == 3
cond2 = df["days since"] > 7
cond3 = ~df["Def"]
temp2 = df[cond1 & cond2 & cond3]
Sample:
df = pd.DataFrame({'Def':[True] *2 + [False]*4,
'days since':[7,8,9,14,2,13],
'bin':[1,3,5,3,3,3]})
print (df)
Def bin days since
0 True 1 7
1 True 3 8
2 False 5 9
3 False 3 14
4 False 3 2
5 False 3 13
temp2 = df[~df["Def"] & (df["days since"] > 7) & (df["bin"] == 3)]
print (temp2)
Def bin days since
3 False 3 14
5 False 3 13
OR
df_train[(df_train["fold"]==1) | (df_train["fold"]==2)]
AND
df_train[(df_train["fold"]==1) & (df_train["fold"]==2)]
Alternatively, you can use the method query:
df.query('not Def & (`days since` > 7) & (bin == 3)')
If you want multiple conditions:
Del_Det_5k_top_10 = Del_Det[(Del_Det['State'] == 'NSW') & (Del_Det['route'] == 2) |
(Del_Det['State'] == 'VIC') & (Del_Det['route'] == 3)]

Get first row of dataframe in Python Pandas based on criteria

Let's say that I have a dataframe like this one
import pandas as pd
df = pd.DataFrame([[1, 2, 1], [1, 3, 2], [4, 6, 3], [4, 3, 4], [5, 4, 5]], columns=['A', 'B', 'C'])
>> df
A B C
0 1 2 1
1 1 3 2
2 4 6 3
3 4 3 4
4 5 4 5
The original table is more complicated with more columns and rows.
I want to get the first row that fulfil some criteria. Examples:
Get first row where A > 3 (returns row 2)
Get first row where A > 4 AND B > 3 (returns row 4)
Get first row where A > 3 AND (B > 3 OR C > 2) (returns row 2)
But, if there isn't any row that fulfil the specific criteria, then I want to get the first one after I just sort it descending by A (or other cases by B, C etc)
Get first row where A > 6 (returns row 4 by ordering it by A desc and get the first one)
I was able to do it by iterating on the dataframe (I know that craps :P). So, I prefer a more pythonic way to solve it.
This tutorial is a very good one for pandas slicing. Make sure you check it out. Onto some snippets... To slice a dataframe with a condition, you use this format:
>>> df[condition]
This will return a slice of your dataframe which you can index using iloc. Here are your examples:
Get first row where A > 3 (returns row 2)
>>> df[df.A > 3].iloc[0]
A 4
B 6
C 3
Name: 2, dtype: int64
If what you actually want is the row number, rather than using iloc, it would be df[df.A > 3].index[0].
Get first row where A > 4 AND B > 3:
>>> df[(df.A > 4) & (df.B > 3)].iloc[0]
A 5
B 4
C 5
Name: 4, dtype: int64
Get first row where A > 3 AND (B > 3 OR C > 2) (returns row 2)
>>> df[(df.A > 3) & ((df.B > 3) | (df.C > 2))].iloc[0]
A 4
B 6
C 3
Name: 2, dtype: int64
Now, with your last case we can write a function that handles the default case of returning the descending-sorted frame:
>>> def series_or_default(X, condition, default_col, ascending=False):
... sliced = X[condition]
... if sliced.shape[0] == 0:
... return X.sort_values(default_col, ascending=ascending).iloc[0]
... return sliced.iloc[0]
>>>
>>> series_or_default(df, df.A > 6, 'A')
A 5
B 4
C 5
Name: 4, dtype: int64
As expected, it returns row 4.
For existing matches, use query:
df.query(' A > 3' ).head(1)
Out[33]:
A B C
2 4 6 3
df.query(' A > 4 and B > 3' ).head(1)
Out[34]:
A B C
4 5 4 5
df.query(' A > 3 and (B > 3 or C > 2)' ).head(1)
Out[35]:
A B C
2 4 6 3
you can take care of the first 3 items with slicing and head:
df[df.A>=4].head(1)
df[(df.A>=4)&(df.B>=3)].head(1)
df[(df.A>=4)&((df.B>=3) * (df.C>=2))].head(1)
The condition in case nothing comes back you can handle with a try or an if...
try:
output = df[df.A>=6].head(1)
assert len(output) == 1
except:
output = df.sort_values('A',ascending=False).head(1)
For the point that 'returns the value as soon as you find the first row/record that meets the requirements and NOT iterating other rows', the following code would work:
def pd_iter_func(df):
for row in df.itertuples():
# Define your criteria here
if row.A > 4 and row.B > 3:
return row
It is more efficient than Boolean Indexing when it comes to a large dataframe.
To make the function above more applicable, one can implements lambda functions:
def pd_iter_func(df: DataFrame, criteria: Callable[[NamedTuple], bool]) -> Optional[NamedTuple]:
for row in df.itertuples():
if criteria(row):
return row
pd_iter_func(df, lambda row: row.A > 4 and row.B > 3)
As mentioned in the answer to the 'mirror' question, pandas.Series.idxmax would also be a nice choice.
def pd_idxmax_func(df, mask):
return df.loc[mask.idxmax()]
pd_idxmax_func(df, (df.A > 4) & (df.B > 3))

Categories