Pandas: Index rows by an OR condition - python

I'm trying to filter out certain rows in my dataframe that is allowing two combinations of values for two columns. For example columns 'A' and 'B' can just be either 'A' > 0 and 'B' > 0 OR 'A' < 0 and 'B' < 0. Any other combination I want to filter.
I tried the following
df = df.loc[(df['A'] > 0 & df['B'] > 0) or (df['A'] < 0 & df['B'] < 0)]
which gives me an error: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I know this is probably a very trivial questions but I couldn't find any solution to be honest and I can't figure out what the problem with my approach ist.

You need some parenthesis and to format for pandas (and/or to become &/|):
df = df[((df['A'] > 0) & (df['B'] > 0)) | ((df['A'] < 0) & (df['B'] < 0))]
Keep in mind what this is doing - you're just building a giant list of [True, False, True, True] and passing that into the df index, telling it to keep each row depending on whether it gets a True or a False in the corresponding list.

Related

Iterate with iterator-range in python

In this dataframe I want to iterate with a span of 3 rows
df = pd.DataFrame(index=range(0, 43), columns=['slow', 'fast', 'p'])
df.slow = 5
df.fast = [
2,2,2,3,3,3,3,3,4,4,
5,6,6,4,5,6,
6,5,4,5,6,6,7,
7,7,6,5,5,4,5,6,6,7,
8,8,9,8,7,7,7,7,7,7
]
df.p = [
1,1,1,1,2,3,3,4,5,6,
7,6,5,4,4,5,
6,7,6,6,7,7,8,
7,6,8,9,10,4,5,3,2,2,
4,4,5,6,7,8,8,8,8,8
]
the logic:
If fast > slow and p >= fast and p[-1] p[-2] p[-3] > slow = array append True
my attempt:
iterarray = [-1, -2, -3]
array = []
for i in range(len(df.index[2:])):
if df.fast[i] > df.slow[i] and df.p[i] >= df.fast[i] and df.p[i:i+len(iterarray)] > df.slow[i:i+len(iterarray)]:
array.append(True)
else:
array.append(False)
But I get an error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
How can I achieve the proper iteration?
In your last condition df.p[i:i+len(iterarray)] > df.slow[i:i+len(iterarray)] you compare a 3 pair of numbers. This 3 pairs have 3 pair result (True or False) and python couldn't merge these 3 results naturally.
You must use .all() that if all pairs is True return True.
...
if df.fast[i] > df.slow[i] and df.p[i] >= df.fast[i] and (df.p[i:i+len(iterarray)] > df.slow[i:i+len(iterarray)]).all():
...
If you want to check if the condition (fast greater than slow) is true and also for some records ago, you can do this:
for i in [1, 2, 3]:
df[f"col_-{i}"] = (df['slow'] < df['fast']) & (df['fast'] <= df['p']) &(df['slow'].shift(i) < df['p'].shift(i))

Why passing df column not work when user function contain boolean conditions?

When I pass pd Series (e.g. df column) to user function without Boolean conditions then it works, otherwise fall with
error: The truth value of a Series is ambiguous. Use a.empty,
a.bool(), a.item(), a.any() or a.all()
Sorry, new to Python, so can't get why in one case it processes element-wise but in case of Boolean - like array.
df = pd.DataFrame({'A' : ['football', 'football',
'tennis','tennis','tennis'],
'B' : ['MESSI', 'ROONEY', 'FEDERER','NADAL', 'FEDERER'],
'C' : [5,4,6,5,6],
'D' : np.random.randn(5),
'E' : [1,2,4,3,5],
'F' : [1,0,1,0,1]
})
def diffs(E, F):
vals = E - F
return vals
This work:
df.loc[:, 'asd'] = pd.Series(diffs(df.loc[:,'E'],df.loc[:,'F']),
index=df.index)
And this code fall:
def peak_rate(E, F):
if E > 0:
vals = 1
else:
vals = 0
return vals
df.loc[:, 'asd'] = pd.Series(peak_rate(df.loc[:,'E'],df.loc[:,'F']),
index=df.index)
the line:
if E > 0:
E (a.k.a df.loc[:,'E']) is a pd.Series and checks whether it's greater then 0
you can't check a whole series if it's greater then 0
what you can do is to use:
if E.all() > 0:
maybe you got confused with 'E' and E
It's because, in the first case, it's just subtraction and two arrays or series can be added/subtracted/multiplied and output will still be a series. You can't do that for greater than or less than equations. Here's an alternate solution:
def peak_rate(E, F):
if E > F:
return 1
else:
return 0
df.loc[:, 'asd'] = pd.Series([peak_rate(df["E"][i],df["F"][i]) for i in range(len(df))], index=df.index)
Or you don't even need the function peak_rate. You can write it like below (I'm guessing you meant E > F instead of E > 0 in peak_rate. In case it was E > 0, just replace df["F"][i] with 0)
df.loc[:, 'asd'] = pd.Series([int(df["E"][i]>df["F"][i]) for i in range(len(df))], index=df.index)

Apply 'if' condition to compare two pandas column and form a third column with the value of the second column

I have two columns in pandas dataframe and want to compare their values against each other and return a third column processing a simple formula.
if post_df['pos'] == 1:
if post_df['lastPrice'] < post_df['exp']:
post_df['profP'] = post_df['lastPrice'] - post_df['ltP']
post_df['pos'] = 0
else:
post_df['profP'] = post_df['lastPrice'] - post_df['ltP']
However, when I run the above code I get the following error:
if post_df['pos'] == 1:
File "/Users/srikanthiyer/Environments/emacs/lib/python3.7/site-packages/pandas/core/generic.py", line 1479, in __nonzero__
.format(self.__class__.__name__))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I have tried using np.where which works but since I intend to build a complex conditional structure want to keep it simple using if statements.
I would try something like this:
def calculate_profp(row):
profP = None
if row['pos'] == 1:
if row['lastPrice'] < row['exp']:
profP = row['lastPrice'] - row['ltP']
else:
profP = row['lastPrice'] - row['ltP']
return profP
post_df['profP'] = post_df.apply(calculate_profp, axis=1)
What do you want to do with rows where row['pos'] is not 1?
afterwards, you can run:
post_df['pos'] = post_df.apply(
lambda row: 0 if row['pos'] == 1 and row['lastPrice'] < row['exp'] else row['pos'],
axis=1)
to set pos from 1 to 0
or:
post_df['pos'] = post_df['pos'].map(lambda pos: 0 if pos == 1 else pos)

Create a "directional" pandas pct_change function

I want to create a directional pandas pct_change function, so a negative number in a prior row, followed by a larger negative number in a subsequent row will result in a negative pct_change (instead of positive).
I have created the following function:
```
ef pct_change_directional(x):
if x.shift() > 0.0:
return x.pct_change() #compute normally if prior number > 0
elif x.shift() < 0.0 and x > x.shift:
return abs(x.pct_change()) # make positive
elif x.shift() <0.0 and x < x.shift():
return -x.pct_change() #make negative
else:
return 0
```
However when I apply it to my pandas dataframe column like so:
df['col_pct_change'] = pct_change_directional(df[col1])
I get the following error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
any ideas how I can make this work?
Thanks!
CWE
As #Wen said multiple where, not unlikely np.select
mask1 = df[col].shift() > 0.0
mask2 = ((df[col].shift() < 0.0) & (df[col] > df[col].shift())
mask3 = ((df[col].shift() < 0.0) & (df[col] < df[col].shift())
np.select([mask1, mask2, mask3],
[df[col].pct_change(), abs(df[col].pct_change()),
-df[col].pct_change()],
0)
Much detail about select and where you can see here

The truth value of a Series is ambiguous - Error when calling a function

I know following error
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
has been asked a long time ago.
However, I am trying to create a basic function and return a new column with df['busy'] with 1 or 0. My function looks like this,
def hour_bus(df):
if df[(df['hour'] >= '14:00:00') & (df['hour'] <= '23:00:00')&\
(df['week_day'] != 'Saturday') & (df['week_day'] != 'Sunday')]:
return df['busy'] == 1
else:
return df['busy'] == 0
I can execute the function, but when I call it with the DataFrame, I get the error mentioned above. I followed the following thread and another thread to create that function. I used & instead of and in my if clause.
Anyhow, when I do the following, I get my desired output.
df['busy'] = np.where((df['hour'] >= '14:00:00') & (df['hour'] <= '23:00:00') & \
(df['week_day'] != 'Saturday') & (df['week_day'] != 'Sunday'),'1','0')
Any ideas on what mistake am I making in my hour_bus function?
The
(df['hour'] >= '14:00:00') & (df['hour'] <= '23:00:00')& (df['week_day'] != 'Saturday') & (df['week_day'] != 'Sunday')
gives a boolean array, and when you index your df with that you'll get a (probably) smaller part of your df.
Just to illustrate what I mean:
import pandas as pd
df = pd.DataFrame({'a': [1,2,3,4]})
mask = df['a'] > 2
print(mask)
# 0 False
# 1 False
# 2 True
# 3 True
# Name: a, dtype: bool
indexed_df = df[mask]
print(indexed_df)
# a
# 2 3
# 3 4
However it's still a DataFrame so it's ambiguous to use it as expression that requires a truth value (in your case an if).
bool(indexed_df)
# ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
You could use the np.where you used - or equivalently:
def hour_bus(df):
mask = (df['hour'] >= '14:00:00') & (df['hour'] <= '23:00:00')& (df['week_day'] != 'Saturday') & (df['week_day'] != 'Sunday')
res = df['busy'] == 0
res[mask] = (df['busy'] == 1)[mask] # replace the values where the mask is True
return res
However the np.where will be the better solution (it's more readable and probably faster).

Categories