I'm trying to filter out certain rows in my dataframe that is allowing two combinations of values for two columns. For example columns 'A' and 'B' can just be either 'A' > 0 and 'B' > 0 OR 'A' < 0 and 'B' < 0. Any other combination I want to filter.
I tried the following
df = df.loc[(df['A'] > 0 & df['B'] > 0) or (df['A'] < 0 & df['B'] < 0)]
which gives me an error: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I know this is probably a very trivial questions but I couldn't find any solution to be honest and I can't figure out what the problem with my approach ist.
You need some parenthesis and to format for pandas (and/or to become &/|):
df = df[((df['A'] > 0) & (df['B'] > 0)) | ((df['A'] < 0) & (df['B'] < 0))]
Keep in mind what this is doing - you're just building a giant list of [True, False, True, True] and passing that into the df index, telling it to keep each row depending on whether it gets a True or a False in the corresponding list.
Related
In this dataframe I want to iterate with a span of 3 rows
df = pd.DataFrame(index=range(0, 43), columns=['slow', 'fast', 'p'])
df.slow = 5
df.fast = [
2,2,2,3,3,3,3,3,4,4,
5,6,6,4,5,6,
6,5,4,5,6,6,7,
7,7,6,5,5,4,5,6,6,7,
8,8,9,8,7,7,7,7,7,7
]
df.p = [
1,1,1,1,2,3,3,4,5,6,
7,6,5,4,4,5,
6,7,6,6,7,7,8,
7,6,8,9,10,4,5,3,2,2,
4,4,5,6,7,8,8,8,8,8
]
the logic:
If fast > slow and p >= fast and p[-1] p[-2] p[-3] > slow = array append True
my attempt:
iterarray = [-1, -2, -3]
array = []
for i in range(len(df.index[2:])):
if df.fast[i] > df.slow[i] and df.p[i] >= df.fast[i] and df.p[i:i+len(iterarray)] > df.slow[i:i+len(iterarray)]:
array.append(True)
else:
array.append(False)
But I get an error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
How can I achieve the proper iteration?
In your last condition df.p[i:i+len(iterarray)] > df.slow[i:i+len(iterarray)] you compare a 3 pair of numbers. This 3 pairs have 3 pair result (True or False) and python couldn't merge these 3 results naturally.
You must use .all() that if all pairs is True return True.
...
if df.fast[i] > df.slow[i] and df.p[i] >= df.fast[i] and (df.p[i:i+len(iterarray)] > df.slow[i:i+len(iterarray)]).all():
...
If you want to check if the condition (fast greater than slow) is true and also for some records ago, you can do this:
for i in [1, 2, 3]:
df[f"col_-{i}"] = (df['slow'] < df['fast']) & (df['fast'] <= df['p']) &(df['slow'].shift(i) < df['p'].shift(i))
When I pass pd Series (e.g. df column) to user function without Boolean conditions then it works, otherwise fall with
error: The truth value of a Series is ambiguous. Use a.empty,
a.bool(), a.item(), a.any() or a.all()
Sorry, new to Python, so can't get why in one case it processes element-wise but in case of Boolean - like array.
df = pd.DataFrame({'A' : ['football', 'football',
'tennis','tennis','tennis'],
'B' : ['MESSI', 'ROONEY', 'FEDERER','NADAL', 'FEDERER'],
'C' : [5,4,6,5,6],
'D' : np.random.randn(5),
'E' : [1,2,4,3,5],
'F' : [1,0,1,0,1]
})
def diffs(E, F):
vals = E - F
return vals
This work:
df.loc[:, 'asd'] = pd.Series(diffs(df.loc[:,'E'],df.loc[:,'F']),
index=df.index)
And this code fall:
def peak_rate(E, F):
if E > 0:
vals = 1
else:
vals = 0
return vals
df.loc[:, 'asd'] = pd.Series(peak_rate(df.loc[:,'E'],df.loc[:,'F']),
index=df.index)
the line:
if E > 0:
E (a.k.a df.loc[:,'E']) is a pd.Series and checks whether it's greater then 0
you can't check a whole series if it's greater then 0
what you can do is to use:
if E.all() > 0:
maybe you got confused with 'E' and E
It's because, in the first case, it's just subtraction and two arrays or series can be added/subtracted/multiplied and output will still be a series. You can't do that for greater than or less than equations. Here's an alternate solution:
def peak_rate(E, F):
if E > F:
return 1
else:
return 0
df.loc[:, 'asd'] = pd.Series([peak_rate(df["E"][i],df["F"][i]) for i in range(len(df))], index=df.index)
Or you don't even need the function peak_rate. You can write it like below (I'm guessing you meant E > F instead of E > 0 in peak_rate. In case it was E > 0, just replace df["F"][i] with 0)
df.loc[:, 'asd'] = pd.Series([int(df["E"][i]>df["F"][i]) for i in range(len(df))], index=df.index)
I have two columns in pandas dataframe and want to compare their values against each other and return a third column processing a simple formula.
if post_df['pos'] == 1:
if post_df['lastPrice'] < post_df['exp']:
post_df['profP'] = post_df['lastPrice'] - post_df['ltP']
post_df['pos'] = 0
else:
post_df['profP'] = post_df['lastPrice'] - post_df['ltP']
However, when I run the above code I get the following error:
if post_df['pos'] == 1:
File "/Users/srikanthiyer/Environments/emacs/lib/python3.7/site-packages/pandas/core/generic.py", line 1479, in __nonzero__
.format(self.__class__.__name__))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I have tried using np.where which works but since I intend to build a complex conditional structure want to keep it simple using if statements.
I would try something like this:
def calculate_profp(row):
profP = None
if row['pos'] == 1:
if row['lastPrice'] < row['exp']:
profP = row['lastPrice'] - row['ltP']
else:
profP = row['lastPrice'] - row['ltP']
return profP
post_df['profP'] = post_df.apply(calculate_profp, axis=1)
What do you want to do with rows where row['pos'] is not 1?
afterwards, you can run:
post_df['pos'] = post_df.apply(
lambda row: 0 if row['pos'] == 1 and row['lastPrice'] < row['exp'] else row['pos'],
axis=1)
to set pos from 1 to 0
or:
post_df['pos'] = post_df['pos'].map(lambda pos: 0 if pos == 1 else pos)
I want to create a directional pandas pct_change function, so a negative number in a prior row, followed by a larger negative number in a subsequent row will result in a negative pct_change (instead of positive).
I have created the following function:
```
ef pct_change_directional(x):
if x.shift() > 0.0:
return x.pct_change() #compute normally if prior number > 0
elif x.shift() < 0.0 and x > x.shift:
return abs(x.pct_change()) # make positive
elif x.shift() <0.0 and x < x.shift():
return -x.pct_change() #make negative
else:
return 0
```
However when I apply it to my pandas dataframe column like so:
df['col_pct_change'] = pct_change_directional(df[col1])
I get the following error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
any ideas how I can make this work?
Thanks!
CWE
As #Wen said multiple where, not unlikely np.select
mask1 = df[col].shift() > 0.0
mask2 = ((df[col].shift() < 0.0) & (df[col] > df[col].shift())
mask3 = ((df[col].shift() < 0.0) & (df[col] < df[col].shift())
np.select([mask1, mask2, mask3],
[df[col].pct_change(), abs(df[col].pct_change()),
-df[col].pct_change()],
0)
Much detail about select and where you can see here
I know following error
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
has been asked a long time ago.
However, I am trying to create a basic function and return a new column with df['busy'] with 1 or 0. My function looks like this,
def hour_bus(df):
if df[(df['hour'] >= '14:00:00') & (df['hour'] <= '23:00:00')&\
(df['week_day'] != 'Saturday') & (df['week_day'] != 'Sunday')]:
return df['busy'] == 1
else:
return df['busy'] == 0
I can execute the function, but when I call it with the DataFrame, I get the error mentioned above. I followed the following thread and another thread to create that function. I used & instead of and in my if clause.
Anyhow, when I do the following, I get my desired output.
df['busy'] = np.where((df['hour'] >= '14:00:00') & (df['hour'] <= '23:00:00') & \
(df['week_day'] != 'Saturday') & (df['week_day'] != 'Sunday'),'1','0')
Any ideas on what mistake am I making in my hour_bus function?
The
(df['hour'] >= '14:00:00') & (df['hour'] <= '23:00:00')& (df['week_day'] != 'Saturday') & (df['week_day'] != 'Sunday')
gives a boolean array, and when you index your df with that you'll get a (probably) smaller part of your df.
Just to illustrate what I mean:
import pandas as pd
df = pd.DataFrame({'a': [1,2,3,4]})
mask = df['a'] > 2
print(mask)
# 0 False
# 1 False
# 2 True
# 3 True
# Name: a, dtype: bool
indexed_df = df[mask]
print(indexed_df)
# a
# 2 3
# 3 4
However it's still a DataFrame so it's ambiguous to use it as expression that requires a truth value (in your case an if).
bool(indexed_df)
# ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
You could use the np.where you used - or equivalently:
def hour_bus(df):
mask = (df['hour'] >= '14:00:00') & (df['hour'] <= '23:00:00')& (df['week_day'] != 'Saturday') & (df['week_day'] != 'Sunday')
res = df['busy'] == 0
res[mask] = (df['busy'] == 1)[mask] # replace the values where the mask is True
return res
However the np.where will be the better solution (it's more readable and probably faster).