pandas multi-group apply() change view value - python

For some reason this is not working:
sample data:
dt = pd.DataFrame({'sid':['a']*9 + ['b']*9 + ['c']*9,
'src': [1] *18 + [2] * 9,
'val':np.random.randn(27),
'dval': [0]*18 + np.random.rand(9)})
I want to multi-group by src,sid and change a dval row value, for those rows that are c, based on some val criteria.
I keep getting a StopIteration error.
# -- set bycp threshold for probability val to alert
def quantg(g):
try:
g['dval'] = g['dval'].apply(lambda x: x > x['val'].quantile(.90) and 1 or 0 )
print '***** bycp ', g.head(2)
#print 'discretize bycp ', g.head()
return g
except (Exception,StopIteration) as e:
print '**bycp error\n', e
print g.info()
pass
Then I try to filter by row before the groupby:
d = d[d['alert_t']=='bycp'].groupby(['source','subject_id','alert_t','variable']).apply(quantg )
I also tried mulitlevel select:
# -- xs for multilevel select
g['dval'] = g.xs(('c','sid')).map(lambda x: len(g['value']) and\
#(x>g['value'].quantile(.90) and 1 or 0 ))
But no luck!
Get frameindex or stopiteration type errors.
what gives, how can i get this done ?

The following doesn't do what you think it does:
x > x['val'].quantile(.90) and 1 or 0
Ifn fact, if you try it with a Series it ought to raise a ValueError.
In [11]: dt and True
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
When writing something like that you want to use np.where:
np.where(x > x['val'].quantile(.90), 1, 0)
Note: astype('int64') would also work, or just leaving it as bool...
However, I think I might use a transform here (to extract each groups quantile and then mask off this), with something like:
q90 = g.transform(lambda x: x.quantile(.90))
df[df.val > q90]

Related

Iterate with iterator-range in python

In this dataframe I want to iterate with a span of 3 rows
df = pd.DataFrame(index=range(0, 43), columns=['slow', 'fast', 'p'])
df.slow = 5
df.fast = [
2,2,2,3,3,3,3,3,4,4,
5,6,6,4,5,6,
6,5,4,5,6,6,7,
7,7,6,5,5,4,5,6,6,7,
8,8,9,8,7,7,7,7,7,7
]
df.p = [
1,1,1,1,2,3,3,4,5,6,
7,6,5,4,4,5,
6,7,6,6,7,7,8,
7,6,8,9,10,4,5,3,2,2,
4,4,5,6,7,8,8,8,8,8
]
the logic:
If fast > slow and p >= fast and p[-1] p[-2] p[-3] > slow = array append True
my attempt:
iterarray = [-1, -2, -3]
array = []
for i in range(len(df.index[2:])):
if df.fast[i] > df.slow[i] and df.p[i] >= df.fast[i] and df.p[i:i+len(iterarray)] > df.slow[i:i+len(iterarray)]:
array.append(True)
else:
array.append(False)
But I get an error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
How can I achieve the proper iteration?
In your last condition df.p[i:i+len(iterarray)] > df.slow[i:i+len(iterarray)] you compare a 3 pair of numbers. This 3 pairs have 3 pair result (True or False) and python couldn't merge these 3 results naturally.
You must use .all() that if all pairs is True return True.
...
if df.fast[i] > df.slow[i] and df.p[i] >= df.fast[i] and (df.p[i:i+len(iterarray)] > df.slow[i:i+len(iterarray)]).all():
...
If you want to check if the condition (fast greater than slow) is true and also for some records ago, you can do this:
for i in [1, 2, 3]:
df[f"col_-{i}"] = (df['slow'] < df['fast']) & (df['fast'] <= df['p']) &(df['slow'].shift(i) < df['p'].shift(i))

Why passing df column not work when user function contain boolean conditions?

When I pass pd Series (e.g. df column) to user function without Boolean conditions then it works, otherwise fall with
error: The truth value of a Series is ambiguous. Use a.empty,
a.bool(), a.item(), a.any() or a.all()
Sorry, new to Python, so can't get why in one case it processes element-wise but in case of Boolean - like array.
df = pd.DataFrame({'A' : ['football', 'football',
'tennis','tennis','tennis'],
'B' : ['MESSI', 'ROONEY', 'FEDERER','NADAL', 'FEDERER'],
'C' : [5,4,6,5,6],
'D' : np.random.randn(5),
'E' : [1,2,4,3,5],
'F' : [1,0,1,0,1]
})
def diffs(E, F):
vals = E - F
return vals
This work:
df.loc[:, 'asd'] = pd.Series(diffs(df.loc[:,'E'],df.loc[:,'F']),
index=df.index)
And this code fall:
def peak_rate(E, F):
if E > 0:
vals = 1
else:
vals = 0
return vals
df.loc[:, 'asd'] = pd.Series(peak_rate(df.loc[:,'E'],df.loc[:,'F']),
index=df.index)
the line:
if E > 0:
E (a.k.a df.loc[:,'E']) is a pd.Series and checks whether it's greater then 0
you can't check a whole series if it's greater then 0
what you can do is to use:
if E.all() > 0:
maybe you got confused with 'E' and E
It's because, in the first case, it's just subtraction and two arrays or series can be added/subtracted/multiplied and output will still be a series. You can't do that for greater than or less than equations. Here's an alternate solution:
def peak_rate(E, F):
if E > F:
return 1
else:
return 0
df.loc[:, 'asd'] = pd.Series([peak_rate(df["E"][i],df["F"][i]) for i in range(len(df))], index=df.index)
Or you don't even need the function peak_rate. You can write it like below (I'm guessing you meant E > F instead of E > 0 in peak_rate. In case it was E > 0, just replace df["F"][i] with 0)
df.loc[:, 'asd'] = pd.Series([int(df["E"][i]>df["F"][i]) for i in range(len(df))], index=df.index)

Apply 'if' condition to compare two pandas column and form a third column with the value of the second column

I have two columns in pandas dataframe and want to compare their values against each other and return a third column processing a simple formula.
if post_df['pos'] == 1:
if post_df['lastPrice'] < post_df['exp']:
post_df['profP'] = post_df['lastPrice'] - post_df['ltP']
post_df['pos'] = 0
else:
post_df['profP'] = post_df['lastPrice'] - post_df['ltP']
However, when I run the above code I get the following error:
if post_df['pos'] == 1:
File "/Users/srikanthiyer/Environments/emacs/lib/python3.7/site-packages/pandas/core/generic.py", line 1479, in __nonzero__
.format(self.__class__.__name__))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I have tried using np.where which works but since I intend to build a complex conditional structure want to keep it simple using if statements.
I would try something like this:
def calculate_profp(row):
profP = None
if row['pos'] == 1:
if row['lastPrice'] < row['exp']:
profP = row['lastPrice'] - row['ltP']
else:
profP = row['lastPrice'] - row['ltP']
return profP
post_df['profP'] = post_df.apply(calculate_profp, axis=1)
What do you want to do with rows where row['pos'] is not 1?
afterwards, you can run:
post_df['pos'] = post_df.apply(
lambda row: 0 if row['pos'] == 1 and row['lastPrice'] < row['exp'] else row['pos'],
axis=1)
to set pos from 1 to 0
or:
post_df['pos'] = post_df['pos'].map(lambda pos: 0 if pos == 1 else pos)

Create a "directional" pandas pct_change function

I want to create a directional pandas pct_change function, so a negative number in a prior row, followed by a larger negative number in a subsequent row will result in a negative pct_change (instead of positive).
I have created the following function:
```
ef pct_change_directional(x):
if x.shift() > 0.0:
return x.pct_change() #compute normally if prior number > 0
elif x.shift() < 0.0 and x > x.shift:
return abs(x.pct_change()) # make positive
elif x.shift() <0.0 and x < x.shift():
return -x.pct_change() #make negative
else:
return 0
```
However when I apply it to my pandas dataframe column like so:
df['col_pct_change'] = pct_change_directional(df[col1])
I get the following error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
any ideas how I can make this work?
Thanks!
CWE
As #Wen said multiple where, not unlikely np.select
mask1 = df[col].shift() > 0.0
mask2 = ((df[col].shift() < 0.0) & (df[col] > df[col].shift())
mask3 = ((df[col].shift() < 0.0) & (df[col] < df[col].shift())
np.select([mask1, mask2, mask3],
[df[col].pct_change(), abs(df[col].pct_change()),
-df[col].pct_change()],
0)
Much detail about select and where you can see here

The truth value of a Series is ambiguous - Error when calling a function

I know following error
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
has been asked a long time ago.
However, I am trying to create a basic function and return a new column with df['busy'] with 1 or 0. My function looks like this,
def hour_bus(df):
if df[(df['hour'] >= '14:00:00') & (df['hour'] <= '23:00:00')&\
(df['week_day'] != 'Saturday') & (df['week_day'] != 'Sunday')]:
return df['busy'] == 1
else:
return df['busy'] == 0
I can execute the function, but when I call it with the DataFrame, I get the error mentioned above. I followed the following thread and another thread to create that function. I used & instead of and in my if clause.
Anyhow, when I do the following, I get my desired output.
df['busy'] = np.where((df['hour'] >= '14:00:00') & (df['hour'] <= '23:00:00') & \
(df['week_day'] != 'Saturday') & (df['week_day'] != 'Sunday'),'1','0')
Any ideas on what mistake am I making in my hour_bus function?
The
(df['hour'] >= '14:00:00') & (df['hour'] <= '23:00:00')& (df['week_day'] != 'Saturday') & (df['week_day'] != 'Sunday')
gives a boolean array, and when you index your df with that you'll get a (probably) smaller part of your df.
Just to illustrate what I mean:
import pandas as pd
df = pd.DataFrame({'a': [1,2,3,4]})
mask = df['a'] > 2
print(mask)
# 0 False
# 1 False
# 2 True
# 3 True
# Name: a, dtype: bool
indexed_df = df[mask]
print(indexed_df)
# a
# 2 3
# 3 4
However it's still a DataFrame so it's ambiguous to use it as expression that requires a truth value (in your case an if).
bool(indexed_df)
# ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
You could use the np.where you used - or equivalently:
def hour_bus(df):
mask = (df['hour'] >= '14:00:00') & (df['hour'] <= '23:00:00')& (df['week_day'] != 'Saturday') & (df['week_day'] != 'Sunday')
res = df['busy'] == 0
res[mask] = (df['busy'] == 1)[mask] # replace the values where the mask is True
return res
However the np.where will be the better solution (it's more readable and probably faster).

Categories