So I am trying to change some values in a df using pandas and, having already tried with df.replace, df.mask, and df.where I got to the conclusion that it must be a logical mistake since it keeps throwing the same mistake:
ValueError: The truth value of a Series is ambiguous.
I am trying to normalize a column in a dataset, thus the function and not just a single line. I need to understand why my logic is wrong, it seems to be such a dumb mistake.
This is my function:
def overweight_normalizer():
if df[df["overweight"] > 25]:
df.where(df["overweight"] > 25, 1)
elif df[df["overweight"] < 25]:
df.where(df["overweight"] < 25, 0)
df[df["overweight"] > 25] is not a valid condition.
Try this:
def overweight_normalizer():
df = pd.DataFrame({'overweight': [2, 39, 15, 45, 9]})
df["overweight"] = [1 if i > 25 else 0 for i in df["overweight"]]
return df
overweight_normalizer()
Output:
overweight
0 0
1 1
2 0
3 1
4 0
Related
I have a dataframe with sorted values:
import numpy as np
import pandas as pd
sub_run = pd.DataFrame({'Runoff':[45,10,5,26,30,23,35], 'ind':[3, 10, 25,43,53,60,93]})
I would like to start from the highest value in Runoff (45), drop all values with which the difference in "ind" is less than 30 (10, 5), reupdate the DataFrame , then go to the second highest value (35): drop the indices with which the difference in "ind" is < 30 , then the the third highest value (30) and drop 26 and 23...
I wrote the following code :
pre_ind = []
for (idx1, row1) in sub_run.iterrows():
var = row1.ind
pre_ind.append(np.array(var))
for (idx2,row2) in sub_run.iterrows():
if (row2.ind != var) and (row2.ind not in pre_ind):
test = abs(row2.ind - var)
print("test" , test)
if test <= 30:
sub_run = sub_run.drop(sub_run[sub_run.ind == row2.ind].index)
I expect to find as an output the values [45,35,30]. However I only find the first one.
Many thanks
Try this:
list_pre_max = []
while True:
try:
max_val = sub_run.Runoff.sort_values(ascending=False).iloc[len(list_pre_max)]
except:
break
max_ind = sub_run.loc[sub_run['Runoff'] == max_val, 'ind'].item()
list_pre_max.append(max_val)
dropped_indices = sub_run.loc[(abs(sub_run['ind']-max_ind) <= 30) & (sub_run['ind'] != max_ind) & (~sub_run.Runoff.isin(list_pre_max))].index
sub_run.drop(index=dropped_indices, inplace=True)
Output:
>>>sub_run
Runoff ind
0 45 3
4 30 53
6 35 93
You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html
In your case, the modification of sub_run has no effect immediately on the iteration.
Therefore, in the outer loop, after iteration on 45, 3,
the next row iterated is 35, 93, followed by 30, 53, 26, 43, 23, 60, 10, 10, 5, 25. For the inner loop, your modification works since you re-enter a new loop through iteration on the outer loop.
Here is my advice code, inspired by bubble sort.
import pandas as pd
sub_run = pd.DataFrame({'Runoff': [45,10,5,26,30,23,35],
'ind': [3,10,25,43,53,60,93]})
sub_run = sub_run.sort_values(by=['Runoff'], ascending=False)
highestRow = 0
while highestRow < len(sub_run) - 1:
cur_run = sub_run
highestRunoffInd = cur_run.iloc[highestRow].ind
for i in range(highestRow + 1, len(cur_run)):
ind = cur_run.iloc[i].ind
if abs(ind - highestRunoffInd) <= 30:
sub_run = sub_run.drop(sub_run[sub_run.ind == ind].index)
highestRow += 1
print(sub_run)
Output:
Runoff ind
0 45 3
6 35 93
4 30 53
Consider the following code.
import pandas as pd
np.random.seed(0)
df_H = pd.DataFrame( {'L0': np.random.randn(100),
'OneAndZero': np.random.randn(100),
'OneAndTwo': np.random.randn(100),
'GTwo': np.random.randn(100),
'Decide': np.random.randn(100)})
I would like to create a new column named Result, which depends on the value of the column Decide. So if the value in Decide is less than 0, I would like Result to have the corresponding value of the row in L0. If the value on the row in Decide is between 1 and 0, it should grab the value in OneAndZero, between 1 and 2, it should grab OneAndTwo and if the value of decide is > 2, then it should grab GTwo.
How would one do this with df.apply since I have only seen examples with fixed values and not values from other columns?
Just because it is Good Friday, we can try the following. Else it is a commonly asked question.
c1=df_H['Decide'].le(0)
c2=df_H['Decide'].between(0,1)
c3=df_H['Decide'].between(1,2)
c4=df_H['Decide'].gt(2)
cond=[c1,c2,c3,c4]
choices=[df_H['L0'],df_H['OneAndZero'],df_H['OneAndTwo'],df_H['GTwo']]
df_H['Result']=np.select(cond,choices)
df_H
If you really want to use apply
def choose_res(x):
if x['Decide'] <= 0:
return x['L0']
if 0 < x['Decide'] <= 1:
return x['OneAndZero']
if 1 < x['Decide'] <= 2:
return x['OneAndTwo']
if x['Decide'] > 2:
return x['GTwo']
df_H['Result'] = df_H.apply(axis=1, func=choose_res, result_type='expand')
df.iloc
df_H.reset_index(drop=True, inplace=True)
for i in range(len(df_H)):
a = df_H['Decide'].iloc[i]
if 0 <= a <=1 :
b = df_H['OneAndZero'].iloc[i]
df_H.loc[i,'Result']= b
if 1.1 <= a <= 2:
b = df_H['OneAndTwo'].iloc[i]
df_H.loc[i,'Result']= b
maybe you can try this way.
df_apply
if you want to use apply..
create the function that have the condition, and the output,
then used this code:
df_H['Result'] = df_H.apply(your function name)
I am trying to put this logic on pandas dataframe
IF base_total_price > 0
IF base_total_discount = 0
actual_price = base_total_price
IF base_total_discount > 0
actual_price = base_total_price +base_total_discount
IF base_total_price = 0
IF base_total_discount > 0
actual_price = base_total_discount
IF base_total_discount = 0
actual_price = 0
so I wrote these 2 apply functions
#for all entries where base_total_price > 0
df_slice_1['actual_price'] = df_slice_1['base_total_discount'].apply(lambda x: df_slice_1['base_total_price'] if x == 0 else df_slice_1['base_total_price']+df_slice_1['base_total_discount'])
#for all entries where base_total_price = 0
df_slice_1['actual_price'] = df_slice_1['base_total_discount'].apply(lambda x: x if x == 0 else df_slice_1['base_total_discount'])
When i run the code I get this error
ValueError: Wrong number of items passed 20, placement implies 1
I know that it is trying to put more values in one column but I do not understand why is this happening or how can I solve this problem. All I need to do is to update the dataframe with the new column `actual_price` and I need to calculate the values for this column according to the above mentioned logic. Please suggest me a better way of implementing the logic or correct me
Sample data would have been useful. Please try use np.select(condtions, choices)
Conditions=[(df.base_total_price > 0)&(df.base_total_discount == 0),(df.base_total_price > 0)&(df.base_total_discount > 0),\
(df.base_total_price == 0)&(df.base_total_discount > 0),\
(df.base_total_price == 0)&(df.base_total_discount == 0)]
choices=[df.base_total_price,df.base_total_price.add(df.base_total_discount),df.base_total_discount,0]
df.actual_price =np.select(Conditions,choices)
I solved this question simply by using iterrows. Thanks everyone who responded
I want to label the data "1" if the current value is higher than that of the previous row and "0" otherwise.
Lets say I have this DataFrame:
df = pd.DataFrame({'date': [1,2,3,4,5], 'price': [50.125, 45.25, 65.857, 100.956, 77.4152]})
and I want the output as if the DataFrame is constructed like this:
df = pd.DataFrame({'date': [1,2,3,4,5], 'price': [50.125, 45.25, 65.857, 100.956, 77.4152], 'label':[0, 0, 1, 1, 0]})
*I don't know how to post a DataFrame
These code is my attempt:
df['label'] = 0
i = 0
for price in df['price']:
i = i+1
if price in i > price: #---> right now I am stuck here. i=It says argument of type 'int' is not iterable
df.append['label', 1]
elif price in i <= price:
df.append['label', 0]
I think there are also other logical mistakes in my codes. What am I doing wrong?
Create boolean mask by Series.ge (>=) with Series.shift and convert to integers for True/False to 1/0 mapping by Series.view:
df['label'] = df['price'].ge(df['price'].shift()).view('i1')
Or by Series.astype:
df['label'] = df['price'].ge(df['price'].shift()).astype(int)
IIUC np.where with a boolean shift to see the change in the row price and if it's greater than the row above.
df['label'] = np.where(df['price'].ge(df['price'].shift()),1,0)
print(df)
date price label
0 1 50.1250 0
1 2 45.2500 0
2 3 65.8570 1
3 4 100.9560 1
4 5 77.4152 0
Explanation:
print(df['price'].ge(df['price'].shift()))
returns a boolean of True and False statements that we can use in our where clause
0 False
1 False
2 True
3 True
4 False
To explain what is happening in your code:
df['label'] should be initiated to an empty list, not "0". If you want to set the first value of the list to 0, you can do df['label'] = [0].
i is just the index value (0, 1, 2, 3...) and not the value of the price at a specific index (50.125, 45.25, 65.857...) , so it is not what you want to compare.
price in is used to check if the value of the price variable exists in a list that follows. The in statement isn't followed with a list, so you will get an error. You want to instead get the price value at a specific index and compare if it is greater or less than the value at the previous index.
The append method uses () and not [].
If you want to continue along your method of using a loop, the following can work:
df['label'] = []
for i in range(len(df['price'])):
if df['price'][i] > df['price'][i - 1]:
df['label'].append(1)
else:
df['label'].append(0)
What this does is loop through the range of the length of the price list. It then compares the values of the price at position i and position i - 1.
You can also further simplify the if/else statement using a ternary operator to:
df['label'].append(1 if df['price'][i] > df['price'][i - 1] else 0)
Working fiddle: https://repl.it/repls/HoarseImpressiveScientists
The below code (calculation of moving average over N-days) works for well. But I want to replace other numbers (e.g., 5, 10, 20, etc.) with 50. Not sure if I can turn the below code into something in for loop. Could anybody please help me?
df['ma50pfret']= df['ret']
df.loc[df.adjp >= df.ma50, 'adjp > ma50']= 1
df.loc[df.adjp < df.ma50, 'adjp > ma50']= 0
df.iloc[0, -1]= 1
df['adjp > ma50']= df['adjp > ma50'].astype(int)
df.loc[df['adjp > ma50'].shift(1)== 0, 'ma50pfret']= 1.000079 # 1.02**(1/250)
df['cum_ma50pfret']=df['ma50pfret'].cumprod()
df.head(10)
Do you just mean to replace the 50's with 5, 10, 20, etc? If so, that could be done by always using brackets to access columns, and using f-strings (or some other string formatting method) to replace 50 with the other numbers, like this:
for num in [5, 10, 20, 50]:
df[f'ma{num}pfret']= df['ret']
df.loc[df.adjp >= df[f'ma{num}'], f'adjp > ma{num}']= 1
df.loc[df.adjp < df[f'ma{num}], f'adjp > ma{num}']= 0
df.iloc[0, -1]= 1
df[f'adjp > ma{num}']= df[f'adjp > ma{num}'].astype(int)
df.loc[df[f'adjp > ma{num}'].shift(1)== 0, f'ma{num}pfret']= 1.000079 # 1.02**(1/250)
df[f'cum_ma{num}pfret']=df[f'ma{num}pfret'].cumprod()
df.head(10)