How not to get repeating value in python - python

import pandas as pd
import os
os.chdir(r"C:\Users\Administrator\Desktop")
data = pd.read_csv("nifty.csv")
for i in range(9,len(data['SMA(10)'])):
if (data['SMA(10)'][i] < data['Open(-1)'][i] and
data['SMA(10)'][i] > data['Open'][i]):
data['Trans'][i] = "Sell"
elif(data['SMA(10)'][i] > data['Open(-1)'][i] and
data['SMA(10)'][i] < data['Open'][i]):
data['Trans'][i] = "Buy"
else:
data['Trans'][i] = "Hold"
print("-----------------------------------------------------------------------")
print(data.head(50))
I dont want two buy/sell values together, instead I want hold value

That's not how you use Pandas. Use a vectorized solution. -1 for sell, 1 for buy and 0 for hold. Faster than using strings but you can replace them if you choose so.
data['Trans'] = np.select([((data['SMA(10)'] < data['Open(-1)']) & (data['SMA(10)'] > data['Open(-1)'])),
((data['SMA(10)'] > data['Open(-1)']) & (data['SMA(10)'] < data['Open(-1)']))],
(-1, 1), 0)

Related

Create a minimum centered on zero operator

I am trying to create a central tendency operator (like the mean or the median) which would follow this logic:
For a given array, return the value closest to zero if all values have the same sign and zero otherwise
In other words:
if all values > 0 return min(array)
if all values < 0 return max(array)
else return 0
Here is the most optimized implementation I managed to do:
def zero_min(x):
if len(x) == 1:
return x[0]
else:
tmin = np.min(x)
tmax = np.max(x)
return (tmin if tmin == abs(tmin) else tmax) if tmin*tmax > 0 else 0
The issue is that I want it to be very efficient in order to use it in a rolling window (using pandas.Series.rolling) on 8.5M values of type float64, like this:
df = df.rolling(timedelta(seconds=5)).apply(zero_min, raw=True)
But this function is painfully slow to execute: for a window of 5s it takes 33.34s, while pandas.Series.rolling.mean takes 0.15s and pandas.Series.rolling.median 1.01 (and the median should be longer to compute, as it is an operation more complex).
Would you know how to optimize it so that it is at least as fast as the median?
I guess I would have to use matrix calculation or code the operation in C but I don't know how to do that.
You can reproduce the data to process using
import random
n = 8467200
df = pd.Series([random.random() for i in range(n)], index=pd.date_range(datetime.now(), datetime.now() + timedelta(seconds=n-1), freq='1S'))
avoid using apply, you can do something like this:
min_val = df['some_col'].rolling(timedelta(seconds=5), min_periods=1).min()
max_val = df['some_col'].rolling(timedelta(seconds=5), min_periods=1).max()
# perform the logics on these series
df['new_col'] = np.select((min_val.gt(0) | min_val.eq(max_val), max_val < 0),
(min_val, max_val), 0)

Check if two consecutive values in a dataframe column are bigger than 0

I want to check if two consecutive values in a column are bigger than 0. If yes, then data['Exit'] = 1, else 0
My code:
data['Exit'] = 0
for row in range(len(data)):
if (data['Mkt_Return'].iloc[row] > 0) and (data['Mkt_Return'].iloc[row-1] > 0):
data['Exit'] = 1
Right now all my values are equal to 1, but I know some values are smaller than 0 and therefore shouldn't be equal to 1.
Is .iloc[row-1] wrong?
Your condition logic is a bit faulty, for instance, it compares the first row data with the last row data. You might need to correct to maybe this form
for row in range(1,len(data)):
if ((data['Mkt_Return'].iloc[row-1] > 0) and (data['Mkt_Return'].iloc[row] > 0)):
data['Exit'] = 1
How about this?
data["Mkt_Return_2"] = data["Mkt_Return"].shift(-1)
import numpy as np
data["foo"] = np.where(((data["Mkt_Return_2"] > 0) & (data["Mkt_Return"] > 0)), 1, 0)

Pandas apply function does not assign values to the colum

I am trying to put this logic on pandas dataframe
IF base_total_price > 0
IF base_total_discount = 0
actual_price = base_total_price
IF base_total_discount > 0
actual_price = base_total_price +base_total_discount
IF base_total_price = 0
IF base_total_discount > 0
actual_price = base_total_discount
IF base_total_discount = 0
actual_price = 0
so I wrote these 2 apply functions
#for all entries where base_total_price > 0
df_slice_1['actual_price'] = df_slice_1['base_total_discount'].apply(lambda x: df_slice_1['base_total_price'] if x == 0 else df_slice_1['base_total_price']+df_slice_1['base_total_discount'])
#for all entries where base_total_price = 0
df_slice_1['actual_price'] = df_slice_1['base_total_discount'].apply(lambda x: x if x == 0 else df_slice_1['base_total_discount'])
When i run the code I get this error
ValueError: Wrong number of items passed 20, placement implies 1
I know that it is trying to put more values in one column but I do not understand why is this happening or how can I solve this problem. All I need to do is to update the dataframe with the new column `actual_price` and I need to calculate the values for this column according to the above mentioned logic. Please suggest me a better way of implementing the logic or correct me
Sample data would have been useful. Please try use np.select(condtions, choices)
Conditions=[(df.base_total_price > 0)&(df.base_total_discount == 0),(df.base_total_price > 0)&(df.base_total_discount > 0),\
(df.base_total_price == 0)&(df.base_total_discount > 0),\
(df.base_total_price == 0)&(df.base_total_discount == 0)]
choices=[df.base_total_price,df.base_total_price.add(df.base_total_discount),df.base_total_discount,0]
df.actual_price =np.select(Conditions,choices)
I solved this question simply by using iterrows. Thanks everyone who responded

Pandas Mask on multiple Conditions

In my dataframe I want to substitute every value below 1 and higher than 5 with nan.
This code works
persDf = persDf.mask(persDf < 1000)
and I get every value as an nan but this one does not:
persDf = persDf.mask((persDf < 1) and (persDf > 5))
and I have no idea why this is so. I have checked the man page and different solutions on apparentely similar problems but could not find a solution. Does anyone have have an idea that could help me on this?
Use the | operator, because a value cant be < 1 AND > 5:
persDf = persDf.mask((persDf < 1) | (persDf > 5))
Another method would be to use np.where and call that inside pd.DataFrame:
pd.DataFrame(data=np.where((df < 1) | (df > 5), np.NaN, df),
columns=df.columns)

Replacing value based on conditional pandas

How do you replace a value in a dataframe for a cell based on a conditional for the entire data frame not just a column. I have tried to use df.where but this doesn't work as planned
df = df.where(operator.and_(df > (-1 * .2), df < 0),0)
df = df.where(df > 0 , df * 1.2)
Basically what Im trying to do here is replace all values between -.2 and 0 to zero across all columns in my dataframe and all values greater than zero I want to multiply by 1.2
You've misunderstood the way pandas.where works, which keeps the values of the original object if condition is true, and replace otherwise, you can try to reverse your logic:
df = df.where((df <= (-1 * .2)) | (df >= 0), 0)
df = df.where(df <= 0 , df * 1.2)
where allows you to have a one-line solution, which is great. I prefer to use a mask like so.
idx = (df < 0) & (df >= -0.2)
df[idx] = 0
I prefer breaking this into two lines because, using this method, it is easier to read. You could force this onto a single line as well.
df[(df < 0) & (df >= -0.2)] = 0
Just another option.

Categories