Pandas DF nested loop to find value matching value from loop1 - python

I'm new to python/pandas.. so please don't judge:)
I have a DF with stock data (i.e., Date, Close Value, ...).
Now I want to see if a given Close value will hit a target value (e.g. Close+50€, Close-50€).
I wrote a nested loop to check every close value with the following close values of that day:
def calc_zv(_df, _distance):
_df['ZV_C'] = 0
_df['ZV_P'] = 0
for i in range(0, len(_df)):
_date = _df.iloc[i].get('Date')
target_put = _df.iloc[i].get('Close') - _distance
target_call = _df.iloc[i].get('Close') + _distance
for x in range(i, len(_df)-1):
a = _df.iloc[x+1].get('Close')
_date2 = _df.iloc[x+1].get('Date')
if(target_call <= a and _date == _date2):
_df.ix[i,'ZV_C'] = 1
break
elif(target_put >= a and _date == _date2):
_df.ix[i,'ZV_P'] = 1
break
elif (_date != _date2):
break
This works fine.. but I wonder if there is a "better" (Faster, more panda-like) solution?
Thanks and best wishes.
M.
EDIT
hi again,
here is some sample data generator:
import numpy as np
import pandas as pd
from PX.indicator_macros import calc_zv
import datetime
abc = datetime.datetime.now()
print(abc)
df2 = pd.DataFrame({'DateTime' : pd.Timestamp('20130102'),
'Close' : pd.Series(np.random.randn(5000))})
#print(df2.to_string())
calc_zv(df2, 2)
#print(df2.to_string())
abc = datetime.datetime.now()
print(abc)
for 5000 rows i need approx. 10s.
I have stock data for 3 years (in 15min intervall) which takes some minutes.
cheers

Related

How to Vectorize 10million rows with for and if conditions in python?

I have one row which one is full of Unix timestamp. I need to create a new row and label them as 'occured' if the timestamp falls into certain range. but i have 10 million rows to iterate.
my code:
`
start_time = 1.583860e+12 #example timestamp
end_time = 1.583862e+12
for i in range(0, len(df['time'])):
if (start_time < df['time'][i] < end_time).all():
df.loc[i, 'event'] = 'occured'
else:
pass `
the above code is running forever to process 10 million rows. i tried vectorizing but i could not do it as i dont have much experience. Can some one help me in this regard ?
Here is a way to use vectorized operations in pandas to flag values in the range:
df.loc[(start_time < df.time) & (df.time < end_time), 'event'] = 'occurred'
We use vectorized pandas comparison and boolean operations (<, > and &) to populate the appropriate rows in the new event column.
Full test code is here:
start_time = 1.583860e+12 #example timestamp
end_time = 1.583862e+12
import pandas as pd
import random
df = pd.DataFrame({'time':random.choices(range(10), k=10000)})
df.time = df.time *1.0e6 + 1.583855e+12
df.loc[(start_time < df.time) & (df.time < end_time), 'event'] = 'occurred'
print(f"count of 'occurred': {sum(df['event'] == 'occurred')}")
print(df.head())
Output:
count of 'occurred': 975
time event
0 1.583863e+12 NaN
1 1.583861e+12 occurred
2 1.583862e+12 NaN
3 1.583860e+12 NaN
4 1.583861e+12 occurred
If you prefer to have an empty string in rows without 'occurred', you can just initialize the event column in advance:
df['event'] = ''
df.loc[(start_time < df.time) & (df.time < end_time), 'event'] = 'occurred'

Sorting my data frame by date (d/m/y + hour: min: sec)

I am trying to sort the values of my columns depending on the date (d/m/y + hour: min: sec). Below I will show you an example of the format of the given data:
Initiator
Price
date
XXX
560
13/05/2020 11:05:35
Glovoapp
250
12/05/2020 13:07:15
Glovoapp
250
13/04/2020 12:09:25
expected output:
if the user selects a date from the 10/04/2020 | 00:00:00 to 15/05/2020 |00:00:00 :
Glovoapp: 500
XXX: 560
if the user selects a date from the 10/04/2020 00:00:00 to 01/05/2020 00:00:00:
Glovoapp: 250
So far I am able to sum the prices depending on the initiators without the date filtering. Any suggestions on what I should do ?
def sum_method(self):
montant_init = self.data.groupby("Initiateur")["Montant (centimes)"].sum()
print(montant_init)
return montant_init
^ I use this method for the calculation. I hope I am clear enough and thanks.
Tried answer; please correct me:
class evaluation():
def __init__(self, df):
self.df = df
# Will receive 'actual' datetime from df, and user defined 'start' and 'stop' datetimes.
def in_range(actual, start, stop):
return start <= actual <= stop
def evaluate(self):
user_start = input("Enter your start date (dd.mm.yyyy hour:min:second): ")
user_stop = input("Enter your end date (dd.mm.yyyy hour:min:second): ")
# creates series of True or False selecting proper rows.
mask = self.df['Date'].apply(self.in_range, args=(user_start, user_stop))
# Do the groupby and sum on only those rows.
montant_init = self.df.loc[mask].groupby("Initiateur")["Montant (centimes)"].sum()
print(montant_init)
output when printing: self.df.loc[mask]
Empty DataFrame
Columns: [Opération, Initiateur, Montant (centimes), Monnaie, Date, Résultat, Compte marchand, Adresse IP Acheteur, Marque de carte]
Index: []
The below works. There are two steps:
Make a mask to select the right rows
Then do the groupby and sum on only those rows
Mask function:
# Will receive 'actual' datetime from df, and user defined 'start' and 'stop' datetimes.
def in_range(actual, start, stop):
return start <= actual <= stop
Then apply the mask and perform the groupby:
# creates series of True or False selecting proper rows.
mask = df['date'].apply(in_range, args=(user_start, user_stop))
# Do the groupby and sum on only those rows.
df2 = df.loc[mask].groupby('Initiator').sum()
Note that user_start and user_stop should be the defined start and stop datetimes by the user.
And you're done!
UPDATE: to include the methods as part of a class:
class evaluation():
def __init__(self, df):
self.df = df
# Will receive 'actual' datetime from df, and user defined 'start' and 'stop' datetimes. Add 'self' as arg in method.
def in_range(self, actual, start, stop):
return start <= actual <= stop
def evaluate(self):
user_start = pd.to_datetime(input("Enter your start date (yyyy.mm.dd hour:min:second): "))
user_stop = pd.to_datetime(input("Enter your end date (yyyy.mm.dd hour:min:second): "))
# creates series of True or False selecting proper rows.
mask = self.df['Date'].apply(self.in_range, args=(user_start, user_stop))
# Do the groupby and sum on only those rows.
amount_init = self.df.loc[mask].groupby("Initiator")["Price"].sum()
print(amount_init)
Then to instantiate an object of the new class:
import pandas as pd
import dateutil.parser as dtp
import evaluation as eval # this is the class we just made
data = {
'Initiator': ['XXX', 'Glovoapp', 'Glovoapp'],
'Price': [560, 250, 250],
'Date': [dtp.parse('13/05/2020 11:05:35'), dtp.parse('12/05/2020 13:07:15'), dtp.parse('13/04/2020 12:09:25')]
}
df = pd.DataFrame(data)
eval_obj = eval.evaluation(df)
eval_obj.evaluate()

Issue of creating a Dataframe

I am trying to create a dataframe by using for loop. It works but the output of the dataframe is not correct. Each cell of the Dataframe contain all data. May I know how can I fix it?
Here is the code:
from pandas_datareader import data
import datetime
from math import exp, sqrt
import pandas as pd
records = []
test = ['AAPL','AAL']
for i in test:
stock_price = data.DataReader(test,
start='2021-01-01',
end='2021-04-01',
data_source='yahoo')['Adj Close'][-100:]
stock_volume = data.DataReader(test,
start='2021-01-01',
end='2021-04-01',
data_source='yahoo')['Volume'][-100:]
returns = stock_price.pct_change()
((1 + returns).cumprod() - 1)
records.append({
'underlyingSymbol' : i,
'last_price' : stock_price.iloc[-1],
'15d_highest' : stock_price.iloc[-15:].max(),
'15d_lowest' : stock_price.iloc[-15:].min(),
})
df = pd.DataFrame(records)
df
Since you're looping over symbols, you should change data.DataReader(test... to data.DataReader(i... (otherwise it reads data for both of them on every iteration):
for i in test:
stock_price = data.DataReader(i,
start='2021-01-01',
end='2021-04-01',
data_source='yahoo')['Adj Close'][-100:]
stock_volume = data.DataReader(i,
start='2021-01-01',
end='2021-04-01',
data_source='yahoo')['Volume'][-100:]
...
Output:
underlyingSymbol last_price 15d_highest 15d_lowest 15d_volume \
0 AAPL 123.000000 125.57 119.900002 92403800.0
1 AAL 23.860001 25.17 21.809999 93746800.0
30d_returns 15d_returns 7d_returns volatility
0 -0.047342 0.018057 0.024240 0.325800
1 0.266432 0.030475 0.092192 0.571564

How to improve performance on average calculations in python dataframe

I am trying to improve the performance of a current piece of code, whereby I loop through a dataframe (dataframe 'r') and find the average values from another dataframe (dataframe 'p') based on criteria.
I want to find the average of all values (column 'Val') from dataframe 'p' where (r.RefDate = p.RefDate) & (r.Item = p.Item) & (p.StartDate >= r.StartDate) & (p.EndDate <= r.EndDate)
Dummy data for this can be generated as per the below;
import pandas as pd
import numpy as np
from datetime import datetime
######### START CREATION OF DUMMY DATA ##########
rng = pd.date_range('2019-01-01', '2019-10-28')
daily_range = pd.date_range('2019-01-01','2019-12-31')
p = pd.DataFrame(columns=['RefDate','Item','StartDate','EndDate','Val'])
for item in ['A','B','C','D']:
for date in daily_range:
daily_p = pd.DataFrame({ 'RefDate': rng,
'Item':item,
'StartDate':date,
'EndDate':date,
'Val' : np.random.randint(0,100,len(rng))})
p = p.append(daily_p)
r = pd.DataFrame(columns=['RefDate','Item','PeriodStartDate','PeriodEndDate','AvgVal'])
for item in ['A','B','C','D']:
r1 = pd.DataFrame({ 'RefDate': rng,
'Item':item,
'PeriodStartDate':'2019-10-25',
'PeriodEndDate':'2019-10-31',#datetime(2019,10,31),
'AvgVal' : 0})
r = r.append(r1)
r.reset_index(drop=True,inplace=True)
######### END CREATION OF DUMMY DATA ##########
The piece of code I currently have calculating and would like to improve the performance of is as follows
for i in r.index:
avg_price = p['Val'].loc[((p['StartDate'] >= r.loc[i]['PeriodStartDate']) &
(p['EndDate'] <= r.loc[i]['PeriodEndDate']) &
(p['RefDate'] == r.loc[i]['RefDate']) &
(p['Item'] == r.loc[i]['Item']))].mean()
r['AvgVal'].loc[i] = avg_price
The first change is that generating r DataFrame, both PeriodStartDate and
PeriodEndDate are created as datetime, see the following fragment of your
initiation code, changed by me:
r1 = pd.DataFrame({'RefDate': rng, 'Item':item,
'PeriodStartDate': pd.to_datetime('2019-10-25'),
'PeriodEndDate': pd.to_datetime('2019-10-31'), 'AvgVal': 0})
To get better speed, I the set index in both DataFrames to RefDate and Item
(both columns compared on equality) and sorted by index:
p.set_index(['RefDate', 'Item'], inplace=True)
p.sort_index(inplace=True)
r.set_index(['RefDate', 'Item'], inplace=True)
r.sort_index(inplace=True)
This way, the access by index is significantly quicker.
Then I defined the following function computing the mean for rows
from p "related to" the current row from r:
def myMean(row):
pp = p.loc[row.name]
return pp[pp.StartDate.ge(row.PeriodStartDate) &
pp.EndDate.le(row.PeriodEndDate)].Val.mean()
And the only thing to do is to apply this function (to each row in r) and
save the result in AvgVal:
r.AvgVal = r.apply(myMean2, axis=1)
Using %timeit, I compared the execution time of the code proposed by EdH with mine
and got the result almost 10 times shorter.
Check on your own.
By using iterrows I managed to improve the performance, although still may be quicker ways.
for index, row in r.iterrows():
avg_price = p['Val'].loc[((p['StartDate'] >= row.PeriodStartDate) &
(p['EndDate'] <= row.PeriodEndDate) &
(p['RefDate'] == row.RefDate) &
(p['Item'] == row.Item))].mean()
r.loc[index, 'AvgVal'] = avg_price

Problem with multiprocessing in Python writing into csv

I created a function which takes a date as an argument and writes produced output into csv. If I run multiprocessing Pool with e.g. 28 tasks, and I have a list of 100 dates, then the last 72 rows in the output csv file are twice longer than they should be (just a joined repetition of last 72 rows).
My code:
import numpy as np
import pandas as pd
import multiprocessing
#Load the data
df = pd.read_csv('data.csv', low_memory=False)
list_s = df.date.unique()
def funk(date):
...
# for each date in df.date.unique() do stuff which gives sample dataframe
# as an output
return sample
# list_s is a list of dates I want to calculate function funk for
def mp_handler():
# 28 is a number of processes I want to run
p = multiprocessing.Pool(28)
for result in p.imap(funk, list_s[0:100]):
result.to_csv('crsp_full.csv', mode='a')
if __name__=='__main__':
mp_handler()
And the output looks like this:
date,port_ret_1,port_ret_2
2010-03-05,0.0,0.002
date,port_ret_1,port_ret_2
2010-02-12,-0.001727,0.009139189315
...
# and after first 28 rows like this
date,port_ret_1,port_ret_2,port_ret_1,port_ret_2
2010-03-03,0.002045,0.00045092025,0.002045,0.00045092025
date,port_ret_1,port_ret_2,port_ret_1,port_ret_2
2010-03-15,-0.006055,-0.00188451972,-0.006055,-0.00188451972
I tried to insert lock() into the funk(), but it yielded the same results, just took more time to implement. Any ideas how to fix it?
Edit. funk looks like this. e is equivalent to date.
def funk(e):
block = pd.DataFrame()
i = s_list.index(e)
if i > 19:
ran = s_list[i-19:i+6]
ran0 = s_list[i-19:i+1]
# print ran0
piv = df.pivot(index='date', columns='permno', values='date')
# Drop the stocks which do not have returns for the given time window and make the list of suitable stocks
s = list(piv.loc[ran].dropna(axis=1).columns)
sample = df[df['permno'].isin(s)]
sample = sample.loc[ran]
permno = ['10001', '93422']
sample = sample[sample['permno'].isin(permno)]
# print sample.index.unique()
# get past 20 days returns in additional 20 columns
for i in range(0, 20):
sample['r_{}'.format(i)] = sample.groupby('permno')['ret'].shift(i)
#merge dataset with betas
sample = pd.merge(sample, betas_aug, left_index=True, right_index=True)
sample['ex_ret'] = 0
# calculate expected return
for i in range(0,20):
sample['ex_ret'] += sample['ma_beta_{}'.format(i)]*sample['r_{}'.format(i)]
# print(sample)
# define a stock into two legs based on expected return
sample['sign'] = sample['ex_ret'].apply(lambda x: -1 if x<0 else 1)
# workaround for short leg, multiply returns by -1
sample['abs_ex_ret'] = sample['ex_ret']*sample['sign']
# create 5 columns for future realised 5 days returns (multiplied by -1 for short leg)
for i in range(1,6):
sample['rp_{}'.format(i)] = sample.groupby(['permno'])['ret'].shift(-i)
sample['rp_{}'.format(i)] = sample['rp_{}'.format(i)]*sample['sign']
sample = sample.reset_index(drop=True)
sample['w_0'] = sample['abs_ex_ret'].div(sample.groupby(['date'])['abs_ex_ret'].transform('sum'))
for i in range(1, 5):
sample['w_{}'.format(i)] = sample['w_{}'.format(i-1)]*(1+sample['rp_{}'.format(i)])
sample = sample.dropna(how='any')
for k in range(0,20):
sample.drop(columns = ['ma_beta_{}'.format(k), 'r_{}'.format(k)])
for k in range(1, 6):
sample['port_ret_{}'.format(k)] = sample['w_{}'.format(k-1)]*sample['rp_{}'.format(k)]
q = ['port_ret_{}'.format(k)]
list_names.extend(q)
block = sample.groupby('date')[list_names].sum().copy()
return block

Categories