Following Code is taking too much running time (more than 5min)
Is there any good ways to reduce running time.
data.head() # more than 10 year data, Total iteration is around 4,500,000
Open High Low Close Volume Adj Close \
Date
2012-07-02 125500.0 126500.0 124000.0 125000.0 118500 104996.59
2012-07-03 126500.0 130000.0 125500.0 129500.0 239400 108776.47
2012-07-04 130000.0 132500.0 128500.0 131000.0 180800 110036.43
2012-07-05 129500.0 131000.0 127500.0 128500.0 118600 107936.50
2012-07-06 128500.0 129000.0 126000.0 127000.0 149000 106676.54
My Code is
import pandas as pd
import numpy as np
from pandas.io.data import DataReader
import matplotlib.pylab as plt
from datetime import datetime
def DataReading(code):
start = datetime(2012,7,1)
end = pd.to_datetime('today')
data = DataReader(code,'yahoo',start=start,end=end)
data = data[data["Volume"] != 0]
return data
data['Cut_Off'] = 0
Cut_Pct = 0.85
for i in range(len(data['Open'])):
if i==0:
pass
for j in range(0,i):
if data['Close'][j]/data['Close'][i-1]<=Cut_Pct:
data['Cut_Off'][j] = 1
data['Cut_Off'][i] = 1
else
pass
Above Code takes more than 5 min.
Of course, there are "elif" are following(I didn't write above code)
I just tested above code.
Is there any good ways to reduce above code running time?
additional
buying list is
Open High Low Close Volume Adj Close \
Date
2012-07-02 125500.0 126500.0 124000.0 125000.0 118500 104996.59
2012-07-03 126500.0 130000.0 125500.0 129500.0 239400 108776.47
2012-07-04 130000.0 132500.0 128500.0 131000.0 180800 110036.43
2012-07-05 129500.0 131000.0 127500.0 128500.0 118600 107936.50
2012-07-06 128500.0 129000.0 126000.0 127000.0 149000 106676.54
2012-07-09 127000.0 133000.0 126500.0 131500.0 207500 110456.41
2012-07-10 131500.0 135000.0 130500.0 133000.0 240800 111716.37
2012-07-11 133500.0 136500.0 132500.0 136500.0 223800 114656.28
for exam, i bought 10 ea at 2012-07-02 with 125,500, and as times goes
daily, if the close price drop under 85% of buying price(125,500) then i
will sell out 10ea with 85% of buying price.
for reducing running time, i made buying list also(i didnt show in here)
but it also take more than 2 min with using for loop.
Rather than iterating over the 4.5MM rows in your data, use pandas' built-in indexing features. I've re-written the loop at the end of your code as below:
data.loc[data.Close/data.Close.shift(1) <= Cut_Pct,'Cut_Off'] = 1
.loc locates rows that meet the criteria in the first argument. .shift shifts the rows up or down depending on the argument passed.
Related
I am working with stock data coming from Yahoo Finance.
def load_y_finance_data(y_finance_tickers: list):
df = pd.DataFrame()
print("Loading Y-Finance data ...")
for ticker in y_finance_tickers:
df[ticker.replace("^", "")] = yf.download(
ticker,
auto_adjust=True, # only download adjusted data
progress=False,
)["Close"]
print("Done loading Y-Finance data!")
return df
x = load_y_finance_data(["^VIX", "^GSPC"])
x
VIX GSPC
Date
1990-01-02 17.240000 359.690002
1990-01-03 18.190001 358.760010
1990-01-04 19.219999 355.670013
1990-01-05 20.110001 352.200012
1990-01-08 20.260000 353.790009
DataSize=(8301, 2)
Here I want to perform a sliding window operation for every 50 days period, where I want to get correlation (using corr() function) for 50 days slice (day_1 to day_50) of data and after window will move by one day (day_2 to day_51) and so on.
I tried the naive way of using a for loop to do this and it works as well. But it takes too much time. Code below-
data_size = len(x)
period = 50
df = pd.DataFrame()
for i in range(data_size-period):
df.loc[i, "GSPC_VIX_corr"] = x[["GSPC", "VIX"]][i:i+period].corr().loc["GSPC", "VIX"]
df
GSPC_VIX_corr
0 -0.703156
1 -0.651513
2 -0.602876
3 -0.583256
4 -0.589086
How can I do this more efficiently? Is there any built-in way I can use?
Thanks :)
You can use the rolling windows functionality of Pandas with many different aggreggations, including corr(). Instead of your for loop, do this:
x["VIX"].rolling(window=period).corr(x["GSPC"])
So I have an excel file containing data on a specific stock.
My excel file contains about 2 months of data, it monitors the Open price, Close price, High Price, Low Price and Volume of trades in 5 minute intervals, so there are about 3000 rows in my file.
I want to calculate the RSI (or EMA if it's easier) of a stock daily, I'm making a summary table that collects the daily data so it converts my table of 3000+ rows into a table with only about 60 rows (each row represents one day).
Essentially I want some sort of code that sorts the excel data by date then calculates the RSI as a single value for that day. RSI is given by: 100-(100/(1+RS)) where RS = average gain of up periods/average loss of down periods.
Note: My excel uses 'Datetime' so each row's 'Datetime' looks something like '2022-03-03 9:30-5:00' and the next row would be '2022-03-03 9:35-5:00', etc. So the code needs to just look at the date and ignore the time I guess.
Some code to maybe help understand what I'm looking for:
So here I'm calling my excel file, I want the code to take the called excel file, group data by date and then calculate the RSI of each day using the formula I wrote above.
dat = pd.read_csv('AMD_5m.csv',index_col='Datetime',parse_dates=['Datetime'],
date_parser=lambda x: pd.to_datetime(x, utc=True))
dates = backtest.get_dates(dat.index)
#create a summary table
cols = ['Num. Obs.', 'Num. Trade', 'PnL', 'Win. Ratio','RSI'] #add addtional fields if necessary
summary_table = pd.DataFrame(index = dates, columns=cols)
# loop backtest by dates
This is the code I used to fill out the other columns in my summary table, I'll put my SMA (simple moving average) function below.
for d in dates:
this_dat = dat.loc[dat.index.date==d]
#find the number of observations in date d
summary_table.loc[d]['Num. Obs.'] = this_dat.shape[0]
#get trading (i.e. position holding) signals
signals = backtest.SMA(this_dat['Close'].values, window=10)
#find the number of trades in date d
summary_table.loc[d]['Num. Trade'] = np.sum(np.diff(signals)==1)
#find PnLs for 100 shares
shares = 100
PnL = -shares*np.sum(this_dat['Close'].values[1:]*np.diff(signals))
if np.sum(np.diff(signals))>0:
#close position at market close
PnL += shares*this_dat['Close'].values[-1]
summary_table.loc[d]['PnL'] = PnL
#find the win ratio
ind_in = np.where(np.diff(signals)==1)[0]+1
ind_out = np.where(np.diff(signals)==-1)[0]+1
num_win = np.sum((this_dat['Close'].values[ind_out]-this_dat['Close'].values[ind_in])>0)
if summary_table.loc[d]['Num. Trade']!=0:
summary_table.loc[d]['Win. Ratio'] = 1. *num_win/summary_table.loc[d]['Num. Trade']
This is my function for calculating Simple Moving Average. I was told to try and adapt this for RSI or for EMA (Exponential Moving Average). Apparently adapting this for EMA isn't too troublesome but I can't figure it out.
def SMA(p,window=10,signal_type='buy only'):
#input price "p", look-back window "window",
#signal type = buy only (default) --gives long signals, sell only --gives sell signals, both --gives both long and short signals
#return a list of signals = 1 for long position and -1 for short position
signals = np.zeros(len(p))
if len(p)<window:
#no signal if no sufficient data
return signals
sma = list(np.zeros(window)+np.nan) #the first few prices does not give technical indicator values
sma += [np.average(p[k:k+window]) for k in np.arange(len(p)-window)]
for i in np.arange(len(p)-1):
if np.isnan(sma[i]):
continue #skip the open market time window
if sma[i]<p[i] and (signal_type=='buy only' or signal_type=='both'):
signals[i] = 1
elif sma[i]>p[i] and (signal_type=='sell only' or signal_type=='both'):
signals[i] = -1
return signals
I have two solutions to this. One is to loop through each group, then add the relevant data to the summary_table, the other is to calculate the whole series and set the RSI column as this.
I first recreated the data:
import yfinance
import pandas as pd
# initially created similar data through yfinance,
# then copied this to Excel and changed the Datetime column to match yours.
df = yfinance.download("AAPL", period="60d", interval="5m")
# copied it and read it as a dataframe
df = pd.read_clipboard(sep=r'\s{2,}', engine="python")
df.head()
# Datetime Open High Low Close Adj Close Volume
#0 2022-03-03 09:30-05:00 168.470001 168.910004 167.970001 168.199905 168.199905 5374241
#1 2022-03-03 09:35-05:00 168.199997 168.289993 167.550003 168.129898 168.129898 1936734
#2 2022-03-03 09:40-05:00 168.119995 168.250000 167.740005 167.770004 167.770004 1198687
#3 2022-03-03 09:45-05:00 167.770004 168.339996 167.589996 167.718094 167.718094 2128957
#4 2022-03-03 09:50-05:00 167.729996 167.970001 167.619995 167.710007 167.710007 968410
Then I formatted the data and created the summary_table:
df["date"] = pd.to_datetime(df["Datetime"].str[:16], format="%Y-%m-%d %H:%M").dt.date
# calculate percentage change from open and close of each row
df["gain"] = (df["Close"] / df["Open"]) - 1
# your summary table, slightly changing the index to use the dates above
cols = ['Num. Obs.', 'Num. Trade', 'PnL', 'Win. Ratio','RSI'] #add addtional fields if necessary
summary_table = pd.DataFrame(index=df["date"].unique(), columns=cols)
Option 1:
# loop through each group, calculate the average gain and loss, then RSI
for grp, data in df.groupby("date"):
# average gain for gain greater than 0
average_gain = data[data["gain"] > 0]["gain"].mean()
# average loss for gain less than 0
average_loss = data[data["gain"] < 0]["gain"].mean()
# add to relevant cell of summary_table
summary_table["RSI"].loc[grp] = 100 - (100 / (1 + (average_gain / average_loss)))
Option 2:
# define a function to apply in the groupby
def rsi_calc(series):
avg_gain = series[series > 0].mean()
avg_loss = series[series < 0].mean()
return 100 - (100 / (1 + (avg_gain / avg_loss)))
summary_table["RSI"] = df.groupby("date")["gain"].apply(lambda x: rsi_calc(x))
Output (same for each):
summary_table.head()
# Num. Obs. Num. Trade PnL Win. Ratio RSI
#2022-03-03 NaN NaN NaN NaN -981.214015
#2022-03-04 NaN NaN NaN NaN 501.950956
#2022-03-07 NaN NaN NaN NaN -228.379066
#2022-03-08 NaN NaN NaN NaN -2304.451654
#2022-03-09 NaN NaN NaN NaN -689.824739
I am trying to use python pandas to compute:
10 day and 30 day cumulative % performance in (stock ticker RTH "minus" stock tiker SPY) after certain performance threshold in stock ticker USO occurs (=>10% in a 5-day window)
Here is my code:
import pandas as pd
import datetime
import pandas_datareader.data as web
from pandas import Series, DataFrame
start = datetime.datetime(2012, 4, 1)
end = datetime.datetime.now()
dfcomp = web.DataReader(['USO', 'RTH', 'SPY'],'yahoo',start=start,end=end)['Adj Close']
dfcomp_daily_returns = dfcomp.pct_change()
dfcomp_daily_returns = dfcomp_daily_returns.dropna().copy()
dfcomp_daily_returns.head()
Symbols USO RTH SPY
Date
2012-04-03 -0.009243 -0.004758 -0.004089
2012-04-04 -0.020676 -0.007411 -0.009911
2012-04-05 0.010814 0.003372 -0.000501
2012-04-09 -0.007387 -0.006961 -0.011231
2012-04-10 -0.011804 -0.018613 -0.016785
I added several more rows so it might be easier to work with if someone can help
Symbols USO RTH SPY
Date
2012-04-03 -0.009243 -0.004758 -0.004089
2012-04-04 -0.020676 -0.007411 -0.009911
2012-04-05 0.010814 0.003372 -0.000501
2012-04-09 -0.007387 -0.006961 -0.011231
2012-04-10 -0.011804 -0.018612 -0.016785
2012-04-11 0.012984 0.010345 0.008095
2012-04-12 0.011023 0.010970 0.013065
2012-04-13 -0.007353 -0.004823 -0.011888
2012-04-16 0.000766 0.004362 -0.000656
2012-04-17 0.011741 0.015440 0.014812
2012-04-18 -0.014884 -0.000951 -0.003379
2012-04-19 -0.002305 -0.006183 -0.006421
2012-04-20 0.011037 0.002632 0.001670
2012-04-23 -0.009139 -0.015513 -0.008409
2012-04-24 0.003587 -0.004364 0.003802
I think this is a solution to your question. Note that I copied your code up to dropna(), and have also used import numpy as np. You don't need to use from pandas import Series, DataFrame, especially as you have already used import pandas as pd.
The main computations use rolling, apply and where.
# 5-day cumulative %
dfcomp_daily_returns["5_day_cum_%"] = dfcomp_daily_returns["USO"].rolling(5).apply(lambda x: np.prod(1+x)-1)
# RTH - SPY
dfcomp_daily_returns["RTH-SPY"] = dfcomp_daily_returns["RTH"] - dfcomp_daily_returns["SPY"]
# 10-day cumulative %
dfcomp_daily_returns["output_10"] = dfcomp_daily_returns["RTH-SPY"].rolling(10).apply(lambda x: np.prod(1+x)-1).shift(-10).where(dfcomp_daily_returns["5_day_cum_%"] > 0.1, np.nan)
# 30-day cumulative %
dfcomp_daily_returns["output_30"] = dfcomp_daily_returns["RTH-SPY"].rolling(30).apply(lambda x: np.prod(1+x)-1).shift(-30).where(dfcomp_daily_returns["5_day_cum_%"] > 0.1, np.nan)
I won't print the output, given that there are thousands of rows, and the occurrences of ["5_day_cum_%"] > 0.1 are irregular.
How this code works:
The 5_day_cum_% is calculated using a rolling 5-day window, with the product of the values in this window.
RTH-SPY is column RTH "minus" column SPY.
The output calculates the rolling product of RTH-SPY, then using .shift() for forward rolling (it is not possible to use .rolling() to roll forwards. This idea came from Daniel Manso here. Finally, .where() is used to only keep these values on the condition that [5_day_cum_%] > 0.1 (or 10%), returning np.nan otherwise.
Additions from comments
From your additions in the comments, here are two options for each of those (one using pd.where again, the other just using standard pandas filtering (I'm not sure if it has an actual name). In both, the standard filtering is shorter.
A list of all the dates:
# Option 1: pd.where
list(dfcomp_daily_returns.where(dfcomp_daily_returns["5_day_cum_%"] > 0.1, np.nan).dropna(subset=["5_day_cum_%"]).index)
# Option 2: standard pandas filtering
list(dfcomp_daily_returns[dfcomp_daily_returns["5_day_cum_%"] > 0.1].index)
A dataframe of only those with 5-day return greater than 10%:
# Option 1: pd.where
dfcomp_daily_returns.where(dfcomp_daily_returns["5_day_cum_%"] > 0.1, np.nan).dropna(subset=["5_day_cum_%"])[["5_day_cum_%", "output_10", "output_30"]]
# Option 2: standard pandas row filtering
dfcomp_daily_returns[dfcomp_daily_returns["5_day_cum_%"] > 0.1][["5_day_cum_%", "output_10", "output_30"]]
I'm reading in a very large (15M lines) csv file into a panda dataframe. I then want to split it in smaller ones (ultimately creating smaller csv files, or a panda panel...).
I have working code but it's VERY slow. I believe it's not taking advantage of the fact that my dataframe is 'ordered'.
The df looks like:
ticker date open high low
0 AAPL 1999-11-18 45.50 50.0000 40.0000
1 AAPL 1999-11-19 42.94 43.0000 39.8100
2 AAPL 1999-11-22 41.31 44.0000 40.0600
...
1000 MSFT 1999-11-18 45.50 50.0000 40.0000
1001 MSFT 1999-11-19 42.94 43.0000 39.8100
1002 MSFT 1999-11-22 41.31 44.0000 40.0600
...
7663 IBM 1999-11-18 45.50 50.0000 40.0000
7664 IBM 1999-11-19 42.94 43.0000 39.8100
7665 IBM 1999-11-22 41.31 44.0000 40.0600
I want to take all rows where symbol=='AAPL', and make a dataframe with it. Then all rows where symbol=='MSFT', and so on. The number of rows for each symbol is NOT the same, and the code has to adapt. I might load in a new 'large' csv where everything is different.
This is what I came up with:
#Read database
alldata = pd.read_csv('./alldata.csv')
#get a list of all unique ticker present in the database
alltickers = alldata.iloc[:,0].unique();
#write data of each ticker in its own csv file
for ticker in alltickers:
print('Creating csv for '+ticker)
#get data for current ticker
tickerdata = alldata.loc[alldata['ticker'] == ticker]
#remove column with ticker symbol (will be the file name) and reindex as
#we're grabbing from somwhere in a large dataframe
tickerdata = tickerdata.iloc[:,1:13].reset_index(drop=True)
#write csv
tickerdata.to_csv('./split/'+ticker+'.csv')
this takes forever to run. I thought it was the file I/O, but I commented the write csv part in the for loop, and I see that this line is the problem:
tickerdata = alldata.loc[alldata['ticker'] == ticker]
I wonder if pandas is looking in the WHOLE dataframe every single time. I do know that the dataframe is in order of ticker. Is there a way to leverage that?
Thank you very much!
Dave
Easiest way to do this is to create a dictionary of the dataframes using a dictionary comprehension and pandas groupby
dodf = {ticker: sub_df for ticker, sub_df in alldata.groupby('ticker')}
dodf['IBM']
ticker date open high low
7663 IBM 1999-11-18 45.50 50.0 40.00
7664 IBM 1999-11-19 42.94 43.0 39.81
7665 IBM 1999-11-22 41.31 44.0 40.06
It makes sense that creating a boolean index of length 15 million, and doing it repeatedly, is going to take a little while. Honestly, for splitting the file into subfiles, I think Pandas is the wrong tool for the job. I'd just use a simple loop to iterate over the lines in the input file, writing them to the appropriate output file as they come. This doesn't even have to load the whole file at once, so it will be fairly fast.
import itertools as it
tickers = set()
with open('./alldata.csv') as f:
headers = next(f)
for ticker, lines in it.groupby(f, lambda s: s.split(',', 1)[0]):
with open('./split/{}.csv'.format(ticker), 'a') as w:
if ticker not in tickers:
w.writelines([headers])
tickers.add(ticker)
w.writelines(lines)
Then you can load each individual split file using pd.read_csv() and turn that into its own DataFrame.
If you know that the file is ordered by ticker, then you can skip everything involving the set tickers (which tracks which tickers have already been encountered). But that's a fairly cheap check.
Probably, the best approach is to use groupby. Suppose:
>>> df
ticker v1 v2
0 A 6 0.655625
1 A 2 0.573070
2 A 7 0.549985
3 B 32 0.155053
4 B 10 0.438095
5 B 26 0.310344
6 C 23 0.558831
7 C 15 0.930617
8 C 32 0.276483
Then group:
>>> grouped = df.groupby('ticker', as_index=False)
Finally, iterate over your groups:
>>> for g, df_g in grouped:
... print('creating csv for ', g)
... print(df_g.to_csv())
...
creating csv for A
,ticker,v1,v2
0,A,6,0.6556248347252436
1,A,2,0.5730698850517599
2,A,7,0.5499849530664374
creating csv for B
,ticker,v1,v2
3,B,32,0.15505313728451087
4,B,10,0.43809490694469133
5,B,26,0.31034386153099336
creating csv for C
,ticker,v1,v2
6,C,23,0.5588311692150466
7,C,15,0.930617426953476
8,C,32,0.2764826801584902
Of course, here I am printing a csv, but you can do whatever you want.
Using groupby is great, but it does not take advantage of the fact that the data is presorted and so will likely have more overhead compared to a solution that does. For a large dataset, this could be a noticeable slowdown.
Here is a method which is optimized for the sorted case:
import pandas as pd
import numpy as np
alldata = pd.read_csv("tickers.csv")
tickers = np.array(alldata.ticker)
# use numpy to compute change points, should
# be super fast and yield performance boost over groupby:
change_points = np.where(
tickers[1:] != tickers[:-1])[0].tolist()
# add last point in as well to get last ticker block
change_points += [tickers.size - 1]
prev_idx = 0
for idx in change_points:
ticker = alldata.ticker[idx]
print('Creating csv for ' + ticker)
# get data for current ticker
tickerdata = alldata.iloc[prev_idx: idx + 1]
tickerdata = tickerdata.iloc[:, 1:13].reset_index(drop=True)
tickerdata.to_csv('./split/' + ticker + '.csv')
prev_idx = idx + 1
I have a pandas dataframe with~3900 rows and 6 columns compiled from Google Finance . One of these columns defines a time in unix format, specifically defining a time during the trading day for a market. In this case the DJIA from 930A EST to 4P EST. However, only the cell for the beginning of each day (930A) has the complete unix time stamp (prefixed with an 'a') and the others are the minutes after the first time of the day.
Here is an example of the raw data:
Date Close High Low Open Volume
0 a1450449000 173.87 173.87 173.83 173.87 46987
1 1 173.61 173.83 173.55 173.78 19275
2 2 173.37 173.63 173.37 173.60 16014
3 3 173.50 173.59 173.31 173.34 14198
4 4 173.50 173.57 173.46 173.52 7010
Date Close High Low Open Volume
388 388 171.16 171.27 171.15 171.26 11809
389 389 171.11 171.23 171.07 171.18 30449
390 390 170.89 171.16 170.89 171.09 163937
391 a1450708200 172.28 172.28 172.28 172.28 23880
392 1 172.27 172.27 172.00 172.06 2719
The change at index 391 is not contiguous such that a solution like #Stefan's would unfortunately not correctly adjust the Date value.
I can easily enough go through with a lambda and line by line remove the 'a' (if necessary) convert the values to an integer and convert the minutes past 930A into seconds with the following code:
import pandas as pd
import numpy as np
import datetime
bars = pd.read_csv(r'http://www.google.com/finance/getprices?i=60&p=10d&f=d,o,h,l,c,v&df=cpct&q=DIA', skiprows=7, header=None, names=['Date', 'Close', 'High', 'Low', 'Open', 'Volume'])
bars['Date'] = bars['Date'].map(lambda x: int(x[1:]) if x[0] == 'a' else int(x))
bars['Date'] = bars['Date'].map(lambda u: u * 60 if u < 400 else u)
Now what I would like to do is, without iterating over the dataframe, determine if the value of bars['Date'] is not a unix time stamp (e.g. < 24000 in the terms of this data set). If so I want to add that value to the time stamp for that particular day to create a complete unix time stamp for each entry.
I know that I can compare the previous row via:
bars['Date'][:-1]>bars['Date'][1:]
I feel like that would be the way to go but I cant figure out a way to use this in a function as it returns a series.
Thanks in advance for any help!
You could add a new column that always contains the latest Timestamp and then add to the Date where necessary.
threshold = 24000
bars['Timestamp'] = bars[bars['Date']>threshold].loc[:, 'Date']
bars['Timestamp'] = bars['Timestamp'].fillna(method='ffill')
bars['Date'] = bars.apply(lambda x: x.Date + x.Timestamp if x.Date < threshold else x.Date, axis=1)
bars.drop('Timestamp', axis=1, inplace=True)
to get:
Date Close High Low Open Volume
0 1450449000 173.87 173.870 173.83 173.87 46987
1 1450449060 173.61 173.830 173.55 173.78 19275
2 1450449120 173.37 173.630 173.37 173.60 16014
3 1450449180 173.50 173.590 173.31 173.34 14198
4 1450449240 173.50 173.570 173.46 173.52 7010
5 1450449300 173.66 173.680 173.44 173.45 10597
6 1450449360 173.40 173.670 173.34 173.67 14270
7 1450449420 173.36 173.360 173.13 173.32 22485
8 1450449480 173.29 173.480 173.25 173.36 18542