have the following code resulting in a streaming data frame with over 5000 rows every minute. As this data frame is within a for loop am unable to manipulate the data within the data frame. So I need to know how to segregate the data frame to come be out of the for loop, say every 5 minute and then restart again to collect the information in the data frame.
'''
df=pd.DataFrame(data=None)
def on_ticks(ws, ticks):
global df
for sc in ticks:
token=sc['instrument_token']
name=trd_portfolio[token]['name']
ltp=sc['last_price']
df1=pd.DataFrame([name,ltp]).T
df1.columns=['name','ltp']
df=df.append(df1,ignore_index=True)
print(df)
'''
Resultant output is
name ltp
0 GLAXO 1352.2
1 GSPL 195.75
2 ABAN 18
3 ADANIPOWER 36.2
4 CGPOWER 6
... ... ...
1470 COLPAL 1317
1471 ITC 196.2
1472 JUBLFOOD 1698.5
1473 HCLTECH 550.6
1474 INDIGO 964.8
[1475 rows x 2 columns]
further manipulation required on the data frame are like:
'''
df['change']=df.groupby('name')['ltp'].pct_change()*100
g = df.groupby('name')['change']
counts = g.agg(
pos_count=lambda s: s.gt(0).sum(),
neg_count=lambda s: s.lt(0).sum(),
net_count=lambda s: s.gt(0).sum()- s.lt(0).sum()).astype(int)
print(counts)
'''
However, am unable to freeze the for loop for a certain time for other processes to happen. I did try the sleep method, but it sleeps for given time and then goes back to the for loop.
Need guidance on how we can freeze the for loop for certain time so that the other codes can be executed and gain going back to the for loop to continue collecting the data.
There is no pausing of the loop but you can just pass the arguments to some other function that performs the other operations after every n iterations.The sudo code would be something like.
for loop in range(10000):
data #collecting data
if loop==100:
other_operation(data):
def other_operation(data):
#perform other operations here
This willperform the other manipulations after every 100 loop iterations.
Related
I am working with stock data coming from Yahoo Finance.
def load_y_finance_data(y_finance_tickers: list):
df = pd.DataFrame()
print("Loading Y-Finance data ...")
for ticker in y_finance_tickers:
df[ticker.replace("^", "")] = yf.download(
ticker,
auto_adjust=True, # only download adjusted data
progress=False,
)["Close"]
print("Done loading Y-Finance data!")
return df
x = load_y_finance_data(["^VIX", "^GSPC"])
x
VIX GSPC
Date
1990-01-02 17.240000 359.690002
1990-01-03 18.190001 358.760010
1990-01-04 19.219999 355.670013
1990-01-05 20.110001 352.200012
1990-01-08 20.260000 353.790009
DataSize=(8301, 2)
Here I want to perform a sliding window operation for every 50 days period, where I want to get correlation (using corr() function) for 50 days slice (day_1 to day_50) of data and after window will move by one day (day_2 to day_51) and so on.
I tried the naive way of using a for loop to do this and it works as well. But it takes too much time. Code below-
data_size = len(x)
period = 50
df = pd.DataFrame()
for i in range(data_size-period):
df.loc[i, "GSPC_VIX_corr"] = x[["GSPC", "VIX"]][i:i+period].corr().loc["GSPC", "VIX"]
df
GSPC_VIX_corr
0 -0.703156
1 -0.651513
2 -0.602876
3 -0.583256
4 -0.589086
How can I do this more efficiently? Is there any built-in way I can use?
Thanks :)
You can use the rolling windows functionality of Pandas with many different aggreggations, including corr(). Instead of your for loop, do this:
x["VIX"].rolling(window=period).corr(x["GSPC"])
I'm currently trying to process some log files using python and pandas library. Logs contain simple information about request sent to the server and i want to extract information about sessions from them. Sessions here are defined as a sets of request made by the same user within specific period of time (e.g 30 minutes, counted from time of first request to the time of last request, request after this timeframe should be treated as part of a new session)
To do than, currently I am performing nested grouping: first I am using groupby to get requests per user and then grouping each user requests by 30 minutes intervals, to finally iterate over those intervals and chose those actually containing data:
# example log entry:
# id,host,time,method,url,response,bytes
# 303372,XXX.XXX.XXX.XXX,1995-07-11 12:17:09,GET,/htbin/wais.com?IMAX,200,6923
by_host = logs.groupby('host', sort=False)
for host, frame in by_host:
by_frame = frame.groupby(pd.Grouper(key='time', freq='30min', origin='start'))
for date, logs in by_frame:
if not logs.empty and logs.shape[0] > 1:
session_calculations()
This of course is quite inefficient and makes calculations take considerable amount of time. Is there any way to optimize this process? I wasn't able to come up with anything successful.
edit:
host time method url response bytes
0 ***.novo.dk 1995-07-11 12:17:09 GET /ksc.html 200 7067
1 ***.novo.dk 1995-07-11 12:17:48 GET /shuttle/missions/missions.html 200 8678
2 ***.novo.dk 1995-07-11 12:23:10 GET /shuttle/resources/orbiters/columbia.html 200 6922
3 ***.novo.dk 1995-08-09 12:48:48 GET /shuttle/missions/sts-69/mission-sts-69.html 200 11264
4 ***.novo.dk 1995-08-09 12:49:48 GET /shuttle/countdown/liftoff.html 200 4665
and expected result is a list of sessions extracted from requests:
host session_time
0 ***.novo.dk 00:06:01
1 ***.novo.dk 00:01:00
note that session_time here is the time difference between first and last request from input, after grouping them in 30 minute time windows.
To define local time windows for each user, i.e. consider the origin as the time of the first request of each user, you can first group by 'host'. Then apply a function to each user's DataFrame, using GroupBy.apply, which handles the time grouping and computes the time duration of the user's sessions.
def session_duration_by_host(by_host):
time_grouper = pd.Grouper(key='time', freq='30min', origin='start')
duration = lambda time: time.max() - time.min()
return (
by_host.groupby(time_grouper)
.agg(session_time = ('time', duration))
)
res = (
logs.groupby("host")
.apply(session_duration_by_host)
.reset_index()
.drop(columns="time")
)
# You have to write idiomatic Pandas code, so rather then processing something -> saving into variable -> using that variable (only once) to something -> .... you have to chain your process. Also pandas `apply` is much faster than normal `for` in most situations.
logs.groupby('host', sort=False).apply(
lambda by_frame:by_frame.groupby(
pd.Grouper(key='time', freq='30min', origin='start')
).apply(lambda logs: session_calculations() if (not logs.empty) and (logs.shape[0] > 1) else None)
)
I am reading in a .csv file and creating a pandas dataframe. The file is a file of stocks. I am only interested in the date, the company, and the closing cost. I want my program to find the max profit with the starting date, the ending date and the company. It needs to use the divide and conquer algorithm. I only know how to use for loops but it takes forever to run. The .csv file is 200,000 rows. How can I get this to run fast?
import pandas as pd
import numpy as np
import math
def cleanData(file):
df = pd.read_csv(file)
del df['open']
del df['low']
del df['high']
del df['volume']
return np.array(df)
df = cleanData('prices-split-adjusted.csv')
bestStock = [None, None, None, float(-math.inf)]
def DAC(data):
global bestStock
if len(data) > 1:
mid = len(data)//2
left = data[:mid]
right = data[mid:]
DAC(left)
DAC(right)
for i in range(len(data)):
for j in range(i+1,len(data)):
if data[i,1] == data[j,1]:
profit = data[j,2] - data[i,2]
if profit > bestStock[3]:
bestStock[0] = data[i,0]
bestStock[1] = data[j,0]
bestStock[2] = data[i,1]
bestStock[3] = profit
print(bestStock)
print('\n')
return bestStock
print(DAC(df))
I've got two things for your consideration (my answer tries not to change your algorithm approach i.e. nested loops and recursive funcs and tackles the low lying fruits first):
Unless you are debugging, try to avoid print() inside a loop. (in your case .. print(bestStock) ..) The I/O overhead can add up esp. if you are looping across large datasets and printing to screen often. Once you are OK with your code, comment it out to run on your full dataset and uncomment it only during debugging sessions. You can expect to see some improvement in speed without having to print to screen in the loop.
If you are after even more ways to 'speed it up', I found in my case (similar to yours which I often encounter especially in search/sort problems) that simply by switching the expensive part (the python 'For' loops) to Cython (and statically defining variable types .. this is KEY! to SPEEEEDDDDDD) gives me several orders of magnitude speed ups even before optimizing implementation. Check Cython out https://cython.readthedocs.io/en/latest/index.html. If thats not enough, then parrelism is your next best friend which would require rethinking your code implementation.
The main problems causing slow system performance are:
You manually iterate over 2 columns in nested loops without using pandas operations which make use of fast ndarray functions;
you use recursive calls which looks nice and simple but slow.
Setting the sample data as follows:
Date Company Close
0 2019-12-31 AAPL 73.412498
1 2019-12-31 FB 205.250000
2 2019-12-31 NFLX 323.570007
3 2020-01-02 AAPL 75.087502
4 2020-01-02 FB 209.779999
... ... ... ...
184 2020-03-30 FB 165.949997
185 2020-03-30 NFLX 370.959991
186 2020-03-31 AAPL 63.572498
187 2020-03-31 FB 166.800003
188 2020-03-31 NFLX 375.500000
189 rows × 3 columns
Then use the following codes (modify the column labels to your labels if different):
df_result = df.groupby('Company').agg(Start_Date=pd.NamedAgg(column='Date', aggfunc="first"), End_Date=pd.NamedAgg(column='Date', aggfunc="last"), bestGain=pd.NamedAgg(column='Close', aggfunc=lambda x: x.max() - x.iloc[0]))
Resulting output:
Start_Date End_Date bestGain
Company
AAPL 2019-12-31 2020-03-31 8.387505
FB 2019-12-31 2020-03-31 17.979996
NFLX 2019-12-31 2020-03-31 64.209991
To get the entry with greatest gain:
df_result.loc[df_result['bestGain'].idxmax()]
Resulting output:
Start_Date 2019-12-31 00:00:00
End_Date 2020-03-31 00:00:00
bestGain 64.209991
Name: NFLX, dtype: object
Execution time comparison
With my scaled down data of 3 stocks over 3 months, the codes making use of pandas function (takes 8.9ms) which is about about half the execution time with the original codes manually iterate over the numpy array with nested loops and recursive calls (takes 16.9ms) even after the majority of print() function calls removed.
Your codes with print() inside DAC() function removed:
%%timeit
"""
def cleanData(df):
# df = pd.read_csv(file)
del df['Open']
del df['Low']
del df['High']
del df['Volume']
return np.array(df)
"""
# df = cleanData('prices-split-adjusted.csv')
# df = cleanData(df0)
df = np.array(df0)
bestStock = [None, None, None, float(-math.inf)]
def DAC(data):
global bestStock
if len(data) > 1:
mid = len(data)//2
left = data[:mid]
right = data[mid:]
DAC(left)
DAC(right)
for i in range(len(data)):
for j in range(i+1,len(data)):
if data[i,1] == data[j,1]:
profit = data[j,2] - data[i,2]
if profit > bestStock[3]:
bestStock[0] = data[i,0]
bestStock[1] = data[j,0]
bestStock[2] = data[i,1]
bestStock[3] = profit
# print(bestStock)
# print('\n')
return bestStock
print(DAC(df))
[Timestamp('2020-03-16 00:00:00'), Timestamp('2020-03-31 00:00:00'), 'NFLX', 76.66000366210938]
16.9 ms ± 303 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
New simplified codes in pandas' way of coding:
%%timeit
df_result = df.groupby('Company').agg(Start_Date=pd.NamedAgg(column='Date', aggfunc="first"), End_Date=pd.NamedAgg(column='Date', aggfunc="last"), bestGain=pd.NamedAgg(column='Close', aggfunc=lambda x: x.max() - x.iloc[0]))
df_result.loc[df_result['bestGain'].idxmax()]
8.9 ms ± 195 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Solution using recursive function:
The main problem of your recursive function lies in that you did not make use of the results of recursive calls of reduced size data.
To properly use recursive function as a divide-and-conquer approach, you should take 3 major steps:
Divide the whole set of data into smaller pieces and handle the smaller pieces by recursive calls each taking one of the smaller pieces
Handle the end-point case (the easiest case most of the time) in each recursive call
Consolidate the results of all recursive calls of smaller pieces
The beauty of recursive calls is that you can solve a complicated problem by replacing the processing with 2 much more easier steps: 1st step is to handle the end-point case where you can handle for most of the time only ONE data item (which is most often easy). 2nd step is to just take another easy step to consolidate the results of the reduced-size calls.
You managed to take the first step but not the other 2 steps. In particular, you did not take advantage of simplifying the processing by making use of the results of smaller pieces. Instead, you handle the whole set of data in each call by looping all over all rows in the 2-dimensional numpy array. The nested loop logics is just like a "Bubble Sort" [with complexity order(n squared) instead of order(n)] . Hence, your recursive calls are just wasting time without value!
Suggest to modify your recursive functions as follows:
def DAC(data):
# global bestStock # define bestStock as a local variable instead
bestStock = [None, None, None, float(-math.inf)] # init bestStock
if len(data) = 1: # End-point case: data = 1 row
bestStock[0] = data[0,0]
bestStock[1] = data[0,0]
bestStock[2] = data[0,1]
bestStock[3] = 0.0
elif len(data) = 2: # End-point case: data = 2 rows
bestStock[0] = data[0,0]
bestStock[1] = data[1,0]
bestStock[2] = data[0,1] # Enhance here to allow stock break
bestStock[3] = data[1,2] - data[0,2]
elif len(data) >= 3: # Recursive calls and consolidate results
mid = len(data)//2
left = data[:mid]
right = data[mid:]
bestStock_left = DAC(left)
bestStock_right = DAC(right)
# Now make use of the results of divide-and-conquer and consolidate the results
bestStock[0] = bestStock_left[0]
bestStock[1] = bestStock_right[1]
bestStock[2] = bestStock_left[2] # Enhance here to allow stock break
bestStock[3] = bestStock_left[3] if bestStock_left[3] >= bestStock_right[3] else bestStock_right[3]
# print(bestStock)
# print('\n')
return bestStock
Here we need to handle 2 kinds of end-point cases: 1 row and 2 rows. The reason is that for case with only 1 row, we cannot calculate the gain and can only set the gain to zero. Gain can start to calculate with 2 rows. If not split into these 2 end-point cases, we could end up only propagating zero gain all the way up.
Here is a demo of how you should code the recursive calls to take advantage of it. There is limitation of the codes that you still need to fine-tune. You have to enhance it further to handle stock break case. The codes for 2 rows and >= 3 rows now assume no stock break at the moment.
I have to run a process on about 2 million IDs, for which I am trying to use MultipleProcessing.
My sample data, stored in dataframe df looks like (just presenting 3 rows):
c_id
0 ID1
1 ID2
2 ID3
My parallelize code is as follows:
def parallelize(data,func,parts=cpu_count()):
if data.shape[0] < parts:
parts = data.shape[0]
data_split = np.array_split(data,parts)
pool = Pool(parts)
parallel_out = pd.concat(pool.map(func,data_split))
pool.close()
pool.join()
return parallel_out
A sample process that I want to run on all the ID's is to add my first name each ID.
There are two pieces of codes that I tested.
First: Using a for loop and then calling the parallelize function, as follows:
def pooltype1(df_id):
dfi=[]
for item in df_id['c_id']:
dfi.append({'string': str(item) + '_ravi'})
dfi = pd.DataFrame(df)
return dfi
p = parallelize(df,pooltype1,parts=cpu_count())
The output is as expected and the index of each is 0, confirming that each ID went into a different cpu (cpu_count() for my system > 3):
string
0 ID1_ravi
0 ID2_ravi
0 ID3_ravi
and the runtime is 0.12 secs.
However, to further speed it up on my actual (2 million) data, I tried to replace the for-loop in the pooltype1 function by a apply command and then calling the parallelize function as below:
# New function
def add_string(x):
return x + '_ravi'
def pooltype2(df_id):
dfi = df_id.apply(add_string)
return dfi
p = parallelize(df,pooltype2,parts=cpu_count())
Now index of the output was not all zero
string
0 ID1_ravi
1 ID2_ravi
2 ID3_ravi
and to my surprise the runtime jumped to 5.5 sec. This seems like apply was executed on the whole original dataframe and not at a cpu level.
So, while doing pool.map do I have to use a for-loop (as in pooltype1 function) or is there a way the apply can be applied within each cpu (hoping that it will further reduce run time). If one can do the apply at a cpu level, please do help me with the code.
Thank you.
I am working on a project for Algo trading using zerodha broker's API.
I am trying to do multithreading to save the costly operation of calling the API function for getting historical data for 50 stocks at a time and then apply my strategy on it for buy/sell.
Here is my code:
historical data function:
def historicData(token, start_dt, end_dt):
data = pd.DataFrame( kite.historical_data( token,
start_dt,
end_dt,
interval = 'minute',
continuous = False,
oi = False
)
)
# kite.historical_data() is the API call with limitation of 20 requests/sec
return data.tail(5)
Strategy function:
def Strategy(token, script_name):
start_dt = (datetime.now() - timedelta(3)).strftime("%Y-%m-%d")
end_dt = datetime.now().strftime("%Y-%m-%d")
ScriptData = historicData(token, start_dt, end_dt)
# perform operations on ScriptData
print(token,script_name)
concurrent calling of the above function:
# concurrent code
from threading import Thread, Lock
start_dt = (datetime.now() - timedelta(3)).strftime("%Y-%m-%d")
end_dt = datetime.now().strftime("%Y-%m-%d")
th_list = [None]*10
start = sleep.time()
for i in range(0,50,20): # trying to send 20 request in a form of 20 threads in one go
token_batch = tokens[i:i+20] # data inside is ['123414','124124',...] total 50
script_batch = scripts[i:i+20] # data inside is ['RELIANCE','INFY',...] total 50
j=0
for stock_script in zip(token_batch,script_batch):
th_list[j] = Thread( target = Strategy,
args = ( stock_script[0],
stock_script[1]
)
)
th_list[j].start()
j+=1
end = sleep.time()
print('time is : ', end-start)
Now there are 2 issues I am unable to resolve after 2 days of trying many solutions online.
There's a bottleneck from the API server that it accepts 20 API calls per second and rejects if more are called. Total stocks in the list are 50 and I am trying to do is get data of 20 stocks at a time then get another 20 in the next and then remaining 10 in the third go. Total stocks list is gonna get big with 200 stocks soon that's why a serial execution is too costly for my strategy to work.
When running this function concurrently, there are too many threads created at once and API request exceeds... and print('time is : ', end-start) runs as soon as I run the 3rd cell.
So how do I block the code from leaving the inner for loop before all threads finish their execution.
and
Is my way correct to get 20 threads at the most per second? Should I place a sleep(1) somewhere?