Related
I would like to simulate individual changes in growth and mortality for a variable number of days. My dataframe is formatted as follows...
import pandas as pd
data = {'unique_id': ['2', '4', '5', '13'],
'length': ['27.7', '30.2', '25.4', '29.1'],
'no_fish': ['3195', '1894', '8', '2774'],
'days_left': ['253', '253', '254', '256'],
'growth': ['0.3898', '0.3414', '0.4080', '0.3839']
}
df = pd.DataFrame(data)
print(df)
unique_id length no_fish days_left growth
0 2 27.7 3195 253 0.3898
1 4 30.2 1894 253 0.3414
2 5 25.4 8 254 0.4080
3 13 29.1 2774 256 0.3839
Ideally, I would like the initial length (i.e., length) to increase by the daily growth rate (i.e., growth) for each of the days remaining in the year (i.e., days_left).
df['final'] = df['length'] + (df['days_left'] * df['growth']
However, I would also like to update the number of fish that each individual represents (i.e., no_fish) on a daily basis using a size-specific equation. I'm fairly new to python so I initially thought to use a for-loop (I'm not sure if there is another, more efficient way). My code is as follows:
# keep track of run time - START
start_time = time.perf_counter()
df['z'] = 0.0
for indx in range(len(df)):
count = 1
while count <= int(df.days_to_forecast[indx]):
# (1) update individual length
df.lgth[indx] = df.lgth[indx] + df.linearGR[indx]
# (2) estimate daily size-specific mortality
if df.lgth[indx] > 50.0:
df.z[indx] = 0.01
else:
if df.lgth[indx] <= 50.0:
df.z[indx] = 0.052857-((0.03/35)*df.lgth[indx])
elif df.lgth[indx] < 15.0:
df.z[indx] = 0.728*math.exp(-0.1892*df.lgth[indx])
df['no_fish'].round(decimals = 0)
if df.no_fish[indx] < 1.0:
df.no_fish[indx] = 0.0
elif df.no_fish[indx] >= 1.0:
df.no_fish[indx] = df.no_fish[indx]*math.exp(-(df.z[indx]))
# (3) reduce no. of days left in forecast by 1
count = count + 1
# keep track of run time - END
total_elapsed_time = round(time.perf_counter() - start_time, 2)
print("Forecast iteration completed in {} seconds".format(total_elapsed_time))
The above code now works correctly, but it is still far to inefficient to run for 40,000 individuals each for 200+ days.
I would really appreciate any advice on how to modify the following code to make it pythonic.
Thanks
Another option that was suggested to me is to use the pd.dataframe.apply function. This dramatically reduced the overall the run time and could be useful to someone else in the future.
### === RUN SIMULATION === ###
start_time = time.perf_counter() # keep track of run time -- START
#-------------------------------------------------------------------------#
def function_to_apply( df ):
df['z_instantMort'] = ''
for indx in range(int(df['days_left'])):
# (1) update individual length
df['length'] = df['length'] + df['growth']
# (2) estimate daily size-specific mortality
if df['length'] > 50.0:
df['z_instantMort'] = 0.01
else:
if df['length'] <= 50.0:
df['z_instantMort'] = 0.052857-((0.03/35)*df['length'])
elif df['length'] < 15.0:
df['z_instantMort'] = 0.728*np.exp(-0.1892*df['length'])
whole_fish = round(df['no_fish'], 0)
if whole_fish < 1.0:
df['no_fish'] = 0.0
elif whole_fish >= 1.0:
df['no_fish'] = df['no_fish']*np.exp(-(df['z_instantMort']))
return df
#-------------------------------------------------------------------------#
sim_results = df.apply(function_to_apply, axis=1)
total_elapsed_time = round(time.perf_counter() - start_time, 2) # END
print("Forecast iteration completed in {} seconds".format(total_elapsed_time))
print(sim_results)
### ====================== ###
output being...
Forecast iteration completed in 0.05 seconds
unique_id length no_fish days_left growth z_instantMort
0 2.0 126.3194 148.729190 253.0 0.3898 0.01
1 4.0 116.5742 93.018465 253.0 0.3414 0.01
2 5.0 129.0320 0.000000 254.0 0.4080 0.01
3 13.0 127.3784 132.864757 256.0 0.3839 0.01
As I said in my comment, a preferable alternative to for loops in this setting is using vector operations. For instance, running your code:
import pandas as pd
import time
import math
import numpy as np
data = {'unique_id': [2, 4, 5, 13],
'length': [27.7, 30.2, 25.4, 29.1],
'no_fish': [3195, 1894, 8, 2774],
'days_left': [253, 253, 254, 256],
'growth': [0.3898, 0.3414, 0.4080, 0.3839]
}
df = pd.DataFrame(data)
print(df)
# keep track of run time - START
start_time = time.perf_counter()
df['z'] = 0.0
for indx in range(len(df)):
count = 1
while count <= int(df.days_left[indx]):
# (1) update individual length
df.length[indx] = df.length[indx] + df.growth[indx]
# (2) estimate daily size-specific mortality
if df.length[indx] > 50.0:
df.z[indx] = 0.01
else:
if df.length[indx] <= 50.0:
df.z[indx] = 0.052857-((0.03/35)*df.length[indx])
elif df.length[indx] < 15.0:
df.z[indx] = 0.728*math.exp(-0.1892*df.length[indx])
df['no_fish'].round(decimals = 0)
if df.no_fish[indx] < 1.0:
df.no_fish[indx] = 0.0
elif df.no_fish[indx] >= 1.0:
df.no_fish[indx] = df.no_fish[indx]*math.exp(-(df.z[indx]))
# (3) reduce no. of days left in forecast by 1
count = count + 1
# keep track of run time - END
total_elapsed_time = round(time.perf_counter() - start_time, 2)
print("Forecast iteration completed in {} seconds".format(total_elapsed_time))
print(df)
with output:
unique_id length no_fish days_left growth
0 2 27.7 3195 253 0.3898
1 4 30.2 1894 253 0.3414
2 5 25.4 8 254 0.4080
3 13 29.1 2774 256 0.3839
Forecast iteration completed in 31.75 seconds
unique_id length no_fish days_left growth z
0 2 126.3194 148.729190 253 0.3898 0.01
1 4 116.5742 93.018465 253 0.3414 0.01
2 5 129.0320 0.000000 254 0.4080 0.01
3 13 127.3784 132.864757 256 0.3839 0.01
Now with vector operations, you could do something like:
# keep track of run time - START
start_time = time.perf_counter()
df['z'] = 0.0
for day in range(1, df.days_left.max() + 1):
update = day <= df['days_left']
# (1) update individual length
df[update]['length'] = df[update]['length'] + df[update]['growth']
# (2) estimate daily size-specific mortality
df[update]['z'] = np.where( df[update]['length'] > 50.0, 0.01, 0.052857-( ( 0.03 / 35)*df[update]['length'] ) )
df[update]['z'] = np.where( df[update]['length'] < 15.0, 0.728 * np.exp(-0.1892*df[update]['length'] ), df[update]['z'] )
df[update]['no_fish'].round(decimals = 0)
df[update]['no_fish'] = np.where(df[update]['no_fish'] < 1.0, 0.0, df[update]['no_fish'] * np.exp(-(df[update]['z'])))
# keep track of run time - END
total_elapsed_time = round(time.perf_counter() - start_time, 2)
print("Forecast iteration completed in {} seconds".format(total_elapsed_time))
print(df)
with output
Forecast iteration completed in 1.32 seconds
unique_id length no_fish days_left growth z
0 2 126.3194 148.729190 253 0.3898 0.0
1 4 116.5742 93.018465 253 0.3414 0.0
2 5 129.0320 0.000000 254 0.4080 0.0
3 13 127.3784 132.864757 256 0.3839 0.0
I'm trying to make a calculation on multiple rows for every row in a dataframe.
My current solution takes so much time when I run 2971000 rows. it almost takes more than 2hours.
So, I want know other solutions to speed up a function
my data looks like this for example.
sig1 sig2 sig3 sig4 sig_p sig_t
20210114 05:52:02.00 0.0 0.0 0.0 0.0 11.5 -3.5
20210114 05:52:02.01 0.0 0.0 0.0 0.0 11.6 -3.5
20210114 05:52:02.02 0.0 0.0 0.0 0.0 11.5 -3.5
20210114 05:52:02.03 0.0 0.0 0.0 0.0 11.6 -3.5
20210114 05:52:02.04 0.0 0.0 0.0 0.0 11.7 -3.5
... ... ... ... ... ... ...
20210114 22:38:59.85 0.0 0.0 0.0 0.0 0.0 -0.5
20210114 22:38:59.86 0.0 0.0 0.0 0.0 0.0 -0.5
20210114 22:38:59.87 0.0 0.0 0.0 0.0 0.0 -0.5
20210114 22:38:59.88 0.0 0.0 0.0 0.0 0.0 -0.5
20210114 22:38:59.89 0.0 0.0 0.0 0.0 0.0 -0.5
I have a function which loops through and calculates value for newcol based on sig1, sig_p, sig_t,previous newcol. the function runs repeat for sig1, sig2, sig3, sig4.
I'll show you the code I currently have, but it's too slow.
parameter.py
from typing import NamedTuple
class Param(NamedTuple):
RATIO : float
D : float
T : float
M : float
S : float
W : float
DYNAMIC : float
T_CONST : float
P_CONST : float
L_COEF : float
O_COEF : float
#property
def A(self):
return (self.D**2)*math.pi
#property
def FACTOR(self):
return self.S / self.A
Param1 = Param(
RATIO = 0.74,
D = 172e-3,
T = 23e-3,
M = 6,
S = 53.7e-4,#4232.5e-6,
W = 0.805,
DYNAMIC = 0.3150,
T_CONST = 2, #4,
P_CONST = 0.2,#3,
L_COEF = 0.8,#4,
O_COEF = 2.5
)
rear = Param(
RATIO = 0.26,
D = 204e-3,
T = 10e-3,
M = 4,
S = 26.8e-4,
W = 0.38,
DYNAMIC = 0.3150,
T_CONST = 1.8,
P_CONST = 0.2,
L_COEF = 0.2,
O_COEF = 1.8
)
test.py
import pandas as pd
import numpy as np
from scipy.interpolate import interp1d
TIME_STAMP = 0.1
SPEC = 449
SPECIFIC = 935
EMISSIVITY = 0.7
ABSORBTIVITY = 0.3
DYNAMIC_SPEED = 12
COEFFICIENT = 0.9506173967164384
input_KV = [-75, -50, -25, -15, -10, -5, 0, 5, 10, 15, 20, 25, 30, 40, 50, 60,
80, 100, 125, 150, 175, 200, 225, 300, 412, 500, 600, 700, 800, 900, 1000, 1100]
viscosity_value = [7.4, 9.22, 11.18, 12.01, 12.43, 12.85, 13.28, 13.72, 14.16, 14.61, 15.06, 15.52, 15.98, 16.92, 17.88, 18.86, 20.88,
22.97, 25.69, 28.51, 31.44, 34.47, 37.6, 47.54, 63.82, 77.72, 94.62, 112.6, 131.7, 151.7, 172.7, 194.6]
input_ka = [-190, -150, -100, -75, -50, -25, -15, -10, -5, 0, 5, 10, 15, 20, 25, 30, 40,
50, 60, 80, 100, 125, 150, 175, 200, 225, 300, 412, 500, 600, 700, 800, 900, 1000, 1100]
conductivity_value = [7.82, 11.69, 16.2, 18.34, 20.41, 22.41, 23.2, 23.59, 23.97, 24.36, 24.74, 25.12, 25.5, 25.87, 26.24, 26.62,
27.35, 28.08, 28.8, 30.23, 31.62, 33.33, 35, 36.64, 38.25, 39.83, 44.41, 50.92, 55.79, 61.14, 66.32, 71.35, 76.26, 81.08, 85.83]
def viscosity(input):
fq = interp1d(input_KV,
viscosity_value, kind='quadratic')
return (fq(input)*10e-6)
def conductivity(input):
fq = interp1d(input_ka,
conductivity_value, kind='quadratic')
return (fq(input)*10e-3)
def calculation(Param, sig, sig_p, sig_t):
new_col1 = np.empty(len(sig_p))
new_col1[0] = sig_t[0]
my_goal = np.empty(len(sig_p))
my_goal[0] = sig_t[0]
calc1 = COEFFICIENT * Param.RATIO * sig_p * sig /2
for n in range(1, len(sig_p)):
calc2 = EMISSIVITY * Param.A * (new_col1[n-1]**4 - sig_t[n]**4)
Ka = conductivity(sig_t[n])
if sig[n] == 0:
h = Param.O_COEF
else :
KV = viscosity(sig_t[n])
if sig[n] < DYNAMIC_SPEED:
h = (0.7*(sig[n]/KV)**0.4) * Ka + Param.O_COEF
else :
h = (0.04*(sig[n])/KV**0.8) * Ka + Param.L_COEF
calc3 = h * Param.A * (new_col1[n-1] - sig_t[n])
calc4 = Ka *Param.A * (new_col1[n-1] - sig_t[n]) / Param.T
a1 = (calc1[n] - (calc2 + calc3 + calc4)) / (SPEC * Param.M)
new_col1[n] = new_col1[n-1] + a1 * TIME_STAMP
if sig_p[n] == 0 :
val1 = ABSORBTIVITY * Param.FACTOR * calc2
elif (sig_p[n] > 0) & (sig_p[n] <= 20):
val1 = ABSORBTIVITY * Param.FACTOR * calc2* (20-sig_p[n])/20 + ((1-COEFFICIENT) * calc1[n] / (4)) * sig_p[n] / 20
else:
val1 = (1-COEFFICIENT) * calc1[n] / 4
if sig[n] == 0:
val2 = Param.T_CONST
else:
h_bar = Param.P_CONST * (sig[n] *Param.DYNAMIC)**0.8
val2 = h_bar * Param.S * (my_goal[n-1] - sig_t[n])
a2 = (val1 - (val2)) / (SPECIFIC * Param.W)
my_goal[n] = my_goal[n-1] + a2 * TIME_STAMP
if my_goal[n] < sig_t[0] : my_goal[n] = sig_t[0]
return my_goal
df = pd.read_csv('data.csv', index_col=0)
df['newcol1'] = calculation(Param1, df['sig1'].values, df[sig_p].values, df['sig_t'].values)
df['newcol2'] = calculation(Param1, df['sig2'].values, df[sig_p].values, df['sig_t'].values)
df['newcol3'] = calculation(Param2, df['sig3'].values, df[sig_p].values, df['sig_t'].values)
df['newcol4'] = calculation(Param2, df['sig4'].values, df[sig_p].values, df['sig_t'].values)
I now need to apply this function to several million rows and it's impossibly slow so I'm trying to figure out the best way to speed it up. I've heard that Cython can increase the speed of functions but I have no experience with it (and I'm new to both pandas and python).
My question is if there is anyway to enhance or speed up this computation method.
I'm run this python code on AWS(sagemaker>notebook instance, jupyter) and my computer OS is window.
Iteration is easy to code but slow for dataframe. Here is a hint to your solution. You need to vectorize the code inside the while loop while n < len(sig_p):.
For example, previously your code:
def fun(Param, sig_p, sig, sig_t):
tempvalue = np.empty(sig_p.shape)
tempvalue[0] = sig_t[0]
newcol = np.empty(sig_p.shape)
newcol[0] = sig_t[0]
n = 1
while n < len(sig_p):
# calc1 = fun1()
calc1 = Param.COEF * (sig_p[n]) * Param.NO * Param.EFF # fun1()
# calc2 = fun2()
if sig[n] > Param.THRESHOLD:
calc2 = 0
else:
calc2 = Param.EMISSIVITY * Param.CONSTANT * (tempvalue[n-1]**4 - sig_t[n]**4)
# calc3
# calc4
# ......
df['newcol1'] = fun(param1, df['sig_p'].values, df['sig1'].values, df['sig_t'].values)
To eliminate the while loop, the fun1() and fun2 can be rewritten like this:
def fun(Param, df, sigTag):
# df['calc1'] = vectorized fun1()
df['calc1'] = Param.COEF * df['sig_p'] * Param.NO * Param.EFF
# df['calc2'] = vectorized fun2()
df['calc2'] = Param.EMISSIVITY * Param.CONSTANT * (df['sig_t'].shift(1)**4 - df['sig_t']**4)
df.loc[df[sigTag] > Param.THRESHOLD, 'calc2'] = 0
# df['calc3'] = vectorized fun3()
# df['calc4'] = vectorized fun4()
# ......
df['newcol1'] = fun(param1, df, 'sig1')
you might also want to input the dataframe to the fun() rather than in separated ndarray(s).
This approach'd greatly enhance the performance. You might want to do some research on how to vectorize the calculation.
I have dataframe
ID Value
A 70
A 80
A 1000
A 100
A 200
A 130
A 60
A 300
A 800
A 200
A 150
A 250
I need to replace outliers to median value.
I use
df = pd.read_excel("test.xlsx")
grouped = df.groupby('ID')
statBefore = pd.DataFrame({'q1': grouped['Value'].quantile(.25), \
'median': grouped['Value'].median(), 'q3' :
grouped['Value'].quantile(.75)})
def is_outlier(row):
iq_range = statBefore.loc[row.ID]['q3'] - statBefore.loc[row.ID]['q1']
median = statBefore.loc[row.ID]['median']
q3 = statBefore.loc[row.ID]['q3']
q1 = statBefore.loc[row.ID]['q1']
if row.Value > (q3 + (3 * iq_range)) or row.Value < (q1 - (3 * iq_range)):
return True
else:
return False
#apply the function to the original df:
df.loc[:, 'outlier'] = df.apply(is_outlier, axis = 1)
But it returns me median - 175 and q1 - 92, but I get - 90, and it returns me q3 - 262,5, but I count and get - 275.
What wrong there?
This is simple and performant, with no Python for-loops to slow it down:
s = pd.Series([30, 31, 32, 45, 50, 999]) # example data
s.where(s.between(*s.quantile([0.25, 0.75])), s.median())
It gives you:
0 38.5
1 38.5
2 32.0
3 45.0
4 38.5
5 38.5
Unpacking that code, we have s.quantile([0.25, 0.75]) to get this:
0.25 31.25
0.75 48.75
We then use the values (31.25 and 48.75) as arguments to between(), with the * operator to unpack them because between() expects two separate arguments, not an array of length 2. That gives us:
0 False
1 False
2 True
3 True
4 False
5 False
Now that we have the binary mask, we use s.where() to choose the original values at the True locations, and fall back to s.median() otherwise.
This is just how quantiles are defined
df = pd.DataFrame(np.array([60,70,80,100,130,150,200,200,250,300,800,1000]))
print df.quantile(.25)
print df.quantile(.50)
print df.quantile(.75)
(The q1 for your data set is 95 btw)
The median is in between 150 and 200 (175)
The first quantile is 3 quarters between 80 and 100 (95)
The thrid quantile is 1 quarter in between 250 and 300 (262.5)
I have a large catalog that I am selecting data from according to the following criteria:
columns = ["System", "rp", "mp", "logg"]
catalog = pd.read_csv('data.txt', skiprows=1, sep ='\s+', names=columns)
# CUTS
i = (catalog.rp != -1) & (catalog.mp != -1)
new_catalog = pd.DataFrame(catalog[i])
print("{0} targets after cuts".format(len(new_catalog)))
When I perform the above cuts the code is working fine. Next, I want to add one more cut: I want to select all the targets that have 4.0 < logg < 5.0. However, some of the targets have logg = -1 (which stands for the fact that the value is not available). Luckily, I can calculate logg from the other available parameters. So here is my updated cuts:
# CUTS
i = (catalog.rp != -1) & (catalog.mp != -1)
if catalog.logg[i] == -1:
catalog.logg[i] = catalog.mp[i] / catalog.rp[i]
i &= (4 <= catalog.logg) & (catalog.logg <= 5)
However, I am receiving an error:
if catalog.logg[i] == -1:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Can someone please explain what I am doing wrong and how I can fix it. Thank you
Edit 1
My dataframe looks like the following:
Data columns:
System 477 non-null values
rp 477 non-null values
mp 477 non-null values
logg 477 non-null values
dtypes: float64(37), int64(3), object(3)None
Edit 2
System rp mp logg FeH FeHu FeHl Mstar Mstaru Mstarl
0 target-01 5196 24 24 0.31 0.04 0.04 0.905 0.015 0.015
1 target-02 5950 150 150 -0.30 0.25 0.25 0.950 0.110 0.110
2 target-03 5598 50 50 0.04 0.05 0.05 0.997 0.049 0.049
3 target-04 6558 44 -1 0.14 0.04 0.04 1.403 0.061 0.061
4 target-05 6190 60 60 0.05 0.07 0.07 1.194 0.049 0.050
....
[5 rows x 43 columns]
Edit 3
My code in a format that I understand should be:
for row in range(len(catalog)):
parameter = catalog['logg'][row]
if parameter == -1:
parameter = catalog['mp'][row] / catalog['rp'][row]
if parameter > 4.0 and parameter < 5.0:
# select this row for further analysis
However, I am trying to write my code in a more simple and professional way. I don't want to use the for loop. How can I do it?
EDIT 4
Consider the following small example:
System rp mp logg
target-01 2 -1 2 # will NOT be selected since mp = -1
target-02 -1 3 4 # will NOT be selected since rp = -1
target-03 7 6 4.3 # will be selected since mp != -1, rp != -1, and 4 < logg <5
target-04 3.2 15 -1 # will be selected since mp != -1, rp != -1, logg = mp / rp = 15/3.2 = 4.68 (which is between 4 and 5)
you get the error because catalog.logg[i] is not a scalar,but a series,so you should turn to vectorized manipulation:
catalog.loc[i,'logg'] = catalog.loc[i,'mp']/catalog.loc[i,'rp']
which would modify the logg column inplace
As for edit 3:
rows=catalog.loc[(catalog.logg > 4) & (catalog.logg < 5)]
which will select rows that satisfy the condition
Instead of that code:
if catalog.logg[i] == -1:
catalog.logg[i] = catalog.mp[i] / catalog.rp[i]
You could use following:
i &= df.logg == -1
df.loc[i, 'logg'] = df.loc[i, 'mp'] / df.loc[i, 'rp']
# or
df.ix[i, 'logg'] = df.ix[i, 'mp'] / df.ix[i, 'rp']
For your edit 3 you need to add that line:
your_rows = df[(df.logg > 4) & (df.logg < 5)]
Full code:
i = (catalog.rp != -1) & (catalog.mp != -1)
i &= df.logg == -1
df.ix[i, 'logg'] = df.ix[i, 'mp'] / df.ix[i, 'rp']
your_rows = df[(df.logg > 4) & (df.logg < 5)]
EDIT
Probably I still don't understand what you want, but I get your desired output:
import pandas as pd
from io import StringIO
data = """
System rp mp logg
target-01 2 -1 2
target-02 -1 3 4
target-03 7 6 4.3
target-04 3.2 15 -1
"""
catalog = pd.read_csv(StringIO(data), sep='\s+')
i = (catalog.rp != -1) & (catalog.mp != -1)
i &= catalog.logg == -1
catalog.ix[i, 'logg'] = catalog.ix[i, 'mp'] / catalog.ix[i, 'rp']
your_rows = catalog[(catalog.logg > 4) & (catalog.logg < 5)]
In [7]: your_rows
Out[7]:
System rp mp logg
2 target-03 7.0 6 4.3000
3 target-04 3.2 15 4.6875
Am I still wrong?
New to Pandas, looking for the most efficient way to do this.
I have a Series of DataFrames. Each DataFrame has the same columns but different indexes, and they are indexed by date. The Series is indexed by ticker symbol. So each item in the Sequence represents a single time series of each individual stock's performance.
I need to randomly generate a list of n data frames, where each dataframe is a subset of some random assortment of the available stocks' histories. It's ok if there is overlap, so long as start end end dates are different.
This following code does it, but it's really slow, and I'm wondering if there's a better way to go about it:
Code
def random_sample(data=None, timesteps=100, batch_size=100, subset='train'):
if type(data) != pd.Series:
return None
if subset=='validate':
offset = 0
elif subset=='test':
offset = 200
elif subset=='train':
offset = 400
tickers = np.random.randint(0, len(data), size=len(data))
ret_data = []
while len(ret_data) != batch_size:
for t in tickers:
data_t = data[t]
max_len = len(data_t)-timesteps-1
if len(ret_data)==batch_size: break
if max_len-offset < 0: continue
index = np.random.randint(offset, max_len)
d = data_t[index:index+timesteps]
if len(d)==timesteps: ret_data.append(d)
return ret_data
Profile output:
Timer unit: 1e-06 s
File: finance.py
Function: random_sample at line 137
Total time: 0.016142 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
137 #profile
138 def random_sample(data=None, timesteps=100, batch_size=100, subset='train'):
139 1 5 5.0 0.0 if type(data) != pd.Series:
140 return None
141
142 1 1 1.0 0.0 if subset=='validate':
143 offset = 0
144 1 1 1.0 0.0 elif subset=='test':
145 offset = 200
146 1 0 0.0 0.0 elif subset=='train':
147 1 1 1.0 0.0 offset = 400
148
149 1 1835 1835.0 11.4 tickers = np.random.randint(0, len(data), size=len(data))
150
151 1 2 2.0 0.0 ret_data = []
152 2 3 1.5 0.0 while len(ret_data) != batch_size:
153 116 148 1.3 0.9 for t in tickers:
154 116 2497 21.5 15.5 data_t = data[t]
155 116 317 2.7 2.0 max_len = len(data_t)-timesteps-1
156 116 80 0.7 0.5 if len(ret_data)==batch_size: break
157 115 69 0.6 0.4 if max_len-offset < 0: continue
158
159 100 101 1.0 0.6 index = np.random.randint(offset, max_len)
160 100 10840 108.4 67.2 d = data_t[index:index+timesteps]
161 100 241 2.4 1.5 if len(d)==timesteps: ret_data.append(d)
162
163 1 1 1.0 0.0 return ret_data
Are you sure you need to find a faster method? Your current method isn't that slow. The following changes might simplify, but won't necessarily be any faster:
Step 1: Take a random sample (with replacement) from the list of dataframes
rand_stocks = np.random.randint(0, len(data), size=batch_size)
You can treat this array rand_stocks as a list of indices to be applied to your Series of dataframes. The size is already batch size so that eliminates the need for the while loop and your comparison on line 156.
That is, you can iterate over rand_stocks and access the stock like so:
for idx in rand_stocks:
stock = data.ix[idx]
# Get a sample from this stock.
Step 2: Get a random datarange for each stock you have randomly selected.
start_idx = np.random.randint(offset, len(stock)-timesteps)
d = data_t[start_idx:start_idx+timesteps]
I don't have your data, but here's how I put it together:
def random_sample(data=None, timesteps=100, batch_size=100, subset='train'):
if subset=='train': offset = 0 #you can obviously change this back
rand_stocks = np.random.randint(0, len(data), size=batch_size)
ret_data = []
for idx in rand_stocks:
stock = data[idx]
start_idx = np.random.randint(offset, len(stock)-timesteps)
d = stock[start_idx:start_idx+timesteps]
ret_data.append(d)
return ret_data
Creating a dataset:
In [22]: import numpy as np
In [23]: import pandas as pd
In [24]: rndrange = pd.DateRange('1/1/2012', periods=72, freq='H')
In [25]: rndseries = pd.Series(np.random.randn(len(rndrange)), index=rndrange)
In [26]: rndseries.head()
Out[26]:
2012-01-02 2.025795
2012-01-03 1.731667
2012-01-04 0.092725
2012-01-05 -0.489804
2012-01-06 -0.090041
In [27]: data = [rndseries,rndseries,rndseries,rndseries,rndseries,rndseries]
Testing the function:
In [42]: random_sample(data, timesteps=2, batch_size = 2)
Out[42]:
[2012-01-23 1.464576
2012-01-24 -1.052048,
2012-01-23 1.464576
2012-01-24 -1.052048]