I would like to simulate individual changes in growth and mortality for a variable number of days. My dataframe is formatted as follows...
import pandas as pd
data = {'unique_id': ['2', '4', '5', '13'],
'length': ['27.7', '30.2', '25.4', '29.1'],
'no_fish': ['3195', '1894', '8', '2774'],
'days_left': ['253', '253', '254', '256'],
'growth': ['0.3898', '0.3414', '0.4080', '0.3839']
}
df = pd.DataFrame(data)
print(df)
unique_id length no_fish days_left growth
0 2 27.7 3195 253 0.3898
1 4 30.2 1894 253 0.3414
2 5 25.4 8 254 0.4080
3 13 29.1 2774 256 0.3839
Ideally, I would like the initial length (i.e., length) to increase by the daily growth rate (i.e., growth) for each of the days remaining in the year (i.e., days_left).
df['final'] = df['length'] + (df['days_left'] * df['growth']
However, I would also like to update the number of fish that each individual represents (i.e., no_fish) on a daily basis using a size-specific equation. I'm fairly new to python so I initially thought to use a for-loop (I'm not sure if there is another, more efficient way). My code is as follows:
# keep track of run time - START
start_time = time.perf_counter()
df['z'] = 0.0
for indx in range(len(df)):
count = 1
while count <= int(df.days_to_forecast[indx]):
# (1) update individual length
df.lgth[indx] = df.lgth[indx] + df.linearGR[indx]
# (2) estimate daily size-specific mortality
if df.lgth[indx] > 50.0:
df.z[indx] = 0.01
else:
if df.lgth[indx] <= 50.0:
df.z[indx] = 0.052857-((0.03/35)*df.lgth[indx])
elif df.lgth[indx] < 15.0:
df.z[indx] = 0.728*math.exp(-0.1892*df.lgth[indx])
df['no_fish'].round(decimals = 0)
if df.no_fish[indx] < 1.0:
df.no_fish[indx] = 0.0
elif df.no_fish[indx] >= 1.0:
df.no_fish[indx] = df.no_fish[indx]*math.exp(-(df.z[indx]))
# (3) reduce no. of days left in forecast by 1
count = count + 1
# keep track of run time - END
total_elapsed_time = round(time.perf_counter() - start_time, 2)
print("Forecast iteration completed in {} seconds".format(total_elapsed_time))
The above code now works correctly, but it is still far to inefficient to run for 40,000 individuals each for 200+ days.
I would really appreciate any advice on how to modify the following code to make it pythonic.
Thanks
Another option that was suggested to me is to use the pd.dataframe.apply function. This dramatically reduced the overall the run time and could be useful to someone else in the future.
### === RUN SIMULATION === ###
start_time = time.perf_counter() # keep track of run time -- START
#-------------------------------------------------------------------------#
def function_to_apply( df ):
df['z_instantMort'] = ''
for indx in range(int(df['days_left'])):
# (1) update individual length
df['length'] = df['length'] + df['growth']
# (2) estimate daily size-specific mortality
if df['length'] > 50.0:
df['z_instantMort'] = 0.01
else:
if df['length'] <= 50.0:
df['z_instantMort'] = 0.052857-((0.03/35)*df['length'])
elif df['length'] < 15.0:
df['z_instantMort'] = 0.728*np.exp(-0.1892*df['length'])
whole_fish = round(df['no_fish'], 0)
if whole_fish < 1.0:
df['no_fish'] = 0.0
elif whole_fish >= 1.0:
df['no_fish'] = df['no_fish']*np.exp(-(df['z_instantMort']))
return df
#-------------------------------------------------------------------------#
sim_results = df.apply(function_to_apply, axis=1)
total_elapsed_time = round(time.perf_counter() - start_time, 2) # END
print("Forecast iteration completed in {} seconds".format(total_elapsed_time))
print(sim_results)
### ====================== ###
output being...
Forecast iteration completed in 0.05 seconds
unique_id length no_fish days_left growth z_instantMort
0 2.0 126.3194 148.729190 253.0 0.3898 0.01
1 4.0 116.5742 93.018465 253.0 0.3414 0.01
2 5.0 129.0320 0.000000 254.0 0.4080 0.01
3 13.0 127.3784 132.864757 256.0 0.3839 0.01
As I said in my comment, a preferable alternative to for loops in this setting is using vector operations. For instance, running your code:
import pandas as pd
import time
import math
import numpy as np
data = {'unique_id': [2, 4, 5, 13],
'length': [27.7, 30.2, 25.4, 29.1],
'no_fish': [3195, 1894, 8, 2774],
'days_left': [253, 253, 254, 256],
'growth': [0.3898, 0.3414, 0.4080, 0.3839]
}
df = pd.DataFrame(data)
print(df)
# keep track of run time - START
start_time = time.perf_counter()
df['z'] = 0.0
for indx in range(len(df)):
count = 1
while count <= int(df.days_left[indx]):
# (1) update individual length
df.length[indx] = df.length[indx] + df.growth[indx]
# (2) estimate daily size-specific mortality
if df.length[indx] > 50.0:
df.z[indx] = 0.01
else:
if df.length[indx] <= 50.0:
df.z[indx] = 0.052857-((0.03/35)*df.length[indx])
elif df.length[indx] < 15.0:
df.z[indx] = 0.728*math.exp(-0.1892*df.length[indx])
df['no_fish'].round(decimals = 0)
if df.no_fish[indx] < 1.0:
df.no_fish[indx] = 0.0
elif df.no_fish[indx] >= 1.0:
df.no_fish[indx] = df.no_fish[indx]*math.exp(-(df.z[indx]))
# (3) reduce no. of days left in forecast by 1
count = count + 1
# keep track of run time - END
total_elapsed_time = round(time.perf_counter() - start_time, 2)
print("Forecast iteration completed in {} seconds".format(total_elapsed_time))
print(df)
with output:
unique_id length no_fish days_left growth
0 2 27.7 3195 253 0.3898
1 4 30.2 1894 253 0.3414
2 5 25.4 8 254 0.4080
3 13 29.1 2774 256 0.3839
Forecast iteration completed in 31.75 seconds
unique_id length no_fish days_left growth z
0 2 126.3194 148.729190 253 0.3898 0.01
1 4 116.5742 93.018465 253 0.3414 0.01
2 5 129.0320 0.000000 254 0.4080 0.01
3 13 127.3784 132.864757 256 0.3839 0.01
Now with vector operations, you could do something like:
# keep track of run time - START
start_time = time.perf_counter()
df['z'] = 0.0
for day in range(1, df.days_left.max() + 1):
update = day <= df['days_left']
# (1) update individual length
df[update]['length'] = df[update]['length'] + df[update]['growth']
# (2) estimate daily size-specific mortality
df[update]['z'] = np.where( df[update]['length'] > 50.0, 0.01, 0.052857-( ( 0.03 / 35)*df[update]['length'] ) )
df[update]['z'] = np.where( df[update]['length'] < 15.0, 0.728 * np.exp(-0.1892*df[update]['length'] ), df[update]['z'] )
df[update]['no_fish'].round(decimals = 0)
df[update]['no_fish'] = np.where(df[update]['no_fish'] < 1.0, 0.0, df[update]['no_fish'] * np.exp(-(df[update]['z'])))
# keep track of run time - END
total_elapsed_time = round(time.perf_counter() - start_time, 2)
print("Forecast iteration completed in {} seconds".format(total_elapsed_time))
print(df)
with output
Forecast iteration completed in 1.32 seconds
unique_id length no_fish days_left growth z
0 2 126.3194 148.729190 253 0.3898 0.0
1 4 116.5742 93.018465 253 0.3414 0.0
2 5 129.0320 0.000000 254 0.4080 0.0
3 13 127.3784 132.864757 256 0.3839 0.0
Related
I want to perform desired computations based in my 'x' and 'y' coordinate values as in below table (Table 1):
TIMESTEP nparticles v_x v_y radius area_sum vx_a1 vy_a1 phi_1
0 1 0.0 0.0 0.490244 0.7550478008000959 1.90579 -1.83605 0.36630
100 1 0.369944 0.196252 0.490244 0.7550478008000959
200 1 -0.110178 -0.233131 0.490244 0.7550478008000959
...
...
97400 1 -1.03617 -7.24768 0.461981 0.6704989496863082
97500 1 -1.30016 -7.25768 0.461981 0.6704989496863082
...
...
For which I am using this code for my above generated dataframe:
bindistance = 0.25
orfl = -4.0
orfr = 4.0
bin_xc = np.arange(orfl, orfr, bindistance)
nbins = len(bin_xc)
binx = 0.25
xo_min = -4.0
xo_max = 4.0
xb1_c = xo_min
xb1_max = xb1_c + (binx * 2)
xb1_min = xb1_c - (binx * 2)
yb_min = -0.5
yb_max = 0.5
yb_c = 0
x_particle1 = df.loc[(df['x'] < xb1_max) &
(df['x'] > xb1_min)]
xy_particle1 =
x_particle1.loc[(x_particle1['y'] < yb_max)
& (x_particle1['y'] > yb_min)]
output1 = xy_particle1.groupby("TIMESTEP").agg(nparticles = ("id", "count"), v_x=("vx", "sum"), v_y=("vy", "sum"), radius = ("radius", "sum"), area_sum = ("Area", "sum"))
nsum1 = output1["nparticles"].sum()
vxsum1 = output1["v_x"].sum()
vysum1 = output1["v_y"].sum()
v_a1 = vxsum1 / nsum1
vy_a1 = vysum1 / nsum1
phi_1 = output1["area_sum"].sum() / 1001
But I am having very large number of desired dataframes (first dataframe is above shown) based in my 'x' and 'y' coordinate conditions. So manually writing each code 50 or more times is not feasible. How I can do with using a loop or otherwise? Please help
This is my input dataset (df):
TIMESTEP id radius x y vx vy Area
0 42 0.490244 -3.85683 0.489375 0.0 0.0 0.7550478008000959
0 245 0.479994 -2.88838 0.479446 0.0 0.0 0.7238048519265009
0 344 0.463757 -1.94613 0.463363 0.0 0.0 0.6756640757454175
0 313 0.503268 -0.981364 0.501991 0.0 0.0 0.7956984398459999
...
...
100000 1051 0.542993 0.887743 1.71649 -0.309668 -5.83282 0.9262715700848821
100000 504 0.540275 2.87158 1.94939 -5.76545 -2.30889 0.9170217083878441
100000 589 0.450005 3.86868 1.89373 -4.49676 -2.63977 0.636186649597414
...
...
I have a pandas dataframe and I need to calculate the sum of a column of values that fall within a certain window. So for instance, if I have a window of 500, and my initial value is 1000, I want to sum all values that are between 499 and 999, and also between 1001 and 1501.
This is easier to explain with some data:
chrom pos end AFR EUR pi
0 1 10177 10177 0.4909 0.4056 0.495988
1 1 10352 10352 0.4788 0.4264 0.496369
2 1 10617 10617 0.9894 0.9940 0.017083
3 1 11008 11008 0.1346 0.0885 0.203142
4 1 11012 11012 0.1346 0.0885 0.203142
5 1 13110 13110 0.0053 0.0567 0.053532
6 1 13116 13116 0.0295 0.1869 0.176091
7 1 13118 13118 0.0295 0.1869 0.176091
8 1 13273 13273 0.0204 0.1471 0.139066
9 1 13550 13550 0.0008 0.0080 0.007795
10 1 14464 14464 0.0144 0.1859 0.161422
11 1 14599 14599 0.1210 0.1610 0.238427
12 1 14604 14604 0.1210 0.1610 0.238427
13 1 14930 14930 0.4811 0.5209 0.500209
14 1 14933 14933 0.0015 0.0507 0.044505
15 1 15211 15211 0.5371 0.7316 0.470848
16 1 15585 15585 0.0008 0.0020 0.002635
17 1 15644 15644 0.0008 0.0080 0.007795
18 1 15777 15777 0.0159 0.0149 0.030470
19 1 15820 15820 0.4849 0.2714 0.477153
20 1 15903 15903 0.0431 0.4652 0.349452
21 1 16071 16071 0.0091 0.0010 0.011142
22 1 16142 16142 0.0053 0.0020 0.007721
23 1 16949 16949 0.0227 0.0159 0.038759
24 1 18643 18643 0.0023 0.0080 0.009485
25 1 18849 18849 0.8411 0.9911 0.170532
26 2 30923 30923 0.6687 0.9364 0.338400
27 2 20286 46286 0.0053 0.0010 0.006863
28 2 21698 46698 0.0015 0.0010 0.002566
29 2 42159 47159 0.0083 0.0696 0.067187
So I need to subset based on the first two columns. For example, if my window = 500, my chrom = 1 and my pos = 15500, I will need to subset my df to include only those rows that have chrom = 1 and 15000 > pos < 16000.
I would then like to sum the AFR column of this subset of data.
Here is the function I have made:
#vdf is my main dataframe,
#polyChrom is the chromosome to subset by,
#polyPos is the position to subset by.
#Distance is how far the window should be from the polyPos.
#windowSize is the size of the window itself
#E.g. if distance=20000 and windowSize= 500, we are looking at a window
#that is (polyPos-20000)-500 to (polyPos-20000) and a window that is
#(polyPos+20000) to (polyPos+20000)+500.
def mafWindow(vdf, polyChrom, polyPos, distance, windowSize):
#If start position becomes less than 0, set it to 0
if(polyPos - distance < 0):
start1 = 0
end1 = windowSize
else:
start1 = polyPos - distance
end1 = start1 + windowSize
end2 = polyPos + distance
start2 = end2 - windowSize
#subset df
df = vdf.loc[(vdf['chrom'] == polyChrom) & ((vdf['pos'] <= end1) & (vdf['pos'] >= start1))|
((vdf['pos'] <= end2) & (vdf['pos'] >= start2))].copy()
return(df.AFR.sum())
This whole method works on subsetting the dataframe and is very slow when my dataframe contains ~55k rows. Is there a quicker and more efficient way of doing this?
The trick is to drop down to numpy arrays. Pandas indexing and slicing is slow.
import pandas as pd
df = pd.DataFrame([[1, 10177, 0.5], [1, 10178, 0.2], [1, 20178, 0.1],
[2, 10180, 0.3], [1, 10180, 0.4]], columns=['chrom', 'pos', 'AFR'])
chrom = df['chrom'].values
pos = df['pos'].values
afr = df['AFR'].values
def filter_sum(chrom_arr, pos_arr, afr_arr, chrom_val, pos_start, pos_end):
return sum(k for i, j, k in zip(chrom_arr, pos_arr, afr_arr) \
if pos_start < j < pos_end and i == chrom_val)
filter_sum(chrom, pos, afr, 1, 10150, 10200)
# 1.1
I have dataframe
ID Value
A 70
A 80
A 1000
A 100
A 200
A 130
A 60
A 300
A 800
A 200
A 150
A 250
I need to replace outliers to median value.
I use
df = pd.read_excel("test.xlsx")
grouped = df.groupby('ID')
statBefore = pd.DataFrame({'q1': grouped['Value'].quantile(.25), \
'median': grouped['Value'].median(), 'q3' :
grouped['Value'].quantile(.75)})
def is_outlier(row):
iq_range = statBefore.loc[row.ID]['q3'] - statBefore.loc[row.ID]['q1']
median = statBefore.loc[row.ID]['median']
q3 = statBefore.loc[row.ID]['q3']
q1 = statBefore.loc[row.ID]['q1']
if row.Value > (q3 + (3 * iq_range)) or row.Value < (q1 - (3 * iq_range)):
return True
else:
return False
#apply the function to the original df:
df.loc[:, 'outlier'] = df.apply(is_outlier, axis = 1)
But it returns me median - 175 and q1 - 92, but I get - 90, and it returns me q3 - 262,5, but I count and get - 275.
What wrong there?
This is simple and performant, with no Python for-loops to slow it down:
s = pd.Series([30, 31, 32, 45, 50, 999]) # example data
s.where(s.between(*s.quantile([0.25, 0.75])), s.median())
It gives you:
0 38.5
1 38.5
2 32.0
3 45.0
4 38.5
5 38.5
Unpacking that code, we have s.quantile([0.25, 0.75]) to get this:
0.25 31.25
0.75 48.75
We then use the values (31.25 and 48.75) as arguments to between(), with the * operator to unpack them because between() expects two separate arguments, not an array of length 2. That gives us:
0 False
1 False
2 True
3 True
4 False
5 False
Now that we have the binary mask, we use s.where() to choose the original values at the True locations, and fall back to s.median() otherwise.
This is just how quantiles are defined
df = pd.DataFrame(np.array([60,70,80,100,130,150,200,200,250,300,800,1000]))
print df.quantile(.25)
print df.quantile(.50)
print df.quantile(.75)
(The q1 for your data set is 95 btw)
The median is in between 150 and 200 (175)
The first quantile is 3 quarters between 80 and 100 (95)
The thrid quantile is 1 quarter in between 250 and 300 (262.5)
I have a large catalog that I am selecting data from according to the following criteria:
columns = ["System", "rp", "mp", "logg"]
catalog = pd.read_csv('data.txt', skiprows=1, sep ='\s+', names=columns)
# CUTS
i = (catalog.rp != -1) & (catalog.mp != -1)
new_catalog = pd.DataFrame(catalog[i])
print("{0} targets after cuts".format(len(new_catalog)))
When I perform the above cuts the code is working fine. Next, I want to add one more cut: I want to select all the targets that have 4.0 < logg < 5.0. However, some of the targets have logg = -1 (which stands for the fact that the value is not available). Luckily, I can calculate logg from the other available parameters. So here is my updated cuts:
# CUTS
i = (catalog.rp != -1) & (catalog.mp != -1)
if catalog.logg[i] == -1:
catalog.logg[i] = catalog.mp[i] / catalog.rp[i]
i &= (4 <= catalog.logg) & (catalog.logg <= 5)
However, I am receiving an error:
if catalog.logg[i] == -1:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Can someone please explain what I am doing wrong and how I can fix it. Thank you
Edit 1
My dataframe looks like the following:
Data columns:
System 477 non-null values
rp 477 non-null values
mp 477 non-null values
logg 477 non-null values
dtypes: float64(37), int64(3), object(3)None
Edit 2
System rp mp logg FeH FeHu FeHl Mstar Mstaru Mstarl
0 target-01 5196 24 24 0.31 0.04 0.04 0.905 0.015 0.015
1 target-02 5950 150 150 -0.30 0.25 0.25 0.950 0.110 0.110
2 target-03 5598 50 50 0.04 0.05 0.05 0.997 0.049 0.049
3 target-04 6558 44 -1 0.14 0.04 0.04 1.403 0.061 0.061
4 target-05 6190 60 60 0.05 0.07 0.07 1.194 0.049 0.050
....
[5 rows x 43 columns]
Edit 3
My code in a format that I understand should be:
for row in range(len(catalog)):
parameter = catalog['logg'][row]
if parameter == -1:
parameter = catalog['mp'][row] / catalog['rp'][row]
if parameter > 4.0 and parameter < 5.0:
# select this row for further analysis
However, I am trying to write my code in a more simple and professional way. I don't want to use the for loop. How can I do it?
EDIT 4
Consider the following small example:
System rp mp logg
target-01 2 -1 2 # will NOT be selected since mp = -1
target-02 -1 3 4 # will NOT be selected since rp = -1
target-03 7 6 4.3 # will be selected since mp != -1, rp != -1, and 4 < logg <5
target-04 3.2 15 -1 # will be selected since mp != -1, rp != -1, logg = mp / rp = 15/3.2 = 4.68 (which is between 4 and 5)
you get the error because catalog.logg[i] is not a scalar,but a series,so you should turn to vectorized manipulation:
catalog.loc[i,'logg'] = catalog.loc[i,'mp']/catalog.loc[i,'rp']
which would modify the logg column inplace
As for edit 3:
rows=catalog.loc[(catalog.logg > 4) & (catalog.logg < 5)]
which will select rows that satisfy the condition
Instead of that code:
if catalog.logg[i] == -1:
catalog.logg[i] = catalog.mp[i] / catalog.rp[i]
You could use following:
i &= df.logg == -1
df.loc[i, 'logg'] = df.loc[i, 'mp'] / df.loc[i, 'rp']
# or
df.ix[i, 'logg'] = df.ix[i, 'mp'] / df.ix[i, 'rp']
For your edit 3 you need to add that line:
your_rows = df[(df.logg > 4) & (df.logg < 5)]
Full code:
i = (catalog.rp != -1) & (catalog.mp != -1)
i &= df.logg == -1
df.ix[i, 'logg'] = df.ix[i, 'mp'] / df.ix[i, 'rp']
your_rows = df[(df.logg > 4) & (df.logg < 5)]
EDIT
Probably I still don't understand what you want, but I get your desired output:
import pandas as pd
from io import StringIO
data = """
System rp mp logg
target-01 2 -1 2
target-02 -1 3 4
target-03 7 6 4.3
target-04 3.2 15 -1
"""
catalog = pd.read_csv(StringIO(data), sep='\s+')
i = (catalog.rp != -1) & (catalog.mp != -1)
i &= catalog.logg == -1
catalog.ix[i, 'logg'] = catalog.ix[i, 'mp'] / catalog.ix[i, 'rp']
your_rows = catalog[(catalog.logg > 4) & (catalog.logg < 5)]
In [7]: your_rows
Out[7]:
System rp mp logg
2 target-03 7.0 6 4.3000
3 target-04 3.2 15 4.6875
Am I still wrong?
New to Pandas, looking for the most efficient way to do this.
I have a Series of DataFrames. Each DataFrame has the same columns but different indexes, and they are indexed by date. The Series is indexed by ticker symbol. So each item in the Sequence represents a single time series of each individual stock's performance.
I need to randomly generate a list of n data frames, where each dataframe is a subset of some random assortment of the available stocks' histories. It's ok if there is overlap, so long as start end end dates are different.
This following code does it, but it's really slow, and I'm wondering if there's a better way to go about it:
Code
def random_sample(data=None, timesteps=100, batch_size=100, subset='train'):
if type(data) != pd.Series:
return None
if subset=='validate':
offset = 0
elif subset=='test':
offset = 200
elif subset=='train':
offset = 400
tickers = np.random.randint(0, len(data), size=len(data))
ret_data = []
while len(ret_data) != batch_size:
for t in tickers:
data_t = data[t]
max_len = len(data_t)-timesteps-1
if len(ret_data)==batch_size: break
if max_len-offset < 0: continue
index = np.random.randint(offset, max_len)
d = data_t[index:index+timesteps]
if len(d)==timesteps: ret_data.append(d)
return ret_data
Profile output:
Timer unit: 1e-06 s
File: finance.py
Function: random_sample at line 137
Total time: 0.016142 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
137 #profile
138 def random_sample(data=None, timesteps=100, batch_size=100, subset='train'):
139 1 5 5.0 0.0 if type(data) != pd.Series:
140 return None
141
142 1 1 1.0 0.0 if subset=='validate':
143 offset = 0
144 1 1 1.0 0.0 elif subset=='test':
145 offset = 200
146 1 0 0.0 0.0 elif subset=='train':
147 1 1 1.0 0.0 offset = 400
148
149 1 1835 1835.0 11.4 tickers = np.random.randint(0, len(data), size=len(data))
150
151 1 2 2.0 0.0 ret_data = []
152 2 3 1.5 0.0 while len(ret_data) != batch_size:
153 116 148 1.3 0.9 for t in tickers:
154 116 2497 21.5 15.5 data_t = data[t]
155 116 317 2.7 2.0 max_len = len(data_t)-timesteps-1
156 116 80 0.7 0.5 if len(ret_data)==batch_size: break
157 115 69 0.6 0.4 if max_len-offset < 0: continue
158
159 100 101 1.0 0.6 index = np.random.randint(offset, max_len)
160 100 10840 108.4 67.2 d = data_t[index:index+timesteps]
161 100 241 2.4 1.5 if len(d)==timesteps: ret_data.append(d)
162
163 1 1 1.0 0.0 return ret_data
Are you sure you need to find a faster method? Your current method isn't that slow. The following changes might simplify, but won't necessarily be any faster:
Step 1: Take a random sample (with replacement) from the list of dataframes
rand_stocks = np.random.randint(0, len(data), size=batch_size)
You can treat this array rand_stocks as a list of indices to be applied to your Series of dataframes. The size is already batch size so that eliminates the need for the while loop and your comparison on line 156.
That is, you can iterate over rand_stocks and access the stock like so:
for idx in rand_stocks:
stock = data.ix[idx]
# Get a sample from this stock.
Step 2: Get a random datarange for each stock you have randomly selected.
start_idx = np.random.randint(offset, len(stock)-timesteps)
d = data_t[start_idx:start_idx+timesteps]
I don't have your data, but here's how I put it together:
def random_sample(data=None, timesteps=100, batch_size=100, subset='train'):
if subset=='train': offset = 0 #you can obviously change this back
rand_stocks = np.random.randint(0, len(data), size=batch_size)
ret_data = []
for idx in rand_stocks:
stock = data[idx]
start_idx = np.random.randint(offset, len(stock)-timesteps)
d = stock[start_idx:start_idx+timesteps]
ret_data.append(d)
return ret_data
Creating a dataset:
In [22]: import numpy as np
In [23]: import pandas as pd
In [24]: rndrange = pd.DateRange('1/1/2012', periods=72, freq='H')
In [25]: rndseries = pd.Series(np.random.randn(len(rndrange)), index=rndrange)
In [26]: rndseries.head()
Out[26]:
2012-01-02 2.025795
2012-01-03 1.731667
2012-01-04 0.092725
2012-01-05 -0.489804
2012-01-06 -0.090041
In [27]: data = [rndseries,rndseries,rndseries,rndseries,rndseries,rndseries]
Testing the function:
In [42]: random_sample(data, timesteps=2, batch_size = 2)
Out[42]:
[2012-01-23 1.464576
2012-01-24 -1.052048,
2012-01-23 1.464576
2012-01-24 -1.052048]