i want to run two for loops in which i calculate annualized returns of a hypothetical trading strategy which is based on moving average crossovers. It's pretty simple: go long as soon as the "faster" MA crosses the "slower". Otherwise move to cash.
My data looks like this:
My Code:
rets = {}
ann_rets = {}
#Nested Loop to calculate returns
for short in range(20, 200):
for long in range(short + 1, 200):
#Calculate cumulative return
rets[short,long] = (aapl[short,long][-1] - aapl[short,long][1]) / aapl[short,long][1]
#calculate annualized return
ann_rets[short,long] = (( 1 + rets[short,long]) ** (12 / D))-1
The error message i get is the following:
TypeError: list indices must be integers or slices, not tuple
EDIT:
Using a dictionary works fine. The screenshot below shows where i'm stuck at the moment.
I want to have three final columns: (SMA_1,SMA_2,Ann_rets)
SMA_1: First Moving average e.g. 20
SMA_2: Second Moving average e.g. 50
Ann_rets: annualized return which is calculated in the loop above
I try to understand your questions. Hope this helps. I simplified your output ann_rets to illustrate reformatting to expected output format. Kr
rets = {}
ann_rets = {}
#Nested Loop to calculate returns
for short in range(20, 200):
for long in range(short + 1, 200):
#Calculate cumulative return
rets[short,long] = (aapl[short,long][-1] - aapl[short,long][1]) / aapl[short,long][1]
#calculate annualized return
ann_rets[short,long] = (( 1 + rets[short,long]) ** (12 / D))-1
# Reformat
Example data:
ann_rets = {(1,2): 0.1, (3,4):0.2, (5,6):0.3}
df1 = pd.DataFrame(ann_rets.values())
df2 = pd.DataFrame(list(ann_rets.keys()))
df = pd.concat([df2, df1], axis=1)
df.columns = ['SMA_1','SMA_2','Ann_rets']
print(df)
Which yields:
SMA_1 SMA_2 Ann_rets
0 1 2 0.1
1 3 4 0.2
2 5 6 0.3
You're trying access the index of a list with a tuple here: rets[short,long].
Try instead using a dictionary. So change
rets = []
ann_rets = []
to
rets = {}
ann_rets = {}
A double index like rets[short, long] will work for NumPy arrays and Pandas dataframes (like, presumably, your aapl variable), but not for a regular Python list. Use rets[short][long] instead. (Which also means you would need to change the initialization of rests at the top of your code.)
To explain briefly the actual error message: a tuple is more or less defined by the separating comma, that is, Python sees short,long and turns that into a tuple (short, long), which is then used inside the list index. Which, of course, fails, and throws this error message.
Related
I am stuck at the moment and don't really know how to solve this problem.
I want to apply this calculation to a list/dataframe:
The equation itself is not really the problem for me, I am able to easily solve it manually, but that wouldn't do with the amount of data I have.
v : value to be approximated
vi: known values (in my case Temperatures)
di: distance to the approximated point
So basically this is for calculating/approximating a new temperature value for a position a certain distance away from the corners of the square:
import pandas as pd
import numpy as np
import xarray as xr
import math
filepath = r'F:\Data\data.nc' # just the path to the file
obj= xr.open_dataset(filepath)
# This is where I get the coordinates for each of the corners of the square
# from the netcdf4 file
lat = 9.7398
lon = 51.2695
xlat = obj['XLAT'].values
xlon = obj['XLON'].values
p_1 = [xlat[0,0], xlon[0,0]]
p_2 = [xlat[0,1], xlon[0,1]]
p_3 = [xlat[1,0], xlon[1,0]]
p_4 = [xlat[1,1], xlon[1,1]]
p_rect = [p_1, p_2, p_3, p_4]
p_orig = [lat, lon]
#=================================================
# Calculates the distance between the points
# d = sqrt((x2-x1)^2 + (y2-y1)^2))
#=================================================
distance = []
for coord in p_rect:
distance.append(math.sqrt(math.pow(coord[0]-p_orig[0],2)+math.pow(coord[1]-p_orig[1],2)))
# to get the values for they key['WS'] for example:
a = obj['WS'].values[:,0,0,0] # Array of floats for the first values
b = obj['WS'].values[:,0,0,1] # Array of floats for the second values
c = obj['WS'].values[:,0,1,0] # Array of floats for the third values
d = obj['WS'].values[:,0,1,1] # Array of floats for the fourth values
From then on, I have no idea how I should continue, should I do:
df = pd.DataFrame()
df['a'] = a
df['b'] = b
df['c'] = c
df['d'] = d
Then somehow work with DataFrames, and drop abcd after I got the needed values or should I do it with lists first, then add only the result to the dataframe. I am a bit lost.
The only thing I came up with so far is how it would look like if I would do it manually:
for i starting at 0 and ending if the end of the list [a, b, c d have the same length] is reached .
1/a[i]^2*distance[0] + 1/b[i]^2*distance[1] + 1/c[i]^2*distance[2] + 1/d[i]^2*distance[3]
v = ------------------------------------------------------------------------------------------
1/a[i]^2 + 1/b[i]^2 + 1/c[i]^2 + 1/d[i]^2
'''
This is the first time I had such a (at least for me) complex calculation on a list/dataframe. I hope you can help me solve this problem or at least nudge me in the right direction.
PS: here is the link to the file:
LINK TO FILE
Simply vectorize your calculations. With data frames you can run whole arithmetic operations directly on columns as if they were scalars to generate another column,df['v']. Below assumes distance is a list of four scalars and remember in Python ^ does not mean power, instead us **.
df = pd.DataFrame({'a':a, 'b':b, 'c':c, 'd':d})
df['v'] = (1/df['a']**2 * distance[0] +
1/df['b']**2 * distance[1] +
1/df['c']**2 * distance[2] +
1/df['d']**2 * distance[3]) / (1/df['a']**2 +
1/df['b']**2 +
1/df['c']**2 +
1/df['d']**2)
Or the functional form using Pandas Series binary operators. Below follows the order of operations (Parentheses --> Exponential --> Multiplication/Division --> Addition/Subtraction):
df['v'] = (df['a'].pow(2).pow(-1).mul(distance[0]) +
df['b'].pow(2).pow(-1).mul(distance[1]) +
df['c'].pow(2).pow(-1).mul(distance[2]) +
df['d'].pow(2).pow(-1).mul(distance[3])) / (df['a'].pow(2).pow(-1) +
df['b'].pow(2).pow(-1) +
df['c'].pow(2).pow(-1) +
df['d'].pow(2).pow(-1))
I have a function which I'm trying to apply in parallel and within that function I call another function that I think would benefit from being executed in parallel. The goal is to take in multiple years of crop yields for each field and combine all of them into one pandas dataframe. I have the function I use for finding the closest point in each dataframe, but it is quite intensive and takes some time. I'm looking to speed it up.
I've tried creating a pool and using map_async on the inner function. I've also tried doing the same with the loop for the outer function. The latter is the only thing I've gotten to work the way I intended it to. I can use this, but I know there has to be a way to make it faster. Check out the code below:
return_columns = []
return_columns_cb = lambda x: return_columns.append(x)
def getnearestpoint(gdA, gdB, retcol):
dist = lambda point1, point2: distance.great_circle(point1, point2).feet
def find_closest(point):
distances = gdB.apply(
lambda row: dist(point, (row["Longitude"], row["Latitude"])), axis=1
)
return (gdB.loc[distances.idxmin(), retcol], distances.min())
append_retcol = gdA.apply(
lambda row: find_closest((row["Longitude"], row["Latitude"])), axis=1
)
return append_retcol
def combine_yield(field):
#field is a list of the files for the field I'm working with
#lots of pre-processing
#dfs in this case is a list of the dataframes for the current field
#mdf is the dataframe with the most points which I poppped from this list
p = Pool()
for i in range(0, len(dfs)):
p.apply_async(getnearestpoint, args=(mdf, dfs[i], dfs[i].columns[-1]), callback=return_cols_cb)
for col in return_columns:
mdf = mdf.append(col)
'''I unzip my points back to longitude and latitude here in the final
dataframe so I can write to csv without tuples'''
mdf[["Longitude", "Latitude"]] = pd.DataFrame(
mdf["Point"].tolist(), index=mdf.index
)
return mdf
def multiprocess_combine_yield():
'''do stuff to get dictionary below with each field name as key and values
as all the files for that field'''
yield_by_field = {'C01': ('files...'), ...}
#The farm I'm working on has 30 fields and below is too slow
for k,v in yield_by_field.items():
combine_yield(v)
I guess what I need help on is I envision something like using a pool to imap or apply_async on each tuple of files in the dictionary. Then within the combine_yield function when applied to that tuple of files, I want to to be able to parallel process the distance function. That function bogs the program down because it calculates the distance between every point in each of the dataframes for each year of yield. The files average around 1200 data points and then you multiply all of that by 30 fields and I need something better. Maybe the efficiency improvement lies in finding a better way to pull in the closest point. I still need something that gives me the value from gdB, and the distance though because of what I do later on when selecting which rows to use from the 'mdf' dataframe.
Thanks to #ALollz comment, I figured this out. I went back to my getnearestpoint function and instead of doing a bunch of Series.apply I am now using cKDTree from scipy.spatial to find the closest point, and then using a vectorized haversine distance to calculate the true distances on each of these matched points. Much much quicker. Here are the basics of the code below:
import numpy as np
import pandas as pd
from scipy.spatial import cKDTree
def getnearestpoint(gdA, gdB, retcol):
gdA_coordinates = np.array(
list(zip(gdA.loc[:, "Longitude"], gdA.loc[:, "Latitude"]))
)
gdB_coordinates = np.array(
list(zip(gdB.loc[:, "Longitude"], gdB.loc[:, "Latitude"]))
)
tree = cKDTree(data=gdB_coordinates)
distances, indices = tree.query(gdA_coordinates, k=1)
#These column names are done as so due to formatting of my 'retcols'
df = pd.DataFrame.from_dict(
{
f"Longitude_{retcol[:4]}": gdB.loc[indices, "Longitude"].values,
f"Latitude_{retcol[:4]}": gdB.loc[indices, "Latitude"].values,
retcol: gdB.loc[indices, retcol].values,
}
)
gdA = pd.merge(left=gdA, right=df, left_on=gdA.index, right_on=df.index)
gdA.drop(columns="key_0", inplace=True)
return gdA
def combine_yield(field):
#same preprocessing as before
for i in range(0, len(dfs)):
mdf = getnearestpoint(mdf, dfs[i], dfs[i].columns[-1])
main_coords = np.array(list(zip(mdf.Longitude, mdf.Latitude)))
lat_main = main_coords[:, 1]
longitude_main = main_coords[:, 0]
longitude_cols = [
c for c in mdf.columns for m in [re.search(r"Longitude_B\d{4}", c)] if m
]
latitude_cols = [
c for c in mdf.columns for m in [re.search(r"Latitude_B\d{4}", c)] if m
]
year_coords = list(zip_longest(longitude_cols, latitude_cols, fillvalue=np.nan))
for i in year_coords:
year = re.search(r"\d{4}", i[0]).group(0)
year_coords = np.array(list(zip(mdf.loc[:, i[0]], mdf.loc[:, i[1]])))
year_coords = np.deg2rad(year_coords)
lat_year = year_coords[:, 1]
longitude_year = year_coords[:, 0]
diff_lat = lat_main - lat_year
diff_lng = longitude_main - longitude_year
d = (
np.sin(diff_lat / 2) ** 2
+ np.cos(lat_main) * np.cos(lat_year) * np.sin(diff_lng / 2) ** 2
)
mdf[f"{year} Distance"] = 2 * (2.0902 * 10 ** 7) * np.arcsin(np.sqrt(d))
return mdf
Then I'll just do Pool.map(combine_yield, (v for k,v in yield_by_field.items()))
This has made a substantial difference. Hope it helps anyone else in a similar predicament.
I know that a few posts have been made regarding how to output the unique values of a dataframe without reordering the data.
I have tried many times to implement these methods, however, I believe that the problem relates to how the dataframe in question has been defined.
Basically, I want to look into the dataframe named "C", and output the unique values into a new dataframe named "C1", without changing the order in which they are stored at the moment.
The line that I use currently is:
C1 = pd.DataFrame(np.unique(C))
However, this returns an ascending order list (while, I simply want the list order preserved only with duplicates removed).
Once again, I apologise to the advanced users who will look at my code and shake their heads -- I'm still learning! And, yes, I have tried numerous methods to solve this problem (redefining the C dataframe, converting the output to be a list etc), to no avail unfortunately, so this is my cry for help to the Python gods. I defined both C and C1 as dataframes, as I understand that these are pretty much the best datastructures to house data in, such that they can be recalled and used later, plus it is quite useful to name the columns without affecting the data contained in the dataframe).
Once again, your help would be much appreciated.
F0 = ('08/02/2018','08/02/2018',50)
F1 = ('08/02/2018','09/02/2018',52)
F2 = ('10/02/2018','11/02/2018',46)
F3 = ('12/02/2018','16/02/2018',55)
F4 = ('09/02/2018','28/02/2018',48)
F_mat = [[F0,F1,F2,F3,F4]]
F_test = pd.DataFrame(np.array(F_mat).reshape(5,3),columns=('startdate','enddate','price'))
#convert string dates into DateTime data type
F_test['startdate'] = pd.to_datetime(F_test['startdate'])
F_test['enddate'] = pd.to_datetime(F_test['enddate'])
#convert datetype to be datetime type for columns startdate and enddate
F['startdate'] = pd.to_datetime(F['startdate'])
F['enddate'] = pd.to_datetime(F['enddate'])
#create contract duration column
F['duration'] = (F['enddate'] - F['startdate']).dt.days + 1
#re-order the F matrix by column 'duration', ensure that the bootstrapping
#prioritises the shorter term contracts
F.sort_values(by=['duration'], ascending=[True])
# create prices P
P = pd.DataFrame()
for index, row in F.iterrows():
new_P_row = pd.Series()
for date in pd.date_range(row['startdate'], row['enddate']):
new_P_row[date] = row['price']
P = P.append(new_P_row, ignore_index=True)
P.fillna(0, inplace=True)
#create C matrix, which records the unique day prices across the observation interval
C = pd.DataFrame(np.zeros((1, intNbCalendarDays)))
C.columns = tempDateRange
#create the Repatriation matrix, which records the order in which contracts will be
#stored in the A matrix, which means that once results are generated
#from the linear solver, we know exactly which CalendarDays map to
#which columns in the results array
#this array contains numbers from 1 to NbContracts
R = pd.DataFrame(np.zeros((1, intNbCalendarDays)))
R.columns = tempDateRange
#define a zero filled matrix, P1, which will house the dominant daily prices
P1 = pd.DataFrame(np.zeros((intNbContracts, intNbCalendarDays)))
#rename columns of P1 to be the dates contained in matrix array D
P1.columns = tempDateRange
#create prices in correct rows in P
for i in list(range(0, intNbContracts)):
for j in list(range(0, intNbCalendarDays)):
if (P.iloc[i, j] != 0 and C.iloc[0,j] == 0) :
flUniqueCalendarMarker = P.iloc[i, j]
C.iloc[0,j] = flUniqueCalendarMarker
P1.iloc[i,j] = flUniqueCalendarMarker
R.iloc[0,j] = i
for k in list(range(j+1,intNbCalendarDays)):
if (C.iloc[0,k] == 0 and P.iloc[i,k] != 0):
C.iloc[0,k] = flUniqueCalendarMarker
P1.iloc[i,k] = flUniqueCalendarMarker
R.iloc[0,k] = i
elif (C.iloc[0,j] != 0 and P.iloc[i,j] != 0):
P1.iloc[i,j] = C.iloc[0,j]
#convert C dataframe into C_list, in prepataion for converting C_list
#into a unique, order preserved list
C_list = C.values.tolist()
#create C1 matrix, which records the unique day prices across unique days in the observation period
C1 = pd.DataFrame(np.unique(C))
Use DataFrame.duplicated() to check if your data-frame contains any duplicate or not.
If yes then you can try DataFrame.drop_duplicate() .
I would like to do some feature enrichment through a large 2 dimensional array (15,100m).
Working on a sample set with 100'000 records showed that I need to get this faster.
Edit (data model info)
To simplify, let's say we have only two relevant columns:
IP (identifier)
Unix (timestamp in seconds since 1970)
I would like to add a 3rd column, counting how many times this IP has shown up in the past 12 hours.
End edit
My first attempt was using pandas, because it was comfortable working with named dimensions, but too slow:
for index,row in tqdm_notebook(myData.iterrows(),desc='iterrows'):
# how many times was the IP address (and specific device) around in the prior 5h?
hours = 12
seen = myData[(myData['ip']==row['ip'])
&(myData['device']==row['device'])
&(myData['os']==row['os'])
&(myData['unix']<row['unix'])
&(myData['unix']>(row['unix']-(60*60*hours)))].shape[0]
ip_seen = myData[(myData['ip']==row['ip'])
&(myData['unix']<row['unix'])
&(myData['unix']>(row['unix']-(60*60*hours)))].shape[0]
myData.loc[index,'seen'] = seen
myData.loc[index,'ip_seen'] = ip_seen
Then I switched to numpy arrays and hoped for a better result, but it is still too slow to run against the full dataset:
# speed test numpy arrays
for i in np.arange(myArray.shape[0]):
hours = 12
ip,device,os,ts = myArray[i,[0,3,4,12]]
ip_seen = myArray[(np.where((myArray[:,0]==ip)
& (myArray[:,12]<ts)
& (myArray[:,12]>(ts-60*60*hours) )))].shape[0]
device_seen = myArray[(np.where((myArray[:,0]==ip)
& (myArray[:,2] == device)
& (myArray[:,3] == os)
& (myArray[:,12]<ts)
& (myArray[:,12]>(ts-60*60*hours) )))].shape[0]
myArray[i,13]=ip_seen
myArray[i,14]=device_seen
My next idea would be to iterate only once, and maintain a growing dictionary of the current count, instead of looking backwards in every iteration.
But that would have some other drawbacks (e.g. how to keep track when to reduce count for observations falling out of the 12h window).
How would you approach this problem?
Could it be even an option to use low level Tensorflow functions to involve a GPU?
Thanks
The only way to speed up things is not looping. In your case you can try using rolling with a window of the time span that you want, using the Unix timestamp as a datetime index (assuming that records are sorted by timestamp, otherwise you would need to sort first). This should work fine for the ip_seen:
ip = myData['ip']
ip.index = pd.to_datetime(myData['unix'], unit='s')
myData['ip_seen'] = ip.rolling('5h')
.agg(lambda w: np.count_nonzero(w[:-1] == w[-1]))
.values.astype(np.int32)
However, when the aggregation involves multiple columns, like in the seen column, it gets more complicated. Currently (see Pandas issue #15095) rolling functions do not support aggregations spanning two dimensions. A workaround could be merging the columns of interest into a single new series, for example a tuple (which may work better if values are numbers) or a string (which may be better is values are already strings). For example:
criteria = myData['ip'] + '|' + myData['device'] + '|' + myData['os']
criteria.index = pd.to_datetime(myData['unix'], unit='s')
myData['seen'] = criteria.rolling('5h')
.agg(lambda w: np.count_nonzero(w[:-1] == w[-1]))
.values.astype(np.int32)
EDIT
Apparently rolling only works with numeric types, which leaves as with two options:
Manipulate the data to use numeric types. For the IP this is easy, since it actually represents a 32 bit number (or 64 if IPv6 I guess). For device and OS, assuming they are strings now, it get's more complicated, you would have to map each possible value to an integer and the merge it with the IP in a long value, e.g. putting these in the higher bits or something like that (maybe even impossible with IPv6, since the biggest integers NumPy supports right now are 64 bits).
Roll over the index of myData (which should now be not datetime, because rolling cannot work with that either) and use the index window to get the necessary data and operate:
# Use sequential integer index
idx_orig = myData.index
myData.reset_index(drop=True, inplace=True)
# Index to roll
idx = pd.Series(myData.index)
idx.index = pd.to_datetime(myData['unix'], unit='s')
# Roll aggregation function
def agg_seen(w, data, fields):
# Use slice for faster data frame slicing
slc = slice(int(w[0]), int(w[-2])) if len(w) > 1 else []
match = data.loc[slc, fields] == data.loc[int(w[-1]), fields]
return np.count_nonzero(np.all(match, axis=1))
# Do rolling
myData['ip_seen'] = idx.rolling('5h') \
.agg(lambda w: agg_seen(w, myData, ['ip'])) \
.values.astype(np.int32)
myData['ip'] = idx.rolling('5h') \
.agg(lambda w: agg_seen(w, myData, ['ip', 'device', 'os'])) \
.values.astype(np.int32)
# Put index back
myData.index = idx_orig
This is not how rolling is meant to be used, though, and I'm not sure if this gives much better performance than just looping.
as mentioned in the comment to #jdehesa, I took another approach which allows me to only iterate once through the entire dataset and pull the (decaying) weight from an index.
decay_window = 60*60*12 # every 12
decay = 0.5 # fall by 50% every window
ip_idx = pd.DataFrame(myData.ip.unique())
ip_idx['ts_seen'] = 0
ip_idx['ip_seen'] = 0
ip_idx.columns = ['ip','ts_seen','ip_seen']
ip_idx.set_index('ip',inplace=True)
for index, row in myData.iterrows(): # all
# How often was this IP seen?
prior_ip_seen = ip_idx.loc[(row['ip'],'ip_seen')]
prior_ts_seen = ip_idx.loc[(row['ip'],'ts_seen')]
delay_since_count = row['unix']-ip_idx.loc[(row['ip'],'ts_seen')]
new_ip_seen = prior_ip_seen*decay**(delay_since_count/decay_window)+1
ip_idx.loc[(row['ip'],'ip_seen')] = new_ip_seen
ip_idx.loc[(row['ip'],'ts_seen')] = row['unix']
myData.iloc[index,14] = new_ip_seen-1
That way the result is not the fixed time window as requested initially, but prior observations "fade out" over time, giving frequent recent observations a higher weight.
This feature carries more information than the simplified (and turned out more expensive) approach initially planned.
Thanks for your input!
Edit
In the meantime I switched to numpy arrays for the same operation, which now only takes a fraction of the time (loop with 200m updates in <2h).
Just in case somebody looks for a starting point:
%%time
import sys
## temporary lookup
ip_seen_ts = [0]*365000
ip_seen_count = [0]*365000
cnt = 0
window = 60*60*12 # 12h
decay = 0.5
counter = 0
chunksize = 10000000
store = pd.HDFStore('store.h5')
t = time.process_time()
try:
store.remove('myCount')
except:
print("myData not present.")
for myHdfData in store.select_as_multiple(['myData','myFeatures'],columns=['ip','unix','ip_seen'],chunksize=chunksize):
print(counter, time.process_time() - t)
#display(myHdfData.head(5))
counter+=chunksize
t = time.process_time()
sys.stdout.flush()
keep_index = myHdfData.index.values
myArray = myHdfData.as_matrix()
for row in myArray[:,:]:
#for row in myArray:
i = (row[0].astype('uint32')) # IP as identifier
u = (row[1].astype('uint32')) # timestamp
try:
delay = u - ip_seen_ts[i]
except:
delay = 0
ip_seen_ts[i] = u
try:
ip_seen_count[i] = ip_seen_count[i]*decay**(delay/window)+1
except:
ip_seen_count[i] = 1
row[3] = np.tanh(ip_seen_count[i]-1) # tanh to normalize between 0 and 1
myArrayAsDF = pd.DataFrame(myArray,columns=['c_ip','c_unix','c_ip2','ip_seen'])
myArrayAsDF.set_index(keep_index,inplace=True)
store.append('myCount',myArrayAsDF)
store.close()
I need some help dropping the NaN from the list generated in the code below. I'm trying to calculate the geometric average of the list of numbers labeled 'prices'. I can get as far as calculating the percent changes between the sequential numbers, but when I go to take the product of the list, there is an NaN that throws is off. I tried pandas.dropna(), but it didn't drop anything and gave me the same output. Any suggestions would be appreciated.
Thanks.
import pandas as pd
import math
import numpy as np
prices = [2,3,4,3,1,3,7,8]
prices = pd.Series(prices)
prices = prices.iloc[::-1]
retlist = list(prices.pct_change())
retlist.reverse()
print(retlist)
calc = np.array([x + 1 for x in retlist])
print(calc)
def product(P):
p = 1
for i in P:
p = i * p
return p
print(product(calc))
retlist a list, which contains NaN.
you can add a step to get rid of NaN by using the following code:
retlist = [i for indx, i in enumerate(retlist) if filter[indx] == True]
After this you can follow with the other steps. Do note that the size of the list changes