python percent change consecutive items in a list - python

I need some help dropping the NaN from the list generated in the code below. I'm trying to calculate the geometric average of the list of numbers labeled 'prices'. I can get as far as calculating the percent changes between the sequential numbers, but when I go to take the product of the list, there is an NaN that throws is off. I tried pandas.dropna(), but it didn't drop anything and gave me the same output. Any suggestions would be appreciated.
Thanks.
import pandas as pd
import math
import numpy as np
prices = [2,3,4,3,1,3,7,8]
prices = pd.Series(prices)
prices = prices.iloc[::-1]
retlist = list(prices.pct_change())
retlist.reverse()
print(retlist)
calc = np.array([x + 1 for x in retlist])
print(calc)
def product(P):
p = 1
for i in P:
p = i * p
return p
print(product(calc))

retlist a list, which contains NaN.
you can add a step to get rid of NaN by using the following code:
retlist = [i for indx, i in enumerate(retlist) if filter[indx] == True]
After this you can follow with the other steps. Do note that the size of the list changes

Related

Going from code that compares one value to all other values to all values of all other values

I've written the code that will find take a number of n grams, a specific index, and a threshold, and return the values that fall within that threshold. However, currently, it only compares a set of tokens (a given index) to each set of tokens at all other indices. I want to compare each set tokens at all indices to every other set of tokens at all indices. I don't think this is a difficult question, but python is my main language and I struggle with for loops a bit.
So essentially, the variable token in the function should actually iterate over each string in the column, and be compared with comp_token and the index call would be removed, since it would be iterating over all indices.
Let me know if that isn't clear enough and I will think more about how to say this: it is just difficult because the thing I am asking is the thing I am struggling with.
data = ['Time', "NY Times", 'Atlantic']
ph = pd.DataFrame(data, columns=['companies'])
ph.reset_index(inplace=True)
import py_stringmatching as sm
import pandas as pd
import numpy as np
jac = sm.Jaccard()
def predict_label(num, index, thresh):
qg_num_tok = sm.QgramTokenizer(qval = num)
companies = ph.companies.to_list()
ids = ph['index']
companies_qg_num_token = {}
companies_id2index = {}
for i in range(len(companies)):
companies_id2index[i] = companies[i]
companies_qg_num_token[i] = qg_num_tok.tokenize(companies[i])
predicted_label = [1]
token = companies_qg_num_token[index] #index you want: to get all the tokens
for comp_name in ids[1:]:
comp_token = companies_qg_num_token[comp_name]
sim = jac.get_sim_score(token, comp_token)
if sim > thresh:
predicted_label.append(1)
else:
predicted_label.append(0)
#companies_id2index must be equal to token numbner
ph.loc[ph['index'] != companies_id2index[index], 'label'] = 0 #if not equal to index
ph['prediction'] = predicted_label
print(token)
print(companies_id2index[index])
return ph.query('prediction==1')
predict_label(ph, 1, .5)

Avoid for loop in Python DataFrame

Problem 1.
Suppose I have n years of annual returns r and my initial wealth is 100. Every year I have fixed expense of 6. I want to create yearly wealth. I can do it in for loop. But for my purpose it's time consuming. How do I do it in DataFrame?
wealth = pd.Series(index = range(n+1))
wealth[0] = 100
for i in range(n):
wealth.iloc[i+1] = wealth.iloc[i]*(1+r.iloc[i]) - 6
Initially I thought
wealth = ((1 + r - 0.06).cumprod()).multiply(other = 100)
to be the solution. But it is not. Expenses are not 6%. They are fixed. It is 6.
Problem 2.
I want to do the above N times. In each case I generate r by sampling n returns with replacement.
r = returnY.sample(n,replace=True).reset_index(drop=True)
Then for that return, create the wealth path I described above and create a n*N dateframe of wealth paths. I can do this in for loop, but for big N and n, it takes long time to run. Is there an efficient and elegant way to do this?
Problem 3.
Suppose allWealth is the DF with all wealth paths. Want to check %columns in each row less than 0. This is how I resolved it.
yy = allWealth.copy()
yy[yy>0] = 1
yy[yy<=0] = 0
yy.sum(axis = 1)/N
Any better, more elegant solution?
Problem 1: It looks like you want to apply the "reduce" pattern. You can use reduce function from functools.
import numpy as np
from functools import reduce
rs = np.random.random(50)*0.3 #sequence of annual returns
result = reduce(lambda w,r: w*(1+r)-6, rs, 100)
If you want to keep all the intermediate values, use itertools.accumulate() instead. For example, replace the last line with the following:
ts_iter= itertools.accumulate(rs, lambda w,r: w*(1+r)-6, initial=100)
ts = list(ts_iter) #itertools.accumulate returns an iterable
Problem 2: You can first generate a random matrix of nxN by sampling with replacement. Then you can use "apply_along_axis" method for each column.
import numpy as np
rm = np.random.random((n,N))
def sim(rs):
return reduce(lambda w,r: w * (1+r) - 6, rs, 100)
result = np.apply_along_axis(sim, 0, rm)
Problem 3: you don't need to assign ones and zeros to your original dataframe. A mask dataframe of True and False implicitly acts as a dataframe of ones and zeros in this case.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((50,30)))
mask = df < 0.5
mask.sum(axis=1)/30
I used #chi's solution with some small edit.
import numpy as np
import itertools
rm = np.random.random((n,N)) #sequence of annual returns
rm0 = np.insert(rm, 0, 100, axis=1)
def wealth(rs):
return list(itertools.accumulate(rs, lambda w,r: w*(1+r)-6))
result = np.apply_along_axis(wealth, 1, rm0)
itertools.accumulate does not recognize initial. Hence inserted initial wealth at the front of return array.

Slicing in a for loop

i want to run two for loops in which i calculate annualized returns of a hypothetical trading strategy which is based on moving average crossovers. It's pretty simple: go long as soon as the "faster" MA crosses the "slower". Otherwise move to cash.
My data looks like this:
My Code:
rets = {}
ann_rets = {}
#Nested Loop to calculate returns
for short in range(20, 200):
for long in range(short + 1, 200):
#Calculate cumulative return
rets[short,long] = (aapl[short,long][-1] - aapl[short,long][1]) / aapl[short,long][1]
#calculate annualized return
ann_rets[short,long] = (( 1 + rets[short,long]) ** (12 / D))-1
The error message i get is the following:
TypeError: list indices must be integers or slices, not tuple
EDIT:
Using a dictionary works fine. The screenshot below shows where i'm stuck at the moment.
I want to have three final columns: (SMA_1,SMA_2,Ann_rets)
SMA_1: First Moving average e.g. 20
SMA_2: Second Moving average e.g. 50
Ann_rets: annualized return which is calculated in the loop above
I try to understand your questions. Hope this helps. I simplified your output ann_rets to illustrate reformatting to expected output format. Kr
rets = {}
ann_rets = {}
#Nested Loop to calculate returns
for short in range(20, 200):
for long in range(short + 1, 200):
#Calculate cumulative return
rets[short,long] = (aapl[short,long][-1] - aapl[short,long][1]) / aapl[short,long][1]
#calculate annualized return
ann_rets[short,long] = (( 1 + rets[short,long]) ** (12 / D))-1
# Reformat
Example data:
ann_rets = {(1,2): 0.1, (3,4):0.2, (5,6):0.3}
df1 = pd.DataFrame(ann_rets.values())
df2 = pd.DataFrame(list(ann_rets.keys()))
df = pd.concat([df2, df1], axis=1)
df.columns = ['SMA_1','SMA_2','Ann_rets']
print(df)
Which yields:
SMA_1 SMA_2 Ann_rets
0 1 2 0.1
1 3 4 0.2
2 5 6 0.3
You're trying access the index of a list with a tuple here: rets[short,long].
Try instead using a dictionary. So change
rets = []
ann_rets = []
to
rets = {}
ann_rets = {}
A double index like rets[short, long] will work for NumPy arrays and Pandas dataframes (like, presumably, your aapl variable), but not for a regular Python list. Use rets[short][long] instead. (Which also means you would need to change the initialization of rests at the top of your code.)
To explain briefly the actual error message: a tuple is more or less defined by the separating comma, that is, Python sees short,long and turns that into a tuple (short, long), which is then used inside the list index. Which, of course, fails, and throws this error message.

How to multiprocess finding closest geographic point in two pandas dataframes?

I have a function which I'm trying to apply in parallel and within that function I call another function that I think would benefit from being executed in parallel. The goal is to take in multiple years of crop yields for each field and combine all of them into one pandas dataframe. I have the function I use for finding the closest point in each dataframe, but it is quite intensive and takes some time. I'm looking to speed it up.
I've tried creating a pool and using map_async on the inner function. I've also tried doing the same with the loop for the outer function. The latter is the only thing I've gotten to work the way I intended it to. I can use this, but I know there has to be a way to make it faster. Check out the code below:
return_columns = []
return_columns_cb = lambda x: return_columns.append(x)
def getnearestpoint(gdA, gdB, retcol):
dist = lambda point1, point2: distance.great_circle(point1, point2).feet
def find_closest(point):
distances = gdB.apply(
lambda row: dist(point, (row["Longitude"], row["Latitude"])), axis=1
)
return (gdB.loc[distances.idxmin(), retcol], distances.min())
append_retcol = gdA.apply(
lambda row: find_closest((row["Longitude"], row["Latitude"])), axis=1
)
return append_retcol
def combine_yield(field):
#field is a list of the files for the field I'm working with
#lots of pre-processing
#dfs in this case is a list of the dataframes for the current field
#mdf is the dataframe with the most points which I poppped from this list
p = Pool()
for i in range(0, len(dfs)):
p.apply_async(getnearestpoint, args=(mdf, dfs[i], dfs[i].columns[-1]), callback=return_cols_cb)
for col in return_columns:
mdf = mdf.append(col)
'''I unzip my points back to longitude and latitude here in the final
dataframe so I can write to csv without tuples'''
mdf[["Longitude", "Latitude"]] = pd.DataFrame(
mdf["Point"].tolist(), index=mdf.index
)
return mdf
def multiprocess_combine_yield():
'''do stuff to get dictionary below with each field name as key and values
as all the files for that field'''
yield_by_field = {'C01': ('files...'), ...}
#The farm I'm working on has 30 fields and below is too slow
for k,v in yield_by_field.items():
combine_yield(v)
I guess what I need help on is I envision something like using a pool to imap or apply_async on each tuple of files in the dictionary. Then within the combine_yield function when applied to that tuple of files, I want to to be able to parallel process the distance function. That function bogs the program down because it calculates the distance between every point in each of the dataframes for each year of yield. The files average around 1200 data points and then you multiply all of that by 30 fields and I need something better. Maybe the efficiency improvement lies in finding a better way to pull in the closest point. I still need something that gives me the value from gdB, and the distance though because of what I do later on when selecting which rows to use from the 'mdf' dataframe.
Thanks to #ALollz comment, I figured this out. I went back to my getnearestpoint function and instead of doing a bunch of Series.apply I am now using cKDTree from scipy.spatial to find the closest point, and then using a vectorized haversine distance to calculate the true distances on each of these matched points. Much much quicker. Here are the basics of the code below:
import numpy as np
import pandas as pd
from scipy.spatial import cKDTree
def getnearestpoint(gdA, gdB, retcol):
gdA_coordinates = np.array(
list(zip(gdA.loc[:, "Longitude"], gdA.loc[:, "Latitude"]))
)
gdB_coordinates = np.array(
list(zip(gdB.loc[:, "Longitude"], gdB.loc[:, "Latitude"]))
)
tree = cKDTree(data=gdB_coordinates)
distances, indices = tree.query(gdA_coordinates, k=1)
#These column names are done as so due to formatting of my 'retcols'
df = pd.DataFrame.from_dict(
{
f"Longitude_{retcol[:4]}": gdB.loc[indices, "Longitude"].values,
f"Latitude_{retcol[:4]}": gdB.loc[indices, "Latitude"].values,
retcol: gdB.loc[indices, retcol].values,
}
)
gdA = pd.merge(left=gdA, right=df, left_on=gdA.index, right_on=df.index)
gdA.drop(columns="key_0", inplace=True)
return gdA
def combine_yield(field):
#same preprocessing as before
for i in range(0, len(dfs)):
mdf = getnearestpoint(mdf, dfs[i], dfs[i].columns[-1])
main_coords = np.array(list(zip(mdf.Longitude, mdf.Latitude)))
lat_main = main_coords[:, 1]
longitude_main = main_coords[:, 0]
longitude_cols = [
c for c in mdf.columns for m in [re.search(r"Longitude_B\d{4}", c)] if m
]
latitude_cols = [
c for c in mdf.columns for m in [re.search(r"Latitude_B\d{4}", c)] if m
]
year_coords = list(zip_longest(longitude_cols, latitude_cols, fillvalue=np.nan))
for i in year_coords:
year = re.search(r"\d{4}", i[0]).group(0)
year_coords = np.array(list(zip(mdf.loc[:, i[0]], mdf.loc[:, i[1]])))
year_coords = np.deg2rad(year_coords)
lat_year = year_coords[:, 1]
longitude_year = year_coords[:, 0]
diff_lat = lat_main - lat_year
diff_lng = longitude_main - longitude_year
d = (
np.sin(diff_lat / 2) ** 2
+ np.cos(lat_main) * np.cos(lat_year) * np.sin(diff_lng / 2) ** 2
)
mdf[f"{year} Distance"] = 2 * (2.0902 * 10 ** 7) * np.arcsin(np.sqrt(d))
return mdf
Then I'll just do Pool.map(combine_yield, (v for k,v in yield_by_field.items()))
This has made a substantial difference. Hope it helps anyone else in a similar predicament.

Data Cleaning(Flagging) Dead Sensor

I have a large timeseries(pandas dataframe) of windspeed (10min average) which contains error data (dead sensor). How can it be flagged automatically. I was trying with moving average.
Some other approach other then moving average is much appreciated. I have attached the sample data image below.
There are several ways to deal with this problem. I will first pass to differences:
%matplotlib inline
import pandas as pd
import numpy as np
np.random.seed(0)
n = 200
y = np.cumsum(np.random.randn(n))
y[100:120] = 2
y[150:160] = 0
ts = pd.Series(y)
ts.diff().plot();
The next step is to find how long are the strikes of consecutive zeros.
def getZeroStrikeLen(x):
""" Accept a boolean array only
"""
res = np.diff(np.where(np.concatenate(([x[0]],
x[:-1] != x[1:],
[True])))[0])[::2]
return res
vec = ts.diff().values == 0
out = getZeroStrikeLen(vec)
Now if len(out)>0 you can conclude that there is a problem. If you want to go one step further you can have a look to this. It is in R but it's not that hard to replicate in Python.

Categories