Code / Loop optimization with pandas for creating two matrixes - python

I need to optimize this loop which takes 2.5 second. The needs is that I call it more than 3000 times in my script.
The aim of this code is to create two matrix which are used after in a linear system.
Has someone any idea in Python or Cython?
## df is only here for illustration and date_indicatrice changes upon function call
df = pd.DataFrame(0, columns=range(6),
index=pd.date_range(start = pd.datetime(2010,1,1),
end = pd.datetime(2020,1,1), freq="H"))
mat = pd.DataFrame(0,index=df.index,columns=range(6))
mat_bp = pd.DataFrame(0,index=df.index,columns=range(6*2))
date_indicatrice = [(pd.datetime(2010,1,1), pd.datetime(2010,4,1)),
(pd.datetime(2012,5,1), pd.datetime(2019,4,1)),
(pd.datetime(2013,4,1), pd.datetime(2019,4,1)),
(pd.datetime(2014,3,1), pd.datetime(2019,4,1)),
(pd.datetime(2015,1,1), pd.datetime(2015,4,1)),
(pd.datetime(2013,6,1), pd.datetime(2018,4,1))]
timer = time.time()
for j, (d1,d2) in enumerate(date_indicatrice):
result = df[(mat.index>=d1)&(mat.index<=d2)]
result2 = df[(mat.index>=d1)&(mat.index<=d2)&(mat.index.hour>=8)]
mat.loc[result.index,j] = 1.
mat_bp.loc[result2.index,j*2] = 1.
mat_bp[j*2+1] = (1 - mat_bp[j*2]) * mat[j]
print time.time()-timer

Here you go. I tested the following and I get the same resultant matrices in mat and mat_bp as in your original code, but in 0.07 seconds vs. 1.4 seconds for the original code on my machine.
The real slowdown was due to using result.index and result2.index. Looking up by a datetime is much slower than looking up using an index. I used binary searches where possible to find the right indices.
import pandas as pd
import numpy as np
import time
import bisect
## df is only here for illustration and date_indicatrice changes upon function call
df = pd.DataFrame(0, columns=range(6),
index=pd.date_range(start = pd.datetime(2010,1,1),
end = pd.datetime(2020,1,1), freq="H"))
mat = pd.DataFrame(0,index=df.index,columns=range(6))
mat_bp = pd.DataFrame(0,index=df.index,columns=range(6*2))
date_indicatrice = [(pd.datetime(2010,1,1), pd.datetime(2010,4,1)),
(pd.datetime(2012,5,1), pd.datetime(2019,4,1)),
(pd.datetime(2013,4,1), pd.datetime(2019,4,1)),
(pd.datetime(2014,3,1), pd.datetime(2019,4,1)),
(pd.datetime(2015,1,1), pd.datetime(2015,4,1)),
(pd.datetime(2013,6,1), pd.datetime(2018,4,1))]
timer = time.time()
for j, (d1,d2) in enumerate(date_indicatrice):
ind_start = bisect.bisect_left(mat.index, d1)
ind_end = bisect.bisect_right(mat.index, d2)
inds = np.array(xrange(ind_start, ind_end))
valid_inds = inds[mat.index[ind_start:ind_end].hour >= 8]
mat.loc[ind_start:ind_end,j] = 1.
mat_bp.loc[valid_inds,j*2] = 1.
mat_bp[j*2+1] = (1 - mat_bp[j*2]) * mat[j]
print time.time()-timer

Related

Avoid for loop in Python DataFrame

Problem 1.
Suppose I have n years of annual returns r and my initial wealth is 100. Every year I have fixed expense of 6. I want to create yearly wealth. I can do it in for loop. But for my purpose it's time consuming. How do I do it in DataFrame?
wealth = pd.Series(index = range(n+1))
wealth[0] = 100
for i in range(n):
wealth.iloc[i+1] = wealth.iloc[i]*(1+r.iloc[i]) - 6
Initially I thought
wealth = ((1 + r - 0.06).cumprod()).multiply(other = 100)
to be the solution. But it is not. Expenses are not 6%. They are fixed. It is 6.
Problem 2.
I want to do the above N times. In each case I generate r by sampling n returns with replacement.
r = returnY.sample(n,replace=True).reset_index(drop=True)
Then for that return, create the wealth path I described above and create a n*N dateframe of wealth paths. I can do this in for loop, but for big N and n, it takes long time to run. Is there an efficient and elegant way to do this?
Problem 3.
Suppose allWealth is the DF with all wealth paths. Want to check %columns in each row less than 0. This is how I resolved it.
yy = allWealth.copy()
yy[yy>0] = 1
yy[yy<=0] = 0
yy.sum(axis = 1)/N
Any better, more elegant solution?
Problem 1: It looks like you want to apply the "reduce" pattern. You can use reduce function from functools.
import numpy as np
from functools import reduce
rs = np.random.random(50)*0.3 #sequence of annual returns
result = reduce(lambda w,r: w*(1+r)-6, rs, 100)
If you want to keep all the intermediate values, use itertools.accumulate() instead. For example, replace the last line with the following:
ts_iter= itertools.accumulate(rs, lambda w,r: w*(1+r)-6, initial=100)
ts = list(ts_iter) #itertools.accumulate returns an iterable
Problem 2: You can first generate a random matrix of nxN by sampling with replacement. Then you can use "apply_along_axis" method for each column.
import numpy as np
rm = np.random.random((n,N))
def sim(rs):
return reduce(lambda w,r: w * (1+r) - 6, rs, 100)
result = np.apply_along_axis(sim, 0, rm)
Problem 3: you don't need to assign ones and zeros to your original dataframe. A mask dataframe of True and False implicitly acts as a dataframe of ones and zeros in this case.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((50,30)))
mask = df < 0.5
mask.sum(axis=1)/30
I used #chi's solution with some small edit.
import numpy as np
import itertools
rm = np.random.random((n,N)) #sequence of annual returns
rm0 = np.insert(rm, 0, 100, axis=1)
def wealth(rs):
return list(itertools.accumulate(rs, lambda w,r: w*(1+r)-6))
result = np.apply_along_axis(wealth, 1, rm0)
itertools.accumulate does not recognize initial. Hence inserted initial wealth at the front of return array.

How do I add a matrix constraint `Ax=b` to a Pyomo model efficiently?

I want to add the constraints Ax=b to a Pyomo model with my numpy arrays A and b as efficient as possible. Unfortunately, the performance is very bad currently. For the following example
import time
import numpy as np
import pyomo.environ as pyo
start = time.time()
rows = 287
cols = 2765
A = np.random.rand(rows, cols)
b = np.random.rand(rows)
mdl = pyo.ConcreteModel()
mdl.rows = range(rows)
mdl.cols = range(cols)
mdl.A = A
mdl.b = b
mdl.x_var = pyo.Var(mdl.cols, bounds=(0.0, None))
mdl.constraints = pyo.ConstraintList()
[mdl.constraints.add(sum(mdl.A[row, col] * mdl.x_var[col] for col in mdl.cols) <= mdl.b[row]) for row in mdl.rows]
mdl.obj = pyo.Objective(expr=sum(mdl.x_var[col] for col in mdl.cols), sense=pyo.minimize)
end = time.time()
print(end - start)
is takes almost 30 seconds because of the add statement and the huge amount of columns. Is it possible to pass A, x, and b directly and fast instead of adding it row by row?
The main thing that is slowing down your construction above is the fact that you are building the constraint list elements within a list comprehension, which is unnecessary and causes a lot of bloat.
This line:
[mdl.constraints.add(sum(mdl.A[row, col] * mdl.x_var[col] for col in mdl.cols) <= mdl.b[row]) for row in mdl.rows]
Constructs a list of the captured results of each ConstraintList.add() expression, which is a "rich" return. That list is an unnecessary byproduct of the loop you desire to do over the add() function. Just change your generation scheme to either a loop or a generator (by using parens) to avoid that capture, as such:
(mdl.constraints.add(sum(mdl.A[row, col] * mdl.x_var[col] for col in mdl.cols) <= mdl.b[row]) for row in mdl.rows)
And the model construction time drops to about 0.02 seconds.

Pandas: memory usage when working with very many columns using Groupby

I have a dataframe with over 1000 columns and I would like to know whether it makes a difference on memory usage and/or speed to run a groupby directly on a dataframe or to create a smaller subset of the dataframe columnwise.
df[['xnew','ynew','znew']] = df.groupby(['a','b'])['x','y','z'].transform(lambda f: f.rolling(3).mean().shift())
or,
df2=df[['a','b','x','y','z']]
df2[['xnew','ynew','znew']] = df2.groupby(['a','b'])['x','y','z'].transform(lambda f: f.rolling(3).mean().shift())
df=pd.concat([df,df2[['xnew','ynew','znew']]],axis=1)
I would like to test this myself but I am unfamiliar with how to do it. Advice on how to test this would be much appreciated.
The short answer is no, it doesn't matter on either dimension. From a Colab notebook:
%load_ext memory_profiler
import pandas as pd
import numpy as np
d = {'a': [1]*100 + [2]*100, 'b': [3]*50 + [4]*50 + [5]*50 + [6]*50}
for i in range(1000):
d[i] = np.random.random(200)
for c in 'xyz':
d[c] = np.random.random(200)
df = pd.DataFrame(d)
%time %memit df[['xnew','ynew','znew']] = df.groupby(['a','b'])[['x','y','z']].transform(lambda f: f.rolling(3).mean().shift())
%%time
%%memit
df2=df[['a','b','x','y','z']]
df2[['xnew','ynew','znew']] = df2.groupby(['a','b'])[['x','y','z']].transform(lambda f: f.rolling(3).mean().shift())
df=pd.concat([df,df2[['xnew','ynew','znew']]],axis=1)
The simple way to do this is to get the time and then subtract the time at the end of the process to display the elapsed time.
import time
start = time.time()
# Write down the process.
process_time = time.time() - start
print(process_time)

How to multiprocess finding closest geographic point in two pandas dataframes?

I have a function which I'm trying to apply in parallel and within that function I call another function that I think would benefit from being executed in parallel. The goal is to take in multiple years of crop yields for each field and combine all of them into one pandas dataframe. I have the function I use for finding the closest point in each dataframe, but it is quite intensive and takes some time. I'm looking to speed it up.
I've tried creating a pool and using map_async on the inner function. I've also tried doing the same with the loop for the outer function. The latter is the only thing I've gotten to work the way I intended it to. I can use this, but I know there has to be a way to make it faster. Check out the code below:
return_columns = []
return_columns_cb = lambda x: return_columns.append(x)
def getnearestpoint(gdA, gdB, retcol):
dist = lambda point1, point2: distance.great_circle(point1, point2).feet
def find_closest(point):
distances = gdB.apply(
lambda row: dist(point, (row["Longitude"], row["Latitude"])), axis=1
)
return (gdB.loc[distances.idxmin(), retcol], distances.min())
append_retcol = gdA.apply(
lambda row: find_closest((row["Longitude"], row["Latitude"])), axis=1
)
return append_retcol
def combine_yield(field):
#field is a list of the files for the field I'm working with
#lots of pre-processing
#dfs in this case is a list of the dataframes for the current field
#mdf is the dataframe with the most points which I poppped from this list
p = Pool()
for i in range(0, len(dfs)):
p.apply_async(getnearestpoint, args=(mdf, dfs[i], dfs[i].columns[-1]), callback=return_cols_cb)
for col in return_columns:
mdf = mdf.append(col)
'''I unzip my points back to longitude and latitude here in the final
dataframe so I can write to csv without tuples'''
mdf[["Longitude", "Latitude"]] = pd.DataFrame(
mdf["Point"].tolist(), index=mdf.index
)
return mdf
def multiprocess_combine_yield():
'''do stuff to get dictionary below with each field name as key and values
as all the files for that field'''
yield_by_field = {'C01': ('files...'), ...}
#The farm I'm working on has 30 fields and below is too slow
for k,v in yield_by_field.items():
combine_yield(v)
I guess what I need help on is I envision something like using a pool to imap or apply_async on each tuple of files in the dictionary. Then within the combine_yield function when applied to that tuple of files, I want to to be able to parallel process the distance function. That function bogs the program down because it calculates the distance between every point in each of the dataframes for each year of yield. The files average around 1200 data points and then you multiply all of that by 30 fields and I need something better. Maybe the efficiency improvement lies in finding a better way to pull in the closest point. I still need something that gives me the value from gdB, and the distance though because of what I do later on when selecting which rows to use from the 'mdf' dataframe.
Thanks to #ALollz comment, I figured this out. I went back to my getnearestpoint function and instead of doing a bunch of Series.apply I am now using cKDTree from scipy.spatial to find the closest point, and then using a vectorized haversine distance to calculate the true distances on each of these matched points. Much much quicker. Here are the basics of the code below:
import numpy as np
import pandas as pd
from scipy.spatial import cKDTree
def getnearestpoint(gdA, gdB, retcol):
gdA_coordinates = np.array(
list(zip(gdA.loc[:, "Longitude"], gdA.loc[:, "Latitude"]))
)
gdB_coordinates = np.array(
list(zip(gdB.loc[:, "Longitude"], gdB.loc[:, "Latitude"]))
)
tree = cKDTree(data=gdB_coordinates)
distances, indices = tree.query(gdA_coordinates, k=1)
#These column names are done as so due to formatting of my 'retcols'
df = pd.DataFrame.from_dict(
{
f"Longitude_{retcol[:4]}": gdB.loc[indices, "Longitude"].values,
f"Latitude_{retcol[:4]}": gdB.loc[indices, "Latitude"].values,
retcol: gdB.loc[indices, retcol].values,
}
)
gdA = pd.merge(left=gdA, right=df, left_on=gdA.index, right_on=df.index)
gdA.drop(columns="key_0", inplace=True)
return gdA
def combine_yield(field):
#same preprocessing as before
for i in range(0, len(dfs)):
mdf = getnearestpoint(mdf, dfs[i], dfs[i].columns[-1])
main_coords = np.array(list(zip(mdf.Longitude, mdf.Latitude)))
lat_main = main_coords[:, 1]
longitude_main = main_coords[:, 0]
longitude_cols = [
c for c in mdf.columns for m in [re.search(r"Longitude_B\d{4}", c)] if m
]
latitude_cols = [
c for c in mdf.columns for m in [re.search(r"Latitude_B\d{4}", c)] if m
]
year_coords = list(zip_longest(longitude_cols, latitude_cols, fillvalue=np.nan))
for i in year_coords:
year = re.search(r"\d{4}", i[0]).group(0)
year_coords = np.array(list(zip(mdf.loc[:, i[0]], mdf.loc[:, i[1]])))
year_coords = np.deg2rad(year_coords)
lat_year = year_coords[:, 1]
longitude_year = year_coords[:, 0]
diff_lat = lat_main - lat_year
diff_lng = longitude_main - longitude_year
d = (
np.sin(diff_lat / 2) ** 2
+ np.cos(lat_main) * np.cos(lat_year) * np.sin(diff_lng / 2) ** 2
)
mdf[f"{year} Distance"] = 2 * (2.0902 * 10 ** 7) * np.arcsin(np.sqrt(d))
return mdf
Then I'll just do Pool.map(combine_yield, (v for k,v in yield_by_field.items()))
This has made a substantial difference. Hope it helps anyone else in a similar predicament.

How to use multiprocessing pool in a for loop while saving the data?

I have some data where I'm trying to apply multiprocessing.pool on it as I have a machine available with 16 processors.
Here do I generate some pseudo data:
y = pd.Series(np.random.randint(400, high=600, size=1250))
date_today = datetime.now()
x = pd.date_range(date_today, date_today + timedelta(1250), freq='D')
data = pd.DataFrame(columns=['Date','Price'])
data['Date'] = x
data['Price'] = y
d={name: group for name, group in data.groupby(np.arange(len(data)) // (len(data)))}
What I exactly want is that I apply pool in the for loop parameters. So using a processor per constant:
parameters = range(300,550,50)
portfolio = pd.DataFrame(columns=['Parameter','Date','Price','Calculation'])
for key, value in sorted(d.items()):
for constante in parameters:
print('Constante:',constante)
# HERE I WANT TO USE MP.POOL()
In the code I'm using some sort of shifting window to perform calculations on. This is the simplest version of the code. So I want to assign a process per constant in the parameters while writing to a DF. How does one achieve this?
You'll want to use multiprocessing.pool.map a bit like this, though you'll probably have to adjust for your needs...
from functools import partial
from multiprocessing import Pool
def pool_map_fn(value=None, constante=None, i=None):
s = {'val': value[i:i+constante]}
window = pd.concat([s['val']['Date'],s['val']['Price']], axis=1)
window['Price'] = pd.to_numeric(window['Price'], errors='coerce').fillna(0)
calc = window['Price'].mean()
date_variable = window['Date'].iloc[-1]
price_var = window['Price'].iloc[-1]
if price_var < calc:
print('Parameter',constante,'Lower than average',date_variable,price_var,calc)
portfolio = portfolio.append({'Parameter': constante,
'Date': date_variable,
'Price': price_var,
'Calculation': calc}, ignore_index=True)
if price_var > calc:
print('Parameter',constante,'Higher than average',date_variable,price_var,calc)
parameters = range(300,550,50)
portfolio = pd.DataFrame(columns=['Parameter','Date','Price','Calculation'])
for key, value in sorted(d.items()):
for constante in parameters:
with Pool() as pool:
results = pool.map(partial(pool_map_fn, value=value, constante=constante),
range(len(value) - constante + 1))
Note: This is untested but should work, if you get errors try to resolve them as the concept should be sound.

Categories