Speed up a curve_fit in Pandas DataFrame - python

I have a dataframe with independent variables in the column headers, and each rows is a seperate set of dependent variables:
5.032530 6.972868 8.888268 10.732009 12.879130 16.877655
0 2.512298 2.132748 1.890665 1.583538 1.582968 1.440091
1 5.628667 4.206962 4.179009 3.162677 3.132448 1.887631
2 3.177090 2.274014 2.412432 2.066641 1.845065 1.574748
3 5.060260 3.793109 3.129861 2.617136 2.703114 1.921615
4 4.153010 3.354411 2.706463 2.570981 2.020634 1.646298
I would like to fit a curve of type Y=A*x^B to each row. I need to solve for A & B for about ~5000 rows, 6 datapoints in each row. I was able to do this using np.apply, but it takes about 40 seconds to do this. Can I speed up using Cython or by vectorizing somehow? I need precision to about 4 decimals
Here is what i have:
import pandas as pd
from scipy.optimize import curve_fit
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv(r'C:\File.csv')
def curvefita(y):
return curve_fit(lambda x,a,b: a*np.power(x,b), df.iloc[:,3:].columns, y,p0=[8.4,-.58], bounds=([0,-10],[200,10]),maxfev=2000)[0][0]
def curvefitb(y):
return curve_fit(lambda x,a,b: a*np.power(x,b), df.iloc[:,3:].columns, y,p0=[8.4,-.58], bounds=([0,-10],[200,10]),maxfev=2000)[0][1]
avalues = df.iloc[:,3:].apply(curvefita, axis=1)
bvalues = df.iloc[:,3:].apply(curvefitb, axis=1)
df['a']=avalues
df['b']=bvalues
colcount = len(df.columns)
#build power fit - make the matrix
powerfit = df.copy()
for column in range(colcount-2):
powerfit.iloc[:,column] = powerfit.iloc[:,colcount-2] * (powerfit.columns[column]**powerfit.iloc[:,colcount-1])
#graph an example
plt.plot(powerfit.iloc[0,:colcount-2],'r')
plt.plot(df.iloc[0,:colcount-2],'ro')
#another example looked up by ticker
plt.plot(powerfit.iloc[5,:colcount-2],'b')
plt.plot(df.iloc[5,:colcount-2],'bo')

You actually do two curve_fits per row, one for a and one for b. Try to find a way to insert both of them at the same time, so you can halve your execution time:
def func(x, a, b):
return a * np.power(x, b)
def curvefit(y):
return tuple(curve_fit(func, df.iloc[:,3:].columns, y ,p0=[8.4, -.58], bounds=([0, -10], [200, 10]))[0])
df[["a", "b"]] = df.iloc[:,3:].apply(curvefit, axis=1).apply(pd.Series)
print(df)
# 5.03253 6.972868 8.888268 10.732009 12.87913 16.877655 a \
# 0 2.512298 2.132748 1.890665 1.583538 1.582968 1.440091 2.677070
# 1 5.628667 4.206962 4.179009 3.162677 3.132448 1.887631 39.878792
# 2 3.177090 2.274014 2.412432 2.066641 1.845065 1.574748 8.589886
# 3 5.060260 3.793109 3.129861 2.617136 2.703114 1.921615 13.078827
# 4 4.153010 3.354411 2.706463 2.570981 2.020634 1.646298 27.715207
# b
# 0 -0.215338
# 1 -1.044384
# 2 -0.600827
# 3 -0.656381
# 4 -1.008753
And to make this more reusable, I would make curvefit also take the x-values and function, which can be passed in with functools.partial:
from functools import partial
def curvefit(func, x, y):
return tuple(curve_fit(func, x, y ,p0=[8.4, -.58], bounds=([0, -10], [200, 10]))[0])
fit = partial(curvefit, func, df.iloc[:,3:].columns)
df[["a", "b"]] = df.iloc[:,3:].apply(fit, axis=1).apply(pd.Series)

I was able to bring my runtime down to 550ms by following the advice of #Brenlla. This code uses an unweighted/biased formula similar to Excel, which is good enough for my purposes (#kennytm discusses it here)
df = pd.read_csv(r'C:\File.csv')
df2=np.log(df)
df3=df2.iloc[:,3:].copy()
df3.columns=np.log(df3.columns)
def curvefit(y):
return tuple(np.polyfit(df3.columns, y ,1))
df[["b", "a"]] = df3.apply(curvefit,axis=1).apply(pd.Series)
df['a']=np.exp(df['a'])
colcount = len(df.columns)
powerfit = df.copy()
for column in range(colcount-2):
powerfit.iloc[:,column] = powerfit.iloc[:,colcount-1] * (powerfit.columns[column]**powerfit.iloc[:,colcount-2])
#graph an example
plt.plot(powerfit.iloc[0,:colcount-2],'r')
plt.plot(df.iloc[0,:colcount-2],'ro')
#another example looked up by ticker
plt.plot(powerfit.iloc[5,:colcount-2],'b')
plt.plot(df.iloc[5,:colcount-2],'bo')

Related

How do I apply Sympy.Solve to solve a two-sided function on each row in a database?

I am attempting to apply the sympy.solve function to each row in a DataFrame using each column a different variable. I have managed to solve for the undefined variable I need to calculate -- λ -- using the following code:
from sympy import symbols, Eq, solve
import math
Tmin = 33.2067
Tmax = 42.606
D = 19.5526
tmin = 6
tmax = 14
pi = math.pi
λ = symbols('λ')
lhs = D
eq1 = (((Tmax-Tmin)/2) * (λ-tmin) * (1-(2/pi))) + (((Tmax-Tmin)/2)*(tmax-λ)*(1+(2/pi))) - (((Tmax-Tmin)/2)*(tmax-tmin))
eq1 = Eq(lhs, rhs)
lam = solve(eq1)
print(lam)
However, I need to apply this function to every row in a DataFrame and output the result as its own column. The DataFrame is formatted as follows:
import pandas as pd
data = [[6, 14, 33.2067, 42.606, 19.5526], [6, 14, 33.4885, 43.0318, -27.9222]]
df = pd.DataFrame(data, columns=['tmin', 'tmax', 'Tmin', 'Tmax', 'D'])
I have searched for how to do this, but am not sure how to proceed. I managed to find similar questions wherein the answers discussed lambdifying the equation, but I wasn't sure how to lambdify a two-sided equation and then apply it to my DataFrame, and my math skills aren't strong enough to isolate λ and place it on the left side of the equation so that I don't have to lambdify a two-sided equation. Any help here would be appreciated.
You can use the apply method of a dataframe in order to apply a numerical function to each row. But first we need to create the numerical function: we are going to do that with lambdify:
from sympy import symbols, Eq, solve, pi, lambdify
import pandas as pd
# create the necessary symbols
Tmin, Tmax, D, tmin, tmax = symbols("T_min, T_max, D, t_min, t_max")
λ = symbols('λ')
# create the equation and solve for λ
lhs = D
rhs = (((Tmax-Tmin)/2) * (λ-tmin) * (1-(2/pi))) + (((Tmax-Tmin)/2)*(tmax-λ)*(1+(2/pi))) - (((Tmax-Tmin)/2)*(tmax-tmin))
eq1 = Eq(lhs, rhs)
# solve eq1 for λ: take the first (and only) solution
λ_expr = solve(eq1, λ)[0]
print(λ_expr)
# out: (-pi*D + T_max*t_max + T_max*t_min - T_min*t_max - T_min*t_min)/(2*(T_max - T_min))
# convert λ_expr to a numerical function so that it can
# be quickly evaluated. Essentialy, creates:
# λ_func(tmin, tmax, Tmin, Tmax, D)
# NOTE: for simplicity, let's order the symbols like the
# columns of the dataframe
λ_func = lambdify([tmin, tmax, Tmin, Tmax, D], λ_expr)
# We are going to use df.apply to apply the function to each row
# of the dataframe. However, pandas will pass into the current row
# as the argument. For example:
# row = [val_tmin, val_tmax, val_Tmin, val_Tmax, val_D]
# Hence, we need a wrapper function to unpack the row to the
# arguments required by λ_func
wrapper_func = lambda row: λ_func(*row)
# create the dataframe
data = [[6, 14, 33.2067, 42.606, 19.5526], [6, 14, 33.4885, 43.0318, -27.9222]]
df = pd.DataFrame(data, columns=['tmin', 'tmax', 'Tmin', 'Tmax', 'D'])
# apply the function to each row
print(df.apply(wrapper_func, axis=1))
# 0 6.732400044759731
# 1 14.595903848357743
# dtype: float64

Generating and Storing Samples of an Exponential Distribution with a name for each sample using a loop

I've got a weird question for a class project. Assuming X ~ Exp(Lambda), Lambda=1.6, I have to generate 100 samples of X, with the indices corresponding to the sample size of each generated sample (S1, S2 ... S100). I've worked out a simple loop which generate the required samples in array, but i am not able to rename the array.
First attempt:
import numpy as np
import matplotlib.pyplot as plt
samples = []
for i in range(1,101,1):
samples.append(np.random.exponential(scale= 1/1.6, size= i))
Second attempt:
import numpy as np
import matplotlib.pyplot as plt
for i in range(1,101,1):
samples = np.random.exponential(scale= 1/1.2, size= i)
col = f'samples {i}'
df_samples[col] = exponential_sample
df_samples = pd.DataFrame(samples)
An example how I would like to visualize the data:
# drawing 50 random samples of size 2 from the exponentially distributed population
sample_size = 2
df2 = pd.DataFrame(index= ['x1', 'x2'] )
for i in range(1, 51):
exponential_sample = np.random.exponential((1/rate), sample_size)
col = f'sample {i}'
df2[col] = exponential_sample
# Taking a peek at the samples
df2
But instead of having a simple size = 2, I would like to have sample size = i. This way, I will be able to generate 1 rows for the first column (S1), 2 rows for the second column (S2), until I reach 100 rows for the 100th column (S100).
You cannot stick vectors of different lengths easily into a df so your mock-up code would not work, but you can concat one vector at a time:
df = pd.DataFrame()
for i in range(100,10100,100):
tmp = pd.DataFrame({f'S{i}':np.random.exponential(scale= 1/1.2, size= i)})
df = pd.concat([df, tmp], axis=1)
Use a dict instead maybe?
samples = {}
for i in range(100,10100,100):
samples[i] = np.random.exponential(scale= 1/1.2, size= i)
Then you can convert it into a pandas Dataframe if you like.

2D bin (x,y) and calculate mean of values (c) of 10 deepest data points (z)

For a data set consisting of:
coordinates x, y
depth z
a certain value c
I would like to do the following more efficient:
bin the data set in 2D bins based on the coordinates (x, y)
take the 10 deepest data points (z) per bin
calculate the mean value of c of these 10 data points per bin
Finally show a 2d heatmap with the calculated mean values.
I have found a working solution, but this takes too long for small bins and/or large data sets.
Is there a more efficient way of achieving the same result?
Current working example
Example dataframe:
import numpy as np
from numpy.random import rand
import pandas as pd
import math
import matplotlib.pyplot as plt
n = 10000
df = pd.DataFrame({'x':rand(n), 'y':rand(n), 'z':rand(n), 'c':rand(n)})
Bin the data set:
cell_size = 0.01
nx = math.ceil((max(df['x']) - min(df['x'])) / cell_size)
ny = math.ceil((max(df['y']) - min(df['y'])) / cell_size)
x_range = np.arange(0, nx)
y_range = np.arange(0, ny)
df['xbin'], x_edges = pd.cut(x=df['x'], bins=nx, labels=x_range, retbins=True)
df['ybin'], y_edges = pd.cut(x=df['y'], bins=ny, labels=y_range, retbins=True)
Code that now takes to long:
df = df.groupby(['xbin', 'ybin']).apply(
lambda d: d.sort_values('z').head(10).mean())
Update an empty DataFrame for the bins without data and show result:
index = pd.MultiIndex.from_product([x_range, y_range],
names=['xbin', 'ybin'])
tot_df = pd.DataFrame(index=index, columns=['z', 'c'])
tot_df.update(df)
zval = tot_df['c'].astype('float').values
zval = zval.reshape((nx, ny))
zval = zval.T
zval = np.flipud(zval)
extent = [min(x_edges), max(x_edges), min(y_edges), max(y_edges)]
plt.matshow(zval, aspect='auto', extent=extent)
plt.show()
you can use np.searchsorted to bin the rows by x and y and then use groupby to take 10 deep values and calculate means. As groupby will maintains the order in each group you can sort values before applying bins. groupby will perform better without apply
df = pd.DataFrame({'x':rand(n), 'y':rand(n), 'z':rand(n), 'c':rand(n)})
df = df.sort_values("z", ascending=False)
bins = np.linspace(0, 1, 11)
df["bin_x"] = np.searchsorted(bins, df['x'].values) - 1
df["bin_y"] = np.searchsorted(bins, df['y'].values) - 1
result = df.groupby(["bin_x", "bin_y"]).head(10)
result.groupby(["bin_x", "bin_y"])["c"].mean()
Result
bin_x bin_y
0 0 0.369531
1 0.601803
2 0.554452
3 0.575464
4 0.455198
...
9 5 0.469838
6 0.420772
7 0.367549
8 0.379200
9 0.523083
Name: c, Length: 100, dtype: float64

How do I use the output of pandas.ewm.cov?

How is one intended to use the output of the pandas.ewm.cov function. I would presume that there are functions that allow you to directly use it in the form returned for multiplication, but nothing I try seems to work.
For example, suppose I take a minimal use case, stock X and Y returns timeseries in DF1, so we estimate an ewma covariance matrix, then to get the variance estimate for a portfolio of position A and B (given in DF2) I need to compute $x^T C x$, but I can't find the command to do this without writing a for loop?
# Python 3.6, pandas 0.20
import pandas as pd
import numpy as np
np.random.seed(100)
DF1 = pd.DataFrame(dict(X = np.random.normal(size = 100), Y = np.random.normal(size = 100)))
DF2 = pd.DataFrame(dict(A = np.random.normal(size = 100), B = np.random.normal(size = 100)))
COV = DF1.ewm(10).cov()
print(DF1)
print(COV)
# All of the following are invalid
print(COV.dot(DF2))
print(DF2.dot(COV))
print(COV.multiply(DF2))
The best I can figure out is this ugly piece of code
COV.reset_index().rename(columns = dict(level_0 = "index", level_1 = "variable"), inplace = True)
DF2m = pd.melt(DF2.reset_index(), id_vars = "index").sort_values("index")
MDF = pd.merge(COV, DF2m, on=["index", "variable"])
VAR = MDF.groupby("index").apply(lambda x: np.dot(np.dot(x["value"], np.matrix([x["X"], x["Y"]])), x["value"])[0,0])
I hold out hope that there is a nice way to do this...

pandas, correctly handle numpy arrays inside a row element

I'll give a minimal example where I would create numpy arrays inside row elements of a pandas.DataFrame.
TL;DR: see the screenshot of the DataFrame
This code finds the minimum of a certain function, by using scipy.optimize.brute, which returns the minimum, variable at which the minimum is found and two numpy arrays at which it evaluated the function.
import numpy
import scipy.optimize
import itertools
sin = lambda r, phi, x: r * np.sin(phi * x)
def func(r, x):
x0, fval, grid, Jout = scipy.optimize.brute(
sin, ranges=[(-np.pi, np.pi)], args=(r, x), Ns=10, full_output=True)
return dict(phi_at_min=x0[0], result_min=fval, phis=grid, result_at_grid=Jout)
rs = numpy.linspace(-1, 1, 10)
xs = numpy.linspace(0, 1, 10)
vals = list(itertools.product(rs, xs))
result = [func(r, x) for r, x in vals]
# idk whether this is the best way of generating the DataFrame, but it works
df = pd.DataFrame(vals, columns=['r', 'x'])
df = pd.concat((pd.DataFrame(result), df), axis=1)
df.head()
I expect that this is not how I am supposed to do this and should maybe expand the lists somehow. How do I handle this in a correct, beautiful, and clean way?
So, even though "beautiful and clean" is subject to interpretation, I'll give you mine, which should give you in turn some ideas. I'm leveraging a multiindex so that you can later easily select pairs of phi/result_at_grid for each point in the evaluation grid. I'm also using applyinstead of creating two dataframes.
import numpy
import scipy.optimize
import itertools
sin = lambda r, phi, x: r * np.sin(phi * x)
def func(row):
"""
Accepts a row of a dataframe (a pd.Series).
df.apply(func, axis=1)
returns a pd.Series with the initial (r,x) and the results
"""
r = row['r']
x = row['x']
x0, fval, grid, Jout = scipy.optimize.brute(
sin, ranges=[(-np.pi, np.pi)], args=(r, x), Ns=10, full_output=True)
# Create a multi index series for the phis
phis = pd.Series(grid)
phis.index = pd.MultiIndex.from_product([['Phis'], phis.index])
# same for result at grid
result_at_grid = pd.Series(Jout)
result_at_grid.index = pd.MultiIndex.from_product([['result_at_grid'], result_at_grid.index])
# concat
s = pd.concat([phis, result_at_grid])
# Add these two float results
s['phi_at_min'] = x0[0]
s['result_min'] = fval
# add the initial r,x to reconstruct the index later
s['r'] = r
s['x'] = x
return s
rs = numpy.linspace(-1, 1, 10)
xs = numpy.linspace(0, 1, 10)
vals = list(itertools.product(rs, xs))
df = pd.DataFrame(vals, columns=['r', 'x'])
# Apply func to each row (axis=1)
results = df.apply(func, axis=1)
results.set_index(['r','x'], inplace=True)
results.head().T # Transposing so we can see the output in one go...
Now you can select all values at the evaluation grid point 2 for example
print(results.swaplevel(0,1, axis=1)[2].head()) # Showing only 5 first
Phis result_at_grid
r x
-1.0 0.000000 -1.745329 0.000000
0.111111 -1.745329 0.193527
0.222222 -1.745329 0.384667
0.333333 -1.745329 0.571062
0.444444 -1.745329 0.750415

Categories