pandas, correctly handle numpy arrays inside a row element - python

I'll give a minimal example where I would create numpy arrays inside row elements of a pandas.DataFrame.
TL;DR: see the screenshot of the DataFrame
This code finds the minimum of a certain function, by using scipy.optimize.brute, which returns the minimum, variable at which the minimum is found and two numpy arrays at which it evaluated the function.
import numpy
import scipy.optimize
import itertools
sin = lambda r, phi, x: r * np.sin(phi * x)
def func(r, x):
x0, fval, grid, Jout = scipy.optimize.brute(
sin, ranges=[(-np.pi, np.pi)], args=(r, x), Ns=10, full_output=True)
return dict(phi_at_min=x0[0], result_min=fval, phis=grid, result_at_grid=Jout)
rs = numpy.linspace(-1, 1, 10)
xs = numpy.linspace(0, 1, 10)
vals = list(itertools.product(rs, xs))
result = [func(r, x) for r, x in vals]
# idk whether this is the best way of generating the DataFrame, but it works
df = pd.DataFrame(vals, columns=['r', 'x'])
df = pd.concat((pd.DataFrame(result), df), axis=1)
df.head()
I expect that this is not how I am supposed to do this and should maybe expand the lists somehow. How do I handle this in a correct, beautiful, and clean way?

So, even though "beautiful and clean" is subject to interpretation, I'll give you mine, which should give you in turn some ideas. I'm leveraging a multiindex so that you can later easily select pairs of phi/result_at_grid for each point in the evaluation grid. I'm also using applyinstead of creating two dataframes.
import numpy
import scipy.optimize
import itertools
sin = lambda r, phi, x: r * np.sin(phi * x)
def func(row):
"""
Accepts a row of a dataframe (a pd.Series).
df.apply(func, axis=1)
returns a pd.Series with the initial (r,x) and the results
"""
r = row['r']
x = row['x']
x0, fval, grid, Jout = scipy.optimize.brute(
sin, ranges=[(-np.pi, np.pi)], args=(r, x), Ns=10, full_output=True)
# Create a multi index series for the phis
phis = pd.Series(grid)
phis.index = pd.MultiIndex.from_product([['Phis'], phis.index])
# same for result at grid
result_at_grid = pd.Series(Jout)
result_at_grid.index = pd.MultiIndex.from_product([['result_at_grid'], result_at_grid.index])
# concat
s = pd.concat([phis, result_at_grid])
# Add these two float results
s['phi_at_min'] = x0[0]
s['result_min'] = fval
# add the initial r,x to reconstruct the index later
s['r'] = r
s['x'] = x
return s
rs = numpy.linspace(-1, 1, 10)
xs = numpy.linspace(0, 1, 10)
vals = list(itertools.product(rs, xs))
df = pd.DataFrame(vals, columns=['r', 'x'])
# Apply func to each row (axis=1)
results = df.apply(func, axis=1)
results.set_index(['r','x'], inplace=True)
results.head().T # Transposing so we can see the output in one go...
Now you can select all values at the evaluation grid point 2 for example
print(results.swaplevel(0,1, axis=1)[2].head()) # Showing only 5 first
Phis result_at_grid
r x
-1.0 0.000000 -1.745329 0.000000
0.111111 -1.745329 0.193527
0.222222 -1.745329 0.384667
0.333333 -1.745329 0.571062
0.444444 -1.745329 0.750415

Related

Replace outlier values with NaN in numpy? (preserve length of array)

I have an array of magnetometer data with artifacts every two hours due to power cycling.
I'd like to replace those indices with NaN so that the length of the array is preserved.
Here's a code example, adapted from https://www.kdnuggets.com/2017/02/removing-outliers-standard-deviation-python.html.
import numpy as np
import plotly.express as px
# For pulling data from CDAweb:
from ai import cdas
import datetime
# Import data:
start = datetime.datetime(2016, 1, 24, 0, 0, 0)
end = datetime.datetime(2016, 1, 25, 0, 0, 0)
data = cdas.get_data(
'sp_phys',
'THG_L2_MAG_'+ 'PG2',
start,
end,
['thg_mag_'+ 'pg2']
)
x =data['UT']
y =data['VERTICAL_DOWN_-_Z']
def reject_outliers(y): # y is the data in a 1D numpy array
n = 5 # 5 std deviations
mean = np.mean(y)
sd = np.std(y)
final_list = [x for x in y if (x > mean - 2 * sd)]
final_list = [x for x in final_list if (x < mean + 2 * sd)]
return final_list
px.scatter(reject_outliers(y))
print('Length of y: ')
print(len(y))
print('Length of y with outliers removed (should be the same): ')
print(len(reject_outliers(y)))
px.line(y=y, x=x)
# px.scatter(y) # It looks like the outliers are successfully dropped.
# px.line(y=reject_outliers(y), x=x) # This is the line I'd like to see work.
When I run 'px.scatter(reject_outliers(y))', it looks like the outliers are successfully getting dropped:
...but that's looking at the culled y vector relative to the index, rather than the datetime vector x as in the above plot. As the debugging text indicates, the vector is shortened because the outlier values are dropped rather than replaced.
How can I edit my 'reject_outliers()` function to assign those values to NaN, or to adjacent values, in order to keep the length of the array the same so that I can plot my data?
Use else in the list comprehension along the lines of:
[x if x_condition else other_value for x in y]
Got a less compact version to work. Full code:
import numpy as np
import plotly.express as px
# For pulling data from CDAweb:
from ai import cdas
import datetime
# Import data:
start = datetime.datetime(2016, 1, 24, 0, 0, 0)
end = datetime.datetime(2016, 1, 25, 0, 0, 0)
data = cdas.get_data(
'sp_phys',
'THG_L2_MAG_'+ 'PG2',
start,
end,
['thg_mag_'+ 'pg2']
)
x =data['UT']
y =data['VERTICAL_DOWN_-_Z']
def reject_outliers(y): # y is the data in a 1D numpy array
mean = np.mean(y)
sd = np.std(y)
final_list = np.copy(y)
for n in range(len(y)):
final_list[n] = y[n] if y[n] > mean - 5 * sd else np.nan
final_list[n] = final_list[n] if final_list[n] < mean + 5 * sd else np.nan
return final_list
px.scatter(reject_outliers(y))
print('Length of y: ')
print(len(y))
print('Length of y with outliers removed (should be the same): ')
print(len(reject_outliers(y)))
# px.line(y=y, x=x)
px.line(y=reject_outliers(y), x=x) # This is the line I wanted to get working - check!
More compact answer, sent via email by a friend:
In numpy you can select/index based on a Boolean array, and then make assignment with it:
def reject_outliers(y): # y is the data in a 1D numpy array
n = 5 # 5 std deviations
mean = np.mean(y)
sd = np.std(y)
final_list = y.copy()
final_list[np.abs(y - mean) > n * sd] = np.nan
return final_list
I also noticed that you didn’t use the value of n in your example code.
Alternatively, you can use the where method (https://numpy.org/doc/stable/reference/generated/numpy.where.html)
np.where(np.abs(y - mean) > n * sd, np.nan, y)
You don’t need the .copy() if you don’t mind modifying the input array.
Replace np.mean and np.std with np.nanmean and np.nanstd if you want the function to work on arrays that already contain nans, i.e. if you want to use this function recursively.
The answer about using if else in a list comprehension would work, but avoiding the list comprehension makes the function much faster if the arrays are large.

How do I apply Sympy.Solve to solve a two-sided function on each row in a database?

I am attempting to apply the sympy.solve function to each row in a DataFrame using each column a different variable. I have managed to solve for the undefined variable I need to calculate -- λ -- using the following code:
from sympy import symbols, Eq, solve
import math
Tmin = 33.2067
Tmax = 42.606
D = 19.5526
tmin = 6
tmax = 14
pi = math.pi
λ = symbols('λ')
lhs = D
eq1 = (((Tmax-Tmin)/2) * (λ-tmin) * (1-(2/pi))) + (((Tmax-Tmin)/2)*(tmax-λ)*(1+(2/pi))) - (((Tmax-Tmin)/2)*(tmax-tmin))
eq1 = Eq(lhs, rhs)
lam = solve(eq1)
print(lam)
However, I need to apply this function to every row in a DataFrame and output the result as its own column. The DataFrame is formatted as follows:
import pandas as pd
data = [[6, 14, 33.2067, 42.606, 19.5526], [6, 14, 33.4885, 43.0318, -27.9222]]
df = pd.DataFrame(data, columns=['tmin', 'tmax', 'Tmin', 'Tmax', 'D'])
I have searched for how to do this, but am not sure how to proceed. I managed to find similar questions wherein the answers discussed lambdifying the equation, but I wasn't sure how to lambdify a two-sided equation and then apply it to my DataFrame, and my math skills aren't strong enough to isolate λ and place it on the left side of the equation so that I don't have to lambdify a two-sided equation. Any help here would be appreciated.
You can use the apply method of a dataframe in order to apply a numerical function to each row. But first we need to create the numerical function: we are going to do that with lambdify:
from sympy import symbols, Eq, solve, pi, lambdify
import pandas as pd
# create the necessary symbols
Tmin, Tmax, D, tmin, tmax = symbols("T_min, T_max, D, t_min, t_max")
λ = symbols('λ')
# create the equation and solve for λ
lhs = D
rhs = (((Tmax-Tmin)/2) * (λ-tmin) * (1-(2/pi))) + (((Tmax-Tmin)/2)*(tmax-λ)*(1+(2/pi))) - (((Tmax-Tmin)/2)*(tmax-tmin))
eq1 = Eq(lhs, rhs)
# solve eq1 for λ: take the first (and only) solution
λ_expr = solve(eq1, λ)[0]
print(λ_expr)
# out: (-pi*D + T_max*t_max + T_max*t_min - T_min*t_max - T_min*t_min)/(2*(T_max - T_min))
# convert λ_expr to a numerical function so that it can
# be quickly evaluated. Essentialy, creates:
# λ_func(tmin, tmax, Tmin, Tmax, D)
# NOTE: for simplicity, let's order the symbols like the
# columns of the dataframe
λ_func = lambdify([tmin, tmax, Tmin, Tmax, D], λ_expr)
# We are going to use df.apply to apply the function to each row
# of the dataframe. However, pandas will pass into the current row
# as the argument. For example:
# row = [val_tmin, val_tmax, val_Tmin, val_Tmax, val_D]
# Hence, we need a wrapper function to unpack the row to the
# arguments required by λ_func
wrapper_func = lambda row: λ_func(*row)
# create the dataframe
data = [[6, 14, 33.2067, 42.606, 19.5526], [6, 14, 33.4885, 43.0318, -27.9222]]
df = pd.DataFrame(data, columns=['tmin', 'tmax', 'Tmin', 'Tmax', 'D'])
# apply the function to each row
print(df.apply(wrapper_func, axis=1))
# 0 6.732400044759731
# 1 14.595903848357743
# dtype: float64

How to use for loop with dictionary and a array at the same time

i want to make a linear equation with some dynamic inputs like it can be
y = θ0*x0 + θ1*x1
or
y = θ0*x0 + θ1*x1 + θ2*x2 + θ3*x3 + θ4*x4
for that i have
dictionary for x0,x1,x2......xn
and array for θ0,θ1,θ2......θn
im new to python so i tried this function but im stuck
so my question is how can i write a fucntion that gets x_values and theta_values as parameters and gives y_values as output
X = pd.DataFrame({'x0': np.ones(6), 'x1': np.linspace(0, 5, 6)})
θ = np.matrix('0 1')
def line_func(features, parameters):
result = []
for feat, param in zip(features.iteritems(), parameters):
for i in feat:
result.append(i*param)
return result
line_func(X,θ)
If you want to multiply your thetas with a list of features, then you technically mulitply a matrix (the features) with a vector (theta).
You can do this as follows:
import numpy as np
x_array= x.values
theta= np.array([theta_0, theta_1])
x_array.dot(theta)
Just order your theta-vector the way your columns are ordered in x. But note, that this gives a row-wise sum of the products for theta_i*x_i for all is. If you don't want it to be summed up rowise, you just need to write x_array * theta.
If you want to work with pandas (which I wouldn't recommend) also for the mulitplication and want to get a dataframe with the products of the column value and the corresponding theta, you could do this as follows:
# define the theta-x mapping (theta-value per column name in x)
thetas={'x1': 1, 'x2': 3}
# create an empty result dataframe with the index of x
df_result= pd.DataFrame(index=x.index)
# assign the calculated columns in a loop
for col_name, col_series in x.iteritems():
df_result[col_name]= col_series*thetas[col_name]
df_result
This results in:
x1 x2
0 1 6
1 -1 3

Some problems with the arrays dimensions i guess

With this code i want to find a minimum in from a two dimensional function using the newton method:
from numpy import array
from numpy.linalg import solve, norm
def newton2d(f, df, x, tol=1e-12, maxit=50):
x = atleast_2d(x)
for i in range(maxit):
s = solve(df(x), f(x))
x -=s
if norm(s)<tol: print(x); print(i); break
f = lambda x: array([x[0]**2-x[1]**4, x[0]-x[1]**3])
df = lambda x: array([[2*x[0], -4*x[1]**3], [1, -3*x[1]**2]])
x = array([0.7, 0.7])
newton2d(f,df,x)
i think this code should work, but i get an error which goes as follows:
IndexError: index 1 is out of bounds for axis 0 with size 1
thanks for any help!!

Speed up a curve_fit in Pandas DataFrame

I have a dataframe with independent variables in the column headers, and each rows is a seperate set of dependent variables:
5.032530 6.972868 8.888268 10.732009 12.879130 16.877655
0 2.512298 2.132748 1.890665 1.583538 1.582968 1.440091
1 5.628667 4.206962 4.179009 3.162677 3.132448 1.887631
2 3.177090 2.274014 2.412432 2.066641 1.845065 1.574748
3 5.060260 3.793109 3.129861 2.617136 2.703114 1.921615
4 4.153010 3.354411 2.706463 2.570981 2.020634 1.646298
I would like to fit a curve of type Y=A*x^B to each row. I need to solve for A & B for about ~5000 rows, 6 datapoints in each row. I was able to do this using np.apply, but it takes about 40 seconds to do this. Can I speed up using Cython or by vectorizing somehow? I need precision to about 4 decimals
Here is what i have:
import pandas as pd
from scipy.optimize import curve_fit
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv(r'C:\File.csv')
def curvefita(y):
return curve_fit(lambda x,a,b: a*np.power(x,b), df.iloc[:,3:].columns, y,p0=[8.4,-.58], bounds=([0,-10],[200,10]),maxfev=2000)[0][0]
def curvefitb(y):
return curve_fit(lambda x,a,b: a*np.power(x,b), df.iloc[:,3:].columns, y,p0=[8.4,-.58], bounds=([0,-10],[200,10]),maxfev=2000)[0][1]
avalues = df.iloc[:,3:].apply(curvefita, axis=1)
bvalues = df.iloc[:,3:].apply(curvefitb, axis=1)
df['a']=avalues
df['b']=bvalues
colcount = len(df.columns)
#build power fit - make the matrix
powerfit = df.copy()
for column in range(colcount-2):
powerfit.iloc[:,column] = powerfit.iloc[:,colcount-2] * (powerfit.columns[column]**powerfit.iloc[:,colcount-1])
#graph an example
plt.plot(powerfit.iloc[0,:colcount-2],'r')
plt.plot(df.iloc[0,:colcount-2],'ro')
#another example looked up by ticker
plt.plot(powerfit.iloc[5,:colcount-2],'b')
plt.plot(df.iloc[5,:colcount-2],'bo')
You actually do two curve_fits per row, one for a and one for b. Try to find a way to insert both of them at the same time, so you can halve your execution time:
def func(x, a, b):
return a * np.power(x, b)
def curvefit(y):
return tuple(curve_fit(func, df.iloc[:,3:].columns, y ,p0=[8.4, -.58], bounds=([0, -10], [200, 10]))[0])
df[["a", "b"]] = df.iloc[:,3:].apply(curvefit, axis=1).apply(pd.Series)
print(df)
# 5.03253 6.972868 8.888268 10.732009 12.87913 16.877655 a \
# 0 2.512298 2.132748 1.890665 1.583538 1.582968 1.440091 2.677070
# 1 5.628667 4.206962 4.179009 3.162677 3.132448 1.887631 39.878792
# 2 3.177090 2.274014 2.412432 2.066641 1.845065 1.574748 8.589886
# 3 5.060260 3.793109 3.129861 2.617136 2.703114 1.921615 13.078827
# 4 4.153010 3.354411 2.706463 2.570981 2.020634 1.646298 27.715207
# b
# 0 -0.215338
# 1 -1.044384
# 2 -0.600827
# 3 -0.656381
# 4 -1.008753
And to make this more reusable, I would make curvefit also take the x-values and function, which can be passed in with functools.partial:
from functools import partial
def curvefit(func, x, y):
return tuple(curve_fit(func, x, y ,p0=[8.4, -.58], bounds=([0, -10], [200, 10]))[0])
fit = partial(curvefit, func, df.iloc[:,3:].columns)
df[["a", "b"]] = df.iloc[:,3:].apply(fit, axis=1).apply(pd.Series)
I was able to bring my runtime down to 550ms by following the advice of #Brenlla. This code uses an unweighted/biased formula similar to Excel, which is good enough for my purposes (#kennytm discusses it here)
df = pd.read_csv(r'C:\File.csv')
df2=np.log(df)
df3=df2.iloc[:,3:].copy()
df3.columns=np.log(df3.columns)
def curvefit(y):
return tuple(np.polyfit(df3.columns, y ,1))
df[["b", "a"]] = df3.apply(curvefit,axis=1).apply(pd.Series)
df['a']=np.exp(df['a'])
colcount = len(df.columns)
powerfit = df.copy()
for column in range(colcount-2):
powerfit.iloc[:,column] = powerfit.iloc[:,colcount-1] * (powerfit.columns[column]**powerfit.iloc[:,colcount-2])
#graph an example
plt.plot(powerfit.iloc[0,:colcount-2],'r')
plt.plot(df.iloc[0,:colcount-2],'ro')
#another example looked up by ticker
plt.plot(powerfit.iloc[5,:colcount-2],'b')
plt.plot(df.iloc[5,:colcount-2],'bo')

Categories