DataFrame Multiobjective Sort to Define Pareto Boundary - python

Are there any multiobjective sorting algorithms built into Pandas?
I have found this which is an NSGA-II algorithm (which is what I want), but it requires passing the objective functions in as separate files. In an ideal world, I would use a DataFrame for all of the data, call a method like multi_of_sort on it while specifying the objective function columns (and other required parameters), and it would return another DataFrame with the Pareto optimum values.
This seems like it should be trivial with Pandas, but I could be wrong.

As it turns out... the pareto package referenced above does handle DataFrame inputs.
import pareto
import pandas as pd
# load the data
df = pd.read_csv('data.csv')
# define the objective function column indices
# optional. default is ALL columns
of_cols = [4, 5]
# define the convergence tolerance for the OF's
# optional. default is 1e-9
eps_tols = [1, 2]
# sort
nondominated = pareto.eps_sort([list(df.itertuples(False))], of_cols, eps_tols)
# convert multi-dimension array to DataFrame
df_pareto = pd.DataFrame.from_records(nondominated, columns=list(df.columns.values))

Related

python efficiently applying function over multiple arrays

(new to python so I apologize if this question is basic)
Say I create a function that will calculate some equation
def plot_ev(accuracy,tranChance,numChoices,reward):
ev=(reward-numChoices)*1-np.power((1-accuracy),numChoices)*tranChance)
return ev
accuracy, tranChance, and numChoices are each float arrays
e.g.
accuracy=np.array([.6,.7,.8])
tranChance=np.array([.6,.7,8])
numChoices=np.array([2,.3,4])
how would I run and plot plot_ev over my 3 arrays so that I end up with an output that has all combinations of elements (ideally not running 3 forloops)
ideally i would have a single plot showing the output of all combinations (1st element from accuracy with all elements from transChance and numChoices, 2nd element from accuracy with all elements from transChance and numChoices and so on )
thanks in advance!
Use numpy.meshgrid to make an array of all the combinations of values of the three variables.
products = np.array(np.meshgrid(accuracy, tranChance, numChoices)).T.reshape(-1, 3)
Then transpose this again and extract three longer arrays with the values of the three variables in every combination:
accuracy_, tranChance_, numChoices_ = products.T
Your function contains only operations that can be carried out on numpy arrays, so you can then simply feed these arrays as parameters into the function:
reward = ?? # you need to set the reward value
results = plot_ev(accuracy_, tranChance_, numChoices_, reward)
Alternatively consider using a pandas dataframe which will provide clearer labeling of the columns.
import pandas as pd
df = pd.DataFrame(products, columns=["accuracy", "tranChance", "numChoices"])
df["ev"] = plot_ev(df["accuracy"], df["tranChance"], df["numChoices"], reward)

How do I calculate lambda to use scipy.special.boxcox1p function for my entire dataframe of 500 columns?

I have a dataframe with total sales of around 500 product categories in each row. So there are 500 columns in my dataframe. I am trying to find the highest correlated category with my another dataframe columns.
So I will use Pearson correlation method for this.
But the Total sales for all the categories are highly skewed data, with the skewness level ranging from 10 to 40 for all the category columns. So I want to log transform this sales data using boxcox transformation.
Since, my sales data has 0 values as well, I want to use boxcox1p function.
Can somebody help me, how do I calculate lambda for boxcox1p function, since it is a mandatory parameter for this function?
Also, Is this the correct approach for my problem statement to find highly correlated categories?
Assume df is Your dataframe with many columns containing numeric values, and lambda parameter of box-cox transformation equals 0.25, then:
from scipy.special import boxcox1p
df_boxcox = df.apply(lambda x: boxcox1p(x,0.25))
Now transformed values are in df_boxcox.
Unfortunately there is no built-in method to find lambda of boxcox1p but we can use PowerTransformer from sklearn.preprocessing instead:
import numpy as np
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method='yeo-johnson')
Note method 'yeo-johnson' is used because it works with both positive and negative values. Method 'box-cox' will raise error: ValueError: The Box-Cox transformation can only be applied to strictly positive data.
data = pd.DataFrame({'x':[-2,-1,0,1,2,3,4,5]}) #just sample data to explain
pt.fit(data)
print(pt.lambdas_)
[0.89691707]
then apply calculated lambda:
print(pt.transform(data))
result:
[[-1.60758267]
[-1.09524803]
[-0.60974999]
[-0.16141745]
[ 0.26331586]
[ 0.67341476]
[ 1.07296428]
[ 1.46430326]]

Replicate Scipy's RegressionResults.predict functionality

Here's my sample program:
import numpy as np
import pandas as pd
import statsmodels
from statsmodels.formula.api import ols
df = pd.DataFrame({"z": [1,1,1,2,2,2,3,3,3],
"x":[0,1,2,0,1,2,0,1,2],
"y":[0,2,4,3,5,7,7,9,11]
})
model = ols("y ~ x + z + I(z**2)", df).fit()
model.params
newdf = pd.DataFrame({"z": [4,4,4,5,5,5],
"x":[0,1,2,0,1,2]
})
model.predict(newdf)
You'll notice, if you run this, that model.params is a pandas Series with indices the same as the right-hand side of the formula, except with an additional entry: "Intercept"
> Out[2]:
> Intercept -2.0
> x 2.0
> z 1.5
> I(z ** 2) 0.5
> dtype: float64
And, using some internal functionality I can't determine, the RegressionResults object's .predict() can recognize the column headers from newdf, match them up (including the patsy syntax "I(z**2)"), add the intercept, and return an answer Series. (this is the last line of my sample code)
This seems convenient! Better than writing out my formula again in python/numpy code whenever I want to evaluate slight variations on it. I feel like there should be some way for me to construct a similar pd.Series for formula coefficients, instead of having created it through a model and fit. Then I should be able to apply this to an appropriate dataframe as a way of evaluating functions.
My attempts to figure out how statsmodel is doing this haven't worked, I haven't found anything obvious in the related function doc pages, in patsy, nor can I seem to enter this section of the source code while debugging.
Anyone have any idea how to set this up?
I eventually pieced together one way of doing this.
def predict(coeffs,datadf:pd.DataFrame)->np.array:
"""Apply a series (or df) of coefficents indexed by model terms to new data
:param coeffs: a series whose elements are coefficients and index are the formula terms
or a df whose column names are formula terms, and each row is a set of coefficients
:param datadf: the new data to predict on
"""
desc = patsy.ModelDesc([],[patsy.Term([]) if column=="Intercept" else patsy.Term([patsy.EvalFactor(column)]) for column in coeffs.index] )
dmat = patsy.dmatrix(desc,datadf)
return np.dot(dmat, coeffs.T)
newdf["y"] = predict(model.params,newdf)
The reason this seemed so appealing to me, in case anyone is baffled, is that I was fitting data piecewise using df.groupby("column").apply(FitFunction). It seemed like having FitFunction() return the model.params series would be the cleanest approach within the pandas paradigm.

Performance Python: How should I structure time series analysis modules?

Question
Let's assume the following DataFrame is given
ID IBM MSFT APPL ORCL FB TWTR
date
1986-08-31 -1.332298 0.396217 0.574269 -0.679972 -0.470584 0.234379
1986-09-30 -0.222567 0.281202 -0.505856 -1.392477 0.941539 0.974867
1986-10-31 -1.139867 -0.458111 -0.999498 1.920840 0.478174 -0.315904
1986-11-30 -0.189720 -0.542432 -0.471642 1.506206 -1.506439 0.301714
1986-12-31 1.061092 -0.922713 -0.275050 0.776958 1.371245 -2.540688
and I want to do some operations on it. This could be some complicated mathematical method. The columns are structurally the same.
Q1: What is the best method wrt. performance and/or implementation design?
Q2: Should I program a method that is disassembling the DataFrame into numerical parts ( numpy arrays ) and indices? Thereby the necessary calculations would be undertaken by a submodule on the numpy array. The main method would be only responsible for recollecting the data retrieved from the submodule and rejoining it with the corresponding indices ( see example code below ).
def submodule(np_array):
# some fancy calculations here
return modified_array
def main(df):
cols = df.columns
indices = df.index
values = df.values()
modified_values = submodule(values)
new_df = pd.DataFrame(modified_values, columns=cols, index=indices)
return new_df
Q3: Or should I do the calculations with DataFrames directly?
Q4: Or should I work with objects instead?
Q5: What is better with respect to performance, design, or code structure?
Addendum
Some more practical example would be if I want to do a portfolio optimization.
Q6: Should I pass the whole DataFrame into the optimization or use only the numerical matrix. Strictly speaking I don't think that the information of the DataFrame should be passed to a numerical method. But I am not sure whether my thinking is outdated.
Another example would be calculating the Delta for a number of options ( an operation on every single series instead of a matrix operation )
P.S.:
I know that I wouldn't need to use a separate function for disassembling. But it highlights my intentions.

Uncertainity about the Interpolate Function in Pandas

I am working with the interpolate function in pandas. Here is a toy example to make an illustrative case:
df=pd.DataFrame({'Data':np.random.normal(size=200), 'Data2':np.random.normal(size=200)})
df.iloc[1, 0] = np.nan
print df
print df.interpolate('nearest')
My question: Does the interpolate function work over multiple columns? That is, does it use multivariate analysis to determine the value for a missing field? Or does it simply look at individual columns?
The docs reference the various available methods - most just rely on the index, possibly via the univariate scipy.interp1d or other univariate scipy methods:
method : {‘linear’, ‘time’, ‘index’, ‘values’, ‘nearest’, ‘zero’,
‘slinear’, ‘quadratic’, ‘cubic’, ‘barycentric’, ‘krogh’, ‘polynomial’,
‘spline’ ‘piecewise_polynomial’, ‘pchip’}
‘linear’: ignore the index and treat the values as equally spaced. This is the only method supported on MultiIndexes.
default ‘time’: interpolation works on daily and higher resolution data to interpolate given length of interval ‘index’, ‘values’: use the actual numerical values of the index
‘nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘barycentric’, ‘polynomial’ is passed to scipy.interpolate.interp1d. Both ‘polynomial’ and ‘spline’ require that you also specify an order (int), e.g. df.interpolate(method=’polynomial’, order=4). These use the actual numerical values of the index.
‘krogh’, ‘piecewise_polynomial’, ‘spline’, and ‘pchip’ are all wrappers around the scipy interpolation methods of similar names. These use the actual numerical values of the index.
Scipy docs and charts illustrating output here

Categories