I have a dataframe with total sales of around 500 product categories in each row. So there are 500 columns in my dataframe. I am trying to find the highest correlated category with my another dataframe columns.
So I will use Pearson correlation method for this.
But the Total sales for all the categories are highly skewed data, with the skewness level ranging from 10 to 40 for all the category columns. So I want to log transform this sales data using boxcox transformation.
Since, my sales data has 0 values as well, I want to use boxcox1p function.
Can somebody help me, how do I calculate lambda for boxcox1p function, since it is a mandatory parameter for this function?
Also, Is this the correct approach for my problem statement to find highly correlated categories?
Assume df is Your dataframe with many columns containing numeric values, and lambda parameter of box-cox transformation equals 0.25, then:
from scipy.special import boxcox1p
df_boxcox = df.apply(lambda x: boxcox1p(x,0.25))
Now transformed values are in df_boxcox.
Unfortunately there is no built-in method to find lambda of boxcox1p but we can use PowerTransformer from sklearn.preprocessing instead:
import numpy as np
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method='yeo-johnson')
Note method 'yeo-johnson' is used because it works with both positive and negative values. Method 'box-cox' will raise error: ValueError: The Box-Cox transformation can only be applied to strictly positive data.
data = pd.DataFrame({'x':[-2,-1,0,1,2,3,4,5]}) #just sample data to explain
pt.fit(data)
print(pt.lambdas_)
[0.89691707]
then apply calculated lambda:
print(pt.transform(data))
result:
[[-1.60758267]
[-1.09524803]
[-0.60974999]
[-0.16141745]
[ 0.26331586]
[ 0.67341476]
[ 1.07296428]
[ 1.46430326]]
Are there any multiobjective sorting algorithms built into Pandas?
I have found this which is an NSGA-II algorithm (which is what I want), but it requires passing the objective functions in as separate files. In an ideal world, I would use a DataFrame for all of the data, call a method like multi_of_sort on it while specifying the objective function columns (and other required parameters), and it would return another DataFrame with the Pareto optimum values.
This seems like it should be trivial with Pandas, but I could be wrong.
As it turns out... the pareto package referenced above does handle DataFrame inputs.
import pareto
import pandas as pd
# load the data
df = pd.read_csv('data.csv')
# define the objective function column indices
# optional. default is ALL columns
of_cols = [4, 5]
# define the convergence tolerance for the OF's
# optional. default is 1e-9
eps_tols = [1, 2]
# sort
nondominated = pareto.eps_sort([list(df.itertuples(False))], of_cols, eps_tols)
# convert multi-dimension array to DataFrame
df_pareto = pd.DataFrame.from_records(nondominated, columns=list(df.columns.values))
Here's my sample program:
import numpy as np
import pandas as pd
import statsmodels
from statsmodels.formula.api import ols
df = pd.DataFrame({"z": [1,1,1,2,2,2,3,3,3],
"x":[0,1,2,0,1,2,0,1,2],
"y":[0,2,4,3,5,7,7,9,11]
})
model = ols("y ~ x + z + I(z**2)", df).fit()
model.params
newdf = pd.DataFrame({"z": [4,4,4,5,5,5],
"x":[0,1,2,0,1,2]
})
model.predict(newdf)
You'll notice, if you run this, that model.params is a pandas Series with indices the same as the right-hand side of the formula, except with an additional entry: "Intercept"
> Out[2]:
> Intercept -2.0
> x 2.0
> z 1.5
> I(z ** 2) 0.5
> dtype: float64
And, using some internal functionality I can't determine, the RegressionResults object's .predict() can recognize the column headers from newdf, match them up (including the patsy syntax "I(z**2)"), add the intercept, and return an answer Series. (this is the last line of my sample code)
This seems convenient! Better than writing out my formula again in python/numpy code whenever I want to evaluate slight variations on it. I feel like there should be some way for me to construct a similar pd.Series for formula coefficients, instead of having created it through a model and fit. Then I should be able to apply this to an appropriate dataframe as a way of evaluating functions.
My attempts to figure out how statsmodel is doing this haven't worked, I haven't found anything obvious in the related function doc pages, in patsy, nor can I seem to enter this section of the source code while debugging.
Anyone have any idea how to set this up?
I eventually pieced together one way of doing this.
def predict(coeffs,datadf:pd.DataFrame)->np.array:
"""Apply a series (or df) of coefficents indexed by model terms to new data
:param coeffs: a series whose elements are coefficients and index are the formula terms
or a df whose column names are formula terms, and each row is a set of coefficients
:param datadf: the new data to predict on
"""
desc = patsy.ModelDesc([],[patsy.Term([]) if column=="Intercept" else patsy.Term([patsy.EvalFactor(column)]) for column in coeffs.index] )
dmat = patsy.dmatrix(desc,datadf)
return np.dot(dmat, coeffs.T)
newdf["y"] = predict(model.params,newdf)
The reason this seemed so appealing to me, in case anyone is baffled, is that I was fitting data piecewise using df.groupby("column").apply(FitFunction). It seemed like having FitFunction() return the model.params series would be the cleanest approach within the pandas paradigm.
How to convert log2 transformed values back to normal scale in python
Any suggestions would be great
log2(x) is the inverse of 2**x (2 to the power of x). If you have a column of data that has been transformed by log2(x), all you have to do is perform the inverse operation:
df['colname'] = [2**i for i in df['colname']]
As suggested in the comment below, it would be more efficient to do:
df['colname'] = df['colname'].rpow(2)
rpow is a pandas Series method that is built in to the pandas package. The first argument is the base you'd like to take powers to. You can also use a fill_value argument, which is nice because you can tell it what to do if the result is NaN
Question
Let's assume the following DataFrame is given
ID IBM MSFT APPL ORCL FB TWTR
date
1986-08-31 -1.332298 0.396217 0.574269 -0.679972 -0.470584 0.234379
1986-09-30 -0.222567 0.281202 -0.505856 -1.392477 0.941539 0.974867
1986-10-31 -1.139867 -0.458111 -0.999498 1.920840 0.478174 -0.315904
1986-11-30 -0.189720 -0.542432 -0.471642 1.506206 -1.506439 0.301714
1986-12-31 1.061092 -0.922713 -0.275050 0.776958 1.371245 -2.540688
and I want to do some operations on it. This could be some complicated mathematical method. The columns are structurally the same.
Q1: What is the best method wrt. performance and/or implementation design?
Q2: Should I program a method that is disassembling the DataFrame into numerical parts ( numpy arrays ) and indices? Thereby the necessary calculations would be undertaken by a submodule on the numpy array. The main method would be only responsible for recollecting the data retrieved from the submodule and rejoining it with the corresponding indices ( see example code below ).
def submodule(np_array):
# some fancy calculations here
return modified_array
def main(df):
cols = df.columns
indices = df.index
values = df.values()
modified_values = submodule(values)
new_df = pd.DataFrame(modified_values, columns=cols, index=indices)
return new_df
Q3: Or should I do the calculations with DataFrames directly?
Q4: Or should I work with objects instead?
Q5: What is better with respect to performance, design, or code structure?
Addendum
Some more practical example would be if I want to do a portfolio optimization.
Q6: Should I pass the whole DataFrame into the optimization or use only the numerical matrix. Strictly speaking I don't think that the information of the DataFrame should be passed to a numerical method. But I am not sure whether my thinking is outdated.
Another example would be calculating the Delta for a number of options ( an operation on every single series instead of a matrix operation )
P.S.:
I know that I wouldn't need to use a separate function for disassembling. But it highlights my intentions.