Here's my sample program:
import numpy as np
import pandas as pd
import statsmodels
from statsmodels.formula.api import ols
df = pd.DataFrame({"z": [1,1,1,2,2,2,3,3,3],
"x":[0,1,2,0,1,2,0,1,2],
"y":[0,2,4,3,5,7,7,9,11]
})
model = ols("y ~ x + z + I(z**2)", df).fit()
model.params
newdf = pd.DataFrame({"z": [4,4,4,5,5,5],
"x":[0,1,2,0,1,2]
})
model.predict(newdf)
You'll notice, if you run this, that model.params is a pandas Series with indices the same as the right-hand side of the formula, except with an additional entry: "Intercept"
> Out[2]:
> Intercept -2.0
> x 2.0
> z 1.5
> I(z ** 2) 0.5
> dtype: float64
And, using some internal functionality I can't determine, the RegressionResults object's .predict() can recognize the column headers from newdf, match them up (including the patsy syntax "I(z**2)"), add the intercept, and return an answer Series. (this is the last line of my sample code)
This seems convenient! Better than writing out my formula again in python/numpy code whenever I want to evaluate slight variations on it. I feel like there should be some way for me to construct a similar pd.Series for formula coefficients, instead of having created it through a model and fit. Then I should be able to apply this to an appropriate dataframe as a way of evaluating functions.
My attempts to figure out how statsmodel is doing this haven't worked, I haven't found anything obvious in the related function doc pages, in patsy, nor can I seem to enter this section of the source code while debugging.
Anyone have any idea how to set this up?
I eventually pieced together one way of doing this.
def predict(coeffs,datadf:pd.DataFrame)->np.array:
"""Apply a series (or df) of coefficents indexed by model terms to new data
:param coeffs: a series whose elements are coefficients and index are the formula terms
or a df whose column names are formula terms, and each row is a set of coefficients
:param datadf: the new data to predict on
"""
desc = patsy.ModelDesc([],[patsy.Term([]) if column=="Intercept" else patsy.Term([patsy.EvalFactor(column)]) for column in coeffs.index] )
dmat = patsy.dmatrix(desc,datadf)
return np.dot(dmat, coeffs.T)
newdf["y"] = predict(model.params,newdf)
The reason this seemed so appealing to me, in case anyone is baffled, is that I was fitting data piecewise using df.groupby("column").apply(FitFunction). It seemed like having FitFunction() return the model.params series would be the cleanest approach within the pandas paradigm.
Related
I have a dataframe with total sales of around 500 product categories in each row. So there are 500 columns in my dataframe. I am trying to find the highest correlated category with my another dataframe columns.
So I will use Pearson correlation method for this.
But the Total sales for all the categories are highly skewed data, with the skewness level ranging from 10 to 40 for all the category columns. So I want to log transform this sales data using boxcox transformation.
Since, my sales data has 0 values as well, I want to use boxcox1p function.
Can somebody help me, how do I calculate lambda for boxcox1p function, since it is a mandatory parameter for this function?
Also, Is this the correct approach for my problem statement to find highly correlated categories?
Assume df is Your dataframe with many columns containing numeric values, and lambda parameter of box-cox transformation equals 0.25, then:
from scipy.special import boxcox1p
df_boxcox = df.apply(lambda x: boxcox1p(x,0.25))
Now transformed values are in df_boxcox.
Unfortunately there is no built-in method to find lambda of boxcox1p but we can use PowerTransformer from sklearn.preprocessing instead:
import numpy as np
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method='yeo-johnson')
Note method 'yeo-johnson' is used because it works with both positive and negative values. Method 'box-cox' will raise error: ValueError: The Box-Cox transformation can only be applied to strictly positive data.
data = pd.DataFrame({'x':[-2,-1,0,1,2,3,4,5]}) #just sample data to explain
pt.fit(data)
print(pt.lambdas_)
[0.89691707]
then apply calculated lambda:
print(pt.transform(data))
result:
[[-1.60758267]
[-1.09524803]
[-0.60974999]
[-0.16141745]
[ 0.26331586]
[ 0.67341476]
[ 1.07296428]
[ 1.46430326]]
Are there any multiobjective sorting algorithms built into Pandas?
I have found this which is an NSGA-II algorithm (which is what I want), but it requires passing the objective functions in as separate files. In an ideal world, I would use a DataFrame for all of the data, call a method like multi_of_sort on it while specifying the objective function columns (and other required parameters), and it would return another DataFrame with the Pareto optimum values.
This seems like it should be trivial with Pandas, but I could be wrong.
As it turns out... the pareto package referenced above does handle DataFrame inputs.
import pareto
import pandas as pd
# load the data
df = pd.read_csv('data.csv')
# define the objective function column indices
# optional. default is ALL columns
of_cols = [4, 5]
# define the convergence tolerance for the OF's
# optional. default is 1e-9
eps_tols = [1, 2]
# sort
nondominated = pareto.eps_sort([list(df.itertuples(False))], of_cols, eps_tols)
# convert multi-dimension array to DataFrame
df_pareto = pd.DataFrame.from_records(nondominated, columns=list(df.columns.values))
Question
Let's assume the following DataFrame is given
ID IBM MSFT APPL ORCL FB TWTR
date
1986-08-31 -1.332298 0.396217 0.574269 -0.679972 -0.470584 0.234379
1986-09-30 -0.222567 0.281202 -0.505856 -1.392477 0.941539 0.974867
1986-10-31 -1.139867 -0.458111 -0.999498 1.920840 0.478174 -0.315904
1986-11-30 -0.189720 -0.542432 -0.471642 1.506206 -1.506439 0.301714
1986-12-31 1.061092 -0.922713 -0.275050 0.776958 1.371245 -2.540688
and I want to do some operations on it. This could be some complicated mathematical method. The columns are structurally the same.
Q1: What is the best method wrt. performance and/or implementation design?
Q2: Should I program a method that is disassembling the DataFrame into numerical parts ( numpy arrays ) and indices? Thereby the necessary calculations would be undertaken by a submodule on the numpy array. The main method would be only responsible for recollecting the data retrieved from the submodule and rejoining it with the corresponding indices ( see example code below ).
def submodule(np_array):
# some fancy calculations here
return modified_array
def main(df):
cols = df.columns
indices = df.index
values = df.values()
modified_values = submodule(values)
new_df = pd.DataFrame(modified_values, columns=cols, index=indices)
return new_df
Q3: Or should I do the calculations with DataFrames directly?
Q4: Or should I work with objects instead?
Q5: What is better with respect to performance, design, or code structure?
Addendum
Some more practical example would be if I want to do a portfolio optimization.
Q6: Should I pass the whole DataFrame into the optimization or use only the numerical matrix. Strictly speaking I don't think that the information of the DataFrame should be passed to a numerical method. But I am not sure whether my thinking is outdated.
Another example would be calculating the Delta for a number of options ( an operation on every single series instead of a matrix operation )
P.S.:
I know that I wouldn't need to use a separate function for disassembling. But it highlights my intentions.
In Pandas, there is a method DataFrame.shift(n) which shifts the contents of an array by n rows, relative to the index, similarly to np.roll(a, n). I can't seem to find a way to get a similar behaviour working with Dask. I realise things like row-shifts may be difficult to manage with Dask's chunked system, but I don't know of a better way to compare each row with the subsequent one.
What I'd like to be able to do is this:
import numpy as np
import pandas as pd
import dask.DataFrame as dd
with pd.HDFStore(path) as store:
data = dd.from_hdf(store, 'sim')[col1]
shifted = data.shift(1)
idx = data.apply(np.sign) != shifted.apply(np.sign)
in order to create a boolean series indicating the locations of sign changes in the data. (I am aware that method would also catch changes from a signed value to zero)
I would then use the boolean series to index a different Dask dataframe for plotting.
Rolling functions
Currently dask.dataframe does not implement the shift operation. It could though if you raise an issue. In principle this is not so dissimilar from rolling operations that dask.dataframe does support, like rolling_mean, rolling_sum, etc..
Actually, if you were to create a Pandas function that adheres to the same API as these pandas.rolling_foo functions then you can use the dask.dataframe.rolling.wrap_rolling function to turn your pandas style rolling function into a dask.dataframe rolling function.
dask.dataframe.rolling_sum = wrap_rolling(pandas.rolling_sum)
The following code might help to shift down the series.
s = dd_df['column'].rolling(window=2).sum() - dd_df['column']
Edit (03/09/2019):
When you are rolling and finding the sum, for a particular row,
result[i] = row[i-1] + row[i]
Then by subtracting the old value of the column from the result, you are doing the following operation:
final_row[i] = result[i] - row[i]
Which equals:
final_row[i] = row[i-1] + row[i] - row[i]
Which ultimately results in the whole column getting shifted down once.
Tip:
If you want to shift it down multiple rows, you should actually execute the whole operation again that many times with the same window.
I'd like to find the worst record which make the correlation worse in pandas.DataFrame to remove anomaly records.
When I have the following DataFrame:
df = pd.DataFrame({'a':[1,2,3], 'b':[1,2,30]})
The correlation becomes better removing third row.
print df.corr() #-> correlation is 0.88
print df.ix[0:1].corr() # -> correlation is 1.00
In this case, my question is how to find the third row is an candidate of anomalies which make the correlation worse.
My idea is execute linear regression and calculate the error of each element (row). But, I don't know the simple way to try that idea and also believe there is more simple and straightforward way.
Update
Of course, you can remove all of elements and achieve the correlation is 1. But I'd like to find just one (or several) anomaly row(s). Intuitively, I hope to get non-trivial set of records which achieves better correlation.
First, you could brute force it to get exact solution:
import pandas as pd
import numpy as np
from itertools import combinations, chain, imap
df = pd.DataFrame(zip(np.random.randn(10), np.random.randn(10)))
# set the maximal number of lines you are willing to remove
reomve_up_to_n = 3
# all combinations of indices to keep
to_keep = imap(list, chain(*map(lambda i: combinations(df.index, df.shape[0] - i), range(1, reomve_up_to_n + 1))))
# find index with highest remaining correlation
highest_correlation_index = max(to_keep, key = lambda ks: df.ix[ks].corr().ix[0,1])
df_remaining = df.ix[highest_correlation_index]
This can be costly. You could get a greedy approximation by adding a column with something like row's contribution to correlation.
df['CorComp'] = (df.icol(0).mean() - df.icol(0)) * (df.icol(1).mean() - df.icol(1))
df = df.sort(['CorComp'])
Now you can remove rows starting from the top, which may raise your correlation.
Your question is about outliers detection. There is many way to perform this detection, but a simple way could be to exclude values with deviation exceeding x % of the standard deviation of the series.
# Keep only values with a deviation less than 10% of the standard deviation of the series.
df[np.abs(df.b-df.b.mean())<=(1.1*df.b.std())]
# result
a b
0 1 1
1 2 2