Create interaction term in scikit-learn

Create interaction term in scikit-learn - python

There are certainly many ways of creating interaction terms in Python, whether by using numpy or pandas directly, or some library like patsy. However, I was looking for a way of creating interaction terms scikit-learn style, i.e. in a form that plays nicely with its fit-transform-predict paradigm. How might I do this?

Let's consider the case of making an interaction term between two variables.
You might make use of the FunctionTransformer class, like so:
import numpy as np
from sklearn.preprocessing import FunctionTransformer
# 5 rows, 2 columns
X = np.arange(10).reshape(5, 2)
# Appends interaction of columns at 0 and 1 indices to original matrix
interaction_append_function = lambda x: np.append(x, (x[:, 0] * x[:, 1])[:, None], 1)
interaction_transformer = FunctionTransformer(func=interaction_append_function)
Let's try it out:
>>> interaction_transformer.fit_transform(X)
array([[ 0, 1, 0],
[ 2, 3, 6],
[ 4, 5, 20],
[ 6, 7, 42],
[ 8, 9, 72]])
You now have a transformer that will play well with other workflows like sklearn.pipeline or sklearn.compose.
Certainly there are more extensible ways of handling this, but hopefully you get the idea.

Related

An optimized matrix multiplication library in Python (similar to Matlab) but is NOT numpy

According to the NumPy documentation they may deprecate their np.matrix class. And while arrays do have their multitude of use cases, they cannot do everything. Specifically, they will "break" when doing pretty basic linear algebra operations (you can read more about it here).
Building my own matrix multiplication module in python is not too difficult, but it would not be optimized at all. I am looking for another library that has full linear algebra support which is optimized upon BLAS (Basic Linear Algebra Subprograms). Or at the least, is there any documents on how to DIY integrate a BLAS to python.
Edit: So some are suggesting the # operator, which is like pushing a mole down a hole and having him pop up immediately in the neighbouring one. In essence, what is happening is a debuggers nightmare:
W*x == w*x.T
W#x == W#x.T
You would hope that an error is raised here letting you know that you made a mistake in defining your matrices. But since arrays don't store 2D information if they are along one axis, I am not sure that the issue can ever be solved via np.array. (These problems don't exist with np.matrix but for some reason the developers seem insistent on removing it).

If you insist on the distinction between column and row vectors, you can do that.
>>> x = np.array([1, 2, 3]).reshape(-1, 1)
>>> W = np.arange(15).reshape(5, 3)
>>> x
array([[1],
[2],
[3]])
>>> W
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11],
[12, 13, 14]])
>>> W # x
array([[ 8],
[26],
[44],
[62],
[80]])
>>> W # x.T
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0,
with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 1 is different from 3)
You could create helper functions to create column and row vectors:
def rowvec(x):
return np.array(x).reshape(1, -1)
def colvec(x):
return np.array(x).reshape(-1, 1)
>>> rowvec([1, 2, 3])
array([[1, 2, 3]])
>>> colvec([1, 2, 3])
array([[1],
[2],
[3]])
I would recommend that you only use this type of constructs when you're porting existing Matlab code. You'll have trouble reading numpy code written by others and many library functions expect 1D arrays as inputs, not (1, n)-shaped arrays.

Actually, numpy offers BLAS-powered matrix mutiplication through the matmul operator #. This invokes the __matmul__ magic method for a given class.
All you have to do in the above example is W # x.
Other linear algebra stuff can be found on the np.linalg module.
Edit: I guess your problem is way more about the language's style than any technical issues. I found this answer very elucidative:
Transposing a NumPy array
Also, I find it very improbable that you will find something that is NOT numpy since most of the major machine learning/data science frameworks rely on it.

Apply custom function to 2 or more rows (or columns) in numpy

I am fairly new to numpy. I want to apply a custom function to 1, 2 or more rows (or columns). How can I do this? Before this is marked as duplicate, I want to point out that the only thread I found that does this is how to apply a generic function over numpy rows? and how to apply a generic function over numpy rows?. There are two issues with this post:
a) As a beginner, I am not quite sure what operation like A[:,None,:] does.
b) That operation doesn't work in my case. Please see below.
Let's assume that Matrix M is:
import numpy as np
M = np.array([[8, 3, 2],
[6, 1, 2],
[1, 2, 4]])
Now, I would want to calculate product of combination of all three rows. For this, I have created a custom function. Actual operation of the function could be different from multiplication. Multiplication is just an example.
def myf(a,b): return(a*b)
I have taken numpy array product as an example. Actual custom function could be different, but no matter what the operation is, the function will always return a numpy array. i.e. it will take two equally-sized numpy 1-D array and return 1-D array. In myf I am assuming that a and b are each np.array.
I want to be able to apply custom function to any two rows or columns, or even three rows (recursively applying function).
Expected output after multiplying two rows recursively:
If I apply pairwise row-operation:
[[48,3,4],
[6,2,8],
[8,6,8]]
OR ( The order of application of custom function doesn't matter. Hence, the actual position of rows in the output matrix won't matter. Below matrix will be fine as well.)
[[6,2,8],
[48,3,4], #row1 and 2 are swapped
[8,6,8]]
Similarly, if I apply pairwise operation on columns, I would get
[[24, 6, 16]
[6, 2, 12]
[2, 8, 4]]
Similarly, if I apply custom function to all three rows, I would get:
[48,6,16] #row-wise
OR
[48,12,8] #column-wise
I tried a few approaches after reading SO:
1:
vf=np.vectorize(myf)
vf(M,M)
However, above function applies custom function element-wise rather than row-wise or columnwise.
2:
I also tried:
M[:,None,:].dot(M) #dot mimics multiplication. Python wouldn't accept `*`
There are two problems with this:
a) I don't know what the output is.
b) I cannot apply custom function.
Can someone please help me? I'd appreciate any help.
I am open to numpy and scipy.
Some experts have requested desired output. Let's assume that the desired output is
[[48,3,4],
[6,2,8],
[8,6,8]].
However, I'd appreciate some guidance on customizing the solution for 2 or more columns and 2 or more rows.

You can simply roll your axis along the 0th axis
np.roll(M, -1, axis=0)
# array([[6, 1, 2],
# [1, 2, 4],
# [8, 3, 2]])
And multiply the result with your original array
M * np.roll(M, -1, axis=0)
# array([[48, 3, 4],
# [ 6, 2, 8],
# [ 8, 6, 8]])
If you want to incorporate more than two rows, you can roll it more than once:
M * np.roll(M, -1, axis=0) * np.roll(M, -2, axis=0)
# array([[48, 6, 16],
# [48, 6, 16],
# [48, 6, 16]])

Dataframe vs Numpy array in Python

I have a questions regarding dataframe and numpy arrays in Python. When we read any csv file using pandas, it is stored in a dataframe. Dataframe is useful when it comes to data manipulations, viewing data in columns etc. However some preprocessing functions such as Imputer do not work on Dataframes. For these functions we have to get the data in numpy arrays which makes the data manipulation difficult
In following code I while y is stored as int64 array, X is ndarray object of numpy module. I can not use append function on X. Can anyone suggest how to correct this
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('titanic.csv')
y = dataset.iloc[:,1].values
X= dataset.iloc[:,2:12].values

You really should give us more information about what you want/expect, but here's my guess:
In [6]: Y=np.arange(3) # 1d
In [7]: X=np.arange(12).reshape(3,4) # 2d
In [8]: np.column_stack([Y,X])
Out[8]:
array([[ 0, 0, 1, 2, 3],
[ 1, 4, 5, 6, 7],
[ 2, 8, 9, 10, 11]])
This should be the same as
dataset.iloc[:,[1,2,3,...12]].values
though why you didn't do dataset.iloc[:,1:12]?

scikit-learn custom transformer / pipeline that changes X and Y

I have a set of N data points X = {x1, ..., xn} and a set of N target values / classes Y = {y1, ..., yn}.
The feature vector for a given yi is constructed taking into account a "window" (for lack of a better term) of data points, e.g. I might want to stack "the last 4 data points", i.e. xi-4, xi-3, xi-2, xi-1 for prediction of yi.
Obviously for a window size of 4 such a feature vector cannot be constructed for the first three target values and I would like to simply drop them. Likewise for the last data point xn.
This would not be a problem, except I want this to take place as part of a sklearn pipeline. So far I have successfully written a few custom transformers for other tasks, but those cannot (as far as I know) change the Y matrix.
Is there a way to do this, that I am unaware of or am I stuck doing this as preprocessing outside of the pipeline? (Which means, I would not be able to use GridsearchCV to find the optimal window size and shift.)
I have tried searching for this, but all I came up with was this question, which deals with removing samples from the X matrix. The accepted answer there makes me think, what I want to do is not supported in scikit-learn, but I wanted to make sure.

You are correct, you cannot adjust the your target within a sklearn Pipeline. That doesn't mean that you cannot do a gridsearch, but it does mean that you may have to go about it in a bit more of a manual fashion. I would recommend writing a function do your transformations and filtering on y and then manually loop through a tuning grid created via ParameterGrid. If this doesn't make sense to you edit your post with the code you have for further assistance.

I am struggling with a similar issue and find it unfortunate that you cannot pass on the y-values between transformers. That being said, I bypassed the issue in a bit of a dirty way.
I am storing the y-values as an instance attribute of the transformers. That way I can access them in the transform method when the pipeline calls fit_transform. Then, the transform method passes on a tuple (X, self.y_stored) which is expected by the next estimator. This means I have to write wrapper estimators and it's very ugly, but it works!
Something like this:
class MyWrapperEstimator(RealEstimator):
def fit(X, y=None):
if isinstance(X, tuple):
X, y = X
super().fit(X=X, y=y)

For your specific example of stacking the last 4 data points, you might be able to use seglearn.
>>> import numpy as np
>>> import seglearn
>>> x = np.arange(10)[None,:]
>>> x
array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
>>> y = x
>>> new_x, new_y, _ = seglearn.transform.SegmentXY(width=4, overlap=0.75).fit_transform(x, y)
>>> new_x
array([[0, 1, 2, 3],
[1, 2, 3, 4],
[2, 3, 4, 5],
[3, 4, 5, 6],
[4, 5, 6, 7],
[5, 6, 7, 8],
[6, 7, 8, 9]])
>>> new_y
array([3, 4, 5, 6, 7, 8, 9])
seglearn claims to be scikit-learn-compatible, so you should be able to fit SegmentXY in the beginning of a scikit-learn pipeline. However, I have not tried it in a pipeline myself.

Solving 5 Linear Equations in Python

I've tried using matrices, and it has failed. I've looked at external modules and external programs, but none of it has worked. If someone could share some tips or code that would be helpful, thanks.

import numpy
import scipy.linalg
m = numpy.matrix([
[1, 1, 1, 1, 1],
[16, 8, 4, 2, 1],
[81, 27, 9, 3, 1],
[256, 64, 16, 4, 1],
[625, 125, 25, 5, 1]
])
res = numpy.matrix([[1],[2],[3],[4],[8]])
print scipy.linalg.solve(m, res)
returns
[[ 0.125]
[-1.25 ]
[ 4.375]
[-5.25 ]
[ 3. ]]
(your solution coefficients for a,b,c,d,e)

I'm not sure what you mean when you say the matrix methods don't work. That's the standard way of solving these types of problems.
From a linear algebra standpoint, solving 5 linear equations is trivial. It can be solved using any number of methods. You can use Gaussian elimination, finding the inverse, Cramer's rule, etc.
If you're lazy, you can always resort to libraries. Sympy and Numpy can both solve linear equations with ease.

Perhaps you're using matrices in a wrong way.
Matrices are just like lists within lists.
[[1,1,1,1,1],[1,1,1,1,1],[1,1,1,1,1],[1,1,1,1,1],[1,1,1,1,1,1]]
The aforementioned code would make a list that you can access like mylist[y][x] as the axes are swapped.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create interaction term in scikit-learn - python

Related

An optimized matrix multiplication library in Python (similar to Matlab) but is NOT numpy

Apply custom function to 2 or more rows (or columns) in numpy

Dataframe vs Numpy array in Python

scikit-learn custom transformer / pipeline that changes X and Y

Solving 5 Linear Equations in Python

Categories

Resources