Creating interaction terms quickly without SKLearn - python

I am using the following code to create interaction terms in my data:
def Interaction(x):
for k in range(0,x.shape[1]-1):
for j in range(k+1,x.shape[1]-1):
new = x[:,k] * x[:,j]
x = np.hstack((x,new[:,None]))
return x
My problem is that it is extremely slow compared to SKLearn's PolynomialFeatures. How can I speed it up? I can't use SKLearn because there are a few customizations that I would like to make. For example, I would like to make an interaction variable of X1 * X2 but also X1 * (1-X2), etc.

We should multiply each element of each row pairwise we can do it as np.einsum('ij,ik->ijk, x, x). This is 2 times redundand but still 2 times faster than PolynomialFeatures.
import numpy as np
def interaction(x):
"""
>>> a = np.arange(9).reshape(3, 3)
>>> b = np.arange(6).reshape(3, 2)
>>> a
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
>>> interaction(a)
array([[ 0, 1, 2, 0, 0, 2],
[ 3, 4, 5, 12, 15, 20],
[ 6, 7, 8, 42, 48, 56]])
>>> b
array([[0, 1],
[2, 3],
[4, 5]])
>>> interaction(b)
array([[ 0, 1, 0],
[ 2, 3, 6],
[ 4, 5, 20]])
"""
b = np.einsum('ij,ik->ijk', x, x)
m, n = x.shape
axis1, axis2 = np.triu_indices(n, 1)
axis1 = np.tile(axis1, m)
axis2 = np.tile(axis2, m)
axis0 = np.arange(m).repeat(n * (n - 1) // 2)
return np.c_[x, b[axis0, axis1, axis2].reshape(m, -1)]
Performance comparision:
c = np.arange(30).reshape(6, 5)
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(2, interaction_only=True)
skl = poly.fit_transform
print(np.allclose(interaction(c), skl(c)[:, 1:]))
# True
In [1]: %timeit interaction(c)
118 µs ± 172 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [2]: %timeit skl(c)
243 µs ± 4.69 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Related

Is there a way to add a different number to each row in a numpy array in python?

I want to add a different number to each row of the matrix below.
array([[ 6, 6, 6, 6],
[ 1, -5, -11, -17],
[ 1, 7, 13, 19]], dtype=int64)
For example I want to add this array to the matrix:
array([-4, -3, 0])
Add the -4 of the array to the first row so it will be array([2, 2, 2, 2], dtype=int64)
The whole matrix should then look like this:
array([[ 2, 2, 2, 2],
[ -2, -8, -14, -20],
[ 1, 7, 13, 19]], dtype=int64)
I could of course transform the 1d array to a matrix, but I wanted to know if there is maybe another option.
You can do it in several ways:
Using .reshape: it will create a "column-vector" instead of a "row-vector"
a + b.reshape((-1,1))
Creating a new array then transposing it:
a + np.array([b]).T
Using numpy.atleast_2d:
a + np.atleast_2d(b).T
All of them with the same output:
array([[ 2, 2, 2, 2],
[ -2, -8, -14, -20],
[ 1, 7, 13, 19]])
Performance
%%timeit
a = np.random.randint(0,10,(2000,100))
b = np.random.randint(0,10,2000)
a + b.reshape((-1,1))
#3.39 ms ± 43.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
a = np.random.randint(0,10,(2000,100))
b = np.random.randint(0,10,2000)
a + np.array([b]).T
#3.4 ms ± 81.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
a = np.random.randint(0,10,(2000,100))
b = np.random.randint(0,10,2000)
a + np.atleast_2d(b).T
#3.37 ms ± 58.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Efficient way to shift values in one matrix according to the values in another matrix

So I want to shift the values in matrix_a according to the values in matrix_b. So if the value in matrix_b at postion 0,0 is 1, then the element in the result_matrix at 0,0 should be the element that is at 1,1 in matrix_a. I already have this working using the following code:
import numpy as np
matrix_a = np.matrix([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
matrix_b = np.matrix([[1, 1, 0],
[0,-1, 0],
[0, 0, -1]])
result_matrix = np.zeros((3,3))
for x in range(matrix_b.shape[0]):
for y in range(matrix_b.shape[1]):
value = matrix_b.item(x,y)
result_matrix[x][y]=matrix_a.item(x+value,y+value)
print(result_matrix)
which results in:
[[5. 6. 3.]
[4. 1. 6.]
[7. 8. 5.]]
Right now this is quite slow on large matrices, and I have the feeling that this can be optimized using one of numpy or scipy's functions. Can someone tell me how this can be done more efficiently?
Using np.indices
ix = np.indices(matrix_a.shape)
matrix_a[tuple(ix + np.array(matrix_b))]
Out[]:
matrix([[5, 6, 3],
[4, 1, 6],
[7, 8, 5]])
As a word of advice, try to avoid using np.matrix - it's only really for compatibility with old MATLAB code, and breaks a lot of numpy functions. np.array works just as well 99% of the time, and the rest of the time np.matrix will be confusing for core numpy users.
Here's one way with integer-indexing generated off the same iterators as open ranged arrays to get row, column indices for all elements -
I,J = np.ogrid[:matrix_b.shape[0],:matrix_b.shape[1]]
out = matrix_a[I+matrix_b, J+matrix_b]
Output for given sample -
In [152]: out
Out[152]:
matrix([[5, 6, 3],
[4, 1, 6],
[7, 8, 5]])
Timings on a large dataset 5000x5000 -
In [142]: np.random.seed(0)
...: N = 5000 # matrix size
...: matrix_a = np.random.rand(N,N)
...: matrix_b = np.random.randint(0,N,matrix_a.shape)-matrix_a.shape[1]
# #Daniel F's soln
In [143]: %%timeit
...: ix = np.indices(matrix_a.shape)
...: matrix_a[tuple(ix + np.array(matrix_b))]
1.37 s ± 99.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Solution from this post
In [149]: %%timeit
...: I,J = np.ogrid[:matrix_b.shape[0],:matrix_b.shape[1]]
...: out = matrix_a[I+matrix_b, J+matrix_b]
1.17 s ± 3.21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Efficiently delete each row of an array if it occurs in another array in pure numpy

I have one numpy array, where indices are stored in the shape of (n, 2). E.g.:
[[0, 1],
[2, 3],
[1, 2],
[4, 2]]
Then I do some processing and create an array in the shape of (m, 2), where n > m. E.g.:
[[2, 3]
[4, 2]]
Now I want to delete every row in the first array that can be found in the second array as well. So my wanted result is:
[[0, 1],
[1, 2]]
My current solution is as follows:
for row in second_array:
result = np.delete(first_array, np.where(np.all(first_array == second_array, axis=1)), axis=0)
However, this is quiet time consuming if the second is large. Does someone know a numpy only solution, which does not require a loop?
Here's one leveraging the fact that they are positive numbers using matrix-multiplication for dimensionality-reduction -
def setdiff_nd_positivenums(a,b):
s = np.maximum(a.max(0)+1,b.max(0)+1)
return a[~np.isin(a.dot(s),b.dot(s))]
Sample run -
In [82]: a
Out[82]:
array([[0, 1],
[2, 3],
[1, 2],
[4, 2]])
In [83]: b
Out[83]:
array([[2, 3],
[4, 2]])
In [85]: setdiff_nd_positivenums(a,b)
Out[85]:
array([[0, 1],
[1, 2]])
Also, it seems the second-array b is a subset of a. So, we can leverage that scenario to boost the performance even further using np.searchsorted, like so -
def setdiff_nd_positivenums_searchsorted(a,b):
s = np.maximum(a.max(0)+1,b.max(0)+1)
a1D,b1D = a.dot(s),b.dot(s)
b1Ds = np.sort(b1D)
return a[b1Ds[np.searchsorted(b1Ds,a1D)] != a1D]
Timings -
In [146]: np.random.seed(0)
...: a = np.random.randint(0,9,(1000000,2))
...: b = a[np.random.choice(len(a), 10000, replace=0)]
In [147]: %timeit setdiff_nd_positivenums(a,b)
...: %timeit setdiff_nd_positivenums_searchsorted(a,b)
10 loops, best of 3: 101 ms per loop
10 loops, best of 3: 70.9 ms per loop
For generic numbers, here's another using views -
# https://stackoverflow.com/a/45313353/ #Divakar
def view1D(a, b): # a, b are arrays
a = np.ascontiguousarray(a)
b = np.ascontiguousarray(b)
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
return a.view(void_dt).ravel(), b.view(void_dt).ravel()
def setdiff_nd(a,b):
# a,b are the nD input arrays
A,B = view1D(a,b)
return a[~np.isin(A,B)]
Sample run -
In [94]: a
Out[94]:
array([[ 0, 1],
[-2, -3],
[ 1, 2],
[-4, -2]])
In [95]: b
Out[95]:
array([[-2, -3],
[ 4, 2]])
In [96]: setdiff_nd(a,b)
Out[96]:
array([[ 0, 1],
[ 1, 2],
[-4, -2]])
Timings -
In [158]: np.random.seed(0)
...: a = np.random.randint(0,9,(1000000,2))
...: b = a[np.random.choice(len(a), 10000, replace=0)]
In [159]: %timeit setdiff_nd(a,b)
1 loop, best of 3: 352 ms per loop
The numpy-indexed package (disclaimer: I am its author) was designed to perform operations of this type efficiently on nd-arrays.
import numpy_indexed as npi
# if the output should consist of unique values and there is no need to preserve ordering
result = npi.difference(first_array, second_array)
# otherwise:
result = first_array[~npi.in_(first_array, second_array)]
Here is a function that works with 2D arrays of integers with any shape, and accepting both positive and negative numbers:
import numpy as np
# Gets a boolean array of rows of a that are in b
def isin_rows(a, b):
a = np.asarray(a)
b = np.asarray(b)
# Subtract minimum value per column
min = np.minimum(a.min(0), b.min(0))
a = a - min
b = b - min
# Get maximum value per column
max = np.maximum(a.max(0), b.max(0))
# Compute multiplicative base for each column
base = np.roll(max, 1)
base[0] = 1
base = np.cumprod(max)
# Make flattened version of arrays
a_flat = (a * base).sum(1)
b_flat = (b * base).sum(1)
# Check elements of a in b
return np.isin(a_flat, b_flat)
# Test
a = np.array([[0, 1],
[2, 3],
[1, 2],
[4, 2]])
b = np.array([[2, 3],
[4, 2]])
a_in_b_mask = isin_rows(a, b)
a_not_in_b = a[~a_in_b_mask]
print(a_not_in_b)
# [[0 1]
# [1 2]]
EDIT: One possible optimization raises from considering the number of possible rows in b. If b has more rows than the possible number of combinations, then you may find its unique elements first so np.isin is faster:
import numpy as np
def isin_rows_opt(a, b):
a = np.asarray(a)
b = np.asarray(b)
min = np.minimum(a.min(0), b.min(0))
a = a - min
b = b - min
max = np.maximum(a.max(0), b.max(0))
base = np.roll(max, 1)
base[0] = 1
base = np.cumprod(max)
a_flat = (a * base).sum(1)
b_flat = (b * base).sum(1)
# Count number of possible different rows for b
num_possible_b = np.prod(b.max(0) - b.min(0) + 1)
if len(b_flat) > num_possible_b: # May tune this condition
b_flat = np.unique(b_flat)
return np.isin(a_flat, b_flat)
The condition len(b_flat) > num_possible_b should probably be tuned better so you only find for unique elements if it is really going to be worth it (maybe len(b_flat) > 2 * num_possible_b or len(b_flat) > num_possible_b + CONSTANT). It seems to give some improvement for big arrays with fewer values:
import numpy as np
# Test setup from #Divakar
np.random.seed(0)
a = np.random.randint(0, 9, (1000000, 2))
b = a[np.random.choice(len(a), 10000, replace=0)]
print(np.all(isin_rows(a, b) == isin_rows_opt(a, b)))
# True
%timeit isin_rows(a, b)
# 100 ms ± 425 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit isin_rows_opt(a, b)
# 81.2 ms ± 324 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

How to reshape DataFrame to get 3-D array using a sliding window?

I need to reshape my DataFrame df:
flights = {
'flight_stage': [1, 0, 1, 1, 0, 0, 1],
'scheduled_hour': [16,16,17,17,17,18,18],
'delay_category': [1, 0, 2, 2, 1, 0, 2]
}
columns = ['flight_stage', 'scheduled_hour', 'delay_category']
df = pd.DataFrame(flights, columns=columns)
I want to get the following 3-D array X:
[
[[1,16],[0,16],[1,17]],
[[0,16],[1,17],[1,17]],
[[1,17],[1,17],[0,17]],
[[1,17],[0,17],[0,18]],
[[0,17],[0,18],[1,18]]
]
and 1-D array y:
[
3,
4,
5,
3,
3
]
Basically, the original DataFrame df should be reshaped using a sliding window of 1, taking last 3 entries, in order to get X. The values of y should be a sum of delay_category of all 3 entries.
How can I do it?
I tried to use reshape, but didn't come up with any solution.
You could do:
import pprint
import pandas as pd
flights = {
'flight_stage': [1, 0, 1, 1, 0, 0, 1],
'scheduled_hour': [16,16,17,17,17,18,18],
'delay_category': [1, 0, 2, 2, 1, 0, 2]
}
columns = ['flight_stage', 'scheduled_hour', 'delay_category']
df = pd.DataFrame(flights, columns=columns)
X = [df.iloc[i:i+3, [0, 1]].values.tolist() for i in range(len(df) - (3 - 1))]
y = df.delay_category.rolling(3).sum().dropna()
pprint.pprint(X)
pprint.pprint(y)
Output
[[[1, 16], [0, 16], [1, 17]],
[[0, 16], [1, 17], [1, 17]],
[[1, 17], [1, 17], [0, 17]],
[[1, 17], [0, 17], [0, 18]],
[[0, 17], [0, 18], [1, 18]]]
2 3.0
3 4.0
4 5.0
5 3.0
6 3.0
Name: delay_category, dtype: float64
If desired you can convert X to a numpy array very easily.
A simple way is to just loop through your array and stack subarrays of your window size. To get your value y, rolling method works well here. Something like this should work:
arr = df[['flight_stage', 'scheduled_hour']].values
win_size = 3
X = np.stack([arr[n:n+win_size, :] for n in range(len(arr) - win_size + 1)])
y = df['delay_category'].rolling(3, ).sum()
For a better performance, you can use numpy and stack together slices of the array:
w = 3
cols = 2
rows = x.shape[0] - window + 1
x = df.values
X = np.hstack((x[:-2,:2], x[1:-1,:2], x[2:,:2])).reshape((rows,w,cols))
print(X)
array([[[ 1, 16],
[ 0, 16],
[ 1, 17]],
[[ 0, 16],
[ 1, 17],
[ 1, 17]],
[[ 1, 17],
[ 1, 17],
[ 0, 17]],
[[ 1, 17],
[ 0, 17],
[ 0, 18]],
[[ 0, 17],
[ 0, 18],
[ 1, 18]]], dtype=int64)
print(y)
y = np.vstack((x[:-2,-1], x[1:-1,-1], x[2:,-1])).sum(axis=0)
array([3, 4, 5, 3, 3], dtype=int64)
Some time comparissons:
def daniel(df):
columns = ['flight_stage', 'scheduled_hour', 'delay_category']
X = [df.iloc[i:i+3, [0, 1]].values.tolist() for i in range(len(df) - (3 - 1))]
y = df.delay_category.rolling(3).sum().dropna()
def busybear(df):
arr = df[['flight_stage', 'scheduled_hour']].values
win_size = 3
X = np.stack([arr[n:n+win_size, :] for n in range(len(arr) - win_size + 1)])
y = df['delay_category'].rolling(3, ).sum()
def yatu(df):
x = df.values
w = 3
cols = 2
rows = x.shape[0] - window + 1
X = np.hstack((x[:-2,:2], x[1:-1,:2], x[2:,:2])).reshape((rows,w,cols))
y = np.vstack((x[:-2,-1], x[1:-1,-1], x[2:,-1])).sum(axis=0)
%timeit daniel(df)
# 2.75 ms ± 389 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit yatu(df)
# 26.3 µs ± 2.37 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit busybear(df)
# 929 µs ± 179 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

numpy, taking array difference of their intersection

I have multiple numpy arrays and I want to create new arrays doing something that is like an XOR ... but not quite.
My input is two arrays, array1 and array2.
My output is a modified (or new array, I don't really care) version of array1.
The modification is elementwise, by doing the following:
1.) If either array has 0 for the given index, then the index is left unchanged.
2.) If array1 and array2 are nonzero, then the modified array is assigned the value of array1's index subtracted by array2's index, down to a minimum of zero.
Examples:
array1: [0, 3, 8, 0]
array2: [1, 1, 1, 1]
output: [0, 2, 7, 0]
array1: [1, 1, 1, 1]
array2: [0, 3, 8, 0]
output: [1, 0, 0, 1]
array1: [10, 10, 10, 10]
array2: [8, 12, 8, 12]
output: [2, 0, 2, 0]
I would like to be able to do this with say, a single numpy.copyto statement, but I don't know how. Thank you.
edit:
it just hit me. could I do:
new_array = np.zeros(size_of_array1)
numpy.copyto(new_array, array1-array2, where=array1>array2)
Edit 2: Since I have received several answers very quickly I am going to time the different answers against each other to see how they do. Be back with results in a few minutes.
Okay, results are in:
array of random ints 0 to 5, size = 10,000, 10 loops
1.)using my np.copyto method
2.)using clip
3.)using maximum
0.000768184661865
0.000391960144043
0.000403165817261
Kasramvd also provided some useful timings below
You can use a simple subtraction and clipping the result with zero as the min:
(arr1 - arr2).clip(min=0)
Demo:
In [43]: arr1 = np.array([0,3,8,0]); arr2 = np.array([1,1,1,1])
In [44]: (arr1 - arr2).clip(min=0)
Out[44]: array([0, 2, 7, 0])
On large arrays it's also faster than maximum approach:
In [51]: arr1 = np.arange(10000); arr2 = np.arange(10000)
In [52]: %timeit np.maximum(0, arr1 - arr2)
22.3 µs ± 1.77 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [53]: %timeit (arr1 - arr2).clip(min=0)
20.9 µs ± 167 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [54]: arr1 = np.arange(100000); arr2 = np.arange(100000)
In [55]: %timeit np.maximum(0, arr1 - arr2)
671 µs ± 5.69 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [56]: %timeit (arr1 - arr2).clip(min=0)
648 µs ± 4.43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Note that if it's possible for arr2 to have negative values you should consider using an abs function on arr2 to get the expected result:
(arr1 - abs(arr2)).clip(min=0)
In [73]: np.maximum(0,np.array([0,3,8,0])-np.array([1,1,1,1]))
Out[73]: array([0, 2, 7, 0])
This doesn't explicitly address
If either array has 0 for the given index, then the index is left unchanged.
but the results match for all examples:
In [74]: np.maximum(0,np.array([1,1,1,1])-np.array([0,3,8,0]))
Out[74]: array([1, 0, 0, 1])
In [75]: np.maximum(0,np.array([10,10,10,10])-np.array([8,12,8,12]))
Out[75]: array([2, 0, 2, 0])
You can first simply subtract the arrays and then use boolean array indexing on the subtracted result to assign 0 where there are negative values as in:
# subtract
In [43]: subtracted = arr1 - arr2
# get a boolean mask by checking for < 0
# index into the array and assign 0
In [44]: subtracted[subtracted < 0] = 0
In [45]: subtracted
Out[45]: array([0, 2, 7, 0])
Applying the same for the other inputs specified by OP:
In [46]: arr1 = np.array([1, 1, 1, 1])
...: arr2 = np.array([0, 3, 8, 0])
In [47]: subtracted = arr1 - arr2
In [48]: subtracted[subtracted < 0] = 0
In [49]: subtracted
Out[49]: array([1, 0, 0, 1])
And for the third input arrays:
In [50]: arr1 = np.array([10, 10, 10, 10])
...: arr2 = np.array([8, 12, 8, 12])
In [51]: subtracted = arr1 - arr2
In [52]: subtracted[subtracted < 0] = 0
In [53]: subtracted
Out[53]: array([2, 0, 2, 0])

Categories