I'm attempting to solve a set of equations of the form Ax = 0. A is known 6x6 matrix and I've written the below code using SVD to get the vector x which works to a certain extent. The answer is approximately correct but not good enough to be useful to me, how can I improve the precision of the calculation? Lowering eps below 1.e-4 causes the function to fail.
from numpy.linalg import *
from numpy import *
A = matrix([[0.624010149127497 ,0.020915658603923 ,0.838082638087629 ,62.0778180312547 ,-0.336 ,0],
[0.669649399820597 ,0.344105317421833 ,0.0543868015800246 ,49.0194290212841 ,-0.267 ,0],
[0.473153758252885 ,0.366893577716959 ,0.924972565581684 ,186.071352614705 ,-1 ,0],
[0.0759305208803158 ,0.356365401030535 ,0.126682113674883 ,175.292109352674 ,0 ,-5.201],
[0.91160934274653 ,0.32447818779582 ,0.741382053883291 ,0.11536775372698 ,0 ,-0.034],
[0.480860406786873 ,0.903499596111067 ,0.542581424762866 ,32.782593418975 ,0 ,-1]])
def null(A, eps=1e-3):
u,s,vh = svd(A,full_matrices=1,compute_uv=1)
null_space = compress(s <= eps, vh, axis=0)
return null_space.T
NS = null(A)
print "Null space equals ",NS,"\n"
print dot(A,NS)
A is full rank --- so x is 0
Since it looks like you need a least-squares solution, i.e. min ||A*x|| s.t. ||x|| = 1, do the SVD such that [U S V] = svd(A) and the last column of V (assuming that the columns are sorted in order of decreasing singular values) is x.
I.e.,
U =
-0.23024 -0.23241 0.28225 -0.59968 -0.04403 -0.67213
-0.1818 -0.16426 0.18132 0.39639 0.83929 -0.21343
-0.69008 -0.59685 -0.18202 0.10908 -0.20664 0.28255
-0.65033 0.73984 -0.066702 -0.12447 0.088364 0.0442
-0.00045131 -0.043887 0.71552 -0.32745 0.1436 0.59855
-0.12164 0.11611 0.5813 0.59046 -0.47173 -0.25029
S =
269.62 0 0 0 0 0
0 4.1038 0 0 0 0
0 0 1.656 0 0 0
0 0 0 0.6416 0 0
0 0 0 0 0.49215 0
0 0 0 0 0 0.00027528
V =
-0.002597 -0.11341 0.68728 -0.12654 0.70622 0.0050325
-0.0024567 0.018021 0.4439 0.85217 -0.27644 0.0028357
-0.0036713 -0.1539 0.55281 -0.4961 -0.6516 0.00013067
-0.9999 -0.011204 -0.0068651 0.0013713 0.0014128 0.0052698
0.0030264 0.17515 0.02341 -0.020917 -0.0054032 0.98402
0.012996 -0.96557 -0.15623 0.10603 0.014754 0.17788
So,
x =
0.0050325
0.0028357
0.00013067
0.0052698
0.98402
0.17788
And, ||A*x|| = 0.00027528 as opposed to your previous solution for x where ||A*x_old|| = 0.079442
Attention: There might be confusion with the SVD in python vs. matlab-syntax(?):
in python, numpy.linalg.svd(A) returns matrices u,s,v such that u*s*v = A
(strictly: dot(u, dot(diag(s), v) = A, because s is a vector and not a 2D-matrix in numpy).
The uppermost Answer is correct in that sense that usually you write u*s*vh = A and vh is returned, and this answer discusses v AND NOT vh.
To make a long story short: if you have matrices u,s,v such that u*s*v = A, then the last rows of v, not the last colums of v, describe the nullspace.
Edit: [for people like me:] each of the last rows is a vector v0 such that A*v0 = 0 (if the corresponding singular value is 0)
Related
I have very big df:
df.shape() = (106, 3364)
I want to calculate so called frechet distance by using this Frechet Distance between 2 curves. And it works good. Example:
x = df['1']
x1 = df['1.1']
p = np.array([x, x1])
y = df['2']
y1 = df['2.1']
q = np.array([y, y1])
P_final = list(zip(p[0], p[1]))
Q_final = list(zip(q[0], q[1]))
from frechetdist import frdist
frdist(P_final,Q_final)
But I can not do row by row like:
`1 and 1.1` to `1 and 1.1` which is equal to 0
`1 and 1.1` to `2 and 2.1` which is equal to some number
...
`1 and 1.1` to `1682 and 1682.1` which is equal to some number
I want to create something (first idea is for loop, but maybe you have better solution) to calculate this frdist(P_final,Q_final) between:
first rows to all rows (including itself)
second row to all rows (including itself)
Finally, I supposed to get a matrix size (106,106) with 0 on diagonal (because distance between itself is 0)
matrix =
0 1 2 3 4 5 ... 105
0 0
1 0
2 0
3 0
4 0
5 0
... 0
105 0
Not including my trial code because it is confusing everyone!
EDITED:
Sample data:
1 1.1 2 2.1 3 3.1 4 4.1 5 5.1
0 43.1024 6.7498 45.1027 5.7500 45.1072 3.7568 45.1076 8.7563 42.1076 8.7563
1 46.0595 1.6829 45.0595 9.6829 45.0564 4.6820 45.0533 8.6796 42.0501 3.6775
2 25.0695 5.5454 44.9727 8.6660 41.9726 2.6666 84.9566 3.8484 44.9566 1.8484
3 35.0281 7.7525 45.0322 3.7465 14.0369 3.7463 62.0386 7.7549 65.0422 7.7599
4 35.0292 7.5616 45.0292 4.5616 23.0292 3.5616 45.0292 7.5616 25.0293 7.5613
I just used own sample data in your format (I hope)
import pandas as pd
from frechetdist import frdist
import numpy as np
# create sample data
df = pd.DataFrame([[1,2,3,4,5,6], [3,4,5,6,8,9], [2,3,4,5,2,2], [3,4,5,6,7,3]], columns=['1','1.1','2', '2.1', '3', '3.1'])
# this matrix will hold the result
res = np.ndarray(shape=(df.shape[1] // 2, df.shape[1] // 2), dtype=np.float32)
for row in range(res.shape[0]):
for col in range(row, res.shape[1]):
# extract the two functions
P = [*zip([df.loc[:, f'{row+1}'], df.loc[:, f'{row+1}.1']])]
Q = [*zip([df.loc[:, f'{col+1}'], df.loc[:, f'{col+1}.1']])]
# calculate distance
dist = frdist(P, Q)
# put result back (its symmetric)
res[row, col] = dist
res[col, row] = dist
# output
print(res)
Output:
[[0. 4. 7.5498343]
[4. 0. 5.5677643]
[7.5498343 5.5677643 0. ]]
Hope that helps
EDIT: Some general tips:
If speed matters: check if frdist handles also a numpy array of shape
(n_values, 2) than you could save the rather expensive zip-and-unpack operation
and directly use the arrays or build the data directly in a format the your library needs
Generally, use better column namings (3 and 3.1 is not too obvious). Why you dont call them x3, y3 or x3 and f_x3
I would actually put the data into two different Matrices. If you watch the
code I had to do some not-so-obvious stuff like iterating over shape
divided by two and built indices from string operations because of the given table layout
I have the following numpy array which is depicted above.
Functions like
print(arr.argsort()[:3])
will return the three lowest indeces of the three lowest value:
[69 66 70]
How do I return the first index where the first minimum or first saddle point (in the calculus sense) whichever comes first of an array?
In this case the two numbers 0.62026396 0.60566623 at index 2 and 3 is a first saddle point (it isn't a true saddle point since the slope doesn't flatten, but it clearly breaks the hard downward slope there. Perhaps add a threshold of what "flattens" means). Since the function never goes up before the first saddle point and therefore the first mimimum occurs after the saddle point, that is the index I am interested in.
[1.04814804 0.90445908 0.62026396 0.60566623 0.32295758 0.26658469
0.19059289 0.10281547 0.08582772 0.05091265 0.03391474 0.03844931
0.03315003 0.02838656 0.03420759 0.03567401 0.038203 0.03530763
0.04394316 0.03876966 0.04156067 0.03937291 0.03966426 0.04438747
0.03690863 0.0363976 0.03171374 0.03644719 0.02989291 0.03166156
0.0323875 0.03406287 0.03691943 0.02829374 0.0368121 0.02971704
0.03427005 0.02873735 0.02843848 0.02101889 0.02114978 0.02128403
0.0185619 0.01749904 0.01441699 0.02118773 0.02091855 0.02431763
0.02472427 0.03186318 0.03205664 0.03135686 0.02838413 0.03206674
0.02638371 0.02048122 0.01502128 0.0162665 0.01331485 0.01569286
0.00901017 0.01343558 0.00908635 0.00990869 0.01041151 0.01063606
0.00822482 0.01312368 0.0115005 0.00620334 0.0084177 0.01058152
0.01198732 0.01451455 0.01605602 0.01823713 0.01685975 0.03161889
0.0216687 0.03052391 0.02220871 0.02420951 0.01651778 0.02066987
0.01999613 0.02532265 0.02589186 0.02748692 0.02191687 0.02612152
0.02309497 0.02744753 0.02619196 0.02281516 0.0254296 0.02732746
0.02567608 0.0199178 0.01831929 0.01776025]
This is how I would detect local maxima/minima, inflection points, and saddles.
Let first define the following functions
import numpy as np
def n_derivative(arr, degree=1):
"""Compute the n-th derivative."""
result = arr.copy()
for i in range(degree):
result = np.gradient(result)
return result
def sign_change(arr):
"""Detect sign changes."""
sign = np.sign(arr)
result = ((np.roll(sign, 1) - sign) != 0).astype(bool)
result[0] = False
return result
def zeroes(arr, threshold=1e-8):
"""Find zeroes of an array."""
return sign_change(arr) | (abs(arr) < threshold)
We can now make use of the derivative test
A critical points will have first-derivative equal to zero.
def critical_points(arr):
return zeroes(n_derivative(arr, 1))
If a critical point has the second-derivative non-zero, then the point is either a maximum or a minimum:
def maxima_minima(arr):
return zeroes(n_derivative(arr, 1)) & ~zeroes(n_derivative(arr, 2))
def maxima(arr):
return zeroes(n_derivative(arr, 1)) & (n_derivative(arr, 2) < 0)
def minima(arr):
return zeroes(n_derivative(arr, 1)) & (n_derivative(arr, 2) > 0)
If the second-derivative is equal to zero but the third-derivative is non-zero, then the point is a point of inflection:
def inflections(arr):
return zeroes(n_derivative(arr, 2)) & ~zeroes(n_derivative(arr, 3))
If a critical point has second-derivative equal to zero, but third-derivative is non-zero, then this is a saddle:
def inflections(arr):
return zeroes(n_derivative(arr, 1)) & zeroes(n_derivative(arr, 2)) & ~zeroes(n_derivative(arr, 3))
Note that this method is numerically not stable, in the sense that, on one hand the zeroes are detected on some arbitrary threshold definition, and on the other hand different sampling may result in the function / array not being differentiable.
Hence, according to this definition, what you expect is actually not a saddle point.
To have a better approximation of a continuous function, one could use a cubic interpolation on a largely oversampled (as per K in the code) function, e.g.:
import scipy as sp
import scipy.interpolate
data = [
1.04814804, 0.90445908, 0.62026396, 0.60566623, 0.32295758, 0.26658469, 0.19059289,
0.10281547, 0.08582772, 0.05091265, 0.03391474, 0.03844931, 0.03315003, 0.02838656,
0.03420759, 0.03567401, 0.038203, 0.03530763, 0.04394316, 0.03876966, 0.04156067,
0.03937291, 0.03966426, 0.04438747, 0.03690863, 0.0363976, 0.03171374, 0.03644719,
0.02989291, 0.03166156, 0.0323875, 0.03406287, 0.03691943, 0.02829374, 0.0368121,
0.02971704, 0.03427005, 0.02873735, 0.02843848, 0.02101889, 0.02114978, 0.02128403,
0.0185619, 0.01749904, 0.01441699, 0.02118773, 0.02091855, 0.02431763, 0.02472427,
0.03186318, 0.03205664, 0.03135686, 0.02838413, 0.03206674, 0.02638371, 0.02048122,
0.01502128, 0.0162665, 0.01331485, 0.01569286, 0.00901017, 0.01343558, 0.00908635,
0.00990869, 0.01041151, 0.01063606, 0.00822482, 0.01312368, 0.0115005, 0.00620334,
0.0084177, 0.01058152, 0.01198732, 0.01451455, 0.01605602, 0.01823713, 0.01685975,
0.03161889, 0.0216687, 0.03052391, 0.02220871, 0.02420951, 0.01651778, 0.02066987,
0.01999613, 0.02532265, 0.02589186, 0.02748692, 0.02191687, 0.02612152, 0.02309497,
0.02744753, 0.02619196, 0.02281516, 0.0254296, 0.02732746, 0.02567608, 0.0199178,
0.01831929, 0.01776025]
samples = np.arange(len(data))
f = sp.interpolate.interp1d(samples, data, 'cubic')
K = 10
N = len(data) * K
x = np.linspace(min(samples), max(samples), N)
y = f(x)
Then, all these definitions can be visually tested with:
import matplotlib.pyplot as plt
plt.figure()
plt.plot(samples, data, label='data')
plt.plot(x, y, label='f')
plt.plot(x, n_derivative(y, 1), label='d1f')
plt.plot(x, n_derivative(y, 2), label='d2f')
plt.plot(x, n_derivative(y, 3), label='d3f')
plt.legend()
for w in np.where(inflections(y))[0]:
plt.axvline(x=x[w])
plt.show()
but even in this case, that point is not a saddle.
you can use np.gradient or np.diff to evaluate differences (the first computes central differences, the second is just x[1:] - x[:-1]), then use np.sign to get the gradient sign and another np.diff to see where the sign changes. Then filter the positive sign changes (corresponding to minima):
np.where(np.diff(np.sign(np.gradient(x))) > 0)[0][0]+2 #add 2 as each time you call np.gradient or np.diff you are substracting 1 in size, the first [0] is to get the positions, the second [0] is to get the "first" element
>> 8
x[np.where(np.diff(np.sign(np.gradient(x))) > 0)[0][0]+2]
>> 0.03420759
After looking around a little bit and from the two suggestions given (so far), I did this:
import scipy
from scipy import interpolate
x = np.arange(0, 100)
spl = scipy.interpolate.splrep(x,arr,k=3) # no smoothing, 3rd order spline
ddy = scipy.interpolate.splev(x,spl,der=2) # use those knots to get second derivative
print(ddy)
asign = np.sign(ddy)
signchange = ((np.roll(asign, 1) - asign) != 0).astype(int)
print(signchange)
This gives me the second derivative, which then I can analyse, for example, seeing where the sign changes happen:
[-0.894053 -0.14050616 0.61304067 -0.69407217 0.55458251 -0.16624336
-0.0073225 0.12481963 -0.067218 0.03648846 0.02876712 -0.02236204
0.00167794 0.01886512 -0.0136314 0.00953279 -0.01812436 0.03041855
-0.03436446 0.02418512 -0.01458896 0.00429809 0.01227133 -0.02679232
0.02168571 -0.0181437 0.02585209 -0.02876075 0.0214645 -0.00715966
0.0009179 0.00918466 -0.03056938 0.04419937 -0.0433638 0.03557532
-0.02904901 0.02010647 -0.0199739 0.0170648 -0.00298236 -0.00511529
0.00630525 -0.01015011 0.02218007 -0.01945341 0.01339405 -0.01211326
0.01710444 -0.01591092 0.00486652 -0.00891456 0.01715403 -0.01976949
0.00573004 -0.00446743 0.01479495 -0.01448144 0.01794968 -0.02533936
0.02904355 -0.02418628 0.01505374 -0.00499926 0.00302616 -0.00877499
0.01625907 -0.01240068 -0.00578862 0.01351128 -0.00318733 -0.0010652
0.0029 -0.0038062 0.0064102 -0.01799678 0.04422601 -0.0620881
0.05587037 -0.04856099 0.03535114 -0.03094757 0.03028399 -0.01912546
0.01726283 -0.01392421 0.00989012 -0.01948119 0.02504401 -0.02204667
0.0197554 -0.01270022 -0.00260326 0.01038581 -0.00299247 -0.00271539
-0.00744152 0.00784016 0.00103947 -0.00576122]
[0 0 1 1 1 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1
1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 1 0 1]
I've been reading about the Metropolis-Hastings (MH) algorithm. Theoretically, I understood how the algorithm works. Now, I am trying to implement the MH algorithm using python.
I came across the following notebook. It suits exactly my problem since I want to fit my data by a straight line taking into consideration the measurement errors on my data. I am going to paste the code I am finding difficulties to understand:
# initial m, b
m,b = 2, 0
# step sizes
mstep, bstep = 0.1, 10.
# how many steps?
nsteps = 10000
chain = []
probs = []
naccept = 0
print 'Running MH for', nsteps, 'steps'
# First point:
L_old = straight_line_log_likelihood(x, y, sigmay, m, b)
p_old = straight_line_log_prior(m, b)
prob_old = np.exp(L_old + p_old)
for i in range(nsteps):
# step
mnew = m + np.random.normal() * mstep
bnew = b + np.random.normal() * bstep
# evaluate probabilities
# prob_new = straight_line_posterior(x, y, sigmay, mnew, bnew)
L_new = straight_line_log_likelihood(x, y, sigmay, mnew, bnew)
p_new = straight_line_log_prior(mnew, bnew)
prob_new = np.exp(L_new + p_new)
if (prob_new / prob_old > np.random.uniform()):
# accept
m = mnew
b = bnew
L_old = L_new
p_old = p_new
prob_old = prob_new
naccept += 1
else:
# Stay where we are; m,b stay the same, and we append them
# to the chain below.
pass
chain.append((b,m))
probs.append((L_old,p_old))
print 'Acceptance fraction:', naccept/float(nsteps)
The code is simple and easy, but I have difficulties in understanding how the MH is being implemented.
My question is in the chain.append (the third line from the bottom). The author is appending m and b whether they were accepted or rejected. Why? Shouldn't he append only the accepted points?
The following R code demonstrates why it is important to capture the rejected case:
# 20 samples from 0 or 1. 1 has an 80% probability of being chosen.
the.population <- sample(c(0,1), 20, replace = TRUE, prob=c(0.2, 0.8))
# Create a new sample that only catches changes
the.sample <- c(the.population[1])
# Loop though the.population,
# but only copy the.population to the.sample if the value changes
for( i in 2:length(the.population))
{
if(the.population[i] != the.population[i-1])
the.sample <- append(the.sample, the.population[i])
}
When this code runs, the.population gets 20 values, for example:
0 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 1 1 1
The probability of a 1 in this population is 16/20 or 0.8. Exactly the probability we expected...
The sample, on the other hand, which only records changes, looks like this:
0 1 0 1 0 1
The probability of a 1 in the sample is 3/6 or 0.5.
We are trying to build a distribution, rejecting the new values means that the old values are more likely than the new values. That needs to be captured so our distribution is correct.
From a quick reading of the algorithm description: When a candidate is rejected, it still counts as a step, but the value is the same as the old step. I.e. b, m are appended either way, but they only get updated (to bnew, mnew) in the case where the candidate is accepted.
Question in short
Given a large sparse csr_matrix A and a numpy array B, what is the fastest way to construct a numpy matrix C, such that C[i,j] = sum(A[k,j]) for all k where B[k] == i?
Details of question
I found a solution to do this, but I am not really content with how long it takes. I will first explain the problem, then my solution, then show my code, and then show my timings.
Problem
I am working on a clustering algorithm in Python, and I'd like to speed it up. I have a sparse csr_matrix pam, in which I have per person per article how many items they bought of that article. Furthermore, I have a numpy array clustering, in which the cluster that person belongs to is denoted. Example:
pam pam.T clustering
article person
p [[1 0 0 0] a
e [0 2 0 0] r [[1 0 1 0 0 0] [0 0 0 0 1 1]
r [1 1 0 0] t [0 2 1 0 0 0]
s [0 0 1 0] i [0 0 0 1 0 1]
o [0 0 0 1] c [0 0 0 0 1 2]]
n [0 0 1 2]] l
e
What I like to calculate is acm: the amount of items all people in one cluster together bought. This amounts to, for every column i in acm, adding those columns p of pam.T for which clustering[p] == i.
acm
cluster
a
r [[2 0]
t [3 0]
i [1 1]
c [0 3]]
l
e
Solution
First, I create another sparse matrix pcm, in which I indicate per element [i,j] if person i is in cluster j. Result (when cast to dense matrix):
pcm
cluster
p [[False True]
e [False True]
r [ True False]
s [False True]
o [False True]
n [ True False]]
Next, I matrix multiply pam.T with pcm to get the matrix that I want.
Code
I wrote the following program to test the duration of this method in practice.
import numpy as np
from scipy.sparse.csr import csr_matrix
from timeit import timeit
def _clustering2pcm(clustering):
'''
Converts a clustering (np array) into a person-cluster matrix (pcm)
'''
N_persons = clustering.size
m_person = np.arange(N_persons)
clusters = np.unique(clustering)
N_clusters = clusters.size
m_data = [True] * N_persons
pcm = csr_matrix( (m_data, (m_person, clustering)), shape = (N_persons, N_clusters))
return pcm
def pam_clustering2acm():
'''
Convert a person-article matrix and a given clustering into an
article-cluster matrix
'''
global clustering
global pam
pcm = _clustering2pcm(clustering)
acm = csr_matrix.transpose(pam).dot(pcm).todense()
return acm
if __name__ == '__main__':
global clustering
global pam
N_persons = 200000
N_articles = 400
N_shoppings = 400000
N_clusters = 20
m_person = np.random.choice(np.arange(N_persons), size = N_shoppings, replace = True)
m_article = np.random.choice(np.arange(N_articles), size = N_shoppings, replace = True)
m_data = np.random.choice([1, 2], p = [0.99, 0.01], size = N_shoppings, replace = True)
pam = csr_matrix( (m_data, (m_person, m_article)), shape = (N_persons, N_articles))
clustering = np.random.choice(np.arange(N_clusters), size = N_persons, replace = True)
print timeit(pam_clustering2acm, number = 100)
Timing
It turns out that for these 100 runs, I need 5.1 seconds. 3.6 seconds of these are spent on creating pcm. I have the feeling there could be a faster way to calculate this matrix without creating a temporary sparse matrix, but I don't see one without looping. Is there a faster way of construction?
EDIT
After Martino's answer, I have tried to implement the loop over clusters and slicing algorithm, but that is even slower. It takes now 12.5 seconds to calculate acm 100 times, of which 4.1 seconds remain if I remove the line acm[:,i] = pam[p,:].sum(axis = 0).
def pam_clustering2acm_loopoverclusters():
global clustering
global pam
N_articles = pam.shape[1]
clusters = np.unique(clustering)
N_clusters = clusters.size
acm = np.zeros([N_articles, N_clusters])
for i in clusters:
p = np.where(clustering == i)[0]
acm[:,i] = pam[p,:].sum(axis = 0)
return acm
This is about 50x faster than your _clustering2pcm function:
def pcm(clustering):
n = clustering.size
data = np.ones((n,), dtype=bool)
indptr = np.arange(n+1)
return csr_matrix((data, clustering, indptr))
I haven't looked at the source code, but when you pass the CSR constructor the (data, (rows, cols)) structure, it is almost certainly using that to create a COO matrix, then converting it to CSR. Because your matrix is so simple, it is very easy to put the actual CSR matrix description arrays together as above, and skip all of that.
This almost cuts your execution time down by three:
In [38]: %timeit pam_clustering2acm()
10 loops, best of 3: 36.9 ms per loop
In [40]: %timeit pam.T.dot(pcm(clustering)).A
100 loops, best of 3: 12.8 ms per loop
In [42]: np.all(pam.T.dot(pcm(clustering)).A == pam_clustering2acm())
Out[42]: True
I refer you to the scipy.sparse docs (http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix). Where they say the row slicing is efficient (as opposed to column splicing), so it is probably better to to stick to the non-transposed matrix. Then if you browse down there is a sum function where the axis can be specified. It is probably better to use the methods that come with your object as they are likely to use compiled code. This is at the cost of looping through clusters (of which I am assuming there are not too many).
I have the following dataset in numpy
indices | real data (X) |targets (y)
| |
0 0 | 43.25 665.32 ... |2.4 } 1st block
0 0 | 11.234 |-4.5 }
0 1 ... ... } 2nd block
0 1 }
0 2 } 3rd block
0 2 }
1 0 } 4th block
1 0 }
1 0 }
1 1 ...
1 1
1 2
1 2
2 0
2 0
2 1
2 1
2 1
...
Theses are my variables
idx1 = data[:,0]
idx2 = data[:,1]
X = data[:,2:-1]
y = data[:,-1]
I also have a variable W which is a 3D array.
What I need to do in the code is loop through all the blocks in the dataset and return a scalar number for each block after some computation, then sum up all the scalars, and store it in a variable called cost. Problem is that the looping implementation is very slow, so I'm trying to do it vectorized if possible. This is my current code. Is it possible to do this without for loops in numpy?
IDX1 = 0
IDX2 = 1
# get unique indices
idx1s = np.arange(len(np.unique(data[:,IDX1])))
idx2s = np.arange(len(np.unique(data[:,IDX2])))
# initialize global sum variable to 0
cost = 0
for i1 in idx1s:
for i2 in idx2:
# for each block in the dataset
mask = np.nonzero((data[:,IDX1] == i1) & (data[:,IDX2] == i2))
# get variables for that block
curr_X = X[mask,:]
curr_y = y[mask]
curr_W = W[:,i2,i1]
# calculate a scalar
pred = np.dot(curr_X,curr_W)
sigm = 1.0 / (1.0 + np.exp(-pred))
loss = np.sum((sigm- (0.5)) * curr_y)
# add result to global cost
cost += loss
Here is some sample data
data = np.array([[0,0,5,5,7],
[0,0,5,5,7],
[0,1,5,5,7],
[0,1,5,5,7],
[1,0,5,5,7],
[1,1,5,5,7]])
W = np.zeros((2,2,2))
idx1 = data[:,0]
idx2 = data[:,1]
X = data[:,2:-1]
y = data[:,-1]
That W was tricky... Actually, your blocks are pretty irrelevant, apart from getting the right slice of W to do the np.dot with the corresponding X, so I went the easy route of creating an aligned_W array as follows:
aligned_W = W[:, idx2, idx1]
This is an array of shape (2, rows) where rows is the number of rows of your data set. You can now proceed to do your whole calculation without any for loops as:
from numpy.core.umath_tests import inner1d
pred = inner1d(X, aligned_W.T)
sigm = 1.0 / (1.0 + np.exp(-pred))
loss = (sigm - 0.5) * curr_y
cost = np.sum(loss)
My guess is the major reason your code is slow is the following line:
mask = np.nonzero((data[:,IDX1] == i1) & (data[:,IDX2] == i2))
Because you repeatedly scan your input arrays for small number of rows of interest. So you need to do the following:
ni1 = len(np.unique(data[:,IDX1]))
ni2 = len(np.unique(data[:,IDX2]))
idx1s = np.arange(ni1)
idx2s = np.arange(ni2)
key = data[:,IDX1] * ni2 + data[:,IDX2] # 1D key to the rows
sortids = np.argsort(key) #indices to the sorted key
Then inside the loop instead of
mask=np.nonzero(...)
you need to do
curid = i1 * ni2 + i2
left = np.searchsorted(key, curid, 'left', sorter=sortids)
right=np.searchsorted(key, curid, 'right', sorter=sortids)
mask = sortids[left:right]
I don't think that there is a way to compare numpy array of different sizes without using for loops. Would be hard to decide what is the output meaning and shape of something like
[0,1,2,3,4] == [3,4,2]
The only suggestion that I can give you is to get rid of one of the for loop using itertools.product:
import itertools as it
[...]
idx1s = np.unique(data[:,IDX1])
idx2s = np.unique(data[:,IDX2])
# initialize global sum variable to 0
cost = 0
for i1, i2 in it.product(idx1s, idx2):
# for each block in the dataset
mask = np.nonzero((data[:,IDX1] == i1) & (data[:,IDX2] == i2))
# get variables for that block
curr_X = X[mask,:]
curr_y = y[mask]
[...]
You can also keep mask as a bool array
mask = (data[:,IDX1] == i1) & (data[:,IDX2] == i2)
The output is the same and you have to use anyway the memory to create the bool array. Doing this way saves you some memory and a function evaluation
EDIT
If you know that the indices do not have holes or have few holes, might be worth to remove the part where you define idx1s and idxs2 and change the for loop to
max1, max2 = data[:,[IDX1, IDX2]].max(axis=0)
for i1, i2 in it.product(xrange(max1), xrange(max2)):
[...]
Both xrange and it.product are iterators, so they create only i1 and i2 when you need.
ps: if you are on python3.x use range instead of xrange