Efficiently update values held in scoring matrix

Efficiently update values held in scoring matrix - python

I am continuously calculating correlation matrices where each time the order of the underlying data is randomized. When a correlation score with randomized data is greater than or equal to the original correlation determined with ordered data, I would like to update the corresponding cell in a scoring matrix with +1. (All cells begin as zeroes in the scoring matrix).
Due to the size of the matrices I am dealing with shape = (3681, 12709), I would like to find out an efficient way of doing this. So far, what I have is inefficient and takes too long. I wonder if there is a matrix-operation style approach to this rather than iterating, as I am currently doing below:
for i, j in product(data_sorted.index, data_sorted.columns):
# if random correlation is as good as or better than sorted correlation
if data_random.loc[i, j] >= data_sorted.loc[i, j]:
# update scoring matrix
scoring_matrix[sorted_index_list.index(i)][sorted_column_list.index(j)] += 1
I have crudely timed this approach and found that doing this for a single line of my matrix will take roughly 4.2 seconds which seems excessive.
Any help would he much obliged.

Assuming everything has the same indices, this should work as expected and be pretty quick.
scoring_matrix += (data_random >= data_sorted).astype(int)

Related

Generate random binary matrix constrained to no null row

I want to generate a random binary matrix, so I'm using W=np.random.binomial(1, p, (n,n)).
It works fine, but I want a constraint that no row is just of 0s.
I create the following function:
def random_matrix(p,n):
m=0
while m==0:
W = np.random.binomial(1, p, (n,n))
m=min(W.sum(axis=1))
return W
It also works fine, but it seems to me too inefficient. Is there a faster way to create this constraint?

When the matrix is large, regenerating the entire matrix just because few rows are full of zeros is not efficient. It should be statistically safe to only regenerate the target rows. Here is an example:
def random_matrix(p,n):
W = np.random.binomial(1, p, (n,n))
while True:
null_rows = np.where(W.sum(axis=1) == 0)[0]
# If there is no null row, then m>0 so we stop the replacement
if null_rows.size == 0:
break
# Replace only the null rows
W[null_rows] = np.random.binomial(1, p, (null_rows.shape[0],n))
return W
Even faster solutions
There is an even more efficient approach when p is close to 0 (when p is close to 1, then the above function is already fast). Indeed, a binomial random variable with 0-1 values is a Bernoulli random variable. The sum of Bernoulli random values with a probability p repeated many times is a binomial random value! Thus, you can generate the sum for all row using S = np.random.binomial(n, p, (n,n)), then apply the above method to remove null values and then build the final matrix by generating S[i] one values for the ith row and use np.shuffle so to randomize the order of the 0-1 values in each row. This method solve conflicts much more efficiently than all others. Indeed, it does not need to generate the full row to check if it is full of zeros. It is n times faster to solve conflicts!
If this is not enough, you can use the uint8 datatype to generate W. Indeed, the memory is slow so generating smaller matrices is generally faster, not to mention it takes less RAM.
If this is not enough, you can generate S item per item using Numba JIT compiler and a basic loop. This should be faster since there is no temporary array to create except the final one. For large matrices, this algorithm can even be parallelized (every row can be independently generated). This last solution should be close to be optimal.

Pairwise distance in very large datasets

I have an array that is about [5000000 x 6] and I need to select only the points (rows) that are at a certain a distance from each other.
The ideia should be:
Start new_array with first row from data array
Compare new_array with the second row from data array
If pdist between they are > tol, append row to new_array
Compare new_array with the third row from data array
and so on...
One problem is RAM size. I cant compare all rows at once even in pdist.
So I've been thinking in split the dataset in smaller ones, but then i dont know how to retrieve the index information for the rows in dataset.
I've tried scipy cdist, scipy euclidean, sklearn euclidean_distances, sklearn paired_distances and the below code is the fastest i could get. At first it is fast but after 40k loops it becomes really slow.
xyTotal=np.random.random([5000000,6])
tol=0.5
for i,z in enumerate(xyTotal):
if (pdist(np.vstack([np.array(ng),z]))>tol).all():
ng.append(z)
Any suggestions for this problem?
EDIT
ktree = BallTree(xyTotal, leaf_size=40,metric='euclidean')
btsem=[]
for i,j in enumerate(xyTotal):
ktree.query_radius(j.reshape(1,-1),r=tol, return_distance=True)
if (ktree.query_radius(j.reshape(1,-1),r=tol, count_only=True))==1:
btsem.append(j)
This is fast but I'm only picking outliers. When i get to points that are near to anothers (i.e. in a little cluster) I don't know hot to pick only one point and leave the others, since i will get the same metrics for all points in the cluster (they all have the same distance to each other)

The computation is slow because the complexity of your algorithm is quadratic: O(k * n * n) where n is len(xyTotal) and k is the probability of the condition being true. Thus, assuming k=0.1 and n=5000000, the running time will be huge (likely hours of computation).
Hopefully, you can write a better implementation running in O(n * log(n)) time. However, this is tricky to implement. You need to add your ng points in a k-d tree and then you can search the nearest neighbor and check the distance with the current point is greater than tol.
Note that you can find Python modules implementing k-d trees and the SciPy documentation provides an example of implementation written in pure Python (so likely not very efficient).

Does this shuffling algorithm produce each permutation with uniform probability?

I've seen how a particular naive shuffling algorithm is biased, and I feel like I basically get that, and I get how the Fischer-Yates algorithm is not biased. I have the following algorithm which was the one I first thought of when I thought about how to shuffle a list. I know it consumes twice the memory and runs in unnecessarily large time, but I'm still curious if it produces each permutation with a uniform distribution, or if there's some sneaky reason I'm not seeing for it to be biased.
I'm also kind of wondering if there is some other "undesirable" property to a random shuffle that this would have, like perhaps the probabilities of various positions in the list being filled with some values are dependent.
def shuf(x):
out = [None for i in range(len(x))]
for i in x:
pos = rand.randint(0,len(x)-1)
while out[pos] != None:
pos = rand.randint(0,len(x)-1)
out[pos] = i
return out
I generated a heat map of this on a list of 20 elements, running 10^6 trials, and it produced the following. The (i,j) coordinate of the map represents the probability of the ith position of the list being filled with the jth element of the original list.
While I don't see any pattern to the heat map, it looks like the variance might be high. Or that might be the heat map over-stating the variance because, hey, the minimum and max have to come up somewhere.

Undesirable property - this can be expensive if you're shuffling a large set:
while out[pos] != None:
pos = rand.randint(0,len(x)-1)
Imagine len(x) == 100,000,000 and you've placed 90,000,000 already - you're going to loop a LOT before you get a hit.
Interesting exercises:
What does the heat map look like for simply generating random numbers between 1 and len(x) over 10e6 iterations look like?
What does the heat map look like for Fischer-Yates, for comparison?
At a glance, it looks to me like, given a uniform RNG, it should yield a truly random distribution (albeit more slowly than Fischer-Yates).

Efficient implementation of the transition matrix for page rank

I'm trying to implement PageRank. I'm reading the description here: http://nlp.stanford.edu/IR-book/html/htmledition/markov-chains-1.html
Everything is very clear to me, however I'm concerned about the construction of the matrix $P$. I find that constructing $P$ the naive way would be very expensive. For example: to implement step 1, one would need to check every row of $A$ and then check every element of that row to see if all elements are zero. For step 2 one would need to compute the number of ones for each row. I can imagine my code to have nasty slow loops. I was wondering if there are smart linear algebra techniques that could efficiently construct $P$. I will be using python numpy for my coding.
EDIT: one way I'm thinking now to solve this is by doing a summation element wise over the columns of $A$. By that I would have a column vector. Now I will go through each element of this vector to check which elements are zeros. Thus I can now know which rows has no 1s and I can multiply those rows with $1/N$.

Your concern is correct. Since the number of web pages (vertices in the representing graph) is huge, it is impossible to actually generate such A and work on it.
The matrix calculation of page rank can be much more efficiently calculated using sparse matrix implementations, since the matrix is very sparse. Most webpages are not actually connected to each other, so most entries in the matrix are 0.
The sparse matrix is built as follows:
Build matrix A as described A_ij = 1 if (i,j) is an edge, otherwise A_ij = 0
Step 1 is usually not made, and instead we remove 'sinks' iteratively. This is done to prevent the matrix being dense, some alternatives are also linking 'sinks' back to the nodes that linked to them, or link a sink to itself.
Divide each 1 in A as described in (2)
Let's denote the resulting matrix as M, and this is the resulting matrix we will work on, in order to get a column vector p (which is initialized with 1/n for each entry).
x = [1/n, 1/n, ... , 1/n]^T //a column vector
p = [1/n, 1/n, ... , 1/n]^T //a column vector with the initial ranks
M = genSparseMatrix() //as described above
do until p converge:
p = (1-\alpha)* M*p + (\alpha) * x
return p
In the end, this yields p, the column vector that holds the page rank value for each node.

Python: Randomly draw several objects in a list

I am looking for the most efficient way to randomly draw nelements in a list given a list of probabilities stating the probability of each element to be picked.
aList = [3,4,2,1,4,3,5,7,6,4]
MyProba = [0.1,0.1,0.2,0,0.1,0,0.2,0,0.2,0.1]
It means that at each draw, the first element (which is 3) has a probability of 0.1 to be drawn. Of course,
sum(MyProba) == 1 # always returns True
len(aList) == len(MyProba) # always returns True
Up to now I did the following:
def random_pick(some_list, proba):
x = random.uniform(0, 1)
cumulative_proba = 0.0
for item, item_proba in zip(some_list, proba):
cumulative_proba += item_proba
if x < cumulative_proba:
break
return item
nb_draws = 10
list_of_drawn_elements = []
for one_draw in range(nb_draws):
list_of_drawn_elements.append(random_pick(aList, MyProba))
It works but it is terribly slow for long lists and big values of nb_draws. How can I improve the speed of this process?
Note: In the special case I am facing, nb_draws always equals the length of aList.

The general idea (as outlined by others' answers as well) is that your method is inefficient because the preprocessing (the calculation of the cumulative distribution) is done every time you draw a sample, although it would be enough to do it once before the sampling and then use the preprocessed data to do the sampling.
The preprocessing and sampling could be done efficiently with Walker's alias method. I have implemented it a while ago; take a look at the source code. (Sorry for the external link, but I think it's too long to post it here). My version requires NumPy; if you don't want to use NumPy, there is a NumPy-free alternative as well (on which my version is based).
Edit: the explanation of Walker's alias method is to be found in the first link I provided. In a nutshell, imagine that you somehow managed to construct a rectangular "darts board" that is subdivided into parts such that each part corresponds to one of your original items, and the area of each part is proportional to the desired probability of selecting the corresponding element. You can then start throwing darts at random at the darts board (by generating two random numbers that specify the horizontal and vertical coordinate of where the dart ended up) and check which areas the darts hit. The items corresponding to the areas will be the items you have selected. Walker's alias method is simply a linear-time preprocessing that constructs the dart board. Drawing each element can then be done in constant time. In the end, drawing m elements out of n will have a cost of O(n) for preprocessing and O(m) for generating the samples, yielding a total complexity of O(n + m).

here's my lazy method... build a list with expected number of values for the desired distribution, and use random.choice() to pick a value from the list.
>>> import random
>>>
>>> value_probs = dict(zip([3,4,2,1,4,3,5,7,6,4], [0.1,0.1,0.2,0,0.1,0,0.2,0,0.2,0.1]))
>>> expected_dist = sum([[i] * int(prob * 100) for i, prob in value_probs.iteritems()], [])
>>> random.choice(expected_dist)

You might try to precalculate the cumulative probability range for each element and make a tree from these intervals. Then you will get a logarithmic complexity for looking up the element corresponding to the generated probability, instead of linear one that you have now.

You're calculating cumulative_proba each time when you call random_pick. I suggest to calculate it outside the method, and use a better data structure to store it, like a binary search tree, which will reduce the time complexity from O(n) to O(lgn).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.