For-Loop over python float array - python

I am working with the IRIS dataset. I have two sets of data, (1 training set) (2 test set). Now I want to calculate the euclidean distance between every test set row and the train set rows. However, I only want to include the first 4 points of the row.
A working example would be:
dist = np.linalg.norm(inner1test[0][0:4]-inner1train[0][0:4])
print(dist)
***output: 3.034243***
The problem is that I have 120 training set points and 30 test set points - so i would have to do 2700 operations manually, thus I thought about iterating through with a for-loop. Unfortunately, every of my attemps is failing.
This would be my best attempt, which shows the error message
for i in inner1test:
for number in inner1train:
dist = np.linalg.norm(inner1test[i][0:4]-inner1train[number][0:4])
print(dist)
(IndexError: arrays used as indices must be of integer (or boolean)
type)
What would be the best solution to iterate through this array?
ps: I will also provide a screenshot for better vizualisation.

From what I see, inner1test is a tuple of lists, so the i value will not be an index but the actual list.
You should use enumerate, which returns two variables, the index and the actual data.
for i, value in enumerate(inner1test):
for j, number in enumerate(inner1train):
dist = np.linalg.norm(inner1test[i][0:4]-inner1train[number][0:4])
print(dist)
Also, if your lists begin the be bigger, consider using a generator which will execute your calculcations iteration per iteration and return only one value at a time, avoiding to return a big chunk of results which would occupy a lot of memory.
eg:
def my_calculatiuon(inner1test, inner1train):
for i, value in enumerate(inner1test):
for j, number in enumerate(inner1train):
dist = np.linalg.norm(inner1test[i][0:4]-inner1train[number][0:4])
yield dist
for i in my_calculatiuon(inner1test, inner1train):
print(i)
You might also want to investigate python list comprehension which is sometimes more elegant way to handle for loops with lists.
[EDIT]
Here's a probably easier solution anyway, without the need of indexes, which won't fail to enumerate a numpy object:
for testvalue in inner1test:
for testtrain in inner1train:
dist = np.linalg.norm(testvalue[0:4]-testtrain[0:4])
[/EDIT]

This was the final solution with the correct output for me:
distanceslist = list()
for testvalue in inner1test:
for testtrain in inner1train:
dist = np.linalg.norm(testvalue[0:4]-testtrain[0:4])
distances = (dist, testtrain[0:4])
distanceslist.append(distances)
distanceslist

Related

Given a set t of tuples containing elements from the set S, what is the most efficient way to build another set whose members are not contained in t?

For example, suppose I had an (n,2) dimensional tensor t whose elements are all from the set S containing random integers. I want to build another tensor d with size (m,2) where individual elements in each tuple are from S, but the whole tuples do not occur in t.
E.g.
S = [0,1,2,3,7]
t = [[0,1],
[7,3],
[3,1]]
d = some_algorithm(S,t)
/*
d =[[2,1],
[3,2],
[7,4]]
*/
What is the most efficient way to do this in python? Preferably with pytorch or numpy, but I can work around general solutions.
In my naive attempt, I just use
d = np.random.choice(S,(m,2))
non_dupes = [i not in t for i in d]
d = d[non_dupes]
But both t and S are incredibly large, and this takes an enormous amount of time (not to mention, rarely results in a (m,2) array). I feel like there has to be some fancy tensor thing I can do to achieve this, or maybe making a large hash map of the values in t so checking for membership in t is O(1), but this produces the same issue just with memory. Is there a more efficient way?
An approximate solution is also okay.
my naive attempt would be a base-transformation function to reduce the problem to an integer set problem:
definitions and assumptions:
let S be a set (unique elements)
let L be the number of elements in S
let t be a set of M-tuples with elements from S
the original order of the elements in t is irrelevant
let I(x) be the index function of the element x in S
let x[n] be the n-th tuple-member of an element of t
let f(x) be our base-transform function (and f^-1 its inverse)
since S is a set we can write each element in t as a M digit number to the base L using elements from S as digits.
for M=2 the transformation looks like
f(x) = I(x[1])*L^1 + I(x[0])*L^0
f^-1(x) is also rather trivial ... x mod L to get back the index of the least significant digit. floor(x/L) and repeat until all indices are extracted. lookup the values in S and construct the tuple.
since now you can represet t as an integer set (read hastable) calculating the inverse set d becomes rather trivial
loop from L^(M-1) to (L^(M+1)-1) and ask your hashtable if the element is in t or d
if the size of S is too big you can also just draw random numbers against the hashtable for a subset of the inverse of t
does this help you?
If |t| + |d| << |S|^2 then the probability of some random tuple to be chosen again (in a single iteration) is relatively small.
To be more exact, if (|t|+|d|) / |S|^2 = C for some constant C<1, then if you redraw an element until it is a "new" one, the expected number of redraws needed is 1/(1-C).
This means, that by doing this, and redrawing elements until this is a new element, you get O((1/(1-C)) * |d|) times to process a new element (on average), which is O(|d|) if C is indeed constant.
Checking is an element is already "seen" can be done in several ways:
Keeping hash sets of t and d. This requires extra space, but each lookup is constant O(1) time. You could also use a bloom filter instead of storing the actual elements you already seen, this will make some errors, saying an element is already "seen" though it was not, but never the other way around - so you will still get all elements in d as unique.
Inplace sorting t, and using binary search. This adds O(|t|log|t|) pre-processing, and O(log|t|) for each lookup, but requires no additional space (other then where you store d).
If in fact, |d| + |t| is very close to |S|^2, then an O(|S|^2) time solution could be to use Fisher Yates shuffle on the available choices, and choosing the first |d| elements that do not appear in t.

Search for the nearest array in a huge array of arrays

I need to find the closest possible sentence.
I have an array of sentences and a user sentence, and I need to find the closest to the user's sentence element of the array.
I presented each sentence in the form of a vector using word2vec:
def get_avg_vector(word_list, model_w2v, size=500):
sum_vec = np.zeros(shape = (1, size))
count = 0
for w in word_list:
if w in model_w2v and w != '':
sum_vec += model_w2v[w]
count +=1
if count == 0:
return sum_vec
else:
return sum_vec / count + 1
As a result, the array element looks like this:
array([[ 0.93162371, 0.95618944, 0.98519795, 0.98580566, 0.96563747,
0.97070891, 0.99079191, 1.01572807, 1.00631016, 1.07349398,
1.02079309, 1.0064849 , 0.99179418, 1.02865136, 1.02610303,
1.02909719, 0.99350413, 0.97481178, 0.97980362, 0.98068508,
1.05657591, 0.97224562, 0.99778703, 0.97888296, 1.01650529,
1.0421448 , 0.98731804, 0.98349052, 0.93752996, 0.98205837,
1.05691232, 0.99914532, 1.02040555, 0.99427229, 1.01193818,
0.94922226, 0.9818139 , 1.03955 , 1.01252615, 1.01402485,
...
0.98990598, 0.99576604, 1.0903802 , 1.02493086, 0.97395976,
0.95563786, 1.00538653, 1.0036294 , 0.97220088, 1.04822631,
1.02806122, 0.95402776, 1.0048053 , 0.97677222, 0.97830801]])
I represent the sentence of the user also as a vector, and I compute the closest element to it is like this:
%%cython
from scipy.spatial.distance import euclidean
def compute_dist(v, list_sentences):
dist_dict = {}
for key, val in list_sentences.items():
dist_dict[key] = euclidean(v, val)
return sorted(dist_dict.items(), key=lambda x: x[1])[0][0]
list_sentences in the method above is a dictionary in which keys are a text representation of sentences, and values are vector.
It takes a very long time, because I have more than 60 million sentences.
How can I speed up, optimize this process?
I'll be grateful for any advice.
The initial calculation of the 60 million sentences' vectors is essentially a fixed cost you'll pay once. I'm assuming you mainly care about the time for each subsequent lookup, for a single user-supplied query sentence.
Using numpy native array operations can speed up the distance calculations over doing your own individual calculations in a Python loop. (It's able to do things in bulk using its optimized code.)
But first you'd want to replace list_sentences with a true numpy array, accessed only by array-index. (If you have other keys/texts you need to associate with each slot, you'd do that elsewhere, with some dict or list.)
Let's assume you've done that, in whatever way is natural for your data, and now have array_sentences, a 60-million by 500-dimension numpy array, with one sentence average vector per row.
Then a 1-liner way to get an array full of the distances is as the vector-length ("norm") of the difference between each of the 60 million candidates and the 1 query (which gives a 60-million entry answer with each of the differences):
dists = np.linalg.norm(array_sentences - v)
Another 1-liner way is to use the numpy utility function cdist() for comuting distance between each pair of two collections of inputs. Here, your first collection is just the one query vector v (but if you had batches to do at once, supplying more than one query at a time could offer an additional slight speedup):
dists = np.linalg.cdists(array[v], array_sentences)
(Note that such vector comparisons often use cosine-distance/cosine-similarity rather than euclidean-distance. If you switch to that, you might be doing other norming/dot-products instead of the first option above, or use the metric='cosine' option to cdist().)
Once you have all the distances in a numpy array, using a numpy-native sort option is likely to be faster than using Python sorted(). For example, numpy's indirect sort argsort(), which just returns the sorted indexes (and thus avoids moving all the vector coordinates-around), since you just want to know which items are the best match(es). For example:
sorted_indexes = argsort(dists)
best_index = sorted_indexes[0]
If you need to turn that int index back into your other key/text, you'd use your own dict/list that remembered the slot-to-key relationships.
All these still give an exactly right result, by comparing against all candidates, which (even when done optimally well) is still time-consuming.
There are ways to get faster results, based on pre-building indexes to the full set of candidates – but such indexes become very tricky in high-dimensional spaces (like your 500-dimensional space). They often trade off perfectly accurate results for faster results. (That is, what they return for 'closest 1' or 'closest N' will have some errors, but usually not be off by much.) For examples of such libraries, see Spotify's ANNOY or Facebook's FAISS.
At least if you are doing this procedure for multiple sentences, you could try using scipy.spatial.cKDTree (I don't know whether it pays for itself on a single query. Also 500 is quite high, I seem to remember KDTrees work better for not quite as many dimensions. You'll have to experiment).
Assuming you've put all your vectors (dict values) into one large numpy array:
>>> import numpy as np
>>> from scipy.spatial import cKDTree as KDTree
>>>
# 100,000 vectors (that's all my RAM can take)
>>> a = np.random.random((100000, 500))
>>>
>>> t = KDTree(a)
# create one new vector and find distance and index of closest
>>> t.query(np.random.random(500))
(8.20910072933986, 83407)
I can think about 2 possible ways of optimizing this process.
First, if your goal is only to get the closest vector (or sentence), you could get rid of the list_sentences variable and only keep in memory the closest sentence you have found yet. This way, you won't need to sort the complete (and presumably very large) list at the end, and only return the closest one.
def compute_dist(v, list_sentences):
min_dist = 0
for key, val in list_sentences.items():
dist = euclidean(v, val)
if dist < min_dist:
closest_sentence = key
min_dist = dist
return closest_sentence
The second one is maybe a little more unsound. You can try to re implement the euclidean method by giving it a third argument which would be the current minimum distance min_dist between the closest vector you have found so far and the user vector. I don't know how the scipy euclidean method is implemented but I guess it is close to summing squared differences along all the vectors dimensions. What you want is the method to stop if the sum is higher than min_dist (the distance will be higher than min_dist anyway and you won't keep it).

More efficient way to find index of objects in Python array

I have a very large 400x300x60x27 array (lets call it 'A'). I took the maximum values which is now a 400x300x60 array called 'B'. Basically I need to find the index in 'A' of each value in 'B'. I have converted them both to lists and set up a for loop to find the indices, but it takes an absurdly long time to get through it because there are over 7 million values. This is what I have:
B=np.zeros((400,300,60))
C=np.zeros((400*300*60))
B=np.amax(A,axis=3)
A=np.ravel(A)
A=A.tolist()
B=np.ravel(B)
B=B.tolist()
for i in range(0,400*300*60):
C[i]=A.index(B[i])
Is there a more efficient way to do this? Its taking hours and hours and the program is still stuck on the last line.
You don't need amax, you need argmax. In case of argmax, the array will only contain the indices rather than values, the computational efficiency of finding the values using indices are much better than vice versa.
So, I would recommend you to store only the indices. Before flattening the array.
instead of np.amax, run A.argmax, this will contain the indices.
But before you're flattening it to 1D, you will need to use a mapping function that causes the indices to 1D as well. This is probably a trivial problem, as you'd need to just use some basic operations to achieve this. But that would also consume some time as it needs to be executed quite some times. But it won't be a searching probem and would save you quite some time.
You are getting those argmax indices and because of the flattening, you are basically converting to linear index equivalents of those.
Thus, a solution would be to add in the proper offsets into the argmax indices in steps leveraging broadcasting at each one of them, like so -
m,n,r,s = A.shape
idx = A.argmax(axis=3)
idx += s*np.arange(r)
idx += r*s*np.arange(n)[:,None]
idx += n*r*s*np.arange(m)[:,None,None] # idx is your C output
Alternatively, a compact way to put it would be like so -
m,n,r,s = A.shape
I,J,K = np.ogrid[:m,:n,:r]
idx = n*r*s*I + r*s*J + s*K + A.argmax(axis=3)

Numpy mean 'inplace'

I have a line of code that looks like this:
te_succ_rate = np.mean(np.argmax(test_y, axis=1) == self.predictor(test_x))
where test_y is a numpy array of arrays and self.predictor(test_x) returns a numpy array. The whole line of code returns the percentage of subarrays in test_y that has a max value equal to the value in the corresponding position in the array returned from self.predictor(test_x).
The problem is that for large sizes of test_y and test_x, it runs out of memory. It works fine for 10 000, but not 60 000.
Is there a way to avoid this?
I tried this:
tr_res = []
for start, end in zip(range(0, len(train_x), subsize), range(subsize, len(train_x), subsize)):
tr_res.append(self.predictor(train_x[start:end]))
tr_res = np.asarray(tr_res)
tr_res = tr_res.flatten()
tr_succ_rate = np.mean(np.argmax(train_y, axis=1) == tr_res)
But it does not work as the result is somehow 0 (which is not correct).
Level 1:
Though this isn't an answer for doing it inline, it may still be an answer to your problem:
You sure you're running out of memory from the mean and not the argmax?
Each additional dimension in test_y will be storing an extra N number of whatever datatype you're working with. Say you have 5 dimensions in your data, you'll have to store 5N values (presumably floats). The results of your self.predictor(test_x) will take a 6th N of memory. The temporary array that is the answer to your conditional is a 7th N. I don't actually know what the memory usage of np.mean is, but I assume it's not another N. But for arguments sake, let's say it is. If you inline just np.mean, you'll only save up to an N of memory, while you already need 7N worth.
So alternatively, try pulling out your np.argmax(test_y, axis=1) into an intermediate variable in a previous step and don't reference test_y again after calculating the argmax so test_y gets garbage collected. (or do whatever python 3 does to force deletion of that variable) That should save you the number of dimensions of your data minus 1 N of memory usage. (you'll be down to around 3N or up to 4N memory usage, which is better than you could have achieved by in-lining just np.mean.
I made the assumption that running self.predictor(test_x) only takes 1N. If it takes more, then pulling that out into its own intermediate variable in the same way will also help.
Level 2:
If that still isn't enough, still pull out your np.argmax(test_y, axis=1) and the self.predictor(test_x) into their own variables, then iterate across the two arrays yourself and do the conditional and aggregation yourself. Something like:
sum = 0.
n = 0
correct_ans = np.argmax(test_y, axis=1)
returned_ans = self.predictor(test_x)
for c, r in zip(correct_ans, returned_ans):
if c == r:
sum += 1
n += 1
avg = sum / n
(not sure if zip is the best way to do this. np probably has a more efficient way to do the same thing. This is the second thing you tried, but accumulating the aggregates without storing an additional array)
That way, you'll also save the need to store the temporary list of booleans resulting from your conditional.
If that still isn't enough, you're going to have to fundamentally change how you're storing your actual and target results, since the issue becomes you not being able to fit just the target and results into memory.

Vectorize iteration over two large numpy arrays in parallel

I have two large arrays of type numpy.core.memmap.memmap, called data and new_data, with > 7 million float32 items.
I need to iterate over them both within the same loop which I'm currently doing like this.
for i in range(0,len(data)):
if new_data[i] == 0: continue
combo = ( data[i], new_data[i] )
if not combo in new_values_map: new_values_map[combo] = available_values.pop()
data[i] = new_values_map[combo]
However this is unreasonably slow, so I gather that using numpy's vectorising functions are the way to go.
Is it possible to vectorize with the index – so that the vectorised array can compare it's items to the corresponding item in the other array?
I thought of zipping the two arrays but I guess this would cause unreasonable overhead to prepare?
Is there some other way to optimise this operation?
For context: the goal is to effectively merge the two arrays such that each unique combination of corresponding values between the two arrays is represented by a different value in the resulting array, except zeros in the new_data array which are ignored. The arrays represent 3D bitmap images.
EDIT: available_values is a set of values that have not yet been used in data and persists across calls to this loop. new_values_map on the other hand is reset to an empty dictionary before each time this loop is used.
EDIT2: the data array only contains whole numbers, that is: it's initialised as zeros then with each usage of this loop with a different new_data it is populated with more values drawn from available_values which is initially a range of integers. new_data could theoretically be anything.
In answer to you question about vectorising, the answer is probably yes, though you need to clarify what available_values contains and how it's used, as that is the core of the vectorisation.
Your solution will probably look something like this...
indices = new_data != 0
data[indices] = available_values
In this case, if available_values can be considered as a set of values in which we allocate the first value to the first value in data in which new_data is not 0, that should work, as long as available_values is a numpy array.
Let's say new_data and data take values 0-255, then you can construct an available_values array with unique entries for every possible pair of values in new_data and data like the following:
available_data = numpy.array(xrange(0, 255*255)).reshape((255, 255))
indices = new_data != 0
data[indices] = available_data[data[indices], new_data[indices]]
Obviously, available_data can be whatever mapping you want. The above should be very quick whatever is in available_data (especially if you only construct available_data once).
Python gives you a powerful tools for handling large arrays of data: generators and iterators
Basically, they will allow to acces your data as they were regular lists, without fetching them at once to memory, but accessing piece by piece.
In case of accessing two large arrays at once, you can
for item_a, item_b in izip(data, new_data):
#... do you stuff here
izip creates an iterator what iterates over the elements of your arrays at once, but it does picks pieces as you need them, not all at once.
It seems that replacing the first two lines of loop to produce:
for i in numpy.where(new_data != 0)[0]:
combo = ( data[i], new_data[i] )
if not combo in new_values_map: new_values_map[combo] = available_values.pop()
data[i] = new_values_map[combo]
has the desired effect.
So most of the time in the loop was spent skipping the entire loop upon encountering a zero in new_data. Don't really understand why these many null iterations were so expensive, maybe one day I will...

Categories