Faster way of centering lists - python

I'm looking for a better, faster way to center a couple of lists. Right now I have the following:
import random
m = range(2000)
sm = sorted(random.sample(range(100000), 16000))
si = random.sample(range(16005), 16000)
# Centered array.
smm = []
print sm
print si
for i in m:
if i in sm:
smm.append(si[sm.index(i)])
else:
smm.append(None)
print m
print smm
Which in effect creates a list (m) containing a range of random numbers to center against, another list (sm) from which m is centered against and a list of values (si) to append.
This sample runs fairly quickly, but when I run a larger task with much more variables performance slows to a standstill.

your mainloop contains this infamous line:
if i in sm:
it seems to be nothing but since sm is a result of sorted it is a list, hence O(n) lookup, which explains why it's slow with a big dataset.
Moreover you're using the even more infamous si[sm.index(i)], which makes your algorithm O(n**2).
Since you need the indexes, using a set is not so easy, and there's better to do:
Since sm is sorted, you could use bisect to find the index in O(log(n)), like this:
for i in m:
j = bisect.bisect_left(sm,i)
smm.append(si[j] if (j < len(sm) and sm[j]==i) else None)
small explanation: bisect gives you the insertion point of i in sm. It doesn't mean that the value is actually in the list so we have to check that (by checking if the returned value is within existing list range, and checking if the value at the returned index is the searched value), if so, append, else append None.

Related

How can i make this python code quicklier?

How can i make this code more quicklier?
def add_may_go(x,y):
counter = 0
for i in range(-2,3):
cur_y = y + i
if cur_y < 0 or cur_y >= board_order:
continue
for j in range(-2,3):
cur_x = x+j
if (i == 0 and j == 0) or cur_x < 0 or cur_x >= board_order or [cur_y,cur_x] in huge_may_go:
continue
if not public_grid[cur_y][cur_x]:
huge_may_go.append([cur_y,cur_x])
counter += 1
return counter
INPUT:
something like: add_may_go(8,8), add_may_go(8,9) ...
huge_may_go is a huge list like:
[[7,8],[7,9], [8,8],[8,9],[8,10]....]
public_grid is also a huge list, the size is same as board_order*board_order
every content it have has to possble from : 0 or 1
like:
[
[0,1,0,1,0,1,1,...(board_order times), 0, 1],
... board_order times
[1,0,1,1,0,0,1,...(board_order times), 0, 1],
]
an board_order is a global variable which usually is 19 (sometimes it is 15 or 20)
It runs toooooooo slowy now. This function is gonna run for hundreds time. Any possible suggestions is ok!
I have tried numpy. But numpy make it more slowly! Please help
It is difficult to provide a definitive improvement without sample data and a bit more context. Using numpy would be beneficial if you can manage to perform all the calls (i.e. all (x,y) coordinate values) in a single operation. There are also strategies based on sets that could work but you would need to maintain additional data structures in parallel with the public_grid.
Based only on that piece of code, and without changing the rest of the program, there are a couple of things you could do that will provide small performance improvements:
loop only on eligible coordinates rather than skip invalid ones (outside of board)
only compute the curr_x, curr_y values once (track them in a dictionary for each x,y pairs). This is assuming that the same x,y coordinates are used in multiple calls to the function.
use comprehensions when possible
use set operations to avoid duplicate coordinates in huge_may_go
.
hugeCoord = dict() # keep track of the offset coordinates
def add_may_go(x,y):
# compute coordinates only once (the first time)
if (x,y) not in hugeCoord:
hugeCoord[x,y] = [(cx,cy)
for cy in range(max(0,y-2),min(board_order,y+3))
for cx in range(max(0,x-2),min(board_order,x+3))
if cx != x or cy != y]
# get the resulting list of coordinates using a comprehension
fit = {(cy,cx) for cx,cy in hugeCoord[x,y] if not public_grid[cy][cx]}
fit.difference_update(huge_may_go) # use set to avoid duplicates
huge_may_go.extend(fit)
return len(fit)
Note that, if huge_may_go was a set instead of a list, adding to it without repetitions would be more efficient because you could update it directly (and return the difference in size)
if (i == 0 and j == 0)...: continue
Small improvement; reduce the number of iterations by not making those.
for i in (1,2):
do stuff with i and -i
for j in (1,2):
do stuff with j and -j
I want to highlight 2 places which need special attention.
if (...) [cur_y,cur_x] in huge_may_go:
Unlike rest of conditions, this is not arithmetic condition, but contains check, if huge_may_go is list it does take O(n) time or speaking simply is proportional to number of elements in list.
huge_may_go.append([cur_y,cur_x])
PythonWiki described .append method of list as O(1) but with disclaimer that Individual actions may take surprisingly long, depending on the history of the container. You might use collections.deque as replacement for list which was designed with performance of insert (at either end) in mind.
If huge_may_go must not contain duplicates and you do not care about order, then you might use set rather than list and use it for keeping tuples of y,x (set is unable to hold lists). When using .add method of set you might skip contains check, as adding existing element will have not any effect, consider that
s = set()
s.add((1,2))
s.add((3,4))
s.add((1,2))
print(s)
gives output
{(1, 2), (3, 4)}
If you would then need some contains check, set contains check is O(1).

For-Loop over python float array

I am working with the IRIS dataset. I have two sets of data, (1 training set) (2 test set). Now I want to calculate the euclidean distance between every test set row and the train set rows. However, I only want to include the first 4 points of the row.
A working example would be:
dist = np.linalg.norm(inner1test[0][0:4]-inner1train[0][0:4])
print(dist)
***output: 3.034243***
The problem is that I have 120 training set points and 30 test set points - so i would have to do 2700 operations manually, thus I thought about iterating through with a for-loop. Unfortunately, every of my attemps is failing.
This would be my best attempt, which shows the error message
for i in inner1test:
for number in inner1train:
dist = np.linalg.norm(inner1test[i][0:4]-inner1train[number][0:4])
print(dist)
(IndexError: arrays used as indices must be of integer (or boolean)
type)
What would be the best solution to iterate through this array?
ps: I will also provide a screenshot for better vizualisation.
From what I see, inner1test is a tuple of lists, so the i value will not be an index but the actual list.
You should use enumerate, which returns two variables, the index and the actual data.
for i, value in enumerate(inner1test):
for j, number in enumerate(inner1train):
dist = np.linalg.norm(inner1test[i][0:4]-inner1train[number][0:4])
print(dist)
Also, if your lists begin the be bigger, consider using a generator which will execute your calculcations iteration per iteration and return only one value at a time, avoiding to return a big chunk of results which would occupy a lot of memory.
eg:
def my_calculatiuon(inner1test, inner1train):
for i, value in enumerate(inner1test):
for j, number in enumerate(inner1train):
dist = np.linalg.norm(inner1test[i][0:4]-inner1train[number][0:4])
yield dist
for i in my_calculatiuon(inner1test, inner1train):
print(i)
You might also want to investigate python list comprehension which is sometimes more elegant way to handle for loops with lists.
[EDIT]
Here's a probably easier solution anyway, without the need of indexes, which won't fail to enumerate a numpy object:
for testvalue in inner1test:
for testtrain in inner1train:
dist = np.linalg.norm(testvalue[0:4]-testtrain[0:4])
[/EDIT]
This was the final solution with the correct output for me:
distanceslist = list()
for testvalue in inner1test:
for testtrain in inner1train:
dist = np.linalg.norm(testvalue[0:4]-testtrain[0:4])
distances = (dist, testtrain[0:4])
distanceslist.append(distances)
distanceslist

Search for the nearest array in a huge array of arrays

I need to find the closest possible sentence.
I have an array of sentences and a user sentence, and I need to find the closest to the user's sentence element of the array.
I presented each sentence in the form of a vector using word2vec:
def get_avg_vector(word_list, model_w2v, size=500):
sum_vec = np.zeros(shape = (1, size))
count = 0
for w in word_list:
if w in model_w2v and w != '':
sum_vec += model_w2v[w]
count +=1
if count == 0:
return sum_vec
else:
return sum_vec / count + 1
As a result, the array element looks like this:
array([[ 0.93162371, 0.95618944, 0.98519795, 0.98580566, 0.96563747,
0.97070891, 0.99079191, 1.01572807, 1.00631016, 1.07349398,
1.02079309, 1.0064849 , 0.99179418, 1.02865136, 1.02610303,
1.02909719, 0.99350413, 0.97481178, 0.97980362, 0.98068508,
1.05657591, 0.97224562, 0.99778703, 0.97888296, 1.01650529,
1.0421448 , 0.98731804, 0.98349052, 0.93752996, 0.98205837,
1.05691232, 0.99914532, 1.02040555, 0.99427229, 1.01193818,
0.94922226, 0.9818139 , 1.03955 , 1.01252615, 1.01402485,
...
0.98990598, 0.99576604, 1.0903802 , 1.02493086, 0.97395976,
0.95563786, 1.00538653, 1.0036294 , 0.97220088, 1.04822631,
1.02806122, 0.95402776, 1.0048053 , 0.97677222, 0.97830801]])
I represent the sentence of the user also as a vector, and I compute the closest element to it is like this:
%%cython
from scipy.spatial.distance import euclidean
def compute_dist(v, list_sentences):
dist_dict = {}
for key, val in list_sentences.items():
dist_dict[key] = euclidean(v, val)
return sorted(dist_dict.items(), key=lambda x: x[1])[0][0]
list_sentences in the method above is a dictionary in which keys are a text representation of sentences, and values are vector.
It takes a very long time, because I have more than 60 million sentences.
How can I speed up, optimize this process?
I'll be grateful for any advice.
The initial calculation of the 60 million sentences' vectors is essentially a fixed cost you'll pay once. I'm assuming you mainly care about the time for each subsequent lookup, for a single user-supplied query sentence.
Using numpy native array operations can speed up the distance calculations over doing your own individual calculations in a Python loop. (It's able to do things in bulk using its optimized code.)
But first you'd want to replace list_sentences with a true numpy array, accessed only by array-index. (If you have other keys/texts you need to associate with each slot, you'd do that elsewhere, with some dict or list.)
Let's assume you've done that, in whatever way is natural for your data, and now have array_sentences, a 60-million by 500-dimension numpy array, with one sentence average vector per row.
Then a 1-liner way to get an array full of the distances is as the vector-length ("norm") of the difference between each of the 60 million candidates and the 1 query (which gives a 60-million entry answer with each of the differences):
dists = np.linalg.norm(array_sentences - v)
Another 1-liner way is to use the numpy utility function cdist() for comuting distance between each pair of two collections of inputs. Here, your first collection is just the one query vector v (but if you had batches to do at once, supplying more than one query at a time could offer an additional slight speedup):
dists = np.linalg.cdists(array[v], array_sentences)
(Note that such vector comparisons often use cosine-distance/cosine-similarity rather than euclidean-distance. If you switch to that, you might be doing other norming/dot-products instead of the first option above, or use the metric='cosine' option to cdist().)
Once you have all the distances in a numpy array, using a numpy-native sort option is likely to be faster than using Python sorted(). For example, numpy's indirect sort argsort(), which just returns the sorted indexes (and thus avoids moving all the vector coordinates-around), since you just want to know which items are the best match(es). For example:
sorted_indexes = argsort(dists)
best_index = sorted_indexes[0]
If you need to turn that int index back into your other key/text, you'd use your own dict/list that remembered the slot-to-key relationships.
All these still give an exactly right result, by comparing against all candidates, which (even when done optimally well) is still time-consuming.
There are ways to get faster results, based on pre-building indexes to the full set of candidates – but such indexes become very tricky in high-dimensional spaces (like your 500-dimensional space). They often trade off perfectly accurate results for faster results. (That is, what they return for 'closest 1' or 'closest N' will have some errors, but usually not be off by much.) For examples of such libraries, see Spotify's ANNOY or Facebook's FAISS.
At least if you are doing this procedure for multiple sentences, you could try using scipy.spatial.cKDTree (I don't know whether it pays for itself on a single query. Also 500 is quite high, I seem to remember KDTrees work better for not quite as many dimensions. You'll have to experiment).
Assuming you've put all your vectors (dict values) into one large numpy array:
>>> import numpy as np
>>> from scipy.spatial import cKDTree as KDTree
>>>
# 100,000 vectors (that's all my RAM can take)
>>> a = np.random.random((100000, 500))
>>>
>>> t = KDTree(a)
# create one new vector and find distance and index of closest
>>> t.query(np.random.random(500))
(8.20910072933986, 83407)
I can think about 2 possible ways of optimizing this process.
First, if your goal is only to get the closest vector (or sentence), you could get rid of the list_sentences variable and only keep in memory the closest sentence you have found yet. This way, you won't need to sort the complete (and presumably very large) list at the end, and only return the closest one.
def compute_dist(v, list_sentences):
min_dist = 0
for key, val in list_sentences.items():
dist = euclidean(v, val)
if dist < min_dist:
closest_sentence = key
min_dist = dist
return closest_sentence
The second one is maybe a little more unsound. You can try to re implement the euclidean method by giving it a third argument which would be the current minimum distance min_dist between the closest vector you have found so far and the user vector. I don't know how the scipy euclidean method is implemented but I guess it is close to summing squared differences along all the vectors dimensions. What you want is the method to stop if the sum is higher than min_dist (the distance will be higher than min_dist anyway and you won't keep it).

Efficient Array replacement in Python

I'm wondering what is the most efficient way to replace elements in an array with other random elements in the array given some criteria. More specifically, I need to replace each element which doesn't meet a given criteria with another random value from that row. For example, I want to replace each row of data as a random cell in data(row) which is between -.8 and .8. My inefficinet solution looks something like this:
import numpy as np
data = np.random.normal(0, 1, (10, 100))
for index, row in enumerate(data):
row_copy = np.copy(row)
outliers = np.logical_or(row>.8, row<-.8)
for prob in np.where(outliers==1)[0]:
fixed = 0
while fixed == 0:
random_other_value = r.randint(0,99)
if random_other_value in np.where(outliers==1)[0]:
fixed = 0
else:
row_copy[prob] = row[random_other_value]
fixed = 1
Obviously, this is not efficient.
I think it would be faster to pull out all the good values, then use random.choice() to pick one whenever you need it. Something like this:
import numpy as np
import random
from itertools import izip
data = np.random.normal(0, 1, (10, 100))
for row in data:
good_ones = np.logical_and(row >= -0.8, row <= 0.8)
good = row[good_ones]
row_copy = np.array([x if f else random.choice(good) for f, x in izip(good_ones, row)])
High-level Python code that you write is slower than the C internals of Python. If you can push work down into the C internals it is usually faster. In other words, try to let Python do the heavy lifting for you rather than writing a lot of code. It's zen... write less code to get faster code.
I added a loop to run your code 1000 times, and to run my code 1000 times, and measured how long they took to execute. According to my test, my code is ten times faster.
Additional explanation of what this code is doing:
row_copy is being set by building a new list, and then calling np.array() on the new list to convert it to a NumPy array object. The new list is being built by a list comprehension.
The new list is made according to the rule: if the number is good, keep it; else, take a random choice from among the good values.
A list comprehension walks over a sequence of values, but to apply this rule we need two values: the number, and the flag saying whether that number is good or not. The easiest and fastest way to make a list comprehension walk along two sequences at once is to use izip() to "zip" the two sequences together. izip() will yield up tuples, one at a time, where the tuple is (f, x); f in this case is the flag saying good or not, and x is the number. (Python has a built-in feature called zip() which does pretty much the same thing, but actually builds a list of tuples; izip() just makes an iterator that yields up tuple values. But you can play with zip() at a Python prompt to learn more about how it works.)
In Python we can unpack a tuple into variable names like so:
a, b = (2, 3)
In this example, we set a to 2 and b to 3. In the list comprehension we unpack the tuples from izip() into variables f and x.
Then the heart of the list comprehension is a "ternary if" statement like so:
a if flag else b
The above will return the value a if the flag value is true, and otherwise return b. The one in this list comprehension is:
x if f else random.choice(good)
This implements our rule.

Python: take max N elements from some list

Is there some function which would return me the N highest elements from some list?
I.e. if max(l) returns the single highest element, sth. like max(l, count=10) would return me a list of the 10 highest numbers (or less if l is smaller).
Or what would be an efficient easy way to get these? (Except the obvious canonical implementation; also, no such things which involve sorting the whole list first because that would be inefficient compared to the canonical solution.)
heapq.nlargest:
>>> import heapq, random
>>> heapq.nlargest(3, (random.gauss(0, 1) for _ in xrange(100)))
[1.9730767232998481, 1.9326532289091407, 1.7762926716966254]
The function in the standard library that does this is heapq.nlargest
Start with the first 10 from L, call that X. Note the minimum value of X.
Loop over L[i] for i over the rest of L.
If L[i] is greater than min(X), drop min(X) from X and insert L[i]. You may need to keep X as a sorted linked list and do an insertion. Update min(X).
At the end, you have the 10 largest values in X.
I suspect that will be O(kN) (where k is 10 here) since insertion sort is linear. Might be what gsl uses, so if you can read some C code:
http://www.gnu.org/software/gsl/manual/html_node/Selecting-the-k-smallest-or-largest-elements.html
Probably something in numpy that does this.
A fairly efficient solution is a variation of quicksort where recursion is limited to the right part of the pivot until the pivot point position is higher than the number of elements required (with a few extra conditions to deal with border cases of course).
The standard library has heapq.nlargest, as pointed out by others here.
If you do not mind using pandas then:
import pandas as pd
N = 10
column_name = 0
pd.DataFrame(your_array).nlargest(N, column_name)
The above code will show you the N largest values along with the index position of each value.
Pandas nlargest documentation

Categories