I'm trying to find all possible sub-intervals between np.linspace(0,n,n*10+1)
where the sub-intervals are greater than width (say width=0.5)
so I tried this with using itertools by
import itertools
ranges=np.linspace(0,n,n*10+1)
#find all combinations
combinations=list(itertools.combinations(ranges,2))
#using for-loops to calculate width of each intervals
#and append to new list if the width is greater than 0.5
save=[]
for i in range(len(combinations)):
if combinations[i][1]-combinations[i][0]>0.5:
save.append(combinations[i])
but this takes too many times especially when n gets bigger especially it costs huge ram usage
So I'm wondering whether if I can modify the function faster or set constraint when I collect combinations
itertools.combinations(...) returns a generator, that means the returned object produces its values when needed instead of calculating everything at once and storing the result in memory. You force immediate calculation and storing by converting it to a list, but this is unnecessary. Simply iterate over the combinations object instead of making a list of it and iterating over the indexes (which should not be done anyway):
import itertools
ranges=np.linspace(0,n,n*10+1) # alternatively 'range(100)' or so to test
combinations=itertools.combinations(ranges,2)
save=[]
for c in combinations:
if c[1] - c[0] > 0.5:
save.append(c)
Related
I created a function max_points that compares two argument strings and returns a certain value in relation to a separately given criterion that involves summing up the values ga, la, ldif, and lgap. It also returns the list of combinations of the strings that reach this certain value. The strings s and t go through a process of running through their respective anagrams with up to n gaps (in this case, the gap is '_'). Here in an example of what the function should return:
In [3]: max_points('AT_', 'A_T', 2, 5, 1, 0, 2)
Out[3]: (16, [['_A_T_', '_A_T_'],
['A__T_', 'A__T_'],
['A_T__', 'A_T__']])
The code I have right now is this:
def max_points(s, t, ga, la, ldif, lgap, n = 1):
lst_s=generate_n_gaps(s, n)
lst_t=generate_n_gaps(t, n)
point_max=-9999
for i in lst_s:
for j in lst_t:
if len(i)==len(j):
point=pointage(i, j, ga, la, ldif, lgap)
if point>=point_max:
point_max=point
ultimate=[]
for i in lst_s:
for j in lst_t:
if len(i)==len(j) and pointage(i, j, ga, la, ldif, lgap)==point_max:
specific=[]
specific.append(i)
specific.append(j)
ultimate.append(specific)
return point_max, ultimate
The other functions, generate_n_gaps and pointage (not shown) work as follows:
generate_n_gaps: Returns a list of all the combinations of the argument strings with up to n gaps.
pointage: Compares only the two argument strings s and t (not all their combinations) and returns an integer value that goes through the same criterion as the max_points function.
You can see that, if the length of the argument strings s and t are larger than 4 or 5 and if n is larger than 2, then the function ends up outputting quite a large amount of lists. I suspect that is why it takes longer than 2 or 3 seconds for some inputs. Is there any way I can make my code for this specific function faster (<1 sec of runtime)? Or might the problem lie on the other non-specified functions used?
One obvious issue here is that you're looping through all i,j combinations twice: once to calculate the maximum value, and then a second time to return all (i,j) combinations that achieve this maximum.
It would probably be more efficient to do this in a single pass. Something like:
point_max=-9999
# or better yet, -math.inf
ultimate=[]
for i in lst_s:
for j in lst_t:
if len(i)==len(j):
point=pointage(i, j, ga, la, ldif, lgap)
if point>point_max:
point_max=point
ultimate=[]
if point==point_max:
specific=[]
specific.append(i)
specific.append(j)
ultimate.append(specific)
This should approximately halve your run-time.
If i and j have many different possible lengths, you might also be able to achieve savings by blocking up the comparisons. Instead of simply looping through lst_s and lst_t, split these lists up by length (use a dict structure keyed by length, with each value being the subset of lst_s or lst_t having that length). Then iterate through all possible lengths, checking only the s- and t-values of that length against one another. This is a bit more work to set up, but may be useful depending on how many comparisons it saves you.
You haven't included the code for max_points but I would be looking hard at that to see if there are any possible savings there; you're going to be calling it a lot, so you want to make it as efficient as possible.
More advanced options include parallelisation, and making use of specific information about the "score" function to do more precise blocking of your score calls. But try the simple stuff first and see if that does the job.
For example, suppose I had an (n,2) dimensional tensor t whose elements are all from the set S containing random integers. I want to build another tensor d with size (m,2) where individual elements in each tuple are from S, but the whole tuples do not occur in t.
E.g.
S = [0,1,2,3,7]
t = [[0,1],
[7,3],
[3,1]]
d = some_algorithm(S,t)
/*
d =[[2,1],
[3,2],
[7,4]]
*/
What is the most efficient way to do this in python? Preferably with pytorch or numpy, but I can work around general solutions.
In my naive attempt, I just use
d = np.random.choice(S,(m,2))
non_dupes = [i not in t for i in d]
d = d[non_dupes]
But both t and S are incredibly large, and this takes an enormous amount of time (not to mention, rarely results in a (m,2) array). I feel like there has to be some fancy tensor thing I can do to achieve this, or maybe making a large hash map of the values in t so checking for membership in t is O(1), but this produces the same issue just with memory. Is there a more efficient way?
An approximate solution is also okay.
my naive attempt would be a base-transformation function to reduce the problem to an integer set problem:
definitions and assumptions:
let S be a set (unique elements)
let L be the number of elements in S
let t be a set of M-tuples with elements from S
the original order of the elements in t is irrelevant
let I(x) be the index function of the element x in S
let x[n] be the n-th tuple-member of an element of t
let f(x) be our base-transform function (and f^-1 its inverse)
since S is a set we can write each element in t as a M digit number to the base L using elements from S as digits.
for M=2 the transformation looks like
f(x) = I(x[1])*L^1 + I(x[0])*L^0
f^-1(x) is also rather trivial ... x mod L to get back the index of the least significant digit. floor(x/L) and repeat until all indices are extracted. lookup the values in S and construct the tuple.
since now you can represet t as an integer set (read hastable) calculating the inverse set d becomes rather trivial
loop from L^(M-1) to (L^(M+1)-1) and ask your hashtable if the element is in t or d
if the size of S is too big you can also just draw random numbers against the hashtable for a subset of the inverse of t
does this help you?
If |t| + |d| << |S|^2 then the probability of some random tuple to be chosen again (in a single iteration) is relatively small.
To be more exact, if (|t|+|d|) / |S|^2 = C for some constant C<1, then if you redraw an element until it is a "new" one, the expected number of redraws needed is 1/(1-C).
This means, that by doing this, and redrawing elements until this is a new element, you get O((1/(1-C)) * |d|) times to process a new element (on average), which is O(|d|) if C is indeed constant.
Checking is an element is already "seen" can be done in several ways:
Keeping hash sets of t and d. This requires extra space, but each lookup is constant O(1) time. You could also use a bloom filter instead of storing the actual elements you already seen, this will make some errors, saying an element is already "seen" though it was not, but never the other way around - so you will still get all elements in d as unique.
Inplace sorting t, and using binary search. This adds O(|t|log|t|) pre-processing, and O(log|t|) for each lookup, but requires no additional space (other then where you store d).
If in fact, |d| + |t| is very close to |S|^2, then an O(|S|^2) time solution could be to use Fisher Yates shuffle on the available choices, and choosing the first |d| elements that do not appear in t.
I'm looking for a better, faster way to center a couple of lists. Right now I have the following:
import random
m = range(2000)
sm = sorted(random.sample(range(100000), 16000))
si = random.sample(range(16005), 16000)
# Centered array.
smm = []
print sm
print si
for i in m:
if i in sm:
smm.append(si[sm.index(i)])
else:
smm.append(None)
print m
print smm
Which in effect creates a list (m) containing a range of random numbers to center against, another list (sm) from which m is centered against and a list of values (si) to append.
This sample runs fairly quickly, but when I run a larger task with much more variables performance slows to a standstill.
your mainloop contains this infamous line:
if i in sm:
it seems to be nothing but since sm is a result of sorted it is a list, hence O(n) lookup, which explains why it's slow with a big dataset.
Moreover you're using the even more infamous si[sm.index(i)], which makes your algorithm O(n**2).
Since you need the indexes, using a set is not so easy, and there's better to do:
Since sm is sorted, you could use bisect to find the index in O(log(n)), like this:
for i in m:
j = bisect.bisect_left(sm,i)
smm.append(si[j] if (j < len(sm) and sm[j]==i) else None)
small explanation: bisect gives you the insertion point of i in sm. It doesn't mean that the value is actually in the list so we have to check that (by checking if the returned value is within existing list range, and checking if the value at the returned index is the searched value), if so, append, else append None.
I have a 2-D list of shape (300,000, X), where each of the sublists has a different size. In order to convert the data to a Tensor, all of the sublists need to have equal length, but I don't want to lose any data from my sublists in the conversion.
That means that I need to fill all sublists smaller than the longest sublist with filler (-1) in order to create a rectangular array. For my current dataset, the longest sublist is of length 5037.
My conversion code is below:
for seq in new_format:
for i in range(0, length-len(seq)):
seq.append(-1)
However, when there are 300,000 sequences in new_format, and length-len(seq) is generally >4000, the process is extraordinarily slow. How can I speed this process up or get around the issue efficiently?
Individual append calls can be rather slow, so use list multiplication to create the whole filler value at once, then concatenate it all at once, e.g.:
for seq in new_format:
seq += [-1] * (length-len(seq))
seq.extend([-1] * (length-len(seq))) would be equivalent (trivially slower due to generalized method call approach, but likely unnoticeable given size of real work).
In theory, seq.extend(itertools.repeat(-1, length-len(seq))) would avoid the potentially large temporaries, but IIRC, the actual CPython implementation of list.__iadd__/list.extend forces the creation of a temporary list anyway (to handle the case where the generator is defined in terms of the list being extended), so it wouldn't actually avoid the temporary.
If I have a variable number of sets (let's call the number n), which have at most m elements each, what's the most efficient way to calculate the pairwise intersections for all pairs of sets? Note that this is different from the intersection of all n sets.
For example, if I have the following sets:
A={"a","b","c"}
B={"c","d","e"}
C={"a","c","e"}
I want to be able to find:
intersect_AB={"c"}
intersect_BC={"c", "e"}
intersect_AC={"a", "c"}
Another acceptable format (if it makes things easier) would be a map of items in a given set to the sets that contain that same item. For example:
intersections_C={"a": {"A", "C"},
"c": {"A", "B", "C"}
"e": {"B", "C"}}
I know that one way to do so would be to create a dictionary mapping each value in the union of all n sets to a list of the sets in which it occurs and then iterate through all of those values to create lists such as intersections_C above, but I'm not sure how that scales as n increases and the sizes of the set become too large.
Some additional background information:
Each of the sets are of roughly the same length, but are also very large (large enough that storing them all in memory is a realistic concern, and an algorithm which avoids that would be preferred though is not necessary)
The size of the intersections between any two sets is very small compared to the size of the sets themselves
If it helps, we can assume anything we need to about the ordering of the input sets.
this ought to do what you want
import random as RND
import string
import itertools as IT
mock some data
fnx = lambda: set(RND.sample(string.ascii_uppercase, 7))
S = [fnx() for c in range(5)]
generate an index list of the sets in S so the sets can be referenced more concisely below
idx = range(len(S))
get all possible unique pairs of the items in S; however, since set intersection is commutative, we want the combinations rather than permutations
pairs = IT.combinations(idx, 2)
write a function perform the set intersection
nt = lambda a, b: S[a].intersection(S[b])
fold this function over the pairs & key the result from each function call to its arguments
res = dict([ (t, nt(*t)) for t in pairs ])
the result below, formatted per the first option recited in the OP, is a dictionary in which the values are the set intersections of two sequences; each values keyed to a tuple comprised of the two indices of those sequences
this solution, is really just two lines of code: (i) calculate the permutations; (ii) then apply some function over each permutation, storing the returned value in a structured container (key-value) container
the memory footprint of this solution is minimal, but you can do even better by returning a generator expression in the last step, ie
res = ( (t, nt(*t)) for t in pairs )
notice that with this approach, neither the sequence of pairs nor the corresponding intersections have been written out in memory--ie, both pairs and res are iterators.
If we can assume that the input sets are ordered, a pseudo-mergesort approach seems promising. Treating each set as a sorted stream, advance the streams in parallel, always only advancing those where the value is the lowest among all current iterators. Compare each current value with the new minimum every time an iterator is advanced, and dump the matches into your same-item collections.
How about using intersection method of set. See below:
A={"a","b","c"}
B={"c","d","e"}
C={"a","c","e"}
intersect_AB = A.intersection(B)
intersect_BC = B.intersection(C)
intersect_AC = A.intersection(C)
print intersect_AB, intersect_BC, intersect_AC