Storing the result of Minhash

Storing the result of Minhash - python

The result is a fixed number of arrays, let's say lists (all of the same length) in python.
One could see it as a matrix too, so in c I would use an array, where every cell would point to another array. How to do it in Python?
A list where every item is a list or something else?
I thought of a dictionary, but the keys are trivial, 1, 2, ..., M, so I am not sure if that is the pythonic way to go here.
I am not interested in the implementation, I am interested in which approach I should follow, in which choice I should made!

Whatever container you choose, it should contain hash-itemID pairs, and should be indexed or sorted by the hash. Unsorted arrays will not be remotely efficient.
Assuming you're using a decent sized hash and your various hash algorithms are well-implemented, you should be able to just as effectively store all minhashes in a single container, since the chance of collision between a minhash from one algorithm and a minhash from another is negligible, and if any such collision occurs it won't substantially alter the similarity measure.
Using a single container as opposed to multiple reduces the memory overhead for indexing, though it also slightly increases the amount of processing required. As memory is usually the limiting factor for minhash, a single container may be preferable.

You can store anything you want in a python list: ints, strings, more lists lists, dicts, objects, functions - you name it.
anything_goes_in_here = [1, 'one', lambda one: one / 1, {1: 'one'}, [1, 1]]
So storing a list of lists is pretty straight forward:
>>> list_1 = [1, 2, 3, 4]
>>> list_2 = [5, 6, 7, 8]
>>> list_3 = [9, 10, 11, 12]
>>> list_4 = [13, 14, 15, 16]
>>> main_list = [list_1, list_2, list_3, list_4]
>>> for list in main_list:
... for num in list:
... print num
...
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
If you are looking to store a list of lists where the index is meaningful (meaning the index gives you some information about the data stored there), then this is basically reimplementing a hashmap (dictionary), and while you say it's trivial - using a dictionary sounds like it fits the problem well here.

Related

If values exist in list, put them at the back of the list in Python

I have a list with values and a list with some given numbers:
my_list = [1, 3, 5, 6, 8, 10]
my_numbers = [2, 3, 4]
Now I want to know if the values in my_numbers exist in my_list and, if so, put the matching value(s) to the back of my_list. I can do this for example like this:
for number in my_numbers:
if number in my_list:
my_list.remove(number)
my_list.append(number)
Specifics:
I can be certain that there are no duplicates within neither of the lists due to my program setup.
The order of which the matching numbers in my_numbers are put in the back of my_list does not matter.
Question: Can I do this more efficiently performance wise?

One possible solution is this: rebuild the list my_list from two parts:
the items that are not in my_numbers
the items that are in my_numbers
Note that I would suggest to use a set for the lookup. A membership test for a set is O(1) (constant time), whereas such a test for a list is O(n), where n is the length of the list.
This means that the total runtime of the code below is O(max(m,n)), where m, n are the lengths of the lists. Your original solution was more like O(m*n), which is much slower if either of the lists is large.
my_numbers_set = set(my_numbers)
my_list = [x for x in my_list if x not in my_numbers_set] + \
[x for x in my_list if x in my_numbers_set]

multiple list generation from a list based on condition in python

I have a Pandas series (which could be a list, this is not very important) of lists which contains (to simplify, but that could also be letters of words) positive and negative number,
such as
0 [12,-13,0,6]
1 [2,-3,8,233]
2 [0,6,8,3]
for each of these, i want to fill a row in a three columns data frame, with a list of all positive values, a list of all negative values, and a list of all values comprised in some interval. Such as:
[[12,6],[-13],[0,6]]
[[2,8,233],[-3],[2,8]]
[[6,8,3],[],[6,8,3]]
What I first thought was using a list comprehension to generate a list of triadic lists of lists, which would be converted using pd.DataFrame to the right form.
This was because i don't want to loop over the list of lists 3 times to apply each time a new choice heuristics, feels slow and dull.
But the problem is that I can't actually generate well the lists of the triad [[positive],[negative], [interval]].
I was using a syntax like
[[[positivelist.extend(number)],[negativelist], [intervalist.extend(number)]]\
for listofnumbers in listoflists for number in listofnumbers\
if number>0 else [positivelist],[negativelist.extend(number)], [intervalist.extend(number)]]
but let be honest, this is unreadable, and anyway it doesn't do what I want since extend yields none.
So how could I go about that without looping three times (I could have many millions elements in the list of lists, and in the sublists, and I might want to apply more complexe formulae to these numbers, too, it is a first approach)?
I thought about using functional programming, map/lambda; but it is unpythonic. The catch is: what in python may help to do it right?
My guess would be something as:
newlistoflist=[]
for list in lists:
positive=[]
negative=[]
interval=[]
for element in list:
positive.extend(element) if element>0
negative.extend(element) if element<0
interval.extend(element) if n<element<m
triad=[positive, negative,interval]
newlistoflist.append(triad)
what do you think?

You can do:
import numpy
l = [[12,-13,0,6], [2,-3,8,233], [0,6,8,3]]
l = numpy.array([x for e in l for x in e])
positive = l[l>0]
negative = l[l<0]
n,m = 1,5
interval = l[((l>n) & (l<m))]
print positive, negative, interval
Output: [ 12 6 2 8 233 6 8 3] [-13 -3] [2 3]
Edit: Triad version:
import numpy
l = numpy.array([[12,-13,0,6], [2,-3,8,233], [0,6,8,3]])
n,m = 1,5
triad = numpy.array([[e[e>0], e[e<0], e[((e>n) & (e<m))]] for e in l])
print triad
Output:
[[array([12, 6]) array([-13]) array([], dtype=int64)]
[array([ 2, 8, 233]) array([-3]) array([2])]
[array([6, 8, 3]) array([], dtype=int64) array([3])]]

Prune a list of combinations based on sets

While this question is formulated using the Python programming language, I believe it is more of a programming logic problem.
I have a list of all possible combinations, i.e.: n choose k
I can prepare such a list using
import itertools
bits_list = list(itertools.combinations(range(n), k))
If 'n' is 100, and `k' is 5, then the length of 'bits_list' will be 75287520.
Now, I want to prune this list, such that numbers appear in groups, or they don't. Let's use the following sets as an example:
Set 1: [0, 1, 2]
Set 2: [57, 58]
Set 3: [10, 15, 20, 25]
Set 4: [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
Here each set needs to appear in any member of the bits_list together, or not at all.
So far, I only have been able to think of a brute-force if-else method of solving this problem, but the number of if-else conditions will be very large this way.
Here's what I have:
bits_list = [x for x in list(itertools.combinations(range(n), k))
if all(y in x for y in [0, 1, 2]) or
all(y not in x for y in [0, 1, 2])]
Now, this only covered Set 1. I would like to do this for many sets. If the length of the set is longer than the value of k, we can ignore the set (for example, k = 5 and Set 4).
Note, that the ultimate aim is to have 'k' iterate over a range, say [5:25] and work on the appended list. The size of the list grows exponentially here and computationally speaking, very expensive!
With 'k' as 10, the python interpreter interrupts the process before completion on any average laptop with 16 GB RAM. I need to find a solution that fits in the memory of a relatively modern server (not a cluster or a server farm).
Any help is greatly appreciated!
P.S.: Intuitively, think of this problem as generating all the possible cases for people boarding a public bus or train system. Usually, you board an entire group or you don't board anyone.
UPDATE:
For the given sets above, if k = 5, then a valid member of bits_list would be [0, 1, 2, 57, 58], i.e.: a combination of Set1 and Set2. If k = 10, then we could have built Set1 + Set2 + Set3 + NoSetElement as a possible member. #DonkeyKong's solution made me realize I haven't mentioned this explicitly in my question.
I have a lot of sets; I intend to use enough sets to prune the full list of combinations such that the bits_list eventually fits into memory.
#9000's suggestion is perfectly valid here, that during each iteration, I can save the combinations as actual bits.

This still gets crushed by a memory error (which I don't see how you're getting away from if you insist on a list) at a certain point (around n=90, k=5), but it is much faster than your current implementation. For n=80 and k=5, my rudimentary benchmarking had my solution at 2.6 seconds and yours around 52 seconds.
The idea is to construct the disjoint and subset parts of your filter separately. The disjoint part is trivial, and the subset part is calculated by taking the itertools.product of all disjoint combinations of length k - set_len and the individual elements of your set.
from itertools import combinations, product, chain
n = 80
k = 5
set1 = {0,1,2}
nots = set(range(n)) - set1
disj_part = list(combinations(nots, k))
subs_part = [tuple(chain(x, els)) for x, *els in
product(combinations(nots, k - len(set1)), *([e] for e in set1))]
full_l = disj_part + subs_part

If you actually represented your bits as bits, that is, 0/1 values in a binary representation of an integer n bits long with exactly k bits set, the amount of RAM you'd need to store the data would be drastically smaller.
Also, you'd be able to use bit operations to look check if all bits in a mask are actually set (value & mask == mask), or all unset (value | ~mask == value).
The brute-force will probably take shorter that the time you'd spend thinking about a more clever algorithm, so it's totally OK for a one-off filtering.
If you must execute this often and quickly, and your n is in small hundreds or less, I'd rather use cython to describe the brute-force algorithm efficiently than look at algorithmic improvements. Modern CPUs can efficiently operate on 64-bit numbers; you won't benefit much from not comparing a part of the number.
OTOH if your n is really large, and the number of sets to compare to is also large, you could partition your bits for efficient comparison.
Let's suppose you can efficiently compare a chunk of 64 bits, and your bit lists contain e.g. 100 chunks each. Then you can do the same thing you'd do with strings: compare chunk by chunk, and if one of the chunks fails to match, do not compare the rest.

A faster implementation would be to replace the if and all() statements in:
bits_list = [x for x in list(itertools.combinations(range(n), k))
if all(y in x for y in [0, 1, 2]) or
all(y not in x for y in [0, 1, 2])]
with python's set operations isdisjoint() and issubset() operations.
bits_generator = (set(x) for x in itertools.combinations(range(n), k))
first_set = set([0,1,2])
filter_bits = (x for x in bits_generator
if x.issubset(first_set) or
x.isdisjoint(first_set))
answer_for_first_set = list(filter_bits)
I can keep going using generators and with generators you won't run out of memory but you will be waiting and hastening the heat death of the universe. Not because of python's runtime or other implementation details but because there are some problems that are just not feasible even in computer time if you pick a large N and K values.

Based on the ideas from #Mitch's answer, I created a solution with a slightly different thinking than originally presented in the question. Instead of creating the list (bits_list) of all combinations and then pruning those combinations that do not match the sets listed, I built bits_list from the sets.
import itertools
all_sets = [[0, 1, 2], [3, 4, 5], [6, 7], [8], [9, 19, 29], [10, 20, 30],
[11, 21, 31], [12, 22, 32], ...[57, 58], ... [95], [96], [97]]
bits_list = [list(itertools.chain.from_iterable(x)) for y in [1, 2, 3, 4, 5]
for x in itertools.combinations(all_sets, y)]
Here, instead of finding n choose k, and then looping for all k, and finding combinations which match the sets, I started from the sets, and even included the individual members as sets by themselves and therefore removing the need for the 2 components - the disjoint and the subset parts - discussed in #Mitch's answer.

Remove duplicates from one Python list, prune other lists based on it

I have a problem that's easy enough to do in an ugly way, but I'm wondering if there's a more Pythonic way of doing it.
Say I have three lists, A, B and C.
A = [1, 1, 2, 3, 4, 4, 5, 5, 3]
B = [1, 2, 3, 4, 5, 6, 7, 8, 9]
C = [1, 2, 3, 4, 5, 6, 7, 8, 9]
# The actual data isn't important.
I need to remove all duplicates from list A, but when a duplicate entry is deleted, I would like the corresponding indexes removed from B and C:
A = [1, 2, 3, 4, 5]
B = [1, 3, 4, 5, 7]
C = [1, 3, 4, 5, 7]
This is easy enough to do with longer code by moving everything to new lists:
new_A = []
new_B = []
new_C = []
for i in range(len(A)):
if A[i] not in new_A:
new_A.append(A[i])
new_B.append(B[i])
new_C.append(C[i])
But is there a more elegant and efficient (and less repetitive) way of doing this? This could get cumbersome if the number of lists grows, which it might.

Zip the three lists together, uniquify based on the first element, then unzip:
from operator import itemgetter
from more_itertools import unique_everseen
abc = zip(a, b, c)
abc_unique = unique_everseen(abc, key=itemgetter(0))
a, b, c = zip(*abc_unique)
This is a very common pattern. Whenever you want to do anything in lock step over a bunch of lists (or other iterables), you zip them together and loop over the result.
Also, if you go from 3 lists to 42 of them ("This could get cumbersome if the number of lists grows, which it might."), this is trivial to extend:
abc = zip(*list_of_lists)
abc_unique = unique_everseen(abc, key=itemgetter(0))
list_of_lists = zip(*abc_unique)
Once you get the hang of zip, the "uniquify" is the only hard part, so let me explain it.
Your existing code checks whether each element has been seen by searching for each one in new_A. Since new_A is a list, this means that if you have N elements, M of them unique, on average you're going to be doing M/2 comparisons for each of those N elements. Plug in some big numbers, and NM/2 gets pretty big—e.g., 1 million values, a half of them unique, and you're doing 250 billion comparisons.
To avoid that quadratic time, you use a set. A set can test an element for membership in constant, rather than linear, time. So, instead of 250 billion comparisons, that's 1 million hash lookups.
If you don't need to maintain order or decorate-process-undecorate the values, just copy the list to a set and you're done. If you need to decorate, you can use a dict instead of a set (with the key as the dict keys, and everything else hidden in the values). To preserve order, you could use an OrderedDict, but at that point it's easier to just use a list and a set side by side. For example, the smallest change to your code that works is:
new_A_set = set()
new_A = []
new_B = []
new_C = []
for i in range(len(A)):
if A[i] not in new_A_set:
new_A_set.add(A[i])
new_A.append(A[i])
new_B.append(B[i])
new_C.append(C[i])
But this can be generalized—and should be, especially if you're planning to expand from 3 lists to a whole lot of them.
The recipes in the itertools documentation include a function called unique_everseen that generalizes exactly what we want. You can copy and paste it into your code, write a simplified version yourself, or pip install more-itertools and use someone else's implementation (as I did above).
PadraicCunningham asks:
how efficient is zip(*unique_everseen(zip(a, b, c), key=itemgetter(0)))?
If there are N elements, M unique, it's O(N) time and O(M) space.
In fact, it's effectively doing the same work as the 10-line version above. In both cases, the only work that's not obviously trivial inside the loop is key in seen and seen.add(key), and since both operations are amortized constant time for set, that means the whole thing is O(N) time. In practice, for N=1000000, M=100000 the two versions are about 278ms and 297ms (I forget which is which) compared to minutes for the quadratic version. You could probably micro-optimize that down to 250ms or so—but it's hard to imagine a case where you'd need that, but wouldn't benefit from running it in PyPy instead of CPython, or writing it in Cython or C, or numpy-izing it, or getting a faster computer, or parallelizing it.
As for space, the explicit version makes it pretty obvious. Like any conceivable non-mutating algorithm, we've got the three new_Foo lists around at the same time as the original lists, and we've also added new_A_set of the same size. Since all of those are length M, that's 4M space. We could cut that in half by doing one pass to get indices, then doing the same thing mu 無's answer does:
indices = set(zip(*unique_everseen(enumerate(a), key=itemgetter(1))[0])
a = [a[index] for index in indices]
b = [b[index] for index in indices]
c = [c[index] for index in indices]
But there's no way to go lower than that; you have to have at least a set and a list of length M alive to uniquify a list of length N in linear time.
If you really need to save space, you can mutate all three lists in-place. But this is a lot more complicated, and a bit slower (although still linear*).
Also, it's worth noting another advantage of the zip version: it works on any iterables. You can feed it three lazy iterators, and it won't have to instantiate them eagerly. I don't think it's doable in 2M space, but it's not too hard in 3M:
indices, a = zip(*unique_everseen(enumerate(a), key=itemgetter(1))
indices = set(indices)
b = [value for index, value in enumerate(b) if index in indices]
c = [value for index, value in enumerate(c) if index in indices]
* Note that just del c[i] will make it quadratic, because deleting from the middle of a list takes linear time. Fortunately, that linear time is a giant memmove that's orders of magnitude faster than the equivalent number of Python assignments, so if N isn't too big you can get away with it—in fact, at N=100000, M=10000 it's twice as fast as the immutable version… But if N might be too big, you have to instead replace each duplicate element with a sentinel, then loop over the list in a second pass so you can shift each element only once, which is instead 50% slower than the immutable version.

How about this - basically get a set of all unique elements of A, and then get their indices, and create a new list based on these indices.
new_A = list(set(A))
indices_to_copy = [A.index(element) for element in new_A]
new_B = [B[index] for index in indices_to_copy]
new_C = [C[index] for index in indices_to_copy]
You can write a function for the second statement, for reuse:
def get_new_list(original_list, indices):
return [original_list[idx] for idx in indices]

Is a Python list guaranteed to have its elements stay in the order they are inserted in?

If I have the following Python code
>>> x = []
>>> x = x + [1]
>>> x = x + [2]
>>> x = x + [3]
>>> x
[1, 2, 3]
Will x be guaranteed to always be [1,2,3], or are other orderings of the interim elements possible?

Yes, the order of elements in a python list is persistent.

In short, yes, the order is preserved. In long:
In general the following definitions will always apply to objects like lists:
A list is a collection of elements that can contain duplicate elements and has a defined order that generally does not change unless explicitly made to do so. stacks and queues are both types of lists that provide specific (often limited) behavior for adding and removing elements (stacks being LIFO, queues being FIFO). Lists are practical representations of, well, lists of things. A string can be thought of as a list of characters, as the order is important ("abc" != "bca") and duplicates in the content of the string are certainly permitted ("aaa" can exist and != "a").
A set is a collection of elements that cannot contain duplicates and has a non-definite order that may or may not change over time. Sets do not represent lists of things so much as they describe the extent of a certain selection of things. The internal structure of set, how its elements are stored relative to each other, is usually not meant to convey useful information. In some implementations, sets are always internally sorted; in others the ordering is simply undefined (usually depending on a hash function).
Collection is a generic term referring to any object used to store a (usually variable) number of other objects. Both lists and sets are a type of collection. Tuples and Arrays are normally not considered to be collections. Some languages consider maps (containers that describe associations between different objects) to be a type of collection as well.
This naming scheme holds true for all programming languages that I know of, including Python, C++, Java, C#, and Lisp (in which lists not keeping their order would be particularly catastrophic). If anyone knows of any where this is not the case, please just say so and I'll edit my answer. Note that specific implementations may use other names for these objects, such as vector in C++ and flex in ALGOL 68 (both lists; flex is technically just a re-sizable array).
If there is any confusion left in your case due to the specifics of how the + sign works here, just know that order is important for lists and unless there is very good reason to believe otherwise you can pretty much always safely assume that list operations preserve order. In this case, the + sign behaves much like it does for strings (which are really just lists of characters anyway): it takes the content of a list and places it behind the content of another.
If we have
list1 = [0, 1, 2, 3, 4]
list2 = [5, 6, 7, 8, 9]
Then
list1 + list2
Is the same as
[0, 1, 2, 3, 4] + [5, 6, 7, 8, 9]
Which evaluates to
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Much like
"abdcde" + "fghijk"
Produces
"abdcdefghijk"

You are confusing 'sets' and 'lists'. A set does not guarantee order, but lists do.
Sets are declared using curly brackets: {}. In contrast, lists are declared using square brackets: [].
mySet = {a, b, c, c}
Does not guarantee order, but list does:
myList = [a, b, c]

I suppose one thing that may be concerning you is whether or not the entries could change, so that the 2 becomes a different number, for instance. You can put your mind at ease here, because in Python, integers are immutable, meaning they cannot change after they are created.
Not everything in Python is immutable, though. For example, lists are mutable---they can change after being created. So for example, if you had a list of lists
>>> a = [[1], [2], [3]]
>>> a[0].append(7)
>>> a
[[1, 7], [2], [3]]
Here, I changed the first entry of a (I added 7 to it). One could imagine shuffling things around, and getting unexpected things here if you are not careful (and indeed, this does happen to everyone when they start programming in Python in some way or another; just search this site for "modifying a list while looping through it" to see dozens of examples).
It's also worth pointing out that x = x + [a] and x.append(a) are not the same thing. The second one mutates x, and the first one creates a new list and assigns it to x. To see the difference, try setting y = x before adding anything to x and trying each one, and look at the difference the two make to y.

Yes the list will remain as [1,2,3] unless you perform some other operation on it.

aList=[1,2,3]
i=0
for item in aList:
if i<2:
aList.remove(item)
i+=1
aList
[2]
The moral is when modifying a list in a loop driven by the list, takes two steps:
aList=[1,2,3]
i=0
for item in aList:
if i<2:
aList[i]="del"
i+=1
aList
['del', 'del', 3]
for i in range(2):
del aList[0]
aList
[3]

Yes lists and tuples are always ordered while dictionaries are not

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.