Compare DB row values efficiently - python

I want to loop through a database of documents and calculate a pairwise comparison score.
A simplistic, naive method would nest a loop within another loop. This would result in the program comparing documents twice and also comparing each document to itself.
Is there a name for the algorithm for doing this task efficiently?
Is there a name for this approach?
Thanks.

Assume all items have a number ItemNumber
Simple solution -- always have the 2nd element's ItemNumber greater than the first item.
eg
for (firstitem = 1 to maxitemnumber)
for (seconditem = firstitemnumber+1 to maxitemnumber)
compare(firstitem, seconditem)
visual note: if you think of the compare as a matrix (item number of one on one axis item of the other on the other axis) this looks at one of the triangles.
........
x.......
xx......
xxx.....
xxxx....
xxxxx...
xxxxxx..
xxxxxxx.

I don't think its complicated enough to qualify for a name.
You can avoid duplicate pairs just by forcing a comparison on any value which might be different between different rows - the primary key is an obvious choice, e.g.
Unique pairings:
SELECT a.item as a_item, b.item as b_item
FROM table AS a, table AS b
WHERE a.id<b.id
Potentially there are a lot of ways in which the the comparison operation can be used to generate data summmaries and therefore identify potentially similar items - for single words the soundex is an obvious choice - however you don't say what your comparison metric is.
C.

You can keep track of which documents you have already compared, e.g. (with numbers ;))
compared = set()
for i in [1,2,3]:
for j in [1,2,3]:
pair = frozenset((i,j))
if i != k and pair not in compared:
compare.add(pair)
compare(i,j)
Another idea would be to create the combination of documents first and iterate over them. But in order to generate this, you have to iterate over both lists and the you iterate over the result list again so I don't think that it has any advantage.
Update:
If you have the documents already in a list, then Hogan's answer is indeed better. But I think it needs a better example:
docs = [1,2,3]
l = len(docs)
for i in range(l):
for j in range(i+1,l):
compare(l[i],l[j])

Something like this?
src = [1,2,3]
for i, x in enumerate(src):
for y in src[i:]:
compare(x, y)
Or you might wish to generate a list of pairs instead:
pairs = [(x, y) for i, x in enumerate(src) for y in src[i:]]

Related

Python: How to insert into a nestled list via iteration at a variable index position?

I've been banging my head over this one for a while, so hopefully you can help me! So here is what I have:
grouped_list = [[["0","1","1","1"]["1","0","1","1"]][["1","1","0","1","1","1"]][["1","1","1","0","1"]]]
index_list = [[2,3][][4]]
and I want to insert a "-" into the sublists of grouped_list at the corresponding index positions indicated in the index_list. The result would look like:
[[["0","1","-","-","1","1"]["1","0","-","-","1","1"]][["1","1","0","1","1","1"]][["1","1","1","0","-","1"]]]
And since I'm new to python, here is my laughable attempt at this:
for groups in grouped_list:
for columns in groups:
[[columns[i:i] = ["-"] for i in index] for index in index_list]
I get a syntax error, pointing at the = in the list comprehension, but I didn't think it would really work to start. I would prefer not to do this manually, because I'm dealing with rather large datasets, so some sort of iteration would be nice! Do I need to use numpy or pandas for something like this? Could this be solved with clever use of zipping? Any help is greatly appreciated!
I am sadly unable to make this a one liner:
def func(x, il):
for i in il:
x.insert(i,'-')
return x
s = [[func(l, il) for l in ll] for (ll, il) in zip(grouped_list, index_list)]
I think what you want is
for k, groups in enumerate(grouped_list):
for columns in groups:
for i in sorted(index_list[k], reverse=True):
columns.insert(i, "-")
Here, I iterate over the grouped lists and save the index k to determine which indices to use from index_list. I modify the lists in-place using list.insert, which inserts elements in place. Note that this only works when the indices are used from the largest to the smallest, since otherwise the positions shift. This is why I use sorted in the loop.

Collapse list of lists to eliminate redundancy

I have a couple of long lists of lists of related objects that I'd like to group to reduce redundancy. Pseudocode:
>>>list_of_lists = [[1,2,3],[3,4],[5,6,7],[1,8,9,10]...]
>>>remove_redundancy(list_of_lists)
[[1,2,3,4,8,9,10],[5,6,7]...]
So lists that contain the same elements would be collapsed into single lists. Collapsing them is easy, once I find lists to combine I can make the lists into sets and take their union, but I'm not sure how to compare the lists. Do I need to do a series of for loops?
My first thought was that I should loop through and check whether each item in a sublist is in any of the other lists, if yes, merge the lists and then start over, but that seems terribly inefficient. I did some searching and found this: Python - dividing a list-of-lists to groups but my data isn't structured. Also, my actual data is a series of strings and thus not sortable in any meaningful sense.
I can write some gnarly looping code to make this work, but I was wondering if there are any built-in functions that would make this sort of comparison easier. Maybe something in list comprehensions?
I think this is a reasonably efficient way of doing it, if I understand your question correctly. The result here will be a list of sets.
Maybe the missing bit of knowledge was d & g (also written d.intersection(g)) for finding the set intersection, along with the fact that an empty set is "falsey" in Python
data = [[1,2,3],[3,4],[5,6,7],[1,8,9,10]]
result = []
for d in data:
d = set(d)
matched = [d]
unmatched = []
# first divide into matching and non-matching groups
for g in result:
if d & g:
matched.append(g)
else:
unmatched.append(g)
# then combine all matching groups into one group
# while leaving unmatched groups intact
result = unmatched + [set().union(*matched)]
print(result)
# [set([5, 6, 7]), set([1, 2, 3, 4, 8, 9, 10])]
We start with no groups at all (result = []). Then we take the first list from the data. We then check which of the existing groups intersect this list and which don't. Then we merge all of these matching groups along with the list (achieved by starting with matched = [d]). We don't touch the non-matching groups (though maybe some of these will end up being merged in a later iteration). If you add a line print(result) in each loop you should be able to see how it's built up.
The union of all the sets in matched is computed by set().union(*matched). For reference:
Pythonic Way to Create Union of All Values Contained in Multiple Lists
What does the Star operator mean?
I assume that you want to merge lists that contain any common element.
Here is a function that looks efficiently (to the best of my knowledge) if any two lists contain at least one common element (according to the == operator)
import functools #python 2.5+
def seematch(X,Y):
return functools.reduce(lambda x,y : x|y,functools.reduce(lambda x,y : x+y, [[k==l for k in X ] for l in Y]))
it would be even faster if you would use a reduce that can be interrupted when finding "true" as described here:
Stopping a Reduce() operation mid way. Functional way of doing partial running sum
I was trying to find an elegant way to iterate fast after having that in place, but I think a good way would be simply looping once and creating an other container that will contain the "merged" lists. You loop once on the lists contained on the original list and for every new list created on the proxy list.
Having said that - it seems there might be a much better option - see if you can do away with that redundancy by some sort of book-keeping on the previous steps.
I know this is an incomplete answer - hope that helped anyway!

Is it possible to pass a list as argument of del?

Good morning everybody,
my simple question is the following: I have 2 lists (let's call them a and b) of length T and I want to eliminate K random elements (with the same index) from each of them.
Let's suppose for the moment K << T, in order to neglect the probability to extract the same index twice or more. Can I simply generate a list aleaindex of K random numbers and pass it to del, like
for i in range(K):
aleaindex.append(random.randint(0, T-1))
del a[aleaindex]
del b[aleaindex]
And is there some Python trick to do this more efficiently?
Thank you very much in advance!
No, there is no way to do this.
The reason for this is that del deletes a name - if there is still another name attached to the object, it will continue to exist. The object itself is untouched.
When you store objects in a list, they do not have names attached, just indices.
This means that when you have a list of objects, Python doesn't know the names that refer to those objects (if there are any), so it can't delete them. It can, at most, remove them from that particular list.
The best solution is to make a new list that doesn't contain the values you don't want. This can be achieved with a list comprehension:
new_a = [v for i, v in enumerate(a) if i not in aleaindex]
You can always assign this back to a if you need to modify the list (a[:] = ...).
Note that it would also make more sense to make aleaindex a set, as it would make this operation faster, and the order doesn't matter:
aleaindex = {random.randint(0, T-1) for _ in range(K)}

Finding the most similar numbers across multiple lists in Python

In Python, I have 3 lists of floating-point numbers (angles), in the range 0-360, and the lists are not the same length. I need to find the triplet (with 1 number from each list) in which the numbers are the closest. (It's highly unlikely that any of the numbers will be identical, since this is real-world data.) I was thinking of using a simple lowest-standard-deviation method to measure agreement, but I'm not sure of a good way to implement this. I could loop through each list, comparing the standard deviation of every possible combination using nested for loops, and have a temporary variable save the indices of the triplet that agrees the best, but I was wondering if anyone had a better or more elegant way to do something like this. Thanks!
I wouldn't be surprised if there is an established algorithm for doing this, and if so, you should use it. But I don't know of one, so I'm going to speculate a little.
If I had to do it, the first thing I would try would be just to loop through all possible combinations of all the numbers and see how long it takes. If your data set is small enough, it's not worth the time to invent a clever algorithm. To demonstrate the setup, I'll include the sample code:
# setup
def distance(nplet):
'''Takes a pair or triplet (an "n-plet") as a list, and returns its distance.
A smaller return value means better agreement.'''
# your choice of implementation here. Example:
return variance(nplet)
# algorithm
def brute_force(*lists):
return min(itertools.product(*lists), key = distance)
For a large data set, I would try something like this: first create one triplet for each number in the first list, with its first entry set to that number. Then go through this list of partially-filled triplets and for each one, pick the number from the second list that is closest to the number from the first list and set that as the second member of the triplet. Then go through the list of triplets and for each one, pick the number from the third list that is closest to the first two numbers (as measured by your agreement metric). Finally, take the best of the bunch. This sample code demonstrates how you could try to keep the runtime linear in the length of the lists.
def item_selection(listA, listB, listC):
# make the list of partially-filled triplets
triplets = [[a] for a in listA]
iT = 0
iB = 0
while iT < len(triplets):
# make iB the index of a value in listB closes to triplets[iT][0]
while iB < len(listB) and listB[iB] < triplets[iT][0]:
iB += 1
if iB == 0:
triplets[iT].append(listB[0])
elif iB == len(listB)
triplets[iT].append(listB[-1])
else:
# look at the values in listB just below and just above triplets[iT][0]
# and add the closer one as the second member of the triplet
dist_lower = distance([triplets[iT][0], listB[iB]])
dist_upper = distance([triplets[iT][0], listB[iB + 1]])
if dist_lower < dist_upper:
triplets[iT].append(listB[iB])
elif dist_lower > dist_upper:
triplets[iT].append(listB[iB + 1])
else:
# if they are equidistant, add both
triplets[iT].append(listB[iB])
iT += 1
triplets[iT:iT] = [triplets[iT-1][0], listB[iB + 1]]
iT += 1
# then another loop while iT < len(triplets) to add in the numbers from listC
return min(triplets, key = distance)
The thing is, I can imagine situations where this wouldn't actually find the best triplet, for instance if a number from the first list is close to one from the second list but not at all close to anything in the third list. So something you could try is to run this algorithm for all 6 possible orderings of the lists. I can't think of a specific situation where that would fail to find the best triplet, but one might still exist. In any case the algorithm will still be O(N) if you use a clever implementation, assuming the lists are sorted.
def symmetrized_item_selection(listA, listB, listC):
best_results = []
for ordering in itertools.permutations([listA, listB, listC]):
best_results.extend(item_selection(*ordering))
return min(best_results, key = distance)
Another option might be to compute all possible pairs of numbers between list 1 and list 2, between list 1 and list 3, and between list 2 and list 3. Then sort all three lists of pairs together, from best to worst agreement between the two numbers. Starting with the closest pair, go through the list pair by pair and any time you encounter a pair which shares a number with one you've already seen, merge them into a triplet. For a suitable measure of agreement, once you find your first triplet, that will give you a maximum pair distance that you need to iterate up to, and once you get up to it, you just choose the closest triplet of the ones you've found. I think that should consistently find the best possible triplet, but it will be O(N^2 log N) because of the requirement for sorting the lists of pairs.
def pair_sorting(listA, listB, listC):
# make all possible pairs of values from two lists
# each pair has the structure ((number, origin_list),(number, origin_list))
# so we know which lists the numbers came from
all_pairs = []
all_pairs += [((nA,0), (nB,1)) for (nA,nB) in itertools.product(listA,listB)]
all_pairs += [((nA,0), (nC,2)) for (nA,nC) in itertools.product(listA,listC)]
all_pairs += [((nB,1), (nC,2)) for (nB,nC) in itertools.product(listB,listC)]
all_pairs.sort(key = lambda p: distance(p[0][0], p[1][0]))
# make a dict to track which (number, origin_list)s we've already seen
pairs_by_number_and_list = collections.defaultdict(list)
min_distance = INFINITY
min_triplet = None
# start with the closest pair
for pair in all_pairs:
# for the first value of the current pair, see if we've seen that particular
# (number, origin_list) combination before
for pair2 in pairs_by_number_and_list[pair[0]]:
# if so, that means the current pair shares its first value with
# another pair, so put the 3 unique values together to make a triplet
this_triplet = (pair[1][0], pair2[0][0], pair2[1][0])
# check if the triplet agrees more than the previous best triplet
this_distance = distance(this_triplet)
if this_distance < min_distance:
min_triplet = this_triplet
min_distance = this_distance
# do the same thing but checking the second element of the current pair
for pair2 in pairs_by_number_and_list[pair[1]]:
this_triplet = (pair[0][0], pair2[0][0], pair2[1][0])
this_distance = distance(this_triplet)
if this_distance < min_distance:
min_triplet = this_triplet
min_distance = this_distance
# finally, add the current pair to the list of pairs we've seen
pairs_by_number_and_list[pair[0]].append(pair)
pairs_by_number_and_list[pair[1]].append(pair)
return min_triplet
N.B. I've written all the code samples in this answer out a little more explicitly than you'd do it in practice to help you to understand how they work. But when doing it for real, you'd use more list comprehensions and such things.
N.B.2. No guarantees that the code works :-P but it should get the rough idea across.

data structure that can do a "select distinct X where W=w and Y=y and Z=z and ..." type lookup

I have a set of unique vectors (10k's worth). And I need to, for any chosen column, extract the set of values that are seen in that column, in rows where all the others columns are given values.
I'm hoping for a solution that is sub linear (wrt the item count) in time and at most linear (wrt the total size of all the items) in space, preferably sub linear extra space over just storing the items.
Can I get that or better?
BTW: it's going to be accessed from python and needs to simple to program or be part of an existing commonly used library.
edit: the costs are for the lookup, and do not include the time to build the structures. All the data that will ever be indexed is available before the first query will be made.
It seems I'm doing a bad job of describing what I'm looking for, so here is a solution that gets close:
class Index:
dep __init__(self, stuff): # don't care about this O() time
self.all = set(stuff)
self.index = {}
for item in stuff:
for i,v in item:
self.index.getdefault(i,set()).add(v)
def Get(self, col, have): # this O() matters
ret = []
t = array(have) # make a copy.
for i in self.index[col]:
t[col] = i
if t in self.all:
ret.append(i)
return ret
The problem is that this give really bad (O(n)) worst case perf.
Since you are asking for a SQL-like query, how about using a relational database? SQLite is part of the standard library, and can be used either on-disk or fully in memory.
If you have a Python set (no ordering) there is no way to select all relevant items without at least looking at all items -- so it's impossible for any solution to be "sub linear" (wrt the number of items) as you require.
If you have a list, instead of a set, then it can be ordered -- but that cannot be achieved in linear time in the general case (O(N log N) is provably the best you can do for a general-case sorting -- and building sorted indices would be similar -- unless there are constraints that let you use "bucket-sort-like" approaches). You can spread around the time it takes to keep indices over all insertions in the data structure -- but that won't reduce the total time needed to build such indices, just, as I said, spread them around.
Hash-based (not sorted) indices can be faster for your special case (where you only need to select by equality, not by "less than" &c) -- but the time to construct such indices is linear in the number of items (obviously you can't construct such an index without at least looking once at each item -- sublinear time requires some magic that lets you completely ignore certain items, and that can't happen without supporting "structure" (such as sortedness) which in turn requires time to achieve (though it can be achieved "incrementally" ahead of time, such an approach doesn't reduce the total time required).
So, taken to the letter, your requirements appear overconstrained: neither Python, nor any other language, nor any database engine, etc, can possibly achieve them -- if interpreted literally exactly as you state them. If "incremental work done ahead of time" doesn't count (as breaching your demands of linearity and sublinearity), if you take about expected/typical rather than worst-case behavior (and your items have friendly probability distributions), etc, then it might be possible to come close to achieving your very demanding requests.
For example, consider maintaining for each of the vectors' D dimensions a dictionary mapping the value an item has in that dimension, to a set of indices of such items; then, selecting the items that meet the D-1 requirements of equality for every dimension but the ith one can be done by set intersections. Does this meet your requirements? Not by taking the latter strictly to the letter, as I've explained above -- but maybe, depending on how much each requirement can perhaps be taken in a more "relaxed" sense.
BTW, I don't understand what a "group by" implies here since all the vectors in each group would be identical (since you said all dimensions but one are specified by equality), so it may be that you've skipped a COUNT(*) in your SQL-equivalent -- i.e., you need a count of how many such vectors have a given value in the i-th dimension. In that case, it would be achievable by the above approach.
Edit: as the OP has clarified somewhat in comments and an edit to his Q I can propose an approach in more details:
import collections
class Searchable(object):
def __init__(self, toindex=None):
self.toindex = toindex
self.data = []
self.indices = None
def makeindices(self):
if self.indices is not None:
return
self.indices = dict((i, collections.defaultdict(set))
for i in self.toindex)
def add(self, record):
if self.toindex is None:
self.toindex = range(len(record))
self.makeindices()
where = len(self.data)
self.data.append(record)
for i in self.toindex:
self.indices[i][record[i]].add(where)
def get(self, indices_and_values, indices_to_get):
ok = set(range(len(self.data)))
for i, v in indices_and_values:
ok.intersection_update(self.indices[i][v])
result = set()
for rec in (self.data[i] for i in ok):
t = tuple(rec[i] for i in indices_to_get)
result.add(t)
return result
def main():
c = Searchable()
for r in ((1,2,3), (1,2,4), (1,5,4)):
c.add(r)
print c.get([(0,1),(1,2)], [2])
main()
This prints
set([(3,), (4,)])
and of course could be easily specialized to return results in other formats, accept indices (to index and/or to return) in different ways, etc. I believe it meets the requirements as edited / clarified since the extra storage is, for each indexed dimension/value, a set of the indices at which said value occurs on that dimension, and the search time is one set intersection per indexed dimension plus a loop on the number of items to be returned.
I'm assuming that you've tried the dictionary and you need something more flexible. Basically, what you need to do is index values of x, y and z
def build_index(vectors):
index = {x: {}, y: {}, z: {}}
for position, vector in enumerate(vectors):
if vector.x in index['x']:
index['x'][vector.x].append(position)
else:
index['x'][vector.x] = [position]
if vector.y in index['y']:
index['y'][vector.y].append(position)
else:
index['y'][vector.y] = [position]
if vector.z in index['z']:
index['z'][vector.z].append(position)
else:
index['z'][vector.z] = [position]
return index
What you have in index a lookup table. You can say, for example, select x,y,z from vectors where x=42 by doing this:
def query_by(vectors, index, property, value):
results = []
for i in index[property][value]:
results.append(vectors[i])
vecs_x_42 = query_by(index, 'x', 42)
# now vec_x_42 is a list of all vectors where x is 42
Now to do a logical conjunction, say select x,y,z from vectors where x=42 and y=3 you can use Python's sets to accomplish this:
def query_by(vectors, index, criteria):
sets = []
for k, v in criteria.iteritems():
if v not in index[k]:
return []
sets.append(index[k][v])
results = []
for i in set.intersection(*sets):
results.append(vectors[i])
return results
vecs_x_42_y_3 = query_by(index, {'x': 42, 'y': 3})
The intersection operation on sets produces values which only appear in both sets, so you are only iterating the positions which satisfy both conditions.
Now for the last part of your question, to group by x:
def group_by(vectors, property):
result = {}
for v in vectors:
value = getattr(v, property)
if value in result:
result[value].append(v)
else:
result[value] = [v]
return result
So let's bring it all together:
vectors = [...] # your vectors, as objects such that v.x, v.y produces the x and y values
index = build_index(vectors)
my_vectors = group_by(query_by(vectors, index, {'y':42, 'z': 3}), 'x')
# now you have in my_vectors a dictionary of vectors grouped by x value, where y=42 and z=3
Update
I updated the code above and fixed a few obvious errors. It works now and it does what it claims to do. On my laptop, a 2GHz core 2 duo with 4GB RAM, it takes less than 1s to build_index. Lookups are very quick, even when the dataset has 100k vectors. If I have some time I'll do some formal comparisons against MySQL.
You can see the full code at this Codepad, if you time it or improve it, let me know.
Suppose you have a 'tuple' class with fields x, y, and z and you have a bunch of such tuples saved in an enumerable var named myTuples. Then:
A) Pre-population:
dct = {}
for tpl in myTuples:
tmp = (tpl.y, tpl.z)
if tmp in dct:
dct[tmp].append(tpl.x)
else:
dct[tmp] = [tpl.x]
B) Query:
def findAll(y,z):
tmp = (y,z)
if tmp not in dct: return ()
return [(x,y,z) for x in dct[tmp]]
I am sure that there is a way to optimize the code for readability, save a few cycles, etc. But essentially you want to pre-populate a dict, using a 2-tuple as a key. If I did not see a request for sub-linear then I would not have though of this :)
A) The pre-population is linear, sorry.
B) Query should be as slow as the number of items returned - most of the time sub-linear, except for weird edge cases.
So you have 3 coordinates and one value for start and end of vector (x,y,z)?
How is it possible to know the seven known values? Are there many coordinate triples multiple times?
You must be doing very tight loop with the function to be so conserned of look up time considering the small size of data (10K).
Could you give example of real input for your class you posted?

Categories