Searching through large data set

Searching through large data set - python

how would i search through a list with ~5 mil 128bit (or 256, depending on how you look at it) strings quickly and find the duplicates (in python)? i can turn the strings into numbers, but i don't think that's going to help much. since i haven't learned much information theory, is there anything about this in information theory?
and since these are hashes already, there's no point in hashing them again

If it fits into memeory, use set(). I think it will be faster than sort. O(n log n) for 5 million items is going to cost you.
If it does not fit into memory, say you've lot more than 5 million record, divide and conquer. Break the records at the mid point like 1 x 2^127. Apply any of the above methods. I guess information theory helps by stating that a good hash function will distribute the keys evenly. So the divide by mid point method should work great.
You can also apply divide and conquer even if it fit into memory. Sorting 2 x 2.5 mil records is faster than sorting 5 mil records.

Load them into memory (5M x 64B = 320MB), sort them, and scan through them finding the duplicates.

In Python2.7+ you can use collections.Counter for older Python use collections.deaultdict(int). Either way is O(n).
first make a list with some hashes in it
>>> import hashlib
>>> s=[hashlib.sha1(str(x)).digest() for x in (1,2,3,4,5,1,2)]
>>> s
['5j\x19+y\x13\xb0LTWM\x18\xc2\x8dF\xe69T(\xab', '\xdaK\x927\xba\xcc\xcd\xf1\x9c\x07`\xca\xb7\xae\xc4\xa85\x90\x10\xb0', 'w\xdeh\xda\xec\xd8#\xba\xbb\xb5\x8e\xdb\x1c\x8e\x14\xd7\x10n\x83\xbb', '\x1bdS\x89$s\xa4g\xd0sr\xd4^\xb0Z\xbc 1dz', '\xac4x\xd6\x9a<\x81\xfab\xe6\x0f\\6\x96\x16ZN^j\xc4', '5j\x19+y\x13\xb0LTWM\x18\xc2\x8dF\xe69T(\xab', '\xdaK\x927\xba\xcc\xcd\xf1\x9c\x07`\xca\xb7\xae\xc4\xa85\x90\x10\xb0']
If you are using Python2.7 or later
>>> from collections import Counter
>>> c=Counter(s)
>>> duplicates = [k for k in c if c[k]>1]
>>> print duplicates
['\xdaK\x927\xba\xcc\xcd\xf1\x9c\x07`\xca\xb7\xae\xc4\xa85\x90\x10\xb0', '5j\x19+y\x13\xb0LTWM\x18\xc2\x8dF\xe69T(\xab']
if you are using Python2.6 or earlier
>>> from collections import defaultdict
>>> d=defaultdict(int)
>>> for i in s:
... d[i]+=1
...
>>> duplicates = [k for k in d if d[k]>1]
>>> print duplicates
['\xdaK\x927\xba\xcc\xcd\xf1\x9c\x07`\xca\xb7\xae\xc4\xa85\x90\x10\xb0', '5j\x19+y\x13\xb0LTWM\x18\xc2\x8dF\xe69T(\xab']

Is this array sorted?
I think the fastest solution can be a heap sort or quick sort, and after go through the array, and find the duplicates.

You say you have a list of about 5 million strings, and the list may contain duplicates. You don't say (1) what you want to do with the duplicates (log them, delete all but one occurrence, ...) (2) what you want to do with the non-duplicates (3) whether this list is a stand-alone structure or whether the strings are keys to some other data that you haven't mentioned (4) why you haven't deleted duplicates at input time instead building a list containing duplicates.
As a Data Structures and Algorithms 101 exercise, the answer you have accepted is a nonsense. If you have enough memory, detecting duplicates using a set should be faster than sorting a list and scanning it. Note that deleting M items from a list of size N is O(MN). The code for each of the various alternatives is short and rather obvious; why don't you try writing them, timing them, and reporting back?
If this is a real-world problem that you have, you need to provide much more information if you want a sensible answer.

Related

Python: Fastest way to search if long string is in list of strings

I have an input of about 2-5 millions strings of about 400 characters each, coming from a stored text file.
I need to check for duplicates before adding them to the list that I check (doesn't have to be a list, can be any other data type, the list is technically a set since all items are unique).
I can expect about 0.01% at max of my data to be non-unique and I need to filter them out.
I'm wondering if there is any faster way for me to check if the item exists in the list rather than:
a=[]
for item in data:
if item not in a:
a.add(item)
I do not want to lose the order.
Would hashing be faster (I don't need encryption)? But then I'd have to maintain a hash table for all the values to check first.
Is there any way I'm missing?
I'm on python 2, can at max go upto python 3.5.

It's hard to answer this question because it keeps changing ;-) The version I'm answering asks whether there's a faster way than:
a=[]
for item in data:
if item not in a:
a.add(item)
That will be horridly slow, taking time quadratic in len(data). In any version of Python the following will take expected-case time linear in len(data):
seen = set()
for item in data:
if item not in seen:
seen.add(item)
emit(item)
where emit() does whatever you like (append to a list, write to a file, whatever).
In comments I already noted ways to achieve the same thing with ordered dictionaries (whether ordered by language guarantee in Python 3.7, or via the OrderedDict type from the collections package). The code just above is the most memory-efficient, though.

You can try this,
a = list(set(data))
A List is an ordered sequence of elements whereas Set is a distinct list of elements which is unordered.

Python: faster alternatives to searching if an item is "in" a list

I have a list of ~30 floats. I want to see if a specific float is in my list. For example:
1 >> # For the example below my list has integers, not floats
2 >> list_a = range(30)
3 >> 5.5 in list_a
False
4 >> 1 in list_a
True
The bottleneck in my code is line 3. I search if an item is in my list numerous times, and I require a faster alternative. This bottleneck takes over 99% of my time.
I was able to speed up my code by making list_a a set instead of a list. Are there any other ways to significantly speed up this line?

The best possible time to check if an element is in list if the list is not sorted is O(n) because the element may be anywhere and you need to look at each item and check if it is what you are looking for
If the array was sorted, you could've used binary search to have O(log n) look up time. You also can use hash maps to have average O(1) lookup time (or you can use built-in set, which is basically a dictionary that accomplishes the same task).
That does not make much sense for a list of length 30, though.

In my experience, Python indeed slows down when we search something in a long list.
To complement the suggestion above, my suggestion will be subsetting the list, of course only if the list can be subset and the query can be easily assigned to the correct subset.
Example is searching for a word in an English dictionary, first subsetting the dictionary into 26 "ABCD" sections based on each word's initials. If the query is "apple", you only need to search the "A" section. The advantage of this is that you have greatly limited the search space and hence the speed boost.
For numerical list, either subset it based on range, or on the first digit.
Hope this helps.

Which one is faster: iterating over a set and iterating over a list

Say that I have a list of strings and a set of the same strings:
l = [str1, str2, str3, str4, ...]
s = set([str1, str2, st3, str4, ...])
I need to run a string comparison with a phrase that I have: comparephrase
I need to iterate over all the elements in the list or the set and generate a ratio between the comparephrase and the compared string. I know that set() is faster when we are doing a membership test. However, I am not doing a membership test but comparing the phrase that I have and the strings in the list/set. Does set() still offer a faster speed? If so, why? It seems to me that this set is actually a set with a list inside. Wouldn't that take the long time since we're iterating over the list within the set?

Iterating over a List is much much faster than iterating over a set.
The currently accepted answer is using a very small set and list and hence, the difference is negligible there.
The following code explains it:
>>> import timeit
>>> l = [ x*x for x in range(1, 400)]
>>> s = set(l)
>>> timeit.timeit("for i in s: pass", "from __main__ import s")
12.152284085999781
>>> timeit.timeit("for i in l: pass", "from __main__ import l")
5.460189446001095
>>> timeit.timeit("if 567 in l: pass", "from __main__ import l")
6.0497558240003855
>>> timeit.timeit("if 567 in s: pass", "from __main__ import s")
0.04609546199935721
I do not know what makes set iteration slow, but the fact is evident from the output above.

A Python set is optimized for equality tests and duplicate removal, and thus implements a hash table underneath. I believe this would make it very slightly slower than a list, if you have to compare every element to comparephrase; lists are very good for iterating over every element one after the other. The difference is probably going to be negligible in almost any case, though.

I've run some tests with timeit, and (while list performs slightly faster) there is no significant difference:
>>> import timeit
>>> # For the set
>>> timeit.timeit("for i in s: pass", "s = set([1,4,7,10,13])")
0.20565616500061878
>>> # For the list
>>> timeit.timeit("for i in l: pass", "l = [1,4,7,10,13]")
0.19532391999928223
These values stay very much the same (0.20 vs. 0.19) even when trying multiple times.
However, the overhead of creating the sets can be significant.

The test in the accepted answer is not really representative as Aditya Shaw stated.
Let me explain the technical differences between iterating lists and sets roughly in an easy way.
Iterating lists
Lists have their elements organized by an order and an index "by design", which can be iterated easily.
Access by an index is fast because it bases on few cheap memory read operations.
Iterating sets
Sets iterate slower because their element access is done by hashes.
Imagine a big tree with many branches and each leaf is an element. A hash is being translated into "directions" to traverse all branches until reaching the leaf.
Finding a leaf or element is still fast but slower compared to a simple index access of a list.
Sets have no linked items so an iteration cannot jump to the "next" item easily like a list. It has to start from the root of the tree on each iteration.
Sets (and dicts) are contrary to lists.
Each kind has a primary use case.
Lists and sets swap their advantages on finding elements directly.
Does a list contain an element?
The list has to iterate over all its elements until it finds a match. That depends heavily on the list size.
If the element is near the beginning, it gets found quite fast even on large lists.
If an element is not in the list or near the end, the list gets iterated completely or until a match near the end.
Does a set contain an element?
It just has to traverse let's say 5 branches to look if there's a leaf.
Even on large sets, the number of branches to traverse is relative low.
Why not make and use a universal collection type?
If a set had an internal list index, the set's operations would be slower because the list had to be updated and/or checked.
If a list had an internal set to find items faster, list operations would be slower because the hashes had to be updated and/or checked.
Data found behind the hash and behind the index also may be inconsistent because of duplicate management. And that also increases memory usage at all.

Efficiently Removing Very-Near-Duplicates From Python List

Background:My Python program handles relatively large quantities of data, which can be generated in-program, or imported. The data is then processed, and during one of these processes, the data is deliberately copied and then manipulated, cleaned for duplicates and then returned to the program for further use. The data I'm handling is very precise (up to 16 decimal places), and maintaining this accuracy to at least 14dp is vital. However, mathematical operations of course can return slight variations in my floats, such that two values are identical to 14dp, but may vary ever so slightly to 16dp, therefore meaning the built in set() function doesn't correctly remove such 'duplicates' (I used this method to prototype the idea, but it's not satisfactory for the finished program). I should also point out I may well be overlooking something simple! I am just interested to see what others come up with :)Question:What is the most efficient way to remove very-near-duplicates from a potentially very large data set?My Attempts:I have tried rounding the values themselves to 14dp, but this is of course not satisfactory as this leads to larger errors down the line. I have a potential solution to this problem, but I am not convinced it is as efficient or 'pythonic' as possible. My attempt involves finding the indices of list entries that match to x dp, and then removing one of the matching entries. Thank you in advance for any advice! Please let me know if there's anything you wish to be clarified, or of course if I'm overlooking something very simple (I may be at a point where I'm over-thinking it).Clarification on 'Duplicates':Example of one of my 'duplicate' entries: 603.73066958946424, 603.73066958946460, the solution would remove one of these values.Note on decimal.Decimal: This could work if it was guaranteed that all imported data did not already have some near-duplicates (which it often does).

You really want to use NumPy if you're handling large quantities of data. Here's how I would do it :
Import NumPy :
import numpy as np
Generate 8000 high-precision floats (128-bits will be enough for your purposes, but note that I'm converting the 64-bits output of random to 128 just to fake it. Use your real data here.) :
a = np.float128(np.random.random((8000,)))
Find the indexes of the unique elements in the rounded array :
_, unique = np.unique(a.round(decimals=14), return_index=True)
And take those indexes from the original (non-rounded) array :
no_duplicates = a[unique]

Why don't you create a dict that maps the 14dp values to the corresponding full 16dp values:
d = collections.defaultdict(list)
for x in l:
d[round(x, 14)].append(x)
Now if you just want "unique" (by your definition) values, you can do
unique = [v[0] for v in d.values()]

Python a fast way to count match in a list

print sum(1 for x in alist if x[1] == 8)
This code runs fine, but it is so slow. Is there a way better than this. Because, my list is very large and the computation takes a lot of time. Do you know a better and faster way to do it?

You'd have to create indexes or cached counts to speed up such code; trade memory for speed.
Wherever you handle your list (add to it, remove from it, edit entries) you also maintain your indices. For example, if you had a counts dict with ids as keys and their frequency as values, all you had to do is look up the count directly, and ensure that the counts stayed up-to-date as you manipulate alist.
The best way to manage this is by encapsulating your list in a custom type, so that you can control all manipulations of the data structure and maintain the extra information.

Not sure how much faster it would be but
len([x for x in alist if x[1] == 8])
is a little clearer.

I would use numpy. My numpy skills are a little bit rusty, but len(np_array == 8) would give you what you need for a single depth array. I think for you it would be len(np_array[:,1]) but I would have to check (this assumes your problem could use numpy arrays)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Searching through large data set - python

Load them into memory (5M x 64B = 320MB), sort them, and scan through them finding the duplicates.

Is this array sorted? I think the fastest solution can be a heap sort or quick sort, and after go through the array, and find the duplicates.

Related

Python: Fastest way to search if long string is in list of strings

Python: faster alternatives to searching if an item is "in" a list

Which one is faster: iterating over a set and iterating over a list

Efficiently Removing Very-Near-Duplicates From Python List

Python a fast way to count match in a list

Categories

Resources