Efficiently Removing Very-Near-Duplicates From Python List

Efficiently Removing Very-Near-Duplicates From Python List - python

Background:My Python program handles relatively large quantities of data, which can be generated in-program, or imported. The data is then processed, and during one of these processes, the data is deliberately copied and then manipulated, cleaned for duplicates and then returned to the program for further use. The data I'm handling is very precise (up to 16 decimal places), and maintaining this accuracy to at least 14dp is vital. However, mathematical operations of course can return slight variations in my floats, such that two values are identical to 14dp, but may vary ever so slightly to 16dp, therefore meaning the built in set() function doesn't correctly remove such 'duplicates' (I used this method to prototype the idea, but it's not satisfactory for the finished program). I should also point out I may well be overlooking something simple! I am just interested to see what others come up with :)Question:What is the most efficient way to remove very-near-duplicates from a potentially very large data set?My Attempts:I have tried rounding the values themselves to 14dp, but this is of course not satisfactory as this leads to larger errors down the line. I have a potential solution to this problem, but I am not convinced it is as efficient or 'pythonic' as possible. My attempt involves finding the indices of list entries that match to x dp, and then removing one of the matching entries. Thank you in advance for any advice! Please let me know if there's anything you wish to be clarified, or of course if I'm overlooking something very simple (I may be at a point where I'm over-thinking it).Clarification on 'Duplicates':Example of one of my 'duplicate' entries: 603.73066958946424, 603.73066958946460, the solution would remove one of these values.Note on decimal.Decimal: This could work if it was guaranteed that all imported data did not already have some near-duplicates (which it often does).

You really want to use NumPy if you're handling large quantities of data. Here's how I would do it :
Import NumPy :
import numpy as np
Generate 8000 high-precision floats (128-bits will be enough for your purposes, but note that I'm converting the 64-bits output of random to 128 just to fake it. Use your real data here.) :
a = np.float128(np.random.random((8000,)))
Find the indexes of the unique elements in the rounded array :
_, unique = np.unique(a.round(decimals=14), return_index=True)
And take those indexes from the original (non-rounded) array :
no_duplicates = a[unique]

Why don't you create a dict that maps the 14dp values to the corresponding full 16dp values:
d = collections.defaultdict(list)
for x in l:
d[round(x, 14)].append(x)
Now if you just want "unique" (by your definition) values, you can do
unique = [v[0] for v in d.values()]

Related

Jagged Quote of IV for BlackVarianceSurface

I am following this example and trying to adapt it to my needs.
In the section of code:
implied_vols[i][j] = data[j][i]
implied_vols = ql.Matrix(len(strikes), len(expiration_dates))
for i in range(implied_vols.rows()):
for j in range(implied_vols.columns()):
implied_vols[i][j] = data[j][i]
[1]: http://gouthamanbalaraman.com/blog/volatility-smile-heston-model-calibration-quantlib-python.html
This assumes the IV matrix has all corresponding strikes for a given expiry. In fact, the [mass] quote is often stored in a dictionary instead of an array exactly for this reason.
For example in the SPX, we have different strike increments at different expiration. So some strikes are empty for one expiry but not another. I realize I can force the situation by making the matrix cell always have numerical value, but I am assuming that inserting a 0 at a given strike/expiry is a bad idea. Alternatively, forcing all expiry to the least common denominator of strikes between the expiry throws out lots of data.
What happens if the volatility quotes you have are not square and you don't want to throw out data when building a ql.Matrix to hand to BlackVarianceSurface?

Unfortunately, there's no ready-made solution. As you say filling the missing cells with 0 is a bad idea; but filling them by interpolating manually the missing values should work. The way to do it probably depends on how sparse your data is...

How to find a float number in a list?

I am trying to create a list of points from a road network. Here, I try to put their coordinates in a List of [x,y] whose items have a float format. As a new point from the network is picked, it should be checked with the existing points in the list. if it exists, then the same index will be given to the feature of network, otherwise a new point will be added to the list and the new index will be given to the feature.
I know that a float number will be saved differently form integers, but for exactly the same float numbers, I still cannot use:
If new_point in list_of_points:
#do something
and I should use:
for point in list_of_points:
if abs(point.x-new_point.x)<0.01 and abs(point.y-new_point.y)<0.01
#do something
the points are supposed to be exactly the same as I snap them using the ArcGIS software, and when I check the coordinates in the software they are exactly the same.
I asked this question for:
1- I think using "in" can make my code tidy and also faster while using for-loop is kind of clumsy way of coding for this situation.
2- I want to know: does that mean even exactly the same float numbers are stored differently?

It's never a good idea to check for equality between two floating point numbers. However, there are built in functions to do a comparison like that. From numpy you can use allclose. For example,
>>> np.allclose( (1.0,2.0), (1.00000001,2.0000001) )
True
This checks if the two array like inputs are element-wise equal within a certain tolerance. You can adjust the relative and absolute tolerances with keyword arguments.

Any given Python implenetation should always store a given floating point number in the same, deterministic, non-random way within itself. I do not believe you can take the same floating point number, input it twice, and have it stored in two different ways. But I'm also reluctant to believe that you're going to be getting exact duplicates of coordinates out of a geographic program like ArcGIS, especially if the resolution is very small. There are many ways that floating point math can mess with your expectations, so you shouldn't ever expect that you'll have identical floats. And between different machines and different versions, you get even more possibilities for error.
If you're worried about the elegance of your code, you can just create a function to abstract out the for loop.
def coord_in(coord, coord_list):
for other_coord in coord_list:
if abs(coord.x-other_coord.x)<0.00001 and abs(coord.y-other_coord.y)<0.00001:
return True
return False

For a large number of points, numpy will always be faster (and perhapd more elegant). If you have separated the x and y coords into (float) arrays arrx and arry:
numpy.sometrue((arrx-point.x)**2+(arry-point.y)**2<tol**2)
will return True if point is within distance tol of an existing point.

2: exactly the same literal (e.g., "2.3") will be stored as exactly the same float representation for for a given platform and data-type, but in general it depends on the bit-ness, endian-ness and perhaps the compiler used to make python.
To be certain when comparing numbers, you should at least round to the precision of the least precise number, or (better) do the kind of thing you are doing here.
>>> 1==1.00000000000000000000000000000000001
True

Old thread but helped me develop my own solution using list comprehension. Because of course it's not a good idea to compare two floats using ==. The following returns list of indices of all elements of the input list that are reasonably close to the value we're looking for.
def findFloats(listOfFloats, value):
return [i for i, number in enumerate(listOfFloats)
if abs(number-value) < 0.00001]

Iterate two or more lists / numpy arrays... and compare each item with each other and avoid loops in python

I am new to python and my problem is the following:
I have defined a function func(a,b) that return a value, given two input values.
Now I have my data stored in lists or numpy arrays A,Band would like to use func for every combination. (A and B have over one million entries)
ATM i use this snippet:
for p in A:
for k in B:
value = func(p,k)
This takes really really a lot of time.
So i was thinking that maybe something like this:
C=(map(func,zip(A,B)))
But this method only works pairwise... Any ideas?
Thanks for help

First issue
You need to calculate the output of f for many pairs of values. The "standard" way to speed up this kind of loops (calculations) is to make your function f accept (NumPy) arrays as input, and do the calculation on the whole array at once (ie, no looping as seen from Python). Check any NumPy tutorial to get an introduction.
Second issue
If A and B have over a million entries each, there are one trillion combinations. For 64 bits numbers, that means you'll need 7.3 TiB of space just to store the result of your calculation. Do you have enough hard drive to just store the result?
Third issue
If A and B where much smaller, in your particular case you'd be able to do this:
values = f(*meshgrid(A, B))
meshgrid returns the cartesian product of A and B, so it's simply a way to generate the points that have to be evaluated.
Summary
You need to use NumPy effectively to avoid Python loops. (Or if all else fails or they can't easily be vectorized, write those loops in a compiled language, for instance by using Cython)
Working with terabytes of data is hard. Do you really need that much data?
Any solution that calls a function f 1e12 times in a loop is bound to be slow, specially in CPython (which is the default Python implementation. If you're not really sure and you're using NumPy, you're using it too).

suppose, itertools.product does what you need:
from itertools import product
pro = product(A,B)
C = map(lambda x: func(*x), pro)
so far as it is generator it doesn't require additional memory

One million times one million is one trillion. Calling f one trillion times will take a while.
Unless you have a way of reducing the number of values to compute, you can't do better than the above.

If you use NumPy, you should definitely look the np.vectorize function which is designed for this kind of problems...

Which data types for genetic algorithms in Python?

I am implementing a GA in Python and need to store a sequence of ones and zeros, so I am representing my data as binaries. What is the best data structure for that? A simple string?

If your chromosomes are fixed-length bitstrings, consider using Numpy arrays and vectorized operations on them instead of lists. These may be much faster than Python lists. E.g., one-point crossover can be done with
def crossover(a, b):
"""Return new individual by combining parents a and b
with random crossover point"""
c = np.empty(a.shape, dtype=bool)
k = np.random.randint(a.shape[0])
c[:k] = a[:k]
c[k:] = b[k:]
return c
If you don't want to use Numpy, then strings seem quite appropriate; they're much more compact than lists, which store pointers to elements rather than actual elements.
Finally, be sure to have a look at how Pyevolve represents chromosomes; it seems to do so with using Numpy.

I think sticking with the strings is a good idea. You can easily chop strings into pieces. if you need to act on them as a list, you can convert it with "list(str)". Once you have a list, you can alter it and turn it back into a string using "''.join(lst)".
Personally, I wouldn't use a long or another integer type to store as bits. It may be more space efficient, but the headache of working with the data when you want to do a recombination would be considerable. Mutations would be problematic as well if the mutation consist of something other than a bit flip. Plus, the code would be much harder to read.
Just my 2 cents. Hope that helps you out.

You can try using bitarray.
Or you can play with buffers.

Searching through large data set

how would i search through a list with ~5 mil 128bit (or 256, depending on how you look at it) strings quickly and find the duplicates (in python)? i can turn the strings into numbers, but i don't think that's going to help much. since i haven't learned much information theory, is there anything about this in information theory?
and since these are hashes already, there's no point in hashing them again

If it fits into memeory, use set(). I think it will be faster than sort. O(n log n) for 5 million items is going to cost you.
If it does not fit into memory, say you've lot more than 5 million record, divide and conquer. Break the records at the mid point like 1 x 2^127. Apply any of the above methods. I guess information theory helps by stating that a good hash function will distribute the keys evenly. So the divide by mid point method should work great.
You can also apply divide and conquer even if it fit into memory. Sorting 2 x 2.5 mil records is faster than sorting 5 mil records.

Load them into memory (5M x 64B = 320MB), sort them, and scan through them finding the duplicates.

In Python2.7+ you can use collections.Counter for older Python use collections.deaultdict(int). Either way is O(n).
first make a list with some hashes in it
>>> import hashlib
>>> s=[hashlib.sha1(str(x)).digest() for x in (1,2,3,4,5,1,2)]
>>> s
['5j\x19+y\x13\xb0LTWM\x18\xc2\x8dF\xe69T(\xab', '\xdaK\x927\xba\xcc\xcd\xf1\x9c\x07`\xca\xb7\xae\xc4\xa85\x90\x10\xb0', 'w\xdeh\xda\xec\xd8#\xba\xbb\xb5\x8e\xdb\x1c\x8e\x14\xd7\x10n\x83\xbb', '\x1bdS\x89$s\xa4g\xd0sr\xd4^\xb0Z\xbc 1dz', '\xac4x\xd6\x9a<\x81\xfab\xe6\x0f\\6\x96\x16ZN^j\xc4', '5j\x19+y\x13\xb0LTWM\x18\xc2\x8dF\xe69T(\xab', '\xdaK\x927\xba\xcc\xcd\xf1\x9c\x07`\xca\xb7\xae\xc4\xa85\x90\x10\xb0']
If you are using Python2.7 or later
>>> from collections import Counter
>>> c=Counter(s)
>>> duplicates = [k for k in c if c[k]>1]
>>> print duplicates
['\xdaK\x927\xba\xcc\xcd\xf1\x9c\x07`\xca\xb7\xae\xc4\xa85\x90\x10\xb0', '5j\x19+y\x13\xb0LTWM\x18\xc2\x8dF\xe69T(\xab']
if you are using Python2.6 or earlier
>>> from collections import defaultdict
>>> d=defaultdict(int)
>>> for i in s:
... d[i]+=1
...
>>> duplicates = [k for k in d if d[k]>1]
>>> print duplicates
['\xdaK\x927\xba\xcc\xcd\xf1\x9c\x07`\xca\xb7\xae\xc4\xa85\x90\x10\xb0', '5j\x19+y\x13\xb0LTWM\x18\xc2\x8dF\xe69T(\xab']

Is this array sorted?
I think the fastest solution can be a heap sort or quick sort, and after go through the array, and find the duplicates.

You say you have a list of about 5 million strings, and the list may contain duplicates. You don't say (1) what you want to do with the duplicates (log them, delete all but one occurrence, ...) (2) what you want to do with the non-duplicates (3) whether this list is a stand-alone structure or whether the strings are keys to some other data that you haven't mentioned (4) why you haven't deleted duplicates at input time instead building a list containing duplicates.
As a Data Structures and Algorithms 101 exercise, the answer you have accepted is a nonsense. If you have enough memory, detecting duplicates using a set should be faster than sorting a list and scanning it. Note that deleting M items from a list of size N is O(MN). The code for each of the various alternatives is short and rather obvious; why don't you try writing them, timing them, and reporting back?
If this is a real-world problem that you have, you need to provide much more information if you want a sensible answer.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficiently Removing Very-Near-Duplicates From Python List - python

Why don't you create a dict that maps the 14dp values to the corresponding full 16dp values: d = collections.defaultdict(list) for x in l: d[round(x, 14)].append(x) Now if you just want "unique" (by your definition) values, you can do unique = [v[0] for v in d.values()]

Related

Jagged Quote of IV for BlackVarianceSurface

How to find a float number in a list?

Iterate two or more lists / numpy arrays... and compare each item with each other and avoid loops in python

Which data types for genetic algorithms in Python?

Searching through large data set

Categories

Resources