Efficient algorithm for only keeping distant values

Efficient algorithm for only keeping distant values - python

I have a list of values which might look something like this: [500,501,809,702,808,807,703,502,499] and I would like to only keep the first instance of each number within a certain distance. In other words, I'd like to get the list: [500,809,702] because the other numbers are within a certain distance from those numbers. So it would keep 500, skip 501 because it's too close, keep 809 because it's far away from the already selected values, keep 702, etc.
Here's my current solution:
vals = ... #the original data
result = []
tolerance = 50
for i in vals:
if not len(np.where(np.abs(result - i) < tolerance)[0]):
results.append(i)
This works fine, but it's too slow for my purposes (I'm dealing with 2.4 million elements in the list). Is there an efficient solution to this problem? Thanks!
EDIT: Just to clarify, I need to keep the first element of each group, not the smallest element (i.e. [499, 702, 807] would not be a valid result in the above example), so sorting it might not help so much.

vals = [500,501,809,702,808,807,703,502,499]
close_set = set()
tolerance = 5
result = []
for e in vals:
if e in close_set:
continue
else:
result.append(e)
close_set.update([*range(e-tolerance, e+tolerance+1)])
print(result) # [500, 809, 702]
This should be pretty fast (I tested it on a list of 1,000,000 elements and it took ~3 seconds). For each element in the list, you check to see if a close value has been seen before by checking for membership in the set of close numbers, which is O(1). If it's not, you add it to your results and then update the set of close numbers.

A better solution is to use a SortedSet from http://www.grantjenks.com/docs/sortedcontainers/index.html.
Before inserting an element you check irange_key all values within +- tolerance. If nothing is there, then add this element.
This solution should be at least an order of magnitude faster than the close_set approach already suggested, and an order of magnitude better as well on memory usage. Plus it will work for floats as well as integers if you should need that.

Related

Is it possible to make "get values that is not duplicated at all" algorithm, not using sort? (Time complexity => O(n)

i want to know "get values that is not duplicated at all" algorithm without sort is possible.
let me explain about "get values that is not duplicated at all"
if list value is [1,3,3,2]
i want to get 1, 2
I tried lots of algorithms.
but i can't find an algorithm that has stable time complexity is O(n).
i tried using dict, queue.
However, always time complexity is not stable O(n)
using dict
li = [1, 2, 3, 3, 5]
li_dict = dict()
for i in range(len(li)):
try:
a = li_dict[li[i]]
li_dict[li[i]] = False
except KeyError:
li_dict[li[i]] = True
for k in li_dict.keys():
if li_dict[k]:
print(k, end="")
is this possible?
if this is possible, please explain how to make it possible.
(without sort)

I've got a hardcoded solution for you. You can make use of the Set type:
li = [1,3,3,2]
result = set()
seen = set()
for i in li:
if i in seen:
result.discard(i)
else:
result.add(i)
seen.add(i)
print(result)
Because python Sets use a hash table technique under the hood lookups are very fast, just as in dictionaries. Sets are a bit better suited for your problem however, because they behave more like lists.
In my code example i iterate over the list only once and as Set memberships tests are near instant (O(1)) I believe this is O(n) complexity.
I use two sets to seperate the items I've seen already from the items I want to keep.
There are more pythonic solutions most likely. This is my first attempt at reasoning the complexity of a piece of code, but i believe my conclusions are valid. If not, someone else may correct me!

Most efficient way to loop over python list and compare and adjust element values

I have a list of elements with a certain attribute/variable. First of all I need to check if all attributes have the same value, if not I need to adapt this attribute for every element to the highes value.
The thing is, it is not difficult to programm such a loop. However, I would like to know what the most efficient way is to do so.
My current approach works fine, but it loops through the list 2 times, has two local variables and just does not feel efficient enough.
I simplified the code. This is basically what I've got:
biggest_value = 0
re_calc = 0
for _, element in enumerate(element_list):
if element.value > biggest_value :
biggest_value = element.value
re_calc += 1
if re_calc > 1:
for _, element in enumerate(element_list):
element.value = adjust_value(biggest_value)
element_list(_) = element
The thing annoying me is the necessity of the "re_calc" variable. A simple check for the biggest value is no big deal. But this task consists out of 3 steps:
"Compare Attributes--> Finding Biggest Value --> possibly Adjust Others". However I do not want to loop over this list 3 times. Not even two times as my current suggestion does.
There has to be a more efficient way. Any ideas? Thanks in advance.

The first loop is just determing the largest value of the element_list. So an approach can be:
transform the element_list into a numpy array. Unfortunately you do not tell, how the list looks like. But if the list contains numbers then
L = np.array(element_list)
can probably do it.
After that use np.max(L). Numpy commands without for loops are usually much faster.
import numpy as np
nl = 10
L = np.random.rand(nl)
biggest_value = np.max(L)
L, biggest_value
gives
(array([0.70047074, 0.14160459, 0.75061621, 0.89013494, 0.70587705,
0.50218377, 0.31197993, 0.42670057, 0.67869183, 0.04415816]),
0.8901349369179461)
In the second for-loop it is not obvious what you want to achieve. Unfortunately you do not give an input and a desired output and do not not tell what adjust_value has to do. A minimal running code with data would be helpful to give a support.

Python lists - codes & algorithem

I need some help with python, a new program language to me.
So, lets say that I have this list:
list= [3, 1, 4, 9, 8, 2]
And I would like to sort it, but without using the built-in function "sort", otherwise where's all the fun and the studying in here? I want to code as simple and as basic as I can, even if it means to work a bit harder. Therefore, if you want to help me and to offer me some of ideas and code, please, try to keep them very "basic".
Anyway, back to my problem: In order to sort this list, I've decided to compare every time a number from the list to the last number. First, I'll check 3 and 2. If 3 is smaller than 2 (and it's false, wrong), then do nothing.
Next - check if 1 is smaller than 2 (and it's true) - then change the index place of this number with the first element.
On the next run, it will check again if the number is smaller or not from the last number in the list. But this time, if the number is smaller, it will change the place with the second number (and on the third run with the third number, if it's smaller, of course).
and so on and so on.
In the end, the ()function will return the sorted list.
Hop you've understand it.
So I want to use a ()recursive function to make the task bit interesting, but still basic.
Therefore, I thought about this code:
def func(list):
if not list:
for i in range(len(list)):
if list[-1] > lst[i]:
#have no idea what to write here in order to change the locations
i = i + 1
#return func(lst[i+1:])?
return list
2 questions:
1. How can I change the locations? Using pop/remove and then insert?
2. I don't know where to put the recursive part and if I've wrote it good (I think I didn't). the recursive part is the second "#", the first "return".
What do you think? How can I improve this code? What's wrong?
Thanks a lot!

Oh man, sorting. That's one of the most popular problems in programming with many, many solutions that differ a little in every language. Anyway, the most straight-forward algorithm is I guess the bubble sort. However, it's not very effective, so it's mostly used for educational purposes. If you want to try something more efficient and common go for the quick sort. I believe it's the most popular sorting algorithm. In python however, the default algorithm is a bit different - read here. And like I've said, there are many, many more sorting algorithms around the web.
Now, to answer your specific questions: in python replacing an item in a list is as simple as
list[-1]=list[i]
or
tmp=list[-1]
list[-1]=list[i]
list[i]=tmp
As to recursion - I don't think it's a good idea to use it, a simple while/for loop is better here.

maybe you can try a quicksort this way :
def quicksort(array, up, down):
# start sorting in your array from down to up :
# is array[up] < array[down] ? if yes switch
# do it until up <= down
# call recursively quicksort
# with the array, middle, up
# with the array, down, middle
# where middle is the value found when the first sort ended
you can check this link : Quicksort on Wikipedia
It is nearly the same logic.
Hope it will help !

The easiest way to swap the two list elements is by using “parallel assignment”:
list[-1], list[i] = list[i], list[-1]
It doesn't really make sense to use recursion for this algorithm. If you call func(lst[i+1:]), that makes a copy of those elements of the list, and the recursive call operates on the copy, and then the copy is discarded. You could make func take two arguments: the list and i+1.
But your code is still broken. The not list test is incorrect, and the i = i + 1 is incorrect. What you are describing sounds a variation of selection sort where you're doing a bunch of extra swapping.
Here's how a selection sort normally works.
Find the smallest of all elements and swap it into index 0.
Find the smallest of all remaining elements (all indexes greater than 0) and swap it into index 1.
Find the smallest of all remaining elements (all indexes greater than 1) and swap it into index 2.
And so on.
To simplify, the algorithm is this: find the smallest of all remaining (unsorted) elements, and append it to the list of sorted elements. Repeat until there are no remaining unsorted elements.
We can write it in Python like this:
def func(elements):
for firstUnsortedIndex in range(len(elements)):
# elements[0:firstUnsortedIndex] are sorted
# elements[firstUnsortedIndex:] are not sorted
bestIndex = firstUnsortedIndex
for candidateIndex in range(bestIndex + 1, len(elements)):
if elements[candidateIndex] < elements[bestIndex]:
bestIndex = candidateIndex
# Now bestIndex is the index of the smallest unsorted element
elements[firstUnsortedIndex], elements[bestIndex] = elements[bestIndex], elements[firstUnsortedIndex]
# Now elements[0:firstUnsortedIndex+1] are sorted, so it's safe to increment firstUnsortedIndex
# Now all elements are sorted.
Test:
>>> testList = [3, 1, 4, 9, 8, 2]
>>> func(testList)
>>> testList
[1, 2, 3, 4, 8, 9]
If you really want to structure this so that recursion makes sense, here's how. Find the smallest element of the list. Then call func recursively, passing all the remaining elements. (Thus each recursive call passes one less element, eventually passing zero elements.) Then prepend that smallest element onto the list returned by the recursive call. Here's the code:
def func(elements):
if len(elements) == 0:
return elements
bestIndex = 0
for candidateIndex in range(1, len(elements)):
if elements[candidateIndex] < elements[bestIndex]:
bestIndex = candidateIndex
return [elements[bestIndex]] + func(elements[0:bestIndex] + elements[bestIndex + 1:])

Efficient Python way to process two huge files?

I am working on a problem where I have to find if a number falls within a certain range. However, the problem is complicated due to the fact that the files I am dealing with have hundreds of thousands of lines.
Below I try to explain the problem in as simple a language as possible.
Here is a brief description of my input files :
File Ranges.txt has some ranges whose min and max are tab separated.
10 20
30 40
60 70
This can have about 10,000,000 such lines with ranges.
NOTE: The ranges never overlap.
File Numbers.txt has a list of numbers and some values associated with each number.
12 0.34
22 0.14
34 0.79
37 0.87
And so on. Again there are hundreds of thousands of such lines with numbers and their associated values.
What I wish to do is take every number from Numbers.txt and check if it falls within any of the ranges in Ranges.txt.
For all such numbers that fall within a range, I have to get a mean of their associated values (ie a mean per range).
For eg. in the example above in Numbers.txt, there are two numbers 34 and 37 that fall within the range 30-40 in Ranges.txt, so for the range 30-40 I have to calculate the mean of the associated values of 34 and 37. (i.e mean of 0.79 and 0.87), which is 0.82
My final output file should be the Ranges.txt but with the mean of the associated values of all numbers falling within each range. Something like :
Output.txt
10 20 <mean>
30 40 0.82
60 70 <mean>
and so on.
Would appreciate any help and ideas on how this can be written efficiently in Python.

Obviously you need to run each line from Numbers.txt against each line from Ranges.txt.
You could just iterate over Numbers.txt, and, for each line, iterate over Ranges.txt. But this will take forever, reading the whole Ranges.txt file millions of times.
You could read both of them into memory, but that will take a lot of storage, and it means you won't be able to do any processing until you've finished reading and preprocessing both files.
So, what you want to do is read Ranges.txt into memory once and store it as, say, a list of pairs of ints instead, but read Numbers.txt lazily, iterating over the list for each number.
This kind of thing comes up all the time. In general, you want to make the bigger collection into the outer loop, and make it as lazy as possible, while the smaller collection goes into the inner loop, and is pre-processed to make it as fast as possible. But if the bigger collection can be preprocessed more efficiently (and you have enough memory to store it!), reverse that.
And speaking of preprocessing, you can do a lot better than just reading into a list of pairs of ints. If you sorted Ranges.txt, you could find the closest range without going over by bisecting then just check that (18 steps), instead of checking each range exhaustively (100000 steps).
This is a bit of a pain with the stdlib, because it's easy to make off-by-one errors when using bisect, but there are plenty of ActiveState recipes to make it easier (including one linked from the official docs), not to mention third-party modules like blist or bintrees that give you a sorted collection in a simple OO interface.
So, something like this pseudocode:
with open('ranges.txt') as f:
ranges = sorted([map(int, line.split()) for line in f])
range_values = {}
with open('numbers.txt') as f:
rows = (map(int, line.split()) for line in f)
for number, value in rows:
use the sorted ranges to find the appropriate range (if any)
range_values.setdefault(range, []).append(value)
with open('output.txt') as f:
for r, values in range_values.items():
mean = sum(values) / len(values)
f.write('{} {} {}\n'.format(r[0], r[1], mean))
By the way, if the parsing turns out to be any more complicated than just calling split on each line, I'd suggest using the csv module… but it looks like that won't be a problem here.
What if you can't fit Ranges.txt into memory, but can fit Numbers.txt? Well, you can sort that, then iterate over Ranges.txt, find all of the matches in the sorted numbers, and write the results out for that range.
This is a bit more complicated, because it you have to bisect_left and bisect_right and iterate everything in between. But that's the only way in which it's any harder. (And here, a third-party class will help even more. For example, with a bintrees.FastRBTree as your sorted collection, it's just sorted_number_tree[low:high].)
If the ranges can overlap, you need to be a bit smarter—you have to find the closest range without going over the start, and the closest range without going under the end, and check everything in between. But the main trick there is the exact same one used for the last version. The only other trick is to keep two copies of ranges, one sorted by the start value and one by the end, and you'll need to have one of them be a map to indices in the other instead of just a plain list.

The naive approach would be to read Numbers.txt into some structure in number order, then read each line of Ranges, us a binary search to find the lowest number in the range, and the read through the numbers higher than that to find all those within the range, so that you can produce the corresponding line of output.
I assume the problem is that you can't have all of Numbers in memory.
So you could do the problem in phases, where each phase reads a portion of Numbers in, then goes through the process outlined above, but using an annotated version of Ranges, where each line includes the COUNT of the values so far that has produced that mean, and will write a similarly annotated version.
Obviously, the initial pass will not have an annotated version of Ranges, and the final pass will not produce one.

It looks like your data in both the files are already sorted. If not, first sort them by an external tool or using Python.
Then, you can go through the two files in parallel. You read a number from the Numbers.txt file, and see if it is in a range in Ranges.txt file, reading as many lines from that file as needed to answer that question. Then read the next number from Numbers.txt, and repeat. The idea is similar to merging two sorted arrays, and should run in O(n+m) time, n and m are the sizes of the files. If you need to sort the files, the run time is O(n lg(n) + m lg(m)). Here is a quick program I wrote to implement this:
import sys
from collections import Counter
class Gen(object):
__slots__ = ('rfp', 'nfp', 'mn', 'mx', 'num', 'd', 'n')
def __init__(self, ranges_filename, numbers_filename):
self.d = Counter() # sum of elements keyed by range
self.n = Counter() # number of elements keyed by range
self.rfp = open(ranges_filename)
self.nfp = open(numbers_filename)
# Read the first number and the first range values
self.num = float(self.nfp.next()) # Current number
self.mn, self.mx = [int(x) for x in self.rfp.next().split()] # Current range
def go(self):
while True:
if self.mx < self.num:
try:
self.mn, self.mx = [int(x) for x in self.rfp.next().split()]
except StopIteration:
break
else:
if self.mn <= self.num <= self.mx:
self.d[(self.mn, self.mx)] += self.num
self.n[(self.mn, self.mx)] += 1
try:
self.num = float(self.nfp.next())
except StopIteration:
break
self.nfp.close()
self.rfp.close()
return self.d, self.n
def run(ranges_filename, numbers_filename):
r = Gen(ranges_filename, numbers_filename)
d, n = r.go()
for mn, mx in sorted(d):
s, N = d[(mn, mx)], n[(mn, mx)]
if s:
av = s/N
else:
av = 0
sys.stdout.write('%d %d %.3f\n' % (mn, mx, av))
On files with 10,000,000 numbers in each of the files, the above runs in about 1.5 minute on my computer without the output part.

Finding the most similar numbers across multiple lists in Python

In Python, I have 3 lists of floating-point numbers (angles), in the range 0-360, and the lists are not the same length. I need to find the triplet (with 1 number from each list) in which the numbers are the closest. (It's highly unlikely that any of the numbers will be identical, since this is real-world data.) I was thinking of using a simple lowest-standard-deviation method to measure agreement, but I'm not sure of a good way to implement this. I could loop through each list, comparing the standard deviation of every possible combination using nested for loops, and have a temporary variable save the indices of the triplet that agrees the best, but I was wondering if anyone had a better or more elegant way to do something like this. Thanks!

I wouldn't be surprised if there is an established algorithm for doing this, and if so, you should use it. But I don't know of one, so I'm going to speculate a little.
If I had to do it, the first thing I would try would be just to loop through all possible combinations of all the numbers and see how long it takes. If your data set is small enough, it's not worth the time to invent a clever algorithm. To demonstrate the setup, I'll include the sample code:
# setup
def distance(nplet):
'''Takes a pair or triplet (an "n-plet") as a list, and returns its distance.
A smaller return value means better agreement.'''
# your choice of implementation here. Example:
return variance(nplet)
# algorithm
def brute_force(*lists):
return min(itertools.product(*lists), key = distance)
For a large data set, I would try something like this: first create one triplet for each number in the first list, with its first entry set to that number. Then go through this list of partially-filled triplets and for each one, pick the number from the second list that is closest to the number from the first list and set that as the second member of the triplet. Then go through the list of triplets and for each one, pick the number from the third list that is closest to the first two numbers (as measured by your agreement metric). Finally, take the best of the bunch. This sample code demonstrates how you could try to keep the runtime linear in the length of the lists.
def item_selection(listA, listB, listC):
# make the list of partially-filled triplets
triplets = [[a] for a in listA]
iT = 0
iB = 0
while iT < len(triplets):
# make iB the index of a value in listB closes to triplets[iT][0]
while iB < len(listB) and listB[iB] < triplets[iT][0]:
iB += 1
if iB == 0:
triplets[iT].append(listB[0])
elif iB == len(listB)
triplets[iT].append(listB[-1])
else:
# look at the values in listB just below and just above triplets[iT][0]
# and add the closer one as the second member of the triplet
dist_lower = distance([triplets[iT][0], listB[iB]])
dist_upper = distance([triplets[iT][0], listB[iB + 1]])
if dist_lower < dist_upper:
triplets[iT].append(listB[iB])
elif dist_lower > dist_upper:
triplets[iT].append(listB[iB + 1])
else:
# if they are equidistant, add both
triplets[iT].append(listB[iB])
iT += 1
triplets[iT:iT] = [triplets[iT-1][0], listB[iB + 1]]
iT += 1
# then another loop while iT < len(triplets) to add in the numbers from listC
return min(triplets, key = distance)
The thing is, I can imagine situations where this wouldn't actually find the best triplet, for instance if a number from the first list is close to one from the second list but not at all close to anything in the third list. So something you could try is to run this algorithm for all 6 possible orderings of the lists. I can't think of a specific situation where that would fail to find the best triplet, but one might still exist. In any case the algorithm will still be O(N) if you use a clever implementation, assuming the lists are sorted.
def symmetrized_item_selection(listA, listB, listC):
best_results = []
for ordering in itertools.permutations([listA, listB, listC]):
best_results.extend(item_selection(*ordering))
return min(best_results, key = distance)
Another option might be to compute all possible pairs of numbers between list 1 and list 2, between list 1 and list 3, and between list 2 and list 3. Then sort all three lists of pairs together, from best to worst agreement between the two numbers. Starting with the closest pair, go through the list pair by pair and any time you encounter a pair which shares a number with one you've already seen, merge them into a triplet. For a suitable measure of agreement, once you find your first triplet, that will give you a maximum pair distance that you need to iterate up to, and once you get up to it, you just choose the closest triplet of the ones you've found. I think that should consistently find the best possible triplet, but it will be O(N^2 log N) because of the requirement for sorting the lists of pairs.
def pair_sorting(listA, listB, listC):
# make all possible pairs of values from two lists
# each pair has the structure ((number, origin_list),(number, origin_list))
# so we know which lists the numbers came from
all_pairs = []
all_pairs += [((nA,0), (nB,1)) for (nA,nB) in itertools.product(listA,listB)]
all_pairs += [((nA,0), (nC,2)) for (nA,nC) in itertools.product(listA,listC)]
all_pairs += [((nB,1), (nC,2)) for (nB,nC) in itertools.product(listB,listC)]
all_pairs.sort(key = lambda p: distance(p[0][0], p[1][0]))
# make a dict to track which (number, origin_list)s we've already seen
pairs_by_number_and_list = collections.defaultdict(list)
min_distance = INFINITY
min_triplet = None
# start with the closest pair
for pair in all_pairs:
# for the first value of the current pair, see if we've seen that particular
# (number, origin_list) combination before
for pair2 in pairs_by_number_and_list[pair[0]]:
# if so, that means the current pair shares its first value with
# another pair, so put the 3 unique values together to make a triplet
this_triplet = (pair[1][0], pair2[0][0], pair2[1][0])
# check if the triplet agrees more than the previous best triplet
this_distance = distance(this_triplet)
if this_distance < min_distance:
min_triplet = this_triplet
min_distance = this_distance
# do the same thing but checking the second element of the current pair
for pair2 in pairs_by_number_and_list[pair[1]]:
this_triplet = (pair[0][0], pair2[0][0], pair2[1][0])
this_distance = distance(this_triplet)
if this_distance < min_distance:
min_triplet = this_triplet
min_distance = this_distance
# finally, add the current pair to the list of pairs we've seen
pairs_by_number_and_list[pair[0]].append(pair)
pairs_by_number_and_list[pair[1]].append(pair)
return min_triplet
N.B. I've written all the code samples in this answer out a little more explicitly than you'd do it in practice to help you to understand how they work. But when doing it for real, you'd use more list comprehensions and such things.
N.B.2. No guarantees that the code works :-P but it should get the rough idea across.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.