Quickly remove outliers from list in Python? - python

I have a many long lists of time and temperature values, which has the following structure:
list1 = [[1, 72], [2, 72], [3, 73], [4, 72], [5, 74], [6, 73], [7, 71], [8, 92], [9, 73]]
Some of the time/temperature pairs are incorrect spikes in the data. For example, in time 8, it spiked to 92 degrees. I would like to get rid of these sudden jumps or dips in the temperature values.
To do this, I wrote the following code (I removed the stuff that isn't necessary and only copied the part that removes the spikes/outliers):
outlierpercent = 3
for i in values:
temperature = i[1]
index = values.index(i)
if index > 0:
prevtemp = values[index-1][1]
pctdiff = (temperature/prevtemp - 1) * 100
if abs(pctdiff) > outlierpercent:
outliers.append(i)
While this works (where I can set the minimum percentage difference required for it to be considered a spike as outlierpercent), it takes a super long time (5-10 minutes per list). My lists are extremely long (around 5 million data points each), and I have hundreds of lists.
I was wondering if there was a much quicker way of doing this? My main concern here is time. There are other similar questions like this, however, they don't seem to be quite efficient for super long list of this structure, so I'm not sure how to do it! Thanks!

outlierpercent = 3
for index in range(1, len(values)):
temperature = values[index][1]
prevtemp = values[index-1][1]
pctdiff = (temperature/prevtemp - 1) * 100
if abs(pctdiff) > outlierpercent:
outliers.append(index)
This should do a lot better with time
Update:
The issue of only first outlier being removed is because after we remove an outlier, in the next iteration, we are comparing the temp from the removed outlier (prevtemp = values[index-1][1]).
I believe you can avoid that by handling the previous temp better. Something like this:
outlierpercent = 3
prevtemp = values[0][1]
for index in range(1, len(values)):
temperature = values[index][1]
pctdiff = (temperature/prevtemp - 1) * 100
# outlier - add to list and don't update prev temp
if abs(pctdiff) > outlierpercent:
outliers.append(index)
# valid temp, update prev temp
else:
prevtemp = values[index-1][1]

Using Numpy to speed computations
With
values = [[1, 72], [2, 72], [3, 73], [4, 72], [5, 74], [6, 73], [7, 71], [8, 92], [9, 73]]
Numpy Code
# Convert list to Numpy array
a = np.array(values)
# Calculate absolute percent difference of temperature
b = np.diff(a[:, 1])*100/a[:-1, 1]
# List of outliers
outlier_indices = np.where(np.abs(b) > outlierpercent)
if outlier_indices:
print(a[outlier_indices[0]+1]) # add one since b is is one short due to
# computing difference
# Output: List of outliers same as original code
[[ 8 92]
[ 9 73]]

This should make two lists, valid and outliers.
I tried to keep math operations to a minimum for speed.
Pardon any typos, this was keyboard composed, untested.
lolim=None
outliers=[]
outlierpercent=3.0
lower_mult=(100.0-outlierpercent)/100.0
upper_mult=(100.0+outlierpercent)/100.0
for index,temp in values
if lolim is None:
valids=[[index,temp]] # start the valid list
lolim,hilim=[lower_mult,upper_mult]*temp # create initial range
else:
if lolim <= temp <= hilim:
valids.append([index,temp]) # new valid entry
lolim,hilim=[lower_mult,upper_mult]*temp # update range
else:
outliers.append([index,temp]) # save outliers, keep old range

Related

How to use random.sample() within a for-loop to generate multiple, *non-identical* sample lists?

I would like to know how to use the python random.sample() function within a for-loop to generate multiple sample lists that are not identical.
For example, right now I have:
for i in range(3):
sample = random.sample(range(10), k=2)
This will generate 3 sample lists containing two numbers each, but I would like to make sure none of those sample lists are identical. (It is okay if there are repeating values, i.e., (2,1), (3,2), (3,7) would be okay, but (2,1), (1,2), (5,4) would not.)
If you specifically need to "use random.sample() within a for-loop", then you could keep track of samples that you've seen, and check that new ones haven't been seen yet.
import random
seen = set()
for i in range(3):
while True:
sample = random.sample(range(10), k=2)
print(f'TESTING: {sample = }') # For demo
fr = frozenset(sample)
if fr not in seen:
seen.add(fr)
break
print(sample)
Example output:
TESTING: sample = [0, 7]
[0, 7]
TESTING: sample = [0, 7]
TESTING: sample = [1, 5]
[1, 5]
TESTING: sample = [7, 0]
TESTING: sample = [3, 5]
[3, 5]
Here I made seen a set to allow fast lookups, and I converted sample to a frozenset so that order doesn't matter in comparisons. It has to be frozen because a set can't contain another set.
However, this could be very slow with different inputs, especially a larger range of i or smaller range to draw samples from. In theory, its runtime is infinite, but in practice, random's number generator is finite.
Alternatives
There are other ways to do the same thing that could be much more performant. For example, you could take a big random sample, then chunk it into the desired size:
n = 3
k = 2
upper = 10
sample = random.sample(range(upper), k=k*n)
for chunk in chunks(sample, k):
print(chunk)
Example output:
[6, 5]
[3, 0]
[1, 8]
With this approach, you'll never get any duplicate numbers like [[2,1], [3,2], [3,7]] because the sample contains all unique numbers.
This approach was inspired by Sven Marnach's answer on "Non-repetitive random number in numpy", which I coincidentally just read today.
it looks like you are trying to make a nested list of certain list items without repetition from original list, you can try below code.
import random
mylist = list(range(50))
def randomlist(mylist,k):
length = lambda : len(mylist)
newlist = []
while length() >= k:
newlist.append([mylist.pop(random.randint(0, length() - 1)) for I in range(k)])
newlist.append(mylist)
return newlist
randomlist(mylist,6)
[[2, 20, 36, 46, 14, 30],
[4, 12, 13, 3, 28, 5],
[45, 37, 18, 9, 34, 24],
[31, 48, 11, 6, 19, 17],
[40, 38, 0, 7, 22, 42],
[23, 25, 47, 41, 16, 39],
[8, 33, 10, 43, 15, 26],
[1, 49, 35, 44, 27, 21],
[29, 32]]
This should do the trick.
import random
import math
# create set to store samples
a = set()
# number of distinct elements in the population
m = 10
# sample size
k = 2
# number of samples
n = 3
# this protects against an infinite loop (see Safety Note)
if n > math.comb(m, k):
print(
f"Error: {math.comb(m, k)} is the number of {k}-combinations "
f"from a set of {m} distinct elements."
)
exit()
# the meat
while len(a) < n:
a.add(tuple(sorted(random.sample(range(m), k = k))))
print(a)
With a set you are guaranteed to get a collection with no duplicate elements. In a set, you would be allowed to have (1, 2) and (2, 1) inside, which is why sorted is applied. So if [1, 2] is drawn, sorted([1, 2]) returns [1, 2]. And if [2, 1] is subsequently drawn, sorted([2, 1]) returns [1, 2], which won't be added to the set because (1, 2) is already in the set. We use tuple because objects in a set have to be hashable and list objects are not.
I hope this helps. Any questions, please let me know.
Safety Note
To avoid an infinite loop when you change 3 to some large number, you need to know the maximum number of possible samples of the type that you desire.
The relevant mathematical concept for this is a combination.
Suppose your first argument of random.sample() is range(m) where
m is some arbitrary positive integer. Note that this means that the
sample will be drawn from a population of m distinct members
without replacement.
Suppose that you wish to have n samples of length k in total.
The number of possible k-combinations from the set of m distinct elements is
m! / (k! * (m - k)!)
You can get this value via
from math import comb
num_comb = comb(m, k)
comb(m, k) gives the number of different ways to choose k elements from m elements without repetition and without order, which is exactly what we want.
So in the example above, m = 10, k = 2, n = 3.
With these m and k, the number of possible k-combinations from the set of m distinct elements is 45.
You need to ensure that n is less than 45 if you want to use those specific m and k and avoid an infinite loop.

Accessing elements from a list?

I am trying to calculate the distance between two lists so I can find the shortest distance between all coordinates.
Here is my code:
import random
import math
import copy
def calculate_distance(starting_x, starting_y, destination_x, destination_y):
distance = math.hypot(destination_x - starting_x, destination_y - starting_y) # calculates Euclidean distance (straight-line) distance between two points
return distance
def nearest_neighbour_algorithm(selected_map):
temp_map = copy.deepcopy(selected_map)
optermised_map = [] # we setup an empty optimised list to fill up
# get last element of temp_map to set as starting point, also removes it from temp_list
optermised_map.append(temp_map.pop()) # we set the first element of the temp_map and put it in optimised_map as the starting point and remove this element from the temp_map
for x in range(len(temp_map)):
nearest_value = 1000
neares_index = 0
for i in range(len(temp_map[x])):
current_value = calculate_distance(*optermised_map[x], *temp_map[x])
I get an error at this part and im not sure why:
for i in range(len(temp_map[x])):
current_value = calculate_distance(*optermised_map[x], *temp_map[x])
I am trying to find the distance between points between these two lists and the error I get is that my list index is out of range where the for loop is
On the first iteration optermised_map would be length 1. This would likely cause the error because it's iterating over len(temp_map) which is likely more than 1. I think you may have wanted:
for i in range(len(optermised_map)):
current_value = calculate_distance(*optermised_map[i], *temp_map[x])
Are the lengths of the lists the same? I could be wrong, but this sounds like a cosine similarity exercise to me. Check out this very simple exercise.
from scipy import spatial
dataSetI = [3, 45, 7, 2]
dataSetII = [2, 54, 13, 15]
result = 1 - spatial.distance.cosine(dataSetI, dataSetII)
result
# 0.97228425171235
dataSetI = [1, 2, 3, 10]
dataSetII = [2, 4, 6, 20]
result = 1 - spatial.distance.cosine(dataSetI, dataSetII)
result
# 1.0
dataSetI = [10, 200, 234, 500]
dataSetII = [45, 3, 19, 20]
result = 1 - spatial.distance.cosine(dataSetI, dataSetII)
result
# 0.4991255575740505
In the second iteration, we can see that the ratios of the numbers in the two lists are exactly the same, but the numbers are different. We focus in the ratios of the numbers.

How to improve time complexity of remove all multiplicands from array or list?

I am trying to find elements from array(integer array) or list which are unique and those elements must not divisible by any other element from same array or list.
You can answer in any language like python, java, c, c++ etc.
I have tried this code in Python3 and it works perfectly but I am looking for better and optimum solution in terms of time complexity.
assuming array or list A is already sorted and having unique elements
A = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
while i<len(A)-1:
while j<len(A):
if A[j]%A[i]==0:
A.pop(j)
else:
j+=1
i+=1
j=i+1
For the given array A=[2,3,4,5,6,7,8,9,10,11,12,13,14,15,16] answer would be like ans=[2,3,5,7,11,13]
another example,A=[4,5,15,16,17,23,39] then ans would be like, ans=[4,5,17,23,39]
ans is having unique numbers
any element i from array only exists if (i%j)!=0, where i!=j
I think it's more natural to do it in reverse, by building a new list containing the answer instead of removing elements from the original list. If I'm thinking correctly, both approaches do the same number of mod operations, but you avoid the issue of removing an element from a list.
A = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
ans = []
for x in A:
for y in ans:
if x % y == 0:
break
else: ans.append(x)
Edit: Promoting the completion else.
This algorithm will perform much faster:
A = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
if (A[-1]-A[0])/A[0] > len(A)*2:
result = list()
for v in A:
for f in result:
d,m = divmod(v,f)
if m == 0: v=0;break
if d<f: break
if v: result.append(v)
else:
retain = set(A)
minMult = 1
maxVal = A[-1]
for v in A:
if v not in retain : continue
minMult = v*2
if minMult > maxVal: break
if v*len(A)<maxVal:
retain.difference_update([m for m in retain if m >= minMult and m%v==0])
else:
retain.difference_update(range(minMult,maxVal,v))
if maxVal%v == 0:
maxVal = max(retain)
result = list(retain)
print(result) # [2, 3, 5, 7, 11, 13]
In the spirit of the sieve of Eratostenes, each number that is retained, removes its multiples from the remaining eligible numbers. Depending on the magnitude of the highest value, it is sometimes more efficient to exclude multiples than check for divisibility. The divisibility check takes several times longer for an equivalent number of factors to check.
At some point, when the data is widely spread out, assembling the result instead of removing multiples becomes faster (this last addition was inspired by Imperishable Night's post).
TEST RESULTS
A = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16] (100000 repetitions)
Original: 0.55 sec
New: 0.29 sec
A = list(range(2,5000))+[9697] (100 repetitions)
Original: 3.77 sec
New: 0.12 sec
A = list(range(1001,2000))+list(range(4000,6000))+[9697**2] (10 repetitions)
Original: 3.54 sec
New: 0.02 sec
I know that this is totally insane but i want to know what you think about this:
A = [4,5,15,16,17,23,39]
prova=[[x for x in A if x!=y and y%x==0] for y in A]
print([A[idx] for idx,x in enumerate(prova) if len(prova[idx])==0])
And i think it's still O(n^2)
If you care about speed more than algorithmic efficiency, numpy would be the package to use here in python:
import numpy as np
# Note: doesn't have to be sorted
a = [2, 2, 3, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 16, 29, 29]
a = np.unique(a)
result = a[np.all((a % a[:, None] + np.diag(a)), axis=0)]
# array([2, 3, 5, 7, 11, 13, 29])
This divides all elements by all other elements and stores the remainder in a matrix, checks which columns contain only non-0 values (other than the diagonal), and selects all elements corresponding to those columns.
This is O(n*M) where M is the max size of an integer in your list. The integers are all assumed to be none negative. This also assumes your input list is sorted (came to that assumption since all lists you provided are sorted).
a = [4, 7, 7, 8]
# a = [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
# a = [4, 5, 15, 16, 17, 23, 39]
M = max(a)
used = set()
final_list = []
for e in a:
if e in used:
continue
else:
used.add(e)
for i in range(e, M + 1):
if not (i % e):
used.add(i)
final_list.append(e)
print(final_list)
Maybe this can be optimized even further...
If the list is not sorted then for the above method to work, one must sort it. The time complexity will then be O(nlogn + Mn) which equals to O(nlogn) when n >> M.

Check if the elements of a range are within a list of lists

I have a sorted list and a range contains multiple lists:
>>> n= [10, 20, 30, 40]
>>> m= [[1, 20], [21, 30]]
What I am trying to do is to check if all the elements of the n list are within either of the existing ranges in m or not. For instance, from the above example, 40 is not within any of the ranges.
I tried to extend the answer to the question in the following post, but seems it is not working.
Checking if all elements of a List of Lists are in another List of Lists Python
is_there = set(tuple(x) for x in [n]).issubset(tuple(x) for x in m)
You should go through each element in n and check if it's in the range of each list of m. Assuming you are only working with ints:
[any(x in range(r[0], r[1]) for r in m) for x in n]
If you want to include the end of your range, just add 1:
[any(x in range(r[0], r[1]+1) for r in m) for x in n]
The simple approach is to check all the elements:
items = [10, 20, 30, 40]
ranges = [[1, 20], [21, 30]]
result = all(any(low <= i <= high for low, high in ranges) for i in items)
For fun, you can make the containment check a bit different by using actual range objects:
range_objects = [range(low, high + 1) for low, high in ranges]
filtered_items = all(any(i in r for r in range_objects) for i in items)
If you wanted to get the matching items:
good = [i for i in items if any(low <= i <= high for low, high in ranges)]
You could also get the bad elements instead:
bad = [i for i in items if all(i < low or i > high for low, high in ranges)]
That way, your original result is just not bad.
Since you said, "a sorted list", you can use the following logic of min and max. The outside will be True if any of the elements in n will be outside the given ranges. It will be False, if none of the elements is outside the ranges
n= [10, 20, 30, 40] # < as per you, this is sorted
m= [[1,20], [21,30]]
outside = any([(min(n) < i[0] and max(n)> i[1]) for i in m])
# True
Edit Answering the test case asked by #Peter DeGlopper in the comment below
m = [[1, 20], [31, 40]]
n = [10, 20, 25, 30, 40]
outside = any([(l < i < r for i in n) for l, r in m])
# True

python find intersection timeranges in array

i have a python numpy array with two rows. One row describes the start time of an event and the other one describes the end (here times as epoch integers). The code example means, that the event at index=0 starts at time=1 and ends at time=7.
start = [1, 8, 15, 30]
end = [7, 16, 20, 40]
timeranges = np.array([start,end])
I want to know, if the time ranges are intersected. That means i need a function/algorithm, that calculates the information, that the time range from 8 to 16 is intersected with the time range from 15 to 20.
My solution is, to use two intersected loops and check if any start time or end time is within an other timerange. But with ipython it lasts very long, because my timeranges are filled with nearly 10000 events.
Is there an elegant solution, to get the result in "short" time (e.g. below one minute)?
Store the data as a collection of (time,index_in_list,start_or_end). For example, if the input data is:
start = [1, 8, 15, 30]
end = [7, 16, 20, 40]
Transform the input data to a list of tuples as follows:
def extract_times(times,is_start):
return [(times[i],i,is_start) for i in range(len(times))]
Which yields:
extract_times(start,true) == [(1,0,true),(8,1,true),(15,2,true),(30,3,true)]
extract_times(end,false) == [(7,0,false),(16,1,false),(20,2,false),(40,3,false)]
Now, merge the two lists and store them.
Then, start traversing the lists from beginning to end, each time keeping track of the currently intersecting intervals, updating the state based on whether each new tuple is a beginning or and ending of an interval. This way you'll find all overlaps.
The complexity is O(n log(n)) for the sorting plus some overhead if there are lots of intersections.
Given that the input lists might not be sorted and to handle cases where we might see timeranges with multiple intersections, here's a brute-force comparison based method using broadcasting -
np.argwhere(np.triu(timeranges[1][:,None] > timeranges[0],1))
Sample runs
Original sample case :
In [81]: timeranges
Out[81]:
array([[ 1, 8, 15, 30],
[ 7, 16, 20, 40]])
In [82]: np.argwhere(np.triu(timeranges[1][:,None] > timeranges[0],1))
Out[82]: array([[1, 2]])
Multiple intersections case :
In [77]: timeranges
Out[77]:
array([[ 5, 7, 18, 12, 19],
[11, 17, 28, 19, 28]])
In [78]: np.argwhere(np.triu(timeranges[1][:,None] > timeranges[0],1))
Out[78]:
array([[0, 1],
[1, 3],
[2, 3],
[2, 4]])
If by within in "if any start time or end time is within an other timerange", you meant the boundaries are inclusive, change the comparison of > to >= in the solution code.

Categories