Python: Find outliers inside a list - python

I'm having a list with a random amount of integers and/or floats. What I'm trying to achieve is to find the exceptions inside my numbers (hoping to use the right words to explain this). For example:
list = [1, 3, 2, 14, 108, 2, 1, 8, 97, 1, 4, 3, 5]
90 to 99% of my integer values are between 1 and 20
sometimes there are values that are much higher, let's say somewhere around 100 or 1.000 or even more
My problem is, that these values can be different all the time. Maybe the regular range is somewhere between 1.000 to 1.200 and the exceptions are in the range of half a million.
Is there a function to filter out these special numbers?

Assuming your list is l:
If you know you want to filter a certain percentile/quantile, you can
use:
This removes bottom 10% and top 90%. Of course, you can change any of
them to your desired cut-off (for example you can remove the bottom filter and only filter the top 90% in your example):
import numpy as np
l = np.array(l)
l = l[(l>np.quantile(l,0.1)) & (l<np.quantile(l,0.9))].tolist()
output:
[ 3 2 14 2 8 4 3 5]
If you are not sure of the percentile cut-off and are looking to
remove outliers:
You can adjust your cut-off for outliers by adjusting argument m in
function call. The larger it is, the less outliers are removed. This function seems to be more robust to various types of outliers compared to other outlier removal techniques.
import numpy as np
l = np.array(l)
def reject_outliers(data, m=6.):
d = np.abs(data - np.median(data))
mdev = np.median(d)
s = d / (mdev if mdev else 1.)
return data[s < m].tolist()
print(reject_outliers(l))
output:
[1, 3, 2, 14, 2, 1, 8, 1, 4, 3, 5]

You can use the built-in filter() method:
lst1 = [1, 3, 2, 14, 108, 2, 1, 8, 97, 1, 4, 3, 5]
lst2 = list(filter(lambda x: x > 5,lst1))
print(lst2)
Output:
[14, 108, 8, 97]

So here is a method how to block out those deviators
import math
_list = [1, 3, 2, 14, 108, 2, 1, 8, 97, 1, 4, 3, 5]
def consts(_list):
mu = 0
for i in _list:
mu += i
mu = mu/len(_list)
sigma = 0
for i in _list:
sigma += math.pow(i-mu,2)
sigma = math.sqrt(sigma/len(_list))
return sigma, mu
def frequence(x, sigma, mu):
return (1/(sigma*math.sqrt(2*math.pi)))*math.exp(-(1/2)*math.pow(((x-mu)/sigma),2))
sigma, mu = consts(_list)
new_list = []
for i in range(len(_list)):
if frequence(_list[i], sigma, mu) > 0.01:
new_list.append(i)
print(new_list)

Related

Length of the intersections between a list an list of list

Note : almost duplicate of Numpy vectorization: Find intersection between list and list of lists
Differences :
I am focused on efficiently when the lists are large
I'm searching for the largest intersections.
x = [500 numbers between 1 and N]
y = [[1, 2, 3], [4, 5, 6, 7], [8, 9], [10, 11, 12], etc. up to N]
Here are some assumptions:
y is a list of ~500,000 sublist of ~500 elements
each sublist in y is a range, so y is characterized by the last elements of each sublists. In the example : 3, 7, 9, 12 ...
x is not sorted
y contains once and only once each numbers between 1 and ~500000*500
y is sorted in the sense that, as in the example, the sub-lists are sorted and the first element of one sublist is the next of the last element of the previous list.
y is known long before even compile-time
My purpose is to know, among the sublists of y, which have at least 10 intersections with x.
I can obviously make a loop :
def find_best(x, y):
result = []
for index, sublist in enumerate(y):
intersection = set(x).intersection(set(sublist))
if len(intersection) > 2: # in real live: > 10
result.append(index)
return(result)
x = [1, 2, 3, 4, 5, 6]
y = [[1, 2, 3], [4], [5, 6], [7], [8, 9, 10, 11]]
res = find_best(x, y)
print(res) # [0, 2]
Here the result is [0,2] because the first and third sublist of y have 2 elements in intersection with x.
An other method should to parse only once y and count the intesections :
def find_intersec2(x, y):
n_sublists = len(y)
res = {num: 0 for num in range(0, n_sublists + 1)}
for list_no, sublist in enumerate(y):
for num in sublist:
if num in x:
x.remove(num)
res[list_no] += 1
return [n for n in range(n_sublists + 1) if res[n] >= 2]
This second method uses more the hypothesis.
Questions :
what optimizations are possibles ?
Is there a completely different approach ? Indexing, kdtree ? In my use case, the large list y is known days before the actual run. So i'm not afraid to buildind an index or whatever from y. The small list x is only known at runtime.
Since y contains disjoint ranges and the union of them is also a range, a very fast solution is to first perform a binary search on y and then count the resulting indices and only return the ones that appear at least 10 times. The complexity of this algorithm is O(Nx log Ny) with Nx and Ny the number of items in respectively x and y. This algorithm is nearly optimal (since x needs to be read entirely).
Actual implementation
First of all, you need to transform your current y to a Numpy array containing the beginning value of all ranges (in an increasing order) with N as the last value (assuming N is excluded for the ranges of y, or N+1 otherwise). This part can be assumed as free since y can be computed at compile time in your case. Here is an example:
import numpy as np
y = np.array([1, 4, 8, 10, 13, ..., N])
Then, you need to perform the binary search and check that the values fits in the range of y:
indices = np.searchsorted(y, x, 'right')
# The `0 < indices < len(y)` check should not be needed regarding the input.
# If so, you can use only `indices -= 1`.
indices = indices[(0 < indices) & (indices < len(y))] - 1
Then you need to count the indices and filter the ones with at least :
uniqueIndices, counts = np.unique(indices, return_counts=True)
result = uniqueIndices[counts >= 10]
Here is an example based on your:
x = np.array([1, 2, 3, 4, 5, 6])
# [[1, 2, 3], [4], [5, 6], [7], [8, 9, 10, 11]]
y = np.array([1, 4, 5, 7, 8, 12])
# Actual simplified version of the above algorithm
indices = np.searchsorted(y, x, 'right') - 1
uniqueIndices, counts = np.unique(indices, return_counts=True)
result = uniqueIndices[counts >= 2]
# [0, 2]
print(result.tolist())
It runs in less than 0.1 ms on my machine on a random input based on your input constraints.
Turn y into 2 dicts.
index = { # index to count map
0 : 0,
1 : 0,
2 : 0,
3 : 0,
4 : 0
}
y = { # elem to index map
1: 0,
2: 0,
3: 0,
4: 1,
5: 2,
6: 2,
7: 3,
8 : 4,
9 : 4,
10 : 4,
11 : 4
}
Since you know y in advance, I don't count the above operations into the time complexity. Then, to count the intersection:
x = [1, 2, 3, 4, 5, 6]
for e in x: index[y[e]] += 1
Since you mentioned x is small, I try to make the time complexity depends only on the size of x (in this case O(n)).
Finally, the answer is the list of keys in index dict where the value is >= 2 (or 10 in real case).
answer = [i for i in index if index[i] >= 2]
This uses y to create a linear array mapping every int to the (1 plus), the index of the range or subgroup the int is in; called x2range_counter.
x2range_counter uses a 32 bit array.array type to save memory and can be cached and used for calculations of all x on the same y.
calculating the hits in each range for a particular x is then just indirected array incrementing of a count'er in function count_ranges`.
y = [[1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11, 12]]
x = [5, 3, 1, 11, 8, 10]
range_counter_max = len(y)
extent = y[-1][-1] + 1 # min in y must be 1 not 0 remember.
x2range_counter = array.array('L', [0] * extent) # efficient 32 bit array storage
# Map any int in any x to appropriate ranges counter.
for range_counter_index, rng in enumerate(y, start=1):
for n in rng:
x2range_counter[n] = range_counter_index
print(x2range_counter) # array('L', [0, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4])
# x2range_counter can be saved for this y and any x on this y.
def count_ranges(x: List[int]) -> List[int]:
"Number of x-hits on each y subgroup in order"
# Note: count[0] initially catches errors. count[1..] counts x's in y ranges [0..]
count = array.array('L', [0] * (range_counter_max + 1))
for xx in x:
count[x2range_counter[xx]] += 1
assert count[0] == 0, "x values must all exist in a y range and y must have all int in its range."
return count[1:]
print(count_ranges(x)) # array('L', [1, 2, 1, 2])
I created a class for this, with extra functionality such as returning the ranges rather than the indices; all ranges hit >=M times; (range, hit-count) tuples sorted most hit first.
Range calculations for different x are proportional to x and are simple array lookups rather than any hashing of dicts.
What do you think?

Distribute elements based on percentages

Let's say that I want to distribute a number of items (n) into an array of fixed size (x).
The difficult part is that I have to distribute the items using a flexibility array.
Assuming that x = 4, n = 11 and flexibility = [20, 20, 30, 30] with len(flexibility) == x.
My question is:
How can I distribute the n elements in an array of length equal to x using the percentage defined in f?
What I want at the end is something like:
n = 11
x = 4
flexibility = [20, 20, 30, 30]
distributed = distribute_elements_in_slots(n, x, flexibility)
print(distributed)
# distributed = [2, 2, 3, 4]
In the case of equal flexibility values, the final result will depend on the rule that we decide to apply to use all the item. In the previous case, the final result will be good with [2, 2, 3, 4] and with [2, 2, 4, 3].
Edit: An example of the method that I want to have is as follows:
def distribute_elements_in_slots(n, x, flexibility=[25,25,25,25]):
element_in_slots = []
element_per_percentage = x / 100
for i in range(x):
element_in_slots.append(round(slots_per_point_percentage * flexibility[i])
Edit 2: One of the solutions that I found is the following:
def distribute_elements_in_slots(n, x, flexibility=[25,25,25,25]):
element_in_slots = [f * n / 100 for f in flexibility]
carry = 0
for i in range(len(element_in_slots)):
element = element_in_slots[i] + carry
element_in_slot[i] = floor(element)
carry = element- floor(element)
if np.sum(element_in_slots) < n:
# Here the carry is almost 1
max_index = element_in_slots.index(max(flexibiliyt))
appointments_per_slot[max_index] = appointments_per_slot[max_index] + 1
This will distribute almost evenly the slots based on the flexibility array.
what you need to do is split the number 11 according to certain percents given in the array so initially it becomes percentage * number(11). Then we get remainder and put assign it somewhere which in your case is the last element.
In [10]: [i*n/100 for i in f]
Out[10]: [2.2, 2.2, 3.3, 3.3]
In [11]: b=[i*n/100 for i in f]
In [12]: rem = sum(b) - sum(map(int,b))
In [13]: rem
Out[13]: 1.0
In [24]: b= list(map(int,b))
In [26]: b[-1] +=rem
In [27]: b
Out[27]: [2, 2, 3, 4.0]
Hope it helps. :)
As Albin Paul did, we need to allocate the whole-number amount for each slot's percentage. The leftovers need to be allocated, largest first.
def distribute_elements_in_slots(total, slots, pct):
# Compute proportional distribution by given percentages.
distr = [total * pct[i] / 100 for i in range(slots)]
# Truncate each position and store the difference in a new list.
solid = [int(elem) for elem in distr]
short = [distr[i] - solid[i] for i in range(slots)]
print(distr)
print(solid)
print(short)
# allocate leftovers
leftover = int(round(sum(short)))
print(leftover)
# For each unallocated item,
# find the neediest slot, and put an extra there.
for i in range(leftover):
shortest = short.index(max(short))
solid[shortest] += 1
short[shortest] = 0
print("Added 1 to slot", shortest)
return solid
n = 11
x = 4
flexibility = [20, 20, 30, 30]
distributed = distribute_elements_in_slots(n, x, flexibility)
print(distributed)
# distributed = [2, 2, 3, 4]
Output:
[2.2, 2.2, 3.3, 3.3]
[2, 2, 3, 3]
[0.2, 0.2, 0.3, 0.3]
1
Added 1 to slot 2
[2, 2, 4, 3]

How to add numbers in your list, incrementally, while also being sorted from lowest to highest value?

I'm trying to write code to firstly, order numbers from lowest to highest (e.g. 1, 3, 2, 4, 5 to 1, 2, 3, 4, 5). Secondly, I would like to incrementally add the numbers in the list.
eg.
1
3
6
10
15
I've already tried using the sum function, then the sorted function, but I was wondering if I can write them neatly in a code to just get everything worked out.
Addition = [1, 13, 166, 3, 80, 6, 40]
print(sorted(Addition))
I was able to get the numbers sorted horizontally, but I wasn't able to get the numbers added vertically.
Apparently, you need a cumulative addition. You can code a simple one using a simple loop and yield the results on the go
def cumulative_add(array):
total = 0
for item in array:
total += item
yield total
>>> list(cumulative_add([1,2,3,4,5]))
[1, 3, 6, 10, 15]
Depending on your goals, you may also wish to use a library, such as pandas, that has cumulative sum already written for you.
For example,
>>> s = pd.Series([1,2,3,4,5])
>>> s.cumsum()
0 1
1 3
2 6
3 10
4 15
You can use itertools.accumulate with sorted:
import itertools
mylist = [1, 2, 3, 4, 5]
result = list(itertools.accumulate(sorted(mylist)))
# result: [1, 3, 6, 10, 15]
The default action is operator.add, but you can customize it. For example, you can do running product instead of running sum if you needed it:
import itertools
import operator
mylist = [1, 2, 3, 4, 5]
result = list(itertools.accumulate(sorted(mylist), operator.mul))
# result: [1, 2, 6, 24, 120]

Single line chunk re-assignment

As shown in the following code, I have a chunk list x and the full list h. I want to reassign back the values stored in x in the correct positions of h.
index = 0
for t1 in range(lbp, ubp):
h[4 + t1] = x[index]
index = index + 1
Does anyone know how to write it in a single line/expression?
Disclaimer: This is part of a bigger project and I simplified the questions as much as possible. You can expect the matrix sizes to be correct but if you think I am missing something please ask for it. For testing you can use the following variable values:
h = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
x = [20, 21]
lbp = 2
ubp = 4
You can use slice assignment to expand on the left-hand side and assign your x list directly to the indices of h, e.g.:
h = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
x = [20, 21]
lbp = 2
ubp = 4
h[4 + lbp:4 + ubp] = x # or better yet h[4 + lbp:4 + lbp + len(x)] = x
print(h)
# [1, 2, 3, 4, 5, 6, 20, 21, 9, 10]
I'm not really sure why are you adding 4 to the indexes in your loop nor what lbp and ubp are supposed to mean, tho. Keep in mind that when you select a range like this, the list you're assigning to the range has to be of the same length as the range.

How do I implement this similarity measure in Python?

I tried implementing the distance measure shown in the image, in Python as such:
import numpy as np
A = [1, 2, 3, 4, 5, 6, 7, 8, 1]
B = [1, 2, 3, 2, 4, 6, 7, 8, 2]
A = np.asarray(A).flatten()
B = np.asarray(B).flatten()
x = np.sum(1 - np.divide((1 + np.minimum(A, B)), (1 + np.maximum(A, B))))
print("Distance: {}".format(x))
but after testing, it doesn't seem to be the right approach. The maximum value returned if there's no similarity at all between the given vectors should be 1, with 0 as perfect similiarity. A and B in the image are both vectors with size m.
Edit: forgot to add that I ignored the part for min(A, B) < 0 as that wont ever happen for my intentions
This should work. First, we create a matrix AB by stacking the columns and calculate the minimum vector AB_min and maximum vector AB_max out of that. Then, we compute D as you defined it, making use of numpy.where to specify the two conditions. After that, we sum the elements to get the D_proposed as you defined it. It gives a value of 0.9 for this example.
import numpy as np
A = [1, 2, 3, 4, 5, 6, 7, 8, 1]
B = [1, 2, 3, 2, 4, 6, 7, 8, 2]
AB = np.column_stack((A,B))
AB_min = np.min(AB,1)
AB_max = np.max(AB,1)
print AB_min
print AB_max
D = np.where(AB_min >= 0.,\
1. - (1. + AB_min) / (1. + AB_max),\
1. - (1. + AB_min + abs(AB_min)) / (1. + AB_max + abs(AB_min)))
print D
D_proposed = np.sum(D)
print D_proposed

Categories