How to detect outliners from a list

How to detect outliners from a list - python

I have a list with values. From this list I would like to get the outliners.
list_of_values = [2, 3, 100, 5, 53, 5, 4, 7]
def detect_outlier(data):
threshold= 3
mean_1 = np.mean(data)
std_1 =np.std(data)
outliers = [y for y in data if (np.abs((y - mean_1)/std_1) > threshold)]
return outliers
print(detect_outlier(list_of_values))
However, my print turns up empty, aka a [] without anything in it. Any ideas?

Since std_1 = 33.413, any element in list_of_values divided by std_1 will be smaller than the threshold and hence not yielded.

Related

How can you find the maximum nth integer in a list in python? [duplicate]

I know how to find the 1st highest value but don't know the rest. Keep in mind i need to print the position of the 1st 2nd and 3rd highest value.Thank You and try to keep it simple as i have only been coding for 2 months. Also they can be joint ranks
def linearSearch(Fscore_list):
pos_list = []
target = (max(Fscore_list))
for i in range(len(Fscore_list)):
if Fscore_list[i] >= target:
pos_list.append(i)
return pos_list

This will create a list of the 3 largest items, and a list of the corresponding indices:
lst = [9,7,43,2,4,7,8,5,4]
values = []
values = zip(*sorted( [(x,i) for (i,x) in enumerate(f_test)],
reverse=True )[:3] )[0]
posns = []
posns = zip(*sorted( [(x,i) for (i,x) in enumerate(f_test)],
reverse=True )[:3] )[1]
Things are a bit more complicated if the same value can appear multiple times (this will show the highest position for a value):
lst = [9,7,43,2,4,7,8,5,4]
ranks = sorted( [(x,i) for (i,x) in enumerate(lst)], reverse=True
)
values = []
for x,i in ranks:
if x not in values:
values.append( x )
posns.append( i )
if len(values) == 3:
break
print zip( values, posns )

Use heapq.nlargest:
>>> import heapq
>>> [i
... for x, i
... in heapq.nlargest(
... 3,
... ((x, i) for i, x in enumerate((0,5,8,7,2,4,3,9,1))))]
[7, 2, 3]

Add all the values in the list to a set. This will ensure you have each value only once.
Sort the set.
Find the index of the top three values in the set in the original list.
Make sense?
Edit
thelist = [1, 45, 88, 1, 45, 88, 5, 2, 103, 103, 7, 8]
theset = frozenset(thelist)
theset = sorted(theset, reverse=True)
print('1st = ' + str(theset[0]) + ' at ' + str(thelist.index(theset[0])))
print('2nd = ' + str(theset[1]) + ' at ' + str(thelist.index(theset[1])))
print('3rd = ' + str(theset[2]) + ' at ' + str(thelist.index(theset[2])))
Edit
You still haven't told us how to handle 'joint winners' but looking at your responses to other answers I am guessing this might possibly be what you are trying to do, maybe? If this is not the output you want please give us an example of the output you are hoping to get.
thelist = [1, 45, 88, 1, 45, 88, 5, 2, 103, 103, 7, 8]
theset = frozenset(thelist)
theset = sorted(theset, reverse=True)
thedict = {}
for j in range(3):
positions = [i for i, x in enumerate(thelist) if x == theset[j]]
thedict[theset[j]] = positions
print('1st = ' + str(theset[0]) + ' at ' + str(thedict.get(theset[0])))
print('2nd = ' + str(theset[1]) + ' at ' + str(thedict.get(theset[1])))
print('3rd = ' + str(theset[2]) + ' at ' + str(thedict.get(theset[2])))
Output
1st = 103 at [8, 9]
2nd = 88 at [2, 5]
3rd = 45 at [1, 4]
BTW : What if all the values are the same (equal first) or for some other reason there is no third place? (or second place?). Do you need to protect against that? If you do then I'm sure you can work out appropriate safety shields to add to the code.

Jupyter image of the code working
This question was on my Udemy machine learning course way too soon. Scott Hunter helped me the most on this problem, but didn't get me to a pass on the site. Having to really think about the issue deeper on my own. Here is my solution, since couldn't find it anywhere else online--in terms that I understood everything that was going on*:
lst = [9,7,43,2,4,7,8,9,4]
ranks = sorted( [(x,i) for (i,x) in enumerate(lst)], reverse=True )
box = []
for x,i in ranks:
if i&x not in box:
box.append( x )
if len(box) == 3:
break
print(box)
So we have a list of numbers. To rank the numbers we sort the value with its position for every position that has a value when we enumerate/iterate the list. Then we put the highest values on top by reversing it. Now we need a box to put our information in to pull out of later, so we build that box []. Now for every value with a position put that in the box, if the value and position isn't already in the box--meaning if the value is already in the box, but the position isn't, still put in the box. And we only want three answers. Finally tell me what is in the variable called box.
*Many of these answers, on this post, will most likely work.

Input : [4, 5, 1, 2, 9]
N = 2
Output : [9, 5]
Input : [81, 52, 45, 10, 3, 2, 96]
N = 3
Output : [81, 96, 52]
# Python program to find N largest
# element from given list of integers
l = [1000,298,3579,100,200,-45,900]
n = 4
l.sort()
print(l[-n:])
Output:
[298, 900, 1000, 3579]

lst = [9,7,43,2,4,7,8,9,4]
temp1 = lst
print(temp1)
#First Highest value:
print(max(temp1))
temp1.remove(max(temp1))
#output: 43
# Second Highest value:
print(max(temp1))
temp1.remove(max(temp1))
#output: 9
# Third Highest Value:
print(max(temp1))
#output: 7

There's a complicated O(n) algorithm, but the simplest way is to sort it, which is O(n * log n), then take the top. The trickiest part here is to sort the data while keeping the indices information.
from operator import itemgetter
def find_top_n_indices(data, top=3):
indexed = enumerate(data) # create pairs [(0, v1), (1, v2)...]
sorted_data = sorted(indexed,
key=itemgetter(1), # sort pairs by value
reversed=True) # in reversed order
return [d[0] for d in sorted_data[:top]] # take first N indices
data = [5, 3, 6, 3, 7, 8, 2, 7, 9, 1]
print find_top_n_indices(data) # should be [8, 5, 4]
Similarly, it can be done with heapq.nlargest(), but still you need to pack the initial data into tuples and unpack afterwards.

To have a list filtered and returned in descending order with duplicates removed try using this function.
You can pass in how many descending values you want it to return as keyword argument.
Also a side note, if the keyword argument (ordered_nums_to_return) is greater than the length of the list, it will return the whole list in descending order. if you need it to raise an exception, you can add a check to the function. If no args is passed it will return the highest value, again you can change this behaviour if you need.
list_of_nums = [2, 4, 23, 7, 4, 1]
def find_highest_values(list_to_search, ordered_nums_to_return=None):
if ordered_nums_to_return:
return sorted(set(list_to_search), reverse=True)[0:ordered_nums_to_return]
return [sorted(list_to_search, reverse=True)[0]]
print find_highest_values(list_of_nums, ordered_nums_to_return=4)

If values can appear in your list repeatedly you can try this solution.
def search(Fscore_list, num=3):
l = Fscore_list
res = dict([(v, []) for v in sorted(set(l), reverse=True)[:num]])
for index, val in enumerate(l):
if val in res:
res[val].append(index)
return sorted(res.items(), key=lambda x: x[0], reverse=True)
First it find num=3 highest values and create dict with empty list for indexes for it. Next it goes over the list and for every of the highest values (val in res) save it's indexes. Then just return sorted list of tuples like [(highest_1, [indexes ...]), ..]. e.g.
>>> l = [9, 7, 43, 2, 4, 7, 43, 8, 5, 8, 4]
>>> print(search(l))
[(43, [2, 6]), (9, [0]), (8, [7, 9])]
To print the positions do something like:
>>> Fscore_list = [9, 7, 43, 2, 4, 7, 43, 8, 5, 8, 4, 43, 43, 43]
>>> result = search(Fscore_list)
>>> print("1st. %d on positions %s" % (result[0][0], result[0][1]))
1st. 43 on positions [2, 6, 11, 12, 13]
>>> print("2nd. %d on positions %s" % (result[1][0], result[1][1]))
2nd. 9 on positions [0]
>>> print("3rd. %d on positions %s" % (result[2][0], result[2][1]))
3rd. 8 on positions [7, 9]

In one line:
lst = [9,7,43,2,8,4]
index = [i[1] for i in sorted([(x,i) for (i,x) in enumerate(lst)])[-3:]]
print(index)
[2, 0, 1]

None is always considered smaller than any number.
>>> None<4
True
>>> None>4
False
Find the highest element, and its index.
Replace it by None. Find the new highest element, and its index. This would be the second highest in the original list. Replace it by None. Find the new highest element, which is actually the third one.
Optional: restore the found elements to the list.
This is O(number of highest elements * list size), so it scales poorly if your "three" grows, but right now it's O(3n).

Index of element in list

I want to find out if the maximum value in a list has a smaller index than the minimum value in a list. If there are two or more indices with the minimum value, I want to look at the greatest index. If there are two or more indices with the maximum value, I want to look at the smallest index. Now my code looks like this:
maximum = max(lijst)
minimum = minimum(lijst)
if lijst.index(maximum) <= lijst.index(minimum):
...
But this doesn't give me the indices I want with these kind of lists:
[2, 9, 15, 36, 36, 3, 2, 36]
Now I want to look at the largest index of the minimum value (which is 6 in this case) and the smallest index for the maximum value (which is 3 in this case). Does someone know how to find these indices?

you can return the min/max value in a list using min/max, then use enumerate to get indices, then apply another min/max of the indicies list, example:
my_list = [2, 9, 15, 36, 36, 3, 2, 36]
maxval = max(my_list)
indices = [index for index, val in enumerate(my_list) if val == maxval]
[3, 4, 7]
maxIndex = max(indices)
7
So if you want to check if maximum before the minimum, then return each value's index and compare the two.

You need to use Python's find function. You can find the last minimum value by continuing to check until find returns -1.
maximum = max(li)
minimum = minimum(li)
i1 = li.find(maximum)
i2 = li.find(minimum)
found = False
while(not found):
if li.find(minimum, i2+1) != -1:
i2 = li.find(minimum, i2+1)
else:
found = True
if i1 < i2:
.......

To get the index of the first maximum:
l.index(max(l))
To get the index of the last minimum you can reverse the list and apply something similar:
l.reverse()
len(l)-l.index(min(l))-1

What you probably had in mind
Although there are other answers I wanted to work on something along the lines of your code. It may not be the most efficient but I believe it is what you had in mind:
my_list = [2, 9, 15, 36, 36, 3, 2, 36]
maximum = max(my_list)
minimum = min(my_list)
first_maximum_index = my_list.index(maximum)
last_minimum_index = len(my_list)-1 - my_list[::-1].index(minimum)
if first_maximum_index <= last_minimum_index:
print("Yes!")
.index() gets the index of the first value in the list. So, to get the last minimum value, you need to reverse the list before using .index() which is this portion:
my_list[::-1].index(minimum)
After that you will get the index of the minimum value BUT it is the index of the reversed list. Now, you have to "reverse" this index by substracting the number of indices, len(my_list)-1 which gives you the final expression:
len(my_list)-1 - my_list[::-1].index(minimum)
After that, you can compare the indices as you did.
A more efficient method
Now, here's a more efficient solution (though longer, and perhaps less readable). If you notice, you are running through the list about 4 times (worst case) in the code above. You can reduce it to running through the list once:
my_list = [2, 9, 15, 36, 36, 3, 36]
# Step 1
current_min = float("inf")
current_max = float("-inf")
is_before = False
for val in my_list:
if val > current_max:
is_before = False
current_max = val
if val <= current_min:
current_min = val
is_before = True
if is_before:
print("Yes!")
The trick here is to think about subsets of the list:
[2] # ???
[2, 9] # False
[2, 9, 15] # False
[2, 9, 15, 36] # False
[2, 9, 15, 36, 36] # False
[2, 9, 15, 36, 36, 3] # False
[2, 9, 15, 36, 36, 3, 2] # True
[2, 9, 15, 36, 36, 3, 2, 36] # True
If you look closely, the result changes from True to False when there is a new maximum value. Similarly, the result changes from False to True when there is a new or existing minimum value introduced at the end of the list.
These correspond to the block of code:
# If value introduced is the new maximum
if val > current_max:
is_before = False
current_max = val
# If value introduced is an existing or new minimum
if val <= current_min:
current_min = val
is_before = True

Accessing elements from a list?

I am trying to calculate the distance between two lists so I can find the shortest distance between all coordinates.
Here is my code:
import random
import math
import copy
def calculate_distance(starting_x, starting_y, destination_x, destination_y):
distance = math.hypot(destination_x - starting_x, destination_y - starting_y) # calculates Euclidean distance (straight-line) distance between two points
return distance
def nearest_neighbour_algorithm(selected_map):
temp_map = copy.deepcopy(selected_map)
optermised_map = [] # we setup an empty optimised list to fill up
# get last element of temp_map to set as starting point, also removes it from temp_list
optermised_map.append(temp_map.pop()) # we set the first element of the temp_map and put it in optimised_map as the starting point and remove this element from the temp_map
for x in range(len(temp_map)):
nearest_value = 1000
neares_index = 0
for i in range(len(temp_map[x])):
current_value = calculate_distance(*optermised_map[x], *temp_map[x])
I get an error at this part and im not sure why:
for i in range(len(temp_map[x])):
current_value = calculate_distance(*optermised_map[x], *temp_map[x])
I am trying to find the distance between points between these two lists and the error I get is that my list index is out of range where the for loop is

On the first iteration optermised_map would be length 1. This would likely cause the error because it's iterating over len(temp_map) which is likely more than 1. I think you may have wanted:
for i in range(len(optermised_map)):
current_value = calculate_distance(*optermised_map[i], *temp_map[x])

Are the lengths of the lists the same? I could be wrong, but this sounds like a cosine similarity exercise to me. Check out this very simple exercise.
from scipy import spatial
dataSetI = [3, 45, 7, 2]
dataSetII = [2, 54, 13, 15]
result = 1 - spatial.distance.cosine(dataSetI, dataSetII)
result
# 0.97228425171235
dataSetI = [1, 2, 3, 10]
dataSetII = [2, 4, 6, 20]
result = 1 - spatial.distance.cosine(dataSetI, dataSetII)
result
# 1.0
dataSetI = [10, 200, 234, 500]
dataSetII = [45, 3, 19, 20]
result = 1 - spatial.distance.cosine(dataSetI, dataSetII)
result
# 0.4991255575740505
In the second iteration, we can see that the ratios of the numbers in the two lists are exactly the same, but the numbers are different. We focus in the ratios of the numbers.

Median of the medians of a list

I need a vector that stores the median values of the medians of the main list "v". I have tried something with the following code but I am only able to write some values in the correct way.
v=[1,2,3,4,5,6,7,8,9,10]
final=[]
nfac=0
for j in range (0,4):
nfac=j+1
for k in range (0,nfac):
if k%2==0:
final.append(v[10/2**(nfac)-1])
else:
final.append(v[9-10/2**(nfac)])
The first median in v=[1,2,3,4,5,6,7,8,9,10] is 5
Then I want the medians of the remaining sublists [1,2,3,4] and [6,7,8,9,10]. I.e. 2 and 8 respectively. And so on.
The list "final" must be in the following form:
final=[5,2,8,1,3,6,9,4,7,10]

Please take a note that the task as you defined it is basically equivalent to constructing a binary heap from an array.
Definitely start by defining a helper function for finding the median:
def split_by_median(l):
median_ind = (len(l)-1) // 2
median = l[median_ind]
left = l[:median_ind]
right = l[median_ind+1:] if len(l) > 1 else []
return median, left, right
Following the example you give, you want to process the resulting sublists in a breadth-first manner, so we need a queue to remember the following tasks:
from collections import deque
def construct_heap(v):
lists_to_process = deque([sorted(v)])
nodes = []
while lists_to_process:
head = lists_to_process.popleft()
if len(head) == 0:
continue
median, left, right = split_by_median(head)
nodes.append(median)
lists_to_process.append(left)
lists_to_process.append(right)
return nodes
So calling the function finally:
print(construct_heap([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])) # [5, 2, 8, 1, 3, 6, 9, 4, 7, 10]
print(construct_heap([5, 1, 2])) # [2, 1, 5]
print(construct_heap([1, 0, 0.5, -1])) # [0, -1, 0.5, 1]
print(construct_heap([])) # []

Python finding repeating sequence in list of integers?

I have a list of lists and each list has a repeating sequence. I'm trying to count the length of repeated sequence of integers in the list:
list_a = [111,0,3,1,111,0,3,1,111,0,3,1]
list_b = [67,4,67,4,67,4,67,4,2,9,0]
list_c = [1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,0,23,18,10]
Which would return:
list_a count = 4 (for [111,0,3,1])
list_b count = 2 (for [67,4])
list_c count = 10 (for [1,2,3,4,5,6,7,8,9,0])
Any advice or tips would be welcome. I'm trying to work it out with re.compile right now but, its not quite right.

Guess the sequence length by iterating through guesses between 2 and half the sequence length. If no pattern is discovered, return 1 by default.
def guess_seq_len(seq):
guess = 1
max_len = len(seq) / 2
for x in range(2, max_len):
if seq[0:x] == seq[x:2*x] :
return x
return guess
list_a = [111,0,3,1,111,0,3,1,111,0,3,1]
list_b = [67,4,67,4,67,4,67,4,2,9,0]
list_c = [1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,0,23,18,10]
print guess_seq_len(list_a)
print guess_seq_len(list_b)
print guess_seq_len(list_c)
print guess_seq_len(range(500)) # test of no repetition
This gives (as expected):
4
2
10
1
As requested, this alternative gives longest repeated sequence. Hence it will return 4 for list_b. The only change is guess = x instead of return x
def guess_seq_len(seq):
guess = 1
max_len = len(seq) / 2
for x in range(2, max_len):
if seq[0:x] == seq[x:2*x] :
guess = x
return guess

I took Maria's faster and more stackoverflow-compliant answer and made it find the largest sequence first:
def guess_seq_len(seq, verbose=False):
seq_len = 1
initial_item = seq[0]
butfirst_items = seq[1:]
if initial_item in butfirst_items:
first_match_idx = butfirst_items.index(initial_item)
if verbose:
print(f'"{initial_item}" was found at index 0 and index {first_match_idx}')
max_seq_len = min(len(seq) - first_match_idx, first_match_idx)
for seq_len in range(max_seq_len, 0, -1):
if seq[:seq_len] == seq[first_match_idx:first_match_idx+seq_len]:
if verbose:
print(f'A sequence length of {seq_len} was found at index {first_match_idx}')
break
return seq_len

This worked for me.
def repeated(L):
'''Reduce the input list to a list of all repeated integers in the list.'''
return [item for item in list(set(L)) if L.count(item) > 1]
def print_result(L, name):
'''Print the output for one list.'''
output = repeated(L)
print '%s count = %i (for %s)' % (name, len(output), output)
list_a = [111, 0, 3, 1, 111, 0, 3, 1, 111, 0, 3, 1]
list_b = [67, 4, 67, 4, 67, 4, 67, 4, 2, 9, 0]
list_c = [
1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2,
3, 4, 5, 6, 7, 8, 9, 0, 23, 18, 10
]
print_result(list_a, 'list_a')
print_result(list_b, 'list_b')
print_result(list_c, 'list_c')
Python's set() function will transform a list to a set, a datatype that can only contain one of any given value, much like a set in algebra. I converted the input list to a set, and then back to a list, reducing the list to only its unique values. I then tested the original list for each of these values to see if it contained that value more than once. I returned a list of all of the duplicates. The rest of the code is just for demonstration purposes, to show that it works.
Edit: Syntax highlighting didn't like the apostrophe in my docstring.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to detect outliners from a list - python

Since std_1 = 33.413, any element in list_of_values divided by std_1 will be smaller than the threshold and hence not yielded.

Related

How can you find the maximum nth integer in a list in python? [duplicate]

Index of element in list

Accessing elements from a list?

Median of the medians of a list

Python finding repeating sequence in list of integers?

Categories

Resources