Python find most common value in array - python

import numpy as np
x = ([1,2,3,3])
y = ([1,2,3])
z = ([6,6,1,2,9,9])
(only positive values)
In each array i need to return the most common value, or, if values come up the same amount of times - return the minimum.
This is home assignment and I can't use anything but numpy.
outputs:
f(x) = 3,
f(y) = 1,
f(z) = 6

for a numpy exclusive solution something like this will work:
occurances = np.bincount(x)
print (np.argmax(occurances))
The above mentioned method won't work if there is a negative number in the list. So in order to account for such an occurrence kindly use:
not_required, counts = np.unique(x, return_counts=True)
x=np.array(x)
if (x >= 0).all():
print(not_required[np.argmax(counts)])
else:
print(not_required[np.argmax(counts)])

It's called a mode function. See https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mode.html

Without numpy
n_dict = {}
for k in x:
try:
n_dict[k] += 1
except KeyError:
n_dict[k] = 1
rev_n_dict = {}
for k in n_dict:
if n_dict[k] not in rev_n_dict:
rev_n_dict[n_dict[k]] = [k]
else:
rev_n_dict[n_dict[k]].append(k)
local_max = 0
for k in rev_n_dict:
if k > local_max:
local_max = k
if len(rev_n_dict[local_max]) > 0:
print (min(rev_n_dict[local_max]))
else:
print (rev_n_dict[local_max])

To add to the previous results, you could use a collections.Counter object:
my_array = [3,24,543,3,1,6,7,8,....,223213,13213]
from collections import Counter
my_counter = Counter( my_array)
most_common_value = my_counter.most_common(1)[0][0]

It is quite simple but certainly not pretty. I have used variable names that will be self explanatory along with the comments. Feel free to ask if there is a doubt.
import numpy as np
x=([6,6,1,2,9,9])
def tester(x):
not_required, counts = np.unique(x, return_counts=True)
x=np.array(x)
if (x >= 0).all():
highest_occurance=[not_required[np.argmax(counts)]]
number_of_counts=np.max(counts)
else:
highest_occurance=not_required[np.argmax(counts)]
number_of_counts=np.max(counts)
return highest_occurance,number_of_counts
most_abundant,first_test_counts=(tester(x))
new_x=[vals for vals in x if vals not in most_abundant]
second_most_abundant,second_test_counts=(tester(new_x))
if second_test_counts==first_test_counts:
print("Atleast two elements have the same number of counts",most_abundant," and", second_most_abundant, "have %s"%first_test_counts,"occurances")
else:
print("%s occurrs for the max of %s times"%(most_abundant,first_test_counts))
we can also loop it to check if there are more than two elements with the same occurrence, instead of using an if else for a specific case of only looking at two elements

Related

Any easy way to transform a missing number sequence to its range?

Suppose I have a list that goes like :
'''
[1,2,3,4,9,10,11,20]
'''
I need the result to be like :
'''
[[4,9],[11,20]]
'''
I have defined a function that goes like this :
def get_range(lst):
i=0
seqrange=[]
for new in lst:
a=[]
start=new
end=new
if i==0:
i=1
old=new
else:
if new - old >1:
a.append(old)
a.append(new)
old=new
if len(a):
seqrange.append(a)
return seqrange
Is there any other easier and efficient way to do it? I need to do this in the range of millions.
You can use numpy arrays and the diff function that comes along with them. Numpy is so much more efficient than looping when you have millions of rows.
Slight aside:
Why are numpy arrays so fast? Because they are arrays of data instead of arrays of pointers to data (which is what Python lists are), because they offload a whole bunch of computations to a backend written in C, and because they leverage the SIMD paradigm to run a Single Instruction on Multiple Data simultaneously.
Now back to the problem at hand:
The diff function gives us the difference between consecutive elements of the array. Pretty convenient, given that we need to find where this difference is greater than a known threshold!
import numpy as np
threshold = 1
arr = np.array([1,2,3,4,9,10,11,20])
deltas = np.diff(arr)
# There's a gap wherever the delta is greater than our threshold
gaps = deltas > threshold
gap_indices = np.argwhere(gaps)
gap_starts = arr[gap_indices]
gap_ends = arr[gap_indices + 1]
# Finally, stack the two arrays horizontally
all_gaps = np.hstack((gap_starts, gap_ends))
print(all_gaps)
# Output:
# [[ 4 9]
# [11 20]]
You can access all_gaps like a 2D matrix: all_gaps[0, 1] would give you 9, for example. If you really need the answer as a list-of-lists, simply convert it like so:
all_gaps_list = all_gaps.tolist()
print(all_gaps_list)
# Output: [[4, 9], [11, 20]]
Comparing the runtime of the iterative method from #happydave's answer with the numpy method:
import random
import timeit
import numpy
def gaps1(arr, threshold):
deltas = np.diff(arr)
gaps = deltas > threshold
gap_indices = np.argwhere(gaps)
gap_starts = arr[gap_indices]
gap_ends = arr[gap_indices + 1]
all_gaps = np.hstack((gap_starts, gap_ends))
return all_gaps
def gaps2(lst, thr):
seqrange = []
for i in range(len(lst)-1):
if lst[i+1] - lst[i] > thr:
seqrange.append([lst[i], lst[i+1]])
return seqrange
test_list = [i for i in range(100000)]
for i in range(100):
test_list.remove(random.randint(0, len(test_list) - 1))
test_arr = np.array(test_list)
# Make sure both give the same answer:
assert np.all(gaps1(test_arr, 1) == gaps2(test_list, 1))
t1 = timeit.timeit('gaps1(test_arr, 1)', setup='from __main__ import gaps1, test_arr', number=100)
t2 = timeit.timeit('gaps2(test_list, 1)', setup='from __main__ import gaps2, test_list', number=100)
print(f"t1 = {t1}s; t2 = {t2}s; Numpy gives ~{t2 // t1}x speedup")
On my laptop, this gives:
t1 = 0.020834800001466647s; t2 = 1.2446780000027502s; Numpy gives ~59.0x speedup
My word that's fast!
There is iterator based solution. It'is allow to get intervals one by one:
flist = [1,2,3,4,9,10,11,20]
def get_range(lst):
start_idx = lst[0]
for current_idx in flist[1:]:
if current_idx > start_idx+1:
yield [start_idx, current_idx]
start_idx = current_idx
for inverval in get_range(flist):
print(inverval)
I don't think there's anything inefficient about the solution, but you can clean up the code quite a bit:
seqrange = []
for i in range(len(lst)-1):
if lst[i+1] - lst[i] > 1:
seqrange.append([lst[i], lst[i+1]])
I think this could be more efficient and a bit cleaner.
def func(lst):
ans=0
final=[]
sol=[]
for i in range(1,lst[-1]+1):
if(i not in lst):
ans+=1
final.append(i)
elif(i in lst and ans>0):
final=[final[0]-1,i]
sol.append(final)
ans=0
final=[]
else:
final=[]
return(sol)

Solving a simple function with step and which outputs max value & argument of a function

I am writing a program which solves a function in an interval 0:9 where step size is 0.005. This program requires 1800 calculations and a way to find the max value of a function and x argument which was used.
What would be the recommended way and loops to use in order calculate function 1800 times (9/0.005), find the max value of it and output related argument value which was used in calculation for the max value?
My idea was that there should be 2 lists generated, one for the range/interval (1800 items) and other for calculated values (also 1800). Which would then find max in 'calculated array' and related x argument in the other array, using list index or some other method..
from operator import itemgetter
import math
myfile = open("result.txt", "w")
data = []
step=0.005
rng=9
lim=rng/step
print(lim)
xs=[x * step for x in range(rng)]
lim_int=int(lim)
print(xs)
for i in range(lim_int):
num=itemgetter(i)(xs)
x=math.sin(num)* math.exp(-num/100)
print(i, x)
data.append(x)
for i in range(rng):
text = str(i)
text2 = str(data[i])
print(text, text2)
myfile.write(text + ' ' + text2 + '\n')
i=1
while i < rng:
i=i+1
num2=itemgetter(i)(xs)
v=math.sin(num2)* math.exp(-num2/100)
if v==max(data):
arg=num2
break
print('largest function value', max(data))
print('function argument value used', arg)
myfile.close()
Numpy is the widely used performant package for this:
import numpy as np
x = np.arange(0, 9, 0.005)
f = np.sin(x)*np.exp(-x/100)
print("max is: ", np.max(f))
print("index of max is: ", np.argmax(f))
output:
max is: 0.98446367206362
index of max is: 312
If for some reason you want a native python solution (without using list methods max and index), you can do something like this:
step = 0.005
rng = 9
lim = int(rng/step)
x = [x_i*step for x_i in range(lim + 1)]
f = [math.exp(-x_i/100)*math.sin(x_i) for x_i in x]
max_ind = 0
f_max = f[max_ind]
for j, f_x in enumerate(f):
if f_x > f_max:
f_max = f_x
max_ind = j

Make my Nested loops Works simpler (Operating Time is Higher)

I am a learner in nested loops in python.
Below I have written my code. I want to make my code simpler, since when I run the code it takes so much time to produce the result.
I have a list which contains 1000 values:
Brake_index_values = [ 44990678, 44990679, 44990680, 44990681, 44990682, 44990683,
44997076, 44990684, 44997077, 44990685,
...
44960673, 8195083, 8979525, 100107546, 11089058, 43040161,
43059162, 100100533, 10180192, 10036189]
I am storing the element no 1 in another list
original_top_brake_index = [Brake_index_values[0]]
I created a temporary list called temp and a numpy array for iteration through Loop:
temp =[]
arr = np.arange(0,1000,1)
Loop operation:
for i in range(1, len(Brake_index_values)):
if top_15_brake <= 15:
a1 = Brake_index_values[i]
#a2 = Brake_index_values[j]
a3 = arr[:i]
for j in a3:
a2 = range(Brake_index_values[j] - 30000, Brake_index_values[j] + 30000)
if a1 in a2:
pass
else:
temp.append(a1)
if len(temp)== len(a3):
original_top_brake_index.append(a1)
top_15_brake += 1
del temp[:]
else:
del temp[:]
continue
I am comparing the Brake_index_values[1] element available between the range of 30000 before and after Brake_index_values[0] element, that is `range(Brake_index_values[0]-30000, Brake_index_values[0]+30000).
If the Brake_index_values[1] available between the range, I should ignore that element and go for the next element Brake_index_values[2] and follow the same process as before for Brake_index_values[0] & Brake_index_values[1]
If it is available, store the Value, in original_top_brake_index thorough append operation.
In other words :
(Lets take 3 values a,b & c. I am checking whether the value b is in range between (a-30000 to a+30000). Possibility 1: If b is in between (a-30000 to a+30000) , neglect that element (Here I am storing inside a temporary list). Then the same process continues with c (next element) Possibility 2: If b is not in b/w those range put b in another list called original_top_brake_index
(this another list is the actual result what i needed)
The result I get:
It is working, but it takes so much time to complete the operation and sometimes it shows MemoryError.
I just want my code to work simpler and efficient with simple operations.
Try this code (with numpy):
import numpy as np
original_top_brake_index = [Brake_index_values[0]]
top_15_brake = 0
Brake_index_values = np.array(Brake_index_values)
for i, a1 in enumerate(Brake_index_values[0:]):
if top_15_brake > 15:
break
m = (Brake_index_values[:i] - a1)
if np.logical_or(m > 30000, m < - 30000).all():
original_top_brake_index.append(a1)
top_15_brake += 1
Note: you can probably make it even more efficient, but this already should reduce the number of operations significantly (and doesn't change much the logic of your original code)
We can use the bisect module to shorten the elements we actually have to lookup by finding the smallest element that's greater or less than the current value. We will use recipes from here
Let's look at this example:
from bisect import bisect_left, bisect_right
def find_lt(a, x):
'Find rightmost value less than x'
i = bisect_left(a, x)
if i:
return a[i-1]
return
def find_gt(a, x):
'Find leftmost value greater than x'
i = bisect_right(a, x)
if i != len(a):
return a[i]
return
vals = [44990678, 44990679, 44990680, 44990681, 44990682, 589548954, 493459734, 3948305434, 34939349534]
vals.sort() # we have to sort the values for bisect to work
passed = []
originals = []
for val in vals:
passed.append(val)
l = find_lt(passed, val)
m = find_gt(passed, val)
cond1 = (l and l + 30000 >= val)
cond2 = (m and m - 30000 <= val)
if not l and not m:
originals.append(val)
continue
elif cond1 or cond2:
continue
else:
originals.append(val)
Which gives us:
print(originals)
[44990678, 493459734, 589548954, 3948305434, 34939349534]
There might be another, more mathematical way to do this, but this should at least simplify your code.

Most frequently overlapping range - Python3.x

I'm a beginner, trying to write code listing the most frequently overlapping ranges in a list of ranges.
So, input is various ranges (#1 through #7 in the example figure; https://prntscr.com/kj80xl) and I would like to find the most common range (in the example 3,000- 4,000 in 6 out of 7 - 86 %). Actually, I would like to find top 5 most frequent.
Not all ranges overlap. Ranges are always positive and given as integers with 1 distance (standard range).
What I have now is only code comparing one sequence to another and returning the overlap, but after that I'm stuck.
def range_overlap(range_x,range_y):
x = (range_x[0], (range_x[-1])+1)
y = (range_y[0], (range_y[-1])+1)
overlap = (max(x[0],y[0]),min(x[-1],(y[-1])))
if overlap[0] <= overlap[1]:
return range(overlap[0], overlap[1])
else:
return "Out of range"
I would be very grateful for any help.
Better solution
I came up with a simpler solution (at least IMHO) so here it is:
def get_abs_min(ranges):
return min([min(r) for r in ranges])
def get_abs_max(ranges):
return max([max(r) for r in ranges])
def count_appearances(i, ranges):
return sum([1 for r in ranges if i in r])
def create_histogram(ranges):
keys = [str(i) for i in range(len(ranges) + 1)]
histogram = dict.fromkeys(keys)
results = []
min = get_abs_min(range_list)
max = get_abs_max(range_list)
for i in range(min, max):
count = str(count_appearances(i, ranges))
if histogram[count] is None:
histogram[count] = dict(start=i, end=None)
elif histogram[count]['end'] is None:
histogram[count]['end'] = i
elif histogram[count]['end'] == i - 1:
histogram[count]['end'] = i
else:
start = histogram[count]['start']
end = histogram[count]['end']
results.append((range(start, end + 1), count))
histogram[count]['start'] = i
histogram[count]['end'] = None
for count, d in histogram.items():
if d is not None and d['start'] is not None and d['end'] is not None:
results.append((range(d['start'], d['end'] + 1), count))
return results
def main(ranges, top):
appearances = create_histogram(ranges)
return sorted(appearances, key=lambda t: t[1], reverse=True)[:top]
The idea here is as simple as iterating through a superposition of all the ranges and building a histogram of appearances (e.g. the number of original ranges this current i appears in)
After that just sort and slice according to the chosen size of the results.
Just call main with the ranges and the top number you want (or None if you want to see all results).
OLDER EDITS BELOW
I (almost) agree with #Kasramvd's answer.
here is my take on it:
from collections import Counter
from itertools import combinations
def range_overlap(x, y):
common_part = list(set(x) & set(y))
if common_part:
return range(common_part[0], common_part[-1] +1)
else:
return False
def get_most_common(range_list, top_frequent):
overlaps = Counter(range_overlap(i, j) for i, j in
combinations(list_of_ranges, 2))
return [(r, i) for (r, i) in overlaps.most_common(top_frequent) if r]
you need to input the range_list and the number of top_frequent you want.
EDIT
the previous answer solved this question for all 2's combinations over the range list.
This edit is tested against your input and results with the correct answer:
from collections import Counter
from itertools import combinations
def range_overlap(*args):
sets = [set(r) for r in args]
common_part = list(set(args[0]).intersection(*sets))
if common_part:
return range(common_part[0], common_part[-1] +1)
else:
return False
def get_all_possible_combinations(range_list):
all_combos = []
for i in range(2, len(range_list)):
all_combos.append(combinations(range_list, i))
all_combos = [list(combo) for combo in all_combos]
return all_combos
def get_most_common_for_combo(combo):
return list(filter(None, [range_overlap(*option) for option in combo]))
def get_most_common(range_list, top_frequent):
all_overlaps = []
combos = get_all_possible_combinations(range_list)
for combo in combos:
all_overlaps.extend(get_most_common_for_combo(combo))
return [r for (r, i) in Counter(all_overlaps).most_common(top_frequent) if r]
And to get the results just run get_most_common(range_list, top_frequent)
Tested on my machine (ubunut 16.04 with python 3.5.2) with your input range_list and top_frequent = 5 with the results:
[range(3000, 4000), range(2500, 4000), range(1500, 4000), range(3000, 6000), range(1, 4000)]
You can first change your function to return a valid range in both cases so that you can use it in a set of comparisons. Also, since Python's range objects are not already created iterables but smart objects that only get start, stop and step attributes of a range and create the range on-demand, you can do a little change on your function as well.
def range_overlap(range_x,range_y):
rng = range(max(range_x.start, range_y.start),
min(range_x.stop, range_y.stop)+1)
if rng.start < rng.stop:
return rng.start, rng.stop
Now, if you have a set of ranges and you want to compare all the pairs you can use itertools.combinations to get all the pairs and then using range_overlap and collections.Counter you can find the number of overlapped ranges.
from collections import Counter
from itertools import combinations
overlaps = Counter(range_overlap(i,j) for i, j in
combinations(list_of_ranges, 2))

Re-order numpy array based on where its associated ids are positioned in the `master_order` array

I am looking for a function that makes a new array of values based on ordered_ids, when the array has a length of one million.
Input:
>>> ids=array(["WYOMING01","TEXAS01","TEXAS02",...])
>>> values=array([12,20,30,...])
>>> ordered_ids=array(["TEXAS01","TEXAS02","ALABAMA01",...])
Output:
ordered [ 20 , 30 , nan , ...]
Closing Summary
#Dietrich's use of a dictionary in list comprehension is 10x faster than using numpy index search (numpy.where). I compared the times of three results in my answer below.
You could try:
import numpy as np
def order_array(ids, values, master_order_ids):
n = len(master_order_ids)
idx = np.searchsorted(master_order_ids, ids)
ordered_values = np.zeros(n)
ordered_values[idx < n] = values[idx < n]
print "ordered", ordered_values
return ordered_values
Searchsorted gives you indices where you should insert ids into master_order_ids to keep the arrray ordered. Then you just drop those (idx, values) that are out of the range of master_order_ids.
You could try using a dict() to associate the stings to your numbers. It simplifies the code considerably:
import numpy as np
def order_bydict(ids,values,master_order_ids):
""" Using a dict to order ``master_order_ids`` """
dd = dict([(k,v) for k,v in zip(ids, values)]) # create the dict
ordered_values = [dd.get(m, 0) for m in master_order_ids] # get() return 0 if key not found
return np.asarray(ordered_values) # return a numpy array instead of a list
The speedwise improvement is hard to predict without testing longer arrays (with your example it was 25% faster based on %timeit).
import numpy
from numpy import copy, random, arange
import time
# SETUP
N=10**4
ids = arange(0,N).astype(str)
values = arange(0,N)
numpy.random.shuffle(ids)
numpy.random.shuffle(values)
ordered_ids=arange(0,N).astype(str)
ordered_values = numpy.empty((N,1))
ordered_values[:] = numpy.NAN
# METHOD 1
start = time.clock()
for i in range(len(values)):ordered_values[ordered_ids==ids[i]]=values[i]
print "not using dictionary:", time.clock() - start
# METHOD 2
start = time.clock()
d = dict(zip(ids, values))
for k, v in d.iteritems(): ordered_values[ordered_ids==k] = v
print "using dictionary:", time.clock() - start
# METHOD 3 #Dietrich's approach in the answer above
start = time.clock()
dd = dict(zip(ids, values))
ordered_values = [dd.get(m, 0) for m in ordered_ids]
print "using dictionary with list comprehension:", time.clock() - start
Results
not using dictionary: 1.320237 # Method 1
using dictionary: 1.327119 # Method 2
using dictionary with list comprehension: 0.013287 # #Dietrich
The following solution using the numpy_indexed package (disclaimer: I am its author) is purely vectorized, and likely to be much more efficient than the solutions posted thus far:
import numpy_indexed as npi
idx = npi.indices(ids, ordered_ids, missing='mask')
new_values = values[idx]
new_values[idx.mask] = -1 # or cast to float and set to nan, but you get the idea...

Categories