I am trying to translate every element of a numpy.array according to a given key:
For example:
a = np.array([[1,2,3],
[3,2,4]])
my_dict = {1:23, 2:34, 3:36, 4:45}
I want to get:
array([[ 23., 34., 36.],
[ 36., 34., 45.]])
I can see how to do it with a loop:
def loop_translate(a, my_dict):
new_a = np.empty(a.shape)
for i,row in enumerate(a):
new_a[i,:] = map(my_dict.get, row)
return new_a
Is there a more efficient and/or pure numpy way?
Edit:
I timed it, and np.vectorize method proposed by DSM is considerably faster for larger arrays:
In [13]: def loop_translate(a, my_dict):
....: new_a = np.empty(a.shape)
....: for i,row in enumerate(a):
....: new_a[i,:] = map(my_dict.get, row)
....: return new_a
....:
In [14]: def vec_translate(a, my_dict):
....: return np.vectorize(my_dict.__getitem__)(a)
....:
In [15]: a = np.random.randint(1,5, (4,5))
In [16]: a
Out[16]:
array([[2, 4, 3, 1, 1],
[2, 4, 3, 2, 4],
[4, 2, 1, 3, 1],
[2, 4, 3, 4, 1]])
In [17]: %timeit loop_translate(a, my_dict)
10000 loops, best of 3: 77.9 us per loop
In [18]: %timeit vec_translate(a, my_dict)
10000 loops, best of 3: 70.5 us per loop
In [19]: a = np.random.randint(1, 5, (500,500))
In [20]: %timeit loop_translate(a, my_dict)
1 loops, best of 3: 298 ms per loop
In [21]: %timeit vec_translate(a, my_dict)
10 loops, best of 3: 37.6 ms per loop
In [22]: %timeit loop_translate(a, my_dict)
I don't know about efficient, but you could use np.vectorize on the .get method of dictionaries:
>>> a = np.array([[1,2,3],
[3,2,4]])
>>> my_dict = {1:23, 2:34, 3:36, 4:45}
>>> np.vectorize(my_dict.get)(a)
array([[23, 34, 36],
[36, 34, 45]])
Here's another approach, using numpy.unique:
>>> a = np.array([[1,2,3],[3,2,1]])
>>> a
array([[1, 2, 3],
[3, 2, 1]])
>>> d = {1 : 11, 2 : 22, 3 : 33}
>>> u,inv = np.unique(a,return_inverse = True)
>>> np.array([d[x] for x in u])[inv].reshape(a.shape)
array([[11, 22, 33],
[33, 22, 11]])
This approach is much faster than np.vectorize approach when the number of unique elements in array is small.
Explanaion: Python is slow, in this approach the in-python loop is used to convert unique elements, afterwards we rely on extremely optimized numpy indexing operation (done in C) to do the mapping. Hence, if the number of unique elements is comparable to the overall size of the array then there will be no speedup. On the other hand, if there is just a few unique elements, then you can observe a speedup of up to x100.
I think it'd be better to iterate over the dictionary, and set values in all the rows and columns "at once":
>>> a = np.array([[1,2,3],[3,2,1]])
>>> a
array([[1, 2, 3],
[3, 2, 1]])
>>> d = {1 : 11, 2 : 22, 3 : 33}
>>> for k,v in d.iteritems():
... a[a == k] = v
...
>>> a
array([[11, 22, 33],
[33, 22, 11]])
Edit:
While it may not be as sexy as DSM's (really good) answer using numpy.vectorize, my tests of all the proposed methods show that this approach (using #jamylak's suggestion) is actually a bit faster:
from __future__ import division
import numpy as np
a = np.random.randint(1, 5, (500,500))
d = {1 : 11, 2 : 22, 3 : 33, 4 : 44}
def unique_translate(a,d):
u,inv = np.unique(a,return_inverse = True)
return np.array([d[x] for x in u])[inv].reshape(a.shape)
def vec_translate(a, d):
return np.vectorize(d.__getitem__)(a)
def loop_translate(a,d):
n = np.ndarray(a.shape)
for k in d:
n[a == k] = d[k]
return n
def orig_translate(a, d):
new_a = np.empty(a.shape)
for i,row in enumerate(a):
new_a[i,:] = map(d.get, row)
return new_a
if __name__ == '__main__':
import timeit
n_exec = 100
print 'orig'
print timeit.timeit("orig_translate(a,d)",
setup="from __main__ import np,a,d,orig_translate",
number = n_exec) / n_exec
print 'unique'
print timeit.timeit("unique_translate(a,d)",
setup="from __main__ import np,a,d,unique_translate",
number = n_exec) / n_exec
print 'vec'
print timeit.timeit("vec_translate(a,d)",
setup="from __main__ import np,a,d,vec_translate",
number = n_exec) / n_exec
print 'loop'
print timeit.timeit("loop_translate(a,d)",
setup="from __main__ import np,a,d,loop_translate",
number = n_exec) / n_exec
Outputs:
orig
0.222067718506
unique
0.0472617006302
vec
0.0357889199257
loop
0.0285375618935
The numpy_indexed package (disclaimer: I am its author) provides an elegant and efficient vectorized solution to this type of problem:
import numpy_indexed as npi
remapped_a = npi.remap(a, list(my_dict.keys()), list(my_dict.values()))
The method implemented is similar to the approach mentioned by John Vinyard, but even more general. For instance, the items of the array do not need to be ints, but can be any type, even nd-subarrays themselves.
If you set the optional 'missing' kwarg to 'raise' (default is 'ignore'), performance will be slightly better, and you will get a KeyError if not all elements of 'a' are present in the keys.
Assuming your dict keys are positive integers, without huge gaps (similar to a range from 0 to N), you would be better off converting your translation dict to an array such that my_array[i] = my_dict[i], and using numpy indexing to do the translation.
A code using this approach is:
def direct_translate(a, d):
src, values = d.keys(), d.values()
d_array = np.arange(a.max() + 1)
d_array[src] = values
return d_array[a]
Testing with random arrays:
N = 10000
shape = (5000, 5000)
a = np.random.randint(N, size=shape)
my_dict = dict(zip(np.arange(N), np.random.randint(N, size=N)))
For these sizes I get around 140 ms for this approach. The np.get vectorization takes around 5.8 s and the unique_translate around 8 s.
Possible generalizations:
If you have negative values to translate, you could shift the values in a and in the keys of the dictionary by a constant to map them back to positive integers:
def direct_translate(a, d): # handles negative source keys
min_a = a.min()
src, values = np.array(d.keys()) - min_a, d.values()
d_array = np.arange(a.max() - min_a + 1)
d_array[src] = values
return d_array[a - min_a]
If the source keys have huge gaps, the initial array creation would waste memory. I would resort to cython to speed up that function.
If you don't really have to use dictionary as substitution table, simple solution would be (for your example):
a = numpy.array([your array])
my_dict = numpy.array([0, 23, 34, 36, 45]) # your dictionary as array
def Sub (myarr, table) :
return table[myarr]
values = Sub(a, my_dict)
This will work of course only if indexes of d cover all possible values of your a, in other words, only for a with usigned integers.
Related
Given the following array:
a = np.array([[1,2,3],[4,5,6],[7,8,9]])
[[1 2 3]
[4 5 6]
[7 8 9]]
How can I replace certain values with other values?
bad_vals = [4, 2, 6]
update_vals = [11, 1, 8]
I currently use:
for idx, v in enumerate(bad_vals):
a[a==v] = update_vals[idx]
Which gives:
[[ 1 1 3]
[11 5 8]
[ 7 8 9]]
But it is rather slow for large arrays with many values to be replaced. Is there any good alternative?
The input array can be changed to anything (list of list/tuples) if this might be necessary to access certain speedy black magic.
EDIT:
Based on the great answers from #Divakar and #charlysotelo did a quick comparison for my real use-case date using the benchit package. My input data array has more or less a of ratio 100:1 (rows:columns) where the length of array of replacement values are in order of 3 x rows size.
Functions:
# current approach
def enumerate_values(a, bad_vals, update_vals):
for idx, v in enumerate(bad_vals):
a[a==v] = update_vals[idx]
return a
# provided solution #Divakar
def map_values(a, bad_vals, update_vals):
N = max(a.max(), max(bad_vals))+1
mapar = np.empty(N, dtype=int)
mapar[a] = a
mapar[bad_vals] = update_vals
out = mapar[a]
return out
# provided solution #charlysotelo
def vectorize_values(a, bad_vals, update_vals):
bad_to_good_map = {}
for idx, bad_val in enumerate(bad_vals):
bad_to_good_map[bad_val] = update_vals[idx]
f = np.vectorize(lambda x: (bad_to_good_map[x] if x in bad_to_good_map else x))
a = f(a)
return a
# define benchit input functions
import benchit
funcs = [enumerate_values, map_values, vectorize_values]
# define benchit input variables to bench against
in_ = {
n: (
np.random.randint(0,n*10,(n,int(n * 0.01))), # array
np.random.choice(n*10, n*3,replace=False), # bad_vals
np.random.choice(n*10, n*3) # update_vals
)
for n in [300, 1000, 3000, 10000, 30000]
}
# do the bench
# btw: timing of bad approaches (my own function here) take time
t = benchit.timings(funcs, in_, multivar=True, input_name='Len')
t.plot(logx=True, grid=False)
Here's one way based on the hinted mapping array method for positive numbers -
def map_values(a, bad_vals, update_vals):
N = max(a.max(), max(bad_vals))+1
mapar = np.empty(N, dtype=int)
mapar[a] = a
mapar[bad_vals] = update_vals
out = mapar[a]
return out
Sample run -
In [94]: a
Out[94]:
array([[1, 2, 1],
[4, 5, 6],
[7, 1, 1]])
In [95]: bad_vals
Out[95]: [4, 2, 6]
In [96]: update_vals
Out[96]: [11, 1, 8]
In [97]: map_values(a, bad_vals, update_vals)
Out[97]:
array([[ 1, 1, 1],
[11, 5, 8],
[ 7, 1, 1]])
Benchmarking
# Original soln
def replacevals(a, bad_vals, update_vals):
out = a.copy()
for idx, v in enumerate(bad_vals):
out[out==v] = update_vals[idx]
return out
The given sample had the 2D input of nxn with n samples to be replaced. Let's setup input datasets with the same structure.
Using benchit package (few benchmarking tools packaged together; disclaimer: I am its author) to benchmark proposed solutions.
import benchit
funcs = [replacevals, map_values]
in_ = {n:(np.random.randint(0,n*10,(n,n)),np.random.choice(n*10,n,replace=False),np.random.choice(n*10,n)) for n in [3,10,100,1000,2000]}
t = benchit.timings(funcs, in_, multivar=True, input_name='Len')
t.plot(logx=True, save='timings.png')
Plot :
This really depends on the size of your array, and the size of your mappings from bad to good integers.
For a larger number of bad to good integers - the method below is better:
import numpy as np
import time
ARRAY_ROWS = 10000
ARRAY_COLS = 1000
NUM_MAPPINGS = 10000
bad_vals = np.random.rand(NUM_MAPPINGS)
update_vals = np.random.rand(NUM_MAPPINGS)
bad_to_good_map = {}
for idx, bad_val in enumerate(bad_vals):
bad_to_good_map[bad_val] = update_vals[idx]
# np.vectorize with mapping
# Takes about 4 seconds
a = np.random.rand(ARRAY_ROWS, ARRAY_COLS)
f = np.vectorize(lambda x: (bad_to_good_map[x] if x in bad_to_good_map else x))
print (time.time())
a = f(a)
print (time.time())
# Your way
# Takes about 60 seconds
a = np.random.rand(ARRAY_ROWS, ARRAY_COLS)
print (time.time())
for idx, v in enumerate(bad_vals):
a[a==v] = update_vals[idx]
print (time.time())
Running the code above it took less than 4 seconds for the np.vectorize(lambda) way to finish - whereas your way took almost 60 seconds. However, setting the NUM_MAPPINGS to 100, your method takes less than a second for me - faster than the 2 seconds for the np.vectorize way.
I'm using itertools.combinations() as follows:
import itertools
import numpy as np
L = [1,2,3,4,5]
N = 3
output = np.array([a for a in itertools.combinations(L,N)]).T
Which yields me the output I need:
array([[1, 1, 1, 1, 1, 1, 2, 2, 2, 3],
[2, 2, 2, 3, 3, 4, 3, 3, 4, 4],
[3, 4, 5, 4, 5, 5, 4, 5, 5, 5]])
I'm using this expression repeatedly and excessively in a multiprocessing environment and I need it to be as fast as possible.
From this post I understand that itertools-based code isn't the fastest solution and using numpy could be an improvement, however I'm not good enough at numpy optimazation tricks to understand and adapt the iterative code that's written there or to come up with my own optimization.
Any help would be greatly appreciated.
EDIT:
L comes from a pandas dataframe, so it can as well be seen as a numpy array:
L = df.L.values
Here's one that's slightly faster than itertools UPDATE: and one (nump2) that's actually quite a bit faster:
import numpy as np
import itertools
import timeit
def nump(n, k, i=0):
if k == 1:
a = np.arange(i, i+n)
return tuple([a[None, j:] for j in range(n)])
template = nump(n-1, k-1, i+1)
full = np.r_[np.repeat(np.arange(i, i+n-k+1),
[t.shape[1] for t in template])[None, :],
np.c_[template]]
return tuple([full[:, j:] for j in np.r_[0, np.add.accumulate(
[t.shape[1] for t in template[:-1]])]])
def nump2(n, k):
a = np.ones((k, n-k+1), dtype=int)
a[0] = np.arange(n-k+1)
for j in range(1, k):
reps = (n-k+j) - a[j-1]
a = np.repeat(a, reps, axis=1)
ind = np.add.accumulate(reps)
a[j, ind[:-1]] = 1-reps[1:]
a[j, 0] = j
a[j] = np.add.accumulate(a[j])
return a
def itto(L, N):
return np.array([a for a in itertools.combinations(L,N)]).T
k = 6
n = 12
N = np.arange(n)
assert np.all(nump2(n,k) == itto(N,k))
print('numpy ', timeit.timeit('f(a,b)', number=100, globals={'f':nump, 'a':n, 'b':k}))
print('numpy 2 ', timeit.timeit('f(a,b)', number=100, globals={'f':nump2, 'a':n, 'b':k}))
print('itertools', timeit.timeit('f(a,b)', number=100, globals={'f':itto, 'a':N, 'b':k}))
Timings:
k = 3, n = 50
numpy 0.06967267207801342
numpy 2 0.035096961073577404
itertools 0.7981023890897632
k = 3, n = 10
numpy 0.015058324905112386
numpy 2 0.0017436158377677202
itertools 0.004743851954117417
k = 6, n = 12
numpy 0.03546895203180611
numpy 2 0.00997065706178546
itertools 0.05292179994285107
This is is most certainly not faster than itertools.combinations but it is vectorized numpy:
def nd_triu_indices(T,N):
o=np.array(np.meshgrid(*(np.arange(len(T)),)*N))
return np.array(T)[o[...,np.all(o[1:]>o[:-1],axis=0)]]
%timeit np.array(list(itertools.combinations(T,N))).T
The slowest run took 4.40 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 8.6 µs per loop
%timeit nd_triu_indices(T,N)
The slowest run took 4.64 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 52.4 µs per loop
Not sure if this is vectorizable another way, or if one of the optimization wizards around here can make this method faster.
EDIT: Came up with another way, but still not faster than combinations:
%timeit np.array(T)[np.array(np.where(np.fromfunction(lambda *i: np.all(np.array(i)[1:]>np.array(i)[:-1], axis=0),(len(T),)*N,dtype=int)))]
The slowest run took 7.78 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 34.3 µs per loop
I know this question is old, but I have been working on it recently, and it still might help. From my (pretty extensive) testing, I have found that first generating combinations of each index, and then using these indexes to slice the array, is much faster than directly making combinations from the array. I'm sure that using #Paul Panzer's nump2 function to generate these indices could be even faster.
Here is an example:
import numpy as np
from math import factorial
import itertools as iters
from timeit import timeit
from perfplot import show
def combinations_iter(array:np.ndarray, r:int = 3) -> np.ndarray:
return np.array([*iters.combinations(array, r = r)], dtype = array.dtype)
def combinations_iter_idx(array:np.ndarray, r:int = 3) -> np.ndarray:
n_items = array.shape[0]
num_combinations = factorial(n_items)//(factorial(n_items-r)*factorial(r))
combination_idx = np.fromiter(
iters.chain.from_iterable(iters.combinations(np.arange(n_items, dtype = np.int64), r = r)),
dtype = np.int64,
count = num_combinations*r,
).reshape(-1,r)
return array[combination_idx]
show(
setup = lambda n: np.random.uniform(0,100,(n,3)),
kernels = [combinations_iter, combinations_iter_idx],
labels = ['pure itertools', 'itertools for index'],
n_range = np.geomspace(5,300,10, dtype = np.int64),
xlabel = "n",
logx = True,
logy = False,
equality_check = np.allclose,
show_progress = True,
max_time = None,
time_unit = "ms",
)
It is clear that the indexing method is much faster.
First off, apologies for the vague title, I couldn't think of an appropriate name for this issue.
I have 3 numpy arrays in the follwing formats:
N = ([[13, 14, 15], [2, 5, 7], [4, 6, 8] ... several hundred thousand elements long
e1 = [1, 0, 0]
e2 = [0, 1, 0]
The idea is to create a fourth array, 'v', which shall have the same dimensions as 'N', but will be given values based on an if statement. Here is what I currently have which should better explain the issue:
v = np.zeros([len(N), 3])
for i in range(0, len(N)):
if((N*e1)[i,0] != 0):
v[i] = np.cross(N[i],e1)
else:
v[i] = np.cross(N[i],e2)
This code does what I require it to but does so in a longer than anticipated time (> 5 mins). Is there any form of list comprehension or similar concept I could use to increase the efficiency of the code?
You can use numpy.where to replace if-else and vectorize the process with broadcasting, here is an option with numpy.where:
import numpy as np
np.where(np.repeat(N[:,0] != 0, 3).reshape(1000,3), np.cross(N, e1), np.cross(N, e2))
Some benchmarks here:
1) Data set up:
N = np.array([np.random.randint(0,10,3) for i in range(1000)])
N
#array([[3, 5, 0],
# [5, 0, 8],
# [4, 6, 0],
# ...,
# [9, 4, 2],
# [6, 9, 3],
# [2, 9, 2]])
e1 = np.array([1, 0, 0])
e2 = np.array([0, 1, 0])
2) Timing:
def forloop():
v = np.zeros([len(N), 3]);
for i in range(0, len(N)):
if((N*e1)[i,0] != 0):
v[i] = np.cross(N[i],e1)
else:
v[i] = np.cross(N[i],e2)
return v
def forloop2():
v = np.zeros([len(N), 3])
# Only calculate this one time.
my_product = N*e1
for i in range(0, len(N)):
if my_product[i,0] != 0:
v[i] = np.cross(N[i],e1)
else:
v[i] = np.cross(N[i],e2)
return v
%timeit forloop()
10 loops, best of 3: 25.5 ms per loop
%timeit forloop2()
100 loops, best of 3: 12.7 ms per loop
%timeit np.where(np.repeat(N[:,0] != 0, 3).reshape(1000,3), np.cross(N, e1), np.cross(N, e2))
10000 loops, best of 3: 71.9 µs per loop
3) Result checking for all methods:
v1 = forloop()
v2 = np.where(np.repeat(N[:,0] != 0, 3).reshape(1000,3), np.cross(N, e1), np.cross(N, e2))
v3 = forloop2()
(v3 == v1).all()
# True
(v1 == v2).all()
# True
I'm not certain what it is you're trying to do, but I know why this specific code is so slow for you. The worst offender is (N*e1). That's a simple calculation, and it runs pretty fast with numpy, but you're executing it inside of the loop, len(N) times!.
I am able to execute your code with N == 1000000 in less than 15 seconds on my machine by pulling that outside of the loop. Example below.
v = np.zeros([len(N), 3])
# Only calculate this one time.
my_product = N*e1
for i in range(0, len(N)):
if my_product[i,0] != 0):
v[i] = np.cross(N[i],e1)
else:
v[i] = np.cross(N[i],e2)
The other answer demonstrates how to avoid the for loop and if statements for a lot of extra speed at the cost of somewhat less readable code.
I have a list of integer numbers and I want to write a function that returns a subset of numbers that are within a range. Something like NumbersWithinRange(list, interval) function name...
I.e.,
list = [4,2,1,7,9,4,3,6,8,97,7,65,3,2,2,78,23,1,3,4,5,67,8,100]
interval = [4,20]
results = NumbersWithinRange(list, interval) # [4,4,6,8,7,8]
maybe i forgot to write one more number in results, but that's the idea...
The list can be as big as 10/20 million length, and the range is normally of a few 100.
Any suggestions on how to do it efficiently with python - I was thinking to use bisect.
Thanks.
I would use numpy for that, especially if the list is that long. For example:
In [101]: list = np.array([4,2,1,7,9,4,3,6,8,97,7,65,3,2,2,78,23,1,3,4,5,67,8,100])
In [102]: list
Out[102]:
array([ 4, 2, 1, 7, 9, 4, 3, 6, 8, 97, 7, 65, 3,
2, 2, 78, 23, 1, 3, 4, 5, 67, 8, 100])
In [103]: good = np.where((list > 4) & (list < 20))
In [104]: list[good]
Out[104]: array([7, 9, 6, 8, 7, 5, 8])
# %timeit says that numpy is MUCH faster than any list comprehension:
# create an array 10**6 random ints b/w 0 and 100
In [129]: arr = np.random.randint(0,100,1000000)
In [130]: interval = xrange(4,21)
In [126]: %timeit r = [x for x in arr if x in interval]
1 loops, best of 3: 14.2 s per loop
In [136]: %timeit good = np.where((list > 4) & (list < 20)) ; new_list = list[good]
100 loops, best of 3: 10.8 ms per loop
In [134]: %timeit r = [x for x in arr if 4 < x < 20]
1 loops, best of 3: 2.22 s per loop
In [142]: %timeit filtered = [i for i in ifilter(lambda x: 4 < x < 20, arr)]
1 loops, best of 3: 2.56 s per loop
The pure-Python Python sortedcontainers module has a SortedList type that can help you. It maintains the list automatically in sorted order and has been tested passed tens of millions of elements. The sorted list type has a bisect function you can use.
from sortedcontainers import SortedList
data = SortedList(...)
def NumbersWithinRange(items, lower, upper):
start = items.bisect(lower)
end = items.bisect_right(upper)
return items[start:end]
subset = NumbersWithinRange(data, 4, 20)
Bisecting and indexing will be much faster this way than scanning the entire list. The sorted containers module is very fast and has a performance comparison page with benchmarks against alternative implementations.
If the list isn't sorted, you need to scan the entire list:
lst = [ 4,2,1,...]
interval=[4,20]
results = [ x for x in lst if interval[0] <= x <= interval[1] ]
If the list is sorted, you can use bisect to find the left and right indices that
bound your range.
left = bisect.bisect_left(lst, interval[0])
right = bisect.bisect_right(lst, interval[1])
results = lst[left+1:right]
Since scanning the list is O(n) and sorting is O(n lg n), it probably is not worth sorting the list just to use bisect unless you plan on doing lots of range extractions.
I think this should be sufficiently efficient:
>>> nums = [4,2,1,7,9,4,3,6,8,97,7,65,3,2,2,78,23,1,3,4,5,67,8,100]
>>> r = [x for x in nums if 4 <= x <21]
>>> r
[4, 7, 9, 4, 6, 8, 7, 4, 5, 8]
Edit:
After J.F. Sebastian's excellent observation, modified the code.
Using iterators
>>> from itertools import ifilter
>>> A = [4,2,1,7,9,4,3,6,8,97,7,65,3,2,2,78,23,1,3,4,5,67,8,100]
>>> [i for i in ifilter(lambda x: 4 < x < 20, A)]
[7, 9, 6, 8, 7, 5, 8]
Let's create a list similar to what you described:
import random
l = [random.randint(-100000,100000) for i in xrange(1000000)]
Now test some possible solutions:
interval=range(400,800)
def v2():
""" return a list """
return [i for i in l if i in interval]
def v3():
""" return a generator """
return list((i for i in l if i in interval))
def v4():
def te(x):
return x in interval
return filter(te,l)
def v5():
return [i for i in ifilter(lambda x: x in interval, l)]
print len(v2()),len(v3()), len(v4()), len(v5())
cmpthese.cmpthese([v2,v3,v4,v5],micro=True, c=2)
Prints this:
rate/sec usec/pass v5 v4 v2 v3
v5 0 6929225.922 -- -0.4% -1.0% -1.6%
v4 0 6903028.488 0.4% -- -0.6% -1.2%
v2 0 6861472.487 1.0% 0.6% -- -0.6%
v3 0 6817855.477 1.6% 1.2% 0.6% --
HOWEVER, watch what happens if interval is a set instead of a list:
interval=set(range(400,800))
cmpthese.cmpthese([v2,v3,v4,v5],micro=True, c=2)
rate/sec usec/pass v5 v4 v3 v2
v5 5 201332.569 -- -20.6% -62.9% -64.6%
v4 6 159871.578 25.9% -- -53.2% -55.4%
v3 13 74769.974 169.3% 113.8% -- -4.7%
v2 14 71270.943 182.5% 124.3% 4.9% --
Now comparing with numpy:
na=np.array(l)
def v7():
""" assume you have to convert from list => numpy array and return a list """
arr=np.array(l)
tgt = np.where((arr >= 400) & (arr < 800))
return [arr[x] for x in tgt][0].tolist()
def v8():
""" start with a numpy list but return a python list """
tgt = np.where((na >= 400) & (na < 800))
return na[tgt].tolist()
def v9():
""" numpy all the way through """
tgt = np.where((na >= 400) & (na < 800))
return [na[x] for x in tgt][0]
# or return na[tgt] if you prefer that syntax...
cmpthese.cmpthese([v2,v3,v4,v5, v7, v8,v9],micro=True, c=2)
rate/sec usec/pass v5 v4 v7 v3 v2 v8 v9
v5 5 185431.957 -- -17.4% -24.7% -63.3% -63.4% -93.6% -93.6%
v4 7 153095.007 21.1% -- -8.8% -55.6% -55.7% -92.3% -92.3%
v7 7 139570.475 32.9% 9.7% -- -51.3% -51.4% -91.5% -91.5%
v3 15 67983.985 172.8% 125.2% 105.3% -- -0.2% -82.6% -82.6%
v2 15 67861.438 173.3% 125.6% 105.7% 0.2% -- -82.5% -82.5%
v8 84 11850.476 1464.8% 1191.9% 1077.8% 473.7% 472.6% -- -0.0%
v9 84 11847.973 1465.1% 1192.2% 1078.0% 473.8% 472.8% 0.0% --
Clearly numpy is faster than pure python as long as you can work with numpy all the way through. Otherwise, use a set for the interval to speed up a bit...
I think you are looking for something like this..
b=[i for i in a if 4<=i<90]
print sorted(set(b))
[4, 5, 6, 7, 8, 9, 23, 65, 67, 78]
If your data set isn't too sparse, you could use "bins" to store and retrieve the data. For example:
a = [4,2,1,7,9,4,3,6,8,97,7,65,3,2,2,78,23,1,3,4,5,67,8,100]
# Initalize a list of 0's [0, 0, ...]
# This is assuming that the minimum possible value is 0
bins = [0 for _ in range(max(a) + 1)]
# Update the bins with the frequency of each number
for i in a:
bins[i] += 1
def NumbersWithinRange(data, interval):
result = []
for i in range(interval[0], interval[1] + 1):
freq = data[i]
if freq > 0:
result += [i] * freq
return result
This works for this test case:
print(NumbersWithinRange(bins, [4, 20]))
# [4, 4, 4, 5, 6, 7, 7, 8, 8, 9]
For simplicity, I omitted some bounds checking in the function.
To reiterate, this may work better in terms of space and time usage, but it depends heavily on your particular data set. The less sparse the data set, the better it will do.
This question already has answers here:
Find the most common element in a list
(27 answers)
Closed 2 years ago.
In Python, I have a list:
L = [1, 2, 45, 55, 5, 4, 4, 4, 4, 4, 4, 5456, 56, 6, 7, 67]
I want to identify the item that occurred the highest number of times. I am able to solve it but I need the fastest way to do so. I know there is a nice Pythonic answer to this.
I am surprised no-one has mentioned the simplest solution,max() with the key list.count:
max(lst,key=lst.count)
Example:
>>> lst = [1, 2, 45, 55, 5, 4, 4, 4, 4, 4, 4, 5456, 56, 6, 7, 67]
>>> max(lst,key=lst.count)
4
This works in Python 3 or 2, but note that it only returns the most frequent item and not also the frequency. Also, in the case of a draw (i.e. joint most frequent item) only a single item is returned.
Although the time complexity of using max() is worse than using Counter.most_common(1) as PM 2Ring comments, the approach benefits from a rapid C implementation and I find this approach is fastest for short lists but slower for larger ones (Python 3.6 timings shown in IPython 5.3):
In [1]: from collections import Counter
...:
...: def f1(lst):
...: return max(lst, key = lst.count)
...:
...: def f2(lst):
...: return Counter(lst).most_common(1)
...:
...: lst0 = [1,2,3,4,3]
...: lst1 = lst0[:] * 100
...:
In [2]: %timeit -n 10 f1(lst0)
10 loops, best of 3: 3.32 us per loop
In [3]: %timeit -n 10 f2(lst0)
10 loops, best of 3: 26 us per loop
In [4]: %timeit -n 10 f1(lst1)
10 loops, best of 3: 4.04 ms per loop
In [5]: %timeit -n 10 f2(lst1)
10 loops, best of 3: 75.6 us per loop
from collections import Counter
most_common,num_most_common = Counter(L).most_common(1)[0] # 4, 6 times
For older Python versions (< 2.7), you can use this recipe to create the Counter class.
In your question, you asked for the fastest way to do it. As has been demonstrated repeatedly, particularly with Python, intuition is not a reliable guide: you need to measure.
Here's a simple test of several different implementations:
import sys
from collections import Counter, defaultdict
from itertools import groupby
from operator import itemgetter
from timeit import timeit
L = [1,2,45,55,5,4,4,4,4,4,4,5456,56,6,7,67]
def max_occurrences_1a(seq=L):
"dict iteritems"
c = dict()
for item in seq:
c[item] = c.get(item, 0) + 1
return max(c.iteritems(), key=itemgetter(1))
def max_occurrences_1b(seq=L):
"dict items"
c = dict()
for item in seq:
c[item] = c.get(item, 0) + 1
return max(c.items(), key=itemgetter(1))
def max_occurrences_2(seq=L):
"defaultdict iteritems"
c = defaultdict(int)
for item in seq:
c[item] += 1
return max(c.iteritems(), key=itemgetter(1))
def max_occurrences_3a(seq=L):
"sort groupby generator expression"
return max(((k, sum(1 for i in g)) for k, g in groupby(sorted(seq))), key=itemgetter(1))
def max_occurrences_3b(seq=L):
"sort groupby list comprehension"
return max([(k, sum(1 for i in g)) for k, g in groupby(sorted(seq))], key=itemgetter(1))
def max_occurrences_4(seq=L):
"counter"
return Counter(L).most_common(1)[0]
versions = [max_occurrences_1a, max_occurrences_1b, max_occurrences_2, max_occurrences_3a, max_occurrences_3b, max_occurrences_4]
print sys.version, "\n"
for vers in versions:
print vers.__doc__, vers(), timeit(vers, number=20000)
The results on my machine:
2.7.2 (v2.7.2:8527427914a2, Jun 11 2011, 15:22:34)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
dict iteritems (4, 6) 0.202214956284
dict items (4, 6) 0.208412885666
defaultdict iteritems (4, 6) 0.221301078796
sort groupby generator expression (4, 6) 0.383440971375
sort groupby list comprehension (4, 6) 0.402786016464
counter (4, 6) 0.564319133759
So it appears that the Counter solution is not the fastest. And, in this case at least, groupby is faster. defaultdict is good but you pay a little bit for its convenience; it's slightly faster to use a regular dict with a get.
What happens if the list is much bigger? Adding L *= 10000 to the test above and reducing the repeat count to 200:
dict iteritems (4, 60000) 10.3451900482
dict items (4, 60000) 10.2988479137
defaultdict iteritems (4, 60000) 5.52838587761
sort groupby generator expression (4, 60000) 11.9538850784
sort groupby list comprehension (4, 60000) 12.1327362061
counter (4, 60000) 14.7495789528
Now defaultdict is the clear winner. So perhaps the cost of the 'get' method and the loss of the inplace add adds up (an examination of the generated code is left as an exercise).
But with the modified test data, the number of unique item values did not change so presumably dict and defaultdict have an advantage there over the other implementations. So what happens if we use the bigger list but substantially increase the number of unique items? Replacing the initialization of L with:
LL = [1,2,45,55,5,4,4,4,4,4,4,5456,56,6,7,67]
L = []
for i in xrange(1,10001):
L.extend(l * i for l in LL)
dict iteritems (2520, 13) 17.9935798645
dict items (2520, 13) 21.8974409103
defaultdict iteritems (2520, 13) 16.8289561272
sort groupby generator expression (2520, 13) 33.853593111
sort groupby list comprehension (2520, 13) 36.1303369999
counter (2520, 13) 22.626899004
So now Counter is clearly faster than the groupby solutions but still slower than the iteritems versions of dict and defaultdict.
The point of these examples isn't to produce an optimal solution. The point is that there often isn't one optimal general solution. Plus there are other performance criteria. The memory requirements will differ substantially among the solutions and, as the size of the input goes up, memory requirements may become the overriding factor in algorithm selection.
Bottom line: it all depends and you need to measure.
Here is a defaultdict solution that will work with Python versions 2.5 and above:
from collections import defaultdict
L = [1,2,45,55,5,4,4,4,4,4,4,5456,56,6,7,67]
d = defaultdict(int)
for i in L:
d[i] += 1
result = max(d.iteritems(), key=lambda x: x[1])
print result
# (4, 6)
# The number 4 occurs 6 times
Note if L = [1, 2, 45, 55, 5, 4, 4, 4, 4, 4, 4, 5456, 7, 7, 7, 7, 7, 56, 6, 7, 67]
then there are six 4s and six 7s. However, the result will be (4, 6) i.e. six 4s.
If you're using Python 3.8 or above, you can use either statistics.mode() to return the first mode encountered or statistics.multimode() to return all the modes.
>>> import statistics
>>> data = [1, 2, 2, 3, 3, 4]
>>> statistics.mode(data)
2
>>> statistics.multimode(data)
[2, 3]
If the list is empty, statistics.mode() throws a statistics.StatisticsError and statistics.multimode() returns an empty list.
Note before Python 3.8, statistics.mode() (introduced in 3.4) would additionally throw a statistics.StatisticsError if there is not exactly one most common value.
A simple way without any libraries or sets
def mcount(l):
n = [] #To store count of each elements
for x in l:
count = 0
for i in range(len(l)):
if x == l[i]:
count+=1
n.append(count)
a = max(n) #largest in counts list
for i in range(len(n)):
if n[i] == a:
return(l[i],a) #element,frequency
return #if something goes wrong
Perhaps the most_common() method
I obtained the best results with groupby from itertools module with this function using Python 3.5.2:
from itertools import groupby
a = [1, 2, 45, 55, 5, 4, 4, 4, 4, 4, 4, 5456, 56, 6, 7, 67]
def occurrence():
occurrence, num_times = 0, 0
for key, values in groupby(a, lambda x : x):
val = len(list(values))
if val >= occurrence:
occurrence, num_times = key, val
return occurrence, num_times
occurrence, num_times = occurrence()
print("%d occurred %d times which is the highest number of times" % (occurrence, num_times))
Output:
4 occurred 6 times which is the highest number of times
Test with timeit from timeit module.
I used this script for my test with number= 20000:
from itertools import groupby
def occurrence():
a = [1, 2, 45, 55, 5, 4, 4, 4, 4, 4, 4, 5456, 56, 6, 7, 67]
occurrence, num_times = 0, 0
for key, values in groupby(a, lambda x : x):
val = len(list(values))
if val >= occurrence:
occurrence, num_times = key, val
return occurrence, num_times
if __name__ == '__main__':
from timeit import timeit
print(timeit("occurrence()", setup = "from __main__ import occurrence", number = 20000))
Output (The best one):
0.1893607140000313
I want to throw in another solution that looks nice and is fast for short lists.
def mc(seq=L):
"max/count"
max_element = max(seq, key=seq.count)
return (max_element, seq.count(max_element))
You can benchmark this with the code provided by Ned Deily which will give you these results for the smallest test case:
3.5.2 (default, Nov 7 2016, 11:31:36)
[GCC 6.2.1 20160830]
dict iteritems (4, 6) 0.2069783889998289
dict items (4, 6) 0.20462976200065896
defaultdict iteritems (4, 6) 0.2095775119996688
sort groupby generator expression (4, 6) 0.4473949929997616
sort groupby list comprehension (4, 6) 0.4367636879997008
counter (4, 6) 0.3618192010007988
max/count (4, 6) 0.20328268999946886
But beware, it is inefficient and thus gets really slow for large lists!
Simple and best code:
def max_occ(lst,x):
count=0
for i in lst:
if (i==x):
count=count+1
return count
lst=[1, 2, 45, 55, 5, 4, 4, 4, 4, 4, 4, 5456, 56, 6, 7, 67]
x=max(lst,key=lst.count)
print(x,"occurs ",max_occ(lst,x),"times")
Output: 4 occurs 6 times
My (simply) code (three months studying Python):
def more_frequent_item(lst):
new_lst = []
times = 0
for item in lst:
count_num = lst.count(item)
new_lst.append(count_num)
times = max(new_lst)
key = max(lst, key=lst.count)
print("In the list: ")
print(lst)
print("The most frequent item is " + str(key) + ". Appears " + str(times) + " times in this list.")
more_frequent_item([1, 2, 45, 55, 5, 4, 4, 4, 4, 4, 4, 5456, 56, 6, 7, 67])
The output will be:
In the list:
[1, 2, 45, 55, 5, 4, 4, 4, 4, 4, 4, 5456, 56, 6, 7, 67]
The most frequent item is 4. Appears 6 times in this list.
if you are using numpy in your solution for faster computation use this:
import numpy as np
x = np.array([2,5,77,77,77,77,77,77,77,9,0,3,3,3,3,3])
y = np.bincount(x,minlength = max(x))
y = np.argmax(y)
print(y) #outputs 77
Following is the solution which I came up with if there are multiple characters in the string all having the highest frequency.
mystr = input("enter string: ")
#define dictionary to store characters and their frequencies
mydict = {}
#get the unique characters
unique_chars = sorted(set(mystr),key = mystr.index)
#store the characters and their respective frequencies in the dictionary
for c in unique_chars:
ctr = 0
for d in mystr:
if d != " " and d == c:
ctr = ctr + 1
mydict[c] = ctr
print(mydict)
#store the maximum frequency
max_freq = max(mydict.values())
print("the highest frequency of occurence: ",max_freq)
#print all characters with highest frequency
print("the characters are:")
for k,v in mydict.items():
if v == max_freq:
print(k)
Input: "hello people"
Output:
{'o': 2, 'p': 2, 'h': 1, ' ': 0, 'e': 3, 'l': 3}
the highest frequency of occurence: 3
the characters are:
e
l
may something like this:
testList = [1, 2, 3, 4, 2, 2, 1, 4, 4]
print(max(set(testList), key = testList.count))