I want to find the mode and get the first one for numpyarray
for example
[1,2,2,3,3,4] there are two mode(most frequently appears) 2,3
However , in this case I want to get the most left one 2
There are some examples to get mode by numpy or scipy or staticts
My array is numpy, so if I can do with only numpy, it is simple...
However how can I make this??
Have you had a look at collections.Counter?
import numpy as np
import collections
x = np.array([1, 2, 3, 2, 4, 3])
c = collections.Counter(x)
largest_num, count = c.most_common(1)[0]
The documentation of Counter.most_common states:
Elements with equal counts are ordered in the order first encountered
If you want to use only numpy, you could try this, based on the scipy mode implementation
a = np.array([1,2,2,3,3,4])
# Unique values
scores = set(np.ravel(a))
# Retrieve keys, counts
keys, counts = zip(*[(score, np.sum((a == score))) for score in scores])
# Maximum key
keys[np.argmax(counts)]
>>> 2
where the argmax function states
"In case of multiple occurrences of the maximum values, the indices corresponding to the first occurrence are returned."
Related
I have a 3D int64 Numpy array, which is output from skimage.measure.label. I need a list of 3D indices that match each of our possible (previously known) values, separated out by which indices correspond to each value.
Currently, we do this by the following idiom:
for cur_idx,count in values_counts.items():
region=labels[:,:,:] == cur_idx
[dim1_indices,dim2_indices,dim3_indices]= np.nonzero(region)
While this code works and produces correct output, it is quite slow, especially the np.nonzero part, as we call this 200+ times on a large array. I realize that there is probably a faster way to do this via, say, numba, but we'd like to avoid adding on additional requirements unless needed.
Ultimately, what we're looking for is a list of indices that correspond to each (nonzero) value relatively efficiently. Assume our number of values <1000 but our array size >100x1000x1000. So, for example, on the array created by the following:
x = np.zeros((4,4,4))
x[3,3,3] = 1; x[1,0,3] = 2; x[1,2,3] = 2
we would want some idx_value dict/array such that idx_value_1[2] = 1 idx_value_2[2] = 2, idx_value_3[2] = 3.
I've tried tackling problems similar to the one you describe, and I think the np.argwhere function is probably your best option for reducing runtime (see docs here). See the code example below for how this could be used per the constraints you identify above.
import numpy as np
x = np.zeros((4,4,4))
x[3,3,3] = 1; x[1,0,3] = 2; x[1,2,3] = 3
# Instantiate dictionary/array to store indices
idx_value = {}
# Get indices equal to 2
idx_value[3] = np.argwhere(x == 3)
idx_value[2] = np.argwhere(x == 2)
idx_value[1] = np.argwhere(x == 1)
# Display idx_value - consistent with indices we set before
>>> idx_value
{3: array([[1, 2, 3]]), 2: array([[1, 0, 3]]), 1: array([[3, 3, 3]])}
For the first use case, I think you would still have to use a for loop to iterate over the values you're searching over, but it could be done as:
# Instantiate dictionary/array
idx_value = {}
# Now loop by incrementally adding key/value pairs
for cur_idx,count in values_counts.items():
idx_value[cur_idx] = np.argwhere(labels)
NOTE: This incrementally creates a dictionary where each key is an idx to search for, and each value is a np.array object of shape (N_matches, 3).
Suppose i have two arrays of equal lengths:
a = [0,0,1,0,0,1,0,0,0,1,0,1,1,0,0,0,1]
b = [0,1,1,0,1,0,0,1,1,0,0,1,1,0,1,0,0]
Now i want to pick up elements from these two arrays , in the sequence given such that they form a new array of same length as a & b by randomly selecting values between a & b, in the ratio of a:b = 4.68 i.e for every 1 value picked from a , there should be 4.68 values picked from b in the resultant array.
So effectively the resultant array could be something like :
res = [0,1,1,0,1, 1(from a) ,0(from a),1,1,0,0,1,1,0, 0(from a),0,0]
res array has : first 5 values are from b ,6th & 7th from a ,8th-14th from b , 15th from a ,16th-17th from b
Overall ratio of values from a:b in the given res array example is a:b 4.67 ( from a = 3 ,from b = 14 )
Thus between the two arrays, values have to be chosen at random however the sequence needs to be maintained i.e cannot take 7th value from one array and 3rd value from other .If the value to be populated in resultant array is 3rd then the choice is between the 3rd element of both input arrays at random.Also, overall ratio needs to be maintained as well.
Can you please help me in developing an efficient Pythonic way of reaching this resultant solution ? The solution need not be consistent with every run w.r.t values
Borrowing the a_count calculation from Barmar's answer (because it seems to work and I can't be bothered to reinvent it), this solution preserves the ordering of the values chosen from a and b:
from future_builtins import zip # Only on Python 2, to avoid temporary list of tuples
import random
# int() unnecessary on Python 3
a_count = int(round(1/(1 + 4.68) * len(a)))
# Use range on Python 3, xrange on Python 2, to avoid making actual list
a_indices = frozenset(random.sample(xrange(len(a)), a_count))
res = [aval if i in a_indices else bval for i, (aval, bval) in enumerate(zip(a, b))]
The basic idea here is that you determine how many a values you need, get a unique sample of the possible indices of that size, then iterate a and b in parallel, keeping the a value for the selected indices, and the b value for all others.
If you don't like the complexity of the list comprehension, you could use a different approach, copying b, then filling in the a values one by one:
res = b[:] # Copy b in its entirety
# Replace selected indices with a values
# No need to convert to frozenset for efficiency here, and it's clean
# enough to just iterate the sample directly without storing it
for i in random.sample(xrange(len(a)), a_count):
res[i] = a[i]
I believe this should work. You specify how many you want from a (you can simply use your ratio to figure out that number), you randomly generate a 'mask' of numbers and choose from a or be based on the cutoff (notice that you only sort to figure out the cutoff, but you use the unsorted mask later)
import numpy as np
a = [0,0,1,0,0,1,0,0,0,1,0,1,1,0,0,0,1]
b = [0,1,1,0,1,0,0,1,1,0,0,1,1,0,1,0,0]
mask = np.random.random(len(a))
from_a = 3
cutoff = np.sort(mask)[from_a]
res = []
for i in range(len(a)):
if (mask[i]>=cutoff):
res.append(a[i])
else:
res.append(b[i])
I have two matrices (I want them for part of speech tagging). The first one contains the pos tags probabilities and the second contains the words probabilities. I need to extract numbers and sum the matrices. The problem is when I call each cell the string part appears, too. But I need the numbers. How can I call them. (Is this a correct way of making matrices? if not, how can I correct it with tags in heads of rows and columns?)
import numpy as np
A = np.array([[{'ARTART':0}],[{'ARTN':1}],[{'ARTV':0}],[{'ARTP':0}],
[{'NART':0}],[{'NN':0.13}],[{'NV':0.43}],[{'NP':0.44}],
[{'VART':0.65}],[{'VN':0.35}],[{'VV':0}],[{'VP':0}],
[{'PART':0.74}],[{'PN':0.26}],[{'PV':0}],[{'PP':0}],
[{'NULLART':0.71}],[{'NULLN':0.29}],[{'NULLV':0}],[{'NULLP':0}]]).reshape(5,4)
#print (A)
B = np.array([[{'ARTflies':0}],[{'ARTlike':0}],[{'ARTa':0.36}],[{'ARTflower':0}],
[{'Nflies':0.025}],[{'Nlike':0.012}],[{'Na':0.001}],[{'Nflower':0.063}],
[{'Vflies':0.076}],[{'Vlike':0.1}],[{'Va':0}],[{'Vflower':0.05}],
[{'Pflies':0}],[{'Plike':0.068}],[{'Pa':0}],[{'Pflower':0}]]).reshape(4,4)
#print (B)
#print (A[4][0])
I think you could achieve this task by using just 2 dictionaries, one for each array which you are currently making:
A = {'ARTART':0, 'ARTN':1, 'ARTV': 0} # and so on
Then you can grab the values of each entry in the dictionary with:
A_val = A.values()
And finally you can sum the values with:
A_sum = sum(A_val)
I wanna print the index of the row containing the minimum element of the matrix
my matrix is matrix = [[22,33,44,55],[22,3,4,12],[34,6,4,5,8,2]]
and the code
matrix = [[22,33,44,55],[22,3,4,12],[34,6,4,5,8,2]]
a = np.array(matrix)
buff_min = matrix.argmin(axis = 0)
print(buff_min) #index of the row containing the minimum element
min = np.array(matrix[buff_min])
print(str(min.min(axis=0))) #print the minium of that row
print(min.argmin(axis = 0)) #index of the minimum
print(matrix[buff_min]) # print all row containing the minimum
after running, my result is
1
3
1
[22, 3, 4, 12]
the first number should be 2, because the minimum is 2 in the third list ([34,6,4,5,8,2]), but it returns 1. It returns 3 as minimum of the matrix.
What's the error?
I am not sure which version of Python you are using, i tested it for Python 2.7 and 3.2 as mentioned your syntax for argmin is not correct, its should be in the format
import numpy as np
np.argmin(array_name,axis)
Next, Numpy knows about arrays of arbitrary objects, it's optimized for homogeneous arrays of numbers with fixed dimensions. If you really need arrays of arrays, better use a nested list. But depending on the intended use of your data, different data structures might be even better, e.g. a masked array if you have some invalid data points.
If you really want flexible Numpy arrays, use something like this:
np.array([[22,33,44,55],[22,3,4,12],[34,6,4,5,8,2]], dtype=object)
However this will create a one-dimensional array that stores references to lists, which means that you will lose most of the benefits of Numpy (vector processing, locality, slicing, etc.).
Also, to mention if you can resize your numpy array thing might work, i haven't tested it, but by the concept that should be an easy solution. But i will prefer use a nested list in this case of input matrix
Does this work?
np.where(a == a.min())[0][0]
Note that all rows of the matrix need to contain the same number of elements.
Problem: From two input arrays, I want to output an array with the frequency of True values (from input_2) corresponding to each value of input_1.
import numpy as np # import everything from numpy
from scipy.stats import itemfreq
input_1 = np.array([3,6,6,3,6,4])
input_2 = np.array([False, True, True, False, False, True])
For this example output that I want is:
output_1 = np.array([0,2,2,0,2,1])
My current approach involves editing input_1, so only the values corresponding to True remain:
locs=np.where(input_2==True,input_1,0)
Then counting the frequency of each answer, creating a dictionary and replacing the appropriate keys of input_1 to values (the True frequencies).
loc_freq = itemfreq(locs)
dic = {}
for key,val in loc_freq:
dic[key]=val
print dic
for k, v in dic.iteritems():
input_1[input_1==k]=v
which outputs [3,2,2,3,2,1].
The problem here is twofold:
1) this still does not do anything with the keys that are not in the dictionary (and should therefore be changed to 0). For example, how can I get the 3s transformed into 0s?
2) This seems very inelegant / ineffective. Is there a better way to approach this?
np.bincount is what you are looking for.
output_1 = np.bincount(input_1[input_2])[input_1]
#memecs solution is correct, +1. However it will be very slow and take a lot of memory if the values in input_1 are really large, i.e. they are not indices of an array, but say they are seconds or some other integer data that can take very large values.
In that case, you have that np.bincount(input_1[input_2]).size is equal to the largest integer in input_1 with a True value in input_2.
It is much faster to use unique and bincount. We use the first to extract the indices of the unique elements of input_1, and then use bincount to count how often these indices appear in that same array, and weigh them 1 or 0 based on the value of the array input_2 (True or False):
# extract unique elements and the indices to reconstruct the array
unq, idx = np.unique(input_1, return_inverse=True)
# calculate the weighted frequencies of these indices
freqs_idx = np.bincount(idx, weights=input_2)
# reconstruct the array of frequencies of the elements
frequencies = freqs_idx[idx]
print(frequencies)
This solution is really fast and has the minimum memory impact. Credit goes to #Jaime, see his comment below. Below I report my original answer, using unique in a different manner.
OTHER POSSIBILITY
It may be faster to go for another solution, using unique:
import numpy as np
input_1 = np.array([3, 6, 6, 3, 6, 4])
input_2 = np.array([False, True, True, False, False, True])
non_zero_hits, counts = np.unique(input_1[input_2], return_counts=True)
all_hits, idx = np.unique(input_1, return_inverse=True)
frequencies = np.zeros_like(all_hits)
#2nd step, with broadcasting
idx_non_zero_hits_in_all_hits = np.where(non_zero_hits[:, np.newaxis] - all_hits == 0)[1]
frequencies[idx_non_zero_hits_in_all_hits] = counts
print(frequencies[idx])
This has the drawback that it will require a lot of memory if the number of unique elements in input_1 with a True value in input_2 are many, because of the 2D array created and passed to where. To reduce the memory footprint, you could use a for loop instead for the 2nd step of the algorithm:
#2nd step, but with a for loop.
for j, val in enumerate(non_zero_hits):
index = np.where(val == all_hits)[0]
frequencies[index] = counts[j]
print(frequencies[idx])
This second solution has a very small memory footprint, but requires a for loop. It depends on your typical data input size and values which solution will be best.
The currently accepted bincount solution is quite elegant, but the numpy_indexed package provides more general solutions to problems of this kind:
import numpy_indexed as npi
idx = npi.as_index(input_1)
unique_labels, true_count_per_label = npi.group_by(idx).sum(input_2)
print(true_count_per_label[idx.inverse])