Related
The function numpy.where() can be used to obtain an array of indices into a numpy array where a logical condition is true. How do we generate a list of arrays of the indices where each represents a contiguous region where the logical condition is true?
For example,
import numpy as np
a = np.array( [0.,.1,.2,.3,.4,.5,.4,.3,.2,.1,0.] )
idx = np.where( (np.abs(a-.2) <= .1) )
print( 'idx =', idx)
print( 'a[idx] =', a[idx] )
produces the following output,
idx = (array([1, 2, 3, 7, 8, 9]),)
a[idx] = [0.1 0.2 0.3 0.3 0.2 0.1]
and then
The question is, in a simple way, how do we obtain a list of arrays of indices, one such array for each contiguous section? For example, like this:
idx = (array([1, 2, 3]),), (array([7, 8, 9]),)
a[idx[0]] = [0.1 0.2 0.3]
a[idx[1]] = [0.3 0.2 0.1]
You can simply use np.split() to split your idx[0] into contiguous runs:
ia = idx[0]
out = np.split(ia, np.where(ia[1:] != ia[:-1] + 1)[0] + 1)
>>> out
[array([1, 2, 3]), array([7, 8, 9])]
This should work:
a = np.array([0., 0.1, 0.2, 0.3, 0.4, 0.5, 0.4, 0.3, 0.2, 0.1, 0.0])
idx = np.nonzero(np.abs(a - 0.2) <= 0.1)[0]
splits = np.split(idx, np.nonzero(np.diff(idx) > 1)[0] + 1)
print(splits)
It gives:
[array([1, 2, 3]), array([7, 8, 9])]
You can check if the difference between the shifted idx arrays is 1, then split the array at the corresponding indices.
import numpy as np
a = np.array( [0.,.1,.2,.3,.4,.5,.4,.3,.2,.1,0.] )
idx = np.where( np.abs(a-.2) <= .1 )[0]
# Get the indices where the increment of values is larger than 1.
split_idcs = np.argwhere( idx[1:]-idx[:-1] > 1 ) + 1
# Split the array at the corresponding indices.
result = np.split(idx, split_idcs[0])
print(result)
# [array([1, 2, 3], dtype=int64), array([7, 8, 9], dtype=int64)]
It works for your example, however I am unsure if this implementation works for arbitrary sequences.
You can achieved the goal by:
diff = np.diff(idx[0], prepend=idx[0][0]-1)
result = np.split(idx[0], np.where(diff != 1)[0])
or
idx = np.where((np.abs(a - .2) <= .1))[0]
diff = np.diff(idx, prepend=idx[0]-1)
result = np.split(idx, np.where(diff != 1)[0])
...and that reference comes from a separate matrix.
This question is an extension of an earlier answered question where the reference element came directly from the same column it was being compared against. Some clever sorting and referencing the index of the sort seemed to solve that one.
Broadcasting has been suggested in both the original and this new question. I run out of memory at around n ~ 3000 and need another order of magnitude larger yet.
The Target ( Production-grade ) Scaling Definitions:
So as to let proposed solutions' approaches fair and mutually comparable, both in the [SPACE]- and the [TIME]-domains,
let's assume n = 50000; m = 20; k = 50; a = np.random.rand( n, m ); ...
I'm now interested in a more general form where the reference value comes from another matrix of reference values.
Original question:
Vectorized pythonic way to get count of elements greater than current element
New question: Can we write a vectorized form to perform the following role.
Function receives as input 2 2-d arrays.
A = n x m
B = k x m
and returns
C = k x m.
C[i,j] is the proportion of observations in A[:,j] ( just the j-th column ) that are larger than B[i,j]
Here is my embarrasingly slow double for loop implementation.
import numpy as np
n = 100
m = 20
k = 50
a = np.random.rand(n,m)
b = np.random.rand(k,m)
c = np.zeros((k,m))
for j in range(0,m): #cols
for i in range(0,k): # rows
r = b[i,j]
c[i,j] = ( ( a[:,j] > r).sum() ) / (n)
Approach #1
We could again use the argsort trick as discussed in this solution but in a bit twisted manner. We would concatenate the second array into the first array and then perform argsort-ing. We need to use argsort for both the concatenated array and the second one and have our desired output. The implementation would look something like this -
ab = np.vstack((a,b))
len_a, len_b = len(a), len(b)
b_argoffset = b.argsort(0).argsort(0)
total_args = ab.argsort(0).argsort(0)[-len_b:]
out = len_a - total_args + b_argoffset
Explanation
Concatenate second array whose values are to be computed into the first array.
Now, since we are appending, we would have their index positions later on, after the first array length has ended.
We use one argsort to get the relative positions of the second array w.r.t to the entire concatenated array and one more argsort to trace back those indices w.r.t the original order.
We need to repeat the double argsort-ing for the second array on itself, so as to compensate for the concatenation.
These indices are for each element in b with the comparison : a[:,j] > b[i,j]. Now, these indices orders are 0-based, i.e. an index closer to 0 represent greater number of elements in a[:,j] than the current element b[i,j], so a greater count and vice versa. So, we need to subtract those indices from the length of a[:,j] for the final output.
Approach #1 - Improvement
We would optimize it further by using array-assignment, again inspired by Approach #2 from the same solution. So, those arg outputs : b_argoffset and total_args could be alternatively computed, like so -
def unqargsort(a):
n,m = a.shape
idx = a.argsort(0)
out = np.zeros((n,m),dtype=int)
out[idx, np.arange(m)] = np.arange(n)[:,None]
return out
b_argoffset = unqargsort(b)
total_args = unqargsort(ab)[-len_b:]
Approach #2
We could also leverage searchsorted for an altogether different approach -
k,m = b.shape
sidx = a.argsort(0)
out = np.empty((k,m), dtype=int)
for i in range(m): #cols
out[:,i] = np.searchsorted(a[:,i], b[:,i],sorter=sidx[:,i])
out = len(a) - out
Explanation
We get the sorted order indices for each column of a.
Then, use those indices to get how we could place values off b into the sorted a with searcshorted. This gives us same as the output from step#3,4 in Approach#1.
Note that these approaches give us the count. So, for the final output, divide the output thus obtained by n.
I think you can use broadcasting:
c = (a[:,None,:] > b).mean(axis=0)
Demo:
In [207]: n = 5
In [208]: m = 3
In [209]: a = np.random.randint(10, size=(n,m))
In [210]: b = np.random.randint(10, size=(n,m))
In [211]: c = np.zeros((n,m))
In [212]: a
Out[212]:
array([[2, 2, 8],
[5, 0, 8],
[2, 5, 7],
[4, 4, 4],
[2, 6, 7]])
In [213]: b
Out[213]:
array([[3, 6, 8],
[2, 7, 5],
[8, 9, 2],
[9, 8, 7],
[2, 7, 2]])
In [214]: for j in range(0,m): #cols
...: for i in range(0,n): # rows
...: r = b[i,j]
...: c[i,j] = ( ( a[:,j] > r).sum() ) / (n)
...:
...:
In [215]: c
Out[215]:
array([[0.4, 0. , 0. ],
[0.4, 0. , 0.8],
[0. , 0. , 1. ],
[0. , 0. , 0.4],
[0.4, 0. , 1. ]])
In [216]: (a[:,None,:] > b).mean(axis=0)
Out[216]:
array([[0.4, 0. , 0. ],
[0.4, 0. , 0.8],
[0. , 0. , 1. ],
[0. , 0. , 0.4],
[0.4, 0. , 1. ]])
check:
In [217]: ((a[:,None,:] > b).mean(axis=0) == c).all()
Out[217]: True
So given this numpy array:
import numpy as np
vector = np.array([1, 2, 2, 3, 3, 3, 3, 3, 3, 2, 2, 1])
# len(vector) == 12
# 2 x ones, 4 x two, 6 x three
How can I convert this into a vector of inverse frequencies?
Such that for each value, the output contains 1 divided by the frequency of that value:
array([0.16, 0.33, 0.33, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.33, 0.33, 0.16])
[Update to a general one]
How about this one using np.histogram:
import numpy as np
l = np.array([1,2,2,3,3,3,3,3,3,2,2,1])
_u, _l = np.unique(l, return_inverse=True)
np.histogram(_l, bins=np.arange(_u.size+1))[0][_l] / _l.size
This essentially requires a grouping operation, which numpy isn't great at... but pandas is. You can do this with groupby + transform + count, and divide the result by the length of vector.
import pandas as pd
s = pd.Series(vector)
vector = (s.groupby(s).transform('count') / len(s)).values
vector
array([ 0.16666667, 0.33333333, 0.33333333, 0.5 , 0.5 ,
0.5 , 0.5 , 0.5 , 0.5 , 0.33333333,
0.33333333, 0.16666667])
You can use collections.Counter to first determine the frequency of each element. Then create an intermediate mapping dict which will contain key as the element and value as the frequency. Finally using numpy.vectorize to transform the array to your desired format
>>> import numpy as np
>>> from collections import Counter
>>> v = np.array([1, 2, 2, 3, 3, 3, 3, 3, 3, 2, 2, 1])
>>> freq_dict = Counter(v)
At this point the freq_dict will contains frequency of each element like
>>> freq_dict
>>> Counter({3: 6, 2: 4, 1: 2})
Next build a probability dict of the format element: probability, using dict comprehension
>>> prob_dict = dict((k,round(val/len(v),3)) for k, val in freq_dict.items())
>>> prob_dict
>>> {1: 0.167, 2: 0.333, 3: 0.5}
Finally using numpy.vectorize to get your desired output
>>> out = np.vectorize(prob_dict.get)(v)
This will produce:
>>> out
>>> array([ 0.167, 0.333, 0.333, 0.5, 0.5, 0.5, 0.5, 0.5,
0.5, 0.333, 0.333, 0.167])
Simple problem, but I cannot seem to get it to work. I want to calculate the percentage a number occurs in a list of arrays and output this percentage accordingly.
I have a list of arrays which looks like this:
import numpy as np
# Create some data
listvalues = []
arr1 = np.array([0, 0, 2])
arr2 = np.array([1, 1, 2, 2])
arr3 = np.array([0, 2, 2])
listvalues.append(arr1)
listvalues.append(arr2)
listvalues.append(arr3)
listvalues
>[array([0, 0, 2]), array([1, 1, 2, 2]), array([0, 2, 2])]
Now I count the occurrences using collections, which returns a a list of collections.Counter:
import collections
counter = []
for i in xrange(len(listvalues)):
counter.append(collections.Counter(listvalues[i]))
counter
>[Counter({0: 2, 2: 1}), Counter({1: 2, 2: 2}), Counter({0: 1, 2: 2})]
The result I am looking for is an array with 3 columns, representing the value 0 to 2 and len(listvalues) of rows. Each cell should be filled with the percentage of that value occurring in the array:
# Result
66.66 0 33.33
0 50 50
33.33 0 66.66
So 0 occurs 66.66% in array 1, 0% in array 2 and 33.33% in array 3, and so on..
What would be the best way to achieve this?
Many thanks!
Here's an approach -
# Get lengths of each element in input list
lens = np.array([len(item) for item in listvalues])
# Form group ID array to ID elements in flattened listvalues
ID_arr = np.repeat(np.arange(len(lens)),lens)
# Extract all values & considering each row as an indexing perform counting
vals = np.concatenate(listvalues)
out_shp = [ID_arr.max()+1,vals.max()+1]
counts = np.bincount(ID_arr*out_shp[1] + vals)
# Finally get the percentages with dividing by group counts
out = 100*np.true_divide(counts.reshape(out_shp),lens[:,None])
Sample run with an additional fourth array in input list -
In [316]: listvalues
Out[316]: [array([0, 0, 2]),array([1, 1, 2, 2]),array([0, 2, 2]),array([4, 0, 1])]
In [317]: print out
[[ 66.66666667 0. 33.33333333 0. 0. ]
[ 0. 50. 50. 0. 0. ]
[ 33.33333333 0. 66.66666667 0. 0. ]
[ 33.33333333 33.33333333 0. 0. 33.33333333]]
The numpy_indexed package has a utility function for this, called count_table, which can be used to solve your problem efficiently as such:
import numpy_indexed as npi
arrs = [arr1, arr2, arr3]
idx = [np.ones(len(a))*i for i, a in enumerate(arrs)]
(rows, cols), table = npi.count_table(np.concatenate(idx), np.concatenate(arrs))
table = table / table.sum(axis=1, keepdims=True)
print(table * 100)
You can get a list of all values and then simply iterate over the individual arrays to get the percentages:
values = set([y for row in listvalues for y in row])
print [[(a==x).sum()*100.0/len(a) for x in values] for a in listvalues]
You can create a list with the percentages with the following code :
percentage_list = [((counter[i].get(j) if counter[i].get(j) else 0)*10000)//len(listvalues[i])/100.0 for i in range(len(listvalues)) for j in range(3)]
After that, create a np array from that list :
results = np.array(percentage_list)
Reshape it so we have the good result :
results = results.reshape(3,3)
This should allow you to get what you wanted.
This is most likely not efficient, and not the best way to do this, but it has the merit of working.
Do not hesitate if you have any question.
I would like to use functional-paradigm to resolve this problem. For example:
>>> import numpy as np
>>> import pprint
>>>
>>> arr1 = np.array([0, 0, 2])
>>> arr2 = np.array([1, 1, 2, 2])
>>> arr3 = np.array([0, 2, 2])
>>>
>>> arrays = (arr1, arr2, arr3)
>>>
>>> u = np.unique(np.hstack(arrays))
>>>
>>> result = [[1.0 * c.get(uk, 0) / l
... for l, c in ((len(arr), dict(zip(*np.unique(arr, return_counts=True))))
... for arr in arrays)] for uk in u]
>>>
>>> pprint.pprint(result)
[[0.6666666666666666, 0.0, 0.3333333333333333],
[0.0, 0.5, 0.0],
[0.3333333333333333, 0.5, 0.6666666666666666]]
I have two measurements, position and temperature, which are sampled at a fixed sampling rate. Some positions might occour multiple times in the data. Now I want to plot the temperature over the position and not over the time. Instead of displaying two points at the same position, I want to replace the temperature measurements with the mean value for the given location. How can this be done nicely in python with numpy?
My solution so far looks like this:
import matplotlib.pyplot as plt
import numpy as np
# x = Position Data
# y = Temperature Data
x = np.random.permutation([0, 1, 1, 2, 3, 4, 5, 5, 6, 7, 8, 8, 9])
y = (x + np.random.rand(len(x)) * 1 - 0.5).round(2)
# Get correct order
idx = np.argsort(x)
x, y = x[idx], y[idx]
plt.plot(x, y) # Plot with multiple points at same location
# Calculate means for dupplicates
new_x = []
new_y = []
skip_next = False
for idx in range(len(x)):
if skip_next:
skip_next = False
continue
if idx < len(x)-1 and x[idx] == x[idx+1]:
new_x.append(x[idx])
new_y.append((y[idx] + y[idx+1]) / 2)
skip_next = True
else:
new_x.append(x[idx])
new_y.append(y[idx])
skip_next = False
x, y = np.array(new_x), np.array(new_y)
plt.plot(x, y) # Plots desired output
This solution does not take into account that some positions might occoure more than twice in the data. To replace all values, the loop must be run multiple times. I know there must be a better solution to this!
One approach using np.bincount -
import numpy as np
# x = Position Data
# y = Temperature Data
x = np.random.permutation([0, 1, 1, 2, 3, 4, 5, 5, 6, 7, 8, 8, 9])
y = (x + np.random.rand(len(x)) * 1 - 0.5).round(2)
# Find unique sorted values for x
x_new = np.unique(x)
# Use bincount to get the accumulated summation for each unique x, and
# divide each summation by the respective count of each unique value in x
y_new_mean= np.bincount(x, weights=y)/np.bincount(x)
Sample run -
In [16]: x
Out[16]: array([7, 0, 2, 8, 5, 4, 1, 9, 6, 8, 1, 3, 5])
In [17]: y
Out[17]:
array([ 6.7 , 0.12, 2.33, 8.19, 5.19, 3.68, 0.62, 9.46, 6.01,
8. , 1.07, 3.07, 5.01])
In [18]: x_new
Out[18]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [19]: y_new_mean
Out[19]:
array([ 0.12 , 0.845, 2.33 , 3.07 , 3.68 , 5.1 , 6.01 , 6.7 ,
8.095, 9.46 ])
If I understand what you're asking, here's one way to do it that is a lot simpler.
Given some dataset that is randomly arranged, but each position is connected with each temperature:
data = np.random.permutation([(1, 5.6), (1, 3.4), (1, 4.5), (2, 5.3), (3, 2.2), (3, 6.8)])
>> array([[ 3. , 2.2],
[ 3. , 6.8],
[ 1. , 3.4],
[ 1. , 5.6],
[ 2. , 5.3],
[ 1. , 4.5]])
We can sort and put each position in a dictionary as its key while keeping track of the temperatures for that position in an array in the dictionary. We use some error handling here, if the key (position) is not yet in our dictionary python will complain with a KeyError so we add it.
results = {}
for entry in sorted(data, key=lambda t: t[0]):
try:
results[entry[0]] = results[entry[0]] + [entry[1]]
except KeyError:
results[entry[0]] = [entry[1]]
print(results)
>> {1.0: [3.3999999999999999, 5.5999999999999996, 4.5],
2.0: [5.2999999999999998],
3.0: [2.2000000000000002, 6.7999999999999998]}
And with a final list comprehension we can flatten this and get the resulting array.
np.array([[key, np.mean(results[key])] for key in results.keys()])
>> array([[ 1. , 4.5],
[ 2. , 5.3],
[ 3. , 4.5]])
This can be put in a function:
def flatten_by_position(data):
results = {}
for entry in sorted(data, key=lambda t: t[0]):
try:
results[entry[0]] = results[entry[0]] + [entry[1]]
except KeyError:
results[entry[0]] = [entry[1]]
return np.array([[key, np.mean(results[key])] for key in results.keys()])
Tested with a variety of inputs this solution should be fast enough for datasets under 1000000 entries.