Split numpy array into contiguous sections using numpy.where() - python

The function numpy.where() can be used to obtain an array of indices into a numpy array where a logical condition is true. How do we generate a list of arrays of the indices where each represents a contiguous region where the logical condition is true?
For example,
import numpy as np
a = np.array( [0.,.1,.2,.3,.4,.5,.4,.3,.2,.1,0.] )
idx = np.where( (np.abs(a-.2) <= .1) )
print( 'idx =', idx)
print( 'a[idx] =', a[idx] )
produces the following output,
idx = (array([1, 2, 3, 7, 8, 9]),)
a[idx] = [0.1 0.2 0.3 0.3 0.2 0.1]
and then
The question is, in a simple way, how do we obtain a list of arrays of indices, one such array for each contiguous section? For example, like this:
idx = (array([1, 2, 3]),), (array([7, 8, 9]),)
a[idx[0]] = [0.1 0.2 0.3]
a[idx[1]] = [0.3 0.2 0.1]

You can simply use np.split() to split your idx[0] into contiguous runs:
ia = idx[0]
out = np.split(ia, np.where(ia[1:] != ia[:-1] + 1)[0] + 1)
>>> out
[array([1, 2, 3]), array([7, 8, 9])]

This should work:
a = np.array([0., 0.1, 0.2, 0.3, 0.4, 0.5, 0.4, 0.3, 0.2, 0.1, 0.0])
idx = np.nonzero(np.abs(a - 0.2) <= 0.1)[0]
splits = np.split(idx, np.nonzero(np.diff(idx) > 1)[0] + 1)
print(splits)
It gives:
[array([1, 2, 3]), array([7, 8, 9])]

You can check if the difference between the shifted idx arrays is 1, then split the array at the corresponding indices.
import numpy as np
a = np.array( [0.,.1,.2,.3,.4,.5,.4,.3,.2,.1,0.] )
idx = np.where( np.abs(a-.2) <= .1 )[0]
# Get the indices where the increment of values is larger than 1.
split_idcs = np.argwhere( idx[1:]-idx[:-1] > 1 ) + 1
# Split the array at the corresponding indices.
result = np.split(idx, split_idcs[0])
print(result)
# [array([1, 2, 3], dtype=int64), array([7, 8, 9], dtype=int64)]
It works for your example, however I am unsure if this implementation works for arbitrary sequences.

You can achieved the goal by:
diff = np.diff(idx[0], prepend=idx[0][0]-1)
result = np.split(idx[0], np.where(diff != 1)[0])
or
idx = np.where((np.abs(a - .2) <= .1))[0]
diff = np.diff(idx, prepend=idx[0]-1)
result = np.split(idx, np.where(diff != 1)[0])

Related

Numpy: concatenation and sorting of matrices

I have the following problem: I have two vectors containing time moments:
a = np.array((0.23, 1.70))
a_ = np.array((0, 0.5, 1, 1.5, 2))
and two vectors corresponding to the values ​​of the function at these points of time
b = np.array((3, -1.2))
b_ = np.array((0, 3, 3, 3, -1.2))
It want to combine the vectors a, a_ and b, b_ into one and sort the time in ascending order. The final effect should look like this:
A = np.array((0, 0.23, 0.5, 1, 1.5, 1.70, 2))
B = np.array((0, 3, 3, 3, 3, -1.2, -1.2))
How to do it? Because here I gave a simple example, but in general I will work with longer vectors. I thought to connect the vectors a, a_ and b, b_, then make them a matrix and sort them over time (i.e. the first row), but if I sort after the first row, the values in the second row doesnt change their position :( Then I also want to access them and count the differences between successive elements (time and value increments)
Here's what I tried, I first convert them into key value pairs and then sort them based on the keys.
Here's the Code:
import numpy as np
a = np.array((0.23, 1.70))
a_ = np.array((0, 0.5, 1, 1.5, 2))
b = np.array((3, -1.2))
b_ = np.array((0, 3, 3, 3, -1.2))
sol = {}
for i, j in zip(list(a), list(b)):
sol[i] = j
for i, j in zip(list(a_), list(b_)):
sol[i] = j
sol = dict(sorted(sol.items(), key = lambda kv:(kv[0], kv[1])))
A = np.array(list(sol.keys()))
B = np.array(list(sol.values()))
print(f'{A}\n{B}')
Result:
[0. 0.23 0.5 1. 1.5 1.7 2. ]
[ 0. 3. 3. 3. 3. -1.2 -1.2]

Python convert a vector into inverse frequencies

So given this numpy array:
import numpy as np
vector = np.array([1, 2, 2, 3, 3, 3, 3, 3, 3, 2, 2, 1])
# len(vector) == 12
# 2 x ones, 4 x two, 6 x three
How can I convert this into a vector of inverse frequencies?
Such that for each value, the output contains 1 divided by the frequency of that value:
array([0.16, 0.33, 0.33, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.33, 0.33, 0.16])
[Update to a general one]
How about this one using np.histogram:
import numpy as np
l = np.array([1,2,2,3,3,3,3,3,3,2,2,1])
_u, _l = np.unique(l, return_inverse=True)
np.histogram(_l, bins=np.arange(_u.size+1))[0][_l] / _l.size
This essentially requires a grouping operation, which numpy isn't great at... but pandas is. You can do this with groupby + transform + count, and divide the result by the length of vector.
import pandas as pd
s = pd.Series(vector)
vector = (s.groupby(s).transform('count') / len(s)).values
vector
array([ 0.16666667, 0.33333333, 0.33333333, 0.5 , 0.5 ,
0.5 , 0.5 , 0.5 , 0.5 , 0.33333333,
0.33333333, 0.16666667])
You can use collections.Counter to first determine the frequency of each element. Then create an intermediate mapping dict which will contain key as the element and value as the frequency. Finally using numpy.vectorize to transform the array to your desired format
>>> import numpy as np
>>> from collections import Counter
>>> v = np.array([1, 2, 2, 3, 3, 3, 3, 3, 3, 2, 2, 1])
>>> freq_dict = Counter(v)
At this point the freq_dict will contains frequency of each element like
>>> freq_dict
>>> Counter({3: 6, 2: 4, 1: 2})
Next build a probability dict of the format element: probability, using dict comprehension
>>> prob_dict = dict((k,round(val/len(v),3)) for k, val in freq_dict.items())
>>> prob_dict
>>> {1: 0.167, 2: 0.333, 3: 0.5}
Finally using numpy.vectorize to get your desired output
>>> out = np.vectorize(prob_dict.get)(v)
This will produce:
>>> out
>>> array([ 0.167, 0.333, 0.333, 0.5, 0.5, 0.5, 0.5, 0.5,
0.5, 0.333, 0.333, 0.167])

Calculate percentage of count for a list of arrays

Simple problem, but I cannot seem to get it to work. I want to calculate the percentage a number occurs in a list of arrays and output this percentage accordingly.
I have a list of arrays which looks like this:
import numpy as np
# Create some data
listvalues = []
arr1 = np.array([0, 0, 2])
arr2 = np.array([1, 1, 2, 2])
arr3 = np.array([0, 2, 2])
listvalues.append(arr1)
listvalues.append(arr2)
listvalues.append(arr3)
listvalues
>[array([0, 0, 2]), array([1, 1, 2, 2]), array([0, 2, 2])]
Now I count the occurrences using collections, which returns a a list of collections.Counter:
import collections
counter = []
for i in xrange(len(listvalues)):
counter.append(collections.Counter(listvalues[i]))
counter
>[Counter({0: 2, 2: 1}), Counter({1: 2, 2: 2}), Counter({0: 1, 2: 2})]
The result I am looking for is an array with 3 columns, representing the value 0 to 2 and len(listvalues) of rows. Each cell should be filled with the percentage of that value occurring in the array:
# Result
66.66 0 33.33
0 50 50
33.33 0 66.66
So 0 occurs 66.66% in array 1, 0% in array 2 and 33.33% in array 3, and so on..
What would be the best way to achieve this?
Many thanks!
Here's an approach -
# Get lengths of each element in input list
lens = np.array([len(item) for item in listvalues])
# Form group ID array to ID elements in flattened listvalues
ID_arr = np.repeat(np.arange(len(lens)),lens)
# Extract all values & considering each row as an indexing perform counting
vals = np.concatenate(listvalues)
out_shp = [ID_arr.max()+1,vals.max()+1]
counts = np.bincount(ID_arr*out_shp[1] + vals)
# Finally get the percentages with dividing by group counts
out = 100*np.true_divide(counts.reshape(out_shp),lens[:,None])
Sample run with an additional fourth array in input list -
In [316]: listvalues
Out[316]: [array([0, 0, 2]),array([1, 1, 2, 2]),array([0, 2, 2]),array([4, 0, 1])]
In [317]: print out
[[ 66.66666667 0. 33.33333333 0. 0. ]
[ 0. 50. 50. 0. 0. ]
[ 33.33333333 0. 66.66666667 0. 0. ]
[ 33.33333333 33.33333333 0. 0. 33.33333333]]
The numpy_indexed package has a utility function for this, called count_table, which can be used to solve your problem efficiently as such:
import numpy_indexed as npi
arrs = [arr1, arr2, arr3]
idx = [np.ones(len(a))*i for i, a in enumerate(arrs)]
(rows, cols), table = npi.count_table(np.concatenate(idx), np.concatenate(arrs))
table = table / table.sum(axis=1, keepdims=True)
print(table * 100)
You can get a list of all values and then simply iterate over the individual arrays to get the percentages:
values = set([y for row in listvalues for y in row])
print [[(a==x).sum()*100.0/len(a) for x in values] for a in listvalues]
You can create a list with the percentages with the following code :
percentage_list = [((counter[i].get(j) if counter[i].get(j) else 0)*10000)//len(listvalues[i])/100.0 for i in range(len(listvalues)) for j in range(3)]
After that, create a np array from that list :
results = np.array(percentage_list)
Reshape it so we have the good result :
results = results.reshape(3,3)
This should allow you to get what you wanted.
This is most likely not efficient, and not the best way to do this, but it has the merit of working.
Do not hesitate if you have any question.
I would like to use functional-paradigm to resolve this problem. For example:
>>> import numpy as np
>>> import pprint
>>>
>>> arr1 = np.array([0, 0, 2])
>>> arr2 = np.array([1, 1, 2, 2])
>>> arr3 = np.array([0, 2, 2])
>>>
>>> arrays = (arr1, arr2, arr3)
>>>
>>> u = np.unique(np.hstack(arrays))
>>>
>>> result = [[1.0 * c.get(uk, 0) / l
... for l, c in ((len(arr), dict(zip(*np.unique(arr, return_counts=True))))
... for arr in arrays)] for uk in u]
>>>
>>> pprint.pprint(result)
[[0.6666666666666666, 0.0, 0.3333333333333333],
[0.0, 0.5, 0.0],
[0.3333333333333333, 0.5, 0.6666666666666666]]

Python Numpy: Replace duplicate values with mean value

I have two measurements, position and temperature, which are sampled at a fixed sampling rate. Some positions might occour multiple times in the data. Now I want to plot the temperature over the position and not over the time. Instead of displaying two points at the same position, I want to replace the temperature measurements with the mean value for the given location. How can this be done nicely in python with numpy?
My solution so far looks like this:
import matplotlib.pyplot as plt
import numpy as np
# x = Position Data
# y = Temperature Data
x = np.random.permutation([0, 1, 1, 2, 3, 4, 5, 5, 6, 7, 8, 8, 9])
y = (x + np.random.rand(len(x)) * 1 - 0.5).round(2)
# Get correct order
idx = np.argsort(x)
x, y = x[idx], y[idx]
plt.plot(x, y) # Plot with multiple points at same location
# Calculate means for dupplicates
new_x = []
new_y = []
skip_next = False
for idx in range(len(x)):
if skip_next:
skip_next = False
continue
if idx < len(x)-1 and x[idx] == x[idx+1]:
new_x.append(x[idx])
new_y.append((y[idx] + y[idx+1]) / 2)
skip_next = True
else:
new_x.append(x[idx])
new_y.append(y[idx])
skip_next = False
x, y = np.array(new_x), np.array(new_y)
plt.plot(x, y) # Plots desired output
This solution does not take into account that some positions might occoure more than twice in the data. To replace all values, the loop must be run multiple times. I know there must be a better solution to this!
One approach using np.bincount -
import numpy as np
# x = Position Data
# y = Temperature Data
x = np.random.permutation([0, 1, 1, 2, 3, 4, 5, 5, 6, 7, 8, 8, 9])
y = (x + np.random.rand(len(x)) * 1 - 0.5).round(2)
# Find unique sorted values for x
x_new = np.unique(x)
# Use bincount to get the accumulated summation for each unique x, and
# divide each summation by the respective count of each unique value in x
y_new_mean= np.bincount(x, weights=y)/np.bincount(x)
Sample run -
In [16]: x
Out[16]: array([7, 0, 2, 8, 5, 4, 1, 9, 6, 8, 1, 3, 5])
In [17]: y
Out[17]:
array([ 6.7 , 0.12, 2.33, 8.19, 5.19, 3.68, 0.62, 9.46, 6.01,
8. , 1.07, 3.07, 5.01])
In [18]: x_new
Out[18]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [19]: y_new_mean
Out[19]:
array([ 0.12 , 0.845, 2.33 , 3.07 , 3.68 , 5.1 , 6.01 , 6.7 ,
8.095, 9.46 ])
If I understand what you're asking, here's one way to do it that is a lot simpler.
Given some dataset that is randomly arranged, but each position is connected with each temperature:
data = np.random.permutation([(1, 5.6), (1, 3.4), (1, 4.5), (2, 5.3), (3, 2.2), (3, 6.8)])
>> array([[ 3. , 2.2],
[ 3. , 6.8],
[ 1. , 3.4],
[ 1. , 5.6],
[ 2. , 5.3],
[ 1. , 4.5]])
We can sort and put each position in a dictionary as its key while keeping track of the temperatures for that position in an array in the dictionary. We use some error handling here, if the key (position) is not yet in our dictionary python will complain with a KeyError so we add it.
results = {}
for entry in sorted(data, key=lambda t: t[0]):
try:
results[entry[0]] = results[entry[0]] + [entry[1]]
except KeyError:
results[entry[0]] = [entry[1]]
print(results)
>> {1.0: [3.3999999999999999, 5.5999999999999996, 4.5],
2.0: [5.2999999999999998],
3.0: [2.2000000000000002, 6.7999999999999998]}
And with a final list comprehension we can flatten this and get the resulting array.
np.array([[key, np.mean(results[key])] for key in results.keys()])
>> array([[ 1. , 4.5],
[ 2. , 5.3],
[ 3. , 4.5]])
This can be put in a function:
def flatten_by_position(data):
results = {}
for entry in sorted(data, key=lambda t: t[0]):
try:
results[entry[0]] = results[entry[0]] + [entry[1]]
except KeyError:
results[entry[0]] = [entry[1]]
return np.array([[key, np.mean(results[key])] for key in results.keys()])
Tested with a variety of inputs this solution should be fast enough for datasets under 1000000 entries.

Sort a list based on a given distribution

Answering one Question, I ended up with a problem that I believe was a circumlocution way of solving which could have been done in a better way, but I was clueless
There are two list
percent = [0.23, 0.27, 0.4, 0.1]
optimal_partition = [3, 2, 2, 1]
optimal_partition, is one of the integer partition of the number 8 into 4 parts
I would like to sort optimal_partition, in a manner which matches the percentage distribution to as closest as possible which would mean, the individual partition should match the percent magnitude as closest as possible
So 3 -> 0.4, 2 -> 0.27 and 0.23 and 1 -> 0.1
So the final result should be
[2, 2, 3, 1]
The way I ended up solving this was
>>> percent = [0.23, 0.27, 0.4, 0.1]
>>> optimal_partition = [3, 2, 2, 1]
>>> optimal_partition_percent = zip(sorted(optimal_partition),
sorted(enumerate(percent),
key = itemgetter(1)))
>>> optimal_partition = [e for e, _ in sorted(optimal_partition_percent,
key = lambda e: e[1][0])]
>>> optimal_partition
[2, 2, 3, 1]
Can you suggest an easier way to solve this?
By easier I mean, without the need to implement multiple sorting, and storing and later rearranging based on index.
Couple of more examples:
percent = [0.25, 0.25, 0.4, 0.1]
optimal_partition = [3, 2, 2, 1]
result = [2, 2, 3, 1]
percent = [0.2, 0.2, 0.4, 0.2]
optimal_partition = [3, 2, 2, 1]
result = [1, 2, 3, 2]
from numpy import take,argsort
take(opt,argsort(argsort(perc)[::-1]))
or without imports:
zip(*sorted(zip(sorted(range(len(perc)), key=perc.__getitem__)[::-1],opt)))[1]
#Test
l=[([0.23, 0.27, 0.4, 0.1],[3, 2, 2, 1]),
([0.25, 0.25, 0.4, 0.1],[3, 2, 2, 1]),
([0.2, 0.2, 0.4, 0.2],[3, 2, 2, 1])]
def f1(perc,opt):
return take(opt,argsort(argsort(perc)[::-1]))
def f2(perc,opt):
return zip(*sorted(zip(sorted(range(len(perc)),
key=perc.__getitem__)[::-1],opt)))[1]
for i in l:
perc, opt = i
print f1(perc,opt), f2(perc,opt)
# output:
# [2 2 3 1] (2, 2, 3, 1)
# [2 2 3 1] (2, 2, 3, 1)
# [1 2 3 2] (1, 2, 3, 2)
Use the fact that the percentages sum to 1:
percent = [0.23, 0.27, 0.4, 0.1]
optimal_partition = [3, 2, 2, 1]
total = sum(optimal_partition)
output = [total*i for i in percent]
Now you need to figure out a way to redistribute the fractional components somehow. Thinking out loud:
from operator import itemgetter
intermediate = [(i[0], int(i[1]), i[1] - int(i[1])) for i in enumerate(output)]
# Sort the list by the fractional component
s = sorted(intermediate, key=itemgetter(2))
# Now, distribute the first item's fractional component to the rest, starting at the top:
for i, tup in enumerate(s):
fraction = tup[2]
# Go through the remaining items in reverse order
for index in range(len(s)-1, i, -1):
this_fraction = s[index][2]
if fraction + this_fraction >= 1:
# increment this item by 1, clear the fraction, carry the remainder
new_fraction = fraction + this_fraction -1
s[index][1] = s[index][1] + 1
s[index][2] = 0
fraction = new_fraction
else:
#just add the fraction to this element, clear the original element
s[index][2] = s[index][2] + fraction
Now, I'm not sure I'd say that's "easier". I haven't tested it, and I'm sure I got the logic wrong in that last section. In fact, I'm attempting assignment to tuples, so I know there's at least one error. But it's a different approach.

Categories