Python zip method and rounding behaviour - python

Just seen what seems a rather curious behaviour in the python zip() built-in. I passed it a Numpy array of rounded decimals but it spits out an expanded version.
This the original array, my goal is to generate a dictionary with the proportion of occupancy of each unique element. np is Numpy.
a = np.array([1, 2, 3, 1, 1, 2, 1])
So I go doing
elems, counts = np.unique(a, return_counts=True)
which spits (array([1, 2, 3]), array([4, 2, 1])). Correct. But now I want the proportion, not the count (rounded to the third digit), so I do
counts = np.round(counts/a.size, 3)
which gives array([ 0.571, 0.286, 0.143]) for counts. Now into zipping this into the sought dict:
dict(zip(*(elems, counts)))
This spits {1: 0.57099999999999995, 2: 0.28599999999999998, 3: 0.14299999999999999}, so looks like the rounded counts have seen some digits added!

Numpy just displays a different amount of significant digits when printing numpy arrays. You can adjust the printing precision with set_printoptions.
Example using your data:
import numpy as np
a = np.array([1, 2, 3, 1, 1, 2, 1])
elems, counts = np.unique(a, return_counts=True)
counts = np.round(counts/a.size, 3)
np.set_printoptions(precision=20)
print(counts)
outputs:
[ 0.57099999999999995204 0.28599999999999997646 0.14299999999999998823]

Related

Taking away from numpy ndarray from another

I have two numpy.ndarrays which one is a random sample from the other. I wish to take the smaller one (random sample) and remove those data points from the larger one.
What is the code to do so?
delete and remove do not work on ndarrays
Thank you
Maybe this can help:
a = np.array([1, 2, 3, 2, 4, 1])
b = np.array([3, 4, 5, 6])
np.setdiff1d(a, b) # array([1, 2])
From here.

Rapid creation of nD array from generator yielding numpy arrays?

I have a generator that yields NumPy arrays, and need a way to rapidly construct another NumPy array from the results of the generator (array of arrays) by taking a specific number of yields from the generator. Speed is the critical aspect in my problem. I've tried np.fromiter but it seems it doesn't support constructing from arrays:
import numpy as np
def generator():
for i in range(5):
yield np.array([i]*10)
arr = np.fromiter(iter(generator()), dtype=np.ndarray, count=3)
This throws an error, as described in several other SO posts:
Calling np.sum(np.fromiter(generator))
Numpy ValueError: setting an array element with a sequence
However, I haven't found any answer that offers a rapid way to source arrays from the generator without having to do:
it = iter(generator())
arr = np.array([next(it) for _ in range(3)])
Here it is indeed shown that np.fromiter is much faster: Faster way to convert list of objects to numpy array
Is it possible to rapidly source numpy arrays from the generator without having use the slow list to array conversion? I specifically want to avoid the np.array(list(...)) construct, because I will be calling it hundreds of thousands of times, and the delay will eventually add up and make a big difference in execution time.
What about using itertools.islice?
from itertools import islice
g = generator()
arr = np.array(list(islice(g, 3)))
# or in one line:
# arr = np.array(list(islice(generator(), 3
output:
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[2, 2, 2, 2, 2, 2, 2, 2, 2, 2]])

Numpy array normalization by group ids:

Suppose data and labels be numpy arrays as below:
import numpy as np
data=np.array([[0,4,5,6,8],[0,6,8,9],[1,9,5],[1,45,7],[1,8,3]]) #Note: length of each row is different
labels=np.array([4,6,10,4,6])
The first element in each row in data shows an id of a group. I want to normalize (see below example) the labels based on the group ids:
For example the first two rows in data have id=0; thus, their label must be:
normalized_labels[0]=labels[0]/(4+6)=0.4
normalized_labels[1]=labels[1]/(4+6)=0.6
The expected output should be:
normalized_labels=[0.4,0.6,0.5,0.2,0.3]
I have a naive solution as:
ids=[data[i][0] for i in range(data.shape[0])]
out=[]
for i in set(ids):
ind=np.where(ids==i)
out.extend(list(labels[ind]/np.sum(labels[ind])))
out=np.array(out)
print(out)
Is there any numpy functions to perform such a task. Any suggestion is appreciated!!
I found this kind of subtle way to transform labels into sums of groups with respect to indices = [n[0] for n in data]. In later solution, no use of data is needed:
indices = [n[0] for n in data]
u, inv = np.unique(indices, return_inverse=True)
bincnt = np.bincount(inv, weights=labels)
sums = bincnt[inv]
Now sums are: array([10., 10., 20., 20., 20.]). The further is simple like so:
normalized_labels = labels / sums
Remarks. np.bincount calculates weighted sums of items labeled as 0, 1, 2... This is why reindexation indices -> inv is needed. For example, indices = [8, 6, 4, 3, 4, 6, 8, 8] should be mapped into inv = [3, 2, 1, 0, 1, 2, 3, 3].

how to get a numpy array from frequency and indices

I have a numpy array like this:
nparr = np.asarray([[u'fals', u'nazi', u'increas', u'technolog', u'equip', u'princeton',
u'realiti', u'civilian', u'credit', u'ten'],
[u'million', u'thousand', u'nazi', u'stick', u'visibl', u'realiti',
u'west', u'singl', u'jack', u'charl']])
What I need to do is to calculate the frequency of each item, and have another numpy array with the corresponding frequency of each item in the same position.
So, here as my array shape is (2, 10). I need to have a numpy array of shape (2, 10) but with the frequency values. Thus, the output of the above would be:
[[1, 2, 1, 1, 1, 1, 2, 1, 1, 1]
[1, 1, 2, 1, 1, 2, 1, 1, 1, 1]]
What I have done so far:
unique, indices, count = np.unique(nparr, return_index=True, return_counts=True)
Though in this way the count is the frequency of unique values and it does not give me the same shape as the original array.
You need to use return_inverse rather than return_index:
_, i, c = np.unique(nparr, return_inverse=True, return_counts=True)
_ is a convention to denote discarded return values. You don't need the unique values to know where the counts go.
You can get the counts arranged in the order of the original array with a simple indexing operation. Unraveling to the original shape is necessary, of course:
c[i].reshape(nparr.shape)

Generate unique values based on rows in a numpy array

I have a 3D numpy array, arr, with shape m*n*k.
for every set of values along the m axis (e.g. arr[:, 0, 0]) I want to generate a single value to represent this set, so that I may end up with a 2D matrix, n*k.
If a set of values along the m axis is repeated, then we should generate the same value each time.
I.e it is a hashing problem.
I created a solution to the problem using a dictionary, but it drastically reduces performance. For each set of values, I call this function:
def getCellId(self, valueSet):
# Turn the set of values (a numpy vector) to a tuple so it can be hashed
key = tuple(valueSet)
# Try and simply return an existing ID for this key
try:
return self.attributeDict[key]
except KeyError:
# If the key was new (and didnt exist), try and generate a new Id by adding one to the max of all current Id's. This will fail the very first time we do this (as there will be no Id's yet), so in that case, just assign the value '1' to the newId
try:
newId = max(self.attributeDict.values()) +1
except ValueError:
newId = 1
self.attributeDict[key] = newId
return newId
The array itself is typically of the size 30*256*256, so a single set of values will have 30 values.
I have hundreds of these arrays to process at any one time.
Currently, doing all processing that needs to be done up to calculating the hash
takes 1.3s for a block of 100 arrays.
Including the hashing bumps that up to 75s.
Is there a faster way to generate the single representative value?
This could be one approach using basic numpy functions -
import numpy as np
# Random input for demo
arr = np.random.randint(0,3,[2,5,4])
# Get dimensions for later usage
m,n,k = arr.shape
# Reshape arr to a 2D array that has each slice arr[:, n, k] in each row
arr2d = np.transpose(arr,(1,2,0)).reshape([-1,m])
# Perform lexsort & get corresponding indices and sorted array
sorted_idx = np.lexsort(arr2d.T)
sorted_arr2d = arr2d[sorted_idx,:]
# Differentiation along rows for sorted array
df1 = np.diff(sorted_arr2d,axis=0)
# Look for changes along df1 that represent new labels to be put there
df2 = np.append([False],np.any(df1!=0,1),0)
# Get unique labels
labels = df2.cumsum(0)
# Store those unique labels in a n x k shaped 2D array
pos_labels = np.zeros_like(labels)
pos_labels[sorted_idx] = labels
out = pos_labels.reshape([n,k])
Sample run -
In [216]: arr
Out[216]:
array([[[2, 1, 2, 1],
[1, 0, 2, 1],
[2, 0, 1, 1],
[0, 0, 1, 1],
[1, 0, 0, 2]],
[[2, 1, 2, 2],
[0, 0, 2, 1],
[2, 1, 0, 0],
[1, 0, 1, 0],
[0, 1, 1, 0]]])
In [217]: out
Out[217]:
array([[6, 4, 6, 5],
[1, 0, 6, 4],
[6, 3, 1, 1],
[3, 0, 4, 1],
[1, 3, 3, 2]], dtype=int32)
Depending on how many new keys vs old keys need to be generated it's hard to say what will be optimal. But using your logic, the following should be fairly fast:
import collections
import hashlib
_key = 0
def _get_new_key():
global _key
_key += 1
return _key
attributes = collections.defaultdict(_get_new_key)
def get_cell_id(series):
global attributes
return attributes[hashlib.md5(series.tostring()).digest()]
Edit:
I now updated for looping all data series according to your question by using strides:
In [99]: import numpy as np
In [100]: A = np.random.random((30, 256, 256))
In [101]: A_strided = np.lib.stride_tricks.as_strided(A, (A.shape[1] * A.shape[2], A.shape[0]), (A.itemsize, A.itemsize * A.shape[1] * A.shape[2]))
In [102]: %timeit tuple(get_cell_id(S) for S in A_strided)
10 loops, best of 3: 169 ms per loop
The above does 256x256 lookups/assignments of 30 element arrays each.
There is of course no guarantee that the md5 hash wont collide. If that should be an issue, you could of course change to other hashes in the same lib.
Edit 2:
Given that you seem to do the majority of costly operations on the first axis of your 3D array, I would suggest you reorganize your array:
In [254]: A2 = np.random.random((256, 256, 30))
In [255]: A2_strided = np.lib.stride_tricks.as_strided(A2, (A2.shape[0] * A2.shape[1], A2.shape[2]), (A2.itemsize * A2.shape[2], A2.itemsize))
In [256]: %timeit tuple(get_cell_id(S) for S in A2_strided)
10 loops, best of 3: 126 ms per loop
Not having to jump around long distances in memory does for about a 25% speed-up
Edit 3:
If there is no actual need for caching a hash to int look-up, but that you just need actual hashes and if the 3D array is of int8-type, then given the A2 and A2_strided organization, time can be reduced some more. Of this 15ms is the tuple-looping.
In [9]: from hashlib import md5
In [10]: %timeit tuple(md5(series.tostring()).digest() for series in A2_strided)
10 loops, best of 3: 72.2 ms per loop
If it is just about hashing try this
import numpy as np
import numpy.random
# create random data
a = numpy.random.randint(10,size=(5,3,3))
# create some identical 0-axis data
a[:,0,0] = np.arange(5)
a[:,0,1] = np.arange(5)
# create matrix with the hash values
h = np.apply_along_axis(lambda x: hash(tuple(x)),0,a)
h[0,0]==h[0,1]
# Output: True
However, use it with caution and test first this code with your code. ... all I can say is that it works for this simple example.
In addition, it may be possible that two values might have the same hash value although they are different. This is an issue which can always happen using the hash function but they are very unlikely
Edit: In order to compare with the other solutions
timeit(np.apply_along_axis(lambda x: hash(tuple(x)),0,a))
# output: 1 loops, best of 3: 677 ms per loop

Categories