Vectorized relabeling of NumPy array to consecutive numbers and retrieving back - python

I have a huge training dataset with 4 classes. These classes are labeled non-consecutively. To be able to apply a sequential neural network the classes have to be relabeled so that the unique values in the classes are consecutive. In addition, at the end of the script I have to relabel them back to their old values.
I know how to relabel them with loops:
def relabel(old_classes, new_classes):
indexes=[np.where(old_classes ==np.unique(old_classes)[i]) for i in range(len(new_classes))]
for i in range(len(new_classes )):
old_classes [indexes[i]]=new_classes[i]
return old_classes
>>> old_classes = np.array([0,1,2,6,6,2,6,1,1,0])
>>> new_classes = np.arange(len(np.unique(old_classes)))
>>> relabel(old_classes,new_classes)
array([0, 1, 2, 3, 3, 2, 3, 1, 1, 0])
But this isn't nice coding and it takes quite a lot of time.
Any idea how to vectorize this relabeling?
To be clear, I also want to be able to relabel them back to their old values:
>>> relabeled_classes=np.array([0, 1, 2, 3, 3, 2, 3, 1, 1, 0])
>>> old_classes = np.array([0,1,2,6])
>>> relabel(relabeled_classes,old_classes )
array([0,1,2,6,6,2,6,1,1,0])

We can use the optional argument return_inverse with np.unique to get those unique sequential IDs/tags, like so -
unq_arr, unq_tags = np.unique(old_classes,return_inverse=1)
Index into unq_arr with unq_tags to retrieve back -
old_classes_retrieved = unq_arr[unq_tags]
Sample run -
In [69]: old_classes = np.array([0,1,2,6,6,2,6,1,1,0])
In [70]: unq_arr, unq_tags = np.unique(old_classes,return_inverse=1)
In [71]: unq_arr
Out[71]: array([0, 1, 2, 6])
In [72]: unq_tags
Out[72]: array([0, 1, 2, 3, 3, 2, 3, 1, 1, 0])
In [73]: old_classes_retrieved = unq_arr[unq_tags]
In [74]: old_classes_retrieved
Out[74]: array([0, 1, 2, 6, 6, 2, 6, 1, 1, 0])

Related

Append numpy arrays with different dimensions

I am trying to attach or concatenate two numpy arrays with different dimensions. It does not look good so far.
So, as an example,
a = np.arange(0,4).reshape(1,4)
b = np.arange(0,3).reshape(1,3)
And I am trying
G = np.concatenate(a,b,axis=0)
I get an error as a and b are not the same dimension. The reason I need to concatenate a and b is that I am trying to solve a model recursively and the state space is changing over time. So I need to call the last value function as an input to get a value function for the next time period, etc.:
for t in range(T-1,0,-1):
VG,CG = findv(VT[-1])
VT = np.append(VT,VG,axis=0)
CT = np.append(CT,CG,axis=0)
But, VT has a different dimension from the time period to the next.
Does anyone know how to deal with VT and CT numpy arrays that keep changing dimension?
OK - thanks for the input ... I need the output to be of the following form:
G = [[0, 1, 2, 3],
[0, 1, 2]]
So, if I write G[-1] I will get the last element,
[0,1,2].
I do not know if that is a numpy array?
Thanks, Jesper.
In [71]: a,b,c = np.arange(0,4), np.arange(0,3), np.arange(0,7)
It's easy to put those arrays in a list, either all at once, or incrementally:
In [72]: [a,b,c]
Out[72]: [array([0, 1, 2, 3]), array([0, 1, 2]), array([0, 1, 2, 3, 4, 5, 6])]
In [73]: G =[a,b]
In [74]: G.append(c)
In [75]: G
Out[75]: [array([0, 1, 2, 3]), array([0, 1, 2]), array([0, 1, 2, 3, 4, 5, 6])]
We can make an object dtype array from that list.
In [76]: np.array(G)
Out[76]:
array([array([0, 1, 2, 3]), array([0, 1, 2]),
array([0, 1, 2, 3, 4, 5, 6])], dtype=object)
Be aware that sometimes this could produce a 2d array (if all subarrays were the same size), or an error. Usually it's better to stick with the list.
Repeated append or concatenate to an array is usually not recommended. It's trickier to do right, and slower when it does work.
But let's demonstrate:
In [80]: G = np.array([a,b])
In [81]: G
Out[81]: array([array([0, 1, 2, 3]), array([0, 1, 2])], dtype=object)
c gets 'expanded' with a simple concatenate:
In [82]: np.concatenate((G,c))
Out[82]:
array([array([0, 1, 2, 3]), array([0, 1, 2]), 0, 1, 2, 3, 4, 5, 6],
dtype=object)
Instead we need to wrap c in an object dtype array of its own:
In [83]: cc = np.array([None])
In [84]: cc[0]= c
In [85]: cc
Out[85]: array([array([0, 1, 2, 3, 4, 5, 6])], dtype=object)
In [86]: np.concatenate((G,cc))
Out[86]:
array([array([0, 1, 2, 3]), array([0, 1, 2]),
array([0, 1, 2, 3, 4, 5, 6])], dtype=object)
In general when we concatenate, the dtypes have to match, or at least be compatible. Here, all inputs need to be object dtype. The same would apply when joining compound dtypes (structured arrays). It's only when joining simple numeric dtypes (and strings) that we can ignore dtypes (provided we don't care about integers becoming floats, etc).
You cant really stack arrays with different dimensions or size of dimensions.
This is list (kind of your desired ouput if I understand correctly):
G = [[0, 1, 2, 3],
[0, 1, 2]]
Transformed to numpy array:
G_np = np.array(G)
>>> G_np.shape
(2,)
>>> G_np
array([list([0, 1, 2, 3]), list([0, 1, 2])], dtype=object)
>>>
Solution in your case (based on your requirements):
a = np.arange(0,4)
b = np.arange(0,3)
G_npy = np.array([a,b])
>>> G_np.shape
(2,)
>>> G_np
array([array([0, 1, 2, 3]), array([0, 1, 2])], dtype=object)
>>> G_npy[-1]
array([0, 1, 2])
Edit: In relation to your Question in comment
I must admit I have no Idea how to do it in correct way.
But if a hacky way is ok(Maybe its the correct way), then:
G_npy = np.array([a,b])
G_npy = np.append(G_npy,None) # Allocate space for your new array
G_npy[-1] = np.arange(5) # populate the new space with new array
>>> G_npy
array([array([0, 1, 2, 3]), array([0, 1, 2]), array([0, 1, 2, 3, 4])],
dtype=object)
>>>
Or this way - but then, there is no point in using numpy
temp = [i for i in G_npy]
temp.append(np.arange(5))
G_npy = np.array(temp)
NOTE:
To be honest, i dont think numpy is good for collecting objects(list like this).
If I were you, I would just keep appending a real list. At the end, I would transform it to numpy. But after all, I dont know your application, so I dont know what is best attitude
Try this way:
import numpy as np
a = np.arange(4).reshape(2,2)
b = np.arange(6).reshape(2,3)
c = np.arange(8).reshape(2,4)
a
# array([[0, 1],
# [2, 3]])
b
# array([[0, 1, 2],
# [3, 4, 5]])
c
# array([[0, 1, 2, 3],
# [4, 5, 6, 7]])
np.hstack((a,b,c))
#array([[0, 1, 0, 1, 2, 0, 1, 2, 3],
# [2, 3, 3, 4, 5, 4, 5, 6, 7]])
Hope it helps.
Thanks
You are missing a parentheses there.
Please refer to the concatenate documentation below.
https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.concatenate.html
import numpy as np
a = np.arange(0,4).reshape(1,4)
b = np.arange(0,3).reshape(1,3)
c = np.concatenate((a,b), axis=1) #axis 1 as you have reshaped the numpy array
The above will give you the concatenated output c as:
array([[0, 1, 2, 3, 0, 1, 2]])

Python: Concatenate many dicts of numpy arrays with same keys and size

I have a function called within a loop that returns a dict (dsst_mean) with roughly 50 variables. All variables are numpy arrays of length 10.
The loop iterates roughly 3000 times. I'm current concatenating towards the end of each loop so that I have an 'dsst_mean_all' dict that grows larger on each iteration.
source = [dsst_mean_all, dsst_mean]
for key in source[0]:
dsst_mean_all[key] = np.concatenate([d[key] for d in source])
It works, but I know this isn't efficient. I also have problems with initialization of the 'dsst_mean_all' dict. (I'm current using dict.fromkeys() to do this.)
My question is: what are some options to do this more efficiently? I'm thinking I could store the dsst_mean dicts in a list and then do one concatenate at the end. But I'm not sure if holding 3000+ dicts of numpy arrays in memory is a good idea. I know this depends on the size, but unfortunately right now I dont have an estimate of the size of each 'dsst_mean' dict.
Thanks.
Normally we recommend collecting values in a list, and making an array once, at the end. The new thing here is we need to iterate on the keys of a dictionary to do this collection.
For example:
A function to make the individual dictionaries:
In [804]: def foo(i):
...: return {k:np.arange(5) for k in ['A','B','C']}
...:
In [805]: foo(0)
Out[805]:
{'A': array([0, 1, 2, 3, 4]),
'B': array([0, 1, 2, 3, 4]),
'C': array([0, 1, 2, 3, 4])}
A collector dictionary:
In [806]: dd = {k:[] for k in ['A','B','C']}
Iteration, collecting arrays in the lists:
In [807]: for _ in range(3):
...: x = foo(None)
...: for k,v in dd.items():
...: v.append(x[k])
...:
In [808]: dd
Out[808]:
{'A': [array([0, 1, 2, 3, 4]), array([0, 1, 2, 3, 4]), array([0, 1, 2, 3, 4])],
'B': [array([0, 1, 2, 3, 4]), array([0, 1, 2, 3, 4]), array([0, 1, 2, 3, 4])],
'C': [array([0, 1, 2, 3, 4]), array([0, 1, 2, 3, 4]), array([0, 1, 2, 3, 4])]}
Another iteration on the dictionary to turn the lists into some sort of array (stack, concatenate, your choice):
In [810]: for k,v in dd.items():
...: dd[k]=np.stack(v,axis=0)
...:
In [811]: dd
Out[811]:
{'A': array([[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4]]), 'B': array([[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4]]), 'C': array([[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4]])}
A list of 3000 arrays of length 10 will take up somewhat more memory than one array of 30,000 numbers, but not drastically more.
You could collect the whole dictionaries in one list the first time around, but you still need to combine those into on dictionary like this.

How to find the indices of a vectorised matrix numpy

I have an ndmatrix in numpy (n x n x n), which I vectorise in order to do some sampling of my data in a particular way, giving me (1 x n^3).
I would like to take the individual vectorised indices and convert them back to n-dimensional indices in the form (n x n x n). Im not sure how bumpy actually vectorises matrices.
Can anyone advise?
Numpy has a function unravel_index which does pretty much that: given a set of 'flat' indices, it will return a tuple of arrays of indices in each dimension:
>>> indices = np.arange(25, dtype=int)
>>> np.unravel_index(indices, (5, 5))
(array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4,
4, 4], dtype=int64),
array([0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2,
3, 4], dtype=int64))
You can then zip them to get your original indices.
Be aware however that matrices can be represented as 'sequences of rows' (C convention, 'C') or 'sequence of columns' (Fortran convention, 'F'), or the corresponding convention in higher dimensions. Typical flattening of matrices in numpy will preserve that order, so [[1, 2], [3, 4]] can be flattened into [1, 2, 3, 4] (if it has 'C' order) or [1, 3, 2, 4] (if it has 'F' order). unravel_index takes an optional order parameter if you want to change the default (which is 'C'), so you can do:
>>> # Typically, transposition will change the order for
>>> # efficiency reasons: no need to change the data !
>>> n = np.random.random((2, 2, 2)).transpose()
>>> n.flags.f_contiguous
True
>>> n.flags.c_contiguous
False
>>> x, y, z = np.unravel_index([1,2,3,7], (2, 2, 2), order='F')

numpy select fixed amount of values among duplicate values in array

Starting from a simple array with duplicate values:
a = np.array([2,3,2,2,3,3,2,1])
I'm trying to select a maximum of 2 unique values from this. The resulting array would appear as:
b = np.array([2,3,2,3,1])
no matter the order of the items. So far I tried to find unique values with:
In [20]: c = np.unique(a,return_counts=True)
In [21]: c
Out[21]: (array([1, 2, 3]), array([1, 4, 3]))
which is useful because it returns the frequency of values as well, but I'm stucked in filtering by frequency.
You could use np.repeat to generate the desired array from the array of uniques and counts:
import numpy as np
a = np.array([2,3,2,2,3,3,2,1])
uniques, count = np.unique(a,return_counts=True)
np.repeat(uniques, np.clip(count, 0, 2))
yields
array([1, 2, 2, 3, 3])
np.clip is used to force all values in count to be between 0 and 2. Thus, you get at most two values for each unique value.
You can use a list comprehension within np.concatenate() and limit the number of items by slicing:
>>> np.concatenate([a[a==i][:2] for i in np.unique(a)])
array([1, 2, 2, 3, 3])
Here's an approach to keep the order as in the input array -
N = 2 # Number of duplicates to keep for each unique element
sortidx = a.argsort()
_,id_arr = np.unique(a[sortidx],return_index=True)
valid_ind = np.unique( (id_arr[:,None] + np.arange(N)).ravel().clip(max=a.size-1) )
out = a[np.sort(sortidx[valid_ind])]
Sample run -
In [253]: a
Out[253]: array([ 0, -3, 0, 2, 0, 3, 2, 0, 2, 3, 3, 2, 1, 5, 0, 2])
In [254]: N
Out[254]: 3
In [255]: out
Out[255]: array([ 0, -3, 0, 2, 0, 3, 2, 2, 3, 3, 1, 5])
In [256]: np.unique(out,return_counts=True)[1] # Verify the counts to be <= N
Out[256]: array([1, 3, 1, 3, 3, 1])

Finding differences between all values in an List

I want to find the differences between all values in a numpy array and append it to a new list.
Example: a = [1,4,2,6]
result : newlist= [3,1,5,3,2,2,1,2,4,5,2,4]
i.e for each value i of a, determine difference between values of the rest of the list.
At this point I have been unable to find a solution
You can do this:
a = [1,4,2,6]
newlist = [abs(i-j) for i in a for j in a if i != j]
Output:
print newlist
[3, 1, 5, 3, 2, 2, 1, 2, 4, 5, 2, 4]
I believe what you are trying to do is to calculate absolute differences between elements of the input list, but excluding the self-differences. So, with that idea, this could be one vectorized approach also known as array programming -
# Input list
a = [1,4,2,6]
# Convert input list to a numpy array
arr = np.array(a)
# Calculate absolute differences between each element
# against all elements to give us a 2D array
sub_arr = np.abs(arr[:,None] - arr)
# Get diagonal indices for the 2D array
N = arr.size
rem_idx = np.arange(N)*(N+1)
# Remove the diagonal elements for the final output
out = np.delete(sub_arr,rem_idx)
Sample run to show the outputs at each step -
In [60]: a
Out[60]: [1, 4, 2, 6]
In [61]: arr
Out[61]: array([1, 4, 2, 6])
In [62]: sub_arr
Out[62]:
array([[0, 3, 1, 5],
[3, 0, 2, 2],
[1, 2, 0, 4],
[5, 2, 4, 0]])
In [63]: rem_idx
Out[63]: array([ 0, 5, 10, 15])
In [64]: out
Out[64]: array([3, 1, 5, 3, 2, 2, 1, 2, 4, 5, 2, 4])

Categories