Creating numpy array from dict

Creating numpy array from dict - python

Assume I have a dict, call it coeffs:
coeffs = {'X1': 0.1, 'X2':0.2, 'X3':0.4, ..., 'Xn':0.09}
How can I convert the values into a 1 x n ndarray?
Into a n x m ndarray?

Here's an example of using your coeffs to fill in an array, with value indices derived from the dictionary keys:
In [591]: coeffs = {'X1': 0.1, 'X2':0.2, 'X3':0.4, 'X4':0.09}
In [592]: alist = [[int(k[1:]),v] for k,v in coeffs.items()]
In [593]: alist
Out[593]: [[4, 0.09], [3, 0.4], [1, 0.1], [2, 0.2]]
Here I stripped off the initial character and converted the rest to an integer. You could do your own conversion.
Now just initial an empty array, and fill in values:
In [594]: X = np.zeros((5,))
In [595]: for k,v in alist: X[k] = v
In [596]: X
Out[596]: array([ 0. , 0.1 , 0.2 , 0.4 , 0.09])
Obviously I could have used X = np.zeros((1,5)). An (n,m) array doesn't make sense unless there's a basis for choosing n for each dictionary item.
Just for laughs, here's another way of making an array from a dictionary - put the keys and values into fields of structured array:
In [613]: X = np.zeros(len(coeffs),dtype=[('keys','S3'),('values',float)])
In [614]: X
Out[614]:
array([(b'', 0.0), (b'', 0.0), (b'', 0.0), (b'', 0.0)],
dtype=[('keys', 'S3'), ('values', '<f8')])
In [615]: for i,(k,v) in enumerate(coeffs.items()):
X[i]=(k,v)
.....:
In [616]: X
Out[616]:
array([(b'X4', 0.09), (b'X3', 0.4), (b'X1', 0.1), (b'X2', 0.2)],
dtype=[('keys', 'S3'), ('values', '<f8')])
In [617]: X['keys']
Out[617]:
array([b'X4', b'X3', b'X1', b'X2'],
dtype='|S3')
In [618]: X['values']
Out[618]: array([ 0.09, 0.4 , 0.1 , 0.2 ])
The scipy sparse module has a sparse matrix format that stores its values in a dictionary, in fact, it is a subclass of dictionary. The keys in this dictionary are (i,j) tuples, the indexes of the nonzero elements. Sparse has the tools for quickly converting such a matrix into other, more computational friendly sparse formats, and into regular dense arrays.
I learned in other SO questions that a fast way to build such a matrix is to use the regular dictionary update method to copy values from another dictionary.
Inspired by #user's 2d version of this problem, here's how such a sparse matrix could be created.
Start with #user's sample coeffs:
In [24]: coeffs
Out[24]:
{'Y8': 22,
'Y2': 16,
'Y6': 20,
'X5': 20,
'Y9': 23,
'X2': 17,
...
'Y1': 15,
'X4': 19}
define a little function that converts the X3 style of key to (0,3) style:
In [25]: def decodekey(akey):
pt1,pt2 = akey[0],akey[1:]
i = {'X':0, 'Y':1}[pt1]
j = int(pt2)
return i,j
....:
Apply it with a dictionary comprehension to coeffs (or use a regular loop in earlier Python versions):
In [26]: coeffs1 = {decodekey(k):v for k,v in coeffs.items()}
In [27]: coeffs1
Out[27]:
{(1, 2): 16,
(0, 1): 16,
(0, 0): 15,
(1, 4): 18,
(1, 5): 19,
...
(0, 8): 23,
(0, 2): 17}
Import sparse and define an empty dok matrix:
In [28]: from scipy import sparse
In [29]: M=sparse.dok_matrix((2,10),dtype=int)
In [30]: M.A
Out[30]:
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
fill it with the coeffs1 dictionary values:
In [31]: M.update(coeffs1)
In [33]: M.A # convert to dense array
Out[33]:
array([[15, 16, 17, 18, 19, 20, 21, 22, 23, 24],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23]])
Actually, I don't need to use sparse to convert coeffs1 into an array. The (i,j) tuple can index an array directly, A[(i,j)] is the same as A[i,j].
In [34]: A=np.zeros((2,10),int)
In [35]: for k,v in coeffs1.items():
....: A[k] = v
....:
In [36]: A
Out[36]:
array([[15, 16, 17, 18, 19, 20, 21, 22, 23, 24],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23]])

Concerning a n x m array
#hpaulj's answer assumed (rightly) that the numbers after the X were supposed to be positions. If you had data like
coeffs = {'X1': 3, 'X2' : 5, ..., 'Xn' : 34, 'Y1': 5, 'Y2' : -3, ..., 'Yn': 32}
You could do as follows. Given sample data like
{'Y3': 17, 'Y2': 16, 'Y8': 22, 'Y5': 19, 'Y6': 20, 'Y4': 18, 'Y9': 23, 'Y1': 15, 'X8': 23, 'X9': 24, 'Y7': 21, 'Y0': 14, 'X2': 17, 'X3': 18, 'X0': 15, 'X1': 16, 'X6': 21, 'X7': 22, 'X4': 19, 'X5': 20}
created by
a = {}
for i in range(10):
a['X'+str(i)] = 15 + i
for i in range(10):
a['Y'+str(i)] = 14 + i
Put it in some ordered dictionary (inefficient, but easy)
b = {}
for k, v in a.iteritems():
letter = k[0]
index = float(k[1:])
if letter not in b.keys():
b[letter] = {}
b[letter][index] = v
gives
>>> b
{'Y': {0: 14, 1: 15, 2: 16, 3: 17, 4: 18, 5: 19, 6: 20, 7: 21, 8: 22, 9: 23}, 'X': {0: 15, 1: 16, 2: 17, 3: 18, 4: 19, 5: 20, 6: 21, 7: 22, 8: 23, 9: 24}}
Find out the target dimesions of the array. (This assumes all params are the same length and you have all values given).
row_length = max(b.values()[0])
row_indices = b.keys()
row_indices.sort()
Create the array via
X = np.empty((len(b.keys()), max(b.values()[0])))
and insert the data:
for i,row in enumerate(row_indices):
for j in range(row_length):
X[i,j] = b[row][j]
Result
>>> X
array([[ 15., 16., 17., 18., 19., 20., 21., 22., 23.],
[ 14., 15., 16., 17., 18., 19., 20., 21., 22.]])
Old answer
coeffs.values() is an array of the dict's values. Just create a
np.array(coeffs.values())
In general, when you have an object like coeffs, you can type
help(coeffs)
in the interpreter, to get a list of all it can do.

Related

How to vectorize a numpy for loop that has a multiple indexed access

unigram is an array shape (N, M, 100)
I would like to remove the for loop and perform all the calculations.
seq is a 1D array of size M, and the size of M maybe up to 10000.
I would like to remove the for loop and vectorize it for easier computation.
batch_size, seq_len, num_labels = unigram_scores.shape
broadcast = np.broadcast_to(seq, (batch_size, seq_len))
for i in range(0, broadcast.shape[1]):
n_seq[i] = unigram_scores[np.arange(batch_size), i , broadcast[:,i]]
edit:
answer by #hpaulj worked perfectly and also has the advantage of not having to install any extra dependency
the speed up was much lower than I expected
I ended up finally installing numba
import numpy as np
from numba import njit, prange
#njit(parallel=True)
def calculate_unigram_probability(unigram_scores,seq):
batch_size, seq_len, num_labels = unigram_scores.shape
broadcast = np.broadcast_to(seq, (batch_size, seq_len))
for i in prange( broadcast.shape[1]):
n_seq[i] = unigram_scores[np.arange(batch_size), i , broadcast[:,i]]
return n_seq
which is also taking a a bit too long, Currently I am trying to move it from the cpu to cuda which should bring about the speedup I am hoping for

In [129]: N,M = 5,3
In [130]: unigram=np.arange(N*M*4).reshape(N,M,4)
In [131]: seq = np.arange(M)
In [132]: b_seq = np.broadcast_to(seq, (N,M))
For a single i:
In [133]: i=0; unigram[np.arange(N),i,b_seq[:,i]]
Out[133]: array([ 0, 12, 24, 36, 48])
For all i in the range:
In [136]: i=np.arange(M)[:,None]
In [137]: unigram[np.arange(N),i,b_seq[:,i]]
Out[137]:
array([[[ 0, 12, 24, 36, 48],
[ 5, 17, 29, 41, 53],
[10, 22, 34, 46, 58]],
...
[[ 0, 12, 24, 36, 48],
[ 5, 17, 29, 41, 53],
[10, 22, 34, 46, 58]]])
A (5,3,5) array. This (5,3) might be better)
In [141]: i=np.arange(M); unigram[np.arange(N)[:,None],i,b_seq[:,i]]
Out[141]:
array([[ 0, 5, 10],
[12, 17, 22],
[24, 29, 34],
[36, 41, 46],
[48, 53, 58]])
We don't need to index b_seq: unigram[np.arange(N)[:,None],i,b_seq]
Or even use; let the indexing broadcast seq:
unigram[np.arange(N)[:,None],i,seq]
and with the help of ix_:
In [145]: I,J=np.ix_(np.arange(N), np.arange(M))
In [146]: unigram[I,J,seq]
To get a visual idea of what this indexing does, look at unigram. It's pull 'diagonals' from successive blocks/batches:
In [147]: unigram
Out[147]:
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]],
[[12, 13, 14, 15],
[16, 17, 18, 19],
[20, 21, 22, 23]],
...

you can use x.flatten() to reshape a 3d array to 1d array (x must be a numpy array )
in your case :
broadcast = broadcast.flatten()
this will transform an array of shape (NM1000) to an array of one dimension

Assign a new column that uses exponential function on the index of the numpy array dynamically

Lets say I have an array of the below nature:
x = arange(30).reshape((10,3))
x
Out[52]:
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11],
[12, 13, 14],
[15, 16, 17],
[18, 19, 20],
[21, 22, 23],
[24, 25, 26],
[27, 28, 29]])
How do I add a fourth column to each of the row such that this column is an exponential function of the index number and ends up with something like this:
array([[ 0, 1, 2, 2.718281828],
[ 3, 4, 5, 7.389056099], ,
[ 6, 7, 8, 20.08553692],
[ 9, 10, 11, 54.59815003 ],
[12, 13, 14, 148.4131591],
[15, 16, 17, 403.4287935],
[18, 19, 20, 1096.633158 ],
[21, 22, 23, 2980.957987],
[24, 25, 26, 8103.083928],
[27, 28, 29, 22026.46579]])

Computing the exponential is easy:
ex = np.exp(np.arange(x.shape[0]) + 1)
What you want to do with it is a whole different story. Numpy doesn't allow heterogeneous arrays, unlike say pandas. So with the simple answer, your result will be float64 (x is most likely int64 or int32):
x = np.concatenate((x, ex[:, None]), axis=1)
An alternative is using structured arrays, which will let you preserve the input types:
d = [('', x.dtype)] * x.shape[1] + [('', ex.dtype)]
out = np.empty(ex.shape, dtype=d)
Bulk assignment is a bit tricky, but can be done with a view obtained from the raw ndarray constructor:
view = np.ndarray(buffer=out, dtype=x.dtype, shape=x.shape, strides=(out.dtype.itemsize, x.dtype.itemsize))
view[...] = x
np.ndarray(buffer=out, dtype=ex.dtype, shape=ex.shape, strides=(out.dtype.itemsize,), offset=x.strides[0])[:] = ex
A simpler approach would be to use recarray, as #PaulPanzer suggests:
out = np.core.records.fromarrays([*x.T, ex])

Try this:
import numpy as np
a = np.arange(30).reshape((10,3))
b = np.zeros((a.shape[0], a.shape[1] + 1))
b[:, :-1] = a
b[:, 3] = np.exp(np.arange(len(b)))

To create a single array of powers of e starting at one, you can use
powers = np.power(np.e, np.arange(10) + 1)
Which basically takes the number e and rases it to the powers given by array np.arange(10) + 1, i.e. the numbers [1...10].
You can then add this as an additional column by first reshaping it and then adding it using np.hstack.
powers = powers.reshape(-1, 1)
x = np.hstack((x, powers))

You can construct such column with:
>>> np.exp(np.arange(1, 11))
array([2.71828183e+00, 7.38905610e+00, 2.00855369e+01, 5.45981500e+01,
1.48413159e+02, 4.03428793e+02, 1.09663316e+03, 2.98095799e+03,
8.10308393e+03, 2.20264658e+04])
So we can first obtain the number of rows, and then use np.hstack:
rows = x.shape[0]
result = np.hstack((x, np.exp(np.arange(1, rows+1)).reshape(-1, 1)))
We then otain:
>>> np.hstack((x, np.exp(np.arange(1, 11)).reshape(-1, 1)))
array([[0.00000000e+00, 1.00000000e+00, 2.00000000e+00, 2.71828183e+00],
[3.00000000e+00, 4.00000000e+00, 5.00000000e+00, 7.38905610e+00],
[6.00000000e+00, 7.00000000e+00, 8.00000000e+00, 2.00855369e+01],
[9.00000000e+00, 1.00000000e+01, 1.10000000e+01, 5.45981500e+01],
[1.20000000e+01, 1.30000000e+01, 1.40000000e+01, 1.48413159e+02],
[1.50000000e+01, 1.60000000e+01, 1.70000000e+01, 4.03428793e+02],
[1.80000000e+01, 1.90000000e+01, 2.00000000e+01, 1.09663316e+03],
[2.10000000e+01, 2.20000000e+01, 2.30000000e+01, 2.98095799e+03],
[2.40000000e+01, 2.50000000e+01, 2.60000000e+01, 8.10308393e+03],
[2.70000000e+01, 2.80000000e+01, 2.90000000e+01, 2.20264658e+04]])

extract all vertical slices from numpy array

I want to extract a complete slice from a 3D numpy array using ndeumerate or something similar.
arr = np.random.rand(4, 3, 3)
I want to extract all possible arr[:, x, y] where x, y range from 0 to 2

ndindex is a convenient way of generating the indices corresponding to a shape:
In [33]: arr = np.arange(36).reshape(4,3,3)
In [34]: for xy in np.ndindex((3,3)):
...: print(xy, arr[:,xy[0],xy[1]])
...:
(0, 0) [ 0 9 18 27]
(0, 1) [ 1 10 19 28]
(0, 2) [ 2 11 20 29]
(1, 0) [ 3 12 21 30]
(1, 1) [ 4 13 22 31]
(1, 2) [ 5 14 23 32]
(2, 0) [ 6 15 24 33]
(2, 1) [ 7 16 25 34]
(2, 2) [ 8 17 26 35]
It uses nditer, but doesn't have any speed advantages over a nested pair of for loops.
In [35]: for x in range(3):
...: for y in range(3):
...: print((x,y), arr[:,x,y])
ndenumerate uses arr.flat as the iterator, but using it to
In [38]: for xy, _ in np.ndenumerate(arr[0,:,:]):
...: print(xy, arr[:,xy[0],xy[1]])
does the same thing, iterating on the elements of a 3x3 subarray. As with ndindex it generates the indices. The element won't be the size 4 array that you want, so I ignored that.
A different approach is to flatten the later axes, transpose, and then just iterate on the (new) first axis:
In [43]: list(arr.reshape(4,-1).T)
Out[43]:
[array([ 0, 9, 18, 27]),
array([ 1, 10, 19, 28]),
array([ 2, 11, 20, 29]),
array([ 3, 12, 21, 30]),
array([ 4, 13, 22, 31]),
array([ 5, 14, 23, 32]),
array([ 6, 15, 24, 33]),
array([ 7, 16, 25, 34]),
array([ 8, 17, 26, 35])]
or with the print as before:
In [45]: for a in arr.reshape(4,-1).T:print(a)

Why not just
[arr[:, x, y] for x in range(3) for y in range(3)]

How to subset a `numpy.ndarray` where another one is max along some axis?

In python/numpy, how can I subset a multidimensional array where another one, of the same shape, is maximum along some axis (e.g. the first one)?
Suppose I have two 3*2*4 arrays, a and b. I want to obtain a 2*4 array containing the values of b at the locations where a has its maximal values along the first axis.
import numpy as np
np.random.seed(7)
a = np.random.rand(3*2*4).reshape((3,2,4))
b = np.random.rand(3*2*4).reshape((3,2,4))
print a
#[[[ 0.07630829 0.77991879 0.43840923 0.72346518]
# [ 0.97798951 0.53849587 0.50112046 0.07205113]]
#
# [[ 0.26843898 0.4998825 0.67923 0.80373904]
# [ 0.38094113 0.06593635 0.2881456 0.90959353]]
#
# [[ 0.21338535 0.45212396 0.93120602 0.02489923]
# [ 0.60054892 0.9501295 0.23030288 0.54848992]]]
print a.argmax(axis=0) #(I would like b at these locations along axis0)
#[[1 0 2 1]
# [0 2 0 1]]
I can do this really ugly manual subsetting:
index = zip(a.argmax(axis=0).flatten(),
[0]*a.shape[2]+[1]*a.shape[2], # a.shape[2] = 4 here
range(a.shape[2])+range(a.shape[2]))
# [(1, 0, 0), (0, 0, 1), (2, 0, 2), (1, 0, 3),
# (0, 1, 0), (2, 1, 1), (0, 1, 2), (1, 1, 3)]
Which would allow me to obtain my desired result:
b_where_a_is_max_along0 = np.array([b[i] for i in index]).reshape(2,4)
# For verification:
print a.max(axis=0) == np.array([a[i] for i in index]).reshape(2,4)
#[[ True True True True]
# [ True True True True]]
What is the smart, numpy way to achieve this? Thanks :)

Use advanced-indexing -
m,n = a.shape[1:]
b_out = b[a.argmax(0),np.arange(m)[:,None],np.arange(n)]
Sample run -
Setup input array a and get its argmax along first axis -
In [185]: a = np.random.randint(11,99,(3,2,4))
In [186]: idx = a.argmax(0)
In [187]: idx
Out[187]:
array([[0, 2, 1, 2],
[0, 1, 2, 0]])
In [188]: a
Out[188]:
array([[[49*, 58, 13, 69], # * are the max positions
[94*, 28, 55, 86*]],
[[34, 17, 57*, 50],
[48, 73*, 22, 80]],
[[19, 89*, 42, 71*],
[24, 12, 66*, 82]]])
Verify results with b -
In [193]: b
Out[193]:
array([[[18*, 72, 35, 51], # Mark * at the same positions in b
[74*, 57, 50, 84*]], # and verify
[[58, 92, 53*, 65],
[51, 95*, 43, 94]],
[[85, 23*, 13, 17*],
[17, 64, 35*, 91]]])
In [194]: b[a.argmax(0),np.arange(2)[:,None],np.arange(4)]
Out[194]:
array([[18, 23, 53, 17],
[74, 95, 35, 84]])

You could use ogrid
>>> x = np.random.random((2,3,4))
>>> x
array([[[ 0.87412737, 0.11069105, 0.86951092, 0.74895912],
[ 0.48237622, 0.67502597, 0.11935148, 0.44133397],
[ 0.65169681, 0.21843482, 0.52877862, 0.72662927]],
[[ 0.48979028, 0.97103611, 0.36459645, 0.80723839],
[ 0.90467511, 0.79118429, 0.31371856, 0.99443492],
[ 0.96329039, 0.59534491, 0.15071331, 0.52409446]]])
>>> y = np.argmax(x, axis=1)
>>> y
array([[0, 1, 0, 0],
[2, 0, 0, 1]])
>>> i, j = np.ogrid[:2,:4]
>>> x[i ,y, j]
array([[ 0.87412737, 0.67502597, 0.86951092, 0.74895912],
[ 0.96329039, 0.97103611, 0.36459645, 0.99443492]])

Python array/matrix dimension

I create two matrices
import numpy as np
arrA = np.zeros((9000,3))
arrB = np.zerros((9000,6))
I want to concatenate pieces of those matrices.
But when I try to do:
arrC = np.hstack((arrA, arrB[:,1]))
I get an error:
ValueError: all the input arrays must have same number of dimensions
I guess it's because np.shape(arrB[:,1]) is equal (9000,) instead of (9000,1), but I cannot figure out how to resolve it.
Could you please comment on this issue?

You could preserve dimensions by passing a list of indices, not an index:
>>> arrB[:,1].shape
(9000,)
>>> arrB[:,[1]].shape
(9000, 1)
>>> out = np.hstack([arrA, arrB[:,[1]]])
>>> out.shape
(9000, 4)

This is easier to see visually.
Assume:
>>> arrA=np.arange(9000*3).reshape(9000,3)
>>> arrA
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
...,
[26991, 26992, 26993],
[26994, 26995, 26996],
[26997, 26998, 26999]])
>>> arrB=np.arange(9000*6).reshape(9000,6)
>>> arrB
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[ 12, 13, 14, 15, 16, 17],
...,
[53982, 53983, 53984, 53985, 53986, 53987],
[53988, 53989, 53990, 53991, 53992, 53993],
[53994, 53995, 53996, 53997, 53998, 53999]])
If you take a slice of arrB, you are producing a series that looks more like a row:
>>> arrB[:,1]
array([ 1, 7, 13, ..., 53983, 53989, 53995])
What you need is a column the same shape as a column to add to arrA:
>>> arrB[:,[1]]
array([[ 1],
[ 7],
[ 13],
...,
[53983],
[53989],
[53995]])
Then hstack works as expected:
>>> arrC=np.hstack((arrA, arrB[:,[1]]))
>>> arrC
array([[ 0, 1, 2, 1],
[ 3, 4, 5, 7],
[ 6, 7, 8, 13],
...,
[26991, 26992, 26993, 53983],
[26994, 26995, 26996, 53989],
[26997, 26998, 26999, 53995]])
An alternate form is to specify -1 in one dimension and the number of rows or cols desired as the other in .reshape():
>>> arrB[:,1].reshape(-1,1) # one col
array([[ 1],
[ 7],
[ 13],
...,
[53983],
[53989],
[53995]])
>>> arrB[:,1].reshape(-1,6) # 6 cols
array([[ 1, 7, 13, 19, 25, 31],
[ 37, 43, 49, 55, 61, 67],
[ 73, 79, 85, 91, 97, 103],
...,
[53893, 53899, 53905, 53911, 53917, 53923],
[53929, 53935, 53941, 53947, 53953, 53959],
[53965, 53971, 53977, 53983, 53989, 53995]])
>>> arrB[:,1].reshape(2,-1) # 2 rows
array([[ 1, 7, 13, ..., 26983, 26989, 26995],
[27001, 27007, 27013, ..., 53983, 53989, 53995]])
There is more on array shaping and stacking here

I would try something like this:
np.vstack((arrA.transpose(), arrB[:,1])).transpose()

There several ways of making your selection from arrB a (9000,1) array:
np.hstack((arrA,arrB[:,[1]]))
np.hstack((arrA,arrB[:,1][:,None]))
np.hstack((arrA,arrB[:,1].reshape(9000,1)))
np.hstack((arrA,arrB[:,1].reshape(-1,1)))
One uses the concept of indexing with an array or list, the next adds a new axis (e.g. np.newaxis), the third uses reshape. These are all basic numpy array manipulation tasks.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating numpy array from dict - python

Assume I have a dict, call it coeffs: coeffs = {'X1': 0.1, 'X2':0.2, 'X3':0.4, ..., 'Xn':0.09} How can I convert the values into a 1 x n ndarray? Into a n x m ndarray?

Related

How to vectorize a numpy for loop that has a multiple indexed access

Assign a new column that uses exponential function on the index of the numpy array dynamically

extract all vertical slices from numpy array

How to subset a `numpy.ndarray` where another one is max along some axis?

Python array/matrix dimension

Categories

Resources