I have a numpy array, a list of start/end indexes that define ranges within the array, and a list of values, where the number of values is the same as the number of ranges. Doing this assignment in a loop is currently very slow, so I'd like to assign the values to the corresponding ranges in the array in a vectorized way. Is this possible to do?
Here's a concrete, simplified example:
a = np.zeros([10])
Here's the list of start and a list of end indexes that define ranges within a, like this:
starts = [0, 2, 4, 6]
ends = [2, 4, 6, 8]
And here's a list of values I'd like to assign to each range:
values = [1, 2, 3, 4]
I have two problems. The first is that I can't figure out how to index into the array using multiple slices at the same time, since the list of ranges is constructed dynamically in the actual code. Once I'm able to extract the ranges, I'm not sure how to assign multiple values at once - one value per range.
Here's how I've tried creating a list of slices and the problems I've run into when using that list to index into the array:
slices = [slice(start, end) for start, end in zip(starts, ends)]
In [97]: a[slices]
...
IndexError: too many indices for array
In [98]: a[np.r_[slices]]
...
IndexError: arrays used as indices must be of integer (or boolean) type
If I use a static list, I can extract multiple slices at once, but then assignment doesn't work the way I want:
In [106]: a[np.r_[0:2, 2:4, 4:6, 6:8]] = [1, 2, 3]
/usr/local/bin/ipython:1: DeprecationWarning: assignment will raise an error in the future, most likely because your index result shape does not match the value array shape. You can use `arr.flat[index] = values` to keep the old behaviour.
#!/usr/local/opt/python/bin/python2.7
In [107]: a
Out[107]: array([ 1., 2., 3., 1., 2., 3., 1., 2., 0., 0.])
What I actually want is this:
np.array([1., 1., 2., 2., 3., 3., 4., 4., 0., 0.])
This will do the trick in a fully vectorized manner:
counts = ends - starts
idx = np.ones(counts.sum(), dtype=np.int)
idx[np.cumsum(counts)[:-1]] -= counts[:-1]
idx = np.cumsum(idx) - 1 + np.repeat(starts, counts)
a[idx] = np.repeat(values, count)
One possibility is to zip the start, end index with the values and broadcast the index and values manually:
starts = [0, 2, 4, 6]
ends = [2, 4, 6, 8]
values = [1, 2, 3, 4]
a = np.zeros(10)
import numpy as np
# calculate the index array and value array by zipping the starts, ends and values and expand it
idx, val = zip(*[(list(range(s, e)), [v] * (e-s)) for s, e, v in zip(starts, ends, values)])
# assign values
a[np.array(idx).flatten()] = np.array(val).flatten()
a
# array([ 1., 1., 2., 2., 3., 3., 4., 4., 0., 0.])
Or write a for loop to assign values one range by another:
for s, e, v in zip(starts, ends, values):
a[slice(s, e)] = v
a
# array([ 1., 1., 2., 2., 3., 3., 4., 4., 0., 0.])
Related
Given the following numpy arrays:
import numpy
a=numpy.array([[1,1,1],[1,1,1],[1,1,1]])
b=numpy.array([[2,2,2],[2,2,2],[2,2,2]])
c=numpy.array([[3,3,3],[3,3,3],[3,3,3]])
and this dictionary containing them all:
mydict={0:a,1:b,2:c}
What is the most efficient way of iterating through mydict so to compute the average numpy array that has (1+2+3)/3=2 as values?
My attempt fails as I am giving it too many values to unpack. It is also extremely inefficient as it has an O(n^3) time complexity:
aver=numpy.empty([a.shape[0],a.shape[1]])
for c,v in mydict.values():
for i in range(0,a.shape[0]):
for j in range(0,a.shape[1]):
aver[i][j]=mydict[c][i][j] #<-too many values to unpack
The final result should be:
In[17]: aver
Out[17]:
array([[ 2., 2., 2.],
[ 2., 2., 2.],
[ 2., 2., 2.]])
EDIT
I am not looking for an average value for each numpy array. I am looking for an average value for each element of my colleciton of numpy arrays. This is a minimal example, but the real thing I am working on has over 120,000 elements per array, and for the same position the values change from array to array.
I think you're making this harder than it needs to be. Either sum them and divide by the number of terms:
In [42]: v = mydict.values()
In [43]: sum(v) / len(v)
Out[43]:
array([[ 2., 2., 2.],
[ 2., 2., 2.],
[ 2., 2., 2.]])
Or stack them into one big array -- which it sounds like is the format they probably should have been in to start with -- and take the mean over the stacked axis:
In [44]: np.array(list(v)).mean(axis=0)
Out[44]:
array([[ 2., 2., 2.],
[ 2., 2., 2.],
[ 2., 2., 2.]])
You really shouldn't be using a dict of numpy.arrays. Just use a multi-dimensional array:
>>> bigarray = numpy.array([arr.tolist() for arr in mydict.values()])
>>> bigarray
array([[[1, 1, 1],
[1, 1, 1],
[1, 1, 1]],
[[2, 2, 2],
[2, 2, 2],
[2, 2, 2]],
[[3, 3, 3],
[3, 3, 3],
[3, 3, 3]]])
>>> bigarray.mean(axis=0)
array([[ 2., 2., 2.],
[ 2., 2., 2.],
[ 2., 2., 2.]])
>>>
You should modify your code to not even work with a dict. Especially not a dict with integer keys...
I am having issues understanding how X and y are referenced for training.
I have a simple csv file with 5 numeric columns that I am loading into a NumPy array as follows:
url = "http://www.xyz/shortDataFinal.data"
# download the file
raw_data = urllib.urlopen(url)
# load the CSV file as a numpy matrix
dataset = np.loadtxt(raw_data, delimiter=",")
print(dataset.shape)
# separate the data from the target attributes
X = dataset[:,0:3] #Does this mean columns 1-4?
y = dataset[:,4] #Is this the 5th column?
I think I am referencing my X values incorrectly.
Here is what I need:
X values reference columns 1-4 and my y value is the last column, which is the 5th. If I understand correctly, I should be referencing array indices 0:3 for the X values and number 4 for the y as I have done above. However, those values aren't correct. In other words, the values returned by the array don't match the values in the data - they are off by one column (index).
Yes, your interpretation is correct. dataset is a matrix in this case, so the numpy indexing operators ([]) use the conventional row, column format.
X = dataset[:,0:3] is interpreted as "All rows for columns 0 through 3" and y = dataset[:,4] is interpreted as "all rows for column 4".
Using a multiline string as a standin for a csv file:
In [332]: txt=b"""0, 1, 2, 4, 5
.....: 6, 7, 8, 9, 10
.....: """
In [333]: data=np.loadtxt(txt.splitlines(), delimiter=',')
In [334]: data
Out[334]:
array([[ 0., 1., 2., 4., 5.],
[ 6., 7., 8., 9., 10.]])
In [335]: data.shape
Out[335]: (2, 5)
In [336]: data[:,0:4]
Out[336]:
array([[ 0., 1., 2., 4.],
[ 6., 7., 8., 9.]])
In [337]: data[:,4]
Out[337]: array([ 5., 10.])
numpy indexing starts at 0; [0:4] is the same (more or less) as the list of numbers starting at 0, up to, but not including 4.
In [339]: np.arange(0,4)
Out[339]: array([0, 1, 2, 3])
Another way to get all but the last column is to use -1 indexing
In [352]: data[:,:-1]
Out[352]:
array([[ 0., 1., 2., 4.],
[ 6., 7., 8., 9.]])
Often a CSV file is a mix of numeric and string values. The loadtxt dtype parameter has a short explanation of how you can load and access that as a structured array. genfromtxt is easier to use for that (though no less confusing).
Say I had the following:
x = np.array([[1.,2.,3.,4.],[2.,3.,4.,5,],[1.,3.,5.,6.]])
What would the syntax be in order to select say, the first two columns of every row? (So [[1.,2.],[2.,3.],[1.,3.]]).
Ultimately I want to run a loop of the form:
for j in range(len(x)):
a = x[1,2:j] * #something
Where x[1,2:j] refers to what I am trying to achieve in my question. Thanks in advance!
You can use np.hsplit() (Split an array into multiple sub-arrays horizontally (column-wise).) then chose the fist part :
>>> np.hsplit(x,2)[0]
array([[ 1., 2.],
[ 2., 3.],
[ 1., 3.]])
Or you can just use slicing :
>>> x[:, :2]
array([[ 1., 2.],
[ 2., 3.],
[ 1., 3.]])
You can slice axis 1 of the array x:
>>> x[:, :2]
array([[ 1., 2.],
[ 2., 3.],
[ 1., 3.]])
The : for axis 0 effectively means "every row". The :2 in axis 1 means "get the first two columns (0 and 1)".
Slicing in multiple dimensions works similarly to Python lists and other iterables,
start:stop:step
You can specify a slice for each dimension of the array, or use : to get everything along the axis.
I have two arrays A and B:
A=array([[ 5., 5., 5.],
[ 8., 9., 9.]])
B=array([[ 1., 1., 2.],
[ 3., 2., 1.]])
Anywhere there is a "1" in B I want to sum the same row and column locations in A.
So for example for this one the answer would be 5+5+9=10
I would want this to continue for 2,3....n (all unique values in B)
So for the 2's... it would be 9+5=14 and for the 3's it would be 8
I found the unique values by using:
numpy.unique(B)
I realize this make take multiple steps but I can't really wrap my head around using the index matrix to sum those locations in another matrix.
For each unique value x, you can do
A[B == x].sum()
Example:
>>> A[B == 1.0].sum()
19.0
I thinknumpy.bincount is what you want. If B is an array of small integers like in you example you can do something like this:
import numpy
A = numpy.array([[ 5., 5., 5.],
[ 8., 9., 9.]])
B = numpy.array([[ 1, 1, 2],
[ 3, 2, 1]])
print numpy.bincount(B.ravel(), weights=A.ravel())
# [ 0. 19. 14. 8.]
or if B has anything but small integers you can do something like this
import numpy
A = numpy.array([[ 5., 5., 5.],
[ 8., 9., 9.]])
B = numpy.array([[ 1., 1., 2.],
[ 3., 2., 1.]])
uniqB, inverse = numpy.unique(B, return_inverse=True)
print uniqB, numpy.bincount(inverse, weights=A.ravel())
# [ 1. 2. 3.] [ 19. 14. 8.]
[(val, np.sum(A[B==val])) for val in np.unique(B)] gives you a list of tuples where the first element is one of the unique values in B, and the second element is the sum of elements in A where the corresponding value in B is that value.
>>> [(val, np.sum(A[B==val])) for val in np.unique(B)]
[(1.0, 19.0), (2.0, 14.0), (3.0, 8.0)]
The key is that you can use A[B==val] to access items in A at positions where B equals val.
Edit: If you just want the sums, just do [np.sum(A[B==val]) for val in np.unique(B)].
I'd use numpy masked arrays. These are standard numpy arrays with a mask associated with them blocking off certain values. The process is pretty straight forward, create a masked array using
numpy.ma.masked_array(data, mask)
where mask is generated by using a masked function
mask = numpy.ma.masked_not_equal(B, 1).mask
and data is A
for i in numpy.unique(B):
print numpy.ma.masked_array(A, numpy.ma.masked_not_equal(B, i).mask).sum()
19.0
14.0
8.0
i found old question here
one of the answer
def sum_by_group(values, groups):
order = np.argsort(groups)
groups = groups[order]
values = values[order]
values.cumsum(out=values)
index = np.ones(len(groups), 'bool')
index[:-1] = groups[1:] != groups[:-1]
values = values[index]
groups = groups[index]
values[1:] = values[1:] - values[:-1]
return values, groups
in your case, you can flatten your array
aflat = A.flatten()
bflat = B.flatten()
sum_by_group(aflat, bflat)
In NumPy, how can you efficiently make a 1-D object into a 2-D object where the singleton dimension is inferred from the current object (i.e. a list should go to either a 1xlength or lengthx1 vector)?
# This comes from some other, unchangeable code that reads data files.
my_list = [1,2,3,4]
# What I want to do:
my_numpy_array[some_index,:] = numpy.asarray(my_list)
# The above doesn't work because of a broadcast error, so:
my_numpy_array[some_index,:] = numpy.reshape(numpy.asarray(my_list),(1,len(my_list)))
# How to do the above without the call to reshape?
# Is there a way to directly convert a list, or vector, that doesn't have a
# second dimension, into a 1 by length "array" (but really it's still a vector)?
In the most general case, the easiest way to add extra dimensions to an array is by using the keyword None when indexing at the position to add the extra dimension. For example
my_array = numpy.array([1,2,3,4])
my_array[None, :] # shape 1x4
my_array[:, None] # shape 4x1
Why not simply add square brackets?
>> my_list
[1, 2, 3, 4]
>>> numpy.asarray([my_list])
array([[1, 2, 3, 4]])
>>> numpy.asarray([my_list]).shape
(1, 4)
.. wait, on second thought, why is your slice assignment failing? It shouldn't:
>>> my_list = [1,2,3,4]
>>> d = numpy.ones((3,4))
>>> d
array([[ 1., 1., 1., 1.],
[ 1., 1., 1., 1.],
[ 1., 1., 1., 1.]])
>>> d[0,:] = my_list
>>> d[1,:] = numpy.asarray(my_list)
>>> d[2,:] = numpy.asarray([my_list])
>>> d
array([[ 1., 2., 3., 4.],
[ 1., 2., 3., 4.],
[ 1., 2., 3., 4.]])
even:
>>> d[1,:] = (3*numpy.asarray(my_list)).T
>>> d
array([[ 1., 2., 3., 4.],
[ 3., 6., 9., 12.],
[ 1., 2., 3., 4.]])
import numpy as np
a = np.random.random(10)
sel = np.at_least2d(a)[idx]
What about expand_dims?
np.expand_dims(np.array([1,2,3,4]), 0)
has shape (1,4) while
np.expand_dims(np.array([1,2,3,4]), 1)
has shape (4,1).
You can always use dstack() to replicate your array:
import numpy
my_list = array([1,2,3,4])
my_list_2D = numpy.dstack((my_list,my_list));