I am having issues understanding how X and y are referenced for training.
I have a simple csv file with 5 numeric columns that I am loading into a NumPy array as follows:
url = "http://www.xyz/shortDataFinal.data"
# download the file
raw_data = urllib.urlopen(url)
# load the CSV file as a numpy matrix
dataset = np.loadtxt(raw_data, delimiter=",")
print(dataset.shape)
# separate the data from the target attributes
X = dataset[:,0:3] #Does this mean columns 1-4?
y = dataset[:,4] #Is this the 5th column?
I think I am referencing my X values incorrectly.
Here is what I need:
X values reference columns 1-4 and my y value is the last column, which is the 5th. If I understand correctly, I should be referencing array indices 0:3 for the X values and number 4 for the y as I have done above. However, those values aren't correct. In other words, the values returned by the array don't match the values in the data - they are off by one column (index).
Yes, your interpretation is correct. dataset is a matrix in this case, so the numpy indexing operators ([]) use the conventional row, column format.
X = dataset[:,0:3] is interpreted as "All rows for columns 0 through 3" and y = dataset[:,4] is interpreted as "all rows for column 4".
Using a multiline string as a standin for a csv file:
In [332]: txt=b"""0, 1, 2, 4, 5
.....: 6, 7, 8, 9, 10
.....: """
In [333]: data=np.loadtxt(txt.splitlines(), delimiter=',')
In [334]: data
Out[334]:
array([[ 0., 1., 2., 4., 5.],
[ 6., 7., 8., 9., 10.]])
In [335]: data.shape
Out[335]: (2, 5)
In [336]: data[:,0:4]
Out[336]:
array([[ 0., 1., 2., 4.],
[ 6., 7., 8., 9.]])
In [337]: data[:,4]
Out[337]: array([ 5., 10.])
numpy indexing starts at 0; [0:4] is the same (more or less) as the list of numbers starting at 0, up to, but not including 4.
In [339]: np.arange(0,4)
Out[339]: array([0, 1, 2, 3])
Another way to get all but the last column is to use -1 indexing
In [352]: data[:,:-1]
Out[352]:
array([[ 0., 1., 2., 4.],
[ 6., 7., 8., 9.]])
Often a CSV file is a mix of numeric and string values. The loadtxt dtype parameter has a short explanation of how you can load and access that as a structured array. genfromtxt is easier to use for that (though no less confusing).
Related
I have a numpy array, a list of start/end indexes that define ranges within the array, and a list of values, where the number of values is the same as the number of ranges. Doing this assignment in a loop is currently very slow, so I'd like to assign the values to the corresponding ranges in the array in a vectorized way. Is this possible to do?
Here's a concrete, simplified example:
a = np.zeros([10])
Here's the list of start and a list of end indexes that define ranges within a, like this:
starts = [0, 2, 4, 6]
ends = [2, 4, 6, 8]
And here's a list of values I'd like to assign to each range:
values = [1, 2, 3, 4]
I have two problems. The first is that I can't figure out how to index into the array using multiple slices at the same time, since the list of ranges is constructed dynamically in the actual code. Once I'm able to extract the ranges, I'm not sure how to assign multiple values at once - one value per range.
Here's how I've tried creating a list of slices and the problems I've run into when using that list to index into the array:
slices = [slice(start, end) for start, end in zip(starts, ends)]
In [97]: a[slices]
...
IndexError: too many indices for array
In [98]: a[np.r_[slices]]
...
IndexError: arrays used as indices must be of integer (or boolean) type
If I use a static list, I can extract multiple slices at once, but then assignment doesn't work the way I want:
In [106]: a[np.r_[0:2, 2:4, 4:6, 6:8]] = [1, 2, 3]
/usr/local/bin/ipython:1: DeprecationWarning: assignment will raise an error in the future, most likely because your index result shape does not match the value array shape. You can use `arr.flat[index] = values` to keep the old behaviour.
#!/usr/local/opt/python/bin/python2.7
In [107]: a
Out[107]: array([ 1., 2., 3., 1., 2., 3., 1., 2., 0., 0.])
What I actually want is this:
np.array([1., 1., 2., 2., 3., 3., 4., 4., 0., 0.])
This will do the trick in a fully vectorized manner:
counts = ends - starts
idx = np.ones(counts.sum(), dtype=np.int)
idx[np.cumsum(counts)[:-1]] -= counts[:-1]
idx = np.cumsum(idx) - 1 + np.repeat(starts, counts)
a[idx] = np.repeat(values, count)
One possibility is to zip the start, end index with the values and broadcast the index and values manually:
starts = [0, 2, 4, 6]
ends = [2, 4, 6, 8]
values = [1, 2, 3, 4]
a = np.zeros(10)
import numpy as np
# calculate the index array and value array by zipping the starts, ends and values and expand it
idx, val = zip(*[(list(range(s, e)), [v] * (e-s)) for s, e, v in zip(starts, ends, values)])
# assign values
a[np.array(idx).flatten()] = np.array(val).flatten()
a
# array([ 1., 1., 2., 2., 3., 3., 4., 4., 0., 0.])
Or write a for loop to assign values one range by another:
for s, e, v in zip(starts, ends, values):
a[slice(s, e)] = v
a
# array([ 1., 1., 2., 2., 3., 3., 4., 4., 0., 0.])
I have several .mat files and each of them including a Matrix. I need to import them in python using h5py, because they have been save by -v7.3.
For example:
*myfile.mat includes matrix X with the size of (10, 20)*
I use following commands in python:
*import numpy np,h5py
f=h5py.File('myfile.mat','r')
data=np.array(f['X'])
data.shape* -> **(20, 10) Here is the problem!**
The matrix X is transposed. How can I import the X without being transposed?
I think you have to live with transposing. MATLAB if F ordered, numpy C ordered (by default). Somewhere along the line loadmat does that transposing. h5py does not, so you have to do some sort of transposing or reordering.
And by the way, transpose is one of the cheapest operations on a numpy array.
save a (2,3) array in Octave
octave:27> x=[0,1,2;3,4,5]
octave:28> save 'x34_7.mat' '-7' x
octave:33> save 'x34_h5.mat' '-hdf5' x
octave:32> reshape(x,[1,6])
ans = 0 3 1 4 2 5
load it. The shape is (2,3), but if F ordered:
In [102]: x7=loadmat('x34_7.mat')
In [103]: x7['x']
Out[103]:
array([[ 0., 1., 2.],
[ 3., 4., 5.]])
In [104]: _.flags
Out[104]:
C_CONTIGUOUS : False
F_CONTIGUOUS : True
...
Look at the h5 version:
In [110]: f=h5py.File('x34_h5.mat','r')
In [111]: x5=f['x']['value'][:]
Out[111]:
array([[ 0., 3.],
[ 1., 4.],
[ 2., 5.]])
# C_contiguous
and the data in x5 buffer is in the same order as in Octave:
In [134]: np.frombuffer(x5.data, float)
Out[134]: array([ 0., 3., 1., 4., 2., 5.])
so is the data from loadmat (though I have to transpose to look at it with frombuffer (to be Ccontiguous)
In [139]: np.frombuffer(x7.T.data,float)
Out[139]: array([ 0., 3., 1., 4., 2., 5.])
(Is there a better way of varifying that x5.data and x7.data has the same content?)
This pattern holds with higher dimensions. In MATLAB it's the 1st dimension that varies most rapidly. Loaded by h5py, that dimension corresponds to the last. So a x(:,2,2,2) would correspond to a x[1,1,1,:], and a x.T[:,1,1,1].
Say I had the following:
x = np.array([[1.,2.,3.,4.],[2.,3.,4.,5,],[1.,3.,5.,6.]])
What would the syntax be in order to select say, the first two columns of every row? (So [[1.,2.],[2.,3.],[1.,3.]]).
Ultimately I want to run a loop of the form:
for j in range(len(x)):
a = x[1,2:j] * #something
Where x[1,2:j] refers to what I am trying to achieve in my question. Thanks in advance!
You can use np.hsplit() (Split an array into multiple sub-arrays horizontally (column-wise).) then chose the fist part :
>>> np.hsplit(x,2)[0]
array([[ 1., 2.],
[ 2., 3.],
[ 1., 3.]])
Or you can just use slicing :
>>> x[:, :2]
array([[ 1., 2.],
[ 2., 3.],
[ 1., 3.]])
You can slice axis 1 of the array x:
>>> x[:, :2]
array([[ 1., 2.],
[ 2., 3.],
[ 1., 3.]])
The : for axis 0 effectively means "every row". The :2 in axis 1 means "get the first two columns (0 and 1)".
Slicing in multiple dimensions works similarly to Python lists and other iterables,
start:stop:step
You can specify a slice for each dimension of the array, or use : to get everything along the axis.
I have two arrays A and B:
A=array([[ 5., 5., 5.],
[ 8., 9., 9.]])
B=array([[ 1., 1., 2.],
[ 3., 2., 1.]])
Anywhere there is a "1" in B I want to sum the same row and column locations in A.
So for example for this one the answer would be 5+5+9=10
I would want this to continue for 2,3....n (all unique values in B)
So for the 2's... it would be 9+5=14 and for the 3's it would be 8
I found the unique values by using:
numpy.unique(B)
I realize this make take multiple steps but I can't really wrap my head around using the index matrix to sum those locations in another matrix.
For each unique value x, you can do
A[B == x].sum()
Example:
>>> A[B == 1.0].sum()
19.0
I thinknumpy.bincount is what you want. If B is an array of small integers like in you example you can do something like this:
import numpy
A = numpy.array([[ 5., 5., 5.],
[ 8., 9., 9.]])
B = numpy.array([[ 1, 1, 2],
[ 3, 2, 1]])
print numpy.bincount(B.ravel(), weights=A.ravel())
# [ 0. 19. 14. 8.]
or if B has anything but small integers you can do something like this
import numpy
A = numpy.array([[ 5., 5., 5.],
[ 8., 9., 9.]])
B = numpy.array([[ 1., 1., 2.],
[ 3., 2., 1.]])
uniqB, inverse = numpy.unique(B, return_inverse=True)
print uniqB, numpy.bincount(inverse, weights=A.ravel())
# [ 1. 2. 3.] [ 19. 14. 8.]
[(val, np.sum(A[B==val])) for val in np.unique(B)] gives you a list of tuples where the first element is one of the unique values in B, and the second element is the sum of elements in A where the corresponding value in B is that value.
>>> [(val, np.sum(A[B==val])) for val in np.unique(B)]
[(1.0, 19.0), (2.0, 14.0), (3.0, 8.0)]
The key is that you can use A[B==val] to access items in A at positions where B equals val.
Edit: If you just want the sums, just do [np.sum(A[B==val]) for val in np.unique(B)].
I'd use numpy masked arrays. These are standard numpy arrays with a mask associated with them blocking off certain values. The process is pretty straight forward, create a masked array using
numpy.ma.masked_array(data, mask)
where mask is generated by using a masked function
mask = numpy.ma.masked_not_equal(B, 1).mask
and data is A
for i in numpy.unique(B):
print numpy.ma.masked_array(A, numpy.ma.masked_not_equal(B, i).mask).sum()
19.0
14.0
8.0
i found old question here
one of the answer
def sum_by_group(values, groups):
order = np.argsort(groups)
groups = groups[order]
values = values[order]
values.cumsum(out=values)
index = np.ones(len(groups), 'bool')
index[:-1] = groups[1:] != groups[:-1]
values = values[index]
groups = groups[index]
values[1:] = values[1:] - values[:-1]
return values, groups
in your case, you can flatten your array
aflat = A.flatten()
bflat = B.flatten()
sum_by_group(aflat, bflat)
I want to center multi-dimensional data in a n x m matrix (<class 'numpy.matrixlib.defmatrix.matrix'>), let's say X . I defined a new array ones(645), lets say centVector to produce the mean for every row in matrix X. And now I want to iterate every row in X, compute the mean and assign this value to the corresponding index in centVector. Isn't this possible in a single row in scipy/numpy? I am not used to this language and think about something like:
centVector = ones(645)
for key, val in X:
centVector[key] = centVector[key] * (val.sum/val.size)
Afterwards I just need to subtract the mean in every Row:
X = X - centVector
How can I simplify this?
EDIT: And besides, the above code is not actually working - for a key-value loop I need something like enumerate(X). And I am not sure if X - centVector is returning the proper solution.
First, some example data:
>>> import numpy as np
>>> X = np.matrix(np.arange(25).reshape((5,5)))
>>> print X
[[ 0 1 2 3 4]
[ 5 6 7 8 9]
[10 11 12 13 14]
[15 16 17 18 19]
[20 21 22 23 24]]
numpy conveniently has a mean function. By default however, it'll give you the mean over all the values in the array. Since you want the mean of each row, you need to specify the axis of the operation:
>>> np.mean(X, axis=1)
matrix([[ 2.],
[ 7.],
[ 12.],
[ 17.],
[ 22.]])
Note that axis=1 says: find the mean along the columns (for each row), where 0 = rows and 1 = columns (and so on). Now, you can subtract this mean from your X, as you did originally.
Unsolicited advice
Usually, it's best to avoid the matrix class (see docs). If you remove the np.matrix call from the example data, then you get a normal numpy array.
Unfortunately, in this particular case, using an array slightly complicates things because np.mean will return a 1D array:
>>> X = np.arange(25).reshape((5,5))
>>> r_means = np.mean(X, axis=1)
>>> print r_means
[ 2. 7. 12. 17. 22.]
If you try to subtract this from X, r_means gets broadcast to a row vector, instead of a column vector:
>>> X - r_means
array([[ -2., -6., -10., -14., -18.],
[ 3., -1., -5., -9., -13.],
[ 8., 4., 0., -4., -8.],
[ 13., 9., 5., 1., -3.],
[ 18., 14., 10., 6., 2.]])
So, you'll have to reshape the 1D array into an N x 1 column vector:
>>> X - r_means.reshape((-1, 1))
array([[-2., -1., 0., 1., 2.],
[-2., -1., 0., 1., 2.],
[-2., -1., 0., 1., 2.],
[-2., -1., 0., 1., 2.],
[-2., -1., 0., 1., 2.]])
The -1 passed to reshape tells numpy to figure out this dimension based on the original array shape and the rest of the dimensions of the new array. Alternatively, you could have reshaped the array using r_means[:, np.newaxis].