Related
I'm writing a library that uses NumPy arrays and I have a scalar operation I would like to perform on any dtype. This works fine for most structured arrays, however I run into a problem when creating structured arrays with multiple dimensions for structured elements. As an example,
x = np.zeros(10, np.dtype('3float32,int8'))
print(x.dtype)
print(x.shape)
shows
[('f0', '<f4', (3,)), ('f1', 'i1')]
(10,)
but
x = np.zeros(10, np.dtype('3float32'))
print(x.dtype)
print(x.shape)
yields
float32
(10, 3)
that is, creating a structured array with a single multidimensional field appears to instead expand the array shape. This means that the number of dimensions for the last example is 2, not 1 as I was expecting. Is there anything I'm missing here, or a known workaround?
Use the same dtype notation as displayed in the first working example:
In [92]: x = np.zeros(3, np.dtype([('f0','<f4',(3,))]))
In [93]: x
Out[93]:
array([([0., 0., 0.],), ([0., 0., 0.],), ([0., 0., 0.],)],
dtype=[('f0', '<f4', (3,))])
I don't normally use the string shorthand,
In [99]: np.dtype('3float32')
Out[99]: dtype(('<f4', (3,))) # no field name assigned
In [100]: np.zeros(3,_)
Out[100]:
array([[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]], dtype=float32)
A couple of comma separated strings creates named fields:
In [102]: np.dtype('3float32,i4')
Out[102]: dtype([('f0', '<f4', (3,)), ('f1', '<i4')])
I have two numpy arrays containing integers which I'm comparing with numpy.testing.assert_array_equal. The arrays are "equal enough", i.e. a few elements differ but given the size of my arrays, that's OK (in this specific case). But of course the test fails:
AssertionError:
Arrays are not equal
(mismatch 0.0010541406645359075%)
x: array([[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],...
y: array([[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],...
----------------------------------------------------------------------
Ran 1 test in 0.658s
FAILED (failures=1)
Of course one might argue that the (long-term) clean solution to this would be to adapt the reference solution or whatnot, but what I'd prefer is to simply allow for some mismatch without the test failing. I would have hoped for assert_array_equal to have an option for this, but this is not the case.
I've written a function which allows me to do exactly what I want, so the problem might be considered solved, but I'm just wondering whether there is a better, more elegant way to do this. Also, the approach of parsing the error string feels pretty hacky, but I haven't found a better way to get the mismatch percentage value.
def assert_array_equal_tolerant(arr1,arr2,threshold):
"""Compare equality of two arrays while allowing a certain mismatch.
Arguments:
- arr1, arr2: Arrays to compare.
- threshold: Mismatch (in percent) above which the test fails.
"""
try:
np.testing.assert_array_equal(arr1,arr2)
except AssertionError as e:
for arg in e.args[0].split("\n"):
match = re.search(r'mismatch ([0-9.]+)%',arg)
if match:
mismatch = float(match.group(1))
break
else:
raise
if mismatch > threshold:
raise
Just to be clear: I'm not talking about assert_array_almost_equal, and using it is also not feasible, because the errors are not small, they might be huge for a single element, but are confined to a very small number of elements.
You could try (if they are integers) to check for the number of elements that are not equal without regular expressions
unequal_pos = np.where(arr1 != arr2)
len(unequal_pos[0]) # gives you the number of elements that are not equal.
I don't know if you consider this more elegant.
Since the result of np.where can be used as index you can get the elements that do not match with
arr1[unequal_pos]
So you can do pretty much every test you like with that result. Depends on how you want to define the mismatch either by number of different elements or difference between the elements or something even fancier.
Here's a crude comparison, but it seems to be in the spirit of what numpy.testing.assert_array_equal does:
In [71]: x=np.arange(100).reshape(10,10)
In [72]: y=np.arange(100).reshape(10,10)
In [73]: y[(5,7),(3,5)]=(3,5)
In [74]: np.sum(np.abs(x-y)>1)
Out[74]: 2
In [80]: np.sum(x!=y)
Out[80]: 2
count_nonzero is a faster counter (because it is used frequently in other numpy code to allocate space)
In [90]: np.count_nonzero(x!=y)
Out[90]: 2
The function that you are using does:
assert_array_compare(operator.__eq__, x, y, err_msg=err_msg)
np.testing.utils.assert_array_compare is a longish function, but most of it has to do with testing shape, and handling nan and inf. Otherwise it comes down to doing
x==y
and doing a count on the number of mismatches, and generating the err_msg. Note that the err_msg can be customized, so parsing it could simplified.
If you know the shapes match, and you aren't worried about nan like values, then just filtering the numeric difference should work just fine.
I need to write a programm that collects different datasets and unites them. For this I have to read in a comma seperated matrix: In this case each row represents an instance (in this case proteins), each column represents an attribute of the instances. If an instance has an attribute, it is represented by a 1, otherwise 0. The matrix looks like the example given below, but much larger, with 35000 instances and hundreds of attributes.
Proteins,Attribute 1,Attribute 2,Attribute 3,Attribute 4
Protein 1,1,1,1,0
Protein 2,0,1,0,1
Protein 3,1,0,0,0
Protein 4,1,1,1,0
Protein 5,0,0,0,0
Protein 6,1,1,1,1
I need a way to store the matrix before writing into a new file with other information about the instances. I thought of using numpy arrays, since i would like to be able to select and check single columns. I tried to use numpy.empty to create the array of the given size, but it seems that you have to preselect the lengh of the strings and cannot change them afterwards.
Is there a better way to deal with such data? I also thought of dictionarys of lists but then iI cannot select single columns.
You can use numpy.loadtxt, for example:
import numpy as np
a = np.loadtxt(filename, delimiter=',',usecols=(1,2,3,4),
skiprows=1, dtype=float)
Which will result in something like:
#array([[ 1., 1., 1., 0.],
# [ 0., 1., 0., 1.],
# [ 1., 0., 0., 0.],
# [ 1., 1., 1., 0.],
# [ 0., 0., 0., 0.],
# [ 1., 1., 1., 1.]])
Or, using structured arrays (`np.recarray'):
a = np.loadtxt('stack.txt', delimiter=',',usecols=(1,2,3,4),
skiprows=1, dtype=[('Attribute 1', float),
('Attribute 2', float),
('Attribute 3', float),
('Attribute 4', float)])
from where you can get each field like:
a['Attribute 1']
#array([ 1., 0., 1., 1., 0., 1.])
Take a look at pandas.
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
You could use genfromtxt instead:
data = np.genfromtxt('file.txt', dtype=None)
This will create a structured array (aka record array) of your table.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
What's the simplest way to extend a numpy array in 2 dimensions?
I've been frustrated as a Matlab user switching over to python because I don't know all the tricks and get stuck hacking together code until it works. Below is an example where I have a matrix that I want to add a dummy column to. Surely, there is a simpler way then the zip vstack zip method below. It works, but it is totally a noob attempt. Please enlighten me. Thank you in advance for taking the time for this tutorial.
# BEGIN CODE
from pylab import *
# Find that unlike most things in python i must build a dummy matrix to
# add stuff in a for loop.
H = ones((4,10-1))
print "shape(H):"
print shape(H)
print H
### enter for loop to populate dummy matrix with interesting data...
# stuff happens in the for loop, which is awesome and off topic.
### exit for loop
# more awesome stuff happens...
# Now I need a new column on H
H = zip(*vstack((zip(*H),ones(4)))) # THIS SEEMS LIKE THE DUMB WAY TO DO THIS...
print "shape(H):"
print shape(H)
print H
# in conclusion. I found a hack job solution to adding a column
# to a numpy matrix, but I'm not happy with it.
# Could someone educate me on the better way to do this?
# END CODE
Use np.column_stack:
In [12]: import numpy as np
In [13]: H = np.ones((4,10-1))
In [14]: x = np.ones(4)
In [15]: np.column_stack((H,x))
Out[15]:
array([[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])
In [16]: np.column_stack((H,x)).shape
Out[16]: (4, 10)
There are several functions that let you concatenate arrays in different dimensions:
np.vstack along axis=0
np.hstack along axis=1
np.dstack along axis=2
In your case, the np.hstack looks what you want. np.column_stack stacks a set 1D arrays as a 2D array, but you have already a 2D array to start with.
Of course, nothing prevents you to do it the hard way:
>>> new = np.empty((a.shape[0], a.shape[1]+1), dtype=a.dtype)
>>> new.T[:a.shape[1]] = a.T
Here, we created an empty array with an extra column, then used some tricks to set the first columns to a (using the transpose operator T, so that new.T has an extra row compared to a.T...)
Suppose I have a NxN matrix M (lil_matrix or csr_matrix) from scipy.sparse, and I want to make it (N+1)xN where M_modified[i,j] = M[i,j] for 0 <= i < N (and all j) and M[N,j] = 0 for all j. Basically, I want to add a row of zeros to the bottom of M and preserve the remainder of the matrix. Is there a way to do this without copying the data?
Scipy doesn't have a way to do this without copying the data but you can do it yourself by changing the attributes that define the sparse matrix.
There are 4 attributes that make up the csr_matrix:
data: An array containing the actual values in the matrix
indices: An array containing the column index corresponding to each value in data
indptr: An array that specifies the index before the first value in data for each row. If the row is empty then the index is the same as the previous column.
shape: A tuple containing the shape of the matrix
If you are simply adding a row of zeros to the bottom all you have to do is change the shape and indptr for your matrix.
x = np.ones((3,5))
x = csr_matrix(x)
x.toarray()
>> array([[ 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1.]])
# reshape is not implemented for csr_matrix but you can cheat and do it yourself.
x._shape = (4,5)
# Update indptr to let it know we added a row with nothing in it. So just append the last
# value in indptr to the end.
# note that you are still copying the indptr array
x.indptr = np.hstack((x.indptr,x.indptr[-1]))
x.toarray()
array([[ 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1.],
[ 0., 0., 0., 0., 0.]])
Here is a function to handle the more general case of vstacking any 2 csr_matrices. You still end up copying the underlying numpy arrays but it is still significantly faster than the scipy vstack method.
def csr_vappend(a,b):
""" Takes in 2 csr_matrices and appends the second one to the bottom of the first one.
Much faster than scipy.sparse.vstack but assumes the type to be csr and overwrites
the first matrix instead of copying it. The data, indices, and indptr still get copied."""
a.data = np.hstack((a.data,b.data))
a.indices = np.hstack((a.indices,b.indices))
a.indptr = np.hstack((a.indptr,(b.indptr + a.nnz)[1:]))
a._shape = (a.shape[0]+b.shape[0],b.shape[1])
return a
Not sure if you're still looking for a solution, but maybe others can look into hstack and vstack - http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.hstack.html. I think we can define a csr_matrix for the single additional row and then vstack it with the previous matrix.
I don't think that there is any way to really escape from doing the copying. Both of those types of sparse matrices store their data as Numpy arrays (in the data and indices attributes for csr and in the data and rows attributes for lil) internally and Numpy arrays can't be extended.
Update with more information:
LIL does stand for LInked List, but the current implementation doesn't quite live up to the name. The Numpy arrays used for data and rows are both of type object. Each of the objects in these arrays are actually Python lists (an empty list when all values are zero in a row). Python lists aren't exactly linked lists, but they are kind of close and quite frankly a better choice due to O(1) look-up. Personally, I don't immediately see the point of using a Numpy array of objects here rather than just a Python list. You could fairly easily change the current lil implementation to use Python lists instead which would allow you to add a row without copying the whole matrix.