strange behaviour of numpy masked array - python

I have troubles understanding the behaviour of numpy masked array.
Here is the snippet that puzzles me for two reasons:
arr = numpy.ma.array([(1,2),(3,4)],dtype=[("toto","int"),("titi","int")])
arr[0][0] = numpy.ma.masked
when doing this nothing happens, no mask is applied on the element [0][0]
changing the data to [[1,2],[3,4]] (instead of [(1,2),(3,4)]), I get the following error:
TypeError: expected a readable buffer object
It seems that I misunderstood completely how to setup (and use) masked array.
Could you tell me what is wrong with this code ?
thanks
EDIT: without specifying the dtypes, it works like expected

The purpose of a masked array is to tell for any operation that some elements of the array are invalid to be used, i.e. masked.
For example, you have an array:
a = np.array([[2, 1000], [3, 1000]])
And you want to ignore any operations with the elements >100. You create a masked array like:
b = np.ma.array(a, mask=(a>100))
You can perform some operations in both arrays to see the differences:
a.sum()
# 2005
b.sum()
# 5
a.prod()
# 6000000
b.prod()
# 6
As you see, the masked items are ignored...

Related

Reinterpret data in numpy ndarray

I have a numpy array with dtype=uint8 and shape=(N,4) and I want to reinterpret the 4 bytes along the axis=1 efficiently as dtype=int32 and get a resulting shape=(N,) but nothing I've tried works. The equivalent in c would be brutally casting the pointer of the array.
The initial array is created like this from a pandas dataframe:
tmp=df[['data_1','data_2','data_3','data_4']].values.astype('uint8')
But then this works but it's not vectorized:
tmp1=np.empty((tmp.shape[0],),dtype=np.int32)
for i in range(tmp.shape[0]):
tmp2=tmp[i].copy()
tmp1[i]=tmp2.view('<i4')
And this, which I understand as the efficient way to do it, doesn't:
tmp1=tmp.view('<i4')
Giving the error:
ValueError: When changing to a larger dtype, its size must be a divisor of the total size in bytes of the last axis of the array.
But the size should be correct as far as I understand.
edit: added the reinterpeted explanation
Assuming you actually want the output shape to be (N*4,) (not (N,) as you wrote initially), you can just flatten it and then cast it to your desired type:
tmp1 = tmp.flatten().astype('int32', copy=False)
EDIT:
If you actually want the same underlying data to be interpreted as a different type and get a (N,) array out, the view method is in fact the way to go. This for example works for me:
import numpy as np
N = 5
a = np.arange(N*4, dtype='uint8').reshape((N,4))
a.view('int32')[:,0]
That view is then array([ 50462976, 117835012, 185207048, 252579084, 319951120], dtype=int32).

Keep Numpy Arrays 2D

I'm doing a lot of vector algebra and want to use numpy arrays to remove any need for loops and run faster.
What I've found is that if I have a matrix A of size [N,P] I constantly need to use np.array([A[:,0]).T to force A[:,0] to be a column vector of size (N,1)
Is there a way to keep the single row or column of a 2D array as a 2D array because it makes the following arithmetic sooo much easier. For example, I often have to multiply a column vector (from a matrix) with a row vector (also created from a matrix) to create a new matrix: eg
C = A[:,i] * B[j,:]
it'd be be great if I didn't have to keep using:
C = np.array([A[:,i]]).T * np.array([B[j,:]])
It really obfuscates the code - in MATLAB it'd simply be C = A[:,i] * B[j,:] which is easier to read and compare with the underlying mathematics, especially if there's a lot of terms like this in the same line, but unfortunately most of my colleagues don't have MATLAB licenses.
Note this isn't the only use case, so a specific function for this column x row operation isn't too helpful
Even MATLAB/Octave squeezes out excess dimensions:
>> ones(2,3,4)(:,:,1)
ans =
1 1 1
1 1 1
>> size(ones(2,3,4)(1,:)) # some indexing "flattens" outer dims
ans =
1 12
When I started MATLAB v3.5 2d matrix was all it had; cells, struct and higher dimensions were later additions (as demonstrated by the above examples).
Your:
In [760]: A=np.arange(6).reshape(2,3)
In [762]: np.array([A[:,0]]).T
Out[762]:
array([[0],
[3]])
is more convoluted than needed. It makes a list, then a (1,N) array from that, and finally a (N,1)
A[:,[0]], A[:,:,None], A[:,0:1] are more direct. Even A[:,0].reshape(-1,1)
I can't think of something simple that treats a scalar and list index the same.
Functions like np.atleast_2d can conditionally add a new dimension, but it will be a leading (outer) one. But by the rules of broadcasting leading dimensions are usually 'automatic'.
basic v advanced indexing
In the underlying Python, scalars can't be indexed, and lists can only be indexed with scalars and slices. The underlying syntax allows indexing with tuples, but lists reject those. It's numpy that has extended the indexing considerably - not with syntax but with how it handles those tuples.
numpy indexing with slices and scalars is basic indexing. That's where the dimension loss can occur. That's consistent with list indexing
In [768]: [[1,2,3],[4,5,6]][1]
Out[768]: [4, 5, 6]
In [769]: np.array([[1,2,3],[4,5,6]])[1]
Out[769]: array([4, 5, 6])
Indexing with lists and arrays is advanced indexing, without any list counterparts. This is perhaps where the differences between MATLAB and numpy are ugliest :)
>> A([1,2],[1,2])
produces a (2,2) block. In numpy that produces a "diagonal"
In [781]: A[[0,1],[0,1]]
Out[781]: array([0, 4])
To get the block we have to use lists (or arrays) that "broadcast" against each other:
In [782]: A[[[0],[1]],[0,1]]
Out[782]:
array([[0, 1],
[3, 4]])
To get the "diagonal" in MATLAB we have to use sub2ind([2,2],[1,2],[1,2]) to get the [1,4] flat indices.
What kind of multiplication?
In
np.array([A[:,i]]).T * np.array([B[j,:]])
is this elementwise (.*) or matrix?
For a (N,1) and (1,M) pair, A*B and A#B produce the same (N,M) result, but one uses broadcasting to generalize the outer product, and the other is inner/matrix product (with sum-of-products).
https://numpy.org/doc/stable/reference/generated/numpy.matrix.html
Returns a matrix from an array-like object, or from a string of data. A matrix is a specialized 2-D array that retains its 2-D nature through operations. It has certain special operators, such as * (matrix multiplication) and ** (matrix power).
I'm not sure how to re-implement it though, it's an interesting exercise.
As mentionned, matrix will be deprecated. But from np.array, you can specify the dimension with the argument ndim=2:
np.array([1, 2, 3], ndmin=2)
You can keep the dimension in the following way (using # for matrix multiplication)
C = A[:,[i]] # B[[j],:]
Notice the brackets around i and j, otherwise C won't be a 2-dimensional matrix.

Why does insert and append for numpy ndarray return a new array instead of modifying the original array?

For numpy ndarray, there are no append, and insert as there are for native python lists.
a = np.array([1, 2, 3])
a.append(5) # this does not work
a = np.append(a, 5) # this is the only way
Whereas for native python lists,
a = [1, 2, 3]
a.append(4) # this modifies a
a # [1, 2, 3, 4]
Why was numpy ndarray designed to be this way? I'm writing a subclass of ndarray, is there any way of implementing "append" like native python arrays?
NumPy makes heavy use of views, a feature that Python lists do not support. A view is an array that uses the memory of another object rather than owning its own memory; for example, in the following snippet
a = numpy.arange(5)
b = a[1:3]
b is a view of a.
Views would interact very poorly with an in-place append or other in-place size-changing operations. Arrays would suddenly not be views of arrays they should be views of, or they would be views of deallocated memory, or it would be unpredictable whether an append on one array would affect an array it was a view of, or all sorts of other problems. For example, what would a look like after b.append(6)? Or what would b look like after a.clear()? And what kind of performance guarantees could you make? Probably not the amortized constant time guarantee of list.append.
If you want to append, you probably shouldn't be using NumPy arrays; you should use a list, and build an array from the list when you're done appending.
ndarray is created with a fixed size databuffer - just big enough to hold the bytes representing the elements.
arr.nbytes == arr.itemsize * arr.size
arr.resize can change the array inplace. But read it's docs to see the limitations, especially about owning its own data. It's one of the few inplace operations, and not used that often.
In contrast a Python list stores object pointers in a buffer. The buffer has some growth room allowing for efficient append. It just has to add a new pointer to the buffer. When the buffer fills up, it allocates a new larger buffer and copies the pointers.
For a 1d array the buffers for ndarray and list will be similar, at least for 4 or 8 bytes numeric dtypes. But for multidimensional arrays, the databuffer can be very large (the product of all dimensions), while the top buffer of an equivalent nested array just contains pointers to the outer layer of lists (the 'rows').
Object dtype arrays store pointers like a list, but the databuffer still has the fixed size (no growth space). Performance lies between numeric arrays and lists.
I can imagine writing an inplace append that uses the resize method, followed by copying the new value(s) to the 0 fills.
In [96]: arr = np.array([[1,3],[2,7]])
In [97]: arr.resize(3,2)
In [98]: arr
Out[98]:
array([[1, 3],
[2, 7],
[0, 0]])
In [99]: arr[-1,:] = 10,11
In [100]: arr
Out[100]:
array([[ 1, 3],
[ 2, 7],
[10, 11]])
But notice what happens to values when we resize an inner axis:
In [101]: arr = np.array([[1,3],[2,7]])
In [102]: arr.resize(2,3)
In [103]: arr
Out[103]:
array([[1, 3, 2],
[7, 0, 0]])
So this kind of append is quite limited compared to concatenate (and all of its 'stack' derivatives).
Have you looked at the code for np.append? After making sure the arguments are arrays, and tweaking their shapes, it does:
concatenate((arr, values), axis=axis)
In other words, it is just an alternative way of calling concatenate. It's probably best for adding a single value to a 1d array. It shouldn't be used repeatedly in a loop, precisely because it returns a new array, and thus is relatively expensive. Otherwise its use messes up many users. Some ignore the axis parameter. Others have problems creating a correct 'empty' array to start with. Concatenate also has those problems, but at least users have to consciously deal the issue of matching shapes.
np.insert is much more complicated. It does different things depending on whether the indices (obj) is a number, slice or list of numbers. One approach is to create a target array of the right size, and copy slices from the original and insert values to the right slots. Another is to use a boolean mask to copy values into the right locations. Both have to accommodate multidimensions - it inserts along one axis, but must use the appropriate slice(None) for the other dimensions. This is much more complicated than the list insert, which inserts one object (pointer) at one location in 1d.

What is the exact meaning of multi-dimensional array for numpy?

Can someone tell me why a works while b does not with ValueError: setting an array element with a sequence? This says the "multi-dimensional" reason, but in my case, I think a and b are the same.
import numpy as np
a=np.array([[1],2,3])
b=np.array([1,2,[3]])
Numpy is observing the first element to see what dtype the array is going to have. For a it sees a list and therefore produces an object array. It happily moves on to fill in the rest of the elements into the object array. For b, it sees a numeric value and assumes it's going to be some numeric dtype. Then it borks when it gets to a list.
You can override this by stating object dtype in the first place
a=np.array([[1],2,3])
b=np.array([1,2,[3]], 'object')
print(a, b, sep='\n\n')
[list([1]) 2 3]
[1 2 list([3])]
Mind you, that may not be exactly how Numpy is identifying dtype but it's got to be pretty close.

How do I declare an array in Python?

How do I declare an array in Python?
variable = []
Now variable refers to an empty list*.
Of course this is an assignment, not a declaration. There's no way to say in Python "this variable should never refer to anything other than a list", since Python is dynamically typed.
*The default built-in Python type is called a list, not an array. It is an ordered container of arbitrary length that can hold a heterogenous collection of objects (their types do not matter and can be freely mixed). This should not be confused with the array module, which offers a type closer to the C array type; the contents must be homogenous (all of the same type), but the length is still dynamic.
This is surprisingly complex topic in Python.
Practical answer
Arrays are represented by class list (see reference and do not mix them with generators).
Check out usage examples:
# empty array
arr = []
# init with values (can contain mixed types)
arr = [1, "eels"]
# get item by index (can be negative to access end of array)
arr = [1, 2, 3, 4, 5, 6]
arr[0] # 1
arr[-1] # 6
# get length
length = len(arr)
# supports append and insert
arr.append(8)
arr.insert(6, 7)
Theoretical answer
Under the hood Python's list is a wrapper for a real array which contains references to items. Also, underlying array is created with some extra space.
Consequences of this are:
random access is really cheap (arr[6653] is same to arr[0])
append operation is 'for free' while some extra space
insert operation is expensive
Check this awesome table of operations complexity.
Also, please see this picture, where I've tried to show most important differences between array, array of references and linked list:
You don't actually declare things, but this is how you create an array in Python:
from array import array
intarray = array('i')
For more info see the array module: http://docs.python.org/library/array.html
Now possible you don't want an array, but a list, but others have answered that already. :)
I think you (meant)want an list with the first 30 cells already filled.
So
f = []
for i in range(30):
f.append(0)
An example to where this could be used is in Fibonacci sequence.
See problem 2 in Project Euler
This is how:
my_array = [1, 'rebecca', 'allard', 15]
For calculations, use numpy arrays like this:
import numpy as np
a = np.ones((3,2)) # a 2D array with 3 rows, 2 columns, filled with ones
b = np.array([1,2,3]) # a 1D array initialised using a list [1,2,3]
c = np.linspace(2,3,100) # an array with 100 points beteen (and including) 2 and 3
print(a*1.5) # all elements of a times 1.5
print(a.T+b) # b added to the transpose of a
these numpy arrays can be saved and loaded from disk (even compressed) and complex calculations with large amounts of elements are C-like fast.
Much used in scientific environments. See here for more.
JohnMachin's comment should be the real answer.
All the other answers are just workarounds in my opinion!
So:
array=[0]*element_count
A couple of contributions suggested that arrays in python are represented by lists. This is incorrect. Python has an independent implementation of array() in the standard library module array "array.array()" hence it is incorrect to confuse the two. Lists are lists in python so be careful with the nomenclature used.
list_01 = [4, 6.2, 7-2j, 'flo', 'cro']
list_01
Out[85]: [4, 6.2, (7-2j), 'flo', 'cro']
There is one very important difference between list and array.array(). While both of these objects are ordered sequences, array.array() is an ordered homogeneous sequences whereas a list is a non-homogeneous sequence.
You don't declare anything in Python. You just use it. I recommend you start out with something like http://diveintopython.net.
I would normally just do a = [1,2,3] which is actually a list but for arrays look at this formal definition
To add to Lennart's answer, an array may be created like this:
from array import array
float_array = array("f",values)
where values can take the form of a tuple, list, or np.array, but not array:
values = [1,2,3]
values = (1,2,3)
values = np.array([1,2,3],'f')
# 'i' will work here too, but if array is 'i' then values have to be int
wrong_values = array('f',[1,2,3])
# TypeError: 'array.array' object is not callable
and the output will still be the same:
print(float_array)
print(float_array[1])
print(isinstance(float_array[1],float))
# array('f', [1.0, 2.0, 3.0])
# 2.0
# True
Most methods for list work with array as well, common
ones being pop(), extend(), and append().
Judging from the answers and comments, it appears that the array
data structure isn't that popular. I like it though, the same
way as one might prefer a tuple over a list.
The array structure has stricter rules than a list or np.array, and this can
reduce errors and make debugging easier, especially when working with numerical
data.
Attempts to insert/append a float to an int array will throw a TypeError:
values = [1,2,3]
int_array = array("i",values)
int_array.append(float(1))
# or int_array.extend([float(1)])
# TypeError: integer argument expected, got float
Keeping values which are meant to be integers (e.g. list of indices) in the array
form may therefore prevent a "TypeError: list indices must be integers, not float", since arrays can be iterated over, similar to np.array and lists:
int_array = array('i',[1,2,3])
data = [11,22,33,44,55]
sample = []
for i in int_array:
sample.append(data[i])
Annoyingly, appending an int to a float array will cause the int to become a float, without throwing an exception.
np.array retain the same data type for its entries too, but instead of giving an error it will change its data type to fit new entries (usually to double or str):
import numpy as np
numpy_int_array = np.array([1,2,3],'i')
for i in numpy_int_array:
print(type(i))
# <class 'numpy.int32'>
numpy_int_array_2 = np.append(numpy_int_array,int(1))
# still <class 'numpy.int32'>
numpy_float_array = np.append(numpy_int_array,float(1))
# <class 'numpy.float64'> for all values
numpy_str_array = np.append(numpy_int_array,"1")
# <class 'numpy.str_'> for all values
data = [11,22,33,44,55]
sample = []
for i in numpy_int_array_2:
sample.append(data[i])
# no problem here, but TypeError for the other two
This is true during assignment as well. If the data type is specified, np.array will, wherever possible, transform the entries to that data type:
int_numpy_array = np.array([1,2,float(3)],'i')
# 3 becomes an int
int_numpy_array_2 = np.array([1,2,3.9],'i')
# 3.9 gets truncated to 3 (same as int(3.9))
invalid_array = np.array([1,2,"string"],'i')
# ValueError: invalid literal for int() with base 10: 'string'
# Same error as int('string')
str_numpy_array = np.array([1,2,3],'str')
print(str_numpy_array)
print([type(i) for i in str_numpy_array])
# ['1' '2' '3']
# <class 'numpy.str_'>
or, in essence:
data = [1.2,3.4,5.6]
list_1 = np.array(data,'i').tolist()
list_2 = [int(i) for i in data]
print(list_1 == list_2)
# True
while array will simply give:
invalid_array = array([1,2,3.9],'i')
# TypeError: integer argument expected, got float
Because of this, it is not a good idea to use np.array for type-specific commands. The array structure is useful here. list preserves the data type of the values.
And for something I find rather pesky: the data type is specified as the first argument in array(), but (usually) the second in np.array(). :|
The relation to C is referred to here:
Python List vs. Array - when to use?
Have fun exploring!
Note: The typed and rather strict nature of array leans more towards C rather than Python, and by design Python does not have many type-specific constraints in its functions. Its unpopularity also creates a positive feedback in collaborative work, and replacing it mostly involves an additional [int(x) for x in file]. It is therefore entirely viable and reasonable to ignore the existence of array. It shouldn't hinder most of us in any way. :D
How about this...
>>> a = range(12)
>>> a
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
>>> a[7]
6
Following on from Lennart, there's also numpy which implements homogeneous multi-dimensional arrays.
Python calls them lists. You can write a list literal with square brackets and commas:
>>> [6,28,496,8128]
[6, 28, 496, 8128]
I had an array of strings and needed an array of the same length of booleans initiated to True. This is what I did
strs = ["Hi","Bye"]
bools = [ True for s in strs ]
You can create lists and convert them into arrays or you can create array using numpy module. Below are few examples to illustrate the same. Numpy also makes it easier to work with multi-dimensional arrays.
import numpy as np
a = np.array([1, 2, 3, 4])
#For custom inputs
a = np.array([int(x) for x in input().split()])
You can also reshape this array into a 2X2 matrix using reshape function which takes in input as the dimensions of the matrix.
mat = a.reshape(2, 2)
# This creates a list of 5000 zeros
a = [0] * 5000
You can read and write to any element in this list with a[n] notation in the same as you would with an array.
It does seem to have the same random access performance as an array. I cannot say how it allocates memory because it also supports a mix of different types including strings and objects if you need it to.

Categories