Numpy - Different behavior for 1-d and 2-d arrays - python

I was reviewing some numpy code and came across this issue. numpy is exhibiting different behavior for 1-d array and 2-d array. In the first case, it is creating a reference while in the second, it is creating a deep copy.
Here's the code snippet
import numpy as np
# Case 1: when using 1d-array
arr = np.array([1,2,3,4,5])
slice_arr = arr[:3] # taking first three elements, behaving like reference
slice_arr[2] = 100 # modifying the value
print(slice_arr)
print (arr) # here also value gets changed
# Case 2: when using 2d-array
arr = np.array([[1,2,3],[4,5,6],[7,8,9]])
slice_arr = arr[:,[0,1]] # taking all rows and first two columns, behaving like deep copy
slice_arr[0,1] = 100 # modifying the value
print(slice_arr)
print() # newline for clarity
print (arr) # here value doesn't change
Can anybody explain the reason for this behavior?

The reason is that you are not slicing in the same way, it's not about 1D vs 2D.
slice_arr = arr[:3]
Here you are using the slicing operator, so numpy can make a view on your original data and returns it.
slice_arr = arr[:,[0,1]]
Here you are using a list of elements you want, and it's not a slice (even if it could be represented by a slice), in that case, numpy returns a copy.
All these are getters, so they can return either view or copy.
For setters, it's always modifying the current array.

Related

Is there a way to not make a copy when a numpy array is sliced?

I need to handle some large numpy arrays in my project. After such an array is loaded from the disk, over half of my computer's memory will be consumed.
After the array is loaded, I make several slices (almost half of the array will be selected) of it, then I receive error tells me the memory is insufficient.
By doing a little experiment I understand, I receive the error because when a numpy array is sliced, a copy will be created
import numpy as np
tmp = np.linspace(1, 100, 100)
inds = list(range(100))
tmp_slice = tmp[inds]
assert id(tmp) == id(tmp_slice)
returns AssertionError
Is there a way that a slice of a numpy array only refers to the memory addresses of the original array thus data entries are not copied?
In Python slice is a well defined class, with start, stop, step values. It is used when we index a list with alist[1: 10: 2]. This makes a new list with copies of the pointers from the original. In numpy these are used in basic indexing, e.g. arr[:3, -3:]. This creates a view of the original. The view shares the data buffer, but has its own shape and strides.
But when we index arrays with lists, arrays or boolean arrays (mask), it has to make a copy, an array with its own data buffer. The selection of elements is too complex or irregular to express in terms of the shape and strides attributes.
In some cases the index array is small (compared to the original) and copy is also small. But if we are permuting the whole array, then the index array, and copy will both be as large as the original.
Reading through this, this, and this I think your problem is in using advanced slicing, and to reiterate one of the answers -- numpy docs clearly state that
Advanced indexing always returns a copy of the data (contrast with
basic slicing that returns a view).
So instead of doing:
inds = list(range(100))
tmp_slice = tmp[inds]
you should rather use:
tmp_slice = tmp[:100]
This will result in a view rather than a copy. You can notice the difference by trying:
tmp[0] = 5
In the first case tmp_slice[0] will return 1.0, but in the second it will return 5.

How to build a numpy array row by row in a for loop?

This is basically what I am trying to do:
array = np.array() #initialize the array. This is where the error code described below is thrown
for i in xrange(?): #in the full version of this code, this loop goes through the length of a file. I won't know the length until I go through it. The point of the question is to see if you can build the array without knowing its exact size beforehand
A = random.randint(0,10)
B = random.randint(0,10)
C = random.randint(0,10)
D = random.randint(0,10)
row = [A,B,C,D]
array[i:]= row # this is supposed to add a row to the array with A,C,B,D as column values
This code doesn't work. First of all it complains: TypeError: Required argument 'object' (pos 1) not found. But I don't know the final size of the array.
Second, I know that last line is incorrect but I am not sure how to call this in python/numpy. So how can I do this?
A numpy array must be created with a fixed size. You can create a small one (e.g., one row) and then append rows one at a time, but that will be inefficient. There is no way to efficiently grow a numpy array gradually to an undetermined size. You need to decide ahead of time what size you want it to be, or accept that your code will be inefficient. Depending on the format of your data, you can possibly use something like numpy.loadtxt or various functions in pandas to read it in.
Use a list of 1D numpy arrays, or a list of lists, and then convert it to a numpy 2D array (or use more nesting and get more dimensions if you need to).
import numpy as np
a = []
for i in range(5):
a.append(np.array([1,2,3])) # or a.append([1,2,3])
a = np.asarray(a) # a list of 1D arrays (or lists) becomes a 2D array
print(a.shape)
print(a)

numpy.ndarray sent as argument doesn't need loop for iteration?

In this code np.linspace() assigns to inputs 200 evenly spaced numbers from -20 to 20.
This function works. What I am not understanding is how could it work. How can inputs be sent as an argument to output_function() without needing a loop to iterate over the numpy.ndarray?
def output_function(x):
return 100 - x ** 2
inputs = np.linspace(-20, 20, 200)
plt.plot(inputs, output_function(inputs), 'b-')
plt.show()
numpy works by defining operations on vectors the way that you really want to work with them mathematically. So, I can do something like:
a = np.arange(10)
b = np.arange(10)
c = a + b
And it works as you might hope -- each element of a is added to the corresponding element of b and the result is stored in a new array c. If you want to know how numpy accomplishes this, it's all done via the magic methods in the python data model. Specifically in my example case, the __add__ method of numpy's ndarray would be overridden to provide the desired behavior.
What you want to use is numpy.vectorize which behaves similarly to the python builtin map.
Here is one way you can use numpy.vectorize:
outputs = (np.vectorize(output_function))(inputs)
You asked why it worked, it works because numpy arrays can perform operations on its array elements en masse, for example:
a = np.array([1,2,3,4]) # gives you a numpy array of 4 elements [1,2,3,4]
b = a - 1 # this operation on a numpy array will subtract 1 from every element resulting in the array [0,1,2,3]
Because of this property of numpy arrays you can perform certain operations on every element of a numpy array very quickly without using a loop (like what you would do if it were a regular python array).

Is there any way to use the "out" argument of a Numpy function when modifying an array in place?

If I want to get the dot product of two arrays, I can get a performance boost by specifying an array to store the output in instead of creating a new array (if I am performing this operation many times)
import numpy as np
a = np.array([[1.0,2.0],[3.0,4.0]])
b = np.array([[2.0,2.0],[2.0,2.0]])
out = np.empty([2,2])
np.dot(a,b, out = out)
Is there any way I can take advantage of this feature if I need to modify an array in place? For instance, if I want:
out = np.array([[3.0,3.0],[3.0,3.0]])
out *= np.dot(a,b)
Yes, you can use the out argument to modify an array (e.g. array=np.ones(10)) in-place, e.g. np.multiply(array, 3, out=array).
You can even use in-place operator syntax, e.g. array *= 2.
To confirm if the array was updated in-place, you can check the memory address array.ctypes.data before and after the modification.

Why does the Numpy Diag function behaves weird?

The diag function does not save the result to a variable.
import numpy as np
A = np.random.rand(4,4)
d = np.diag(A)
print d
# above gives the diagonal entries of A
# let us change one entry
A[0, 0] = 0
print d
# above gives updated diagonal entries of A
Why does the diag function behave in this fashion?
np.diag returns a view to the original array. This means later changes to the original array are reflected in the view. (The upside, however, is that the operation is in much faster than creating a copy.)
Note this is only the behavior in some versions of numpy. In others, a copy is returned.
To "freeze" the result, you can copy it like d = np.diag(A).copy()

Categories