default values for numpy ndarray - python

I was working with numpy.ndarray and something interesting happened.
I created an array with the shape of (2, 2) and left everything else with the default values.
It created an array for me with these values:
array([[2.12199579e-314, 0.00000000e+000],
[5.35567160e-321, 7.72406468e-312]])
I created another array with the same default values and it also gave me the same result.
Then I created a new array (using the default values and the shape (2, 2)) and filled it with zeros using the 'fill' method.
The interesting part is that now whenever I create a new array with ndarray it gives me an array with 0 values.
So what is going on behind the scenes?

See https://numpy.org/doc/stable/reference/generated/numpy.empty.html#numpy.empty:
(Precisely as #Michael Butscher commented)
np.empty([2, 2]) creates an array without touching the contents of the memory chunk allocated for the array; thus, the array may look as if filled with some more or less random values.
np.ndarray([2, 2]) does the same.
Other creation methods, however, fill the memory with some values:
np.zeros([2, 2]) fills the memory with zeros,
np.full([2, 2], 9) fills the memory with nines, etc.
Now, if you create a new array via np.empty() after creating (and disposing of, i.e. automatically garbage collected) an array filled with e.g. ones, your new array may be allocated the same chunk of memory and thus look as if "filled" with ones.

np.empty explicitly says it returns:
Array of uninitialized (arbitrary) data of the given shape, dtype, and
order. Object arrays will be initialized to None.
It's compiled code so I can't say for sure, but I strongly suspect is just calls np.ndarray, with shape and dtype.
ndarray describes itself as a low level function, and lists many, better alternatives.
In a ipython session I can make two arrays:
In [2]: arr = np.empty((2,2), dtype='int32'); arr
Out[2]:
array([[ 927000399, 1267404612],
[ 1828571807, -1590157072]])
In [3]: arr1 = np.ndarray((2,2), dtype='int32'); arr1
Out[3]:
array([[ 927000399, 1267404612],
[ 1828571807, -1590157072]])
The values are the same, but when I check the "location" of their data buffers, I see that they are different:
In [4]: arr.__array_interface__['data'][0]
Out[4]: 2213385069328
In [5]: arr1.__array_interface__['data'][0]
Out[5]: 2213385068176
We can't use that number in code to fiddle with the values, but it's useful as a human-readable indicator of where the data is stored. (Do you understand the basics of how arrays are stored, with shape, dtype, strides, and data-buffer?)
Why the "uninitialized values" are the same is anyones guess; my guess it's just an artifact of the how that bit of memory was used before. np.empty stresses that we shouldn't place an significance to those values.
Doing the ndarray again, produces different values and location:
In [9]: arr1 = np.ndarray((2,2), dtype='int32'); arr1
Out[9]:
array([[1469865440, 515],
[ 0, 0]])
In [10]: arr1.__array_interface__['data'][0]
Out[10]: 2213403372816
apparent reuse
If I don't assign the array to a variable, or otherwise "hang on to it", numpy may reuse the data buffer memory:
In [17]: np.ndarray((2,2), dtype='int').__array_interface__['data'][0]
Out[17]: 2213403374512
In [18]: np.ndarray((2,2), dtype='int').__array_interface__['data'][0]
Out[18]: 2213403374512
In [19]: np.ndarray((2,2), dtype='int').__array_interface__['data'][0]
Out[19]: 2213403374512
In [20]: np.empty((2,2), dtype='int').__array_interface__['data'][0]
Out[20]: 2213403374512
Again, we shouldn't place too much significance to this reuse, and certainly not count on it for any calculations.
object dtype
If we specify the object dtype, then the values are initialized to None. This dtype contains references/pointers to objects in memory, and "random" pointers wouldn't be safe.
In [14]: arr1 = np.ndarray((2,2), dtype='object'); arr1
Out[14]:
array([[None, None],
[None, None]], dtype=object)
In [15]: arr1 = np.ndarray((2,2), dtype='U3'); arr1
Out[15]:
array([['', ''],
['', '']], dtype='<U3')

Related

How to copy dtype when doing numpy array assignment or when appending to a numpy array

I'm pretty illiterate in using Python/numpy.
I have the following piece of code:
data = np.array([])
for i in range(10):
data = np.append(data, GetData())
return data
GetData() returns a numpy array with a custom dtype. However when executing the above piece of code, the numbers convert to float64 which I suspect is the culprit for other issues I'm having. How can I copy/append the output of the functions while preserving the dtype as well?
Given the comments stating that you will only know the type of data once you run GetData(), and that multiple types are expected, you could do it like so:
# [...]
dataByType = {} # dictionary to store the dtypes encountered and the arrays with given dtype
for i in range(10):
newData = GetData()
if newData.dtype not in dataByType:
# If the dtype has not been encountered yet,
# create an empty array with that dtype and store it in the dict
dataByType[newData.dtype] = np.array([], dtype=newData.dtype)
# Append the new data to the corresponding array in dict, depending on dtype
dataByType[newData.dtype] = np.append(dataByType[newData.dtype], newData)
Taking into account hpaulj's answer, if you wish to conserve the different types you might encounter without creating a new array at each iteration you can adapt the above to:
# [...]
dataByType = {} # dictionary to store the dtypes encountered and the list storing data with given dtype
for i in range(10):
newData = GetData()
if newData.dtype not in dataByType:
# If the dtype has not been encountered yet,
# create an empty list with that dtype and store it in the dict
dataByType[newData.dtype] = []
# Append the new data to the corresponding list in dict, depending on dtype
dataByType[newData.dtype].append(newData)
# At this point, you have all your data pieces stored according to their original dtype inside the dataByType dictionary.
# Now if you wish you can convert them to numpy arrays as well
# Either by concatenation, updating what is stored in the dict
for dataType in dataByType:
dataByType[dataType] = np.concatenate(dataByType[dataType])
# No need to specify the dtype in concatenate here, since previous step ensures all data pieces are the same type
# Or by creating array directly, to store each data piece at a different index
for dataType in dataByType:
dataByType[dataType] = np.array(dataByType[dataType])
# As for concatenate, no need to specify the dtype here
A little example:
import numpy as np
# to get something similar to GetData in the example structure:
getData = [
np.array([1.,2.], dtype=np.float64),
np.array([1,2], dtype=np.int64),
np.array([3,4], dtype=np.int64),
np.array([3.,4.], dtype=np.float64)
] # dtype precised here for clarity, but not needed
dataByType = {}
for i in range(len(getData)):
newData = getData[i]
if newData.dtype not in dataByType:
dataByType[newData.dtype] = []
dataByType[newData.dtype].append(newData)
print(dataByType) # output formatted below for clarity
# {dtype('float64'):
# [array([1., 2.]), array([3., 4.])],
# dtype('int64'):
# [array([1, 2], dtype=int64), array([3, 4], dtype=int64)]}
Now if we use concatenate on that dataset, we get 1D arrays, conserving the original type (dtype=float64 not precised in the output since it is the default type for floating point values):
for dataType in dataByType:
dataByType[dataType] = np.concatenate(dataByType[dataType])
print(dataByType) # once again output formatted for clarity
# {dtype('float64'):
# array([1., 2., 3., 4.]),
# dtype('int64'):
# array([1, 2, 3, 4], dtype=int64)}
And if we use array, we get 2D arrays:
for dataType in dataByType:
dataByType[dataType] = np.array(dataByType[dataType])
print(dataByType)
# {dtype('float64'):
# array([[1., 2.],
# [3., 4.]]),
# dtype('int64'):
# array([[1, 2],
# [3, 4]], dtype=int64)}
Important thing to note: using array will not work as intended if all the arrays to combine don't have the same shape:
import numpy as np
print(repr(np.array([
np.array([1,2,3]),
np.array([4,5])])])))
# array([array([1, 2, 3]), array([4, 5])], dtype=object)
You get an array of dtype object, which are all in this case arrays of different lengths.
Your use of [] and append indicates that your are naively copying that common list idiom:
alist = []
for x in another_list:
alist.append(x)
Your data is not a clone of the [] list:
In [220]: np.array([])
Out[220]: array([], dtype=float64)
It's an array with shape (0,) and dtype float.
np.append is not an list append clone. I stress that, because too many new users make that mistake, and the result is many different errors. It is really just a cover for np.concatenate, one that takes 2 arguments instead of a list of arguments. As the docs stress it returns a new array, and when used iteratively, that means a lot of copying.
It is best to collect your arrays in a list, and give it to concatenate. List append is in-place, and better when done iteratively. If you give concatenate a list of arrays, the resulting dtype will be the common one (or whatever promoting requires). (new versions do let you specify dtype when calling concatenate.)
Keep the numpy documentation at hand (python too if necessary), and look up functions. Pay attention to how they are called, including the keyword parameters). And practice with small examples. I keep an interactive python session at hand, even when writing answers.
When working with arrays, pay close attention to shape and dtype. Don't make assumptions.
concatenating 2 int arrays:
In [238]: np.concatenate((np.array([1,2]),np.array([4,3])))
Out[238]: array([1, 2, 4, 3])
making one a float array (just by adding a decimal point to one number):
In [239]: np.concatenate((np.array([1,2]),np.array([4,3.])))
Out[239]: array([1., 2., 4., 3.])
It won't let me change the result to int:
In [240]: np.concatenate((np.array([1,2]),np.array([4,3.])), dtype=int)
Traceback (most recent call last):
File "<ipython-input-240-91b4e3fec07a>", line 1, in <module>
np.concatenate((np.array([1,2]),np.array([4,3.])), dtype=int)
File "<__array_function__ internals>", line 180, in concatenate
TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'same_kind'
If an element is a string, the result is also a string dtype:
In [241]: np.concatenate((np.array([1,2]),np.array(['4',3.])))
Out[241]: array(['1', '2', '4', '3.0'], dtype='<U32')
Sometimes it is necessary to adjust dtypes after a calculation:
In [243]: np.concatenate((np.array([1,2]),np.array(['4',3.]))).astype(float)
Out[243]: array([1., 2., 4., 3.])
In [244]: np.concatenate((np.array([1,2]),np.array(['4',3.]))).astype(float).as
...: type(int)
Out[244]: array([1, 2, 4, 3])

How to apply np.ceil to a structured numpy array

I'm trying to use the np.ceil function on a structrued numpy array, but all I get is the error message:
TypeError: ufunc 'ceil' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Here's a simply example of what that array would look like:
arr = np.array([(1.4,2.3), (3.2,4.1)], dtype=[("x", "<f8"), ("y", "<f8")])
When I try
np.ceil(arr)
I get the above mentioned error. When I just use one column, it works:
In [77]: np.ceil(arr["x"])
Out[77]: array([ 2., 4.])
But I need to get the entire array. Is there any way other than going column by column, or not using structured arrays all together?
Here's a dirty solution based on viewing the array without its structure, taking the ceiling, and then converting it back to a structured array.
# sample array
arr = np.array([(1.4,2.3), (3.2,4.1)], dtype = [("x", "<f8"), ("y", "<f8")])
# remove struct and take the ceiling
arr1 = np.ceil(arr.view((float, len(arr.dtype.names))))
# coerce it back into the struct
arr = np.array(list(tuple(t) for t in arr1), dtype = arr.dtype)
# kill the intermediate copy
del arr1
and here it is as an unreadable one-liner but without assigning the intermediate copy arr1
arr = np.array(
list(tuple(t) for t in np.ceil(arr.view((float, len(arr.dtype.names))))),
dtype = arr.dtype
)
# array([(2., 3.), (4., 5.)], dtype=[('x', '<f8'), ('y', '<f8')])
I don't claim this is a great solution, but it should help you move on with your project until something better is proposed

Numpy and Applying Method to a Column

I have a numpy array that contains objects.
For example my array is:
a = np.array({'a':1,'b':2},....,{'a':n,'b':n+1})
The data is not that important, but what I need to do is for each column call a property on that object.
Using my dictionary example, I want to call keys() to print out a list of keys on that row and return as a numpy array:
a[0].keys()
If I were using Pandas, I could leverage apply() on the column and use lambda functions to do this. For this case, I CANNOT use Pandas, so how can I do the same operation on a single numpy array column?
I tried using apply_along_axis but the lambda passes the arr as a whole not one row at a time, so I need to basically use a for loop inside my lambda to get my method.
np.apply_along_axis(lambda b: b.keys(), axis=0, arr=self.data)
The above code does not work! (I know this).
If there a way to do a pandas.apply() using a numpy array?
The desired result in this case would be N row numpy array with lists of [a,b] in them.
A object array like this can treated as a list:
In [110]: n=2;a = np.array(({'a':1,'b':2},{'a':n,'b':n+1}))
In [111]: a
Out[111]: array([{'a': 1, 'b': 2}, {'a': 2, 'b': 3}], dtype=object)
In [112]: [d.keys() for d in a]
Out[112]: [dict_keys(['a', 'b']), dict_keys(['a', 'b'])]
You could also use frompyfunc which will apply a function to all elements of an array (or broadcasted elements of several arrays)
In [114]: np.frompyfunc(lambda d:d.keys(),1,1)(a)
Out[114]: array([dict_keys(['a', 'b']), dict_keys(['a', 'b'])], dtype=object)
It returns an object array, which is fine in this case. np.vectorize uses this function as well, but takes an otypes parameter.
As a general rule, iterating on an object dtype array is faster than iterating on a numeric array (since all it has to do is return a pointer), but slower than the equivalent iteration on a list. Calculations on object dtype arrays are not as fast as the compiled numeric array calculations.

How to efficiently extract values from nested numpy arrays generated by loadmat function?

Is there a more efficient method in python to extract data from a nested python list such as A = array([[array([[12000000]])]], dtype=object). I have been using A[0][0][0][0], it does not seem to be an efficinet method when you have lots of data like A.
I have also used
numpy.squeeeze(array([[array([[12000000]])]], dtype=object)) but this gives me
array(array([[12000000]]), dtype=object)
PS: The nested array was generated by loadmat() function in scipy module to load a .mat file which consists of nested structures.
Creating such an array is a bit tedious, but loadmat does it to handle the MATLAB cells and 2d matrix:
In [5]: A = np.empty((1,1),object)
In [6]: A[0,0] = np.array([[1.23]])
In [7]: A
Out[7]: array([[array([[ 1.23]])]], dtype=object)
In [8]: A.any()
Out[8]: array([[ 1.23]])
In [9]: A.shape
Out[9]: (1, 1)
squeeze compresses the shape, but does not cross the object boundary
In [10]: np.squeeze(A)
Out[10]: array(array([[ 1.23]]), dtype=object)
but if you have one item in an array (regardless of shape) item() can extract it. Indexing also works, A[0,0]
In [11]: np.squeeze(A).item()
Out[11]: array([[ 1.23]])
item again to extract the number from that inner array:
In [12]: np.squeeze(A).item().item()
Out[12]: 1.23
Or we don't even need the squeeze:
In [13]: A.item().item()
Out[13]: 1.23
loadmat has a squeeze_me parameter.
Indexing is just as easy:
In [17]: A[0,0]
Out[17]: array([[ 1.23]])
In [18]: A[0,0][0,0]
Out[18]: 1.23
astype can also work (though it can be picky about the number of dimensions).
In [21]: A.astype(float)
Out[21]: array([[ 1.23]])
With single item arrays like efficiency isn't much of an issue. All these methods are quick. Things become more complicated when the array has many items, or the items are themselves large.
How to access elements of numpy ndarray?
You could use A.all() or A.any() to get a scalar. This would only work if A contains one element.
Try A.flatten()[0]
This will flatten the array into a single dimension and extract the first item from it. In your case, the first item is the only item.
What worked in my case was the following..
import scipy.io
xcat = scipy.io.loadmat(os.path.join(dir_data, file_name))
pars = xcat['pars'] # Extract numpy.void element from the loadmat object
# Note that you are dealing with a numpy structured array object when you enter pars[0][0].
# Thus you can acces names and all that...
dict_values = [x[0][0] for x in pars[0][0]] # Extract all elements in one go
dict_keys = list(pars.dtype.names) # Extract the corresponding names/tags
dict_xcat = dict(zip(dict_keys, dict_values)) # Pack it up again in a dict
where the idea behind this is.. first extract ALL values I want, and format them in a nice python dict.
This prevents me from cumbersome indexing later in the file...
Of course, this is a very specific solution. Since in my case the values I needed were all floats/ints.

Removing blank rows from Memmapped array

I have around 6000 json.gz files totalling 24GB which I need to do various calculations on.
Because I have no clue about the number of lines I'm going to pick up from each JSON file, (since I would reject some lines with invalid data), I estimated a maximum of 2000 lines from each JSON.
I created a Memmapped Numpy with array shape (6000*2000,10) and parsed data from the json.gz into the Memmapped numpy [Total size=2.5GB]
In the end it turned out that because of the overestimation, the last 10-15% of the rows are all zeroes. Now because of the nature of my computation I need to remove these invalid rows from the Memmapped numpy. Priority is of course time and after that memory.
What could be the best method to do this? I'm programatically aware of the exact indices of the rows to be removed.
Create another Memmaped array with the correct shape and size, slice the original array into this.
Use the delete() function
Use Masking
Something else?
You can use arr.base.resize to truncate or enlarge the array, then arr.flush() to save the change to disk:
In [169]: N = 10**6
In [170]: filename = '/tmp/test.dat'
In [171]: arr = np.memmap(filename, mode='w+', shape=(N,))
In [172]: arr.base.resize(N//2)
In [173]: arr.flush()
In [174]: arr = np.memmap(filename, mode='r')
In [175]: arr.shape
Out[175]: (500000,)
In [176]: arr.size == N//2
Out[176]: True

Categories