Conversion from U3 dtype to ascii - python

I am reading data from a .mat file. The data is in form on numpy array.
[array([u'ABT'], dtype='<U3')]
This is one element of the array. I want to get only the value 'ABT' from the array. Unicode normalize and Encode to ascii functions do not work.

encode is a string method, so can't work directly on an array of strings. But there are several ways of applying it to each string
Here I'm working Py3, so the default is unicode.
In [179]: A=np.array(['one','two'])
In [180]: A
Out[180]:
array(['one', 'two'],
dtype='<U3')
plain iteration:
In [181]: np.array([s.encode() for s in A])
Out[181]:
array([b'one', b'two'],
dtype='|S3')
np.char has functions that apply string methods to each element of an array:
In [182]: np.char.encode(A)
Out[182]:
array([b'one', b'two'],
dtype='|S3')
but it looks like this is one of the conversions that astype can handle:
In [183]: A.astype('<S3')
Out[183]:
array([b'one', b'two'],
dtype='|S3')
And inspired by a recent question about np.chararray:
What happened to numpy.chararray
In [191]: Ac=np.char.array(A)
In [192]: Ac
Out[192]:
chararray(['one', 'two'],
dtype='<U3')
In [193]: Ac.encode()
Out[193]:
array([b'one', b'two'],
dtype='|S3')

Related

default values for numpy ndarray

I was working with numpy.ndarray and something interesting happened.
I created an array with the shape of (2, 2) and left everything else with the default values.
It created an array for me with these values:
array([[2.12199579e-314, 0.00000000e+000],
[5.35567160e-321, 7.72406468e-312]])
I created another array with the same default values and it also gave me the same result.
Then I created a new array (using the default values and the shape (2, 2)) and filled it with zeros using the 'fill' method.
The interesting part is that now whenever I create a new array with ndarray it gives me an array with 0 values.
So what is going on behind the scenes?
See https://numpy.org/doc/stable/reference/generated/numpy.empty.html#numpy.empty:
(Precisely as #Michael Butscher commented)
np.empty([2, 2]) creates an array without touching the contents of the memory chunk allocated for the array; thus, the array may look as if filled with some more or less random values.
np.ndarray([2, 2]) does the same.
Other creation methods, however, fill the memory with some values:
np.zeros([2, 2]) fills the memory with zeros,
np.full([2, 2], 9) fills the memory with nines, etc.
Now, if you create a new array via np.empty() after creating (and disposing of, i.e. automatically garbage collected) an array filled with e.g. ones, your new array may be allocated the same chunk of memory and thus look as if "filled" with ones.
np.empty explicitly says it returns:
Array of uninitialized (arbitrary) data of the given shape, dtype, and
order. Object arrays will be initialized to None.
It's compiled code so I can't say for sure, but I strongly suspect is just calls np.ndarray, with shape and dtype.
ndarray describes itself as a low level function, and lists many, better alternatives.
In a ipython session I can make two arrays:
In [2]: arr = np.empty((2,2), dtype='int32'); arr
Out[2]:
array([[ 927000399, 1267404612],
[ 1828571807, -1590157072]])
In [3]: arr1 = np.ndarray((2,2), dtype='int32'); arr1
Out[3]:
array([[ 927000399, 1267404612],
[ 1828571807, -1590157072]])
The values are the same, but when I check the "location" of their data buffers, I see that they are different:
In [4]: arr.__array_interface__['data'][0]
Out[4]: 2213385069328
In [5]: arr1.__array_interface__['data'][0]
Out[5]: 2213385068176
We can't use that number in code to fiddle with the values, but it's useful as a human-readable indicator of where the data is stored. (Do you understand the basics of how arrays are stored, with shape, dtype, strides, and data-buffer?)
Why the "uninitialized values" are the same is anyones guess; my guess it's just an artifact of the how that bit of memory was used before. np.empty stresses that we shouldn't place an significance to those values.
Doing the ndarray again, produces different values and location:
In [9]: arr1 = np.ndarray((2,2), dtype='int32'); arr1
Out[9]:
array([[1469865440, 515],
[ 0, 0]])
In [10]: arr1.__array_interface__['data'][0]
Out[10]: 2213403372816
apparent reuse
If I don't assign the array to a variable, or otherwise "hang on to it", numpy may reuse the data buffer memory:
In [17]: np.ndarray((2,2), dtype='int').__array_interface__['data'][0]
Out[17]: 2213403374512
In [18]: np.ndarray((2,2), dtype='int').__array_interface__['data'][0]
Out[18]: 2213403374512
In [19]: np.ndarray((2,2), dtype='int').__array_interface__['data'][0]
Out[19]: 2213403374512
In [20]: np.empty((2,2), dtype='int').__array_interface__['data'][0]
Out[20]: 2213403374512
Again, we shouldn't place too much significance to this reuse, and certainly not count on it for any calculations.
object dtype
If we specify the object dtype, then the values are initialized to None. This dtype contains references/pointers to objects in memory, and "random" pointers wouldn't be safe.
In [14]: arr1 = np.ndarray((2,2), dtype='object'); arr1
Out[14]:
array([[None, None],
[None, None]], dtype=object)
In [15]: arr1 = np.ndarray((2,2), dtype='U3'); arr1
Out[15]:
array([['', ''],
['', '']], dtype='<U3')

Convert numpy array from space separated to comma separated in python

This is data in .csv format file
generally we expect array/ list with [1,2,3,4] comma separated values
which it seems that nothing happened in this case
data = pd.read_csv('file.csv')
data_array = data.values
print(data_array)
print(type(data_array[0]))
and here is the output data
[16025788 179 '179batch1640694482' 18055630 8317948789 '2021-12-28'
8315780000.0 '6214' 'CA' Nan Nan 'Wireless' '2021-12-28 12:32:46'
'2021-12-28 12:32:46']
<class 'numpy.ndarray'>
So, i am looking for way to find array with comma separated values
Okay so simply make the changes:
converted_str = numpy.array_str(data_array)
converted_str.replace(' ',',')
print(converted_str)
Now, if you want to get the output in <class 'numpy.ndarray'> simply convert it back to a numpy array. I hope this helps! 😉
Without the csv or dataframe (or at least a sample) there's some ambiguity as to what your data array is like. But let me illustrate things with sample.
In [166]: df = pd.DataFrame([['one',2],['two',3]])
the dataframe display:
In [167]: df
Out[167]:
0 1
0 one 2
1 two 3
The array derived from the frame:
In [168]: data = df.values
In [169]: data
Out[169]:
array([['one', 2],
['two', 3]], dtype=object)
In my Ipython session, the display is actually the repr representation of the array. Note the commas, word 'array', and dtype.
In [170]: print(repr(data))
array([['one', 2],
['two', 3]], dtype=object)
A print of the array omits those words and commas. That's the str format. Omitting the commas is normal for numpy arrays, and helps distinguish them from lists. But let me stress that this is just the display style.
In [171]: print(data)
[['one' 2]
['two' 3]]
In [172]: print(data[0])
['one' 2]
We can convert the array to a list:
In [173]: alist = data.tolist()
In [174]: alist
Out[174]: [['one', 2], ['two', 3]]
Commas are a standard part of list display.
But let me stress, commas or not, is part of the display. Don't confuse that with the underlying distinction between a pandas dataframe, a numpy array, and a Python list.
Convert to a normal python list first:
print(list(data_array))

Python How to convert ['0.12' '0.23'] <class 'numpy.ndarray'> to a normal numpy array

I am using a package that is fetching values from a csv file for me. If I print out the result I get ['0.12' '0.23']. I checked the type, which is <class 'numpy.ndarray'> I want to convert it to a numpy array like [0.12, 0.23].
I tried np.asarray(variabel) but that did not resolve the problem.
Solution
import numpy as np
array = array.astype(np.float)
# If you are just initializing array you can do this
ar= np.array(your_list,dtype=np.float)
It might help to know how the csv was read. But for what ever reason it appears to have created a numpy array with a string dtype:
In [106]: data = np.array(['0.12', '0.23'])
In [107]: data
Out[107]: array(['0.12', '0.23'], dtype='<U4')
In [108]: print(data)
['0.12' '0.23']
The str formatting of such an array omits the comma, the repr display keeps it.
A list equivalent also displays with comma:
In [109]: data.tolist()
Out[109]: ['0.12', '0.23']
We call this a numpy array, but technically it is of class numpy.ndarray
In [110]: type(data)
Out[110]: numpy.ndarray
It can be converted to an array of floats with:
In [111]: data.astype(float)
Out[111]: array([0.12, 0.23])
It is still a ndarray, just the dtype is different. You may need to read more in the numpy docs about dtype.
The error:
If I want to calculate with it it gives me an error TypeError: only size-1 arrays can be converted to Python scalars
has a different source. data has 2 elements. You don't show the code that generates this error, but often we see this in plotting calls. The parameter is supposed to be a single number (often an integer), where as your array, even with a numeric dtype) is two numbers.

Numpy and Applying Method to a Column

I have a numpy array that contains objects.
For example my array is:
a = np.array({'a':1,'b':2},....,{'a':n,'b':n+1})
The data is not that important, but what I need to do is for each column call a property on that object.
Using my dictionary example, I want to call keys() to print out a list of keys on that row and return as a numpy array:
a[0].keys()
If I were using Pandas, I could leverage apply() on the column and use lambda functions to do this. For this case, I CANNOT use Pandas, so how can I do the same operation on a single numpy array column?
I tried using apply_along_axis but the lambda passes the arr as a whole not one row at a time, so I need to basically use a for loop inside my lambda to get my method.
np.apply_along_axis(lambda b: b.keys(), axis=0, arr=self.data)
The above code does not work! (I know this).
If there a way to do a pandas.apply() using a numpy array?
The desired result in this case would be N row numpy array with lists of [a,b] in them.
A object array like this can treated as a list:
In [110]: n=2;a = np.array(({'a':1,'b':2},{'a':n,'b':n+1}))
In [111]: a
Out[111]: array([{'a': 1, 'b': 2}, {'a': 2, 'b': 3}], dtype=object)
In [112]: [d.keys() for d in a]
Out[112]: [dict_keys(['a', 'b']), dict_keys(['a', 'b'])]
You could also use frompyfunc which will apply a function to all elements of an array (or broadcasted elements of several arrays)
In [114]: np.frompyfunc(lambda d:d.keys(),1,1)(a)
Out[114]: array([dict_keys(['a', 'b']), dict_keys(['a', 'b'])], dtype=object)
It returns an object array, which is fine in this case. np.vectorize uses this function as well, but takes an otypes parameter.
As a general rule, iterating on an object dtype array is faster than iterating on a numeric array (since all it has to do is return a pointer), but slower than the equivalent iteration on a list. Calculations on object dtype arrays are not as fast as the compiled numeric array calculations.

How to efficiently extract values from nested numpy arrays generated by loadmat function?

Is there a more efficient method in python to extract data from a nested python list such as A = array([[array([[12000000]])]], dtype=object). I have been using A[0][0][0][0], it does not seem to be an efficinet method when you have lots of data like A.
I have also used
numpy.squeeeze(array([[array([[12000000]])]], dtype=object)) but this gives me
array(array([[12000000]]), dtype=object)
PS: The nested array was generated by loadmat() function in scipy module to load a .mat file which consists of nested structures.
Creating such an array is a bit tedious, but loadmat does it to handle the MATLAB cells and 2d matrix:
In [5]: A = np.empty((1,1),object)
In [6]: A[0,0] = np.array([[1.23]])
In [7]: A
Out[7]: array([[array([[ 1.23]])]], dtype=object)
In [8]: A.any()
Out[8]: array([[ 1.23]])
In [9]: A.shape
Out[9]: (1, 1)
squeeze compresses the shape, but does not cross the object boundary
In [10]: np.squeeze(A)
Out[10]: array(array([[ 1.23]]), dtype=object)
but if you have one item in an array (regardless of shape) item() can extract it. Indexing also works, A[0,0]
In [11]: np.squeeze(A).item()
Out[11]: array([[ 1.23]])
item again to extract the number from that inner array:
In [12]: np.squeeze(A).item().item()
Out[12]: 1.23
Or we don't even need the squeeze:
In [13]: A.item().item()
Out[13]: 1.23
loadmat has a squeeze_me parameter.
Indexing is just as easy:
In [17]: A[0,0]
Out[17]: array([[ 1.23]])
In [18]: A[0,0][0,0]
Out[18]: 1.23
astype can also work (though it can be picky about the number of dimensions).
In [21]: A.astype(float)
Out[21]: array([[ 1.23]])
With single item arrays like efficiency isn't much of an issue. All these methods are quick. Things become more complicated when the array has many items, or the items are themselves large.
How to access elements of numpy ndarray?
You could use A.all() or A.any() to get a scalar. This would only work if A contains one element.
Try A.flatten()[0]
This will flatten the array into a single dimension and extract the first item from it. In your case, the first item is the only item.
What worked in my case was the following..
import scipy.io
xcat = scipy.io.loadmat(os.path.join(dir_data, file_name))
pars = xcat['pars'] # Extract numpy.void element from the loadmat object
# Note that you are dealing with a numpy structured array object when you enter pars[0][0].
# Thus you can acces names and all that...
dict_values = [x[0][0] for x in pars[0][0]] # Extract all elements in one go
dict_keys = list(pars.dtype.names) # Extract the corresponding names/tags
dict_xcat = dict(zip(dict_keys, dict_values)) # Pack it up again in a dict
where the idea behind this is.. first extract ALL values I want, and format them in a nice python dict.
This prevents me from cumbersome indexing later in the file...
Of course, this is a very specific solution. Since in my case the values I needed were all floats/ints.

Categories