Preventing numpy from upcasting numerical values to strings - python

Normally, I'm happy with the way numpy determines the minimum type required to hold the objects of the sequence in np.array:
>>> print(np.array([42, 4.2]))
array([42, 4.2], dtype=float64)
That is quite intuitive: I need to upcast an integer to a float in order to handle the data.
However, the following case seems to be less intuitive to me:
>>> print(np.array([42, 4.2, 'aa']))
array(['42', '4.2', 'aa'], dtype='<U32')
I would prefer the resulting array to be of type np.object. I don't want to call
np.array(ma_list, dtype=np.object)
because I would like to keep the old behavior in the case of my_list=[42, 4.2] and also in case of my_list=['aa'] (which would result in type being <U2).
Is it possible to tweak the default behavior in order to prevent the upcasting of numerical values to a string, or is there any workaround with the same effect?

It looks like you want to do a bit of pre-processing on your data before you let numpy determine the data type. From what I understood of your criteria, if all the objects in the list are numbers, or all of them are not numbers, you want to let numpy determine the type. If the categories are mixed, you want to use np.object.
Fortunately, all numbers in Python have the abstract base class numbers.Number hooked in:
from numbers import Number
isnum = lambda x: isinstance(x, Number)
isntnum = lambda x: not isinstance(x, Number)
if all(map(isnum, my_list)) or all(map(isntnum, my_list)):
dtype = None
else:
dtype = np.object
my_arr = np.array(my_list, dtype=dtype)
The phrasing here isn't ideal, but it should work, and give you a starting point for something more elegant and efficient.

After looking through all of the C code that I could in ~30 minutes, I've concluded there is no great way of doing this.
My best bet would be the following:
a = np.array([4.2,42,'42'])
if str(a.dtype)[:2]=='<U':
a = np.array([4.2,42,'42'],dtype=np.object)
I'll admit that this is really hacky, since it relies on the fact that np.array casts these string/float arrays to unicode data types, but it should work well, at least for small arrays.

Related

What is the pythonic way of "iterating" over a single item?

I come across this issue often, and I would be surprised if there wasn't some very simple and pythonic one-liner solution to it.
Suppose I have a method or a function that takes a list or some other iterable object as an argument. I want for an operation to be performed once for each item in the object.
Sometimes, only a single item (say, a float value) is passed to this function. In this situation, my for-loop doesn't know what to do. And so, I find myself peppering my code with the following snippet of code:
from collections.abc import Sequence
def my_function(value):
if not isinstance(value, Sequence):
value = [value]
# rest of my function
This works, but it seems wasteful and not particularly legible. In searching StackOverflow I've also discovered that strings are considered sequences, and so this code could easily break given the wrong argument. It just doesn't feel like the right approach.
I come from a MATLAB background, and this is neatly solved in that language since scalars are treated like 1x1 matrices. I'd expect, at the very least, for there to be a built-in, something like numpy's atleast_1d function, that automatically converts anything into an iterable if it isn't one.
The short answer is nope, there is no simple built-in. And yep, if you want str (or bytes or bytes-like stuff or whatever) to act as a scalar value, it gets uglier. Python expects callers to adhere to the interface contract; if you say you accept sequences, say so, and it's on the caller to wrap any individual arguments.
If you must do this, there's two obvious ways to do it:
First is to make your function accept varargs instead of a single argument, and leave it up to the caller to unpack any sequences, so you can always iterate the varargs received:
def my_function(*values):
for val in values:
# Rest of function
A caller with individual items calls you with my_function(a, b), a caller with a sequence calls you with my_function(*seq). The latter does incur some overhead to unpack the sequence to a new tuple to be received by my_function, but in many cases this is fine.
If that's not acceptable for whatever reason, the other solution is to roll your own "ensure iterable" converter function, following whatever rules you care about:
from collections.abc import ByteString
def ensure_iterable(obj):
if isinstance(obj, (str, ByteString)):
return (obj,) # Treat strings and bytes-like stuff as scalars and wrap
try:
iter(obj) # Simplest way to test if something is iterable is to try to make it an iterator
except TypeError:
return (obj,) # Not iterable, wrap
else:
return obj # Already iterable
which my_function can use with:
def my_function(value):
value = ensure_iterable(value)
Python is a general purpose language, with true scalars, and as well as iterables like lists.
MATLAB does not have true scalars. The base object is a 2d matrix. It did not start as a general purpose language.
numpy adds MATLAB like arrays to Python, but it too can have 0d arrays (scalar arrays), which may give the wayward MATLAB users headaches.
Many numpy functions have a provision for converting their input to an array. That way they will work a list input as well as array
In [10]: x = np.array(3)
In [11]: x
Out[11]: array(3)
In [12]: x.shape
Out[12]: ()
In [13]: for i in x: print(x)
Traceback (most recent call last):
Input In [13] in <cell line: 1>
for i in x: print(x)
TypeError: iteration over a 0-d array
It also has utility functions that insure the array is 1d, or 2 ...
In [14]: x = np.atleast_1d(1)
In [15]: x
Out[15]: array([1])
In [16]: for i in x: print(i)
1
But like old-fashion MATLAB, we prefer to avoid iteration in numpy. It doesn't have jit compilation that lets current MATLAB users get by with iterations. Technically numpy functions do use iteration, but it usually in compiled code.
np.sin applied to various inputs:
In [17]: np.sin(1) # scalar
Out[17]: 0.8414709848078965
In [18]: np.sin([1,2,3]) # list
Out[18]: array([0.84147098, 0.90929743, 0.14112001])
In [19]: np.sin(np.array([1,2,3]).reshape(3,1))
Out[19]:
array([[0.84147098],
[0.90929743],
[0.14112001]])
Technically, the [17] result is a numpy scalar, not a base python float:
In [20]: type(Out[17])
Out[20]: numpy.float64
I would duck type:
def first(item):
try:
it=iter(item)
except TypeError:
it=iter([item])
return next(it)
Test it:
tests=[[1,2,3],'abc',1,1.23]
for e in tests:
print(e, first(e))
Prints:
[1, 2, 3] 1
abc a
1 1
1.23 1.23

Assignment to numpy structured array

How does one assign to numpy structured arrays?
import numpy as np
baz_dtype = np.dtype([("baz1", "str"),
("baz2", "uint16"),
("baz3", np.float32)])
dtype = np.dtype([("foo", "str"),
("bar", "uint16"),
("baz", baz_dtype)])
xx = np.zeros(2, dtype=dtype)
xx["foo"][0] = "A"
Here xx remains unchanged. The docs https://docs.scipy.org/doc/numpy/user/basics.rec.html are a little vague on this.
On a related note, is it possible to make one or more of the subtypes be lists or numpy arrays of the specified dtype?
Any tips welcome.
You're performing the assignment correctly. The part you've screwed up is the dtypes. NumPy string dtypes are fixed-size, and if you try to use "str" as a dtype, it's treated as size 0 - the empty string is the only possible value! Your "A" gets truncated to 0 characters to fit.
Specify a size - for example, 'S10' is 10-byte bytestrings, or 'U10' is 10-code-point unicode strings - or use object to store ordinary Python string objects and avoid the length restrictions and treatment of '\0' as a null terminator.

change numpy array type from string to float

I have a 40 - dimension numpy vector with values like -0.399917 ,0.441786 ...
The data type of the vector by default is |S8 (String)
When I try to change the dtype using the astype method, I get the following error --
ValueError: could not convert string to float:
My partial ode
value=np.array(value)
value.astype(float)
I don't have your exact dataset, but the following code worked for me:
a = np.array(['-0.399917', '0.441786']) # (Line 1)
dim = (2, 2, 2, 2)
a = np.tile(a, dim)
b = a.astype(float)
I know the dimensions aren't the same as you have, but that shouldn't make a difference. What is likely happening is that some of the values in your vector are not of the form you specified. For example, the following both raise your ValueError when they are used in (Line 1):
a = np.array(['-0.399917', '0.441786', ''])
a = np.array(['-0.399917', '0.441786', 'spam'])
It is likely that you have either empty values, string values, or something similar in your array somewhere.
If the values are empty, you can do something like was suggested here:
a[a==''] = '0'
or whatever value you want it to be. You can do a similar thing with other string values as long as they have a pattern. If they don't have a pattern, you can still do this, but it may not be feasible to go through looking for all the possibilities.
EDIT: If you don't care what the strings are, you just want them turned into nan, you can use np.genfromtxt as explained here. That might or might not be dangerous, depending on your application. Often, codes are given to indicate something about the array element and you might not want to treat them all the same.

How to convert numpy object array into str/unicode array?

Update: In lastest version of numpy (e.g., v1.8.1), this is no longer a issue. All the methods mentioned here now work as excepted.
Original question: Using object dtype to store string array is convenient sometimes, especially when one needs to modify the content of a large array without prior knowledge about the maximum length of the strings, e.g.,
>>> import numpy as np
>>> a = np.array([u'abc', u'12345'], dtype=object)
At some point, one might want to convert the dtype back to unicode or str. However, simple conversion will truncate the string at length 4 or 1 (why?), e.g.,
>>> b = np.array(a, dtype=unicode)
>>> b
array([u'abc', u'1234'], dtype='<U4')
>>> c = a.astype(unicode)
>>> c
array([u'a', u'1'], dtype='<U1')
Of course, one can always iterate over the entire array explicitly to determine the max length,
>>> d = np.array(a, dtype='<U{0}'.format(np.max([len(x) for x in a])))
array([u'abc', u'12345'], dtype='<U5')
Yet, this is a little bit awkward in my opinion. Is there a better way to do this?
Edit to add: According to this closely related question,
>>> len(max(a, key=len))
is another way to find out the longest string length, and this step seems to be unavoidable...
I know this is an old question but in case anyone comes across it and is looking for an answer, try
c = a.astype('U')
and you should get the result you expect:
c = array([u'abc', u'12345'], dtype='<U5')
At least in Python 3.5 Jupyter 4 I can use:
a=np.array([u'12345',u'abc'],dtype=object)
b=a.astype(str)
b
works just fine for me and returns:
array(['12345', 'abc'],dtype='<U5')

Why numpy.sum returns a float64 instead of an uint64 when adding elements of a generator?

I just came across this strange behaviour of numpy.sum:
>>> import numpy
>>> ar = numpy.array([1,2,3], dtype=numpy.uint64)
>>> gen = (el for el in ar)
>>> lst = [el for el in ar]
>>> numpy.sum(gen)
6.0
>>> numpy.sum(lst)
6
>>> numpy.sum(iter(lst))
<listiterator object at 0x87d02cc>
According to the documentation the result should be of the same dtype of the iterable, but then why in the first case a numpy.float64 is returned instead of an numpy.uint64?
And how come the last example does not return any kind of sum and does not raise any error either?
In general, numpy functions don't always do what you might expect when working with generators. To create a numpy array, you need to know its size and type before creating it, and this isn't possible for generators. So many numpy functions either don't work with generators, or do this sort of thing where they fall back on Python builtins.
However, for the same reason, using generators often isn't that useful in Numpy contexts. There's no real advantage to making a generator from a Numpy object, because you already have to have the entire Numpy object in memory anyway. If you need all the types to stay as you specify, you should just not wrap your Numpy objects in generators.
Some more info: Technically, the argument to np.sum is supposed to be an "array-like" object, not an iterable. Array-like is defined in the documentation as:
An array, any object exposing the array interface, an object whose __array__ method returns an array, or any (nested) sequence.
The array interface is documented here. Basically, arrays have to have a fixed shape and a uniform type.
Generators don't fit this protocol and so aren't really supported. Many numpy functions are nice and will accept other sorts of objects that don't technically qualify as array-like, but a strict reading of the docs implies you can't rely on this behavior. The operations may work, but you can't expect all the types to be preserved perfectly.
If the argument is a generator, Python's builtin sum get used.
You can see this in the source code of numpy.sum (numpy/core/fromnumeric.py):
0 if isinstance(a, _gentype):
1 res = _sum_(a)
2 if out is not None:
3 out[...] = res
4 return out
5 return res
_gentype is just an alias of types.GeneratorType, and _sum_ is alias of the built-in sum.
If you try applying sum to gen and lst, you could see that the results are the same: 6.0.
The second parameter of sum is start, which defaults to 0, this is part of what makes your result a float64.
In [1]: import numpy as np
In [2]: type(np.uint64(1) + np.uint64(2))
Out[2]: numpy.uint64
In [3]: type(np.uint64(1) + 0)
Out[3]: numpy.float64
EDIT:
BTW, I find a ticket on this issue, which is marked as a wontfix: http://projects.scipy.org/numpy/ticket/669

Categories