Prevent strings being truncated when replacing values in a numpy array - python

Lets say I have arrays a and b
a = np.array([1,2,3])
b = np.array(['red','red','red'])
If I were to apply some fancy indexing like this to these arrays
b[a<3]="blue"
the output I get is
array(['blu', 'blu', 'red'], dtype='<U3')
I understand that the issue is because of numpy initially allocating space only for 3 characters at first hence it cant fit the whole word blue into the array, what work around can I use?
Currently I am doing
b = np.array([" "*100 for i in range(3)])
b[a>2] = "red"
b[a<3] = "blue"
but it's just a work around, is this a fault in my code? Or is it some issue with numpy, how can I fix this?

You can handle variable length strings by setting the dtype of b to be "object":
import numpy as np
a = np.array([1,2,3])
b = np.array(['red','red','red'], dtype="object")
b[a<3] = "blue"
print(b)
this outputs:
['blue' 'blue' 'red']
This dtype will handle strings, or other general Python objects. This also necessarily means that under the hood you'll have a numpy array of pointers, so don't expect the performance you get when using a primitive datatype.

A marginal improvement on your current approach (which is potentially very wasteful in space):
import numpy as np
a = np.array([1,2,3])
b = np.array(['red','red','red'])
replacement = "blue"
b = b.astype('<U{}'.format(max(len(replacement), a.dtype.itemsize)))
b[a<3] = replacement
print(b)
This accounts for strings already in the array, so the allocated space only increases if the replacement is longer than all existing strings in the array.

If you construct such array, the type looks like:
>>> b
array(['red', 'red', 'red'], dtype='<U3')
This means that the strings have a length of at most 3 characters. In case you assign longer strings, these strings are truncated.
You can change the data type to make the maximum length longer, for example:
b2 = b.astype('<U10')
So now we have an array that can store strings up to 10 characters. Note however that if you make the maximum length larger, the size of the matrix will increase.

Related

Replace elements in Numpy array by value and location

I am working on a program which will create contour data out of numpy arrays, and trying to avoid calls to matplotlib.
I have an array of length L which contains NxN arrays of booleans. I want to convert this into an LxNxN array where, for example, the "True"s in the first inner array get replaced by "red", in the second, by "blue" and so forth.
The following code works as expected:
import numpy as np
import pdb
def new_layer(N,p):
return np.random.choice(a=[False,True],size=(N,N),p=[p,1-p])
a = np.array([new_layer(3,0.5),new_layer(3,0.5),new_layer(3,0.5)]).astype('object')
colors = np.array(["red","green","blue"])
for i in range(np.shape(a)[0]):
b = a[i]
b[np.where(b==True)] = colors[i]
a[i] = b
print(a)
But I am wondering if there is a way to accomplish the same using Numpy's built-in tools, e.g., indexing. I am a newcomer to Numpy and I suspect there is a better way to do this but I can't think what it would be. Thank you.
You could use np.copyto:
np.copyto(a, colors[:, None, None], where=a.astype(bool))
Here's one way -
a_bool = a.astype(bool)
a[a_bool] = np.repeat(colors,a_bool.sum((1,2)))
Another with extending colors to 3D -
a_bool = a.astype(bool)
colors3D = np.broadcast_to(colors[:,None,None],a.shape)
a[a_bool] = colors3D[a_bool]
You can use a combination of boolean indexes and np.indices. Also you can use a as index to itself. Then you could do what you did in the for loop with this line (although I don't think it necessarily is a good idea):
a[a.astype(bool)] = colors[np.indices(a.shape)[0][a.astype(bool)]]
Also, for the new_layer function you could just use np.random.rand(N,N) > p (not sure if the actual distribution will be exactly the same as what you had).

How to convert numpy object array into str/unicode array?

Update: In lastest version of numpy (e.g., v1.8.1), this is no longer a issue. All the methods mentioned here now work as excepted.
Original question: Using object dtype to store string array is convenient sometimes, especially when one needs to modify the content of a large array without prior knowledge about the maximum length of the strings, e.g.,
>>> import numpy as np
>>> a = np.array([u'abc', u'12345'], dtype=object)
At some point, one might want to convert the dtype back to unicode or str. However, simple conversion will truncate the string at length 4 or 1 (why?), e.g.,
>>> b = np.array(a, dtype=unicode)
>>> b
array([u'abc', u'1234'], dtype='<U4')
>>> c = a.astype(unicode)
>>> c
array([u'a', u'1'], dtype='<U1')
Of course, one can always iterate over the entire array explicitly to determine the max length,
>>> d = np.array(a, dtype='<U{0}'.format(np.max([len(x) for x in a])))
array([u'abc', u'12345'], dtype='<U5')
Yet, this is a little bit awkward in my opinion. Is there a better way to do this?
Edit to add: According to this closely related question,
>>> len(max(a, key=len))
is another way to find out the longest string length, and this step seems to be unavoidable...
I know this is an old question but in case anyone comes across it and is looking for an answer, try
c = a.astype('U')
and you should get the result you expect:
c = array([u'abc', u'12345'], dtype='<U5')
At least in Python 3.5 Jupyter 4 I can use:
a=np.array([u'12345',u'abc'],dtype=object)
b=a.astype(str)
b
works just fine for me and returns:
array(['12345', 'abc'],dtype='<U5')

Weird behaviour initializing a numpy array of string data

I am having some seemingly trivial trouble with numpy when the array contains string data. I have the following code:
my_array = numpy.empty([1, 2], dtype = str)
my_array[0, 0] = "Cat"
my_array[0, 1] = "Apple"
Now, when I print it with print my_array[0, :], the response I get is ['C', 'A'], which is clearly not the expected output of Cat and Apple. Why is that, and how can I get the right output?
Thanks!
Numpy requires string arrays to have a fixed maximum length. When you create an empty array with dtype=str, it sets this maximum length to 1 by default. You can see if you do my_array.dtype; it will show "|S1", meaning "one-character string". Subsequent assignments into the array are truncated to fit this structure.
You can pass an explicit datatype with your maximum length by doing, e.g.:
my_array = numpy.empty([1, 2], dtype="S10")
The "S10" will create an array of length-10 strings. You have to decide how big will be big enough to hold all the data you want to hold.
I got a "codec error" when I tried to use a non-ascii character with dtype="S10"
You also get an array with binary strings, which confused me.
I think it is better to use:
my_array = numpy.empty([1, 2], dtype="<U10")
Here 'U10' translates to "Unicode string of length 10; little endian format"
The numpy string array is limited by its fixed length (length 1 by default). If you're unsure what length you'll need for your strings in advance, you can use dtype=object and get arbitrary length strings for your data elements:
my_array = numpy.empty([1, 2], dtype=object)
I understand there may be efficiency drawbacks to this approach, but I don't have a good reference to support that.
in case of anyone who's new here, I guess there's another way to do this job for now, just need a little work:
my_array = np.full([1, 2], "", dtype=np.object)
Use np.full instead of np.empty, and create the array with a empty string (type is object).
Another alternative is to initialize as follows:
my_array = np.array([["CAT","APPLE"],['','']], dtype=str)
In other words, first you write a regular array with what you want, then you turn it into a numpy array. However, this will fix your max string length to the length of the longest string at initialization. So if you were to add
my_array[1,0] = 'PINEAPPLE'
then the string stored would be 'PINEA'.
What works best if you are doing a for loop is to start a list comprehension, which will allow you to allocate the right memory.
data = ['CAT', 'APPLE', 'CARROT']
my_array = [name for name in data]

Swap Array Data in NumPy

I have many large multidimensional NP arrays (2D and 3D) used in an algorithm. There are numerous iterations in this, and during each iteration the arrays are recalculated by performing calculations and saving into temporary arrays of the same size. At the end of a single iteration the contents of the temporary arrays are copied into the actual data arrays.
Example:
global A, B # ndarrays
A_temp = numpy.zeros(A.shape)
B_temp = numpy.zeros(B.shape)
for i in xrange(num_iters):
# Calculate new values from A and B storing in A_temp and B_temp...
# Then copy values from temps to A and B
A[:] = A_temp
B[:] = B_temp
This works fine, however it seems a bit wasteful to copy all those values when A and B could just swap. The following would swap the arrays:
A, A_temp = A_temp, A
B, B_temp = B_temp, B
However there can be other references to the arrays in other scopes which this won't change.
It seems like NumPy could have an internal method for swapping the internal data pointer of two arrays, such as numpy.swap(A, A_temp). Then all variables pointing to A would be pointing to the changed data.
Even though you way should work as good (I suspect the problem is somewhere else), you can try doing it explicitly:
import numpy as np
A, A_temp = np.frombuffer(A_temp), np.frombuffer(A)
It's not hard to verify that your method works as well:
>>> import numpy as np
>>> arr = np.zeros(100)
>>> arr2 = np.ones(100)
>>> print arr.__array_interface__['data'][0], arr2.__array_interface__['data'][0]
152523144 152228040
>>> arr, arr2 = arr2, arr
>>> print arr.__array_interface__['data'][0], arr2.__array_interface__['data'][0]
152228040 152523144
... pointers succsessfully switched
Perhaps you could solve this by adding a level of indirection.
You could have an "array holder" class. All that would do is keep a reference to the underlying NumPy array. Implementing a cheap swap operation for a pair of these would be trivial.
If all external references are to these holder objects and not directly to the arrays, none of those references would get invalidated by a swap.
I realize this is an old question, but for what it's worth you could also swap data between two ndarray buffers (without a temp copy) by performing an xor swap:
A_bytes = A.view('ubyte')
A_temp_bytes = A.view('ubyte')
A_bytes ^= A_temp_bytes
A_temp_bytes ^= A_bytes
A_bytes ^= A_temp_bytes
Since this was done on views, if you look at the original A and A_temp arrays (in whatever their original dtype was) their values should be correctly swapped. This is basically equivalent to the numpy.swap(A, A_temp) you were looking for. It's unfortunate that it requires 3 loops--if this were implemented as a ufunc (maybe it should be) it would be a lot faster.

How do I declare an array in Python?

How do I declare an array in Python?
variable = []
Now variable refers to an empty list*.
Of course this is an assignment, not a declaration. There's no way to say in Python "this variable should never refer to anything other than a list", since Python is dynamically typed.
*The default built-in Python type is called a list, not an array. It is an ordered container of arbitrary length that can hold a heterogenous collection of objects (their types do not matter and can be freely mixed). This should not be confused with the array module, which offers a type closer to the C array type; the contents must be homogenous (all of the same type), but the length is still dynamic.
This is surprisingly complex topic in Python.
Practical answer
Arrays are represented by class list (see reference and do not mix them with generators).
Check out usage examples:
# empty array
arr = []
# init with values (can contain mixed types)
arr = [1, "eels"]
# get item by index (can be negative to access end of array)
arr = [1, 2, 3, 4, 5, 6]
arr[0] # 1
arr[-1] # 6
# get length
length = len(arr)
# supports append and insert
arr.append(8)
arr.insert(6, 7)
Theoretical answer
Under the hood Python's list is a wrapper for a real array which contains references to items. Also, underlying array is created with some extra space.
Consequences of this are:
random access is really cheap (arr[6653] is same to arr[0])
append operation is 'for free' while some extra space
insert operation is expensive
Check this awesome table of operations complexity.
Also, please see this picture, where I've tried to show most important differences between array, array of references and linked list:
You don't actually declare things, but this is how you create an array in Python:
from array import array
intarray = array('i')
For more info see the array module: http://docs.python.org/library/array.html
Now possible you don't want an array, but a list, but others have answered that already. :)
I think you (meant)want an list with the first 30 cells already filled.
So
f = []
for i in range(30):
f.append(0)
An example to where this could be used is in Fibonacci sequence.
See problem 2 in Project Euler
This is how:
my_array = [1, 'rebecca', 'allard', 15]
For calculations, use numpy arrays like this:
import numpy as np
a = np.ones((3,2)) # a 2D array with 3 rows, 2 columns, filled with ones
b = np.array([1,2,3]) # a 1D array initialised using a list [1,2,3]
c = np.linspace(2,3,100) # an array with 100 points beteen (and including) 2 and 3
print(a*1.5) # all elements of a times 1.5
print(a.T+b) # b added to the transpose of a
these numpy arrays can be saved and loaded from disk (even compressed) and complex calculations with large amounts of elements are C-like fast.
Much used in scientific environments. See here for more.
JohnMachin's comment should be the real answer.
All the other answers are just workarounds in my opinion!
So:
array=[0]*element_count
A couple of contributions suggested that arrays in python are represented by lists. This is incorrect. Python has an independent implementation of array() in the standard library module array "array.array()" hence it is incorrect to confuse the two. Lists are lists in python so be careful with the nomenclature used.
list_01 = [4, 6.2, 7-2j, 'flo', 'cro']
list_01
Out[85]: [4, 6.2, (7-2j), 'flo', 'cro']
There is one very important difference between list and array.array(). While both of these objects are ordered sequences, array.array() is an ordered homogeneous sequences whereas a list is a non-homogeneous sequence.
You don't declare anything in Python. You just use it. I recommend you start out with something like http://diveintopython.net.
I would normally just do a = [1,2,3] which is actually a list but for arrays look at this formal definition
To add to Lennart's answer, an array may be created like this:
from array import array
float_array = array("f",values)
where values can take the form of a tuple, list, or np.array, but not array:
values = [1,2,3]
values = (1,2,3)
values = np.array([1,2,3],'f')
# 'i' will work here too, but if array is 'i' then values have to be int
wrong_values = array('f',[1,2,3])
# TypeError: 'array.array' object is not callable
and the output will still be the same:
print(float_array)
print(float_array[1])
print(isinstance(float_array[1],float))
# array('f', [1.0, 2.0, 3.0])
# 2.0
# True
Most methods for list work with array as well, common
ones being pop(), extend(), and append().
Judging from the answers and comments, it appears that the array
data structure isn't that popular. I like it though, the same
way as one might prefer a tuple over a list.
The array structure has stricter rules than a list or np.array, and this can
reduce errors and make debugging easier, especially when working with numerical
data.
Attempts to insert/append a float to an int array will throw a TypeError:
values = [1,2,3]
int_array = array("i",values)
int_array.append(float(1))
# or int_array.extend([float(1)])
# TypeError: integer argument expected, got float
Keeping values which are meant to be integers (e.g. list of indices) in the array
form may therefore prevent a "TypeError: list indices must be integers, not float", since arrays can be iterated over, similar to np.array and lists:
int_array = array('i',[1,2,3])
data = [11,22,33,44,55]
sample = []
for i in int_array:
sample.append(data[i])
Annoyingly, appending an int to a float array will cause the int to become a float, without throwing an exception.
np.array retain the same data type for its entries too, but instead of giving an error it will change its data type to fit new entries (usually to double or str):
import numpy as np
numpy_int_array = np.array([1,2,3],'i')
for i in numpy_int_array:
print(type(i))
# <class 'numpy.int32'>
numpy_int_array_2 = np.append(numpy_int_array,int(1))
# still <class 'numpy.int32'>
numpy_float_array = np.append(numpy_int_array,float(1))
# <class 'numpy.float64'> for all values
numpy_str_array = np.append(numpy_int_array,"1")
# <class 'numpy.str_'> for all values
data = [11,22,33,44,55]
sample = []
for i in numpy_int_array_2:
sample.append(data[i])
# no problem here, but TypeError for the other two
This is true during assignment as well. If the data type is specified, np.array will, wherever possible, transform the entries to that data type:
int_numpy_array = np.array([1,2,float(3)],'i')
# 3 becomes an int
int_numpy_array_2 = np.array([1,2,3.9],'i')
# 3.9 gets truncated to 3 (same as int(3.9))
invalid_array = np.array([1,2,"string"],'i')
# ValueError: invalid literal for int() with base 10: 'string'
# Same error as int('string')
str_numpy_array = np.array([1,2,3],'str')
print(str_numpy_array)
print([type(i) for i in str_numpy_array])
# ['1' '2' '3']
# <class 'numpy.str_'>
or, in essence:
data = [1.2,3.4,5.6]
list_1 = np.array(data,'i').tolist()
list_2 = [int(i) for i in data]
print(list_1 == list_2)
# True
while array will simply give:
invalid_array = array([1,2,3.9],'i')
# TypeError: integer argument expected, got float
Because of this, it is not a good idea to use np.array for type-specific commands. The array structure is useful here. list preserves the data type of the values.
And for something I find rather pesky: the data type is specified as the first argument in array(), but (usually) the second in np.array(). :|
The relation to C is referred to here:
Python List vs. Array - when to use?
Have fun exploring!
Note: The typed and rather strict nature of array leans more towards C rather than Python, and by design Python does not have many type-specific constraints in its functions. Its unpopularity also creates a positive feedback in collaborative work, and replacing it mostly involves an additional [int(x) for x in file]. It is therefore entirely viable and reasonable to ignore the existence of array. It shouldn't hinder most of us in any way. :D
How about this...
>>> a = range(12)
>>> a
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
>>> a[7]
6
Following on from Lennart, there's also numpy which implements homogeneous multi-dimensional arrays.
Python calls them lists. You can write a list literal with square brackets and commas:
>>> [6,28,496,8128]
[6, 28, 496, 8128]
I had an array of strings and needed an array of the same length of booleans initiated to True. This is what I did
strs = ["Hi","Bye"]
bools = [ True for s in strs ]
You can create lists and convert them into arrays or you can create array using numpy module. Below are few examples to illustrate the same. Numpy also makes it easier to work with multi-dimensional arrays.
import numpy as np
a = np.array([1, 2, 3, 4])
#For custom inputs
a = np.array([int(x) for x in input().split()])
You can also reshape this array into a 2X2 matrix using reshape function which takes in input as the dimensions of the matrix.
mat = a.reshape(2, 2)
# This creates a list of 5000 zeros
a = [0] * 5000
You can read and write to any element in this list with a[n] notation in the same as you would with an array.
It does seem to have the same random access performance as an array. I cannot say how it allocates memory because it also supports a mix of different types including strings and objects if you need it to.

Categories