Assignment to numpy structured array - python

How does one assign to numpy structured arrays?
import numpy as np
baz_dtype = np.dtype([("baz1", "str"),
("baz2", "uint16"),
("baz3", np.float32)])
dtype = np.dtype([("foo", "str"),
("bar", "uint16"),
("baz", baz_dtype)])
xx = np.zeros(2, dtype=dtype)
xx["foo"][0] = "A"
Here xx remains unchanged. The docs https://docs.scipy.org/doc/numpy/user/basics.rec.html are a little vague on this.
On a related note, is it possible to make one or more of the subtypes be lists or numpy arrays of the specified dtype?
Any tips welcome.

You're performing the assignment correctly. The part you've screwed up is the dtypes. NumPy string dtypes are fixed-size, and if you try to use "str" as a dtype, it's treated as size 0 - the empty string is the only possible value! Your "A" gets truncated to 0 characters to fit.
Specify a size - for example, 'S10' is 10-byte bytestrings, or 'U10' is 10-code-point unicode strings - or use object to store ordinary Python string objects and avoid the length restrictions and treatment of '\0' as a null terminator.

Related

Is there a Numpy function for casting just one element to a different type than the rest?

My code is the following:
file_ = open('file.txt', 'r')
lines = file_.readlines()
data = []
for row in lines:
temp = row.split()
data.append(np.array(temp).astype(np.float64))
I want to cast every item in the array to float EXCEPT the final one, which I want to remain a string.
How can I do this?
No, there is no function to cast elements of the same array to different types. Unlike regular Python lists, numpy arrays are homogeneous and store elements with fixed physical record sizes, so each element of the array must always have the same type.
You could handle the strings separately and parse only the numeric part into a numpy array:
for row in lines:
temp = row.split()
numbers = temp[:-1]
stringbit = temp[-1]
data.append(np.array(numbers).astype(np.float64))
Alternatively, if your data is very consistent and each line always has the same type structure, you might be able to use a more complex numpy dtype and numpy.genfromtext to make each line an element of a larger array.
You might also find a pandas.DataFrame fits better for working with this kind of heterogeneous data.
A related question with useful details: NumPy array/matrix of mixed types
You can use recarrays.
Of your rows are records with similar data, you can create a custom dtype that does what you want. The requirement for a homogenous datatype in this case is that the number of elements is constant and there is an upper bound on the number of characters in the final string.
Here is an example that assumes the string only holds ASCII characters:
max_len = 10
dtype = np.dtype([('c1', np.float_), ('c2', np.float_), ('c3', np.float_), ('str', f'S{max_len}')])
row = [(10.0, 1.2, 4.5, b'abc')]
result = np.array(row, dtype)
If you don't want to name each float column separately, you can make that field a subarray:
dtype = np.dtype([('flt', np.float_, 3), ('str', f'S{max_len}')])
row = [([10.0, 1.2, 4.5], b'abc')]
If the strings are not of a known length, you can use the object dtype in that field and simply store a reference.
Even though it's possible, you may find it simpler to just load the floats into one array and the strings into another. I generally find it simpler to work with arrays of a homogenous built in dtype than recarrays.

converting to a pandas dataframe with numpy dtype of string

This is a super basic question but I couldn't find a solution anywhere. I've looked at the docs and on Stack Overflow. I need to get a numpy array of strings which I will then save as a csv. I have a line like this:
np.savetxt(path + 'section.csv', np.asarray(seclist, dtype=np.dtype('a2')), fmt='%s', delimiter=',')
and it returns an error:
ValueError: cannot set an array element with a sequence.
If I remove dtype, it works fine except it prints objects formatted as strings so each item has a bracket and '' around it. When I try dtype=string that does not work it gives me an error:
TypeError: data type not understood.
I've tried dtype='string', dtype = np.str, dtype = np.str_ and quite a few more but none of them get the dtype to be string. Any help would be appreciated,
Cameron
I had a python list of lists. However some lists were length 0 and some were length 1. Thus the numpy couldn't accept the string dtype. When I padded the lists so they were all the same length that fixed everything.

How may numpy return a length one object array?

In numpy 1.8.2, when I index a numpy array of fixed length strings and ask for one value, I get back a numpy array of length 1, supporting numpy operations. So:
import numpy as np
strs = np.array(('aa', 'bbb', 'c'), dtype=np.dtype('|S4'))
print type(strs[(0,)])
I get
<type 'numpy.string_'>
If I do the same with an array of objects:
strs = np.array(('aa', 'bbb', 'c'), dtype=np.dtype('object'))
print type(strs[(0,)])
I get
<type 'str'>
and any numpy specific property/method (e.g. .shape) returns an exception
How may I ensure that numpy returns me a length one array of objects from slicing?
From the Data type objects Documentation:
"An item extracted from an array, e.g., by indexing, will be a Python object whose type is the scalar type associated with the data type of the array.
Note that the scalar types are not dtype objects, even though they can be used in place of one whenever a data type specification is needed in Numpy."
By using the dtype=np.dtype('object'), you are therefore creating a numpy array that is only filled with scalar types. A numpy array is defined as:
"homogeneous, and contains elements described by a dtype object. A dtype object can be constructed from different combinations of fundamental numeric types."
Numpy Array Docs
Technically, strs[(0,)] is returning a 0d array, shape ().
Using a list index, gives a 1d array
In [170]: strs[[0]]
Out[170]:
array([b'aa'],
dtype='|S4')
A true slice returns that same: strs[0:1], strs[:1].
These methods also work with the object array
In [175]: strs1[:1]
Out[175]: array(['aa'], dtype=object)
strs[(0,)] is the same as strs[0]. In fact, the later is short for the former. It selects an item from the array, reducing the dimensionality by 1 (e.g. from 1d to 0d).
dtype=object is an odd ball case, stretching the normal numpy behavior. So it's not surprising that it behaves a little different in this case. Whether it should or not is something you could ask on the github developers' site.

numpy arrays dimension mismatch

I am using numpy and pandas to attempt to concatenate a number of heterogenous values into a single array.
np.concatenate((tmp, id, freqs))
Here are the exact values:
tmp = np.array([u'DNMT3A', u'p.M880V', u'chr2', 25457249], dtype=object)
freqs = np.array([0.022831050228310501], dtype=object)
id = "id_23728"
The dimensions of tmp, 17232, and freqs are as follows:
[in] tmp.shape
[out] (4,)
[in] np.array(17232).shape
[out] ()
[in] freqs.shape
[out] (1,)
I have also tried casting them all as numpy arrays to no avail.
Although the variable freqs will frequently have more than one value.
However, with both the np.concatenate and np.append functions I get the following error:
*** ValueError: all the input arrays must have same number of dimensions
These all have the same number of columns (0), why can't I concatenate them with either of the above described numpy methods?
All I'm looking to obtain is[(tmp), 17232, (freqs)] in one single dimensional array, which is to be appended onto the end of a pandas dataframe.
Thanks.
Update
It appears I can concatenate the two existing arrays:
np.concatenate([tmp, freqs],axis=0)
array([u'DNMT3A', u'p.M880V', u'chr2', 25457249, 0.022831050228310501], dtype=object)
However, the integer, even when casted cannot be used in concatenate.
np.concatenate([tmp, np.array(17571)],axis=0)
*** ValueError: all the input arrays must have same number of dimensions
What does work, however is nesting append and concatenate
np.concatenate((np.append(tmp, 17571), freqs),)
array([u'DNMT3A', u'p.M880V', u'chr2', 25457249, 17571,
0.022831050228310501], dtype=object)
Although this is kind of messy. Does anyone have a better solution for concatenating a number of heterogeneous arrays?
The problem is that id, and later the integer np.array(17571), are not an array_like object. See here how numpy decides whether an object can be converted automatically to a numpy array or not.
The solution is to make id array_like, i.e. to be an element of a list or tuple, so that numpy understands that id belongs to a 1D array_like structure
It all boils down to
concatenate((tmp, (id,), freqs))
or
concatenate((tmp, [id], freqs))
To avoid this sort of problems when dealing with input variables in functions using numpy, you can use atleast_1d, as pointed out by #askewchan. See about it this question/answer.
Basically, if you are unsure if in different scenarios your variable id will be a single str or a list of str, you are better off using
concatenate((tmp, atleast_1d(id), freqs))
because the two options above will fail if id is already a list/tuple of strings.
EDIT: It may not be obvious why np.array(17571) is not an array_like object. This happens because np.array(17571).shape==(), so it is not iterable as it has no dimensions.

How do I declare an array in Python?

How do I declare an array in Python?
variable = []
Now variable refers to an empty list*.
Of course this is an assignment, not a declaration. There's no way to say in Python "this variable should never refer to anything other than a list", since Python is dynamically typed.
*The default built-in Python type is called a list, not an array. It is an ordered container of arbitrary length that can hold a heterogenous collection of objects (their types do not matter and can be freely mixed). This should not be confused with the array module, which offers a type closer to the C array type; the contents must be homogenous (all of the same type), but the length is still dynamic.
This is surprisingly complex topic in Python.
Practical answer
Arrays are represented by class list (see reference and do not mix them with generators).
Check out usage examples:
# empty array
arr = []
# init with values (can contain mixed types)
arr = [1, "eels"]
# get item by index (can be negative to access end of array)
arr = [1, 2, 3, 4, 5, 6]
arr[0] # 1
arr[-1] # 6
# get length
length = len(arr)
# supports append and insert
arr.append(8)
arr.insert(6, 7)
Theoretical answer
Under the hood Python's list is a wrapper for a real array which contains references to items. Also, underlying array is created with some extra space.
Consequences of this are:
random access is really cheap (arr[6653] is same to arr[0])
append operation is 'for free' while some extra space
insert operation is expensive
Check this awesome table of operations complexity.
Also, please see this picture, where I've tried to show most important differences between array, array of references and linked list:
You don't actually declare things, but this is how you create an array in Python:
from array import array
intarray = array('i')
For more info see the array module: http://docs.python.org/library/array.html
Now possible you don't want an array, but a list, but others have answered that already. :)
I think you (meant)want an list with the first 30 cells already filled.
So
f = []
for i in range(30):
f.append(0)
An example to where this could be used is in Fibonacci sequence.
See problem 2 in Project Euler
This is how:
my_array = [1, 'rebecca', 'allard', 15]
For calculations, use numpy arrays like this:
import numpy as np
a = np.ones((3,2)) # a 2D array with 3 rows, 2 columns, filled with ones
b = np.array([1,2,3]) # a 1D array initialised using a list [1,2,3]
c = np.linspace(2,3,100) # an array with 100 points beteen (and including) 2 and 3
print(a*1.5) # all elements of a times 1.5
print(a.T+b) # b added to the transpose of a
these numpy arrays can be saved and loaded from disk (even compressed) and complex calculations with large amounts of elements are C-like fast.
Much used in scientific environments. See here for more.
JohnMachin's comment should be the real answer.
All the other answers are just workarounds in my opinion!
So:
array=[0]*element_count
A couple of contributions suggested that arrays in python are represented by lists. This is incorrect. Python has an independent implementation of array() in the standard library module array "array.array()" hence it is incorrect to confuse the two. Lists are lists in python so be careful with the nomenclature used.
list_01 = [4, 6.2, 7-2j, 'flo', 'cro']
list_01
Out[85]: [4, 6.2, (7-2j), 'flo', 'cro']
There is one very important difference between list and array.array(). While both of these objects are ordered sequences, array.array() is an ordered homogeneous sequences whereas a list is a non-homogeneous sequence.
You don't declare anything in Python. You just use it. I recommend you start out with something like http://diveintopython.net.
I would normally just do a = [1,2,3] which is actually a list but for arrays look at this formal definition
To add to Lennart's answer, an array may be created like this:
from array import array
float_array = array("f",values)
where values can take the form of a tuple, list, or np.array, but not array:
values = [1,2,3]
values = (1,2,3)
values = np.array([1,2,3],'f')
# 'i' will work here too, but if array is 'i' then values have to be int
wrong_values = array('f',[1,2,3])
# TypeError: 'array.array' object is not callable
and the output will still be the same:
print(float_array)
print(float_array[1])
print(isinstance(float_array[1],float))
# array('f', [1.0, 2.0, 3.0])
# 2.0
# True
Most methods for list work with array as well, common
ones being pop(), extend(), and append().
Judging from the answers and comments, it appears that the array
data structure isn't that popular. I like it though, the same
way as one might prefer a tuple over a list.
The array structure has stricter rules than a list or np.array, and this can
reduce errors and make debugging easier, especially when working with numerical
data.
Attempts to insert/append a float to an int array will throw a TypeError:
values = [1,2,3]
int_array = array("i",values)
int_array.append(float(1))
# or int_array.extend([float(1)])
# TypeError: integer argument expected, got float
Keeping values which are meant to be integers (e.g. list of indices) in the array
form may therefore prevent a "TypeError: list indices must be integers, not float", since arrays can be iterated over, similar to np.array and lists:
int_array = array('i',[1,2,3])
data = [11,22,33,44,55]
sample = []
for i in int_array:
sample.append(data[i])
Annoyingly, appending an int to a float array will cause the int to become a float, without throwing an exception.
np.array retain the same data type for its entries too, but instead of giving an error it will change its data type to fit new entries (usually to double or str):
import numpy as np
numpy_int_array = np.array([1,2,3],'i')
for i in numpy_int_array:
print(type(i))
# <class 'numpy.int32'>
numpy_int_array_2 = np.append(numpy_int_array,int(1))
# still <class 'numpy.int32'>
numpy_float_array = np.append(numpy_int_array,float(1))
# <class 'numpy.float64'> for all values
numpy_str_array = np.append(numpy_int_array,"1")
# <class 'numpy.str_'> for all values
data = [11,22,33,44,55]
sample = []
for i in numpy_int_array_2:
sample.append(data[i])
# no problem here, but TypeError for the other two
This is true during assignment as well. If the data type is specified, np.array will, wherever possible, transform the entries to that data type:
int_numpy_array = np.array([1,2,float(3)],'i')
# 3 becomes an int
int_numpy_array_2 = np.array([1,2,3.9],'i')
# 3.9 gets truncated to 3 (same as int(3.9))
invalid_array = np.array([1,2,"string"],'i')
# ValueError: invalid literal for int() with base 10: 'string'
# Same error as int('string')
str_numpy_array = np.array([1,2,3],'str')
print(str_numpy_array)
print([type(i) for i in str_numpy_array])
# ['1' '2' '3']
# <class 'numpy.str_'>
or, in essence:
data = [1.2,3.4,5.6]
list_1 = np.array(data,'i').tolist()
list_2 = [int(i) for i in data]
print(list_1 == list_2)
# True
while array will simply give:
invalid_array = array([1,2,3.9],'i')
# TypeError: integer argument expected, got float
Because of this, it is not a good idea to use np.array for type-specific commands. The array structure is useful here. list preserves the data type of the values.
And for something I find rather pesky: the data type is specified as the first argument in array(), but (usually) the second in np.array(). :|
The relation to C is referred to here:
Python List vs. Array - when to use?
Have fun exploring!
Note: The typed and rather strict nature of array leans more towards C rather than Python, and by design Python does not have many type-specific constraints in its functions. Its unpopularity also creates a positive feedback in collaborative work, and replacing it mostly involves an additional [int(x) for x in file]. It is therefore entirely viable and reasonable to ignore the existence of array. It shouldn't hinder most of us in any way. :D
How about this...
>>> a = range(12)
>>> a
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
>>> a[7]
6
Following on from Lennart, there's also numpy which implements homogeneous multi-dimensional arrays.
Python calls them lists. You can write a list literal with square brackets and commas:
>>> [6,28,496,8128]
[6, 28, 496, 8128]
I had an array of strings and needed an array of the same length of booleans initiated to True. This is what I did
strs = ["Hi","Bye"]
bools = [ True for s in strs ]
You can create lists and convert them into arrays or you can create array using numpy module. Below are few examples to illustrate the same. Numpy also makes it easier to work with multi-dimensional arrays.
import numpy as np
a = np.array([1, 2, 3, 4])
#For custom inputs
a = np.array([int(x) for x in input().split()])
You can also reshape this array into a 2X2 matrix using reshape function which takes in input as the dimensions of the matrix.
mat = a.reshape(2, 2)
# This creates a list of 5000 zeros
a = [0] * 5000
You can read and write to any element in this list with a[n] notation in the same as you would with an array.
It does seem to have the same random access performance as an array. I cannot say how it allocates memory because it also supports a mix of different types including strings and objects if you need it to.

Categories