numpy arrays dimension mismatch - python

I am using numpy and pandas to attempt to concatenate a number of heterogenous values into a single array.
np.concatenate((tmp, id, freqs))
Here are the exact values:
tmp = np.array([u'DNMT3A', u'p.M880V', u'chr2', 25457249], dtype=object)
freqs = np.array([0.022831050228310501], dtype=object)
id = "id_23728"
The dimensions of tmp, 17232, and freqs are as follows:
[in] tmp.shape
[out] (4,)
[in] np.array(17232).shape
[out] ()
[in] freqs.shape
[out] (1,)
I have also tried casting them all as numpy arrays to no avail.
Although the variable freqs will frequently have more than one value.
However, with both the np.concatenate and np.append functions I get the following error:
*** ValueError: all the input arrays must have same number of dimensions
These all have the same number of columns (0), why can't I concatenate them with either of the above described numpy methods?
All I'm looking to obtain is[(tmp), 17232, (freqs)] in one single dimensional array, which is to be appended onto the end of a pandas dataframe.
Thanks.
Update
It appears I can concatenate the two existing arrays:
np.concatenate([tmp, freqs],axis=0)
array([u'DNMT3A', u'p.M880V', u'chr2', 25457249, 0.022831050228310501], dtype=object)
However, the integer, even when casted cannot be used in concatenate.
np.concatenate([tmp, np.array(17571)],axis=0)
*** ValueError: all the input arrays must have same number of dimensions
What does work, however is nesting append and concatenate
np.concatenate((np.append(tmp, 17571), freqs),)
array([u'DNMT3A', u'p.M880V', u'chr2', 25457249, 17571,
0.022831050228310501], dtype=object)
Although this is kind of messy. Does anyone have a better solution for concatenating a number of heterogeneous arrays?

The problem is that id, and later the integer np.array(17571), are not an array_like object. See here how numpy decides whether an object can be converted automatically to a numpy array or not.
The solution is to make id array_like, i.e. to be an element of a list or tuple, so that numpy understands that id belongs to a 1D array_like structure
It all boils down to
concatenate((tmp, (id,), freqs))
or
concatenate((tmp, [id], freqs))
To avoid this sort of problems when dealing with input variables in functions using numpy, you can use atleast_1d, as pointed out by #askewchan. See about it this question/answer.
Basically, if you are unsure if in different scenarios your variable id will be a single str or a list of str, you are better off using
concatenate((tmp, atleast_1d(id), freqs))
because the two options above will fail if id is already a list/tuple of strings.
EDIT: It may not be obvious why np.array(17571) is not an array_like object. This happens because np.array(17571).shape==(), so it is not iterable as it has no dimensions.

Related

Reinterpret data in numpy ndarray

I have a numpy array with dtype=uint8 and shape=(N,4) and I want to reinterpret the 4 bytes along the axis=1 efficiently as dtype=int32 and get a resulting shape=(N,) but nothing I've tried works. The equivalent in c would be brutally casting the pointer of the array.
The initial array is created like this from a pandas dataframe:
tmp=df[['data_1','data_2','data_3','data_4']].values.astype('uint8')
But then this works but it's not vectorized:
tmp1=np.empty((tmp.shape[0],),dtype=np.int32)
for i in range(tmp.shape[0]):
tmp2=tmp[i].copy()
tmp1[i]=tmp2.view('<i4')
And this, which I understand as the efficient way to do it, doesn't:
tmp1=tmp.view('<i4')
Giving the error:
ValueError: When changing to a larger dtype, its size must be a divisor of the total size in bytes of the last axis of the array.
But the size should be correct as far as I understand.
edit: added the reinterpeted explanation
Assuming you actually want the output shape to be (N*4,) (not (N,) as you wrote initially), you can just flatten it and then cast it to your desired type:
tmp1 = tmp.flatten().astype('int32', copy=False)
EDIT:
If you actually want the same underlying data to be interpreted as a different type and get a (N,) array out, the view method is in fact the way to go. This for example works for me:
import numpy as np
N = 5
a = np.arange(N*4, dtype='uint8').reshape((N,4))
a.view('int32')[:,0]
That view is then array([ 50462976, 117835012, 185207048, 252579084, 319951120], dtype=int32).

what the differences between ndarray and list in python?

last week, my teacher asks us: when storing integers from one to one hundred, what the differences between using list and using ndarray. I never use numpy before, so I search this question on the website.
But all my search result told me, they just have dimension difference. Ndarray can store N dimension data, while list storge one. That doesn't satisfy me. Is it really simple, just my overthinking, Or I didn't find the right keyword to search?
I need help.
There are several differences:
-You can append elements to a list, but you can't change the size of a ´numpy.ndarray´ without making a full copy.
-Lists can containt about everything, in numpy arrays all the elements must have the same type.
-In practice, numpy arrays are faster for vectorial functions than mapping functions to lists.
-I think than modification times is not an issue, but iteration over the elements is.
Numpy arrays have many array related methods (´argmin´, ´min´, ´sort´, etc).
I prefer to use numpy arrays when I need to do some mathematical operations (sum, average, array multiplication, etc) and list when I need to iterate in 'items' (strings, files, etc).
A one-dimensional array is like one row graph paper .##
You can store one thing inside of each box
The following picture is an example of a 2-dimensional array
Two-dimensional arrays have rows and columns
I should have changed the numbers.
When I was drawing the picture I just copied the first row many times.
The numbers can be completely different on each row.
import numpy as np
lol = [[1, 2, 3], [4, 5, 6]]
# `lol` is a list of lists
arr_har = np.array(lol, np.int32)
print(type(arr_har)) # <class 'numpy.ndarray'>
print("BEFORE:")
print(arr_har)
# change the value in row 0 and column 2.
arr_har[0][2] = 999
print("\n\nAFTER arr_har[0][2] = 999:")
print(arr_har)
The following picture is an example of a 3-dimensional array
Summary/Conclusion:
A list in Python acts like a one-dimensional array.
ndarray is an abbreviation of "n-dimensional array" or "multi-dimensional array"
The difference between a Python list and an ndarray, is that an ndarray has 2 or more dimensions

NumPy: is assignment of a scalar to a slice broadcasting?

I know in Python,
[1,2,3][0:2]=7
doesn't work because the right side must be an iterable.
However, the same thing works for NumPy ndarrays:
a=np.array([1,2,3])
a[0:2]=9
a
Is this the same mechanism as broadcasting? On https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html, it is said broadcasting is only for arithmetic operations.
Yes, assignment follows the same rules of broadcasting because you can also assign an array to another array's items. This however requires that the second array's shape to be broadcastable to destination slice/array shape.
This is also mentioned in Assigning values to indexed arrays documentation:
As mentioned, one can select a subset of an array to assign to using a single index, slices, and index and mask arrays. The value being assigned to the indexed array must be shape consistent (the same shape or broadcastable to the shape the index produces).

numpy array concatenation error: 0-d arrays can't be concatenated

I am trying to concatenate two numpy arrays, but I got this error. Could some one give me a bit clue about what this actually means?
Import numpy as np
allValues = np.arange(-1, 1, 0.5)
tmp = np.concatenate(allValues, np.array([30], float))
Then I got
ValueError: 0-d arrays can't be concatenated
If I do
tmp = np.concatenate(allValues, np.array([50], float))
There is no error message but tmp variable does not reflect the concatenation either.
You need to put the arrays you want to concatenate into a sequence (usually a tuple or list) in the argument.
tmp = np.concatenate((allValues, np.array([30], float)))
tmp = np.concatenate([allValues, np.array([30], float)])
Check the documentation for np.concatenate. Note that the first argument is a sequence (e.g. list, tuple) of arrays. It does not take them as separate arguments.
As far as I know, this API is shared by all of numpy's concatenation functions: concatenate, hstack, vstack, dstack, and column_stack all take a single main argument that should be some sequence of arrays.
The reason you are getting that particular error is that arrays are sequences as well. But this means that concatenate is interpreting allValues as a sequence of arrays to concatenate. However, each element of allValues is a float rather than an array, and is therefore being interpreted as a zero-dimensional array. As the error says, these "arrays" cannot be concatenated.
The second argument is taken as the second (optional) argument of concatenate, which is the axis to concatenate on. This only works because there is a single element in the second argument, which can be cast as an integer and therefore is a valid value. If you had put an array with more elements in the second argument, you would have gotten a different error:
a = np.array([1, 2])
b = np.array([3, 4])
np.concatenate(a, b)
# TypeError: only length-1 arrays can be converted to Python scalars
Also make sure you are concatenating two numpy arrays. I was concatenating one python array with a numpy array and it was giving me the same error:
ValueError: 0-d arrays can't be concatenated
It took me some time to figure this out since all the answers in stackoverflow were assuming that you had two numpy arrays.
Pretty silly but easily overlooked mistake. Hence posting just in case this helps someone.
Here are the links to converting an existing python array using np.asarray
or
create np arrays, if it helps.
Another way to get this error is to have two numpy objects of different... types?
I get this error when I try np.concatenate([A,B])
and ValueError: all the input arrays must have same number of dimensions when I run np.concatenate([B,A])
Just as #mithunpaul mentioned, my types are off: A is an array of 44279x204 and B is a <44279x12 sparse matrix of type '<class 'numpy.float64'>' with 88558 stored elements in Compressed Sparse Row format>)
So that's why the error is happening. Don't know how to solve it yet though.

numpy loadtxt single line/row as list

I have a data file with only one line like:
1.2 2.1 3.2
I used numpy version 1.3.0 loadtxt to load it
a,b,c = loadtxt("data.dat", usecols(0,1,2), unpack=True)
The output was a float instead of array like
a = 1.2
I expect it would be:
a = array([1.2])
If i read a file with multiple lines, it's working.
Simply use the numpy's inbuit loadtxt parameter ndmin.
a,b,c=np.loadtxt('data.dat',ndmin=2,unpack=True)
output
a=[1.2]
What is happening is that when you load the array you obtain a monodimensional one. When you unpack it, it obtain a set of numbers, i.e. array without dimension. This is because when you unpack an array, it decrease it's number of dimension by one. starting with a monodimensional array, it boil down to a simple number.
If you test for the type of a, it is not a float, but a numpy.float, that has all the properties of an array but a void tuple as shape. So it is an array, just is not represented as one.
If what you need is a monodimensional array with just one element, the simplest way is to reshape your array before unpacking it:
#note the reshape function to transform the shape
a,b,c = loadtxt("text.txt").reshape((-1,1))
This gives you the expected result. What is happening is that whe reshaped it into a bidimensional array, so that when you unpack it, the number of dimensions go down to one.
EDIT:
If you need it to work normally for multidimensional array and to keep one-dimensional when you read onedimensional array, I thik that the best way is to read normally with loadtxt and reshape you arrays in a second phase, converting them to monodimensional if they are pure numbers
a,b,c = loadtxt("text.txt",unpack=True)
for e in [a,b,c]
e.reshape(e.shape if e.shape else (-1,))
The simple way without using reshape is, to explicitly typecast the list
a,b,c = loadtxt("data.dat", usecols(0,1,2), unpack=True)
a,b,c = (a,b,c) if usi.shape else ([a], [b], [c])
This works faster than the reshape!

Categories