Convert numpy array from space separated to comma separated in python - python

This is data in .csv format file
generally we expect array/ list with [1,2,3,4] comma separated values
which it seems that nothing happened in this case
data = pd.read_csv('file.csv')
data_array = data.values
print(data_array)
print(type(data_array[0]))
and here is the output data
[16025788 179 '179batch1640694482' 18055630 8317948789 '2021-12-28'
8315780000.0 '6214' 'CA' Nan Nan 'Wireless' '2021-12-28 12:32:46'
'2021-12-28 12:32:46']
<class 'numpy.ndarray'>
So, i am looking for way to find array with comma separated values

Okay so simply make the changes:
converted_str = numpy.array_str(data_array)
converted_str.replace(' ',',')
print(converted_str)
Now, if you want to get the output in <class 'numpy.ndarray'> simply convert it back to a numpy array. I hope this helps! 😉

Without the csv or dataframe (or at least a sample) there's some ambiguity as to what your data array is like. But let me illustrate things with sample.
In [166]: df = pd.DataFrame([['one',2],['two',3]])
the dataframe display:
In [167]: df
Out[167]:
0 1
0 one 2
1 two 3
The array derived from the frame:
In [168]: data = df.values
In [169]: data
Out[169]:
array([['one', 2],
['two', 3]], dtype=object)
In my Ipython session, the display is actually the repr representation of the array. Note the commas, word 'array', and dtype.
In [170]: print(repr(data))
array([['one', 2],
['two', 3]], dtype=object)
A print of the array omits those words and commas. That's the str format. Omitting the commas is normal for numpy arrays, and helps distinguish them from lists. But let me stress that this is just the display style.
In [171]: print(data)
[['one' 2]
['two' 3]]
In [172]: print(data[0])
['one' 2]
We can convert the array to a list:
In [173]: alist = data.tolist()
In [174]: alist
Out[174]: [['one', 2], ['two', 3]]
Commas are a standard part of list display.
But let me stress, commas or not, is part of the display. Don't confuse that with the underlying distinction between a pandas dataframe, a numpy array, and a Python list.

Convert to a normal python list first:
print(list(data_array))

Related

Adding ndarray into dataframe and then back to ndarray

I have a ndarray which looks like this:
x
I wanted to add this into an existing dataframe so that I could export it as a csv, and then use that csv in a separate python script, pull out the ndarray and carry out some analysis, mainly so that I don't have one really long python script.
To add it to a dataframe I've done the following:
data["StandardisedFeatures"] = x.tolist()
This looks ok to me. However, in my next script, when I try to pull out the data and put it back as an array, it doesn't appear the same, it's wrapped in single quotes and treating it as a string:
data['StandardisedFeatures'].to_numpy()
I've tried astype(float) but it doesn't seem to work, can anyone suggest a way to fix this?
Thanks.
If your list objects in a DataFrame have become strings while processing (happens sometimes), you can use eval or ast.literal_eval functions to convert back from string to list, and use map to do it for every element.
Here is an example which will give you an idea of how to deal with this:
import pandas as pd
import numpy as np
dic = {"a": [1,2,3], "b":[4,5,6], "c": [[1,2,3], [4,5,6], [1,2,3]]}
df = pd.DataFrame(dic)
print("DataFrame:", df, sep="\n", end="\n\n")
print("Column of list to numpy:", df.c.to_numpy(), sep="\n", end="\n\n")
temp = df.c.astype(str).to_numpy()
print("Since your list objects have somehow become str objects while working with df:", temp, sep="\n", end="\n\n")
print("Magic for what you want:", np.array(list(map(eval, temp))), sep="\n", end="\n\n")
Output:
DataFrame:
a b c
0 1 4 [1, 2, 3]
1 2 5 [4, 5, 6]
2 3 6 [1, 2, 3]
Column of list to numpy:
[list([1, 2, 3]) list([4, 5, 6]) list([1, 2, 3])]
Since your list objects have somehow become str objects while working with df:
['[1, 2, 3]' '[4, 5, 6]' '[1, 2, 3]']
Magic for what you want:
[[1 2 3]
[4 5 6]
[1 2 3]]
Note: I have used eval in the example only because more people are familiar with it. You should prefer using ast.literal_eval instead whenever you need eval. This SO post nicely explains why you should do this.
Perhaps an alternative and simpler way of solving this issue is to use numpy.save and numpy.load functions. Then you can save the array as a numpy array object and load it again in the next script directly as a numpy array:
import numpy as np
x = np.array([[1, 2], [3, 4]])
# Save the array in the working directory as "x.npy" (extension is automatically inserted)
np.save("x", x)
# Load "x.npy" as a numpy array
x_loaded = np.load("x.npy")
You can save objects of any type in a DataFrame.
You retain their type, but they will be classified as "object" in the pandas.DataFrame.info().
Example: save lists
df = pd.DataFrame(dict(my_list=[[1,2,3,4], [1,2,3,4]]))
print(type(df.loc[0, 'my_list']))
# Print: list
This is useful if you use your objects directly with pandas.DataFrame.apply().

pandas.factorize with custom array datatype

Let's start off with a random (reproducible) data array -
# Setup
In [11]: np.random.seed(0)
...: a = np.random.randint(0,9,(7,2))
...: a[2] = a[0]
...: a[4] = a[1]
...: a[6] = a[1]
# Check values
In [12]: a
Out[12]:
array([[5, 0],
[3, 3],
[5, 0],
[5, 2],
[3, 3],
[6, 8],
[3, 3]])
# Check its itemsize
In [13]: a.dtype.itemsize
Out[13]: 8
Let's view each row as a scalar using custom datatype that covers two elements. We will use void-dtype for this purpose. As mentioned in the docs -
https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.dtypes.html#specifying-and-constructing-data-types, https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.interface.html#arrays-interface) and in stackoverflow Q&A, it seems that would be -
In [23]: np.dtype((np.void, 16)) # 8 is the itemsize, so 8x2=16
Out[23]: dtype('V16')
# Create new view of the input
In [14]: b = a.view('V16').ravel()
# Check new view array
In [15]: b
Out[15]:
array([b'\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
b'\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00',
b'\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
b'\x05\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00',
b'\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00',
b'\x06\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00',
b'\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00'],
dtype='|V16')
# Use pandas.factorize on the new view
In [16]: pd.factorize(b)
Out[16]:
(array([0, 1, 0, 0, 1, 2, 1]),
array(['\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
'\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00',
'\x06\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00'],
dtype=object))
Two things off factorize's output that I could not understand and hence the follow-up questions -
The fourth element of the first output (=0) looks wrong, because it has same ID as the third element, but in b, the fourth and third elements are different. Why so?
Why does the second output has an object dtype, while the dtype of b was V16. Is this also causing the wrong value mentioned in 1.?
A bigger question could be - Does pandas.factorize cover custom datatypes? From docs, I see :
values : sequence A 1-D sequence. Sequences that aren’t pandas objects
are coerced to ndarrays before factorization.
In the provided sample case, we have a NumPy array, so one would assume no issues with the input, unless the docs didn't clarify about the custom datatype part?
System setup : Ubuntu 16.04, Python : 2.7.12, NumPy : 1.16.2, Pandas :
0.24.2.
On Python-3.x
System setup : Ubuntu 16.04, Python : 3.5.2, NumPy : 1.16.2, Pandas :
0.24.2.
Running the same setup, I get -
In [18]: b
Out[18]:
array([b'\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
b'\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00',
b'\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
b'\x05\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00',
b'\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00',
b'\x06\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00',
b'\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00'],
dtype='|V16')
In [19]: pd.factorize(b)
Out[19]:
(array([0, 1, 0, 2, 1, 3, 1]),
array([b'\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
b'\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00',
b'\x05\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00',
b'\x06\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00'],
dtype=object))
So, the first output off factorize looks alright here. But, the second output has object dtype again, different from the input. So, the same question - Why this dtype change?
Compiling the questions/tl;dr
With such a custom datatype :
Why wrong labels, uniques and different uniques dtype on Python2.x?
Why different uniques dtype on Python3.x?
As for why V16 is coerced to object, many functions in pandas convert data to one of the data types the internal functions can handle, here. If the data type is not in the list, it becomes an object – and pandas doesn't convert the result back into the original dtype, it appears.
Regarding the discrepancy between Python 2 and Python 3: There's only one pandas codebase for both, so why do they give different results?
Turns out that Python 2 uses the string type (which are just arrays of bytes) to represent your data¹, and Python 3 the bytes type. The effect of this is that Python 2 uses a StringHashTable for the factorization and Python 3 uses a PyObjectHashTable, and the StringHashTable gives incorrect results in your case. I believe that this is because the strings in the StringHashTable are assumed to be zero-terminated, which is not the case for your strings – and indeed, if you only compare the rows up to the first zero byte, the first and fourth row look identical.
Conclusion: It's a bug, and we should probably file an issue for it.
¹ More detail: This call to ensure_object returns an array of strings in Python 2, but an array of bytes in Python 3 (as can be seen by the b prefix). Correspondingly, the hashtable chosen here is different.

How to efficiently extract values from nested numpy arrays generated by loadmat function?

Is there a more efficient method in python to extract data from a nested python list such as A = array([[array([[12000000]])]], dtype=object). I have been using A[0][0][0][0], it does not seem to be an efficinet method when you have lots of data like A.
I have also used
numpy.squeeeze(array([[array([[12000000]])]], dtype=object)) but this gives me
array(array([[12000000]]), dtype=object)
PS: The nested array was generated by loadmat() function in scipy module to load a .mat file which consists of nested structures.
Creating such an array is a bit tedious, but loadmat does it to handle the MATLAB cells and 2d matrix:
In [5]: A = np.empty((1,1),object)
In [6]: A[0,0] = np.array([[1.23]])
In [7]: A
Out[7]: array([[array([[ 1.23]])]], dtype=object)
In [8]: A.any()
Out[8]: array([[ 1.23]])
In [9]: A.shape
Out[9]: (1, 1)
squeeze compresses the shape, but does not cross the object boundary
In [10]: np.squeeze(A)
Out[10]: array(array([[ 1.23]]), dtype=object)
but if you have one item in an array (regardless of shape) item() can extract it. Indexing also works, A[0,0]
In [11]: np.squeeze(A).item()
Out[11]: array([[ 1.23]])
item again to extract the number from that inner array:
In [12]: np.squeeze(A).item().item()
Out[12]: 1.23
Or we don't even need the squeeze:
In [13]: A.item().item()
Out[13]: 1.23
loadmat has a squeeze_me parameter.
Indexing is just as easy:
In [17]: A[0,0]
Out[17]: array([[ 1.23]])
In [18]: A[0,0][0,0]
Out[18]: 1.23
astype can also work (though it can be picky about the number of dimensions).
In [21]: A.astype(float)
Out[21]: array([[ 1.23]])
With single item arrays like efficiency isn't much of an issue. All these methods are quick. Things become more complicated when the array has many items, or the items are themselves large.
How to access elements of numpy ndarray?
You could use A.all() or A.any() to get a scalar. This would only work if A contains one element.
Try A.flatten()[0]
This will flatten the array into a single dimension and extract the first item from it. In your case, the first item is the only item.
What worked in my case was the following..
import scipy.io
xcat = scipy.io.loadmat(os.path.join(dir_data, file_name))
pars = xcat['pars'] # Extract numpy.void element from the loadmat object
# Note that you are dealing with a numpy structured array object when you enter pars[0][0].
# Thus you can acces names and all that...
dict_values = [x[0][0] for x in pars[0][0]] # Extract all elements in one go
dict_keys = list(pars.dtype.names) # Extract the corresponding names/tags
dict_xcat = dict(zip(dict_keys, dict_values)) # Pack it up again in a dict
where the idea behind this is.. first extract ALL values I want, and format them in a nice python dict.
This prevents me from cumbersome indexing later in the file...
Of course, this is a very specific solution. Since in my case the values I needed were all floats/ints.

Conversion from U3 dtype to ascii

I am reading data from a .mat file. The data is in form on numpy array.
[array([u'ABT'], dtype='<U3')]
This is one element of the array. I want to get only the value 'ABT' from the array. Unicode normalize and Encode to ascii functions do not work.
encode is a string method, so can't work directly on an array of strings. But there are several ways of applying it to each string
Here I'm working Py3, so the default is unicode.
In [179]: A=np.array(['one','two'])
In [180]: A
Out[180]:
array(['one', 'two'],
dtype='<U3')
plain iteration:
In [181]: np.array([s.encode() for s in A])
Out[181]:
array([b'one', b'two'],
dtype='|S3')
np.char has functions that apply string methods to each element of an array:
In [182]: np.char.encode(A)
Out[182]:
array([b'one', b'two'],
dtype='|S3')
but it looks like this is one of the conversions that astype can handle:
In [183]: A.astype('<S3')
Out[183]:
array([b'one', b'two'],
dtype='|S3')
And inspired by a recent question about np.chararray:
What happened to numpy.chararray
In [191]: Ac=np.char.array(A)
In [192]: Ac
Out[192]:
chararray(['one', 'two'],
dtype='<U3')
In [193]: Ac.encode()
Out[193]:
array([b'one', b'two'],
dtype='|S3')

Pandas df.columns.values.tostring()

When I use the following on a df...
df.columns.values.tostring()
I get the following which are not at all like my column names (and there are far fewer columns than that). When I omit "tolist()", I just get the column names.
b'0\x16B\n\x00\x00\x00\x00p\x84P\n\x00\x00\x00\x00\xf0\xe7x\t\x00\x00\x00\x00\xb0\xf3J\n\x00\x00\x00\x00p\xfc\t\x0c\x00\x00\x00\x000\xad\xd7\x00\x00\x00\x00\x00p\xae\xd7\x00\x00\x00\x00\x00\xf0\xab\xd7\x00\x00\x00\x00\x00(9\x05\x01\x00\x00\x00\x00\xf0\xa7\xdd\x0b\x00\x00\x00\x00p\xac\xdd\x0b\x00\x00\x00\x00\xf0\xed\xc1\x00\x00\x00\x00\x00\xb0\xa3\xdd\x0b\x00\x00\x00\x000g\xdd\x0b\x00\x00\x00\x00p\xf2\xb2\x0c\x00\x00\x00\x000\xf1\xb2\x0c\x00\x00\x00\x00\xf0\xf0\xb2\x0c\x00\x00\x00\x00\xb0\xf0\xb2\x0c\x00\x00\x00\x00\xa0w\x9a\x05\x00\x00\x00\x000\xae\xd7\x00\x00\x00\x00\x00\x90\x9c\xe4\x00\x00\x00\x00\x00\xd0U\n\x0c\x00\x00\x00\x00\xb0\xfa\t\x0c\x00\x00\x00\x00\xb0\n\xca\x00\x00\x00\x00\x00\x88\x8e\xbb\x00\x00\x00\x00\x00\xf0\x05\xca\x00\x00\x00\x00\x00\x90<y\t\x00\x00\x00\x00\x18?y\t\x00\x00\x00\x00\xb0\x01\xca\x00\x00\x00\x00\x00\xb0=y\t\x00\x00\x00\x00\xf8=y\t\x00\x00\x00\x00p\xac\xd7\x00\x00\x00\x00\x00\xb0\xad\xd7\x00\x00\x00\x00\x00'
I can't figure out why. The df is a product of several instances of pd.merge and type conversions.
This isn't really a pandas thing, it's a numpy thing. df.columns.values gives us a numpy array:
>>> df = pd.DataFrame({"A": [1,2,3], "B": [4,5,6]})
>>> df
A B
0 1 4
1 2 5
2 3 6
>>> df.columns
Index(['A', 'B'], dtype='object')
>>> df.columns.values
array(['A', 'B'], dtype=object)
The tostring method of a numpy array promises:
Construct Python bytes containing the raw data bytes in the array.
Constructs Python bytes showing a copy of the raw contents of data memory. The bytes object can be produced in either ‘C’ or ‘Fortran’, or ‘Any’ order (the default is ‘C’-order). ‘Any’ order means C-order unless the F_CONTIGUOUS flag in the array is set, in which case it means ‘Fortran’ order.
This function is a compatibility alias for tobytes. Despite its name it returns bytes not strings.
which is why you get something messy:
>>> df.columns.values.tostring()
b'\xe0N\x0e\xb7\x00\\\x14\xb7'

Categories