Adding ndarray into dataframe and then back to ndarray - python

I have a ndarray which looks like this:
x
I wanted to add this into an existing dataframe so that I could export it as a csv, and then use that csv in a separate python script, pull out the ndarray and carry out some analysis, mainly so that I don't have one really long python script.
To add it to a dataframe I've done the following:
data["StandardisedFeatures"] = x.tolist()
This looks ok to me. However, in my next script, when I try to pull out the data and put it back as an array, it doesn't appear the same, it's wrapped in single quotes and treating it as a string:
data['StandardisedFeatures'].to_numpy()
I've tried astype(float) but it doesn't seem to work, can anyone suggest a way to fix this?
Thanks.

If your list objects in a DataFrame have become strings while processing (happens sometimes), you can use eval or ast.literal_eval functions to convert back from string to list, and use map to do it for every element.
Here is an example which will give you an idea of how to deal with this:
import pandas as pd
import numpy as np
dic = {"a": [1,2,3], "b":[4,5,6], "c": [[1,2,3], [4,5,6], [1,2,3]]}
df = pd.DataFrame(dic)
print("DataFrame:", df, sep="\n", end="\n\n")
print("Column of list to numpy:", df.c.to_numpy(), sep="\n", end="\n\n")
temp = df.c.astype(str).to_numpy()
print("Since your list objects have somehow become str objects while working with df:", temp, sep="\n", end="\n\n")
print("Magic for what you want:", np.array(list(map(eval, temp))), sep="\n", end="\n\n")
Output:
DataFrame:
a b c
0 1 4 [1, 2, 3]
1 2 5 [4, 5, 6]
2 3 6 [1, 2, 3]
Column of list to numpy:
[list([1, 2, 3]) list([4, 5, 6]) list([1, 2, 3])]
Since your list objects have somehow become str objects while working with df:
['[1, 2, 3]' '[4, 5, 6]' '[1, 2, 3]']
Magic for what you want:
[[1 2 3]
[4 5 6]
[1 2 3]]
Note: I have used eval in the example only because more people are familiar with it. You should prefer using ast.literal_eval instead whenever you need eval. This SO post nicely explains why you should do this.

Perhaps an alternative and simpler way of solving this issue is to use numpy.save and numpy.load functions. Then you can save the array as a numpy array object and load it again in the next script directly as a numpy array:
import numpy as np
x = np.array([[1, 2], [3, 4]])
# Save the array in the working directory as "x.npy" (extension is automatically inserted)
np.save("x", x)
# Load "x.npy" as a numpy array
x_loaded = np.load("x.npy")

You can save objects of any type in a DataFrame.
You retain their type, but they will be classified as "object" in the pandas.DataFrame.info().
Example: save lists
df = pd.DataFrame(dict(my_list=[[1,2,3,4], [1,2,3,4]]))
print(type(df.loc[0, 'my_list']))
# Print: list
This is useful if you use your objects directly with pandas.DataFrame.apply().

Related

Convert numpy array from space separated to comma separated in python

This is data in .csv format file
generally we expect array/ list with [1,2,3,4] comma separated values
which it seems that nothing happened in this case
data = pd.read_csv('file.csv')
data_array = data.values
print(data_array)
print(type(data_array[0]))
and here is the output data
[16025788 179 '179batch1640694482' 18055630 8317948789 '2021-12-28'
8315780000.0 '6214' 'CA' Nan Nan 'Wireless' '2021-12-28 12:32:46'
'2021-12-28 12:32:46']
<class 'numpy.ndarray'>
So, i am looking for way to find array with comma separated values
Okay so simply make the changes:
converted_str = numpy.array_str(data_array)
converted_str.replace(' ',',')
print(converted_str)
Now, if you want to get the output in <class 'numpy.ndarray'> simply convert it back to a numpy array. I hope this helps! 😉
Without the csv or dataframe (or at least a sample) there's some ambiguity as to what your data array is like. But let me illustrate things with sample.
In [166]: df = pd.DataFrame([['one',2],['two',3]])
the dataframe display:
In [167]: df
Out[167]:
0 1
0 one 2
1 two 3
The array derived from the frame:
In [168]: data = df.values
In [169]: data
Out[169]:
array([['one', 2],
['two', 3]], dtype=object)
In my Ipython session, the display is actually the repr representation of the array. Note the commas, word 'array', and dtype.
In [170]: print(repr(data))
array([['one', 2],
['two', 3]], dtype=object)
A print of the array omits those words and commas. That's the str format. Omitting the commas is normal for numpy arrays, and helps distinguish them from lists. But let me stress that this is just the display style.
In [171]: print(data)
[['one' 2]
['two' 3]]
In [172]: print(data[0])
['one' 2]
We can convert the array to a list:
In [173]: alist = data.tolist()
In [174]: alist
Out[174]: [['one', 2], ['two', 3]]
Commas are a standard part of list display.
But let me stress, commas or not, is part of the display. Don't confuse that with the underlying distinction between a pandas dataframe, a numpy array, and a Python list.
Convert to a normal python list first:
print(list(data_array))

pandas.factorize with custom array datatype

Let's start off with a random (reproducible) data array -
# Setup
In [11]: np.random.seed(0)
...: a = np.random.randint(0,9,(7,2))
...: a[2] = a[0]
...: a[4] = a[1]
...: a[6] = a[1]
# Check values
In [12]: a
Out[12]:
array([[5, 0],
[3, 3],
[5, 0],
[5, 2],
[3, 3],
[6, 8],
[3, 3]])
# Check its itemsize
In [13]: a.dtype.itemsize
Out[13]: 8
Let's view each row as a scalar using custom datatype that covers two elements. We will use void-dtype for this purpose. As mentioned in the docs -
https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.dtypes.html#specifying-and-constructing-data-types, https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.interface.html#arrays-interface) and in stackoverflow Q&A, it seems that would be -
In [23]: np.dtype((np.void, 16)) # 8 is the itemsize, so 8x2=16
Out[23]: dtype('V16')
# Create new view of the input
In [14]: b = a.view('V16').ravel()
# Check new view array
In [15]: b
Out[15]:
array([b'\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
b'\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00',
b'\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
b'\x05\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00',
b'\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00',
b'\x06\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00',
b'\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00'],
dtype='|V16')
# Use pandas.factorize on the new view
In [16]: pd.factorize(b)
Out[16]:
(array([0, 1, 0, 0, 1, 2, 1]),
array(['\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
'\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00',
'\x06\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00'],
dtype=object))
Two things off factorize's output that I could not understand and hence the follow-up questions -
The fourth element of the first output (=0) looks wrong, because it has same ID as the third element, but in b, the fourth and third elements are different. Why so?
Why does the second output has an object dtype, while the dtype of b was V16. Is this also causing the wrong value mentioned in 1.?
A bigger question could be - Does pandas.factorize cover custom datatypes? From docs, I see :
values : sequence A 1-D sequence. Sequences that aren’t pandas objects
are coerced to ndarrays before factorization.
In the provided sample case, we have a NumPy array, so one would assume no issues with the input, unless the docs didn't clarify about the custom datatype part?
System setup : Ubuntu 16.04, Python : 2.7.12, NumPy : 1.16.2, Pandas :
0.24.2.
On Python-3.x
System setup : Ubuntu 16.04, Python : 3.5.2, NumPy : 1.16.2, Pandas :
0.24.2.
Running the same setup, I get -
In [18]: b
Out[18]:
array([b'\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
b'\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00',
b'\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
b'\x05\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00',
b'\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00',
b'\x06\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00',
b'\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00'],
dtype='|V16')
In [19]: pd.factorize(b)
Out[19]:
(array([0, 1, 0, 2, 1, 3, 1]),
array([b'\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
b'\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00',
b'\x05\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00',
b'\x06\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00'],
dtype=object))
So, the first output off factorize looks alright here. But, the second output has object dtype again, different from the input. So, the same question - Why this dtype change?
Compiling the questions/tl;dr
With such a custom datatype :
Why wrong labels, uniques and different uniques dtype on Python2.x?
Why different uniques dtype on Python3.x?
As for why V16 is coerced to object, many functions in pandas convert data to one of the data types the internal functions can handle, here. If the data type is not in the list, it becomes an object – and pandas doesn't convert the result back into the original dtype, it appears.
Regarding the discrepancy between Python 2 and Python 3: There's only one pandas codebase for both, so why do they give different results?
Turns out that Python 2 uses the string type (which are just arrays of bytes) to represent your data¹, and Python 3 the bytes type. The effect of this is that Python 2 uses a StringHashTable for the factorization and Python 3 uses a PyObjectHashTable, and the StringHashTable gives incorrect results in your case. I believe that this is because the strings in the StringHashTable are assumed to be zero-terminated, which is not the case for your strings – and indeed, if you only compare the rows up to the first zero byte, the first and fourth row look identical.
Conclusion: It's a bug, and we should probably file an issue for it.
¹ More detail: This call to ensure_object returns an array of strings in Python 2, but an array of bytes in Python 3 (as can be seen by the b prefix). Correspondingly, the hashtable chosen here is different.

Python loop through text and set numpy array index

Given a block of text with matrix rows and columns separated by commas and semicolons, I want to parse the text and set the indices of numpy arrays. Here is the code with the variable 'matrixText' representing the base text.
I first create the matrices and then split the text by semicolons and then by commas. I loop through the split text and set each index. However with the text ...
1,2,3;4,5,6;7,8,9
I get the result
7,7,7;8,8,8;9,9,9
temp1=matrixText.split(';')
temp2=temp1[0].split(',')
rows=len(temp1)
columns=len(temp2)
rA=np.zeros((rows, columns))
arrayText=matrixText.split(';')
rowText=range(len(arrayText))
for rowIndex, rowItem in enumerate(arrayText):
rowText[rowIndex]=arrayText[rowIndex].split(',')
for colIndex, colItem in enumerate(rowText[rowIndex]):
rA[[rowIndex, colIndex]]=rowText[rowIndex][colIndex]
I thought that by setting each index, I would avoid any copy by reference issues.
To provide more info, in the first iteration, the 0,0 index is set to 1 and the output of that is then 1,1,1;0,0,0;0,0,0 which I can't figure out since setting one index in the numpy array sets three.
In the second iteration, the index 0-1 is set to 2 and the result is then 2,2,2;2,2,2;0,0,0
The third iteration sets 0-2 to 3 but the result is 3,3,3;2,2,2;3,3,3
Any suggestions?
You can (ab-) use the matrix constructor plus the A property
np.matrix('1,2,3;4,5,6;7,8,9').A
Output:
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
matrixText = '1,2,3;4,5,6;7,8,9'
temp1=matrixText.split(';')
temp2=temp1[0].split(',')
rows=len(temp1)
columns=len(temp2)
rA=np.empty((rows, columns),dtype=np.int)
for n, line in enumerate(temp1):
rA[n,:]=line.split(',')
Using a nested list-comprehension:
Having defined:
s = "1,2,3;4,5,6;7,8,9"
we can use a nice one-liner:
np.array([[int(c) for c in r.split(",")] for r in s.split(";")])
which would give the following array:
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
Another one-liner (+1 import):
from io import StringIO
rA = np.loadtxt(StringIO(matrixText.replace(';','\n')), delimiter=',')
So, the problem here was that the dual brackets in rA[[rowIndex, colIndex]] caused every cell in the row to be set. This should be rA[rowIndex, colIndex]

Input a list of arrays of numbers in h5py

I was trying to input a list of numeric arrays into an HDF5 file using h5py. Consider for example:
f = h5py.File('tester.hdf5','w')
b = [[1,2][1,2,3]]
This throws an error.
TypeError: Object dtype dtype('O') has no native HDF5 equivalent
So I am assuming HDF5 doesn't support this.
Like you can store a list of strings by using a special datatype, is there a way for list of numeric array as well.
H5py store list of list of strings
If not, what are the other suitable ways to store a list like this which I can access later from memory.
Thanks for the help in advance
You could split your list of lists in seperate datasets and store them seperately:
import h5py
my_list = [[1,2],[1,2,3]]
f = h5py.File('tester.hdf5','w')
grp=f.create_group('list_of_lists')
for i,list in enumerate(my_list):
grp.create_dataset(str(i),data=list)
After doing so, you can slice trough your datasets like before with a little variation:
In[1]: grp[str(0)][:].tolist()
Out[1]: [1, 2]
In[2]: grp[str(1)][:].tolist()
Out[2]: [1, 2, 3]

finding the max of a column in an array

def maxvalues():
for n in range(1,15):
dummy=[]
for k in range(len(MotionsAndMoorings)):
dummy.append(MotionsAndMoorings[k][n])
max(dummy)
L = [x + [max(dummy)]] ## to be corrected (adding columns with value max(dummy))
## suggest code to add new row to L and for next function call, it should save values here.
i have an array of size (k x n) and i need to pick the max values of the first column in that array. Please suggest if there is a simpler way other than what i tried? and my main aim is to append it to L in columns rather than rows. If i just append, it is adding values at the end. I would like to this to be done in columns for row 0 in L, because i'll call this function again and add a new row to L and do the same. Please suggest.
General suggestions for your code
First of all it's not very handy to access globals in a function. It works but it's not considered good style. So instead of using:
def maxvalues():
do_something_with(MotionsAndMoorings)
you should do it with an argument:
def maxvalues(array):
do_something_with(array)
MotionsAndMoorings = something
maxvalues(MotionsAndMoorings) # pass it to the function.
The next strange this is you seem to exlude the first row of your array:
for n in range(1,15):
I think that's unintended. The first element of a list has the index 0 and not 1. So I guess you wanted to write:
for n in range(0,15):
or even better for arbitary lengths:
for n in range(len(array[0])): # I chose the first row length here not the number of columns
Alternatives to your iterations
But this would not be very intuitive because the max function already implements some very nice keyword (the key) so you don't need to iterate over the whole array:
import operator
column = 2
max(array, key=operator.itemgetter(column))[column]
this will return the row where the i-th element is maximal (you just define your wanted column as this element). But the maximum will return the whole row so you need to extract just the i-th element.
So to get a list of all your maximums for each column you could do:
[max(array, key=operator.itemgetter(column))[column] for column in range(len(array[0]))]
For your L I'm not sure what this is but for that you should probably also pass it as argument to the function:
def maxvalues(array, L): # another argument here
but since I don't know what x and L are supposed to be I'll not go further into that. But it looks like you want to make the columns of MotionsAndMoorings to rows and the rows to columns. If so you can just do it with:
dummy = [[MotionsAndMoorings[j][i] for j in range(len(MotionsAndMoorings))] for i in range(len(MotionsAndMoorings[0]))]
that's a list comprehension that converts a list like:
[[1, 2, 3], [4, 5, 6], [0, 2, 10], [0, 2, 10]]
to an "inverted" column/row list:
[[1, 4, 0, 0], [2, 5, 2, 2], [3, 6, 10, 10]]
Alternative packages
But like roadrunner66 already said sometimes it's easiest to use a library like numpy or pandas that already has very advanced and fast functions that do exactly what you want and are very easy to use.
For example you convert a python list to a numpy array simple by:
import numpy as np
Motions_numpy = np.array(MotionsAndMoorings)
you get the maximum of the columns by using:
maximums_columns = np.max(Motions_numpy, axis=0)
you don't even need to convert it to a np.array to use np.max or transpose it (make rows to columns and the colums to rows):
transposed = np.transpose(MotionsAndMoorings)
I hope this answer is not to unstructured. Some parts are suggestions to your function and some are alternatives. You should pick the parts that you need and if you have any trouble with it, just leave a comment or ask another question. :-)
An example with a random input array, showing that you can take the max in either axis easily with one command.
import numpy as np
aa= np.random.random([4,3])
print aa
print
print np.max(aa,axis=0)
print
print np.max(aa,axis=1)
Output:
[[ 0.51972266 0.35930957 0.60381998]
[ 0.34577217 0.27908173 0.52146593]
[ 0.12101346 0.52268843 0.41704152]
[ 0.24181773 0.40747905 0.14980534]]
[ 0.51972266 0.52268843 0.60381998]
[ 0.60381998 0.52146593 0.52268843 0.40747905]

Categories