Convert Select Columns in Pandas Dataframe to Numpy Array

Convert Select Columns in Pandas Dataframe to Numpy Array - python

I would like to convert everything but the first column of a pandas dataframe into a numpy array. For some reason using the columns= parameter of DataFrame.to_matrix() is not working.
df:
viz a1_count a1_mean a1_std
0 n 3 2 0.816497
1 n 0 NaN NaN
2 n 2 51 50.000000
I tried X=df.as_matrix(columns=[df[1:]]) but this yields an array of all NaNs

the easy way is the "values" property df.iloc[:,1:].values
a=df.iloc[:,1:]
b=df.iloc[:,1:].values
print(type(df))
print(type(a))
print(type(b))
so, you can get type
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'numpy.ndarray'>

Please use the Pandas to_numpy() method. Below is an example--
>>> import pandas as pd
>>> df = pd.DataFrame({"A":[1, 2], "B":[3, 4], "C":[5, 6]})
>>> df
A B C
0 1 3 5
1 2 4 6
>>> s_array = df[["A", "B", "C"]].to_numpy()
>>> s_array
array([[1, 3, 5],
[2, 4, 6]])
>>> t_array = df[["B", "C"]].to_numpy()
>>> print (t_array)
[[3 5]
[4 6]]
Hope this helps. You can select any number of columns using
columns = ['col1', 'col2', 'col3']
df1 = df[columns]
Then apply to_numpy() method.

The columns parameter accepts a collection of column names. You're passing a list containing a dataframe with two rows:
>>> [df[1:]]
[ viz a1_count a1_mean a1_std
1 n 0 NaN NaN
2 n 2 51 50]
>>> df.as_matrix(columns=[df[1:]])
array([[ nan, nan],
[ nan, nan],
[ nan, nan]])
Instead, pass the column names you want:
>>> df.columns[1:]
Index(['a1_count', 'a1_mean', 'a1_std'], dtype='object')
>>> df.as_matrix(columns=df.columns[1:])
array([[ 3. , 2. , 0.816497],
[ 0. , nan, nan],
[ 2. , 51. , 50. ]])

Hope this easy one liner helps:
cols_as_np = df[df.columns[1:]].to_numpy()

The best way for converting to Numpy Array is using '.to_numpy(self, dtype=None, copy=False)'. It is new in version 0.24.0.Refrence
You can also use '.array'.Refrence
Pandas .as_matrix deprecated since version 0.23.0.

Instead of .as_matrix(), use .values, because the first one was deprecated. Here is the contribution:
'DataFrame' object has no attribute 'as_matrix

The fastest and easiest way is to use .as_matrix(). One short line:
df.iloc[:,[1,2,3]].as_matrix()
Gives:
array([[3, 2, 0.816497],
[0, 'NaN', 'NaN'],
[2, 51, 50.0]], dtype=object)
By using indices of the columns, you can use this code for any dataframe with different column names.
Here are the steps for your example:
import pandas as pd
columns = ['viz', 'a1_count', 'a1_mean', 'a1_std']
index = [0,1,2]
vals = {'viz': ['n','n','n'], 'a1_count': [3,0,2], 'a1_mean': [2,'NaN', 51], 'a1_std': [0.816497, 'NaN', 50.000000]}
df = pd.DataFrame(vals, columns=columns, index=index)
Gives:
viz a1_count a1_mean a1_std
0 n 3 2 0.816497
1 n 0 NaN NaN
2 n 2 51 50
Then:
x1 = df.iloc[:,[1,2,3]].as_matrix()
Gives:
array([[3, 2, 0.816497],
[0, 'NaN', 'NaN'],
[2, 51, 50.0]], dtype=object)
Where x1 is numpy.ndarray.

Related

Replace NaN values with previous no-NaN value in a 1d Numpy array

I have a 1d Numpy array with NaNs and no NaNs values:
arr = np.array([4, np.nan, np.nan, 3, np.nan, 5])
I need to replace NaN values with previous no-NaN value and replace no-NaN values with NaN, as per below:
result = np.array([np.nan, 4, 4, np.nan, 3, 5])
Thanks in advance.
Tommaso

If Pandas is available, you can use ffill(), and then replace the original non-Nan values with a boolean mask:
import pandas
arr2 = pd.Series(arr).ffill()
mask = ~np.isnan(arr) # all elements which started non-NaN
mask[-1] = False # last element will never forward-fill
arr2[mask] = np.nan # replace non-NaNs with NaNs
Output:
arr2
0 NaN
1 4.0
2 4.0
3 NaN
4 3.0
5 5.0
dtype: float64

if not using pandas then:
arr = np.array([4.1, np.nan, np.nan, 3.1, np.nan, 5.1])
arr2 = np.array([])
val=0
for n in arr:
if ~np.isnan(n):
val = n
arr2= np.append(arr2,np.nan)
else:
arr2= np.append(arr2,val)
output:
arr
array([4.1, nan, nan, 3.1, nan, 5.1])
arr2
array([nan, 4.1, 4.1, nan, 3.1, nan])

create a new column in pandas DataFrame based on two others which contain NaN values

I have a pandas DataFrame df
L C
0 [1, 2, 3] 5
1 [4, nan, 6] 0
2 [nan, nan, nan] 15
and another DataFrame other
C
0 0
1 25
2 0
Then I append other to df and in L column are added 3 rows with NaN values.
L C
0 [1, 2, 3] 5
1 [4, nan, 6] 0
2 [nan, nan, nan] 15
0 NaN 0
1 NaN 25
2 NaN 0
I want to create a column that if L column is NaN and C is 0 then it will get value 1 otherwise it will get value 0. I also make computations with rows that do not contain NaN values but it is out of the purpose of this post.
I found that the way Pandas deals with Nan values is pd.isna().
I created the function
def check_cols(L, C):
if pd.isna(L) and C == 0:
return 1
elif pd.isna(L) and C != 0:
return 0
and I apply the function on every row
df['col'] = df.apply(lambda row: check_cols(row.L,row.C), axis=1)
but i get the error
The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
because it checks every element of the list if is NaN. I don't want to check the elements of the list if there are NaN or not, i want want to check if there is a list (even with all elements nan) or a NaN value. Another way to do it is to create a column with pd.isna() like this
L C is_NaN
0 [1, 2, 3] 5 False
1 [4, nan, 6] 0 False
2 [nan, nan, nan] 15 False
0 NaN 0 True
1 NaN 25 True
2 NaN 0 True
and then give three columns as an argument to the function, which will work. I want to do the same check, if there is a list and if there is a NaN value, within the function without having to create extra column.
If someone would explain why in the first case it checks every element of the list and in the second it does the check that I want, and/or provide some sources, it would be great.

The reason behind the exception is that you should use & instead of and and also the if condition cannot evaluate to either True or False because the output is a Series of Booleans. Example:
pd.isna(df.L) & df.C == 0
0 True
1 True
2 True
0 True
1 False
2 True
dtype: bool
The result above cannot be evaluated by an if condition.
Here's a solution that returns directly the condition you mentioned:
import pandas as pd
import numpy as np
def check_cols(L, C):
return pd.isna(df.L) & (df.C == 0)
data = {
'L': [[1, 2, 3], [4, np.nan, 6], [np.nan, np.nan, np.nan], np.nan, np.nan, np.nan],
'C': [5, 0, 15, 0, 25, 0]}
df = pd.DataFrame(data=data, index=[0, 1, 2, 0, 1 ,2])
res = check_cols(df.L, df.C)
df['res'] = res
df
# EDIT: Updated solution according to comments
Then the issue is that you are applying pd.isna to a list - for example in the first row L = [1, 2, 3] in the first row and that cannot be evaluated by the if condition.
import pandas as pd
import numpy as np
def check_cols(L, C):
if not isinstance(L, list) and np.isnan(L) and C == 0:
return 1
elif not isinstance(L, list) and np.isnan(L) and C != 0:
return 0
else:
# when L is a list
return 1
data = {
'L': [[1, 2, 3], [4, np.nan, 6], [np.nan, np.nan, np.nan], np.nan, np.nan, np.nan],
'C': [5, 0, 15, 0, 25, 0]
}
df = pd.DataFrame(data=data, index=[0, 1, 2, 0, 1 ,2])
df['col'] = df.apply(lambda row: check_cols(row.L,row.C), axis=1)
df
EDIT 2: I decided to go for np.nan but it also works with pd.na.

Removing NaNs from imported lists [duplicate]

How do I remove NaN values from a NumPy array?
[1, 2, NaN, 4, NaN, 8] ⟶ [1, 2, 4, 8]

To remove NaN values from a NumPy array x:
x = x[~numpy.isnan(x)]
Explanation
The inner function numpy.isnan returns a boolean/logical array which has the value True everywhere that x is not-a-number. Since we want the opposite, we use the logical-not operator ~ to get an array with Trues everywhere that x is a valid number.
Lastly, we use this logical array to index into the original array x, in order to retrieve just the non-NaN values.

filter(lambda v: v==v, x)
works both for lists and numpy array
since v!=v only for NaN

For me the answer by #jmetz didn't work, however using pandas isnull() did.
x = x[~pd.isnull(x)]

Try this:
import math
print [value for value in x if not math.isnan(value)]
For more, read on List Comprehensions.

#jmetz's answer is probably the one most people need; however it yields a one-dimensional array, e.g. making it unusable to remove entire rows or columns in matrices.
To do so, one should reduce the logical array to one dimension, then index the target array. For instance, the following will remove rows which have at least one NaN value:
x = x[~numpy.isnan(x).any(axis=1)]
See more detail here.

As shown by others
x[~numpy.isnan(x)]
works. But it will throw an error if the numpy dtype is not a native data type, for example if it is object. In that case you can use pandas.
x[~pandas.isna(x)] or x[~pandas.isnull(x)]

If you're using numpy
# first get the indices where the values are finite
ii = np.isfinite(x)
# second get the values
x = x[ii]

The accepted answer changes shape for 2d arrays.
I present a solution here, using the Pandas dropna() functionality.
It works for 1D and 2D arrays. In the 2D case you can choose weather to drop the row or column containing np.nan.
import pandas as pd
import numpy as np
def dropna(arr, *args, **kwarg):
assert isinstance(arr, np.ndarray)
dropped=pd.DataFrame(arr).dropna(*args, **kwarg).values
if arr.ndim==1:
dropped=dropped.flatten()
return dropped
x = np.array([1400, 1500, 1600, np.nan, np.nan, np.nan ,1700])
y = np.array([[1400, 1500, 1600], [np.nan, 0, np.nan] ,[1700,1800,np.nan]] )
print('='*20+' 1D Case: ' +'='*20+'\nInput:\n',x,sep='')
print('\ndropna:\n',dropna(x),sep='')
print('\n\n'+'='*20+' 2D Case: ' +'='*20+'\nInput:\n',y,sep='')
print('\ndropna (rows):\n',dropna(y),sep='')
print('\ndropna (columns):\n',dropna(y,axis=1),sep='')
print('\n\n'+'='*20+' x[np.logical_not(np.isnan(x))] for 2D: ' +'='*20+'\nInput:\n',y,sep='')
print('\ndropna:\n',x[np.logical_not(np.isnan(x))],sep='')
Result:
==================== 1D Case: ====================
Input:
[1400. 1500. 1600. nan nan nan 1700.]
dropna:
[1400. 1500. 1600. 1700.]
==================== 2D Case: ====================
Input:
[[1400. 1500. 1600.]
[ nan 0. nan]
[1700. 1800. nan]]
dropna (rows):
[[1400. 1500. 1600.]]
dropna (columns):
[[1500.]
[ 0.]
[1800.]]
==================== x[np.logical_not(np.isnan(x))] for 2D: ====================
Input:
[[1400. 1500. 1600.]
[ nan 0. nan]
[1700. 1800. nan]]
dropna:
[1400. 1500. 1600. 1700.]

Doing the above :
x = x[~numpy.isnan(x)]
or
x = x[numpy.logical_not(numpy.isnan(x))]
I found that resetting to the same variable (x) did not remove the actual nan values and had to use a different variable. Setting it to a different variable removed the nans.
e.g.
y = x[~numpy.isnan(x)]

In case it helps, for simple 1d arrays:
x = np.array([np.nan, 1, 2, 3, 4])
x[~np.isnan(x)]
>>> array([1., 2., 3., 4.])
but if you wish to expand to matrices and preserve the shape:
x = np.array([
[np.nan, np.nan],
[np.nan, 0],
[1, 2],
[3, 4]
])
x[~np.isnan(x).any(axis=1)]
>>> array([[1., 2.],
[3., 4.]])
I encountered this issue when dealing with pandas .shift() functionality, and I wanted to avoid using .apply(..., axis=1) at all cost due to its inefficiency.

Simply fill with
x = numpy.array([
[0.99929941, 0.84724713, -0.1500044],
[-0.79709026, numpy.NaN, -0.4406645],
[-0.3599013, -0.63565744, -0.70251352]])
x[numpy.isnan(x)] = .555
print(x)
# [[ 0.99929941 0.84724713 -0.1500044 ]
# [-0.79709026 0.555 -0.4406645 ]
# [-0.3599013 -0.63565744 -0.70251352]]

pandas introduces an option to convert all data types to missing values.
https://pandas.pydata.org/docs/user_guide/missing_data.html
The np.isnan() function is not compatible with all data types, e.g.
>>> import numpy as np
>>> values = [np.nan, "x", "y"]
>>> np.isnan(values)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
The pd.isna() and pd.notna() functions are compatible with many data types and pandas introduces a pd.NA value:
>>> import numpy as np
>>> import pandas as pd
>>> values = pd.Series([np.nan, "x", "y"])
>>> values
0 NaN
1 x
2 y
dtype: object
>>> values.loc[pd.isna(values)]
0 NaN
dtype: object
>>> values.loc[pd.isna(values)] = pd.NA
>>> values.loc[pd.isna(values)]
0 <NA>
dtype: object
>>> values
0 <NA>
1 x
2 y
dtype: object
#
# using map with lambda, or a list comprehension
#
>>> values = [np.nan, "x", "y"]
>>> list(map(lambda x: pd.NA if pd.isna(x) else x, values))
[<NA>, 'x', 'y']
>>> [pd.NA if pd.isna(x) else x for x in values]
[<NA>, 'x', 'y']

A simplest way is:
numpy.nan_to_num(x)
Documentation: https://docs.scipy.org/doc/numpy/reference/generated/numpy.nan_to_num.html

What is the best way to get Index of NANs in a pandas data Series

Suppose we've got some data series:
0 'one'
1 'two'
2 NAN
3 'three'
4 NAN
5 NAN
Now I would like to get the indecies of all the NAN elements. So using python's pandas lib I would do something like that:
import pandas as pd
import numpy as np
data = pd.Series(['one', 'two', np.nan, 'three', np.nan, np.nan])
nan_index = data.index.difference(data.dropna().index)
However, I'm getting the feeling that it's not a pandonic way of doing it.

By using isnull
data[data.isnull()].index
Out[739]: Int64Index([2, 4, 5], dtype='int64')
Or
data.isnull().nonzero()

In [11]: data.index[data.isnull()]
Out[11]: Int64Index([2, 4, 5], dtype='int64')
or
In [12]: np.where(data.isnull())[0]
Out[12]: array([2, 4, 5], dtype=int64)

How do I remove NaN values from a NumPy array?

How do I remove NaN values from a NumPy array?
[1, 2, NaN, 4, NaN, 8] ⟶ [1, 2, 4, 8]

To remove NaN values from a NumPy array x:
x = x[~numpy.isnan(x)]
Explanation
The inner function numpy.isnan returns a boolean/logical array which has the value True everywhere that x is not-a-number. Since we want the opposite, we use the logical-not operator ~ to get an array with Trues everywhere that x is a valid number.
Lastly, we use this logical array to index into the original array x, in order to retrieve just the non-NaN values.

filter(lambda v: v==v, x)
works both for lists and numpy array
since v!=v only for NaN

For me the answer by #jmetz didn't work, however using pandas isnull() did.
x = x[~pd.isnull(x)]

Try this:
import math
print [value for value in x if not math.isnan(value)]
For more, read on List Comprehensions.

#jmetz's answer is probably the one most people need; however it yields a one-dimensional array, e.g. making it unusable to remove entire rows or columns in matrices.
To do so, one should reduce the logical array to one dimension, then index the target array. For instance, the following will remove rows which have at least one NaN value:
x = x[~numpy.isnan(x).any(axis=1)]
See more detail here.

As shown by others
x[~numpy.isnan(x)]
works. But it will throw an error if the numpy dtype is not a native data type, for example if it is object. In that case you can use pandas.
x[~pandas.isna(x)] or x[~pandas.isnull(x)]

If you're using numpy
# first get the indices where the values are finite
ii = np.isfinite(x)
# second get the values
x = x[ii]

The accepted answer changes shape for 2d arrays.
I present a solution here, using the Pandas dropna() functionality.
It works for 1D and 2D arrays. In the 2D case you can choose weather to drop the row or column containing np.nan.
import pandas as pd
import numpy as np
def dropna(arr, *args, **kwarg):
assert isinstance(arr, np.ndarray)
dropped=pd.DataFrame(arr).dropna(*args, **kwarg).values
if arr.ndim==1:
dropped=dropped.flatten()
return dropped
x = np.array([1400, 1500, 1600, np.nan, np.nan, np.nan ,1700])
y = np.array([[1400, 1500, 1600], [np.nan, 0, np.nan] ,[1700,1800,np.nan]] )
print('='*20+' 1D Case: ' +'='*20+'\nInput:\n',x,sep='')
print('\ndropna:\n',dropna(x),sep='')
print('\n\n'+'='*20+' 2D Case: ' +'='*20+'\nInput:\n',y,sep='')
print('\ndropna (rows):\n',dropna(y),sep='')
print('\ndropna (columns):\n',dropna(y,axis=1),sep='')
print('\n\n'+'='*20+' x[np.logical_not(np.isnan(x))] for 2D: ' +'='*20+'\nInput:\n',y,sep='')
print('\ndropna:\n',x[np.logical_not(np.isnan(x))],sep='')
Result:
==================== 1D Case: ====================
Input:
[1400. 1500. 1600. nan nan nan 1700.]
dropna:
[1400. 1500. 1600. 1700.]
==================== 2D Case: ====================
Input:
[[1400. 1500. 1600.]
[ nan 0. nan]
[1700. 1800. nan]]
dropna (rows):
[[1400. 1500. 1600.]]
dropna (columns):
[[1500.]
[ 0.]
[1800.]]
==================== x[np.logical_not(np.isnan(x))] for 2D: ====================
Input:
[[1400. 1500. 1600.]
[ nan 0. nan]
[1700. 1800. nan]]
dropna:
[1400. 1500. 1600. 1700.]

Doing the above :
x = x[~numpy.isnan(x)]
or
x = x[numpy.logical_not(numpy.isnan(x))]
I found that resetting to the same variable (x) did not remove the actual nan values and had to use a different variable. Setting it to a different variable removed the nans.
e.g.
y = x[~numpy.isnan(x)]

In case it helps, for simple 1d arrays:
x = np.array([np.nan, 1, 2, 3, 4])
x[~np.isnan(x)]
>>> array([1., 2., 3., 4.])
but if you wish to expand to matrices and preserve the shape:
x = np.array([
[np.nan, np.nan],
[np.nan, 0],
[1, 2],
[3, 4]
])
x[~np.isnan(x).any(axis=1)]
>>> array([[1., 2.],
[3., 4.]])
I encountered this issue when dealing with pandas .shift() functionality, and I wanted to avoid using .apply(..., axis=1) at all cost due to its inefficiency.

Simply fill with
x = numpy.array([
[0.99929941, 0.84724713, -0.1500044],
[-0.79709026, numpy.NaN, -0.4406645],
[-0.3599013, -0.63565744, -0.70251352]])
x[numpy.isnan(x)] = .555
print(x)
# [[ 0.99929941 0.84724713 -0.1500044 ]
# [-0.79709026 0.555 -0.4406645 ]
# [-0.3599013 -0.63565744 -0.70251352]]

pandas introduces an option to convert all data types to missing values.
https://pandas.pydata.org/docs/user_guide/missing_data.html
The np.isnan() function is not compatible with all data types, e.g.
>>> import numpy as np
>>> values = [np.nan, "x", "y"]
>>> np.isnan(values)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
The pd.isna() and pd.notna() functions are compatible with many data types and pandas introduces a pd.NA value:
>>> import numpy as np
>>> import pandas as pd
>>> values = pd.Series([np.nan, "x", "y"])
>>> values
0 NaN
1 x
2 y
dtype: object
>>> values.loc[pd.isna(values)]
0 NaN
dtype: object
>>> values.loc[pd.isna(values)] = pd.NA
>>> values.loc[pd.isna(values)]
0 <NA>
dtype: object
>>> values
0 <NA>
1 x
2 y
dtype: object
#
# using map with lambda, or a list comprehension
#
>>> values = [np.nan, "x", "y"]
>>> list(map(lambda x: pd.NA if pd.isna(x) else x, values))
[<NA>, 'x', 'y']
>>> [pd.NA if pd.isna(x) else x for x in values]
[<NA>, 'x', 'y']

A simplest way is:
numpy.nan_to_num(x)
Documentation: https://docs.scipy.org/doc/numpy/reference/generated/numpy.nan_to_num.html

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert Select Columns in Pandas Dataframe to Numpy Array - python

the easy way is the "values" property df.iloc[:,1:].values a=df.iloc[:,1:] b=df.iloc[:,1:].values print(type(df)) print(type(a)) print(type(b)) so, you can get type <class 'pandas.core.frame.DataFrame'> <class 'pandas.core.frame.DataFrame'> <class 'numpy.ndarray'>

Hope this easy one liner helps: cols_as_np = df[df.columns[1:]].to_numpy()

The best way for converting to Numpy Array is using '.to_numpy(self, dtype=None, copy=False)'. It is new in version 0.24.0.Refrence You can also use '.array'.Refrence Pandas .as_matrix deprecated since version 0.23.0.

Instead of .as_matrix(), use .values, because the first one was deprecated. Here is the contribution: 'DataFrame' object has no attribute 'as_matrix

Related

Replace NaN values with previous no-NaN value in a 1d Numpy array

create a new column in pandas DataFrame based on two others which contain NaN values

Removing NaNs from imported lists [duplicate]

What is the best way to get Index of NANs in a pandas data Series

How do I remove NaN values from a NumPy array?

Categories

Resources