I have a very simple function like this one:
import numpy as np
from numba import jit
import pandas as pd
#jit
def f_(n, x, y, z):
for i in range(n):
z[i] = x[i] * y[i]
f_(df.shape[0], df["x"].values, df["y"].values, df["z"].values)
To which I pass
df = pd.DataFrame({"x": [1, 2, 3], "y": [3, 4, 5], "z": np.NaN})
I expected that function will modify data z column in place like this:
>>> f_(df.shape[0], df["x"].values, df["y"].values, df["z"].values)
>>> df
x y z
0 1 3 3.0
1 2 4 8.0
2 3 5 15.0
This works fine most of the time, but somehow fails to modify data in others.
I double checked things and:
I haven't determined any problems with data points which could cause this problem.
I see that data is modified as expected when I print the result.
If I return z array from the function it is modified as expected.
Unfortunately I couldn't reduce the problem to a minimal reproducible case. For example removing unrelated columns seems to "fix" the problem making reduction impossible.
Do I use jit in a way that is not intended to be used? Are there any border cases I should be aware of? Or is it likely to be a bug?
Edit:
I found the source of the problem. It occurs when data contains duplicated column names:
>>> df_ = pd.read_json('{"schema": {"fields":[{"name":"index","type":"integer"},{"name":"v","type":"integer"},{"name":"y","type":"integer"},
... {"name":"v","type":"integer"},{"name":"x","type":"integer"},{"name":"z","type":"number"}],"primaryKey":["index"],"pandas_version":"0.20.
... 0"}, "data": [{"index":0,"v":0,"y":3,"v":0,"x":1,"z":null}]}', orient="table")
>>> f_(df_.shape[0], df_["x"].values, df_["y"].values, df_["z"].values)
>>> df_
v y v x z
0 0 3 0 1 NaN
If duplicate is removed the function works like expected:
>>> df_.drop("v", axis="columns", inplace=True)
>>> f_(df_.shape[0], df_["x"].values, df_["y"].values, df_["z"].values)
>>> df_
y x z
0 3 1 3.0
Ah, that's because in your "failing case" the df["z"].values returns a copy of what is stored in the 'z' column of df. It has nothing to do with the numba function:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame([[0, 3, 0, 1, np.nan]], columns=['v', 'y', 'v', 'x', 'z'])
>>> np.shares_memory(df['z'].values, df['z'])
False
While in the "working case" it's a view into the 'z' column:
>>> df = pd.DataFrame([[0, 3, 1, np.nan]], columns=['v', 'y', 'x', 'z'])
>>> np.shares_memory(df['z'].values, df['z'])
True
NB: It's actually quite funny that this works, because the copy is made when you do df['z'] not when you access the .values.
The take-away here is that you cannot expect that indexing a DataFrame or accessing the .values of a Series will always return a view. So updating the column in-place may not change the values of the original. Not only duplicate column names could be a problem. When the property values returns a copy and when it returns a view is not always clear (except for pd.Series then it's always a view). But these are just implementation details. So it's never a good idea to rely on a specific behavior here. The only guarantee that .values is making is that it returns a numpy.ndarray containing the same values.
However it's pretty easy to avoid that problem by simply returning the modified z column from the function:
import numba as nb
import numpy as np
import pandas as pd
#nb.njit
def f_(n, x, y, z):
for i in range(n):
z[i] = x[i] * y[i]
return z # this is new
Then assign the result of the function to the column:
>>> df = pd.DataFrame([[0, 3, 0, 1, np.nan]], columns=['v', 'y', 'v', 'x', 'z'])
>>> df['z'] = f_(df.shape[0], df["x"].values, df["y"].values, df["z"].values)
>>> df
v y v x z
0 0 3 0 1 3.0
>>> df = pd.DataFrame([[0, 3, 1, np.nan]], columns=['v', 'y', 'x', 'z'])
>>> df['z'] = f_(df.shape[0], df["x"].values, df["y"].values, df["z"].values)
>>> df
v y x z
0 0 3 1 3.0
In case you're interested what happened in your specific case currently (as I mentioned we're talking about implementation details here so don't take this as given. It's just the way it's implemented now). If you have a DataFrame it will store the columns that have the same dtype in a multidimensional NumPy array. This can be seen if you access the blocks attribute (deprecated because the internal storage may change in the near future):
>>> df = pd.DataFrame([[0, 3, 0, 1, np.nan]], columns=['v', 'y', 'v', 'x', 'z'])
>>> df.blocks
{'float64':
z
0 NaN
,
'int64':
v y v x
0 0 3 0 1}
Normally it's very easy to create a view into that block, by translating the column name to the column index of the corresponding block. However if you have a duplicate column name the accessing an arbitrary column cannot be guaranteed to be a view. For example if you want to access 'v' then it has to index the Int64 Block with index 0 and 2:
>>> df = pd.DataFrame([[0, 3, 0, 1, np.nan]], columns=['v', 'y', 'v', 'x', 'z'])
>>> df['v']
v v
0 0 0
Technically it could be possible to index the non-duplicated columns as views (and in this case even for the duplicated column, for example by using Int64Block[::2] but that's a very special case...). Pandas opts for the safe option to always return a copy if there are duplicate column names (makes sense if you think about it. Why should indexing one column return a view and another returns a copy). The indexing of the DataFrame has an explicit check for duplicate columns and treats them differently (resulting in copies):
def _getitem_column(self, key):
""" return the actual column """
# get column
if self.columns.is_unique:
return self._get_item_cache(key)
# duplicate columns & possible reduce dimensionality
result = self._constructor(self._data.get(key))
if result.columns.is_unique:
result = result[key]
return result
The columns.is_unique is the important line here. It's True for your "normal case" but "False" for the "failing case".
Related
How do I remove NaN values from a NumPy array?
[1, 2, NaN, 4, NaN, 8] ⟶ [1, 2, 4, 8]
To remove NaN values from a NumPy array x:
x = x[~numpy.isnan(x)]
Explanation
The inner function numpy.isnan returns a boolean/logical array which has the value True everywhere that x is not-a-number. Since we want the opposite, we use the logical-not operator ~ to get an array with Trues everywhere that x is a valid number.
Lastly, we use this logical array to index into the original array x, in order to retrieve just the non-NaN values.
filter(lambda v: v==v, x)
works both for lists and numpy array
since v!=v only for NaN
For me the answer by #jmetz didn't work, however using pandas isnull() did.
x = x[~pd.isnull(x)]
Try this:
import math
print [value for value in x if not math.isnan(value)]
For more, read on List Comprehensions.
#jmetz's answer is probably the one most people need; however it yields a one-dimensional array, e.g. making it unusable to remove entire rows or columns in matrices.
To do so, one should reduce the logical array to one dimension, then index the target array. For instance, the following will remove rows which have at least one NaN value:
x = x[~numpy.isnan(x).any(axis=1)]
See more detail here.
As shown by others
x[~numpy.isnan(x)]
works. But it will throw an error if the numpy dtype is not a native data type, for example if it is object. In that case you can use pandas.
x[~pandas.isna(x)] or x[~pandas.isnull(x)]
If you're using numpy
# first get the indices where the values are finite
ii = np.isfinite(x)
# second get the values
x = x[ii]
The accepted answer changes shape for 2d arrays.
I present a solution here, using the Pandas dropna() functionality.
It works for 1D and 2D arrays. In the 2D case you can choose weather to drop the row or column containing np.nan.
import pandas as pd
import numpy as np
def dropna(arr, *args, **kwarg):
assert isinstance(arr, np.ndarray)
dropped=pd.DataFrame(arr).dropna(*args, **kwarg).values
if arr.ndim==1:
dropped=dropped.flatten()
return dropped
x = np.array([1400, 1500, 1600, np.nan, np.nan, np.nan ,1700])
y = np.array([[1400, 1500, 1600], [np.nan, 0, np.nan] ,[1700,1800,np.nan]] )
print('='*20+' 1D Case: ' +'='*20+'\nInput:\n',x,sep='')
print('\ndropna:\n',dropna(x),sep='')
print('\n\n'+'='*20+' 2D Case: ' +'='*20+'\nInput:\n',y,sep='')
print('\ndropna (rows):\n',dropna(y),sep='')
print('\ndropna (columns):\n',dropna(y,axis=1),sep='')
print('\n\n'+'='*20+' x[np.logical_not(np.isnan(x))] for 2D: ' +'='*20+'\nInput:\n',y,sep='')
print('\ndropna:\n',x[np.logical_not(np.isnan(x))],sep='')
Result:
==================== 1D Case: ====================
Input:
[1400. 1500. 1600. nan nan nan 1700.]
dropna:
[1400. 1500. 1600. 1700.]
==================== 2D Case: ====================
Input:
[[1400. 1500. 1600.]
[ nan 0. nan]
[1700. 1800. nan]]
dropna (rows):
[[1400. 1500. 1600.]]
dropna (columns):
[[1500.]
[ 0.]
[1800.]]
==================== x[np.logical_not(np.isnan(x))] for 2D: ====================
Input:
[[1400. 1500. 1600.]
[ nan 0. nan]
[1700. 1800. nan]]
dropna:
[1400. 1500. 1600. 1700.]
Doing the above :
x = x[~numpy.isnan(x)]
or
x = x[numpy.logical_not(numpy.isnan(x))]
I found that resetting to the same variable (x) did not remove the actual nan values and had to use a different variable. Setting it to a different variable removed the nans.
e.g.
y = x[~numpy.isnan(x)]
In case it helps, for simple 1d arrays:
x = np.array([np.nan, 1, 2, 3, 4])
x[~np.isnan(x)]
>>> array([1., 2., 3., 4.])
but if you wish to expand to matrices and preserve the shape:
x = np.array([
[np.nan, np.nan],
[np.nan, 0],
[1, 2],
[3, 4]
])
x[~np.isnan(x).any(axis=1)]
>>> array([[1., 2.],
[3., 4.]])
I encountered this issue when dealing with pandas .shift() functionality, and I wanted to avoid using .apply(..., axis=1) at all cost due to its inefficiency.
Simply fill with
x = numpy.array([
[0.99929941, 0.84724713, -0.1500044],
[-0.79709026, numpy.NaN, -0.4406645],
[-0.3599013, -0.63565744, -0.70251352]])
x[numpy.isnan(x)] = .555
print(x)
# [[ 0.99929941 0.84724713 -0.1500044 ]
# [-0.79709026 0.555 -0.4406645 ]
# [-0.3599013 -0.63565744 -0.70251352]]
pandas introduces an option to convert all data types to missing values.
https://pandas.pydata.org/docs/user_guide/missing_data.html
The np.isnan() function is not compatible with all data types, e.g.
>>> import numpy as np
>>> values = [np.nan, "x", "y"]
>>> np.isnan(values)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
The pd.isna() and pd.notna() functions are compatible with many data types and pandas introduces a pd.NA value:
>>> import numpy as np
>>> import pandas as pd
>>> values = pd.Series([np.nan, "x", "y"])
>>> values
0 NaN
1 x
2 y
dtype: object
>>> values.loc[pd.isna(values)]
0 NaN
dtype: object
>>> values.loc[pd.isna(values)] = pd.NA
>>> values.loc[pd.isna(values)]
0 <NA>
dtype: object
>>> values
0 <NA>
1 x
2 y
dtype: object
#
# using map with lambda, or a list comprehension
#
>>> values = [np.nan, "x", "y"]
>>> list(map(lambda x: pd.NA if pd.isna(x) else x, values))
[<NA>, 'x', 'y']
>>> [pd.NA if pd.isna(x) else x for x in values]
[<NA>, 'x', 'y']
A simplest way is:
numpy.nan_to_num(x)
Documentation: https://docs.scipy.org/doc/numpy/reference/generated/numpy.nan_to_num.html
I am fairly new to Python. I came across Pandas: Group by combination of two columns on SO. Unfortunately, the accepted answer no longer works with pandas version 0.23.4 The objective of that post is to figure out combination of group variables, and create a dictionary for values. i.e. group_by should ignore the order of grouping.
Here's the accepted answer:
import pandas as pd
from collections import Counter
d = pd.DataFrame([('a','b',1), ('a','c', 2), ('b','a',3), ('b','a',3)],
columns=['x', 'y', 'score'])
d[['x', 'y']] = d[['x', 'y']].apply(sorted, axis=1)
x = d.groupby(['x', 'y']).agg(Counter)
print(x)
Here, ...apply(sorted) throws the following exception:
raise ValueError('Must have equal len keys and value ' ValueError:
Must have equal len keys and value when setting with an iterable
Here's my pandas version:
> pd.__version__
Out: '0.23.4'
Here's what I tried after reading https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html:
d = pd.DataFrame([('a','b',1), ('a','c', 2), ('b','a',3), ('b','a',3)],
columns=['x', 'y', 'score'])
d=d.sort_values(by=['x','y'],axis=1).reset_index(drop=True)
x = d.groupby(['x', 'y']).agg(Counter)
print(x)
Unfortunately, this also throws error:
1382, in _get_label_or_level_values
raise KeyError(key) KeyError: 'x'
Expected output:
score count
x y
a b {1: 1, 3: 2} 2
c {2: 1} 1
Can someone please help me? On a side note, it will be great if you could also guide on how to compute the count of keys() in score column. I am looking for a vectorized solution.
I am using python 3.6.7
Many thanks.
Problem is sorted return lists, so is necessary convert ti to Series:
d[['x', 'y']] = d[['x', 'y']].apply(lambda x: pd.Series(sorted(x)), axis=1)
But faster is use numpy.sort with DataFrame constructor, because apply are loops under the hood:
d = pd.DataFrame([('a','b',1), ('a','c', 2), ('b','a',3), ('b','a',3)],
columns=['x', 'y', 'score'])
d[['x', 'y']] = pd.DataFrame(np.sort(d[['x', 'y']], axis=1), index=d.index)
Then seelct column for aggregation with list of aggregated functions - e.g. nunique for count of number of unique values:
x = d.groupby(['x', 'y'])['score'].agg([Counter, 'nunique'])
print(x)
Counter nunique
x y
a b {1: 1, 3: 2} 2
c {2: 1} 1
Or count by DataFrameGroupBy.size:
x = d.groupby(['x', 'y'])['score'].agg([Counter, 'size'])
print(x)
Counter size
x y
a b {1: 1, 3: 2} 3
c {2: 1} 1
Use -
a=d[['x','y']].values
a.sort(axis=1)
d[['x','y']] = a
x = d.groupby(['x', 'y']).agg(Counter)
print(x)
Output
score
x y
a b {1: 1, 3: 2}
c {2: 1}
Adding result_type = 'broadcast' as one of the args to .apply() worked.
>>> d = pd.DataFrame([('a','b',1), ('a','c', 2), ('b','a',3), ('b','a',3)],
columns=['x', 'y', 'score'])
>>> d[['x', 'y']] = d[['x', 'y']].apply(sorted, axis=1, result_type='broadcast')
>>> x = d.groupby(['x', 'y']).agg(Counter)
>>> print(x)
score
x y
a b {1: 1, 3: 2}
c {2: 1}
Note the difference with and without result_type = 'broadcast'.
>>> d[['x', 'y']].apply(sorted, axis=1)
0 [a, b]
1 [a, c]
2 [a, b]
3 [a, b]
dtype: object
>>> d[['x', 'y']].apply(sorted, axis=1, result_type='broadcast')
x y
0 a b
1 a c
2 a b
3 a b
As you can see, result_type = 'broadcast' splits (broadcasts) the result of .apply() back from a list into the respective columns, allowing the assignment to d[['x', 'y']].
How should I map indices of a numpy matrix?
For example:
mx = np.matrix([[5,6,2],[3,3,7],[0,1,6]]
The row/column indices are 0, 1, 2.
So:
>>> mx[0,0]
5
Let s say I need to map these indices, converting 0, 1, 2 into, e.g. 10, 'A', 'B' in the way that:
mx[10,10] #returns 5
mx[10,'A'] #returns 6 and so on..
I can just set a dict and use it to access the elements, but I would like to know if it is possible to do something like what I just described.
I would suggest using pandas dataframe with the index and columns using the new mapping for row and col indexing respectively for ease in indexing. It allows us to select a single element or an entire row or column with the familiar colon operator.
Consider a generic (non-square 4x3 shaped matrix) -
mx = np.matrix([[5,6,2],[3,3,7],[0,1,6],[4,5,2]])
Consider the mappings for rows and columns -
row_idx = [10, 'A', 'B','C']
col_idx = [10, 'A', 'B']
Let's take a look on the workflow with the given sample -
# Get data into dataframe with given mappings
In [57]: import pandas as pd
In [58]: df = pd.DataFrame(mx,index=row_idx, columns=col_idx)
# Here's how dataframe data looks like
In [60]: df
Out[60]:
10 A B
10 5 6 2
A 3 3 7
B 0 1 6
C 4 5 2
# Get one scalar element
In [61]: df.loc['C',10]
Out[61]: 4
# Get one entire col
In [63]: df.loc[:,10].values
Out[63]: array([5, 3, 0, 4])
# Get one entire row
In [65]: df.loc['A'].values
Out[65]: array([3, 3, 7])
And best of all we are not making any extra copies as the dataframe and its slices are still indexing into the original matrix/array memory space -
In [98]: np.shares_memory(mx,df.loc[:,10].values)
Out[98]: True
Try this:
import numpy as np
A = np.array(((1,2),(3,4),(50,100)))
dt = np.dtype([('ID', np.int32), ('Ring', np.int32)])
B = np.array(list(map(tuple, A)), dtype=dt)
print(B['ID'])
You can use the __getitem__ and __setitem__ special methods and create a new class as shown.
Store the index map as a dictionary in an instance variable self.index_map.
import numpy as np
class Matrix(np.matrix):
def __init__(self, lis):
self.matrix = np.matrix(lis)
self.index_map = {}
def setIndexMap(self, index_map):
self.index_map = index_map
def getIndex(self, key):
if type(key) is slice:
return key
elif key not in self.index_map.keys():
return key
else:
return self.index_map[key]
def __getitem__(self, idx):
return self.matrix[self.getIndex(idx[0]), self.getIndex(idx[1])]
def __setitem__(self, idx, value):
self.matrix[self.getIndex(idx[0]), self.getIndex(idx[1])] = value
Usage:
Creating a matrix.
>>> mx = Matrix([[5,6,2],[3,3,7],[0,1,6]])
>>> mx
Matrix([[5, 6, 2],
[3, 3, 7],
[0, 1, 6]])
Defining the Index Map.
>>> mx.setIndexMap({10:0, 'A':1, 'B':2})
Different ways to index the matrix.
>>> mx[0,0]
5
>>> mx[10,10]
5
>>> mx[10,'A']
6
It also handles slicing as shown.
>>> mx[1:3, 1:3]
matrix([[3, 7],
[1, 6]])
I'm working with the dataset outlined here:
https://archive.ics.uci.edu/ml/datasets/Balance+Scale
I'm trying create a general function to be able to parse any categorical data following these two rules:
Must have a column labeled class containing the class of the object
Each row must have the same numbers of columns
Minimal example of the data that I'm working with:
Class,LW,LD,RW,RD
B,1,1,1,1
L,1,2,1,1
R,1,2,1,3
R,2,2,4,5
This provides 3 unique classes: B, L, R. It also provides 4 features which pertain to each entry: LW, LD, RW and RD.
The following is a part of my function to handle generic cases, but my issue with it is that I don't know how to check if any column labels are simply missing:
import pandas as pd
import sys
dataframe = pd.read_csv('Balance_Data.csv')
columns = list(dataframe.columns.values)
if "Class" not in columns:
sys.exit("'Class' is not a column in the data")
if "Class.1" in columns:
sys.exit("Cannot specify more than one 'Class' column")
columns.remove("Class")
inputX = dataframe.loc[:, columns].as_matrix()
inputY = dataframe.loc[:, ['Class']].as_matrix()
At this point, the correct values are:
inputX = array([[1, 1, 1, 1],
[1, 2, 1, 1],
[1, 2, 1, 3],
[2, 2, 4, 5]])
inputY = array([['B'],
['L'],
['R'],
['R'],
['R'],
['R']], dtype=object)
But if I remove the last column label (RD) and reprocess,
Class,LW,LD,RW
B,1,1,1,1
L,1,2,1,1
R,1,2,1,3
R,2,2,4,5
I get:
inputX = array([[1, 1, 1],
[2, 1, 1],
[2, 1, 3],
[2, 4, 5]])
inputY = array([[1],
[1],
[1],
[2]])
This indicates that it reads label values from right to left instead of left to right, which means that if any data is input into this function that doesn't have the right amount of labels, it's not going to work correctly.
How can I check that the dimension of the rows is the same as the number of columns? (It can be assumed that there are no gaps in the data itself, that each row of data beyond the columns always has the same number of elements in it)
I would pull it out as follows:
In [11]: df = pd.read_csv('Balance_Data.csv', index_col=0)
In [12]: df
Out[12]:
LW LD RW RD
Class
B 1 1 1 1
L 1 2 1 1
R 1 2 1 3
R 2 2 4 5
That way the assertion check can be:
if "Class" in df.columns:
sys.exit("class must be the first and only the column and number of columns must match all rows")
and then check that the there are no NaNs in the last column:
In [21]: df.iloc[:, -1].notnull().all()
Out[21]: True
Note: this happens e.g. with the following (bad) csv:
In [31]: !cat bad.csv
A,B,C
1,2
3,4
In [32]: df = pd.read_csv('bad.csv', index_col=0)
In [33]: df
Out[33]:
B C
A
1 2 NaN
3 4 NaN
In [34]: df.iloc[:, -1].notnull().all()
Out[34]: False
I think these are the only two failing cases (but I think the error messages can be made clearer)...
I have some time series data, say:
# [ [time] [ data ] ]
a = [[0,1,2,3,4],['a','b','c','d','e']]
b = [[0,3,4]['f','g','h']]
and I would like an output with some filler value, lets say None for now:
a_new = [[0,1,2,3,4],['a','b','c','d','e']]
b_new = [[0,1,2,3,4],['f',None,None,'g','h']]
Is there a built in function in python/numpy to do this (or something like this)? Basically I would like to have all of my time vectors of equal size so I can calculate statistics (np.mean) and deal with the missing data accordingly.
How about this? (I'm assuming your definition of b was a typo, and I'm also assuming you know in advance how many entries you want.)
>>> b = [[0,3,4], ['f','g','h']]
>>> b_new = [list(range(5)), [None] * 5]
>>> for index, value in zip(*b): b_new[1][index] = value
>>> b_new
[[0, 1, 2, 3, 4], ['f', None, None, 'g', 'h']]
smarx has a fine answer, but pandas was made exactly for things like this.
# your data
a = [[0,1,2,3,4],['a','b','c','d','e']]
b = [[0,3,4],['f','g','h']]
# make an empty DataFrame (can do this faster but I'm going slow so you see how it works)
df_a = pd.DataFrame()
df_a['time'] = a[0]
df_a['A'] = a[1]
df_a.set_index('time',inplace=True)
# same for b (a faster way this time)
df_b = pd.DataFrame({'B':b[1]}, index=b[0])
# now merge the two Series together (the NaNs are in the right place)
df = pd.merge(df_a, df_b, left_index=True, right_index=True, how='outer')
In [28]: df
Out[28]:
A B
0 a f
1 b NaN
2 c NaN
3 d g
4 e h
Now the fun is just beginning. Within a DataFrame you can
compute all of your summary statistics (e.g. df.mean())
make plots (e.g. df.plot())
slice/dice your data basically however you want (e.g df.groupby())
Fill in or drop missing data using a specified method (e.g. df.fillna()),
take quarterly or monthly averages (e.g. df.resample()) and a lot more.
If you're just getting started (sorry for the infomercial it you aren't), I recommend reading 10 minutes to pandas for a quick overview.
Here's a vectorized NumPythonic approach -
def align_arrays(A):
time, data = A
time_new = np.arange(np.max(time)+1)
data_new = np.full(time_new.size, None, dtype=object)
data_new[np.in1d(time_new,time)] = data
return time_new, data_new
Sample runs -
In [113]: a = [[0,1,2,3,4],['a','b','c','d','e']]
In [114]: align_arrays(a)
Out[114]: (array([0, 1, 2, 3, 4]), array(['a', 'b', 'c', 'd', 'e'], dtype=object))
In [115]: b = [[0,3,4],['f','g','h']]
In [116]: align_arrays(b)
Out[116]: (array([0, 1, 2, 3, 4]),array(['f', None, None, 'g', 'h'],dtype=object))