np.isreal behavior different in pandas.DataFrame and numpy.array

np.isreal behavior different in pandas.DataFrame and numpy.array - python

I have a array like below
np.array(["hello","world",{"a":5,"b":6,"c":8},"usa","india",{"d":9,"e":10,"f":11}])
and a pandas DataFrame like below
df = pd.DataFrame({'A': ["hello","world",{"a":5,"b":6,"c":8},"usa","india",{"d":9,"e":10,"f":11}]})
When I apply np.isreal to DataFrame
df.applymap(np.isreal)
Out[811]:
A
0 False
1 False
2 True
3 False
4 False
5 True
When I do np.isreal for the numpy array.
np.isreal( np.array(["hello","world",{"a":5,"b":6,"c":8},"usa","india",{"d":9,"e":10,"f":11}]))
Out[813]: array([ True, True, True, True, True, True], dtype=bool)
I must using the np.isreal in the wrong use case, But can you help me about why the result is different ?

A partial answer is that isreal is only intended to be used on array-like as the first argument.
You want to use isrealobj on each element to get the bahavior you see here:
In [11]: a = np.array(["hello","world",{"a":5,"b":6,"c":8},"usa","india",{"d":9,"e":10,"f":11}])
In [12]: a
Out[12]:
array(['hello', 'world', {'a': 5, 'b': 6, 'c': 8}, 'usa', 'india',
{'d': 9, 'e': 10, 'f': 11}], dtype=object)
In [13]: [np.isrealobj(aa) for aa in a]
Out[13]: [True, True, True, True, True, True]
In [14]: np.isreal(a)
Out[14]: array([ True, True, True, True, True, True], dtype=bool)
That does leave the question, what does np.isreal do on something that isn't array-like e.g.
In [21]: np.isrealobj("")
Out[21]: True
In [22]: np.isreal("")
Out[22]: False
In [23]: np.isrealobj({})
Out[23]: True
In [24]: np.isreal({})
Out[24]: True
It turns out this stems from .imag since the test that isreal does is:
return imag(x) == 0 # note imag == np.imag
and that's it.
In [31]: np.imag(a)
Out[31]: array([0, 0, 0, 0, 0, 0], dtype=object)
In [32]: np.imag("")
Out[32]:
array('',
dtype='<U1')
In [33]: np.imag({})
Out[33]: array(0, dtype=object)
This looks up the .imag attribute on the array.
In [34]: np.asanyarray("").imag
Out[34]:
array('',
dtype='<U1')
In [35]: np.asanyarray({}).imag
Out[35]: array(0, dtype=object)
I'm not sure why this isn't set in the string case yet...

I think this a small bug in Numpy to be honest. Here Pandas is just looping over each item in the column and calling np.isreal() on it. E.g.:
>>> np.isreal("a")
False
>>> np.isreal({})
True
I think the paradox here has to do with how np.real() treats inputs of dtype=object. My guess is it's taking the object pointer and treating it like an int, so of course np.isreal(<some object>) returns True. Over an array of mixed types like np.array(["A", {}]), the array is of dtype=object so np.isreal() is treating all the elements (including the strings) the way it would anything with dtype=object.
To be clear, I think the bug is in how np.isreal() treats arbitrary objects in a dtype=object array, but I haven't confirmed this explicitly.

There are a couple things going on here. First is pointed out by the previous answers in that np.isreal acts strangely when passed ojbects.
However, I think you are also confused about what applymap is doing. Difference between map, applymap and apply methods in Pandas is always a great reference.
In this case what you think you are doing is actually:
df.apply(np.isreal, axis=1)
Which essentially calls np.isreal(df), whereas df.applymap(np.isreal) is essentially calling np.isreal on each individual element of df. e.g
np.isreal(df.A)
array([ True, True, True, True, True, True], dtype=bool)
np.array([np.isreal(x) for x in df.A])
array([False, False, True, False, False, True], dtype=bool)

Related

How to use broadcast feature of numpy on a pandas dataframe with list columns of different lengths

I am trying to use broadcast feature of numpy on my large data. I have list columns that can have hundreds of elements in many rows. I need to filter rows based on presence of columns value in the list column. If number in col_a is present in col_b, I need to filter IN that row.
Sample data:
import pandas as pd
import numpy as np
dt = pd.DataFrame({'id' : ['a', 'a', 'a', 'b', 'b'],
'col_a': [[1],[2],[5],[1],[2]],
'col_b': [[2],[2,4],[2,5,7],[4],[3,2]],
})
dt
id col_a col_b
0 a [1] [2]
1 a [2] [2, 4]
2 a [5] [2, 5, 7]
3 b [1] [4]
4 b [2] [3, 2]
I tried below code to add dimension to col_b and check if the value is present in col_a:
(dt['col_a'] == dt['col_b'][:,None]).any(axis = 1)
but I get below error:
ValueError: ('Shapes must match', (5,), (5, 1))
Could someone please let me know what's the correct approach.

import pandas as pd
import numpy as np
from itertools import product
Parse out columns based on the commas:
dt2 = pd.DataFrame([j for i in dt.values for j in product(*i)], columns=dt.columns)
Filter to where col_a equals col_b:
dt2 = dt2[dt2['col_a'] == dt2['col_b']]
Results in:

I think you've been told that numpy "vectorization" is the key to speeding up your code, but you don't have a good grasp of what this. It isn't something magical that you can apply to any pandas task. It's just "shorthand" for making full use of numpy array methods, which means, actually learning numpy.
But let's explore your task:
In [205]: dt = pd.DataFrame({'id' : ['a', 'a', 'a', 'b', 'b'],
...: 'col_a': [[1],[2],[5],[1],[2]],
...: 'col_b': [[2],[2,4],[2,5,7],[4],[3,2]],
...: })
In [206]: dt
Out[206]:
id col_a col_b
0 a [1] [2]
1 a [2] [2, 4]
2 a [5] [2, 5, 7]
3 b [1] [4]
4 b [2] [3, 2]
In [207]: dt.dtypes
Out[207]:
id object
col_a object
col_b object
dtype: object
Because the columns contain lists, their dtype is object; they have references to lists.
Doing things like == on columns, pandas Series is not the same as doing things with the arrays of their values.
But to focus on the numpy aspect, lets get numpy arrays:
In [208]: a = dt['col_a'].to_numpy()
In [209]: b = dt['col_b'].to_numpy()
In [210]: a
Out[210]:
array([list([1]), list([2]), list([5]), list([1]), list([2])],
dtype=object)
In [211]: b
Out[211]:
array([list([2]), list([2, 4]), list([2, 5, 7]), list([4]), list([3, 2])],
dtype=object)
The fast numpy operations use compiled code, and, for the most part, only work with numeric dtypes. Arrays like this, containing references to lists, are basically the same as lists. Math, and other operations like equalty, operate at list comprehension speeds. That may be faster than pandas speeds, but no where like the highly vaunted "vectorized" numpy speeds.
So lets to a list comprehension on the elements of these lists. This is a lot like pandas apply, though I think it's faster (pandas apply is notoriously slow).
In [212]: [i in j for i,j in zip(a,b)]
Out[212]: [False, False, False, False, False]
Oops, not matches - must be because i from a is a list. Let's extract that number:
In [213]: [i[0] in j for i,j in zip(a,b)]
Out[213]: [False, True, True, False, True]
Making col_a contain lists instead of numbers does not help you.
Since a and b are arrays, we can use ==, but that essentially the same operation as [212] (timeit is slightly better):
In [214]: a==b
Out[214]: array([False, False, False, False, False])
We could make b into a (5,1) array, but why?
In [215]: b[:,None]
Out[215]:
array([[list([2])],
[list([2, 4])],
[list([2, 5, 7])],
[list([4])],
[list([3, 2])]], dtype=object)
What I think you were trying to imitate an array comparison like this, broadcasting a (5,) against a (3,1) to produce a (3,5) truth table:
In [216]: x = np.arange(5); y = np.array([3,5,1])
In [217]: x==y[:,None]
Out[217]:
array([[False, False, False, True, False],
[False, False, False, False, False],
[False, True, False, False, False]])
In [218]: (x==y[:,None]).any(axis=1)
Out[218]: array([ True, False, True])
isin can do the same sort of comparision:
In [219]: np.isin(x,y)
Out[219]: array([False, True, False, True, False])
In [220]: np.isin(y,x)
Out[220]: array([ True, False, True])
While this works for numbers, it does not work for the arrays of lists, especially not your case where you want to test the lists in a against the corresponding list in b. You aren't testing all of a against all of b.
Since the lists in a are all the same size, we can join them into one number array:
In [225]: np.hstack(a)
Out[225]: array([1, 2, 5, 1, 2])
We cannot do the same for b because the lists very in size. As a general rule, when you have lists (or arrays) that vary in size, you cannot do the fast numeric numpy math and comparisons.
We could test (5,) a against (5,1) b, producing a (5,5) truth table:
In [227]: a==b[:,None]
Out[227]:
array([[False, True, False, False, True],
[False, False, False, False, False],
[False, False, False, False, False],
[False, False, False, False, False],
[False, False, False, False, False]])
But that is True for a couple of cells in the first row; that's where the list([2]) from b matches the same list in a.

Compare elements in multiple pandas Series, given as a list

I have a multiple Series, given as a list. Number of series may vary.
s1 = pandas.Series(data=['Bob', 'John', '10', 10, 'i'])
s2 = pandas.Series(data=['John', 'John', 10, 10, 'j'])
s3 = pandas.Series(data=['Bob', 'John', '10', 10, 'k'])
series = [s1,s2,s3]
What I want is to check list with a series if elements are equal and get back list with an indexes or numpy.array with booleans.
What I have tried:
numpy.equal.reduce([s for s in series])
or
numpy.equal.reduce([s.values for s in series])
But with a given series i get:
array([ True, True, True, True, True])
I expected:
array([ False, True, False, True, False])
Are there any elegant way to do this job, without constructing big iterating methods?
Thank you!

You can simply construct a df and check number of unique:
print (pd.DataFrame(series).nunique().eq(1))
0 False
1 True
2 False
3 True
4 False
dtype: bool
Or as an array:
print (pd.DataFrame(series).nunique().eq(1).to_numpy())
[False True False True False]

Using entrywise sum of boolean arrays as inclusive `or`

I would like to compare many m-by-n boolean numpy arrays and get an array of the same shape whose entries are True if the corresponding entry in at least one of the inputs is True.
The easiest way I've found to do this is:
In [5]: import numpy as np
In [6]: a = np.array([True, False, True])
In [7]: b = np.array([True, True, False])
In [8]: a + b
Out[8]: array([ True, True, True])
But I can also use
In [11]: np.stack([a, b]).sum(axis=0) > 0
Out[11]: array([ True, True, True])
Are these equivalent operations? Are there any gotchas I should be aware of? Is one method preferable to the other?

You can use np.logical_or
a = np.array([True, False, True])
b = np.array([True, True, False])
np.logical_or(a,b)
it also works for (m,n) arrays
a = np.random.rand(3,4) < 0.5
b = np.random.rand(3,4) < 0.5
print('a\n',a)
print('b\n',b)
np.logical_or(a,b)

NumPy masked array not considering fill_value when comparing to scalar

I have a masked numpy array like the following:
mar = np.ma.array([0, 0, 100, 100], mask=[False, True, True, False], fill_value=-1)
So the two values in the middle are masked, calling mar.filled() would return [0, -1, -1, 100].
I want to compare this array to a scalar 0, i.e.:
mar == 0
which returns
masked_array(data = [True -- -- False],
mask = [False True True False],
fill_value = True)
Note that the fill_value is now True which is the default fill value for bool arrays but does not make sense for me in this case (I would have expected that it is set to -1 == 0 which is False).
To illustrate my problem more clearly: (mar == 0).filled() and mar.filled() == 0 do not return the same result.
Is this intended behaviour or is it a bug? In any case, is there a workaround to achieve my desired behaviour? I know that I can just convert to a normal array before comparison using .filled() but I would like to avoid that if possible, since the code should not care whether it is a masked array or a normal one.

mar == 0 uses mar.__eq__(0)
docs for that method say:
When either of the elements is masked, the result is masked as well,
but the underlying boolean data are still set, with self and other
considered equal if both are masked, and unequal otherwise.
That method in turn uses mar._comparison
This first performs the comparison on the .data attributes
In [16]: mar.data
Out[16]: array([ 0, 0, 100, 100])
In [17]: mar.data == 0
Out[17]: array([ True, True, False, False])
But then it compares the masks and adjusts values. 0 is not masked, so its 'mask' is False. Since the mask for the masked elements of mar is True, the masks don't match, and the comparison .data is set to False.
In [19]: np.ma.getmask(0)
Out[19]: False
In [20]: mar.mask
Out[20]: array([False, True, True, False])
In [21]: (mar==0).data
Out[21]: array([ True, False, False, False])
I get a different fill_value in the comparison. That could be a change in v 1.14.0.
In [24]: mar==0
Out[24]:
masked_array(data=[True, --, --, False],
mask=[False, True, True, False],
fill_value=-1)
In [27]: (mar==0).filled()
Out[27]: array([True, -1, -1, False], dtype=object)
This is confusing. Comparisons (and in general most functions) on masked arrays have to deal with the .data, the mask, and the fill. Numpy code that isn't ma aware usually works the .data and ignores the masking. ma methods may work with the filled() values, or the compressed. This comparison method attempts to take all 3 attributes into account.
Testing the equality with a masked 0 array (same mask and fillvalue):
In [34]: mar0 = np.ma.array([0, 0, 0, 0], mask=[False, True, True, False], fill_
...: value=-1)
In [35]: mar0
Out[35]:
masked_array(data=[0, --, --, 0],
mask=[False, True, True, False],
fill_value=-1)
In [36]: mar == mar0
Out[36]:
masked_array(data=[True, --, --, False],
mask=[False, True, True, False],
fill_value=-1)
In [37]: _.data
Out[37]: array([ True, True, True, False])
mar == 0 is the same as mar == np.ma.array([0, 0, 0, 0], mask=False)

I don't know why (mar == 0) does not yield the desired output. But you can consider
np.equal(mar, 0)
which retain the original fill value.

NumPy chained comparison with two predicates

In NumPy, I can generate a boolean array like this:
>>> arr = np.array([1, 2, 1, 2, 3, 6, 9])
>>> arr > 2
array([False, False, False, False, True, True, True], dtype=bool)
How can we chain comparisons together? For example:
>>> 6 > arr > 2
array([False, False, False, False, True, False, False], dtype=bool)
Attempting to do so results in the error message
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

AFAIK the closest you can get is to use &, |, and ^:
>>> arr = np.array([1, 2, 1, 2, 3, 6, 9])
>>> (2 < arr) & (arr < 6)
array([False, False, False, False, True, False, False], dtype=bool)
>>> (2 < arr) | (arr < 6)
array([ True, True, True, True, True, True, True], dtype=bool)
>>> (2 < arr) ^ (arr < 6)
array([ True, True, True, True, False, True, True], dtype=bool)
I don't think you'll be able to get a < b < c-style chaining to work.

You can use the numpy logical operators to do something similar.
>>> arr = np.array([1, 2, 1, 2, 3, 6, 9])
>>> arr > 2
array([False, False, False, False, True, True, True], dtype=bool)
>>>np.logical_and(arr>2,arr<6)
Out[5]: array([False, False, False, False, True, False, False], dtype=bool)

Chained comparisons are not allowed in numpy. You need to write both left and right comparisons separately, and chain them with bitwise operators. Also you'll need to parenthesise both expressions due to operator precendence (|, & and ^ have a higher precedence). In this case, since you want both conditions to be satisfied you need an bitwise AND (&):
(2<arr) & (arr<6)
# array([False, False, False, False, True, False, False])
It was actually proposed to make this possible in PEP 535, though it still remains deferred. In it there is an explanation on why this occurs. As posed in the question, chaining comparisons in such way, yields:
2<arr<6
ValueError: The truth value of an array with more than one element is ambiguous.
Use a.any() or a.all()
The problem here, is that python is internally expanding the above to:
2<arr and arr<6
Which is what causes the error, since and is implicitly calling bool, and NumPy only permits implicit coercion to a boolean value for single elements (not arrays with size>1), since a boolean array with many values does not evaluate neither to True or False. It is due to this ambiguity that this isn't allowed, and evaluating an array in boolean context always yields a ValueError

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

np.isreal behavior different in pandas.DataFrame and numpy.array - python

Related

How to use broadcast feature of numpy on a pandas dataframe with list columns of different lengths

Compare elements in multiple pandas Series, given as a list

Using entrywise sum of boolean arrays as inclusive `or`

NumPy masked array not considering fill_value when comparing to scalar

NumPy chained comparison with two predicates

Categories

Resources