Compare elements in multiple pandas Series, given as a list - python

I have a multiple Series, given as a list. Number of series may vary.
s1 = pandas.Series(data=['Bob', 'John', '10', 10, 'i'])
s2 = pandas.Series(data=['John', 'John', 10, 10, 'j'])
s3 = pandas.Series(data=['Bob', 'John', '10', 10, 'k'])
series = [s1,s2,s3]
What I want is to check list with a series if elements are equal and get back list with an indexes or numpy.array with booleans.
What I have tried:
numpy.equal.reduce([s for s in series])
or
numpy.equal.reduce([s.values for s in series])
But with a given series i get:
array([ True, True, True, True, True])
I expected:
array([ False, True, False, True, False])
Are there any elegant way to do this job, without constructing big iterating methods?
Thank you!

You can simply construct a df and check number of unique:
print (pd.DataFrame(series).nunique().eq(1))
0 False
1 True
2 False
3 True
4 False
dtype: bool
Or as an array:
print (pd.DataFrame(series).nunique().eq(1).to_numpy())
[False True False True False]

Related

How can I get the index of the row where a value has a certain value?

Good morning,
I have a dataframe that has only values of True and False and want to get the row index where the value True exits.
I tried this:
[i for i in df_str[df_str.columns.values] if i== True]
But this return an empty array.
How can I do this?
Here's a way to do that. I'm using synthetic data for the sake of demonstration.
df = pd.DataFrame({"a": np.random.choice([True, False], 10),
"b": np.random.choice([True, False], 10)})
print(df)
# a b
# 0 False True
# 1 True False
# 2 False True
# 3 True True
# 4 False False
# 5 True False
# 6 False True
# 7 True False
# 8 True False
# 9 True True
# 'a' and 'b' are the columns you'd like to search
df[df[["a", "b"]].sum(axis=1) > 0].index.to_list()
# ==> [0, 1, 2, 3, 5, 6, 7, 8, 9]
Here is the solution
# for single column
df.index[df['col_name'] == True].tolist()
#for multiple columns
df[df[["a", "b"]].sum(axis=1) > 0].index.to_list()
The best way to get the indices where True is present (for the provided sample) is to use any. The following code will give you all the indices where any value in a particular row is True.
df=pd.DataFrame({"A":[True, False, False, True],"B":[True, True, False, False]})
indices=df[df.any(axis=1)].index
Expected Output
Int64Index([0, 1, 3], dtype='int64')

Fast way to create numpy 1d bool array with known nonzero entry indexes

Given a 1d np.ndarray containing a list of indexes that is True: [1, 2, 4], and length of the target np.ndarray: 6
How can we quickly construct the actual np.ndarray which should be [False, True, True, False, True, False]
idx = [1,2,3]
s = 6
a = np.zeros(s,dtype=bool)
a[idx] = True
output:
[False True True True False False]

How to create multiple column list of booleans from given list of integers in phython?

I am new to Python. I want to do following.
Input: A list of integers of size n. Each integer is in a range of 0 to 3.
Output: A multi-column (4 column in this case as integer range in 0-3 = 4) numpy list of size n. Each row of the new list will have the column corresponding to the integer value of Input list as True and rest of the columns as False.
E.g. Input list : [0, 3, 2, 1, 1, 2], size = 6, Each integer is in range of 0-3
Output list :
Row 0: True False False False
Row 1: False False False True
Row 2: False False True False
Row 3: False True False False
Row 4: False True False False
Row 5: False False True False
Now, I can start with 4 columns. Traverse through the input list and create this as follows,
output_columns[].
for i in Input list:
output_column[i] = True
Create an output numpy list with output columns
Is this the best way to do this in Python? Especially for creating numpy list as an output.
If yes, How do I merge output_columns[] at the end to create numpy multidimensional list with each dimension as a column of output_columns.
If not, what would be the best (most time efficient way) to do this in Python?
Thank you,
Is this the best way to do this in Python?
No, a more Pythonic and probably the best way is to use a simple broadcasting comparison as following:
In [196]: a = np.array([0, 3, 2, 1, 1, 2])
In [197]: r = list(range(0, 4))
In [198]: a[:,None] == r
Out[198]:
array([[ True, False, False, False],
[False, False, False, True],
[False, False, True, False],
[False, True, False, False],
[False, True, False, False],
[False, False, True, False]])
You are creating so called one-hot vector (each row in matrix is a one-hot vector meaning that only one value is True).
mylist = [0, 3, 2, 1, 1, 2]
one_hot = np.zeros((len(mylist), 4), dtype=np.bool)
for i, v in enumerate(mylist):
one_hot[i, v] = True
Output
array([[ True, False, False, False],
[False, False, False, True],
[False, False, True, False],
[False, True, False, False],
[False, True, False, False],
[False, False, True, False]], dtype=bool)

np.isreal behavior different in pandas.DataFrame and numpy.array

I have a array like below
np.array(["hello","world",{"a":5,"b":6,"c":8},"usa","india",{"d":9,"e":10,"f":11}])
and a pandas DataFrame like below
df = pd.DataFrame({'A': ["hello","world",{"a":5,"b":6,"c":8},"usa","india",{"d":9,"e":10,"f":11}]})
When I apply np.isreal to DataFrame
df.applymap(np.isreal)
Out[811]:
A
0 False
1 False
2 True
3 False
4 False
5 True
When I do np.isreal for the numpy array.
np.isreal( np.array(["hello","world",{"a":5,"b":6,"c":8},"usa","india",{"d":9,"e":10,"f":11}]))
Out[813]: array([ True, True, True, True, True, True], dtype=bool)
I must using the np.isreal in the wrong use case, But can you help me about why the result is different ?
A partial answer is that isreal is only intended to be used on array-like as the first argument.
You want to use isrealobj on each element to get the bahavior you see here:
In [11]: a = np.array(["hello","world",{"a":5,"b":6,"c":8},"usa","india",{"d":9,"e":10,"f":11}])
In [12]: a
Out[12]:
array(['hello', 'world', {'a': 5, 'b': 6, 'c': 8}, 'usa', 'india',
{'d': 9, 'e': 10, 'f': 11}], dtype=object)
In [13]: [np.isrealobj(aa) for aa in a]
Out[13]: [True, True, True, True, True, True]
In [14]: np.isreal(a)
Out[14]: array([ True, True, True, True, True, True], dtype=bool)
That does leave the question, what does np.isreal do on something that isn't array-like e.g.
In [21]: np.isrealobj("")
Out[21]: True
In [22]: np.isreal("")
Out[22]: False
In [23]: np.isrealobj({})
Out[23]: True
In [24]: np.isreal({})
Out[24]: True
It turns out this stems from .imag since the test that isreal does is:
return imag(x) == 0 # note imag == np.imag
and that's it.
In [31]: np.imag(a)
Out[31]: array([0, 0, 0, 0, 0, 0], dtype=object)
In [32]: np.imag("")
Out[32]:
array('',
dtype='<U1')
In [33]: np.imag({})
Out[33]: array(0, dtype=object)
This looks up the .imag attribute on the array.
In [34]: np.asanyarray("").imag
Out[34]:
array('',
dtype='<U1')
In [35]: np.asanyarray({}).imag
Out[35]: array(0, dtype=object)
I'm not sure why this isn't set in the string case yet...
I think this a small bug in Numpy to be honest. Here Pandas is just looping over each item in the column and calling np.isreal() on it. E.g.:
>>> np.isreal("a")
False
>>> np.isreal({})
True
I think the paradox here has to do with how np.real() treats inputs of dtype=object. My guess is it's taking the object pointer and treating it like an int, so of course np.isreal(<some object>) returns True. Over an array of mixed types like np.array(["A", {}]), the array is of dtype=object so np.isreal() is treating all the elements (including the strings) the way it would anything with dtype=object.
To be clear, I think the bug is in how np.isreal() treats arbitrary objects in a dtype=object array, but I haven't confirmed this explicitly.
There are a couple things going on here. First is pointed out by the previous answers in that np.isreal acts strangely when passed ojbects.
However, I think you are also confused about what applymap is doing. Difference between map, applymap and apply methods in Pandas is always a great reference.
In this case what you think you are doing is actually:
df.apply(np.isreal, axis=1)
Which essentially calls np.isreal(df), whereas df.applymap(np.isreal) is essentially calling np.isreal on each individual element of df. e.g
np.isreal(df.A)
array([ True, True, True, True, True, True], dtype=bool)
np.array([np.isreal(x) for x in df.A])
array([False, False, True, False, False, True], dtype=bool)

Convert two boolean columns to class ID in Pandas

I have to boolean columns:
df = pd.DataFrame([[True, True],
[True, False],
[False, True],
[True, True],
[False, False]],
columns=['col1', 'col2'])
I need to generate a new column that identifies which unique combination they belong to:
result = pd.Series([0, 1, 2, 0, 3])
Seems like there should be a very simple way to do this but it's escaping me. Maybe something using sklearn.preprocessing? Simple Pandas or Numpy solutions are equally preferred.
EDIT: Would be really nice if the solution could scale to more than 2 columns
The simpliest is create tuples with factorize:
print (pd.Series(pd.factorize(df.apply(tuple, axis=1))[0]))
0 0
1 1
2 2
3 0
4 3
dtype: int64
Another solution with cast to string and sum:
print (pd.Series(pd.factorize(df.astype(str).sum(axis=1))[0]))
0 0
1 1
2 2
3 0
4 3
dtype: int64
I've never used pandas before but here is a solution with plain python that I'm sure wouldn't be hard to adapt to pandas:
a = [[True, True],
[True, False],
[False, True],
[True, True],
[False, False]]
ids, result = [], [] # ids, keeps a list of previously seen items. result, keeps the result
for x in a:
if x in ids: # x has been seen before
id = ids.index(x) # find old id
result.append(id)
else: # x hasn't been seen before
id = len(ids) # create new id
result.append(id)
ids.append(x)
print(result) # [0, 1, 2, 0, 3]
This works with any number of columns, to get the result into a series just use:
result = pd.Series(result)

Categories