Why numpy .isin function gives incorrect output - python

My requirement is I have a large dataframe with millions of rows. I encoded all strings to numeric values in order to use numpys vectorization to increase processing speed.
So I was looking at a way to quickly check if a number exists in another list column. Previously, I was using list comprehension with string values, but with after converting to np.arrays was looking at similar function.
I stumbled across this link: check if values of a column are in values of another numpy array column in pandas
In order to the numpy.isin, I tried running below code:
dt = pd.DataFrame({'id' : ['a', 'a', 'a', 'b', 'b'],
'col_a': [1,2,5,1,2],
'col_b': [2,2,[2,5,4],4,[1,5,6,3,2]]})
dt
id col_a col_b
0 a 1 2
1 a 2 2
2 a 5 [2, 5, 4]
3 b 1 4
4 b 2 [1, 5, 6, 3, 2]
When I enter:
np.isin(dt['col_a'], dt['col_b'])
The output is:
array([False, True, False, False, True])
Which is incorrect as the 3rd row has 5 in both columns col_a and col_b.
Where as if I change the value to 4 as below:
dt = pd.DataFrame({'id' : ['a', 'a', 'a', 'b', 'b'],
'col_a': [1,2,4,1,2],
'col_b': [2,2,[2,5,4],4,[1,5,6,3,2]]})
dt
id col_a col_b
0 a 1 2
1 a 2 2
2 a 4 [2, 5, 4]
3 b 1 4
4 b 2 [1, 5, 6, 3, 2]
and execute same code:
np.isin(dt['col_a'], dt['col_b'])
I get correct result:
array([False, True, True, False, True])
Can someone please let me know why it's giving different results.

Since col_b not only has lists but also integers, you may need to use apply and treat them differently:
( dt.apply(lambda x: x['col_a'] in x['col_b'] if type(x['col_b']) is list
else x['col_a'] == x['col_b'], axis=1)
Output:
0 False
1 True
2 True
3 False
4 True
dtype: bool

np.isin for each element from dt['col_a'] checks whether it is present in the whole dt['col_b'] column, i.e.:
[
1 in dt['col_b'],
2 in dt['col_b'],
5 in dt['col_b'],
...
]
There's no 5 in dt['col_b'] but there's 4
From the docs
isin is an element-wise function version of the python keyword in. isin(a, b) is roughly equivalent to np.array([item in b for item in a]) if a and b are 1-D sequences.
Also, your issue is that you have an inconsistent dt['col_b'] column (some values are numbers some are lists). I think the easiest approach is to use apply:
def isin(row):
if isinstance(row['col_b'], int):
return row['col_a'] == row['col_b']
else:
return row['col_a'] in row['col_b']
dt.apply(isin, axis=1)
Output:
0 False
1 True
2 True
3 False
4 True
dtype: bool

Related

Pandas: adjust value of DataFrame that is sliced multiple times

Imagine I have the follow Pandas.DataFrame:
df = pd.DataFrame({
'type': ['A', 'A', 'A', 'B', 'B', 'B'],
'value': [1, 2, 3, 4, 5, 6]
})
I want to adjust the first value when type == 'B' to 999, i.e. the fourth row's value should become 999.
Initially I imagined that
df.loc[df['type'] == 'B'].iloc[0, -1] = 999
or something similar would work. But as far as I can see, slicing the df twice does not point to the original df anymore so the value of the df is not updated.
My other attempt is
df.loc[df.loc[df['type'] == 'B'].index[0], df.columns[-1]] = 999
which works, but is quite ugly.
So I'm wondering -- what would be the best approach in such situation?
You can use idxmax which returns the index of the first occurrence of a max value. Like this using a boolean series:
df.loc[(df['type'] == 'B').idxmax(), 'value'] = 999
Output:
type value
0 A 1
1 A 2
2 A 3
3 B 999
4 B 5
5 B 6

Pandas transform inconsistent behavior for list

I have sample snippet that works as expected:
import pandas as pd
df = pd.DataFrame(data={'label': ['a', 'b', 'b', 'c'], 'wave': [1, 2, 3, 4], 'y': [0,0,0,0]})
df['new'] = df.groupby(['label'])[['wave']].transform(tuple)
The result is:
label wave y new
0 a 1 0 (1,)
1 b 2 0 (2, 3)
2 b 3 0 (2, 3)
3 c 4 0 (4,)
It works analagously, if instead of tuple in transform I give set, frozenset, dict, but if I give list I got completly unexpected result:
df['new'] = df.groupby(['label'])[['wave']].transform(list)
label wave y new
0 a 1 0 1
1 b 2 0 2
2 b 3 0 3
3 c 4 0 4
There is a workaround to get expected result:
df['new'] = df.groupby(['label'])[['wave']].transform(tuple)['wave'].apply(list)
label wave y new
0 a 1 0 [1]
1 b 2 0 [2, 3]
2 b 3 0 [2, 3]
3 c 4 0 [4]
I thought about mutability/immutability (list/tuple) but for set/frozenset it is consistent.
The question is why it works in this way?
I've come across a similar issue before. The underlying issue I think is when the number of elements in the list matches the number of records in the group, it tries to unpack the list so each element of the list maps to a record in the group.
For example, this will cause the list to unpack, as the len of the list matches the length of each group:
df.groupby(['label'])[['wave']].transform(lambda x: list(x))
wave
0 1
1 2
2 3
3 4
However, if the length of the list is not the same as each group, you will get desired behaviour:
df.groupby(['label'])[['wave']].transform(lambda x: list(x)+[0])
wave
0 [1, 0]
1 [2, 3, 0]
2 [2, 3, 0]
3 [4, 0]
I think this is a side effect of the list unpacking functionality.
I think that is a bug in pandas. Can you open a ticket on their github page please?
At first I thought, it might be, because list is just not handeled correctly as argument to .transform, but if I do:
def create_list(obj):
print(type(obj))
return obj.to_list()
df.groupby(['label'])[['wave']].transform(create_list)
I get the same unexpected result. If however the agg method is used, it works directly:
df.groupby(['label'])['wave'].agg(list)
Out[179]:
label
a [1]
b [2, 3]
c [4]
Name: wave, dtype: object
I can't imagine that this is intended behavior.
Btw. I also find the different behavior suspicious, that shows up if you apply tuple to a grouped series and a grouped dataframe. E.g. if transform is applied to a series instead of a DataFrame, the result also is not a series containing lists, but a series containing ints (remember for [['wave']] which creates a one-columed dataframe transform(tuple) indeed returned tuples):
df.groupby(['label'])['wave'].transform(tuple)
Out[177]:
0 1
1 2
2 3
3 4
Name: wave, dtype: int64
If I do that again with agg instead of transform it works for both ['wave'] and [['wave']]
I was using version 0.25.0 on an ubuntu X86_64 system for my tests.
Since DataFrames are mainly designed to handle 2D data, including arrays instead of scalar values might stumble upon a caveat such as this one.
pd.DataFrame.trasnform is originally implemented on top of .agg:
# pandas/core/generic.py
#Appender(_shared_docs["transform"] % dict(axis="", **_shared_doc_kwargs))
def transform(self, func, *args, **kwargs):
result = self.agg(func, *args, **kwargs)
if is_scalar(result) or len(result) != len(self):
raise ValueError("transforms cannot produce " "aggregated results")
return result
However, transform always return a DataFrame that must have the same length as self, which is essentially the input.
When you do an .agg function on the DataFrame, it works fine:
df.groupby('label')['wave'].agg(list)
label
a [1]
b [2, 3]
c [4]
Name: wave, dtype: object
The problem gets introduced when transform tries to return a Series with the same length.
In the process to transforming a groupby element which is a slice from self and then concatenating this again, lists gets unpacked to the same length of index as #Allen mentioned.
However, when they don't align, then don't get unpacked:
df.groupby(['label'])[['wave']].transform(lambda x: list(x) + [1])
wave
0 [1, 1]
1 [2, 3, 1]
2 [2, 3, 1]
3 [4, 1]
A workaround this problem might be avoiding transform:
df = pd.DataFrame(data={'label': ['a', 'b', 'b', 'c'], 'wave': [1, 2, 3, 4], 'y': [0,0,0,0]})
df = df.merge(df.groupby('label')['wave'].agg(list).rename('new'), on='label')
df
label wave y new
0 a 1 0 [1]
1 b 2 0 [2, 3]
2 b 3 0 [2, 3]
3 c 4 0 [4]
Another interesting work around, that works for strings, is:
df = df.applymap(str) # Make them all strings... would be best to use on non-numeric data.
df.groupby(['label'])['wave'].transform(' '.join).str.split()
Output:
0 [1]
1 [2, 3]
2 [2, 3]
3 [4]
Name: wave, dtype: object
The suggested answers does not work on Pandas 1.2.4 anymore. Here is a workaround for it:
df.groupby(['label'])[['wave']].transform(lambda x: [list(x) + [1]]*len(x))
The idea behind it is the same as explained in other answers (e.g. #Allen's answer). Therefore, the solution here is wrap the function into another list and repeat it same number as the group length, so that when pandas transform unwraps it, each row gets the inside list.
output:
wave
0 [1, 1]
1 [2, 3, 1]
2 [2, 3, 1]
3 [4, 1]

Select DataFrame data base on series value

I have a pandas' DataFrame and when I perform an operation on the dataframe, I get back a series. How can I use that series to select out only records where I find a match?
Right now I'm appending the column onto the DataFrame and doing a query against the dataframe then dropping the column. I really do not like this solution though, so I'm hoping I can get a better solution.
data = [[1,2,3], [1,3,4], [3,4,5]]
columns = ['a', 'b', 'c']
df = pd.DataFrame(data, columns=columns)
series = df.myoperation()
df['myoperation'] = series
res = df[df['myoperation'] == True]
del res['myoperation']
The series object will produce a 1-1 match, so index item 1 will match item 1 in the dataframe object.
Above is my hacky code to get it done, but I'm afraid when the dataframe have many column or a lot more data than just this simple example, it will be slow.
Thank you
I think you can use if series is boolean Series with same index as df and same length as df - it is called boolean indexing:
series = pd.Series([True, False, True], index=df.index)
res = df[series]
print (res)
a b c
0 1 2 3
2 3 4 5
It always works with boolean list and numpy array, only lenght has to be same as df:
L = [True, False, True]
res = df[L]
print (res)
a b c
0 1 2 3
2 3 4 5
arr = np.array([True, False, True])
res = df[arr]
print (res)
a b c
0 1 2 3
2 3 4 5

Pandas GroupBy Index

I have a dataframe with a column that I want to groupby. Within each group, I want to perform a check to see if the first values is less than the second value times some scalar, e.g. (x < y * .5). If it is, the first value is set to True and all other values False. Else, all values are False.
I have a sample data frame here:
d = pd.DataFrame(np.array([[0, 0, 1, 1, 2, 2, 2],
[3, 4, 5, 6, 7, 8, 9],
[1.25, 10.1, 2.3, 2.4, 1.2, 5.5, 5.7]]).T,
columns=['a', 'b', 'c'])
I can get a stacked groupby to get the data that I want out a a:
g = d.groupby('a')['c'].nsmallest(2).groupby(level='a')
This results in three groups, each with 2 entries. By adding an apply, I can call a function to return a boolean mask:
def func(group):
if group.iloc[0] < group.iloc[1] * .5:
return [True, False]
else:
return [False, False]
g = d.groupby('a')['c'].nsmallest(2).groupby(level='a').apply(func)
Unfortunately, this destroys the index into the original dataframe and removes the ability to handle cases where more than 2 elements are present.
Two questions:
Is it possible to maintain the index in the original dataframe and update a column with the results of a groupby? This is made slightly different because the .nsmallest call results in a Series on the 'c' column.
Does a more elegant method exist to compute a boolean array for groups in a dataframe based on some custom criteria, e.g. this ratio test.
Looks like transform is what you need:
>>> def func(group):
... res = [False] * len(group)
... if group.iloc[0] < group.iloc[1] * .5:
... res[0] = True
... return res
>>> d['res'] = d.groupby('a')['c'].transform(func).astype('bool')
>>> d
a b c res
0 0 3 1.25 True
1 0 4 10.10 False
2 1 5 2.30 False
3 1 6 2.40 False
4 2 7 1.20 True
5 2 8 5.50 False
6 2 9 5.70 False
From the documentation:
The transform method returns an object that is indexed the same (same
size) as the one being grouped. Thus, the passed transform function
should return a result that is the same size as the group chunk. For
example, suppose we wished to standardize the data within each group

Filtering rows from pandas dataframe using concatenated strings

I have a pandas dataframe plus a pandas series of identifiers, and would like to filter the rows from the dataframe that correspond to the identifiers in the series. To get the identifiers from the dataframe, I need to concatenate its first two columns. I have tried various things to filter, but none seem to work so far. Here is what I have tried:
1) I tried adding a column of booleans to the data frame, being true if that row corresponds to one of the identifiers, and false otherwise (hoping to be able to do filtering afterwards using the new column):
df["isInAcids"] = (df["AcNo"] + df["Sortcode"]) in acids
where
acids
is the series containing the identifiers.
However, this gives me a
TypeError: unhashable type
2) I tried filtering using the apply function:
df[df.apply(lambda x: x["AcNo"] + x["Sortcode"] in acids, axis = 1)]
This doesn't give me an error, but the length of the data frame remains unchanged, so it doesn't appear to filter anything.
3) I have added a new column, containing the concatenated strings/identifiers, and then try to filter afterwards (see Filter dataframe rows if value in column is in a set list of values):
df["ACIDS"] = df["AcNo"] + df["Sortcode"]
df[df["ACIDS"].isin(acids)]
But again, the dataframe doesn't change.
I hope this makes sense...
Any suggestions where I might be going wrong?
Thanks,
Anne
I think you're asking for something like the following:
In [1]: other_ids = pd.Series(['a', 'b', 'c', 'c'])
In [2]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'c', 'f']})
In [3]: df
Out[3]:
ids vals
0 a 1
1 b 2
2 c 3
3 f 4
In [4]: other_ids
Out[4]:
0 a
1 b
2 c
3 c
dtype: object
In this case, the series other_ids would be like your series acids. We want to select just those rows of df whose id is in the series other_ids. To do that we'll use the dataframe's method .isin().
In [5]: df.ids.isin(other_ids)
Out[5]:
0 True
1 True
2 True
3 False
Name: ids, dtype: bool
This gives a column of bools that we can index into:
In [6]: df[df.ids.isin(other_ids)]
Out[6]:
ids vals
0 a 1
1 b 2
2 c 3
This is close to what you're doing with your 3rd attempt. Once you post a sample of your dataframe I can edit this answer, if it doesn't work already.
Reading a bit more, you may be having trouble because you have two columns in df that are your ids? Dataframe doesn't have an isin method, but we can get around that with something like:
In [26]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'f', 'f'],
'ids2': ['e', 'f', 'c', 'f']})
In [27]: df
Out[27]:
ids ids2 vals
0 a e 1
1 b f 2
2 f c 3
3 f f 4
In [28]: df.ids.isin(ids) + df.ids2.isin(ids)
Out[28]:
0 True
1 True
2 True
3 False
dtype: bool
True is like 1 and False is like zero so we add the two boolean series from the two isins() to get something like an OR operation. Then like before we can index into this boolean series:
In [29]: new = df.ix[df.ids.isin(ids) + df.ids2.isin(ids)]
In [30]: new
Out[30]:
ids ids2 vals
0 a e 1
1 b f 2
2 f c 3

Categories