Select DataFrame data base on series value - python

I have a pandas' DataFrame and when I perform an operation on the dataframe, I get back a series. How can I use that series to select out only records where I find a match?
Right now I'm appending the column onto the DataFrame and doing a query against the dataframe then dropping the column. I really do not like this solution though, so I'm hoping I can get a better solution.
data = [[1,2,3], [1,3,4], [3,4,5]]
columns = ['a', 'b', 'c']
df = pd.DataFrame(data, columns=columns)
series = df.myoperation()
df['myoperation'] = series
res = df[df['myoperation'] == True]
del res['myoperation']
The series object will produce a 1-1 match, so index item 1 will match item 1 in the dataframe object.
Above is my hacky code to get it done, but I'm afraid when the dataframe have many column or a lot more data than just this simple example, it will be slow.
Thank you

I think you can use if series is boolean Series with same index as df and same length as df - it is called boolean indexing:
series = pd.Series([True, False, True], index=df.index)
res = df[series]
print (res)
a b c
0 1 2 3
2 3 4 5
It always works with boolean list and numpy array, only lenght has to be same as df:
L = [True, False, True]
res = df[L]
print (res)
a b c
0 1 2 3
2 3 4 5
arr = np.array([True, False, True])
res = df[arr]
print (res)
a b c
0 1 2 3
2 3 4 5

Related

Why numpy .isin function gives incorrect output

My requirement is I have a large dataframe with millions of rows. I encoded all strings to numeric values in order to use numpys vectorization to increase processing speed.
So I was looking at a way to quickly check if a number exists in another list column. Previously, I was using list comprehension with string values, but with after converting to np.arrays was looking at similar function.
I stumbled across this link: check if values of a column are in values of another numpy array column in pandas
In order to the numpy.isin, I tried running below code:
dt = pd.DataFrame({'id' : ['a', 'a', 'a', 'b', 'b'],
'col_a': [1,2,5,1,2],
'col_b': [2,2,[2,5,4],4,[1,5,6,3,2]]})
dt
id col_a col_b
0 a 1 2
1 a 2 2
2 a 5 [2, 5, 4]
3 b 1 4
4 b 2 [1, 5, 6, 3, 2]
When I enter:
np.isin(dt['col_a'], dt['col_b'])
The output is:
array([False, True, False, False, True])
Which is incorrect as the 3rd row has 5 in both columns col_a and col_b.
Where as if I change the value to 4 as below:
dt = pd.DataFrame({'id' : ['a', 'a', 'a', 'b', 'b'],
'col_a': [1,2,4,1,2],
'col_b': [2,2,[2,5,4],4,[1,5,6,3,2]]})
dt
id col_a col_b
0 a 1 2
1 a 2 2
2 a 4 [2, 5, 4]
3 b 1 4
4 b 2 [1, 5, 6, 3, 2]
and execute same code:
np.isin(dt['col_a'], dt['col_b'])
I get correct result:
array([False, True, True, False, True])
Can someone please let me know why it's giving different results.
Since col_b not only has lists but also integers, you may need to use apply and treat them differently:
( dt.apply(lambda x: x['col_a'] in x['col_b'] if type(x['col_b']) is list
else x['col_a'] == x['col_b'], axis=1)
Output:
0 False
1 True
2 True
3 False
4 True
dtype: bool
np.isin for each element from dt['col_a'] checks whether it is present in the whole dt['col_b'] column, i.e.:
[
1 in dt['col_b'],
2 in dt['col_b'],
5 in dt['col_b'],
...
]
There's no 5 in dt['col_b'] but there's 4
From the docs
isin is an element-wise function version of the python keyword in. isin(a, b) is roughly equivalent to np.array([item in b for item in a]) if a and b are 1-D sequences.
Also, your issue is that you have an inconsistent dt['col_b'] column (some values are numbers some are lists). I think the easiest approach is to use apply:
def isin(row):
if isinstance(row['col_b'], int):
return row['col_a'] == row['col_b']
else:
return row['col_a'] in row['col_b']
dt.apply(isin, axis=1)
Output:
0 False
1 True
2 True
3 False
4 True
dtype: bool

Combine multiple columns into one category column using the column names as value label

I have this data
ID A B C
0 0 True False False
1 1 False True False
2 2 False False True
And want to transform it into
ID group
0 0 A
1 1 B
2 2 C
I want to use the column names as value labels for the category column.
There is a maximum of only one True value per row.
This is the MWE
#!/usr/bin/env python3
import pandas as pd
df = pd.DataFrame({
'ID': range(3),
'A': [True, False, False],
'B': [False, True, False],
'C': [False, False, True]
})
result = pd.DataFrame({
'ID': range(3),
'group': ['A', 'B', 'C']
})
result.group = result.group.astype('category')
print(df)
print(result)
I could do df.apply(lambda row: ...magic.., axis=1). But isn't there a more elegant way with pandas' own tools?
You can use df.dot:
df['group'] = df[['A', 'B', 'C']].dot(df.columns[1:])
You could use pd.melt() to reshape and rename, then boolean filtering on the 'value' column using query:
pd.melt(df,id_vars=['ID'],var_name= 'group').query('value')
ID group value
0 0 A True
4 1 B True
8 2 C True
Chaining .drop('value',axis=1).reset_index(drop=True) will give your final output:
ID group
0 0 A
1 1 B
2 2 C
Yet another way:
df.set_index(['ID'])\
.rename_axis('group', axis=1)\ # getting column name correct
.stack()\ # reshaping getting column headers into dataframe rows
.loc[lambda x: x]\ # filtering for True
.reset_index()\ # moving ID back into dataframe columns
.drop(0, axis=1) # dropping boolean column
Output:
ID group
0 0 A
1 1 B
2 2 C
More verbose than melt, but this drops the invalid columns during the reshaping:
(df.set_index('ID')
.rename_axis(columns='group')
.replace(False, pd.NA)
.stack().reset_index().drop(columns=0)
)
output:
ID group
0 0 A
1 1 B
2 2 C
You can use melt then a lookup based on the column where the values are true to get the results you are expecting
df = df.melt(id_vars = 'ID', var_name = 'group')
df.loc[df['value'] == True][['ID', 'group']]
idxmax
s = df.set_index('ID')
s.idxmax(1).where(s.any(1))
ID
0 A
1 B
2 C
dtype: object
Try with apply lambda
df.set_index('ID').apply(lambda x : x.index[x][0],axis=1)
Out[39]:
ID
0 A
1 B
2 C
dtype: object

selecting pandas rows: why isin() and not i in mylist?

I have a dataframe with a multi-index and need to select only the rows where the first index is not in a list. This works:
df= df.iloc[~(df.index.get_level_values(0).isin(mylist) )
This doesn't:
df= df.iloc[(df.index.get_level_values(0) not in mylist )
I get an error about the truth value of the array.
Why? What does it mean? Is it documented in the official docs?
Say, you have a dataframe df as follows:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(30).reshape((6,5)))
tuples = [(i//2, i%2) for i in range(6)]
df.index = pd.MultiIndex.from_tuples(tuples)
print(df)
0 1 2 3 4
0 0 0.623671 0.335741 0.035219 0.902372 0.349697
1 0.487387 0.325101 0.361753 0.935972 0.425735
1 0 0.147836 0.599608 0.888232 0.712804 0.604688
1 0.156712 0.286682 0.680316 0.104996 0.389507
2 0 0.212923 0.580401 0.02415 0.712987 0.803497
1 0.804538 0.035597 0.611101 0.328159 0.140793
df.index.get_level_values(0) will return an array: Int64Index([0, 0, 1, 1, 2, 2], dtype='int64')
The error says that by using in operator it is not clear whether you want to check all elements in that array are in the list, or any element in that array is in the list. You are comparing the array against the whole list. What you want is the element-wise comparison and in does not do that. Even if it was clear, it would return a single value. If you try df.index.get_level_values(0).isin([0,1]), on the other hand, it will return an array of boolean values: array([ True, True, True, True, False, False], dtype=bool) so it will check first whether 0 is in the list, whether second 0 is in the list, whether 1 is in the list... And then those boolean values will be used to slice the dataframe (i.e. show me only the rows where the array has True value).
In [12]: df.iloc[[ True, True, True, True, False, False]]
Out [12]: 0 1 2 3 4
0 0 0.623671 0.335741 0.035219 0.902372 0.349697
1 0.487387 0.325101 0.361753 0.935972 0.425735
1 0 0.147836 0.599608 0.888232 0.712804 0.604688
1 0.156712 0.286682 0.680316 0.104996 0.389507

Filtering rows from pandas dataframe using concatenated strings

I have a pandas dataframe plus a pandas series of identifiers, and would like to filter the rows from the dataframe that correspond to the identifiers in the series. To get the identifiers from the dataframe, I need to concatenate its first two columns. I have tried various things to filter, but none seem to work so far. Here is what I have tried:
1) I tried adding a column of booleans to the data frame, being true if that row corresponds to one of the identifiers, and false otherwise (hoping to be able to do filtering afterwards using the new column):
df["isInAcids"] = (df["AcNo"] + df["Sortcode"]) in acids
where
acids
is the series containing the identifiers.
However, this gives me a
TypeError: unhashable type
2) I tried filtering using the apply function:
df[df.apply(lambda x: x["AcNo"] + x["Sortcode"] in acids, axis = 1)]
This doesn't give me an error, but the length of the data frame remains unchanged, so it doesn't appear to filter anything.
3) I have added a new column, containing the concatenated strings/identifiers, and then try to filter afterwards (see Filter dataframe rows if value in column is in a set list of values):
df["ACIDS"] = df["AcNo"] + df["Sortcode"]
df[df["ACIDS"].isin(acids)]
But again, the dataframe doesn't change.
I hope this makes sense...
Any suggestions where I might be going wrong?
Thanks,
Anne
I think you're asking for something like the following:
In [1]: other_ids = pd.Series(['a', 'b', 'c', 'c'])
In [2]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'c', 'f']})
In [3]: df
Out[3]:
ids vals
0 a 1
1 b 2
2 c 3
3 f 4
In [4]: other_ids
Out[4]:
0 a
1 b
2 c
3 c
dtype: object
In this case, the series other_ids would be like your series acids. We want to select just those rows of df whose id is in the series other_ids. To do that we'll use the dataframe's method .isin().
In [5]: df.ids.isin(other_ids)
Out[5]:
0 True
1 True
2 True
3 False
Name: ids, dtype: bool
This gives a column of bools that we can index into:
In [6]: df[df.ids.isin(other_ids)]
Out[6]:
ids vals
0 a 1
1 b 2
2 c 3
This is close to what you're doing with your 3rd attempt. Once you post a sample of your dataframe I can edit this answer, if it doesn't work already.
Reading a bit more, you may be having trouble because you have two columns in df that are your ids? Dataframe doesn't have an isin method, but we can get around that with something like:
In [26]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'f', 'f'],
'ids2': ['e', 'f', 'c', 'f']})
In [27]: df
Out[27]:
ids ids2 vals
0 a e 1
1 b f 2
2 f c 3
3 f f 4
In [28]: df.ids.isin(ids) + df.ids2.isin(ids)
Out[28]:
0 True
1 True
2 True
3 False
dtype: bool
True is like 1 and False is like zero so we add the two boolean series from the two isins() to get something like an OR operation. Then like before we can index into this boolean series:
In [29]: new = df.ix[df.ids.isin(ids) + df.ids2.isin(ids)]
In [30]: new
Out[30]:
ids ids2 vals
0 a e 1
1 b f 2
2 f c 3

How can I map True/False to 1/0 in a Pandas DataFrame?

I have a column in python pandas DataFrame that has boolean True/False values, but for further calculations I need 1/0 representation. Is there a quick pandas/numpy way to do that?
A succinct way to convert a single column of boolean values to a column of integers 1 or 0:
df["somecolumn"] = df["somecolumn"].astype(int)
Just multiply your Dataframe by 1 (int)
[1]: data = pd.DataFrame([[True, False, True], [False, False, True]])
[2]: print data
0 1 2
0 True False True
1 False False True
[3]: print data*1
0 1 2
0 1 0 1
1 0 0 1
True is 1 in Python, and likewise False is 0*:
>>> True == 1
True
>>> False == 0
True
You should be able to perform any operations you want on them by just treating them as though they were numbers, as they are numbers:
>>> issubclass(bool, int)
True
>>> True * 5
5
So to answer your question, no work necessary - you already have what you are looking for.
* Note I use is as an English word, not the Python keyword is - True will not be the same object as any random 1.
This question specifically mentions a single column, so the currently accepted answer works. However, it doesn't generalize to multiple columns. For those interested in a general solution, use the following:
df.replace({False: 0, True: 1}, inplace=True)
This works for a DataFrame that contains columns of many different types, regardless of how many are boolean.
You also can do this directly on Frames
In [104]: df = DataFrame(dict(A = True, B = False),index=range(3))
In [105]: df
Out[105]:
A B
0 True False
1 True False
2 True False
In [106]: df.dtypes
Out[106]:
A bool
B bool
dtype: object
In [107]: df.astype(int)
Out[107]:
A B
0 1 0
1 1 0
2 1 0
In [108]: df.astype(int).dtypes
Out[108]:
A int64
B int64
dtype: object
Use Series.view for convert boolean to integers:
df["somecolumn"] = df["somecolumn"].view('i1')
You can use a transformation for your data frame:
df = pd.DataFrame(my_data condition)
transforming True/False in 1/0
df = df*1
I had to map FAKE/REAL to 0/1 but couldn't find proper answer.
Please find below how to map column name 'type' which has values FAKE/REAL to 0/1 (Note: similar can be applied to any column name and values)
df.loc[df['type'] == 'FAKE', 'type'] = 0
df.loc[df['type'] == 'REAL', 'type'] = 1
This is a reproducible example based on some of the existing answers:
import pandas as pd
def bool_to_int(s: pd.Series) -> pd.Series:
"""Convert the boolean to binary representation, maintain NaN values."""
return s.replace({True: 1, False: 0})
# generate a random dataframe
df = pd.DataFrame({"a": range(10), "b": range(10, 0, -1)}).assign(
a_bool=lambda df: df["a"] > 5,
b_bool=lambda df: df["b"] % 2 == 0,
)
# select all bool columns (or specify which cols to use)
bool_cols = [c for c, d in df.dtypes.items() if d == "bool"]
# apply the new coding to a new dataframe (or can replace the existing one)
df_new = df.assign(**{c: lambda df: df[c].pipe(bool_to_int) for c in bool_cols})
Tries and tested:
df[col] = df[col].map({'True': 1,'False' :0 })
If there are more than one columns with True/False, use the following.
for col in bool_cols:
df[col] = df[col].map({'True': 1,'False' :0 })
#AMC wrote this in a comment
If the column is of the type object
df["somecolumn"] = df["somecolumn"].astype(bool).astype(int)

Categories