Use boolean series of different length to select rows from dataframe - python

I have a dataframe that looks like this:
df = pd.DataFrame({"piece": ["piece1", "piece2", "piece3", "piece4"], "No": [1, 1, 2, 3]})
No piece
0 1 piece1
1 1 piece2
2 2 piece3
3 3 piece4
I have a series with an index that corresponds to the "No"-column in the dataframe. It assigns boolean variables to the "No"-values, like so:
s = pd.Series([True, False, True, True])
0 True
1 False
2 True
3 True
dtype: bool
I would like to select those rows from the dataframe where in the series the "No"-value is True. This should result in
No piece
2 2 piece3
3 3 piece4
I've tried a lot of indexing with df["No"].isin(s), or something like df[s["No"] == True]... But it didn't work yet.

I think you need map the value in No column to the true/false condition and use it for subsetting:
df[df.No.map(s)]
# No piece
#2 2 piece3
#3 3 piece4
df.No.map(s)
# 0 False
# 1 False
# 2 True
# 3 True
# Name: No, dtype: bool

You are trying to index into s using df['No'], then use the result as a mask on df itself:
df[s[df['No']].values]
The final mask needs to be extracted as an array using values because the duplicates in the index cause an error otherwise.

Related

Pandas dataframe check if a value exists in multiple columns for one row

I want to print out the row where the value is "True" for more than one column.
For example if data frame is the following:
Remove Ignore Repair
0 True False False
1 False True True
2 False True False
I want it to print:
1
Is there an elegant way to do this instead of bunch of if statements?
you can use sum and pass axis=1 to sum over columns.
import pandas as pd
df = pd.DataFrame({'a':[False, True, False],'b':[False, True, False], 'c':[True, False, False,]})
print(df)
print("Ans: ",df[(df.sum(axis=1)>1)].index.tolist())
output:
a b c
0 False False True
1 True True False
2 False False False
Ans: [1]
To get the first row that meets the criteria:
df.index[df.sum(axis=1).gt(1)][0]
Output:
Out[14]: 1
Since you can get multiple matches, you can exclude the [0] to get all the rows that meet your criteria
You can just sum booleans as they will be interpreted as True=1, False=0:
df.sum(axis=1) > 1
So to filter to rows where this evaluates as True:
df.loc[df.sum(axis=1) > 1]
Or the same thing but being more explicit about converting the booleans to integers:
df.loc[df.astype(int).sum(axis=1) > 1]

How to determine if a dataframe column contains a particular list, independently of its order?

I have this dataframe :
df = pd.DataFrame()
df['Col1'] = [['B'],['A','D','B'],['D','C']]
df['Col2'] = [1,2,4]
df
Col1 Col2
0 [B] 1
1 [A,D,B] 2
2 [D,C] 4
I would like to know if Col1 contains the list [B,A,D], without caring for the order of the lists (those inside the column as the one to check).
I would like therefore to get here a True answer.
How could I do ?
Thanks
If values are not duplicated you can compare sets:
L = ['B','A','D']
print (df['Col1'].map(set).eq(set(L)))
0 False
1 True
2 False
Name: Col1, dtype: bool
If want scalar output- True or False test if at least one True in column by Series.any:
print (df['Col1'].map(set).eq(set(['B','A','D'])).any())
True
Use:
l=['B','A','D']
[set(i)==set(l) for i in df['Col1']]
#[False, True, False]
IIUC method of get_dummies
l=['B','A','D']
df.Col1.str.join(',').str.get_dummies(',')[l].all(1)
Out[197]:
0 False
1 True
2 False
dtype: bool

Create columns in dataframe based on csv field

I have a pandas dataframe with the column "Values" that has comma separated values:
Row|Values
1|1,2,3,8
2|1,4
I want to create columns based on the CSV, and assign a boolean indicating if the row has that value, as follows:
Row|1,2,3,4,8
1|true,true,true,false,true
2|true,false,false,true,false
How can I accomplish that?
Thanks in advance
Just using get_dummies, check the link here and the astype(bool) change 1 to True 0 to False
df.set_index('Row')['Values'].str.get_dummies(',').astype(bool)
Out[318]:
1 2 3 4 8
Row
1 True True True False True
2 True False False True False

Why does pandas dataframe indexing change axis depending on index type?

when you index into a pandas dataframe using a list of ints, it returns columns.
e.g. df[[0, 1, 2]] returns the first three columns.
why does indexing with a boolean vector return a list of rows?
e.g. df[[True, False, True]] returns the first and third rows. (and errors out if there aren't 3 rows.)
why? Shouldn't it return the first and third columns?
Thanks!
Because if use:
df[[True, False, True]]
it is called boolean indexing by mask:
[True, False, True]
Sample:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
print (df)
A B C
0 1 4 7
1 2 5 8
2 3 6 9
print (df[[True, False, True]])
A B C
0 1 4 7
2 3 6 9
Boolean mask is same as:
print (df.B != 5)
0 True
1 False
2 True
Name: B, dtype: bool
print (df[df.B != 5])
A B C
0 1 4 7
2 3 6 9
There are very specific slicing accessors to target rows and columns in specific ways.
Mixed Position and Label Based Selection
Select by position
Selection by Label
loc[], at[], and get_value() take row and column labels and return the appropriate slice
iloc[] and iat[] take row and column positions and return the appropriate slice
What you are seeing is the result of pandas trying to infer what you are trying to do. As you have noticed this is inconsistent at times. In fact, it is more pronounced than just what you've highlighted... but I wont go into that now.
See also
pandas docs
However, when an axis is integer based,
ONLY label based access and not positional access is supported.
Thus, in such cases, it’s usually better to be explicit and use .iloc or .loc.

Extracting all rows from pandas Dataframe that have certain value in a specific column

I am relatively new to Python/Pandas and am struggling with extracting the correct data from a pd.Dataframe. What I actually have is a Dataframe with 3 columns:
data =
Position Letter Value
1 a TRUE
2 f FALSE
3 c TRUE
4 d TRUE
5 k FALSE
What I want to do is put all of the TRUE rows into a new Dataframe so that the answer would be:
answer =
Position Letter Value
1 a TRUE
3 c TRUE
4 d TRUE
I know that you can access a particular column using
data['Value']
but how do I extract all of the TRUE rows?
Thanks for any help and advice,
Alex
You can test which Values are True:
In [11]: data['Value'] == True
Out[11]:
0 True
1 False
2 True
3 True
4 False
Name: Value, dtype: bool
and then use fancy indexing to pull out those rows:
In [12]: data[data['Value'] == True]
Out[12]:
Position Letter Value
0 1 a True
2 3 c True
3 4 d True
*Note: if the values are actually the strings 'TRUE' and 'FALSE' (they probably shouldn't be!) then use:
data['Value'] == 'TRUE'
You can wrap your value/values in a list and do the following:
new_df = df.loc[df['yourColumnName'].isin(['your', 'list', 'items'])]
This will return a new dataframe consisting of rows where your list items match your column name in df.

Categories