Select columns in pandas dataframe by value in rows - python

I have a pandas.DataFrame with too much columns. I want to select all columns with values in rows equals to 0 and 1. Type of all columns is int64 and I can't select they by object or other type. How can I do this?

IIUC then you can use isin and filter the columns:
In [169]:
df = pd.DataFrame({'a':[0,1,1,0], 'b':list('abcd'), 'c':[1,2,3,4]})
df
Out[169]:
a b c
0 0 a 1
1 1 b 2
2 1 c 3
3 0 d 4
In [174]:
df[df.columns[df.isin([0,1]).all()]]
Out[174]:
a
0 0
1 1
2 1
3 0
The output from the inner condition:
In [175]:
df.isin([0,1]).all()
Out[175]:
a True
b False
c False
dtype: bool
We can use the boolean mask to filter the columns:
In [176]:
df.columns[df.isin([0,1]).all()]
Out[176]:
Index(['a'], dtype='object')

Related

How to select values from a dataframe by a series of column names?

I have a dataframe df:
A B
0 1 4
1 2 5
2 3 6
And a series s:
0 A
1 B
2 A
Now I want to pick values from df with column names specified in s. The expected result is:
0 1 <- from column A
1 5 <- from column B
2 3 <- from column A
How can I get this done efficiently?
Use Index.get_indexer for indices by Series and select values by numpy indexing in 2d array:
a = df.to_numpy()
b = a[np.arange(len(df)), df.columns.get_indexer(s)]
print (b)
[1 5 3]
s1 = pd.Series(b, s.index)
print (s1)
0 1
1 5
2 3
dtype: int64

Checking column values in python panda

How do I check whether the column values in a panda table are the same and create the result in a fourth column:
original
red blue green
a 1 1 1
b 1 2 1
c 2 2 2
becomes:
red blue green match
a 1 1 1 1
b 1 2 1 0
c 2 2 2 1
Originally I only had 2 columns and it was possible to achieve something similar by doing this:
df['match']=df['blue']-df['red']
but this won't work with 3 columns.
Your help is greatly appreciated!
To make it more generic, compare row values on apply method.
Using set()
In [54]: df['match'] = df.apply(lambda x: len(set(x)) == 1, axis=1).astype(int)
In [55]: df
Out[55]:
red blue green match
a 1 1 1 1
b 1 2 1 0
c 2 2 2 1
Alternatively, use pd.Series.nunique to identify number of unique in row.
In [56]: (df.apply(pd.Series.nunique, axis=1) == 1).astype(int)
Out[56]:
a 1
b 0
c 1
dtype: int32
Or, use df.iloc[:, 0] for first column values and match it eq with df
In [57]: df.eq(df.iloc[:, 0], axis=0).all(axis=1).astype(int)
Out[57]:
a 1
b 0
c 1
dtype: int32
You can try this:
df["match"] = df.apply(lambda x: int(x[0]==x[1]==x[2]), axis=1)
where:
x[0]==x[1]==x[2] : test for the eaquality of the 3 first columns
axis=1: columns wise
Alternatively, you can call the column by their name too:
df["match"] = df.apply(lambda x: int(x["red"]==x["blue"]==x["green"]), axis=1)
This is more convenient if you have many column and that you want to compare a subpart of them without knowing their number.
If you want to compare all the columns, use John Galt's solution

How to select cells greater than a value in a multi-index Pandas dataframe?

Try 1:
df[ df > 1.0 ] : this returned all cells in NAN.
Try2:
df.loc[ df > 1.0 ] : this returned KeyError: 0
df[df['A']> 1.0] : this works - But I want to apply the filter condition to all columns.
If what you are trying to do is to select only rows where any one column meets the condition , you can use DataFrame.any() along with axis=1 (to do row-wise grouping) . Example -
In [3]: df
Out[3]:
A B C
0 1 2 3
1 3 4 5
2 3 1 4
In [6]: df[(df <= 2).any(axis=1)]
Out[6]:
A B C
0 1 2 3
2 3 1 4
Alternatively, if you are trying for filtering rows where all columns meet the condition , use .all() inplace of .any() . Example of all -
In [8]: df = pd.DataFrame([[1,2,3],[3,4,5],[3,1,4],[1,2,1]],columns=['A','B','C'])
In [9]: df
Out[9]:
A B C
0 1 2 3
1 3 4 5
2 3 1 4
3 1 2 1
In [11]: df[(df <= 2).all(axis=1)]
Out[11]:
A B C
3 1 2 1

Getting all rows with NaN value

I have a table with a column that has some NaN values in it:
A B C D
2 3 2 Nan
3 4 5 5
2 3 1 Nan
I'd like to get all rows where D = NaN. How can I do this?
Creating a df for illustration (containing Nan)
In [86]: df =pd.DataFrame({'a':[1,2,3],'b':[3,4,5],'c':[np.nan, 4,5]})
In [87]: df
Out[87]:
a b c
0 1 3 NaN
1 2 4 4
2 3 5 5
Checking which indices have null for column c
In [88]: pd.isnull(df['c'])
Out[88]:
0 True
1 False
2 False
Name: c, dtype: bool
Checking which indices dont have null for column c
In [90]: pd.notnull(df['c'])
Out[90]:
0 False
1 True
2 True
Name: c, dtype: bool
Selecting rows of df where c is not null
In [91]: df[pd.notnull(df['c'])]
Out[91]:
a b c
1 2 4 4
2 3 5 5
Selecting rows of df where c is null
In [93]: df[pd.isnull(df['c'])]
Out[93]:
a b c
0 1 3 NaN
Selecting rows of column c of df where c is not null
In [94]: df['c'][pd.notnull(df['c'])]
Out[94]:
1 4
2 5
Name: c, dtype: float64
For a solution that doesn't involve pandas, you can do something like:
goodind=np.where(np.sum(np.isnan(y),axis=1)==0)[0] #indices of rows non containing nans
(or the negation if you want rows with nan) and use the indices to slice data.
I am not sure sum is the best way to combine booleans, but np.any and np.all don't seem to have a axis parameter, so this is the best way I found.

Pandas: Selection with MultiIndex

Considering the following DataFrames
In [136]:
df = pd.DataFrame({'A':[1,1,2,2],'B':[1,2,1,2],'C':np.arange(10,30,5)}).set_index(['A','B'])
df
Out[136]:
C
A B
1 1 10
2 15
2 1 20
2 25
In [130]:
vals = pd.DataFrame({'A':[1,2],'values':[True,False]}).set_index('A')
vals
Out[130]:
values
A
1 True
2 False
How can I select only the rows of df with corresponding True values in vals?
If I reset_index on both frames I can now merge/join them and slice however I want, but how can I do it using the (multi)indexes?
boolean indexing all the way...
In [65]: df[pd.Series(df.index.get_level_values('A')).isin(vals[vals['values']].index)]
Out[65]:
C
A B
1 1 10
2 15
Note that you can use xs on a multiindex.
In [66]: df.xs(1)
Out[66]:
C
B
1 10
2 15

Categories