Pandas: Selection with MultiIndex - python

Considering the following DataFrames
In [136]:
df = pd.DataFrame({'A':[1,1,2,2],'B':[1,2,1,2],'C':np.arange(10,30,5)}).set_index(['A','B'])
df
Out[136]:
C
A B
1 1 10
2 15
2 1 20
2 25
In [130]:
vals = pd.DataFrame({'A':[1,2],'values':[True,False]}).set_index('A')
vals
Out[130]:
values
A
1 True
2 False
How can I select only the rows of df with corresponding True values in vals?
If I reset_index on both frames I can now merge/join them and slice however I want, but how can I do it using the (multi)indexes?

boolean indexing all the way...
In [65]: df[pd.Series(df.index.get_level_values('A')).isin(vals[vals['values']].index)]
Out[65]:
C
A B
1 1 10
2 15
Note that you can use xs on a multiindex.
In [66]: df.xs(1)
Out[66]:
C
B
1 10
2 15

Related

intersect dataframes python, keep one dataframe columns

I would like to join 2 dataframes, so that the result will be the intersection on the two datasets on the key column.
By doing this:
result = pd.merge(df1,df2,on='key', how='inner')
I will get what I need, but with extra columns of df2. I only want df1 columns in the results. (I do not want to delete them later).
Any ideas?
Thanks,
Here is a generic solution which will work for one and for multiple keys (joining) columns:
Setup:
In [28]: a = pd.DataFrame({'a':[1,2,3,4], 'b':[10,20,30,40], 'c':list('abcd')})
In [29]: b = pd.DataFrame({'a':[3,4,5,6], 'b':[30,41,51,61], 'c':list('efgh')})
In [30]: a
Out[30]:
a b c
0 1 10 a
1 2 20 b
2 3 30 c
3 4 40 d
In [31]: b
Out[31]:
a b c
0 3 30 e
1 4 41 f
2 5 51 g
3 6 61 h
multiple joining keys:
In [32]: join_cols = ['a','b']
In [33]: a.merge(b[join_cols], on=join_cols)
Out[33]:
a b c
0 3 30 c
single joining key:
In [34]: join_cols = ['a']
In [35]: a.merge(b[join_cols], on=join_cols)
Out[35]:
a b c
0 3 30 c
1 4 40 d

Python: given list of columns and list of values, return subset of dataframe that meets all criteria

I have a dataframe like the following.
df = pd.DataFrame({'A' : ['Bob','Jean','Sally','Sue'], 'B' : [1,2,3, 2],'C' : [7,8,9,8] })
Assume that column A will always in the dataframe but sometimes the could be column B, column B and C, or multiple number of columns.
I have created a code to save the columns names (other than A) in a list as well as the unique permutations of the values in the other columns into a list. For instance, in this example, we have columns B and C saved into columns:
col = ['B','C']
The permutations in the simple df are 1,7; 2,8; 3,9. For simplicity assume one permutation is saved as follows:
permutation = [2,8]
How do I select the entire rows (and only those) that equal that permutation?
Right now, I am using:
a[a[col].isin(permutation)]
Unfortunately, I don't get the values in column A.
(I know how to drop those values that are NaN later. BuT How should I do this to keep it dynamic? Sometimes there will be multiple columns. (Ultimately, I'll run through a loop and save the different iterations) based upon multiple permutations in the columns other than A.
Use the intersection of boolean series (where both conditions are true) - first setup code:
import pandas as pd
df = pd.DataFrame({'A' : ['Bob','Jean','Sally','Sue'], 'B' : [1,2,3, 2],'C' : [7,8,9,8] })
col = ['B','C']
permutation = [2,8]
And here's the solution for this limited example:
>>> df[(df[col[0]] == permutation[0]) & (df[col[1]] == permutation[1])]
A B C
1 Jean 2 8
3 Sue 2 8
To break that down:
>>> b, c = col
>>> per_b, per_c = permutation
>>> column_b_matches = df[b] == per_b
>>> column_c_matches = df[c] == per_c
>>> intersection = column_b_matches & column_c_matches
>>> df[intersection]
A B C
1 Jean 2 8
3 Sue 2 8
Additional columns and values
To take any number of columns and values, I would create a function:
def select_rows(df, columns, values):
if not columns or not values:
raise Exception('must pass columns and values')
if len(columns) != len(values):
raise Exception('columns and values must be same length')
intersection = True
for c, v in zip(columns, values):
intersection &= df[c] == v
return df[intersection]
and to use it:
>>> select_rows(df, col, permutation)
A B C
1 Jean 2 8
3 Sue 2 8
Or you can coerce the permutation to an array and accomplish this with a single comparison, assuming numeric values:
import numpy as np
def select_rows(df, columns, values):
return df[(df[col] == np.array(values)).all(axis=1)]
But this does not work with your code sample as given
I figured out a solution. Aaron's above works well if I only have two columns. I need a solution that works regardless of the size of the df (as size will be 3-7 columns).
df = pd.DataFrame({'A' : ['Bob','Jean','Sally','Sue'], 'B' : [1,2,3, 2],'C' : [7,8,9,8] })
permutation = [2,8]
col = ['B','C']
interim = df[col].isin(permutation)
df[df.index.isin(interim[(interim != 0).all(1)].index)]
you can do it this way:
In [77]: permutation = np.array([0,2,2])
In [78]: col
Out[78]: ['a', 'b', 'c']
In [79]: df.loc[(df[col] == permutation).all(axis=1)]
Out[79]:
a b c
10 0 2 2
15 0 2 2
16 0 2 2
your solution will not always work properly:
sample DF:
In [71]: df
Out[71]:
a b c
0 0 2 1
1 1 1 1
2 0 1 2
3 2 0 1
4 0 1 0
5 2 0 0
6 2 0 0
7 0 1 0
8 2 1 0
9 0 0 0
10 0 2 2
11 1 0 1
12 2 1 1
13 1 0 0
14 2 1 0
15 0 2 2
16 0 2 2
17 1 0 2
18 0 1 1
19 1 2 0
In [67]: col = ['a','b','c']
In [68]: permutation = [0,2,2]
In [69]: interim = df[col].isin(permutation)
pay attention at the result:
In [70]: df[df.index.isin(interim[(interim != 0).all(1)].index)]
Out[70]:
a b c
5 2 0 0
6 2 0 0
9 0 0 0
10 0 2 2
15 0 2 2
16 0 2 2

How to select cells greater than a value in a multi-index Pandas dataframe?

Try 1:
df[ df > 1.0 ] : this returned all cells in NAN.
Try2:
df.loc[ df > 1.0 ] : this returned KeyError: 0
df[df['A']> 1.0] : this works - But I want to apply the filter condition to all columns.
If what you are trying to do is to select only rows where any one column meets the condition , you can use DataFrame.any() along with axis=1 (to do row-wise grouping) . Example -
In [3]: df
Out[3]:
A B C
0 1 2 3
1 3 4 5
2 3 1 4
In [6]: df[(df <= 2).any(axis=1)]
Out[6]:
A B C
0 1 2 3
2 3 1 4
Alternatively, if you are trying for filtering rows where all columns meet the condition , use .all() inplace of .any() . Example of all -
In [8]: df = pd.DataFrame([[1,2,3],[3,4,5],[3,1,4],[1,2,1]],columns=['A','B','C'])
In [9]: df
Out[9]:
A B C
0 1 2 3
1 3 4 5
2 3 1 4
3 1 2 1
In [11]: df[(df <= 2).all(axis=1)]
Out[11]:
A B C
3 1 2 1

Select columns in pandas dataframe by value in rows

I have a pandas.DataFrame with too much columns. I want to select all columns with values in rows equals to 0 and 1. Type of all columns is int64 and I can't select they by object or other type. How can I do this?
IIUC then you can use isin and filter the columns:
In [169]:
df = pd.DataFrame({'a':[0,1,1,0], 'b':list('abcd'), 'c':[1,2,3,4]})
df
Out[169]:
a b c
0 0 a 1
1 1 b 2
2 1 c 3
3 0 d 4
In [174]:
df[df.columns[df.isin([0,1]).all()]]
Out[174]:
a
0 0
1 1
2 1
3 0
The output from the inner condition:
In [175]:
df.isin([0,1]).all()
Out[175]:
a True
b False
c False
dtype: bool
We can use the boolean mask to filter the columns:
In [176]:
df.columns[df.isin([0,1]).all()]
Out[176]:
Index(['a'], dtype='object')

Pandas remove column by index

Suppose I have a DataFrame like this:
>>> df = pd.DataFrame([[1,2,3], [4,5,6], [7,8,9]], columns=['a','b','b'])
>>> df
a b b
0 1 2 3
1 4 5 6
2 7 8 9
And I want to remove second 'b' column. If I just use del statement, it'll delete both 'b' columns:
>>> del df['b']
>>> df
a
0 1
1 4
2 7
I can select column by index with .iloc[] and reassign DataFrame, but how can I delete only second 'b' column, for example by index?
df = df.drop(['b'], axis=1).join(df['b'].ix[:, 0:1])
>>> df
a b
0 1 2
1 4 5
2 7 8
Or just for this case
df = df.ix[:, 0:2]
But I think it has other better ways.

Categories