Selecting rows based on multiple column values in pandas dataframe MultiIndex

Selecting rows based on multiple column values in pandas dataframe MultiIndex - python

I have a pandas DataFrame MultiIndex:
f1 f2 value
2 2 4
2 3 5
3 3 4
4 1 3
4 4 3
I would like to have an output where f1 == f2:
f1 f2 value
2 2 4
3 3 4
4 4 3
Can you suggest an elegant way to select those rows?

Use boolean indexing if f1, f2 are columns:
df = df[df.f1 == df.f2]
print (df)
f1 f2 value
0 2 2 4
2 3 3 4
4 4 4 3
If level of MultiIndex are f1, f2 use Index.get_level_values:
df = df[df.index.get_level_values('f1') == df.index.get_level_values('f2')]
Or if f1, f2 are columns names or levels of MultiIndex:
df = df.query('f1 == f2')
print (df)
f1 f2 value
0 2 2 4
2 3 3 4
4 4 4 3

Related

pandas: How to sort values in order of most repeated to least repeated?

Suppose a df as:
A B ...
2 .
3 .
2 .
3
2
1
I expect output to be:
A B ...
2 .
2 .
2 .
3
3
1
Because 2 was repeated more, then 3 and so on.

This works:
# Suppose you have a df like this:
import pandas as pd
df = pd.DataFrame({'A':[2,3,2,3,2,1], 'B':range(6)})
A B
0 2 0
1 3 1
2 2 2
3 3 3
4 2 4
5 1 5
# you can pass a sorting function to sort_values as key:
df = df.sort_values(by='A', key=lambda x: x.map(x.value_counts()), ascending=False)
A B
0 2 0
2 2 2
4 2 4
1 3 1
3 3 3
5 1 5

This would work
df['Frequency'] = df.groupby('A')['A'].transform('count')
df.sort_values('Frequency', inplace=True, ascending=False)

Try value_counts and argsort
out = df.iloc[(-df.A.value_counts().reindex(df.A)).argsort()]
Out[647]:
A B ...
0 2 . NaN
2 2 . NaN
4 2 None NaN
1 3 . NaN
3 3 None NaN
5 1 None NaN

First add a new column counting the repetitions:
>>> df['C'] = df.groupby('A')['A'].transform('count')
Then sort by this new column:
>>> df.sort_values(['C','A'], ascending=False)

How to remove columns from a df, if columns are not in df2

I have multiple dfs i need to compare, however the way the data was gathered one df has 25 columns and another 20 columns. Keep in mind the column label names are the same (the 20 columns exist in the 25 columns df).
I can't figure out how to remove columns from df_cont, if they don't exist in df_red + not include columns in df_red, which are not currently df_cont
df_cont A B C D E F
01-01-2019 1 2 3 4 5 5
02-01-2019 1 3 4 4 6 5
df_red A B D F G
01-01-2019 2 5 6 4 3
02-01-2019 2 5 6 4 3
Code:
df_cont1 = df_cont.query(df_cont.columns == df_red.columns)
Expected:
df_cont1 A B D F
01-01-2019 1 2 4 5
02-01-2019 1 3 4 5

As #busybear already stated you can use
df_cont = df_cont[df_red.columns]
in your special case.
This alternative solution is a bit safer if you don't know which DataFrame is the bigger one:
df_cont[df_cont.columns.intersection(df_red.columns)]

Is there an easy way to compute the intersection of two different indexes in a dataframe?

For example, if I have a DataFrame consisting of 5 rows (0-4) and 5 columns (A-E), I want to say, 0A * E3. Or more pseudo-like df[0,A] * df[3,E]?

I think you need select values by DataFrame.loc and then multiple:
a = df.loc[0,'A'] * df.loc[3,'E']
Sample:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(5,5)), columns=list('ABCDE'))
print (df)
A B C D E
0 8 8 3 7 7
1 0 4 2 5 2
2 2 2 1 0 8
3 4 0 9 6 2
4 4 1 5 3 4
a = df.loc[0,'A'] * df.loc[3,'E']
print (a)
16
Btw, your pseodo code is very close to real solution.

Sum pandas dataframe column values based on condition of column name

I have a DataFrame with column names in the shape of x.y, where I would like to sum up all columns with the same value on x without having to explicitly name them. That is, the value of column_name.split(".")[0] should determine their group. Here's an example:
import pandas as pd
df = pd.DataFrame({'x.1': [1,2,3,4], 'x.2': [5,4,3,2], 'y.8': [19,2,1,3], 'y.92': [10,9,2,4]})
df
Out[3]:
x.1 x.2 y.8 y.92
0 1 5 19 10
1 2 4 2 9
2 3 3 1 2
3 4 2 3 4
The result should be the same as this operation, only I shouldn't have to explicitly list the column names and how they should group.
pd.DataFrame({'x': df[['x.1', 'x.2']].sum(axis=1), 'y': df[['y.8', 'y.92']].sum(axis=1)})
x y
0 6 29
1 6 11
2 6 3
3 6 7

Another option, you can extract the prefix from the column names and use it as a group variable:
df.groupby(by = df.columns.str.split('.').str[0], axis = 1).sum()
# x y
#0 6 29
#1 6 11
#2 6 3
#3 6 7

You can first create Multiindex by split and then groupby by first level and aggregate sum:
df.columns = df.columns.str.split('.', expand=True)
print (df)
x y
1 2 8 92
0 1 5 19 10
1 2 4 2 9
2 3 3 1 2
3 4 2 3 4
df = df.groupby(axis=1, level=0).sum()
print (df)
x y
0 6 29
1 6 11
2 6 3
3 6 7

Pandas merge on aggregated columns

Let's say I create a DataFrame:
import pandas as pd
df = pd.DataFrame({"a": [1,2,3,13,15], "b": [4,5,6,6,6], "c": ["wish", "you","were", "here", "here"]})
Like so:
a b c
0 1 4 wish
1 2 5 you
2 3 6 were
3 13 6 here
4 15 6 here
... and then group and aggregate by a couple columns ...
gb = df.groupby(['b','c']).agg({"a": lambda x: x.nunique()})
Yielding the following result:
a
b c
4 wish 1
5 you 1
6 here 2
were 1
Is it possible to merge df with the newly aggregated table gb such that I create a new column in df, containing the corresponding values from gb? Like this:
a b c nc
0 1 4 wish 1
1 2 5 you 1
2 3 6 were 1
3 13 6 here 2
4 15 6 here 2
I tried doing the simplest thing:
df.merge(gb, on=['b','c'])
But this gives the error:
KeyError: 'b'
Which makes sense because the grouped table has a Multi-index and b is not a column. So my question is two-fold:
Can I transform the multi-index of the gb DataFrame back into columns (so that it has the b and c column)?
Can I merge df with gb on the column names?

Whenever you want to add some aggregated column from groupby operation back to the df you should be using transform, this produces a Series with its index aligned with your orig df:
In [4]:
df['nc'] = df.groupby(['b','c'])['a'].transform(pd.Series.nunique)
df
Out[4]:
a b c nc
0 1 4 wish 1
1 2 5 you 1
2 3 6 were 1
3 13 6 here 2
4 15 6 here 2
There is no need to reset the index or perform an additional merge.

There's a simple way of doing this using reset_index().
df.merge(gb.reset_index(), on=['b','c'])
gives you
a_x b c a_y
0 1 4 wish 1
1 2 5 you 1
2 3 6 were 1
3 13 6 here 2
4 15 6 here 2

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Selecting rows based on multiple column values in pandas dataframe MultiIndex - python

I have a pandas DataFrame MultiIndex: f1 f2 value 2 2 4 2 3 5 3 3 4 4 1 3 4 4 3 I would like to have an output where f1 == f2: f1 f2 value 2 2 4 3 3 4 4 4 3 Can you suggest an elegant way to select those rows?

Related

pandas: How to sort values in order of most repeated to least repeated?

How to remove columns from a df, if columns are not in df2

Is there an easy way to compute the intersection of two different indexes in a dataframe?

Sum pandas dataframe column values based on condition of column name

Pandas merge on aggregated columns

Categories

Resources