Accessing key of MultiIndexed groupby - python

I have a pandas groupby object, from two keys.
gb = df.groupby(['A','B'])
How can I access a specific key say (2,4), how do I do it?
The group_by() method works well if there is only one key.
Any ideas?

I think you are looking for get_group:
In [1]: df = pd.DataFrame([[2, 4, 1], [2, 4, 2], [3, 4, 1]], columns=['A', 'B', 'C'])
In [2]: df
Out[2]:
A B C
0 2 4 1
1 2 4 2
2 3 4 1
In [3]: g = df.groupby(['A', 'B'])
In [4]: g.get_group((2,4))
Out[4]:
A B C
0 2 4 1
1 2 4 2

Use a tuple in get_group
In [49]: df = DataFrame(np.random.randint(10,size=15).reshape(5,3),columns=list('ABC'))
In [50]: df
Out[50]:
A B C
0 8 9 2
1 7 5 3
2 3 1 2
3 2 4 0
4 6 9 4
In [51]: df.groupby(['A','B']).sum()
Out[51]:
C
A B
2 4 0
3 1 2
6 9 4
7 5 3
8 9 2
In [52]: df.groupby(['A','B']).get_group((6,9))
Out[52]:
A B C
4 6 9 4

Related

Pandas max for rows, top n max

Im trying to create top columns, which is the max of a couple of column rows. Pandas has a method nlargest but I cannot get it to work in rows. Pandas also has max and idxmax which does exactly what I want to do but only for the absolute max value.
df = pd.DataFrame(np.array([[1, 2, 3, 5, 1, 9], [4, 5, 6, 2, 5, 9], [7, 8, 9, 2, 5, 10]]), columns=['a', 'b', 'c', 'd', 'e', 'f'])
cols = df.columns[:-1].tolist()
df['max_1_val'] = df[cols].max(axis=1)
df['max_1_col'] = df[cols].idxmax(axis=1)
Output:
a b c d e f max_1_val max_1_col
0 1 2 3 5 1 9 5 d
1 4 5 6 2 5 9 6 c
2 7 8 9 2 5 10 9 c
But I am trying to get max_n_val and max_n_col so the expected output for top 3 would be:
a b c d e f max_1_val max_1_col max_2_val max_2_col max_3_val max_3_col
0 1 2 3 5 1 9 5 d 3 c 2 b
1 4 5 6 2 5 9 6 c 5 b 5 e
2 7 8 9 2 5 10 9 c 8 b 7 a
For improve performance is used numpy.argsort for positions, for correct order is used the last 3 items, reversed by indexing:
N = 3
a = df[cols].to_numpy().argsort()[:, :-N-1:-1]
print (a)
[[3 2 1]
[2 4 1]
[2 1 0]]
Then get columns names by indexing to c and for reordering values in d use this solution:
c = np.array(cols)[a]
d = df[cols].to_numpy()[np.arange(a.shape[0])[:, None], a]
Last create DataFrames, join by concat and reorder columns names by DataFrame.reindex:
df1 = pd.DataFrame(c).rename(columns=lambda x : f'max_{x+1}_col')
df2 = pd.DataFrame(d).rename(columns=lambda x : f'max_{x+1}_val')
c = df.columns.tolist() + [y for x in zip(df2.columns, df1.columns) for y in x]
df = pd.concat([df, df1, df2], axis=1).reindex(c, axis=1)
print (df)
a b c d e f max_1_val max_1_col max_2_val max_2_col max_3_val \
0 1 2 3 5 1 9 5 d 3 c 2
1 4 5 6 2 5 9 6 c 5 e 5
2 7 8 9 2 5 10 9 c 8 b 7
max_3_col
0 b
1 b
2 a

use meshgrid for rows with common values in column

my dataframes:
df1 = pd.DataFrame(np.array([[1, 2, 3], [4, 2, 3], [7, 8, 8]]),columns=['a', 'b', 'c'])
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 2, 3], [5, 8, 8]]),columns=['a', 'b', 'c'])
df1,df2:
a b c
0 1 2 3
1 4 2 3
2 7 8 8
a b c
0 1 2 3
1 4 2 3
2 5 8 8
I want to combine rows from columns a from both df's in all sequences but only where values in column b and c are equal.
Right now I have only solution for all in general with this code:
x = np.array(np.meshgrid(df1.a.values,
df2.a.values)).T.reshape(-1,2)
df = pd.DataFrame(x)
print(df)
0 1
0 1 1
1 1 4
2 1 5
3 4 1
4 4 4
5 4 5
6 7 1
7 7 4
8 7 5
expected output for df1.a and df2.a only for rows where df1.b==df2.b and df1.c==df2.c:
0 1
0 1 1
1 1 4
2 4 1
3 4 4
4 7 5
so basically i need to group by common rows in selected columns band c
You should try DataFrame.merge using inner merge:
df1.merge(df2, on=['b', 'c'])[['a_x', 'a_y']]
a_x a_y
0 1 1
1 1 4
2 4 1
3 4 4
4 7 5

Expanding the rows of a data frame based on its column containing lists [duplicate]

This question already has answers here:
How to unnest (explode) a column in a pandas DataFrame, into multiple rows
(16 answers)
Closed 4 years ago.
Say I have the following Pandas Dataframe:
df = pd.DataFrame({"a" : [1,2,3], "b" : [[1,2],[2,3,4],[5]]})
a b
0 1 [1, 2]
1 2 [2, 3, 4]
2 3 [5]
How would I "unstack" the lists in the "b" column in order to transform it into the dataframe:
a b
0 1 1
1 1 2
2 2 2
3 2 3
4 2 4
5 3 5
Starting from Pandas 0.25.0, there is internal method DataFrame.explode(), which was designed just for that:
res = df.explode("b")
output
In [98]: res
Out[98]:
a b
0 1 1
0 1 2
1 2 2
1 2 3
1 2 4
2 3 5
Solution for Pandas versions < 0.25: generic vectorized approach - will work also for multiple columns DFs:
assuming we have the following DF:
In [159]: df
Out[159]:
a b c
0 1 [1, 2] 5
1 2 [2, 3, 4] 6
2 3 [5] 7
Solution:
In [160]: lst_col = 'b'
In [161]: pd.DataFrame({
...: col:np.repeat(df[col].values, df[lst_col].str.len())
...: for col in df.columns.difference([lst_col])
...: }).assign(**{lst_col:np.concatenate(df[lst_col].values)})[df.columns.tolist()]
...:
Out[161]:
a b c
0 1 1 5
1 1 2 5
2 2 2 6
3 2 3 6
4 2 4 6
5 3 5 7
Setup:
df = pd.DataFrame({
"a" : [1,2,3],
"b" : [[1,2],[2,3,4],[5]],
"c" : [5,6,7]
})
Vectorized NumPy approach:
In [124]: pd.DataFrame({'a':np.repeat(df.a.values, df.b.str.len()),
'b':np.concatenate(df.b.values)})
Out[124]:
a b
0 1 1
1 1 2
2 2 2
3 2 3
4 2 4
5 3 5
OLD answer:
Try this:
In [89]: df.set_index('a', append=True).b.apply(pd.Series).stack().reset_index(level=[0, 2], drop=True).reset_index()
Out[89]:
a 0
0 1 1.0
1 1 2.0
2 2 2.0
3 2 3.0
4 2 4.0
5 3 5.0
Or bit nicer solution provided by #Boud:
In [110]: df.set_index('a').b.apply(pd.Series).stack().reset_index(level=-1, drop=True).astype(int).reset_index()
Out[110]:
a 0
0 1 1
1 1 2
2 2 2
3 2 3
4 2 4
5 3 5
Here is another approach with itertuples -
df = pd.DataFrame({"a" : [1,2,3], "b" : [[1,2],[2,3,4],[5]]})
data = []
for i in df.itertuples():
lst = i[2]
for col2 in lst:
data.append([i[1], col2])
df_output = pd.DataFrame(data =data, columns=df.columns)
df_output
Output is -
a b
0 1 1
1 1 2
2 2 2
3 2 3
4 2 4
5 3 5
Edit: You can also compress the loops into a single code and populate data as -
data = [[i[1], col2] for i in df.itertuples() for col2 in i[2]]

Keeping the last N duplicates in pandas

Given a dataframe:
>>> import pandas as pd
>>> lol = [['a', 1, 1], ['b', 1, 2], ['c', 1, 4], ['c', 2, 9], ['b', 2, 10], ['x', 2, 5], ['d', 2, 3], ['e', 3, 5], ['d', 2, 10], ['a', 3, 5]]
>>> df = pd.DataFrame(lol)
>>> df.rename(columns={0:'value', 1:'key', 2:'something'})
value key something
0 a 1 1
1 b 1 2
2 c 1 4
3 c 2 9
4 b 2 10
5 x 2 5
6 d 2 3
7 e 3 5
8 d 2 10
9 a 3 5
The goal is to keep the last N rows for the unique values of the key column.
If N=1, I could simply use the .drop_duplicates() function as such:
>>> df.drop_duplicates(subset='key', keep='last')
value key something
2 c 1 4
8 d 2 10
9 a 3 5
How do I keep the last 3 rows for each unique values of key?
I could try this for N=3:
>>> from itertools import chain
>>> unique_keys = {k:[] for k in df['key']}
>>> for idx, row in df.iterrows():
... k = row['key']
... unique_keys[k].append(list(row))
...
>>>
>>> df = pd.DataFrame(list(chain(*[v[-3:] for k,v in unique_keys.items()])))
>>> df.rename(columns={0:'value', 1:'key', 2:'something'})
value key something
0 a 1 1
1 b 1 2
2 c 1 4
3 x 2 5
4 d 2 3
5 d 2 10
6 e 3 5
7 a 3 5
But there must be a better way...
Is this what you want ?
df.groupby('key').tail(3)
Out[127]:
value key something
0 a 1 1
1 b 1 2
2 c 1 4
5 x 2 5
6 d 2 3
7 e 3 5
8 d 2 10
9 a 3 5
Does this help:
for k,v in df.groupby('key'):
print v[-2:]
value key something
1 b 1 2
2 c 1 4
value key something
6 d 2 3
8 d 2 10
value key something
7 e 3 5
9 a 3 5

Comparing columns of 2 dataframes

I am trying to get the columns that are unique to a data frame.
DF_A has 10 columns
DF_B has 3 columns (all three match column names in DF_A).
Before I was using:
cols_to_use = DF_A.columns - DF_B.columns.
Since my pandas update, I am getting this error:
TypeError: cannot perform sub with this index type:
What should I be doing now instead?
Thank you!
You can use difference method:
Demo:
In [12]: df
Out[12]:
a b c d
0 0 8 0 3
1 3 4 1 7
2 0 5 4 0
3 0 9 7 0
4 5 8 5 4
In [13]: df2
Out[13]:
a d
0 4 3
1 3 1
2 1 2
3 3 4
4 0 3
In [14]: df.columns.difference(df2.columns)
Out[14]: Index(['b', 'c'], dtype='object')
In [15]: cols = df.columns.difference(df2.columns)
In [16]: df[cols]
Out[16]:
b c
0 8 0
1 4 1
2 5 4
3 9 7
4 8 5

Categories