Pandas: How to group by and sum MultiIndex - python

I have a dataframe with categorical attributes where the index contains duplicates. I am trying to find the sum of each possible combination of index and attribute.
x = pd.DataFrame({'x':[1,1,3,3],'y':[3,3,5,5]},index=[11,11,12,12])
y = x.stack()
print(y)
print(y.groupby(level=[0,1]).sum())
output
11 x 1
y 3
x 1
y 3
12 x 3
y 5
x 3
y 5
dtype: int64
11 x 1
y 3
x 1
y 3
12 x 3
y 5
x 3
y 5
dtype: int64
The stack and group by sum are just the same.
However, the one I expect is
11 x 2
11 y 6
12 x 6
12 y 10
EDIT 2:
x = pd.DataFrame({'x':[1,1,3,3],'y':[3,3,5,5]},index=[11,11,12,12])
y = x.stack().groupby(level=[0,1]).sum()
print(y.groupby(level=[0,1]).sum())
output:
11 x 1
y 3
x 1
y 3
12 x 3
y 5
x 3
y 5
dtype: int64
EDIT3:
An issue has been logged
https://github.com/pydata/pandas/issues/10417

With pandas 0.16.2 and Python 3, I was able to get the correct result via:
x.stack().reset_index().groupby(['level_0','level_1']).sum()
Which produces:
0
level_0 level_1
11 x 2
y 6
12 x 6
y 10
You can then change the index and column names to more desirable ones using reindex() and columns.
Based on my research, I agree that the failure of the original approach appears to be a bug. I think the bug is on Series, which is what x.stack() produces. My workaround is to turn the Series into a DataFrame via reset_index(). In this case the DataFrame does not have a MultiIndex anymore - I'm just grouping on labeled columns.
To make sure that grouping and summing works on a DataFrame with a MultiIndex, you can try this to get the same correct output:
x.stack().reset_index().set_index(['level_0','level_1'],drop=True).\
groupby(level=[0,1]).sum()
Either of these workarounds should take care of things until the bug is resolved.
I wonder if the bug has something to do with the MultiIndex instances that are created on a Series vs. a DataFrame. For example:
In[1]: obj = x.stack()
type(obj)
Out[1]: pandas.core.series.Series
In[2]: obj.index
Out[2]: MultiIndex(levels=[[11, 11, 12, 12], ['x', 'y']],
labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]])
vs.
In[3]: obj = x.stack().reset_index().set_index(['level_0','level_1'],drop=True)
type(obj)
Out[3]: pandas.core.frame.DataFrame
In[4]: obj.index
Out[4]: MultiIndex(levels=[[11, 12], ['x', 'y']],
labels=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 1, 0, 1, 0, 1, 0, 1]],
names=['level_0', 'level_1'])
Notice how the MultiIndex on the DataFrame describes the levels more correctly.

sum allows you to specify the levels to sum over in a MultiIndex data frame.
x = pd.DataFrame({'x':[1,1,3,3],'y':[3,3,5,5]},index=[11,11,12,12])
y = x.stack()
y.sum(level=[0,1])
11 x 2
y 6
12 x 6
y 10

Using Pandas 0.15.2, you just need one more iteration of groupby
x = pd.DataFrame({'x':[1,1,3,3],'y':[3,3,5,5]},index=[11,11,12,12])
y = x.stack().groupby(level=[0,1]).sum()
print(y.groupby(level=[0,1]).sum())
prints
11 x 2
y 6
12 x 6
y 10

Related

Numpy / Pandas slicing based on intervals

Trying to figure out a way to slice non-contiguous and non-equal length rows of a pandas / numpy matrix so I can set the values to a common value. Has anyone come across an elegant solution for this?
import numpy as np
import pandas as pd
x = pd.DataFrame(np.arange(12).reshape(3,4))
#x is the matrix we want to index into
"""
x before:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
"""
y = pd.DataFrame([[0,3],[2,2],[1,2],[0,0]])
#y is a matrix where each row contains a start idx and end idx per column of x
"""
0 1
0 0 3
1 2 3
2 1 3
3 0 1
"""
What I'm looking for is a way to effectively select different length slices of x based on the rows of y
x[y] = 0
"""
x afterwards:
array([[ 0, 1, 2, 0],
[ 0, 5, 0, 7],
[ 0, 0, 0, 11]])
Masking can still be useful, because even if a loop cannot be entirely avoided, the main dataframe x would not need to be involved in the loop, so this should speed things up:
mask = np.zeros_like(x, dtype=bool)
for i in range(len(y)):
mask[y.iloc[i, 0]:(y.iloc[i, 1] + 1), i] = True
x[mask] = 0
x
0 1 2 3
0 0 1 2 0
1 0 5 0 7
2 0 0 0 11
As a further improvement, consider defining y as a NumPy array if possible.
I customized this answer to your problem:
y_t = y.values.transpose()
y_t[1,:] = y_t[1,:] - 1 # or remove this line and change '>= r' below to '> r`
r = np.arange(x.shape[0])
mask = ((y_t[0,:,None] <= r) & (y_t[1,:,None] >= r)).transpose()
res = x.where(~mask, 0)
res
# 0 1 2 3
# 0 0 1 2 0
# 1 0 5 0 7
# 2 0 0 0 11

Python: combine columns then rows by group

I have this dataframe:
In [182]: data_set
Out[182]:
name parent distance rank
0 x aaa 10 1
1 x bbb 5 1
2 x fff 3 2
3 y aaa 2 2
4 y bbb 10 1
5 z ccc 8 2
I want to reshape it to:
name Combined
x ('aaa',10,1),('bbb',5,1),('fff',3,2)
y ('aaa',2,2),('bbb',10,1)
z ('ccc',8,2)
Then I wanted to convert it into dataframe 3x2 with two columns name and combined.
I was thinking to use zip or group but those return different outputs.
First combine your columns to tuple, then groupby to list.
df['combined'] = df[['parent', 'distance', 'rank']].apply(tuple, axis=1)
res = df.groupby('name')['combined'].apply(list).reset_index()
print(res)
name combined
0 x [(aaa, 10, 1), (bbb, 5, 1), (fff, 3, 2)]
1 y [(aaa, 2, 2), (bbb, 10, 1)]
2 z [(ccc, 8, 2)]
By using groupby and apply
df.groupby('name')[['parent','distance','rank']].apply(lambda x : x.values.tolist())
Out[14]:
name
x [[aaa, 10, 1], [bbb, 5, 1], [fff, 3, 2]]
y [[aaa, 2, 2], [bbb, 10, 1]]
z [[ccc, 8, 2]]
dtype: object

Create columns for every category in Pandas DataFrame

I have a data frame with many columns with binaries representing the presence of a category in the observation. Each observation has exactly 3 categories with a value of 1, the rest 0. I want to create 3 new columns, 1 for each category, where the value is instead the name of the category (so the name of the binary column) if it's equal to one.To make it clearer:
I have:
x|y|z|k|w
---------
0|1|1|0|1
To be:
cat1|cat2|cat3
--------------
y |z |w
Can I do this ?
For better performance use numpy solution:
print (df)
x y z k w
0 0 1 1 0 1
1 1 1 0 0 1
c = df.columns.values
df = pd.DataFrame(c[np.where(df)[1].reshape(-1, 3)]).add_prefix('cat')
print (df)
cat0 cat1 cat2
0 y z w
1 x y w
Details:
#get indices of 1s
print (np.where(df))
(array([0, 0, 0, 1, 1, 1], dtype=int64), array([1, 2, 4, 0, 1, 4], dtype=int64))
#seelct second array
print (np.where(df)[1])
[1 2 4 0 1 4]
#reshape to 3 columns
print (np.where(df)[1].reshape(-1, 3))
[[1 2 4]
[0 1 4]]
#indexing
print (c[np.where(df)[1].reshape(-1, 3)])
[['y' 'z' 'w']
['x' 'y' 'w']]
Timings:
df = pd.concat([df] * 1000, ignore_index=True)
#jezrael solution
In [390]: %timeit (pd.DataFrame(df.columns.values[np.where(df)[1].reshape(-1, 3)]).add_prefix('cat'))
The slowest run took 4.62 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 503 µs per loop
#jpp solution
In [391]: %timeit (pd.DataFrame(df.apply(lambda row: [x for x in df.columns if row[x]], axis=1).values.tolist()))
10 loops, best of 3: 111 ms per loop
#Zero solution working only with one row DataFrame, so not included
Here is one way:
import pandas as pd
df = pd.DataFrame({'x': [0, 1], 'y': [1, 1], 'z': [1, 0], 'k': [0, 1], 'w': [1, 1]})
split = df.apply(lambda row: [x for x in df.columns if row[x]], axis=1).values.tolist()
df2 = pd.DataFrame(split)
# 0 1 2 3
# 0 w y z None
# 1 k w x y
You could
In [13]: pd.DataFrame([df.columns[df.astype(bool).values[0]]]).add_prefix('cat')
Out[13]:
cat0 cat1 cat2
0 y z w

python pandas isin method?

I have a dictionary 'wordfreq' like this:
{'techsmart': 30, 'paradies': 57, 'jobvark': 5000, 'midgley': 100, 'weisman': 2, 'tucuman': 1, 'amdahl': 2, 'frogfeet': 1, 'd8848': 1, 'jiaoyuwang': 1, 'walter': 19}
and I want to put the keys in a list if the value is more than 5 and also if the key is not in another dataframe 'df', and then adding them to a list called 'stopword':here is a df dataframe:
word freq
1 paradies 1
5 tucuman 1
and here is the code I am using:
stopword = []
for k,v in wordfreq.items():
if v >= 5:
if k not in list_c:
stopword.append((k))
Anybody knows how can I do the same thing with isin() method or more efficiently at least?
I'd load your dict into a df:
In [177]:
wordfreq = {'techsmart': 30, 'paradies': 57, 'jobvark': 5000, 'midgley': 100, 'weisman': 2, 'tucuman': 1, 'amdahl': 2, 'frogfeet': 1, 'd8848': 1, 'jiaoyuwang': 1, 'walter': 19}
df = pd.DataFrame({'word':list(wordfreq.keys()), 'freq':list(wordfreq.values())})
df
Out[177]:
freq word
0 1 frogfeet
1 1 tucuman
2 57 paradies
3 1 d8848
4 5000 jobvark
5 100 midgley
6 1 jiaoyuwang
7 30 techsmart
8 2 weisman
9 19 walter
10 2 amdahl
And then filter using isin against the other df (df_1 in my case) like this:
In [181]:
df[(df['freq'] > 5) & (~df['word'].isin(df1['word']))]
Out[181]:
freq word
4 5000 jobvark
5 100 midgley
7 30 techsmart
9 19 walter
So the boolean condition looks for freq values greater than 5 and also where the word is not in the other df using isin and invert the boolean mask ~.
You can then now get a list easily:
In [182]:
list(df[(df['freq'] > 5) & (~df['word'].isin(df1['word']))]['word'])
Out[182]:
['jobvark', 'midgley', 'techsmart', 'walter']

Panda-Column as index for numpy array

How can I use a panda row as index for a numpy array? Say I have
>>> grid = arange(10,20)
>>> df = pd.DataFrame([0,1,1,5], columns=['i'])
I would like to do
>>> df['j'] = grid[df['i']]
IndexError: unsupported iterator index
What is a short and clean way to actually perform this operation?
Update
To be precise, I want an additional column that has the values that correspond to the indices that the first column contains: df['j'][0] = grid[df['i'][0]] in column 0 etc
expected output:
index i j
0 0 10
1 1 11
2 1 11
3 5 15
Parallel Case: Numpy-to-Numpy
Just to show where the idea comes from, in standard python / numpy, if you have
>>> keys = [0, 1, 1, 5]
>>> grid = arange(10,20)
>>> grid[keys]
Out[30]: array([10, 11, 11, 15])
Which is exactly what I want to do. Only that my keys are not stored in a vector, they are stored in a column.
This is a numpy bug that surfaced with pandas 0.13.0 / numpy 1.8.0.
You can do:
In [5]: grid[df['i'].values]
Out[5]: array([0, 1, 1, 5])
In [6]: Series(grid)[df['i']]
Out[6]:
i
0 0
1 1
1 1
5 5
dtype: int64
This matches your output. You can assign an array to a column, as long as the length of the array/list is the same as the frame (otherwise how would you align it?)
In [14]: grid[keys]
Out[14]: array([10, 11, 11, 15])
In [15]: df['j'] = grid[df['i'].values]
In [17]: df
Out[17]:
i j
0 0 10
1 1 11
2 1 11
3 5 15

Categories