Change pivot table from Series to DataFrame - python

df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'B': ['X', 'Y', 'Z'] * 3,
'C': [1, 2, 3, 1, 2, 3, 1, 2, 3]})
>>> df
A B C
0 1 X 1
1 1 Y 2
2 1 Z 3
3 2 X 1
4 2 Y 2
5 2 Z 3
6 3 X 1
7 3 Y 2
8 3 Z 3
result = df.pivot_table(index=['B'], values='C', aggfunc=sum)
>>> result
B
X 3
Y 6
Z 9
Name: C, dtype: int64
How can I have the column name for C show up a above the sums, and how can I sort result either ascending or descending. Result is a series not a dataframe and seems non-sortable?
Python: 2.7.11 and Pandas: 0.17.1

You were very close. Note that the brackets around the values coerces the result into a dataframe instead of a series (i.e. values=['C'] instead of values='C').
result = df.pivot_table(index = ['B'], values=['C'], aggfunc=sum)
>>> result
C
B
X 3
Y 6
Z 9
Asresult is now a dataframe, you can use sort_values on it:
>>> result.sort_values('C', ascending=False)
C
B
Z 9
Y 6
X 3

Related

Computed/Reactive column in Pandas?

I would like to emulate an Excel formula in Pandas I've tried this:
df = pd.DataFrame({'a': [3, 2, 1, 0], 'b': [5, 3, 2, 1]})
df['c'] = lambda x : df.a + df.b + 1 # Displays <function <lambda> ..> instead of the result
df['d'] = df.a + df.b + 1 # Static computation
df.a *= 2
df # Result of column c and d not updated :(
a b c d
0 6 5 <function <lambda> at 0x7f2354ddcca0> 9
1 4 3 <function <lambda> at 0x7f2354ddcca0> 6
2 2 2 <function <lambda> at 0x7f2354ddcca0> 4
3 0 1 <function <lambda> at 0x7f2354ddcca0> 2
What I expect is:
df
a b c
0 6 5 12
1 4 3 8
2 2 2 5
3 0 1 2
df.a /= 2
a b c
0 3 5 9
1 2 3 6
2 1 2 4
3 0 1 2
Is this possible to have a computed column dynamically in Pandas?
Maybe this code might give you a step in the right direction:
import pandas as pd
c_list =[]
df = pd.DataFrame({'a': [3, 2, 1, 0], 'b': [5, 3, 2, 1]})
c_list2 = list(map(lambda x: x + df.b + 1 , list(df.a)))
for i in range (0,4):
c_list.append(pd.DataFrame(c_list2[i])["b"][i])
df['c'] = c_list
df['d'] = df.a + df.b # Static computation
df.a *= 2
df
Reactivity between columns in a DataFrame does not seem practically feasible. My cellopype package does give you Excel-like reactivity between DataFrames. Here's my take on your question:
pip install cellopype
import pandas as pd
from cellopype import Cell
# define source df and wrap it in a Cell:
df_ab = pd.DataFrame({'a': [3, 2, 1, 0], 'b': [5, 3, 2, 1]})
cell_ab = Cell(recalc=lambda: df_ab.copy())
# define the dependent/reactive Cell (with a single column 'c')
cell_c = Cell(
recalc=lambda df: pd.DataFrame(df.a + df.b, columns=['c']),
sources=[cell_ab]
)
# and get its value
print(cell_c.value)
c
0 8
1 5
2 3
3 1
# change source df and recalc its Cell...
df_ab.loc[0,'a']=100
cell_ab.recalc()
# cell_c has been updated in response
print(cell_c.value)
c
0 105
1 5
2 3
3 1
Also see my response to this question.

Set Multi-Index DataFrame column by Series with Index

I'm struggling with a MultiIndex dataframe (a) which requires the column x to be set by b which isn't a MultiIndex and has only 1 index level (first level of a). I have an index to change those values (ix), which is why I am using .loc[] for indexing. The problem is that the way missing index levels are populated in a is not what I require (see example).
>>> a = pd.DataFrame({'a': [1, 2, 3], 'b': ['b', 'b', 'b'], 'x': [4, 5, 6]}).set_index(['a', 'b'])
>>> a
x
a b
1 b 4
2 b 5
3 b 6
>>> b = pd.DataFrame({'a': [1, 4], 'x': [9, 10]}).set_index('a')
>>> b
x
a
1 9
4 10
>>> ix = a.index[[0, 1]]
>>> ix
MultiIndex(levels=[[1, 2, 3], [u'b']],
codes=[[0, 1], [0, 0]],
names=[u'a', u'b'])
>>> a.loc[ix]
x
a b
1 b 4
2 b 5
>>> a.loc[ix, 'x'] = b['x']
>>> # wrong result (at least not what I want)
>>> a
x
a b
1 b NaN
2 b NaN
3 b 6.0
>>> # expected result
>>> a
x
a b
1 b 9 # index: a=1 is part of DataFrame b
2 b 5 # other indices don't exist in b and...
3 b 6 # ... x-values remain unchanged
# if there were more [1, ...] indices...
# ...x would also bet set to 9
I think you want to merge a and B. you should consider using concat,merge or join funcs.
I can't think of any one-liner, so here's a multi-step approach:
tmp_df = a.loc[ix, ['x']].reset_index(level=1, drop=True)
tmp_df['x'] = b['x']
tmp_df.index = ix
a.loc[ix, 'x'] = tmp_df['x']
Output:
x
a b
1 b 9.0
2 b 5.0
3 b 6.0
Edit: I assume that the b's in index are symbolic. Otherwise, the code will fail from a.loc[ix, 'x']: for
a = pd.DataFrame({'a': [1, 1, 2, 3],
'b': ['b', 'b', 'b', 'b'],
'x': [4, 5, 3, 6]}).set_index(['a', 'b'])
a.loc[ix,'x'] gives:
a b
1 b 4
b 5
b 4
b 5
Name: x, dtype: int64
You try use 1- index frame with 2- index frame, just use values:
EDIT:
import pandas as pd
a = pd.DataFrame({'a': [1, 2, 3], 'b': ['b', 'b', 'b'], 'x': [4, 5, 6]}).set_index(['a', 'b'])
b = pd.DataFrame({'a': [1, 4], 'x': [9, 10]}).set_index('a')
a_ix = a.index.get_level_values('a')[[0, 1]]
b_ix = b.index
mask = (b_ix == a_ix)
a.loc[mask, 'x'] = b.loc[mask,'x'].values
a:
x
a b
1 b 9
2 b 5
3 b 6
I first reset the multi-index of a and then I set it to the (single column) a
a = a.reset_index()
a = a.set_index('a')
print(a)
b x
a
1 b 4
2 b 5
3 b 6
print(b)
x
a
1 9
4 10
Then, make the assignment you require using loc and also re-set the multi-index
now, since we are using loc, your ix = a.index[[0, 1]] becomes similar to [1,0] (1 refers to index of a and 0 refers to index of b)
a.loc[1, 'x'] = b.iloc[0,0]
a.reset_index(inplace=True)
a = a.set_index(['a','b'])
print(a)
x
a b
1 b 9
2 b 5
3 b 6
EDIT:
Alternatively, reset the multi-index of a and don't set it to a single column index. Then your [0,1] (referring to index values with loc, not positions iloc) can be used (0 refers to index of a and 1 refers to index of b)
a = a.reset_index()
print(a)
a b x
0 1 b 4
1 2 b 5
2 3 b 6
a.loc[0, 'x'] = b.loc[1,'x']
a = a.set_index(['a','b'])
print(a)
x
a b
1 b 9
2 b 5
3 b 6

Create a column in a dataframe that is a string of characters summarizing data in other columns

I have a dataframe like this where the columns are the scores of some metrics:
A B C D
4 3 3 1
2 5 2 2
3 5 2 4
I want to create a new column to summarize which metrics each row scored over a set threshold in, using the column name as a string. So if the threshold was A > 2, B > 3, C > 1, D > 3, I would want the new column to look like this:
A B C D NewCol
4 3 3 1 AC
2 5 2 2 BC
3 5 2 4 ABCD
I tried using a series of np.where:
df[NewCol] = np.where(df['A'] > 2, 'A', '')
df[NewCol] = np.where(df['B'] > 3, 'B', '')
etc.
but realized the result was overwriting with the last metric any time all four metrics didn't meet the conditions, like so:
A B C D NewCol
4 3 3 1 C
2 5 2 2 C
3 5 2 4 ABCD
I am pretty sure there is an easier and correct way to do this.
You could do:
import pandas as pd
data = [[4, 3, 3, 1],
[2, 5, 2, 2],
[3, 5, 2, 4]]
df = pd.DataFrame(data=data, columns=['A', 'B', 'C', 'D'])
th = {'A': 2, 'B': 3, 'C': 1, 'D': 3}
df['result'] = [''.join(k for k in df.columns if record[k] > th[k]) for record in df.to_dict('records')]
print(df)
Output
A B C D result
0 4 3 3 1 AC
1 2 5 2 2 BC
2 3 5 2 4 ABCD
Using dot
s=pd.Series([2,3,1,3],index=df.columns)
df.gt(s,1).dot(df.columns)
Out[179]:
0 AC
1 BC
2 ABCD
dtype: object
#df['New']=df.gt(s,1).dot(df.columns)
Another option that operates in an array fashion. It would be interesting to compare performance.
import pandas as pd
import numpy as np
# Data to test.
data = pd.DataFrame(
[
[4, 3, 3, 1],
[2, 5, 2, 2],
[3, 5, 2, 4]
]
, columns = ['A', 'B', 'C', 'D']
)
# Series to hold the thresholds.
thresholds = pd.Series([2, 3, 1, 3], index = ['A', 'B', 'C', 'D'])
# Subtract the series from the data, broadcasting, and then use sum to concatenate the strings.
data['result'] = np.where(data - thresholds > 0, data.columns, '').sum(axis = 1)
print(data)
Gives:
A B C D result
0 4 3 3 1 AC
1 2 5 2 2 BC
2 3 5 2 4 ABCD

How can I map to a new dataframe by Multi Index level?

I have a dataframe with columns A, B, C, D and the index is a time series.
I want to create a new dataframe with the same index, but many more columns in a multi index. A, B, C, D are the first level of the multi index. I want every column in the new dataframe to have the same value that A, B, C, D did, according to its multi index level.
In other words, if I have a data frame like this:
A B C D
0 2 3 4 5
1 X Y Z 1
I want to make a new dataframe that looks like this
A B C D
0 1 2 3 4 5 6 7
0 2 2 2 3 3 4 5 5
1 X X X Y Y Z 1 1
In other words - I want to do the equivalent of an "HLOOKUP" in excel, using the first level of the multi-index and looking up on the original dataframe.
The new multi-index is pre-determined.
As suggested by cᴏʟᴅsᴘᴇᴇᴅ in the comments, you can use DataFrame.reindex with the columns and level arguments:
In [35]: mi
Out[35]:
MultiIndex(levels=[['A', 'B', 'C', 'D'], ['0', '1', '2', '3', '4', '5', '6', '7']],
labels=[[0, 0, 0, 1, 1, 2, 3, 3], [0, 1, 2, 3, 4, 5, 6, 7]])
In [36]: df
Out[36]:
A B C D
0 2 3 4 5
1 X Y Z 1
In [37]: df.reindex(columns=mi, level=0)
Out[37]:
A B C D
0 1 2 3 4 5 6 7
0 2 2 2 3 3 4 5 5
1 X X X Y Y Z 1 1

Unusual reshaping of Pandas DataFrame

i have a DF like this:
df = pd.DataFrame({'x': ['a', 'a', 'b', 'b', 'b', 'c'],
'y': [1, 2, 3, 4, 5, 6],
})
which looks like:
x y
0 a 1
1 a 2
2 b 3
3 b 4
4 b 5
5 c 6
I need to reshape it in the way to keep 'x' column unique:
x y_1 y_2 y_3
0 a 1 2 NaN
1 b 3 4 5
2 c 6 NaN NaN
So the max N of 'y_N' columns have to be equal to
max(df.groupby('x').count().values)
and the x column has to contain unique values.
For now i dont get how to get y_N columns.
Thanks.
You can use pandas.crosstab with cumcount column as the columns parameter:
(pd.crosstab(df.x, df.groupby('x').cumcount() + 1, df.y,
aggfunc = lambda x: x.iloc[0])
.rename(columns="y_{}".format).reset_index())

Categories