Pandas sum across columns and divide each cell from that value

Pandas sum across columns and divide each cell from that value - python

I have read a csv file and pivoted it to get to following structure:
pivoted = df.pivot('user_id', 'group', 'value')
lookup = df.drop_duplicates('user_id')[['user_id', 'group']]
lookup.set_index(['user_id'], inplace=True)
result = pivoted.join(lookup)
result = result.fillna(0)
Section of the result:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 group
user_id
2 33653 2325 916 720 867 187 31 0 6 3 42 56 92 15 l-1
4 18895 414 1116 570 1190 55 92 0 122 23 78 6 4 2 l-2
16 1383 70 27 17 17 1 0 0 0 0 1 0 0 0 l-2
50 396 72 34 5 18 0 0 0 0 0 0 0 0 0 l-3
51 3915 1170 402 832 2791 316 12 5 118 51 32 9 62 27 l-4
I want to sum across column 0 to column 13 by each row and divide each cell by the sum of that row. I am still getting used to pandas; if I understand correctly, we should try to avoid for loops when doing things like this? In other words, how can I do this in a 'pandas' way?

More simply:
result.div(result.sum(axis=1), axis=0)

Try the following:
In [1]: import pandas as pd
In [2]: df = pd.read_csv("test.csv")
In [3]: df
Out[3]:
id value1 value2 value3
0 A 1 2 3
1 B 4 5 6
2 C 7 8 9
In [4]: df["sum"] = df.sum(axis=1)
In [5]: df
Out[5]:
id value1 value2 value3 sum
0 A 1 2 3 6
1 B 4 5 6 15
2 C 7 8 9 24
In [6]: df_new = df.loc[:,"value1":"value3"].div(df["sum"], axis=0)
In [7]: df_new
Out[7]:
value1 value2 value3
0 0.166667 0.333333 0.500
1 0.266667 0.333333 0.400
2 0.291667 0.333333 0.375
Or you can do the following:
In [8]: df.loc[:,"value1":"value3"] = df.loc[:,"value1":"value3"].div(df["sum"], axis=0)
In [9]: df
Out[9]:
id value1 value2 value3 sum
0 A 0.166667 0.333333 0.500 6
1 B 0.266667 0.333333 0.400 15
2 C 0.291667 0.333333 0.375 24
Or just straight up from the beginning:
In [10]: df = pd.read_csv("test.csv")
In [11]: df
Out[11]:
id value1 value2 value3
0 A 1 2 3
1 B 4 5 6
2 C 7 8 9
In [12]: df.loc[:,"value1":"value3"] = df.loc[:,"value1":"value3"].div(df.sum(axis=1), axis=0)
In [13]: df
Out[13]:
id value1 value2 value3
0 A 0.166667 0.333333 0.500
1 B 0.266667 0.333333 0.400
2 C 0.291667 0.333333 0.375
Changing the column value1 and the like to your headers should work similarly.

easier to work per column:
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
(df.T / df.T.sum()).T
result:
0 1 2
0 0.166667 0.333333 0.500
1 0.266667 0.333333 0.400
2 0.291667 0.333333 0.375

The following seemed to work fine for me:
In [39]:
cols = ['0','1','2','3','4','5','6','7','8','9','10','11','12','13']
result[cols] = result[cols].apply(lambda row: row / row.sum(axis=1), axis=1)
result
Out[39]:
0 1 2 3 4 5 6 \
user_id
2 0.864827 0.059749 0.023540 0.018503 0.022280 0.004806 0.000797
4 0.837285 0.018345 0.049453 0.025258 0.052732 0.002437 0.004077
16 0.912269 0.046174 0.017810 0.011214 0.011214 0.000660 0.000000
50 0.754286 0.137143 0.064762 0.009524 0.034286 0.000000 0.000000
51 0.401868 0.120099 0.041265 0.085403 0.286491 0.032437 0.001232
7 8 9 10 11 12 13 \
user_id
2 0.000000 0.000154 0.000077 0.001079 0.001439 0.002364 0.000385
4 0.000000 0.005406 0.001019 0.003456 0.000266 0.000177 0.000089
16 0.000000 0.000000 0.000000 0.000660 0.000000 0.000000 0.000000
50 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
51 0.000513 0.012113 0.005235 0.003285 0.000924 0.006364 0.002772
group
user_id
2 l-1
4 l-2
16 l-2
50 l-3
51 l-4
OK scratch the above, the following will be much faster:
result[cols] = result[cols].div(result[cols].sum(axis=1), axis=0)
And just to prove the result is the same:
In [47]:
cols = ['0','1','2','3','4','5','6','7','8','9','10','11','12','13']
np.alltrue(result[cols].div(result[cols].sum(axis=1), axis=0) == result[cols].apply(lambda row: row / row.sum(axis=1), axis=1))
Out[47]:
True
And that it's faster:
In [48]:
cols = ['0','1','2','3','4','5','6','7','8','9','10','11','12','13']
%timeit result[cols].div(result[cols].sum(axis=1), axis=0)
%timeit result[cols].apply(lambda row: row / row.sum(axis=1), axis=1)
100 loops, best of 3: 2.38 ms per loop
100 loops, best of 3: 4.47 ms per loop

result.iloc[:,:-1].div(result.iloc[:,:-1].sum(axis=1), axis=0)
result.iloc[:,:-1] gets all rows and columns except last column
result.iloc[:,:-1].sum(axis=1) sums across a row due to axis=1, default is axis=0 i.e. column
result.div(result, axis=0) axis=0 because default for div is column i.e. axis=1

Related

How to select column in data after aggregation [duplicate]

I have a multi-index data frame with columns 'A' and 'B'.
Is there is a way to select rows by filtering on one column of the multi-index without resetting the index to a single column index?
For Example.
# has multi-index (A,B)
df
#can I do this? I know this doesn't work because the index is multi-index so I need to specify a tuple
df.ix[df.A ==1]

One way is to use the get_level_values Index method:
In [11]: df
Out[11]:
0
A B
1 4 1
2 5 2
3 6 3
In [12]: df.iloc[df.index.get_level_values('A') == 1]
Out[12]:
0
A B
1 4 1
In 0.13 you'll be able to use xs with drop_level argument:
df.xs(1, level='A', drop_level=False) # axis=1 if columns
Note: if this were column MultiIndex rather than index, you could use the same technique:
In [21]: df1 = df.T
In [22]: df1.iloc[:, df1.columns.get_level_values('A') == 1]
Out[22]:
A 1
B 4
0 1

You can also use query which is very readable in my opinion and straightforward to use:
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [10, 20, 50, 80], 'C': [6, 7, 8, 9]})
df = df.set_index(['A', 'B'])
C
A B
1 10 6
2 20 7
3 50 8
4 80 9
For what you had in mind you can now simply do:
df.query('A == 1')
C
A B
1 10 6
You can also have more complex queries using and
df.query('A >= 1 and B >= 50')
C
A B
3 50 8
4 80 9
and or
df.query('A == 1 or B >= 50')
C
A B
1 10 6
3 50 8
4 80 9
You can also query on different index levels, e.g.
df.query('A == 1 or C >= 8')
will return
C
A B
1 10 6
3 50 8
4 80 9
If you want to use variables inside your query, you can use #:
b_threshold = 20
c_threshold = 8
df.query('B >= #b_threshold and C <= #c_threshold')
C
A B
2 20 7
3 50 8

You can use DataFrame.xs():
In [36]: df = DataFrame(np.random.randn(10, 4))
In [37]: df.columns = [np.random.choice(['a', 'b'], size=4).tolist(), np.random.choice(['c', 'd'], size=4)]
In [38]: df.columns.names = ['A', 'B']
In [39]: df
Out[39]:
A b a
B d d d d
0 -1.406 0.548 -0.635 0.576
1 -0.212 -0.583 1.012 -1.377
2 0.951 -0.349 -0.477 -1.230
3 0.451 -0.168 0.949 0.545
4 -0.362 -0.855 1.676 -2.881
5 1.283 1.027 0.085 -1.282
6 0.583 -1.406 0.327 -0.146
7 -0.518 -0.480 0.139 0.851
8 -0.030 -0.630 -1.534 0.534
9 0.246 -1.558 -1.885 -1.543
In [40]: df.xs('a', level='A', axis=1)
Out[40]:
B d d
0 -0.635 0.576
1 1.012 -1.377
2 -0.477 -1.230
3 0.949 0.545
4 1.676 -2.881
5 0.085 -1.282
6 0.327 -0.146
7 0.139 0.851
8 -1.534 0.534
9 -1.885 -1.543
If you want to keep the A level (the drop_level keyword argument is only available starting from v0.13.0):
In [42]: df.xs('a', level='A', axis=1, drop_level=False)
Out[42]:
A a
B d d
0 -0.635 0.576
1 1.012 -1.377
2 -0.477 -1.230
3 0.949 0.545
4 1.676 -2.881
5 0.085 -1.282
6 0.327 -0.146
7 0.139 0.851
8 -1.534 0.534
9 -1.885 -1.543

Understanding how to access multi-indexed pandas DataFrame can help you with all kinds of task like that.
Copy paste this in your code to generate example:
# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
names=['subject', 'type'])
# mock some data
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 37
# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data
Will give you table like this:
Standard access by column
health_data['Bob']
type HR Temp
year visit
2013 1 22.0 38.6
2 52.0 38.3
2014 1 30.0 38.9
2 31.0 37.3
health_data['Bob']['HR']
year visit
2013 1 22.0
2 52.0
2014 1 30.0
2 31.0
Name: HR, dtype: float64
# filtering by column/subcolumn - your case:
health_data['Bob']['HR']==22
year visit
2013 1 True
2 False
2014 1 False
2 False
health_data['Bob']['HR'][2013]
visit
1 22.0
2 52.0
Name: HR, dtype: float64
health_data['Bob']['HR'][2013][1]
22.0
Access by row
health_data.loc[2013]
subject Bob Guido Sue
type HR Temp HR Temp HR Temp
visit
1 22.0 38.6 40.0 38.9 53.0 37.5
2 52.0 38.3 42.0 34.6 30.0 37.7
health_data.loc[2013,1]
subject type
Bob HR 22.0
Temp 38.6
Guido HR 40.0
Temp 38.9
Sue HR 53.0
Temp 37.5
Name: (2013, 1), dtype: float64
health_data.loc[2013,1]['Bob']
type
HR 22.0
Temp 38.6
Name: (2013, 1), dtype: float64
health_data.loc[2013,1]['Bob']['HR']
22.0
Slicing multi-index
idx=pd.IndexSlice
health_data.loc[idx[:,1], idx[:,'HR']]
subject Bob Guido Sue
type HR HR HR
year visit
2013 1 22.0 40.0 53.0
2014 1 30.0 52.0 45.0

You can use DataFrame.loc:
>>> df.loc[1]
Example
>>> print(df)
result
A B C
1 1 1 6
2 9
2 1 8
2 11
2 1 1 7
2 10
2 1 9
2 12
>>> print(df.loc[1])
result
B C
1 1 6
2 9
2 1 8
2 11
>>> print(df.loc[2, 1])
result
C
1 7
2 10

Another option is:
filter1 = df.index.get_level_values('A') == 1
filter2 = df.index.get_level_values('B') == 4
df.iloc[filter1 & filter2]
Out[11]:
0
A B
1 4 1

You can use MultiIndex slicing. For example:
arrays = [["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
["one", "two", "one", "two", "one", "two", "one", "two"]]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=["A", "B"])
df = pd.DataFrame(np.random.randint(9, size=(8, 2)), index=index, columns=["col1", "col2"])
col1 col2
A B
bar one 0 8
two 4 8
baz one 6 0
two 7 3
foo one 6 8
two 2 6
qux one 7 0
two 6 4
To select all from A and two from B:
df.loc[(slice(None), 'two'), :]
Output:
col1 col2
A B
bar two 4 8
baz two 7 3
foo two 2 6
qux two 6 4
To select bar and baz from A and two from B:
df.loc[(['bar', 'baz'], 'two'), :]
Output:
col1 col2
A B
bar two 4 8
baz two 7 3

Multiple nested multi-index selection, filtering and computation [duplicate]

I have a multi-index data frame with columns 'A' and 'B'.
Is there is a way to select rows by filtering on one column of the multi-index without resetting the index to a single column index?
For Example.
# has multi-index (A,B)
df
#can I do this? I know this doesn't work because the index is multi-index so I need to specify a tuple
df.ix[df.A ==1]

One way is to use the get_level_values Index method:
In [11]: df
Out[11]:
0
A B
1 4 1
2 5 2
3 6 3
In [12]: df.iloc[df.index.get_level_values('A') == 1]
Out[12]:
0
A B
1 4 1
In 0.13 you'll be able to use xs with drop_level argument:
df.xs(1, level='A', drop_level=False) # axis=1 if columns
Note: if this were column MultiIndex rather than index, you could use the same technique:
In [21]: df1 = df.T
In [22]: df1.iloc[:, df1.columns.get_level_values('A') == 1]
Out[22]:
A 1
B 4
0 1

You can also use query which is very readable in my opinion and straightforward to use:
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [10, 20, 50, 80], 'C': [6, 7, 8, 9]})
df = df.set_index(['A', 'B'])
C
A B
1 10 6
2 20 7
3 50 8
4 80 9
For what you had in mind you can now simply do:
df.query('A == 1')
C
A B
1 10 6
You can also have more complex queries using and
df.query('A >= 1 and B >= 50')
C
A B
3 50 8
4 80 9
and or
df.query('A == 1 or B >= 50')
C
A B
1 10 6
3 50 8
4 80 9
You can also query on different index levels, e.g.
df.query('A == 1 or C >= 8')
will return
C
A B
1 10 6
3 50 8
4 80 9
If you want to use variables inside your query, you can use #:
b_threshold = 20
c_threshold = 8
df.query('B >= #b_threshold and C <= #c_threshold')
C
A B
2 20 7
3 50 8

You can use DataFrame.xs():
In [36]: df = DataFrame(np.random.randn(10, 4))
In [37]: df.columns = [np.random.choice(['a', 'b'], size=4).tolist(), np.random.choice(['c', 'd'], size=4)]
In [38]: df.columns.names = ['A', 'B']
In [39]: df
Out[39]:
A b a
B d d d d
0 -1.406 0.548 -0.635 0.576
1 -0.212 -0.583 1.012 -1.377
2 0.951 -0.349 -0.477 -1.230
3 0.451 -0.168 0.949 0.545
4 -0.362 -0.855 1.676 -2.881
5 1.283 1.027 0.085 -1.282
6 0.583 -1.406 0.327 -0.146
7 -0.518 -0.480 0.139 0.851
8 -0.030 -0.630 -1.534 0.534
9 0.246 -1.558 -1.885 -1.543
In [40]: df.xs('a', level='A', axis=1)
Out[40]:
B d d
0 -0.635 0.576
1 1.012 -1.377
2 -0.477 -1.230
3 0.949 0.545
4 1.676 -2.881
5 0.085 -1.282
6 0.327 -0.146
7 0.139 0.851
8 -1.534 0.534
9 -1.885 -1.543
If you want to keep the A level (the drop_level keyword argument is only available starting from v0.13.0):
In [42]: df.xs('a', level='A', axis=1, drop_level=False)
Out[42]:
A a
B d d
0 -0.635 0.576
1 1.012 -1.377
2 -0.477 -1.230
3 0.949 0.545
4 1.676 -2.881
5 0.085 -1.282
6 0.327 -0.146
7 0.139 0.851
8 -1.534 0.534
9 -1.885 -1.543

Understanding how to access multi-indexed pandas DataFrame can help you with all kinds of task like that.
Copy paste this in your code to generate example:
# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
names=['subject', 'type'])
# mock some data
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 37
# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data
Will give you table like this:
Standard access by column
health_data['Bob']
type HR Temp
year visit
2013 1 22.0 38.6
2 52.0 38.3
2014 1 30.0 38.9
2 31.0 37.3
health_data['Bob']['HR']
year visit
2013 1 22.0
2 52.0
2014 1 30.0
2 31.0
Name: HR, dtype: float64
# filtering by column/subcolumn - your case:
health_data['Bob']['HR']==22
year visit
2013 1 True
2 False
2014 1 False
2 False
health_data['Bob']['HR'][2013]
visit
1 22.0
2 52.0
Name: HR, dtype: float64
health_data['Bob']['HR'][2013][1]
22.0
Access by row
health_data.loc[2013]
subject Bob Guido Sue
type HR Temp HR Temp HR Temp
visit
1 22.0 38.6 40.0 38.9 53.0 37.5
2 52.0 38.3 42.0 34.6 30.0 37.7
health_data.loc[2013,1]
subject type
Bob HR 22.0
Temp 38.6
Guido HR 40.0
Temp 38.9
Sue HR 53.0
Temp 37.5
Name: (2013, 1), dtype: float64
health_data.loc[2013,1]['Bob']
type
HR 22.0
Temp 38.6
Name: (2013, 1), dtype: float64
health_data.loc[2013,1]['Bob']['HR']
22.0
Slicing multi-index
idx=pd.IndexSlice
health_data.loc[idx[:,1], idx[:,'HR']]
subject Bob Guido Sue
type HR HR HR
year visit
2013 1 22.0 40.0 53.0
2014 1 30.0 52.0 45.0

You can use DataFrame.loc:
>>> df.loc[1]
Example
>>> print(df)
result
A B C
1 1 1 6
2 9
2 1 8
2 11
2 1 1 7
2 10
2 1 9
2 12
>>> print(df.loc[1])
result
B C
1 1 6
2 9
2 1 8
2 11
>>> print(df.loc[2, 1])
result
C
1 7
2 10

Another option is:
filter1 = df.index.get_level_values('A') == 1
filter2 = df.index.get_level_values('B') == 4
df.iloc[filter1 & filter2]
Out[11]:
0
A B
1 4 1

You can use MultiIndex slicing. For example:
arrays = [["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
["one", "two", "one", "two", "one", "two", "one", "two"]]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=["A", "B"])
df = pd.DataFrame(np.random.randint(9, size=(8, 2)), index=index, columns=["col1", "col2"])
col1 col2
A B
bar one 0 8
two 4 8
baz one 6 0
two 7 3
foo one 6 8
two 2 6
qux one 7 0
two 6 4
To select all from A and two from B:
df.loc[(slice(None), 'two'), :]
Output:
col1 col2
A B
bar two 4 8
baz two 7 3
foo two 2 6
qux two 6 4
To select bar and baz from A and two from B:
df.loc[(['bar', 'baz'], 'two'), :]
Output:
col1 col2
A B
bar two 4 8
baz two 7 3

Access keys of pandas dataframe when using groupby [duplicate]

I have a multi-index data frame with columns 'A' and 'B'.
Is there is a way to select rows by filtering on one column of the multi-index without resetting the index to a single column index?
For Example.
# has multi-index (A,B)
df
#can I do this? I know this doesn't work because the index is multi-index so I need to specify a tuple
df.ix[df.A ==1]

One way is to use the get_level_values Index method:
In [11]: df
Out[11]:
0
A B
1 4 1
2 5 2
3 6 3
In [12]: df.iloc[df.index.get_level_values('A') == 1]
Out[12]:
0
A B
1 4 1
In 0.13 you'll be able to use xs with drop_level argument:
df.xs(1, level='A', drop_level=False) # axis=1 if columns
Note: if this were column MultiIndex rather than index, you could use the same technique:
In [21]: df1 = df.T
In [22]: df1.iloc[:, df1.columns.get_level_values('A') == 1]
Out[22]:
A 1
B 4
0 1

You can also use query which is very readable in my opinion and straightforward to use:
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [10, 20, 50, 80], 'C': [6, 7, 8, 9]})
df = df.set_index(['A', 'B'])
C
A B
1 10 6
2 20 7
3 50 8
4 80 9
For what you had in mind you can now simply do:
df.query('A == 1')
C
A B
1 10 6
You can also have more complex queries using and
df.query('A >= 1 and B >= 50')
C
A B
3 50 8
4 80 9
and or
df.query('A == 1 or B >= 50')
C
A B
1 10 6
3 50 8
4 80 9
You can also query on different index levels, e.g.
df.query('A == 1 or C >= 8')
will return
C
A B
1 10 6
3 50 8
4 80 9
If you want to use variables inside your query, you can use #:
b_threshold = 20
c_threshold = 8
df.query('B >= #b_threshold and C <= #c_threshold')
C
A B
2 20 7
3 50 8

You can use DataFrame.xs():
In [36]: df = DataFrame(np.random.randn(10, 4))
In [37]: df.columns = [np.random.choice(['a', 'b'], size=4).tolist(), np.random.choice(['c', 'd'], size=4)]
In [38]: df.columns.names = ['A', 'B']
In [39]: df
Out[39]:
A b a
B d d d d
0 -1.406 0.548 -0.635 0.576
1 -0.212 -0.583 1.012 -1.377
2 0.951 -0.349 -0.477 -1.230
3 0.451 -0.168 0.949 0.545
4 -0.362 -0.855 1.676 -2.881
5 1.283 1.027 0.085 -1.282
6 0.583 -1.406 0.327 -0.146
7 -0.518 -0.480 0.139 0.851
8 -0.030 -0.630 -1.534 0.534
9 0.246 -1.558 -1.885 -1.543
In [40]: df.xs('a', level='A', axis=1)
Out[40]:
B d d
0 -0.635 0.576
1 1.012 -1.377
2 -0.477 -1.230
3 0.949 0.545
4 1.676 -2.881
5 0.085 -1.282
6 0.327 -0.146
7 0.139 0.851
8 -1.534 0.534
9 -1.885 -1.543
If you want to keep the A level (the drop_level keyword argument is only available starting from v0.13.0):
In [42]: df.xs('a', level='A', axis=1, drop_level=False)
Out[42]:
A a
B d d
0 -0.635 0.576
1 1.012 -1.377
2 -0.477 -1.230
3 0.949 0.545
4 1.676 -2.881
5 0.085 -1.282
6 0.327 -0.146
7 0.139 0.851
8 -1.534 0.534
9 -1.885 -1.543

Understanding how to access multi-indexed pandas DataFrame can help you with all kinds of task like that.
Copy paste this in your code to generate example:
# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
names=['subject', 'type'])
# mock some data
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 37
# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data
Will give you table like this:
Standard access by column
health_data['Bob']
type HR Temp
year visit
2013 1 22.0 38.6
2 52.0 38.3
2014 1 30.0 38.9
2 31.0 37.3
health_data['Bob']['HR']
year visit
2013 1 22.0
2 52.0
2014 1 30.0
2 31.0
Name: HR, dtype: float64
# filtering by column/subcolumn - your case:
health_data['Bob']['HR']==22
year visit
2013 1 True
2 False
2014 1 False
2 False
health_data['Bob']['HR'][2013]
visit
1 22.0
2 52.0
Name: HR, dtype: float64
health_data['Bob']['HR'][2013][1]
22.0
Access by row
health_data.loc[2013]
subject Bob Guido Sue
type HR Temp HR Temp HR Temp
visit
1 22.0 38.6 40.0 38.9 53.0 37.5
2 52.0 38.3 42.0 34.6 30.0 37.7
health_data.loc[2013,1]
subject type
Bob HR 22.0
Temp 38.6
Guido HR 40.0
Temp 38.9
Sue HR 53.0
Temp 37.5
Name: (2013, 1), dtype: float64
health_data.loc[2013,1]['Bob']
type
HR 22.0
Temp 38.6
Name: (2013, 1), dtype: float64
health_data.loc[2013,1]['Bob']['HR']
22.0
Slicing multi-index
idx=pd.IndexSlice
health_data.loc[idx[:,1], idx[:,'HR']]
subject Bob Guido Sue
type HR HR HR
year visit
2013 1 22.0 40.0 53.0
2014 1 30.0 52.0 45.0

You can use DataFrame.loc:
>>> df.loc[1]
Example
>>> print(df)
result
A B C
1 1 1 6
2 9
2 1 8
2 11
2 1 1 7
2 10
2 1 9
2 12
>>> print(df.loc[1])
result
B C
1 1 6
2 9
2 1 8
2 11
>>> print(df.loc[2, 1])
result
C
1 7
2 10

Another option is:
filter1 = df.index.get_level_values('A') == 1
filter2 = df.index.get_level_values('B') == 4
df.iloc[filter1 & filter2]
Out[11]:
0
A B
1 4 1

You can use MultiIndex slicing. For example:
arrays = [["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
["one", "two", "one", "two", "one", "two", "one", "two"]]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=["A", "B"])
df = pd.DataFrame(np.random.randint(9, size=(8, 2)), index=index, columns=["col1", "col2"])
col1 col2
A B
bar one 0 8
two 4 8
baz one 6 0
two 7 3
foo one 6 8
two 2 6
qux one 7 0
two 6 4
To select all from A and two from B:
df.loc[(slice(None), 'two'), :]
Output:
col1 col2
A B
bar two 4 8
baz two 7 3
foo two 2 6
qux two 6 4
To select bar and baz from A and two from B:
df.loc[(['bar', 'baz'], 'two'), :]
Output:
col1 col2
A B
bar two 4 8
baz two 7 3

Pandas how to apply normalization on a column with a condtion

I have a data frame like below and i want to normalize the values per customer .please help me how to achieve the solution.I tried minmaxscaler from sklearn on complete price column but it is giving me values close to zero.
Dataframe
Customer price
A 0
A 3
A 7
A 0
A 0
B 2
B 2
B 0
C 5
C 1
D 0
D 0
D 15
D 0

If you want per customer,
df.groupby('Customer').price.transform(\
lambda s: MinMaxScaler().fit_transform(s.values.reshape(-1,1)).ravel()
)
0 0.000000
1 0.428571
2 1.000000
3 0.000000
4 0.000000
5 1.000000
6 1.000000
7 0.000000
8 1.000000
9 0.000000
10 0.000000
11 0.000000
12 1.000000
13 0.000000
Name: price, dtype: float64

You can solve it without MinMaxScaler:
df["norm"]=df.groupby("Customer").apply(\
lambda grp: grp.price.div(grp.price.max()) ).values
Customer price norm
0 A 0 0.000000
1 A 3 0.428571
2 A 7 1.000000
3 A 0 0.000000
4 A 0 0.000000
5 B 2 1.000000
6 B 2 1.000000
7 B 0 0.000000
8 C 5 1.000000
9 C 1 0.200000
10 D 0 0.000000
11 D 0 0.000000
12 D 15 1.000000
13 D 0 0.000000
Edit:
For another normalization, you can divide by grp.price.sum() instead of grp.price.max().
Edit2:
For more columns you can do:
cols=["price","weights"] # group the requested column names
df2= df.groupby("Customer").apply(lambda grp: grp[cols].div(grp[cols].max()) )
new_df=pd.concat([df,df2],axis=1)
You must rename the last, normalized columns:
new_df.columns
Index(['Customer', 'price', 'weight', 'price', 'weight'], dtype='object')
new_df.columns= df.columns.append(pd.Index(["norm_"+c for c in df2.columns]))

GroupBy Transformation on hierarchically indexed dataframe

I would like to take my Pandas dataframe with hierarchically indexed columns and normalize the values such that the values with the same outer index sum to one. For example:
cols = pd.MultiIndex.from_tuples([('A', 1), ('A', 2), ('B', 1), ('B', 2)])
X = pd.DataFrame(np.arange(20).reshape(5,4), columns=cols)
gives a dataframe X:
A B
1 2 1 2
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
I would like to normalize the rows so that the A columns sum to 1 and the B columns sum to 1. I.e. to generate:
A B
1 2 1 2
0 0.000000 1.000000 0.400000 0.600000
1 0.444444 0.555556 0.461538 0.538462
2 0.470588 0.529412 0.476190 0.523810
3 0.480000 0.520000 0.482759 0.517241
4 0.484848 0.515152 0.486486 0.513514
The following for loop works:
res = []
for (k,g) in X.groupby(axis=1, level=0):
g = g.div(g.sum(axis=1), axis=0)
res.append(g)
res = pd.concat(res, axis=1)
But the one liner fails:
X.groupby(axis=1, level=0).transform(lambda x: x.div(x.sum(axis=1), axis=0))
With the error message:
ValueError: transform must return a scalar value for each group
Any idea what the issue might be?

is that what you want?
In [33]: X.groupby(level=0, axis=1).apply(lambda x: x.div(x.sum(axis=1), axis=0))
Out[33]:
A B
1 2 1 2
0 0.000000 1.000000 0.400000 0.600000
1 0.444444 0.555556 0.461538 0.538462
2 0.470588 0.529412 0.476190 0.523810
3 0.480000 0.520000 0.482759 0.517241
4 0.484848 0.515152 0.486486 0.513514

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas sum across columns and divide each cell from that value - python

More simply: result.div(result.sum(axis=1), axis=0)

easier to work per column: df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]]) (df.T / df.T.sum()).T result: 0 1 2 0 0.166667 0.333333 0.500 1 0.266667 0.333333 0.400 2 0.291667 0.333333 0.375

result.iloc[:,:-1].div(result.iloc[:,:-1].sum(axis=1), axis=0) result.iloc[:,:-1] gets all rows and columns except last column result.iloc[:,:-1].sum(axis=1) sums across a row due to axis=1, default is axis=0 i.e. column result.div(result, axis=0) axis=0 because default for div is column i.e. axis=1

Related

How to select column in data after aggregation [duplicate]

Multiple nested multi-index selection, filtering and computation [duplicate]

Access keys of pandas dataframe when using groupby [duplicate]

Pandas how to apply normalization on a column with a condtion

GroupBy Transformation on hierarchically indexed dataframe

Categories

Resources