Consider the following dataframe df
np.random.seed([3,1415])
df = pd.DataFrame(dict(
S=np.random.rand(20),
W=np.random.rand(20),
G=np.random.choice(list('ABCD'), 20)
))
print(df)
G S W
0 B 0.444939 0.278735
1 D 0.407554 0.609862
2 C 0.460148 0.085823
3 B 0.465239 0.836997
4 A 0.462691 0.739635
5 A 0.016545 0.866059
6 D 0.850445 0.691271
7 C 0.817744 0.377185
8 B 0.777962 0.225146
9 C 0.757983 0.435280
10 C 0.934829 0.700900
11 A 0.831104 0.700946
12 C 0.879891 0.796487
13 A 0.926879 0.018688
14 D 0.721535 0.700566
15 D 0.117642 0.900749
16 D 0.145906 0.764869
17 C 0.199844 0.253200
18 B 0.437564 0.548054
19 A 0.100702 0.778883
I want to perform a weighted zscore of the 'S' column using weights 'W' within each group defined by 'G'
So that we know what the definition of weighted zscore is, this is how you'd calculate it over the entire set:
(df.S - (df.S * df.W).mean()) / df.S.std()
Question(s)
What is the most elegant way to calculate this?
What is the most key-stroke efficient way to calculate this?
What is the most time-efficient way to calculate this?
I calculated the answer as
0 1.291729
1 0.288806
2 0.394302
3 1.414926
4 0.619677
5 -0.461462
6 1.625974
7 1.645083
8 3.312825
9 1.436054
10 2.054617
11 1.512449
12 1.862456
13 1.744537
14 1.236770
15 -0.586493
16 -0.501159
17 -0.516180
18 1.246969
19 -0.257527
dtype: float64
Here you go:
>>> df.groupby('G').apply(lambda x: (x.S - (x.S * x.W).mean()) / x.S.std())
G
A 4 0.619677
5 -0.461462
11 1.512449
13 1.744537
19 -0.257527
B 0 1.291729
3 1.414926
8 3.312825
18 1.246969
C 2 0.394302
7 1.645083
9 1.436054
10 2.054617
12 1.862456
17 -0.516180
D 1 0.288806
6 1.625974
14 1.236770
15 -0.586493
16 -0.501159
Name: S, dtype: float64
We first split on each group in G, then apply the weighted z-score function to each group dataframe.
transform
P = df.S * df.W
m = P.groupby(df.G).transform('mean')
z = df.groupby('G').S.transform('std')
(df.S - m) / z
0 1.291729
1 0.288806
2 0.394302
3 1.414926
4 0.619677
5 -0.461462
6 1.625974
7 1.645083
8 3.312825
9 1.436054
10 2.054617
11 1.512449
12 1.862456
13 1.744537
14 1.236770
15 -0.586493
16 -0.501159
17 -0.516180
18 1.246969
19 -0.257527
dtype: float64
agg + join + eval
f = dict(S=dict(Std='std'), P=dict(Mean='mean'))
stats = df.assign(P=df.S * df.W).groupby('G').agg(f)
stats.columns = stats.columns.droplevel()
df.join(stats, on='G').eval('(S - Mean) / Std')
0 1.291729
1 0.288806
2 0.394302
3 1.414926
4 0.619677
5 -0.461462
6 1.625974
7 1.645083
8 3.312825
9 1.436054
10 2.054617
11 1.512449
12 1.862456
13 1.744537
14 1.236770
15 -0.586493
16 -0.501159
17 -0.516180
18 1.246969
19 -0.257527
dtype: float64
naive timing
Related
I have the following dataframe with each row containing two values.
print(x)
0 0 1
1 4 5
2 8 9
3 10 11
4 14 15
5 16 17
6 16 18
7 16 19
8 17 18
9 17 19
10 18 19
11 20 21
I want to merge these values if one or both values of a particular row reoccur in another row. The principal can be explained as follows: if A and B are together in one row and B and C are together in another row, then it means that A, B and C should be together. What I want as an outcome looking at the dataframe above is:
0 0 1
1 4 5
2 8 9
3 10 11
4 14 15
5 16 17 18 19
6 20 21
I tried creating a loop with df.duplicated that would create such an outcome, but it hasn't worked out yet.
This seems like graph theory problem dealing with connected components. You can use the networkx library:
import networkx as nx
g = nx.from_pandas_edgelist(df, 'a', 'b')
pd.concat([pd.Series([list(i)[0],
' '.join(map(str, list(i)[1:]))],
index=['a', 'b'])
for i in list(nx.connected_components(g))], axis=1).T
Output:
a b
0 0 1
1 4 5
2 8 9
3 10 11
4 14 15
5 16 17 18 19
6 20 21
Assume I have a data frame like:
import pandas as pd
df = pd.DataFrame({"user_id": [1, 5, 11],
"user_type": ["I", "I", "II"],
"joined_for": [1.4, 9.4, 18.1]})
Now I'd like to:
Take each user's joined_for and get the ceiling integer.
Based on the integer, create a new data frame containing number sequences where the maximum is the ceiling number.
This is how I do it now:
import math
new_df = pd.DataFrame()
for i in range(df.shape[0]):
ceil_num = math.ceil(df.iloc[i]["joined_for"])
new_df = new_df.append(pd.DataFrame({"user_id": df.iloc[i]["user_id"],
"joined_month": range(1, ceil_num+1)}),
ignore_index=True)
new_df = new_df.merge(df.drop(columns="joined_for"), on="user_id")
new_df is what I want, but it's so time-consuming when there are lots of users and the number of joined_for can be larger. Is there any better way to do this? Faster or neater?
Using a comprehension
pd.DataFrame([
[t.user_id, m, t.user_type] for t in df.itertuples(index=False)
for m in range(1, math.ceil(t.joined_for) + 1)
], columns=['user_id', 'joined_month', 'user_type'])
user_id joined_month user_type
0 1 1 I
1 1 2 I
2 5 1 I
3 5 2 I
4 5 3 I
5 5 4 I
6 5 5 I
7 5 6 I
8 5 7 I
9 5 8 I
10 5 9 I
11 5 10 I
12 11 1 II
13 11 2 II
14 11 3 II
15 11 4 II
16 11 5 II
17 11 6 II
18 11 7 II
19 11 8 II
20 11 9 II
21 11 10 II
22 11 11 II
23 11 12 II
24 11 13 II
25 11 14 II
26 11 15 II
27 11 16 II
28 11 17 II
29 11 18 II
30 11 19 II
I have a pandas multiindex with two indices, a data and a gender columns. It looks like this:
Division North South West East
Date Gender
2016-05-16 19:00:00 F 0 2 3 3
M 12 15 12 12
2016-05-16 20:00:00 F 12 9 11 11
M 10 13 8 9
2016-05-16 21:00:00 F 9 4 7 1
M 5 1 12 10
Now if I want to find the average values for each hour, I know I can do like:
df.groupby(df.index.hour).mean()
but this does not seem to work when you have a multi index. I found that I could do reach the Date index like:
df.groupby(df.index.get_level_values('Date').hour).mean()
which sort of averages over the 24 hours in a day, but I loose track of the Gender index...
so my question is: how can I find the average hourly values for each Division by Gender?
I think you can add level of MultiIndex, need pandas 0.20.1+:
df1 = df.groupby([df.index.get_level_values('Date').hour,'Gender']).mean()
print (df1)
North South West East
Date Gender
19 F 0 2 3 3
M 12 15 12 12
20 F 12 9 11 11
M 10 13 8 9
21 F 9 4 7 1
M 5 1 12 10
Another solution:
df1 = df.groupby([df.index.get_level_values('Date').hour,
df.index.get_level_values('Gender')]).mean()
print (df1)
North South West East
Date Gender
19 F 0 2 3 3
M 12 15 12 12
20 F 12 9 11 11
M 10 13 8 9
21 F 9 4 7 1
M 5 1 12 10
Or simply create columns from MultiIndex:
df = df.reset_index()
df1 = df.groupby([df['Date'].dt.hour, 'Gender']).mean()
print (df1)
North South West East
Date Gender
19 F 0 2 3 3
M 12 15 12 12
20 F 12 9 11 11
M 10 13 8 9
21 F 9 4 7 1
M 5 1 12 10
Consider this data:
df = pd.DataFrame(np.random.randint(0,20,size=(5, 4)),
columns=list('ABCD'),
index=pd.date_range('2016-04-01', '2016-04-05'))
date A B C D
1/1/2016 15 5 19 2
2/1/2016 18 1 14 11
3/1/2016 10 16 8 8
4/1/2016 7 17 17 18
5/1/2016 10 15 18 18
where date is the index
what I want to get back is a tuple of (date, <max>, <series_name>) for each column:
2/1/2016, 18, 'A'
4/1/2016, 17, 'B'
1/1/2016, 19, 'C'
4/1/2016, 18, 'D'
How can this be done in idiomatic pandas?
You could use idxmax and max with axis=0 for that and then join them:
np.random.seed(632)
df = pd.DataFrame(np.random.randint(0,20,size=(5, 4)), columns=list('ABCD'))
In [28]: df
Out[28]:
A B C D
0 10 14 16 1
1 12 13 8 8
2 8 16 11 1
3 8 1 17 12
4 4 2 1 7
In [29]: df.idxmax(axis=0)
Out[29]:
A 1
B 2
C 3
D 3
dtype: int64
In [30]: df.max(axis=0)
Out[30]:
A 12
B 16
C 17
D 12
dtype: int32
In [32]: pd.concat([df.idxmax(axis=0) , df.max(axis=0)], axis=1)
Out[32]:
0 1
A 1 12
B 2 16
C 3 17
D 3 12
I think you can concat max and idxmax. Last you can reset_index, rename column index and reorder all columns:
print df
A B C D
date
1/1/2016 15 5 19 2
2/1/2016 18 1 14 11
3/1/2016 10 16 8 8
4/1/2016 7 17 17 18
5/1/2016 10 15 18 18
print pd.concat([df.max(),df.idxmax()], axis=1, keys=['max','date'])
max date
A 18 2/1/2016
B 17 4/1/2016
C 19 1/1/2016
D 18 4/1/2016
df = pd.concat([df.max(),df.idxmax()], axis=1, keys=['max','date'])
.reset_index()
.rename(columns={'index':'name'})
#change order of columns
df = df[['date','max','name']]
print df
date max name
0 2/1/2016 18 A
1 4/1/2016 17 B
2 1/1/2016 19 C
3 4/1/2016 18 D
Another solution with rename_axis (new in pandas 0.18.0):
print pd.concat([df.max().rename_axis('name'), df.idxmax()], axis=1, keys=['max','date'])
max date
name
A 18 2/1/2016
B 17 4/1/2016
C 19 1/1/2016
D 18 4/1/2016
df = pd.concat([df.max().rename_axis('name'), df.idxmax()], axis=1, keys=['max','date'])
.reset_index()
#change order of columns
df = df[['date','max','name']]
print df
date max name
0 2/1/2016 18 A
1 4/1/2016 17 B
2 1/1/2016 19 C
3 4/1/2016 18 D
Setup
import numpy as np
import pandas as pd
np.random.seed(314)
df = pd.DataFrame(np.random.randint(0,20,size=(5, 4)),
columns=list('ABCD'),
index=pd.date_range('2016-04-01', '2016-04-05'))
print df
A B C D
2016-04-01 8 13 9 19
2016-04-02 10 14 16 7
2016-04-03 2 7 16 3
2016-04-04 12 7 4 0
2016-04-05 4 13 8 16
Solution
stacked = df.stack()
stacked = stacked[stacked.groupby(level=1).idxmax()]
produces
print stacked
2016-04-04 A 12
2016-04-02 B 14
C 16
2016-04-01 D 19
dtype: int32
I am currently trying to make use of Pandas MultiIndex attribute. I am trying to group an existing DataFrame-object df_original based on its columns in a smart way, and was therefore thinking of MultiIndex.
print df_original =
by_currency by_portfolio A B C
1 AUD a 1 2 3
2 AUD b 4 5 6
3 AUD c 7 8 9
4 AUD d 10 11 12
5 CHF a 13 14 15
6 CHF b 16 17 18
7 CHF c 19 20 21
8 CHF d 22 23 24
Now, what I would like to have is a MultiIndex DataFrame-object, with A, B and C, and by_portfolio as indices. Looking like
CHF AUD
A a 13 1
b 16 4
c 19 7
d 22 10
B a 14 2
b 17 5
c 20 8
d 23 11
C a 15 3
b 18 6
c 21 9
d 24 12
I have tried making all columns in df_original and the sought after indices into list-objects, and from there create a new DataFrame. This seems a bit cumbersome, and I can't figure out how to add the actual values after.
Perhaps some sort of groupby is better for this purpose? Thing is I will need to be able to add this data to another, similar, DataFrame, so I will need the resulting DataFrame to be able to be added to another one later on.
Thanks
You can use a combination of stack and unstack:
In [50]: df.set_index(['by_currency', 'by_portfolio']).stack().unstack(0)
Out[50]:
by_currency AUD CHF
by_portfolio
a A 1 13
B 2 14
C 3 15
b A 4 16
B 5 17
C 6 18
c A 7 19
B 8 20
C 9 21
d A 10 22
B 11 23
C 12 24
To obtain your desired result, we only need to swap the levels of the index:
In [51]: df2 = df.set_index(['by_currency', 'by_portfolio']).stack().unstack(0)
In [52]: df2.columns.name = None
In [53]: df2.index = df2.index.swaplevel(0,1)
In [55]: df2 = df2.sort_index()
In [56]: df2
Out[56]:
AUD CHF
by_portfolio
A a 1 13
b 4 16
c 7 19
d 10 22
B a 2 14
b 5 17
c 8 20
d 11 23
C a 3 15
b 6 18
c 9 21
d 12 24