Multiindex on DataFrames and sum in Pandas - python

I am currently trying to make use of Pandas MultiIndex attribute. I am trying to group an existing DataFrame-object df_original based on its columns in a smart way, and was therefore thinking of MultiIndex.
print df_original =
by_currency by_portfolio A B C
1 AUD a 1 2 3
2 AUD b 4 5 6
3 AUD c 7 8 9
4 AUD d 10 11 12
5 CHF a 13 14 15
6 CHF b 16 17 18
7 CHF c 19 20 21
8 CHF d 22 23 24
Now, what I would like to have is a MultiIndex DataFrame-object, with A, B and C, and by_portfolio as indices. Looking like
CHF AUD
A a 13 1
b 16 4
c 19 7
d 22 10
B a 14 2
b 17 5
c 20 8
d 23 11
C a 15 3
b 18 6
c 21 9
d 24 12
I have tried making all columns in df_original and the sought after indices into list-objects, and from there create a new DataFrame. This seems a bit cumbersome, and I can't figure out how to add the actual values after.
Perhaps some sort of groupby is better for this purpose? Thing is I will need to be able to add this data to another, similar, DataFrame, so I will need the resulting DataFrame to be able to be added to another one later on.
Thanks

You can use a combination of stack and unstack:
In [50]: df.set_index(['by_currency', 'by_portfolio']).stack().unstack(0)
Out[50]:
by_currency AUD CHF
by_portfolio
a A 1 13
B 2 14
C 3 15
b A 4 16
B 5 17
C 6 18
c A 7 19
B 8 20
C 9 21
d A 10 22
B 11 23
C 12 24
To obtain your desired result, we only need to swap the levels of the index:
In [51]: df2 = df.set_index(['by_currency', 'by_portfolio']).stack().unstack(0)
In [52]: df2.columns.name = None
In [53]: df2.index = df2.index.swaplevel(0,1)
In [55]: df2 = df2.sort_index()
In [56]: df2
Out[56]:
AUD CHF
by_portfolio
A a 1 13
b 4 16
c 7 19
d 10 22
B a 2 14
b 5 17
c 8 20
d 11 23
C a 3 15
b 6 18
c 9 21
d 12 24

Related

Find missing numbers in a column dataframe pandas

I have a dataframe with stores and its invoices numbers and I need to find the missing consecutive invoices numbers per Store, for example:
df1 = pd.DataFrame()
df1['Store'] = ['A','A','A','A','A','B','B','B','B','C','C','C','D','D']
df1['Invoice'] = ['1','2','5','6','8','20','23','24','30','200','202','203','204','206']
Store Invoice
0 A 1
1 A 2
2 A 5
3 A 6
4 A 8
5 B 20
6 B 23
7 B 24
8 B 30
9 C 200
10 C 202
11 C 203
12 D 204
13 D 206
And I want a dataframe like this:
Store MissInvoice
0 A 3
1 A 4
2 A 7
3 B 21
4 B 22
5 B 25
6 B 26
7 B 27
8 B 28
9 B 29
10 C 201
11 D 205
Thanks in advance!
You can use groupby.apply to compute a set difference with the range from the min to max value. Then explode:
(df1.astype({'Invoice': int})
.groupby('Store')['Invoice']
.apply(lambda s: set(range(s.min(), s.max())).difference(s))
.explode().reset_index()
)
NB. if you want to ensure having sorted values, use lambda s: sorted(set(range(s.min(), s.max())).difference(s)).
Output:
Store Invoice
0 A 3
1 A 4
2 A 7
3 B 21
4 B 22
5 B 25
6 B 26
7 B 27
8 B 28
9 B 29
10 C 201
11 D 205
Here's an approach:
import pandas as pd
import numpy as np
df1 = pd.DataFrame()
df1['Store'] = ['A','A','A','A','A','B','B','B','B','C','C','C']
df1['Invoice'] = ['1','2','5','6','8','20','23','24','30','200','202','203']
df1['Invoice'] = df1['Invoice'].astype(int)
df2 = df1.groupby('Store')['Invoice'].agg(['min','max'])
df2['MissInvoice'] = [[]]*len(df2)
for store,row in df2.iterrows():
df2.at[store,'MissInvoice'] = np.setdiff1d(np.arange(row['min'],row['max']+1),
df1.loc[df1['Store'] == store, 'Invoice'])
df2 = df2.explode('MissInvoice').drop(columns = ['min','max']).reset_index()
The resulting dataframe df2:
Store MissInvoice
0 A 3
1 A 4
2 A 7
3 B 21
4 B 22
5 B 25
6 B 26
7 B 27
8 B 28
9 B 29
10 C 201
Note: Store D is absent from the dataframe in my code because it is omitted from the lines in the question defining df1.

Merging rows in a dataframe based on reoccurring values

I have the following dataframe with each row containing two values.
print(x)
0 0 1
1 4 5
2 8 9
3 10 11
4 14 15
5 16 17
6 16 18
7 16 19
8 17 18
9 17 19
10 18 19
11 20 21
I want to merge these values if one or both values of a particular row reoccur in another row. The principal can be explained as follows: if A and B are together in one row and B and C are together in another row, then it means that A, B and C should be together. What I want as an outcome looking at the dataframe above is:
0 0 1
1 4 5
2 8 9
3 10 11
4 14 15
5 16 17 18 19
6 20 21
I tried creating a loop with df.duplicated that would create such an outcome, but it hasn't worked out yet.
This seems like graph theory problem dealing with connected components. You can use the networkx library:
import networkx as nx
g = nx.from_pandas_edgelist(df, 'a', 'b')
pd.concat([pd.Series([list(i)[0],
' '.join(map(str, list(i)[1:]))],
index=['a', 'b'])
for i in list(nx.connected_components(g))], axis=1).T
Output:
a b
0 0 1
1 4 5
2 8 9
3 10 11
4 14 15
5 16 17 18 19
6 20 21

Creating a new column from the values of a column - Pandas

I want to create a new column on pandas related to the info that I have on column C and want to create a column D..The data that I have has 50k columns so it is impossible for me to do it manually..
A sample of the data is ;
A B C
12 12 3:02
13 13 2:02
14 14 3:03
15 15 1:04
16 16 2:05
I need to dive the values into 2 parts at column C from the colon symbol ;
if the first value is bigger than the second like in row 1 == 3>02 the value on D column value will be A
if both values are equal like in rows 2 and 3 (2:02/3:03) the value on D column value will be B
if second value is bigger than the first value like in rows 4 and 5 (1:04 /2:05 ) D column value will be C
so the new data will look like
A B C D
2 12 3:02 A
13 13 2:02 B
14 14 3:03 B
15 15 1:04 C
16 16 2:05 C
Thanks in advance .
Use numpy.select with new DataFrame created by Series.str.split and expand=True:
df1 = df['C'].str.split(':', expand=True).astype(int)
print(df1)
0 1
1 3 2
2 2 2
3 3 3
4 1 4
5 2 5
df['D'] = np.select([df1[0] > df1[1], df1[0] == df1[1], df1[0] < df1[1]], ['A','B','C'])
print (df)
A B C D
1 12 12 3:02 A
2 13 13 2:02 B
3 14 14 3:03 B
4 15 15 1:04 C
5 16 16 2:05 C

weighted zscore within groups

Consider the following dataframe df
np.random.seed([3,1415])
df = pd.DataFrame(dict(
S=np.random.rand(20),
W=np.random.rand(20),
G=np.random.choice(list('ABCD'), 20)
))
print(df)
G S W
0 B 0.444939 0.278735
1 D 0.407554 0.609862
2 C 0.460148 0.085823
3 B 0.465239 0.836997
4 A 0.462691 0.739635
5 A 0.016545 0.866059
6 D 0.850445 0.691271
7 C 0.817744 0.377185
8 B 0.777962 0.225146
9 C 0.757983 0.435280
10 C 0.934829 0.700900
11 A 0.831104 0.700946
12 C 0.879891 0.796487
13 A 0.926879 0.018688
14 D 0.721535 0.700566
15 D 0.117642 0.900749
16 D 0.145906 0.764869
17 C 0.199844 0.253200
18 B 0.437564 0.548054
19 A 0.100702 0.778883
I want to perform a weighted zscore of the 'S' column using weights 'W' within each group defined by 'G'
So that we know what the definition of weighted zscore is, this is how you'd calculate it over the entire set:
(df.S - (df.S * df.W).mean()) / df.S.std()
Question(s)
What is the most elegant way to calculate this?
What is the most key-stroke efficient way to calculate this?
What is the most time-efficient way to calculate this?
I calculated the answer as
0 1.291729
1 0.288806
2 0.394302
3 1.414926
4 0.619677
5 -0.461462
6 1.625974
7 1.645083
8 3.312825
9 1.436054
10 2.054617
11 1.512449
12 1.862456
13 1.744537
14 1.236770
15 -0.586493
16 -0.501159
17 -0.516180
18 1.246969
19 -0.257527
dtype: float64
Here you go:
>>> df.groupby('G').apply(lambda x: (x.S - (x.S * x.W).mean()) / x.S.std())
G
A 4 0.619677
5 -0.461462
11 1.512449
13 1.744537
19 -0.257527
B 0 1.291729
3 1.414926
8 3.312825
18 1.246969
C 2 0.394302
7 1.645083
9 1.436054
10 2.054617
12 1.862456
17 -0.516180
D 1 0.288806
6 1.625974
14 1.236770
15 -0.586493
16 -0.501159
Name: S, dtype: float64
We first split on each group in G, then apply the weighted z-score function to each group dataframe.
transform
P = df.S * df.W
m = P.groupby(df.G).transform('mean')
z = df.groupby('G').S.transform('std')
(df.S - m) / z
0 1.291729
1 0.288806
2 0.394302
3 1.414926
4 0.619677
5 -0.461462
6 1.625974
7 1.645083
8 3.312825
9 1.436054
10 2.054617
11 1.512449
12 1.862456
13 1.744537
14 1.236770
15 -0.586493
16 -0.501159
17 -0.516180
18 1.246969
19 -0.257527
dtype: float64
agg + join + eval
f = dict(S=dict(Std='std'), P=dict(Mean='mean'))
stats = df.assign(P=df.S * df.W).groupby('G').agg(f)
stats.columns = stats.columns.droplevel()
df.join(stats, on='G').eval('(S - Mean) / Std')
0 1.291729
1 0.288806
2 0.394302
3 1.414926
4 0.619677
5 -0.461462
6 1.625974
7 1.645083
8 3.312825
9 1.436054
10 2.054617
11 1.512449
12 1.862456
13 1.744537
14 1.236770
15 -0.586493
16 -0.501159
17 -0.516180
18 1.246969
19 -0.257527
dtype: float64
naive timing

pandas: finding maximum for each series in dataframe

Consider this data:
df = pd.DataFrame(np.random.randint(0,20,size=(5, 4)),
columns=list('ABCD'),
index=pd.date_range('2016-04-01', '2016-04-05'))
date A B C D
1/1/2016 15 5 19 2
2/1/2016 18 1 14 11
3/1/2016 10 16 8 8
4/1/2016 7 17 17 18
5/1/2016 10 15 18 18
where date is the index
what I want to get back is a tuple of (date, <max>, <series_name>) for each column:
2/1/2016, 18, 'A'
4/1/2016, 17, 'B'
1/1/2016, 19, 'C'
4/1/2016, 18, 'D'
How can this be done in idiomatic pandas?
You could use idxmax and max with axis=0 for that and then join them:
np.random.seed(632)
df = pd.DataFrame(np.random.randint(0,20,size=(5, 4)), columns=list('ABCD'))
In [28]: df
Out[28]:
A B C D
0 10 14 16 1
1 12 13 8 8
2 8 16 11 1
3 8 1 17 12
4 4 2 1 7
In [29]: df.idxmax(axis=0)
Out[29]:
A 1
B 2
C 3
D 3
dtype: int64
In [30]: df.max(axis=0)
Out[30]:
A 12
B 16
C 17
D 12
dtype: int32
In [32]: pd.concat([df.idxmax(axis=0) , df.max(axis=0)], axis=1)
Out[32]:
0 1
A 1 12
B 2 16
C 3 17
D 3 12
I think you can concat max and idxmax. Last you can reset_index, rename column index and reorder all columns:
print df
A B C D
date
1/1/2016 15 5 19 2
2/1/2016 18 1 14 11
3/1/2016 10 16 8 8
4/1/2016 7 17 17 18
5/1/2016 10 15 18 18
print pd.concat([df.max(),df.idxmax()], axis=1, keys=['max','date'])
max date
A 18 2/1/2016
B 17 4/1/2016
C 19 1/1/2016
D 18 4/1/2016
df = pd.concat([df.max(),df.idxmax()], axis=1, keys=['max','date'])
.reset_index()
.rename(columns={'index':'name'})
#change order of columns
df = df[['date','max','name']]
print df
date max name
0 2/1/2016 18 A
1 4/1/2016 17 B
2 1/1/2016 19 C
3 4/1/2016 18 D
Another solution with rename_axis (new in pandas 0.18.0):
print pd.concat([df.max().rename_axis('name'), df.idxmax()], axis=1, keys=['max','date'])
max date
name
A 18 2/1/2016
B 17 4/1/2016
C 19 1/1/2016
D 18 4/1/2016
df = pd.concat([df.max().rename_axis('name'), df.idxmax()], axis=1, keys=['max','date'])
.reset_index()
#change order of columns
df = df[['date','max','name']]
print df
date max name
0 2/1/2016 18 A
1 4/1/2016 17 B
2 1/1/2016 19 C
3 4/1/2016 18 D
Setup
import numpy as np
import pandas as pd
np.random.seed(314)
df = pd.DataFrame(np.random.randint(0,20,size=(5, 4)),
columns=list('ABCD'),
index=pd.date_range('2016-04-01', '2016-04-05'))
print df
A B C D
2016-04-01 8 13 9 19
2016-04-02 10 14 16 7
2016-04-03 2 7 16 3
2016-04-04 12 7 4 0
2016-04-05 4 13 8 16
Solution
stacked = df.stack()
stacked = stacked[stacked.groupby(level=1).idxmax()]
produces
print stacked
2016-04-04 A 12
2016-04-02 B 14
C 16
2016-04-01 D 19
dtype: int32

Categories