collapse a pandas MultiIndex - python

Suppose I have a DataFrame with MultiIndex columns. How can I collapse the levels to a concatenation of the values so that I only have one level?
Setup
np.random.seed([3, 14])
col = pd.MultiIndex.from_product([list('ABC'), list('DE'), list('FG')])
df = pd.DataFrame(np.random.rand(4, 12) * 10, columns=col).astype(int)
print df
A B C
D E D E D E
F G F G F G F G F G F G
0 2 1 1 7 5 9 9 2 7 4 0 3
1 3 7 1 1 5 3 1 4 3 5 6 0
2 2 6 9 9 9 5 7 0 1 2 7 5
3 2 2 8 0 3 9 4 7 0 8 2 5
I want the result to look like this:
ADF ADG AEF AEG BDF BDG BEF BEG CDF CDG CEF CEG
0 2 1 1 7 5 9 9 2 7 4 0 3
1 3 7 1 1 5 3 1 4 3 5 6 0
2 2 6 9 9 9 5 7 0 1 2 7 5
3 2 2 8 0 3 9 4 7 0 8 2 5

Solution
I did this
def collapse_columns(df):
df = df.copy()
if isinstance(df.columns, pd.MultiIndex):
df.columns = df.columns.to_series().apply(lambda x: "".join(x))
return df
I had to check if its a MultiIndex because if it wasn't, I'd split a string and recombine it with what ever separator I chose in the join.

you may try this:
In [200]: cols = pd.Series(df.columns.tolist()).apply(pd.Series).sum(axis=1)
In [201]: cols
Out[201]:
0 ADF
1 ADG
2 AEF
3 AEG
4 BDF
5 BDG
6 BEF
7 BEG
8 CDF
9 CDG
10 CEF
11 CEG
dtype: object

df.columns = df.columns.to_series().apply(''.join)
This will give no separation, but you can sub in '_' for '' or any other separator you might want.

Solution 1)
df.columns = df.columns.to_series().str.join('_')
print(df.columns.shape) #(1,_X_) # a 2 D Array.
OR BETTER Solution 2
pivoteCols = df.columns.to_series().str.join('_')
pivoteCols = pivoteCols.values.reshape(len(pivoteCols))
df.columns = pivoteCols
print(df.columns.shape) # One Dimensional

Related

Can You Preserve Column Order When Pandas Dataframe.Combine Or DataFrame.Combine_First?

If you have 2 dataframes, represented as:
A F Y
0 1 2 3
1 4 5 6
And
B C T
0 7 8 9
1 10 11 12
When combining it becomes:
A B C F T Y
0 1 7 8 2 9 3
1 4 10 11 5 12 6
I would like it to become:
A F Y B C T
0 1 2 3 7 8 9
1 4 5 6 10 11 12
How do I combine 1 data frame with another but keep the original column order?
In [1294]: new_df = df.join(df1)
In [1295]: new_df
Out[1295]:
A F Y B C T
0 1 2 3 7 8 9
1 4 5 6 10 11 12
OR you can also use pd.merge(not a very clean solution though)
In [1297]: df['tmp' ] =1
In [1298]: df1['tmp'] = 1
In [1309]: pd.merge(df, df1, on=['tmp'], left_index=True, right_index=True).drop('tmp', 1)
Out[1309]:
A F Y B C T
0 1 2 3 7 8 9
1 4 5 6 10 11 12

dataframe:drop object type colums with certain kinds of values

how to drop columns with more than 50 kinds of values using function?
here drop columns:date_dispatch,con_birth_dt,dat_cust_open,cust_mgr_team,mng_issu_date,created_date
app_train.select_dtypes('object').apply(pd.Series.nunique, axis = 0)
label 1
date_dispatch 2883
con_birth_dt 12617
con_sex_mf 2
dat_cust_open 264
cust_mgr_team 2250
mng_issu_date 1796
um_num 38
created_date 2900
hqck_flag 2
dqck_flag 2
tzck_flag 2
yhlcck_flag 2
bzjck_flag 2
gzck_flag 2
jjsz_flag 2
e_yhlcck_flag 2
zq_flag 2
xtsz_flag 1
whsz_flag 1
hjsz_flag 2
yb_flag 2
qslc_flag 2
Use drop with index values filtered by boolean indexing:
a = app_train.select_dtypes('object').apply(pd.Series.nunique, axis = 0)
df = app_train.drop(a.index[a > 50], axis=1)
Another solution is add reindex for missing columns and then filter by inverted condition <=:
a = (app_train.select_dtypes('object')
.apply(pd.Series.nunique, axis = 0)
.reindex(app_train.columns, fill_value=0))
df = app_train.loc[:, a <= 50]
Sample:
app_train = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
print (app_train)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
a = (app_train.select_dtypes('object')
.apply(pd.Series.nunique, axis = 0)
.reindex(app_train.columns, fill_value=0))
df = app_train.loc[:, a <= 5]
print (df)
B C D E F
0 4 7 1 5 a
1 5 8 3 3 a
2 4 9 5 6 a
3 5 4 7 9 b
4 5 2 1 2 b
5 4 3 0 4 b
nunique + loc
You can use nunique followed by loc with Boolean indexing:
n = 5 # maximum number of unique values permitted
counts = app_train.select_dtypes(['object']).apply(pd.Series.nunique)
df = app_train.loc[:, ~app_train.columns.isin(counts[counts > n].index)]
# data from jezrael
print(df)
B C D E F
0 4 7 1 5 a
1 5 8 3 3 a
2 4 9 5 6 a
3 5 4 7 9 b
4 5 2 1 2 b
5 4 3 0 4 b

Creating a list of sliced dataframes

I am trying to create a list of dataframes where each dataframe is 3 rows of a larger dataframe.
dframes = [df[0:3], df[3:6],...,df[2000:2003]]
I am still fairly new to programming, why does:
x = 3
dframes = []
for i in range(0, len(df)):
dframes = dframes.append(df[i:x])
i = x
x = x + 3
dframes = dframes.append(df[i:x])
AttributeError: 'NoneType' object has no attribute 'append'
Use np.split
Setup
Consider the dataframe df
df = pd.DataFrame(dict(A=range(15), B=list('abcdefghijklmno')))
Solution
dframes = np.split(df, range(3, len(df), 3))
Output
for d in dframes:
print(d, '\n')
A B
0 0 a
1 1 b
2 2 c
A B
3 3 d
4 4 e
5 5 f
A B
6 6 g
7 7 h
8 8 i
A B
9 9 j
10 10 k
11 11 l
A B
12 12 m
13 13 n
14 14 o
Python raise this error because function append return None and next time in your loot variable dframes will be None
You can use this:
[list(dframes[i:i+3]) for i in range(0, len(dframes), 3)]
You can use list comprehension with groupby by numpy array created by length of index floor divided by 3:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(10,5)), columns=list('ABCDE'))
print (df)
A B C D E
0 8 8 3 7 7
1 0 4 2 5 2
2 2 2 1 0 8
3 4 0 9 6 2
4 4 1 5 3 4
5 4 3 7 1 1
6 7 7 0 2 9
7 9 3 2 5 8
8 1 0 7 6 2
9 0 8 2 5 1
dfs = [x for i, x in df.groupby(np.arange(len(df.index)) // 3)]
print (dfs)
[ A B C D E
0 8 8 3 7 7
1 0 4 2 5 2
2 2 2 1 0 8, A B C D E
3 4 0 9 6 2
4 4 1 5 3 4
5 4 3 7 1 1, A B C D E
6 7 7 0 2 9
7 9 3 2 5 8
8 1 0 7 6 2, A B C D E
9 0 8 2 5 1]
If default monotonic index (0,1,2...) solution can be simplify:
dfs = [x for i, x in df.groupby(df.index // 3)]

Python Pandas multiply based on columns and add suffix

I have two DataFrame objects which I would like to multiply based on the column names and output the new column with a suffix...
df1 = pd.DataFrame(np.random.randint(0,10, size=(5,5)), columns=list('ABCDE'))
A B C D E
0 6 2 1 7 2
1 0 0 2 1 8
2 7 2 6 6 9
3 2 5 5 1 3
4 9 1 6 7 4
df2 = pd.DataFrame(np.random.randint(1, 10, size=(5,3)), columns=list('ABC'))
A B C
0 2 1 2
1 7 5 1
2 2 1 4
3 7 8 5
4 9 2 2
I would like the output to be listed as with columns A_x, B_x and C_x being the product of the aligning columns in df1 and df2
A B C A_x B_x C_x D E
0 6 2 1 12 2 2 7 2
1 0 0 2 0 0 2 1 8
2 7 2 6 14 2 24 6 9
3 2 5 5 14 40 25 1 3
4 9 1 6 81 2 12 7 4
You can use intersection for get same columns names and then multiple by mul, add add_suffix and last concat df1:
cols = df1.columns.intersection(df2.columns)
df = df1[cols].mul(df2[cols], axis=1).add_suffix('_x')
df = pd.concat([df1, df], axis=1)
print (df)
A B C D E A_x B_x C_x
0 6 2 1 7 2 12 2 2
1 0 0 2 1 8 0 0 2
2 7 2 6 6 9 14 2 24
3 2 5 5 1 3 14 40 25
4 9 1 6 7 42 81 2 12
If need change order of columns:
cols = df1.columns.intersection(df2.columns)
df = df1[cols].mul(df2[cols], axis=1).add_suffix('_x')
cols1 = cols.tolist() + \
df.columns.tolist() + \
df1.columns.difference(df2.columns).tolist()
df = pd.concat([df1, df], axis=1)
print (df[cols1])
A B C A_x B_x C_x D E
0 6 2 1 12 2 2 7 2
1 0 0 2 0 0 2 1 8
2 7 2 6 14 2 24 6 9
3 2 5 5 14 40 25 1 3
4 9 1 6 81 2 12 7 42

How do I put a series (such as) the result of a pandas groupby.apply(f) into a new column of the dataframe?

I have a dataframe, that I want to calculate statitics on (value_count, mode, mean, etc.) and then put the result in a new column. My current solution is O(n**2) or so, and I'm sure there is likely a faster, obvious method that I'm overlooking.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(10, size=(100, 10)),
columns = list('abcdefghij'))
df['result'] = 0
groups = df.groupby([df.i, df.j])
for g in groups:
icol_eq = df.i == g[0][0]
jcol_eq = df.j == g[0][1]
i_and_j = icol_eq & jcol_eq
df['result'][i_and_j] = len(g[1])
The above works, but is extremely slow for large dataframes.
I tried
df['result'] = df.groupby([df.i, df.j]).apply(len)
but it doesn't seem to work.
Nor does
def f(g):
g['result'] = len(g)
return g
df.groupby([df.i, df.j]).apply(f)
Nor can I merge the resulting series of a df.groupby.apply(lambda x: len(x))
You want to use transform:
In [98]:
df['result'] = df.groupby([df.i, df.j]).transform(len)
df
Out[98]:
a b c d e f g h i j result
0 6 1 3 0 1 1 4 2 8 6 6
1 1 3 9 7 5 5 3 5 4 4 1
2 1 5 0 1 8 1 4 7 3 9 1
3 6 8 6 4 6 0 8 0 6 5 6
4 7 9 7 2 8 9 9 6 0 6 7
5 3 5 5 7 2 7 7 3 2 8 3
6 5 0 4 7 5 7 5 7 9 1 5
7 3 2 5 4 3 6 8 4 2 0 3
8 2 3 0 4 8 5 7 9 7 2 2
9 1 1 3 2 3 5 6 6 5 6 1
10 3 0 2 7 1 8 1 3 5 4 3
....
transform returns a Series with an index aligned to your original df so you can then add it as a column

Categories