Pandas 'partial melt' or 'group melt' - python

I have a DataFrame like this
>>> df = pd.DataFrame([[1,1,2,3,4,5,6],[2,7,8,9,10,11,12]],
columns=['id', 'ax','ay','az','bx','by','bz'])
>>> df
id ax ay az bx by bz
0 1 1 2 3 4 5 6
1 2 7 8 9 10 11 12
and I want to transform it into something like this
id name x y z
0 1 a 1 2 3
1 2 a 7 8 9
2 1 b 4 5 6
3 2 b 10 11 12
This is an unpivot / melt problem, but I don't know of any way to melt by keeping these groups intact. I know I can create projections across the original dataframe and then concat those but I'm wondering if I'm missing some common melt tricks from my toolbelt.

Set_index, convert columns to multi index and stack,
df = df.set_index('id')
df.columns = [df.columns.str[1], df.columns.str[0]]
new_df = df.stack().reset_index().rename(columns = {'level_1': 'name'})
id name x y z
0 1 a 1 2 3
1 1 b 4 5 6
2 2 a 7 8 9
3 2 b 10 11 12

Not melt wide_to_long with stack and unstack
pd.wide_to_long(df,['a','b'],i='id',j='drop',suffix='\w+').stack().unstack(1)
Out[476]:
drop x y z
id
1 a 1 2 3
b 4 5 6
2 a 7 8 9
b 10 11 12

An addition to the already excellent answers; pivot_longer from pyjanitor can help to abstract the reshaping :
# pip install pyjanitor
import pandas as pd
import janitor
df.pivot_longer(index = 'id',
names_to = ('name', '.value'),
names_pattern = r"(.)(.)")
id name x y z
0 1 a 1 2 3
1 2 a 7 8 9
2 1 b 4 5 6
3 2 b 10 11 12

Related

Pandas - split text with values in parenthesis into multiple columns

I have a dataframe column with values as below:
HexNAc(6)Hex(7)Fuc(1)NeuAc(3)
HexNAc(6)Hex(7)Fuc(1)NeuAc(3)
HexNAc(5)Hex(4)NeuAc(1)
HexNAc(6)Hex(7)
I want to split this information into multiple columns:
HexNAc Hex Fuc NeuAc
6 7 1 3
6 7 1 3
5 4 0 1
6 7 0 0
What is the best way to do this?
Can be done with a combination of string splits and explode (pandas version >= 0.25) then pivot. The rest cleans up some of the columns and fills missing values.
import pandas as pd
s = pd.Series(['HexNAc(6)Hex(7)Fuc(1)NeuAc(3)', 'HexNAc(6)Hex(7)Fuc(1)NeuAc(3)',
'HexNAc(5)Hex(4)NeuAc(1)', 'HexNAc(6)Hex(7)'])
(pd.DataFrame(s.str.split(')').explode().str.split('\(', expand=True))
.pivot(columns=0, values=1)
.rename_axis(None, axis=1)
.dropna(how='all', axis=1)
.fillna(0, downcast='infer'))
Fuc Hex HexNAc NeuAc
0 1 7 6 3
1 1 7 6 3
2 0 4 5 1
3 0 7 6 0
Check
pd.DataFrame(s.str.findall('\w+').map(lambda x : dict(zip(x[::2], x[1::2]))).tolist())
Out[207]:
Fuc Hex HexNAc NeuAc
0 1 7 6 3
1 1 7 6 3
2 NaN 4 5 1
3 NaN 7 6 NaN

Select Columns of a DataFrame based on another DataFrame

I am trying to select a subset of a DataFrame based on the columns of another DataFrame.
The DataFrames look like this:
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
a b
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
I want to get all rows of the first Dataframe for the columns which are included in both DataFrames. My result should look like this:
a b
0 0 1
1 4 5
2 8 9
3 12 13
You can use pd.Index.intersection or its syntactic sugar &:
intersection_cols = df1.columns & df2.columns
res = df1[intersection_cols]
import pandas as pd
data1=[[0,1,2,3,],[4,5,6,7],[8,9,10,11],[12,13,14,15]]
data2=[[0,1],[2,3],[4,5],[6,7],[8,9]]
df1 = pd.DataFrame(data=data1,columns=['a','b','c','d'])
df2 = pd.DataFrame(data=data2,columns=['a','b'])
df1[(df1.columns) & (df2.columns)]

Is there an easy way to compute the intersection of two different indexes in a dataframe?

For example, if I have a DataFrame consisting of 5 rows (0-4) and 5 columns (A-E), I want to say, 0A * E3. Or more pseudo-like df[0,A] * df[3,E]?
I think you need select values by DataFrame.loc and then multiple:
a = df.loc[0,'A'] * df.loc[3,'E']
Sample:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(5,5)), columns=list('ABCDE'))
print (df)
A B C D E
0 8 8 3 7 7
1 0 4 2 5 2
2 2 2 1 0 8
3 4 0 9 6 2
4 4 1 5 3 4
a = df.loc[0,'A'] * df.loc[3,'E']
print (a)
16
Btw, your pseodo code is very close to real solution.

Sum pandas dataframe column values based on condition of column name

I have a DataFrame with column names in the shape of x.y, where I would like to sum up all columns with the same value on x without having to explicitly name them. That is, the value of column_name.split(".")[0] should determine their group. Here's an example:
import pandas as pd
df = pd.DataFrame({'x.1': [1,2,3,4], 'x.2': [5,4,3,2], 'y.8': [19,2,1,3], 'y.92': [10,9,2,4]})
df
Out[3]:
x.1 x.2 y.8 y.92
0 1 5 19 10
1 2 4 2 9
2 3 3 1 2
3 4 2 3 4
The result should be the same as this operation, only I shouldn't have to explicitly list the column names and how they should group.
pd.DataFrame({'x': df[['x.1', 'x.2']].sum(axis=1), 'y': df[['y.8', 'y.92']].sum(axis=1)})
x y
0 6 29
1 6 11
2 6 3
3 6 7
Another option, you can extract the prefix from the column names and use it as a group variable:
df.groupby(by = df.columns.str.split('.').str[0], axis = 1).sum()
# x y
#0 6 29
#1 6 11
#2 6 3
#3 6 7
You can first create Multiindex by split and then groupby by first level and aggregate sum:
df.columns = df.columns.str.split('.', expand=True)
print (df)
x y
1 2 8 92
0 1 5 19 10
1 2 4 2 9
2 3 3 1 2
3 4 2 3 4
df = df.groupby(axis=1, level=0).sum()
print (df)
x y
0 6 29
1 6 11
2 6 3
3 6 7

Pandas merge on aggregated columns

Let's say I create a DataFrame:
import pandas as pd
df = pd.DataFrame({"a": [1,2,3,13,15], "b": [4,5,6,6,6], "c": ["wish", "you","were", "here", "here"]})
Like so:
a b c
0 1 4 wish
1 2 5 you
2 3 6 were
3 13 6 here
4 15 6 here
... and then group and aggregate by a couple columns ...
gb = df.groupby(['b','c']).agg({"a": lambda x: x.nunique()})
Yielding the following result:
a
b c
4 wish 1
5 you 1
6 here 2
were 1
Is it possible to merge df with the newly aggregated table gb such that I create a new column in df, containing the corresponding values from gb? Like this:
a b c nc
0 1 4 wish 1
1 2 5 you 1
2 3 6 were 1
3 13 6 here 2
4 15 6 here 2
I tried doing the simplest thing:
df.merge(gb, on=['b','c'])
But this gives the error:
KeyError: 'b'
Which makes sense because the grouped table has a Multi-index and b is not a column. So my question is two-fold:
Can I transform the multi-index of the gb DataFrame back into columns (so that it has the b and c column)?
Can I merge df with gb on the column names?
Whenever you want to add some aggregated column from groupby operation back to the df you should be using transform, this produces a Series with its index aligned with your orig df:
In [4]:
df['nc'] = df.groupby(['b','c'])['a'].transform(pd.Series.nunique)
df
Out[4]:
a b c nc
0 1 4 wish 1
1 2 5 you 1
2 3 6 were 1
3 13 6 here 2
4 15 6 here 2
There is no need to reset the index or perform an additional merge.
There's a simple way of doing this using reset_index().
df.merge(gb.reset_index(), on=['b','c'])
gives you
a_x b c a_y
0 1 4 wish 1
1 2 5 you 1
2 3 6 were 1
3 13 6 here 2
4 15 6 here 2

Categories