Transpose two or more columns in a dataframe - python

I have a dataframe which looks like:
PRIO Art Name Value
1 A Alpha 0
1 A Alpha 0
1 A Beta 1
2 A Alpha 3
2 B Theta 2
How can I transpose the dataframe, that I have all unique names as a column with the corresponding values to it (note that duplicate rows I want to ignore)?
So in this case:
PRIO Art Alpha Alpha_value Beta Beta_value Theta Theta_value
1 A 1 0 1 1 NaN NaN
2 A 1 3 NaN NaN NaN NaN
2 B NaN NaN NaN NaN 1 2

Here's one way using pivot_table. A few tricky things to keep in mind:
You need to specify both 'PRIO', 'Art' as pivot index
We can also use two aggregation funcs to get it done in a single call
We have to rename the level 0 columns to distinguish them. So you need to swap levels and rename
out = df.pivot_table(index=['PRIO', 'Art'], columns='Name', values='Value',
aggfunc=[lambda x: 1, 'first'])
# get the column names right
d = {'<lambda>':'is_present', 'first':'value'}
out = out.rename(columns=d, level=0)
out.columns = out.swaplevel(1,0, axis=1).columns.map('_'.join)
print(out.reset_index())
PRIO Art Alpha_is_present Beta_is_present Theta_is_present Alpha_value \
0 1 A 1.0 1.0 NaN 0.0
1 2 A 1.0 NaN NaN 3.0
2 2 B NaN NaN 1.0 NaN
Beta_value Theta_value
0 1.0 NaN
1 NaN NaN
2 NaN 2.0

Groupby twice, first to pivot Name and suffix with value. Next groupby same imperatives and find unique values. Join the two.In the joining, drop the duplicate columns and rename others as appropriate
g=df.groupby([ 'Art','PRIO', 'Name'])['Value'].\
first().unstack().reset_index().add_suffix('_value')
print(g.join(df.groupby(['PRIO', 'Art','Name'])['Value'].\
nunique().unstack('Name').reset_index()).drop(columns=['PRIO_value','Art'])\
.rename(columns={'Art_value':'Art'}))
Name Art Alpha_value Beta_value Theta_value PRIO Alpha Beta Theta
0 A 0.0 1.0 NaN 1 1.0 1.0 NaN
1 A 3.0 NaN NaN 2 1.0 NaN NaN
2 B NaN NaN 2.0 2 NaN NaN 1.0

This is an example of pd.crosstab() and groupby().
df = pd.concat([pd.crosstab([df['PRIO'],df['Art']], df['Name']),df.groupby(['PRIO','Art','Name'])['Value'].sum().unstack().add_suffix('_value')],axis=1).reset_index()
df
| | Alpha | Beta | Theta | Alpha_value | Beta_value | Theta_value |
|:---------|--------:|-------:|--------:|--------------:|-------------:|--------------:|
| (1, 'A') | 1 | 1 | 0 | 0 | 1 | nan |
| (2, 'A') | 1 | 0 | 0 | 3 | nan | nan |
| (2, 'B') | 0 | 0 | 1 | nan | nan | 2 |

Related

get a column containing the first non-nan value from a group of columns

Given data as:
| | a | b | c |
|---:|----:|----:|----:|
| 0 | nan | nan | 1 |
| 1 | nan | 2 | nan |
| 2 | 3 | 3 | 3 |
I would like to create some column d containing [1, 2, 3]
There can be an arbitrary amount of columns (though it's going to be <30).
Using
df.isna().apply(lambda x: x.idxmin(), axis=1)
Will give me:
0 c
1 b
2 a
dtype: object
Which seems useful, but I'm drawing a blank on how to access the columns with this, or whether there's a more suitable approach.
Repro:
import io
import pandas as pd
df = pd.read_csv(io.StringIO(',a,b,c\n0,,,1\n1,,2,\n2,3,3,3\n'))
Try this:
df.fillna(method='bfill', axis=1).iloc[:, 0]
What if you use min on axis = 1 ? :
df['min_val'] = df.min(axis=1)
a b c min_val
0 NaN NaN 1.0 1.0
1 NaN 2.0 NaN 2.0
2 3.0 3.0 3.0 3.0
And to get the respective columns:
df['min_val_col'] = df.idxmin(axis=1)
a b c min_val_col
0 NaN NaN 1.0 c
1 NaN 2.0 NaN b
2 3.0 3.0 3.0 a

Python explode multiple columns where some rows are NaN

I am trying to apply the Python explode function to unpack a few columns that are | delimited. Each row will be the same | delimited length (e.g. A will have the same |s as B) but rows can have different lengths from one another (e.g. row 1 is length 3 and rows 2 is length 2).
There are some rows where there may be an NaN here and there (e.g. A and C) which is causing the following error "columns must have matching element counts"
Current data:
A | B | C
1 | 2 | 3 app | ban | cor NaN
4 | 5 dep | exp NaN
NaN for | gep NaN
Expected output:
A | B | C
1 app NaN
2 ban NaN
3 cor NaN
4 dep NaN
5 exp NaN
NaN for NaN
NaN gep NaN
cols = ['A','B','C']
for col in cols:
df_test[col] = df_test[col].str.split('|')
df_test[col] = df_test[col].fillna({i: [] for i in df_test.index}) #tried replace the NaN with a null list but same error
df_long = df_test.explode(cols)

Merging columns within a dataframe with pandas

I'm trying to merge two different columns within a data frame.
So if you have columns A and B, and you want A to remain the default value unless it is empty. If it is empty you want to use the value for B.
pd.merge looks like it only works when merging data frames, not columns within an existing single data frame.
| A | B |
| 2 | 4 |
| NaN | 3 |
| 5 | NaN |
| NaN | 6 |
| 7 | 8 |
Desired Result:
|A|
|2|
|3|
|5|
|6|
|7|
Credit to Scott Boston for the comment on the OP:
import pandas as pd
df = pd.DataFrame(
{
'A': [2, None, 5, None, 7],
'B': [4, 3, None, 6, 8]
}
)
df.head()
"""
A B
0 2.0 4.0
1 NaN 3.0
2 5.0 NaN
3 NaN 6.0
4 7.0 8.0
"""
df['A'] = df['A'].fillna(df['B'])
df.head()
"""
A B
0 2.0 4.0
1 3.0 3.0
2 5.0 NaN
3 6.0 6.0
4 7.0 8.0
"""

Pandas combining sparse columns in dataframe

I am using Python, Pandas for data analysis. I have sparsely distributed data in different columns like following
| id | col1a | col1b | col2a | col2b | col3a | col3b |
|----|-------|-------|-------|-------|-------|-------|
| 1 | 11 | 12 | NaN | NaN | NaN | NaN |
| 2 | NaN | NaN | 21 | 86 | NaN | NaN |
| 3 | 22 | 87 | NaN | NaN | NaN | NaN |
| 4 | NaN | NaN | NaN | NaN | 545 | 32 |
I want to combine this sparsely distributed data in different columns to tightly packed column like following.
| id | group | cola | colb |
|----|-------|-------|-------|
| 1 | g1 | 11 | 12 |
| 2 | g2 | 21 | 86 |
| 3 | g1 | 22 | 87 |
| 4 | g3 | 545 | 32 |
What I have tried is doing following, but not able to do it properly
df['cola']=np.nan
df['colb']=np.nan
df['cola'].fillna(df.col1a,inplace=True)
df['colb'].fillna(df.col1b,inplace=True)
df['cola'].fillna(df.col2a,inplace=True)
df['colb'].fillna(df.col2b,inplace=True)
df['cola'].fillna(df.col3a,inplace=True)
df['colb'].fillna(df.col3b,inplace=True)
But I think there must be more concise and efficient way way of doing this. How to do this in better way?
You can use df.stack() assuming 'id' is your index else set 'id' as index. Then use pd.pivot_table.
df = df.stack().reset_index(name='val',level=1)
df['group'] = 'g'+ df['level_1'].str.extract('col(\d+)')
df['level_1'] = df['level_1'].str.replace('col(\d+)','')
df.pivot_table(index=['id','group'],columns='level_1',values='val')
level_1 cola colb
id group
1 g1 11.0 12.0
2 g2 21.0 86.0
3 g1 22.0 87.0
4 g3 545.0 32.0
Another alternative with pd.wide_to_long
m = pd.wide_to_long(df,['col'],'id','j',suffix='\d+\w+').reset_index()
(m.join(pd.DataFrame(m.pop('j').agg(list).tolist()))
.assign(group=lambda x:x[0].radd('g'))
.set_index(['id','group',1])['col'].unstack().dropna()
.rename_axis(None,axis=1).add_prefix('col').reset_index())
id group cola colb
0 1 g1 11 12
1 2 g2 21 86
2 3 g1 22 87
3 4 g3 545 32
Use:
import re
def fx(s):
s = s.dropna()
group = 'g' + re.search(r'\d+', s.index[0])[0]
return pd.Series([group] + s.tolist(), index=['group', 'cola', 'colb'])
df1 = df.set_index('id').agg(fx, axis=1).reset_index()
# print(df1)
id group cola colb
0 1 g1 11.0 12.0
1 2 g2 21.0 86.0
2 3 g1 22.0 87.0
3 4 g3 545.0 32.0
This would a way of doing it:
df = pd.DataFrame({'id':[1,2,3,4],
'col1a':[11,np.nan,22,np.nan],
'col1b':[12,np.nan,87,np.nan],
'col2a':[np.nan,21,np.nan,np.nan],
'col2b':[np.nan,86,np.nan,np.nan],
'col3a':[np.nan,np.nan,np.nan,545],
'col3b':[np.nan,np.nan,np.nan,32]})
df_new = df.copy(deep=False)
df_new['group'] = 'g'+df_new['id'].astype(str)
df_new['cola'] = df_new[[x for x in df_new.columns if x.endswith('a')]].sum(axis=1)
df_new['colb'] = df_new[[x for x in df_new.columns if x.endswith('b')]].sum(axis=1)
df_new = df_new[['id','group','cola','colb']]
print(df_new)
Output:
id group cola colb
0 1 g1 11.0 12.0
1 2 g2 21.0 86.0
2 3 g3 22.0 87.0
3 4 g4 545.0 32.0
So if you have more suffixes (colc, cold, cole, colf, etc...) you can create a loop and then use:
suffixes = ['a','b','c','d','e','f']
cols = ['id','group'] + ['col'+x for x in suffixes]
for i in suffixes:
df_new['col'+i] = df_new[[x for x in df_new.columns if x.endswith(i)]].sum(axis=1)
df_new = df_new[cols]
Thanks to #CeliusStingher for providing the code for the dataframe :
One suggestion is to set the id as index, rearrange the columns, with the numbers extracted from the text. Create a multiIndex, and stack to get the final result :
#set id as index
df = df.set_index("id")
#pull out the numbers from each column
#so that you have (cola,1), (colb,1) ...
#add g to the numbers ... (cola, g1),(colb,g1), ...
#create a MultiIndex
#and reassign to the columns
df.columns = pd.MultiIndex.from_tuples([("".join((first,last)), f"g{second}")
for first, second, last
in df.columns.str.split("(\d)")],
names=[None,"group"])
#stack the data
#to get your result
df.stack()
cola colb
id group
1 g1 11.0 12.0
2 g2 21.0 86.0
3 g1 22.0 87.0
4 g3 545.0 32.0

Is there a way to apply a function on a MultiIndex column?

I have a dataframe that looks like this:
id sex isActive score
0 1 M 1 10
1 2 F 0 20
2 2 F 1 30
3 2 M 0 40
4 3 M 1 50
I want to pivot the dataframe on the index id and columns sex and isActive (the value should be score). I want each id to have their score be a percentage of their total score associated with the sex group.
In the end, my dataframe should look like this:
sex F M
isActive 0 1 0 1
id
1 NaN NaN NaN 1.0
2 0.4 0.6 1.0 NaN
3 NaN NaN NaN 1.0
I tried pivoting first:
p = df.pivot_table(index='id', columns=['sex', 'isActive'], values='score')
print(p)
sex F M
isActive 0 1 0 1
id
1 NaN NaN NaN 10.0
2 20.0 30.0 40.0 NaN
3 NaN NaN NaN 50.0
Then, I summed up the scores for each group:
row_sum = p.sum(axis=1, level=[0])
print(row_sum)
sex F M
id
1 0.0 10.0
2 50.0 40.0
3 0.0 50.0
This is where I'm getting stuck. I'm trying to use DataFrame.apply to perform a column-wise sum based on the second dataframe. However, I keep getting errors following this format:
p.apply(lambda col: col/row_sum)
I may be overthinking this problem. Is there some better approach out there?
I think just a simple division of p by row_sum would work like:
print (p/row_sum)
sex F M
isActive 0 1 0 1
id
1 NaN NaN NaN 1.0
2 0.4 0.6 1.0 NaN
3 NaN NaN NaN 1.0

Categories