Given data as:
| | a | b | c |
|---:|----:|----:|----:|
| 0 | nan | nan | 1 |
| 1 | nan | 2 | nan |
| 2 | 3 | 3 | 3 |
I would like to create some column d containing [1, 2, 3]
There can be an arbitrary amount of columns (though it's going to be <30).
Using
df.isna().apply(lambda x: x.idxmin(), axis=1)
Will give me:
0 c
1 b
2 a
dtype: object
Which seems useful, but I'm drawing a blank on how to access the columns with this, or whether there's a more suitable approach.
Repro:
import io
import pandas as pd
df = pd.read_csv(io.StringIO(',a,b,c\n0,,,1\n1,,2,\n2,3,3,3\n'))
Try this:
df.fillna(method='bfill', axis=1).iloc[:, 0]
What if you use min on axis = 1 ? :
df['min_val'] = df.min(axis=1)
a b c min_val
0 NaN NaN 1.0 1.0
1 NaN 2.0 NaN 2.0
2 3.0 3.0 3.0 3.0
And to get the respective columns:
df['min_val_col'] = df.idxmin(axis=1)
a b c min_val_col
0 NaN NaN 1.0 c
1 NaN 2.0 NaN b
2 3.0 3.0 3.0 a
I am trying to apply the Python explode function to unpack a few columns that are | delimited. Each row will be the same | delimited length (e.g. A will have the same |s as B) but rows can have different lengths from one another (e.g. row 1 is length 3 and rows 2 is length 2).
There are some rows where there may be an NaN here and there (e.g. A and C) which is causing the following error "columns must have matching element counts"
Current data:
A | B | C
1 | 2 | 3 app | ban | cor NaN
4 | 5 dep | exp NaN
NaN for | gep NaN
Expected output:
A | B | C
1 app NaN
2 ban NaN
3 cor NaN
4 dep NaN
5 exp NaN
NaN for NaN
NaN gep NaN
cols = ['A','B','C']
for col in cols:
df_test[col] = df_test[col].str.split('|')
df_test[col] = df_test[col].fillna({i: [] for i in df_test.index}) #tried replace the NaN with a null list but same error
df_long = df_test.explode(cols)
I'm trying to merge two different columns within a data frame.
So if you have columns A and B, and you want A to remain the default value unless it is empty. If it is empty you want to use the value for B.
pd.merge looks like it only works when merging data frames, not columns within an existing single data frame.
| A | B |
| 2 | 4 |
| NaN | 3 |
| 5 | NaN |
| NaN | 6 |
| 7 | 8 |
Desired Result:
|A|
|2|
|3|
|5|
|6|
|7|
Credit to Scott Boston for the comment on the OP:
import pandas as pd
df = pd.DataFrame(
{
'A': [2, None, 5, None, 7],
'B': [4, 3, None, 6, 8]
}
)
df.head()
"""
A B
0 2.0 4.0
1 NaN 3.0
2 5.0 NaN
3 NaN 6.0
4 7.0 8.0
"""
df['A'] = df['A'].fillna(df['B'])
df.head()
"""
A B
0 2.0 4.0
1 3.0 3.0
2 5.0 NaN
3 6.0 6.0
4 7.0 8.0
"""
I am using Python, Pandas for data analysis. I have sparsely distributed data in different columns like following
| id | col1a | col1b | col2a | col2b | col3a | col3b |
|----|-------|-------|-------|-------|-------|-------|
| 1 | 11 | 12 | NaN | NaN | NaN | NaN |
| 2 | NaN | NaN | 21 | 86 | NaN | NaN |
| 3 | 22 | 87 | NaN | NaN | NaN | NaN |
| 4 | NaN | NaN | NaN | NaN | 545 | 32 |
I want to combine this sparsely distributed data in different columns to tightly packed column like following.
| id | group | cola | colb |
|----|-------|-------|-------|
| 1 | g1 | 11 | 12 |
| 2 | g2 | 21 | 86 |
| 3 | g1 | 22 | 87 |
| 4 | g3 | 545 | 32 |
What I have tried is doing following, but not able to do it properly
df['cola']=np.nan
df['colb']=np.nan
df['cola'].fillna(df.col1a,inplace=True)
df['colb'].fillna(df.col1b,inplace=True)
df['cola'].fillna(df.col2a,inplace=True)
df['colb'].fillna(df.col2b,inplace=True)
df['cola'].fillna(df.col3a,inplace=True)
df['colb'].fillna(df.col3b,inplace=True)
But I think there must be more concise and efficient way way of doing this. How to do this in better way?
You can use df.stack() assuming 'id' is your index else set 'id' as index. Then use pd.pivot_table.
df = df.stack().reset_index(name='val',level=1)
df['group'] = 'g'+ df['level_1'].str.extract('col(\d+)')
df['level_1'] = df['level_1'].str.replace('col(\d+)','')
df.pivot_table(index=['id','group'],columns='level_1',values='val')
level_1 cola colb
id group
1 g1 11.0 12.0
2 g2 21.0 86.0
3 g1 22.0 87.0
4 g3 545.0 32.0
Another alternative with pd.wide_to_long
m = pd.wide_to_long(df,['col'],'id','j',suffix='\d+\w+').reset_index()
(m.join(pd.DataFrame(m.pop('j').agg(list).tolist()))
.assign(group=lambda x:x[0].radd('g'))
.set_index(['id','group',1])['col'].unstack().dropna()
.rename_axis(None,axis=1).add_prefix('col').reset_index())
id group cola colb
0 1 g1 11 12
1 2 g2 21 86
2 3 g1 22 87
3 4 g3 545 32
Use:
import re
def fx(s):
s = s.dropna()
group = 'g' + re.search(r'\d+', s.index[0])[0]
return pd.Series([group] + s.tolist(), index=['group', 'cola', 'colb'])
df1 = df.set_index('id').agg(fx, axis=1).reset_index()
# print(df1)
id group cola colb
0 1 g1 11.0 12.0
1 2 g2 21.0 86.0
2 3 g1 22.0 87.0
3 4 g3 545.0 32.0
This would a way of doing it:
df = pd.DataFrame({'id':[1,2,3,4],
'col1a':[11,np.nan,22,np.nan],
'col1b':[12,np.nan,87,np.nan],
'col2a':[np.nan,21,np.nan,np.nan],
'col2b':[np.nan,86,np.nan,np.nan],
'col3a':[np.nan,np.nan,np.nan,545],
'col3b':[np.nan,np.nan,np.nan,32]})
df_new = df.copy(deep=False)
df_new['group'] = 'g'+df_new['id'].astype(str)
df_new['cola'] = df_new[[x for x in df_new.columns if x.endswith('a')]].sum(axis=1)
df_new['colb'] = df_new[[x for x in df_new.columns if x.endswith('b')]].sum(axis=1)
df_new = df_new[['id','group','cola','colb']]
print(df_new)
Output:
id group cola colb
0 1 g1 11.0 12.0
1 2 g2 21.0 86.0
2 3 g3 22.0 87.0
3 4 g4 545.0 32.0
So if you have more suffixes (colc, cold, cole, colf, etc...) you can create a loop and then use:
suffixes = ['a','b','c','d','e','f']
cols = ['id','group'] + ['col'+x for x in suffixes]
for i in suffixes:
df_new['col'+i] = df_new[[x for x in df_new.columns if x.endswith(i)]].sum(axis=1)
df_new = df_new[cols]
Thanks to #CeliusStingher for providing the code for the dataframe :
One suggestion is to set the id as index, rearrange the columns, with the numbers extracted from the text. Create a multiIndex, and stack to get the final result :
#set id as index
df = df.set_index("id")
#pull out the numbers from each column
#so that you have (cola,1), (colb,1) ...
#add g to the numbers ... (cola, g1),(colb,g1), ...
#create a MultiIndex
#and reassign to the columns
df.columns = pd.MultiIndex.from_tuples([("".join((first,last)), f"g{second}")
for first, second, last
in df.columns.str.split("(\d)")],
names=[None,"group"])
#stack the data
#to get your result
df.stack()
cola colb
id group
1 g1 11.0 12.0
2 g2 21.0 86.0
3 g1 22.0 87.0
4 g3 545.0 32.0
I have a dataframe that looks like this:
id sex isActive score
0 1 M 1 10
1 2 F 0 20
2 2 F 1 30
3 2 M 0 40
4 3 M 1 50
I want to pivot the dataframe on the index id and columns sex and isActive (the value should be score). I want each id to have their score be a percentage of their total score associated with the sex group.
In the end, my dataframe should look like this:
sex F M
isActive 0 1 0 1
id
1 NaN NaN NaN 1.0
2 0.4 0.6 1.0 NaN
3 NaN NaN NaN 1.0
I tried pivoting first:
p = df.pivot_table(index='id', columns=['sex', 'isActive'], values='score')
print(p)
sex F M
isActive 0 1 0 1
id
1 NaN NaN NaN 10.0
2 20.0 30.0 40.0 NaN
3 NaN NaN NaN 50.0
Then, I summed up the scores for each group:
row_sum = p.sum(axis=1, level=[0])
print(row_sum)
sex F M
id
1 0.0 10.0
2 50.0 40.0
3 0.0 50.0
This is where I'm getting stuck. I'm trying to use DataFrame.apply to perform a column-wise sum based on the second dataframe. However, I keep getting errors following this format:
p.apply(lambda col: col/row_sum)
I may be overthinking this problem. Is there some better approach out there?
I think just a simple division of p by row_sum would work like:
print (p/row_sum)
sex F M
isActive 0 1 0 1
id
1 NaN NaN NaN 1.0
2 0.4 0.6 1.0 NaN
3 NaN NaN NaN 1.0