how to split dataframe cells using delimiter into different dataframes. with conditions - python

There are other questions on the same topic and they helped but I have an extra twist.
I have a dataframe with multiple values in each (but not all) cells.
df = pd.DataFrame({'a':["10-30-410","20-40-500","25-50"], 'b':["5-8-9","4", "99"]})
index
a
b
0
10-30-410
5-8-9
1
20-40-500
4
2
25-50
99
How can I split each cell by the dash "-" and create three new dataframes? Note that not all cells have multiple values, in which case the second and third dataframes get NA or blank (treating these as strings).
So I need df1 to be the first of those values:
index
a
b
0
10
5
1
20
4
2
25
99
And df2 would be:
index
a
b
0
30
8
1
40
2
50
And likewise for df3:
index
a
b
0
410
9
1
500
2
I got df1 with this
df1 = df.replace(r'(\d+).*(\d+).*(\d+)+', r'\1', regex=True)
But df2 doesn't quite work. I get the second values but also 4 and 99, which should be blank.
df2 = df.replace(r'(\d+)-(\d+).*', r'\2', regex=True)
index
a
b
0
30
8
1
40
4 - should be blank
2
50
99 - should be blank
Is this the right approach? I'm pretty good on regex but fuzzy with groups. Thank you.

Use str.split + concat + stack to get the data in a more usable format:
new_df = pd.concat(
(df['a'].str.split('-', expand=True),
df['b'].str.split('-', expand=True)),
keys=('a', 'b'),
axis=1
).stack(dropna=False).droplevel(0)
new_df:
a b
0 10 5
1 30 8
2 410 9
0 20 4
1 40 None
2 500 None
0 25 99
1 50 None
2 None None
Expandable option for n cols:
cols = ['a', 'b']
new_df = pd.concat(
(df[c].str.split('-', expand=True) for c in cols),
keys=cols,
axis=1
).stack(dropna=False).droplevel(0)
Then groupby level 0 + reset_index to create a list of dataframes:
dfs = [g.reset_index(drop=True) for _, g in new_df.groupby(level=0)]
dfs:
[ a b
0 10 5
1 20 4
2 25 99,
a b
0 30 8
1 40 None
2 50 None,
a b
0 410 9
1 500 None
2 None None]
Complete Working Example:
import pandas as pd
df = pd.DataFrame({
'a': ["10-30-410", "20-40-500", "25-50"],
'b': ["5-8-9", "4", "99"]
})
cols = ['a', 'b']
new_df = pd.concat(
(df[c].str.split('-', expand=True) for c in cols),
keys=cols,
axis=1
).stack(dropna=False).droplevel(0)
dfs = [g.reset_index(drop=True) for _, g in new_df.groupby(level=0)]
print(dfs)

You can also try with filter:
k = pd.concat((df[c].str.split('-', expand=True).add_prefix(c+ '-')
for c in df.columns), 1).fillna('')
df1 = k.filter(like='0')
df2 = k.filter(like='1')
df3 = k.filter(like='2')
NOTE: To strip the digit from columns use : k.filter(like='0').rename(columns= lambda x: x.split('-')[0])

Related

Dataframe Transpose after groupby

This code picks the top 3 values for each id from df and put in df2. I want to transpose the df2 data into dataframe with column id and 3 columns associated with 3 top values as shown below. I try df2_transposed = df2.T but it doesn't work. Can you please help?
import pandas as pd
df = pd.DataFrame([[1,1], [1,6], [1,39],[1,30],[1,40],[1,140], [2,2], [2,1], [2,20], [2,15], [2,99], [2,9]], columns=['id', 'value'])
print(df)
df2 = df.groupby('id')['value'].nlargest(3)
df-----------
id value
0 1 1
1 1 6
2 1 39
3 1 30
4 1 40
5 1 140
6 2 2
7 2 1
8 2 20
9 2 15
10 2 99
11 2 9
what I want
id top1 top2 top3
0 1 140 40 39
1 2 99 20 15
I have solution for you. This is tested. Please check this out:
import pandas as pd
import numpy as np
df = pd.DataFrame([[1,1], [1,6], [1,39],[1,30],[1,40],[1,140], [2,2], [2,1], [2,20], [2,15], [2,99], [2,9]], columns=['id', 'value'])
#Filtering the largest 3 values
x = (df.groupby('id')['value']
.apply(lambda x: x.nlargest(3))
.reset_index(level=1, drop=True)
.to_frame('value'))
#Transposing using unstack
x = x.set_index(np.arange(len(x)) % 3, append=True)['value'].unstack().add_prefix('top')
x = x.reset_index()
x
Hope this will help you :)
Single expression:
df2 = df.groupby('id').apply(
lambda x: x['value'].nlargest(3).reset_index(drop=True).T) \
.set_axis(['top1', 'top2', 'top3'], axis=1).reset_index()

Remove any 0 value from row, order values descending for row, for each non 0 value in row return the index, column name, and score to a new df

I'm looking for a more efficient way of doing the below (perhaps using boolean masks and vecotrization).
I'm new to this forum so apologies if my first question is not quite what was expected.
#order each row by values descending
#remove any 0 value column from row
#for each non 0 value return the index, column name, and score to a new dataframe
test_data={'a':[1,0,8,5],
'b':[36,2,0,6],
'c':[2,8,100,0],
'd':[7,8,9,50]}
df=pd.DataFrame(test_data,columns=['a','b','c','d'])
column_names = ['index_row','header','score']
#create empty df with final output columns
df_result = pd.DataFrame(columns = column_names)
row_index=list(df.index.values)
for row in row_index:
working_row=row
#change all 0 values to null and drop any extraneous columns
subset_cols=df.loc[[working_row],:].replace(0,pd.np.nan).dropna(axis=1,how='any').columns.to_list()
#order by score
sub_df = df.loc[[working_row],subset_cols].sort_values(by =row, axis=1, ascending=False)
s_cols = sub_df.columns.to_list()
scores = sub_df.values.tolist()
scores = scores[0]
index_row=[]
header=[]
score=[]
for count, value in enumerate(scores):
header.append(s_cols[count])
score.append(value)
index_row.append(row)
data={'index_row': index_row,
'header': header,
'score': score}
result_frame = pd.DataFrame (data, columns =['index_row','header','score'])
df_result=pd.concat([df_result, result_frame], ignore_index=True)
df_result
You could do it directly with melt and some additional processing:
df_result = df.reset_index().rename(columns={'index': 'index_row'}).melt(
id_vars='index_row', var_name='header', value_name='score').query(
"score!=0").sort_values(['index_row', 'score'], ascending=[True, False]
).reset_index(drop=True)
it gives as expected:
index_row header score
0 0 b 36
1 0 d 7
2 0 c 2
3 0 a 1
4 1 c 8
5 1 d 8
6 1 b 2
7 2 c 100
8 2 d 9
9 2 a 8
10 3 d 50
11 3 b 6
12 3 a 5
for index in df.index:
temp_df = df.loc[index].reset_index().reset_index()
temp_df.columns = ['index_row', 'header', 'score']
temp_df['index_row'] = index
temp_df.sort_values(by=['score'], ascending=False, inplace=True)
df_result = df_result.append(temp_df[temp_df.score != 0], ignore_index=True)
test_data={'a':[1,0,8,5],
'b':[36,2,0,6],
'c':[2,8,100,0],
'd':[7,8,9,50]}
df=pd.DataFrame(test_data,columns=['a','b','c','d'])
df=df.reset_index()
results=pd.melt(df,id_vars='index',var_name='header',value_name='score')
filter=results['score']!=0
print(results[filter].sort_values(by=['index','score'],ascending=[True,False]))
output:
index header score
4 0 b 36
12 0 d 7
8 0 c 2
0 0 a 1
9 1 c 8
13 1 d 8
5 1 b 2
10 2 c 100
14 2 d 9
2 2 a 8
15 3 d 50
7 3 b 6
3 3 a 5
​

Insert a blank row between each grouping in a dataframe BUT only display the first header

The following code courtesy of #jezrael displays a blank row AND a Header for each grouping
data = {
'MARKET_SECTOR_DES':['A','A','B','B','B','B'],
'count':[10,9,20,19,18,17]
}
df = pd.DataFrame(data)
print(df)
print("")
# retrieve column headers
df2 = pd.DataFrame([[''] * len(df.columns), df.columns], columns=df.columns)
# For each grouping Apply insert headers
df1 = (df.groupby('MARKET_SECTOR_DES', group_keys=False)
.apply(lambda d: d.append(df2))
.iloc[:-2]
.reset_index(drop=True))
print(df1)
Output:
MARKET_SECTOR_DES count
0 A 10
1 A 9
2
3 MARKET_SECTOR_DES count
4 B 20
5 B 19
6 B 18
7 B 17
Desired output:
MARKET_SECTOR_DES count
0 A 10
1 A 9
4 B 20
5 B 19
6 B 18
7 B 17
So only the single header at the top.
Change your df2 to
df2 = pd.DataFrame([[''] * len(df.columns)], columns=df.columns)

Pandas multiply two data frames to get product

I have two data frames with different variable names
df1 = pd.DataFrame({'A':[2,2,3],'B':[5,5,6]})
>>> df1
A B
0 2 5
1 2 5
2 3 6
df2 = pd.DataFrame({'C':[3,3,3],'D':[5,5,6]})
>>> df2
C D
0 3 5
1 3 5
2 3 6
I want to create a third data frame where the n-th column is the product of the n-th columns in the first two data frames. In the above example, df3 would have two columns X and Y, where df.X = df.A * df.C and df.Y = df.B * df.D
df3 = pd.DataFrame({'X':[6,6,9],'Y':[25,25,36]})
>>> df3
X Y
0 6 25
1 6 25
2 9 36
Is there a simple pandas function that allows me to do this?
You can use mul, to multiply df1 by the values of df2:
df3 = df1.mul(df2.values)
df3.columns = ['X','Y']
>>> df3
X Y
0 6 25
1 6 25
2 9 36
You can also use numpy as:
df3 = np.multiply(df1, df2)
Note: Most numpy operations will take Pandas Series or DataFrame.

Pandas DataFrame merge summing column

I'm trying to merge two DataFrames summing columns value.
>>> print(df1)
id name weight
0 1 A 0
1 2 B 10
2 3 C 10
>>> print(df2)
id name weight
0 2 B 15
1 3 C 10
I need to sum weight values during merging for similar values in the common column.
merge = pd.merge(df1, df2, how='inner')
So the output will be something like following.
id name weight
1 2 B 25
2 3 C 20
This solution works also if you want to sum more than one column. Assume data frames
>>> df1
id name weight height
0 1 A 0 5
1 2 B 10 10
2 3 C 10 15
>>> df2
id name weight height
0 2 B 25 20
1 3 C 20 30
You can concatenate them and group by index columns.
>>> pd.concat([df1, df2]).groupby(['id', 'name']).sum().reset_index()
id name weight height
0 1 A 0 5
1 2 B 35 30
2 3 C 30 45
In [41]: pd.merge(df1, df2, on=['id', 'name']).set_index(['id', 'name']).sum(axis=1)
Out[41]:
id name
2 B 25
3 C 20
dtype: int64
If you set the common columns as the index, you can just sum the two dataframes, much simpler than merging:
In [30]: df1 = df1.set_index(['id', 'name'])
In [31]: df2 = df2.set_index(['id', 'name'])
In [32]: df1 + df2
Out[32]:
weight
id name
1 A NaN
2 B 25
3 C 20

Categories