Dataframe Transpose after groupby - python

This code picks the top 3 values for each id from df and put in df2. I want to transpose the df2 data into dataframe with column id and 3 columns associated with 3 top values as shown below. I try df2_transposed = df2.T but it doesn't work. Can you please help?
import pandas as pd
df = pd.DataFrame([[1,1], [1,6], [1,39],[1,30],[1,40],[1,140], [2,2], [2,1], [2,20], [2,15], [2,99], [2,9]], columns=['id', 'value'])
print(df)
df2 = df.groupby('id')['value'].nlargest(3)
df-----------
id value
0 1 1
1 1 6
2 1 39
3 1 30
4 1 40
5 1 140
6 2 2
7 2 1
8 2 20
9 2 15
10 2 99
11 2 9
what I want
id top1 top2 top3
0 1 140 40 39
1 2 99 20 15

I have solution for you. This is tested. Please check this out:
import pandas as pd
import numpy as np
df = pd.DataFrame([[1,1], [1,6], [1,39],[1,30],[1,40],[1,140], [2,2], [2,1], [2,20], [2,15], [2,99], [2,9]], columns=['id', 'value'])
#Filtering the largest 3 values
x = (df.groupby('id')['value']
.apply(lambda x: x.nlargest(3))
.reset_index(level=1, drop=True)
.to_frame('value'))
#Transposing using unstack
x = x.set_index(np.arange(len(x)) % 3, append=True)['value'].unstack().add_prefix('top')
x = x.reset_index()
x
Hope this will help you :)

Single expression:
df2 = df.groupby('id').apply(
lambda x: x['value'].nlargest(3).reset_index(drop=True).T) \
.set_axis(['top1', 'top2', 'top3'], axis=1).reset_index()

Related

how to split dataframe cells using delimiter into different dataframes. with conditions

There are other questions on the same topic and they helped but I have an extra twist.
I have a dataframe with multiple values in each (but not all) cells.
df = pd.DataFrame({'a':["10-30-410","20-40-500","25-50"], 'b':["5-8-9","4", "99"]})
index
a
b
0
10-30-410
5-8-9
1
20-40-500
4
2
25-50
99
How can I split each cell by the dash "-" and create three new dataframes? Note that not all cells have multiple values, in which case the second and third dataframes get NA or blank (treating these as strings).
So I need df1 to be the first of those values:
index
a
b
0
10
5
1
20
4
2
25
99
And df2 would be:
index
a
b
0
30
8
1
40
2
50
And likewise for df3:
index
a
b
0
410
9
1
500
2
I got df1 with this
df1 = df.replace(r'(\d+).*(\d+).*(\d+)+', r'\1', regex=True)
But df2 doesn't quite work. I get the second values but also 4 and 99, which should be blank.
df2 = df.replace(r'(\d+)-(\d+).*', r'\2', regex=True)
index
a
b
0
30
8
1
40
4 - should be blank
2
50
99 - should be blank
Is this the right approach? I'm pretty good on regex but fuzzy with groups. Thank you.
Use str.split + concat + stack to get the data in a more usable format:
new_df = pd.concat(
(df['a'].str.split('-', expand=True),
df['b'].str.split('-', expand=True)),
keys=('a', 'b'),
axis=1
).stack(dropna=False).droplevel(0)
new_df:
a b
0 10 5
1 30 8
2 410 9
0 20 4
1 40 None
2 500 None
0 25 99
1 50 None
2 None None
Expandable option for n cols:
cols = ['a', 'b']
new_df = pd.concat(
(df[c].str.split('-', expand=True) for c in cols),
keys=cols,
axis=1
).stack(dropna=False).droplevel(0)
Then groupby level 0 + reset_index to create a list of dataframes:
dfs = [g.reset_index(drop=True) for _, g in new_df.groupby(level=0)]
dfs:
[ a b
0 10 5
1 20 4
2 25 99,
a b
0 30 8
1 40 None
2 50 None,
a b
0 410 9
1 500 None
2 None None]
Complete Working Example:
import pandas as pd
df = pd.DataFrame({
'a': ["10-30-410", "20-40-500", "25-50"],
'b': ["5-8-9", "4", "99"]
})
cols = ['a', 'b']
new_df = pd.concat(
(df[c].str.split('-', expand=True) for c in cols),
keys=cols,
axis=1
).stack(dropna=False).droplevel(0)
dfs = [g.reset_index(drop=True) for _, g in new_df.groupby(level=0)]
print(dfs)
You can also try with filter:
k = pd.concat((df[c].str.split('-', expand=True).add_prefix(c+ '-')
for c in df.columns), 1).fillna('')
df1 = k.filter(like='0')
df2 = k.filter(like='1')
df3 = k.filter(like='2')
NOTE: To strip the digit from columns use : k.filter(like='0').rename(columns= lambda x: x.split('-')[0])

Insert a blank row between each grouping in a dataframe BUT only display the first header

The following code courtesy of #jezrael displays a blank row AND a Header for each grouping
data = {
'MARKET_SECTOR_DES':['A','A','B','B','B','B'],
'count':[10,9,20,19,18,17]
}
df = pd.DataFrame(data)
print(df)
print("")
# retrieve column headers
df2 = pd.DataFrame([[''] * len(df.columns), df.columns], columns=df.columns)
# For each grouping Apply insert headers
df1 = (df.groupby('MARKET_SECTOR_DES', group_keys=False)
.apply(lambda d: d.append(df2))
.iloc[:-2]
.reset_index(drop=True))
print(df1)
Output:
MARKET_SECTOR_DES count
0 A 10
1 A 9
2
3 MARKET_SECTOR_DES count
4 B 20
5 B 19
6 B 18
7 B 17
Desired output:
MARKET_SECTOR_DES count
0 A 10
1 A 9
4 B 20
5 B 19
6 B 18
7 B 17
So only the single header at the top.
Change your df2 to
df2 = pd.DataFrame([[''] * len(df.columns)], columns=df.columns)

Pandas multiply two data frames to get product

I have two data frames with different variable names
df1 = pd.DataFrame({'A':[2,2,3],'B':[5,5,6]})
>>> df1
A B
0 2 5
1 2 5
2 3 6
df2 = pd.DataFrame({'C':[3,3,3],'D':[5,5,6]})
>>> df2
C D
0 3 5
1 3 5
2 3 6
I want to create a third data frame where the n-th column is the product of the n-th columns in the first two data frames. In the above example, df3 would have two columns X and Y, where df.X = df.A * df.C and df.Y = df.B * df.D
df3 = pd.DataFrame({'X':[6,6,9],'Y':[25,25,36]})
>>> df3
X Y
0 6 25
1 6 25
2 9 36
Is there a simple pandas function that allows me to do this?
You can use mul, to multiply df1 by the values of df2:
df3 = df1.mul(df2.values)
df3.columns = ['X','Y']
>>> df3
X Y
0 6 25
1 6 25
2 9 36
You can also use numpy as:
df3 = np.multiply(df1, df2)
Note: Most numpy operations will take Pandas Series or DataFrame.

Pandas : Sum multiple columns and get results in multiple columns

I have a "sample.txt" like this.
idx A B C D cat
J 1 2 3 1 x
K 4 5 6 2 x
L 7 8 9 3 y
M 1 2 3 4 y
N 4 5 6 5 z
O 7 8 9 6 z
With this dataset, I want to get sum in row and column.
In row, it is not a big deal.
I made result like this.
### MY CODE ###
import pandas as pd
df = pd.read_csv('sample.txt',sep="\t",index_col='idx')
df.info()
df2 = df.groupby('cat').sum()
print( df2 )
The result is like this.
A B C D
cat
x 5 7 9 3
y 8 10 12 7
z 11 13 15 11
But I don't know how to write a code to get result like this.
(simply add values in column A and B as well as column C and D)
AB CD
J 3 4
K 9 8
L 15 12
M 3 7
N 9 11
O 15 15
Could anybody help how to write a code?
By the way, I don't want to do like this.
(it looks too dull, but if it is the only way, I'll deem it)
df2 = df['A'] + df['B']
df3 = df['C'] + df['D']
df = pd.DataFrame([df2,df3],index=['AB','CD']).transpose()
print( df )
When you pass a dictionary or callable to groupby it gets applied to an axis. I specified axis one which is columns.
d = dict(A='AB', B='AB', C='CD', D='CD')
df.groupby(d, axis=1).sum()
Use concat with sum:
df = df.set_index('idx')
df = pd.concat([df[['A', 'B']].sum(1), df[['C', 'D']].sum(1)], axis=1, keys=['AB','CD'])
print( df)
AB CD
idx
J 3 4
K 9 8
L 15 12
M 3 7
N 9 11
O 15 15
Does this do what you need? By using axis=1 with DataFrame.apply, you can use the data that you want in a row to construct a new column. Then you can drop the columns that you don't want anymore.
In [1]: import pandas as pd
In [5]: df = pd.DataFrame(columns=['A', 'B', 'C', 'D'], data=[[1, 2, 3, 4], [1, 2, 3, 4]])
In [6]: df
Out[6]:
A B C D
0 1 2 3 4
1 1 2 3 4
In [7]: df['CD'] = df.apply(lambda x: x['C'] + x['D'], axis=1)
In [8]: df
Out[8]:
A B C D CD
0 1 2 3 4 7
1 1 2 3 4 7
In [13]: df.drop(['C', 'D'], axis=1)
Out[13]:
A B CD
0 1 2 7
1 1 2 7

Pandas DataFrame merge summing column

I'm trying to merge two DataFrames summing columns value.
>>> print(df1)
id name weight
0 1 A 0
1 2 B 10
2 3 C 10
>>> print(df2)
id name weight
0 2 B 15
1 3 C 10
I need to sum weight values during merging for similar values in the common column.
merge = pd.merge(df1, df2, how='inner')
So the output will be something like following.
id name weight
1 2 B 25
2 3 C 20
This solution works also if you want to sum more than one column. Assume data frames
>>> df1
id name weight height
0 1 A 0 5
1 2 B 10 10
2 3 C 10 15
>>> df2
id name weight height
0 2 B 25 20
1 3 C 20 30
You can concatenate them and group by index columns.
>>> pd.concat([df1, df2]).groupby(['id', 'name']).sum().reset_index()
id name weight height
0 1 A 0 5
1 2 B 35 30
2 3 C 30 45
In [41]: pd.merge(df1, df2, on=['id', 'name']).set_index(['id', 'name']).sum(axis=1)
Out[41]:
id name
2 B 25
3 C 20
dtype: int64
If you set the common columns as the index, you can just sum the two dataframes, much simpler than merging:
In [30]: df1 = df1.set_index(['id', 'name'])
In [31]: df2 = df2.set_index(['id', 'name'])
In [32]: df1 + df2
Out[32]:
weight
id name
1 A NaN
2 B 25
3 C 20

Categories