Pandas groupby two columns and expand the third

Pandas groupby two columns and expand the third - python

I have a Pandas dataframe with the following structure:
A B C
a b 1
a b 2
a b 3
c d 7
c d 8
c d 5
c d 6
c d 3
e b 4
e b 3
e b 2
e b 1
And I will like to transform it into this:
A B C1 C2 C3 C4 C5
a b 1 2 3 NAN NAN
c d 7 8 5 6 3
e b 4 3 2 1 NAN
In other words, something like groupby A and B and expand C into different columns.
Knowing that the length of each group is different.
C is already ordered
Shorter groups can have NAN or NULL values (empty), it does not matter.

Use GroupBy.cumcount and pandas.Series.add with 1, to start naming the new columns from 1 onwards, then pass this to DataFrame.pivot, and add DataFrame.add_prefix to rename the columns (C1, C2, C3, etc...). Finally use DataFrame.rename_axis to remove the indexes original name ('g') and transform the MultiIndex into columns by using DataFrame.reset_indexcolumns A,B:
df['g'] = df.groupby(['A','B']).cumcount().add(1)
df = df.pivot(['A','B'], 'g', 'C').add_prefix('C').rename_axis(columns=None).reset_index()
print (df)
A B C1 C2 C3 C4 C5
0 a b 1.0 2.0 3.0 NaN NaN
1 c d 7.0 8.0 5.0 6.0 3.0
2 e b 4.0 3.0 2.0 1.0 NaN
Because NaN is by default of type float, if you need the columns dtype to be integers add DataFrame.astype with Int64:
df['g'] = df.groupby(['A','B']).cumcount().add(1)
df = (df.pivot(['A','B'], 'g', 'C')
.add_prefix('C')
.astype('Int64')
.rename_axis(columns=None)
.reset_index())
print (df)
A B C1 C2 C3 C4 C5
0 a b 1 2 3 <NA> <NA>
1 c d 7 8 5 6 3
2 e b 4 3 2 1 <NA>
EDIT: If there's a maximum N new columns to be added, it means that A,B are duplicated. Therefore, it will beneeded to add helper groups g1, g2 with integer and modulo division, adding a new level in index:
N = 4
g = df.groupby(['A','B']).cumcount()
df['g1'], df['g2'] = g // N, (g % N) + 1
df = (df.pivot(['A','B','g1'], 'g2', 'C')
.add_prefix('C')
.droplevel(-1)
.rename_axis(columns=None)
.reset_index())
print (df)
A B C1 C2 C3 C4
0 a b 1.0 2.0 3.0 NaN
1 c d 7.0 8.0 5.0 6.0
2 c d 3.0 NaN NaN NaN
3 e b 4.0 3.0 2.0 1.0

df1.astype({'C':str}).groupby([*'AB'])\
.agg(','.join).C.str.split(',',expand=True)\
.add_prefix('C').reset_index()
A B C0 C1 C2 C3 C4
0 a b 1 2 3 None None
1 c d 7 8 5 6 3
2 e b 4 3 2 1 None

The accepted solution but avoiding the deprecation warning:
N = 3
g = df_grouped.groupby(['A','B']).cumcount()
df_grouped['g1'], df_grouped['g2'] = g // N, (g % N) + 1
df_grouped = (df_grouped.pivot(index=['A','B','g1'], columns='g2', values='C')
.add_prefix('C_')
.astype('Int64')
.droplevel(-1)
.rename_axis(columns=None)
.reset_index())

Related

Merging columns using pandas

I am trying to merge multiple-choice question columns using pandas so I can then manipulate them. An example of what my questions look like is:
C1 C2 C3
0 A A
1 B B
2 C C
3 D D
The data is currently presented as C1 and C2 but I need it to be combined into 1 column as represented in C3.

One option, assuming NaN in empty cells, is to bfill the first column and copy it:
df['C3'] = df[['C1', 'C2']].bfill(axis=1)['C1']
This way is extensible to any number of initial columns.
Output:
C1 C2 C3
0 A NaN A
1 NaN B B
2 NaN C C
3 D NaN D

You may try with fillna
df['C3'] = df['C1'].fillna(df['C2'])
df
Out[483]:
C1 C2 C3
0 A NaN A
1 NaN B B
2 NaN C C
3 D NaN D

You can also use combine_first:
df['C3'] = df['C1'].combine_first(df['C2'])
print(df)
# Output
C1 C2 C3
0 A NaN A
1 NaN B B
2 NaN C C
3 D NaN D
If your cells contain empty strings and not null values, replace them temporary by NaN:
df['C3'] = df['C1'].replace('', np.nan).combine_first(df['C2'])
print(df)
# Output
C1 C2 C3
0 A A
1 B B
2 C C
3 D D

Python pandas dataframe apply result of function to multiple columns where NaN

I have a dataframe with three columns and a function that calculates the values of column y and z given the value of column x. I need to only calculate the values if they are missing NaN.
def calculate(x):
return 1, 2
df = pd.DataFrame({'x':['a', 'b', 'c', 'd', 'e', 'f'], 'y':[np.NaN, np.NaN, np.NaN, 'a1', 'b2', 'c3'], 'z':[np.NaN, np.NaN, np.NaN, 'a2', 'b1', 'c4']})
x y z
0 a NaN NaN
1 b NaN NaN
2 c NaN NaN
3 d a1 a2
4 e b2 b1
5 f c3 c4
mask = (df.isnull().any(axis=1))
df[['y', 'z']] = df[mask].apply(calculate, axis=1, result_type='expand')
However, I get the following result, although I only apply to the masked set. Unsure what I'm doing wrong.
x y z
0 a 1.0 2.0
1 b 1.0 2.0
2 c 1.0 2.0
3 d NaN NaN
4 e NaN NaN
5 f NaN NaN
If the mask is inverted I get the following result:
df[['y', 'z']] = df[~mask].apply(calculate, axis=1, result_type='expand')
x y z
0 a NaN NaN
1 b NaN NaN
2 c NaN NaN
3 d 1.0 2.0
4 e 1.0 2.0
5 f 1.0 2.0
Expected result:
x y z
0 a 1.0 2.0
1 b 1.0 2.0
2 c 1.0 2.0
3 d a1 a2
4 e b2 b1
5 f c3 c4

you can fillna after calculating for the full dataframe and set_axis
out = (df.fillna(df.apply(calculate, axis=1, result_type='expand')
.set_axis(['y','z'],inplace=False,axis=1)))
print(out)
x y z
0 a 1 2
1 b 1 2
2 c 1 2
3 d a1 a2
4 e b2 b1
5 f c3 c4

Try:
df.loc[mask,["y","z"]] = pd.DataFrame(df.loc[mask].apply(calculate, axis=1).to_list(), index=df[mask].index, columns = ["y","z"])
print(df)
x y z
0 a 1 2
1 b 1 2
2 c 1 2
3 d a1 a2
4 e b2 b1
5 f c3 c4

How to compare values of certain columns of one dataframe with the values of same set of columns in another dataframe?

I have three dataframes df1, df2, and df3, which are defined as follows
df1 =
A B C
0 1 a a1
1 2 b b2
2 3 c c3
3 4 d d4
4 5 e e5
5 6 f f6
df2 =
A B C
0 1 a X
1 2 b Y
2 3 c Z
df3 =
A B C
3 4 d P
4 5 e Q
5 6 f R
I have defined a Primary Key list PK = ["A","B"].
Now, I take a fourth dataframe df4 as df4 = df1.sample(n=2), which gives something like
df4 =
A B C
4 5 e e5
1 2 b b2
Now, I want to select the rows from df2 and df1 which matches the values of the primary keys of df4.
For eg, in this case,
I need to get row with
index = 4 from df3,
index = 1 from df2.
If possible I need to get a dataframe as follows:
df =
A B C A(df2) B(df2) C(df2) A(df3) B(df3) C(df3)
4 5 e e5 5 e Q
1 2 b b2 2 b Y
Any ideas on how to work this out will be very helpful.

Use two consecutive DataFrame.merge operations along with using DataFrame.add_suffix on the right dataframe to left merge the dataframes df4, df2, df3, finally use Series.fillna to replace the missing values with empty string:
df = (
df4.merge(df2.add_suffix('(df2)'), left_on=['A', 'B'], right_on=['A(df2)', 'B(df2)'], how='left')
.merge(df3.add_suffix('(df3)'), left_on=['A', 'B'], right_on=['A(df3)', 'B(df3)'], how='left')
.fillna('')
)
Result:
# print(df)
A B C A(df2) B(df2) C(df2) A(df3) B(df3) C(df3)
0 5 e e5 5 e Q
1 2 b b2 2 b Y

Here's how I would do it on the entire data set. If you want to sample first, just update the merge statements at the end by replacing df1 with df4 or just take a sample of t
PK = ["A","B"]
df2 = pd.concat([df2,df2], axis=1)
df2.columns=['A','B','C','A(df2)', 'B(df2)', 'C(df2)']
df2.drop(columns=['C'], inplace=True)
df3 = pd.concat([df3,df3], axis=1)
df3.columns=['A','B','C','A(df3)', 'B(df3)', 'C(df3)']
df3.drop(columns=['C'], inplace=True)
t = df1.merge(df2, on=PK, how='left')
t = t.merge(df3, on=PK, how='left')
Output
A B C A(df2) B(df2) C(df2) A(df3) B(df3) C(df3)
0 1 a a1 1.0 a X NaN NaN NaN
1 2 b b2 2.0 b Y NaN NaN NaN
2 3 c c3 3.0 c Z NaN NaN NaN
3 4 d d4 NaN NaN NaN 4.0 d P
4 5 e e5 NaN NaN NaN 5.0 e Q
5 6 f f6 NaN NaN NaN 6.0 f R

Merge columns with have \n

ex)
C1 C2 C3 C4 C5 C6
0 A B nan C A nan
1 B C D nan B nan
2 D E F nan C nan
3 nan nan A nan nan B
I'm merging columns, but I want to give '\n\n' in the merging process.
so output what I want
C
0 A
B
C
A
1 B
C
D
B
2 D
E
F
C
3. A
B
I want 'nan' to drop.
I tried
df['merge'] = df['C1'].map(str) + '\n\n' + tt['C2'].map(str) + '\n\n' + tt['C3'].map(str) + '\n\n' + df['C4'].map(str)
However, this includes all nan values.
thank you for reading.

Use DataFrame.stack for Series, misisng values are removed, so you can aggregate join:
df['merge'] = df.stack().groupby(level=0).agg('\n\n'.join)
#for filter only C columns
df['merge'] = df.filter(like='C').stack().groupby(level=0).agg('\n\n'.join)
Or remove missing values by join per rows by Series.dropna:
df['merge'] = df.apply(lambda x: '\n\n'.join(x.dropna()), axis=1)
#for filter only C columns
df['merge'] = df.filter(like='C').apply(lambda x: '\n\n'.join(x.dropna()), axis=1)
print (df)
C1 C2 C3 C4 C5 C6 merge
0 A B NaN C A NaN A\n\nB\n\nC\n\nA
1 B C D NaN B NaN B\n\nC\n\nD\n\nB
2 D E F NaN C NaN D\n\nE\n\nF\n\nC
3 NaN NaN A NaN NaN B A\n\nB

Replace column and extend index in DataFrame

I have DataFrame x and I would like to replace one column with Series y
x = DataFrame([[1,2],[3,4]], columns=['C1','C2'], index=['a','b'])
C1 C2
a 1 2
b 3 4
y = Series([5,6,7], index=['a','b','c'])
a 5
b 6
c 7
Simple replacement works fine but keeps original index of DataFrame
x['C1'] = y
C1 C2
a 5 2
b 6 4
I need to have union of indeces of x and y. One solution would be to reindex before replacement
x = x.reindex(x.index.union(y.index), copy=False)
x['C1'] = y
C1 C2
a 5 2.0
b 6 4.0
c 7 NaN
Is there simpler way?

combine_first
Turn y into a DataFrame first with to_frame
y.to_frame('C1').combine_first(x)
C1 C2
a 5 2.0
b 6 4.0
c 7 NaN
align and assign
Use align to... align the indices
x, y = x.align(y, axis=0)
x.assign(C1=y)
C1 C2
a 5 2.0
b 6 4.0
c 7 NaN

Your cat try use join:
pd.DataFrame(y,columns=['C1']).join(x[['C2']])
Output:
C1 C2
a 5 2.0
b 6 4.0
c 7 NaN

Similar to your solution but more succinct, you use reindex, then assign:
res = x.reindex(x.index | y.index).assign(C1=y)
print(res)
C1 C2
a 5 2.0
b 6 4.0
c 7 NaN

You can use concat but you will have to fix the column names, i.e.
import pandas as pd
pd.concat([x.loc[:, 'C2'], y], axis = 1)
which gives,
C2 0
a 2.0 5
b 4.0 6
c NaN 7

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas groupby two columns and expand the third - python

df1.astype({'C':str}).groupby([*'AB'])\ .agg(','.join).C.str.split(',',expand=True)\ .add_prefix('C').reset_index() A B C0 C1 C2 C3 C4 0 a b 1 2 3 None None 1 c d 7 8 5 6 3 2 e b 4 3 2 1 None

Related

Merging columns using pandas

Python pandas dataframe apply result of function to multiple columns where NaN

How to compare values of certain columns of one dataframe with the values of same set of columns in another dataframe?

Merge columns with have \n

Replace column and extend index in DataFrame

Categories

Resources