Pandas groupby two columns and expand the third - python

I have a Pandas dataframe with the following structure:
A B C
a b 1
a b 2
a b 3
c d 7
c d 8
c d 5
c d 6
c d 3
e b 4
e b 3
e b 2
e b 1
And I will like to transform it into this:
A B C1 C2 C3 C4 C5
a b 1 2 3 NAN NAN
c d 7 8 5 6 3
e b 4 3 2 1 NAN
In other words, something like groupby A and B and expand C into different columns.
Knowing that the length of each group is different.
C is already ordered
Shorter groups can have NAN or NULL values (empty), it does not matter.

Use GroupBy.cumcount and pandas.Series.add with 1, to start naming the new columns from 1 onwards, then pass this to DataFrame.pivot, and add DataFrame.add_prefix to rename the columns (C1, C2, C3, etc...). Finally use DataFrame.rename_axis to remove the indexes original name ('g') and transform the MultiIndex into columns by using DataFrame.reset_indexcolumns A,B:
df['g'] = df.groupby(['A','B']).cumcount().add(1)
df = df.pivot(['A','B'], 'g', 'C').add_prefix('C').rename_axis(columns=None).reset_index()
print (df)
A B C1 C2 C3 C4 C5
0 a b 1.0 2.0 3.0 NaN NaN
1 c d 7.0 8.0 5.0 6.0 3.0
2 e b 4.0 3.0 2.0 1.0 NaN
Because NaN is by default of type float, if you need the columns dtype to be integers add DataFrame.astype with Int64:
df['g'] = df.groupby(['A','B']).cumcount().add(1)
df = (df.pivot(['A','B'], 'g', 'C')
.add_prefix('C')
.astype('Int64')
.rename_axis(columns=None)
.reset_index())
print (df)
A B C1 C2 C3 C4 C5
0 a b 1 2 3 <NA> <NA>
1 c d 7 8 5 6 3
2 e b 4 3 2 1 <NA>
EDIT: If there's a maximum N new columns to be added, it means that A,B are duplicated. Therefore, it will beneeded to add helper groups g1, g2 with integer and modulo division, adding a new level in index:
N = 4
g = df.groupby(['A','B']).cumcount()
df['g1'], df['g2'] = g // N, (g % N) + 1
df = (df.pivot(['A','B','g1'], 'g2', 'C')
.add_prefix('C')
.droplevel(-1)
.rename_axis(columns=None)
.reset_index())
print (df)
A B C1 C2 C3 C4
0 a b 1.0 2.0 3.0 NaN
1 c d 7.0 8.0 5.0 6.0
2 c d 3.0 NaN NaN NaN
3 e b 4.0 3.0 2.0 1.0

df1.astype({'C':str}).groupby([*'AB'])\
.agg(','.join).C.str.split(',',expand=True)\
.add_prefix('C').reset_index()
A B C0 C1 C2 C3 C4
0 a b 1 2 3 None None
1 c d 7 8 5 6 3
2 e b 4 3 2 1 None

The accepted solution but avoiding the deprecation warning:
N = 3
g = df_grouped.groupby(['A','B']).cumcount()
df_grouped['g1'], df_grouped['g2'] = g // N, (g % N) + 1
df_grouped = (df_grouped.pivot(index=['A','B','g1'], columns='g2', values='C')
.add_prefix('C_')
.astype('Int64')
.droplevel(-1)
.rename_axis(columns=None)
.reset_index())

Related

Merging columns using pandas

I am trying to merge multiple-choice question columns using pandas so I can then manipulate them. An example of what my questions look like is:
C1 C2 C3
0 A A
1 B B
2 C C
3 D D
The data is currently presented as C1 and C2 but I need it to be combined into 1 column as represented in C3.
One option, assuming NaN in empty cells, is to bfill the first column and copy it:
df['C3'] = df[['C1', 'C2']].bfill(axis=1)['C1']
This way is extensible to any number of initial columns.
Output:
C1 C2 C3
0 A NaN A
1 NaN B B
2 NaN C C
3 D NaN D
You may try with fillna
df['C3'] = df['C1'].fillna(df['C2'])
df
Out[483]:
C1 C2 C3
0 A NaN A
1 NaN B B
2 NaN C C
3 D NaN D
You can also use combine_first:
df['C3'] = df['C1'].combine_first(df['C2'])
print(df)
# Output
C1 C2 C3
0 A NaN A
1 NaN B B
2 NaN C C
3 D NaN D
If your cells contain empty strings and not null values, replace them temporary by NaN:
df['C3'] = df['C1'].replace('', np.nan).combine_first(df['C2'])
print(df)
# Output
C1 C2 C3
0 A A
1 B B
2 C C
3 D D

Python pandas dataframe apply result of function to multiple columns where NaN

I have a dataframe with three columns and a function that calculates the values of column y and z given the value of column x. I need to only calculate the values if they are missing NaN.
def calculate(x):
return 1, 2
df = pd.DataFrame({'x':['a', 'b', 'c', 'd', 'e', 'f'], 'y':[np.NaN, np.NaN, np.NaN, 'a1', 'b2', 'c3'], 'z':[np.NaN, np.NaN, np.NaN, 'a2', 'b1', 'c4']})
x y z
0 a NaN NaN
1 b NaN NaN
2 c NaN NaN
3 d a1 a2
4 e b2 b1
5 f c3 c4
mask = (df.isnull().any(axis=1))
df[['y', 'z']] = df[mask].apply(calculate, axis=1, result_type='expand')
However, I get the following result, although I only apply to the masked set. Unsure what I'm doing wrong.
x y z
0 a 1.0 2.0
1 b 1.0 2.0
2 c 1.0 2.0
3 d NaN NaN
4 e NaN NaN
5 f NaN NaN
If the mask is inverted I get the following result:
df[['y', 'z']] = df[~mask].apply(calculate, axis=1, result_type='expand')
x y z
0 a NaN NaN
1 b NaN NaN
2 c NaN NaN
3 d 1.0 2.0
4 e 1.0 2.0
5 f 1.0 2.0
Expected result:
x y z
0 a 1.0 2.0
1 b 1.0 2.0
2 c 1.0 2.0
3 d a1 a2
4 e b2 b1
5 f c3 c4
you can fillna after calculating for the full dataframe and set_axis
out = (df.fillna(df.apply(calculate, axis=1, result_type='expand')
.set_axis(['y','z'],inplace=False,axis=1)))
print(out)
x y z
0 a 1 2
1 b 1 2
2 c 1 2
3 d a1 a2
4 e b2 b1
5 f c3 c4
Try:
df.loc[mask,["y","z"]] = pd.DataFrame(df.loc[mask].apply(calculate, axis=1).to_list(), index=df[mask].index, columns = ["y","z"])
print(df)
x y z
0 a 1 2
1 b 1 2
2 c 1 2
3 d a1 a2
4 e b2 b1
5 f c3 c4

How to compare values of certain columns of one dataframe with the values of same set of columns in another dataframe?

I have three dataframes df1, df2, and df3, which are defined as follows
df1 =
A B C
0 1 a a1
1 2 b b2
2 3 c c3
3 4 d d4
4 5 e e5
5 6 f f6
df2 =
A B C
0 1 a X
1 2 b Y
2 3 c Z
df3 =
A B C
3 4 d P
4 5 e Q
5 6 f R
I have defined a Primary Key list PK = ["A","B"].
Now, I take a fourth dataframe df4 as df4 = df1.sample(n=2), which gives something like
df4 =
A B C
4 5 e e5
1 2 b b2
Now, I want to select the rows from df2 and df1 which matches the values of the primary keys of df4.
For eg, in this case,
I need to get row with
index = 4 from df3,
index = 1 from df2.
If possible I need to get a dataframe as follows:
df =
A B C A(df2) B(df2) C(df2) A(df3) B(df3) C(df3)
4 5 e e5 5 e Q
1 2 b b2 2 b Y
Any ideas on how to work this out will be very helpful.
Use two consecutive DataFrame.merge operations along with using DataFrame.add_suffix on the right dataframe to left merge the dataframes df4, df2, df3, finally use Series.fillna to replace the missing values with empty string:
df = (
df4.merge(df2.add_suffix('(df2)'), left_on=['A', 'B'], right_on=['A(df2)', 'B(df2)'], how='left')
.merge(df3.add_suffix('(df3)'), left_on=['A', 'B'], right_on=['A(df3)', 'B(df3)'], how='left')
.fillna('')
)
Result:
# print(df)
A B C A(df2) B(df2) C(df2) A(df3) B(df3) C(df3)
0 5 e e5 5 e Q
1 2 b b2 2 b Y
Here's how I would do it on the entire data set. If you want to sample first, just update the merge statements at the end by replacing df1 with df4 or just take a sample of t
PK = ["A","B"]
df2 = pd.concat([df2,df2], axis=1)
df2.columns=['A','B','C','A(df2)', 'B(df2)', 'C(df2)']
df2.drop(columns=['C'], inplace=True)
df3 = pd.concat([df3,df3], axis=1)
df3.columns=['A','B','C','A(df3)', 'B(df3)', 'C(df3)']
df3.drop(columns=['C'], inplace=True)
t = df1.merge(df2, on=PK, how='left')
t = t.merge(df3, on=PK, how='left')
Output
A B C A(df2) B(df2) C(df2) A(df3) B(df3) C(df3)
0 1 a a1 1.0 a X NaN NaN NaN
1 2 b b2 2.0 b Y NaN NaN NaN
2 3 c c3 3.0 c Z NaN NaN NaN
3 4 d d4 NaN NaN NaN 4.0 d P
4 5 e e5 NaN NaN NaN 5.0 e Q
5 6 f f6 NaN NaN NaN 6.0 f R

Merge columns with have \n

ex)
C1 C2 C3 C4 C5 C6
0 A B nan C A nan
1 B C D nan B nan
2 D E F nan C nan
3 nan nan A nan nan B
I'm merging columns, but I want to give '\n\n' in the merging process.
so output what I want
C
0 A
B
C
A
1 B
C
D
B
2 D
E
F
C
3. A
B
I want 'nan' to drop.
I tried
df['merge'] = df['C1'].map(str) + '\n\n' + tt['C2'].map(str) + '\n\n' + tt['C3'].map(str) + '\n\n' + df['C4'].map(str)
However, this includes all nan values.
thank you for reading.
Use DataFrame.stack for Series, misisng values are removed, so you can aggregate join:
df['merge'] = df.stack().groupby(level=0).agg('\n\n'.join)
#for filter only C columns
df['merge'] = df.filter(like='C').stack().groupby(level=0).agg('\n\n'.join)
Or remove missing values by join per rows by Series.dropna:
df['merge'] = df.apply(lambda x: '\n\n'.join(x.dropna()), axis=1)
#for filter only C columns
df['merge'] = df.filter(like='C').apply(lambda x: '\n\n'.join(x.dropna()), axis=1)
print (df)
C1 C2 C3 C4 C5 C6 merge
0 A B NaN C A NaN A\n\nB\n\nC\n\nA
1 B C D NaN B NaN B\n\nC\n\nD\n\nB
2 D E F NaN C NaN D\n\nE\n\nF\n\nC
3 NaN NaN A NaN NaN B A\n\nB

Replace column and extend index in DataFrame

I have DataFrame x and I would like to replace one column with Series y
x = DataFrame([[1,2],[3,4]], columns=['C1','C2'], index=['a','b'])
C1 C2
a 1 2
b 3 4
y = Series([5,6,7], index=['a','b','c'])
a 5
b 6
c 7
Simple replacement works fine but keeps original index of DataFrame
x['C1'] = y
C1 C2
a 5 2
b 6 4
I need to have union of indeces of x and y. One solution would be to reindex before replacement
x = x.reindex(x.index.union(y.index), copy=False)
x['C1'] = y
C1 C2
a 5 2.0
b 6 4.0
c 7 NaN
Is there simpler way?
combine_first
Turn y into a DataFrame first with to_frame
y.to_frame('C1').combine_first(x)
C1 C2
a 5 2.0
b 6 4.0
c 7 NaN
align and assign
Use align to... align the indices
x, y = x.align(y, axis=0)
x.assign(C1=y)
C1 C2
a 5 2.0
b 6 4.0
c 7 NaN
Your cat try use join:
pd.DataFrame(y,columns=['C1']).join(x[['C2']])
Output:
C1 C2
a 5 2.0
b 6 4.0
c 7 NaN
Similar to your solution but more succinct, you use reindex, then assign:
res = x.reindex(x.index | y.index).assign(C1=y)
print(res)
C1 C2
a 5 2.0
b 6 4.0
c 7 NaN
You can use concat but you will have to fix the column names, i.e.
import pandas as pd
pd.concat([x.loc[:, 'C2'], y], axis = 1)
which gives,
C2 0
a 2.0 5
b 4.0 6
c NaN 7

Categories