pandas data transformation long-wide-long - python

how to get data from this form (long representation of data):
import pandas as pd
df = pd.DataFrame({
'c0': ['A','A','B'],
'c1': ['b','c','d'],
'c2': [1, 3,4]})
print(df)
Out:
c0 c1 c2
0 A b 1
2 A c 3
3 B d 4
to this form :
c0 c1 c2
0 A b 1
2 A c 3
3 A d NaN
4 B b NaN
5 B c NaN
6 B d 4
Is long to wide to long transformation the only approach to doing this?

method 1
unstack and stack
df.set_index(['c0', 'c1']).unstack().stack(dropna=False).reset_index()
method 2
reindex with product
df.set_index(['c0', 'c1']).reindex(
pd.MultiIndex.from_product([df.c0.unique(), df.c1.unique()], names=['c0', 'c1'])
).reset_index()

Related

Pandas groupby two columns and expand the third

I have a Pandas dataframe with the following structure:
A B C
a b 1
a b 2
a b 3
c d 7
c d 8
c d 5
c d 6
c d 3
e b 4
e b 3
e b 2
e b 1
And I will like to transform it into this:
A B C1 C2 C3 C4 C5
a b 1 2 3 NAN NAN
c d 7 8 5 6 3
e b 4 3 2 1 NAN
In other words, something like groupby A and B and expand C into different columns.
Knowing that the length of each group is different.
C is already ordered
Shorter groups can have NAN or NULL values (empty), it does not matter.
Use GroupBy.cumcount and pandas.Series.add with 1, to start naming the new columns from 1 onwards, then pass this to DataFrame.pivot, and add DataFrame.add_prefix to rename the columns (C1, C2, C3, etc...). Finally use DataFrame.rename_axis to remove the indexes original name ('g') and transform the MultiIndex into columns by using DataFrame.reset_indexcolumns A,B:
df['g'] = df.groupby(['A','B']).cumcount().add(1)
df = df.pivot(['A','B'], 'g', 'C').add_prefix('C').rename_axis(columns=None).reset_index()
print (df)
A B C1 C2 C3 C4 C5
0 a b 1.0 2.0 3.0 NaN NaN
1 c d 7.0 8.0 5.0 6.0 3.0
2 e b 4.0 3.0 2.0 1.0 NaN
Because NaN is by default of type float, if you need the columns dtype to be integers add DataFrame.astype with Int64:
df['g'] = df.groupby(['A','B']).cumcount().add(1)
df = (df.pivot(['A','B'], 'g', 'C')
.add_prefix('C')
.astype('Int64')
.rename_axis(columns=None)
.reset_index())
print (df)
A B C1 C2 C3 C4 C5
0 a b 1 2 3 <NA> <NA>
1 c d 7 8 5 6 3
2 e b 4 3 2 1 <NA>
EDIT: If there's a maximum N new columns to be added, it means that A,B are duplicated. Therefore, it will beneeded to add helper groups g1, g2 with integer and modulo division, adding a new level in index:
N = 4
g = df.groupby(['A','B']).cumcount()
df['g1'], df['g2'] = g // N, (g % N) + 1
df = (df.pivot(['A','B','g1'], 'g2', 'C')
.add_prefix('C')
.droplevel(-1)
.rename_axis(columns=None)
.reset_index())
print (df)
A B C1 C2 C3 C4
0 a b 1.0 2.0 3.0 NaN
1 c d 7.0 8.0 5.0 6.0
2 c d 3.0 NaN NaN NaN
3 e b 4.0 3.0 2.0 1.0
df1.astype({'C':str}).groupby([*'AB'])\
.agg(','.join).C.str.split(',',expand=True)\
.add_prefix('C').reset_index()
A B C0 C1 C2 C3 C4
0 a b 1 2 3 None None
1 c d 7 8 5 6 3
2 e b 4 3 2 1 None
The accepted solution but avoiding the deprecation warning:
N = 3
g = df_grouped.groupby(['A','B']).cumcount()
df_grouped['g1'], df_grouped['g2'] = g // N, (g % N) + 1
df_grouped = (df_grouped.pivot(index=['A','B','g1'], columns='g2', values='C')
.add_prefix('C_')
.astype('Int64')
.droplevel(-1)
.rename_axis(columns=None)
.reset_index())

Merging columns using pandas

I am trying to merge multiple-choice question columns using pandas so I can then manipulate them. An example of what my questions look like is:
C1 C2 C3
0 A A
1 B B
2 C C
3 D D
The data is currently presented as C1 and C2 but I need it to be combined into 1 column as represented in C3.
One option, assuming NaN in empty cells, is to bfill the first column and copy it:
df['C3'] = df[['C1', 'C2']].bfill(axis=1)['C1']
This way is extensible to any number of initial columns.
Output:
C1 C2 C3
0 A NaN A
1 NaN B B
2 NaN C C
3 D NaN D
You may try with fillna
df['C3'] = df['C1'].fillna(df['C2'])
df
Out[483]:
C1 C2 C3
0 A NaN A
1 NaN B B
2 NaN C C
3 D NaN D
You can also use combine_first:
df['C3'] = df['C1'].combine_first(df['C2'])
print(df)
# Output
C1 C2 C3
0 A NaN A
1 NaN B B
2 NaN C C
3 D NaN D
If your cells contain empty strings and not null values, replace them temporary by NaN:
df['C3'] = df['C1'].replace('', np.nan).combine_first(df['C2'])
print(df)
# Output
C1 C2 C3
0 A A
1 B B
2 C C
3 D D

Python pandas dataframe apply result of function to multiple columns where NaN

I have a dataframe with three columns and a function that calculates the values of column y and z given the value of column x. I need to only calculate the values if they are missing NaN.
def calculate(x):
return 1, 2
df = pd.DataFrame({'x':['a', 'b', 'c', 'd', 'e', 'f'], 'y':[np.NaN, np.NaN, np.NaN, 'a1', 'b2', 'c3'], 'z':[np.NaN, np.NaN, np.NaN, 'a2', 'b1', 'c4']})
x y z
0 a NaN NaN
1 b NaN NaN
2 c NaN NaN
3 d a1 a2
4 e b2 b1
5 f c3 c4
mask = (df.isnull().any(axis=1))
df[['y', 'z']] = df[mask].apply(calculate, axis=1, result_type='expand')
However, I get the following result, although I only apply to the masked set. Unsure what I'm doing wrong.
x y z
0 a 1.0 2.0
1 b 1.0 2.0
2 c 1.0 2.0
3 d NaN NaN
4 e NaN NaN
5 f NaN NaN
If the mask is inverted I get the following result:
df[['y', 'z']] = df[~mask].apply(calculate, axis=1, result_type='expand')
x y z
0 a NaN NaN
1 b NaN NaN
2 c NaN NaN
3 d 1.0 2.0
4 e 1.0 2.0
5 f 1.0 2.0
Expected result:
x y z
0 a 1.0 2.0
1 b 1.0 2.0
2 c 1.0 2.0
3 d a1 a2
4 e b2 b1
5 f c3 c4
you can fillna after calculating for the full dataframe and set_axis
out = (df.fillna(df.apply(calculate, axis=1, result_type='expand')
.set_axis(['y','z'],inplace=False,axis=1)))
print(out)
x y z
0 a 1 2
1 b 1 2
2 c 1 2
3 d a1 a2
4 e b2 b1
5 f c3 c4
Try:
df.loc[mask,["y","z"]] = pd.DataFrame(df.loc[mask].apply(calculate, axis=1).to_list(), index=df[mask].index, columns = ["y","z"])
print(df)
x y z
0 a 1 2
1 b 1 2
2 c 1 2
3 d a1 a2
4 e b2 b1
5 f c3 c4

How to compare values of certain columns of one dataframe with the values of same set of columns in another dataframe?

I have three dataframes df1, df2, and df3, which are defined as follows
df1 =
A B C
0 1 a a1
1 2 b b2
2 3 c c3
3 4 d d4
4 5 e e5
5 6 f f6
df2 =
A B C
0 1 a X
1 2 b Y
2 3 c Z
df3 =
A B C
3 4 d P
4 5 e Q
5 6 f R
I have defined a Primary Key list PK = ["A","B"].
Now, I take a fourth dataframe df4 as df4 = df1.sample(n=2), which gives something like
df4 =
A B C
4 5 e e5
1 2 b b2
Now, I want to select the rows from df2 and df1 which matches the values of the primary keys of df4.
For eg, in this case,
I need to get row with
index = 4 from df3,
index = 1 from df2.
If possible I need to get a dataframe as follows:
df =
A B C A(df2) B(df2) C(df2) A(df3) B(df3) C(df3)
4 5 e e5 5 e Q
1 2 b b2 2 b Y
Any ideas on how to work this out will be very helpful.
Use two consecutive DataFrame.merge operations along with using DataFrame.add_suffix on the right dataframe to left merge the dataframes df4, df2, df3, finally use Series.fillna to replace the missing values with empty string:
df = (
df4.merge(df2.add_suffix('(df2)'), left_on=['A', 'B'], right_on=['A(df2)', 'B(df2)'], how='left')
.merge(df3.add_suffix('(df3)'), left_on=['A', 'B'], right_on=['A(df3)', 'B(df3)'], how='left')
.fillna('')
)
Result:
# print(df)
A B C A(df2) B(df2) C(df2) A(df3) B(df3) C(df3)
0 5 e e5 5 e Q
1 2 b b2 2 b Y
Here's how I would do it on the entire data set. If you want to sample first, just update the merge statements at the end by replacing df1 with df4 or just take a sample of t
PK = ["A","B"]
df2 = pd.concat([df2,df2], axis=1)
df2.columns=['A','B','C','A(df2)', 'B(df2)', 'C(df2)']
df2.drop(columns=['C'], inplace=True)
df3 = pd.concat([df3,df3], axis=1)
df3.columns=['A','B','C','A(df3)', 'B(df3)', 'C(df3)']
df3.drop(columns=['C'], inplace=True)
t = df1.merge(df2, on=PK, how='left')
t = t.merge(df3, on=PK, how='left')
Output
A B C A(df2) B(df2) C(df2) A(df3) B(df3) C(df3)
0 1 a a1 1.0 a X NaN NaN NaN
1 2 b b2 2.0 b Y NaN NaN NaN
2 3 c c3 3.0 c Z NaN NaN NaN
3 4 d d4 NaN NaN NaN 4.0 d P
4 5 e e5 NaN NaN NaN 5.0 e Q
5 6 f f6 NaN NaN NaN 6.0 f R

Python pandas; fill in data frame with pivot_table

I have a large python script, which makes two dataframes A and B, and at the end, I want to fill in dataframe A with the values of dataframe B, and keep the columns of dataframe A, but it is not going well.
Dataframe A is like this
A B C D
1 ab
2 bc
3 cd
Dataframe B:
A BB CC
1 C 10
2 C 11
3 D 12
My output must be:
new dataframe
A B C D
1 ab 10
2 bc 11
3 cd 12
But my output is
A B C D
1 ab
2 bc
3 cd
Why is it not filling in the values of dataframe B?
My command is
dfnew = dfB.pivot_table(index='A', columns='BB', values='CC').reindex(index=dfA.index, columns=dfA.columns).fillna(dfA)
I think you need set_index by index column of df for align data, fillna or combine_first and last reset_index:
dfA = pd.DataFrame({'A':[1,2,3], 'B':['ab','bc','cd'], 'C':[np.nan] * 3,'D':[np.nan] * 3})
print (dfA)
A B C D
0 1 ab NaN NaN
1 2 bc NaN NaN
2 3 cd NaN NaN
dfB = pd.DataFrame({'A':[1,2,3], 'BB':['C','C','D'], 'CC':[10,11,12]})
print (dfB)
A BB CC
0 1 C 10
1 2 C 11
2 3 D 12
df = dfB.pivot_table(index='A', columns='BB', values='CC')
print (df)
BB C D
A
1 10.0 NaN
2 11.0 NaN
3 NaN 12.0
dfA = dfA.set_index('A').fillna(df).reset_index()
#dfA = dfA.set_index('A').combine_first(df).reset_index()
print (dfA)
A B C D
0 1 ab 10.0 NaN
1 2 bc 11.0 NaN
2 3 cd NaN 12.0

Categories