Rearrange Python Pandas DataFrame Rows into a Single Row - python

I have a Pandas dataframe that looks something like:
df = pd.DataFrame({'col1': [1, 2, 3, 4], 'col2': [5, 6, 7, 8]}, index=['A', 'B', 'C', 'D'])
col1 col2
A 1 50
B 2 60
C 3 70
D 4 80
However, I want to automatically rearrange it so that it looks like:
col1 A col1 B col1 C col1 D col2 A col2 B col2 C col2 D
0 1 2 3 4 50 60 70 80
I want to combine the row name with the column name
I want to end up with only one row

df2 = df.unstack()
df2.index = [' '.join(x) for x in df2.index.values]
df2 = pd.DataFrame(df2).T
df2
col1 A col1 B col1 C col1 D col2 A col2 B col2 C col2 D
0 1 2 3 4 5 6 7 8
If you want to have the orignal x axis labels in front of the column names ("A col1"...) just change .join(x) by .join(x[::-1]):
df2 = df.unstack()
df2.index = [' '.join(x[::-1]) for x in df2.index.values]
df2 = pd.DataFrame(df2).T
df2
A col1 B col1 C col1 D col1 A col2 B col2 C col2 D col2
0 1 2 3 4 5 6 7 8

Here's one way to do it, there could be a simpler way
In [562]: df = pd.DataFrame({'col1': [1, 2, 3, 4], 'col2': [50, 60, 70, 80]},
index=['A', 'B', 'C', 'D'])
In [563]: pd.DataFrame([df.values.T.ravel()],
columns=[y+x for y in df.columns for x in df.index])
Out[563]:
col1A col1B col1C col1D col2A col2B col2C col2D
0 1 2 3 4 50 60 70 80

Related

How to add interleaving rows as result of sort / groups?

I have the following sample input data:
import pandas as pd
df = pd.DataFrame({'col1': ['x', 'y', 'z'], 'col2': [1, 2, 3], 'col3': ['a', 'a', 'b']})
I would like to sort and group by col3 while interleaving the summaries on top of the corresponding group in col1 and get the following output:
col1 col2
0 a 3
1 x 1
2 y 2
3 b 3
4 z 3
I can of course do the part:
df.sort_values(by=['col3']).groupby(by=['col3']).sum()
col2
col3
a 3
b 3
but I am not sure how to interleave the group labels on top of col1.
Use custom function for top1 row for each group:
def f(x):
return pd.DataFrame({'col1': x.name, 'col2': x['col2'].sum()}, index=[0]).append(x)
df = (df.sort_values(by=['col3'])
.groupby(by=['col3'], group_keys=False)
.apply(f)
.drop('col3', 1)
.reset_index(drop=True))
print (df)
col1 col2
0 a 3
1 x 1
2 y 2
3 b 3
4 z 3
More performant solution is use GroupBy.ngroup for indices, aggregate sum amd last join values by concat with only stable sorting by mergesort:
df = df.sort_values(by=['col3'])
df1 = df.groupby(by=['col3'])['col2'].sum().rename_axis('col1').reset_index()
df2 = df.set_index(df.groupby(by=['col3']).ngroup())
df = pd.concat([df1, df2]).sort_index(kind='mergesort', ignore_index=True).drop('col3', 1)
print (df)
col1 col2
0 a 3
1 x 1
2 y 2
3 b 3
4 z 3
What about:
(df.melt(id_vars='col2')
.rename(columns={'value': 'col1'})
.groupby('col1').sum()
.reset_index()
)
output:
col1 col2
0 a 3
1 b 3
2 x 1
3 y 2
4 z 3
def function1(dd:pd.DataFrame):
df.loc[dd.index.min()-0.5,['col1','col2']]=[dd.name,dd.col2.sum()]
df.groupby('col3').apply(function1).pipe(lambda dd:df.sort_index(ignore_index=True)).drop('col3',axis=1)
output
col1 col2
0 a 3
1 x 1
2 y 2
3 b 3
4 z 3
or use pandasql library
def function1(dd:pd.DataFrame):
return dd.sql("select '{}' as col1,{} as col2 union select col1,col2 from self".format(dd.name,dd.col2.sum()))
df.groupby('col3').apply(function1).reset_index(drop=False)
col1 col2
0 a 3
1 x 1
2 y 2
3 b 3
4 z 3

Pandas get column value based on row value [duplicate]

This question already has answers here:
Lookup Values by Corresponding Column Header in Pandas 1.2.0 or newer
(4 answers)
Closed 1 year ago.
I have the following dataframe:
df = pd.DataFrame(data={'flag': ['col3', 'col2', 'col2'],
'col1': [1, 3, 2],
'col2': [5, 2, 4],
'col3': [6, 3, 6],
'col4': [0, 4, 4]},
index=pd.Series(['A', 'B', 'C'], name='index'))
index
flag
col1
col2
col3
col4
A
col3
1
5
6
0
B
col2
3
2
3
4
C
col2
2
4
6
4
For each row, I want to get the value when column name is equal to the flag.
index
flag
col1
col2
col3
col4
col_val
A
col3
1
5
6
0
6
B
col2
3
2
3
4
2
C
col2
2
4
6
4
4
– Index A has a flag of col3. So col_val should be 6 because df['col3'] for that row is 6.
– Index B has a flag of col2. So col_val should be 2 because df['col2'] for that row is 2.
– Index C has a flag of col2. So col_val should be 4 because df['col2'] for that row is 3.
Per this page:
idx, cols = pd.factorize(df['flag'])
df['COl_VAL'] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
Output:
>>> df
flag col1 col2 col3 col4 COl_VAL
index
A col3 1 5 6 0 6
B col2 3 2 3 4 2
C col2 2 4 6 4 4
The docs has an example that you can adapt; the solution is below is just another option.
What it does is flip the dataframe into a MultiIndex dataframe, select the relevant columns and trim it to non nulls::
cols = [(ent, ent) for ent in df.flag.unique()]
(df.assign(col_val = df.pivot(index = None, columns = 'flag')
.loc(axis = 1)[cols].sum(1)
)
flag col1 col2 col3 col4 col_val
index
A col3 1 5 6 0 6.0
B col2 3 2 3 4 2.0
C col2 2 4 6 4 4.0
try this:
cond = ([df.columns.values[1:]] * df.shape[0]) == df.flag.values.reshape(-1,1)
df1 = df.set_index('flag', append=True)
df1.join(df1.where(cond).ffill(axis=1).col4.rename('res')).reset_index('flag')

Can pandas perform an aggregating operation involving two columns?

Given the following dataframe,
is it possible to calculate the sum of col2 and the sum of col2 + col3,
in a single aggregating function?
import pandas as pd
df = pd.DataFrame({'col1': ['a', 'a', 'b', 'b'], 'col2': [1, 2, 3, 4], 'col3': [10, 20, 30, 40]})
.
col1
col2
col3
0
a
1
10
1
a
2
20
2
b
3
30
3
b
4
40
In R's dplyr I would do it with a single line of summarize,
and I was wondering what might be the equivalent in pandas:
df %>% group_by(col1) %>% summarize(col2_sum = sum(col2), col23_sum = sum(col2 + col3))
Desired result:
.
col1
col2_sum
col23_sum
0
a
3
33
1
b
7
77
Let us try assign the new column first
out = df.assign(col23 = df.col2+df.col3).groupby('col1',as_index=False).sum()
Out[81]:
col1 col2 col3 col23
0 a 3 30 33
1 b 7 70 77
From my understanding the apply is more like the summarize in R
out = df.groupby('col1').\
apply(lambda x : pd.Series({'col2_sum':x['col2'].sum(),
'col23_sum':(x['col2'] + x['col3']).sum()})).\
reset_index()
Out[83]:
col1 col2_sum col23_sum
0 a 3 33
1 b 7 77
You can do it easily with datar:
>>> from datar.all import f, tibble, group_by, summarize, sum
>>> df = tibble(
... col1=['a', 'a', 'b', 'b'],
... col2=[1, 2, 3, 4],
... col3=[10, 20, 30, 40]
... )
>>> df >> group_by(f.col1) >> summarize(
... col2_sum = sum(f.col2),
... col23_sum = sum(f.col2 + f.col3)
... )
col1 col2_sum col23_sum
<object> <int64> <int64>
0 a 3 33
1 b 7 77
I am the author of the datar package.

Pandas melt with column names and top row as column

I have a dataframe df as where Col1, Col2 and Col3 are column names:
Col1 Col2 Col3
a b
B 2 3
C 10 6
First row above with values a, b is subcategory so Col1 is empty for row 1.
I am trying to get the following:
B Col2 a 2
B Col3 b 3
C Col2 a 10
C Col3 b 6
I am not sure how to approach above.
Edit:
df.to_dict()
Out[16]:
{'Unnamed: 0': {0: nan, 1: 'B', 2: 'C'},
'Col2': {0: 'a', 1: '2', 2: '10'},
'Col3': {0: 'b', 1: '3', 2: '6'}}
Use stack and join
df_final = (df.iloc[1:].set_index('Col1').stack().reset_index(0)
.join(df.iloc[0,1:].rename('1')).sort_values('Col1'))
Out[345]:
Col1 0 1
Col2 B 2 a
Col3 B 3 b
Col2 C 10 a
Col3 C 6 b
You can try this replacing that NaN with a blank(or any string you want the colum to be named):
df.fillna('').set_index('Col1').T\
.set_index('',append=True).stack().reset_index()
Output:
level_0 Col1 0
0 Col2 a B 2
1 Col2 a C 10
2 Col3 b B 3
3 Col3 b C 6
df.fillna('Col0').set_index('Col1').T\
.set_index('Col0',append=True).stack().reset_index(level=[1,2])
Output:
Col0 Col1 0
Col2 a B 2
Col2 a C 10
Col3 b B 3
Col3 b C 6
df = pd.DataFrame.from_dict({'Col1': {0: np.nan, 1: 'B', 2: 'C'},
'Col2': {0: 'a', 1: '2', 2: '10'},
'Col3': {0: 'b', 1: '3', 2: '6'}})
# set index as a multi-index from the first row
df.index = pd.MultiIndex.from_product([df.iloc[0,:]])
# get rid of the empty row and reset the index
df = df.iloc[1:,:].reset_index()
answer = pd.melt(df, id_vars=['Col1',0], value_vars=['Col2','Col3'],value_name='vals')
answer[['Col1','variable',0,'vals']]
Col1 variable 0 vals
0 B Col2 a 2
1 C Col2 b 10
2 B Col3 a 3
3 C Col3 b 6
You can do the following:
df = pd.DataFrame({'Col1': {0: np.nan, 1: 'B', 2: 'C'},
'Col2': {0: 'a', 1: '2', 2: '10'},
'Col3': {0: 'b', 1: '3', 2: '6'}})
melted = pd.melt(df, id_vars=['Col1'], value_vars=['Col3',
'Col2']).dropna().reset_index(drop=True)
subframe = pd.DataFrame({'Col2': ['a'], 'Col3': ['b']}).melt()
melted.merge(subframe, on='variable')
Out[1]:
Col1 variable value_x value_y
0 B Col3 3 b
1 C Col3 6 b
2 B Col2 2 a
3 C Col2 10 a
Then you can rename your columns as you want
You can melt the dataframe, create a new column dependent on which rows are null, and then filter out the rows where the columns both have a and b :
(
df.melt("Col1")
.assign(temp=lambda x: np.where(x.Col1.isna(), x.value, np.nan))
.ffill()
.query("value != temp")
)
Col1 variable value temp
1 B Col2 2 a
2 C Col2 10 a
4 B Col3 3 b
5 C Col3 6 b

Pandas: consolidating columns in DataFrame

Using the DataFrame below as an example:
import pandas as pd
df = pd.DataFrame({'col1':[1, 2, 3, 2, 1] , 'col2':['A', 'A', 'B', 'B','C']})
col1 col2
0 1 A
1 2 A
2 3 B
3 2 B
4 1 C
how can I get
col1 col2
0 1 A,C
1 2 A,B
2 3 B
You can groupby on 'col1' and then apply a lambda that joins the values:
In [88]:
df = pd.DataFrame({'col1':[1, 2, 3, 2, 1] , 'col2':['A', 'A', 'B', 'B','C']})
df.groupby('col1')['col2'].apply(lambda x: ','.join(x)).reset_index()
Out[88]:
col1 col2
0 1 A,C
1 2 A,B
2 3 B

Categories