I would like to transform a Pandas DataFrame of the following wide format
df = pd.DataFrame([['A', '1', '2', '3'], ['B', '4', '5', '6'], ['C', '7', '8', '9']], columns=['ABC', 'def', 'ghi', 'jkl'])
df =
ABC def ghi jkl
0 A 1 2 3
1 B 4 5 6
2 C 7 8 9
into a long format, where the values from the first column still correspond to the values in the lower-case columns. The column names cannot be used as stub names. The names of the new columns are irrelevant and could be renamed later.
The output should look something like this:
df =
0 1
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 C 7
7 C 8
8 C 9
I am not sure how to best and efficiently do this. Can this be done with wide_to_long()? Then I would not know how to deal with stub names. The best would be an efficient one-liner that can be used on a large table.
Many thanks!!
Use DataFrame.melt with DataFrame.sort_index and remove variable column:
df1 = (df.melt("ABC", value_name='new', ignore_index=False)
.sort_index(ignore_index=True)
.drop('variable', axis=1)
)
print (df1)
ABC new
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 C 7
7 C 8
8 C 9
If need more dynamic solution with generate first value of columns names:
first = df.columns[0]
df1 = (df.melt(first, value_name='new', ignore_index=False)
.sort_index(ignore_index=True)
.drop('variable', axis=1))
You can use df.stack:
>>> df.set_index('ABC') \
.stack() \
.reset_index(level='ABC') \
.reset_index(drop=True)
ABC 0
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 C 7
7 C 8
8 C 9
or use df.melt as suggested by #MustafaAydın:
>>> df.melt('ABC') \
.sort_values('ABC') \
.drop(columns='variable') \
.reset_index(drop=True)
ABC value
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 C 7
7 C 8
8 C 9
Related
Having two data frames:
df1 = pd.DataFrame({'a':[1,2,3],'b':[4,5,6]})
a b
0 1 4
1 2 5
2 3 6
df2 = pd.DataFrame({'c':[7],'d':[8]})
c d
0 7 8
The goal is to add all df2 column values to df1, repeated and create the following result. It is assumed that both data frames do not share any column names.
a b c d
0 1 4 7 8
1 2 5 7 8
2 3 6 7 8
If there are strings columns names is possible use DataFrame.assign with unpack Series created by selecing first row of df2:
df = df1.assign(**df2.iloc[0])
print (df)
a b c d
0 1 4 7 8
1 2 5 7 8
2 3 6 7 8
Another idea is repeat values by df1.index with DataFrame.reindex and use DataFrame.join (here first index value of df2 is same like first index value of df1.index):
df = df1.join(df2.reindex(df1.index, method='ffill'))
print (df)
a b c d
0 1 4 7 8
1 2 5 7 8
2 3 6 7 8
If no missing values in original df is possible use forward filling missing values in last step, but also are types changed to floats, thanks #Dishin H Goyan:
df = df1.join(df2).ffill()
print (df)
a b c d
0 1 4 7.0 8.0
1 2 5 7.0 8.0
2 3 6 7.0 8.0
I want to shift especific column down by one (I dont know if other library can help me)
import pandas as pd
#pd.set_option('display.max_rows',100)
fac=pd.read_excel('TEST.xlsm',sheet_name="DC - Consumables",header=None, skiprows=1)
df = pd.DataFrame(fac)
df1=df.iloc[0:864,20:39]
df2=df.iloc[0:864,40:59]
df1=pd.concat([df1,df2])
print (df1)
I want one column to be below the other column
A B C` A B C`
1 2 3` 6 7 8`
4 5 8` 4 1 9`
my code print this
A B C
1 2 3
4 5 8
A B C
6 7 8
4 1 9
I need the second column (dataframe) to be below the first column, like this:
A B C
1 2 3
4 5 8
A B C
6 7 8
4 1 9
Please help me
Try pd.concat().
df3 = pd.concat([df1, df2])
I use the following code to try to change value in columns 4,5,6 of a dataframe to percentage format but it returned me the errors.
df.iloc[:,4:7].apply('{:.2%}'.format)
You can use DataFrame.applymap:
df = pd.DataFrame({
'a':list('abcdef'),
'b':list('aaabbb'),
'c':[4,5,4,5,5,4],
'd':[7,8,9,4,2,3],
'e':[1,3,5,7,1,0],
'e':[5,3,6,9,2,4],
'f':[7,8,9,4,2,3],
'g':[1,3,5,7,1,0],
'h':[7,8,9,4,2,3],
'i':[1,3,5,7,1,0]
})
df.iloc[:,4:7] = df.iloc[:,4:7].applymap('{:.2%}'.format)
print (df)
a b c d e f g h i
0 a a 4 7 500.00% 700.00% 100.00% 7 1
1 b a 5 8 300.00% 800.00% 300.00% 8 3
2 c a 4 9 600.00% 900.00% 500.00% 9 5
3 d b 5 4 900.00% 400.00% 700.00% 4 7
4 e b 5 2 200.00% 200.00% 100.00% 2 1
5 f b 4 3 400.00% 300.00% 0.00% 3 0
I am trying to select a subset of a DataFrame based on the columns of another DataFrame.
The DataFrames look like this:
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
a b
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
I want to get all rows of the first Dataframe for the columns which are included in both DataFrames. My result should look like this:
a b
0 0 1
1 4 5
2 8 9
3 12 13
You can use pd.Index.intersection or its syntactic sugar &:
intersection_cols = df1.columns & df2.columns
res = df1[intersection_cols]
import pandas as pd
data1=[[0,1,2,3,],[4,5,6,7],[8,9,10,11],[12,13,14,15]]
data2=[[0,1],[2,3],[4,5],[6,7],[8,9]]
df1 = pd.DataFrame(data=data1,columns=['a','b','c','d'])
df2 = pd.DataFrame(data=data2,columns=['a','b'])
df1[(df1.columns) & (df2.columns)]
I have a DataFrame with column names in the shape of x.y, where I would like to sum up all columns with the same value on x without having to explicitly name them. That is, the value of column_name.split(".")[0] should determine their group. Here's an example:
import pandas as pd
df = pd.DataFrame({'x.1': [1,2,3,4], 'x.2': [5,4,3,2], 'y.8': [19,2,1,3], 'y.92': [10,9,2,4]})
df
Out[3]:
x.1 x.2 y.8 y.92
0 1 5 19 10
1 2 4 2 9
2 3 3 1 2
3 4 2 3 4
The result should be the same as this operation, only I shouldn't have to explicitly list the column names and how they should group.
pd.DataFrame({'x': df[['x.1', 'x.2']].sum(axis=1), 'y': df[['y.8', 'y.92']].sum(axis=1)})
x y
0 6 29
1 6 11
2 6 3
3 6 7
Another option, you can extract the prefix from the column names and use it as a group variable:
df.groupby(by = df.columns.str.split('.').str[0], axis = 1).sum()
# x y
#0 6 29
#1 6 11
#2 6 3
#3 6 7
You can first create Multiindex by split and then groupby by first level and aggregate sum:
df.columns = df.columns.str.split('.', expand=True)
print (df)
x y
1 2 8 92
0 1 5 19 10
1 2 4 2 9
2 3 3 1 2
3 4 2 3 4
df = df.groupby(axis=1, level=0).sum()
print (df)
x y
0 6 29
1 6 11
2 6 3
3 6 7