Use a different row as labels in pandas after read - python

I need to use the third row as the labels for a dataframe, but keep the first two rows for other uses. How can you change the labels on an existing dataframe to an existing row?
So basically this dataframe
A B C D
1 2 3 4
5 7 8 9
a b c d
6 4 2 1
becomes
a b c d
6 4 2 1
And I cannot just set the headers when the file is read in because I need the first two rows and labels for some processing

One way would be just to take a slice and then overwrite the columns:
In [71]:
df1 = df.loc[3:]
df1.columns = df.loc[2].values
df1
Out[71]:
a b c d
3 6 4 2 1
You can then assign back to df a slice of the rows of interest:
In [73]:
df = df[:2]
df
Out[73]:
A B C D
0 1 2 3 4
1 5 7 8 9

First copy the first two rows into a new DataFrame. Then rename the columns using the data contained in the second row. Finally, delete the first three rows of data.
import pandas as pd
df = pd.DataFrame({'A': {0: '1', 1: '5', 2: 'a', 3: '6'},
'B': {0: '2', 1: '7', 2: 'b', 3: '4'},
'C': {0: '3', 1: '8', 2: 'c', 3: '2'},
'D': {0: '4', 1: '9', 2: 'd', 3: '1'}})
df2 = df.loc[:1, :].copy()
df.columns = [c for c in df.loc[2, :]]
df.drop(df.index[:3], inplace=True)
>>> df
a b c d
3 6 4 2 1
>>> df2
A B C D
0 1 2 3 4
1 5 7 8 9

Related

Combining sets of multiple rows in a single Dataframe to generate new Dataframe

I'm having trouble thinking of a vectorized (or efficient...) solution to a problem involving two large dataframes. One dataframe (df1) is filled with data (floats, ints, and nans). The other (df2) contains a column with n, where n is the number of rows that I want to combine. The index of n matches the index in df1 where the combine should start, grabbing and combining n subsequent rows after it. I.E.:
Input
mylist = [
{'1': 'A', '2': 'B','3':'1'},
{'1': 'C', '2': 'D','3':pd.NA},
{'1': 'E', '2': 'F','3':'3'},
{'1': 'G', '2': pd.NA,'3':'4'},
{'1': 'I', '2': 'J','3':'5'}]
df1 = pd.DataFrame(mylist)
df1
Output
1 2 3
0 A B 1
1 C D <NA>
2 E F 3
3 G <NA> 4
4 I J 5
Input
info_list = [{'n':2},{'n':2}]
df2 = pd.DataFrame(info_list,index=[0,2])
df2
Ouput
n
0 2
2 2
The index of df2 represents the index in df1 that marks the starting row, and n represents how many total rows I would like to combine into a single row. The resulting dataframe df3 would look like:
Desired Result
1 2 3 4 5 6
0 A B 1 C D <NA>
2 E F 3 G <NA> 4
4 I J 5 NaN NaN NaN
I can accomplish with a complicated iterrows() function, but it's slow over the entire dataframe. Any help appreciated.
You can use the indexes from df2 to indicate where the groupings start. So index 0, and index 2 are a group, and since the sum of n is 4, any rows after 4 in df1 must be their own group.
Based on that you can number the groups using cumsum, and then creating an incremental value per group which will determine how the rows within the group are split.
Once you have that you can just concatenate the split groups.
import pandas as pd
mylist = [
{'1': 'A', '2': 'B','3':'1'},
{'1': 'C', '2': 'D','3':pd.NA},
{'1': 'E', '2': 'F','3':'3'},
{'1': 'G', '2': pd.NA,'3':'4'},
{'1': 'I', '2': 'J','3':'5'}]
df1 = pd.DataFrame(mylist)
df1
info_list = [{'n':2},{'n':2}]
df2 = pd.DataFrame(info_list,index=[0,2])
df1 = df1.join(df2)
df1.iloc[df2['n'].sum():,df1.columns.get_loc('n')] = 1
df1['n'] = (~df1.n.isnull()).cumsum()
df1['n'] = df1.groupby('n').cumcount()
out = pd.concat([x.drop(columns='n').reset_index(drop='True') for _,x in df1.groupby('n')], axis=1)
print(out)
Output
1 2 3 1 2 3
0 A B 1 C D <NA>
1 E F 3 G <NA> 4
2 I J 5 NaN NaN NaN

How to separate strings from a column in pandas?

I have 2 columns:
A
B
1
ABCSD
2
SSNFs
3 CVY KIP
4 MSSSQ
5
ABCSD
6 MMS LLS
7
QQLL
This is an example actual files contains these type of cases in 1000+ rows.
I want to separate all the alphabets from column A and get them as output in column B:
Expected Output:
A
B
1
ABCSD
2
SSNFs
3
CVY KIP
4
MSSSQ
5
ABCSD
6
MMS LLS
7
QQLL
So Far I have tried this which works but looking for a better way:
df['B2'] = df['A'].str.split(' ').str[1:]
def try_join(l):
try:
return ' '.join(map(str, l))
except TypeError:
return np.nan
df['B2'] = [try_join(l) for l in df['B2']]
df = df.replace('', np.nan)
append=df['B2']
df['B']=df['B'].combine_first(append)
df['A']=[str(x).split(' ')[0] for x in df['A']]
df.drop(['B2'],axis=1,inplace=True)
df
You could try as follows.
Either use str.extractall with two named capture groups (generic: (?P<name>...)) as A and B. First one for the digit(s) at the start, second one for the rest of the string. (You can easily adjust these patterns if your actual strings are less straightforward.) Finally, drop the added index level (1) by using df.droplevel.
Or use str.split with n=1 and expand=True and rename the columns (0 and 1 to A and B).
Either option can be placed inside df.update with overwrite=True to get the desired outcome.
import pandas as pd
import numpy as np
data = {'A': {0: '1', 1: '2', 2: '3 CVY KIP', 3: '4 MSSSQ',
4: '5', 5: '6 MMS LLS', 6: '7'},
'B': {0: 'ABCSD', 1: 'SSNFs', 2: np.nan, 3: np.nan,
4: 'ABCSD', 5: np.nan, 6: 'QQLL'}
}
df = pd.DataFrame(data)
df.update(df.A.str.extractall(r'(?P<A>^\d+)\s(?P<B>.*)').droplevel(1),
overwrite=True)
# or in this case probably easier:
# df.update(df.A.str.split(pat=' ', n=1, expand=True)\
# .rename(columns={0:'A',1:'B'}),overwrite=True)
df['A'] = df.A.astype(int)
print(df)
A B
0 1 ABCSD
1 2 SSNFs
2 3 CVY KIP
3 4 MSSSQ
4 5 ABCSD
5 6 MMS LLS
6 7 QQLL
You can split on ' ' as it seems that the numeric value is always at the beginning and the text is after a space.
split = df.A.str.split(' ', 1)
df.loc[df.B.isnull(), 'B'] = split.str[1]
df.loc[:, 'A'] = split.str[0]
You could use str.split() if your number appears first.
df['A'].str.split(n=1,expand=True).set_axis(df.columns,axis=1).combine_first(df)
or
df['A'].str.extract(r'(?P<A>\d+) (?P<B>[A-Za-z ]+)').combine_first(df)
Output:
A B
0 1 ABCSD
1 2 SSNFs
2 3 CVY KIP
3 4 MSSSQ
4 5 ABCSD
5 6 MMS LLS
6 7 QQLL

In pandas, how to re-arrange the dataframe to simultaneously combine groups of columns?

I hope someone could help me solve my issue.
Given a pandas dataframe as depicted in the image below,
I would like to re-arrange it into a new dataframe, combining several sets of columns (the sets have all the same size) such that each set becomes a single column as shown in the desired result image below.
Thank you in advance for any tips.
For a general solution, you can try one of this two options:
You could try this, using OrderedDict to get the alpha-nonnumeric column names ordered alphabetically, pd.DataFrame.filter to filter the columns with similar names, and then concat the values with pd.DataFrame.stack:
import pandas as pd
from collections import OrderedDict
df = pd.DataFrame([[0,1,2,3,4],[5,6,7,8,9]], columns=['a1','a2','b1','b2','c'])
newdf=pd.DataFrame()
for col in list(OrderedDict.fromkeys( ''.join(df.columns)).keys()):
if col.isalpha():
newdf[col]=df.filter(like=col, axis=1).stack().reset_index(level=1,drop=True)
newdf=newdf.reset_index(drop=True)
Output:
df
a1 a2 b1 b2 c
0 0 1 2 3 4
1 5 6 7 8 9
newdf
a b c
0 0 2 4
1 1 3 4
2 5 7 9
3 6 8 9
Another way to get the column names could be using re and set like this, and then sort columns alphabetically:
newdf=pd.DataFrame()
import re
for col in set(re.findall('[^\W\d_]',''.join(df.columns))):
newdf[col]=df.filter(like=col, axis=1).stack().reset_index(level=1,drop=True)
newdf=newdf.reindex(sorted(newdf.columns), axis=1).reset_index(drop=True)
Output:
newdf
a b c
0 0 2 4
1 1 3 4
2 5 7 9
3 6 8 9
You can do this with pd.wide_to_long and rename the 'c' column:
df_out = pd.wide_to_long(df.reset_index().rename(columns={'c':'c1'}),
['a','b','c'],'index','no')
df_out = df_out.reset_index(drop=True).ffill().astype(int)
df_out
Output:
a b c
0 0 2 4
1 1 3 4
2 5 7 9
3 6 8 9
Same dataframe just sorting is different.
pd.wide_to_long(df, ['a','b'], 'c', 'no').reset_index().drop('no', axis=1)
Output:
c a b
0 4 0 2
1 9 5 7
2 4 1 3
3 9 6 8
The fact that column c only had one columns versus other letters having two columns, made it kind of tricky. I first stacked the dataframe and got rid of the numbers in the column names. Then for a and b I pivoted a dataframe and removed all nans. For c, I multiplied the length of the dataframe by 2 to make it match a and b and then merged it in with a and b.
input:
import pandas as pd
df = pd.DataFrame({'a1': {0: 0, 1: 5},
'a2': {0: 1, 1: 6},
'b1': {0: 2, 1: 7},
'b2': {0: 3, 1: 8},
'c': {0: 4, 1: 9}})
df
code:
df1=df.copy().stack().reset_index().replace('[0-9]+', '', regex=True)
dfab = df1[df1['level_1'].isin(['a','b'])].pivot(index=0, columns='level_1', values=0) \
.apply(lambda x: pd.Series(x.dropna().values)).astype(int)
dfc = pd.DataFrame(np.repeat(df['c'].values,2,axis=0)).rename({0:'c'}, axis=1)
df2=pd.merge(dfab, dfc, how='left', left_index=True, right_index=True)
df2
output:
a b c
0 0 2 4
1 1 3 4
2 5 7 9
3 6 8 9

What does "col_level" do in the melt function?

From the documentation:
pd.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None)
What does col_level do?
Examples with different values of col_level would be great.
My current dataframe is created by the following:
df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
'B': {0: 1, 1: 3, 2: 5},
'C': {0: 2, 1: 4, 2: 6}})
df.columns = [list('ABC'), list('DEF'), list('GHI')]
Thanks.
You can check melt:
col_level : int or string, optional
If columns are a MultiIndex then use this level to melt.
And examples:
df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
'B': {0: 1, 1: 3, 2: 5},
'C': {0: 2, 1: 4, 2: 6}})
#use Multiindex.from_arrays for set levels names
df.columns = pd.MultiIndex.from_arrays([list('ABC'), list('DEF'), list('GHI')],
names=list('abc'))
print (df)
a A B C
b D E F
c G H I
0 a 1 2
1 b 3 4
2 c 5 6
#melt by first level of MultiIndex
print (df.melt(col_level=0))
a value
0 A a
1 A b
2 A c
3 B 1
4 B 3
5 B 5
6 C 2
7 C 4
8 C 6
#melt by level a of MultiIndex
print (df.melt(col_level='a'))
a value
0 A a
1 A b
2 A c
3 B 1
4 B 3
5 B 5
6 C 2
7 C 4
8 C 6
#melt by level c of MultiIndex
print (df.melt(col_level='c'))
c value
0 G a
1 G b
2 G c
3 H 1
4 H 3
5 H 5
6 I 2
7 I 4
8 I 6

How can I map to a new dataframe by Multi Index level?

I have a dataframe with columns A, B, C, D and the index is a time series.
I want to create a new dataframe with the same index, but many more columns in a multi index. A, B, C, D are the first level of the multi index. I want every column in the new dataframe to have the same value that A, B, C, D did, according to its multi index level.
In other words, if I have a data frame like this:
A B C D
0 2 3 4 5
1 X Y Z 1
I want to make a new dataframe that looks like this
A B C D
0 1 2 3 4 5 6 7
0 2 2 2 3 3 4 5 5
1 X X X Y Y Z 1 1
In other words - I want to do the equivalent of an "HLOOKUP" in excel, using the first level of the multi-index and looking up on the original dataframe.
The new multi-index is pre-determined.
As suggested by cᴏʟᴅsᴘᴇᴇᴅ in the comments, you can use DataFrame.reindex with the columns and level arguments:
In [35]: mi
Out[35]:
MultiIndex(levels=[['A', 'B', 'C', 'D'], ['0', '1', '2', '3', '4', '5', '6', '7']],
labels=[[0, 0, 0, 1, 1, 2, 3, 3], [0, 1, 2, 3, 4, 5, 6, 7]])
In [36]: df
Out[36]:
A B C D
0 2 3 4 5
1 X Y Z 1
In [37]: df.reindex(columns=mi, level=0)
Out[37]:
A B C D
0 1 2 3 4 5 6 7
0 2 2 2 3 3 4 5 5
1 X X X Y Y Z 1 1

Categories