How to separate strings from a column in pandas?

How to separate strings from a column in pandas? - python

I have 2 columns:
A
B
1
ABCSD
2
SSNFs
3 CVY KIP
4 MSSSQ
5
ABCSD
6 MMS LLS
7
QQLL
This is an example actual files contains these type of cases in 1000+ rows.
I want to separate all the alphabets from column A and get them as output in column B:
Expected Output:
A
B
1
ABCSD
2
SSNFs
3
CVY KIP
4
MSSSQ
5
ABCSD
6
MMS LLS
7
QQLL
So Far I have tried this which works but looking for a better way:
df['B2'] = df['A'].str.split(' ').str[1:]
def try_join(l):
try:
return ' '.join(map(str, l))
except TypeError:
return np.nan
df['B2'] = [try_join(l) for l in df['B2']]
df = df.replace('', np.nan)
append=df['B2']
df['B']=df['B'].combine_first(append)
df['A']=[str(x).split(' ')[0] for x in df['A']]
df.drop(['B2'],axis=1,inplace=True)
df

You could try as follows.
Either use str.extractall with two named capture groups (generic: (?P<name>...)) as A and B. First one for the digit(s) at the start, second one for the rest of the string. (You can easily adjust these patterns if your actual strings are less straightforward.) Finally, drop the added index level (1) by using df.droplevel.
Or use str.split with n=1 and expand=True and rename the columns (0 and 1 to A and B).
Either option can be placed inside df.update with overwrite=True to get the desired outcome.
import pandas as pd
import numpy as np
data = {'A': {0: '1', 1: '2', 2: '3 CVY KIP', 3: '4 MSSSQ',
4: '5', 5: '6 MMS LLS', 6: '7'},
'B': {0: 'ABCSD', 1: 'SSNFs', 2: np.nan, 3: np.nan,
4: 'ABCSD', 5: np.nan, 6: 'QQLL'}
}
df = pd.DataFrame(data)
df.update(df.A.str.extractall(r'(?P<A>^\d+)\s(?P<B>.*)').droplevel(1),
overwrite=True)
# or in this case probably easier:
# df.update(df.A.str.split(pat=' ', n=1, expand=True)\
# .rename(columns={0:'A',1:'B'}),overwrite=True)
df['A'] = df.A.astype(int)
print(df)
A B
0 1 ABCSD
1 2 SSNFs
2 3 CVY KIP
3 4 MSSSQ
4 5 ABCSD
5 6 MMS LLS
6 7 QQLL

You can split on ' ' as it seems that the numeric value is always at the beginning and the text is after a space.
split = df.A.str.split(' ', 1)
df.loc[df.B.isnull(), 'B'] = split.str[1]
df.loc[:, 'A'] = split.str[0]

You could use str.split() if your number appears first.
df['A'].str.split(n=1,expand=True).set_axis(df.columns,axis=1).combine_first(df)
or
df['A'].str.extract(r'(?P<A>\d+) (?P<B>[A-Za-z ]+)').combine_first(df)
Output:
A B
0 1 ABCSD
1 2 SSNFs
2 3 CVY KIP
3 4 MSSSQ
4 5 ABCSD
5 6 MMS LLS
6 7 QQLL

Related

Add a comment row after a row if condition is met in a Pandas Dataframe

I can figure out a way to do it with iterrows or itertuples but I was looking for a more efficient way (probably with a lambda function).
Code and data for the purpose of this question:
import pandas
a = {'a': [1,3,5,7], 'b': [2,4,6,8], 'c': [3,5,7,9]}
b = pandas.DataFrame(a)
b
Out[4]:
a b c
0 1 2 3
1 3 4 5
2 5 6 7
3 7 8 9
If for the sum of the digits in a row: sum_row % 4 = 0, the program adds a row below row. The added row is not divided into columns, but rather it consists of a single cell with a comment.
Desired resulting dataframe should look like that:
a b c
0 1 2 3
1 3 4 5
2 sum_row % 4 yields no remainder
3 5 6 7
4 7 8 9
5 sum_row % 4 yields no remainder
Thanks.

this is my interpretation: He needs evaluate "mod 4" (% 4) in every row. This is my old school loop into DATAFRAME using iterrows. With this code you can create a new DF into Loop, or create a D column with % 4 result.
import pandas as pd
# Loop in DATAFRAME
a = {'a': [1,3,5,7], 'b': [2,4 ,6, 8], 'c': [3, 5,7,9]}
b = pd.DataFrame(a)
# Print Data Frame
# print (b)
for index, x in b.iterrows():
# Print Line
print (x['a'] , x['b'] , x['c'] )
result = (x['a'] + x['b'] + x['c'] ) % 4
if result == 0:
# Print Result
print ('sum_row % 4 yields no remainder')

The desired output as shown is not possible. You can do something like this:
import pandas as pd
spam = {'a': [1,3,5,7], 'b': [2,4 ,6, 8], 'c': [3, 5,7,9]}
df = pd.DataFrame(spam)
df['remainder'] = (df.sum(axis=1) % 4).astype(bool)
df['comment'] = df.remainder.apply(lambda x: 'remainder' if x else 'no remainder')
print(df)
output
a b c remainder comment
0 1 2 3 True remainder
1 3 4 5 False no remainder
2 5 6 7 True remainder
3 7 8 9 False no remainder

In pandas, how to re-arrange the dataframe to simultaneously combine groups of columns?

I hope someone could help me solve my issue.
Given a pandas dataframe as depicted in the image below,
I would like to re-arrange it into a new dataframe, combining several sets of columns (the sets have all the same size) such that each set becomes a single column as shown in the desired result image below.
Thank you in advance for any tips.

For a general solution, you can try one of this two options:
You could try this, using OrderedDict to get the alpha-nonnumeric column names ordered alphabetically, pd.DataFrame.filter to filter the columns with similar names, and then concat the values with pd.DataFrame.stack:
import pandas as pd
from collections import OrderedDict
df = pd.DataFrame([[0,1,2,3,4],[5,6,7,8,9]], columns=['a1','a2','b1','b2','c'])
newdf=pd.DataFrame()
for col in list(OrderedDict.fromkeys( ''.join(df.columns)).keys()):
if col.isalpha():
newdf[col]=df.filter(like=col, axis=1).stack().reset_index(level=1,drop=True)
newdf=newdf.reset_index(drop=True)
Output:
df
a1 a2 b1 b2 c
0 0 1 2 3 4
1 5 6 7 8 9
newdf
a b c
0 0 2 4
1 1 3 4
2 5 7 9
3 6 8 9
Another way to get the column names could be using re and set like this, and then sort columns alphabetically:
newdf=pd.DataFrame()
import re
for col in set(re.findall('[^\W\d_]',''.join(df.columns))):
newdf[col]=df.filter(like=col, axis=1).stack().reset_index(level=1,drop=True)
newdf=newdf.reindex(sorted(newdf.columns), axis=1).reset_index(drop=True)
Output:
newdf
a b c
0 0 2 4
1 1 3 4
2 5 7 9
3 6 8 9

You can do this with pd.wide_to_long and rename the 'c' column:
df_out = pd.wide_to_long(df.reset_index().rename(columns={'c':'c1'}),
['a','b','c'],'index','no')
df_out = df_out.reset_index(drop=True).ffill().astype(int)
df_out
Output:
a b c
0 0 2 4
1 1 3 4
2 5 7 9
3 6 8 9
Same dataframe just sorting is different.
pd.wide_to_long(df, ['a','b'], 'c', 'no').reset_index().drop('no', axis=1)
Output:
c a b
0 4 0 2
1 9 5 7
2 4 1 3
3 9 6 8

The fact that column c only had one columns versus other letters having two columns, made it kind of tricky. I first stacked the dataframe and got rid of the numbers in the column names. Then for a and b I pivoted a dataframe and removed all nans. For c, I multiplied the length of the dataframe by 2 to make it match a and b and then merged it in with a and b.
input:
import pandas as pd
df = pd.DataFrame({'a1': {0: 0, 1: 5},
'a2': {0: 1, 1: 6},
'b1': {0: 2, 1: 7},
'b2': {0: 3, 1: 8},
'c': {0: 4, 1: 9}})
df
code:
df1=df.copy().stack().reset_index().replace('[0-9]+', '', regex=True)
dfab = df1[df1['level_1'].isin(['a','b'])].pivot(index=0, columns='level_1', values=0) \
.apply(lambda x: pd.Series(x.dropna().values)).astype(int)
dfc = pd.DataFrame(np.repeat(df['c'].values,2,axis=0)).rename({0:'c'}, axis=1)
df2=pd.merge(dfab, dfc, how='left', left_index=True, right_index=True)
df2
output:
a b c
0 0 2 4
1 1 3 4
2 5 7 9
3 6 8 9

how to set the index as character for pandas

I am trying to create a pandas df like this post.
df = pd.DataFrame(np.arange(9).reshape(3,3) , columns=list('123'))
df
this piece of code gives
describe() gives
is there is way to set the name of each row (i.e. the index) in df as 'A', 'B', 'C' instead of '0', '1', '2' ?

Use df.index:
df.index=['A', 'B', 'C']
print(df)
1 2 3
A 0 1 2
B 3 4 5
C 6 7 8
A more scalable and general solution would be using list-comprehension
df.index = [chr(ord('a') + x).upper() for x in df.index]
print(df)
1 2 3
A 0 1 2
B 3 4 5
C 6 7 8

Add index parameter in DataFrame constructor:
df = pd.DataFrame(np.arange(9).reshape(3,3) ,
index=list('ABC'),
columns=list('123'))
print (df)
1 2 3
A 0 1 2
B 3 4 5
C 6 7 8

Pandas sort data frame by logical day

I have the following resulting pandas DateFrame:
How can I get this to sort properly? For example have the sort so that Day 2 comes after Day 1, not Day 11. As seen in Group 2 below?

set_levels + sort_index
The issue is your strings are being sorted as strings rather than numerically. First convert your first index level to numeric, then sort by index:
# split by whitespace, take last split, convert to integers
new_index_values = df.index.levels[1].str.split().str[-1].astype(int)
# set 'Day' level
df.index = df.index.set_levels(new_index_values, level='Day')
# sort by index
df = df.sort_index()
print(df)
Value
Group Day
A 0 1
2 3
11 2
B 5 5
7 6
10 4
Setup
The above demonstration uses this example setup:
df = pd.DataFrame({'Group': ['A', 'A', 'A', 'B', 'B', 'B'],
'Day': ['Day 0', 'Day 11', 'Day 2', 'Day 10', 'Day 5', 'Day 7'],
'Value': [1, 2, 3, 4, 5, 6]}).set_index(['Group', 'Day'])
print(df)
Value
Group Day
A Day 0 1
Day 11 2
Day 2 3
B Day 10 4
Day 5 5
Day 7 6

You need to sort integers instead of strings:
import pandas as pd
x = pd.Series([1,2,3,4,6], index=[3,2,1,11,12])
x.sort_index()
1 3
2 2
3 1
11 4
12 6
dtype: int64
y = pd.Series([1,2,3,4,5], index=['3','2','1','11','12'])
y.sort_index()
1 3
11 4
12 5
2 2
3 1
dtype: int64
I would suggest to have only numbers in the column instead of strings 'Day..'.

Use a different row as labels in pandas after read

I need to use the third row as the labels for a dataframe, but keep the first two rows for other uses. How can you change the labels on an existing dataframe to an existing row?
So basically this dataframe
A B C D
1 2 3 4
5 7 8 9
a b c d
6 4 2 1
becomes
a b c d
6 4 2 1
And I cannot just set the headers when the file is read in because I need the first two rows and labels for some processing

One way would be just to take a slice and then overwrite the columns:
In [71]:
df1 = df.loc[3:]
df1.columns = df.loc[2].values
df1
Out[71]:
a b c d
3 6 4 2 1
You can then assign back to df a slice of the rows of interest:
In [73]:
df = df[:2]
df
Out[73]:
A B C D
0 1 2 3 4
1 5 7 8 9

First copy the first two rows into a new DataFrame. Then rename the columns using the data contained in the second row. Finally, delete the first three rows of data.
import pandas as pd
df = pd.DataFrame({'A': {0: '1', 1: '5', 2: 'a', 3: '6'},
'B': {0: '2', 1: '7', 2: 'b', 3: '4'},
'C': {0: '3', 1: '8', 2: 'c', 3: '2'},
'D': {0: '4', 1: '9', 2: 'd', 3: '1'}})
df2 = df.loc[:1, :].copy()
df.columns = [c for c in df.loc[2, :]]
df.drop(df.index[:3], inplace=True)
>>> df
a b c d
3 6 4 2 1
>>> df2
A B C D
0 1 2 3 4
1 5 7 8 9

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to separate strings from a column in pandas? - python

You can split on ' ' as it seems that the numeric value is always at the beginning and the text is after a space. split = df.A.str.split(' ', 1) df.loc[df.B.isnull(), 'B'] = split.str[1] df.loc[:, 'A'] = split.str[0]

You could use str.split() if your number appears first. df['A'].str.split(n=1,expand=True).set_axis(df.columns,axis=1).combine_first(df) or df['A'].str.extract(r'(?P<A>\d+) (?P<B>[A-Za-z ]+)').combine_first(df) Output: A B 0 1 ABCSD 1 2 SSNFs 2 3 CVY KIP 3 4 MSSSQ 4 5 ABCSD 5 6 MMS LLS 6 7 QQLL

Related

Add a comment row after a row if condition is met in a Pandas Dataframe

In pandas, how to re-arrange the dataframe to simultaneously combine groups of columns?

how to set the index as character for pandas

Pandas sort data frame by logical day

Use a different row as labels in pandas after read

Categories

Resources