Python - Regex split data in Dataframe

Python - Regex split data in Dataframe - python

I have a column containing values. I want to split it based on a regex. If the regex matches, the original value will be replaced with the left-side of the split. A new column will contain the right-side of a split.
Below is some sample code. I feel I am close but it isn't quite working.
import pandas as pd
import re
df = pd.DataFrame({ 'A' : ["test123","foo"]})
// Regex example to split it if it ends in numbers
r = r"^(.+?)(\d*)$"
df['A'], df['B'] = zip(*df['A'].apply(lambda x: x.split(r, 1)))
print(df)
In the example above I would expect the following output
A B
0 test 123
1 foo
I am fairly new to Python and assumed this would be the way to go. However, it appears that I haven't quite hit the mark. Is anyone able to help me correct this example?

Just base on your own regex
df.A.str.split(r,expand=True).replace('',np.nan).dropna(thresh=1,axis=1).fillna('')
Out[158]:
1 2
0 test 123
1 foo
df[['A','B']]=df.A.str.split(r,expand=True).replace('',np.nan).dropna(thresh=1,axis=1).fillna('')
df
Out[160]:
A B
0 test 123
1 foo

Your regex is working just fine, use it with str.extract
df = pd.DataFrame({ 'A' : ["test123","foo", "12test3"]})
df[['A', 'B']] = df['A'].str.extract("^(.+?)(\d*)$", expand = True)
A B
0 test 123
1 foo
2 12test 3

def bar(x):
els = re.findall(r'^(.+?)(\d*)$', x)[0]
if len(els):
return els
else:
return x, None
def foo():
df = pd.DataFrame({'A': ["test123", "foo"]})
df['A'], df['B'] = zip(*df['A'].apply(bar))
print(df)
result:
A B
0 test 123
1 foo

Related

pandas split column by some specific words and keep this delimiter

I want to split the column by some specific words and keep the delimiter in the same time. I tried to split the column with str.split but the result isn't I want.
example data(test.csv):
a
abc123and321abcor213cba
abc321or123cbaand321cba
my code:
import pandas as pd
df = pd.read_csv('test.csv')
df[['b','c']] = df['a'].str.split("and",1,expand=True)
df[['c','d']] = df['c'].str.split("or",1,expand=True)
print(df)
my result:
a b c d
0 abc123and321abcor213cba abc123 321abc 213cba
1 abc321or123cbaand321cba abc321or123cba 321cba None
Desired result:
a b c d
0 abc123and321abcor213cba abc123and 321abcor 213cba
1 abc321or123cbaand321cba abc321or 123cbaand 321cba
How can I do this?

Borrowing from Tim's answer using a lookbehind Regex to split on and or or, without using up the seperating string in the split:
d = {'a': ["abc123and321abcor213cba", "abc321or123cbaand321cba"]}
df = pandas.DataFrame(data=d)
df[["b", "c", "d"]] = df['a'].str.split(r'(?<=and)|(?<=or)', expand=True)
Output:
a b c d
0 abc123and321abcor213cba abc123and 321abcor 213cba
1 abc321or123cbaand321cba abc321or 123cbaand 321cba

If you have issues with split check that you don't have a too old pandas version.
You could also use str.extractall and unstack. I'll also recommend to usejoin to add the columns if you don't know in advance the number of matches/columns?
df = pd.DataFrame({'a': ["abc123and321abcor213cba", "abc321or123cbaand321cba"]})
df.join(df['a'].str.extractall(r'(.*?(?:and|or)|.+$)')[0].unstack('match'))
Output:
a 0 1 2
0 abc123and321abcor213cba abc123and 321abcor 213cba
1 abc321or123cbaand321cba abc321or 123cbaand 321cba

Try splitting on the lookbehind (?<=and|or):
df[['b', 'c', 'd']] = df['a'].str.split(r'(?<=and)|(?<=or)', 1, expand=True)

Removing /N character from a column in Python Dataframe

I have a column with the headlines of articles. The headlines are like this(in Greek):
[\n, [Μητσοτάκης: Έχει μεγάλη σημασία οι φωτισ..
How can I remove this character: [\n, ?
I have tried this but nothing happened:
df['Title'].replace('\n', '', regex=True)

.replace() does not change the dataframe by default, it returns a new dataframe. Use the inplace pararameter.
>>> import pandas
>>> df = pandas.DataFrame([{"x": "a\n"}, {"x": "b\n"}, {"x": "c\n"}])
>>> df['x'].replace('\n', '', regex=True) # does not change df
0 a
1 b
2 c
Name: x, dtype: object
>>> df # df is unchanged
x
0 a\n
1 b\n
2 c\n
>>> df['x'].replace('\n', '', regex=True, inplace=True)
>>> df # df is changed
x
0 a
1 b
2 c

You're looking for
df['Title'].str.replace('\n', '')
Also remember that this replacement doesn't happen in-place. To change the original dataframe, you're going to have to do
df['Title'] = df['Title'].str.replace('\n', '')
df.str provides vectorized string functions to operate on each value in the column. df.str.replace('\n', '') runs the str.replace() function on each element of df.
df.replace() replaces entire values in the column with the given replacement.
For example,
data = [{"x": "hello\n"}, {"x": "yello\n"}, {"x": "jello\n"}]
df = pd.DataFrame(data)
# df:
# x
# 0 hello\n
# 1 yello\n
# 2 jello\n
df["x"].str.replace('\n', '')
# df["x"]:
# 0 hello
# 1 yello
# 2 jello
df["x"].replace('yello\n', 'bello\n')
# df["x"]:
# 0 hello\n
# 1 bello\n
# 2 jello\n

How can I remove string after last underscore in python dataframe?

I want to remove the all string after last underscore from the dataframe. If I my data in dataframe looks like.
AA_XX,
AAA_BB_XX,
AA_BB_XYX,
AA_A_B_YXX
I would like to get this result
AA,
AAA_BB,
AA_BB,
AA_A_B

You can do this simply using Series.str.split and Series.str.join:
In [2381]: df
Out[2381]:
col1
0 AA_XX
1 AAA_BB_XX
2 AA_BB_XYX
3 AA_A_B_YXX
In [2386]: df['col1'] = df['col1'].str.split('_').str[:-1].str.join('_')
In [2387]: df
Out[2387]:
col1
0 AA
1 AAA_BB
2 AA_BB
3 AA_A_B

pd.DataFrame({'col': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX']})['col'].apply(lambda r: '_'.join(r.split('_')[:-1]))
Explaination:
df = pd.DataFrame({'col': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX']})
Creates
col
0 AA_XX
1 AAA_BB_XX
2 AA_BB_XYX
3 AA_A_B_YXX
Use apply in order to loop through the column you want to edit.
I broke the string at _ and then joined all parts leaving the last part at _
df['col'] = df['col'].apply(lambda r: '_'.join(r.split('_')[:-1]))
print(df)
Results:
col
0 AA
1 AAA_BB
2 AA_BB
3 AA_A_B
If your dataset contains values like AA (values without underscore).
Change the lambda like this
df = pd.DataFrame({'col': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX', 'AA']})
df['col'] = df['col'].apply(lambda r: '_'.join(r.split('_')[:-1]) if len(r.split('_')) > 1 else r)
print(df)

Here is another way of going about it.
import pandas as pd
data = {'s': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX']}
df = pd.DataFrame(data)
def cond1(s):
temp_s = s.split('_')
temp_len = len(temp_s)
if len(temp_s) == 1:
return temp_s
else:
return temp_s[:len(temp_s)-1]
df['result'] = df['s'].apply(cond1)

How to write in excel/pandas Dataframe of two different variables value in same column

I'm new to pandas and trying to figure out how to add two different variables values in the same column.
import pandas as pd
import requests
from bs4 import BeautifulSoup
itemproducts = pd.DataFrame()
url = 'https://www.trwaftermarket.com/en/catalogue/product/BCH720/'
r = requests.get(url)
soup = BeautifulSoup(r.content,'lxml')
code_name = soup.find_all('div',{'class':'col-sm-6 intro-section reset-margin'})
for head in code_name:
item_code = head.find('span',{'class':'heading'}).text
item_name = head.find('span',{'class':'subheading'}).text
for tab_ in tab_4:
ab = tab_.find_all('td')
make_name1 = ab[0].text.replace('Make','')
code1 = ab[1].text.replace('OE Number','')
make_name2 = ab[2].text.replace('Make','')
code2 = ab[3].text.replace('OE Number','')
itemproducts=itemproducts.append({'CODE':item_code,
'NAME':item_name,
'MAKE':[make_name1,make_name2],
'OE NUMBER':[code1,code2]},ignore_index=True)
OUTPUT (Excel image)
What actually I want

In pandas you must specify all the data in the same length. So, in this case, I suggest that you specify each column or row as a fixed length list. For those that have one member less, append a NaN to match.

I found a similar question here on stackoverflow that can help you. Another approach is to use explode function from Pandas Dataframe.
Below I put an example from pandas documentation.
>>> df = pd.DataFrame({'A': [[1, 2, 3], 'foo', [], [3, 4]], 'B': 1})
>>> df
A B
0 [1, 2, 3] 1
1 foo 1
2 [] 1
3 [3, 4] 1
>>> df.explode('A')
A B
0 1 1
0 2 1
0 3 1
1 foo 1
2 NaN 1
3 3 1
3 4 1

I couldn't reproduce the results from your script. However, based on your end dataframe, perhpas you can make use of explode together with apply the dataframe in the end:
#creating your dataframe
itemproducts = pd.DataFrame({'CODE':'BCH720','MAKE':[['HONDA','HONDA']],'NAME':['Brake Caliper'],'OE NUMBER':[['43019-SAA-J51','43019-SAA-J50']]})
>>> itemproducts
CODE MAKE NAME OE NUMBER
0 BCH720 ['HONDA', 'HONDA'] Brake Caliper ['43019-SAA-J51', '43019-SAA-J50']
#using apply method with explode on 'MAKE' and 'OE NUMBER'
>>> itemproducts.apply(lambda x: x.explode() if x.name in ['MAKE', 'OE NUMBER'] else x)
CODE MAKE NAME OE NUMBER
0 BCH720 HONDA Brake Caliper 43019-SAA-J51
0 BCH720 HONDA Brake Caliper 43019-SAA-J50

Python DataFrame : Split data in rows based on custom value?

I have a dataframe with column a. I need to get data after second _.
a
0 abc_def12_0520_123
1 def_ghij123_0120_456
raw_data = {'a': ['abc_def12_0520_123', 'def_ghij123_0120_456']}
df = pd.DataFrame(raw_data, columns = ['a'])
Output:
a b
0 abc_def12_0520_123 0520_123
1 def_ghij123_0120_456 0120_456
What I have tried:
df['b'] = df.number.str.replace('\D+', '')
I tried removing alphabets first, But its getting complex. Any suggestions

Here is how:
df['b'] = ['_'.join(s.split('_')[2:]) for s in df['a']]
print(df)
Output:
a b
0 abc_def12_0520_123 0520_123
1 def_ghij123_0120_456 0120_456
Explanation:
lst = ['_'.join(s.split('_')[2:]) for s in df['a']]
is the equivalent of:
lst = []
for s in df['a']:
a = s.split('_')[2:] # List all strings in list of substrings splitted '_' besides the first 2
lst.append('_'.join(a))

Try:
df['b'] = df['a'].str.split('_',2).str[-1]
a b
0 abc_def12_0520_123 0520_123
1 def_ghij123_0120_456 0120_456

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - Regex split data in Dataframe - python

Just base on your own regex df.A.str.split(r,expand=True).replace('',np.nan).dropna(thresh=1,axis=1).fillna('') Out[158]: 1 2 0 test 123 1 foo df[['A','B']]=df.A.str.split(r,expand=True).replace('',np.nan).dropna(thresh=1,axis=1).fillna('') df Out[160]: A B 0 test 123 1 foo

Your regex is working just fine, use it with str.extract df = pd.DataFrame({ 'A' : ["test123","foo", "12test3"]}) df[['A', 'B']] = df['A'].str.extract("^(.+?)(\d*)$", expand = True) A B 0 test 123 1 foo 2 12test 3

def bar(x): els = re.findall(r'^(.+?)(\d)$', x)[0] if len(els): return els else: return x, None def foo(): df = pd.DataFrame({'A': ["test123", "foo"]}) df['A'], df['B'] = zip(df['A'].apply(bar)) print(df) result: A B 0 test 123 1 foo

Related

pandas split column by some specific words and keep this delimiter

Removing /N character from a column in Python Dataframe

How can I remove string after last underscore in python dataframe?

How to write in excel/pandas Dataframe of two different variables value in same column

Python DataFrame : Split data in rows based on custom value?

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - Regex split data in Dataframe - python

Just base on your own regex df.A.str.split(r,expand=True).replace('',np.nan).dropna(thresh=1,axis=1).fillna('') Out[158]: 1 2 0 test 123 1 foo df[['A','B']]=df.A.str.split(r,expand=True).replace('',np.nan).dropna(thresh=1,axis=1).fillna('') df Out[160]: A B 0 test 123 1 foo

Your regex is working just fine, use it with str.extract df = pd.DataFrame({ 'A' : ["test123","foo", "12test3"]}) df[['A', 'B']] = df['A'].str.extract("^(.+?)(\d*)$", expand = True) A B 0 test 123 1 foo 2 12test 3

def bar(x): els = re.findall(r'^(.+?)(\d*)$', x)[0] if len(els): return els else: return x, None def foo(): df = pd.DataFrame({'A': ["test123", "foo"]}) df['A'], df['B'] = zip(*df['A'].apply(bar)) print(df) result: A B 0 test 123 1 foo

Related

pandas split column by some specific words and keep this delimiter

Removing /N character from a column in Python Dataframe

How can I remove string after last underscore in python dataframe?

How to write in excel/pandas Dataframe of two different variables value in same column

Python DataFrame : Split data in rows based on custom value?

Categories

Resources

def bar(x): els = re.findall(r'^(.+?)(\d)$', x)[0] if len(els): return els else: return x, None def foo(): df = pd.DataFrame({'A': ["test123", "foo"]}) df['A'], df['B'] = zip(df['A'].apply(bar)) print(df) result: A B 0 test 123 1 foo