I have a string with several spaces followed by commas in a pandas column. These are how the strings are organized.
original_string = "okay, , , , humans"
I want to remove the spaces and the subsequent commas so that the string will be:
goodstring = "okay,humans"
But when I use this regex pattern: [\s,]+ what I get is different. I get
badstring = "okayhumans".
It removes the comma after okay but I want it to be like in goodstring.
How can I do that?
Replace:
[\s,]*,[\s,]*
With:
,
See an online demo
[\s,]* - 0+ leading whitespace-characters or comma;
, - A literal comma (ensure we don't replace a single space);
[\s,]* - 0+ trainling whitespace-characters or comma.
In Pandas, this would translate to something like:
df[<YourColumn>].str.replace('[\s,]*,[\s,]*', ',', regex=True)
You have two issues with your code:
Since [\s,]+ matches any combination of spaces and commas (e.g. single comma ,) you should not remove the match but replace it with ','
[\s,]+ matches any combination of spaces and commas, e.g. just a space ' '; it is not what we are looking for, we must be sure that at least one comma is present in the match.
Code:
text = 'okay, , ,,,, humans! A,B,C'
result = re.sub(r'\s*,[\s,]*', ',', text);
Pattern:
\s* - zero or more (leading) whitespaces
, - comma (we must be sure that we have at least one comma in a match)
[\s,]* - arbitrary combination of spaces and commas
Please try this
re.sub('[,\s+,]+',',',original_string)
you want to replace ",[space]," with ",".
You could use substitution:
import re
pattern = r'[\s,]+'
original_string = "okay, , , , humans"
re.sub(r'[\s,]+', ',', original_string)
Related
I need to know how to exclude words that are in between commas using regex, i.e., "Lobasso, Jr., Sion" (I don't want the Jr.), so I have two ideas to use regex to include only words that are in between the two commas "ha,hello,bla" (hello) or to exclude the words that are between the commas, "he,blabla,lado" (helado).
Sometimes people will add additional designations to their name. There also may be 0 or more whitespaces that appear before/after a comma. To cover those cases (and avoid having to import re) consider using split() followed by strip()
strings = [
"Lobasso, Jr., Sion",
"Lobasso, Jr., B.Sc., Sion",
"Lobasso , Jr. , B.Sc. , Sion",
"Lobasso,Sion"
]
for string in strings:
result = string.split(",")
print(result[0].strip(), result[-1].strip())
You can exclude everything between commas like this:
print(re.sub(',.*,', '', "Lobasso, Jr., Sion"))
print(re.sub(',.*,', '', "he,blabla,lado"))
Output:
Lobasso Sion
helado
Exclude: result = re.sub(r',([^,]*),', '', string)
>>> print(re.sub(r',[^,]*,', '', "Lobasso, Jr., Sion"))
Lobasso Sion
>>> print(re.sub(r',[^,]*,', '', "he,blabla,lado"))
helado
Include: result = ''.join(re.findall(r',([^,]*),', string))
>>> print(''.join(re.findall(r',([^,]*),', "ha,hello,bla")))
hello
in both cases, the regex is of the pattern
r',([^,]*),'
( ) a capture group, containing (these are only necessary in Include)
* zero or more occurrences of
[^,] any character other than ','
, , with a ',' on both sides
If a regex contains exactly one capture group then re.findall() will return on whatever is found in that capture group instead of what's in the entire matching string, so in this case both expressions will act on whatever was matched by [^,]* - the thing between the commas.
to include, we find all the occurrences of text surrounded by commas, take them out, and then use ''.join() to stitch them back together without anything in between
to exclude, we replace all occurrences of text surrounded by commas, and the surrounding commas, with the empty string
I have a DataFrame with list of strings as below
df
text
,info_concern_blue,replaced_mod,replaced_rad
,info_concern,info_concern_red,replaced_unit
,replaced_link
I want to replace all words after info_concern for eg. info_concern_blue/info_concern_red to info_concern until it encounters comma.
I tried the following regex:
df['replaced_text'] = [re.sub(r'info_concern[^,]*.+?,', 'info_concern,',
x) for x in df['text']]
But this is giving me incorrect results.
Desired output:
replaced_text
,info_concern,replaced_mod,replaced_rad
,info_concern,info_concern,replaced_unit
,replaced_link
Please suggest/advise.
You can use
df['replaced_text'] = df['text'].str.replace(r'(info_concern)[^,]*', r'\1', regex=True)
See the regex demo.
If you want to make sure the match starts right after a comma or start of string, add the (?<![^,]) lookbehind at the start of the pattern:
df['replaced_text'] = df['text'].str.replace(r'(?<![^,])(info_concern)[^,]*', r'\1', regex=True)
See this regex demo. Details:
(?<![^,]) - right before, there should be either , or start of string
(info_concern) - Group 1: info_concern string
[^,]* - zero or more chars other than a comma.
The \1 replacement replaces the match with Group 1 value.
The issue is that the pattern info_concern[^,]*.+?, matches till before the first comma using [^,]*
Then this part .+?, matches at least a single character (which can also be a comma due to the .) and then till the next first comma.
So if there is a second comma, it will overmatch and remove too much.
You could also assert info_concern to the left, and match any char except a comma to be removed by an empty string.
If there has to be a comma to the right, you can assert it.
(?<=\binfo_concern)[^,]*(?=,)
The pattern matches:
(?<=\binfo_concern) Positive lookbehind, assert info_concern to the left
[^,]* Match 0+ times any char except ,
(?=,) Positive lookahead, assert , directly to the right
Regex demo
If the comma is not mandatory, you can omit the lookahead
(?<=\binfo_concern)[^,]*
For example
import pandas as pd
texts = [
",info_concern_blue,replaced_mod,replaced_rad",
",info_concern,info_concern_red,replaced_unit",
",replaced_link"
]
df = pd.DataFrame(texts, columns=["text"])
df['replaced_text'] = df['text'].str.replace(r'(?<=\binfo_concern)[^,]*(?=,)', '', regex=True)
print(df)
Output
text replaced_text
0 ,info_concern_blue,replaced_mod,replaced_rad ,info_concern,replaced_mod,replaced_rad
1 ,info_concern,info_concern_red,replaced_unit ,info_concern,info_concern,replaced_unit
2 ,replaced_link ,replaced_link
I need to truncate string by special characters '-', '(', '/' with one leading whitespace, i.e. ' -', ' (', ' /'.
how to do that?
patterns=r'[-/()]'
try:
return row.split(re.findall(patterns, row)[0], 1)[0]
except:
return row
the above code picked up all special characters but without the leading space.
patterns=r'[s-/()]'
this one does not work.
Try this pattern
patterns=r'^\s[-/()]'
or remove ^ depending on your needs.
It looks like you want to get a part of the string before the first occurrence of \s[-(/] pattern.
Use
return re.sub(r'\s[-(/].*', '', row)
This code will return a part of row string without all chars after the first occurrence of a whitespace (\s) followed with -, ( or / ([-(/]).
See the regex demo.
Please try this pattern patterns = r'\s+-|\s\/|\s\(|\s\)'
I want to split a string only if there's a space before and after that character. In my case the character is the dash i.e '-'
Example
Opzione - AAAA-11
Should be Splitted in
Opzione AAAA-11
and not in
Opzione AAAA 11
Language is python.
Thanks
You can use lookaround
(?<=\s)-(?=\s)
(?<=\s) -> Positive look behind. condition to check preceding space.
- -> Matches -.
(?=\s) -> Positive lookahead matches following space
On side note - \s will match \r , \t and \n also if you just want to consider space only you can have like this
(?<= )-(?= )
You can do it with regex but how about with non-regex way using split() and join()
str = 'Opzione - AAAA-11';
df = ' '.join(str.split(' - '))
print(df)
str="Opzione - AAAA-11"
str=re.sub('(\s([\S])\s[\S]?)','',str)
This (\s([\S])\s[\S]?) means anything , except a space , between two spaces then anything except a whitespace or not and by this you will be able to match like g h h g.
So , both h are between two spaces but when you match only with \s([\S])\s another h will not but by (\s([\S])\s[\S]?) both will match.
I have a large list of chemical data, that contains entries like the following:
1. 2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP
2. Lead,Paints/Pigments,Zinc
I have a function that is correctly splitting the 1st entry into:
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
based on ', ' as a separator. For the second entry, ', ' won't work. But, if i could easily split any string that contains ',' with only two non-numeric characters on either side, I would be able to parse all entries like the second one, without splitting up the chemicals in entries like the first, that have numbers in their name separated by commas (i.e. 2,4,5-TP).
Is there an easy pythonic way to do this?
I explain a little bit based on #eph's answer:
import re
data_list = ['2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP', 'Lead,Paints/Pigments,Zinc']
for d in data_list:
print re.split(r'(?<=\D),\s*|\s*,(?=\D)',d)
re.split(pattern, string) will split string by the occurrences of regex pattern.
(plz read Regex Quick Start if you are not familiar with regex.)
The (?<=\D),\s*|\s*,(?=\D) consists of two part: (?<=\D),\s* and \s*,(?=\D). The meaning of each unit:
The middle | is the OR operator.
\D matches a single character that is not a digit.
\s matches a whitespace character (includes tabs and line breaks).
, matches character ",".
* attempts to match the preceding token zero or more times. Therefore, \s* means the whitespace can be appear zero or more times. (see Repetition with Star and Plus)
(?<= ... ) and (?= ...) are the lookbebind and lookahead assertions.
For example, q(?=u) matches a q that is followed by a u, without making the u part of the match.
Therefore, \s*,(?=\D) matches a , that is preceded by zero or more whitespace and followed by non-digit characters. Similarly, (?<=\D),\s* matches a , that is preceded by non-digit characters and followed by zero or more whitespace. The whole regex will find , that satisfy either case, which is equivalent to your requirement: ',' with only two non-numeric characters on either side.
Some useful tools for regex:
Regex Cheat Sheet
Online regex tester: regex101 (with a tree structure explanation to your regex)
Use regex and lookbehind/lookahead assertion
>>> re.split(r'(?<=\D\D),\s*|,\s*(?=\D\D)', s)
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
>>> s1 = "2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP"
>>> s2 = "Lead,Paints/Pigments,Zinc"
>>> import re
>>> res1 = re.findall(r"\s*(.*?[A-Za-z])(?:,|$)", s1)
>>> res1
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
>>> res2 = re.findall(r"\s*(.*?[A-Za-z])(?:,|$)", s2)
>>> res2
['Lead', 'Paints/Pigments', 'Zinc']