Using regex to include or exclude everything between commas (python) - python

I need to know how to exclude words that are in between commas using regex, i.e., "Lobasso, Jr., Sion" (I don't want the Jr.), so I have two ideas to use regex to include only words that are in between the two commas "ha,hello,bla" (hello) or to exclude the words that are between the commas, "he,blabla,lado" (helado).

Sometimes people will add additional designations to their name. There also may be 0 or more whitespaces that appear before/after a comma. To cover those cases (and avoid having to import re) consider using split() followed by strip()
strings = [
"Lobasso, Jr., Sion",
"Lobasso, Jr., B.Sc., Sion",
"Lobasso , Jr. , B.Sc. , Sion",
"Lobasso,Sion"
]
for string in strings:
result = string.split(",")
print(result[0].strip(), result[-1].strip())

You can exclude everything between commas like this:
print(re.sub(',.*,', '', "Lobasso, Jr., Sion"))
print(re.sub(',.*,', '', "he,blabla,lado"))
Output:
Lobasso Sion
helado

Exclude: result = re.sub(r',([^,]*),', '', string)
>>> print(re.sub(r',[^,]*,', '', "Lobasso, Jr., Sion"))
Lobasso Sion
>>> print(re.sub(r',[^,]*,', '', "he,blabla,lado"))
helado
Include: result = ''.join(re.findall(r',([^,]*),', string))
>>> print(''.join(re.findall(r',([^,]*),', "ha,hello,bla")))
hello
in both cases, the regex is of the pattern
r',([^,]*),'
( ) a capture group, containing (these are only necessary in Include)
* zero or more occurrences of
[^,] any character other than ','
, , with a ',' on both sides
If a regex contains exactly one capture group then re.findall() will return on whatever is found in that capture group instead of what's in the entire matching string, so in this case both expressions will act on whatever was matched by [^,]* - the thing between the commas.
to include, we find all the occurrences of text surrounded by commas, take them out, and then use ''.join() to stitch them back together without anything in between
to exclude, we replace all occurrences of text surrounded by commas, and the surrounding commas, with the empty string

Related

Regex match two characters following each other

I have a string with several spaces followed by commas in a pandas column. These are how the strings are organized.
original_string = "okay, , , , humans"
I want to remove the spaces and the subsequent commas so that the string will be:
goodstring = "okay,humans"
But when I use this regex pattern: [\s,]+ what I get is different. I get
badstring = "okayhumans".
It removes the comma after okay but I want it to be like in goodstring.
How can I do that?
Replace:
[\s,]*,[\s,]*
With:
,
See an online demo
[\s,]* - 0+ leading whitespace-characters or comma;
, - A literal comma (ensure we don't replace a single space);
[\s,]* - 0+ trainling whitespace-characters or comma.
In Pandas, this would translate to something like:
df[<YourColumn>].str.replace('[\s,]*,[\s,]*', ',', regex=True)
You have two issues with your code:
Since [\s,]+ matches any combination of spaces and commas (e.g. single comma ,) you should not remove the match but replace it with ','
[\s,]+ matches any combination of spaces and commas, e.g. just a space ' '; it is not what we are looking for, we must be sure that at least one comma is present in the match.
Code:
text = 'okay, , ,,,, humans! A,B,C'
result = re.sub(r'\s*,[\s,]*', ',', text);
Pattern:
\s* - zero or more (leading) whitespaces
, - comma (we must be sure that we have at least one comma in a match)
[\s,]* - arbitrary combination of spaces and commas
Please try this
re.sub('[,\s+,]+',',',original_string)
you want to replace ",[space]," with ",".
You could use substitution:
import re
pattern = r'[\s,]+'
original_string = "okay, , , , humans"
re.sub(r'[\s,]+', ',', original_string)

Find all occurrences of regex pattern, but ignore occurrences that contain another pattern

I have a block of text that I'm trying to parse:
「<%sM_item2><%sM_plusnum2>の| <%sM_slot>の部分を| <%sM_change_color>に カラーリングするのですね?|<br>|「それでは <%sM_item>が 10本と| <%nM_gold>ゴールドが必要ですが よろしいですか?|<yesno><close>
In this block of text, I'm trying to regex split on all occurrences of <???>, EXCEPT for when it matches on <%???>.
I have it mostly working with this:
re.split(r'<((?!%).+?)>', source_text)
['「<%sM_item2><%sM_plusnum2>の|\u3000<%sM_slot>の部分を|\u3000<%sM_change_color>に\u3000カラーリングするのですね?|', 'br', '|「それでは\u3000<%sM_item>が\u300010
本と|\u3000<%nM_gold>ゴールドが必要ですが\u3000よろしいですか?|', 'yesno', '', 'close', '']
My problem is although it kept the <%???> tags in place, it somehow stripped the <> characters from the matches (notice 'yesno', 'close', and 'br' tags no longer have those characters).
Based on the documentation of re.split:
Split string by the occurrences of pattern. If capturing parentheses are used
in pattern, then the text of all groups in the pattern are also returned as
part of the resulting list.
In this case, my parentheses needs to be placed on the outside of the match to preserve the ().
re.split('(<(?!%).+?>)', source_text)
['「<%sM_item2><%sM_plusnum2>の|\u3000<%sM_slot>の部分を|\u3000<%sM_change_color>に\u3000カラーリングするのですね?|', '<br>', '|「それでは\u3000<%sM_item>が\u300010本と|\u3000<%nM_gold>ゴールドが必要ですが\u3000よろしいですか?|', '<yesno>', '', '<close>', '']

Splitting whitespace string into list but not splitting whitespace in quotes and also allow special characters (like $, %, etc) in quotes in Python

s = 'hello "ok and #com" name'
s.split()
Is there a way to split this into a list that splits whitespace characters but as well not split white characters in quotes and allow special characters in the quotes.
["hello", '"ok and #com"', "name"]
I want it to be able to output like this but also allow the special characters in it no matter what.
Can someone help me with this?
(I've looked at other posts that are related to this, but those posts don't allow the special characters when I have tested it.)
You can do it with re.split(). Regex pattern from: https://stackoverflow.com/a/11620387/42346
import re
re.split(r'\s+(?=[^"]*(?:"[^"]*"[^"]*)*$)',s)
Returns:
['hello', '"ok and #com"', 'name']
Explanation of regex:
\s+ # match whitespace
(?= # start lookahead
[^"]* # match any number of non-quote characters
(?: # start non-capturing group, repeated zero or more times
"[^"]*" # one quoted portion of text
[^"]* # any number of non-quote characters
)* # end non-capturing group
$ # match end of the string
) # end lookahead
One option is to use regular expressions to capture the strings in quotes, delete them, and then to split the remaining text on whitespace. Note that this won't work if the order of the resulting list matters.
import re
items = []
s = 'hello "ok and #com" name'
patt = re.compile(r'(".*?")')
# regex to find quoted strings
match = re.search(patt, s)
if match:
for item in match.groups():
items.append(item)
# split on whitespace after removing quoted strings
for item in re.sub(patt, '', s).split():
items.append(item)
>>>items
['"ok and #com"', 'hello', 'name']

split string in python when characters on either side of separator are not numbers

I have a large list of chemical data, that contains entries like the following:
1. 2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP
2. Lead,Paints/Pigments,Zinc
I have a function that is correctly splitting the 1st entry into:
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
based on ', ' as a separator. For the second entry, ', ' won't work. But, if i could easily split any string that contains ',' with only two non-numeric characters on either side, I would be able to parse all entries like the second one, without splitting up the chemicals in entries like the first, that have numbers in their name separated by commas (i.e. 2,4,5-TP).
Is there an easy pythonic way to do this?
I explain a little bit based on #eph's answer:
import re
data_list = ['2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP', 'Lead,Paints/Pigments,Zinc']
for d in data_list:
print re.split(r'(?<=\D),\s*|\s*,(?=\D)',d)
re.split(pattern, string) will split string by the occurrences of regex pattern.
(plz read Regex Quick Start if you are not familiar with regex.)
The (?<=\D),\s*|\s*,(?=\D) consists of two part: (?<=\D),\s* and \s*,(?=\D). The meaning of each unit:
The middle | is the OR operator.
\D matches a single character that is not a digit.
\s matches a whitespace character (includes tabs and line breaks).
, matches character ",".
* attempts to match the preceding token zero or more times. Therefore, \s* means the whitespace can be appear zero or more times. (see Repetition with Star and Plus)
(?<= ... ) and (?= ...) are the lookbebind and lookahead assertions.
For example, q(?=u) matches a q that is followed by a u, without making the u part of the match.
Therefore, \s*,(?=\D) matches a , that is preceded by zero or more whitespace and followed by non-digit characters. Similarly, (?<=\D),\s* matches a , that is preceded by non-digit characters and followed by zero or more whitespace. The whole regex will find , that satisfy either case, which is equivalent to your requirement: ',' with only two non-numeric characters on either side.
Some useful tools for regex:
Regex Cheat Sheet
Online regex tester: regex101 (with a tree structure explanation to your regex)
Use regex and lookbehind/lookahead assertion
>>> re.split(r'(?<=\D\D),\s*|,\s*(?=\D\D)', s)
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
>>> s1 = "2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP"
>>> s2 = "Lead,Paints/Pigments,Zinc"
>>> import re
>>> res1 = re.findall(r"\s*(.*?[A-Za-z])(?:,|$)", s1)
>>> res1
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
>>> res2 = re.findall(r"\s*(.*?[A-Za-z])(?:,|$)", s2)
>>> res2
['Lead', 'Paints/Pigments', 'Zinc']

Splitting strings separated by multiple possible characters?

...note that values will be delimited by one or more space or TAB characters
How can I use the split() method if there are multiple separating characters of different types, as in this case?
by default split can handle multiple types of white space, not sure if it's enough for what you need but try it:
>>> s = "a \tb c\t\t\td"
>>> s.split()
['a', 'b', 'c', 'd']
It certainly works for multiple spaces and tabs mixed.
Split using regular expressions and not just one separator:
http://docs.python.org/2/library/re.html
I had the same problem with some strings separated by different whitespace chars, and used \s as shown in the Regular Expressions library specification.
\s matches any whitespace character, this is equivalent to the set [ \t\n\r\f\v].
you will need to import re as the regular expression handler:
import re
line = "something separated\t by \t\t\t different \t things"
workstr = re.sub('\s+','\t',line)
So, any whitespace or separator (\s) repeated one or more times (+) is transformed to a single tabulation (\t), that you can reprocess with split('\t')
workstr = "something`\t`separated`\t`by`\t`different`\t`things"
newline = workstr.split('\t')
newline = ['something','separated','by','different','things']
Do a text substitution first then split.
e.g. replace all tabs with spaces, then split on space.
You can use regular expressions first:
import re
re.sub('\s+', ' ', 'text with whitespace etc').split()
['text', 'with', 'whitespace', 'etc']
For whitespace delimeters, str.split() already does what you may want. From the Python Standard Library,
str.split([sep[, maxsplit]])
If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. Consequently, splitting an empty string or a string consisting of just whitespace with a None separator returns [].
For example, ' 1 2 3 '.split() returns ['1', '2', '3'], and ' 1 2 3 '.split(None, 1) returns ['1', '2 3 '].

Categories