matching pattern difference between `([, ]+)` and `([, ])+` - python

IDLE 1.1.4
>>> import re
>>> some_text = 'alpha, beta,,,,gamma delta'
>>> re.split('[, ]+', some_text)
['alpha', 'beta', 'gamma', 'delta']
# when the pattern doesn't contain parentheses, the returned values
# only include matched substrings but separators.
>>> re.split('([, ]+)', some_text)
['alpha', ', ', 'beta', ',,,,', 'gamma', ' ', 'delta']
# returned values include separators and I can guess how it works.
>>> re.split('([, ])+', some_text)
['alpha', ' ', 'beta', ',', 'gamma', ' ', 'delta']
# Now I cannot even guess what is going on here.
Question> What is the difference between '([, ]+)' and '([, ])+'?
How it affects the returned values?

The former places all instances of " " and "," in a run into a group, whereas the latter returns a group only containing the last.

([, ]+) says "match one or more commas and/or spaces and capture them as a group", thus in your second example your see the whole string of separator characaters returned in one group.
([, ])+ says "match one comma or space, and capture one or more groups of these". So in your third example, each separator character is captured in it's own group, and you're only getting the last of these each time.

if there is ,,,, in your string, when your pattern is ([, ]+) this group will return ,,,, and if your pattern is ([, ])+ it will return ,

Watch your matching groups.
([, ]+) this matches 1 or more occurrences of , or a space and returns them all, which catches long chains of those characters.
([, ])+ this matches either a space or a , and returns it as a group.

Change your commas to ABCD as below to see it visually:
some_text2 = 'alphaA betaABCDgamma delta'
re.split('([ABCD ])+', some_text2)
['alpha', ' ', 'beta', 'D', 'gamma', ' ', 'delta']
Its actually matching each comma, but as 1 character group. The + is turns it into a greedy match until it no longer matches the letters in the character class.
Try without the +
re.split('([ABCD ])', some_text2)
['alpha',
'A',
'',
' ',
'beta',
'A',
'',
'B',
'',
'C',
'',
'D',
'gamma',
' ',
'',
' ',
'',
' ',
'delta']

Related

How to escape specific whitespaces when splitting line into words with regex

I want to split a string into a list of words (here "word" means arbitrary sequence of non-whitespace characters), but also keep the groups of consecutive whitespaces that have been used as separators (because the number of whitespaces is significant in my data). For this simple task, I know that the following regex would do the job (I use Python as an illustrative language, but the code can be easily adapted to any language including regexes):
import re
regexA = re.compile(r"(\S+)")
print(regexA.split("aa b+b cc dd! :ee "))
produces the expected output:
['', 'aa', ' ', 'b+b', ' ', 'cc', ' ', 'dd!', ' ', ':ee', ' ']
Now the hard part: when a word includes an opening parenthesis, all the whitespaces encountered until the matching closing parenthesis should not be considered as word separators. In other words:
regexB.split("aa b+b cc(dd! :ee (ff gg) hh) ii ")
should produce:
['', 'aa', ' ', 'b+b', ' ', 'cc(dd! :ee (ff gg) hh)', ' ', 'ii', ' ']
Using
regexB = re.compile(r'([^(\s]*\([^)]*\)|\S+)')
works for a single pair of parentheses, but fails when there are inner parentheses. How could I improve the regex to correctly skip inner parentheses?
And the final question: in my data, only words starting with % should be tested for the "parenthesis rule" (regexB), the other words should be treated by regexA. I have no idea how to combine two regexes in a single split.
Any hint is warmly welcome...
In the PCRE regex engine, sub-routine is supported and recursive pattern seems workable for the case including balanced nested parentheses.
(?m)\s+(?=[^()]*(\([^()]*(?1)?[^()]*\))*[^()]*$)
Demo,,, in which (?1) means calling sub-routine 1, (\([^()]*(?1)?[^()]*\)), namely recursive pattern which includes caller, (?1)
But python does not support sub-routinepattern in regex.
So I tried first replacing every ( , ) with another distinctive character( # in this example) and applying the regex to split and finally turn # back to ( or ) respectively in my pythone script.
Regex for spliting.
(?m)(\s+)(?=[^#]*(?:(?:#[^#]*){2})*$)
Demo,,, in which I changed your separator \S+ to consecutive spaces \s+ because #,(,) are included in [\S]' possible characters set.
Python script may be like this
import re
ss="""aa b+b cc(dd! :ee ((ff gg)) hh) ii """
ss=re.sub(r"\(|\)","#",ss) #repacing every `(`,`)` to `#`
regx=re.compile(r"(?m)(\s+)(?=[^#]*(?:(?:#[^#]*){2})*$)")
m=regx.split(ss)
for i in range(len(m)): # turn `#` back to `(` or `)` respectively
n= m[i].count('#')
if n < 2: continue
else:
for j in range(int(n/2)):
k=m[i].find('#'); m[i]=m[i][:k]+'('+m[i][k+1:]
m[i]= m[i].replace("#",')')
print(m)
Output is
['aa', ' ', 'b+b', ' ', 'cc(dd! :ee ((ff gg)) hh)', ' ', 'ii', ' ', '']
Finally after having tested several ideas based on the answers proposed by #Wiktor Stribiżew and #Thm Lee, I came to bunch of solutions dealing with different levels of complexity. To reduce dependency, I wanted to stick to the re module from the Python standard library, so here is the code:
import re
text = "aa b%b( %cc(dd! (:ee ff) gg) %hh ii) "
# Solution 1: don't process parentheses at all
regexA = re.compile(r'(\S+)')
print(regexA.split(text))
# Solution 2: works for non-nested parentheses
regexB = re.compile(r'(%[^(\s]*\([^)]*\)|\S+)')
print(regexB.split(text))
# Solution 3: works for one level of nested parentheses
regexC = re.compile(r'(%[^(\s]*\((?:[^()]*\([^)]*\))*[^)]*\)|\S+)')
print(regexC.split(text))
# Solution 4: works for arbitrary levels of nested parentheses
n, words = 0, []
for word in regexA.split(text):
if n: words[-1] += word
else: words.append(word)
if n or (word and word[0] == '%'):
n += word.count('(') - word.count(')')
print(words)
Here is the generated output:
Solution 1: ['', 'aa', ' ', 'b%b(', ' ', '%cc(dd!', ' ', '(:ee', ' ', 'ff)', ' ', 'gg)', ' ', '%hh', ' ', 'ii)', ' ']
Solution 2: ['', 'aa', ' ', 'b%b(', ' ', '%cc(dd! (:ee ff)', ' ', 'gg)', ' ', '%hh', ' ', 'ii)', ' ']
Solution 3: ['', 'aa', ' ', 'b%b(', ' ', '%cc(dd! (:ee ff) gg)', ' ', '%hh', ' ', 'ii)', ' ']
Solution 4: ['', 'aa', ' ', 'b%b(', ' ', '%cc(dd! (:ee ff) gg)', ' ', '%hh', ' ', 'ii)', ' ']
As stated in the OP, for my specific data, escaping whitespaces in parentheses has only to be done for words starting with %, other parentheses (e.g. word b%b( in my example) are not considered are special. If you want to escape whitespaces inside any pair of parentheses, simply remove the %char in the regexes. Here is the result with that modification:
Solution 1: ['', 'aa', ' ', 'b%b(', ' ', '%cc(dd!', ' ', '(:ee', ' ', 'ff)', ' ', 'gg)', ' ', '%hh', ' ', 'ii)', ' ']
Solution 2: ['', 'aa', ' ', 'b%b( %cc(dd! (:ee ff)', ' ', 'gg)', ' ', '%hh', ' ', 'ii)', ' ']
Solution 3: ['', 'aa', ' ', 'b%b( %cc(dd! (:ee ff) gg)', ' ', '%hh', ' ', 'ii)', ' ']
Solution 4: ['', 'aa', ' ', 'b%b( %cc(dd! (:ee ff) gg) %hh ii)', ' ']

Python how to split a string into words that contain words with a single quote?

I have a string a, I would like to return a list b, which contain words in a that not starts from # or #, and not contains any non-word characters.
However, I'm in trouble of keep words like "They're" as a single word. Please notice that words like "Okay....so" should be split into two words "okay" and "so".
I think problem could be solved by just revising the regular expression. Thanks!
a = "#luke5sos are you awake now?!!! me #hashtag time! is# over, now okay....so they're rich....and hopefully available?"
a = a.split()
b = []
for word in a:
if word != "" and word[0] != "#" and word[0] != "#":
for item in re.split(r'\W+\'\W|\W+', word):
if item != "":
b.append(item)
else:
continue
else:
continue
print b
It's easier to combine all these rules into one regex:
import re
a = "#luke5sos are you awake now?!!! me #hashtag time! is# over, now okay....so they're rich....and hopefully available?"
b = re.findall(r"(?<![##])\b\w+(?:'\w+)?", a)
print(b)
Result:
['are', 'you', 'awake', 'now', 'me', 'time', 'is', 'over', 'now', 'okay', 'so', "they're", 'rich', 'and', 'hopefully', 'available']
The regex works like this:
Checks to make sure that it's not coming right after # or #, using (?<![##]).
Checks that it's at the begining of a word using \b. This is important so that the #/# check doesn't just skip one character and go on.
Matches a sequence of one or more "word" type characters with \w+.
Optionally matches an apostrophe and some more word type characters with (?:'\w)?.
Note that the fourth step is written that way so that they're will count as one word, but only this, that, and these from this, 'that', these will match.
The following code (a) treats .... as a word separator, (b) removes trailing non-word characters, such as question marks and exclamation points, and (c) rejects any words that start with # or # or otherwise contain non-alpha characters:
a = "#luke5sos are you awake now?!!! me #hashtag time! is# over, now okay....so they're rich....and hopefully available?"
a = a.replace('....', ' ')
a = re.sub('[?!##$%^&]+( |$)', ' ', a)
result = [w for w in a.split() if w[0] not in '##' and w.replace("'",'').isalpha()]
print result
This produces the desired result:
['are', 'you', 'awake', 'now', 'me', 'time', 'is', 'now', 'okay', 'so', "they're", 'rich', 'and', 'hopefully', 'available']
import re
v = re.findall(r'(?:\s|^)([\w\']+)\b', a)
Gives:
['are', 'you', 'awake', 'now', 'me', 'time', 'is', 'over', 'now',
'okay', 'so', "they're", 'rich', 'and', 'hopefully', 'available']
From what I understand, you don't want words with digits in them and you want to disregard all the other special characters except the single quote. You could try something like this:
import re
a = re.sub('[^0-9a-zA-Z']+', ' ', a)
b = a.split()
I haven't been able to try the syntax, but hopefully it should work. What I suggest is replace every character that is not aplha-numberic or a single qoute with a single space. So this would result in a string where your required strings separated by multiple white spaces. Simply calling the split function with no argument, splits the string into words taking care of multiple whitespaces as well. Hope it helps.

Splitting strings but keeping "split" character

I've split a string by [ and ] but I want these characters to still appear. How do I do this?
words = [beginning for ending in x.split('[') for beginning in ending.split(']')]
I think you need re.split to do this easily:
>>> import re
>>> s = 'Hello, my name is [name] and I am [age] years old'
>>> re.split(r'(\[|\])', s)
['Hello, my name is ', '[', 'name', ']', ' and I am ', '[', 'age', ']', ' years old']
Would need to know more about the context of your list and what x, beginning, and ending are, but here are some suggestions.
You can add [ and ] to each item in the list, and return a new list, like this:
["[%s]" % s for s in some_list]
Or, string.join will return a string from the items in a list joined by a given string:
"[".join(some_list)

RegEx Tokenizer: split text into words, digits, punctuation, and spacing (do not delete anything)

I almost found the answer to this question in this thread (samplebias's answer); however I need to split a phrase into words, digits, punctuation marks, and spaces/tabs. I also need this to preserve the order in which each of these things occurs (which the code in that thread already does).
So, what I've found is something like this:
from nltk.tokenize import *
txt = "Today it's 07.May 2011. Or 2.999."
regexp_tokenize(txt, pattern=r'\w+([.,]\w+)*|\S+')
['Today', 'it', "'s", '07.May', '2011', '.', 'Or', '2.999', '.']
But this is the kind of list I need to yield:
['Today', ' ', 'it', "'s", ' ', '\t', '07.May', ' ', '2011', '.', ' ', 'Or', ' ', '2.999', '.']
Regex has always been one of my weakpoints so after a couple hours of research I'm still stumped. Thank you!!
I think that something like this should work for you. There is probably more in that regex than there needs to be, but your requirements are somewhat vague and don't exactly match up with the expected output you provided.
>>> txt = "Today it's \t07.May 2011. Or 2.999."
>>> p = re.compile(r"\d+|[-'a-z]+|[ ]+|\s+|[.,]+|\S+", re.I)
>>> slice_starts = [m.start() for m in p.finditer(txt)] + [None]
>>> [txt[s:e] for s, e in zip(slice_starts, slice_starts[1:])]
['Today', ' ', "it's", ' ', '\t', '07', '.', 'May', ' ', '2011', '.', ' ', 'Or', ' ', '2', '.', '999', '.']
In the regex \w+([.,]\w+)*|\S+, \w+([.,]\w+)* captures words and \S+ captures other non-whitespace.
In order to capture spaces and tabs as well, try this: \w+([.,]\w+)*|\S+|[ \t].
Not fully compliant with the expected output you provided, some more details in the question would help, but anyway:
>>> txt = "Today it's 07.May 2011. Or 2.999."
>>> regexp_tokenize(txt, pattern=r"\w+([.',]\w+)*|[ \t]+")
['Today', ' ', "it's", ' \t', '07.May', ' ', '2011', ' ', 'Or', ' ', '2.999']

Product code looks like abcd2343, how to split by letters and numbers?

I have a list of product codes in a text file, on each line is the product code that looks like:
abcd2343 abw34324 abc3243-23A
So it is letters followed by numbers and other characters.
I want to split on the first occurrence of a number.
import re
s='abcd2343 abw34324 abc3243-23A'
re.split('(\d+)',s)
> ['abcd', '2343', ' abw', '34324', ' abc', '3243', '-', '23', 'A']
Or, if you want to split on the first occurrence of a digit:
re.findall('\d*\D+',s)
> ['abcd', '2343 abw', '34324 abc', '3243-', '23A']
\d+ matches 1-or-more digits.
\d*\D+ matches 0-or-more digits followed by 1-or-more non-digits.
\d+|\D+ matches 1-or-more digits or 1-or-more non-digits.
Consult the docs for more about Python's regex syntax.
re.split(pat, s) will split the string s using pat as the delimiter. If pat begins and ends with parentheses (so as to be a "capturing group"), then re.split will return the substrings matched by pat as well. For instance, compare:
re.split('\d+', s)
> ['abcd', ' abw', ' abc', '-', 'A'] # <-- just the non-matching parts
re.split('(\d+)', s)
> ['abcd', '2343', ' abw', '34324', ' abc', '3243', '-', '23', 'A'] # <-- both the non-matching parts and the captured groups
In contrast, re.findall(pat, s) returns only the parts of s that match pat:
re.findall('\d+', s)
> ['2343', '34324', '3243', '23']
Thus, if s ends with a digit, you could avoid ending with an empty string by using re.findall('\d+|\D+', s) instead of re.split('(\d+)', s):
s='abcd2343 abw34324 abc3243-23A 123'
re.split('(\d+)', s)
> ['abcd', '2343', ' abw', '34324', ' abc', '3243', '-', '23', 'A ', '123', '']
re.findall('\d+|\D+', s)
> ['abcd', '2343', ' abw', '34324', ' abc', '3243', '-', '23', 'A ', '123']
This function handles float and negative numbers as well.
def separate_number_chars(s):
res = re.split('([-+]?\d+\.\d+)|([-+]?\d+)', s.strip())
res_f = [r.strip() for r in res if r is not None and r.strip() != '']
return res_f
For example:
utils.separate_number_chars('-12.1grams')
> ['-12.1', 'grams']
import re
m = re.match(r"(?P<letters>[a-zA-Z]+)(?P<the_rest>.+)$",input)
m.group('letters')
m.group('the_rest')
This covers your corner case of abc3243-23A and will output abc for the letters group and 3243-23A for the_rest
Since you said they are all on individual lines you'll obviously need to put a line at a time in input
def firstIntIndex(string):
result = -1
for k in range(0, len(string)):
if (bool(re.match('\d', string[k]))):
result = k
break
return result
To partition on the first digit
parts = re.split('(\d.*)','abcd2343') # => ['abcd', '2343', '']
parts = re.split('(\d.*)','abc3243-23A') # => ['abc', '3243-23A', '']
So the two parts are always parts[0] and parts[1].
Of course, you can apply this to multiple codes:
>>> s = "abcd2343 abw34324 abc3243-23A"
>>> results = [re.split('(\d.*)', pcode) for pcode in s.split(' ')]
>>> results
[['abcd', '2343', ''], ['abw', '34324', ''], ['abc', '3243-23A', '']]
If each code is in an individual line then instead of s.split( ) use s.splitlines().
Try this code it will work fine
import re
text = "MARIA APARECIDA 99223-2000 / 98450-8026"
parts = re.split(r' (?=\d)',text, 1)
print(parts)
Output:
['MARIA APARECIDA', '99223-2000 / 98450-8026']

Categories