Extract items delimited with square brackets using python regular expressions - python

I'm trying to split out words/phrases delimited by square brackets using a python regular expression. I want to split the output. Conditions are that section of text beginning and ending with square brackets will be split into a different element.
This is what I have so far but it doesn't work properly:
import re
t="word1 word2 3456 [abc def] [ghi jkl] [1234] [-abcd] word 2345"
re.split("(\[)(.*)(\])+",t)
Output:
['word1 word2 3456 ',
'[',
'abc def] [ghi jkl] [1234] [-abcd',
']',
' word [xyz 2345']
I want the output to be something like:
['word1 word2 3456 ',
'[abc def]',
' ',
'[ghi jkl]',
' ',
'[1234]',
' ',
'[-abcd]',
' word [xyz 2345']
Note only the items with both an opening and closing square bracket are split out.
I've also tried this:
re.split("(\[.*\])+",t)
but that only splits by the first and last square bracket
['word1 word2 3456 ', '[abc def] [ghi jkl] [1234] [-abcd]', ' word [xyz 2345']

Use .+? instead of .*:
>>> re.split("(\[.+?\])", t)
['word1 word2 3456 ', '[abc def]', ' ', '[ghi jkl]', ' ', '[1234]', ' ', '[-abcd]', ' word 2345']

You can use this regex to split your strings:
\s(?=\[)|(?<=\])\s
Working demo
But since it splits those spaces it will consume them and your generated output will be:
word1 word2 3456
[abc def]
[ghi jkl]
[1234]
[-abcd] word 2345
So, as a workaround you can use above regex to replace the matches with a custom token like ||| ||| to generate something like:
word1 word2 3456||| |||[abc def]||| |||[ghi jkl]||| |||[1234]||| |||[-abcd]||| |||word 2345
Then you can use the split method on your custom token ||| and it will keep the spaces too as:
'word1 word2 3456'
' '
'[abc def]'
' '
'[ghi jkl]'
' '
'[1234]'
' '
'[-abcd]'
' '
'word '

Try this instead:
re.findall(r"[^\]\[]*|\[[^\]\[]*?\]", t)
This will return
['word1 word2 3456 ', '', 'abc def', '', ' ', '', 'ghi jkl', '', ' ', '', '1234', '', ' ', '', '-abcd', '', ' word 2345', '']
To remove the empty strings, do:
list(filter(None, re.findall(r"[^\]\[]*|\[[^\]\[]*?\]", t)))
which returns
['word1 word2 3456 ',
'abc def',
' ',
'ghi jkl',
' ',
'1234',
' ',
'-abcd',
' word 2345']
To explain the regex:
re.compile(r"""
[^\]\[]* # Zero or more characters that aren't [ or ]
| # OR
\[ # a literal [
[^\]\[]*? # Zero or more characters that aren't [ or ]
\] # a literal ]""", re.X)

Related

Regex to extract irregular delimiters

I have a column of data containing id numbers that are between 4 and 10 digits in length. However, these id numbers are manually entered and have no systematic delimiters. In some cases, id numbers are delimited by a comment. With the caveat that the real data is unpredictable, here is an example of values in a python list.
[ '13796352',
'2113146, 2113148, 2113147',
'asdf ee A070_321 on 4.3.99 - MC',
'blah blah3',
'1914844\xa0, 3310339, 1943270, 2190351, 1215262',
'789702/ 89057',
'1 of 5 blah blah',
'688327/ 6712563/> 5425153',
'1820196/1964143/ 249805/ 300510',
'731862\n\nAccepted: 176666\nRejected: 8787' ]
Here is the regex that is not working:
r'^[0-9]{4,10}([\s\S]*)[[0-9]{4,10}]*'
The desired output (looping through the list) is:
[''],
[', ',', '],
[''],
[''],
['\xa0, ',', ',', ',', '],
['/ '],
[''],
['/ ,'/> '],
[''/','/ ','/ '],
['\n\nAccepted: ','\nRejected: ']
I am not getting this with the regex above. What am I doing wrong?
If you want to extract the ids, you could use for example:
import re
data = [
'13796352',
'2113146, 2113148, 2113147',
'asdf ee A070_321 on 4.3.99 - MC',
'blah blah3',
'1914844\xa0, 3310339, 1943270, 2190351, 1215262',
'789702/ 89057',
'1 of 5 blah blah',
'688327/ 6712563/> 5425153',
'1820196/1964143/ 249805/ 300510',
'731862\n\nAccepted: 176666\nRejected: 8787'
]
for el in data:
print(re.findall(r'(?<!\d)\d{4,10}(?!\d)', el))
Resulting in:
['13796352']
['2113146', '2113148', '2113147']
[]
[]
['1914844', '3310339', '1943270', '2190351', '1215262']
['789702', '89057']
[]
['688327', '6712563', '5425153']
['1820196', '1964143', '249805', '300510']
['731862', '176666', '8787']
(?<!\d)\d{4,10}(?!\d) means match a sequence of 4 to 10 digits that is not preceded or followed by a digit.
This is just a quick sketch but it looks pretty close to what you want. Basically try to match 4 or more digits, split at the matches and exclude
empty strings
entries without any matches.
>>> data = [...] # your sample
>>> num_re = re.compile(r'\d{4,}')
>>> [[x for x in num_re.split(d) if x] if num_re.search(d) else [] for d in data]
[[],
[', ', ', '],
[],
[],
['\xa0, ', ', ', ', ', ', '],
['/ '],
[],
['/ ', '/> '],
['/', '/ ', '/ '],
['\n\nAccepted: ', '\nRejected: ']]

How to escape specific whitespaces when splitting line into words with regex

I want to split a string into a list of words (here "word" means arbitrary sequence of non-whitespace characters), but also keep the groups of consecutive whitespaces that have been used as separators (because the number of whitespaces is significant in my data). For this simple task, I know that the following regex would do the job (I use Python as an illustrative language, but the code can be easily adapted to any language including regexes):
import re
regexA = re.compile(r"(\S+)")
print(regexA.split("aa b+b cc dd! :ee "))
produces the expected output:
['', 'aa', ' ', 'b+b', ' ', 'cc', ' ', 'dd!', ' ', ':ee', ' ']
Now the hard part: when a word includes an opening parenthesis, all the whitespaces encountered until the matching closing parenthesis should not be considered as word separators. In other words:
regexB.split("aa b+b cc(dd! :ee (ff gg) hh) ii ")
should produce:
['', 'aa', ' ', 'b+b', ' ', 'cc(dd! :ee (ff gg) hh)', ' ', 'ii', ' ']
Using
regexB = re.compile(r'([^(\s]*\([^)]*\)|\S+)')
works for a single pair of parentheses, but fails when there are inner parentheses. How could I improve the regex to correctly skip inner parentheses?
And the final question: in my data, only words starting with % should be tested for the "parenthesis rule" (regexB), the other words should be treated by regexA. I have no idea how to combine two regexes in a single split.
Any hint is warmly welcome...
In the PCRE regex engine, sub-routine is supported and recursive pattern seems workable for the case including balanced nested parentheses.
(?m)\s+(?=[^()]*(\([^()]*(?1)?[^()]*\))*[^()]*$)
Demo,,, in which (?1) means calling sub-routine 1, (\([^()]*(?1)?[^()]*\)), namely recursive pattern which includes caller, (?1)
But python does not support sub-routinepattern in regex.
So I tried first replacing every ( , ) with another distinctive character( # in this example) and applying the regex to split and finally turn # back to ( or ) respectively in my pythone script.
Regex for spliting.
(?m)(\s+)(?=[^#]*(?:(?:#[^#]*){2})*$)
Demo,,, in which I changed your separator \S+ to consecutive spaces \s+ because #,(,) are included in [\S]' possible characters set.
Python script may be like this
import re
ss="""aa b+b cc(dd! :ee ((ff gg)) hh) ii """
ss=re.sub(r"\(|\)","#",ss) #repacing every `(`,`)` to `#`
regx=re.compile(r"(?m)(\s+)(?=[^#]*(?:(?:#[^#]*){2})*$)")
m=regx.split(ss)
for i in range(len(m)): # turn `#` back to `(` or `)` respectively
n= m[i].count('#')
if n < 2: continue
else:
for j in range(int(n/2)):
k=m[i].find('#'); m[i]=m[i][:k]+'('+m[i][k+1:]
m[i]= m[i].replace("#",')')
print(m)
Output is
['aa', ' ', 'b+b', ' ', 'cc(dd! :ee ((ff gg)) hh)', ' ', 'ii', ' ', '']
Finally after having tested several ideas based on the answers proposed by #Wiktor Stribiżew and #Thm Lee, I came to bunch of solutions dealing with different levels of complexity. To reduce dependency, I wanted to stick to the re module from the Python standard library, so here is the code:
import re
text = "aa b%b( %cc(dd! (:ee ff) gg) %hh ii) "
# Solution 1: don't process parentheses at all
regexA = re.compile(r'(\S+)')
print(regexA.split(text))
# Solution 2: works for non-nested parentheses
regexB = re.compile(r'(%[^(\s]*\([^)]*\)|\S+)')
print(regexB.split(text))
# Solution 3: works for one level of nested parentheses
regexC = re.compile(r'(%[^(\s]*\((?:[^()]*\([^)]*\))*[^)]*\)|\S+)')
print(regexC.split(text))
# Solution 4: works for arbitrary levels of nested parentheses
n, words = 0, []
for word in regexA.split(text):
if n: words[-1] += word
else: words.append(word)
if n or (word and word[0] == '%'):
n += word.count('(') - word.count(')')
print(words)
Here is the generated output:
Solution 1: ['', 'aa', ' ', 'b%b(', ' ', '%cc(dd!', ' ', '(:ee', ' ', 'ff)', ' ', 'gg)', ' ', '%hh', ' ', 'ii)', ' ']
Solution 2: ['', 'aa', ' ', 'b%b(', ' ', '%cc(dd! (:ee ff)', ' ', 'gg)', ' ', '%hh', ' ', 'ii)', ' ']
Solution 3: ['', 'aa', ' ', 'b%b(', ' ', '%cc(dd! (:ee ff) gg)', ' ', '%hh', ' ', 'ii)', ' ']
Solution 4: ['', 'aa', ' ', 'b%b(', ' ', '%cc(dd! (:ee ff) gg)', ' ', '%hh', ' ', 'ii)', ' ']
As stated in the OP, for my specific data, escaping whitespaces in parentheses has only to be done for words starting with %, other parentheses (e.g. word b%b( in my example) are not considered are special. If you want to escape whitespaces inside any pair of parentheses, simply remove the %char in the regexes. Here is the result with that modification:
Solution 1: ['', 'aa', ' ', 'b%b(', ' ', '%cc(dd!', ' ', '(:ee', ' ', 'ff)', ' ', 'gg)', ' ', '%hh', ' ', 'ii)', ' ']
Solution 2: ['', 'aa', ' ', 'b%b( %cc(dd! (:ee ff)', ' ', 'gg)', ' ', '%hh', ' ', 'ii)', ' ']
Solution 3: ['', 'aa', ' ', 'b%b( %cc(dd! (:ee ff) gg)', ' ', '%hh', ' ', 'ii)', ' ']
Solution 4: ['', 'aa', ' ', 'b%b( %cc(dd! (:ee ff) gg) %hh ii)', ' ']

Removing all punctuation from a list and return the entire list in Python

I have a list that I'm trying to strip all punctuation and the character "·" from and then returning that list without any of the above. However, when I try to return the list, only the first word of the list appears and I'm not sure where I went wrong with this.
Here is the list I'm trying to strip punctuation from:
['in·vis·i·ble', 'in·vis·i·bil·i·ty, ', 'in·vis·i·ble·ness, ', 'in·vis·i·bly, ', 'qua·si-in·vis·i·ble, ', 'qua·si-in·vis·i·bly, ', 'inˌvisiˈbility, ', 'inˈvisibleness, ', 'inˈvisibly, ']
Here's what I'm getting: ['invisible']
Here is a portion of my code (it's part of a larger function)
syl = []
for words in span:
if words not in syl:
syl.append(words)
for text in syl:
drop_sep = re.sub(r'·', '', text)
return drop_sep
Use a list comprehension where each element of the resulting list is a string with all occurrences of dot substring '·' replaced by the void '':
[word.replace('·', '') for word in words]
Example
>>> words = ['in·vis·i·ble',
... 'in·vis·i·bil·i·ty, ',
... 'in·vis·i·ble·ness, ',
... 'in·vis·i·bly, ',
... 'qua·si-in·vis·i·ble, ',
... 'qua·si-in·vis·i·bly, ',
... 'inˌvisiˈbility, ',
... 'inˈvisibleness, ',
... 'inˈvisibly, ']
>>>
>>> from pprint import pprint
>>> pprint([word.replace('·', '') for word in words])
['invisible',
'invisibility, ',
'invisibleness, ',
'invisibly, ',
'quasi-invisible, ',
'quasi-invisibly, ',
'inˌvisiˈbility, ',
'inˈvisibleness, ',
'inˈvisibly, ']

matching pattern difference between `([, ]+)` and `([, ])+`

IDLE 1.1.4
>>> import re
>>> some_text = 'alpha, beta,,,,gamma delta'
>>> re.split('[, ]+', some_text)
['alpha', 'beta', 'gamma', 'delta']
# when the pattern doesn't contain parentheses, the returned values
# only include matched substrings but separators.
>>> re.split('([, ]+)', some_text)
['alpha', ', ', 'beta', ',,,,', 'gamma', ' ', 'delta']
# returned values include separators and I can guess how it works.
>>> re.split('([, ])+', some_text)
['alpha', ' ', 'beta', ',', 'gamma', ' ', 'delta']
# Now I cannot even guess what is going on here.
Question> What is the difference between '([, ]+)' and '([, ])+'?
How it affects the returned values?
The former places all instances of " " and "," in a run into a group, whereas the latter returns a group only containing the last.
([, ]+) says "match one or more commas and/or spaces and capture them as a group", thus in your second example your see the whole string of separator characaters returned in one group.
([, ])+ says "match one comma or space, and capture one or more groups of these". So in your third example, each separator character is captured in it's own group, and you're only getting the last of these each time.
if there is ,,,, in your string, when your pattern is ([, ]+) this group will return ,,,, and if your pattern is ([, ])+ it will return ,
Watch your matching groups.
([, ]+) this matches 1 or more occurrences of , or a space and returns them all, which catches long chains of those characters.
([, ])+ this matches either a space or a , and returns it as a group.
Change your commas to ABCD as below to see it visually:
some_text2 = 'alphaA betaABCDgamma delta'
re.split('([ABCD ])+', some_text2)
['alpha', ' ', 'beta', 'D', 'gamma', ' ', 'delta']
Its actually matching each comma, but as 1 character group. The + is turns it into a greedy match until it no longer matches the letters in the character class.
Try without the +
re.split('([ABCD ])', some_text2)
['alpha',
'A',
'',
' ',
'beta',
'A',
'',
'B',
'',
'C',
'',
'D',
'gamma',
' ',
'',
' ',
'',
' ',
'delta']

Product code looks like abcd2343, how to split by letters and numbers?

I have a list of product codes in a text file, on each line is the product code that looks like:
abcd2343 abw34324 abc3243-23A
So it is letters followed by numbers and other characters.
I want to split on the first occurrence of a number.
import re
s='abcd2343 abw34324 abc3243-23A'
re.split('(\d+)',s)
> ['abcd', '2343', ' abw', '34324', ' abc', '3243', '-', '23', 'A']
Or, if you want to split on the first occurrence of a digit:
re.findall('\d*\D+',s)
> ['abcd', '2343 abw', '34324 abc', '3243-', '23A']
\d+ matches 1-or-more digits.
\d*\D+ matches 0-or-more digits followed by 1-or-more non-digits.
\d+|\D+ matches 1-or-more digits or 1-or-more non-digits.
Consult the docs for more about Python's regex syntax.
re.split(pat, s) will split the string s using pat as the delimiter. If pat begins and ends with parentheses (so as to be a "capturing group"), then re.split will return the substrings matched by pat as well. For instance, compare:
re.split('\d+', s)
> ['abcd', ' abw', ' abc', '-', 'A'] # <-- just the non-matching parts
re.split('(\d+)', s)
> ['abcd', '2343', ' abw', '34324', ' abc', '3243', '-', '23', 'A'] # <-- both the non-matching parts and the captured groups
In contrast, re.findall(pat, s) returns only the parts of s that match pat:
re.findall('\d+', s)
> ['2343', '34324', '3243', '23']
Thus, if s ends with a digit, you could avoid ending with an empty string by using re.findall('\d+|\D+', s) instead of re.split('(\d+)', s):
s='abcd2343 abw34324 abc3243-23A 123'
re.split('(\d+)', s)
> ['abcd', '2343', ' abw', '34324', ' abc', '3243', '-', '23', 'A ', '123', '']
re.findall('\d+|\D+', s)
> ['abcd', '2343', ' abw', '34324', ' abc', '3243', '-', '23', 'A ', '123']
This function handles float and negative numbers as well.
def separate_number_chars(s):
res = re.split('([-+]?\d+\.\d+)|([-+]?\d+)', s.strip())
res_f = [r.strip() for r in res if r is not None and r.strip() != '']
return res_f
For example:
utils.separate_number_chars('-12.1grams')
> ['-12.1', 'grams']
import re
m = re.match(r"(?P<letters>[a-zA-Z]+)(?P<the_rest>.+)$",input)
m.group('letters')
m.group('the_rest')
This covers your corner case of abc3243-23A and will output abc for the letters group and 3243-23A for the_rest
Since you said they are all on individual lines you'll obviously need to put a line at a time in input
def firstIntIndex(string):
result = -1
for k in range(0, len(string)):
if (bool(re.match('\d', string[k]))):
result = k
break
return result
To partition on the first digit
parts = re.split('(\d.*)','abcd2343') # => ['abcd', '2343', '']
parts = re.split('(\d.*)','abc3243-23A') # => ['abc', '3243-23A', '']
So the two parts are always parts[0] and parts[1].
Of course, you can apply this to multiple codes:
>>> s = "abcd2343 abw34324 abc3243-23A"
>>> results = [re.split('(\d.*)', pcode) for pcode in s.split(' ')]
>>> results
[['abcd', '2343', ''], ['abw', '34324', ''], ['abc', '3243-23A', '']]
If each code is in an individual line then instead of s.split( ) use s.splitlines().
Try this code it will work fine
import re
text = "MARIA APARECIDA 99223-2000 / 98450-8026"
parts = re.split(r' (?=\d)',text, 1)
print(parts)
Output:
['MARIA APARECIDA', '99223-2000 / 98450-8026']

Categories