Regex to extract irregular delimiters - python

I have a column of data containing id numbers that are between 4 and 10 digits in length. However, these id numbers are manually entered and have no systematic delimiters. In some cases, id numbers are delimited by a comment. With the caveat that the real data is unpredictable, here is an example of values in a python list.
[ '13796352',
'2113146, 2113148, 2113147',
'asdf ee A070_321 on 4.3.99 - MC',
'blah blah3',
'1914844\xa0, 3310339, 1943270, 2190351, 1215262',
'789702/ 89057',
'1 of 5 blah blah',
'688327/ 6712563/> 5425153',
'1820196/1964143/ 249805/ 300510',
'731862\n\nAccepted: 176666\nRejected: 8787' ]
Here is the regex that is not working:
r'^[0-9]{4,10}([\s\S]*)[[0-9]{4,10}]*'
The desired output (looping through the list) is:
[''],
[', ',', '],
[''],
[''],
['\xa0, ',', ',', ',', '],
['/ '],
[''],
['/ ,'/> '],
[''/','/ ','/ '],
['\n\nAccepted: ','\nRejected: ']
I am not getting this with the regex above. What am I doing wrong?

If you want to extract the ids, you could use for example:
import re
data = [
'13796352',
'2113146, 2113148, 2113147',
'asdf ee A070_321 on 4.3.99 - MC',
'blah blah3',
'1914844\xa0, 3310339, 1943270, 2190351, 1215262',
'789702/ 89057',
'1 of 5 blah blah',
'688327/ 6712563/> 5425153',
'1820196/1964143/ 249805/ 300510',
'731862\n\nAccepted: 176666\nRejected: 8787'
]
for el in data:
print(re.findall(r'(?<!\d)\d{4,10}(?!\d)', el))
Resulting in:
['13796352']
['2113146', '2113148', '2113147']
[]
[]
['1914844', '3310339', '1943270', '2190351', '1215262']
['789702', '89057']
[]
['688327', '6712563', '5425153']
['1820196', '1964143', '249805', '300510']
['731862', '176666', '8787']
(?<!\d)\d{4,10}(?!\d) means match a sequence of 4 to 10 digits that is not preceded or followed by a digit.

This is just a quick sketch but it looks pretty close to what you want. Basically try to match 4 or more digits, split at the matches and exclude
empty strings
entries without any matches.
>>> data = [...] # your sample
>>> num_re = re.compile(r'\d{4,}')
>>> [[x for x in num_re.split(d) if x] if num_re.search(d) else [] for d in data]
[[],
[', ', ', '],
[],
[],
['\xa0, ', ', ', ', ', ', '],
['/ '],
[],
['/ ', '/> '],
['/', '/ ', '/ '],
['\n\nAccepted: ', '\nRejected: ']]

Related

How to escape specific whitespaces when splitting line into words with regex

I want to split a string into a list of words (here "word" means arbitrary sequence of non-whitespace characters), but also keep the groups of consecutive whitespaces that have been used as separators (because the number of whitespaces is significant in my data). For this simple task, I know that the following regex would do the job (I use Python as an illustrative language, but the code can be easily adapted to any language including regexes):
import re
regexA = re.compile(r"(\S+)")
print(regexA.split("aa b+b cc dd! :ee "))
produces the expected output:
['', 'aa', ' ', 'b+b', ' ', 'cc', ' ', 'dd!', ' ', ':ee', ' ']
Now the hard part: when a word includes an opening parenthesis, all the whitespaces encountered until the matching closing parenthesis should not be considered as word separators. In other words:
regexB.split("aa b+b cc(dd! :ee (ff gg) hh) ii ")
should produce:
['', 'aa', ' ', 'b+b', ' ', 'cc(dd! :ee (ff gg) hh)', ' ', 'ii', ' ']
Using
regexB = re.compile(r'([^(\s]*\([^)]*\)|\S+)')
works for a single pair of parentheses, but fails when there are inner parentheses. How could I improve the regex to correctly skip inner parentheses?
And the final question: in my data, only words starting with % should be tested for the "parenthesis rule" (regexB), the other words should be treated by regexA. I have no idea how to combine two regexes in a single split.
Any hint is warmly welcome...
In the PCRE regex engine, sub-routine is supported and recursive pattern seems workable for the case including balanced nested parentheses.
(?m)\s+(?=[^()]*(\([^()]*(?1)?[^()]*\))*[^()]*$)
Demo,,, in which (?1) means calling sub-routine 1, (\([^()]*(?1)?[^()]*\)), namely recursive pattern which includes caller, (?1)
But python does not support sub-routinepattern in regex.
So I tried first replacing every ( , ) with another distinctive character( # in this example) and applying the regex to split and finally turn # back to ( or ) respectively in my pythone script.
Regex for spliting.
(?m)(\s+)(?=[^#]*(?:(?:#[^#]*){2})*$)
Demo,,, in which I changed your separator \S+ to consecutive spaces \s+ because #,(,) are included in [\S]' possible characters set.
Python script may be like this
import re
ss="""aa b+b cc(dd! :ee ((ff gg)) hh) ii """
ss=re.sub(r"\(|\)","#",ss) #repacing every `(`,`)` to `#`
regx=re.compile(r"(?m)(\s+)(?=[^#]*(?:(?:#[^#]*){2})*$)")
m=regx.split(ss)
for i in range(len(m)): # turn `#` back to `(` or `)` respectively
n= m[i].count('#')
if n < 2: continue
else:
for j in range(int(n/2)):
k=m[i].find('#'); m[i]=m[i][:k]+'('+m[i][k+1:]
m[i]= m[i].replace("#",')')
print(m)
Output is
['aa', ' ', 'b+b', ' ', 'cc(dd! :ee ((ff gg)) hh)', ' ', 'ii', ' ', '']
Finally after having tested several ideas based on the answers proposed by #Wiktor Stribiżew and #Thm Lee, I came to bunch of solutions dealing with different levels of complexity. To reduce dependency, I wanted to stick to the re module from the Python standard library, so here is the code:
import re
text = "aa b%b( %cc(dd! (:ee ff) gg) %hh ii) "
# Solution 1: don't process parentheses at all
regexA = re.compile(r'(\S+)')
print(regexA.split(text))
# Solution 2: works for non-nested parentheses
regexB = re.compile(r'(%[^(\s]*\([^)]*\)|\S+)')
print(regexB.split(text))
# Solution 3: works for one level of nested parentheses
regexC = re.compile(r'(%[^(\s]*\((?:[^()]*\([^)]*\))*[^)]*\)|\S+)')
print(regexC.split(text))
# Solution 4: works for arbitrary levels of nested parentheses
n, words = 0, []
for word in regexA.split(text):
if n: words[-1] += word
else: words.append(word)
if n or (word and word[0] == '%'):
n += word.count('(') - word.count(')')
print(words)
Here is the generated output:
Solution 1: ['', 'aa', ' ', 'b%b(', ' ', '%cc(dd!', ' ', '(:ee', ' ', 'ff)', ' ', 'gg)', ' ', '%hh', ' ', 'ii)', ' ']
Solution 2: ['', 'aa', ' ', 'b%b(', ' ', '%cc(dd! (:ee ff)', ' ', 'gg)', ' ', '%hh', ' ', 'ii)', ' ']
Solution 3: ['', 'aa', ' ', 'b%b(', ' ', '%cc(dd! (:ee ff) gg)', ' ', '%hh', ' ', 'ii)', ' ']
Solution 4: ['', 'aa', ' ', 'b%b(', ' ', '%cc(dd! (:ee ff) gg)', ' ', '%hh', ' ', 'ii)', ' ']
As stated in the OP, for my specific data, escaping whitespaces in parentheses has only to be done for words starting with %, other parentheses (e.g. word b%b( in my example) are not considered are special. If you want to escape whitespaces inside any pair of parentheses, simply remove the %char in the regexes. Here is the result with that modification:
Solution 1: ['', 'aa', ' ', 'b%b(', ' ', '%cc(dd!', ' ', '(:ee', ' ', 'ff)', ' ', 'gg)', ' ', '%hh', ' ', 'ii)', ' ']
Solution 2: ['', 'aa', ' ', 'b%b( %cc(dd! (:ee ff)', ' ', 'gg)', ' ', '%hh', ' ', 'ii)', ' ']
Solution 3: ['', 'aa', ' ', 'b%b( %cc(dd! (:ee ff) gg)', ' ', '%hh', ' ', 'ii)', ' ']
Solution 4: ['', 'aa', ' ', 'b%b( %cc(dd! (:ee ff) gg) %hh ii)', ' ']

Removing all punctuation from a list and return the entire list in Python

I have a list that I'm trying to strip all punctuation and the character "·" from and then returning that list without any of the above. However, when I try to return the list, only the first word of the list appears and I'm not sure where I went wrong with this.
Here is the list I'm trying to strip punctuation from:
['in·vis·i·ble', 'in·vis·i·bil·i·ty, ', 'in·vis·i·ble·ness, ', 'in·vis·i·bly, ', 'qua·si-in·vis·i·ble, ', 'qua·si-in·vis·i·bly, ', 'inˌvisiˈbility, ', 'inˈvisibleness, ', 'inˈvisibly, ']
Here's what I'm getting: ['invisible']
Here is a portion of my code (it's part of a larger function)
syl = []
for words in span:
if words not in syl:
syl.append(words)
for text in syl:
drop_sep = re.sub(r'·', '', text)
return drop_sep
Use a list comprehension where each element of the resulting list is a string with all occurrences of dot substring '·' replaced by the void '':
[word.replace('·', '') for word in words]
Example
>>> words = ['in·vis·i·ble',
... 'in·vis·i·bil·i·ty, ',
... 'in·vis·i·ble·ness, ',
... 'in·vis·i·bly, ',
... 'qua·si-in·vis·i·ble, ',
... 'qua·si-in·vis·i·bly, ',
... 'inˌvisiˈbility, ',
... 'inˈvisibleness, ',
... 'inˈvisibly, ']
>>>
>>> from pprint import pprint
>>> pprint([word.replace('·', '') for word in words])
['invisible',
'invisibility, ',
'invisibleness, ',
'invisibly, ',
'quasi-invisible, ',
'quasi-invisibly, ',
'inˌvisiˈbility, ',
'inˈvisibleness, ',
'inˈvisibly, ']

Align numbers in sublist

I have a set of numbers that I want to align considering the comma:
10 3
200 4000,222 3 1,5
200,21 0,3 2
30000 4,5 1
mylist = [['10', '3', '', ''],
['200', '4000,222', '3', '1,5'],
['200,21', '', '0,3', '2'],
['30000', '4,5', '1', '']]
What I want is to align this list considering the comma:
expected result:
mylist = [[' 10 ', ' 3 ', ' ', ' '],
[' 200 ', '4000,222', '3 ', '1,5'],
[' 200,21', ' ', '0,3', '2 '],
['30000 ', ' 4,5 ', '1 ', ' ']]
I tried to turn the list:
mynewlist = list(zip(*mylist))
and to find the longest part after the comma in every sublist:
for m in mynewlist:
max([x[::-1].find(',') for x in m]
and to use rjust and ljust but I don't know how to ljust after a comma and rjust before the comma, both in the same string.
How can I resolve this without using format()?
(I want to align with ljust and rjust)
Here's another approach that currently does the trick. Unfortunately, I can't see any simple way to make this work, maybe due to the time :-)
Either way, I'll explain it. r is the result list created before hand.
r = [[] for i in range(4)]
Then we loop through the values and also grab an index with enumerate:
for ind1, vals in enumerate(zip(*mylist)):
Inside the loop we grab the max length of the decimal digits present and the max length of the word (the word w/o the decimal digits):
l = max(len(v.partition(',')[2]) for v in vals) + 1
mw = max(len(v if ',' not in v else v.split(',')[0]) for v in vals)
Now we go through the values inside the tuple vals and build our results (yup, can't currently think of a way to avoid this nesting).
for ind2, v in enumerate(vals):
If it contains a comma, it should be formatted differently. Specifically, we rjust it based on the max length of a word mw and then add the decimal digits and any white-space needed:
if ',' in v:
n, d = v.split(',')
v = "".join((n.rjust(mw),',', d, " " * (l - 1 - len(d))))
In the opposite case, we simply .rjust and then add whitespace:
else:
v = "".join((v.rjust(mw) + " " * l))
finally, we append to r.
r[ind1].append(v)
All together:
r = [[] for i in range(4)]
for ind1, vals in enumerate(zip(*mylist)):
l = max(len(v.partition(',')[2]) for v in vals) + 1
mw = max(len(v if ',' not in v else v.split(',')[0]) for v in vals)
for ind2, v in enumerate(vals):
if ',' in v:
n, d = v.split(',')
v = "".join((n.rjust(mw),',', d, " " * (l - 1 - len(d))))
else:
v = "".join((v.rjust(mw) + " " * l))
r[ind1].append(v)
Now, we can print it out:
>>> print(*map(list,zip(*r)), sep='\n)
[' 10 ', ' 3 ', ' ', ' ']
[' 200 ', '4000,222', '3 ', '1,5']
[' 200,21', ' ', '0,3', '2 ']
['30000 ', ' 4,5 ', '1 ', ' ']
Here's a bit different solution that doesn't transpose my_list but instead iterates over it twice. On the first pass it generates a list of tuples, one for each column. Each tuple is a pair of numbers where first number is length before comma and second number is length of comma & everything following it. For example '4000,222' results to (4, 4). On the second pass it formats the data based on the formatting info generated on first pass.
from functools import reduce
mylist = [['10', '3', '', ''],
['200', '4000,222', '3', '1,5'],
['200,21', '', '0,3', '2'],
['30000', '4,5', '1', '']]
# Return tuple (left part length, right part length) for given string
def part_lengths(s):
left, sep, right = s.partition(',')
return len(left), len(sep) + len(right)
# Return string formatted based on part lengths
def format(s, first, second):
left, sep, right = s.partition(',')
return left.rjust(first) + sep + right.ljust(second - len(sep))
# Generator yielding part lengths row by row
parts = ((part_lengths(c) for c in r) for r in mylist)
# Combine part lengths to find maximum for each column
# For example data it looks like this: [[5, 3], [4, 4], [1, 2], [1, 2]]
sizes = reduce(lambda x, y: [[max(z) for z in zip(a, b)] for a, b in zip(x, y)], parts)
# Format result based on part lengths
res = [[format(c, *p) for c, p in zip(r, sizes)] for r in mylist]
print(*res, sep='\n')
Output:
[' 10 ', ' 3 ', ' ', ' ']
[' 200 ', '4000,222', '3 ', '1,5']
[' 200,21', ' ', '0,3', '2 ']
['30000 ', ' 4,5 ', '1 ', ' ']
This works for python 2 and 3. I didn't use ljust or rjust though, i just added as many spaces before and after the number as are missing to the maximum sized number in the column:
mylist = [['10', '3', '', ''],
['200', '4000,222', '3', '1,5'],
['200,21', '', '0,3', '2'],
['30000', '4,5', '1', '']]
transposed = list(zip(*mylist))
sizes = [[(x.index(",") if "," in x else len(x), len(x) - x.index(",") if "," in x else 0)
for x in l] for l in transposed]
maxima = [(max([x[0] for x in l]), max([x[1] for x in l])) for l in sizes]
withspaces = [
[' ' * (maxima[i][0] - sizes[i][j][0]) + number + ' ' * (maxima[i][1] - sizes[i][j][1])
for j, number in enumerate(l)] for i, l in enumerate(transposed)]
result = list(zip(*withspaces))
Printing the result in python3:
>>> print(*result, sep='\n')
(' 10 ', ' 3 ', ' ', ' ')
(' 200 ', '4000,222', '3 ', '1,5')
(' 200,21', ' ', '0,3', '2 ')
('30000 ', ' 4,5 ', '1 ', ' ')

Extract items delimited with square brackets using python regular expressions

I'm trying to split out words/phrases delimited by square brackets using a python regular expression. I want to split the output. Conditions are that section of text beginning and ending with square brackets will be split into a different element.
This is what I have so far but it doesn't work properly:
import re
t="word1 word2 3456 [abc def] [ghi jkl] [1234] [-abcd] word 2345"
re.split("(\[)(.*)(\])+",t)
Output:
['word1 word2 3456 ',
'[',
'abc def] [ghi jkl] [1234] [-abcd',
']',
' word [xyz 2345']
I want the output to be something like:
['word1 word2 3456 ',
'[abc def]',
' ',
'[ghi jkl]',
' ',
'[1234]',
' ',
'[-abcd]',
' word [xyz 2345']
Note only the items with both an opening and closing square bracket are split out.
I've also tried this:
re.split("(\[.*\])+",t)
but that only splits by the first and last square bracket
['word1 word2 3456 ', '[abc def] [ghi jkl] [1234] [-abcd]', ' word [xyz 2345']
Use .+? instead of .*:
>>> re.split("(\[.+?\])", t)
['word1 word2 3456 ', '[abc def]', ' ', '[ghi jkl]', ' ', '[1234]', ' ', '[-abcd]', ' word 2345']
You can use this regex to split your strings:
\s(?=\[)|(?<=\])\s
Working demo
But since it splits those spaces it will consume them and your generated output will be:
word1 word2 3456
[abc def]
[ghi jkl]
[1234]
[-abcd] word 2345
So, as a workaround you can use above regex to replace the matches with a custom token like ||| ||| to generate something like:
word1 word2 3456||| |||[abc def]||| |||[ghi jkl]||| |||[1234]||| |||[-abcd]||| |||word 2345
Then you can use the split method on your custom token ||| and it will keep the spaces too as:
'word1 word2 3456'
' '
'[abc def]'
' '
'[ghi jkl]'
' '
'[1234]'
' '
'[-abcd]'
' '
'word '
Try this instead:
re.findall(r"[^\]\[]*|\[[^\]\[]*?\]", t)
This will return
['word1 word2 3456 ', '', 'abc def', '', ' ', '', 'ghi jkl', '', ' ', '', '1234', '', ' ', '', '-abcd', '', ' word 2345', '']
To remove the empty strings, do:
list(filter(None, re.findall(r"[^\]\[]*|\[[^\]\[]*?\]", t)))
which returns
['word1 word2 3456 ',
'abc def',
' ',
'ghi jkl',
' ',
'1234',
' ',
'-abcd',
' word 2345']
To explain the regex:
re.compile(r"""
[^\]\[]* # Zero or more characters that aren't [ or ]
| # OR
\[ # a literal [
[^\]\[]*? # Zero or more characters that aren't [ or ]
\] # a literal ]""", re.X)

Product code looks like abcd2343, how to split by letters and numbers?

I have a list of product codes in a text file, on each line is the product code that looks like:
abcd2343 abw34324 abc3243-23A
So it is letters followed by numbers and other characters.
I want to split on the first occurrence of a number.
import re
s='abcd2343 abw34324 abc3243-23A'
re.split('(\d+)',s)
> ['abcd', '2343', ' abw', '34324', ' abc', '3243', '-', '23', 'A']
Or, if you want to split on the first occurrence of a digit:
re.findall('\d*\D+',s)
> ['abcd', '2343 abw', '34324 abc', '3243-', '23A']
\d+ matches 1-or-more digits.
\d*\D+ matches 0-or-more digits followed by 1-or-more non-digits.
\d+|\D+ matches 1-or-more digits or 1-or-more non-digits.
Consult the docs for more about Python's regex syntax.
re.split(pat, s) will split the string s using pat as the delimiter. If pat begins and ends with parentheses (so as to be a "capturing group"), then re.split will return the substrings matched by pat as well. For instance, compare:
re.split('\d+', s)
> ['abcd', ' abw', ' abc', '-', 'A'] # <-- just the non-matching parts
re.split('(\d+)', s)
> ['abcd', '2343', ' abw', '34324', ' abc', '3243', '-', '23', 'A'] # <-- both the non-matching parts and the captured groups
In contrast, re.findall(pat, s) returns only the parts of s that match pat:
re.findall('\d+', s)
> ['2343', '34324', '3243', '23']
Thus, if s ends with a digit, you could avoid ending with an empty string by using re.findall('\d+|\D+', s) instead of re.split('(\d+)', s):
s='abcd2343 abw34324 abc3243-23A 123'
re.split('(\d+)', s)
> ['abcd', '2343', ' abw', '34324', ' abc', '3243', '-', '23', 'A ', '123', '']
re.findall('\d+|\D+', s)
> ['abcd', '2343', ' abw', '34324', ' abc', '3243', '-', '23', 'A ', '123']
This function handles float and negative numbers as well.
def separate_number_chars(s):
res = re.split('([-+]?\d+\.\d+)|([-+]?\d+)', s.strip())
res_f = [r.strip() for r in res if r is not None and r.strip() != '']
return res_f
For example:
utils.separate_number_chars('-12.1grams')
> ['-12.1', 'grams']
import re
m = re.match(r"(?P<letters>[a-zA-Z]+)(?P<the_rest>.+)$",input)
m.group('letters')
m.group('the_rest')
This covers your corner case of abc3243-23A and will output abc for the letters group and 3243-23A for the_rest
Since you said they are all on individual lines you'll obviously need to put a line at a time in input
def firstIntIndex(string):
result = -1
for k in range(0, len(string)):
if (bool(re.match('\d', string[k]))):
result = k
break
return result
To partition on the first digit
parts = re.split('(\d.*)','abcd2343') # => ['abcd', '2343', '']
parts = re.split('(\d.*)','abc3243-23A') # => ['abc', '3243-23A', '']
So the two parts are always parts[0] and parts[1].
Of course, you can apply this to multiple codes:
>>> s = "abcd2343 abw34324 abc3243-23A"
>>> results = [re.split('(\d.*)', pcode) for pcode in s.split(' ')]
>>> results
[['abcd', '2343', ''], ['abw', '34324', ''], ['abc', '3243-23A', '']]
If each code is in an individual line then instead of s.split( ) use s.splitlines().
Try this code it will work fine
import re
text = "MARIA APARECIDA 99223-2000 / 98450-8026"
parts = re.split(r' (?=\d)',text, 1)
print(parts)
Output:
['MARIA APARECIDA', '99223-2000 / 98450-8026']

Categories