Python regexp: exclude specific pattern from sub - python

Having a string like this: aa5f5 aa5f5 i try to split the tokens where non-digit meets digit, like this:
re.sub(r'([^\d])(\d{1})', r'\1 \2', 'aa5f5 aa5f5')
Out: aa 5f 5 aa 5f 5
Now i try to prevent some tokens from being splitted with specific prefix character($): $aa5f5 aa5f5, the desired output is $aa5f5 aa 5f 5
The problem is that i only came up with this ugly loop:
sentence = '$aa5f5 aa5f5'
new_sentence = []
for s in sentence.split():
if s.startswith('$'):
new_sentence.append(s)
else:
new_sentence.append(re.sub(r'([^\d])(\d{1})', r'\1 \2', s))
print(' '.join(new_sentence)) # $aa5f5 aa 5f 5
But could not find a way to make this possible with single line regexp. Need help with this, thank you.

You may use
new_sentence = re.sub(r'(\$\S*)|(?<=\D)\d', lambda x: x.group(1) if x.group(1) else rf' {x.group()}', sentence)
See the Python demo.
Here, (\$\S*)|(?<=\D)\d matches $ and any 0+ non-whitespace characters (with (\$\S*) capturing the value in Group 1, or a digit is matched that is preceded with a non-digit char (see (?<=\D)\d pattern part).
If Group 1 matched, it is pasted back as is (see x.group(1) if x.group(1) in the replacement), else, the space is inserted before the matched digit (see else rf' {x.group()}').
With PyPi regex module, you may do it in a simple way:
import regex
sentence = '$aa5f5 aa5f5'
print( regex.sub(r'(?<!\$\S*)(?<=\D)(\d)', r' \1', sentence) )
See this online Python demo.
The (?<!\$\S*)(?<=\D)(\d) pattern matches and captures into Group 1 any digit ((\d)) that is preceded with a non-digit ((?<=\D)) and not preceded with $ and then any 0+ non-whitespace chars ((?<!\$\S*)).

This is not something regular expression can do. If it can, it'll be a complex regex which will be hard to understand. And when a new developer joins your team, he will not understand it right away. It's better you write it the way you wrote it already. For the regex part, the following code will probably do the splitting correctly
' '.join(map(str.strip, re.findall(r'\d+|\D+', s)))
>>> s = "aa5f5 aa5f53r12"
>>> ' '.join(map(str.strip, re.findall(r'\d+|\D+', s)))
'aa 5 f 5 aa 5 f 53 r 12'

Related

Use regex to contextually replace dots in a string

I want to remove all occurrences of dots separated by single characters, I also want to replace all occurrences of dots separated by more than one consecutive character with a space (if one side has len > 1 char).
For example. Given a string,
s = ' A.B.C.D.E. FGH.IJ K.L.M.NO PQ.R.S T.U.VWXYZ'
After processing the output should look like:
'ABCDE FGH IJ KLM NO PQ RS TU VWXYZ'
Notice that in the case of A.B.C.D.E., all dots are removed (this should be true for when there is no trailing dot also)
Notice that in the case of K.L.M.NO, the first two dots are removed, the last one is replaced with a space (because NO is not a single character)
Notice that in the case of PQ.R.S, the first dot is replaced with a space, the second dot is removed.
I almost have a working solution:
re.sub(r'(?<!\w)([A-Z])\.', r'\1', s)
But in the example given, T.U.VWXYZ gets translated to TUVWXYZ, whereas it should be TU VWXYZ
Note: it's not important for this to be solved with a single regex, or even regex at all for that matter.
Edit: changed PQ.RS to PQ.R.S in the example string.
I'd take two steps.
replace (\b[A-Z])\.(?=[A-Z]\b|\s|$) with r'\1'
replace (\b[A-Z]{2,})\.(?=[A-Z])|(\b[A-Z])\.(?=[A-Z]{2,}) with r'\1\2 '
Sample
import re
re1 = re.compile(r'(\b[A-Z])\.(?=[A-Z]\b|\s|$)')
re2 = re.compile(r'(\b[A-Z]{2,})\.(?=[A-Z])|(\b[A-Z])\.(?=[A-Z]{2,})')
s = ' A.B.C.D.E. FGH.IJ K.L.M.NO PQ.RS T.U.VWXYZ'
r = re2.sub(r'\1\2 ', re1.sub(r'\1', s)).strip()
print(r)
outputs
'ABCDE FGH IJ KLM NO PQ RS TU VWXYZ'
which matches your desired result:
'ABCDE FGH IJ KLM NO PQ RS TU VWXYZ'
re1 matches all dots that are preceded by a free-standing letter and followed by either another free-standing letter, or whitespace, or the end of the string.
re2 matches all dots that are preceded by a least 2 and followed by at least 1 letter (or the other way around)
You can first replace all dots followed by two characters by spaces, and then remove the remaining dots:
re.sub(r'\.([A-Z]{2})', r' \1', s).replace(".", "")
This gives " ABCDE FGH IJ KLM NO PQ RS TU VWXYZ" on your example.
hopefully this is slightly neater:
import re
s = ' A.B.C.D.E. FGH.IJ K.L.M.NO PQ.RS T.U.VWXYZ'
s = re.sub(r"\.(\w{2})", r" \1", s)
s = re.sub(r"(\w{2})\.(\w)", r"\1 \2", s)
s = re.sub(r"\.", "",s)
s = s.strip()
print(s)
You can use a single regex solution if you consider using a dynamic replacement:
import re
rx = r'\b([A-Z](?:\.[A-Z])+\b(?:\.(?![A-Z]))?)|\.'
s = ' A.B.C.D.E. FGH.IJ K.L.M.NO PQ.R.S T.U.VWXYZ'
print( re.sub(rx, lambda x: x.group(1).replace('.', '') if x.group(1) else ' ', s.strip()) )
# => ABCDE FGH IJ KLM NO PQ RS TU VWXYZ
See the Python demo and a regex demo.
The regex matches:
\b([A-Z](?:\.[A-Z])+\b(?:\.(?![A-Z]))?) - a word boundary, then Group 1 (that will be replaced with itself after stripping off all periods) capturing:
[A-Z] - an uppercase ASCII letter
(?:\.[A-Z])+ - zero or more sequences of a dot and an uppercase ASCII letter
\b - word boundary
(?:\.(?![A-Z]))? - an optional sequence of . that is not followed with an uppercase ASCII letter
| - or
\. - a . in any other context (it will be replaced with a space).
The lambda x: x.group(1).replace('.', '') if x.group(1) else ' ' replacement means that if Group 1 matches, the replacement string is Group 1 value without dots, and if Group 1 does not match the replacement is a single regular space.

Removing characters/whitespace up to the first chosen character python

Here is a string:
. 68.00 68.00 .
I am trying to remove the first . and the fourth. while adding a comma between.
output should look like:
68.00,68.00
Have tried strip and some initial character removal functions but having issues e.g.
[1:]
Any help would be appreciated.
Regex works for this problem. Here, instead of removing the whitespace, I'm grabbing the numbers, which is equivalent in your example. And then joining them with ','.join(), of course.
>>> import re
>>> inp = '. 68 68 . '
>>> print(','.join(re.findall(r'[0-9]+', inp)))
'68,68'
You could try this with re.findall():
import re
st='. 68.00 68.00 . '
print(','.join(re.findall('(?<!\d)(\d{2}.00)(?!\d)',st)))
Output:
68.00,68.00
If you have more numbers with different lengths and you only want those with length two, you could try this:
import re
st='. 68 67 . 600 '
print(','.join(re.findall('(?<!\d)(\d{2})(?!\d)',st))) #you can change \d{2}, to \d{n} with n as the length you want
See the explanation of the regular expression here.
Output:
68,68
Edit:
Another option without using regex:
st='. 68 68 .'
ls=[s for s in st.split() if all(let.isdigit() for let in s)]
print(','.join(ls))
Output:
68,68
Alternatively to using regex, the result could also be achieved by the combination of replace, split and join functions:
x = '. 68 68 .'
result = ','.join(x.replace('.','').split())
You can do a simple (yet convoluted looking) regex search and replace. You may have to adjust this regex to your needs.
import re
the_input = '. 68.00 68.00 . '
print(the_input)
# r = raw string
# the_regex = r'\s*\.\s*(\d+)\s+(\d+)\s+\.\s+'
the_regex = r'\s*\.\s*(\d+\.\d+)\s+(\d+\.\d+)\s+\.\s+' # with decimals numbers
the_output = re.sub(the_regex, r'\1,\2', the_input)
print(the_output)
the_regex is a bit convoluted so here is it broken down.
\s* - spaces, 0 or more
\. - one dot
(\d+) - 1st pair of parenthesis means capture group 1. 2nd pair is 2nd group.
\d+ - number digits, 1 or more
\s+ - spaces, 1 or more
In re.sub(), \1 is capture group 1. \2 is capture group 2.
To test your regex and to have an explanation generated, punch your regex and input-string here:
https://regex101.com/
I suggest you use regex101 to truly understand this regex. Type out the input and then slowly type out the regex. Anything that matches the regex is highlighted and color coded.

regex two group matches everything until pattern

I have the following examples:
Tortillas Bolsa 2a 1kg 4118
Tortillinas 50p 1 31Kg TAB TR 46113
Bollos BK 4in 36p 1635g SL 131
Super Pan Bco Ajonjoli 680g SP WON 100
Pan Blanco Bimbo Rendidor 567g BIM 49973
Gansito ME 5p 250g MTA MLA 49860
Where I want to keep everything before the number but I also don't want the two uppercase letter word example: ME, BK. I'm using ^((\D*).*?) [^A-Z]{2,3}
The expected result should be
Tortillas Bolsa
Tortillinas
Bollos
Super Pan Bco Ajonjoli
Pan Blanco Bimbo Rendidor
Gansito
With the regex I'm using I'm still getting the two capital letter words Bollos BK and Gansito ME
Pre-compile a regex pattern with a lookahead (explained below) and employ regex.match inside a list comprehension:
>>> import re
>>> p = re.compile(r'\D+?(?=\s*([A-Z]{2})?\s*\d)')
>>> [p.match(x).group() for x in data]
[
'Tortillas Bolsa',
'Tortillinas',
'Bollos',
'Super Pan Bco Ajonjoli',
'Pan Blanco Bimbo Rendidor',
'Gansito'
]
Here, data is your list of strings.
Details
\D+? # anything that isn't a digit (non-greedy)
(?= # regex-lookahead
\s* # zero or more wsp chars
([A-Z]{2})? # two optional uppercase letters
\s*
\d # digit
)
In the event of any string not containing the pattern you're looking for, the list comprehension will error out (with an AttributeError), since re.match returns None in that instance. You can then employ a loop and test the value of re.match before extracting the matched portion.
matches = []
for x in data:
m = p.match(x)
if m:
matches.append(m.group())
Or, if you want a placeholder None when there's no match:
matches = []
for x in data:
matches.append(m.group() if m else None)
My 2 cents
^.*?(?=\s[\d]|\s[A-Z]{2,})
https://regex101.com/r/7xD7DS/1/
You may use the lookahead feature:
I_WANT = '(.+?)' # This is what you want
I_DO_NOT_WANT = '\s(?:[0-9]|(?:[A-Z]{2,3}\s))' # Stop-patterns
RE = '{}(?={})'.format(I_WANT, I_DO_NOT_WANT) # Combine the parts
[re.findall(RE, x)[0] for x in test_strings]
#['Tortillas Bolsa', 'Tortillinas', 'Bollos', 'Super Pan Bco Ajonjoli',
# 'Pan Blanco Bimbo Rendidor', 'Gansito']
Supposing that:
All the words you want to match in your capture group start with an uppercase letter
The rest of each word contains only lowercase letters
Words are separated by a single space
...you can use the following regular expressions:
Using Unicode character properties:
^((\p{Lu}\p{Ll}+ )+)
> Try this regex on regex101.
Without Unicode support:
^(([A-z][a-z]+ )+)
> Try this regex on regex101.
I suggest splitting on the first two uppercase letter word or a digit and grab the first item:
r = re.compile(r'\b[A-Z]{2}\b|\d')
[r.split(item)[0].strip() for item in my_list]
# => ['Tortillas Bolsa', 'Tortillinas', 'Bollos', 'Super Pan Bco Ajonjoli', 'Pan Blanco Bimbo Rendidor', 'Gansito']
See the Python demo
Pattern details
\b[A-Z]{2}\b - a whole (since \b are word boundaries) two uppercase ASCII letter word
| - or
\d - a digit.
With .strip(), all trailing and leading whitespace will get trimmed.
A slight variation for a re.sub:
re.sub(r'\s*(?:\b[A-Z]{2}\b|\d).*', '', s)
See the regex demo
Details
\s* - 0+ whitespace chars
(?:\b[A-Z]{2}\b|\d) - either a two uppercase letter word or a digit
.* - the rest of the line.

regex - how to select a word that has a '-' in it?

I am learning Regular Expressions, so apologies for a simple question.
I want to select the words that have a '-' (minus sign) in it but not at the beginning and not at the end of the word
I tried (using findall):
r'\b-\b'
for
str = 'word semi-column peace'
but, of course got only:
['-']
Thank you!
What you actually want to do is a regex like this:
\w+-\w+
What this means is find a alphanumeric character at least once as indicated by the utilization of '+', then find a '-', following by another alphanumeric character at least once, again, as indicated by the '+' again.
str is a built in name, better not to use it for naming
st = 'word semi-column peace'
# \w+ word - \w+ word after -
print(re.findall(r"\b\w+-\w+\b",st))
['semi-column']
a '-' (minus sign) in it but not at the beginning and not at the end of the word
Since "-" is not a word character, you can't use word boundaries (\b) to prevent a match from words with hyphens at the beggining or end. A string like "-not-wanted-" will match both \b\w+-\w+\b and \w+-\w+.
We need to add an extra condition before and after the word:
Before: (?<![-\w]) not preceded by either a hyphen nor a word character.
After: (?![-\w]) not followed by either a hyphen nor a word character.
Also, a word may have more than 1 hyphen in it, and we need to allow it. What we can do here is repeat the last part of the word ("hyphen and word characters") once or more:
\w+(?:-\w+)+ matches:
\w+ one or more word characters
(?:-\w+)+ a hyphen and one or more word characters, and also allows this last part to repeat.
Regex:
(?<![-\w])\w+(?:-\w+)+(?![-\w])
regex101 demo
Code:
import re
pattern = re.compile(r'(?<![-\w])\w+(?:-\w+)+(?![-\w])')
text = "-abc word semi-column peace -not-wanted- one-word dont-match- multi-hyphenated-word"
result = re.findall(pattern, text)
ideone demo
You can also use the following regex:
>>> st = "word semi-column peace"
>>> print re.findall(r"\S+\-\S+", st)
['semi-column']
You can try something like this: Centering on the hyphen, I match until there is a white space in either direction from the hyphen I also make check to see if the words are surrounded by hyphens (e.g -test-cats-) and if they are I make sure not to include them. The regular expression should also work with findall.
st = 'word semi-column peace'
m = re.search(r'([^ | ^-]+-[^ | ^-]+)', st)
if m:
print m.group(1)

Python Regex match a mac address from the end?

I have the following re to extract MAC address:
re.sub( r'(\S{2,2})(?!$)\s*', r'\1:', '0x0000000000aa bb ccdd ee ff' )
However, this gave me 0x:00:00:00:00:00:aa:bb:cc:dd:ee:ff.
How do I modify this regex to stop after matching the first 6 pairs starting from the end, so that I get aa:bb:cc:dd:ee:ff?
Note: the string has whitespace in between which is to be ignored. Only the last 12 characters are needed.
Edit1: re.findall( r'(\S{2})\s*(\S{2})\s*(\S{2})\s*(\S{2})\s*(\S{2})\s*(\S{2})\s*$',a) finds the last 6 pairs in the string. I still don't know how to compress this regex. Again this still depends on the fact that the strings are in pairs.
Ideally the regex should take the last 12 valid \S characters starting from the end and string them with :
Edit2: Inspired by #Mariano answer which works great but depends on the fact that that last 12 characters must start with a pair I came up with the following solution. It is kludgy but still seems to work for all inputs.
string = '0x0000000000a abb ccddeeff'
':'.join( ''.join( i ) for i in re.findall( '(\S)\s*(\S)(?!(?:\s*\S\s*{11})',' string) )
'aa:bb:cc:dd:ee:ff'
Edit3: #Mariano has updated his answer which now works for all inputs
This will work for the last 12 characters, ignoring whitespace.
Code:
import re
text = "0x0000000000aa bb ccdd ee ff"
result = re.sub( r'.*?(?!(?:\s*\S){13})(\S)\s*(\S)', r':\1\2', text)[1:]
print(result)
Output:
aa:bb:cc:dd:ee:ff
DEMO
Regex breakdown:
The expression used in this code uses re.sub() to replace the following in the subject text:
.*? # consume the subject text as few as possible
(?!(?:\s*\S){13}) # CONDITION: Can't be followed by 13 chars
# so it can only start matching when there are 12 to $
(\S)\s*(\S) # Capture a char in group 1, next char in group 2
#
# The match is replaced with :\1\2
# For this example, re.sub() returns ":aa:bb:cc:dd:ee:ff"
# We'll then apply [1:] to the returned value to discard the leading ":"
You can use re.finditer to find all the pairs then join the result :
>>> my_string='0x0000000000aa bb ccdd ee ff'
>>> ':'.join([i.group() for i in re.finditer( r'([a-z])\1+',my_string )])
'aa:bb:cc:dd:ee:ff'
You may do like this,
>>> import re
>>> s = '0x0000000000aa bb ccdd ee ff'
>>> re.sub(r'(?!^)\s*(?=(?:\s*[a-z]{2})+$)', ':', re.sub(r'.*?((?:\s*[a-z]){12})\s*$', r'\1', s ))
'aa:bb:cc:dd:ee:ff'
>>> s = '???767aa bb ccdd ee ff'
>>> re.sub(r'(?!^)\s*(?=(?:\s*[a-z]{2})+$)', ':', re.sub(r'.*?((?:\s*[a-z]){12})\s*$', r'\1', s ))
'aa:bb:cc:dd:ee:ff'
>>> s = '???767aa bb ccdd eeff '
>>> re.sub(r'(?!^)\s*(?=(?:\s*[a-z]{2})+$)', ':', re.sub(r'.*?((?:\s*[a-z]){12})\s*$', r'\1', s ))
'aa:bb:cc:dd:ee:ff'
I know this is not a direct answer to your question, but do you really need a regular expression? If your format is fixed, this should also work:
>>> s = '0x0000000000aa bb ccdd ee ff'
>>> ':'.join([s[-16:-8].replace(' ', ':'), s[-8:].replace(' ', ':')])
'aa:bb:cc:dd:ee:ff'

Categories