I have the following re to extract MAC address:
re.sub( r'(\S{2,2})(?!$)\s*', r'\1:', '0x0000000000aa bb ccdd ee ff' )
However, this gave me 0x:00:00:00:00:00:aa:bb:cc:dd:ee:ff.
How do I modify this regex to stop after matching the first 6 pairs starting from the end, so that I get aa:bb:cc:dd:ee:ff?
Note: the string has whitespace in between which is to be ignored. Only the last 12 characters are needed.
Edit1: re.findall( r'(\S{2})\s*(\S{2})\s*(\S{2})\s*(\S{2})\s*(\S{2})\s*(\S{2})\s*$',a) finds the last 6 pairs in the string. I still don't know how to compress this regex. Again this still depends on the fact that the strings are in pairs.
Ideally the regex should take the last 12 valid \S characters starting from the end and string them with :
Edit2: Inspired by #Mariano answer which works great but depends on the fact that that last 12 characters must start with a pair I came up with the following solution. It is kludgy but still seems to work for all inputs.
string = '0x0000000000a abb ccddeeff'
':'.join( ''.join( i ) for i in re.findall( '(\S)\s*(\S)(?!(?:\s*\S\s*{11})',' string) )
'aa:bb:cc:dd:ee:ff'
Edit3: #Mariano has updated his answer which now works for all inputs
This will work for the last 12 characters, ignoring whitespace.
Code:
import re
text = "0x0000000000aa bb ccdd ee ff"
result = re.sub( r'.*?(?!(?:\s*\S){13})(\S)\s*(\S)', r':\1\2', text)[1:]
print(result)
Output:
aa:bb:cc:dd:ee:ff
DEMO
Regex breakdown:
The expression used in this code uses re.sub() to replace the following in the subject text:
.*? # consume the subject text as few as possible
(?!(?:\s*\S){13}) # CONDITION: Can't be followed by 13 chars
# so it can only start matching when there are 12 to $
(\S)\s*(\S) # Capture a char in group 1, next char in group 2
#
# The match is replaced with :\1\2
# For this example, re.sub() returns ":aa:bb:cc:dd:ee:ff"
# We'll then apply [1:] to the returned value to discard the leading ":"
You can use re.finditer to find all the pairs then join the result :
>>> my_string='0x0000000000aa bb ccdd ee ff'
>>> ':'.join([i.group() for i in re.finditer( r'([a-z])\1+',my_string )])
'aa:bb:cc:dd:ee:ff'
You may do like this,
>>> import re
>>> s = '0x0000000000aa bb ccdd ee ff'
>>> re.sub(r'(?!^)\s*(?=(?:\s*[a-z]{2})+$)', ':', re.sub(r'.*?((?:\s*[a-z]){12})\s*$', r'\1', s ))
'aa:bb:cc:dd:ee:ff'
>>> s = '???767aa bb ccdd ee ff'
>>> re.sub(r'(?!^)\s*(?=(?:\s*[a-z]{2})+$)', ':', re.sub(r'.*?((?:\s*[a-z]){12})\s*$', r'\1', s ))
'aa:bb:cc:dd:ee:ff'
>>> s = '???767aa bb ccdd eeff '
>>> re.sub(r'(?!^)\s*(?=(?:\s*[a-z]{2})+$)', ':', re.sub(r'.*?((?:\s*[a-z]){12})\s*$', r'\1', s ))
'aa:bb:cc:dd:ee:ff'
I know this is not a direct answer to your question, but do you really need a regular expression? If your format is fixed, this should also work:
>>> s = '0x0000000000aa bb ccdd ee ff'
>>> ':'.join([s[-16:-8].replace(' ', ':'), s[-8:].replace(' ', ':')])
'aa:bb:cc:dd:ee:ff'
Related
I want to remove all occurrences of dots separated by single characters, I also want to replace all occurrences of dots separated by more than one consecutive character with a space (if one side has len > 1 char).
For example. Given a string,
s = ' A.B.C.D.E. FGH.IJ K.L.M.NO PQ.R.S T.U.VWXYZ'
After processing the output should look like:
'ABCDE FGH IJ KLM NO PQ RS TU VWXYZ'
Notice that in the case of A.B.C.D.E., all dots are removed (this should be true for when there is no trailing dot also)
Notice that in the case of K.L.M.NO, the first two dots are removed, the last one is replaced with a space (because NO is not a single character)
Notice that in the case of PQ.R.S, the first dot is replaced with a space, the second dot is removed.
I almost have a working solution:
re.sub(r'(?<!\w)([A-Z])\.', r'\1', s)
But in the example given, T.U.VWXYZ gets translated to TUVWXYZ, whereas it should be TU VWXYZ
Note: it's not important for this to be solved with a single regex, or even regex at all for that matter.
Edit: changed PQ.RS to PQ.R.S in the example string.
I'd take two steps.
replace (\b[A-Z])\.(?=[A-Z]\b|\s|$) with r'\1'
replace (\b[A-Z]{2,})\.(?=[A-Z])|(\b[A-Z])\.(?=[A-Z]{2,}) with r'\1\2 '
Sample
import re
re1 = re.compile(r'(\b[A-Z])\.(?=[A-Z]\b|\s|$)')
re2 = re.compile(r'(\b[A-Z]{2,})\.(?=[A-Z])|(\b[A-Z])\.(?=[A-Z]{2,})')
s = ' A.B.C.D.E. FGH.IJ K.L.M.NO PQ.RS T.U.VWXYZ'
r = re2.sub(r'\1\2 ', re1.sub(r'\1', s)).strip()
print(r)
outputs
'ABCDE FGH IJ KLM NO PQ RS TU VWXYZ'
which matches your desired result:
'ABCDE FGH IJ KLM NO PQ RS TU VWXYZ'
re1 matches all dots that are preceded by a free-standing letter and followed by either another free-standing letter, or whitespace, or the end of the string.
re2 matches all dots that are preceded by a least 2 and followed by at least 1 letter (or the other way around)
You can first replace all dots followed by two characters by spaces, and then remove the remaining dots:
re.sub(r'\.([A-Z]{2})', r' \1', s).replace(".", "")
This gives " ABCDE FGH IJ KLM NO PQ RS TU VWXYZ" on your example.
hopefully this is slightly neater:
import re
s = ' A.B.C.D.E. FGH.IJ K.L.M.NO PQ.RS T.U.VWXYZ'
s = re.sub(r"\.(\w{2})", r" \1", s)
s = re.sub(r"(\w{2})\.(\w)", r"\1 \2", s)
s = re.sub(r"\.", "",s)
s = s.strip()
print(s)
You can use a single regex solution if you consider using a dynamic replacement:
import re
rx = r'\b([A-Z](?:\.[A-Z])+\b(?:\.(?![A-Z]))?)|\.'
s = ' A.B.C.D.E. FGH.IJ K.L.M.NO PQ.R.S T.U.VWXYZ'
print( re.sub(rx, lambda x: x.group(1).replace('.', '') if x.group(1) else ' ', s.strip()) )
# => ABCDE FGH IJ KLM NO PQ RS TU VWXYZ
See the Python demo and a regex demo.
The regex matches:
\b([A-Z](?:\.[A-Z])+\b(?:\.(?![A-Z]))?) - a word boundary, then Group 1 (that will be replaced with itself after stripping off all periods) capturing:
[A-Z] - an uppercase ASCII letter
(?:\.[A-Z])+ - zero or more sequences of a dot and an uppercase ASCII letter
\b - word boundary
(?:\.(?![A-Z]))? - an optional sequence of . that is not followed with an uppercase ASCII letter
| - or
\. - a . in any other context (it will be replaced with a space).
The lambda x: x.group(1).replace('.', '') if x.group(1) else ' ' replacement means that if Group 1 matches, the replacement string is Group 1 value without dots, and if Group 1 does not match the replacement is a single regular space.
Having a string like this: aa5f5 aa5f5 i try to split the tokens where non-digit meets digit, like this:
re.sub(r'([^\d])(\d{1})', r'\1 \2', 'aa5f5 aa5f5')
Out: aa 5f 5 aa 5f 5
Now i try to prevent some tokens from being splitted with specific prefix character($): $aa5f5 aa5f5, the desired output is $aa5f5 aa 5f 5
The problem is that i only came up with this ugly loop:
sentence = '$aa5f5 aa5f5'
new_sentence = []
for s in sentence.split():
if s.startswith('$'):
new_sentence.append(s)
else:
new_sentence.append(re.sub(r'([^\d])(\d{1})', r'\1 \2', s))
print(' '.join(new_sentence)) # $aa5f5 aa 5f 5
But could not find a way to make this possible with single line regexp. Need help with this, thank you.
You may use
new_sentence = re.sub(r'(\$\S*)|(?<=\D)\d', lambda x: x.group(1) if x.group(1) else rf' {x.group()}', sentence)
See the Python demo.
Here, (\$\S*)|(?<=\D)\d matches $ and any 0+ non-whitespace characters (with (\$\S*) capturing the value in Group 1, or a digit is matched that is preceded with a non-digit char (see (?<=\D)\d pattern part).
If Group 1 matched, it is pasted back as is (see x.group(1) if x.group(1) in the replacement), else, the space is inserted before the matched digit (see else rf' {x.group()}').
With PyPi regex module, you may do it in a simple way:
import regex
sentence = '$aa5f5 aa5f5'
print( regex.sub(r'(?<!\$\S*)(?<=\D)(\d)', r' \1', sentence) )
See this online Python demo.
The (?<!\$\S*)(?<=\D)(\d) pattern matches and captures into Group 1 any digit ((\d)) that is preceded with a non-digit ((?<=\D)) and not preceded with $ and then any 0+ non-whitespace chars ((?<!\$\S*)).
This is not something regular expression can do. If it can, it'll be a complex regex which will be hard to understand. And when a new developer joins your team, he will not understand it right away. It's better you write it the way you wrote it already. For the regex part, the following code will probably do the splitting correctly
' '.join(map(str.strip, re.findall(r'\d+|\D+', s)))
>>> s = "aa5f5 aa5f53r12"
>>> ' '.join(map(str.strip, re.findall(r'\d+|\D+', s)))
'aa 5 f 5 aa 5 f 53 r 12'
I have a python string
'AAAAA BBB AAAAA AA BBBBBB'
with the blank spaces in between.
I need the output to have the non zero islands below a certain length to be replaced by blank spaces.
Say for example I need to replace strings smaller than 4 characters long, then my output should look like:
'AAAAA AAAAA BBBBBB'
with the position of other characters being the same.
Use a regular expression, using the re module:
import re
re.sub(r'\b\w{1,3}\b', lambda m: ' ' * len(m.group()), inputstring)
The 3 is your maximum number of consecutive characters.
Breaking this down:
re.sub(pattern, replacement, string) will find matches in string using pattern, then uses the replacement pattern or function to produce replacements, and a new string is returned.
The pattern \b\w{1,3}\b uses.
\b word boundaries; these match between word and non-word characters or at the start or end; here between a space and a letter. By putting these at either end of \w means we only want matches that have spaces or the start or end of the string on each side.
\w matches 'word' characters, which are letters and digits and underscores.
{n,m} states a pattern must be repeated between n and m times; you can leave one or the other out for none or as many as you like. {1,3} means between 1 and 3 times a character that matches \w.
The replacement is a function, that is passed a match object for each matching substring. Here, it returns a number of spaces matching the input string length.
See the Regular Expression HOWTO for more info.
If you want to keep the length variable, use formatting to add the number into the pattern:
def blank_out_up_to(string, length):
return re.sub(
rf'\b\w{{1,{length}}}\b',
lambda m: ' ' * len(m.group()),
string)
Demo:
>>> example = 'AAAAA BBB AAAAA AA BBBBBB'
>>> for i in range(1, 6):
... print(f'{i}: {blank_out_up_to(example, i)}')
...
1: AAAAA BBB AAAAA AA BBBBBB
2: AAAAA BBB AAAAA BBBBBB
3: AAAAA AAAAA BBBBBB
4: AAAAA AAAAA BBBBBB
5: BBBBBB
Here is another variation using re,
inp = 'AAAAA BBB AAAAA AA BBBBBB'
''.join([x if len(x) > 3 else ' ' * len(x) for x in re.split(r'(\s+)', inp)])
>> 'AAAAA AAAAA BBBBBB'
Here's an anti-regex solution using itertools.
This works if, as in your example, your groups consist of identical characters. If this is not guaranteed, you should use a regex method.
from itertools import groupby, chain
x = 'AAAAA BBB AAAAA AA BBBBBB'
res = ''.join(chain.from_iterable(i if len(i)>3 else ' '*len(i) for i in
(''.join(j) for _, j in groupby(x))))
print(res)
# "AAAAA AAAAA BBBBBB"
I'm now trying to extract the text from a structured string by regex.
For instance,
string = "field1:afield3:bfield2:cfield3:d"
all I want is the values of field3 which are 'b' and 'd'
I try to use the regex = "(field1:.*?)?(field2:.*?)?field3:"
and split the raw string by it.
but I ve got this:
['', 'field1:a', None, 'b', None, 'field2:c', 'd']
So, what is the solution?
The real case is:
string = "1st sentence---------------------- Forwarded by Michelle
Cash/HOU/ECT on ---------------------------Ava Syon#ENRON To: Michelle
Cash/HOU/ECT#ECTcc: Twanda Sweet/HOU/ECT#ECT Subject: 2nd sentence---------
------------- Forwarded by Michelle Cash/HOU/ECT on -----------------------
----Ava Syon#ENRON To: Michelle Cash/HOU/ECT#ECTcc: Twanda
Sweet/HOU/ECT#ECT Subject: 3rd sentence"
(one line string, without \n)
a list
re = ["1st sentence","2nd sentence","3rd sentence"]
is the result needed
Thanks!
Use
re.findall(r'field3:(.*?)(?=field\d+:|$)', s)
See the regex demo. NOTE: re.findall returns the contents of the capturing group, thus, you do not need a lookbehind in the pattern, a capturing group will do.
The regex matches:
field3: - a literal char sequence
(.*?) - any 0+ chars other than line break (if you use re.DOTALL modifier, the dot will match a newline, too)
(?=field\d+:|$) - a positive lookahead that requires (but does not consume, does not add to the match or capture) the presence of field, 1+ digits, : or the end of string after the current position.
Python demo:
import re
rx = r"field3:(.*?)(?=field\d+:|$)"
s = "field1:afield3:b and morefield2:cfield3:d and here"
res = re.findall(rx, s)
print(res)
# => ['b and more', 'd and here']
NOTE: A more efficient (unrolled) version of the same regex is
field3:([^f]*(?:f(?!ield\d+:)[^f]*)*)
See the regex demo
You could use a positive lookbehind. It will find any character directly after field3 :
>>> import re
>>> string = "field1:afield3:bfield2:cfield3:d"
>>> re.findall(r'(?<=field3:).', string)
['b', 'd']
This would only work for a single-character. I would add a positive lookeahead, but it would become the same answer as Wiktor's.
So here's an alternative with re.split():
>>> string = "field1:afield3:boatfield2:cfield3:dolphin"
>>> elements = re.split(r'(field\d+:)',string)
>>> [elements[i+1] for i, x in enumerate(elements) if x == 'field3:']
['boat', 'dolphin']
A complex solution to get field values by field number using built-in str.replace(), str.split() and str.startswith() functions:
def getFieldValues(s, field_number):
delimited = s.replace('field', '|field') # setting delimiter between fields
return [i.split(':')[1] for i in delimited.split('|') if i.startswith('field' + str(field_number))]
s = "field1:a hello-againfield3:b some textfield2:c another textfield3:d and data"
print(getFieldValues(s, 3))
# ['b some text', 'd and data']
print(getFieldValues(s, 1))
# ['a hello-again']
print(getFieldValues(s, 2))
# ['c another text']
I have the following string as an example:
string = "## cat $$ ##dog$^"
I want to extract all the stringa that are locked between "##" and "$", so the output will be:
[" cat ","dog"]
I only know how to extract the first occurrence:
import re
r = re.compile('##(.*?)$')
m = r.search(string)
if m:
result_str = m.group(1)
Thoughts & suggestions on how to catch them all are welcomed.
Use re.findall() to get every occurrence of your substring. $ is considered a special character in regular expressions meaning — "the end of the string" anchor, so you need to escape $ to match a literal character.
>>> import re
>>> s = '## cat $$ ##dog$^'
>>> re.findall(r'##(.*?)\$', s)
[' cat ', 'dog']
To remove the leading and trailing whitespace, you can simply match it outside of the capture group.
>>> re.findall(r'##\s*(.*?)\s*\$', s)
['cat', 'dog']
Also, if the context has a possibility of spanning across newlines, you may consider using negation.
>>> re.findall(r'##\s*([^$]*)\s*\$', s)