Replacing consecutive non-space elements less than a specified length - python

I have a python string
'AAAAA BBB AAAAA AA BBBBBB'
with the blank spaces in between.
I need the output to have the non zero islands below a certain length to be replaced by blank spaces.
Say for example I need to replace strings smaller than 4 characters long, then my output should look like:
'AAAAA AAAAA BBBBBB'
with the position of other characters being the same.

Use a regular expression, using the re module:
import re
re.sub(r'\b\w{1,3}\b', lambda m: ' ' * len(m.group()), inputstring)
The 3 is your maximum number of consecutive characters.
Breaking this down:
re.sub(pattern, replacement, string) will find matches in string using pattern, then uses the replacement pattern or function to produce replacements, and a new string is returned.
The pattern \b\w{1,3}\b uses.
\b word boundaries; these match between word and non-word characters or at the start or end; here between a space and a letter. By putting these at either end of \w means we only want matches that have spaces or the start or end of the string on each side.
\w matches 'word' characters, which are letters and digits and underscores.
{n,m} states a pattern must be repeated between n and m times; you can leave one or the other out for none or as many as you like. {1,3} means between 1 and 3 times a character that matches \w.
The replacement is a function, that is passed a match object for each matching substring. Here, it returns a number of spaces matching the input string length.
See the Regular Expression HOWTO for more info.
If you want to keep the length variable, use formatting to add the number into the pattern:
def blank_out_up_to(string, length):
return re.sub(
rf'\b\w{{1,{length}}}\b',
lambda m: ' ' * len(m.group()),
string)
Demo:
>>> example = 'AAAAA BBB AAAAA AA BBBBBB'
>>> for i in range(1, 6):
... print(f'{i}: {blank_out_up_to(example, i)}')
...
1: AAAAA BBB AAAAA AA BBBBBB
2: AAAAA BBB AAAAA BBBBBB
3: AAAAA AAAAA BBBBBB
4: AAAAA AAAAA BBBBBB
5: BBBBBB

Here is another variation using re,
inp = 'AAAAA BBB AAAAA AA BBBBBB'
''.join([x if len(x) > 3 else ' ' * len(x) for x in re.split(r'(\s+)', inp)])
>> 'AAAAA AAAAA BBBBBB'

Here's an anti-regex solution using itertools.
This works if, as in your example, your groups consist of identical characters. If this is not guaranteed, you should use a regex method.
from itertools import groupby, chain
x = 'AAAAA BBB AAAAA AA BBBBBB'
res = ''.join(chain.from_iterable(i if len(i)>3 else ' '*len(i) for i in
(''.join(j) for _, j in groupby(x))))
print(res)
# "AAAAA AAAAA BBBBBB"

Related

Python regex match a pattern for multiple times

I've got a list of strings.
input=['XX=BB|3|3|1|1|PLP|KLWE|9999|9999', 'XX=BB|3|3|1|1|2|PLP|KPOK|99999|99999', '999|999|999|9999|999', ....]
This type '999|999|999|9999|999' remains unchanged.
I need to replace 9999|9999 with 12|21
I write this (?<=BB\|\d\|\d\|\d\|\d\|\S{3}\|\S{4}\|)9{2,9}\|9{2,9} to match 999|999. However, there are 4 to 6 \|\d in the middle. So how to match |d this pattern for multiple times.
Desired result:
['XX=BB|3|3|1|1|PLP|KLWE|12|21', 'XX=BB|3|3|1|1|2|PLP|KPOK|12|21', '999|999|999|9999|999'...]
thanks
You can use
re.sub(r'(BB(?:\|\d){4,6}\|[^\s|]{3}\|[^\s|]{4}\|)9{2,9}\|9{2,9}(?!\d)', r'\g<1>12|21', text)
See the regex demo.
Details:
(BB(?:\|\d){4,6}\|[^\s|]{3}\|[^\s|]{4}\|) - Capturing group 1:
BB - a BB string
(?:\|\d){4,6} - four, five or six repetitions of | and any digit sequence
\| - a | char
[^\s|]{3} - three chars other than whitespace and a pipe
\|[^\s|]{4}\| - a |, four chars other than whitespace and a pipe, and then a pipe char
9{2,9}\|9{2,9} - two to nine 9 chars, | and again two to nine 9 chars...
(?!\d) - not followed with another digit (note you may remove this if you do not need to check for the digit boundary here. You may also use (?![^|]) instead if you need to check if there is a | char or end of string immediately on the right).
The \g<1>12|21 replacement includes an unambiguous backreference to Group 1 (\g<1>) and a 12|21 substring appended to it.
See the Python demo:
import re
texts=['XX=BB|3|3|1|1|PLP|KLWE|9999|9999', 'XX=BB|3|3|1|1|2|PLP|KPOK|99999|99999', '999|999|999|9999|999']
pattern = r'(BB(?:\|\d){4,6}\|[^\s|]{3}\|[^\s|]{4}\|)9{2,9}\|9{2,9}(?!\d)'
repl = r'\g<1>12|21'
for text in texts:
print( re.sub(pattern, repl, text) )
Output:
XX=BB|3|3|1|1|PLP|KLWE|12|21
XX=BB|3|3|1|1|2|PLP|KPOK|12|21
999|999|999|9999|999
I would just use re.sub here and search for the pattern \b9{2,9}\|9{2,9}\b:
inp = ["XX=BB|3|3|1|1|PLP|KLWE|9999|9999" "XX=BB|3|3|1|1|2|PLP|KPOK|99999|99999"]
output = [re.sub(r'\b9{2,9}\|9{2,9}\b', '12|21', i) for i in inp]
print(output)
# ['XX=BB|3|3|1|1|PLP|KLWE|12|21', 'XX=BB|3|3|1|1|2|PLP|KPOK|12|21']

Use regex to contextually replace dots in a string

I want to remove all occurrences of dots separated by single characters, I also want to replace all occurrences of dots separated by more than one consecutive character with a space (if one side has len > 1 char).
For example. Given a string,
s = ' A.B.C.D.E. FGH.IJ K.L.M.NO PQ.R.S T.U.VWXYZ'
After processing the output should look like:
'ABCDE FGH IJ KLM NO PQ RS TU VWXYZ'
Notice that in the case of A.B.C.D.E., all dots are removed (this should be true for when there is no trailing dot also)
Notice that in the case of K.L.M.NO, the first two dots are removed, the last one is replaced with a space (because NO is not a single character)
Notice that in the case of PQ.R.S, the first dot is replaced with a space, the second dot is removed.
I almost have a working solution:
re.sub(r'(?<!\w)([A-Z])\.', r'\1', s)
But in the example given, T.U.VWXYZ gets translated to TUVWXYZ, whereas it should be TU VWXYZ
Note: it's not important for this to be solved with a single regex, or even regex at all for that matter.
Edit: changed PQ.RS to PQ.R.S in the example string.
I'd take two steps.
replace (\b[A-Z])\.(?=[A-Z]\b|\s|$) with r'\1'
replace (\b[A-Z]{2,})\.(?=[A-Z])|(\b[A-Z])\.(?=[A-Z]{2,}) with r'\1\2 '
Sample
import re
re1 = re.compile(r'(\b[A-Z])\.(?=[A-Z]\b|\s|$)')
re2 = re.compile(r'(\b[A-Z]{2,})\.(?=[A-Z])|(\b[A-Z])\.(?=[A-Z]{2,})')
s = ' A.B.C.D.E. FGH.IJ K.L.M.NO PQ.RS T.U.VWXYZ'
r = re2.sub(r'\1\2 ', re1.sub(r'\1', s)).strip()
print(r)
outputs
'ABCDE FGH IJ KLM NO PQ RS TU VWXYZ'
which matches your desired result:
'ABCDE FGH IJ KLM NO PQ RS TU VWXYZ'
re1 matches all dots that are preceded by a free-standing letter and followed by either another free-standing letter, or whitespace, or the end of the string.
re2 matches all dots that are preceded by a least 2 and followed by at least 1 letter (or the other way around)
You can first replace all dots followed by two characters by spaces, and then remove the remaining dots:
re.sub(r'\.([A-Z]{2})', r' \1', s).replace(".", "")
This gives " ABCDE FGH IJ KLM NO PQ RS TU VWXYZ" on your example.
hopefully this is slightly neater:
import re
s = ' A.B.C.D.E. FGH.IJ K.L.M.NO PQ.RS T.U.VWXYZ'
s = re.sub(r"\.(\w{2})", r" \1", s)
s = re.sub(r"(\w{2})\.(\w)", r"\1 \2", s)
s = re.sub(r"\.", "",s)
s = s.strip()
print(s)
You can use a single regex solution if you consider using a dynamic replacement:
import re
rx = r'\b([A-Z](?:\.[A-Z])+\b(?:\.(?![A-Z]))?)|\.'
s = ' A.B.C.D.E. FGH.IJ K.L.M.NO PQ.R.S T.U.VWXYZ'
print( re.sub(rx, lambda x: x.group(1).replace('.', '') if x.group(1) else ' ', s.strip()) )
# => ABCDE FGH IJ KLM NO PQ RS TU VWXYZ
See the Python demo and a regex demo.
The regex matches:
\b([A-Z](?:\.[A-Z])+\b(?:\.(?![A-Z]))?) - a word boundary, then Group 1 (that will be replaced with itself after stripping off all periods) capturing:
[A-Z] - an uppercase ASCII letter
(?:\.[A-Z])+ - zero or more sequences of a dot and an uppercase ASCII letter
\b - word boundary
(?:\.(?![A-Z]))? - an optional sequence of . that is not followed with an uppercase ASCII letter
| - or
\. - a . in any other context (it will be replaced with a space).
The lambda x: x.group(1).replace('.', '') if x.group(1) else ' ' replacement means that if Group 1 matches, the replacement string is Group 1 value without dots, and if Group 1 does not match the replacement is a single regular space.

Python regexp: exclude specific pattern from sub

Having a string like this: aa5f5 aa5f5 i try to split the tokens where non-digit meets digit, like this:
re.sub(r'([^\d])(\d{1})', r'\1 \2', 'aa5f5 aa5f5')
Out: aa 5f 5 aa 5f 5
Now i try to prevent some tokens from being splitted with specific prefix character($): $aa5f5 aa5f5, the desired output is $aa5f5 aa 5f 5
The problem is that i only came up with this ugly loop:
sentence = '$aa5f5 aa5f5'
new_sentence = []
for s in sentence.split():
if s.startswith('$'):
new_sentence.append(s)
else:
new_sentence.append(re.sub(r'([^\d])(\d{1})', r'\1 \2', s))
print(' '.join(new_sentence)) # $aa5f5 aa 5f 5
But could not find a way to make this possible with single line regexp. Need help with this, thank you.
You may use
new_sentence = re.sub(r'(\$\S*)|(?<=\D)\d', lambda x: x.group(1) if x.group(1) else rf' {x.group()}', sentence)
See the Python demo.
Here, (\$\S*)|(?<=\D)\d matches $ and any 0+ non-whitespace characters (with (\$\S*) capturing the value in Group 1, or a digit is matched that is preceded with a non-digit char (see (?<=\D)\d pattern part).
If Group 1 matched, it is pasted back as is (see x.group(1) if x.group(1) in the replacement), else, the space is inserted before the matched digit (see else rf' {x.group()}').
With PyPi regex module, you may do it in a simple way:
import regex
sentence = '$aa5f5 aa5f5'
print( regex.sub(r'(?<!\$\S*)(?<=\D)(\d)', r' \1', sentence) )
See this online Python demo.
The (?<!\$\S*)(?<=\D)(\d) pattern matches and captures into Group 1 any digit ((\d)) that is preceded with a non-digit ((?<=\D)) and not preceded with $ and then any 0+ non-whitespace chars ((?<!\$\S*)).
This is not something regular expression can do. If it can, it'll be a complex regex which will be hard to understand. And when a new developer joins your team, he will not understand it right away. It's better you write it the way you wrote it already. For the regex part, the following code will probably do the splitting correctly
' '.join(map(str.strip, re.findall(r'\d+|\D+', s)))
>>> s = "aa5f5 aa5f53r12"
>>> ' '.join(map(str.strip, re.findall(r'\d+|\D+', s)))
'aa 5 f 5 aa 5 f 53 r 12'

Python Regex match a mac address from the end?

I have the following re to extract MAC address:
re.sub( r'(\S{2,2})(?!$)\s*', r'\1:', '0x0000000000aa bb ccdd ee ff' )
However, this gave me 0x:00:00:00:00:00:aa:bb:cc:dd:ee:ff.
How do I modify this regex to stop after matching the first 6 pairs starting from the end, so that I get aa:bb:cc:dd:ee:ff?
Note: the string has whitespace in between which is to be ignored. Only the last 12 characters are needed.
Edit1: re.findall( r'(\S{2})\s*(\S{2})\s*(\S{2})\s*(\S{2})\s*(\S{2})\s*(\S{2})\s*$',a) finds the last 6 pairs in the string. I still don't know how to compress this regex. Again this still depends on the fact that the strings are in pairs.
Ideally the regex should take the last 12 valid \S characters starting from the end and string them with :
Edit2: Inspired by #Mariano answer which works great but depends on the fact that that last 12 characters must start with a pair I came up with the following solution. It is kludgy but still seems to work for all inputs.
string = '0x0000000000a abb ccddeeff'
':'.join( ''.join( i ) for i in re.findall( '(\S)\s*(\S)(?!(?:\s*\S\s*{11})',' string) )
'aa:bb:cc:dd:ee:ff'
Edit3: #Mariano has updated his answer which now works for all inputs
This will work for the last 12 characters, ignoring whitespace.
Code:
import re
text = "0x0000000000aa bb ccdd ee ff"
result = re.sub( r'.*?(?!(?:\s*\S){13})(\S)\s*(\S)', r':\1\2', text)[1:]
print(result)
Output:
aa:bb:cc:dd:ee:ff
DEMO
Regex breakdown:
The expression used in this code uses re.sub() to replace the following in the subject text:
.*? # consume the subject text as few as possible
(?!(?:\s*\S){13}) # CONDITION: Can't be followed by 13 chars
# so it can only start matching when there are 12 to $
(\S)\s*(\S) # Capture a char in group 1, next char in group 2
#
# The match is replaced with :\1\2
# For this example, re.sub() returns ":aa:bb:cc:dd:ee:ff"
# We'll then apply [1:] to the returned value to discard the leading ":"
You can use re.finditer to find all the pairs then join the result :
>>> my_string='0x0000000000aa bb ccdd ee ff'
>>> ':'.join([i.group() for i in re.finditer( r'([a-z])\1+',my_string )])
'aa:bb:cc:dd:ee:ff'
You may do like this,
>>> import re
>>> s = '0x0000000000aa bb ccdd ee ff'
>>> re.sub(r'(?!^)\s*(?=(?:\s*[a-z]{2})+$)', ':', re.sub(r'.*?((?:\s*[a-z]){12})\s*$', r'\1', s ))
'aa:bb:cc:dd:ee:ff'
>>> s = '???767aa bb ccdd ee ff'
>>> re.sub(r'(?!^)\s*(?=(?:\s*[a-z]{2})+$)', ':', re.sub(r'.*?((?:\s*[a-z]){12})\s*$', r'\1', s ))
'aa:bb:cc:dd:ee:ff'
>>> s = '???767aa bb ccdd eeff '
>>> re.sub(r'(?!^)\s*(?=(?:\s*[a-z]{2})+$)', ':', re.sub(r'.*?((?:\s*[a-z]){12})\s*$', r'\1', s ))
'aa:bb:cc:dd:ee:ff'
I know this is not a direct answer to your question, but do you really need a regular expression? If your format is fixed, this should also work:
>>> s = '0x0000000000aa bb ccdd ee ff'
>>> ':'.join([s[-16:-8].replace(' ', ':'), s[-8:].replace(' ', ':')])
'aa:bb:cc:dd:ee:ff'

Elegant way to use regex to match order-indifferent groups of characters while limiting how may times a given character can appear?

I am looking for a way to use python regular expressions to match groups of characters with limits on how many times a character can appear in the match. The main problem is that the order of characters does not matter.
I would like to find a simple and extensible pattern for saying things like:
Find 3 characters together.
All of the characters must be from the group 'ABC'
0 to 3 of them can be 'A'
0 to 3 of them can be 'B'
0 to 1 of them can be 'C'
In which case the following would match:
ABC ACB CBA BCA AAB ABA BAA AAC ACA CAA ABB BAB BBA BBB
and the following would not match:
CCC CCA CAC ACC CCB CBC BCC
I have tried several approaches using lookahead but have yet to find one that handles all cases. For instance:
(?=C?[AB]{2,3})(?=[AB]{2,3}C?)(?=[AB]C?[AB])([ABC]{3})
which you can see here at regex101.
Is there a pattern for this type of match that doesn't involve listing all possible combinations?
UPDATE: Thank you for the great answers. You expanded my understanding of regular expressions. Since the original question didn't specify whether I wanted to match a substring and gave examples that implied to the contrary. I will select the answer most in line with spirit of the original question and post a new question specific to the substring issue.
use this pattern
(?!.?C.?C)([ABC]{3})
Demo
to match a substring use this pattern
(?!CC.|C.C|.CC)([ABC]{3})
for ABCD with A{0,4} B{0,4} C{0,2} D{0,1} use this pattern
(?!([ABD]?C){3}|([ABC]?D){2})([ABCD]{4})
(?!(.*?C){2})[ABC]{3}
Try this.See demo.
http://regex101.com/r/aU6gF1/2
import re
p = re.compile(ur'(?!(.*?C){2})[ABC]{3}', re.IGNORECASE)
test_str = u"ABC\nACB\nCBA\nBCA\nAAB\nABA\nBAA\nAAC\nACA\nCAA\nABB\nBAB\nBBA\nBBB\nCCC\nCCA\nCAC\nACC\nCCB\nCBC\nBCC\n\n\n\n"
re.findall(p, test_str)
One thing you could do is programmatically generate an explicit alternation that you can then embed in other regexes:
from collections import Counter, namedtuple
from itertools import product
# You could just hardcode tuples in `limits` instead and access their indices in
# `test`; I just happen to like `namedtuple`.
Limit = namedtuple('Limit', ['low', 'high'])
# conditions
length = 3
valid_characters = 'ABC'
limits = {
'A': Limit(low=0, high=3),
'B': Limit(low=0, high=3),
'C': Limit(low=0, high=1)
}
# determines whether a single string is valid
def is_valid(string):
if len(string) != length:
return False
counts = Counter(string)
for character in limits:
if not (limits[character].low <= counts[character] <= limits[character].high):
return False
return True
# constructs a (foo|bar|baz)-style alternation of all valid strings
def generate_alternation():
possible_strings = map(''.join,
product(valid_characters, repeat=length))
valid_strings = filter(is_valid,
possible_strings)
alternation = '(' + '|'.join(valid_strings) + ')'
return alternation
Given the conditions I included above, generate_alternation() would give:
(AAA|AAB|AAC|ABA|ABB|ABC|ACA|ACB|BAA|BAB|BAC|BBA|BBB|BBC|BCA|BCB|CAA|CAB|CBA|CBB)
Which would do what you wanted. You can embed the resulting alternation in further regexes freely.
There is always the permutation approach.
http://regex101.com/r/kO4uW8/1
# (?!(C.?C)|.?CC)[ABC]{3}
(?!
( C .? C )
| .? CC
)
[ABC]{3}

Categories