Remove consecutively repeated substring in a string using regex - python

import re
input_text = "((PERS)Yo), ((PERS)Yo) ((PERS)yo) hgasghasghsa ((PERS)Yo) ((PERS)Yo) ((PERS)Yo) ((PERS)yo) jhsjhsdhjsdsdh ((PERS)Yo) jhdjfjhdffdj ((PERS)ella) ((PERS)Ella) ((PERS)ellos) asassaasasasassaassaas ((PERS)yo) ssdsdsd"
pattern = re.compile(r'\(\(PERS\)\s*yo\s*\)(?:\(\(PERS\)\s*yo\s*\))+', flags = re.IGNORECASE)
modified_text = re.sub(pattern, '((PERS)yo)', input_text)
print(modified_text)
Why is this code not used to eliminate the repeated occurrences one after the other of the sequence of characters ((PERS)\s*yo\s*) ?
This should be the correct output:
"((PERS)Yo), ((PERS)yo) hgasghasghsa ((PERS)yo) jhsjhsdhjsdsdh ((PERS)yo) jhdjfjhdffdj ((PERS)ella) ((PERS)Ella) ((PERS)ellos) asassaasasasassaassaas ((PERS)yo) ssdsdsd"

Related

Python regexp: exclude specific pattern from sub

Having a string like this: aa5f5 aa5f5 i try to split the tokens where non-digit meets digit, like this:
re.sub(r'([^\d])(\d{1})', r'\1 \2', 'aa5f5 aa5f5')
Out: aa 5f 5 aa 5f 5
Now i try to prevent some tokens from being splitted with specific prefix character($): $aa5f5 aa5f5, the desired output is $aa5f5 aa 5f 5
The problem is that i only came up with this ugly loop:
sentence = '$aa5f5 aa5f5'
new_sentence = []
for s in sentence.split():
if s.startswith('$'):
new_sentence.append(s)
else:
new_sentence.append(re.sub(r'([^\d])(\d{1})', r'\1 \2', s))
print(' '.join(new_sentence)) # $aa5f5 aa 5f 5
But could not find a way to make this possible with single line regexp. Need help with this, thank you.
You may use
new_sentence = re.sub(r'(\$\S*)|(?<=\D)\d', lambda x: x.group(1) if x.group(1) else rf' {x.group()}', sentence)
See the Python demo.
Here, (\$\S*)|(?<=\D)\d matches $ and any 0+ non-whitespace characters (with (\$\S*) capturing the value in Group 1, or a digit is matched that is preceded with a non-digit char (see (?<=\D)\d pattern part).
If Group 1 matched, it is pasted back as is (see x.group(1) if x.group(1) in the replacement), else, the space is inserted before the matched digit (see else rf' {x.group()}').
With PyPi regex module, you may do it in a simple way:
import regex
sentence = '$aa5f5 aa5f5'
print( regex.sub(r'(?<!\$\S*)(?<=\D)(\d)', r' \1', sentence) )
See this online Python demo.
The (?<!\$\S*)(?<=\D)(\d) pattern matches and captures into Group 1 any digit ((\d)) that is preceded with a non-digit ((?<=\D)) and not preceded with $ and then any 0+ non-whitespace chars ((?<!\$\S*)).
This is not something regular expression can do. If it can, it'll be a complex regex which will be hard to understand. And when a new developer joins your team, he will not understand it right away. It's better you write it the way you wrote it already. For the regex part, the following code will probably do the splitting correctly
' '.join(map(str.strip, re.findall(r'\d+|\D+', s)))
>>> s = "aa5f5 aa5f53r12"
>>> ' '.join(map(str.strip, re.findall(r'\d+|\D+', s)))
'aa 5 f 5 aa 5 f 53 r 12'

Strip special characters in front of the first alphanumeric character

I have a regex that picks up address from data.
The regex looks for 'Address | address | etc.' and picks up the characters following that until occurrence of 4 integers together.
This includes special characters, which need to be stripped.
I run a loop to exclude all the special characters as seen in the code below. I need the code to drop only the special characters that are present in front of the first alphanumeric character.
Input (from OCR on an image):
Service address —_Unit8-10 LEWIS St, BERRI,SA 5343
possible_addresses = list(re.findall('Address(.* \d{4}|)|address(.*\d{4})|address(.*)', data))
address = str(possible_addresses[0])
for k in address.split("\n"):
if k != ['(A-Za-Z0-9)']:
address_2 = re.sub(r"[^a-zA-Z0-9]+", ' ', k)
Got now:
address : —_Unit 8 - 10 LEWIS ST, BERRI SA 5343
address_2 : Unit 8 10 LEWIS ST BERRI SA 5343
[\W_]* captures the special chars.
import re
data='Service address —_Unit8-10 LEWIS St, BERRI,SA 5343'
possible_addresses = re.search('address[\W_]*(.*?\d{4})', data,re.I)
address = possible_addresses[1]
print('Address : ' address)
Address : Unit8-10 LEWIS St, BERRI,SA 5343
I'm guessing that the expression we wish to design here should be swiping everything from address to a four-digit zip, excluding some defined chars such as _. Let's then start with a simple expression with an i flag, such as:
(address:)[^_]*|[^_]*\d{4}
Demo
Any char that we do not want to have would go in here [^_]. For instance, if we are excluding !, our expression would become:
(address:)[^_!]*|[^_!]*\d{4}
Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(address:)[^_]*|[^_]*\d{4}"
test_str = ("Address: Unit 8 - 10 LEWIS ST, BERRI SA 5343 \n"
"Address: Got now: __Unit 8 - 10 LEWIS ST, BERRI SA 5343\n"
"aDDress: Unit 8 10 LEWIS ST BERRI SA 5343\n"
"address: any special chars here !##$%^&*( 5343\n"
"address: any special chars here !##$%^&*( 5343")
matches = re.finditer(regex, test_str, re.MULTILINE | re.IGNORECASE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

Regex - replace space in string that has '.com Capital letter' with new line

I have a string with a space that I need to replace. The pattern is .com followed by a space and then any capital letter.
An example would be:
".com T"
The space between the .com and T needs to be replaced by a new line.
Using Regex. Lookbehind & Lookahead
Ex:
import re
l = "AAS asdasd asdasd Hello.com T"
m = re.sub("(?<=.com)(\s+)(?=[A-Z])", r"\n", l)
print(m)
Output:
AAS asdasd asdasd Hello.com
T
You can use this to replace spaces after .com before a capital letter:
import re
data = """some.com Tata
dir.com Wube
asa.com alas
null.com 1234
"""
pattern = r'(\.com)(\s)([A-Z])' # captures .com as \1 and the capital letter as \3
repl = r"\1\n\3" # replaces the match with \1+newline+\3
print(re.sub(pattern,repl,data))
Output:
some.com
Tata
dir.com
Wube
asa.com alas
null.com 1234
See: https://regex101.com/r/hYOb3a/1
Use re.sub
import re
text = re.sub(r'\.com\s+([A-Z])', r'.com\n\1', text)

Regex for time format HH:MM AM/am/PM/pm in python

I have written regex to capture HH:MM AM/PM/am/pm but it is not able to extract exact pattern
Code For regex:
import re
def replace_entities(example):
res = ''
# TIME
m = re.findall("\d{2}:\d{2} (:?AM|PM|am|pm)", example)
if m:
for id in m:
res = res +"\n{} :TIMESTR".format(id)
m = re.findall("\d{2}:\d{2}:\d{3} (:?AM|PM|am|pm)", example)
if m:
for id in m:
res = res +"\n{} :TIMESTR".format(id)
print(replace_entities('My name is sayli, Todays time is 12:10 PM Date is 21/08/2018 otal amount is www.amazon.com chandanpatil#yahoo.com euros 10,2018/13/09 saylijawale#gmail.com. https://imarticus.com Account number is Accountsortcode:abca123456'))
But i am not ale to capture time 12:10 PM as TIMESTR
Link for tried Regex .https://regex101.com/r/Z8lUIW/2
How do i correct it? any suggestions.please help
Try this one:
\s(\d{2}\:\d{2}\s?(?:AM|PM|am|pm))
Explanation:
\s matches any whitespace character (equal to [\r\n\t\f\v ]) 1st Capturing
\d{2} matches a digit (equal to [0-9]) {2} Quantifier — Matches exactly 2 times
\: matches the character : literally (case sensitive)
\d{2} matches a digit (equal to [0-9]) {2} Quantifier — Matches exactly 2 times
\s? matches any whitespace character (equal to [\r\n\t\f\v ]) 0 or more times
Non-capturing group (?:AM|PM|am|pm)
1st Alternative AM AM matches the characters AM literally (case sensitive) 2nd Alternative PM 3rd Alternative am 4th Alternative pm
In action:
>>> import re
>>> re.findall(r'\s(\d{2}\:\d{2}\s?(?:AM|PM|am|pm))', 'Time today is 10:30 PM')
['10:30 PM']

Python Regex match a mac address from the end?

I have the following re to extract MAC address:
re.sub( r'(\S{2,2})(?!$)\s*', r'\1:', '0x0000000000aa bb ccdd ee ff' )
However, this gave me 0x:00:00:00:00:00:aa:bb:cc:dd:ee:ff.
How do I modify this regex to stop after matching the first 6 pairs starting from the end, so that I get aa:bb:cc:dd:ee:ff?
Note: the string has whitespace in between which is to be ignored. Only the last 12 characters are needed.
Edit1: re.findall( r'(\S{2})\s*(\S{2})\s*(\S{2})\s*(\S{2})\s*(\S{2})\s*(\S{2})\s*$',a) finds the last 6 pairs in the string. I still don't know how to compress this regex. Again this still depends on the fact that the strings are in pairs.
Ideally the regex should take the last 12 valid \S characters starting from the end and string them with :
Edit2: Inspired by #Mariano answer which works great but depends on the fact that that last 12 characters must start with a pair I came up with the following solution. It is kludgy but still seems to work for all inputs.
string = '0x0000000000a abb ccddeeff'
':'.join( ''.join( i ) for i in re.findall( '(\S)\s*(\S)(?!(?:\s*\S\s*{11})',' string) )
'aa:bb:cc:dd:ee:ff'
Edit3: #Mariano has updated his answer which now works for all inputs
This will work for the last 12 characters, ignoring whitespace.
Code:
import re
text = "0x0000000000aa bb ccdd ee ff"
result = re.sub( r'.*?(?!(?:\s*\S){13})(\S)\s*(\S)', r':\1\2', text)[1:]
print(result)
Output:
aa:bb:cc:dd:ee:ff
DEMO
Regex breakdown:
The expression used in this code uses re.sub() to replace the following in the subject text:
.*? # consume the subject text as few as possible
(?!(?:\s*\S){13}) # CONDITION: Can't be followed by 13 chars
# so it can only start matching when there are 12 to $
(\S)\s*(\S) # Capture a char in group 1, next char in group 2
#
# The match is replaced with :\1\2
# For this example, re.sub() returns ":aa:bb:cc:dd:ee:ff"
# We'll then apply [1:] to the returned value to discard the leading ":"
You can use re.finditer to find all the pairs then join the result :
>>> my_string='0x0000000000aa bb ccdd ee ff'
>>> ':'.join([i.group() for i in re.finditer( r'([a-z])\1+',my_string )])
'aa:bb:cc:dd:ee:ff'
You may do like this,
>>> import re
>>> s = '0x0000000000aa bb ccdd ee ff'
>>> re.sub(r'(?!^)\s*(?=(?:\s*[a-z]{2})+$)', ':', re.sub(r'.*?((?:\s*[a-z]){12})\s*$', r'\1', s ))
'aa:bb:cc:dd:ee:ff'
>>> s = '???767aa bb ccdd ee ff'
>>> re.sub(r'(?!^)\s*(?=(?:\s*[a-z]{2})+$)', ':', re.sub(r'.*?((?:\s*[a-z]){12})\s*$', r'\1', s ))
'aa:bb:cc:dd:ee:ff'
>>> s = '???767aa bb ccdd eeff '
>>> re.sub(r'(?!^)\s*(?=(?:\s*[a-z]{2})+$)', ':', re.sub(r'.*?((?:\s*[a-z]){12})\s*$', r'\1', s ))
'aa:bb:cc:dd:ee:ff'
I know this is not a direct answer to your question, but do you really need a regular expression? If your format is fixed, this should also work:
>>> s = '0x0000000000aa bb ccdd ee ff'
>>> ':'.join([s[-16:-8].replace(' ', ':'), s[-8:].replace(' ', ':')])
'aa:bb:cc:dd:ee:ff'

Categories