Change a text between two strings in Python with Regex

Change a text between two strings in Python with Regex - python

I found several similar questions, but I cannot fit my problem to any of them. I try to find and replace a string between two other strings in a text.
reg = "%s(.*?)%s" % (str1,str2)
r = re.compile(reg,re.DOTALL)
result = r.sub(newstring, originaltext)
The problem is that the code above replace also str1 and str2, whereas I want to replace only the text between them. Something obviously that I miss?
Update:
I simplified example:
text = 'abcdefghijklmnopqrstuvwxyz'
str1 = 'gh'
str2 = 'op'
newstring = 'stackexchange'
reg = "%s(.*?)%s" % (str1,str2)
r = re.compile(reg,re.DOTALL)
result = r.sub(newstring, text)
print result
The result is abcdefstackexchangeqrstuvwxyz whereas I need abcdefghstackexchangeopqrstuvwxyz

Use a combination of lookarounds in your regular expression.
reg = "(?<=%s).*?(?=%s)" % (str1,str2)
Explanation:
Lookarounds are zero-width assertions. They don't consume any characters on the string.
(?<= # look behind to see if there is:
gh # 'gh'
) # end of look-behind
.*? # any character except \n (0 or more times)
(?= # look ahead to see if there is:
op # 'op'
) # end of look-ahead
Working Demo

Related

Please advise on the Python regular expression

message = <#U0104FGR7SL> test111 <#U0106LSJ> test33
There is the above string.
Based on the reference letter corresponding to the pattern <#U......>
I'd like to split the text.
I'd like to cut it by a pattern.
regex = re.compile("<#U[^>]+>")
match = regex.split (message)
If I do this, I get a "test, test22"
<#U0104FGR7SL> test111
<#U0106LSJ> test33
I'd like to split it this way.
Please advise me what to do.

You can do the following:
import re
message = "<#U0104FGR7SL> test111 <#U0106LSJ> test33"
matches = re.findall("<\S+>\s\S+", message)
for x in matches:
print(x)
# <#U0104FGR7SL> test111
# <#U0106LSJ> test33

Another one - using the newer regex module which supports splitting by lookarounds:
import regex as re
string = "<#U0104FGR7SL> test111 <#U0106LSJ> test33"
parts = re.split(r'(?<!\A)(?=<#)', string)
print(parts)
This yields
['<#U0104FGR7SL> test111 ', '<#U0106LSJ> test33']
See a demo on regex101.com.

You may use either of the two re.split solutions:
re.split(r'\s+(?=<#U[^>]+>)', message) # Any Python version, if matches are whitespace separated
[x.strip() for x in re.split(r'(?=<#U[^>]+>)', message) if x] # Starting with Python 3.7
NOTE: In Python 3.7, re.split finally was fixed to split with empty matches.
Details
\s+ - 1+ whitespaces
(?=<#U[^>]+>) - a positive lookahead that requires <#U, 1+ chars other than > and then > immediately to the right of the current location.
See the Python demo:
import re
message = '<#U0104FGR7SL> test111 <#U0106LSJ> test33'
print ( re.split(r'\s+(?=<#U[^>]+>)', message) )
# => '<#U0104FGR7SL> test111', '<#U0106LSJ> test33']
print ( [x.strip() for x in re.split(r'(?=<#U[^>]+>)', message) if x] )
# => '<#U0104FGR7SL> test111', '<#U0106LSJ> test33']

How to use regex to tell if first and last character of a string match?

I'm relatively new to using Python and Regex, and I wanted to check if strings first and last characters are the same.
If first and last characters are same, then return 'True' (Ex: 'aba')
If first and last characters are not same, then return 'False' (Ex: 'ab')
Below is the code, I've written:
import re
string = 'aba'
pattern = re.compile(r'^/w./1w$')
matches = pattern.finditer(string)
for match in matches
print (match)
But from the above code, I don't see any output

if and only if you really want to use regex (for learning purpose):
import re
string = 'aba'
string2 = 'no match'
pattern = re.compile(r'^(.).*\1$')
if re.match(pattern, string):
print('ok')
else:
print('nok')
if re.match(pattern, string2):
print('ok')
else:
print('nok')
output:
ok
nok
Explanations:
^(.).*\1$
^ start of line anchor
(.) match the first character of the line and store it in a group
.* match any characters any time
\1 backreference to the first group, in this case the first character to impose that the first char and the last one are equal
$ end of line anchor
Demo: https://regex101.com/r/DaOPEl/1/
Otherwise the best approach is to simply use the comparison string[0] == string[-1]
string = 'aba'
if string[0] == string[-1]:
print 'same'
output:
same

Why do you overengineer with an regex at all? One principle of programming should be keeping it simple like:
string[0] is string[-1]
Or is there a need for regex?

The above answer of #Tobias is perfect & simple but if you want solution using regex then try the below code.
Try this code !
Code :
import re
string = 'abbaaaa'
pattern = re.compile(r'^(.).*\1$')
matches = pattern.finditer(string)
for match in matches:
print (match)
Output :
<_sre.SRE_Match object; span=(0, 7), match='abbaaaa'>

I think this is the regex you are trying to execute:
Code:
import re
string = 'aba'
pattern = re.compile(r'^(\w).(\1)$')
matches = pattern.finditer(string)
for match in matches:
print (match.group(0))
Output:
aba

if you want to check with regex use below:
import re
string = 'aba is a cowa'
pat = r'^(.).*\1$'
re.findall(pat,string)
if re.findall(pat,string):
print(string)
this will match first and last character of line or string if they match then it returns matching character in that case it will print string of line otherwise it will skip

Match everything except a pattern and replace matched with string

I want to use python in order to manipulate a string I have.
Basically, I want to prepend"\x" before every hex byte except the bytes that already have "\x" prepended to them.
My original string looks like this:
mystr = r"30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00"
And I want to create the following string from it:
mystr = r"\x30\x33\x62\x37\x61\x31\x31\x90\x01\x0A\x90\x02\x14\x6F\x6D\x6D\x61\x6E\x64\x90\x01\x06\x90\x02\x0F\x52\x65\x6C\x61\x74\x90\x01\x02\x90\x02\x50\x65\x6D\x31\x90\x00"
I thought of using regular expressions to match everything except /\x../g and replace every match with "\x". Sadly, I struggled with it a lot without any success. Moreover, I'm not sure that using regex is the best approach to solve such case.

Regex: (?:\\x)?([0-9A-Z]{2}) Substitution: \\x$1
Details:
(?:) Non-capturing group
? Matches between zero and one time, match string \x if it exists.
() Capturing group
[] Match a single character present in the list 0-9 and A-Z
{n} Matches exactly n times
\\x String \x
$1 Group 1.
Python code:
import re
text = R'30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00'
text = re.sub(R'(?:\\x)?([0-9A-Z]{2})', R'\\x\1', text)
print(text)
Output:
\x30\x33\x62\x37\x61\x31\x31\x90\x01\x0A\x90\x02\x14\x6F\x6D\x6D\x61\x6E\x64\x90\x01\x06\x90\x02\x0F\x52\x65\x6C\x61\x74\x90\x01\x02\x90\x02\x50\x65\x6D\x31\x90\x00
Code demo

You don't need regex for this. You can use simple string manipulation. First remove all of the "\x" from your string. Then add add it back at every 2 characters.
replaced = mystr.replace(r"\x", "")
newstr = "".join([r"\x" + replaced[i*2:(i+1)*2] for i in range(len(replaced)/2)])
Output:
>>> print(newstr)
\x30\x33\x62\x37\x61\x31\x31\x90\x01\x0A\x90\x02\x14\x6F\x6D\x6D\x61\x6E\x64\x90\x01\x06\x90\x02\x0F\x52\x65\x6C\x61\x74\x90\x01\x02\x90\x02\x50\x65\x6D\x31\x90\x00

You can get a list with your values to manipulate as you wish, with an even simpler re pattern
mystr = r"30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00"
import re
pat = r'([a-fA-F0-9]{2})'
match = re.findall(pat, mystr)
if match:
print('\n\nNew string:')
print('\\x' + '\\x'.join(match))
#for elem in match: # match gives you a list of strings with the hex values
# print('\\x{}'.format(elem), end='')
print('\n\nOriginal string:')
print(mystr)

This can be done without replacing existing \x by using a combination of positive lookbehinds and negative lookaheads.
(?!(?<=\\x)|(?<=\\x[a-f\d]))([a-f\d]{2})
Usage
See code in use here
import re
regex = r"(?!(?<=\\x)|(?<=\\x[a-f\d]))([a-f\d]{2})"
test_str = r"30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00"
subst = r"\\x$1"
result = re.sub(regex, subst, test_str, 0, re.IGNORECASE)
if result:
print (result)
Explanation
(?!(?<=\\x)|(?<=\\x[a-f\d])) Negative lookahead ensuring either of the following doesn't match.
(?<=\\x) Positive lookbehind ensuring what precedes is \x.
(?<=\\x[a-f\d]) Positive lookbehind ensuring what precedes is \x followed by a hexidecimal digit.
([a-f\d]{2}) Capture any two hexidecimal digits into capture group 1.

About how to find all desired format in a str

I have a text like this format,
s = '[aaa]foo[bbb]bar[ccc]foobar'
Actually the text is Chinese car review like this
【最满意】整车都很满意，最满意就是性价比，...【空间】空间真的超乎想象，毫不夸张，...【内饰】内饰还可以吧，没有多少可以说的...
Now I want to split it to these parts
[aaa]foo
[bbb]bar
[ccc]foobar
first I tried
>>> re.findall(r'\[.*?\].*?',s)
['[aaa]', '[bbb]', '[ccc]']
only got first half.
Then I tried
>>> re.findall(r'(\[.*?\].*?)\[?',s)
['[aaa]', '[bbb]', '[ccc]']
still only got first half
At last I have to get the two parts respectively then zip them
>>> re.findall(r'\[.*?\]',s)
['[aaa]', '[bbb]', '[ccc]']
>>> re.split(r'\[.*?\]',s)
['', 'foo', 'bar', 'foobar']
>>> for t in zip(re.findall(r'\[.*?\]',s),[e for e in re.split(r'\[.*?\]',s) if e]):
... print(''.join(t))
...
[aaa]foo
[bbb]bar
[ccc]foobar
So I want to know if exists some regex could directly split it to these parts?

One of the approaches:
import re
s = '[aaa]foo[bbb]bar[ccc]foobar'
result = re.findall(r'\[[^]]+\][^\[\]]+', s)
print(result)
The output:
['[aaa]foo', '[bbb]bar', '[ccc]foobar']
\[ or \] - matches the bracket literally
[^]]+ - matches one or more characters except ]
[^\[\]]+ - matches any character(s) except brackets \[\]

I think this could work:
r'\[.+?\]\w+'

Here it is:
>>> re.findall(r"(\[\w*\]\w+)",s)
['[aaa]foo', '[bbb]bar', '[ccc]foobar']
Explanation:
parenthesis means the group to search. Witch group:
it should start by a braked \[ followed by some letters \w
then the matched braked braked \] followed by more letters \w
Notice you should to escape braked with \.

I think if input string format is "strict enough", it's possible to try something w/o regexp. It may look as a microoptimisation, but could be interesting as a challenge.
result = map(lambda x: '[' + x, s[1:].split("["))
So I tried to check performance on a 1Mil iterations and here are my results (seconds):
result = map(lambda x: '[' + x, s[1:].split("[")) # 0.89862203598
result = re.findall(r'\[[^]]+\][^\[\]]+', s) # 1.48306798935
result = re.findall(r'\[.+?\]\w+', s) # 1.47224497795
result = re.findall(r'(\[\w*\]\w+)', s) # 1.47370815277

\[.*?\][a-zA-Z]*
This regex should capture anything that start with [somethinghere]Any letters from a to Z
you can play on regex101 to try out different ones and it's easy to make your own regex there

All you need is findall and here is very simple pattern without making it complicated:
import re
print(re.findall(r'\[\w+\]\w+','[aaa]foo[bbb]bar[ccc]foobar'))
output:
['[aaa]foo', '[bbb]bar', '[ccc]foobar']
Detailed solution:
import re
string_1='[aaa]foo[bbb]bar[ccc]foobar'
pattern=r'\[\w+\]\w+'
print(re.findall(pattern,string_1))
explanation:
\[\w+\]\w+
\[ matches the character [ literally (case sensitive)
\w+ matches any word character (equal to [a-zA-Z0-9_])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed

Python: Ignore a # / and random numbers in a string

I use part of code to read a website and scrap some information and place it into Google and print some directions.
I'm having an issue as some of the information. the site i use sometimes adds a # followed by 3 random numbers then a / and another 3 numbers e.g #037/100
how can i use python to ignore this "#037/100" string?
I currently use
for i, part in enumerate(list(addr_p)):
if '#' in part:
del addr_p[i]
break
to remove the # if found but I'm not sure how to do it for the random numbers
Any ideas ?

If you find yourself wanting to remove "three digits followed by a forward slash followed by three digits" from a string s, you could do
import re
s = "this is a string #123/234 with other stuff"
t = re.sub('#\d{3}\/\d{3}', '', s)
print t
Result:
'this is a string with other stuff'
Explanation:
# - literal character '#'
\d{3} - exactly three digits
\/ - forward slash (escaped since it can have special meaning)
\d{3} - exactly three digits
And the whole thing that matches the above (if it's present) is replaced with '' - i.e. "removed".

import re
re.sub('#[0-9]+\/[0-9]+$', '', addr_p[i])
I'm no wizzard with regular expressions but i'd imagine you could so something like this.
You could even handle '#' in the regexp as well.

If the format is always the same, then you could check if the line starts with a #, then set the string to itself without the first 8 characters.
if part[0:1] == '#':
part = part[8:]
if the first letter is a #, it sets the string to itself, from the 8th character to the end.

I'd double your problems and match against a regular expression for this.
import re
regex = re.compile(r'([\w\s]+)#\d+\/\d+([\w\s]+)')
m = regex.match('This is a string with a #123/987 in it')
if m:
s = m.group(1) + m.group(2)
print(s)

A more concise way:
import re
s = "this is a string #123/234 with other stuff"
t = re.sub(r'#\S+', '', s)
print(t)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Change a text between two strings in Python with Regex - python

Related

Please advise on the Python regular expression

How to use regex to tell if first and last character of a string match?

Match everything except a pattern and replace matched with string

About how to find all desired format in a str

Python: Ignore a # / and random numbers in a string

Categories

Resources