Regular expression find and replace multiple - python

I am trying to write a regular expression that will match all cases of
[[any text or char her]]
in a series of text.
Eg:
My name is [[Sean]]
There is a [[new and cool]] thing here.
This all works fine using my regex.
data = "this is my tes string [[ that does some matching ]] then returns."
p = re.compile("\[\[(.*)\]\]")
data = p.sub('STAR', data)
The problem is when I have multiple instances of the match occuring :[[hello]] and [[bye]]
Eg:
data = "this is my new string it contains [[hello]] and [[bye]] and nothing else"
p = re.compile("\[\[(.*)\]\]")
data = p.sub('STAR', data)
This will match the opening bracket of hello and the closing bracket of bye. I want it to replace them both.

.* is greedy and matches as much text as it can, including ]] and [[, so it plows on through your "tag" boundaries.
A quick solution is to make the star lazy by adding a ?:
p = re.compile(r"\[\[(.*?)\]\]")
A better (more robust and explicit but slightly slower) solution is to make it clear that we cannot match across tag boundaries:
p = re.compile(r"\[\[((?:(?!\]\]).)*)\]\]")
Explanation:
\[\[ # Match [[
( # Match and capture...
(?: # ...the following regex:
(?!\]\]) # (only if we're not at the start of the sequence ]]
. # any character
)* # Repeat any number of times
) # End of capturing group
\]\] # Match ]]

Use ungreedy matching .*? <~~ the ? after a + or * makes it match as few characters as possible. The default is to be greedy, and consume as many characters as possible.
p = re.compile("\[\[(.*?)\]\]")

You can use this:
p = re.compile(r"\[\[[^\]]+\]\]")
>>> data = "this is my new string it contains [[hello]] and [[bye]] and nothing else"
>>> p = re.compile(r"\[\[[^\]]+\]\]")
>>> data = p.sub('STAR', data)
>>> data
'this is my new string it contains STAR and STAR and nothing else'

Related

Python regex to match after the text and the dot [duplicate]

I am using Python and would like to match all the words after test till a period (full-stop) or space is encountered.
text = "test : match this."
At the moment, I am using :
import re
re.match('(?<=test :).*',text)
The above code doesn't match anything. I need match this as my output.
Everything after test, including test
test.*
Everything after test, without test
(?<=test).*
Example here on regexr.com
You need to use re.search since re.match tries to match from the beging of the string. To match until a space or period is encountered.
re.search(r'(?<=test :)[^.\s]*',text)
To match all the chars until a period is encountered,
re.search(r'(?<=test :)[^.]*',text)
In a general case, as the title mentions, you may capture with (.*) pattern any 0 or more chars other than newline after any pattern(s) you want:
import re
p = re.compile(r'test\s*:\s*(.*)')
s = "test : match this."
m = p.search(s) # Run a regex search anywhere inside a string
if m: # If there is a match
print(m.group(1)) # Print Group 1 value
If you want . to match across multiple lines, compile the regex with re.DOTALL or re.S flag (or add (?s) before the pattern):
p = re.compile(r'test\s*:\s*(.*)', re.DOTALL)
p = re.compile(r'(?s)test\s*:\s*(.*)')
However, it will retrun match this.. See also a regex demo.
You can add \. pattern after (.*) to make the regex engine stop before the last . on that line:
test\s*:\s*(.*)\.
Watch out for re.match() since it will only look for a match at the beginning of the string (Avinash aleady pointed that out, but it is a very important note!)
See the regex demo and a sample Python code snippet:
import re
p = re.compile(r'test\s*:\s*(.*)\.')
s = "test : match this."
m = p.search(s) # Run a regex search anywhere inside a string
if m: # If there is a match
print(m.group(1)) # Print Group 1 value
If you want to make sure test is matched as a whole word, add \b before it (do not remove the r prefix from the string literal, or '\b' will match a BACKSPACE char!) - r'\btest\s*:\s*(.*)\.'.
I don't see why you want to use regex if you're just getting a subset from a string.
This works the same way:
if line.startswith('test:'):
print(line[5:line.find('.')])
example:
>>> line = "test: match this."
>>> print(line[5:line.find('.')])
match this
Regex is slow, it is awkward to design, and difficult to debug. There are definitely occassions to use it, but if you just want to extract the text between test: and ., then I don't think is one of those occasions.
See: https://softwareengineering.stackexchange.com/questions/113237/when-you-should-not-use-regular-expressions
For more flexibility (for example if you are looping through a list of strings you want to find at the beginning of a string and then index out) replace 5 (the length of 'test:') in the index with len(str_you_looked_for).

Match everything except a pattern and replace matched with string

I want to use python in order to manipulate a string I have.
Basically, I want to prepend"\x" before every hex byte except the bytes that already have "\x" prepended to them.
My original string looks like this:
mystr = r"30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00"
And I want to create the following string from it:
mystr = r"\x30\x33\x62\x37\x61\x31\x31\x90\x01\x0A\x90\x02\x14\x6F\x6D\x6D\x61\x6E\x64\x90\x01\x06\x90\x02\x0F\x52\x65\x6C\x61\x74\x90\x01\x02\x90\x02\x50\x65\x6D\x31\x90\x00"
I thought of using regular expressions to match everything except /\x../g and replace every match with "\x". Sadly, I struggled with it a lot without any success. Moreover, I'm not sure that using regex is the best approach to solve such case.
Regex: (?:\\x)?([0-9A-Z]{2}) Substitution: \\x$1
Details:
(?:) Non-capturing group
? Matches between zero and one time, match string \x if it exists.
() Capturing group
[] Match a single character present in the list 0-9 and A-Z
{n} Matches exactly n times
\\x String \x
$1 Group 1.
Python code:
import re
text = R'30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00'
text = re.sub(R'(?:\\x)?([0-9A-Z]{2})', R'\\x\1', text)
print(text)
Output:
\x30\x33\x62\x37\x61\x31\x31\x90\x01\x0A\x90\x02\x14\x6F\x6D\x6D\x61\x6E\x64\x90\x01\x06\x90\x02\x0F\x52\x65\x6C\x61\x74\x90\x01\x02\x90\x02\x50\x65\x6D\x31\x90\x00
Code demo
You don't need regex for this. You can use simple string manipulation. First remove all of the "\x" from your string. Then add add it back at every 2 characters.
replaced = mystr.replace(r"\x", "")
newstr = "".join([r"\x" + replaced[i*2:(i+1)*2] for i in range(len(replaced)/2)])
Output:
>>> print(newstr)
\x30\x33\x62\x37\x61\x31\x31\x90\x01\x0A\x90\x02\x14\x6F\x6D\x6D\x61\x6E\x64\x90\x01\x06\x90\x02\x0F\x52\x65\x6C\x61\x74\x90\x01\x02\x90\x02\x50\x65\x6D\x31\x90\x00
You can get a list with your values to manipulate as you wish, with an even simpler re pattern
mystr = r"30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00"
import re
pat = r'([a-fA-F0-9]{2})'
match = re.findall(pat, mystr)
if match:
print('\n\nNew string:')
print('\\x' + '\\x'.join(match))
#for elem in match: # match gives you a list of strings with the hex values
# print('\\x{}'.format(elem), end='')
print('\n\nOriginal string:')
print(mystr)
This can be done without replacing existing \x by using a combination of positive lookbehinds and negative lookaheads.
(?!(?<=\\x)|(?<=\\x[a-f\d]))([a-f\d]{2})
Usage
See code in use here
import re
regex = r"(?!(?<=\\x)|(?<=\\x[a-f\d]))([a-f\d]{2})"
test_str = r"30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00"
subst = r"\\x$1"
result = re.sub(regex, subst, test_str, 0, re.IGNORECASE)
if result:
print (result)
Explanation
(?!(?<=\\x)|(?<=\\x[a-f\d])) Negative lookahead ensuring either of the following doesn't match.
(?<=\\x) Positive lookbehind ensuring what precedes is \x.
(?<=\\x[a-f\d]) Positive lookbehind ensuring what precedes is \x followed by a hexidecimal digit.
([a-f\d]{2}) Capture any two hexidecimal digits into capture group 1.

Regex: skip the first match of a character in group?

From this string
s = 'stringalading-0.26.0-1'
I'd like to extract the part 0.26.0-1. I can think of various ways to achieve this, using split or a regular expression using a pattern like this
pattern = r'\d+\.\d+\.\d+\-\d+'
I also tried to use a group of characters, like so:
pattern = r'[.\-\d]+'
This gives me:
In [30]: re.findall(pattern, s)
Out[30]: ['-0.26.0-1']
So I wondered: is it possible to skip the first occurrence of a character in a group, in this case the first occurrence of -?
is it possible to to skip the first occurrence of a character in a group, in this case the first occurrence of -?
NO, because when matching, the regex engine processes the string from left to right, and once the matching pattern is found, the matched chunk of text is written to the match buffer. Thus, either write a regex that only matches what you need, or post-process the found result by stripping unwanted characters from the left.
I think you do not need a regex here. You can split the string with - and pass the maxsplit argument set to 1, then just access the second item:
s = 'stringalading-0.26.0-1'
print(s.split("-", 1)[1]) # => '0.26.0-1'
See the Python demo
Also, your first regex works well:
import re
s = 'stringalading-0.26.0-1'
pat = r'\d+\.\d+\.\d+-\d+'
print(re.findall(pat, s)) # => ['0.26.0-1']
Do:
-(.*)
and get captured group 1.
Example:
In [9]: s = 'stringalading-0.26.0-1'
In [10]: re.search(r'-(.*)', s).group(1)
Out[10]: '0.26.0-1'

Slicing by start and stop string values in Python

I have a string in which there are certain values that I need to extract from it. For example: "FEFEWFSTARTFFFPENDDCDC". How could I make an expression that would take a slice from "START" all the way to "END"?
I tried doing this previously by creating functions which used a for loop and string.find("START") to locate the beginning and ends, but this didn't appear to work effectively and seemed overly complex. Is there an easier way to do this without using complex loops?
EDIT:
Forgot this part. What if there were different end values? In other words, instead of just ending with "END", the values "DONE" and "NOMORE" would also end it? And in addition to that, there were multiple starts and ends throughout the string. For example: "STARTFFEFFDONEFEWFSTARTFEFFENDDDW".
EDIT2: Sample run: Start value: ATG. End values: TAG,TAA,TGA
"Enter a string": TTATGTTTTAAGGATGGGGCGTTAGTT
TTT
GGGCGT
And
"Enter a string": TGTGTGTATAT
"No string found"
That's a perfect fit for a regular expression:
>>> import re
>>> s = "FEFEWFSTARTFFFPENDDCDCSTARTDOINVOIJHSDFDONEDFOIER"
>>> re.findall("START.*?(?:END|DONE|NOMORE)", s)
['STARTFFFPEND', 'STARTDOINVOIJHSDFDONE']
.* matches any number of characters (except newlines), the additional ? makes the quantifier lazy, telling it to match as few characters as possible. Otherwise, there would be only one match, namely STARTFFFPENDDCDCSTARTDOINVOIJHSDFDONE.
As #BurhanKhalid noted, if you add a capturing group, only the substring matched by that part of the regex will be captured:
>>> re.findall("START(.*?)(?:END|DONE|NOMORE)", s)
['FFFP', 'DOINVOIJHSDF']
Explanation:
START # Match "START"
( # Match and capture in group number 1:
.*? # Any character, any number of times, as few as possible
) # End of capturing group 1
(?: # Start a non-capturing group that matches...
END # "END"
| # or
DONE # "DONE"
| # or
NOMORE # "NOMORE"
) # End of non-capturing group
And if your real goal is to match gene sequences, you need to make sure that you always match triplets:
re.findall("ATG(?:.{3})*?(?:TA[AG]|TGA)", s)
a="FEFEWFSTARTFFFPENDDCDC"
a[a.find('START'):]
'STARTFFFPENDDCDC'
The simple way (no loop, no regex):
s = "FEFEWFSTARTFFFPENDDCDC"
tmp = s[s.find("START") + len("START"):]
result = tmp[:tmp.find("END")]
yourString = 'FEFEWFSTARTFFFPENDDCDC'
substring = yourString[yourString.find("START") + len("START") : yourString.find("END")]
Not that efficient but does work.
>>> s = "FEFEWFSTARTFFFPENDDCDC"
>>> s[s.index('START'):s.index('END')+len('END')]
'STARTFFFPEND'

How do i extract a list of elements encased in quotation marks bounded by <> and delimited by commas - python, regex?

Given a string like this:
ORTH < "cali.ber,kl", 'calf' , "done" >,\nLKEYS.KEYREL.PRED "_calf_n_1_rel",
With regex, how do I get a tuple that looks like the following:
('ORTH', ['cali.ber,kl','calf','done'])
I've been doing it as such:
txt = '''ORTH < "cali.ber,kl", 'calf' , "done" >,'''
e1 = txt.partition(" ")[0]
vs = re.search(r"<([A-Za-z0-9_]+)>", txt)
v = vs.group(1)
v1 = [i[1:-1] for i in vs.strip().strip("<>").split(",")]
print v1
But i'm getting none for re.search().group(1). How should it be done to get the desired output?
The reason you don't get a match is that your regex doesn't match:
r"<([A-Za-z0-9_]+)>" is missing comma, quotation marks and the space character, which all can occur inside the < > according to your sample.
This one would match:
re.search(r"< ([A-Za-z0-9_.,\"' ]+) >", txt)
What also may trip you up is the fact that the list of names is delimited by comma, which itself can be part of the values, unescaped.
That means you can't just split that string by ',', but instead need to consider the two different quotation characters(' and " ) in order to separate the fields.
So I'd use this approach:
Use re.match to split the string into PREFIX < NAMES > parts, and discard the rest.
Use re.findall() to split the names into fields according to quotation marks
Edit:
1) According to your first comment, your data can also contain a preamble before the prefix that contains newlines. The default behavior for . is to match everything except newlines.
From the Python re docs:
re.DOTALL
Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.
So you need to construct that regex with the re.DOTALL flag. You do this by compiling it first and passing the ORed flags:
re.compile(pattern, flags=re.DOTALL)
2) If you include the space character before PREFIX in the regex, it will only match for data that actually contains that space - but not anymore for your first piece of example data. So I use .*?([A-Z\.]*)... to cover both cases. The ? is for non-greedy matching, so it matches the shortest possible match instead of the longest.
3) To cover PREFIX.FOO just extend the pattern for the prefix to ([A-Z\.]*) by including the . character and escaping it.
Updated example covering all the cases you mentioned:
import re
TEST_VALUES = [
"""ORTH.FOO < "cali.ber,kl", 'calf' , "done" >,\nLKEYS.KEYREL.PRED "_calf_n_1_rel",""",
"""calf_n1 := n_-_c_le & n_-_pn_le &\n [ ORTH.FOO < "cali.ber,kl", 'calf' , "done" >,\nLKEYS.KEYREL.PRED "_calf_n_1_rel","""
]
EXPECTED = ('ORTH.FOO', ['cali.ber,kl','calf','done'])
pattern = re.compile(r'.*?([A-Z\.]*) < (.*) >.*', flags=re.DOTALL)
for value in TEST_VALUES:
prefix, names_str = pattern.match(value).groups()
names = re.findall('[\'"](.*?)["\']', names_str)
result = prefix, names
assert(result == EXPECTED)
print result

Categories