How to format a regex

How to format a regex - python

I am trying to make a Warning waver which can look for known warnings in a log file.
The warnings in the waving file are copied directly from the log file during a review of the warnings.
The mission here is to make it as simple as possible. But i found that directly copying was a bit problematic due to that fact that the warnings could contain absolute paths.
So I added a "tag" which could be inserted into a warning which the system should look for. The whole string would then look like this.
WARNING:HDLParsers:817 - ":RE[.*]:/modules/top/hdl_src/top.vhd" Line :RE[.*]: Choice . is not a locally static expression.
The tag is :RE[Insert RegEx here]:.
In the above warning string there are two of these tags which I am trying to find using Python3 regex tool. And my pattern is the following:
(:RE\[.*\]\:)
See RegEx101 for reference
My problem with the above is that, when there are two tags in my string it finds only one result extended from the first to the last tag. how do i setup the regex so it will find each tag ?
Regards

You can use re.findall with the following regex that assumes that the regular expression inside the square brackets spans from :RE[ up to the ] that is followed by ]:
:RE\[.*?]:
See regex demo. The .*? matches 0 or more characters other than a newline but as few as possible. See rexegg.com description of a lazy quantifier solution:
The lazy .*? guarantees that the quantified dot only matches as many characters as needed for the rest of the pattern to succeed.
See IDEONE demo
import re
p = re.compile(r':RE\[.*?]:')
test_str = "# Even more commments\nWARNING:HDLParsers:817 - \":RE[.*]:/modules/top/hdl_src/cpu_0342.vhd\" Line :RE[.*]: Choice . is not a locally static expression."
print(p.findall(test_str))
If you need to get the contents between the [ and ], use a capturing group so that re.findall could extract just those contents:
p = re.compile(r':RE\[(.*?)]:')
See another demo
To obtain indices, use re.finditer (see this demo):
re.finditer(pattern, string, flags=0)
Return an iterator yielding match objects over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result unless they touch the beginning of another match.
p = re.compile(r':RE\[(.*?)]:')
print([x.start(1) for x in p.finditer(test_str)])

Related

Can overlapping matches with the same start position be found using regex?

I am looking for a regex or a regex flag in python/BigQuery that enables me to find overlapping occurrences.
For example, I have the string 1.2.5.6.8.10.12
and I would like to extract:
[1., 1.2., 1.2.5., 1.2.5.6., ..., 1.2.5.6.8.10.12]
I tried running the python code
re.findall("^(\d+(?:\.|$))+", string)
and it resulted in ['12']

Use below (BigQuery)
select text,
array(
select regexp_extract(text, r'((?:[^.]+.){' || i || '})')
from unnest(generate_array(1, array_length(split(text, '.')))) i
) as extracted
from your_table
with output

While the regex parser walks down the string each position gets consumed. To extract substrings with the same starting position it would be needed to look behind and capture matches towards start. Capturing overlapping matches needs to be done inside a lookaround for not consuming the captured parts. Python re does not support lookbehinds of variable length but PyPI regex does.
import regex as re
res = re.findall(r"(?<=(.*\d(?:\.|$)))", s)
See this Python demo at tio.run or a Regex101 demo (captures will be in the first group).
In PyPI there is even an overlapped=True option which lets avoid to capture inside the lookbehind. Together with (?r) another interesting flag for doing a reverse search it could also be achieved.
res = re.findall(r'(?r).*\d(?:\.|$)', s, overlapped=True)[::-1]
The result just needs to be reversed afterwards for receiving the desired order: Python demo
Using standard re an idea can be to reverse the string and do capturing inside a lookahead. The needed parts get captured from the reversed string and finally each list item gets reversed again before reversing the entire list. I don't know if this is worth the effort but it seems to work as well.
res = [x[::-1] for x in re.findall(r'(?=((?:\.\d|^).*))', s[::-1])][::-1]
Another Python demo at tio.run or a Regex101 demo (shows matching on the reversed string).

Extracting a word between two path separators that comes after a specific word

I have the following path stored as a python string 'C:\ABC\DEF\GHI\App\Module\feature\src' and I would like to extract the word Module that is located between words \App\ and \feature\ in the path name. Note that there are file separators '\' in between which ought not to be extracted, but only the string Module has to be extracted.
I had the few ideas on how to do it:
Write a RegEx that matches a string between \App\ and \feature\
Write a RegEx that matches a string after \App\ --> App\\[A-Za-z0-9]*\\, and then split that matched string in order to find the Module.
I think the 1st solution is better, but that unfortunately it goes over my RegEx knowledge and I am not sure how to do it.
I would much appreciate any help.
Thank you in advance!

The regex you want is:
(?<=\\App\\).*?(?=\\feature\\)
Explanation of the regex:
(?<=behind)rest matches all instances of rest if there is behind immediately before it. It's called a positive lookbehind
rest(?=ahead) matches all instances of rest where there is ahead immediately after it. This is a positive lookahead.
\ is a reserved character in regex patterns, so to use them as part of the pattern itself, we have to escape it; hence, \\
.* matches any character, zero or more times.
? specifies that the match is not greedy (so we are implicitly assuming here that \feature\ only shows up once after \App\).
The pattern in general also assumes that there are no \ characters between \App\ and \feature\.
The full code would be something like:
str = 'C:\\ABC\\DEF\\GHI\\App\\Module\\feature\\src'
start = '\\App\\'
end = '\\feature\\'
pattern = rf"(?<=\{start}\).*?(?=\{end}\)"
print(pattern) # (?<=\\App\\).*?(?=\\feature\\)
print(re.search(pattern, str)[0]) # Module
A link on regex lookarounds that may be helpful: https://www.regular-expressions.info/lookaround.html

We can do that by str.find somethings like
str = 'C:\\ABC\\DEF\\GHI\\App\\Module\\feature\\src'
import re
start = '\\App\\'
end = '\\feature\\'
print( (str[str.find(start)+len(start):str.rfind(end)]))
print("\n")
output
Module

Your are looking for groups. With some small modificatians you can extract only the part between App and Feature.
(?:App\\\\)([A-Za-z0-9]*)(?:\\\\feature)
The brackets ( ) define a Match group which you can get by match.group(1). Using (?:foo) defines a non-matching group, e.g. one that is not included in your result. Try the expression here: https://regex101.com/r/24mkLO/1

How to set regex for website url pattern

The url pattern is
http://www.hepsiburada.com/philips-40pfk5500-40-102-ekran-full-hd-200-hz-uydu-alicili-cift-cekirdek-smart-android-led-tv-p-EVPHI40PFK5500
This website has similar urls. The unique identifier is -p- for this url.
The url pattern always has -p- before word which is at end of url.
I used the following regex
(.*)hepsiburada\.com\/([\w.-]+)([\-p\-\w+])\Z
it matched but it match many patterns on this website.
For example regex should match url above but it shouldnt match with
http://www.hepsiburada.com/bilgisayarlar-c-2147483646

Since you are using a re.match you really need to match the string from the beginning. However, the main problem is that your -p- is inside a character class, and is thus treated as separate symbols that can be matched. Same is with the \w+ - it is considered as \w and + separately.
So, use a sequence:
(.*)hepsiburada\.com/([\w.-]+)(-p-\w+)$
See this regex demo
Or
^https?://(?:www\.)?hepsiburada\.com/([\w.-]+)(-p-\w+)$
See the regex demo
Note that most probably you even have no need in the capture groups, and (...) parentheses can be removed from the pattern.

how to replace markdown tags into html by python?

I want to replace some "markdown" tags into html tags.
for example:
#Title1#
##title2##
Welcome to **My Home Page**
will be turned into
<h1>Title1</h1>
<h2>title2</h2>
Welcome to <b>My Home Page</b>
I just don't know how to do that...For Title1,I tried this:
#!/usr/bin/env python3
import re
text = '''
#Title1#
##title2##
'''
p = re.compile('^#\w*#\n$')
print(p.sub('<h1>\w*</h1>',text))
but nothing happens..
#Title1#
##title2##
How could those bbcode/markdown language come into html tags?

Check this regex: demo
Here you can see how I substituted the #...# into <h1>...</h1>.
I believe you can get this to work with double # and so on to get other markdown features considered, but still you should listen to #Thomas and #nhahtdh comments and use a markdown parser. Using regexes in such cases is unreliable, slow and unsafe.
As for inline text like **...** to <b>...</b> you can try this regex with substitution: demo. Hope you can twink this for other features like underlining and so on.

Your regular expression does not work because in the default mode, ^ and $ (respectively) matches the beginning and the end of the whole string.
'^'
(Caret.) Matches the start of the string, and in MULTILINE mode also matches immediately after each newline (my emph.)
'$'
Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline. foo matches both ‘foo’ and ‘foobar’, while the regular expression foo$ matches only ‘foo’. More interestingly, searching for foo.$ in 'foo1\nfoo2\n' matches ‘foo2’ normally, but ‘foo1’ in MULTILINE mode; searching for a single $ in 'foo\n' will find two (empty) matches: one just before the newline, and one at the end of the string.
(7.2.1. Regular Expression Syntax)
Add the flag re.MULTILINE in your compile line:
p = re.compile('^#(\w*)#\n$', re.MULTILINE)
and it should work – at least for single words, such as your example. A better check would be
p = re.compile('^#([^#]*)#\n$', re.MULTILINE)
– any sequence that does not contain a #.
In both expressions, you need to add parentheses around the part you want to copy so you can use that text in your replacement code. See the official documentation on Grouping for that.

Python, regex and html: match final tag on line

I'm confused about python greedy/not-greedy characters.
"Given multi-line html, return the final tag on each line."
I would think this would be correct:
re.findall('<.*?>$', html, re.MULTILINE)
I'm irked because I expected a list of single tags like:
"</html>", "<ul>", "</td>".
My O'Reilly's Pocket Reference says that *? wil "match 0 or more times, but as few times as possible."
So why am I getting 'greedier' matches, i.e., more than one tag in some (but not all) matches?

Your problem stems from the fact that you have an end-of-line anchor ('$'). The way non-greedy matching works is that the engine first searches for the first unconstrained pattern on the line ('<' in your case). It then looks for the first '>' character (which you have constrained, with the $ anchor, to be at the end of the line). So a non-greedy * is not any different from a greedy * in this situation.
Since you cannot remove the '$' from your RE (you are looking for the final tag on a line), you will need to take a different tack...see #Mark's answer. '<[^><]*>$' will work.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to format a regex - python

Related

Can overlapping matches with the same start position be found using regex?

Extracting a word between two path separators that comes after a specific word

How to set regex for website url pattern

how to replace markdown tags into html by python?

Python, regex and html: match final tag on line

Categories

Resources