Ignore special characters when creating a regular expression in python [duplicate] - python

This question already has answers here:
Escaping regex string
(4 answers)
Closed 6 years ago.
Is there a way to ignore special character meaning when creating a regular expression in python? In other words, take the string "as is".
I am writing code that uses internally the expect method from a Telnet object, which only accepts regular expressions. Therefore, the answer cannot be the obvious "use == instead of regular expression".
I tried this
import re
SPECIAL_CHARACTERS = "\\.^$*+?{}[]|():" # backslash must be placed first
def str_to_re(s):
result = s
for c in SPECIAL_CHARACTERS:
result = result.replace(c,'\\'+c)
return re.compile(result)
TEST = "Bob (laughing). Do you know 1/2 equals 2/4 [reference]?"
re_bad = re.compile(TEST)
re_good = str_to_re(TEST)
print re_bad.match(TEST)
print re_good.match(TEST)
It works, since the first one does not recognize the string, and the second one does. I looked at the options in the python documentation, and was not able to find a simpler way. Or are there any cases my solution does not cover (I used python docs to build SPECIAL_CHARACTERS)?
P.S. The problem can apply to other libraries. It does not apply to the pexpect library, because it provides the expect_exact method which solves this problem. However, someone could want to specify a mix of strings (as is) and regular expressions.

If 'reg' is the regex, you gotta use a raw string as follows
pat = re.compile(r'reg')
If reg is a name bound to a regex str, use
reg = re.escape(reg)
pat = re.compile(reg)

Related

Finding correct regex for a bolded/underlined strings (Python) [duplicate]

This question already has answers here:
How to extract the substring between two markers?
(22 answers)
Closed 2 years ago.
So I have 2 sets of criterias that I would like to find in a string. For example:
import re
bold_pattern = re.compile() #pattern for finding all words in between ** **
underline_pattern = re.compile() # pattern for finding all words in between __ __
a = "__Hello__ **This** __is__ **Lego**"
How would I go abouts doing that on regex?
Use capture patterns to capture words between two patterns:
bold_pattern = re.compile(r'\*\*(.*?)\*\*') # pattern for finding all words in between ** **
underline_pattern = re.compile(r'__(.*?)__') # pattern for finding all words in between __ __
Then use them in a re.findall:
bolds = re.findall(bold_pattern, a)
# or: bold_pattern.findall(a)
underlines = re.findall(underline_pattern, a)
# or: underline_pattern.findall(a)
Using re.findall we can try:
a = "__Hello__ **This** __is__ **Lego**"
terms = re.findall(r'\*\*(.*?)\*\*', a)
print(terms)
This prints:
['This', 'Lego']
Hope this helps :) You need to first define the pattern in compile and further use the find all function to extract the string. You can also do it in one line by defining the pattern in findall function as #Tim Biegeleisen suggested.
import re
bold_pattern = re.compile(r'\*\*(.*?)\*\*')
underline_pattern = re.compile(r'\_\_(.*?)\_\_')
a = "__Hello__ **This** __is__ **Lego**"
print(bold_pattern.findall(a))
print(underline_pattern.findall(a))
Suggestion:
If you're dealing with multiline text (i.e. \n), then you'll need to pass the argument: flags=re.DOTALL to your re.findall() method.
Case: Multiline text
# string to be searched
a = """
__Hello__ **This
is a multiline test** __it is__ **Lego
**
"""
# pattern variations
bold_pattern = r'\*\*(.*?)\*\*'
# call re functions
match = re.findall(pattern=bold_pattern, string=a)
flag_match = re.findall(pattern=bold_pattern, string=a, flags=re.DOTALL)
# print results for observation
print(match)
print(flag_match) # using the flag
Returns:
[' __it is__ ']
['This \nis a multiline test', 'Lego\n']
From the Python 3.8.2 documentation:
"The expression’s behaviour can be modified by specifying a flags value."
Dealing with (\n)
Depending on your needs, there are a few different ways you can deal with \n. If I need to, I'll use re.sub() on the entire text body prior to doing anything else to remove them all.
To Compile or Not to Compile?
From the Python 3.8.2 documentation:
"Some of the functions are simplified versions of the full featured methods for compiled regular expressions. Most non-trivial applications always use the compiled form...
...but using re.compile() and saving the resulting regular expression object for reuse is more efficient when the expression will be used several times in a single program."
and
"The compiled versions of the most recent patterns passed to re.compile() and the module-level matching functions are cached, so programs that use only a few regular expressions at a time needn’t worry about compiling regular expressions."
So unless you're using a whole bunch of patterns, you shouldn't see a noticable improvement from compiling.
You can also use the %%time magic command to test both options and see if you notice an advantage locally!
Good luck!

How to input a variable string into re.search in python [duplicate]

This question already has answers here:
How to use a variable inside a regular expression?
(12 answers)
Closed 4 years ago.
Initially I had my date regex working as follows, to capture "February 12, 2018" for example
match = re.search(r'(January|February|March|April|May|June|July|August|September?|October?|November|December)\s+\d{1,2},\s+\d{4}', date).group()
But I want it to become more flexible, and input my variable string into my regex but I can't seem to get it to work after looking through many of the stackoverflow threads about similar issues. I'm quite a novice so I'm not sure what's going wrong. I'm aware that simply MONTHS won't work. Thank you
MONTHS = "January|February|March|April|May|June|July|August|September|October|November|December"
match = re.search(r'(MONTHS)\s+\d{1,2},\s+\d{4}', date).group()
print(match)
'NoneType' object has no attribute 'group'
You've got MONTHS as just a part of the match string, python doesn't know that it's supposed to be referencing a variable that's storing another string.
So instead, try:
match = re.search(r'(' + MONTHS + ')\s+\d{1,2},\s+\d{4}', date).group()
That will concatenate (stick together) three strings, the first bit, then the string stored in your MONTHS variable, and then the last bit.
If you want to substitute something into a string, you need to use either format strings (whether an f-string literal or the format or format_map methods on string objects) or printf-style formatting (or template strings, or a third-party library… but usually one of the first two).
Normally, format strings are the easiest solution, but they don't play nice with strings that need braces for other purposes. You don't want that {4} to be treated as "fill in the 4th argument", and escaping it as {{4}} makes things less readable (and when you're dealing with regular expressions, they're already unreadable enough…).
So, printf-style formatting is probably a better option here:
pattern = r'(%s)\s+\d{1,2},\s+\d{4}' % (MONTHS,)
… or:
pattern = r'(%(MONTHS)s)\s+\d{1,2},\s+\d{4}' % {'MONTHS': MONTHS}

Python: check if string meets specific format

Programming in Python3.
I am having difficulty in controlling whether a string meets a specific format.
So, I know that Python does not have a .contain() method like Java but that we can use regex.
My code hence will probably look something like this, where lowpan_headers is a dictionary with a field that is a string that should meet a specific format.
So the code will probably be like this:
import re
lowpan_headers = self.converter.lowpan_string_to_headers(lowpan_string)
pattern = re.compile("^([A-Z][0-9]+)+$")
pattern.match(lowpan_headers[dest_addrS])
However, my issue is in the format and I have not been able to get it right.
The format should be like bbbb00000000000000170d0000306fb6, where the first 4 characters should be bbbb and all the rest, with that exact length, should be hexadecimal values (so from 0-9 and a-f).
So two questions:
(1) any easier way of doing this except through importing re
(2) If not, can you help me out with the regex?
As for the regex you're looking for I believe that
^bbbb[0-9a-f]{28}$
should validate correctly for your requirements.
As for if there is an easier way than using the re module, I would say that there isn't really to achieve the result you're looking for. While using the in keyword in python works in the way you would expect a contains method to work for a string, you are actually wanting to know if a string is in a correct format. As such the best solution, as it is relatively simple, is to use a regular expression, and thus use the re module.
Here is a solution that does not use regex:
lowpan_headers = 'bbbb00000000000000170d0000306fb6'
if lowpan_headers[:4] == 'bbbb' and len(lowpan_headers) == 32:
try:
int(lowpan_headers[4:], 16) # tries interpreting the last 28 characters as hexadecimal
print('Input is valid!')
except ValueError:
print('Invalid Input') # hex test failed!
else:
print('Invalid Input') # either length test or 'bbbb' prefix test failed!
In fact, Python does have an equivalent to the .contains() method. You can use the in operator:
if 'substring' in long_string:
return True
A similar question has already been answered here.
For your case, however, I'd still stick with regex as you're indeed trying to evaluate a certain String format. To ensure that your string only has hexadecimal values, i.e. 0-9 and a-f, the following regex should do it: ^[a-fA-F0-9]+$. The additional "complication" are the four 'b' at the start of your string. I think an easy fix would be to include them as follows: ^(bbbb)?[a-fA-F0-9]+$.
>>> import re
>>> pattern = re.compile('^(bbbb)?[a-fA-F0-9]+$')
>>> test_1 = 'bbbb00000000000000170d0000306fb6'
>>> test_2 = 'bbbb00000000000000170d0000306fx6'
>>> pattern.match(test_1)
<_sre.SRE_Match object; span=(0, 32), match='bbbb00000000000000170d0000306fb6'>
>>> pattern.match(test_2)
>>>
The part that is currently missing is checking for the exact length of the string for which you could either use the string length method or extend the regex -- but I'm sure you can take it from here :-)
As I mentioned in the comment Python does have contains() equivalent.
if "blah" not in somestring:
continue
(source) (PythonDocs)
If you would prefer to use a regex instead to validate your input, you can use this:
^b{4}[0-9a-f]{28}$ - Regex101 Demo with explanation

Python generate string based on regex format [duplicate]

This question already has answers here:
Reversing a regular expression in Python
(8 answers)
Closed 1 year ago.
I have some difficulties learning regex in python. I want to parse my tornado web route configuration along with arguments into a request path string without handlers request.path method.
For example, I have route with patterns like:
/entities/([0-9]+)
/product/([0-9]+/actions
The expected result combine with integer parameter (123) will be a string like:
/entities/123
/product/123/actions
How do I generate string based on that pattern?
Thank you very much in advance!
This might be a possible duplicate to:
Reversing a regular expression in Python
Generate a String that matches a RegEx in Python
Using the answer provided by #bjmc a solution works like this:
>>> import rstr
>>> intermediate = rstr.xeger(\d+)
>>> path = '/product/' + intermediate + '/actions'
Depending on how long you want your intermediate integer, you could replace the regex: \d{1,3}

python .replace() regex [duplicate]

This question already has answers here:
How to input a regex in string.replace?
(7 answers)
Closed 5 years ago.
I am trying to do a grab everything after the '</html>' tag and delete it, but my code doesn't seem to be doing anything. Does .replace() not support regex?
z.write(article.replace('</html>.+', '</html>'))
No. Regular expressions in Python are handled by the re module.
article = re.sub(r'(?is)</html>.+', '</html>', article)
In general:
str_output = re.sub(regex_search_term, regex_replacement, str_input)
In order to replace text using regular expression use the re.sub function:
sub(pattern, repl, string[, count, flags])
It will replace non-everlaping instances of pattern by the text passed as string. If you need to analyze the match to extract information about specific group captures, for instance, you can pass a function to the string argument. more info here.
Examples
>>> import re
>>> re.sub(r'a', 'b', 'banana')
'bbnbnb'
>>> re.sub(r'/\d+', '/{id}', '/andre/23/abobora/43435')
'/andre/{id}/abobora/{id}'
You can use the re module for regexes, but regexes are probably overkill for what you want. I might try something like
z.write(article[:article.index("</html>") + 7]
This is much cleaner, and should be much faster than a regex based solution.
For this particular case, if using re module is overkill, how about using split (or rsplit) method as
se='</html>'
z.write(article.split(se)[0]+se)
For example,
#!/usr/bin/python
article='''<html>Larala
Ponta Monta
</html>Kurimon
Waff Moff
'''
z=open('out.txt','w')
se='</html>'
z.write(article.split(se)[0]+se)
outputs out.txt as
<html>Larala
Ponta Monta
</html>

Categories