newbie about regular expression in Python

newbie about regular expression in Python - python

I have the following code to match pattern from str
match = re.search(r'word:\w\w\w', str)
I try to create a variable
pat='word:\w\w\w' # and using in re.search
match = re.search(rpat, str)
I got compile error
how to create a variable for a pattern in Python?

You can't replace r'word:\w\w\w' by erasing the string and adding a variable name in front of the r. The r is part of the string literal, so it has to go with you when you move it.
pat=r'word:\w\w\w'
match = re.search(pat, str)
Although in this particular situation, you don't need the r at all, since your string doesn't have any escape sequences in it anyway.
pat='word:\w\w\w'
match = re.search(pat, str)

Related

Compiling user defined regex object and searching/matching in Python [duplicate]

This question already has answers here:
How to create raw string from string variable in python?
(3 answers)
Closed 4 years ago.
I am writing an online regex checker, which takes input from the user in the form of a pattern, and flags, and uses that to compile a regex object. The regex object is then used to check if the test string matches within the format provided by the regex pattern or not. As of this moment, the compile function looks like this:
class RegexObject:
...
def compile(self):
flags = ''
if self.multiline_flag:
flags = re.M
if self.dotall_flag:
flags |= re.S
if self.verbose_flag:
flags |= re.X
if self.ignorecase_flag:
flags |= re.I
if self.unicode_flag:
flags |= re.U
regex = re.compile(self.pattern, flags)
return regex
Please note, the self.pattern and all the flags are class attributes defined by the user using a simple form. However, one thing I noticed in the docs is that there is usually an r before the pattern in the compile functions, like this:
re.compile(r'(?<=abc)def')
How do I place that r in my code before my variable name? Also, if I want to tell the user if the test input is valid or not, should I be using the match method, or the search method?
Thanks for any help.
Edit: This question is not a duplicate of this one, because that question has nothing to do with regular expressions.

Don't worry about the r, you don't need it here.
The r stands for "raw", not "regex". In an r string, you can put backslashes without escaping them. R strings are often used in regexes because there are often many backslashes in regexes. Escaping the backslashes can be annoying. See this shell output:
>>> s = r"\a"
>>> s2 = "\a"
>>> s2
'\x07'
>>> s
'\\a'
And you should use search, as match only looks at the start of the string. Look at the docs.
re.search(pattern, string, flags=0)
Scan through string looking for
the first location where the regular expression pattern produces a
match, and return a corresponding match object. Return None if no
position in the string matches the pattern; note that this is
different from finding a zero-length match at some point in the
string.
re.match(pattern, string, flags=0)
If zero or more characters at the
beginning of string match the regular expression pattern, return a
corresponding match object. Return None if the string does not match
the pattern; note that this is different from a zero-length match.
Note that even in MULTILINE mode, re.match() will only match at the
beginning of the string and not at the beginning of each line.
If you want to locate a match anywhere in string, use search() instead
(see also search() vs. match()).

You need not use r.Instead you should use re.escape.match or search again should be user input.

using OR operator (|) in variable for regular expression in python

I need to match against a list of string values. I'm using '|'.join() to build a sting that is passed into re.match:
import re
line='GigabitEthernet0/1 is up, line protocol is up'
interfacenames=[
'Loopback',
'GigabitEthernet'
]
rex="r'" + '|'.join(interfacenames) + "'"
print rex
interface=re.match(rex,line)
print interface
The code result is:
r'Loopback|GigabitEthernet'
None
However if I copy past the string directly into match:
interface=re.match(r'Loopback|GigabitEthernet',line)
It works:
r'Loopback|GigabitEthernet'
<_sre.SRE_Match object at 0x7fcdaf2f4718>
I did try to replace .join with actual "Loopback|GigabitEthernet" in rex and it didn't work either. It looks like the pipe symbol is not treated as operator when passed from string.
Any thoughts how to fix it?

You use the r' prefix as a part of a string literal. This is how it could be used:
rex=r'|'.join(interfacenames)
See the Python demo
If the interfacenames may contain special regex metacharacters, escape the values like this:
rex=r'|'.join([re.escape(x) for x in interfacenames])
Also, if you plan to match the strings not only at the start of the string, use re.search rather than re.match. See What is the difference between Python's re.search and re.match?

You don't need to put "r'" at the beginning and "'". That's part of the syntax for literal raw strings, it's not part of the string itself.
rex = '|'.join(interfacenames)

Python RegEx: find specific word followed by roman numeral [duplicate]

http://regex101.com/r/oU6eI5/1 , test here seam works, but when i put in Python, match whole str.
str = galley/files/tew/tewt/tweqt/
re.sub('^.+/+([^/]+/$)', "\1", str)
i want get "tweqt/"

You need to use a raw string in the replace:
str = galley/files/tew/tewt/tweqt/
re.sub('^.+/+([^/]+/$)', r"\1", str)
# ^
Otherwise, you get the escaped character \1. For instance on my console, it's a little smiley.
If you somehow don't want to raw your string, you'll have to escape the backslash:
re.sub('^.+/+([^/]+/$)', "\\1", str)
Also worth noting that it's a good practice to raw your regex strings and use consistent quotes, so you I would advise using:
re.sub(r'^.+/+([^/]+/$)', r'\1', str)
Other notes
It might be simpler to match (using re.search) instead of using re.sub:
re.search(r'[^/]+/$', str).group()
# => tweqt/
And you might want to use another variable name other than str because this will override the existing function str().

It would be better if you define the pattern or regex as raw string.
>>> import re
>>> s = "galley/files/tew/tewt/tweqt/"
>>> m = re.sub(r'^.+/+([^/]+/$)', r'\1', s)
^ ^
>>> m
'tweqt/'

can't use variable inside regex

So, I have a long sequence of Unicode characters that I want to match using regular expressions:
char_set = '\u0041-\u005A|\u00C0-\u00D6|\u00D8-\u00DE|\u0100|\u0102|\u0104|\u0106|\u0108|\u010A|\u010C|\u010E|\u0110|\u0112|\u0114|\u0116|\u0118|\u011A|\u011C|\u011E|\u0120|\u0122|\u0124|\u0126|\u0128|\u012A|\u012C|\u012E|\u0130|\u0132|\u0134|\u0136|\u0139|\u013B|\u013D|\u013F|\u0141|\u0143|\u0145|\u0147|\u014A|\u014C|\u014E|\u0150|\u0152|\u0154|\u0156|\u0158|\u015A|\u015C|\u015E|\u0160|\u0162|\u0164|\u0166|\u0168|\u016A|\u016C|\u016E|\u0170|\u0172|\u0174|\u0176|\u0178|\u0179|\u017B|\u017D'
(These are all the uppercase characters comprehended in the Unicode range 0-382. Most of them are accented. PEP8 discourages the use of non-ASCII characters in Python scripts, so I'm using the Unicode codes instead of the string literals.)
If I simply compile that long string directly, it works. For instance, this matches all the words that begin with one of those characters:
regex = re.compile(u'\A[\u0041-\u005A|\u00C0-\u00D6|\u00D8-\u00DE|\u0100|\u0102|\u0104|\u0106|\u0108|\u010A|\u010C|\u010E|\u0110|\u0112|\u0114|\u0116|\u0118|\u011A|\u011C|\u011E|\u0120|\u0122|\u0124|\u0126|\u0128|\u012A|\u012C|\u012E|\u0130|\u0132|\u0134|\u0136|\u0139|\u013B|\u013D|\u013F|\u0141|\u0143|\u0145|\u0147|\u014A|\u014C|\u014E|\u0150|\u0152|\u0154|\u0156|\u0158|\u015A|\u015C|\u015E|\u0160|\u0162|\u0164|\u0166|\u0168|\u016A|\u016C|\u016E|\u0170|\u0172|\u0174|\u0176|\u0178|\u0179|\u017B|\u017D]')
But I want to re-use that same sequence of characters in several other regular expressions. I could simply copy and paste it every time, but that's ugly. So based on previous answers to similar questions I've tried this:
regex = re.compile(u'\A[%s]' % char_set)
No good. Somehow the above expression seems to match ANY character, not just the ones hardcoded under the variable 'char_set'.
I've also tried this:
regex = re.compile(u'\A[' + char_set + ']')
And this:
regex = re.compile(u'\A[' + re.escape(char_set) + ']')
And this too:
regex = re.compile(u'\A[{ }]'.format(char_set))
None of which works as expected.
Any thoughts? What am I doing wrong?
(I'm using Python 2.7 and Mac OS X 10.6)

When you're using a pattern with a set of characters in square brackets, you don't want to put any vertical bar (|) characters in the set. Instead, just string the characters together and it should work. Here's a session where I tried out your characters with no problems after stripping the | chars:
>>> import re
>>> char_set = u'\u0041-\u005A|\u00C0-\u00D6|\u00D8-\u00DE|\u0100|\u0102|\u0104|\u0106|\u0108|\u010A|\u010C|\u010E|\u0110|\u0112|\u0114|\u0116|\u0118|\u011A|\u011C|\u011E|\u0120|\u0122|\u0124|\u0126|\u0128|\u012A|\u012C|\u012E|\u0130|\u0132|\u0134|\u0136|\u0139|\u013B|\u013D|\u013F|\u0141|\u0143|\u0145|\u0147|\u014A|\u014C|\u014E|\u0150|\u0152|\u0154|\u0156|\u0158|\u015A|\u015C|\u015E|\u0160|\u0162|\u0164|\u0166|\u0168|\u016A|\u016C|\u016E|\u0170|\u0172|\u0174|\u0176|\u0178|\u0179|\u017B|\u017D'
>>> fixed_char_set = char_set.replace("|", "") # remove the unneeded vertical bars
>>> pattern = ur"\A[{}]".format(fixed_char_set) # create a pattern string
>>> regex = re.compile(pattern) # compile the pattern into a regex object
>>> print regex.match("%foo") # "%" is not in the character set, so match returns None
None
edit: Actually, it seems like there must be some other issue going on, since I don't match "%foo" even if I use your original char_set without stripping out anything. Please give examples of text that is matching when it shouldn't!

Python regular expression with string in it

I would like to match a string with something like:
re.match(r'<some_match_symbols><my_match><some_other_match_symbols>', mystring)
where mymatch is a string I would like it to find. The problem is that it may be different from time to time, and it is stored in a variable. Would it be possible to add one variable to a regex?

Nothing prevents you from simply doing this:
re.match('<some_match_symbols>' + '<my_match>' + '<some_other_match_symbols>', mystring)
Regular expressions are nothing else than a string containing some special characters, specific to the regular expression syntax. But they are still strings, so you can do whatever you are used to do with strings.
The r'…' syntax is btw. a raw string syntax which basically just prevents any escape sequences inside the string from being evaluated. So r'\n' will be the same as '\\n', a string containing a backslash and an n; while '\n' contain a line break.

import re
url = "www.dupe.com"
expression = re.compile('<p>%s</p>'%url)
result = expression.match("<p>www.dupe.com</p>BBB")
if result:
print result.start(), result.end()

The r'' notation is for constants. Use the re library to compile from variables.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

newbie about regular expression in Python - python

I have the following code to match pattern from str match = re.search(r'word:\w\w\w', str) I try to create a variable pat='word:\w\w\w' # and using in re.search match = re.search(rpat, str) I got compile error how to create a variable for a pattern in Python?

Related

Compiling user defined regex object and searching/matching in Python [duplicate]

using OR operator (|) in variable for regular expression in python

Python RegEx: find specific word followed by roman numeral [duplicate]

can't use variable inside regex

Python regular expression with string in it

Categories

Resources