Python regular expression with string in it - python

I would like to match a string with something like:
re.match(r'<some_match_symbols><my_match><some_other_match_symbols>', mystring)
where mymatch is a string I would like it to find. The problem is that it may be different from time to time, and it is stored in a variable. Would it be possible to add one variable to a regex?

Nothing prevents you from simply doing this:
re.match('<some_match_symbols>' + '<my_match>' + '<some_other_match_symbols>', mystring)
Regular expressions are nothing else than a string containing some special characters, specific to the regular expression syntax. But they are still strings, so you can do whatever you are used to do with strings.
The r'…' syntax is btw. a raw string syntax which basically just prevents any escape sequences inside the string from being evaluated. So r'\n' will be the same as '\\n', a string containing a backslash and an n; while '\n' contain a line break.

import re
url = "www.dupe.com"
expression = re.compile('<p>%s</p>'%url)
result = expression.match("<p>www.dupe.com</p>BBB")
if result:
print result.start(), result.end()

The r'' notation is for constants. Use the re library to compile from variables.

Related

using OR operator (|) in variable for regular expression in python

I need to match against a list of string values. I'm using '|'.join() to build a sting that is passed into re.match:
import re
line='GigabitEthernet0/1 is up, line protocol is up'
interfacenames=[
'Loopback',
'GigabitEthernet'
]
rex="r'" + '|'.join(interfacenames) + "'"
print rex
interface=re.match(rex,line)
print interface
The code result is:
r'Loopback|GigabitEthernet'
None
However if I copy past the string directly into match:
interface=re.match(r'Loopback|GigabitEthernet',line)
It works:
r'Loopback|GigabitEthernet'
<_sre.SRE_Match object at 0x7fcdaf2f4718>
I did try to replace .join with actual "Loopback|GigabitEthernet" in rex and it didn't work either. It looks like the pipe symbol is not treated as operator when passed from string.
Any thoughts how to fix it?
You use the r' prefix as a part of a string literal. This is how it could be used:
rex=r'|'.join(interfacenames)
See the Python demo
If the interfacenames may contain special regex metacharacters, escape the values like this:
rex=r'|'.join([re.escape(x) for x in interfacenames])
Also, if you plan to match the strings not only at the start of the string, use re.search rather than re.match. See What is the difference between Python's re.search and re.match?
You don't need to put "r'" at the beginning and "'". That's part of the syntax for literal raw strings, it's not part of the string itself.
rex = '|'.join(interfacenames)

Python - Should I be using string prefix r when looking for a period (full stop or .) using regex?

I would like to know the reason I get the same result when using string prefix "r" or not when looking for a period (full stop) using python regex.
After reading a number sources (Links below) a multiple times and experimenting with in code to find the same result (again see below), I am still unsure of:
What is the difference when using string prefix "r" and not using string prefix "r", when looking for a period using regex?
Which way is considered the correct way of finding a period in a string using python regex with string prefix "r" or without string prefix "r"?
re.compile("\.").sub("!", "blah.")
'blah!'
re.compile(r"\.").sub("!", "blah.")
'blah!'
re.compile(r"\.").search("blah.").group()
'.'
re.compile("\.").search("blah.").group()
'.'
Sources I have looked at:
Python docs: string literals
http://docs.python.org/2/reference/lexical_analysis.html#string-literals
Regular expression to replace "escaped" characters with their originals
Python regex - r prefix
r prefix is for raw strings
http://forums.udacity.com/questions/7000217/r-prefix-is-for-raw-strings
The raw string notation is just that, a notation to specify a string value. The notation results in different string values when it comes to backslash escapes recognized by the normal string notation. Because regular expressions also attach meaning to the backslash character, raw string notation is quite handy as it avoids having to use excessive escaping.
Quoting from the Python Regular Expression HOWTO:
The solution is to use Python’s raw string notation for regular expressions; backslashes are not handled in any special way in a string literal prefixed with 'r', so r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline. Regular expressions will often be written in Python code using this raw string notation.
The \. combination has no special meaning in regular python strings, so there is no difference, at all between the result of '\.' and r'\.'; you can use either:
>>> len('\.')
2
>>> len(r'\.')
2
Raw strings only make a difference when the backslash + other characters do have special meaning in regular string notation:
>>> '\b'
'\x08'
>>> r'\b'
'\\b'
>>> len('\b')
1
>>> len(r'\b')
2
The \b combination has special meaning; in a regular string it is interpreted as the backspace character. But regular expressions see \b as a word boundary anchor, so you'd have to use \\b in your Python string every time you wanted to use this in a regular expression. Using r'\b' instead makes it much easier to read and write your expressions.
The regular expression functions are passed string values; the result of Python interpreting your string literal. The functions do not know if you used raw or normal string literal syntax.

can't use variable inside regex

So, I have a long sequence of Unicode characters that I want to match using regular expressions:
char_set = '\u0041-\u005A|\u00C0-\u00D6|\u00D8-\u00DE|\u0100|\u0102|\u0104|\u0106|\u0108|\u010A|\u010C|\u010E|\u0110|\u0112|\u0114|\u0116|\u0118|\u011A|\u011C|\u011E|\u0120|\u0122|\u0124|\u0126|\u0128|\u012A|\u012C|\u012E|\u0130|\u0132|\u0134|\u0136|\u0139|\u013B|\u013D|\u013F|\u0141|\u0143|\u0145|\u0147|\u014A|\u014C|\u014E|\u0150|\u0152|\u0154|\u0156|\u0158|\u015A|\u015C|\u015E|\u0160|\u0162|\u0164|\u0166|\u0168|\u016A|\u016C|\u016E|\u0170|\u0172|\u0174|\u0176|\u0178|\u0179|\u017B|\u017D'
(These are all the uppercase characters comprehended in the Unicode range 0-382. Most of them are accented. PEP8 discourages the use of non-ASCII characters in Python scripts, so I'm using the Unicode codes instead of the string literals.)
If I simply compile that long string directly, it works. For instance, this matches all the words that begin with one of those characters:
regex = re.compile(u'\A[\u0041-\u005A|\u00C0-\u00D6|\u00D8-\u00DE|\u0100|\u0102|\u0104|\u0106|\u0108|\u010A|\u010C|\u010E|\u0110|\u0112|\u0114|\u0116|\u0118|\u011A|\u011C|\u011E|\u0120|\u0122|\u0124|\u0126|\u0128|\u012A|\u012C|\u012E|\u0130|\u0132|\u0134|\u0136|\u0139|\u013B|\u013D|\u013F|\u0141|\u0143|\u0145|\u0147|\u014A|\u014C|\u014E|\u0150|\u0152|\u0154|\u0156|\u0158|\u015A|\u015C|\u015E|\u0160|\u0162|\u0164|\u0166|\u0168|\u016A|\u016C|\u016E|\u0170|\u0172|\u0174|\u0176|\u0178|\u0179|\u017B|\u017D]')
But I want to re-use that same sequence of characters in several other regular expressions. I could simply copy and paste it every time, but that's ugly. So based on previous answers to similar questions I've tried this:
regex = re.compile(u'\A[%s]' % char_set)
No good. Somehow the above expression seems to match ANY character, not just the ones hardcoded under the variable 'char_set'.
I've also tried this:
regex = re.compile(u'\A[' + char_set + ']')
And this:
regex = re.compile(u'\A[' + re.escape(char_set) + ']')
And this too:
regex = re.compile(u'\A[{ }]'.format(char_set))
None of which works as expected.
Any thoughts? What am I doing wrong?
(I'm using Python 2.7 and Mac OS X 10.6)
When you're using a pattern with a set of characters in square brackets, you don't want to put any vertical bar (|) characters in the set. Instead, just string the characters together and it should work. Here's a session where I tried out your characters with no problems after stripping the | chars:
>>> import re
>>> char_set = u'\u0041-\u005A|\u00C0-\u00D6|\u00D8-\u00DE|\u0100|\u0102|\u0104|\u0106|\u0108|\u010A|\u010C|\u010E|\u0110|\u0112|\u0114|\u0116|\u0118|\u011A|\u011C|\u011E|\u0120|\u0122|\u0124|\u0126|\u0128|\u012A|\u012C|\u012E|\u0130|\u0132|\u0134|\u0136|\u0139|\u013B|\u013D|\u013F|\u0141|\u0143|\u0145|\u0147|\u014A|\u014C|\u014E|\u0150|\u0152|\u0154|\u0156|\u0158|\u015A|\u015C|\u015E|\u0160|\u0162|\u0164|\u0166|\u0168|\u016A|\u016C|\u016E|\u0170|\u0172|\u0174|\u0176|\u0178|\u0179|\u017B|\u017D'
>>> fixed_char_set = char_set.replace("|", "") # remove the unneeded vertical bars
>>> pattern = ur"\A[{}]".format(fixed_char_set) # create a pattern string
>>> regex = re.compile(pattern) # compile the pattern into a regex object
>>> print regex.match("%foo") # "%" is not in the character set, so match returns None
None
edit: Actually, it seems like there must be some other issue going on, since I don't match "%foo" even if I use your original char_set without stripping out anything. Please give examples of text that is matching when it shouldn't!

Regular Expression Not matching the value

I have a file saving IP addresses to names in format
<<%#$192.168.8.40$#% %##Name_of_person##% >>
I read This file and now want to extract the list using pythons regular expressions
list=re.findall("<<%#$(\S+)$#%\s%##(\w+\s*\w*)##%\s>>",ace)
print list
But the list is always an empty list..
can anyone tell me where is the mistake in the regular expression
edit-ace is the variable saving the contents read from the file
$ is a special character in regular expressions, meaning "end of line" (or "end of string", depending on the flavour). Your regex has other characters following the $, and as such only matches strings that have those characters after the end, which is impossible.
You will need to escape the $, like so: \$
I would suggest the following regular expression (formatted as a raw string since you are using Python):
r"<<%#\$([^$]+)\$#%\s%##([^#]+)##%\s>>"
That is, <<%#$, then one or more non-$ characters, $#%, a whitespace character, %##, one or more non-# characters, ##%, whitespace, >>.
Something like:
text = '<<%#$192.168.8.40$#% %##Name_of_person##% >>'
ip, name = [el[1] for el in re.findall(r'%#(.)(.+?)\1#%', text)]
If you can get any with just splitting on '#' and '$' then...
from itertools import itemgetter
ip, name = itemgetter(1, 3)(re.split(r'[#\$]', text))
You could also just use built-in string functions:
tmp = text.split('$')
ip, name = tmp[1], tmp[2].split('#')[1]
u use a invalid regex pattern.
you may use
r"<\%#\$(\S+)\$#\%\s\%##(\w+\s*\w*)##\%\s>>" replace
"<<%#$(\S+)$#%\s%##(\w+\s*\w*)##%\s>>" in fandall method
good luck~!

How to write a regular expression to match a string literal where the escape is a doubling of the quote character?

I am writing a parser using ply that needs to identify FORTRAN string literals. These are quoted with single quotes with the escape character being doubled single quotes. i.e.
'I don''t understand what you mean'
is a valid escaped FORTRAN string.
Ply takes input in regular expression. My attempt so far does not work and I don't understand why.
t_STRING_LITERAL = r"'[^('')]*'"
Any ideas?
A string literal is:
An open single-quote, followed by:
Any number of doubled-single-quotes and non-single-quotes, then
A close single quote.
Thus, our regex is:
r"'(''|[^'])*'"
You want something like this:
r"'([^']|'')*'"
This says that inside of the single quotes you can have either double quotes or a non-quote character.
The brackets define a character class, in which you list the characters that may or may not match. It doesn't allow anything more complicated than that, so trying to use parentheses and match a multiple-character sequence ('') doesn't work. Instead your [^('')] character class is equivalent to [^'()], i.e. it matches anything that's not a single quote or a left or right parenthesis.
It's usually easy to get something quick-and-dirty for parsing particular string literals that are giving you problems, but for a general solution you can get a very powerful and complete regex for string literals from the pyparsing module:
>>> import pyparsing
>>> pyparsing.quotedString.reString
'(?:"(?:[^"\\n\\r\\\\]|(?:"")|(?:\\\\x[0-9a-fA-F]+)|(?:\\\\.))*")|(?:\'(?:[^\'\\n\\r\\\\]|(?:\'\')|(?:\\\\x[0-9a-fA-F]+)|(?:\\\\.))*\')'
I'm not sure about significant differences between FORTRAN's string literals and Python's, but it's a handy reference if nothing else.
import re
ch ="'I don''t understand what you mean' and you' ?"
print re.search("'.*?'",ch).group()
print re.search("'.*?(?<!')'(?!')",ch).group()
result
'I don'
'I don''t understand what you mean'

Categories