How to match double quote in python regex? - python

I use this statement result=re.match(r"\[.+\]",sentence) to match sentence="[balabala]". But I always get None. Why? I tried many times and online regex test shows it works.

Those double-quotes in your regular expression are delimiting the string rather than part of the regular expression. If you want them to be part of the actual expression, you'll need to add more, and escape them with a backslash (r"\"\[.+\]\""). Alternatively, enclose the string in single quotes instead (r'"\[.+\]"').
re.match() only produces a match if the expression is found at the beginning of the string. Since, in your example, there is a double quote character at the beginning of the string, and the regular expression doesn't include a double quote, it does not produce a match. Try re.search() or re.findall() instead.

Regx
["][\w\s]+["]
This will match any words enclosed within duble quotes, for example "asd"
s='hello dear "Akhil", how are you?'
print(re.findall(r'["][\w\s]+["]',s)
This will return "Akhil" in a list type.

Related

Search a regex containing special characters

I want to find if there is a pattern $*.* in text.
But i cannot figure out how to do that with regular expressions in python.
Escape the dollar and the dot:
re.search(r'\$\.', inputstring)
Rules of thumb:
Use a raw string literal, so you don't have to double slashes (both Python and regular expressions derive meaning from the backslash)
When in doubt, escape the character to make it match a literal character.
Since you are looking for a specific substring, you don't even need a regular expression for this. This should do:
"$." in my_string
Example:
>>> "$." in "tes$.t"
True
>>> "$." in "test"
False

python regex re.compile match

I am trying to match (using regex in python):
http://images.mymaterials.com/images/steel-images/small/steel/steel800/steel800-2.jpg
in the following string:
http://www.mymaterialssite.com','http://images.mymaterials.com/images/steel-images/small/steel/steel800/steel800-2.jpg','Model Photo'
My code has something like this:
temp="http://www.mymaterialssite.com','http://images.mymaterials.com/images/steel-images/small/steel/steel800/steel800-2.jpg','Model Photo'"
dummy=str(re.compile(r'.com'',,''(.*?)'',,''Model Photo').search(str(temp)).group(1))
I do not think the "dummy" is correct & I am unsure how I "escape" the single and double quotes in the regex re.compile command.
I tried googling for the problem, but I couldnt find anything relevant.
Would appreciate any guidance on this.
Thanks.
The easiest way to deal with strings in Python that contain escape characters and quotes is to triple double-quote the string (""") and prefix it with r. For example:
my_str = r"""This string would "really "suck"" to write if I didn't
know how to tell Python to parse it as "raw" text with the 'r' character and
triple " quotes. Especially since I want \n to show up as a backlash followed
by n. I don't want \0 to be the null byte either!"""
The r means "take escape characters as literal". The triple double-quotes (""") prevent single-quotes, double-quotes, and double double-quotes from prematurely ending the string.
EDIT: I expanded the example to include things like \0 and \n. In a normal string (not a raw string) a \ (the escape character) signifies that the next character has special meaning. For example \n means "the newline character". If you literally wanted the character \ followed by n in your string you would have to write \\n, or just use a raw string instead, as I show in the example above.
You can also read about string literals in the Python documentation here:
For beginners: http://docs.python.org/tutorial/introduction.html#strings
Complex explanation: http://docs.python.org/reference/lexical_analysis.html#string-literals
Try triple quotes:
import re
tmp=""".*http://images.mymaterials.com/images/steel-images/small/steel/steel800/steel800-2.jpg.*"""
str="""http://www.mymaterialssite.com\'\,\'http://images.mymaterials.com/images/steel-images/small/steel/steel800/steel800-2.jpg','Model Photo'"""
x=re.match(tmp,str)
if x!=None:
print x.group()
Also you were missing the .* in the beginning of the pattern and at the end. I added that too.
if you use double quotes (which have the same meaning as the single ones, in Python), you don't have to escape at all.. (in this case). you can even use string literal without the starting r (you don't have any backslash there)
re.compile(".com','(.*?)','Model Photo")
Commas don't need to be escaped, and single quotes don't need to be escaped if you use double quotes to create the string:
>>> dummy=re.compile(r".com','(.*?)','Model Photo").search(temp).group(1)
>>> print dummy
http://images.mymaterials.com/images/steel-images/small/steel/steel800/steel800-2.jpg
Note that I also removed some unnecessary str() calls, and for future reference if you do ever need to escape single or double quotes (say your string contains both), use a backslash like this:
'.com\',\'(.*?)\',\'Model Photo'
As mykhal pointed out in comments, this doesn't work very nicely with regex because you can no longer use the raw string (r'...') literal. A better solution would be to use triple quoted strings as other answers suggested.

python "re" package, strange phenomenon with "raw" string

I am seeing the following phenomenon, couldn't seem to figure it out, and didn't find anything with some search through archives:
if I type in:
>>> if re.search(r'\n',r'this\nis\nit'):<br>
... print 'found it!'<br>
... else:<br>
... print "didn't find it"<br>
...
I will get:
didn't find it!
However, if I type in:
>>> if re.search(r'\\n',r'this\nis\nit'):<br>
... print 'found it!'<br>
... else:<br>
... print "didn't find it"<br>
...
Then I will get:
found it!
(The first one only has one backslash on the r'\n' whereas the second one has two backslashes in a row on the r'\\n' ... even this interpreter is removing one of them.)
I can guess what is going on, but I don't understand the official mechanism as to why this is happening: in the first case, I need to escape two things: both the regular expression and the special strings. "Raw" lets me escape the special strings, but not the regular expression.
But there will never be a regular expression in the second string, since it is the string being matched. So there is only a need to escape once.
However, something doesn't seem consistent to me: how am I supposed to ensure that the characters REALLY ARE taken literally in the first case? Can I type rr'' ? Or do I have to ensure that I escape things twice?
On a similar vein, how do I ensure that a variable is taken literally (or that it is NOT taken literally)? E.g., what if I had a variable tmp = 'this\nis\nmy\nhome', and I really wanted to find the literal combination of a slash and an 'n', instead of a newline?
Thanks!Mike
re.search(r'\n', r'this\nis\nit')
As you said, "there will never be a regular expression in the second string." So we need to look at these strings differently: the first string is a regex, the second just a string. Usually your second string will not be raw, so any backslashes are Python-escapes, not regex-escapes.
So the first string consists of a literal "\" and an "n". This is interpreted by the regex parser as a newline (docs: "Most of the standard escapes supported by Python string literals are also accepted by the regular expression parser"). So your regex will be searching for a newline character.
Your second string consists of the string "this" followed by a literal "\" and an "n". So this string does not contain an actual newline character. Your regex will not match.
As for your second regex:
re.search(r'\\n', r'this\nis\nit')
This version matches because your regex contains three characters: a literal "\", another literal "\" and an "n". The regex parser interprets the two slashes as a single "\" character, followed by an "n". So your regex will be searching for a "\" followed by an "n", which is found within the string. But that isn't very helpful, since it has nothing to do with newlines.
Most likely what you want is to drop the r from the second string, thus treating it as a normal Python string.
re.search(r'\n', 'this\nis\nit')
In this case, your regex (as before) is searching for a newline character. And, it finds it, because the second string contains the word "this" followed by a newline.
Escaping special sequences in string literals is one thing, escaping regular expression special characters is another. The row string modifier only effects the former.
Technically, re.search accepts two strings and passes the first to the regex builder with re.compile. The compiled regex object is used to search patterns inside simple strings. The second string is never compiled and thus it is not subject to regex special character rules.
If the regex builder receives a \n after the string literal is processed, it converts this sequence to a newline character. You also have to escape it if you need the match the sequence instead.
All rationale behind this is that regular expressions are not part of the language syntax. They are rather handled within the standard library inside the re module with common building blocks of the language.
The re.compile function uses special characters and escaping rules compatible with most commonly used regex implementations. However, the Python interpreter is not aware of the whole regular expression concept and it does not know whether a string literal will be compiled into a regex object or not. As a result, Python can't provide any kind syntax simplification such as the ones you suggested.
Regexes have their own meaning for literal backslashes, as character classes like \d. If you actually want a literal backslash character, you will in fact need to double-escape it. It's really not supposed to be parallel since you're comparing a regex to a string.
Raw strings are just a convenience, and it would be way overkill to have double-raw strings.

How to detect an invalid C escaped string using a regular expression?

I would like to find a regular expression (regex) that does detect if you have some invalid escapes in a C double quoted escaped string (where you can find double quotes only escaped).
I consider valid \\ \n \r \" (the test string is using ")
A partial solution to this is to use (?<!\\)\\[^\"\\nr] but this one fails to detect bad escapes like \\\.
Here is a test string that I use to test the matching:
...\n...\\b...\"...\\\\...\\\E...\...\\\...\\\\\..."...\E...
The expression should match the last 6 blocks as invalid, the first 4 are valid. The problem is that my current version does find only 2/5 errors.
(?:^|[^\\])(?:\\\\)*((?:\"|\\(?:[^\"\\nr]|$)))
That's the start of a string, or something that's not a backslash. Then some (possibly zero) properly escaped backslashes, then either an unescaped " or another backslash; if it's another backslash, it must be followed by something that is neither ", \, n, nor r, or the end of the string.
The incorrect escape is captured for you as well.
Try this regular expression:
^(?:[^\\]+|\\[\\rn"])*(\\(?:[^\\rn"]|$))
If you have a match, you have an invalid escape sequence.

How to write a regular expression to match a string literal where the escape is a doubling of the quote character?

I am writing a parser using ply that needs to identify FORTRAN string literals. These are quoted with single quotes with the escape character being doubled single quotes. i.e.
'I don''t understand what you mean'
is a valid escaped FORTRAN string.
Ply takes input in regular expression. My attempt so far does not work and I don't understand why.
t_STRING_LITERAL = r"'[^('')]*'"
Any ideas?
A string literal is:
An open single-quote, followed by:
Any number of doubled-single-quotes and non-single-quotes, then
A close single quote.
Thus, our regex is:
r"'(''|[^'])*'"
You want something like this:
r"'([^']|'')*'"
This says that inside of the single quotes you can have either double quotes or a non-quote character.
The brackets define a character class, in which you list the characters that may or may not match. It doesn't allow anything more complicated than that, so trying to use parentheses and match a multiple-character sequence ('') doesn't work. Instead your [^('')] character class is equivalent to [^'()], i.e. it matches anything that's not a single quote or a left or right parenthesis.
It's usually easy to get something quick-and-dirty for parsing particular string literals that are giving you problems, but for a general solution you can get a very powerful and complete regex for string literals from the pyparsing module:
>>> import pyparsing
>>> pyparsing.quotedString.reString
'(?:"(?:[^"\\n\\r\\\\]|(?:"")|(?:\\\\x[0-9a-fA-F]+)|(?:\\\\.))*")|(?:\'(?:[^\'\\n\\r\\\\]|(?:\'\')|(?:\\\\x[0-9a-fA-F]+)|(?:\\\\.))*\')'
I'm not sure about significant differences between FORTRAN's string literals and Python's, but it's a handy reference if nothing else.
import re
ch ="'I don''t understand what you mean' and you' ?"
print re.search("'.*?'",ch).group()
print re.search("'.*?(?<!')'(?!')",ch).group()
result
'I don'
'I don''t understand what you mean'

Categories