Python error because of regex inside a Google Big Query - python

I am writing Google Big Query wrappers in python. One of the queries has a regex and the python code is treating it as an syntax error.
Here is the regex
WHEN tier2 CONTAINS '-' THEN REGEXP_EXTRACT(tier2,'(.*)\s-')
the error is Invalid string literal: '(.*)\s-'>
The error is for \ in the regex.
Any suggestion to overcome it

You need to escape backslash by preceding it with yet another backslash
Backslash \ is an escape character so you need to escape it so it is treated as a normal character
Try
'(.*)\\s-'
Based on your comments, looks like above is exactly what you are using in BigQuery - so in this case you need to escape each of two backslashes
'(.*)\\\\s-'

Related

Python lint issue : invalid escape sequence '\/'

This is my python code line which is giving me invalid escape sequence '/' lint issue.
pattern = 'gs:\/\/([a-z0-9-]+)\/(.+)$' # for regex matching
It is giving me out that error for all the backslash I used here .
any idea how to resolve this ?
There's two issues here:
Since this is not a raw string, the backslashes are string escapes, not regexp escapes. Since \/ is not a valid string escape sequence, you get that warning. Use a raw string so that the backslashes will be ignored by the string parser and passed to the regexp engine. See What exactly is a "raw string regex" and how can you use it?
In some languages / is part of the regular expression syntax (it's the delimiter around the regexp), so they need to be escaped. But Python doesn't use / this way, so there's no need to escape them in the first place.
Use this:
pattern = r'gs://([a-z0-9-]+)/(.+)$' # for regex matching

Invalid regular expression: invalid escape \ sequence, Postgres, Django

I'm trying to escape a string яблуко* for Postgres regex query:
name = re.escape('яблуко*')
Model.objects.filter(name__iregex='^%s' % name)
This gives me:
Invalid regular expression: invalid escape \ sequence
What am I doing wrong?
P.S. I know that I can do it with istartswith, just wondering why regex is not working.
The problem here is that re.escape does escape way too much for PostgreSQL - it does escape all non ASCII chars, while PostgreSQL doesn't support escape sequences for unknown chars - in this case it's all the unicode chars:
>>> print re.escape('яблуко*')
\я\б\л\у\к\о\*
In the end it's not really possible to mix Python regexp engine (for escaping) with database regexp engine (for evaluation). Unfortunately Django doesn't provide way to do this. In Weblate, I've solved this by writing custom function to escape the regexp, see https://github.com/WeblateOrg/weblate/commit/7425a749b44abafe36d8f1c9db018f57684e5983

Python docstring + Neo4J cypher query seems to cause regexp error

I have a cypher snippet like this:
where my_node.my_column =~ ("(?i).*\\." + {my_var})
The idea is to match a path-like string. For example, my_column could have a value of db.schema.MY_TABLE and I want to pass "My_TaBlE" in my Python cypher statement. This should match.
However, I am getting a Cypher error on that statement; Specifically, it does not like the final "." in the regexp. It is like I am not escaping it correctly. The docs say Java regexp is used under the hood.
Right now I am using:
where my_node.my_column =~ ('(?i).*' + '.' + {table_name})
This seems to work, but I can't honestly say if the period is matching any character or the literal period character.
If it matters, my Cypher query is in a Python docstring.
How can I escape the period? Is there a better way to express what I am looking for?
In Python string literals, the backslash ("\") character is used start an escape sequence. In particular, "\\" is the escape sequence for the backslash character itself.
So, in order to produce a string literal with 2 adjacent backslash characters, you actually need to use 4 adjacent backslash characters in your code. For example, your Python snippet should look like this:
where my_node.my_column =~ ("(?i).*\\\\." + {my_var})

regex to not get the escaped quote

The example string in python is "sasi0'sada1\'adad2'theend"
I want the single quotes which are not escaped, so quotes after 0 and 2 should be selected but not the quote after 1.
I tried re.findall(r"[\d]'") but I'm getting all tree quotes
Any help?
let me tell you the actual scenario!
I'm writing a script to extract sql queries from code.
perl code:
ad.pl:$query = "Select * from (Select ((select cast(sysdate as ts) from dual)||(select c_r from v\$r_limit where r_n=\'sessions\')||\',\'||(select c_u from v\$r_l where r_n=\'t\')) as \"D,B,HH,AS,CT\" from dual)";
The regex:
re.compile(r'''(('|")(insert |update |delete |select )(.*?)(?<!\)(\2)(;?))''',re.IGNORECASE)
but the back reference is catching the escaped double quote.
so getting only half query
I don't think i can add extra backslash automatically to escape it as python fails to read \ in the first place to add other!
manually it's impossible to escape because thats huge project having lots of queries.
Any help?
The following regex will work
(?<!\\)(?=')
or
(?=(?<!\\)')
Ideone Demo
If your requirement is as simple as you mentioned, then you don't even need look around. It can be simply written as
[^\\]'
The reason for regex not matching every quotes is because python is interpreting \' inside sting as a way to escape ' because in python strings can be represented with both single and double quotes. So basically the left string to be matched is
sasi0'sada1'adad2'theend
This modified string does not contain any \'. So every ' is matched. If you escape the ' twice as
sasi0'sada1\\'adad2'theend
What's the Solution then?
Use raw string instead of normal string. This can be done by putting r in front of string before double quotes
r"sasi0'sada1\'adad2'theend"
\' in this case \ acted as a escape for ' so you need to escape the '\' as well like this \\'
re.findall(r"[^\\]'","sasi0'sada1\\'adad2'theend")
["0'", "2'"]
This one seems to be working for me. \w((?<!\\)([\w']+))

pep8 warning on regex string in Python, Eclipse

Why is pep8 complaining on the next string in the code?
import re
re.compile("\d{3}")
The warning I receive:
ID:W1401 Anomalous backslash in string: '\d'. String constant might be missing an r prefix.
Can you explain what is the meaning of the message? What do I need to change in the code so that the warning W1401 is passed?
The code passes the tests and runs as expected. Moreover \d{3} is a valid regex.
"\d" is same as "\\d" because there's no escape sequence for d. But it is not clear for the reader of the code.
But, consider \t. "\t" represent tab chracter, while r"\t" represent literal \ and t character.
So use raw string when you mean literal \ and d:
re.compile(r"\d{3}")
or escape backslash explicitly:
re.compile("\\d{3}")
Python is unable to parse '\d' as an escape sequence, that's why it produces a warning.
After that it's passed down to regex parser literally, works fine as an E.S. for regex.

Categories