regex for parsing SQL statements

regex for parsing SQL statements - python

I've got an IronPython script that executes a bunch of SQL statements against a SQL Server database. the statements are large strings that actually contain multiple statements, separated by the "GO" keyword. That works when they're run from sql management studio and some other tools, but not in ADO. So I split up the strings using the 2.5 "re" module like so:
splitter = re.compile(r'\bGO\b', re.IGNORECASE)
for script in splitter.split(scriptBlob):
if(script):
[... execute the query ...]
This breaks in the rare case that there's the word "go" in a comment or a string. How in the heck would I work around that? i.e. correctly parse this string into two scripts:
-- this is a great database script! go team go!
INSERT INTO myTable(stringColumn) VALUES ('go away!')
/*
here are some comments that go with this script.
*/
GO
INSERT INTO myTable(stringColumn) VALUES ('this is the next script')
EDIT:
I searched more and found this SQL documentation:
http://msdn.microsoft.com/en-us/library/ms188037(SQL.90).aspx
As it turns out, GO must be on its own line as some answers suggested. However it can be followed by a "count" integer which will actually execute the statement batch that many times (has anybody actually used that before??) and it can be followed by a single-line comments on the same line (but not a multi-line, I tested this.) So the magic regex would look something like:
"(?m)^\s*GO\s*\d*\s*$"
Except this doesn't account for:
a possible single-line comment ("--" followed by any character except a line break) at the end.
the whole line being inside a larger multi-line comment.
I'm not concerned about capturing the "count" argument and using it. Now that I have some technical documentation i'm tantalizingly close to writing this "to spec" and never having to worry about it again.

Is "GO" always on a line by itself? You could just split on "^GO$".

since you can have comments inside comments, nested comments, comments inside queries, etc, there is no sane way to do it with regexes.
Just immagine the following script:
INSERT INTO table (name) VALUES (
-- GO NOW GO
'GO to GO /* GO */ GO' +
/* some comment 'go go go'
-- */ 'GO GO' /*
GO */
)
That without mentioning:
INSERT INTO table (go) values ('xxx') GO
The only way would be to build a stateful parser instead. One that reads a char at a time, and has a flag that will be set when it is inside a comment/quote-delimited string/etc and reset when it ends, so the code can ignore "GO" instances when inside those.

If GO is always on a line by itself you can use split like this:
#!/usr/bin/python
import re
sql = """-- this is a great database script! go team go!
INSERT INTO myTable(stringColumn) VALUES ('go away!')
/*
here are some comments that go with this script.
*/
GO 5 --this is a test
INSERT INTO myTable(stringColumn) VALUES ('this is the next script')"""
statements = re.split("(?m)^\s*GO\s*(?:[0-9]+)?\s*(?:--.*)?$", sql)
for statement in statements:
print "the statement is\n%s\n" % (statement)
(?m) turns on multiline matchings, that is ^ and $ will match start and end of line (instead of start and end of string).
^ matches at the start of a line
\s* matches zero or more whitespaces (space, tab, etc.)
GO matches a literal GO
\s* matches as before
(?:[0-9]+)? matches an optional integer number (with possible leading zeros)
\s* matches as before
(?:--.*)? matches an optional end-of-line comment
$ matches at the end of a line
The split will consume the GO line, so you won't have to worry about it. This will leave you with a list of statements.
This modified split has a problem: it will not give you back the number after the GO, if that is important I would say it is time to move to a parser of some form.

This won't detect if GO ever is used as a variable name inside some statement, but should take care of those inside comments or strings.
EDIT: This now works if GO is part of the statement, as long as it is not in it's own line.
import re
line_comment = r'(?:--|#).*$'
block_comment = r'/\*[\S\s]*?\*/'
singe_quote_string = r"'(?:\\.|[^'\\])*'"
double_quote_string = r'"(?:\\.|[^"\\])*"'
go_word = r'^[^\S\n]*(?P<GO>GO)[^\S\n]*\d*[^\S\n]*(?:(?:--|#).*)?$'
full_pattern = re.compile(r'|'.join((
line_comment,
block_comment,
singe_quote_string,
double_quote_string,
go_word,
)), re.IGNORECASE | re.MULTILINE)
def split_sql_statements(statement_string):
last_end = 0
for match in full_pattern.finditer(statement_string):
if match.group('GO'):
yield statement_string[last_end:match.start()]
last_end = match.end()
yield statement_string[last_end:]
Example usage:
statement_string = r"""
-- this is a great database script! go team go!
INSERT INTO go(go) VALUES ('go away!')
go 7 -- foo
INSERT INTO go(go) VALUES (
'I have to GO " with a /* comment to GO inside a /* GO string /*'
)
/*
here are some comments that go with this script.
*/
GO
INSERT INTO go(go) VALUES ('this is the next script')
"""
for statement in split_sql_statements(statement_string):
print '======='
print statement
Output:
=======
-- this is a great database script! go team go!
INSERT INTO go(go) VALUES ('go away!')
=======
INSERT INTO go(go) VALUES (
'I have to GO " with a /* comment to GO inside a /* GO string /*'
)
/*
here are some comments that go with this script.
*/
=======
INSERT INTO go(go) VALUES ('this is the next script')

Related

Toggle multiline or single line in parenthesis Vim

When coding Python I usually find myself wanting to break lines containing long lists or functions containing several arguments into multiple lines.
Between this:
# Example 1
foo(this_is_a_long_variable_1, this_is_a_long_variable_2, this_is_a_long_variable_3, this_is_a_long_variable_4)
# Example 2
def bar():
return [this_is_a_long_variable_1, this_is_a_long_variable_2, this_is_a_long_variable_3, this_is_a_long_variable_4]
and this:
# Example 1
foo(
this_is_a_long_variable_1,
this_is_a_long_variable_2,
this_is_a_long_variable_3,
this_is_a_long_variable_4,
)
# Example 2
def bar():
return [
this_is_a_long_variable_1,
this_is_a_long_variable_2,
this_is_a_long_variable_3,
this_is_a_long_variable_4,
]
What is the best way to do this?
From what I can gather I want to connect a special action to object-select and the actions themselves should be relatively okay to do by regex replace with some special handling of adding an extra comma before the end of the block.
But I have never really done anything this advanced before with Vim and don't really know where to start.

First, the low-level pieces…
Change the content of the parentheses to this:
foo(
)
with:
ci(<CR><Esc>
leaving the cursor on ).
Put the result of the following expression on the line above the cursor:
:put!=getreg('\"')->split(', *')->map('v:val . \",\"')<CR>
The expression in details:
getreg('\"') gets the content of the default register, here it is what used to be between the parentheses,
split(', *') splits it into individual arguments,
map('v:val . \",\"') appends a , to each item.
NOTE: the command above makes use of the new-ish "method" notation. In older Vims, it should look like this:
:put!=map(split(getreg('\"'), ', *'), 'v:val . \",\"')
Format what we just put:
='[
Second, putting it together…
Now that we have a working solution, we may want to make it a bit easier on the fingers with a simple visual mode mapping, which makes sense because it will keep it simple and agnostic:
xnoremap <key> c<CR><Esc><Cmd>put!=getreg('\"')->split(', *')->map('v:val . \",\"')<CR>='[
Which we can use like this:
vi(<key>
vi[<key>
vip<key>
etc.
NOTE: the mapping above makes use of the new-ish <Cmd> "special text". In older Vims, it should look like this:
xnoremap <key> c<CR><Esc>:<C-u>put!=map(split(getreg('\"'), ', *'), 'v:val . \",\"')<CR>='[

To complement romainl answer I'll give the reverse command to be able to toggle between single line and multi-line.
Single line -> Multi line:
xnoremap <Key1> c<CR><Esc>:put!=map(split(getreg('\"'), ', *'), 'v:val . \",\"')<CR>='[
Multi-line -> Single line
xnoremap <Key2> :s/\%V\n\s*\(.*,\)$/\1 / <Bar> s/\%V\(.*\),\s\n/\1/ <CR>:noh<CR>
Usage:
va(<Key2>
Explanation:
:s/match/replace/flags is search and replace which follows regular regexp matching
\%V is to only search in selection
s/\%V\n\s*\(.*,\)$/\1/ for lines with a comma at the end move these to the preceding line and remove extra whitespace
<Bar> this is a pipe
s/\%V\(.*\),\s\n/\1/ remove the extra last comma, space and newline
However, note that neither can handle complex cases, such as lists containing other lists.

"# this is a string", How python identifies it as a string but not a comment?

I really want to know how python identifies # in quotes as a string and normal # as a comment
I mean how the code to identify difference between these actually works, like will the python read a line and how it excludes the string to find the comment
"# this is a string" # this is a comment
How the comment is identified, will python exclude the string and if so, How?
How can we write a code which does the same, like to design a compiler for our own language with python
I am a newbie, please help

You need to know that whether something is a string or a comment can be determined from just one single character. That is the job of the scanner (or lexical analyzer if you want to sound fancy).
If it starts with a ", it's a string. If it starts with #, it's a comment.
In the code that makes up Python itself, there's probably a loop that goes something like this:
# While there is still source code to read
while not done:
# Get the current character
current = source[pos]
# If the current character is a pound sign
if current == "#":
# While we are not at the end of the line
while current != "\n":
# Get the next character
pos += 1
current = source[pos]
elif current == '"':
# Code to read a string omitted for brevity...
else:
done = True
In the real Python lexer, there are probably dozens more of those if statements, but I hope you have a better idea of how it works now. :)

Because of the quotes
# This is a comment
x = "# this is a string"
x = '# this is a also string'
x = """# this string
spans
multiple
lines"""

"# this is a string" # this is a comment
In simple terms, the interpreter sees the first ", then it takes everything that follows as part of the string until it finds the matching " which terminates the string. Then it sees the subsequent # and interprets everything to follow as a comment. The first # is ignored because it is between the two quotes, and hence is taken as part of the string.

Incorrect syntax near GO in SQL

I am concatenating many sql statements and am running into the following error.
"Incorrect syntax near GO" and "Incorrect syntax near "-
It seems that when i delete the trailing space and the go and the space after the go, and then CTRL+Z to put back the GO this makes the error go away? its pretty weird
why??
How could I code it in Python, thanks
')
END TRY
BEGIN CATCH
print ERROR_MESSAGE()
END CATCH
GO

As already mentioned in comments, GO is not part of the SQL syntax, rather a batch delimiter in Management Studio.
You can go around it in two ways, use Subprocess to call SqlCmd, or cut the scripts within Python. The Subprocess + SqlCmd will only really work for you if you don't care about query results as you would need to parse console output to get those.
I needed to build a database from SSMS generated scripts in past and created the below function as a result (updating, as I now have a better version that leaves comments in):
def partition_script(sql_script: str) -> list:
""" Function will take the string provided as parameter and cut it on every line that contains only a "GO" string.
Contents of the script are also checked for commented GO's, these are removed from the comment if found.
If a GO was left in a multi-line comment,
the cutting step would generate invalid code missing a multi-line comment marker in each part.
:param sql_script: str
:return: list
"""
# Regex for finding GO's that are the only entry in a line
find_go = re.compile(r'^\s*GO\s*$', re.IGNORECASE | re.MULTILINE)
# Regex to find multi-line comments
find_comments = re.compile(r'/\*.*?\*/', flags=re.DOTALL)
# Get a list of multi-line comments that also contain lines with only GO
go_check = [comment for comment in find_comments.findall(sql_script) if find_go.search(comment)]
for comment in go_check:
# Change the 'GO' entry to '-- GO', making it invisible for the cutting step
sql_script = sql_script.replace(comment, re.sub(find_go, '-- GO', comment))
# Removing single line comments, uncomment if needed
# file_content = re.sub(r'--.*$', '', file_content, flags=re.MULTILINE)
# Returning everything besides empty strings
return [part for part in find_go.split(sql_script) if part != '']
Using this function, you can run scripts containing GO like this:
import pymssql
conn = pymssql.connect(server, user, password, "tempdb")
cursor = conn.cursor()
for part in partition_script(your_script):
cursor.execute(part)
conn.close()
I hope this helps.

Ending a comment in Python

I have already read this: Why doesn't Python have multiline comments?
So in my IDLE , I wrote a comment:
Hello#World
Anything after the d of world is also a part of the comment.In c++ , I am aware of a way to close the comment like:
/*Mycomment*/
Is there a way to end a comment in Python?
NOTE: I would not prefer not to use the triple quotes.

You've already read there are no multiline comments, only single line. Comments cause Python to ignore everything until the end of the line. You "close" them with a newline!
I don't particularly like it, but some people use multiline strings as comments. Since you're just throwing away the value, you can approximate a comment this way. The only time it's really doing anything is when it's the first line in a function or class block, in which case it is treated as a docstring.
Also, this may be more of a shell scripting convention, but what's so bad about using multiple single line comments?
#####################################################################
# It is perfectly fine and natural to write "multi-line" comments #
# using multiple single line comments. Some people even draw boxes #
# with them! #
#####################################################################

You can't close a comment in python other than by ending the line.
There are number of things you can do to provide a comment in the middle of an expression or statement, if that's really what you want to do.
First, with functions you annotate arguments -- an annotation can be anything:
def func(arg0: "arg0 should be a str or int", arg1: (tuple, list)):
...
If you start an expression with ( the expression continues beyond newlines until a matching ) is encountered. Thus
assert (
str
# some comment
.
# another comment
join
) == str.join

You can emulate comments by using strings. They are not exactly comments, since they execute, but they don't return anything.
print("Hello", end = " ");"Comment";print("World!")

if you start with triple quotes, end with triple quotes

Regular expression optimization - C enum typedef

For a certain project of mine I need to parse enum typedefs from an .h file.
For example lets take the next simple case:
typedef enum
{
data1, /*aaagege*/
data2,
data3
}ESample;
This is a very simple declaration (without assigns or anything a bit more complex) and yet the regular expression that I wrote seems to be very poor performance wise.
Here is my expression:
typedef\s+enum\s*\{(?:\s+(\w+)[^\n]*)+\s*\}(\w+)\s*;
I've tested the expression on one of my files (about 2000 lines of code) and it took ages..
The first thing that I tried to do is to make everything possible not greedy like so:
typedef\s+?enum\s*?\{(?:\s+?(\w+?)[^\n]*?)+?\s*?\}(\w+?)\s*?;
But that only made things worse.
Any suggestions as to how I can make this better performance wise? If you could add an explanation about your suggested solution and why it is better than mine It will help me a lot.
Thanks in advance,
Kfir

The reason it's slow is because of your nested repeats (marked with ^):
(?:\s+(\w+)[^\n]*)+
^ ^
This causes nested backtracking, which leads to exponential running times.
But you have a larger problem which is that putting a group inside a repeat means that only the last match of the group is kept:
>>> print m.groups()
('data3', 'ESample')

You can't parse C with a regex:
// w00t /* "testing */ "strings n comments \"here"//
printf("/* haha gotcha\" epic stuff") /* "more text // */;
/* typedef test {
val,
"string",
*/ typedef test ??<
val,
"commentstring/*\"//",
??>
But if you just want a quick hack to parse all the typedefs:
typedef\s+enum\s*{[^}]*}[^;]+;

The first thing that I tried to do is to make everything possible not gready... But that only made things worse.
Of course it did! How couldn't it? Look a this regex:
\w+\s
It will (greedily) eat up all the word characters, and when those are out, it will look for a space character. Now consider:
\w+?\s
This eats up one word character, then checks for a space. Failing that, it eats another word character and checks for a space. It checks every word character to see if it's a space.
Generally, non-greedy is slower than greedy because it has to check the same characters twice. Sometimes, non-greedy produces different results, but when it doesn't, always use greedy. In fact, Perl has possessive quantifiers:
\w++\s
Which means "be greedy, and if that fails to match don't bother giving any characters back because you're too greedy." The example above works fine, and may be optimizable, but you can really understand it with this:
\w++h
That example will always fail, because any "h" character at the end of a word will get permanently eaten up by \w++, whereas if it was just \w+ it'd get eaten up, but then given back once the match failed once to see if it would succeed.
Unfortunately Python doesn't have the possessive form to my knowledge (though in the comments, #tchrist suggests an alterative Python regex library), so the first example is about as fast as I suspect you'll get. You might also find a speedup by searching for occurrences of the string "enum" and working from there instead of using a single giant regex to search through an entire file.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.