I need help figuring out some Python Regex - python

I have tried to properly wrap my head around the below but I still have big hole in my reasoning. What is ?::, and could someone explain it properly for me
rule_syntax = re.compile('(\\\\*)'\
'(?:(?::([a-zA-Z_][a-zA-Z_0-9]*)?()(?:#(.*?)#)?)'\
'|(?:<([a-zA-Z_][a-zA-Z_0-9]*)?(?::([a-zA-Z_]*)'\
'(?::((?:\\\\.|[^\\\\>]+)+)?)?)?>))')

There are two tools that you may wish to look into to help with your understanding
Regexper creates a visual representation of regex, here's yours:
Regexpal is a tool that allows you to input a regex and various strings and see what matches, here's yours with some example matches

(?:expr) is just like normal parentheses (expr), except that for purposes of retrieving groups later (backreferences, re.sub, or MatchObject.group), parenthesized groups beginning with ?: are excluded. This can be useful if you need to capture a complex expression in parentheses to apply another operator like * to it, but don't want to get it mixed in with groups that you'll actually need to retrieve later.'
?:: is simply ?: followed by a literal :.

Related

Python regex match all sentences include either wordA or wordB [duplicate]

I'm creating a javascript regex to match queries in a search engine string. I am having a problem with alternation. I have the following regex:
.*baidu.com.*[/?].*wd{1}=
I want to be able to match strings that have the string 'word' or 'qw' in addition to 'wd', but everything I try is unsuccessful. I thought I would be able to do something like the following:
.*baidu.com.*[/?].*[wd|word|qw]{1}=
but it does not seem to work.
replace [wd|word|qw] with (wd|word|qw) or (?:wd|word|qw).
[] denotes character sets, () denotes logical groupings.
Your expression:
.*baidu.com.*[/?].*[wd|word|qw]{1}=
does need a few changes, including [wd|word|qw] to (wd|word|qw) and getting rid of the redundant {1}, like so:
.*baidu.com.*[/?].*(wd|word|qw)=
But you also need to understand that the first part of your expression (.*baidu.com.*[/?].*) will match baidu.com hello what spelling/handle????????? or hbaidu-com/ or even something like lkas----jhdf lkja$##!3hdsfbaidugcomlaksjhdf.[($?lakshf, because the dot (.) matches any character except newlines... to match a literal dot, you have to escape it with a backslash (like \.)
There are several approaches you could take to match things in a URL, but we could help you more if you tell us what you are trying to do or accomplish - perhaps regex is not the best solution or (EDIT) only part of the best solution?

How do I extract definitions from a html file?

I'm trying to practice with regular expressions by extracting function definitions from Python's standard library built-in functions page. What I do have so far is that the definitions are generally printed between <dd><p> and </dd></dl>. When I try
import re
fname = open('functions.html').read()
deflst = re.findall(r'<dd><p>([\D3]+)</dd></dl>', fhand)
it doesn't actually stop at </dd></dl>. This is probably something very silly that I'm missing here, but I've been really having a hard time trying to figure this one out.
Regular expressions are evaluated left to right, in a sense. So in your regular expression,
r'<dd><p>([\D3]+)</dd></dl>'
the regex engine will first look for a <dd><p>, then it will look at each of the following characters in turn, checking each for whether it's a nondigit or 3, and if so, add it to the match. It turns out that all the characters in </dd></dl> are in the class "nondigit or 3", so all of them get added to the portion matched by [\D3]+, and the engine dutifully keeps going. It will only stop when it finds a character that is a digit other than 3, and then go on and "notice" the rest of the regex (the </dd></dl>).
To fix this, you can use the reluctant quantifier like so:
r'<dd><p>([\D3]+?)</dd></dl>'
(note the added ?) which means the regex engine should be conservative in how much it adds to the match. Instead of trying to "gobble" as many characters as possible, it will now try to match the [\D3]+? to just one character and then go on and see if the rest of the regex matches, and if not it will try to match [\D3]+? with just two characters, and so on.
Basically, [\D3]+ matches the longest possible string of [\D3]'s that it can while still letting the full regex match, whereas [\D3]+? matches the shortest possible string of [\D3]'s that it can while still letting the full regex match.
Of course one shouldn't really be using regular expressions to parse HTML in "the real world", but if you just want to practice regular expressions, this is probably as good a text sample as any.
By default all quantifiers are greedy which means they want to match as many characters as possible. You can use ? after quantifier to make it lazy which matches as few characters as possible. \d+? matches at least one digit, but as few as possible.
Try r'<dd><p>([\D3]+?)</dd></dl>'

Extracting parenthesis with a specific format with Python

I am fairly new to python so I apologies if this is quite a novice question, but I am trying to extract text from parentheses that has specific format from a raw text file.
I have tried this with regular expressions, but please let me know if their is a better method.
To show what I want to do by example:
s = "Testing (Stackoverflow, 2013). Testing (again) (Stackoverflow, 1999)"
From this string I want a result something like:
['(Stackoverflow, 2013)', '(Stackoverflow, 1999)']
The regular expression I have tried so far is
"(\(.+[,] [0-9]{4}\))"
in conjunction with re.findall(), however this only gives me the result:
['(Stackoverflow, 2013). Testing (again) (Stackoverflow, 1999)']
So, as you may have guessed, I am trying to extract the bibliographic references from a .txt file. But I don't want to extract anything that happens to be in parentheses that is not a bibliographic reference.
Again, I apologies if this is novice, and again if there is a question like this out there already. I have searched, but no luck as yet.
Using [^()] instead of .. This will make sure there is no nested ().
>>> re.findall("(\([^()]+[,] [0-9]{4}\))", s)
['(Stackoverflow, 2013)', '(Stackoverflow, 1999)']
Assuming that you will have no nested brackets, you could use something like so: (\([^()]+?, [0-9]{4}\)). This will match any non bracket character which is within a set of parenthesis which is followed by a comma, a white space four digits and a closing parenthesis.
I would suggest something like \(\w+,\s+[0-9]{4}\). A couple changes from your original:
Match word characters (letters/numbers/underscores) instead of any character in the source name.
Match one or more space characters after the comma, instead of limiting yourself to a single literal space.

Use regular expression to handle nested parenthesis in math equation?

If I have:
statement = "(2*(3+1))*2"
I want to be able to handle multiple parentheses within parentheses for a math reader I'm writing. Perhaps I'm going about this the wrong way, but my goal was to recursively go deeper into the parentheses until there were none, and then I would perform the math operations. Thus, I would first want to focus on
"(2*(3+1))"
then focus on
"(3+1)"
I hoped to do this by assigning the focus value to the start index of the regex and the end index of the regex. I have yet to figure out how to find the end index, but I'm more interested in first matching the regex
r"\(.+\)"
failed to match. I wanted it to read as "any one or more characters contained within a set of parentheses". Could someone explain why the above expression will not match to the above statement in python?
I love regular expressions. I use them all the time.
Don't use regular expressions for this.
You want an actual parser that will actually parse your math expressions. You might want to read this:
http://effbot.org/zone/simple-top-down-parsing.htm
Once you have actually parsed the expression, it's trivial to walk the parse tree and compute the result.
EDIT: #Lattyware suggested pyparsing, which should also be a good way to go, and might be easier than the EFFBot solution posted above.
https://github.com/pyparsing/pyparsing
Here's a direct link to the pyparsing sample code for a four-function algebraic expression evaluator:
http://pyparsing.wikispaces.com/file/view/fourFn.py
for what it's worth, here's a little more context:
regular expressions are called "regular" because they're associated with regular grammars, and regular grammars cannot describe (an unlimited number of) nested parentheses (they can describe a bunch of random parentheses, but cannot make them match in neat pairs).
one way to understand this is to understand that regular expressions can (modulo some details which i will explain at the end) be converted to deterministic finite automatons. which sounds intimidating but really just means that they can be converted into lists of "rules", where the rules depend on what you matched, and describe what you can match.
for example, the regular expression ab*c can be converted to:
at the start, you can only match a. then go to 2.
now, you can match b and go back to 2, or match c and go to 3
you're done! the match was a success!
and that is a "deterministic finite automaton".
anyway, the interesting part of this is that if you sit down and try to make something like that for matching pairs of parentheses you can't! try it. you can match a finite number by making more and more rules, but you can't write a general set of rules that match an unlimited number of parentheses (i should add that the rules have to be of the form "if you match X go to Y").
now obviously you could modify that in various ways. you could allow more complex rules (like extending them to let you keep a count of the parentheses), and you could then get something that worked as you expect. but it wouldn't be a regular grammar.
given that regular expressions are limited in this way, why are they used rather than something more complex? it turns out that they're something of a sweet spot - they can do a lot, while remaining fairly simple and efficient. more complex grammars (kinds of rules) can be more powerful, but are also harder to implement, and have more problems with efficiency.
final disclaimer and promised extra details: in practice many regular expressions these days actually are more powerful than this (and should not really be called "regular expressions"). but the above is still the basic explanation of why you should not use a regexp for this.
ps jesse's suggested solution gets round this by using a regexp multiple times; the argument here is for a single use of the regexp.
I probably agree with steveha, and don't recommend regex for this, but to answer your question specifically, you need unescaped parens to pull out results groups (your pattern only has escaped parens):
>>> re.match(r"\((.+)\)", "(2*(3+1))*2").group(1)
'2*(3+1)'
If you go that route, you could iterate over the match results until you run out of matches, and then reverse the results list to work inside out.

Small problem with reg exps in python

So I have one variable that has all the code from some file. I need to remove all comments from this file. One of my regexp lines is this
x=re.sub('\/\*.*\*\/','',x,re.M,re.S);
What I want this to be doing is to remove all multi line comments. For some odd reason though, its skipping two instances of */, and removing everything up to the third instance of */.
I'm pretty sure the reason is this third instance of */ has code after it, while the first two are by themselves on the line. I'm not sure why this matters, but I'm pretty sure thats why.
Any ideas?
.* will always match as many characters as possible. Try (.*?) - most implementations should try to match as few characters as possible then (should work without the brackets but not sure right now). So your whole pattern should look like this: \/\*.*?\*\/ or \/\*(.*?)\*\/
The expression .* is greedy, meaning that it will attempt to match as many characters as possible. Instead, use (.*?) which will stop matching characters as soon as possible.
The regular expression is "greedy" and when presented with several stopping points will take the farthest one. Regex has some patterns to help control this, in particular the
(?&gt!...)
which matches the following expression only if it is Not preceeded by a match of the pattern in parens. (put in a pointy brace for &gt in the above - I don't know the forum convention for getting on in my answer).
(?*...) was not in Python 2.4 but is a good choice if you are using a later version.

Categories