RegEx anomalous behavior in a program - python

I have written the following regex to match a set of e-mails from HTML files. The e-mails can take various formats such as
alice # so.edu
alice at sm.so.edu
alice # sm.com
<a href="mailto:alice at bob dot com">
I generally use RegexPal to test my regular expressions before implementing them in a programing language. I observe a strange behavior on the last e-mail example posted. RegexPal shows me a match for my regex but while using the same regex in a Python program it doesn't give me a hit. What could be the reason?
mail_regex = (?:[a-zA-Z]+[\w+\.]+[a-zA-Z]+)\s*(?:#|\bat\b)\s*(?:(?:(?:(?:[a-zA-Z]+)\s*
(?:\.|dot|dom)\s*(?:[a-zA-Z]+)\s*(?:\.|dot|dom)\s*)(?:edu|com))|(?:(?:[a-zA-Z]+\s*(?:\.|dot|dom)\s*(?:edu|com))))
The RegEx is a little bit complex to accommodate variety of other examples (email patterns found in the dataset). You can also run and inspect the Python program on CodePad - http://codepad.org/W2p6waBb
Edit
Just to give a perspective the same regex works on - http://pythonregex.com/

It looks like the specific issue here is that you need to use a raw string:
mail_re = r"(?:[a-zA-Z]+[\w+\.]+[a-zA-Z]+)\s*(?:#|\bat\b)\s*(?:(?:(?:(?:[a-zA-Z]+)\s*(?:\.|dot|dom)\s*(?:[a-zA-Z]+)\s*(?:\.|dot|dom)\s*)(?:edu|com))|(?:(?:[a-zA-Z]+\s*(?:\.|dot|dom)\s*(?:edu|com))))"
Otherwise, for instance \b will be backspace instead of word boundary.
Also, you're using a JavaScript tester. Python has different syntax and behavior. To avoid surprises, it would better to test with the Python-specific syntax.

Related

Python auto-detect and split regex groups

I have a regex pattern which is too long to type it here, but you can read it from here:
https://linksnappy.com/api/REGEX
I want to re.compile it straight away, but I am getting AssertionError and inability to compile more than 100 named groups.
I tried writing a pattern to split the above one, but it's way too difficult to make it work and not raise any exceptions from sre_*.py.
Is there a function which can automatically split capture groups/alternatives, similar to sre_parse, but make a list with the regex alternatives from the above pattern?
I copied the strings and compiled it in python3, but did not get AssertionError. The only skill I employed is to use the literal string '''regex'''.
I also pasted it in regex101. It is also valid and gives very detailed explanations of all alternatives and capturing groups.
For python2, I did saw in the source code that the number of capturing groups is limited to 100. In this case, python3 is the best option. If python2 is required, you may have to separate/shorten the regular expressions or choose not to use it.
In the particular example you provided, you can change your regex into 5 independent ones because it consists of 127 alternatives. It has the pattern of (a|b|c|d|e|...), but each alternative contains capturing groups as well. Check the regex explanation link regex101. Just make sure each regex has less than 100 capturing groups.
I hope it helps you solve the problem.

Non-halting regular expressions (python re module)

What can cause non-halting behavior in regular expression match() operation (with Python's re module)?
I'm current wracking my brains trying to work out a problem that has stumped me for hours. Why does the below line hang?
re.compile(r'.*?(now|with that|at this time|ready|stop|wrap( it|) up|at this point|happy|pleased|concludes|concluded|we will|like to)(,)*(( |\n)[a-z]+(\'[a-z]+)*,*){0,20}( |\n)(take|for|to|open|entertain|answer|address)(( |\n|)[a-z]+(\'[a-z]+)*,*){0,20}( |\n|)(questions|Q *& *A).*?', re.DOTALL| re.IGNORECASE).match("I would now like to turn the presentation over to your host for today's call, Mr. Mitch Feiger, please proceed.")
In short, I'm using match(), the regular expression is r'.*?(now|with that|at this time|ready|stop|wrap( it|) up|at this point|happy|pleased|concludes|concluded|we will|like to)(,)*(( |\n)[a-z]+(\'[a-z]+)*,*){0,20}( |\n)(take|for|to|open|entertain|answer|address)(( |\n|)[a-z]+(\'[a-z]+)*,*){0,20}( |\n|)(questions|Q *& *A).*?'
And the text is: "I would now like to turn the presentation over to your host for today's call, Mr. Mitch Feiger, please proceed."
I understand my regular expression is a bit of a mess, it's been built over time to somewhat cheatily match paragraphs in which the speakers announces the start of a question session. My main confusion right now is trying to find what in there could be causing what I assume is a non-halting search.
It gets stuck on a lot of other pieces of text my program uses, but far from all of them (the program processes thousands of text files, each with ~100 of these text pieces it needs to do matching on), and I can't see any common factors. To be clear, this is not supposed to return a match, but this check does need to be done, and I can't understand why it hangs like it does.
More generally, what are the sorts of things that could cause a Python regular expression match to hang indefinitely? I'd love to have the information so I can work out the problem myself, but at this point, I'd take a cheap answer...
Perl-compatible regular expressions (PCRE), which is what Python's re module uses, are no longer "regular" in the Computer Science sense. Because of this, they can suffer from catastrophic backtracking: https://swtch.com/~rsc/regexp/regexp1.html
This doesn't help you much with your problem. What would help you is:
break down your regexp in multiple small blocks
see how long each block takes to execute
start putting the blocks together to get closer to your original huge regexp
You might have to stop trying to do everything with 1 single regexp and you might use 1 or 2 and a bit of code to put the 2 parts together more efficiently.

Remove Script tag and on attributes from HTML

I have the following HTML and I need to remove the script tags and any script related attributes in the HTML. By script related attributes I mean any attribute that starts with on.
<body>
<script src="...">
</script>
<div onresize="CreateFixedHeaders()" onscroll="CreateFixedHeaders()" id="oReportDiv" style="overflow:auto;WIDTH:100%">
<script type="text/javascript" language="javascript">
//<![CDATA[
function CreateFixedHeaders() {}//]]>
</script>
<script>
var ClientReportfb64a4706a3749c484169e...
</script>
</body>
My first thought was to use BeautifulSoup to remove the tags and attributes. Unfortunately, I am unable to use BeautifulSoup. Seeing that BeautifulSoup is off the table I can see two options for doing this. The first option I see is splitting the strings and parsing based on index. This seems like a bad solution to me.
The other option is to use Regular Expressions. However, we know that isn't a good solution either (Cthulhu Parsing).
Now with that in mind, I personally feel it is alright to use regular expressions to strip the attributes. After all, with those it is still simple string manipulation.
So for removing the attributes I have:
script_attribute_regex = r'\son[a-zA-Z]+="[a-zA-Z0-0\.;\(\)_]+"'
result = re.sub(script_attribute_regex, "", page_source)
As I've said before, I personally think the above perfectly acceptable use of Regular Expression with HTML. But still I would like to get some opinions on the above usage.
Then there is the question of the script tags. I'm very tempted to go with Regular Expressions for this because I know them and I know what I need is pretty simple. Something like:
<script(.*)</script>
The above would start to get me close to what I need. And yes I realize the above RegEx will grab everything starting at the first opening script tag until the last closing script tag, but it's a starting example.
I'm very tempted to use Regular Expressions as I'm familiar with them (more so than Python) and I know that is the quickest way to achieve the results I want, at least for me it is.
So I need help to go against my nature and not be evil. I want to be evil and use RegEx so somebody please show me the light and guide me to the promised land on non-Regular Expressions.
Thanks
Update:
It looks like I wasn't very clear about what my question actually is, I apologize for that. My question is how can I parse the HTML using pure Python without Regular Expressions?
<script(.*)</script>
As for the above code example, it's wrong. I know it is wrong, I was using it as an example of a starting point.
I hope this clears up my question some
Update 2
I just wanted to add a few more notes about what I am doing.
I am crawling a web site to get the data I need.
Once we have the page that contains the data we need it is saved to the database.
Then the saved web page is displayed to the user.
The issue I am trying to solve happens here. The application throws a script error when you attempt to interact with the page that forces the user to click on a confirmation box. The application is not a web browser but uses the web browser DLL in Windows (I cannot remember the name at the moment).
The error in question only happens in this one page for this one web site.
Update 3
After adding the update I realized I was over thinking the problem, I was looking for a more generic solution. However, in this case that isn't what is needed.
The page is dynamically generated, however the script tags will stay static. With that in mind the solution becomes much simpler. With that I no longer need to treat it like HTML but as static strings.
So the solution I'm looking at is
import re
def strip_script_tags(page_source: str) -> str:
pattern = re.compile(r'\s?on\w+="[^"]+"\s?')
result = re.sub(pattern, "", page_source)
pattern2 = re.compile(r'<script[\s\S]+?/script>')
result = re.sub(pattern2, "", result)
return result
I would like to avoid Regular Expression however, since I'm limited to only using the standard library regular expressions seems like the best solution in this case. Which means #skamazin's answer is correct.
As for removing all the attributes that start with on, you can try this
It uses the regex:
\s?on\w+="[^"]+"\s?
And substitutes with the empty string (deletion). So in Python it should be:
pattern = re.compile(ur'\s?on\w+="[^"]+"\s?')
subst = u""
result = re.sub(pattern, subst, file)
If you are trying to match anything between the script tags try:
<script[\s\S]+?/script>
DEMO
The problem with your regex is that that dot (.) doesn't match newline character. Using a complemented set will match every single character possible. And make sure use the ? in [\s\S]+? so that it is lazy instead of greedy.

Creating custom extensions to a regular expression engine

Is there any easy way to go about adding custom extensions to a
Regular Expression engine? (For Python in particular, but I would take
a general solution as well).
It might be easier to explain what I'm trying to build with an
example. Here is the use case that I have in mind:
I want users to be able to match strings that may contain arbitrary
ASCII characters. Regular Expressions are a good start, but aren't
quite enough for the type of data I have in mind. For instance, say I
have data that contains strings like this:
<STX>12.3,45.6<ETX>
where <STX> and <ETX> are the Start of Text/End of Text characters
0x02 and 0x03. To capture the two numbers, it would be very
convenient for the user to be able to specify any ASCII
character in their expression. Something like so:
\x02(\d\d\.\d),(\d\d\.\d)\x03
Where the "\x02" and "\x03" are matching the control characters and
the first and second match groups are the numbers. So, something like
regular expressions with just a few domain-specific add-ons.
How should I go about doing this? Is this even the right way to go?
I have to believe this sort of problem has been solved, but my initial
searches didn't turn up anything promising. Regular Expression have
the advantage of being well known, keeping the learning curve down.
A few notes:
I am not looking for a fixed parser for a particular protocol - it needs to be general and user configurable
I really don't want to write my own regex engine
Although it would be nice, I am not looking for "regex macros" where I create shortcuts for a handful of common expressions. (perhaps a follow-up question...)
Bonus: Have you heard of any academic work, i.e "Creating Domain Specific search languages"
EDIT: Thanks for the replies so far, I hadn't realized Python re supported arbitrary ascii chars. However, this is still not quite what I'm looking for. Here is another example that hopefully give the breadth of what I want in the end:
Suppose I have data that contains strings like this:
$\x01\x02\x03\r\n
Where the 123 forms two 12-bit integers (0x010 and 0x023). So how could I add syntax so the user could match it with a regex like this:
\$(\int12)(\int12)\x0d\x0a
Where the \int12's each pull out 12 bits. This would be handy if trying to search for packed data.
\x escapes are already supported by the Python regular expression parser:
>>> import re
>>> regex = re.compile(r'\x02(\d\d\.\d),(\d\d\.\d)\x03')
>>> regex.match('\x0212.3,45.6\x03')
<_sre.SRE_Match object at 0x7f551b0c9a48>

Python:Pattern detection and rule generation

I have a need for a pattern interpretation and rule generating system. Basically how it will work is that it should parse through text and interpret patterns from it, and based on those interprtation, i need to output a set of rules. Here is an example. Lets say i have an HTTP header which looks like
GET https://website.com/api/1.0/download/8hqcdzt9oaq8llapjai1bpp2q27p14ah/2139379149 HTTP/1.1
Host: website.com
User-Agent: net.me.me/2.7.1;OS/iOS-5.0.1;Apple/iPad 2 (GSM)
Accept: */*
Accept-Language: en-us
Accept-Encoding: gzip, deflate
The parser would run through this and output
req-hdr-pattern: "^GET[ ].*/api/1\\.0/download/{STRING:auth_token}/{STRING:id}[].*website\\.com"
The above rule contains a modified version of regex. Each variable e.g STRING:auth_token or STRING:id is to be extracted.
For parsing through the text(header in this case) i will have to tell the parser that it needs to extract whatever comes after the "download". So basically there is a definition of a set of rules which this parser will use to parse through the text and eventually output the final rule.
Now the question is, is there any such module available in python for pattern matching,detection,generation that can help me with this? This is somewhat like a compiler's parser part. I wanted to ask before going deep into trying to make one myself. Any help ?
I think that this has been already answered in:
Parser generation
Python parser Module tutorial
I can assure that what you want is easy with pyparsing module.
Sorry if this is not quite what you're looking for, but I'm a little rushed for time.
The re module documentaiton for Python contains a section on writing a tokenizer.
It's under-documented, but might help you in making something workable.
Certainly easier than tokenizing things yourself, though may not provide the flexibility you seem to be after.
You'd best do this yourself. It is not much work.
As you say, you'd have to define regular expressions as rules. Your program would then find the matching regular expression and transform the match into an output rule.
** EDIT **
I do not think there is a library to do this. If I understand you correctly, you want to specify a set of rules like this one:
EXTRACT AFTER download
And this will output a text like this:
req-hdr-pattern: "^GET[ ].*/api/1\\.0/download/{STRING:auth_token}/{STRING:id}[].*website\\.com"
For this you'd have to create a parser that would parse your rules. Depending on the complexity of the rule syntax, you could use pyparsing, use regular expressions or do it by hand. My rule of the thumb is, if your syntax is recursive (i.e. like html), then it makes sense to use pyparsing, otherwise it is not worth it.
From these parsed rules your program would have to create new regular expressions to match the input text. Basically, your program would translate rules into regular expressions.
Using these regular expressions you'd match extract the data from your input text.

Categories