Massage with BeautifulSoup or clean with Regex - python

Could someone tell me whats a better way to clean up bad HTML so BeautifulSoup can handle it - should one use the massage methods of BeautifulSoup or clean it up using regular expressions?

Thought I should reword my answer.
The built-in massages are good for light damage (extra whitespace, no closing slashes, etc). I would certainly try and get away with these before getting any more involved.
You can pass in your own massages and I would suggest you extend the default set:
import copy, re
myMassage = [(re.compile('<!-([^-])'), lambda match: '<!--' + match.group(1))]
myNewMassage = copy.copy(BeautifulSoup.MARKUP_MASSAGE)
myNewMassage.extend(myMassage)
BeautifulSoup(badString, markupMassage=myNewMassage)
# Foo<!--This comment is malformed.-->Bar<br />Baz
You're probably better off doing it this way as it all goes into one parsing pot, gaining BeautifulSoups optimisations... Although the runtime performance is probably pretty similar.

From the documentation, massage methods are just pairs of (regular expression, replacement function) so I don't think it's really a case of use massaging or regexps.
e.g. to tidy up malformed comments:
(re.compile('<!-([^-])'), lambda match: '<!--' + match.group(1))
If you look at the source of the _feed method in BeautifulSoup.py you will see that these are just run in sequence against the markup:
for fix, m in self.markupMassage:
markup = fix.sub(m, markup)
So whilst you could do some regexp processing of your own before BeautifulSoup gets to see the markup you are probably better combining any additional tidying needed with the default builtin MARKUP_MASSAGE as shown in Oli's answer.

Related

Python3 - Generate string matching multiple regexes, without modifying them

I would like to generate string matching my regexes using Python 3. For this I am using handy library called rstr.
My regexes:
^[abc]+.
[a-z]+
My task:
I must find a generic way, how to create string that would match both my regexes.
What I cannot do:
Modify both regexes or join them in any way. This I consider as ineffective solution, especially in the case if incompatible regexes:
import re
import rstr
regex1 = re.compile(r'^[abc]+.')
regex2 = re.compile(r'[a-z]+')
for index in range(0, 1000):
generated_string = rstr.xeger(regex1)
if re.fullmatch(regex2, generated_string):
break;
else:
raise Exception('Regexes are probably incompatibile.')
print('String matching both regexes is: {}'.format(generated_string))
Is there any workaround or any magical library that can handle this? Any insights appreciated.
Questions which are seemingly similar, but not helpful in any way:
Match a line with multiple regex using Python
Asker already has the string, which he just want to check against multiple regexes in the most elegant way. In my case we need to generate string in a smart way that would match regexes.
If you want really generic way, you can't really use brute force approach.
What you look for is create some kind of representation of regexp (as rstr does through call of sre_parse.py) and then calling some SMT solver to satisfy both criteria.
For Haskell there is https://github.com/audreyt/regex-genex which uses Yices SMT solver to do just that, but I doubt there is anything like this for Python. If I were you, I'd bite a bullet and call it as external program from your python program.
I don't know if there is something that can fulfill your needs much smother.
But I would do it something like (as you've done it already):
Create a Regex object with the re.compile() function.
Generate String based on 1st regex.
Pass the string you've got into the 2nd regex object using search() method.
If that passes... your done, string passed both regexs.
Maybe you can create a function and pass both regexes as parameters and test "2 by 2" using the same logic.
And then if you have 8 regexes to match...
Just do:
call (regex1, regex2)
call (regex2, regex3)
call (regex4, regex5)
...
I solved this using a little alternative approach. Notice second regex is basically insurance so only lowercase letters are generated in our new string.
I used Google's python package sre_yield which allows charset limitation. Package is also available on PyPi. My code:
import sre_yield
import string
sre_yield.AllStrings(r'^[abc]+.', charset=string.ascii_lowercase)[0]
# returns `aa`

Simple parser, but not a calculator

I am trying to write a very simple parser. I read similar questions here on SO and on the Internet, but all I could find was limited to "arithmetic like" things.
I have a very simple DSL, for example:
ELEMENT TYPE<TYPE> elemName {
TYPE<TYPE> memberName;
}
Where the <TYPE> part is optional and valid only for some types.
Following what I read, I tried to write a recursive descent parser in Python, but there are a few things that I can't seem to understand:
How do I look for tokens that are longer than 1 char?
How do I break up the text in the different parts? For example, after a TYPE I can have a whitespace or a < or a whitespace followed by a <. How do I address that?
Short answer
All your questions boil down to the fact that you are not tokenizing your string before parsing it.
Long answer
The process of parsing is actually split in two distinct parts: lexing and parsing.
Lexing
What seems to be missing in the way you think about parsing is called tokenizing or lexing. It is the process of converting a string into a stream of tokens, i.e. words. That is what you are looking for when asking How do I break up the text in the different parts?
You can do it by yourself by checking your string against a list of regexp using re, or you can use some well-known librairy such as PLY. Although if you are using Python3, I will be biased toward a lexing-parsing librairy that I wrote, which is ComPyl.
So proceeding with ComPyl, the syntax you are looking for seems to be the following.
from compyl.lexer import Lexer
rules = [
(r'\s+', None),
(r'\w+', 'ID'),
(r'< *\w+ *>', 'TYPE'), # Will match your <TYPE> token with inner whitespaces
(r'{', 'L_BRACKET'),
(r'}', 'R_BRACKET'),
]
lexer = Lexer(rules=rules, line_rule='\n')
# See ComPyl doc to figure how to proceed from here
Notice that the first rule (r'\s+', None), is actually what solves your issue about whitespace. It basically tells the lexer to match any whitespace character and to ignore them. Of course if you do not want to use a lexing tool, you can simply add a similar rule in your own re implementation.
Parsing
You seem to want to write your own LL(1) parser, so I will be brief on that part. Just know that there exist a lot of tools that can do that for you (PLY and ComPyl librairies offer LR(1) parsers which are more powerful but harder to hand-write, see the difference between LL(1) and LR(1) here).
Simply notice that now that you know how to tokenize your string, the issue of How do I look for tokens that are longer than 1 char? has been solved. You are now parsing, not a stream of characters, but a stream of tokens that encapsulate the matched words.
Olivier's answer regarding lexing/tokenizing and then parsing is helpful.
However, for relatively simple cases, some parsing tools are able to handle your kind of requirements without needing a separate tokenizing step. parsy is one of those. You build up parsers from smaller building blocks - there is good documentation to help.
An example of a parser done with parsy for your kind of grammar is here: http://parsy.readthedocs.io/en/latest/howto/other_examples.html#proto-file-parser .
It is significantly more complex than yours, but shows what is possible. Where whitespace is allowed (but not required), it uses the lexeme utility (defined at the top) to consume optional whitespace.
You may need to tighten up your understanding of where whitespace is necessary and where it is optional, and what kind of whitespace you really mean.

Regex to find a string python

I have a string
<a href="/p/123411/"><img src="/p_img/411/123411/639469aa9f_123411_100.jpg" alt="ABCDXYZ" />
What is the Regex to find ABCDXYZ in Python
Don't use regex to parse HTML. Use BeautifulSoup.
from bs4 import BeautifulSoup as BS
text = '''<a href="/p/123411/"><img src="/p_img/411/123411/639469aa9f_123411_100.jpg" alt="ABCDXYZ" />'''
soup = BS(text)
print soup.find('img').attrs['alt']
If you're looking for the value of that alt attribute, you can do this:
>>> r = r'alt="(.*?)"'
Then:
>>> m = re.search(r, mystring)
>>> m.group(1)
'ABCDXYZ'
And you can use re.findall if you want to find more than one.
However, this code will be easily fooled by something like this:
<span>Here's some text explaining how to do alt="foo" in an img tag.</span>
On the other hand, it'll also fail to pick up something like this:
<img src='/p_img/411/123411/639469aa9f_123411_100.jpg' alt='ABCDXYZ' />
How do you deal with that? The short answer is: You don't. XML and HTML are not regular languages.
It's worth backing up here to point out that Python's re engine is not actually a true regular expression engine—and, on top of that, it's embedded in a Turing-complete programming language. So obviously it is possible to build an HTML parser around Python and re. This answer shows part of a parser written in perl, where regexes do most of the heavy lifting. But that doesn't mean you should do it this way. You shouldn't be writing a parser in the first place, given that perfectly good ones already exist, and if you did, you shouldn't be forcing yourself to use regexes even when there's an easier way to do what you want. For quick&dirty playing around, regex is fine. For a production program, it's almost always the wrong answer.
One way to convince your boss to let you use a parser is by crafting a suite of tests that are all obviously valid, and that cannot possibly be handled by any regex-based solution short of a full parser. If you can come up with a test that can be parsed, but only using exponential backtracking, and therefore takes 12 hours with regex vs. 0.1 seconds with bs4, even better, but that's a bit trickier…
Of course it's also worth looking for articles online (and SO questions like this and this and the 300 other dups) and picking the best ones to show your boss.
If you really can't convince your boss otherwise, then you're done at this point. Given what's been specified, this works. Given what may or may not actually be intended, nothing short of mind-reading will work. As you find more and more real-life cases that fail, you can hack it up by adding more and more complex alternations and/or context onto the regex itself, or possibly use a series of regexes and post-filters, until finally you get sick of it and find yourself a better job.
First, a disclaimer: You shouldn't be using regular expressions to parse HTML. You can use BeautifulSoup for this
Next, if you are actually serious about using regular expressions and the above is the exact case you want then you could do something like:
<a href="[a-zA-Z0-9/]+"><img src="[a-zA-Z0-9/]+" alt="([a-zA-Z0-9/]+)" />
and you could access the text via the match object's groups attribute.

avoid regex [python]

I'd like to know if it's a good idea avoid regex.
actually I have avoided it in any case and some peoples has been giving me advice that i shouldn't avoid it, since if you know what means every thing like:
[] '|' \A \B \d \D \W \w \S \Z $ * ? ...
it would be easy to read, right? but i fell like avoiding regex i would have a more readable code.
it gets more unreadable when it's bigger, example: validators.py
email_re = re.compile(
r"(^[-!#$%&'*+/=?^_`{}|~0-9A-Z]+(\.[-!#$%&'*+/=?^_`{}|~0-9A-Z]+)*" # dot-atom
r'|^"([\001-\010\013\014\016-\037!#-\[\]-\177]|\\[\001-011\013\014\016-\177])*"' # quoted-string
r')#(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+[A-Z]{2,6}\.?$', re.IGNORECASE) # domain
so, I'd like to know a reason to not avoid regex?
No, don't avoid regular expressions. They're actually quite a nifty little tool and will save you a lot of work if you use them wisely.
What you do need to avoid is trying to use it for everything, a malaise that appears to strike those new to regular expressions before they become a little more tempered and a little less enamoured :-)
For example, don't use it to validate email addresses. The way you validate an email address is to send an email to it with a link that the receiver has to click on to complete the "transaction".
There are billions of valid email addresses (according to the RFCs) that have no physical email receiver behind them. The only way to be certain that there is a receiver is to send an email and wait for proof positive that it was received and acted upon.
If I find myself writing a regular expression that's more than, let's say, 60 characters, I step back to see if there's a more readable way. Similarly, if I write a regular expression and come back a week later and can't instantly recognise what it does, I think about replacing it. This particular paragraph consists of my opinions of course, but they've served me well :-)
Regular expressions are a tool. They are perfectly suited to some tasks and not to others. Like any tool, use them when they are the right tool for the job. Don't just avoid them because somebody said they were bad. Learn how to use them and then you can decide for yourself rather then depending on someone elses dogma.
If you choose to use a more general parsing approach, like pyparsing or PLY, you will never require regular expressions (which can only match a small subset of the languages matchable with such general parsers). However, lexers such as the one in PLY are typically built around regular expressions (which are a perfect match for a lexer's needs!), so you will probably have to avoid that (as well as powerful tools such as BeautifulSoup when any "normal" user would be able to keep using and enjoying it by simply passing a regular expression object as the selector, since BeautifulSoup fully supports that) and will have to recode a lot of such existing parsers with your chosen general-purpose parsing package.
Performance may suffer greatly, of course, by using extremely general tools in cases where simpler, highly optimized and concise ones would be a perfect solution -- and the size of your code may "blow up" to being very large in many common cases. But if you don't mind having programs twice as big and twice as slow, and are determined to avoid regular expressions at all costs, you can do that.
On the other hand, if your main concern is with readability (quite an understandable and commendable concern, too), then the re.VERBOSE option, by allowing abundant use of whitespace and comments within the RE's pattern, can really do wonders for that goal without removing any of REs' advantages (except by diluting a sometimes-excessive conciseness;-). You WILL want to also keep at least one general-purpose parsing system under your belt, of course (rather than stretch REs to do tasks they're wrong for, as so many people unfortunately do!) -- but a minimal command of REs will serve you well in so many cases (including, for example, full use of BeautifulSoup and many other tools which can accept REs as parameters to apply them appropriately) that I think it's quite to be recommended.
Just for some comparisions, here my version email format check not with regexp (with test cases) and one readable regexp offered to me as alternative (though sending email after it is accepted, is great idea):
# -*- coding: utf8 -*-
import string
print("Valid letters in this computer are: "+string.letters)
import re
def validateEmail(a):
sep=[x for x in a if not (x.isalpha() or
x.isdigit() or
x in r"!#$%&'*+-/=?^_`{|}~]") ]
sepjoined=''.join(sep)
## sep joined must be ..#.... form
if len(a)>255 or sepjoined.strip('.') != '#': return False
end=a
for i in sep:
part,i,end=end.partition(i)
if len(part)<2: return False
return len(end)>1
def emailval(address):
pattern = "[\.\w]{2,}[#]\w+[.]\w+"
return re.match(pattern, address)
if __name__ == '__main__':
emails = [ "test.#web.com","test+john#web.museum", "test+john#web.m",
"a#n.dk", "and.bun#webben.de","marjaliisa.hämäläinen#hel.fi",
"marja-liisa.hämäläinen#hel.fi", "marjaliisah#hel.",'tony#localhost',
'1234#23.45','me#somewhere']
print('\n\t'.join(["Valid emails are:"] +
filter(validateEmail,emails)))
print('\n\t'.join(["Regexp gives wrong answer:"] +
filter(emailval,emails)))
""" Output:
Valid letters in this computer are: abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
Valid emails are:
test+john#web.museum
and.bun#webben.de
tony#localhost
1234#23.45
me#somewhere
Regexp gives wrong answer:
test.#web.com
and.bun#webben.de
1234#23.45
"""
EDIT: cleaned up the regex filter function from this ancient code, edited for #detly link based more permissive version. Good enough for form filling first check for me before sending the confirmation email. Finaly put the 255 character length limit check mentioned in comments.
This code by purpose does not accept the normal a#b as valid email address, but does accept me#somewhere. Another thing is that it depends of what isalpha returns. So this output, which is from Ideone.com has not accepted the scandinavian öä even they are valid nowadays. When run in my home computer, those are accepted. This is even when coding line is there.
(Deleted a regular expression which purported to be an "official" one but is in fact not found in the RFC it claimed to be from.)
This regex may be amusing as it is an attempt to precisely match the e-mail address grammar provided in an older version of the Internet mail standards.
Regular expressions are likely the right tool for extracting/validating email addresses...
To extract one or more email addresses from raw text:
import re
pat_e = re.compile(r'(?P<email>[\w.+-]+#(?:[\w-]+\.)+[a-zA-Z]{2,})')
emails = []
for r in pat_e.finditer(text):
emails.append(r.group('email'))
return emails
To see if a single piece of text is a valid email:
import re
pat_m = re.compile(r'([\w.+-]+#(?:[\w-]+\.)+[a-zA-Z]{2,}$)')
if pat_m.match(text):
return True
return False

Regular expression to extract URL from an HTML link

I’m a newbie in Python. I’m learning regexes, but I need help here.
Here comes the HTML source:
http://www.ptop.se
I’m trying to code a tool that only prints out http://ptop.se. Can you help me please?
If you're only looking for one:
import re
match = re.search(r'href=[\'"]?([^\'" >]+)', s)
if match:
print(match.group(1))
If you have a long string, and want every instance of the pattern in it:
import re
urls = re.findall(r'href=[\'"]?([^\'" >]+)', s)
print(', '.join(urls))
Where s is the string that you're looking for matches in.
Quick explanation of the regexp bits:
r'...' is a "raw" string. It stops you having to worry about escaping characters quite as much as you normally would. (\ especially -- in a raw string a \ is just a \. In a regular string you'd have to do \\ every time, and that gets old in regexps.)
"href=[\'"]?" says to match "href=", possibly followed by a ' or ". "Possibly" because it's hard to say how horrible the HTML you're looking at is, and the quotes aren't strictly required.
Enclosing the next bit in "()" says to make it a "group", which means to split it out and return it separately to us. It's just a way to say "this is the part of the pattern I'm interested in."
"[^\'" >]+" says to match any characters that aren't ', ", >, or a space. Essentially this is a list of characters that are an end to the URL. It lets us avoid trying to write a regexp that reliably matches a full URL, which can be a bit complicated.
The suggestion in another answer to use BeautifulSoup isn't bad, but it does introduce a higher level of external requirements. Plus it doesn't help you in your stated goal of learning regexps, which I'd assume this specific html-parsing project is just a part of.
It's pretty easy to do:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_to_parse)
for tag in soup.findAll('a', href=True):
print(tag['href'])
Once you've installed BeautifulSoup, anyway.
Don't use regexes, use BeautifulSoup. That, or be so crufty as to spawn it out to, say, w3m/lynx and pull back in what w3m/lynx renders. First is more elegant probably, second just worked a heck of a lot faster on some unoptimized code I wrote a while back.
this should work, although there might be more elegant ways.
import re
url='http://www.ptop.se'
r = re.compile('(?<=href=").*?(?=")')
r.findall(url)
John Gruber (who wrote Markdown, which is made of regular expressions and is used right here on Stack Overflow) had a go at producing a regular expression that recognises URLs in text:
http://daringfireball.net/2009/11/liberal_regex_for_matching_urls
If you just want to grab the URL (i.e. you’re not really trying to parse the HTML), this might be more lightweight than an HTML parser.
Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.
In particular you will want to look at the Python answers: BeautifulSoup, HTMLParser, and lxml.
this regex can help you, you should get the first group by \1 or whatever method you have in your language.
href="([^"]*)
example:
amgheziName
result:
http://www.amghezi.com
There's tonnes of them on regexlib
Yes, there are tons of them on regexlib. That only proves that RE's should not be used to do that. Use SGMLParser or BeautifulSoup or write a parser - but don't use RE's. The ones that seems to work are extremely compliated and still don't cover all cases.
This works pretty well with using optional matches (prints after href=) and gets the link only. Tested on http://pythex.org/
(?:href=['"])([:/.A-z?<_&\s=>0-9;-]+)
Oputput:
Match 1. /wiki/Main_Page
Match 2. /wiki/Portal:Contents
Match 3. /wiki/Portal:Featured_content
Match 4. /wiki/Portal:Current_events
Match 5. /wiki/Special:Random
Match 6. //donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en
You can use this.
<a[^>]+href=["'](.*?)["']

Categories