Use Python to parse html data which contains "&"

Use Python to parse html data which contains "&" - python

I'm using the python library SGMLParser to parse some html.
I encounter an html tag of the form
<td class="school">Texas A&M</td>
I'd like to read out "Texas A&M". But when handle_data gets called, it gets called with "Texas A", and then, separately, "M" (quotes for clarity).
How do I replace the
&
string with an & before the call, without replacing all special ampersands in the whole string (some of which I may need).
Thanks!

If you switch from the deprecated SGMLParser to a modern alternative such as LXML (which also handles HTML), this becomes trivial:
>>> etree.fromstring('''<td class="school">Texas A&M</td>''').text
'Texas A&M'

SGMLParser has convert_entityref() method, but instead of deprecated SGMLParser I would recommend using lxml or Beautiful Soup which have better parser API.

Entity references like & are handled by handle_entity. Check that this method knows how to translate &. The default implementation should call handle_data('&'), but you may have accidentally overwritten it.
Also, if possible, consider using the far more advanced lxml instead.

Related

how to extract javascript variables by using python bs4

<script type="text/javascript">var csrfMagicToken = "sid:bf8be784734837a64a47fcc30b9df99,162591180";var csrfMagicName = "__csrf_magic";</script>
The above script tag is from a webpage.
script = soup.find_all('script')[5]
By using the above line of code I was able to extract the script tag which I want but I need to extract the value of variables in a python script,I am using BeautifulSoup in my python script to extract the data.

You could use
(?:var|let)\s+(\w+)\s*=\s*"([^"]+)"
See a demo on regex101.com.
Note: However, there are a couple of drawbacks in general to using regular expressions on code. E.g. with the above, sth. like let x = -10; would not be matched but would be totally valid JavaScript code. Also, single quotes are not supported (yet) - it totally depends on your actual input.
That being said, you could go for:
(?:var|let)\s+
(?P<key>\w+)\s*=\s*
(['"])?(?(2)(?P<value1>.+?)\2|(?P<value2>[^;]+))
See another demo on regex101.com.
This still leaves you helpless against escaped quotes like let x = "some \" string"; or against variable declarations in comments. In general, favour a parser solution.

Regex to find a string python

I have a string
<a href="/p/123411/"><img src="/p_img/411/123411/639469aa9f_123411_100.jpg" alt="ABCDXYZ" />
What is the Regex to find ABCDXYZ in Python

Don't use regex to parse HTML. Use BeautifulSoup.
from bs4 import BeautifulSoup as BS
text = '''<a href="/p/123411/"><img src="/p_img/411/123411/639469aa9f_123411_100.jpg" alt="ABCDXYZ" />'''
soup = BS(text)
print soup.find('img').attrs['alt']

If you're looking for the value of that alt attribute, you can do this:
>>> r = r'alt="(.*?)"'
Then:
>>> m = re.search(r, mystring)
>>> m.group(1)
'ABCDXYZ'
And you can use re.findall if you want to find more than one.
However, this code will be easily fooled by something like this:
<span>Here's some text explaining how to do alt="foo" in an img tag.</span>
On the other hand, it'll also fail to pick up something like this:
<img src='/p_img/411/123411/639469aa9f_123411_100.jpg' alt='ABCDXYZ' />
How do you deal with that? The short answer is: You don't. XML and HTML are not regular languages.
It's worth backing up here to point out that Python's re engine is not actually a true regular expression engine—and, on top of that, it's embedded in a Turing-complete programming language. So obviously it is possible to build an HTML parser around Python and re. This answer shows part of a parser written in perl, where regexes do most of the heavy lifting. But that doesn't mean you should do it this way. You shouldn't be writing a parser in the first place, given that perfectly good ones already exist, and if you did, you shouldn't be forcing yourself to use regexes even when there's an easier way to do what you want. For quick&dirty playing around, regex is fine. For a production program, it's almost always the wrong answer.
One way to convince your boss to let you use a parser is by crafting a suite of tests that are all obviously valid, and that cannot possibly be handled by any regex-based solution short of a full parser. If you can come up with a test that can be parsed, but only using exponential backtracking, and therefore takes 12 hours with regex vs. 0.1 seconds with bs4, even better, but that's a bit trickier…
Of course it's also worth looking for articles online (and SO questions like this and this and the 300 other dups) and picking the best ones to show your boss.
If you really can't convince your boss otherwise, then you're done at this point. Given what's been specified, this works. Given what may or may not actually be intended, nothing short of mind-reading will work. As you find more and more real-life cases that fail, you can hack it up by adding more and more complex alternations and/or context onto the regex itself, or possibly use a series of regexes and post-filters, until finally you get sick of it and find yourself a better job.

First, a disclaimer: You shouldn't be using regular expressions to parse HTML. You can use BeautifulSoup for this
Next, if you are actually serious about using regular expressions and the above is the exact case you want then you could do something like:
<a href="[a-zA-Z0-9/]+"><img src="[a-zA-Z0-9/]+" alt="([a-zA-Z0-9/]+)" />
and you could access the text via the match object's groups attribute.

Massage with BeautifulSoup or clean with Regex

Could someone tell me whats a better way to clean up bad HTML so BeautifulSoup can handle it - should one use the massage methods of BeautifulSoup or clean it up using regular expressions?

Thought I should reword my answer.
The built-in massages are good for light damage (extra whitespace, no closing slashes, etc). I would certainly try and get away with these before getting any more involved.
You can pass in your own massages and I would suggest you extend the default set:
import copy, re
myMassage = [(re.compile('<!-([^-])'), lambda match: '<!--' + match.group(1))]
myNewMassage = copy.copy(BeautifulSoup.MARKUP_MASSAGE)
myNewMassage.extend(myMassage)
BeautifulSoup(badString, markupMassage=myNewMassage)
# Foo<!--This comment is malformed.-->Bar<br />Baz
You're probably better off doing it this way as it all goes into one parsing pot, gaining BeautifulSoups optimisations... Although the runtime performance is probably pretty similar.

From the documentation, massage methods are just pairs of (regular expression, replacement function) so I don't think it's really a case of use massaging or regexps.
e.g. to tidy up malformed comments:
(re.compile('<!-([^-])'), lambda match: '<!--' + match.group(1))
If you look at the source of the _feed method in BeautifulSoup.py you will see that these are just run in sequence against the markup:
for fix, m in self.markupMassage:
markup = fix.sub(m, markup)
So whilst you could do some regexp processing of your own before BeautifulSoup gets to see the markup you are probably better combining any additional tidying needed with the default builtin MARKUP_MASSAGE as shown in Oli's answer.

need to selectively escape html entities (&)

I'm scraping a html page, then using xml.dom.minidom.parseString() to create a dom object.
however, the html page has a '&'. I can use cgi.escape to convert this into & but it also converts all my html <> tags into <> which makes parseString() unhappy.
how do i go about this? i would rather not just hack it and straight replace the "&"s
thanks

For scraping, try to use a library that can handle such html "tag soup", like lxml, which has a html parser (as well as a dedicated html package in lxml.html), or BeautifulSoup (you will also find that these libraries also contain other stuff that makes scraping/working with html easier, aside from being able to handle ill-formed documents: getting information out of forms, making hyperlinks absolute, using css selectors...)

i would rather not just hack it and
straight replace the "&"s
Er, why? That's what cgi.escape is doing - effectively just a search and replace operation for certain characters that have to be escaped.
If you only want to replace a single character, just replace the single character:
yourstring.replace('&', '&')
Don't beat around the bush.

If you want to make sure that you don't accidentally re-escape an already escaped & (i. e. not transform & into &amp; or ß into &szlig;), you could
import re
newstring = re.sub(r"&(?![A-Za-z])", "&", oldstring)
This will leave &s alone when they are followed by a letter.

You shouldn't use an XML parser to parse data that isn't XML. Find an HTML parser instead, you'll be happier in the long run. The standard library has a few (HTMLParser and htmllib), and BeautifulSoup is a well-loved third-party package.

Regular expression to extract URL from an HTML link

I’m a newbie in Python. I’m learning regexes, but I need help here.
Here comes the HTML source:
http://www.ptop.se
I’m trying to code a tool that only prints out http://ptop.se. Can you help me please?

If you're only looking for one:
import re
match = re.search(r'href=[\'"]?([^\'" >]+)', s)
if match:
print(match.group(1))
If you have a long string, and want every instance of the pattern in it:
import re
urls = re.findall(r'href=[\'"]?([^\'" >]+)', s)
print(', '.join(urls))
Where s is the string that you're looking for matches in.
Quick explanation of the regexp bits:
r'...' is a "raw" string. It stops you having to worry about escaping characters quite as much as you normally would. (\ especially -- in a raw string a \ is just a \. In a regular string you'd have to do \\ every time, and that gets old in regexps.)
"href=[\'"]?" says to match "href=", possibly followed by a ' or ". "Possibly" because it's hard to say how horrible the HTML you're looking at is, and the quotes aren't strictly required.
Enclosing the next bit in "()" says to make it a "group", which means to split it out and return it separately to us. It's just a way to say "this is the part of the pattern I'm interested in."
"[^\'" >]+" says to match any characters that aren't ', ", >, or a space. Essentially this is a list of characters that are an end to the URL. It lets us avoid trying to write a regexp that reliably matches a full URL, which can be a bit complicated.
The suggestion in another answer to use BeautifulSoup isn't bad, but it does introduce a higher level of external requirements. Plus it doesn't help you in your stated goal of learning regexps, which I'd assume this specific html-parsing project is just a part of.
It's pretty easy to do:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_to_parse)
for tag in soup.findAll('a', href=True):
print(tag['href'])
Once you've installed BeautifulSoup, anyway.

Don't use regexes, use BeautifulSoup. That, or be so crufty as to spawn it out to, say, w3m/lynx and pull back in what w3m/lynx renders. First is more elegant probably, second just worked a heck of a lot faster on some unoptimized code I wrote a while back.

this should work, although there might be more elegant ways.
import re
url='http://www.ptop.se'
r = re.compile('(?<=href=").*?(?=")')
r.findall(url)

John Gruber (who wrote Markdown, which is made of regular expressions and is used right here on Stack Overflow) had a go at producing a regular expression that recognises URLs in text:
http://daringfireball.net/2009/11/liberal_regex_for_matching_urls
If you just want to grab the URL (i.e. you’re not really trying to parse the HTML), this might be more lightweight than an HTML parser.

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.
In particular you will want to look at the Python answers: BeautifulSoup, HTMLParser, and lxml.

this regex can help you, you should get the first group by \1 or whatever method you have in your language.
href="([^"]*)
example:
amgheziName
result:
http://www.amghezi.com

There's tonnes of them on regexlib

Yes, there are tons of them on regexlib. That only proves that RE's should not be used to do that. Use SGMLParser or BeautifulSoup or write a parser - but don't use RE's. The ones that seems to work are extremely compliated and still don't cover all cases.

This works pretty well with using optional matches (prints after href=) and gets the link only. Tested on http://pythex.org/
(?:href=['"])([:/.A-z?<_&\s=>0-9;-]+)
Oputput:
Match 1. /wiki/Main_Page
Match 2. /wiki/Portal:Contents
Match 3. /wiki/Portal:Featured_content
Match 4. /wiki/Portal:Current_events
Match 5. /wiki/Special:Random
Match 6. //donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en

You can use this.
<a[^>]+href=["'](.*?)["']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Use Python to parse html data which contains "&" - python

If you switch from the deprecated SGMLParser to a modern alternative such as LXML (which also handles HTML), this becomes trivial: >>> etree.fromstring('''<td class="school">Texas A&M</td>''').text 'Texas A&M'

SGMLParser has convert_entityref() method, but instead of deprecated SGMLParser I would recommend using lxml or Beautiful Soup which have better parser API.

Entity references like & are handled by handle_entity. Check that this method knows how to translate &. The default implementation should call handle_data('&'), but you may have accidentally overwritten it. Also, if possible, consider using the far more advanced lxml instead.

Related

how to extract javascript variables by using python bs4

Regex to find a string python

Massage with BeautifulSoup or clean with Regex

need to selectively escape html entities (&)

Regular expression to extract URL from an HTML link

Categories

Resources