Python: How to remove html entities from list items? [duplicate] - python

I'm parsing some HTML with Beautiful Soup 3, but it contains HTML entities which Beautiful Soup 3 doesn't automatically decode for me:
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup("<p>£682m</p>")
>>> text = soup.find("p").string
>>> print text
£682m
How can I decode the HTML entities in text to get "£682m" instead of "£682m".

Python 3.4+
Use html.unescape():
import html
print(html.unescape('£682m'))
FYI html.parser.HTMLParser.unescape is deprecated, and was supposed to be removed in 3.5, although it was left in by mistake. It will be removed from the language soon.
Python 2.6-3.3
You can use HTMLParser.unescape() from the standard library:
For Python 2.6-2.7 it's in HTMLParser
For Python 3 it's in html.parser
>>> try:
... # Python 2.6-2.7
... from HTMLParser import HTMLParser
... except ImportError:
... # Python 3
... from html.parser import HTMLParser
...
>>> h = HTMLParser()
>>> print(h.unescape('£682m'))
£682m
You can also use the six compatibility library to simplify the import:
>>> from six.moves.html_parser import HTMLParser
>>> h = HTMLParser()
>>> print(h.unescape('£682m'))
£682m

Beautiful Soup handles entity conversion. In Beautiful Soup 3, you'll need to specify the convertEntities argument to the BeautifulSoup constructor (see the 'Entity Conversion' section of the archived docs). In Beautiful Soup 4, entities get decoded automatically.
Beautiful Soup 3
>>> from BeautifulSoup import BeautifulSoup
>>> BeautifulSoup("<p>£682m</p>",
... convertEntities=BeautifulSoup.HTML_ENTITIES)
<p>£682m</p>
Beautiful Soup 4
>>> from bs4 import BeautifulSoup
>>> BeautifulSoup("<p>£682m</p>")
<html><body><p>£682m</p></body></html>

You can use replace_entities from w3lib.html library
In [202]: from w3lib.html import replace_entities
In [203]: replace_entities("£682m")
Out[203]: u'\xa3682m'
In [204]: print replace_entities("£682m")
£682m

Beautiful Soup 4 allows you to set a formatter to your output
If you pass in formatter=None, Beautiful Soup will not modify strings
at all on output. This is the fastest option, but it may lead to
Beautiful Soup generating invalid HTML/XML, as in these examples:
print(soup.prettify(formatter=None))
# <html>
# <body>
# <p>
# Il a dit <<Sacré bleu!>>
# </p>
# </body>
# </html>
link_soup = BeautifulSoup('A link')
print(link_soup.a.encode(formatter=None))
# A link

I had a similar encoding issue. I used the normalize() method. I was getting a Unicode error using the pandas .to_html() method when exporting my data frame to an .html file in another directory. I ended up doing this and it worked...
import unicodedata
The dataframe object can be whatever you like, let's call it table...
table = pd.DataFrame(data,columns=['Name','Team','OVR / POT'])
table.index+= 1
encode table data so that we can export it to out .html file in templates folder(this can be whatever location you wish :))
#this is where the magic happens
html_data=unicodedata.normalize('NFKD',table.to_html()).encode('ascii','ignore')
export normalized string to html file
file = open("templates/home.html","w")
file.write(html_data)
file.close()
Reference: unicodedata documentation

This probably isnt relevant here. But to eliminate these html entites from an entire document, you can do something like this: (Assume document = page and please forgive the sloppy code, but if you have ideas as to how to make it better, Im all ears - Im new to this).
import re
import HTMLParser
regexp = "&.+?;"
list_of_html = re.findall(regexp, page) #finds all html entites in page
for e in list_of_html:
h = HTMLParser.HTMLParser()
unescaped = h.unescape(e) #finds the unescaped value of the html entity
page = page.replace(e, unescaped) #replaces html entity with unescaped value

Related

Convert HTML String Tag into String Python

I am trying to convert the HTML String Tag into String using Python.
Here is the content I'm trying to convert:
htmltxt = "<b>Hello World</b>".
The result should appear like Hello World in bold. But I'm getting like
<html><body><b>Hello World</b></body></html>
with the below snippet of code
from bs4 import BeautifulSoup
htmltxt = "<b>Hello World</b>"
soup = BeautifulSoup(htmltxt, 'lxml')
Can anyone suggest me how to convert?
In this situation you're trying to find a tag from within your soup object. Given this is the only one and there is no id or class name you can use:
hello_world_tag = soup.find("b")
hello_world_tag_text = hello_world_tag.text
print(hello_world_tag_text) # Output: 'Hello World'
The key here is '.text'. Using beautiful soup to find a specific tag will return that entire tag, but the .text method returns just the text from within that tag.
Edit following comment:
I would still recommend using bs4 to parse html. Once you have your text if you'd like it in bold you may print with:
print('\033[1m' + text)
Note You won't get out a bold string per se, it is something that always have to be done by interpreting or formating.
Extracting text from HTML string with BeautifulSoup you can call the methods text or get_text():
from bs4 import BeautifulSoup
htmltxt = "<b>Hello World</b>"
soup = BeautifulSoup(htmltxt, 'lxml')
soup.text

problem finding anchor tag in a line with python regex module

>>> data
'পূর্ববর্তী ৫০টি) (পরবর্তী ৫০টি'
>>> next = re.findall('<a href="(.*?)".*?>পরবর্তী ৫০টি</a>',data)
>>> next
['/w/index.php?title=%E0%A6%AC%E0%A6%BF%E0%A6%B6%E0%A7%87%E0%A6%B7:%E0%A6%B8%E0%A6%82%E0%A6%AF%E0%A7%8B%E0%A6%97%E0%A6%95%E0%A6%BE%E0%A6%B0%E0%A7%80_%E0%A6%AA%E0%A7%83%E0%A6%B7%E0%A7%8D%E0%A6%A0%E0%A6%BE%E0%A6%B8%E0%A6%AE%E0%A7%82%E0%A6%B9/%E0%A6%9F%E0%A7%87%E0%A6%AE%E0%A6%AA%E0%A7%8D%E0%A6%B2%E0%A7%87%E0%A6%9F:%E0%A6%AC%E0%A6%BF%E0%A6%B7%E0%A6%AF%E0%A6%BC%E0%A6%B6%E0%A7%8D%E0%A6%B0%E0%A7%87%E0%A6%A3%E0%A7%80%E0%A6%B9%E0%A7%80%E0%A6%A8&from=0&hidelinks=1']
see the Image here
I am trying to find what inside the second anchor tag, But why am I getting data from the first tag?
Any reason why you're using regex to parse HTML? Why not? Read this.
You should be using BeautifulSoup for example:
from bs4 import BeautifulSoup
a = 'পূর্ববর্তী ৫০টি) (পরবর্তী ৫০টি'
print(BeautifulSoup(a, "html.parser").find_all("a"))
This gets you the second anchor.
from bs4 import BeautifulSoup
a = 'পূর্ববর্তী ৫০টি) (পরবর্তী ৫০টি'
print([i.get("href") for i in BeautifulSoup(a, "html.parser").find_all("a") if i.text == "পরবর্তী ৫০টি"])
Output:
/w/index.php?title=%E0%A6%AC%E0%A6%BF%E0%A6%B6%E0%A7%87%E0%A6%B7:%E0%A6%B8%E0%A6%82%E0%A6%AF%E0%A7%8B%E0%A6%97%E0%A6%95%E0%A6%BE%E0%A6%B0%E0%A7%80_%E0%A6%AA%E0%A7%83%E0%A6%B7%E0%A7%8D%E0%A6%A0%E0%A6%BE%E0%A6%B8%E0%A6%AE%E0%A7%82%E0%A6%B9/%E0%A6%9F%E0%A7%87%E0%A6%AE%E0%A6%AA%E0%A7%8D%E0%A6%B2%E0%A7%87%E0%A6%9F:%E0%A6%AC%E0%A6%BF%E0%A6%B7%E0%A6%AF%E0%A6%BC%E0%A6%B6%E0%A7%8D%E0%A6%B0%E0%A7%87%E0%A6%A3%E0%A7%80%E0%A6%B9%E0%A7%80%E0%A6%A8&from=950505&hidelinks=1&back=776017

Get URL from BeautifulSoup object

Somebody is handing my function a BeautifulSoup object (BS4) that he has gotten using the typical call:
soup = BeautifulSoup(url)
my code:
def doSomethingUseful(soup):
url = soup.???
How do I get the original URL from the soup object? I tried reading the docs AND the BeautifulSoup source code... I'm still not sure.
If the url variable is a string of an actual URL, then you should just forget the BeautifulSoup here and use the same variable url. You should be using BeautifulSoup to parse HTML code, not a simple URL. In fact, if you try to use it like this, you get a warning:
>>> from bs4 import BeautifulSoup
>>> url = "https://foo"
>>> soup = BeautifulSoup(url)
C:\Python27\lib\site-packages\bs4\__init__.py:336: UserWarning: "https://foo" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
' that document to Beautiful Soup.' % decoded_markup
Since the URL is just a string, BeautifulSoup doesn't really know what to do with it when you "soupify" it, except for wrapping it up in basic HTML:
>>> soup
<html><body><p>https://foo</p></body></html>
If you still wanted to extract the URL from this, you could just use .text on the object, since it's the only thing in there:
>>> print(soup.text)
https://foo
If on the other hand url is not really a URL at all but rather a bunch of HTML code (in which case the variable name would be very misleading), then how you'd extract a specific link inside would beg the question of how it's in your code. Doing a find to get the first a tag, then extracting the href value would be one way.
>>> actual_html = '<html><body>My link text</body></html>'
>>> newsoup = BeautifulSoup(actual_html)
>>> newsoup.find('a')['href']
'http://moo'

Scraping in Python with BeautifulSoup

I've read quite a few posts here about this, but I'm very new to Python in general so I was hoping for some more info.
Essentially, I'm trying to write something that will pull word definitions from a site and write them to a file. I've been using BeautifulSoup, and I've made quite some progress, but here's my issue -
from __future__ import print_function
import requests
import urllib2, urllib
from BeautifulSoup import BeautifulSoup
wordlist = open('test.txt', 'a')
word = raw_input('Paste your word ')
url = 'http://services.aonaware.com/DictService/Default.aspx?action=define&dict=wn&query=%s' % word
# print url
html = urllib.urlopen(url).read()
# print html
soup = BeautifulSoup(html)
visible_text = soup.find('pre')(text=True)
print(visible_text, file=wordlist)
this seems to pull what I need, but puts it in this format
[u'passable\n adj 1: able to be passed or traversed or crossed; "the road is\n passable"
but I need it to be in plaintext. I've tried using a sanitizer (I was running it through bleach, but that didn't work. I've read some of the other answers here, but they don't explain HOW the code works, and I don't want to add something if I don't understand how it works.
Is there any way to just pull the plaintext?
edit: I ended up doing
from __future__ import print_function
import requests
import urllib2, urllib
from bs4 import BeautifulSoup
wordlist = open('test.txt', 'a')
word = raw_input('Paste your word ')
url = 'http://services.aonaware.com/DictService/Default.aspx?action=define&dict=wn&query=%s' % word
# print url
html = urllib.urlopen(url).read()
# print html
soup = BeautifulSoup(html)
visible_text = soup.find('pre')(text=True)[0]
print(visible_text, file=wordlist)
The code is already giving you plaintext, it just happens to have some characters encoded as entity references. In this case, special characters, which form part of the XML/HTML syntax are encoded to prevent them from breaking the structure of the text.
To decode them, use the HTMLParser module:
import HTMLParser
h = HTMLParser.HTMLParser()
h.unescape('"the road is passable"')
>>> u'"the road is passable"'

Passing lxml output to BeautifulSoup

My offline code works fine but I'm having trouble passing a web page from urllib via lxml to BeautifulSoup. I'm using urllib for basic authentication then lxml to parse (it gives a good result with the specific pages we need to scrape) then to BeautifulSoup.
#! /usr/bin/python
import urllib.request
import urllib.error
from io import StringIO
from bs4 import BeautifulSoup
from lxml import etree
from lxml import html
file = open("sample.html")
doc = file.read()
parser = etree.HTMLParser()
html = etree.parse(StringIO(doc), parser)
result = etree.tostring(html.getroot(), pretty_print=True, method="html")
soup = BeautifulSoup(result)
# working perfectly
With that working, I tried to feed it a page via urllib:
# attempt 1
page = urllib.request.urlopen(req)
doc = page.read()
# print (doc)
parser = etree.HTMLParser()
html = etree.parse(StringIO(doc), parser)
# TypeError: initial_value must be str or None, not bytes
Trying to deal with the error message, I tried:
# attempt 2
html = etree.parse(bytes.decode(doc), parser)
#OSError: Error reading file
I didn't know what to do about the OSError so I sought another method. I found suggestions to use lxml.html instead of lxml.etree so the next attempt is:
attempt 3
page = urllib.request.urlopen(req)
doc = page.read()
# print (doc)
html = html.document_fromstring(doc)
print (html)
# <Element html at 0x140c7e0>
soup = BeautifulSoup(html) # also tried (html, "lxml")
# TypeError: expected string or buffer
This clearly gives a structure of some sort, but how to pass it to BeautifulSoup? My question is twofold: How can I pass a page from urllib to lxml.etree (as in attampt 1, closest to my working code)? or, How can I pass a lxml.html structure to BeautifulSoup (as above)? I understand that both revolve around datatypes but don't know what to do about them.
python 3.3, lxml 3.0.1, BeautifulSoup 4. I'm new to python. Thanks to the internet for code fragments and examples.
BeautifulSoup can use the lxml parser directly, no need to go to these lengths.
BeautifulSoup(doc, 'lxml')

Categories