Convert html entities file to Unicode (with BeautifulSoup and Python?) - python

I have installed Python 2.7.13, pip and beautifulsoup on Win10. I want to convert a big file with html entities into Unicode characters and I am not sure how to go about it (I don't know much about Python). The file contents look like this:
<b>γέρων</b>, <i>οντος, ὁ</i>, Wurzel <i>ΓΕΡ</i>, verwandt mit <i>γέρας, γεραρός, γεραιός</i>
I can do small parts with EmEditor (using Edit > Encode/Decode Selection -> HTML/XML character reference to Unicode) but it is too slow and cannot cope with a big file conversion).
I would be happy for any (offline) solution for this.

this is html encoded, try with this:
from HTMLParser import HTMLParser
f = open("myfile.txt")
h = HTMLParser()
new_file_content = h.unescape(f.read())
new_file = open("newfile.txt", 'w')
new_file.write(new_file_content)

BeautifulSoup has a built in function for doing this called .decode(). Simply add this to the end of the line when you read in the file!
Example:
site_read = site_download.read().decode('utf-8')

import bs4
html = '''<b>γέρων</b>, <i>οντος, ὁ</i>, Wurzel <i>ΓΕΡ</i>, verwandt mit <i>γέρας, γεραρός, γεραιός</i>'''
soup = bs4.BeautifulSoup(html, 'lxml')
out:
<html><body><b>γέρων</b>, <i>οντος, ὁ</i>, Wurzel <i>ΓΕΡ</i>, verwandt mit <i>γέρας, γεραρός, γεραιός</i></body></html>
Document:
To parse a document, pass it into the BeautifulSoup constructor. You
can pass in a string or an open filehandle:
from bs4 import BeautifulSoup
>
> soup = BeautifulSoup(open("index.html")) # you can open you file in here
>
> soup = BeautifulSoup("<html>data</html>")
First, the document is
converted to Unicode, and HTML entities are converted to Unicode
characters:

Thank you for your help, I did manage to do it quite easily with the latest version of EmEditor which proved to be quite fast:
Select text > Edit > Encode/Decode Selection -> HTML/XML character reference to Unicode

Related

Weird character not exists in html source python BeautifulSoup

I have watched a video that teaches how to use BeautifulSoup and requests to scrape a website
Here's the code
from bs4 import BeautifulSoup as bs4
import requests
import pandas as pd
pages_to_scrape = 1
for i in range(1,pages_to_scrape+1):
url = ('http://books.toscrape.com/catalogue/page-{}.html').format(i)
pages.append(url)
for item in pages:
page = requests.get(item)
soup = bs4(page.text, 'html.parser')
#print(soup.prettify())
for j in soup.findAll('p', class_='price_color'):
price=j.getText()
print(price)
The code i working well. But as for the results I noticed weird character before the euro symbol and when checking the html source, I didn't find that character. Any ideas why this character appears? and how this be fixed .. is using replace enough or there is a better approach?
Seems for me you explained your question wrongly. I assume that you are using Windows where your terminal IDLE is using the default encoding of cp1252,
But you are dealing with UTF-8, you've to configure your terminal/IDLE with UTF-8
import requests
from bs4 import BeautifulSoup
def main(url):
with requests.Session() as req:
for item in range(1, 10):
r = req.get(url.format(item))
print(r.url)
soup = BeautifulSoup(r.content, 'html.parser')
goal = [(x.h3.a.text, x.select_one("p.price_color").text)
for x in soup.select("li.col-xs-6")]
print(goal)
main("http://books.toscrape.com/catalogue/page-{}.html")
try to always use The DRY Principle which means Don’t Repeat Yourself”.
Since you are dealing with the same host so you've to maintain the same session instead of keep open tcp socket stream and then close it and then open it again. That's can lead to block your requests and consider it as DDOS attack where the TCP flags got captured by the back-end. imagine that you open your browser and then open a website then you close it and repeat the circle!
Python functions is usually looks nice and easy to read instead of letting code looks like journal text.
Notes: the usage of range() and {} format string, CSS selectors.
You could use page.content.decode('utf-8') instead of page.text. As people in the comments said, it is an encoding issue, and .content returns HTML as bytes, then you can convert it into string with right encoding using .decode('utf-8'), whereas .text returns string with bad encoding (maybe cp1252). The final code may look like this:
from bs4 import BeautifulSoup as bs4
import requests
import pandas as pd
pages_to_scrape = 1
pages = [] # You forgot this line
for i in range(1,pages_to_scrape+1):
url = ('http://books.toscrape.com/catalogue/page-{}.html').format(i)
pages.append(url)
for item in pages:
page = requests.get(item)
soup = bs4(page.content.decode('utf-8'), 'html.parser') # Replace .text with .content.decode('utf-8')
#print(soup.prettify())
for j in soup.findAll('p', class_='price_color'):
price=j.getText()
print(price)
This should hopefully work
P.S: Sorry for directly writing the answer, I don't have enought reputation to write in comments :D

Python: How to remove html entities from list items? [duplicate]

I'm parsing some HTML with Beautiful Soup 3, but it contains HTML entities which Beautiful Soup 3 doesn't automatically decode for me:
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup("<p>£682m</p>")
>>> text = soup.find("p").string
>>> print text
£682m
How can I decode the HTML entities in text to get "£682m" instead of "£682m".
Python 3.4+
Use html.unescape():
import html
print(html.unescape('£682m'))
FYI html.parser.HTMLParser.unescape is deprecated, and was supposed to be removed in 3.5, although it was left in by mistake. It will be removed from the language soon.
Python 2.6-3.3
You can use HTMLParser.unescape() from the standard library:
For Python 2.6-2.7 it's in HTMLParser
For Python 3 it's in html.parser
>>> try:
... # Python 2.6-2.7
... from HTMLParser import HTMLParser
... except ImportError:
... # Python 3
... from html.parser import HTMLParser
...
>>> h = HTMLParser()
>>> print(h.unescape('£682m'))
£682m
You can also use the six compatibility library to simplify the import:
>>> from six.moves.html_parser import HTMLParser
>>> h = HTMLParser()
>>> print(h.unescape('£682m'))
£682m
Beautiful Soup handles entity conversion. In Beautiful Soup 3, you'll need to specify the convertEntities argument to the BeautifulSoup constructor (see the 'Entity Conversion' section of the archived docs). In Beautiful Soup 4, entities get decoded automatically.
Beautiful Soup 3
>>> from BeautifulSoup import BeautifulSoup
>>> BeautifulSoup("<p>£682m</p>",
... convertEntities=BeautifulSoup.HTML_ENTITIES)
<p>£682m</p>
Beautiful Soup 4
>>> from bs4 import BeautifulSoup
>>> BeautifulSoup("<p>£682m</p>")
<html><body><p>£682m</p></body></html>
You can use replace_entities from w3lib.html library
In [202]: from w3lib.html import replace_entities
In [203]: replace_entities("£682m")
Out[203]: u'\xa3682m'
In [204]: print replace_entities("£682m")
£682m
Beautiful Soup 4 allows you to set a formatter to your output
If you pass in formatter=None, Beautiful Soup will not modify strings
at all on output. This is the fastest option, but it may lead to
Beautiful Soup generating invalid HTML/XML, as in these examples:
print(soup.prettify(formatter=None))
# <html>
# <body>
# <p>
# Il a dit <<Sacré bleu!>>
# </p>
# </body>
# </html>
link_soup = BeautifulSoup('A link')
print(link_soup.a.encode(formatter=None))
# A link
I had a similar encoding issue. I used the normalize() method. I was getting a Unicode error using the pandas .to_html() method when exporting my data frame to an .html file in another directory. I ended up doing this and it worked...
import unicodedata
The dataframe object can be whatever you like, let's call it table...
table = pd.DataFrame(data,columns=['Name','Team','OVR / POT'])
table.index+= 1
encode table data so that we can export it to out .html file in templates folder(this can be whatever location you wish :))
#this is where the magic happens
html_data=unicodedata.normalize('NFKD',table.to_html()).encode('ascii','ignore')
export normalized string to html file
file = open("templates/home.html","w")
file.write(html_data)
file.close()
Reference: unicodedata documentation
This probably isnt relevant here. But to eliminate these html entites from an entire document, you can do something like this: (Assume document = page and please forgive the sloppy code, but if you have ideas as to how to make it better, Im all ears - Im new to this).
import re
import HTMLParser
regexp = "&.+?;"
list_of_html = re.findall(regexp, page) #finds all html entites in page
for e in list_of_html:
h = HTMLParser.HTMLParser()
unescaped = h.unescape(e) #finds the unescaped value of the html entity
page = page.replace(e, unescaped) #replaces html entity with unescaped value

Python: parsing UNICODE characters using bs4

I am building a python3 web crawler/scraper using bs4. The program crashes whenever it meets a UNICODE code character like a Chinese symbol. How do I modify my scraper so that it supports UNICODE?
Here's the code:
import urllib.request
from bs4 import BeautifulSoup
def crawlForData(url):
r = urllib.request.urlopen(url)
soup = BeautifulSoup(r.read(),'html.parser')
result = [i.text.replace('\n', ' ').strip() for i in soup.find_all('p')]
for p in result:
print(p)
url = 'https://en.wikipedia.org/wiki/Adivasi'
crawlForData(url)
You can try unicode() method. It decodes unicode strings.
or a way to go is
content.decode('utf-8','ignore')
where content is your string
The complete solution may be:
html = urllib2.urlopen("your url")
content = html.read().decode('utf-8', 'ignore')
soup = BeautifulSoup(content)

Scraping in Python with BeautifulSoup

I've read quite a few posts here about this, but I'm very new to Python in general so I was hoping for some more info.
Essentially, I'm trying to write something that will pull word definitions from a site and write them to a file. I've been using BeautifulSoup, and I've made quite some progress, but here's my issue -
from __future__ import print_function
import requests
import urllib2, urllib
from BeautifulSoup import BeautifulSoup
wordlist = open('test.txt', 'a')
word = raw_input('Paste your word ')
url = 'http://services.aonaware.com/DictService/Default.aspx?action=define&dict=wn&query=%s' % word
# print url
html = urllib.urlopen(url).read()
# print html
soup = BeautifulSoup(html)
visible_text = soup.find('pre')(text=True)
print(visible_text, file=wordlist)
this seems to pull what I need, but puts it in this format
[u'passable\n adj 1: able to be passed or traversed or crossed; "the road is\n passable"
but I need it to be in plaintext. I've tried using a sanitizer (I was running it through bleach, but that didn't work. I've read some of the other answers here, but they don't explain HOW the code works, and I don't want to add something if I don't understand how it works.
Is there any way to just pull the plaintext?
edit: I ended up doing
from __future__ import print_function
import requests
import urllib2, urllib
from bs4 import BeautifulSoup
wordlist = open('test.txt', 'a')
word = raw_input('Paste your word ')
url = 'http://services.aonaware.com/DictService/Default.aspx?action=define&dict=wn&query=%s' % word
# print url
html = urllib.urlopen(url).read()
# print html
soup = BeautifulSoup(html)
visible_text = soup.find('pre')(text=True)[0]
print(visible_text, file=wordlist)
The code is already giving you plaintext, it just happens to have some characters encoded as entity references. In this case, special characters, which form part of the XML/HTML syntax are encoded to prevent them from breaking the structure of the text.
To decode them, use the HTMLParser module:
import HTMLParser
h = HTMLParser.HTMLParser()
h.unescape('"the road is passable"')
>>> u'"the road is passable"'

How to get unicode string when extract data in Python?

I am trying to extract text from a Vietnamese website, which charset is in utf-8. However, the text I got is always in Ascii, and I can't find a way to convert them to unicode or get exactly the text on the website. As a result, I can't save them into file as expected.
I know this is the very popular problem with unicode in Python, but I still hope someone will help me to figure it out. Thanks.
My code:
import requests, re, io
import simplejson as json
from lxml import html, etree
base = "http://www.amthuc365.vn/cong-thuc/"
page = requests.get(base + "trang-" + str(1) + ".html")
pageTree = html.fromstring(page.text)
links = pageTree.xpath('//ul[contains(#class, "mt30")]/li/a/#href')
names = pageTree.xpath('//h3[#class="title"]/a/text()')
for name in names[:1]:
print name
# Làm bánh oreo nhân bÆ¡ Äậu phá»ng thÆ¡m bùi
but what I need is "Làm bánh oreo nhân bơ đậu phộng thơm bùi"
Thanks.
Just switching from page.text to page.content should make it work.
Explanation here.
Also see:
What is the difference between 'content' and 'text'
HTML encoding and lxml parsing

Categories