Use soup.get_text() with UTF-8

Use soup.get_text() with UTF-8 - python

I need to get all the text from a page using BeautifulSoup. At BeautifulSoup's documentation, it showed that you could do soup.get_text() to do this. When I tried doing this on reddit.com, I got this error:
UnicodeEncodeError in soup.py:16
'cp932' codec can't encode character u'\xa0' in position 2262: illegal multibyte sequence
I get errors like that on most of the sites I checked.
I got similar errors when I did soup.prettify() too, but I fixed it by changing it to soup.prettify('UTF-8'). Is there any way to fix this? Thanks in advance!
Update June 24
I've found a bit of code that seems to work for other people, but I still need to use UTF-8 instead of the default. Code:
texts = soup.findAll(text=True)
def visible(element):
if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
return False
elif re.match('', str(element)): return False
elif re.match('\n', str(element)): return False
return True
visible_texts = filter(visible, texts)
print visible_texts
Error is different, though. Progress?
UnicodeEncodeError in soup.py:29
'ascii' codec can't encode character u'\xbb' in position 1: ordinal not in range
(128)

soup.get_text() returns a Unicode string that's why you're getting the error.
You can solve this in a number of ways including setting the encoding at the shell level.
export PYTHONIOENCODING=UTF-8
You can reload sys and set the encoding by including this in your script.
if __name__ == "__main__":
reload(sys)
sys.setdefaultencoding("utf-8")
Or you can encode the string as utf-8 in code. For your reddit problem something like the following would work:
import urllib
from bs4 import BeautifulSoup
url = "https://www.reddit.com/r/python"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# get text
text = soup.get_text()
print(text.encode('utf-8'))

You can't do str(text) if you may be dealing with unicode on the page. Instead of str(), use unicode().

Related

web scraping accented and other special characters in beautiful soup

I made a script to parse spanish dict for the verb conjugations but always get the error
UnicodeEncodeError: 'ascii' codec can't encode character '\xed' in position 3: ordinal not in range(128)
when I try to print the characters with accents. When looking at the other similar questions, I realized that I need to encode and decode the accented characters, but don't know how to do this with my code.
Here is my code:
from requests import get
from bs4 import BeautifulSoup
url = "http://www.spanishdict.com/translate/hacer"
response = get(url)
#print(response.text[:500])
soup = BeautifulSoup(response.text,'html.parser')
container = soup.find_all("a", class_ = "vtable-word-text")
for n in range(0,30,1):
try:
print(container[n].text)
except Exception as e:
print("accented character")
I have also added a try and except to tell me which ones are accented characters. Could anyone help me with my problem? Thanks in advance.

Ascii codec can't encode character error, content.decode('utf-8','ignore') won't work. Any other options?

Example code:
import requests
import bs4 as bs
import urllib
import urllib.request
from urllib.request import Request
req = Request("https://www.openbugbounty.org/researchers/Spam404/vip/page/1/",
headers ={'User-Agent':'Mozilla/5.0'})
sauce = urllib.request.urlopen(req).read()
soup = bs.BeautifulSoup(sauce,'lxml')
print(soup)
My output gives me the following error:
File "/Users/student/Desktop/AutoBots/kbfza2.py", line 15, in
print(soup)
UnicodeEncodeError: 'ascii' codec can't encode character '\xa0' in position 5194: ordinal not in range(128)
After searching for a solution online for a while, it seemed that changing my soup line to:
soup = bs.BeautifulSoup(sauce.decode('utf-8','ignore'),'lxml)
would solve this for me but this hasn't fixed anything for me.
Am I mistaken in thinking that the decode function with the ignore argument should allow me to print(soup) without error even if it isn't successfully decoded completely?

Just re-read your question and I believe you are trying to print Unicode text to a console that doesn't support that character set (I couldn't reproduce the error with the code you posted).
You may need to force your console output to utf-8 encoding, or if you are using an IDE like sublime text, change it to render utf-8 encoding.

How to extract Urdu text from a webpage using beautifulsoup

I am using bs4 to extract text from a web document. But Its output is very strange. like
Ú©Ø¨Ú¾Û ÛÛ Ø¨Ø§Øª Ø³ÙØ¬Ú¾ ÙÛÚº ÙÛ Ø§ÙØªÛ ØªÚ¾ÛÛ Ù¾Ú¾Ø± Ø§ÙÛØ³ØªÛ Ø§ÙÛØ³ØªÛ Ø¬Ø¨ Ú©ÚÚ¾ Ø¹ÙÙ Ø§ÙÙÛ Ø´Ø±ÙØ¹ ÛÙØ¦Û ØªÙ Ø¨Ø§Øª Ú©ÚÚ¾ Ù¾ÙÛ Ù¾ÚÛÛÙÛÚ©Ù Ø§Ø¨ ÛÛ Ø¨Ø§Øª Ø§ÛØ³Û Ø³ÙØ¬Ú¾ ÙÛÚº Ø§ÙØ¦Û Ú©Û Ø³ÙÚØ§ Ø§ÙÙ¾ Ú©ÛÙÙÚ¯ÙÚº Ú©Û Ø¨Ú¾Û Ø´ÛØ¦Ø± Ú©Ø±ÙÚºÛ ÚÙØ¯ Ø±ÙØ² ÙØ¨Ù ÙÛØ±Ø§ 8 Ù
I think it is some encoding. I am a new user of bs4. Please guide me how to decode it to show as urdu text.
Here is a document source whose title I want to extract
Follwoing code I am using to do it.
from bs4 import BeautifulSoup
import urllib2
import requests
url="http://blog.jang.com.pk/blog_details.asp?id=11058"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data,'lxml')
print str(soup.title)

Burhan Khalid's answer works, but because the original web page is encoded in utf-8:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
You should update the requests' response field to match the original page's encoding:
from bs4 import BeautifulSoup
import urllib2
import requests
url="http://blog.jang.com.pk/blog_details.asp?id=11058"
r = requests.get(url)
# Update encoding to match source
r.encoding = "utf-8"
data = r.text
soup = BeautifulSoup(data,'lxml')
print str(soup.title)
Now any field you access will have the correct encoding rather than having to set to Urdu on a per field basis.

If you simply try to print the string, you'll get garbage characters out:
>>> import requests
>>> from bs4 import BeautifulSoup as bs4
>>> r = requests.get('http://blog.jang.com.pk/blog_details.asp?id=11058')
>>> s = bs4(r.text, 'lxml')
>>> print s.title.text
Ú©ÚÚ¾ ØªÙØ¬Û Ø§Ø³ Ø·Ø±Ù Ø¨Ú¾Û!
You need to encode it properly, since the result is a unicode bytestring.
>>> print s.title.text.encode('iso-8859-1')
کچھ توجہ اس طرف بھی!
If it displays the glyphs correctly, but in the wrong order (ie, they are not right-to-left), then this is a problem with your operating system/terminal/shell/program you are using to run the application.
The above is from gnome-terminal, which doesn't support Arabic RTL properly.
If I run the same code in mlterm:
The white box is there because I am using an Arabic font, which doesn't have all the characters in the Urdu language.

I think what is happening is that there some badly formed Unicode in the website response:
----> 1 r.content.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd8 in position 1106: invalid continuation byte
Hence the text is being decoded using the ANSI codec which is obviously wrong. You can work around this issue by calling decode with the option errors='ignore' (we are using the content rather than text because this is the raw binary response from the website:
data = r.content.decode(errors='ignore')
soup = BeautifulSoup(data,'lxml')
print (str(soup.title))
<title>کچھ توجہ اس طرف بھی!</title>

Python: UnicodeDecodeError: 'utf-8' codec can't decode byte...invalid continuation byte

I'm building a web scraper using BeautifulSoup on Python 3.3
However I get a problem which prevents me from getting a valid strin* that I can use with BeautifulSoup. That is:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 7047: invalid continuation byte
I know there are dozens of similar questions but I haven't so far found a method that can help me to diagnose what's wrong with the following code:
import urllib.request
URL = "<url>" # sorry, I cannot show the url for privacy reasons, but it's a normal html document
page = urllib.request.urlopen(URL)
page = page.read().decode("utf-8") # from bytes to <source encodings>
As I guessed I notice this error occurs only with some URLS and not with others. Even with this same error I wasn't having this error until yesterday. Then today I run the program again and the error popped up..
Any clue on how to diagnose the error?

You should not decode the response. First of all, you are incorrectly assuming the response is UTF-8 encoded (it is not, as the error shows), but more importantly, BeautifulSoup will detect the encoding for you. See the Encodings section of the BeautifulSoup documentation.
Pass a byte string to BeautifulSoup and it'll use any <meta> header proclaiming the correct encoding, or do great job of autodetecting the encoding for you.
In the event that auto-detection fails, you can always fall back to the server-provided encoding:
encoding = page.info().get_charset()
page = page.read()
soup = BeautifulSoup(page)
if encoding is not None and soup.original_encoding != encoding:
print('Server and BeautifulSoup disagree')
print('Content-type states it is {}, BS4 states thinks it is {}'.format(encoding, soup.original_encoding)
print('Forcing encoding to server-supplied codec')
soup = BeautifulSoup(page, from_encoding=encoding)
This still leaves the actual decoding to BeautifulSoup, but if the server included a charset parameter in the Content-Type header then the above assumes that the server is correctly configured and forces BeautifulSoup to use that encoding.

how to get tags when attribute is chinese in beautifulsoup

I'm not familiar with beautifulsoup's encoding.
when I tackle with some pages,some attribute is chinese, and I want to use this chinese attribute to extract tags.
for example,a html like below:
<P class=img_s>
<A href="/pic/93/b67793.jpg" target="_blank" title="查看大图">
<IMG src="/pic/93/s67793.jpg">
</A>
</P>
I want to extract the '/pic/93/b67793.jpg'
so what I done is:
img_urls = form_soup.findAll('a',title='查看大图')
and encounter:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb2 in position 0: ordinalnot in range(128)
to tackle with this,I have done two method,both failed:
one way is :
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
another way is:
response = unicode(response, 'gb2312','ignore').encode('utf-8','ignore')

You need to pass in unicode to the findAll method:
# -*- coding: utf-8
...
img_urls = form_soup.findAll('a', title=u'查看大图')
Note the u unicode literal marker in front of the title value. You do need to specify an encoding on your source file for this to work (the coding comment at the top of the file), or switch to unicode escape codes instead:
img_urls = form_soup.findAll('a', title=u'\u67e5\u770b\u5927\u56fe')
Internally, BeautifulSoup uses unicode, but you are passing it a byte-string with non-ascii characters in them. BeautifulSoup tries to decode that to unicode for you and fails as it doesn't know what encoding you used. By providing it with ready-made unicode instead you side-step the issue.
Working example:
>>> from BeautifulSoup import BeautifulSoup
>>> example = u'<P class=img_s>\n<IMG src="/pic/93/s67793.jpg"></P>'
>>> soup = BeautifulSoup(example)
>>> soup.findAll('a', title=u'\u67e5\u770b\u5927\u56fe')
[<img src="/pic/93/s67793.jpg" />]

Beautiful Soup 4.1.0 will automatically convert attribute values from UTF-8, which solves this problem:

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Use soup.get_text() with UTF-8 - python

You can't do str(text) if you may be dealing with unicode on the page. Instead of str(), use unicode().

Related

web scraping accented and other special characters in beautiful soup

Ascii codec can't encode character error, content.decode('utf-8','ignore') won't work. Any other options?

How to extract Urdu text from a webpage using beautifulsoup

Python: UnicodeDecodeError: 'utf-8' codec can't decode byte...invalid continuation byte

how to get tags when attribute is chinese in beautifulsoup

Categories

Resources