BeautifulSoup4 cannot get the printing right. Python3

BeautifulSoup4 cannot get the printing right. Python3 - python

I'm currently in the learning process of Python3, I am scraping a site for some data, which works fine, but when it comes to printing out the p tags I just can't get it to work as I expect.
import urllib
import lxml
from urllib import request
from bs4 import BeautifulSoup
data = urllib.request.urlopen('www.site.com').read()
soup = BeautifulSoup(data, 'lxml')
stat = soup.find('div', {'style' : 'padding-left: 10px';})
dialog = stat.findChildren('p')
for child in dialog:
childtext = child.get_text()
#have tried child.string aswell (exactly the same result)
childlist.append(childtext.encode('utf-8', 'ignore')
#Have tried with str(childtext.encode('utf-8', 'ignore'))
print (childlist)
That all works, but the printing is "bytes"
b'This is a ptag.string'
b'\xc2\xa0 (probably &nbsp'
b'this is anotherone'
Real sample text that is ascii encoded:
b"Announcementb'Firefox users may encounter browser warnings encountering SSL SHA-1 certificates"
Note that Announcement is p and the rest is 'strong' under a p tag.
Same sample with utf-8 encode
b"Announcement\xc2\xa0\xe2\x80\x93\xc2\xa0b'Firefox users may encounter browser warnings encountering SSL SHA-1 "
I WISH to get:
"Announcement"
(newline / new item in list)
"Firefox users may encounter browser warnings encountering SSL SHA-1 certificates"
As you see, the incorrect chars are stripped in "ascii", but as some are that destroys some linebreaks and I have yet to figure out how to print that correctly, also, the b's are still there then!
I really can't figure out how to remove b's and encode or decode properly. I have tried every "solution" that I can google up.
HTML Content = utf-8
I would most rather not change the full data before processing because it will mess up my other work and I don't think it is needed.
Prettify does not work.
Any suggestions?

First, you're getting output of the form b'stuff' because you are calling .encode(), which returns a bytes object. If you want to print strings for reading, keep them as strings!
As a guess, I assume you're looking to print strings from HTML nicely, pretty much as they would be seen in a browser. For that, you need to decode the HTML string encoding, as described in this SO answer, which for Python 3.5 means:
import html
html.unescape(childtext)
Among other things, this will convert any sequences in the HTML string into '\xa0' characters, which are printed as spaces. However, if you want to break lines on these characters despite literally meaning "non-breaking space", you'll have to replace those with actual spaces before printing, e.g. using x.replace('\xa0', ' ').

Related

How can I 'translate' all unicode codes in a string to the actual symbols using Python 3?

I'm parsing web content to isolate the body of news articles from a certain site, for which I'm using urllib.request to retrieve the source code for the article webpage and isolate the main text. However, urllib takes characters like "ç" and puts it into a python string as its utf-8 notation, "c387". It does the same for the '”' and "„" characters, which print as an 'e' followed by a set of numbers. This is very irritating when trying to read the article and thus needs to be resolved. I could loop through the article and change every recognizable utf-8 code to the actual character using a tedious function, but I was wondering if there was a way to do that more easily.
For an example, the current output of my program might be:
e2809eThis country doesn't...e2809d
I would like it to be:
„This country doesn't...”
Note: I've already checked the source code of the web page, which just uses these 'special' characters, so it's definitely a urllib issue.
Thanks in advance!

urllib returns bytes:
>import urllib
>url = 'https://stackoverflow.com/questions/62085906'
>data = urllib.request.urlopen(url).read()
>type(data)
bytes
>idx = data.index(b'characters like')
>data[idx:idx+20]
b'characters like "\xc3\xa7"'
Now, let's try to interpret this as utf-8:
>data[idx:idx+20].decode('utf-8')
'characters like "ç"'
Et voilà!

Fetching URL and converting to UTF-8 Python

I would like to do my first project in python but I have problem with coding. When I fetch data it shows coded letters instead of my native letters, for example '\xc4\x87' instead of 'ć'. The code is below:
import urllib.request
import sys
page = urllib.request.urlopen("http://olx.pl/")
test = page.read()
print(test)
print(sys.stdin.encoding)
z = "ł"
print(z)
print(z.encode("utf-8"))
I know that code here is poor but I tried many options to change encoding. I wrote z = "ł" to check if it can print any 'special' letter and it shows. I tried to encode it and it works also as it should. Sys.stdin.encoding shows cp852.

The data you read from a urlopen() response is encoded data. You'd need to first decode that data using the right encoding.
You appear to have downloaded UTF-8 data; you'd have to decode that data first before you had text:
test = page.read().decode('utf8')
However, it is up to the server to tell you what data was received. Check for a characterset in the headers:
encoding = page.info().getparam('charset')
This can still be None; many data formats include the encoding as part of the format. XML for example is UTF-8 by default but the XML declaration at the start can contain information about what codec was used for that document. An XML parser would extract that information to ensure you get properly decoded Unicode text when parsing.
You may not be able to print that data; the 852 codepage can only handle 256 different codepoints, while the Unicode standard is far larger.

The urlopen is returning to you a bytes object. That means it's a raw, encoded stream of bytes. Python 3 prints that in a repr format, which uses escape codes for non-ASCII characters. To get the canonical unicode you would have to decode it. The right way to do that would be to inspect the header and look for the encoding declaration. But for this we can assume UTF-8 and you can simply decode it as such, not encode it.
import urllib.request
import sys
page = urllib.request.urlopen("http://olx.pl/")
test = page.read()
print(test.decode("utf-8")) # <- note change
Now, Python 3 defaults to UTF-8 source encoding. So you can embed non-ASCII like this if your editor supports unicode and saving as UTF-8.
z = "ł"
print(z)
Printing it will only work if your terminal supports UTF-8 encoding. On Linux and OSX they do, so this is not a problem there.

The others are correct, but I'd like to offer a simpler solution. Use requests. It's 3rd party, so you'll need to install it via pip:
pip install requests
But it's a lot simpler to use than the urllib libraries. For your particular case, it handles the decoding for you out of the box:
import requests
r = requests.get("http://olx.pl/")
print(r.encoding)
# UTF-8
print(type(r.text))
# <class 'str'>
print(r.text)
# The HTML
Breakdown:
get sends an HTTP GET request to the server and returns the respose.
We print the encoding requests thinks the text is in. It chooses this based on the response header Martijin mentions.
We show that r.text is already a decoded text type (unicode in Python 2 and str in Python 3)
Then we actually print the response.
Note that we don't have to print the encoding or type; I've just done so for diagnostic purposes to show what requests is doing. requests is designed to simplify a lot of other details of working with HTTP requests, and it does a good job of it.

lxml.html parsing and utf-8 with requests

i used requests to retrieve a url which contains some unicode characters, and want to do some processing with it , then write it out.
r=requests.get(url)
f=open('unicode_test_1.html','w');f.write(r.content);f.close()
html = lxml.html.fromstring(r.content)
htmlOut = lxml.html.tostring(html)
f=open('unicode_test_2.html','w');f.write(htmlOut);f.close()
in unicode_test_1.html, all chars looks fine, but in unicode_test_2.html, some chars changed to gibberish, why is that ?
i then tried
html = lxml.html.fromstring(r.text)
htmlOut = lxml.html.tostring(html,encoding='latin1')
f=open('unicode_test_2.html','w');f.write(htmlOut);f.close()
it seems it's working now. but i don't know why is this happening, always use latin1 ?
what's the difference between r.text and r.content, and why can't i write html out using encoding='utf-8' ?

You've not specified if you're using python 2 or 3. Encoding is handled quite differently depending on which version you're using. The following advice is more or less universal anyway.
The difference between r.text and r.content is in the Requests docs. Simply put Requests will attempt to figure out the character encoding for you and return Unicode after decoding it. This which is accessible via r.text. To get just the bytes use r.content.
You really need to get to grips with encodings. Read http://www.joelonsoftware.com/articles/Unicode.html and watch https://www.youtube.com/watch?v=sgHbC6udIqc to get started. Also, do a search for "Overcoming frustration: Correctly using unicode in python2" for additional help.
Just to clarify, it's not as simple as always use one encoding over another. Make a Unicode sandwich by doing any I/O in bytes and work with Unicode in your application. If you start with bytes (isinstance(mytext, str)) you need to know the encoding to decode to Unicode, if you start with Unicode (isinstance(mytext, unicode)) you should encode to UTF-8 as it will handle all the worlds characters.
Make sure your editor, files, server and database are configured to UTF-8 also otherwise you'll get more 'gibberish'.
If you want further help post the source files and output of your script.

BeautifulSoup 4 converting HTML entities to unicode, but getting junk characters when using print

I am trying to scrape text from the web using BeautifulSoup 4 to parse it out. I am running into an issue when printing bs4 processed text out to the console. Whenever I hit a character that was originally an HTML entity, like ’ I get garbage characters on the console. I believe bs4 is converting these entities to unicode correctly because if I try using another encoding to print out the text, it will complain about the appropriate lack of unicode mapping for a character (like u'\u2019.) I'm not sure why the print function gets confused over these characters. I've tried changing around fonts, which changes the garbage characters, and am on a Windows 7 machine with US-English locale. Here is my code for reference, any help is appreciated. Thanks in advance!
#!/usr/bin/python
import json
import urllib2
import cookielib
import bs4
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
url = "http://api.nytimes.com/svc/search/v2/articlesearch.json?q=Tiguan\
&page=0&api-key=blah"
response = opener.open(url)
articles = response.read()
decoded = json.loads(articles)
totalpages = decoded['response']['meta']['hits']/10
for page in range(totalpages + 1):
if page>0:
url = "http://api.nytimes.com/svc/search/v2/articlesearch.json?\
q=Tiguan&page=" + str(page) + "&api-key=blah"
response = opener.open(url)
articles = response.read()
decoded = json.loads(articles)
for url in decoded['response']['docs']:
print url['web_url']
urlstring = url['web_url']
art = opener.open(urlstring)
soup = bs4.BeautifulSoup(art.read())
goodstuff = soup.findAll('nyt_text')
for tag in goodstuff:
print tag.prettify().encode("UTF")

The problem has nothing to do with bs4, or HTML entities, or anything else. You could reproduce the exact same behavior, on most Windows systems, with a one-liner program to print out the same characters that are appearing as garbage when you try to print them, like this:
print u'\u2019'.encode('UTF-8')
The problem here is that, like the vast majority of Windows systems (and nothing else anyone uses in 2013), your default character set is not UTF-8, but something like CP1252.
So, when you encode your Unicode strings to UTF-8 and print those bytes to the console, the console interprets them as CP1252. Which, in this case, means you get â€™ instead of ’.
Changing fonts won't help. The UTF-8 encoding of \u2013 is the three bytes \xe2, \x80, and \x99, and the CP1252 meaning of those three bytes is â, €, and ™.
If you want to encode manually for the console, you need to encode to the right character set, the one your console actually uses. You may be able to get that as sys.stdout.encoding.
Of course you may get an exception trying to encode things for the right character set, because 8-bit character sets like CP1252 can only handle about 240 of the 110K characters in Unicode. The only way to handle that is to use the errors argument to encode to either ignore them or replace them with replacement characters.
Meanwhile, if you haven't read the Unicode HOWTO, you really need to. Especially if you plan to stick with Python 2.x and Windows.
If you're wondering why a few command-line programs seem to be able to get around these problems: Microsoft's solution to the character set problem is to create a whole parallel set of APIs that use 16-bit characters instead of 8-bit, and those APIs always use UTF-16. Unfortunately, many things, like the portable stdio wrappers that Microsoft provides for talking to the console and that Python 2.x relies on, only have the 8-bit API. Which means the problem isn't solved at all. Python 3.x no longer uses those wrappers, and there have been recurring discussions on making some future version talk UTF-16 to the console. But even if that happens in 3.4 (which seems very unlikely), that won't help you as long as you're using 2.x.

#abarnert's answer contains a good explanation of the issue.
In your particular case, you could just pass encoding parameter to prettify() instead of default utf-8.
If you are printing to console, you could try to print Unicode directly:
print soup.prettify(encoding=None, formatter='html') # print Unicode
It may fail. If you pass ascii; then BeautifulSoup may use numerical character references instead of non-ascii characters:
print soup.prettify('ascii', formatter='html')
It assumes that current Windows codepage is ascii-based encoding (most of them do). It should also work if the output is redirected to a file or another program via a pipe.
For portability, you could always print Unicode (encoding=None above) and use PYTHONIOENCODING to get appropriate character encoding e.g., utf-8 for files, pipes and ascii:xmlcharrefreplace to avoid garbage in a console.

Python Beautiful Soup, saved text not displaying properly in original encoding

I'm having trouble with a saved file not displaying properly when using it's original encoding.
I'm downloading a web page, searching it for content I want and then writing that content to a file.
The encoding on the site is 'iso-8859-1' or so chrome and beautiful soup tell me and it appears perfectly when viewed using that encoding on the original site.
When I download the page and try to view it however I end up with strange characters (HTML Entities?) like these:
â€œ , â€™
If I manually set Chromes encoding to 'Utf-8' when viewing the saved page it appears normally, as does the original page if I set that to 'Utf-8'.
I'm not sure what to do with this, I would change the encoding before writing the text to a file but I get ascii errors when I try that.
Here is a sample page (possible adult content):
http://original.adultfanfiction.net/story.php?no=600106516
And the code I am using to get the text from the page:
site = requests.post(url, allow_redirects=False)
html = site.text
soup = BeautifulSoup(html)
rawStory = soup.findAll("td",{"colspan" : '3'})
story = str(rawStory)
return story
I turn the ResultSet into a string so that I can write it to a file, I don't know if that could be part of the problem, if I print the html to the console after requesting it but before doing anything to it it displays improperly in the console as well.

I'm 90% sure that your problem is just that you're asking BeautifulSoup for a UTF-8 fragment and then trying to use it as ISO-8859-1, which obviously isn't going to work. The documentation explains all of this pretty nicely.
You're calling str. As Non pretty printing explains:
If you just want a string, with no fancy formatting, you can call unicode() or str() on a BeautifulSoup object, or a Tag within it… The str() function returns a string encoded in UTF-8.
As Output encoding explains:
When you write out a document from Beautiful Soup, you get a UTF-8 document, even if the document wasn’t in UTF-8 to begin with.
It then follows up with an example of almost exactly what you're doing—parsing a Latin-1 HTML document and writing it back as UTF-8—and then immediately explaining how to fix it:
If you don’t want UTF-8, you can pass an encoding into prettify()… You can also call encode() on the BeautifulSoup object, or any element in the soup, just as if it were a Python string…
So, that's all you have to do.
However, you've got another problem before you get there. When you call findAll, you don't get back a tag, you get back a ResultSet, which is basically a list of tags. Just as calling str on a list of strings gives you brackets, commas, and the repr of each string (with extraneous quotes and backlash escapes for non-printable-ASCII characters) instead of the strings themselves, calling str on a ResultSet gives you something similar. And you obviously can't call Tag methods on a ResultSet.
Finally, I'm not sure what problem you're actually trying to solve. You're creating a HTML fragment. Ignoring the fact that a fragment isn't a valid document, and browsers shouldn't strictly speaking display it in the first place, it doesn't specify an encoding, meaning the browser can only get that information from some out-of-band place, like you selecting a menu item. Changing it to Latin-1 won't actually "fix" things, it'll just mean that now you get the right display when you pick Latin-1 in the menu and wrong when you pick UTF-8 instead of vice-versa. Why not actually create a full HTML document that actually has a meta http-equiv that actually means what you want it to mean, instead of trying to figure out how to trick Chrome into guessing what you want it to guess?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.