How to deal with unknown encoding when scraping webpages? [duplicate]

How to deal with unknown encoding when scraping webpages? [duplicate] - python

This question already has answers here:
How to determine the encoding of text
(16 answers)
Closed 5 years ago.
I'm scraping news articles from various sites, using GAE and Python.
The code where I scrape one article url at a time leads to the following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 8858: ordinal not in range(128)
Here's my code in its simplest form:
from google.appengine.api import urlfetch
def fetch(url):
headers = {'User-Agent' : "Chrome/11.0.696.16"}
result = urlfetch.fetch(url,headers)
if result.status_code == 200:
return result.content
Here is another variant I have tried, with the same result:
def fetch(url):
headers = {'User-Agent' : "Chrome/11.0.696.16"}
result = urlfetch.fetch(url,headers)
if result.status_code == 200:
s = result.content
s = s.decode('utf-8')
s = s.encode('utf-8')
s = unicode(s,'utf-8')
return s
Here's the ugly, brittle one, which also doesn't work:
def fetch(url):
headers = {'User-Agent' : "Chrome/11.0.696.16"}
result = urlfetch.fetch(url,headers)
if result.status_code == 200:
s = result.content
try:
s = s.decode('iso-8859-1')
except:
pass
try:
s = s.decode('ascii')
except:
pass
try:
s = s.decode('GB2312')
except:
pass
try:
s = s.decode('Windows-1251')
except:
pass
try:
s = s.decode('Windows-1252')
except:
s = "did not work"
s = s.encode('utf-8')
s = unicode(s,'utf-8')
return s
The last variant returns s as the string "did not work" from the last except.
So, am I going to have to expand my clumsy try/except construction to encompass all possible encodings (will that even work?), or is there an easier way?
Why have I decided to scrape the entire html, not just the BeautifulSoup? Because I want to do the soupifying later, to avoid DeadlineExceedError in GAE.
Have I read all the excellent articles about Unicode, and how it should be done? Yes. However, I have failed to find a solution that does not assume I know the incoming encoding, which I don't, since I'm scraping different sites every day.

I had the same problem some time ago and there is nothing 100% accurate. What I did was:
Get encoding from Content-Type
Get encoding from meta tags
Detect encoding with chardet Python module
Decode text from the most common encoding to Unicode
Process the text/html

It is better to simply read Content-Type in meta tags or headers. Please note, that Chrome (opposite to Opera) does not guess encoding. If it is not said that it is UTF8 or anything else in either of places, it threads site as windows default encoded. So only really bad sites do not define it.

Related

'ascii' codec can't encode characters in position 10-12: ordinal not in range(128) [duplicate]

I need to fetch data from a URL with non-ascii characters but urllib2.urlopen refuses to open the resource and raises:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0131' in position 26: ordinal not in range(128)
I know the URL is not standards compliant but I have no chance to change it.
What is the way to access a resource pointed by a URL containing non-ascii characters using Python?
edit: In other words, can / how urlopen open a URL like:
http://example.org/Ñöñ-ÅŞÇİİ/

Strictly speaking URIs can't contain non-ASCII characters; what you have there is an IRI.
To convert an IRI to a plain ASCII URI:
non-ASCII characters in the hostname part of the address have to be encoded using the Punycode-based IDNA algorithm;
non-ASCII characters in the path, and most of the other parts of the address have to be encoded using UTF-8 and %-encoding, as per Ignacio's answer.
So:
import re, urlparse
def urlEncodeNonAscii(b):
return re.sub('[\x80-\xFF]', lambda c: '%%%02x' % ord(c.group(0)), b)
def iriToUri(iri):
parts= urlparse.urlparse(iri)
return urlparse.urlunparse(
part.encode('idna') if parti==1 else urlEncodeNonAscii(part.encode('utf-8'))
for parti, part in enumerate(parts)
)
>>> iriToUri(u'http://www.a\u0131b.com/a\u0131b')
'http://www.xn--ab-hpa.com/a%c4%b1b'
(Technically this still isn't quite good enough in the general case because urlparse doesn't split away any user:pass# prefix or :port suffix on the hostname. Only the hostname part should be IDNA encoded. It's easier to encode using normal urllib.quote and .encode('idna') at the time you're constructing a URL than to have to pull an IRI apart.)

In python3, use the urllib.parse.quote function on the non-ascii string:
>>> from urllib.request import urlopen
>>> from urllib.parse import quote
>>> chinese_wikipedia = 'http://zh.wikipedia.org/wiki/Wikipedia:' + quote('首页')
>>> urlopen(chinese_wikipedia)

Python 3 has libraries to handle this situation. Use
urllib.parse.urlsplit to split the URL into its components, and
urllib.parse.quote to properly quote/escape the unicode characters
and urllib.parse.urlunsplit to join it back together.
>>> import urllib.parse
>>> url = 'http://example.com/unicodè'
>>> url = urllib.parse.urlsplit(url)
>>> url = list(url)
>>> url[2] = urllib.parse.quote(url[2])
>>> url = urllib.parse.urlunsplit(url)
>>> print(url)
http://example.com/unicod%C3%A8

It is more complex than the accepted #bobince's answer suggests:
netloc should be encoded using IDNA;
non-ascii URL path should be encoded to UTF-8 and then percent-escaped;
non-ascii query parameters should be encoded to the encoding of a page URL was extracted from (or to the encoding server uses), then percent-escaped.
This is how all browsers work; it is specified in https://url.spec.whatwg.org/ - see this example. A Python implementation can be found in w3lib (this is the library Scrapy is using); see w3lib.url.safe_url_string:
from w3lib.url import safe_url_string
url = safe_url_string(u'http://example.org/Ñöñ-ÅŞÇİİ/', encoding="<page encoding>")
An easy way to check if a URL escaping implementation is incorrect/incomplete is to check if it provides 'page encoding' argument or not.

Based on #darkfeline answer:
from urllib.parse import urlsplit, urlunsplit, quote
def iri2uri(iri):
"""
Convert an IRI to a URI (Python 3).
"""
uri = ''
if isinstance(iri, str):
(scheme, netloc, path, query, fragment) = urlsplit(iri)
scheme = quote(scheme)
netloc = netloc.encode('idna').decode('utf-8')
path = quote(path)
query = quote(query)
fragment = quote(fragment)
uri = urlunsplit((scheme, netloc, path, query, fragment))
return uri

For those not depending strictly on urllib, one practical alternative is requests, which handles IRIs "out of the box".
For example, with http://bücher.ch:
>>> import requests
>>> r = requests.get(u'http://b\u00DCcher.ch')
>>> r.status_code
200

Encode the unicode to UTF-8, then URL-encode.

Use iri2uri method of httplib2. It makes the same thing as by bobin (is he/she the author of that?)

Another option to convert an IRI to an ASCII URI is to use furl package:
gruns/furl: 🌐 URL parsing and manipulation made easy. - https://github.com/gruns/furl
Python's standard urllib and urlparse modules provide a number of URL
related functions, but using these functions to perform common URL
operations proves tedious. Furl makes parsing and manipulating URLs
easy.
Examples
Non-ASCII domain
http://国立極地研究所.jp/english/ (Japanese National Institute of Polar Research website)
import furl
url = 'http://国立極地研究所.jp/english/'
furl.furl(url).tostr()
'http://xn--vcsoey76a2hh0vtuid5qa.jp/english/'
Non-ASCII path
https://ja.wikipedia.org/wiki/日本語 ("Japanese" article in Wikipedia)
import furl
url = 'https://ja.wikipedia.org/wiki/日本語'
furl.furl(url).tostr()
'https://ja.wikipedia.org/wiki/%E6%97%A5%E6%9C%AC%E8%AA%9E'

works! finally
I could not avoid from this strange characters, but at the end I come through it.
import urllib.request
import os
url = "http://www.fourtourismblog.it/le-nuove-tendenze-del-marketing-tenere-docchio/"
with urllib.request.urlopen(url) as file:
html = file.read()
with open("marketingturismo.html", "w", encoding='utf-8') as file:
file.write(str(html.decode('utf-8')))
os.system("marketingturismo.html")

How can i solve this ascii error in python

def scrapeFacebookPageFeedStatus(page_id, access_token):
# -*- coding: utf-8 -*-
with open('%s_facebook_statuses.csv' % page_id, 'wb') as file:
w = csv.writer(file)
w.writerow(["status_id", "status_message", "link_name", "status_type", "status_link",
"status_published", "num_likes", "num_comments", "num_shares"])
has_next_page = True
num_processed = 0 # keep a count on how many we've processed
scrape_starttime = datetime.datetime.now()
print "Scraping %s Facebook Page: %s\n" % (page_id, scrape_starttime)
statuses = getFacebookPageFeedData(page_id, access_token, 100)
while has_next_page:
for status in statuses['data']:
w.writerow(processFacebookPageFeedStatus(status))
# output progress occasionally to make sure code is not stalling
num_processed += 1
if num_processed % 1000 == 0:
print "%s Statuses Processed: %s" % (num_processed, datetime.datetime.now())
# if there is no next page, we're done.
if 'paging' in statuses.keys():
statuses = json.loads(request_until_succeed(statuses['paging']['next']))
else:
has_next_page = False
print "\nDone!\n%s Statuses Processed in %s" % (num_processed, datetime.datetime.now() - scrape_starttime)
scrapeFacebookPageFeedStatus(page_id, access_token)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 40-43: ordinal not in range(128)
I'm writing code to scrape through Facebook pages to gather all the posts in cvs file.
The code is working properly when there is only the English language, but
the error above appears when I try to scrape through pages that post in Arabic.
I know the solution is to use utf-8 but I don't know how to implement it on the code.

Your problem probably is not in this code, I suspect is in your processFacebookPageFeedStatus function. But when you are formatting your fields you'll want to make sure any that may contain unicode characters are all decoded (or encoded as appropriate) in utf-8.
import codecs
field_a = "some unicode text in here"
field_a.decode('utf-8') -----> \u1234\u........
field_a.encode('utf-8') -----> Back to original unicode
Your CSV probably doesn't support unicode, so you need to decode each field in your source data.
Debugging unicode is a pain, but there are a lot of SO posts about different problems related to encoding/decoding unicode

import sys
reload(sys).setdefaultencoding("utf-8")
I added this piece of code and it works fine when I open this file in pandas .
there are no other errors or what so ever for now

Python Requests URL with Unicode Parameters

I'm currently trying to hit the google tts url, http://translate.google.com/translate_tts with japanese characters and phrases in python using the requests library.
Here is an example:
http://translate.google.com/translate_tts?tl=ja&q=ひとつ
However, when I try to use the python requests library to download the mp3 that the endpoint returns, the resulting mp3 is blank. I have verified that I can hit this URL in requests using non-unicode characters (via romanji) and have gotten correct responses back.
Here is a part of the code I am using to make the request
langs = {'japanese': 'ja',
'english': 'en'}
def get_sound_file_for_text(text, download=False, lang='japanese'):
r = StringIO()
glang = langs[lang]
text = text.replace('*', '')
text = text.replace('/', '')
text = text.replace('x', '')
url = 'http://translate.google.com/translate_tts'
if download:
result = requests.get(url, params={'tl': glang, 'q': text})
r.write(result.content)
r.seek(0)
return r
else:
return url
Also, if I print textor url within this snippet, the kana/kanji is rendered correctly in my console.
Edit:
If I attempt to encode the unicode and quote it as such, I still get the same response.
# -*- coding: utf-8 -*-
from StringIO import StringIO
import urllib
import requests
__author__ = 'jacob'
langs = {'japanese': 'ja',
'english': 'en'}
def get_sound_file_for_text(text, download=False, lang='japanese'):
r = StringIO()
glang = langs[lang]
text = text.replace('*', '')
text = text.replace('/', '')
text = text.replace('x', '')
text = urllib.quote(text.encode('utf-8'))
url = 'http://translate.google.com/translate_tts?tl=%(glang)s&q=%(text)s' % locals()
print url
if download:
result = requests.get(url)
r.write(result.content)
r.seek(0)
return r
else:
return url
Which returns this:
http://translate.google.com/translate_tts?tl=ja&q=%E3%81%B2%E3%81%A8%E3%81%A4
Which seems like it should work, but doesn't.
Edit 2:
If I attempt to use urlllb/urllib2, I get a 403 error.
Edit 3:
So, it seems that this problem/behavior is simply limited to this endpoint. If I try the following URL, a different endpoint.
http://www.kanjidamage.com/kanji/13-un-%E4%B8%8D
From within requests and my browser, I get the same response (they match). If I even try ascii characters to the server, like this url.
http://translate.google.com/translate_tts?tl=ja&q=sayonara
I get the same response as well (they match again). But if I attempt to send unicode characters to this URL, I get a correct audio file on my browser, but not from requests, which sends an audio file, but with no sound.
http://translate.google.com/translate_tts?tl=ja&q=%E3%81%B2%E3%81%A8%E3%81%A4
So, it seems like this behavior is limited to the Google TTL URL?

The user agent can be part of the problem, however, it is not in this case. The translate_tts service rejects (with HTTP 403) some user agents, e.g. any that begin with Python, curl, wget, and possibly others. That is why you are seeing a HTTP 403 response when using urllib2.urlopen() - it sets the user agent to Python-urllib/2.7 (the version might vary).
You found that setting the user agent to Mozilla/5.0 fixed the problem, but that might work because the API might assume a particular encoding based on the user agent.
What you actually should do is to explicitly specify the URL character encoding with the ie field. Your URL request should look like this:
http://translate.google.com/translate_tts?ie=UTF-8&tl=ja&q=%E3%81%B2%E3%81%A8%E3%81%A4
Note the ie=UTF-8 which explicitly sets the URL character encoding. The spec does state that UTF-8 is the default, but doesn't seem entirely true, so you should always set ie in your requests.
The API supports kanji, hiragana, and katakana (possibly others?). These URLs all produce "nihongo", although the audio produced for hiragana input has a slightly different inflection to the others.
import requests
one = u'\u3072\u3068\u3064'
kanji = u'\u65e5\u672c\u8a9e'
hiragana = u'\u306b\u307b\u3093\u3054'
katakana = u'\u30cb\u30db\u30f3\u30b4'
url = 'http://translate.google.com/translate_tts'
for text in one, kanji, hiragana, katakana:
r = requests.get(url, params={'ie': 'UTF-8', 'tl': 'ja', 'q': text})
print u"{} -> {}".format(text, r.url)
open(u'/tmp/{}.mp3'.format(text), 'wb').write(r.content)

I made this little method before to help me with UTF-8 encoding. I was having issues printing cyrllic and CJK languages to csvs and this did the trick.
def assist(unicode_string):
utf8 = unicode_string.encode('utf-8')
read = utf8.decode('string_escape')
return read ## UTF-8 encoded string
Also, make sure you have these two lines at the beginning of your .py.
#!/usr/bin/python
# -*- coding: utf-8 -*-
The first line is just a good python habit, it specifies which compiler to use on the .py (really only useful if you have more than one version of python loaded on your machine). The second line specifies the encoding of the python file. A slightly longer answer for this is given here.

Setting the User-Agent to Mozilla/5.0 fixes this issue.
from StringIO import StringIO
import urllib
import requests
__author__ = 'jacob'
langs = {'japanese': 'ja',
'english': 'en'}
def get_sound_file_for_text(text, download=False, lang='japanese'):
r = StringIO()
glang = langs[lang]
text = text.replace('*', '')
text = text.replace('/', '')
text = text.replace('x', '')
url = 'http://translate.google.com/translate_tts'
if download:
result = requests.get(url, params={'tl': glang, 'q': text}, headers={'User-Agent': 'Mozilla/5.0'})
r.write(result.content)
r.seek(0)
return r
else:
return url

Unable to display Japanese (UTF-8) characters in email body with webbrowser

I am reading text from two different .txt files and concatenating them together. Then add that to a body of the email through by using webbrowser.
One text file is English characters (ascii) and the other Japanese (UTF-8). The text will display fine if I write it to a text file. But if I use webbrowser to insert the text into an email body the Japanese text displays as question marks.
I have tried running the script on multiple machines that have different mail clients as their defaults. Initially I thought maybe that was the issue, but that does not appear to be. Thunderbird and Mail (MacOSX) display question marks.
Hello. Today is 2014-05-09
????????????????2014-05-09????
I have looked at similar issues around on SO but they have not solved the issue.
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in
position 20: ordinal not in
range(128)
Japanese in python function
Printing out Japanese (Chinese) characters
python utf-8 japanese
Is there a way to have the Japanese (UTF-8) display in the body of an email created with webbrowser in python? I could use the email functionality but the requirement is the script needs to open the default mail client and insert all the information.
The code and text files I am using are below. I have simplified it to focus on the issue.
email-template.txt
Hello. Today is {{date}}
email-template-jp.txt
こんにちは。今日は {{date}} です。
Python Script
#
# -*- coding: utf-8 -*-
#
import sys
import re
import os
import glob
import webbrowser
import codecs,sys
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
# vars
date_range = sys.argv[1:][0]
email_template_en = "email-template.txt"
email_template_jp = "email-template-jp.txt"
email_to_send = "email-to-send.txt" # finished email is saved here
# Default values for the composed email that will be opened
mail_list = "test#test.com"
cc_list = "test1#test.com, test2#test.com"
subject = "Email Subject"
# Open email templates and insert the date from the parameters sent in
try:
f_en = open(email_template_en, "r")
f_jp = codecs.open(email_template_jp, "r", "UTF-8")
try:
email_content_en = f_en.read()
email_content_jp = f_jp.read()
email_en = re.sub(r'{{date}}', date_range, email_content_en)
email_jp = re.sub(r'{{date}}', date_range, email_content_jp).encode("UTF-8")
# this throws an error
# UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 26: ordinal not in range(128)
# email_en_jp = (email_en + email_jp).encode("UTF-8")
email_en_jp = (email_en + email_jp)
finally:
f_en.close()
f_jp.close()
pass
except Exception, e:
raise e
# Open the default mail client and fill in all the information
try:
f = open(email_to_send, "w")
try:
f.write(email_en_jp)
# Does not send Japanese text to the mail client. But will write to the .txt file fine. Unsure why.
webbrowser.open("mailto:%s?subject=%s&cc=%s&body=%s" %(mail_list, subject, cc_list, email_en_jp), new=1) # open mail client with prefilled info
finally:
f.close()
pass
except Exception, e:
raise e
edit: Forgot to add I am using Python 2.7.1

EDIT 2: Found a workable solution after all.
Replace your webbrowser call with this.
import subprocess
[... other code ...]
arg = "mailto:%s?subject=%s&cc=%s&body=%s" % (mail_list, subject, cc_list, email_en_jp)
subprocess.call(["open", arg])
This will open your default email client on MacOS. For other OSes please replace "open" in the subprocess line with the proper executable.
EDIT: I looked into it a bit more and Mark's comment above made me read the RFC (2368) for mailto URL scheme.
The special hname "body" indicates that the associated hvalue is the
body of the message. The "body" hname should contain the content for
the first text/plain body part of the message. The mailto URL is
primarily intended for generation of short text messages that are
actually the content of automatic processing (such as "subscribe"
messages for mailing lists), not general MIME bodies.
And a bit further down:
8-bit characters in mailto URLs are forbidden. MIME encoded words (as
defined in [RFC2047]) are permitted in header values, but not for any
part of a "body" hname."
So it looks like this is not possible as per RFC, although that makes me question why the JavaScript solution in the JSFiddle provided by naota works at all.
I leave my previous answer as is below, although it does not work.
I have run into same issues with Python 2.7.x quite a couple of times now and every time a different solution somehow worked.
So here are several suggestions that may or may not work, as I haven't tested them.
a) Force unicode strings:
webbrowser.open(u"mailto:%s?subject=%s&cc=%s&body=%s" % (mail_list, subject, cc_list, email_en_jp), new=1)
Notice the small u right after the opening ( and before the ".
b) Force the regex to use unicode:
email_jp = re.sub(ur'{{date}}', date_range, email_content_jp).encode("UTF-8")
# or maybe
email_jp = re.sub(ur'{{date}}', date_range, email_content_jp)
c) Another idea regarding the regex, try compiling it first with the re.UNICODE flag, before applying it.
pattern = re.compile(ur'{{date}}', re.UNICODE)
d) Not directly related, but I noticed you write the combined text via the normal open method. Try using the codecs.open here as well.
f = codecs.open(email_to_send, "w", "UTF-8")
Hope this helps.

reading web pages including various languages such Russian, Korean and etc

everyone.
For my research projects, I have collected some web pages.
For example, http://git.gnome.org/browse/anjuta/commit/?id=d17caca8f81bb0f0ba4d341d6d6132ff51d186e3
As you see the above web page, the committer's name is not English.
Other web pages, also, have committers' names written in various languages not English.
The following codes are for handling with committers' names.
import csv
import re
import urllib
def get_page (link):
k = 1
while k == 1:
try:
f = urllib.urlopen (link)
htmlSource = f.read()
return htmlSource
except EnvironmentError:
print ('Error occured:', link)
else:
k = 2
f.close()
def get_commit_info (commit_page):
commit_page_string = str (commit_page)
author_pattern = re.compile (r'<tr><th>author</th><td>(.*?)</td><td class=', re.DOTALL)
t_author = author_pattern.findall (commit_page_string)
t_author_string = str (t_author)
author_point = re.search (" <", t_author_string)
author = t_author_string[:author_point.start()]
print author
git_url = "http://git.gnome.org/browse/anjuta/commit/?id=d17caca8f81bb0f0ba4d341d6d6132ff51d186e3"
commit_page = get_page (git_url)
get_commit_info (commit_page)
The result of 'print author' is as follows:
\xd0\x9c\xd0\xb8\xd1\x80\xd0\xbe\xd1\x81\xd0\xbb\xd0\xb0\xd0\xb2 \xd0\x9d\xd0\
xb8\xd0\xba\xd0\xbe\xd0\xbb\xd0\xb8\xd1\x9b
How can I print the name exactly?

WELL... this will do what you want
author = 'Мирослав Николић'
print author.decode('utf8') # Мирослав Николић
But it also won't work if the encoding isn't UTF8...
Mostly things use utf8. Mostly.
Unicode is complicated stuff to get your head around. 'author' is a string object that contains bytes. There is no information in those bytes to tell you what those bytes represent. Absolutely none. You have to tell Python that this string of bytes are code points in UTF8. For each byte you come across, look it up in the UTF8 code table and see which UTF8 unicode glyph this represents.
You could detect the encoding for each page by looking at the meta tags. In html5 they would look like this:
<meta charset="utf-8">.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to deal with unknown encoding when scraping webpages? [duplicate] - python

I had the same problem some time ago and there is nothing 100% accurate. What I did was: Get encoding from Content-Type Get encoding from meta tags Detect encoding with chardet Python module Decode text from the most common encoding to Unicode Process the text/html

It is better to simply read Content-Type in meta tags or headers. Please note, that Chrome (opposite to Opera) does not guess encoding. If it is not said that it is UTF8 or anything else in either of places, it threads site as windows default encoded. So only really bad sites do not define it.

Related

'ascii' codec can't encode characters in position 10-12: ordinal not in range(128) [duplicate]

How can i solve this ascii error in python

Python Requests URL with Unicode Parameters

Unable to display Japanese (UTF-8) characters in email body with webbrowser

reading web pages including various languages such Russian, Korean and etc

Categories

Resources