I am teaching myself how to parse google results with json, but when I run this code ( which shoud work ), I am getting this error: UnicodeEncodeError: 'charmap' codec can't encode character u'\u2014' in position 5: character maps to <undefined>. Can someone help me?
import urllib
import simplejson
query = urllib.urlencode({'q' : 'site:example.com'})
url = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s&start=50' \
% (query)
search_results = urllib.urlopen(url)
json = simplejson.loads(search_results.read())
results = json['responseData']['results']
for i in results:
print i['title'] + ": " + i['url']
This error may be caused by the encoding that your console application uses when sending unicode data to stdout. There's an article that talks about it.
Check stdout's encoding:
>>> import sys
>>> sys.stdout.encoding # On my machine I get this result:
'UTF-8'
Use unicode literals.
print i[u'title'] + u": " + i[u'url']
Also:
jsondata = simplejson.load(search_results)
My guess is that the error is in simplejson.loads(search_results.read()) line, possibly because the default encoding your python is picking up is not utf-8 and google is returning utf-8.
Try: simplejson.loads(unicode(search_results.read(), "utf8").
Related
I want to send zlib compressed data of file to server using POST request. Following is the code I am trying to use
orig = open('fileName', 'r').read()
comp = zlib.compress(orig, 9)
req = Request(url, comp)
urlopen(req)
But I get the following error UnicodeDecodeError: 'utf8' codec can't decode byte 0x85 in position 2: invalid start byte
I tried following comp.encode('utf-8') but this also doesn't work. I get the same UnicodeDecodeError at some position. How can I resolve my problem?
The docs for urllib2.Request's data parameter state:
data should be a buffer in the standard application/x-www-form-urlencoded format.
You can encode your buffer using urllib.quote:
>>> orig = 'aaaaabbbccddxddaaabb'
>>> comp = zlib.compress(orig, 9)
>>> comp
'x\xdaKL\x04\x82\xa4\xa4\xa4\xe4\xe4\x94\x94\x8a\x94\x140\x07\x00Q\x19\x07\xc1'
>>> quoted = quote(comp)
>>> quoted
'x%DAKL%04%82%A4%A4%A4%E4%E4%94%94%8A%94%140%07%00Q%19%07%C1'
>>> req = Request('http://example.com', quoted)
I need use utf-8 characters in set dryscrape method. But after run show this error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128)
My code (for example):
site = dryscrape.Session()
site.visit("https://www.website.com")
search = site.at_xpath('//*[#name="search"]')
search.set(u'فارسی')
search.form().submit()
Also u'فارسی' change to search.set(unicode('فارسی', 'utf-8')), But show this error.
Its very easy... This method working perfectly with google. Also try with any other if you know the url prams
import dryscrape as d
d.start_xvfb()
br = d.Session()
import urllib.parse
query = urllib.parse.quote("فارسی")
print(query) #it prints : '%D9%81%D8%A7%D8%B1%D8%B3%DB%8C'
Url = "http://google.com/search?q="+query
br.visit(Url)
print(br.xpath('//title')[0].text())
#it prints : Google Search - فارسی
#You can also check it with br.render("url_screenshot.png")
Scraping site with chineese simbols .
How do i scrap chineese simbolse ??
from urllib.request import urlopen
from urllib.parse import urljoin
from lxml.html import fromstring
URL = 'http://list.suning.com/0-258003-0.html'
ITEM_PATH = '.clearfix .product .border-out .border-in .wrap .res-info .sell-point'
def parse_items():
f = urlopen(URL)
list_html = f.read().decode('utf-8')
list_doc = fromstring(list_html)
for elem in list_doc.cssselect(ITEM_PATH):
a = elem.cssselect('a')[0]
href = a.get('href')
title = a.text
em = elem.cssselect('em')[0]
title2 = em.text
print(href, title, title2)
def main():
parse_items()
if __name__ == '__main__':
main()
Error looks like this.
Error looks like this
Error looks like this
Error looks like this
Error looks like this
http://product.suning.com/0000000000/146422477.html Traceback (most recent call last):
File "parser.py", line 27, in <module>
main()
File "parser.py", line 24, in main
parse_items()
File "parser.py", line 20, in parse_items
print(href, title, title2)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
From the print syntax and the imports, I assume that you use a Python3 version, since it can matter for unicode.
So, we can expect that href, title and title2 are all unicode strings (or Python 3 strings). But the print function will try to convert the strings to an encoding acceptable by the output system - for a reason I cannot know, you system uses by default ASCII, so the error.
How to fix:
the best way would be to make your system accept unicode. On Linux or other unixes, you can declare an UTF8 charset in LANG environment variable (export LANG=en_US.UTF-8), on Windows you can try chcp 65001 but this latter if far from being sure
if it does not work, or does not meet your needs, you can force an explicit encoding, or more exactly filter out offending characters, because Python3 natively uses unicode strings.
I would use:
import sys
def u_filter(s, encoding = sys.stdout.encoding):
return (s.encode(encoding, errors='replace').decode(encoding)
if isinstance(s, str) else s)
That means: if s is a unicode string encode it in the encoding used for stdout, replacing any non convertible character by a replacement char, and decode it back into a now clean string
and next:
def fprint(*args, **kwargs):
fargs = [ u_filter(arg) for arg in args ]
print(*fargs, **kwargs)
means: filter out any offending character from unicode strings and print the remaining unchanged.
With that you can safely replace your print throwing the exception with:
fprint(href, title, title2)
I'm trying to get a response from urllib and decode it
to a readable format. The text is in Hebrew and also contains characters like { and /
top page coding is:
# -*- coding: utf-8 -*-
raw string is:
b'\xff\xfe{\x00 \x00\r\x00\n\x00"\x00i\x00d\x00"\x00 \x00:\x00 \x00"\x001\x004\x000\x004\x008\x003\x000\x000\x006\x004\x006\x009\x006\x00"\x00,\x00\r\x00\n\x00"\x00t\x00i\x00t\x00l\x00e\x00"\x00 \x00:\x00 \x00"\x00\xe4\x05\xd9\x05\xe7\x05\xd5\x05\xd3\x05 \x00\xd4\x05\xe2\x05\xd5\x05\xe8\x05\xe3\x05 \x00\xd4\x05\xea\x05\xe8\x05\xe2\x05\xd4\x05 \x00\xd1\x05\xde\x05\xe8\x05\xd7\x05\xd1\x05 \x00"\x00,\x00\r\x00\n\x00"\x00d\x00a\x00t\x00a\x00"\x00 \x00:\x00 \x00[\x00]\x00\r\x00\n\x00}\x00\r\x00\n\x00\r\x00\n\x00'
Now I'm trying to decode it using:
data = data.decode()
and I get the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
Your problem is that that is not UTF-8. You have UTF-16 encoded data, decode it as such:
>>> data = b'\xff\xfe{\x00 \x00\r\x00\n\x00"\x00i\x00d\x00"\x00 \x00:\x00 \x00"\x001\x004\x000\x004\x008\x003\x000\x000\x006\x004\x006\x009\x006\x00"\x00,\x00\r\x00\n\x00"\x00t\x00i\x00t\x00l\x00e\x00"\x00 \x00:\x00 \x00"\x00\xe4\x05\xd9\x05\xe7\x05\xd5\x05\xd3\x05 \x00\xd4\x05\xe2\x05\xd5\x05\xe8\x05\xe3\x05 \x00\xd4\x05\xea\x05\xe8\x05\xe2\x05\xd4\x05 \x00\xd1\x05\xde\x05\xe8\x05\xd7\x05\xd1\x05 \x00"\x00,\x00\r\x00\n\x00"\x00d\x00a\x00t\x00a\x00"\x00 \x00:\x00 \x00[\x00]\x00\r\x00\n\x00}\x00\r\x00\n\x00\r\x00\n\x00'
>>> data.decode('utf16')
'{ \r\n"id" : "1404830064696",\r\n"title" : "פיקוד העורף התרעה במרחב ",\r\n"data" : []\r\n}\r\n\r\n'
>>> import json
>>> json.loads(data.decode('utf16'))
{'title': 'פיקוד העורף התרעה במרחב ', 'id': '1404830064696', 'data': []}
If you loaded this from a website with urllib.request, the Content-Type header should contain a charset parameter telling you this; if response is the returned urllib.request response object, then use:
codec = response.info().get_content_charset('utf-8')
This defaults to UTF-8 when no charset parameter has been set, which is the appropriate default for JSON data.
Alternatively, use the requests library to load the JSON response, it handles decoding automatically (including UTF-codec autodetection specific to JSON responses).
One further note: the PEP 263 source code codec comment is used only to interpret your source code, including string literals. It has nothing to do with encodings of external sources (files, network data, etc.).
I got this error in Django with Python 3.4. I was trying to get this to work with django-rest-framework.
This was my code that fixed the error UnicodeDecodeError: 'utf-8' codec can't decode byte error.
This is the passing test:
import os
from os.path import join, dirname
import uuid
from rest_framework.test import APITestCase
class AttachmentTests(APITestCase):
def setUp(self):
self.base_dir = dirname(dirname(dirname(__file__)))
self.image = join(self.base_dir, "source/test_in/aaron.jpeg")
self.image_filename = os.path.split(self.image)[1]
def test_create_image(self):
id = str(uuid.uuid4())
with open(self.image, 'rb') as data:
# data = data.read()
post_data = {
'id': id,
'filename': self.image_filename,
'file': data
}
response = self.client.post("/api/admin/attachments/", post_data)
self.assertEqual(response.status_code, 201)
I am really struggling to find an answer to this.
I am writing a simple cgi script and the input GET parameters will be html encoded
e.g. £ -> %A3
Here 2 test URLs im using in my browser.
?a=%7B&b=%A3
?a={&b=£
When i loop through the parameters from cgi.FieldStorage i get an exception with the b parameter.
- i know its related to encodign of some form, but i just can't work out a solution.
key = a
value = {
key = b
ERROR: 'ascii' codec can't encode character '\ufffd' in position 12: ordinal not in range(128)
key = a
value = {
key = b
ERROR: 'ascii' codec can't encode character '\xa3' in position 12: ordinal not in range(128)
The following is the test CGI script.
#!/opt/python-3.3.4/bin/python3
import cgitb
import cgi
import sys
print("Content-Type: text/html; charset=utf-8")
print("")
print("<html>")
print("<body>")
print("<h1>Hello</h1>")
form = cgi.FieldStorage()
#form = cgi.FieldStorage(encoding="utf8")
for i in form.keys():
print("<br>key = ", i)
try:
tmp = form[i].value
print("<br>value = %s" % tmp)
except Exception as err:
print("<br>ERROR:", err)
print("</body>")
print("</html>")
I believe that GET only supports ASCII characters.
Therefore you need to use POST for non-ASCII characters