I write a python script to retrieve the image from url:
url = `https://uploads0.wikiart.org/images/albrecht-durer/watermill-at-the-montaсa.jpg`
urllib.request.urlretrieve(url, STYLE_IMAGE_UPLOAD + "wikiart" + "/" + url)
When I run I got the message
UnicodeEncodeError: 'ascii' codec can't encode character '\u0441' in position 49: ordinal not in range(128)
I think the problem from the image url
'https://uploads0.wikiart.org/images/albrecht-durer/watermill-at-the-monta\u0441a.jpg',
How to fix this problem?
The URL contains a non-ASCII character (a Cyrillic letter that looks like a Latin "c").
Escape this character using the urllib.parse.quote function:
url = 'https://uploads0.wikiart.org' + urllib.parse.quote('/images/albrecht-durer/watermill-at-the-montaсa.jpg')
urllib.request.urlretrieve(url, '/tmp/watermill.jpg')
Don't put the entire URL in the quote function, otherwise it would escape the colon (":") in "https://".
Related
This is my code:
import urllib.request
imglinks = ["http://www.katytrailweekly.com/Files/MalibuPokeMatt_©Marple_449-EDITED_15920174118.jpg"]
for link in imglinks:
filename = link.split('/')[-1]
urllib.request.urlretrieve(link, filename)
It gives me the error:
UnicodeEncodeError: 'ascii' codec can't encode character '\xa9'
How do I solve this? I tried using .encode('utf-8'), but it gives me:
TypeError: cannot use a string pattern on a bytes-like object
The problem here is not the encoding itself but the correct encoding to pass to `request'.
You need to quote the url as follows:
import urllib.request
import urllib.parse
imglinks = ["http://www.katytrailweekly.com/Files/MalibuPokeMatt_©Marple_449-EDITED_15920174118.jpg"]
for link in imglinks:
link = urllib.parse.quote(link,safe=':/') # <- here
filename = link.split('/')[-1]
urllib.request.urlretrieve(link, filename)
This way your © symbol is encoded as %C2%A9 as the web server wants.
The safe parameter is specified to prevent quote to modify also the : after http.
Is up to you to modify the code to save the file with the correct original filename. ;)
BLUF: Why is the decode() method on a bytes object failing to decode ç?
I am receiving a UnicodeDecodeError: 'utf-8' codec can't decode by 0xe7 in position..... Upon tracking down the character, it is the ç character. So when I get to reading the response from the server:
conn = http.client.HTTPConnection(host = 'something.com')
conn.request('GET', url = '/some/json')
resp = conn.getresponse()
content = resp.read().decode() # throws error
I am unable to get the content. If I just do content = resp.read() it is successful, I can write to file using wb but then whever the ç is, it is replaced with 0xE7 in the file upon writing. Even if I open the file in Notepad++ and set the encoding to UTF-8, the character only shows as the hex version.
Why am I not able to decode this UTF-8 character from an HTTPResponse? Am I not correctly writing it to file either?
When you have issues with encoding/decoding, you should take a look at the UTF-8 Encoding Debugging Chart.
If you look in the chart for the Windows 1252 code point 0xE7 you find the expected character is ç showing that the encoding is CP1252.
I'm scraping a £ value in python and when I try to write it into an excel sheet the process breaks and I get the following error
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128)
The £ sign is printing without any error in the cmd prompt. Could some suggest how I can write the value (£1,750) into my sheet (with or without £ sign). many thanks...
import requests
from bs4 import BeautifulSoup as soup
import csv
outputfilename = 'Ed_Streets2.csv'
outputfile = open(outputfilename, 'wb')
writer = csv.writer(outputfile)
writer.writerow([Rateable Value])
url = 'https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&PAGE=0&DISPLAY_COUNT=100&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&ORIGINAL_SEARCH_TERM=city+of+edinburgh&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY#results'
response = session.get(url)
html = soup(response.text, 'lxml')
prop_link = html.find_all("a", class_="pagelink button small")
for link in prop_link:
prop_url = base_url+(link["href"])
response = session.get(prop_url)
prop = soup(response.content,"lxml")
RightBlockData = prop.find_all("div", class_="columns small-7 cell")
Rateable_Value = RightBlockData[0].get_text().strip()
print (Rateable_Value)
writer.writerow([Rateable_Value])
You need to encode your unicode object into bytes explicitely. Or else, your system will automatically try to encode it using ascii codec, which will fail with non-ascii characters. So, this:
Rateable_Value = Rateable_Value.encode('utf8')
before you
writer.writerow([Rateable_Value])
Should do the trick.
I already decoded a lot of email attachments filenames in my code.
But this particular filename breaks my code.
Here is a minimal example:
from email.header import decode_header
encoded_filename='=?UTF-8?B?U2FsZXNJbnZvaWNl?==?UTF-8?B?LVJlcG9ydC5wZGY=?='
decoded_header=decode_header(encoded_filename) # --> [('SalesInvoiceQ1|\x04\xb5I\x95\xc1\xbd\xc9\xd0\xb9\xc1\x91\x98', 'utf-8')]
filename=str(decoded_header[0][0]).decode(decoded_header[0][1])
Exception:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb5 in position 16: invalid start byte
Don't ask my how, but Thunderbird is able to decode this filename to: SalesInvoice-Report.pdf
How can I decode this with python like email clients apparently are able to?
There are two Encoded-Word sections in that header. You'd have to detect where one ends and one begins:
>>> print decode_header(encoded_filename[:28])[0]
('SalesInvoice', 'utf-8')
>>> print decode_header(encoded_filename[28:])[0]
('-Report.pdf', 'utf-8')
Apparently that's what Thunderbird does in this case; split the string into =?encoding?data?= chunks. Normally these should be separated by \r\n (CARRIAGE RETURN + LINE FEED) characters, but in your case they are mashed up together. If you re-introduce the \r\n separator the value decodes correctly:
>>> decode_header(encoded_filename[:28] + '\r\n' + encoded_filename[28:])[0]
('SalesInvoice-Report.pdf', 'utf-8')
You could use a regular expression to extract the parts and re-introduce the separator:
import re
from email.header import decode_header
quopri_entry = re.compile(r'=\?[\w-]+\?[QB]\?[^?]+?\?=')
def decode_multiple(encoded, _pattern=quopri_entry):
fixed = '\r\n'.join(_pattern.findall(encoded))
output = [b.decode(c) for b, c in decode_header(fixed)]
return ''.join(output)
Demo:
>>> encoded_filename = '=?UTF-8?B?U2FsZXNJbnZvaWNl?==?UTF-8?B?LVJlcG9ydC5wZGY=?='
>>> decode_multiple(encoded_filename)
u'SalesInvoice-Report.pdf'
Of course, it could be that you have a bug in how you read the header in the first place. Make sure you don't accidentally destroy an existing \r\n separator when extracting the encoded_filename value.
I have a list of html pages which may contain certain encoded characters. Some examples are as below -
<a href="mailto:lad%20at%20maestro%20dot%20com">
<em>ada#graphics.maestro.com</em>
<em>mel#graphics.maestro.com</em>
I would like to decode (escape, I'm unsure of the current terminology) these strings to -
<a href="mailto:lad at maestro dot com">
<em>ada#graphics.maestro.com</em>
<em>mel#graphics.maestro.com</em>
Note, the HTML pages are in a string format. Also, I DO NOT want to use any external library like a BeautifulSoup or lxml, only native python libraries are ok.
Edit -
The below solution isn't perfect. HTML Parser unescaping with urllib2 throws a
UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 31: ordinal not in range(128)
error in some cases.
You need to unescape HTML entities, and URL-unquote.
The standard library has HTMLParser and urllib2 to help with those tasks.
import HTMLParser, urllib2
markup = '''<a href="mailto:lad%20at%20maestro%20dot%20com">
<em>ada#graphics.maestro.com</em>
<em>mel#graphics.maestro.com</em>'''
result = HTMLParser.HTMLParser().unescape(urllib2.unquote(markup))
for line in result.split("\n"):
print(line)
Result:
<a href="mailto:lad at maestro dot com">
<em>ada#graphics.maestro.com</em>
<em>mel#graphics.maestro.com</em>
Edit:
If your pages can contain non-ASCII characters, you'll need to take care to decode on input and encode on output.
The sample file you uploaded has charset set to cp-1252, so let's try decoding from that to Unicode:
import codecs
with codecs.open(filename, encoding="cp1252") as fin:
decoded = fin.read()
result = HTMLParser.HTMLParser().unescape(urllib2.unquote(decoded))
with codecs.open('/output/file.html', 'w', encoding='cp1252') as fou:
fou.write(result)
Edit2:
If you don't care about the non-ASCII characters you can simplify a bit:
with open(filename) as fin:
decoded = fin.read().decode('ascii','ignore')
...