Strange UnicodeEncodeError/AttributeError in my script

Strange UnicodeEncodeError/AttributeError in my script - python

Currently I am writing a script in Python 2.7 that works fine except for after running it for a few seconds it runs into an error:
Enter Shopify website URL (without HTTP): store.highsnobiety.com
Scraping! Check log file # z:\shopify_output.txt to see output.
!!! Also make sure to clear file every hour or so !!!
Copper Bracelet - 3mm - Polished ['3723603267']
Traceback (most recent call last):
File "shopify_sitemap_scraper.py", line 38, in <module>
print(prod, variants).encode('utf-8')
AttributeError: 'NoneType' object has no attribute 'encode'
The script is to get data from a Shopify website and then print it to console. Code here:
# -*- coding: utf-8 -*-
from __future__ import print_function
from lxml.html import fromstring
import requests
import time
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
# Log file location, change "z://shopify_output.txt" to your location.
logFileLocation = "z:\shopify_output.txt"
log = open(logFileLocation, "w")
# URL of Shopify website from user input (for testing, just use store.highsnobiety.com during input)
url = 'http://' + raw_input("Enter Shopify website URL (without HTTP): ") + '/sitemap_products_1.xml'
print ('Scraping! Check log file # ' + logFileLocation + ' to see output.')
print ("!!! Also make sure to clear file every hour or so !!!")
while True :
page = requests.get(url)
tree = fromstring(page.content)
# skip first url tag with no image:title
url_tags = tree.xpath("//url[position() > 1]")
data = [(e.xpath("./image/title//text()")[0],e.xpath("./loc/text()")[0]) for e in url_tags]
for prod, url in data:
# add xml extension to url
page = requests.get(url + ".xml")
tree = fromstring(page.content)
variants = tree.xpath("//variants[#type='array']//id[#type='integer']//text()")
print(prod, variants).encode('utf-8')
The most crazy part about it is that when I take out the .encode('utf-8') it gives me a UnicodeEncodeError seen here:
Enter Shopify website URL (without HTTP): store.highsnobiety.com
Scraping! Check log file # z:\shopify_output.txt to see output.
!!! Also make sure to clear file every hour or so !!!
Copper Bracelet - 3mm - Polished ['3723603267']
Copper Bracelet - 5mm - Brushed ['3726247811']
Copper Bracelet - 7mm - Polished ['3726253635']
Highsnobiety x EARLY - Leather Pouch ['14541472963', '14541473027', '14541473091']
Traceback (most recent call last):
File "shopify_sitemap_scraper.py", line 38, in <module>
print(prod, variants)
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\xae' in position 13: character maps to <undefined>'
Any ideas? Have no idea what else to try after hours of googling.

snakecharmerb almost got it, but missed the cause of your first error. Your code
print(prod, variants).encode('utf-8')
means you print the values of the prod and variants variables, then try to run the encode() function on the output of print. Unfortunately, print() (as a function in Python 2 and always in Python 3) returns None. To fix it, use the following instead:
print(prod.encode("utf-8"), variants)

Your console has a default encoding of cp437, and cp437 is unable to represent the character u'\xae'.
>>> print (u'\xae')
®
>>> print (u'\xae'.encode('utf-8'))
b'\xc2\xae'
>>> print (u'\xae'.encode('cp437'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.5/encodings/cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character '\xae' in position 0: character maps to <undefined>
You can see that it's trying to convert to cp437 in the traceback:
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
(I reproduced the problem in Python3.5, but it's the same issue in both versions of Python)

Related

Error while converting html entities to hieroglyphs/special symbols [python]

My program gets JSON string from webserver using requests. Then it converted to a dictionary with json.loads(). After that I write some elements from this dictionary in a loop to the file:
parsedJSON = json.loads(cleanJSON)
for i in range(len(parsedJSON['list'])):
f.write(html.unescape(parsedJSON['list'][i][4]) + ' - ' + html.unescape(parsedJSON['list'][i][3]) + '\n')
The problem is that JSON can contain japanese/chinese hieroglyphs and other special symbols. In JSON string I get they are stored as html entities (for example this string '&# 12493;&# 12467;&# 12496;&# 12473;' is ネコバス).
To convert html entities to a human-readable form, I use html.unescape('someHTMLEntity'). On my Debian 8 and some other linux systems it works perfect - hieroglyph codes are converted to actual hieroglyphs, etc. But on Windows (on 7, 8.1 and 10) I'm getting this error:
Traceback (most recent call last):
File "main.py", line 144, in <module>
f.write(html.unescape(parsedJSON['list'][i][4]) + ' - ' + html.unescape(par
sedJSON['list'][i][3]) + '\n')
File "C:\Users\dangerous\AppData\Local\Programs\Python\Python36-32\lib\encodin
gs\cp1251.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 12-15: c
haracter maps to <undefined>
The program crashes when the function html.unescape('someHieroglyphCode') is executed.
As I understand it's some Windows-specific encoding problem, but I can't understand what exactly.

Fixed it explicitly using utf-8 encoding in open():
f = open('./dump', 'a', encoding='utf-8')

I get TimeoutError: [WinError 10060] when I parsing my school website

# -*- coding: UTF-8 -*-
import urllib.request
import re
import os
os.system("cls")
url=input("Url Link : ")
if(url[0:8]=="https://"):
url=url[:4]+url[5:]
if(url[0:7]!="http://"):
url="http://"+url
value=urllib.request.urlopen(url).read().decode('UTF8')
par='<title>(.+?)</title>'
result=re.findall(par,value)
print(result)
It is title parsing program. It works well when parsing like Google, Gmail site. When try to parsing my school website the error comes out. It is the problem in school? Or in my code?

You can increase the timeout
code:
value=urllib.request.urlopen(url,timeout=60).read().decode('UTF8')

Using Python Requests (http://docs.python-requests.org/en/latest/) I was able to download http://jakjeon.icems.kr/main.do without error although some of the text was garbled due to inability to install the Korean code page (949) for Windows.
Here is the script:
import requests
url='http://jakjeon.icems.kr/main.do'
r = requests.get(url)
print(r.status_code)
print(r.headers['content-type'])
print(r.encoding)
print(r.text)
Running it printed:
200 # r.status_code
text/html; charset=UTF-8 # r.headers['content-type']
UTF-8 # r.encoding
Followed by all of the text of the page (r.text)
This printed successfully to the Windows cmd.exe console only after setting its code page to 65001 (Unicode (UTF-8), see https://msdn.microsoft.com/en-us/library/windows/desktop/dd317756%28v=vs.85%29.aspx).
Attempting to redirect output to a file resulted in a UnicodeEncodeError because the default Windows encoding for a file on my platform is code page 1252 (ANSI Latin 1; Western European (Windows)). Here is the error message from attempting to print to a file:
Traceback (most recent call last):
File "URLDownloadDemo.py", line 12, in <module>
print(r.text)
File "C:\Anaconda3\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 600-601: character maps to <undefined>
Then entire cmd.exe console transcript is available at https://goo.gl/Cyav17 and a copy of the script is at https://goo.gl/W4Sk9S.
Due to the encoding error when redirecting output to a file, I enhanced the script to write its output directly to a file with UTF-8 encoding. Here is the new script:
import requests
url='http://jakjeon.icems.kr/main.do'
r = requests.get(url)
print(r.status_code)
print(r.headers['content-type'])
print(r.encoding)
fout = open('URLDownloadDemo.output.txt', mode='wt', encoding='UTF-8')
fout.write(r.text)
fout.close()
Running this worked perfectly (no errors) and the output file contained Korean alphabet symbols identical to those in the source of the web page.
The new script is available at https://goo.gl/VJs2Na and its output file is at https://goo.gl/4BKe8C.

python odfpy AttributeError: Text instance has no attribute encode

I'm trying to read from an ods (Opendocument spreadsheet) document with the odfpy modules. So far I've been able to extract some data but whenever a cell contains non-standard input the script errors out with:
Traceback (most recent call last):
File "python/test.py", line 26, in <module>
print x.firstChild
File "/usr/lib/python2.7/site-packages/odf/element.py", line 247, in __str__
return self.data.encode()
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0105' in position 4: ordinal not in range(128)
I tried to force an encoding on the output but apparently it does not go well with print:
Traceback (most recent call last):
File "python/test.py", line 27, in <module>
print x.firstChild.encode('utf-8', 'ignore')
AttributeError: Text instance has no attribute 'encode'
What is the problem here and how could it be solved without editing the module code (which I'd like to avoid at all cost)? Is there an alternative to running encode on output that could work?
Here is my code:
from odf.opendocument import Spreadsheet
from odf.opendocument import load
from odf.table import Table,TableRow,TableCell
from odf.text import P
import sys,codecs
doc = load(sys.argv[1])
d = doc.spreadsheet
tables = d.getElementsByType(Table)
for table in tables:
tName = table.attributes[(u'urn:oasis:names:tc:opendocument:xmlns:table:1.0', u'name')]
print tName
rows = table.getElementsByType(TableRow)
for row in rows[:2]:
cells = row.getElementsByType(TableCell)
for cell in cells:
tps = cell.getElementsByType(P)
if len(tps)>0:
for x in tps:
#print x.firstChild
print x.firstChild.encode('utf-8', 'ignore')

Maybe you are not using the latest odfpy, in the latest verion, the __str__ method of Text is implemented as:
def __str__(self):
return self.data
Update odfpy to the latest version, and modify your code as:
print x.firstChild.__str__().encode('utf-8', 'ignore')
UPDATE
This is another method for getting the raw unicode data for Text: __unicode__. So if you don't want to update odfpy, modify your code as:
print x.firstChild.__unicode__().encode('utf-8', 'ignore')

Seems like the library itself is calling encode() -
return self.data.encode()
This uses the system default encoding , which in your case seems to be ascii. you can check that by using -
import sys
sys.getdefaultencoding()
From the traceback, seems like the actual data exists in a variable called data.
Try doing the below instead -
print x.firstChild.data

Receiving a Windows message in Python - UnicodeEncodeError ...?

I have a little Python 3.3 script that successfully sends (SendMessage) a WM_COPYDATA message (inspired from here , works with XYplorer):
import win32api
import win32gui
import struct
import array
def sendScript(window, message):
CopyDataStruct = "IIP"
dwData = 0x00400001 #value required by XYplorer
buffer = array.array("u", message)
cds = struct.pack(CopyDataStruct, dwData, buffer.buffer_info()[1] * 2 + 1, buffer.buffer_info()[0])
win32api.SendMessage(window, 0x004A, 0, cds) #0x004A is the WM_COPYDATA id
message = "helloworld"
sendScript(window, message) #I write manually the hwnd during debug
Now I need to write a receiver script, still in Python. The script in this answer seems to work (after correcting all the print statements in a print() form). Seems because it prints out all properties of the received message (hwnd, wparam, lparam, etc) except the content of the message.
I get an error instead, UnicodeEncodeError. More specifically,
Python WNDPROC handler failed
Traceback (most recent call last):
File "C:\Python\xxx.py", line 45, in OnCopyData
print(ctypes.wstring_at(pCDS.contents.lpData))
File "C:\Python\python-3.3.2\lib\encodings\cp850.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 10-13: character maps to <undefined>
I don't know how to fix it, also because I'm not using "fancy" characters in the message so I really can't see why I get this error. I also tried setting a different length of message in print(ctypes.wstring_at(pCDS.contents.lpData)) as well as using simply string_at, but without success (in the latter case I obtain a binary string).

ctypes.wstring (in the line print (ctypes.wstring_at(pCDS.contents.lpData))) may not be the string type that the sender sent. Try changing it to:
print (ctypes.string_at(pCDS.contents.lpData))

Handling French text Python

I am trying to read some French text and do some frequency analysis of words. I want the characters with the umlauts and other diacritics to stay. So, I did this for testing:
>>> import codecs
>>> f = codecs.open('file','r','utf-8')
>>> for line in f:
... print line
...
Faites savoir à votre famille que vous êtes en sécurité.
So far, so good. But, I have a list of French files which I iterate over in the following way:
import codecs,sys,os
path = sys.argv[1]
for f in os.listdir(path):
french = codecs.open(os.path.join(path,f),'r','utf-8')
for line in french:
print line
Here, it gives the following error:
rdholaki74: python TestingCodecs.py ../frenchResources | more
Traceback (most recent call last):
File "TestingCodecs.py", line 7, in <module>
print line
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 14: ordinal not in range(128)
Why is it that the same file throws up an error when passed as an argument and not when given explicitly in the code?
Thanks.

Because you're misinterpreting the cause. The fact that you're piping the output means that Python can't detect what encoding to use. If stdout is not a TTY then you'll need to encode as UTF-8 manually before outputting.

It is a print error due to redirection. You could use:
PYTHONIOENCODING=utf-8 python ... | ...
Specify another encoding if your terminal doesn't use utf-8

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Strange UnicodeEncodeError/AttributeError in my script - python

Related

Error while converting html entities to hieroglyphs/special symbols [python]

I get TimeoutError: [WinError 10060] when I parsing my school website

python odfpy AttributeError: Text instance has no attribute encode

Receiving a Windows message in Python - UnicodeEncodeError ...?

Handling French text Python

Categories

Resources