unicode utf8 error while moving working code to separate module - python

I am a newbie python programmer. I am strugling with strange error - which pops out only when I move working code from main script file to separate module (file) as a function. The error is SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0xbf in position 58: invalid start byte.
If the function is in the main code there is no error and code works properly...
The code is about do some webscraping with the use of selenium and xpath
#main file:
import requests
import lxml.html as lh
import pandas as pd
import numpy
import csv
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import funkcje as f
spolka = "https://mojeinwestycje.interia.pl/gie/prof/spolki/notowania?wlid=213"
wynik = f.listaTransakcji(spolka)
#module file with function definition (funkcje.py):
def listaTransakcji(spolka):
driver = webdriver.Firefox()
driver.implicitly_wait(30)
driver.get(spolka)
driver.find_element_by_xpath("//button[#class='rodo-popup-agree']").click()
driver.find_element_by_xpath("//input[#type='radio' and #name='typ' and #value='wsz']").click()
driver.find_element_by_xpath("//input[#type='submit' and #name='Submit' and #value='pokaż']").click()
page = driver.page_source
#end of selenium-----------------------------------------------------------------------------
#Store the contents of the website under doc
doc = lh.fromstring(page)
#wyluskanie rekordów transakcji - xpath------------------------------------------------------
tr_elements = doc.xpath('//table//tr[#bgcolor="#FFFFFF" or #bgcolor="#F7FAFF"]/td')
rekord = numpy.array([])
length = len(tr_elements)
for i in range (0, length):
if(tr_elements[i].text=='TRANSAKCJA') or (tr_elements[i].text=='WIDEŁKI STAT') or (tr_elements[i].text=='WIDEŁKI DYN'):
new_rekord=[tr_elements[i-5].text, tr_elements[i-4].text, tr_elements[i-3].text, tr_elements[i-2].text, tr_elements[i-1].text, tr_elements[i].text]
rekord=numpy.concatenate((rekord,new_rekord))
ilosc = (len(rekord))//6
tablica = numpy.array([])
tablica = rekord.reshape(ilosc, 6)
header = numpy.array(["godzina", "cena", "zmiana", "wolumen", "numer", "typ operacji"])
header = header.reshape(1, 6)
tablica = numpy.concatenate((header,tablica))
return (tablica)
offending line 10:
import funkcje as f
offending line 34:
driver.find_element_by_xpath("//input[#type='submit' and #name='Submit' and #value='pokaż']").click()
expected result:
["11:17:40","0,4930","0,00",24300,76,"TRANSAKCJA"]
actual result:
Traceback (most recent call last):
File "C:/Users/Vox/PycharmProjects/interia/scraper.py", line 10, in <module>
import funkcje as f
File "C:\Users\Vox\PycharmProjects\interia\funkcje.py", line 34
SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0xbf in position 58: invalid start byte

thanks Marat!
try putting # -- coding: utf-8 -- on top of the new file (replace
utf-8 with whatever encoding is used for pokaż
solved issue... no idea why it happened in the first place though. Like new file that is not a main file is not utf-8 by default?

Related

How to render xml with python 2.7 ElementTree module with unicode

I have some code that is throwing a fatal exception on unicode input. I am using ElementTree to build up an xml document and tostring() to print it. I've tried passing in unicode objects, and encoding them as UTF-8 bytestrings, and it makes no difference. I can't figure out if I'm doing something wrong or if there is a bug in the module.
Here is a small sample to reproduce it.
#!/usr/bin/python
from __future__ import unicode_literals, print_function
from xml.etree.ElementTree import Element, SubElement, tostring
import xml.etree.ElementTree as ET
import time
def main():
xml = Element('build_summary')
mpversion = SubElement(xml, 'magpy_version')
mpversion.text = '1.2.3.4'
version = SubElement(xml, 'version')
version.text = '11.22.33.44'
date = SubElement(xml, 'date')
date.text = time.strftime("%a %b %-d %Y", time.localtime())
args = SubElement(xml, 'args')
args.text = 'build args'
issues = SubElement(xml, 'issues')
# Add the repos and the changes in them
changelog = 'this is the changelog \u2615'
#changelog = 'this is the changelog'
print("adding changelog:", changelog)
repository = SubElement(issues, 'repo')
reponame = SubElement(repository, 'reponame')
reponame.text = 'repo name'
repoissues = SubElement(repository, 'repoissues')
#repoissues.text = changelog.encode('UTF-8', 'replace')
repoissues.text = changelog
# Generate a string, reparse it, and pretty-print it.
#ET.dump(xml)
#xml.write('myoutput.xml')
rough = tostring(xml, encoding='UTF-8', method='xml')
#rough = tostring(xml)
print(rough)
if __name__ == '__main__':
main()
This yields the following:
msoulier#anton:~$ python treetest.py
adding changelog: this is the changelog ☕
Traceback (most recent call last):
File "treetest.py", line 38, in <module>
main()
File "treetest.py", line 33, in main
rough = tostring(xml, encoding='UTF-8', method='xml')
File "/usr/local/Cellar/python#2/2.7.16/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1127, in tostring
return "".join(data)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 22: ordinal not in range(128)
So, what am I doing wrong here? Oddly, ElementTree.dump works fine but the docs say not to use it for anything bug debugging.

How to use utf-8 characters in dryscrape in Python?

I need use utf-8 characters in set dryscrape method. But after run show this error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128)
My code (for example):
site = dryscrape.Session()
site.visit("https://www.website.com")
search = site.at_xpath('//*[#name="search"]')
search.set(u'فارسی')
search.form().submit()
Also u'فارسی' change to search.set(unicode('فارسی', 'utf-8')), But show this error.
Its very easy... This method working perfectly with google. Also try with any other if you know the url prams
import dryscrape as d
d.start_xvfb()
br = d.Session()
import urllib.parse
query = urllib.parse.quote("فارسی")
print(query) #it prints : '%D9%81%D8%A7%D8%B1%D8%B3%DB%8C'
Url = "http://google.com/search?q="+query
br.visit(Url)
print(br.xpath('//title')[0].text())
#it prints : Google Search - فارسی
#You can also check it with br.render("url_screenshot.png")

Encode error scraping

Scraping site with chineese simbols .
How do i scrap chineese simbolse ??
from urllib.request import urlopen
from urllib.parse import urljoin
from lxml.html import fromstring
URL = 'http://list.suning.com/0-258003-0.html'
ITEM_PATH = '.clearfix .product .border-out .border-in .wrap .res-info .sell-point'
def parse_items():
f = urlopen(URL)
list_html = f.read().decode('utf-8')
list_doc = fromstring(list_html)
for elem in list_doc.cssselect(ITEM_PATH):
a = elem.cssselect('a')[0]
href = a.get('href')
title = a.text
em = elem.cssselect('em')[0]
title2 = em.text
print(href, title, title2)
def main():
parse_items()
if __name__ == '__main__':
main()
Error looks like this.
Error looks like this
Error looks like this
Error looks like this
Error looks like this
http://product.suning.com/0000000000/146422477.html Traceback (most recent call last):
File "parser.py", line 27, in <module>
main()
File "parser.py", line 24, in main
parse_items()
File "parser.py", line 20, in parse_items
print(href, title, title2)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
From the print syntax and the imports, I assume that you use a Python3 version, since it can matter for unicode.
So, we can expect that href, title and title2 are all unicode strings (or Python 3 strings). But the print function will try to convert the strings to an encoding acceptable by the output system - for a reason I cannot know, you system uses by default ASCII, so the error.
How to fix:
the best way would be to make your system accept unicode. On Linux or other unixes, you can declare an UTF8 charset in LANG environment variable (export LANG=en_US.UTF-8), on Windows you can try chcp 65001 but this latter if far from being sure
if it does not work, or does not meet your needs, you can force an explicit encoding, or more exactly filter out offending characters, because Python3 natively uses unicode strings.
I would use:
import sys
def u_filter(s, encoding = sys.stdout.encoding):
return (s.encode(encoding, errors='replace').decode(encoding)
if isinstance(s, str) else s)
That means: if s is a unicode string encode it in the encoding used for stdout, replacing any non convertible character by a replacement char, and decode it back into a now clean string
and next:
def fprint(*args, **kwargs):
fargs = [ u_filter(arg) for arg in args ]
print(*fargs, **kwargs)
means: filter out any offending character from unicode strings and print the remaining unchanged.
With that you can safely replace your print throwing the exception with:
fprint(href, title, title2)

Strange UnicodeEncodeError/AttributeError in my script

Currently I am writing a script in Python 2.7 that works fine except for after running it for a few seconds it runs into an error:
Enter Shopify website URL (without HTTP): store.highsnobiety.com
Scraping! Check log file # z:\shopify_output.txt to see output.
!!! Also make sure to clear file every hour or so !!!
Copper Bracelet - 3mm - Polished ['3723603267']
Traceback (most recent call last):
File "shopify_sitemap_scraper.py", line 38, in <module>
print(prod, variants).encode('utf-8')
AttributeError: 'NoneType' object has no attribute 'encode'
The script is to get data from a Shopify website and then print it to console. Code here:
# -*- coding: utf-8 -*-
from __future__ import print_function
from lxml.html import fromstring
import requests
import time
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
# Log file location, change "z://shopify_output.txt" to your location.
logFileLocation = "z:\shopify_output.txt"
log = open(logFileLocation, "w")
# URL of Shopify website from user input (for testing, just use store.highsnobiety.com during input)
url = 'http://' + raw_input("Enter Shopify website URL (without HTTP): ") + '/sitemap_products_1.xml'
print ('Scraping! Check log file # ' + logFileLocation + ' to see output.')
print ("!!! Also make sure to clear file every hour or so !!!")
while True :
page = requests.get(url)
tree = fromstring(page.content)
# skip first url tag with no image:title
url_tags = tree.xpath("//url[position() > 1]")
data = [(e.xpath("./image/title//text()")[0],e.xpath("./loc/text()")[0]) for e in url_tags]
for prod, url in data:
# add xml extension to url
page = requests.get(url + ".xml")
tree = fromstring(page.content)
variants = tree.xpath("//variants[#type='array']//id[#type='integer']//text()")
print(prod, variants).encode('utf-8')
The most crazy part about it is that when I take out the .encode('utf-8') it gives me a UnicodeEncodeError seen here:
Enter Shopify website URL (without HTTP): store.highsnobiety.com
Scraping! Check log file # z:\shopify_output.txt to see output.
!!! Also make sure to clear file every hour or so !!!
Copper Bracelet - 3mm - Polished ['3723603267']
Copper Bracelet - 5mm - Brushed ['3726247811']
Copper Bracelet - 7mm - Polished ['3726253635']
Highsnobiety x EARLY - Leather Pouch ['14541472963', '14541473027', '14541473091']
Traceback (most recent call last):
File "shopify_sitemap_scraper.py", line 38, in <module>
print(prod, variants)
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\xae' in position 13: character maps to <undefined>'
Any ideas? Have no idea what else to try after hours of googling.
snakecharmerb almost got it, but missed the cause of your first error. Your code
print(prod, variants).encode('utf-8')
means you print the values of the prod and variants variables, then try to run the encode() function on the output of print. Unfortunately, print() (as a function in Python 2 and always in Python 3) returns None. To fix it, use the following instead:
print(prod.encode("utf-8"), variants)
Your console has a default encoding of cp437, and cp437 is unable to represent the character u'\xae'.
>>> print (u'\xae')
®
>>> print (u'\xae'.encode('utf-8'))
b'\xc2\xae'
>>> print (u'\xae'.encode('cp437'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.5/encodings/cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character '\xae' in position 0: character maps to <undefined>
You can see that it's trying to convert to cp437 in the traceback:
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
(I reproduced the problem in Python3.5, but it's the same issue in both versions of Python)

UnicodeEncodeError with BeautifulSoup 3.1.0.1 and Python 2.5.2

With BeautifulSoup 3.1.0.1 and Python 2.5.2, and trying to parse a web page in French. However, as soon as I call findAll, I get the following error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1146: ordinal not in range(128)
Below is the code I am currently running:
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://fr.encarta.msn.com/encyclopedia_761561798/Paris.html")
soup = BeautifulSoup(page, fromEncoding="latin1")
r = soup.findAll("table")
print r
Does anybody have an idea why?
Thanks!
UPDATE: As resquested, below is the full Traceback
Traceback (most recent call last):
File "[...]\test.py", line 6, in <module>
print r
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1146-1147: ordinal not in range(128)
Here is another idea. Your terminal is not capable of displaying an unicode string from Python. The interpreter tries to convert it to ASCII first. You should encode it explicitly before printing. I don't know the exact semantics of soup.findAll(). But it is probably something like:
for t in soup.findAll("table"):
print t.encode('latin1')
If t really is a string. Maybe its just another object from which you have to build the data that you want to display.

Categories