I have created a Django application with python 3.4 on windows platform. Now I am trying to host it on AWS Linux instance. First time I faced the following error
Non-ASCII character '\xe2'
I resolved this issue by adding utf on each page.
-- coding: utf-8 --
Now I am facing the following error
'ascii' codec can't decode byte 0xe2 in position 18: ordinal not in
range(128)
Code:
class TaskTodo:
#classmethod
def validate_search(cls, form_data):
try:
search_url = 'https://www.foo.com/s-{search}/page-{page}'
url = search_url.format(page=1, search=form_data['keywords'])
url = url.encode('utf-8')
r = requests.get(url)
not_found_text = 'Sorry, but we didn’t find any results. Below you can find some tips to help you in your search.'
if not_found_text in r.text.encode('utf-8'):
return
#after encoding its not working on localhost
#'str' does not support the buffer interface
if r.status_code == 200:
content = r.text
soup = BeautifulSoup(content, "html.parser")
total = soup.find('span', {"class": 'count'}).text.replace('words', '').replace(',', '').strip()
pages = 1
last_page = soup.find('a', {"class": 'last follows'})
if last_page:
href = last_page['href'].split('/')
pages = int(href[len(href) - 1].replace('somewords', '').strip())
except Exception as ex:
raise ex
I have searched and tried to implement encoding etc but doesn't work. I have completed the application and mostly functions are doing request to http, parsing html etc. Its really worrying for me to debug on production server and encode each function.
I am using Apache on production server and tried with both python version 2.7 and 3.5
Any idea how can I resolve this issue. Thanks
After working with the OP in a chatroom it was still unclear where the actual problem came from.
I noticed that the text 'Sorry, but we didn’t contains a non-ascii 'RIGHT SINGLE QUOTATION MARK'
Therefore, I recommended making not_found_text a Unicode by appending u'' to the string value.
I also recommended removing removing all spurious .encodes() and .decodes().
Related
I'm hoping to scrape data from the table for passengers going through TSA security lines, but I keep getting this error.
UnicodeEncodeError: 'charmap' codec can't encode character '\u2713' in position 33780: character maps to <undefined>
from this code
url = "https://www.tsa.gov/coronavirus/passenger-throughput"
page = requests.get(url).content
soup = BeautifulSoup(page, features = 'lxml')
text = soup.get_text()
soup.prettify()
print(soup)
Are there any suggestions?
Well let me explain for you what happened actually.
Read the following error:
UnicodeEncodeError: 'charmap' codec can't encode character '\u2713' in position 33780: character maps to <undefined>
Now from my side if ran the following:
print("\u2713")
Output will be the following Unicode:
✓
I believe that you are using Windows where the default encoding of it is cp1252 not UTF-8.
You can verify that using the following:
import sys
print(sys.getdefaultencoding())
print(sys.stdin.encoding)
print(sys.stdout.encoding)
print(sys.stderr.encoding)
Or directly via cmd by running the following command: chcp
Now you can change the system encoding by opening cmd and run the following cmd:
cp 65001
Check the official doc.
Identifier .NET Name Additional information
65001 utf-8 Unicode (UTF-8)
note that if you are using VSCode with Code-Runner, kindly run your code in the terminal as py code.py or change append the following setting:
{
"code-runner.executorMap": {
"python": "set PYTHONIOENCODING=utf8 && python"
}
}
Check my previous answer for similar issue here
I need use utf-8 characters in set dryscrape method. But after run show this error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128)
My code (for example):
site = dryscrape.Session()
site.visit("https://www.website.com")
search = site.at_xpath('//*[#name="search"]')
search.set(u'فارسی')
search.form().submit()
Also u'فارسی' change to search.set(unicode('فارسی', 'utf-8')), But show this error.
Its very easy... This method working perfectly with google. Also try with any other if you know the url prams
import dryscrape as d
d.start_xvfb()
br = d.Session()
import urllib.parse
query = urllib.parse.quote("فارسی")
print(query) #it prints : '%D9%81%D8%A7%D8%B1%D8%B3%DB%8C'
Url = "http://google.com/search?q="+query
br.visit(Url)
print(br.xpath('//title')[0].text())
#it prints : Google Search - فارسی
#You can also check it with br.render("url_screenshot.png")
I'm trying to scrape yahoo finance web pages to get stock price data with Python 3.3, httplib2, and beautifulsoup4. Here is the code:
def getData (symbol = 'GOOG', period = 'm'):
baseUrl = 'http://finance.yahoo.com/q/hp?s='
url = baseUrl + symbol + '&g=' + period
h = httplib2.Http('.cache')
response, content = h.request(url)
soup = BeautifulSoup(content)
print(soup.prettify())
getData()
I get the following error trace:
File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/encodings/mac_roman.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\xd7' in position 11875: character maps to <undefined>
I'm new to python and the libraries and would greatly appreciate your help!
This is due to the encoding of your console.
Depending on which console you're working in (Windows, Mac, Linux) the console is trying to display characters it doesn't recognize and therefore can't print to screen.
You could try converting the output string into the encoding of your console.
I found an easy way was to just convert your data into a string and it prints just fine.
I am teaching myself how to parse google results with json, but when I run this code ( which shoud work ), I am getting this error: UnicodeEncodeError: 'charmap' codec can't encode character u'\u2014' in position 5: character maps to <undefined>. Can someone help me?
import urllib
import simplejson
query = urllib.urlencode({'q' : 'site:example.com'})
url = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s&start=50' \
% (query)
search_results = urllib.urlopen(url)
json = simplejson.loads(search_results.read())
results = json['responseData']['results']
for i in results:
print i['title'] + ": " + i['url']
This error may be caused by the encoding that your console application uses when sending unicode data to stdout. There's an article that talks about it.
Check stdout's encoding:
>>> import sys
>>> sys.stdout.encoding # On my machine I get this result:
'UTF-8'
Use unicode literals.
print i[u'title'] + u": " + i[u'url']
Also:
jsondata = simplejson.load(search_results)
My guess is that the error is in simplejson.loads(search_results.read()) line, possibly because the default encoding your python is picking up is not utf-8 and google is returning utf-8.
Try: simplejson.loads(unicode(search_results.read(), "utf8").
I am using python2.7 and lxml. My code is as below
import urllib
from lxml import html
def get_value(el):
return get_text(el, 'value') or el.text_content()
response = urllib.urlopen('http://www.edmunds.com/dealerships/Texas/Frisco/DavidMcDavidHondaofFrisco/fullsales-504210667.html').read()
dom = html.fromstring(response)
try:
description = get_value(dom.xpath("//div[#class='description item vcard']")[0].xpath(".//p[#class='sales-review-paragraph loose-spacing']")[0])
except IndexError, e:
description = ''
The code crashes inside the try, giving an error
UnicodeDecodeError at /
'utf8' codec can't decode byte 0x92 in position 85: invalid start byte
The string that could not be encoded/decoded was: ouldn�t be
I have tried using a lot of techniques including .encode('utf8'), but none does solve the problem. I have 2 question:
How to solve this problem
How can my app crash when the problem code is between a try except
The page is being served up with charset=ISO-8859-1. Decode from that to unicode.
[
Your except clause only handles exceptions of the IndexError type. The problem was a UnicodeDecodeError, which is not an IndexError - so the exception is not handled by that except clause.
It's also not clear what 'get_value' does, and that may well be where the actual problem is arising.
skip chars on Error, or decode it correctly to unicode.
you only catch IndexError, not UnicodeDecodeError
decode the response to unicode, properly handling errors (ignore on error) before parsing with fromhtml.
catch the UnicodeDecodeError, or all errors.