How to use utf-8 characters in dryscrape in Python?

How to use utf-8 characters in dryscrape in Python? - python

I need use utf-8 characters in set dryscrape method. But after run show this error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128)
My code (for example):
site = dryscrape.Session()
site.visit("https://www.website.com")
search = site.at_xpath('//*[#name="search"]')
search.set(u'فارسی')
search.form().submit()
Also u'فارسی' change to search.set(unicode('فارسی', 'utf-8')), But show this error.

Its very easy... This method working perfectly with google. Also try with any other if you know the url prams
import dryscrape as d
d.start_xvfb()
br = d.Session()
import urllib.parse
query = urllib.parse.quote("فارسی")
print(query) #it prints : '%D9%81%D8%A7%D8%B1%D8%B3%DB%8C'
Url = "http://google.com/search?q="+query
br.visit(Url)
print(br.xpath('//title')[0].text())
#it prints : Google Search - فارسی
#You can also check it with br.render("url_screenshot.png")

Related

How to handle the network message with unicode that is not decodeable to utf-8

I receive the following byte message via socket connection and I want to convert into string and do further processing I am using python3.7
below is the code i tried so far
import codecs
a = b'0400F224648188E0801200000040000000001941678904000010237890000000000000222220418151856038556051259950760020806002468060046010403319 HSBCBSB8001101234567890MC 100 WITH ORDERIN FO AU009006Q\x00\x00\x00\x83\x00007\xa0\x00\x00\x00\x00%\x02010003855604181518562468000000000460100000'
b= codecs.decode(a, 'utf-8')
print(b)
Iam getting the error as below
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0x83 in position > 208: invalid start byte
how can I convert the data to string and process further
Thanks in advance

Your data is not utf-8 encoded. You can use BeautifulSoup to decode unknown encodings:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(b'0400F224648188E0801200000040000000001941678904000010237890000000000000222220418151856038556051259950760020806002468060046010403319 HSBCBSB8001101234567890MC 100 WITH ORDERIN FO AU009006Q\x00\x00\x00\x83\x00007\xa0\x00\x00\x00\x00%\x02010003855604181518562468000000000460100000'
)
print(soup.contents[0])
print(soup.originalEncoding)
to get
0400F224648188E0801200000040000 ... # etc
and
windows-1252
You can use the bs4-detector seperately as well: UnicodeDammit and also provide it with suggestions which encodings to try first / not to try to finetune it.
More info on SO:
How to determine the encoding of text?

Another BeautifulSoup error 'charmap' codec can't encode character

I'm hoping to scrape data from the table for passengers going through TSA security lines, but I keep getting this error.
UnicodeEncodeError: 'charmap' codec can't encode character '\u2713' in position 33780: character maps to <undefined>
from this code
url = "https://www.tsa.gov/coronavirus/passenger-throughput"
page = requests.get(url).content
soup = BeautifulSoup(page, features = 'lxml')
text = soup.get_text()
soup.prettify()
print(soup)
Are there any suggestions?

Well let me explain for you what happened actually.
Read the following error:
UnicodeEncodeError: 'charmap' codec can't encode character '\u2713' in position 33780: character maps to <undefined>
Now from my side if ran the following:
print("\u2713")
Output will be the following Unicode:
✓
I believe that you are using Windows where the default encoding of it is cp1252 not UTF-8.
You can verify that using the following:
import sys
print(sys.getdefaultencoding())
print(sys.stdin.encoding)
print(sys.stdout.encoding)
print(sys.stderr.encoding)
Or directly via cmd by running the following command: chcp
Now you can change the system encoding by opening cmd and run the following cmd:
cp 65001
Check the official doc.
Identifier .NET Name Additional information
65001 utf-8 Unicode (UTF-8)
note that if you are using VSCode with Code-Runner, kindly run your code in the terminal as py code.py or change append the following setting:
{
"code-runner.executorMap": {
"python": "set PYTHONIOENCODING=utf8 && python"
}
}
Check my previous answer for similar issue here

Django: ascii codec can't decode byte 0xe2

I have created a Django application with python 3.4 on windows platform. Now I am trying to host it on AWS Linux instance. First time I faced the following error
Non-ASCII character '\xe2'
I resolved this issue by adding utf on each page.
-- coding: utf-8 --
Now I am facing the following error
'ascii' codec can't decode byte 0xe2 in position 18: ordinal not in
range(128)
Code:
class TaskTodo:
#classmethod
def validate_search(cls, form_data):
try:
search_url = 'https://www.foo.com/s-{search}/page-{page}'
url = search_url.format(page=1, search=form_data['keywords'])
url = url.encode('utf-8')
r = requests.get(url)
not_found_text = 'Sorry, but we didn’t find any results. Below you can find some tips to help you in your search.'
if not_found_text in r.text.encode('utf-8'):
return
#after encoding its not working on localhost
#'str' does not support the buffer interface
if r.status_code == 200:
content = r.text
soup = BeautifulSoup(content, "html.parser")
total = soup.find('span', {"class": 'count'}).text.replace('words', '').replace(',', '').strip()
pages = 1
last_page = soup.find('a', {"class": 'last follows'})
if last_page:
href = last_page['href'].split('/')
pages = int(href[len(href) - 1].replace('somewords', '').strip())
except Exception as ex:
raise ex
I have searched and tried to implement encoding etc but doesn't work. I have completed the application and mostly functions are doing request to http, parsing html etc. Its really worrying for me to debug on production server and encode each function.
I am using Apache on production server and tried with both python version 2.7 and 3.5
Any idea how can I resolve this issue. Thanks

After working with the OP in a chatroom it was still unclear where the actual problem came from.
I noticed that the text 'Sorry, but we didn’t contains a non-ascii 'RIGHT SINGLE QUOTATION MARK'
Therefore, I recommended making not_found_text a Unicode by appending u'' to the string value.
I also recommended removing removing all spurious .encodes() and .decodes().

Why I always get bytes data from server when used python requests module?

I want to use python requests module to get data from server,but I always get bytes data,even if I had set headers={'content-type':'application/json;charset=utf-8'} .
My code:
import requests
from io import BytesIO
headers={'content-type':'application/json;charset=utf-8'}
#response=requests.get("https://api-dev.creams.io/buildings/2/contract- templates",headers=headers)
r = requests.get('https://developer.github.com/v3/timeline.json',headers=headers)
print(r.headers)
# response = urlopen("https://beta.creams.io/")
when I print headers,content-type still be text/html;charset-utf-8
and I always get bytes data. when I use r.text, I got an error:UnicodeEncodeError: 'ascii' codec can't encode character '\u2022' in position 382: ordinal not in range(128). And I used r.content method,I always get bytes data(start with b'),I just want to get utf-8 encoding string. How can I resolve it?

This should work just fine:
import requests as req
r = req.get('https://developer.github.com/v3/timeline.json')
print(r.text)

Character encoding error: UnicodeEncodeError: 'charmap' codec can't encode character X in position Y: character maps to <undefined>

I'm trying to scrape yahoo finance web pages to get stock price data with Python 3.3, httplib2, and beautifulsoup4. Here is the code:
def getData (symbol = 'GOOG', period = 'm'):
baseUrl = 'http://finance.yahoo.com/q/hp?s='
url = baseUrl + symbol + '&g=' + period
h = httplib2.Http('.cache')
response, content = h.request(url)
soup = BeautifulSoup(content)
print(soup.prettify())
getData()
I get the following error trace:
File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/encodings/mac_roman.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\xd7' in position 11875: character maps to <undefined>
I'm new to python and the libraries and would greatly appreciate your help!

This is due to the encoding of your console.
Depending on which console you're working in (Windows, Mac, Linux) the console is trying to display characters it doesn't recognize and therefore can't print to screen.
You could try converting the output string into the encoding of your console.
I found an easy way was to just convert your data into a string and it prints just fine.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to use utf-8 characters in dryscrape in Python? - python

Related

How to handle the network message with unicode that is not decodeable to utf-8

Another BeautifulSoup error 'charmap' codec can't encode character

Django: ascii codec can't decode byte 0xe2

Why I always get bytes data from server when used python requests module?

Character encoding error: UnicodeEncodeError: 'charmap' codec can't encode character X in position Y: character maps to <undefined>

Categories

Resources