Python encoding issue w/ ascii to utf-8 - python

I'm currently running into an issue when trying to write data into a file from an api get request. the error is the following message: "UnicodeEncodeError: 'ascii' codec can't encode character u'\xe2' in position 1: ordinal not in range(128)"
I know this means I must convert the text from ascii to utf-8, but I'm not sure how to do this. This is the code that I have so far
import urllib2
import json
def moviesearch(query):
title = query
api_key = ""
f = open('movie_ID_name.txt', 'w')
for i in range(1,15,1):
api_key = "http://api.themoviedb.org/3/search/movie?api_key=b4a53d5c860f2d09852271d1278bec89&query="+title+"&page="+str(i)
json_obj = urllib2.urlopen(api_key)
json_obj.encode('utf-8')
data = json.load(json_obj)
for item in data['results']:
f.write("<"+str(item['id'])+", "+str(item['title'])+'>\n')
f.close()
moviesearch("life")
When I run this I get the following error: AttributeError: addinfourl instance has no attribute 'encode'
What can I do to solve this?
Thanks in advance!

Encoding/decoding only makes sense on things like byte strings or unicode strings. The strings in the data dictionary are Unicode, which is good, since this makes your life easy. Just encode the value as UTF-8:
import urllib2
import json
def moviesearch(query):
title = query
api_key = ""
with open('movie_ID_name.txt', 'w') as f:
for i in range(1,15,1):
api_key = "http://api.themoviedb.org/3/search/movie?api_key=b4a53d5c860f2d09852271d1278bec89&query="+title+"&page="+str(i)
json_obj = urllib2.urlopen(api_key)
data = json.load(json_obj)
for item in data['results']:
f.write("<"+str(item['id'])+", "+item['title'].encode('utf-8')+'>\n')
moviesearch("life")

Related

Can't get rid of illegible contents while writing to a csv file

I've written a script in python using post requests to scrape the json content from a webpage. When I run my script, I get the result in the console as expected. However, I encounter an issue, when I try to write the same in a csv file.
When I try like:
with open ("outputContent.csv","w",newline="") as f:
I encounter the following error:
Traceback (most recent call last):
File "C:\Users\WCS\AppData\Local\Programs\Python\Python36-32\all_reviews_grabber.py", line 27, in <module>
writer.writerow([nom,ville,region])
File "C:\Users\WCS\AppData\Local\Programs\Python\Python36-32\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\ufb02' in position 16: character maps to <undefined>
When I try like the following, the script does produce a data ridden csv file:
with open ("outputContent.csv","w",newline="",encoding="utf-8") as f:
But, the csv file contains some illegible contents, as in:
Beijingshì
Xinjiangwéiwúerzìzhìqu
Shànghaishì
Qingpuqu
Shànghaishì
Xúhuìqu
Putuóqu
This is my script so far:
import csv
import requests
from bs4 import BeautifulSoup
baseUrl = "https://fr-vigneron.gilbertgaillard.com/importer"
postUrl = "https://fr-vigneron.gilbertgaillard.com/importer/ajax"
with requests.Session() as s:
req = s.get(baseUrl)
sauce = BeautifulSoup(req.text,"lxml")
token = sauce.select_one("input[name='_token']")['value']
payload = {
'data': 'country=0&type=0&input_search=',
'_token': token
}
res = s.post(postUrl,data=payload)
with open ("outputContent.csv","w",newline="",encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(['nom','ville','region'])
for item in res.json():
nom = item['prospect_nom']
ville = item['prospect_ville']
region = item['prospect_region']
print(nom,ville,region)
writer.writerow([nom,ville,region])
How can I write the content in the right way in a csv file?
Take a look at this - http://www.pgbovine.net/unicode-python-errors.htm
Check your default encoding in your interpreter:
import sys
sys.stdout.encoding
An old version of Python can also cause this error.
Would using pandas to parse and then write alleviate the issue?
import pandas as pd
import requests
from bs4 import BeautifulSoup
baseUrl = "https://fr-vigneron.gilbertgaillard.com/importer"
postUrl = "https://fr-vigneron.gilbertgaillard.com/importer/ajax"
with requests.Session() as s:
req = s.get(baseUrl)
sauce = BeautifulSoup(req.text,"lxml")
token = sauce.select_one("input[name='_token']")['value']
payload = {
'data': 'country=0&type=0&input_search=',
'_token': token
}
res = s.post(postUrl,data=payload)
jsonObj = res.json()
results = pd.DataFrame()
for item in jsonObj:
nom = item['prospect_nom']
ville = item['prospect_ville']
region = item['prospect_region']
#print(id_,nom,ville,region)
temp_df = pd.DataFrame([[nom,ville,region]], columns = ['nom','ville','region'])
results = results.append(temp_df)
results = results.reset_index(drop=True)
results.to_csv("outputContent.csv", idex=False)
The code works correctly, as long as the print statement is removed*.
The corrupted data that you are seeing is because you are decoding the file data from cp1252, rather than UTF-8 when you view it.
>>> s = 'Xinjiangwéiwúerzìzhìqu'
>>> encoded = s.encode('utf-8')
>>> encoded.decode('cp1252')
'Xinjiangwéiwúerzìzhìqu'
If you are viewing the data by opening the csv file in Python, ensure that you specify UTF-8 encoding** when you open it:
open('outputContent.csv', 'r', encoding='utf-8'...
If you are opening the file with an application such as Excel, ensure that you specify that the encoding is UTF-8 when opening it.
If you don't specify an encoding the default cp1252 encoding will be used to decode the data in the file, and you will see garbage data.
* print will automatically use the default encoding, so you'll get an exception if it tries to encode characters which can't be encoded as cp1252.
** It may also be worth trying the 'utf-8-sig' encoding, which is a Microsoft-specific version of UTF-8 that inserts a byte-order-mark or BOM (b'\xef\xbb\xbf') at the beginning of encoded strings, but is otherwise identical to UTF-8.

UnicodeEncodeError: 'ascii' codec can't encode character '\xa9' python 3

This is my code:
import urllib.request
imglinks = ["http://www.katytrailweekly.com/Files/MalibuPokeMatt_©Marple_449-EDITED_15920174118.jpg"]
for link in imglinks:
filename = link.split('/')[-1]
urllib.request.urlretrieve(link, filename)
It gives me the error:
UnicodeEncodeError: 'ascii' codec can't encode character '\xa9'
How do I solve this? I tried using .encode('utf-8'), but it gives me:
TypeError: cannot use a string pattern on a bytes-like object
The problem here is not the encoding itself but the correct encoding to pass to `request'.
You need to quote the url as follows:
import urllib.request
import urllib.parse
imglinks = ["http://www.katytrailweekly.com/Files/MalibuPokeMatt_©Marple_449-EDITED_15920174118.jpg"]
for link in imglinks:
link = urllib.parse.quote(link,safe=':/') # <- here
filename = link.split('/')[-1]
urllib.request.urlretrieve(link, filename)
This way your © symbol is encoded as %C2%A9 as the web server wants.
The safe parameter is specified to prevent quote to modify also the : after http.
Is up to you to modify the code to save the file with the correct original filename. ;)

'UnicodeEncodeError: 'ascii' codec' Error when try to write £ sign into excel sheet using python

I'm scraping a £ value in python and when I try to write it into an excel sheet the process breaks and I get the following error
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128)
The £ sign is printing without any error in the cmd prompt. Could some suggest how I can write the value (£1,750) into my sheet (with or without £ sign). many thanks...
import requests
from bs4 import BeautifulSoup as soup
import csv
outputfilename = 'Ed_Streets2.csv'
outputfile = open(outputfilename, 'wb')
writer = csv.writer(outputfile)
writer.writerow([Rateable Value])
url = 'https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&PAGE=0&DISPLAY_COUNT=100&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&ORIGINAL_SEARCH_TERM=city+of+edinburgh&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY#results'
response = session.get(url)
html = soup(response.text, 'lxml')
prop_link = html.find_all("a", class_="pagelink button small")
for link in prop_link:
prop_url = base_url+(link["href"])
response = session.get(prop_url)
prop = soup(response.content,"lxml")
RightBlockData = prop.find_all("div", class_="columns small-7 cell")
Rateable_Value = RightBlockData[0].get_text().strip()
print (Rateable_Value)
writer.writerow([Rateable_Value])
You need to encode your unicode object into bytes explicitely. Or else, your system will automatically try to encode it using ascii codec, which will fail with non-ascii characters. So, this:
Rateable_Value = Rateable_Value.encode('utf8')
before you
writer.writerow([Rateable_Value])
Should do the trick.

Python, UnicodeEncodeError: 'charmap' codec can't encode characters in position

I want to write the HTML of a website to the file I created, tough I decode to utf-8 but still it puts up a error like this, I use print(data1) and the html is printed properlyand I am using python 3.5.0
import re
import urllib.request
city = input("city name")
url = "http://www.weather-forecast.com/locations/"+city+"/forecasts/latest"
data = urllib.request.urlopen(url).read()
data1 = data.decode("utf-8")
f = open("C:\\Users\\Gopal\\Desktop\\test\\scrape.txt","w")
f.write(data1)
You've opened a file with the default system encoding:
f = open("C:\\Users\\Gopal\\Desktop\\test\\scrape.txt", "w")
You need to specify your encoding explicitly:
f = open("C:\\Users\\Gopal\\Desktop\\test\\scrape.txt", "w", encoding='utf8')
See the open() function documentation:
In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.
On your system, the default is a codec that cannot handle your data.
f = open("C:\\Users\\Gopal\\Desktop\\test\\scrape.txt","w",encoding='utf8')
f.write(data1)
This should work, it did for me

Encoding CSV lists in python

I need some help with the encoding of a list. I'm new in python, sorry.
First, I'm using Python 2.7.3
I have two lists (entidad & valores), and I need to get them encoded or something of that.
My code:
import urllib
from bs4 import BeautifulSoup
import csv
sock = urllib.urlopen("http://www.fatm.com.es/Datos_Equipo.asp?Cod=01HU0010")
htmlSource = sock.read()
sock.close()
soup = BeautifulSoup(htmlSource)
form = soup.find("form", {'id': "FORM1"})
table = form.find("table")
entidad = [item.text.strip() for item in table.find_all('td')]
valores = [item.get('value') for item in form.find_all('input')]
valores.remove('Imprimir')
valores.remove('Cerrar')
header = entidad
values = valores
print values
out = open('tomate.csv', 'w')
w = csv.writer(out)
w.writerow(header)
w.writerow(values)
out.close()
the log: UnicodeEncodeError: 'ascii' codec can't encode character
any ideas? Thanks in advance!!
You should encode your data to utf-8 manually, csv.writer didnt do it for you:
w.writerow([s.encode("utf-8") for s in header])
w.writerow([s.encode("utf-8") for s in values])
#w.writerow(header)
#w.writerow(values)
This appears to be the same type of problem as had been found here UnicodeEncodeError in csv writer in Python
UnicodeEncodeError in csv writer in Python
Today I was writing a
program that generates a csv file after some processing. But I got the
following error while trying on some test data:
writer.writerow(csv_li) UnicodeEncodeError: 'ascii' codec can't encode
character u'\xbf' in position 5: ordinal not in range(128)
I looked into the documentation of csv module in Python and found a
class named UnicodeWriter. So I changed my code to
writer = UnicodeWriter(open("filename.csv", "wb"))
Then I tried to run it again. It got rid of the previous
UnicodeEncodeError but got into another error.
self.writer.writerow([s.encode("utf-8") for s in row]) AttributeError:
'int' object has no attribute 'encode'
So, before writing the list, I had to change every value to string.
row = [str(item) for item in row]
I think this line can be added in the writerow function of
UnicodeWriter class.

Categories