Python, UnicodeEncodeError: 'charmap' codec can't encode characters in position - python

I want to write the HTML of a website to the file I created, tough I decode to utf-8 but still it puts up a error like this, I use print(data1) and the html is printed properlyand I am using python 3.5.0
import re
import urllib.request
city = input("city name")
url = "http://www.weather-forecast.com/locations/"+city+"/forecasts/latest"
data = urllib.request.urlopen(url).read()
data1 = data.decode("utf-8")
f = open("C:\\Users\\Gopal\\Desktop\\test\\scrape.txt","w")
f.write(data1)

You've opened a file with the default system encoding:
f = open("C:\\Users\\Gopal\\Desktop\\test\\scrape.txt", "w")
You need to specify your encoding explicitly:
f = open("C:\\Users\\Gopal\\Desktop\\test\\scrape.txt", "w", encoding='utf8')
See the open() function documentation:
In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.
On your system, the default is a codec that cannot handle your data.

f = open("C:\\Users\\Gopal\\Desktop\\test\\scrape.txt","w",encoding='utf8')
f.write(data1)
This should work, it did for me

Related

Can't get rid of illegible contents while writing to a csv file

I've written a script in python using post requests to scrape the json content from a webpage. When I run my script, I get the result in the console as expected. However, I encounter an issue, when I try to write the same in a csv file.
When I try like:
with open ("outputContent.csv","w",newline="") as f:
I encounter the following error:
Traceback (most recent call last):
File "C:\Users\WCS\AppData\Local\Programs\Python\Python36-32\all_reviews_grabber.py", line 27, in <module>
writer.writerow([nom,ville,region])
File "C:\Users\WCS\AppData\Local\Programs\Python\Python36-32\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\ufb02' in position 16: character maps to <undefined>
When I try like the following, the script does produce a data ridden csv file:
with open ("outputContent.csv","w",newline="",encoding="utf-8") as f:
But, the csv file contains some illegible contents, as in:
Beijingshì
Xinjiangwéiwúerzìzhìqu
Shànghaishì
Qingpuqu
Shànghaishì
Xúhuìqu
Putuóqu
This is my script so far:
import csv
import requests
from bs4 import BeautifulSoup
baseUrl = "https://fr-vigneron.gilbertgaillard.com/importer"
postUrl = "https://fr-vigneron.gilbertgaillard.com/importer/ajax"
with requests.Session() as s:
req = s.get(baseUrl)
sauce = BeautifulSoup(req.text,"lxml")
token = sauce.select_one("input[name='_token']")['value']
payload = {
'data': 'country=0&type=0&input_search=',
'_token': token
}
res = s.post(postUrl,data=payload)
with open ("outputContent.csv","w",newline="",encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(['nom','ville','region'])
for item in res.json():
nom = item['prospect_nom']
ville = item['prospect_ville']
region = item['prospect_region']
print(nom,ville,region)
writer.writerow([nom,ville,region])
How can I write the content in the right way in a csv file?
Take a look at this - http://www.pgbovine.net/unicode-python-errors.htm
Check your default encoding in your interpreter:
import sys
sys.stdout.encoding
An old version of Python can also cause this error.
Would using pandas to parse and then write alleviate the issue?
import pandas as pd
import requests
from bs4 import BeautifulSoup
baseUrl = "https://fr-vigneron.gilbertgaillard.com/importer"
postUrl = "https://fr-vigneron.gilbertgaillard.com/importer/ajax"
with requests.Session() as s:
req = s.get(baseUrl)
sauce = BeautifulSoup(req.text,"lxml")
token = sauce.select_one("input[name='_token']")['value']
payload = {
'data': 'country=0&type=0&input_search=',
'_token': token
}
res = s.post(postUrl,data=payload)
jsonObj = res.json()
results = pd.DataFrame()
for item in jsonObj:
nom = item['prospect_nom']
ville = item['prospect_ville']
region = item['prospect_region']
#print(id_,nom,ville,region)
temp_df = pd.DataFrame([[nom,ville,region]], columns = ['nom','ville','region'])
results = results.append(temp_df)
results = results.reset_index(drop=True)
results.to_csv("outputContent.csv", idex=False)
The code works correctly, as long as the print statement is removed*.
The corrupted data that you are seeing is because you are decoding the file data from cp1252, rather than UTF-8 when you view it.
>>> s = 'Xinjiangwéiwúerzìzhìqu'
>>> encoded = s.encode('utf-8')
>>> encoded.decode('cp1252')
'Xinjiangwéiwúerzìzhìqu'
If you are viewing the data by opening the csv file in Python, ensure that you specify UTF-8 encoding** when you open it:
open('outputContent.csv', 'r', encoding='utf-8'...
If you are opening the file with an application such as Excel, ensure that you specify that the encoding is UTF-8 when opening it.
If you don't specify an encoding the default cp1252 encoding will be used to decode the data in the file, and you will see garbage data.
* print will automatically use the default encoding, so you'll get an exception if it tries to encode characters which can't be encoded as cp1252.
** It may also be worth trying the 'utf-8-sig' encoding, which is a Microsoft-specific version of UTF-8 that inserts a byte-order-mark or BOM (b'\xef\xbb\xbf') at the beginning of encoded strings, but is otherwise identical to UTF-8.

Python reading a PE file and changing resource section

I am trying to open a Windows PE file and alter some strings in the resource section.
f = open('c:\test\file.exe', 'rb')
file = f.read()
if b'A'*10 in file:
s = file.replace(b'A'*10, newstring)
In the resource section I have a string that is just:
AAAAAAAAAA
And I want to replace that with something else. When I read the file I get:
\x00A\x00A\x00A\x00A\x00A\x00A\x00A\x00A\x00A\x00A
I have tried opening with UTF-16 and decoding as UTF-16 but then I run into a error:
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 1604-1605: illegal encoding
Everyone I seen who had the same issue fixed by decoding to UTF-16. I am not sure why this doesn't work for me.
If resource inside binary file is encoded to utf-16, you shouldn't change encoding.
try this
f = open('c:\\test\\file.exe', 'rb')
file = f.read()
unicode_str = u'AAAAAAAAAA'
encoded_str = unicode_str.encode('UTF-16')
if encoded_str in file:
s = file.replace(encoded_str, new_utf_string.encode('UTF-16'))
inside binary file everything is encoded, keep in mind

unable to decode this string using python

I have this text.ucs file which I am trying to decode using python.
file = open('text.ucs', 'r')
content = file.read()
print content
My result is
\xf\xe\x002\22
I tried doing decoding with utf-16, utf-8
content.decode('utf-16')
and getting error
Traceback (most recent call last): File "", line 1, in
File "C:\Python27\lib\encodings\utf_16.py", line 16, in
decode
return codecs.utf_16_decode(input, errors, True) UnicodeDecodeError: 'utf16' codec can't decode bytes in position
32-33: illegal encoding
Please let me know if I am missing anything or my approach is wrong
Edit: Screenshot has been asked
The string is encoded as UTF16-BE (Big Endian), this works:
content.decode("utf-16-be")
oooh, as i understand you using python 2.x.x but encoding parameter was added only in python 3.x.x as I know, i am doesn't master of python 2.x.x but you can search in google about io.open for example try:
file = io.open('text.usc', 'r',encoding='utf-8')
content = file.read()
print content
but chek do you need import io module or not
You can specify which encoding to use with the encoding argument:
with open('text.ucs', 'r', encoding='utf-16') as f:
text = f.read()
your string need to Be Uncoded With The Coding utf-8 you can do What I Did Now for decode your string
f = open('text.usc', 'r',encoding='utf-8')
print f

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc1 in position 0: invalid start byte

This is my code.
stock_code = open('/home/ubuntu/trading/456.csv', 'r')
csvReader = csv.reader(stock_code)
for st in csvReader:
eventcode = st[1]
print(eventcode)
I want to know content in excel.
But there are unicodeDecodeError.
How can i fix it?
The CSV docs say,
Since open() is used to open a CSV file for reading, the file will by default be decoded into unicode using the system default encoding...
The error message shows that your system is expecting the file to be using UTF-8 encoding.
Solutions:
Make sure the file is using the correct encoding.
For example, open the file using NotePad++, select Encoding from the menu
and select UTF-8. Then resave the file.
Alternatively, specify the encoding of the file when calling open(), like this
my_encoding = 'UTF-8' # or whatever is the encoding of the file.
with open('/home/ubuntu/trading/456.csv', 'r', encoding=my_encoding) as stock_code:
stock_code = open('/home/ubuntu/trading/456.csv', 'r')
csvReader = csv.reader(stock_code)
for st in csvReader:
eventcode = st[1]
print(eventcode)

Why does printing to a utf-8 file fail?

So I ran into a problem this afternoon, I was able to solve it, but I don't quite understand why it worked.
this is related to a problem I had the other week: python check if utf-8 string is uppercase
basically, the following will not work:
#!/usr/bin/python
import codecs
from lxml import etree
outFile = codecs.open('test.xml', 'w', 'utf-8') #cannot use codecs.open()
root = etree.Element('root')
sect = etree.SubElement(root,'sect')
words = ( u'\u041c\u041e\u0421\u041a\u0412\u0410', # capital of Russia, all uppercase
u'R\xc9SUM\xc9', # RESUME with accents
u'R\xe9sum\xe9', # Resume with accents
u'R\xe9SUM\xe9', ) # ReSUMe with accents
for word in words:
print word
if word.encode('utf8').decode('utf8').isupper(): #.isupper won't function on utf8
title = etree.SubElement(sect,'title')
title.text = word
else:
item = etree.SubElement(sect,'item')
item.text = word
print>>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')
it fails with the following:
Traceback (most recent call last):
File "./temp.py", line 25, in
print >>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')
File "/usr/lib/python2.7/codecs.py",
line 691, in write
return self.writer.write(data) File "/usr/lib/python2.7/codecs.py",
line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec
can't decode byte 0xd0 in position 66:
ordinal not in range(128)
but if I open the new file without codecs.open('test.xml', 'w', 'utf-8') and instead use
outFile = open('test.xml', 'w') it works perfectly.
So whats happening??
since encoding='utf-8' is specified in etree.tostring() is it encoding the file again?
if I leave codecs.open() and remove encoding='utf-8' the file then becomes an ascii file. Why? becuase etree.tostring() has a default encoding of ascii I persume?
but etree.tostring() is simply being written to stdout, and is then redirect to a file that was created as a utf-8 file??
is print>> not workings as I expect? outFile.write(etree.tostring()) behaves the same way.
Basically, why wouldn't this work? what is going on here. It might be trivial, but I am obviously a bit confused and have a desire to figure out why my solution works,
You've opened the file with UTF-8 encoding, which means that it expects Unicode strings.
tostring is encoding to UTF-8 (in the form of bytestrings, str), which you're writing to the file.
Because the file is expecting Unicode, it's decoding the bytestrings to Unicode using the default ASCII encoding so that it can then encode the Unicode to UTF-8.
Unfortunately, the bytestrings aren't ASCII.
EDIT: The best advice to avoid this kind of problem is to use Unicode internally, decoding on input and encoding on output.
Using print>>outFile is a little strange. I don't have lxml installed, but the built-in xml.etree library is similar (but doesn't support pretty_print). Wrap the root Element in an ElementTree and use the write method.
Also, if you using a # coding line to declare the encoding of the source file, you can use readable Unicode strings instead of escape codes:
#!/usr/bin/python
# coding: utf8
import codecs
from xml.etree import ElementTree as etree
root = etree.Element(u'root')
sect = etree.SubElement(root,u'sect')
words = [u'МОСКВА',u'RÉSUMÉ',u'Résumé',u'RéSUMé']
for word in words:
print word
if word.isupper():
title = etree.SubElement(sect,u'title')
title.text = word
else:
item = etree.SubElement(sect,u'item')
item.text = word
tree = etree.ElementTree(root)
tree.write('text.xml',xml_declaration=True,encoding='utf-8')
In addition to MRABs answer some lines of code:
import codecs
from lxml import etree
root = etree.Element('root')
sect = etree.SubElement(root,'sect')
# do some other xml building here
with codecs.open('test.xml', 'w', encoding='utf-8') as f:
f.write(etree.tostring(root, encoding=unicode))

Categories