Writing to file in python gives ascii error

Writing to file in python gives ascii error - python

I'm trying to write results from a web scraping to a html file. I'm using Beautiful Soup to scrape links and text from web pages. Then when I'm creating the file and writing to it, I get the following error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 939-940: ordinal not in range(128)
The line writing to file looks like this:
file_object.write(file_content)
And when I instead do this:
file_object.write(file_content.encode('utf-8'))
I don't get an error, but it can't print special characters, like å or ä.
I realize this is some kind of encoding error, but I can't understand how to get around it. The project in its entirety is located here, line 81, since I had trouble extracting runnable and logical sub parts.
I'm using a Mac, but had similar problem running the same script on a pc. Using python 2.7

Yes use open() from codecs module, or, in Python 3 normal (built-in) open() as this:
f = open(path, "wt", encoding="UTF-8")
But, if you don't want to change your code much, you do not need anything special.
The trick is to add the correct BOM (byte order mark) at the beggining of your file, so that editor that opens it knows that it is an UTF-8 file, and that should treat is as such.
Change you should make:
file_object.write('\xef\xbb\xbf'+file_content.encode('utf-8'))

Related

I keep getting 'charmap' codec can't encode characters error when trying to save python script's output to clipboard or text file [duplicate]

I'm trying to scrape a website, but it gives me an error.
I'm using the following code:
import urllib.request
from bs4 import BeautifulSoup
get = urllib.request.urlopen("https://www.website.com/")
html = get.read()
soup = BeautifulSoup(html)
print(soup)
And I'm getting the following error:
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 70924-70950: character maps to <undefined>
What can I do to fix this?

I was getting the same UnicodeEncodeError when saving scraped web content to a file. To fix it I replaced this code:
with open(fname, "w") as f:
f.write(html)
with this:
with open(fname, "w", encoding="utf-8") as f:
f.write(html)
If you need to support Python 2, then use this:
import io
with io.open(fname, "w", encoding="utf-8") as f:
f.write(html)
If you want to use a different encoding than UTF-8, specify whatever your actual encoding is for encoding.

I fixed it by adding .encode("utf-8") to soup.
That means that print(soup) becomes print(soup.encode("utf-8")).

In Python 3.7, and running Windows 10 this worked (I am not sure whether it will work on other platforms and/or other versions of Python)
Replacing this line:
with open('filename', 'w') as f:
With this:
with open('filename', 'w', encoding='utf-8') as f:
The reason why it is working is because the encoding is changed to UTF-8 when using the file, so characters in UTF-8 are able to be converted to text, instead of returning an error when it encounters a UTF-8 character that is not suppord by the current encoding.

set PYTHONIOENCODING=utf-8
set PYTHONLEGACYWINDOWSSTDIO=utf-8
You may or may not need to set that second environment variable PYTHONLEGACYWINDOWSSTDIO.
Alternatively, this can be done in code (although it seems that doing it through env vars is recommended):
sys.stdin.reconfigure(encoding='utf-8')
sys.stdout.reconfigure(encoding='utf-8')
Additionally: Reproducing this error was a bit of a pain, so leaving this here too in case you need to reproduce it on your machine:
set PYTHONIOENCODING=windows-1252
set PYTHONLEGACYWINDOWSSTDIO=windows-1252

While saving the response of get request, same error was thrown on Python 3.7 on window 10. The response received from the URL, encoding was UTF-8 so it is always recommended to check the encoding so same can be passed to avoid such trivial issue as it really kills lots of time in production
import requests
resp = requests.get('https://en.wikipedia.org/wiki/NIFTY_50')
print(resp.encoding)
with open ('NiftyList.txt', 'w') as f:
f.write(resp.text)
When I added encoding="utf-8" with the open command it saved the file with the correct response
with open ('NiftyList.txt', 'w', encoding="utf-8") as f:
f.write(resp.text)

Even I faced the same issue with the encoding that occurs when you try to print it, read/write it or open it. As others mentioned above adding .encoding="utf-8" will help if you are trying to print it.
soup.encode("utf-8")
If you are trying to open scraped data and maybe write it into a file, then open the file with (......,encoding="utf-8")
with open(filename_csv , 'w', newline='',encoding="utf-8") as csv_file:

For those still getting this error, adding encode("utf-8") to soup will also fix this.
soup = BeautifulSoup(html_doc, 'html.parser').encode("utf-8")
print(soup)

There are multiple aspects to this problem. The fundamental question is which character set you want to output into. You may also have to figure out the input character set.
Printing (with either print or write) into a file with an explicit encoding="..." will translate Python's internal Unicode representation into that encoding. If the output contains characters which are not supported by that encoding, you will get an UnicodeEncodeError. For example, you can't write Russian or Chinese or Indic or Hebrew or Arabic or emoji or ... anything except a restricted set of some 200+ Western characters to a file whose encoding is "cp1252" because this limited 8-bit character set has no way to represent these characters.
Basically the same problem will occur with any 8-bit character set, including nearly all the legacy Windows code pages (437, 850, 1250, 1251, etc etc), though some of them support some additional script in addition to or instead of English (1251 supports Cyrillic, for example, so you can write Russian, Ukrainian, Serbian, Bulgarian, etc). An 8-bit encoding has only a maximum of 256 character codes and no way to represent a character which isn't among them.
Perhaps now would be a good time to read Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
On platforms where the terminal is not capable of printing Unicode (only Windows these days really, though if you're into retrocomputing, this problem was also prevalent on other platforms in the previous millennium) attempting to print Unicode strings can also produce this error, or output mojibake. If you see something like HÃ©llÃ¶ instead of Héllö, this is your issue.
In short, then, you need to know:
What is the character set of the page you scraped, or the data you received? Was it correctly scraped? Did the originator correctly identify its encoding, or are you able to otherwise obtain this information (or guess it)? Some web sites incorrectly declare a different character set than the page actually contains, some sites have incorrectly configured the connection between the web server and a back-end database. See e.g. scrape with correct character encoding (python requests + beautifulsoup) for a more detailed example with some solutions.
What is the character set you want to write? If printing to the screen, is your terminal correctly configured, and is your Python interpreter configured identically?
Perhaps see also How to display utf-8 in windows console
If you are here, probably the answer to one of these questions is not "UTF-8". This is increasingly becoming the prevalent encoding for web pages, too, though the former standard was ISO-8859-1 (aka Latin-1) and more recently Windows code page 1252.
Going forward, you basically want all your textual data to be Unicode, outside of a few fringe use cases. Generally, that means UTF-8, though on Windows (or if you need Java compatibility), UTF-16 is also vaguely viable, albeit somewhat cumbersome. (There are several other Unicode serialization formats, which may be useful in specialized circumstances. UTF-32 is technically trivial, but takes up a lot more memory; UTF-7 is used in a few network protocols where 7-bit ASCII is required for transport.)
Perhaps see also https://utf8everywhere.org/
Naturally, if you are printing to a file, you also need to examine that file using a tool which can correctly display it. A common pilot error is to open the file using a tool which only displays the currently selected system encoding, or one which tries to guess the encoding, but guesses wrong. Again, a common symptom when viewing UTF-8 text using Windows code page 1252 would result, for example, in Héllö displaying as HÃ©llÃ¶.
If the encoding of character data is unknown, there is no simple way to automatically establish it. If you know what the text is supposed to represent, you can perhaps infer it, but this is typically a manual process with some guesswork involved. (Automatic tools like chardet and ftfy can help, but they get it wrong some of the time, too.)
To establish which encoding you are looking at, it can be helpful if you can identify the individual bytes in a character which isn't displayed correctly. For example, if you are looking at H\x8ell\x9a but expect it to represent Héllö, you can look up the bytes in a translation table. I have published one such table at https://tripleee.github.io/8bit where you can see that in this example, it's probably one of the legacy Mac 8-bit character sets; with more data points, perhaps you can narrow it down to just one of them (and if not, any one of them will do in practice, since all the code points you care about map to the same Unicode characters).
Python 3 on most platforms defaults to UTF-8 for all input and output, but on Windows, this is commonly not the case. It will then instead default to the system's default encoding (still misleadingly called "ANSI code page" in some Microsoft documentation), which depends on a number of factors. On Western systems, the default encoding out of the box is commonly Windows code page 1252.
(Earlier Python versions had somewhat different expectations, and in Python 2, the internal string representation was not Unicode.)
If you are on Windows and write UTF-8 to a text file, maybe specify encoding="utf-8-sig" which adds a BOM sequence at the beginning of the file. This is strictly speaking not necessary or correct, but some Windows tools need it to correctly identify the encoding.
Several of the earlier answers here suggest blindly applying some encoding, but hopefully this should help you understand how that's not generally the correct approach, and how to figure out - rather than guess - which encoding to use.

From Python 3.7 onwards,
Set the the environment variable PYTHONUTF8 to 1
The following script included other useful variables too which set System Environment Variables.
setx /m PYTHONUTF8 1
setx PATHEXT "%PATHEXT%;.PY" ; In CMD, Python file can be executed without extesnion.
setx /m PY_PYTHON 3.10 ; To set default python version for py
Source

I got the same error so I use (encoding="utf-8") and it solve the error.
This generally happens when we got some unidentified symbol or pattern in text data that our encoder does not understand.
with open("text.txt", "w", encoding='utf-8') as f:
f.write(data)
This will solve your problem.

if you are using windows try to pass encoding='latin1', encoding='iso-8859-1' or encoding='cp1252'
example:
csv_data = pd.read_csv(csvpath,encoding='iso-8859-1')
print(print(soup.encode('iso-8859-1')))

Python | UnicodeEncodeError: 'charmap' codec can't encode character '\u0119' in position 1: character maps to <undefined> [duplicate]

I'm trying to scrape a website, but it gives me an error.
I'm using the following code:
import urllib.request
from bs4 import BeautifulSoup
get = urllib.request.urlopen("https://www.website.com/")
html = get.read()
soup = BeautifulSoup(html)
print(soup)
And I'm getting the following error:
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 70924-70950: character maps to <undefined>
What can I do to fix this?

I was getting the same UnicodeEncodeError when saving scraped web content to a file. To fix it I replaced this code:
with open(fname, "w") as f:
f.write(html)
with this:
with open(fname, "w", encoding="utf-8") as f:
f.write(html)
If you need to support Python 2, then use this:
import io
with io.open(fname, "w", encoding="utf-8") as f:
f.write(html)
If you want to use a different encoding than UTF-8, specify whatever your actual encoding is for encoding.

I fixed it by adding .encode("utf-8") to soup.
That means that print(soup) becomes print(soup.encode("utf-8")).

In Python 3.7, and running Windows 10 this worked (I am not sure whether it will work on other platforms and/or other versions of Python)
Replacing this line:
with open('filename', 'w') as f:
With this:
with open('filename', 'w', encoding='utf-8') as f:
The reason why it is working is because the encoding is changed to UTF-8 when using the file, so characters in UTF-8 are able to be converted to text, instead of returning an error when it encounters a UTF-8 character that is not suppord by the current encoding.

set PYTHONIOENCODING=utf-8
set PYTHONLEGACYWINDOWSSTDIO=utf-8
You may or may not need to set that second environment variable PYTHONLEGACYWINDOWSSTDIO.
Alternatively, this can be done in code (although it seems that doing it through env vars is recommended):
sys.stdin.reconfigure(encoding='utf-8')
sys.stdout.reconfigure(encoding='utf-8')
Additionally: Reproducing this error was a bit of a pain, so leaving this here too in case you need to reproduce it on your machine:
set PYTHONIOENCODING=windows-1252
set PYTHONLEGACYWINDOWSSTDIO=windows-1252

While saving the response of get request, same error was thrown on Python 3.7 on window 10. The response received from the URL, encoding was UTF-8 so it is always recommended to check the encoding so same can be passed to avoid such trivial issue as it really kills lots of time in production
import requests
resp = requests.get('https://en.wikipedia.org/wiki/NIFTY_50')
print(resp.encoding)
with open ('NiftyList.txt', 'w') as f:
f.write(resp.text)
When I added encoding="utf-8" with the open command it saved the file with the correct response
with open ('NiftyList.txt', 'w', encoding="utf-8") as f:
f.write(resp.text)

Even I faced the same issue with the encoding that occurs when you try to print it, read/write it or open it. As others mentioned above adding .encoding="utf-8" will help if you are trying to print it.
soup.encode("utf-8")
If you are trying to open scraped data and maybe write it into a file, then open the file with (......,encoding="utf-8")
with open(filename_csv , 'w', newline='',encoding="utf-8") as csv_file:

For those still getting this error, adding encode("utf-8") to soup will also fix this.
soup = BeautifulSoup(html_doc, 'html.parser').encode("utf-8")
print(soup)

From Python 3.7 onwards,
Set the the environment variable PYTHONUTF8 to 1
The following script included other useful variables too which set System Environment Variables.
setx /m PYTHONUTF8 1
setx PATHEXT "%PATHEXT%;.PY" ; In CMD, Python file can be executed without extesnion.
setx /m PY_PYTHON 3.10 ; To set default python version for py
Source

I got the same error so I use (encoding="utf-8") and it solve the error.
This generally happens when we got some unidentified symbol or pattern in text data that our encoder does not understand.
with open("text.txt", "w", encoding='utf-8') as f:
f.write(data)
This will solve your problem.

if you are using windows try to pass encoding='latin1', encoding='iso-8859-1' or encoding='cp1252'
example:
csv_data = pd.read_csv(csvpath,encoding='iso-8859-1')
print(print(soup.encode('iso-8859-1')))

Change encoding for locally stored .html files downloaded with urllib.request.urlretrieve()

I used the following python code to save an html file to local storage:
url = "some_url.html
urllib.request.urlretrieve(url, 'save/to/path')
This successfully saves the file with a .html extension. When I attempt to open the file with:
html_doc = open('save/to/path/some_url.html', 'r')
I get the following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 36255: ordinal not in range(128)
I think this means I am attempting to read a utf-8 file with a ascii codec. I attempted the solution found at:
Convert Unicode to ASCII without errors in Python
But this, as well as other solutions I have found, only seem to work for encoding the file for immediate viewing and not saved files. I cannot find one that works for altering the encoding of a locally stored file.

The open() function has an optional encoding parameter.
Its default is platform dependent, but in your case it apparently defaults to UTF-8.
I you know the correct codec (eg. from a HTTTP header), you can specify it:
html_doc = open('path/to/file.html', 'r', encoding='cp1252')
If you don't know it, chances are that it is written in the file.
You can open the file in binary mode:
html_doc = open('path/to/file.html', 'rb')
and then try to find an encoding declaration and decode the whole thing in memory.
However, don't do that.
There's not much use in opening and processing HTML like a text file.
You should use an HTML parser to walk through the document tree and extract whatever you need.
Python's standard library has one, but you might find Beautiful Soup easier to use.

UnicodeDecodeError when reading a text file

I am a beginner to Python (I am using 3.4). This is the relevant part of my code.
fileObject = open("countable nouns raw.txt", "rt")
bigString = fileObject.read()
fileObject.close()
Whenever I try to read this file I get:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 82273: character maps to <undefined>
I have been reading around and it seems to be something to do with my default encoding not matching the text file encoding. I've read in another post that you can use this method to read a file with a specific encoding:
import codecs
f = codecs.open("file.txt", "r", "utf-8")
But you have to know it in advance. The thing is I don't know how the text file is encoded. A few posts suggested using Chardet. I've installed it but I have no idea how to get it to read a text file.
Any ideas on how to get around this??

There is no need to use codecs.open(); that's advice for Python 2.
In Python 3 open() takes an encoding argument:
fileObject = open("countable nouns raw.txt", "rt", encoding='utf8')
This does require that you know what codec was used for the file, of course. Generally speaking is no easy way for Python to figure that out; individual file formats may include codec information or have standardised on a given codec, but if all you have a generic text file you'll have to figure out what created it and what codec that used to write the data.

In addition to using the correct Python method to specifiy the encoding when using open, you could try to get the encoding using the file tool.
A file foo.txt containing
ÙÚÛÜ
can be checked using
$ file foo.txt
foo.txt: UTF-8 Unicode text
$ wc foo.txt
1 1 9 foo.txt
As you can see by using wc, it contains nine bytes, two for each character, one newline.

To add to Martijn Pieters answer,you may want to check out this link:
http://osxdaily.com/2015/08/11/determine-file-type-encoding-command-line-mac-os-x/
if you are a Mac user and have trouble figuring out what encoding a particular file you have is in.

One way you can detect the encoding on any operating system is by using the library chardet.
If you don't have it, make sure you run pip install chardet . After that, it is fairly simple:
import chardet
import requests
content = requests.get("http://yahoo.co.jp/").content
detect = chardet.detect(content)
print(detect)
This library tries to detect what the encoding is. This doesn't mean that it is 100% right, just that it will likely be correct. Then you can just read the file:
open('file.txt', encoding=detect['encoding'])

How to save the output of airport -s -x to file with Python

i am learning python, and i am having troubles with saving the output of a small function to file. My python function is the following:
#!/usr/local/bin/python
import subprocess
import codecs
airport = '/System/Library/PrivateFrameworks/Apple80211.framework/Versions/Current/Resources/airport'
def getAirportInfo():
arguments = [airport, "--scan" , "--xml"]
execute = subprocess.Popen(arguments, stdout=subprocess.PIPE)
out, err = execute.communicate()
print out
return out
airportInfo = getAirportInfo()
outFile = codecs.open('wifi-data.txt', 'w')
outFile.write(airportInfo)
outFile.close()
I guess that this would only work on a Mac, as it references some PrivateFrameworks.
Anyways, the code almost works as it should. The print statement prints a huge xml file, that i'd like to store in a file, for future processing. And here start the problems.
In the version above, the script executes without any errors, however, when i try to open the file, i get an error message, along the lines of error with utf-8 encoding. Ignoring this, opens the file, and most of the things look fine, except for a couple of things:
some SSID have non-ascii characters, like ä, ö and ü. When printing those on the screen, they are correctly displayed as \xc3\xa4 and so on. When I open the file it is displayed incorrectly, the usual random garbage.
some of the xml values look like these when printed on screen: Data("\x00\x11WLAN-0024FE056185\x01\x08\x82\x84\x8b\x96\x0c\ … x10D\x00\x01\x02") but like this when read from file: //8AAAAAAAAAAAAAAAAAAA==
I thought it could be an encoding error (seen as the Umlauts have problems, the error message says something about the utf-8 encoding being messed up, and the text containing \x type of characters), and i tried looking here for possible solutions. However, no matter what i do, there are still errors:
adding an additional argument 'utf-8' to the codecs.open yields a
UnicodeDecodeError: 'ascii' codec can't decode byte 0x9a in position 24227: ordinal not in range(128) and the generated file is empty.
explicitly encoding to utf-8 with outFile.write(airportInfo.encode('utf-8')) before saving results in the same error
not being an expert, i tried decoding it, maybe i was just doing the exact opposite of what needed to be done, but i got an UnicodeDecodeError: 'utf8' codec can't decode byte 0x8a in position 8980: invalid start byte
The only the thing that worked (unsurprisingly), was to write the repr() of the string to file, but that is just not what i need, and also i can't make a nice .plist of a file full with escape symbols.
So please, please, can somebody help me? What am i missing?
If it helps, the type that gets saved in airportInfo is str (as in type(airportInfo) == str) and not u

You don't need re-encoding when your text is already unicode. Just write the text to a file. It should just work.
In [1]: t = 'äïöú'
In [2]: with open('test.txt', 'w') as f:
f.write(t)
...:
Additionally, you can make getAirportInfo simpler by using subprocess.check_output(). Also, mixed case names should only be used for classes, not functions. See PEP8.
import subprocess
def get_airport_info():
args = ['/System/Library/PrivateFrameworks/Apple80211.framework/Versions/Current/Resources/airport',
'--scan', '--xml']
return subprocess.check_output(args)
airportInfo = get_airport_info()
with open('wifi-data.txt', 'w') as outf:
outf.write(airportinfo)

I should have read this before my original answer:
What is the difference between encode/decode?
I always get confused between string and unicode conversion. On my mac, import sys; sys.getfilesystemencoding() suggests that subprocess returns a 'utf-8' string - so I don't know why airportInfo.encode('utf-8') fails. Is it possible to do airportInfo.encode('utf-8', 'ignore') and throw out the invalid characters?
Also - have you tried writing your file as wb: outFile = codecs.open('wifi-data.txt', 'wb') - doesn't 'w' open an ascii file?
Regarding your text editor - that may handle unicode characters differently. If it's reading a unicode text file as ascii, then the unicode characters may appear a garbled mess. You might try naming the file .xml, in which depending on your text editor may read it better as unicode.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.