I'm trying to scrape a website, but it gives me an error.
I'm using the following code:
import urllib.request
from bs4 import BeautifulSoup
get = urllib.request.urlopen("https://www.website.com/")
html = get.read()
soup = BeautifulSoup(html)
print(soup)
And I'm getting the following error:
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 70924-70950: character maps to <undefined>
What can I do to fix this?
I was getting the same UnicodeEncodeError when saving scraped web content to a file. To fix it I replaced this code:
with open(fname, "w") as f:
f.write(html)
with this:
with open(fname, "w", encoding="utf-8") as f:
f.write(html)
If you need to support Python 2, then use this:
import io
with io.open(fname, "w", encoding="utf-8") as f:
f.write(html)
If you want to use a different encoding than UTF-8, specify whatever your actual encoding is for encoding.
I fixed it by adding .encode("utf-8") to soup.
That means that print(soup) becomes print(soup.encode("utf-8")).
In Python 3.7, and running Windows 10 this worked (I am not sure whether it will work on other platforms and/or other versions of Python)
Replacing this line:
with open('filename', 'w') as f:
With this:
with open('filename', 'w', encoding='utf-8') as f:
The reason why it is working is because the encoding is changed to UTF-8 when using the file, so characters in UTF-8 are able to be converted to text, instead of returning an error when it encounters a UTF-8 character that is not suppord by the current encoding.
set PYTHONIOENCODING=utf-8
set PYTHONLEGACYWINDOWSSTDIO=utf-8
You may or may not need to set that second environment variable PYTHONLEGACYWINDOWSSTDIO.
Alternatively, this can be done in code (although it seems that doing it through env vars is recommended):
sys.stdin.reconfigure(encoding='utf-8')
sys.stdout.reconfigure(encoding='utf-8')
Additionally: Reproducing this error was a bit of a pain, so leaving this here too in case you need to reproduce it on your machine:
set PYTHONIOENCODING=windows-1252
set PYTHONLEGACYWINDOWSSTDIO=windows-1252
While saving the response of get request, same error was thrown on Python 3.7 on window 10. The response received from the URL, encoding was UTF-8 so it is always recommended to check the encoding so same can be passed to avoid such trivial issue as it really kills lots of time in production
import requests
resp = requests.get('https://en.wikipedia.org/wiki/NIFTY_50')
print(resp.encoding)
with open ('NiftyList.txt', 'w') as f:
f.write(resp.text)
When I added encoding="utf-8" with the open command it saved the file with the correct response
with open ('NiftyList.txt', 'w', encoding="utf-8") as f:
f.write(resp.text)
Even I faced the same issue with the encoding that occurs when you try to print it, read/write it or open it. As others mentioned above adding .encoding="utf-8" will help if you are trying to print it.
soup.encode("utf-8")
If you are trying to open scraped data and maybe write it into a file, then open the file with (......,encoding="utf-8")
with open(filename_csv , 'w', newline='',encoding="utf-8") as csv_file:
For those still getting this error, adding encode("utf-8") to soup will also fix this.
soup = BeautifulSoup(html_doc, 'html.parser').encode("utf-8")
print(soup)
There are multiple aspects to this problem. The fundamental question is which character set you want to output into. You may also have to figure out the input character set.
Printing (with either print or write) into a file with an explicit encoding="..." will translate Python's internal Unicode representation into that encoding. If the output contains characters which are not supported by that encoding, you will get an UnicodeEncodeError. For example, you can't write Russian or Chinese or Indic or Hebrew or Arabic or emoji or ... anything except a restricted set of some 200+ Western characters to a file whose encoding is "cp1252" because this limited 8-bit character set has no way to represent these characters.
Basically the same problem will occur with any 8-bit character set, including nearly all the legacy Windows code pages (437, 850, 1250, 1251, etc etc), though some of them support some additional script in addition to or instead of English (1251 supports Cyrillic, for example, so you can write Russian, Ukrainian, Serbian, Bulgarian, etc). An 8-bit encoding has only a maximum of 256 character codes and no way to represent a character which isn't among them.
Perhaps now would be a good time to read Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
On platforms where the terminal is not capable of printing Unicode (only Windows these days really, though if you're into retrocomputing, this problem was also prevalent on other platforms in the previous millennium) attempting to print Unicode strings can also produce this error, or output mojibake. If you see something like Héllö instead of Héllö, this is your issue.
In short, then, you need to know:
What is the character set of the page you scraped, or the data you received? Was it correctly scraped? Did the originator correctly identify its encoding, or are you able to otherwise obtain this information (or guess it)? Some web sites incorrectly declare a different character set than the page actually contains, some sites have incorrectly configured the connection between the web server and a back-end database. See e.g. scrape with correct character encoding (python requests + beautifulsoup) for a more detailed example with some solutions.
What is the character set you want to write? If printing to the screen, is your terminal correctly configured, and is your Python interpreter configured identically?
Perhaps see also How to display utf-8 in windows console
If you are here, probably the answer to one of these questions is not "UTF-8". This is increasingly becoming the prevalent encoding for web pages, too, though the former standard was ISO-8859-1 (aka Latin-1) and more recently Windows code page 1252.
Going forward, you basically want all your textual data to be Unicode, outside of a few fringe use cases. Generally, that means UTF-8, though on Windows (or if you need Java compatibility), UTF-16 is also vaguely viable, albeit somewhat cumbersome. (There are several other Unicode serialization formats, which may be useful in specialized circumstances. UTF-32 is technically trivial, but takes up a lot more memory; UTF-7 is used in a few network protocols where 7-bit ASCII is required for transport.)
Perhaps see also https://utf8everywhere.org/
Naturally, if you are printing to a file, you also need to examine that file using a tool which can correctly display it. A common pilot error is to open the file using a tool which only displays the currently selected system encoding, or one which tries to guess the encoding, but guesses wrong. Again, a common symptom when viewing UTF-8 text using Windows code page 1252 would result, for example, in Héllö displaying as Héllö.
If the encoding of character data is unknown, there is no simple way to automatically establish it. If you know what the text is supposed to represent, you can perhaps infer it, but this is typically a manual process with some guesswork involved. (Automatic tools like chardet and ftfy can help, but they get it wrong some of the time, too.)
To establish which encoding you are looking at, it can be helpful if you can identify the individual bytes in a character which isn't displayed correctly. For example, if you are looking at H\x8ell\x9a but expect it to represent Héllö, you can look up the bytes in a translation table. I have published one such table at https://tripleee.github.io/8bit where you can see that in this example, it's probably one of the legacy Mac 8-bit character sets; with more data points, perhaps you can narrow it down to just one of them (and if not, any one of them will do in practice, since all the code points you care about map to the same Unicode characters).
Python 3 on most platforms defaults to UTF-8 for all input and output, but on Windows, this is commonly not the case. It will then instead default to the system's default encoding (still misleadingly called "ANSI code page" in some Microsoft documentation), which depends on a number of factors. On Western systems, the default encoding out of the box is commonly Windows code page 1252.
(Earlier Python versions had somewhat different expectations, and in Python 2, the internal string representation was not Unicode.)
If you are on Windows and write UTF-8 to a text file, maybe specify encoding="utf-8-sig" which adds a BOM sequence at the beginning of the file. This is strictly speaking not necessary or correct, but some Windows tools need it to correctly identify the encoding.
Several of the earlier answers here suggest blindly applying some encoding, but hopefully this should help you understand how that's not generally the correct approach, and how to figure out - rather than guess - which encoding to use.
From Python 3.7 onwards,
Set the the environment variable PYTHONUTF8 to 1
The following script included other useful variables too which set System Environment Variables.
setx /m PYTHONUTF8 1
setx PATHEXT "%PATHEXT%;.PY" ; In CMD, Python file can be executed without extesnion.
setx /m PY_PYTHON 3.10 ; To set default python version for py
Source
I got the same error so I use (encoding="utf-8") and it solve the error.
This generally happens when we got some unidentified symbol or pattern in text data that our encoder does not understand.
with open("text.txt", "w", encoding='utf-8') as f:
f.write(data)
This will solve your problem.
if you are using windows try to pass encoding='latin1', encoding='iso-8859-1' or encoding='cp1252'
example:
csv_data = pd.read_csv(csvpath,encoding='iso-8859-1')
print(print(soup.encode('iso-8859-1')))
Some people use the following to declare the encoding method for the text of their Python source code:
# -*- coding: utf-8 -*-
Back in 2001, it is said the default encoding method that Python interpreter assumes is ASCII. I have dealt with strings using non-ASCII characters in my Python code, without declaring encoding method of my code, and I don't remember I have bumped into encoding error before. What is the default encoding for code assumed by Python interpreter now?
I am not sure if this is relevant.
My OS is Ubuntu, and I am using the default Python interpreter, and gedit or emacs for editing.
Will the default encoding method by Python interpreter changes if the above changes?
Thanks.
Without any explicit encoding declaration, the assumed encoding for your source code will be
ascii for Python 2.x
utf-8 for Python 3.x
See PEP 0263 and Using source code encoding for Python 2.x, and PEP 3120 for the new default of utf-8 for Python 3.x.
So the default encoding assumened for source code will be directly dependent of the version of the Python interpreter, and it is not configurable.
Note that the source code encoding is something entirely different than dealing with non-ASCII characters as part of your data in strings.
There are two distinct cases where you may encounter non-ASCII characters:
As part of your programs data, during runtime
As part of your source code (and since you can't have non-ASCII characters in identifiers, that usually means hard coded string data in your source code or comments).
The source code encoding declaration affects what encoding your source code will be interpreted with - so it's only needed if you decide to directly put non-ASCII characters in your source code.
So, the following code will eventually have to deal with the fact that there might be non-ASCII characters in data.txt:
with open('data.txt') as f:
for line in f:
# do something with `line`
But it doesn't contain any non-ASCII characters in the source code, therefore it doesn't need an encoding declaration at the top of the file. It will however need to properly decode line if it wants to turn it into unicode. Simply doing unicode(line) will use the system default encoding, which is ascii (different from the default source encoding, but happens to also be ascii). So to explicitely decode the string using utf-8 you'd need to do line.decode('utf-8').
This code however does contain non-ASCII characters directly in its source code:
TEST_DATA = 'Bär' # <--- non-ASCII character on this line
print TEST_DATA
And it will fail with a SyntaxError similar to this, unless you declare an explicit source code encoding:
SyntaxError: Non-ASCII character '\xc3' in file foo.py on line 1, but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details
So assuming your text editor is configured to save files in utf-8, you'd need to put the line
# -*- coding: utf-8 -*-
at the top of the file for Python to interpret the source code correctly.
My advice however would be to generally avoid putting non-ASCII characters in your source code, exactly because if it depends on your and your co-workers editor and terminal settings wheter it will be written and read correctly.
Instead you can use escaped strings to safely enter non-ASCII characters in your code:
TEST_DATA = 'B\xc3\xa4r'
By default, Python source files are treated as encoded in UTF-8. In that encoding, — although the standard library only uses ASCII characters for identifiers, a convention that any portable code should follow. To display all these characters properly, the editor must recognize that the file is UTF-8, and it must use a font that supports all the characters in the file.
It is also possible to specify a different encoding for source files. In order to do this, we put the below code on top of our code !
# -*- coding: encoding -*-
https://docs.python.org/dev/tutorial/interpreter.html
I am on Mac OS X 10.8.2
When I try to find files with filenames that contain non-ASCII-characters I get no results although I know for sure that they are existing. Take for example the console input
> find */Bärlauch*
I get no results. But if I try without the umlaut I get
> find */B*rlauch*
images/Bärlauch1.JPG
So the file is definitely existing. If I rename the file replacing 'ä' by 'ae' the file is being found.
Similarily the Python module glob is not able to find the file:
>>> glob.glob('*/B*rlauch*')
['images/Bärlauch1.JPG']
>>> glob.glob('*/Bärlauch*')
[]
I figured out it must have something to do with the encoding but my terminal is set to be utf-8 and I am using Python 3.3.0 which uses unicode strings.
Mac OS X uses denormalized characters always for filenames on HFS+. Use unicodedata.normalize('NFD', pattern) to denormalize the glob pattern.
import unicodedata
glob.glob(unicodedata.normalize('NFD', '*/Bärlauch*'))
Python programs are fundamentally text files. Conventionally, people write them using only characters from the ASCII character set, and thus do not have to think about the encoding they write them in: all character sets agree on how ASCII characters should be decoded.
You have written a Python program using a non-ASCII character. Your program thus comes with an implicit encoding (which you haven't mentioned): to save such a file, you have to decide how you are going to represent a-umlaut on disk. I would guess that perhaps your editor has chosen something non-Unicode for you.
Anyway, there are two ways around such a problem: either you can restrict yourself to using only ASCII characters in the source code of your program, or you can declare to Python that you want it to read the text file with a specific encoding.
To do the former, you should replace the a-umlaut with its Unicode escape sequence (which I think is \x0228 but can't test at the moment). To do the latter, you should add a coding declaration at the top of the file:
# -*- coding: <your encoding> -*-
this is my code:
#!/usr/bin/python
#-*-coding:utf-8-*-
import xlrd,sys,re
data = xlrd.open_workbook('a.xls',encoding_override="utf-8")
a = data.sheets()[0]
s=''
for i in range(a.nrows):
if 9<i<20:
#stage
print a.row_values(i)[1].decode('shift_jis')+'\n'
but it show :
????
????????
??????
????
????
????
????????
so what can i do ,
thanks
Background: In a "modern" (Excel 97-2003) XLS file, text is effectively stored as Unicode. In older files, text is stored as 8-bit strings, and a "codepage" record tells how it is encoded e.g. the integer 1252 corresponds to the encoding known as cp1252 or windows-1252. In either case, xlrd presents extracted text as unicode objects.
Please insert this line into your code:
print data.biff_version, data.codepage, data.encoding
If you have a new file, you should see
80 1200 utf_16_le
In any case, please edit your question to report the outcome.
Problem 1: encoding_override is required ONLY if the file is an old file AND you know/suspect that the codepage record is omitted or wrong. It is ignored if the file is a new file. Do you really know that the file is pre-Excel-97 and the text is encoded in UTF-8? If so, it can only have been created by some seriously deluded 3rd-party software, and Excel will blow up if you try to open it with Excel; visit the author with a baseball bat. Otherwise, don't use encoding_override.
Problem 2: You should have unicode objects. To display them, you need to encode (not decode) them from unicode to str using a suitable encoding. It is very suprising that print unicode_object.decode('shift-jis') doesn't raise an exception and prints question marks.
To help understand this, please change your code to be like this:
text = a.rowvalues(i)[1]
print i, repr(text)
print repr(text.decode('shift-jis'))
and report the outcome.
So that we can help you choose an appropriate encoding (if any), tell us what version of what operating system you are using, and what the following display:
print sys.stdout.encoding
import locale
print locale.getpreferredencoding()
Further reading:
(1) the xlrd documentation (section on Unicode, right up the front) ... included in the distribution, or get the latest commit here.
(2) the Python Unicode HOWTO.
Why isn't your encoding override on open shift-jis?
data = xlrd.open_workbook('a.xls',encoding_override="shift-jis")
If the file is really shift-JIS, there are lots of code points (well frankly, almost all of them) that don't overlap with valid UTF-8 code points. If you are getting illegal characters (?) and your file is really UTF-8 and you want to output Shift-JIS, might I suggest that your output shell (for print - probably a file would be fine) can't handle the encoding.
I have made some adaptations to the script from this answer. and I am having problems with unicode. Some of the questions end up being written poorly.
Some answers and responses end up looking like:
Yeah.. I know.. I’m a simpleton.. So what’s a Singleton? (2)
How can I make the ’ to be translated to the right character?
Note: If that matters, I'm using python 2.6, on a French windows.
>>> sys.getdefaultencoding()
'ascii'
>>> sys.getfilesystemencoding()
'mbcs'
EDIT1: Based on Ryan Ginstrom's post, I have been able to correct a part of the output, but I am having problems with python's unicode.
In Idle / python shell:
Yeah.. I know.. I’m a simpleton.. So
what’s a Singleton?
In a text file, when redirecting stdout
Yeah.. I know.. I’m a simpleton.. So
what’s a Singleton?
How can I correct that ?
Edit2: I have tried Jarret Hardie's solution but it didn't do anything.
I am on windows, using python 2.6, so my site-packages folder is at:
C:\Python26\Lib\site-packages
There was no siteconfig.py file, so I created one, pasted the code provided by Jarret Hardie, started a python interpreter, but seems like it has not been loaded.
sys.getdefaultencoding()
'ascii'
I noticed there is a site.py file at :
C:\Python26\Lib\site.py
I tried changing the encoding in the function
def setencoding():
"""Set the string encoding used by the Unicode implementation. The
default is 'ascii', but if you're willing to experiment, you can
change this."""
encoding = "ascii" # Default value set by _PyUnicode_Init()
if 0:
# Enable to support locale aware default string encodings.
import locale
loc = locale.getdefaultlocale()
if loc[1]:
encoding = loc[1]
if 0:
# Enable to switch off string to Unicode coercion and implicit
# Unicode to string conversion.
encoding = "undefined"
if encoding != "ascii":
# On Non-Unicode builds this will raise an AttributeError...
sys.setdefaultencoding(encoding) # Needs Python Unicode build !
to set the encoding to utf-8. It worked (after a restart of python of course).
>>> sys.getdefaultencoding()
'utf-8'
The sad thing is that it didn't correct the caracters in my program. :(
You should be able to convert HTMl/XML entities into Unicode characters. Check out this answer in SO:
Decoding HTML Entities With Python
Basically you want something like this:
from BeautifulSoup import BeautifulStoneSoup
soup = BeautifulStoneSoup(urllib2.urlopen(URL),
convertEntities=BeautifulStoneSoup.ALL_ENTITIES)
Does changing your default encoding in siteconfig.py work?
In your site-packages file (on my OS X system it's in /Library/Python/2.5/site-packages/) create a file called siteconfig.py. In this file put:
import sys
sys.setdefaultencoding('utf-8')
The setdefaultencoding method is removed from the sys module once siteconfig.py is processed, so you must put it in site-packages so that Python will read it when the interpreter starts up.