Downloading different language webpage using web python

Downloading different language webpage using web python - python

I am trying to download a webpage (in Russian) using mechanize module in python (My computer uses only English) . I get the following error
UnicodeEncodeError: 'ascii' codec can't encode characters in position 50-59
Can somebody tell me how to correct these type of errors or what they mean?

Long story short, your original string is not encoded in ASCII meaning that when trying to print the characters python doesn't know what to do because the original character code is out of the ASCII scope.
Here's the ASCII table and what characters it supports: http://www.asciitable.com/
You can convert your characters using say:
Python - Encoding string - Swedish Letters
Or you can do:
(This is a solution to a lot of problems encoding wise)
Edit: C:\Python??\Lib\Site.py
Replace "del sys.setdefaultencoding" with "pass" like so:
Then,
Put this in the top of your code:
sys.setdefaultencoding('latin-1')
The holy grail of fixing the Swedish/non-UTF8 compatible characters.
I'm not sure that latin-1 will cover all your russian characters, if it doesn't you probably know of a encoding which does (example: ISO-8859-15 or something)

Related

Decode / encode html escaped special characters in Python

I have some text that has html escape codes in it that I am struggling to fully decode / encode to display properly with Python (ultimately in a Django application).
""Coup d'Ãtat"" being a troublesome snippet.
I have used html.unescape() to successfully unescape most of the html codes, but I am struggling with the decoding of the special characters, "Ã", in this example. Ideally this would display as "Coup d'État", but despite trying some decoding/encoding combinations I am getting "Coup d'Ãtat".
What is the correct way to convert ""Coup d'Ãtat"" into "Coup d'État"?
Thanks for your help, and apologies if this has been answered elsewhere. I've tried searching, but no success.

You have a Mojibake, double-encoded data. You not only have HTML entities, your data was incorrectly decoded from bytes to text before the HTML entities were applied.
For your example, the two Ã,  entities decode to the Unicode characters Ã and ‰. Those two characters are also known (from the Unicode standard), as U+00C3 LATIN CAPITAL LETTER A WITH TILDE and U+2030 PER MILLE SIGN. This is typical of UTF-8 data being mis-interpreted as a Latin variant encoding (such as ISO 8859-1 or a Windows Latin codepage variant.
If we assume that the original character was meant to be É, or U+00C9 LATIN CAPITAL LETTER E WITH ACUTE, then the original would have been encoded to the bytes C3 and 89 if using UTF-8. That Ã (U+00C3!) shows up here is not a coincidence, it is typical of UTF-8 -> Latin variant Mojibakes to end up with such combinations. The 89 mapping tells us that the most likely candidate for the wrong encoding is the Windows CP 1252 encoding, which maps the hex value 89 to U+2030 PER MILLE SIGN.
You could manually encode to bytes then decode as the correct encoding, but the trick is to know what encoding was used incorrectly, and sometimes that mistake leads to data loss, because the CP-1252 codepage doesn't have a Unicode character mapping for 5 specific byte values. That's not a direct problem for the example in your question, but can be for other text. Manually decoding would work like this:
>>> import html
>>> broken = ""Coup d'Ãtat""
>>> html.unescape(broken)
'"Coup d\'Ã‰tat"'
>>> html.unescape(broken).encode("cp1252")
b'"Coup d\'\xc3\x89tat"'
>>> html.unescape(broken).encode("cp1252").decode("utf-8")
'"Coup d\'État"'
A better option is to use the special ftfy library (the name is an acronym for Fixed That For You), which uses detailed knowledge about how to recognize such mistakes and undo the damage.
ftfy also handles the HTML-entity decoding, all in one step:
>>> import ftfy
>>> ftfy.fix_text(""Coup d'Ãtat"")
'"Coup d\'État"'
The library includes sloppy variants of text codes often found in a Mojibake to help with repairing. It also encodes information about how to recognize the specific errors that a given wrong codec choice produces so it knows what to do to reverse the damage.

encoding issue. Replace special character

I have a dictionary that looks like this:
{ u'Samstag & Sonntag': u'Ganztags ge\xf6ffnet', u'Freitag': u'18:00 & 22:00'}
Now I'm trying to replace the \xf6 with ö ,
but trying .replace('\xf6', 'ö') returns an error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position
0: ordinal not in range(128)
How can I fix this?

Now encoding is a mine field, and I might be off on this one - please correct me if that's the case.
From what I've gathered over the years is that Python2 assumes ASCII unless you defined a encoding at the top of your script. Mainly because either it's compiled that way or the OS/Terminal uses ASCII as it's primary encoding.
With that said, what you see in your example data:
{ u'Samstag & Sonntag': u'Ganztags ge\xf6ffnet', u'Freitag': u'18:00 & 22:00'}
Is the ASCII representation of a unicode string. Some how Python needs to tell you there's an ö in there - but it can't with ASCII because ö has no representation in the ASCII table.
But when you try to replace it using:
x.replace('\xf6', 'ö')
You're trying to find a ASCII character/string called \xf6 that is outside of the accepted bytes ranges of ASCII, so that will raise an exception. And you're trying to replace it with another invalid ASCII character and that will cause the same exception.
Hence why you get the "'ascii' codec can't decode byte...' message.
You can do unicode replacements like this:
a = u'Ganztags ge\xf6ffnet'
a.replace(u'\xf6', u'ö')
This will tell Python to find a unicode string, and replace it with another unicode string.
But the output data will result in the same thing in the example above, because \xf6 is ö in unicode.
What you want to do, is encode your string into something you want to use, for instance - UTF-8:
a.encode('UTF-8')
'Ganztags ge\xc3\xb6ffnet'
And define UTF-8 as your primary encoding by placing this at the top of your code:
#!/usr/bin/python
# -*- coding: UTF-8
This should in theory make your application a little easier to work with.
And you can from then on work with UTF-8 as your base model.
But there's no way that I know of, to convert your representation into a ASCII ö, because there really isn't such a thing. There's just different ways Python will do this encoding magic for you to make you believe it's possible to "just write ö".
In Python3 most of the strings you encounter will either be bytes data or treated a bit differently from Python2. And for the most part it's a lot easier.
There's numerous ways to change the encoding that is not part of the standard praxis. But there are ways to do it.
The closest to "good" praxis, would be the locale:
locale.setlocale(locale.LC_ALL, 'sv_SE.UTF-8')
I also had a horrendous solution and approach to this years back, it looked something like this (it was a great bodge for me at the time):
Python - Encoding string - Swedish Letters
tl;dr:
Your code usually assume/use ASCII as it's encoder/decoder.
ö is not a part of ASCII, there for you'll always see \xf6 if you've some how gotten unicode characters. Normally, if you print u'Ganztags ge\xf6ffnet' it will be shown as a Ö because of automatic encoding, if you need to verify if input matches that string, you have to compare them u'ö' == u'ö', if other systems depend on this data, encode it with something they understand .encode('UTF-8'). But replacing \xf6 with ö is the same thing, just that ö doesn't exist in ASCII and you need to do u'ö' - which, will result in the same data at the end.

As you are using German language, you should be aware of non ascii characters. You know whether your system prefers Latin1 (Windows console and some Unixes), UTF8 (most Linux variants), or native unicode (Windows GUI).
If you can process everything as native unicode things are cleaner and you should just accept the fact that u'ö' and u'\xf6' are the same character - the latter is simply independant of the python source file charset.
If you have to output byte strings of store them in files, you should encode them in UTF8 (can process any unicode character but characters of code above 127 use more than 1 byte) or Latin1 (one byte per character, but only supports unicode code point below 256)
In that case just use an explicit encoding to convert your unicode strings to byte strings:
print u'Ganztags ge\xf6ffnet'.encode('Latin1') # or .encode('utf8')
should give what you expect.

how to convert u'\uf04a' to unicode in python [duplicate]

This question already has answers here:
Python unicode codepoint to unicode character
(4 answers)
Closed 1 year ago.
I am trying to decode u'\uf04a' in python thus I can print it without error warnings. In other words, I need to convert stupid microsoft Windows 1252 characters to actual unicode
The source of html containing the unusual errors comes from here http://members.lovingfromadistance.com/showthread.php?12338-HAVING-SECOND-THOUGHTS
Read about u'\uf04a' and u'\uf04c' by clicking here http://www.fileformat.info/info/unicode/char/f04a/index.htm
one example looks like this:
"Oh god please some advice ":
Out[408]: u'Oh god please some advice \uf04c'
Given a thread like this as one example for test:
thread = u'who are you \uf04a Why you are so harsh to her \uf04c'
thread.decode('utf8')
print u'\uf04a'
print u'\uf04a'.decode('utf8') # error!!!
'charmap' codec can't encode character u'\uf04a' in position 1526: character maps to undefined
With the help of two Python scripts, I successfully convert the u'\x92', but I am still stuck with u'\uf04a'. Any suggestions?
References
https://github.com/AnthonyBRoberts/NNS/blob/master/tools/killgremlins.py
Handling non-standard American English Characters and Symbols in a CSV, using Python
Solution:
According to the comments below: I replace these character set with the question mark('?')
thread = u'who are you \uf04a Why you are so harsh to her \uf04c'
thread = thread.replace(u'\uf04a', '?')
thread = thread.replace(u'\uf04c', '?')
Hope this helpful to the other beginners.

The notation u'\uf04a' denotes the Unicode codepoint U+F04A, which is by definition a private use codepoint. This means that the Unicode standard does not assign any character to it, and never will; instead, it can be used by private agreements.
It is thus meaningless to talk about printing it. If there is a private agreement on using it in some context, then you print it using a font that has a glyph allocated to that codepoint. Different agreements and different fonts may allocate completely different characters and glyphs to the same codepoint.
It is possible that U+F04A is a result of erroneous processing (e.g., wrong conversions) of character data at some earlier phase.

u'\uf04a'
already is a Unicode object, which means there's nothing to decode. The only thing you can do with it is encode it, if you're targeting a specific file encoding like UTF-8 (which is not the same as Unicode, but is confused with it all the time).
u'\uf04a'.encode("utf-8")
gives you a string (Python 2) or bytes object (Python 3) which you can then write to a file or a UTF-8 terminal etc.
You won't be able to encode it as a plain Windows string because cp1252 doesn't have that character.
What you can do is convert it to an encoding that doesn't have those offending characters by telling the encoder to replace missing characters by ?:
>>> u'who\uf04a why\uf04c'.encode("ascii", errors="replace")
'who? why?'

Python Unicode CSV export (using Django)

I'm using a Django app to export a string to a CSV file. The string is a message that was submitted through a front end form. However, I've been getting this error when a unicode single quote is provided in the input.
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019'
in position 200: ordinal not in range(128)
I've been trying to convert the unicode to ascii using the code below, but still get a similar error.
UnicodeEncodeError: 'ascii' codec can't encode characters in
position 0-9: ordinal not in range(128)
I've sifted through dozens of websites and learned a lot about unicode, however, I'm still not able to convert this unicode to ascii. I don't care if the algorithm removes the unicode characters. The commented lines indicate some various options I've tried, but the error persists.
import csv
import unicodedata
...
#message = unicode( unicodedata.normalize(
# 'NFKD',contact.message).encode('ascii','ignore'))
#dmessage = (contact.message).encode('utf-8','ignore')
#dmessage = contact.message.decode("utf-8")
#dmessage = "%s" % dmessage
dmessage = contact.message
csv_writer.writerow([
dmessage,
])
Does anyone have any advice in removing unicode characters to I can export them to CSV? This seemingly easy problem has kept my head spinning. Any help is much appreciated.
Thanks,
Joe

You can't encode the Unicode character u'\u2019' (U+2019 Right Single Quotation Mark) into ASCII, because ASCII doesn't have that character in it. ASCII is only the basic Latin alphabet, digits and punctuation; you don't get any accented letters or ‘smart quotes’ like this character.
So you will have to choose another encoding. Now normally the sensible thing to do would be to export to UTF-8, which can hold any Unicode character. Unfortunately for you if your target users are using Office (and they probably are), they're not going to be able to read UTF-8-encoded characters in CSV. Instead Excel will read the files using the system default code page for that machine (also misleadingly known as the ‘ANSI’ code page), and end up with mojibake like â€™ instead of ’.
So that means you have to guess the user's system default code page if you want the characters to show up correctly. For Western users, that will be code page 1252. Users with non-Western Windows installs will see the wrong characters, but there's nothing you can do about that (other than organise a letter-writing campaign to Microsoft to just drop the stupid nonsense with ANSI already and use UTF-8 like everyone else).
Code page 1252 can contain U+2019 (’), but obviously there are many more characters it can't represent. To avoid getting UnicodeEncodeError for those characters you can use the ignore argument (or replace to replace them with question marks).
dmessage= contact.message.encode('cp1252', 'ignore')
alternatively, to give up and remove all non-ASCII characters, so that everyone gets an equally bad experience regardless of locale:
dmessage= contact.message.encode('ascii', 'ignore')

Encoding is a pain, but if you're working in django have you tried smart_unicode(str) from django.utils.encoding? I find that usually does the trick.
The only other option I've found is to use the built-in python encode() and decode() for strings, but you have to specify the encoding for those and honestly, it's a pain.

[caveat: I'm not a djangoist; django may have a better solution].
General non-django-specific answer:
If you have a smallish number of known non-ASCII characters and there are user-acceptable ASCII equivalents for them, you can set up a translation table and use the unicode.translate method:
smashcii = {
0x2019 : u"'",
# etc
#
smashed = input_string.translate(smashcii)

Python Encoding issue

Why am I getting this issue? and how do I resolve it?
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 24: unexpected code byte
Thank you

Somewhere, perhaps subtly, you are asking Python to turn a stream of bytes into a "string" of characters.
Don't think of a string as "bytes". A string is a list of numbers, each number having an agreed meaning in Unicode. (#65 = Latin Capital A. #19968 = Chinese Character "One"/"First") .
There are many methods of encoding a list of Unicode entities into a stream of bytes. Python is assuming your stream of bytes is the result of a particular such method, called "UTF-8".
However, your stream of bytes has data that does not correspond to that method. Thus the error is raised.
You need to figure out the encoding of the stream of bytes, and tell Python that encoding.
It's important to know if you're using Python 2 or 3, and the code leading up to this exception to see where your bytes came from and what the appropriate way to deal with them is.
If it's from reading a file, you can explicity deal with the bytes read. But you must be sure of the file encoding.
If it's from a string that is part of your source code, then Python is assuming the "wrong thing" about your source files... perhaps $LC_ALL or $LANG needs to be set. This is a good time to firmly understand the concept of encoding, and how text editors choose an encoding to write, and what is standard for your language and operating system.

In addition to what Joe said, chardet is a useful tool to detect encoding of the source data.

Somewhere you have a plain string encoded as "Windows-1252" (or "cp1252") containing a "RIGHT SINGLE QUOTATION MARK" (’) instead of an APOSTROPHE ('). This could come from a file you read, or even in a Python source file of yours; you could be running Python 2.x and have a # -*- coding: utf8 -*- line somewhere near the script's beginning, or you could be running Python 3.x.
You don't give enough data; however, somewhere you have a cp1252-encoded string, which you try (explicitly or implicitly) to decode to unicode as utf-8. This won't work.
Give us more info, and we'll try again to help you.
Joe Koberg's answer reminded me of an older answer of mine, which some people have found helpful: Python UnicodeDecodeError - Am I misunderstanding encode?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.