python3 decode external utf8 string

python3 decode external utf8 string - python

Suppose I have the following string that I want to decode as utf-8:
str ='\\u00d7\\u0090\\u00d7\\u0090\\u00d7\\u0090'
# expect 'אאא'
Using python 3, I would expect the following to work, but it doesn't:
bytes(str, 'ascii').decode('unicode-escape')
# prints '×××'
bytes(str, 'ascii').decode('utf-8')
# prints '\\u00d7\\u0090\\u00d7\\u0090\\u00d7\\u0090'
Any help?

You can do it with multiple trips through encode/decode.
print(st.encode('ascii').decode('unicode-escape').encode('iso-8859-1').decode('utf-8'))
The first is the preferred alternate to bytes. The second converts the escape sequences to their equivalent characters. The third takes advantage of Unicode being based on ISO-8859-1 for the first 256 code points to convert those characters directly back into bytes. Finally you can decode the UTF-8 string.

Related

Turning utf in their real symbols

I just made the first web scraper by myself which goes onto wikipedia and downloads the html of the whole page. I managed to get just the content of a list. the values on the list contain numbers either positive or negative.
But instead of printing out a '-2' it gives me a '\xe2\x88\x922' . I tried the string.replace("\xe2\x88\x92","-") but this doesn't seem to work due to the backslashes.
do you know how I can convert these utf things into their real symbol ?
I used urllib to get the html content if this is important.

You can use bytes.decode to convert it:
>>> b'\xe2\x88\x922'.decode("utf8")
'-2'
And if your data doesn't start with b (i.e. if it is not a bytes object), you can first convert it to bytes then decode:
>>> s = '\xe2\x88\x922'
>>> byte_object = bytes(ord(c) for c in s)
>>> byte_object.decode("utf8")
'-2'

That is unfortunately common when reading data from web pages: they contain characters looking like standard ASCII characters but that are not.
Here you have a MINUS character (unicode U+2212) − which looks like the normal HYPHEN-MINUS (unicode U+002D or ASCII 0x2D) -.
In UTF8 it is encoded as b'\xe2\x88\x922'. It probably means that you read it as if it was Latin1 encoded while it is UTF8 encoded.
A trick the correctly recode it is to encode it as Latin1 and decode it back:
t = '\xe2\x88\x922'
print(t.encode('latin1').decode()
−2

Obtaining the original bytes after decoding to unicode and back

I have a byte string which I'm decoding to unicode in python using .decode('unicode-escape'). This returns a unicode string. Encoding this unicode string to obtain it in byte form again however returns a different byte string. Why is this, and how can I decode and encode in a way that preserves the original data?
Examples:
some_bytes = b'7Q\x82\xacqo\xbb\x0f\x03\x105\x93<\xebD\xbe\xde\xad\x82\xf9\xa6\x1cX\x01N\x8c\xff\x9e\x84\x1e\xa1\x97'
some_bytes.decode('unicode-escape')
yields: 7Q¬qo»5<ëD¾Þù¦XNÿ¡
some_bytes.decode('unicode-escape').encode()
yields: b'7Q\xc2\x82\xc2\xacqo\xc2\xbb\x0f\x03\x105\xc2\x93<\xc3\xabD\xc2\xbe\xc3\x9e\xc2\xad\xc2\x82\xc3\xb9\xc2\xa6\x1cX\x01N\xc2\x8c\xc3\xbf\xc2\x9e\xc2\x84\x1e\xc2\xa1\xc2\x97'

xc2,xc3 refers to 00 in utf-8. For eg :For power 2, utf-8 is \xc2\xb2
So when you are encoding it is added before every code-point.
For more details, you can see below link
https://www.utf8-chartable.de/unicode-utf8-table.pl?start=128&number=128&utf8=string-literal&unicodeinhtml=hex

Encode Decode using python

I have this function in python
Str = "Ã¼";
print Str
def correctText( str ):
str = str.upper()
correctedText = str.decode('UTF8').encode('Windows-1252')
return correctedText;
corText = correctText(Str);
print corText
It works and converts characters like Ã¼ and Ã© however it fails when i try Ã? and Â¶
Is there a way i can fix it?

According to UTF8, Ã and Â¶ are not valid characters, meaning that don't have a number of bytes divisible by 4 (usually). What you need to do is either use some other kind of encoding or strip out errors in your str by using the unicode() function. I recommend using the ladder.

What you are trying to do is to compose valid UTF-8 codes by several consecutive Windows-1252 codes.
For example, for Ã¼, the Windows-1252 code of Ã is C3 and for ¼ it's BC. Together the code C3BC happens to be the UTF-8 code of ü.
Now, for Ã?, the Windows-1252 code is C33F, which is not a valid UTF-8 code (because the second byte does not start with 10).
Are you sure this sequence occurs in your text? For example, for à, the Windows-1252 decoding of the UTF-8 code (C3A0) is Ã followed by a non-printable character (non-breaking space). So, if this second character is not printed, the ? might be a regular character of the text.
For Â¶ the Windows-1252 encoding is C2B6. Shouldn't it be Ã¶, for which the Windows-1252 encoding is C3B6, which equals the UTF-8 code of ö?

Python: decoding a string that consists of both unicode code points and unicode text

Parsing some HTML content I got the following string:
АБВ\u003d\"res
The common advice on handling it appears to be to decode using unicode_escape. However, this results in the following:
ÐÐÐ="res
The escaped characters get correctly decoded, but cyrillic letters for some reason get mangled. Other than using regexes to extract everything that looks like a unicode string, decoding only them using unicode_escape and then putting everything into a new string, which other methods exist to decode strings with unicode code points in Python?

unicode_escape treats the input as Latin-1 encoded; any bytes that do not represent a Python string literal escape sequence are decoded mapping bytes directly to Unicode codepoints. You gave it UTF-8 bytes, so the cyrillic characters are represented with 2 bytes each where decoded to two Latin-1 characters each, one of which is U+00D0 Ð, the other unprintable:
>>> print repr('АБВ\\u003d\\"res')
'\xd0\x90\xd0\x91\xd0\x92\\u003d\\"res'
>>> print repr('АБВ\\u003d\\"res'.decode('latin1'))
u'\xd0\x90\xd0\x91\xd0\x92\\u003d\\"res'
>>> print 'АБВ\\u003d\\"res'.decode('latin1')
ÐÐÐ\u003d\"res
This kind of mis-decoding is called a Mojibake, and can be repaired by re-encoding to Latin-1, then decoding from the correct codec (UTF-8 in your case):
>>> print 'АБВ\u003d\\"res'.decode('unicode_escape')
ÐÐÐ="res
>>> print 'АБВ\u003d\\"res'.decode('unicode_escape').encode('latin1').decode('utf8')
АБВ="res
Note that this will fail if the \uhhhh escape sequences encode codepoints outside of the Latin-1 range (U+0000-U+00FF).
The Python 3 equivalent of the above uses codecs.encode():
>>> import codecs
>>> codecs.decode('АБВ\\u003d\\"res', 'unicode_escape').encode('latin1').decode('utf8')
'АБВ="res'

The regex really is the easiest solution (Python 3):
text = 'АБВ\\u003d\\"re'
re.sub(r'(?i)(?<!\\)(?:\\\\)*\\u([0-9a-f]{4})', lambda m: chr(int(m.group(1), 16)), text)
This works fine with any 4-nibble Unicode escape, and can be pretty easily extended to other escapes.
For Python 2, make all strings u'' strings, and use unichr.

encode and decode for a specific character set

There is no difference for the printing results, what is the usage of encoding and decoding for utf-8?
And is it encode('utf8') or encode('utf-8')?
u ='abc'
print(u)
u=u.encode('utf-8')
print(u)
uu = u.decode('utf-8')
print(uu)

str.encode encodes the string (or unicode string) into a series of bytes. In Python 3 this is a bytearray, in Python 2 it's str again (confusingly). When you encode a unicode string, you are left with bytes, not unicode—remember that UTF-8 is not unicode, it's an encoding method that can turn unicode codepoints into bytes.
str.decode will decode the serialized byte stream with the selected codec, picking the proper unicode codepoints and giving you a unicode string.
So, what you're doing in Python 2 is: 'abc' > 'abc' > u'abc', and in Python 3 is:
'abc' > b'abc' > 'abc'. Try printing repr(u) or type(u) in addition to see what's changing where.
utf_8 might be the most canonical, but it doesn't really matter.

Usually Python will first try to decode it to unicode before it can encode it back to UTF-8.There are encording which doesnt have anything to do with the character sets which can be applied to 8 bit strings
For eg
data = u'\u00c3' # Unicode data
data = data.encode('utf8')
print data
'\xc3\x83' //the output.
Please have a look through here and here.It would be helpful.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python3 decode external utf8 string - python

Related

Turning utf in their real symbols

Obtaining the original bytes after decoding to unicode and back

Encode Decode using python

Python: decoding a string that consists of both unicode code points and unicode text

encode and decode for a specific character set

Categories

Resources