I have a hex string and i want to convert it utf8 to insert mysql. (my database is utf8)
hex_string = 'kitap ara\xfet\xfdrmas\xfd'
...
result = 'kitap araştırması'
How can I do that?
Try(Python 3.x):
import codecs
codecs.decode("707974686f6e2d666f72756d2e696f", "hex").decode('utf-8')
From here.
Assuming Python 2.6,
>>> print('kitap ara\xfet\xfdrmas\xfd'.decode('iso-8859-9'))
kitap araştırması
>>> 'kitap ara\xfet\xfdrmas\xfd'.decode('iso-8859-9').encode('utf-8')
'kitap ara\xc5\x9ft\xc4\xb1rmas\xc4\xb1'
Try
hex_string.decode("cp1254").encode("utf-8")
(cp1254 or iso-8859-9 are the Turkish codepages, the former being the usual name on Windows platforms, but in Python, both work equally well)
First you need to decode it from the encoded bytes you have. That appears to be ISO-8859-9 (latin-5), or, if you are using Windows, probably code page 1254, which is based on latin-5.
>>> 'kitap ara\xfet\xfdrmas\xfd'.decode('cp1254')
u'kitap ara\u015ft\u0131rmas\u0131' # u'kitap araştırması'
If you are using Windows, then depending on where you are getting those bytes, it might be more appropriate to decode them as mbcs, which translates to ‘whichever code page the local system is using’. If the string is just sitting in a .py file, you would be better off just writing u'kitap araştırması' in the source and setting a -*- coding declaration to direct Python to decode it. See PEP 263.
As to how to encode unicode strings to UTF-8 for the database, well, if you want to you can do it manually:
>>> u'kitap ara\u015ft\u0131rmas\u0131'.encode('utf-8')
'kitap ara\xc5\x9ft\xc4\xb1rmas\xc4\xb1'
but a good data access layer is likely to do that automatically for you, if you've got the COLLATION of the tables the data is going into right.
String literals explains how to use UTF8 strings in Python source.
Related
How can I correctly print Vietnamese cp1258 encoded characters in python 3? My terminal doesnt seem to be the issue as it will print the first print statement in my code correctly. I am trying to decode hex characters to vietnamese
Code:
import binascii
data = 'tạm biệt'
print(data) # tạm biệt
a = binascii.hexlify(data.encode('cp1258', errors='backslashreplace'))
print(a) # b'745c75316561316d2062695c753165633774'
# if i dont use the error handler here, then I get a UnicodeEncodeError for \u1ea1
print(
binascii.unhexlify(a).decode('cp1258') # t\u1ea1m bi\u1ec7t
)
There seems to be an omission in Python's support for code page 1258. The legacy codec does support Vietnamese by way of combining diacritics, but Python doesn't know how to convert Unicode to these combinations. I guess you will have to perform your own conversion.
As a first step, observe that unicodedata.normalize('NFD', data) splits the representation into a base character and a sequence of combining diacritics.
>>> unicodedata.normalize('NFD', data).encode('utf-8')
b'ta\xcc\xa3m bie\xcc\xa3\xcc\x82t'
>>> '{0:04x}'.format(ord(b'\xcc\xa3'.decode('utf-8')))
'0323'
So U+0323 is the combining Unicode diacritic for dot-under, and this correspondence should be known to the codec (the Wikipedia page I link to above shows the same Unicode character code for the CP1258 code point 0xF2).
I don't know enough about the target codec to tell you how to map these to CP1258, but if you are lucky, there is already some sort of mapping of these in the Python codec.
iconv on my Mojave MacOS seems to convert this without a hitch:
$ iconv -t cp1258 <<<'tạm biệt' | xxd
00000000: 7461 f26d 2062 69ea f274 0a ta.m bi..t.
From this, it looks like the diacritic applies straightforwardly as a suffix -- 61 is a, and as noted above, f2 is the combining diacritic to place a dot under the main glyph.
If you have a working iconv, a quick and dirty workaround might be to run it as a subprocess.
import subprocess
converted = subprocess.run(['iconv', '-t', 'cp1258'],
input=data.encode('utf-8'), stdout=subprocess.PIPE).stdout
If my understanding is correct, this should really be reported as a bug in Python. It should definitely know how to round-trip between this codec and Unicode if it wants to claim that it supports it.
I figured it out. Decoding with unicode-escape does the trick.
import binascii
data = u'tạm biệt'
print(data) # tạm biệt
a = binascii.hexlify(data.encode('cp1258', errors='backslashreplace'))
print(a) # b'745c75316561316d2062695c753165633774'
# if i dont use the error handler here, then I get a UnicodeEncodeError for \u1ea1
print(
binascii.unhexlify(a).decode('unicode-escape') # tạm biệt
)
When doing the following concatenation:
a = u'Hello there '
b = 'pirate ®'
c = a + b # This will raise UnicodeDecodeError
in python 2, 'pirate ®' is automatically converted to unicode type through ascii encoding. And since there is a non-ascii unicode sequence (®) in the string, it will fail.
Is there a way to change this default encoding to utf8?
It is possible, although it's considered a hack. You have to reload sys:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
See this blog post for some explanation of the potential issues this raises:
http://blog.startifact.com/posts/older/changing-the-python-default-encoding-considered-harmful.html
It may be the only option you have, but you should be aware that it can lead to further problems. Which is why it's not a simple and easy thing to set.
From the Python Unicode Howto:
Ideally, you’d want to be able to write literals in your language’s natural encoding. You could then edit Python source code with your favorite editor which would display the accented characters naturally, and have the right characters used at runtime.
Python supports writing Unicode literals in any encoding, but you have to declare the encoding being used. This is done by including a special comment as either the first or second line of the source file:
#!/usr/bin/env python
# -*- coding: latin-1 -*-
u = u'abcdé'
print ord(u[-1])
suffixes = {
1: ["ो", "े", "ू", "ु", "ी", "ि", "ा"]}
When I done
message given by IDLE is
Unsupported characters in input
Also not see the proper font in MS-DOS.
What encoding is your source file in?
If it is UTF8, put the comment
# -*- coding: utf-8 -*-
at the top of the file.
If you don't declare encoding in your first or second line in your python source file, then the python interpreter will use ASCII encoding system to decode the characters in the file. As these characters you used couldn't be decoded by ASCII encoding system, errors happended.
The solution is as #RemcoGerlich said. Here is the doc.
The encoding is used for all lexical analysis, in particular to find the end of a string, and to interpret the contents of Unicode literals. String literals are converted to Unicode for syntactical analysis, then converted back to their original encoding before interpretation starts. The encoding declaration must appear on a line of its own.
This seems to be a known bug in the 2.x IDLE console: http://bugs.python.org/issue15809. A fix was made for Python 3.x, but doesn't appear to be backported.
Instead, use an alternative console, such as iPython/Jupyter, or a fully-fledged IDE, such as PyCharm.
I feel stacked here trying to change encodings with Python 2.5
I have XML response, which I encode to UTF-8: response.encode('utf-8'). That is fine, but the program which uses this info doesn't like this encoding and I have to convert it to other code page. Real example is that I use ghostscript python module to embed pdfmark data to a PDF file - end result is with wrong characters in Acrobat.
I've done numerous combinations with .encode() and .decode() between 'utf-8' and 'latin-1' and it drives me crazy as I can't output correct result.
If I output the string to a file with .encode('utf-8') and then convert this file from UTF-8 to CP1252 (aka latin-1) with i.e. iconv.exe and embed the data everything is fine.
Basically can someone help me convert i.e. character á which is UTF-8 encoded as hex: C3 A1 to latin-1 as hex: E1?
Instead of .encode('utf-8'), use .encode('latin-1').
data="UTF-8 data"
udata=data.decode("utf-8")
data=udata.encode("latin-1","ignore")
Should do it.
Can you provide more details about what you are trying to do? In general, if you have a unicode string, you can use encode to convert it into string with appropriate encoding. Eg:
>>> a = u"\u00E1"
>>> type(a)
<type 'unicode'>
>>> a.encode('utf-8')
'\xc3\xa1'
>>> a.encode('latin-1')
'\xe1'
If the previous answers do not solve your problem, check the source of the data that won't print/convert properly.
In my case, I was using json.load on data incorrectly read from file by not using the encoding="utf-8". Trying to de-/encode the resulting string to latin-1 just does not help...
I have a Python 2.6 script that is gagging on special characters, encoded in Latin-1, that I am retrieving from a SQL Server database. I would like to print these characters, but I'm somewhat limited because I am using a library that calls the unicode factory, and I don't know how to make Python use a codec other than ascii.
The script is a simple tool to return lookup data from a database without having to execute the SQL directly in a SQL editor. I use the PrettyTable 0.5 library to display the results.
The core of the script is this bit of code. The tuples I get from the cursor contain integer and string data, and no Unicode data. (I'd use adodbapi instead of pyodbc, which would get me Unicode, but adodbapi gives me other problems.)
x = pyodbc.connect(cxnstring)
r = x.cursor()
r.execute(sql)
t = PrettyTable(columns)
for rec in r:
t.add_row(rec)
r.close()
x.close()
t.set_field_align("ID", 'r')
t.set_field_align("Name", 'l')
print t
But the Name column can contain characters that fall outside the ASCII range. I'll sometimes get an error message like this, in line 222 of prettytable.pyc, when it gets to the t.add_row call:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xed in position 12: ordinal not in range(128)
This is line 222 in prettytable.py. It uses unicode, which is the source of my problems, and not just in this script, but in other Python scripts that I have written.
for i in range(0,len(row)):
if len(unicode(row[i])) > self.widths[i]: # This is line 222
self.widths[i] = len(unicode(row[i]))
Please tell me what I'm doing wrong here. How can I make unicode work without hacking prettytable.py or any of the other libraries that I use? Is there even a way to do this?
EDIT: The error occurs not at the print statement, but at the t.add_row call.
EDIT: With Bastien Léonard's help, I came up with the following solution. It's not a panacea, but it works.
x = pyodbc.connect(cxnstring)
r = x.cursor()
r.execute(sql)
t = PrettyTable(columns)
for rec in r:
urec = [s.decode('latin-1') if isinstance(s, str) else s for s in rec]
t.add_row(urec)
r.close()
x.close()
t.set_field_align("ID", 'r')
t.set_field_align("Name", 'l')
print t.get_string().encode('latin-1')
I ended up having to decode on the way in and encode on the way out. All of this makes me hopeful that everybody ports their libraries to Python 3.x sooner than later!
Add this at the beginning of the module:
# coding: latin1
Or decode the string to Unicode yourself.
[Edit]
It's been a while since I played with Unicode, but hopefully this example will show how to convert from Latin1 to Unicode:
>>> s = u'ééé'.encode('latin1') # a string you may get from the database
>>> s.decode('latin1')
u'\xe9\xe9\xe9'
[Edit]
Documentation:
http://docs.python.org/howto/unicode.html
http://docs.python.org/library/codecs.html
Maybe try to decode the latin1-encoded strings into unicode?
t.add_row((value.decode('latin1') for value in rec))
After a quick peek at the source for PrettyTable, it appears that it works on unicode objects internally (see _stringify_row, add_row and add_column, for example). Since it doesn't know what encoding your input strings are using, it uses the default encoding, usually ascii.
Now ascii is a subset of latin-1, which means if you're converting from ascii to latin-1, you shouldn't have any problems. The reverse however, isn't true; not all latin-1 characters map to ascii characters. To demonstrate this:
>>> s = u'\xed\x31\x32\x33'
>>> print s
# FAILS: Python calls "s.decode('ascii')", but ascii codec can't decode '\xed'
>>> print s.decode('ascii')
# FAILS: Same as above
>>> print s.decode('latin-1')
í123
Explicitly converting the strings to unicode (like you eventually did) fixes things, and makes more sense, IMO -- you're more likely to know what charset your data is using, than the author of PrettyTable :). BTW, you can omit the check for strings in your list comprehension by replacing s.decode('latin-1') with unicode(s, 'latin-1') since all objects can be coerced to strings.
One last thing: don't forget to check the character set of your database and tables -- you don't want to assume 'latin-1' in code, when the data is actually being stored as something else ('utf-8'?) in the database. In MySQL, you can use the SHOW CREATE TABLE <table_name> command to find out what character set a table is using, and SHOW CREATE DATABASE <db_name> to do the same for a database.