Python: Convert Unicode-Hex-String to Unicode - python

I have a hex-string made from a unicode string with that function:
def toHex(s):
res = ""
for c in s:
res += "%02X" % ord(c) #at least 2 hex digits, can be more
return res
hex_str = toHex(u"...")
This returns a string like this one:
"80547CFB4EBA5DF15B585728"
That's a sequence of 6 chinese symbols.
But
u"Knรถdel"
converts to
"4B6EF664656C"
What I need now is a function to convert this back to the original unicode. The chinese symbols seem to have a 2-byte representation while the second example has 1-byte representations for all characters. So I can't just use unichr() for each 1- or 2-byte block.
I've already tried
binascii.unhexlify(hex_str)
but this seems to convert byte-by-byte and returns a string, not unicode. I've also tried
binascii.unhexlify(hex_str).decode(...)
with different formats. Never got the original unicode string.
Thank you a lot in advance!

This seems to work just fine:
binascii.unhexlify(binascii.hexlify(u"Knรถdel".encode('utf-8'))).decode('utf-8')
Comes back to the original object. You can do the same for the chinese text if it's encoded properly, however ord(x) already destroys the text you started from. You'll need to encode it first and only then treat like a string of bytes.

Can't be done. Using %02X loses too much information. You should be using something like UTF-8 first and converting that, instead of inventing a broken encoding.
>>> u"Knรถdel".encode('utf-8').encode('hex')
'4b6ec3b664656c'

When I was working with Unicode in a VB app a while ago the first 1 or 2 digits would be removed if they were a "0". Meaning "&H00A2" would automatically be converted to "&HA2", I just created a small function to check the length of the string and if it was less than 4 chars add the missing 0's. I'm not sure if this is what's happening to you, but I thought I would give bit of information as something to be aware of.

Related

How to get a character from its UTF-16 code points in Python 3?

I have a list of UTF-16 code points that I need to convert to the actual characters they represent programmatically. This seems unbelievably hard to do in Python 3.
For example, I have the numbers 55357 and 56501 for one character, which I know is this banknote emoji: ๐Ÿ’ต But I have no idea how to convert that in Python. I first tried chr(55357) + chr(56501), but Python seems to assume that it is UTF-8 encoded and thus gives me broken Unicode.
I then tried re-encoding the string, but since it's broken UTF-8, it gives me what seems to be broken UTF-16. If I tell it to leave it alone with (chr(55357) + chr(56501)).encode('utf-8', 'surrogatepass'), I can actually get a valid bytes of the character, but it's encoded in...CESU-8, for reasons I cannot yet grasp. This is not an encoding Python supports natively, and I can't find a codec to convert it.
I think I could probably write these to the disk and then read them with the right encoding, but that sounds really terrible.
Is there a reasonable way to do this in Python 3?
The trick is not to mess with chr but rather to convert to a byte array, which you can then decode into a string:
a, b = 55357, 56501
x = a.to_bytes(2, 'little') + b.to_bytes(2, 'little')
print(x.decode('UTF-16'))
This can be generalized for any number of integers:
data = [55357, 56501]
b = bytes([x for c in data for x in c.to_bytes(2, 'little')])
result = b.decode('utf-16')
The reason something like chr(55357) + chr(56501) doesn't work is that chr assumes no encoding. It works on the raw Unicode code points, so you are combining two distinct characters. As the other answer points out, you then have to encode this two character string and re-decode it, or just get the bytes and decode once as I'm suggesting.
The folowing code works:
cp1 = 55357
cp2 = 56501
(chr(cp1) + chr(cp2)).encode('utf-16', 'surrogatepass').decode('utf-16')
#๐Ÿ’ต

Convert utf-8 string to bytes and back in Python 2.7

I have a program which takes a string, makes a list with its byte representation, and then converts the list back to a string. This is really easy if the string contains only ASCII characters:
def messagetobitlist(message):
bitlist = []
for i in message:
for x in (format(ord(i), '08b')):
bitlist.append(int(x))
return bitlist
And then I simply convert it back with unichr (or also chr would work).
I want however to expand the code and make it capable of accepting string with accents and foreign characters. To do this I though of encoding it in UTF-8 and creating the bitlist, but when I try to convert it back it doesn't work, since the characters are represented with a different number of bytes and the code is not capable of distinguishing beforehand if it has to read just one byte or more. I tried to encode every character with 4 bytes (since it is the maximum of UTF-8), but this really does seem a waste of space and it doesn't work anyway.
Is there a solution to have a function that does this while still being somewhat space-conservative?
EDIT: Whoops, wrote Python 3 instead of Python 2.7

Python: Correct Way to refer to index of unicode string

Not sure if this is exactly the problem, but I'm trying to insert a tag on the first letter of a unicode string and it seems that this is not working. Could these be because unicode indices work differently than those of regular strings?
Right now my code is this:
for index, paragraph in enumerate(intro[2:-2]):
intro[index] = bold_letters(paragraph, 1)
def bold_letters(string, index):
return "<b>"+string[0]+"</b>"+string[index:]
And I'm getting output like this:
<b>?</b>?ืจืš ื”ืื—ื“ ื•ื™ืฉืชื‘ื— ื”ื‘ื•ืจื ื‘ื—ื›ืžืชื• ื•ืจืฆื•ื ื• ื›ืœ ืฆื‘ื ื”ืฉืžื™ื ืืจืฅ ื•ื™ืžื™ื ืืœื” ื•ืืœื•ื ื™ื.
It seems the unicode gets messed up when I try to insert the HTML tag. I tried messing with the insert position but didn't make any progress.
Example desired output (hebrew goes right to left):
>>>first_letter_bold("ื”ืงื“ืžื”")
"ื”ืงื“ืž<\b>ื”<b>"
BTW, this is for Python 2
You are right, indices work over each byte when you are dealing with raw bytes i.e String in Python(2.x).
To work seamlessly with Unicode data, you need to first let Python(2.x) know that you are dealing with Unicode, then do the string manipulation. You can finally convert it back to raw bytes to keep the behavior abstracted i.e you get String and you return String.
Ideally you should convert all the data from UTF8 raw encoding to Unicode object (I am assuming your source encoding is Unicode UTF8 because that is the standard used by most applications these days) at the very beginning of your code and convert back to raw bytes at the fag end of code like saving to DB, responding to client etc. Some frameworks might handle that for you so that you don't have to worry.
def bold_letters(string, index):
string = string.decode('utf8')
string "<b>"+string[0]+"</b>"+string[index:]
return string.encode('utf8')
This will also work for ASCII because UTF8 is a super-set of ASCII. You can understand how Unicode works and in Python specifically better by reading http://nedbatchelder.com/text/unipain.html
Python 3.x String is a Unicode object so you don't have to explicitly do anything.
You should use Unicode strings. Byte strings in UTF-8 use a variable number of bytes per character. Unicode use one (at least those in the BMP on Python 2...the first 65536 characters):
#coding:utf8
s = u"ื”ืงื“ืžื”"
t = u'<b>'+s[0]+u'</b>'+s[1:]
print(t)
with open('out.htm','w',encoding='utf-8-sig') as f:
f.write(t)
Output:
<b>ื”</b>ืงื“ืžื”
But my Chrome browser displays out.htm as:

Lxml trying to extract data with windows-1250 characters

Hello i am experimenting with Python and LXML, and I am stuck with the problem of extracting data from the webpage which contains windows-1250 characters like ลพ and ฤ‡.
tree = html.fromstring(new.text,parser=hparser)
title = tree.xpath('//strong[text()="Title"]')
opis[g] = opis[g].tail.encode('utf-8')[2:]
I get text responses containing something like this :
\xc2\x9ea
instead of characters. Then I have the problem with storing into database
So how can I accomplish this? I tried put 'windows-1250' instead utf8 without success. Can I convert this code to original characters somehow?
Try:
text = "\xc2\x9ea"
print text.decode('windows-1250').encode('utf-8')
Output:
ร‚ลพa
And save nice chars in your DB.
If encoding to UTF-8 results in b'\xc2\x9ea', then that means the original string was '\x9ea'. Whether lxml didn't do things correctly, or something happened on your end (perhaps a parser configuration issue), the fact is that you get the equivalent of this (Python 3.x syntax):
>>> '\x9ea'.encode('utf-8')
b'\xc2\x9ea'
How do you fix it? One error-prone way would be to encode as something other than UTF-8 that can properly handle the characters. It's error-prone because while something might work in one case, it might not in another. You could instead extract the character ordinals by mapping the character ordinals and work with the character ordinals instead:
>>> list(map((lambda n: hex(n)[2:]), map(ord, '\x9ea')))
['9e', '61']
That gets us somewhere because the bytes type has a fromhex method that can decode a string containing hexadecimal values to the equivalent byte values:
>>> bytes.fromhex(''.join(map((lambda n: hex(n)[2:]), map(ord, '\x9ea'))))
b'\x9ea'
You can use decode('cp1250') on the result of that to get ลพa, which I believe is the string you wanted. If you are using Python 2.x, the equivalent would be
from binascii import unhexlify
unhexlify(u''.join(map((lambda n: hex(n)[2:]), map(ord, u'\x9ea'))))
Note that this is highly destructive as it forces all characters in a Unicode string to be interpreted as bytes. For this reason, it should only be used on strings containing Unicode characters that fit in a single byte. If you had something like '\x9e\u724b\x61', that code would result in joining ['9e', '724b', '61'] as '9e724b61', and interpreting that using a single-byte character set such as CP1250 would result in something like 'ลพrKa'.
For that reason, more reliable code would replace ord with a function that throws an exception if 0 <= ord(ch) < 0x100 is false, but I'll leave that for you to code.

How to encode a string in a SQL CHAR

'admin' encoded is = CHAR(97, 100, 109, 105, 110)
I would like to know if there is a module or a way to convert each letter of a string to SQL CHARs. If not, how do I convert it myself? I have access to a chart that says a=97, b=98, etc., if that helps.
I'm not sure why you need this at all. It's not hard to get the string representation of a CHAR field holding ASCII or Unicode or whatever code points. But I'm pretty sure you don't need that, because databases already know how to compare those to strings passed in SQL, etc. Unless you're trying to, say, generate a dump that looks exactly like the ones you get from some other tool. But, assuming you do need to do this, here's how.
I think you're looking for the ord function:
Given a string representing one Unicode character, return an integer representing the Unicode code point of that character. For example, ord('a') returns the integer 97 and ord('\u2020') returns 8224. This is the inverse of chr().
This works because Python has access to that same chart that you haveโ€”in fact, to a bunch of different ones, one for each encoding it knows about. In fact, that chart is pretty much what an encoding is.
So, for example:
def encode_as_char(s):
return 'CHAR({})'.format(', '.join(str(ord(c)) for c in s))
Or, if you just wanted a list of numbers, not a string made out of those numbers, it's even simpler:
def encode_as_char(s):
return [ord(c) for c in s]
This is all assuming that either (a) your database is storing Unicode characters and you're using Python 3, or (b) your database is storing 8-bit characters and you're using Python 2. Otherwise, you need an encode or decode step in there as well.
For a Python 3 Unicode string to a UTF-8 database (notice that we don't need ord here, because a Python 3 bytes is actually a sequence of numbers):
def encode_as_utf8_char(s):
return 'CHAR({})'.format(', '.join(str(c) for c in s.encode('utf-8')))
For Python 2 UTF-8 string to a Unicode database:
def encode_utf8_as_char(s):
return 'CHAR({})'.format(', '.join(str(ord(c)) for c in s.decode('utf-8')))

Categories