How to encode a string in a SQL CHAR - python

'admin' encoded is = CHAR(97, 100, 109, 105, 110)
I would like to know if there is a module or a way to convert each letter of a string to SQL CHARs. If not, how do I convert it myself? I have access to a chart that says a=97, b=98, etc., if that helps.

I'm not sure why you need this at all. It's not hard to get the string representation of a CHAR field holding ASCII or Unicode or whatever code points. But I'm pretty sure you don't need that, because databases already know how to compare those to strings passed in SQL, etc. Unless you're trying to, say, generate a dump that looks exactly like the ones you get from some other tool. But, assuming you do need to do this, here's how.
I think you're looking for the ord function:
Given a string representing one Unicode character, return an integer representing the Unicode code point of that character. For example, ord('a') returns the integer 97 and ord('\u2020') returns 8224. This is the inverse of chr().
This works because Python has access to that same chart that you have—in fact, to a bunch of different ones, one for each encoding it knows about. In fact, that chart is pretty much what an encoding is.
So, for example:
def encode_as_char(s):
return 'CHAR({})'.format(', '.join(str(ord(c)) for c in s))
Or, if you just wanted a list of numbers, not a string made out of those numbers, it's even simpler:
def encode_as_char(s):
return [ord(c) for c in s]
This is all assuming that either (a) your database is storing Unicode characters and you're using Python 3, or (b) your database is storing 8-bit characters and you're using Python 2. Otherwise, you need an encode or decode step in there as well.
For a Python 3 Unicode string to a UTF-8 database (notice that we don't need ord here, because a Python 3 bytes is actually a sequence of numbers):
def encode_as_utf8_char(s):
return 'CHAR({})'.format(', '.join(str(c) for c in s.encode('utf-8')))
For Python 2 UTF-8 string to a Unicode database:
def encode_utf8_as_char(s):
return 'CHAR({})'.format(', '.join(str(ord(c)) for c in s.decode('utf-8')))

Related

Python: Correct Way to refer to index of unicode string

Not sure if this is exactly the problem, but I'm trying to insert a tag on the first letter of a unicode string and it seems that this is not working. Could these be because unicode indices work differently than those of regular strings?
Right now my code is this:
for index, paragraph in enumerate(intro[2:-2]):
intro[index] = bold_letters(paragraph, 1)
def bold_letters(string, index):
return "<b>"+string[0]+"</b>"+string[index:]
And I'm getting output like this:
<b>?</b>?רך האחד וישתבח הבורא בחכמתו ורצונו כל צבא השמים ארץ וימים אלה ואלונים.
It seems the unicode gets messed up when I try to insert the HTML tag. I tried messing with the insert position but didn't make any progress.
Example desired output (hebrew goes right to left):
>>>first_letter_bold("הקדמה")
"הקדמ<\b>ה<b>"
BTW, this is for Python 2
You are right, indices work over each byte when you are dealing with raw bytes i.e String in Python(2.x).
To work seamlessly with Unicode data, you need to first let Python(2.x) know that you are dealing with Unicode, then do the string manipulation. You can finally convert it back to raw bytes to keep the behavior abstracted i.e you get String and you return String.
Ideally you should convert all the data from UTF8 raw encoding to Unicode object (I am assuming your source encoding is Unicode UTF8 because that is the standard used by most applications these days) at the very beginning of your code and convert back to raw bytes at the fag end of code like saving to DB, responding to client etc. Some frameworks might handle that for you so that you don't have to worry.
def bold_letters(string, index):
string = string.decode('utf8')
string "<b>"+string[0]+"</b>"+string[index:]
return string.encode('utf8')
This will also work for ASCII because UTF8 is a super-set of ASCII. You can understand how Unicode works and in Python specifically better by reading http://nedbatchelder.com/text/unipain.html
Python 3.x String is a Unicode object so you don't have to explicitly do anything.
You should use Unicode strings. Byte strings in UTF-8 use a variable number of bytes per character. Unicode use one (at least those in the BMP on Python 2...the first 65536 characters):
#coding:utf8
s = u"הקדמה"
t = u'<b>'+s[0]+u'</b>'+s[1:]
print(t)
with open('out.htm','w',encoding='utf-8-sig') as f:
f.write(t)
Output:
<b>ה</b>קדמה
But my Chrome browser displays out.htm as:

Lxml trying to extract data with windows-1250 characters

Hello i am experimenting with Python and LXML, and I am stuck with the problem of extracting data from the webpage which contains windows-1250 characters like ž and ć.
tree = html.fromstring(new.text,parser=hparser)
title = tree.xpath('//strong[text()="Title"]')
opis[g] = opis[g].tail.encode('utf-8')[2:]
I get text responses containing something like this :
\xc2\x9ea
instead of characters. Then I have the problem with storing into database
So how can I accomplish this? I tried put 'windows-1250' instead utf8 without success. Can I convert this code to original characters somehow?
Try:
text = "\xc2\x9ea"
print text.decode('windows-1250').encode('utf-8')
Output:
ža
And save nice chars in your DB.
If encoding to UTF-8 results in b'\xc2\x9ea', then that means the original string was '\x9ea'. Whether lxml didn't do things correctly, or something happened on your end (perhaps a parser configuration issue), the fact is that you get the equivalent of this (Python 3.x syntax):
>>> '\x9ea'.encode('utf-8')
b'\xc2\x9ea'
How do you fix it? One error-prone way would be to encode as something other than UTF-8 that can properly handle the characters. It's error-prone because while something might work in one case, it might not in another. You could instead extract the character ordinals by mapping the character ordinals and work with the character ordinals instead:
>>> list(map((lambda n: hex(n)[2:]), map(ord, '\x9ea')))
['9e', '61']
That gets us somewhere because the bytes type has a fromhex method that can decode a string containing hexadecimal values to the equivalent byte values:
>>> bytes.fromhex(''.join(map((lambda n: hex(n)[2:]), map(ord, '\x9ea'))))
b'\x9ea'
You can use decode('cp1250') on the result of that to get ža, which I believe is the string you wanted. If you are using Python 2.x, the equivalent would be
from binascii import unhexlify
unhexlify(u''.join(map((lambda n: hex(n)[2:]), map(ord, u'\x9ea'))))
Note that this is highly destructive as it forces all characters in a Unicode string to be interpreted as bytes. For this reason, it should only be used on strings containing Unicode characters that fit in a single byte. If you had something like '\x9e\u724b\x61', that code would result in joining ['9e', '724b', '61'] as '9e724b61', and interpreting that using a single-byte character set such as CP1250 would result in something like 'žrKa'.
For that reason, more reliable code would replace ord with a function that throws an exception if 0 <= ord(ch) < 0x100 is false, but I'll leave that for you to code.

Mapping Unicode to ASCII in Python

I receive strings after querying via urlopen in JSON format:
def get_clean_text(text):
return text.translate(maketrans("!?,.;():", " ")).lower().strip()
for track in json["tracks"]:
print track["name"].lower()
get_clean_text(track["name"].lower())
For the string "türlich, türlich (sicher, dicker)" I then get
File "main.py", line 23, in get_clean_text
return text.translate(maketrans("!?,.;():", " ")).lower().strip()
TypeError: character mapping must return integer, None or unicode
I want to format the string to be "türlich türlich sicher dicker".
The question is not a complete self-contained example; I can't be sure whether it's Python 2 or 3, where maketrans came from, etc. There's a good chance I will guess wrong, which is why you should be sure to tag your questions appropriately and provide a short, self contained, correct example. (That, and the fact that various other people—some of them probably smarter than me—likely ignored your question because it was ambiguous.)
Assuming you're using 2.x, and you've done a from string import * to get maketrans, and json["name"] is unicode rather than str/bytes, here's your problem:
There are two kinds of translation tables: old-style 8-bit tables (which are just an array of 256 characters) and new-style sparse tables (which are just a dict mapping one character's ordinal to another). The str.translate function can use either, but unicode.translate can only use the second (for reasons that should be obvious if you think about it for a bit).
The string.maketrans function makes old-style 8-bit translation tables. So you can't use it with unicode.translate.
You can always write your own "makeunitrans" function as a drop-in replacement, something like this:
def makeunitrans(frm, to):
return {ord(f):ord(t) for (f,t) in zip(frm, to)}
But if you just want to map out certain characters, you could do something a bit more special purpose:
def makeunitrans(frm):
return {ord(f):ord(' ') for f in frm}
However, from your final comment, I'm not sure translate is even what you want:
I want to format the string to be "türlich türlich sicher dicker"
If you get this right, you're going to format the string to be "türlich türlich sicher dicker ", because you're mapping all those punctuation characters to spaces, not nothing.
With new-style translation tables you can map anything you want to None, which solves that problem. But you might want to step back and ask why you're using the translate method in the first place instead of, e.g., calling replace multiple times (people usually say "for performance", but you wouldn't be building the translation table in-line every time through if that were an issue) or using a trivial regular expression.

Python: Convert Unicode-Hex-String to Unicode

I have a hex-string made from a unicode string with that function:
def toHex(s):
res = ""
for c in s:
res += "%02X" % ord(c) #at least 2 hex digits, can be more
return res
hex_str = toHex(u"...")
This returns a string like this one:
"80547CFB4EBA5DF15B585728"
That's a sequence of 6 chinese symbols.
But
u"Knödel"
converts to
"4B6EF664656C"
What I need now is a function to convert this back to the original unicode. The chinese symbols seem to have a 2-byte representation while the second example has 1-byte representations for all characters. So I can't just use unichr() for each 1- or 2-byte block.
I've already tried
binascii.unhexlify(hex_str)
but this seems to convert byte-by-byte and returns a string, not unicode. I've also tried
binascii.unhexlify(hex_str).decode(...)
with different formats. Never got the original unicode string.
Thank you a lot in advance!
This seems to work just fine:
binascii.unhexlify(binascii.hexlify(u"Knödel".encode('utf-8'))).decode('utf-8')
Comes back to the original object. You can do the same for the chinese text if it's encoded properly, however ord(x) already destroys the text you started from. You'll need to encode it first and only then treat like a string of bytes.
Can't be done. Using %02X loses too much information. You should be using something like UTF-8 first and converting that, instead of inventing a broken encoding.
>>> u"Knödel".encode('utf-8').encode('hex')
'4b6ec3b664656c'
When I was working with Unicode in a VB app a while ago the first 1 or 2 digits would be removed if they were a "0". Meaning "&H00A2" would automatically be converted to "&HA2", I just created a small function to check the length of the string and if it was less than 4 chars add the missing 0's. I'm not sure if this is what's happening to you, but I thought I would give bit of information as something to be aware of.

How do you store raw bytes as text without losing information in python 2.x?

Suppose I have any data stored in bytes. For example:
0110001100010101100101110101101
How can I store it as printable text? The obvious way would be to convert every 0 to the character '0' and every 1 to the character '1'. In fact this is what I'm currently doing. I'd like to know how I could pack them more tightly, without losing information.
I thought of converting bits in groups of eight to ASCII, but some bit combinations are not
accepted in that format. Any other ideas?
What about an encoding that only uses "safe" characters like base64?
http://en.wikipedia.org/wiki/Base64
EDIT: That is assuming that you want to safely store the data in text files and such?
In Python 2.x, strings should be fine (Python doesn't use null terminated strings, so don't worry about that).
Else in 3.x check out the bytes and bytearray objects.
http://docs.python.org/3.0/library/stdtypes.html#bytes-methods
Not sure what you're talking about.
>>> sample = "".join( chr(c) for c in range(256) )
>>> len(sample)
256
>>> sample
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\
x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?#ABC
DEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83
\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97
\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab
\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf
\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3
\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7
\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb
\xfc\xfd\xfe\xff'
The string sample contains all 256 distinct bytes. There is no such thing as a "bit combinations ... not accepted".
To make it printable, simply use repr(sample) -- non-ASCII characters are escaped. As you see above.
Try the standard array module or the struct module. These support storing bytes in a space efficient way -- but they don't support bits directly.
You can also try http://cobweb.ecn.purdue.edu/~kak/dist/BitVector-1.2.html or http://ilan.schnell-web.net/prog/bitarray/
For Python 2.x, your best bet is to store them in a string. Once you have that string, you can encode it into safe ASCII values using the base64 module that comes with python.
import base64
encoded = base64.b64encode(bytestring)
This will be much more condensed than storing "1" and "0".
For more information on the base64 module, see the python docs.

Categories