Mapping Unicode to ASCII in Python

Mapping Unicode to ASCII in Python - python

I receive strings after querying via urlopen in JSON format:
def get_clean_text(text):
return text.translate(maketrans("!?,.;():", " ")).lower().strip()
for track in json["tracks"]:
print track["name"].lower()
get_clean_text(track["name"].lower())
For the string "türlich, türlich (sicher, dicker)" I then get
File "main.py", line 23, in get_clean_text
return text.translate(maketrans("!?,.;():", " ")).lower().strip()
TypeError: character mapping must return integer, None or unicode
I want to format the string to be "türlich türlich sicher dicker".

The question is not a complete self-contained example; I can't be sure whether it's Python 2 or 3, where maketrans came from, etc. There's a good chance I will guess wrong, which is why you should be sure to tag your questions appropriately and provide a short, self contained, correct example. (That, and the fact that various other people—some of them probably smarter than me—likely ignored your question because it was ambiguous.)
Assuming you're using 2.x, and you've done a from string import * to get maketrans, and json["name"] is unicode rather than str/bytes, here's your problem:
There are two kinds of translation tables: old-style 8-bit tables (which are just an array of 256 characters) and new-style sparse tables (which are just a dict mapping one character's ordinal to another). The str.translate function can use either, but unicode.translate can only use the second (for reasons that should be obvious if you think about it for a bit).
The string.maketrans function makes old-style 8-bit translation tables. So you can't use it with unicode.translate.
You can always write your own "makeunitrans" function as a drop-in replacement, something like this:
def makeunitrans(frm, to):
return {ord(f):ord(t) for (f,t) in zip(frm, to)}
But if you just want to map out certain characters, you could do something a bit more special purpose:
def makeunitrans(frm):
return {ord(f):ord(' ') for f in frm}
However, from your final comment, I'm not sure translate is even what you want:
I want to format the string to be "türlich türlich sicher dicker"
If you get this right, you're going to format the string to be "türlich türlich sicher dicker ", because you're mapping all those punctuation characters to spaces, not nothing.
With new-style translation tables you can map anything you want to None, which solves that problem. But you might want to step back and ask why you're using the translate method in the first place instead of, e.g., calling replace multiple times (people usually say "for performance", but you wouldn't be building the translation table in-line every time through if that were an issue) or using a trivial regular expression.

Related

Does the inchstr() curses-function work in python?

I want to retrieve multiple strings in one row of my terminal right now I'm using instr() but that only extracts the string in that exact position. The function that should actually do this is inchstr() but that doesn't seem to work in python or is it?

No. Python's curses binding does not extend the underlying curses library (much). There's more than one related curses function which python might use, depending on what you are looking at, but none read more than a single line of text:
int instr(char *str);
int inwstr(wchar_t *wstr);
int inchstr(chtype *chstr);
int in_wchstr(cchar_t *wchstr);
The first (instr) and third (inchstr) both read from the screen, but the latter returns attributes (color, underline, etc) along with the text.
Python's instr appears to use the former, since its documentation states
Return a bytes object of characters, extracted from the window starting at the current cursor position, or at y, x if specified. Attributes are stripped from the characters. If n is specified, instr() returns a string at most n characters long (exclusive of the trailing NUL).
The second (inwstr) and fourth (in_wchstr) differ from the other two by allowing for reading wide-characters directly. python actually should provide for using either set (narrow or wide character interfaces), since ncurses' wide-character interface is better suited to returning Unicode strings, but it is using the narrow interface in either case, returning a byte array (and requiring the application to puzzle out how to convert the data into a string).

Can't regex string between string and line break - str(x) versus x.decode()

Why does a regex fail of a string cast from an object when line breaks are present?
That is why does this fail to find a match (ie print 'Green') in a string created from str(obj):
import re
s = str(b'Package Name: Green\\r\\n Release version: 8.1\\r\\n')
match = re.search(r'Package Name: (.*)\r\n', s)
print(match.group(1))
When this succeeds using a string created from obj.decode()?
import re
s = (b'Package Name: Green\\r\\n Release version: 8.1\\r\\n').decode()
match = re.search(r'Package Name: (.*)\r\n', s)
print(match.group(1))
No matter what search pattern was tried, searching the string created by str(obj) failed to find a match...

The reason you get different results is that you’re doing different things. Calling str on a bytes with newline characters returns a string with a literal backslash and n; calling decode returns a string with a newline character in it. So, if you’re searching the results for newline characters, the second one will succeed, and the first will fail. And it’s the second one that you wanted.
In other words, using decode here is right, and str is wrong; that’s why you get different results. If you can’t think through the difference, try just printing them out: print(b.decode()); print(str(b)) and you’ll see the difference immediately.
In fact, you should usually be decoding the strings as soon as you receive them, and never looking at the bytes again. Then you never have to worry about the str representation of bytes objects (except maybe in some code that logs errors caused by invalid strings that you couldn’t decode). The only exception is when you know the bytes are some kind of encoded text, but can’t be sure what the encoding is. For example, if you’re parsing HTTP headers or email messages or Python source code, you don’t know the character set until you read part of the file and search it for special ASCII-encoded strings. Or, if you’re converting a bunch of old text files from Windows to Unix line endings and some are cp1252 while others are cp1250, you don’t care which is which because they both encode line endings the same way. For those cases, just stick with bytes, and search for b'\n' instead of '\n'.
If you want to know why Python makes this so complicated:
bytes objects are used to store strings encoded in your default encoding—but they’re also used to store strings encoded in different encodings, and binary data that isn’t a string at all. And a bytes object has no idea which of those it’s storing; they’re all just sequences of numbers.
Python 2 effectively assumed that a bytes was being used to store a string in your default encoding, so it let you convert back and forth to Unicode by calling functions like str, or even concatenating a bytes and Unicode string. That turned out to be one of the biggest sources of errors in the language. You still see Python 2 users posting questions here every few days asking why they got a UnicodeEncodeError when they weren’t calling encode anywhere (or, worse, when they were calling decode), and fixing that was one of the main reasons for Python 3’s existence.
The human-readable representation of a bytes object has to be something that can be produced without error, and read unambiguously, whether it’s a string in the default encoding, a string in a completely different encoding, or a sequence of pixel brightness values ranging from 0 to 255. The compromise solution (for things like that HTTP headers case above) is the backslash-escaped quoted string.
By the way, during the Python 2 to 3 transition, the core devs assumed multiple people would come up with clever EncodedBytes types that carried around their encoding, and could therefore act more like Python 2 byte strings but without all the associated errors, and after a couple years one of them would be the clear winner on PyPI and maybe they could add it to Python 3.3 or so. That’s what you’re probably instinctively reaching for here. But, as it turned out, nobody used any such libraries, because it’s almost always easier to just decode and encode at the edges of your program and use Unicode everywhere, and the exceptions are almost always cases where you don’t know the encoding so EncodedBytes wouldn’t help.
One last thing: thinking of functions like str or float as “casts” is misleading. While it looks superficially similar to the way you do explicit casts in C or Java or Go or whatever language you’re used to, it has a very different meaning

Why does the output output of hashes look like \x88H\x98\xda(\x04qQ vs �*�r?

I'm new to encryption, and programming in general. I'm just trying to get my head wrapped around some basic concepts.
I'm using python, Crypto.Hash.SHA256
from Crypto.Hash import SHA256
In the REPL if I type
print SHA256.new('password').digest()//j���*�rBo��)'s`=
vs
SHA256.new('password').digest()//"^\x88H\x98\xda(\x04qQ\xd0\xe5o\x8d\xc6)'s`=\rj\xab\xbd\xd6*\x11\xefr\x1d\x15B\xd8"
What are these two outputs?
How are they supposed to be interpreted?

In the first case, you are using print, so Python is trying to convert the bytes to printable characters. Unfortunately, not every byte is printable, so you get some strange output.
In the second case, since you are not calling print, the Python interpreter does something different. It takes the return value, which is a string in this case, and shows the internal representation of the string. Which is why for some characters, you get something that is printable, but in other cases, you get an escaped sequence, like \x88.
The two outputs happen to just be two representations of the same digest.
FYI, when working with pycrypto and looking at hash function outputs, I highly recommend using hexdigest instead of digest.

How to encode a string in a SQL CHAR

'admin' encoded is = CHAR(97, 100, 109, 105, 110)
I would like to know if there is a module or a way to convert each letter of a string to SQL CHARs. If not, how do I convert it myself? I have access to a chart that says a=97, b=98, etc., if that helps.

I'm not sure why you need this at all. It's not hard to get the string representation of a CHAR field holding ASCII or Unicode or whatever code points. But I'm pretty sure you don't need that, because databases already know how to compare those to strings passed in SQL, etc. Unless you're trying to, say, generate a dump that looks exactly like the ones you get from some other tool. But, assuming you do need to do this, here's how.
I think you're looking for the ord function:
Given a string representing one Unicode character, return an integer representing the Unicode code point of that character. For example, ord('a') returns the integer 97 and ord('\u2020') returns 8224. This is the inverse of chr().
This works because Python has access to that same chart that you have—in fact, to a bunch of different ones, one for each encoding it knows about. In fact, that chart is pretty much what an encoding is.
So, for example:
def encode_as_char(s):
return 'CHAR({})'.format(', '.join(str(ord(c)) for c in s))
Or, if you just wanted a list of numbers, not a string made out of those numbers, it's even simpler:
def encode_as_char(s):
return [ord(c) for c in s]
This is all assuming that either (a) your database is storing Unicode characters and you're using Python 3, or (b) your database is storing 8-bit characters and you're using Python 2. Otherwise, you need an encode or decode step in there as well.
For a Python 3 Unicode string to a UTF-8 database (notice that we don't need ord here, because a Python 3 bytes is actually a sequence of numbers):
def encode_as_utf8_char(s):
return 'CHAR({})'.format(', '.join(str(c) for c in s.encode('utf-8')))
For Python 2 UTF-8 string to a Unicode database:
def encode_utf8_as_char(s):
return 'CHAR({})'.format(', '.join(str(ord(c)) for c in s.decode('utf-8')))

Bytes string in Python

Would you know by any chance how to get rid on the bytes identifier in front of a string in a Python's list, perhaps there is some global setting that can be amended?
I retrieve a query from the Postgres 9.3, and create a list form that query. It looks like Python 3.3 interprets records in columns that are of type char(4) as if the they are bytes strings, for example:
Funds[1][1]
b'FND3'
Funds[1][1].__class__
<class 'bytes'>
So the implication is:
Funds[1][1]=='FND3'
False
I have some control over that database so I could change the column type to varchar(4), and it works well:
Funds[1][1]=='FND3'
True
But this is only a temporary solution.
The little b makes my life a nightmare for the last two days ;), and I would appreciate your help with that problem.
Thanks and Regards
Peter

You have to either manually implement __str__/__repr__ or, if you're willing to take the risk, do some sort of Regex-replace over the string.
Example __repr__:
def stringify(lst):
return "[{}]".format(", ".join(repr(x)[1:] if isinstance(x, bytes) else repr(x) for x in lst))

The b isn't part of the string, any more than the quotes around it are; they're just part of the representation when you print the string out. So, you're chasing the wrong problem, one that doesn't exist.
The problem is that the byte string b'FND3' is not the same thing as the string 'FND3'. In this particular example, that may seem silly, but if you might ever have any non-ASCII characters anywhere, it stops being silly.
For example, the string 'é' is the same as the byte string b'\xe9' in Latin-1, and it's also the same as the byte string b'\xce\xa9' in UTF-8. And of course b'\xce\a9' is the same as the string 'Ã©' in Latin-1.
So, you have to be explicit about what encoding you're using:
Funds[1][1].decode('utf-8')=='FND3'
But why is PostgreSQL returning you byte strings? Well, that's what a char column is. It's up to the Python bindings to decide what to do with them. And without knowing which of the multiple PostgreSQL bindings you're using, and which version, it's impossible to tell you what to do. But, for example, in recent-ish psycopg, you just have to set an encoding in the connection (e.g., conn.set_client_encoding('UTF-8'); in older versions you had to register a standard typecaster and do some more stuff; etc.; in py-postgresql you have to register lambda s: s.decode('utf-8'); etc.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.