Python re.sub() and unicode - python

I have what feels to me like a really basic question, but for the life of me I can't figure it out.
I have a whole bunch of text I'm going through and converting to the International Phonetic Alphabet. I'm using the re.sub() method a lot, and in many cases this means replacing a character of string type with a character of unicode type. For example:
for row in responsesIPA:
re.sub("3", u"\u0259", row)
I'm getting TypeError: expected string or buffer. The docs on Python re say that the type for the replacement has to match the type for what you're searching, so maybe that's the problem? I tried putting str() around u"\u0259", but I'm still getting the type error. Is there a way for me to do this replacement?

The error you're getting is telling you that the "row" isn't a valid string or buffer(str, bytes, unicode, anything that is readable), you will need to double check what is stored in row by adding a print(row) in front.
Just to prove that this is the case, doing so will work:
import re
print(re.sub("3", u"\u0259", "12345"))

Related

How to modify unicode code as it is a string

I have a list of partial Unicode codes for cuneiform characters.
for example I have, 12220 which python couldn't render to 𒈠 which is what I wanted. Then I realized that adding \U000 in front of these partial codes creates results that I want. The problem is I can't modify unicode.
"\U000{}".format(12220) doesn't work. Clearly adding string to unicode is not possible. I don't want to hand merge 375 characters. Can anyone help me with this?
Use this:
print(chr(int("12220", 16)))
chr function returns a character from int, and second paraemeter of int is the base it should be converted to.

How to distinguish between a correct and a botched unicode encoded string in Python?

I have string data in various languages where parts of the strings have seen some wrong encoding/decoding while others are correct, I need to fix the wrong ones:
Here's an example for the german word "ZubehĂśr":
correct = "ZUBEHÖR"
incorrect = "ZUBEHÃ\x96R"
I already found out that I can correct the errors like this:
incorrect.encode("raw_unicode_escape").decode("utf8")
However using this on the correct strings yields an error. I could iterate over all strings and use a try-statement, but I don't know if this will work reliably and I'd like to know a more elegant way.
Also while the \x96 is written out when printing it's actually only one character:
incorrect[-3]
Out[34]: 'Ã'
incorrect[-2]
Out[33]: '\x96'
How can I reliably only find those strings that have these odd unicode characters in them like ZUBEHÃ\x96R?
EDIT:
Here's something else I stumbled upon while experimenting:
When I do incorrect.encode("raw_unicode_escape") then the result is b'ZUBEH\xc3\x96R'.
But when I do this with e.g. a cyrillic word like this:
"Персонализированные".encode("raw_unicode_escape")
Then the result is b'\\u041f\\u0435\\u0440\\u0441\\u043e\\u043d\\u0430\\u043b\\u0438\\u0437\\u0438\\u0440\\u043e\\u0432\\u0430\\u043d\\u043d\\u044b\\u0435'
Why am I getting \x-escapes in the first case and \u-escapes in the second case while doing the exact same thing?
And why can I .decode("utf8") back the \x-escapes into a readable format but not the \u-escapes?
You should try the fixes-text-for-you library (ftfy):
>>> import ftfy
>>> ftfy.fix_text("ZUBEHÃ\x96R")
'ZUBEHÖR'
It operates line by line, so if you have a string with clean and corrupt strings, but on separate lines, ftfy can probably handle it.
Note: This is not an exact science.
The way ftfy works involves a lot of educated guesses.
The tool is very well made, but it may not guess correctly in all cases you have.
If you can, it is always better to fix the errors at the source (ie. make sure all text is correctly decoded in the first place).

Can't regex string between string and line break - str(x) versus x.decode()

Why does a regex fail of a string cast from an object when line breaks are present?
That is why does this fail to find a match (ie print 'Green') in a string created from str(obj):
import re
s = str(b'Package Name: Green\\r\\n Release version: 8.1\\r\\n')
match = re.search(r'Package Name: (.*)\r\n', s)
print(match.group(1))
When this succeeds using a string created from obj.decode()?
import re
s = (b'Package Name: Green\\r\\n Release version: 8.1\\r\\n').decode()
match = re.search(r'Package Name: (.*)\r\n', s)
print(match.group(1))
No matter what search pattern was tried, searching the string created by str(obj) failed to find a match...
The reason you get different results is that you’re doing different things. Calling str on a bytes with newline characters returns a string with a literal backslash and n; calling decode returns a string with a newline character in it. So, if you’re searching the results for newline characters, the second one will succeed, and the first will fail. And it’s the second one that you wanted.
In other words, using decode here is right, and str is wrong; that’s why you get different results. If you can’t think through the difference, try just printing them out: print(b.decode()); print(str(b)) and you’ll see the difference immediately.
In fact, you should usually be decoding the strings as soon as you receive them, and never looking at the bytes again. Then you never have to worry about the str representation of bytes objects (except maybe in some code that logs errors caused by invalid strings that you couldn’t decode). The only exception is when you know the bytes are some kind of encoded text, but can’t be sure what the encoding is. For example, if you’re parsing HTTP headers or email messages or Python source code, you don’t know the character set until you read part of the file and search it for special ASCII-encoded strings. Or, if you’re converting a bunch of old text files from Windows to Unix line endings and some are cp1252 while others are cp1250, you don’t care which is which because they both encode line endings the same way. For those cases, just stick with bytes, and search for b'\n' instead of '\n'.
If you want to know why Python makes this so complicated:
bytes objects are used to store strings encoded in your default encoding—but they’re also used to store strings encoded in different encodings, and binary data that isn’t a string at all. And a bytes object has no idea which of those it’s storing; they’re all just sequences of numbers.
Python 2 effectively assumed that a bytes was being used to store a string in your default encoding, so it let you convert back and forth to Unicode by calling functions like str, or even concatenating a bytes and Unicode string. That turned out to be one of the biggest sources of errors in the language. You still see Python 2 users posting questions here every few days asking why they got a UnicodeEncodeError when they weren’t calling encode anywhere (or, worse, when they were calling decode), and fixing that was one of the main reasons for Python 3’s existence.
The human-readable representation of a bytes object has to be something that can be produced without error, and read unambiguously, whether it’s a string in the default encoding, a string in a completely different encoding, or a sequence of pixel brightness values ranging from 0 to 255. The compromise solution (for things like that HTTP headers case above) is the backslash-escaped quoted string.
By the way, during the Python 2 to 3 transition, the core devs assumed multiple people would come up with clever EncodedBytes types that carried around their encoding, and could therefore act more like Python 2 byte strings but without all the associated errors, and after a couple years one of them would be the clear winner on PyPI and maybe they could add it to Python 3.3 or so. That’s what you’re probably instinctively reaching for here. But, as it turned out, nobody used any such libraries, because it’s almost always easier to just decode and encode at the edges of your program and use Unicode everywhere, and the exceptions are almost always cases where you don’t know the encoding so EncodedBytes wouldn’t help.
One last thing: thinking of functions like str or float as “casts” is misleading. While it looks superficially similar to the way you do explicit casts in C or Java or Go or whatever language you’re used to, it has a very different meaning

Bytes string in Python

Would you know by any chance how to get rid on the bytes identifier in front of a string in a Python's list, perhaps there is some global setting that can be amended?
I retrieve a query from the Postgres 9.3, and create a list form that query. It looks like Python 3.3 interprets records in columns that are of type char(4) as if the they are bytes strings, for example:
Funds[1][1]
b'FND3'
Funds[1][1].__class__
<class 'bytes'>
So the implication is:
Funds[1][1]=='FND3'
False
I have some control over that database so I could change the column type to varchar(4), and it works well:
Funds[1][1]=='FND3'
True
But this is only a temporary solution.
The little b makes my life a nightmare for the last two days ;), and I would appreciate your help with that problem.
Thanks and Regards
Peter
You have to either manually implement __str__/__repr__ or, if you're willing to take the risk, do some sort of Regex-replace over the string.
Example __repr__:
def stringify(lst):
return "[{}]".format(", ".join(repr(x)[1:] if isinstance(x, bytes) else repr(x) for x in lst))
The b isn't part of the string, any more than the quotes around it are; they're just part of the representation when you print the string out. So, you're chasing the wrong problem, one that doesn't exist.
The problem is that the byte string b'FND3' is not the same thing as the string 'FND3'. In this particular example, that may seem silly, but if you might ever have any non-ASCII characters anywhere, it stops being silly.
For example, the string 'Ê' is the same as the byte string b'\xe9' in Latin-1, and it's also the same as the byte string b'\xce\xa9' in UTF-8. And of course b'\xce\a9' is the same as the string 'Ê' in Latin-1.
So, you have to be explicit about what encoding you're using:
Funds[1][1].decode('utf-8')=='FND3'
But why is PostgreSQL returning you byte strings? Well, that's what a char column is. It's up to the Python bindings to decide what to do with them. And without knowing which of the multiple PostgreSQL bindings you're using, and which version, it's impossible to tell you what to do. But, for example, in recent-ish psycopg, you just have to set an encoding in the connection (e.g., conn.set_client_encoding('UTF-8'); in older versions you had to register a standard typecaster and do some more stuff; etc.; in py-postgresql you have to register lambda s: s.decode('utf-8'); etc.

How to escape special char

I got the following code to handle Chinese character problem, or some special character in powerpoint file , because I would like to use the content of the ppt as the filename to save.
If it contains some special character, it will throw some exception, so I use the following code to handle it.
It works fine under Python 2.7 , but when I run with Python 3.0 it gives me the following error :
if not (char in '<>:"/\|?*'):
TypeError: 'in <string>' requires string as left operand, not int
I Googled the error message but I don't understand how to resolve it. I know the code if not (char in '<>:"/\|?*'): is to convert the character to ASCII code number, right?
Is there any example to fix my problem in Python 3?
def rm_invalid_char(self,str):
final=""
dosnames=['CON', 'PRN', 'AUX', 'NUL', 'COM1', 'COM2', 'COM3', 'COM4', 'COM5', 'COM6', 'COM7', 'COM8', 'COM9', 'LPT1', 'LPT2', 'LPT3', 'LPT4', 'LPT5', 'LPT6', 'LPT7', 'LPT8', 'LPT9']
for char in str:
if not (char in '<>:"/\|?*'):
if ord(char)>31:
final+=char
if final in dosnames:
#oh dear...
raise SystemError('final string is a DOS name!')
elif final.replace('.', '')=='':
print ('final string is all periods!')
pass
return final
Simple: use this
re.escape(YourStringHere)
From the docs:
Return string with all non-alphanumerics backslashed; this is useful
if you want to match an arbitrary literal string that may have regular
expression metacharacters in it.
You are passing an iterable whose first element is an integer (232) to rm_invalid_char(). The problem does not lie with this function, but with the caller.
Some debugging is in order: right at the beginning of rm_invalid_char(), you should do print(repr(str)): you will not see a string, contrary to what is expected by rm_invalid_char(). You must fix this until you see the string that you were expecting, by adjusting the code before rm_invalid_char() is called.
The problem is likely due to how Python 2 and Python 3 handle strings (in Python 2, str objects are strings of bytes, while in Python 3, they are strings of characters).
I'm curious why there is something in "str" that is acting like an integer - something strange is going on with the input.
However, I suspect if you:
Change the name of your str value to something else, e.g. char_string
Right after for char in char_string coerce whatever your input is to a string
then the problem you describe will be solved.
You might also consider adding a random bit to the end of your generated file name so you don't have to worry about colliding with the DOS reserved names.

Categories