Python repr function problem - python

I'm dealing with some text parsing in Python and for that purpose, it's good for me to apply repr() function on each string I'm gonna parse, but after the parsing, I need to convert some parsed substring back to the previous representation, because I want to print them and I'm not able to do this. I thought that str() function should get the string back to the human more readable form. But when I apply str function on that substring, nothing's changed.
As I've said I need to print the string in human readable form, without printing escape sequences like \n, \t etc...
But when I apply repr() to a string and then I want to convert it back, I don't know how, because str() function didn't do it.
So my question is, how to convert the string back into human readable form?
Thanks for every reply.

str() has no effect on objects that are already strings. You need to use eval() to undo a repr() where possible. Try using ast.literal_eval() instead though.

Related

How to modify unicode code as it is a string

I have a list of partial Unicode codes for cuneiform characters.
for example I have, 12220 which python couldn't render to 𒈠 which is what I wanted. Then I realized that adding \U000 in front of these partial codes creates results that I want. The problem is I can't modify unicode.
"\U000{}".format(12220) doesn't work. Clearly adding string to unicode is not possible. I don't want to hand merge 375 characters. Can anyone help me with this?
Use this:
print(chr(int("12220", 16)))
chr function returns a character from int, and second paraemeter of int is the base it should be converted to.

Python re.sub() and unicode

I have what feels to me like a really basic question, but for the life of me I can't figure it out.
I have a whole bunch of text I'm going through and converting to the International Phonetic Alphabet. I'm using the re.sub() method a lot, and in many cases this means replacing a character of string type with a character of unicode type. For example:
for row in responsesIPA:
re.sub("3", u"\u0259", row)
I'm getting TypeError: expected string or buffer. The docs on Python re say that the type for the replacement has to match the type for what you're searching, so maybe that's the problem? I tried putting str() around u"\u0259", but I'm still getting the type error. Is there a way for me to do this replacement?
The error you're getting is telling you that the "row" isn't a valid string or buffer(str, bytes, unicode, anything that is readable), you will need to double check what is stored in row by adding a print(row) in front.
Just to prove that this is the case, doing so will work:
import re
print(re.sub("3", u"\u0259", "12345"))

Load JSON file in Python without the 'u in the key

I was doing some work in Python with graphs and wanted to a save some structures in files so I could load them fast when I resumed work. One of those was a dictionary which I saved in JSON format using json.dump.
When I load it back with json.load the keys have changed from "1" to u'1'. Why is that? What does it mean? How can I change it? I use the keys later to make some lists which I will then use with the original graph which nodes are the keys (in integer form) and it causes problem in comparisons...
The u prefix signifies a Unicode string. In Python 2.x, you can convert it to a regular string with str(). That shouldn't really be necessary, though; u'1' == '1' because Python will do any conversion for you before comparing.
The u'' or u"" just means that this is a unicode string. Which in general should not be any problem unless you need a byte string. Though I would expect that your original data already was unicode, so it should not be a problem.
It is a unicode string. You can treat it as a normal python string in most cases. If you really want to convert it to a normal string use str(). If you need to convert it to a bytes type, use object.encode(encoding) where encoding is the encoding of the Unicode character, usually 'utf-8'.

Python .split() without 'u

In Python, if I have a string like:
a =" Hello - to - everybody"
And I do
a.split('-')
then I get
[u'Hello', u'to', u'everybody']
This is just an example.
How can I get a simple list without that annoying u'??
The u means that it's a unicode string - your original string must also have been a unicode string. Generally it's a good idea to keep strings Unicode as trying to convert to normal strings could potentially fail due to characters with no equivalent.
The u is purely used to let you know it's a unicode string in the representation - it will not affect the string itself.
In general, unicode strings work exactly as normal strings, so there should be no issue with leaving them as unicode strings.
In Python 3.x, unicode strings are the default, and don't have the u prepended (instead, bytes (the equivalent to old strings) are prepended with b).
If you really, really need to convert to a normal string (rarely the case, but potentially an issue if you are using an extension library that doesn't support unicode strings, for example), take a look at unicode.encode() and unicode.decode(). You can either do this before the split, or after the split using a list comprehension.
I have a opposite problem. The str '第一回\u3000甄士隐梦幻识通灵 贾雨村风尘怀闺秀' needs to be splitted by the unicode character. But I made wrong and code split('\u') that leaded to the unicode syntax error.
I should code split('\u3000')

Mapping Unicode to ASCII in Python

I receive strings after querying via urlopen in JSON format:
def get_clean_text(text):
return text.translate(maketrans("!?,.;():", " ")).lower().strip()
for track in json["tracks"]:
print track["name"].lower()
get_clean_text(track["name"].lower())
For the string "türlich, türlich (sicher, dicker)" I then get
File "main.py", line 23, in get_clean_text
return text.translate(maketrans("!?,.;():", " ")).lower().strip()
TypeError: character mapping must return integer, None or unicode
I want to format the string to be "türlich türlich sicher dicker".
The question is not a complete self-contained example; I can't be sure whether it's Python 2 or 3, where maketrans came from, etc. There's a good chance I will guess wrong, which is why you should be sure to tag your questions appropriately and provide a short, self contained, correct example. (That, and the fact that various other people—some of them probably smarter than me—likely ignored your question because it was ambiguous.)
Assuming you're using 2.x, and you've done a from string import * to get maketrans, and json["name"] is unicode rather than str/bytes, here's your problem:
There are two kinds of translation tables: old-style 8-bit tables (which are just an array of 256 characters) and new-style sparse tables (which are just a dict mapping one character's ordinal to another). The str.translate function can use either, but unicode.translate can only use the second (for reasons that should be obvious if you think about it for a bit).
The string.maketrans function makes old-style 8-bit translation tables. So you can't use it with unicode.translate.
You can always write your own "makeunitrans" function as a drop-in replacement, something like this:
def makeunitrans(frm, to):
return {ord(f):ord(t) for (f,t) in zip(frm, to)}
But if you just want to map out certain characters, you could do something a bit more special purpose:
def makeunitrans(frm):
return {ord(f):ord(' ') for f in frm}
However, from your final comment, I'm not sure translate is even what you want:
I want to format the string to be "türlich türlich sicher dicker"
If you get this right, you're going to format the string to be "türlich türlich sicher dicker ", because you're mapping all those punctuation characters to spaces, not nothing.
With new-style translation tables you can map anything you want to None, which solves that problem. But you might want to step back and ask why you're using the translate method in the first place instead of, e.g., calling replace multiple times (people usually say "for performance", but you wouldn't be building the translation table in-line every time through if that were an issue) or using a trivial regular expression.

Categories