Convert string literal to string or raise error, Python - python

I have a string, which may or may not contain a syntactically valid Python string literal. If it does, I want to convert it to the string it represents, otherwise I want to raise an error. Is there a better way to accomplish this than
# 'x' contains the putative string literal
s = ast.literal_eval(x)
if not isinstance(s, basestring):
raise ValueError("not a valid string literal: " + x)
In particular, because of the origin of this string, it could potentially contain the repr of a complex object, and I don't want to waste time parsing that and then throwing it away.
Another way to put it is that I want the behavior of float or int when applied to a string, only for, well, strings.
[Note: The existing question Python convert string literals to strings recommends ast.literal_eval, but that is what I am hoping to be able to beat.]

I think that you could use a regular expression. A syntactically valid Python string is:
'' on one line containing anything except ' preceeded by an even number of \
"" on one line containing anything except \n " preceeded by an even number of \
""" """ containing anything except """ preceeded by an even number of \
''' ''' containing anything except ''' preceeded by an even number of \
Theoretically you should be able to write a regex to match one of those, and I think that should work.
It might not be any faster or better than ast.literal_eval, even with a complex object.
Now that I think about it, you could simply do:
if x.lstrip().startswith(("'", '"')): #Might be a string
as a pre-filter.

Related

Replace Unicode code point with actual character using regex

I have a large file where any unicode character that wasn't in UTF-8 got replaced by its code point in angle brackets (e.g. the "👍" was converted to "<U+0001F44D>"). Now I want to revert this with a regex substitution.
I've tried to acomplish this with
re.sub(r'<U\+([A-F0-9]+)>',r'\U\1', str)
but obviously this won't work because we cannot insert the group into this unicode escape.
What's the best/easiest way to do this? I found many questions trying to do the exact opposite but nothing useful to 're-encode' these code points as actual characters...
When you have a number of the character, you can do ord(number) to get the character of that number.
Because we have a string, we need to read it as int with base 16.
Both of those together:
>>> chr(int("0001F44D", 16))
'👍'
However, now we have a small function, not a string to simply replace! Quick search returned that you can pass a function to re.sub
Now we get:
re.sub(r'<U\+([A-F0-9]+)>', lambda x: chr(int(x.group(1), 16)), my_str)
PS Don't name your string just str - you'll shadow the builtin str meaning type.

Python: Why does %r not accurately represent the raw data on \" versus \'

I've been going through Zed's LPTHW and I have been messing around with escape characters after doing lesson 10. While fooling around with %r I came across this, and I have no idea why it's happening (I'm so new to any form of programming/coding it hurts):
test = "10'5\""
test_2 = '10\'5"'
print "%r" % test
print "%r" % test_2
When I run this, I get:
'10\'5"'
'10\'5"'
I'm confused. I had assumed that I would get output in the following:
"10'5\"'
'10\'5"'
It was my understanding that %r would return the string identical to how it is written, yet it seems to convert it to test_2 by moving the \ to the left.
Am I missing something here?
Thanks.
It was my understanding that %r would return the string identical to how it is written
Your understanding is incorrect. Python does not "remember" how a string was written in the source code; all that matters to the interpreter is that it contains the characters:
10'5"
Printing a repr of that string will use whichever type of quotation marks Python feels is most appropriate for its contents. Since both strings contain the same characters, they are printed identically by repr (and, hence, by the %r format string).
Python string literals or constants can start either with a single-quote(') or a double-quote("). When it begins with a single-quote('), then it must end with a with a single-quote(').
And the similar logic goes for a string literal or constant beginning with with a double-quote("). If it begins with a double-quote ("), then it must end with the same double-quote(").
Now the interesting thing to notice here is: if a string literal or constant begins with a double-quote("), then the single-quote(') inside that string literal is just another character. You don't need to put the escape character (\) to tell Python interpreter to retain that single-quote(') intact.
And the similar logic goes for a string literal or constant that begins with a single-quote('), then the double-quote(") inside that string literal is just another character. You don't need to put the escape character (\) to tell Python interpreter to retain that double-quote(") intact.
So in your following code fragment, you initialized your variable test with a string literal or constant. You defined your string constant by the double- quotes("..."). Inside your string constant, you put two digits (10), a single- quote('), digit(5), and a double-quote (which needed escaping by the escape character).
You did the same thing for your variable test_2 in your following code fragment. But here you defined your string literal or constant using the single-quotes('...'), so you needed to escape the single-quote(') after the first two digits(10).
test = "10'5\""
test_2 = '10\'5"'
print "%r" % test
print "%r" % test_2
If you print your variables using the print format %s instead of the raw format %r as following, you will get the same string literal or constant value for both the variables, which is: 10'5"
print "%s" % test
print "%s" % test_2
But for the raw format %r value of your two Python string variables test and test_2, the Python interpreter internally chose to represent your raw string value beginning and ending with a single-quote, and printing both as : '10\'5"'. This has no bearing on how you defined your string literals, using either the double-quotes("...") or the single-quotes('...').

Explicitly make a string into raw string

I am reading reading path to the registry from a text file. The registry path is
HKEY_LOCAL_MACHINE\Software\MYAPP\6.3
I store this registry in a variable :
REGISTRY_KEY
Then I strip the HKEY_LOCAL_MACHINE part from the string and try to read the value at the key.
if REGISTRY_KEY.split('\\')[0] == "HKEY_LOCAL_MACHINE":
keyPath = REGISTRY_KEY.strip("HKEY_LOCAL_MACHINE\\")
try:
key = winreg.OpenKey(winreg.HKEY_LOCAL_MACHINE, keyPath)
value = winreg.QueryValueEx(key, "InstallPath")[0]
except IOError as err:
print(err)
I get the following error
[WinError 2] The system cannot find the file specified
However if I do it manually like
key = winreg.OpenKey(winreg.HKEY_LOCAL_MACHINE,r'Software\MYAPP\6.3')
OR
key = winreg.OpenKey(winreg.HKEY_LOCAL_MACHINE,"Software\\MYAPP\\6.3")
it works.
So is there any way I can make the keyPath variable to either be a raw string or contain double '\'
PS:I am using Python 3.3
A raw str is a way of entering the string so you do not need to escape special characters. Another way to enter the same str is to escape the special characters (blackslash being one of them). They would have the same data. So really your question doesn't have an answer.
You are also using strip incorrectly, but it would not matter for this particular string. Because the first character after the first \ is S and S is not in your strip command and your key ends in a digit also not in your strip command. But you will want to fix it so other keys are not messed up by this. You got lucky on this string.
>>> r"HKEY_LOCAL_MACHINE\Software\MYAPP\6.3".strip("HKEY_LOCAL_MACHINE\\")
'Software\\MYAPP\\6.3'
As for your real problem. There is something else about the string that is wrong. Try print repr(keyPath) before your call to OpenKey
EDIT: looks like SylvainDefresne guessed correctly about a newline character on the end of the string
Your REGISTRY_KEY.strip() call is not doing what you think it's doing. It doesn't remove the string HKEY_LOCAL_MACHINE\ from the beginning of the string. Instead, it removes the characters H, K, E, etc., in any order, from both ends of the string. This is why it works when you manually put in what you expect.
As for your original question, a double backslash is an escape sequence that produces a single backslash in your string, so it is not necessary to convert keyPath to double slashes.

How to escape special char

I got the following code to handle Chinese character problem, or some special character in powerpoint file , because I would like to use the content of the ppt as the filename to save.
If it contains some special character, it will throw some exception, so I use the following code to handle it.
It works fine under Python 2.7 , but when I run with Python 3.0 it gives me the following error :
if not (char in '<>:"/\|?*'):
TypeError: 'in <string>' requires string as left operand, not int
I Googled the error message but I don't understand how to resolve it. I know the code if not (char in '<>:"/\|?*'): is to convert the character to ASCII code number, right?
Is there any example to fix my problem in Python 3?
def rm_invalid_char(self,str):
final=""
dosnames=['CON', 'PRN', 'AUX', 'NUL', 'COM1', 'COM2', 'COM3', 'COM4', 'COM5', 'COM6', 'COM7', 'COM8', 'COM9', 'LPT1', 'LPT2', 'LPT3', 'LPT4', 'LPT5', 'LPT6', 'LPT7', 'LPT8', 'LPT9']
for char in str:
if not (char in '<>:"/\|?*'):
if ord(char)>31:
final+=char
if final in dosnames:
#oh dear...
raise SystemError('final string is a DOS name!')
elif final.replace('.', '')=='':
print ('final string is all periods!')
pass
return final
Simple: use this
re.escape(YourStringHere)
From the docs:
Return string with all non-alphanumerics backslashed; this is useful
if you want to match an arbitrary literal string that may have regular
expression metacharacters in it.
You are passing an iterable whose first element is an integer (232) to rm_invalid_char(). The problem does not lie with this function, but with the caller.
Some debugging is in order: right at the beginning of rm_invalid_char(), you should do print(repr(str)): you will not see a string, contrary to what is expected by rm_invalid_char(). You must fix this until you see the string that you were expecting, by adjusting the code before rm_invalid_char() is called.
The problem is likely due to how Python 2 and Python 3 handle strings (in Python 2, str objects are strings of bytes, while in Python 3, they are strings of characters).
I'm curious why there is something in "str" that is acting like an integer - something strange is going on with the input.
However, I suspect if you:
Change the name of your str value to something else, e.g. char_string
Right after for char in char_string coerce whatever your input is to a string
then the problem you describe will be solved.
You might also consider adding a random bit to the end of your generated file name so you don't have to worry about colliding with the DOS reserved names.

Mapping Unicode to ASCII in Python

I receive strings after querying via urlopen in JSON format:
def get_clean_text(text):
return text.translate(maketrans("!?,.;():", " ")).lower().strip()
for track in json["tracks"]:
print track["name"].lower()
get_clean_text(track["name"].lower())
For the string "türlich, türlich (sicher, dicker)" I then get
File "main.py", line 23, in get_clean_text
return text.translate(maketrans("!?,.;():", " ")).lower().strip()
TypeError: character mapping must return integer, None or unicode
I want to format the string to be "türlich türlich sicher dicker".
The question is not a complete self-contained example; I can't be sure whether it's Python 2 or 3, where maketrans came from, etc. There's a good chance I will guess wrong, which is why you should be sure to tag your questions appropriately and provide a short, self contained, correct example. (That, and the fact that various other people—some of them probably smarter than me—likely ignored your question because it was ambiguous.)
Assuming you're using 2.x, and you've done a from string import * to get maketrans, and json["name"] is unicode rather than str/bytes, here's your problem:
There are two kinds of translation tables: old-style 8-bit tables (which are just an array of 256 characters) and new-style sparse tables (which are just a dict mapping one character's ordinal to another). The str.translate function can use either, but unicode.translate can only use the second (for reasons that should be obvious if you think about it for a bit).
The string.maketrans function makes old-style 8-bit translation tables. So you can't use it with unicode.translate.
You can always write your own "makeunitrans" function as a drop-in replacement, something like this:
def makeunitrans(frm, to):
return {ord(f):ord(t) for (f,t) in zip(frm, to)}
But if you just want to map out certain characters, you could do something a bit more special purpose:
def makeunitrans(frm):
return {ord(f):ord(' ') for f in frm}
However, from your final comment, I'm not sure translate is even what you want:
I want to format the string to be "türlich türlich sicher dicker"
If you get this right, you're going to format the string to be "türlich türlich sicher dicker ", because you're mapping all those punctuation characters to spaces, not nothing.
With new-style translation tables you can map anything you want to None, which solves that problem. But you might want to step back and ask why you're using the translate method in the first place instead of, e.g., calling replace multiple times (people usually say "for performance", but you wouldn't be building the translation table in-line every time through if that were an issue) or using a trivial regular expression.

Categories