How to escape special char - python

I got the following code to handle Chinese character problem, or some special character in powerpoint file , because I would like to use the content of the ppt as the filename to save.
If it contains some special character, it will throw some exception, so I use the following code to handle it.
It works fine under Python 2.7 , but when I run with Python 3.0 it gives me the following error :
if not (char in '<>:"/\|?*'):
TypeError: 'in <string>' requires string as left operand, not int
I Googled the error message but I don't understand how to resolve it. I know the code if not (char in '<>:"/\|?*'): is to convert the character to ASCII code number, right?
Is there any example to fix my problem in Python 3?
def rm_invalid_char(self,str):
final=""
dosnames=['CON', 'PRN', 'AUX', 'NUL', 'COM1', 'COM2', 'COM3', 'COM4', 'COM5', 'COM6', 'COM7', 'COM8', 'COM9', 'LPT1', 'LPT2', 'LPT3', 'LPT4', 'LPT5', 'LPT6', 'LPT7', 'LPT8', 'LPT9']
for char in str:
if not (char in '<>:"/\|?*'):
if ord(char)>31:
final+=char
if final in dosnames:
#oh dear...
raise SystemError('final string is a DOS name!')
elif final.replace('.', '')=='':
print ('final string is all periods!')
pass
return final

Simple: use this
re.escape(YourStringHere)
From the docs:
Return string with all non-alphanumerics backslashed; this is useful
if you want to match an arbitrary literal string that may have regular
expression metacharacters in it.

You are passing an iterable whose first element is an integer (232) to rm_invalid_char(). The problem does not lie with this function, but with the caller.
Some debugging is in order: right at the beginning of rm_invalid_char(), you should do print(repr(str)): you will not see a string, contrary to what is expected by rm_invalid_char(). You must fix this until you see the string that you were expecting, by adjusting the code before rm_invalid_char() is called.
The problem is likely due to how Python 2 and Python 3 handle strings (in Python 2, str objects are strings of bytes, while in Python 3, they are strings of characters).

I'm curious why there is something in "str" that is acting like an integer - something strange is going on with the input.
However, I suspect if you:
Change the name of your str value to something else, e.g. char_string
Right after for char in char_string coerce whatever your input is to a string
then the problem you describe will be solved.
You might also consider adding a random bit to the end of your generated file name so you don't have to worry about colliding with the DOS reserved names.

Related

Replace Unicode code point with actual character using regex

I have a large file where any unicode character that wasn't in UTF-8 got replaced by its code point in angle brackets (e.g. the "👍" was converted to "<U+0001F44D>"). Now I want to revert this with a regex substitution.
I've tried to acomplish this with
re.sub(r'<U\+([A-F0-9]+)>',r'\U\1', str)
but obviously this won't work because we cannot insert the group into this unicode escape.
What's the best/easiest way to do this? I found many questions trying to do the exact opposite but nothing useful to 're-encode' these code points as actual characters...
When you have a number of the character, you can do ord(number) to get the character of that number.
Because we have a string, we need to read it as int with base 16.
Both of those together:
>>> chr(int("0001F44D", 16))
'👍'
However, now we have a small function, not a string to simply replace! Quick search returned that you can pass a function to re.sub
Now we get:
re.sub(r'<U\+([A-F0-9]+)>', lambda x: chr(int(x.group(1), 16)), my_str)
PS Don't name your string just str - you'll shadow the builtin str meaning type.

Python re.sub() and unicode

I have what feels to me like a really basic question, but for the life of me I can't figure it out.
I have a whole bunch of text I'm going through and converting to the International Phonetic Alphabet. I'm using the re.sub() method a lot, and in many cases this means replacing a character of string type with a character of unicode type. For example:
for row in responsesIPA:
re.sub("3", u"\u0259", row)
I'm getting TypeError: expected string or buffer. The docs on Python re say that the type for the replacement has to match the type for what you're searching, so maybe that's the problem? I tried putting str() around u"\u0259", but I'm still getting the type error. Is there a way for me to do this replacement?
The error you're getting is telling you that the "row" isn't a valid string or buffer(str, bytes, unicode, anything that is readable), you will need to double check what is stored in row by adding a print(row) in front.
Just to prove that this is the case, doing so will work:
import re
print(re.sub("3", u"\u0259", "12345"))

Ignoring escape sequences

I'm using Python 2.6 and I have a variable which contains a string (I have sent it thorugh sockets and now I want to do something with it).
The problem is that I get the following error:
TypeError: file() argument 1 must be encoded string without NULL bytes, not str
After I looked it up I found out that the problem is probably that the string I'm sending contains '\0' but it isn't a literal string that I can just edit with double backslash or adding a 'r' before hand, so is there a way to tell python to ignore the escape sequences and treat the whole thing as string?
(For example - I don't want python to treat the sequence \0 as a null char, but rather I want it to be treated as a backslash char followed by a zero char)
Considering all comments it looks like incorrectly used PIL/Pillow API, namely the Image.open function that requires file name instead of file data.

Why is 3 behaving differently from int(3)? [duplicate]

This question already has answers here:
Why doesn't 2.__add__(3) work in Python?
(2 answers)
Closed 8 years ago.
I was playing with the Python interpreter (Python 3.2.3) and tried the following:
>>> dir(1)
This gave me all the attributes and methods of the int object. Next I tried:
>>> 1.__class__
However this threw an exception:
File "<stdin>", line 1
1.__class__
^
SyntaxError: invalid syntax
When I tried out the same with a float I got what I expected:
>>> 2.0.__class__
<class 'float'>
Why do int and float literals behave differently?
It's probably a consequence of the parsing algorithm used. A simple mental model is that the tokenizer attempts to match all the token patterns there are, and recognizes the longest match it finds. On a lower-level, the tokenizer works character-by-character, and makes a decision based only on the current state and input character – there shouldn't be any backtracking or re-reading of input.
After joining patterns with common prefixes – in this case, the pattern for int literals and the integral part of the pattern of float literals – what happens in the tokenizer is that it:
Reads the 1, and enters the state that indicates "reading either a float or an int literal"
Reads the ., and enters the state "reading a float literal"
Reads the _, which can not be part of a float literal. The parser emits 1. as a float literal token.
Carries on parsing starting with the _, and eventually emits __class__ as an identifier token.
Aside: This tokenizing approach is also the reason why common languages have the syntax restrictions they have. E.g. identifiers
contain letters, digits, and underscores, but cannot start with a
digit. If that was allowed, 123abc could be intended as either an
identifier, or the integer 123 followed by the identifier abc.
A lex-like tokenizer would recognize this as the former since it leads
to the longest single token, but nobody likes having to keep details
like this in their head when trying to read code. Or when trying to
write and debug the tokenizer for that matter.
The parser then tries to process the token stream:
<FloatLiteral: '1.'> <Identifier: '__class__'>
In Python, a literal directly followed by an identifier – without an operator between the tokens – makes no sense, so the parser bails. This also means that the reason why Python would complain about 123abc being invalid syntax isn't the tokenizer error "the character a isn't valid in an integer literal", but the parser error "the identifier abc cannot directly follow the integer literal 123"
The reason why the tokenizer can't recognize the 1 as an int literal is that the character that makes it leave the float-or-int state determines what it just read. If it's ., it was the start of a float literal, which might continue afterwards. If it's something else, it was a complete int literal token.
It's not possible for the tokenizer to "go back" and re-read the previous input as something else. In fact, the tokenizer is at too low a level to care about what an "attribute access" is and handle such ambiguities.
Now, your second example is valid because the tokenizer knows a float literal can only have one . in it. More precisely: the first . makes it transition from the float-or-int state to the float state. In this state, it only expects digits (or an E for scientific/engineering notation, a j for complex numbers…) to continue the the float literal. The first character that's not a digit etc. (i.e. the .) is definitely no longer part of the float literal and the tokenizer can emit the finished token. The token stream for your second example will thus be:
<FloatLiteral: '1.'> <Operator: '.'> <Identifier: '__class__'>
Which, of course, the parser then recognizes as valid Python. Now we also know enough why the suggested workarounds help. In Python, separating tokens with whitespace is optional – unlike, say, in Lisp. Conversely, whitespace does separate tokens. (That is, no tokens except string literals may contain whitespace, it's merely skipped between tokens.) So the code:
1 .__class__
is always tokenized as
<IntLiteral: '1'> <Operator: '.'> <Identifier: '__class__'>
And since a closing parenthesis cannot appear in an int literal, this:
(1).__class__
gets read as this:
<Operator: '('> <IntLiteral: '1'> <Operator: ')'> <Operator: '.'> <Identifier: '__class__'>
The above implies that, amusingly, the following is also valid:
1..__class__ # => <type 'float'>
The decimal part of a float literal is optional, and the second . read will make the preceding input be recognized as one.
It is a tokenization issue... the . is parsed as the beginning of the fractional part of a floating point number.
You can use
(1).__class__
to avoid the problem
Because if there's a . after a number, python thinks you're creating a float. When it encounters something else that isn't a number, it will throw an error.
However, in a float, python doesn't expect another . to be a part of the value, hence the result! It works. :)
How do we get the attributes, then?
You can easily wrap it in parentheses. For example, see this console session:
>>> (1).__class__
<type 'int'>
Now, Python knows that you're not trying to make a float, but to refer to the int itself.
Bonus: putting a blank space after the number works as well.
>>> 1 .__class__
<type 'int'>
Also, if you only want to get the __class__, type(1) will do it for you.
Hope this helps!
Or you can even do this:
>>> getattr(1 , '__class__')
<type 'int'>
You need parenthesis to surround the number:
>>> (1).__class__
<type 'int'>
>>>
Otherwise, Python sees the . after the number and it tries to interpret the whole thing as a float.

Getting the attributes of integers and floats [duplicate]

This question already has answers here:
Why doesn't 2.__add__(3) work in Python?
(2 answers)
Closed 8 years ago.
I was playing with the Python interpreter (Python 3.2.3) and tried the following:
>>> dir(1)
This gave me all the attributes and methods of the int object. Next I tried:
>>> 1.__class__
However this threw an exception:
File "<stdin>", line 1
1.__class__
^
SyntaxError: invalid syntax
When I tried out the same with a float I got what I expected:
>>> 2.0.__class__
<class 'float'>
Why do int and float literals behave differently?
It's probably a consequence of the parsing algorithm used. A simple mental model is that the tokenizer attempts to match all the token patterns there are, and recognizes the longest match it finds. On a lower-level, the tokenizer works character-by-character, and makes a decision based only on the current state and input character – there shouldn't be any backtracking or re-reading of input.
After joining patterns with common prefixes – in this case, the pattern for int literals and the integral part of the pattern of float literals – what happens in the tokenizer is that it:
Reads the 1, and enters the state that indicates "reading either a float or an int literal"
Reads the ., and enters the state "reading a float literal"
Reads the _, which can not be part of a float literal. The parser emits 1. as a float literal token.
Carries on parsing starting with the _, and eventually emits __class__ as an identifier token.
Aside: This tokenizing approach is also the reason why common languages have the syntax restrictions they have. E.g. identifiers
contain letters, digits, and underscores, but cannot start with a
digit. If that was allowed, 123abc could be intended as either an
identifier, or the integer 123 followed by the identifier abc.
A lex-like tokenizer would recognize this as the former since it leads
to the longest single token, but nobody likes having to keep details
like this in their head when trying to read code. Or when trying to
write and debug the tokenizer for that matter.
The parser then tries to process the token stream:
<FloatLiteral: '1.'> <Identifier: '__class__'>
In Python, a literal directly followed by an identifier – without an operator between the tokens – makes no sense, so the parser bails. This also means that the reason why Python would complain about 123abc being invalid syntax isn't the tokenizer error "the character a isn't valid in an integer literal", but the parser error "the identifier abc cannot directly follow the integer literal 123"
The reason why the tokenizer can't recognize the 1 as an int literal is that the character that makes it leave the float-or-int state determines what it just read. If it's ., it was the start of a float literal, which might continue afterwards. If it's something else, it was a complete int literal token.
It's not possible for the tokenizer to "go back" and re-read the previous input as something else. In fact, the tokenizer is at too low a level to care about what an "attribute access" is and handle such ambiguities.
Now, your second example is valid because the tokenizer knows a float literal can only have one . in it. More precisely: the first . makes it transition from the float-or-int state to the float state. In this state, it only expects digits (or an E for scientific/engineering notation, a j for complex numbers…) to continue the the float literal. The first character that's not a digit etc. (i.e. the .) is definitely no longer part of the float literal and the tokenizer can emit the finished token. The token stream for your second example will thus be:
<FloatLiteral: '1.'> <Operator: '.'> <Identifier: '__class__'>
Which, of course, the parser then recognizes as valid Python. Now we also know enough why the suggested workarounds help. In Python, separating tokens with whitespace is optional – unlike, say, in Lisp. Conversely, whitespace does separate tokens. (That is, no tokens except string literals may contain whitespace, it's merely skipped between tokens.) So the code:
1 .__class__
is always tokenized as
<IntLiteral: '1'> <Operator: '.'> <Identifier: '__class__'>
And since a closing parenthesis cannot appear in an int literal, this:
(1).__class__
gets read as this:
<Operator: '('> <IntLiteral: '1'> <Operator: ')'> <Operator: '.'> <Identifier: '__class__'>
The above implies that, amusingly, the following is also valid:
1..__class__ # => <type 'float'>
The decimal part of a float literal is optional, and the second . read will make the preceding input be recognized as one.
It is a tokenization issue... the . is parsed as the beginning of the fractional part of a floating point number.
You can use
(1).__class__
to avoid the problem
Because if there's a . after a number, python thinks you're creating a float. When it encounters something else that isn't a number, it will throw an error.
However, in a float, python doesn't expect another . to be a part of the value, hence the result! It works. :)
How do we get the attributes, then?
You can easily wrap it in parentheses. For example, see this console session:
>>> (1).__class__
<type 'int'>
Now, Python knows that you're not trying to make a float, but to refer to the int itself.
Bonus: putting a blank space after the number works as well.
>>> 1 .__class__
<type 'int'>
Also, if you only want to get the __class__, type(1) will do it for you.
Hope this helps!
Or you can even do this:
>>> getattr(1 , '__class__')
<type 'int'>
You need parenthesis to surround the number:
>>> (1).__class__
<type 'int'>
>>>
Otherwise, Python sees the . after the number and it tries to interpret the whole thing as a float.

Categories