In Python 2, when dealing with regular expression we use r'expression', do we still need prepend "r" in Python 3, since I know Python 3 use Unicode by default
Yes. Backslash escape sequences are still present in Python 3 strings, thus raw strings prefixed with r make a difference as shown in this simple example:
>>> s = 'hello\n'
>>> raw = r'hello\n'
>>> s
hello\n
>>> raw
hello\\n
>>> print(s)
hello
>>> print(raw)
hello\n
Raw strings are still useful for writing characters like \ without escaping them. This is generally useful in regex and window paths etc.
Related
I know that we can use the r(raw string) and u(unicode) flags before a string to get what we might actually desired. However, I am wondering how these do work with strings. I tried this in the IDLE:
a = r"This is raw string and \n will come as is"
print a
# "This is raw string and \n will come as is"
help(r)
# ..... Will get NameError
help(r"")
# Prints empty
How Python knows that it should treat the r or u in the front of a string as a flag? Or as string literals to be specific? If I want to learn more about what are the string literals and their limitations, how can I learn them?
The u and r prefixes are a part of the string literal, as defined in the python grammar. When the python interpreter parses a textual command in order to understand what the command does, it reads r"foo" as a single string literal with the value "foo". On the other hand, it reads b"foo" as a single bytes literal with an equivalent value.
For more information, you can refer to the literals section in python's documentation. Also, python has an ast module, that allows you to explore the way python parses commands.
Hi i am using python3 and i want to change utf8 value to string (decode)
Here is my code now
s1 = '\u54c7'
print(chr(ord(s1))) # print 哇
It's fine if input is one char but how to change a string?
s2 = '\u300c\u54c7\u54c8\u54c8!!\u300d'
print(chr(ord(s2))) # Error! I want print "「哇哈哈!!」"
Thanks
Edit: ================================================================
Hi all,i update the question
If i got the string is "s3" like below and i use replace to change format
but print "s3" not show "哇哈哈!!"
If i initiated s4 with \u54c7\u54c8\u54c8!!' and print s4
it's look like correct so how can i fix s3 ?
s3 = '哇哈哈!!'
s3 = s3.replace("&#x","\\u").replace(";","") # s3 = \u54c7\u54c8\u54c8!!
s4 = '\u54c7\u54c8\u54c8!!'
print(s3) # \u54c7\u54c8\u54c8!!
print(s4) # 哇哈哈!!
If you are in fact using python3, you don't need to do anything. You can just print the string. Also you can just copy and paste the literals into a python string and it will work.
'「哇哈哈!!」' == '\u300c\u54c7\u54c8\u54c8!!\u300d'
In regards to the updated question, the difference is escaping. If you type a string literal, some sequences of characters are changed to characters that can't be easily typed or be displayed. The string is not stored as the series of characters you see but as a list of values created from characters like 'a', ';', and '\300'. Note that all of those have a len of 1 because they are all one character.
To actually convert those values you could use eval, the answer provided by Iron Fist, or find a library that converts the string you have. I would suggest the last since the rules surrounding such things can be complex and rarely are covered by simple replacements. I don't recognize the particular pattern of escaping, so I cannot recommend anything, sorry.
Regarding your s3 string, this seems to me more like an HTML entity or text in HTML format, so use proper html.parser, this way:
>>> s3 = '哇哈哈!!'
>>> from html.parser import HTMLParser
>>>
>>> p = HTMLParser()
>>>
>>> p.unescape(s3)
'哇哈哈!!'
Or, more simply with html.unescape:
>>> import html
>>>
>>> html.unescape(s3)
'哇哈哈!!'
Quoting from Python docs on html.unescape:
html.unescape(s)
Convert all named and numeric character references (e.g. >, >, &x3e;) in the string s to the corresponding unicode characters.
...
Apparently the ur"" syntax has been disabled in Python 3. However, I need it! "Why?", you may ask. Well, I need the u prefix because it is a unicode string and my code needs to work on Python 2. As for the r prefix, maybe it's not essential, but the markup format I'm using requires a lot of backslashes and it would help avoid mistakes.
Here is an example that does what I want in Python 2 but is illegal in Python 3:
tamil_letter_ma = u"\u0bae"
marked_text = ur"\a%s\bthe Tamil\cletter\dMa\e" % tamil_letter_ma
After coming across this problem, I found http://bugs.python.org/issue15096 and noticed this quote:
It's easy to overcome the limitation.
Would anyone care to offer an idea about how?
Related: What exactly do "u" and "r" string flags do in Python, and what are raw string literals?
Why don't you just use raw string literal (r'....'), you don't need to specify u because in Python 3, strings are unicode strings.
>>> tamil_letter_ma = "\u0bae"
>>> marked_text = r"\a%s\bthe Tamil\cletter\dMa\e" % tamil_letter_ma
>>> marked_text
'\\aம\\bthe Tamil\\cletter\\dMa\\e'
To make it also work in Python 2.x, add the following Future import statement at the very beginning of your source code, so that all the string literals in the source code become unicode.
from __future__ import unicode_literals
The preferred way is to drop u'' prefix and use from __future__ import unicode_literals as #falsetru suggested. But in your specific case, you could abuse the fact that "ascii-only string" % unicode returns Unicode:
>>> tamil_letter_ma = u"\u0bae"
>>> marked_text = r"\a%s\bthe Tamil\cletter\dMa\e" % tamil_letter_ma
>>> marked_text
u'\\a\u0bae\\bthe Tamil\\cletter\\dMa\\e'
Unicode strings are the default in Python 3.x, so using r alone will produce the same as ur in Python 2.
What Unicode code-point conversion does the stringprefix "r" (or "R") actually perform on string literals in Python 3 (literals/files parsed as UTF-8)?
I am using Python 3.4 on Windows 7.
I want to to parse this "evil" path on Windows:
>>> a = 'c:\a\b\f\v'
>>> a
'c:\x07\x08\x0c\x0b'
>>> a.decode(encoding='utf-8')
b'c:\x07\x08\x0c\x0b'
With the prefix "r", I get:
>>> b = r'c:\a\b\f\v'
>>> b
c:\a\b\f\v
My question: How do I mimic (exactly) the "raw" code-point mapping/conversion on a Unicode string object in memory (not a string literal)? I could use str.translate and str.maketrans, but what exact mapping are we talking about then?
Context: Generally, I want to be to support all kinds of weird directory names on Windows (and other platforms) being handed to my application as strings via command line parameters. How can I?
What Unicode code-point conversion does the string prefix "r" (or "R") actually perform on string literals in Python 3 (literals/files parsed as UTF-8)?
Python 3 native strings are already UTF-8 (by default), no conversions are done with the r prefix.
Without the r prefix then conversions are done to characters prefixed \. See here
\a gives the code for a bell (a - alarm) 0x07
\b gives the code for a backspace 0x08
\f is a form feed 0x0c
\v is a vertical tab 0x0b
So, if you have (what you call) weird Windows path names, then always use raw strings, or use a / for a directory separator instead. However you only need to worry about those that are hard-coded because they are parsed by python, those entered by the user should be fine.
Edit:
if you do this:
>>> import os.path
>>> os.path.normpath('C:\bash')
'C:\x08ash'
>>> var = input("Enter a filename: ")
Enter a filename: C:\bash
>>> print(var)
C:\bash
>>> os.path.normpath(var)
'C:\\bash'
Double back-slashing has the same effect as using raw strings.
>>> 'c:\a\b\f\v'
'c:\x07\x08\x0c\x0b'
When you type a string literal like this in Python source code, you need to either double the backslashes or use r for a raw string.
>>> 'c:\\a\\b\\f\\v'
'c:\\a\\b\\f\\v'
>>> r'c:\a\b\f\v'
'c:\\a\\b\\f\\v'
>>> print('c:\\a\\b\\f\\v')
c:\a\b\f\v
>>> print(r'c:\a\b\f\v')
c:\a\b\f\v
This has nothing to do with Unicode. It's the Python interpreter which is evaluating backslash escape sequences in string literals.
This is only the case with string literals in your source code. If you read a string from the command line or from a file you don't have to worry about any of this. Python does not interpret backslashes in these cases.
i'm trying to store a string and after tokenize it with nltk in python.But i cant understand why after tokenizing it ( it creates a list ) i cant see the strings in list..
Can anyone help me plz?
Here is the code:
#a="Γεια σου"
#b=nltk.word_tokenize(a)
#b
['\xc3\xe5\xe9\xe1', '\xf3\xef\xf5']
I just want to be able to see the content of the list regularly..
Thx in advance
You are using Python 2, where unprefixed quotes denote a byte as opposed to a character string (if you're not sure about the difference, read this). Either switch to Python 3, where this has been fixed, or prefix all character strings with u and print the strings (as opposed to showing their repr, which differs in Python 2.x):
>>> import nltk
>>> a = u'Γεια σου'
>>> b = nltk.word_tokenize(a)
>>> print(u'\n'.join(b))
Γεια
σου
You can see the strings. The characters are represented by escape sequences because of your terminal encoding settings. Configure your terminal to accept input, and present output, in UTF-8.