Raw unicode literal that is valid in Python 2 and Python 3? - python

Apparently the ur"" syntax has been disabled in Python 3. However, I need it! "Why?", you may ask. Well, I need the u prefix because it is a unicode string and my code needs to work on Python 2. As for the r prefix, maybe it's not essential, but the markup format I'm using requires a lot of backslashes and it would help avoid mistakes.
Here is an example that does what I want in Python 2 but is illegal in Python 3:
tamil_letter_ma = u"\u0bae"
marked_text = ur"\a%s\bthe Tamil\cletter\dMa\e" % tamil_letter_ma
After coming across this problem, I found http://bugs.python.org/issue15096 and noticed this quote:
It's easy to overcome the limitation.
Would anyone care to offer an idea about how?
Related: What exactly do "u" and "r" string flags do in Python, and what are raw string literals?

Why don't you just use raw string literal (r'....'), you don't need to specify u because in Python 3, strings are unicode strings.
>>> tamil_letter_ma = "\u0bae"
>>> marked_text = r"\a%s\bthe Tamil\cletter\dMa\e" % tamil_letter_ma
>>> marked_text
'\\aம\\bthe Tamil\\cletter\\dMa\\e'
To make it also work in Python 2.x, add the following Future import statement at the very beginning of your source code, so that all the string literals in the source code become unicode.
from __future__ import unicode_literals

The preferred way is to drop u'' prefix and use from __future__ import unicode_literals as #falsetru suggested. But in your specific case, you could abuse the fact that "ascii-only string" % unicode returns Unicode:
>>> tamil_letter_ma = u"\u0bae"
>>> marked_text = r"\a%s\bthe Tamil\cletter\dMa\e" % tamil_letter_ma
>>> marked_text
u'\\a\u0bae\\bthe Tamil\\cletter\\dMa\\e'

Unicode strings are the default in Python 3.x, so using r alone will produce the same as ur in Python 2.

Related

Converting Address List to Hash 160 [duplicate]

What's the correct way to convert bytes to a hex string in Python 3?
I see claims of a bytes.hex method, bytes.decode codecs, and have tried other possible functions of least astonishment without avail. I just want my bytes as hex!
Since Python 3.5 this is finally no longer awkward:
>>> b'\xde\xad\xbe\xef'.hex()
'deadbeef'
and reverse:
>>> bytes.fromhex('deadbeef')
b'\xde\xad\xbe\xef'
works also with the mutable bytearray type.
Reference: https://docs.python.org/3/library/stdtypes.html#bytes.hex
Use the binascii module:
>>> import binascii
>>> binascii.hexlify('foo'.encode('utf8'))
b'666f6f'
>>> binascii.unhexlify(_).decode('utf8')
'foo'
See this answer:
Python 3.1.1 string to hex
Python has bytes-to-bytes standard codecs that perform convenient transformations like quoted-printable (fits into 7bits ascii), base64 (fits into alphanumerics), hex escaping, gzip and bz2 compression. In Python 2, you could do:
b'foo'.encode('hex')
In Python 3, str.encode / bytes.decode are strictly for bytes<->str conversions. Instead, you can do this, which works across Python 2 and Python 3 (s/encode/decode/g for the inverse):
import codecs
codecs.getencoder('hex')(b'foo')[0]
Starting with Python 3.4, there is a less awkward option:
codecs.encode(b'foo', 'hex')
These misc codecs are also accessible inside their own modules (base64, zlib, bz2, uu, quopri, binascii); the API is less consistent, but for compression codecs it offers more control.
New in python 3.8, you can pass a delimiter argument to the hex function, as in this example
>>> value = b'\xf0\xf1\xf2'
>>> value.hex('-')
'f0-f1-f2'
>>> value.hex('_', 2)
'f0_f1f2'
>>> b'UUDDLRLRAB'.hex(' ', -4)
'55554444 4c524c52 4142'
https://docs.python.org/3/library/stdtypes.html#bytes.hex
The method binascii.hexlify() will convert bytes to a bytes representing the ascii hex string. That means that each byte in the input will get converted to two ascii characters. If you want a true str out then you can .decode("ascii") the result.
I included an snippet that illustrates it.
import binascii
with open("addressbook.bin", "rb") as f: # or any binary file like '/bin/ls'
in_bytes = f.read()
print(in_bytes) # b'\n\x16\n\x04'
hex_bytes = binascii.hexlify(in_bytes)
print(hex_bytes) # b'0a160a04' which is twice as long as in_bytes
hex_str = hex_bytes.decode("ascii")
print(hex_str) # 0a160a04
from the hex string "0a160a04" to can come back to the bytes with binascii.unhexlify("0a160a04") which gives back b'\n\x16\n\x04'
import codecs
codecs.getencoder('hex_codec')(b'foo')[0]
works in Python 3.3 (so "hex_codec" instead of "hex").
it can been used the format specifier %x02 that format and output a hex value. For example:
>>> foo = b"tC\xfc}\x05i\x8d\x86\x05\xa5\xb4\xd3]Vd\x9cZ\x92~'6"
>>> res = ""
>>> for b in foo:
... res += "%02x" % b
...
>>> print(res)
7443fc7d05698d8605a5b4d35d56649c5a927e2736
OK, the following answer is slightly beyond-scope if you only care about Python 3, but this question is the first Google hit even if you don't specify the Python version, so here's a way that works on both Python 2 and Python 3.
I'm also interpreting the question to be about converting bytes to the str type: that is, bytes-y on Python 2, and Unicode-y on Python 3.
Given that, the best approach I know is:
import six
bytes_to_hex_str = lambda b: ' '.join('%02x' % i for i in six.iterbytes(b))
The following assertion will be true for either Python 2 or Python 3, assuming you haven't activated the unicode_literals future in Python 2:
assert bytes_to_hex_str(b'jkl') == '6a 6b 6c'
(Or you can use ''.join() to omit the space between the bytes, etc.)
If you want to convert b'\x61' to 97 or '0x61', you can try this:
[python3.5]
>>>from struct import *
>>>temp=unpack('B',b'\x61')[0] ## convert bytes to unsigned int
97
>>>hex(temp) ##convert int to string which is hexadecimal expression
'0x61'
Reference:https://docs.python.org/3.5/library/struct.html

Does Python3 still need raw string in regular expression?

In Python 2, when dealing with regular expression we use r'expression', do we still need prepend "r" in Python 3, since I know Python 3 use Unicode by default
Yes. Backslash escape sequences are still present in Python 3 strings, thus raw strings prefixed with r make a difference as shown in this simple example:
>>> s = 'hello\n'
>>> raw = r'hello\n'
>>> s
hello\n
>>> raw
hello\\n
>>> print(s)
hello
>>> print(raw)
hello\n
Raw strings are still useful for writing characters like \ without escaping them. This is generally useful in regex and window paths etc.

What exact code-point conversion does string literal prefix "r" imply (Python 3.4)?

What Unicode code-point conversion does the stringprefix "r" (or "R") actually perform on string literals in Python 3 (literals/files parsed as UTF-8)?
I am using Python 3.4 on Windows 7.
I want to to parse this "evil" path on Windows:
>>> a = 'c:\a\b\f\v'
>>> a
'c:\x07\x08\x0c\x0b'
>>> a.decode(encoding='utf-8')
b'c:\x07\x08\x0c\x0b'
With the prefix "r", I get:
>>> b = r'c:\a\b\f\v'
>>> b
c:\a\b\f\v
My question: How do I mimic (exactly) the "raw" code-point mapping/conversion on a Unicode string object in memory (not a string literal)? I could use str.translate and str.maketrans, but what exact mapping are we talking about then?
Context: Generally, I want to be to support all kinds of weird directory names on Windows (and other platforms) being handed to my application as strings via command line parameters. How can I?
What Unicode code-point conversion does the string prefix "r" (or "R") actually perform on string literals in Python 3 (literals/files parsed as UTF-8)?
Python 3 native strings are already UTF-8 (by default), no conversions are done with the r prefix.
Without the r prefix then conversions are done to characters prefixed \. See here
\a gives the code for a bell (a - alarm) 0x07
\b gives the code for a backspace 0x08
\f is a form feed 0x0c
\v is a vertical tab 0x0b
So, if you have (what you call) weird Windows path names, then always use raw strings, or use a / for a directory separator instead. However you only need to worry about those that are hard-coded because they are parsed by python, those entered by the user should be fine.
Edit:
if you do this:
>>> import os.path
>>> os.path.normpath('C:\bash')
'C:\x08ash'
>>> var = input("Enter a filename: ")
Enter a filename: C:\bash
>>> print(var)
C:\bash
>>> os.path.normpath(var)
'C:\\bash'
Double back-slashing has the same effect as using raw strings.
>>> 'c:\a\b\f\v'
'c:\x07\x08\x0c\x0b'
When you type a string literal like this in Python source code, you need to either double the backslashes or use r for a raw string.
>>> 'c:\\a\\b\\f\\v'
'c:\\a\\b\\f\\v'
>>> r'c:\a\b\f\v'
'c:\\a\\b\\f\\v'
>>> print('c:\\a\\b\\f\\v')
c:\a\b\f\v
>>> print(r'c:\a\b\f\v')
c:\a\b\f\v
This has nothing to do with Unicode. It's the Python interpreter which is evaluating backslash escape sequences in string literals.
This is only the case with string literals in your source code. If you read a string from the command line or from a file you don't have to worry about any of this. Python does not interpret backslashes in these cases.

Greek encoding in PYTHON

i'm trying to store a string and after tokenize it with nltk in python.But i cant understand why after tokenizing it ( it creates a list ) i cant see the strings in list..
Can anyone help me plz?
Here is the code:
#a="Γεια σου"
#b=nltk.word_tokenize(a)
#b
['\xc3\xe5\xe9\xe1', '\xf3\xef\xf5']
I just want to be able to see the content of the list regularly..
Thx in advance
You are using Python 2, where unprefixed quotes denote a byte as opposed to a character string (if you're not sure about the difference, read this). Either switch to Python 3, where this has been fixed, or prefix all character strings with u and print the strings (as opposed to showing their repr, which differs in Python 2.x):
>>> import nltk
>>> a = u'Γεια σου'
>>> b = nltk.word_tokenize(a)
>>> print(u'\n'.join(b))
Γεια
σου
You can see the strings. The characters are represented by escape sequences because of your terminal encoding settings. Configure your terminal to accept input, and present output, in UTF-8.

Unicode literals causing invalid syntax

The following code:
s = s.replace(u"&", u"&")
is causing an error in python:
SyntaxError: invalid syntax
removing the u's before the " fixes the problem, but this should work as is? I'm using Python 3.1
The u is no longer used in Python 3. String literals are unicode by default. See What's New in Python 3.0.
You can no longer use u"..." literals for Unicode text. However, you must use b"..." literals for binary data.
On Python 3, strings are unicode. There is no need to (and as you've discovered, you can't) put a u before the string literal to designate unicode.
Instead, you have to put a b before a byte literal to designate that it isn't unicode.
In Python3.3+ unicode literal is valid again, see What’s New In Python 3.3:
New syntax features:
New yield from expression for generator delegation.
The u'unicode' syntax is accepted again for str objects.

Categories