i need to load the third column of this text file as a hex string
http://www.netmite.com/android/mydroid/1.6/external/skia/emoji/gmojiraw.txt
>>> open('gmojiraw.txt').read().split('\n')[0].split('\t')[2]
'\\xF3\\xBE\\x80\\x80'
how do i open the file so that i can get the third column as hex string:
'\xF3\xBE\x80\x80'
i also tried binary mode and hex mode, with no success.
You can:
Remove the \x-es
Use .decode('hex') on the resulting string
Code:
>>> '\\xF3\\xBE\\x80\\x80'.replace('\\x', '').decode('hex')
'\xf3\xbe\x80\x80'
Note the appropriate interpretation of backslashes. When the string representation is '\xf3' it means it's a single-byte string with the byte value 0xF3. When it's '\\xf3', which is your input, it means a string consisting of 4 characters: \, x, f and 3
Quick'n'dirty reply
your_string.decode('string_escape')
>>> a='\\xF3\\xBE\\x80\\x80'
>>> a.decode('string_escape')
'\xf3\xbe\x80\x80'
>>> len(_)
4
Bonus info
>>> u='\uDBB8\uDC03'
>>> u.decode('unicode_escape')
Some trivia
What's interesting, is that I have Python 2.6.4 on Karmic Koala Ubuntu (sys.maxunicode==1114111) and Python 2.6.5 on Gentoo (sys.maxunicode==65535); on Ubuntu, the unicode_escape-decode result is \uDBB8\uDC03 and on Gentoo it's u'\U000fe003', both correctly of length 2. Unless it's something fixed between 2.6.4 and 2.6.5, I'm impressed the 2-byte-per-unicode-character Gentoo version reports the correct character.
If you are using Python2.6+ here is a safe way to use eval
>>> from ast import literal_eval
>>> item='\\xF3\\xBE\\x80\\x80'
>>> literal_eval("'%s'"%item)
'\xf3\xbe\x80\x80'
After stripping out the "\x" as Eli's answer, you can just do:
int("F3BE8080",16)
If you trust the source, you can use eval('"%s"' % data)
Related
What's the correct way to convert bytes to a hex string in Python 3?
I see claims of a bytes.hex method, bytes.decode codecs, and have tried other possible functions of least astonishment without avail. I just want my bytes as hex!
Since Python 3.5 this is finally no longer awkward:
>>> b'\xde\xad\xbe\xef'.hex()
'deadbeef'
and reverse:
>>> bytes.fromhex('deadbeef')
b'\xde\xad\xbe\xef'
works also with the mutable bytearray type.
Reference: https://docs.python.org/3/library/stdtypes.html#bytes.hex
Use the binascii module:
>>> import binascii
>>> binascii.hexlify('foo'.encode('utf8'))
b'666f6f'
>>> binascii.unhexlify(_).decode('utf8')
'foo'
See this answer:
Python 3.1.1 string to hex
Python has bytes-to-bytes standard codecs that perform convenient transformations like quoted-printable (fits into 7bits ascii), base64 (fits into alphanumerics), hex escaping, gzip and bz2 compression. In Python 2, you could do:
b'foo'.encode('hex')
In Python 3, str.encode / bytes.decode are strictly for bytes<->str conversions. Instead, you can do this, which works across Python 2 and Python 3 (s/encode/decode/g for the inverse):
import codecs
codecs.getencoder('hex')(b'foo')[0]
Starting with Python 3.4, there is a less awkward option:
codecs.encode(b'foo', 'hex')
These misc codecs are also accessible inside their own modules (base64, zlib, bz2, uu, quopri, binascii); the API is less consistent, but for compression codecs it offers more control.
New in python 3.8, you can pass a delimiter argument to the hex function, as in this example
>>> value = b'\xf0\xf1\xf2'
>>> value.hex('-')
'f0-f1-f2'
>>> value.hex('_', 2)
'f0_f1f2'
>>> b'UUDDLRLRAB'.hex(' ', -4)
'55554444 4c524c52 4142'
https://docs.python.org/3/library/stdtypes.html#bytes.hex
The method binascii.hexlify() will convert bytes to a bytes representing the ascii hex string. That means that each byte in the input will get converted to two ascii characters. If you want a true str out then you can .decode("ascii") the result.
I included an snippet that illustrates it.
import binascii
with open("addressbook.bin", "rb") as f: # or any binary file like '/bin/ls'
in_bytes = f.read()
print(in_bytes) # b'\n\x16\n\x04'
hex_bytes = binascii.hexlify(in_bytes)
print(hex_bytes) # b'0a160a04' which is twice as long as in_bytes
hex_str = hex_bytes.decode("ascii")
print(hex_str) # 0a160a04
from the hex string "0a160a04" to can come back to the bytes with binascii.unhexlify("0a160a04") which gives back b'\n\x16\n\x04'
import codecs
codecs.getencoder('hex_codec')(b'foo')[0]
works in Python 3.3 (so "hex_codec" instead of "hex").
it can been used the format specifier %x02 that format and output a hex value. For example:
>>> foo = b"tC\xfc}\x05i\x8d\x86\x05\xa5\xb4\xd3]Vd\x9cZ\x92~'6"
>>> res = ""
>>> for b in foo:
... res += "%02x" % b
...
>>> print(res)
7443fc7d05698d8605a5b4d35d56649c5a927e2736
OK, the following answer is slightly beyond-scope if you only care about Python 3, but this question is the first Google hit even if you don't specify the Python version, so here's a way that works on both Python 2 and Python 3.
I'm also interpreting the question to be about converting bytes to the str type: that is, bytes-y on Python 2, and Unicode-y on Python 3.
Given that, the best approach I know is:
import six
bytes_to_hex_str = lambda b: ' '.join('%02x' % i for i in six.iterbytes(b))
The following assertion will be true for either Python 2 or Python 3, assuming you haven't activated the unicode_literals future in Python 2:
assert bytes_to_hex_str(b'jkl') == '6a 6b 6c'
(Or you can use ''.join() to omit the space between the bytes, etc.)
If you want to convert b'\x61' to 97 or '0x61', you can try this:
[python3.5]
>>>from struct import *
>>>temp=unpack('B',b'\x61')[0] ## convert bytes to unsigned int
97
>>>hex(temp) ##convert int to string which is hexadecimal expression
'0x61'
Reference:https://docs.python.org/3.5/library/struct.html
Apparently the ur"" syntax has been disabled in Python 3. However, I need it! "Why?", you may ask. Well, I need the u prefix because it is a unicode string and my code needs to work on Python 2. As for the r prefix, maybe it's not essential, but the markup format I'm using requires a lot of backslashes and it would help avoid mistakes.
Here is an example that does what I want in Python 2 but is illegal in Python 3:
tamil_letter_ma = u"\u0bae"
marked_text = ur"\a%s\bthe Tamil\cletter\dMa\e" % tamil_letter_ma
After coming across this problem, I found http://bugs.python.org/issue15096 and noticed this quote:
It's easy to overcome the limitation.
Would anyone care to offer an idea about how?
Related: What exactly do "u" and "r" string flags do in Python, and what are raw string literals?
Why don't you just use raw string literal (r'....'), you don't need to specify u because in Python 3, strings are unicode strings.
>>> tamil_letter_ma = "\u0bae"
>>> marked_text = r"\a%s\bthe Tamil\cletter\dMa\e" % tamil_letter_ma
>>> marked_text
'\\aம\\bthe Tamil\\cletter\\dMa\\e'
To make it also work in Python 2.x, add the following Future import statement at the very beginning of your source code, so that all the string literals in the source code become unicode.
from __future__ import unicode_literals
The preferred way is to drop u'' prefix and use from __future__ import unicode_literals as #falsetru suggested. But in your specific case, you could abuse the fact that "ascii-only string" % unicode returns Unicode:
>>> tamil_letter_ma = u"\u0bae"
>>> marked_text = r"\a%s\bthe Tamil\cletter\dMa\e" % tamil_letter_ma
>>> marked_text
u'\\a\u0bae\\bthe Tamil\\cletter\\dMa\\e'
Unicode strings are the default in Python 3.x, so using r alone will produce the same as ur in Python 2.
In Python 2, when dealing with regular expression we use r'expression', do we still need prepend "r" in Python 3, since I know Python 3 use Unicode by default
Yes. Backslash escape sequences are still present in Python 3 strings, thus raw strings prefixed with r make a difference as shown in this simple example:
>>> s = 'hello\n'
>>> raw = r'hello\n'
>>> s
hello\n
>>> raw
hello\\n
>>> print(s)
hello
>>> print(raw)
hello\n
Raw strings are still useful for writing characters like \ without escaping them. This is generally useful in regex and window paths etc.
Consider that Python will all one to type the name 'Jânis' into the Python CLI, if it is known that the 'â' character is hex "E2" in CP-1252:
>>> 'J\xe2nis'
'Jânis'
How might one type that name into the Python CLI if the Unicode code point is known, but not the CP-1252 point? In fact, the code point in question isU+00E2. Also, the UTF-8 encoded character is %C3 %A2, is there any way to type that into the Python CLI if only that is known?
I am using Python 3.2 on Kubuntu Linux 12.10.
Use unicode escape sequence (\unnnn):
>>> 'J\u00e2nis'
'Jânis'
If you know utf-8, use bytes.decode (utf-8 is default encoding in Python 3.x, so it is optional):
>>> b'J\xc3\xa2nis'.decode('utf-8')
'Jânis'
If you have %C3%A2, use urllib.parse.unquote:
>>> import urllib.parse
>>> urllib.parse.unquote('J%c3%a2nis', encoding='utf-8')
'Jânis'
i'm trying to store a string and after tokenize it with nltk in python.But i cant understand why after tokenizing it ( it creates a list ) i cant see the strings in list..
Can anyone help me plz?
Here is the code:
#a="Γεια σου"
#b=nltk.word_tokenize(a)
#b
['\xc3\xe5\xe9\xe1', '\xf3\xef\xf5']
I just want to be able to see the content of the list regularly..
Thx in advance
You are using Python 2, where unprefixed quotes denote a byte as opposed to a character string (if you're not sure about the difference, read this). Either switch to Python 3, where this has been fixed, or prefix all character strings with u and print the strings (as opposed to showing their repr, which differs in Python 2.x):
>>> import nltk
>>> a = u'Γεια σου'
>>> b = nltk.word_tokenize(a)
>>> print(u'\n'.join(b))
Γεια
σου
You can see the strings. The characters are represented by escape sequences because of your terminal encoding settings. Configure your terminal to accept input, and present output, in UTF-8.