I have a file, that contains both hex data and non-hex data.
For example, var _0x36ba=["\x69\x73\x41\x72\x72\x61\x79","\x63\x61\x6C\x6C","\x74\x6F\x53\x74\x72\x69\x6E\x67",]
When I directly paste this code in python console, I got var _0x36ba=["isArray","call","toString",]
But when I try to read the file and print contents, it gives me var _0x36ba=["\\x69\\x73\\x41\\x72\\x72\\x61\\x79","\\x63\\x61\\x6C\\x6C","\\x74\\x6F\\x53\\x74\\x72\\x69\\x6E\\x67","\\
Seems like backslashes are parsed as they are.
How can I read the file and obtain readable output?
You have string literals with \xhh hex escapes. You can decode these with the string_escape encoding:
text.decode('string_escape')
See the Python Specific Encodings section of the codecs module documentation:
string_escape
Produce a string that is suitable as string literal in Python source code
Decoding reverses that encoding:
>>> "\\x69\\x73\\x41\\x72\\x72\\x61\\x79".decode('string_escape')
'isArray'
>>> "\\x63\\x61\\x6C\\x6C".decode('string_escape')
'call'
>>> "\\x74\\x6F\\x53\\x74\\x72\\x69\\x6E\\x67".decode('string_escape')
'toString'
Being a built-in codec, this is a lot faster than using regular expressions:
>>> from timeit import timeit
>>> import re
>>> def unescape(text):
... return re.sub(r'\\x([0-9a-fA-F]{2})',
... lambda m: chr(int(m.group(1), 16)), text)
...
>>> value = "\\x69\\x73\\x41\\x72\\x72\\x61\\x79"
>>> timeit('unescape(value)', 'from __main__ import unescape, value')
6.254786968231201
>>> timeit('value.decode("string_escape")', 'from __main__ import value')
0.43862390518188477
That's about 14 times faster.
EDIT: Please use Martijn's solution. I didn't know the text.decode('string_escape') yet, and of course it is way faster. Below follows my original answer.
Use this regular expression to unescape all escaped hexadecimal expressions within the string:
def unescape(text):
return re.sub(r'\\\\|\\x([0-9a-fA-F]{2})',
lambda m: chr(int(m.group(1), 16)) if m.group(1)
else '\\', text)
If you know that the input will not contain a double backslash followed by an x (e. g. foo bar \\x41 bloh which probably should be interpreted to foo bar \x41 bloh instead of to foo bar \A bloh), then you can simplify this to:
def unescape(text):
return re.sub(r'\\x([0-9a-fA-F]{2})',
lambda m: chr(int(m.group(1), 16)), text)
Related
I tried using:
>>> wbpat='\btest\b'
>>> re.findall(wbpat, 'a test tested in testing')
The result that expected to get was ['test'] but somehow I am getting an empty list. What could be the problem...
\b is an escape code for a backspace (length 1 string). Use r'\btest\b'. The leading r indicates to the Python interpreter that it should interpret each character in the string as a literal single character (a "raw" string) and ignore escape sequences.
Example:
>>> len('\btest\b') # <backspace>test<backspace>
6
>>> len(r'\btest\b') # <backslash>btest<backslash>b
8
>>> import re
>>> re.findall(r'\btest\b','a test tested in testing')
['test']
It's a good habit to use a raw string for regular expressions in Python.
I have a file of strings one per line in which non-ascii characters have been escaped with decimal code points. One example line is:
mj\\195\\164ger
(The double backslashes are in the file exactly as printed)
I would like to process this string to produce
mjäger
. Conventionally, python uses hexadecimal escapes rather than decimal escapes (e.g., the above string would be written as mj\xc3\xa4ger, which python can decode:
>>> by=b'mj\xc3\xa4ger'
>>> by.decode('utf-8')
'mjäger'
Python, however, doesn't recognize the decimal escape right away.
I have written a method that correctly manipulates the strings to produce hexadecimal escapes, but these escapes are themselves escaped. How can I get python to process these hexadecimal escapes to create the final string?
import re
hexconst=["0","1","2","3","4","5","6","7","8","9","a","b","c","d","e","f"]
escapes=re.compile(r"\\[0-9]{3}")
def dec2hex(matchobj):
dec=matchobj.group(0)
dec=int(dec[1:])
digit1=dec//16 #integer division
digit2=dec%16
hex="\\x" + hexconst[digit1] + hexconst[digit2]
return hex
line=r'mj\195\164ger'
print(escapes.sub(dec2hex,line)) #Outputs mj\xc3\xa4ger
What is the final step I'm missing to convert the output of the above from mj\xc3\xa4ger to mjäger? Thanks!
It's much easier. re.sub() can take a callback function instead of a replacement string as an argument:
>>> import re
>>> line=r'mj\195\164ger'
>>> def replace(match):
... return chr(int(match.group(1)))
>>> regex = re.compile(r"\\(\d{1,3})")
>>> new = regex.sub(replace, line)
>>> new
'mj\xc3\xa4ger'
>>> print new
mjäger
In Python 3, strings are Unicode strings, so if you're working with encoded input (like UTF-8 encoded content), then you need to use the proper type which is bytes:
>>> line = rb'mj\195\164ger'
>>> regex = re.compile(rb"\\(\d{1,3})")
>>> def replace(match):
... return int(match.group(1)).to_bytes(1, byteorder="big")
>>> new = regex.sub(replace, line)
>>> new
b'mj\xc3\xa4ger'
>>> print(new.decode("utf-8"))
mjäger
I am making a class that relies heavily on regular expressions.
Let's say my class looks like this:
class Example:
def __init__(self, regex):
self.regex = regex
def __repr__(self):
return 'Example({})'.format(repr(self.regex.pattern))
And let's say I use it like this:
import re
example = Example(re.compile(r'\d+'))
If I do repr(example), I get 'Example('\\\\d+')', but I want 'Example(r'\\d+')'. Take into account the extra backslash where that upon printing, it appears correctly. I suppose I could implement it to return "r'{}'".format(regex.pattern), but that doesn't sit well with me. In the unlikely event that the Python Software Foundation someday changes the manner for specifying raw string literals, my code won't reflect that. That's hypothetical, though. My main concern is whether or not this always works. I can't think of an edge case off the top of my head, though. Is there a more formal way of doing this?
EDIT: Nothing seems to appear in the Format Specification Mini-Language, the printf-style String Formatting guide, or the string module.
The problem with rawstring representation is, that you cannot represent everything in a portable (i.e. without using control characters) manner. For example, if you had a linebreak in your string, you had to literally break the string to the next line, because it cannot be represented as rawstring.
That said, the actual way to get rawstring representation is what you already gave:
"r'{}'".format(regex.pattern)
The definition of rawstrings is that there are no rules applied except that they end at the quotation character they start with and that you can escape said quotation character using a backslash. Thus, for example, you cannot store the equivalent of a string like "\" in raw string representation (r"\" yields SyntaxError and r"\\" yields "\\\\").
If you really want to do this, you should use a wrapper like:
def rawstr(s):
"""
Return the raw string representation (using r'') literals of the string
*s* if it is available. If any invalid characters are encountered (or a
string which cannot be represented as a rawstr), the default repr() result
is returned.
"""
if any(0 <= ord(ch) < 32 for ch in s):
return repr(s)
if (len(s) - len(s.rstrip("\\"))) % 2 == 1:
return repr(s)
pattern = "r'{0}'"
if '"' in s:
if "'" in s:
return repr(s)
elif "'" in s:
pattern = 'r"{0}"'
return pattern.format(s)
Tests:
>>> test1 = "\\"
>>> test2 = "foobar \n"
>>> test3 = r"a \valid rawstring"
>>> test4 = "foo \\\\\\"
>>> test5 = r"foo \\"
>>> test6 = r"'"
>>> test7 = r'"'
>>> print(rawstr(test1))
'\\'
>>> print(rawstr(test2))
'foobar \n'
>>> print(rawstr(test3))
r'a \valid rawstring'
>>> print(rawstr(test4))
'foo \\\\\\'
>>> print(rawstr(test5))
r'foo \\'
>>> print(rawstr(test6))
r"'"
>>> print(rawstr(test7))
r'"'
In this post: Print a string as hex bytes? I learned how to print as string into an "array" of hex bytes now I need something the other way around:
So for example the input would be: 73.69.67.6e.61.74.75.72.65 and the output would be a string.
you can use the built in binascii module. Do note however that this function will only work on ASCII encoded characters.
binascii.unhexlify(hexstr)
Your input string will need to be dotless however, but that is quite easy with a simple
string = string.replace('.','')
another (arguably safer) method would be to use base64 in the following way:
import base64
encoded = base64.b16encode(b'data to be encoded')
print (encoded)
data = base64.b16decode(encoded)
print (data)
or in your example:
data = base64.b16decode(b"7369676e6174757265", True)
print (data.decode("utf-8"))
The string can be sanitised before input into the b16decode method.
Note that I am using python 3.2 and you may not necessarily need the b out the front of the string to denote bytes.
Example was found here
Without binascii:
>>> a="73.69.67.6e.61.74.75.72.65"
>>> "".join(chr(int(e, 16)) for e in a.split('.'))
'signature'
>>>
or better:
>>> a="73.69.67.6e.61.74.75.72.65"
>>> "".join(e.decode('hex') for e in a.split('.'))
PS: works with unicode:
>>> a='.'.join(x.encode('hex') for x in 'Hellö Wörld!')
>>> a
'48.65.6c.6c.94.20.57.94.72.6c.64.21'
>>> print "".join(e.decode('hex') for e in a.split('.'))
Hellö Wörld!
>>>
EDIT:
No need for a generator expression here (thx to thg435):
a.replace('.', '').decode('hex')
Use string split to get a list of strings, then base 16 for decoding the bytes.
>>> inp="73.69.67.6e.61.74.75.72.65"
>>> ''.join((chr(int(i,16)) for i in inp.split('.')))
'signature'
>>>
How can you use string methods like strip() on a unicode string? and can't you access characters of a unicode string like with oridnary strings? (ex: mystring[0:4] )
It's working as usual, as long as they are actually unicode, not str (note: every string literal must be preceded by u, like in this example):
>>> a = u"coțofană"
>>> a
u'co\u021bofan\u0103'
>>> a[-1]
u'\u0103'
>>> a[2]
u'\u021b'
>>> a[3]
u'o'
>>> a.strip(u'ă')
u'co\u021bofan'
Maybe it's a bit late to answer to this, but if you are looking for the library function and not the instance method, you can use that as well.
Just use:
yourunicodestring = u' a unicode string with spaces all around '
unicode.strip(yourunicodestring)
In some cases it's easier to use this one, for example inside a map function like:
unicodelist=[u'a',u' a ',u' foo is just...foo ']
map (unicode.strip,unicodelist)
You can do every string operation, actually in Python 3, all str's are unicode.
>>> my_unicode_string = u"abcşiüğ"
>>> my_unicode_string[4]
u'i'
>>> my_unicode_string[3]
u'\u015f'
>>> print(my_unicode_string[3])
ş
>>> my_unicode_string[3:]
u'\u015fi\xfc\u011f'
>>> print(my_unicode_string[3:])
şiüğ
>>> print(my_unicode_string.strip(u"ğ"))
abcşiü
See the Python docs on Unicode strings and the following section on string methods. Unicode strings support all of the usual methods and operations as normal ASCII strings.