Check if a string is hexadecimal - python

I know the easiest way is using a regular expression, but I wonder if there are other ways to do this check.
Why do I need this? I am writing a Python script that reads text messages (SMS) from a SIM card. In some situations, hex messages arrives and I need to do some processing for them, so I need to check if a received message is hexadecimal.
When I send following SMS:
Hello world!
And my script receives
00480065006C006C006F00200077006F0072006C00640021
But in some situations, I receive normal text messages (not hex). So I need to do a if hex control.
I am using Python 2.6.5.
UPDATE:
The reason of that problem is, (somehow) messages I sent are received as hex while messages sent by operator (info messages and ads.) are received as a normal string. So I decided to make a check and ensure that I have the message in the correct string format.
Some extra details: I am using a Huawei 3G modem and PyHumod to read data from the SIM card.
Possible best solution to my situation:
The best way to handle such strings is using a2b_hex (a.k.a. unhexlify) and utf-16 big endian encoding (as #JonasWielicki mentioned):
from binascii import unhexlify # unhexlify is another name of a2b_hex
mystr = "00480065006C006C006F00200077006F0072006C00640021"
unhexlify(mystr).encode("utf-16-be")
>> u'Hello world!'

(1) Using int() works nicely for this, and Python does all the checking for you :)
int('00480065006C006C006F00200077006F0072006C00640021', 16)
6896377547970387516320582441726837832153446723333914657L
will work. In case of failure you will receive a ValueError exception.
Short example:
int('af', 16)
175
int('ah', 16)
...
ValueError: invalid literal for int() with base 16: 'ah'
(2) An alternative would be to traverse the data and make sure all characters fall within the range of 0..9 and a-f/A-F. string.hexdigits ('0123456789abcdefABCDEF') is useful for this as it contains both upper and lower case digits.
import string
all(c in string.hexdigits for c in s)
will return either True or False based on the validity of your data in string s.
Short example:
s = 'af'
all(c in string.hexdigits for c in s)
True
s = 'ah'
all(c in string.hexdigits for c in s)
False
Notes:
As #ScottGriffiths notes correctly in a comment below, the int() approach will work if your string contains 0x at the start, while the character-by-character check will fail with this. Also, checking against a set of characters is faster than a string of characters, but it is doubtful this will matter with short SMS strings, unless you process many (many!) of them in sequence in which case you could convert stringhexditigs to a set with set(string.hexdigits).

You can:
test whether the string contains only hexadecimal digits (0…9,A…F)
try to convert the string to integer and see whether it fails.
Here is the code:
import string
def is_hex(s):
hex_digits = set(string.hexdigits)
# if s is long, then it is faster to check against a set
return all(c in hex_digits for c in s)
def is_hex(s):
try:
int(s, 16)
return True
except ValueError:
return False

I know the op mentioned regular expressions, but I wanted to contribute such a solution for completeness' sake:
def is_hex(s):
return re.fullmatch(r"^[0-9a-fA-F]$", s or "") is not None
Performance
In order to evaluate the performance of the different solutions proposed here, I used Python's timeit module. The input strings are generated randomly for three different lengths, 10, 100, 1000:
s=''.join(random.choice('0123456789abcdef') for _ in range(10))
Levon's solutions:
# int(s, 16)
10: 0.257451018987922
100: 0.40081690801889636
1000: 1.8926858339982573
# all(_ in string.hexdigits for _ in s)
10: 1.2884491360164247
100: 10.047717947978526
1000: 94.35805322701344
Other answers are variations of these two. Using a regular expression:
# re.fullmatch(r'^[0-9a-fA-F]$', s or '')
10: 0.725040541990893
100: 0.7184272820013575
1000: 0.7190397029917222
Picking the right solution thus depends on the length on the input string and whether exceptions can be handled safely. The regular expression certainly handles large strings much faster (and won't throw a ValueError on overflow), but int() is the winner for shorter strings.

One more simple and short solution based on transformation of string to set and checking for subset (doesn't check for '0x' prefix):
import string
def is_hex_str(s):
return set(s).issubset(string.hexdigits)
More information here.

Another option:
def is_hex(s):
hex_digits = set("0123456789abcdef")
for char in s:
if not (char in hex_digits):
return False
return True

Most of the solutions proposed above do not take into account that any decimal integer may be also decoded as hex because decimal digits set is a subset of hex digits set. So Python will happily take 123 and assume it's 0123 hex:
>>> int('123',16)
291
This may sound obvious but in most cases you'll be looking for something that was actually hex-encoded, e.g. a hash and not anything that can be hex-decoded. So probably a more robust solution should also check for an even length of the hex string:
In [1]: def is_hex(s):
...: try:
...: int(s, 16)
...: except ValueError:
...: return False
...: return len(s) % 2 == 0
...:
In [2]: is_hex('123')
Out[2]: False
In [3]: is_hex('f123')
Out[3]: True

This will cover the case if the string starts with '0x' or '0X': [0x|0X][0-9a-fA-F]
d='0X12a'
all(c in 'xX' + string.hexdigits for c in d)
True

In Python3, I tried:
def is_hex(s):
try:
tmp=bytes.fromhex(hex_data).decode('utf-8')
return ''.join([i for i in tmp if i.isprintable()])
except ValueError:
return ''
It should be better than the way: int(x, 16)

Using Python you are looking to determine True or False, I would use eumero's is_hex method over Levon's method one. The following code contains a gotcha...
if int(input_string, 16):
print 'it is hex'
else:
print 'it is not hex'
It incorrectly reports the string '00' as not hex because zero evaluates to False.

Since all the regular expression above took about the same amount of time, I would guess that most of the time was related to converting the string to a regular expression. Below is the data I got when pre-compiling the regular expression.
int_hex
0.000800 ms 10
0.001300 ms 100
0.008200 ms 1000
all_hex
0.003500 ms 10
0.015200 ms 100
0.112000 ms 1000
fullmatch_hex
0.001800 ms 10
0.001200 ms 100
0.005500 ms 1000

Simple solution in case you need a pattern to validate prefixed hex or binary along with decimal
\b(0x[\da-fA-F]+|[\d]+|0b[01]+)\b
Sample: https://regex101.com/r/cN4yW7/14
Then doing int('0x00480065006C006C006F00200077006F0072006C00640021', 0) in python gives
6896377547970387516320582441726837832153446723333914657
The base 0 invokes prefix guessing behaviour.
This has saved me a lot of hassle. Hope it helps!

Most of the solution are not properly in checking string with prefix 0x
>>> is_hex_string("0xaaa")
False
>>> is_hex_string("0x123")
False
>>> is_hex_string("0xfff")
False
>>> is_hex_string("fff")
True

Here's my solution:
def to_decimal(s):
'''input should be int10 or hex'''
isString = isinstance(s, str)
if isString:
isHex = all(c in string.hexdigits + 'xX' for c in s)
return int(s, 16) if isHex else int(s)
else:
return int(hex(s), 16)
a = to_decimal(12)
b = to_decimal(0x10)
c = to_decimal('12')
d = to_decimal('0x10')
print(a, b, c, d)

Related

Decode emoji into two (or more) code points, using standard libraries

I'd like to be able to decode an emoji into its corresponding code points as seen here. I'm limited to using standard libraries in 2.7.
For example:
🇲🇩 -> U+1F1F2 U+1F1E9
I've managed to get the first code point using this code, but I can't figure out how to pull the second. Some emoji have even more code points.
to_decode = u'🇲🇩'
code = ord(to_decode[0])
if 0xd800 <= code <= 0xdbff:
code = (code - 0xd800) * 1024 + (ord(to_decode[1]) - 0xdc00) + + 0x010000
print(hex(code))
A combination of encode and struct.unpack can give you what you need.
>>> import struct
>>> b = to_decode.encode('utf_32_le')
>>> count = len(b) // 4
>>> count
2
>>> cp = struct.unpack('<%dI' % count, b)
>>> [hex(x) for x in cp]
['0x1f1f2', '0x1f1e9']
This is sort of an hack, but you can use the repr of the unicode string:
>>> repr(to_decode)
"u'\\U0001f1f2\\U0001f1e9'"
so:
>>> hex(int(repr(to_decode)[4:12], 16))
'0x1f1f2'
and
>>> hex(int(repr(to_decode)[14:22], 16))
'0x1f1e9'
You must extend this method to support emojis with more than two code points. You may consider using a combination of the above with .split("\\U").
For this problem, you actually need list() which will break a Unicode character into its constituent code points
to_decode = u'🇲🇩'
list(to_decode)
['🇲', '🇩']
As an example of what you can do with this, I created a unicode visualization of the Bengali Alphabet
https://www.kaggle.com/jamesmcguigan/unicode-visualization-of-the-bengali-alphabet

Using strings and byte-like objects compatibly in code to run in both Python 2 & 3

I'm trying to modify the code shown far below, which works in Python 2.7.x, so it will also work unchanged in Python 3.x. However I'm encountering the following problem I can't solve in the first function, bin_to_float() as shown by the output below:
float_to_bin(0.000000): '0'
Traceback (most recent call last):
File "binary-to-a-float-number.py", line 36, in <module>
float = bin_to_float(binary)
File "binary-to-a-float-number.py", line 9, in bin_to_float
return struct.unpack('>d', bf)[0]
TypeError: a bytes-like object is required, not 'str'
I tried to fix that by adding a bf = bytes(bf) right before the call to struct.unpack(), but doing so produced its own TypeError:
TypeError: string argument without an encoding
So my questions are is it possible to fix this issue and achieve my goal? And if so, how? Preferably in a way that would work in both versions of Python.
Here's the code that works in Python 2:
import struct
def bin_to_float(b):
""" Convert binary string to a float. """
bf = int_to_bytes(int(b, 2), 8) # 8 bytes needed for IEEE 754 binary64
return struct.unpack('>d', bf)[0]
def int_to_bytes(n, minlen=0): # helper function
""" Int/long to byte string. """
nbits = n.bit_length() + (1 if n < 0 else 0) # plus one for any sign bit
nbytes = (nbits+7) // 8 # number of whole bytes
bytes = []
for _ in range(nbytes):
bytes.append(chr(n & 0xff))
n >>= 8
if minlen > 0 and len(bytes) < minlen: # zero pad?
bytes.extend((minlen-len(bytes)) * '0')
return ''.join(reversed(bytes)) # high bytes at beginning
# tests
def float_to_bin(f):
""" Convert a float into a binary string. """
ba = struct.pack('>d', f)
ba = bytearray(ba)
s = ''.join('{:08b}'.format(b) for b in ba)
s = s.lstrip('0') # strip leading zeros
return s if s else '0' # but leave at least one
for f in 0.0, 1.0, -14.0, 12.546, 3.141593:
binary = float_to_bin(f)
print('float_to_bin(%f): %r' % (f, binary))
float = bin_to_float(binary)
print('bin_to_float(%r): %f' % (binary, float))
print('')
To make portable code that works with bytes in both Python 2 and 3 using libraries that literally use the different data types between the two, you need to explicitly declare them using the appropriate literal mark for every string (or add from __future__ import unicode_literals to top of every module doing this). This step is to ensure your data types are correct internally in your code.
Secondly, make the decision to support Python 3 going forward, with fallbacks specific for Python 2. This means overriding str with unicode, and figure out methods/functions that do not return the same types in both Python versions should be modified and replaced to return the correct type (being the Python 3 version). Do note that bytes is a reserved word, too, so don't use that.
Putting this together, your code will look something like this:
import struct
import sys
if sys.version_info < (3, 0):
str = unicode
chr = unichr
def bin_to_float(b):
""" Convert binary string to a float. """
bf = int_to_bytes(int(b, 2), 8) # 8 bytes needed for IEEE 754 binary64
return struct.unpack(b'>d', bf)[0]
def int_to_bytes(n, minlen=0): # helper function
""" Int/long to byte string. """
nbits = n.bit_length() + (1 if n < 0 else 0) # plus one for any sign bit
nbytes = (nbits+7) // 8 # number of whole bytes
ba = bytearray(b'')
for _ in range(nbytes):
ba.append(n & 0xff)
n >>= 8
if minlen > 0 and len(ba) < minlen: # zero pad?
ba.extend((minlen-len(ba)) * b'0')
return u''.join(str(chr(b)) for b in reversed(ba)).encode('latin1') # high bytes at beginning
# tests
def float_to_bin(f):
""" Convert a float into a binary string. """
ba = struct.pack(b'>d', f)
ba = bytearray(ba)
s = u''.join(u'{:08b}'.format(b) for b in ba)
s = s.lstrip(u'0') # strip leading zeros
return (s if s else u'0').encode('latin1') # but leave at least one
for f in 0.0, 1.0, -14.0, 12.546, 3.141593:
binary = float_to_bin(f)
print(u'float_to_bin(%f): %r' % (f, binary))
float = bin_to_float(binary)
print(u'bin_to_float(%r): %f' % (binary, float))
print(u'')
I used the latin1 codec simply because that's what the byte mappings are originally defined, and it seems to work
$ python2 foo.py
float_to_bin(0.000000): '0'
bin_to_float('0'): 0.000000
float_to_bin(1.000000): '11111111110000000000000000000000000000000000000000000000000000'
bin_to_float('11111111110000000000000000000000000000000000000000000000000000'): 1.000000
float_to_bin(-14.000000): '1100000000101100000000000000000000000000000000000000000000000000'
bin_to_float('1100000000101100000000000000000000000000000000000000000000000000'): -14.000000
float_to_bin(12.546000): '100000000101001000101111000110101001111110111110011101101100100'
bin_to_float('100000000101001000101111000110101001111110111110011101101100100'): 12.546000
float_to_bin(3.141593): '100000000001001001000011111101110000010110000101011110101111111'
bin_to_float('100000000001001001000011111101110000010110000101011110101111111'): 3.141593
Again, but this time under Python 3.5)
$ python3 foo.py
float_to_bin(0.000000): b'0'
bin_to_float(b'0'): 0.000000
float_to_bin(1.000000): b'11111111110000000000000000000000000000000000000000000000000000'
bin_to_float(b'11111111110000000000000000000000000000000000000000000000000000'): 1.000000
float_to_bin(-14.000000): b'1100000000101100000000000000000000000000000000000000000000000000'
bin_to_float(b'1100000000101100000000000000000000000000000000000000000000000000'): -14.000000
float_to_bin(12.546000): b'100000000101001000101111000110101001111110111110011101101100100'
bin_to_float(b'100000000101001000101111000110101001111110111110011101101100100'): 12.546000
float_to_bin(3.141593): b'100000000001001001000011111101110000010110000101011110101111111'
bin_to_float(b'100000000001001001000011111101110000010110000101011110101111111'): 3.141593
It's a lot more work, but in Python3 you can more clearly see that the types are done as proper bytes. I also changed your bytes = [] to a bytearray to more clearly express what you were trying to do.
I had a different approach from #metatoaster's answer. I just modified int_to_bytes to use and return a bytearray:
def int_to_bytes(n, minlen=0): # helper function
""" Int/long to byte string. """
nbits = n.bit_length() + (1 if n < 0 else 0) # plus one for any sign bit
nbytes = (nbits+7) // 8 # number of whole bytes
b = bytearray()
for _ in range(nbytes):
b.append(n & 0xff)
n >>= 8
if minlen > 0 and len(b) < minlen: # zero pad?
b.extend([0] * (minlen-len(b)))
return bytearray(reversed(b)) # high bytes at beginning
This seems to work without any other modifications under both Python 2.7.11 and Python 3.5.1.
Note that I zero padded with 0 instead of '0'. I didn't do much testing, but surely that's what you meant?
In Python 3, integers have a to_bytes() method that can perform the conversion in a single call. However, since you asked for a solution that works on Python 2 and 3 unmodified, here's an alternative approach.
If you take a detour via hexadecimal representation, the function int_to_bytes() becomes very simple:
import codecs
def int_to_bytes(n, minlen=0):
hex_str = format(n, "0{}x".format(2 * minlen))
return codecs.decode(hex_str, "hex")
You might need some special case handling to deal with the case when the hex string gets an odd number of characters.
Note that I'm not sure this works with all versions of Python 3. I remember that pseudo-encodings weren't supported in some 3.x version, but I don't remember the details. I tested the code with Python 3.5.

Hashing same character multiple times

I'm doing a programming challenge and I'm going crazy with one of the challenges. In the challenge, I need to compute the MD5 of a string. The string is given in the following form:
n[c]: Where n is a number and c is a character. For example: b3[a2[c]] => baccaccacc
Everything went ok until I was given the following string:
1[2[3[4[5[6[7[8[9[10[11[12[13[a]]]]]]]]]]]]]
This strings turns into a string with 6227020800 a's. This string is more than 6GB, so it's nearly impossible to compute it in practical time. So, here is my question:
Are there any properties of MD5 that I can use here?
I know that there has to be a form to make it in short time, and I suspect it has to be related to the fact that all the string has is the same character repeated multiple times.
You probably have created a (recursive) function to produce the result as a single value. Instead you should use a generator to produce the result as a stream of bytes. These you can then feed byte by byte into your MD5 hash routine. The size of the stream does not matter this way, it will just have an impact on the computation time, not on the memory used.
Here's an example using a single-pass parser:
import re, sys, md5
def p(s, pos, callBack):
while pos < len(s):
m = re.match(r'(d+)[', s[pos:])
if m: # repetition?
number = m.group(1)
for i in range(int(number)):
endPos = p(s, pos+len(number)+1, callBack)
pos = endPos
elif s[pos] == ']':
return pos + 1
else:
callBack(s[pos])
pos += 1
return pos + 1
digest = md5.new()
def feed(s):
digest.update(s)
sys.stdout.write(s)
sys.stdout.flush()
end = p(sys.argv[1], 0, feed)
print
print "MD5:", digest.hexdigest()
print "finished parsing input at pos", end
All hash functions are designed to work with byte streams, so you should not first generate the whole string, and after that hash it - you should write generator, which produces chunks of string data, and feed it to MD5 context.
And, MD5 uses 64-byte (or char) buffer so it would be a good idea to feed 64-byte chunks of data to the context.
Take advantage of the good properties of hashes:
import hashlib
cruncher = hashlib.md5()
chunk = 'a' * 100
for i in xrange(100000):
cruncher.update(chunk)
print cruncher.hexdigest()
Tweak the number of rounds (x = 10000) and the length of the chunk (y = 100) so that x * y = 13!. The point is that your are feeding the algorithm with chunks of your string (each one x characters long), one after the other, for y times.

Unpack format characters in Python

I need the Python analog for this Perl string:
unpack("nNccH*", string_val)
I need the nNccH* - data format in Python format characters.
In Perl it unpack binary data to five variables:
16 bit value in "network" (big-endian)
32 bit value in "network" (big-endian)
Signed char (8-bit integer) value
Signed char (8-bit integer) value
Hexadecimal string, high nibble first
But I can't do it in Python
More:
bstring = ''
while DataByte = client[0].recv(1):
bstring += DataByte
print len(bstring)
if len(bstring):
a, b, c, d, e = unpack("nNccH*", bstring)
I never wrote in Perl or Python, but my current task is to write a multithreading Python server that was written in Perl...
The Perl format "nNcc" is equivalent to the Python format "!HLbb".
There is no direct equivalent in Python for Perl's "H*".
There are two problems.
Python's struct.unpack does not accept the wildcard character, *
Python's struct.unpack does not "hexlify" data strings
The first problem can be worked-around using a helper function like unpack.
The second problem can be solved using binascii.hexlify:
import struct
import binascii
def unpack(fmt, data):
"""
Return struct.unpack(fmt, data) with the optional single * in fmt replaced with
the appropriate number, given the length of data.
"""
# http://stackoverflow.com/a/7867892/190597
try:
return struct.unpack(fmt, data)
except struct.error:
flen = struct.calcsize(fmt.replace('*', ''))
alen = len(data)
idx = fmt.find('*')
before_char = fmt[idx-1]
n = (alen-flen)//struct.calcsize(before_char)+1
fmt = ''.join((fmt[:idx-1], str(n), before_char, fmt[idx+1:]))
return struct.unpack(fmt, data)
data = open('data').read()
x = list(unpack("!HLbbs*", data))
# x[-1].encode('hex') works in Python 2, but not in Python 3
x[-1] = binascii.hexlify(x[-1])
print(x)
When tested on data produced by this Perl script:
$line = pack("nNccH*", 1, 2, 10, 4, '1fba');
print "$line";
The Python script yields
[1, 2, 10, 4, '1fba']
The equivalent Python function you're looking for is struct.unpack. Documentation of the format string is here: http://docs.python.org/library/struct.html
You will have a better chance of getting help if you actually explain what kind of unpacking you need. Not everyone knows Perl.

Efficient string to hex function

I'm using an old version of python on an embedded platform ( Python 1.5.2+ on Telit platform ). The problem that I have is my function for converting a string to hex. It is very slow. Here is the function:
def StringToHexString(s):
strHex=''
for c in s:
strHex = strHex + hexLoookup[ord(c)]
return strHex
hexLookup is a lookup table (a python list) containing all the hex representation of each character.
I am willing to try everything (a more compact function, some language tricks I don't know about). To be more clear here are the benchmarks (resolution is 1 second on that platform):
N is the number of input characters to be converted to hex and the time is in seconds.
N | Time (seconds)
50 | 1
150 | 3
300 | 4
500 | 8
1000 | 15
1500 | 23
2000 | 31
Yes, I know, it is very slow... but if I could gain something like 1 or 2 seconds it would be a progress.
So any solution is welcomed, especially from people who know about python performance.
Thanks,
Iulian
PS1: (after testing the suggestions offered - keeping the ord call):
def StringToHexString(s):
hexList=[]
hexListAppend=hexList.append
for c in s:
hexListAppend(hexLoookup[ord(c)])
return ''.join(hexList)
With this function I obtained the following times: 1/2/3/5/12/19/27 (which is clearly better)
PS2 (can't explain but it's blazingly fast) A BIG thank you Sven Marnach for the idea !!!:
def StringToHexString(s):
return ''.join( map(lambda param:hexLoookup[param], map(ord,s) ) )
Times:1/1/2/3/6/10/12
Any other ideas/explanations are welcome!
Make your hexLoookup a dictionary indexed by the characters themselves, so you don't have to call ord each time.
Also, don't concatenate to build strings – that used to be slow. Use join on a list instead.
from string import join
def StringToHexString(s):
strHex = []
for c in s:
strHex.append(hexLoookup[c])
return join(strHex, '')
Building on Petr Viktorin's answer, you could further improve the performance by avoiding global vairable and attribute look-ups in favour of local variable look-ups. Local variables are optimized to avoid a dictionary look-up on each access. (They haven't always been, by I just double-checked this optimization was already in place in 1.5.2, released in 1999.)
from string import join
def StringToHexString(s):
strHex = []
strHexappend = strHex.append
_hexLookup = hexLoookup
for c in s:
strHexappend(_hexLoookup[c])
return join(strHex, '')
Constantly reassigning and adding strings together using the + operator is very slow. I guess that Python 1.5.2 isn't yet optimizing for this. So using string.join() would be preferable.
Try
import string
def StringToHexString(s):
listhex = []
for c in s:
listhex.append(hexLookup[ord(c)])
return string.join(listhex, '')
and see if that is any faster.
Try:
from string import join
def StringToHexString(s):
charlist = []
for c in s:
charlist.append(hexLoookup[ord(c)])
return join(charlist, '')
Each string addition takes time proportional to the length of the string so, while join will also take time proportional to the length of the entire string, but you only have to do it once.
You could also make hexLookup a dict mapping characters to hex values, so you don't have to call ord for every character. It's a micro-optimization, so probably won't be significant.
def StringToHexString(s):
return ''.join( map(lambda param:hexLoookup[param], map(ord,s) ) )
Seems like this is the fastest! Thank you Sven Marnach!

Categories