Converting "0x08h, 0x8ah" to [int,int] in Python - python

I've got string like x='0x08h, 0x0ah' in Python, wanting to convert it to [8,10] (like unsigned ints). I could split and index it like [int(a[-3:-1],16) for a in x.split(', ')] but is there a better way to convert it to a list of ints?
Would it matter if I had y='080a'?
edit (for plus points:).) what (sane) string-based hexadecimal notations have python support, and which not?

You really have to know what the pattern you're trying to parse is, before you write a parser.
But it looks like your pattern is: optional 0x, then hex digits, then optional h. At least that's the most reasonable thing I can come up with that handles both '0x08h' and '080a'. So:
def parse_hex(s):
return int(s.lstrip('0x').rstrip('h'), 16)
Then:
numbers = [parse_hex(s) for s in x.split(', ')]
Of course you don't actually need to remove the 0x prefix, because Python accepts that as part of a hex string, so you could write it as:
def parse_hex(s):
return int(s.rstrip('h'), 16)
However, I think the intention is clearer if you're more explicit.
From your edit:
edit what (sane) string-based hexadecimal notations have python support, and which not?
See the documentation for int:
Base-2, -8, and -16 literals can be optionally prefixed with 0b/0B, 0o/0O, or 0x/0X, as with integer literals in code.
That's it. (If you read the rest of the paragraph, if you're guaranteed to have 0x/0X, you don't have to explicitly use base=16. But that doesn't help you here, so that one sentence is really all you need.) The docs on Numeric Types and Numeric literals detail exactly what "as with integer literals in code"; the only thing surprising there is that negative numbers aren't literals, complex numbers aren't literals (but pure imaginary numbers are), and non-ASCII digits can be used but the documentation doesn't explain how.

You can also use map: map(lambda s:int(s.lower().replace('0x','').replace('h',''), 16),x.split(', '))

Related

Uniquely encode any ASCII string into a string that uses a subset of ASCII

For this question, please assume python, but it doesn't necessarily matter.
Imagine you have an arbitrary ASCII string, for example:
jrioj4oi3m_=\.,ei9#
Sparing the extensive details, I need to pass this string as a "label" on to another program, but that program doesn't support "labels" containing "special characters" or even numbers. So I'm trying to encode an ASCII string into a string that uses an arbitrary subset of ASCII.
One very naive solution would be to convert the original string into binary, then convert 0s into "a" and 1s into "b". This works to solve my problem, but I would like to learn a better solution here, to become a better programmer.
First of all, what exactly is this problem called?
This is not exactly a hashing problem, because IIRC hashing generally involves encoding into a string that is shorter than the original, and involves collisions.
I need no collisions, and I don't really care how long the encoded string is, as long as it's shorter than the naive case. (Ideally it would be the shortest length possible given the subset)
In fact, it would be ideal to specify exactly what the allowed character set is, then use a generalized encoding algorithm to do the encoding.
Decoding would be nice to know also.
A simple solution would be to first convert to a hex encoding:
jrioj4oi3m_=.,ei9# => 6a72696f6a346f69336d5f3d2e2c65693923
and then translate any numbers into non-hex letters:
6a72696f6a346f69336d5f3d2e2c65693923 => waxswzwfwatuwfwzttwdvftdsescwvwztzst
So the output string would always be exactly twice the length of the input string and only ever contain characters in the range a-z.
This can be easily achieved in python like this:
>>> enc = str.maketrans('0123456789', 'qrstuvwxyz')
>>> dec = str.maketrans('qrstuvwxyz', '0123456789')
>>> s = 'jrioj4oi3m_=.,ei9#'
>>> x = s.encode('ascii').hex().translate(enc)
>>> x
'waxswzwfwatuwfwzttwdvftdsescwvwztzst'
>>> bytes.fromhex(x.translate(dec)).decode('ascii')
'jrioj4oi3m_=.,ei9#'
Interestingly, this actually turns out to be a really simple and common math problem: Base conversion. As a programmer, you probably know, at least in theory, how to convert between base 2, 10, and 16 representations of a value. There are 96 printable ASCII characters, so any ASCII string can be considered to be a base 96 representation of a (probably very large) value. If your label only accepts 64 characters (uppercase, lowercase, digits, and 2 others, for instance), then you simply need to convert your base 96 representation into a base 64 representation of the same value.
Decoding is simply converting your base 64 representation back to the base 96 representation.

Python's newly featured numeric literal (eg; 234_432) doesn't work with .isdigit()?

I always understood that if something can be converted to integer (ie; something is string representation of numeric), isdigit() return True. This is not the case with the new feature. Here is the sample below:
Code Sample
But why?
To answer your question, looking at the python 3.6 documentation for the isdigit method.
Return true if all characters in the string are digits and there is at least one character, false otherwise.
Since an underscore isn't a digit, the new format will not work well with the current implementation of isdigit. As I commented before, the immediate work around would be: str.replace("_", "").isdigit() where str is string containing the newly formatted number, while avoiding a try-except block with int.
You also need to take out the negative sign for negative integers. This way negative integers will work as well. str.replace("_", "").lstrip("-").isdigit().

Why is it not possible to convert "1.7" to integer directly, without converting to float first?

When I type int("1.7") Python returns error (specifically, ValueError). I know that I can convert it to integer by int(float("1.7")). I would like to know why the first method returns error.
From the documentation:
If x is not a number or if base is given, then x must be a string or Unicode object representing an integer literal in radix base ...
Obviously, "1.7" does not represent an integer literal in radix base.
If you want to know why the python dev's decided to limit themselves to integer literals in radix base, there are a possible infinite number of reasons and you'd have to ask Guido et. al to know for sure. One guess would be ease of implementation + efficiency. You might think it would be easily for them to implement it as:
Interpret number as a float
truncate to an integer
Unfortunately, that doesn't work in python as integers can have arbitrary precision and floats cannot. Special casing big numbers could lead to inefficiency for the common case1.
Additionally, forcing you do to int(float(...)) has the additional benefit in clarity -- It makes it more obvious what the input string probably looks like which can help in debugging elsewhere. In fact, I might argue that even if int would accept strings like "1.7", it'd be better to write int(float("1.7")) anyway for the increased code clarity.
1Assuming some validation. Other languages skip this -- e.g. ruby will evaluate '1e6'.to_i and give you 1 since it stops parsing at the first non-integral character. Seems like that could lead to fun bugs to track down ...
We have a good, obvious idea of what "make an int out of this float" means because we think of a float as two parts and we can throw one of them away.
It's not so obvious when we have a string. Make this string into a float implies all kinds of subtle things about the contents of the string, and that is not the kind of thing a sane person wants to see in code where the value is not obvious.
So the short answer is: Python likes obvious things and discourages magic.
Here is a good description of why you cannot do this found in the python documentation.
https://docs.python.org/2/library/functions.html#int
If x is not a number or if base is given, then x must be a string or Unicode object representing an integer literal in radix base. Optionally, the literal can be preceded by + or - (with no space in between) and surrounded by whitespace. A base-n literal consists of the digits 0 to n-1, with a to z (or A to Z) having values 10 to 35. The default base is 10. The allowed values are 0 and 2-36. Base-2, -8, and -16 literals can be optionally prefixed with 0b/0B, 0o/0O/0, or 0x/0X, as with integer literals in code. Base 0 means to interpret the string exactly as an integer literal, so that the actual base is 2, 8, 10, or 16.
Basically to typecast to an integer from a string, the string must not contain a "."
Breaks backwards-compatibility. It is certainly possible, however this would be a terrible idea since it would break backwards-compatibility with the very old and well-established Python idiom of relying on a try...except ladder ("Easier to ask forgiveness than permission") to determine the type of the string's contents. This idiom has been around and used since at least Python 1.5, AFAIK; here are two citations: [1] [2]
s = "foo12.7"
#s = "-12.7"
#s = -12
try:
n = int(s) # or else throw an exception if non-integer...
print "Do integer stuff with", n
except ValueError:
try:
f = float(s) # or else throw an exception if non-float...
print "Do float stuff with", f
except ValueError:
print "Handle case for when s is neither float nor integer"
raise # if you want to reraise the exception
And another minor thing: it's not just about whether the number contains '.' Scientific notation, or arbitrary letters, could also break the int-ness of the string.
Examples: int("6e7") is not an integer (base-10). However int("6e7",16) =
1767 is an integer in base-16 (or any base>=15). But int("6e-7") is never an int.
(And if you expand the base to base-36, any legal alphanumeric string (or Unicode) can be interpreted as representing an integer, but doing that by default would generally be a terrible behavior, since "dog" or "cat" are unlikely to be references to integers).

Alternative ways for binary conversion in python

I often need to convert status code to bit representation in order to determine what error/status are active on analyzers using plain-text or binary communication protocol.
I use python to poll data and to parse it. Sometime I really get confuse because I found that there is so many ways to solve a problem. Today I had to convert a string where each character is an hexadecimal digit to its binary representation. That is, each hexadecimal character must be converted into 4 bits, where the MSB start from left. Note: I need a char by char conversion, and leading zero.
I managed to build these following function which does the trick in a quasi one-liner fashion.
def convertStatus(s, base=16):
n = int(math.log2(base))
b = "".join(["{{:0>{}b}}".format(n).format(int(x, base)) for x in s])
return b
Eg., this convert the following input:
0123456789abcdef
into:
0000000100100011010001010110011110001001101010111100110111101111
Which was my goal.
Now, I am wondering what another elegant solutions could I have used to reach my goal? I also would like to better understand what are advantages and drawbacks among solutions. The function signature can be changed, but usually it is a string for input and output. Lets become imaginative...
This is simple in two steps
Converting a string to an int is almost trivial: use int(aString, base=...)
the first parameter is can be a string!
and with base, almost every option is possible
Converting a number to a string is easy with format() and the mini print language
So converting hex-strings to binary can be done as
def h2b(x):
val = int(x, base=16)
return format(val, 'b')
Here the two steps are explicitly. Possible it's better to do it in one line, or even in-line

Python .split() without 'u

In Python, if I have a string like:
a =" Hello - to - everybody"
And I do
a.split('-')
then I get
[u'Hello', u'to', u'everybody']
This is just an example.
How can I get a simple list without that annoying u'??
The u means that it's a unicode string - your original string must also have been a unicode string. Generally it's a good idea to keep strings Unicode as trying to convert to normal strings could potentially fail due to characters with no equivalent.
The u is purely used to let you know it's a unicode string in the representation - it will not affect the string itself.
In general, unicode strings work exactly as normal strings, so there should be no issue with leaving them as unicode strings.
In Python 3.x, unicode strings are the default, and don't have the u prepended (instead, bytes (the equivalent to old strings) are prepended with b).
If you really, really need to convert to a normal string (rarely the case, but potentially an issue if you are using an extension library that doesn't support unicode strings, for example), take a look at unicode.encode() and unicode.decode(). You can either do this before the split, or after the split using a list comprehension.
I have a opposite problem. The str '第一回\u3000甄士隐梦幻识通灵 贾雨村风尘怀闺秀' needs to be splitted by the unicode character. But I made wrong and code split('\u') that leaded to the unicode syntax error.
I should code split('\u3000')

Categories