String of text to unique integer method? - python

Is there a method that converts a string of text such as 'you' to a number other than
y = tuple('you')
for k in y:
k = ord(k)
which only converts one character at a time?

In order to convert a string to a number (and the reverse), you should first always work with bytes. Since you are using Python 3, strings are actually Unicode strings and as such may contain characters that have a ord() value higher than 255. bytes however just have a single byte per character; so you should always convert between those two types first.
So basically, you are looking for a way to convert a bytes string (which is basically a list of bytes, a list of numbers 0–255) into a single number, and the inverse. You can use int.to_bytes and int.from_bytes for that:
import math
def convertToNumber (s):
return int.from_bytes(s.encode(), 'little')
def convertFromNumber (n):
return n.to_bytes(math.ceil(n.bit_length() / 8), 'little').decode()
>>> convertToNumber('foo bar baz')
147948829660780569073512294
>>> x = _
>>> convertFromNumber(x)
'foo bar baz'

Treat the string as a base-255 number.
# Reverse the digits to make reconstructing the string more efficient
digits = reversed(ord(b) for b in y.encode())
n = reduce(lambda x, y: x*255 + y, digits)
new_y = ""
while n > 0:
n, b = divmod(n, 255)
new_y += chr(b)
assert y == new_y.decode()
(Note this is essentially the same as poke's answer, but written explicitly rather than using available methods for converting between a byte string and an integer.)

You don't need to convert the string into tuple
k is overwritten. Collect items using something like list comprehension:
>>> text = 'you'
>>> [ord(ch) for ch in text]
[121, 111, 117]
To get the text back, use chr, and join the characters using str.join:
>>> numbers = [ord(ch) for ch in text]
>>> ''.join(chr(n) for n in numbers)
'you'

Though there are a number of ways to fulfill this task, I prefer the hashing way because it has the following nice properties
it ensures that the number you get is highly random, actually uniformly random
it ensures that even a small change in your input string will lead to a significant difference in output integer.
it is an irreversible process, i.e., you can't tell which string is the input based on the integer output.
import hashlib
# there are a number of hashing functions you can pick, and they provide tags of different lengths and security levels.
hashing_func = hashlib.md5
# the lambda func does three things
# 1. hash a given string using the given algorithm
# 2. retrive its hex hash tag
# 3. convert hex to integer
str2int = lambda s : int(hashing_func(s.encode()).hexdigest(), 16)
To see how the resulting integers are uniform randomly distributed, we first need to have some random string generator
import string
import numpy as np
# candidate characters
letters = string.ascii_letters
# total number of candidates
L = len(letters)
# control the seed or prng for reproducible results
prng = np.random.RandomState(1234)
# define the string prng of length 10
prng_string = lambda : "".join([letters[k] for k in prng.randint(0, L, size=(10))])
Now we generate sufficient number of random strings and obtain corresponding integers
ss = [prng_string() for x in range(50000)]
vv = np.array([str2int(s) for s in ss])
Let us check the randomness by comparing the theoretical mean and standard deviation of a uniform distribution and those we observed.
for max_num in [256, 512, 1024, 4096] :
ints = vv % max_num
print("distribution comparsions for max_num = {:4d} \n\t[theoretical] {:7.2f} +/- {:8.3f} | [observed] {:7.2f} +/- {:8.3f}".format(
max_num, max_num/2., np.sqrt(max_num**2/12), np.mean(ints), np.std(ints)))
Finally, you will see the results below, which indicates that the number you got are very uniform.
distribution comparsions for max_num = 256
[theoretical] 128.00 +/- 73.901 | [observed] 127.21 +/- 73.755
distribution comparsions for max_num = 512
[theoretical] 256.00 +/- 147.802 | [observed] 254.90 +/- 147.557
distribution comparsions for max_num = 1024
[theoretical] 512.00 +/- 295.603 | [observed] 512.02 +/- 296.519
distribution comparsions for max_num = 4096
[theoretical] 2048.00 +/- 1182.413 | [observed] 2048.67 +/- 1181.422
It is worthy to call out that other posted answers may not attain these these properties.
For example, #poke's convertToNumber solution will give
distribution comparsions for max_num = 256
[theoretical] 128.00 +/- 73.901 | [observed] 93.48 +/- 17.663
distribution comparsions for max_num = 512
[theoretical] 256.00 +/- 147.802 | [observed] 220.71 +/- 129.261
distribution comparsions for max_num = 1024
[theoretical] 512.00 +/- 295.603 | [observed] 477.67 +/- 277.651
distribution comparsions for max_num = 4096
[theoretical] 2048.00 +/- 1182.413 | [observed] 1816.51 +/- 1059.643

I was trying to find a way to convert a numpy character array into a unique numeric array in order to do some other stuff. I have implemented the following functions including the answers by #poke and #falsetrue (these methods were giving me some trouble when the strings were too large). I have also added the hash method (a hash is a fixed sized integer that identifies a particular value.)
import numpy as np
def str_to_num(x):
"""Converts a string into a unique concatenated UNICODE representation
Args:
x (string): input string
Raises:
ValueError: x must be a string
"""
if isinstance(x, str):
x = [str(ord(c)) for c in x]
x = int(''.join(x))
else:
raise ValueError('x must be a string.')
return x
def chr_to_num(x):
return int.from_bytes(x.encode(), 'little')
def char_arr_to_num(arr, type = 'hash'):
"""Converts a character array into a unique hash representation.
Args:
arr (np.array): numpy character array.
"""
if type == 'unicode':
vec_fun = np.vectorize(str_to_num)
elif type == 'byte':
vec_fun = np.vectorize(chr_to_num)
elif type == 'hash':
vec_fun = np.vectorize(hash)
out = np.apply_along_axis(vec_fun, 0, arr)
out = out.astype(float)
return out
a = np.array([['x', 'y', 'w'], ['x', 'z','p'], ['y', 'z', 'w'], ['x', 'w','y'], ['w', 'z', 'q']])
char_arr_to_num(a, type = 'unicode')
char_arr_to_num(a, type = 'byte')
char_arr_to_num(a, type = 'hash')

Related

int 111 to binary 111(decimal 7)

Problem:Take a number example 37 is (binary 100101).
Count the binary 1s and create a binary like (111) and print the decimal of that binary(7)
num = bin(int(input()))
st = str(num)
count=0
for i in st:
if i == "1":
count +=1
del st
vt = ""
for i in range(count):
vt = vt + "1"
vt = int(vt)
print(vt)
I am a newbie and stuck here.
I wouldn't recommend your approach, but to show where you went wrong:
num = bin(int(input()))
st = str(num)
count = 0
for i in st:
if i == "1":
count += 1
del st
# start the string representation of the binary value correctly
vt = "0b"
for i in range(count):
vt = vt + "1"
# tell the `int()` function that it should consider the string as a binary number (base 2)
vt = int(vt, 2)
print(vt)
Note that the code below does the exact same thing as yours, but a bit more concisely so:
ones = bin(int(input())).count('1')
vt = int('0b' + '1' * ones, 2)
print(vt)
It uses the standard method count() on the string to get the number of ones in ones and it uses Python's ability to repeat a string a number of times using the multiplication operator *.
Try this once you got the required binary.
def binaryToDecimal(binary):
binary1 = binary
decimal, i, n = 0, 0, 0
while(binary != 0):
dec = binary % 10
decimal = decimal + dec * pow(2, i)
binary = binary//10
i += 1
print(decimal)
In one line:
print(int(format(int(input()), 'b').count('1') * '1', 2))
Let's break it down, inside out:
format(int(input()), 'b')
This built-in function takes an integer number from the input, and returns a formatted string according to the Format Specification Mini-Language. In this case, the argument 'b' gives us a binary format.
Then, we have
.count('1')
This str method returns the total number of occurrences of '1' in the string returned by the format function.
In Python, you can multiply a string times a number to get the same string repeatedly concatenated n times:
x = 'a' * 3
print(x) # prints 'aaa'
Thus, if we take the number returned by the count method and multiply it by the string '1' we get a string that only contains ones and only the same amount of ones as our original input number in binary. Now, we can express this number in binary by casting it in base 2, like this:
int(number_string, 2)
So, we have
int(format(int(input()), 'b').count('1') * '1', 2)
Finally, let's print the whole thing:
print(int(format(int(input()), 'b').count('1') * '1', 2))

Python internal metric unit conversion function

I'm trying to build a function to do internal metric conversion on a wavelength to frequency conversion program and have been having a hard time getting it to behave properly. It is super slow and will not assign the correct labels to the output. If anyone can help with either a different method of computing this or a reason on why this is happening and any fixes that I cond do that would be amazing!
def convert_SI_l(n):
if n in range( int(1e-12),int(9e-11)):
return n/0.000000000001, 'pm'
else:
if n in range(int(1e-10),int(9e-8)):
return n/0.000000001 , 'nm'
else:
if n in range(int(1e-7),int(9e-5)):
return n/0.000001, 'um'
else:
if n in range(int(1e-4),int(9e-3)):
return n/0.001, 'mm'
else:
if n in range(int(0.01), int(0.99)):
return n/0.01, 'cm'
else:
if n in range(1,999):
return n/1000, 'm'
else:
if n in range(1000,299792459):
return n/1000, 'km'
else:
return n , 'm'
def convert_SI_f(n):
if n in range( 1,999):
return n, 'Hz'
else:
if n in range(1000,999999):
return n/1000 , 'kHz'
else:
if n in range(int(1e6),999999999):
return n/1e6, 'MHz'
else:
if n in range(int(1e9),int(1e13)):
return n/1e9, 'GHz'
else:
return n, 'Hz'
c=299792458
i=input("Are we starting with a frequency or a wavelength? ( F / L ): ")
#Error statements
if i.lower() == ("f"):
True
else:
if not i.lower() == ("l"):
print ("Error invalid input")
#Cases
if i.lower() == ("f"):
f = float(input("Please input frequency (in Hz): "))
size_l = c/f
print(convert_SI_l(size_l))
if i.lower() == ("l"):
l = float(input("Please input wavelength (in meters): "))
size_f = ( l/c)
print(convert_SI_f(size_f))
You are using range() in a way that is close to how it is used in natural language, to express a contiguous segment of the real number line, as in in the range 4.5 to 5.25. But range() doesn't mean that in Python. It means a bunch of integers. So your floating-point values, even if they are in the range you specify, will not occur in the bunch of integers that the range() function generates.
Your first test is
if n in range( int(1e-12),int(9e-11)):
and I am guessing you wrote it like this because what you actually wanted was range(1e-12, 9e-11) but you got TypeError: 'float' object cannot be interpreted as an integer.
But if you do this at the interpreter prompt
>>> range(int(1e-12),int(9e-11))
range(0, 0)
>>> list(range(int(1e-12),int(9e-11)))
[]
you will see it means something quite different to what you obviously expect.
To test if a floating-point number falls in a given range do
if lower-bound <= mynumber <= upper-bound:
You don't need ranges and your logic will be more robust if you base it on fixed threshold points that delimit the unit magnitude. This would typically be a unit of one in the given scale.
Here's a generalized approach to all unit scale determination:
SI_Length = [ (1/1000000000000,"pm"),
(1/1000000000, "nm"),
(1/1000000, "um"),
(1/1000, "mm"),
(1/100, "cm"),
(1, "m"),
(1000, "km") ]
SI_Frequency = [ (1, "Hz"), (1000,"kHz"), (1000000,"MHz"), (1000000000,"GHz")]
def convert(n,units):
useFactor,useName = units[0]
for factor,name in units:
if n >= factor : useFactor,useName = factor,name
return (n/useFactor,useName)
print(convert(0.0035,SI_Length)) # 3.5 mm
print(convert(12332.55,SI_Frequency)) # 12.33255 kHz
Each unit array must be in order of smallest to largest multiplier.
EDIT: Actually, range is a function which is generally used in itaration to generate numbers. So, when you write if n in range(min_value, max_value), this function generates all integers until it finds a match or reach the max_value.
The range type represents an immutable sequence of numbers and is commonly used for looping a specific number of times in for loops.
Instead of writing:
if n in range(int(1e-10),int(9e-8)):
return n/0.000000001 , 'nm'
you should write:
if 1e-10 <= n < 9e-8:
return n/0.000000001 , 'nm'
Also keep in mind that range only works on integers, not float.
More EDIT:
For your specific use case, you can define dictionary of *(value, symbol) pairs, like below:
import collections
symbols = collections.OrderedDict(
[(1e-12, u'p'),
(1e-9, u'n'),
(1e-6, u'μ'),
(1e-3, u'm'),
(1e-2, u'c'),
(1e-1, u'd'),
(1e0, u''),
(1e1, u'da'),
(1e2, u'h'),
(1e3, u'k'),
(1e6, u'M'),
(1e9, u'G'),
(1e12, u'T')])
The use the bisect.bisect function to find the "insertion" point of your value in that ordered collection. This insertion point can be used to get the simplified value and the SI symbol to use.
For instance:
import bisect
def convert_to_si(value):
if value < 0:
value, symbol = convert_to_si(-value)
return -value, symbol
elif value > 0:
orders = list(symbols.keys())
order_index = bisect.bisect(orders, value / 10.0)
order = orders[min(order_index, len(orders) - 1)]
return value / order, symbols[order]
else:
return value, u""
Demonstration:
for value in [1e-12, 3.14e-11, 0, 2, 20, 3e+9]:
print(*convert_to_si(value), sep="")
You get:
1.0p
0.0314n
0
2.0
2.0da
3.0G
You can adapt this function to your needs…

Two's complement in Python (shift left on many bits with rounding)

How we could code the reverse complete of a DNA sequence from its code?
A DNA sequence can contain 4 different characters A, C, G, T; where A is the complement of T and C is the complement of G.
A reverse complement of A DNA sequence is the complement of a sequence but in an inverse way (we compute the complement of each character from right to left).
Example: the complement of (AA) is: TT, the complement of (AC) is GT and so on...
In general, using python we code a sequence by mapping each character to a number going from 0 to 3,
{A:0, C:1, G:2, T:3}
then the coding of AA is: 0, the coding of AC is:
AC = 0*4^0+1*4^1 = 4
the coding of GT is:
GT = 2*4^0+3*4^1 = 14
How could I transform the code of each sequence to its reverse complement in python without creating a dictionary? For the above example: convert 4 to 14? and 0 to 15 ...
Your symbol set is too small for a hash map to actually be efficient. And mixing two's complement into your problem has just caused confusion.
symbols = 'ACGT'
complements = symbols[::-1] # reverse order
import string
table = string.maketrans(symbols, complements)
sample = 'ACCGTT'
print(sample[::-1].translate(table))
# output: AACGGT
Converting to some bitpacked format would take less space but require a lot more special handling, as you'd need to track sizes separately, perform arbitrarily wide shifts and so on. Python can certainly do it, in particular with int() accepting many bases and creating arbitrary width results, but it's likely a counterproductive detour.
digits = string.digits[:len(symbols)]
length = len(sample)
digitmap = string.maketrans(symbols, digits)
number = int(sample.translate(digitmap), len(digits))
def reversemapnumber(function=id, number=0, radix=0b100, length=0):
result = 0
for i in range(length):
number,digit = divmod(number, radix)
result = result*radix + function(digit)
return result
revcomplemented = reversemapnumber(function=lambda x: 3-x,
number=number, length=length)
# binary form
print('{:0{}b}'.format(revcomplemented, length*2))
# back to text form
print(''.join(symbols[(revcomplemented>>i)&0b11]
for i in range(2*length-2, -2, -2)))
In that jumble of code I've used division rather than shifts to be somewhat more generic (supporting radix not a power of two), but the printing examples rely on the width exactly. In the end it's just tricky and unclear.
the reverse of a list in python
>>> xs = [1,2,3]
>>> reversed(xs)
<listreverseiterator object at 0x10089c9d0>
>>> list(reversed(xs))
[3, 2, 1]
>>>
def complement(x):
return ~x & 15 # as 15 == int('1111', 2)
the 15 is a bitmask. It represents the binary 1111. We then use the binary and operator.
>>> "{0:b}".format(complement(int('1111',2)))
'0'
>>> "{0:b}".format(complement(int('0001',2)))
'1110'
>>> "{0:b}".format(complement(int('1001',2)))
'110'
>>> xs = [int('1111',2), int('1001',2), int('0110',2), int('1011',2)]
>>> map(complement, xs)
[0, 6, 9, 4]
>>> list(reversed(map(complement, xs)))
[4, 9, 6, 0]
Basing your example where
given a sequence of 6 characters: ACCGTT, the complement of A is: T,
and the complement of C is G; so the reverse complement of ACCGTT is: AACGGT.
assume that you have c complemnt function complement and a reverse function reverse.
we have reverse(ACCGTT) = TTGCCA and complement(ACCGTT) = TGGCAA
. Reversing a list after calling a function on each element is the same as calling a function on each element on a list.
complement(reverse(ACCGTT)) = reverse(complement(ACCGTT))
So the other part of the question is that you want to map
{A:0, C:1, G:2, T:3}
A -> T | 0 -> 3
T -> A | 3 -> 0
C -> G | 1 -> 2
G -> C | 2 -> 1
which in binary would be
a = int('00', 2) # 0
c = int('01', 2) # 1
g = int('10', 2) # 2
t = int('11', 2) # 3
def complement(x):
return ~x & 3 # this 3 is the same as int('11', 2)
def reverse_complement(list_of_ints):
return list(reversed(map(complement, list_of_ints)))

Python Decimal - engineering notation for mili (10e-3) and micro (10e-6)

Here is the example which is bothering me:
>>> x = decimal.Decimal('0.0001')
>>> print x.normalize()
>>> print x.normalize().to_eng_string()
0.0001
0.0001
Is there a way to have engineering notation for representing mili (10e-3) and micro (10e-6)?
Here's a function that does things explicitly, and also has support for using SI suffixes for the exponent:
def eng_string( x, format='%s', si=False):
'''
Returns float/int value <x> formatted in a simplified engineering format -
using an exponent that is a multiple of 3.
format: printf-style string used to format the value before the exponent.
si: if true, use SI suffix for exponent, e.g. k instead of e3, n instead of
e-9 etc.
E.g. with format='%.2f':
1.23e-08 => 12.30e-9
123 => 123.00
1230.0 => 1.23e3
-1230000.0 => -1.23e6
and with si=True:
1230.0 => 1.23k
-1230000.0 => -1.23M
'''
sign = ''
if x < 0:
x = -x
sign = '-'
exp = int( math.floor( math.log10( x)))
exp3 = exp - ( exp % 3)
x3 = x / ( 10 ** exp3)
if si and exp3 >= -24 and exp3 <= 24 and exp3 != 0:
exp3_text = 'yzafpnum kMGTPEZY'[ ( exp3 - (-24)) / 3]
elif exp3 == 0:
exp3_text = ''
else:
exp3_text = 'e%s' % exp3
return ( '%s'+format+'%s') % ( sign, x3, exp3_text)
EDIT:
Matplotlib implemented the engineering formatter, so one option is to directly use Matplotlibs formatter, e.g.:
import matplotlib as mpl
formatter = mpl.ticker.EngFormatter()
formatter(10000)
result: '10 k'
Original answer:
Based on Julian Smith's excellent answer (and this answer), I changed the function to improve on the following points:
Python3 compatible (integer division)
Compatible for 0 input
Rounding to significant number of digits, by default 3, no trailing zeros printed
so here's the updated function:
import math
def eng_string( x, sig_figs=3, si=True):
"""
Returns float/int value <x> formatted in a simplified engineering format -
using an exponent that is a multiple of 3.
sig_figs: number of significant figures
si: if true, use SI suffix for exponent, e.g. k instead of e3, n instead of
e-9 etc.
"""
x = float(x)
sign = ''
if x < 0:
x = -x
sign = '-'
if x == 0:
exp = 0
exp3 = 0
x3 = 0
else:
exp = int(math.floor(math.log10( x )))
exp3 = exp - ( exp % 3)
x3 = x / ( 10 ** exp3)
x3 = round( x3, -int( math.floor(math.log10( x3 )) - (sig_figs-1)) )
if x3 == int(x3): # prevent from displaying .0
x3 = int(x3)
if si and exp3 >= -24 and exp3 <= 24 and exp3 != 0:
exp3_text = 'yzafpnum kMGTPEZY'[ exp3 // 3 + 8]
elif exp3 == 0:
exp3_text = ''
else:
exp3_text = 'e%s' % exp3
return ( '%s%s%s') % ( sign, x3, exp3_text)
The decimal module is following the Decimal Arithmetic Specification, which states:
This is outdated - see below
to-scientific-string – conversion to numeric string
[...]
The coefficient is first converted to a string in base ten using the characters 0 through 9 with no leading zeros (except if its value is zero, in which case a single 0 character is used).
Next, the adjusted exponent is calculated; this is the exponent, plus the number of characters in the converted coefficient, less one. That is, exponent+(clength-1), where clength is the length of the coefficient in decimal digits.
If the exponent is less than or equal to zero and the adjusted exponent is greater than or equal to -6, the number will be converted
to a character form without using exponential notation.
[...]
to-engineering-string – conversion to numeric string
This operation converts a number to a string, using engineering
notation if an exponent is needed.
The conversion exactly follows the rules for conversion to scientific
numeric string except in the case of finite numbers where exponential
notation is used. In this case, the converted exponent is adjusted to be a multiple of three (engineering notation) by positioning the decimal point with one, two, or three characters preceding it (that is, the part before the decimal point will range from 1 through 999).
This may require the addition of either one or two trailing zeros.
If after the adjustment the decimal point would not be followed by a digit then it is not added. If the final exponent is zero then no indicator letter and exponent is suffixed.
Examples:
For each abstract representation [sign, coefficient, exponent] on the left, the resulting string is shown on the right.
Representation
String
[0,123,1]
"1.23E+3"
[0,123,3]
"123E+3"
[0,123,-10]
"12.3E-9"
[1,123,-12]
"-123E-12"
[0,7,-7]
"700E-9"
[0,7,1]
"70"
Or, in other words:
>>> for n in (10 ** e for e in range(-1, -8, -1)):
... d = Decimal(str(n))
... print d.to_eng_string()
...
0.1
0.01
0.001
0.0001
0.00001
0.000001
100E-9
I realize that this is an old thread, but it does come near the top of a search for python engineering notation and it seems prudent to have this information located here.
I am an engineer who likes the "engineering 101" engineering units. I don't even like designations such as 0.1uF, I want that to read 100nF. I played with the Decimal class and didn't really like its behavior over the range of possible values, so I rolled a package called engineering_notation that is pip-installable.
pip install engineering_notation
From within Python:
>>> from engineering_notation import EngNumber
>>> EngNumber('1000000')
1M
>>> EngNumber(1000000)
1M
>>> EngNumber(1000000.0)
1M
>>> EngNumber('0.1u')
100n
>>> EngNumber('1000m')
1
This package also supports comparisons and other simple numerical operations.
https://github.com/slightlynybbled/engineering_notation
The «full» quote shows what is wrong!
The decimal module is indeed following the proprietary (IBM) Decimal Arithmetic Specification.
Quoting this IBM specification in its entirety clearly shows what is wrong with decimal.to_eng_string() (emphasis added):
to-engineering-string – conversion to numeric string
This operation converts a number to a string, using engineering
notation if an exponent is needed.
The conversion exactly follows the rules for conversion to scientific
numeric string except in the case of finite numbers where exponential
notation is used. In this case, the converted exponent is adjusted to be a multiple of three (engineering notation) by positioning the decimal point with one, two, or three characters preceding it (that is, the part before the decimal point will range from 1 through 999). This may require the addition of either one or two trailing zeros.
If after the adjustment the decimal point would not be followed by a digit then it is not added. If the final exponent is zero then no indicator letter and exponent is suffixed.
This proprietary IBM specification actually admits to not applying the engineering notation for numbers with an infinite decimal representation, for which ordinary scientific notation is used instead! This is obviously incorrect behaviour for which a Python bug report was opened.
Solution
from math import floor, log10
def powerise10(x):
""" Returns x as a*10**b with 0 <= a < 10
"""
if x == 0: return 0,0
Neg = x < 0
if Neg: x = -x
a = 1.0 * x / 10**(floor(log10(x)))
b = int(floor(log10(x)))
if Neg: a = -a
return a,b
def eng(x):
"""Return a string representing x in an engineer friendly notation"""
a,b = powerise10(x)
if -3 < b < 3: return "%.4g" % x
a = a * 10**(b % 3)
b = b - b % 3
return "%.4gE%s" % (a,b)
Source: https://code.activestate.com/recipes/578238-engineering-notation/
Test result
>>> eng(0.0001)
100E-6
Like the answers above, but a bit more compact:
from math import log10, floor
def eng_format(x,precision=3):
"""Returns string in engineering format, i.e. 100.1e-3"""
x = float(x) # inplace copy
if x == 0:
a,b = 0,0
else:
sgn = 1.0 if x > 0 else -1.0
x = abs(x)
a = sgn * x / 10**(floor(log10(x)))
b = int(floor(log10(x)))
if -3 < b < 3:
return ("%." + str(precision) + "g") % x
else:
a = a * 10**(b % 3)
b = b - b % 3
return ("%." + str(precision) + "gE%s") % (a,b)
Trial:
In [10]: eng_format(-1.2345e-4,precision=5)
Out[10]: '-123.45E-6'

Is there a faster way to convert an arbitrary large integer to a big endian sequence of bytes?

I have this Python code to do this:
from struct import pack as _pack
def packl(lnum, pad = 1):
if lnum < 0:
raise RangeError("Cannot use packl to convert a negative integer "
"to a string.")
count = 0
l = []
while lnum > 0:
l.append(lnum & 0xffffffffffffffffL)
count += 1
lnum >>= 64
if count <= 0:
return '\0' * pad
elif pad >= 8:
lens = 8 * count % pad
pad = ((lens != 0) and (pad - lens)) or 0
l.append('>' + 'x' * pad + 'Q' * count)
l.reverse()
return _pack(*l)
else:
l.append('>' + 'Q' * count)
l.reverse()
s = _pack(*l).lstrip('\0')
lens = len(s)
if (lens % pad) != 0:
return '\0' * (pad - lens % pad) + s
else:
return s
This takes approximately 174 usec to convert 2**9700 - 1 to a string of bytes on my machine. If I'm willing to use the Python 2.7 and Python 3.x specific bit_length method, I can shorten that to 159 usecs by pre-allocating the l array to be the exact right size at the very beginning and using l[something] = syntax instead of l.append.
Is there anything I can do that will make this faster? This will be used to convert large prime numbers used in cryptography as well as some (but not many) smaller numbers.
Edit
This is currently the fastest option in Python < 3.2, it takes about half the time either direction as the accepted answer:
def packl(lnum, padmultiple=1):
"""Packs the lnum (which must be convertable to a long) into a
byte string 0 padded to a multiple of padmultiple bytes in size. 0
means no padding whatsoever, so that packing 0 result in an empty
string. The resulting byte string is the big-endian two's
complement representation of the passed in long."""
if lnum == 0:
return b'\0' * padmultiple
elif lnum < 0:
raise ValueError("Can only convert non-negative numbers.")
s = hex(lnum)[2:]
s = s.rstrip('L')
if len(s) & 1:
s = '0' + s
s = binascii.unhexlify(s)
if (padmultiple != 1) and (padmultiple != 0):
filled_so_far = len(s) % padmultiple
if filled_so_far != 0:
s = b'\0' * (padmultiple - filled_so_far) + s
return s
def unpackl(bytestr):
"""Treats a byte string as a sequence of base 256 digits
representing an unsigned integer in big-endian format and converts
that representation into a Python integer."""
return int(binascii.hexlify(bytestr), 16) if len(bytestr) > 0 else 0
In Python 3.2 the int class has to_bytes and from_bytes functions that can accomplish this much more quickly that the method given above.
Here is a solution calling the Python/C API via ctypes. Currently, it uses NumPy, but if NumPy is not an option, it could be done purely with ctypes.
import numpy
import ctypes
PyLong_AsByteArray = ctypes.pythonapi._PyLong_AsByteArray
PyLong_AsByteArray.argtypes = [ctypes.py_object,
numpy.ctypeslib.ndpointer(numpy.uint8),
ctypes.c_size_t,
ctypes.c_int,
ctypes.c_int]
def packl_ctypes_numpy(lnum):
a = numpy.zeros(lnum.bit_length()//8 + 1, dtype=numpy.uint8)
PyLong_AsByteArray(lnum, a, a.size, 0, 1)
return a
On my machine, this is 15 times faster than your approach.
Edit: Here is the same code using ctypes only and returning a string instead of a NumPy array:
import ctypes
PyLong_AsByteArray = ctypes.pythonapi._PyLong_AsByteArray
PyLong_AsByteArray.argtypes = [ctypes.py_object,
ctypes.c_char_p,
ctypes.c_size_t,
ctypes.c_int,
ctypes.c_int]
def packl_ctypes(lnum):
a = ctypes.create_string_buffer(lnum.bit_length()//8 + 1)
PyLong_AsByteArray(lnum, a, len(a), 0, 1)
return a.raw
This is another two times faster, totalling to a speed-up factor of 30 on my machine.
For completeness and for future readers of this question:
Starting in Python 3.2, there are functions int.from_bytes() and int.to_bytes() that perform the conversion between bytes and int objects in a choice of byte orders.
I suppose you really should just be using numpy, which I'm sure has something or other built in for this. It might also be faster to hack around with the array module. But I'll take a stab at it anyway.
IMX, creating a generator and using a list comprehension and/or built-in summation is faster than a loop that appends to a list, because the appending can be done internally. Oh, and 'lstrip' on a large string has got to be costly.
Also, some style points: special cases aren't special enough; and you appear not to have gotten the memo about the new x if y else z construct. :) Although we don't need it anyway. ;)
from struct import pack as _pack
Q_size = 64
Q_bitmask = (1L << Q_size) - 1L
def quads_gen(a_long):
while a_long:
yield a_long & Q_bitmask
a_long >>= Q_size
def pack_long_big_endian(a_long, pad = 1):
if lnum < 0:
raise RangeError("Cannot use packl to convert a negative integer "
"to a string.")
qs = list(reversed(quads_gen(a_long)))
# Pack the first one separately so we can lstrip nicely.
first = _pack('>Q', qs[0]).lstrip('\x00')
rest = _pack('>%sQ' % len(qs) - 1, *qs[1:])
count = len(first) + len(rest)
# A little math trick that depends on Python's behaviour of modulus
# for negative numbers - but it's well-defined and documented
return '\x00' * (-count % pad) + first + rest
Just wanted to post a follow-up to Sven's answer (which works great). The opposite operation - going from arbitrarily long bytes object to Python Integer object requires the following (because there is no PyLong_FromByteArray() C API function that I can find):
import binascii
def unpack_bytes(stringbytes):
#binascii.hexlify will be obsolete in python3 soon
#They will add a .tohex() method to bytes class
#Issue 3532 bugs.python.org
return int(binascii.hexlify(stringbytes), 16)

Categories