In Python, how can I translate *(1+(int*)&x)?

In Python, how can I translate *(1+(int*)&x)? - python

This question is a follow-up of this one. In Sun's math library (in C), the expression
*(1+(int*)&x)
is used to retrieve the high word of the floating point number x. Here, the OS is assumed 64-bit, with little-endian representation.
I am thinking how to translate the C expression above into Python? The difficulty here is how to translate the '&', and '*' in the expression. Btw, maybe Python has some built-in function that retrieves the high word of a floating point number?

You can do this more easily with struct:
high_word = struct.pack('<d', x)[4:8]
return struct.unpack('<i', high_word)[0]
Here, high_word is a bytes object (or a str in 2.x) consisting of the four most significant bytes of x in little endian order (using IEEE 64-bit floating point format). We then unpack it back into a 32-bit integer (which is returned in a singleton tuple, hence the [0]).
This always uses little-endian for everything, regardless of your platform's underlying endianness. If you need to use native endianness, replace the < with = (and use > or ! to force big endian). It also guarantees 64-bit doubles and 32-bit ints, which C does not. You can remove that guarantee as well, but there is no good reason to do so since it makes your question nonsensical.
While this could be done with pointer arithmetic, it would involve messing around with ctypes and the conversion from Python float to C float would still be relatively expensive. The struct code is much easier to read.

Related

Need help understanding binary conversion in Python

Or I guess binary in general. I'm obviously quite new to coding, so I'll appreciate any help here.
I just started learning about converting numbers into binary, specifically two's complement. The course presented the following code for converting:
num = 19
if num < 0:
isNeg = True
num = abs(num)
else:
isNeg = False
result = ''
if num == 0:
result = '0'
while num > 0:
result = str(num % 2) + result
num = num // 2
if isNeg:
result = '-' + result
This raised a couple of questions with me and after doing some research (mostly here on Stack Overflow), I found myself more confused than I was before. Hoping somebody can break things down a bit more for me. Here are some of those questions:
I thought it was outright wrong that the code suggested just appending a - to the front of a binary number to show its negative counterpart. It looks like bin() does the same thing, but don't you have to flip the bits and add a 1 or something? Is there a reason for this other than making it easy to comprehend/read?
Was reading here and one of the answers in particular said that Python doesn't really work in two's complement, but something else that mimics it. The disconnect here for me is that Python shows me one thing but is storing the numbers a different way. Again, is this just for ease of use? Is bin() using two's complement or Python's method?
Follow-up to that one, how does the 'sign-magnitude' format mentioned in the above answer differ from two's complement?
The Professor doesn't talk at all about 8-bit, 16-bit, 64-bit, etc., which I saw a lot of while reading up on this. Where does this distinction come from, and does Python use one? Or are those designations specific to the program that I might be coding?
A lot of these posts I've only reference how Python stores integers. Is that suggesting that it stores floats a different way, or are they just speaking broadly?
As I wrote this all up, I sort of realized that maybe I'm diving into the deep end before learning how to swim, but I'm curious like that and like to have a deeper understanding of stuff before moving on.

I thought it was outright wrong that the code suggested just appending a - to the front of a binary number to show its negative counterpart. It looks like bin() does the same thing, but don't you have to flip the bits and add a 1 or something? Is there a reason for this other than making it easy to comprehend/read?
You have to somehow designate the number being negative. You can add another symbol (-), add a sign bit at the very beginning, use ones'-complement, use two's-complement, or some other completely made-up scheme that works. Both the ones'- and two's-complement representation of a number require a fixed number of bits, which doesn't exist for Python integers:
>>> 2**1000
1071508607186267320948425049060001810561404811705533607443750
3883703510511249361224931983788156958581275946729175531468251
8714528569231404359845775746985748039345677748242309854210746
0506237114187795418215304647498358194126739876755916554394607
7062914571196477686542167660429831652624386837205668069376
The natural solution is to just prepend a minus sign. You can similarly write your own version of bin() that requires you to specify the number of bits and return the two's-complement representation of the number.
Was reading here and one of the answers in particular said that Python doesn't really work in two's complement, but something else that mimics it. The disconnect here for me is that Python shows me one thing but is storing the numbers a different way. Again, is this just for ease of use? Is bin() using two's complement or Python's method?
Python is a high-level language, so you don't really know (or care) how your particular Python runtime interally stores integers. Whether you use CPython, Jython, PyPy, IronPython, or something else, the language specification only defines how they should behave, not how they should be represented in memory. bin() just takes a number and prints it out using binary digits, the same way you'd convert 123 into base-2.
Follow-up to that one, how does the 'sign-magnitude' format mentioned in the above answer differ from two's complement?
Sign-magnitude usually encodes a number n as 0bXYYYYYY..., where X is the sign bit and YY... are the binary digits of the non-negative magnitude. Arithmetic with numbers encoded as two's-complement is more elegant due to the representation, while sign-magnitude encoding requires special handling for operations on numbers of opposite signs.
The Professor doesn't talk at all about 8-bit, 16-bit, 64-bit, etc., which I saw a lot of while reading up on this. Where does this distinction come from, and does Python use one? Or are those designations specific to the program that I might be coding?
No, Python doesn't define a maximum size for its integers because it's not that low-level. 2**1000000 computes fine, as will 2**10000000 if you have enough memory. n-bit numbers arise when your hardware makes it more beneficial to make your numbers a certain size. For example, processors have instructions that quickly work with 32-bit numbers but not with 87-bit numbers.
A lot of these posts I've only reference how Python stores integers. Is that suggesting that it stores floats a different way, or are they just speaking broadly?
It depends on what your Python runtime uses. Usually floating point numbers are like C doubles, but that's not required.

don't you have to flip the bits and add a 1 or something?
Yes, for two complement notation you invert all bits and add one to get the negative counterpart.
Is bin() using two's complement or Python's method?
Two's complement is a practical way to represent negative number in electronics that can have only 0 and 1. Internally the microprocessor uses two's complement for negative numbers and all modern microprocessors do. For more info, see your textbook on computer architecture.
how does the 'sign-magnitude' format mentioned in the above answer
differ from two's complement?
You should look what this code does and why it is there:
while num > 0:
result = str(num % 2) + result
num = num // 2

conversion from float to Decimal in python-2.6: how to do it and why they didn't do it

Direct conversion from float to Decimal was implemented in python-2.7, both in Decimal's constructor and with the Decimal.from_float() classmethod.
Python-2.6 instead throws a TypeError suggesting to convert to string first:
TypeError: Cannot convert float to Decimal. First convert the float to a string
so my usual workaround is this:
if sys.version_info < (2, 7):
Decimal.from_float = classmethod(lambda cls, x: cls(str(x)))
That's just a literary translation from the error message - and I just don't bother implementing it in the constructor too.
If it's that simple, why didn't they implement it in the first place instead of telling the user to do it in a TypeError? Is that the best method available (and by extension is that the one used in python-2.7 and newer?)

Your workaround is not the RightWayToDoIt(tm) because it loses information. The lossless way to convert is shown in the recipe for float_to_decimal() shown in the Decimal FAQ.
The reason we didn't include Decimal.from_float in Python 2.6 is because we were being conservative about introducing unintentional interactions between binary floats and decimal floats. By Python 2.7, this was all worked out and you can just write Decimal(f) where f is a binary float.
Other than the small nuisance in 2.6, I hope you're enjoying the Decimal module

Probably because the behavior of a direct conversion can be counterintuitive if you don't know a few implementation details about floats. As stated in the docs:
Note Decimal.from_float(0.1) is not the same as Decimal('0.1').
Since 0.1 is not exactly representable in binary floating point, the
value is stored as the nearest representable value which is
0x1.999999999999ap-4. That equivalent value in decimal is
0.1000000000000000055511151231257827021181583404541015625.
If you convert to a string, you can control the precision you want to use, so you can get an accurate conversion to Decimal.
The new method was introduced in Python 2.7 - that's why it isn't in 2.6. New features are not backported to older versions.

According to 2.7's What's New documentation (emphasis added)
Conversions between floating-point numbers and strings are now
correctly rounded on most platforms. These conversions occur in many
different places: str() on floats and complex numbers; the float and
complex constructors; numeric formatting; serializing and
deserializing floats and complex numbers using the marshal, pickle and
json modules; parsing of float and imaginary literals in Python code;
and Decimal-to-float conversion.
So, although they could have done that prior to 2.7, they apparently didn't feel comfortable doing that with the rounding problems that existed.

Maximum and Minimum values for ints

How do I represent minimum and maximum values for integers in Python? In Java, we have Integer.MIN_VALUE and Integer.MAX_VALUE.
See also: What is the maximum float in Python?.

Python 3
In Python 3, this question doesn't apply. The plain int type is unbounded.
However, you might actually be looking for information about the current interpreter's word size, which will be the same as the machine's word size in most cases. That information is still available in Python 3 as sys.maxsize, which is the maximum value representable by a signed word. Equivalently, it's the size of the largest possible list or in-memory sequence.
Generally, the maximum value representable by an unsigned word will be sys.maxsize * 2 + 1, and the number of bits in a word will be math.log2(sys.maxsize * 2 + 2). See this answer for more information.
Python 2
In Python 2, the maximum value for plain int values is available as sys.maxint:
>>> sys.maxint # on my system, 2**63-1
9223372036854775807
You can calculate the minimum value with -sys.maxint - 1 as shown in the docs.
Python seamlessly switches from plain to long integers once you exceed this value. So most of the time, you won't need to know it.

If you just need a number that's bigger than all others, you can use
float('inf')
in similar fashion, a number smaller than all others:
float('-inf')
This works in both python 2 and 3.

The sys.maxint constant has been removed from Python 3.0 onward, instead use sys.maxsize.
Integers
PEP 237: Essentially, long renamed to int. That is, there is only one built-in integral type, named int; but it behaves mostly like the old long type.
...
The sys.maxint constant was removed, since there is no longer a limit to the value of integers. However, sys.maxsize can be used as an integer larger than any practical list or string index. It conforms to the implementation’s “natural” integer size and is typically the same as sys.maxint in previous releases on the same platform (assuming the same build options).

For Python 3, it is
import sys
max_size = sys.maxsize
min_size = -sys.maxsize - 1

In Python integers will automatically switch from a fixed-size int representation into a variable width long representation once you pass the value sys.maxint, which is either 231 - 1 or 263 - 1 depending on your platform. Notice the L that gets appended here:
>>> 9223372036854775807
9223372036854775807
>>> 9223372036854775808
9223372036854775808L
From the Python manual:
Numbers are created by numeric literals or as the result of built-in functions and operators. Unadorned integer literals (including binary, hex, and octal numbers) yield plain integers unless the value they denote is too large to be represented as a plain integer, in which case they yield a long integer. Integer literals with an 'L' or 'l' suffix yield long integers ('L' is preferred because 1l looks too much like eleven!).
Python tries very hard to pretend its integers are mathematical integers and are unbounded. It can, for instance, calculate a googol with ease:
>>> 10**100
10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000L

You may use 'inf' like this:
import math
bool_true = 0 < math.inf
bool_false = 0 < -math.inf
Refer: math — Mathematical functions

If you want the max for array or list indices (equivalent to size_t in C/C++), you can use numpy:
np.iinfo(np.intp).max
This is same as sys.maxsize however advantage is that you don't need import sys just for this.
If you want max for native int on the machine:
np.iinfo(np.intc).max
You can look at other available types in doc.
For floats you can also use sys.float_info.max.

sys.maxsize is not the actually the maximum integer value which is supported. You can double maxsize and multiply it by itself and it stays a valid and correct value.
However, if you try sys.maxsize ** sys.maxsize, it will hang your machine for a significant amount of time. As many have pointed out, the byte and bit size does not seem to be relevant because it practically doesn't exist. I guess python just happily expands it's integers when it needs more memory space. So in general there is no limit.
Now, if you're talking about packing or storing integers in a safe way where they can later be retrieved with integrity then of course that is relevant. I'm really not sure about packing but I know python's pickle module handles those things well. String representations obviously have no practical limit.
So really, the bottom line is: what is your applications limit? What does it require for numeric data? Use that limit instead of python's fairly nonexistent integer limit.

I rely heavily on commands like this.
python -c 'import sys; print(sys.maxsize)'
Max int returned: 9223372036854775807
For more references for 'sys' you should access
https://docs.python.org/3/library/sys.html
https://docs.python.org/3/library/sys.html#sys.maxsize

code given below will help you.
for maximum value you can use sys.maxsize and for minimum you can negate same value and use it.
import sys
ni=sys.maxsize
print(ni)

fast, large-width, non-cryptographic string hashing in python

I have a need for a high-performance string hashing function in python that produces integers with at least 34 bits of output (64 bits would make sense, but 32 is too few). There are several other questions like this one on Stack Overflow, but of those every accepted/upvoted answer I could find fell in to one of a few categories, which don't apply (for the given reason.)
Use the built-in hash() function. This function, at least on the machine I'm developing for (with python 2.7, and a 64-bit cpu) produces an integer that fits within 32 bits - not large enough for my purposes.
Use hashlib. hashlib provides cryptographic hash routines, which are far slower than they need to be for non-cryptographic purposes. I find this self-evident, but if you require benchmarks and citations to convince you of this fact then I can provide that.
Use the string.__hash__() function as a prototype to write your own function. I suspect this will be the correct way to go, except that this particular function's efficiency lies in its use of the c_mul function, which wraps around 32 bits - again, too small for my use! Very frustrating, it's so close to perfect!
An ideal solution would have the following properties, in a relative, loose order of importance.
Have an output range extending at least 34 bits long, likely 64 bits, while preserving consistent avalanche properties over all bits. (Concatenating 32-bit hashes tends to violate the avalanche properties, at least with my dumb examples.)
Portable. Given the same input string on two different machines, I should get the same result both times. These values will be stored in a file for later re-use.
High-performance. The faster the better as this function will get called roughly 20 billion times during the execution of the program I'm running (it is the performance-critical code at the moment.) It doesn't need to be written in C, it really just needs to outperform md5 (somewhere in the realm of the built-in hash() for strings).
Accept a 'perturbation' (what's the better word to use here?) integer as input to modify the output. I put an example below (the list formatting rules wouldn't let me place it nearer.) I suppose this isn't 100% necessary since it can be simulated by perturbing the output of the function manually, but having it as input gives me a nice warm feeling.
Written entirely in Python. If it absolutely, positively needs to be written in C then I guess that can be done, but I'd take a 20% slower function written in python over the faster one in C, just due to project coordination headache of using two different languages. Yes, this is a cop-out, but this is a wish list here.
'Perturbed' hash example, where the hash value is changed drastically by a small integer value n
def perturb_hash(key,n):
return hash((key,n))
Finally, if you're curious as to what the heck I'm doing that I need such a specific hash function, I'm doing a complete re-write of the pybloom module to enhance its performance considerably. I succeeded at that (it now runs about 4x faster and uses about 50% of the space) but I noticed that sometimes if the filter got large enough it was suddenly spiking in false-positive rates. I realized it was because the hash function wasn't addressing enough bits. 32 bits can only address 4 billion bits (mind you, the filter addresses bits and not bytes) and some of the filters I'm using for genomic data double that or more (hence 34 bit minimum.)
Thanks!

Take a look at the 128-bit variant of MurmurHash3. The algorithm's page includes some performance numbers. Should be possible to port this to Python, pure or as a C extension. (Updated the author recommends using the 128-bit variant and throwing away the bits you don't need).
If MurmurHash2 64-bit works for you, there is a Python implementation (C extension) in the pyfasthash package, which includes a few other non-cryptographic hash variants, though some of these only offer 32-bit output.
Update I did a quick Python wrapper for the Murmur3 hash function. Github project is here and you can find it on Python Package Index as well; it just needs a C++ compiler to build; no Boost required.
Usage example and timing comparison:
import murmur3
import timeit
# without seed
print murmur3.murmur3_x86_64('samplebias')
# with seed value
print murmur3.murmur3_x86_64('samplebias', 123)
# timing comparison with str __hash__
t = timeit.Timer("murmur3.murmur3_x86_64('hello')", "import murmur3")
print 'murmur3:', t.timeit()
t = timeit.Timer("str.__hash__('hello')")
print 'str.__hash__:', t.timeit()
Output:
15662901497824584782
7997834649920664675
murmur3: 0.264422178268
str.__hash__: 0.219163894653

BE CAREFUL WITH THE BUILT-IN HASH FUNCTION!
Since Python3, it's fed with a different seed every time the interpreter starts (I don't know more details), thus it generates different values every time -- but not with with native numeric types.
$ python3 -c 'print(hash("Hello!"), hash(3.14))'
-1756730906053498061 322818021289917443
$ python3 -c 'print(hash("Hello!"), hash(3.14))'
-4556027264747844925 322818021289917443
$ python3 -c 'print(hash("Hello!"), hash(3.14))'
-4403217265550417031 322818021289917443

Use the built-in hash() function. This function, at least on the machine I'm developing for (with
python 2.7, and a 64-bit cpu) produces an integer that fits within 32 bits - not large enough for
my purposes.
That's not true. The built-in hash function will generate a 64-bit hash on a 64-bit system.
This is the python str hashing function from Objects/stringobject.c (Python version 2.7):
static long
string_hash(PyStringObject *a)
{
register Py_ssize_t len;
register unsigned char *p;
register long x; /* Notice the 64-bit hash, at least on a 64-bit system */
if (a->ob_shash != -1)
return a->ob_shash;
len = Py_SIZE(a);
p = (unsigned char *) a->ob_sval;
x = *p << 7;
while (--len >= 0)
x = (1000003*x) ^ *p++;
x ^= Py_SIZE(a);
if (x == -1)
x = -2;
a->ob_shash = x;
return x;
}

"strings": I'm presuming you wish to hash Python 2.x str objects and/or Python3.x bytes and/or bytearray objects.
This may violate your first constraint, but: consider using something like
(zlib.adler32(strg, perturber) << N) ^ hash(strg)
to get a (32+N)-bit hash.

If you can use Python 3.2, the hash result on 64-bit Windows is now a 64-bit value.

Have a look at xxHash, there's also the pip package.
xxHash is an Extremely fast Hash algorithm, running at RAM speed limits. It successfully completes the SMHasher test suite which evaluates collision, dispersion and randomness qualities of hash functions. Code is highly portable, and hashes are identical across all platforms (little / big endian).
I've been using xxHash for a long time (my typical use case is to hash strings -- not for security purposes) and I'm really satisfied of the performance.

How do languages such as Python overcome C's Integral data limits?

While doing some random experimentation with a factorial program in C, Python and Scheme. I came across this fact:
In C, using 'unsigned long long' data type, the largest factorial I can print is of 65. which is '9223372036854775808' that is 19 digits as specified here.
In Python, I can find the factorial of a number as large as 999 which consists of a large number of digits, much more than 19.
How does CPython achieve this? Does it use a data type like 'octaword' ?
I might be missing some fundamental facts here. So, I would appreciate some insights and/or references to read. Thanks!
UPDATE: Thank you all for the explanation. Does that means, CPython is using the GNU Multi-precision library (or some other similar library)?
UPDATE 2: I am looking for Python's 'bignum' implementation in the sources. Where exactly it is? Its here at http://svn.python.org/view/python/trunk/Objects/longobject.c?view=markup. Thanks Baishampayan.

It's called Arbitrary Precision Arithmetic. There's more here: http://en.wikipedia.org/wiki/Arbitrary-precision_arithmetic

Looking at the Python source code, it seems the long type (at least in pre-Python 3 code) is defined in longintrepr.h like this -
/* Long integer representation.
The absolute value of a number is equal to
SUM(for i=0 through abs(ob_size)-1) ob_digit[i] * 2**(SHIFT*i)
Negative numbers are represented with ob_size < 0;
zero is represented by ob_size == 0.
In a normalized number, ob_digit[abs(ob_size)-1] (the most significant
digit) is never zero. Also, in all cases, for all valid i,
0 <= ob_digit[i] <= MASK.
The allocation function takes care of allocating extra memory
so that ob_digit[0] ... ob_digit[abs(ob_size)-1] are actually available.
CAUTION: Generic code manipulating subtypes of PyVarObject has to
aware that longs abuse ob_size's sign bit.
*/
struct _longobject {
PyObject_VAR_HEAD
digit ob_digit[1];
};
The actual usable interface of the long type is then defined in longobject.h by creating a new type PyLongObject like this -
typedef struct _longobject PyLongObject;
And so on.
There is more stuff happening inside longobject.c, you can take a look at those for more details.

Data types such as int in C are directly mapped (more or less) to the data types supported by the processor. So the limits on C's int are essentially the limits imposed by the processor hardware.
But one can implement one's own int data type entirely in software. You can for example use an array of digits as your underlying representation. May be like this:
class MyInt {
private int [] digits;
public MyInt(int noOfDigits) {
digits = new int[noOfDigits];
}
}
Once you do that you may use this class and store integers containing as many digits as you want, as long as you don't run out memory.
Perhaps Python is doing something like this inside its virtual machine. You may want to read this article on Arbitrary Precision Arithmetic to get the details.

Not octaword. It implemented bignum structure to store arbitary-precision numbers.

Python assigns to long integers (all ints in Python 3) just as much space as they need -- an array of "digits" (base being a power of 2) allocated as needed.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.