How multiarray.correlate2(a, v, mode) is actually implemented?

How multiarray.correlate2(a, v, mode) is actually implemented? - python

On my way to understand how the Numpy.correlate() function actually works, I get to it's implementation in pure Python, but what I saw was very disappointing:
def correlate(a, v, mode='valid', old_behavior=False):
mode = _mode_from_name(mode)
if old_behavior:
warnings.warn("""Warning.""", DeprecationWarning)
return multiarray.correlate(a, v, mode)
else:
return multiarray.correlate2(a, v, mode)
So I started to look for implementation of the multiarray.correlate2(a, v, mode) function, but unfortunately I can't find it. I'll just say, that I'm looking for it, because I try to implement the autocorrelation function by myself, and I'm missing the functionality similar to mode='full' parameter in Numpy.correlate() that makes function to return result as a 1D array. Thank you for help in advance.

The speed of python code can be very poor compared to other languages like c. numpy aims to provide highly performant operations on arrays, therefore the developers decided to implement some operations in c.
Unfortunately, won't find a python implementation of correlate in numpy's code base, but if you are familiar with C and python's extension modules, you can have find the relevant code here.
The different modes just specify the length of the output array.
You can simulate them by transforming your inputs:
import numpy as np
a = [1, 2, 3]
v = [0, 1, 0.5]
np.correlate(a, v, mode="full")
returns:
array([ 0.5, 2. , 3.5, 3. , 0. ])
You can get the same result by filling v with zeros:
np.correlate(a, [0, 0] + v + [0, 0])
returns the same result:
array([ 0.5, 2. , 3.5, 3. , 0. ])

np.core.multiarray.correlate2
dir(np.core.multiarray.correlate2) # to inspect
print (numpy.__version__)
print numpy.__version__ # python 2
found it! it might be a private API, can't find the docs after first search with numpy.multiarray or with the newly discovered 'correct' name.
optimal search query is 'np.core.multiarray.correlate2 github'
return multiarray.correlate2(a, v, mode) # mode is int data type
if you're plan to customize the code for your purposes, be careful.
/*
* simulates a C-style 1-3 dimensional array which can be accessed using
* ptr[i] or ptr[i][j] or ptr[i][j][k] -- requires pointer allocation
* for 2-d and 3-d.
*
* For 2-d and up, ptr is NOT equivalent to a statically defined
* 2-d or 3-d array. In particular, it cannot be passed into a
* function that requires a true pointer to a fixed-size array.
*/
/*NUMPY_API
* Simulate a C-array
* steals a reference to typedescr -- can be NULL
*/
NPY_NO_EXPORT int
PyArray_AsCArray(PyObject **op, void *ptr, npy_intp *dims, int nd,
PyArray_Descr* type
# NPY_NO_EXPORT int NPY_NUMUSERTYPES = 0;
# omit code
switch(mode) {
case 0:
length = length - n + 1;
n_left = n_right = 0;
break;
case 1:
n_left = (npy_intp)(n/2);
n_right = n - n_left - 1;
break;
case 2:
n_right = n - 1;
n_left = n - 1;
length = length + n - 1;
break;
default:
PyErr_SetString(PyExc_ValueError, "mode must be 0, 1, or 2");
return NULL;
}
don't mess with the internal APIs if this is your first crack at the codebase and you are on a deadline. Too late for me.

Related

equivalent of "double[:,::1] u_tp1 not None" in python?

I am trying to write the short-code below in python (it is from a .pyx file). my issue is the lines with "double[:,::1]" in them. Is there any equivalent in python for it? also, how does "cdef unsigned int i, j" translate to python? I am new to programming and most of what I found online is over my head. any suggestion or help is appreciated.
def _step_scalar(
double[:,::1] u_tp1 not None,
double[:,::1] u_t not None,
double[:,::1] u_tm1 not None,
unsigned int x1, unsigned int x2, unsigned int z1, unsigned int z2,
double dt, double ds,
double[:,::1] vel not None):
"""
Perform a single time step in the Finite Difference solution for scalar
waves 4th order in space
"""
cdef unsigned int i, j
for i in xrange(z1, z2):
for j in xrange(x1, x2):
u_tp1[i,j] = (2.*u_t[i,j] - u_tm1[i,j]
+ ((vel[i,j]*dt/ds)**2)*(
(-u_t[i,j + 2] + 16.*u_t[i,j + 1] - 30.*u_t[i,j] +
16.*u_t[i,j - 1] - u_t[i,j - 2])/12. +
(-u_t[i + 2,j] + 16.*u_t[i + 1,j] - 30.*u_t[i,j] +
16.*u_t[i - 1,j] - u_t[i - 2,j])/12.))

They're type declarations to help Cython speed up the code. Python is dynamically typed (accepts variables of any type) so they aren't meaningful in Cython. Therefore you can get rid of them.
double[:,::1] defines the variable as a 2D, C contiguous memoryview of doubles. This means the function expects something similar to a 2D numpy array (as this is still what you should pass your Cython function).
u_tp1 is the variable name. You should keep this.
not None tells Cython to assume that you won't pass None into the function (so it disables some checks for extra speed). This can be deleted in Python.
cdef unsigned int i, j defines i and j as C integers, for extra speed. In Python i and j are created when they are needed in the for loop so the definition can be deleted completely.

Times-two faster than bit-shift, for Python 3.x integers?

I was looking at the source of sorted_containers and was surprised to see this line:
self._load, self._twice, self._half = load, load * 2, load >> 1
Here load is an integer. Why use bit shift in one place, and multiplication in another? It seems reasonable that bit shifting may be faster than integral division by 2, but why not replace the multiplication by a shift as well? I benchmarked the the following cases:
(times, divide)
(shift, shift)
(times, shift)
(shift, divide)
and found that #3 is consistently faster than other alternatives:
# self._load, self._twice, self._half = load, load * 2, load >> 1
import random
import timeit
import pandas as pd
x = random.randint(10 ** 3, 10 ** 6)
def test_naive():
a, b, c = x, 2 * x, x // 2
def test_shift():
a, b, c = x, x << 1, x >> 1
def test_mixed():
a, b, c = x, x * 2, x >> 1
def test_mixed_swapped():
a, b, c = x, x << 1, x // 2
def observe(k):
print(k)
return {
'naive': timeit.timeit(test_naive),
'shift': timeit.timeit(test_shift),
'mixed': timeit.timeit(test_mixed),
'mixed_swapped': timeit.timeit(test_mixed_swapped),
}
def get_observations():
return pd.DataFrame([observe(k) for k in range(100)])
The question:
Is my test valid? If so, why is (multiply, shift) faster than (shift, shift)?
I run Python 3.5 on Ubuntu 14.04.
Edit
Above is the original statement of the question. Dan Getz provides an excellent explanation in his answer.
For the sake of completeness, here are sample illustrations for larger x when multiplication optimizations do not apply.

This seems to be because multiplication of small numbers is optimized in CPython 3.5, in a way that left shifts by small numbers are not. Positive left shifts always create a larger integer object to store the result, as part of the calculation, while for multiplications of the sort you used in your test, a special optimization avoids this and creates an integer object of the correct size. This can be seen in the source code of Python's integer implementation.
Because integers in Python are arbitrary-precision, they are stored as arrays of integer "digits", with a limit on the number of bits per integer digit. So in the general case, operations involving integers are not single operations, but instead need to handle the case of multiple "digits". In pyport.h, this bit limit is defined as 30 bits on 64-bit platform, or 15 bits otherwise. (I'll just call this 30 from here on to keep the explanation simple. But note that if you were using Python compiled for 32-bit, your benchmark's result would depend on if x were less than 32,768 or not.)
When an operation's inputs and outputs stay within this 30-bit limit, the operation can be handled in an optimized way instead of the general way. The beginning of the integer multiplication implementation is as follows:
static PyObject *
long_mul(PyLongObject *a, PyLongObject *b)
{
PyLongObject *z;
CHECK_BINOP(a, b);
/* fast path for single-digit multiplication */
if (Py_ABS(Py_SIZE(a)) <= 1 && Py_ABS(Py_SIZE(b)) <= 1) {
stwodigits v = (stwodigits)(MEDIUM_VALUE(a)) * MEDIUM_VALUE(b);
#ifdef HAVE_LONG_LONG
return PyLong_FromLongLong((PY_LONG_LONG)v);
#else
/* if we don't have long long then we're almost certainly
using 15-bit digits, so v will fit in a long. In the
unlikely event that we're using 30-bit digits on a platform
without long long, a large v will just cause us to fall
through to the general multiplication code below. */
if (v >= LONG_MIN && v <= LONG_MAX)
return PyLong_FromLong((long)v);
#endif
}
So when multiplying two integers where each fits in a 30-bit digit, this is done as a direct multiplication by the CPython interpreter, instead of working with the integers as arrays. (MEDIUM_VALUE() called on a positive integer object simply gets its first 30-bit digit.) If the result fits in a single 30-bit digit, PyLong_FromLongLong() will notice this in a relatively small number of operations, and create a single-digit integer object to store it.
In contrast, left shifts are not optimized this way, and every left shift deals with the integer being shifted as an array. In particular, if you look at the source code for long_lshift(), in the case of a small but positive left shift, a 2-digit integer object is always created, if only to have its length truncated to 1 later: (my comments in /*** ***/)
static PyObject *
long_lshift(PyObject *v, PyObject *w)
{
/*** ... ***/
wordshift = shiftby / PyLong_SHIFT; /*** zero for small w ***/
remshift = shiftby - wordshift * PyLong_SHIFT; /*** w for small w ***/
oldsize = Py_ABS(Py_SIZE(a)); /*** 1 for small v > 0 ***/
newsize = oldsize + wordshift;
if (remshift)
++newsize; /*** here newsize becomes at least 2 for w > 0, v > 0 ***/
z = _PyLong_New(newsize);
/*** ... ***/
}
Integer division
You didn't ask about the worse performance of integer floor division compared to right shifts, because that fit your (and my) expectations. But dividing a small positive number by another small positive number is not as optimized as small multiplications, either. Every // computes both the quotient and the remainder using the function long_divrem(). This remainder is computed for a small divisor with a multiplication, and is stored in a newly-allocated integer object, which in this situation is immediately discarded.
Or at least, that was the case when this question was originally asked. In CPython 3.6, a fast path for small int // was added, so // now beats >> for small ints too.

Swift equivalent of Python slice assignment

In Python, one can have a list (similar to an array in swift):
>>> li=[0,1,2,3,4,5]
And perform a slice assignment on any / all of the list:
>>> li[2:]=[99] # note then end index is not needed if you mean 'to the end'
>>> li
[0, 1, 99]
Swift has a similar slice assignment (this is in the swift interactive shell):
1> var arr=[0,1,2,3,4,5]
arr: [Int] = 6 values {
[0] = 0
[1] = 1
[2] = 2
[3] = 3
[4] = 4
[5] = 5
}
2> arr[2...arr.endIndex-1]=[99]
3> arr
$R0: [Int] = 3 values {
[0] = 0
[1] = 1
[2] = 99
}
So far, so good. But, there are a couple of issues.
First, swift does not work for an empty list or if the index is after the endIndex. Python appends if the slice index is after then end index:
>>> li=[] # empty
>>> li[2:]=[6,7,8]
>>> li
[6, 7, 8]
>>> li=[0,1,2]
>>> li[999:]=[999]
>>> li
[0, 1, 2, 999]
The equivalent in swift is an error:
4> var arr=[Int]()
arr: [Int] = 0 values
5> arr[2...arr.endIndex-1]=[99]
fatal error: Can't form Range with end < start
That is easy to test and code around.
Second issue is the killer: it is really slow in swift. Consider this Python code to to perform exact summations of a list of floats:
def msum(iterable):
"Full precision summation using multiple floats for intermediate values"
# Rounded x+y stored in hi with the round-off stored in lo. Together
# hi+lo are exactly equal to x+y. The inner loop applies hi/lo summation
# to each partial so that the list of partial sums remains exact.
# Depends on IEEE-754 arithmetic guarantees. See proof of correctness at:
# www-2.cs.cmu.edu/afs/cs/project/quake/public/papers/robust-arithmetic.ps
partials = [] # sorted, non-overlapping partial sums
for x in iterable:
i = 0
for y in partials:
if abs(x) < abs(y):
x, y = y, x
hi = x + y
lo = y - (hi - x)
if lo:
partials[i] = lo
i += 1
x = hi
partials[i:] = [x]
return sum(partials, 0.0)
It works by maintaining a hi/lo partial summations so that msum([.1]*10) produces 1.0 exactly rather than 0.9999999999999999. The C equivalent of msum is part of the math library in Python.
I have attempted to replicate in swift:
func msum(it:[Double])->Double {
// Full precision summation using multiple floats for intermediate values
var partials=[Double]()
for var x in it {
var i=0
for var y in partials{
if abs(x) < abs(y){
(x, y)=(y, x)
}
let hi=x+y
let lo=y-(hi-x)
if abs(lo)>0.0 {
partials[i]=lo
i+=1
}
x=hi
}
// slow part trying to replicate Python's slice assignment partials[i:]=[x]
if partials.endIndex>i {
partials[i...partials.endIndex-1]=[x]
}
else {
partials.append(x)
}
}
return partials.reduce(0.0, combine: +)
}
Test the function and speed:
import Foundation
var arr=[Double]()
for _ in 1...1000000 {
arr+=[10, 1e100, 10, -1e100]
}
print(arr.reduce(0, combine: +)) // will be 0.0
var startTime: CFAbsoluteTime!
startTime = CFAbsoluteTimeGetCurrent()
print(msum(arr), arr.count*5) // should be arr.count * 5
print(CFAbsoluteTimeGetCurrent() - startTime)
On my machine, that takes 7 seconds to complete. Python native msum takes 2.2 seconds (about 4x faster) and the library fsum function takes 0.09 seconds (almost 90x faster)
I have tried to replace partials[i...partials.endIndex-1]=[x] with arr.removeRange(i..<arr.endIndex) and then appending. Little faster but not much.
Question:
Is this idiomatic swift: partials[i...partials.endIndex-1]=[x]
Is there a faster / better way?

First (as already said in the comments), there is a huge
difference between non-optimized and optimised code in Swift
("-Onone" vs "-O" compiler option, or Debug vs. Release configuration), so for performance test make sure that the "Release" configuration
is selected. ("Release" is also the default configuration if you
profile the code with Instruments).
It has some advantages to use half-open ranges:
var arr = [0,1,2,3,4,5]
arr[2 ..< arr.endIndex] = [99]
print(arr) // [0, 1, 99]
In fact, that's how a range is stored internally, and it allows you
to insert a slice at the end of the array (but not beyond that as in Python):
var arr = [Int]()
arr[0 ..< arr.endIndex] = [99]
print(arr) // [99]
So
if partials.endIndex > i {
partials[i...partials.endIndex-1]=[x]
}
else {
partials.append(x)
}
is equivalent to
partials[i ..< partials.endIndex] = [x]
// Or: partials.replaceRange(i ..< partials.endIndex, with: [x])
However, that is not a performance improvement. It seems that
replacing a slice is slow in Swift. Truncating the array and
appending the new element with
partials.replaceRange(i ..< partials.endIndex, with: [])
partials.append(x)
reduced the time for your test code from about 1.25 to 0.75 seconds on my
computer.

As #MartinR points out, replaceRange is faster than slice assignment.
If you want maximum speed (based on my tests), your best bet is probably:
partials.replaceRange(i..<partials.endIndex, with: CollectionOfOne(x))
CollectionOfOne is faster than [x] because it just stores the element inline within the struct, rather than allocating memory like an array.

Why is "1000000000000000 in range(1000000000000001)" so fast in Python 3?

It is my understanding that the range() function, which is actually an object type in Python 3, generates its contents on the fly, similar to a generator.
This being the case, I would have expected the following line to take an inordinate amount of time because, in order to determine whether 1 quadrillion is in the range, a quadrillion values would have to be generated:
1_000_000_000_000_000 in range(1_000_000_000_000_001)
Furthermore: it seems that no matter how many zeroes I add on, the calculation more or less takes the same amount of time (basically instantaneous).
I have also tried things like this, but the calculation is still almost instant:
# count by tens
1_000_000_000_000_000_000_000 in range(0,1_000_000_000_000_000_000_001,10)
If I try to implement my own range function, the result is not so nice!
def my_crappy_range(N):
i = 0
while i < N:
yield i
i += 1
return
What is the range() object doing under the hood that makes it so fast?
Martijn Pieters's answer was chosen for its completeness, but also see abarnert's first answer for a good discussion of what it means for range to be a full-fledged sequence in Python 3, and some information/warning regarding potential inconsistency for __contains__ function optimization across Python implementations. abarnert's other answer goes into some more detail and provides links for those interested in the history behind the optimization in Python 3 (and lack of optimization of xrange in Python 2). Answers by poke and by wim provide the relevant C source code and explanations for those who are interested.

The Python 3 range() object doesn't produce numbers immediately; it is a smart sequence object that produces numbers on demand. All it contains is your start, stop and step values, then as you iterate over the object the next integer is calculated each iteration.
The object also implements the object.__contains__ hook, and calculates if your number is part of its range. Calculating is a (near) constant time operation *. There is never a need to scan through all possible integers in the range.
From the range() object documentation:
The advantage of the range type over a regular list or tuple is that a range object will always take the same (small) amount of memory, no matter the size of the range it represents (as it only stores the start, stop and step values, calculating individual items and subranges as needed).
So at a minimum, your range() object would do:
class my_range:
def __init__(self, start, stop=None, step=1, /):
if stop is None:
start, stop = 0, start
self.start, self.stop, self.step = start, stop, step
if step < 0:
lo, hi, step = stop, start, -step
else:
lo, hi = start, stop
self.length = 0 if lo > hi else ((hi - lo - 1) // step) + 1
def __iter__(self):
current = self.start
if self.step < 0:
while current > self.stop:
yield current
current += self.step
else:
while current < self.stop:
yield current
current += self.step
def __len__(self):
return self.length
def __getitem__(self, i):
if i < 0:
i += self.length
if 0 <= i < self.length:
return self.start + i * self.step
raise IndexError('my_range object index out of range')
def __contains__(self, num):
if self.step < 0:
if not (self.stop < num <= self.start):
return False
else:
if not (self.start <= num < self.stop):
return False
return (num - self.start) % self.step == 0
This is still missing several things that a real range() supports (such as the .index() or .count() methods, hashing, equality testing, or slicing), but should give you an idea.
I also simplified the __contains__ implementation to only focus on integer tests; if you give a real range() object a non-integer value (including subclasses of int), a slow scan is initiated to see if there is a match, just as if you use a containment test against a list of all the contained values. This was done to continue to support other numeric types that just happen to support equality testing with integers but are not expected to support integer arithmetic as well. See the original Python issue that implemented the containment test.
* Near constant time because Python integers are unbounded and so math operations also grow in time as N grows, making this a O(log N) operation. Since it’s all executed in optimised C code and Python stores integer values in 30-bit chunks, you’d run out of memory before you saw any performance impact due to the size of the integers involved here.

The fundamental misunderstanding here is in thinking that range is a generator. It's not. In fact, it's not any kind of iterator.
You can tell this pretty easily:
>>> a = range(5)
>>> print(list(a))
[0, 1, 2, 3, 4]
>>> print(list(a))
[0, 1, 2, 3, 4]
If it were a generator, iterating it once would exhaust it:
>>> b = my_crappy_range(5)
>>> print(list(b))
[0, 1, 2, 3, 4]
>>> print(list(b))
[]
What range actually is, is a sequence, just like a list. You can even test this:
>>> import collections.abc
>>> isinstance(a, collections.abc.Sequence)
True
This means it has to follow all the rules of being a sequence:
>>> a[3] # indexable
3
>>> len(a) # sized
5
>>> 3 in a # membership
True
>>> reversed(a) # reversible
<range_iterator at 0x101cd2360>
>>> a.index(3) # implements 'index'
3
>>> a.count(3) # implements 'count'
1
The difference between a range and a list is that a range is a lazy or dynamic sequence; it doesn't remember all of its values, it just remembers its start, stop, and step, and creates the values on demand on __getitem__.
(As a side note, if you print(iter(a)), you'll notice that range uses the same listiterator type as list. How does that work? A listiterator doesn't use anything special about list except for the fact that it provides a C implementation of __getitem__, so it works fine for range too.)
Now, there's nothing that says that Sequence.__contains__ has to be constant time—in fact, for obvious examples of sequences like list, it isn't. But there's nothing that says it can't be. And it's easier to implement range.__contains__ to just check it mathematically ((val - start) % step, but with some extra complexity to deal with negative steps) than to actually generate and test all the values, so why shouldn't it do it the better way?
But there doesn't seem to be anything in the language that guarantees this will happen. As Ashwini Chaudhari points out, if you give it a non-integral value, instead of converting to integer and doing the mathematical test, it will fall back to iterating all the values and comparing them one by one. And just because CPython 3.2+ and PyPy 3.x versions happen to contain this optimization, and it's an obvious good idea and easy to do, there's no reason that IronPython or NewKickAssPython 3.x couldn't leave it out. (And in fact, CPython 3.0-3.1 didn't include it.)
If range actually were a generator, like my_crappy_range, then it wouldn't make sense to test __contains__ this way, or at least the way it makes sense wouldn't be obvious. If you'd already iterated the first 3 values, is 1 still in the generator? Should testing for 1 cause it to iterate and consume all the values up to 1 (or up to the first value >= 1)?

Use the source, Luke!
In CPython, range(...).__contains__ (a method wrapper) will eventually delegate to a simple calculation which checks if the value can possibly be in the range. The reason for the speed here is we're using mathematical reasoning about the bounds, rather than a direct iteration of the range object. To explain the logic used:
Check that the number is between start and stop, and
Check that the stride value doesn't "step over" our number.
For example, 994 is in range(4, 1000, 2) because:
4 <= 994 < 1000, and
(994 - 4) % 2 == 0.
The full C code is included below, which is a bit more verbose because of memory management and reference counting details, but the basic idea is there:
static int
range_contains_long(rangeobject *r, PyObject *ob)
{
int cmp1, cmp2, cmp3;
PyObject *tmp1 = NULL;
PyObject *tmp2 = NULL;
PyObject *zero = NULL;
int result = -1;
zero = PyLong_FromLong(0);
if (zero == NULL) /* MemoryError in int(0) */
goto end;
/* Check if the value can possibly be in the range. */
cmp1 = PyObject_RichCompareBool(r->step, zero, Py_GT);
if (cmp1 == -1)
goto end;
if (cmp1 == 1) { /* positive steps: start <= ob < stop */
cmp2 = PyObject_RichCompareBool(r->start, ob, Py_LE);
cmp3 = PyObject_RichCompareBool(ob, r->stop, Py_LT);
}
else { /* negative steps: stop < ob <= start */
cmp2 = PyObject_RichCompareBool(ob, r->start, Py_LE);
cmp3 = PyObject_RichCompareBool(r->stop, ob, Py_LT);
}
if (cmp2 == -1 || cmp3 == -1) /* TypeError */
goto end;
if (cmp2 == 0 || cmp3 == 0) { /* ob outside of range */
result = 0;
goto end;
}
/* Check that the stride does not invalidate ob's membership. */
tmp1 = PyNumber_Subtract(ob, r->start);
if (tmp1 == NULL)
goto end;
tmp2 = PyNumber_Remainder(tmp1, r->step);
if (tmp2 == NULL)
goto end;
/* result = ((int(ob) - start) % step) == 0 */
result = PyObject_RichCompareBool(tmp2, zero, Py_EQ);
end:
Py_XDECREF(tmp1);
Py_XDECREF(tmp2);
Py_XDECREF(zero);
return result;
}
static int
range_contains(rangeobject *r, PyObject *ob)
{
if (PyLong_CheckExact(ob) || PyBool_Check(ob))
return range_contains_long(r, ob);
return (int)_PySequence_IterSearch((PyObject*)r, ob,
PY_ITERSEARCH_CONTAINS);
}
The "meat" of the idea is mentioned in the comment lines:
/* positive steps: start <= ob < stop */
/* negative steps: stop < ob <= start */
/* result = ((int(ob) - start) % step) == 0 */
As a final note - look at the range_contains function at the bottom of the code snippet. If the exact type check fails then we don't use the clever algorithm described, instead falling back to a dumb iteration search of the range using _PySequence_IterSearch! You can check this behaviour in the interpreter (I'm using v3.5.0 here):
>>> x, r = 1000000000000000, range(1000000000000001)
>>> class MyInt(int):
... pass
...
>>> x_ = MyInt(x)
>>> x in r # calculates immediately :)
True
>>> x_ in r # iterates for ages.. :(
^\Quit (core dumped)

To add to Martijn’s answer, this is the relevant part of the source (in C, as the range object is written in native code):
static int
range_contains(rangeobject *r, PyObject *ob)
{
if (PyLong_CheckExact(ob) || PyBool_Check(ob))
return range_contains_long(r, ob);
return (int)_PySequence_IterSearch((PyObject*)r, ob,
PY_ITERSEARCH_CONTAINS);
}
So for PyLong objects (which is int in Python 3), it will use the range_contains_long function to determine the result. And that function essentially checks if ob is in the specified range (although it looks a bit more complex in C).
If it’s not an int object, it falls back to iterating until it finds the value (or not).
The whole logic could be translated to pseudo-Python like this:
def range_contains (rangeObj, obj):
if isinstance(obj, int):
return range_contains_long(rangeObj, obj)
# default logic by iterating
return any(obj == x for x in rangeObj)
def range_contains_long (r, num):
if r.step > 0:
# positive step: r.start <= num < r.stop
cmp2 = r.start <= num
cmp3 = num < r.stop
else:
# negative step: r.start >= num > r.stop
cmp2 = num <= r.start
cmp3 = r.stop < num
# outside of the range boundaries
if not cmp2 or not cmp3:
return False
# num must be on a valid step inside the boundaries
return (num - r.start) % r.step == 0

If you're wondering why this optimization was added to range.__contains__, and why it wasn't added to xrange.__contains__ in 2.7:
First, as Ashwini Chaudhary discovered, issue 1766304 was opened explicitly to optimize [x]range.__contains__. A patch for this was accepted and checked in for 3.2, but not backported to 2.7 because "xrange has behaved like this for such a long time that I don't see what it buys us to commit the patch this late." (2.7 was nearly out at that point.)
Meanwhile:
Originally, xrange was a not-quite-sequence object. As the 3.1 docs say:
Range objects have very little behavior: they only support indexing, iteration, and the len function.
This wasn't quite true; an xrange object actually supported a few other things that come automatically with indexing and len,* including __contains__ (via linear search). But nobody thought it was worth making them full sequences at the time.
Then, as part of implementing the Abstract Base Classes PEP, it was important to figure out which builtin types should be marked as implementing which ABCs, and xrange/range claimed to implement collections.Sequence, even though it still only handled the same "very little behavior". Nobody noticed that problem until issue 9213. The patch for that issue not only added index and count to 3.2's range, it also re-worked the optimized __contains__ (which shares the same math with index, and is directly used by count).** This change went in for 3.2 as well, and was not backported to 2.x, because "it's a bugfix that adds new methods". (At this point, 2.7 was already past rc status.)
So, there were two chances to get this optimization backported to 2.7, but they were both rejected.
* In fact, you even get iteration for free with indexing alone, but in 2.3 xrange objects got a custom iterator.
** The first version actually reimplemented it, and got the details wrong—e.g., it would give you MyIntSubclass(2) in range(5) == False. But Daniel Stutzbach's updated version of the patch restored most of the previous code, including the fallback to the generic, slow _PySequence_IterSearch that pre-3.2 range.__contains__ was implicitly using when the optimization doesn't apply.

The other answers explained it well already, but I'd like to offer another experiment illustrating the nature of range objects:
>>> r = range(5)
>>> for i in r:
print(i, 2 in r, list(r))
0 True [0, 1, 2, 3, 4]
1 True [0, 1, 2, 3, 4]
2 True [0, 1, 2, 3, 4]
3 True [0, 1, 2, 3, 4]
4 True [0, 1, 2, 3, 4]
As you can see, a range object is an object that remembers its range and can be used many times (even while iterating over it), not just a one-time generator.

It's all about a lazy approach to the evaluation and some extra optimization of range.
Values in ranges don't need to be computed until real use, or even further due to extra optimization.
By the way, your integer is not such big, consider sys.maxsize
sys.maxsize in range(sys.maxsize) is pretty fast
due to optimization - it's easy to compare given integer just with min and max of range.
but:
Decimal(sys.maxsize) in range(sys.maxsize) is pretty slow.
(in this case, there is no optimization in range, so if python receives unexpected Decimal, python will compare all numbers)
You should be aware of an implementation detail but should not be relied upon, because this may change in the future.

TL;DR
The object returned by range() is actually a range object. This object implements the iterator interface so you can iterate over its values sequentially, just like a generator, list, or tuple.
But it also implements the __contains__ interface which is actually what gets called when an object appears on the right-hand side of the in operator. The __contains__() method returns a bool of whether or not the item on the left-hand side of the in is in the object. Since range objects know their bounds and stride, this is very easy to implement in O(1).

Due to optimization, it is very easy to compare given integers just with min and max range.
The reason that the range() function is so fast in Python3 is that here we use mathematical reasoning for the bounds, rather than a direct iteration of the range object.
So for explaining the logic here:
Check whether the number is between the start and stop.
Check whether the step precision value doesn't go over our number.
Take an example, 997 is in range(4, 1000, 3) because:
4 <= 997 < 1000, and (997 - 4) % 3 == 0.

Try x-1 in (i for i in range(x)) for large x values, which uses a generator comprehension to avoid invoking the range.__contains__ optimisation.

TLDR;
the range is an arithmetic series so it can very easily calculate whether the object is there. It could even get the index of it if it were list like really quickly.

__contains__ method compares directly with the start and end of the range

Define a matrix depending on variable in Mathematica

I am translating my code from Python to Mathematica. I am trying to define a matrix, whose values depend on a variable chosen by the user, called kappa.
In Python the code looked like that:
def getA(kappa):
matrix = zeros((n, n), float)
for i in range(n):
for j in range(n):
matrix[i][j] = 2*math.cos((2*math.pi/n)*(abs(j-i))*kappa)
n = 5
return matrix
What I have done so far in Mathematica is the following piece of code:
n = 5
getA[kappa_] :=
A = Table[0.0, {n}, {n}];
For[i = 0, i < n, i++,
For[ j = 0, j < n, j++,
A[[i, j]] = 2*Cos[(2*pi/n)*(abs (j - i))*kappa]]];
b = getA[3]
But when I try to evaluate this matrix for a value of kappa equal to 3, I get the following error:
Set::partd: "Part specification A[[i,j]] is longer than depth of object.
How can I fix it?

Try something like this
n = 5;
A = Table[2*Cos[(2 \[Pi]/n) (Abs[ j - i]) \[Kappa]], {i, 1, n}, {j, 1, n}];
b = A /. \[Kappa]->3
I'll leave you to package this into a function if you want to.
You write that you are trying to translate Python into Mathematica; your use of For loops suggests that you are trying to translate to C-in-Mathematica. The first rule of Mathematica club is don't use loops.
Besides that you've made a number of small syntactical errors, such as using abs() where you should have had Abs[] (Mathematica's built-in functions all have names beginning with a capital letter, they wrap their arguments in [ and ], not ( and )), pi is not the name of the value of the ratio of a circle's diameter to its radius (it's called \[Pi]). Note too that I've omitted the multiplication operator which is often not required.

In your particular case, this would be the fastest and the most straightforward solution:
getA[κ_, n_] := ToeplitzMatrix[2 Cos[2 π κ Range[0, n - 1] / n]]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.