How globally accurate is Python's time.time function? [duplicate] - python

Is there a way to measure time with high-precision in Python --- more precise than one second? I doubt that there is a cross-platform way of doing that; I'm interesting in high precision time on Unix, particularly Solaris running on a Sun SPARC machine.
timeit seems to be capable of high-precision time measurement, but rather than measure how long a code snippet takes, I'd like to directly access the time values.

The standard time.time() function provides sub-second precision, though that precision varies by platform. For Linux and Mac precision is +- 1 microsecond or 0.001 milliseconds. Python on Windows uses +- 16 milliseconds precision due to clock implementation problems due to process interrupts. The timeit module can provide higher resolution if you're measuring execution time.
>>> import time
>>> time.time() #return seconds from epoch
1261367718.971009
Python 3.7 introduces new functions to the time module that provide higher resolution:
>>> import time
>>> time.time_ns()
1530228533161016309
>>> time.time_ns() / (10 ** 9) # convert to floating-point seconds
1530228544.0792289

Python tries hard to use the most precise time function for your platform to implement time.time():
/* Implement floattime() for various platforms */
static double
floattime(void)
{
/* There are three ways to get the time:
(1) gettimeofday() -- resolution in microseconds
(2) ftime() -- resolution in milliseconds
(3) time() -- resolution in seconds
In all cases the return value is a float in seconds.
Since on some systems (e.g. SCO ODT 3.0) gettimeofday() may
fail, so we fall back on ftime() or time().
Note: clock resolution does not imply clock accuracy! */
#ifdef HAVE_GETTIMEOFDAY
{
struct timeval t;
#ifdef GETTIMEOFDAY_NO_TZ
if (gettimeofday(&t) == 0)
return (double)t.tv_sec + t.tv_usec*0.000001;
#else /* !GETTIMEOFDAY_NO_TZ */
if (gettimeofday(&t, (struct timezone *)NULL) == 0)
return (double)t.tv_sec + t.tv_usec*0.000001;
#endif /* !GETTIMEOFDAY_NO_TZ */
}
#endif /* !HAVE_GETTIMEOFDAY */
{
#if defined(HAVE_FTIME)
struct timeb t;
ftime(&t);
return (double)t.time + (double)t.millitm * (double)0.001;
#else /* !HAVE_FTIME */
time_t secs;
time(&secs);
return (double)secs;
#endif /* !HAVE_FTIME */
}
}
( from http://svn.python.org/view/python/trunk/Modules/timemodule.c?revision=81756&view=markup )

David's post was attempting to show what the clock resolution is on Windows. I was confused by his output, so I wrote some code that shows that time.time() on my Windows 8 x64 laptop has a resolution of 1 msec:
# measure the smallest time delta by spinning until the time changes
def measure():
t0 = time.time()
t1 = t0
while t1 == t0:
t1 = time.time()
return (t0, t1, t1-t0)
samples = [measure() for i in range(10)]
for s in samples:
print s
Which outputs:
(1390455900.085, 1390455900.086, 0.0009999275207519531)
(1390455900.086, 1390455900.087, 0.0009999275207519531)
(1390455900.087, 1390455900.088, 0.0010001659393310547)
(1390455900.088, 1390455900.089, 0.0009999275207519531)
(1390455900.089, 1390455900.09, 0.0009999275207519531)
(1390455900.09, 1390455900.091, 0.0010001659393310547)
(1390455900.091, 1390455900.092, 0.0009999275207519531)
(1390455900.092, 1390455900.093, 0.0009999275207519531)
(1390455900.093, 1390455900.094, 0.0010001659393310547)
(1390455900.094, 1390455900.095, 0.0009999275207519531)
And a way to do a 1000 sample average of the delta:
reduce( lambda a,b:a+b, [measure()[2] for i in range(1000)], 0.0) / 1000.0
Which output on two consecutive runs:
0.001
0.0010009999275207519
So time.time() on my Windows 8 x64 has a resolution of 1 msec.
A similar run on time.clock() returns a resolution of 0.4 microseconds:
def measure_clock():
t0 = time.clock()
t1 = time.clock()
while t1 == t0:
t1 = time.clock()
return (t0, t1, t1-t0)
reduce( lambda a,b:a+b, [measure_clock()[2] for i in range(1000000)] )/1000000.0
Returns:
4.3571334791658954e-07
Which is ~0.4e-06
An interesting thing about time.clock() is that it returns the time since the method was first called, so if you wanted microsecond resolution wall time you could do something like this:
class HighPrecisionWallTime():
def __init__(self,):
self._wall_time_0 = time.time()
self._clock_0 = time.clock()
def sample(self,):
dc = time.clock()-self._clock_0
return self._wall_time_0 + dc
(which would probably drift after a while, but you could correct this occasionally, for example dc > 3600 would correct it every hour)

If Python 3 is an option, you have two choices:
time.perf_counter which always use the most accurate clock on your platform. It does include time spent outside of the process.
time.process_time which returns the CPU time. It does NOT include time spent outside of the process.
The difference between the two can be shown with:
from time import (
process_time,
perf_counter,
sleep,
)
print(process_time())
sleep(1)
print(process_time())
print(perf_counter())
sleep(1)
print(perf_counter())
Which outputs:
0.03125
0.03125
2.560001310720671e-07
1.0005455362793145

You can also use time.clock() It counts the time used by the process on Unix and time since the first call to it on Windows. It's more precise than time.time().
It's the usually used function to measure performance.
Just call
import time
t_ = time.clock()
#Your code here
print 'Time in function', time.clock() - t_
EDITED: Ups, I miss the question as you want to know exactly the time, not the time spent...

Python 3.7 introduces 6 new time functions with nanosecond resolution, for example instead of time.time() you can use time.time_ns() to avoid floating point imprecision issues:
import time
print(time.time())
# 1522915698.3436284
print(time.time_ns())
# 1522915698343660458
These 6 functions are described in PEP 564:
time.clock_gettime_ns(clock_id)
time.clock_settime_ns(clock_id, time:int)
time.monotonic_ns()
time.perf_counter_ns()
time.process_time_ns()
time.time_ns()
These functions are similar to the version without the _ns suffix, but
return a number of nanoseconds as a Python int.

time.clock() has 13 decimal points on Windows but only two on Linux.
time.time() has 17 decimals on Linux and 16 on Windows but the actual precision is different.
I don't agree with the documentation that time.clock() should be used for benchmarking on Unix/Linux. It is not precise enough, so what timer to use depends on operating system.
On Linux, the time resolution is high in time.time():
>>> time.time(), time.time()
(1281384913.4374139, 1281384913.4374161)
On Windows, however the time function seems to use the last called number:
>>> time.time()-int(time.time()), time.time()-int(time.time()), time.time()-time.time()
(0.9570000171661377, 0.9570000171661377, 0.0)
Even if I write the calls on different lines in Windows it still returns the same value so the real precision is lower.
So in serious measurements a platform check (import platform, platform.system()) has to be done in order to determine whether to use time.clock() or time.time().
(Tested on Windows 7 and Ubuntu 9.10 with python 2.6 and 3.1)

The original question specifically asked for Unix but multiple answers have touched on Windows, and as a result there is misleading information on windows. The default timer resolution on windows is 15.6ms you can verify here.
Using a slightly modified script from cod3monk3y I can show that windows timer resolution is ~15milliseconds by default. I'm using a tool available here to modify the resolution.
Script:
import time
# measure the smallest time delta by spinning until the time changes
def measure():
t0 = time.time()
t1 = t0
while t1 == t0:
t1 = time.time()
return t1-t0
samples = [measure() for i in range(30)]
for s in samples:
print(f'time delta: {s:.4f} seconds')
These results were gathered on windows 10 pro 64-bit running python 3.7 64-bit.

The comment left by tiho on Mar 27 '14 at 17:21 deserves to be its own answer:
In order to avoid platform-specific code, use timeit.default_timer()

I observed that the resolution of time.time() is different between Windows 10 Professional and Education versions.
On a Windows 10 Professional machine, the resolution is 1 ms.
On a Windows 10 Education machine, the resolution is 16 ms.
Fortunately, there's a tool that increases Python's time resolution in Windows:
https://vvvv.org/contribution/windows-system-timer-tool
With this tool, I was able to achieve 1 ms resolution regardless of Windows version. You will need to be keep it running while executing your Python codes.

For those stuck on windows (version >= server 2012 or win 8)and python 2.7,
import ctypes
class FILETIME(ctypes.Structure):
_fields_ = [("dwLowDateTime", ctypes.c_uint),
("dwHighDateTime", ctypes.c_uint)]
def time():
"""Accurate version of time.time() for windows, return UTC time in term of seconds since 01/01/1601
"""
file_time = FILETIME()
ctypes.windll.kernel32.GetSystemTimePreciseAsFileTime(ctypes.byref(file_time))
return (file_time.dwLowDateTime + (file_time.dwHighDateTime << 32)) / 1.0e7
GetSystemTimePreciseAsFileTime function

On the same win10 OS system using "two distinct method approaches" there appears to be an approximate "500 ns" time difference. If you care about nanosecond precision check my code below.
The modifications of the code is based on code from user cod3monk3y and Kevin S.
OS: python 3.7.3 (default, date, time) [MSC v.1915 64 bit (AMD64)]
def measure1(mean):
for i in range(1, my_range+1):
x = time.time()
td = x- samples1[i-1][2]
if i-1 == 0:
td = 0
td = f'{td:.6f}'
samples1.append((i, td, x))
mean += float(td)
print (mean)
sys.stdout.flush()
time.sleep(0.001)
mean = mean/my_range
return mean
def measure2(nr):
t0 = time.time()
t1 = t0
while t1 == t0:
t1 = time.time()
td = t1-t0
td = f'{td:.6f}'
return (nr, td, t1, t0)
samples1 = [(0, 0, 0)]
my_range = 10
mean1 = 0.0
mean2 = 0.0
mean1 = measure1(mean1)
for i in samples1: print (i)
print ('...\n\n')
samples2 = [measure2(i) for i in range(11)]
for s in samples2:
#print(f'time delta: {s:.4f} seconds')
mean2 += float(s[1])
print (s)
mean2 = mean2/my_range
print ('\nMean1 : ' f'{mean1:.6f}')
print ('Mean2 : ' f'{mean2:.6f}')
The measure1 results:
nr, td, t0
(0, 0, 0)
(1, '0.000000', 1562929696.617988)
(2, '0.002000', 1562929696.6199884)
(3, '0.001001', 1562929696.620989)
(4, '0.001001', 1562929696.62199)
(5, '0.001001', 1562929696.6229906)
(6, '0.001001', 1562929696.6239917)
(7, '0.001001', 1562929696.6249924)
(8, '0.001000', 1562929696.6259928)
(9, '0.001001', 1562929696.6269937)
(10, '0.001001', 1562929696.6279945)
...
The measure2 results:
nr, td , t1, t0
(0, '0.000500', 1562929696.6294951, 1562929696.6289947)
(1, '0.000501', 1562929696.6299958, 1562929696.6294951)
(2, '0.000500', 1562929696.6304958, 1562929696.6299958)
(3, '0.000500', 1562929696.6309962, 1562929696.6304958)
(4, '0.000500', 1562929696.6314962, 1562929696.6309962)
(5, '0.000500', 1562929696.6319966, 1562929696.6314962)
(6, '0.000500', 1562929696.632497, 1562929696.6319966)
(7, '0.000500', 1562929696.6329975, 1562929696.632497)
(8, '0.000500', 1562929696.633498, 1562929696.6329975)
(9, '0.000500', 1562929696.6339984, 1562929696.633498)
(10, '0.000500', 1562929696.6344984, 1562929696.6339984)
End result:
Mean1 : 0.001001 # (measure1 function)
Mean2 : 0.000550 # (measure2 function)

Here is a python 3 solution for Windows building upon the answer posted above by CyberSnoopy (using GetSystemTimePreciseAsFileTime). We borrow some code from jfs
Python datetime.utcnow() returning incorrect datetime
and get a precise timestamp (Unix time) in microseconds
#! python3
import ctypes.wintypes
def utcnow_microseconds():
system_time = ctypes.wintypes.FILETIME()
#system call used by time.time()
#ctypes.windll.kernel32.GetSystemTimeAsFileTime(ctypes.byref(system_time))
#getting high precision:
ctypes.windll.kernel32.GetSystemTimePreciseAsFileTime(ctypes.byref(system_time))
large = (system_time.dwHighDateTime << 32) + system_time.dwLowDateTime
return large // 10 - 11644473600000000
for ii in range(5):
print(utcnow_microseconds()*1e-6)
References
https://learn.microsoft.com/en-us/windows/win32/sysinfo/time-functions
https://learn.microsoft.com/en-us/windows/win32/api/sysinfoapi/nf-sysinfoapi-getsystemtimepreciseasfiletime
https://support.microsoft.com/en-us/help/167296/how-to-convert-a-unix-time-t-to-a-win32-filetime-or-systemtime

1. Python 3.7 or later
If using Python 3.7 or later, use the modern, cross-platform time module functions such as time.monotonic_ns(), here: https://docs.python.org/3/library/time.html#time.monotonic_ns. It provides nanosecond-resolution timestamps.
import time
time_ns = time.monotonic_ns()
# or on Unix or Linux you can also use:
time_ns = time.clock_gettime_ns()
# or on Windows:
time_ns = time.perf_counter_ns()
# etc. etc. There are others. See the link above.
From my other answer from 2016, here: How can I get millisecond and microsecond-resolution timestamps in Python?:
You might also try time.clock_gettime_ns() on Unix or Linux systems. Based on its name, it appears to call the underlying clock_gettime() C function which I use in my nanos() function in C in my answer here and in my C Unix/Linux library here: timinglib.c.
2. Python 3.3 or later
On Windows, in Python 3.3 or later, you can use time.perf_counter(), as shown by #ereOn here. See: https://docs.python.org/3/library/time.html#time.perf_counter. This provides roughly a 0.5us-resolution timestamp, in floating point seconds. Ex:
import time
# For Python 3.3 or later
time_sec = time.perf_counter() # Windows only, I think
# or on Unix or Linux (I think only those)
time_sec = time.monotonic()
3. Pre-Python 3.3 (ex: Python 3.0, 3.1, 3.2), or later
Summary:
See my other answer from 2016 here for 0.5-us-resolution timestamps, or better, in Windows and Linux, and for versions of Python as old as 3.0, 3.2 or 3.2 even! We do this by calling C or C++ shared object libraries (.dll on Windows, or .so on Unix or Linux) using the ctypes module in Python.
I provide these functions:
millis()
micros()
delay()
delayMicroseconds()
Download GS_timing.py from my eRCaGuy_PyTime repo, then do:
import GS_timing
time_ms = GS_timing.millis()
time_us = GS_timing.micros()
GS_timing.delay(10) # delay 10 ms
GS_timing.delayMicroseconds(10000) # delay 10000 us
Details:
In 2016, I was working in Python 3.0 or 3.1, on an embedded project on a Raspberry Pi, and which I tested and ran frequently on Windows also. I needed nanosecond resolution for some precise timing I was doing with ultrasonic sensors. The Python language at the time did not provide this resolution, and neither did any answer to this question, so I came up with this separate Q&A here: How can I get millisecond and microsecond-resolution timestamps in Python?. I stated in the question at the time:
I read other answers before asking this question, but they rely on the time module, which prior to Python 3.3 did NOT have any type of guaranteed resolution whatsoever. Its resolution is all over the place. The most upvoted answer here quotes a Windows resolution (using their answer) of 16 ms, which is 32000 times worse than my answer provided here (0.5 us resolution). Again, I needed 1 ms and 1 us (or similar) resolutions, not 16000 us resolution.
Zero, I repeat: zero answers here on 12 July 2016 had any resolution better than 16-ms for Windows in Python 3.1. So, I came up with this answer which has 0.5us or better resolution in pre-Python 3.3 in Windows and Linux. If you need something like that for an older version of Python, or if you just want to learn how to call C or C++ dynamic libraries in Python (.dll "dynamically linked library" files in Windows, or .so "shared object" library files in Unix or Linux) using the ctypes library, see my other answer here.

I created a tiny C-Extension that uses GetSystemTimePreciseAsFileTime to provide an accurate timestamp on Windows:
https://win-precise-time.readthedocs.io/en/latest/api.html#win_precise_time.time
Usage:
>>> import win_precise_time
>>> win_precise_time.time()
1654539449.4548845

def start(self):
sec_arg = 10.0
cptr = 0
time_start = time.time()
time_init = time.time()
while True:
cptr += 1
time_start = time.time()
time.sleep(((time_init + (sec_arg * cptr)) - time_start ))
# AND YOUR CODE .......
t00 = threading.Thread(name='thread_request', target=self.send_request, args=([]))
t00.start()

Related

Why is the following simple parallelized code much slower than a simple loop in Python?

A simple program which calculates square of numbers and stores the results:
import time
from joblib import Parallel, delayed
import multiprocessing
array1 = [ 0 for i in range(100000) ]
def myfun(i):
return i**2
#### Simple loop ####
start_time = time.time()
for i in range(100000):
array1[i]=i**2
print( "Time for simple loop --- %s seconds ---" % ( time.time()
- start_time
)
)
#### Parallelized loop ####
start_time = time.time()
results = Parallel( n_jobs = -1,
verbose = 0,
backend = "threading"
)(
map( delayed( myfun ),
range( 100000 )
)
)
print( "Time for parallelized method --- %s seconds ---" % ( time.time()
- start_time
)
)
#### Output ####
# >>> ( executing file "Test_vr20.py" )
# Time for simple loop --- 0.015599966049194336 seconds ---
# Time for parallelized method --- 7.763299942016602 seconds ---
Could it be the difference in array handling for the two options? My actual program would have something more complicated but this is the kind of calculation that I need to parallelize, as simply as possible, but not with such results.
System Model: HP ProBook 640 G2, Windows 7,
IDLE for Python System Type: x64-based PC Processor:
Intel(R) Core(TM) i5-6300U CPU # 2.40GHz,
2401 MHz,
2 Core(s),
4 Logical Processor(s)
From the documentation of threading:
If you know that the function you are calling is based on a compiled
extension that releases the Python Global Interpreter Lock (GIL)
during most of its computation ...
The problem is that in the this case, you don't know that. Python itself will only allow one thread to run at once (the python interpreter locks the GIL every time it executes a python operation).
threading is only going to be useful if myfun() spends most of its time in a compiled Python extension, and that extension releases the GIL.
The Parallel code is so embarrassingly slow because you are doing a huge amount of work to create multiple threads - and then you only execute one thread at a time anyway.
If you use the multiprocessing backend, then you have to copy the input data into each of four or eight processes (one per core), do the processing in each processes, and then copy the output data back. The copying is going to be slow, but if the processing is a little bit more complex than just calculating a square, it might be worth it. Measure and see.
Why?Because trying to use tools in cases,where tools principally cannot and DO NOT adjust the costs of entry:
I love python.
I pray educators better explain the costs of tools, otherwise we get lost in these wish-to-get [PARALLEL]-schedules.
A few facts:
No.0: With a lot of simplification, python intentionally uses GIL to [SERIAL]-ise access to variables and thus avoiding any potential collision from [CONCURRENT] modifications - paying these add-on costs of GIL-stepped dancing in extra time
No.1: [PARALLEL]-code execution is way harder than a "just"-[CONCURRENT] ( read more )
No.2: [SERIAL]-process has to pay extra costs, if trying to split work onto [CONCURRENT]-workers
No.3: If a process does inter-worker communication, immense extra costs per data exchange are paid
No.4: If hardware has few resources for [CONCURRENT] processes, results get way worse further
To have some smell of what can be done in standard python 2.7.13:
Efficiency is in better using silicon, not in bulldozing syntax-constructors into territories, where they are legal, but their performance has adverse effects on the experiment-under-test end-to-end speed:
You pay about 8 ~ 11 [ms] just to iteratively assemble an empty array1
>>> from zmq import Stopwatch
>>> aClk = Stopwatch()
>>> aClk.start();array1 = [ 0 for i in xrange( 100000 ) ];aClk.stop()
9751L
10146L
10625L
9942L
10346L
9359L
10473L
9171L
8328L
( the Stopwatch().stop() method yields [us] from .start() )
while, the memory-efficient, vectorisable, GIL-free approach can do the same about +230x ~ +450x faster:
>>> import numpy as np
>>>
>>> aClk.start();arrayNP = np.zeros( 100000 );aClk.stop()
15L
22L
21L
23L
19L
22L
>>> aClk.start();arrayNP = np.zeros( 100000, dtype = np.int );aClk.stop()
43L
47L
42L
44L
47L
So, using the proper tools just starts the story of performance:
>>> def test_SERIAL_python( nLOOPs = 100000 ):
... aClk.start()
... for i in xrange( nLOOPs ): # py3 range() ~ xrange() in py27
... array1[i] = i**2 # your loop-code
... _ = aClk.stop()
... return _
While a naive [SERIAL]-iterative implementation works, you pay immense costs for opting to do so ~ 70 [ms] for a 100000-D vector:
>>> test_SERIAL_python( nLOOPs = 100000 )
70318L
69211L
77825L
70943L
74834L
73079L
Using a more suitable / appropriate tool costs just ~ 0.2 [ms] i.e. ++350x FASTER
>>> aClk.start();arrayNP[:] = arrayNP[:]**2;aClk.stop()
189L
171L
173L
187L
183L
188L
193L
and with another glitch, a.k.a. an inplace modus-operandi:
>>> aClk.start();arrayNP[:] *=arrayNP[:];aClk.stop()
138L
139L
136L
137L
136L
136L
137L
Yields ~ +514x SPEEDUP, just from using appropriate tool
The art of performance is not in following marketing-sounding claimsabout parallellizing-( at-any-cost ),but in using know-how based methods, that pay least costs for biggest speedups achievable.
For "small"-problems, typical costs of distributing "thin"-work-packages are indeed hard to get covered by any potentially achievable speedups, so "problem-size" actually limits one's choice of methods, that could reach positive gain ( speedups of 0.9 or even << 1.0 are so often reported here, on StackOverflow, that you need not feel lost or alone in this sort of surprise ).
Epilogue
Processor number counts.
Core number counts.
But cache-sizes + NUMA-irregularities count more than that.
Smart, vectorised, HPC-cured, GIL-free libraries matter ( numpy et al - thanks a lot Travis OLIPHANT & al ... Great Salute to his team ... )
As an overhead-strict Amdahl Law (re-)-formulation explains, why even many-N-CPU parallelised code execution may ( and indeed often does ) suffer from speedups << 1.0
Overhead-strict formulation of the Amdahl's Law speedup S includes the very costs of the paid [PAR]-Setup + [PAR]-Terminate Overheads, explicitly:
1
S = __________________________; where s, ( 1 - s ), N were defined above
( 1 - s ) pSO:= [PAR]-Setup-Overhead add-on
s + pSO + _________ + pTO pTO:= [PAR]-Terminate-Overhead add-on
N
( an interactive animated tool for 2D visualising effects of these performance constraints is cited here )

use time.time() as Timer failed on `**`

I'm a Python 3 learner and recently I was confused by a strange behavior of time.time().
I wrote two pieces of code and timed them with time.time():
pow version
from time import time
t0 = time()
x = pow(2, 1000000000)
t1 = time()
print(t1 - t0)
powimage
** version
from time import time
t0 = time()
x = 2 ** 1000000000
t1 = time()
print(t1 - t0)
**image
pow: time_cost = 1.544sec, output = 1.43625617027
**: time_cost = 1.526sec, output = 0.0
Why time.time() doesn't work for ** version???
More Info:
sys.version='3.6.2 (v3.6.2:5fd33b5, Jul 8 2017, 04:14:34) [MSC v.1900 32 bit (Intel)]',
sys.winver='3.6-32',
sys.platform='win32',
sys.implementation=namespace(cache_tag='cpython-36', hexversion=50725616, name='cpython', version=sys.version_info(major=3, minor=6, micro=2, releaselevel='final', serial=0))
It has to do with the way Python 3 processes certain constant expressions while processing a file. If you run the program:
print("start")
x = 2**1000000000
print("end")
you will see that there is a long start-up delay, and then "start" and "end" are printed almost simultaneously. While the program:
print("start")
x = pow(2, 1000000000)
print("end")
prints "start", then pauses for a while, and prints "end".
Python "pre-computes" the expression 2**1000000000 while it's initially processing the file (at the byte compilation stage), before the program actually starts running. In contrast, the expression pow(2,1000000000) isn't precomputed; it's byte compiled as a function call and computed when the program actually runs.
Here's another way to see this is happening. If you create two module files:
# starstar.py
x = 2**1000000000
# pow.py
x = pow(2,1000000000)
and import them into another program:
# main.py
import starstar
import pow
and run python main.py, then Python will produce byte compiled versions of the modules (maybe in a subdirectory called __pycache__ or something). You'll see that the byte-compiled version of the starstar.pyc module is huge -- it actually contains a copy of the pre-computed value of 2**1000000000; but the pow.pyc file will be tiny.

Why single addition takes longer than single additions plus single assignment?

Here is the python code, I use python 3.5.2/Intel(R) Core(TM) i7-4790K CPU # 4.00GHz :
import time
empty_loop_t = 0.14300823211669922
N = 10000000
def single_addition(n):
a = 1.0
b = 0.0
start_t = time.time()
for i in range(0, n):
a + b
end_t = time.time()
cost_t = end_t - start_t - empty_loop_t
print(n,"iterations single additions:", cost_t)
return cost_t
single_addition(N)
def single_addition_plus_single_assignment(n):
a = 1.0
b = 0.0
c = 0.0
start_t = time.time()
for i in range(0, n):
c = a + b
end_t = time.time()
cost_t = end_t - start_t - empty_loop_t
print(n,"iterations single additions and single assignments:", cost_t)
return cost_t
single_addition_plus_single_assignment(N)
The output is:
10000000 iterations single additions: 0.19701123237609863
10000000 iterations single additions and single assignments: 0.1890106201171875
Normally, to get a more reliable result, it is better to do the test using K-fold. However, since K-fold loop itself has influence on the result, I don't use it in my test. And I'm sure this inequality can be reproduced, at least on my machine. So the question is why this happened?
I run it with pypy (had to set empty_loop_t = 0) and got the following results:
(10000000, 'iterations single additions:', 0.014394044876098633)
(10000000, 'iterations single additions and single assignments:', 0.018398046493530273)
So I guess it's up to what interpreter does with the source code and how interpreter executes it. It might be that deliberate assignment takes less operations and workload than disposing of the result with non-JIT interpreter while JIT-compiler forces the code to perform the actual number of operations.
Furthermore, the use of JIT-interpreter makes your script run ~50 times faster on my configuration. If you general aim is to optimize the running time of your script you are probably to look that way.

Trying to read an ADC with Cython on an RPi 2 b+ via SPI (MCP3304, MCP3204, or MCP3008)?

I'd like to read differential voltage values from an MCP3304 (5v VDD, 3.3v Vref, main channel = 7, diff channel = 6) connected to an RPi 2 b+ as close as possible to the MCP3304's max sample rate of 100ksps. Preferably, I'd get > 1 sample per 100µs (> 10 ksps).
A kind user recently suggested I try porting my code to C for some speed gains. I'm VERY new to C, so thought I'd give Cython a shot, but can't seem to figure out how to tap into the C-based speed gains.
My guess is that I need to write a .pyx file that includes a more-raw means of accessing the ADC's bits/bytes via SPI than the python package I'm currently using (the python-wrapped gpiozero package). 1) Does this seem correct, and, if so, might someone 2) please help me understand how to manipulate bits/bytes appropriately for the MCP3304 in a way that will produce speed gains from Cython? I've seen C tutorials for the MCP3008, but am having trouble adapting this code to fit the timing laid out in the MCP3304's spec sheet; though I might be able to adapt a Cython-specific MCP3008 (or other ADC) tutorial to fit the MCP3304.
Here's a little .pyx loop I wrote to test how fast I'm reading voltage values. (Timing how long it takes to read 25,000 samples). It's ~ 9% faster than running it straight in Python.
# Load packages
import time
from gpiozero import MCP3304
# create a class to ping PD every ms for 1 minute
def pd_ping():
cdef char *FILENAME = "testfile.txt"
cdef double v
# cdef int adc_channel_pd = 7
cdef size_t i
# send message to user re: expected timing
print "Runing for ", 25000 , "iterations. Fingers crossed!"
print time.clock()
s = []
for i in range(25000):
v = MCP3304(channel = 7, diff = True).value * 3.3
# concatenate
s.append( str(v) )
print "donezo" , time.clock()
# write to file
out = '\n'.join(s)
f = open(FILENAME, "w")
f.write(out)
There is probably no need to create an MCP3304 object for each iteration. Also conversion from float to string can be delayed.
s = []
mcp = MCP3304(channel = 7, diff = True)
for i in range(25000):
v = mcp.value * 3.3
s.append(v)
out = '\n'.join('{:.2f}'.format(v) for v in s) + '\n'
If that multiplication by 3.3 is not strictly necessary at that point, it could be done later.

Why is this call to a dynamic library function so slow?

I am writing a shared library for python to call. Since this is my first time using python's ctypes module, and nearly my first time writing a shared library, I have been writing both C and python code to call the library's functions.
For the heck of it I put some timing code in and found that, while most calls of the C program to the library are very fast, the first is slow, considerably slower than its python counterpart in fact. This goes against everything I expected and was hoping that someone could tell me why.
Here is a stripped down version of the header file from my C library.
typedef struct MdaDataStruct
{
int numPts;
int numDists;
float* data;
float* dists;
} MdaData;
//allocate the structure
void* makeMdaStruct(int numPts, int numDist);
//deallocate the structure
void freeMdaStruct(void* strPtr);
//assign the data array
void setData(void* strPtr, float* divData);
Here is the C program that calls the functions:
int main(int argc, char* argv[])
{
clock_t t1, t2;
t1=clock();
long long int diff;
//test the allocate function
t1 = clock();
MdaData* dataPtr = makeMdaStruct(10, 3);
t2 = clock();
diff = (((t2-t1)*1000000)/CLOCKS_PER_SEC);
printf("make struct, took: %d microseconds\n", diff);
//make some data
float testArr[10] = {0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9};
//test the set data function
t1 = clock();
setData(dataPtr, testArr);
t2 = clock();
diff = (((t2-t1)*1000000)/CLOCKS_PER_SEC);
printf("set data, took: %d microseconds\n", diff);
//test the deallocate function
t1 = clock();
freeMdaStruct(dataPtr);
t2 = clock();
diff = (((t2-t1)*1000000)/CLOCKS_PER_SEC);
printf("free struct, took: %d microseconds\n", diff);
//exit
return 0;
}
and here is the python script that calls the functions:
# load the library
t1 = time.time()
cs_lib = cdll.LoadLibrary("./libChiSq.so")
t2 = time.time()
print "load library, took", int((t2-t1)*1000000), "microseconds"
# tell python the function will return a void pointer
cs_lib.makeMdaStruct.restype = c_void_p
# make the strcuture to hold the MdaData with 50 data points and 8 L dists
t1 = time.time()
mdaData = cs_lib.makeMdaStruct(10,3)
t2 = time.time()
print "make struct, took", int((t2-t1)*1000000), "microseconds"
# make an array with the test data
divDat = np.array([0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], np.float32)
#run the function to load the array into the struct
t1 = time.time()
cs_lib.setData(mdaData, divDat.ctypes.data)
t2 = time.time()
print "set data, took", int((t2-t1)*1000000), "microseconds"
#free the structure
t1 = time.time()
cs_lib.freeMdaStruct(mdaData)
t2 = time.time()
print "free struct, took", int((t2-t1)*1000000), "microseconds"
and finally, here is the output of running the two consecutively:
[]$ ./tester
make struct, took: 60 microseconds
set data, took: 2 microseconds
free struct, took: 2 microseconds
[]$ python so_py_tester.py
load library, took 77 microseconds
make struct, took 3 microseconds
set data, took 23 microseconds
free struct, took 10 microseconds
As you can see, the C call to makeMdaStruct takes 60us and the python call to makeMdaStruct takes 3us, which is highly confusing.
My best guess was that somehow the C code pays the cost of loading the library at the first call? Which confuses me because I thought that the library was loaded when the program was loaded into memory.
Edit: I think there might be a kernel of truth to the guess because I put an extra untimed call to makeMdaStruct and freeMdaStruct before the timed call to makeMdaStruct and got the following output in testing:
[]$ ./tester
make struct, took: 1 microseconds
set data, took: 1 microseconds
free struct, took: 0 microseconds
[]$ python so_py_tester.py
load library, took 70 microseconds
make struct, took 4 microseconds
set data, took 23 microseconds
free struct, took 12 microseconds
My best guess was that somehow the C code pays the cost of loading the library at the first call? Which confuses me because I thought that the library was loaded when the program was loaded into memory.
You are correct in both cases. The library is loaded when the program is loaded. However, the dynamic loader/linker defers symbol resolution until function invocation time.
Calls to shared libraries are done so indirectly, via an entry in the procedure linkage table (PLT). Initially, all of the entries in the PLT point to ld.so. Upon the first call to a function, ld.so looks up the actual address of the symbol, updates the entry in the PLT, and jumps to the function. This is "lazy" symbol resolution.
You can set theLD_BIND_NOW environment variable to change this behavior. From ld.so(8):
LD_BIND_NOW
(libc5; glibc since 2.1.1) If set to a nonempty string, causes the dynamic linker to resolve all symbols at program startup instead of deferring function call
resolution to the point when they are first referenced. This is useful when using a debugger.
This behavior can also be changed at link time. From ld(1):
-z keyword
The recognized keywords are:
...
lazy
When generating an executable or shared library, mark it to
tell the dynamic linker to defer function call resolution to
the point when the function is called (lazy binding), rather
than at load time. Lazy binding is the default.
Further reading:
http://www.macieira.org/blog/2012/01/sorry-state-of-dynamic-libraries-on-linux/

Categories