python passlib: what is the best value for "rounds" - python

from the passlib documentation
For most public facing services, you can generally have signin take upwards of 250ms - 400ms before users start getting annoyed.
so what is the best value for rounds in a login/registration if we consider that there is one call for the database for the login attempt, and it uses MongoDB with non-blocking call. (using Mongotor, and using the email as the _id, so it is by default indexed, the query is fast: 0.00299978256226 and of course tested with a database that has 3 records...)
import passlib.hash
import time
hashh = passlib.hash.pbkdf2_sha512
beg1 = time.time()
password = hashh.encrypt("test", salt_size = 32, rounds = 12000)
print time.time()- beg1 # returns 0.142999887466
beg2 = time.time()
hashh.verify("test", password) # returns 0.143000125885
print time.time()- beg2
now if i use half value:
password = hashh.encrypt("test", salt_size = 32, rounds = 4000) # returns 0.0720000267029
hashh.verify("test", password) # returns 0.0709998607635
am using Windows 7 64 bits on Dell XPS 15 i7 2.0 Ghz
NB: installed bcrypt, and of course, it's a real pain using it directly as its default values (rounds = 12):
hashh = passlib.hash.bcrypt
beg1 = time.time()
password = hashh.encrypt("test", rounds = 12) # returns 0.406000137329
print time.time()- beg1
beg2 = time.time()
hashh.verify("test", password) # returns 0.40499997139
print time.time()- beg2
half value:
password = hashh.encrypt("test", rounds = 12) # 0.00699996948242 wonderful?
hashh.verify("test", password) # 0.00600004196167
can you suggest me a good rounds value when using pbkdf2_sha512 that will be good for production?

(passlib developer here)
The amount of time pbkdf2_sha512 takes is linearly proportional to it's rounds parameter (elapsed_time = rounds * native_speed). Using the data for your system, native_speed = 12000 / .143 = 83916 iterations/second, which means you'll need around 83916 * .350 = 29575 rounds to get ~350ms delay.
Things are a little tricker for bcrypt, because the amount of time it takes is logarithmically proportional to it's rounds parameter (elapsed_time = (2 ** rounds) * native_speed). Using the data for your system, native_speed = (2 ** 12) / .405 = 10113 iterations/second, which means you'll need around log(10113 * .350, 2) = 11.79 rounds to get ~350 ms delay. But since BCrypt only accepts integer rounds parameters, so you'll need to pick rounds=11 (~200ms) or rounds=12 (~400ms).
All of this is something I'm hoping to fix in a future release of passlib. As a work in progress, passlib's mercurial repo currently contains a simple little script, choose_rounds.py, which takes care of choosing the correct rounds value for a given target time. You can download and run it directly as follows (it may take 20s or so to run):
$ python choose_rounds.py -h
usage: python choose_rounds.py <hash_name> [<target_in_milliseconds>]
$ python choose_rounds.py pbkdf2_sha512 350
hash............: pbkdf2_sha512
speed...........: 83916 iterations/second
target time.....: 350 ms
target rounds...: 29575
$ python choose_rounds.py bcrypt 350
hash............: bcrypt
speed...........: 10113 iterations/second
target time.....: 350 ms
target rounds...: 11 (200ms -- 150ms faster than requested)
target rounds...: 12 (400ms -- 50ms slower than requested)
(edit: added response regarding secure minimum rounds...)
Disclaimer: Determining a secure minimum is a surprisingly tricky question - there are a number of hard to quantify parameters, very little real world data, and some rigorously unhelpful theory. Lacking a good authority, I've been researching the topic myself; and for off-the-cuff calculations, I've boiled the raw data down to a short formula (below), which is generally what I use. Just be aware that behind it are a couple of pages of assumptions and rough estimates, making it more of a Fermi Estimation than an exact answer :|
My rule of thumb (mid 2012) for attacking PBKDF2-HMAC-SHA512 using GPUs is:
days * dollars = 2**(n-31) * rounds
days is the number of days before the attacker has a 50/50 chance of guessing the password.
dollars is the attackers' hardware budget (in $USD).
n is the average amount of entropy in your user's passwords (in bits).
To answer your script-kiddie question: if an average password has 32 bits of entropy, and the attacker has a $2000 system with a good GPU, then at 30000 rounds they will need 30 days (2**(32-31)*30000/2000) to have a 50/50 chance of cracking a given hash. I'd recommend playing around with the values until you arrive at a rounds/days tradeoff that you're comfortable with.
Some things to keep in mind:
The success rate of a dictionary attack isn't linear, it's more of "long tail" situation, so think of the 50/50 mark as more of a half-life.
That 31 is the key factor, as it encodes an estimation of the cost of attacking a specific algorithm using a specific technology level. The actual value, 2**-31, measures the "dollar-days per round" it will cost an attacker. For comparison, attacking PBKDF2-HMAC-SHA512 using an ASIC has a factor closer to 46 -- larger numbers mean more bang for the attacker's buck, and less security per round for you, though script kiddies generally won't have that kind of budget :)

Related

Numba Python - how to exploit parallelism effectively?

I have been trying to exploit Numba to speed up large array calculations. I have been measuring the calculation speed in GFLOPS, and it consistently falls far short of my expectations for my CPU.
My processor is i9-9900k, which according to float32 benchmarks should be capable of over 200 GFLOPS. In my tests I have never exceeded about 50 GFLOPS. This is running on all 8 cores.
On a single core I achieve about 17 GFLOPS, which (I believe) is 50% of the theoretical performance. I'm not sure if this is improvable, but the fact that it doesn't extend well to multi-core is a problem.
I am trying to learn this because I am planning to write some image processing code that desperately needs every speed boost possible. I also feel I should understand this first, before I dip my toes into GPU computing.
Here is some example code with a few of my attempts at writing fast functions. The operation I am testing, is multiplying an array by a float32 then summing the whole array, i.e. a MAC operation.
How can I get better results?
import os
# os.environ["NUMBA_ENABLE_AVX"] = "1"
import numpy as np
import timeit
from timeit import default_timer as timer
import numba
# numba.config.NUMBA_ENABLE_AVX = 1
# numba.config.LOOP_VECTORIZE = 1
# numba.config.DUMP_ASSEMBLY = 1
from numba import float32, float64
from numba import jit, njit, prange
from numba import vectorize
from numba import cuda
lengthY = 16 # 2D array Y axis
lengthX = 2**16 # X axis
totalops = lengthY * lengthX * 2 # MAC operation has 2 operations
iters = 100
doParallel = True
#njit(fastmath=True, parallel=doParallel)
def MAC_numpy(testarray):
output = (float)(0.0)
multconst = (float)(.99)
output = np.sum(np.multiply(testarray, multconst))
return output
#njit(fastmath=True, parallel=doParallel)
def MAC_01(testarray):
lengthX = testarray.shape[1]
lengthY = testarray.shape[0]
output = (float)(0.0)
multconst = (float)(.99)
for y in prange(lengthY):
for x in prange(lengthX):
output += multconst*testarray[y,x]
return output
#njit(fastmath=True, parallel=doParallel)
def MAC_04(testarray):
lengthX = testarray.shape[1]
lengthY = testarray.shape[0]
output = (float)(0.0)
multconst = (float)(.99)
for y in prange(lengthY):
for x in prange(int(lengthX/4)):
xn = x*4
output += multconst*testarray[y,xn] + multconst*testarray[y,xn+1] + multconst*testarray[y,xn+2] + multconst*testarray[y,xn+3]
return output
# ======================================= TESTS =======================================
testarray = np.random.rand(lengthY, lengthX)
# ==== MAC_numpy ====
time = 1000
for n in range(iters):
start = timer()
output = MAC_numpy(testarray)
end = timer()
if((end-start) < time): #get shortest time
time = end-start
print("\nMAC_numpy")
print("output = %f" % (output))
print(type(output))
print("fastest time = %16.10f us" % (time*10**6))
print("Compute Rate = %f GFLOPS" % ((totalops/time)/10**9))
# ==== MAC_01 ====
time = 1000
lengthX = testarray.shape[1]
lengthY = testarray.shape[0]
for n in range(iters):
start = timer()
output = MAC_01(testarray)
end = timer()
if((end-start) < time): #get shortest time
time = end-start
print("\nMAC_01")
print("output = %f" % (output))
print(type(output))
print("fastest time = %16.10f us" % (time*10**6))
print("Compute Rate = %f GFLOPS" % ((totalops/time)/10**9))
# ==== MAC_04 ====
time = 1000
for n in range(iters):
start = timer()
output = MAC_04(testarray)
end = timer()
if((end-start) < time): #get shortest time
time = end-start
print("\nMAC_04")
print("output = %f" % (output))
print(type(output))
print("fastest time = %16.10f us" % (time*10**6))
print("Compute Rate = %f GFLOPS" % ((totalops/time)/10**9))
Q : How can I get better results?
1st : Learn how to avoid doing useless work - you can straight eliminate HALF of the FLOP-s not speaking about also the half of all the RAM-I/O-s avoided, each one being at a cost of +100~350 [ns] per writeback
Due to the distributive nature of MUL and ADD ( a.C + b.C ) == ( a + b ).C, better first np.sum( A ) and only after that then MUL the sum by the (float) constant.
#utput = np.sum(np.multiply(testarray, multconst)) # AWFULLY INEFFICIENT
output = np.sum( testarray)*multconst #######################
2nd : Learn how to best align data along the order of processing ( cache-line reuses get you ~100x faster re-use of pre-fetched data. Not aligning vectorised-code along these already pre-fetched data side-effects just let your code pay many times the RAM-access latencies, instead of smart re-using the already paid for data-blocks. Designing work-units aligned according to this principle means a few SLOCs more, but the rewards are worth that - who gets ~100x faster CPUs+RAMs for free and right now or about a ~100x speedup for free, just from not writing a badly or naively designed looping iterators?
3rd : Learn how to efficiently harness vectorised (block-directed) operations inside numpy or numba code-blocks and avoid pressing numba to spend time on auto-analysing the call-signatures ( you pay an extra time for this auto-analyses per call, while you have designed the code and knew exactly what data-types are going to go there, so why to pay an extra time for auto-analysis each time a numba-block gets called???)
4th : Learn where the extended Amdahl's Law, having all the relevant add-on costs and processing atomicity put into the game, supports your wish to get speedups, not to ever pay way more than you will get back (to at least justify the add-on costs... ) - paying extra costs for not getting any reward is possible, yet has no beneficial impact on your code's performance ( rather the opposite )
5th : Learn when and how the manually created inline(s) may save your code, once the steps 1-4 are well learnt and routinely excersised with proper craftmanship ( Using popular COTS frameworks is fine, yet these may deliver results after a few days of work, while a hand-crafted single purpose smart designed assembly code was able to get the same results in about 12 minutes(!), not several days without any GPU/CPU tricks etc - yes, that faster - just by not doing a single step more than what was needed for the numerical processing of the large matrix data )
Did I mention float32 may surprise at being processed slower on small scales than float64, while on larger data-scales ~ n [GB] the RAM I/O-times grow slower for more efficient float32 pre-fetches? This never happens here, as float64 array gets processed here. Sure, unless one explicitly instructs the constructor(s) to downconvert the default data type, like this: np.random.rand( lengthY, lengthX ).astype( dtype = np.float32 )>>> np.random.rand( 10, 2 ).dtypedtype('float64')Avoiding extensive memory allocations is another performance trick, supported in numpy call-signatures. Using this option for large arrays will save you a lot of extra time wasted on mem-allocs for large interim arrays. Reusing already pre-allocated memory-zones and wisely controlled gc-policing are another signs of a professional, focused on low-latency & design-for-performance

How to generate a time-ordered uid in Python?

Is this possible? I've heard Cassandra has something similar : https://datastax.github.io/python-driver/api/cassandra/util.html
I have been using a ISO timestamp concatenated with a uuid4, but that ended up way too large (58 characters) and probably overkill.
Keeping a sequential number doesn't work in my context (DynamoDB NoSQL)
Worth noticing that for my application it doesn't matter if items created in batch/same second are in a random order, as long as the uid don't collapse.
I have no specific restriction on maximum length, ideally I would like to see the different collision chance for different lengths, but it needs to be smaller than 58 (my original attempt)
This is to use with DynamoDB(NoSQL Database) as Sort-key
Why uuid.uuid1 is not sequential
uuid.uuid1(node=None, clock_seq=None) is effectively:
60 bits of timestamp (representing number of 100-ns intervals after 1582-10-15 00:00:00)
14 bits of "clock sequence"
48 bits of "Node info" (generated from network card's mac-address or from hostname or from RNG).
If you don't provide any arguments, then System function is called to generate uuid. In that case:
It's unclear if "clock sequence" is sequential or random.
It's unclear if it's safe to be used in multiple processes (can clock_seq be repeated in different processes or not?). In Python 3.7 this info is now available.
If you provide clock_seq or node, then "pure python implementation is used". IN this case even with "fixed value" for clock_seq:
timestamp part is guaranteed to be sequential for all the calls in current process even in threaded execution.
clock_seq part is randomly generated. But that is not critical annymore because timestamp is sequential and unique.
It's NOT safe for multiple processes (processes that call uuid1 with the same clock_seq, node might return conflicting values if called during the "same 100-ns time interval")
Solution that reuses uuid.uuid1
It's easy to see, that you can make uuid1 sequential by providing clock_seq or node arguments (to use python implementation).
import time
from uuid import uuid1, getnode
_my_clock_seq = getrandbits(14)
_my_node = getnode()
def sequential_uuid(node=None):
return uuid1(node=node, clock_seq=_my_clock_seq)
# .hex attribute of this value is 32-characters long string
def alt_sequential_uuid(clock_seq=None):
return uuid1(node=_my_node, clock_seq=clock_seq)
if __name__ == '__main__':
from itertools import count
old_n = uuid1() # "Native"
old_s = sequential_uuid() # Sequential
native_conflict_index = None
t_0 = time.time()
for x in count():
new_n = uuid1()
new_s = sequential_uuid()
if old_n > new_n and not native_conflict_index:
native_conflict_index = x
if old_s >= new_s:
print("OOops: non-sequential results for `sequential_uuid()`")
break
if (x >= 10*0x3fff and time.time() - t_0 > 30) or (native_conflict_index and x > 2*native_conflict_index):
print('No issues for `sequential_uuid()`')
break
old_n = new_n
old_s = new_s
print(f'Conflicts for `uuid.uuid1()`: {bool(native_conflict_index)}')
Multiple processes issues
BUT if you are running some parallel processes on the same machine, then:
node which defaults to uuid.get_node() will be the same for all the processes;
clock_seq has small chance to be the same for some processes (chance of 1/16384)
That might lead to conflicts! That is general concern for using
uuid.uuid1 in parallel processes on the same machine unless you have access to SafeUUID from Python3.7.
If you make sure to also set node to unique value for each parallel process that runs this code, then conflicts should not happen.
Even if you are using SafeUUID, and set unique node, it's still possible to have non-sequential (but unique) ids if they are generated in different processes.
If some lock-related overhead is acceptable, then you can store clock_seq in some external atomic storage (for example in "locked" file) and increment it with each call: this allows to have same value for node on all parallel processes and also will make id-s sequential. For cases when all parallel processes are subprocesses created using multiprocessing: clock_seq can be "shared" using multiprocessing.Value
As a result you always have to remember:
If you are running multiple processes on the same machine, then you must:
Ensure uniqueness of node. The problem for this solution: you can't be sure to have sequential ids from different processes generated during the same 100-ns interval. But this is very "light" operation executed once on process startup and achieved by: "adding" something to default node, e.g. int(time.time()*1e9) - 0x118494406d1cc000, or by adding some counter from machine-level atomic db.
Ensure "machine-level atomic clock_seq" and the same node for all processes on one machine. That way you'll have some overhead for "locking" clock_seq, but id-s are guaranteed to be sequential even if generated in different processes during the same 100-ns interval (unless you are calling uuid from several threads in the same process).
For processes on different machines:
either you have to use some "global counter service";
or it's not possible to have sequential ids generated on different machines during the same 100-ns interval.
Reducing size of the id
General approach to generate UUIDs is quite simple, so it's easy to implement something similar from scratch, and for example use less bits for node_info part:
import time
from random import getrandbits
_my_clock_seq = getrandbits(14)
_last_timestamp_part = 0
_used_clock_seq = 0
timestamp_multiplier = 1e7 # I'd recommend to use this value
# Next values are enough up to year 2116:
if timestamp_multiplier == 1e9:
time_bits = 62 # Up to year 2116, also reduces chances for non-sequential id-s generated in different processes
elif timestamp_multiplier == 1e8:
time_bits = 60 # up to year 2335
elif timestamp_multiplier == 1e7:
time_bits = 56 # Up to year 2198.
else:
raise ValueError('Please calculate and set time_bits')
time_mask = 2**time_bits - 1
seq_bits = 16
seq_mask = 2**seq_bits - 1
node_bits = 12
node_mask = 2**node_bits - 1
max_hex_len = len(hex(2**(node_bits+seq_bits+time_bits) - 1)) - 2 # 21
_default_node_number = getrandbits(node_bits) # or `uuid.getnode() & node_mask`
def sequential_uuid(node_number=None):
"""Return 21-characters long hex string that is sequential and unique for each call in current process.
Results from different processes may "overlap" but are guaranteed to
be unique if `node_number` is different in each process.
"""
global _my_clock_seq
global _last_timestamp_part
global _used_clock_seq
if node_number is None:
node_number = _default_node_number
if not 0 <= node_number <= node_mask:
raise ValueError("Node number out of range")
timestamp_part = int(time.time() * timestamp_multiplier) & time_mask
_my_clock_seq = (_my_clock_seq + 1) & seq_mask
if _last_timestamp_part >= timestamp_part:
timestamp_part = _last_timestamp_part
if _used_clock_seq == _my_clock_seq:
timestamp_part = (timestamp_part + 1) & time_mask
else:
_used_clock_seq = _my_clock_seq
_last_timestamp_part = timestamp_part
return hex(
(timestamp_part << (node_bits+seq_bits))
|
(_my_clock_seq << (node_bits))
|
node_number
)[2:]
Notes:
Maybe it's better to simply store integer value (not hex-string) in the database
If you are storing it as text/char, then its better to convert integer to base64-string instead of converting it to hex-string. That way it will be shorter (21 chars hex-string → 16 chars b64-encoded string):
from base64 import b64encode
total_bits = time_bits+seq_bits+node_bits
total_bytes = total_bits // 8 + 1 * bool(total_bits % 8)
def int_to_b64(int_value):
return b64encode(int_value.to_bytes(total_bytes, 'big'))
Collision chances
Single process: collisions not possible
Multiple processes with manually set unique clock_seq or unique node in each process: collisions not possible
Multiple processes with randomly set node (48-bits, "fixed" in time):
Chance to have the node collision in several processes:
in 2 processes out of 10000: ~0.000018%
in 2 processes out of 100000: 0.0018%
Chance to have single collision of the id per second in 2 processes with the "colliding" node:
for "timestamp" interval of 100-ns (default for uuid.uuid1 , and in my code when timestamp_multiplier == 1e7): proportional to 3.72e-19 * avg_call_frequency²
for "timestamp" interval of 10-ns (timestamp_multiplier == 1e8): proportional to 3.72e-21 * avg_call_frequency²
In the article you've linked too, the cassandra.util.uuid_from_time(time_arg, node=None, clock_seq=None)[source] seems to be exactly what you're looking for.
def uuid_from_time(time_arg, node=None, clock_seq=None):
"""
Converts a datetime or timestamp to a type 1 :class:`uuid.UUID`.
:param time_arg:
The time to use for the timestamp portion of the UUID.
This can either be a :class:`datetime` object or a timestamp
in seconds (as returned from :meth:`time.time()`).
:type datetime: :class:`datetime` or timestamp
:param node:
None integer for the UUID (up to 48 bits). If not specified, this
field is randomized.
:type node: long
:param clock_seq:
Clock sequence field for the UUID (up to 14 bits). If not specified,
a random sequence is generated.
:type clock_seq: int
:rtype: :class:`uuid.UUID`
"""
if hasattr(time_arg, 'utctimetuple'):
seconds = int(calendar.timegm(time_arg.utctimetuple()))
microseconds = (seconds * 1e6) + time_arg.time().microsecond
else:
microseconds = int(time_arg * 1e6)
# 0x01b21dd213814000 is the number of 100-ns intervals between the
# UUID epoch 1582-10-15 00:00:00 and the Unix epoch 1970-01-01 00:00:00.
intervals = int(microseconds * 10) + 0x01b21dd213814000
time_low = intervals & 0xffffffff
time_mid = (intervals >> 32) & 0xffff
time_hi_version = (intervals >> 48) & 0x0fff
if clock_seq is None:
clock_seq = random.getrandbits(14)
else:
if clock_seq > 0x3fff:
raise ValueError('clock_seq is out of range (need a 14-bit value)')
clock_seq_low = clock_seq & 0xff
clock_seq_hi_variant = 0x80 | ((clock_seq >> 8) & 0x3f)
if node is None:
node = random.getrandbits(48)
return uuid.UUID(fields=(time_low, time_mid, time_hi_version,
clock_seq_hi_variant, clock_seq_low, node), version=1)
There's nothing Cassandra specific to a Type 1 UUID...
You should be able to encode a timestamp precise to the second for a time range of 135 years in 32 bits. That will only take 8 characters to represent in hex. Added to the hex representation of the uuid (32 hex characters) that will amount to only 40 hex characters.
Encoding the time stamp that way requires that you pick a base year (e.g. 2000) and compute the number of days up to the current date (time stamp). Multiply this number of days by 86400, then add the seconds since midnight. This will give you values that are less than 2^32 until you reach year 2135.
Note that you have to keep leading zeroes in the hex encoded form of the timestamp prefix in order for alphanumeric sorting to preserve the chronology.
With a few bits more in the time stamp, you could increase the time range and/or the precision. With 8 more bits (two hex characters), you could go up to 270 years with a precision to the hundredth of a second.
Note that you don't have to model the fraction of seconds in a base 10 range. You will get optimal bit usage by breaking it down in 128ths instead of 100ths for the same number of characters. With the doubling of the year range, this still fits within 8 bits (2 hex characters)
The collision probability, within the time precision (i.e. per second or per 100th or 128th of a second) is driven by the range of the uuid so it will be 1 in 2^128 for the chosen precision. Increasing the precision of the time stamp has the most impact on reducing the collision chances. It is also the factor that has the lowest impact on total size of the key.
More efficient character encoding: 27 to 29 character keys
You could significantly reduce the size of the key by encoding it in base 64 instead of 16 which would give you 27 to 29 characters (depending on you choice of precision)
Note that, for the timestamp part, you need to use an encoding function that takes an integer as input and that preserves the collating sequence of digit characters.
For example:
def encode64(number, size):
chars = "+-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
result = list()
for _ in range(size):
result.append(chars[number%64])
number //= 64
return "".join(reversed(result))
a = encode64(1234567890,6) # '-7ZU9G'
b = encode64(9876543210,6) # '7Ag-Pe'
print(a < b) # True
u = encode64(int(uuid.uuid4()),22) # '1QA2LtMg30ztnugxaokVMk'
key = a+u # '-7ZU9G1QA2LtMg30ztnugxaokVMk' (28 characters)
You can save some more characters by combining the time stamp and uuid into a single number before encoding instead of concatenating the two encoded values.
The encode64() function needs one character every 6 bits.
So, for 135 years with precision to the second: (32+128)/6 = 26.7 --> 27 characters
instead of (32/6 = 5.3 --> 6) + (128/6 = 21.3 --> 22) ==> 28 characters
uid = uuid.uuid4()
timeStamp = daysSince2000 * 86400 + int(secondsSinceMidnight)
key = encode64( timeStamp<<128 | int(uid) ,27)
with a 270 year span and 128th of a second precision: (40+128)/6 = 28 characters
uid = uuid.uuid4()
timeStamp = daysSince2000 * 86400 + int(secondsSinceMidnight)
precision = 128
timeStamp = timeStamp * precision + int(factionOfSecond * precision)
key = encode64( timeStamp<<128 | int(uid) ,28)
With 29 characters you can raise precision to 1024th of a second and year range to 2160 years.
UUID masking: 17 to 19 characters keys
To be even more efficient, you could strip out the first 64 bits of the uuid (which is already a time stamp) and combine it with your own time stamp. This would give you keys with a length of 17 to 19 characters with practically no loss of collision avoidance (depending on your choice of precision).
mask = (1<<64)-1
key = encode64( timeStamp<<64 | (int(uid) & mask) ,19)
Integer/Numeric keys ?
As a final note, if your database supports very large integers or numeric fields (140 bits or more) as keys, you don't have to convert the combined number to a string. Just use it directly as the key. The numerical sequence of timeStamp<<128 | int(uid) will respect the chronology.
The uuid6 module (pip install uuid6) solves the problem. It aims at implementing the corresponding draft for a new uuid variant standard, see here.
Example code:
import uuid6
for i in range(0, 30):
u = uuid6.uuid7()
print(u)
time.sleep(0.1)
The package suggests to use uuid6.uuid7():
Implementations SHOULD utilize UUID version 7 over UUID version 1 and
6 if possible.
UUID version 7 features a time-ordered value field derived from the
widely implemented and well known Unix Epoch timestamp source, the
number of milliseconds seconds since midnight 1 Jan 1970 UTC, leap
seconds excluded. As well as improved entropy characteristics over
versions 1 or 6.

Why is the following simple parallelized code much slower than a simple loop in Python?

A simple program which calculates square of numbers and stores the results:
import time
from joblib import Parallel, delayed
import multiprocessing
array1 = [ 0 for i in range(100000) ]
def myfun(i):
return i**2
#### Simple loop ####
start_time = time.time()
for i in range(100000):
array1[i]=i**2
print( "Time for simple loop --- %s seconds ---" % ( time.time()
- start_time
)
)
#### Parallelized loop ####
start_time = time.time()
results = Parallel( n_jobs = -1,
verbose = 0,
backend = "threading"
)(
map( delayed( myfun ),
range( 100000 )
)
)
print( "Time for parallelized method --- %s seconds ---" % ( time.time()
- start_time
)
)
#### Output ####
# >>> ( executing file "Test_vr20.py" )
# Time for simple loop --- 0.015599966049194336 seconds ---
# Time for parallelized method --- 7.763299942016602 seconds ---
Could it be the difference in array handling for the two options? My actual program would have something more complicated but this is the kind of calculation that I need to parallelize, as simply as possible, but not with such results.
System Model: HP ProBook 640 G2, Windows 7,
IDLE for Python System Type: x64-based PC Processor:
Intel(R) Core(TM) i5-6300U CPU # 2.40GHz,
2401 MHz,
2 Core(s),
4 Logical Processor(s)
From the documentation of threading:
If you know that the function you are calling is based on a compiled
extension that releases the Python Global Interpreter Lock (GIL)
during most of its computation ...
The problem is that in the this case, you don't know that. Python itself will only allow one thread to run at once (the python interpreter locks the GIL every time it executes a python operation).
threading is only going to be useful if myfun() spends most of its time in a compiled Python extension, and that extension releases the GIL.
The Parallel code is so embarrassingly slow because you are doing a huge amount of work to create multiple threads - and then you only execute one thread at a time anyway.
If you use the multiprocessing backend, then you have to copy the input data into each of four or eight processes (one per core), do the processing in each processes, and then copy the output data back. The copying is going to be slow, but if the processing is a little bit more complex than just calculating a square, it might be worth it. Measure and see.
Why?Because trying to use tools in cases,where tools principally cannot and DO NOT adjust the costs of entry:
I love python.
I pray educators better explain the costs of tools, otherwise we get lost in these wish-to-get [PARALLEL]-schedules.
A few facts:
No.0: With a lot of simplification, python intentionally uses GIL to [SERIAL]-ise access to variables and thus avoiding any potential collision from [CONCURRENT] modifications - paying these add-on costs of GIL-stepped dancing in extra time
No.1: [PARALLEL]-code execution is way harder than a "just"-[CONCURRENT] ( read more )
No.2: [SERIAL]-process has to pay extra costs, if trying to split work onto [CONCURRENT]-workers
No.3: If a process does inter-worker communication, immense extra costs per data exchange are paid
No.4: If hardware has few resources for [CONCURRENT] processes, results get way worse further
To have some smell of what can be done in standard python 2.7.13:
Efficiency is in better using silicon, not in bulldozing syntax-constructors into territories, where they are legal, but their performance has adverse effects on the experiment-under-test end-to-end speed:
You pay about 8 ~ 11 [ms] just to iteratively assemble an empty array1
>>> from zmq import Stopwatch
>>> aClk = Stopwatch()
>>> aClk.start();array1 = [ 0 for i in xrange( 100000 ) ];aClk.stop()
9751L
10146L
10625L
9942L
10346L
9359L
10473L
9171L
8328L
( the Stopwatch().stop() method yields [us] from .start() )
while, the memory-efficient, vectorisable, GIL-free approach can do the same about +230x ~ +450x faster:
>>> import numpy as np
>>>
>>> aClk.start();arrayNP = np.zeros( 100000 );aClk.stop()
15L
22L
21L
23L
19L
22L
>>> aClk.start();arrayNP = np.zeros( 100000, dtype = np.int );aClk.stop()
43L
47L
42L
44L
47L
So, using the proper tools just starts the story of performance:
>>> def test_SERIAL_python( nLOOPs = 100000 ):
... aClk.start()
... for i in xrange( nLOOPs ): # py3 range() ~ xrange() in py27
... array1[i] = i**2 # your loop-code
... _ = aClk.stop()
... return _
While a naive [SERIAL]-iterative implementation works, you pay immense costs for opting to do so ~ 70 [ms] for a 100000-D vector:
>>> test_SERIAL_python( nLOOPs = 100000 )
70318L
69211L
77825L
70943L
74834L
73079L
Using a more suitable / appropriate tool costs just ~ 0.2 [ms] i.e. ++350x FASTER
>>> aClk.start();arrayNP[:] = arrayNP[:]**2;aClk.stop()
189L
171L
173L
187L
183L
188L
193L
and with another glitch, a.k.a. an inplace modus-operandi:
>>> aClk.start();arrayNP[:] *=arrayNP[:];aClk.stop()
138L
139L
136L
137L
136L
136L
137L
Yields ~ +514x SPEEDUP, just from using appropriate tool
The art of performance is not in following marketing-sounding claimsabout parallellizing-( at-any-cost ),but in using know-how based methods, that pay least costs for biggest speedups achievable.
For "small"-problems, typical costs of distributing "thin"-work-packages are indeed hard to get covered by any potentially achievable speedups, so "problem-size" actually limits one's choice of methods, that could reach positive gain ( speedups of 0.9 or even << 1.0 are so often reported here, on StackOverflow, that you need not feel lost or alone in this sort of surprise ).
Epilogue
Processor number counts.
Core number counts.
But cache-sizes + NUMA-irregularities count more than that.
Smart, vectorised, HPC-cured, GIL-free libraries matter ( numpy et al - thanks a lot Travis OLIPHANT & al ... Great Salute to his team ... )
As an overhead-strict Amdahl Law (re-)-formulation explains, why even many-N-CPU parallelised code execution may ( and indeed often does ) suffer from speedups << 1.0
Overhead-strict formulation of the Amdahl's Law speedup S includes the very costs of the paid [PAR]-Setup + [PAR]-Terminate Overheads, explicitly:
1
S = __________________________; where s, ( 1 - s ), N were defined above
( 1 - s ) pSO:= [PAR]-Setup-Overhead add-on
s + pSO + _________ + pTO pTO:= [PAR]-Terminate-Overhead add-on
N
( an interactive animated tool for 2D visualising effects of these performance constraints is cited here )

High-speed alternatives to replace byte array processing bottlenecks

>> See EDIT below <<
I am working on processing data from a special pixelated CCD camera over serial, using FTDI D2xx drivers via pyUSB.
The camera can operate at high bandwidth to the PC, up to 80 frames/sec. I would love that speed, but know that it isn't feasible with Python, due to it being a scripted language, but would like to know how close I can get - whether it be some optimizations that I missed in my code, threading, or using some other approach. I immediately think that breaking-out the most time consuming loops and putting them in C code, but I don't have much experience with C code and not sure the best way to get Python to interact inline with it, if that's possible. I have complex algorithms heavily developed in Python with SciPy/Numpy, which are already optimized and have acceptable performance, so I would need a way to just speed-up the acquisition of the data to feed-back to Python, if that's the best approach.
The difficulty, and the reason I used Python, and not some other language, is due to the need to be able to easily run it cross-platform (I develop in Windows, but am putting the code on an embedded Linux board, making a stand-alone system). If you suggest that I use another code, like C, how would I be able to work cross-platform? I have never worked with compiling a lower-level language like C between Windows and Linux, so I would want to be sure of that process - I would have to compile it for each system, right? What do you suggest?
Here are my functions, with current execution times:
ReadStream: 'RXcount' is 114733 for a device read, formatting from string to byte equivalent
Returns a list of bytes (0-255), representing binary values
Current execution time: 0.037 sec
def ReadStream(RXcount):
global ftdi
RXdata = ftdi.read(RXcount)
RXdata = list(struct.unpack(str(len(RXdata)) + 'B', RXdata))
return RXdata
ProcessRawData: To reshape the byte list into an array that matches the pixel orientations
Results in a 3584x32 array, after trimming off some un-needed bytes.
Data is unique in that every block of 14 rows represents 14-bits of one row of pixels on the device (with 32 bytes across # 8 bits/byte = 256 bits across), which is 256x256 pixels. The processed array has 32 columns of bytes because each byte, in binary, represents 8 pixels (32 bytes * 8 bits = 256 pixels). Still working on how to do that one... I have already posted a question for that previously
Current execution time: 0.01 sec ... not bad, it's just Numpy
def ProcessRawData(RawData):
if len(RawData) == 114733:
ProcessedMatrix = np.ndarray((1, 114733), dtype=int)
np.copyto(ProcessedMatrix, RawData)
ProcessedMatrix = ProcessedMatrix[:, 1:-44]
ProcessedMatrix = np.reshape(ProcessedMatrix, (-1, 32))
return ProcessedMatrix
else:
return None
Finally,
GetFrame: The device has a mode where it just outputs whether a pixel detected anything or not, using the lowest bit of the array (every 14th row) - Get that data and convert to int for each pixel
Results in 256x256 array, after processing every 14th row, which are bytes to be read as binary (32 bytes across ... 32 bytes * 8 bits = 256 pixels across)
Current execution time: 0.04 sec
def GetFrame(ProcessedMatrix):
if np.shape(ProcessedMatrix) == (3584, 32):
FrameArray = np.zeros((256, 256), dtype='B')
DataRows = ProcessedMatrix[13::14]
for i in range(256):
RowData = ""
for j in range(32):
RowData = RowData + "{:08b}".format(DataRows[i, j])
FrameArray[i] = [int(RowData[b:b+1], 2) for b in range(256)]
return FrameArray
else:
return False
Goal:
I would like to target a total execution time of ~0.02 secs/frame by whatever suggestions you make (currently it's 0.25 secs/frame with the GetFrame function being the weakest). The device I/O is not the limiting factor, as that outputs a data packet every 0.0125 secs. If I get the execution time down, then can I just run the acquisition and processing in parallel with some threading?
Let me know what you suggest as the best path forward - Thank you for the help!
EDIT, thanks to #Jaime:
Functions are now:
def ReadStream(RXcount):
global ftdi
return np.frombuffer(ftdi.read(RXcount), dtype=np.uint8)
... time 0.013 sec
def ProcessRawData(RawData):
if len(RawData) == 114733:
return RawData[1:-44].reshape(-1, 32)
return None
... time 0.000007 sec!
def GetFrame(ProcessedMatrix):
if ProcessedMatrix.shape == (3584, 32):
return np.unpackbits(ProcessedMatrix[13::14]).reshape(256, 256)
return False
... time 0.00006 sec!
So, with pure Python, I am now able to acquire the data at the desired frame rate! After a few tweaks to the D2xx USB buffers and latency timing, I just clocked it at 47.6 FPS!
Last step is if there is any way to make this run in parallel with my processing algorithms? Need some way to pass the result of GetFrame to another loop running in parallel.
There are several places where you can speed things up significantly. Perhaps the most obvious is rewriting GetFrame:
def GetFrame(ProcessedMatrix):
if ProcessedMatrix.shape == (3584, 32):
return np.unpackbits(ProcessedMatrix[13::14]).reshape(256, 256)
return False
This requires that ProcessedMatrix be an ndarray of type np.uint8, but other than that, on my systems it runs 1000x faster.
With your other two functions, I think that in ReadStream you should do something like:
def ReadStream(RXcount):
global ftdi
return np.frombuffer(ftdi.read(RXcount), dtype=np.uint8)
Even if it doesn't speed up that function much, because it is the reading taking up most of the time, it will already give you a numpy array of bytes to work on. With that, you can then go on to ProcessRawData and try:
def ProcessRawData(RawData):
if len(RawData) == 114733:
return RawData[1:-44].reshape(-1, 32)
return None
Which is 10x faster than your version.

Python: How to get number of mili seconds per jiffy

I'd like to know the HZ of the system, i.e. how many mili seconds is one jiffy from Python code.
There is USER_HZ
>>> import os
>>> os.sysconf_names['SC_CLK_TCK']
2
>>> os.sysconf(2)
100
which is what the kernel uses to report time in /proc.
From the time(7) manual page:
The Software Clock, HZ, and Jiffies
The accuracy of various system calls that set timeouts, (e.g.,
select(2), sigtimedwait(2)) and measure CPU time (e.g., getrusage(2))
is limited by the resolution of the software clock, a clock maintained
by the kernel which measures time in jiffies. The size of a jiffy is
determined by the value of the kernel constant HZ.
The value of HZ varies across kernel versions and hardware platforms.
On i386 the situation is as follows: on kernels up to and including
2.4.x, HZ was 100, giving a jiffy value of 0.01 seconds; starting with
2.6.0, HZ was raised to 1000, giving a jiffy of 0.001 seconds. Since
kernel 2.6.13, the HZ value is a kernel configuration parameter and can
be 100, 250 (the default) or 1000, yielding a jiffies value of, respec‐
tively, 0.01, 0.004, or 0.001 seconds. Since kernel 2.6.20, a further
frequency is available: 300, a number that divides evenly for the com‐
mon video frame rates (PAL, 25 HZ; NTSC, 30 HZ).
The times(2) system call is a special case. It reports times with a
granularity defined by the kernel constant USER_HZ. Userspace applica‐
tions can determine the value of this constant using
sysconf(_SC_CLK_TCK).
If you absolutely must know SYSTEM_HZ:
>>> from ctypes import *
>>> rt = CDLL('librt.so')
>>> CLOCK_REALTIME = 0
>>> class timespec(Structure):
... _fields_ = [("tv_sec", c_long), ("tv_nsec", c_long)]
...
>>> res = timespec()
>>> rt.clock_getres(CLOCK_REALTIME, byref(res))
0
>>> res.tv_sec, res.tv_nsec
(0, 4000250)
>>> SYSTEM_HZ = round(1/(res.tv_sec + (res.tv_nsec/10.0**9)))
Gives 250 on my laptop (which sounds about right) and 1000000000 in a VM…
sysconf(SC_CLK_TCK) does not give the frequency of the timer interrupts in Linux. It gives the frequency of jiffies which is visible to userspace in things like the counters in various directories in /proc
The actual frequency is hidden from userspace, deliberately. Indeed, some systems use dynamic ticks or "tickless" systems, so there aren't really any at all.
All the userspace interfaces use the value from SC_CLK_TCK, which as far as I can see is always 100 under Linux.
I wrote this:
https://github.com/peppelinux/xt_recent_parser
output is like this:
python3 xt_recent_parser.py
XT_RECENT python parser
<giuseppe.demarco#unical.it>
114.241.108.160, last seen: 2017-03-25 18:21:42 after 13 Connections
46.165.210.17, last seen: 2017-03-25 13:07:54 after 10 Connections
61.53.219.162, last seen: 2017-03-25 17:39:17 after 20 Connections
179.37.141.232, last seen: 2017-03-25 18:08:23 after 2 Connections
114.42.117.39, last seen: 2017-03-25 13:22:14 after 18 Connections
177.12.84.234, last seen: 2017-03-25 16:22:14 after 17 Connections
I think that it will be easy to edit if you need millisecond conversion, you only have to extend JiffyTimeConverter python class

Categories