Related
Background
I am trying to speed-up computation by parallelization (via joblib) using more available cores in Python 3.8, but observed that it does scale poorly.
Trials
I wrote a little script to test and demonstrate the behavior which can be found later. The script (see later) is designed to have a completely independent task doing some iterations of dummy operations using NumPy and Pandas. There is no input and no output to the task, no disc or other I/O, nor any communication or shared memory, just plain CPU and RAM usage. The processes do not use any other resources either other than the occasional request for the current time. Amdahl's Law should not apply to the code here, since there is no common code at all except for process setup.
I ran some experiments with increased workloads by duplicating the tasks using sequential vs. parallelization processing and measured the time it takes for each iteration and the whole (parallel) processes to complete. I ran the script on my Windows 10 laptop, and two AWS EC2 Linux (Amazon Linux 2) machines. The number of parallel processed never exceeded the number of available cores.
Observation
I observed the following (see results later for details, duration in seconds):
In case the number of parallel processed was less than the number of available cores, the total average CPUs utilization (user) never was more than 93%, system calls did not exceed 4%, and no iowait (measured with iostat -hxm 10)
The workload seems to be distributed equally over the available cores, though, which might be an indication for frequent switches between processes even though there are plenty of cores available
Interestingly, for sequential processing, the CPU utilization (user) was around 48%
The summed duration of all iterations is only slightly less than the total duration of a process, hence the process setup does not seem to be a major factor
For each doubling of the number of parallel processes there is a decrease in speed per each iteration/process of 50%
Whereas the duration for sequential processing approx. doubles as expected with doubling the workload (total number of iterations),
the duration for the parallel processing also increased significantly by approx. 50% per each doubling
These findings in this magnitude are unexpected to me.
Questions
What is the cause for this beavior?
Am I missing something?
How can it be remedied in order to utilize the full prospect of using more cores?
Detailed results
Windows 10
6 CPUs, 12 cores
Call: python .\time_parallel_processing.py 1,2,4,8 10
Duration/Iter Duration total TotalIterCount
mean std mean mean
Mode ParallelCount
Joblib 1 4.363902 0.195268 43.673971 10
2 6.322100 0.140654 63.870973 20
4 9.270582 0.464706 93.631790 40
8 15.489000 0.222859 156.670544 80
Seq 1 4.409772 0.126686 44.133441 10
2 4.465326 0.113183 89.377296 20
4 4.534959 0.125097 181.528372 40
8 4.444790 0.083315 355.849860 80
AWS c5.4xlarge
8 CPUs, 16 cores
Call: python time_parallel_processing.py 1,2,4,8,16 10
Duration/Iter Duration total TotalIterCount
mean std mean mean
Mode ParCount
Joblib 1 2.196086 0.009798 21.987626 10
2 3.392873 0.010025 34.297323 20
4 4.519174 0.126054 45.967140 40
8 6.888763 0.676024 71.815990 80
16 12.191278 0.156941 123.287779 160
Seq 1 2.192089 0.010873 21.945536 10
2 2.184294 0.008955 43.735713 20
4 2.201437 0.027537 88.156621 40
8 2.145312 0.009631 171.805374 80
16 2.137723 0.018985 342.393953 160
AWS c5.9xlarge
18 CPUs, 36 cores
Call: python time_parallel_processing.py 1,2,4,8,16,32 10
Duration/Iter Duration total TotalIterCount
mean std mean mean
Mode ParCount
Joblib 1 1.888071 0.023799 18.905295 10
2 2.797132 0.009859 28.307708 20
4 3.349333 0.106755 34.199839 40
8 4.273267 0.705345 45.998927 80
16 6.383214 1.455857 70.469109 160
32 10.974141 4.220783 129.671016 320
Seq 1 1.891170 0.030131 18.934494 10
2 1.866365 0.007283 37.373133 20
4 1.893082 0.041085 75.813468 40
8 1.855832 0.007025 148.643725 80
16 1.896622 0.007573 303.828529 160
32 1.864366 0.009142 597.301383 320
Script code
import argparse
import sys
import time
from argparse import Namespace
from typing import List
import numpy as np
import pandas as pd
from joblib import delayed
from joblib import Parallel
from tqdm import tqdm
RESULT_COLUMNS = {"Mode": str, "ParCount": int, "ProcessId": int, "IterId": int, "Duration": float}
def _create_empty_data_frame() -> pd.DataFrame:
return pd.DataFrame({key: [] for key, _ in RESULT_COLUMNS.items()}).astype(RESULT_COLUMNS)
def _do_task() -> None:
for _ in range(10):
array: np.ndarray = np.random.rand(2500, 2500)
_ = np.matmul(array, array)
data_frame: pd.DataFrame = pd.DataFrame(np.random.rand(250, 250), columns=list(map(str, list(range(250)))))
_ = data_frame.merge(data_frame)
def _process(process_id: int, iter_count: int) -> pd.DataFrame:
durations: pd.DataFrame = _create_empty_data_frame()
for i in tqdm(range(iter_count)):
iter_start_time: float = time.time()
_do_task()
durations = durations.append(
{
"Mode": "",
"ParCount": 0,
"ProcessId": process_id,
"IterId": i,
"Duration": time.time() - iter_start_time,
},
ignore_index=True,
)
return durations
def main(args: Namespace) -> None:
"""Execute main script."""
iter_durations: List[pd.DataFrame] = []
mode_durations: List[pd.DataFrame] = []
for par_count in list(map(int, args.par_counts.split(","))):
total_iter_count: int = par_count * int(args.iter_count)
print(f"\nRunning {par_count} processes in parallel and {total_iter_count} iterations in total")
start_time_joblib: float = time.time()
with Parallel(n_jobs=par_count) as parallel:
joblib_durations: List[pd.DataFrame] = parallel(
delayed(_process)(process_id, int(args.iter_count)) for process_id in range(par_count)
)
iter_durations.append(pd.concat(joblib_durations).assign(**{"Mode": "Joblib", "ParCount": par_count}))
end_time_joblib: float = time.time()
print(f"\nRunning {par_count} processes sequentially with {total_iter_count} iterations in total")
start_time_seq: float = time.time()
seq_durations: List[pd.DataFrame] = []
for process_id in range(par_count):
seq_durations.append(_process(process_id, int(args.iter_count)))
iter_durations.append(pd.concat(seq_durations).assign(**{"Mode": "Seq", "ParCount": par_count}))
end_time_seq: float = time.time()
mode_durations.append(
pd.DataFrame(
{
"Mode": ["Joblib", "Seq"],
"ParCount": [par_count] * 2,
"Duration": [end_time_joblib - start_time_joblib, end_time_seq - start_time_seq],
"TotalIterCount": [total_iter_count] * 2,
}
)
)
print("\nDuration in seconds")
grouping_columns: List[str] = ["Mode", "ParCount"]
print(
pd.concat(iter_durations)
.groupby(grouping_columns)
.agg({"Duration": ["mean", "std"]})
.merge(
pd.concat(mode_durations).groupby(grouping_columns).agg({"Duration": ["mean"], "TotalIterCount": "mean"}),
on=grouping_columns,
suffixes=["/Iter", " total"],
how="inner",
)
)
if __name__ == "__main__":
print(f"Command line: {sys.argv}")
parser: argparse.ArgumentParser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
"par_counts",
help="Comma separated list of parallel processes counts to start trials for (e.g. '1,2,4,8,16,32')",
)
parser.add_argument("iter_count", help="Number of iterations per parallel process to carry out")
args: argparse.Namespace = parser.parse_args()
start_time: float = time.time()
main(args)
print(f"\nTotal elapsed time: {time.time() - start_time:.2f} seconds")
Environment
Created with' conda env create -f environment.time_parallel.yaml
environment.time_parallel.yaml:
name: time_parallel
channels:
- defaults
- conda-forge
dependencies:
- python=3.8.5
- pip=20.3.3
- pandas=1.2.0
- numpy=1.19.2
- joblib=1.0.0
- tqdm=4.55.1
Update 1
Thanks to the coment of #sholderbach I investigated into the NumPy/Pandas usage and found out a couple of things.
1)
NumPy uses a linear algebra backend which automatically will run some commands (including matrix multiplication) in parallel threads which results in too many threads altogether clogging the system, the more parallel processes, the more, hence the increasing duration per iteration.
I tested this hypthesis by removing NumPy and Pandas operations in method _do_task adn replacing it by simple math operations only:
def _do_task() -> None:
for _ in range(10):
for i in range(10000000):
_ = 1000 ^ 2 % 200
The results are exactly as expected in that the duration of an iteration does not change when increasing the number of processes (beyond the number of cores available).
Windows 10
Call python time_parallel_processing.py 1,2,4,8 5
Duration in seconds
Duration/Iter Duration total TotalIterCount
mean std mean mean
Mode ParCount
Joblib 1 2.562570 0.015496 13.468393 5
2 2.556241 0.021074 13.781174 10
4 2.565614 0.054754 16.171828 20
8 2.630463 0.258474 20.328055 40
Seq 2 2.576542 0.033270 25.874965 10
AWS c5.9xlarge
Call python time_parallel_processing.py 1,2,4,8,16,32 10
Duration/Iter Duration total TotalIterCount
mean std mean mean
Mode ParCount
Joblib 1 2.082849 0.022352 20.854512 10
2 2.126195 0.034078 21.596860 20
4 2.287874 0.254493 27.420978 40
8 2.141553 0.030316 21.912917 80
16 2.156828 0.137937 24.483243 160
32 3.581366 1.197282 42.884399 320
Seq 2 2.076256 0.004231 41.571033 20
2)
Following the hint of #sholderbach I found a number of other links which cover the topic of linear algebra backends using multiple threads automatically and how to turn this off:
NumPy issue (from #sholderbach)
threadpoolctl package
Nice article
Pinning process to a specific CPU with Python (and package psutil)
Add to _process:
proc = psutil.Process()
proc.cpu_affinity([process_id])
with threadpool_limits(limits=1):
...
Add to environment:
- threadpoolctl=2.1.0
- psutil=5.8.0
Note: I had to replace joblib by multiprocessing, since pinning did not work properly with joblib (only one half of the processes got spawned at a time on Linux).
I did some tests with mixed results. Monitoring shows that pinnng and restricting to one thread per process works for both Windows 10 and Linux/AWS c5.9xlarge. Unfortunately, the absolute duration per iteration increases by these "fixes".
Also, the duration per iteration still begins to increase at some point of parallelization.
Here are the results:
Windows 10
Call: python time_parallel_processing.py 1,2,4,8 5
Duration/Iter Duration total TotalIterCount
mean std mean mean
Mode ParCount
Joblib 1 9.502184 0.046554 47.542230 5
2 9.557120 0.092897 49.488612 10
4 9.602235 0.078271 50.249238 20
8 10.518716 0.422020 60.186707 40
Seq 2 9.493682 0.062105 95.083382 10
AWS c5.9xlarge
Call python time_parallel_processing.py 1,2,4,8,16,20,24,28,32 5
Duration/Iter Duration total TotalIterCount
mean std mean mean
Mode ParCount
Parallel 1 5.271010 0.008730 15.862883 3
2 5.400430 0.016094 16.271649 6
4 5.708021 0.069001 17.428172 12
8 6.088623 0.179789 18.745922 24
16 8.330902 0.177772 25.566504 48
20 10.515132 3.081697 47.895538 60
24 13.506221 4.589382 53.348917 72
28 16.318631 4.961513 57.536180 84
32 19.800182 4.435462 64.717435 96
Seq 2 5.212529 0.037129 31.332297 6
What is the cause for this behavior?
Very generally, this type of slowdown usually indicates some combination of being blocked by the GIL, context-switching between cores, or doing a lot of pickling
Am I missing something?
You may be missing some small issues - try profiling (some sampling profiler may be much more performant than cProfile) to see where the time is spent!
However, there's still a finite limit to how fast this can be made before you are reimplementing the suggestions below
How can it be remedied in order to utilize the full prospect of using more cores?
Take a look at numba and dask, which can allow you to get tremendous speedups on numpy and pandas code through parallelization that steps outside of the GIL
numba compiles numpy code and caches it for greater speed and practical processor operations
dask is a framework full of good tricks for efficient parallelization on a single and multiple systems
When I had a scaling issue with ipyparallel, it was caused by garbage collection during the run.
The source code of timeit shows how to disable gc properly.
Is this possible? I've heard Cassandra has something similar : https://datastax.github.io/python-driver/api/cassandra/util.html
I have been using a ISO timestamp concatenated with a uuid4, but that ended up way too large (58 characters) and probably overkill.
Keeping a sequential number doesn't work in my context (DynamoDB NoSQL)
Worth noticing that for my application it doesn't matter if items created in batch/same second are in a random order, as long as the uid don't collapse.
I have no specific restriction on maximum length, ideally I would like to see the different collision chance for different lengths, but it needs to be smaller than 58 (my original attempt)
This is to use with DynamoDB(NoSQL Database) as Sort-key
Why uuid.uuid1 is not sequential
uuid.uuid1(node=None, clock_seq=None) is effectively:
60 bits of timestamp (representing number of 100-ns intervals after 1582-10-15 00:00:00)
14 bits of "clock sequence"
48 bits of "Node info" (generated from network card's mac-address or from hostname or from RNG).
If you don't provide any arguments, then System function is called to generate uuid. In that case:
It's unclear if "clock sequence" is sequential or random.
It's unclear if it's safe to be used in multiple processes (can clock_seq be repeated in different processes or not?). In Python 3.7 this info is now available.
If you provide clock_seq or node, then "pure python implementation is used". IN this case even with "fixed value" for clock_seq:
timestamp part is guaranteed to be sequential for all the calls in current process even in threaded execution.
clock_seq part is randomly generated. But that is not critical annymore because timestamp is sequential and unique.
It's NOT safe for multiple processes (processes that call uuid1 with the same clock_seq, node might return conflicting values if called during the "same 100-ns time interval")
Solution that reuses uuid.uuid1
It's easy to see, that you can make uuid1 sequential by providing clock_seq or node arguments (to use python implementation).
import time
from uuid import uuid1, getnode
_my_clock_seq = getrandbits(14)
_my_node = getnode()
def sequential_uuid(node=None):
return uuid1(node=node, clock_seq=_my_clock_seq)
# .hex attribute of this value is 32-characters long string
def alt_sequential_uuid(clock_seq=None):
return uuid1(node=_my_node, clock_seq=clock_seq)
if __name__ == '__main__':
from itertools import count
old_n = uuid1() # "Native"
old_s = sequential_uuid() # Sequential
native_conflict_index = None
t_0 = time.time()
for x in count():
new_n = uuid1()
new_s = sequential_uuid()
if old_n > new_n and not native_conflict_index:
native_conflict_index = x
if old_s >= new_s:
print("OOops: non-sequential results for `sequential_uuid()`")
break
if (x >= 10*0x3fff and time.time() - t_0 > 30) or (native_conflict_index and x > 2*native_conflict_index):
print('No issues for `sequential_uuid()`')
break
old_n = new_n
old_s = new_s
print(f'Conflicts for `uuid.uuid1()`: {bool(native_conflict_index)}')
Multiple processes issues
BUT if you are running some parallel processes on the same machine, then:
node which defaults to uuid.get_node() will be the same for all the processes;
clock_seq has small chance to be the same for some processes (chance of 1/16384)
That might lead to conflicts! That is general concern for using
uuid.uuid1 in parallel processes on the same machine unless you have access to SafeUUID from Python3.7.
If you make sure to also set node to unique value for each parallel process that runs this code, then conflicts should not happen.
Even if you are using SafeUUID, and set unique node, it's still possible to have non-sequential (but unique) ids if they are generated in different processes.
If some lock-related overhead is acceptable, then you can store clock_seq in some external atomic storage (for example in "locked" file) and increment it with each call: this allows to have same value for node on all parallel processes and also will make id-s sequential. For cases when all parallel processes are subprocesses created using multiprocessing: clock_seq can be "shared" using multiprocessing.Value
As a result you always have to remember:
If you are running multiple processes on the same machine, then you must:
Ensure uniqueness of node. The problem for this solution: you can't be sure to have sequential ids from different processes generated during the same 100-ns interval. But this is very "light" operation executed once on process startup and achieved by: "adding" something to default node, e.g. int(time.time()*1e9) - 0x118494406d1cc000, or by adding some counter from machine-level atomic db.
Ensure "machine-level atomic clock_seq" and the same node for all processes on one machine. That way you'll have some overhead for "locking" clock_seq, but id-s are guaranteed to be sequential even if generated in different processes during the same 100-ns interval (unless you are calling uuid from several threads in the same process).
For processes on different machines:
either you have to use some "global counter service";
or it's not possible to have sequential ids generated on different machines during the same 100-ns interval.
Reducing size of the id
General approach to generate UUIDs is quite simple, so it's easy to implement something similar from scratch, and for example use less bits for node_info part:
import time
from random import getrandbits
_my_clock_seq = getrandbits(14)
_last_timestamp_part = 0
_used_clock_seq = 0
timestamp_multiplier = 1e7 # I'd recommend to use this value
# Next values are enough up to year 2116:
if timestamp_multiplier == 1e9:
time_bits = 62 # Up to year 2116, also reduces chances for non-sequential id-s generated in different processes
elif timestamp_multiplier == 1e8:
time_bits = 60 # up to year 2335
elif timestamp_multiplier == 1e7:
time_bits = 56 # Up to year 2198.
else:
raise ValueError('Please calculate and set time_bits')
time_mask = 2**time_bits - 1
seq_bits = 16
seq_mask = 2**seq_bits - 1
node_bits = 12
node_mask = 2**node_bits - 1
max_hex_len = len(hex(2**(node_bits+seq_bits+time_bits) - 1)) - 2 # 21
_default_node_number = getrandbits(node_bits) # or `uuid.getnode() & node_mask`
def sequential_uuid(node_number=None):
"""Return 21-characters long hex string that is sequential and unique for each call in current process.
Results from different processes may "overlap" but are guaranteed to
be unique if `node_number` is different in each process.
"""
global _my_clock_seq
global _last_timestamp_part
global _used_clock_seq
if node_number is None:
node_number = _default_node_number
if not 0 <= node_number <= node_mask:
raise ValueError("Node number out of range")
timestamp_part = int(time.time() * timestamp_multiplier) & time_mask
_my_clock_seq = (_my_clock_seq + 1) & seq_mask
if _last_timestamp_part >= timestamp_part:
timestamp_part = _last_timestamp_part
if _used_clock_seq == _my_clock_seq:
timestamp_part = (timestamp_part + 1) & time_mask
else:
_used_clock_seq = _my_clock_seq
_last_timestamp_part = timestamp_part
return hex(
(timestamp_part << (node_bits+seq_bits))
|
(_my_clock_seq << (node_bits))
|
node_number
)[2:]
Notes:
Maybe it's better to simply store integer value (not hex-string) in the database
If you are storing it as text/char, then its better to convert integer to base64-string instead of converting it to hex-string. That way it will be shorter (21 chars hex-string → 16 chars b64-encoded string):
from base64 import b64encode
total_bits = time_bits+seq_bits+node_bits
total_bytes = total_bits // 8 + 1 * bool(total_bits % 8)
def int_to_b64(int_value):
return b64encode(int_value.to_bytes(total_bytes, 'big'))
Collision chances
Single process: collisions not possible
Multiple processes with manually set unique clock_seq or unique node in each process: collisions not possible
Multiple processes with randomly set node (48-bits, "fixed" in time):
Chance to have the node collision in several processes:
in 2 processes out of 10000: ~0.000018%
in 2 processes out of 100000: 0.0018%
Chance to have single collision of the id per second in 2 processes with the "colliding" node:
for "timestamp" interval of 100-ns (default for uuid.uuid1 , and in my code when timestamp_multiplier == 1e7): proportional to 3.72e-19 * avg_call_frequency²
for "timestamp" interval of 10-ns (timestamp_multiplier == 1e8): proportional to 3.72e-21 * avg_call_frequency²
In the article you've linked too, the cassandra.util.uuid_from_time(time_arg, node=None, clock_seq=None)[source] seems to be exactly what you're looking for.
def uuid_from_time(time_arg, node=None, clock_seq=None):
"""
Converts a datetime or timestamp to a type 1 :class:`uuid.UUID`.
:param time_arg:
The time to use for the timestamp portion of the UUID.
This can either be a :class:`datetime` object or a timestamp
in seconds (as returned from :meth:`time.time()`).
:type datetime: :class:`datetime` or timestamp
:param node:
None integer for the UUID (up to 48 bits). If not specified, this
field is randomized.
:type node: long
:param clock_seq:
Clock sequence field for the UUID (up to 14 bits). If not specified,
a random sequence is generated.
:type clock_seq: int
:rtype: :class:`uuid.UUID`
"""
if hasattr(time_arg, 'utctimetuple'):
seconds = int(calendar.timegm(time_arg.utctimetuple()))
microseconds = (seconds * 1e6) + time_arg.time().microsecond
else:
microseconds = int(time_arg * 1e6)
# 0x01b21dd213814000 is the number of 100-ns intervals between the
# UUID epoch 1582-10-15 00:00:00 and the Unix epoch 1970-01-01 00:00:00.
intervals = int(microseconds * 10) + 0x01b21dd213814000
time_low = intervals & 0xffffffff
time_mid = (intervals >> 32) & 0xffff
time_hi_version = (intervals >> 48) & 0x0fff
if clock_seq is None:
clock_seq = random.getrandbits(14)
else:
if clock_seq > 0x3fff:
raise ValueError('clock_seq is out of range (need a 14-bit value)')
clock_seq_low = clock_seq & 0xff
clock_seq_hi_variant = 0x80 | ((clock_seq >> 8) & 0x3f)
if node is None:
node = random.getrandbits(48)
return uuid.UUID(fields=(time_low, time_mid, time_hi_version,
clock_seq_hi_variant, clock_seq_low, node), version=1)
There's nothing Cassandra specific to a Type 1 UUID...
You should be able to encode a timestamp precise to the second for a time range of 135 years in 32 bits. That will only take 8 characters to represent in hex. Added to the hex representation of the uuid (32 hex characters) that will amount to only 40 hex characters.
Encoding the time stamp that way requires that you pick a base year (e.g. 2000) and compute the number of days up to the current date (time stamp). Multiply this number of days by 86400, then add the seconds since midnight. This will give you values that are less than 2^32 until you reach year 2135.
Note that you have to keep leading zeroes in the hex encoded form of the timestamp prefix in order for alphanumeric sorting to preserve the chronology.
With a few bits more in the time stamp, you could increase the time range and/or the precision. With 8 more bits (two hex characters), you could go up to 270 years with a precision to the hundredth of a second.
Note that you don't have to model the fraction of seconds in a base 10 range. You will get optimal bit usage by breaking it down in 128ths instead of 100ths for the same number of characters. With the doubling of the year range, this still fits within 8 bits (2 hex characters)
The collision probability, within the time precision (i.e. per second or per 100th or 128th of a second) is driven by the range of the uuid so it will be 1 in 2^128 for the chosen precision. Increasing the precision of the time stamp has the most impact on reducing the collision chances. It is also the factor that has the lowest impact on total size of the key.
More efficient character encoding: 27 to 29 character keys
You could significantly reduce the size of the key by encoding it in base 64 instead of 16 which would give you 27 to 29 characters (depending on you choice of precision)
Note that, for the timestamp part, you need to use an encoding function that takes an integer as input and that preserves the collating sequence of digit characters.
For example:
def encode64(number, size):
chars = "+-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
result = list()
for _ in range(size):
result.append(chars[number%64])
number //= 64
return "".join(reversed(result))
a = encode64(1234567890,6) # '-7ZU9G'
b = encode64(9876543210,6) # '7Ag-Pe'
print(a < b) # True
u = encode64(int(uuid.uuid4()),22) # '1QA2LtMg30ztnugxaokVMk'
key = a+u # '-7ZU9G1QA2LtMg30ztnugxaokVMk' (28 characters)
You can save some more characters by combining the time stamp and uuid into a single number before encoding instead of concatenating the two encoded values.
The encode64() function needs one character every 6 bits.
So, for 135 years with precision to the second: (32+128)/6 = 26.7 --> 27 characters
instead of (32/6 = 5.3 --> 6) + (128/6 = 21.3 --> 22) ==> 28 characters
uid = uuid.uuid4()
timeStamp = daysSince2000 * 86400 + int(secondsSinceMidnight)
key = encode64( timeStamp<<128 | int(uid) ,27)
with a 270 year span and 128th of a second precision: (40+128)/6 = 28 characters
uid = uuid.uuid4()
timeStamp = daysSince2000 * 86400 + int(secondsSinceMidnight)
precision = 128
timeStamp = timeStamp * precision + int(factionOfSecond * precision)
key = encode64( timeStamp<<128 | int(uid) ,28)
With 29 characters you can raise precision to 1024th of a second and year range to 2160 years.
UUID masking: 17 to 19 characters keys
To be even more efficient, you could strip out the first 64 bits of the uuid (which is already a time stamp) and combine it with your own time stamp. This would give you keys with a length of 17 to 19 characters with practically no loss of collision avoidance (depending on your choice of precision).
mask = (1<<64)-1
key = encode64( timeStamp<<64 | (int(uid) & mask) ,19)
Integer/Numeric keys ?
As a final note, if your database supports very large integers or numeric fields (140 bits or more) as keys, you don't have to convert the combined number to a string. Just use it directly as the key. The numerical sequence of timeStamp<<128 | int(uid) will respect the chronology.
The uuid6 module (pip install uuid6) solves the problem. It aims at implementing the corresponding draft for a new uuid variant standard, see here.
Example code:
import uuid6
for i in range(0, 30):
u = uuid6.uuid7()
print(u)
time.sleep(0.1)
The package suggests to use uuid6.uuid7():
Implementations SHOULD utilize UUID version 7 over UUID version 1 and
6 if possible.
UUID version 7 features a time-ordered value field derived from the
widely implemented and well known Unix Epoch timestamp source, the
number of milliseconds seconds since midnight 1 Jan 1970 UTC, leap
seconds excluded. As well as improved entropy characteristics over
versions 1 or 6.
Is there a way to measure time with high-precision in Python --- more precise than one second? I doubt that there is a cross-platform way of doing that; I'm interesting in high precision time on Unix, particularly Solaris running on a Sun SPARC machine.
timeit seems to be capable of high-precision time measurement, but rather than measure how long a code snippet takes, I'd like to directly access the time values.
The standard time.time() function provides sub-second precision, though that precision varies by platform. For Linux and Mac precision is +- 1 microsecond or 0.001 milliseconds. Python on Windows uses +- 16 milliseconds precision due to clock implementation problems due to process interrupts. The timeit module can provide higher resolution if you're measuring execution time.
>>> import time
>>> time.time() #return seconds from epoch
1261367718.971009
Python 3.7 introduces new functions to the time module that provide higher resolution:
>>> import time
>>> time.time_ns()
1530228533161016309
>>> time.time_ns() / (10 ** 9) # convert to floating-point seconds
1530228544.0792289
Python tries hard to use the most precise time function for your platform to implement time.time():
/* Implement floattime() for various platforms */
static double
floattime(void)
{
/* There are three ways to get the time:
(1) gettimeofday() -- resolution in microseconds
(2) ftime() -- resolution in milliseconds
(3) time() -- resolution in seconds
In all cases the return value is a float in seconds.
Since on some systems (e.g. SCO ODT 3.0) gettimeofday() may
fail, so we fall back on ftime() or time().
Note: clock resolution does not imply clock accuracy! */
#ifdef HAVE_GETTIMEOFDAY
{
struct timeval t;
#ifdef GETTIMEOFDAY_NO_TZ
if (gettimeofday(&t) == 0)
return (double)t.tv_sec + t.tv_usec*0.000001;
#else /* !GETTIMEOFDAY_NO_TZ */
if (gettimeofday(&t, (struct timezone *)NULL) == 0)
return (double)t.tv_sec + t.tv_usec*0.000001;
#endif /* !GETTIMEOFDAY_NO_TZ */
}
#endif /* !HAVE_GETTIMEOFDAY */
{
#if defined(HAVE_FTIME)
struct timeb t;
ftime(&t);
return (double)t.time + (double)t.millitm * (double)0.001;
#else /* !HAVE_FTIME */
time_t secs;
time(&secs);
return (double)secs;
#endif /* !HAVE_FTIME */
}
}
( from http://svn.python.org/view/python/trunk/Modules/timemodule.c?revision=81756&view=markup )
David's post was attempting to show what the clock resolution is on Windows. I was confused by his output, so I wrote some code that shows that time.time() on my Windows 8 x64 laptop has a resolution of 1 msec:
# measure the smallest time delta by spinning until the time changes
def measure():
t0 = time.time()
t1 = t0
while t1 == t0:
t1 = time.time()
return (t0, t1, t1-t0)
samples = [measure() for i in range(10)]
for s in samples:
print s
Which outputs:
(1390455900.085, 1390455900.086, 0.0009999275207519531)
(1390455900.086, 1390455900.087, 0.0009999275207519531)
(1390455900.087, 1390455900.088, 0.0010001659393310547)
(1390455900.088, 1390455900.089, 0.0009999275207519531)
(1390455900.089, 1390455900.09, 0.0009999275207519531)
(1390455900.09, 1390455900.091, 0.0010001659393310547)
(1390455900.091, 1390455900.092, 0.0009999275207519531)
(1390455900.092, 1390455900.093, 0.0009999275207519531)
(1390455900.093, 1390455900.094, 0.0010001659393310547)
(1390455900.094, 1390455900.095, 0.0009999275207519531)
And a way to do a 1000 sample average of the delta:
reduce( lambda a,b:a+b, [measure()[2] for i in range(1000)], 0.0) / 1000.0
Which output on two consecutive runs:
0.001
0.0010009999275207519
So time.time() on my Windows 8 x64 has a resolution of 1 msec.
A similar run on time.clock() returns a resolution of 0.4 microseconds:
def measure_clock():
t0 = time.clock()
t1 = time.clock()
while t1 == t0:
t1 = time.clock()
return (t0, t1, t1-t0)
reduce( lambda a,b:a+b, [measure_clock()[2] for i in range(1000000)] )/1000000.0
Returns:
4.3571334791658954e-07
Which is ~0.4e-06
An interesting thing about time.clock() is that it returns the time since the method was first called, so if you wanted microsecond resolution wall time you could do something like this:
class HighPrecisionWallTime():
def __init__(self,):
self._wall_time_0 = time.time()
self._clock_0 = time.clock()
def sample(self,):
dc = time.clock()-self._clock_0
return self._wall_time_0 + dc
(which would probably drift after a while, but you could correct this occasionally, for example dc > 3600 would correct it every hour)
If Python 3 is an option, you have two choices:
time.perf_counter which always use the most accurate clock on your platform. It does include time spent outside of the process.
time.process_time which returns the CPU time. It does NOT include time spent outside of the process.
The difference between the two can be shown with:
from time import (
process_time,
perf_counter,
sleep,
)
print(process_time())
sleep(1)
print(process_time())
print(perf_counter())
sleep(1)
print(perf_counter())
Which outputs:
0.03125
0.03125
2.560001310720671e-07
1.0005455362793145
You can also use time.clock() It counts the time used by the process on Unix and time since the first call to it on Windows. It's more precise than time.time().
It's the usually used function to measure performance.
Just call
import time
t_ = time.clock()
#Your code here
print 'Time in function', time.clock() - t_
EDITED: Ups, I miss the question as you want to know exactly the time, not the time spent...
Python 3.7 introduces 6 new time functions with nanosecond resolution, for example instead of time.time() you can use time.time_ns() to avoid floating point imprecision issues:
import time
print(time.time())
# 1522915698.3436284
print(time.time_ns())
# 1522915698343660458
These 6 functions are described in PEP 564:
time.clock_gettime_ns(clock_id)
time.clock_settime_ns(clock_id, time:int)
time.monotonic_ns()
time.perf_counter_ns()
time.process_time_ns()
time.time_ns()
These functions are similar to the version without the _ns suffix, but
return a number of nanoseconds as a Python int.
time.clock() has 13 decimal points on Windows but only two on Linux.
time.time() has 17 decimals on Linux and 16 on Windows but the actual precision is different.
I don't agree with the documentation that time.clock() should be used for benchmarking on Unix/Linux. It is not precise enough, so what timer to use depends on operating system.
On Linux, the time resolution is high in time.time():
>>> time.time(), time.time()
(1281384913.4374139, 1281384913.4374161)
On Windows, however the time function seems to use the last called number:
>>> time.time()-int(time.time()), time.time()-int(time.time()), time.time()-time.time()
(0.9570000171661377, 0.9570000171661377, 0.0)
Even if I write the calls on different lines in Windows it still returns the same value so the real precision is lower.
So in serious measurements a platform check (import platform, platform.system()) has to be done in order to determine whether to use time.clock() or time.time().
(Tested on Windows 7 and Ubuntu 9.10 with python 2.6 and 3.1)
The original question specifically asked for Unix but multiple answers have touched on Windows, and as a result there is misleading information on windows. The default timer resolution on windows is 15.6ms you can verify here.
Using a slightly modified script from cod3monk3y I can show that windows timer resolution is ~15milliseconds by default. I'm using a tool available here to modify the resolution.
Script:
import time
# measure the smallest time delta by spinning until the time changes
def measure():
t0 = time.time()
t1 = t0
while t1 == t0:
t1 = time.time()
return t1-t0
samples = [measure() for i in range(30)]
for s in samples:
print(f'time delta: {s:.4f} seconds')
These results were gathered on windows 10 pro 64-bit running python 3.7 64-bit.
The comment left by tiho on Mar 27 '14 at 17:21 deserves to be its own answer:
In order to avoid platform-specific code, use timeit.default_timer()
I observed that the resolution of time.time() is different between Windows 10 Professional and Education versions.
On a Windows 10 Professional machine, the resolution is 1 ms.
On a Windows 10 Education machine, the resolution is 16 ms.
Fortunately, there's a tool that increases Python's time resolution in Windows:
https://vvvv.org/contribution/windows-system-timer-tool
With this tool, I was able to achieve 1 ms resolution regardless of Windows version. You will need to be keep it running while executing your Python codes.
For those stuck on windows (version >= server 2012 or win 8)and python 2.7,
import ctypes
class FILETIME(ctypes.Structure):
_fields_ = [("dwLowDateTime", ctypes.c_uint),
("dwHighDateTime", ctypes.c_uint)]
def time():
"""Accurate version of time.time() for windows, return UTC time in term of seconds since 01/01/1601
"""
file_time = FILETIME()
ctypes.windll.kernel32.GetSystemTimePreciseAsFileTime(ctypes.byref(file_time))
return (file_time.dwLowDateTime + (file_time.dwHighDateTime << 32)) / 1.0e7
GetSystemTimePreciseAsFileTime function
On the same win10 OS system using "two distinct method approaches" there appears to be an approximate "500 ns" time difference. If you care about nanosecond precision check my code below.
The modifications of the code is based on code from user cod3monk3y and Kevin S.
OS: python 3.7.3 (default, date, time) [MSC v.1915 64 bit (AMD64)]
def measure1(mean):
for i in range(1, my_range+1):
x = time.time()
td = x- samples1[i-1][2]
if i-1 == 0:
td = 0
td = f'{td:.6f}'
samples1.append((i, td, x))
mean += float(td)
print (mean)
sys.stdout.flush()
time.sleep(0.001)
mean = mean/my_range
return mean
def measure2(nr):
t0 = time.time()
t1 = t0
while t1 == t0:
t1 = time.time()
td = t1-t0
td = f'{td:.6f}'
return (nr, td, t1, t0)
samples1 = [(0, 0, 0)]
my_range = 10
mean1 = 0.0
mean2 = 0.0
mean1 = measure1(mean1)
for i in samples1: print (i)
print ('...\n\n')
samples2 = [measure2(i) for i in range(11)]
for s in samples2:
#print(f'time delta: {s:.4f} seconds')
mean2 += float(s[1])
print (s)
mean2 = mean2/my_range
print ('\nMean1 : ' f'{mean1:.6f}')
print ('Mean2 : ' f'{mean2:.6f}')
The measure1 results:
nr, td, t0
(0, 0, 0)
(1, '0.000000', 1562929696.617988)
(2, '0.002000', 1562929696.6199884)
(3, '0.001001', 1562929696.620989)
(4, '0.001001', 1562929696.62199)
(5, '0.001001', 1562929696.6229906)
(6, '0.001001', 1562929696.6239917)
(7, '0.001001', 1562929696.6249924)
(8, '0.001000', 1562929696.6259928)
(9, '0.001001', 1562929696.6269937)
(10, '0.001001', 1562929696.6279945)
...
The measure2 results:
nr, td , t1, t0
(0, '0.000500', 1562929696.6294951, 1562929696.6289947)
(1, '0.000501', 1562929696.6299958, 1562929696.6294951)
(2, '0.000500', 1562929696.6304958, 1562929696.6299958)
(3, '0.000500', 1562929696.6309962, 1562929696.6304958)
(4, '0.000500', 1562929696.6314962, 1562929696.6309962)
(5, '0.000500', 1562929696.6319966, 1562929696.6314962)
(6, '0.000500', 1562929696.632497, 1562929696.6319966)
(7, '0.000500', 1562929696.6329975, 1562929696.632497)
(8, '0.000500', 1562929696.633498, 1562929696.6329975)
(9, '0.000500', 1562929696.6339984, 1562929696.633498)
(10, '0.000500', 1562929696.6344984, 1562929696.6339984)
End result:
Mean1 : 0.001001 # (measure1 function)
Mean2 : 0.000550 # (measure2 function)
Here is a python 3 solution for Windows building upon the answer posted above by CyberSnoopy (using GetSystemTimePreciseAsFileTime). We borrow some code from jfs
Python datetime.utcnow() returning incorrect datetime
and get a precise timestamp (Unix time) in microseconds
#! python3
import ctypes.wintypes
def utcnow_microseconds():
system_time = ctypes.wintypes.FILETIME()
#system call used by time.time()
#ctypes.windll.kernel32.GetSystemTimeAsFileTime(ctypes.byref(system_time))
#getting high precision:
ctypes.windll.kernel32.GetSystemTimePreciseAsFileTime(ctypes.byref(system_time))
large = (system_time.dwHighDateTime << 32) + system_time.dwLowDateTime
return large // 10 - 11644473600000000
for ii in range(5):
print(utcnow_microseconds()*1e-6)
References
https://learn.microsoft.com/en-us/windows/win32/sysinfo/time-functions
https://learn.microsoft.com/en-us/windows/win32/api/sysinfoapi/nf-sysinfoapi-getsystemtimepreciseasfiletime
https://support.microsoft.com/en-us/help/167296/how-to-convert-a-unix-time-t-to-a-win32-filetime-or-systemtime
1. Python 3.7 or later
If using Python 3.7 or later, use the modern, cross-platform time module functions such as time.monotonic_ns(), here: https://docs.python.org/3/library/time.html#time.monotonic_ns. It provides nanosecond-resolution timestamps.
import time
time_ns = time.monotonic_ns()
# or on Unix or Linux you can also use:
time_ns = time.clock_gettime_ns()
# or on Windows:
time_ns = time.perf_counter_ns()
# etc. etc. There are others. See the link above.
From my other answer from 2016, here: How can I get millisecond and microsecond-resolution timestamps in Python?:
You might also try time.clock_gettime_ns() on Unix or Linux systems. Based on its name, it appears to call the underlying clock_gettime() C function which I use in my nanos() function in C in my answer here and in my C Unix/Linux library here: timinglib.c.
2. Python 3.3 or later
On Windows, in Python 3.3 or later, you can use time.perf_counter(), as shown by #ereOn here. See: https://docs.python.org/3/library/time.html#time.perf_counter. This provides roughly a 0.5us-resolution timestamp, in floating point seconds. Ex:
import time
# For Python 3.3 or later
time_sec = time.perf_counter() # Windows only, I think
# or on Unix or Linux (I think only those)
time_sec = time.monotonic()
3. Pre-Python 3.3 (ex: Python 3.0, 3.1, 3.2), or later
Summary:
See my other answer from 2016 here for 0.5-us-resolution timestamps, or better, in Windows and Linux, and for versions of Python as old as 3.0, 3.2 or 3.2 even! We do this by calling C or C++ shared object libraries (.dll on Windows, or .so on Unix or Linux) using the ctypes module in Python.
I provide these functions:
millis()
micros()
delay()
delayMicroseconds()
Download GS_timing.py from my eRCaGuy_PyTime repo, then do:
import GS_timing
time_ms = GS_timing.millis()
time_us = GS_timing.micros()
GS_timing.delay(10) # delay 10 ms
GS_timing.delayMicroseconds(10000) # delay 10000 us
Details:
In 2016, I was working in Python 3.0 or 3.1, on an embedded project on a Raspberry Pi, and which I tested and ran frequently on Windows also. I needed nanosecond resolution for some precise timing I was doing with ultrasonic sensors. The Python language at the time did not provide this resolution, and neither did any answer to this question, so I came up with this separate Q&A here: How can I get millisecond and microsecond-resolution timestamps in Python?. I stated in the question at the time:
I read other answers before asking this question, but they rely on the time module, which prior to Python 3.3 did NOT have any type of guaranteed resolution whatsoever. Its resolution is all over the place. The most upvoted answer here quotes a Windows resolution (using their answer) of 16 ms, which is 32000 times worse than my answer provided here (0.5 us resolution). Again, I needed 1 ms and 1 us (or similar) resolutions, not 16000 us resolution.
Zero, I repeat: zero answers here on 12 July 2016 had any resolution better than 16-ms for Windows in Python 3.1. So, I came up with this answer which has 0.5us or better resolution in pre-Python 3.3 in Windows and Linux. If you need something like that for an older version of Python, or if you just want to learn how to call C or C++ dynamic libraries in Python (.dll "dynamically linked library" files in Windows, or .so "shared object" library files in Unix or Linux) using the ctypes library, see my other answer here.
I created a tiny C-Extension that uses GetSystemTimePreciseAsFileTime to provide an accurate timestamp on Windows:
https://win-precise-time.readthedocs.io/en/latest/api.html#win_precise_time.time
Usage:
>>> import win_precise_time
>>> win_precise_time.time()
1654539449.4548845
def start(self):
sec_arg = 10.0
cptr = 0
time_start = time.time()
time_init = time.time()
while True:
cptr += 1
time_start = time.time()
time.sleep(((time_init + (sec_arg * cptr)) - time_start ))
# AND YOUR CODE .......
t00 = threading.Thread(name='thread_request', target=self.send_request, args=([]))
t00.start()
>> See EDIT below <<
I am working on processing data from a special pixelated CCD camera over serial, using FTDI D2xx drivers via pyUSB.
The camera can operate at high bandwidth to the PC, up to 80 frames/sec. I would love that speed, but know that it isn't feasible with Python, due to it being a scripted language, but would like to know how close I can get - whether it be some optimizations that I missed in my code, threading, or using some other approach. I immediately think that breaking-out the most time consuming loops and putting them in C code, but I don't have much experience with C code and not sure the best way to get Python to interact inline with it, if that's possible. I have complex algorithms heavily developed in Python with SciPy/Numpy, which are already optimized and have acceptable performance, so I would need a way to just speed-up the acquisition of the data to feed-back to Python, if that's the best approach.
The difficulty, and the reason I used Python, and not some other language, is due to the need to be able to easily run it cross-platform (I develop in Windows, but am putting the code on an embedded Linux board, making a stand-alone system). If you suggest that I use another code, like C, how would I be able to work cross-platform? I have never worked with compiling a lower-level language like C between Windows and Linux, so I would want to be sure of that process - I would have to compile it for each system, right? What do you suggest?
Here are my functions, with current execution times:
ReadStream: 'RXcount' is 114733 for a device read, formatting from string to byte equivalent
Returns a list of bytes (0-255), representing binary values
Current execution time: 0.037 sec
def ReadStream(RXcount):
global ftdi
RXdata = ftdi.read(RXcount)
RXdata = list(struct.unpack(str(len(RXdata)) + 'B', RXdata))
return RXdata
ProcessRawData: To reshape the byte list into an array that matches the pixel orientations
Results in a 3584x32 array, after trimming off some un-needed bytes.
Data is unique in that every block of 14 rows represents 14-bits of one row of pixels on the device (with 32 bytes across # 8 bits/byte = 256 bits across), which is 256x256 pixels. The processed array has 32 columns of bytes because each byte, in binary, represents 8 pixels (32 bytes * 8 bits = 256 pixels). Still working on how to do that one... I have already posted a question for that previously
Current execution time: 0.01 sec ... not bad, it's just Numpy
def ProcessRawData(RawData):
if len(RawData) == 114733:
ProcessedMatrix = np.ndarray((1, 114733), dtype=int)
np.copyto(ProcessedMatrix, RawData)
ProcessedMatrix = ProcessedMatrix[:, 1:-44]
ProcessedMatrix = np.reshape(ProcessedMatrix, (-1, 32))
return ProcessedMatrix
else:
return None
Finally,
GetFrame: The device has a mode where it just outputs whether a pixel detected anything or not, using the lowest bit of the array (every 14th row) - Get that data and convert to int for each pixel
Results in 256x256 array, after processing every 14th row, which are bytes to be read as binary (32 bytes across ... 32 bytes * 8 bits = 256 pixels across)
Current execution time: 0.04 sec
def GetFrame(ProcessedMatrix):
if np.shape(ProcessedMatrix) == (3584, 32):
FrameArray = np.zeros((256, 256), dtype='B')
DataRows = ProcessedMatrix[13::14]
for i in range(256):
RowData = ""
for j in range(32):
RowData = RowData + "{:08b}".format(DataRows[i, j])
FrameArray[i] = [int(RowData[b:b+1], 2) for b in range(256)]
return FrameArray
else:
return False
Goal:
I would like to target a total execution time of ~0.02 secs/frame by whatever suggestions you make (currently it's 0.25 secs/frame with the GetFrame function being the weakest). The device I/O is not the limiting factor, as that outputs a data packet every 0.0125 secs. If I get the execution time down, then can I just run the acquisition and processing in parallel with some threading?
Let me know what you suggest as the best path forward - Thank you for the help!
EDIT, thanks to #Jaime:
Functions are now:
def ReadStream(RXcount):
global ftdi
return np.frombuffer(ftdi.read(RXcount), dtype=np.uint8)
... time 0.013 sec
def ProcessRawData(RawData):
if len(RawData) == 114733:
return RawData[1:-44].reshape(-1, 32)
return None
... time 0.000007 sec!
def GetFrame(ProcessedMatrix):
if ProcessedMatrix.shape == (3584, 32):
return np.unpackbits(ProcessedMatrix[13::14]).reshape(256, 256)
return False
... time 0.00006 sec!
So, with pure Python, I am now able to acquire the data at the desired frame rate! After a few tweaks to the D2xx USB buffers and latency timing, I just clocked it at 47.6 FPS!
Last step is if there is any way to make this run in parallel with my processing algorithms? Need some way to pass the result of GetFrame to another loop running in parallel.
There are several places where you can speed things up significantly. Perhaps the most obvious is rewriting GetFrame:
def GetFrame(ProcessedMatrix):
if ProcessedMatrix.shape == (3584, 32):
return np.unpackbits(ProcessedMatrix[13::14]).reshape(256, 256)
return False
This requires that ProcessedMatrix be an ndarray of type np.uint8, but other than that, on my systems it runs 1000x faster.
With your other two functions, I think that in ReadStream you should do something like:
def ReadStream(RXcount):
global ftdi
return np.frombuffer(ftdi.read(RXcount), dtype=np.uint8)
Even if it doesn't speed up that function much, because it is the reading taking up most of the time, it will already give you a numpy array of bytes to work on. With that, you can then go on to ProcessRawData and try:
def ProcessRawData(RawData):
if len(RawData) == 114733:
return RawData[1:-44].reshape(-1, 32)
return None
Which is 10x faster than your version.
from the passlib documentation
For most public facing services, you can generally have signin take upwards of 250ms - 400ms before users start getting annoyed.
so what is the best value for rounds in a login/registration if we consider that there is one call for the database for the login attempt, and it uses MongoDB with non-blocking call. (using Mongotor, and using the email as the _id, so it is by default indexed, the query is fast: 0.00299978256226 and of course tested with a database that has 3 records...)
import passlib.hash
import time
hashh = passlib.hash.pbkdf2_sha512
beg1 = time.time()
password = hashh.encrypt("test", salt_size = 32, rounds = 12000)
print time.time()- beg1 # returns 0.142999887466
beg2 = time.time()
hashh.verify("test", password) # returns 0.143000125885
print time.time()- beg2
now if i use half value:
password = hashh.encrypt("test", salt_size = 32, rounds = 4000) # returns 0.0720000267029
hashh.verify("test", password) # returns 0.0709998607635
am using Windows 7 64 bits on Dell XPS 15 i7 2.0 Ghz
NB: installed bcrypt, and of course, it's a real pain using it directly as its default values (rounds = 12):
hashh = passlib.hash.bcrypt
beg1 = time.time()
password = hashh.encrypt("test", rounds = 12) # returns 0.406000137329
print time.time()- beg1
beg2 = time.time()
hashh.verify("test", password) # returns 0.40499997139
print time.time()- beg2
half value:
password = hashh.encrypt("test", rounds = 12) # 0.00699996948242 wonderful?
hashh.verify("test", password) # 0.00600004196167
can you suggest me a good rounds value when using pbkdf2_sha512 that will be good for production?
(passlib developer here)
The amount of time pbkdf2_sha512 takes is linearly proportional to it's rounds parameter (elapsed_time = rounds * native_speed). Using the data for your system, native_speed = 12000 / .143 = 83916 iterations/second, which means you'll need around 83916 * .350 = 29575 rounds to get ~350ms delay.
Things are a little tricker for bcrypt, because the amount of time it takes is logarithmically proportional to it's rounds parameter (elapsed_time = (2 ** rounds) * native_speed). Using the data for your system, native_speed = (2 ** 12) / .405 = 10113 iterations/second, which means you'll need around log(10113 * .350, 2) = 11.79 rounds to get ~350 ms delay. But since BCrypt only accepts integer rounds parameters, so you'll need to pick rounds=11 (~200ms) or rounds=12 (~400ms).
All of this is something I'm hoping to fix in a future release of passlib. As a work in progress, passlib's mercurial repo currently contains a simple little script, choose_rounds.py, which takes care of choosing the correct rounds value for a given target time. You can download and run it directly as follows (it may take 20s or so to run):
$ python choose_rounds.py -h
usage: python choose_rounds.py <hash_name> [<target_in_milliseconds>]
$ python choose_rounds.py pbkdf2_sha512 350
hash............: pbkdf2_sha512
speed...........: 83916 iterations/second
target time.....: 350 ms
target rounds...: 29575
$ python choose_rounds.py bcrypt 350
hash............: bcrypt
speed...........: 10113 iterations/second
target time.....: 350 ms
target rounds...: 11 (200ms -- 150ms faster than requested)
target rounds...: 12 (400ms -- 50ms slower than requested)
(edit: added response regarding secure minimum rounds...)
Disclaimer: Determining a secure minimum is a surprisingly tricky question - there are a number of hard to quantify parameters, very little real world data, and some rigorously unhelpful theory. Lacking a good authority, I've been researching the topic myself; and for off-the-cuff calculations, I've boiled the raw data down to a short formula (below), which is generally what I use. Just be aware that behind it are a couple of pages of assumptions and rough estimates, making it more of a Fermi Estimation than an exact answer :|
My rule of thumb (mid 2012) for attacking PBKDF2-HMAC-SHA512 using GPUs is:
days * dollars = 2**(n-31) * rounds
days is the number of days before the attacker has a 50/50 chance of guessing the password.
dollars is the attackers' hardware budget (in $USD).
n is the average amount of entropy in your user's passwords (in bits).
To answer your script-kiddie question: if an average password has 32 bits of entropy, and the attacker has a $2000 system with a good GPU, then at 30000 rounds they will need 30 days (2**(32-31)*30000/2000) to have a 50/50 chance of cracking a given hash. I'd recommend playing around with the values until you arrive at a rounds/days tradeoff that you're comfortable with.
Some things to keep in mind:
The success rate of a dictionary attack isn't linear, it's more of "long tail" situation, so think of the 50/50 mark as more of a half-life.
That 31 is the key factor, as it encodes an estimation of the cost of attacking a specific algorithm using a specific technology level. The actual value, 2**-31, measures the "dollar-days per round" it will cost an attacker. For comparison, attacking PBKDF2-HMAC-SHA512 using an ASIC has a factor closer to 46 -- larger numbers mean more bang for the attacker's buck, and less security per round for you, though script kiddies generally won't have that kind of budget :)