How to generate a time-ordered uid in Python? - python

Is this possible? I've heard Cassandra has something similar : https://datastax.github.io/python-driver/api/cassandra/util.html
I have been using a ISO timestamp concatenated with a uuid4, but that ended up way too large (58 characters) and probably overkill.
Keeping a sequential number doesn't work in my context (DynamoDB NoSQL)
Worth noticing that for my application it doesn't matter if items created in batch/same second are in a random order, as long as the uid don't collapse.
I have no specific restriction on maximum length, ideally I would like to see the different collision chance for different lengths, but it needs to be smaller than 58 (my original attempt)
This is to use with DynamoDB(NoSQL Database) as Sort-key

Why uuid.uuid1 is not sequential
uuid.uuid1(node=None, clock_seq=None) is effectively:
60 bits of timestamp (representing number of 100-ns intervals after 1582-10-15 00:00:00)
14 bits of "clock sequence"
48 bits of "Node info" (generated from network card's mac-address or from hostname or from RNG).
If you don't provide any arguments, then System function is called to generate uuid. In that case:
It's unclear if "clock sequence" is sequential or random.
It's unclear if it's safe to be used in multiple processes (can clock_seq be repeated in different processes or not?). In Python 3.7 this info is now available.
If you provide clock_seq or node, then "pure python implementation is used". IN this case even with "fixed value" for clock_seq:
timestamp part is guaranteed to be sequential for all the calls in current process even in threaded execution.
clock_seq part is randomly generated. But that is not critical annymore because timestamp is sequential and unique.
It's NOT safe for multiple processes (processes that call uuid1 with the same clock_seq, node might return conflicting values if called during the "same 100-ns time interval")
Solution that reuses uuid.uuid1
It's easy to see, that you can make uuid1 sequential by providing clock_seq or node arguments (to use python implementation).
import time
from uuid import uuid1, getnode
_my_clock_seq = getrandbits(14)
_my_node = getnode()
def sequential_uuid(node=None):
return uuid1(node=node, clock_seq=_my_clock_seq)
# .hex attribute of this value is 32-characters long string
def alt_sequential_uuid(clock_seq=None):
return uuid1(node=_my_node, clock_seq=clock_seq)
if __name__ == '__main__':
from itertools import count
old_n = uuid1() # "Native"
old_s = sequential_uuid() # Sequential
native_conflict_index = None
t_0 = time.time()
for x in count():
new_n = uuid1()
new_s = sequential_uuid()
if old_n > new_n and not native_conflict_index:
native_conflict_index = x
if old_s >= new_s:
print("OOops: non-sequential results for `sequential_uuid()`")
break
if (x >= 10*0x3fff and time.time() - t_0 > 30) or (native_conflict_index and x > 2*native_conflict_index):
print('No issues for `sequential_uuid()`')
break
old_n = new_n
old_s = new_s
print(f'Conflicts for `uuid.uuid1()`: {bool(native_conflict_index)}')
Multiple processes issues
BUT if you are running some parallel processes on the same machine, then:
node which defaults to uuid.get_node() will be the same for all the processes;
clock_seq has small chance to be the same for some processes (chance of 1/16384)
That might lead to conflicts! That is general concern for using
uuid.uuid1 in parallel processes on the same machine unless you have access to SafeUUID from Python3.7.
If you make sure to also set node to unique value for each parallel process that runs this code, then conflicts should not happen.
Even if you are using SafeUUID, and set unique node, it's still possible to have non-sequential (but unique) ids if they are generated in different processes.
If some lock-related overhead is acceptable, then you can store clock_seq in some external atomic storage (for example in "locked" file) and increment it with each call: this allows to have same value for node on all parallel processes and also will make id-s sequential. For cases when all parallel processes are subprocesses created using multiprocessing: clock_seq can be "shared" using multiprocessing.Value
As a result you always have to remember:
If you are running multiple processes on the same machine, then you must:
Ensure uniqueness of node. The problem for this solution: you can't be sure to have sequential ids from different processes generated during the same 100-ns interval. But this is very "light" operation executed once on process startup and achieved by: "adding" something to default node, e.g. int(time.time()*1e9) - 0x118494406d1cc000, or by adding some counter from machine-level atomic db.
Ensure "machine-level atomic clock_seq" and the same node for all processes on one machine. That way you'll have some overhead for "locking" clock_seq, but id-s are guaranteed to be sequential even if generated in different processes during the same 100-ns interval (unless you are calling uuid from several threads in the same process).
For processes on different machines:
either you have to use some "global counter service";
or it's not possible to have sequential ids generated on different machines during the same 100-ns interval.
Reducing size of the id
General approach to generate UUIDs is quite simple, so it's easy to implement something similar from scratch, and for example use less bits for node_info part:
import time
from random import getrandbits
_my_clock_seq = getrandbits(14)
_last_timestamp_part = 0
_used_clock_seq = 0
timestamp_multiplier = 1e7 # I'd recommend to use this value
# Next values are enough up to year 2116:
if timestamp_multiplier == 1e9:
time_bits = 62 # Up to year 2116, also reduces chances for non-sequential id-s generated in different processes
elif timestamp_multiplier == 1e8:
time_bits = 60 # up to year 2335
elif timestamp_multiplier == 1e7:
time_bits = 56 # Up to year 2198.
else:
raise ValueError('Please calculate and set time_bits')
time_mask = 2**time_bits - 1
seq_bits = 16
seq_mask = 2**seq_bits - 1
node_bits = 12
node_mask = 2**node_bits - 1
max_hex_len = len(hex(2**(node_bits+seq_bits+time_bits) - 1)) - 2 # 21
_default_node_number = getrandbits(node_bits) # or `uuid.getnode() & node_mask`
def sequential_uuid(node_number=None):
"""Return 21-characters long hex string that is sequential and unique for each call in current process.
Results from different processes may "overlap" but are guaranteed to
be unique if `node_number` is different in each process.
"""
global _my_clock_seq
global _last_timestamp_part
global _used_clock_seq
if node_number is None:
node_number = _default_node_number
if not 0 <= node_number <= node_mask:
raise ValueError("Node number out of range")
timestamp_part = int(time.time() * timestamp_multiplier) & time_mask
_my_clock_seq = (_my_clock_seq + 1) & seq_mask
if _last_timestamp_part >= timestamp_part:
timestamp_part = _last_timestamp_part
if _used_clock_seq == _my_clock_seq:
timestamp_part = (timestamp_part + 1) & time_mask
else:
_used_clock_seq = _my_clock_seq
_last_timestamp_part = timestamp_part
return hex(
(timestamp_part << (node_bits+seq_bits))
|
(_my_clock_seq << (node_bits))
|
node_number
)[2:]
Notes:
Maybe it's better to simply store integer value (not hex-string) in the database
If you are storing it as text/char, then its better to convert integer to base64-string instead of converting it to hex-string. That way it will be shorter (21 chars hex-string → 16 chars b64-encoded string):
from base64 import b64encode
total_bits = time_bits+seq_bits+node_bits
total_bytes = total_bits // 8 + 1 * bool(total_bits % 8)
def int_to_b64(int_value):
return b64encode(int_value.to_bytes(total_bytes, 'big'))
Collision chances
Single process: collisions not possible
Multiple processes with manually set unique clock_seq or unique node in each process: collisions not possible
Multiple processes with randomly set node (48-bits, "fixed" in time):
Chance to have the node collision in several processes:
in 2 processes out of 10000: ~0.000018%
in 2 processes out of 100000: 0.0018%
Chance to have single collision of the id per second in 2 processes with the "colliding" node:
for "timestamp" interval of 100-ns (default for uuid.uuid1 , and in my code when timestamp_multiplier == 1e7): proportional to 3.72e-19 * avg_call_frequency²
for "timestamp" interval of 10-ns (timestamp_multiplier == 1e8): proportional to 3.72e-21 * avg_call_frequency²

In the article you've linked too, the cassandra.util.uuid_from_time(time_arg, node=None, clock_seq=None)[source] seems to be exactly what you're looking for.
def uuid_from_time(time_arg, node=None, clock_seq=None):
"""
Converts a datetime or timestamp to a type 1 :class:`uuid.UUID`.
:param time_arg:
The time to use for the timestamp portion of the UUID.
This can either be a :class:`datetime` object or a timestamp
in seconds (as returned from :meth:`time.time()`).
:type datetime: :class:`datetime` or timestamp
:param node:
None integer for the UUID (up to 48 bits). If not specified, this
field is randomized.
:type node: long
:param clock_seq:
Clock sequence field for the UUID (up to 14 bits). If not specified,
a random sequence is generated.
:type clock_seq: int
:rtype: :class:`uuid.UUID`
"""
if hasattr(time_arg, 'utctimetuple'):
seconds = int(calendar.timegm(time_arg.utctimetuple()))
microseconds = (seconds * 1e6) + time_arg.time().microsecond
else:
microseconds = int(time_arg * 1e6)
# 0x01b21dd213814000 is the number of 100-ns intervals between the
# UUID epoch 1582-10-15 00:00:00 and the Unix epoch 1970-01-01 00:00:00.
intervals = int(microseconds * 10) + 0x01b21dd213814000
time_low = intervals & 0xffffffff
time_mid = (intervals >> 32) & 0xffff
time_hi_version = (intervals >> 48) & 0x0fff
if clock_seq is None:
clock_seq = random.getrandbits(14)
else:
if clock_seq > 0x3fff:
raise ValueError('clock_seq is out of range (need a 14-bit value)')
clock_seq_low = clock_seq & 0xff
clock_seq_hi_variant = 0x80 | ((clock_seq >> 8) & 0x3f)
if node is None:
node = random.getrandbits(48)
return uuid.UUID(fields=(time_low, time_mid, time_hi_version,
clock_seq_hi_variant, clock_seq_low, node), version=1)
There's nothing Cassandra specific to a Type 1 UUID...

You should be able to encode a timestamp precise to the second for a time range of 135 years in 32 bits. That will only take 8 characters to represent in hex. Added to the hex representation of the uuid (32 hex characters) that will amount to only 40 hex characters.
Encoding the time stamp that way requires that you pick a base year (e.g. 2000) and compute the number of days up to the current date (time stamp). Multiply this number of days by 86400, then add the seconds since midnight. This will give you values that are less than 2^32 until you reach year 2135.
Note that you have to keep leading zeroes in the hex encoded form of the timestamp prefix in order for alphanumeric sorting to preserve the chronology.
With a few bits more in the time stamp, you could increase the time range and/or the precision. With 8 more bits (two hex characters), you could go up to 270 years with a precision to the hundredth of a second.
Note that you don't have to model the fraction of seconds in a base 10 range. You will get optimal bit usage by breaking it down in 128ths instead of 100ths for the same number of characters. With the doubling of the year range, this still fits within 8 bits (2 hex characters)
The collision probability, within the time precision (i.e. per second or per 100th or 128th of a second) is driven by the range of the uuid so it will be 1 in 2^128 for the chosen precision. Increasing the precision of the time stamp has the most impact on reducing the collision chances. It is also the factor that has the lowest impact on total size of the key.
More efficient character encoding: 27 to 29 character keys
You could significantly reduce the size of the key by encoding it in base 64 instead of 16 which would give you 27 to 29 characters (depending on you choice of precision)
Note that, for the timestamp part, you need to use an encoding function that takes an integer as input and that preserves the collating sequence of digit characters.
For example:
def encode64(number, size):
chars = "+-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
result = list()
for _ in range(size):
result.append(chars[number%64])
number //= 64
return "".join(reversed(result))
a = encode64(1234567890,6) # '-7ZU9G'
b = encode64(9876543210,6) # '7Ag-Pe'
print(a < b) # True
u = encode64(int(uuid.uuid4()),22) # '1QA2LtMg30ztnugxaokVMk'
key = a+u # '-7ZU9G1QA2LtMg30ztnugxaokVMk' (28 characters)
You can save some more characters by combining the time stamp and uuid into a single number before encoding instead of concatenating the two encoded values.
The encode64() function needs one character every 6 bits.
So, for 135 years with precision to the second: (32+128)/6 = 26.7 --> 27 characters
instead of (32/6 = 5.3 --> 6) + (128/6 = 21.3 --> 22) ==> 28 characters
uid = uuid.uuid4()
timeStamp = daysSince2000 * 86400 + int(secondsSinceMidnight)
key = encode64( timeStamp<<128 | int(uid) ,27)
with a 270 year span and 128th of a second precision: (40+128)/6 = 28 characters
uid = uuid.uuid4()
timeStamp = daysSince2000 * 86400 + int(secondsSinceMidnight)
precision = 128
timeStamp = timeStamp * precision + int(factionOfSecond * precision)
key = encode64( timeStamp<<128 | int(uid) ,28)
With 29 characters you can raise precision to 1024th of a second and year range to 2160 years.
UUID masking: 17 to 19 characters keys
To be even more efficient, you could strip out the first 64 bits of the uuid (which is already a time stamp) and combine it with your own time stamp. This would give you keys with a length of 17 to 19 characters with practically no loss of collision avoidance (depending on your choice of precision).
mask = (1<<64)-1
key = encode64( timeStamp<<64 | (int(uid) & mask) ,19)
Integer/Numeric keys ?
As a final note, if your database supports very large integers or numeric fields (140 bits or more) as keys, you don't have to convert the combined number to a string. Just use it directly as the key. The numerical sequence of timeStamp<<128 | int(uid) will respect the chronology.

The uuid6 module (pip install uuid6) solves the problem. It aims at implementing the corresponding draft for a new uuid variant standard, see here.
Example code:
import uuid6
for i in range(0, 30):
u = uuid6.uuid7()
print(u)
time.sleep(0.1)
The package suggests to use uuid6.uuid7():
Implementations SHOULD utilize UUID version 7 over UUID version 1 and
6 if possible.
UUID version 7 features a time-ordered value field derived from the
widely implemented and well known Unix Epoch timestamp source, the
number of milliseconds seconds since midnight 1 Jan 1970 UTC, leap
seconds excluded. As well as improved entropy characteristics over
versions 1 or 6.

Related

How to convert hours and minutes into decimal values in Python in a CSV file?

I have a CSV file having online Teams meeting data. It has two columns, one is "names" and the other with their duration of attendance during the meeting. I want to convert this information into a graph to quickly identify who remained for a long time in the meeting and who for less time. The duration column data is in this format, 3h 54m. This means it has the characters h and m in the column. See the picture below too.
Now how can I convert this data into decimal values like 3.54 or 234?
3.54 will mean hours and 234 will mean minutes in total. I am happy with any solution like either hour i.e. 3.54 or minutes 234.
Numpy or Pandas are both welcome.
The conversion of the timestring can be achieved with the datetime module:
from datetime import datetime
time_string = '3h 54m'
datetime_obj = datetime.strptime(time_string, '%Hh %Mm')
total_mins = (datetime_obj.hour * 60) + datetime_obj.minute
time_in_hours = total_mins / 60
# Outputs: `3.9 234`
print(time_in_hours, total_mins)
Here '%H' means the number of hours and '%M' is the number of minutes (both zero-padded). If this is encapsulated in a function, it can applied on the data from your spreadsheet that has been read-in via numpy or pandas.
REFERENCE:
https://docs.python.org/3/library/datetime.html#datetime.datetime.strptime
3h 54m is not 3.54. It's 3 + 54/60 = 3.9. Also since you have some items with seconds, it may be best to do all the conversions to seconds so you don't lose significant digits due to rounding if you need to add any of the items. For example, if you have 37 minutes, that's 0.61666667 hours. So using seconds, you get more precise results if you have to combine things.
The function below can handle h m and s and any combo of them. I provide examples at the bottom. I'm sure there are hundreds of ways to do this conversion. There are tons of examples on StackOverflow of reading and using CSV files so I didn't provide info. I like playing with the math stuff. :)
def hms_to_seconds(time_string):
timeparts = []
# split string. Assumes blank space(s) between each time componet
timeparts = time_string.split()
h = 0
m = 0
s = 0
# loop through the componets and get values
for part in timeparts:
if (part[-1] == "h"):
h = int( part.partition("h")[0] )
if (part[-1] == "m"):
m = int( part.partition("m")[0] )
if (part[-1] == "s"):
s = int( part.partition("s")[0] )
return (h*3600 + m*60 + s)
print(hms_to_seconds("3h 45m")) # 13500 sec
print(hms_to_seconds("1m 1s")) # 61 sec
print(hms_to_seconds("1h 1s")) # 3601 sec
print(hms_to_seconds("2h")) # 7200 sec
print(hms_to_seconds("10m")) # 600 sec
print(hms_to_seconds("33s")) # 33 sec
1. Read .csv
Reading the csv file into memory can be done using pd.read_csv()
2. Parse time string
Parser
Ideally you want to write a parser that can parse the time string for you. This can be done using ANTLR or by writing one yourself. See e.g. this blog post. This would be the most interesting solution and also the most challenging one. There might be a parser already out there which can handle this as the format is quite common, so you might wanna search for one.
RegEx
A quick and dirty solution however can be implemented using RegEx. The idea is to use this RegEx [0-9]*(?=h) for hours, [0-9]*(?=m) for minutes, [0-9]*(?=s) for seconds to extract the actual values before those identifiers and then calculate the total duration in seconds.
Please note: this is not an ideal solution and only works for "h", "m", "s" and there are edge cases that are not handled in this implementation but it does work in principle.
import re
duration_calcs = {
"h": 60 * 60, # convert hours to seconds => hours * 60 * 60
"m": 60, # convert minutes to seconds => minutes * 60
"s": 1, # convert seconds to seconds => seconds * 1
}
tests_cases = [
"3h 54m",
"4h 9m",
"12s",
"41m 1s"
]
tests_results = [
int(14040), # 3 * 60 * 60 + 54 * 60
14940, # 4 * 60 * 60 + 9 * 60
12, # 12 * 1
2461 # 41 * 60 + 1 * 1
]
def get_duration_component(identifier, text):
"""
Gets one component of the duration from the string and returns the duration in seconds.
:param identifier: either "h", "m" or "s"
:type identifier: str
:param text: text to extract the information from
:type text: str
:return duration in seconds
"""
# RegEx using positive lookahead to either "h", "m" or "s" to extract number before that identifier
regex = re.compile(f"[0-9]*(?={identifier})")
match = regex.search(text)
if not match:
return 0
return int(match.group()) * duration_calcs[identifier]
def get_duration(text):
"""
Get duration from text which contains duration like "4h 43m 12s".
Only "h", "m" and "s" supported.
:param text: text which contains a duration
:type text: str
:return: duration in seconds
"""
idents = ["h", "m", "s"]
total_duration = 0
for ident in idents:
total_duration += get_duration_component(ident, text)
return total_duration
def run_tests(tests, test_results):
"""
Run tests to verify that this quick and dirty implementation works at least for the given test cases.
:param tests: test cases to run
:type tests: list[str]
:param test_results: expected test results
:type test_results: list[int]
:return:
"""
for i, (test_inp, expected_output) in enumerate(zip(tests, test_results)):
result = get_duration(test_inp)
print(f"Test {i}: Input={test_inp}\tExpected Output={expected_output}\tResult: {result}", end="")
if result == expected_output:
print(f"\t[SUCCESS]")
else:
print(f"\t[FAILURE]")
run_tests(tests_cases, tests_results)
Expected result
Test 0: Input=3h 54m Expected Output=14040 Result: 14040 [SUCCESS]
Test 1: Input=4h 9m Expected Output=14940 Result: 14940 [SUCCESS]
Test 2: Input=12s Expected Output=12 Result: 12 [SUCCESS]
Test 3: Input=41m 1s Expected Output=2461 Result: 2461 [SUCCESS]
split()
Another (even simpler) solution could be to split() at space and use some if statements to determine whether a part is "h", "m" or "s" and then parse the preceding string to int and convert is as shown above. The idea is similar, so I did not write a program for this.

Convert large numpy arrays of BCD to decimal

I have binary data files in the multiple GB range that I am memory mapping with numpy. The start of each data packet contains a BCD timestamp. Where each hex number is coded into the time format of 0DDD:HH:MM:SS.ssss I need this timestamp turned into total seconds of the current year.
Example:
The the first time stamp 0x0261 1511 2604 6002 Would be: 261:15:11:26.046002 or
261*86400 + 15*3600 + 11*60 + 26.046002 = 22551986.046002
Currently I am doing this to compute the timestamps:
import numpy as np
rawData = np.memmap('dataFile.bin',dtype='u1',mode='r')
#findFrameStart returns the index to the start of each data packet [0,384,768,...]
fidx = findFrameStart(rawData)
# Do lots of bit shifting and multiplying and type casting....
day1 = ((rawData[fidx ]>>4)*10 + (rawData[fidx ]&0x0F)).astype('f8')
day2 = ((rawData[fidx+1]>>4)*10 + (rawData[fidx+1]&0x0F)).astype('f8')
hour = ((rawData[fidx+2]>>4)*10 + (rawData[fidx+2]&0x0F)).astype('f8')
mins = ((rawData[fidx+3]>>4)*10 + (rawData[fidx+3]&0x0F)).astype('f8')
sec1 = ((rawData[fidx+4]>>4)*10 + (rawData[fidx+4]&0x0F)).astype('f8')
sec2 = ((rawData[fidx+5]>>4)*10 + (rawData[fidx+5]&0x0F)).astype('f8')
sec3 = ((rawData[fidx+6]>>4)*10 + (rawData[fidx+6]&0x0F)).astype('f8')
sec4 = ((rawData[fidx+7]>>4)*10 + (rawData[fidx+7]&0x0F)).astype('f8')
time = (day1*100+day2)*86400 + hour*3600 + mins*60 + sec1 + sec2/100 + sec3/10000 + sec4/1000000
Note I had to cast each of the intermediate vars (day1, day2, etc.) to double to get the time to compute correctly.
Given that there are lots of frames, fidx can get kind of large (~10e6 elements or more). This results in lots of math operations, bit shifts, casting, etc. in my current method. So far it is working OK on a smaller test file (~180ms on a 150MB data file). However, I am worried about when I hit some larger data(4-5GB) there might be memory issues with all of the intermediate arrays.
So if possible I was looking for a different method that might shortcut some of the overhead. The BCD to decimal operations are similar for each byte so it seems I should maybe be able to iterate over something and maybe convert an array in place ... at least reducing the memory footprint.
Any help would be appreciated. FYI, I am using Python 3.7
I made the following adjustments to my code. This modifies the time array in place & removed the need for all of the intermediate arrays. I haven't timed the result but it should require less memory.
time = np.zeros(fidx.shape,dtype='f8')
scale = np.array([8640000, 86400, 3600, 60, 1, .01, .0001, .000001],dtype='f8')
for ii,sf in enumerate(scale):
time = time + ((rawData[fidx+ii]>>4)*10 + (rawData[fidx+ii]&0x0F))*sf

What is the fastest way to sort strings in Python if locale is a non-concern?

I was trying to find a fast way to sort strings in Python and the locale is a non-concern i.e. I just want to sort the array lexically according to the underlying bytes. This is perfect for something like radix sort. Here is my MWE
import numpy as np
import timeit
# randChar is workaround for MemoryError in mtrand.RandomState.choice
# http://stackoverflow.com/questions/25627161/how-to-solve-memory-error-in-mtrand-randomstate-choice
def randChar(f, numGrp, N) :
things = [f%x for x in range(numGrp)]
return [things[x] for x in np.random.choice(numGrp, N)]
N=int(1e7)
K=100
id3 = randChar("id%010d", N//K, N) # small groups (char)
timeit.Timer("id3.sort()" ,"from __main__ import id3").timeit(1) # 6.8 seconds
As you can see it took 6.8 seconds which is almost 10x slower than R's radix sort below.
N = 1e7
K = 100
id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE)
system.time(sort(id3,method="radix"))
I understand that Python's .sort() doesn't use radix sort, is there an implementation somewhere that allows me to sort strings as performantly as R?
AFAIK both R and Python "intern" strings so any optimisations in R can also be done in Python.
The top google result for "radix sort strings python" is this gist which produced an error when sorting on my test array.
It is true that R interns all strings, meaning it has a "global character cache" which serves as a central dictionary of all strings ever used by your program. This has its advantages: the data takes less memory, and certain algorithms (such as radix sort) can take advantage of this structure to achieve higher speed. This is particularly true for the scenarios such as in your example, where the number of unique strings is small relative to the size of the vector. On the other hand it has its drawbacks too: the global character cache prevents multi-threaded write access to character data.
In Python, afaik, only string literals are interned. For example:
>>> 'abc' is 'abc'
True
>>> x = 'ab'
>>> (x + 'c') is 'abc'
False
In practice it means that, unless you've embedded data directly into the text of the program, nothing will be interned.
Now, for your original question: "what is the fastest way to sort strings in python"? You can achieve very good speeds, comparable with R, with python datatable package. Here's the benchmark that sorts N = 10⁸ strings, randomly selected from a set of 1024:
import datatable as dt
import pandas as pd
import random
from time import time
n = 10**8
src = ["%x" % random.getrandbits(10) for _ in range(n)]
f0 = dt.Frame(src)
p0 = pd.DataFrame(src)
f0.to_csv("test1e8.csv")
t0 = time(); f1 = f0.sort(0); print("datatable: %.3fs" % (time()-t0))
t0 = time(); src.sort(); print("list.sort: %.3fs" % (time()-t0))
t0 = time(); p1 = p0.sort_values(0); print("pandas: %.3fs" % (time()-t0))
Which produces:
datatable: 1.465s / 1.462s / 1.460s (multiple runs)
list.sort: 44.352s
pandas: 395.083s
The same dataset in R (v3.4.2):
> require(data.table)
> DT = fread("test1e8.csv")
> system.time(sort(DT$C1, method="radix"))
user system elapsed
6.238 0.585 6.832
> system.time(DT[order(C1)])
user system elapsed
4.275 0.457 4.738
> system.time(setkey(DT, C1)) # sort in-place
user system elapsed
3.020 0.577 3.600
Jeremy Mets posted in the comments of this blog post that Numpy can sort string fairly by converting the array to np.araray. This indeed improve performance, however it is still slower than Julia's implementation.
import numpy as np
import timeit
# randChar is workaround for MemoryError in mtrand.RandomState.choice
# http://stackoverflow.com/questions/25627161/how-to-solve-memory-error-in-mtrand-randomstate-choice
def randChar(f, numGrp, N) :
things = [f%x for x in range(numGrp)]
return [things[x] for x in np.random.choice(numGrp, N)]
N=int(1e7)
K=100
id3 = np.array(randChar("id%010d", N//K, N)) # small groups (char)
timeit.Timer("id3.sort()" ,"from __main__ import id3").timeit(1) # 6.8 seconds

How floats are stored in python

I have been using python for my assignments for past few days. I noticed one strange thing that -
When I convert string to float - it gives exactly same number of digits that were in string.
When I put this number in file using struct.pack() with 4 bytes floats and read it back using struct.unpack(), it gives a number not exactly same but some longer string which I expect if as per the floating point storage
Ex. - String - 0.931973
Float num - 0.931973
from file - 0.931972980499 (after struct pack and unpack into 4 bytes)
So I am unable to understand how python actually stored my number previously when I read it from string.
EDIT
Writing the float (I think in python 2.7 on ubuntu its other way around, d- double and f-float)
buf = struct.pack("f", float(self.dataArray[i]))
fout.write(buf)
Query -
buf = struct.pack("f", dataPoint)
dataPoint = struct.unpack("f", buf)[0]
node = root
while(node.isBPlusNodeLeaf()) == False:
node = node.findNextNode(dataPoint)
findNextNode -
def findNextNode(self, num):
i = 0
for d in self.dataArray:
if float(num) > float(d):
i = i + 1
continue
else:
break
ptr = self.pointerArray[i]
#open the node before passing on the pointer to it
out, tptr = self.isNodeAlive(ptr)
if out == False:
node = BPlusNode(name = ptr)
node.readBPlusNode(ptr)
return node
else:
return BPlusNode.allNodes[tptr]
once I reach to leaf it reads the leaf and check if the datapoint exist there.
for data in node.dataArray:
if data == dataPoint:
return True
return False
So in this case it returns unsuccessful search for datapoint - 0.931972980499 which is there although.
While following code works fine -
for data in node.dataArray:
if round(float(data), 6) == dataPoint:
return True
return False
I am not able to understand why this is happening
float in Python is actually what C programmers call double, i.e. it is 64 bits (or perhaps even wider on some platforms). So when you store it in 4 bytes (32 bits), you lose precision.
If you use the d format instead of f, you should see the results you expect:
>>> struct.unpack('d', struct.pack('d', float('0.931973')))
(0.931973,)

Hashing same character multiple times

I'm doing a programming challenge and I'm going crazy with one of the challenges. In the challenge, I need to compute the MD5 of a string. The string is given in the following form:
n[c]: Where n is a number and c is a character. For example: b3[a2[c]] => baccaccacc
Everything went ok until I was given the following string:
1[2[3[4[5[6[7[8[9[10[11[12[13[a]]]]]]]]]]]]]
This strings turns into a string with 6227020800 a's. This string is more than 6GB, so it's nearly impossible to compute it in practical time. So, here is my question:
Are there any properties of MD5 that I can use here?
I know that there has to be a form to make it in short time, and I suspect it has to be related to the fact that all the string has is the same character repeated multiple times.
You probably have created a (recursive) function to produce the result as a single value. Instead you should use a generator to produce the result as a stream of bytes. These you can then feed byte by byte into your MD5 hash routine. The size of the stream does not matter this way, it will just have an impact on the computation time, not on the memory used.
Here's an example using a single-pass parser:
import re, sys, md5
def p(s, pos, callBack):
while pos < len(s):
m = re.match(r'(d+)[', s[pos:])
if m: # repetition?
number = m.group(1)
for i in range(int(number)):
endPos = p(s, pos+len(number)+1, callBack)
pos = endPos
elif s[pos] == ']':
return pos + 1
else:
callBack(s[pos])
pos += 1
return pos + 1
digest = md5.new()
def feed(s):
digest.update(s)
sys.stdout.write(s)
sys.stdout.flush()
end = p(sys.argv[1], 0, feed)
print
print "MD5:", digest.hexdigest()
print "finished parsing input at pos", end
All hash functions are designed to work with byte streams, so you should not first generate the whole string, and after that hash it - you should write generator, which produces chunks of string data, and feed it to MD5 context.
And, MD5 uses 64-byte (or char) buffer so it would be a good idea to feed 64-byte chunks of data to the context.
Take advantage of the good properties of hashes:
import hashlib
cruncher = hashlib.md5()
chunk = 'a' * 100
for i in xrange(100000):
cruncher.update(chunk)
print cruncher.hexdigest()
Tweak the number of rounds (x = 10000) and the length of the chunk (y = 100) so that x * y = 13!. The point is that your are feeding the algorithm with chunks of your string (each one x characters long), one after the other, for y times.

Categories