generate alpha-numerically ordered UUID's over time - python

Following up on my question: Unique Linux filename, sortable by time
I need to generate a UUID that is itself alpha-numerically sequential over time. I assume I'll need to append the system date seconds since epoch and nanoseconds. This means I really just need an UUID algorithm that is alpha-numerically sequential within a given nanosecond.
So for example, I'm thinking of uuid's something like:
SECONDS_SINCE_EPOCH.NANOSECONDS.UID
The following bash:
for i in `seq 1 10`;
do
echo `date '+%s.%N'`.`uuidgen -t`
done
Results in:
1424718695.481439000.c8fef5d4-bb8f-11e4-92c7-00215e673861
1424718695.484130000.c8ff5eb6-bb8f-11e4-ae12-00215e673861
1424718695.486718000.c8ffc2ca-bb8f-11e4-ae15-00215e673861
1424718695.489267000.c90025bc-bb8f-11e4-a624-00215e673861
1424718695.491803000.c90089f8-bb8f-11e4-95ac-00215e673861
1424718695.494381000.c900ed76-bb8f-11e4-9058-00215e673861
1424718695.496899000.c901513a-bb8f-11e4-8018-00215e673861
1424718695.499460000.c901b440-bb8f-11e4-b382-00215e673861
1424718695.502007000.c90217a0-bb8f-11e4-89cd-00215e673861
1424718695.504532000.c90279d4-bb8f-11e4-b515-00215e673861
These files names appear as though they would suffice... but my fear is that I can't promise the names will be alpha-numerically sequential IF two files are created within the same nano-second (think large scale enterprise system with 10's of cores running many concurrent users). Because at that point I'm relying solely on the UUID algorithm for my unique name and all the UUID algorithm promises is uniqueness, not "alpha-numeric-sequential-ness".
Any ideas for a method that can guarantee uniqueness AND alpha-numeric sequential order? Because we're dealing with large enterprise systems, I need to keep my requirements as old-school as possible but I can probably swing some older versions of Python and whatnot if a solution in pure bash isn't readily available.

Based on another answer, you could reorder the time portions of the UUID so that the most-significant value shows up first, on down to the least significant. This is the more "natural" way that, say, UNIX time is presented and produces the sort order that you are looking for.
So the follow BASH should do the trick in your case:
for i in `seq 1 10`; do
echo $(date '+%s.%N').$(uuidgen -t | cut -d- -f3,2,1,4,5)
done
Bare in mind that there are no guarantees. Given enough tries and enough time, a collision will occur. If at all possible, you may want to do some sanity book checking further down the process chain which can correct any such mistakes before the data gets entered into a permanent record.

Related

Python Shared Memory Dictionary for Mapping Big Data

I've been having a hard time using a large dictionary (~86GB, 1.75 billion keys) to process a big dataset (2TB) using multiprocessing in Python.
Context: a dictionary mapping strings to strings is loaded from pickled files into memory. Once loaded, worker processes (ideally >32) are created that must lookup values in the dictionary but not modify it's contents, in order to process the ~2TB dataset. The data set needs to be processed in parallel otherwise the task would take over a month.
Here are the two three four five six seven eight nine approaches (all failing) that I have tried:
Store the dictionary as a global variable in the Python program and then fork the ~32 worker processes. Theoretically this method might work since the dictionary is not being modified and therefore the COW mechanism of fork on Linux would mean that the data structure would be shared and not copied among processes. However, when I attempt this, my program crashes on os.fork() inside of multiprocessing.Pool.map from OSError: [Errno 12] Cannot allocate memory. I'm convinced that this is because the kernel is configured to never overcommit memory (/proc/sys/vm/overcommit_memory is set to 2, and I can't configure this setting on the machine since I don't have root access).
Load the dictionary into a shared-memory dictionary with multiprocessing.Manager.dict. With this approach I was able to fork the 32 worker process without crashing but the subsequent data processing is orders of magnitude slower than another version of the task that required no dictionary (only difference is no dictionary lookup). I theorize that this is because of the inter-process communication between the manager process containing the dictionary and each worker process, that is required for every single dictionary lookup. Although the dictionary is not being modified, it is being accessed many many times, often simultaneously by many processes.
Copy the dictionary into a C++ std::map and rely on Linux's COW mechanism to prevent it from being copied (like approach #1 except with the dictionary in C++). With this approach, it took a long time to load the dictionary into std::map and subsequently crashed from ENOMEM on os.fork() just as before.
Copy the dictionary into pyshmht. It takes far too long to copy the dictionary into pyshmht.
Try using SNAP's HashTable. The underlying implementation in C++ allows for it to be made and used in shared memory. Unfortunately the Python API does not offer this functionality.
Use PyPy. Crash still happened as in #1.
Implement my own shared-memory hash table in python on top of multiprocessing.Array. This approach still resulted in the out of memory error that ocured in #1.
Dump the dictionary into dbm. After trying to dump the dictionary into a dbm database for four days and seeing an ETA of "33 days", I gave up on this approach.
Dump the dictionary into Redis. When I try to dump the dictionaries (the 86GB dict is loaded from 1024 smaller dicts) into Redis using redis.mset I get a connection reset by peer error. When I try to dump the key-value pairs using a loop, it takes an extremely long time.
How can I process this dataset in parallel efficiently without requiring inter-process communication in order to lookup values in this dictionary. I would welcome any suggestions for solving this problem!
I'm using Python 3.6.3 from Anaconda on Ubuntu on a machine with 1TB RAM.
Edit: What finally worked:
I was able to get this to work using Redis. To get around the issued in #9, I had to chunk the large key-value insertion and lookup queries into "bite-sized" chunks so that it was still processing in batches, but didn't time-out from too large a query. Doing this allowed the insertion of the 86GB dictionary to be performed in 45 minutes (with 128 threads and some load balancing), and the subsequent processing was not hampered in performance by the Redis lookup queries (finished in 2 days).
Thank you all for your help and suggestions.
You should probably use a system that's meant for sharing large amounts of data with many different processes -- like a Database.
Take your giant dataset and create a schema for it and dump it into a database. You could even put it on a separate machine.
Then launch as many processes as you want, across as many hosts as you want, to process the data in parallel. Pretty much any modern database will be more than capable of handling the load.
Instead of using a dictionary, use a data structure that compresses data, but still has fast lookups.
e.g:
keyvi: https://github.com/cliqz-oss/keyvi
keyvi is a FSA-based key-value data structure optimized for space & lookup speed. multiple processes reading from keyvi will re-use the memory, because a keyvi structure is memory mapped and it uses shared memory. Since your worker processes don't need to modify the data structure, I think this would be your best bet.
marisa trie: https://github.com/pytries/marisa-trie static trie structure for Python, based on the marisa-trie C++ library. Like keyvi, marisa-trie also uses memory-mapping. Multiple processes using the same trie will use the same memory.
EDIT:
To use keyvi for this task, you can first install it with pip install pykeyvi. Then use it like this:
from pykeyvi import StringDictionaryCompiler, Dictionary
# Create the dictionary
compiler = StringDictionaryCompiler()
compiler.Add('foo', 'bar')
compiler.Add('key', 'value')
compiler.Compile()
compiler.WriteToFile('test.keyvi')
# Use the dictionary
dct = Dictionary('test.keyvi')
dct['foo'].GetValue()
> 'bar'
dct['key'].GetValue()
> 'value'
marisa trie is just a trie, so it wouldn't work as a mapping out of the box, but you can for example us a delimiter char to separate keys from values.
If you can successfully load that data into a single process in point 1, you can most likely work around the problem of fork doing copies by using gc.freeze introduced in https://bugs.python.org/issue31558
You have to use python 3.7+ and call that function before you fork. (or before you do the map over process pool)
Since this requires a virtual copy of the whole memory for the CoW to work, you need to make sure your overcommit settings allow you to do that.
As most people here already mentioned:
Don't use that big a dictionary, Dump it on a Database instead!!!
After dumping your data into a database, using indexes will help reduce data retrieval times.
A good indexing explanation for PostgreSQL databases here.
You can optimize your database even further (I give a PostgreSQL example because that is what I mostly use, but those concepts apply to almost every database)
Assuming you did the above (or if you want to use the dictionary either way...), you can implement a parallel and asynchronous processing routine using Python's asyncio (needs Python version >= 3.4).
The base idea is to create a mapping method to assign (map) an asynchronous task to each item of an iterable and register each task to asyncio's event_loop.
Finally, we will collect all those promises with asyncio.gather and we will wait to receive all the results.
A skeleton code example of this idea:
import asyncio
async def my_processing(value):
do stuff with the value...
return processed_value
def my_async_map(my_coroutine, my_iterable):
my_loop = asyncio.get_event_loop()
my_future = asyncio.gather(
*(my_coroutine(val) for val in my_iterable)
)
return my_loop.run_until_complete(my_future)
my_async_map(my_processing, my_ginormous_iterable)
You can use gevent instead of asyncio, but keep in mind that asyncio is part of the standard library.
Gevent implementation:
import gevent
from gevent.pool import Group
def my_processing(value):
do stuff with the value...
return processed_value
def my_async_map(my_coroutine, my_iterable):
my_group = Group()
return my_group.map(my_coroutine, my_iterable)
my_async_map(my_processing, my_ginormous_iterable)
The already mentioned keyvi (http://keyvi.org) sounds like the best option to me, because "python shared memory dictionary" describes exactly what it is. I am the author of keyvi, call me biased, but give me the chance to explain:
Shared memory make it scalable, especially for python where the GIL-problematic forces you to use multiprocessing rather than threading. That's why a heap-based in-process solution wouldn't scale. Also shared memory can be bigger than main memory, parts can be swapped in and out.
External process network based solutions require an extra network hop, which you can avoid by using keyvi, this makes a big performance difference even on the local machine. The question is also whether the external process is single-threaded and therefore introduces a bottleneck again.
I wonder about your dictionary size: 86GB: there is a good chance that keyvi compresses that nicely, but hard to say without knowing the data.
As for processing: Note that keyvi works nicely in pySpark/Hadoop.
Your usecase BTW is exactly what keyvi is used for in production, even on a higher scale.
The redis solution sounds good, at least better than some database solution. For saturating the cores you should use several instances and divide the key space using consistent hashing. But still, using keyvi, I am sure, would scale way better. You should try it, if you have to repeat the task and/or need to process more data.
Last but not least, you find nice material on the website, explaining the above in more detail.
Maybe you should try do it in database, and maybe try to use Dask to solve your problem,let Dask to care about how to multiprocessing in the low level. You can focus on the main question you want to solve using that large data.
And this the link you may want to look Dask
Well I do believe that the Redis or a database would be the easiest and quickest fix.
But from what I understood, why not reduce the problem from your second solution? That is, first try to load a portion of the billion keys into memory (say 50 Million). Then using Multi-processing, create a pool to work on the 2 TB file. If the lookup of the line exists in the table, push the data to a list of processed lines. If it doesn't exist, push it to a list. Once you complete reading the data set, pickle your list and flush the keys you have stored from memory. Then load the next million and repeat the process instead reading from your list. Once it is finished completely, read all your pickle objects.
This should handle the speed issue that you were facing. Of course, I have very little knowledge of your data set and do not know if this is even feasible. Of course, you might be left with lines that did not get a proper dictionary key read, but at this point your data size would be significantly reduced.
Don't know if that is of any help.
Another solution could be to use some existing database driver which can allocate / retire pages as necessary and deal with the index lookup quickly.
dbm has a nice dictionary interface available and with automatic caching of pages may be fast enough for your needs. If nothing is modified, you should be able to effectively cache the whole file at VFS level.
Just remember to disable locking, open in not synch-ed mode, and open for 'r' only so nothing impacts caching/concurrent access.
Since you're only looking to create a read-only dictionary it is possible that you can get better speed than some off the shelf databases by rolling your own simple version. Perhaps you could try something like:
import os.path
import functools
db_dir = '/path/to/my/dbdir'
def write(key, value):
path = os.path.join(db_dir, key)
with open(path, 'w') as f:
f.write(value)
#functools.lru_cache(maxsize=None)
def read(key):
path = os.path.join(db_dir, key)
with open(path) as f:
return f.read()
This will create a folder full of text files. The name of each file is the dictionary key and the contents are the value. Timing this myself I get about 300us per write (using a local SSD). Using those numbers theoretically the time taken to write your 1.75 billion keys would be about a week but this is easily parallelisable so you might be able to get it done a lot faster.
For reading I get about 150us per read with warm cache and 5ms cold cache (I mean the OS file cache here). If your access pattern is repetitive you could memoize your read function in process with lru_cache as above.
You may find that storing this many files in one directory is not possible with your filesystem or that it is inefficient for the OS. In that case you can do like the .git/objects folder: Store the key abcd in a file called ab/cd (i.e. in a file cd in folder ab).
The above would take something like 15TB on disk based on a 4KB block size. You could make it more efficient on disk and for OS caching by trying to group together keys by the first n letters so that each file is closer to the 4KB block size. The way this would work is that you have a file called abc which stores key value pairs for all keys that begin with abc. You could create this more efficiently if you first output each of your smaller dictionaries into a sorted key/value file and then mergesort as you write them into the database so that you write each file one at a time (rather than repeatedly opening and appending).
While the majority suggestion of "use a database" here is wise and proven, it sounds like you may want to avoid using a database for some reason (and you are finding the load into the db to be prohibitive), so essentially it seems you are IO-bound, and/or processor-bound. You mention that you are loading the 86GB index from 1024 smaller indexes. If your key is reasonably regular, and evenly-distributed, is it possible for you to go back to your 1024 smaller indexes and partition your dictionary? In other words, if, for example, your keys are all 20 characters long, and comprised of the letters a-z, create 26 smaller dictionaries, one for all keys beginning with 'a', one for keys beginning 'b' and so on. You could extend this concept to a large number of smaller dictionaries dedicated to the first 2 characters or more. So, for example, you could load one dictionary for the keys beginning 'aa', one for keys beginning 'ab' and so on, so you would have 676 individual dictionaries. The same logic would apply for a partition over the first 3 characters, using 17,576 smaller dictionaries. Essentially I guess what I'm saying here is "don't load your 86GB dictionary in the first place". Instead use a strategy that naturally distributes your data and/or load.

Equivalent to python's -R option that affects the hash of ints

We have a large collection of python code that takes some input and produces some output.
We would like to guarantee that, given the identical input, we produce identical output regardless of python version or local environment. (e.g. whether the code is run on Windows, Mac, or Linux, in 32-bit or 64-bit)
We have been enforcing this in an automated test suite by running our program both with and without the -R option to python and comparing the output, assuming that would shake out any spots where our output accidentally wound up dependent on iteration over a dict. (The most common source of non-determinism in our code)
However, as we recently adjusted our code to also support python 3, we discovered a place where our output depended in part on iteration over a dict that used ints as keys. This iteration order changed in python3 as compared to python2, and was making our output different. Our existing tests (all on python 2.7) didn't notice this. (Because -R doesn't affect the hash of ints) Once found, it was easy to fix, but we would like to have found it earlier.
Is there any way to further stress-test our code and give us confidence that we've ferreted out all places where we end up implicitly depending on something that will possibly be different across python versions/environments? I think that something like -R or PYTHONHASHSEED that applied to numbers as well as to str, bytes, and datetime objects could work, but I'm open to other approaches. I would however like our automated test machine to need only a single python version installed, if possible.
Another acceptable alternative would be some way to run our code with pypy tweaked so as to use a different order when iterating items out of a dict; I think our code runs on pypy, though it's not something we've ever explicitly supported. However, if some pypy expert gives us a way to tweak dictionary iteration order on different runs, it's something we'll work towards.
Using PyPy is not the best choice here, given that it always retain the insertion order in its dicts (with a method that makes dicts use less memory). We can of course make it change the order dicts are enumerated, but it defeats the point.
Instead, I'd suggest to hack at the CPython source code to change the way the hash is used inside dictobject.c. For example, after each hash = PyObject_Hash(key); if (hash == -1) { ..error.. }; you could add hash ^= HASH_TWEAK; and compile different versions of CPython with different values for HASH_TWEAK. (I did such a thing at one point, but I can't find it any more. You need to be a bit careful about where the hash values are the original ones or the modified ones.)

Unable to store huge strings in-memory

I have data in the following form:
## De
A B C.
## dabc
xyz def ghi.
## <MyName_1>
Here is example.
## Df
A B C.
## <MyName_2>
De another one.
## <MyName_3>
Df next one.
## dabc1
xyz def ghi.
## <MyName_4>
dabc this one.
Convert it into the following form:
A B#1 C. //step 1 -- 1 assigned to the first occurrence of A B C.
xyz def#1 ghi. //1 assigned to first occurrence of xyz def ghi
Here is example
A B#2 C. //step 1 -- 2 assigned in increasing order
B#1 another one. //step 2
B#2 next one.
xyz def ghi.
def#1 this one.
// Here stand for comments and are not are part of the output.
The algorithm is the following.
If the second line following ## gets repeated. Then, append to the
middle word#number, where number is a numeric identifier and is
assigned in increasing order of repetition of the second line.
Replace ##... with word#number where ever it occurs.
Remove all ## where the second line is not getting repeated.
In order to achieve this I am storing all the triples and then finding their occurrences in order to assign numbers in increasing order. Is there some other way to achieve the same in python. Actually my file is 500GB and it is not possible to store all the triples in-memory in order to find their occurrences.
If you need something that's like a dict, but too big to hold in memory, what you need is a key-value database.
The simplest way to do this is with a dbm-type library, which is a very simple key-value database with almost exactly the same interface as a dict, except that it only allows strings for keys and values, and has some extra methods to control persistence and caching and the like. Depending on your platform and how your Python 2.7 was built, you may have any of:
dbm
gdbm
dumbdbm
dbhash
bsddb
bsddb185
bsddb3
PyBSDDB
The last three are all available on PyPI if your Python install doesn't include them, as long as you have the relevant version of libbsddb itself and don't have any problems with its license.
The problem is that, depending on your platform, the various underlying database libraries may not exist (although of course you can download the C library, install it, then build and install the Python wrapper), or may not support databases this big, or may do so but only in a horribly inefficient way (or, in a few cases, in a buggy way…).
Hopefully one of them will work for you, but the only way you'll really know is to test all of the ones you have.
Of course, if I understand properly, you're mapping strings to integers, not to strings. You could use the shelve module, which wraps any dbm-like library to allow you to use string keys but anything picklable as values… but that's huge overkill (and may kill your performance) for a case like this; you just need to change code like this:
counts.setdefault(key, 0)
counts[key] += 1
… into this:
counts.setdefault(key, '0')
counts[key] = str(int(counts[key]) + 1)
And of course you can easily write a wrapper class that does this for you (maybe even one that supports the Counter interface instead of the dict interface).
If that doesn't work, you need a more powerful database.
Most builds of Python come with sqlite3 in the stdlib, but using it will require learning a pretty low-level API, and learning SQL, which is a whole different language that's very unlike Python. (There are also a variety of different relational databases out there, but you shouldn't need any of them.)
There are also a variety of query expression libraries and even full object-relational mappers, like SQLAlchemy (which can be used either way) that let you write your queries in a much more Pythonic way, but it's still not going to be as simple as using a dict or dbm. (That being said, it's not that hard to wrap a dbm-like interface around SQLAlchemy.)
There are also a wide variety of non-relational or semi-relational databases that are generally lumped under the term NoSQL, the simplest of which are basically dbm on steroids. Again, they'll usually require learning a pretty low-level API, and sometimes a query language as well—but some of them will have nice Python libraries that make them easier to use.

Displaying a continuous stream of integers on a web page

I want to make a web page that generates a uniform random number between 0 and 99 every 10 seconds, and displays a list of the 100 most recent numbers (which are the same for everyone visiting the site). It should update live.
My design is the following:
A long-running Python process (e.g. using supervisord) that runs in an eternal loop, generating numbers at 10-second intervals, and writing the numbers to a file or SQL database, and pruning the old numbers since they are no longer needed.
Then the web server process simply reads the file and displays to the user (either on initial load, or from an Ajax call to get the most recent numbers)
I don't feel great about this solution. It's pretty heavy on file system I/O, which is not really a bottleneck or anything, but I just wonder if there's a smarter way that is still simple. If I could store the list as an in-memory data structure shared between processes, I could have one process push and pop values every 10 seconds, and then the web server processes could just read that data structure. I read a bit about Unix domain sockets, but it wasn't clear that this was a great fit to my problem
Is there a more efficient approach that is still simple?
EDIT: the approach suggested by Martijn Peters in his answer (don't generate anything until someone visits) is sensible and I am considering it too, since the website doesn't get very heavy traffic. The problem I see is with race conditions, since you then have multiple processes trying to write to the same file/DB. If the values in the file/DB are stale, we need to generate new ones, but one process might read the old values before another process has had the chance to update them. File locking as described in this question is a possibility, but many people in the answers warn about having multiple processes write to the same file.
You are overcomplicating things.
Don't generate any numbers until you have an actual request. Then see how old your last number is, generate enough numbers to cover the intervening time period, update your tables, return the result.
There is no actual need to actually generate a random number every 10 seconds here. You only need to produce the illusion that the numbers have been generated every 10 seconds, that more than suffices for your use case.
A good database will handle concurrent access for you, and most will also let you set exclusive locks. Grab a lock when you need to update the numbers. Fail to grab the lock? Something else is already updating the numbers.
Pre-generate numbers; nothing says you actually have to only generate the numbers for the past time slot. Randomize what requests pre-generate to minimize lock contention. Append the numbers to the end of the pool, so that if you accidentally run this twice all you get is double the extra random numbers, so you can wait twice as long before you need to generate more.
Most of all, generating a sequence of random numbers is cheap, so doing this during any request is hardly going to slow down your responses.
I would pre-generate a lot of numbers (say, enough numbers for 1 week; do the math) and store them. That way, the Ajax calls would only load the next number on the list. When you are running out of numbers, pre-generate again. The process of generating and writing into DB would be only executed one in a while (e.g. once a week).
EDIT: For a full week, you would need 60480 numbers at the most. Using what Martijn Pieters recommends (only reading a new number when a visitor indeed asks for one), and depending on your specific need (as you may need to still burn the numbers even if nobody is seeing them), those numbers may last way more than the week.

Counting number of symbols in Python script

I have a Telit module which runs [Python 1.5.2+] (http://www.roundsolutions.com/techdocs/python/Easy_Script_Python_r13.pdf)!. There are certain restrictions in the number of variable, module and method names I can use (< 500), the size of each variable (16k) and amount of RAM (~ 1MB). Refer pg 113&114 for details. I would like to know how to get the number of symbols being generated, size in RAM of each variable, memory usage (stack and heap usage).
I need something similar to a map file that gets generated with gcc after the linking process which shows me each constant / variable, symbol, its address and size allocated.
Python is an interpreted and dynamically-typed language, so generating that kind of output is very difficult, if it's even possible. I'd imagine that the only reasonable way to get this information is to profile your code on the target interpreter.
If you're looking for a true memory map, I doubt such a tool exists since Python doesn't go through the same kind of compilation process as C or C++. Since everything is initialized and allocated at runtime as the program is parsed and interpreted, there's nothing to say that one interpreter will behave the same as another, especially in a case such as this where you're running on such a different architecture. As a result, there's nothing to say that your objects will be created in the same locations or even with the same overall memory structure.
If you're just trying to determine memory footprint, you can do some manual checking with sys.getsizeof(object, [default]) provided that it is supported with Telit's libs. I don't think they're using a straight implementation of CPython. Even still, this doesn't always work and with raise a TypeError when an object's size cannot be determined if you don't specify the default parameter.
You might also get some interesting results by studying the output of the dis module's bytecode disassembly, but that assumes that dis works on your interpreter, and that your interpreter is actually implemented as a VM.
If you just want a list of symbols, take a look at this recipe. It uses reflection to dump a list of symbols.
Good manual testing is key here. Your best bet is to set up the module's CMUX (COM port MUXing), and watch the console output. You'll know very quickly if you start running out of memory.
This post makes me recall my pain once with Telit GM862-GPS modules. My code was exactly at the point that the number of variables, strings, etc added up to the limit. Of course, I didn't know this fact by then. I added one innocent line and my program did not work any more. I drove me really crazy for two days until I look at the datasheet to find this fact.
What you are looking for might not have a good answer because the Python interpreter is not a full fledged version. What I did was to use the same local variable names as many as possible. Also I deleted doc strings for functions (those count too) and replace with #comments.
In the end, I want to say that this module is good for small applications. The python interpreter does not support threads or interrupts so your program must be a super loop. When your application gets bigger, each iteration will take longer. Eventually, you might want to switch to a faster platform.

Categories