Data structure to manipulate long strings of bits

Data structure to manipulate long strings of bits - python

In Python, what is the best data structure of n bits (where n is about 10000) on which performing the usual binary operations (&, |, ^) with other such data structures is fast?

"Fast" is always relative :)
The BitVector package seems to do what you need. I have no experience concerning performance with it though.
There is also a BitString implementation. Perhaps you do some measurements to find out, which one is more performant for your specific needs?
If you don't want a specific class and do not need things such as slicing or bit counting, you might be ok simply using python's long values which are arbitrary length integers. This might be the most performant implementation.
This qestion seems to be similar, although the author need fewer bits and requires a standard library.

In addition to those mentioned by MartinStettner, there's also the bitarray module, which I've used on multiple occations with great results.
PS: My 100th answer, wohooo!

Related

Most efficient string similarity metric function

I am looking for an efficient implementation of a string similarity metric function in Python (or a lib that provides Python bindings).
I want to compare strings with an average of 10kb in size and I can't take any shortcuts like comparing line-by-line, I need to compare the entire thing. I don't really care, what exact metric will be used, as long as the results are reasonable and computation is fast. Here's what I've tried so far:
difflib.SequenceMatcher from the standard lib. ratio() gives good results, but takes >100ms for 10kb text. quick_ratio() takes only half the time, but the results are sometimes far of the real value.
python-Levenshtein: levenshtein is an acceptable metric for my use case, but Levenshtein.ratio('foo', 'bar') is not faster than the SequenceMatcher.
Before I start benchmarking every lib on pypi that provides functions for measuring string similarity, maybe you can point me in the right direction? I'd love to reduce the time for a single comparison to less than 10ms (on commodity hardware), if possible.

edlib seems to be fast enough for my use case.
It's a C++ lib with Python bindings that calculates the Levehnstein distance for texts <100kb in less than 10ms each (on my machine). 10kb texts are done in ~1ms, which is 100x faster than difflib.SequenceMatcher.

I've had some luck with RapidFuzz, I don't know how it compares to the others but it was much faster than thefuzz/fuzzywuzzy.
Don't know if it's applicable for your use-case, but this one of the first things you find when you google fast string similarity python

Based on a lot of reading up I've been able to do, something like tfidf_matcher worked well for me. Returns the best k matches. Also, it's easily 1000x faster than Fuzzywuzzy.

Python complexity reference?

Is there any Python complexity reference? In cppreference, for example, for many functions (such as std::array::size or std::array::fill) there's a complexity section which describes their running complexity, in terms of linear in the size of the container or constant.
I would expect the same information to appear in the python website, perhaps, at least for the CPython implementation. For example, in the list reference, in list.insert I would expect to see complexity: linear; I know this case (and many other container-related operations) is covered here, but many other cases are not. Here are a few examples:
What is the complexity of tuple.__le__? It seems like when comparing two tuples of size n, k, the complexity is about O(min(n,k)) (however, for small n's it looks different).
What is the complexity of random.shuffle? It appears to be O(n). It also appears that the complexity of random.randint is O(1).
What is the complexity of the __format__ method of strings? It appears to be linear in the size of the input string; however, it also grows when the number of relevant arguments grow (compare ("{0}"*100000).format(*(("abc",)*100000)) with ("{}"*100000).format(*(("abc",)*100000))).
I'm aware that (a) each of these questions may be answered by itself, (b) one may look at the code of these modules (even though some are written in C), and (c) StackExchange is not a python mailing list for user requests. So: this is not a doc-feature request, just a question of two parts:
Do you know if such a resource exists?
If not, do you know what is the place to ask for such, or can you suggest why I don't need such?

CPython is pretty good about its algorithms, and the time complexity of an operation is usually just the best you would expect of a good standard library.
For example:
Tuple ordering has to be O(min(n,m)), because it works by comparing element-wise.
random.shuffle is O(n), because that's the complexity of the modern Fisher–Yates shuffle.
.format I imagine is linear, since it only requires one scan through the template string. As for the difference you see, CPython might just be clever enough to cache the same format code used twice.
The docs do mention time complexity, but generally only when it's not what you would expect — for example, because a deque is implemented with a doubly-linked list, it's explicitly mentioned as having O(n) for indexing in the middle.
Would the docs benefit from having time complexity called out everywhere it's appropriate? I'm not sure. The docs generally present builtins by what they should be used for and have implementations optimized for those use cases. Emphasizing time complexity seems like it would either be useless noise or encourage developers to second-guess the Python implementation itself.

Storing arbitrary precision integers

I am writing a little arbitrary precision in c(I know such libraries exist, such as gmp, nut I find it more fun to write it, just to exercise myself), and I would like to know if arrays are the best way to represent very long integers, or if there is a better solution(maybe linked chains)? And secondly how does python work to have big integers?(does it use arrays or another technique ?)
Thanks in advance for any response

Try reading documentation about libgmp, it already implements bigints. From what I see, it seems like integers are implemented as a dynamically allocated which is realloc'd when the number needs to grow. See http://gmplib.org/manual/Integer-Internals.html#Integer-Internals.

Python long integers are implemented as a simple structure with just an object header and an array of integers (which might be either 16- or 32-bit ints, depending on platform).
Code is at http://hg.python.org/cpython/file/f8942b8e6774/Include/longintrepr.h

Best way to store boolean values to save memory in python

What is the best way to store between a million to 450,000 Boolean values in a dictionary like collection indexed by a long number? I need to use the least amount of memory possible. True and Int both take up more than 22 bytes per entry. Is there a lower memory per Boolean possible?

Check this question. Bitarray seems to be the preferred choice.

The two main modules for this are bitarray and bitstring (I wrote the latter). Each will do what you need, but some plus and minus points for each:
bitarray
Written as a C extension so very quick.
Python 2 only.
bitstring
Pure Python.
Python 2.6+ and Python 3.x
Richer array of methods for reading and interpreting data.
So it depends on what you need to do with your data. If it's just storage and retrieval then both will be fine, but for performance critical stuff it's better to use bitarray if you can. Take a look at the docs (bitstring, bitarray) to see which you prefer.

Have you thought about using a hybrid list/bitstring?
Use your list to store one dimension of your bits. Each list item would hold a bitstring of fixed length. You would use your list to focus your search to the bitstring of interest, then use the bitstring to find/modify your bit of interest.
The list should allow the most efficent recall of the bitstrings, the bitstrings should allow you to pack all your data as efficiently as possible, and the hybrid list/bitstring should allow a compromise between speed (slightly slower accessing the bit string in the list) and storage (bit packed data plus list overhead.)

FSharp runs my algorithm slower than Python

Years ago, I solved a problem via dynamic programming:
https://www.thanassis.space/fillupDVD.html
The solution was coded in Python.
As part of expanding my horizons, I recently started learning OCaml/F#. What better way to test the waters, than by doing a direct port of the imperative code I wrote in Python to F# - and start from there, moving in steps towards a functional programming solution.
The results of this first, direct port... are disconcerting:
Under Python:
bash$ time python fitToSize.py
....
real 0m1.482s
user 0m1.413s
sys 0m0.067s
Under FSharp:
bash$ time mono ./fitToSize.exe
....
real 0m2.235s
user 0m2.427s
sys 0m0.063s
(in case you noticed the "mono" above: I tested under Windows as well, with Visual Studio - same speed).
I am... puzzled, to say the least. Python runs code faster than F# ? A compiled binary, using the .NET runtime, runs SLOWER than Python's interpreted code?!?!
I know about startup costs of VMs (mono in this case) and how JITs improve things for languages like Python, but still... I expected a speedup, not a slowdown!
Have I done something wrong, perhaps?
I have uploaded the code here:
https://www.thanassis.space/fsharp.slower.than.python.tar.gz
Note that the F# code is more or less a direct, line-by-line translation of the Python code.
P.S. There are of course other gains, e.g. the static type safety offered by F# - but if the resulting speed of an imperative algorithm is worse under F# ... I am disappointed, to say the least.
EDIT: Direct access, as requested in the comments:
the Python code: https://gist.github.com/950697
the FSharp code: https://gist.github.com/950699

Dr Jon Harrop, whom I contacted over e-mail, explained what is going on:
The problem is simply that the program has been optimized for Python. This is common when the programmer is more familiar with one language than the other, of course. You just have to learn a different set of rules that dictate how F# programs should be optimized...
Several things jumped out at me such as the use of a "for i in 1..n do" loop rather than a "for i=1 to n do" loop (which is faster in general but not significant here), repeatedly doing List.mapi on a list to mimic an array index (which allocated intermediate lists unnecessarily) and your use of the F# TryGetValue for Dictionary which allocates unnecessarily (the .NET TryGetValue that accepts a ref is faster in general but not so much here)
... but the real killer problem turned out to be your use of a hash table to implement a dense 2D matrix. Using a hash table is ideal in Python because its hash table implementation has been extremely well optimized (as evidenced by the fact that your Python code is running as fast as F# compiled to native code!) but arrays are a much better way to represent dense matrices, particularly when you want a default value of zero.
The funny part is that when I first coded this algorithm, I DID use a table -- I changed the implementation to a dictionary for reasons of clarity (avoiding the array boundary checks made the code simpler - and much easier to reason about).
Jon transformed my code (back :-)) into its array version, and it runs at 100x speed.
Moral of the story:
F# Dictionary needs work... when using tuples as keys, compiled F# is slower than interpreted Python's hash tables!
Obvious, but no harm in repeating: Cleaner code sometimes means... much slower code.
Thank you, Jon -- much appreciated.
EDIT: the fact that replacing Dictionary with Array makes F# finally run at the speeds a compiled language is expected to run, doesn't negate the need for a fix in Dictionary's speed (I hope F# people from MS are reading this). Other algorithms depend on dictionaries/hashes, and can't be easily switched to using arrays; making programs suffer "interpreter-speeds" whenever one uses a Dictionary, is arguably, a bug. If, as some have said in the comments, the problem is not with F# but with .NET Dictionary, then I'd argue that this... is a bug in .NET!
EDIT2: The clearest solution, that doesn't require the algorithm to switch to arrays (some algorithms simply won't be amenable to that) is to change this:
let optimalResults = new Dictionary<_,_>()
into this:
let optimalResults = new Dictionary<_,_>(HashIdentity.Structural)
This change makes the F# code run 2.7x times faster, thus finally beating Python (1.6x faster). The weird thing is that tuples by default use structural comparison, so in principle, the comparisons done by the Dictionary on the keys are the same (with or without Structural). Dr Harrop theorizes that the speed difference may be attributed to virtual dispatch: "AFAIK, .NET does little to optimize virtual dispatch away and the cost of virtual dispatch is extremely high on modern hardware because it is a "computed goto" that jumps the program counter to an unpredictable location and, consequently, undermines branch prediction logic and will almost certainly cause the entire CPU pipeline to be flushed and reloaded".
In plain words, and as suggested by Don Syme (look at the bottom 3 answers), "be explicit about the use of structural hashing when using reference-typed keys in conjunction with the .NET collections". (Dr. Harrop in the comments below also says that we should always use Structural comparisons when using .NET collections).
Dear F# team in MS, if there is a way to automatically fix this, please do.

As Jon Harrop has pointed out, simply constructing the dictionaries using Dictionary(HashIdentity.Structural) gives a major performance improvement (a factor of 3 on my computer). This is almost certainly the minimally invasive change you need to make to get better performance than Python, and keeps your code idiomatic (as opposed to replacing tuples with structs, etc.) and parallel to the Python implementation.

Edit: I was wrong, it's not a question of value type vs reference type. The performance problem was related to the hash function, as explained in other comments. I keep my answer here because there's an interessant discussion. My code partially fixed the performance issue, but this is not the clean and recommended solution.
--
On my computer, I made your sample run twice as fast by replacing the tuple with a struct. This means, the equivalent F# code should run faster than your Python code. I don't agree with the comments saying that .NET hashtables are slow, I believe there's no significant difference with Python or other languages implementations. Also, I don't agree with the "You can't 1-to-1 translate code expect it to be faster": F# code will generally be faster than Python for most tasks (static typing is very helpful to the compiler). In your sample, most of the time is spent doing hashtable lookups, so it's fair to imagine that both languages should be almost as fast.
I think the performance issue is related to gabage collection (but I haven't checked with a profiler). The reason why using tuples can be slower here than structures has been discussed in a SO question ( Why is the new Tuple type in .Net 4.0 a reference type (class) and not a value type (struct)) and a MSDN page (Building tuples):
If they are reference types, this
means there can be lots of garbage
generated if you are changing elements
in a tuple in a tight loop. [...]
F# tuples were reference types, but
there was a feeling from the team that
they could realize a performance
improvement if two, and perhaps three,
element tuples were value types
instead. Some teams that had created
internal tuples had used value instead
of reference types, because their
scenarios were very sensitive to
creating lots of managed objects.
Of course, as Jon said in another comment, the obvious optimization in your example is to replace hashtables with arrays. Arrays are obviously much faster (integer index, no hashing, no collision handling, no reallocation, more compact), but this is very specific to your problem, and it doesn't explain the performance difference with Python (as far as I know, Python code is using hashtables, not arrays).
To reproduce my 50% speedup, here is the full code: http://pastebin.com/nbYrEi5d
In short, I replaced the tuple with this type:
type Tup = {x: int; y: int}
Also, it seems like a detail, but you should move the List.mapi (fun i x -> (i,x)) fileSizes out of the enclosing loop. I believe Python enumerate does not actually allocate a list (so it's fair to allocate the list only once in F#, or use Seq module, or use a mutable counter).

Hmm.. if the hashtable is the major bottleneck, then it is properly the hash function itself. Havn't look at the specific hash function but For one of the most common hash functions namely
((a * x + b) % p) % q
The modulus operation % is painfully slow, if p and q is of the form 2^k - 1, we can do modulus with an and, add and a shift operation.
Dietzfelbingers universal hash function h_a : [2^w] -> [2^l]
lowerbound(((a * x) % 2^w)/2^(w-l))
Where is a random odd seed of w-bit.
It can be computed by (a*x) >> (w-l), which is magnitudes of speed faster than the first hash function. I had to implement a hash table with linked list as collision handling. It took 10 minutes to implement and test, we had to test it with both functions, and analyse the differens of speed. The second hash function had as I remember around 4-10 times of speed gain dependend on the size of the table.
But the thing to learn here is if your programs bottleneck is hashtable lookup the hash function has to be fast too

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.