How to find a complexity of a built-in function in python - python

I have the special case of the problem, but it would be nice to know whether it is possible for any function.
So I want to find the position of a substring in a string. Ok, in python there is a find method which does exactly what is needed.
string.find(s, sub[, start[, end]])
Return the lowest index in s where
the substring sub is found such that sub is wholly contained in
s[start:end]. Return -1 on failure. Defaults for start and end and
interpretation of negative values is the same as for slices.
Amazing, but the problem is that finding a big substring in a big string can run from O(n*m) to O(n) (which is a huge deal) depending on the algorithm. Documentation gives no information about time complexity, nor information about the underlying algorithm.
I see few approaches how to resolve this:
benchmark
go to source code and try to understand it
Both does not sound really easy (I hope that there is an easier way). So how can I find a complexity of a built-in function?

You say, "go to source code and try to understand it," but it might be easier than you think. Once you get to the actual implementation code, in Objects/stringlib/fastsearch.h, you find:
/* fast search/count implementation, based on a mix between boyer-
moore and horspool, with a few more bells and whistles on the top.
for some more background, see: http://effbot.org/zone/stringlib.htm */
The URL referenced there has a good discussion of the algorithm and its complexity.

Related

How to find the reverse of power_mod(a,b,c)

When using power_mod(a,b,c) we get a^b % c returning x. I have a,c, and x but am having a difficult time reversing the %c.
Is there a function that already exists to reverse this, or would I need to implement Euclidean algorithm to find what b is and return that?
This sounds like the "discrete logarithm" problem. If not, please be clearer (e.g., give a specific numeric example). But, if so, as the cited article says, there is no efficient approach known for the general case.
Since you tagged "sage" in the question, see the docs for the Sage discrete_log() function too. It implements some of the approaches named in the earlier cited Wikipedia article ("Pohlig-Hellman and baby step giant step").

The difference in application between SequenceMatcher in edit distance and that in difflib?

I know the implementation of the edit distance algorithm. By dynamic programming, we first fill the first column and first row and then the entries immediately right and below of the filled entries by comparing three paths from the left, above, and left above. While for the Ratcliff/Obershelp algorithm, we first extract the longest common substring out from the two strings, then we do recursive operations for the left side two sub-strings and right side two sub-strings until no characters are left.
Both of them can be utilized to calculate the similarity between two strings and transform one string into another using four operations: delete, replace, copy, and insert.
But I wonder when to use which between SequenceMatcher in edit distance and that in difflib?
Here is what I found on the internet that makes me think that this question would also benefit others:
In the documentation of edit distance it reads that
Similar to the difflib SequenceMatcher, but uses Levenshtein/edit distance.
In this answer to a question on calculating edit distance, an answer on Ratcliff/Obershelp algorithm was provided.
There are only a few resources about the Ratcliff/Obershelp algorithm, let alone its comparison to edit distance that I thought is the most well known string alignment algorithm.
So far as I know, I have the following ideas:
I find that edit distance and the Ratcliff/Obershelp algorithm can both be used for spell checking. But when to use which?
I thought the edit distance is employed to find the minimal edit sequence while the Ratcliff/Obershelp algorithm yields matches that "look right" to people. However, 'look right' seems too vague a term, especially in real world applications. What's more, when is the minimum edit sequence a must/preferred?
Any suggestions would be highly appreciated, and thanks in advance.
"Looks right to people" needn't be all that vague. Search the web for discussion of why, e.g., the very widely used git source control system added "patience" and "histogram" differencing algorithms, as options. Variations of "minimal edit distance" routinely produce diffs that are jarring to humans, and I'm not going to reproduce examples here that are easily found by searching.
From a formal perspective, Levenshtein is more in line with what a mathematician means by "distance". Chiefly, difflib's .ratio() can depend on the order of the arguments passed to it, but Levenshtein is insensitve to order:
>>> import difflib
>>> difflib.SequenceMatcher(None, "tide", "diet").ratio()
0.25
>>> difflib.SequenceMatcher(None, "diet", "tide").ratio()
0.5
For the rest, I don't think you're going to get crisp answers. There are many notions of "similarity", not just the two you mentioned, and they all have their fans. "Minimal" was probably thought to be more important back when disk space and bandwidth were scarce and expensive.
The physical realities constraining genetic mutation have made measures that take into account spatial locality much more important in that field - doesn't matter one whit if it's "minimal" if it's also physically implausible ;-) Terms to search for: Smith–Waterman, and Needleman–Wunsch.

Any hints on bisection searching?

I do not want the answer. Could some one please give me a hint on how to approach this problem? Thanks.
Exercise 10
To check whether a word is in the word list, you could use the in operator, but it would be slow because it searches through the words in order.
Because the words are in alphabetical order, we can speed things up with a bisection search (also known as binary search), which is similar to what you do when you look a word up in the dictionary. You start in the middle and check to see whether the word you are looking for comes before the word in the middle of the list. If so, you search the first half of the list the same way. Otherwise you search the second half.
Either way, you cut the remaining search space in half. If the word list has 113,809 words, it will take about 17 steps to find the word or conclude that it’s not there.
Write a function called in_bisect that takes a sorted list and a target value and returns the index of the value in the list if it’s there, or None if it’s not.
There's a python library for doing this: the bisect library.
There's a general programming paradigm that is called for in this situation. Given the problem you've been given, I'm guessing that this paradigm has been discussed in class. It starts with an "r".
Once you get it, it will seem pretty cool.

Compare two glob expressions

Does anyone know an algorithm to check which of two wild card expression is more general than the other?
For example I'd like to compare
*/foo/foo.bar
with
*.bar
Clearly the first expression is contained in the second. I know that is not possible for regex (at least not if you don't have a looooot of time, as far as I remember this is in complexity class Non elementary), but it could be possible for wild card expression which are far less expressive. I tried to put together a simple python algorithm, but it get's very nasty when it comes to special cases.
Anybody has an idea if there is an algorithm for that problem?
UPDATE:
I do not want to use any brute force algorithm, since this won't work in general, because of performance reasons
Regards,
Gerald
You basically need to find a string somehow that matches the more general glob but not the more specific one. Just being captain obvious...
Probably by replacing * character with 0 or more random symbols.

Writing a parser for regular expressions

Even after years of programming, I'm ashamed to say that I've never really fully grasped regular expressions. In general, when a problem calls for a regex, I can usually (after a bunch of referring to syntax) come up with an appropriate one, but it's a technique that I find myself using increasingly often.
So, to teach myself and understand regular expressions properly, I've decided to do what I always do when trying to learn something; i.e., try to write something ambitious that I'll probably abandon as soon as I feel I've learnt enough.
To this end, I want to write a regular expression parser in Python. In this case, "learn enough" means that I want to implement a parser that can understand Perl's extended regex syntax completely. However, it doesn't have to be the most efficient parser or even necessarily usable in the real-world. It merely has to correctly match or fail to match a pattern in a string.
The question is, where do I start? I know almost nothing about how regexes are parsed and interpreted apart from the fact that it involves a finite state automaton in some way. Any suggestions for how to approach this rather daunting problem would be much appreciated.
EDIT: I should clarify that while I'm going to implement the regex parser in Python, I'm not overly fussed about what programming language the examples or articles are written in. As long as it's not in Brainfuck, I will probably understand enough of it to make it worth my while.
Writing an implementation of a regular expression engine is indeed a quite complex task.
But if you are interested in how to do it, even if you can't understand enough of the details to actually implement it, I would recommend that you at least look at this article:
Regular Expression Matching Can Be Simple And Fast
(but is slow in Java, Perl, PHP, Python, Ruby, ...)
It explains how many of the popular programming languages implement regular expressions in a way that can be very slow for some regular expressions, and explains a slightly different method that is faster. The article includes some details of how the proposed implementation works, including some source code in C. It may be a bit heavy reading if you are just starting to learn regular expressions, but I think it is well worth knowing about the difference between the two approaches.
I've already given a +1 to Mark Byers - but as far as I remember the paper doesn't really say that much about how regular expression matching works beyond explaining why one algorithm is bad and another much better. Maybe something in the links?
I'll focus on the good approach - creating finite automata. If you limit yourself to deterministic automata with no minimisation, this isn't really too difficult.
What I'll (very quickly) describe is the approach taken in Modern Compiler Design.
Imagine you have the following regular expression...
a (b c)* d
The letters represent literal characters to match. The * is the usual zero-or-more repetitions match. The basic idea is to derive states based on dotted rules. State zero we'll take as the state where nothing has been matched yet, so the dot goes at the front...
0 : .a (b c)* d
The only possible match is 'a', so the next state we derive is...
1 : a.(b c)* d
We now have two possibilities - match the 'b' (if there's at least one repeat of 'b c') or match the 'd' otherwise. Note - we are basically doing a digraph search here (either depth first or breadth first or whatever) but we are discovering the digraph as we search it. Assuming a breadth-first strategy, we'll need to queue one of our cases for later consideration, but I'll ignore that issue from here on. Anyway, we've discovered two new states...
2 : a (b.c)* d
3 : a (b c)* d.
State 3 is an end state (there may be more than one). For state 2, we can only match the 'c', but we need to be careful with the dot position afterwards. We get "a.(b c)* d" - which is the same as state 1, so we don't need a new state.
IIRC, the approach in Modern Compiler Design is to translate a rule when you hit an operator, in order to simplify the handling of the dot. State 1 would be transformed to...
1 : a.b c (b c)* d
a.d
That is, your next option is either to match the first repetition or to skip the repetition. The next states from this are equivalent to states 2 and 3. An advantage of this approach is that you can discard all your past matches (everything before the '.') as you only care about future matches. This typically gives a smaller state model (but not necessarily a minimal one).
EDIT If you do discard already matched details, your state description is a representation of the set of strings that can occur from this point on.
In terms of abstract algebra, this is a kind of set closure. An algebra is basically a set with one (or more) operators. Our set is of state descriptions, and our operators are our transitions (character matches). A closed set is one where applying any operator to any members in the set always produces another member that is in the set. The closure of a set is the mimimal larger set that is closed. So basically, starting with the obvious start state, we are constructing the minimal set of states that is closed relative to our set of transition operators - the minimal set of reachable states.
Minimal here refers to the closure process - there may be a smaller equivalent automata which is normally referred to as minimal.
With this basic idea in mind, it's not too difficult to say "if I have two state machines representing two sets of strings, how to I derive a third representing the union" (or intersection, or set difference...). Instead of dotted rules, your state representations will a current state (or set of current states) from each input automaton and perhaps additional details.
If your regular grammars are getting complex, you can minimise. The basic idea here is relatively simple. You group all your states into one equivalence class or "block". Then you repeatedly test whether you need to split blocks (the states aren't really equivalent) with respect to a particular transition type. If all states in a particular block can accept a match of the same character and, in doing so, reach the same next-block, they are equivalent.
Hopcrofts algorithm is an efficient way to handle this basic idea.
A particularly interesting thing about minimisation is that every deterministic finite automaton has precisely one minimal form. Furthermore, Hopcrofts algorithm will produce the same representation of that minimal form, no matter what representation of what larger case it started from. That is, this is a "canonical" representation which can be used to derive a hash or for arbitrary-but-consistent orderings. What this means is that you can use minimal automata as keys into containers.
The above is probably a bit sloppy WRT definitions, so make sure you look up any terms yourself before using them yourself, but with a bit of luck this gives a fair quick introduction to the basic ideas.
BTW - have a look around the rest of Dick Grunes site - he has a free PDF book on parsing techniques. The first edition of Modern Compiler Design is pretty good IMO, but as you'll see, there's a second edition imminent.
"A play on regular expressions: functional pearl" takes an interesting approach. The implementation is given in Haskell, but it's been reimplemented in Python at least once.
The developed program is based on an old technique to turn regular expressions into finite automata which makes it efficient both in terms of worst-case time and space bounds and actual performance: despite its simplicity, the Haskell implementation can compete with a recently published professional C++ program for the same problem.
There's an interesting (if slightly short) chapter in Beautiful Code by Brian Kernighan, appropriately called "A Regular Expression Matcher". In it he discusses a simple matcher that can match literal characters, and the .^$* symbols.
I do agree that writing a regex engine will improve understanding but have you taken a look at ANTLR??. It generates the parsers automatically for any kind of language. So maybe you can try your hand by taking one of the language grammars listed at Grammar examples and run through the AST and parser that it generates. It generates a really complicated code but you will have a good understanding on how a parser works.

Categories