String matching in Python - python

does anyone know which string matching algorithm is implemented in Python?

Per the sources, it's a
fast search/count implementation,
based on a mix between boyer-moore and
horspool, with a few more bells and
whistles on the top. for some more
background, see:
http://effbot.org/zone/stringlib.htm
The essay in question is really well worth reading!

I assume you're talking about CPython. In that case, you could always check the source (see fastsearch.h).

Related

Numpy argmax source

I can't seem to find the code for numpy argmax.
The source link in the docs lead me to here, which doesn't have any actual code.
I went through every function that mentions argmax using the github search tool and still no luck. I'm sure I'm missing something.
Can someone lead me in the right direction?
Thanks
Numpy is written in C. It uses a template engine that parsing some comments to generate many versions of the same generic function (typically for many different types). This tool is very helpful to generate fast code since the C language does not provide (proper) templates unlike C++ for example. However, it also make the code more cryptic than necessary since the name of the function is often generated. For example, generic functions names can look like #TYPE#_#OP# where #TYPE# and #OP# are two macros that can take different values each. On top of all of this, the CPython binding also make the code more complex since C functions have to be wrapped to be called from a CPython code with complex arrays (possibly with a high amount of dimensions and custom user types) and CPython arguments to decode.
_PyArray_ArgMinMaxCommon is a quite good entry point but it is only a wrapping function and not the main computing one. It is only useful if you plan to change the prototype of the Numpy function from Python.
The main computational function can be found here. The comment just above the function is the one used to generate the variants of the functions (eg. CDOUBLE_argmax). Note that there are some alternative specific implementation for alternative type below the main one like OBJECT_argmax since CPython objects and strings must be computed a bit differently. Thank you for contributing to Numpy.
As mentioned in the comments, you'll likely find what you are searching in the C code implementation (here under _PyArray_ArgMinMaxCommon). The code itself can be very convoluted, so if your intent was to open an issue on numpy with a broad idea, I would do it on the page you linked anyway.

In Python, is there a way to check how alike two files are and get the percentage of differences they have?

I'm trying to compare a lot of scripts at once and most of them have small differences, like a different name inside a variable and such.
For the most part, the scripts should be identical in function, and I'd like to be able to test how actually different they are.
What I'm thinking of doing is taking in all of the input from both files and comparing them against each other, character by character, and increasing a count of some sort when a difference arises. I'm not sure what I would compare this count to to make a percentage, or if this is even the best way to go about this.
If you have an idea or advice to give me I would greatly appreciate it!
Two suggestions:
1) Check out this SO question and Python's difflib. This SO question specifically asks about difflib.
Also, a guy named Doug Hellmann has an excellent series of blog posts called Python Module of the Week (PyMOTW). Here is his post about difflib.
2) If those don't work for you, try searching for language-independent algorithms for file comparisons first, and think about which ones can be most easily implemented in Python. A simple Google search for "file comparison algorithms" came up with several decent looking possibilities that you could try to implement in Python:
Here is a published PDF with a diff algorithm
This site has a discussion of several different algorithms with links

Syntax recognizer in python

I need a module or strategy for detecting that a piece of data is written in a programming language, not syntax highlighting where the user specifically chooses a syntax to highlight. My question has two levels, I would greatly appreciate any help, so:
Is there any package in python that receives a string(piece of data) and returns if it belongs to any programming language syntax ?
I don't necessarily need to recognize the syntax, but know if the string is source code or not at all.
Any clues are deeply appreciated.
Maybe you can use existing multi-language syntax highlighters. Many of them can detect language a file is written in.
You could have a look at methods around baysian filtering.
My answer somewhat depends on the amount of code you're going to be given. If you're going to be given 30+ lines of code, it should be fairly easy to identify some unique features of each language that are fairly common. For example, tell the program that if anything matches an expression like from * import * then it's Python (I'm not 100% sure that phrasing is unique to Python, but you get the gist). Other things you could look at that are usually slightly different would be class definition (i.e. Python always starts with 'class', C will start with a definition of the return so you could check to see if there is a line that starts with a data type and has the formatting of a method declaration), conditionals are usually formatted slightly differently, etc, etc. If you wanted to make it more accurate, you could introduce some sort of weighting system, features that are more unique and less likely to be the result of a mismatched regexp get a higher weight, things that are commonly mismatched get a lower weight for the language, and just calculate which language has the highest composite score at the end. You could also define features that you feel are 100% unique, and tell it that as soon as it hits one of those, to stop parsing because it knows the answer (things like the shebang line).
This would, of course, involve you knowing enough about the languages you want to identify to find unique features to look for, or being able to find people that do know unique structures that would help.
If you're given less than 30 or so lines of code, your answers from parsing like that are going to be far less accurate, in that case the easiest best way to do it would probably be to take an appliance similar to Travis, and just run the code in each language (in a VM of course). If the code runs successfully in a language, you have your answer. If not, you would need a list of errors that are "acceptable" (as in they are errors in the way the code was written, not in the interpreter). It's not a great solution, but at some point your code sample will just be too short to give an accurate answer.

Are inner-classes unpythonic?

My colleague just pointed out that my use of inner-classes seemed to be "unpythonic". I guess it violates the "flat is better than nested" heuristic.
What do people here think? Are inner-classes something which are more appropriate to Java etc than Python?
NB : I don't think this is a "subjective" question. Surely style and aesthetics are objective within a programming community.
Related Question: Is there a benefit to defining a class inside another class in Python?
This may not deserve a [subjective] tag on StackOverflow, but it's subjective on the larger stage: some language communities encourage nesting and others discourage it. So why would the Python community discourage nesting? Because Tim Peters put it in The Zen of Python? Does it apply to every scenario, always, without exception? Rules should be taken as guidelines, meaning you should not switch off your brain when applying them. You must understand what the rule means and why it's important enough that someone bothered to make a rule.
The biggest reason I know to keep things flat is because of another philosophy: do one thing and do it well. Lots of little special purpose classes living inside other classes is a sign that you're not abstracting enough. I.e., you should be removing the need and desire to have inner classes, not just moving them outside for the sake of following rules.
But sometimes you really do have some behavior that should be abstracted into a class, and it's a special case that only obtains within another single class. In that case you should use an inner class because it makes sense, and it tells anyone else reading the code that there's something special going on there.
Don't slavishly follow rules.
Do understand the reason for a rule and respect that.
"Flat is better than nested" is focused on avoiding excessive nesting -- i.e., seriously deep hierarchies. One level of nesting, per se, is no big deal: as long as your nesting still respects (a weakish form of) the Law of Demeter, as typically confirmed by the fact that you don't find yourself writing stuff like onething.another.andyet.anotherone (too many dots in an expression are a "code smell" suggesting you've gone too deep in nesting and need to refactor to flatten things out), I wouldn't worry too much.
Actually, I'm not sure if I agree with the whole premise that "Flat is better than nested". Sometimes, quite often actually, the best way to represent something is hierarchically... Even nature itself, often uses hierarchy.

Regular expression implementation details

A question that I answered got me wondering:
How are regular expressions implemented in Python? What sort of efficiency guarantees are there? Is the implementation "standard", or is it subject to change?
I thought that regular expressions would be implemented as DFAs, and therefore were very efficient (requiring at most one scan of the input string). Laurence Gonsalves raised an interesting point that not all Python regular expressions are regular. (His example is r"(a+)b\1", which matches some number of a's, a b, and then the same number of a's as before). This clearly cannot be implemented with a DFA.
So, to reiterate: what are the implementation details and guarantees of Python regular expressions?
It would also be nice if someone could give some sort of explanation (in light of the implementation) as to why the regular expressions "cat|catdog" and "catdog|cat" lead to different search results in the string "catdog", as mentioned in the question that I referenced before.
Python's re module was based on PCRE, but has moved on to their own implementation.
Here is the link to the C code.
It appears as though the library is based on recursive backtracking when an incorrect path has been taken.
Regular expression and text size n
a?nan matching an
Keep in mind that this graph is not representative of normal regex searches.
http://swtch.com/~rsc/regexp/regexp1.html
There are no "efficiency guarantees" on Python REs any more than on any other part of the language (C++'s standard library is the only widespread language standard I know that tries to establish such standards -- but there are no standards, even in C++, specifying that, say, multiplying two ints must take constant time, or anything like that); nor is there any guarantee that big optimizations won't be applied at any time.
Today, F. Lundh (originally responsible for implementing Python's current RE module, etc), presenting Unladen Swallow at Pycon Italia, mentioned that one of the avenues they'll be exploring is to compile regular expressions directly to LLVM intermediate code (rather than their own bytecode flavor to be interpreted by an ad-hoc runtime) -- since ordinary Python code is also getting compiled to LLVM (in a soon-forthcoming release of Unladen Swallow), a RE and its surrounding Python code could then be optimized together, even in quite aggressive ways sometimes. I doubt anything like that will be anywhere close to "production-ready" very soon, though;-).
Matching regular expressions with backreferences is NP-hard, which is at least as hard as NP-Complete. That basically means that it's as hard as any problem you're likely to encounter, and most computer scientists think it could require exponential time in the worst case. If you could match such "regular" expressions (which really aren't, in the technical sense) in polynomial time, you could win a million bucks.

Categories