Syntax recognizer in python - python

I need a module or strategy for detecting that a piece of data is written in a programming language, not syntax highlighting where the user specifically chooses a syntax to highlight. My question has two levels, I would greatly appreciate any help, so:
Is there any package in python that receives a string(piece of data) and returns if it belongs to any programming language syntax ?
I don't necessarily need to recognize the syntax, but know if the string is source code or not at all.
Any clues are deeply appreciated.

Maybe you can use existing multi-language syntax highlighters. Many of them can detect language a file is written in.

You could have a look at methods around baysian filtering.

My answer somewhat depends on the amount of code you're going to be given. If you're going to be given 30+ lines of code, it should be fairly easy to identify some unique features of each language that are fairly common. For example, tell the program that if anything matches an expression like from * import * then it's Python (I'm not 100% sure that phrasing is unique to Python, but you get the gist). Other things you could look at that are usually slightly different would be class definition (i.e. Python always starts with 'class', C will start with a definition of the return so you could check to see if there is a line that starts with a data type and has the formatting of a method declaration), conditionals are usually formatted slightly differently, etc, etc. If you wanted to make it more accurate, you could introduce some sort of weighting system, features that are more unique and less likely to be the result of a mismatched regexp get a higher weight, things that are commonly mismatched get a lower weight for the language, and just calculate which language has the highest composite score at the end. You could also define features that you feel are 100% unique, and tell it that as soon as it hits one of those, to stop parsing because it knows the answer (things like the shebang line).
This would, of course, involve you knowing enough about the languages you want to identify to find unique features to look for, or being able to find people that do know unique structures that would help.
If you're given less than 30 or so lines of code, your answers from parsing like that are going to be far less accurate, in that case the easiest best way to do it would probably be to take an appliance similar to Travis, and just run the code in each language (in a VM of course). If the code runs successfully in a language, you have your answer. If not, you would need a list of errors that are "acceptable" (as in they are errors in the way the code was written, not in the interpreter). It's not a great solution, but at some point your code sample will just be too short to give an accurate answer.

Related

Visualize Python function flow (e.g. as tree or concept map)

I have taught myself python in quite a haphazard way. So my question perhaps won't be very pythonic.
When I write functions in classes, I often lose overview of what each function does. I do try to name them properly. But still, they are sometimes smaller parts of code where it seems arbitrary in which functions to put them. So whenever I want to make changes, I still need to go through the entire code in order to figure out how my functions actually flow.
From what I understand, we have objects and we have functions, and these are our units for structuring our code. But this only gives you a flat structure. It doesn't allow you to do any kind of nesting, like in a tree diagram, over multiple levels. Especially, the code in my file doesn't automatically water itself so that the functions that are called first and more often automatically further up on the top, whereas helper functions would be automatically further down in the document, or even nested.
In fact, even being able to visually nest lower-order subroutines "inside" a higher-order function that calls it would seem helpful. But it's not something that would be supported by Python's syntax. (Plus, it wouldn't quite suffice, because a sub-routine might be used by several higher-order functions.)
I imagine it would be useful to see all functions in my code visualized as a tree, or as a concept map:
Where each function is a dot. And calling order visualized by arrows. This way, I would also easily see which functions are more central, and which are more outliers, or even orphaned.
Yet perhaps this isn't even a case for another tool. Perhaps this is more a case of me not understanding proper coding. Perhaps there is something I can do differently in order to get the kind of intuitive overview over how my program works, without needing another tool.
Firstly, I am not quite sure why this isn't asked more often. Reading code is not intuitive, at all! We should be able to visualize the evolution of a process or function so well that close to every one will be able to understand its behavior. In the 60s and so, people had to be reasonably sure their programs would run, because getting access to the computer would take time; today we execute or compile our program, run tests if we have them, and get to know immediately whether it works. What happened is there is less mental effort now, we execute a bit less code in our heads, and the computer a bit more. But we must still think of how the program behaves midst execution in order to debug. For the future, it would be nice if the computer could just tell us how the program behaves.
You propose looking at a sort of tree of the program as a resolute, and after all, the abstract syntax tree is literally a tree, but I don't think this is what we ought to spend our efforts on when it comes to visualizing systems. What would be preferable is if we could look at an interactive view of how the problem changes its intermediate data-structures as a function of time.
Currently, we have debuggers--but that's akin to looking at the issue by asking what a function is at many values, when we would much rather look at its graph. A lot of programming is done by doing something you feel is right, observing if the behavior correct, if it's not correct then we make modifications by reacting and correcting said behavior.
Bret Victor in his essay, Learnable Programming, discusses this topic. I highly recommend it, even though it won't help you right now, maybe you can help others in the future by making these ideas more prevalent.
Onwards, then to where I tell you what you can do right now. In his book Clean Code, Robert C. Martin proposes structuring code much like how a newspaper is laid out.
Think of a well-written newspaper article. You read it vertically. At the top you expect a headline that will tell you what the story is about and allows you to decide whether it is something you want to read. The first paragraph gives you a synopsis of the whole story, hiding all the details while giving you the broad-brush concepts. As you continue downward, the details increase until you have all the dates, names, quotes, claims, and other minutia.
What is proposed, is to organize your program top-down, with higher level procedures that call mid-level procedures, which in turn call the lower level procedures. At any place, it should be obvious that (1) you are at the appropriate level of abstraction, and (2) you are looking at the part of the program implementing the behavior you seek to modify.
This means storing state at the level where it belongs, and exposing it anywhere else. Procedures should take only the parameters they need, because more parameters means you must also think about more parameters when reasoning about the code.
This is the primary reason for decomposing procedures into smaller procedures. For example, here some code I've written previously. It's not perfect, but you can see very clearly which part of the program you need to go to if you want to change anything.
Of course, higher order procedures are listed before any other. I'm telling you what I'm going to do, before I show you how I do it.
function formatLinks(matches, stripped) {
let formatted_links = []
for (match of matches) {
let diff = difference(match, stripped)
if (isSimpleLink(diff)) {
formatted_links.push(formatAsSimpleLink(diff))
} else if (hasPrefix(diff)) {
formatted_links.push(formatAsPrefixedLink(diff))
} else if (hasSuffix(diff)) {
formatted_links.push(formatAsSuffixedLink(diff))
}
}
// There might be multiple links within a word
// in which case we join them and strip any duplicated parts
// (which are otherwise interpreted as suffixes)
if (formatted_links.length > 1) {
return combineLinks(formatted_links)
}
// Default case
return formatted_links[0]
}
If JavaScript was a typed language, and if we could see an image of the decisions made in the code, as a factor of input and time, this could be even better.
I think Quokka.js and VS Code Debug Visualizer are both doing interesting work in this sector.

Building an index of term usage in python code

Brief version
I have a collection of python code (part of instructional materials). I'd like to build an index of when the various python keywords, builtins and operators are used (and especially: when they are first used). Does it make sense to use ast to get proper tokenization? Is it overkill? Is there a better tool? If not, what would the ast code look like? (I've read the docs but I've never used ast).
Clarification: This is about indexing python source code, not English text that talks about python.
Background
My materials are in the form of ipython Notebooks, so even if I had an indexing tool I'd need to do some coding anyway to get the source code out. And I don't have an indexing tool; googling "python index" doesn't turn up anything with the intended sense of "index".
I thought, "it's a simple task, let's write a script for the whole thing so I can do it exactly the way I want". But of course I want to do it right.
The dumb solution: Read the file, tokenize on whitespaces and word boundaries, index. But this gets confused by the contents of strings (when does for really get introduced for the first time?), and of course attached operators are not separated: text+="this" will be tokenized as ["text", '+="', "this"]
Next idea: I could use ast to parse the file, then walk the tree and index what I see. This looks like it would involve ast.parse() and ast.walk(). Something like this:
for source in source_files:
with open(source) as fp:
code = fp.read()
tree = ast.parse(code)
for node in tree.walk():
... # Get node's keyword, identifier etc., and line number-- how?
print(term, source, line) # I do know how to make an index
So, is this a reasonable approach? Is there a better one? How should this be done?
Did you search on "index" alone, or for "indexing tool"? I would think that your main problem would be to differentiate a language concept from its natural language use.
I expect that your major difficulty here is not traversing the text, but in the pattern-matching to find these things. For instance, how do you recognize introducing for loops? This would be the word for "near" the word loop, with a for command "soon" after. That command would be a line beginning with for and ending with a colon.
That is just one pattern, albeit one with many variations. However, consider what it takes to differentiate that from a list comprehension, and that from a generation comprehension (both explicit and built-in).
Will you have directed input? I'm thinking that a list of topics and keywords is essential, not all of which are in the language's terminal tokens -- although a full BNF grammar would likely include them.
Would you consider a mark-up indexing tool? Sometimes, it's easier to place a mark at each critical spot, doing it by hand, and then have the mark-up tool extract an index from that. Such tools have been around for at least 30 years. These are also found with a search for "indexing tools", adding "mark-up" or "tagging" to the search.
Got it. I thought you wanted to parse both, using the code as the primary key for introduction. My mistake. Too much contact with the Society for Technical Communication. :-)
Yes, AST is overkill -- internally. Externally -- it works, it gives you a tree including those critical non-terminal tokens (such as "list comprehension"), and it's easy to get given a BNF and the input text.
This would give you a sequential list of parse trees. Your coding would consist of traversing the tress to make an index of each new concept from your input list. Once you find each concept, you index the instance, remove it from the input list, and continue until you run out of sample code or input items.

In Python, is there a way to check how alike two files are and get the percentage of differences they have?

I'm trying to compare a lot of scripts at once and most of them have small differences, like a different name inside a variable and such.
For the most part, the scripts should be identical in function, and I'd like to be able to test how actually different they are.
What I'm thinking of doing is taking in all of the input from both files and comparing them against each other, character by character, and increasing a count of some sort when a difference arises. I'm not sure what I would compare this count to to make a percentage, or if this is even the best way to go about this.
If you have an idea or advice to give me I would greatly appreciate it!
Two suggestions:
1) Check out this SO question and Python's difflib. This SO question specifically asks about difflib.
Also, a guy named Doug Hellmann has an excellent series of blog posts called Python Module of the Week (PyMOTW). Here is his post about difflib.
2) If those don't work for you, try searching for language-independent algorithms for file comparisons first, and think about which ones can be most easily implemented in Python. A simple Google search for "file comparison algorithms" came up with several decent looking possibilities that you could try to implement in Python:
Here is a published PDF with a diff algorithm
This site has a discussion of several different algorithms with links

How do I design a class in Python?

I've had some really awesome help on my previous questions for detecting paws and toes within a paw, but all these solutions only work for one measurement at a time.
Now I have data that consists off:
about 30 dogs;
each has 24 measurements (divided into several subgroups);
each measurement has at least 4 contacts (one for each paw) and
each contact is divided into 5 parts and
has several parameters, like contact time, location, total force etc.
Obviously sticking everything into one big object isn't going to cut it, so I figured I needed to use classes instead of the current slew of functions. But even though I've read Learning Python's chapter about classes, I fail to apply it to my own code (GitHub link)
I also feel like it's rather strange to process all the data every time I want to get out some information. Once I know the locations of each paw, there's no reason for me to calculate this again. Furthermore, I want to compare all the paws of the same dog to determine which contact belongs to which paw (front/hind, left/right). This would become a mess if I continue using only functions.
So now I'm looking for advice on how to create classes that will let me process my data (link to the zipped data of one dog) in a sensible fashion.
How to design a class.
Write down the words. You started to do this. Some people don't and wonder why they have problems.
Expand your set of words into simple statements about what these objects will be doing. That is to say, write down the various calculations you'll be doing on these things. Your short list of 30 dogs, 24 measurements, 4 contacts, and several "parameters" per contact is interesting, but only part of the story. Your "locations of each paw" and "compare all the paws of the same dog to determine which contact belongs to which paw" are the next step in object design.
Underline the nouns. Seriously. Some folks debate the value of this, but I find that for first-time OO developers it helps. Underline the nouns.
Review the nouns. Generic nouns like "parameter" and "measurement" need to be replaced with specific, concrete nouns that apply to your problem in your problem domain. Specifics help clarify the problem. Generics simply elide details.
For each noun ("contact", "paw", "dog", etc.) write down the attributes of that noun and the actions in which that object engages. Don't short-cut this. Every attribute. "Data Set contains 30 Dogs" for example is important.
For each attribute, identify if this is a relationship to a defined noun, or some other kind of "primitive" or "atomic" data like a string or a float or something irreducible.
For each action or operation, you have to identify which noun has the responsibility, and which nouns merely participate. It's a question of "mutability". Some objects get updated, others don't. Mutable objects must own total responsibility for their mutations.
At this point, you can start to transform nouns into class definitions. Some collective nouns are lists, dictionaries, tuples, sets or namedtuples, and you don't need to do very much work. Other classes are more complex, either because of complex derived data or because of some update/mutation which is performed.
Don't forget to test each class in isolation using unittest.
Also, there's no law that says classes must be mutable. In your case, for example, you have almost no mutable data. What you have is derived data, created by transformation functions from the source dataset.
The following advices (similar to #S.Lott's advice) are from the book, Beginning Python: From Novice to Professional
Write down a description of your problem (what should the problem do?). Underline all the nouns, verbs, and adjectives.
Go through the nouns, looking for potential classes.
Go through the verbs, looking for potential methods.
Go through the adjectives, looking for potential attributes
Allocate methods and attributes to your classes
To refine the class, the book also advises we can do the following:
Write down (or dream up) a set of use cases—scenarios of how your program may be used. Try to cover all the functionally.
Think through every use case step by step, making sure that everything we need is covered.
I like the TDD approach...
So start by writing tests for what you want the behaviour to be. And write code that passes. At this point, don't worry too much about design, just get a test suite and software that passes. Don't worry if you end up with a single big ugly class, with complex methods.
Sometimes, during this initial process, you'll find a behaviour that is hard to test and needs to be decomposed, just for testability. This may be a hint that a separate class is warranted.
Then the fun part... refactoring. After you have working software you can see the complex pieces. Often little pockets of behaviour will become apparent, suggesting a new class, but if not, just look for ways to simplify the code. Extract service objects and value objects. Simplify your methods.
If you're using git properly (you are using git, aren't you?), you can very quickly experiment with some particular decomposition during refactoring, and then abandon it and revert back if it doesn't simplify things.
By writing tested working code first you should gain an intimate insight into the problem domain that you couldn't easily get with the design-first approach. Writing tests and code push you past that "where do I begin" paralysis.
The whole idea of OO design is to make your code map to your problem, so when, for example, you want the first footstep of a dog, you do something like:
dog.footstep(0)
Now, it may be that for your case you need to read in your raw data file and compute the footstep locations. All this could be hidden in the footstep() function so that it only happens once. Something like:
class Dog:
def __init__(self):
self._footsteps=None
def footstep(self,n):
if not self._footsteps:
self.readInFootsteps(...)
return self._footsteps[n]
[This is now a sort of caching pattern. The first time it goes and reads the footstep data, subsequent times it just gets it from self._footsteps.]
But yes, getting OO design right can be tricky. Think more about the things you want to do to your data, and that will inform what methods you'll need to apply to what classes.
After skimming your linked code, it seems to me that you are better off not designing a Dog class at this point. Rather, you should use Pandas and dataframes. A dataframe is a table with columns. You dataframe would have columns such as: dog_id, contact_part, contact_time, contact_location, etc.
Pandas uses Numpy arrays behind the scenes, and it has many convenience methods for you:
Select a dog by e.g. : my_measurements['dog_id']=='Charly'
save the data: my_measurements.save('filename.pickle')
Consider using pandas.read_csv() instead of manually reading the text files.
Writing out your nouns, verbs, adjectives is a great approach, but I prefer to think of class design as asking the question what data should be hidden?
Imagine you had a Query object and a Database object:
The Query object will help you create and store a query -- store, is the key here, as a function could help you create one just as easily. Maybe you could stay: Query().select('Country').from_table('User').where('Country == "Brazil"'). It doesn't matter exactly the syntax -- that is your job! -- the key is the object is helping you hide something, in this case the data necessary to store and output a query. The power of the object comes from the syntax of using it (in this case some clever chaining) and not needing to know what it stores to make it work. If done right the Query object could output queries for more then one database. It internally would store a specific format but could easily convert to other formats when outputting (Postgres, MySQL, MongoDB).
Now let's think through the Database object. What does this hide and store? Well clearly it can't store the full contents of the database, since that is why we have a database! So what is the point? The goal is to hide how the database works from people who use the Database object. Good classes will simplify reasoning when manipulating internal state. For this Database object you could hide how the networking calls work, or batch queries or updates, or provide a caching layer.
The problem is this Database object is HUGE. It represents how to access a database, so under the covers it could do anything and everything. Clearly networking, caching, and batching are quite hard to deal with depending on your system, so hiding them away would be very helpful. But, as many people will note, a database is insanely complex, and the further from the raw DB calls you get, the harder it is to tune for performance and understand how things work.
This is the fundamental tradeoff of OOP. If you pick the right abstraction it makes coding simpler (String, Array, Dictionary), if you pick an abstraction that is too big (Database, EmailManager, NetworkingManager), it may become too complex to really understand how it works, or what to expect. The goal is to hide complexity, but some complexity is necessary. A good rule of thumb is to start out avoiding Manager objects, and instead create classes that are like structs -- all they do is hold data, with some helper methods to create/manipulate the data to make your life easier. For example, in the case of EmailManager start with a function called sendEmail that takes an Email object. This is a simple starting point and the code is very easy to understand.
As for your example, think about what data needs to be together to calculate what you are looking for. If you wanted to know how far an animal was walking, for example, you could have AnimalStep and AnimalTrip (collection of AnimalSteps) classes. Now that each Trip has all the Step data, then it should be able to figure stuff out about it, perhaps AnimalTrip.calculateDistance() makes sense.

Merging duplicates in a list? - Question is more complex than it seems

So I have a huge list of entries in a DB (MySql)
I'm using Python and Django in the creation of my web application.
This is the base Django model I'm using:
class DJ(models.Model):
alias = models.CharField(max_length=255)
#other fields...
In my DB I have now duplicates
eg. Above & Beyond, Above and Beyond, Above Beyond, DJ Above and Beyond,
Disk Jokey Above and Beyond, ...
This is a problem... as it blows a big hole in my DB and therefore my application.
I'm sure other people have encountered this problem and thought about it.
My ideas are the following:
Create a set of rules so a new entry cannot be created?
eg. "DJ Above and Beyond" cannot be
created because "Above & Beyond" is in
the DB
Relate these aliases to each other somehow?
eg. relate "DJ Above and Beyond" to "Above & Beyond"
I have literally no clue how to go on about this, even if someone could point me into a direction that would be very helpful.
Any help would be very much appreciated! Thank you guys.
I guess you could do something based on Levenshtein distance, but there's no real way to do this automatically - without creating a fairly complex rules-based system.
Unless you can define a rules system that can work out for any x and y whether x is a duplicate of y, you're going to have to deal with this in a fuzzy, human way.
Stack Overflow has a fairly decent way of dealing with this - warn users if something may be a duplicate, based on something like Levenshtein distance (and perhaps some kind of rules engine), and then allow a subset of your users to merge things as duplicates if other users ignore the warnings.
From the examples you give, it sounds like you have more a natural language problem than an exact matching problem. Given that natural language matching is inexact by nature you're unlikely to come up with a perfect solution.
String distance doesn't really work, as strings that are algorithmically close may not be semantically close (e.g. "DJ Above & Beyond" should match "Above and Beyond" but not "DJ Above & Beyond 2" which is closer in Levenshtein distance.
Some cheap alternatives to natural language parsing are soundex, which will match by phonetic sounds, and Stemming, which removes prefixes/suffixes to normalize on word stems. I suppose you could create a linked list of word roots, but this wouldn't be terribly accurate either.
If this is a User-interacting program, you could echo "near misses" to the user, e.g. "Is one of these what you meant to enter?"
You could normalize the entries in some way so that different entries map to the same normalized value (e.g. case normalize, "&" -> "And", etc, etc. which some of the above suggestions might be a step towards) to find near misses or map multiple inputs to a single value.
Add the caveat that my experience only applies to English, e.g. an english PorterStemmer won't recognize the one French title you put in there.
I think this is more of a social problem than a programming problem. Any sort of programatic solution to natural language processing like this is going to be buggy and error prone. It's very hard to distinguish things that are close, but legitimately different from the sort of undesired duplicates that you're talking about.
As Dominic mentioned, Stack Overflow's tagging system is a pretty good model for this. It provides cues to the user that encourage them to use existing tags if appropriate (drop down lists as the user types), it allows trusted users to retag individual questions, and it allows moderators to do mass retags.
This is really a process that has to have a person directly involved.
This is not a complete solutions but one thought I had:
class DJ(models.Model):
#other fields, no alias!
class DJAlias(models.Model):
dj = models.ForeignKey(DJ)
This would allow you to have several Aliases for the same dj.
But still you will need to find a proper way to ensure the aliases are added to the right dj. See Dominics post.
But if you check an alias against several other aliases pointing to one dj, the algorithms might work better.
You could try to solve this problem for this instance only (replacing the "&" with "&" and "DJ" with "Disk jokey" or ignore "DJ" etc..). If your table only contains DJ's you could set up a bunch of rules like those.
If your table contains more diverse stuff you will have to go with a more structural approach. Could you give a sample of your dataset?
First of all of course the programming task (NLP etc. as mentioned) is interesting. But as mentioned it's overkill aiming to perfect that.
But the other view is as mentioned ("social"), who enters the data, who views it, how long and how correct should it be? So it's a naming convention issue and reminds me to the great project musicbrainz.org - should your site "just work" or do you prefer to go along standards, in latter case i would orient myself along the mb project - in case you haven't done that and not heard of it.
ie. see here for Above & Beyond: they have on alias defined, they use it to match user searches.
http://musicbrainz.org/show/artist/aliases.html?artistid=58438
check out also the Artist_Alias page in the wiki.
The data model is worth a look and there are even several API bindings to sync data, also in python.
How about changing the model so "alias" to be list of keys to other table that looks like this (skipping small words like "the", "and", etc.):
1 => Above;
2 => Beyond;
3 => Disk;
4 => Jokey;
Then when you want to insert new record just check how many of the significant words from the title are already in the table and match currently existing model entities. If more than 50% (for example) maybe you have a coincidence and you can show list of them to the visitor and asking "do you mean some of this one".
Looks like fuzzywuzzy is a perfect match to your needs.
This article explains the reason it was set up, which very closely matches your requirements -- basically, to handle situations in which two different things were named slightly differently:
One of our most consistently frustrating issues is trying to figure out whether two ticket listings are for the same real-life event (that is, without enlisting the help of our army of interns).
…
To achieve this, we've built up a library of "fuzzy" string matching routines to help us along.
If you're only after artist names or generally media related names it might be much better to just use the API of last.fm or echonest as they already have a huge rule set and a huge database to settle on.

Categories