Brief version
I have a collection of python code (part of instructional materials). I'd like to build an index of when the various python keywords, builtins and operators are used (and especially: when they are first used). Does it make sense to use ast to get proper tokenization? Is it overkill? Is there a better tool? If not, what would the ast code look like? (I've read the docs but I've never used ast).
Clarification: This is about indexing python source code, not English text that talks about python.
Background
My materials are in the form of ipython Notebooks, so even if I had an indexing tool I'd need to do some coding anyway to get the source code out. And I don't have an indexing tool; googling "python index" doesn't turn up anything with the intended sense of "index".
I thought, "it's a simple task, let's write a script for the whole thing so I can do it exactly the way I want". But of course I want to do it right.
The dumb solution: Read the file, tokenize on whitespaces and word boundaries, index. But this gets confused by the contents of strings (when does for really get introduced for the first time?), and of course attached operators are not separated: text+="this" will be tokenized as ["text", '+="', "this"]
Next idea: I could use ast to parse the file, then walk the tree and index what I see. This looks like it would involve ast.parse() and ast.walk(). Something like this:
for source in source_files:
with open(source) as fp:
code = fp.read()
tree = ast.parse(code)
for node in tree.walk():
... # Get node's keyword, identifier etc., and line number-- how?
print(term, source, line) # I do know how to make an index
So, is this a reasonable approach? Is there a better one? How should this be done?
Did you search on "index" alone, or for "indexing tool"? I would think that your main problem would be to differentiate a language concept from its natural language use.
I expect that your major difficulty here is not traversing the text, but in the pattern-matching to find these things. For instance, how do you recognize introducing for loops? This would be the word for "near" the word loop, with a for command "soon" after. That command would be a line beginning with for and ending with a colon.
That is just one pattern, albeit one with many variations. However, consider what it takes to differentiate that from a list comprehension, and that from a generation comprehension (both explicit and built-in).
Will you have directed input? I'm thinking that a list of topics and keywords is essential, not all of which are in the language's terminal tokens -- although a full BNF grammar would likely include them.
Would you consider a mark-up indexing tool? Sometimes, it's easier to place a mark at each critical spot, doing it by hand, and then have the mark-up tool extract an index from that. Such tools have been around for at least 30 years. These are also found with a search for "indexing tools", adding "mark-up" or "tagging" to the search.
Got it. I thought you wanted to parse both, using the code as the primary key for introduction. My mistake. Too much contact with the Society for Technical Communication. :-)
Yes, AST is overkill -- internally. Externally -- it works, it gives you a tree including those critical non-terminal tokens (such as "list comprehension"), and it's easy to get given a BNF and the input text.
This would give you a sequential list of parse trees. Your coding would consist of traversing the tress to make an index of each new concept from your input list. Once you find each concept, you index the instance, remove it from the input list, and continue until you run out of sample code or input items.
Related
I'd like to replace assembly instructions within the code (.text) section of a PE image with semantically equivalent instructions that are of the same length. An example would be replacing an "add 5" with a "sub -5", though certainly I have much more elaborate plans than this.
I've found that LIEF is great for working with higher-level features in a PE file and can give you a dump of the code section data. The same seems to apply for the pefile library. They don't seem to be great tools for manipulating individual instructions though, unless I'm missing something.
The goal would be to perform some initial disassembly and looping through instructions in order to locate interesting or desirable instructions (e.g., the add instruction from above). Then I'd prefer to not have to figure out all of the opcodes and generate bytes by hand. If possible, I'd be looking for something more developer-friendly that computes it for you (E.g., this.intr = 'add' and this.op1 = '5'). Ideally it wouldn't try to re-assemble the entire section after the change is made... this could cause other differences to occur, and my use case requires that this individual instruction be the only change present from a bit-level perspective. Again, the idea is that I'd only be selecting semantically equivalent instructions whose lengths are equal, which is what would allow this scenario to occur without re-assembling from scratch.
How can I do something like this using Python?
I have a big text file from which I want to count the occurrences of known phrases. I currently read the whole text file line by line into memory and use the 'find' function to check whether a particular phrase exists in the text file or not:
found = txt.find(phrase)
This is very slow for large file. To build an index of all possible phrases and store them in a dict will help, but the problem is it's challenging to create all meaningful phrases myself. I know that Lucene search engine supports phrase search. In using Lucene to create an index for a text set, do I need to come up with my own tokenization method, especially for my phrase search purpose above? Or Lucene has an efficient way to automatically create an index for all possible phrases without the need for me to worry about how to create the phrases?
My main purpose is to find a good way to count occurrences in a big text.
Summary: Lucene will take care of allocating higher matching scores to indexed text which more closely match your input phrases, without you having to "create all meaningful phrases" yourself.
Start Simple
I recommend you start with a basic Lucene analyzer, and see what effect that has. There is a reasonably good chance that it will meet your needs.
If that does not give you satisfactory results, then you can certainly investigate more specific/targeted analyzers/tokenizers/filters (for example if you need to analyze non-Latin character sets).
It is hard to be more specific without looking at the source data and the phrase matching requirements in more detail.
But, having said that, here are two examples (and I am assuming you have basic familiarity with how to create a Lucene index, and then query it).
All of the code is based on Lucene 8.4.
CAVEAT - I am not familiar with Python implementations of Lucene. So, with apologies, my examples are in Java - not Python (as your question is tagged). I would imagine that the concepts are somewhat translatable. Apologies if that's a showstopper for you.
A Basic Multi-Purpose Analyzer
Here is a basic analyzer - using the Lucene "service provider interface" syntax and a CustomAnalyzer:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.custom.CustomAnalyzer;
...
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer("icu")
.addTokenFilter("lowercase")
.addTokenFilter("asciiFolding")
.build();
The above analyzer tokenizes your input text using Unicode whitespace rules, as encoded into the ICU libraries. It then standardizes on lowercase, and maps accents/diacritics/etc. to their ASCII equivalents.
An Example Using Shingles
If the above approach proves to be weak for your specific phrase matching needs (i.e. false positives scoring too highly), then one technique you can try is to use shingles as your tokens. Read more about shingles here (Elasticsearch has great documentation).
Here is an example analyzer using shingles, and using the more "traditional" syntax:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.analysis.StopwordAnalyzerBase;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter;
import org.apache.lucene.analysis.shingle.ShingleFilter;
...
StopwordAnalyzerBase.TokenStreamComponents createComponents(String fieldName) {
final Tokenizer source = new StandardTokenizer();
TokenStream tokenStream = source;
tokenStream = new LowerCaseFilter(tokenStream);
tokenStream = new ASCIIFoldingFilter(tokenStream);
// default shingle size is 2:
tokenStream = new ShingleFilter(tokenStream);
return new Analyzer.TokenStreamComponents(source, tokenStream);
}
In this example, the default shingle size is 2 (two words per shingle) - which is a good place to start.
Finally...
Even if you think this is a one-time exercise, it is probably still worth going to the trouble to build some Lucene indexes in a repeatable/automated way (which may take a while depending on the amount of data you have).
That way, it will be fast to run your set of known phrases against the index, to see how effective each index is.
I have deliberately not said anything about your ultimate objective ("to count occurrences"), because that part should be relatively straightforward, assuming you really do want to find exact matches for known phrases. It's possible I have misinterpreted your question - but at a high level I think this is what you need.
I'm working on to get twitter trends using tweepy in python and I'm able find out world top 50 trends so for sample I'm getting results like these
#BrazilianFansAreTheBest, #PSYPagtuklas, Federer, Corcuera, Ouvindo ANTI,
艦これ改, 영혼의 나이, #TodoDiaéDiaDe, #TronoChicas, #이사람은_분위기상_군주_장수_책사,
#OTWOLKnowYourLimits, #BaeIn3Words, #NoEntiendoPorque
(Please ignore non English words)
So here I need to parse every hashtag and convert them into proper English words, Also I checked how people write hashtag and found below ways -
#thisisawesome
#ThisIsAwesome
#thisIsAwesome
#ThisIsAWESOME
#ThisISAwesome
#ThisisAwesome123
(some time hashtags have numbers as well)
So keeping all these in mind I thought if I'm able to split below string then all above cases will be covered.
string ="pleaseHelpMeSPLITThisString8989"
Result = please, Help, Me, SPLIT, This, String, 8989
I tried something using re.sub but it is not giving me desired results.
Regex is the wrong tool for the job. You need a clearly-defined pattern in order to write a good regex, and in this case, you don't have one. Given that you can have Capitalized Words, CAPITAL WORDS, lowercase words, and numbers, there's no real way to look at, say, THATSand and differentiate between THATS and or THAT Sand.
A natural-language approach would be a better solution, but again, it's inevitably going to run into the same problem as above - how do you differentiate between two (or more) perfectly valid ways to parse the same inputs? Now you'd need to get a trie of common sentences, build one for each language you plan to parse, and still need to worry about properly parsing the nonsensical tags twitter often comes up with.
The question becomes, why do you need to split the string at all? I would recommend finding a way to omit this requirement, because it's almost certainly going to be easier to change the problem than it is to develop this particular solution.
For a single large text (~4GB) I need to search for ~1million phrases and replace them with complementary phrases. Both the raw text and the replacements can easily fit in memory. The naive solution will literally takes years to finish as a single replacement takes about a minute.
Naive solution:
for search, replace in replacements.iteritems():
text = text.replace(search, replace)
The regex method using re.subis x10 slower:
for search, replace in replacements.iteritems():
text = re.sub(search, replace, text)
At any rate, this seems like a great place use Boyer-Moore string, or Aho-Corasick; but these methods as they are generally implemented only work for searching the string and not also replacing it.
Alternatively, any tool (outside of Python) that can do this quickly would also be appreciated.
Thanks!
There's probably a better way than this:
re.sub('|'.join(replacements), lambda match: replacements[match.group()], text)
This does one search pass, but it's not a very efficient search. The re2 module may speed this up dramatically.
Outside of python, sed is usually used for this sort of thing.
For example (taken from here), to replace the word ugly with beautiful in the file sue.txt:
sed -i 's/ugly/beautiful/g' /home/bruno/old-friends/sue.txt
You haven't posted any profiling of your code, you should try some timings before you do any premature optimization. Searching and replacing text in a 4GB file is a computationally-intensive operation.
ALTERNATIVE
Ask: should I be doing this at all? -
You discuss below doing an entire search and replace of the Wikipedia corpus in under 10ms. This rings some alarm bells as it doesn't sound like great design. Unless there's an obvious reason not to you should be modifying whatever code you use to present and/or load that to do the search and replace as a subset of the data is being loaded/viewed. It's unlikely you'll be doing many operations on the entire 4GB of data so restrict your search and replace operations to what you're actually working on. Additionally, your timing is still very imprecise because you don't know how big the file you're working on is.
On a final point, you note that:
the speedup has to be algorithmic not chaining millions of sed calls
But you indicated that the data you're working with was a "single large text (~4GB)" so there shouldn't be any chaning involved if I understand what you mean by that correctly.
UPDATE:
Below you indicate that to do the operation on a ~4KB file (I'm assuming) takes 90s, this seems very strange to me - sed operations don't normally take anywhere close to that. If the file is actually 4MB (I'm hoping) then it should take 24 hours to evaluate (not ideal but probably acceptable?)
I had this use case as well, where I needed to do ~100,000 search and replace operations on the full text of Wikipedia. Using sed, awk, or perl would take years. I wasn't able to find any implementation of Aho-Corasick that did search-and-replace, so I wrote my own: fsed. This tool happens to be written in Python (so you can hack into the code if you like), but it's packaged up as a command line utility that runs like sed.
You can get it with:
pip install fsed
they are generally implemented only work for searching the string and not also replacing it
Perfect, that's exactly what you need. Searching with an ineffective algorithm in a 4G text is bad enough, but doing several replacing is probably even worse... you potentially have to move gigabytes of text to make space for the expansion/shrinking caused by the size difference of source and target text.
Just find the locations, then join the pieces with the replacements parts.
So a dumb analogy would be be "_".join( "a b c".split(" ") ), but of course you don't want to create copies the way split does.
Note: any reason to do this in python?
I need a module or strategy for detecting that a piece of data is written in a programming language, not syntax highlighting where the user specifically chooses a syntax to highlight. My question has two levels, I would greatly appreciate any help, so:
Is there any package in python that receives a string(piece of data) and returns if it belongs to any programming language syntax ?
I don't necessarily need to recognize the syntax, but know if the string is source code or not at all.
Any clues are deeply appreciated.
Maybe you can use existing multi-language syntax highlighters. Many of them can detect language a file is written in.
You could have a look at methods around baysian filtering.
My answer somewhat depends on the amount of code you're going to be given. If you're going to be given 30+ lines of code, it should be fairly easy to identify some unique features of each language that are fairly common. For example, tell the program that if anything matches an expression like from * import * then it's Python (I'm not 100% sure that phrasing is unique to Python, but you get the gist). Other things you could look at that are usually slightly different would be class definition (i.e. Python always starts with 'class', C will start with a definition of the return so you could check to see if there is a line that starts with a data type and has the formatting of a method declaration), conditionals are usually formatted slightly differently, etc, etc. If you wanted to make it more accurate, you could introduce some sort of weighting system, features that are more unique and less likely to be the result of a mismatched regexp get a higher weight, things that are commonly mismatched get a lower weight for the language, and just calculate which language has the highest composite score at the end. You could also define features that you feel are 100% unique, and tell it that as soon as it hits one of those, to stop parsing because it knows the answer (things like the shebang line).
This would, of course, involve you knowing enough about the languages you want to identify to find unique features to look for, or being able to find people that do know unique structures that would help.
If you're given less than 30 or so lines of code, your answers from parsing like that are going to be far less accurate, in that case the easiest best way to do it would probably be to take an appliance similar to Travis, and just run the code in each language (in a VM of course). If the code runs successfully in a language, you have your answer. If not, you would need a list of errors that are "acceptable" (as in they are errors in the way the code was written, not in the interpreter). It's not a great solution, but at some point your code sample will just be too short to give an accurate answer.