Will Lucene index help speed up to count occurrence? - python

I have a big text file from which I want to count the occurrences of known phrases. I currently read the whole text file line by line into memory and use the 'find' function to check whether a particular phrase exists in the text file or not:
found = txt.find(phrase)
This is very slow for large file. To build an index of all possible phrases and store them in a dict will help, but the problem is it's challenging to create all meaningful phrases myself. I know that Lucene search engine supports phrase search. In using Lucene to create an index for a text set, do I need to come up with my own tokenization method, especially for my phrase search purpose above? Or Lucene has an efficient way to automatically create an index for all possible phrases without the need for me to worry about how to create the phrases?
My main purpose is to find a good way to count occurrences in a big text.

Summary: Lucene will take care of allocating higher matching scores to indexed text which more closely match your input phrases, without you having to "create all meaningful phrases" yourself.
Start Simple
I recommend you start with a basic Lucene analyzer, and see what effect that has. There is a reasonably good chance that it will meet your needs.
If that does not give you satisfactory results, then you can certainly investigate more specific/targeted analyzers/tokenizers/filters (for example if you need to analyze non-Latin character sets).
It is hard to be more specific without looking at the source data and the phrase matching requirements in more detail.
But, having said that, here are two examples (and I am assuming you have basic familiarity with how to create a Lucene index, and then query it).
All of the code is based on Lucene 8.4.
CAVEAT - I am not familiar with Python implementations of Lucene. So, with apologies, my examples are in Java - not Python (as your question is tagged). I would imagine that the concepts are somewhat translatable. Apologies if that's a showstopper for you.
A Basic Multi-Purpose Analyzer
Here is a basic analyzer - using the Lucene "service provider interface" syntax and a CustomAnalyzer:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.custom.CustomAnalyzer;
...
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer("icu")
.addTokenFilter("lowercase")
.addTokenFilter("asciiFolding")
.build();
The above analyzer tokenizes your input text using Unicode whitespace rules, as encoded into the ICU libraries. It then standardizes on lowercase, and maps accents/diacritics/etc. to their ASCII equivalents.
An Example Using Shingles
If the above approach proves to be weak for your specific phrase matching needs (i.e. false positives scoring too highly), then one technique you can try is to use shingles as your tokens. Read more about shingles here (Elasticsearch has great documentation).
Here is an example analyzer using shingles, and using the more "traditional" syntax:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.analysis.StopwordAnalyzerBase;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter;
import org.apache.lucene.analysis.shingle.ShingleFilter;
...
StopwordAnalyzerBase.TokenStreamComponents createComponents(String fieldName) {
final Tokenizer source = new StandardTokenizer();
TokenStream tokenStream = source;
tokenStream = new LowerCaseFilter(tokenStream);
tokenStream = new ASCIIFoldingFilter(tokenStream);
// default shingle size is 2:
tokenStream = new ShingleFilter(tokenStream);
return new Analyzer.TokenStreamComponents(source, tokenStream);
}
In this example, the default shingle size is 2 (two words per shingle) - which is a good place to start.
Finally...
Even if you think this is a one-time exercise, it is probably still worth going to the trouble to build some Lucene indexes in a repeatable/automated way (which may take a while depending on the amount of data you have).
That way, it will be fast to run your set of known phrases against the index, to see how effective each index is.
I have deliberately not said anything about your ultimate objective ("to count occurrences"), because that part should be relatively straightforward, assuming you really do want to find exact matches for known phrases. It's possible I have misinterpreted your question - but at a high level I think this is what you need.

Related

Split string using Regular Expression that includes lowercase, camelcase, numbers

I'm working on to get twitter trends using tweepy in python and I'm able find out world top 50 trends so for sample I'm getting results like these
#BrazilianFansAreTheBest, #PSYPagtuklas, Federer, Corcuera, Ouvindo ANTI,
艦これ改, 영혼의 나이, #TodoDiaéDiaDe, #TronoChicas, #이사람은_분위기상_군주_장수_책사,
#OTWOLKnowYourLimits, #BaeIn3Words, #NoEntiendoPorque
(Please ignore non English words)
So here I need to parse every hashtag and convert them into proper English words, Also I checked how people write hashtag and found below ways -
#thisisawesome
#ThisIsAwesome
#thisIsAwesome
#ThisIsAWESOME
#ThisISAwesome
#ThisisAwesome123
(some time hashtags have numbers as well)
So keeping all these in mind I thought if I'm able to split below string then all above cases will be covered.
string ="pleaseHelpMeSPLITThisString8989"
Result = please, Help, Me, SPLIT, This, String, 8989
I tried something using re.sub but it is not giving me desired results.
Regex is the wrong tool for the job. You need a clearly-defined pattern in order to write a good regex, and in this case, you don't have one. Given that you can have Capitalized Words, CAPITAL WORDS, lowercase words, and numbers, there's no real way to look at, say, THATSand and differentiate between THATS and or THAT Sand.
A natural-language approach would be a better solution, but again, it's inevitably going to run into the same problem as above - how do you differentiate between two (or more) perfectly valid ways to parse the same inputs? Now you'd need to get a trie of common sentences, build one for each language you plan to parse, and still need to worry about properly parsing the nonsensical tags twitter often comes up with.
The question becomes, why do you need to split the string at all? I would recommend finding a way to omit this requirement, because it's almost certainly going to be easier to change the problem than it is to develop this particular solution.

Building an index of term usage in python code

Brief version
I have a collection of python code (part of instructional materials). I'd like to build an index of when the various python keywords, builtins and operators are used (and especially: when they are first used). Does it make sense to use ast to get proper tokenization? Is it overkill? Is there a better tool? If not, what would the ast code look like? (I've read the docs but I've never used ast).
Clarification: This is about indexing python source code, not English text that talks about python.
Background
My materials are in the form of ipython Notebooks, so even if I had an indexing tool I'd need to do some coding anyway to get the source code out. And I don't have an indexing tool; googling "python index" doesn't turn up anything with the intended sense of "index".
I thought, "it's a simple task, let's write a script for the whole thing so I can do it exactly the way I want". But of course I want to do it right.
The dumb solution: Read the file, tokenize on whitespaces and word boundaries, index. But this gets confused by the contents of strings (when does for really get introduced for the first time?), and of course attached operators are not separated: text+="this" will be tokenized as ["text", '+="', "this"]
Next idea: I could use ast to parse the file, then walk the tree and index what I see. This looks like it would involve ast.parse() and ast.walk(). Something like this:
for source in source_files:
with open(source) as fp:
code = fp.read()
tree = ast.parse(code)
for node in tree.walk():
... # Get node's keyword, identifier etc., and line number-- how?
print(term, source, line) # I do know how to make an index
So, is this a reasonable approach? Is there a better one? How should this be done?
Did you search on "index" alone, or for "indexing tool"? I would think that your main problem would be to differentiate a language concept from its natural language use.
I expect that your major difficulty here is not traversing the text, but in the pattern-matching to find these things. For instance, how do you recognize introducing for loops? This would be the word for "near" the word loop, with a for command "soon" after. That command would be a line beginning with for and ending with a colon.
That is just one pattern, albeit one with many variations. However, consider what it takes to differentiate that from a list comprehension, and that from a generation comprehension (both explicit and built-in).
Will you have directed input? I'm thinking that a list of topics and keywords is essential, not all of which are in the language's terminal tokens -- although a full BNF grammar would likely include them.
Would you consider a mark-up indexing tool? Sometimes, it's easier to place a mark at each critical spot, doing it by hand, and then have the mark-up tool extract an index from that. Such tools have been around for at least 30 years. These are also found with a search for "indexing tools", adding "mark-up" or "tagging" to the search.
Got it. I thought you wanted to parse both, using the code as the primary key for introduction. My mistake. Too much contact with the Society for Technical Communication. :-)
Yes, AST is overkill -- internally. Externally -- it works, it gives you a tree including those critical non-terminal tokens (such as "list comprehension"), and it's easy to get given a BNF and the input text.
This would give you a sequential list of parse trees. Your coding would consist of traversing the tress to make an index of each new concept from your input list. Once you find each concept, you index the instance, remove it from the input list, and continue until you run out of sample code or input items.

Extracting Wikipedia entries from text

I have a large text, and I want to parse this text and identify (e.g. wikipedia entries that exist within this text).
I thought of using regular expression, something like:
pattern='New York|Barak Obama|Russian Federation|Olympic Games'
re.findall(pattern,text)
... etc, but this would be millions of characters long, and re doesn't accept that...
The other way I thought about was to tokenize my text and search wikipedia entries for each token, but this doesn't look very efficient, especially if my text is too big...
Any ideas how to do this in Python?
Another way would be getting all Wikipedia articles and pages and use then the Sentence tagger from NLTK.
Put the created sentences, sentence by sentence into an Lucene Index, so that each sentence represent an own "document" in the Lucene Index.
Than you can look up for example all sentences with "Barak Obama", to find patterns in the sentences.
The access to Lucene is pretty fast, I myself use a Lucene Index, containing over 42000000 sentences from Wikipedia.
To get on the clan wikipedia txt file, you can download wikipedia as xml file from here: http://en.wikipedia.org/wiki/Wikipedia:Database_download
and then use the WikipediaExtractor from the Università di Pisa.
http://medialab.di.unipi.it/wiki/Wikipedia_Extractor
I would use NLTK to tokenize the text and look for valid wikipedia entries in the token. If you don't want to store the whole text in-memory you can work line by line or sizes of text chunks.
Do you have to do this with Python? grep --fixed-strings is a good fit for what you want to do, and should do it fairly efficiently: http://www.gnu.org/savannah-checkouts/gnu/grep/manual/grep.html#index-g_t_0040command_007bgrep_007d-programs-175
If you want to do it in pure Python, you'll probably have a tough time getting faster than:
for name in articles:
if name in text:
print 'found name'
The algorithm used by fgrep is called the Aho-Corasick algorithm but a pure Python implementation is likely to be slow.
The Gensim library has a threaded iterator for the ~13GB wikipedia dump. So if you're after specific terms (n-grams) then you can write a custom regex and process each article of text. It may take a day of cpu power to do the search.
You may need to adjust the library if you're after the uri source.

Are there any radix/patricia/critbit trees for Python?

I have about 10,000 words used as a set of inverted indices to about 500,000 documents. Both are normalized so the index is a mapping of integers (word id) to a set of integers (ids of documents which contain the word).
My prototype uses Python's set as the obvious data type.
When I do a search for a document I find the list of N search words and their corresponding N sets. I want to return the set of documents in the intersection of those N sets.
Python's "intersect" method is implemented as a pairwise reduction. I think I can do better with a parallel search of sorted sets, so long as the library offers a fast way to get the next entry after i.
I've been looking for something like that for some time. Years ago I wrote PyJudy but I no longer maintain it and I know how much work it would take to get it to a stage where I'm comfortable with it again. I would rather use someone else's well-tested code, and I would like one which supports fast serialization/deserialization.
I can't find any, or at least not any with Python bindings. There is avltree which does what I want, but since even the pair-wise set merge take longer than I want, I suspect I want to have all my operations done in C/C++.
Do you know of any radix/patricia/critbit tree libraries written as C/C++ extensions for Python?
Failing that, what is the most appropriate library which I should wrap? The Judy Array site hasn't been updated in 6 years, with 1.0.5 released in May 2007. (Although it does build cleanly so perhaps It Just Works.)
(Edit: to clarify what I'm looking for from an API, I want something like:
def merge(document_sets):
probe_i = 0
probe_set = document_sets[probe_i]
document_id = GET_FIRST(probe_set)
while IS_VALID(document_id):
# See if the document is present in all sets
for i in range(1, len(document_sets)):
# dynamically adapt to favor the least matching set
target_i = (i + probe_i) % len(document_sets)
target = document_sets[target_i]
if document_id not in target_set:
probe_i = target_id
probe_set = document_sets[probe_i]
document_id = GET_NEXT(probe_set, document_id)
break
else:
yield document_id
I'm looking for something which implements GET_NEXT() to return the next entry which occurs after the given entry. This corresponds to Judy1N and the similar entries for other Judy arrays.
This algorithm dynamically adapts to the data should preferentially favor sets with low hits. For the type of data I work with this has given a 5-10% increase in performance.)
)
Yes, there are some, though I'm not sure if they're suitable for your use case: but it seems none of them are what you asked for.
BioPython has a Trie implementation in C.
Ah, here's a nice discussion including benchmarks: http://bugs.python.org/issue9520
Other (some very stale) implementations:
http://pypi.python.org/pypi/radix
py-radix is an implementation of a
radix tree data structure for the
storage and retrieval of IPv4 and IPv6
network prefixes.
https://bitbucket.org/markon/patricia-tree/src
A Python implementation of
patricia-tree
http://pypi.python.org/pypi/trie
A prefix tree (trie) implementation.
http://pypi.python.org/pypi/logilab-common/0.50.3
patricia.py : A Python implementation
of PATRICIA trie (Practical Algorithm
to Retrieve Information Coded in
Alphanumeric).
I've recently added iteration support to datrie, you may give it a try.

Merging duplicates in a list? - Question is more complex than it seems

So I have a huge list of entries in a DB (MySql)
I'm using Python and Django in the creation of my web application.
This is the base Django model I'm using:
class DJ(models.Model):
alias = models.CharField(max_length=255)
#other fields...
In my DB I have now duplicates
eg. Above & Beyond, Above and Beyond, Above Beyond, DJ Above and Beyond,
Disk Jokey Above and Beyond, ...
This is a problem... as it blows a big hole in my DB and therefore my application.
I'm sure other people have encountered this problem and thought about it.
My ideas are the following:
Create a set of rules so a new entry cannot be created?
eg. "DJ Above and Beyond" cannot be
created because "Above & Beyond" is in
the DB
Relate these aliases to each other somehow?
eg. relate "DJ Above and Beyond" to "Above & Beyond"
I have literally no clue how to go on about this, even if someone could point me into a direction that would be very helpful.
Any help would be very much appreciated! Thank you guys.
I guess you could do something based on Levenshtein distance, but there's no real way to do this automatically - without creating a fairly complex rules-based system.
Unless you can define a rules system that can work out for any x and y whether x is a duplicate of y, you're going to have to deal with this in a fuzzy, human way.
Stack Overflow has a fairly decent way of dealing with this - warn users if something may be a duplicate, based on something like Levenshtein distance (and perhaps some kind of rules engine), and then allow a subset of your users to merge things as duplicates if other users ignore the warnings.
From the examples you give, it sounds like you have more a natural language problem than an exact matching problem. Given that natural language matching is inexact by nature you're unlikely to come up with a perfect solution.
String distance doesn't really work, as strings that are algorithmically close may not be semantically close (e.g. "DJ Above & Beyond" should match "Above and Beyond" but not "DJ Above & Beyond 2" which is closer in Levenshtein distance.
Some cheap alternatives to natural language parsing are soundex, which will match by phonetic sounds, and Stemming, which removes prefixes/suffixes to normalize on word stems. I suppose you could create a linked list of word roots, but this wouldn't be terribly accurate either.
If this is a User-interacting program, you could echo "near misses" to the user, e.g. "Is one of these what you meant to enter?"
You could normalize the entries in some way so that different entries map to the same normalized value (e.g. case normalize, "&" -> "And", etc, etc. which some of the above suggestions might be a step towards) to find near misses or map multiple inputs to a single value.
Add the caveat that my experience only applies to English, e.g. an english PorterStemmer won't recognize the one French title you put in there.
I think this is more of a social problem than a programming problem. Any sort of programatic solution to natural language processing like this is going to be buggy and error prone. It's very hard to distinguish things that are close, but legitimately different from the sort of undesired duplicates that you're talking about.
As Dominic mentioned, Stack Overflow's tagging system is a pretty good model for this. It provides cues to the user that encourage them to use existing tags if appropriate (drop down lists as the user types), it allows trusted users to retag individual questions, and it allows moderators to do mass retags.
This is really a process that has to have a person directly involved.
This is not a complete solutions but one thought I had:
class DJ(models.Model):
#other fields, no alias!
class DJAlias(models.Model):
dj = models.ForeignKey(DJ)
This would allow you to have several Aliases for the same dj.
But still you will need to find a proper way to ensure the aliases are added to the right dj. See Dominics post.
But if you check an alias against several other aliases pointing to one dj, the algorithms might work better.
You could try to solve this problem for this instance only (replacing the "&" with "&" and "DJ" with "Disk jokey" or ignore "DJ" etc..). If your table only contains DJ's you could set up a bunch of rules like those.
If your table contains more diverse stuff you will have to go with a more structural approach. Could you give a sample of your dataset?
First of all of course the programming task (NLP etc. as mentioned) is interesting. But as mentioned it's overkill aiming to perfect that.
But the other view is as mentioned ("social"), who enters the data, who views it, how long and how correct should it be? So it's a naming convention issue and reminds me to the great project musicbrainz.org - should your site "just work" or do you prefer to go along standards, in latter case i would orient myself along the mb project - in case you haven't done that and not heard of it.
ie. see here for Above & Beyond: they have on alias defined, they use it to match user searches.
http://musicbrainz.org/show/artist/aliases.html?artistid=58438
check out also the Artist_Alias page in the wiki.
The data model is worth a look and there are even several API bindings to sync data, also in python.
How about changing the model so "alias" to be list of keys to other table that looks like this (skipping small words like "the", "and", etc.):
1 => Above;
2 => Beyond;
3 => Disk;
4 => Jokey;
Then when you want to insert new record just check how many of the significant words from the title are already in the table and match currently existing model entities. If more than 50% (for example) maybe you have a coincidence and you can show list of them to the visitor and asking "do you mean some of this one".
Looks like fuzzywuzzy is a perfect match to your needs.
This article explains the reason it was set up, which very closely matches your requirements -- basically, to handle situations in which two different things were named slightly differently:
One of our most consistently frustrating issues is trying to figure out whether two ticket listings are for the same real-life event (that is, without enlisting the help of our army of interns).
…
To achieve this, we've built up a library of "fuzzy" string matching routines to help us along.
If you're only after artist names or generally media related names it might be much better to just use the API of last.fm or echonest as they already have a huge rule set and a huge database to settle on.

Categories