I have a large text, and I want to parse this text and identify (e.g. wikipedia entries that exist within this text).
I thought of using regular expression, something like:
pattern='New York|Barak Obama|Russian Federation|Olympic Games'
re.findall(pattern,text)
... etc, but this would be millions of characters long, and re doesn't accept that...
The other way I thought about was to tokenize my text and search wikipedia entries for each token, but this doesn't look very efficient, especially if my text is too big...
Any ideas how to do this in Python?
Another way would be getting all Wikipedia articles and pages and use then the Sentence tagger from NLTK.
Put the created sentences, sentence by sentence into an Lucene Index, so that each sentence represent an own "document" in the Lucene Index.
Than you can look up for example all sentences with "Barak Obama", to find patterns in the sentences.
The access to Lucene is pretty fast, I myself use a Lucene Index, containing over 42000000 sentences from Wikipedia.
To get on the clan wikipedia txt file, you can download wikipedia as xml file from here: http://en.wikipedia.org/wiki/Wikipedia:Database_download
and then use the WikipediaExtractor from the Università di Pisa.
http://medialab.di.unipi.it/wiki/Wikipedia_Extractor
I would use NLTK to tokenize the text and look for valid wikipedia entries in the token. If you don't want to store the whole text in-memory you can work line by line or sizes of text chunks.
Do you have to do this with Python? grep --fixed-strings is a good fit for what you want to do, and should do it fairly efficiently: http://www.gnu.org/savannah-checkouts/gnu/grep/manual/grep.html#index-g_t_0040command_007bgrep_007d-programs-175
If you want to do it in pure Python, you'll probably have a tough time getting faster than:
for name in articles:
if name in text:
print 'found name'
The algorithm used by fgrep is called the Aho-Corasick algorithm but a pure Python implementation is likely to be slow.
The Gensim library has a threaded iterator for the ~13GB wikipedia dump. So if you're after specific terms (n-grams) then you can write a custom regex and process each article of text. It may take a day of cpu power to do the search.
You may need to adjust the library if you're after the uri source.
Related
I have a huge list of larger spaCy documents and a list of words which I want to look up in the document.
An example: I want to look up the word "Aspirin" in a website text, which was parsed with spaCy.
The list of keywords I want to look up is quite long.
Naive approach
Don't use spacy and just use if keyword in website_text: as a simple matcher. Of course this has the downside that tokens are ignored and searches for test will yield false positives at words like tested, attested, etc.
Use spaCy's matchers
Matcher are an option, but I would need to automatically build a lot of matchers based on my list of keywords.
Is there a recommended way to achieve this task?
I'd go with your naive approach, but you can use regular expressions to get a smarter match that won't pick up false positives.
For example, \b(test|aspirin)\b picks up on the words "test" and "aspirin", but not on "aspiring", "attested", or "testing". You could add other words inside the brackets, separated by pipes, to pick up more key words.
Here's an example of it working.
To actually apply that to Python code, you can use the re module.
I have a big text file from which I want to count the occurrences of known phrases. I currently read the whole text file line by line into memory and use the 'find' function to check whether a particular phrase exists in the text file or not:
found = txt.find(phrase)
This is very slow for large file. To build an index of all possible phrases and store them in a dict will help, but the problem is it's challenging to create all meaningful phrases myself. I know that Lucene search engine supports phrase search. In using Lucene to create an index for a text set, do I need to come up with my own tokenization method, especially for my phrase search purpose above? Or Lucene has an efficient way to automatically create an index for all possible phrases without the need for me to worry about how to create the phrases?
My main purpose is to find a good way to count occurrences in a big text.
Summary: Lucene will take care of allocating higher matching scores to indexed text which more closely match your input phrases, without you having to "create all meaningful phrases" yourself.
Start Simple
I recommend you start with a basic Lucene analyzer, and see what effect that has. There is a reasonably good chance that it will meet your needs.
If that does not give you satisfactory results, then you can certainly investigate more specific/targeted analyzers/tokenizers/filters (for example if you need to analyze non-Latin character sets).
It is hard to be more specific without looking at the source data and the phrase matching requirements in more detail.
But, having said that, here are two examples (and I am assuming you have basic familiarity with how to create a Lucene index, and then query it).
All of the code is based on Lucene 8.4.
CAVEAT - I am not familiar with Python implementations of Lucene. So, with apologies, my examples are in Java - not Python (as your question is tagged). I would imagine that the concepts are somewhat translatable. Apologies if that's a showstopper for you.
A Basic Multi-Purpose Analyzer
Here is a basic analyzer - using the Lucene "service provider interface" syntax and a CustomAnalyzer:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.custom.CustomAnalyzer;
...
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer("icu")
.addTokenFilter("lowercase")
.addTokenFilter("asciiFolding")
.build();
The above analyzer tokenizes your input text using Unicode whitespace rules, as encoded into the ICU libraries. It then standardizes on lowercase, and maps accents/diacritics/etc. to their ASCII equivalents.
An Example Using Shingles
If the above approach proves to be weak for your specific phrase matching needs (i.e. false positives scoring too highly), then one technique you can try is to use shingles as your tokens. Read more about shingles here (Elasticsearch has great documentation).
Here is an example analyzer using shingles, and using the more "traditional" syntax:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.analysis.StopwordAnalyzerBase;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter;
import org.apache.lucene.analysis.shingle.ShingleFilter;
...
StopwordAnalyzerBase.TokenStreamComponents createComponents(String fieldName) {
final Tokenizer source = new StandardTokenizer();
TokenStream tokenStream = source;
tokenStream = new LowerCaseFilter(tokenStream);
tokenStream = new ASCIIFoldingFilter(tokenStream);
// default shingle size is 2:
tokenStream = new ShingleFilter(tokenStream);
return new Analyzer.TokenStreamComponents(source, tokenStream);
}
In this example, the default shingle size is 2 (two words per shingle) - which is a good place to start.
Finally...
Even if you think this is a one-time exercise, it is probably still worth going to the trouble to build some Lucene indexes in a repeatable/automated way (which may take a while depending on the amount of data you have).
That way, it will be fast to run your set of known phrases against the index, to see how effective each index is.
I have deliberately not said anything about your ultimate objective ("to count occurrences"), because that part should be relatively straightforward, assuming you really do want to find exact matches for known phrases. It's possible I have misinterpreted your question - but at a high level I think this is what you need.
I'm working on to get twitter trends using tweepy in python and I'm able find out world top 50 trends so for sample I'm getting results like these
#BrazilianFansAreTheBest, #PSYPagtuklas, Federer, Corcuera, Ouvindo ANTI,
艦これ改, 영혼의 나이, #TodoDiaéDiaDe, #TronoChicas, #이사람은_분위기상_군주_장수_책사,
#OTWOLKnowYourLimits, #BaeIn3Words, #NoEntiendoPorque
(Please ignore non English words)
So here I need to parse every hashtag and convert them into proper English words, Also I checked how people write hashtag and found below ways -
#thisisawesome
#ThisIsAwesome
#thisIsAwesome
#ThisIsAWESOME
#ThisISAwesome
#ThisisAwesome123
(some time hashtags have numbers as well)
So keeping all these in mind I thought if I'm able to split below string then all above cases will be covered.
string ="pleaseHelpMeSPLITThisString8989"
Result = please, Help, Me, SPLIT, This, String, 8989
I tried something using re.sub but it is not giving me desired results.
Regex is the wrong tool for the job. You need a clearly-defined pattern in order to write a good regex, and in this case, you don't have one. Given that you can have Capitalized Words, CAPITAL WORDS, lowercase words, and numbers, there's no real way to look at, say, THATSand and differentiate between THATS and or THAT Sand.
A natural-language approach would be a better solution, but again, it's inevitably going to run into the same problem as above - how do you differentiate between two (or more) perfectly valid ways to parse the same inputs? Now you'd need to get a trie of common sentences, build one for each language you plan to parse, and still need to worry about properly parsing the nonsensical tags twitter often comes up with.
The question becomes, why do you need to split the string at all? I would recommend finding a way to omit this requirement, because it's almost certainly going to be easier to change the problem than it is to develop this particular solution.
I downloaded the Wikipedia article titles file which contains the name of every Wikipedia article. I need to search for all the article titles that may be a possible match. For example, I might have the word "hockey", but the Wikipedia article for hockey that I would want is "Ice_hockey". It should be a case-insensitive search too.
I'm using Python, and is there a more efficient way than to just do a line by line search? I'll be performing this search like 500 or a 1000 times per minute ideally. If line by line is my only option, are there some optimizations I can do within this?
I think there are several million lines in the file.
Any ideas?
Thanks.
If you've got a fixed data set and variable queries, then the usual technique is to reorganise the data set into something that can be searched more easily. At an abstract level, you could break up each article title into individual lowercase words, and add each of them to a Python dictionary data structure. Then, whenever you get a query, convert the query word to lower case and look it up in the dictionary. If each dictionary entry value is a list of titles, then you can easily find all the titles that match a given query word.
This works for straightforward words, but you will have to consider whether you want to do matching on similar words, such as finding "smoking" when the query is "smoke".
Greg's answer is good if you want to match on individual words. If you want to match on substrings you'll need something a bit more complicated, like a suffix tree (http://en.wikipedia.org/wiki/Suffix_tree). Once constructed, a suffix tree can efficiently answer queries for arbitrary substrings, so in your example it could match "Ice_Hockey" when someone searched for "hock".
I'd suggest you put your data into an sqlite database, and use the SQL 'like' operator for your searches.
I have a simple question. I'm doing some light crawling so new content arrives every few days. I've written a tokenizer and would like to use it for some text mining purposes. Specifically, I'm using Mallet's topic modeling tool and one of the pipe is to tokenize the text into tokens before further processing can be done. With the amount of text in my database, it takes a substantial amount of time tokenizing the text (I'm using regex here).
As such, is it a norm to store the tokenized text in the db so that tokenized data can be readily available and tokenizing can be skipped if I need them for other text mining purposes such as Topic modeling, POS tagging? What are the cons of this approach?
Caching Intermediate Representations
It's pretty normal to cache the intermediate representations created by slower components in your document processing pipeline. For example, if you needed dependency parse trees for all the sentences in each document, it would be pretty crazy to do anything except parsing the documents once and then reusing the results.
Slow Tokenization
However, I'm surprise that tokenization is really slow for you, since the stuff downstream from tokenization is usually the real bottleneck.
What package are you using to do the tokenization? If you're using Python and you wrote your own tokenization code, you might want to try one of the tokenizers included in NLTK (e.g., TreebankWordTokenizer).
Another good tokenizer, albeit one that is not written in Python, is the PTBTokenizer included with the Stanford Parser and the Stanford CoreNLP end-to-end NLP pipeline.
I store tokenized text in a MySQL database. While I don't always like the overhead of communication with the database, I've found that there are lots of processing tasks that I can ask the database to do for me (like search the dependency parse tree for complex syntactic patterns).