Normalize all words in a document - python

I need to normalize all words in a huge corpora. Any ideas how to optimize this code? That's too slow...
texts = [ [ list(morph.normalize(word.upper()))[0] for word in document.split() ]
for document in documents ]
documents is a list of strings, where each string is single book's text.
morph.normalize works only for upper register, so I apply .upper() to all words. Moreover, it returns a set with one element, which is normalized word (string)

The first and obvious thing I'd do would be to cache the normalized words in a local dict, as to avoid calling morph.normalize() more than once for a given word.
A second optimization is to alias methods to local variables - this avoids going thru the whole attribute lookup + function descriptor invocation + method object instanciation on each turn of the loop.
Then since it's a "huge" corpus you probably want to avoid creating a full list of lists at once, which might eat all your ram, make your computer start to swap (which is garanteed to make it snail slow) and finally crash with a memory error. I don't know what your supposed to do with this list of lists nor how huge each document is but as an example I iter on a per-document result and write it to stdout - what should really be done depends on the context and concrete use case.
NB : untested code, obviously, but at least this should get you started
def iterdocs(documents, morph):
# keep trac of already normalized words
# beware this dict might get too big if you
# have lot of different words. Depending on
# your corpus, you may want to either use a LRU
# cache instead and/or use a per-document cache
# and/or any other appropriate caching strategy...
cache = {}
# aliasing methods as local variables
# is faster for tight loops
normalize = morph.normalize
def norm(word):
upw = word.upper()
if upw in cache:
return cache[upw]
nw = cache[upw] = normalize(upw).pop()
return nw
for doc in documents:
words = [norm(word) for word in document.split() if word]
yield words
for text in iterdocs(docs, morph):
# if you need all the texts for further use
# at least write them to disk or other persistence
# mean and re-read them when needed.
# Here I just write them to sys.stdout as an example
print(text)
Also, I don't know where you get your documents from but if they are text files, you may want to avoid loading them all in memory. Just read them one by one, and if they are themselves huge don't even read a whole file at once (you can iterate over a file line by line - the most obvious choice for text).
Finally, once you made sure your code don't eat up to much memory for a single document, the next obvious optimisation is parallelisation - run a process per available core and split the corpus between processes (each writing it's results to a given place). Then you just have to sum up the results if you need them all at once...
Oh and yes : if that's still not enough you may want to distribute the work with some map reduce framework - your problem looks like a perfect fit for map reduce.

Related

is there a way to stop creation of vocabulary in gensim.WikiCorpus when reach 2000000 tokens?

I downloaded the latest wiki dump multi-stream bz2. I call the WikiCorpus class from gensim corpora and after 90000 document the vocabulary reaches the highest value (2000000 tokens).
I got this in terminal:
keeping 2000000 tokens which were in no less than 0 and no more than 580000 (=100.0%) documents
resulting dictionary: Dictionary(2000000 unique tokens: ['ability', 'able', 'abolish', 'abolition', 'about']...)
adding document #580000 to Dictionary(2000000 unique tokens: ['ability', 'able', 'abolish', 'abolition', 'about']...)
The WikiCorpus class continues to work until the end of the documents in my bz2.
Is there a way to stop it? or to split the bz2 file in a sample?
thanks for help!
There's no specific parameter to cap the number of tokens. But when you use WikiCorpus.get_texts(), you don't have to read them all: you can stop at any time.
If, as suggested by another question of yours, you plan to use the article texts for Gensim Word2Vec (or a similar model), you don't need the constructor to do its own expensive full-scan vocabulary-discovery. If you supply any dummy object (such as an empty dict) as the optional dictionary parameter, it'll skip this unnecessary step. EG:
wiki_corpus = WikiCorpus(filename, dictionary={})
If you also want to use some truncated version of the full set of articles, I'd suggest manually iterating over just a subset of the articles. For example if the subset will easily fit as a list in RAM, say 50000 articles, that's as simple as:
import itertools
subset_corpus = list(itertools.islice(wiki_corpus, 50000))
If you want to create a subset larger than RAM, iterate over the set number of articles, writing their tokenized texts to a scratch text file, one per line. Then use that file as your later input. (By spending the WikiCorpus extraction/tokenization effort only once, then reusing the file from disk, this can sometimes offer a performance boost even if you don't need to do it.)

Add progress bar (verbose) when creating gensim dictionary

I want to create a gensim dictionary from lines of a dataframe. The df.preprocessed_text is a list of words.
from gensim.models.phrases import Phrases, Phraser
from gensim.corpora.dictionary import Dictionary
def create_dict(df, bigram=True, min_occ_token=3):
token_ = df.preprocessed_text.values
if not bigram:
return Dictionary(token_)
bigram = Phrases(token_,
min_count=3,
threshold=1,
delimiter=b' ')
bigram_phraser = Phraser(bigram)
bigram_token = []
for sent in token_:
bigram_token.append(bigram_phraser[sent])
dictionary = Dictionary(bigram_token)
dictionary.filter_extremes(no_above=0.8, no_below=min_occ_token)
dictionary.compactify()
return dictionary
I couldn't find a progress bar option for it and the callbacks doesn't seem to work for it too. Since my corpus is huge, I really appreciate a way to show the progress. Is there any?
I'd recommend against changing prune_at for monitoring purposes, as it changes the behavior around which bigrams/words are remembered, possibly discarding many more than is strictly required for capping memory usage.
Wrapping tqdm around the iterables used (including the token_ use in the Phrases constructor and the bigram_token use in the Dictionary constructor) should work.
Alternatively, enabling INFO or greater logging should display logging that, while not as pretty/accurate as a progress-bar, will give some indication of progress.
Further, if as shown in the code, the use of bigram_token is only to support the next Dictionary, it need not be created as a full in-memory list. You should be able to just use layered iterators to transform the text, & tally the Dictionary, item-by-item. EG:
# ...
dictionary = Dictionary(tqdm(bigram_phraser[token_]))
# ...
(Also, if you're only using the Phraser once, you may not be getting any benefit from creating it at all - it's an optional memory optimization for when you want to keep applying the same phrase-creation operation without the full overhead of the original Phrases survey object. But if the Phrases is still in-scope, and all of it will be discarded immediately after this step, it might be just as fast to use the Phrases object directly without ever taking a detour to create the Phraser - so give that a try.)

Handling Memory Error when dealing with really large number of words (>100 million) for LDA analysis

I have 50,000k files - that have a combined total of 162 million words. I wanted to do topic modelling using Gensim similar to this tutorial here
So, LDA requires one to tokenize the documents into words and then create a word frequency dictionary.
So, I have these files read into a pandas dataframe (The 'content' column has the text) and do the following to create a list of the texts.image of dataframe attached here
texts = [[word for word in row[1]['content'].lower().split() if word not in stopwords] for row in df.iterrows()]
However, I have been running into a memory error, because of the large word count.
I also tried the TokenVectorizer in Python. I had got a memory error for this too.
def simple_tokenizer(str_input):
words = re.sub(r"[^A-Za-z0-9\-]", " ", str_input).lower().split()
return words
vectorizer = TfidfVectorizer(use_idf=True, tokenizer=simple_tokenizer, stop_words='english')
X = vectorizer.fit_transform(df['content'])
How do I handle tokenizing these really long documents in a way it can be processed for LDA Analysis?
I have an i7, 16GB Desktop if that matters.
EDIT
Since Python was unable to store really large lists. I actually rewrote the code, to read each file (originally stored as HTML), convert it to text, create a text vector, append it to a list, and then sent it to the LDA code. It worked!
So, LDA requires one to tokenize the documents into words and then
create a word frequency dictionary.
If the only output you need from this is a dictionary with the word count, I would do the following:
Process files one by one in a loop. This way you store only one file in memory. Process it, then move to the next one:
# for all files in your directory/directories:
with open(current_file, 'r') as f:
for line in f:
# your logic to update the dictionary with the word count
# here the file is closed and the loop moves to the next one
EDIT: When it comes to issues with keeping a really large dictionary in memory, you have to remember that Python reserves a lot of memory for keeping the dict low density - a price for a fast lookup possibilities. Therefore, you must search for another way of storing the key-value pairs, for e.g. a list of tuples, but the cost will be much slower lookup. This question is about that and has some nice alternatives described there.

Size issues with Python shelve module

I want to store a few dictionaries using the shelve module, however, I am running into a problem with the size. I use Python 3.5.2 and the latest shelve module.
I have a list of words and I want to create a map from the bigrams (character level) to the words. The structure will look something like this:
'aa': 'aardvark', 'and', ...
'ab': 'absolute', 'dab', ...
...
I read in a large file consisting of approximately 1.3 million words. So the dictionary gets pretty large. This is the code:
self.bicharacters // part of class
def _create_bicharacters(self):
'''
Creates a bicharacter index for calculating Jaccard coefficient.
'''
with open('wordlist.txt', encoding='ISO-8859-1') as f:
for line in f:
word = line.split('\t')[2]
for i in range(len(word) - 1):
bicharacter = (word[i] + word[i+1])
if bicharacter in self.bicharacters:
get = self.bicharacters[bicharacter]
get.append(word)
self.bicharacters[bicharacter] = get
else:
self.bicharacters[bicharacter] = [word]
When I ran this code using a regular Python dictionary, I did not run into issues, but I can't spare those kinds of memory resources due to the rest of the program also having quite a large memory footprint.
So I tried using the shelve module. However, when I run the code above using shelve the program stops after a while due to no more memory on disk, the shelve db that was created was around 120gb, and it had still not read even half the 1.3M word list from the file. What am I doing wrong here?
The problem here is not so much the number of keys, but that each key references a list of words.
While in memory as one (huge) dictionary, this isn't that big a problem as the words are simply shared between the lists; each list is simply a sequence of references to other objects and here many of those objects are the same, as only one string per word needs to be referenced.
In shelve, however, each value is pickled and stored separately, meaning that a concrete copy of the words in a list will have to be stored for each value. Since your setup ends up adding a given word to a large number of lists, this multiplies your data needs rather drastically.
I'd switch to using a SQL database here. Python comes with bundled with sqlite3. If you create one table for individual words, and second table for each possible bigram, and a third that simply links between the two (a many-to-many mapping, linking bigram row id to word row id), this can be done very efficiently. You can then do very efficient lookups as SQLite is quite adept managing memory and indices for you.

Effective 1-5 grams extraction with python

I have a huge files of 3,000,000 lines and each line have 20-40 words. I have to extract 1 to 5 ngrams from the corpus. My input files are tokenized plain text, e.g.:
This is a foo bar sentence .
There is a comma , in this sentence .
Such is an example text .
Currently, I am doing it as below but this don't seem to be a efficient way to extract the 1-5grams:
#!/usr/bin/env python -*- coding: utf-8 -*-
import io, os
from collections import Counter
import sys; reload(sys); sys.setdefaultencoding('utf-8')
with io.open('train-1.tok.en', 'r', encoding='utf8') as srcfin, \
io.open('train-1.tok.jp', 'r', encoding='utf8') as trgfin:
# Extract words from file.
src_words = ['<s>'] + srcfin.read().replace('\n', ' </s> <s> ').split()
del src_words[-1] # Removes the final '<s>'
trg_words = ['<s>'] + trgfin.read().replace('\n', ' </s> <s> ').split()
del trg_words[-1] # Removes the final '<s>'
# Unigrams count.
src_unigrams = Counter(src_words)
trg_unigrams = Counter(trg_words)
# Sum of unigram counts.
src_sum_unigrams = sum(src_unigrams.values())
trg_sum_unigrams = sum(trg_unigrams.values())
# Bigrams count.
src_bigrams = Counter(zip(src_words,src_words[1:]))
trg_bigrams = Counter(zip(trg_words,trg_words[1:]))
# Sum of bigram counts.
src_sum_bigrams = sum(src_bigrams.values())
trg_sum_bigrams = sum(trg_bigrams.values())
# Trigrams count.
src_trigrams = Counter(zip(src_words,src_words[1:], src_words[2:]))
trg_trigrams = Counter(zip(trg_words,trg_words[1:], trg_words[2:]))
# Sum of trigram counts.
src_sum_trigrams = sum(src_bigrams.values())
trg_sum_trigrams = sum(trg_bigrams.values())
Is there any other way to do this more efficiently?
How to optimally extract different N ngrams simultaneously?
From Fast/Optimize N-gram implementations in python, essentially this:
zip(*[words[i:] for i in range(n)])
when hard-coded is this for bigrams, n=2:
zip(src_words,src_words[1:])
and is this for trigrams, n=3:
zip(src_words,src_words[1:],src_words[2:])
If you are interested only in the most common (frequent) n-grams (which is your case I suppose), you can reuse the central idea of the Apriori algorithm. Given s_min, a minimal support which can be thought as the number of lines that a given n-gram is contained in, it efficiently searches for all such n-grams.
The idea is as follows: write a query function which takes an n-gram and tests how many times it is contained in the corpus. After you have such a function prepared (may be optimized as discussed later), scan the whole corpus and get all the 1-grams, i.e. bare tokens, and select those which are contained at least s_min times. This gives you subset F1 of frequent 1-grams. Then test all the possible 2-grams by combining all the 1-grams from F1. Again, select those which hold the s_min criterion and you'll get F2. By combining all the 2-grams from F2 and selecting the frequent 3-grams, you'll get F3. Repeat for as long as Fn is non-empty.
Many optimizations can be done here. When combining n-grams from Fn, you can exploit the fact that n-grams x and y may only be combined to form (n+1)-gram iff x[1:] == y[:-1] (may be checked in constant time for any n if proper hashing is used). Moreover, if you have enough RAM (for your corpus, many GBs), you can extremely speed up the query function. For each 1-gram, store a hash-set of line indices containing the given 1-gram. When combining two n-grams into an (n+1)-gram, use intersection of the two corresponding sets, obtaining a set of lines where the (n+1)-gram may be contained.
The time complexity grows as s_min decreases. The beauty is that infrequent (and hence uninteresting) n-grams are completely filtered as the algorithm runs, saving computational time for the frequent ones only.
I am giving you a bunch of pointers regarding the general problems you are trying to solve.. One or more of these should be useful for you and help you figure this out.
For what you are doing (I am guessing some sort of machine translation experiment) you don't really need to load the two files srcfin and trgfin into memory at the same time (at least not for the code sample you have provided).. Processing them separately will be less expensive in terms of the amount of stuff you need to hold in memory at a given time.
You are reading a ton of data into memory, processing it (which takes even more memory), and then holding the results in some in-memory data-structures. Instead of doing that, you should strive to be lazier. Learn about python generators and write a generator which streams out all the ngrams from a given text without needing to hold the entire text in memory at any given point in time. The itertools python package will probably come in handy while writing this.
Beyond a point, it will no longer be feasible for you to hold all this data in memory. You should consider looking at map-reduce to help you break this down. Check out the mrjob python package which lets you write map reduce jobs in python. In the mapper step, you will break text down into its ngrams, and in the reducer stage you will count the number of times you see each ngram to get its overall count. mrjob's can also be run locally which obviously won't give you any parallelization benefits, but will be nice cause mrjob will still do a lot of heavy lifting for you.
If you are compelled to hold all the counts in memory at the same time (for a massive amount of text), then either implement some pruning strategy to prune out very rare ngrams, or consider using some persistent file-based lookup table such sqlite to hold all the data for you.
Assuming you don't want to count ngrams between lines, and assuming naive tokenization:
def ngrams(n, f):
deque = collections.deque(maxlen=n)
for line in f:
deque.clear()
words = ["<s>"] + line.split() + ["</s>"]
deque.extend(words[:n-1]) # pre-seed so 5-gram counter doesn't count incomplete 5-grams
for word in words[n-1:]:
deque.append(word)
yield tuple(str(w) for w in deque) # n-gram tokenization
counters = [collections.Counter(ngrams(i, open('somefile.txt'))) for i in range(5)]
edit: added beginning/end line tokens
The resultant data object is I believe about as sparse as possible. 3m lines with 40 words is ~120m tokens. With ~1m words in English (though less commonly used), you'll probably get a rather long tail. If you can imagine your data to be exchangeable / iid, then you can add some pruning in the middle:
def ngrams(n, f, prune_after=10000):
counter = collections.Counter()
deque = collections.deque(maxlen=n)
for i, line in enumerate(f):
deque.clear()
words = ["<s>"] + line.split() + ["</s>"]
deque.extend(words[:n-1])
for word in words[n-1:]:
deque.append(word)
ngram = tuple(str(w) for w in deque)
if i < prune_after or ngram in counter:
counter[ngram] += 1
return counter
Relaxing the exchangeability assumption would require something like Tregoreg's answer for efficient pruning, but in most cases exchangeability should hold.
As far as raw speed, I think zip (like the original code) vs deque is the fundamental question. zip removes the innermost loop, so it is likely already very fast. deque requires the innermost loop but also consumes the data iteratively, so its working memory footprint should be much smaller. Which is better will likely depend on your machine, but I'd imagine for large machines/small data that zip would be faster. Once you start running out of memory (especially if you start talking about pruning), however, deque gets a few more advantages.

Categories