Effective 1-5 grams extraction with python

Effective 1-5 grams extraction with python - python

I have a huge files of 3,000,000 lines and each line have 20-40 words. I have to extract 1 to 5 ngrams from the corpus. My input files are tokenized plain text, e.g.:
This is a foo bar sentence .
There is a comma , in this sentence .
Such is an example text .
Currently, I am doing it as below but this don't seem to be a efficient way to extract the 1-5grams:
#!/usr/bin/env python -*- coding: utf-8 -*-
import io, os
from collections import Counter
import sys; reload(sys); sys.setdefaultencoding('utf-8')
with io.open('train-1.tok.en', 'r', encoding='utf8') as srcfin, \
io.open('train-1.tok.jp', 'r', encoding='utf8') as trgfin:
# Extract words from file.
src_words = ['<s>'] + srcfin.read().replace('\n', ' </s> <s> ').split()
del src_words[-1] # Removes the final '<s>'
trg_words = ['<s>'] + trgfin.read().replace('\n', ' </s> <s> ').split()
del trg_words[-1] # Removes the final '<s>'
# Unigrams count.
src_unigrams = Counter(src_words)
trg_unigrams = Counter(trg_words)
# Sum of unigram counts.
src_sum_unigrams = sum(src_unigrams.values())
trg_sum_unigrams = sum(trg_unigrams.values())
# Bigrams count.
src_bigrams = Counter(zip(src_words,src_words[1:]))
trg_bigrams = Counter(zip(trg_words,trg_words[1:]))
# Sum of bigram counts.
src_sum_bigrams = sum(src_bigrams.values())
trg_sum_bigrams = sum(trg_bigrams.values())
# Trigrams count.
src_trigrams = Counter(zip(src_words,src_words[1:], src_words[2:]))
trg_trigrams = Counter(zip(trg_words,trg_words[1:], trg_words[2:]))
# Sum of trigram counts.
src_sum_trigrams = sum(src_bigrams.values())
trg_sum_trigrams = sum(trg_bigrams.values())
Is there any other way to do this more efficiently?
How to optimally extract different N ngrams simultaneously?
From Fast/Optimize N-gram implementations in python, essentially this:
zip(*[words[i:] for i in range(n)])
when hard-coded is this for bigrams, n=2:
zip(src_words,src_words[1:])
and is this for trigrams, n=3:
zip(src_words,src_words[1:],src_words[2:])

If you are interested only in the most common (frequent) n-grams (which is your case I suppose), you can reuse the central idea of the Apriori algorithm. Given s_min, a minimal support which can be thought as the number of lines that a given n-gram is contained in, it efficiently searches for all such n-grams.
The idea is as follows: write a query function which takes an n-gram and tests how many times it is contained in the corpus. After you have such a function prepared (may be optimized as discussed later), scan the whole corpus and get all the 1-grams, i.e. bare tokens, and select those which are contained at least s_min times. This gives you subset F1 of frequent 1-grams. Then test all the possible 2-grams by combining all the 1-grams from F1. Again, select those which hold the s_min criterion and you'll get F2. By combining all the 2-grams from F2 and selecting the frequent 3-grams, you'll get F3. Repeat for as long as Fn is non-empty.
Many optimizations can be done here. When combining n-grams from Fn, you can exploit the fact that n-grams x and y may only be combined to form (n+1)-gram iff x[1:] == y[:-1] (may be checked in constant time for any n if proper hashing is used). Moreover, if you have enough RAM (for your corpus, many GBs), you can extremely speed up the query function. For each 1-gram, store a hash-set of line indices containing the given 1-gram. When combining two n-grams into an (n+1)-gram, use intersection of the two corresponding sets, obtaining a set of lines where the (n+1)-gram may be contained.
The time complexity grows as s_min decreases. The beauty is that infrequent (and hence uninteresting) n-grams are completely filtered as the algorithm runs, saving computational time for the frequent ones only.

I am giving you a bunch of pointers regarding the general problems you are trying to solve.. One or more of these should be useful for you and help you figure this out.
For what you are doing (I am guessing some sort of machine translation experiment) you don't really need to load the two files srcfin and trgfin into memory at the same time (at least not for the code sample you have provided).. Processing them separately will be less expensive in terms of the amount of stuff you need to hold in memory at a given time.
You are reading a ton of data into memory, processing it (which takes even more memory), and then holding the results in some in-memory data-structures. Instead of doing that, you should strive to be lazier. Learn about python generators and write a generator which streams out all the ngrams from a given text without needing to hold the entire text in memory at any given point in time. The itertools python package will probably come in handy while writing this.
Beyond a point, it will no longer be feasible for you to hold all this data in memory. You should consider looking at map-reduce to help you break this down. Check out the mrjob python package which lets you write map reduce jobs in python. In the mapper step, you will break text down into its ngrams, and in the reducer stage you will count the number of times you see each ngram to get its overall count. mrjob's can also be run locally which obviously won't give you any parallelization benefits, but will be nice cause mrjob will still do a lot of heavy lifting for you.
If you are compelled to hold all the counts in memory at the same time (for a massive amount of text), then either implement some pruning strategy to prune out very rare ngrams, or consider using some persistent file-based lookup table such sqlite to hold all the data for you.

Assuming you don't want to count ngrams between lines, and assuming naive tokenization:
def ngrams(n, f):
deque = collections.deque(maxlen=n)
for line in f:
deque.clear()
words = ["<s>"] + line.split() + ["</s>"]
deque.extend(words[:n-1]) # pre-seed so 5-gram counter doesn't count incomplete 5-grams
for word in words[n-1:]:
deque.append(word)
yield tuple(str(w) for w in deque) # n-gram tokenization
counters = [collections.Counter(ngrams(i, open('somefile.txt'))) for i in range(5)]
edit: added beginning/end line tokens
The resultant data object is I believe about as sparse as possible. 3m lines with 40 words is ~120m tokens. With ~1m words in English (though less commonly used), you'll probably get a rather long tail. If you can imagine your data to be exchangeable / iid, then you can add some pruning in the middle:
def ngrams(n, f, prune_after=10000):
counter = collections.Counter()
deque = collections.deque(maxlen=n)
for i, line in enumerate(f):
deque.clear()
words = ["<s>"] + line.split() + ["</s>"]
deque.extend(words[:n-1])
for word in words[n-1:]:
deque.append(word)
ngram = tuple(str(w) for w in deque)
if i < prune_after or ngram in counter:
counter[ngram] += 1
return counter
Relaxing the exchangeability assumption would require something like Tregoreg's answer for efficient pruning, but in most cases exchangeability should hold.
As far as raw speed, I think zip (like the original code) vs deque is the fundamental question. zip removes the innermost loop, so it is likely already very fast. deque requires the innermost loop but also consumes the data iteratively, so its working memory footprint should be much smaller. Which is better will likely depend on your machine, but I'd imagine for large machines/small data that zip would be faster. Once you start running out of memory (especially if you start talking about pruning), however, deque gets a few more advantages.

Related

Handling Memory Error when dealing with really large number of words (>100 million) for LDA analysis

I have 50,000k files - that have a combined total of 162 million words. I wanted to do topic modelling using Gensim similar to this tutorial here
So, LDA requires one to tokenize the documents into words and then create a word frequency dictionary.
So, I have these files read into a pandas dataframe (The 'content' column has the text) and do the following to create a list of the texts.image of dataframe attached here
texts = [[word for word in row[1]['content'].lower().split() if word not in stopwords] for row in df.iterrows()]
However, I have been running into a memory error, because of the large word count.
I also tried the TokenVectorizer in Python. I had got a memory error for this too.
def simple_tokenizer(str_input):
words = re.sub(r"[^A-Za-z0-9\-]", " ", str_input).lower().split()
return words
vectorizer = TfidfVectorizer(use_idf=True, tokenizer=simple_tokenizer, stop_words='english')
X = vectorizer.fit_transform(df['content'])
How do I handle tokenizing these really long documents in a way it can be processed for LDA Analysis?
I have an i7, 16GB Desktop if that matters.
EDIT
Since Python was unable to store really large lists. I actually rewrote the code, to read each file (originally stored as HTML), convert it to text, create a text vector, append it to a list, and then sent it to the LDA code. It worked!

So, LDA requires one to tokenize the documents into words and then
create a word frequency dictionary.
If the only output you need from this is a dictionary with the word count, I would do the following:
Process files one by one in a loop. This way you store only one file in memory. Process it, then move to the next one:
# for all files in your directory/directories:
with open(current_file, 'r') as f:
for line in f:
# your logic to update the dictionary with the word count
# here the file is closed and the loop moves to the next one
EDIT: When it comes to issues with keeping a really large dictionary in memory, you have to remember that Python reserves a lot of memory for keeping the dict low density - a price for a fast lookup possibilities. Therefore, you must search for another way of storing the key-value pairs, for e.g. a list of tuples, but the cost will be much slower lookup. This question is about that and has some nice alternatives described there.

Normalize all words in a document

I need to normalize all words in a huge corpora. Any ideas how to optimize this code? That's too slow...
texts = [ [ list(morph.normalize(word.upper()))[0] for word in document.split() ]
for document in documents ]
documents is a list of strings, where each string is single book's text.
morph.normalize works only for upper register, so I apply .upper() to all words. Moreover, it returns a set with one element, which is normalized word (string)

The first and obvious thing I'd do would be to cache the normalized words in a local dict, as to avoid calling morph.normalize() more than once for a given word.
A second optimization is to alias methods to local variables - this avoids going thru the whole attribute lookup + function descriptor invocation + method object instanciation on each turn of the loop.
Then since it's a "huge" corpus you probably want to avoid creating a full list of lists at once, which might eat all your ram, make your computer start to swap (which is garanteed to make it snail slow) and finally crash with a memory error. I don't know what your supposed to do with this list of lists nor how huge each document is but as an example I iter on a per-document result and write it to stdout - what should really be done depends on the context and concrete use case.
NB : untested code, obviously, but at least this should get you started
def iterdocs(documents, morph):
# keep trac of already normalized words
# beware this dict might get too big if you
# have lot of different words. Depending on
# your corpus, you may want to either use a LRU
# cache instead and/or use a per-document cache
# and/or any other appropriate caching strategy...
cache = {}
# aliasing methods as local variables
# is faster for tight loops
normalize = morph.normalize
def norm(word):
upw = word.upper()
if upw in cache:
return cache[upw]
nw = cache[upw] = normalize(upw).pop()
return nw
for doc in documents:
words = [norm(word) for word in document.split() if word]
yield words
for text in iterdocs(docs, morph):
# if you need all the texts for further use
# at least write them to disk or other persistence
# mean and re-read them when needed.
# Here I just write them to sys.stdout as an example
print(text)
Also, I don't know where you get your documents from but if they are text files, you may want to avoid loading them all in memory. Just read them one by one, and if they are themselves huge don't even read a whole file at once (you can iterate over a file line by line - the most obvious choice for text).
Finally, once you made sure your code don't eat up to much memory for a single document, the next obvious optimisation is parallelisation - run a process per available core and split the corpus between processes (each writing it's results to a given place). Then you just have to sum up the results if you need them all at once...
Oh and yes : if that's still not enough you may want to distribute the work with some map reduce framework - your problem looks like a perfect fit for map reduce.

Size issues with Python shelve module

I want to store a few dictionaries using the shelve module, however, I am running into a problem with the size. I use Python 3.5.2 and the latest shelve module.
I have a list of words and I want to create a map from the bigrams (character level) to the words. The structure will look something like this:
'aa': 'aardvark', 'and', ...
'ab': 'absolute', 'dab', ...
...
I read in a large file consisting of approximately 1.3 million words. So the dictionary gets pretty large. This is the code:
self.bicharacters // part of class
def _create_bicharacters(self):
'''
Creates a bicharacter index for calculating Jaccard coefficient.
'''
with open('wordlist.txt', encoding='ISO-8859-1') as f:
for line in f:
word = line.split('\t')[2]
for i in range(len(word) - 1):
bicharacter = (word[i] + word[i+1])
if bicharacter in self.bicharacters:
get = self.bicharacters[bicharacter]
get.append(word)
self.bicharacters[bicharacter] = get
else:
self.bicharacters[bicharacter] = [word]
When I ran this code using a regular Python dictionary, I did not run into issues, but I can't spare those kinds of memory resources due to the rest of the program also having quite a large memory footprint.
So I tried using the shelve module. However, when I run the code above using shelve the program stops after a while due to no more memory on disk, the shelve db that was created was around 120gb, and it had still not read even half the 1.3M word list from the file. What am I doing wrong here?

The problem here is not so much the number of keys, but that each key references a list of words.
While in memory as one (huge) dictionary, this isn't that big a problem as the words are simply shared between the lists; each list is simply a sequence of references to other objects and here many of those objects are the same, as only one string per word needs to be referenced.
In shelve, however, each value is pickled and stored separately, meaning that a concrete copy of the words in a list will have to be stored for each value. Since your setup ends up adding a given word to a large number of lists, this multiplies your data needs rather drastically.
I'd switch to using a SQL database here. Python comes with bundled with sqlite3. If you create one table for individual words, and second table for each possible bigram, and a third that simply links between the two (a many-to-many mapping, linking bigram row id to word row id), this can be done very efficiently. You can then do very efficient lookups as SQLite is quite adept managing memory and indices for you.

Is there any algorithm to mine continuous closed sequences from a sequence database?

I am working on text compression and I want to use the knowledge of mining closed frequent sequences. The existing algorithms like GSP, CloSpan, ClaSP, Bide mine all frequent sequences both continuous and non continuous. Can you help me in finding such algorithm?
For example if the sequence database is
SID Sequence
1 CAABC
2 ABCB
3 CABC
4 ABBCA
and minimum support is 2
the existing algorithms consider the subsequence 'CB' of sequence with id 1 but I don't want that.

Modern sequential pattern mining algorithms try to prune the search space to reduce running time. The search space is exponentially larger as a "non-continuous" sub-sequence can be any combination from the input sequences. In your case, the search space is far smaller given that the sequences are continuous i.e. we already know the combinations. So, you can probably make an algorithm for this on your own if you like, and it would still be reasonably faster.
Here's how a crude recursive example could look like:
def f(str, database, minSupp):
freq = 0
if len(patt) == 0:
return ""
#count frequency
for trans in db:
if patt in trans:
freq += 1
if freq >= minSupp:
return patt
else: #break it down
r = []
r.append(f(patt[1:], db, minSupp)) #All but the first element
r.append(f(patt[:-1], db, minSupp)) #All but the last element
return r
This just demonstrates one way of doing it. Of course, it crappy.
To do this much faster, you can use probably write some conditions so that a recursive call is not made in case the pattern is known.
An even faster way would be that you maintain an inverted-index of all patterns and then incrementally update them to create super-patterns using the Apriori-condition. For this, you can refer to some slide-show explaining how the Apriori algorithm works (using your candidate generation method; one example of which is in the algorithm above).

As such no algorithm exists for finding continuous sequences for compression. You can just modify the existing algorithms to mine only continuous sequences. I suggest you to modify BIDE algorithm to find continuous sub sequences only.

Fastest way to search 1GB+ a string of data for the first occurrence of a pattern in Python

There's a 1 Gigabyte string of arbitrary data which you can assume to be equivalent to something like:
1_gb_string=os.urandom(1*gigabyte)
We will be searching this string, 1_gb_string, for an infinite number of fixed width, 1 kilobyte patterns, 1_kb_pattern. Every time we search the pattern will be different. So caching opportunities are not apparent. The same 1 gigabyte string will be searched over and over. Here is a simple generator to describe what's happening:
def findit(1_gb_string):
1_kb_pattern=get_next_pattern()
yield 1_gb_string.find(1_kb_pattern)
Note that only the first occurrence of the pattern needs to be found. After that, no other major processing should be done.
What can I use that's faster than python's bultin find for matching 1KB patterns against 1GB or greater data strings?
(I am already aware of how to split up the string and searching it in parallel, so you can disregard that basic optimization.)
Update: Please bound memory requirements to 16GB.

As you clarify that long-ish preprocessing is acceptable, I'd suggest a variant of Rabin-Karp: "an algorithm of choice for multiple pattern search", as wikipedia puts it.
Define a "rolling hash" function, i.e., one such that, when you know the hash for haystack[x:x+N], computing the hash for haystack[x+1:x+N+1] is O(1). (Normal hashing functions such as Python's built-in hash do not have this property, which is why you have to write your own, otherwise the preprocessing becomes exhaustingly long rather than merely long-ish;-). A polynomial approach is fruitful, and you could use, say, 30-bit hash results (by masking if needed, i.e., you can do the computation w/more precision and just store the masked 30 bits of choice). Let's call this rolling hash function RH for clarity.
So, compute 1G of RH results as you roll along the haystack 1GB string; if you just stored these it would give you an array H of 1G 30-bit values (4GB) mapping index-in-haystack->RH value. But you want the reverse mapping, so use instead an array A of 2**30 entries (1G entries) that for each RH value gives you all the indices of interest in the haystack (indices at which that RH value occurs); for each entry you store the index of the first possibly-interesting haystack index into another array B of 1G indices into the haystack which is ordered to keep all indices into haystack with identical RH values ("collisions" in hashing terms) adjacent. H, A and B both have 1G entries of 4 bytes each, so 12GB total.
Now for each incoming 1K needle, compute its RH, call it k, and use it as an index into A; A[k] gives you the first index b into B at which it's worth comparing. So, do:
ib = A[k]
b = B[ib]
while b < len(haystack) - 1024:
if H[b] != k: return "not found"
if needle == haystack[b:b+1024]: return "found at", b
ib += 1
b = B[ib]
with a good RH you should have few collisions, so the while should execute very few times until returning one way or another. So each needle-search should be really really fast.

There are a number of string matching algorithms use in the field of genetics to find substrings. You might try this paper or this paper

As far as I know, standard find algorithm is naive algorithm with complexity about n*m comparisons, because
it checks patterns against every possible offset. There are some more effective algoithms, requiring about n+m comparisons.
If your string is not a natural language string, you can try
Knuth–Morris–Pratt algorithm . Boyer–Moore search algorithm is fast and simple enough too.

Are you willing to spend a significant time preprocessing the string?
If you are, what you can do is build a list of n-grams with offsets.
Suppose your alphabet is hex bytes and you are using 1-grams.
Then for 00-ff, you can create a dictionary that looks like this(perlese, sorry)
$offset_list{00} = #array_of_offsets
$offset_list{01} = #...etc
where you walk down the string and build the #array_of_offsets from all points where bytes happen. You can do this for arbitrary n-grams.
This provides a "start point for search" that you can use to walk.
Of course, the downside is that you have to preprocess the string, so that's your tradeoff.
edit:
The basic idea here is to match prefixes. This may bomb out badly if the information is super-similar, but if it has a fair amount of divergence between n-grams, you should be able to match prefixes pretty well.
Let's quantify divergence, since you've not discussed the kind of info you're analyzing. For the purposes of this algorithm, we can characterize divergence as a distance function: you need a decently high Hamming distance. If the hamming distance between n-grams is, say, 1, the above idea won't work. But if it's n-1, the above algorithm will be much easier.
To improve on my algorithm, let's build an algorithm that does some successive elimination of possibilities:
We can invoke Shannon Entropy to define information of a given n-gram. Take your search string and successively build a prefix based upon the first m characters. When the entropy of the m-prefix is 'sufficiently high', use it later.
Define p to be an m-prefix of the search string
Search your 1 GB string and create an array of offsets that match p.
Extend the m-prefix to be some k-prefix, k > m, entropy of k-prefix higher than m-prefix.
Keep the elements offset array defined above, such that they match the k-prefix string. Discard the non-matching elements.
Goto 4 until the entire search string is met.
In a sense, this is like reversing Huffman encoding.

With infinite memory, you can hash every 1k string along with its position in the 1 GB file.
With less than infinite memory, you will be bounded by how many memory pages you touch when searching.

I don't know definitively if the find() method for strings is faster than the search() method provided by Python's re (regular expressions) module, but there's only one way to find out.
If you're just searching a string, what you want is this:
import re
def findit(1_gb_string):
yield re.search(1_kb_pattern, 1_gb_string)
However, if you really only want the first match, you might be better off using finditer(), which returns an iterator, and with such large operations might actually be better.

http://www.youtube.com/watch?v=V5hZoJ6uK-s
Will be of most value to you. Its an MIT lecture on Dynamic Programming

If the patterns are fairly random, you can precompute the location of n-prefixes of strings.
Instead of going over all options for n-prefixes, just use the actual ones in the 1GB string - there will be less than 1Gig of those. Use as big a prefix as fits in your memory, I don't have 16GB RAM to check but a prefix of 4 could work (at least in a memory-efficient data structures), if not try 3 or even 2.
For a random 1GB string and random 1KB patterns, you should get a few 10s of locations per prefix if you use 3-byte prefixes, but 4-byte prefixes should get you an average of 0 or 1 , so lookup should be fast.
Precompute Locations
def find_all(pattern, string):
cur_loc = 0
while True:
next_loc = string.find(pattern, cur_loc)
if next_loc < 0: return
yield next_loc
cur_loc = next_loc+1
big_string = ...
CHUNK_SIZE = 1024
PREFIX_SIZE = 4
precomputed_indices = {}
for i in xrange(len(big_string)-CHUNK_SIZE):
prefix = big_string[i:i+PREFIX_SIZE]
if prefix not in precomputed_indices:
precomputed_indices[prefix] = tuple(find_all(prefix, big_string))
Look up a pattern
def find_pattern(pattern):
prefix = pattern[:PREFIX_SIZE]
# optimization - big prefixes will result in many misses
if prefix not in precomputed_indices:
return -1
for loc in precomputed_indices[prefix]:
if big_string[loc:loc+CHUNK_SIZE] == pattern:
return loc
return -1

Someone hinted at a possible way to index this thing if you have abundant RAM (or possibly even disk/swap) available.
Imagine if you performed a simple 32-bit CRC on a 1K block extending from each character in the original Gig string. This would result in 4bytes of checksum data for each byte offset from the beginning of the data.
By itself this might give a modest improvement in search speed. The checksum of each 1K search target could be checked against each CRC ... which each collision tested for a true match. That should still be a couple orders of magnitude faster than a normal linear search.
That, obviously, costs us 4GB of RAM for he CRC array (plus the original Gig for the original data and a little more overhead for the environment and our program).
If we have ~16GB we could sort the checksums and store a list of offsets where each is found. That becomes an indexed search (average of about 16 probes per target search ... worst case around 32 or 33 (might be a fence post there).
It's possible that a 16BG file index would still give better performance than a linear checksum search and it would almost certainly be better than a linear raw search (unless you have extremely slow filesystems/storage).
(Adding): I should clarify that this strategy is only beneficial given that you've described a need to do many searches on the same one gigabyte data blob.
You might use a threaded approach to building the index (while reading it as well as having multiple threads performing the checksumming). You might also offload the indexing into separate processes or a cluster of nodes (particularly if you use a file-based index --- the ~16GB option described above). With a simple 32-bit CRC you might be able to perform the checksums/indexing as fast as your reader thread can get the data (but we are talking about 1024 checksums for each 1K of data so perhaps not).
You might further improve performance by coding a Python module in C for actually performing the search ... and/or possibly for performing the checksumming/indexing.
The development and testing of such C extensions entail other trade-offs, obviously enough. It sounds like this would have near zero re-usability.

One efficient but complex way is full-text indexing with the Burrows-Wheeler transform. It involves performing a BWT on your source text, then using a small index on that to quickly find any substring in the text that matches your input pattern.
The time complexity of this algorithm is roughly O(n) with the length of the string you're matching - and independent of the length of the input string! Further, the size of the index is not much larger than the input data, and with compression can even be reduced below the size of the source text.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.