Size issues with Python shelve module - python

I want to store a few dictionaries using the shelve module, however, I am running into a problem with the size. I use Python 3.5.2 and the latest shelve module.
I have a list of words and I want to create a map from the bigrams (character level) to the words. The structure will look something like this:
'aa': 'aardvark', 'and', ...
'ab': 'absolute', 'dab', ...
...
I read in a large file consisting of approximately 1.3 million words. So the dictionary gets pretty large. This is the code:
self.bicharacters // part of class
def _create_bicharacters(self):
'''
Creates a bicharacter index for calculating Jaccard coefficient.
'''
with open('wordlist.txt', encoding='ISO-8859-1') as f:
for line in f:
word = line.split('\t')[2]
for i in range(len(word) - 1):
bicharacter = (word[i] + word[i+1])
if bicharacter in self.bicharacters:
get = self.bicharacters[bicharacter]
get.append(word)
self.bicharacters[bicharacter] = get
else:
self.bicharacters[bicharacter] = [word]
When I ran this code using a regular Python dictionary, I did not run into issues, but I can't spare those kinds of memory resources due to the rest of the program also having quite a large memory footprint.
So I tried using the shelve module. However, when I run the code above using shelve the program stops after a while due to no more memory on disk, the shelve db that was created was around 120gb, and it had still not read even half the 1.3M word list from the file. What am I doing wrong here?

The problem here is not so much the number of keys, but that each key references a list of words.
While in memory as one (huge) dictionary, this isn't that big a problem as the words are simply shared between the lists; each list is simply a sequence of references to other objects and here many of those objects are the same, as only one string per word needs to be referenced.
In shelve, however, each value is pickled and stored separately, meaning that a concrete copy of the words in a list will have to be stored for each value. Since your setup ends up adding a given word to a large number of lists, this multiplies your data needs rather drastically.
I'd switch to using a SQL database here. Python comes with bundled with sqlite3. If you create one table for individual words, and second table for each possible bigram, and a third that simply links between the two (a many-to-many mapping, linking bigram row id to word row id), this can be done very efficiently. You can then do very efficient lookups as SQLite is quite adept managing memory and indices for you.

Related

Handling Memory Error when dealing with really large number of words (>100 million) for LDA analysis

I have 50,000k files - that have a combined total of 162 million words. I wanted to do topic modelling using Gensim similar to this tutorial here
So, LDA requires one to tokenize the documents into words and then create a word frequency dictionary.
So, I have these files read into a pandas dataframe (The 'content' column has the text) and do the following to create a list of the texts.image of dataframe attached here
texts = [[word for word in row[1]['content'].lower().split() if word not in stopwords] for row in df.iterrows()]
However, I have been running into a memory error, because of the large word count.
I also tried the TokenVectorizer in Python. I had got a memory error for this too.
def simple_tokenizer(str_input):
words = re.sub(r"[^A-Za-z0-9\-]", " ", str_input).lower().split()
return words
vectorizer = TfidfVectorizer(use_idf=True, tokenizer=simple_tokenizer, stop_words='english')
X = vectorizer.fit_transform(df['content'])
How do I handle tokenizing these really long documents in a way it can be processed for LDA Analysis?
I have an i7, 16GB Desktop if that matters.
EDIT
Since Python was unable to store really large lists. I actually rewrote the code, to read each file (originally stored as HTML), convert it to text, create a text vector, append it to a list, and then sent it to the LDA code. It worked!
So, LDA requires one to tokenize the documents into words and then
create a word frequency dictionary.
If the only output you need from this is a dictionary with the word count, I would do the following:
Process files one by one in a loop. This way you store only one file in memory. Process it, then move to the next one:
# for all files in your directory/directories:
with open(current_file, 'r') as f:
for line in f:
# your logic to update the dictionary with the word count
# here the file is closed and the loop moves to the next one
EDIT: When it comes to issues with keeping a really large dictionary in memory, you have to remember that Python reserves a lot of memory for keeping the dict low density - a price for a fast lookup possibilities. Therefore, you must search for another way of storing the key-value pairs, for e.g. a list of tuples, but the cost will be much slower lookup. This question is about that and has some nice alternatives described there.

pickle.load() a dictionary with duplicate keys and reduce

For reasons concerning MemoryError, I am appending a series of dictionaries to a file like pickle.dump(mydict, open(filename, "a")). The entirety of the dictionary, as far as I can tell, can't be constructed in my laptop's memory. As a result I have identical keys in the same pickled file. The file is essentially a hash table of doublets and strings. The data looks like:
{
'abc': [list of strings1],
'efg': [list of strings2],
'abc': [list of strings3]
}
Main Question: When I use pickle.load(open(filename, "r")) is there a way to join the duplicate dictionary keys?
Question 2: Does it matter that there are duplicates? Will calling the duplicate key give me all applicable results?
For example:
mydict = pickle.load(open(filename, "r"))
mydict['abc'] = <<sum of all lists with this key>>
One solution I've considered, but I'm not clear on from a Python-knowledge standpoint:
x = mydict['abc']
if type(x[0]) is list:
reduce(lambda a, b: a.extend(b), x)
<<do processing on list items>>
Edit 1: Here's the flow of data, roughly speaking.
Daily: update table_ownership with 100-500 new records. Each record contains 1 or 2 strings (names of people).
Create a new hash table of 3-letter groups, tied to the strings that contain the doublet. The key is the doublet, the value is a list of strings (actually a tuple containing the string and the primary key for the table_ownership record.
Hourly: update table_people with 10-40 new names to match.
Use the hash table to pull the most likely matches PRIOR to running fuzzy matching. We get the doublets from myString and strings like potential_matches.append(hashTable[doublet]) for doublet in get_doublets(myString)
Sort by shared doublet count.
Apply fuzzy matching to top 5000 potential_matches, storing results of high quality in table_fuzzymatches
So this works very well, and it's 10-100 times faster than fuzzymatching straight away. With only 200k records, I can make the hash table in memory and pickle.dump() but with the full 1.65M records I can't.
Edit 2: I'm looking into 2 things:
x64 Python
'collections.defaultdict'
I'll report back.
Answers:
32bit Python has a 2GB memory limit. x64 fixed my problem right away.
But what if I hadn't had 64bit Python available?
Chunk the input.
When I used a 10**5 chunk size and wrote to a dictionary piecemeal, it worked out.
For timing, my chunking process took 2000 seconds. 64bit Python sped it up to 380 seconds.

Normalize all words in a document

I need to normalize all words in a huge corpora. Any ideas how to optimize this code? That's too slow...
texts = [ [ list(morph.normalize(word.upper()))[0] for word in document.split() ]
for document in documents ]
documents is a list of strings, where each string is single book's text.
morph.normalize works only for upper register, so I apply .upper() to all words. Moreover, it returns a set with one element, which is normalized word (string)
The first and obvious thing I'd do would be to cache the normalized words in a local dict, as to avoid calling morph.normalize() more than once for a given word.
A second optimization is to alias methods to local variables - this avoids going thru the whole attribute lookup + function descriptor invocation + method object instanciation on each turn of the loop.
Then since it's a "huge" corpus you probably want to avoid creating a full list of lists at once, which might eat all your ram, make your computer start to swap (which is garanteed to make it snail slow) and finally crash with a memory error. I don't know what your supposed to do with this list of lists nor how huge each document is but as an example I iter on a per-document result and write it to stdout - what should really be done depends on the context and concrete use case.
NB : untested code, obviously, but at least this should get you started
def iterdocs(documents, morph):
# keep trac of already normalized words
# beware this dict might get too big if you
# have lot of different words. Depending on
# your corpus, you may want to either use a LRU
# cache instead and/or use a per-document cache
# and/or any other appropriate caching strategy...
cache = {}
# aliasing methods as local variables
# is faster for tight loops
normalize = morph.normalize
def norm(word):
upw = word.upper()
if upw in cache:
return cache[upw]
nw = cache[upw] = normalize(upw).pop()
return nw
for doc in documents:
words = [norm(word) for word in document.split() if word]
yield words
for text in iterdocs(docs, morph):
# if you need all the texts for further use
# at least write them to disk or other persistence
# mean and re-read them when needed.
# Here I just write them to sys.stdout as an example
print(text)
Also, I don't know where you get your documents from but if they are text files, you may want to avoid loading them all in memory. Just read them one by one, and if they are themselves huge don't even read a whole file at once (you can iterate over a file line by line - the most obvious choice for text).
Finally, once you made sure your code don't eat up to much memory for a single document, the next obvious optimisation is parallelisation - run a process per available core and split the corpus between processes (each writing it's results to a given place). Then you just have to sum up the results if you need them all at once...
Oh and yes : if that's still not enough you may want to distribute the work with some map reduce framework - your problem looks like a perfect fit for map reduce.

Searching a string from a file - python

I have a human dictionary file that looks like this in eng.dic (image that there is close to a billion words in that list). And I have to run different word queries quite often.
apple
pear
foo
bar
foo bar
dictionary
sentence
I have a string let's say "foo-bar", is there a better (more efficient way) of searching through that file to see whether it exist, if it return exist, if it doesnt exist, append the dictionary file
dic_file = open('en_dic', 'ra', 'utf8')
query = "foo-bar"
wordlist = list(dic_file.readlines().replace(" ","-"))
en_dic = map(str.strip, wordlist)
if query in en_dic:
return 1
else:
print>>dic_file, query
Is there any in-built search functions in python? or any libraries that i can import to run such searches without much overheads?
As I already mentioned, going through the whole file when its size is significant, is not a good idea. Instead you should use established solutions and:
index the words in the document,
store the results of indexing in appropriate form (I suggest database),
check if the word exists in the file (by checking the database),
if it does not exist, add it to file and database,
Storing data in database is really a lot more efficient than trying to reinvent the wheel. If you will use SQLite, the database will be also a file, so the setup procedure is minimal.
So again, I am proposing storing words in SQLite database and querying when you want to check if the word exists in the file, then updating it if you are adding it.
To read more on the solution see answers to this question:
The most efficient way to index words in a document
Most efficient way depends on most frequent operation that you will perform with this dictionary.
If you need to read file each time, you can use while loop reading file line-by-line until result is your word on end of the file. This is necessary if you have several concurrent workers that can update file at the same time.
If you don't need to read file each time (eg, you have only one process that work with dictionary), you can definitely write more efficient implementation: 1) read all lines into set (instead of list), 2) for each "new" word perform both actions - update set with add operation and write word to file.
If it is "pretty large" file, then access the lines sequentially and don't read the whole file into memory:
with open('largeFile', 'r') as inF:
for line in inF:
if 'myString' in line:
# do_something

Storing huge hash table in a file in Python

Hey. I have a function I want to memoize, however, it has too many possible values. Is there any convenient way to store the values in a text file and make it read from them? For example, something like storing a pre-computed list of primes up to 10^9 in a text file? I know it's slow to read from a text file but there's no other option if the amount of data is really huge. Thanks!
For a list of primes up to 10**9, why do you need a hash? What would the KEYS be?! Sounds like a perfect opportunity for a simple, straightforward binary file! By the Prime Number Theorem, there's about 10**9/ln(10**9) such primes -- i.e. 50 millions or a bit less. At 4 bytes per prime, that's only 200 MB or less -- perfect for an array.array("L") and its methods such as fromfile, etc (see the docs). In many cases you could actually suck all of the 200 MB into memory, but, worst case, you can get a slice of those (e.g. via mmap and the fromstring method of array.array), do binary searches there (e.g. via bisect), etc, etc.
When you DO need a huge key-values store -- gigabytes, not a paltry 200 MB!-) -- I used to recommend shelve but after unpleasant real-life experience with huge shelves (performance, reliability, etc), I currently recommend a database engine instead -- sqlite is good and comes with Python, PostgreSQL is even better, non-relational ones such as CouchDB can be better still, and so forth.
You can use the shelve module to store a dictionary like structure in a file. From the Python documentation:
import shelve
d = shelve.open(filename) # open -- file may get suffix added by low-level
# library
d[key] = data # store data at key (overwrites old data if
# using an existing key)
data = d[key] # retrieve a COPY of data at key (raise KeyError
# if no such key)
del d[key] # delete data stored at key (raises KeyError
# if no such key)
flag = key in d # true if the key exists
klist = list(d.keys()) # a list of all existing keys (slow!)
# as d was opened WITHOUT writeback=True, beware:
d['xx'] = [0, 1, 2] # this works as expected, but...
d['xx'].append(3) # *this doesn't!* -- d['xx'] is STILL [0, 1, 2]!
# having opened d without writeback=True, you need to code carefully:
temp = d['xx'] # extracts the copy
temp.append(5) # mutates the copy
d['xx'] = temp # stores the copy right back, to persist it
# or, d=shelve.open(filename,writeback=True) would let you just code
# d['xx'].append(5) and have it work as expected, BUT it would also
# consume more memory and make the d.close() operation slower.
d.close() # close it
You could also just go with the ultimate brute force, and create a Python file with just a single statement in it:
seedprimes = [3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,
79,83,89,97,101,103,107,109,113,127,131,137,139,149,151,157,163,167,173, ...
and then just import it. (Here is file with the primes up to 1e5: http://python.pastebin.com/f177ec30).
from primes_up_to_1e9 import seedprimes
For Project Euler, I stored a precomputed list of primes up to 10**8 in a text file just by writing them in comma separated format. It worked well for that size, but it doesn't scale well to going much larger.
If your huge is not really that huge, I would use something simple like me, otherwise I would go with shelve as the others have said.
Just naively sticking a hash table onto disk will result in about 5 orders of magnitude performance loss compared to an in memory implementation (or at least 3 if you have a SSD). When dealing with hard disks you'll want to extract every bit of data-locality and caching you can get.
The correct choice will depend on details of your use case. How much performance do you need? What kind of operations do you need on data-structure? Do you need to only check if the table contains a key or do you need to fetch a value based on the key? Can you precompute the table or do you need to be able to modify it on the fly? What kind of hit rate are you expecting? Can you filter out a significant amount of the operations using a bloom filter? Are the requests uniformly distributed or do you expect some kind of temporal locality? Can you predict the locality clusters ahead of time?
If you don't need ultimate performance or can parallelize and throw hardware at the problem check out some distributed key-value stores.
You can also go one step down the ladder and use pickle. Shelve imports from pickle (link), so if you don't need the added functionality of shelve, this may spare you some clock cycles (although, they really don't matter to you, as you have choosen python to do large number storing)
Let's see where the bottleneck is. When you're going to read a file, the hard drive has to turn enough to be able to read from it; then it reads a big block and caches the results.
So you want some method that will guess exactly what position in file you're going to read from and then do it exactly once. I'm pretty much sure standard DB modules will work for you, but you can do it yourself -- just open the file in binary mode for reading/writing and store your values as, say, 30-digits (=100-bit = 13-byte) numbers.
Then use standard file methods .

Categories