I have a big dictionary (1mil keys) in the form of:
{
key1: {
file1: [number_list1],
file7: [number_list2],
file10: [number_list3],
...
}
key2: {
file1: [number_list4],
file5: [number_list5],
file2: [number_list6],
...
}
...
...
}
Due to various constraints, after building it I can't keep it in memory and have to dump it on disk in its pickled form. However, I still want fast lookup from disk to any one of the keys.
My idea was to divide the big dict into smaller chunks (ballpark of 0.5-1MB). This requires an additional key:chunk mapping but allows me to load only the necessary chunk during lookup. I came up with the following algorithm:
def split_to_pages(self, big_dict):
page_buffer = defaultdict(lambda: defaultdict(list))
page_size = 0
page_number = 0
symbol2page = {}
for symbol, files in big_dict.items():
page_buffer[symbol] = files
symbol2page[symbol] = page_number
page_size += deep_sizeof_bytes(files)
if page_size > max_page_size:
save_page_to_file(page_number, page_buffer)
page_buffer.clear()
page_size = 0
page_number += 1
if page_size > 0:
save_page_to_file(page_number, page_buffer)
This solution performs well for a static dict. However, since it represents a dynamic entity, it's very likely that a new key is introduced to or removed from the dict during operation. To reflect this change, my solution requires partitioning the entire dict from scratch. Is there a better way to handle this scenario? I have a feeling that this is a common problem which I'm not aware of and better solutions have already been proposed for this matter.
EDIT:
I tried shelve, about 0.5s key lookup time for a small database (2k keys), which is very slow. My half-baked paging algorithm described above was about 0.01s.
sqlite3 did 0.4s lookuptime for a 1mil key table, I doubt mongo will be faster. There's just too much overhead for my use case. I guess I'll go on with my own implementation of a partitioned database.
Try on-disk key-value pair solution for python. Check out my project RocksDict which is an efficient solution for your problem: https://github.com/Congyuwang/RocksDict. It is much faster than dbm, and therefore much faster than shelve since it is based on dbm. If you wish to try RocksDict, use pip install rocksdict
Related
I am trying to get all data from view(Lotus Notes) with lotusscript and Python(noteslib module) and export it to csv, but problem is that this takes too much time. I have tried two ways with loop through all documents:
import noteslib
db = noteslib.Database('database','file.nsf')
view = db.GetView('My View')
doc = view.GetFirstDocument()
data = list()
while doc:
data.append(doc.ColumnValues)
doc = view.GetNextDocument(doc)
To get about 1000 lines of data it took me 70 seconds, but view has about 85000 lines so get all data will be too much time, because manually when I use File->Export in Lotus Notes it is about 2 minutes to export all data to csv.
And I tried second way with AllEntries, but it was even slower:
database = []
ec = view.AllEntries
ent = ec.Getfirstentry()
while ent:
row = []
for v in ent.Columnvalues:
row.append(v)
database.append(row)
ent = ec.GetNextEntry(ent)
Everything that I found on the Internet is based on "NextDocument" or "AllEntries". Is there any way to do it faster?
It is (or at least used to be) very expensive from a time standpoint to open a Notes document, like you are doing in your code.
Since you are saying that you want to export the data that is being displayed in the view, you could use the NotesViewEntry class instead. It should be much faster.
Set col = view.AllEntries
Set entry = col.GetFirstEntry()
Do Until entry Is Nothing
values = entry.ColumnValues '*** Array of column values
'*** Do stuff here
Set entry = col.GetNextEntry(entry)
Loop
I wrote a blog about this back in 2013:
http://blog.texasswede.com/which-is-faster-columnvalues-or-getitemvalue/
Something is going on with your code "outside" the view navigation: You already chose the most performant way to navigate a view using "GetFirstDocument" and "GetNextDocument". Using the NotesViewNavigator as mentioned in the comments will be slightly better, but not significant.
You might get a little bit of performance out of your code by setting view.AutoUpdate = False to prohibit the view object to refresh when something in the backend changes. But as you only read data and not change view data that will not give you much of a performance boost.
My suggestion: Identify the REAL bottleneck of your code by commenting out single sections to find out when it starts to get slower:
First attempt:
while doc:
doc = view.GetNextDocument(doc)
Slow?
If not then next attempt:
while doc:
arr = doc.ColumnValues
doc = view.GetNextDocument(doc)
Slow?
If yes: ColumnValues is your enemy...
If not then next attempt:
while doc:
arr = doc.ColumnValues
data.append(arr)
doc = view.GetNextDocument(doc)
I would be very interested to get your results of where it starts to become slow.
I would suspect the performance issue is using COM/ActiveX in Python to access Notes databases. Transferring data via COM involves datatype 'marshalling', possibly at every step, and especially for 'out-of-process' method/property calls.
I don't think there is any way around this in COM. You should consider arranging a Notes 'agent' to do this for you instead (LotusScript or Java maybe). Even a basic LotusScript agent can export 000's of docs per minute. A further alternative may be to look at the Notes C-API (not an easy option and requires API calls from Python).
I want to do hierarchical key-value storage in Python, which basically boils down to storing dictionaries to files. By that I mean any type of dictionary structure, that may contain other dictionaries, numpy arrays, serializable Python objects, and so forth. Not only that, I want it to store numpy arrays space-optimized and play nice between Python 2 and 3.
Below are methods I know are out there. My question is what is missing from this list and is there an alternative that dodges all my deal-breakers?
Python's pickle module (deal-breaker: inflates the size of numpy arrays a lot)
Numpy's save/savez/load (deal-breaker: Incompatible format across Python 2/3)
PyTables replacement for numpy.savez (deal-breaker: only handles numpy arrays)
Using PyTables manually (deal-breaker: I want this for constantly changing research code, so it's really convenient to be able to dump dictionaries to files by calling a single function)
The PyTables replacement of numpy.savez is promising, since I like the idea of using hdf5 and it compresses the numpy arrays really efficiently, which is a big plus. However, it does not take any type of dictionary structure.
Lately, what I've been doing is to use something similar to the PyTables replacement, but enhancing it to be able to store any type of entries. This actually works pretty well, but I find myself storing primitive data types in length-1 CArrays, which is a bit awkward (and ambiguous to actual length-1 arrays), even though I set chunksize to 1 so it doesn't take up that much space.
Is there something like that already out there?
Thanks!
After asking this two years ago, I starting coding my own HDF5-based replacement of pickle/np.save. Ever since, it has matured into a stable package, so I thought I would finally answer and accept my own question because it is by design exactly what I was looking for:
https://github.com/uchicago-cs/deepdish
I recently found myself with a similar problem, for which I wrote a couple of functions for saving the contents of dicts to a group in a PyTables file, and loading them back into dicts.
They process nested dictionary and group structures recursively, and handle objects with types that are not natively supported by PyTables by pickling them and storing them as string arrays. It's not perfect, but at least things like numpy arrays will be stored efficiently. There's also a check included to avoid inadvertently loading enormous structures into memory when reading the group contents back into a dict.
import tables
import cPickle
def dict2group(f, parent, groupname, dictin, force=False, recursive=True):
"""
Take a dict, shove it into a PyTables HDF5 file as a group. Each item in
the dict must have a type and shape compatible with PyTables Array.
If 'force == True', any existing child group of the parent node with the
same name as the new group will be overwritten.
If 'recursive == True' (default), new groups will be created recursively
for any items in the dict that are also dicts.
"""
try:
g = f.create_group(parent, groupname)
except tables.NodeError as ne:
if force:
pathstr = parent._v_pathname + '/' + groupname
f.removeNode(pathstr, recursive=True)
g = f.create_group(parent, groupname)
else:
raise ne
for key, item in dictin.iteritems():
if isinstance(item, dict):
if recursive:
dict2group(f, g, key, item, recursive=True)
else:
if item is None:
item = '_None'
f.create_array(g, key, item)
return g
def group2dict(f, g, recursive=True, warn=True, warn_if_bigger_than_nbytes=100E6):
"""
Traverse a group, pull the contents of its children and return them as
a Python dictionary, with the node names as the dictionary keys.
If 'recursive == True' (default), we will recursively traverse child
groups and put their children into sub-dictionaries, otherwise sub-
groups will be skipped.
Since this might potentially result in huge arrays being loaded into
system memory, the 'warn' option will prompt the user to confirm before
loading any individual array that is bigger than some threshold (default
is 100MB)
"""
def memtest(child, threshold=warn_if_bigger_than_nbytes):
mem = child.size_in_memory
if mem > threshold:
print '[!] "%s" is %iMB in size [!]' % (child._v_pathname, mem / 1E6)
confirm = raw_input('Load it anyway? [y/N] >>')
if confirm.lower() == 'y':
return True
else:
print "Skipping item \"%s\"..." % g._v_pathname
else:
return True
outdict = {}
for child in g:
try:
if isinstance(child, tables.group.Group):
if recursive:
item = group2dict(f, child)
else:
continue
else:
if memtest(child):
item = child.read()
if isinstance(item, str):
if item == '_None':
item = None
else:
continue
outdict.update({child._v_name: item})
except tables.NoSuchNodeError:
warnings.warn('No such node: "%s", skipping...' % repr(child))
pass
return outdict
It's also worth mentioning joblib.dump and joblib.load, which tick all of your boxes apart from Python 2/3 cross-compatibility. Under the hood they use np.save for numpy arrays and cPickle for everything else.
I tried playing with np.memmap for saving an array of dictionaries. Say we have the dictionary:
a = np.array([str({'a':1, 'b':2, 'c':[1,2,3,{'d':4}]}])
first I tried to directly save it to a memmap:
f = np.memmap('stack.array', dtype=dict, mode='w+', shape=(100,))
f[0] = d
# CRASHES when reopening since it looses the memory pointer
f = np.memmap('stack.array', dtype=object, mode='w+', shape=(100,))
f[0] = d
# CRASHES when reopening for the same reason
the way it worked is converting the dictionary to a string:
f = np.memmap('stack.array', dtype='|S1000', mode='w+', shape=(100,))
f[0] = str(a)
this works and afterwards you can eval(f[0]) to get the value back.
I do not know the advantage of this approach over the others, but it deserves a closer look.
I absolutely recommend a python object database like ZODB. It seems pretty well suited for your situation, considering you store objects (literally whatever you like) to a dictionary - this means you can store dictionaries inside dictionaries. I've used it in a wide range of problems, and the nice thing is that you can just hand somebody the database file (the one with a .fs extension). With this, they'll be able to read it in, and perform any queries they wish, and modify their own local copies. If you wish to have multiple programs simultaneously accessing the same database, I'd make sure to look at ZEO.
Just a silly example of how to get started:
from ZODB import DB
from ZODB.FileStorage import FileStorage
from ZODB.PersistentMapping import PersistentMapping
import transaction
from persistent import Persistent
from persistent.dict import PersistentDict
from persistent.list import PersistentList
# Defining database type and creating connection.
storage = FileStorage('/path/to/database/zodbname.fs')
db = DB(storage)
connection = db.open()
root = connection.root()
# Define and populate the structure.
root['Vehicle'] = PersistentDict() # Upper-most dictionary
root['Vehicle']['Tesla Model S'] = PersistentDict() # Object 1 - also a dictionary
root['Vehicle']['Tesla Model S']['range'] = "208 miles"
root['Vehicle']['Tesla Model S']['acceleration'] = 5.9
root['Vehicle']['Tesla Model S']['base_price'] = "$71,070"
root['Vehicle']['Tesla Model S']['battery_options'] = ["60kWh","85kWh","85kWh Performance"]
# more attributes here
root['Vehicle']['Mercedes-Benz SLS AMG E-Cell'] = PersistentDict() # Object 2 - also a dictionary
# more attributes here
# add as many objects with as many characteristics as you like.
# commiting changes; up until this point things can be rolled back
transaction.get().commit()
transaction.get().abort()
connection.close()
db.close()
storage.close()
Once the database is created it's very easy use. Since it's an object database (a dictionary), you can access objects very easily:
#after it's opened (lines from the very beginning, up to and including root = connection.root() )
>> root['Vehicles']['Tesla Model S']['range']
'208 miles'
You can also display all of the keys (and do all other standard dictionary things you might want to do):
>> root['Vehicles']['Tesla Model S'].keys()
['acceleration', 'range', 'battery_options', 'base_price']
Last thing I want to mention is that keys can be changed: Changing the key value in python dictionary. Values can also be changed - so if your research results change because you change your method or something you don't have to start the entire database from scratch (especially if everything else is still okay). Be careful with doing both of these. I put in safety measures in my database code to make sure I'm aware of my attempts to overwrite keys or values.
** ADDED **
# added imports
import numpy as np
from tempfile import TemporaryFile
outfile = TemporaryFile()
# insert into definition/population section
np.save(outfile,np.linspace(-1,1,10000))
root['Vehicle']['Tesla Model S']['arraydata'] = outfile
# check to see if it worked
>>> root['Vehicle']['Tesla Model S']['arraydata']
<open file '<fdopen>', mode 'w+b' at 0x2693db0>
outfile.seek(0)# simulate closing and re-opening
A = np.load(root['Vehicle']['Tesla Model S']['arraydata'])
>>> print A
array([-1. , -0.99979998, -0.99959996, ..., 0.99959996,
0.99979998, 1. ])
You could also use numpy.savez() for compressed saving of multiple numpy arrays in this exact same way.
This is not a direct answer. Anyway, you may be interested also in JSON. Have a look at the 13.10. Serializing Datatypes Unsupported by JSON. It shows how to extend the format for unsuported types.
The whole chapter from "Dive into Python 3" by Mark Pilgrim is definitely a good read for at least to know...
Update: Possibly an unrelated idea, but... I have read somewhere, that one of the reasons why XML was finally adopted for data exchange in heterogeneous environment was some study that compared specialized binary format with zipped XML. The conclusion for you could be to use possibly not so space efficient solution and compress it via zip or another well known algorithm. Using the known algorithm helps when you need to debug (to unzip and then look at the text file by eye).
Suppose you have a sqlite database with several thousand rows -- each of which either contains or references a sizable, unique blob -- and you want to sparsely sample this collection, pulling rows based on rowid or some equivalent primary key. I find that the first time I attempt to fetch several (500) datapoints after connecting (out of 20k rows), the call takes over 10 seconds to return; and, with every successive iteration, the calls get shorter and shorter, until converging to around 100 milliseconds after 50-100 such queries.
Clearly, either sqlite or its python wrapper must be caching... something. If I clear out inactive memory (I'm in OS X, but I think Linux has a comparable if-not-identical "purge" command?), the behavior can be replicated exactly. The question is, what is it caching that an index doesn't address? And furthermore, is it possible to automatically pull whatever information is accelerating these queries into memory from the start? Or is there something else I've missed entirely?
A few notes in case someone doesn't immediately know the answer...
Each blob is around 40kB, and are a large (ha) source of the problem. I've some code below for anyone who wants to play along at home, but I've had better luck keeping separate tables for sortable information and data. This introduces an inner join, but it's generally been better than keeping it all together (although if anyone feels this is wrong, I'm keen to hear it). Without the inner join / data fetch, things start at 4 seconds and drop to 3 ms in a hurry.
I feel like this might be a PRAGMA thing, but I fiddled with some settings suggested by others in the wilderness of the web and didn't really see any benefit.
In-memory databases are not an option. For one, I'm trying to share across threads (which might not actually be a problem for in-mems...? not sure), but more importantly the database files are typically on the order of 17 GB. So, that's out.
That being said, there's no problem caching a reasonable amount of information. After a few dozen calls, inactive memory gets somewhat bloated anyways, but I'd rather do it (1) right and (2) efficiently.
Okay, now some code for anyone who wants to try to replicate things. You should be able to copy and paste it into a stand-alone script (that's basically what I did, save for formatting).
import sqlite3
import numpy as np
import time
ref_uid_index = """CREATE INDEX ref_uid_idx
ON data(ref_uid)"""
def populate_db_split(db_file, num_classes=10, num_points=20000, VERBOSE=False):
def_schema_split0 = """
CREATE TABLE main (
uid INTEGER PRIMARY KEY,
name TEXT,
label INTEGER,
ignore INTEGER default 0,
fold INTEGER default 0)"""
def_schema_split1 = """
CREATE TABLE data (
uid INTEGER PRIMARY KEY,
ref_uid INTEGER REFERENCES main(uid),
data BLOB)"""
def_insert_split0 = """
INSERT INTO main (name, label, fold)
VALUES (?,?,?)"""
def_insert_split1 = """
INSERT INTO data (ref_uid, data)
VALUES (?,?)"""
blob_size= 5000
k_folds = 5
some_names = ['apple', 'banana', 'cherry', 'date']
dbconn = sqlite3.connect(db_file)
dbconn.execute(def_schema_split0)
dbconn.execute(def_schema_split1)
rng = np.random.RandomState()
for n in range(num_points):
if n%1000 == 0 and VERBOSE:
print n
# Make up some data
data = buffer(rng.rand(blob_size).astype(float))
fold = rng.randint(k_folds)
label = rng.randint(num_classes)
rng.shuffle(some_names)
# And add it
dbconn.execute(def_insert_split0,[some_names[0], label, fold])
ref_uid = dbconn.execute("SELECT uid FROM main WHERE rowid=last_insert_rowid()").fetchone()[0]
dbconn.execute(def_insert_split1,[ref_uid,data])
dbconn.execute(ref_uid_index)
dbconn.commit()
return dbconn
def timeit_join(dbconn, n_times=10, num_rows=500):
qmarks = "?,"*(num_rows-1)+"?"
q_join = """SELECT data.data, main.uid, main.label
FROM data INNER JOIN main ON main.uid=data.ref_uid
WHERE main.uid IN (%s)"""%qmarks
row_max = dbconn.execute("SELECT MAX(rowid) from main").fetchone()[0]
tstamps = []
for n in range(n_times):
now = time.time()
uids = np.random.randint(low=1,high=row_max,size=num_rows).tolist()
res = dbconn.execute(q_join, uids).fetchall()
tstamps += [time.time()-now]
print tstamps[-1]
Now, if you want to replicate things, do the following. On my machine, this creates an 800MB database and produces something like below.
>>> db = populate_db_split('/some/file/path.db')
>>> timeit_join(db)
12.0593519211
5.56209111214
3.51154184341
2.20699000359
1.73895692825
1.18351387978
1.27329611778
0.934082984924
0.780968904495
0.834318161011
So... what say you, knowledgable sages?
Database files with GB size are never loaded into the memory entirely. They are split into a tree of socalled pages. These pages are cached in the memory, the default is 2000 pages.
You can use the following statement to e.g. double the number of cached pages of 1kB size.
conn.execute("""PRAGMA cache_size = 4000""")
The connection again has a cache for the last 100 statements, as you can see in the function description:
sqlite3.connect(database[, timeout, detect_types, isolation_level, check_same_thread, factory, cached_statements])
cached_statements expects and integer and defaults to 100.
Except from setting up the cache size, it is not likely that you benefit from actively caching statements or pages at the application start.
I have a script which processes a list of URLs. The script may be called at any time with a fresh list of URLs. I want to avoid processing an URL which has already been processed at any time in the past.
At this point, all I want to match are URLs, which are really long strings, against all previously processed URLs, to ensure uniqueness.
My question is, how does an SQL query matching a text URL against a MySQL database of only URLs (say 40000 long text URLs) compare, against my other idea of hashing the URLs and saving the hashes using, say, Python's shelve module?
shelf[hash(url)] = 1
Is shelve usable for a dictionary with 40000 string keys? What about with 40000 numerical keys with binary values? Any gotchas with choosing shelve over MySQL for this simple requirement?
Or, if I use a DB, is there a huge benefit to store URL hashes in my MySQL DB instead of string URLs?
Store the URLs in a set, which assures O(1) for finding items, and shelve it. At this amount of URLs, storing and restoring will take very little time and memory:
import shelve
# Write URLS to shelve
urls= ['http://www.airmagnet.com/', 'http://www.alcatel-lucent.com/',
'http://www.ami.com/', 'http://www.apcc.com/', 'http://www.stk.com/',
'http://www.apani.com/', 'http://www.apple.com/',
'http://www.arcoide.com/', 'http://www.areca.com.tw/',
'http://www.argus-systems.com/', 'http://www.ariba.com/',
'http://www.asus.com.tw/']
s=set(urls) # Store URLs as set - Search is O(1)
sh=shelve.open('/tmp/shelve.tmp') # Dump set (as one unit) to shelve file
sh['urls']=s
sh.close()
sh=shelve.open('/tmp/shelve.tmp') # Retrieve set from file
s=sh['urls']
print 'http://www.apple.com/' in s # True
print 'http://matan.name/' in s # False
This approach is quite fast:
import random
import string
import shelve
import datetime
urls=[''.join(random.choice(string.ascii_uppercase + string.digits) for x in range(50))
for i in range(40000)]
s=set(urls)
start=datetime.datetime.now()
sh=shelve.open('/tmp/test.shelve')
sh['urls']=urls
end=datetime.datetime.now()
print end-start
Using a shelve is in general a bad idea for large amounts of data. A database is better suited with you have lots of data.
Options are:
ZODB (Python Object Database)
any RDBMS
noSQL world (like MongoDB which is easy approachable and very fast)
Hashing is a good idea. To search strings in data bases they use indexes. Since one can define a comparison operation on strings it's possible to build an index which is search tree and process each query with logarithmic complexity
I'm writing an inverted index for a search engine on a collection of documents. Right now, I'm storing the index as a dictionary of dictionaries. That is, each keyword maps to a dictionary of docIDs->positions of occurrence.
The data model looks something like:
{word : { doc_name : [location_list] } }
Building the index in memory works fine, but when I try to serialize to disk, I hit a MemoryError. Here's my code:
# Write the index out to disk
serializedIndex = open(sys.argv[3], 'wb')
cPickle.dump(index, serializedIndex, cPickle.HIGHEST_PROTOCOL)
Right before serialization, my program is using about 50% memory (1.6 Gb). As soon as I make the call to cPickle, my memory usage skyrockets to 80% before crashing.
Why is cPickle using so much memory for serialization? Is there a better way to be approaching this problem?
cPickle needs to use a bunch of extra memory because it does cycle detection. You could try using the marshal module if you are sure your data has no cycles
There's the other pickle library you could try. Also there might be some cPickle settings you could change.
Other options: Break your dictionary into smaller pieces and cPickle each piece. Then put them back together when you load everything in.
Sorry this is vague, I'm just writing off the top of my head. I figured it might still be helpful since no one else has answered.
You may well be using the wrong tool for this job. If you want to persist a huge amount of indexed data, I'd strongly suggest using an SQLite on-disk database (or, of course, just a normal database) with an ORM like SQLObject or SQL Alchemy.
These will take care of the mundane things like compatibility, optimising format for purpose, and not holding all the data in memory simultaneously so that you run out of memory...
Added: Because I was working on a near identical thing anyway, but mainly because I'm such a nice person, here's a demo that appears to do what you need (it'll create an SQLite file in your current dir, and delete it if a file with that name already exists, so put it somewhere empty first):
import sqlobject
from sqlobject import SQLObject, UnicodeCol, ForeignKey, IntCol, SQLMultipleJoin
import os
DB_NAME = "mydb"
ENCODING = "utf8"
class Document(SQLObject):
dbName = UnicodeCol(dbEncoding=ENCODING)
class Location(SQLObject):
""" Location of each individual occurrence of a word within a document.
"""
dbWord = UnicodeCol(dbEncoding=ENCODING)
dbDocument = ForeignKey('Document')
dbLocation = IntCol()
TEST_DATA = {
'one' : {
'doc1' : [1,2,10],
'doc3' : [6],
},
'two' : {
'doc1' : [2, 13],
'doc2' : [5,6,7],
},
'three' : {
'doc3' : [1],
},
}
if __name__ == "__main__":
db_filename = os.path.abspath(DB_NAME)
if os.path.exists(db_filename):
os.unlink(db_filename)
connection = sqlobject.connectionForURI("sqlite:%s" % (db_filename))
sqlobject.sqlhub.processConnection = connection
# Create the tables
Document.createTable()
Location.createTable()
# Import the dict data:
for word, locs in TEST_DATA.items():
for doc, indices in locs.items():
sql_doc = Document(dbName=doc)
for index in indices:
Location(dbWord=word, dbDocument=sql_doc, dbLocation=index)
# Let's check out the data... where can we find 'two'?
locs_for_two = Location.selectBy(dbWord = 'two')
# Or...
# locs_for_two = Location.select(Location.q.dbWord == 'two')
print "Word 'two' found at..."
for loc in locs_for_two:
print "Found: %s, p%s" % (loc.dbDocument.dbName, loc.dbLocation)
# What documents have 'one' in them?
docs_with_one = Location.selectBy(dbWord = 'one').throughTo.dbDocument
print
print "Word 'one' found in documents..."
for doc in docs_with_one:
print "Found: %s" % doc.dbName
This is certainly not the only way (or necessarily the best way) to do this. Whether the Document or Word tables should be separate tables from the Location table depends on your data and typical usage. In your case, the "Word" table could probably be a separate table with some added settings for indexing and uniqueness.