Inadequate RAM usage by Redis

Inadequate RAM usage by Redis - python

I'm developing an API using Go and Redis. The problem is that RAM usage is inadequate and I can't find the root of the problem.
TL;DR version
There are hundreds/thousands of hash objects. Each one of 1 KB objects (key+value) takes ~0.5 MB of RAM. However, there is no memory fragmentation (INFO shows none).
Also, dump.rdb is 70x times smaller than the RAM set (360KB dump.rdb vs 25MB RAM for 50 objects, and 35.5MB vs 2.47GB for 5000 objects).
Long version
Redis instance is filled mostly with task:123 hashes of the following kind:
"task_id" : int
"client_id" : int
"worker_id" : int
"text" : string (0..255 chars)
"is_processed" : boolean
"timestamp" : int
"image" : byte array (1 kbyte)
Also, there are a couple of integer counters, one list and one sorted set (both consist of task_id's).
RAM usage has a linear dependency on the number of task objects.
INFO output for 50 tasks:
# Memory
used_memory:27405872
used_memory_human:26.14M
used_memory_rss:45215744
used_memory_peak:31541400
used_memory_peak_human:30.08M
used_memory_lua:35840
mem_fragmentation_ratio:1.65
mem_allocator:jemalloc-3.6.0
and 5000 tasks:
# Memory
used_memory:2647515776
used_memory_human:2.47G
used_memory_rss:3379187712
used_memory_peak:2651672840
used_memory_peak_human:2.47G
used_memory_lua:35840
mem_fragmentation_ratio:1.28
mem_allocator:jemalloc-3.6.0
Size of dump.rdb for 50 tasks is 360kB and for 5000 tasks it's 35553kB.
Every task object has serializedlength of ~7KB:
127.0.0.1:6379> DEBUG OBJECT task:2000
Value at:0x7fcb403f5880 refcount:1 encoding:hashtable serializedlength:7096 lru:6497592 lru_seconds_idle:180
I've written a Python script trying to reproduce the problem:
import redis
import time
import os
from random import randint
img_size = 1024 * 1 # 1 kb
r = redis.StrictRedis(host='localhost', port=6379, db=0)
for i in range(0, 5000):
values = {
"task_id" : randint(0, 65536),
"client_id" : randint(0, 65536),
"worker_id" : randint(0, 65536),
"text" : "",
"is_processed" : False,
"timestamp" : int(time.time()),
"image" : bytearray(os.urandom(img_size)),
}
key = "task:" + str(i)
r.hmset(key, values)
if i % 500 == 0: print(i)
And it consumes just 80MB of RAM!
I would appreciate any ideas on how to figure out what's going on.

You have lots and lots of small HASH objects, and that's fine. But each of them has a lot of overhead in the redis memory, since it has a separate dictionary. There is a small optimization for this that usually improves things significantly, and it's to keep hashes in a memory optimized but slightly slower data structure, which at these object sizes should not matter much. From the config:
# Hashes are encoded using a memory efficient data structure when they have a
# small number of entries, and the biggest entry does not exceed a given
# threshold. These thresholds can be configured using the following directives.
hash-max-ziplist-entries 512
hash-max-ziplist-value 64
Now, you have large values which causes this optimization not to work.
I'd set hash-max-ziplist-value to a few kbs (depending on the size of your largest object), and it should improve this (you should not see any performance degradation in this HASH size).
Also, keep in mind that redis compresses its RDB files relative to what you have in memory, so a ~50% reduction over memory is to be expected anyway.
[EDIT] After re-reading your question and seeing it's a go only problem, and considering the fact that the compressed rdb is small, something tells me you're writing a bigger size than you'd expect for the image. Any chance you're writing that off a []byte slice? If so, perhaps you did not trim it and you're writing a much bigger buffer or something similar? I've worked like this with redigo tons of times and never seen what you're describing.

Related

Neo4j: dependence of execution speed on batch size of input parameters

I'm using Neo4J to identify the connections between different node labels.
Neo4J 4.4.4 Community Edition
DB rolled out in docker container with k8s orchestrating.
MATCH (source_node: Person) WHERE source_node.name in $inputs
MATCH (source_node)-[r]->(child_id:InternalId)
WHERE r.valid_from <= datetime($actualdate) < r.valid_to
WITH [type(r), toString(date(r.valid_from)), child_id.id] as child_path, child_id, false as filtered
CALL apoc.do.when(filtered,
'RETURN child_path as full_path, NULL as issuer_id',
'OPTIONAL MATCH p_path = (child_id)-[:HAS_PARENT_ID*0..50]->(parent_id:InternalId)
WHERE all(a in relationships(p_path) WHERE a.valid_from <= datetime($actualdate) < a.valid_to) AND
NOT EXISTS{ MATCH (parent_id)-[q:HAS_PARENT_ID]->() WHERE q.valid_from <= datetime($actualdate) < q.valid_to}
WITH DISTINCT last(nodes(p_path)) as i_source,
reduce(st = [], q IN relationships(p_path) | st + [type(q), toString(date(q.valid_from)), endNode(q).id])
as parent_path, CASE WHEN length(p_path) = 0 THEN NULL ELSE parent_id END as parent_id, child_path
OPTIONAL MATCH (i_source)-[r:HAS_ISSUER_ID]->(issuer_id:IssuerId)
WHERE r.valid_from <= datetime($actualdate) < r.valid_to
RETURN DISTINCT CASE issuer_id WHEN NULL THEN child_path + parent_path + [type(r), NULL, "NOT FOUND IN RELATION"]
ELSE child_path + parent_path + [type(r), toString(date(r.valid_from)), toInteger(issuer_id.id)]
END as full_path, issuer_id, CASE issuer_id WHEN NULL THEN true ELSE false END as filtered',
{filtered: filtered, child_path: child_path, child_id: child_id, actualdate: $actualdate}
)
YIELD value
RETURN value.full_path as full_path, value.issuer_id as issuer_id, value.filtered as filtered
When query executing on a large number of incoming names (Person), it is processed quickly for example for 100,000 inputs it takes ~2.5 seconds. However, if 100,000 names are divided into small batches and fore each batch query is executed sequentially, the overall processing time increases dramatically:
100 names batch is ~2 min
1000 names batch is ~10 sec
Could you please provide me a clue why exactly this is happening? And how I could get the same executions time as for the entire dataset regardless the batch size?
Is the any possibility to divide transactions into multiple processes? I tried Python multiprocessing using Neo4j Driver. It works faster but still cannot achieve the target execution time of 2.5 sec for some reasons.
Is it any possibility to keep entire graph into memory during the whole container lifecycle? Could it help resolve the issue with the execution speed on multiple batches instead the entire dataset?
Essentially, the goal is to use as small batches as possible in order to process the entire dataset.
Thank you.
PS: Any suggestions to improve the query are very welcome.)

You pass in a list - then it will use an index to efficiently filter down the results by passing the list to the index, and you do additional aggressive filtering on properties.
So if you run the query with PROFILE you will see how much data is loaded / touched at each step.
A single execution makes more efficient use of resources like heap and page-cache.
For individual batched executions it has to go through the whole machinery (driver, query-parsing, planning, runtime), depending if you execute your queries in parallel (do you?) or sequentially, the next query needs to wait until your previous one has finished.
Multiple executions also content for resources like memory, IO, network.
Python is also not the fastest driver esp. if you send/receive larger volumes of data, try one of the other languages if that serves you better.
Why don't you just always execute one large batch then?
With Neo4j EE (e.g. on Aura) or CE 5 you will also get better runtimes and execution.
Yes if you configure your page-cache large enough to hold the store, it will keep the graph in memory during the execution.
If you run PROFILE with your query you should also see page-cache faults, when it needs to fetch data from disk.

Efficiently processing ~50 million record file in python

We have a file with about 46 million records in CSV format. Each record has about 18 fields and one of them is a 64 byte ID. We have another file with about 167K unique IDs in it. The records corresponding to the IDs needs to be yanked out. So, we have written a python program that reads the 167K IDs into an array and processes the 46 million records file checking if the ID exists in each one of the those records. Here is the snippet of the code:
import csv
...
csvReadHandler = csv.reader(inputFile, delimiter=chr(1))
csvWriteHandler = csv.writer(outputFile, delimiter=chr(1), lineterminator='\n')
for fieldAry in csvReadHandler:
lineCounts['orig'] += 1
if fieldAry[CUSTOMER_ID] not in idArray:
csvWriteHandler.writerow(fieldAry)
lineCounts['mod'] += 1
Tested the program on a small set of data, here are the processing times:
lines: 117929 process time: 236.388447046 sec
lines: 145390 process time: 277.075321913 sec
We have started running the program on the 46 million records file (which is about 13 GB size) last night ~3:00am EST, now it is around 10am EST and it is still processing!
Questions:
Is there a better way to process these records to improve on processing times?
Is python the right choice? Would awk or some other tool help?
I am guessing 64 byte ID look-up on 167K array in the following statement is the culprit:
if fieldAry[CUSTOMER_ID] not in idArray:
Is there a better alternative?
Thank you!
Update: This is processed on an EC2 instance with EBS attached volume.

You should must use a set instead of a list; before the for loop do:
idArray = set(idArray)
csvReadHandler = csv.reader(inputFile, delimiter=chr(1))
csvWriteHandler = csv.writer(outputFile, delimiter=chr(1), lineterminator='\n')
for fieldAry in csvReadHandler:
lineCounts['orig'] += 1
if fieldAry[CUSTOMER_ID] not in idArray:
csvWriteHandler.writerow(fieldAry)
lineCounts['mod'] += 1
And see the unbelievable speed-up; you're using days worth of unnecessary processing time just because you chose the wrong data structure.
in operator with set has O(1) time complexity, whereas O(n) time complexity with list. This might sound like "not a big deal" but actually it is the bottleneck in your script. Even though set will have somewhat higher constants for that O. So your code is using something like 30000 more time on that single in operation than is necessary. If in the optimal version it'd require 30 seconds, now you spend 10 days on that single line alone.
See the following test: I generate 1 million IDs and take 10000 aside into another list - to_remove. I then do a for loop like you do, doing the in operation for each record:
import random
import timeit
all_ids = [random.randint(1, 2**63) for i in range(1000000)]
to_remove = all_ids[:10000]
random.shuffle(to_remove)
random.shuffle(all_ids)
def test_set():
to_remove_set = set(to_remove)
for i in all_ids:
if i in to_remove_set:
pass
def test_list():
for i in all_ids:
if i in to_remove:
pass
print('starting')
print('testing list', timeit.timeit(test_list, number=1))
print('testing set', timeit.timeit(test_set, number=1))
And the results:
testing list 227.91903045598883
testing set 0.14897623099386692
It took 149 milliseconds for the set version; the list version required 228 seconds. Now these were small numbers: in your case you have 50 million input records against my 1 million; thus there you need to multiply the testing set time by 50: with your data set it would take about 7.5 seconds.
The list version, on the other hand, you need to multiply that time by 50 * 17 - not only are there 50 times more input records, but 17 times more records to match against. Thus we get 227 * 50 * 17 = 192950.
So your algorithm spends 2.2 days doing something that by using correct data structure can be done in 7.5 seconds. Of course this does not mean that you can scan the whole 50 GB document in 7.5 seconds, but it probably ain't more than 2.2 days either. So we changed from:
2 days 2.2 days
|reading and writing the files||------- doing id in list ------|
to something like
2 days 7.5 seconds (doing id in set)
|reading and writing the files||

The easiest way to speed things a bit would be to parallelize the line-processing with some distributed solution. On of the easiest would be to use multiprocessing.Pool. You should do something like this (syntax not checked):
from multiprocessing import Pool
p = Pool(processes=4)
p.map(process_row, csvReadHandler)
Despite that, python is not the best language to do this kind of batch processing (mainly because writing to disk is quite slow). It is better to leave all the disk-writing management (buffering, queueing, etc...) to the linux kernel so using a bash solution would be quite better. Most efficient way would be to split the input files into chunks and simply doing an inverse grep to filter the ids.
for file in $list_of_splitted_files; then
cat $file | grep -v (id1|id2|id3|...) > $file.out
done;
if you need to merge afterwards simply:
for file in $(ls *.out); then
cat $file >> final_results.csv
done
Considerations:
Don't know if doing a single grep with all the ids is more/less
efficient than looping through all the ids and doing a single-id
grep.
When writing parallel solutions try to read/write to different
files to minimize the I/O bottleneck (all threads trying to write to
the same file) in all languages.
Put timers on your code for each processing part. This way you'll see which part is wasting more time. I really recommend this because I had a similar program to write and I thought the processing part was the bottleneck (similar to your comparison with the ids vector) but in reality it was the I/O which was dragging down all the execution.

i think its better to use a database to solve this problem
first create a database like MySql or anything else
then write your data from file into 2 table
and finally use a simple sql query to select rows
somthing like:
select * from table1 where id in(select ids from table2)

Spark MLlib - trainImplicit warning

I keep seeing these warnings when using trainImplicit:
WARN TaskSetManager: Stage 246 contains a task of very large size (208 KB).
The maximum recommended task size is 100 KB.
And then the task size starts to increase. I tried to call repartition on the input RDD but the warnings are the same.
All these warnings come from ALS iterations, from flatMap and also from aggregate, for instance the origin of the stage where the flatMap is showing these warnings (w/ Spark 1.3.0, but they are also shown in Spark 1.3.1):
org.apache.spark.rdd.RDD.flatMap(RDD.scala:296)
org.apache.spark.ml.recommendation.ALS$.org$apache$spark$ml$recommendation$ALS$$computeFactors(ALS.scala:1065)
org.apache.spark.ml.recommendation.ALS$$anonfun$train$3.apply(ALS.scala:530)
org.apache.spark.ml.recommendation.ALS$$anonfun$train$3.apply(ALS.scala:527)
scala.collection.immutable.Range.foreach(Range.scala:141)
org.apache.spark.ml.recommendation.ALS$.train(ALS.scala:527)
org.apache.spark.mllib.recommendation.ALS.run(ALS.scala:203)
and from aggregate:
org.apache.spark.rdd.RDD.aggregate(RDD.scala:968)
org.apache.spark.ml.recommendation.ALS$.computeYtY(ALS.scala:1112)
org.apache.spark.ml.recommendation.ALS$.org$apache$spark$ml$recommendation$ALS$$computeFactors(ALS.scala:1064)
org.apache.spark.ml.recommendation.ALS$$anonfun$train$3.apply(ALS.scala:538)
org.apache.spark.ml.recommendation.ALS$$anonfun$train$3.apply(ALS.scala:527)
scala.collection.immutable.Range.foreach(Range.scala:141)
org.apache.spark.ml.recommendation.ALS$.train(ALS.scala:527)
org.apache.spark.mllib.recommendation.ALS.run(ALS.scala:203)

Similar problem was described in Apache Spark mail lists - http://apache-spark-user-list.1001560.n3.nabble.com/Large-Task-Size-td9539.html
I think you can try to play with number of partitions (using repartition() method), depending of how many hosts, RAM, CPUs do you have.
Try also to investigate all steps via Web UI, where you can see number of stages, memory usage on each stage, and data locality.
Or just never mind about this warnings unless everything works correctly and fast.
This notification is hard-coded in Spark (scheduler/TaskSetManager.scala)
if (serializedTask.limit > TaskSetManager.TASK_SIZE_TO_WARN_KB * 1024 &&
!emittedTaskSizeWarning) {
emittedTaskSizeWarning = true
logWarning(s"Stage ${task.stageId} contains a task of very large size " +
s"(${serializedTask.limit / 1024} KB). The maximum recommended task size is " +
s"${TaskSetManager.TASK_SIZE_TO_WARN_KB} KB.")
}
.
private[spark] object TaskSetManager {
// The user will be warned if any stages contain a task that has a serialized size greater than
// this.
val TASK_SIZE_TO_WARN_KB = 100
}

MongoDB, pymongo: algorithm to speed up identification of consecutive documents with matching values

I'm new to MongoDB and pymongo and looking for some guidance in terms of algorithms and performance for a specific task described below. I have posted a link to an image of the data sample and also my sample python code below.
I have a single collection that grows about 5 to 10 Million documents every month. It receives all this info from other systems, which I have no access to modify in any way (they are in different companies). Each document represent sort of a financial transaction. I need to group documents that are part of a same "transaction group".
Each document has hundreds of keys. Almost all keys vary between documents (which is why they moved from MySQL to MongoDB - no easy way to align schema). However, I found out that three keys are guaranteed to always be in all of them. I'll call these keys key1, key2 and key3 in this example. These keys are my only option to identify the transactions that are part of the same transaction group.
The basic rule is:
- If consecutive documents have the same key1, and the same key2, and the same key3, they are all in the same "transaction group". Then I must give it some integer id in a new key named 'transaction_group_id'
- Else, consecutive documents that do not matck key1, key2 and key3 are all in their own individual "transaction_groups".
It's really easy to understand it by looking at the screenshot of a data sample (better than my explanation anyway). See here:
As you can see in the sample:
- Documents 1 and 2 are in the same group, because they match key1, key2 and key3;
- Documents 3 and 4 also match and are in their own group;
- Following the same logic, documents 18 and 19 are a group obviously. However, even though they match the values of documents 1 and 3, they are not in the same group (because the documents are not consecutive).
I created a very simplified version of the current python function, to give you guys an idea of the current implementation:
def groupTransactions(mongo_host,
mongo_port,
mongo_db,
mongo_collection):
"""
Group transactions if Keys 1, 2 and 3 all match in consecutive docs.
"""
mc = MongoClient(mongo_host, mongo_port)
db = mc['testdb']
coll = db['test_collection']
# The first document transaction group must always be equal to 1.
first_doc_id = coll.find_one()['_id']
coll.update({'_id': first_doc_id},
{"$set": {"transaction_group_id": 1}},
upsert=False, multi=False)
# Cursor order is undetermined unless we use sort(), no matter what the _id is. We learned it the hard way.
cur = coll.find().sort('subtransaction_id', ASCENDING)
doc_count = cur.count()
unique_data = []
unique_data.append(cur[0]['key1'], cur[0]['key2'], cur[0]['key3'])
transaction_group_id = 1
i = 1
while i < doc_count:
doc_id = cur[i]['_id']
unique_data.append(cur[i]['key1'], cur[i]['key2'], cur[i]['key3'])
if unique_data[i] != unique_data[i-1]:
# New group find, increase group id by 1
transaction_group_id = transaction_group_id + 1
# Update the group id in the database
coll.update({'_id': doc_id},
{"$set": {"transaction_group_id": transaction_group_id}},
upsert=False, multi=False)
i = i + 1
print "%d subtransactions were grouped into %d transaction groups." % (doc_count, i)
return 1
This is the code, more or less, and it works. But it takes between 2 to 3 days to finish, which is starting to become unacceptable. The hardware is good: VMs in last generation Xeon, local MongoDB in SSD, 128GB RAM). It will probably run fast if we decide to run it on AWS, use threading/subprocesses, etc - which are all obviously good options to try at some point.
However, I'm not convinced this is the best algorithm. It's just the best I could come up with.There must be obvious ways to improve it that I'm not seeing.
Moving to c/c++ or out of NoSQL is out of the question at this point. I have to make it work the way it is.
So basically the question is: Is this the best possible algorithm (using MongoDB/pymongo) in terms of speed? If not, I'd appreciate it if you could point me in the right direction.
EDIT: Just so you can have an idea of how slow this code performance is: Last time I measured it, it took 22 hours to run on 1.000.000 results. As a quick workaround, I wrote something else to load the data to a Pandas DataFrame first and then apply the same logic of this code more or less. It took 3 to 4 minutes to group everything, using the same hardware. I mean, I know Pandas is efficient, etc. But there's something wrong, there can't be such a huge gap between between the two solutions performances (4min vs 1,320min).

It is the case that most of the time is spent writing to the database, which includes the round trip of sending work to the DB, plus the DB doing the work. I will point out a couple of places where you can speed up each of those.
Speeding up the back-and-forth of sending write requests to the DB:
One of the best ways to improve the latency of the requests to the DB is to minimize the number of round trips. In your case, it's actually possible because multiple documents will get updated with the same transaction_group_id. If you accumulate their values and only send a "multi" update for all of them, then it will cut down on the back-and-forth. The larger the transaction groups the more this will help.
In your code, you would replace the current update statement:
coll.update( {'_id': doc_id},
{"$set": {"transaction_group_id": transaction_group_id}},
upsert=False, multi=False)
With an accumulator of doc_id values (appending them to a list should be just fine). When you detect the "pattern" change and transaction group go to the next one, you would then run one update for the whole group as:
coll.update( {'_id': {$in: list-of-docids },
{"$set": {"transaction_group_id": transaction_group_id}},
upsert=False, multi=True)
A second way of increasing parallelism of this process and speeding up end-to-end work would be to split the job between more than one client - the downside of this is that you need a single unit of work to pre-calculate how many transaction_group_id values there will be and where the split points are. Then you can have multiple clients like this one which only handle range of specific subtransaction_id values and their transaction_group_id starting value is not 1 but whatever is given to them by the "pre-work" process.
Speeding up the actual write on the DB:
The reason I asked about existence of the transaction_group_id field is because if a field that's being $set does not exist, it will be created and that increases the document size. If there is not enough space for the increased document, it has to be relocated and that's less efficient than the in-place update.
MongoDB stores documents in BSON format. Different BSON values have different sizes. As a quick demonstration, here's a shell session that shows total document size based on the type and size of value stored:
> db.sizedemo.find()
{ "_id" : ObjectId("535abe7a5168d6c4735121c9"), "transaction_id" : "" }
{ "_id" : ObjectId("535abe7d5168d6c4735121ca"), "transaction_id" : -1 }
{ "_id" : ObjectId("535abe815168d6c4735121cb"), "transaction_id" : 9999 }
{ "_id" : ObjectId("535abe935168d6c4735121cc"), "transaction_id" : NumberLong(123456789) }
{ "_id" : ObjectId("535abed35168d6c4735121cd"), "transaction_id" : " " }
{ "_id" : ObjectId("535abedb5168d6c4735121ce"), "transaction_id" : " " }
> db.sizedemo.find().forEach(function(doc) { print(Object.bsonsize(doc)); })
43
46
46
46
46
53
Note how the empty string takes up three bytes fewer than double or NumberLong do. The string " " takes the same amount as a number and longer strings take proportionally longer. To guarantee that your updates that $set the transaction group never cause the document to grow, you want to set transaction_group_id to the same size value on initial load as it will be updated to (or larger). This is why I suggested -1 or some other agreed upon "invalid" or "unset" value.
You can check if the updates have been causing document moves by looking at the value in db.serverStatus().metrics.record.moves - this is the number of document moves caused by growth since the last time server was restarted. You can compare this number before and after your process runs (or during) and see how much it goes up relative to the number of documents you are updating.

What data is cached during a "select" in sqlite3/Python, and can this be done manually from the start?

Suppose you have a sqlite database with several thousand rows -- each of which either contains or references a sizable, unique blob -- and you want to sparsely sample this collection, pulling rows based on rowid or some equivalent primary key. I find that the first time I attempt to fetch several (500) datapoints after connecting (out of 20k rows), the call takes over 10 seconds to return; and, with every successive iteration, the calls get shorter and shorter, until converging to around 100 milliseconds after 50-100 such queries.
Clearly, either sqlite or its python wrapper must be caching... something. If I clear out inactive memory (I'm in OS X, but I think Linux has a comparable if-not-identical "purge" command?), the behavior can be replicated exactly. The question is, what is it caching that an index doesn't address? And furthermore, is it possible to automatically pull whatever information is accelerating these queries into memory from the start? Or is there something else I've missed entirely?
A few notes in case someone doesn't immediately know the answer...
Each blob is around 40kB, and are a large (ha) source of the problem. I've some code below for anyone who wants to play along at home, but I've had better luck keeping separate tables for sortable information and data. This introduces an inner join, but it's generally been better than keeping it all together (although if anyone feels this is wrong, I'm keen to hear it). Without the inner join / data fetch, things start at 4 seconds and drop to 3 ms in a hurry.
I feel like this might be a PRAGMA thing, but I fiddled with some settings suggested by others in the wilderness of the web and didn't really see any benefit.
In-memory databases are not an option. For one, I'm trying to share across threads (which might not actually be a problem for in-mems...? not sure), but more importantly the database files are typically on the order of 17 GB. So, that's out.
That being said, there's no problem caching a reasonable amount of information. After a few dozen calls, inactive memory gets somewhat bloated anyways, but I'd rather do it (1) right and (2) efficiently.
Okay, now some code for anyone who wants to try to replicate things. You should be able to copy and paste it into a stand-alone script (that's basically what I did, save for formatting).
import sqlite3
import numpy as np
import time
ref_uid_index = """CREATE INDEX ref_uid_idx
ON data(ref_uid)"""
def populate_db_split(db_file, num_classes=10, num_points=20000, VERBOSE=False):
def_schema_split0 = """
CREATE TABLE main (
uid INTEGER PRIMARY KEY,
name TEXT,
label INTEGER,
ignore INTEGER default 0,
fold INTEGER default 0)"""
def_schema_split1 = """
CREATE TABLE data (
uid INTEGER PRIMARY KEY,
ref_uid INTEGER REFERENCES main(uid),
data BLOB)"""
def_insert_split0 = """
INSERT INTO main (name, label, fold)
VALUES (?,?,?)"""
def_insert_split1 = """
INSERT INTO data (ref_uid, data)
VALUES (?,?)"""
blob_size= 5000
k_folds = 5
some_names = ['apple', 'banana', 'cherry', 'date']
dbconn = sqlite3.connect(db_file)
dbconn.execute(def_schema_split0)
dbconn.execute(def_schema_split1)
rng = np.random.RandomState()
for n in range(num_points):
if n%1000 == 0 and VERBOSE:
print n
# Make up some data
data = buffer(rng.rand(blob_size).astype(float))
fold = rng.randint(k_folds)
label = rng.randint(num_classes)
rng.shuffle(some_names)
# And add it
dbconn.execute(def_insert_split0,[some_names[0], label, fold])
ref_uid = dbconn.execute("SELECT uid FROM main WHERE rowid=last_insert_rowid()").fetchone()[0]
dbconn.execute(def_insert_split1,[ref_uid,data])
dbconn.execute(ref_uid_index)
dbconn.commit()
return dbconn
def timeit_join(dbconn, n_times=10, num_rows=500):
qmarks = "?,"*(num_rows-1)+"?"
q_join = """SELECT data.data, main.uid, main.label
FROM data INNER JOIN main ON main.uid=data.ref_uid
WHERE main.uid IN (%s)"""%qmarks
row_max = dbconn.execute("SELECT MAX(rowid) from main").fetchone()[0]
tstamps = []
for n in range(n_times):
now = time.time()
uids = np.random.randint(low=1,high=row_max,size=num_rows).tolist()
res = dbconn.execute(q_join, uids).fetchall()
tstamps += [time.time()-now]
print tstamps[-1]
Now, if you want to replicate things, do the following. On my machine, this creates an 800MB database and produces something like below.
>>> db = populate_db_split('/some/file/path.db')
>>> timeit_join(db)
12.0593519211
5.56209111214
3.51154184341
2.20699000359
1.73895692825
1.18351387978
1.27329611778
0.934082984924
0.780968904495
0.834318161011
So... what say you, knowledgable sages?

Database files with GB size are never loaded into the memory entirely. They are split into a tree of socalled pages. These pages are cached in the memory, the default is 2000 pages.
You can use the following statement to e.g. double the number of cached pages of 1kB size.
conn.execute("""PRAGMA cache_size = 4000""")
The connection again has a cache for the last 100 statements, as you can see in the function description:
sqlite3.connect(database[, timeout, detect_types, isolation_level, check_same_thread, factory, cached_statements])
cached_statements expects and integer and defaults to 100.
Except from setting up the cache size, it is not likely that you benefit from actively caching statements or pages at the application start.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.