Efficiently processing ~50 million record file in python

Efficiently processing ~50 million record file in python - python

We have a file with about 46 million records in CSV format. Each record has about 18 fields and one of them is a 64 byte ID. We have another file with about 167K unique IDs in it. The records corresponding to the IDs needs to be yanked out. So, we have written a python program that reads the 167K IDs into an array and processes the 46 million records file checking if the ID exists in each one of the those records. Here is the snippet of the code:
import csv
...
csvReadHandler = csv.reader(inputFile, delimiter=chr(1))
csvWriteHandler = csv.writer(outputFile, delimiter=chr(1), lineterminator='\n')
for fieldAry in csvReadHandler:
lineCounts['orig'] += 1
if fieldAry[CUSTOMER_ID] not in idArray:
csvWriteHandler.writerow(fieldAry)
lineCounts['mod'] += 1
Tested the program on a small set of data, here are the processing times:
lines: 117929 process time: 236.388447046 sec
lines: 145390 process time: 277.075321913 sec
We have started running the program on the 46 million records file (which is about 13 GB size) last night ~3:00am EST, now it is around 10am EST and it is still processing!
Questions:
Is there a better way to process these records to improve on processing times?
Is python the right choice? Would awk or some other tool help?
I am guessing 64 byte ID look-up on 167K array in the following statement is the culprit:
if fieldAry[CUSTOMER_ID] not in idArray:
Is there a better alternative?
Thank you!
Update: This is processed on an EC2 instance with EBS attached volume.

You should must use a set instead of a list; before the for loop do:
idArray = set(idArray)
csvReadHandler = csv.reader(inputFile, delimiter=chr(1))
csvWriteHandler = csv.writer(outputFile, delimiter=chr(1), lineterminator='\n')
for fieldAry in csvReadHandler:
lineCounts['orig'] += 1
if fieldAry[CUSTOMER_ID] not in idArray:
csvWriteHandler.writerow(fieldAry)
lineCounts['mod'] += 1
And see the unbelievable speed-up; you're using days worth of unnecessary processing time just because you chose the wrong data structure.
in operator with set has O(1) time complexity, whereas O(n) time complexity with list. This might sound like "not a big deal" but actually it is the bottleneck in your script. Even though set will have somewhat higher constants for that O. So your code is using something like 30000 more time on that single in operation than is necessary. If in the optimal version it'd require 30 seconds, now you spend 10 days on that single line alone.
See the following test: I generate 1 million IDs and take 10000 aside into another list - to_remove. I then do a for loop like you do, doing the in operation for each record:
import random
import timeit
all_ids = [random.randint(1, 2**63) for i in range(1000000)]
to_remove = all_ids[:10000]
random.shuffle(to_remove)
random.shuffle(all_ids)
def test_set():
to_remove_set = set(to_remove)
for i in all_ids:
if i in to_remove_set:
pass
def test_list():
for i in all_ids:
if i in to_remove:
pass
print('starting')
print('testing list', timeit.timeit(test_list, number=1))
print('testing set', timeit.timeit(test_set, number=1))
And the results:
testing list 227.91903045598883
testing set 0.14897623099386692
It took 149 milliseconds for the set version; the list version required 228 seconds. Now these were small numbers: in your case you have 50 million input records against my 1 million; thus there you need to multiply the testing set time by 50: with your data set it would take about 7.5 seconds.
The list version, on the other hand, you need to multiply that time by 50 * 17 - not only are there 50 times more input records, but 17 times more records to match against. Thus we get 227 * 50 * 17 = 192950.
So your algorithm spends 2.2 days doing something that by using correct data structure can be done in 7.5 seconds. Of course this does not mean that you can scan the whole 50 GB document in 7.5 seconds, but it probably ain't more than 2.2 days either. So we changed from:
2 days 2.2 days
|reading and writing the files||------- doing id in list ------|
to something like
2 days 7.5 seconds (doing id in set)
|reading and writing the files||

The easiest way to speed things a bit would be to parallelize the line-processing with some distributed solution. On of the easiest would be to use multiprocessing.Pool. You should do something like this (syntax not checked):
from multiprocessing import Pool
p = Pool(processes=4)
p.map(process_row, csvReadHandler)
Despite that, python is not the best language to do this kind of batch processing (mainly because writing to disk is quite slow). It is better to leave all the disk-writing management (buffering, queueing, etc...) to the linux kernel so using a bash solution would be quite better. Most efficient way would be to split the input files into chunks and simply doing an inverse grep to filter the ids.
for file in $list_of_splitted_files; then
cat $file | grep -v (id1|id2|id3|...) > $file.out
done;
if you need to merge afterwards simply:
for file in $(ls *.out); then
cat $file >> final_results.csv
done
Considerations:
Don't know if doing a single grep with all the ids is more/less
efficient than looping through all the ids and doing a single-id
grep.
When writing parallel solutions try to read/write to different
files to minimize the I/O bottleneck (all threads trying to write to
the same file) in all languages.
Put timers on your code for each processing part. This way you'll see which part is wasting more time. I really recommend this because I had a similar program to write and I thought the processing part was the bottleneck (similar to your comparison with the ids vector) but in reality it was the I/O which was dragging down all the execution.

i think its better to use a database to solve this problem
first create a database like MySql or anything else
then write your data from file into 2 table
and finally use a simple sql query to select rows
somthing like:
select * from table1 where id in(select ids from table2)

Related

Neo4j: dependence of execution speed on batch size of input parameters

I'm using Neo4J to identify the connections between different node labels.
Neo4J 4.4.4 Community Edition
DB rolled out in docker container with k8s orchestrating.
MATCH (source_node: Person) WHERE source_node.name in $inputs
MATCH (source_node)-[r]->(child_id:InternalId)
WHERE r.valid_from <= datetime($actualdate) < r.valid_to
WITH [type(r), toString(date(r.valid_from)), child_id.id] as child_path, child_id, false as filtered
CALL apoc.do.when(filtered,
'RETURN child_path as full_path, NULL as issuer_id',
'OPTIONAL MATCH p_path = (child_id)-[:HAS_PARENT_ID*0..50]->(parent_id:InternalId)
WHERE all(a in relationships(p_path) WHERE a.valid_from <= datetime($actualdate) < a.valid_to) AND
NOT EXISTS{ MATCH (parent_id)-[q:HAS_PARENT_ID]->() WHERE q.valid_from <= datetime($actualdate) < q.valid_to}
WITH DISTINCT last(nodes(p_path)) as i_source,
reduce(st = [], q IN relationships(p_path) | st + [type(q), toString(date(q.valid_from)), endNode(q).id])
as parent_path, CASE WHEN length(p_path) = 0 THEN NULL ELSE parent_id END as parent_id, child_path
OPTIONAL MATCH (i_source)-[r:HAS_ISSUER_ID]->(issuer_id:IssuerId)
WHERE r.valid_from <= datetime($actualdate) < r.valid_to
RETURN DISTINCT CASE issuer_id WHEN NULL THEN child_path + parent_path + [type(r), NULL, "NOT FOUND IN RELATION"]
ELSE child_path + parent_path + [type(r), toString(date(r.valid_from)), toInteger(issuer_id.id)]
END as full_path, issuer_id, CASE issuer_id WHEN NULL THEN true ELSE false END as filtered',
{filtered: filtered, child_path: child_path, child_id: child_id, actualdate: $actualdate}
)
YIELD value
RETURN value.full_path as full_path, value.issuer_id as issuer_id, value.filtered as filtered
When query executing on a large number of incoming names (Person), it is processed quickly for example for 100,000 inputs it takes ~2.5 seconds. However, if 100,000 names are divided into small batches and fore each batch query is executed sequentially, the overall processing time increases dramatically:
100 names batch is ~2 min
1000 names batch is ~10 sec
Could you please provide me a clue why exactly this is happening? And how I could get the same executions time as for the entire dataset regardless the batch size?
Is the any possibility to divide transactions into multiple processes? I tried Python multiprocessing using Neo4j Driver. It works faster but still cannot achieve the target execution time of 2.5 sec for some reasons.
Is it any possibility to keep entire graph into memory during the whole container lifecycle? Could it help resolve the issue with the execution speed on multiple batches instead the entire dataset?
Essentially, the goal is to use as small batches as possible in order to process the entire dataset.
Thank you.
PS: Any suggestions to improve the query are very welcome.)

You pass in a list - then it will use an index to efficiently filter down the results by passing the list to the index, and you do additional aggressive filtering on properties.
So if you run the query with PROFILE you will see how much data is loaded / touched at each step.
A single execution makes more efficient use of resources like heap and page-cache.
For individual batched executions it has to go through the whole machinery (driver, query-parsing, planning, runtime), depending if you execute your queries in parallel (do you?) or sequentially, the next query needs to wait until your previous one has finished.
Multiple executions also content for resources like memory, IO, network.
Python is also not the fastest driver esp. if you send/receive larger volumes of data, try one of the other languages if that serves you better.
Why don't you just always execute one large batch then?
With Neo4j EE (e.g. on Aura) or CE 5 you will also get better runtimes and execution.
Yes if you configure your page-cache large enough to hold the store, it will keep the graph in memory during the execution.
If you run PROFILE with your query you should also see page-cache faults, when it needs to fetch data from disk.

Fastest method to reformat terabytes of data

I have 100s of files that are 10s of Gbyte each. I need to reformat the files and combine to a more usable table format so I can group, sum, average, etc. the data. Reformatting the data using Python would take over a week. Even after I reformat it to a table I don't know if it would be too large for dataframe, but one problem at a time.
Can anyone suggest a faster method to reformat the text files? I'll consider anything C++, perl, etc.
Sample data:
Scenario: Modeling_5305 (0.0001)
Position: NORTHERN UTILITIES SR NT,
" ","THEO/Effective Duration","THEO/Yield","THEO/Implied Spread","THEO/Value","THEO/Price","THEO/Outstanding Balance","THEO/Effective Convexity","ID","WAL","Type","Maturity Date","Coupon Rate","POS/Position Units","POS/Portfolio","POS/User Defined 1","POS/SE Cash 1","User Defined 2","CMO WAL","Spread Over Yield",
"2017/12/31",16.0137 T,4.4194 % SEMI 30/360,0.4980 % SEMI 30/360,"6,934,452.0000 USD","6,884,052.0000 USD","7,000,000.0000 USD",371.6160 T,CachedFilterPartitions-PL_SPLITTER.2:665876C#3,29.8548 T,Fixed Rate Bond,2047/11/01,4.3200 % SEMI 30/360,"70,000.0000",All Portfolios,030421000,0.0000 USD,FRB,N/A,0.4980 % SEMI 30/360,
"2018/01/12",15.5666 T,4.8499 % SEMI 30/360,0.4980 % SEMI 30/360,"6,477,803.7492 USD","6,418,163.7492 USD","7,000,000.0000 USD",356.9428 T,CachedFilterPartitions-PL_SPLITTER.2:665876C#3,29.8219 T,Fixed Rate Bond,2047/11/01,4.3200 % SEMI 30/360,"70,000.0000",All Portfolios,030421000,0.0000 USD,FRB,N/A,0.4980 % SEMI 30/360,
Scenario: Modeling_5305 (0.0001)
Position: OLIVIA ISSUER TR SER A (A,
" ","THEO/Effective Duration","THEO/Yield","THEO/Implied Spread","THEO/Value","THEO/Price","THEO/Outstanding Balance","THEO/Effective Convexity","ID","WAL","Type","Maturity Date","Coupon Rate","POS/Position Units","POS/Portfolio","POS/User Defined 1","POS/SE Cash 1","User Defined 2","CMO WAL","Spread Over Yield",
"2017/12/31",1.3160 T,19.0762 % SEMI 30/360,0.2990 % SEMI 30/360,"3,862,500.0000 USD","3,862,500.0000 USD","5,000,000.0000 USD",2.3811 T,CachedFilterPartitions-PL_SPLITTER.2:681071AA4,1.3288 T,Interest Rate Index Linked Note,2019/05/30,0.0000 % MON 30/360,"50,000.0000",All Portfolios,010421002,0.0000 USD,IRLIN,N/A,0.2990 % SEMI 30/360,
"2018/01/12",1.2766 T,21.9196 % SEMI 30/360,0.2990 % SEMI 30/360,"3,815,391.3467 USD","3,815,391.3467 USD","5,000,000.0000 USD",2.2565 T,CachedFilterPartitions-PL_SPLITTER.2:681071AA4,1.2959 T,Interest Rate Index Linked Note,2019/05/30,0.0000 % MON 30/360,"50,000.0000",All Portfolios,010421002,0.0000 USD,IRLIN,N/A,0.2990 % SEMI 30/360,
I'd like to reformat to this csv table so I can import to dataframe:
Position, Scenario, TimeSteps, THEO/Value
NORTHERN UTILITIES SR NT, Modeling_5305, 2018/01/12, 6477803.7492
OLIVIA ISSUER TR SER A (A, Modeling_5305, 2018/01/12, 3815391.3467

There are two big bottlenecks when you have to manipulate large files or a great number of files. One is your file system, which is limited by either your HDD or SSD (your storage medium), the connection to the storage medium and your operation system. Usually you can not change that. But you have to ask yourself, what is my top speed? How fast can the system read and write? You can never be faster than that.
A rough estimation of the maximum speed would be the time you need to read all your data plus the time you need to write all your data.
The other bottleneck is the library you use in order to make the changes. Not all Python packages are created equal, there are huge speed differences. I recommend to try several approaches on a small test sample until you find one that works for you.
Keep in mind that most file systems like to either read or write huge junks of data. So you should try to avoid situation where you alternate between reading one line and then writing one line. In other words not only the library is important but how you use it.
Different programming languages, while they might provide a good library for this task and can be a good idear, will not speed up the process in any meaningful way (so you won't get 10 times the speed or anything).

I would use C/C++ with memory mapping.
With memory mapping you can go through the data as if it were one big array of bytes (and this will also prevent a copy of the data from kernel space to user space (on Windows, not sure about Linux)).
For very large files you could map one block at a time (for example 10GB).
For writing, use a buffer (say 1MB) to store the results, and then write that buffer to the file each time (using fwrite()).
Whatever you do, do not use streaming I/O or readline().
The process should take no longer (or at least not much longer) than the time it would take to copy the files on disk (or over network since you're using network file storage).
If you have the option, write the data to a different (physical) disk than the one you are reading from.

Speeding up a large process run over some data obtained from a database

So I am working in a project in which I have to read a large database (for me it is large) of 10 million records. I cannot really filter them, because I have to treat them all and individually. For each record I must apply a formula and then write this result into multiple files depending on certain conditions of the record.
I have implemented a few algorithms and finishing the whole processing takes around 2- 3 days. This is a problem because I am trying to optimise a process that already takes this time. 1 day is acceptable.
So far I have tried indexes on the database, threading(of the process upon the record and not I/O operations). I can not get a shorter time.
I am using django, and i fail to measure how much it really takes to really start treating the data due to its lazy behaviour. I would also like to know if i can start treating the data as soon as i receive it and not having to wait for all the data to be loaded unto memory before i can actually process it. It could also be my understanding of writing operations upon python. Lastly it could be that I need a better machine (I doubt it, I have 4 cores and 4GB RAM, it should be able to give better speeds)
Any ideas? I really appreciate the feedback. :)
Edit: Code
Explanation:
The records i talked about are ids of customers(passports), and the conditions are if there are agreements between the different terminals of the company(countries). The process is a hashing.
First strategy tries to treat the whole database... We have at the beginning some preparation for treating the condition part of the algorithm (agreements between countries). Then a large verification by belonging or not in a set.
Since i've been trying to improve it on my own, i tried to cut the problem in parts for the second strategy, treating the query by parts (obtaining the records that belong to a country and writing in the files of those that have an agreement with them)
The threaded strategy is not depicted for it was designed for a single country and i got awful results compared with no threaded. I honestly have the intuition it has to be a thing of memory and sql.
def create_all_files(strategy=0):
if strategy == 0:
set_countries_agreements = set()
file_countries = open(os.path.join(PROJECT_ROOT, 'list_countries'))
set_countries_temp = set(line.strip() for line in file_countries)
file_countries.close()
set_countries = sorted_nicely(set_countries_temp)
for each_country in set_countries:
set_agreements = frozenset(get_agreements(each_country))
set_countries_agreements.add(set_agreements)
print("All agreements obtained")
set_passports = Passport.objects.all()
print("All passports obtained")
for each_passport in set_passports:
for each_agreement in set_countries_agreements:
for each_country in each_agreement:
if each_passport.nationality == each_country:
with open(os.path.join(PROJECT_ROOT, 'generated_indexes/%s' % iter(each_agreement).next()), "a") as f:
f.write(generate_hash(each_passport.nationality + "<" + each_passport.id_passport, each_country) + "\n")
print(".")
print("_")
print("-")
print("~")
if strategy == 1:
file_countries = open(os.path.join(PROJECT_ROOT, 'list_countries'))
set_countries_temp = set(line.strip() for line in file_countries)
file_countries.close()
set_countries = sorted_nicely(set_countries_temp)
while len(set_countries)!= 0:
country = set_countries.pop()
list_countries = get_agreements(country)
list_passports = Passport.objects.filter(nationality=country)
for each_passport in list_passports:
for each_country in list_countries:
with open(os.path.join(PROJECT_ROOT, 'generated_indexes/%s' % each_country), "a") as f:
f.write(generate_hash(each_passport.nationality + "<" + each_passport.id_passport, each_country) + "\n")
print("r")
print("c")
print("p")
print("P")

In your question, you are describing an ETL process. I suggest you to use an ETL tool.
To mention some python ETL tool I can talk about Pygrametl, wrote by Christian Thomsen, in my opinion it runs nicely and its performance is impressive. Test it and comeback with results.
I can't post this answer without mention MapReduce. This programming model can catch with your requirements if you are planing to distribute task through nodes.

It looks like you have a file for each country that you append hashes to, instead of opening and closing handles to these files 10 million+ times you should open each one once and close them all at the end.
countries = {} # country -> file
with open(os.path.join(PROJECT_ROOT, 'list_countries')) as country_file:
for line in country_file:
country = line.strip()
countries[country] = open(os.path.join(PROJECT_ROOT, 'generated_indexes/%s' % country), "a")
for country in countries:
agreements = get_agreements(country)
for postcode in Postcode.objects.filter(nationality=country):
for agreement in agreements:
countries[agreement].write(generate_hash(passport.nationality + "<" + passport.id_passport, country_agreement) + "\n")
for country, file in countries.items():
file.close()
I don't how big a list of Postcode objects Postcode.objects.filter(nationality=country) will return, if it is massive and memory is an issue, you will have to start thinking about chunking/paginating the query using limits
You are using sets for your list of countries and their agreements, if that is because your file containing the list of countries is not guaranteed to be unique, the dictionary solution may error when you attempt to open another handle to the same file. This can be avoided by added a simple check to see if the country is already a member of countries

Why are my Datastore Write Ops so high?

I busted through my daily free quota on a new project this weekend. For reference, that's .05 million writes, or 50,000 if my math is right.
Below is the only code in my project that is making any Datastore write operations.
old = Streams.query().fetch(keys_only=True)
ndb.delete_multi(old)
try:
r = urlfetch.fetch(url=streams_url,
method=urlfetch.GET)
streams = json.loads(r.content)
for stream in streams['streams']:
stream = Streams(channel_id=stream['_id'],
display_name=stream['channel']['display_name'],
name=stream['channel']['name'],
game=stream['channel']['game'],
status=stream['channel']['status'],
delay_timer=stream['channel']['delay'],
channel_url=stream['channel']['url'],
viewers=stream['viewers'],
logo=stream['channel']['logo'],
background=stream['channel']['background'],
video_banner=stream['channel']['video_banner'],
preview_medium=stream['preview']['medium'],
preview_large=stream['preview']['large'],
videos_url=stream['channel']['_links']['videos'],
chat_url=stream['channel']['_links']['chat'])
stream.put()
self.response.out.write("Done")
except urlfetch.Error, e:
self.response.out.write(e)
This is what I know:
There will never be more than 25 "stream" in "streams." It's
guaranteed to call .put() exactly 25 times.
I delete everything from the table at the start of this call because everything needs to be refreshed every time it runs.
Right now, this code is on a cron running every 60 seconds. It will never run more often than once a minute.
I have verified all of this by enabling Appstats and I can see the datastore_v3.Put count go up by 25 every minute, as intended.
I have to be doing something wrong here, because 25 a minute is 1,500 writes an hour, not the ~50,000 that I'm seeing now.
Thanks

You are mixing two different things here: write API calls (what your code calls) and low-level datastore write operations. See the billing docs for relations: Pricing of Costs for Datastore Calls (second section).
This is the relevant part:
New Entity Put (per entity, regardless of entity size) = 2 writes + 2 writes per indexed property value + 1 write per composite index value
In your case Streams has 15 indexed properties resulting in: 2 + 15 * 2 = 32 write OPs per write API call.
Total per hour: 60 (requests/hour) * 25 (puts/request) * 32 (operations/put) = 48,000 datastore write operations per hour

It seems as though I've finally figured out what was going on, so I wanted to update here.
I found this older answer: https://stackoverflow.com/a/17079348/1452497.
I've missed somewhere along the line where the properties being indexed were somehow multiplying the writes by factors of at least 10, I did not expect that. I didn't need everything indexed and after turning off the index in my model, I've noticed the write ops drop DRAMATICALLY. Down to about where I expect them.
Thanks guys!

It is 1500*24=36,000 writes/day, which is very near to the daily quota.

What data is cached during a "select" in sqlite3/Python, and can this be done manually from the start?

Suppose you have a sqlite database with several thousand rows -- each of which either contains or references a sizable, unique blob -- and you want to sparsely sample this collection, pulling rows based on rowid or some equivalent primary key. I find that the first time I attempt to fetch several (500) datapoints after connecting (out of 20k rows), the call takes over 10 seconds to return; and, with every successive iteration, the calls get shorter and shorter, until converging to around 100 milliseconds after 50-100 such queries.
Clearly, either sqlite or its python wrapper must be caching... something. If I clear out inactive memory (I'm in OS X, but I think Linux has a comparable if-not-identical "purge" command?), the behavior can be replicated exactly. The question is, what is it caching that an index doesn't address? And furthermore, is it possible to automatically pull whatever information is accelerating these queries into memory from the start? Or is there something else I've missed entirely?
A few notes in case someone doesn't immediately know the answer...
Each blob is around 40kB, and are a large (ha) source of the problem. I've some code below for anyone who wants to play along at home, but I've had better luck keeping separate tables for sortable information and data. This introduces an inner join, but it's generally been better than keeping it all together (although if anyone feels this is wrong, I'm keen to hear it). Without the inner join / data fetch, things start at 4 seconds and drop to 3 ms in a hurry.
I feel like this might be a PRAGMA thing, but I fiddled with some settings suggested by others in the wilderness of the web and didn't really see any benefit.
In-memory databases are not an option. For one, I'm trying to share across threads (which might not actually be a problem for in-mems...? not sure), but more importantly the database files are typically on the order of 17 GB. So, that's out.
That being said, there's no problem caching a reasonable amount of information. After a few dozen calls, inactive memory gets somewhat bloated anyways, but I'd rather do it (1) right and (2) efficiently.
Okay, now some code for anyone who wants to try to replicate things. You should be able to copy and paste it into a stand-alone script (that's basically what I did, save for formatting).
import sqlite3
import numpy as np
import time
ref_uid_index = """CREATE INDEX ref_uid_idx
ON data(ref_uid)"""
def populate_db_split(db_file, num_classes=10, num_points=20000, VERBOSE=False):
def_schema_split0 = """
CREATE TABLE main (
uid INTEGER PRIMARY KEY,
name TEXT,
label INTEGER,
ignore INTEGER default 0,
fold INTEGER default 0)"""
def_schema_split1 = """
CREATE TABLE data (
uid INTEGER PRIMARY KEY,
ref_uid INTEGER REFERENCES main(uid),
data BLOB)"""
def_insert_split0 = """
INSERT INTO main (name, label, fold)
VALUES (?,?,?)"""
def_insert_split1 = """
INSERT INTO data (ref_uid, data)
VALUES (?,?)"""
blob_size= 5000
k_folds = 5
some_names = ['apple', 'banana', 'cherry', 'date']
dbconn = sqlite3.connect(db_file)
dbconn.execute(def_schema_split0)
dbconn.execute(def_schema_split1)
rng = np.random.RandomState()
for n in range(num_points):
if n%1000 == 0 and VERBOSE:
print n
# Make up some data
data = buffer(rng.rand(blob_size).astype(float))
fold = rng.randint(k_folds)
label = rng.randint(num_classes)
rng.shuffle(some_names)
# And add it
dbconn.execute(def_insert_split0,[some_names[0], label, fold])
ref_uid = dbconn.execute("SELECT uid FROM main WHERE rowid=last_insert_rowid()").fetchone()[0]
dbconn.execute(def_insert_split1,[ref_uid,data])
dbconn.execute(ref_uid_index)
dbconn.commit()
return dbconn
def timeit_join(dbconn, n_times=10, num_rows=500):
qmarks = "?,"*(num_rows-1)+"?"
q_join = """SELECT data.data, main.uid, main.label
FROM data INNER JOIN main ON main.uid=data.ref_uid
WHERE main.uid IN (%s)"""%qmarks
row_max = dbconn.execute("SELECT MAX(rowid) from main").fetchone()[0]
tstamps = []
for n in range(n_times):
now = time.time()
uids = np.random.randint(low=1,high=row_max,size=num_rows).tolist()
res = dbconn.execute(q_join, uids).fetchall()
tstamps += [time.time()-now]
print tstamps[-1]
Now, if you want to replicate things, do the following. On my machine, this creates an 800MB database and produces something like below.
>>> db = populate_db_split('/some/file/path.db')
>>> timeit_join(db)
12.0593519211
5.56209111214
3.51154184341
2.20699000359
1.73895692825
1.18351387978
1.27329611778
0.934082984924
0.780968904495
0.834318161011
So... what say you, knowledgable sages?

Database files with GB size are never loaded into the memory entirely. They are split into a tree of socalled pages. These pages are cached in the memory, the default is 2000 pages.
You can use the following statement to e.g. double the number of cached pages of 1kB size.
conn.execute("""PRAGMA cache_size = 4000""")
The connection again has a cache for the last 100 statements, as you can see in the function description:
sqlite3.connect(database[, timeout, detect_types, isolation_level, check_same_thread, factory, cached_statements])
cached_statements expects and integer and defaults to 100.
Except from setting up the cache size, it is not likely that you benefit from actively caching statements or pages at the application start.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficiently processing ~50 million record file in python - python

i think its better to use a database to solve this problem first create a database like MySql or anything else then write your data from file into 2 table and finally use a simple sql query to select rows somthing like: select * from table1 where id in(select ids from table2)

Related

Neo4j: dependence of execution speed on batch size of input parameters

Fastest method to reformat terabytes of data

Speeding up a large process run over some data obtained from a database

Why are my Datastore Write Ops so high?

What data is cached during a "select" in sqlite3/Python, and can this be done manually from the start?

Categories

Resources