I am trying to compare two tables in an sqlite3 database in python. One of the answers to this question:
Comparing two sqlite3 tables using python
gives a solution:
Alternatively, import them into SQLite tables. Then you can use queries like the following:
SELECT * FROM a INTERSECT SELECT * FROM b;
SELECT * FROM a EXCEPT SELECT * FROM b;
to get rows that exist in both tables, or only in one table.
This works great for tables with less than a million rows, but is far too slow for my program which requires comparing tables with more than ten billion rows. (Script took over ten minutes for just 100 million rows.)
Is there a faster way to compare two sqlite3 tables in python?
I thought about trying to compare the hashes of the two database files, but an overview of a program called dbhash on sqlite.org claims that even if the contents of two database files are the same certain operations "can potentially cause vast changes to the raw database file, and hence cause very different SHA1 hashes at the file level," which makes me think that this would not work unless I ran some sort of script to query all the data in an ordered fashion and then hash that (like the dbhash program does), but would that even be faster?
Or should I be using another database entirely that can preform this comparison faster than sqlite3?
Any ideas or suggestions would be greatly appreciated.
Edit: There have been some good ideas put forward so far, but to clarify: the order of the tables doesn't matter, just the contents.
You could resort to the following workaround:
Add a column to each table where you store a hash over the content of all other columns.
Add an index to the new column.
Compute and store the hash with the record.
Compare the hash columns of your tables instead of using intersect/except.
If altering the tables isn't an option you can perhaps create new tables that relate a hash to the primary key or rowid of the hashed record.
With that you shift part of the processing time needed for the comparison to the time you insert/update the records. I would expect this to be significant faster at the time you execute the comparison than comparing all columns of all rows just then.
Of course your hash must be aware of the order of values and produce unique values for every permutation; a simple checksum won't suffice. Suggestion:
Convert every column value to a string.
Concatenate the strings with a separator that's guaranteed not to occur in the values themselves.
Use SHA1 or a similarly sophisticated hashing algorithm over the concatenated string.
You can test whether storing the hash as string, blob or integer (provided it fits into 64 bit) makes a difference in speed.
Yes, it will take a lot of time for a single thread (or even several) on a single hard drive to crawl billions of raws.
It can obviously be better with stronger DB engines but indexing all your columns would not really help in the end.
You have to resort to precalculation or distributing your dataset amongst multiple systems...
If you have a LOT of RAM you can try copying the SQLite files first in /dev/shm allowing you to read your data straight from memory and benefit a performance boost.
Related
I have a set of large tables with many records each. I'm writing a Python program that SELECTs a large number of records from these tables, based on the value of multiple columns on those records.
Essentially, these are going to be lots of queries of the form:
SELECT <some columns> FROM <some table> WHERE <column1=val1 AND column2=val2...>
Each table has a different set of columns, but otherwise the SELECT formula above holds.
By default, I was going to just run all these queries through the psycopg2 PostgreSQL database driver, each as a separate query. But I'm wondering if there's a more efficient way of going about it, given that there will be a very large number of such queries - thousands or more.
If the SELECT list entries are the same for all queries (the same number of entries and the same data types), you can use UNION ALL to combine several such queries. This won't reduce the amount of work for the database, but it will reduce the number of client-server round trips. This can be a huge improvement, because for short queries network latency is often the dominant cost.
If all your queries have different SELECT lists, there is nothing you can do.
I'm looking to maintain a (Postgres) SQL database collecting data from third parties. As most data is static, while I get a full dump every day, I want to store only the data that is new. I.e., every day I get 100K new records with say 300 columns, and 95K rows will be the same. In order to do so in an efficient way, I was thinking of inserting a hash of my record (coming from a Pandas dataframe or a Python dict) alongside the data. Some other data is stored as well, like when it was loaded into the database. Then I could, prior to inserting data in the database, hash the incoming data and verify the record is not yet in the database easily, instead of having to check all 300 columns.
My question: which hash function to pick (given that I'm in Python and prefer to use a very fast & solid solution that requires little coding from my side while being able to handle all kinds of data like ints, floats, strings, datetimes, etc)
Python's hash is unsuited as it changes for every session (like: Create hash value for each row of data with selected columns in dataframe in python pandas does)
md5 or sha1 are cryptographic hashes. I don't need the crypto part, as this is not for security. Might be a bit slow as well, and I had some troubles with strings as these require encoding.
is a solution like CRC good enough?
For two and three, if you recommend, how can I implement it for arbitrary dicts and pandas rows? I have had little success in keeping this simple. For instance, for strings I needed to explicitly define the encoding, and the order of the fields in the record should also not change the hash.
Edit: I just realized that it might be tricky to depend on Python for this, if I change programming language I might end up with different hashes. Tying it to the database seems the more sensible choice.
Have you tried pandas.util.hash_pandas_object?
Not sure how efficient this is, but maybe you could use it like this:
df.apply(lambda row: pd.util.hash_pandas_object(row), axis=1)
This will at least get you a pandas Series of hashes for each row in the df.
There's been several similar questions around online for processing large csv files into multiple postgresql tables with python. However, none seem to address a couple concerns around optimizing database reads/writes and system memory/processing.
Say I have a row of product data that looks like this:
name,sku,datetime,decimal,decimal,decimal,decimal,decimal,decimal
Where the name and sku are stored in one table (parent), then each decimal field is stored in a child EAV table that essentially contains the decimal, parent_id, and datetime.
Let's say I have 20000 of these rows in a csv file, so I end up chunking them up. Right now, I take chunks of 2000 of these rows and loop line by line. Each iteration checks to see if the product exists and creates it if not, retrieving the parent_id. Then, I have a large list of insert statements generated for the child table with the decimal values. If the user has selected to only overwrite non-modified decimal values, then this also checks each individual decimal value to see if it has been modified before adding to the insert list.
In this example, if I had the worst case scenario, I'd end up doing 160,000 database reads and anywhere from 10-20010 writes. I'd also be storing up to 12000 insert statements in a list in memory for each chunk (however, this would only be one list, so that part isn't as bad).
My main question is:
How can I optimize this to be faster, use less database operations (since this also affects network traffic), and use less processing and memory? I'd also rather have the processing speed to be slower if it could save on the other two optimizations, as those ones cost more money when translated to server/database processing pricing in something like AWS.
Some sub questions are:
Is there a way I can combine all the product read/writes and replace them in the file before doing the decimals?
Should I be doing a smaller chunk size to help with memory?
Should I be utilizing threads or keeping it linear?
Could I have it build a more efficient sql query that does the product create if not exists and referencing inline, thus moving some of the processing into sql rather than python?
Could I optimize the child insert statements to do something better than thousands of INSERT INTO statements?
A fun question, but one that's hard to answer precisely, since there are many
variables defining the best solution that may or may not apply.
Below is one approach, based on the following assumptions -
You don't need the database code to be portable.
The csv is structured with a header, or at the least the attribute names are
known and fixed.
The sku (or name/sku combo) in product table have unique constraints.
Likewise, the EAV table has a unique constraint on product_id, and
attr_name
Corollary - you didn't specify, but I also assume that the EAV table has a field
for the attribute name.
The process boils down to -
Load the data into the database by the fastest path possible
Unpivot the csv from a tabular structure to EAV structure during or after the load
"Upsert" the resulting records - update if present, insert otherwise.
Approach -
All that background, given a similar problem, here is the approach I would take.
Create temp tables mirroring the final destination, but without pks, types, or constraints
The temp tables will get deleted when the database session ends
Load the .csv straight into the temp tables in a single pass; two SQL executions per row
One for product
One for the EAV, using the 'multi-value' insert - insert into tmp_eav (sku, attr_name, attr_value) values (%s, %s), (%s, %s)....
psycopg2 has a custom method to do this for you: http://initd.org/psycopg/docs/extras.html#psycopg2.extras.execute_values
Select from tmp tables to upsert into final tables, using a statement like insert into product (name, sku) select name, sku from tmp_product on conflict (sku) do nothing
This requires PostgreSQL 9.5+.
For the user-selectable requirement to optionally update fields based on the csv, you can change do nothing to do update set col = excluded.col. excluded is the input row that conflicted
Alternative approach -
Create the temp table based on the structure of the csv (assumes you have
have enough metadata to do this on each run or that the csv structure is
fixed and can be consistently translated to a table)
Load the csv into the database using the COPY command (supported in psycopg2
via the cursor.copy_from method, passing in the csv as a file object).
This will be faster than anything you write in Python
Caveat: this works if the csv is very dependable (same number of cols on
every row) and the temp table is very lax w/ nulls, all strings w/ no
type coercion.
You can 'unpivot' the csv rows with a union all query that combines a
select for each column to row transpose. The 6 decimals in your example
should be manageable.
For example:
select sku, 'foo' as attr_name, foo as attr_value from tmp_csv union all
select sku, 'bar' as attr_name, bar as attr_value from tmp_csv union all
...
order by sku;
This solution hits a couple of the things you were you interested in:
Python application memory remains flat
Network I/O is limited to what it takes to get the .csv into the db and issue
the right follow up sql statements
A little general advice to close out -
Optimal and "good enough" are almost never the same thing
Optimal is only required under very specific situations
So, aim for "good enough", but be precise about what "good enough" means -
i.e., pick one or two measures
Iterate, solving for one variable at a time. In my experience, the first hurdle (say, "end to end processing time less than
X seconds") is often sufficient.
I am currently facing the problem of having to frequently access a large but simple data set on a smallish (700 Mhz) device in real time. The data set contains around 400,000 mappings from abbreviations to abbreviated words, e.g. "frgm" to "fragment". Reading will happen frequently when the device is used and should not require more than 15-20ms.
My first attempt was to utilize SQLite in order to create a simple data base which merely contains a single table where two strings constitute a data set:
CREATE TABLE WordMappings (key text, word text)
This table is created once and although alterations are possible, only read-access is time critical.
Following this guide, my SELECT statement looks as follows:
def databaseQuery(self, query_string):
self.cursor.execute("SELECT word FROM WordMappings WHERE key=" + query_string + " LIMIT 1;")
result = self.cursor.fetchone()
return result[0]
However, using this code on a test data base with 20,000 abbreviations, I am unable to fetch data quicker than ~60ms, which is far to slow.
Any suggestions on how to improve performance using SQLite or would another approach yield more promising results?
You can speed up lookups on the key column by creating an index for it:
CREATE INDEX kex_index ON WordMappings(key);
To check whether a query uses an index or scans the entire table, use EXPLAIN QUERY PLAN.
A long time ago I tried to use SQLite for sequential data and it was not fast enough for my needs. At the time, I was comparing it against an existing in-house binary format, which I ended up using.
I have not personally used, but a friend uses PyTables for large time-series data; maybe it's worth looking into.
It turns out that defining a primary key speeds up individual queries by an factor order of magnitude.
Individual queries on a test table with 400,000 randomly created entries (10/20 characters long) took no longer than 5ms which satisfies the requirements.
The table is now created as follows:
CREATE TABLE WordMappings (key text PRIMARY KEY, word text)
A primary key is used because
It is implicitly unique, which is a property of the abbreviations stored
It cannot be NULL, so the rows containing it must not be NULL. In our case, if they were, the database would be corrupt
Other users have suggested using an index, however, they are not necessarily unique and according to the accept answer to this question, they unnecessarily slow down update/insert/delete performance. Nevertheless, using an index may as well increase performance. This has, however not been tested by the original author, although not tested by the original author.
I have got a approximately 12GB of tab-separated data in a very simple format:
mainIdentifier, altIdentifierType, altIdentifierText
MainIdentifier is not a unique row identifier - only the whole combination of the 3 columns is unique. My main use-case is looking up corresponding entries going from mainIdentifier or going from two different types of alternative identifiers.
From what I can glean, I would need to construct a lookup index for each entry direction to make it fast. However, given the simplicity of the task, I do not really need the index pointing to the record - the index itself is the answer.
I've tried sqlite3 in python but as expected, the result is not as fast as I would've liked. I am now considering just storing the two lists and moving around in a binary-search fashion, however, I do not want to re-invent the wheel - is there any way existing solution how to solve this?
Also, I intend to run this a REST-enabled service, so it's not feasible for the lookup table to be stored in memory in any fashion..