I have a database with a single indexed UNIQUE text column, currently at 100m rows but it could go up to a billion. this is for doing millions of contains queries on the db so index is necessary for fast lookup.
The problem i'm having is imports have slowed down by a lot as the DB got bigger. I've tried a lot of ideas including stuff like this (reads roughly 100k lines from file and does batch insert):
lines = f.readlines(3355629)
while lines:
c.execute('BEGIN TRANSACTION')
rows_modified += c.executemany('INSERT OR IGNORE INTO mydb (name) values (?)', map(lambda name: (name.strip().lower(),), lines)).rowcount
c.execute('COMMIT')
lines = f.readlines(3355629)
the above takes many minutes to insert 100k lines when the DB is at 100m rows.
I've imported to Postgres with a python script which can get 100k inserts on an indexed column in 2 seconds at 100m rows (using psycopg2.extras.execute_values).
I keep seeing claims that SQLite can easily handle terabytes of data, yet can't figure out how to get the data in. BTW I can't drop the index and recreate it because then id need additional code to make sure the data is unique, and you can't change a column to UNIQUE after creation. Multiple tables is an option for speed increase though it makes things more complex.
Related
I have a set of large tables with many records each. I'm writing a Python program that SELECTs a large number of records from these tables, based on the value of multiple columns on those records.
Essentially, these are going to be lots of queries of the form:
SELECT <some columns> FROM <some table> WHERE <column1=val1 AND column2=val2...>
Each table has a different set of columns, but otherwise the SELECT formula above holds.
By default, I was going to just run all these queries through the psycopg2 PostgreSQL database driver, each as a separate query. But I'm wondering if there's a more efficient way of going about it, given that there will be a very large number of such queries - thousands or more.
If the SELECT list entries are the same for all queries (the same number of entries and the same data types), you can use UNION ALL to combine several such queries. This won't reduce the amount of work for the database, but it will reduce the number of client-server round trips. This can be a huge improvement, because for short queries network latency is often the dominant cost.
If all your queries have different SELECT lists, there is nothing you can do.
There's been several similar questions around online for processing large csv files into multiple postgresql tables with python. However, none seem to address a couple concerns around optimizing database reads/writes and system memory/processing.
Say I have a row of product data that looks like this:
name,sku,datetime,decimal,decimal,decimal,decimal,decimal,decimal
Where the name and sku are stored in one table (parent), then each decimal field is stored in a child EAV table that essentially contains the decimal, parent_id, and datetime.
Let's say I have 20000 of these rows in a csv file, so I end up chunking them up. Right now, I take chunks of 2000 of these rows and loop line by line. Each iteration checks to see if the product exists and creates it if not, retrieving the parent_id. Then, I have a large list of insert statements generated for the child table with the decimal values. If the user has selected to only overwrite non-modified decimal values, then this also checks each individual decimal value to see if it has been modified before adding to the insert list.
In this example, if I had the worst case scenario, I'd end up doing 160,000 database reads and anywhere from 10-20010 writes. I'd also be storing up to 12000 insert statements in a list in memory for each chunk (however, this would only be one list, so that part isn't as bad).
My main question is:
How can I optimize this to be faster, use less database operations (since this also affects network traffic), and use less processing and memory? I'd also rather have the processing speed to be slower if it could save on the other two optimizations, as those ones cost more money when translated to server/database processing pricing in something like AWS.
Some sub questions are:
Is there a way I can combine all the product read/writes and replace them in the file before doing the decimals?
Should I be doing a smaller chunk size to help with memory?
Should I be utilizing threads or keeping it linear?
Could I have it build a more efficient sql query that does the product create if not exists and referencing inline, thus moving some of the processing into sql rather than python?
Could I optimize the child insert statements to do something better than thousands of INSERT INTO statements?
A fun question, but one that's hard to answer precisely, since there are many
variables defining the best solution that may or may not apply.
Below is one approach, based on the following assumptions -
You don't need the database code to be portable.
The csv is structured with a header, or at the least the attribute names are
known and fixed.
The sku (or name/sku combo) in product table have unique constraints.
Likewise, the EAV table has a unique constraint on product_id, and
attr_name
Corollary - you didn't specify, but I also assume that the EAV table has a field
for the attribute name.
The process boils down to -
Load the data into the database by the fastest path possible
Unpivot the csv from a tabular structure to EAV structure during or after the load
"Upsert" the resulting records - update if present, insert otherwise.
Approach -
All that background, given a similar problem, here is the approach I would take.
Create temp tables mirroring the final destination, but without pks, types, or constraints
The temp tables will get deleted when the database session ends
Load the .csv straight into the temp tables in a single pass; two SQL executions per row
One for product
One for the EAV, using the 'multi-value' insert - insert into tmp_eav (sku, attr_name, attr_value) values (%s, %s), (%s, %s)....
psycopg2 has a custom method to do this for you: http://initd.org/psycopg/docs/extras.html#psycopg2.extras.execute_values
Select from tmp tables to upsert into final tables, using a statement like insert into product (name, sku) select name, sku from tmp_product on conflict (sku) do nothing
This requires PostgreSQL 9.5+.
For the user-selectable requirement to optionally update fields based on the csv, you can change do nothing to do update set col = excluded.col. excluded is the input row that conflicted
Alternative approach -
Create the temp table based on the structure of the csv (assumes you have
have enough metadata to do this on each run or that the csv structure is
fixed and can be consistently translated to a table)
Load the csv into the database using the COPY command (supported in psycopg2
via the cursor.copy_from method, passing in the csv as a file object).
This will be faster than anything you write in Python
Caveat: this works if the csv is very dependable (same number of cols on
every row) and the temp table is very lax w/ nulls, all strings w/ no
type coercion.
You can 'unpivot' the csv rows with a union all query that combines a
select for each column to row transpose. The 6 decimals in your example
should be manageable.
For example:
select sku, 'foo' as attr_name, foo as attr_value from tmp_csv union all
select sku, 'bar' as attr_name, bar as attr_value from tmp_csv union all
...
order by sku;
This solution hits a couple of the things you were you interested in:
Python application memory remains flat
Network I/O is limited to what it takes to get the .csv into the db and issue
the right follow up sql statements
A little general advice to close out -
Optimal and "good enough" are almost never the same thing
Optimal is only required under very specific situations
So, aim for "good enough", but be precise about what "good enough" means -
i.e., pick one or two measures
Iterate, solving for one variable at a time. In my experience, the first hurdle (say, "end to end processing time less than
X seconds") is often sufficient.
When inserting rows via INSERT INTO tbl VALUES (...), (...), ...;, what is the maximum number of values I can use?
To clarify, PostgreSQL supports using VALUES to insert multiple rows at once. My question isn't how many columns I can insert, but rather how many rows of columns I can insert into a single VALUES clause. The table in question has only ~10 columns.
Can I insert 100K+ rows at a time using this format?
I am assembling my statements using SQLAlchemy Core / psycopg2 if that matters.
As pointed out by Gordon, there doesn't appear to be a predefined limit on the number of values sets you can have in your statement. But you would want to keep this to a reasonable limit to avoid consuming too much memory at both the client and the server. The client only needs to build the string and the server needs to parse it as well.
If you want to insert a large number of rows speedily COPY FROM is what you are looking for.
I've got a script that is attempting to insert a large number of rows into a postgresql table. When I say large, I mean up to 200,000. I'm inserting data from python using sql alchemy. Each row is made up of one unique ID and a number of 0/1 flags.
When I try to insert a small number of rows, it works just fine. I have even inserted around 18,000 without any problems, and I think it only took a handful of seconds.
Lately, I have stepped it up to try inserting a much larger set of data set of around 150,000 records. I had my script print the time that it started doing this, and this insert has been running for 12+ hours at this point. It seems disproportionately long when compared to the fast 20k row insert. Here is the code that I'm using.
sql_engine = sqlalchemy.create_engine("postgresql://database")
meta=sqlalchemy.MetaData(sql_engine)
my_table= sqlalchemy.Table('table_name', meta, autoload=True, autoload_with=sql_engine)
already_inserted=[i for i in sql_engine.execute(sqlalchemy.select([some_column]))]
table_rows=[]
for i in summary:
if i[some_column] not in alread_inserted:
table_rows.append(
{logic that builds row of 0s and 1s})
if len(table_rows)>0:
my_table.insert().execute(table_rows)
Are there any tips to getting this to work? Should I be inserting in smaller chunks? Furthermore, would inserts go faster if I only tried to insert the flags that are equal to 1 and left the zeros as null?
I have a load of data in CSV format. I need to be able to index this data based on a single text field (the primary key), so I'm thinking of entering it into a database. I'm familiar with sqlite from previous projects, so I've decided to use that engine.
After some experimentation, I realized that that storing a hundred million records in one table won't work well: the indexing step slows to a crawl pretty quickly. I could come up with two solutions to this problem:
partition the data into several tables
partition the data into several databases
I went with the second solution (it yields several large files instead of one huge file). My partition method is to look at the first two characters of the primary key: each partition has approximately 2 million records, and there are approximately 50 partitions.
I'm doing this in Python with the sqlite3 module. I keep 50 open database connections and open cursors for the entire duration of the process. For each row, I look at the first two characters of the primary key, fetch the right cursor via dictionary lookup, and perform a single insert statement (via calling execute on the cursor).
Unfortunately, the insert speed still decreases to an unbearable level after a while (approx. 10 million total processed records). What can I do to get around this? Is there a better way to do what I'm doing?
Wrap all insert commands into a single transaction.
Use prepared statements.
Create the index only after inserting all the data (i.e., don't declare a primary key).
I think the problem you have is that once the processing cannot just use in-memory buffers your hard disk head is just jumping randomly between 50 locations and this is dog slow.
Something you can try is just processing one subset at a time:
seen = {} # Key prefixes already processed
while True:
k0 = None # Current prefix
for L in all_the_data:
k = L[0][:2]
if k not in seen:
if k0 is None:
k0 = k
if k0 == k:
store_into_database(L)
if k0 is None:
break
seen.append(k0)
This will do n+1 passes over the data (where n is the number of prefixes) but will only access two disk locations (one for reading and one for writing). It should work even better if you've separate physical devices.
PS: Are you really really sure an SQL database is the best solution for this problem?