I have a database with a single indexed UNIQUE text column, currently at 100m rows but it could go up to a billion. this is for doing millions of contains queries on the db so index is necessary for fast lookup.
The problem i'm having is imports have slowed down by a lot as the DB got bigger. I've tried a lot of ideas including stuff like this (reads roughly 100k lines from file and does batch insert):
lines = f.readlines(3355629)
while lines:
c.execute('BEGIN TRANSACTION')
rows_modified += c.executemany('INSERT OR IGNORE INTO mydb (name) values (?)', map(lambda name: (name.strip().lower(),), lines)).rowcount
c.execute('COMMIT')
lines = f.readlines(3355629)
the above takes many minutes to insert 100k lines when the DB is at 100m rows.
I've imported to Postgres with a python script which can get 100k inserts on an indexed column in 2 seconds at 100m rows (using psycopg2.extras.execute_values).
I keep seeing claims that SQLite can easily handle terabytes of data, yet can't figure out how to get the data in. BTW I can't drop the index and recreate it because then id need additional code to make sure the data is unique, and you can't change a column to UNIQUE after creation. Multiple tables is an option for speed increase though it makes things more complex.
I'm working on a python script to grab every field in an mssql database and various metadata about each, then generating a series of data dictionaries in XLSX format.
I've almost finished, but I'm now trying to grab 10 unique values from each field as an example of the data each field contains (for dates I'm using max & min). Currently I'm using select distinct top 10 X from table; for each field, but with a largish database, this is incredibly slow going.
Is there a quicker/better alternative?
It would seem that by select distinct top 10 * from table; and then parsing that data with Python I save an incredible amount of time. I may not end up with 10 values per field, but it's good enough!
I'm storing business/ statistical data by date in different collections.
Every day thousands of thousands of rows are being inserted.
In some cases my application fetches or generate information that includes let's say the last 20 days with new values, so I need to update that old information in MongoDB with the new values for those dates.
The first option I thought of is removing all rows from 20 days ago until now by removing by date, and insert the new data with insertMany().
The problem with this is that the amount of rows is huge and it blocks the database which some times makes my worker process to die (It's a python celery task).
The second option I thought of to is to split the new coming data into chunks per date (using Pandas dataframes), and perform a "removal" then "insert of that date, and iterate that process until today. This way is the same but in smaller chunks.
Is the last option a good idea?
Is there any better approach for this type of problem?
Thanks a lot
There's been several similar questions around online for processing large csv files into multiple postgresql tables with python. However, none seem to address a couple concerns around optimizing database reads/writes and system memory/processing.
Say I have a row of product data that looks like this:
name,sku,datetime,decimal,decimal,decimal,decimal,decimal,decimal
Where the name and sku are stored in one table (parent), then each decimal field is stored in a child EAV table that essentially contains the decimal, parent_id, and datetime.
Let's say I have 20000 of these rows in a csv file, so I end up chunking them up. Right now, I take chunks of 2000 of these rows and loop line by line. Each iteration checks to see if the product exists and creates it if not, retrieving the parent_id. Then, I have a large list of insert statements generated for the child table with the decimal values. If the user has selected to only overwrite non-modified decimal values, then this also checks each individual decimal value to see if it has been modified before adding to the insert list.
In this example, if I had the worst case scenario, I'd end up doing 160,000 database reads and anywhere from 10-20010 writes. I'd also be storing up to 12000 insert statements in a list in memory for each chunk (however, this would only be one list, so that part isn't as bad).
My main question is:
How can I optimize this to be faster, use less database operations (since this also affects network traffic), and use less processing and memory? I'd also rather have the processing speed to be slower if it could save on the other two optimizations, as those ones cost more money when translated to server/database processing pricing in something like AWS.
Some sub questions are:
Is there a way I can combine all the product read/writes and replace them in the file before doing the decimals?
Should I be doing a smaller chunk size to help with memory?
Should I be utilizing threads or keeping it linear?
Could I have it build a more efficient sql query that does the product create if not exists and referencing inline, thus moving some of the processing into sql rather than python?
Could I optimize the child insert statements to do something better than thousands of INSERT INTO statements?
A fun question, but one that's hard to answer precisely, since there are many
variables defining the best solution that may or may not apply.
Below is one approach, based on the following assumptions -
You don't need the database code to be portable.
The csv is structured with a header, or at the least the attribute names are
known and fixed.
The sku (or name/sku combo) in product table have unique constraints.
Likewise, the EAV table has a unique constraint on product_id, and
attr_name
Corollary - you didn't specify, but I also assume that the EAV table has a field
for the attribute name.
The process boils down to -
Load the data into the database by the fastest path possible
Unpivot the csv from a tabular structure to EAV structure during or after the load
"Upsert" the resulting records - update if present, insert otherwise.
Approach -
All that background, given a similar problem, here is the approach I would take.
Create temp tables mirroring the final destination, but without pks, types, or constraints
The temp tables will get deleted when the database session ends
Load the .csv straight into the temp tables in a single pass; two SQL executions per row
One for product
One for the EAV, using the 'multi-value' insert - insert into tmp_eav (sku, attr_name, attr_value) values (%s, %s), (%s, %s)....
psycopg2 has a custom method to do this for you: http://initd.org/psycopg/docs/extras.html#psycopg2.extras.execute_values
Select from tmp tables to upsert into final tables, using a statement like insert into product (name, sku) select name, sku from tmp_product on conflict (sku) do nothing
This requires PostgreSQL 9.5+.
For the user-selectable requirement to optionally update fields based on the csv, you can change do nothing to do update set col = excluded.col. excluded is the input row that conflicted
Alternative approach -
Create the temp table based on the structure of the csv (assumes you have
have enough metadata to do this on each run or that the csv structure is
fixed and can be consistently translated to a table)
Load the csv into the database using the COPY command (supported in psycopg2
via the cursor.copy_from method, passing in the csv as a file object).
This will be faster than anything you write in Python
Caveat: this works if the csv is very dependable (same number of cols on
every row) and the temp table is very lax w/ nulls, all strings w/ no
type coercion.
You can 'unpivot' the csv rows with a union all query that combines a
select for each column to row transpose. The 6 decimals in your example
should be manageable.
For example:
select sku, 'foo' as attr_name, foo as attr_value from tmp_csv union all
select sku, 'bar' as attr_name, bar as attr_value from tmp_csv union all
...
order by sku;
This solution hits a couple of the things you were you interested in:
Python application memory remains flat
Network I/O is limited to what it takes to get the .csv into the db and issue
the right follow up sql statements
A little general advice to close out -
Optimal and "good enough" are almost never the same thing
Optimal is only required under very specific situations
So, aim for "good enough", but be precise about what "good enough" means -
i.e., pick one or two measures
Iterate, solving for one variable at a time. In my experience, the first hurdle (say, "end to end processing time less than
X seconds") is often sufficient.
I'm working on implementing a relatively large (5,000,000 and growing) set of time series data in an HDF5 table. I need a way to remove duplicates on it, on a daily basis, one 'run' per day. As my data retrieval process currently stands, it's far easier to write in the duplicates during the data retrieval process than ensure no dups go in.
What is the best way to remove dups from a pytable? All of my reading is pointing me towards importing the whole table into pandas, and getting a unique- valued data frame, and writing it back to disk by recreating the table with each data run. This seems counter to the point of pytables, though, and in time I don't know that the whole data set will efficiently fit into memory. I should add that it is two columns that define a unique record.
No reproducible code, but can anyone give me pytables data management advice?
Big thanks in advance...
See this releated question: finding a duplicate in a hdf5 pytable with 500e6 rows
Why do you say that this is 'counter to the point of pytables'? It is perfectly possible to store duplicates. The user is responsible for this.
You can also try this: merging two tables with millions of rows in python, where you use a merge function that is simply drop_duplicates().