There's been several similar questions around online for processing large csv files into multiple postgresql tables with python. However, none seem to address a couple concerns around optimizing database reads/writes and system memory/processing.
Say I have a row of product data that looks like this:
name,sku,datetime,decimal,decimal,decimal,decimal,decimal,decimal
Where the name and sku are stored in one table (parent), then each decimal field is stored in a child EAV table that essentially contains the decimal, parent_id, and datetime.
Let's say I have 20000 of these rows in a csv file, so I end up chunking them up. Right now, I take chunks of 2000 of these rows and loop line by line. Each iteration checks to see if the product exists and creates it if not, retrieving the parent_id. Then, I have a large list of insert statements generated for the child table with the decimal values. If the user has selected to only overwrite non-modified decimal values, then this also checks each individual decimal value to see if it has been modified before adding to the insert list.
In this example, if I had the worst case scenario, I'd end up doing 160,000 database reads and anywhere from 10-20010 writes. I'd also be storing up to 12000 insert statements in a list in memory for each chunk (however, this would only be one list, so that part isn't as bad).
My main question is:
How can I optimize this to be faster, use less database operations (since this also affects network traffic), and use less processing and memory? I'd also rather have the processing speed to be slower if it could save on the other two optimizations, as those ones cost more money when translated to server/database processing pricing in something like AWS.
Some sub questions are:
Is there a way I can combine all the product read/writes and replace them in the file before doing the decimals?
Should I be doing a smaller chunk size to help with memory?
Should I be utilizing threads or keeping it linear?
Could I have it build a more efficient sql query that does the product create if not exists and referencing inline, thus moving some of the processing into sql rather than python?
Could I optimize the child insert statements to do something better than thousands of INSERT INTO statements?
A fun question, but one that's hard to answer precisely, since there are many
variables defining the best solution that may or may not apply.
Below is one approach, based on the following assumptions -
You don't need the database code to be portable.
The csv is structured with a header, or at the least the attribute names are
known and fixed.
The sku (or name/sku combo) in product table have unique constraints.
Likewise, the EAV table has a unique constraint on product_id, and
attr_name
Corollary - you didn't specify, but I also assume that the EAV table has a field
for the attribute name.
The process boils down to -
Load the data into the database by the fastest path possible
Unpivot the csv from a tabular structure to EAV structure during or after the load
"Upsert" the resulting records - update if present, insert otherwise.
Approach -
All that background, given a similar problem, here is the approach I would take.
Create temp tables mirroring the final destination, but without pks, types, or constraints
The temp tables will get deleted when the database session ends
Load the .csv straight into the temp tables in a single pass; two SQL executions per row
One for product
One for the EAV, using the 'multi-value' insert - insert into tmp_eav (sku, attr_name, attr_value) values (%s, %s), (%s, %s)....
psycopg2 has a custom method to do this for you: http://initd.org/psycopg/docs/extras.html#psycopg2.extras.execute_values
Select from tmp tables to upsert into final tables, using a statement like insert into product (name, sku) select name, sku from tmp_product on conflict (sku) do nothing
This requires PostgreSQL 9.5+.
For the user-selectable requirement to optionally update fields based on the csv, you can change do nothing to do update set col = excluded.col. excluded is the input row that conflicted
Alternative approach -
Create the temp table based on the structure of the csv (assumes you have
have enough metadata to do this on each run or that the csv structure is
fixed and can be consistently translated to a table)
Load the csv into the database using the COPY command (supported in psycopg2
via the cursor.copy_from method, passing in the csv as a file object).
This will be faster than anything you write in Python
Caveat: this works if the csv is very dependable (same number of cols on
every row) and the temp table is very lax w/ nulls, all strings w/ no
type coercion.
You can 'unpivot' the csv rows with a union all query that combines a
select for each column to row transpose. The 6 decimals in your example
should be manageable.
For example:
select sku, 'foo' as attr_name, foo as attr_value from tmp_csv union all
select sku, 'bar' as attr_name, bar as attr_value from tmp_csv union all
...
order by sku;
This solution hits a couple of the things you were you interested in:
Python application memory remains flat
Network I/O is limited to what it takes to get the .csv into the db and issue
the right follow up sql statements
A little general advice to close out -
Optimal and "good enough" are almost never the same thing
Optimal is only required under very specific situations
So, aim for "good enough", but be precise about what "good enough" means -
i.e., pick one or two measures
Iterate, solving for one variable at a time. In my experience, the first hurdle (say, "end to end processing time less than
X seconds") is often sufficient.
Related
I tried cx_Oracle package , fetchall() , but this ended up consuming lot of RAM,
Tried Pandas as well, but that also doesn't seem to be that efficient in case we have billions of records.
My Use case - Fetch each row from oracle table to python , do some processing and load it to another table.
PS - Expecting something like fetchmany() , tried this but not able to get it work.
With your large data set, since your machine's memory is "too small", then you will have to do batch processing of sets of rows and reinsert each set before fetching the next set.
Tuning Cursor.arraysize and Cursor.prefetchrows (new in cx_Oracle 8) will be important for your fetch performance. The value is used for internal buffer sizes, regardless of whether you use fetchone() or fetchmany(). See Tuning cx_Oracle.
Use executemany() when reinserting the values, see Batch Statement Execution and Bulk Loading.
The big question is whether you even need to fetch into Python, or whether you can do the manipulation in PL/SQL or SQL - using these two will remove the need to transfer data across the network so will be much, much more efficient.
Generally your strategy should be the following.
Write an sql query that gets the data you want.
Execute the query
loop: Read a tuple from the result.
Process the tuple.
Write the processed stuff to the target relation
Lots of details to deal with however.
Generally when executing a query of this time, you'll want to read a version of the database. If you have simultaneous writing to the same data, you'll get a stale version, but you won't hold up the writers. If you want the most recent version, you have a hard problem to solve.
When executing step 4, don't open a new connection, reuse a connection. Opening a new connection is akin to launching a battleship.
Best if you don't have any indexes on the target relation, since you'll pay the cost of updating the target relation each time you write a tuple. It will work, but it will be slower. You can build the index after finishing the processing.
If this implementation is too slow for you, an easy way to speed things up is to process, say, 1000 tuples at once and write them in batches. If that's too slow, you've entered into the murky world of database performance tuning.
You can use the offset-limit property. If the data can be ordered by a column (primary key or combination of columns so that the data is ordered same across the execution), heres what I tried.
I take the total count() of the table.
Based on a chunk size(eg:100000), create a list of values [0, 0+chunk, 0+chunk2...count(*)].
ex: if count(*)=2222 and chunk=100, output list will be:
[0,100,200,...2222]
And then using the values of above list, I fetch each partition by using 'offset i rows fetch next j values only'.
This will select the data in partitions without overlapping any data.
I change values of i and j in each iteration so that I get the complete data.
At a runtime, the given chunk is 100,000; that much data is loaded into memory. You can change it to any values based on your system spec.
At each iteration, save the data wherever you want.
I am trying to compare two tables in an sqlite3 database in python. One of the answers to this question:
Comparing two sqlite3 tables using python
gives a solution:
Alternatively, import them into SQLite tables. Then you can use queries like the following:
SELECT * FROM a INTERSECT SELECT * FROM b;
SELECT * FROM a EXCEPT SELECT * FROM b;
to get rows that exist in both tables, or only in one table.
This works great for tables with less than a million rows, but is far too slow for my program which requires comparing tables with more than ten billion rows. (Script took over ten minutes for just 100 million rows.)
Is there a faster way to compare two sqlite3 tables in python?
I thought about trying to compare the hashes of the two database files, but an overview of a program called dbhash on sqlite.org claims that even if the contents of two database files are the same certain operations "can potentially cause vast changes to the raw database file, and hence cause very different SHA1 hashes at the file level," which makes me think that this would not work unless I ran some sort of script to query all the data in an ordered fashion and then hash that (like the dbhash program does), but would that even be faster?
Or should I be using another database entirely that can preform this comparison faster than sqlite3?
Any ideas or suggestions would be greatly appreciated.
Edit: There have been some good ideas put forward so far, but to clarify: the order of the tables doesn't matter, just the contents.
You could resort to the following workaround:
Add a column to each table where you store a hash over the content of all other columns.
Add an index to the new column.
Compute and store the hash with the record.
Compare the hash columns of your tables instead of using intersect/except.
If altering the tables isn't an option you can perhaps create new tables that relate a hash to the primary key or rowid of the hashed record.
With that you shift part of the processing time needed for the comparison to the time you insert/update the records. I would expect this to be significant faster at the time you execute the comparison than comparing all columns of all rows just then.
Of course your hash must be aware of the order of values and produce unique values for every permutation; a simple checksum won't suffice. Suggestion:
Convert every column value to a string.
Concatenate the strings with a separator that's guaranteed not to occur in the values themselves.
Use SHA1 or a similarly sophisticated hashing algorithm over the concatenated string.
You can test whether storing the hash as string, blob or integer (provided it fits into 64 bit) makes a difference in speed.
Yes, it will take a lot of time for a single thread (or even several) on a single hard drive to crawl billions of raws.
It can obviously be better with stronger DB engines but indexing all your columns would not really help in the end.
You have to resort to precalculation or distributing your dataset amongst multiple systems...
If you have a LOT of RAM you can try copying the SQLite files first in /dev/shm allowing you to read your data straight from memory and benefit a performance boost.
I am currently facing the problem of having to frequently access a large but simple data set on a smallish (700 Mhz) device in real time. The data set contains around 400,000 mappings from abbreviations to abbreviated words, e.g. "frgm" to "fragment". Reading will happen frequently when the device is used and should not require more than 15-20ms.
My first attempt was to utilize SQLite in order to create a simple data base which merely contains a single table where two strings constitute a data set:
CREATE TABLE WordMappings (key text, word text)
This table is created once and although alterations are possible, only read-access is time critical.
Following this guide, my SELECT statement looks as follows:
def databaseQuery(self, query_string):
self.cursor.execute("SELECT word FROM WordMappings WHERE key=" + query_string + " LIMIT 1;")
result = self.cursor.fetchone()
return result[0]
However, using this code on a test data base with 20,000 abbreviations, I am unable to fetch data quicker than ~60ms, which is far to slow.
Any suggestions on how to improve performance using SQLite or would another approach yield more promising results?
You can speed up lookups on the key column by creating an index for it:
CREATE INDEX kex_index ON WordMappings(key);
To check whether a query uses an index or scans the entire table, use EXPLAIN QUERY PLAN.
A long time ago I tried to use SQLite for sequential data and it was not fast enough for my needs. At the time, I was comparing it against an existing in-house binary format, which I ended up using.
I have not personally used, but a friend uses PyTables for large time-series data; maybe it's worth looking into.
It turns out that defining a primary key speeds up individual queries by an factor order of magnitude.
Individual queries on a test table with 400,000 randomly created entries (10/20 characters long) took no longer than 5ms which satisfies the requirements.
The table is now created as follows:
CREATE TABLE WordMappings (key text PRIMARY KEY, word text)
A primary key is used because
It is implicitly unique, which is a property of the abbreviations stored
It cannot be NULL, so the rows containing it must not be NULL. In our case, if they were, the database would be corrupt
Other users have suggested using an index, however, they are not necessarily unique and according to the accept answer to this question, they unnecessarily slow down update/insert/delete performance. Nevertheless, using an index may as well increase performance. This has, however not been tested by the original author, although not tested by the original author.
I have a load of data in CSV format. I need to be able to index this data based on a single text field (the primary key), so I'm thinking of entering it into a database. I'm familiar with sqlite from previous projects, so I've decided to use that engine.
After some experimentation, I realized that that storing a hundred million records in one table won't work well: the indexing step slows to a crawl pretty quickly. I could come up with two solutions to this problem:
partition the data into several tables
partition the data into several databases
I went with the second solution (it yields several large files instead of one huge file). My partition method is to look at the first two characters of the primary key: each partition has approximately 2 million records, and there are approximately 50 partitions.
I'm doing this in Python with the sqlite3 module. I keep 50 open database connections and open cursors for the entire duration of the process. For each row, I look at the first two characters of the primary key, fetch the right cursor via dictionary lookup, and perform a single insert statement (via calling execute on the cursor).
Unfortunately, the insert speed still decreases to an unbearable level after a while (approx. 10 million total processed records). What can I do to get around this? Is there a better way to do what I'm doing?
Wrap all insert commands into a single transaction.
Use prepared statements.
Create the index only after inserting all the data (i.e., don't declare a primary key).
I think the problem you have is that once the processing cannot just use in-memory buffers your hard disk head is just jumping randomly between 50 locations and this is dog slow.
Something you can try is just processing one subset at a time:
seen = {} # Key prefixes already processed
while True:
k0 = None # Current prefix
for L in all_the_data:
k = L[0][:2]
if k not in seen:
if k0 is None:
k0 = k
if k0 == k:
store_into_database(L)
if k0 is None:
break
seen.append(k0)
This will do n+1 passes over the data (where n is the number of prefixes) but will only access two disk locations (one for reading and one for writing). It should work even better if you've separate physical devices.
PS: Are you really really sure an SQL database is the best solution for this problem?
I am new with SQL/Python.
I was wondering if there is a way for me to sort or categorize expense items into three primary categories.
That is I have a 56,000 row list with about 100+ different expense categories. They vary from things like Payroll, Credit Card Pmt, telephone, etc.
I would like to put them into three categories, for the sake of analysis.
I know I could do a GIANT IF statement in Excel, but that would be really time consuming, based on the fact that there are 100+ sub categories.
Is there any way to expedite the process with Python or even in Excel?
Also, I don't know if this is material or not, but I am preparing this file to be uploaded to a SQL database.
You should create a table called something like ExpenseCategories, with the columns ExpenseCategory, PrimaryCategory.
This table would have one row for each expense category (which you can enforce with a constraint if you like). You would then join this table with your existing data in SQL.
By the way, in Excel, you could do this with a vlookup() rather than an if(). The vlookup() is analogous to using a lookup table in SQL. The equivalent of an if() would be a giant case statement, which is another possibility.