Python SQLAlchemy - large insert into postgres table hangs - python

I've got a script that is attempting to insert a large number of rows into a postgresql table. When I say large, I mean up to 200,000. I'm inserting data from python using sql alchemy. Each row is made up of one unique ID and a number of 0/1 flags.
When I try to insert a small number of rows, it works just fine. I have even inserted around 18,000 without any problems, and I think it only took a handful of seconds.
Lately, I have stepped it up to try inserting a much larger set of data set of around 150,000 records. I had my script print the time that it started doing this, and this insert has been running for 12+ hours at this point. It seems disproportionately long when compared to the fast 20k row insert. Here is the code that I'm using.
sql_engine = sqlalchemy.create_engine("postgresql://database")
meta=sqlalchemy.MetaData(sql_engine)
my_table= sqlalchemy.Table('table_name', meta, autoload=True, autoload_with=sql_engine)
already_inserted=[i for i in sql_engine.execute(sqlalchemy.select([some_column]))]
table_rows=[]
for i in summary:
if i[some_column] not in alread_inserted:
table_rows.append(
{logic that builds row of 0s and 1s})
if len(table_rows)>0:
my_table.insert().execute(table_rows)
Are there any tips to getting this to work? Should I be inserting in smaller chunks? Furthermore, would inserts go faster if I only tried to insert the flags that are equal to 1 and left the zeros as null?

Related

Optimized number of rows that can be inserted at each dump insertion to DB2 using JDBC

Using a JDBC connection, I have about 50 million rows to insert to DB2 from Python.
I initially tried to insert one row at a time, which resulted in a huge processing time lag.
Therefore, I made it to insert 1,000 rows at each attempt, which clearly made a faster processing time. However, I wonder if there is a clear guideline to do this type of insertion.
How does one find the dynamic optimization of batch size for making dump insertion without taking a long connection hold?

How to read fast with Python from MS SQL

How to fast search specific data in MS SQL with Python if you have three dependent variables and you have 10 million data? If prepare use pyodbc library or any else?
Starting from the fact that 10 million rows for a table in a RDBMS is not a huge number, I personally used TurboODBC with MSSQL to migrate a table with ~2 billion rows. I was very satisfied.
You should carefully review and optimize your query/schema (adding indexes, exploiting them, etc.) regardless of the technology you will use to run it. Most likely the database query execution will be the bottleneck of your process (assuming that you will not process this data in python).
Using a columnstore index is appropriate for high volume of data/rows in a table (many millions at least). If your criterias for the WHERE cluse of the SELECT are the 3 variables, create such index with those three columns. As an hypothetical exemple :
CREATE COLUMNSTORE INDEX XC_MyTable ON MyTable (Col1, Col2, Col3);

Can't match Postgres insert speed with SQLite3 in Python

I have a database with a single indexed UNIQUE text column, currently at 100m rows but it could go up to a billion. this is for doing millions of contains queries on the db so index is necessary for fast lookup.
The problem i'm having is imports have slowed down by a lot as the DB got bigger. I've tried a lot of ideas including stuff like this (reads roughly 100k lines from file and does batch insert):
lines = f.readlines(3355629)
while lines:
c.execute('BEGIN TRANSACTION')
rows_modified += c.executemany('INSERT OR IGNORE INTO mydb (name) values (?)', map(lambda name: (name.strip().lower(),), lines)).rowcount
c.execute('COMMIT')
lines = f.readlines(3355629)
the above takes many minutes to insert 100k lines when the DB is at 100m rows.
I've imported to Postgres with a python script which can get 100k inserts on an indexed column in 2 seconds at 100m rows (using psycopg2.extras.execute_values).
I keep seeing claims that SQLite can easily handle terabytes of data, yet can't figure out how to get the data in. BTW I can't drop the index and recreate it because then id need additional code to make sure the data is unique, and you can't change a column to UNIQUE after creation. Multiple tables is an option for speed increase though it makes things more complex.

How to process 10 million records in Oracle DB using Python. (cx_Oracle / Pandas)

I tried cx_Oracle package , fetchall() , but this ended up consuming lot of RAM,
Tried Pandas as well, but that also doesn't seem to be that efficient in case we have billions of records.
My Use case - Fetch each row from oracle table to python , do some processing and load it to another table.
PS - Expecting something like fetchmany() , tried this but not able to get it work.
With your large data set, since your machine's memory is "too small", then you will have to do batch processing of sets of rows and reinsert each set before fetching the next set.
Tuning Cursor.arraysize and Cursor.prefetchrows (new in cx_Oracle 8) will be important for your fetch performance. The value is used for internal buffer sizes, regardless of whether you use fetchone() or fetchmany(). See Tuning cx_Oracle.
Use executemany() when reinserting the values, see Batch Statement Execution and Bulk Loading.
The big question is whether you even need to fetch into Python, or whether you can do the manipulation in PL/SQL or SQL - using these two will remove the need to transfer data across the network so will be much, much more efficient.
Generally your strategy should be the following.
Write an sql query that gets the data you want.
Execute the query
loop: Read a tuple from the result.
Process the tuple.
Write the processed stuff to the target relation
Lots of details to deal with however.
Generally when executing a query of this time, you'll want to read a version of the database. If you have simultaneous writing to the same data, you'll get a stale version, but you won't hold up the writers. If you want the most recent version, you have a hard problem to solve.
When executing step 4, don't open a new connection, reuse a connection. Opening a new connection is akin to launching a battleship.
Best if you don't have any indexes on the target relation, since you'll pay the cost of updating the target relation each time you write a tuple. It will work, but it will be slower. You can build the index after finishing the processing.
If this implementation is too slow for you, an easy way to speed things up is to process, say, 1000 tuples at once and write them in batches. If that's too slow, you've entered into the murky world of database performance tuning.
You can use the offset-limit property. If the data can be ordered by a column (primary key or combination of columns so that the data is ordered same across the execution), heres what I tried.
I take the total count() of the table.
Based on a chunk size(eg:100000), create a list of values [0, 0+chunk, 0+chunk2...count(*)].
ex: if count(*)=2222 and chunk=100, output list will be:
[0,100,200,...2222]
And then using the values of above list, I fetch each partition by using 'offset i rows fetch next j values only'.
This will select the data in partitions without overlapping any data.
I change values of i and j in each iteration so that I get the complete data.
At a runtime, the given chunk is 100,000; that much data is loaded into memory. You can change it to any values based on your system spec.
At each iteration, save the data wherever you want.

Table updates using daily data from other tables Postgres/Python

I have a database and csv file that gets updated once a day. I managed to updated my table1 from this file by creating a separate log file with the record of the last insert.
No, I have to create a new table table2 where I keep calculations from the table1.
My issue is that those calculations are based on 10, 20 and 90 previous rows from table1.
The question is - how can I efficiently update table2 from the data of the table1 on a daily basis? I don't want to re-do the calculations everyday from the beginning of the table since it will be very time consuming for me.
Thanks for your help!
The answer is "as well as one could possibly expect."
Without seeing your tables, data, and queries, and the stats of your machine it is hard to be too specific. However in general updates basically doing three steps. This is a bit of an oversimplification but it allows you to estimate performance.
First it selects the data necessary. Then it marks the rows that were updated as deleted, then it inserts new rows with the new data into the table. In general, your limit is usually the data selection. As long as you can efficiently run the SELECT query to get the data you want, update should perform relatively well.

Categories