How increate efffciency insert data in PostGIS with Python? - python

I need to insert 46mln points into PostGIS database in a decent time. Inserting 14mln points was executing around 40 minutes, it its awful and inefficient.
I created database with spatial GIST index and wrote this code:
import psycopg2
import time
start = time.time()
conn = psycopg2.connect(host='localhost', port='5432', dbname='test2', user='postgres', password='alfabet1')
filepath = "C:\\Users\\nmt1m.csv"
curs = conn.cursor()
with open(filepath, 'r') as text:
for i in text:
i = i.replace("\n", "")
i = i.split(sep=" ")
curs.execute(f"INSERT INTO nmt_1 (geom, Z) VALUES (ST_GeomFromText('POINTZ({i[0]} {i[1]} {i[2]})',0), {i[2]});")
conn.commit()
end = time.time()
print(end - start)
curs.close()
conn.close()
Im looking for the best way to inserting data, it not must be in python.
Thanks ;)

Cześć! Welcome to SO.
There are a few things you can do to speed up your bulk insert:
If the target table is empty or is not being used in a production system, consider dropping the indexes right before inserting the data. After the insert is complete you can recreate them. This will avoid PostgreSQL to re-index your table after every insert, which in your case means 46 million times.
If the target table can be entirely built from your CSV file, consider creating an UNLOGGED TABLE. Unlogged tables are much faster than "normal" tables, since they (as the name suggests) are not logged in the WAL file (write-ahead log). Unlogged tables might be lost in case of database crash or an unclean shutdown!
Use either the PostgreSQL COPY command or copy_from as #MauriceMeyer pointed out. If for some reason you must stick to inserts, make sure you're not committing after every insert ;-)
Cheers

Thanks Jim for help, according to your instructions better way to insert data is:
import psycopg2
import time
start = time.time()
conn = psycopg2.connect(host='localhost', port='5432', dbname='test2',
user='postgres', password='alfabet1')
curs = conn.cursor()
filepath = "C:\\Users\\Jakub\\PycharmProjects\\test2\\testownik9_NMT\\nmt1m.csv"
curs.execute("CREATE UNLOGGED TABLE nmt_10 (id_1 FLOAT, id_2 FLOAT, id_3 FLOAT);")
with open(filepath, 'r') as text:
curs.copy_from(text, 'nmt_10', sep=" ")
curs.execute("SELECT AddGeometryColumn('nmt_10', 'geom', 2180, 'POINTZ', 3);")
curs.execute("CREATE INDEX nmt_10_index ON nmt_10 USING GIST (geom);")
curs.execute("UPDATE nmt_10 SET geom = ST_SetSRID(ST_MakePoint(id_1, id_2, id_3), 2180);")
conn.commit()
end = time.time()
print(end - start)
cheers

Related

exponential in SQLite3 in python

I'm wrting a python code that creates a SQLite database and does some calculations for massive tables. To begin with, reason i'm doing it in SQLite through python is memory, my data is huge that will break into a memory error if run in, say, pandas. and if chuncked it'll take ages, generally because pandas is slow with merges and groupes, etc.
So my issue now is at some point, i want to calculate exponential of one column in a table (sample code below) but it seems that SQLite doesn't have an EXP function.
I can write data to a dataframe and then use numpy to calculate the EXP but that then beats the whole point that pushed my twoards DBs and not have the additional time of reading/writing back and forth between the DB and python.
so my question is this: is there a way around this to calculate the exponential within the database? i've read that i can create the function within sqlite3 in python, but i have no idea how. If you know how or can direct me to where i can find relavent info then i would be thankful, thanks.
Sample of my code where i'm trying to do the calculation, note here i'm just providing a sample where the table is coming directly from a csv, but in my process it's actually created within the DB after lots of megres and group bys:
import sqlite3
#set path and files names
folderPath = 'C:\\SCP\\'
inputDemandFile = 'demandFile.csv'
#set connection to database
conn = sqlite3.connect(folderPath + dataBaseName)
cur = conn.cursor()
#read demand file into db
inputDemand = pd.read_csv(folderPath + inputDemandFile)
inputDemand.to_sql('inputDemand', conn, if_exists='replace', index=False)
#create new table and calculate EXP
cur.execute('CREATE TABLE demand_exp AS SELECT from_zone_id, to_zone_id, EXP(demand) AS EXP_Demand FROM inputDemand;')
i've read that i can create the function within sqlite3 in python, but i have no idea how.
That's conn.create_function()
https://docs.python.org/3/library/sqlite3.html#sqlite3.Connection.create_function
>>> import math
>>> conn.create_function('EXP', 1, math.exp)
>>> cur.execute('select EXP(1)')
>>> cur.fetchone()
(2.718281828459045,)

Fastest way to load .xlsx file into MySQL database

I'm trying to import data from a .xlsx file into a SQL database.
Right now, I have a python script which uses the openpyxl and MySQLdb modules to
establish a connection to the database
open the workbook
grab the worksheet
loop thru the rows the the worksheet, extracting the columns I need
and inserting each record into the database, one by one
Unfortunately, this is painfully slow. I'm working with a huge data set, so I need to find a faster way to do this (preferably with Python). Any ideas?
wb = openpyxl.load_workbook(filename="file", read_only=True)
ws = wb['My Worksheet']
conn = MySQLdb.connect()
cursor = conn.cursor()
cursor.execute("SET autocommit = 0")
for row in ws.iter_rows(row_offset=1):
sql_row = # data i need
cursor.execute("INSERT sql_row")
conn.commit()
Disable autocommit if it is on! Autocommit is a function which causes MySQL to immediately try to push your data to disk. This is good if you only have one insert, but this is what causes each individual insert to take a long time. Instead, you can turn it off and try to insert the data all at once, committing only once you've run all of your insert statements.
Something like this might work:
con = mysqldb.connect(
host="your db host",
user="your username",
passwd="your password",
db="your db name"
)
con.execute("SET autocommit = 0")
cursor = con.cursor()
data = # some code to get data from excel
for datum in data:
cursor.execute("your insert statement".format(datum))
con.commit()
con.close()
Consider saving workbook's worksheet as a CSV, then use MySQL's LOAD DATA INFILE. This is often a very fast read.
sql = """LOAD DATA INFILE '/path/to/data.csv'
INTO TABLE myTable
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '\"'
LINES TERMINATED BY '\n'"""
cursor.execute(sql)
con.commit()

Fast data moving from CSV to SQLite by Python

I have a problem. There are hundreds of CSV files, ca. 1,000,000 lines each one.
I need to move that data in a specific way, but script working very slow (it passing few ten of tousands per hour).
My code:
import sqlite3 as lite
import csv
import os
my_file = open('file.csv', 'r')
reader = csv.reader(my_file, delimiter=',')
date = '2014-09-29'
con = lite.connect('test.db', isolation_level = 'exclusive')
for row in reader:
position = row[0]
item_name = row[1]
cur = con.cursor()
cur.execute("CREATE TABLE IF NOT EXISTS [%s] (Date TEXT, Position INT)" % item_name)
cur.execute("INSERT INTO [%s] VALUES(?, ?)" % item_name, (date, position))
con.commit()
I found an information saying about isolation_level and single accessing to database, but it didn't work well.
Lines CSV files have a structure: 1,item1 | 2,item2
Does anyone could to help me? Thanks!
Don't do sql inserts. Prepare CSV file first, then do:
.separator <separator>
.import <loadFile> <tableName>
See here: http://cs.stanford.edu/people/widom/cs145/sqlite/SQLiteLoad.html
You certainly don't want to create a new cursor object for each row to insert - and checking for table creation at each line will certainly slow you down s well -
I'd suggest doing this in 2 passes: first
you create the needed tables, on the second pass you record
the data. If it is still slow, you could make a
a more sophisticated in-memory collection of data
to be inserted and perform "executemany" - but this would
require some sophistication to group data by name in memory
prior to comitting;.
import sqlite3 as lite
import csv
import os
my_file = open('file.csv', 'r')
reader = csv.reader(my_file, delimiter=',')
date = '2014-09-29'
con = lite.connect('test.db', isolation_level = 'exclusive')
cur = con.cursor()
table_names = set(row[1] for row in reader)
my_file.seek(0)
for name in table_names:
cur.execute("CREATE TABLE IF NOT EXISTS [%s] (Date TEXT, Position INT)" % item_name)
for row in reader:
position = row[0]
item_name = row[1]
cur.execute("INSERT INTO [%s] VALUES(?, ?)" % item_name, (date, position))
con.commit()
The code is inefficient in that it performs two SQL statements for each row in CSV. Try to optimize.
Is there a way to process CSV first and convert it to SQL statements?
Are rows in CSV grouped by tables (item name's)? If yes, you can accumulate the rows to be inserted into the same table (generate a set of INSERT statements for the same table) and only prefix the resulting set of statements with CREATE TABLE IF NOT EXISTS once, not every of them.
If possible, use bulk insert. If I get it right, bulk insert is introduced with SQLite v.3.27.1. More on this: Is it possible to insert multiple rows at a time in an SQLite database?
If needed, bulk insert in chunks. More on this: Bulk insert huge data into SQLite using Python
I have the same problem. Now it is solved! I would like to share the methods with everyone who is facing the same problem!
We use sqlite3 database as an example, and other databases may also work but are not sure. We adopt pandas and sqlites modules in python.
This can convert a list of csv files [file1,file2,...] into talbes [table1,table2,...] quickly.
import pandas as pd
import sqlite3 as sql
DataBasePath="C:\\Users\\...\\database.sqlite"
conn=sql.connect(DataBasePath)
filePath="C:\\Users\\...\\filefolder\\"
datafiles=["file1","file2","file3",...]
for f in datafiles:
df=pd.read_csv(filePath+f+".csv")
df.to_sql(name=f,con=conn,if_exists='append', index=False)
conn.close()
What's more, this code can create database if it doesn't exist. The argument of pd.to_sql() 'if_exists' is important. Its value is "fail" as default, which will import data if it exists otherwise does nothing; "replace" will drop the table first if it exists then create new table and import data; "append" will import data if it exists otherwise creates a new one can import data.

python+MySQLdb, simple select is too slow compared to flat file access

I have one simple table with 80 000 rows.
I'm trying to select and save all rows to python list as fast as possible.
It's taking around 4 - 10 seconds.
In contrast If I dump exact same table into csv file and process it with this code
f = open('list.csv','rb')
lines = f.read().splitlines()
f.close()
print len(lines)
It's taking only 0.08 - 0.3 second
I tried MySQLdb and mysql.connector using fetchall() or fetchone()
import time
start = time.time()
import MySQLdb as mdb
con = mdb.connect('127.0.0.1', 'login', 'p', 'db');
with con:
cur = con.cursor()
cur.execute("SELECT * FROM table")
rows = cur.fetchall()
print len(rows)
print 'MySQLdb %s' % (time.time()-start)
Took 3.7 - 8 seconds with high CPU load
Is it possible to achieve same speed like with that csv file?
EDIT
My MySQL server seems to be ok.
In mysql console:
SELECT * from TABLE;
....
80789 rows in set (0.21 sec)
The entire query result will be restored in client side as a list when cur.execute(...) is done. Check self._rows attribute in MySQLdb/cursor.py for this.
That is to say, time cost differs in reading file content and fetching query results from a MySQL database. As we all, a built-in function is always faster than a 3PP function. So I don't think there is a way to make cursor.execute() same speed as open().
As for why open() is faster, I suggest you look into Python source code. Here is the link.
Hope it helps.

Converting dbf to sqlite using Python is not populating table

I've struggled over this issue for over an hour now. I'm trying to create a Sqlite database using a dbf table. When I create a list of records derived from a dbf to be used as input for the Sqlite executemany statement, the Sqlite table comes out empty. When I try to replicate the issue using Python interactively, the Sqlite execution is successful. The list generated from the dbf is populated when I run it - so the problem lies in the executemany statement.
import sqlite3
from dbfpy import dbf
streets = dbf.Dbf("streets_sample.dbf")
conn = sqlite3.connect('navteq.db')
conn.execute('PRAGMA synchronous = OFF')
conn.execute('PRAGMA journal_mode = MEMORY')
conn.execute('DROP TABLE IF EXISTS STREETS')
conn.execute('''CREATE TABLE STREETS
(blink_id CHAR(8) PRIMARY KEY,
bst_name VARCHAR(39),
bst_nm_pref CHAR(2));''')
alink_id = []
ast_name = []
ast_nm_pref = []
for i in streets:
alink_id.append(i["LINK_ID"])
ast_name.append(i["ST_NAME"])
ast_nm_pref.append(i["ST_NM_PREF"])
streets_table = zip(alink_id, ast_name, ast_nm_pref)
conn.executemany("INSERT OR IGNORE INTO STREETS VALUES(?,?,?)", streets_table)
conn.close()
This may not be the only issue, but you want to call conn.commit() to save the changes to the SQLite database. Reference: http://www.python.org/dev/peps/pep-0249/#commit

Categories