python+MySQLdb, simple select is too slow compared to flat file access - python

I have one simple table with 80 000 rows.
I'm trying to select and save all rows to python list as fast as possible.
It's taking around 4 - 10 seconds.
In contrast If I dump exact same table into csv file and process it with this code
f = open('list.csv','rb')
lines = f.read().splitlines()
f.close()
print len(lines)
It's taking only 0.08 - 0.3 second
I tried MySQLdb and mysql.connector using fetchall() or fetchone()
import time
start = time.time()
import MySQLdb as mdb
con = mdb.connect('127.0.0.1', 'login', 'p', 'db');
with con:
cur = con.cursor()
cur.execute("SELECT * FROM table")
rows = cur.fetchall()
print len(rows)
print 'MySQLdb %s' % (time.time()-start)
Took 3.7 - 8 seconds with high CPU load
Is it possible to achieve same speed like with that csv file?
EDIT
My MySQL server seems to be ok.
In mysql console:
SELECT * from TABLE;
....
80789 rows in set (0.21 sec)

The entire query result will be restored in client side as a list when cur.execute(...) is done. Check self._rows attribute in MySQLdb/cursor.py for this.
That is to say, time cost differs in reading file content and fetching query results from a MySQL database. As we all, a built-in function is always faster than a 3PP function. So I don't think there is a way to make cursor.execute() same speed as open().
As for why open() is faster, I suggest you look into Python source code. Here is the link.
Hope it helps.

Related

Exporting MYSQL tables to csv through python script?

Although there are many solutions to export the mysql tables to csv using python.
I want to know the best way of doing that?
Currently I am storing around 50 tables to csv which takes around 47 minutes and also takes more than 16gb of memory.
The code is :
sqlEngine = create_engine(f'mysql+pymysql://{MYSQL_READER_USERNAME}:%s#{MYSQL_READER_HOST}/{MYSQL_DB_NAME}' % urllib.parse.quote(f'{MYSQL_READER_PASSWORD}'), pool_recycle=3600)
def export_table(name, download_location):
table = pd.read_sql(f'select /*+ MAX_EXECUTION_TIME(100000000) */ * from {name}', sqlEngine)
table.to_csv(os.path.join(download_location, name + '.csv'), index=False)
tables = ['table1', ... , 'table50']
for table in tqdm(tables):
print(f'\t => \t Storing {table}')
export_table(table, store_dir)
I have seen many methods to store to csv like:
using Cursor
using pyodbc library
pandas read sql method.
Is there any other method or technique and which one is best to reduce
memory or execution time ?

How increate efffciency insert data in PostGIS with Python?

I need to insert 46mln points into PostGIS database in a decent time. Inserting 14mln points was executing around 40 minutes, it its awful and inefficient.
I created database with spatial GIST index and wrote this code:
import psycopg2
import time
start = time.time()
conn = psycopg2.connect(host='localhost', port='5432', dbname='test2', user='postgres', password='alfabet1')
filepath = "C:\\Users\\nmt1m.csv"
curs = conn.cursor()
with open(filepath, 'r') as text:
for i in text:
i = i.replace("\n", "")
i = i.split(sep=" ")
curs.execute(f"INSERT INTO nmt_1 (geom, Z) VALUES (ST_GeomFromText('POINTZ({i[0]} {i[1]} {i[2]})',0), {i[2]});")
conn.commit()
end = time.time()
print(end - start)
curs.close()
conn.close()
Im looking for the best way to inserting data, it not must be in python.
Thanks ;)
Cześć! Welcome to SO.
There are a few things you can do to speed up your bulk insert:
If the target table is empty or is not being used in a production system, consider dropping the indexes right before inserting the data. After the insert is complete you can recreate them. This will avoid PostgreSQL to re-index your table after every insert, which in your case means 46 million times.
If the target table can be entirely built from your CSV file, consider creating an UNLOGGED TABLE. Unlogged tables are much faster than "normal" tables, since they (as the name suggests) are not logged in the WAL file (write-ahead log). Unlogged tables might be lost in case of database crash or an unclean shutdown!
Use either the PostgreSQL COPY command or copy_from as #MauriceMeyer pointed out. If for some reason you must stick to inserts, make sure you're not committing after every insert ;-)
Cheers
Thanks Jim for help, according to your instructions better way to insert data is:
import psycopg2
import time
start = time.time()
conn = psycopg2.connect(host='localhost', port='5432', dbname='test2',
user='postgres', password='alfabet1')
curs = conn.cursor()
filepath = "C:\\Users\\Jakub\\PycharmProjects\\test2\\testownik9_NMT\\nmt1m.csv"
curs.execute("CREATE UNLOGGED TABLE nmt_10 (id_1 FLOAT, id_2 FLOAT, id_3 FLOAT);")
with open(filepath, 'r') as text:
curs.copy_from(text, 'nmt_10', sep=" ")
curs.execute("SELECT AddGeometryColumn('nmt_10', 'geom', 2180, 'POINTZ', 3);")
curs.execute("CREATE INDEX nmt_10_index ON nmt_10 USING GIST (geom);")
curs.execute("UPDATE nmt_10 SET geom = ST_SetSRID(ST_MakePoint(id_1, id_2, id_3), 2180);")
conn.commit()
end = time.time()
print(end - start)
cheers

exponential in SQLite3 in python

I'm wrting a python code that creates a SQLite database and does some calculations for massive tables. To begin with, reason i'm doing it in SQLite through python is memory, my data is huge that will break into a memory error if run in, say, pandas. and if chuncked it'll take ages, generally because pandas is slow with merges and groupes, etc.
So my issue now is at some point, i want to calculate exponential of one column in a table (sample code below) but it seems that SQLite doesn't have an EXP function.
I can write data to a dataframe and then use numpy to calculate the EXP but that then beats the whole point that pushed my twoards DBs and not have the additional time of reading/writing back and forth between the DB and python.
so my question is this: is there a way around this to calculate the exponential within the database? i've read that i can create the function within sqlite3 in python, but i have no idea how. If you know how or can direct me to where i can find relavent info then i would be thankful, thanks.
Sample of my code where i'm trying to do the calculation, note here i'm just providing a sample where the table is coming directly from a csv, but in my process it's actually created within the DB after lots of megres and group bys:
import sqlite3
#set path and files names
folderPath = 'C:\\SCP\\'
inputDemandFile = 'demandFile.csv'
#set connection to database
conn = sqlite3.connect(folderPath + dataBaseName)
cur = conn.cursor()
#read demand file into db
inputDemand = pd.read_csv(folderPath + inputDemandFile)
inputDemand.to_sql('inputDemand', conn, if_exists='replace', index=False)
#create new table and calculate EXP
cur.execute('CREATE TABLE demand_exp AS SELECT from_zone_id, to_zone_id, EXP(demand) AS EXP_Demand FROM inputDemand;')
i've read that i can create the function within sqlite3 in python, but i have no idea how.
That's conn.create_function()
https://docs.python.org/3/library/sqlite3.html#sqlite3.Connection.create_function
>>> import math
>>> conn.create_function('EXP', 1, math.exp)
>>> cur.execute('select EXP(1)')
>>> cur.fetchone()
(2.718281828459045,)

How to extract all rows from a large postgres table using python efficiently?

I have been able to extract close to 3.5 mil rows from a postgres table using python and write to a file. However the process is extremely slow and I'm sure not the most efficient.
Following is my code:
import psycopg2, time,csv
conn_string = "host='compute-1.amazonaws.com' dbname='re' user='data' password='reck' port=5433"
conn = psycopg2.connect(conn_string)
cursor = conn.cursor()
quert = '''select data from table;'''
cursor.execute(quert)
def get_data():
while True:
recs = cursor.fetchmany(10000)
if not recs:
break
for columns in recs:
# do transformation of data here
yield(columns)
solr_input=get_data()
with open('prc_ind.csv','a') as fh:
for i in solr_input:
count += 1
if count % 1000 == 0:
print(count)
a,b,c,d = i['Skills'],i['Id'],i['History'],i['Industry']
fh.write("{0}|{1}|{2}|{3}\n".format(a,b,c,d))
The table has about 8 mil rows. I want to ask is there is a better, faster and less memory intensive way to accomplish this.
I can see four fields, so I'll assume you are selecting only these.
But even then, you are still loading 8 mil x 4 x n Bytes of data from what seems to be another server. So yes it'll take some time.
Though you are trying to rebuild the wheel, why not use the PostgreSQL client?
psql -d dbname -t -A -F"," -c "select * from users" > output.csv
Psycopg2's copy_to command does the exact same thing as a psql dump, as Loïc suggested, except it's in the python side of things. I've found this to be the fastest way to get a table dump.
The formatting for certain data types (such as hstore/json and composite types) is a bit funky, but the command is very simple.
f = open('foobar.dat', 'wb')
cursor.copy_to(f, 'table', sep='|', columns=['skills', 'id', 'history', 'industry'])
Docs here: http://initd.org/psycopg/docs/cursor.html#cursor.copy_to

python script hangs when calling cursor.fetchall() with large data set

I have a query that returns over 125K rows.
The goal is to write a script the iterates through the rows, and for each, populate a second table with data processed from the result of the query.
To develop the script, I created a duplicate database with a small subset of the data (4126 rows)
On the small database, the following code works:
import os
import sys
import random
import mysql.connector
cnx = mysql.connector.connect(user='dbuser', password='thePassword',
host='127.0.0.1',
database='db')
cnx_out = mysql.connector.connect(user='dbuser', password='thePassword',
host='127.0.0.1',
database='db')
ins_curs = cnx_out.cursor()
curs = cnx.cursor(dictionary=True)
#curs = cnx.cursor(dictionary=True,buffered=True) #fail
with open('sql\\getRawData.sql') as fh:
sql = fh.read()
curs.execute(sql, params=None, multi=False)
result = curs.fetchall() #<=== script stops at this point
print len(result) #<=== this line never executes
print curs.column_names
curs.close()
cnx.close()
cnx_out.close()
sys.exit()
The line curs.execute(sql, params=None, multi=False) succeeds on both the large and small databases.
If I use curs.fetchone() in a loop, I can read all records.
If I alter the line:
curs = cnx.cursor(dictionary=True)
to read:
curs = cnx.cursor(dictionary=True,buffered=True)
The script hangs at curs.execute(sql, params=None, multi=False).
I can find no documentation on any limits to fetchall(), nor can I find any way to increase the buffer size, and no way to tell how large a buffer I even need.
There are no exceptions raised.
How can I resolve this?
I was having this same issue, first on a query that returned ~70k rows and then on one that only returned around 2k rows (and for me RAM was also not the limiting factor). I switched from using mysql.connector (i.e. the mysql-connector-python package) to MySQLdb (i.e. the mysql-python package) and then was able to fetchall() on large queries with no problem. Both packages seem to follow the python DB API, so for me MySQLdb was a drop-in replacement for mysql.connector, with no code changes necessary beyond the line that sets up the connection. YMMV if you're leveraging something specific about mysql.connector.
Pragmatically speaking, if you don't have a specific reason to be using mysql.connector the solution to this is just to switch to a package that works better!

Categories