Retrieving Data from MySQL in batches via Python - python

I would like to make this process in batches, because of the volume.
Here's my code:
getconn = conexiones()
con = getconn.mysqlDWconnect()
with con:
cur = con.cursor(mdb.cursors.DictCursor)
cur.execute("SELECT id, date, product_id, sales FROM sales")
rows = cur.fetchall()
How can I implement an index to fetch the data in batches?

First point: a python db-api.cursor is an iterator, so unless you really need to load a whole batch in memory at once, you can just start with using this feature, ie instead of:
cursor.execute("SELECT * FROM mytable")
rows = cursor.fetchall()
for row in rows:
do_something_with(row)
you could just:
cursor.execute("SELECT * FROM mytable")
for row in cursor:
do_something_with(row)
Then if your db connector's implementation still doesn't make proper use of this feature, it will be time to add LIMIT and OFFSET to the mix:
# py2 / py3 compat
try:
# xrange is defined in py2 only
xrange
except NameError:
# py3 range is actually p2 xrange
xrange = range
cursor.execute("SELECT count(*) FROM mytable")
count = cursor.fetchone()[0]
batch_size = 42 # whatever
for offset in xrange(0, count, batch_size):
cursor.execute(
"SELECT * FROM mytable LIMIT %s OFFSET %s",
(batch_size, offset))
for row in cursor:
do_something_with(row)

You can use
SELECT id, date, product_id, sales FROM sales LIMIT X OFFSET Y;
where X is the size of the batch you need and Y is current offset (X times number of current iterations for example)

To expand on akalikin's answer, you can use a stepped iteration to split the query into chunks, and then use LIMIT and OFFSET to execute the query.
cur = con.cursor(mdb.cursors.DictCursor)
cur.execute("SELECT COUNT(*) FROM sales")
for i in range(0,cur.fetchall(),5):
cur2 = con.cursor(mdb.cursors.DictCursor)
cur2.execute("SELECT id, date, product_id, sales FROM sales LIMIT %s OFFSET %s" %(5,i))
rows = cur2.fetchall()
print rows

Thank you, here's how I implement it with your suggestions:
control = True
index = 0
while control==True:
getconn = conexiones()
con = getconn.mysqlDWconnect()
with con:
cur = con.cursor(mdb.cursors.DictCursor)
query = "SELECT id, date, product_id, sales FROM sales limit 10 OFFSET " + str(10 * (index))
cur.execute(query)
rows = cur.fetchall()
index = index+1
if len(rows)== 0:
control=False
for row in rows:
dataset.append(row)

Related

How to make a for loop to update all the rows of a column in sqlite with numbers

i want to make a 'for' loop function such that upon deletion or addition of a new row will decrease or increase the number in the column. basically i want to make a "Sl.no" column for my database which will correct itself with addition or deletion of rows.
import sqlite3
conn = sqlite3.connect('time.db')
cur = conn.cursor()
cur.execute(''' CREATE TABLE IF NOT EXISTS time(id INTEGER PRIMARY KEY, sl real)''')
conn.commit()
for sl in range(100):
cur.execute("INSERT INTO time VALUES(NULL, :sl)",{'sl':sl})
conn.commit()
cur.execute("SELECT sl FROM time")
fulllist = cur.fetchall()
print(fulllist)
print("next"'\n')
print(len(fulllist))
print("next"'\n')
cmy=[]
for dx in fulllist:
for cx in dx:
cmy.append(cx)
print(cmy)
print( str(len(cmy)) +'\n' +'next'+'\n')
for slx in range(1,len(cmy)+1):
cur.execute('''UPDATE time SET sl=:sl WHERE id=:id ''',{'sl':slx, 'id':id})
conn.commit()
print(slx)
print("slno_setting"+'\n')
cur.execute("SELECT sl FROM time")
newlist = cur.fetchall()
print(newlist)

Batch downloading of table using cx_oracle

I need to download a large table from an oracle database into a python server, using cx_oracle to do so. However, the ram is limited on the python server and so I need to do it in a batch way.
I know already how to do generally a whole table
usr = ''
pwd = ''
tns = '(Description = ...'
orcl = cx_Oracle.connect(user, pwd, tns)
curs = orcl.cursor()
printHeader=True
tabletoget = 'BIGTABLE'
sql = "SELECT * FROM " + "SCHEMA." + tabletoget
curs.execute(sql)
data = pd.read_sql(sql, orcl)
data.to_csv(tabletoget + '.csv'
I'm not sure what to do though to load say a batch of 10000 rows at a time and then save it off to a csv and then rejoin.
You can use cx_Oracle directly to perform this sort of batch:
curs.arraysize = 10000
curs.execute(sql)
while True:
rows = cursor.fetchmany()
if rows:
write_to_csv(rows)
if len(rows) < curs.arraysize:
break
If you are using Oracle Database 12c or higher you can also use the OFFSET and FETCH NEXT ROWS options, like this:
offset = 0
numRowsInBatch = 10000
while True:
curs.execute("select * from tabletoget offset :offset fetch next :nrows only",
offset=offset, nrows=numRowsInBatch)
rows = curs.fetchall()
if rows:
write_to_csv(rows)
if len(rows) < numRowsInBatch:
break
offset += len(rows)
This option isn't as efficient as the first one and involves giving the database more work to do but it may be better for you depending on your circumstances.
None of these examples use pandas directly. I am not particularly familiar with that package, but if you (or someone else) can adapt this appropriately, hopefully this will help!
You can achieve your result like this. Here I am loading data to df.
import cx_Oracle
import time
import pandas
user = "test"
pw = "test"
dsn="localhost:port/TEST"
con = cx_Oracle.connect(user,pw,dsn)
start = time.time()
cur = con.cursor()
cur.arraysize = 10000
try:
cur.execute( "select * from test_table" )
names = [ x[0] for x in cur.description]
rows = cur.fetchall()
df=pandas.DataFrame( rows, columns=names)
print(df.shape)
print(df.head())
finally:
if cur is not None:
cur.close()
elapsed = (time.time() - start)
print(elapsed, "seconds")

How to change the cursor to the next row using pyodbc in Python

I am trying to fetch records after a regular interval from a database table which growing with records. I am using Python and its pyodbc package to carry out the fetching of records. While fetching, how can I point the cursor to the next row of the row which was read/fetched last so that with every fetch I can only get the new set of records inserted.
To explain more,
my table has 100 records and they are fetched.
after an interval the table has 200 records and I want to fetch rows from 101 to 200. And so on.
Is there a way with pyodbc cursor?
Or any other suggestion would be very helpful.
Below is the code I am trying:
#!/usr/bin/python
import pyodbc
import csv
import time
conn_str = (
"DRIVER={PostgreSQL Unicode};"
"DATABASE=postgres;"
"UID=userid;"
"PWD=database;"
"SERVER=localhost;"
"PORT=5432;"
)
conn = pyodbc.connect(conn_str)
cursor = conn.cursor()
def fetch_table(**kwargs):
qry = kwargs['qrystr']
try:
#cursor = conn.cursor()
cursor.execute(qry)
all_rows = cursor.fetchall()
rowcnt = cursor.rowcount
rownum = cursor.description
#return (rowcnt, rownum)
return all_rows
except pyodbc.ProgrammingError as e:
print ("Exception occured as :", type(e) , e)
def poll_db():
for i in [1, 2]:
stmt = "select * from my_database_table"
rows = fetch_table(qrystr = stmt)
print("***** For i = " , i , "******")
for r in rows:
print("ROW-> ", r)
time.sleep(10)
poll_db()
conn.close()
I don't think you can use pyodbc, or any other odbc package, to find "new" rows. But if there is a 'timestamp' column in your database, or if you can add such a column (some databases allow for it to be automatically populated as the time of insertion so you don't have to change the insert queries) then you can change your query to select only the rows whose timestamp is greater than the previous timestamp. And you can keep changing the prev_timestamp variable on each iteration.
def poll_db():
prev_timestamp = ""
for i in [1, 2]:
if prev_timestamp == "":
stmt = "select * from my_database_table"
else:
# convert your timestamp str to match the database's format
stmt = "select * from my_database_table where timestamp > " + str(prev_timestamp)
rows = fetch_table(qrystr = stmt)
prev_timestamp = datetime.datetime.now()
print("***** For i = " , i , "******")
for r in rows:
print("ROW-> ", r)
time.sleep(10)

Select records incrementally in MySQL and save to csv in Python

I need to query the database for some data analysis and I have more than 20 millions records. I have limited access to the DB and my query times out after 8 mins. So, I am trying to break up the query into smaller portions and save the results to excel for processing later.
This is what I have so far. How can I get python to loop the queries over every x-number (e.g 1,000,000) of records and store them in the same csv until all (20 mil++) records are searched?
import MySQLdb
import csv
db_main = MySQLdb.connect(host="localhost",
port = 1234,
user="user1",
passwd="test123",
db="mainDB")
cur = db_main .cursor()
cur.execute("SELECT a.user_id, b.last_name, b.first_name,
FLOOR(DATEDIFF(CURRENT_DATE(), c.birth_date) / 365) age,
DATEDIFF(b.left_date, b.join_date) workDays
FROM users a
INNER JOIN users_signup b ON a.user_id a = b.user_id
INNER JOIN users_personal c ON a.user_id a = c.user_id
INNER JOIN
(
SELECT distinct d.a.user_id FROM users_signup d
WHERE (user_id >=1 AND user_id <1000000)
AND d.join_date >= '2013-01-01' and d.join_date < '2014-01-01'
)
AS t ON a.user_id = t.user_id")
result=cur.fetchall()
c = csv.writer(open("temp.csv","wb"))
for row in result:
c.writerow(row)
Your code should look like below. You can tune its performance by per_query variable
c = csv.writer(open("temp.csv","wb"))
offset = 0
per_query = 10000
while true:
cur.execute("__the_query__ LIMIT %s OFFSET %s", (per_query, offset))
rows = cur.fetchall()
if len(rows) == 0:
break #escape the loop at the end of data
for row in cur.fetchall():
c.writerow(row)
offset += per_query
Untested code but this should get you started...
SQL = """
SELECT a.user_id, b.last_name, b.first_name,
FLOOR(DATEDIFF(CURRENT_DATE(), c.birth_date) / 365) age,
DATEDIFF(b.left_date, b.join_date) workDays
FROM users a
INNER JOIN users_signup b ON a.user_id a = b.user_id
INNER JOIN users_personal c ON a.user_id a = c.user_id
INNER JOIN
(
SELECT distinct d.a.user_id FROM users_signup d
WHERE (user_id >=1 AND user_id <1000000)
AND d.join_date >= '2013-01-01' and d.join_date < '2014-01-01'
)
AS t ON a.user_id = t.user_id
OFFSET %s LIMIT %s
"""
BATCH_SIZE = 100000
with open("temp.csv","wb") as f:
writer = csv.writer(f)
cursor = db_main.cursor()
offset = 0
limit = BATCH_SIZE
while True:
cursor.execute(SQL, (offset, limit))
for row in cursor:
writer.writerow(row)
else:
# no more rows, we're done
break
offset += BATCH_SIZE
cursor.close()
Here is an example of implementation that might help you:
from contextlib import contextmanager
import MySQLdb
import csv
connection_args = {"host": "localhost", "port": 1234, "user": "user1", "passwd": "test123", "db": "mainDB"}
#contextmanager
def get_cursor(**kwargs):
''' The contextmanager allow to automatically close
the cursor.
'''
db = MySQLdb.connect(**kwargs)
cursor = db.cursor()
try:
yield cursor
finally:
cursor.close()
# note the placeholders for the limits
query = """ SELECT a.user_id, b.last_name, b.first_name,
FLOOR(DATEDIFF(CURRENT_DATE(), c.birth_date) / 365) age,
DATEDIFF(b.left_date, b.join_date) workDays
FROM users a
INNER JOIN users_signup b ON a.user_id a = b.user_id
INNER JOIN users_personal c ON a.user_id a = c.user_id
INNER JOIN
(
SELECT distinct d.a.user_id FROM users_signup d
WHERE (user_id >= 1 AND user_id < 1000000)
AND d.join_date >= '2013-01-01' and d.join_date < '2014-01-01'
) AS t ON a.user_id = t.user_id OFFSET %s LIMIT %s """
csv_file = csv.writer(open("temp.csv","wb"))
# One million at the time
STEP = 1000000
for step_nb in xrange(0, 20):
with get_cursor(**connection_args) as cursor:
cursor.execute(query, (step_nb * STEP, (step_nb + 1) * STEP)) # query the DB
for row in cursor: # use the cursor instead of fetching everything in memory
csv_file.writerow(row)
Edited: misunderstanding of what was the Batch (though it was on user_id)

Can i store a cursor.fetchone() in a variable

Hell guys just jumped in to python and i'm having a hard time figuring this out
I have 2 queries . . query1 and query2 now how can i tell
row = cursor.fetchone() that i am refering to query1 and not query2
cursor = conn.cursor()
query1 = cursor.execute("select * FROM spam")
query2 = cursor.execute("select * FROM eggs")
row = cursor.fetchone ()
thanks guys
Once you perform the second query, the results from the first are gone. (The return value of execute isn't useful.) The correct way to work with two queries simultaneously is to have two cursors:
cursor1 = conn.cursor()
cursor2 = conn.cursor()
cursor1.execute("select * FROM spam")
cursor2.execute("select * FROM eggs")
cursor1.fetchone() #first result from query 1
cursor2.fetchone() #first result from query 2
It doesn't. The return value from cursor.execute is meaningless. Per PEP 249:
.execute(operation[,parameters])
Prepare and execute a database operation (query or
command)...
[...]
Return values are not defined.
You can't do it the way you're trying to. Do something like this instead:
cursor = conn.cursor()
cursor.execute("select * FROM spam")
results1 = cursor.fetchall()
cursor.execute("select * FROM eggs")
if results1 is not None and len(results1) > 0:
print "First row from query1: ", results1[0]
row = cursor.fetchone()
if row is not None:
print "First row from query2: ", row

Categories