I need to download a large table from an oracle database into a python server, using cx_oracle to do so. However, the ram is limited on the python server and so I need to do it in a batch way.
I know already how to do generally a whole table
usr = ''
pwd = ''
tns = '(Description = ...'
orcl = cx_Oracle.connect(user, pwd, tns)
curs = orcl.cursor()
printHeader=True
tabletoget = 'BIGTABLE'
sql = "SELECT * FROM " + "SCHEMA." + tabletoget
curs.execute(sql)
data = pd.read_sql(sql, orcl)
data.to_csv(tabletoget + '.csv'
I'm not sure what to do though to load say a batch of 10000 rows at a time and then save it off to a csv and then rejoin.
You can use cx_Oracle directly to perform this sort of batch:
curs.arraysize = 10000
curs.execute(sql)
while True:
rows = cursor.fetchmany()
if rows:
write_to_csv(rows)
if len(rows) < curs.arraysize:
break
If you are using Oracle Database 12c or higher you can also use the OFFSET and FETCH NEXT ROWS options, like this:
offset = 0
numRowsInBatch = 10000
while True:
curs.execute("select * from tabletoget offset :offset fetch next :nrows only",
offset=offset, nrows=numRowsInBatch)
rows = curs.fetchall()
if rows:
write_to_csv(rows)
if len(rows) < numRowsInBatch:
break
offset += len(rows)
This option isn't as efficient as the first one and involves giving the database more work to do but it may be better for you depending on your circumstances.
None of these examples use pandas directly. I am not particularly familiar with that package, but if you (or someone else) can adapt this appropriately, hopefully this will help!
You can achieve your result like this. Here I am loading data to df.
import cx_Oracle
import time
import pandas
user = "test"
pw = "test"
dsn="localhost:port/TEST"
con = cx_Oracle.connect(user,pw,dsn)
start = time.time()
cur = con.cursor()
cur.arraysize = 10000
try:
cur.execute( "select * from test_table" )
names = [ x[0] for x in cur.description]
rows = cur.fetchall()
df=pandas.DataFrame( rows, columns=names)
print(df.shape)
print(df.head())
finally:
if cur is not None:
cur.close()
elapsed = (time.time() - start)
print(elapsed, "seconds")
Related
I would like to read a mysql database in chunks and write its contents to a bunch of csv files.
While this can be done easily with pandas using below:
df_chunks = pd.read_sql_table(table_name, con, chunksize=CHUNK_SIZE)
for i, df in enumerate(chunks):
df.to_csv("file_{}.csv".format(i)
Assuming I cannot use pandas, what other alternative can I use? I tried using
import sqlalchemy as sqldb
import csv
CHUNK_SIZE = 100000
table_name = "XXXXX"
host = "XXXXXX"
user = "XXXX"
password = "XXXXX"
database = "XXXXX"
port = "XXXX"
engine = sqldb.create_engine('mysql+pymysql://{}:{}#{}:{}/{}'.format(user,password,host,port,database))
con = engine.connect()
metadata = sqldb.MetaData()
table = sqldb.Table(table_name, metadata, autoload=True, autoload_with=engine)
query = table.select()
proxy = con.execution_options(stream_results=True).execute(query)
cols = [""] + [column.name for column in table.c]
file_num = 0
while True:
batch = proxy.fetchmany(CHUNK_SIZE)
if not batch:
break
csv_writer = csv.writer("file_{}.csv".format(file_num), delimiter=',')
csv_writer.writerow(cols)
#csv_writer.writerows(batch) # while this work, it does not have the index similar to df.to_csv()
for i, row in enumerate(batch):
csv_writer.writerow(i + row) # will error here
file_num += 1
proxy.close()
While using .writerows(batch) works fine, it does not have the index like the result you get from df.to_csv(). I would like to add the row number equivalent as well, but cant seem to add to the row which is a sqlalchemy.engine.result.RowProxy. How can I do it? Or what other faster alternative can I use?
Look up SELECT ... INTO OUTFILE ...
It will do the task in 1 SQL statement; 0 lines of Python (other than invoking that SQL).
I am trying to fetch records after a regular interval from a database table which growing with records. I am using Python and its pyodbc package to carry out the fetching of records. While fetching, how can I point the cursor to the next row of the row which was read/fetched last so that with every fetch I can only get the new set of records inserted.
To explain more,
my table has 100 records and they are fetched.
after an interval the table has 200 records and I want to fetch rows from 101 to 200. And so on.
Is there a way with pyodbc cursor?
Or any other suggestion would be very helpful.
Below is the code I am trying:
#!/usr/bin/python
import pyodbc
import csv
import time
conn_str = (
"DRIVER={PostgreSQL Unicode};"
"DATABASE=postgres;"
"UID=userid;"
"PWD=database;"
"SERVER=localhost;"
"PORT=5432;"
)
conn = pyodbc.connect(conn_str)
cursor = conn.cursor()
def fetch_table(**kwargs):
qry = kwargs['qrystr']
try:
#cursor = conn.cursor()
cursor.execute(qry)
all_rows = cursor.fetchall()
rowcnt = cursor.rowcount
rownum = cursor.description
#return (rowcnt, rownum)
return all_rows
except pyodbc.ProgrammingError as e:
print ("Exception occured as :", type(e) , e)
def poll_db():
for i in [1, 2]:
stmt = "select * from my_database_table"
rows = fetch_table(qrystr = stmt)
print("***** For i = " , i , "******")
for r in rows:
print("ROW-> ", r)
time.sleep(10)
poll_db()
conn.close()
I don't think you can use pyodbc, or any other odbc package, to find "new" rows. But if there is a 'timestamp' column in your database, or if you can add such a column (some databases allow for it to be automatically populated as the time of insertion so you don't have to change the insert queries) then you can change your query to select only the rows whose timestamp is greater than the previous timestamp. And you can keep changing the prev_timestamp variable on each iteration.
def poll_db():
prev_timestamp = ""
for i in [1, 2]:
if prev_timestamp == "":
stmt = "select * from my_database_table"
else:
# convert your timestamp str to match the database's format
stmt = "select * from my_database_table where timestamp > " + str(prev_timestamp)
rows = fetch_table(qrystr = stmt)
prev_timestamp = datetime.datetime.now()
print("***** For i = " , i , "******")
for r in rows:
print("ROW-> ", r)
time.sleep(10)
I have 74 relatively large Pandas DataFrames (About 34,600 rows and 8 columns) that I am trying to insert into a SQL Server database as quickly as possible. After doing some research, I learned that the good ole pandas.to_sql function is not good for such large inserts into a SQL Server database, which was the initial approach that I took (very slow - almost an hour for the application to complete vs about 4 minutes when using mysql database.)
This article, and many other StackOverflow posts have been helpful in pointing me in the right direction, however I've hit a roadblock:
I am trying to use SQLAlchemy's Core rather than the ORM for reasons explained in the link above. So, I am converting the dataframe to a dictionary, using pandas.to_dict and then doing an execute() and insert():
self._session_factory.engine.execute(
TimeSeriesResultValues.__table__.insert(),
data)
# 'data' is a list of dictionaries.
The problem is that insert is not getting any values -- they appear as a bunch of empty parenthesis and I get this error:
(pyodbc.IntegretyError) ('23000', "[23000] [FreeTDS][SQL Server]Cannot
insert the value NULL into the column...
There are values in the list of dictionaries that I passed in, so I can't figure out why the values aren't showing up.
EDIT:
Here's the example I'm going off of:
def test_sqlalchemy_core(n=100000):
init_sqlalchemy()
t0 = time.time()
engine.execute(
Customer.__table__.insert(),
[{"name": 'NAME ' + str(i)} for i in range(n)]
)
print("SQLAlchemy Core: Total time for " + str(n) +
" records " + str(time.time() - t0) + " secs")
I've got some sad news for you, SQLAlchemy actually doesn't implement bulk imports for SQL Server, it's actually just going to do the same slow individual INSERT statements that to_sql is doing. I would say that your best bet is to try and script something up using the bcp command line tool. Here is a script that I've used in the past, but no guarantees:
from subprocess import check_output, call
import pandas as pd
import numpy as np
import os
pad = 0.1
tablename = 'sandbox.max.pybcp_test'
overwrite=True
raise_exception = True
server = 'P01'
trusted_connection= True
username=None
password=None
delimiter='|'
df = pd.read_csv('D:/inputdata.csv', encoding='latin', error_bad_lines=False)
def get_column_def_sql(col):
if col.dtype == object:
width = col.str.len().max() * (1+pad)
return '[{}] varchar({})'.format(col.name, int(width))
elif np.issubdtype(col.dtype, float):
return'[{}] float'.format(col.name)
elif np.issubdtype(col.dtype, int):
return '[{}] int'.format(col.name)
else:
if raise_exception:
raise NotImplementedError('data type {} not implemented'.format(col.dtype))
else:
print('Warning: cast column {} as varchar; data type {} not implemented'.format(col, col.dtype))
width = col.str.len().max() * (1+pad)
return '[{}] varchar({})'.format(col.name, int(width))
def create_table(df, tablename, server, trusted_connection, username, password, pad):
if trusted_connection:
login_string = '-E'
else:
login_string = '-U {} -P {}'.format(username, password)
col_defs = []
for col in df:
col_defs += [get_column_def_sql(df[col])]
query_string = 'CREATE TABLE {}\n({})\nGO\nQUIT'.format(tablename, ',\n'.join(col_defs))
if overwrite == True:
query_string = "IF OBJECT_ID('{}', 'U') IS NOT NULL DROP TABLE {};".format(tablename, tablename) + query_string
query_file = 'c:\\pybcp_tempqueryfile.sql'
with open (query_file,'w') as f:
f.write(query_string)
if trusted_connection:
login_string = '-E'
else:
login_string = '-U {} -P {}'.format(username, password)
o = call('sqlcmd -S {} {} -i {}'.format(server, login_string, query_file), shell=True)
if o != 0:
raise BaseException("Failed to create table")
# o = call('del {}'.format(query_file), shell=True)
def call_bcp(df, tablename):
if trusted_connection:
login_string = '-T'
else:
login_string = '-U {} -P {}'.format(username, password)
temp_file = 'c:\\pybcp_tempqueryfile.csv'
#remove the delimiter and change the encoding of the data frame to latin so sql server can read it
df.loc[:,df.dtypes == object] = df.loc[:,df.dtypes == object].apply(lambda col: col.str.replace(delimiter,'').str.encode('latin'))
df.to_csv(temp_file, index = False, sep = '|', errors='ignore')
o = call('bcp sandbox.max.pybcp_test2 in c:\pybcp_tempqueryfile.csv -S "localhost" -T -t^| -r\n -c')
This just recently been updated as of SQLAchemy ver: 1.3.0 just in case anyone else needs to know. Should make your dataframe.to_sql statement much faster.
https://docs.sqlalchemy.org/en/latest/changelog/migration_13.html#support-for-pyodbc-fast-executemany
engine = create_engine(
"mssql+pyodbc://scott:tiger#mssql2017:1433/test?driver=ODBC+Driver+13+for+SQL+Server",
fast_executemany=True)
I have this code:
dsn = cx_Oracle.makedsn(hostname, port, sid)
orcl = cx_Oracle.connect(username + '/' + password + '#' + dsn)
curs = orcl.cursor()
sql = "select TEMPLATE from my_table where id ='6'"
curs.execute(sql)
rows = curs.fetchall()
print rows
template = rows[0][0]
orcl.close()
print template.read()
When I do print rows, I get this:
[(<cx_Oracle.LOB object at 0x0000000001D49990>,)]
However, when I do print template.read(), I get this error:
cx_Oracle.DatabaseError: Invalid handle!
Do how do I get and read this data? Thanks.
I've found out that this happens in case when connection to Oracle is closed before the cx_Oracle.LOB.read() method is used.
orcl = cx_Oracle.connect(usrpass+'#'+dbase)
c = orcl.cursor()
c.execute(sq)
dane = c.fetchall()
orcl.close() # before reading LOB to str
wkt = dane[0][0].read()
And I get: DatabaseError: Invalid handle!
But the following code works:
orcl = cx_Oracle.connect(usrpass+'#'+dbase)
c = orcl.cursor()
c.execute(sq)
dane = c.fetchall()
wkt = dane[0][0].read()
orcl.close() # after reading LOB to str
Figured it out. I have to do something like this:
curs.execute(sql)
for row in curs:
print row[0].read()
You basically have to loop through the fetchall object
dsn = cx_Oracle.makedsn(hostname, port, sid)
orcl = cx_Oracle.connect(username + '/' + password + '#' + dsn)
curs = orcl.cursor()
sql = "select TEMPLATE from my_table where id ='6'"
curs.execute(sql)
rows = curs.fetchall()
for x in rows:
list_ = list(x)
print(list_)
There should be an extra comma in the for loop, see in below code, i have supplied an extra comma after x in for loop.
dsn = cx_Oracle.makedsn(hostname, port, sid)
orcl = cx_Oracle.connect(username + '/' + password + '#' + dsn)
curs = orcl.cursor()
sql = "select TEMPLATE from my_table where id ='6'"
curs.execute(sql)
rows = curs.fetchall()
for x, in rows:
print(x)
I had the same problem with in a slightly different context. I needed to query a +27000 rows table and it turns out that cx_Oracle cuts the connection to the DB after a while.
While a connection to the db is open, you can use the read() method of the cx_Oracle.Lob object to transform it into a string. But if the query brings a table that is too big, it won´t work because the connection will stop at some point and when you want to read the results from the query you´ll gt an error on the cx_Oracle objects.
I tried many things, like setting
connection.callTimeout = 0 (according to documentation, this means it would wait indefinetly), using fetchall() and then putting the results on a dataframe or numpy array but I could never read the cx_Oracle.Lob objects.
If I try to run the query using pandas.DataFrame.read_sql(query, connection) The dataframe would contain cx_Oracle.Lob objects with the connection closed, making them useless. (Again this only happens if the table is very big)
In the end I found a way of getting around this by querying and creating a csv file inmediatlely after, even though I know it´s not ideal.
def csv_from_sql(sql: str, path: str="dataframe.csv") -> bool:
try:
with cx_Oracle.connect(config.username, config.password, config.database, encoding=config.encoding) as connection:
connection.callTimeout = 0
data = pd.read_sql(sql, con=connection)
data.to_csv(path)
print("FILE CREATED")
except cx_Oracle.Error as error:
print(error)
return False
finally:
print("PROCESS ENDED\n")
return True
def make_query(sql: str, path: str="dataframe.csv") -> pd.DataFrame:
if csv_from_sql(sql, path):
dataframe = pd.read_csv("dataframe.csv")
return dataframe
return pd.DataFrame()
This took a long time (about 4 to 5 minutes) to bring my +27000-rows table, but it worked when everything else didn´t.
If anyone knows a better way, it would be helpful for me too.
import cx_Oracle
import wx
print "Start..." + str(wx.Now())
base = cx_Oracle.makedsn('xxx', port, 'yyyy')
connection = cx_Oracle.connect(user name, password, base)
cursor = connection.cursor()
cursor.execute('select data from t_table')
li_row = cursor.fetchall()
data = []
for row in li_row:
data.append(row[0])
cursor.close()
connection.close()
print "End..." + str(wx.Now())
print "DONE!!!"
Is there a way to integrate pre-fetch in this program? My goal is to get data from database as quick as possible.
Fetching 10000 rows...
cursor.arraysize = 10000