Pandas read_sql reading Access Database: First row missing - python

I have an external database, pre MS Access 2007 that I am connecting to with pypyodbc and converting to dataframe with pandas. That by itself seems to work ok, however, the first line of data seems to be missing.
import pypyodbc
import pandas as pd
conn_str = (
r'DRIVER={Microsoft Access Driver (*.mdb)};'
r'DBQ=c:\path_to_my.mdb;'
)
cnxn = pypyodbc.connect(conn_str)
data = pd.read_sql("SELECT * FROM Table1 ORDER BY date DESC",cnxn)
cnxn.close()
#print a few lines and only specific columns
print(data.loc[:,('date','time')].head())
As a result I get
as my first line. The first entry in row 0 is from 04:00 o'clock.
In the databse itself, however, I have a more recent entry from 06:00
Why is the first line omitted in my read?

Related

Move panda df to teradata table: [HY000] [Teradata][ODBC Teradata Driver][Teradata Database] Invalid timestamp

I have a df that i want to move to a teradata table. I am using a framework that was discussed on this platform. However I am getting a few errors.
The entire logic behind loading the df to teradata is:
1) If table doesnt exist then create table else skip creation.
2) Start loading the df to the table. (Note i will be passing multiple xlsx files to a df and eventually appending it to the teradata table)
I have written a bteq script to create a table which goes like this:
FROM DBC.TABLES WHERE DATABASENAME = 'abc' AND TABLENAME = 'sample';
.IF ACTIVITYCOUNT <> 0 THEN .GOTO SKIP_CREATION
.IF ACTIVITYCOUNT = 0 THEN .GOTO TABLE_NOT_EXISTS
.LABEL TABLE_NOT_EXISTS
CREATE TABLE abc.sample (
col1 VARCHAR(400) CHARACTER SET LATIN NOT CASESPECIFIC,
col2 VARCHAR(400) CHARACTER SET LATIN NOT CASESPECIFIC,
.
.
col23 TIMESTAMP(0) WITH TIME ZONE FORMAT 'YYYY-MM-DD HH:MI:SSZ',
col24 TIMESTAMP(0) WITH TIME ZONE FORMAT 'YYYY-MM-DD HH:MI:SSZ'
);
.LABEL SKIP_CREATION
.LOGOFF
My python code to move the df to teradata is:
df=some data frame
host,username,password = 'host','username', "password"
num_of_chunks = 1000
insert_query= "INSERT INTO abc.sample VALUES(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)"
udaExec = teradata.UdaExec (appName="IMC", version="1.0", logConsole=False)
with udaExec.connect(method="odbc",system=host, username = username,
password=password, driver="Teradata") as session:
file_exist=session.execute(file=r"Path of the bteq file" ,fileType="bteq",ignoreErrors=[3803])
schedule_chunks = np.array_split(df, num_of_chunks)
for i,_ in enumerate(schedule_chunks):
data = [tuple(x) for x in schedule_chunks[i].to_records(index=False)]
session.executemany(insert_query, data,batch=True)
When I run this i get the following error message:
DatabaseError: [HY000] [Teradata][ODBC Teradata Driver][Teradata
Database] Invalid timestamp.
Can someone help me with whee am i going wrong? Also need some suggestion if I am writing the bteq script correctly. I want to avoid dropping tables and creating a new one each time.
I was able to push my dataframe into Teradata successfully. All i did was convert my Timestamp columns in my dataframe from datetime64 to object. Below is the only line of code i added before running the above code
df=df.astype(object).where(pd.notnull(df),'')

Python Running SQL Query With Temp Tables

I am new to the Python-SQL connectivity world. My goal is to retrieve data from SQL in a pandas DataFrame format by executing long SQL queries thru my python script.
Most of my SQL queries are long with multiple interim-temp tables before the final SELECT statement from the last temp table. When I run such a monolithic query in Python I get an error saying -
"pandas.io.sql.DatabaseError: Execution failed on sql"
Though they run absolutely fine in MS SQL Management Studio
I suspect this is due to the interim-temp tables, because if I split my long query into two pieces (with everything before the final SELECT in 1st section and final SELECT in the 2nd section) the two section sequentially, run fine
Can someone guide me why is it so or alternatively what is the best way to run long queries with temp tables/views and retrieve results in a pandas DataFrame?
Here is my sample Python code that ideally should take a fine name as an input and run the SQL to retrieve results in a data frame, however it fails in case of a query with temp tables
import pyodbc as db
import pandas as pd
filename = 'file.sql'
username = 'XXXX'
password = 'YYYYY'
driver= '{ODBC Driver 13 for SQL Server}'
database = 'DB'
server = 'local'
conn = db.connect('DRIVER='+driver+'; PORT=1433; SERVER='+server+';
PORT=1443; DATABASE='+database+'; UID='+username+'; PWD='+ password)
fd = open(filename, 'r')
sqlfile = fd.read()
fd.close()
sqlcommand1 = sql
df_table = pd.read_sql(sqlcommand1, conn)
If I break my sql query in two pieces (one with all temp tables and 2nd with final Select), then it runs fine. Below is a modified function that splits the long Query after finding '/**/' and it works fine
"""
This Function Reads a SQL Script From an Extrenal File and Executes The
Script in SQL. If The SQL Script Has Bunch of Tem Tables/Views
Followed By a Select Statement to Retrieve Data From Those Views Then Input
SQL File Should Have '/**/' Immediately Before the Final
Select Statement. This is to Esnure Final Select Statement is Executed on
the Temporary Views Already Run by Python.
Input is a SQL File Name and Output is a DataFrame
"""
import pyodbc as db
import pandas as pd
filename = 'filename.sql'
username = 'XXXX'
password = 'YYYYY'
driver= '{ODBC Driver 13 for SQL Server}'
database = 'DB'
server = 'local'
conn = db.connect('DRIVER='+driver+'; PORT=1433; SERVER='+server+';
PORT=1443; DATABASE='+database+'; UID='+username+'; PWD='+ password)
fd = open(filename, 'r')
sqlfile = fd.read()
fd.close()
sql = sqlfile.split('/**/')
sqlcommand1 = sql[0] #1st Section of Query with temp tables
sqlcommand2 = sql[1] #2nd section of Query with final SELECT statement
conn.execute(sqlcommand1)
df_table = pd.read_sql(sqlcommand2, conn)
Quick and dirty answer: if using T-SQL put the line SET NOCOUNT ON at the beginning of your query.
Like #Parfait mentioned above the pandas read_sql method can only support one result set. However, when you generate a temp table in T-sql you do create a result set in the form "(XX row(s) affected)" which is what causes your original query to fail. By setting NOCOUNT you eliminate any early returns and only get the results from your final SELECT statement.
Alternatively, if using pyodbc cursor instead of pandas you can utilize nextset() to skip the result sets from the temp table(s). More info on pyodbc here.

Error while inserting dataframe into oracle table using python

I am querying an oracle table (Table1) into pandas dataframe and then truing to insert the same into another oracle table(Table2) with the same table/column definition.
I am also changing the 'object' datatypes into VARCHAR in my code to avoid CLOB and speed up the insert.
import cx_Oracle
import pandas as pd
from sqlalchemy import types
conn = cx_Oracle.connect('username/password#dbname')
df = pd.read_sql(sql='SELECT * FROM TABLE1',con=conn)
dtyp = {c:types.VARCHAR(df[c].str.len().max())
for c in df.columns[df.dtypes == 'object'].tolist()}
df.to_sql('TABLE2', conn, index=False, if_exists='append', dtype=dtyp, chunksize=10**4)
The primary key column ACCOUNT_ID is VARCHAR2(100) in both source and target tables. I am getting the error:
ValueError: ACCOUNT_ID (VARCHAR(8)) not a string
Please help me.

Importing from Excel to MySQL Table Using Python 2.7

I'm trying to insert into a MySQL table from data in this Excel sheet: https://www.dropbox.com/s/w7m282386t08xk3/GA.xlsx?dl=0
The script should start from the second sheet "Daily Metrics" at row 16. The MySQL table already has the fields called date, campaign, users, and sessions.
Using Python 2.7, I've already created the MySQL connection and opened the sheet, but I'm not sure how to loop over those rows and insert into the database.
import MySQLdb as db
from openpyxl import load_workbook
wb = load_workbook('GA.xlsx')
sheetranges = wb['Daily Metrics']
print(sheetranges['A16'].value)
conn = db.connect('serverhost','username','password','database')
cursor = conn.cursor()
cursor.execute('insert into test_table ...')
conn.close()
Thank you for you help!
Try this and see if it does what you are looking for. You will need to update to the correct workbook name and location. Also, udate the range that you want to iterate over in for rw in wb["Daily Metrics"].iter_rows("A16:B20"):
from openpyxl import load_workbook
wb = load_workbook("c:/testing.xlsx")
for rw in wb["Daily Metrics"].iter_rows("A16:B20"):
for cl in rw:
print cl.value
Only basic knowledge of MySQL and Openpyxl is needed, you can solve it by reading tutorials on your own.
Before executing the script, you need to create database and table. Assuming you've done it.
import openpyxl
import MySQLdb
wb = openpyxl.load_workbook('/path/to/GA.xlsx')
ws = wb['Daily Metrics']
# map is a convenient way to construct a list. you can get a 2x2 tuple by slicing
# openpyxl.worksheet.worksheet.Worksheet instance and last row of worksheet
# from openpyxl.worksheet.worksheet.Worksheet.max_row
data = map(lambda x: {'date': x[0].value,
'campaign': x[1].value,
'users': x[2].value,
'sessions': x[3].value},
ws[16: ws.max_row])
# filter is another builtin function. Filter blank cells out if needed
data = filter(lambda x: None not in x.values(), data)
db = MySQLdb.connect('host', 'user', 'password', 'database')
cursor = db.cursor()
for row in data:
# execute raw MySQL syntax by using execute function
cursor.execute('insert into table (date, campaign, users, sessions)'
'values ("{date}", "{campaign}", {users}, {sessions});'
.format(**row)) # construct MySQL syntax through format function
db.commit()

python pandas with to_sql() , SQLAlchemy and schema in exasol

I'm trying to upload a pandas data frame to an SQL table. It seemed to me that pandas to_sql function is the best solution for larger data frames, but I can't get it to work. I can easily extract data, but get an error message when trying to write it to a new table:
# connect to Exasol DB
exaString='DSN=exa'
conDB = pyodbc.connect(exaString)
# get some data from somewhere, works without error
sqlString = "SELECT * FROM SOMETABLE"
data = pd.read_sql(sqlString, conDB)
# now upload this data to a new table
data.to_sql('MYTABLENAME', conDB, flavor='mysql')
conDB.close()
The error message I get is
pyodbc.ProgrammingError: ('42000', "[42000] [EXASOL][EXASolution driver]syntax error, unexpected identifier_chain2, expecting
assignment_operator or ':' [line 1, column 6] (-1)
(SQLExecDirectW)")
Unfortunately I have no idea how the query that caused this syntax error looks like or what else is wrong. Can someone please point me in the right direction?
(Second) EDIT:
Following Humayuns and Joris suggestions, I now use Pandas version 0.14 and SQLAlchemy in combination with the Exasol dialect (?). Since I am connecting to a defined schema, I am using the meta data option, but the programm crashes with "Bus error (core dumped)".
engine = create_engine('exa+pyodbc://uid:passwd#exa/mySchemaName', echo=True)
# get some data
sqlString = "SELECT * FROM SOMETABLE" # SOMETABLE is a view in mySchemaName
df = pd.read_sql(sqlString, con=engine) # works
print engine.has_table('MYTABLENAME') # MYTABLENAME is a view in mySchemaName
# prints "True"
# upload it to a new table
meta = sqlalchemy.MetaData(engine, schema='mySchemaName')
meta.reflect(engine, schema='mySchemaName')
pdsql = sql.PandasSQLAlchemy(engine, meta=meta)
pdsql.to_sql(df, 'MYTABLENAME')
I am not sure about setting "mySchemaName" in create_engine(..), but the outcome is the same.
Pandas does not support the EXASOL syntax out of the box, so it need to be changed a bit, here is a working example of your code without SQLAlchemy:
import pyodbc
import pandas as pd
con = pyodbc.connect('DSN=EXA')
con.execute('OPEN SCHEMA TEST2')
# configure pandas to understand EXASOL as mysql flavor
pd.io.sql._SQL_TYPES['int']['mysql'] = 'INT'
pd.io.sql._SQL_SYMB['mysql']['br_l'] = ''
pd.io.sql._SQL_SYMB['mysql']['br_r'] = ''
pd.io.sql._SQL_SYMB['mysql']['wld'] = '?'
pd.io.sql.PandasSQLLegacy.has_table = \
lambda self, name: name.upper() in [t[0].upper() for t in con.execute('SELECT table_name FROM cat').fetchall()]
data = pd.read_sql('SELECT * FROM services', con)
data.to_sql('SERVICES2', con, flavor = 'mysql', index = False)
If you use the EXASolution Python package, then the code would look like follows:
import exasol
con = exasol.connect(dsn='EXA') # normal pyodbc connection with additional functions
con.execute('OPEN SCHEMA TEST2')
data = con.readData('SELECT * FROM services') # pandas data frame per default
con.writeData(data, table = 'services2')
The problem is that also in pandas 0.14 the read_sql and to_sql functions cannot deal with schemas, but using exasol without schemas makes no sense. This will be fixed in 0.15. If you want to use it now look at this pull request https://github.com/pydata/pandas/pull/7952

Categories