Pandas read_sql inconsistent behaviour dependent on driver? - python

When I run a query from my local machine to a SQL server db, data is returned. If I run the same query from a JupyterHub server (using ssh), the following is returned:
TypeError: 'NoneType' object is not iterable
Implying it isn't getting any data.
The connection string is OK on both systems (albeit different), because running the same stored procedure works fine on both systems using the connection string -
Local= "Driver={SQL Server};Server=DNS-based-address;Database=name;uid=user;pwd=pwd"
Hub = "DRIVER=FreeTDS;SERVER=IP.add.re.ss;PORT=1433;DATABASE=name;UID=dbuser;PWD=pwd;TDS_Version=8.0"
Is there something in the FreeTDS driver that affects chunksize, or means a set nocount is required in the original query as per this NoneType object is not iterable error in pandas - I tried this fix by the way and got nowhere.

Are you using pymssql, which is build on top of FreeTDS?
For SQL-Server you could also try the Microsoft JDBC Driver with the python package jaydebeapi: https://github.com/microsoft/mssql-jdbc.
import pandas as pd
import pymssql
conn = pymssql.connect(
host = r'192.168.254.254',
port = '1433',
user = r'user',
password = r'password',
database = 'DB_NAME'
)
query = """SELECT * FROM db_table"""
df = pd.read_sql(con=conn, sql=query)

Related

pyodbc is returning binary data in char field

pyodbc with driver "iSeries Access ODBC Driver" is returning binary output, Ex:
original data in the table: B06300
what it returns: b'\xc2\xf0\xf6\xf3\xf0\xf0########################'
My code:
import pyodbc
connection = pyodbc.connect(
driver='{iSeries Access ODBC Driver}',
System='**************',
port = '****',
uid='**************',
pwd='**************')
df = pd.read_sql_query("SELECT * FROM schema_name.F4108")
I tried putting add_output_converter and encoder to connection but didn't work
I suspect the problem is that the data is defined on the server as CCSID 65535, which means to not translate the data.
Using the TRANSLATE=1 connection string keyword will cause the ODBC driver to translate the data using the current EBCDIC CCSID for the job. Other ODBC connection properties can be found here: https://www.ibm.com/docs/en/i/7.1?topic=keywords-connection-string-conversion-properties

How to query a (Postgres) RDS DB through an AWS Jupyter Notebook?

I'm trying to query an RDS (Postgres) database through Python, more specifically a Jupyter Notebook. Overall, what I've been trying for now is:
import boto3
client = boto3.client('rds-data')
response = client.execute_sql(
awsSecretStoreArn='string',
database='string',
dbClusterOrInstanceArn='string',
schema='string',
sqlStatements='string'
)
The error I've been receiving is:
BadRequestException: An error occurred (BadRequestException) when calling the ExecuteSql operation: ERROR: invalid cluster id: arn:aws:rds:us-east-1:839600708595:db:zprime
In the end, it was much simpler than I thought, nothing fancy or specific. It was basically a solution I had used before when accessing one of my local DBs. Simply import a specific library for your database type (Postgres, MySQL, etc) and then connect to it in order to execute queries through python.
I don't know if it will be the best solution since making queries through python will probably be much slower than doing them directly, but it's what works for now.
import psycopg2
conn = psycopg2.connect(database = 'database_name',
user = 'user',
password = 'password',
host = 'host',
port = 'port')
cur = conn.cursor()
cur.execute('''
SELECT *
FROM table;
''')
cur.fetchall()

(psycopg2.OperationalError) Invalid - opcode

I am trying to connect to Netezza using SQLalchemy.create_engine(). The reason I want to use SQLAlchmey is because I want to be able to read and write through pandas dataframe.
What works is as follow:
import pandas as pd
import pyodbc
conn = pyodbc.connect('DSN=NZDWW')
df2 = pd.read_sql(Query,conn)
Above code runs fine. But in order to write df dataframe to the Netezza, I need to use the function to_sql(), which needs SQLAlchemy. This is what my code looks like:
from sqlalchemy import create_engine
username = os.getenv('REDSHIFT_USER')
password = os.getenv('REDSHIFT_PASS')
DATABASE = "SHP_TARGET"
HOST = "Netezza1"
PORT = 5480
conn_str = "postgresql://"+username+":"+password+"#"+HOST+':'+str(PORT)+'/'+DATABASE
engine3 = create_engine(conn_str)
df = pd.read_sql(Query, engine3)
When I execute this, I get the following error:
OperationalError: (psycopg2.OperationalError) Invalid - opcode
Invalid - opcodeInvalid packet length (Background on this error at: http://sqlalche.me/e/e3q8)
Any leads will be much appreciated. thanks.
Database: Netezza
Python version: 3.6
OS: Windows
The sqlalchemy dialect for Postges isn't compatible with Netezza.
The error you're receiving is the psycopg2 module, which facilitates the connection, complaining that it can't make sense of what the server is "saying", basically.
There appears to be a dialect for Netezza though. You may want to try that out.
Here's the formal dialect for Netezza has been released.
It can be used as documented here - https://github.com/IBM/nzalchemy#prerequisites
Example
from sqlalchemy import create_engine
from urllib import parse_quote_plus
# assumes NZ_HOST, NZ_USER, NZ_PASSWORD are set
import os
params = parse_quote_plus(f"DRIVER=NetezzaSQL;SERVER={os['NZ_HOST']};"
f"DATABASE={os['NZ_DATABASE']};USER={os['NZ_USER'};"
f"PASSWORD={os['NZ_PASSWORD']}")
engine = create_engine(f"netezza+pyodbc:///?odbc_connect={params}",
echo=True)

Sqlalchemy pyodbc DNS-less URL connection string does not seem to work

I am trying to connect to a Sybase ASE 15 database on windows using sqlalchemy (1.0.9) and pyodbc. If I use a DNS url everything works as expected:
url = r'sybase+pyodbc://usename:password#dns'
engine = create_engine(url, echo=True)
Session = sessionmaker(bind=engine)
sess = Session()
conn = sess.connection()
However, if I avoid the DNS I get an error message:
url = 'sybase+pyodbc://username:password#host:port/database?driver=Adaptive Server Enterprise'
, I get an error:
DBAPIError: (pyodbc.Error) ('01S00', '[01S00] [SAP][ASE ODBC
Driver]Invalid port number (30011) (SQLDriverConnect)')
The port number is correct and it is the same port as specified in the DNS.
Any ideas?
It may be worth trying to work with some version of the semicolon-separated DSN-less format used by pyODBC (and ODBC in general). Some examples here:
http://www.connectionstrings.com/adaptive-server-enterprise-odbc-driver/
This question tackles a similar issue with FreeTDS, but the concept is the same, as the connect string is basically passed through to the low-level ODBC connect:
SqlAlchemy equivalent of pyodbc connect string using FreeTDS
The URL is being parsed down to this type of string for ultimate connection by pyodbc through to SQLDriverConnect (in the ODBC API), so specifying the ASE ODBC DSN-less connection string directly may work better.
Update: Ran a quick test to see what connect arguments are produced for this URL:
from sqlalchemy.engine.url import *
from sqlalchemy.connectors.pyodbc import *
connector = PyODBCConnector()
url = make_url("sybase+pyodbc://username:password#host:5555/database?driver=Adaptive Server Enterprise")
print connector.create_connect_args(url)
This results in:
[['DRIVER={Adaptive Server Enterprise};Server=host,5555;Database=database;UID=username;PWD=password'], {}]
Note that the hostname and port are separated by a comma, per http://www.connectionstrings.com/adaptive-server-enterprise-odbc-driver/tds-based-odbc-driver-from-sybase-ocs-125/, this format works for TDS-based ODBC for Sybase 12.5:
Driver={Sybase ASE ODBC Driver};NetworkAddress=myServerAddress,5000;
Db=myDataBase;Uid=myUsername;Pwd=myPassword;
However, the ASE 15 format (http://www.connectionstrings.com/adaptive-server-enterprise-odbc-driver/adaptive-server-enterprise-150/) specifies server=myServerAddress;port=myPortnumber with port as a key passed in the semicolon-delimited string:
Driver={Adaptive Server Enterprise};app=myAppName;server=myServerAddress;
port=myPortnumber;db=myDataBase;uid=myUsername;pwd=myPassword;
If you "cheat" on the port spec by using host;port=5555, you get:
[['DRIVER={Adaptive Server Enterprise};Server=host;port=5555;Database=database;UID=username;PWD=password'], {}]
But this just feels like a Bad Idea™, even if it works. I'd also note that the generated string is using Database as the key vs. Db in the Sybase connection string reference. This may prove to be an issue as well.
Using ?odbc_connect as in the linked question is probably your best option for controlling the exact connect arguments being sent to ODBC.

Get data from pandas into a SQL server with PYODBC

I am trying to understand how python could pull data from an FTP server into pandas then move this into SQL server. My code here is very rudimentary to say the least and I am looking for any advice or help at all. I have tried to load the data from the FTP server first which works fine.... If I then remove this code and change it to a select from ms sql server it is fine so the connection string works, but the insertion into the SQL server seems to be causing problems.
import pyodbc
import pandas
from ftplib import FTP
from StringIO import StringIO
import csv
ftp = FTP ('ftp.xyz.com','user','pass' )
ftp.set_pasv(True)
r = StringIO()
ftp.retrbinary('filname.csv', r.write)
pandas.read_table (r.getvalue(), delimiter=',')
connStr = ('DRIVER={SQL Server Native Client 10.0};SERVER=localhost;DATABASE=TESTFEED;UID=sa;PWD=pass')
conn = pyodbc.connect(connStr)
cursor = conn.cursor()
cursor.execute("INSERT INTO dbo.tblImport(Startdt, Enddt, x,y,z,)" "VALUES (x,x,x,x,x,x,x,x,x,x.x,x)")
cursor.close()
conn.commit()
conn.close()
print"Script has successfully run!"
When I remove the ftp code this runs perfectly, but I do not understand how to make the next jump to get this into Microsoft SQL server, or even if it is possible without saving into a file first.
For the 'write to sql server' part, you can use the convenient to_sql method of pandas (so no need to iterate over the rows and do the insert manually). See the docs on interacting with SQL databases with pandas: http://pandas.pydata.org/pandas-docs/stable/io.html#io-sql
You will need at least pandas 0.14 to have this working, and you also need sqlalchemy installed. An example, assuming df is the DataFrame you got from read_table:
import sqlalchemy
import pyodbc
engine = sqlalchemy.create_engine("mssql+pyodbc://<username>:<password>#<dsnname>")
# write the DataFrame to a table in the sql database
df.to_sql("table_name", engine)
See also the documentation page of to_sql.
More info on how to create the connection engine with sqlalchemy for sql server with pyobdc, you can find here:http://docs.sqlalchemy.org/en/rel_1_1/dialects/mssql.html#dialect-mssql-pyodbc-connect
But if your goal is to just get the csv data into the SQL database, you could also consider doing this directly from SQL. See eg Import CSV file into SQL Server
Python3 version using a LocalDB SQL instance:
from sqlalchemy import create_engine
import urllib
import pyodbc
import pandas as pd
df = pd.read_csv("./data.csv")
quoted = urllib.parse.quote_plus("DRIVER={SQL Server Native Client 11.0};SERVER=(localDb)\ProjectsV14;DATABASE=database")
engine = create_engine('mssql+pyodbc:///?odbc_connect={}'.format(quoted))
df.to_sql('TargetTable', schema='dbo', con = engine)
result = engine.execute('SELECT COUNT(*) FROM [dbo].[TargetTable]')
result.fetchall()
Yes, the bcp utility seems to be the best solution for most cases.
If you want to stay within Python, the following code should work.
from sqlalchemy import create_engine
import urllib
import pyodbc
quoted = urllib.parse.quote_plus("DRIVER={SQL Server};SERVER=YOUR\ServerName;DATABASE=YOur_Database")
engine = create_engine('mssql+pyodbc:///?odbc_connect={}'.format(quoted))
df.to_sql('Table_Name', schema='dbo', con = engine, chunksize=200, method='multi', index=False, if_exists='replace')
Don't avoid method='multi', because it significantly reduces the task execution time.
Sometimes you may encounter the following error.
ProgrammingError: ('42000', '[42000] [Microsoft][ODBC SQL Server
Driver][SQL Server]The incoming request has too many parameters. The
server supports a maximum of 2100 parameters. Reduce the number of
parameters and resend the request. (8003) (SQLExecDirectW)')
In such a case, determine the number of columns in your dataframe: df.shape[1]. Divide the maximum supported number of parameters by this value and use the result's floor as a chunk size.
I found that using bcp utility (https://learn.microsoft.com/en-us/sql/tools/bcp-utility) works best when you have a large dataset. I have 2.7 million rows that inserts at 80K rows/sec. You can store your data frame as csv file (use tabs for separator if your data doesn't have tabs and utf8 encoding). With bcp, I've used format "-c" and it works without issues so far.
This worked for me on Python 3.5.2:
import sqlalchemy as sa
import urllib
import pyodbc
conn= urllib.parse.quote_plus('DRIVER={ODBC Driver 17 for SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password)
engine = sa.create_engine('mssql+pyodbc:///?odbc_connect={}'.format(conn))
frame.to_sql("myTable", engine, schema='dbo', if_exists='append', index=False, index_label='myField')
"As the Connection represents an open resource against the database, we want to always limit the scope of our use of this object to a specific context, and the best way to do that is by using Python context manager form, also known as the with statement."
https://docs.sqlalchemy.org/en/14/tutorial/dbapi_transactions.html
The example would then be
from sqlalchemy import create_engine
import urllib
import pyodbc
connection_string = (
"Driver={SQL Server Native Client 11.0};"
"Server=myserver;"
"UID=myuser;"
"PWD=mypwd;"
"Database=mydb;"
)
quoted = urllib.parse.quote_plus(connection_string)
engine = create_engine(f'mssql+pyodbc:///?odbc_connect={quoted}')
with engine.connect() as cnn:
df.to_sql('mytable',con=cnn, if_exists='replace', index=False)
Following is what worked for me using sqlalchemy. Pay attention to the last part ?driver=SQL+Server'.
import sqlalchemy
import pyodbc
engine = sqlalchemy.create_engine('mssql+pyodbc://MyUser:MyPWD#dataserver.sandbox.myserver/MY_DB?driver=SQL+Server')
dt.to_sql("PatientResultTest", engine,if_exists='append')
The SQL table needs an index column at the beginning to store the index value of dataframe.
# using class function
import pandas as pd
import pyodbc
import sqlalchemy
import urllib
class data_frame_to_sql():
def__init__(self,dataFrame,sql_table_name):
self.dataFrame=dataFrame
self.sql_table_name=sql_table_name
def conversion(self):
params = urllib.parse.quote_plus("DRIVER={SQL Server};"
"SERVER=######;"
"DATABASE=####;"
"UID=#####;"
"PWD=###;")
try:
engine = sqlalchemy.create_engine("mssql+pyodbc:///?odbc_connect={}".format(params))
return f"Table '{self.sql_table_name}' added sucsessfully in database" ,self.dataFrame.to_sql(self.sql_table_name, engine)
except Exception as e :
e=str(e).replace(".","")
print(f"{e} in Database." )
data={"BusinessEntityID":["1","2","3"],"FirstName":["raj","abhi","amir"],"LastName":["kapoor","bachn","khhan"]}
df = pd.DataFrame(data, columns= ['BusinessEntityID','FirstName','LastName'])
ab=data_frame_to_sql(df,"ab").conversion()
print(ab)
It's not necessary to use sqlamchemy, one could create a connection with pyodbc directly to use it with pandas, as below: `with pyodbc.connect('DRIVER={ODBC Driver 18 for SQL Server};SERVER='+server
+';DATABASE='+database+';UID='+username+';PWD='+ password) as newconn:
df = pd.read_sql(,newconn)
`

Categories