How to read from a Azure Data Warehouse into Python using Databricks? - python

I tested some sample code that I found online. This is it.
import pandas as pd
import pypyodbc
conn = "DRIVER={SQL Server Native Client 11.0};SERVER=name.database.windows.net;DATABASE=db_name;UID=my_id;PWD=my_pwd"
df = pd.read_sql_query('select * from dbo.main_table', conn)
So, I would expect this to import the dataset, that sites in SQL Server, into the dataframe, but I'm getting this error.
ArgumentError: Could not parse rfc1738 URL from string 'DRIVER={SQL Server Native Client 11.0};SERVER=etc., etc., etc.
I'm working in a Databricks environment. Thanks for the look.

Related

Data corruption - pyodbc to SQL server

I have a Python script which is connecting to SQL server and fetching data from it. Everything works fine but sometimes the data gets corrupted. There is nothing on the terminal, no warning sign or error. And the data is perfectly fine in SQL server.
Snapshot of data at SQL Server.
Snapshot of data after it is read by Python.
Sample code
import pyodbc
import pandas as pd
connStr = (
r'Driver={ODBC Driver for SQL Server};'
r'Server=ServerName;'
r'database=DBName;'
r'Trusted_Connection=yes;'
)
conn = pyodbc.connect(connStr)
sqlQuery = pd.read_sql_query('''
Select * from TableA
''', conn)
dfSQL = pd.DataFrame(sqlQuery)
I have absolutely no clue whats happening over here. Can someone please help.
#AlwaysLearning. Thank you for your comment. I'm pasting the snapshot over here as comments dont accept it. I ran the query in SQL server and can see errant bytes.

(psycopg2.OperationalError) Invalid - opcode

I am trying to connect to Netezza using SQLalchemy.create_engine(). The reason I want to use SQLAlchmey is because I want to be able to read and write through pandas dataframe.
What works is as follow:
import pandas as pd
import pyodbc
conn = pyodbc.connect('DSN=NZDWW')
df2 = pd.read_sql(Query,conn)
Above code runs fine. But in order to write df dataframe to the Netezza, I need to use the function to_sql(), which needs SQLAlchemy. This is what my code looks like:
from sqlalchemy import create_engine
username = os.getenv('REDSHIFT_USER')
password = os.getenv('REDSHIFT_PASS')
DATABASE = "SHP_TARGET"
HOST = "Netezza1"
PORT = 5480
conn_str = "postgresql://"+username+":"+password+"#"+HOST+':'+str(PORT)+'/'+DATABASE
engine3 = create_engine(conn_str)
df = pd.read_sql(Query, engine3)
When I execute this, I get the following error:
OperationalError: (psycopg2.OperationalError) Invalid - opcode
Invalid - opcodeInvalid packet length (Background on this error at: http://sqlalche.me/e/e3q8)
Any leads will be much appreciated. thanks.
Database: Netezza
Python version: 3.6
OS: Windows
The sqlalchemy dialect for Postges isn't compatible with Netezza.
The error you're receiving is the psycopg2 module, which facilitates the connection, complaining that it can't make sense of what the server is "saying", basically.
There appears to be a dialect for Netezza though. You may want to try that out.
Here's the formal dialect for Netezza has been released.
It can be used as documented here - https://github.com/IBM/nzalchemy#prerequisites
Example
from sqlalchemy import create_engine
from urllib import parse_quote_plus
# assumes NZ_HOST, NZ_USER, NZ_PASSWORD are set
import os
params = parse_quote_plus(f"DRIVER=NetezzaSQL;SERVER={os['NZ_HOST']};"
f"DATABASE={os['NZ_DATABASE']};USER={os['NZ_USER'};"
f"PASSWORD={os['NZ_PASSWORD']}")
engine = create_engine(f"netezza+pyodbc:///?odbc_connect={params}",
echo=True)

Python sqlalchemy trying to write pandas dataframe to SQL Server using .to_sql

I have a python code through which I am getting a pandas dataframe "df". I am trying to write this dataframe to Microsoft SQL server. I am trying to connect through the following code by I am getting an error
import pyodbc
from sqlalchemy import create_engine
engine = create_engine('mssql+pyodbc:///?odbc_connect=DRIVER={SQL Server};SERVER=bidept;DATABASE=BIDB;UID=sdcc\neils;PWD=neil!pass')
engine.connect()
df.to_sql(name='[BIDB].[dbo].[Test]',con=engine, if_exists='append')
However at the engine.connect() line I am getting the following error
sqlalchemy.exc.DBAPIError: (pyodbc.Error) ('08001', '[08001] [Microsoft][ODBC SQL Server Driver]Neither DSN nor SERVER keyword supplied (0) (SQLDriverConnect)')
Can anyone tell me what I am missing. I am using Microsoft SQL Server Management Studio - 14.0.17177.0
I connect to the SQL server through the following
Server type: Database Engine
Server name: bidept
Authentication: Windows Authentication
for which I log into my windows using username : sdcc\neils
and password : neil!pass
I am new to databases and python. Kindly let me know if you need any additional details. Any help will be greatly appreciated. Thank you in advance.
I was finally able to make it run.
import pyodbc
from sqlalchemy import create_engine
import urllib
params = urllib.quote_plus(r'DRIVER={SQL Server};SERVER=bidept;DATABASE=BIDB;Trusted_Connection=yes')
### For python 3.5: urllib.parse.quote_plus
conn_str = 'mssql+pyodbc:///?odbc_connect={}'.format(params)
engine = create_engine(conn_str)
reload(sys)
sys.setdefaultencoding('utf8')
df.to_sql(name='Test',con=engine, if_exists='append',index=False)
Thanks to #gord-thompson who answered Here
Although my in my sql server, all the tables are under the 'dbo' schema (i.e. dbo.Test1, dbo.Other_Tables) and this query puts my table in 'sdcc\neils' schema (i.e. sdcc\neils.Test1, sdcc\neils.Other_Tables) any solution to this?

How to create sql alchemy connection for pandas read_sql with sqlalchemy+pyodbc and multiple databases in MS SQL Server?

I am trying to use 'pandas.read_sql_query' to copy data from MS SQL Server into a pandas DataFrame. I need to do multiple joins in my SQL query. The tables being joined are on the same server but in different databases. The query I am passing to pandas works fine inside MS SQL Server Management Studio. In a Jupyter Notebook I tried to query data like so (to make things readable the query itself is simplified to just 2 joins and generic names are used):
import pandas as pd
import sqlalchemy as sql
import pyodbc
server = '100.10.10.10'
driver = 'SQL+Server+Native+Client+11.0'
myQuery = '''SELECT first.Field1, second.Field2
FROM db1.schema.Table1 AS first
JOIN db2.schema.Table2 AS second
ON first.Id = second.FirstId
'''
engine = sql.create_engine('mssql+pyodbc://{}?driver={}'.format(server, driver))
df = pd.read_sql_query(myQuery, engine)
This does not work and returns an error:
DBAPIError: (pyodbc.Error) ('IM010', '[IM010] [Microsoft][��������� ��������� ODBC] ������� ������� ��� ��������� ������ (0) (SQLDriverConnect)')
It seems that the problem is in the engine which does not include information about the database, because everything works fine with the next kind of code, where I include database in the engine:
myQuery = 'select Field1 from schema.Table1'
db = 'db1'
engine = sql.create_engine('mssql+pyodbc://{}/{}?driver={}'.format(server, db, driver))
df = pd.read_sql_query(myQuery, engine)
but breaks like the code with joins above if I don't include database in the engine, but add it to the query like so:
myQuery = 'select Field1 from db1.schema.Table1'
engine = sql.create_engine('mssql+pyodbc://{}?driver={}'.format(server,
driver))
df = pd.read_sql_query(myQuery, engine)
So how should I specify the pandas.read_sql_query 'sql' and 'con' parameters in
this case when I need to join tables from different databases but the same server?
P.S. I only have read access to this server I am connecting to. I can not create new tables or views or anything like that.
Update:
The MS SQL Server version is 2008 R2.
Update 2: I am using Python 3.6 and Windows 10.
So I have found a workaround: use pymssql instead of pyodbc (both in the import statement and in the engine). It lets you build your joins using database names and without specifying them in the engine. And there is no need to specify a driver in this case.
There might be a problem if you are using Python 3.6 which is not supported by pymssql oficially yet, but you can find unofficial wheels for your Python 3.6 here. It works as is supposed to with my queries.
Here is the original code with joins, rebuilt to work with pymssql:
import pandas as pd
import sqlalchemy as sql
import pymssql
server = '100.10.10.10'
myQuery = '''SELECT first.Field1, second.Field2
FROM db1.schema.Table1 AS first
JOIN db2.schema.Table2 AS second
ON first.Id = second.FirstId'''
engine = sql.create_engine('mssql+pymssql://{}'.format(server))
df = pd.read_sql_query(myQuery, engine)
As for the unofficial wheels, you need to download the file for Python 3.6 from the link I gave above, then cd to the download folder and run pip install wheels where 'wheels' is the name of the wheels file.
UPDATE:
Actually, it is possible to use pyodbc too. I am not sure if this should work for any SQL Server setup, but everything worked for me after I had set 'master' as my database in the engine. The resulting code would look like this:
import pandas as pd
import sqlalchemy as sql
import pyodbc
server = '100.10.10.10'
driver = 'SQL+Server'
db = 'master'
myQuery = '''SELECT first.Field1, second.Field2
FROM db1.schema.Table1 AS first
JOIN db2.schema.Table2 AS second
ON first.Id = second.FirstId'''
engine = sql.create_engine('mssql+pyodbc://{}/{}?driver={}'.format(server, db, driver))
df = pd.read_sql_query(myQuery, engine)
The following code is working for me. I am using SQL server with SQLAlchemy
import pyodbc
import pandas as pd
cnxn = pyodbc.connect('DRIVER=ODBC Driver 17 for SQL Server;SERVER=your_db_server_id,your_db_server_port;DATABASE=pangard;UID=your_db_username;PWD=your_db_password')
query = "SELECT * FROM database.tablename;"
df = pd.read_sql(query, cnxn)
print(df)

Get data from pandas into a SQL server with PYODBC

I am trying to understand how python could pull data from an FTP server into pandas then move this into SQL server. My code here is very rudimentary to say the least and I am looking for any advice or help at all. I have tried to load the data from the FTP server first which works fine.... If I then remove this code and change it to a select from ms sql server it is fine so the connection string works, but the insertion into the SQL server seems to be causing problems.
import pyodbc
import pandas
from ftplib import FTP
from StringIO import StringIO
import csv
ftp = FTP ('ftp.xyz.com','user','pass' )
ftp.set_pasv(True)
r = StringIO()
ftp.retrbinary('filname.csv', r.write)
pandas.read_table (r.getvalue(), delimiter=',')
connStr = ('DRIVER={SQL Server Native Client 10.0};SERVER=localhost;DATABASE=TESTFEED;UID=sa;PWD=pass')
conn = pyodbc.connect(connStr)
cursor = conn.cursor()
cursor.execute("INSERT INTO dbo.tblImport(Startdt, Enddt, x,y,z,)" "VALUES (x,x,x,x,x,x,x,x,x,x.x,x)")
cursor.close()
conn.commit()
conn.close()
print"Script has successfully run!"
When I remove the ftp code this runs perfectly, but I do not understand how to make the next jump to get this into Microsoft SQL server, or even if it is possible without saving into a file first.
For the 'write to sql server' part, you can use the convenient to_sql method of pandas (so no need to iterate over the rows and do the insert manually). See the docs on interacting with SQL databases with pandas: http://pandas.pydata.org/pandas-docs/stable/io.html#io-sql
You will need at least pandas 0.14 to have this working, and you also need sqlalchemy installed. An example, assuming df is the DataFrame you got from read_table:
import sqlalchemy
import pyodbc
engine = sqlalchemy.create_engine("mssql+pyodbc://<username>:<password>#<dsnname>")
# write the DataFrame to a table in the sql database
df.to_sql("table_name", engine)
See also the documentation page of to_sql.
More info on how to create the connection engine with sqlalchemy for sql server with pyobdc, you can find here:http://docs.sqlalchemy.org/en/rel_1_1/dialects/mssql.html#dialect-mssql-pyodbc-connect
But if your goal is to just get the csv data into the SQL database, you could also consider doing this directly from SQL. See eg Import CSV file into SQL Server
Python3 version using a LocalDB SQL instance:
from sqlalchemy import create_engine
import urllib
import pyodbc
import pandas as pd
df = pd.read_csv("./data.csv")
quoted = urllib.parse.quote_plus("DRIVER={SQL Server Native Client 11.0};SERVER=(localDb)\ProjectsV14;DATABASE=database")
engine = create_engine('mssql+pyodbc:///?odbc_connect={}'.format(quoted))
df.to_sql('TargetTable', schema='dbo', con = engine)
result = engine.execute('SELECT COUNT(*) FROM [dbo].[TargetTable]')
result.fetchall()
Yes, the bcp utility seems to be the best solution for most cases.
If you want to stay within Python, the following code should work.
from sqlalchemy import create_engine
import urllib
import pyodbc
quoted = urllib.parse.quote_plus("DRIVER={SQL Server};SERVER=YOUR\ServerName;DATABASE=YOur_Database")
engine = create_engine('mssql+pyodbc:///?odbc_connect={}'.format(quoted))
df.to_sql('Table_Name', schema='dbo', con = engine, chunksize=200, method='multi', index=False, if_exists='replace')
Don't avoid method='multi', because it significantly reduces the task execution time.
Sometimes you may encounter the following error.
ProgrammingError: ('42000', '[42000] [Microsoft][ODBC SQL Server
Driver][SQL Server]The incoming request has too many parameters. The
server supports a maximum of 2100 parameters. Reduce the number of
parameters and resend the request. (8003) (SQLExecDirectW)')
In such a case, determine the number of columns in your dataframe: df.shape[1]. Divide the maximum supported number of parameters by this value and use the result's floor as a chunk size.
I found that using bcp utility (https://learn.microsoft.com/en-us/sql/tools/bcp-utility) works best when you have a large dataset. I have 2.7 million rows that inserts at 80K rows/sec. You can store your data frame as csv file (use tabs for separator if your data doesn't have tabs and utf8 encoding). With bcp, I've used format "-c" and it works without issues so far.
This worked for me on Python 3.5.2:
import sqlalchemy as sa
import urllib
import pyodbc
conn= urllib.parse.quote_plus('DRIVER={ODBC Driver 17 for SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password)
engine = sa.create_engine('mssql+pyodbc:///?odbc_connect={}'.format(conn))
frame.to_sql("myTable", engine, schema='dbo', if_exists='append', index=False, index_label='myField')
"As the Connection represents an open resource against the database, we want to always limit the scope of our use of this object to a specific context, and the best way to do that is by using Python context manager form, also known as the with statement."
https://docs.sqlalchemy.org/en/14/tutorial/dbapi_transactions.html
The example would then be
from sqlalchemy import create_engine
import urllib
import pyodbc
connection_string = (
"Driver={SQL Server Native Client 11.0};"
"Server=myserver;"
"UID=myuser;"
"PWD=mypwd;"
"Database=mydb;"
)
quoted = urllib.parse.quote_plus(connection_string)
engine = create_engine(f'mssql+pyodbc:///?odbc_connect={quoted}')
with engine.connect() as cnn:
df.to_sql('mytable',con=cnn, if_exists='replace', index=False)
Following is what worked for me using sqlalchemy. Pay attention to the last part ?driver=SQL+Server'.
import sqlalchemy
import pyodbc
engine = sqlalchemy.create_engine('mssql+pyodbc://MyUser:MyPWD#dataserver.sandbox.myserver/MY_DB?driver=SQL+Server')
dt.to_sql("PatientResultTest", engine,if_exists='append')
The SQL table needs an index column at the beginning to store the index value of dataframe.
# using class function
import pandas as pd
import pyodbc
import sqlalchemy
import urllib
class data_frame_to_sql():
def__init__(self,dataFrame,sql_table_name):
self.dataFrame=dataFrame
self.sql_table_name=sql_table_name
def conversion(self):
params = urllib.parse.quote_plus("DRIVER={SQL Server};"
"SERVER=######;"
"DATABASE=####;"
"UID=#####;"
"PWD=###;")
try:
engine = sqlalchemy.create_engine("mssql+pyodbc:///?odbc_connect={}".format(params))
return f"Table '{self.sql_table_name}' added sucsessfully in database" ,self.dataFrame.to_sql(self.sql_table_name, engine)
except Exception as e :
e=str(e).replace(".","")
print(f"{e} in Database." )
data={"BusinessEntityID":["1","2","3"],"FirstName":["raj","abhi","amir"],"LastName":["kapoor","bachn","khhan"]}
df = pd.DataFrame(data, columns= ['BusinessEntityID','FirstName','LastName'])
ab=data_frame_to_sql(df,"ab").conversion()
print(ab)
It's not necessary to use sqlamchemy, one could create a connection with pyodbc directly to use it with pandas, as below: `with pyodbc.connect('DRIVER={ODBC Driver 18 for SQL Server};SERVER='+server
+';DATABASE='+database+';UID='+username+';PWD='+ password) as newconn:
df = pd.read_sql(,newconn)
`

Categories