pyspark jdbc dataframe creation using LDAP connections from Oracle - python

I am trying to create pyspark jdbc dataframe using LDAP connections from Oracle.
Below code is working for normal connection string based JDBC dataframe creation which worked.
creds = {"user":"USER_NAME",
"password":"PASSWORD",
"driver": "oracle.jdbc.OracleDriver"}
connection_string = "jdbc:oracle:thin:#//hostname.com:1521/myint.domain.com"
df = spark_session.read.jdbc(url=connection_string, table=query, properties=creds)
Now, our database configurations changed and we are supposed to use LDAP based authentication only.
Hence I tried changing connection_string as below.
connection_string = "myint"
But, it is throwing below issue.
py4j.protocol.Py4JJavaError: An error occurred while calling o51.jdbc.
: java.lang.NullPointerException
Without Spark, I tried connecting using cx_Oracle module(python module for connecting to Oracle) for testing and it worked.
Before:
host = 'myint.domain.com'
ip = 'jdbc:oracle:thin:#//hostname.com'
port = 1521
conn = cx_Oracle.makedsn(ip, port, service_name=host)
db = cx_Oracle.connect('USER_NAME', 'PASSWORD', conn) # giving the conn object
cursor = db.cursor()
cursor.execute("""select * from myschema.mytable fetch first 5 rows only""")
for row in cursor:
print(row)
After:
db = cx_Oracle.connect('USER_NAME', 'PASSWORD', "myint") # Now, just giving only the database
cursor = db.cursor()
cursor.execute("""select * from myschema.mytable fetch first 5 rows only""")
for row in cursor:
print(row)
I need to achieve the same in Pyspark JDBC dataframe by just passing only the database name. Please suggest what needs to be done.
I hope this question can be applicable for scala spark as well.

Related

How to use Pyodbc to read in SQL results generated from joining 2 CTE's from 2 databases

I have a SQL script that joins two CTE's together, one CTE is from database1 and another CTE is from database2. It can be run successfully in SQL Server.
However, I'd like to establish a connection between the SQL Server to Python using pyodbc package (like below) so that I can read-in the results directly. Since we can only specify one database in the following code, how do I establish the connection if my SQL script contains two different databases?
conn = pyodbc.connect('Driver= {SQL Server Native Client 11.0};'
'Server= server;'
'Database = database1;'
'InitialCatalog=dbo;'
'Trusted_Connection=yes;')
query = open(file_path, 'r')
df = pd.read_sql_query(query.read(), conn)
query.close()

Use Turbodbc to connect to SQL Server, simple select

I cannot get Python Turbodbc to connect to a Sql Server table, simple as that seems, to read or write user tables. However I have established ODBC connection, and can print a list of objects from it.
1 List objects from server to test connection. Seems to work:
from turbodbc import connect, make_options
options = make_options(prefer_unicode=True)
connection = connect(dsn='FPA', turbodbc_options=options)
cursor = connection.cursor()
cursor.execute('''SELECT * FROM sys.objects WHERE schema_id = SCHEMA_ID('dbo');''')
2 Simple Select: Does not work
cursor.execute('''SELECT * from [dbo].[Kits_Rec];''')
From #1 I get
From # 2 message: [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Invalid object name 'dbo.Kits_Rec'.
enter image description here
A SQL Server contains multiple databases, your dsn "FPA" probably doesn't specify the database name, so you are connecting to the master database instead of the database containing your Kits_Rec table.
Fix that dsn entry to specify the correct database, or use this syntax instead :
connection = connect(driver="MSSQL Driver",
server="hostname",
port="1433",
database="myDataBase",
uid="myUsername",
pwd="myPassword")

Unable to execute Excel SQL Query from Python application

From my Python application, I am trying to open an ADODB connection of Excel. The code is as below:
# Create Connection object and connect to database.
ado_conn = win32com.client.gencache.EnsureDispatch('ADODB.Connection')
ado_conn.ConnectionString = "Provider = Microsoft.ACE.OLEDB.12.0; Data Source = C:\\Test1.xlsx; Extended Properties ='Excel 12.0 Xml;HDR=YES'";
ado_conn.Open()
# Now create a RecordSet object and open a table
oRS = win32com.client.gencache.EnsureDispatch('ADODB.Recordset')
oRS.ActiveConnection = ado_conn # Set the recordset to connect thru oConn
oRS.Open("SELECT * FROM [Orders]")
When I debug the application, it throws the error:
com_error(-2147352567, 'Exception occurred.', (0, 'Microsoft Access Database Engine', "The Microsoft Access database engine could not find the object 'Orders'. Make sure the object exists and that you spell its name and the path name correctly. If 'Orders' is not a local object, check your network connection or contact the server administrator.", None, 5003011, -2147217865), None)
In the Excel sheet the connection string looks like this:
Provider=Microsoft.Mashup.OleDb.1;Data Source=$Workbook$;Location=Orders;Extended Properties=""
The Command Text is:
Select * from [Orders]
Even if the connection is there, it is throwing this error.
How to execute the above Excel query from Python application?
Edit: Excel connection screenshot added below
To use the Jet/ACE SQL Engine to query Excel workbooks, you must use the $ address referencing which can be extended for specific cell ranges:
SELECT * from [Orders$]
SELECT * from [Orders$B4:Y100]
With that said, consider querying Excel workbooks directly from Python with either OLEDB provider or ODBC driver version and avoid the COM interfacing of Window's ADO object:
# OLEDB PROVIDER
import adodbapi
conn = adodbapi.connect("PROVIDER=Microsoft.ACE.OLEDB.12.0;" \
"Data Source = C:\\Test1.xlsx;" \
"Extended Properties ='Excel 12.0 Xml;HDR=YES'")
cursor = conn.cursor()
cursor.execute("SELECT * FROM [Orders$]")
for row in cursor.fetchall():
print(row)
# ODBC DRIVER
import pyodbc
conn = pyodbc.connect("DRIVER={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)};" \
"DBQ=C:\Path\To\Excel.xlsx;")
cursor = conn.cursor()
cursor.execute("SELECT * FROM [Orders$]")
for row in cursor.fetchall():
print(row)

How can I connect to Impala using a keytab?

I am trying to establish a connection to the impala database through a Python script using a keytab instead the normal user/password combination, but am unable to find any tutorials online whatsoever, the code I am currently using is:
conn = connect(host=impala_host, port=impala_port, use_ssl=True, auth_mechanism="PLAIN", user=username, password=pwd, database=impala_db)
cursor = conn.cursor()
However, I want to connect using keytab instead of my password.
It appears that you are trying to use that library: https://github.com/cloudera/impyla
Please have a look at Usage section in README.md there:
from impala.dbapi import connect
conn = connect(host='my.host.com', port=21050)
cursor = conn.cursor()
cursor.execute('SELECT * FROM mytable LIMIT 100')
print cursor.description # prints the result set's schema
results = cursor.fetchall()
...
...

How to retrieve each column value using sql alchemy

I am using SQL Alchemy to connect to my Oracle 11g database residing in a Linux machine. I am writing a script in Python to connect to the database and retrieve values from a particular table.
The code i wrote is:
from sqlalchemy import *
def db_connection():
config = ConfigParser.RawConfigParser()
config.read('CMDC_Analyser.cfg')
USER = config.get('DB_Connector','db.user_name' )
PASSWORD = config.get('DB_Connector','db.password' )
SID = config.get('DB_Connector','db.sid' )
IP = config.get('DB_Connector','db.ip')
PORT = config.get('DB_Connector','db.port')
engine = create_engine('oracle://{user}:{pwd}#{ip}:{port}/{sid}'.format(user=USER, pwd=PASSWORD, ip=IP, port=PORT, sid=SID), echo=False)
global connection
connection = engine.connect()
p = connection.execute("select * from ssr.bouquet")
for columns in p:
print columns
connection.close()
The complete values from the table is printed out here. I wanted to select value from a particular column only. Hence i used the following code:
for columns in p:
print columns['BOUQUET_ID']
But here i am getting the following error.
sqlalchemy.exc.NoSuchColumnError: "Could not locate column in row for column 'BOUQUET_ID'"
How to fix this?

Categories