Is there a way to connect to databricks using ADO.Net? - python

I want to build an API which queries the databricks's tables and output the results as a JSON. One way to achieve this is calling the databricks Jobs REST API to execute a job and read the job output but that has data size limitations(max of 5MB and my API result set can exceed beyond 20MB). Instead, can I connect to databricks using the JDBC/ODBC endpoint provided by the cluster using Microsoft.Net or if there's any other way to connect directly? My API layer preferably needs to be built in Microsoft.Net. However, I'm willing to try Python.

I found few ways to connect to the databricks cluster
Connect using ODBC connection with Simba drivers (https://pages.databricks.com/ODBC-Driver-Download.html). Also, shared in the comment above by #EdHarper
Use cdata nuget package - https://www.cdata.com/drivers/spark/ado/ but, there's a license cost involved
Using JDBC connection string provided, may need Java code
I went ahead with option #1 and below is a sample c# code.
// Build connection string
OdbcConnectionStringBuilder odbcConnectionStringBuilder = new OdbcConnectionStringBuilder
{
Driver = "Simba Spark ODBC Driver"
};
odbcConnectionStringBuilder.Add("Host", "adb-xxxxxxxxxxxxx.7.xxxxxxxbricks.net");
odbcConnectionStringBuilder.Add("Port", "443");
odbcConnectionStringBuilder.Add("SSL", "1");
odbcConnectionStringBuilder.Add("ThriftTransport", "2");
odbcConnectionStringBuilder.Add("AuthMech", "3");
odbcConnectionStringBuilder.Add("UID", "token");
odbcConnectionStringBuilder.Add("PWD", "<Access token generated in databricks>");
odbcConnectionStringBuilder.Add("HTTPPath", "sql/protocolv1/o/xxxxxxxxxxxxxxx/yyyy8-dfcccf-tyyujjk8");
using (OdbcConnection connection = new OdbcConnection(odbcConnectionStringBuilder.ConnectionString))
{
string sqlQuery = "select * from yourdb.TableName";
OdbcCommand command = new OdbcCommand(sqlQuery, connection);
connection.Open();
OdbcDataReader reader = command.ExecuteReader();
for (int i = 0; i < reader.FieldCount; i++)
{
Console.Write(reader.GetName(i) + "\t");
}
Console.Write("\n");
reader.Close();
command.Dispose();
}
Additionally, you connect using a DSN if you prefer, more details here - https://www.simba.com/products/Spark/doc/v1/ODBC_InstallGuide/win/content/odbc/hi/windows/dsn.htm

Related

execute copy command from aws glue to connect to redshift

I'm trying to execute copy command in redshift via glue
redshiftsql = 'copy table from s3://bucket/test credentials 'aws-iam-role fromat json as 'auto';"
I'm connecting using below syntax
from_jdbc_conf(frame, catalog_connection, connection_options={}, redshift_tmp_dir = "", transformation_ctx="")
what is the value I need to pass for frame ? Any thoughts ? I appreciate your response.
Apperantly you can pass "extracopyoptions":"" in the connection_options object for redshift, anything from here: https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-format.html
See this archived question from AWS premium support.
https://aws.amazon.com/premiumsupport/knowledge-center/sql-commands-redshift-glue-job/
By the way this is very poorly documented in my opinion.

Migrating multiple tables data from SQL Server to Oracle

I have a scenario to migrate SQL Server tables(30- 40 tables) to Oracle. I Cannot depend on SSIS as the no of tables to be migrated to Oracle will change regularly and I cannot always create or update a DFT when ever there is a change in schema.
Is there any other way where the movement of data can be handled dynamically and can work effectively ? Like using Python or any other Programming languages ?
C# approach - SchemaMapper library
Since you are open to a solution using a programming language, i think you can benefit from SchemaMapper class library which is an open-source project published on GitHub. A full description can be found in the Readme file on the link above.
Important Note: Yesterday i added the support of reading data from databases (SQL Server , Oracle ...) and the ability to export data to Oracle.
In this answer i will provide information on importing SQL Server tables, create the appropriate SchemaMapper class for each one (since they have different schema and you need to import them to different schemas), and how to export data to Oracle.
//First of all list the tables names need to import
string[] TableNameFilter = new[] { "Table1", "Table2" };
//Create an instance of the oracle import class
SchemaMapper.Exporters.OracleExport expOracle = new SchemaMapper.Exporters.OracleExport(oracleconnectionstring);
//Create an SQL Server import class
using (SchemaMapper.Converters.SqlServerCeImport ssImport = new SchemaMapper.Converters.SqlServerCeImport(sqlconnectionstring))
{
//Retrieve tables names
ssImport.getSchemaTable();
//loop over tables matching the filter
foreach(DataRow drRowSchema in ssImport.SchemaTable.AsEnumerable().Where(x =>
TableNameFilter.Contains(x["TABLE_NAME"].ToString())).ToList())
{
string SQLTableName = drRowSchema["TABLE_NAME"].ToString();
string SQLTableSchema = drRowSchema["TABLE_SCHEMA"].ToString();
DataTable dtSQL = ssImport.GetDataTable(SQLTableSchema, SQLTableName);
//Create a schema mapping class
using (SchemaMapper.SchemaMapping.SchemaMapper sm = new SchemaMapper.SchemaMapping.SchemaMapper(SQLTableSchema, SQLTableName))
{
foreach (DataColumn dc in dtSQL.Columns)
{
SchemaMapper_Column smCol = new SchemaMapper_Column();
smCol.Name = dc.ColumnName;
smCol.Name = dc.ColumnName;
smCol.DataType = smCol.GetCorrespondingDataType(dc.DataType.ToString(), dc.MaxLength);
sm.Columns.Add(smCol);
}
//create destination table in oracle
expOracle.CreateDestinationTable(sm);
//Insert data
expOracle.InsertUsingOracleBulk(sm, dtSQL);
//there are other methods such as :
//expOracle.InsertIntoDb(sm, dtSQL);
//expOracle.InsertIntoDbWithParameters(sm, dtSQL);
}
}
}
Note: this is an open-source project: it is not fully tested and not all data types are supported, if you encountered some errors feel free to give a feedback, or add an Issue in GitHub
Other approach - SQL Server Import and Export Wizard
If you can do this without scheduling a Job, then you can use the Import and Export Wizard which allows you to import multiple tables into Oracle without the need to build the packages manually. It will create packages, destination tables, map columns and import data.
Start the SQL Server Import and Export Wizard
Connect to an Oracle Data Source (SQL Server Import and Export Wizard)
Here is the approach I have decided to go considering the time constraint( using C# is taking more time).For 8 GB table it is taking 11 minutes to move the data SQL to Oracle.
Steps:
Dump the SQL tables data into flat files.(Used BIML for automating
the DFT creation)
Transfer these flat files to the Destination server.
Using SQL*Loader to load data from flat files to Oracle.

Portable remote connection sucking data from MySQL to Excel

Are there any alternatives for Excel users to suck data from MySQL through remote connection without bothering to establish ODBC connection in control panel nor downloading MySQL extensions nor doing anything on user side?
Are there any connectors on MySQL side which would turn data into format readable for Excel? I am looking for dynamic solutions - reading from data base. I am not looking for static solutions like export to csv and import to Excel. Solutions done in any programming language like Python are acceptable.
This ADO connection looked promising for me however still extra work on user side is required at start up: How can VBA connect to MySQL database in Excel?
I would like to make a portable Excel file with macro that will work on any computer, in any case.
Hope you are looking for some this as below
Sub test1()
Dim con As ADODB.Connection
Dim rec As ADODB.Recordset
Set con = New ADODB.Connection
Set rec = New ADODB.Recordset
rec.CursorLocation = adUseClient
con.Open ("Provider=SQLOLEDB;Data Source=.;Initial Catalog=databasename;user ID=sa; password=sa#123;")
qry1 = "select * from [dbo].[FARA];"
rec.Open qry1, con
For i = 1 To rec.RecordCount
Debug.Print rec(0), rec(1), rec(2)
Next i
End Sub

Extracting raw data from a PowerPivot model using Python

What seemed like a trivial task turned into a real nightmare when I had to read in some data from a PowerPivot model using Python. I believe I've researched this very well over the last couple of days but now I hit a brick wall and would appreciate some help from the Python/SSAS/ADO community.
Basically, all I want to do is programmatically access raw data stored in PowerPivot models - my idea was to connect to the underlying PowerPivot (i.e. MS Analysis Services) engine via one of the methods listed below, list the tables contained in the model, then extract the raw data from each table using a simple DAX query (something like EVALUATE (table_name)). Easy peasy, right? Well, maybe not.
0. Some Background Information
As you can see, I've tried several different approaches. I'll try to document everything as carefully as possible so that those uninitiated in PowerPivot functionality will have a good idea of what I'd like to do.
First of all, some background on programmatic access to Analysis Services engine (it says 2005 SQL Server, but all of it ought to still be applicable): SQL Server Data Mining Programmability and Data providers used for Analysis Services connections.
The sample Excel/PowerPivot file I'll be using in the example below can be found here: Microsoft PowerPivot for Excel 2010 and PowerPivot in Excel 2013 Samples.
Also, note that I'm using Excel 2010, so some of my code is version-specific. E.g. wb.Connections["PowerPivot Data"].OLEDBConnection.ADOConnection should be wb.Model.DataModelConnection.ModelConnection.ADOConnection if you're using Excel 2013.
The connection string I'll be using throughout this question is based on the information found here: Connect to PowerPivot engine with C#. Additionally, some of the methods apparently require some sort of initialization of the PowerPivot model prior to data retrieval. See here: Automating PowerPivot Refresh operation from VBA.
Finally, here's a couple of links showing that this should be achievable (note however, that these links mainly refer to C#, not Python):
Made connection to PowerPivot DataModel, how can I fill a dataset with it?
Connecting to PowerPivot with C#
2013 C# connection to PowerPivot DataModel
Connecting Tableau and PowerPivot. It just works. (showing that external apps can in fact read PowerPivot model data - note that the Tableau add-in installs Interop.ADODB.dll assembly, which I guess is what it uses to access the PowerPivot data)
1. Using ADOMD
import clr
clr.AddReference("Microsoft.AnalysisServices.AdomdClient")
import Microsoft.AnalysisServices.AdomdClient as ADOMD
ConnString = "Provider=MSOLAP;Data Source=$Embedded$;Locale Identifier=1033;
Location=H:\\PowerPivotTutorialSample.xlsx;SQLQueryMode=DataKeys"
Connection = ADOMD.AdomdConnection(ConnString)
Connection.Open()
Here, it appears the problem is that the PowerPivot model has not been initialized:
AdomdConnectionException: A connection cannot be made. Ensure that the server is running.
2. Using AMO
import clr
clr.AddReference("Microsoft.AnalysisServices")
import Microsoft.AnalysisServices as AMO
ConnString = "Provider=MSOLAP;Data Source=$Embedded$;Locale Identifier=1033;
Location=H:\\PowerPivotTutorialSample.xlsx;SQLQueryMode=DataKeys"
Connection = AMO.Server()
Connection.Connect(ConnString)
Same story, "the server is not running":
ConnectionException: A connection cannot be made. Ensure that the server is running.
Note that AMO is technically not used for querying data, but I included it as one of the potential ways of connecting to the PowerPivot model.
3. Using ADO.NET
import clr
clr.AddReference("System.Data")
import System.Data.OleDb as ADONET
ConnString = "Provider=MSOLAP;Data Source=$Embedded$;Locale Identifier=1033;
Location=H:\\PowerPivotTutorialSample.xlsx;SQLQueryMode=DataKeys"
Connection = ADONET.OleDbConnection()
Connection.ConnectionString = ConnString
Connection.Open()
This is similar to What's the simplest way to access mssql with python or ironpython?. Unfortunately, this also doesn't work:
OleDbException: OLE DB error: OLE DB or ODBC error: The following system error occurred:
The requested name is valid, but no data of the requested type was found.
4. Using ADO via adodbapi module
import adodbapi
ConnString = "Provider=MSOLAP;Data Source=$Embedded$;Locale Identifier=1033;
Location=H:\\PowerPivotTutorialSample.xlsx;SQLQueryMode=DataKeys"
Connection = adodbapi.connect(ConnString)
Similar to Opposite Workings of OLEDB/ODBC between Python and MS Access VBA. The error I get is:
OperationalError: (com_error(-2147352567, 'Exception occurred.', (0, u'Microsoft OLE DB
Provider for SQL Server 2012 Analysis Services.', u'OLE DB error: OLE DB or ODBC error: The
following system error occurred: The requested name is valid, but no data of the requested
type was found...
This is basically the same problem as with ADO.NET above.
5. Using ADO via Excel/win32com module
from win32com.client import Dispatch
Xlfile = "H:\\PowerPivotTutorialSample.xlsx"
XlApp = Dispatch("Excel.Application")
Workbook = XlApp.Workbooks.Open(Xlfile)
Workbook.Connections["PowerPivot Data"].Refresh()
Connection = Workbook.Connections["PowerPivot Data"].OLEDBConnection.ADOConnection
Recordset = Dispatch('ADODB.Recordset')
Query = "EVALUATE(dbo_DimDate)" #sample DAX query
Recordset.Open(Query, Connection)
The idea for this approach came from this blog post that uses VBA: Export a table or DAX query from Power Pivot to CSV using VBA. Note that this approach uses an explicit Refresh command that initializes the model (i.e. "server"). Here's the error message:
com_error: (-2147352567, 'Exception occurred.', (0, u'ADODB.Recordset', u'Arguments are of
the wrong type, are out of acceptable range, or are in conflict with one another.',
u'C:\\Windows\\HELP\\ADO270.CHM', 1240641, -2146825287), None)
It appears, however, that the ADO connection has been established:
type(Connection) returns instance
print(Connection) returns Provider=MSOLAP.5;Persist Security Info=True;Initial Catalog=Microsoft_SQLServer_AnalysisServices;Data Source=$Embedded$;MDX Compatibility=1;Safety Options=2;ConnectTo=11.0;MDX Missing Member Mode=Error;Subqueries=2;Optimize Response=3;Cell Error Mode=TextValue
It seems the problem lies in the creation of the ADODB.Recordset object.
6. Using ADO via Excel/win32com, direct use of ADODB.Connection
from win32com.client import Dispatch
ConnString = "Provider=MSOLAP;Data Source=$Embedded$;Locale Identifier=1033;
Location=H:\\PowerPivotTutorialSample.xlsx;SQLQueryMode=DataKeys"
Connection = Dispatch('ADODB.Connection')
Connection.Open(ConnString)
Similar to Connection to Access from Python [duplicate] and Query access using ADO in Win32 platform (Python recipe). Unfortunately, the error Python spits out is the same as in the two examples above:
com_error: (-2147352567, 'Exception occurred.', (0, u'Microsoft OLE DB Provider for SQL
Server 2012 Analysis Services.', u'OLE DB error: OLE DB or ODBC error: The following system
error occurred: The requested name is valid, but no data of the requested type was found.
..', None, 0, -2147467259), None)
7. Using ADO via Excel/win32com, direct use of ADODB.Connection plus model refresh
from win32com.client import Dispatch
Xlfile = "H:\\PowerPivotTutorialSample.xlsx"
XlApp = Dispatch("Excel.Application")
Workbook = XlApp.Workbooks.Open(Xlfile)
Workbook.Connections["PowerPivot Data"].Refresh()
ConnStringInternal = "Provider=MSOLAP.5;Persist Security Info=True;Initial Catalog=
Microsoft_SQLServer_AnalysisServices;Data Source=$Embedded$;MDX
Compatibility=1;Safety Options=2;ConnectTo=11.0;MDX Missing Member
Mode=Error;Optimize Response=3;Cell Error Mode=TextValue"
Connection = Dispatch('ADODB.Connection')
Connection.Open(ConnStringInternal)
I was hoping I could initialize an instance of Excel, then initialize the PowerPivot model, and then create a connection using the internal connection string Excel uses for embedded PowerPivot data (similar to How do you copy the powerpivot data into the excel workbook as a table? - note that the connection string is different from the one I've used elsewhere). Unfortunately, this doesn't work and my guess is that Python starts the ADODB.Connection process in a separate instance (as I get the same error message when I execute the last three rows without first initializing Excel, etc.):
com_error: (-2147352567, 'Exception occurred.', (0, u'Microsoft OLE DB Provider for SQL
Server 2012 Analysis Services.', u'Either the user, ****** (masked), does not have access
to the Microsoft_SQLServer_AnalysisServices database, or the database does not exist.',
None, 0, -2147467259), None)
Lo and behold, I finally managed to crack the problem - turns out that accessing Power Pivot data using Python is indeed possible! Below's a short recap of what I did - you can find a more detailed description here: Analysis Services (SSAS) on a shoestring. Note: the code has been optimized neither for efficiency nor elegance.
Install Microsoft Power BI Desktop (comes with free Analysis Services server, so no need for a costly SQL Server license - however, the same approach obviously also works if you have a proper license).
Fire up the AS engine by first creating the msmdsrv.ini settings file, then restore the database from the ABF file (using AMO.NET), then extract data using ADOMD.NET.
Here's the Python code that illustrates the AS engine + AMO.NET parts:
import psutil, subprocess, random, os, zipfile, shutil, clr, sys, pandas
def initialSetup(pathPowerBI):
sys.path.append(pathPowerBI)
#required Analysis Services assemblies
clr.AddReference("Microsoft.PowerBI.Amo.Core")
clr.AddReference("Microsoft.PowerBI.Amo")
clr.AddReference("Microsoft.PowerBI.AdomdClient")
global AMO, ADOMD
import Microsoft.AnalysisServices as AMO
import Microsoft.AnalysisServices.AdomdClient as ADOMD
def restorePowerPivot(excelName, pathTarget, port, pathPowerBI):
#create random folder
os.chdir(pathTarget)
folder = os.getcwd()+str(random.randrange(10**6, 10**7))
os.mkdir(folder)
#extract PowerPivot model (abf backup)
archive = zipfile.ZipFile(excelName)
for member in archive.namelist():
if ".data" in member:
filename = os.path.basename(member)
abfname = os.path.join(folder, filename) + ".abf"
source = archive.open(member)
target = file(os.path.join(folder, abfname), 'wb')
shutil.copyfileobj(source, target)
del target
archive.close()
#start the cmd.exe process to get its PID
listPIDpre = [proc for proc in psutil.process_iter()]
process = subprocess.Popen('cmd.exe /k', stdin=subprocess.PIPE)
listPIDpost = [proc for proc in psutil.process_iter()]
pid = [proc for proc in listPIDpost if proc not in listPIDpre if "cmd.exe" in str(proc)][0]
pid = str(pid).split("=")[1].split(",")[0]
#msmdsrv.ini
msmdsrvText = '''<ConfigurationSettings>
<DataDir>{0}</DataDir>
<TempDir>{0}</TempDir>
<LogDir>{0}</LogDir>
<BackupDir>{0}</BackupDir>
<DeploymentMode>2</DeploymentMode>
<RecoveryModel>1</RecoveryModel>
<DisklessModeRequested>0</DisklessModeRequested>
<CleanDataFolderOnStartup>1</CleanDataFolderOnStartup>
<AutoSetDefaultInitialCatalog>1</AutoSetDefaultInitialCatalog>
<Network>
<Requests>
<EnableBinaryXML>1</EnableBinaryXML>
<EnableCompression>1</EnableCompression>
</Requests>
<Responses>
<EnableBinaryXML>1</EnableBinaryXML>
<EnableCompression>1</EnableCompression>
<CompressionLevel>9</CompressionLevel>
</Responses>
<ListenOnlyOnLocalConnections>1</ListenOnlyOnLocalConnections>
</Network>
<Port>{1}</Port>
<PrivateProcess>{2}</PrivateProcess>
<InstanceVisible>0</InstanceVisible>
<Language>1033</Language>
<Debug>
<CallStackInError>0</CallStackInError>
</Debug>
<Log>
<Exception>
<CrashReportsFolder>{0}</CrashReportsFolder>
</Exception>
<FlightRecorder>
<Enabled>0</Enabled>
</FlightRecorder>
</Log>
<AllowedBrowsingFolders>{0}</AllowedBrowsingFolders>
<ResourceGovernance>
<GovernIMBIScheduler>0</GovernIMBIScheduler>
</ResourceGovernance>
<Feature>
<ManagedCodeEnabled>1</ManagedCodeEnabled>
</Feature>
<VertiPaq>
<EnableDisklessTMImageSave>0</EnableDisklessTMImageSave>
<EnableProcessingSimplifiedLocks>1</EnableProcessingSimplifiedLocks>
</VertiPaq>
</ConfigurationSettings>'''
#save ini file to disk, fill it with required parameters
msmdsrvini = open(folder+"\\msmdsrv.ini", "w")
msmdsrvText = msmdsrvText.format(folder, port, pid) #{0},{1},{2}
msmdsrvini.write(msmdsrvText)
msmdsrvini.close()
#run AS engine inside the cmd.exe process
initString = "\"{0}\\msmdsrv.exe\" -c -s \"{1}\""
initString = initString.format(pathPowerBI.replace("/","\\"),folder)
process.stdin.write(initString + " \n")
#connect to the AS instance from Python
AMOServer = AMO.Server()
AMOServer.Connect("localhost:{0}".format(port))
#restore database from PowerPivot abf backup, disconnect
AMORestoreInfo = AMO.RestoreInfo(os.path.join(folder, abfname))
AMOServer.Restore(AMORestoreInfo)
AMOServer.Disconnect()
return process
And the data-extraction part:
def runQuery(query, port, flag):
#ADOMD assembly
ADOMDConn = ADOMD.AdomdConnection("Data Source=localhost:{0}".format(port))
ADOMDConn.Open()
ADOMDCommand = ADOMDConn.CreateCommand()
ADOMDCommand.CommandText = query
#read data in via AdomdDataReader object
DataReader = ADOMDCommand.ExecuteReader()
#get metadata, number of columns
SchemaTable = DataReader.GetSchemaTable()
numCol = SchemaTable.Rows.Count #same as DataReader.FieldCount
#get column names
columnNames = []
for i in range(numCol):
columnNames.append(str(SchemaTable.Rows[i][0]))
#fill with data
data = []
while DataReader.Read()==True:
row = []
for j in range(numCol):
try:
row.append(DataReader[j].ToString())
except:
row.append(DataReader[j])
data.append(row)
df = pandas.DataFrame(data)
df.columns = columnNames
if flag==0:
DataReader.Close()
ADOMDConn.Close()
return df
else:
#metadata table
metadataColumnNames = []
for j in range(SchemaTable.Columns.Count):
metadataColumnNames.append(SchemaTable.Columns[j].ToString())
metadata = []
for i in range(numCol):
row = []
for j in range(SchemaTable.Columns.Count):
try:
row.append(SchemaTable.Rows[i][j].ToString())
except:
row.append(SchemaTable.Rows[i][j])
metadata.append(row)
metadf = pandas.DataFrame(metadata)
metadf.columns = metadataColumnNames
DataReader.Close()
ADOMDConn.Close()
return df, metadf
The raw data are then extracted via something like this:
pathPowerBI = "C:/Program Files/Microsoft Power BI Desktop/bin"
initialSetup(pathPowerBI)
session = restorePowerPivot("D:/Downloads/PowerPivotTutorialSample.xlsx", "D:/", 60000, pathPowerBI)
df, metadf = runQuery("EVALUATE dbo_DimProduct", 60000, 1)
endSession(session)
The problem with getting data out of PowerPivot is that the tabular engine in PowerPivot runs in-process inside Excel and the only way to connect to that engine is to have your code running inside Excel too. (I suspect that it may use shared memory or some other transport, but it's definitely not listening on a TCP port or a named pipe or anything like that which would allow an external process to connect)
We do this in Dax Studio by running a C# VSTO Excel add-in in Excel. However that was only designed to work for testing analytic queries, not for doing bulk data extraction. We marshal the data across from the add-in to the UI using a string variable so the entire dataset must be less than 2Gb or the response gets truncated and you will see an "unrecognizable response" error (the data is serialized into an XMLA rowset which is quite verbose so may see it break when only extracting a few hundred Mb of data)
If you wanted to build a script to automate extracting all the raw data from a model I don't think you will be able to do it with Python as I don't believe you can get the python interpreter running in-process inside Excel. I would look at using a vba macro like this one http://www.powerpivotblog.nl/export-a-table-or-dax-query-from-power-pivot-to-csv-using-vba/
You should find that you can query the model for a list of tables with something like "SELECT * FROM $SYSTEM.DBSCHEMA_TABLES" - you could then loop over each table and extract with a variation of the code in the above link.
I got in touch with Tom Gleeson (aka Gobán Saor) who was kind enough to let me post his emails here. There are some interesting nuggets in them, so hopefully others will also find them useful.
Email #1
When you say Python, you mean running Python.NET as a standalone exe?
If that’s the case, you’re out of luck with Excel PP models (different
story for Power BI desktop though). I’ve accessed PP models (2010+)
successfully from both VBA, and from Python.NET (via AMO) using
similar code to that in your SO question. The difference being (in
both VBA & .NET version) is that my code is running in-process within
Excel using Excel’s various add-in technologies. (Likely Tableau is
also running as an add-in or has embedded Excel within itself enabling
similar behaviour). DAX Studio (a useful C# code base to learn the
how-tos of PP access) runs both as an Excel add-in and as a standalone
EXE, but only as an add-in can it access Excel based PP models.
Email #2
You might find the process of using Python.NET for this somewhat
challenging. You would need to embed a Python engine using C#/VB.NET
Excel add-in code. I’ve used Excel-DNA (a fantastic open source
project) rather than MS’s highly cumbersome "official" method for
developing such .NET addins in the past, but I mainly stick to VBA
where at all possible.
Using VBA you’ll not be able to access the
.NET-only AMO (so no ability to create calculated columns on the fly),
but by loading the resulting dataset into an ADO recordset you should
be able to output to a worksheet OR to a corporate-database/MS Access
OR to a flat-file/CSV etc.
Unlike the 1M worksheet limit, for a
flat-file or database output memory (RAM) will be the limiting factor,
but, assuming you’re using 64bit Excel and have enough memory to hold
the compacted model and the workspace for the largest of the model’s
tables in un-compacted form (i.e. a row based rather than column based
format that’ll result from a DAX Query), multiplied by 2ish (one
instance within PP workspace the other within VBA’s ADO workspace) you
should be okay.
Having said that, I’ve never attempted extracting a
very large dataset, and using models as a dataset exchange medium is
not one of PP’s "use-cases"; so, very large tables might hit some
other bug/constraint!

How to connect Python to Db2

Is there a way to connect Python to Db2?
The documentation is difficult to find, and once you find it, it's pretty abysmal. Here's what I've found over the past 3 hours.
You need to install ibm_db using pip, as follows:
pip install ibm_db
You'll want to create a connection object. The documentation is here.
Here's what I wrote:
from ibm_db import connect
# Careful with the punctuation here - we have 3 arguments.
# The first is a big string with semicolons in it.
# (Strings separated by only whitespace, newlines included,
# are automatically joined together, in case you didn't know.)
# The last two are emptry strings.
connection = connect('DATABASE=<database name>;'
'HOSTNAME=<database ip>;' # 127.0.0.1 or localhost works if it's local
'PORT=<database port>;'
'PROTOCOL=TCPIP;'
'UID=<database username>;'
'PWD=<username password>;', '', '')
Next you should know that commands to ibm_db never actually give you results. Instead, you need to call one of the fetch methods on the command, repeatedly, to get the results. I wrote this helper function to deal with that.
def results(command):
from ibm_db import fetch_assoc
ret = []
result = fetch_assoc(command)
while result:
# This builds a list in memory. Theoretically, if there's a lot of rows,
# we could run out of memory. In practice, I've never had that happen.
# If it's ever a problem, you could use
# yield result
# Then this function would become a generator. You lose the ability to access
# results by index or slice them or whatever, but you retain
# the ability to iterate on them.
ret.append(result)
result = fetch_assoc(command)
return ret # Ditch this line if you choose to use a generator.
Now with that helper function defined, you can easily do something like get the information on all the tables in your database with the following:
from ibm_db import tables
t = results(tables(connection))
If you'd like to see everything in a given table, you could do something like this now:
from ibm_db import exec_immediate
sql = 'LIST * FROM ' + t[170]['TABLE_NAME'] # Using our list of tables t from before...
rows = results(exec_immediate(connection, sql))
And now rows contains a list of rows from the 170th table in your database, where every row contains a dict of column name: value.
Hope this all helps.
After lots of digging I discovered how to connect with DB2 using ibm_db.
First off, if you use a python version higher than 3.2 use
pip install ibm_db==2.0.8a
version 2.0.8 (the latest) will fail to install.
then use the following to connect
import ibm_db_dbi as db
conn = db.connect("DATABASE=name;HOSTNAME=host;PORT=60000;PROTOCOL=TCPIP;UID=username;PWD=password;", "", "")
list tables with
for t in conn.tables():
print(t)
and execute SQL with
cursor = conn.cursor()
cursor.execute("SELECT * FROM Schema.Table")
for r in cursor.fetchall():
print(r)
check this link for official not so accurate documentation
ibm-db, the official DB2 driver for Python and Django is here:
https://code.google.com/p/ibm-db/
Here's a recent tutorial for how to install everything on Ubuntu Linux:
http://programmingzen.com/2011/05/12/installing-python-django-and-db2-on-ubuntu-11-04/
I should mention that there were several older unofficial DB2 drivers for Python. ibm-db is the one you should be using.
In addition to #prof1990 response:
Since 2.0.9 (Aug 16th 2018), also with Python 3 you can simply use:
pip install ibm_db
Reference:
https://github.com/ibmdb/python-ibmdb#updated-ibm_db
Example of connection here:
import ibm_db
ibm_db.connect("DATABASE=<dbname>;HOSTNAME=<host>;PORT=<60000>;PROTOCOL=TCPIP;UID=<username>;PWD=<password>;", "", "")
Full API documentation here:
https://github.com/ibmdb/python-ibmdb/wiki/APIs
You can connect to db2 from python using jaydeapi
First install library running pip install jaydeapi
download db2jcc4.jar
Then you can connect using below code :
by passing hostname,portno, userid,password database name
import jaydebeapi
conn_src = jaydebeapi.connect(
'com.ibm.db2.jcc.DB2Driver',
['YourHostName:PortNo/DatabaseName','userid','password'],'C:/db2jcc4.jar'
)
cursor=conn_src.cursor()
sql = 'Select * from schemaname.TableName fetch first 100 rows only '
cursor.execute(sql)
print("fetchall:")
result = cursor.fetchall()
for r in result:
print(r)
There is a way in which one can connect to IBM db2 using nothing but Python requests library. Worked for me.
STEP 1:
Go to IBM CLOUD Dashboard -> Navigate to your IBM db2 instance -> Click on 'Service Credentials'
A default one should be there, if not, create one. This service credential is a dictionary. Copy the service credentials.
STEP 2:
db2id = { // service credential dictionary here //}
api = "/dbapi/v3"
host = db2id['https_url']+api
userinfo = {"userid":db2id['username'],"password":db2id['password']}
service = '/auth/tokens'
r = requests.post(host+service,json=userinfo)
access_token = r.json()['token']
auth_header = {"Authorization": "Bearer "+access_token}
// Connection to database established
STEP 3
Now you can run SELECT, INSERT, DELETE, UPDATE queries
The format for INSERT, DELETE, UPDATE queries is the same. After an INSERT, DELETE, UPDATE query, a COMMIT query has to be sent, else changes aren't reflected. (You should commit your changes otherwise also)
INSERT / UPDATE / DELETE QUERIES
sql = " your insert/update/delete query here "
sql_command = {"commands":sql,"limit":1000,"separator":";","stop_on_error":"yes"}
service = "/sql_jobs"
r = requests.post(host+service,headers=auth_header,json=sql_command)
sql_command = {"commands":"COMMIT","limit":1000,"separator":";","stop_on_error":"yes"}
service = "/sql_jobs"
r = requests.post(host+service,headers=auth_header,json=sql_command)
You can use the variable r to check status of your request
SELECT QUERIES
sql = " your select query here "
service = "/sql_jobs"
r = requests.post(host+service,headers=auth_header,json=sql_command)
jobid = r.json()['id']
r = requests.get(host+service+"/"+jobid,headers=auth_header)
results = r.json()['results']
rows = results[0]['rows']
The variable rows will have the results of your query. Use it as per your convenience.
I didn't use any DDL queries. But I think they should work like the DML queries. Not sure though!
IBM's Db2 is available for various platforms. If you are trying to connect to a Db2 which lives on an IBM i server (formerly known as AS/400, iSeries, or System i), then ibm_db requires a product called Db2 Connect, which is rather expensive. Most people who use Python to connect to Db2 for i use ODBC (usually through PyODBC).
I'm not completely sure about the situation with Db2 on their z (mainframe) servers, but I would think it also requires Db2 Connect.
There are many ways to connect from Python to Db2. I am trying to provide a summary of options. Note that in many environments SSL/TLS is enforced now which requires additional parameters (see below).
Db2 and Python drivers
Db2 does not offer one, but four drivers (clients) for Python. The Db2 documentation page "Python, SQLAlchemy, and Django Framework application development for IBM Database servers" provides a good overview about the four drivers:
ibm_db is based on the IBM-defined API,
ibm_db_dbi is a driver for the Python database API (DBI),
ibm_db_sa implements the Python SQLAlchemy interface and
ibm_db_django serves as Db2 driver in the Django Framework.
Note that there are additional Python database interfaces which make use of existing JDBC or ODBC drivers which can be used to connect to Db2. You can use SQLAlchemy (ibm_db_sa) with the popular Flask framework. To use Db2 with pandas utilize ibm_db_dbi. All of the above Db2 drivers are available on GitHub and are based on the CLI (Call Level Interface / ODBC). There are additional ways to connect to Db2, e.g., by using 3rd party ODBC-based wrappers and more.
Db2 connections
Typical connection information is made up of the Db2 server (hostname), the port, the database name and username / password information. If nothing else is specified, most drivers assume that the connection is not encrypted. Thus, to connect over an encrypted connection more parameters are needed. They depend on the Db2 version, the type of Db2 product and some more. Let's start easy.
Newer Db2 versions simplified the use of SSL/TLS because certificates are now part of the package. A typical connection string would then look like this:
conn_str='database=MYDB;hostname=db2host.example.com;port=50001;protocol=tcpip;uid=db2inst1;pwd=secret;security=SSL'
ibm_db_conn = ibm_db.connect(conn_str,'','')
An important parameter is "security=SSL" to tell the driver to use encryption for the data in transit.
Db2 connection strings can have even more options. It depends on what security plugin is enabled. See this blog post on connecting from Python to Db2 for more links and discussions.
SQL Alchemy connection
When using Db2 with SQLAlchemy, pass an URI similar to
ibm_db_sa://user:password#hostname:port/database?Security=SSL
to get the connection established.
You can use ibm_db library to connect DB2.
query_str = "SELECT COUNT(*) FROM table_name"
conn = ibm_db.pconnect("dsn=write","usrname","secret")
query_stmt = ibm_db.prepare(conn, query_str)
ibm_db.execute(query_stmt)
This is for future reference:
Official installation docs say:
Python 2.5 or later, excluding Python 3.X.
pip install ibm_db
It only worked on Python 2.7 for me; it didn't for 3.X. Also, I had to make Python 2.7 default (instead of Python 3) so that the installation would work (otherwise, there would be installation errors).
Official docs sample usage:
import ibm_db
ibm_db.connect("DATABASE=name;HOSTNAME=host;PORT=60000;PROTOCOL=TCPIP;UID=username; PWD=password;", "", "")
Version: ibm-db 3.0.2 - ibm-db==3.0.2
pip install ibm-db
Released: Jun 17, 2020
Connect to a local or cataloged database:
import ibm_db
conn = ibm_db.connect("database","username","password")
Connect to an uncataloged database:
import ibm_db
ibm_db.connect("DATABASE=name;HOSTNAME=host;PORT=60000;PROTOCOL=TCPIP;UID=username;
PWD=password;", "", "")
How I managed to do in 2021.
What you will need:
Python 3.7
PipEnv
Ibm-db
Ibm-db version is not important but this lib only works with Python 3.7 (current python version is 3.9).
Install Python 3.7.6 in your machine (this is the version that worked).
In your IDE create a new python file.
Let' create a Virtual Enviroment to make sure we will use Python 3.7
pip install pipenv
After installing
pipenv install --python 3.7
Activate the Virtual Environment
pipenv shell
You can use pip list to verify if you are in the new Virtual Enviroment - if list only shows 3 or 4 libs, it's because you are
Now you can download Ibm_db
pip install ibm-db
You may add this to your code to confirm what is the version you are using
from platform import python_version
print(python_version())
Now accessing the DB2
import ibm_db_dbi as db
# Connect to DB2B1 (keep Protocol as TCPIP)
conn = db.connect("DATABASE=DBNAME;HOSTNAME=hostname;PORT=port;PROTOCOL=TCPIP;UID=Your User;PWD=Your Password;", "", "")
Checking all tables available
for t in conn.tables():
print(t)
Your SQL code:
sql_for_df = """SELECT *
FROM TABLE
WHERE ..."""
Visualizing as DataFrame
First install pandas as it will not be present in your Virtual Environment
pip install pandas
After that import to your code and play around
import pandas as pd
df = pd.read_sql(sql_for_df, conn)
df.head()
To exit the VIrtual Enviroment just write exit in your terminal.
If you want to remove the Virtual Enviroment write in the terminal pipenv --rm
That's pretty much all I could learn so far.
I hope it helps you all.
# Install : ibm_db package
# Command : pip install ibm_db
import ibm_db
import sys
def get_connection():
db_name = ""
db_host_name = ""
db_port = ""
db_protocol = ""
db_username = ""
db_password = ""
try:
conn = ibm_db.connect(
f"DATABASE = {db_name}; HOSTNAME = {db_host_name}; PORT = {db_port}; PROTOCOL = {db_protocol}; "
f"UID = {db_username}; PWD = {db_password};", "", "")
return conn
except:
print("no connection:", ibm_db.conn_errormsg())
sys.exit(1)
get_connection()

Categories