I have a scenario to migrate SQL Server tables(30- 40 tables) to Oracle. I Cannot depend on SSIS as the no of tables to be migrated to Oracle will change regularly and I cannot always create or update a DFT when ever there is a change in schema.
Is there any other way where the movement of data can be handled dynamically and can work effectively ? Like using Python or any other Programming languages ?
C# approach - SchemaMapper library
Since you are open to a solution using a programming language, i think you can benefit from SchemaMapper class library which is an open-source project published on GitHub. A full description can be found in the Readme file on the link above.
Important Note: Yesterday i added the support of reading data from databases (SQL Server , Oracle ...) and the ability to export data to Oracle.
In this answer i will provide information on importing SQL Server tables, create the appropriate SchemaMapper class for each one (since they have different schema and you need to import them to different schemas), and how to export data to Oracle.
//First of all list the tables names need to import
string[] TableNameFilter = new[] { "Table1", "Table2" };
//Create an instance of the oracle import class
SchemaMapper.Exporters.OracleExport expOracle = new SchemaMapper.Exporters.OracleExport(oracleconnectionstring);
//Create an SQL Server import class
using (SchemaMapper.Converters.SqlServerCeImport ssImport = new SchemaMapper.Converters.SqlServerCeImport(sqlconnectionstring))
{
//Retrieve tables names
ssImport.getSchemaTable();
//loop over tables matching the filter
foreach(DataRow drRowSchema in ssImport.SchemaTable.AsEnumerable().Where(x =>
TableNameFilter.Contains(x["TABLE_NAME"].ToString())).ToList())
{
string SQLTableName = drRowSchema["TABLE_NAME"].ToString();
string SQLTableSchema = drRowSchema["TABLE_SCHEMA"].ToString();
DataTable dtSQL = ssImport.GetDataTable(SQLTableSchema, SQLTableName);
//Create a schema mapping class
using (SchemaMapper.SchemaMapping.SchemaMapper sm = new SchemaMapper.SchemaMapping.SchemaMapper(SQLTableSchema, SQLTableName))
{
foreach (DataColumn dc in dtSQL.Columns)
{
SchemaMapper_Column smCol = new SchemaMapper_Column();
smCol.Name = dc.ColumnName;
smCol.Name = dc.ColumnName;
smCol.DataType = smCol.GetCorrespondingDataType(dc.DataType.ToString(), dc.MaxLength);
sm.Columns.Add(smCol);
}
//create destination table in oracle
expOracle.CreateDestinationTable(sm);
//Insert data
expOracle.InsertUsingOracleBulk(sm, dtSQL);
//there are other methods such as :
//expOracle.InsertIntoDb(sm, dtSQL);
//expOracle.InsertIntoDbWithParameters(sm, dtSQL);
}
}
}
Note: this is an open-source project: it is not fully tested and not all data types are supported, if you encountered some errors feel free to give a feedback, or add an Issue in GitHub
Other approach - SQL Server Import and Export Wizard
If you can do this without scheduling a Job, then you can use the Import and Export Wizard which allows you to import multiple tables into Oracle without the need to build the packages manually. It will create packages, destination tables, map columns and import data.
Start the SQL Server Import and Export Wizard
Connect to an Oracle Data Source (SQL Server Import and Export Wizard)
Here is the approach I have decided to go considering the time constraint( using C# is taking more time).For 8 GB table it is taking 11 minutes to move the data SQL to Oracle.
Steps:
Dump the SQL tables data into flat files.(Used BIML for automating
the DFT creation)
Transfer these flat files to the Destination server.
Using SQL*Loader to load data from flat files to Oracle.
Related
I'm trying to create a temp view using spark, from a csv file.
To reproduce my production scenario, I need to test my script locally, however in production I'm using Glue Jobs (AWS) where there are databases and tables.
In the code below, I'm creating a database in my spark session and using it, after that, I create a temp view.
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName("pulsar_data").getOrCreate()
df = spark.read.format('csv')\
.options(infer_schema=True)\
.options(header=True)\
.load('pulsar_stars.csv')
spark.sql('CREATE DATABASE IF NOT EXISTS MYDB')
spark.sql('USE MYDB')
df.createOrReplaceTempView('MYDB.TB_PULSAR_STARS')
spark.catalog.listTables()
spark.sql('SELECT * FROM MYDB.TB_PULSAR_STARS').show()
However, when I try to select db.table, Spark can't find the relation between my temp view and my database and throws following error:
*** pyspark.sql.utils.AnalysisException: Table or view not found: MYDB.TB_PULSAR_STARS; line 1 pos 14;
'Project [*]
+- 'UnresolvedRelation [MYDB, TB_PULSAR_STARS], [], false
Debugging my code with pdb, I have listed my spark session catalog, where I find that my table is in fact associated:
(Pdb) spark.catalog.listTables()
[Table(name='tb_pulsar_stars', database='MYDB', description=None, tableType='TEMPORARY', isTemporary=True)]
How can I make this relationship work?
Temporary view name associated to a DataFrame can only be one segment. This is explicitly checked here in Spark code. I would expect your code to throw AnalysisException: CREATE TEMPORARY VIEW or the corresponding Dataset APIs only accept single-part view names, but got: MYDB.TB_PULSAR_STARS - not sure why in your case it's a bit different.
Anyway, use:
df.createOrReplaceTempView('TB_PULSAR_STARS')
spark.sql('SELECT * FROM TB_PULSAR_STARS').show()
And if you need to actually write this data to a table, create it using:
spark.sql("CREATE TABLE MYDB.TB_PULSAR_STARS AS SELECT * FROM TB_PULSAR_STARS")
I'm creating a Snowflake procedure using Snowpark (python) package executing a query into a snowflake dataframe and I would like to export that into Excel, how can I accomplish that? Is it a better approach to do this? The end goal is to export it the query results into Excel. Needs to be in a Snowflake procedure since we already have others "parent" procedures. Thanks!
CREATE OR REPLACE PROCEDURE EXPORT_SP()
RETURNS string not null
LANGUAGE PYTHON
RUNTIME_VERSION = '3.8'
PACKAGES = ('snowflake-snowpark-python', 'pandas')
HANDLER = 'run'
AS
$$
import pandas
def run(snowpark_session):
## Execute the query into a Snowflake dataframe
results_df = snowpark_session.sql('''
SELECT * FROM
MY TABLES
;
''').collect()
return results_df
$$
;
In general, you can do this by:
"Unloading" the data from the table using the COPY INTO <location> command.
Using the GET command to copy the data to your local filesystem.
Open the file with Excel! If you used the CSV format and the appropriate format options in step 1, you should be able to easily open the resulting data with Excel.
Snowpark directly supports step 1 in the DataFrameWriter.copy_into_location method. An instance of DataFrameWriter contained in the DataFrame.write attribute, as described here.
Snowpark also directly supports step 2 in the FileOperation.get method. As per the example in that documentation page, you can access this method using the .file attribute of your Snowpark session object.
Putting this all together, you should be able to do something like this in Snowpark to save a single exported file into the current working directory:
source_table = "my_table"
unload_location = "#my_stage/export.csv"
def run(session):
df = session.table(source_table)
df.write.copy_into_location(
unload_location,
file_format_type="csv",
format_type_options=dict(
compression="none",
field_delimiter="\t",
),
single=True,
header=True,
)
session.file.get(unload_location, ".")
You can of course use session.sql() instead of session.table() as needed. You might also want to consider unloading data to the stage associated with the source data, instead of creating a separate stage, i.e. if the data is from table my_table then you would unload to the stage #%my_table.
For more details, refer to the documentation pages I linked, which contain important reference information as well as several examples.
Note that I am not sure if session.file is accessible from inside a stored procedure; you will have to experiment to see what works in your specific situation.
As always, remember that this is untested code written by an unpaid volunteer. Always triple-check and test any code that is provided here. Please do ask questions in the comments if anything is still unclear.
I want to build an API which queries the databricks's tables and output the results as a JSON. One way to achieve this is calling the databricks Jobs REST API to execute a job and read the job output but that has data size limitations(max of 5MB and my API result set can exceed beyond 20MB). Instead, can I connect to databricks using the JDBC/ODBC endpoint provided by the cluster using Microsoft.Net or if there's any other way to connect directly? My API layer preferably needs to be built in Microsoft.Net. However, I'm willing to try Python.
I found few ways to connect to the databricks cluster
Connect using ODBC connection with Simba drivers (https://pages.databricks.com/ODBC-Driver-Download.html). Also, shared in the comment above by #EdHarper
Use cdata nuget package - https://www.cdata.com/drivers/spark/ado/ but, there's a license cost involved
Using JDBC connection string provided, may need Java code
I went ahead with option #1 and below is a sample c# code.
// Build connection string
OdbcConnectionStringBuilder odbcConnectionStringBuilder = new OdbcConnectionStringBuilder
{
Driver = "Simba Spark ODBC Driver"
};
odbcConnectionStringBuilder.Add("Host", "adb-xxxxxxxxxxxxx.7.xxxxxxxbricks.net");
odbcConnectionStringBuilder.Add("Port", "443");
odbcConnectionStringBuilder.Add("SSL", "1");
odbcConnectionStringBuilder.Add("ThriftTransport", "2");
odbcConnectionStringBuilder.Add("AuthMech", "3");
odbcConnectionStringBuilder.Add("UID", "token");
odbcConnectionStringBuilder.Add("PWD", "<Access token generated in databricks>");
odbcConnectionStringBuilder.Add("HTTPPath", "sql/protocolv1/o/xxxxxxxxxxxxxxx/yyyy8-dfcccf-tyyujjk8");
using (OdbcConnection connection = new OdbcConnection(odbcConnectionStringBuilder.ConnectionString))
{
string sqlQuery = "select * from yourdb.TableName";
OdbcCommand command = new OdbcCommand(sqlQuery, connection);
connection.Open();
OdbcDataReader reader = command.ExecuteReader();
for (int i = 0; i < reader.FieldCount; i++)
{
Console.Write(reader.GetName(i) + "\t");
}
Console.Write("\n");
reader.Close();
command.Dispose();
}
Additionally, you connect using a DSN if you prefer, more details here - https://www.simba.com/products/Spark/doc/v1/ODBC_InstallGuide/win/content/odbc/hi/windows/dsn.htm
Are there any alternatives for Excel users to suck data from MySQL through remote connection without bothering to establish ODBC connection in control panel nor downloading MySQL extensions nor doing anything on user side?
Are there any connectors on MySQL side which would turn data into format readable for Excel? I am looking for dynamic solutions - reading from data base. I am not looking for static solutions like export to csv and import to Excel. Solutions done in any programming language like Python are acceptable.
This ADO connection looked promising for me however still extra work on user side is required at start up: How can VBA connect to MySQL database in Excel?
I would like to make a portable Excel file with macro that will work on any computer, in any case.
Hope you are looking for some this as below
Sub test1()
Dim con As ADODB.Connection
Dim rec As ADODB.Recordset
Set con = New ADODB.Connection
Set rec = New ADODB.Recordset
rec.CursorLocation = adUseClient
con.Open ("Provider=SQLOLEDB;Data Source=.;Initial Catalog=databasename;user ID=sa; password=sa#123;")
qry1 = "select * from [dbo].[FARA];"
rec.Open qry1, con
For i = 1 To rec.RecordCount
Debug.Print rec(0), rec(1), rec(2)
Next i
End Sub
I would like to get some understanding on the question that I was pretty sure was clear for me. Is there any way to create table using psycopg2 or any other python Postgres database adapter with the name corresponding to the .csv file and (probably the most important) with columns that are specified in the .csv file.
I'll leave you to look at the psycopg2 library properly - this is off the top of my head (not had to use it for a while, but IIRC the documentation is ample).
The steps are:
Read column names from CSV file
Create "CREATE TABLE whatever" ( ... )
Maybe INSERT data
import os.path
my_csv_file = '/home/somewhere/file.csv'
table_name = os.path.splitext(os.path.split(my_csv_file)[1])[0]
cols = next(csv.reader(open(my_csv_file)))
You can go from there...
Create a SQL query (possibly using a templating engine for the fields and then issue the insert if needs be)