I used .saveAsTable on my DataFrame and now it is stored in my HDFS hive warehouse metastore. How can I load this back into Spark SQL? I have deleted my cluster (Azure HDInsight) and created a new one, confirmed my Hive metastore location is the same and the directory is still there.
I need to load this again as a persistent table, not as a temp table as I am using the PowerBI/Spark connector. The only way I have found to do so far is to load the directory back into a DF, then run .saveAsTable again.. which is writing the file again and takes a long time to process. I'm hopeful there is a better way!!
After you use .saveAsTable you may query direcly with sql.
df.saveAsTable("tableName")
myOtherDf = sqlContext.sql("select * from tableName")
Related
I'm not sure how to go about doing a one time load of the existing data I have in Oracle to MariaDB. I have DBeaver which I am using to access the databases. I saw an option in DBeaver to migrate the data from Source (Oracle) to Target (MariaDB) with a few clicks, but I'm not sure if that's the best approach.
Is writing a python script a better way of doing it? Should I download another tool to do a one time load? We are using CData Sync to do the incremental loads. Basically, it copies data from one database to another (Oracle to SQL Server for example) and it does incremental loads. I'm not sure if I can use it to do a full time/one time load of all the data I have in my Oracle database to MariaDB. I'm new to this, I've never loaded data before. The thing is, I have over 1100 tables so I can't manually write the schema for each table and do a "CREATE TABLE" statement for all 1100 tables...
Option 1 DBeaver
If DBeaver is willing to try it in a few clicks I'd try and see what it gives for some small tables.
Option 2 MariaDB connect
Alternately there MariaDB connect engine using ODBC or JDBC.
Note you don't need to create table structure for all, but do need the list of table and generate CREATE TABLE t1 ENGINE=CONNECT TABLE_TYPE=ODBC tabname='T1' CONNECTION='DSN=XE;.. for each table.
Then it would be:
create database mariadb_migration;
create table mariadb_migration.t1 like t1;
insert into mariadb_migration.t1 select * from t1;
Option 3 MariaDB Oracle Mode
This uses the Oracle compatibility mode of MariaDB.
Take a SQL dump from Oracle.
Prepend SET SQL_MODE='ORACLE'; to start of the dump.
Import this to MariaDB.
Option 4 SQLines
SQLines offer a Oracle to MariaDB
Small disclaimer, I've not done any of these personally, I just know these options exist.
I just linked an Azure storage account (Storage gen2) with its underlying containers to my Databricks environment. Inside the storage account are two containers each with some subdirectories. Inside the folders are .csv files.
I have connected an Azure service principal with Azure Blog Data Contributor access to the storage account inside databricks so I can read and write to the storage account.
I am trying to figure out the best way to convert the existing storage account into a delta lake (tables inside the metastore + convert the files to parquet (delta tables).
What is the easiest way to do that?
My naive approach as a beginner might be
Read the folder using
spark.read.format("csv).load("{container}#{storage}..../directory)
Write to a new folder with similar name (so if folder is directory, write it to directory_parquet) using df.write.format("delta").save({container}#{storage}.../directory_parquet)
And then not sure on the last steps? This would create a new folder with a new set of files. But it wouldn't be a table in databricks that shows up in the hive store. But I do get parquet files.
Alternatively I can use df.write.format().saveAsTable("tablename") but that doesn't create the table in the storage account, but inside the databricks file system, but does show up in the hive metastore.
delete the existing data files if desired (or have it duplicated)
Preferably this can be done in a Databricks workbook using python as preferred, or scala/sql if necessary.
*As a possible solution, if the efforts to do this are monumental, just converting to parquet and getting table information for each subfolder into hive storage as a format of database=containerName
tableName=subdirectoryName
The folder structure is pretty flat at the moment, so only rootcontainer/Subfolders deep.
Perhaps an external table is what you're looking for:
df.write.format("delta").option("path", "some/external/path").saveAsTable("tablename")
This post has more info on external tables vs managed tables.
I have created a local spark environment in docker. I intend to use this as part of a CICD pipeline for unit testing code executed in the spark environment. I have two scripts which I want to use: 1 will create a set of persistent spark databases and tables and the other will read those tables. Even though the tables should be persistent, they only persist in that specific spark session. If I create a new spark session, I cannot access the tables, even though it is visible in the file system. Code examples are below:
Create db and table
Create_script.py
from pyspark.sql import SparkSession
def main():
spark = SparkSession.builder.appName('Example').getOrCreate()
columns = ["language", "users_count"]
data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF(columns)
spark.sql("create database if not exists schema1")
df.write.mode("ignore").saveAsTable('schema1.table1')
Load Data
load_data.py
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('example').getOrCreate()
df = spark.sql("select * from schema1.table1")
I know there is a problem as when I run this command: print(spark.catalog.listDatabases()) It can only find database default. But if I import Create_script.py then it will find schema1 db.
How do I make persistent tables across all spark sessions?
These files in /repo/test/spark-warehouse is only data of the tables, without meta info of database/table/column.
If you don't enable Hive, Spark use an InMemoryCatalog, which is ephemeral and only for testing, only available in same spark context. This InMemoryCatalog doesn't provide any function to load db/table from file system.
So there is two way:
Columnar Format
spark.write.orc(), write data into orc/parquet format in your Create_script.py script. orc/parquet format store column info aside with data.
val df = spark.read.orc(), then createOrReplaceTempView if you need use it in sql.
Use Embed Hive
You don't need to install Hive, Spark can work with embed hive, just two steps.
add spark-hive dependency. (I'm using Java which use pom.xml to manage dependencies, I don't know how to do it in pyspark)
SparkSession.builder().enableHiveSupport()
Then data will be /repo/test/spark-warehouse/schema1.db, meta info will be /repo/test/metastore_db, which contains files of Derby db. You can read or write tables across all spark sessions.
I have basic csv report that is produced by other team on a daily basis, each report has 50k rows, those reports are saved on sharedrive everyday. And I have Oracle DB.
I need to create autoscheduled process (or at least less manual) to import those csv reports to Oracle DB. What solution would you recommend for it?
I did not find such solution in SQL Developer, since it is upload from file and not a query. I was thinking about python cron script, that will autoran on a daily basis and transform csv report to txt with needed SQL syntax (insert into...) and then python will connect to Oracle DB and will ran txt file as SQL command and insert data.
But this looks complicated.
Maybe you know other solution that you would recommend yo use?
Create an external table to allow you to access the content of the CSV as if it were a regular table. This assumes the file name does not change day-to-day.
Create a scheduled job to import the data in that external table and do whatever you want with it.
One common blocking issue that prevents using 'external tables' is that external tables require the data to be on the computer hosting the database. Not everyone has access to those servers. Or sometimes the external transfer of data to that machine + the data load to the DB is slower than doing a direct path load from the remote machine.
SQL*Loader with direct path load may be an option: https://docs.oracle.com/en/database/oracle/oracle-database/19/sutil/oracle-sql-loader.html#GUID-8D037494-07FA-4226-B507-E1B2ED10C144 This will be faster than Python.
If you do want to use Python, then read the cx_Oracle manual Batch Statement Execution and Bulk Loading. There is an example of reading from a CSV file.
I have some large (+500 Mbytes) .CSV files that I need to import into a Postgres SQL database.
I am looking for a script or tool that helps me to:
Generate the table columns SQL CREATE code, ideally taking into account the data in the .CSV file in order to create the optimal data types for each column.
Use the header of the .CSV as the name of the column.
It would be perfect if such functionality existed in PostgreSQL or could be added as an add-on.
Thank you very much
you can use this open source tool called pgfutter to create table from your csv file.
git hub link
also postgresql has COPY functionality however copy expect that the table already exists.