Local Spark config - python

I have created a local spark environment in docker. I intend to use this as part of a CICD pipeline for unit testing code executed in the spark environment. I have two scripts which I want to use: 1 will create a set of persistent spark databases and tables and the other will read those tables. Even though the tables should be persistent, they only persist in that specific spark session. If I create a new spark session, I cannot access the tables, even though it is visible in the file system. Code examples are below:
Create db and table
Create_script.py
from pyspark.sql import SparkSession
def main():
spark = SparkSession.builder.appName('Example').getOrCreate()
columns = ["language", "users_count"]
data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF(columns)
spark.sql("create database if not exists schema1")
df.write.mode("ignore").saveAsTable('schema1.table1')
Load Data
load_data.py
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('example').getOrCreate()
df = spark.sql("select * from schema1.table1")
I know there is a problem as when I run this command: print(spark.catalog.listDatabases()) It can only find database default. But if I import Create_script.py then it will find schema1 db.
How do I make persistent tables across all spark sessions?

These files in /repo/test/spark-warehouse is only data of the tables, without meta info of database/table/column.
If you don't enable Hive, Spark use an InMemoryCatalog, which is ephemeral and only for testing, only available in same spark context. This InMemoryCatalog doesn't provide any function to load db/table from file system.
So there is two way:
Columnar Format
spark.write.orc(), write data into orc/parquet format in your Create_script.py script. orc/parquet format store column info aside with data.
val df = spark.read.orc(), then createOrReplaceTempView if you need use it in sql.
Use Embed Hive
You don't need to install Hive, Spark can work with embed hive, just two steps.
add spark-hive dependency. (I'm using Java which use pom.xml to manage dependencies, I don't know how to do it in pyspark)
SparkSession.builder().enableHiveSupport()
Then data will be /repo/test/spark-warehouse/schema1.db, meta info will be /repo/test/metastore_db, which contains files of Derby db. You can read or write tables across all spark sessions.

Related

AttachDistributedSequence is not supported in Unity Catalog

I'm trying to read a table on Databricks to a DataFrame using the pyspark.pandas.read_table and receive the following error:
AnalysisException: [UC_COMMAND_NOT_SUPPORTED] AttachDistributedSequence is not supported in Unity Catalog.;
AttachDistributedSequence[__index_level_0__#767L, _c0#734, carat#735, cut#736, color#737, clarity#738, depth#739, table#740, price#741, x#742, y#743, z#744] Index: __index_level_0__#767L
+- SubqueryAlias spark_catalog.default.diamonds
+- Relation hive_metastore.default.diamonds[_c0#734,carat#735,cut#736,color#737,clarity#738,depth#739,table#740,price#741,x#742,y#743,z#744] csv
The table was created following the Databricks Quick Start notebook:
DROP TABLE IF EXISTS diamonds;
CREATE TABLE diamonds
USING csv
OPTIONS (path "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", header "true")
I'm trying to read the table with
import pyspark.pandas as ps
psdf = ps.read_table("hive_metastore.default.diamonds")
and get the error above.
Reading the table into spark.sql.DataFrame works fine with
df = spark.read.table("hive_metastore.default.diamonds")
The cluster versions are
Databricks Runtime Version 11.2
Apache Spark 3.3.0
Scala 2.12
I'm familiar with pandas already and would like to use pyspark.pandas.DataFrame since I assume it will have a familiar API and be quick for me to learn and use.
The questions I have:
What does the error mean?
What can I do to read the tables to pyspark.pandas.DataFrame?
Alternatively, should I just learn pyspark.sql.DataFrame and use that? If so, why?
The AttachDistributedSequence is a special extension used by Pandas on Spark to create a distributed index. Right now it's not supported on the Shared clusters enabled for Unity Catalog due the restricted set of operations enabled on such clusters. The workarounds are:
Use single-user Unity Catalog enabled cluster
Read table using the Spark API, and then use pandas_api function (doc) to convert into Pandas on Spark DataFrame. (in Spark 3.2.x/3.3.x it's called to_pandas_on_spark (doc)):
pdf = spark.read.table("abc").pandas_api()
P.S. It's not recommended to use .toPandas as it will pull all data to the driver node.

How to use Spark BigQuery connector to join multiple tables and then fetch the data into dataframe?

I have to read three different bigquery tables and then join them to get some data which will be stored to GCS bucket. I was using Spark BQ connector.
# Load data from BigQuery.
bqdf = spark.read.format('bigquery') \
.option('table', bq_dataset + bq_table) \
.load()
bqdf.createOrReplaceTempView('bqdf')
This reads entire table data to dataframe. I know that I can apply filter on tables and also select required columns. Thereafter create three dataframes and then join them to get the output.
Is there any equivalent way to achieve this?
I have an option of using BigQuery client API (https://googleapis.dev/python/bigquery/latest/index.html) and import it from pyspark script. However, if I can achive that through Spark BQ connector, dont want to use the API call from python script.
Please help.

transform data in azure data factory using python data bricks

I have the task to transform and consolidate millions of single JSON file into BIG CSV files.
The operation would be very simple using a copy activity and mapping the schemas, I have already tested, the problem is that a massive amount of files have bad JSON format.
I know what is the error and the fix is very simple too, I figured that I could use a Python Data brick activity to fix the string and then pass the output to a copy activity that could consolidate the records into a big CSV file.
I have something in mind like this, I'm not sure if this is the proper way to address this task. I don't know to use the output of the Copy Activy in the Data Brick activity
It sounds like you want to transform a large number of single JSON file using Azure Data Factory, but it does not support on Azure now as #KamilNowinski said. However, now that you were using Azure Databricks, to write a simple Python script to do the same thing is easier for you. So a workaound solution is to directly use Azure Storage SDK and pandas Python package to do that via few steps on Azure Databricks.
Maybe these JSON files are all in a container of Azure Blob Storage, so you need to list them in container via list_blob_names and generate their urls with sas token for pandas read_json function, the code as below.
from azure.storage.blob.baseblobservice import BaseBlobService
from azure.storage.blob import ContainerPermissions
from datetime import datetime, timedelta
account_name = '<your account name>'
account_key = '<your account key>'
container_name = '<your container name>'
service = BaseBlobService(account_name=account_name, account_key=account_key)
token = service.generate_container_shared_access_signature(container_name, permission=ContainerPermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1),)
blob_names = service.list_blob_names(container_name)
blob_urls_with_token = (f"https://{account_name}.blob.core.windows.net/{container_name}/{blob_name}?{token}" for blob_name in blob_names)
#print(list(blob_urls_with_token))
Then, you can read these JSON file directly from blobs via read_json function to create their pandas Dataframe.
import pandas as pd
for blob_url_with_token in blob_urls_with_token:
df = pd.read_json(blob_url_with_token)
Even if you want to merge them to a big CSV file, you can first merge them to a big Dataframe via pandas functions listed in Combining / joining / merging like append.
To write a dataframe to a csv file, I think it's very easy by to_csv function. Or you can convert a pandas dataframe to a PySpark dataframe on Azure Databricks, as the code below.
from pyspark.sql import SQLContext
from pyspark import SparkContext
sc = SparkContext()
sqlContest = SQLContext(sc)
spark_df = sqlContest.createDataFrame(df)
So next, whatever you want to do, it's simple. And if you want to schedule the script as notebook in Azure Databricks, you can refer to the offical document Jobs to run Spark jobs.
Hope it helps.
Copy JSON file to storage (e.g. BLOB) and you can get access to the storage from Databricks. Then you can fix the file using Python and even transform to the required format having cluster run.
So, in Copy Data activity do the copy of the files to BLOB if you haven't them there yet.

AWS Glue - Pick Dynamic File

Does anyone know how to get a dynamic file from a S3 bucket? I setup a crawler on a S3 bucket however, my issue is, there will be new files coming each day with YYYY-MM-DD-HH-MM-SS suffix.
When I read the table through the catalog, it reads all the files present in the directory? Is it possible to dynamically pick the latest three files for a given day and use it as a Source?
Thanks!
You don't need to re-run crawler if files are located in the same place. For example, if your data folder is s3://bucket/data/<files> then you can add new files to it and run ETL job - new files will be picked up automatically.
However, if data arrives in new partitions (sub-folders) like s3://bucket/data/<year>/<month>/<day>/<files> then you need either to run a crawler or execute MSCK REPAIR TABLE <catalog-table-name> in Athena to register new partitions in Glue Catalog before starting Glue ETL job.
When data is loaded into DynamicFrame or spark's DataFrame you can apply some filters to use needed data only. If you still want to work with file names then you can add it as a column using input_file_name spark function and then apply filtering:
from pyspark.sql.functions import col, input_file_name
df.withColumn("filename", input_file_name)
.where(col("filename") == "your-filename")
If you control how files are coming I would suggest to put them into partitions (sub-folders that indicate date, ie. /data/<year>/<month>/<day>/ or just /data/<year-month-day>/) so that you could benefit from using pushdown predicates in AWS Glue

Saved table to Hive metastore with .saveAsTable(), how do I reload?

I used .saveAsTable on my DataFrame and now it is stored in my HDFS hive warehouse metastore. How can I load this back into Spark SQL? I have deleted my cluster (Azure HDInsight) and created a new one, confirmed my Hive metastore location is the same and the directory is still there.
I need to load this again as a persistent table, not as a temp table as I am using the PowerBI/Spark connector. The only way I have found to do so far is to load the directory back into a DF, then run .saveAsTable again.. which is writing the file again and takes a long time to process. I'm hopeful there is a better way!!
After you use .saveAsTable you may query direcly with sql.
df.saveAsTable("tableName")
myOtherDf = sqlContext.sql("select * from tableName")

Categories