I'm trying to read a table on Databricks to a DataFrame using the pyspark.pandas.read_table and receive the following error:
AnalysisException: [UC_COMMAND_NOT_SUPPORTED] AttachDistributedSequence is not supported in Unity Catalog.;
AttachDistributedSequence[__index_level_0__#767L, _c0#734, carat#735, cut#736, color#737, clarity#738, depth#739, table#740, price#741, x#742, y#743, z#744] Index: __index_level_0__#767L
+- SubqueryAlias spark_catalog.default.diamonds
+- Relation hive_metastore.default.diamonds[_c0#734,carat#735,cut#736,color#737,clarity#738,depth#739,table#740,price#741,x#742,y#743,z#744] csv
The table was created following the Databricks Quick Start notebook:
DROP TABLE IF EXISTS diamonds;
CREATE TABLE diamonds
USING csv
OPTIONS (path "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", header "true")
I'm trying to read the table with
import pyspark.pandas as ps
psdf = ps.read_table("hive_metastore.default.diamonds")
and get the error above.
Reading the table into spark.sql.DataFrame works fine with
df = spark.read.table("hive_metastore.default.diamonds")
The cluster versions are
Databricks Runtime Version 11.2
Apache Spark 3.3.0
Scala 2.12
I'm familiar with pandas already and would like to use pyspark.pandas.DataFrame since I assume it will have a familiar API and be quick for me to learn and use.
The questions I have:
What does the error mean?
What can I do to read the tables to pyspark.pandas.DataFrame?
Alternatively, should I just learn pyspark.sql.DataFrame and use that? If so, why?
The AttachDistributedSequence is a special extension used by Pandas on Spark to create a distributed index. Right now it's not supported on the Shared clusters enabled for Unity Catalog due the restricted set of operations enabled on such clusters. The workarounds are:
Use single-user Unity Catalog enabled cluster
Read table using the Spark API, and then use pandas_api function (doc) to convert into Pandas on Spark DataFrame. (in Spark 3.2.x/3.3.x it's called to_pandas_on_spark (doc)):
pdf = spark.read.table("abc").pandas_api()
P.S. It's not recommended to use .toPandas as it will pull all data to the driver node.
Related
I have a BigQuery data warehouse containing all the data from a mongodb database, those data are sync once a day.
I would like to add a column to one of my table, that column is a cleaned + lemmatized version of another column (the type is string). I can't do that with DBT because I need to use the python library Spacy. How could I run such a transformation on my table without having to get all the data locally and sending 10M UPDATE on bigquery ? Is there some GCP tools to run python function against bigquery like dataflow or something like that ?
And in a more general way, how do you tranform data when tools like DBT are not enough ?
Thanks for your help !
You can try Dataflow Batch processing for your requirement since Dataflow is a fully managed service which can run a transformation on your table without downloading the data locally and spaCy library can be used along with the Dataflow pipelines. Although Bigquery and Dataflow is a managed service that can process larger amounts of data, it is always a best practice to split larger jobs into smaller ones for larger NLP jobs as discussed here.
Note - As you want to add a column which is a lemmatized and cleaned version of a column in a table, it would be better to create a new destination table.
I have created a local spark environment in docker. I intend to use this as part of a CICD pipeline for unit testing code executed in the spark environment. I have two scripts which I want to use: 1 will create a set of persistent spark databases and tables and the other will read those tables. Even though the tables should be persistent, they only persist in that specific spark session. If I create a new spark session, I cannot access the tables, even though it is visible in the file system. Code examples are below:
Create db and table
Create_script.py
from pyspark.sql import SparkSession
def main():
spark = SparkSession.builder.appName('Example').getOrCreate()
columns = ["language", "users_count"]
data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF(columns)
spark.sql("create database if not exists schema1")
df.write.mode("ignore").saveAsTable('schema1.table1')
Load Data
load_data.py
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('example').getOrCreate()
df = spark.sql("select * from schema1.table1")
I know there is a problem as when I run this command: print(spark.catalog.listDatabases()) It can only find database default. But if I import Create_script.py then it will find schema1 db.
How do I make persistent tables across all spark sessions?
These files in /repo/test/spark-warehouse is only data of the tables, without meta info of database/table/column.
If you don't enable Hive, Spark use an InMemoryCatalog, which is ephemeral and only for testing, only available in same spark context. This InMemoryCatalog doesn't provide any function to load db/table from file system.
So there is two way:
Columnar Format
spark.write.orc(), write data into orc/parquet format in your Create_script.py script. orc/parquet format store column info aside with data.
val df = spark.read.orc(), then createOrReplaceTempView if you need use it in sql.
Use Embed Hive
You don't need to install Hive, Spark can work with embed hive, just two steps.
add spark-hive dependency. (I'm using Java which use pom.xml to manage dependencies, I don't know how to do it in pyspark)
SparkSession.builder().enableHiveSupport()
Then data will be /repo/test/spark-warehouse/schema1.db, meta info will be /repo/test/metastore_db, which contains files of Derby db. You can read or write tables across all spark sessions.
New to spark programming and had a doubt regarding the method to read partitioned tables using pyspark.
Let us say we have a table partitioned as below:
~/$table_name/category=$category/year=$year/month=$month/day=$day
Now, I want to read data from all the categories, but want to restrict data by time period. Is there any way to specify this with wild cards rather than writing out all the individual paths?
Something to the effect of
table_path = ["~/$table_name/category=*/year=2019/month=03",
"~/$table_name/category=*/year=2019/month=04"]
table_df_raw = spark.read.option(
"basePath", "~/$table_name").parquet(*table_path)
Also, as bonus is there a more pythonic way to specify the time ranges which may fall in different years rather than listing the paths individually.
Edit: To clarify a few things, I don't have access to the hive metastore for this table and hence can't access with just a SQL query. Also, the size of the data doesn't allow filtering post conversion to dataframe.
You can try this
Wildcards can also be used to specify a range of days:
table_df_raw = spark.read
.option("basePath", "~/$table_name")
.parquet("~/$table_name/category=*/year=2019/month={3,4,8}")
Or
table_df_raw = spark.read
.option("basePath", "~/$table_name")
.parquet("~/$table_name/category=*/year=2019/month=[3-4]")
Are you using a Hortonworks HDP cluster? If yes, try to use HiveWarehouse connector. Its allow Spark to access Hive catalog. After this, you can perform any Spark SQL command over hive tables: https://community.hortonworks.com/articles/223626/integrating-apache-hive-with-apache-spark-hive-war.html
If you aren't using Hortonworks, i suggest you look at this link: https://acadgild.com/blog/how-to-access-hive-tables-to-spark-sql
I want to use Zeppelin to query databases. I currently see two possibilities but none of them is sufficient for me:
Configure a database connection as "interpreter", name it e.g. "sql1", use it in a paragraph, run a sql query and use the inbuilt nice plotting tools. It seems that all the tutorials and tips deal with it but then the documentation suddenly stops! But I want to do more with the data: I want to filter and process. If I want to plot it again (with other limitations), I have to do the query (that may last some seconds or minutes) again (see my other question Zeppelin SQL: reuse data of query without another interpreter or a new query)
Use spark with python, scala or similar. But the documentation seems only to load csv data, put in into a dataframe and then accesses this dataframe with sql. There is no accessing the data with sql in the first place. How do I access the sql data the best way? Can I use a already configured "interpreter" (database connection)?
You can use Zeppelin API to retrieve paragraph data:
val buffer = scala.io.Source.fromURL("http://XXXXX:9995/api/notebook/2CN2QP93H/paragraph/20170713-092810_1633770798").mkString
val df = sqlContext.read.json(sc.parallelize(buffer :: Nil)).select("body.text")
df.first.getAs[String](0)
This Spark Scala lines will retrieve the SQL query used by a paragprah. You could do same thing to get results I think.
I cannot find a solution for 1. But I have made a short solution for 2. that works within zeppelin with python (2.7), sqlalchemy (sql wrapper), mysqldb (mysql implementation) and pandas (make sure that have these packages installed, all of them are in Debian 9). I wonder why I have not found such a solution before...
%python
from sqlalchemy import create_engine
import pandas as pd
sql = "select col1, col2 from table limit 10"
df = pd.read_sql(sql,
create_engine('mysql+mysqldb://user:password#host:3306/database').connect())
z.show(df)
If you want to connect to another database like db2 or oracle, you have to use other python packages and adjust the first part in the create_engine string.
I have been prototyping a beam pipeline using their python SDK and have been able to use the BigQuerySink to output my final pcollection just fine using this:
beam.io.Write(beam.io.BigQuerySink('dataset.table',
self.get_schema(),
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE))
modifying the table to include a partition such as this: dataset.table$20170517 triggers the following error when trying to run this pipeline with the DirectRunner
"code": 400,
"message": "Cannot read partition information from a table that is not partitioned:
I have studied the examples found here but found no trace of partition use
https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples
How can beam sink data into partitioned bigquery tables?
The apache_beam Python SDK does accept partition decorators for the BigQuerySink. Experimenting with the different write_disposition available reveals more information.
WRITE_TRUNCATE will not write to table partitions. Using the $YYYYmmdd partition in the table name will result in this error. This differs from the Google Python SDK behaviour which will actually accept the partition decorator.
Table IDs must be alphanumeric (plus underscores) and
must be at most 1024 characters long.
WRITE_EMPTY will accept the partition decorator.
WRITE_APPEND will accept the partition decorator.