How to Use DataFrame Created in Scala in Databricks' PySpark - python

My Databricks notebook is on Python.
Some codes in the notebook are written in Scala (using the %scala) and one of them is for creating dataframe.
If I use Python/PySpark (the default mode) again, how can I use / access this dataframe that was created when it was on scala mode?
Is it even possible?
Thanks

You can access DataFrames created in one language with another language through temp tables in SparkSQL.
For instance, say you have a DataFarame in scala called scalaDF. You can create a temporary view of that and make it accessible to a Python cell, for instance:
scalaDF.createOrReplaceTempView("my_table")
Then in a Python cell you can run
pythonDF = spark.sql("select * from my_table")
pythonDF.show()
The same works for passing dataframes between those languages and R. The common construct is a SparkSQL table.

Related

Pyspark external table compression does not work

I am trying to save an external table from PySpark in parquet format and I need
to compress it. The PySpark version I am using is 2.4.7. I am updating the table after the initial
creation and appending data in a loop manner.
So far I have set the following options:
.config("spark.sql.parquet.compression.codec", "snappy") df.write.mode("append").format("parquet").option("compression","snappy").saveAsTable(...) df.write.mode("overwrite").format("parquet").option("compression","snappy").saveAsTable(...)
Is there anything else that I need to set or am I doing something wrong?
Thank you

DataFrame.write.parquet - Parquet-file cannot be read by HIVE or Impala

I wrote a DataFrame with pySpark into HDFS with this command:
df.repartition(col("year"))\
.write.option("maxRecordsPerFile", 1000000)\
.parquet('/path/tablename', mode='overwrite', partitionBy=["year"], compression='snappy')
When taking a look into the HDFS I can see that the files are properly laying there. Anyhow, when I try to read the table with HIVE or Impala, the table cannot be found.
Whats going wrong here, am I missing something?
Interestingly, df.write.format('parquet').saveAsTable("tablename") works properly.
It's an expected behaviour from spark as:
df...etc.parquet("") writes the data to HDFS location and won't create any table in Hive.
but df..saveAsTable("") creates the table in hive and writes data to it.
In the case the table already exists, behavior of this function
depends on the save mode, specified by the mode function (default to
throwing an exception). When mode is Overwrite, the schema of the
DataFrame does not need to be the same as that of the existing table.
That's the reason why you are not able to find table in hive after performing df...parquet("")

Neo4j as data source for pyspark

I have a requirement wherein I've to pull data from Neo4j and create Spark RDD's out of that data. I'm using Python in my project. There is this connector for the same purpose but it's written in Scala. So I can think of following workarounds for now -
Query data from neo4j in small chunks/batches, convert each chunk to Spark RDD using parallize() method. Finally merge/combine all the RDD's using union() method to get single RDD. And then I can do transformations & actions on them.
Another approach is to read data from Neo4j and create a Kafka producer out of it. Then use Kafka as a data source for Spark. e.g.
Neo4j -> Kafka -> Spark
I want to know that which one is more efficient for large chunks of data? and if there is any better approach for solving this problem, please help me out with that.
Note: I do tried to extend pyspark API in order to create custom RDD in python. pyspark is API is very different as compared to Spark's Scala/Java API. In case of Scala API's, custom RDD can be created by extending RDD class and overriding compute() and getPartitions() methods. But in pyspark API, I couldn't find compute() under RDD class in rdd.py
This blog post by Michael Hunger talks about using Spark to get CSV data into Neo4j, but maybe some of the Spark code will help you out. Also there is Mazerunner which is a Spark/Neo4j/GraphX integrated tool for using Spark to pass subgraphs to Neo4j.

Python script to Python UDF

I have python script - which reads data from hive table, format it using pandas and writes the updated data back to a new hive table.
Can I convert this normal python script(which contains raw script & other functions too) to a python udf?
If yes, can I access hive from the udf too ?
Thanks in advance

How to write a Python UDF for User Defined Aggregate Function in Hive

I would like to do some aggregation work on an aggregate column (after GROUP BY) in Hive using Python. I found there is UDAF for this purpose. All I can find is a Java example. Is there an example on writing in Python?
Or for python between UDF and UDAF, there is no difference? For UDAF, I just need to write it like a reducer? Please advise.
You can make use of Hive's streaming UDF functionality (TRANSFORM) to use a Python UDF which reads from stdin and outputs to stdout. You haven't found any Python "UDAF" examples because UDAF refers to the Hive Java class you extend so it would only be in Java.
When using a streaming UDF, Hive will choose whether to launch or a map or reduce job so there is no need to specify (for more on this functionality see this link: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform).
Basically, your implementation would be to write a python script which reads from stdin, calculates some aggregate number and outputs it to stdout. To implement in Hive do the following:
1) First add your python script to your resource library in Hive so that it gets distributed across your cluster:
add file script.py;
2) Then call your transform function and input the columns you want to aggregate. Here is an example:
select transform(input cols)
using 'python script.py' as (output cols)
from table
;
Depending on what you need to do, you may need a separate mapper and reducer script. If you need to aggregate based on column value, remember to use Hive's CLUSTER BY/DISTRIBUTE BY syntax in your mapper stage so that partitioned data gets sent to the reducer.
Let me know if this helps.

Categories