Unable to gather data using PySpark connection with Hive - python

I am currently trying to run queries via PySpark. All went well with the connection and accessing the database. Unfortunately, when I run a query; the only output that is displayed are the column names followed by None.
I read through the documentation but could not find any answers. Posted Below is how I accessed the Database.
try:
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
print ("Successfully imported Spark Modules")
except ImportError as e:
print ("Can not import Spark Modules", e)
sys.exit(1)
sc = SparkContext('local', 'pyspark')
sqlctx = SQLContext(sc)
df = sqlctx.read.format("jdbc").option("url", "jdbc:hive2://.....").option("dbtable", "(SELECT * FROM dtable LIMIT 10) df").load()
print df.show()
The output of df.show() is just the column names. When I run the same query using Pyhive there is data that is populated, so I assumed it has to do something with the way I am trying to load the data table using PySpark.
Thanks!

Related

Running SPARK with SQL file from python gives error

I am trying to call a .SQL file with hive queries from Python py file using SPARK. It gives Error -- AttributeError: 'Builder' object has no attribute 'SparkContext'
Looked at multiple posts with similar error and tried but none of them worked for me. Here is my code.
from pyspark import SparkContext, SparkConf, SQLContext
sc = SparkSession.SparkContext.getOrCreate()
with open("/apps/home/p1.sql") as fr:
query = fr.read()
results = sc.sql(query)
The p1.sql has sql queries. How to pass parameters to the sql file? what will be different if sql returns rows and does not return rows. New to SPARK. Appreciate if the answer gives the code lines. Thanks
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
with open("/apps/home/p1.sql") as fr:
query = fr.read()
results = spark.sql(query)
You can refer https://spark.apache.org/docs/latest/sql-getting-started.html#starting-point-sparksession

Cannot import parse_url in pyspark

I have this sql query, for hiveql in pyspark:
spark.sql('SELECT split(parse_url(page.viewed_page, "PATH"), "/")[1] as path FROM df')
And I would like to translate into functional query like:
df.select(split(parse_url(col('page.viewed_page'), 'HOST')))
but when I import the parse_url function I get:
----> 1 from pyspark.sql.functions import split, parse_url
ImportError: cannot import name 'parse_url' from 'pyspark.sql.functions' (/usr/local/opt/apache-spark/libexec/python/pyspark/sql/functions.py)
Could you point me in the right direction to import the parse_url function.
Cheers
parse_url is a Hive UDF, so you need to enable Hive Support by while creating the SparkSession object()
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
Then your following query should work:
spark.sql('SELECT split(parse_url(page.viewed_page, "PATH"), "/")[1] as path FROM df')
If your Spark is <2.2:
from pyspark import SparkContext
from pyspark.sql import HiveContext, SQLContext
sc = SparkContext()
sqlContext = SQLContext(sc)
hivContext = HiveContext(sc)
query = 'SELECT split(parse_url(page.viewed_page, "PATH"), "/")[1] as path FROM df'
hivContext.sql(query) # this will work
sqlContext.sql(query) # this will not work
EDIT:
parse_url is a SparkSQL builtin from Spark v2.3. It's not available in pyspark.sql.functions as of yet (11/28/2020). You can still use it on a pyspark dataframe by using selectExpr like this:
df.selectExpr('parse_url(mycolumn, "HOST")')

How to fix AWS Glue code in displaying count and schema of partitioned table from AWS S3

I'm trying to count the records and print the schema of my partitioned table (in a form of parquet). I'm doing it just in AWS Glue Console (since I dont have access to connect to a developer endpoint). However, I dont think my query is producing any result. See my code below. Any suggestion?
%pyspark
from awsglue.context import GlueContext
from awsglue.transforms import *
from pyspark.context import SparkContext
glueContext = GlueContext(SparkContext.getOrCreate())
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "s3", table_name = "subscriber", push_down_predicate = "(date=='2018-12-06')", transformation_ctx = "datasource0")
df = datasource0.toDF()
print df.count()
df.printSchema()
I'm not sure about using print in Glue... I would recommend use logging to print results. You can get the logger object and use it like that:
spark = glueContext.spark_session
log4jLogger = spark.sparkContext._jvm.org.apache.log4j
logger = log4jLogger.LogManager.getLogger(__name__)
logger.info(df.count())
From the Job console you can then access to the logs of the specific Job execution. There you should be able to see your DF count for example.
You can see an example code with generated results in the below picture:

How to handle empty table from Glue's data catalog in pyspark

I'd like to execute SparkSQL on SageMaker by AWS Glue, but haven't succeeded.
What I want to do is parameterizing Glue job, so I want it acceptable to access empty tables. However, when the method glueContext.create_dynamic_frame.from_catalog is provided with an empty table, it raises an error.
Here's a code what raises an error:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
glueContext = GlueContext(SparkContext.getOrCreate())
df1 = glueContext.create_dynamic_frame.from_catalog(
database = "<glue's database name>",
table_name = "<glue's table name>", # I want here to be parameterized
transformation_ctx = "df1"
)
df1 = df1.toDF() # Here raises an Error
df1.createOrReplaceTempView('tmp_table')
df_sql = spark.sql("""SELECT ...""")
And this is the error:
Unable to infer schema for Parquet. It must be specified manually.
Is it impossible to use an empty table as an input to DynamicFrame? Thank you in advance.
df1 = df1.toDF() # Here raises an Error
Replace this line with:
dynamic_df = DynamicFrame.fromDF(df1, glueContext, 'sample_job') # Load pyspark df to dynamic frame

Getting org.bson.BsonInvalidOperationException: Invalid state INITIAL while printing pyspark dataframe

I am able to connect with mongodb from my spark job, but when I try to view the data that is being loaded from the database I get the error mentioning in the title. I am using pyspark module of Apache Spark.
The code Snippet is:
from pyspark import SparkConf,SparkContext
from pyspark.sql import SQLContext
import sys
print(sys.stdin.encoding, sys.stdout.encoding)
conf=SparkConf()
conf.set('spark.mongodb.input.uri','mongodb://127.0.0.1/github.users')
conf.set('spark.mongodb.output.uri','mongodb://127.0.0.1/github.users')
sc =SparkContext(conf=conf)
sqlContext =SQLContext(sc)
df = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource").load()
df.printSchema()
df = df.sort('followers', ascending = True)
df.take(1)

Categories