Getting org.bson.BsonInvalidOperationException: Invalid state INITIAL while printing pyspark dataframe

Getting org.bson.BsonInvalidOperationException: Invalid state INITIAL while printing pyspark dataframe - python

I am able to connect with mongodb from my spark job, but when I try to view the data that is being loaded from the database I get the error mentioning in the title. I am using pyspark module of Apache Spark.
The code Snippet is:
from pyspark import SparkConf,SparkContext
from pyspark.sql import SQLContext
import sys
print(sys.stdin.encoding, sys.stdout.encoding)
conf=SparkConf()
conf.set('spark.mongodb.input.uri','mongodb://127.0.0.1/github.users')
conf.set('spark.mongodb.output.uri','mongodb://127.0.0.1/github.users')
sc =SparkContext(conf=conf)
sqlContext =SQLContext(sc)
df = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource").load()
df.printSchema()
df = df.sort('followers', ascending = True)
df.take(1)

Related

Running SPARK with SQL file from python gives error

I am trying to call a .SQL file with hive queries from Python py file using SPARK. It gives Error -- AttributeError: 'Builder' object has no attribute 'SparkContext'
Looked at multiple posts with similar error and tried but none of them worked for me. Here is my code.
from pyspark import SparkContext, SparkConf, SQLContext
sc = SparkSession.SparkContext.getOrCreate()
with open("/apps/home/p1.sql") as fr:
query = fr.read()
results = sc.sql(query)
The p1.sql has sql queries. How to pass parameters to the sql file? what will be different if sql returns rows and does not return rows. New to SPARK. Appreciate if the answer gives the code lines. Thanks

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
with open("/apps/home/p1.sql") as fr:
query = fr.read()
results = spark.sql(query)
You can refer https://spark.apache.org/docs/latest/sql-getting-started.html#starting-point-sparksession

Why is my MySQL database disconnecting when running cron job?

I am running a job on a Databricks notebook that connects to my MySQL database on AWS RDS and inserts data. When I ran the notebook manually, I was able to connect to the endpoint URL and insert my data. Now I have my notebook running on a corn job every 30 min. The first job was successful, but every job after that failed with this error:
MySQLInterfaceError: MySQL server has gone away
I then tried to run my job manually again and I get the same error on tweets_pdf.to_sql(name='tweets', con=engine, if_exists = 'replace', index=False). This is the code that is running in the Databricks notebook:
from __future__ import print_function
import sys
import pymysql
import os
import re
import mysql.connector
from sqlalchemy import create_engine
from operator import add
import pandas as pd
from pyspark.sql.types import StructField, StructType, StringType
from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark.sql import SQLContext
import json
import boto
import boto3
from boto.s3.key import Key
import boto.s3.connection
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
from pyspark.ml.feature import MinMaxScaler
from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.sql.functions import *
# Get AWS credentials
aws_key_id = os.environ.get("accesskeyid")
aws_key = os.environ.get("secretaccesskey")
# Start spark instance
conf = SparkConf().setAppName("first")
sc = SparkContext.getOrCreate(conf=conf)
# Allow spark to access my S3 bucket
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId",aws_key_id)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey",aws_key)
config_dict = {"fs.s3n.awsAccessKeyId":aws_key_id,
"fs.s3n.awsSecretAccessKey":aws_key}
bucket = "diego-twitter-stream-sink"
prefix = "/2020/*/*/*/*"
filename = "s3n://{}/{}".format(bucket, prefix)
# Convert file from S3 bucket to an RDD
rdd = sc.hadoopFile(filename,
'org.apache.hadoop.mapred.TextInputFormat',
'org.apache.hadoop.io.Text',
'org.apache.hadoop.io.LongWritable',
conf=config_dict)
spark = SparkSession.builder.appName("PythonWordCount").config("spark.files.overwrite","true").getOrCreate()
# Map RDD to specific columns
df = spark.read.json(rdd.map(lambda x: x[1]))
features_of_interest = ["ts", "text", "sentiment"]
df_reduce = df.select(features_of_interest)
# Convert RDD to Pandas Dataframe
tweets_pdf = df_reduce.toPandas()
engine = create_engine(f'mysql+mysqlconnector://admin:{os.environ.get("databasepassword")}#{os.environ.get("databasehost")}/twitter-data')
tweets_pdf.to_sql(name='tweets', con=engine, if_exists = 'replace', index=False)
Does anyone know what could be the issue? All of the database config variables are correct, the S3 bucket that PySpark is streaming from has data, and the AWS RDS is nowhere near any capacity or compute limits.

A default max_allowed_packets (4M) can cause this issue

Cannot import parse_url in pyspark

I have this sql query, for hiveql in pyspark:
spark.sql('SELECT split(parse_url(page.viewed_page, "PATH"), "/")[1] as path FROM df')
And I would like to translate into functional query like:
df.select(split(parse_url(col('page.viewed_page'), 'HOST')))
but when I import the parse_url function I get:
----> 1 from pyspark.sql.functions import split, parse_url
ImportError: cannot import name 'parse_url' from 'pyspark.sql.functions' (/usr/local/opt/apache-spark/libexec/python/pyspark/sql/functions.py)
Could you point me in the right direction to import the parse_url function.
Cheers

parse_url is a Hive UDF, so you need to enable Hive Support by while creating the SparkSession object()
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
Then your following query should work:
spark.sql('SELECT split(parse_url(page.viewed_page, "PATH"), "/")[1] as path FROM df')
If your Spark is <2.2:
from pyspark import SparkContext
from pyspark.sql import HiveContext, SQLContext
sc = SparkContext()
sqlContext = SQLContext(sc)
hivContext = HiveContext(sc)
query = 'SELECT split(parse_url(page.viewed_page, "PATH"), "/")[1] as path FROM df'
hivContext.sql(query) # this will work
sqlContext.sql(query) # this will not work
EDIT:
parse_url is a SparkSQL builtin from Spark v2.3. It's not available in pyspark.sql.functions as of yet (11/28/2020). You can still use it on a pyspark dataframe by using selectExpr like this:
df.selectExpr('parse_url(mycolumn, "HOST")')

How to handle empty table from Glue's data catalog in pyspark

I'd like to execute SparkSQL on SageMaker by AWS Glue, but haven't succeeded.
What I want to do is parameterizing Glue job, so I want it acceptable to access empty tables. However, when the method glueContext.create_dynamic_frame.from_catalog is provided with an empty table, it raises an error.
Here's a code what raises an error:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
glueContext = GlueContext(SparkContext.getOrCreate())
df1 = glueContext.create_dynamic_frame.from_catalog(
database = "<glue's database name>",
table_name = "<glue's table name>", # I want here to be parameterized
transformation_ctx = "df1"
)
df1 = df1.toDF() # Here raises an Error
df1.createOrReplaceTempView('tmp_table')
df_sql = spark.sql("""SELECT ...""")
And this is the error:
Unable to infer schema for Parquet. It must be specified manually.
Is it impossible to use an empty table as an input to DynamicFrame? Thank you in advance.

df1 = df1.toDF() # Here raises an Error
Replace this line with:
dynamic_df = DynamicFrame.fromDF(df1, glueContext, 'sample_job') # Load pyspark df to dynamic frame

Unable to gather data using PySpark connection with Hive

I am currently trying to run queries via PySpark. All went well with the connection and accessing the database. Unfortunately, when I run a query; the only output that is displayed are the column names followed by None.
I read through the documentation but could not find any answers. Posted Below is how I accessed the Database.
try:
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
print ("Successfully imported Spark Modules")
except ImportError as e:
print ("Can not import Spark Modules", e)
sys.exit(1)
sc = SparkContext('local', 'pyspark')
sqlctx = SQLContext(sc)
df = sqlctx.read.format("jdbc").option("url", "jdbc:hive2://.....").option("dbtable", "(SELECT * FROM dtable LIMIT 10) df").load()
print df.show()
The output of df.show() is just the column names. When I run the same query using Pyhive there is data that is populated, so I assumed it has to do something with the way I am trying to load the data table using PySpark.
Thanks!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting org.bson.BsonInvalidOperationException: Invalid state INITIAL while printing pyspark dataframe - python

Related

Running SPARK with SQL file from python gives error

Why is my MySQL database disconnecting when running cron job?

Cannot import parse_url in pyspark

How to handle empty table from Glue's data catalog in pyspark

Unable to gather data using PySpark connection with Hive

Categories

Resources