I am running a job on a Databricks notebook that connects to my MySQL database on AWS RDS and inserts data. When I ran the notebook manually, I was able to connect to the endpoint URL and insert my data. Now I have my notebook running on a corn job every 30 min. The first job was successful, but every job after that failed with this error:
MySQLInterfaceError: MySQL server has gone away
I then tried to run my job manually again and I get the same error on tweets_pdf.to_sql(name='tweets', con=engine, if_exists = 'replace', index=False). This is the code that is running in the Databricks notebook:
from __future__ import print_function
import sys
import pymysql
import os
import re
import mysql.connector
from sqlalchemy import create_engine
from operator import add
import pandas as pd
from pyspark.sql.types import StructField, StructType, StringType
from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark.sql import SQLContext
import json
import boto
import boto3
from boto.s3.key import Key
import boto.s3.connection
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
from pyspark.ml.feature import MinMaxScaler
from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.sql.functions import *
# Get AWS credentials
aws_key_id = os.environ.get("accesskeyid")
aws_key = os.environ.get("secretaccesskey")
# Start spark instance
conf = SparkConf().setAppName("first")
sc = SparkContext.getOrCreate(conf=conf)
# Allow spark to access my S3 bucket
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId",aws_key_id)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey",aws_key)
config_dict = {"fs.s3n.awsAccessKeyId":aws_key_id,
"fs.s3n.awsSecretAccessKey":aws_key}
bucket = "diego-twitter-stream-sink"
prefix = "/2020/*/*/*/*"
filename = "s3n://{}/{}".format(bucket, prefix)
# Convert file from S3 bucket to an RDD
rdd = sc.hadoopFile(filename,
'org.apache.hadoop.mapred.TextInputFormat',
'org.apache.hadoop.io.Text',
'org.apache.hadoop.io.LongWritable',
conf=config_dict)
spark = SparkSession.builder.appName("PythonWordCount").config("spark.files.overwrite","true").getOrCreate()
# Map RDD to specific columns
df = spark.read.json(rdd.map(lambda x: x[1]))
features_of_interest = ["ts", "text", "sentiment"]
df_reduce = df.select(features_of_interest)
# Convert RDD to Pandas Dataframe
tweets_pdf = df_reduce.toPandas()
engine = create_engine(f'mysql+mysqlconnector://admin:{os.environ.get("databasepassword")}#{os.environ.get("databasehost")}/twitter-data')
tweets_pdf.to_sql(name='tweets', con=engine, if_exists = 'replace', index=False)
Does anyone know what could be the issue? All of the database config variables are correct, the S3 bucket that PySpark is streaming from has data, and the AWS RDS is nowhere near any capacity or compute limits.
A default max_allowed_packets (4M) can cause this issue
Related
I have a code below which connects to a MongoDB database and selects the specified JSON file, flattens it and exports it as a CSV.
So my problem is some of the JSON files in the MongoDB databases are absolutely huge with thousands of rows, so what I am trying to do is filter table down so that I only bring in data from the last 7 days.
from pymongo import MongoClient
import pandas as pd
import os, uuid, sys
import collections
from azure.storage.filedatalake import DataLakeServiceClient
from azure.core._match_conditions import MatchConditions
from azure.storage.filedatalake._models import ContentSettings
from pandas import json_normalize
mongo_client = MongoClient("connstring")
db = mongo_client.nhdb
table = db.Report
document = table.find()
mongo_docs = list(document)
mongo_docs = json_normalize(mongo_docs)
mongo_docs.to_csv("Report.csv", sep = ",", index=False)
Any help will be much appreciated.
Note: I know a way to do it in Azure Data Factory using the expression below, however, I am not sure how to go about it in Python
{"createdDatetime":{$gt: ISODate("#{adddays(utcnow(),-7)}")}}
In python create a datetime object to use as a filter; for example this shows the last 7 days:
from datetime import datetime, timedelta
document = table.find({'createdDatetime': {'$gt': datetime.utcnow() - timedelta(days=7)}})
I have this sql query, for hiveql in pyspark:
spark.sql('SELECT split(parse_url(page.viewed_page, "PATH"), "/")[1] as path FROM df')
And I would like to translate into functional query like:
df.select(split(parse_url(col('page.viewed_page'), 'HOST')))
but when I import the parse_url function I get:
----> 1 from pyspark.sql.functions import split, parse_url
ImportError: cannot import name 'parse_url' from 'pyspark.sql.functions' (/usr/local/opt/apache-spark/libexec/python/pyspark/sql/functions.py)
Could you point me in the right direction to import the parse_url function.
Cheers
parse_url is a Hive UDF, so you need to enable Hive Support by while creating the SparkSession object()
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
Then your following query should work:
spark.sql('SELECT split(parse_url(page.viewed_page, "PATH"), "/")[1] as path FROM df')
If your Spark is <2.2:
from pyspark import SparkContext
from pyspark.sql import HiveContext, SQLContext
sc = SparkContext()
sqlContext = SQLContext(sc)
hivContext = HiveContext(sc)
query = 'SELECT split(parse_url(page.viewed_page, "PATH"), "/")[1] as path FROM df'
hivContext.sql(query) # this will work
sqlContext.sql(query) # this will not work
EDIT:
parse_url is a SparkSQL builtin from Spark v2.3. It's not available in pyspark.sql.functions as of yet (11/28/2020). You can still use it on a pyspark dataframe by using selectExpr like this:
df.selectExpr('parse_url(mycolumn, "HOST")')
Python Shell Jobs was introduced in AWS Glue. They mentioned:
You can now use Python shell jobs, for example, to submit SQL queries to services such as ... Amazon Athena ...
Ok. We have an example to read data from Athena tables here:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
glueContext = GlueContext(SparkContext.getOrCreate())
persons = glueContext.create_dynamic_frame.from_catalog(
database="legislators",
table_name="persons_json")
print("Count: ", persons.count())
persons.printSchema()
# TODO query all persons
However, it uses Spark instead of Python Shell. There are no such libraries that are normally available with Spark job type and I have an error:
ModuleNotFoundError: No module named 'awsglue.transforms'
How can I rewrite the code above to make it executable in the Python Shell job type?
The thing is, Python Shell type has its own limited set of built-in libraries.
I only managed to achieve my goal using Boto 3 to query data and Pandas to read it into a dataframe.
Here is the code snippet:
import boto3
import pandas as pd
s3 = boto3.resource('s3')
s3_client = boto3.client('s3')
athena_client = boto3.client(service_name='athena', region_name='us-east-1')
bucket_name = 'bucket-with-csv'
print('Working bucket: {}'.format(bucket_name))
def run_query(client, query):
response = client.start_query_execution(
QueryString=query,
QueryExecutionContext={ 'Database': 'sample-db' },
ResultConfiguration={ 'OutputLocation': 's3://{}/fromglue/'.format(bucket_name) },
)
return response
def validate_query(client, query_id):
resp = ["FAILED", "SUCCEEDED", "CANCELLED"]
response = client.get_query_execution(QueryExecutionId=query_id)
# wait until query finishes
while response["QueryExecution"]["Status"]["State"] not in resp:
response = client.get_query_execution(QueryExecutionId=query_id)
return response["QueryExecution"]["Status"]["State"]
def read(query):
print('start query: {}\n'.format(query))
qe = run_query(athena_client, query)
qstate = validate_query(athena_client, qe["QueryExecutionId"])
print('query state: {}\n'.format(qstate))
file_name = "fromglue/{}.csv".format(qe["QueryExecutionId"])
obj = s3_client.get_object(Bucket=bucket_name, Key=file_name)
return pd.read_csv(obj['Body'])
time_entries_df = read('SELECT * FROM sample-table')
SparkContext won't be available in Glue Python Shell. Hence you need to depend on Boto3 and Pandas to handle the data retrieval. But it comes a lot of overhead to query Athena using boto3 and poll the ExecutionId to check if the query execution got finished.
Recently awslabs released a new package called AWS Data Wrangler. It extends power of Pandas library to AWS to easily interact with Athena and lot of other AWS Services.
Reference link:
https://github.com/awslabs/aws-data-wrangler
https://github.com/awslabs/aws-data-wrangler/blob/master/tutorials/006%20-%20Amazon%20Athena.ipynb
Note: AWS Data Wrangler library wont be available by default inside Glue Python shell. To include it in Python shell, follow the instructions in following link:
https://aws-data-wrangler.readthedocs.io/en/latest/install.html#aws-glue-python-shell-jobs
I have a few month using glue, i use:
from pyspark.context import SparkContext
from awsglue.context import GlueContext
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
data_frame = spark.read.format("com.databricks.spark.csv")\
.option("header","true")\
.load(<CSVs THAT IS USING FOR ATHENA - STRING>)
I'd like to execute SparkSQL on SageMaker by AWS Glue, but haven't succeeded.
What I want to do is parameterizing Glue job, so I want it acceptable to access empty tables. However, when the method glueContext.create_dynamic_frame.from_catalog is provided with an empty table, it raises an error.
Here's a code what raises an error:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
glueContext = GlueContext(SparkContext.getOrCreate())
df1 = glueContext.create_dynamic_frame.from_catalog(
database = "<glue's database name>",
table_name = "<glue's table name>", # I want here to be parameterized
transformation_ctx = "df1"
)
df1 = df1.toDF() # Here raises an Error
df1.createOrReplaceTempView('tmp_table')
df_sql = spark.sql("""SELECT ...""")
And this is the error:
Unable to infer schema for Parquet. It must be specified manually.
Is it impossible to use an empty table as an input to DynamicFrame? Thank you in advance.
df1 = df1.toDF() # Here raises an Error
Replace this line with:
dynamic_df = DynamicFrame.fromDF(df1, glueContext, 'sample_job') # Load pyspark df to dynamic frame
I am able to connect with mongodb from my spark job, but when I try to view the data that is being loaded from the database I get the error mentioning in the title. I am using pyspark module of Apache Spark.
The code Snippet is:
from pyspark import SparkConf,SparkContext
from pyspark.sql import SQLContext
import sys
print(sys.stdin.encoding, sys.stdout.encoding)
conf=SparkConf()
conf.set('spark.mongodb.input.uri','mongodb://127.0.0.1/github.users')
conf.set('spark.mongodb.output.uri','mongodb://127.0.0.1/github.users')
sc =SparkContext(conf=conf)
sqlContext =SQLContext(sc)
df = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource").load()
df.printSchema()
df = df.sort('followers', ascending = True)
df.take(1)