Unable to load S3-hosted CSV into Spark Dataframe on Jupyter Notebook

Unable to load S3-hosted CSV into Spark Dataframe on Jupyter Notebook - python

Unable to load S3-hosted CSV into Spark Dataframe on Jupyter Notebook.
I believe I uploaded the 2 required packages with the os.environ line below. If I did it incorrectly please show me how to correctly install it. The Jupyter Notebook is hosted on an EC2 instance, which is why I'm trying to pull the CSV from a S3 bucket.
Here is my code:
import os
import pyspark
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2 pyspark-shell'
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('test').getOrCreate()
spark
Output:
Then i do:
%%time
df = spark.read.csv(f"s3://{AWS_BUCKET}/prices/{ts}_all_prices_{user}.csv", inferSchema = True, header = True)
And i get an error of:
WARN FileStreamSink: Assume no metadata directory. Error while looking for metadata directory in the path: s3://blah-blah-blah/prices/1655999784.356597_blah_blah_blah.csv.
org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"

Here is an example using s3a.
import os
import pyspark
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2 pyspark-shell'
from pyspark.sql import SQLContext
from pyspark import SparkContext
sc = SparkContext()
sqlContext = SQLContext(sc)
filePath = "s3a://yourBucket/yourFile.parquet"
df = sqlContext.read.parquet(filePath) # Parquet file read example
Here is a more complete example taken from here to build the spark session with required config.
from pyspark.sql import SparkSession
def get_spark():
spark = SparkSession.builder.master("local[4]").appName('SparkDelta') \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
.config("spark.jars.packages",
"io.delta:delta-core_2.12:1.1.0,"
"org.apache.hadoop:hadoop-aws:3.2.2,"
"com.amazonaws:aws-java-sdk-bundle:1.12.180") \
.getOrCreate()
# This is mandate config on spark session to use AWS S3
spark._jsc.hadoopConfiguration().set("com.amazonaws.services.s3.enableV4", "true")
spark._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider","com.amazonaws.auth.InstanceProfileCredentialsProvider,com.amazonaws.auth.DefaultAWSCredentialsProviderChain")
spark._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.s3a.impl", "org.apache.hadoop.fs.s3a.S3A")
# spark.sparkContext.setLogLevel("DEBUG")
return spark

Try using s3a as protocol, you can find more details here.
df = spark.read.csv(f"s3a://{AWS_BUCKET}/prices/{ts}_all_prices_{user}.csv", inferSchema = True, header = True)

Related

Exception: Java gateway process exited before sending its port number with pyspark

I am working with python and pyspark in a jupyter notebook. I am trying to read several parquet files from an aws s3 bucket and convert them into a single json file.
This is what I have:
from functools import reduce
from pyspark.sql import DataFrame
bucket = s3.Bucket(name='mybucket')
keys =[]
for key in bucket.objects.all():
keys.append(key.key)
print(keys[0])
from pyspark.sql import SparkSession
# initialise sparkContext
spark = SparkSession.builder \
.master('local') \
.appName('myAppName') \
.config('spark.executor.memory', '5gb') \
.config("spark.cores.max", "6") \
.getOrCreate()
sc = spark.sparkContext
But I am getting:
Exception: Java gateway process exited before sending its port number with pyspark
I am not sure how to fix this, thank you!

Your getting this error because your pyspark is not able to communicate with your cluster. you need to set the value of some global variable like this.
import os
import findspark
findspark.init()
os.environ['PYSPARK_SUBMIT_ARGS'] = """--name job_name --master yarn / local
--conf spark.dynamicAllocation.enabled=true
pyspark-shell"""
os.environ['PYSPARK_PYTHON'] = "python3.6" # what ever version of python your using
os.environ['python'] = "python3.6"
findspark package is optional but it's good to use in case of pyspark.

No FileSystem for scheme: s3 using pyspark while reading parquet s3 file

I have a bucket with a few small Parquet files that I would like to consolidate into a bigger one.
To do this task, I would like to create a spark job to consume and write a new file.
from pyspark import SparkContext
from pyspark.sql import SparkSession, SQLContext
spark = SparkSession.builder \
.master("local") \
.appName("Consolidated tables") \
.getOrCreate()
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", "access")
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "secret")
df = spark.read.parquet("s3://lake/bronze/appx/contextb/*")
This code is throwing me an Exception: No FileSystem for scheme: s3. If I switch to s3a://..., I got the error: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found.
I'm trying to run this code as python myfile.py.
Any idea on what's wrong?

download this hadoop-aws-2.7.5.jar (or latest version) and configure this jar available for spark
spark = SparkSession \
.builder \
.config("spark.jars", "/path/to/hadoop-aws-2.7.5.jar")\
.getOrCreate()

from boto3.session import Session
from pyspark import SparkContext
from pyspark.sql import SparkSession, SQLContext
spark = SparkSession.builder \
.master("local") \
.appName("Consolidated tables") \
.getOrCreate()
ACCESS_KEY='your_access_key'
SECRET_KEY='your_secret_key'
session = Session(aws_access_key_id=ACCESS_KEY,
aws_secret_access_key=SECRET_KEY)
s3 = session.resource('s3')
df = spark.read.parquet("s3://lake/bronze/appx/contextb/*")

Spark 2.4.4 Avro Pyspark Shell Configuration

I think I'm following proper documentation to get pyspark to write avro files. I'm running Spark 2.4.4 I'm using Jupyter lab to run pyspark shell.
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-avro_2.12:2.4.4 pyspark-shell'
spark_conf = SparkConf().setMaster("local").setAppName("app")\
.set('spark.jars.packages', 'org.apache.spark:spark-avro_2.12:2.4.4')\
.set('spark.driver.memory', '3g')\
sc = SparkContext(conf=spark_conf)
spark = SparkSession(sc)
...
df.write.format("avro").save('file.avro')
But I'm getting the following error. I'm not concerned about backward compatibility with Avro. Any ideas?
Py4JJavaError: An error occurred while calling o41.jdbc.
: java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.avro.AvroFileFormat could not be instantiated

Shaido had the right idea. Using Version Spark-Avro 2.11 works.
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-avro_2.11:2.4.3 pyspark-shell'
spark_conf = SparkConf().setMaster("local").setAppName("app")\
.set('spark.jars.packages', 'org.apache.spark:spark-avro_2.11:2.4.3')
sc = SparkContext(conf=spark_conf)
spark = SparkSession(sc)

Local spark installation not working as expected

I have installed Hadoop-Spark on my local machine. I tried to connect to AWS S3 and was successful in doing that. I used hadoop-aws-2.8.0.jar for this purpose. However, I have been trying to connect to DynamoDB using EMR provided jar file emr-ddb-hadoop.jar. I have installed all the AWS dependencies and are available locally. But, I have been getting the following exception continuously.
java.lang.ClassCastException: org.apache.hadoop.dynamodb.read.DynamoDBInputFormat cannot be cast to org.apache.hadoop.mapreduce.InputFormat
Here is my code snippet.
import sys
import os
if 'SPARK_HOME' not in os.environ:
os.environ['SPARK_HOME'] = "/usr/local/Cellar/spark"
os.environ[
'PYSPARK_SUBMIT_ARGS'] = '--jars /usr/local/Cellar/hadoop/2.8.0/libexec/share/hadoop/tools/lib/emr-ddb-hadoop.jar,' \
'/home/aws-java-sdk/1.11.201/lib/aws-java-sdk-1.11.201.jar pyspark-shell'
sys.path.append("/usr/local/Cellar/spark/python")
sys.path.append("/usr/local/Cellar/spark/python")
sys.path.append("/usr/local/Cellar/spark/python/lib/py4j-0.10.4-src.zip")
try:
from pyspark.sql import SparkSession, SQLContext, Row
from pyspark import SparkConf, SparkContext
from pyspark.sql.window import Window
import pyspark.sql.functions as func
from pyspark.sql.functions import lit, lag, col, udf
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType, DoubleType, TimestampType, LongType
except ImportError as e:
print("error importing spark modules", e)
sys.exit(1)
spark = SparkSession \
.builder \
.master("spark://xxx.local:7077") \
.appName("Sample") \
.getOrCreate()
sc = spark.sparkContext
conf = {"dynamodb.servicename": "dynamodb", \
"dynamodb.input.tableName": "test-table", \
"dynamodb.endpoint": "http://dynamodb.us-east-1.amazonaws.com/", \
"dynamodb.regionid": "us-east-1", \
"mapred.input.format.class": "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat"}
dynamo_rdd = sc.newAPIHadoopRDD('org.apache.hadoop.dynamodb.read.DynamoDBInputFormat',
'org.apache.hadoop.io.Text',
'org.apache.hadoop.dynamodb.DynamoDBItemWritable',
conf=conf)
dynamo_rdd.collect()

I have not used the newAPIHadoopRDD. Using the old api it works without issues.
Here is the working sample I followed,
https://aws.amazon.com/blogs/big-data/analyze-your-data-on-amazon-dynamodb-with-apache-spark/

Apache pyspark using oracle jdbc to pull data. Driver cannot be found

I am using apache spark pyspark (spark-1.5.2-bin-hadoop2.6) on windows 7.
I keep getting this error when I run my python script in pyspark.
An error occured while calling o23.load. java.sql.SQLException: No suitable driver found for jdbc:oracle:thin:------------------------------------connection
Here is my python file
import os
os.environ["SPARK_HOME"] = "C:\\spark-1.5.2-bin-hadoop2.6"
os.environ["SPARK_CLASSPATH"] = "L:\\Pyspark_Snow\\ojdbc6.jar"
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
spark_config = SparkConf().setMaster("local[8]")
sc = SparkContext(conf=spark_config)
sqlContext = SQLContext(sc)
df = (sqlContext
.load(source="jdbc",
url="jdbc:oracle:thin://x.x.x.x/xdb?user=xxxxx&password=xxxx",
dbtable="x.users")
)
sc.stop()

Unfortunately changing environment variable SPARK_CLASSPATH won't work. You need to declare
spark.driver.extraClassPath L:\\Pyspark_Snow\\ojdbc6.jar
in your /path/to/spark/conf/spark-defaults.conf or simply execute spark-submit job with additional argument --jars:
spark-submit --jars "L:\\Pyspark_Snow\\ojdbc6.jar" yourscript.py

To set the jars programatically set the following config:
spark.yarn.dist.jars with comma-separated list of jars.
Eg:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Spark config example") \
.config("spark.yarn.dist.jars", "<path-to-jar/test1.jar>,<path-to-jar/test2.jar>") \
.getOrCreate()
Or as below:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
spark_config = SparkConf().setMaster("local[8]")
spark_config.set("spark.yarn.dist.jars", "L:\\Pyspark_Snow\\ojdbc6.jar")
sc = SparkContext(conf=spark_config)
sqlContext = SQLContext(sc)
Or pass --jars with the path of jar files separated by , to spark-submit.

You can also add the jar using --jars and --driver-class-path and then set the driver specifically. See https://stackoverflow.com/a/36328672/1547734

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unable to load S3-hosted CSV into Spark Dataframe on Jupyter Notebook - python

Try using s3a as protocol, you can find more details here. df = spark.read.csv(f"s3a://{AWS_BUCKET}/prices/{ts}_all_prices_{user}.csv", inferSchema = True, header = True)

Related

Exception: Java gateway process exited before sending its port number with pyspark

No FileSystem for scheme: s3 using pyspark while reading parquet s3 file

Spark 2.4.4 Avro Pyspark Shell Configuration

Local spark installation not working as expected

Apache pyspark using oracle jdbc to pull data. Driver cannot be found

Categories

Resources