I think I'm following proper documentation to get pyspark to write avro files. I'm running Spark 2.4.4 I'm using Jupyter lab to run pyspark shell.
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-avro_2.12:2.4.4 pyspark-shell'
spark_conf = SparkConf().setMaster("local").setAppName("app")\
.set('spark.jars.packages', 'org.apache.spark:spark-avro_2.12:2.4.4')\
.set('spark.driver.memory', '3g')\
sc = SparkContext(conf=spark_conf)
spark = SparkSession(sc)
...
df.write.format("avro").save('file.avro')
But I'm getting the following error. I'm not concerned about backward compatibility with Avro. Any ideas?
Py4JJavaError: An error occurred while calling o41.jdbc.
: java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.avro.AvroFileFormat could not be instantiated
Shaido had the right idea. Using Version Spark-Avro 2.11 works.
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-avro_2.11:2.4.3 pyspark-shell'
spark_conf = SparkConf().setMaster("local").setAppName("app")\
.set('spark.jars.packages', 'org.apache.spark:spark-avro_2.11:2.4.3')
sc = SparkContext(conf=spark_conf)
spark = SparkSession(sc)
Related
Unable to load S3-hosted CSV into Spark Dataframe on Jupyter Notebook.
I believe I uploaded the 2 required packages with the os.environ line below. If I did it incorrectly please show me how to correctly install it. The Jupyter Notebook is hosted on an EC2 instance, which is why I'm trying to pull the CSV from a S3 bucket.
Here is my code:
import os
import pyspark
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2 pyspark-shell'
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('test').getOrCreate()
spark
Output:
Then i do:
%%time
df = spark.read.csv(f"s3://{AWS_BUCKET}/prices/{ts}_all_prices_{user}.csv", inferSchema = True, header = True)
And i get an error of:
WARN FileStreamSink: Assume no metadata directory. Error while looking for metadata directory in the path: s3://blah-blah-blah/prices/1655999784.356597_blah_blah_blah.csv.
org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
Here is an example using s3a.
import os
import pyspark
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2 pyspark-shell'
from pyspark.sql import SQLContext
from pyspark import SparkContext
sc = SparkContext()
sqlContext = SQLContext(sc)
filePath = "s3a://yourBucket/yourFile.parquet"
df = sqlContext.read.parquet(filePath) # Parquet file read example
Here is a more complete example taken from here to build the spark session with required config.
from pyspark.sql import SparkSession
def get_spark():
spark = SparkSession.builder.master("local[4]").appName('SparkDelta') \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
.config("spark.jars.packages",
"io.delta:delta-core_2.12:1.1.0,"
"org.apache.hadoop:hadoop-aws:3.2.2,"
"com.amazonaws:aws-java-sdk-bundle:1.12.180") \
.getOrCreate()
# This is mandate config on spark session to use AWS S3
spark._jsc.hadoopConfiguration().set("com.amazonaws.services.s3.enableV4", "true")
spark._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider","com.amazonaws.auth.InstanceProfileCredentialsProvider,com.amazonaws.auth.DefaultAWSCredentialsProviderChain")
spark._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.s3a.impl", "org.apache.hadoop.fs.s3a.S3A")
# spark.sparkContext.setLogLevel("DEBUG")
return spark
Try using s3a as protocol, you can find more details here.
df = spark.read.csv(f"s3a://{AWS_BUCKET}/prices/{ts}_all_prices_{user}.csv", inferSchema = True, header = True)
I'm trying to get JSON objects from an S3 bucket using PySpark (on Windows, using wsl2 terminal).
I can do this using boto3 as an intermediate step but, when I try to use the spark.read.json method, I get an error.
Code:
import findspark
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
import os
import multiprocessing
#----------------APACHE CONFIGURATIONS--------------
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'
#---------------spark--------------
conf = (
SparkConf()
.set('spark.executor.extraJavaOptions','-Dcom.amazonaws.services.s3.enableV4=true')
.set('spark.driver.extraJavaOptions','-Dcom.amazonaws.services.s3.enableV4=true')
.setAppName('pyspark_aws')
.setMaster(f"local[{multiprocessing.cpu_count()}]")
.setIfMissing("spark.executor.memory", "2g")
)
sc=SparkContext(conf=conf)
sc.setSystemProperty('com.amazonaws.services.s3.enableV4', 'true')
spark=SparkSession(sc)
#--------------hadoop--------------
accessKeyId='xxxxxxxxxxxx'
secretAccessKey='xxxxxxxxx'
hadoopConf = sc._jsc.hadoopConfiguration()
hadoopConf.set('fs.s3a.access.key', accessKeyId)
hadoopConf.set('fs.s3a.secret.key', secretAccessKey)
hadoopConf.set('fs.s3a.endpoint', 's3-eu-west-1.amazonaws.com')
hadoopConf.set('fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')
hadoopConf.set('fs.s3a.multipart.size', '419430400')
hadoopConf.set('fs.s3a.multipart.threshold', '2097152000')
hadoopConf.set('fs.s3a.connection.maximum', '500')
hadoopConf.set('s3a.connection.timeout', '600000')
s3_df = spark.read.json('s3a://{bucket}/{directory}/{object}.json')
Error:
py4j.protocol.Py4JJavaError: An error occurred while call
: java.lang.NumberFormatException: For input string: "32M
at java.base/java.lang.NumberFormatException.forI
at java.base/java.lang.Long.parseLong(Long.java:6
at java.base/java.lang.Long.parseLong(Long.java:8
at org.apache.hadoop.conf.Configuration.getLong(C
at org.apache.hadoop.fs.s3a.S3AFileSystem.getDefa
at org.apache.hadoop.fs.FileSystem.getDefaultBloc
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFile
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFile
at org.apache.hadoop.fs.FileSystem.exists(FileSys
at org.apache.spark.sql.execution.datasources.Dat
at org.apache.spark.sql.execution.datasources.Dat
at org.apache.spark.util.ThreadUtils$.$anonfun$pa
at java.base/java.util.concurrent.ForkJoinTask$Ruava.util.coteAction.exec(ForkJoinTask.java:1426)ncurrent.Fojava.base/java.util.concurrent.ForkJoinTask.dorkJoinWorkejava.base/java.util.concurrent.ForkJoinPool$WorThread.runjava.base/java.util.concurrent.ForkJoinPool.sc(ForkJoinWojava.base/java.util.concurrent.ForkJoinPool.rurkerThread.java.base/java.util.concurrent.ForkJoinWorkerTjava:183)
I added the multipart.size, multipart.threshold, connection.maximum, connection.timeout hadoop conf settings when I was getting a similar error earlier (this earlier error had '64M' instead of '32M' and changed when I added these conf settings)
I'm new to Spark so any and all tips/pointers would be helpful!
if needed
the "32M" is the default of "fs.s3a.block.size"
try hadoopConf.set('fs.s3a.block.size', '33554432')
go to https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html
you will find the explanations of the "32M" and the "64M"
I am working with python and pyspark in a jupyter notebook. I am trying to read several parquet files from an aws s3 bucket and convert them into a single json file.
This is what I have:
from functools import reduce
from pyspark.sql import DataFrame
bucket = s3.Bucket(name='mybucket')
keys =[]
for key in bucket.objects.all():
keys.append(key.key)
print(keys[0])
from pyspark.sql import SparkSession
# initialise sparkContext
spark = SparkSession.builder \
.master('local') \
.appName('myAppName') \
.config('spark.executor.memory', '5gb') \
.config("spark.cores.max", "6") \
.getOrCreate()
sc = spark.sparkContext
But I am getting:
Exception: Java gateway process exited before sending its port number with pyspark
I am not sure how to fix this, thank you!
Your getting this error because your pyspark is not able to communicate with your cluster. you need to set the value of some global variable like this.
import os
import findspark
findspark.init()
os.environ['PYSPARK_SUBMIT_ARGS'] = """--name job_name --master yarn / local
--conf spark.dynamicAllocation.enabled=true
pyspark-shell"""
os.environ['PYSPARK_PYTHON'] = "python3.6" # what ever version of python your using
os.environ['python'] = "python3.6"
findspark package is optional but it's good to use in case of pyspark.
I have a spark cluster I use in local mode. I want to read a csv with the databricks external library spark.csv. I start my app as follows:
import os
import sys
os.environ["SPARK_HOME"] = "/home/mebuddy/Programs/spark-1.6.0-bin-hadoop2.6"
spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))
from pyspark import SparkContext, SparkConf, SQLContext
try:
sc
except NameError:
print('initializing SparkContext...')
sc=SparkContext()
sq = SQLContext(sc)
df = sq.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("/my/path/to/my/file.csv")
When I run it, I get the following error:
java.lang.ClassNotFoundException: Failed to load class for data source: com.databricks.spark.csv.
My question: how can I load the databricks.spark.csv library INSIDE my python code. I don't want to load it from outside (using --packages) from instance.
I tried to add the following lines but it did not work:
os.environ["SPARK_CLASSPATH"] = '/home/mebuddy/Programs/spark_lib/spark-csv_2.11-1.3.0.jar'
If you create SparkContext from scratch you can for example set PYSPARK_SUBMIT_ARGS before SparkContext is intialized:
os.environ["PYSPARK_SUBMIT_ARGS"] = (
"--packages com.databricks:spark-csv_2.11:1.3.0 pyspark-shell"
)
sc = SparkContext()
If for some reason you expect that SparkContext has been already initialized, as it is suggested by your code, this won't work. In local mode you could try to use Py4J gateway and URLClassLoader but it doesn't look like a good idea and won't work in a cluster mode.
I am using apache spark pyspark (spark-1.5.2-bin-hadoop2.6) on windows 7.
I keep getting this error when I run my python script in pyspark.
An error occured while calling o23.load. java.sql.SQLException: No suitable driver found for jdbc:oracle:thin:------------------------------------connection
Here is my python file
import os
os.environ["SPARK_HOME"] = "C:\\spark-1.5.2-bin-hadoop2.6"
os.environ["SPARK_CLASSPATH"] = "L:\\Pyspark_Snow\\ojdbc6.jar"
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
spark_config = SparkConf().setMaster("local[8]")
sc = SparkContext(conf=spark_config)
sqlContext = SQLContext(sc)
df = (sqlContext
.load(source="jdbc",
url="jdbc:oracle:thin://x.x.x.x/xdb?user=xxxxx&password=xxxx",
dbtable="x.users")
)
sc.stop()
Unfortunately changing environment variable SPARK_CLASSPATH won't work. You need to declare
spark.driver.extraClassPath L:\\Pyspark_Snow\\ojdbc6.jar
in your /path/to/spark/conf/spark-defaults.conf or simply execute spark-submit job with additional argument --jars:
spark-submit --jars "L:\\Pyspark_Snow\\ojdbc6.jar" yourscript.py
To set the jars programatically set the following config:
spark.yarn.dist.jars with comma-separated list of jars.
Eg:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Spark config example") \
.config("spark.yarn.dist.jars", "<path-to-jar/test1.jar>,<path-to-jar/test2.jar>") \
.getOrCreate()
Or as below:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
spark_config = SparkConf().setMaster("local[8]")
spark_config.set("spark.yarn.dist.jars", "L:\\Pyspark_Snow\\ojdbc6.jar")
sc = SparkContext(conf=spark_config)
sqlContext = SQLContext(sc)
Or pass --jars with the path of jar files separated by , to spark-submit.
You can also add the jar using --jars and --driver-class-path and then set the driver specifically. See https://stackoverflow.com/a/36328672/1547734