Python -> Py4j -> Spark -> Cassandra

Python -> Py4j -> Spark -> Cassandra - python

I would like to test a simply Spark row count job on a test Cassandra table with only four rows just to verify that everything works.
I can quickly get this working from Java:
JavaSparkContext sc = new JavaSparkContext(conf);
SparkContextJavaFunctions sparkContextJavaFunctions = CassandraJavaUtil.javaFunctions(sc);
CassandraJavaRDD<CassandraRow> table = sparkContextJavaFunctions.cassandraTable("demo", "playlists");
long count = table.count();
Now, I'd like to get the same thing working in Python. The Spark distribution comes with a set of unbundled PySpark source code to use Spark from Python. It uses a library called py4j to launch a Java server and marshal java commands through a TCP gateway. I'm using that gateway directly to get this working.
I specify the following extra jars to the Java SparkSubmit host via the --driver-class-path option:
spark-cassandra-connector-java_2.11-1.2.0-rc1.jar
spark-cassandra-connector_2.11-1.2.0-rc1.jar
cassandra-thrift-2.1.3.jar
cassandra-clientutil-2.1.3.jar
cassandra-driver-core-2.1.5.jar
libthrift-0.9.2.jar
joda-convert-1.2.jar
joda-time-2.3.jar
Here is the core Python code to do the row count test:
from pyspark.java_gateway import launch_gateway
jvm_gateway = launch_gateway()
sc = jvm_gateway.jvm.org.apache.spark.api.java.JavaSparkContext(conf)
spark_cass_functions = jvm_gateway.jvm.com.datastax.spark.connector.japi.CassandraJavaUtil.javaFunctions(sc)
table = spark_cass_functions.cassandraTable("demo", "playlists");
On this last line, I get the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o5.cassandraTable.
: com.datastax.spark.connector.util.ConfigCheck$ConnectorConfigurationException: Invalid Config Variables
Only known spark.cassandra.* variables are allowed when using the Spark Cassandra Connector.
spark.cassandra.connection.conf.factory is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.auth.conf.factory is not a valid Spark Cassandra Connector variable.
No likely matches found.
at com.datastax.spark.connector.util.ConfigCheck$.checkConfig(ConfigCheck.scala:38)
at com.datastax.spark.connector.rdd.CassandraRDD.<init>(CassandraRDD.scala:18)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.<init>(CassandraTableScanRDD.scala:59)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD$.apply(CassandraTableScanRDD.scala:182)
at com.datastax.spark.connector.japi.SparkContextJavaFunctions.cassandraTable(SparkContextJavaFunctions.java:88)
at com.datastax.spark.connector.japi.SparkContextJavaFunctions.cassandraTable(SparkContextJavaFunctions.java:68)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Clearly, there is some configuration or setup issue. I'm not sure how to reasonably debug or investigate or what I could try. Can anyone with more Cassandra/Python/Spark expertise provide some advice? Thank you!
EDIT: A coworker setup a spark-defaults.conf file that was the root of this. I don't fully understand why this caused problems from Python and not from Java, but it doesn't matter. I don't want that conf file and removing it resolved by issue.

That is a known bug in the Spark Cassandra Connector in 1.2.0-rc1 and 1.2.0-rc2, it will be fixed in rc3.
Relevant Tickets
https://datastax-oss.atlassian.net/browse/SPARKC-102
https://datastax-oss.atlassian.net/browse/SPARKC-105
https://datastax-oss.atlassian.net/browse/SPARKC-108

You could always try out pyspark_cassandra. It's built against 1.2.0-rc3 and probably is a lot easier when working with Cassandra in pyspark.

Related

How to increase speed when writing Spark DataFrame to Redis?

I am developing a book recommendation API based on Flask, and it was found that to manage multiple requests I'll need to pre-calculate similarity matrix and store it somewhere for future queries. This matrix is created using PySpark based on ~1.5 million of database entries with book id, name and metadata, and the result can be described by this schema (i and j are for book indexes, dot is for similarity of their metadata):
StructType(List(StructField(i,IntegerType,true),StructField(j,IntegerType,true),StructField(dot,DoubleType,true)))
Initially, it was my intention to store it on Redis, using spark-redis connector. However, the following command appears to work with a very slow speed (even if initial book database query size is limited to a very modest 40k batch):
similarities.write.format("org.apache.spark.sql.redis").option("table", "similarities").option("key.column", "i").save()
It took around 6 hours to advance through 3 of the 9 stages Spark separated the initial task into. Strangely, storage memory usage by Spark executors was very low, around 20kb. A typical stage active stage is described as such by Spark Application UI:
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:282)
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
py4j.commands.CallCommand.execute(CallCommand.java:79)
py4j.GatewayConnection.run(GatewayConnection.java:238)
java.lang.Thread.run(Thread.java:748)
Is it possible to somehow speed up this process? My Spark session is set up this way:
SUBMIT_ARGS = " --driver-memory 2G --executor-memory 2G --executor-cores 4 --packages mysql:mysql-connector-java:5.1.39 pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
conf = SparkConf().set("spark.jars", "spark-redis/target/spark-redis_2.11-2.4.3-SNAPSHOT-jar-with-dependencies.jar").set("spark.executor.memory", "4g")
sc = SparkContext('local','example', conf=conf)
sql_sc = SQLContext(sc)

You may try to use Append save mode to avoid checking if the data already exists in the table:
similarities.write.format("org.apache.spark.sql.redis").option("table", "similarities").mode('append').option("key.column", "i").save()
Also, you may want to change
sc = SparkContext('local','example', conf=conf)
to
sc = SparkContext('local[*]','example', conf=conf)
to utilize all cores on your machine.
BTW, is it correct to use i as a key in Redis? Shouldn't it be a composition of both i and j?

Pyspark: how to read a .csv file in google bucket?

I have some files stored in a google bucket. Those are my settings as suggested here.
spark = SparkSession.builder.\
master("local[*]").\
appName("TestApp").\
config("spark.serializer", KryoSerializer.getName).\
config("spark.jars", "/usr/local/.sdkman/candidates/spark/2.4.4/jars/gcs-connector-hadoop2-2.1.1.jar").\
config("spark.kryo.registrator", GeoSparkKryoRegistrator.getName).\
getOrCreate()
#Recommended settings for using GeoSpark
spark.conf.set("spark.driver.memory", 6)
spark.conf.set("spark.network.timeout", 1000)
#spark.conf.set("spark.driver.maxResultSize", 5)
spark.conf.set
spark._jsc.hadoopConfiguration().set('fs.gs.impl', 'com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem')
# This is required if you are using service account and set true,
spark._jsc.hadoopConfiguration().set('fs.gs.auth.service.account.enable', 'false')
spark._jsc.hadoopConfiguration().set('google.cloud.auth.service.account.json.keyfile', "myJson.json")
path = 'mBucket-c892b51f8579.json'
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = path
client = storage.Client()
name = 'https://console.cloud.google.com/storage/browser/myBucket/'
bucket_id = 'myBucket'
bucket = client.get_bucket(bucket_id)
I can read them simple using the following:
df = pd.read_csv('gs://myBucket/myFile.csv.gz', compression='gzip')
df.head()
time_zone_name province_short
0 America/Chicago US.TX
1 America/Chicago US.TX
2 America/Los_Angeles US.CA
3 America/Chicago US.TX
4 America/Los_Angeles US.CA
I am trying to read the same file with pyspark
myTable = spark.read.format("csv").schema(schema).load('gs://myBucket/myFile.csv.gz', compression='gzip')
but I get the following error
Py4JJavaError: An error occurred while calling o257.load.
: java.lang.NoClassDefFoundError: Could not initialize class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:332)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)

Welcome to the hadoop dependency hell !
1. Use packages rather than jars
Your configuration is basically correct but when you add the gcs-connector as a local jar you also need to manually ensure all its dependencies are available in the JVM classpath.
It's usually easier to add the connector as a package and let spark deal with the dependencies so instead of config("spark.jars", "/usr/local/.sdkman/candidates/spark/2.4.4/jars/gcs-connector-hadoop2-2.1.1.jar") use config('spark.jars.packages', 'com.google.cloud.bigdataoss:gcs-connector:hadoop2-2.1.1')
2. Manage ivy2 dependencies resolution issues
When you do as above, spark will likely complain that it can't download some dependencies due to resolution differences between maven (used for publication) and ivy2 (used by spark for dependency resolution).
You can usually fix this by simply asking spark to ignore the unresolved dependencies using spark.jars.excludes so add a new config line such as
config('spark.jars.excludes','androidx.annotation:annotation,org.slf4j:slf4j-api')
3. Manage classpath conflicts
When this is done, the SparkSession will start but the filesystem will still fail because the standard distribution of pyspark packages an old version of guava library that doesn't implement the API the gcs-connector relies on.
You need to ensure that gcs-connector will find its expected version first by using the following configs config('spark.driver.userClassPathFirst','true') and config('spark.executor.userClassPathFirst','true')
4. Manage dependency conflicts
Now you may think everything is OK but actually no because the default pyspark distribution contains version 2.7.3 of hadoop libraries but the gcs-connector version 2.1.1 relies on 2.8+ only APIs.
Now your options are:
use a custom build of spark with a newer hadoop (or the package with no built-in hadoop libraries)
use an older version of gcs-connector (version 1.9.17 works fine)
5. A working config at last
Assuming you want to stick with the PyPi or Anaconda latest distribution of pyspark, the following config should work as expected.
I've included only the gcs relevant configs, moved the Hadoop config directly into the spark config and assumed you are correctly setting your GOOGLE_APPLICATION_CREDENTIALS:
from pyspark.sql import SparkSession
spark = SparkSession.builder.\
master("local[*]").\
appName("TestApp").\
config('spark.jars.packages',
'com.google.cloud.bigdataoss:gcs-connector:hadoop2-1.9.17').\
config('spark.jars.excludes',
'javax.jms:jms,com.sun.jdmk:jmxtools,com.sun.jmx:jmxri').\
config('spark.driver.userClassPathFirst','true').\
config('spark.executor.userClassPathFirst','true').\
config('spark.hadoop.fs.gs.impl',
'com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem').\
config('spark.hadoop.fs.gs.auth.service.account.enable', 'false').\
getOrCreate()
Note that gcs-connector version 1.9.17 has a different set of excludes than 2.1.1 because why not...
PS: You also need to ensure you're using a Java 1.8 JVM because Spark 2.4 doesn't work on newer JVMs.

In addition to #rluta's great answer, you can also replace the userClassPathFirst lines by specifically putting the guava jars in extraClassPath:
spark.driver.extraClassPath=/root/.ivy2/jars/com.google.guava_guava-27.0.1-jre.jar:/root/.ivy2/jars/com.google.guava_failureaccess-1.0.1.jar:/root/.ivy2/jars/com.google.guava_listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar
spark.executor.extraClassPath=/root/.ivy2/jars/com.google.guava_guava-27.0.1-jre.jar:/root/.ivy2/jars/com.google.guava_failureaccess-1.0.1.jar:/root/.ivy2/jars/com.google.guava_listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar
It's a bit hackish as you need the exact local ivy2 path, although you can also download/copy the jars to somewhere more permanent.
But, this reduces other potential dependency conflicts such as with livy, which throws java.lang.NoClassDefFoundError: org.apache.livy.shaded.json4s.jackson.Json4sModule if gcs-connector's jackson dependencies are in front of the classpath.

Provide blob type to read an Azure append blob from PySpark

The ultimate goal is to able to read the data in my Azure container into a PySpark dataframe.
Steps until now
The steps I have followed till now:
Written this code
spark = SparkSession(SparkContext())
spark.conf.set(
"fs.azure.account.key.%s.blob.core.windows.net" % AZURE_ACCOUNT_NAME,
AZURE_ACCOUNT_KEY
)
spark.conf.set(
"fs.wasbs.impl",
"org.apache.hadoop.fs.azure.NativeAzureFileSystem"
)
container_path = "wasbs://%s#%s.blob.core.windows.net" % (
AZURE_CONTAINER_NAME, AZURE_ACCOUNT_NAME
)
blob_folder = "%s/%s" % (container_path, AZURE_BLOB_NAME)
df = spark.read.format("text").load(blob_folder)
print(df.count())
Set public access and anonymous access to my Azure container.
Added two jars hadoop-azure-2.7.3.jar and azure-storage-2.2.0.jar to the path.
Problem
But now I am stuck with this error: Caused by: com.microsoft.azure.storage.StorageException: Incorrect Blob type, please use the correct Blob type to access a blob on the server. Expected BLOCK_BLOB, actual UNSPECIFIED..
I have not been able to find anything which talks about / resolves this issue. The closest I have found is this which does not work / is outdated.
EDIT
I found that the azure-storage-2.2.0.jar did not support APPEND_BLOB. I upgraded to azure-storage-4.0.0.jar and it changed the error from Expected BLOCK_BLOB, actual UNSPECIFIED. to Expected BLOCK_BLOB, actual APPEND_BLOB.. Does anyone know how to pass the correct type to expect?
Can someone please help me with resolving this.
I have minimal expertise in working with Azure but I don't think it should be this difficult to read and create a Spark dataframe from it. What am I doing wrong?

java.lang.NoSuchFieldError: DECIMAL128 mongoDB spark

I'm writing a spark job using pyspark; I should only read from mongoDB collection and print the content on the screen; the code is the following:
import pyspark
from pyspark.sql import SparkSession
my_spark = SparkSession.builder.appName("myApp").config("spark.mongodb.input.uri", "mongodb://127.0.0.1/marco.weather_test").config("spark.mongodb.output.uri", "mongodb://127.0.0.1/marco.out").getOrCreate()
df = my_spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri", "mongodb://127.0.0.1/marco.weather_test").load()
#df.printSchema()
df.show()
the problem is that when I want to print the schema the job works, but when I want to print the content of the dataFrame with show() function I get back the error:
#java.lang.NoSuchFieldError: DECIMAL128
the command I use is:
#bin/spark-submit --packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.3 /home/erca/Scrivania/proveTesi/test_batch.py

I got the same error due to wrong jar is used for mongo-java-driver.
The NoSuchFieldError is raised from
org.bson.BsonType.Decimal128
, and the field Decimal128 is added in class BsonType after mongo-java-driver 3.4。While the
org.mongodb.spark:mongo-spark-connector_2.11:2.2.4
contains mongo-java-driver 3.6.2, a existing jar "mongo-java-driver" with version 3.2.1 is located in driverExtraClassPath.
Just start the spark-shell with verbose:
spark-shell --verbose
i.e. the output:
***
***
Parsed arguments:
master local[*]
deployMode null
executorMemory 1024m
executorCores 1
totalExecutorCores null
propertiesFile /etc/ecm/spark-conf/spark-defaults.conf
driverMemory 1024m
driverCores null
driverExtraClassPath /opt/apps/extra-jars/*
driverExtraLibraryPath /usr/lib/hadoop-current/lib/native
***
***
and pay attention to the driverExtraClassPath, driverExtraLibraryPath.
Check these paths, and remove mongo-java-driver if it exists within these paths.

Importing the library explicitly might solve the issue or give you a better error message for debugging.
import org.bson.types.Decimal128

To BSON jar I a dded a version major and it worked
(In gradle)
compile group: 'org.mongodb', name: 'bson', version: '3.10.2', 'force': "true"

Spark 1.4 increase maxResultSize memory

I am using Spark 1.4 for my research and struggling with the memory settings. My machine has 16GB of memory so no problem there since the size of my file is only 300MB. Although, when I try to convert Spark RDD to panda dataframe using toPandas() function I receive the following error:
serialized results of 9 tasks (1096.9 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
I tried to fix this changing the spark-config file and still getting the same error. I've heard that this is a problem with spark 1.4 and wondering if you know how to solve this. Any help is much appreciated.

You can set spark.driver.maxResultSize parameter in the SparkConf object:
from pyspark import SparkConf, SparkContext
# In Jupyter you have to stop the current context first
sc.stop()
# Create new config
conf = (SparkConf()
.set("spark.driver.maxResultSize", "2g"))
# Create new context
sc = SparkContext(conf=conf)
You should probably create a new SQLContext as well:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

From the command line, such as with pyspark, --conf spark.driver.maxResultSize=3g can also be used to increase the max result size.

Tuning spark.driver.maxResultSize is a good practice considering the running environment. However, it is not the solution to your problem as the amount of data may change time by time. As #Zia-Kayani mentioned, it is better to collect data wisely. So if you have a DataFrame df, then you can call df.rdd and do all the magic stuff on the cluster, not in the driver. However, if you need to collect the data, I would suggest:
Do not turn on spark.sql.parquet.binaryAsString. String objects take more space
Use spark.rdd.compress to compress RDDs when you collect them
Try to collect it using pagination. (code in Scala, from another answer Scala: How to get a range of rows in a dataframe)
long count = df.count()
int limit = 50;
while(count > 0){
df1 = df.limit(limit);
df1.show(); //will print 50, next 50, etc rows
df = df.except(df1);
count = count - limit;
}

Looks like you are collecting the RDD, So it will definitely collect all the data to driver node that's why you are facing this issue.
You have to avoid collect data if not required for a rdd, or if its necessary then specify spark.driver.maxResultSize. there are two ways of defining this variable
1 - create Spark Config by setting this variable as
conf.set("spark.driver.maxResultSize", "3g")
2 - or set this variable
in spark-defaults.conf file present in conf folder of spark. like
spark.driver.maxResultSize 3g and restart the spark.

while starting the job or terminal, you can use
--conf spark.driver.maxResultSize="0"
to remove the bottleneck

There is also a Spark bug
https://issues.apache.org/jira/browse/SPARK-12837
that gives the same error
serialized results of X tasks (Y MB) is bigger than spark.driver.maxResultSize
even though you may not be pulling data to the driver explicitly.
SPARK-12837 addresses a Spark bug that accumulators/broadcast variables prior to Spark 2 were pulled to driver unnecessary causing this problem.

You can set spark.driver.maxResultSize to 2GB when you start the pyspark shell:
pyspark --conf "spark.driver.maxResultSize=2g"
This is for allowing 2Gb for spark.driver.maxResultSize

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.