I am developing a book recommendation API based on Flask, and it was found that to manage multiple requests I'll need to pre-calculate similarity matrix and store it somewhere for future queries. This matrix is created using PySpark based on ~1.5 million of database entries with book id, name and metadata, and the result can be described by this schema (i and j are for book indexes, dot is for similarity of their metadata):
StructType(List(StructField(i,IntegerType,true),StructField(j,IntegerType,true),StructField(dot,DoubleType,true)))
Initially, it was my intention to store it on Redis, using spark-redis connector. However, the following command appears to work with a very slow speed (even if initial book database query size is limited to a very modest 40k batch):
similarities.write.format("org.apache.spark.sql.redis").option("table", "similarities").option("key.column", "i").save()
It took around 6 hours to advance through 3 of the 9 stages Spark separated the initial task into. Strangely, storage memory usage by Spark executors was very low, around 20kb. A typical stage active stage is described as such by Spark Application UI:
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:282)
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
py4j.commands.CallCommand.execute(CallCommand.java:79)
py4j.GatewayConnection.run(GatewayConnection.java:238)
java.lang.Thread.run(Thread.java:748)
Is it possible to somehow speed up this process? My Spark session is set up this way:
SUBMIT_ARGS = " --driver-memory 2G --executor-memory 2G --executor-cores 4 --packages mysql:mysql-connector-java:5.1.39 pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
conf = SparkConf().set("spark.jars", "spark-redis/target/spark-redis_2.11-2.4.3-SNAPSHOT-jar-with-dependencies.jar").set("spark.executor.memory", "4g")
sc = SparkContext('local','example', conf=conf)
sql_sc = SQLContext(sc)
You may try to use Append save mode to avoid checking if the data already exists in the table:
similarities.write.format("org.apache.spark.sql.redis").option("table", "similarities").mode('append').option("key.column", "i").save()
Also, you may want to change
sc = SparkContext('local','example', conf=conf)
to
sc = SparkContext('local[*]','example', conf=conf)
to utilize all cores on your machine.
BTW, is it correct to use i as a key in Redis? Shouldn't it be a composition of both i and j?
Related
I have some files stored in a google bucket. Those are my settings as suggested here.
spark = SparkSession.builder.\
master("local[*]").\
appName("TestApp").\
config("spark.serializer", KryoSerializer.getName).\
config("spark.jars", "/usr/local/.sdkman/candidates/spark/2.4.4/jars/gcs-connector-hadoop2-2.1.1.jar").\
config("spark.kryo.registrator", GeoSparkKryoRegistrator.getName).\
getOrCreate()
#Recommended settings for using GeoSpark
spark.conf.set("spark.driver.memory", 6)
spark.conf.set("spark.network.timeout", 1000)
#spark.conf.set("spark.driver.maxResultSize", 5)
spark.conf.set
spark._jsc.hadoopConfiguration().set('fs.gs.impl', 'com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem')
# This is required if you are using service account and set true,
spark._jsc.hadoopConfiguration().set('fs.gs.auth.service.account.enable', 'false')
spark._jsc.hadoopConfiguration().set('google.cloud.auth.service.account.json.keyfile', "myJson.json")
path = 'mBucket-c892b51f8579.json'
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = path
client = storage.Client()
name = 'https://console.cloud.google.com/storage/browser/myBucket/'
bucket_id = 'myBucket'
bucket = client.get_bucket(bucket_id)
I can read them simple using the following:
df = pd.read_csv('gs://myBucket/myFile.csv.gz', compression='gzip')
df.head()
time_zone_name province_short
0 America/Chicago US.TX
1 America/Chicago US.TX
2 America/Los_Angeles US.CA
3 America/Chicago US.TX
4 America/Los_Angeles US.CA
I am trying to read the same file with pyspark
myTable = spark.read.format("csv").schema(schema).load('gs://myBucket/myFile.csv.gz', compression='gzip')
but I get the following error
Py4JJavaError: An error occurred while calling o257.load.
: java.lang.NoClassDefFoundError: Could not initialize class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:332)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Welcome to the hadoop dependency hell !
1. Use packages rather than jars
Your configuration is basically correct but when you add the gcs-connector as a local jar you also need to manually ensure all its dependencies are available in the JVM classpath.
It's usually easier to add the connector as a package and let spark deal with the dependencies so instead of config("spark.jars", "/usr/local/.sdkman/candidates/spark/2.4.4/jars/gcs-connector-hadoop2-2.1.1.jar") use config('spark.jars.packages', 'com.google.cloud.bigdataoss:gcs-connector:hadoop2-2.1.1')
2. Manage ivy2 dependencies resolution issues
When you do as above, spark will likely complain that it can't download some dependencies due to resolution differences between maven (used for publication) and ivy2 (used by spark for dependency resolution).
You can usually fix this by simply asking spark to ignore the unresolved dependencies using spark.jars.excludes so add a new config line such as
config('spark.jars.excludes','androidx.annotation:annotation,org.slf4j:slf4j-api')
3. Manage classpath conflicts
When this is done, the SparkSession will start but the filesystem will still fail because the standard distribution of pyspark packages an old version of guava library that doesn't implement the API the gcs-connector relies on.
You need to ensure that gcs-connector will find its expected version first by using the following configs config('spark.driver.userClassPathFirst','true') and config('spark.executor.userClassPathFirst','true')
4. Manage dependency conflicts
Now you may think everything is OK but actually no because the default pyspark distribution contains version 2.7.3 of hadoop libraries but the gcs-connector version 2.1.1 relies on 2.8+ only APIs.
Now your options are:
use a custom build of spark with a newer hadoop (or the package with no built-in hadoop libraries)
use an older version of gcs-connector (version 1.9.17 works fine)
5. A working config at last
Assuming you want to stick with the PyPi or Anaconda latest distribution of pyspark, the following config should work as expected.
I've included only the gcs relevant configs, moved the Hadoop config directly into the spark config and assumed you are correctly setting your GOOGLE_APPLICATION_CREDENTIALS:
from pyspark.sql import SparkSession
spark = SparkSession.builder.\
master("local[*]").\
appName("TestApp").\
config('spark.jars.packages',
'com.google.cloud.bigdataoss:gcs-connector:hadoop2-1.9.17').\
config('spark.jars.excludes',
'javax.jms:jms,com.sun.jdmk:jmxtools,com.sun.jmx:jmxri').\
config('spark.driver.userClassPathFirst','true').\
config('spark.executor.userClassPathFirst','true').\
config('spark.hadoop.fs.gs.impl',
'com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem').\
config('spark.hadoop.fs.gs.auth.service.account.enable', 'false').\
getOrCreate()
Note that gcs-connector version 1.9.17 has a different set of excludes than 2.1.1 because why not...
PS: You also need to ensure you're using a Java 1.8 JVM because Spark 2.4 doesn't work on newer JVMs.
In addition to #rluta's great answer, you can also replace the userClassPathFirst lines by specifically putting the guava jars in extraClassPath:
spark.driver.extraClassPath=/root/.ivy2/jars/com.google.guava_guava-27.0.1-jre.jar:/root/.ivy2/jars/com.google.guava_failureaccess-1.0.1.jar:/root/.ivy2/jars/com.google.guava_listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar
spark.executor.extraClassPath=/root/.ivy2/jars/com.google.guava_guava-27.0.1-jre.jar:/root/.ivy2/jars/com.google.guava_failureaccess-1.0.1.jar:/root/.ivy2/jars/com.google.guava_listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar
It's a bit hackish as you need the exact local ivy2 path, although you can also download/copy the jars to somewhere more permanent.
But, this reduces other potential dependency conflicts such as with livy, which throws java.lang.NoClassDefFoundError: org.apache.livy.shaded.json4s.jackson.Json4sModule if gcs-connector's jackson dependencies are in front of the classpath.
I am trying to create a Spark Dataframe from a Pandas Dataframe and have tried many workarounds but continue to fail. This is my code, I am simply trying to follow one of the many basic examples:
test = pd.DataFrame([1,2,3,4,5])
type(test)
from pyspark import SparkContext
sc = SparkContext(master="local[4]")
sqlCtx = SQLContext(sc)
spark_df = sqlCtx.createDataFrame(test)
I was trying the above with a pandas dataframe having 2000 columns and hundreds of thousands of rows but I created the above test df just to make sure it wasn't a problem with the dataframe. And indeed I get the exact same error:
Py4JJavaError: An error occurred while calling o596.get.
: java.util.NoSuchElementException: spark.sql.execution.pandas.respectSessionTimeZone
at org.apache.spark.sql.internal.SQLConf$$anonfun$getConfString$2.apply(SQLConf.scala:884)
at org.apache.spark.sql.internal.SQLConf$$anonfun$getConfString$2.apply(SQLConf.scala:884)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.internal.SQLConf.getConfString(SQLConf.scala:884)
at org.apache.spark.sql.RuntimeConfig.get(RuntimeConfig.scala:74)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Do you have this set?
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
Also just to be sure, add the path to the
py4j zip (mine is py4j-0.10.1-src.zip)
in the spark directory as:
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.1-src.zip:$PYTHONPATH
I resolved the issue - I forgot to add the following lines of code to the beginning of my anaconda notebook:
import findspark
findspark.init()
In my spark application I am saving a dataframe as a parquet file as follows,
comp_df.write.mode("overwrite").saveAsTable("cdr_step1", format="parquet", path="/data/intermediate_data/cdr_step1/")
If my dataframe size is small this works fine. But as the dataset size increases I am getting the following error.
I checked this issue in the internet and in most places, people resolve this problem changing their code design. In my case I have only a single line of write operation and I don't understand what I need to change.
17/02/02 13:22:56 ERROR datasources.DefaultWriterContainer: Job job_201702021228_0000 aborted.
17/02/02 13:22:56 INFO yarn.YarnAllocator: Driver requested a total number of 0 executor(s).
17/02/02 13:22:56 WARN scheduler.TaskSetManager: Lost task 1979.0 in stage 3.0 (TID 1984, slv3.cdh-prod.xxxx.com, executor 86): org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /data/intermediate_data/cdr_step1/_temporary/0/_temporary/attempt_201702021322_0003_m_001979_0/part-r-01979-9fe33b7c-0b14-4e63-8e96-6e83aabbe807.gz.parquet (inode 2144221): File does not exist. Holder DFSClient_NONMAPREDUCE_-1523564925_148 does not have any open files.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3635)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:3438)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3294)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:679)
at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:214)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:489)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
I'm invoking Pyspark with Spark 2.0 in local mode with the following command:
pyspark --executor-memory 4g --driver-memory 4g
The input dataframe is being read from a tsv file and has 580 K x 28 columns. I'm doing a few operation on the dataframe and then i am trying to export it to a tsv file and i am getting this error.
df.coalesce(1).write.save("sample.tsv",format = "csv",header = 'true', delimiter = '\t')
Any pointers how to get rid of this error. I can easily display the df or count the rows.
The output dataframe is 3100 rows with 23 columns
Error:
Job aborted due to stage failure: Task 0 in stage 70.0 failed 1 times, most recent failure: Lost task 0.0 in stage 70.0 (TID 1073, localhost): org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: Unable to acquire 100 bytes of memory, got 0
at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:129)
at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:374)
at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:396)
at org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:94)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at org.apache.spark.sql.execution.WindowExec$$anonfun$15$$anon$1.fetchNextRow(WindowExec.scala:300)
at org.apache.spark.sql.execution.WindowExec$$anonfun$15$$anon$1.<init>(WindowExec.scala:309)
at org.apache.spark.sql.execution.WindowExec$$anonfun$15.apply(WindowExec.scala:289)
at org.apache.spark.sql.execution.WindowExec$$anonfun$15.apply(WindowExec.scala:288)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:766)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:766)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:96)
at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:95)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
... 8 more
Driver stacktrace:
I believe that the cause of this problem is coalesce(), which despite the fact that it avoids a full shuffle (like repartition would do), it has to shrink the data in the requested number of partitions.
Here, you are requesting all the data to fit into one partition, thus one task (and only one task) has to work with all the data, which may cause its container to suffer from memory limitations.
So, either ask for more partitions than 1, or avoid coalesce() in this case.
Otherwise, you could try the solutions provided in the links below, for increasing your memory configurations:
Spark java.lang.OutOfMemoryError: Java heap space
Spark runs out of memory when grouping by key
In my case replacing the coalesce(1) with repartition(1) Worked.
The problem for me was indeed coalesce().
What I did was exporting the file not using coalesce() but parquet instead using df.write.parquet("testP"). Then read back the file and export that with coalesce(1).
Hopefully it works for you as well.
As was stated in other answers, use repartition(1) instead of coalesce(1). The reason is that repartition(1) will ensure that upstream processing is done in parallel (multiple tasks/partitions), rather than on only one executor.
To quote the Dataset.coalesce() Spark docs:
However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can call repartition(1) instead. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).
In my case the driver was smaller than the workers. Issue was resolved by making the driver larger.
I would like to test a simply Spark row count job on a test Cassandra table with only four rows just to verify that everything works.
I can quickly get this working from Java:
JavaSparkContext sc = new JavaSparkContext(conf);
SparkContextJavaFunctions sparkContextJavaFunctions = CassandraJavaUtil.javaFunctions(sc);
CassandraJavaRDD<CassandraRow> table = sparkContextJavaFunctions.cassandraTable("demo", "playlists");
long count = table.count();
Now, I'd like to get the same thing working in Python. The Spark distribution comes with a set of unbundled PySpark source code to use Spark from Python. It uses a library called py4j to launch a Java server and marshal java commands through a TCP gateway. I'm using that gateway directly to get this working.
I specify the following extra jars to the Java SparkSubmit host via the --driver-class-path option:
spark-cassandra-connector-java_2.11-1.2.0-rc1.jar
spark-cassandra-connector_2.11-1.2.0-rc1.jar
cassandra-thrift-2.1.3.jar
cassandra-clientutil-2.1.3.jar
cassandra-driver-core-2.1.5.jar
libthrift-0.9.2.jar
joda-convert-1.2.jar
joda-time-2.3.jar
Here is the core Python code to do the row count test:
from pyspark.java_gateway import launch_gateway
jvm_gateway = launch_gateway()
sc = jvm_gateway.jvm.org.apache.spark.api.java.JavaSparkContext(conf)
spark_cass_functions = jvm_gateway.jvm.com.datastax.spark.connector.japi.CassandraJavaUtil.javaFunctions(sc)
table = spark_cass_functions.cassandraTable("demo", "playlists");
On this last line, I get the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o5.cassandraTable.
: com.datastax.spark.connector.util.ConfigCheck$ConnectorConfigurationException: Invalid Config Variables
Only known spark.cassandra.* variables are allowed when using the Spark Cassandra Connector.
spark.cassandra.connection.conf.factory is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.auth.conf.factory is not a valid Spark Cassandra Connector variable.
No likely matches found.
at com.datastax.spark.connector.util.ConfigCheck$.checkConfig(ConfigCheck.scala:38)
at com.datastax.spark.connector.rdd.CassandraRDD.<init>(CassandraRDD.scala:18)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.<init>(CassandraTableScanRDD.scala:59)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD$.apply(CassandraTableScanRDD.scala:182)
at com.datastax.spark.connector.japi.SparkContextJavaFunctions.cassandraTable(SparkContextJavaFunctions.java:88)
at com.datastax.spark.connector.japi.SparkContextJavaFunctions.cassandraTable(SparkContextJavaFunctions.java:68)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Clearly, there is some configuration or setup issue. I'm not sure how to reasonably debug or investigate or what I could try. Can anyone with more Cassandra/Python/Spark expertise provide some advice? Thank you!
EDIT: A coworker setup a spark-defaults.conf file that was the root of this. I don't fully understand why this caused problems from Python and not from Java, but it doesn't matter. I don't want that conf file and removing it resolved by issue.
That is a known bug in the Spark Cassandra Connector in 1.2.0-rc1 and 1.2.0-rc2, it will be fixed in rc3.
Relevant Tickets
https://datastax-oss.atlassian.net/browse/SPARKC-102
https://datastax-oss.atlassian.net/browse/SPARKC-105
https://datastax-oss.atlassian.net/browse/SPARKC-108
You could always try out pyspark_cassandra. It's built against 1.2.0-rc3 and probably is a lot easier when working with Cassandra in pyspark.