Cannot save Spark dataframe to Google Cloud Storage from PySpark

Cannot save Spark dataframe to Google Cloud Storage from PySpark - python

I have a Spark dataframe that I'm trying to save to a Google Storage bucket with the line
df.write.format("com.databricks.spark.csv").save('gs://some-test-bucket-delete-me')
But Pyspark raises the following exception
Py4JJavaError: An error occurred while calling o55.save.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
What I've tried:
The typical solutions to this exception is to make sure that the environmental variable HADOOP_CLASSPATH to point at the gcs-connector-latest-hadoop2.jar file, which it does. I've tried using the Hadoop version 1 as well as version 2 jars incase that was the problem. I've tried explicitly pointing at it within Jupyter notebook with
sc._jsc.hadoopConfiguration().set("spark.driver.extraClassPath", "/home/****/hadoop-2.8.2/share/hadoop/common/lib/gcs-connector-latest-hadoop1.jar")
to no avail.
If I try hadoop fs -ls gs://gs://some-test-bucket-delete-me from bash the command returns perfectly, which is supposed to indicate that that the Google Cloud Storage connect works but for some reason I can't seem to get this functionality to work in PySpark.
Things that may be important:
Spark Version 2.2.0
Python 3.6.1 :: Anaconda custom (64-bit)
I'm running PySpark locally

You should run gcloud init first
Then try df.write.csv('gs://some-test-bucket-delete-me/file_name')

Related

How to install jars related to spark-redis in databricks cluster?

I am trying to connect to Azure cache for redis from databricks .
I have installed this package com.redislabs:spark-redis:2.3.0 from maven package in databricks. I have created a spark session with below code
SparkSession\
.builder\
.appName("myApp")\
.config("spark.redis.host", "my host")\
.config("spark.redis.port", "6379")\
.config("spark.redis.auth", "passwd")\
.getOrCreate()
But when I ran df.write.format("org.apache.spark.sql.redis").option("table", "people").option("key.column", "name").save()
I am getting below error.
*Py4JJavaError: An error occurred while calling o390.save.
: java.lang.ClassNotFoundException:
Failed to find data source: org.apache.spark.sql.redis. Please find packages at
http://spark.apache.org/third-party-projects.html*
Could you please let me know the detailed steps to install all necessary libraries/jars to access redis in databricks.
I have seen below code in spark-redis python doc but I don't know how to run it in databricks.
$ ./bin/pyspark --jars <path-to>/spark-redis-<version>-jar-with-dependencies.jar
And also please let me know what is the latest spark-redis version.

Redis has a Spark Package that you can download and attach to your cluster
The following notebook shows how to use Redis with Apache Spark in Azure Databricks.
For more details, refer to Azure Databricks - Redis.

start pyspark cluster with jupyter notebook

i'm buildind a pyspark app using jupyter notebook , so far i'm using it in a standalone mode.
Now i have in my disposition 3 Virtual machines with spark on them, and i want to start Pyspark in a cluster.
Here is my code to start it in standalone mode :
knowing i'm using spark 3.1.2 hadoop 3.2
i've searched for ways to do it and i didn't get it, and there are some articles saying that pyspark doesn't work in clusters, so please if you know how i can change this code and launch my session in a cluster please help.
thank you.

You most have a cluster of some sort.
I use kubernutes and https://github.com/bjornjorgensen/jlpyk8s
This way I have a notebook that interactive run pyspark on.

Deploying Google Cloud Function with Tensorflow fails

I am trying to deploy a google cloud function to use the universal-sentence-encoder model.
However, if I add in the dependencies to my requirements.txt:
tensorflow==2.1
tensorflow-hub==0.8.0
then the function fails to deploy with the following error:
Build failed: {"error": {"canonicalCode": "INTERNAL", "errorMessage": "gzip_tar_runtime_package gzip /tmp/tmpOBr2rZ.tar -1\nexited with error [Errno 12] Cannot allocate memory\ngzip_tar_runtime_package is likely not on the path", "errorType": "InternalError", "errorId": "F57B9E18"}}
What does this error mean?
How can I fix it?
Note that the code for the function itself is just the demo code provided by google when you click "create function" in the web console. It deploys when I remove these requirements, when I add them it breaks.

This error can happen when the size of the deployed files is larger than the available Cloud Function memory. The gzip_tar_runtime_package could not be installed because memory could not be allocated.
Make sure you are only using the required dependencies. If you are uploading static files, make sure you only upload necessary files.
After that, if you are still facing the issue, try increasing the Cloud Function Memory setting the --memory flag in the gcloud functions deploy command as explained here
EDIT:
There is currently a known issue with Tensorflow 2.1 in Cloud Functions.
The current workaround would be to use Tensorflow 2.0.0 or 2.0.1

Cannot access pyspark in EMR cluster jupyter notebook

I've created a spark cluster on EMR.
But I'm unable to access pyspark when I open it with a notebook.
Configuration:
Example:
from pyspark import SparkContext
I also cannot access sc which I was under the impression would be available.
sc.list_packages()
NameError: name 'sc' is not defined
I feel like I'm missing something very basic here but I'm completely new to EMR and have spent a bunch of time on this already.
Are there any ideas I should try to debug this?

When I opened my notebook with "JupyterLab" instead of "Jupyter" all libraries were available.

How to develop with PYSPARK locally and run on Spark Cluster?

I'm new to Spark.I installed a Spark 2.3.0 in Stand-Alone-Mode on an Ubuntu 16.04.3 server. That runs well so far. Now I would like to start developing with pyspark because I've got more experience using python than scala.
Ok. Even after using google for a while I'm not sure how I should setup my development environment. My local machine is a windows 10 laptop with eclipse neon and pydev configured. What are the neccessary steps to set ist up that I can develop in a local context and submit my modules to the spark cluster on my server?
Thank for helping.

use spark-submit to run locally or on a cluster. There are many online tutorials for this. I like the AWS documentation which explains the architecture, has sample spark code, and gives examples of local and remote commands. Even if you are not using AWS EMR the basics are the same.
give it a try and let us know how it goes

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cannot save Spark dataframe to Google Cloud Storage from PySpark - python

You should run gcloud init first Then try df.write.csv('gs://some-test-bucket-delete-me/file_name')

Related

How to install jars related to spark-redis in databricks cluster?

start pyspark cluster with jupyter notebook

Deploying Google Cloud Function with Tensorflow fails

Cannot access pyspark in EMR cluster jupyter notebook

How to develop with PYSPARK locally and run on Spark Cluster?

Categories

Resources