What is the standard development process involving some kind of IDE for spark with python for
Data exploration on the cluster
Application development?
I found the following answers, which do not satisfy me:
a) Zeeplin/Jupiter notbooks running "on the cluster"
b)
Install Spark and PyCharm locally,
use some local files containing dummy data to develope locally,
change references in the code to some real files on the cluster,
execute script using spark-submit in the console on the cluster.
source: https://de.hortonworks.com/tutorial/setting-up-a-spark-development-environment-with-python/
I would love to do a) and b) using some locally installed IDE, which communicates with the cluster directly, because I dislike the idea to create local dummy files and to change the code before running it on the cluster. I would also prefer an IDE over a notebook. Is there a standard way to do this or are my answers above already "best practice"?
You should be able to use any IDE with PySpark. Here are some instructions for Eclipse and PyDev:
set HADOOP_HOME variable referencing location of winutils.exe
set SPARK_HOME variable referencing your local spark folder
set SPARK_CONF_DIR to the folder where you have actual cluster config copied (spark-defaults and log4j)
add %SPARK_HOME%/python/lib/pyspark.zip and
%SPARK_HOME%/python/lib/py4j-xx.x.zip to a PYTHONPATH of the interpreter
For the testing purposes you can add code like:
spark = SparkSession.builder.set_master("my-cluster-master-node:7077")..
With the proper configuration file in SPARK_CONF_DIR, it should work with just SparkSession.builder.getOrCreate(). Alternatively you could setup your run configurations to use spark-submit directly. Some websites with similar instructions for other IDEs include:
PyCharm
Spyder
PyCharm & Spark
Jupyter Notebook
PySpark
Related
I'm running vscode using keras application on R with the following code (on R console):
library(foreign)
library(dplyr)
library(tidyverse)
library(tidytext)
library(keras)
library(data.table)
options(scipen=999)
dat <- read.csv("https://www.dropbox.com/s/31wmgva0n151dyq/consumers.csv?dl=1")
max_words <- 2000 # Maximum number of words to consider as features
maxlen <- 64 # Text cutoff after n words
# Prepare to tokenize the text
texts <- as.character(dat$consumer_complaint_narrative)
tokenizer <- text_tokenizer(num_words = max_words) %>%
fit_text_tokenizer(texts)
But it says:
Python was not found but can be installed from the Microsoft Store: https://go.microsoft.com/fwlink?linkID=2082640Python was not found but can be installed from the Microsoft Store: https://go.microsoft.com/fwlink?linkID=2082640Error in python_config(python_version, required_module, python_versions) :
Error 9009 occurred running C:\Users\my_working_directory\AppData\Local\MICROS~1\WINDOW~1\python.exe
It seems to suggest that I have not installed python on my device, but I actually did because I ran similar keras Python code on my jupyter notebook without problem, and I just want to try doing this in R.
I have found others asked similar question before, but I could not figure on top of my mind what is the problem, at least for my case. It will be really appreciated if someone could help me on this.
Have you checked that Python is in the default PATH?
From the docs:
3.6. Configuring Python
To run Python conveniently from a command prompt, you might consider changing some default environment variables in Windows. While the installer provides an option to configure the PATH and PATHEXT variables for you, this is only reliable for a single, system-wide installation. If you regularly use multiple versions of Python, consider using the Python Launcher for Windows.
3.6.1. Excursus: Setting environment variables Windows allows environment variables to be configured permanently at both the User
level and the System level, or temporarily in a command prompt.
To temporarily set environment variables, open Command Prompt and use
the set command:
C:\>set PATH=C:\Program Files\Python 3.8;%PATH%
C:\>set PYTHONPATH=%PYTHONPATH%;
C:\My_python_lib
C:\>python
These changes will
apply to any further commands executed in that console, and will be
inherited by any applications started from the console.
Including the variable name within percent signs will expand to the
existing value, allowing you to add your new value at either the start
or the end. Modifying PATH by adding the directory containing
python.exe to the start is a common way to ensure the correct version
of Python is launched.
To permanently modify the default environment variables, click Start
and search for ‘edit environment variables’, or open System
properties, Advanced system settings and click the Environment
Variables button. In this dialog, you can add or modify User and
System variables. To change System variables, you need non-restricted
access to your machine (i.e. Administrator rights).
Note Windows will concatenate User variables after System variables,
which may cause unexpected results when modifying PATH. The PYTHONPATH
variable is used by all versions of Python 2 and Python 3, so you
should not permanently configure this variable unless it only includes
code that is compatible with all of your installed Python versions.
The path specified in the code snippet C:\Program Files\Python 3.8 must be adapted to reflect where your Python is actually located.
I use the RemoteFS extension in VSCode to connect to my remote SSH server. When I open a .py file on the remote server in VSCode, and then add #%% comment to the .py file, I don't get the option to run a Jupyter cell like I would locally.
Has anybody gotten VSCode's Python extension working with it's built-in Jupyter support and the RemoteFS extension?
We had an overly restrictive file check for when we allowed our "Run Cell" commands to show up and it was limiting it to local files only. I've fixed that issue in the following PR here:
https://github.com/Microsoft/vscode-python/pull/4191
I verified that I was seeing the cell commands using Remote FS after that. Sadly this just missed the cutoff for our recent January release, so it won't show up in the extension until later in February. If you want to check out the fix you can access our daily development build here:
https://github.com/Microsoft/vscode-python/blob/master/CONTRIBUTING.md#development-build
That build has the fix already, but it's not the full public release.
I found the configuration for cpp (https://github.com/Microsoft/vscode-cpptools/blob/master/Documentation/Debugger/gdb/Windows%20Subsystem%20for%20Linux.md) and tried to change it for python debugging but it doesn't work. Any suggestion to make it work?
It should be mentioned that the Python extension for VS Code does not officially support WSL yet, but the enhancement request has been made and we do plan on supporting it.
Beyond extensions installations, IDE_PROJECT_ROOTS environment variable also may affect the debugger. For usual WSL standalone python code debugging, making sure this variable is not set (or set to the location of the files) when the VS code is opened helps.
For "step into" debugging of jupyter notebook having the python files path(s) as a part of IDE_PROJECT_ROOTS (for example export IDE_PROJECT_ROOTS="/tmp:/foo_pythonfilespath" set in .bashrc) will help to carry out "step into" python-code debugging in VSCode.
This is now supported and just requires installing the Microsoft Python extension and then to quote the documentation on remote debugging with WSL:
Once you've opened a folder in WSL, you can use VS Code's debugger in
the same way you would when running the application locally. For
example, if you select a launch configuration in launch.json and start
debugging (F5), the application will start on remote host and attach
the debugger to it.
See the debugging documentation for details on configuring VS Code's
debugging features in .vscode/launch.json
I run Python program which uses couple of source paths on runtime.
I put these rows on my /.bashrc file:
source home/raphael/kaldi/aspire/s5/cmd.sh
source home/raphael/kaldi/aspire/s5/path.sh
So when I'm running from terminal everything works fine and Python manage to locate paths.
However when I'm trying to run through PyCharm for DEBUG purposes mostly it seems that PyCharm can't locate the paths.
Is there anyway to add the paths manually for PyCharm or make it read /.bashrc file. What I am missing?
You can try using the options available in the Run/Debug Configuration settings (Run > Edit Configurations...)
You can set environment variables individually (such as $PATH), or at the bottom is a section to define external tools (scripts) to be run when your Python code is run or debugged. From that sub-section, you could set your bash scripts to be run each time you start debugging.
Alternatively, see if using os.environ would work for your project. Check the docs for more information.
I have some third-party database client libraries in Java. I want to access them through
java_gateway.py
E.g.: to make the client class (not a JDBC driver!) available to the Python client via the Java gateway:
java_import(gateway.jvm, "org.mydatabase.MyDBClient")
It is not clear where to add the third-party libraries to the JVM classpath. I tried to add to file compute-classpath.sh, but that did not seem to work. I get:
Py4jError: Trying to call a package
Also, when comparing to Hive: the hive JAR files are not loaded via file compute-classpath.sh, so that makes me suspicious. There seems to be some other mechanism happening to set up the JVM side classpath.
You could add the path to jar file using Spark configuration at Runtime.
Here is an example :
conf = SparkConf().set("spark.jars", "/path-to-jar/spark-streaming-kafka-0-8-assembly_2.11-2.2.1.jar")
sc = SparkContext( conf=conf)
Refer the document for more information.
You can add external jars as arguments to pyspark
pyspark --jars file1.jar,file2.jar
You could add --jars xxx.jar when using spark-submit
./bin/spark-submit --jars xxx.jar your_spark_script.py
or set the enviroment variable SPARK_CLASSPATH
SPARK_CLASSPATH='/path/xxx.jar:/path/xx2.jar' your_spark_script.py
your_spark_script.py was written by pyspark API
All the above answers did not work for me
What I had to do with pyspark was
pyspark --py-files /path/to/jar/xxxx.jar
For Jupyter Notebook:
spark = (SparkSession
.builder
.appName("Spark_Test")
.master('yarn-client')
.config("spark.sql.warehouse.dir", "/user/hive/warehouse")
.config("spark.executor.cores", "4")
.config("spark.executor.instances", "2")
.config("spark.sql.shuffle.partitions","8")
.enableHiveSupport()
.getOrCreate())
# Do this
spark.sparkContext.addPyFile("/path/to/jar/xxxx.jar")
Link to the source where I found it:
https://github.com/graphframes/graphframes/issues/104
Extract the downloaded jar file.
Edit system environment variable
Add a variable named SPARK_CLASSPATH and set its value to \path\to\the\extracted\jar\file.
Eg: you have extracted the jar file in C drive in folder named sparkts
its value should be: C:\sparkts
Restart your cluster
Apart from the accepted answer, you also have below options:
if you are in virtual environment then you can place it in
e.g. lib/python3.7/site-packages/pyspark/jars
if you want java to discover it then you can place where your jre is installed under ext/ directory
One more thing you can do is to add the Jar in the pyspark jar folder where pyspark is installed. Usually /python3.6/site-packages/pyspark/jars
Be careful if you are using a virtual environment that the jar needs to go to the pyspark installation in the virtual environment.
This way you can use the jar without sending it in command line or load it in your code.
I've worked around this by dropping the jars into a directory drivers and then creating a spark-defaults.conf file in conf folder. Steps to follow;
To get the conf path:
cd ${SPARK_HOME}/conf
vi spark-defaults.conf
spark.driver.extraClassPath /Users/xxx/Documents/spark_project/drivers/*
run your Jupyter notebook.
java/scala libs from pyspark both --jars and spark.jars are not working in version 2.4.0 and earlier (I didn't check newer version). I'm surprised how many guys are claiming that it is working.
The main problem is that for classloader retrieved in following way:
jvm = SparkSession.builder.getOrCreate()._jvm
clazz = jvm.my.scala.class
# or
clazz = jvm.java.lang.Class.forName('my.scala.class')
it works only when you copy jar files to ${SPARK_HOME}/jars (this one works for me).
But when your only way is using --jars or spark.jars there is another classloader used (which is child class loader) which is set in current thread. So your python code needs to look like:
clazz = jvm.java.lang.Thread.currentThread().getContextClassLoader().loadClass(f"{object_name}$")
Hope it explains your troubles. Give me a shout if not.