PySpark's addPyFile method makes SparkContext None

PySpark's addPyFile method makes SparkContext None - python

I have been trying to do this. In PySpark shell, I get SparkContext as sc. But when I use addPyFile method, it makes the resulting SparkContext None:
>>> sc2 = sc.addPyFile("/home/ec2-user/redis.zip")
>>> sc2 is None
True
What's wrong?

Below is the source code to pyspark's (v1.1.1) addPyFile. (The source links for 1.4.1 in the official pyspark docs are broken as I'm writing this)
It returns None, because there is no return statement. See also: in python ,if a function doesn't have a return statement,what does it return?
So, if you do sc2 = sc.addPyFile("mymodule.py") of course sc2 will be None because .addPyFile() does not return anything!
Instead, simply call sc.addPyFile("mymodule.py") and keep using sc as the SparkContext
def addPyFile(self, path):
635 """
636 Add a .py or .zip dependency for all tasks to be executed on this
637 SparkContext in the future. The C{path} passed can be either a local
638 file, a file in HDFS (or other Hadoop-supported filesystems), or an
639 HTTP, HTTPS or FTP URI.
640 """
641 self.addFile(path)
642 (dirname, filename) = os.path.split(path) # dirname may be directory or HDFS/S3 prefix
643
644 if filename.endswith('.zip') or filename.endswith('.ZIP') or filename.endswith('.egg'):
645 self._python_includes.append(filename)
646 # for tests in local mode
647 sys.path.append(os.path.join(SparkFiles.getRootDirectory(), filename))

Related

function serving deployment failed

Here, I'm attaching actual error showed. im using mlrun with docker. specifically mlrun 1.2.0.
--------------------------------------------------------------------------
RunError Traceback (most recent call last)
<ipython-input-20-aab97e08b914> in <module>
1 serving_fn.with_code(body=" ") # adds the serving wrapper, not required with MLRun >= 1.0.3
----> 2 project.deploy_function(serving_fn)
/opt/conda/lib/python3.8/site-packages/mlrun/projects/project.py in deploy_function(self, function, dashboard, models, env, tag, verbose, builder_env, mock)
2307 :param mock: deploy mock server vs a real Nuclio function (for local simulations)
2308 """
-> 2309 return deploy_function(
2310 function,
2311 dashboard=dashboard,
/opt/conda/lib/python3.8/site-packages/mlrun/projects/operations.py in deploy_function(function, dashboard, models, env, tag, verbose, builder_env, project_object, mock)
344 )
345
--> 346 address = function.deploy(
347 dashboard=dashboard, tag=tag, verbose=verbose, builder_env=builder_env
348 )
/opt/conda/lib/python3.8/site-packages/mlrun/runtimes/serving.py in deploy(self, dashboard, project, tag, verbose, auth_info, builder_env)
621 logger.info(f"deploy root function {self.metadata.name} ...")
622
--> 623 return super().deploy(
624 dashboard, project, tag, verbose, auth_info, builder_env=builder_env
625 )
/opt/conda/lib/python3.8/site-packages/mlrun/runtimes/function.py in deploy(self, dashboard, project, tag, verbose, auth_info, builder_env)
550 self.status = data["data"].get("status")
551 self._update_credentials_from_remote_build(data["data"])
--> 552 self._wait_for_function_deployment(db, verbose=verbose)
553
554 # NOTE: on older mlrun versions & nuclio versions, function are exposed via NodePort
/opt/conda/lib/python3.8/site-packages/mlrun/runtimes/function.py in _wait_for_function_deployment(self, db, verbose)
620 if state != "ready":
621 logger.error("Nuclio function failed to deploy", function_state=state)
--> 622 raise RunError(f"function {self.metadata.name} deployment failed")
623
624 #min_nuclio_versions("1.5.20", "1.6.10")
RunError: function serving deployment failed
I don't have any idea what is the reason behind this error. as I'm new bee here. so someone pls help me to resolve this error.

I see two steps, how to solve the issue:
1. Relevant installation
The MLRun Community Edition in desktop docker has to be install under relevant HOST_IP (not with localhost or 127.0.0.1, but with stable IP address, see ipconfig) and with relevant SHARED_DIR. See relevant command line (from OS windows):
set HOST_IP=192.168.0.150
set SHARED_DIR=c:\Apps\mlrun-data
set TAG=1.2.0
mkdir %SHARED_DIR%
docker-compose -f "c:\Apps\Desktop Docker Tools\compose.with-jupyter.yaml" up
BTW: YAML file see https://docs.mlrun.org/en/latest/install/local-docker.html
2. Access to the port
In case of call serving_fn.invoke you have to open relevant port (from deploy_function) on your IP address (based on setting of HOST_IP, see the first point).
Typically this port can be blocked based on your firewall policy or your local antivirus. It means, you have to open access to this port before invoke call.
BTW: You can see focus on the issue https://github.com/mlrun/mlrun/issues/2102

VS Code: "Open Folder" make my PySpark not work?

I am using Anaconda on Windows 10 x64. I'm using VS Code. Recently I successfully ran the following:
(I)
import pyspark
from pyspark.sql import SparkSession
spark=SparkSession.builder.master("local[1]").appName("local").getOrCreate()
rdd=spark.sparkContext.parallelize([1,2,3,4,56])
print(rdd.count())
Then I clicked file->open folder and opened a folder, so now it appears in a pane on the left-hand side of my screen. Question #1: What does this do? I used to think that it was just a quick way to see some frequently-used files.
Now that my folder is in my left-hand pane, the above code errors out (see below), with an error that includes the phrase Python worker failed to connect back.. Question #2: Why does this happen?
Question #3: If I want to be able to avoid the above error while also having a folder open in VS Code, what should I do? Any ideas about what settings I should look at?
If I close the folder, my code works again.
-----------------------------------------------------------------------------------------------------------------------------------------
The Error: Every time it gives me a slightly different error, but it always begins with something like this:
Py4JJavaError Traceback (most recent call last)
\\pathtomyfile\temp.py in
----> 240 spark.createDataFrame(pandas_df).toDF(*columns).show()
C:\Spark\spark-3.1.1-bin-hadoop2.7\python\pyspark\sql\dataframe.py in show(self, n, truncate, vertical)
482 """
483 if isinstance(truncate, bool) and truncate:
--> 484 print(self._jdf.showString(n, 20, vertical))
485 else:
486 print(self._jdf.showString(n, int(truncate), vertical))
C:\Spark\spark-3.1.1-bin-hadoop2.7\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py in __call__(self, *args)
1302
1303 answer = self.gateway_client.send_command(command)
-> 1304 return_value = get_return_value(
1305 answer, self.gateway_client, self.target_id, self.name)
1306
C:\Spark\spark-3.1.1-bin-hadoop2.7\python\pyspark\sql\utils.py in deco(*a, **kw)
109 def deco(*a, **kw):
110 try:
--> 111 return f(*a, **kw)
112 except py4j.protocol.Py4JJavaError as e:
113 converted = convert_exception(e.java_exception)
C:\Spark\spark-3.1.1-bin-hadoop2.7\python\lib\py4j-0.10.9-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o77.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3) (blablabla executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
<a bunch more stuff after that>
(It's always "Task 0 in stage x failed y times", but x isn't always 3 and y isn't always 1.)
-----------------------------------------------------------------------------------------------------------------------------------------
Other Results: I get a similar error if I run the following:
(II)
import pandas as pd
import numpy as np
from pyspark.sql import SparkSession
spark=SparkSession.builder.master("local[1]").appName("local").getOrCreate()
pandas_df = pd.DataFrame(np.random.randint(0,100, size=(5,5)))
columns = ["col"+str(i) for i in range(5)]
print(columns)
display(pandas_df.head())
spark.createDataFrame(pandas_df).toDF(*columns).show()
I don't get an error if I run the following:
(III)
from pyspark.sql import SparkSession
spark=SparkSession.builder.master("local[1]").appName("local").getOrCreate()
rocks = spark.read.format("csv").option("header", "true").load("C:\\rocksamples.csv")
rocks.show(10)
Also, none of the above three code blocks (I, II, III) gives an error if I run it from an .ipynb file inside the folder I opened.
-----------------------------------------------------------------------------------------------------------------------------------------
Background: I have the following files and folders on my machine:
Anaconda directory: C:\users\me\Anaconda3
Spark directory: C:\Spark\spark-3.1.1-bin-hadoop2.7
Java directory: C:\Spark\java\jre1.8.0_231
Nothing called pyspark in C:\users\me\Anaconda3\Lib\site-packages
a .csv file at C:\rocksamples.csv
Winutils file in C:\Hadoop\bin\winutils.exe
My current environment variables (which I would like to clean up but now I'm afraid to) include the following:
HADOOP_HOME=C:\Hadoop
JAVA_HOME=C:\Spark\java\jre1.8.0_231
Path=(other stuff);C:\Spark\spark-3.1.1-bin-hadoop2.7\bin;C:\Spark\java\jre1.8.0_231\bin
PYSPARK_DRIVER_PYTHON=jupyter
PYSPARK_DRIVER_PYTHON_OPTS=notebook
PYSPARK_PYTHON=python
PYTHONPATH=C:\Spark\spark-3.1.1-bin-hadoop2.7\python
SPARK_HOME=C:\Spark\spark-3.1.1-bin-hadoop2.7
I think Python is finding the correct PySpark, because if I try from pyspark import this_does_not_exist I get ImportError: cannot import name 'this_does_not_exist' from 'pyspark' (C:\Spark\spark-3.1.1-bin-hadoop2.7\python\pyspark\__init__.py) .

The folder I opened using "open folder" in VS Code is on a UNC path containing a space (i.e. \\blablabla\bla\my folder).

Unable to create spark session

When I am creating a spark session, it is throwing an error
Unable to create a spark session
Using pyspark, the code snippet:
ValueError Traceback (most recent call last)
<ipython-input-13-2262882856df> in <module>()
37 if __name__ == "__main__":
38 conf = SparkConf()
---> 39 sc = SparkContext(conf=conf)
40 # print(sc.version)
41 # sc = SparkContext(conf=conf)
~/anaconda3/lib/python3.5/site-packages/pyspark/context.py in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
131 " note this option will be removed in Spark 3.0")
132
--> 133 SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
134 try:
135 self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,
~/anaconda3/lib/python3.5/site-packages/pyspark/context.py in _ensure_initialized(cls, instance, gateway, conf)
330 " created by %s at %s:%s "
331 % (currentAppName, currentMaster,
--> 332 callsite.function, callsite.file, callsite.linenum))
333 else:
334 SparkContext._active_spark_context = instance
ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=pyspark-shell, master=local[*]) created by __init__ at <ipython-input-7-edf43bdce70a>:33
imports
from pyspark import SparkConf, SparkContext
I tried this alternative approach, which fails too:
spark = SparkSession(sc).builder.appName("Detecting-Malicious-URL App").getOrCreate()
This is throwing another error as follows:
NameError: name 'SparkSession' is not defined

Try this -
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Detecting-Malicious-URL App").getOrCreate()
Before spark 2.0 we had to create a SparkConf and SparkContext to interact with Spark.
Whereas in Spark 2.0 SparkSession is the entry point to Spark SQL. Now we don't need to create SparkConf, SparkContext or SQLContext, as they’re encapsulated within the SparkSession.
Please refer this blog for more details : How to use SparkSession in Apache Spark 2.0

spark context is used to connect to the cluster through a resource manager.
sparkConf is required to create the spark context object, which stores configuration parameter like appName (to identify your spark driver), application, number of core and memory size of executor running on worker node. To use APIs of Sql, Hive, Streaming separate contexts need to be created.
While as for SparkSession provides a single point of entry to interact with underlying Spark functionality and allows programming Spark with Dataframes and API's. It don't need to create a separate session to use Sql, Hive etc.
To create a SparkSession you might use the following builder
SparkSession.builder.master("local").appName("Detecting-Malicious-URL
App") .config("spark.some.config.option", "some-value")
To overcome this error
"NameError: name 'SparkSession' is not defined"
you might need to use a package calling such as
"from pyspark.sql import SparkSession"
pyspark.sql supports spark session which is used to create data frames or register data frames as tables etc.
And the above error
(ValueError: Cannot run multiple SparkContexts at once; existing
SparkContext(app=pyspark-shell, master=local[*]) created by init
at :33 )
you specified this might be helpful- ValueError: Cannot run multiple SparkContexts at once in spark with pyspark

Faced Similar problem on Win 10 Solved by following Way:-
go to the Conda prompt and run the following command:-
# Install OpenJDK 11
conda install openjdk
Then run the:-
SparkSession.builder.appName('...').getOrCreate()

import nltk error in xampp, or on wamp

I am using nltk (installed and works fine in IDLE) for a project in Python (2.7.4) which runs perfectly fine on IDLE, but using the same code in xampp or wamp (cgi-bin), everything related to 'nltk' doesn't works and this is the error shown by adding these lines
import cgitb
cgitb.enable()
The errors are in lines are marked by '=>' and details are enclosed within ** and **. I've tried printing 'sys.path' after importing os, which shows the 'site-packages' directory in which 'nltk' resides. I even copied and pasted this folder into 'C:\Python27\' but still the same errors. Things other than nltk works as desired.
<type 'exceptions.ValueError'> Python 2.7.4: C:\python27\python.exe
Wed May 14 16:13:34 2014
A problem occurred in a Python script. Here is the sequence of function calls
leading up to the error, in the order they occurred.
C:\xampp\cgi-bin\Major\project.py in ()
=> 73 import nltk
**nltk undefined**
C:\python27\nltk\__init__.py in ()
159 import cluster; from cluster import *
160
=> 161 from downloader import download, download_shell
162 try:
163 import Tkinter
**downloader undefined, download undefined, download_shell undefined**
C:\python27\nltk\downloader.py in ()
2199
2200 # Aliases
=> 2201 _downloader = Downloader()
2202 download = _downloader.download
2203 def download_shell(): DownloaderShell(_downloader).run()
**_downloader undefined, Downloader = None**
C:\python27\nltk\downloader.py in __init__(self=<nltk.downloader.Downl
oader object>, server_index_url=None, download_dir=None)
425 # decide where we're going to save things to.
426 if self._download_dir is None:
=> 427 self._download_dir = self.default_download_dir()
428
429 #/////////////////////////////////////////////////////////////////
**self = <nltk.downloader.Downloader object>, self._download_dir = None,
self.default_download_dir = <bound method Downloader.default_download_dir of
<nltk.downloader.Downloader object>>**
C:\python27\nltk\downloader.py in default_download_dir(self=<nltk.downloa
der.Downloader object>)
926 homedir = os.path.expanduser('~/')
927 if homedir == '~/':
=> 928 raise ValueError("Could not find a default
download directory")
929
930 # append "nltk_data" to the home directory
**<pre>builtin ValueError = < type 'exceptions.ValueError' > <pre/>**
<type 'exceptions.ValueError'>: Could not find a default download directory
args = ('Could not find a default download directory',)
message = 'Could not find a default download directory'

Talking to supervisord over xmlrpc

I'm trying to talk to supervisor over xmlrpc. Based on supervisorctl (especially this line), I have the following, which seems like it should work, and indeed it works, in so far as it connects enough to receive an error from the server:
#socketpath is the full path to the socket, which exists
# None and None are the default username and password in the supervisorctl options
In [12]: proxy = xmlrpclib.ServerProxy('http://127.0.0.1', transport=supervisor.xmlrpc.SupervisorTransport(None, None, serverurl='unix://'+socketpath))
In [13]: proxy.supervisor.getState()
Resulting in this error:
---------------------------------------------------------------------------
ProtocolError Traceback (most recent call last)
/home/marcintustin/webapps/django/oneclickcosvirt/oneclickcos/<ipython-input-13-646258924bc2> in <module>()
----> 1 proxy.supervisor.getState()
/usr/local/lib/python2.7/xmlrpclib.pyc in __call__(self, *args)
1222 return _Method(self.__send, "%s.%s" % (self.__name, name))
1223 def __call__(self, *args):
-> 1224 return self.__send(self.__name, args)
1225
1226 ##
/usr/local/lib/python2.7/xmlrpclib.pyc in __request(self, methodname, params)
1576 self.__handler,
1577 request,
-> 1578 verbose=self.__verbose
1579 )
1580
/home/marcintustin/webapps/django/oneclickcosvirt/lib/python2.7/site-packages/supervisor/xmlrpc.pyc in request(self, host, handler, request_body, verbose)
469 r.status,
470 r.reason,
--> 471 '' )
472 data = r.read()
473 p, u = self.getparser()
ProtocolError: <ProtocolError for 127.0.0.1/RPC2: 401 Unauthorized>
This is the unix_http_server section of supervisord.conf:
[unix_http_server]
file=/home/marcintustin/webapps/django/oneclickcosvirt/tmp/supervisor.sock ; (the path to the socket file)
;chmod=0700 ; socket file mode (default 0700)
;chown=nobody:nogroup ; socket file uid:gid owner
;username=user ; (default is no username (open server))
;password=123 ; (default is no password (open server))
So, there should be no authentication problems.
It seems like my code is in all material respects identical to the equivalent code from supervisorctl, but supervisorctl actually works. What am I doing wrong?

Your code looks substantially correct. I'm running Supervisor 3.0 with Python 2.7, and given the following:
import supervisor.xmlrpc
import xmlrpclib
p = xmlrpclib.ServerProxy('http://127.0.0.1',
transport=supervisor.xmlrpc.SupervisorTransport(
None, None,
'unix:///home/lars/lib/supervisor/tmp/supervisor.sock'))
print p.supervisor.getState()
I get:
{'statename': 'RUNNING', 'statecode': 1}
Are you certain that your running Supervisor instance is using the configuration file you think it is? What if you run supervisord in debug mode, do you see the connection?

I don't use the ServerProxy from xmlrpclib, I use the Server class instead and I don't have to define any transports or paths to sockets. Not sure if your purposes require that, but here's a thin client I use fairly frequently. It's pretty much straight out of the docs.
python -c "import xmlrpclib;\
supervisor_client = xmlrpclib.Server('http://localhost:9001/RPC2');\
print( supervisor_client.supervisor.stopProcess(<some_proc_name>) )"

I faced the same issue; the problem was simple; supervisord was not running!
First:
supervisord
And then:
supervisorctl start all
Done! :)
If you've set nodaemon to true, you must keep the process runing in another tab of your terminal.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

PySpark's addPyFile method makes SparkContext None - python

I have been trying to do this. In PySpark shell, I get SparkContext as sc. But when I use addPyFile method, it makes the resulting SparkContext None: >>> sc2 = sc.addPyFile("/home/ec2-user/redis.zip") >>> sc2 is None True What's wrong?

Related

function serving deployment failed

VS Code: "Open Folder" make my PySpark not work?

Unable to create spark session

import nltk error in xampp, or on wamp

Talking to supervisord over xmlrpc

Categories

Resources