Pyspark - Using a custom function in map transform - python

I run the following file (called test_func.py) with py.test:
import findspark
findspark.init()
from pyspark.context import SparkContext
def filtering(data):
return data.map(lambda p: modif(p)).count()
def modif(row):
row.split(",")
class Test(object):
sc = SparkContext('local[1]')
def test_filtering(self):
data = self.sc.parallelize(['1','2', ''])
assert filtering(data) == 2
And, because the modif function is used inside the map transform, it fails with the following error:
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/osboxes/spark-1.5.2-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/worker.py", line 98, in main
command = pickleSer._read_with_length(infile)
File "/home/osboxes/spark-1.5.2-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
return self.loads(obj)
File "/home/osboxes/spark-1.5.2-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/serializers.py", line 422, in loads
return pickle.loads(obj)
ImportError: No module named clustering.test_func
pyspark does not manage to find the modif function. Note that the file test_func.py is in the directory clustering and I run py.test from inside clustering directory.
What astonishes me is that if I use modif function outside a map, it works fine. For instance, if I do: modif(data.first())
Any idea why I get such importErrors and how I could fix it?
EDIT
I had already tested what has been proposed Avihoo Mamka's answer, i.e. to add test_func.py to the files copied to all workers. However, it had no effect. It is no surprise for me because I think the main file, from where the spark application is created, is always sent to all workers.
I think it can come from the fact that pyspark is looking for clustering.test_func and not test_func.

The key here is the Traceback you got.
PySpark is telling you that the worker process doesn't have access to clustering.test_func.py. When you initialize the SparkContext you can pass a list of files that should be copied to the worker:
sc = SparkContext("local[1]", "App Name", pyFiles=['MyFile.py', 'lib.zip', 'app.egg'])
More information: https://spark.apache.org/docs/1.5.2/programming-guide.html

Related

servable panel app ModuleNotFoundError on included local module

I'm having issues deploying my Panel app to be served up as HTML.
Following instructions at https://panel.holoviz.org/user_guide/Running_in_Webassembly.html, with script.py as
import panel as pn
from mymodule import MyObj
pn.extension('terminal', 'tabulator', sizing_mode="stretch_width")
gspec = pn.GridSpec(sizing_mode='stretch_width', max_height=100, ncols=3)
PI = MyObj()
gspec[0, 0:1] = PI.tabs
gspec[0, 2] = PI.view
pn.Column(
"Text Text Text"
, gspec
).servable()
and mymodule/__init__.py:
__version__ = '0.1.0'
from .my_obj import MyObj
and mymodule/my_obj.py:
from .param_classes import ClassA
from .param_classes import ClassB
# etc
from .compute_obj import ComputeObj
class MyObj(object):
# all the details of the panel build, calling in turn
# param classes detailed in another file, and also calling another module
# to handle all the computation behind the panel
panel serve script.py --autoreload works perfectly, but
$ panel convert script.py --to pyodide-worker --out pyodide
$ python3 -m http.server
does not work. I get a big display at http://localhost:8000/pyodide/script.html: ModuleNotFoundError: no module named 'mymodule', and an infinite loop spinning graphic, and in the Developer Tools Console output:
pyodide.asm.js:10 Uncaught (in promise) PythonError: Traceback (most recent call last):
File "/lib/python3.10/asyncio/futures.py", line 201, in result
raise self._exception
File "/lib/python3.10/asyncio/tasks.py", line 232, in __step
result = coro.send(None)
File "/lib/python3.10/site-packages/_pyodide/_base.py", line 506, in eval_code_async
await CodeRunner(
File "/lib/python3.10/site-packages/_pyodide/_base.py", line 359, in run_async
await coroutine
File "<exec>", line 11, in <module>
ModuleNotFoundError: No module named 'mymodule'
at new_error (pyodide.asm.js:10:218123)
at pyodide.asm.wasm:0xdef7c
at pyodide.asm.wasm:0xe37ae
at method_call_trampoline (pyodide.asm.js:10:218037)
at pyodide.asm.wasm:0x126317
at pyodide.asm.wasm:0x1f6f2e
at pyodide.asm.wasm:0x161a32
at pyodide.asm.wasm:0x126827
at pyodide.asm.wasm:0x126921
at pyodide.asm.wasm:0x1269c4
at pyodide.asm.wasm:0x1e0697
at pyodide.asm.wasm:0x1da6a5
at pyodide.asm.wasm:0x126a07
at pyodide.asm.wasm:0x1e248c
at pyodide.asm.wasm:0x1e00d9
at pyodide.asm.wasm:0x1da6a5
at pyodide.asm.wasm:0x126a07
at pyodide.asm.wasm:0xe347a
at Module.callPyObjectKwargs (pyodide.asm.js:10:119064)
at Module.callPyObject (pyodide.asm.js:10:119442)
at wrapper (pyodide.asm.js:10:183746)
I should add that I'm using poetry to manage packages and build the venv, and I'm operating all the above from within the activated .venv (via poetry shell)
I've tried all the tips and tricks around append the local path to sys.path. Looking at the .js file that the convert utility generates, I gather it would work if all the code were in one file, but forcing bad coding practice doesn't sound right.
I imagine there could be some kind of C++ build-style --include argument to panel convert, but there are no man pages, and the closest I can get with online documentation is --requirements ./mymodule, but no joy.
Could anybody please advise?

"Module not found error" while trying to call a function that is in another directory

Trying to call a function yamlCreation() which is in another directory into this python file ..
the path of this file is
path:bss_micro_svcs/bss_micro_svcs/kubectl.py
Calling File: "kubectl.py"
from bss_micro_svcs.kubectl_utils.templating_to_yaml import yamlCreation
class Kube(Svcs):
def startService(self, cluster_info, service, service_profile):
yamlCreation()
k=Kube()
k.startService()
yamlCreation() function is in this python file named "templating_to_yaml.py"
the path of the file is..
path:bss_micro_svcs/kubectl_utils/templating_to_yaml.py
function to be called is inside this file:"templating_to_yaml.py"
file_loader = FileSystemLoader('bss_micro_svcs/bss_micro_svcs/yamls/templates')
env = Environment(loader=file_loader)
#deployment.yaml templating
deployment_template=env.get_template('deployment.tmpl')
deployment_output=deployment_template.render(deployment_name='nginx-deployment',deployment_app='nginx',deployment_replicas='3',deployment_container_name='nginx',deployment_image='nginx:1.14.2',deployment_container_port=80)
deployment_outfile=open('bss_micro_svcs/bss_micro_svcs/yamls/services/integration/deployment.yaml','w')
deployment_outfile.write(deployment_output)
deployment_outfile.close()
#*********************************************************************************************************************************************************#
#service.yaml templating
service_template=env.get_template('services.tmpl')
service_output=service_template.render(service_name='integration-service',service_name_space='demo-sandbox',service_app='integration',service_port=8090,target_port=8090)
service_outfile=open('bss_micro_svcs/bss_micro_svcs/yamls/services/integration/services.yaml','w')
service_outfile.write(service_output)
service_outfile.close()
GETTING AN ERROR:
┌──(bss_micro_svcs-jnO67Li7)(RK㉿kali)-[~/Desktop/servicelaunchmgr]
└─$ /home/RK/.local/share/virtualenvs/bss_micro_svcs-jnO67Li7/bin/python /home/RK/Desktop/servicelaunchmgr/bss_micro_svcs/bss_micro_svcs/kubectl.py
Traceback (most recent call last):
File "/home/RK/Desktop/servicelaunchmgr/bss_micro_svcs/bss_micro_svcs/kubectl.py", line 2, in <module>
from bss_micro_svcs.kubectl_utils.templating_to_yaml import yamlCreation
ModuleNotFoundError: No module named 'bss_micro_svcs'
You could set environment variable PYTHONPATH so the interpreter could know where to start looking for the module to import. I would suggest to set it in the root module.
Reference: https://bic-berkeley.github.io/psych-214-fall-2016/using_pythonpath.html

No output operations registered, so nothing to execute in PySpark

I am trying to integtrate Spark with Kafka. I am having the kafka consumer have json data. I want to inegrate the kafka consumer with Spark for processing. When I run the below code error is throwing.
bin\spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.0 test.py localhost:9092 maktest
My test.py is below
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
if __name__ == "__main__":
sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
ssc = StreamingContext(sc, 2)
brokers, topic = sys.argv[1:]
kvs = KafkaUtils.createDirectStream(ssc,[topic],{"metadata.broker.list": brokers})
lines = kvs.map(lambda x: x[1])
print (lines)
ssc.start()
ssc.awaitTermination()
I got the error below
18/12/10 16:41:40 INFO VerifiableProperties: Verifying properties
18/12/10 16:41:40 INFO VerifiableProperties: Property group.id is overridden to
18/12/10 16:41:40 INFO VerifiableProperties: Property zookeeper.connect is overridden to
<pyspark.streaming.kafka.KafkaTransformedDStream object at 0x000002A6DA9FE6A0>
18/12/10 16:41:40 ERROR StreamingContext: Error starting the context, marking it as stopped
java.lang.IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute
at scala.Predef$.require(Predef.scala:224)
Traceback (most recent call last):
File "C:/Users/maws/Desktop/spark-2.2.1-bin-hadoop2.7/test.py", line 12, in <module>
ssc.start()
py4j.protocol.Py4JJavaError: An error occurred while calling o25.start.
: java.lang.IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute
You're not using a supported Spark Streaming DStream output operation.
For the pyspark API, you should use:
pprint()
saveAsTextFiles(prefix, [suffix])
saveAsObjectFiles(prefix, [suffix])
saveAsHadoopFiles(prefix, [suffix])
foreachRDD(func)
print() can't be used with pyspark, so make sure when you check other Spark Streaming Examples for Scala or Java, you change to pprint()

zeppelin pyspark how to connect remote spark?

My zeppelin is using local spark now.
Got ValueError: Cannot run multiple SparkContexts at once when I tried to create a remote SparkContext .
Follow
multiple SparkContexts error in tutorial
write below code:
from pyspark import SparkConf, SparkContext
sc.stop()
conf = SparkConf().setAppName('train_etl').setMaster('spark://xxxx:7077')
sc = SparkContext(conf=conf)
Got another error:
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-6681108227268089746.py", line 363, in <module>
sc.setJobGroup(jobGroup, jobDesc)
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/context.py", line 944, in setJobGroup
self._jsc.setJobGroup(groupId, description, interruptOnCancel)
AttributeError: 'NoneType' object has no attribute 'setJobGroup'
What should I do?
by default, Spark automatically creates the SparkContext object named sc, when the PySpark application started.you have to use below line in your code which
sc = SparkContext.getOrCreate()
Get the singleton SQLContext if it exists or create a new one using the given SparkContext.
This function can be used to create a singleton SQLContext object that can be shared across the JVM.
If there is an active SQLContext for current thread, it will be returned instead of the global one.
Enter http://zeppelin_host:zeppelin_port/#/interpreter
config parameter master of spark interpreter (which use for pyspark) to spark://xxxx:7077

Unittesting a file by calling it from another file in Python

I am new to unnitest module. I have a file that has unittest in it. The file is something like ...
File1.py
class ABC (unittest.TestCase):
def setUp(self):
# Do some work here
def test_123(self, a,b,c):
# Do some work here
if __name__ == "__main__":
unittest.main()
*Now I am calling this file from another file by passing values to the function "test_123".* But python displays the following error. Could anybody please help!
Traceback (most recent call last):
File "caller_file.py", line 20, in <module>
r = file1.ABC()
File "/usr/lib/python2.7/unittest/case.py", line 191, in __init__
(self.__class__, methodName))
ValueError: no such test method in <class 'file1.ABC'>: runTest
You can run file1.ABC test case like this:
import unittest
import file1
suite = unittest.TestLoader().loadTestsFromTestCase(file1.ABC)
unittest.TextTestRunner(verbosity=2).run(suite)
Also you need to add the self argument to the setUp and test_123 methods and self should be the sole argument.
I run into similar problems with my unittests because missing entries in search path for modules.
I solved it by creating
my_env = os.environ.copy()
if not 'PYTHONPATH' in my_env:
my_env['PYTHONPATH'] = ''
my_env['PYTHONPATH'] += ';' + ';'.join(
[os.path.abspath('.'),
os.path.abspath('..'),
os.path.abspath('..\\..')])
and then the call of the file
_ = subprocess.check_output(filepath, shell=True, env=my_env)
I just added the current path environment because the calling-file is in the same directories. Maybe you have to adjust that.

Categories