VS Code: "Open Folder" make my PySpark not work? - python

I am using Anaconda on Windows 10 x64. I'm using VS Code. Recently I successfully ran the following:
import pyspark
from pyspark.sql import SparkSession
Then I clicked file->open folder and opened a folder, so now it appears in a pane on the left-hand side of my screen. Question #1: What does this do? I used to think that it was just a quick way to see some frequently-used files.
Now that my folder is in my left-hand pane, the above code errors out (see below), with an error that includes the phrase Python worker failed to connect back.. Question #2: Why does this happen?
Question #3: If I want to be able to avoid the above error while also having a folder open in VS Code, what should I do? Any ideas about what settings I should look at?
If I close the folder, my code works again.
The Error: Every time it gives me a slightly different error, but it always begins with something like this:
Py4JJavaError Traceback (most recent call last)
\\pathtomyfile\temp.py in
----> 240 spark.createDataFrame(pandas_df).toDF(*columns).show()
C:\Spark\spark-3.1.1-bin-hadoop2.7\python\pyspark\sql\dataframe.py in show(self, n, truncate, vertical)
482 """
483 if isinstance(truncate, bool) and truncate:
--> 484 print(self._jdf.showString(n, 20, vertical))
485 else:
486 print(self._jdf.showString(n, int(truncate), vertical))
C:\Spark\spark-3.1.1-bin-hadoop2.7\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py in __call__(self, *args)
1303 answer = self.gateway_client.send_command(command)
-> 1304 return_value = get_return_value(
1305 answer, self.gateway_client, self.target_id, self.name)
C:\Spark\spark-3.1.1-bin-hadoop2.7\python\pyspark\sql\utils.py in deco(*a, **kw)
109 def deco(*a, **kw):
110 try:
--> 111 return f(*a, **kw)
112 except py4j.protocol.Py4JJavaError as e:
113 converted = convert_exception(e.java_exception)
C:\Spark\spark-3.1.1-bin-hadoop2.7\python\lib\py4j-0.10.9-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o77.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3) (blablabla executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
<a bunch more stuff after that>
(It's always "Task 0 in stage x failed y times", but x isn't always 3 and y isn't always 1.)
Other Results: I get a similar error if I run the following:
import pandas as pd
import numpy as np
from pyspark.sql import SparkSession
pandas_df = pd.DataFrame(np.random.randint(0,100, size=(5,5)))
columns = ["col"+str(i) for i in range(5)]
I don't get an error if I run the following:
from pyspark.sql import SparkSession
rocks = spark.read.format("csv").option("header", "true").load("C:\\rocksamples.csv")
Also, none of the above three code blocks (I, II, III) gives an error if I run it from an .ipynb file inside the folder I opened.
Background: I have the following files and folders on my machine:
Anaconda directory: C:\users\me\Anaconda3
Spark directory: C:\Spark\spark-3.1.1-bin-hadoop2.7
Java directory: C:\Spark\java\jre1.8.0_231
Nothing called pyspark in C:\users\me\Anaconda3\Lib\site-packages
a .csv file at C:\rocksamples.csv
Winutils file in C:\Hadoop\bin\winutils.exe
My current environment variables (which I would like to clean up but now I'm afraid to) include the following:
Path=(other stuff);C:\Spark\spark-3.1.1-bin-hadoop2.7\bin;C:\Spark\java\jre1.8.0_231\bin
I think Python is finding the correct PySpark, because if I try from pyspark import this_does_not_exist I get ImportError: cannot import name 'this_does_not_exist' from 'pyspark' (C:\Spark\spark-3.1.1-bin-hadoop2.7\python\pyspark\__init__.py) .
The folder I opened using "open folder" in VS Code is on a UNC path containing a space (i.e. \\blablabla\bla\my folder).


Spark-nlp Pretrained-model not loading in windows

I am trying to install pretrained pipelines in spark-nlp in windows 10 with python.
The following is the code I have tried so far in the Jupyter notebook in the local system:
! java -version
# should be Java 8 (Oracle or OpenJDK)
! conda create -n sparknlp python=3.7 -y
! conda activate sparknlp
! pip install --user spark-nlp==2.6.4 pyspark==2.4.5
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.pretrained import PretrainedPipeline
import sparknlp
# Start Spark Session with Spark NLP
# start() functions has two parameters: gpu and spark23
# sparknlp.start(gpu=True) will start the session with GPU support
# sparknlp.start(sparrk23=True) is when you have Apache Spark 2.3.x installed
spark = sparknlp.start()
# Download a pre-trained pipeline
pipeline = PretrainedPipeline('explain_document_ml', lang='en')
I am getting the following error:
explain_document_ml download started this may take some time.
Approx size to download 9.4 MB
Py4JJavaError Traceback (most recent call last)
~\AppData\Roaming\Python\Python37\site-packages\pyspark\sql\utils.py in deco(*a, **kw)
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
~\Anaconda3\envs\py37\lib\site-packages\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadPipeline.
: java.lang.IllegalArgumentException: requirement failed: Was not found appropriate resource to download for request: ResourceRequest(explain_document_ml,Some(en),public/models,2.6.4,2.4.4) with downloader: com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader#2570f26e
at scala.Predef$.require(Predef.scala:224)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadResource(ResourceDownloader.scala:345)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadPipeline(ResourceDownloader.scala:376)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadPipeline(ResourceDownloader.scala:371)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.downloadPipeline(ResourceDownloader.scala:474)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadPipeline(ResourceDownloader.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Unknown Source)
During handling of the above exception, another exception occurred:
IllegalArgumentException Traceback (most recent call last)
<ipython-input-2-d18238e76d9f> in <module>
12 # Download a pre-trained pipeline
---> 13 pipeline = PretrainedPipeline('explain_document_ml', lang='en')
~\Anaconda3\envs\py37\lib\site-packages\sparknlp\pretrained.py in __init__(self, name, lang, remote_loc, parse_embeddings, disk_location)
89 def __init__(self, name, lang='en', remote_loc=None, parse_embeddings=False, disk_location=None):
90 if not disk_location:
---> 91 self.model = ResourceDownloader().downloadPipeline(name, lang, remote_loc)
92 else:
93 self.model = PipelineModel.load(disk_location)
~\Anaconda3\envs\py37\lib\site-packages\sparknlp\pretrained.py in downloadPipeline(name, language, remote_loc)
58 t1.start()
59 try:
---> 60 j_obj = _internal._DownloadPipeline(name, language, remote_loc).apply()
61 jmodel = PipelineModel._from_java(j_obj)
62 finally:
~\Anaconda3\envs\py37\lib\site-packages\sparknlp\internal.py in __init__(self, name, language, remote_loc)
179 class _DownloadPipeline(ExtendedJavaWrapper):
180 def __init__(self, name, language, remote_loc):
--> 181 super(_DownloadPipeline, self).__init__("com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadPipeline", name, language, remote_loc)
~\Anaconda3\envs\py37\lib\site-packages\sparknlp\internal.py in __init__(self, java_obj, *args)
127 super(ExtendedJavaWrapper, self).__init__(java_obj)
128 self.sc = SparkContext._active_spark_context
--> 129 self._java_obj = self.new_java_obj(java_obj, *args)
130 self.java_obj = self._java_obj
~\Anaconda3\envs\py37\lib\site-packages\sparknlp\internal.py in new_java_obj(self, java_class, *args)
138 def new_java_obj(self, java_class, *args):
--> 139 return self._new_java_obj(java_class, *args)
141 def new_java_array(self, pylist, java_class):
~\AppData\Roaming\Python\Python37\site-packages\pyspark\ml\wrapper.py in _new_java_obj(java_class, *args)
65 java_obj = getattr(java_obj, name)
66 java_args = [_py2java(sc, arg) for arg in args]
---> 67 return java_obj(*java_args)
69 #staticmethod
~\Anaconda3\envs\py37\lib\site-packages\py4j\java_gateway.py in __call__(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1259 for temp_arg in temp_args:
~\AppData\Roaming\Python\Python37\site-packages\pyspark\sql\utils.py in deco(*a, **kw)
77 raise QueryExecutionException(s.split(': ', 1)[1], stackTrace)
78 if s.startswith('java.lang.IllegalArgumentException: '):
---> 79 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
80 raise
81 return deco
IllegalArgumentException: 'requirement failed: Was not found appropriate resource to download for request: ResourceRequest(explain_document_ml,Some(en),public/models,2.6.4,2.4.4) with downloader: com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader#2570f26e'
This is one of the common issues with Apache Spark & Spark NLP when the Java/Spark/Hadoop is not correctly setup on Windows:
You need to follow these steps correctly to avoid the common issues including failed pretrained() downloads:
Download OpenJDK from here: https://adoptopenjdk.net/?variant=openjdk8&jvmVariant=hotspot
Make sure it is 64-bit
Make sure you install it in the root C:\java Windows doesn't like space in the path.
During installation after changing the path, select setting Path
Download winutils and put it in C:\hadoop\bin https://github.com/cdarlint/winutils/blob/master/hadoop-2.7.3/bin/winutils.exe
Download Anaconda 3.6 from Archive, I didn't like the new 3.8 (Apache Spark 2.4.x only works with Python 3.6 and 3.7): https://repo.anaconda.com/archive/Anaconda3-2020.02-Windows-x86_64.exe
Download Apache Spark 2.4.6 and extract it in C:\spark\
Set the env for HADOOP_HOME to C:\hadoop and SPARK_HOME to C:\spark
Set Paths for %HADOOP_HOME%\bin and %SPARK_HOME%\bin
Install C++ (again the 64 bit) https://www.microsoft.com/en-us/download/confirmation.aspx?id=14632
Create C:\temp and C:\temp\hive
Fix permissions:
C:\Users\maz>%HADOOP_HOME%\bin\winutils.exe chmod 777 /tmp/hive
C:\Users\maz>%HADOOP_HOME%\bin\winutils.exe chmod 777 /tmp/
Either create a conda env for python 3.6, install pyspark==2.4.6 spark-nlp numpy and use Jupyter/python console, or in the same conda env you can go to spark bin for pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.11:2.6.5.

tensorflow: "trying to mutate a frozen object", bazel

MacOS high sierra, MBP 2016, in terminal.
I'm following the directions here:
All options for ./configure chosen as default (and all python directories double-checked.). All steps have completed cleanly until this:
bazel test ...
# On Mac, run the following:
bazel test --linkopt=-headerpad_max_install_names \
dragnn/... syntaxnet/... util/utf8/...
I assume I'm supposed to run the latter line ("bazel test --linkopt" etc.). But I get the same result either way, interestingly.
This throws about 10 errors, each of the same type "trying to mutate a frozen object", and concludes tests not run, error loading package dragnn/protos, and couldn't start build.
This is the general form of the errors:
syntaxnet>> bazel test --linkopt=-headerpad_max_install_names
dragnn/... syntaxnet/... util/utf8/...
Traceback (most recent call last): File
line 35 tf_proto_library_py(name = "data_py_pb2", srcs = ["dat..."])
line 53, in tf_proto_library_py py_proto_library(name = name, srcs =
srcs, srcs_versi...", <5 more arguments>) File
line 374, in py_proto_library py_libs += [default_runtime] trying to
mutate a frozen object ERROR: package contains errors: dragnn/protos
... [same error for various 'name = "...pb2"' files] ...
INFO: Elapsed time: 0.709s FAILED: Build did NOT complete successfully
(17 packages loaded) ERROR: Couldn't start the build. Unable to run
Any idea what could be doing this? Thanks.
This error indicates a bug in the py_proto_library rule implementation.
tf_proto_library_py is defined in syntaxnet.bzl. It is a wrapper around py_proto_library, which is defined by the tf_workspace macro's protobuf_archive rule.
"protobuf_archive" downloads Protobuf 3.3.0, which contains //:protobuf.bzl with the buggy py_proto_library rule implementation: in line #374 it tries to mutate an immutable object py_libs.
Make sure you use the latest Bazel version, currently that's 0.8.1.
If the problem still persists, then:
I suggest filing a bug with:
Protobuf, to fix the py_proto_library rule
TensorFlow, to update their Protobuf version in tf_workspace, and
Syntaxnet to update their TF submodule reference in //research/syntaxnet to the bugfixed version.
As a workaround, perhaps you can patch protobuf.bzl.
The patch is to change these lines:
373 if default_runtime and not default_runtime in py_libs + deps:
374 py_libs += [default_runtime]
376 native.py_library(
377 name=name,
378 srcs=outs+py_extra_srcs,
379 deps=py_libs+deps,
380 imports=includes,
381 **kargs)
to these:
373 if default_runtime and not default_runtime in py_libs + deps:
374 py_libs2 = py_libs + [default_runtime]
375 else:
376 py_libs2 = py_libs
378 native.py_library(
379 name=name,
380 srcs=outs+py_extra_srcs,
381 deps=py_libs2+deps,
382 imports=includes,
383 **kargs)
Disclaimer: this is a "blind" fix; I have not tried whether it works.
Tried same pattern patch for cc_libs.
if default_runtime and not default_runtime in cc_libs:
cc_libs2 = cc_libs + [default_runtime]
cc_libs2 = cc_libs
if use_grpc_plugin:
cc_libs += ["//external:grpc_lib"]
deps=cc_libs2 + deps,
Shows new error, but keeps compiling. (Ubuntu 16 on Windows System for Linux--don't ask, native tensorflow 1.4 winx64 works, but not syntaxnet).
greg#FX11:/mnt/c/code/models/research/syntaxnet$ bazel test ...
ERROR: /home/greg/.cache/bazel/_bazel_greg/adb8eb0eab8b9680449366fbebe59ec2/external/org_tensorflow/tensorflow/core/kernels/BUILD:451:1: in _transitive_hdrs rule #org_tensorflow//tensorflow/core/kernels:bounds_check_lib_gather:
Traceback (most recent call last):
File "/home/greg/.cache/bazel/_bazel_greg/adb8eb0eab8b9680449366fbebe59ec2/external/org_tensorflow/tensorflow/core/kernels/BUILD", line 451
_transitive_hdrs(name = 'bounds_check_lib_gather')
File "/home/greg/.cache/bazel/_bazel_greg/adb8eb0eab8b9680449366fbebe59ec2/external/org_tensorflow/tensorflow/tensorflow.bzl", line 869, in _transitive_hdrs_impl
Just changed set() to depset() and that seems to have avoided the error.
To make a long story short. I was inspired by a sstrasburg's comment.
Firstly, uninstall a fresh version of bazel.
brew uninstall bazel
Download bazel 0.5.4 from here.
chmod +x bazel-0.5.4-without-jdk-installer-darwin-x86_64.sh
After that, again run
bazel test --linkopt=-headerpad_max_install_names dragnn/... syntaxnet/... util/utf8/...
Finally, I got
Executed 57 out of 57 tests: 57 tests pass.

Unable to import matplotlib.animation

I'm new to both python and OSX, so if i'm not understanding super basic stuff please forgive me.
I'm using python 2.7.12 on a fresh install from Homebrew. I also used Homebrew to install ipython, ffmpeg and libav (installs avconv, which I believe is required for what i'm trying to do).
I've used pip to install Scipy, numpy (which I think comes with scipy anyway?) and matplotlib
I'm running El Capitan v10.11.6
Background (for some context):
I'm running some hydrodynamic simulations that output a bunch of binary files. I wan't to stitch them together to create a movie. Lucky for me, one of my colleagues has already written a tidy little python script to do so (which he wrote in ipython).
When trying to run
import matplotlib.animation
The script just hangs, and matplotlib animation never gets imported. I've tried changing the backend via
import matplotlib
import matplotlib.animaton
I've tried various backends that I got by running code from List of all available matplotlib backends
I have also tried import matplotlib.pyplot hangs (updating fc-lists)
Lastly, and i'm not sure if this is helpful, but leaving ipython trying to import matplot.animation for about 10 minutes, and then terminating it outputs the following
In [3]: import matplotlib.animation
KeyboardInterrupt Traceback (most recent call last)
<ipython-input-3-64e90e455a86> in <module>()
----> 1 import matplotlib.animation
/usr/local/lib/python2.7/site-packages/matplotlib/animation.py in <module>()
590 #writers.register('imagemagick')
--> 591 class ImageMagickWriter(MovieWriter, ImageMagickBase):
592 def _args(self):
593 return ([self.bin_path(),
/usr/local/lib/python2.7/site-packages/matplotlib/animation.py in wrapper(writerClass)
73 def register(self, name):
74 def wrapper(writerClass):
---> 75 if writerClass.isAvailable():
76 self.avail[name] = writerClass
77 return writerClass
/usr/local/lib/python2.7/site-packages/matplotlib/animation.py in isAvailable(cls)
284 stderr=subprocess.PIPE,
285 creationflags=subprocess_creation_flags)
--> 286 p.communicate()
287 return True
288 except OSError:
/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/ lib/python2.7/subprocess.pyc in communicate(self, input)
798 return (stdout, stderr)
--> 800 return self._communicate(input)
/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.pyc in _communicate(self, input)
1417 stdout, stderr = self._communicate_with_poll(input)
1418 else:
-> 1419 stdout, stderr = self._communicate_with_select(input)
1421 # All data exchanged. Translate lists into strings.
/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.pyc in _communicate_with_select(self, input)
1518 while read_set or write_set:
1519 try:
-> 1520 rlist, wlist, xlist = select.select(read_set, write_set, [])
1521 except select.error, e:
1522 if e.args[0] == errno.EINTR:
If you give any of this a second thought, even if you aren't able to help, thank you very much!
Do you have more than one version of Python installed? I would check your python path. Make sure matplotlib is in 2.7 in this case.
This might be relevant too -- import matplotlib.pyplot hangs

PySpark's addPyFile method makes SparkContext None

I have been trying to do this. In PySpark shell, I get SparkContext as sc. But when I use addPyFile method, it makes the resulting SparkContext None:
>>> sc2 = sc.addPyFile("/home/ec2-user/redis.zip")
>>> sc2 is None
What's wrong?
Below is the source code to pyspark's (v1.1.1) addPyFile. (The source links for 1.4.1 in the official pyspark docs are broken as I'm writing this)
It returns None, because there is no return statement. See also: in python ,if a function doesn't have a return statement,what does it return?
So, if you do sc2 = sc.addPyFile("mymodule.py") of course sc2 will be None because .addPyFile() does not return anything!
Instead, simply call sc.addPyFile("mymodule.py") and keep using sc as the SparkContext
def addPyFile(self, path):
635 """
636 Add a .py or .zip dependency for all tasks to be executed on this
637 SparkContext in the future. The C{path} passed can be either a local
638 file, a file in HDFS (or other Hadoop-supported filesystems), or an
640 """
641 self.addFile(path)
642 (dirname, filename) = os.path.split(path) # dirname may be directory or HDFS/S3 prefix
644 if filename.endswith('.zip') or filename.endswith('.ZIP') or filename.endswith('.egg'):
645 self._python_includes.append(filename)
646 # for tests in local mode
647 sys.path.append(os.path.join(SparkFiles.getRootDirectory(), filename))

import nltk error in xampp, or on wamp

I am using nltk (installed and works fine in IDLE) for a project in Python (2.7.4) which runs perfectly fine on IDLE, but using the same code in xampp or wamp (cgi-bin), everything related to 'nltk' doesn't works and this is the error shown by adding these lines
import cgitb
The errors are in lines are marked by '=>' and details are enclosed within ** and **. I've tried printing 'sys.path' after importing os, which shows the 'site-packages' directory in which 'nltk' resides. I even copied and pasted this folder into 'C:\Python27\' but still the same errors. Things other than nltk works as desired.
<type 'exceptions.ValueError'> Python 2.7.4: C:\python27\python.exe
Wed May 14 16:13:34 2014
A problem occurred in a Python script. Here is the sequence of function calls
leading up to the error, in the order they occurred.
C:\xampp\cgi-bin\Major\project.py in ()
=> 73 import nltk
**nltk undefined**
C:\python27\nltk\__init__.py in ()
159 import cluster; from cluster import *
=> 161 from downloader import download, download_shell
162 try:
163 import Tkinter
**downloader undefined, download undefined, download_shell undefined**
C:\python27\nltk\downloader.py in ()
2200 # Aliases
=> 2201 _downloader = Downloader()
2202 download = _downloader.download
2203 def download_shell(): DownloaderShell(_downloader).run()
**_downloader undefined, Downloader = None**
C:\python27\nltk\downloader.py in __init__(self=<nltk.downloader.Downl
oader object>, server_index_url=None, download_dir=None)
425 # decide where we're going to save things to.
426 if self._download_dir is None:
=> 427 self._download_dir = self.default_download_dir()
429 #/////////////////////////////////////////////////////////////////
**self = <nltk.downloader.Downloader object>, self._download_dir = None,
self.default_download_dir = <bound method Downloader.default_download_dir of
<nltk.downloader.Downloader object>>**
C:\python27\nltk\downloader.py in default_download_dir(self=<nltk.downloa
der.Downloader object>)
926 homedir = os.path.expanduser('~/')
927 if homedir == '~/':
=> 928 raise ValueError("Could not find a default
download directory")
930 # append "nltk_data" to the home directory
**<pre>builtin ValueError = < type 'exceptions.ValueError' > <pre/>**
<type 'exceptions.ValueError'>: Could not find a default download directory
args = ('Could not find a default download directory',)
message = 'Could not find a default download directory'
