Using Python with Zeppelin under the Spark 2 Interpreter

Using Python with Zeppelin under the Spark 2 Interpreter - python

I have deployed HDP: 2.6.4 on a virtual machine
I can see that the spark2 is not pointing to the correct python folder. My questions are
1) How can I find where my python is located?
solution: Type whereis python and you will get a list of where it is
2) How can I update the existing python libraries and add new libraries to that folder ? For example, the equivalent of 'pip install numpy' on CLI.
Nothing clear yet
3) How can I make Zeppelin Spark2 point at that specific directory that contains the python folder that I can update? - On Zeppelin, there is a little 'edit' button that I can change the path to the directory that contains python.
solution: go to the interpreter in zeppelin, find spark2, and make zeppelin.pyspark.python point to where python is already there.
Now if you need python 3.4+ there is a whole set of different steps you have to do, to first get python 3.4.+ into the HDP sandbox.
Thank you,

For a Sandbox environment like yours, a sandbox image is made on a Linux OS (CentOS). The Zeppelin Notebook points, in all probability, to the Python installation that comes along with every Linux OS.
If you wish to have your own installation of Python and your own set of libraries for Data Analysis like those in the SciPy stack. You need to install Anaconda on your Virtual machine. Your VM eed to be connected to the internet so that you can download and install the Anaconda package for testing.
You can then point Zeppelin to the anaconda's directory till the following path : /home/user/anaconda3/bin/python where user is your username
Zeppelin Configuration also confirms the fact that it uses the default python installation at /usr/bin/python. You can go through its documentation for more Information
UPDATE
Hi Joseph, Spark Installations, by default, use the Python interpreter and the python libraries that have been installed on your OS. The folder structure that you have shown only tell you the location of the PySpark module. This module is a library like Pandas ior NumPy.
What you can do is install the SciPy Stack[NumPy, Pandas, MatplotLib etc..] via the command pip install package name and import those libraries directly into your Zeppelin Notebook.
Use the command whereis python in the terminal of your snadbox, the result would give you something as follows
/usr/bin/python /usr/bin/python2.7 ....
In your Zeppelin Configuration, for the property zeppelin.pyspark.python you can set the first value from the out put of the previous command i.e /usr/bin/python. So now all the libraries you installed via the pip install command would be available for you in zeppelin.
This process would only work for your Sandbox environment. In a real production cluster, your administrator needs to install all these libraries on all the nodes of your Spark cluster.

Related

Matlab python configuration on linux

I am currently attempting to set up Matlab to work with Volttron on a Linux virtual machine. Python 3.8, Volttron, and Matlab are all installed on the virtual machine. When I run
pyversion python.exe
in the command window I get this error:
Error using pyversion
Path argument does not specify a valid executable.
Running pe = pyenv; and pe.Version returns blank, as does pyversion. This document describes a way to set the version used and I believe this ought to be my next step. However, the instructions say that for Linux I should run
pyenv('Version','executable')
but python is already installed and to my knowledge on linux does not have an executable file one can download for python. How can I remedy this?

If you have followed the recommended steps to setup VOLTTRON, and are running VOLTTRON within a virtual environment, the python version to use should be located within that virtual environment at env/bin/python.
As mentioned in this answer, if you want to verify the path, you can activate the environment using source env/bin/activate, and then run python. Once inside the python interpreter, you would just need to print the system executable.
import sys
print(sys.executable)
It is worth noting that this is an older method for connecting to MatLab with VOLTTRON. You may want to try using the newer MatLab agents. The documentation for this method is included with the example agents. https://volttron.readthedocs.io/en/latest/developing-volttron/developing-agents/example-agents/matlab-agent.html
The new method also assumes that MatLab is running in a separate Windows environment. In your case, you would install the standalone MatLab Agent within the linux virtual machine, and proceed accordingly.

How to use a Virtual Environment?

I am using Python 3.7.9 Shell.
I created a virtual environment in this location
C:\Users\my_username\Desktop\Projects.venv
Inside of Python Shell, when I type: import numpy, which is in my .venv\lib folder, it says that the module does not exist.
Using Python Shell, how do I make use of the contents in .venv? In particular, the libraries located there?
Edit #1: Include Details
In my windows command line, it has (.venv) off to the left.
I have run the Activate file. I then started Python.
In my \lib\site-packages area, I have the requests library.
When I open up Python Shell and type "import requests", it says "no such library can be found"
I am using Windows 10
I installed the libraries while in the (.venv) environment.
Theory:
In the virtual environment, in Python Shell, it's searching a different location for libraries...now if I can just figure out where it's searching and how to change that...I might be able to make progress.
Edit #2: My Progress
My theory was correct. Despite using a virtual environment, it's not looking for the libraries installed in (.venv)\lib\site-packages, it's looking somewhere outside of that.
Now I just need to figure out how to make the Python code look for libraries inside of (.venv)\lib\site-packages when I'm in the virtual environment.
When I run the python.exe file inside of the (.venv)\Scripts location, it recognizes the virtual environment scripts.
If I click on my version of Python.Exe located in my C:...\Programs\Python 3.7 folder, it doesn't recognize them.
I was under the impression it didn't matter where I clicked on the Python.exe file if I did it after going to the virtual environment in the command line? Is this not true?
Edit #3: Important Links
Where Python Looks for Modules When Importing

Right from the official docs https://docs.python.org/3/tutorial/venv.html#creating-virtual-environments
Once you’ve created a virtual environment, you may activate it.
On Windows, run:
tutorial-env\Scripts\activate.bat
On Unix or MacOS, run:
source tutorial-env/bin/activate
this is done in your shell before starting python at its prompt, and allows you to choose different python versions in addition to other benefits

How to set default interpreter and keep things in order?

I was required to install anaconda for a CS course and used spyder and Rstudio.
Then, for a different class I used pycharm.
When I type on the command line "python -V" I get:
Python 3.6.1 :: Anaconda 4.4.0 (x86_64)
and I have no idea why it relates the python version I have installed with Anaconda (and why not pycharm?). I understand that the OS runs python 2.7 (shouldn't I get that instead? and when I type python3 -V get which version of python 3 I have?) and when I use something like Pycharm or Spyder I can choose which version I want from the ones I have installed and use it within the program, not for the terminal.
I just want to have everything in order and under control. I don't think I understand what Anaconda really is (to me is like a program that has more programs in it...). How do I keep anaconda to itself ? 1313
Also, should the packages I installed through Terminal work on both pycharm and spyder/anaconda even though when I used pycharm I used python 3.5 and anaconda 3.6?
I think I need definitions and help to get everything in order in my head and the computer.

Pycharm is just an application to help you write code. Pycharm itself does not run python code. This is why in PyCharm, you need to set the interpreter for a project, which could be any python binary. In PyCharm, go to Preferences > Project > Project Interpreter to see where you would set the python environment being used for a given project. This could point to any python installation on your machine, whether that is the python 2.7 located at /usr/bin/python or a virtual environment in your project dir.
The industry standard way to "keep things in order" is to use what are called virtual environments. See here: https://docs.python.org/3/library/venv.html. A virtual environment is literally just a copy of a python environment (binaries and everything) so whatever directory you specify. This allows you to configure your environment to however you need in your project without interfering with other projects you might have. For example, say project A requires django 1.9.2 but project b requires 1.5.3. By having a virtual environment for each project, dependencies won't conflict.
Since you have python3.6, I would recommend going to you project directory in a terminal window. Running python -m venv .venv to create a hidden directory which contains a local python environment of whatever your 3.6 python installation. You could then set your project interpret to use that environment. to connect to it on the command line, run source .venv/bin/activate from where you created your virtual environment. run which python again and see that python is now referencing your virtual environment :)
If you are using a mac (which I believe you are from what you said about python2.7), what likely happened is that your anaconda installer put the Python bin directory on your PATH environment variable. Type in which python to see what the python alias is referencing. You can undo this if you want by editing your ~/.bash_profile file if you really want.
You are more or less correct about anaconda. It is itself another distribution of python and contains a load of common libraries/dependencies that tend to make life easier. For a lot of data analysis, you likely won't even need to install another dependency with pip after downloading anaconda.
I suspect this won't be all too helpful at first as it is a lot to learn, but hopefully this points you in the right direction.

Starting Matlab engine in anaconda virtual environment returns 'Segmentation fault (core dumped)'

I've installed the official MATLAB Engine by following the instructions from the answer to Anaconda install Matlab Engine on Linux to an Anaconda virtual environment running Python3.5. I can now import matlab and matlab.engine without receiving errors. However, when I try:
matlab.engine.start_matlab(), I get 'Segmentation fault (core dumped)'
I've tried setting LD_LIBRARY_PATH from within the conda environment (in case that is even relevant): export LD_LIBRARY_PATH=/System/Library/Frameworks/Python.framework/Versions/Current/lib:$LD_LIBRARY_PATH, but to no avail. The path doesn't exist either as far as I'm aware, so I've also tried export DYLD_LIBRARY_PATH=path_to_anaconda3/envs/myEnv/lib:$LD_LIBRARY_PATH
So how can I start the matlab engine/call Matlab scripts from Python from within a Anaconda virtual environment?
I'm on Ubuntu, by the way

Short answer: there were two problems that needed to be fixed
$LD_LIBRARY_PATH should not contain a path to the Anaconda installation. Adding such a path is discouraged according to the conda documentation: https://conda.io/docs/building/shared-libraries.html, but some packages may do so anyways, causing the segmentation error.
A symbolic link is needed from a libpythonXXX.dylib file of the right version to /usr/lib/, so that MATLAB can find the right Python
Long answer: complete installation instructions for using MATLAB Engine with Anaconda
Install a MATLAB version that supports the Python you want to use. Ensure that this particular MATLAB installation is activated
Open a terminal and go to the folder containing the Python engine of the MATLAB installation: cd "/usr/local/MATLAB/R2017a/extern/engines/python"
Run setup.py with the Python version you want to use, and prefix the Anaconda environment location: sudo python3.5 setup.py install --prefix="/your_path_to_anaconda3/envs/your_env". At this point you should be able to import matlab and matlab.engine from within the Python of your Anaconda environment, but, in my case, starting the engine gave the segmentation error.
Find the libpython file of the right version. Your Anaconda environment should contain it: find /your_path_to_anaconda3/envs/your_env/ -name libpython*. In my case this returned:
/.../lib/libpython3.so
/.../lib/python3.5/config-3.5m/libpython3.5m.a
/.../lib/libpython3.5m.so.1.0
/.../lib/libpython3.5m.so
As I wanted to use it with python 3.5, I went with libpython3.5m (I don't know why the 'm' is there). Make a symbolic link from the .dylib version of this file to /usr/lib: sudo ln -s /your_path_to_anaconda3/envs/your_env/lib/libpython3.5m.dylib /usr/lib. Note that you can only have one link called libpython3.5m.dylib in /usr/lib. So if you have multiple Anaconda environments using the same version of Python, you only need to set up this link once to whichever one. Remember not to delete this environment though, as that would break the link for all other environments relying on it.
Start a new terminal (!) and activate your Anaconda environment: source activate your_env. Check within your Anaconda environment whether LD_LIBRARY_PATH contains any references to the Anaconda environment echo $LD_LIBRARY_PATH. If so, ensure that it no longer does: export LD_LIBRARY_PATH=only_paths_you_do_want_to_keep_separated_by_a_colon. This export action needs to be repeated whenever you activate your Anaconda environment, so you may want to look into more permanent means of setting it. However, in my case (apart from the fact I had been adding it myself in the hope this would improve things) the path was actually added by pygpu, so I ended up resetting LD_LIBRARY_PATH from my python script (so far without noticing ill effects).

How can I manage mutiple python in Ubuntu16.04?

  In my Ubuntu16.04, there are python 2 and python 3 default. In addition, i have installed anaconda too. I am sucked by the 'python' cmd. Every time i use pip or pip3 install, I don't know where the package install, python2 or python 3? And I use conda install to install anaconda package. I also use anaconda env to manage different virtual env. But I think it mix with my local Python 2 and 3.
  For example, in directory /usr/bin, I found many soft links like this:
   When i try 'python' cmd, it just confuse me！
   Why python3m are local, shouldn't it be anaconda? Why python3 are anaconda, shouldn't it be local? Then I found that if I use ./python2 or ./python3, I found it is correct now!
  So I know it is caused by environment variables. I echo $PATH, Found it like this: /home/kinny/.pyenv/shims:/home/kinny/.pyenv/bin:/home/kinny/anaconda3/bin:/home/kinny/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/opt/ant/bin:/snap/bin:/opt/maven/bin:/usr/lib/jvm/java-8-oracle/bin
   I have used update-alternative --config python to configure default python, but it doesn't work! It sames mixed with each others.
   Now I just want to install tensorflow 0.11 in local python3, because in anaconda it is 0.10 version by default. So how can I change this. I just want to use python python3 and python3m represents python2.7 python3.5 and anaconda python respectively, How can I do that! use pip and pip3 for local python2 and python3 respectively!

I ran into a similiar problem when setting up PyCharm Edu to work with Anaconda. I found that I had several versions of Python installed and it was very hard to keep track of which version the IDE was referencing. My CS professor gave me the advice of simply removing the versions of Python I didn't frequent. I now just have Anaconda installed; and use the Anaconda Prompt as my Python console. I also rely on PyCharm's IPython for the developer console. However, if you still want differing versions of Python installed (say your doing QA testing for older devices); there is the really helpful command: which python. When entered into the python console or Anaconda Prompt: which python will display the directory associated with the currently executing Python Shell. This enables you to better keep track of to what particular python.exe the current window is referring to.

Follow up to the comments mentioning using virtualenv and virtualenvwrapper.
Here are the official docs and a good blog post to follow for getting started using virtualenv's is here:
https://virtualenv.pypa.io/en/stable/installation/
http://virtualenvwrapper.readthedocs.io/en/latest/install.html
http://exponential.io/blog/2015/02/10/install-virtualenv-and-virtualenvwrapper-on-ubuntu/
Also, once you are setup you can create virtualenv's specifying which python installation you want to use.
which python3
returns
/usr/bin/python3
Then create a virtualenv with that python path. Where example_env is the name of the virtualenv.
mkvirtualenv -p /usr/bin/python3 example_env
Then activate the virtualenv using virtualenvwrapper.
workon example_env
Finally, install tensorflow and other dependencies with pip.
pip install tensorflow

the which command is very useful for finding the path to the executable that is first in your path. Zsh also has the where command, which will show you all instances of the given executable that show up in your path. For managing different python versions, you have a lot of options. The easiest for most people tends to be anaconda, using conda environments. The installer will ask you to add some stuff to your .bashrc file, which will then make anaconda's binaries come first in your path. Anything else you run after the .bashrc gets sourced after that, will then use that first, including PyCharm. For graphical desktop apps to pick up the change, you may need to log out and back in again. If you only need one version each of python 2 and python 3, you can just use the ones available via apt. Depending on your Ubuntu version, Python 2 is definitely installed by default as it is used by many system utilities, including apt itself. Some newer versions may also install python 3 by default, but I do not remember for sure. Another option is to install the versions of python you need in an alternate location, such as /opt/python/<version> and then using environment-modules (installed via apt install environment-modules) or Lmod to control which versions are being used, but that may or may not be easy/convenient to use with a desktop application such as PyCharm.
for TensorFlow, 1.11 is available in anaconda, but I don't remember if it's in the default channel or not.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.