Sagemaker lifecycle configuration for installing pandas not working - python

I am trying to update pandas within a lifecycle configuration, and following the example of AWS I have the next code:
#!/bin/bash
set -e
# OVERVIEW
# This script installs a single pip package in a single SageMaker conda environments.
sudo -u ec2-user -i <<EOF
# PARAMETERS
PACKAGE=pandas
ENVIRONMENT=python3
source /home/ec2-user/anaconda3/bin/activate "$ENVIRONMENT"
pip install --upgrade "$PACKAGE"==0.25.3
source /home/ec2-user/anaconda3/bin/deactivate
EOF
Then I attach it to a notebook and when I enter the notebook and open a notebook file, I see that pandas have not been updated. Using !pip show pandas I get:
Name: pandas
Version: 0.24.2
Summary: Powerful data structures for data analysis, time series, and statistics
Home-page: http://pandas.pydata.org
Author: None
Author-email: None
License: BSD
Location: /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages
Requires: pytz, python-dateutil, numpy
Required-by: sparkmagic, seaborn, odo, hdijupyterutils, autovizwidget
So we can see that I am indeed in the python3 env although the version is 0.24.
However, the log in cloudwatch shows that it has been installed:
Collecting pandas==0.25.3 Downloading https://files.pythonhosted.org/packages/52/3f/f6a428599e0d4497e1595030965b5ba455fd8ade6e977e3c819973c4b41d/pandas-0.25.3-cp36-cp36m-manylinux1_x86_64.whl (10.4MB)
2020-02-03T12:33:09.065+01:00
Requirement already satisfied, skipping upgrade: pytz>=2017.2 in ./anaconda3/lib/python3.6/site-packages (from pandas==0.25.3) (2018.4)
2020-02-03T12:33:09.065+01:00
Requirement already satisfied, skipping upgrade: python-dateutil>=2.6.1 in ./anaconda3/lib/python3.6/site-packages (from pandas==0.25.3) (2.7.3)
2020-02-03T12:33:09.065+01:00
Requirement already satisfied, skipping upgrade: numpy>=1.13.3 in ./anaconda3/lib/python3.6/site-packages (from pandas==0.25.3) (1.16.4)
2020-02-03T12:33:09.065+01:00
Requirement already satisfied, skipping upgrade: six>=1.5 in ./anaconda3/lib/python3.6/site-packages (from python-dateutil>=2.6.1->pandas==0.25.3) (1.13.0)
2020-02-03T12:33:09.065+01:00
Installing collected packages: pandas Found existing installation: pandas 0.24.2 Uninstalling pandas-0.24.2: Successfully uninstalled pandas-0.24.2
2020-02-03T12:33:12.066+01:00
Successfully installed pandas-0.25.3
What could be the problem?

if you want to install the packages only in for the python3 environment, use the following script in your Create Sagemaker Lifecycle configurations.
#!/bin/bash
sudo -u ec2-user -i <<'EOF'
# This will affect only the Jupyter kernel called "conda_python3".
source activate python3
# Replace myPackage with the name of the package you want to install.
pip install pandas==0.25.3
# You can also perform "conda install" here as well.
source deactivate
EOF
Reference : "Lifecycle Configuration Best Practices"

I have encountered the exact same problem when package was not available in the notebook while Lifecycle Cloudwatch indicated successful installation for the specific kernel. The solution that worked for me is to make sure installation completes before opening up notebook.

Related

why is the pip install process stuck on ''Installing collected packages" step?

I'm trying to pip install some python libraries in a virtual environment created by conda create, but for some packages, the installation were stuck on the step "Installing collected packages: .
Take pandas as an example:
My command and output are as follows:
pip install pandas --no-cache-dir
Collecting pandas
Downloading https://files.pythonhosted.org/packages/99/12/bf4c58eea94cea4f91ff931f284146337814fb8546e6eb0b52584446fd52/pandas-0.24.1-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (16.3MB)
100% |████████████████████████████████| 16.3MB 11.4MB/s
Requirement already satisfied: numpy>=1.12.0 in /anaconda/envs/testctds2/lib/python3.6/site-packages (from pandas) (1.16.1)
Requirement already satisfied: pytz>=2011k in /anaconda/envs/testctds2/lib/python3.6/site-packages (from pandas) (2018.9)
Requirement already satisfied: python-dateutil>=2.5.0 in /anaconda/envs/testctds2/lib/python3.6/site-packages (from pandas) (2.8.0)
Requirement already satisfied: six>=1.5 in /anaconda/envs/testctds2/lib/python3.6/site-packages (from python-dateutil>=2.5.0->pandas) (1.12.0)
Installing collected packages: pandas
The process just hang there (at least for 30 min) until I control+z to quit (control+c got no response).
What I have tried:
conda install pandas worked well, which is also the recommended way to install pandas. I just don't understand why pip install didn't work, as it's assumed to and this situation also happened to some other libraries such as numpy, scipy, and scikit-learn.
I also tried without --no-cache-dir or -vvv to see more details, but in either case there were no more information or error code after the line "Installing collected packages: pandas"
I tried the command in a new terminal window. Magically numpy can be installed very quickly by "pip install numpy", but it didn't work with pandas or scipy.
I see this may be a problem other users are having. Here is a github link describing the same problem. There are a few others on the Conda GitHub page. Some of the answers that come from that post are:
Make sure you are up to date on your root conda environment. try: conda upgrade conda
Create a brand new virtual env
Micheal Grant, who is a Director for Technical Consulting at Anaconda replied to that thread with this:
That said, when I look at the debug output, I'm finding that it's not able to prune back the package list very well. The more "old" packages it has to consider the higher the likelihood that this kind of solver stall happens. Thankfully it is a lot less likely than it used to be.

Can't upgrade python-dateutil in OSX

I am trying to install Sphinx on OSX, with the hopes of eventually making it into a website, following this guide :
I have been macporting, homebrewing, python wheelin', figuring out what a virtualenv is... figuring out if I should use python 2.7 or 3.X (which I couldn't figure out)
I added this to my bash profile :
export PYTHONPATH=$PYTHONPATH:/usr/local/lib/python2.7/site-packages
When I run this :
python -c "import sys; print('\n'.join(sys.path))"
I had expected it to return simply /usr/local/lib/python2.7/site-packages
but it says the following :
/Library/Python/2.7/site-packages/pip-10.0.0b2-py2.7.egg
/Library/Python/2.7/site-packages/virtualenv-15.2.0-py2.7.egg
/Users/nook
/usr/local/lib/python2.7/site-packages
/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python27.zip
/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7
/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-darwin
/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-mac
/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-mac/lib-scriptpackages
/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-tk
/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-old
/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-dynload
/Users/nook/Library/Python/2.7/lib/python/site-packages
/Library/Python/2.7/site-packages
/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python
Anyway, I finally got a virtualenv set-up, and ran the following command:
pip install sphinx
I sorted out a NumPy error, and another error with six. But I am unable to get rid of the following error :
jupyter-client 5.2.3 has requirement python-dateutil>=2.1,
but you'll have python-dateutil 1.5 which is incompatible.
When I run the following command :
pip install python-dateutil --upgrade --ignore-installed
I get
Collecting python-dateutil
Using cached python_dateutil-2.7.2-py2.py3-none-any.whl
Collecting six>=1.5 (from python-dateutil)
Using cached six-1.11.0-py2.py3-none-any.whl
Installing collected packages: six, python-dateutil`
That seems good...
But when I try to install sphinx again I get the same
jupyter-client 5.2.3 has requirement python-dateutil>=2.1, but you'll have
python-dateutil 1.5 which is incompatible.
So I went in and tried the following command :
sudo rm -rf /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/python_dateutil-1.5-py2.7.egg-info/
I want to remove that directory, to NumPy directories, and another dateutil directory...but can't ! I don't have the permissions.
If anyone has a pro-tip on what version of python to use for my sphinx site (the latest version ?), and how set up a virtual environment with **that version of python, figure out Numpty and dateutil... I will be forever in your debt !
EDIT : Just tried the following :
sudo pip install python-dateutil
Password:
The directory '/Users/nick/Library/Caches/pip/http' or its parent directory is
not owned by the current user and the cache has been disabled. Please check
the permissions and owner of that directory. If executing pip with sudo, you
may want sudo's -H flag.
The directory '/Users/nick/Library/Caches/pip' or its parent directory is not
owned by the current user and caching wheels has been disabled. check the
permissions and owner of that directory. If executing pip with sudo, you may
want sudo's -H flag.
Requirement already satisfied: python-dateutil in
/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python
(1.5)
pandas 0.22.0 has requirement numpy>=1.9.0, but you'll have numpy 1.8.0rc1
which is incompatible.
matplotlib 1.3.1 has requirement numpy>=1.5, but you'll have numpy 1.8.0rc1
which is incompatible.
jupyter-client 5.2.3 has requirement python-dateutil>=2.1, but you'll have
python-dateutil 1.5 which is incompatible.
You can see my numpy is totally budget too, at 1.8.0rc1. Thanks.

How to install packages on EMR

I created a cluster on AWS and with Jupyter, python3 installed. Now I can type code in the cells and I found 'numpy' is installed, i.e., by import numpy as np, I am able to access the functions in this package. However, I found pandas is not there. So in the next cell I typed !pip install pandas, then it displays
Requirement already satisfied: pandas in /mnt/usrmoved/local/lib64/python2.7/site-packages
Requirement already satisfied: pytz>=2011k in /mnt/usrmoved/local/lib/python2.7/site-packages (from pandas)
Requirement already satisfied: numpy>=1.7.0 in /mnt/usrmoved/local/lib64/python2.7/site-packages (from pandas)
Requirement already satisfied: python-dateutil in /mnt/usrmoved/local/lib/python2.7/site-packages (from pandas)
Requirement already satisfied: six>=1.5 in /mnt/usrmoved/local/lib/python2.7/site-packages (from python-dateutil->pandas)
I thought it is successfully installed, but then in the next cell, I type import pandas as pd it gives me an error
---------------------------------------------------------------------------
ImportError
Traceback (most recent call last)<ipython-input-8-af55e7023913> in <module>()----> 1 import pandas as pd
ImportError: No module named 'pandas'
In general, how should we install related python packages in EMR?
In my laptop, in the jupyter, I always did "! pip install package" and it works. But why it does not work in jupyer on EMR?
I tried installing python packages using pip install, but I get the pip: command not found. So I used pip3 instead of pip, and it worked.
Using EMR 5.30.1
The conventional method to install python packages on EMR is to specify the packages needed at cluster creation using a bootstrap-action.
This method ensures the packages are installed on all nodes and not just the driver.
aws emr create-cluster \
--name 'test python packages' \
--release-label emr-5.20.0 \
--region us-east-1 \
--use-default-roles
--instance-type m4.large \
--instance-count 2 \
--bootstrap-actions \
Path="s3://your-bucket/python-modules.sh",Name='Install Python Modules' \
The python-modules.sh would contain commands to install the python packages. For example:
#!/bin/sh
# Install needed packages
sudo pip install pandas
AWS documentation

Python jira is not upgraded by pip

I am using python jira package installed using PIP in a virtual environment. Recently my script started to complain about jira package version:
$ ./my_script.sh
jira/client.py:282: UserWarning: You are running an outdated version of JIRA Python 1.0.3. Current version is 1.0.6.dev20160420173258. Do not file any bugs against older versions.
I tried upgrade using pip like:
$ pip install --upgrade --no-cache-dir jira
Collecting jira
Downloading jira-1.0.3-py2.py3-none-any.whl (46kB)
100% |████████████████████████████████| 51kB 175kB/s
Requirement already up-to-date: requests>=2.6.0 in <...>/.virtualenvs/jira/lib/python3.4/site-packages (from jira)
Requirement already up-to-date: requests-oauthlib>=0.3.3 in <...>/.virtualenvs/jira/lib/python3.4/site-packages (from jira)
Requirement already up-to-date: six>=1.9.0 in <...>/.virtualenvs/jira/lib/python3.4/site-packages (from jira)
Requirement already up-to-date: requests-toolbelt in <...>/.virtualenvs/jira/lib/python3.4/site-packages (from jira)
Requirement already up-to-date: tlslite>=0.4.4 in <...>/.virtualenvs/jira/lib/python3.4/site-packages (from jira)
Requirement already up-to-date: oauthlib>=0.6.2 in <...>/.virtualenvs/jira/lib/python3.4/site-packages (from requests-oauthlib>=0.3.3->jira)
Installing collected packages: jira
Successfully installed jira-1.0.3
I tried to remove installed jira package and install it freshly with same result. Pip always installs only version 1.0.3 but scripts complain about newer version.
My assumption is that 1.0.6 is marked as released (the check is inside the package itself) but not published (I don´t know if this is the right word) for pip to download.
Any clue?
Regards,
JrBenito
It appears there is a 1.0.6.dev20160420173258 version but this isn't being downloaded when using pip install jira. It can be installed by using the workaround found on the issue #156 for this new version. pip install https://pypi.python.org/packages/f6/ea/2535e412ff76d85da20d2be6d1eaf9aa5de49481da94f2fe7e8830eedd35/jira-1.0.6.dev20160420173258-py2.py3-none-any.whl Which it appears you have already commented on, so hopefully they resolve the issue permanently.
I had this same issue, even after specifically downloading the 1.0.6.dev20160420173258 version. When the client.py file gets the version information, it doesn't get the git changeset correctly and so returns 1.0.6 instead of 1.0.6.dev20160420173258.
For now I made a workaround by hardcoding the version number pulled from https://pypi.python.org/pypi/jira/json
In /usr/lib/python2.7/site-packages/jira/client.py:
released_version = "1.0.6" # data['info']['version']
This is admittedly not a fix but lets hope it gets fixed.
It worked for me only after changed it to
/usr/lib/python2.7/site-packages/jira/client.py:
with
released_version = "1.0.6" # data['info']['version']

Installing Fuel (Machine Learning) using pip on Ubuntu 14.04.02

When installing the Fuel machine learning library, I got stuck with some dependencies issues:
alvas#ubi:~$ pip install --upgrade git+git://github.com/mila-udem/fuel.gitYou are using pip version 7.1.0, however version 7.1.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Collecting git+git://github.com/mila-udem/fuel.git
Cloning git://github.com/mila-udem/fuel.git to /tmp/pip-xUlqCT-build
/usr/local/lib/python2.7/dist-packages/pip/_vendor/requests/packages/urllib3/util/ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
InsecurePlatformWarning
Collecting numpy (from fuel==0.0.1)
Requirement already up-to-date: six in /usr/local/lib/python2.7/dist-packages (from fuel==0.0.1)
Collecting picklable-itertools (from fuel==0.0.1)
Downloading picklable-itertools-0.1.1.tar.gz
Collecting pyyaml (from fuel==0.0.1)
Downloading PyYAML-3.11.tar.gz (248kB)
100% |████████████████████████████████| 249kB 612kB/s
Collecting h5py (from fuel==0.0.1)
Downloading h5py-2.5.0.tar.gz (684kB)
100% |████████████████████████████████| 688kB 398kB/s
Collecting tables (from fuel==0.0.1)
Downloading tables-3.2.2.tar.gz (7.0MB)
100% |████████████████████████████████| 7.0MB 73kB/s
Complete output from command python setup.py egg_info:
/usr/bin/ld: cannot find -lhdf5
collect2: error: ld returned 1 exit status
* Using Python 2.7.6 (default, Jun 22 2015, 17:58:13)
* USE_PKGCONFIG: True
.. ERROR:: Could not find a local HDF5 installation.
You may need to explicitly state where your local HDF5 headers and
library can be found by setting the ``HDF5_DIR`` environment
variable or by using the ``--hdf5`` command-line option.
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-wMS1d3/tables
Then after I have done (Installing h5py on an Ubuntu server):
sudo apt-get install libhdf5-dev
sudo HDF5_DIR=/usr/lib/x86_64-linux-gnu/hdf5/serial/ pip install h5py
And then I had to also update my cython with:
sudo pip install cython
My question is not about how to fix the installation issues but what does this command mean?
sudo HDF5_DIR=/usr/lib/x86_64-linux-gnu/hdf5/serial/ pip install h5py
What does specifying the HDF5_DIR do?
Why didn't the fuel dependencies automatically install from the:
https://github.com/mila-udem/fuel/blob/master/requirements.txt
https://github.com/mila-udem/fuel/blob/master/setup.py
What should I do to update the setup.py from fuel so that it can automatically pip install the dependencies?
You do not need to make any modifications to the setup.py from fuel. Just make sure HDF5_DIR is set correctly before updating the lib.
Explanations:
If you look at your error log, you can see that it fails at installing the h5py python lib that is a dependency of fuel. It also tells you why it failed at the end, basically it is because h5py use the C library hdf5 and it needs the headers of this lib to use it.
So the sudo apt-get install libhdf5-dev that you executed is to install the development version of this C library (you can gess that by the -dev). The dev versions install the headers of the lib and not just the compiled lib.
Then, the HDF5_DIR env variable is nedded to tell h5py setup where to find those headers.
So if you whant to update the fuel lib next time, make sure that the HDF5_DIR is set correctly and then it will update its dependencies (including h5py).

Categories