How to install packages on EMR - python

I created a cluster on AWS and with Jupyter, python3 installed. Now I can type code in the cells and I found 'numpy' is installed, i.e., by import numpy as np, I am able to access the functions in this package. However, I found pandas is not there. So in the next cell I typed !pip install pandas, then it displays
Requirement already satisfied: pandas in /mnt/usrmoved/local/lib64/python2.7/site-packages
Requirement already satisfied: pytz>=2011k in /mnt/usrmoved/local/lib/python2.7/site-packages (from pandas)
Requirement already satisfied: numpy>=1.7.0 in /mnt/usrmoved/local/lib64/python2.7/site-packages (from pandas)
Requirement already satisfied: python-dateutil in /mnt/usrmoved/local/lib/python2.7/site-packages (from pandas)
Requirement already satisfied: six>=1.5 in /mnt/usrmoved/local/lib/python2.7/site-packages (from python-dateutil->pandas)
I thought it is successfully installed, but then in the next cell, I type import pandas as pd it gives me an error
---------------------------------------------------------------------------
ImportError
Traceback (most recent call last)<ipython-input-8-af55e7023913> in <module>()----> 1 import pandas as pd
ImportError: No module named 'pandas'
In general, how should we install related python packages in EMR?
In my laptop, in the jupyter, I always did "! pip install package" and it works. But why it does not work in jupyer on EMR?

I tried installing python packages using pip install, but I get the pip: command not found. So I used pip3 instead of pip, and it worked.
Using EMR 5.30.1

The conventional method to install python packages on EMR is to specify the packages needed at cluster creation using a bootstrap-action.
This method ensures the packages are installed on all nodes and not just the driver.
aws emr create-cluster \
--name 'test python packages' \
--release-label emr-5.20.0 \
--region us-east-1 \
--use-default-roles
--instance-type m4.large \
--instance-count 2 \
--bootstrap-actions \
Path="s3://your-bucket/python-modules.sh",Name='Install Python Modules' \
The python-modules.sh would contain commands to install the python packages. For example:
#!/bin/sh
# Install needed packages
sudo pip install pandas
AWS documentation

Related

Unable to import pandas in Replit.com

import pandas
So I am working on a python project and want to import pandas library on https://replit.com/~
For some reasons, it shows some attribute error when I run the project.
Does anyone know ho do I fix or manually install pandas on replit?
Attaching Screenshot of an error herewith.
Usually packages have a lot of errors in replit but you can try this: -
Pandas does actually work on repl.it - you have to install it from the package manager. To do so, click the cube on the side navigation bar and type pandas into the search box. Then click on the pandas search entry and hit the plus sign. Tell me if this works!
Or
Broken package installs can usually be fixed by,
Updating pip and installing pandas from PyPI. By default, Repl.it comes with pip version 19.3.1, but the latest available version for python 3.8 is pip-21.1.1.
~/repl$ pip -V
pip 19.3.1 from /opt/virtualenvs/python3/lib/python3.8/site-packages/pip (python 3.8)
~/repl$ pip install pandas
Requirement already satisfied: numpy>=1.16.5 in /opt/virtualenvs/python3/lib/python3.8/site-packages (from pandas) (1.20.2)
Collecting pytz>=2017.3
Using cached https://files.pythonhosted.org/packages/70/94/784178ca5dd892a98f113cdd923372024dc04b8d40abe77ca76b5fb90ca6/pytz-2021.1-py2.py3-none-any.whl
Requirement already satisfied: python-dateutil>=2.7.3 in /opt/virtualenvs/python3/lib/python3.8/site-packages (from pandas) (2.8.1)
Requirement already satisfied: six>=1.5 in /opt/virtualenvs/python3/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)
Installing collected packages: pytz, pandas
Successfully installed pandas-1.2.4 pytz-2021.1
Pandas does not work on replit at all, because a lot of modules, which Pandas needs to work properly, do not work in replit. An online Compiler is also not the best choice for doing dataprocessing, it would be better if you install an Interpreter for python on your PC.

python cant find packages on ec2 instance

I am trying to run a python-script on an aws ec2-instance using jenkins.
I get the following error:
[ProdTest] $ /bin/sh -xe /tmp/jenkins14047325752732522807.sh
+ python3 prod.py
Traceback (most recent call last):
File "prod.py", line 2, in <module>
import base
File "/home/jenkins-slave-strat/workspace/ProdTest/base.py", line 3, in <module>
import boto3
ModuleNotFoundError: No module named 'boto3'
Build step 'Execute shell' marked build as failure
Finished: FAILURE
on the ec2 instance:
$ python3 --version
Python 3.7.9
$ pip3 install boto3 --user
WARNING: pip is being invoked by an old script wrapper. This will fail in a future version of pip.
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Requirement already satisfied: boto3 in /usr/local/lib/python3.7/site-packages (1.16.25)
Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.7/site-packages (from boto3) (0.10.0)
Requirement already satisfied: s3transfer<0.4.0,>=0.3.0 in /usr/local/lib/python3.7/site-packages (from boto3) (0.3.3)
Requirement already satisfied: botocore<1.20.0,>=1.19.25 in /usr/local/lib/python3.7/site-packages (from boto3) (1.19.25)
Requirement already satisfied: urllib3<1.27,>=1.25.4; python_version != "3.4" in /usr/local/lib/python3.7/site-packages (from botocore<1.20.0,>=1.19.25->boto3) (1.26.2)
Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /usr/local/lib/python3.7/site-packages (from botocore<1.20.0,>=1.19.25->boto3) (2.8.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.20.0,>=1.19.25->boto3) (1.15.0)
How do i get python to access the packages in /usr/local/lib/python3.7/site-packages ?
Thanks to everyone taking the time!
Before I give a short answer, a few cautions:
I don't know what version of Linux (I assume) the EC2 instance is running.
As a result, it's unclear if Python 2 is installed, but since you used pip3 and not just pip, it seems like it might be.
You could try altering your $PATH variable, but it's generally considered a good practice to use virtual environments for Python. I'm personally a fan of Conda, but you can find guides for Pipenv as well. My advice is; install Conda via Miniconda, then do the following:
$ conda create --name myAppEnv python=your.python.version
$ conda activate myAppEnv
$ conda install [your libraries]

Sagemaker lifecycle configuration for installing pandas not working

I am trying to update pandas within a lifecycle configuration, and following the example of AWS I have the next code:
#!/bin/bash
set -e
# OVERVIEW
# This script installs a single pip package in a single SageMaker conda environments.
sudo -u ec2-user -i <<EOF
# PARAMETERS
PACKAGE=pandas
ENVIRONMENT=python3
source /home/ec2-user/anaconda3/bin/activate "$ENVIRONMENT"
pip install --upgrade "$PACKAGE"==0.25.3
source /home/ec2-user/anaconda3/bin/deactivate
EOF
Then I attach it to a notebook and when I enter the notebook and open a notebook file, I see that pandas have not been updated. Using !pip show pandas I get:
Name: pandas
Version: 0.24.2
Summary: Powerful data structures for data analysis, time series, and statistics
Home-page: http://pandas.pydata.org
Author: None
Author-email: None
License: BSD
Location: /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages
Requires: pytz, python-dateutil, numpy
Required-by: sparkmagic, seaborn, odo, hdijupyterutils, autovizwidget
So we can see that I am indeed in the python3 env although the version is 0.24.
However, the log in cloudwatch shows that it has been installed:
Collecting pandas==0.25.3 Downloading https://files.pythonhosted.org/packages/52/3f/f6a428599e0d4497e1595030965b5ba455fd8ade6e977e3c819973c4b41d/pandas-0.25.3-cp36-cp36m-manylinux1_x86_64.whl (10.4MB)
2020-02-03T12:33:09.065+01:00
Requirement already satisfied, skipping upgrade: pytz>=2017.2 in ./anaconda3/lib/python3.6/site-packages (from pandas==0.25.3) (2018.4)
2020-02-03T12:33:09.065+01:00
Requirement already satisfied, skipping upgrade: python-dateutil>=2.6.1 in ./anaconda3/lib/python3.6/site-packages (from pandas==0.25.3) (2.7.3)
2020-02-03T12:33:09.065+01:00
Requirement already satisfied, skipping upgrade: numpy>=1.13.3 in ./anaconda3/lib/python3.6/site-packages (from pandas==0.25.3) (1.16.4)
2020-02-03T12:33:09.065+01:00
Requirement already satisfied, skipping upgrade: six>=1.5 in ./anaconda3/lib/python3.6/site-packages (from python-dateutil>=2.6.1->pandas==0.25.3) (1.13.0)
2020-02-03T12:33:09.065+01:00
Installing collected packages: pandas Found existing installation: pandas 0.24.2 Uninstalling pandas-0.24.2: Successfully uninstalled pandas-0.24.2
2020-02-03T12:33:12.066+01:00
Successfully installed pandas-0.25.3
What could be the problem?
if you want to install the packages only in for the python3 environment, use the following script in your Create Sagemaker Lifecycle configurations.
#!/bin/bash
sudo -u ec2-user -i <<'EOF'
# This will affect only the Jupyter kernel called "conda_python3".
source activate python3
# Replace myPackage with the name of the package you want to install.
pip install pandas==0.25.3
# You can also perform "conda install" here as well.
source deactivate
EOF
Reference : "Lifecycle Configuration Best Practices"
I have encountered the exact same problem when package was not available in the notebook while Lifecycle Cloudwatch indicated successful installation for the specific kernel. The solution that worked for me is to make sure installation completes before opening up notebook.

ModuleNotFoundError- Requests (despite installation attemps showing "requirement already satisfied")

When I run the code below to attempt to import a few of the usual Python libraries for API interaction... I get a ModuleNotFoundError on the import line of code.
I verified that it is indeed installed on my machine via pip3. I then tried uninstalling it and reinstalling it. When that didn't work I tried running the installation as a shell command in my Jupyter notebook. The same errors persisted.
Please note: what I am referring to as "it" is either the requests or json library for Python; I am encountering the same errors with each.
#right on the import line is where the error happens, the code is simple though...
import requests
import json
Here is the traceback...
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-2-021831bd5cc5> in <module>
1 # Dependencies
2 get_ipython().system(' pip3 install requests')
----> 3 import requests
4 import json
ModuleNotFoundError: No module named 'requests'
And here is the "requirement already satisfied" statement from Terminal...
(base) Computer:~ User$ pip3 install requests
Requirement already satisfied: requests in ./anaconda3/lib/python3.7/site-packages (2.22.0)
Requirement already satisfied: idna<2.9,>=2.5 in ./anaconda3/lib/python3.7/site-packages (from requests) (2.8)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in ./anaconda3/lib/python3.7/site-packages (from requests) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in ./anaconda3/lib/python3.7/site-packages (from requests) (2019.6.16)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in ./anaconda3/lib/python3.7/site-packages (from requests) (1.24.2)
Sorry I’ve reputation to comment.
Pleas look at your env where you install the package.
You’re running on base environment maybe you’re using Anaconda, and then run Python without that environment. So the package won’t be seen bye your editor or terminale.
Could you add more information on where you use python?
I was facing the same problem on mac OSX, when I did "pip install requests", then I installed with "sudo" and it worked.
On OSX/Linux :
Use $ sudo pip install requests if you have pip installed.
Alternatively you can also use sudo easy_install -U requests if you have easy_install installed.
For centOS: yum install python-requests
Reference: [ImportError: No module named requests

import pandas as pd ImportError: No module named pandas

I can't seem to import panda package. I use Visual Studio code to code. I use a mac and have osX 10.14 Majove.
The code that i am trying to compile is :
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
house_data = pd.read_csv('house.csv')
plt.plot(house_data['surface'], house_data['loyer'], 'ro', markersize=4)
plt.show()
When I try to use pip install pandas i get on my terminal :
(base) pip install pandas
Requirement already satisfied: pandas in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (0.24.0)
Requirement already satisfied: pytz>=2011k in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from pandas) (2018.9)
Requirement already satisfied: python-dateutil>=2.5.0 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from pandas) (2.7.5)
Requirement already satisfied: numpy>=1.12.0 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from pandas) (1.15.3)
Requirement already satisfied: six>=1.5 in /Users/Library/Python/3.7/lib/python/site-packages (from python-dateutil>=2.5.0->pandas) (1.12.0)
(base) Thibaults-MBP-5d47:ML_folder thibaultmonsel$
Then when i execute my code i get :
Traceback (most recent call last):
File "ML1.py", line 5, in <module>
import pandas as pd
ImportError: No module named pandas
After if i try sudo pip install pandas i get :
(base) MBP-5d47:ML_folder $ sudo pip3 install pandas --upgrade
Password:
The directory '/Users/Library/Caches/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory.If executing pip with sudo, you may want sudo's -H flag.
The directory '/Users/Library/Caches/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Collecting pandas
Downloading https://files.pythonhosted.org/packages/34/63/529fd1391044051514f2f22d61754245db2133cd37c4dad7150a1cbe2ece/pandas-0.24.1-cp37-cp37m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (15.9MB)
100% |████████████████████████████████| 15.9MB 901kB/s
Requirement already satisfied, skipping upgrade: python-dateutil>=2.5.0 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from pandas) (2.7.5)
Requirement already satisfied, skipping upgrade: numpy>=1.12.0 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from pandas) (1.15.3)
Requirement already satisfied, skipping upgrade: pytz>=2011k in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from pandas) (2018.9)
Requirement already satisfied, skipping upgrade: six>=1.5 in /Users/Library/Python/3.7/lib/python/site-packages (from python-dateutil>=2.5.0->pandas) (1.12.0)
Installing collected packages: pandas
Found existing installation: pandas 0.24.0
Uninstalling pandas-0.24.0:
Successfully uninstalled pandas-0.24.0
Successfully installed pandas-0.24.1
However, i still get no modules named pandas
Lastly, when i try pip3 install pandas i get :
Requirement already satisfied: pandas in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (0.24.0)
Requirement already satisfied: pytz>=2011k in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from pandas) (2018.9)
Requirement already satisfied: numpy>=1.12.0 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from pandas) (1.15.3)
Requirement already satisfied: python-dateutil>=2.5.0 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from pandas) (2.7.5)
Requirement already satisfied: six>=1.5 in /Users/Library/Python/3.7/lib/python/site-packages (from python-dateutil>=2.5.0->pandas) (1.12.0)
When i try to execute the program i get the same error mentioned above after using pip3 install pandas....
I also did an import.sys if can help :
base)-MBP-5d47:ML_folder $ python help1.py
2.7.10 (default, Aug 17 2018, 17:41:52)
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.0.42)]
Here is also my sys.path :
['/Users/Desktop/ML_folder', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python27.zip', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-darwin', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-mac', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-mac/lib-scriptpackages', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-tk', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-old', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-dynload', '/Library/Python/2.7/site-packages', '/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python', '/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/PyObjC']
You need to install pandas with:
pip install pandas
If you run into issues with privileges, you may need to run:
sudo pip install pandas
It is also possible on Python 3 that you may need to run:
pip3 install pandas (although pip may be pointing to pip3 already). You can read about differences between pip versions on this SO post.
If you don't have pip installed, see here for installation.
Check pandas package path from your env with:
jupyter kernelspec list
If you see the path:
/Users/yourname/Library/Jupyter/kernels/yourenv
Delete that Jupyter folder from Library and run again.
if you see such this in your IDE and the error "no module named pandas" when you run your code, it means that pandas has not been installed although you have done "pip install pandas" or whatever.
Go to file > settings > project interpreter and see if pandas is available in the list of packages. if not simply click + (plus), choose pandas and install it in your project environment .
see picture
then wait for you IDE update your project skeletons ... voila , the error disappears !
When entering the command to run your file, make sure you specify which version of python you're using. For example, instead of python filename.py, use python3 filename.py or python2 filename.py
your pandas is installed in python3 (3.7):
Requirement already satisfied: pandas in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (0.24.0)
but you are running python2.7 and pandas isn't in your path 2.7:
['/Users/thibaultmonsel/Desktop/ML_folder',
'/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python27.zip',
'/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7',
'/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-darwin',
'/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-mac',
'/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-mac/lib-scriptpackages',
'/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-tk',
'/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-old',
'/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-dynload',
'/Library/Python/2.7/site-packages',
'/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python',
'/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/PyObjC']
try to simply run your script using python3:
python3 help1.py
or add python3 header, example:
#!/usr/bin/env python3
or
#!/usr/local/bin/python3
and if that doesn't work (like I had the same problem because I was importing pandas from jupyter notebook, macos), you can ultimately import from your --user path, example:
sys.path.append("/Users/<USER>/Library/Python/3.7/lib/python/site-packages")
but make sure you have pandas installed there (..python/site-packages/pandas) using
pip3 install pandas --user
Check your virtual environment (you can see it at the left corner of VS code) and install the package (e.g. pandas) in your virtual environment like this:
conda install -n yourenvname [package]
install pandas outside the project, I wanted to download it only for an env environment but I got the same error so I did it from outside.
Code > Preferences > Settings
In Search, type "interpreter"
You will see a bar : Python: Default Interpreter Path
Paste your correct path to Python (something like "/usr/local/bin/python3" in Mac), it will automatically save
Then go back to your python file and try to run
For me, the below command works in MAC
sudo -H pip3 install pandas --upgrade

Categories