Azure dataset .to_pandas_dataframe() error - python

I am following an azure ml course on udemy and cannot get around the following error:
Execution failed in operation 'to_pandas_dataframe' for Dataset(id='id', name='Loan Applications Using SDK', version=1, error_code=None, exception_type=PandasImportError)
Here is the code for Submitting the Script:
from azureml.core import Workspace, Experiment, ScriptRunConfig,
Environment
ws = Workspace.from_config(path="./config")
new_experiment = Experiment(workspace=ws,
name="Loan_Script")
script_config = ScriptRunConfig(source_directory=".",
script="180 - Script to Run.py")
script_config.framework = "python"
script_config.environment = Environment("conda_env")
new_run = new_experiment.submit(config=script_config)
Here is the Script being run:
from azureml.core import Workspace, Datastore, Dataset,
Experiment
from azureml.core import Run
ws = Workspace.from_config(path="./config")
az_store = Datastore.get(ws, "bencouser_sdk_blob01")
az_dataset = Dataset.get_by_name(ws, name='Loan Applications Using SDK')
az_default_store = ws.get_default_datastore()
#%%----------------------------------------------------
# Get context of the run
#------------------------------------------------------
new_run = Run.get_context()
#%%----------------------------------------------------
# Stuff that will be logged
#------------------------------------------------------
df = az_dataset.to_pandas_dataframe()
total_observations = len(df)
nulldf = df.isnull().sum()
#%%----------------------------------------------------
# Complete the Experiment
#------------------------------------------------------
new_run.log("Total Observations:", total_observations)
for columns in df.columns:
new_run.log(columns, nulldf[columns])
new_run.complete()
I have run the .to_pandas_dataframe() part outside of an experiment and it worked without error. I have also tried the following (that was recommended in the driver log):
InnerException Could not import pandas. Ensure a compatible version is installed by running: pip install azureml-dataprep[pandas]
I have seen people come across this before but I cannot find a solution, any help is appreciated.

When doing an experiment a new azure environment was created without pandas installed. To install pandas (if using anaconda nav) go onto environments in the anaconda nav window, click the azure env, go to uninstalled packages and search pandas, click install. It worked once this was done.

For VS code users:
conda info --envs
You will have an environment with a name starting from azureml_f3f7e6c5xxxxxxx. Activate this environment
conda activate azureml_f3f7e6c5xxxxxxx
Then install pandas in the environment
pip install pandas

Related

Installation of tensorflow terminated on RStudio Cloud

Similarly to posts here and here, I am having more trouble when I try to install TensorFlow in a new RStudio Cloud project. I know I need to set up both Miniconda and a virtual environment locally in /cloud/project/ so the Python dependencies stay with copies of the cloud project. Previous versions of the following setup script worked.
install.packages(c("keras", "rstudioapi", "tensorflow"))
lines <- c(
paste0("RETICULATE_CONDA=", file.path(getwd(), "miniconda", "bin", "conda")),
paste0("RETICULATE_PYTHON=", file.path(getwd(), "miniconda", "bin", "python")),
paste0("WORKON_HOME=", file.path(getwd(), "virtualenvs"))
)
writeLines(lines, ".Renviron")
rstudioapi::restartSession()
reticulate::install_miniconda("miniconda")
reticulate::virtualenv_create(
envname = "r-tensorflow",
python = Sys.getenv("RETICULATE_PYTHON")
)
keras::install_keras(
method = "virtualenv",
conda = Sys.getenv("RETICULATE_CONDA"),
envname = "r-tensorflow"
)
But I get an error on Cloud when I try to install Python's TensorFlow and Keras:
keras::install_keras(
+ method = "virtualenv",
+ conda = Sys.getenv("RETICULATE_CONDA"),
+ envname = "r-tensorflow"
+ )
Using virtual environment 'r-tensorflow' ...
Collecting tensorflow==2.2.0
Downloading tensorflow-2.2.0-cp38-cp38-manylinux2010_x86_64.whl (516.3 MB)
Killed
Error: Error installing package(s): 'tensorflow==2.2.0', 'keras', 'tensorflow-hub', 'h5py', 'pyyaml==3.12', 'requests', 'Pillow', 'scipy'
The same script on my local Ubuntu machine appears to succeed, but it ignores my local virtual environment even though I set WORKON_HOME.
> tensorflow::tf_config()
Installation of TensorFlow not found.
Python environments searched for 'tensorflow' package:
/home/landau/projects/targets-tutorial/miniconda/bin/python3.8
You can install TensorFlow using the install_tensorflow() function.
Example project that uses this general approach: https://github.com/wlandau/targets-keras.

Custom Docker file for Azure ML Environment that contains COPY statements errors with COPY failed: /path no such file or directory

I'm trying to submit an experiment to Azure ML using a Python script.
The Environment being initialised uses a custom Dockerfile.
env = Environment(name="test")
env.docker.base_image = None
env.docker.base_dockerfile = './Docker/Dockerfile'
env.docker.enabled = True
However the DockerFile needs a few COPY statements but those fail as follow:
Step 9/23 : COPY requirements-azure.txt /tmp/requirements-azure.txt
COPY failed: stat /var/lib/docker/tmp/docker-builder701026190/requirements-azure.txt: no such file or directory
The Azure host environment responsible to build the image does not contain the files the Dockerfile requires, those exist in my local development machine from where I initiate the python script.
I've been searching for the whole day of a way to add to the environment these files but without success.
Below an excerpt from the Dockerfile and the python script that submits the experiment.
FROM mcr.microsoft.com/azureml/base:intelmpi2018.3-ubuntu16.04 as base
COPY ./Docker/requirements-azure.txt /tmp/requirements-azure.txt # <- breaks here
[...]
Here is how I'm submitting the experiment:
from azureml.core.environment import Environment
from azureml.core import Workspace
from azureml.core.model import Model
from azureml.core import Workspace, Experiment
from azureml.core.compute import ComputeTarget
from azureml.core import Experiment, Workspace
from azureml.train.estimator import Estimator
import os
ws = Workspace.from_config(path='/mnt/azure/config/workspace-config.json')
env = Environment(name="test")
env.docker.base_image = None
env.docker.base_dockerfile = './Docker/Dockerfile'
env.docker.enabled = True
compute_target = ComputeTarget(workspace=ws, name='GRComputeInstance')
estimator = Estimator(
source_directory='/workspace/',
compute_target=compute_target,
entry_script="./src/ml/train/main.py",
environment_definition=env
)
experiment = Experiment(workspace=ws, name="estimator-test")
run = experiment.submit(estimator)
run.wait_for_completion(show_output=True, wait_post_processing=True)
Any idea?
I think the correct way to setup the requirements.txt for your project is using Define an inference configuration as:
name: project_environment
dependencies:
- python=3.6.2
- scikit-learn=0.20.0
- pip:
# You must list azureml-defaults as a pip dependency
- azureml-defaults>=1.0.45
- inference-schema[numpy-support]
See this
I think you need to look for "using your own base image", e.g. in the Azure docs here. For building the actual Docker image you have two options:
Build on Azure build servers. Here you need to upload all required files together with your Dockerfile to the build environment. (Alternatively, you could consider making the requirements-azure.txt file available via HTTP, such that the build environment can fetch it from anywhere.)
Build locally with your own Docker-installation and upload the final image to the correct Azure registry.
This is just the broad outline, at the moment I can't give more detailed recommendations. Hope it helps anyway.

Dependency missing when running AzureML Estimator in docker environment

Scenario description
I'm trying to submit a training script to AzureML (want to use AmlCompute, but I'm starting/testing locally first, for debugging purposes).
The train.py script I have uses a custom package (arcus.ml) and I believe I have specified the right settings and dependencies, but still I get the error:
User program failed with ModuleNotFoundError: No module named 'arcus.ml'
Code and reproduction
This the python code I have:
name='test'
script_params = {
'--test-par': 0.2
}
est = Estimator(source_directory='./' + name,
script_params=script_params,
compute_target='local',
entry_script='train.py',
pip_requirements_file='requirements.txt',
conda_packages=['scikit-learn','tensorflow', 'keras'])
run = exp.submit(est)
print(run.get_portal_url())
This is the (fully simplified) train.py script in the testdirectory:
from arcus.ml import dataframes as adf
from azureml.core import Workspace, Dataset, Datastore, Experiment, Run
# get hold of the current run
run = Run.get_context()
ws = run.get_environment()
print('training finished')
And this is my requirements.txt file
arcus-azureml
arcus-ml
numpy
pandas
azureml-core
tqdm
joblib
scikit-learn
matplotlib
tensorflow
keras
Logs
In the logs file of the run, I can see this section, sot it seems the external module is being installed anyhow.
Collecting arcus-azureml
Downloading arcus_azureml-1.0.3-py3-none-any.whl (3.1 kB)
Collecting arcus-ml
Downloading arcus_ml-1.0.6-py3-none-any.whl (2.1 kB)
It could be there's an issue with arcus-ml 1.0.6 wheel installable, like Anders pointed out it doesn't seem to have any code. Could you try with earlier version arcus-ml==1.0.5 ?
I think this error isn't necessarily about Azure ML. I think the error has to do w/ the difference b/w using a hyphen and a period in your package name. But I'm a python packaging newb.
In a new conda environment on my laptop, I ran the following
> conda create -n arcus python=3.6 -y
> conda activate arcus
> pip install arcus-ml
> python
>>> from arcus.ml import dataframes as adf
ModuleNotFoundError: No module named 'arcus'
When I look in the env's site packages folder, I didn't see the arcus/ml folder structure I was expecting. There's no arcus code there at all, only the .dist-info file
~/opt/anaconda3/envs/arcus/lib/python3.6/site-packages

Configure Ipython/Jupyter notebook with Pyspark on AWS EMR v4.0.0

I am trying to use IPython notebook with Apache Spark 1.4.0. I have followed the 2 tutorial below to set my configuration
Installing Ipython notebook with pyspark 1.4 on AWS
and
Configuring IPython notebook support for Pyspark
After fisnish the configuration, following is several code in the related files:
1.ipython_notebook_config.py
c=get_config()
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser =False
c.NotebookApp.port = 8193
2.00-pyspark-setup.py
import os
import sys
spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")
# Add the py4j to the path.
# You may need to change the version number to match your install
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))
# Initialize PySpark to predefine the SparkContext variable 'sc'
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))
I also add following two lines to my .bash_profile:
export SPARK_HOME='home/hadoop/sparl'
source ~/.bash_profile
However, when I run
ipython notebook --profile=pyspark
it shows the message: unrecognized alias '--profile=pyspark' it will probably have no effect
It seems that the notebook doesn't configure with pyspark successfully
Does anyone know how to solve it? Thank you very much
following are some software version
ipython/Jupyter: 4.0.0
spark 1.4.0
AWS EMR: 4.0.0
python: 2.7.9
By the way I have read the following, but it doesn't work
IPython notebook won't read the configuration file
Jupyter notebooks don't have the concept of profiles (as IPython did). The recommended way of launching with a different configuration is e.g.:
JUPTYER_CONFIG_DIR=~/alternative_jupyter_config_dir jupyter notebook
See also issue jupyter/notebook#309, where you'll find a comment describing how to set up Jupyter notebook with PySpark without profiles or kernels.
This worked for me...
Update ~/.bashrc with:
export SPARK_HOME="<your location of spark>"
export PYSPARK_SUBMIT_ARGS="--master local[2] pyspark-shell"
(Lookup pyspark docs for those arguments)
Then create a new ipython profile eg. pyspark:
ipython profile create pyspark
Then create and add the following lines in ~/.ipython/profile_pyspark/startup/00-pyspark-setup.py:
import os
import sys
spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.9-src.zip'))
filename = os.path.join(spark_home, 'python/pyspark/shell.py')
exec(compile(open(filename, "rb").read(), filename, 'exec'))
spark_release_file = spark_home + "/RELEASE"
if os.path.exists(spark_release_file) and "Spark 1.6" in open(spark_release_file).read():
pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")
(update versions of py4j and spark to suit your case)
Then mkdir -p ~/.ipython/kernels/pyspark and then create and add following lines in the file ~/.ipython/kernels/pyspark/kernel.json
{
"display_name": "pySpark (Spark 1.6.1)",
"language": "python",
"argv": [
"/usr/bin/python",
"-m",
"IPython.kernel",
"--profile=pyspark",
"-f",
"{connection_file}"
]
}
Now you should see this kernel, pySpark (Spark 1.6.1), under jupyter's new notebook option. You can test by executing sc and should see your spark context.
I have tried so many ways to solve this 4.0 version problem, and finally I decided to install version 3.2.3. of IPython:
conda install 'ipython<4'
It's anazoning! And wish to help all you!
ref: https://groups.google.com/a/continuum.io/forum/#!topic/anaconda/ace9F4dWZTA
As people commented, in Jupyter you don't need profiles. All you need to do is export the variables for jupyter to find your spark install (I use zsh but it's the same for bash)
emacs ~/.zshrc
export PATH="/Users/hcorona/anaconda/bin:$PATH"
export SPARK_HOME="$HOME/spark"
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_SUBMIT_ARGS="--master local[*,8] pyspark-shell"
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH
It is important to add pyspark-shell in the PYSPARK_SUBMIT_ARGS
I found this guide useful but not fully accurate.
My config is local, but should work if you use the PYSPARK_SUBMIT_ARGS to the ones you need.
I am having the same problem to specify the --profile **kwarg. It seems it is a general problem with the new version, not related with Spark. If you downgrade to ipython 3.2.1 you will be able to specify the profile again.

Create PySpark Profile for IPython

I follow this link http://ramhiser.com/2015/02/01/configuring-ipython-notebook-support-for-pyspark/ in order to create PySpark Profile for IPython.
00-pyspark-setup.py
# Configure the necessary Spark environment
import os
import sys
spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "\python")
# Add the py4j to the path.
# You may need to change the version number to match your install
sys.path.insert(0, os.path.join(spark_home, '\python\lib\py4j-0.8.2.1-src.zip'))
# Initialize PySpark to predefine the SparkContext variable 'sc'
execfile(os.path.join(spark_home, '\python\pyspark\shell.py'))
My problem when I type sc in ipython-notebook, I got '' I should see output similar to <pyspark.context.SparkContext at 0x1097e8e90>.
Any idea about how to resolve it ?
I was trying to do the same, but had problems. Now, I use findspark (https://github.com/minrk/findspark) instead. You can install it with pip (see https://pypi.python.org/pypi/findspark/):
$ pip install findspark
And then, inside a notebook:
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext(appName="myAppName")
If you want to avoid this boilerplate, you can put the above 4 lines in 00-pyspark-setup.py.
(Right now I have Spark 1.4.1. and findspark 0.0.5.)
Please try to set proper value to SPARK_LOCAL_IP variable, eg.:
export SPARK_LOCAL_IP="$(hostname -f)"
before you run ipython notebook --profile=pyspark.
If this doesn't help, try to debug your environment by executing setup script:
python 00-pyspark-setup.py
Maybe you can find some error lines in that way and debug them.
Are you on windows? I am dealing with the same things, and a couple of things helped.
In the 00-pyspark-setup.py, change this line (match your path to your spark folder)
# Configure the environment
if 'SPARK_HOME' not in os.environ:
print 'environment spark not set'
os.environ['SPARK_HOME'] = 'C:/spark-1.4.1-bin-hadoop2.6'
I am sure you added a new environment variable, if not, this will manually set it.
The next thing I noticed is that if you use ipython 4 (the latest), the config files don't work the same way you see it in all the tutorials. You can try out if your config files get called by adding a print statement or just messing them up so an error gets thrown.
I am using a lower version of iPython (3) and I call it using
ipython notebook --profile=pyspark
Change the 00-pyspark-setup.py to:
# Configure the necessary Spark environment
import os
# Spark home
spark_home = os.environ.get("SPARK_HOME")
######## CODE ADDED ########
os.environ["PYSPARK_SUBMIT_ARGS"] = "--master local[2] pyspark-shell"
######## END OF ADDED CODE #########
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))
# Initialize PySpark to predefine the SparkContext variable 'sc'
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))
Basically, the added code sets the PYSPARK_SUBMIT_ARGS environment variable to
--master local[2] pyspark-shell, which works for Spark 1.6 standalone.
Now run ipython notebook again. Run os.environ["PYSPARK_SUBMIT_ARGS"] to check whether its value is correctly set. If so, then type sc should give you the expected output like <pyspark.context.SparkContext at 0x1097e8e90>

Categories