Cannot load R packages on Azure Batch nodes - python

I am having difficulty loading packages into R on my compute pool nodes using the Azure Batch Python API. The code that I am using is similar to what is provided in the Azure Batch Python SDK Tutorial, except the task is more complicated -- I want each node in the job pool to execute an R script which requires certain package dependencies.
Hence, in my start task commands below, I have each node (Canonical UbuntuServer SKU: 16) install R via apt and install R package dependencies (the reason why I added R package installation to the start task is that, even after creating a lib directory ~/Rpkgs with universal permissions, running install.packages(list_of_packages, lib="~/Rpkgs/", repos="http://cran.r-project.org") in the task script leads to "not writable" errors.)
task_commands = [
'cp -p {} $AZ_BATCH_NODE_SHARED_DIR'.format(_R_TASK_SCRIPT),
# Install pip
'curl -fSsL https://bootstrap.pypa.io/get-pip.py | python',
# Install the azure-storage module so that the task script can access Azure Blob storage, pre-cryptography version
'pip install azure-storage==0.32.0',
# Install R
'sudo apt -y install r-base-core',
'mkdir ~/Rpkgs/',
'sudo chown _azbatch:_azbatchgrp ~/Rpkgs/',
'sudo chmod 777 ~/Rpkgs/',
# Install R package dependencies
# *NOTE*: the double escape below is necessary because Azure strips the forward slash
'printf "install.packages( c(\\"foreach\\", \\"iterators\\", \\"optparse\\", \\"glmnet\\", \\"doMC\\"), lib=\\"~/Rpkgs/\\", repos=\\"https://cran.cnr.berkeley.edu\\")\n" > ~/startTask.txt',
'R < startTask.txt --no-save'
]
Anyhow, I confirmed in the Azure portal that these packages installed as intended on the compute pool nodes (you can see them located at startup/wd/Rpkgs/, a.k.a. ~/Rpkgs/, in the node filesystem). However, while the _R_TASK_SCRIPT task was successfully added to the job pool, it terminated with a non-zero exit code because it wasn't able to load any of the packages (e.g. foreach, iterators, optparse, etc.) that had been installed in the start task.
More specifically, the _R_TASK_SCRIPT contained the following R code and returned the following output:
R code:
lapply( c("iterators", "foreach", "optparse", "glmnet", "doMC"), require, character.only=TRUE, lib.loc="~/Rpkgs/")
...
R stderr, stderr.txt on Azure Batch node:
Loading required package: iterators
Loading required package: foreach
Loading required package: optparse
Loading required package: glmnet
Loading required package: doMC
R stdout, stdout.txt on Azure Batch node:
[[1]]
[1] FALSE
[[2]]
[1] FALSE
[[3]]
[1] FALSE
[[4]]
[1] FALSE
[[5]]
[1] FALSE
FALSE above indicates that it was not able to load the R package. This is the issue I'm facing, and I'd like to figure out why.
It may be noteworthy that, when I spin up a comparable VM (Canonical UbuntuServer SKU: 16) and run the same installation manually, it successfully loads all packages.
myusername#rnode:~$ pwd
/home/myusername
myusername#rnode:~$ mkdir ~/Rpkgs/
myusername#rnode:~$ printf "install.packages( c(\"foreach\", \"iterators\", \"optparse\", \"glmnet\", \"doMC\"), lib=\"~/Rpkgs/\", repos=\"http://cran.r-project.org\")\n" > ~/startTask.txt
myusername#rnode:~$ R < startTask.txt --no-save
myusername#rnode:~$ R
R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree"
...
> lapply( c("iterators", "foreach", "optparse", "glmnet", "doMC"), require, character.only=TRUE, lib.loc="~/Rpkgs/")
Loading required package: iterators
Loading required package: foreach
...
Loading required package: optparse
Loading required package: glmnet
Loading required package: Matrix
Loaded glmnet 2.0-10
Loading required package: doMC
Loading required package: parallel
[[1]]
[1] TRUE
[[2]]
[1] TRUE
[[3]]
[1] TRUE
[[4]]
[1] TRUE
[[5]]
[1] TRUE
Thanks in advance for your help and suggestions.

Each task runs on its own working directory which is referenced by the environment variable, $AZ_BATCH_TASK_WORKING_DIR. When the R session runs, the current R working directory [ getwd() ] will be $AZ_BATCH_TASK_WORKING_DIR, not $AZ_BATCH_NODE_STARTUP_DIR where the pkgs lives.
To get the exact package location ("startup/wd/pkgs") in the R code,
lapply( c("iterators", "foreach", "optparse", "glmnet", "doMC"), require,
character.only=TRUE, lib.loc=paste0(Sys.getenv("AZ_BATCH_NODE_STARTUP_DIR"),
"/wd/", "Rpkgs") )
or
Run this method before the lapply:
setwd(paste0(Sys.getenv("AZ_BATCH_NODE_STARTUP_DIR"), "/wd/"))
Added: You can also create a Batch pool of Azure data scientist virtual machines that has R already installed so you don't have to install it yourself.
Azure Batch has the doAzureParallel R package supports package installation.
Here's a link: https://github.com/Azure/doAzureParallel (Disclaimer: I created the doAzureParallel R package)

It seems to be caused by the installed packages not exists the default library paths for R. Try to set the path of library trees within which packages are looked for via add the code .libPath("~\Rpkgs") before load packages.
As reference, there is a SO thread Changing R default library path using .libPaths in Rprofile.site fails to work which you can refer to.
Meanwhile, an offical blog introduces how to use R workload on Azure Batch, but for Windows environment. Hope it helps.

Related

Python requirements.txt restrict dependency to be installed only on atom processors

I'm using TensorFlow under inside an x64_64 environment, but the processor is an Intel Atom processor. This processor lacks the AVX processor extension and since the pre-built wheels for TensorFLow are complied with the AVX extension TensorFLow does not work and exits. Hence I had to build my own wheel and I host it on GitHub as a released file.
The problem I have is to download this pre-built wheel only in an Atom based processor. I was able to achieve this previously using a setup.py file where this can be easily detected, but I have migrated to pyproject.toml which is very poor when it comes to customization and scripted installation support.
Is there anything similar in addition to platform_machine=='x86_64' which checks for the processor type? Or has the migration to pyproject.toml killed here my flexibility?
The current requirements.txt is:
confluent-kafka # https://github.com/HandsFreeGadgets/python-wheels/releases/download/v0.1/confluent_kafka-1.9.2-cp38-cp38-linux_aarch64.whl ; platform_machine=='aarch64'
tensorflow # https://github.com/HandsFreeGadgets/python-wheels/releases/download/v0.1/tensorflow-2.8.4-cp38-cp38-linux_aarch64.whl ; platform_machine=='aarch64'
tensorflow-addons # https://github.com/HandsFreeGadgets/python-wheels/releases/download/v0.1/tensorflow_addons-0.17.1-cp38-cp38-linux_aarch64.whl ; platform_machine=='aarch64'
tensorflow-text # https://github.com/HandsFreeGadgets/python-wheels/releases/download/v0.1/tensorflow_text-2.8.2-cp38-cp38-linux_aarch64.whl ; platform_machine=='aarch64'
rasa==3.4.2
SQLAlchemy==1.4.45
phonetics==1.0.5
de-core-news-md # https://github.com/explosion/spacy-models/releases/download/de_core_news_md-3.4.0/de_core_news_md-3.4.0-py3-none-any.whl
For platform_machine=='aarch64' I need something similar for x86_64 but only executed on Atom processor environments.
The old setup.py was:
import platform
import subprocess
import os
from setuptools import setup
def get_requirements():
requirements = []
if platform.machine() == 'x86_64':
command = "cat /proc/cpuinfo"
all_info = subprocess.check_output(command, shell=True).strip()
# AVX extension is the missing important information
if b'avx' not in all_info or ("NO_AVX" in os.environ and os.environ['NO_AVX']):
requirements.append(f'tensorflow # file://localhost/'+os.getcwd()+'/pip-wheels/amd64/tensorflow-2.3.2-cp38-cp38-linux_x86_64.whl')
elif platform.machine() == 'aarch64':
...
requirements.append('rasa==3.3.3')
requirements.append('SQLAlchemy==1.4.45')
requirements.append('phonetics==1.0.5')
requirements.append('de-core-news-md # https://github.com/explosion/spacy-models/releases/download/de_core_news_md-3.4.0/de_core_news_md-3.4.0-py3-none-any.whl')
return requirements
setup(
...
install_requires=get_requirements(),
...
)
The line if b'avx' not in all_info or ("NO_AVX" in os.environ and os.environ['NO_AVX']) does the necessary differentiation.
If a pyproject.toml approach is not for my needs, what is recommended for Python with more installation power which is not marked as legacy? Maybe there is something similar for Python what is Gradle for building projects in the Java world, which was introduced to overcome the XML limitations and providing a complete scripting language which I'm not aware of?
My recommendation would be to migrate pyproject.toml as intended. I would declare dependencies such as tensorflow according to the standard specification for dependencies but I would not use any direct references at all.
Then I would create some requirements.txt files in which I would list the dependencies that need special treatment (no need to list all dependencies), for example those that require a direct reference (and/or a pinned version). I would probably create one requirements file per platform, for example I would create a requirements-atom.txt.
As far as I know it should be possible to instruct pip to install from a remote requirements file via its URL. Something like this:
python -m pip install --requirement 'https://server.tld/path/requirements-atom.txt'
If you need to create multiple requirements.txt files with common parts, then probably a tool like pip-tools can help.
Maybe something like the following (untested):
requirements-common.in
# Application (or main project)
MyApplication # git+https://github.com/HandsFreeGadgets/MyApplication.git
# Common dependencies
CommonLibrary
AnotherCommonLibrary==1.2.3
requirements-atom.in:
--requirement requirements-common.in
# Atom CPU specific
tensorflow # https://github.com/HandsFreeGadgets/tensorflow-atom/releases/download/v0.1/tensorflow-2.8.4-cp38-cp38-linux_aarch64.whl ; platform_machine=='aarch64'
pip-compile requirements-atom.in > requirements-atom.txt

How to load private python package when loading a MLFlow model?

I am trying to use a private Python package as a model using the mlflow.pyfunc.PythonModel.
My conda.yaml looks like
channels:
- defaults
dependencies:
- python=3.10.4
- pip
- pip:
- mlflow==2.1.1
- pandas
- --extra-index-url <private-pypa-repo-link>
- <private-package>
name: model_env
python_env.yaml
python: 3.10.4
build_dependencies:
- pip==23.0
- setuptools==58.1.0
- wheel==0.38.4
dependencies:
- -r requirements.txt
requirements.txt
mlflow==2.1.1
pandas
--extra-index-url <private-pypa-repo-link>
<private-package>
When running the following
import mlflow
model_uri = '<run_id>'
# Load model as a PyFuncModel.
loaded_model = mlflow.pyfunc.load_model(model_uri)
# Predict on a Pandas DataFrame.
import pandas as pd
t = loaded_model.predict(pd.read_json("test.json"))
print(t)
The result is
WARNING mlflow.pyfunc: Encountered an unexpected error (InvalidRequirement('Parse error at "\'--extra-\'": Expected W:(0-9A-Za-z)')) while detecting model dependency mismatches. Set logging level to DEBUG to see the full traceback.
Adding in the following before loading the mode makes it work
dep = mlflow.pyfunc.get_model_dependencies(model_uri)
print(dep)
import subprocess
import sys
subprocess.check_call([sys.executable, "-m", "pip", "install", "-r", dep])
Is there a way automatically install these dependencies rather than doing it explicitly? What are my options to get mlflow to install the private package?
Answering my own question here. Turns out the issue is that I was trying to use the keyring library which needs to be pre-installed and is not supported when doing inference in a virtual environment.
There are ways to get around it though.
Add the authentication token to the extra-index-url itself. You can find it documented in this stackoverflow question.
MlFlow allows you to log any dependencies with the model itself using the code_path argument (link). Using this method, you can skip adding in your private package as a requirement. This question also touches on the same topic. The code would look a bit like this.
mlflow.pyfunc.save_model(
path=dest_path,
python_model=MyModel(),
artifacts=_get_artifact_dict(t_dir),
conda_env=conda_env,
# Adding the current script file as dependency
code_path=[os.path.realpath(__file__), #Add any other script]
)
Opt for first approach if saving authentication token in the requirements.txt is feasible, otherwise use the second approach. The downside of using the code_path solution is that with each model, your packages' code is getting replicated.

make file target to check for installed dependencies

I have a makefile where I have targets that depend on having some external clients installed (python3, libxml2, etc).
Here is my makefile
.PHONY: test install-packages mac-setup checkenv target help
EXTERNALS = python3 pip3 xmllint pytest pipenv
P := $(foreach exec,$(EXTERNALS),$(if $(shell which $(exec)),missing,$(warning "===>>>WARNING: No required `$(exec)` in PATH, run `make mac-setup` + `make install-packages` <<<===")))
test: ## run all tests in test directory
pipenv run pytest -v --ignore=path payload_files .
install-packages: ##install python packages listed in Pipfile
pipenv install
mac-setup: ## setup mac for testing
brew install libxml2
brew install python3
brew install pipenv
# see https://github.mycompany.com/ea/ea_test_player_unified/blob/master/run-feature.sh
help:
#grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | sort | awk 'BEGIN {FS = ":.*?## "}; {printf "\033[36m%-30s\033[0m %s\n", $$1, $$2}'
.DEFAULT_GOAL := help
Notice the line
P := $(foreach exec,$(EXTERNALS),$(if $(shell which $(exec)),missing,$(warning "===>>>WARNING: No required `$(exec)` in PATH, run `make mac-setup` + `make install-packages` <<<===")))
This checks for the binaries required. This works.... however I would rather have a checkenv target that performs this and errors so I can attach it too specific targets like test instead of printing out a WARNING that might be overlooked.
Want:
checkenv: # error if which ${binary} fails or *even better* if if binary --version doesn't return the right version: python3 pip3 xmllint pytest pipenv
I tried various techniques that I found around the web including stackoverflow.... but most use the technique I am using above that don't use a make target or just check for one binary. I tried building a loop through an array of binaries but just couldn't get the syntax correct due to make being a PITA :)
Any suggestions?
Note I'm a python newbie, task is to rewrite some jmeter tests in python....so if you have any thoughts on the above approach feel free to share.
Thanks,
Phil
Don't see what the problem is. It looks very straightforward to me, as make allows using multiple targets on the same line:
EXTERNALS := python3 pip3 xmllint pytest pipenv
python3_version := Python 3.7.3
pip3_version := ...
...
.PHONY: checkenv $(EXTERNALS)
checkenv: $(EXTERNALS)
$(EXTERNALS):
if [ "`$# --version`" != "$($#_version)" ]; then echo "$# check failed"; false; fi

In NixOS, how can I install an environment with the Python packages SpaCy, pandas, and jenks-natural-breaks?

I'm very new to NixOS, so please forgive my ignorance. I'm just trying to set up a Python environment---any kind of environment---for developing with SpaCy, the SpaCy data, pandas, and jenks-natural-breaks. Here's what I've tried so far:
pypi2nix -V "3.6" -E gcc -E libffi -e spacy -e pandas -e numpy --default-overrides, followed by nix-build -r requirements.nix -A packages. I've managed to get the first command to work, but the second fails with Could not find a version that satisfies the requirement python-dateutil>=2.5.0 (from pandas==0.23.4)
Writing a default.nix that looks like this: with import <nixpkgs> {};
python36.withPackages (ps: with ps; [ spacy pandas scikitlearn ]). This fails with collision between /nix/store/9szpqlby9kvgif3mfm7fsw4y119an2kb-python3.6-msgpack-0.5.6/lib/python3.6/site-packages/msgpack/_packer.cpython-36m-x86_64-linux-gnu.so and /nix/store/d08bgskfbrp6dh70h3agv16s212zdn6w-python3.6-msgpack-python-0.5.6/lib/python3.6/site-packages/msgpack/_packer.cpython-36m-x86_64-linux-gnu.so
Making a new virtualenv, and then running pip install on all these packages. Scikit-learn fails to install, with fish: Unknown command 'ar rc build/temp.linux-x86_64-3.6/liblibsvm-skl.a build/temp.linux-x86_64-3.6/sklearn/svm/src/libsvm/libsvm_template.o'
I guess ideally I'd like to install this environment with nix, so that I could enter it with nix-shell, and so other environments could reuse the same python packages. How would I go about doing that? Especially since some of these packages exist in nixpkgs, and others are only on Pypi.
Caveat
I had trouble with jenks-natural-breaks to the tune of
nix-shell ❯ poetry run python -c 'import jenks_natural_breaks'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/matt/2022/12/28-2/.venv/lib/python3.10/site-packages/jenks_natural_breaks/__init__.py", line 5, in <module>
from ._jenks_matrices import ffi as _ffi
ModuleNotFoundError: No module named 'jenks_natural_breaks._jenks_matrices'
So I'm going to use jenkspy which appears to be a bit livelier. If that doesn't scratch your itch, I'd contact the maintainer of jenks-natural-breaks for guidance
Flakes
you said:
so other environments could reuse the same python packages
Which makes me think that a flake.nix is what you need. What's cool about flakes is that you can define an environment that has spacy, pandas, and jenkspy with one flake. And then you (or somebody else) might say:
I want an env like Jonathan's, except I also want sympy
and rather than copying your env and making tweaks, they can declare your env as a build input and write a flake.nix with their modifications--which can be further modified by others.
One could imagine a sort of family-tree of environments, so you just need to pick the one that suits your task. The python community has not yet converged on this vision.
Poetry
Poetry will treat you like you're trying to publish a library when all you asked for is an environment, but a library's dependencies are pretty much an environment so there's nothing wrong with having an empty package and just using poetry as an environment factory.
Bonus: if you decide to publish a library after all, you're ready.
The Setup
nix flakes thinks in terms of git repo's, so we'll start with one:
$ git init
Then create a file called flake.nix. Usually I end up with poetry handling 90% of the python stuff, but both pandas and spacy are in that 10% that has dependencies which link to system libraries. So we ask nix to install them so that when poetry tries to install them in the nix develop shell, it has what it needs.
{
description = "Jonathan's awesome env";
inputs = {
nixpkgs.url = "github:nixos/nixpkgs";
};
outputs = { self, nixpkgs, flake-utils }: (flake-utils.lib.eachSystem [
"x86_64-linux"
"x86_64-darwin"
"aarch64-linux"
"aarch64-darwin"
] (system:
let
pkgs = nixpkgs.legacyPackages.${system};
in
rec {
packages.jonathansenv = pkgs.poetry2nix.mkPoetryApplication {
projectDir = ./.;
};
defaultPackage = packages.jonathansenv;
devShell = pkgs.mkShell {
buildInputs = [
pkgs.poetry
pkgs.python310Packages.pandas
pkgs.python310Packages.spacy
];
};
}));
}
Now we let git know about the flake and enter the environment:
❯ git add flake.nix
❯ nix develop
$
Then we initialize the poetry project. I've found that poetry, installed by nix, is kind of odd about which python it uses by default, so we'll set it explicitly
$ poetry init # follow prompts
$ poetry env use $(which python)
$ poetry run python --version
Python 3.10.9 # declared in the flake.nix
At this point, we should have a pyproject.toml:
[tool.poetry]
name = "jonathansenv"
version = "0.1.0"
description = ""
authors = ["Your Name <you#example.com>"]
readme = "README.md"
[tool.poetry.dependencies]
python = "^3.10"
jenkspy = "^0.3.2"
spacy = "^3.4.4"
pandas = "^1.5.2"
[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
Usage
Now we create the venv that poetry will use, and run a command that depends on these.
$ poetry install
$ poetry run python -c 'import jenkspy, spacy, pandas'
You can also have poetry put you in a shell:
$ poetry shell
(venv)$ python -c 'import jenkspy, spacy, pandas'
It's kind of awkward to do so though, because we're two subshells deep and any shell customizations that we have the grandparent shell are not available. So I recommend using direnv, to enter the dev shell whenever I navigate to that directory and then just use poetry run ... to run commands in the environment.
Publishing the env
In addition to running nix develop with the flake.nix in your current dir, you can also do nix develop /local/path/to/repo or develop nix develop github:/githubuser/githubproject to achieve the same result.
To demonstrate the github example, I have pushed the files referenced above here. So you ought to be able to run this from any linux shell with nix installed:
❯ nix develop github:/MatrixManAtYrService/nix-flake-pandas-spacy
$ poetry install
$ poetry run python -c 'import jenkspy, spacy, pandas'
I say "ought" because if I run that command on a mac it complains about linux-headers-5.19.16 being unsupported on x86_64-darwin.
Presumably there's a way to write the flake (or fix a package) so that it doesn't insist on building linux stuff on a mac, but until I figure it out I'm afraid that this is only a partial answer.

How would you install a python module with chef?

We're using EngineYard which has Python installed by default. But when we enabled SSL we received the following error message from our logentries chef recipe.
"WARNING: The "ssl" module is not present. Using unreliable workaround, host identity cannot be verified. Please install "ssl" module or newer version of Python (2.6) if possible."
I'm looking for a way to install the SSL module with chef recipe but I simply don't have enough experience. Could someone point me in the right direction?
Resources:
Logentries chef recipe: https://github.com/logentries/le_chef
Logentries EY docs: https://logentries.com/doc/engineyard/
SSL Module: http://pypi.python.org/pypi/ssl/
There now appears to be a solution with better community support (based on the fact that it is documented on the opscode website).
You might try:
include_recipe 'python'
python_pip 'ssl'
As documented: here or here
I just wrote a recipe for this, and now am able to run the latest Logentries client on EngineYard. Here you go:
file_dir = "/mnt/src/python-ssl"
file_name = "ssl-1.15.tar.gz"
file_path = File.join(file_dir,file_name)
uncompressed_file_dir = File.join(file_dir, file_name.split(".tar.gz").first)
directory file_dir do
owner "deploy"
group "deploy"
mode "0755"
recursive true
action :create
end
remote_file file_path do
source "http://pypi.python.org/packages/source/s/ssl/ssl-1.15.tar.gz"
mode "0644"
not_if { File.exists?(file_path) }
end
execute "gunzip ssl" do
command "gunzip -c #{file_name} | tar xf -"
cwd file_dir
not_if { File.exists?(uncompressed_file_dir) }
end
installed_file_path = File.join(uncompressed_file_dir, "installed")
execute "install python ssl module" do
command "python setup.py install"
cwd uncompressed_file_dir
not_if { File.exists?(installed_file_path) }
end
execute "touch #{installed_file_path}" do
action :run
end
You could install a new Python using PythonBrew: https://github.com/utahta/pythonbrew. Just make you install libssl before you build, or it still won't be able to use SSL. However, based on the warning, it seems that SSL might work, but it won't be able to verify host. Of course, that is one major purposes of SSL, so that is likely a non-starter.
HTH

Categories