Snakemake - How to set conda environment path - python

In Snakemake, conda environments can be easily set up by defining rules as such conda: "envs/my_environment.yaml". This way, YAML files specify which packages to install prior to running the pipeline.
Some software requires a path to third-party-software, to execute specific commands.
An example of this is when generating a reference index with RSEM (example from GitHub page DeweyLab - RSEM):
rsem-prepare-reference --gtf mm9.gtf \
--star \
--star-path /sw/STAR \
-p 8 \
--prep-pRSEM \
--bowtie-path /sw/bowtie \
--mappability-bigwig-file /data/mm9.bigWig \
/data/mm9 \
/ref/mouse_0
Can I locate or predefine the directory (e.g. [workdir]/.snakemake/conda/STAR) for the STAR aligner software, which is installed via conda in a prior rule?
Currently, one option may be to create a shared environment folder, using the Command-line interface option: --conda-prefixSnakemake docs - Command-line interface, however as this is a single-case-issue, I would prefer to define this information in the rules.

There are two ways that I've dealt with this.
1: Let Conda Handle PATH
That specific option (--star-path) only needs to be specified if STAR is not on PATH. However, if STAR is included in your YAML for this rule, then Conda will place it on PATH as part of the environment activation, and so that option won't be needed. Same goes for --bowtie-path. Hence, for such a rule the YAML might be something like:
name: rsem
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- rsem
- star
- bowtie
As per this thread, consider fixing the versions on the packages up to a minor version (e.g., bowtie=1.3).
2: Use config.yaml for Pipeline Options
If for some reason you don't want a fully self-contained pipeline, e.g., your system already has lots of standard genomics software like STAR preinstalled, then you could include an entries in your config.yaml where users should adjust the pipeline to their system. For example, here are the relevant parts:
config.yaml
star_path: /sw/STAR
bowtie_path: /sw/bowtie
Snakefile
configfile: config.yaml
## this is not a complete rule
rule rsem_prep_ref:
# needs input, output...
params:
star=config['star_path'],
bowtie=config['bowtie_path']
threads: 8
conda: "envs/myenv.yaml"
shell:
"""
rsem-prepare-reference --gtf mm9.gtf \
--star \
--star-path {params.star} \
-p {threads} \
--prep-pRSEM \
--bowtie-path {params.bowtie} \
--mappability-bigwig-file /data/mm9.bigWig \
/data/mm9 \
/ref/mouse_0
"""
Really, anything your pipeline assumes already exists and is not generated by the pipeline itself should go into your config.yaml (e.g., mm9.gtf or mm9.bigWig).
Note on Sharing Environments
Generally, I advise against trying to share environments. However, you can still conserve space by sharing a package cache across users and making sure environments are created on the same filesystem (this lets Conda use hardlinks instead of copying). You can use the Conda configuration option pkgs_dirs to set package cache locations. If the pipeline itself is already on the same file system as the Conda package cache, I would just let Snakemake use the default location (.snakemake/conda) and not mess with the --conda-prefix argument.
Otherwise, you can give Snakemake the --conda-prefix argument to point to a directory on the same file system in which to create Conda environments. This should be a rather generic directory in which all environments for the pipeline get located. What was proposed in OP ([workdir]/.snakemake/conda/STAR) would not make sense.

I would like to add a third option to #merv's answer. You could use which to dynamically figure out the path (assuming it is enabled on your system):
rsem-prepare-reference --star-path $(which star) ...

Related

Running multiple tox testenv matching a name fragment

Rationale
With complex dependency matrix, tox testenv name patterns end up being a list like
py37-pytest5-framework1
py37-pytest5-framework2
py37-pytest6-framework1
py37-pytest6-framework2
py38-pytest5-framework1
py38-pytest5-framework2
py38-pytest6-framework1
py38-pytest6-framework2
...
py310-pytest6-framework2
While the inner tox.ini syntax allows configuring a lot of things with name fragments, e.g.
[testenv]
basepython =
py37: python3.7
py38: python3.8
py39: python3.9
py310: python3.10
deps =
pytest5: pytest ~= 5.0
pytest6: pytest ~= 6.0
framework1: framework ~= 1.0
framework2: framework ~= 2.0
setenv =
framework2: FOO=bar
I find there is no way in telling the tox CLI in running all testenvs matching a name fragment like tox -e py39 or tox -e framework2.
Issues
The main drawback is that most usually CI testing jobs will end up being segregated by python version, so you end up writing instructions like
tox -e $PY-pytest5-framework1,$PY-pytest5-framework2,$PY-pytest6-framework1,$PY-pytest6-framework2
but then the CI jobs definition is coupled to the tox test matrix because it must be aware of:
testenvs being added or removed
matrix exclusions like pytest-5 is not compatible with python-3.10
And this is cumbersome to maintain.
Incomplete workaround
An easy-to-go workaround is simply running tox --skip-missing-interpreters, but the drawbacks are:
CI jobs can't be segregated by framework version instead of python version, for example to reuse some special framework cache
CI VMs could feature system python installations beyond the one targeted by each job, so you could en up with e.g. python-3.8 being run in all CI jobs.
Question
Am I missing some out-of-the-box mechanism to filter the testenvs to be run with a fragment that powers me to write CI jobs agnostic to the tox dependency matrix? I mean something like tox -e '*-framework2'.
Am I bound to filter and aggregate the output of tox --listenvs with shell tricks?
You could negate a regex pattern for the TOX_SKIP_ENV as the following:
$ env TOX_SKIP_ENV='.*[^-framework2]$' tox
tox4, which will be introduced within the next couple of months, introduces labels. While this may be not an immediate help for your problem, maybe you see a way to simplify your tox.ini.

Creating a python package (deb/rpm) from cmake

I am trying to create a python package (deb & rpm) from cmake, ideally using cpack. I did read
https://cmake.org/cmake/help/latest/cpack_gen/rpm.html and,
https://cmake.org/cmake/help/latest/cpack_gen/deb.html
The installation works just fine (using component install) for my shared library. However I cannot make sense of the documentation to install the python binding (glue) code. Using the standard cmake install mechanism, I tried:
install(
FILES __init__.py library.py
DESTINATION ${ACME_PYTHON_PACKAGE_DIR}/project_name
COMPONENT python)
And then using brute-force approach ended-up with:
# debian based package (relative path)
set(ACME_PYTHON_PACKAGE_DIR lib/python3/dist-packages)
and
# rpm based package (full path required)
set(ACME_PYTHON_PACKAGE_DIR /var/lang/lib/python3.8/site-packages)
The above is derived from:
debian % python -c 'import site; print(site.getsitepackages())'
['/usr/local/lib/python3.9/dist-packages', '/usr/lib/python3/dist-packages', '/usr/lib/python3.9/dist-packages']
while:
rpm % python -c 'import site; print(site.getsitepackages())'
['/var/lang/lib/python3.8/site-packages']
It is pretty clear that the brute-force approach will not be portable, and is doomed to fail on the next release of python. The only possible solution that I can think of is generating a temporary setup.py python script (using setuptools), that will do the install. Typically cmake would call the following process:
% python setup.py install --root ${ACME_PYTHON_INSTALL_ROOT}
My questions are:
Did I understand the cmake/cpack documentation correctly for python package ? If so this means I need to generate an intermediate setup.py script.
I have been searching through the cmake/cpack codebase (git grep setuptools) but did not find helper functions to handle generation of setup.py and passing the result files back to cpack. Is there an existing cmake module which I could re-use ?
I did read, some alternative solution, such as:
How to build debian package with CPack to execute setup.py?
Which seems overly complex, and geared toward Debian-only based system. I need to handle RPM in my case.
As mentionned in my other solution, the ugly part is dealing with absolute path in cmake install() commands. I was able to refactor the code to avoid usage of absolute path in install(). I simply changed the installation into:
install(
# trailing slash is important:
DIRECTORY ${SETUP_OUTPUT}/
# "." syntax is a reliable mechanism, see:
# https://gitlab.kitware.com/cmake/cmake/-/issues/22616
DESTINATION "."
COMPONENT python)
And then one simply needs to:
set(CMAKE_INSTALL_PREFIX "/")
set(CPACK_PACKAGING_INSTALL_PREFIX "/")
include(CPack)
At this point all install path now need to include explicitely /usr since we've cleared the value for CMAKE_INSTALL_PREFIX.
The above has been tested for deb and rpm packages. CPACK_BINARY_TGZ does properly run with the above solution:
https://gitlab.kitware.com/cmake/cmake/-/issues/22925
I am going to post the temporary solution I am using at the moment, until someone provide something more robust.
So I eventually manage to stumble upon:
https://alioth-lists.debian.net/pipermail/libkdtree-devel/2012-October/000366.html and,
Using CMake with setup.py
Re-using the above to do an install step instead of a build step can be done as follow:
find_package(Python COMPONENTS Interpreter)
set(SETUP_PY_IN "${CMAKE_CURRENT_SOURCE_DIR}/setup.py.in")
set(SETUP_PY "${CMAKE_CURRENT_BINARY_DIR}/setup.py")
set(SETUP_DEPS "${CMAKE_CURRENT_SOURCE_DIR}/project_name/__init__.py")
set(SETUP_OUTPUT "${CMAKE_CURRENT_BINARY_DIR}/build-python")
configure_file(${SETUP_PY_IN} ${SETUP_PY})
add_custom_command(
OUTPUT ${CMAKE_CURRENT_BINARY_DIR}/setup_timestamp
COMMAND ${Python_EXECUTABLE} ARGS ${SETUP_PY} install --root ${SETUP_OUTPUT}
COMMAND ${CMAKE_COMMAND} -E touch ${CMAKE_CURRENT_BINARY_DIR}/setup_timestamp
DEPENDS ${SETUP_DEPS})
add_custom_target(target ALL DEPENDS ${CMAKE_CURRENT_BINARY_DIR}/setup_timestamp)
And then the ugly part is:
install(
# trailing slash is important:
DIRECTORY ${SETUP_OUTPUT}/
DESTINATION "/" # FIXME may cause issues with other cpack generators
COMPONENT python)
Turns out that the documentation for install() is pretty clear about absolute paths:
https://cmake.org/cmake/help/latest/command/install.html#introduction
DESTINATION
[...]
As absolute paths are not supported by cpack installer generators,
it is preferable to use relative paths throughout.
For reference, here is my setup.py.in:
from setuptools import setup
if __name__ == '__main__':
setup(name='project_name_python',
version='${PROJECT_VERSION}',
package_dir={'': '${CMAKE_CURRENT_SOURCE_DIR}'},
packages=['project_name'])
You can be fancy and remove the __pycache__ folder using the -B flag:
COMMAND ${Python_EXECUTABLE} ARGS -B ${SETUP_PY} install --root ${SETUP_OUTPUT}
You can be extra fancy and add debian option such as:
if(CPACK_BINARY_DEB)
set(EXTRA_ARG "--install-layout" "deb")
endif()
use as:
COMMAND ${Python_EXECUTABLE} ARGS -B ${SETUP_PY} install --root ${SETUP_OUTPUT} ${EXTRA_ARG}

How to refer to executable inside anaconda environment in Snakemake

I'm using vcf2maf to annotate variants as part of a snakemake pipeline
rule vcf2maf:
input:
vcf="vcfs/{sample}.vcf",
fasta=vep_fasta,
vep_dir=vep_dir
output:
"mafs/{sample}.maf"
conda:
"../envs/annotation.yml"
shell:
"""
vcf2maf.pl --input-vcf {input.vcf} --output-maf {output} \
--tumor-id {wildcards.sample}.tumor \
--normal-id {wildcards.sample}.normal \
--ref-fasta {input.fasta} --filter-vcf 0 \
--vep-data {input.vep_dir} --vep-path [need path]
"""
The conda environment has two packages: vcf2maf and vep. vcf2maf requires a path to vep to run properly, but I'm not sure how to access vep's path since it's stored inside the conda environment which will have a user specific absolute path. Is there an easy way to get vep's path so I can refer to it for --vep-path?
You could use the unix which command like:
veppath=`which vep`
vcf2maf.pl --vep-path $veppath ...
[vep path is] stored inside the conda environment which will have a user specific absolute path
The variable CONDA_PREFIX contains the path to the current conda environment. so you could also do something like:
vcf2maf.pl --vep-path $CONDA_PREFIX/bin/vep ...

In NixOS, how can I install an environment with the Python packages SpaCy, pandas, and jenks-natural-breaks?

I'm very new to NixOS, so please forgive my ignorance. I'm just trying to set up a Python environment---any kind of environment---for developing with SpaCy, the SpaCy data, pandas, and jenks-natural-breaks. Here's what I've tried so far:
pypi2nix -V "3.6" -E gcc -E libffi -e spacy -e pandas -e numpy --default-overrides, followed by nix-build -r requirements.nix -A packages. I've managed to get the first command to work, but the second fails with Could not find a version that satisfies the requirement python-dateutil>=2.5.0 (from pandas==0.23.4)
Writing a default.nix that looks like this: with import <nixpkgs> {};
python36.withPackages (ps: with ps; [ spacy pandas scikitlearn ]). This fails with collision between /nix/store/9szpqlby9kvgif3mfm7fsw4y119an2kb-python3.6-msgpack-0.5.6/lib/python3.6/site-packages/msgpack/_packer.cpython-36m-x86_64-linux-gnu.so and /nix/store/d08bgskfbrp6dh70h3agv16s212zdn6w-python3.6-msgpack-python-0.5.6/lib/python3.6/site-packages/msgpack/_packer.cpython-36m-x86_64-linux-gnu.so
Making a new virtualenv, and then running pip install on all these packages. Scikit-learn fails to install, with fish: Unknown command 'ar rc build/temp.linux-x86_64-3.6/liblibsvm-skl.a build/temp.linux-x86_64-3.6/sklearn/svm/src/libsvm/libsvm_template.o'
I guess ideally I'd like to install this environment with nix, so that I could enter it with nix-shell, and so other environments could reuse the same python packages. How would I go about doing that? Especially since some of these packages exist in nixpkgs, and others are only on Pypi.
Caveat
I had trouble with jenks-natural-breaks to the tune of
nix-shell ❯ poetry run python -c 'import jenks_natural_breaks'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/matt/2022/12/28-2/.venv/lib/python3.10/site-packages/jenks_natural_breaks/__init__.py", line 5, in <module>
from ._jenks_matrices import ffi as _ffi
ModuleNotFoundError: No module named 'jenks_natural_breaks._jenks_matrices'
So I'm going to use jenkspy which appears to be a bit livelier. If that doesn't scratch your itch, I'd contact the maintainer of jenks-natural-breaks for guidance
Flakes
you said:
so other environments could reuse the same python packages
Which makes me think that a flake.nix is what you need. What's cool about flakes is that you can define an environment that has spacy, pandas, and jenkspy with one flake. And then you (or somebody else) might say:
I want an env like Jonathan's, except I also want sympy
and rather than copying your env and making tweaks, they can declare your env as a build input and write a flake.nix with their modifications--which can be further modified by others.
One could imagine a sort of family-tree of environments, so you just need to pick the one that suits your task. The python community has not yet converged on this vision.
Poetry
Poetry will treat you like you're trying to publish a library when all you asked for is an environment, but a library's dependencies are pretty much an environment so there's nothing wrong with having an empty package and just using poetry as an environment factory.
Bonus: if you decide to publish a library after all, you're ready.
The Setup
nix flakes thinks in terms of git repo's, so we'll start with one:
$ git init
Then create a file called flake.nix. Usually I end up with poetry handling 90% of the python stuff, but both pandas and spacy are in that 10% that has dependencies which link to system libraries. So we ask nix to install them so that when poetry tries to install them in the nix develop shell, it has what it needs.
{
description = "Jonathan's awesome env";
inputs = {
nixpkgs.url = "github:nixos/nixpkgs";
};
outputs = { self, nixpkgs, flake-utils }: (flake-utils.lib.eachSystem [
"x86_64-linux"
"x86_64-darwin"
"aarch64-linux"
"aarch64-darwin"
] (system:
let
pkgs = nixpkgs.legacyPackages.${system};
in
rec {
packages.jonathansenv = pkgs.poetry2nix.mkPoetryApplication {
projectDir = ./.;
};
defaultPackage = packages.jonathansenv;
devShell = pkgs.mkShell {
buildInputs = [
pkgs.poetry
pkgs.python310Packages.pandas
pkgs.python310Packages.spacy
];
};
}));
}
Now we let git know about the flake and enter the environment:
❯ git add flake.nix
❯ nix develop
$
Then we initialize the poetry project. I've found that poetry, installed by nix, is kind of odd about which python it uses by default, so we'll set it explicitly
$ poetry init # follow prompts
$ poetry env use $(which python)
$ poetry run python --version
Python 3.10.9 # declared in the flake.nix
At this point, we should have a pyproject.toml:
[tool.poetry]
name = "jonathansenv"
version = "0.1.0"
description = ""
authors = ["Your Name <you#example.com>"]
readme = "README.md"
[tool.poetry.dependencies]
python = "^3.10"
jenkspy = "^0.3.2"
spacy = "^3.4.4"
pandas = "^1.5.2"
[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
Usage
Now we create the venv that poetry will use, and run a command that depends on these.
$ poetry install
$ poetry run python -c 'import jenkspy, spacy, pandas'
You can also have poetry put you in a shell:
$ poetry shell
(venv)$ python -c 'import jenkspy, spacy, pandas'
It's kind of awkward to do so though, because we're two subshells deep and any shell customizations that we have the grandparent shell are not available. So I recommend using direnv, to enter the dev shell whenever I navigate to that directory and then just use poetry run ... to run commands in the environment.
Publishing the env
In addition to running nix develop with the flake.nix in your current dir, you can also do nix develop /local/path/to/repo or develop nix develop github:/githubuser/githubproject to achieve the same result.
To demonstrate the github example, I have pushed the files referenced above here. So you ought to be able to run this from any linux shell with nix installed:
❯ nix develop github:/MatrixManAtYrService/nix-flake-pandas-spacy
$ poetry install
$ poetry run python -c 'import jenkspy, spacy, pandas'
I say "ought" because if I run that command on a mac it complains about linux-headers-5.19.16 being unsupported on x86_64-darwin.
Presumably there's a way to write the flake (or fix a package) so that it doesn't insist on building linux stuff on a mac, but until I figure it out I'm afraid that this is only a partial answer.

How to run Tox with Travis-CI

How do you test different Python versions with Tox from within Travis-CI?
I have a tox.ini:
[tox]
envlist = py{27,33,34,35}
recreate = True
[testenv]
basepython =
py27: python2.7
py33: python3.3
py34: python3.4
py35: python3.5
deps =
-r{toxinidir}/pip-requirements.txt
-r{toxinidir}/pip-requirements-test.txt
commands = py.test
which runs my Python unittests in several Python versions and works perfectly.
I want to setup a build in Travis-CI to automatically run this when I push changes to Github, so I have a .travis.yml:
language: python
python:
- "2.7"
- "3.3"
- "3.4"
- "3.5"
install:
- pip install tox
script:
- tox
This technically seems to work, but it redundantly runs all my tests in each version of Python...from each version of Python. So a build that takes 5 minutes now takes 45 minutes.
I tried removing the python list from my yaml file, so Travis will only run a single Python instance, but that causes my Python3.5 tests to fail because the 3.5 interpreter can't be found. Apparently, that's a known limitation as Travis-CI won't install Python3.5 unless you specify that exact version in your config...but it doesn't do that for the other versions.
Is there a way I can workaround this?
For this I would consider using tox-travis. This is a plugin which allows use of Travis CI’s multiple python versions and Tox’s full configurability.
To do this you will configure the .travis.yml file to test with Python:
sudo: false
language: python
python:
- "2.7"
- "3.4"
install: pip install tox-travis
script: tox
This will run the appropriate testenvs, which are any declared env with py27 or py34 as factors of the name by default. Py27 or py34 will be used as fallback if no environments match the given factor.
Further Reading
For more control and flexibility you can manually define your matrix so that the Python version and tox environment match up:
language: python
matrix:
include:
- python: 2.7
env: TOXENV=py27
- python: 3.3
env: TOXENV=py33
- python: 3.4
env: TOXENV=py34
- python: 3.5
env: TOXENV=py35
- python: pypy
env: TOXENV=pypy
- env: TOXENV=flake8
install:
- pip install tox
script:
- tox
In case it's not obvious, each entry in the matrix starts on a line which begins with a hyphen (-). Any items following that line which are indented are additional lines for that single item.
For example, all entries except for the last, are two lines. the last entry is only one line and does not contain a python setting; therefore, it simply uses the default Python version (Python 2.7 according to the Travis documentation). Of course, a specific Python version is not as important for that test. If you wanted to run such a test against both Python 2 and 3 (once each), then it is recommended to use the versions Travis installs by default (2.7 and 3.4) so that the tests complete more quickly as they don't need to install a non-standard Python version first. For example:
- python: 2.7
env: TOXENV=flake8
- python: 3.4
env: TOXENV=flake8
The same works with pypy (second to last entry on matrix) and pypy3 (not shown) in addition to Python versions 2.5-3.6.
While the various other answers provide shortcuts which give you this result in the end, sometimes its helpful to define the matrix manually. Then you can define specific things for individual environments within the matrix. For example, you can define dependencies for only a single environment and avoid the wasted time installing that dependency in every environment.
- python: 3.5
env: TOXENV=py35
- env: TOXENV=checkspelling
before_install: install_spellchecker.sh
- env: TOXENV=flake8
In the above matrix, the install_spellchecker.sh script is only run for the relevant environment, but not the others. The before_install setting was used (rather than install), as using the install setting would have overridden the global install setting. However, if that's what you want (to override/replace a global setting), simply redefine it in the matrix entry. No doubt, various other settings could be defined for individual environments within the matrix as well.
Manually defining the matrix can provide a lot of flexibility. However, if you don't need the added flexibility, one of the various shortcuts in the other answers will keep your config file simpler and easier to read and edit later on.
Travis provides the python version for each test as TRAVIS_PYTHON_VERSION, but in the form '3.4', while tox expects 'py34'.
If you don't want to rely on an external lib (tox-travis) to do the translation, you can do that manually:
language: python
python:
- "2.7"
- "3.3"
- "3.4"
- "3.5"
install:
- pip install tox
script:
- tox -e $(echo py$TRAVIS_PYTHON_VERSION | tr -d .)
Search this pattern in a search engine and you'll find many projects using it.
This works for pypy as well:
tox -e $(echo py$TRAVIS_PYTHON_VERSION | tr -d . | sed -e 's/pypypy/pypy/')
Source: flask-mongoengine's .travis.yml.
TOXENV environment variable can be used to select subset of tests for each version of Python via specified matrix:
language: python
python:
- "2.7"
- "3.4"
- "3.5"
env:
matrix:
- TOXENV=py27-django-19
- TOXENV=py27-django-110
- TOXENV=py27-django-111
- TOXENV=py34-django-19
- TOXENV=py34-django-110
- TOXENV=py34-django-111
- TOXENV=py35-django-19
- TOXENV=py35-django-110
- TOXENV=py35-django-111
install:
- pip install tox
script:
- tox -e $TOXENV
In tox config specify to skip missing versions of Python:
[tox]
skip_missing_interpreters=true

Categories