How to refer to executable inside anaconda environment in Snakemake

How to refer to executable inside anaconda environment in Snakemake - python

I'm using vcf2maf to annotate variants as part of a snakemake pipeline
rule vcf2maf:
input:
vcf="vcfs/{sample}.vcf",
fasta=vep_fasta,
vep_dir=vep_dir
output:
"mafs/{sample}.maf"
conda:
"../envs/annotation.yml"
shell:
"""
vcf2maf.pl --input-vcf {input.vcf} --output-maf {output} \
--tumor-id {wildcards.sample}.tumor \
--normal-id {wildcards.sample}.normal \
--ref-fasta {input.fasta} --filter-vcf 0 \
--vep-data {input.vep_dir} --vep-path [need path]
"""
The conda environment has two packages: vcf2maf and vep. vcf2maf requires a path to vep to run properly, but I'm not sure how to access vep's path since it's stored inside the conda environment which will have a user specific absolute path. Is there an easy way to get vep's path so I can refer to it for --vep-path?

You could use the unix which command like:
veppath=`which vep`
vcf2maf.pl --vep-path $veppath ...
[vep path is] stored inside the conda environment which will have a user specific absolute path
The variable CONDA_PREFIX contains the path to the current conda environment. so you could also do something like:
vcf2maf.pl --vep-path $CONDA_PREFIX/bin/vep ...

Related

Snakemake - How to set conda environment path

In Snakemake, conda environments can be easily set up by defining rules as such conda: "envs/my_environment.yaml". This way, YAML files specify which packages to install prior to running the pipeline.
Some software requires a path to third-party-software, to execute specific commands.
An example of this is when generating a reference index with RSEM (example from GitHub page DeweyLab - RSEM):
rsem-prepare-reference --gtf mm9.gtf \
--star \
--star-path /sw/STAR \
-p 8 \
--prep-pRSEM \
--bowtie-path /sw/bowtie \
--mappability-bigwig-file /data/mm9.bigWig \
/data/mm9 \
/ref/mouse_0
Can I locate or predefine the directory (e.g. [workdir]/.snakemake/conda/STAR) for the STAR aligner software, which is installed via conda in a prior rule?
Currently, one option may be to create a shared environment folder, using the Command-line interface option: --conda-prefixSnakemake docs - Command-line interface, however as this is a single-case-issue, I would prefer to define this information in the rules.

There are two ways that I've dealt with this.
1: Let Conda Handle PATH
That specific option (--star-path) only needs to be specified if STAR is not on PATH. However, if STAR is included in your YAML for this rule, then Conda will place it on PATH as part of the environment activation, and so that option won't be needed. Same goes for --bowtie-path. Hence, for such a rule the YAML might be something like:
name: rsem
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- rsem
- star
- bowtie
As per this thread, consider fixing the versions on the packages up to a minor version (e.g., bowtie=1.3).
2: Use config.yaml for Pipeline Options
If for some reason you don't want a fully self-contained pipeline, e.g., your system already has lots of standard genomics software like STAR preinstalled, then you could include an entries in your config.yaml where users should adjust the pipeline to their system. For example, here are the relevant parts:
config.yaml
star_path: /sw/STAR
bowtie_path: /sw/bowtie
Snakefile
configfile: config.yaml
## this is not a complete rule
rule rsem_prep_ref:
# needs input, output...
params:
star=config['star_path'],
bowtie=config['bowtie_path']
threads: 8
conda: "envs/myenv.yaml"
shell:
"""
rsem-prepare-reference --gtf mm9.gtf \
--star \
--star-path {params.star} \
-p {threads} \
--prep-pRSEM \
--bowtie-path {params.bowtie} \
--mappability-bigwig-file /data/mm9.bigWig \
/data/mm9 \
/ref/mouse_0
"""
Really, anything your pipeline assumes already exists and is not generated by the pipeline itself should go into your config.yaml (e.g., mm9.gtf or mm9.bigWig).
Note on Sharing Environments
Generally, I advise against trying to share environments. However, you can still conserve space by sharing a package cache across users and making sure environments are created on the same filesystem (this lets Conda use hardlinks instead of copying). You can use the Conda configuration option pkgs_dirs to set package cache locations. If the pipeline itself is already on the same file system as the Conda package cache, I would just let Snakemake use the default location (.snakemake/conda) and not mess with the --conda-prefix argument.
Otherwise, you can give Snakemake the --conda-prefix argument to point to a directory on the same file system in which to create Conda environments. This should be a rather generic directory in which all environments for the pipeline get located. What was proposed in OP ([workdir]/.snakemake/conda/STAR) would not make sense.

I would like to add a third option to #merv's answer. You could use which to dynamically figure out the path (assuming it is enabled on your system):
rsem-prepare-reference --star-path $(which star) ...

How to give priority to conda package over pip one?

With my virtual environment activated, I see with conda list that my pandas version is 0.24.0. When I do the same with pip list, I see the version is 0.22.0 (probably an older version that I installed before using conda). When I import pandas in python (3.6), the pandas version is 0.22.0.
Why and how to force the loading of the conda package?
EDIT: MacOS High Sierra 10.13.1

TL;DR is in Possible Fix at the bottom
A few notes, and these may or may not answer the question, but I think this is a bit better than dumping everything into comments. These assume that your environment is activated, for these examples, my environment is called new36. I am also on MacOS with High Sierra 10.13.6.
Checking conda vs pip locations
First, let's check to make sure conda and pip are both looking in the same environment. To find information surrounding conda, check:
conda info
I get the following:
active environment : new36
active env location : /Users/mm92400/anaconda3/envs/new36
shell level : 1
user config file : /Users/mm92400/.condarc
populated config files : /Users/mm92400/.condarc
conda version : 4.6.8
conda-build version : 3.0.27
python version : 3.6.3.final.0
# extra info excluded
The active env location is what we're concerned with. This should be a directory that contains the directory of pip:
which pip | head -n 1
/Users/mm92400/anaconda3/envs/new36/bin/pip
If pip does not sit in a directory under where conda lives, this could be part of the issue.
Verifying the import path of python
You should be able to check where python is sourcing files from via sys.path:
import sys
sys.path
['', '/Users/mm92400/anaconda3/envs/new36/lib/python36.zip', '/Users/mm92400/anaconda3/envs/new36/lib/python3.6', '/Users/mm92400/anaconda3/envs/new36/lib/python3.6/lib-dynload', '/Users/mm92400/anaconda3/envs/new36/lib/python3.6/site-packages']
This is a list, and that's important to note. Note how my sys.path does not have any directories that source from a file/folder based on a base install of conda, nor any of the Framework installs of python on my Mac. import will search these directories ('' is cwd) in order, pulling down the first instance of a package that it finds. If your sys.path has an element earlier than your conda env that contains pandas, this is your problem.
Verbose python
You can also verify where the pandas package is being sourced from using the verbose mode of python, python -v:
# you have gotten here by running python -v in the terminal
# there's a whole bunch of comments that pop out that I'm going to omit here
# Now run
import pandas
~snip~
# code object from '/Users/mm92400/anaconda3/envs/new36/lib/python3.6/site-packages/pandas/__pycache__/_version.cpython-36.pyc'
import 'pandas._version' # <_frozen_importlib_external.SourceFileLoader object at 0x107952b00>
import 'pandas' # <_frozen_importlib_external.SourceFileLoader object at 0x104572b38>
Note how the code object path matches where I expect python to source that package from
Possible Fix
You can hack on sys.path, though I'm not sure how recommended that is. You can prioritize where directories are in sys.path without modifying sys.path in your script like:
env PYTHONPATH=$(find $CONDA_PREFIX -type d -name "site-packages" | head -n 1) python
which will take you into an interpreter and sys.path will look like:
import sys
sys.path
['', '/Users/mm92400/anaconda3/envs/new36/lib/python3.6/site-packages', ...]
Where now the first directory it will check is the conda env site-packages. Because sys.path is a list, it will be traversed in order. The way to prioritize which one you want to use is by injecting that particular directory into the sys.path first. If I were to write a script like:
import sys
print(f"I prioritized {sys.path[1]}")
And ran it using env PYTHONPATH=$(find $CONDA_PREFIX -type d -name "site-packages" | head -n 1) python somefile.py I would get:
env PYTHONPATH=$(find $CONDA_PREFIX -type d -name "site-packages" | head -n 1) python somefile.py
I prioritized /Users/mm92400/anaconda3/envs/new36/lib/python3.6/site-packages
Alternatively, you can insert into sys.path, but I can say definitively that this is not recommended and quite fragile:
import os, sys
try:
conda_env = os.environ['CONDA_PREFIX']
except KeyError:
raise KeyError("The env var $CONDA_PREFIX was not found. Please check that your conda environment was activated")
for root, dirs, files in os.walk(conda_env):
if 'site-packages' in dirs:
syspath_add = os.path.join(root, 'site-packages')
break
else:
raise FileNotFoundError("Couldn't find site-packages!")
sys.path.insert(0, syspath_add)
sys.path
# ['/Users/mm92400/anaconda3/envs/new36/lib/python3.6/site-packages', '', ...]

Setting $PATH via a command with tox

Currently using tox to test a python package, and using a python library (chromedriver-binary) to install chromedriver.
This library creates a script (chromedriver-path) which when called outputs the PATH where chromedriver is installed. The usual way to use this is to run:
export PATH=$PATH:`chromedriver-path`
I've tried the following without success in tox.ini
setenv=
PATH = {env:PATH}{:}`chromedriver-path`
This errors as expected:
FileNotFoundError: [Errno 2] No such file or directory: 'chromedriver': 'chromedriver'
Implying that the setenv command is never called/run.
commands=
export PATH=$PATH:`chromedriver-path
This fails with:
ERROR: InvocationError for command could not find executable export
How do I make this work?

Commands can't change their parent processes' environment variables, and thus can't change the environment variables of subsequent commands launched by forking that parent; they can only set environment variables for themselves or their own children.
If you were able to collect the output of chromedriver-path before starting tox, this would be moot. If it's only available in an environment tox itself creates, then things get a bit more interesting.
One approach you can follow is to wrap the commands that need this path entry in a shim that adds it. Consider changing:
commands=
py test ...
to:
commands=
sh -c 'PATH=$PATH:$(chromedrive-path); exec "$#"' _ py test ...

Get path of virtual environment in pipenv

How to can get the path of virtualenv in pipenv?
can configure it to use a custom path for newly created virtualenv?

The following should give you the paths
$ pipenv --where
/home/wonder/workspace/myproj
$ pipenv --venv
/home/wonder/PyEnvs/myproj-BKbQCeJj

Adding to Sewagodimo Matlapeng's answer for the second part of the question:
can configure it to use a custom path for newly created virtualenv?
According to documentation, you can set the base location for the virtualenvs with the environment variable WORKON_HOME. If you want to place the virtualenv specifically in <project>/.venv, set the environment variable PIPENV_VENV_IN_PROJECT.
e.g., running:
export WORKON_HOME=/tmp
pipenv install
Would place the virtualenv in /tmp/<projectname>-<hash>.

I create a command to handle this:
https://github.com/tw-yshuang/Ubuntu-Setup-Scripts/blob/8917578f9ad95be03f48608b7068337215d33f92/config/.customfunction#L12
From line 12 ~ 105
Usage: pipenv_correspond [OPTION]
OPTION:
ls, --list list all the corresponding projects_root & venvs
uls, --useless list all the not existing projects_roots that still have corresponding venvs
npr, --no-project-root hide projects_root
rm, --remove remove all the venvs from command: "ls" or "uls", deafult is use "uls"
# example
$ pipenv_correspond ls
There have some options that suggest you enable this command:
Recommend, create a ~/.customfunction file and
paste it, then use this command:
$ echo '# customfunction\nsource ~/.customfunction' >> <shell_profile>
shell_profile, e.g. ~/.bash_profile, ~/.zshrc
you can copy this code from line 12 ~ 105 to your shell_profile.

You can simple use following command:
pipenv --where

(Buildbot) Can't activate virtualenv using ShellCommand

Returns error "OSError : no Such file or directory". We were trying to activate our newly created virtual env venvCI using the steps in builder with shellCommand.Seems like we cant activate the virtualenv venvCI. Were only new in this environment so please help us.Thanks.
from buildbot.steps.shell import ShellCommand
factory = util.BuildFactory()
# STEPS for example-slave:
factory.addStep(ShellCommand(command=['virtualenv', 'venvCI']))
factory.addStep(ShellCommand(command=['source', 'venvCI/bin/activate']))
factory.addStep(ShellCommand(command=['pip', 'install', '-r','development.pip']))
factory.addStep(ShellCommand(command=['pyflakes', 'calculator.py']))
factory.addStep(ShellCommand(command=['python', 'test.py']))
c['builders'] = []
c['builders'].append(
util.BuilderConfig(name="runtests",
slavenames=["example-slave"],
factory=factory))

Since the buildsystem creates a new Shell for every ShellCommand you can't source env/bin/activate since that only modifies the active shell's environment. When the Shell(Command) exits, the environment is gone.
Things you can do:
Give the environment manually for every ShellCommand (read what
activate does) env={...}
Create a bash script that runs all your commands in
a single shell (what I've done in other systems)
e.g.
myscript.sh:
#!/bin/bash
source env/bin/activate
pip install x
python y.py
Buildbot:
factory.addStep(ShellCommand(command=['bash', 'myscript.sh']))
blog post about the issue

Another option is to call the python executable inside your virtual environment directly, since many Python tools that provide command-line commands are often executable as modules:
from buildbot.steps.shell import ShellCommand
factory = util.BuildFactory()
# STEPS for example-slave:
factory.addStep(ShellCommand(command=['virtualenv', 'venvCI']))
factory.addStep(ShellCommand(
command=['./venvCI/bin/python', '-m', 'pip', 'install', '-r', 'development.pip']))
factory.addStep(ShellCommand(
command=['./venvCI/bin/python', '-m', 'pyflakes', 'calculator.py']))
factory.addStep(ShellCommand(command=['python', 'test.py']))
However, this does get tiresome after a while. You can use string.Template to make helpers:
import shlex
from string import Template
def addstep(cmdline, **kwargs):
tmpl = Template(cmdline)
factory.addStep(ShellCommand(
command=shlex.split(tmpl.safe_substitute(**kwargs))
))
Then you can do things like this:
addstep('$python -m pip install pytest', python='./venvCI/bin/python')
These are some ideas to get started. Note that the neat thing about shlex is that it will respect spaces inside quoted strings when doing the split.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to refer to executable inside anaconda environment in Snakemake - python

Related

Snakemake - How to set conda environment path

How to give priority to conda package over pip one?

Setting $PATH via a command with tox

Get path of virtual environment in pipenv

(Buildbot) Can't activate virtualenv using ShellCommand

Categories

Resources