snakemake cluster script ImportError snakemake.utils

snakemake cluster script ImportError snakemake.utils - python

I have a strange issue that comes and goes randomly and I really can't figure out when and why.
I am running a snakemake pipeline like this:
conda activate $myEnv
snakemake -s $snakefile --configfile test.conf.yml --cluster "python $qsub_script" --latency-wait 60 --use-conda -p -j 10 --jobscript "$job_script"
I installed snakemake 5.9.1 (also tried downgrading to 5.5.4) within a conda environment.
This works fine if I just run this command, but when I qsub this command to the PBS cluster I'm using, I get an error. My qsub script looks like this:
#PBS stuff...
source ~/.bashrc
hostname
conda activate PGC_de_novo
cd $workDir
snakefile="..."
qsub_script="pbs_qsub_snakemake_wrapper.py"
job_script="..."
snakemake -s $snakefile --configfile test.conf.yml --cluster "python $qsub_script" --latency-wait 60 --use-conda -p -j 10 --jobscript "$job_script" >out 2>err
And the error message I get is:
...
Traceback (most recent call last):
File "/path/to/pbs_qsub_snakemake_wrapper.py", line 6, in <module>
from snakemake.utils import read_job_properties
ImportError: No module named snakemake.utils
Error submitting jobscript (exit code 1):
...
So it looks like for some reason my cluster script doesn't find snakemake, although snakemake is clearly installed. As I said, this problem keeps coming and going. It'd stay for a few hours, then go away for now aparent reason. I guess this indicates an environment problem, but I really can't figure out what, and ran out of ideas. I've tried:
different conda versions
different snakemake versions
different nodes on the cluster
ssh to the node it just failed on and try to reproduce the error
but nothing. Any ideas where to look? Thanks!

Following #Manavalan Gajapathy's advice, I added print(sys.version) commands both to the snakefile and the cluster script, and in both cases got a python version (2.7.5) different than the one indicated in the activated environment (3.7.5).
To cut a long story short - for some reason when I activate the environment within a PBS job, the environment path is added to the $PATH only after /usr/bin, which results in /usr/bin/python being used (which does not have the snakemake package). When the env is activated locally, the env path is added to the beginning of the $PATH, so the right python is used.
I still don't understand this behavior, but at least I could work around it by changing the #PATH. I guess this is not a very elegant solution, but it works for me.

A possibility could be that some cluster nodes don't find the path to the snakemake package so when a job is submitted to those nodes you get the error.
I don't know if/how that could happen but if that is the case you could find the incriminated nodes with something like:
for node in pbsnodes
do
echo $node
ssh $node 'python -c "from snakemake.utils import read_job_properties"'
done
(for nodes in pbsnodes iterates through the available nodes - I don't have the exact syntax right now but hopefully you get the idea). This at least would narrow down the problem a bit...

Related

Snakemake doesn't activate conda environment correctly

I have a Python module modulename installed in a conda environment called myenvname.
My snakemake file consists of one simple rule:
rule checker2:
output:
"tata.txt"
conda:
"myenvname"
script:
"scripts/test2.py"
The contents of the test2.py are the following:
import modulename
with open("tata.txt","w") as _f:
_f.write(modulename.__version__)
When I run the above snakemake file with the command snakemake -j 1 --use-conda --conda-frontend conda I get ModuleNotFoundError, which would imply that there is no modulename in my specified environment. However, when I do the following :
conda activate myenvname
python workflow/scripts/test2.py
... everything works perfectly. I have no idea what's going on.
The full error is pasted below, with some info omitted for privacy.
Traceback (most recent call last):
File "/OMITTED/.snakemake/scripts/tmpheaxuqjn.test2.py", line 13, in <module>
import cnvpytor as cnv
ModuleNotFoundError: No module named 'MODULENAME'
[Thu Nov 17 18:27:22 2022]
Error in rule checker2:
jobid: 0
output: tata.txt
conda-env: MYENVNAME
RuleException:
CalledProcessError in line 12 of /OMITTED/workflow/snakefile:
Command 'source /apps/qiime2/miniconda3/bin/activate 'MYENVNAME'; set -euo pipefail; /OMITTED/.conda/envs/snakemake/bin/python3.1 /OMITTED/.snakemake/scripts/tmpheaxuqjn.test2.py' returned non-zero exit status 1.
File "/OMITTED/workflow/snakefile", line 12, in __rule_checker2
File "/OMITTED/.conda/envs/snakemake/lib/python3.10/concurrent/futures/thread.py", line 58, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /OMITTED/.snakemake/log/2022-11-17T182715.495739.snakemake.log
EDIT:
Typo in script fixed, the typo isn't in the script I'm running so it's not the issue here.
EDIT2:
I've tried two different attempts from comments. All three attempts
are run with the same CLI command snakemake -j 1 --use-conda --conda-frontend conda
Attempt 1
Rule in the snakemake:
rule checker3:
output:
"tata.txt"
conda:
"myenvname"
shell:
"""
conda env list >> {output}
conda list >> {output}
"""
In the output file I had the following (I have lots of environs and packages I've cut out):
# conda environments:
#
...
myenvname * /OMITTED/.conda/envs/myenvname
...
# packages in environment at /OMITTED/.conda/envs/myenvname:
#
# Name Version Build Channel
...
modulename 1.2 dev_0 <develop>
...
This attempt proves that the conda environment is activated and that this environment has modulename.
Attempt 2
Same as running the script, but I've modified the script to include
import time; time.sleep(30); import modulename
So I can snag a runnable script before it's deleted. The script has the following inserted at the start:
######## snakemake preamble start (automatically inserted, do not edit) ########
import sys; sys.path.extend(['/OMITTED/.conda/envs/snakemake/lib/python3.10/site-packages', '/OMITTED/MYWORKINGDIRECTORY/workflow/scripts']); import pickle; snakemake = pickle.loads(####a lot of stuff here###); from snakemake.logging import logger; logger.printshellcmds = False; __real_file__ = __file__; __file__ = '/OMITTED/MYWORKINGDIRECTORY/workflow/scripts/test3.py';
######## snakemake preamble end #########
I have no idea what to do with this information.
Attempt 3
Instead of running script, I've ran a shell command that runs a python script.
rule checker4:
output:
"tata.txt"
conda:
"myenvname"
shell:
"python workflow/scripts/test3.py"
It worked (showed no errors), and when I open "tata.txt" I find "1.2" which is the version of of my module.
Conclusions
The snakemake actually activates proper environment, but the problem is in script part. I have no idea why this is.
There is a similar question here, so this is a duplicate question.

Question is answered. Snakemake actually activates correct environment, but running a python script with the script conflicts with this directive. I don't know if this is a bug in snakemake (version is 6.14.0) or an intentional thing. I've solved the problem by running the python script via shell command with python workflow/scripts/MyScript.py - it's a bit of a problem because I had to include a CLI wrapper that would normally be solved by a snakemake object.

How to get the location of installed Python package into the shell

I want my users to be able to reference a file in my python package (specifically a docker-compose.yml file) directly from the shell.
I couldnt find a way to get only the location from pip show (and grep-ing out "location" from its output feels ugly), so my current (somewhat verbose) solution is:
docker compose -f $(python3 -c "import locust_plugins; print(locust_plugins.__path__[0])")/timescale/docker-compose.yml up
Is there a better way?
Edit: I solved it by installing a wrapper command I call locust-compose as part of the package. Not perfect, but it gets the job done:
#!/bin/bash
module_location=$(python3 -c "import locust_plugins; print(locust_plugins.__path__[0])")
set -x
docker compose -f $module_location/timescale/docker-compose.yml "$#"

Most of the support you need for this is in the core setuptools suite.
First of all, you need to make sure the data file is included in your package. In a setup.cfg file you can write:
[options.package_data]
timescale = docker-compose.yml
Now if you pip install . or pip wheel, that will include the Compose file as part of the Python package.
Next, you can retrieve this in Python code using the ResourceManager API:
#!/usr/bin/env python3
# timescale/compose_path.py
import pkg_resources
if __name__ == '__main__':
print(pkg_resources.resource_filename('timescale', 'docker-compose.yml'))
And finally, you can take that script and make it a setuptools entry point script (as distinct from the similarly-named Docker concept), so that you can just run it as a single command.
[options.entry_points]
console_scripts=
timescale_compose_path = timescale:compose_path
Again, if you pip install . into a virtual environment, you should be able to run timescale_compose_path and get the path name out.
Having done all of those steps, you can finally run a simpler
docker-compose -f $(timescale_compose_path) up

Setting $PATH via a command with tox

Currently using tox to test a python package, and using a python library (chromedriver-binary) to install chromedriver.
This library creates a script (chromedriver-path) which when called outputs the PATH where chromedriver is installed. The usual way to use this is to run:
export PATH=$PATH:`chromedriver-path`
I've tried the following without success in tox.ini
setenv=
PATH = {env:PATH}{:}`chromedriver-path`
This errors as expected:
FileNotFoundError: [Errno 2] No such file or directory: 'chromedriver': 'chromedriver'
Implying that the setenv command is never called/run.
commands=
export PATH=$PATH:`chromedriver-path
This fails with:
ERROR: InvocationError for command could not find executable export
How do I make this work?

Commands can't change their parent processes' environment variables, and thus can't change the environment variables of subsequent commands launched by forking that parent; they can only set environment variables for themselves or their own children.
If you were able to collect the output of chromedriver-path before starting tox, this would be moot. If it's only available in an environment tox itself creates, then things get a bit more interesting.
One approach you can follow is to wrap the commands that need this path entry in a shim that adds it. Consider changing:
commands=
py test ...
to:
commands=
sh -c 'PATH=$PATH:$(chromedrive-path); exec "$#"' _ py test ...

Run python module from command line

I don't really know how to ask this question but I can describe what I want to achieve. I would update any edits that would be suggested.
I have a python module that makes use of some command line arguments. Using the module requires some initial setup outside of the python interpreter. The python file that does the setup runs fine, but the problem is that I have to dig through the python installation to find where that file is located i.e. I have to do python full-path-to-setup-script.py -a argA -b argB etc.I would like to call the setup script like this
some-setup-command -a argA -b argB etc.
I want to achieve something like
workon environmnent_name as in the virtualenv module or
pipenv install as in the pipenv module.
I know both of the above commands call a script of some kind (whether bash or python). I've tried digging through the source codes of virtualenv and pipenv without any success.
I would really appreciate if someone could point me to any necessary resource for coding such programs.

If full-path-to-setup-script.py is executable and has a proper shebang line
#! /usr/bin/env python
then you can
ln -s full-path-to-setup-script.py ~/bin/some-command
considering ~/bin exists and is in your PATH,
and you'll be able to invoke
some-command -a argA -b argB

It's a bit difficult to understand what you're looking for, but python -m is my best guess.
For example, to make a new Jupyter kernel, we call
python -m ipykernel arg --option --option
Where arg is the CLI argument and option is a CLI option, and ipykernel is the module receiving the args and options.

Commands that are callable from the command prompt are located in one of the directories in your system's PATH variable. If you are on Windows, you see the locations via:
echo %PATH%
Or if you want a nicer readout:
powershell -c "$env:path -split(';')"
One solution is to create a folder, add it to your system's PATH, and then create a callable file that you can run. In this example we will create a folder in your user profile, add it to the path, then create a callable file in that folder.
mkdir %USERPROFILE%\path
set PATH=%PATH%%USERPROFILE%\path;
setx PATH %PATH%
In the folder %USERPROFILE%\path, we create a batch file with following content:
# file name:
# some-command.bat
#
python C:\full\path\to\setup-script.py %*
Now you should be able to call
some-command -a argA -b argB
And the batch file will call python with python script and pass the arguments you added.

Looking at the above answers, I see no one has mentioned this:
You can of course compile the python file and give executable permissions with
chmod +x filename.py
and then run it as
./filename.py -a argA -b argB ...
Moreover, you can also remove the extention .py (since it is an executable now) and then run it only as
./filename -a argA -b argB ...

Skyfield.api loader behaves differently in docker container

I wish to specify to Skyfield a download directory as documented here :
http://rhodesmill.org/skyfield/files.html
Here is my script:
from skyfield.api import Loader
load = Loader('~/data/skyfield')
# Next line downloads deltat.data, deltat.preds, Leap_Second.dat in ~/data/skyfield
ts = load.timescale()
t = ts.utc(2017,9,13,0,0,0)
stations_url = 'http://celestrak.com/NORAD/elements/stations.txt'
# Next line downloads stations.txt in ~/data/skyfield AND deltat.data, deltat.preds, Leap_Second.dat in $PWD !!!
satellites = load.tle(stations_url)
satellite = satellites['ISS (ZARYA)']
Expected behaviour (works fine outside docker)
The 3 deltat files (deltat.data, deltat.preds and Leap_Second.dat) are downloaded in ~/data/skyfield with load.timescale() and stations.txt is downloaded at the same place with load.tle(stations_url)
Behaviour when run in a container
The 3 deltat files get downloaded twice :
one time in the specified folder at the call load.timescale()
another time in the current directory at the call load.tle(stations_url)
This is frustrating because they already exist at this point and they pollute current directory. Note that stations.txt end up in the right place (~/data/skyfield)
If the container is ran interactively, then calling exec(open("script.py").read()) in a python shell gives a normal behaviour again. Can anyone reproduce this issue? It is hard to tell wether it comes from python, docker or skyfield.
The dockerfile is just these 2 lines:
FROM continuumio/anaconda3:latest
RUN conda install -c astropy astroquery && conda install -c anaconda ephem=3.7.6.0 && pip install skyfield
Then (assuming the built image is tagged astro) I run it with :
docker run --rm -w /tmp/working -v $PWD:/tmp/working astro:latest python script.py
And here is the output (provided the folders are empty before the run):
[#################################] 100% deltat.data
[#################################] 100% deltat.preds
[#################################] 100% Leap_Second.dat
[#################################] 100% stations.txt
[#################################] 100% deltat.data
[#################################] 100% deltat.preds
[#################################] 100% Leap_Second.dat
EDIT
Adding -t to docker run did not solve the issue but helped to even illustrate it better. I think it may come from Skyfield because some recent issues on github seem quite similar although not exactly the same.

The simple solution here is to add -t to your docker run command to allocate a pseudo TTY:
docker run --rm -t -w /tmp/working -v $PWD:/tmp/working astro:latest python script.py
What you are seeing is caused by the way the lines are printed and buffering of non-TTY based stdout. The percentage up to 100% is likely printed on a line without newlines. Then after 100% it is printed again with a newline. With buffering, this causes it to be printed twice.
When you run the same command with a TTY, there is no buffering and the lines are printed realtime so the newlines actually work as desired.
The code path isn't actually running twice :)
See Docker run with pseudoTTY (-t) gives instant stdout, buffering happens without it for another explanation (possibly better than mine).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.