How to use singularity and conda wrappers in Snakemake

How to use singularity and conda wrappers in Snakemake - python

TLDR I'm getting the following error:
The 'conda' command is not available inside your singularity container image. Snakemake mounts your conda installation into singularity. Sometimes, this can fail because of shell restrictions. It has been tested to work with docker://ubuntu, but it e.g. fails with docker://bash
I had created a Snakemake workflow and converted the shell: commands to rule-based package management via Snakemake wrappers: .
However, I ran into issues running this on HPC and one of the HPC support staff strongly recommended against using conda on any HPC system as:
"if the builder [of wrapper] is not super careful, dynamic libraries present in the conda environment that relies on the host libs (there are always a couple present because builder are most of the time carefree) will break. I think that relying on Singularity for your pipeline would make for a more robust system." - Anon
I did some reading over the weekend and according to this document, it's possible to combine containers with conda-based package management; by defining a global conda docker container and per-rule yaml files.
Note: In contrast to the example in the link above (Figure 5.4), which uses a predefined yaml and shell: command, here I've use
conda wrappers which download these yaml files into the
Singularity container (if I'm thinking correctly) so I thought should function the same - see the Note: at the end though...
Snakefile, config.yaml and samples.txt
Snakefile
# Directories------------------------------------------------------------------
configfile: "config.yaml"
# Setting the names of all directories
dir_list = ["REF_DIR", "LOG_DIR", "BENCHMARK_DIR", "QC_DIR", "TRIM_DIR", "ALIGN_DIR", "MARKDUP_DIR", "CALLING_DIR", "ANNOT_DIR"]
dir_names = ["refs", "logs", "benchmarks", "qc", "trimming", "alignment", "mark_duplicates", "variant_calling", "annotation"]
dirs_dict = dict(zip(dir_list, dir_names))
import os
import pandas as pd
# getting the samples information (names, path to r1 & r2) from samples.txt
samples_information = pd.read_csv("samples.txt", sep='\t', index_col=False)
# get a list of the sample names
sample_names = list(samples_information['sample'])
sample_locations = list(samples_information['location'])
samples_dict = dict(zip(sample_names, sample_locations))
# get number of samples
len_samples = len(sample_names)
# Singularity with conda wrappers
singularity: "docker://continuumio/miniconda3:4.5.11"
# Rules -----------------------------------------------------------------------
rule all:
input:
"resources/vep/plugins",
"resources/vep/cache"
rule download_vep_plugins:
output:
directory("resources/vep/plugins")
params:
release=100
resources:
mem=1000,
time=30
wrapper:
"0.66.0/bio/vep/plugins"
rule get_vep_cache:
output:
directory("resources/vep/cache")
params:
species="caenorhabditis_elegans",
build="WBcel235",
release="100"
resources:
mem=1000,
time=30
log:
"logs/vep/cache.log"
cache: True # save space and time with between workflow caching (see docs)
wrapper:
"0.66.0/bio/vep/cache"
config.yaml
# Files
REF_GENOME: "c_elegans.PRJNA13758.WS265.genomic.fa"
GENOME_ANNOTATION: "c_elegans.PRJNA13758.WS265.annotations.gff3"
# Tools
QC_TOOL: "fastQC"
TRIM_TOOL: "trimmomatic"
ALIGN_TOOL: "bwa"
MARKDUP_TOOL: "picard"
CALLING_TOOL: "varscan"
ANNOT_TOOL: "vep"
samples.txt
sample location
MTG324 /home/moldach/wrappers/SUBSET/MTG324_SUBSET
Submission
snakemake --profile slurm --use-singularity --use-conda --jobs 2
Logs
Workflow defines that rule get_vep_cache is eligible for caching between workflows (use the --cache argument to enable this).
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 get_vep_cache
1
[Mon Sep 21 15:35:50 2020]
rule get_vep_cache:
output: resources/vep/cache
log: logs/vep/cache.log
jobid: 0
resources: mem=1000, time=30
Activating singularity image /home/moldach/wrappers/SUBSET/VEP/.snakemake/singularity/d7617773b315c3abcb29e0484085ed06.simg
Activating conda environment: /home/moldach/wrappers/SUBSET/VEP/.snakemake/conda/774ea575
[Mon Sep 21 15:36:38 2020]
Finished job 0.
1 of 1 steps (100%) done
Note: Leaving --use-conda out of the submission of the workflow will cause an error for get_vep_cache: - /bin/bash: vep_install: command not found
Workflow defines that rule get_vep_cache is eligible for caching between workflows (use the --cache argument to enable this).
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 download_vep_plugins
1
[Mon Sep 21 15:35:50 2020]
rule download_vep_plugins:
output: resources/vep/plugins
jobid: 0
resources: mem=1000, time=30
Activating singularity image /home/moldach/wrappers/SUBSET/VEP/.snakemake/singularity/d7617773b315c3abcb29e0484085ed06.simg
Activating conda environment: /home/moldach/wrappers/SUBSET/VEP/.snakemake/conda/9f602d9a
[Mon Sep 21 15:35:56 2020]
Finished job 0.
1 of 1 steps (100%) done
The problem occurs when adding the third rule, fastq:
Updated Snakefile
# Directories------------------------------------------------------------------
configfile: "config.yaml"
# Setting the names of all directories
dir_list = ["REF_DIR", "LOG_DIR", "BENCHMARK_DIR", "QC_DIR", "TRIM_DIR", "ALIGN_DIR", "MARKDUP_DIR", "CALLING_DIR", "ANNOT_DIR"]
dir_names = ["refs", "logs", "benchmarks", "qc", "trimming", "alignment", "mark_duplicates", "variant_calling", "annotation"]
dirs_dict = dict(zip(dir_list, dir_names))
import os
import pandas as pd
# getting the samples information (names, path to r1 & r2) from samples.txt
samples_information = pd.read_csv("samples.txt", sep='\t', index_col=False)
# get a list of the sample names
sample_names = list(samples_information['sample'])
sample_locations = list(samples_information['location'])
samples_dict = dict(zip(sample_names, sample_locations))
# get number of samples
len_samples = len(sample_names)
# Singularity with conda wrappers
singularity: "docker://continuumio/miniconda3:4.5.11"
# Rules -----------------------------------------------------------------------
rule all:
input:
"resources/vep/plugins",
"resources/vep/cache",
expand('{QC_DIR}/{QC_TOOL}/before_trim/{sample}_{pair}_fastqc.{ext}', QC_DIR=dirs_dict["QC_DIR"], QC_TOOL=config["QC_TOOL"], sample=sample_names, pair=['R1', 'R2'], ext=['html', 'zip'])
rule download_vep_plugins:
output:
directory("resources/vep/plugins")
params:
release=100
resources:
mem=1000,
time=30
wrapper:
"0.66.0/bio/vep/plugins"
rule get_vep_cache:
output:
directory("resources/vep/cache")
params:
species="caenorhabditis_elegans",
build="WBcel235",
release="100"
resources:
mem=1000,
time=30
log:
"logs/vep/cache.log"
cache: True # save space and time with between workflow caching (see docs)
wrapper:
"0.66.0/bio/vep/cache"
def getHome(sample):
return(list(os.path.join(samples_dict[sample],"{0}_{1}.fastq.gz".format(sample,pair)) for pair in ['R1','R2']))
rule qc_before_trim_r1:
input:
r1=lambda wildcards: getHome(wildcards.sample)[0]
output:
html=os.path.join(dirs_dict["QC_DIR"],config["QC_TOOL"],"before_trim","{sample}_R1_fastqc.html"),
zip=os.path.join(dirs_dict["QC_DIR"],config["QC_TOOL"],"before_trim","{sample}_R1_fastqc.zip"),
params:
dir=os.path.join(dirs_dict["QC_DIR"],config["QC_TOOL"],"before_trim")
log:
os.path.join(dirs_dict["LOG_DIR"],config["QC_TOOL"],"{sample}_R1.log")
resources:
mem=1000,
time=30
singularity:
"https://depot.galaxyproject.org/singularity/fastqc:0.11.9--0"
threads: 1
message: """--- Quality check of raw data with FastQC before trimming."""
wrapper:
"0.66.0/bio/fastqc"
rule qc_before_trim_r2:
input:
r1=lambda wildcards: getHome(wildcards.sample)[1]
output:
html=os.path.join(dirs_dict["QC_DIR"],config["QC_TOOL"],"before_trim","{sample}_R2_fastqc.html"),
zip=os.path.join(dirs_dict["QC_DIR"],config["QC_TOOL"],"before_trim","{sample}_R2_fastqc.zip"),
params:
dir=os.path.join(dirs_dict["QC_DIR"],config["QC_TOOL"],"before_trim")
log:
os.path.join(dirs_dict["LOG_DIR"],config["QC_TOOL"],"{sample}_R2.log")
resources:
mem=1000,
time=30
singularity:
"https://depot.galaxyproject.org/singularity/fastqc:0.11.9--0"
threads: 1
message: """--- Quality check of raw data with FastQC before trimming."""
wrapper:
"0.66.0/bio/fastqc"
Error reported in nohup.out
Building DAG of jobs...
Pulling singularity image https://depot.galaxyproject.org/singularity/fastqc:0.11.9--0.
CreateCondaEnvironmentException:
The 'conda' command is not available inside your singularity container image. Snakemake mounts your conda installation into singularity. Sometimes, this can fail because of shell restrictions. It has been tested to work with docker://ubuntu, but it e.g. fails with docker://bash
File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/deployment/conda.py", line 247, in create
File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/deployment/conda.py", line 381, in __new__
File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/deployment/conda.py", line 394, in __init__
File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/deployment/conda.py", line 417, in _check
using shell: instead of wrapper:
I changed the wrapper back into the shell command:
and this is the error I get when submitting with ``:
orkflow defines that rule get_vep_cache is eligible for caching between workflows (use the --cache argument to enable this).
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 qc_before_trim_r2
1
[Mon Sep 21 16:32:54 2020]
Job 0: --- Quality check of raw data with FastQC before trimming.
Activating singularity image /home/moldach/wrappers/SUBSET/VEP/.snakemake/singularity/6740cb07e67eae01644839c9767bdca5.simg
^[[33mWARNING:^[[0m Skipping mount /var/singularity/mnt/session/etc/resolv.conf [files]: /etc/resolv.conf doesn't exist in container
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LANGUAGE = (unset),
LC_ALL = (unset),
LANG = "en_CA.UTF-8"
are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
Skipping '/home/moldach/wrappers/SUBSET/MTG324_SUBSET/MTG324_R2.fastq.gz' which didn't exist, or couldn't be read
Waiting at most 60 seconds for missing files.
MissingOutputException in line 84 of /home/moldach/wrappers/SUBSET/VEP/Snakefile:
Job completed successfully, but some output files are missing. Missing files after 60 seconds:
qc/fastQC/before_trim/MTG324_R2_fastqc.html
qc/fastQC/before_trim/MTG324_R2_fastqc.zip
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 544, in handle_job_success
File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 231, in handle_job_success
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
The error Skipping '/home/moldach/wrappers/SUBSET/MTG324_SUBSET/MTG324_R2.fastq.gz' which didn't exist, or couldn't be read is misleading because the file is does exist...
Update 2
Following the advice Manavalan Gajapathy I've eliminated defining singularity at two different levels (global + per-rule).
Now I'm using a singularity container at only the global level and using wrappers via --use-conda which creates the conda environment inside of the container:
# Directories------------------------------------------------------------------
configfile: "config.yaml"
# Setting the names of all directories
dir_list = ["REF_DIR", "LOG_DIR", "BENCHMARK_DIR", "QC_DIR", "TRIM_DIR", "ALIGN_DIR", "MARKDUP_DIR", "CALLING_DIR", "ANNOT_DIR"]
dir_names = ["refs", "logs", "benchmarks", "qc", "trimming", "alignment", "mark_duplicates", "variant_calling", "annotation"]
dirs_dict = dict(zip(dir_list, dir_names))
import os
import pandas as pd
# getting the samples information (names, path to r1 & r2) from samples.txt
samples_information = pd.read_csv("samples.txt", sep='\t', index_col=False)
# get a list of the sample names
sample_names = list(samples_information['sample'])
sample_locations = list(samples_information['location'])
samples_dict = dict(zip(sample_names, sample_locations))
# get number of samples
len_samples = len(sample_names)
# Singularity with conda wrappers
singularity: "docker://continuumio/miniconda3:4.5.11"
# Rules -----------------------------------------------------------------------
rule all:
input:
"resources/vep/plugins",
"resources/vep/cache",
expand('{QC_DIR}/{QC_TOOL}/before_trim/{sample}_{pair}_fastqc.{ext}', QC_DIR=dirs_dict["QC_DIR"], QC_TOOL=config["QC_TOOL"], sample=sample_names, pair=['R1', 'R2'], ext=['html', 'zip'])
rule download_vep_plugins:
output:
directory("resources/vep/plugins")
params:
release=100
resources:
mem=1000,
time=30
wrapper:
"0.66.0/bio/vep/plugins"
rule get_vep_cache:
output:
directory("resources/vep/cache")
params:
species="caenorhabditis_elegans",
build="WBcel235",
release="100"
resources:
mem=1000,
time=30
log:
"logs/vep/cache.log"
cache: True # save space and time with between workflow caching (see docs)
wrapper:
"0.66.0/bio/vep/cache"
def getHome(sample):
return(list(os.path.join(samples_dict[sample],"{0}_{1}.fastq.gz".format(sample,pair)) for pair in ['R1','R2']))
rule qc_before_trim_r1:
input:
r1=lambda wildcards: getHome(wildcards.sample)[0]
output:
html=os.path.join(dirs_dict["QC_DIR"],config["QC_TOOL"],"before_trim","{sample}_R1_fastqc.html"),
zip=os.path.join(dirs_dict["QC_DIR"],config["QC_TOOL"],"before_trim","{sample}_R1_fastqc.zip"),
params:
dir=os.path.join(dirs_dict["QC_DIR"],config["QC_TOOL"],"before_trim")
log:
os.path.join(dirs_dict["LOG_DIR"],config["QC_TOOL"],"{sample}_R1.log")
resources:
mem=1000,
threads: 1
message: """--- Quality check of raw data with FastQC before trimming."""
wrapper:
"0.66.0/bio/fastqc"
rule qc_before_trim_r2:
input:
r1=lambda wildcards: getHome(wildcards.sample)[1]
output:
html=os.path.join(dirs_dict["QC_DIR"],config["QC_TOOL"],"before_trim","{sample}_R2_fastqc.html"),
zip=os.path.join(dirs_dict["QC_DIR"],config["QC_TOOL"],"before_trim","{sample}_R2_fastqc.zip"),
params:
dir=os.path.join(dirs_dict["QC_DIR"],config["QC_TOOL"],"before_trim")
log:
os.path.join(dirs_dict["LOG_DIR"],config["QC_TOOL"],"{sample}_R2.log")
resources:
mem=1000,
time=30
threads: 1
message: """--- Quality check of raw data with FastQC before trimming."""
wrapper:
"0.66.0/bio/fastqc"
and submit via:
However, I'm still getting an error:
Workflow defines that rule get_vep_cache is eligible for caching between workflows (use the --cache argument to enable this).
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 qc_before_trim_r2
1
[Tue Sep 22 12:44:03 2020]
Job 0: --- Quality check of raw data with FastQC before trimming.
Activating singularity image /home/moldach/wrappers/SUBSET/OMG/.snakemake/singularity/d7617773b315c3abcb29e0484085ed06.simg
Activating conda environment: /home/moldach/wrappers/SUBSET/OMG/.snakemake/conda/c591f288
Skipping '/work/mtgraovac_lab/MATTS_SCRATCH/rep1_R2.fastq.gz' which didn't exist, or couldn't be read
Skipping ' 2> logs/fastQC/rep1_R2.log' which didn't exist, or couldn't be read
Failed to process qc/fastQC/before_trim
java.io.FileNotFoundException: qc/fastQC/before_trim (Is a directory)
at java.base/java.io.FileInputStream.open0(Native Method)
at java.base/java.io.FileInputStream.open(FileInputStream.java:219)
at java.base/java.io.FileInputStream.<init>(FileInputStream.java:157)
at uk.ac.babraham.FastQC.Sequence.FastQFile.<init>(FastQFile.java:73)
at uk.ac.babraham.FastQC.Sequence.SequenceFactory.getSequenceFile(SequenceFactory.java:106)
at uk.ac.babraham.FastQC.Sequence.SequenceFactory.getSequenceFile(SequenceFactory.java:62)
at uk.ac.babraham.FastQC.Analysis.OfflineRunner.processFile(OfflineRunner.java:159)
at uk.ac.babraham.FastQC.Analysis.OfflineRunner.<init>(OfflineRunner.java:121)
at uk.ac.babraham.FastQC.FastQCApplication.main(FastQCApplication.java:316)
Traceback (most recent call last):
File "/home/moldach/wrappers/SUBSET/OMG/.snakemake/scripts/tmpiwwprg5m.wrapper.py", line 35, in <module>
shell(
File "/mnt/snakemake/snakemake/shell.py", line 205, in __new__
raise sp.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'set -euo pipefail; fastqc qc/fastQC/before_trim --quiet -t 1 --outdir /tmp/tmps93snag8 /work/mtgraovac_lab/MATTS_SCRATCH/rep1_R2.fastq.gz ' 2> logs/fastQC/rep1_R$
[Tue Sep 22 12:44:16 2020]
Error in rule qc_before_trim_r2:
jobid: 0
output: qc/fastQC/before_trim/rep1_R2_fastqc.html, qc/fastQC/before_trim/rep1_R2_fastqc.zip
log: logs/fastQC/rep1_R2.log (check log file(s) for error message)
conda-env: /home/moldach/wrappers/SUBSET/OMG/.snakemake/conda/c591f288
RuleException:
CalledProcessError in line 97 of /home/moldach/wrappers/SUBSET/OMG/Snakefile:
Command ' singularity exec --home /home/moldach/wrappers/SUBSET/OMG --bind /home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages:/mnt/snakemake /home/moldach/wrappers/SUBSET/OMG/.snakemake$
File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 2189, in run_wrapper
File "/home/moldach/wrappers/SUBSET/OMG/Snakefile", line 97, in __rule_qc_before_trim_r2
File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 529, in _callback
File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/concurrent/futures/thread.py", line 57, in run
File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 515, in cached_or_run
File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 2201, in run_wrapper
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Reproducibility
To replicate this you can download this small dataset:
git clone https://github.com/CRG-CNAG/CalliNGS-NF.git
cp CalliNGS-NF/data/reads/rep1_*.fq.gz .
mv rep1_1.fq.gz rep1_R1.fastq.gz
mv rep1_2.fq.gz rep1_R2.fastq.gz
UPDATE 3: Bind Mounts
According to the link shared on mounting:
"By default Singularity bind mounts /home/$USER, /tmp, and $PWD into your container at runtime."
Thus, for simplicity (and also because I got errors using --singularity-args), I've moved the required files into /home/$USER and tried to run from there.
(snakemake) [~]$ pwd
/home/moldach
(snakemake) [~]$ ll
total 3656
drwx------ 26 moldach moldach 4096 Aug 27 17:36 anaconda3
drwx------ 2 moldach moldach 4096 Sep 22 10:11 bin
-rw------- 1 moldach moldach 265 Sep 22 14:29 config.yaml
-rw------- 1 moldach moldach 1817903 Sep 22 14:29 rep1_R1.fastq.gz
-rw------- 1 moldach moldach 1870497 Sep 22 14:29 rep1_R2.fastq.gz
-rw------- 1 moldach moldach 55 Sep 22 14:29 samples.txt
-rw------- 1 moldach moldach 3420 Sep 22 14:29 Snakefile
and ran with bash -c "nohup snakemake --profile slurm --use-singularity --use-conda --jobs 4 &"
However, I still get this odd error:
Activating conda environment: /home/moldach/.snakemake/conda/fdae4f0d
Skipping ' 2> logs/fastQC/rep1_R2.log' which didn't exist, or couldn't be read
Failed to process qc/fastQC/before_trim
java.io.FileNotFoundException: qc/fastQC/before_trim (Is a directory)
at java.base/java.io.FileInputStream.open0(Native Method)
at java.base/java.io.FileInputStream.open(FileInputStream.java:219)
at java.base/java.io.FileInputStream.<init>(FileInputStream.java:157)
at uk.ac.babraham.FastQC.Sequence.FastQFile.<init>(FastQFile.java:73)
at uk.ac.babraham.FastQC.Sequence.SequenceFactory.getSequenceFile(SequenceFactory.java:106)
at uk.ac.babraham.FastQC.Sequence.SequenceFactory.getSequenceFile(SequenceFactory.java:62)
at uk.ac.babraham.FastQC.Analysis.OfflineRunner.processFile(OfflineRunner.java:159)
at uk.ac.babraham.FastQC.Analysis.OfflineRunner.<init>(OfflineRunner.java:121)
at uk.ac.babraham.FastQC.FastQCApplication.main(FastQCApplication.java:316)
Traceback (most recent call last):
Why does it think it's being given a directory?
Note: If you submit only with --use-conda, e.g. bash -c "nohup snakemake --profile slurm --use-conda --jobs 4 &" there is no error from the fastqc rules. However, the --use-conda param alone is not %100 reproducible, case-in-point doesn't work on another HPC I tested it on
The full log in nohup.out when using --printshellcmds can be found at this gist

TLDR:
fastqc singularity container used in qc rule likely doesn't have conda available in it, and this doesn't satisfy what snakemake's--use-conda expects.
Explanation:
You have singularity containers defined at two different levels - 1. global level that will be used for all rules, unless they are overridden at rule level; 2. per-rule level that will be used at the rule level.
# global singularity container to use
singularity: "docker://continuumio/miniconda3:4.5.11"
# singularity container defined at rule level
rule qc_before_trim_r1:
....
....
singularity:
"https://depot.galaxyproject.org/singularity/fastqc:0.11.9--0"
When you use --use-singularity and --use-conda together, jobs will be run in conda environment inside the singularity container. So conda command needs to be available inside the singularity container for this to be possible. While this requirement is clearly satisfied for your global-level container, I am quite certain (haven't tested though) this is not the case for your fastqc container.
The way snakemake works if --use-conda flag is supplied, it will create conda environment locally or inside the container depending on the supply of --use-singularity flag. Since you are using snakemake-wrapper for qc rule and it comes with conda env recipe pre-defined, the easiest solution here is to just use the globally defined miniconda container for all rules. That is, there is no need to use fastqc specific container for qc rule.
If you really want to use the fastqc container, then you shouldn't be using --use-conda flag, but of course this will mean that all necessary tools are available from the container(s) defined globally or per rule.

Related

mypy: how to ignore specified files (not talking about how to ignore errors)

I want to let mypy ignore a specified file. I'm not talking about just ignoring error (which already answered in many places), rather I want mypy to completely ignore specified files.
When applying mypy to jax library, mypy somehow hangs. For example, let's say there is a file named main.py with only single file
import jax
Mypy somehow takes really long time (over 5 minuites) when applying main.py with the following log (abbreviated). Thus I want let mypy ignore specific file.
h-ishida#stone-jsk:/tmp/test$ mypy -v main.py
LOG: Could not load plugins snapshot: #plugins_snapshot.json
LOG: Mypy Version: 0.991
LOG: Config File: Default
LOG: Configured Executable: /usr/bin/python3
LOG: Current Executable: /usr/bin/python3
LOG: Cache Dir: .mypy_cache
LOG: Compiled: True
LOG: Exclude: []
LOG: Found source: BuildSource(path='main.py', module='main', has_text=False, base_dir='/tmp/test', followed=False)
LOG: Could not load cache for main: main.meta.json
LOG: Metadata not found for main
LOG: Parsing main.py (main)
LOG: Could not load cache for jax: jax/__init__.meta.json
LOG: Metadata not found for jax
LOG: Parsing /home/h-ishida/.local/lib/python3.8/site-packages/jax/__init__.py (jax)
LOG: Metadata fresh for builtins: file /home/h-ishida/.local/lib/python3.8/site-packages/mypy/typeshed/stdlib/builtins.pyi
LOG: Metadata fresh for jax._src.cloud_tpu_init: file /home/h-ishida/.local/lib/python3.8/site-packages/jax/_src/cloud_tpu_init.py
LOG: Could not load cache for jax._src.basearray: jax/_src/basearray.meta.json
LOG: Metadata not found for jax._src.basearray
...
LOG: Found 163 SCCs; largest has 196 nodes
LOG: Processing 161 queued fresh SCCs
LOG: Processing SCC of size 196 (jax.scipy jax._src.lib jax._src.config jax.experimental.x64_context jax._src.iree jax.config jax._src.pretty_printer jax._src.util jax.util jax._src.lax.stack jax._src.state.types jax._src.lax.svd jax._src.nn.initializers jax._src.lax.qdwh jax._src.debugger.web_debugger jax._src.debugger.colab_debugger jax._src.debugger.cli_debugger jax._src.debugger.core jax._src.clusters.cloud_tpu_cluster jax._src.clusters.slurm_cluster jax._src.clusters jax._src.lax.utils jax._src.profiler jax._src.ops.scatter jax._src.numpy.vectorize jax._src.lax.ann jax._src.lax.other jax._src.errors jax._src.abstract_arrays jax._src.traceback_util jax._src.source_info_util jax._src.callback jax._src.lib.xla_bridge jax._src.sharding jax.lib jax._src.tree_util jax._src.environment_info jax._src.lax.eigh jax._src.lib.mlir.dialects jax.lib.xla_bridge jax._src.stages jax.nn.initializers jax._src.debugger jax._src.distributed jax._src.api_util jax.tree_util jax.profiler jax.ops jax.errors jax._src.typing jax._src.array jax._src.basearray jax._src.state.primitives jax._src.numpy.ndarray jax._src.custom_transpose jax._src.ad_checkpoint jax._src.lax.convolution jax._src.lax.slicing jax._src.debugging jax.linear_util jax.interpreters.mlir jax._src.dtypes jax._src.dispatch jax._src.device_array jax.stages jax.distributed jax.api_util jax.core jax._src.state.discharge jax._src.lax.control_flow.common jax._src.numpy.util jax._src.ad_util jax._src.lax.parallel jax.custom_transpose jax.ad_checkpoint jax.interpreters.pxla jax.interpreters.xla jax.interpreters.partial_eval jax.dtypes jax.debug jax.abstract_arrays jax._src.state jax._src.third_party.numpy.linalg jax._src.numpy.fft jax._src.lax.control_flow.solves jax._src.lax.control_flow.conditionals jax._src.numpy.reductions jax._src.numpy.index_tricks jax.interpreters.batching jax.interpreters.ad jax.sharding jax._src.custom_derivatives jax._src.custom_batching jax.numpy.fft jax._src.lax.lax jax._src.lax.linalg jax.custom_derivatives jax.custom_batching jax.lax.linalg jax._src.api jax.experimental.global_device_array jax._src.prng jax._src.numpy.ufuncs jax._src.lax.fft jax jax._src.numpy.linalg jax._src.lax.control_flow.loops jax.experimental.maps jax._src.lax.windowed_reductions jax._src.image.scale jax._src.numpy.lax_numpy jax.experimental.pjit jax._src.ops.special jax.numpy.linalg jax._src.numpy.setops jax._src.numpy.polynomial jax._src.lax.control_flow jax.image jax._src.random jax._src.nn.functions jax.numpy jax.lax jax.random jax.nn jax._src.dlpack jax.experimental.compilation_cache.compilation_cache jax.experimental jax.interpreters jax._src.lax jax.dlpack jax._src.scipy.cluster.vq jax._src.scipy.stats.gennorm jax._src.scipy.stats.chi2 jax._src.scipy.stats.uniform jax._src.scipy.stats.t jax._src.scipy.stats.pareto jax._src.scipy.stats.multivariate_normal jax._src.scipy.stats.laplace jax._src.scipy.stats.expon jax._src.scipy.stats.cauchy jax._src.third_party.scipy.betaln jax._src.scipy.sparse.linalg jax._src.third_party.scipy.signal_helper jax._src.scipy.fft jax._src.scipy.stats._core jax._src.scipy.ndimage jax._src.scipy.linalg jax._src.third_party.scipy.interpolate jax.scipy.cluster.vq jax.scipy.stats.gennorm jax.scipy.stats.chi2 jax.scipy.stats.uniform jax.scipy.stats.t jax.scipy.stats.pareto jax.scipy.stats.multivariate_normal jax.scipy.stats.laplace jax.scipy.stats.expon jax.scipy.stats.cauchy jax._src.scipy.special jax.scipy.sparse.linalg jax._src.scipy.signal jax._src.third_party.scipy.linalg jax.scipy.fft jax.scipy.ndimage jax.scipy.interpolate jax._src.scipy.stats.betabinom jax._src.scipy.stats.nbinom jax._src.scipy.stats.multinomial jax.scipy.cluster jax.scipy.special jax.scipy.sparse jax.scipy.signal jax.scipy.linalg jax._src.scipy.stats.poisson jax._src.scipy.stats.norm jax._src.scipy.stats.logistic jax._src.scipy.stats.geom jax._src.scipy.stats.gamma jax._src.scipy.stats.dirichlet jax._src.scipy.stats.beta jax._src.scipy.stats.bernoulli jax.scipy.stats.betabinom jax.scipy.stats.nbinom jax.scipy.stats.multinomial jax._src.scipy.stats.kde jax._src.scipy.stats.truncnorm jax.scipy.stats.poisson jax.scipy.stats.norm jax.scipy.stats.logistic jax.scipy.stats.geom jax.scipy.stats.gamma jax.scipy.stats.dirichlet jax.scipy.stats.beta jax.scipy.stats.bernoulli jax.scipy.stats.truncnorm jax.scipy.stats) as inherently stale

Snakemake MissingOutputException when writing list to file

I'm having a MissingOutputException with what I think is a very basic rule. I'm trying to print a list given through the config file into a file using some Python commands but Snakemake keeps throwing MissingOutputException error:
# --- Importing Configuration Files --- #
configfile: "config.yaml"
# -------------------------------------------------
scaffolds = config["Scaffolds"]
localrules: all, MakeScaffoldList
# -------------------------------------------------
rule all:
input:
LIST = "scaffolds.list"
# -------------------------------------------------
rule MakeScaffoldList:
output:
LIST = "scaffolds.list"
params:
SCAFFOLDS = scaffolds
run:
"""
with open(output.LIST, 'w') as f:
for line in params.SCAFFOLDS:
f.write(f"{line}\n")
"""
Error:
[Thu Nov 17 14:08:33 2022]
localrule MakeScaffoldList:
output: scaffolds.list
jobid: 1
resources: mem_mb=27200, disk_mb=1000, tmpdir=/scratch, account=snic2022-22-156, partition=core, time=12:00:00, threads=4
MissingOutputException in line 37 of test.smk:
Job Missing files after 5 seconds. This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait:
scaffolds.list completed successfully, but some output files are missing. 0
Exiting because a job execution failed. Look above for error message
What am I doing wrong? Is it the Python code wrong?

If you want to include Python code directly into your Snakefile you have to loose the quotation marks around your Python code in the run directive:
scaffolds = ["dummy", "entries"]
localrules: all, MakeScaffoldList
# -------------------------------------------------
rule all:
input:
LIST = "scaffolds.list"
# -------------------------------------------------
rule MakeScaffoldList:
output:
LIST = "scaffolds.list"
params:
SCAFFOLDS = scaffolds
run:
with open(output.LIST, 'w') as f:
for line in params.SCAFFOLDS:
f.write(f"{line}\n")
works.

Docker images builds with Bazel on Apple M1

I setup a new project that uses Bazel to build/package my Python applications.
The project uses rules_python to setup a py_binary rule with my source files and rules_docker to setup a py_image rule to build my image.
I am successfully able to bazel run the py_binary by itself.
But when trying to run the py_image rule, it succeeds with building the image, but fails to run the binary entry-point and throws the following error:
INFO: Analyzed target //demo:demo_img (110 packages loaded, 12496 targets configured).
INFO: Found 1 target...
Target //demo:demo_img up-to-date:
bazel-out/darwin_arm64-fastbuild-ST-9e3a93240a9e/bin/demo/demo_img-layer.tar
INFO: Elapsed time: 8.722s, Critical Path: 3.94s
INFO: 31 processes: 13 internal, 18 darwin-sandbox.
INFO: Build completed successfully, 31 total actions
INFO: Build completed successfully, 31 total actions
f83e52040704: Loading layer [==================================================>] 147.1MB/147.1MB
Loaded image ID: sha256:078152695f1056177bd21cd96171245f42f7415f5a1ff802b20fbd973eecddfd
Tagging 078152695f1056177bd21cd96171245f42f7415f5a1ff802b20fbd973eecddfd as bazel/demo:demo_img
WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
Traceback (most recent call last):
File "/app/demo/demo_img.binary", line 392, in <module>
Main()
File "/app/demo/demo_img.binary", line 382, in Main
os.execv(args[0], args)
OSError: [Errno 8] Exec format error: '/app/demo/demo_img.binary.runfiles/python3_8_aarch64-apple-darwin/bin/python3'
Taking a look at the generated image
docker run -it --entrypoint sh bazel/demo:demo_img
WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
/app/demo/demo_img.binary.runfiles/__main__ # uname -a
Linux c94f44a24832 5.10.104-linuxkit #1 SMP PREEMPT Thu Mar 17 17:05:54 UTC 2022 aarch64 Linux
My current setup also uses a hermetic Python interpreter following this blog post: https://thethoughtfulkoala.com/posts/2020/05/16/bazel-hermetic-python.html
I am assuming that this problem exists due to the mismatch in OS type? The Python binary is built with an interpreter using apple/darwin where as the image is based on linux?
How do I configure py_image to build a binary for linux when developing on an M1 Macbook?
Appendix:
The following files are part of my sample project:
__main__.py
from flask import Flask
app = Flask(__name__)
#app.route("/", methods=["GET"])
def root():
return "OK"
if __name__ == "__main__":
app.run()
BUILD.bazel
load("#rules_python//python:defs.bzl", "py_binary")
load("#io_bazel_rules_docker//python3:image.bzl", py_image = "py3_image")
py_binary(
name = "demo_bin",
srcs = ["__main__.py"],
imports = [".."],
main = "__main__.py",
visibility = ["//:__subpackages__"],
deps = [
"#python_deps_flask//:pkg",
],
)
container_image(
name = "py_alpine_base",
base = "#python-alpine//image",
symlinks = {
"/usr/bin/python": "/usr/local/bin/python", # To work as base for py3_image
"/usr/bin/python3": "/usr/local/bin/python3", # To work as base for py3_image
},
)
py_image(
name = "demo_img",
srcs = ["__main__.py"],
base = "//:py_alpine_base",
main = "__main__.py",
deps = [
"#python_deps_flask//:pkg",
],
)
Where python-alpine is defined in WORKSPACE. It references an arm64 image from dockerhub.
load("#io_bazel_rules_docker//container:container.bzl", "container_pull")
container_pull(
name = "python-alpine",
registry = "index.docker.io",
repository = "arm64v8/python",
tag = "3.8-alpine",
)

The error usually means that instruction set of container host doesn’t match the instructions set of the container image that is attempting to initiate.
Use —platform in build command and specify linux/amd64

"snakemake: error: unrecognized arguments:"

I am working to implement a snakemake pipeline on our university's HPC. I am doing so in an activated conda environment and with the following script submitted using sbatch:
snakemake --dryrun --summary --jobs 100 --use-conda -p \
--configfile config.yaml --cluster-config cluster.yaml \
--profile /path/to/conda/env --cluster "sbatch --parsable \
--qos=unlim --partition={cluster.queue} \
--job-name=username.{rule}.{wildcards} --mem={cluster.mem}gb \
--time={cluster.time} --ntasks={cluster.threads} \
--nodes={cluster.nodes}"
config.yaml
metaG_accession: PRJNA766694
metaG_ena_table: /home/etucker5/miniconda3/envs/s-niv-MAGs/data/input/ENA_tables/PRJNA766694_metaG_wenv.txt
inputDIR: /home/etucker5/miniconda3/envs/s-niv-MAGs/data/input
outputDIR: /home/etucker5/miniconda3/envs/s-niv-MAGs/data/output
scratch: /home/etucker5/miniconda3/envs/s-niv-MAGs/data/scratch
adapters: /home/etucker5/miniconda3/envs/s-niv-MAGs/data/input/adapters/illumina-adapters.fa
metaG_sample_list: /home/etucker5/miniconda3/envs/s-niv-MAGs/data/input/SampleList_ForAssembly_metaG.txt
megahit_other: --continue --k-list 29,39,59,79,99,119
megahit_cpu: 80
megahit_min_contig: 1000
megahit_mem: 0.95
restart-times: 0
max-jobs-per-second: 1
max-status-checks-per-secon: 10
local-cores: 1
rerun-incomplete: true
keep-going: true
Snakefile
configfile: "config.yaml"
import io
import os
import pandas as pd
import numpy as np
import pathlib
from snakemake.exceptions import print_exception, WorkflowError
#----SET VARIABLES----#
METAG_ACCESSION = config["metaG_accession"]
METAG_SAMPLES = pd.read_table(config["metaG_ena_table"])
INPUTDIR = config["inputDIR"]
ADAPTERS = config["adapters"]
SCRATCHDIR = config["scratch"]
OUTPUTDIR = config["outputDIR"]
METAG_SAMPLELIST = pd.read_table(config["metaG_sample_list"], index_col="Assembly_group")
METAG_ASSEMBLYGROUP = list(METAG_SAMPLELIST.index)
ASSEMBLYGROUP = METAG_ASSEMBLYGROUP
#----COMPUTE VAR----#
MEGAHIT_CPU = config["megahit_cpu"]
MEGAHIT_MIN_CONTIG = config["megahit_min_contig"]
MEGAHIT_MEM = config["megahit_mem"]
MEGAHIT_OTHER = config["megahit_other"]
and slurm error output
snakemake: error: unrecognized arguments: --metaG_accession=PRJNA766694
--metaG_ena_table=/home/etucker5/miniconda3/envs/s-niv-MAGs/data/input/ENA_tables/PRJNA766694_metaG_wenv.txt
--inputDIR=/home/etucker5/miniconda3/envs/s-niv-MAGs/data/input
--outputDIR=/home/etucker5/miniconda3/envs/s-niv-MAGs/data/output
--scratch=/home/etucker5/miniconda3/envs/s-niv-MAGs/data/scratch
--adapters=/home/etucker5/miniconda3/envs/s-niv-MAGs/data/input/adapters/illumina-adapters.fa
--metaG_sample_list=/home/etucker5/miniconda3/envs/s-niv-MAGs/data/input/SampleList_ForAssembly_metaG.txt
--megahit_cpu=80 --megahit_min_contig=1000 --megahit_mem=0.95
On execution it fails to recognize arguments in my config.yaml file (for ex.):
snakemake: error: unrecognized arguments: --inputDIR=[path\to\dir]
In my understanding the Snakefile should be able to take any arguments stated in the config.yaml using:
INPUTDIR = config["inputDIR"]
when:
configfile: "config.yaml"
is input in my Snakefile.
Also, my config.yaml properly recognizes non-custom arguments such as:
max-jobs-per-second: 1
Is there some custom library setup that I need to initiate for this particular config.yaml? This is my first time using Snakemake and I am still learning how to properly work with config files.
Also, on swapping the paths directly into the Snakefile I was able to get the summary output for my dryrun without the unrecognized arguments error.

The issue was the way in which the workflow was executed using Slurm. I had been executing snakemake with sbatch as a bash script.
Instead, snakemake can be executed directly through the terminal or using bash. While I'm not exactly sure why, this caused my jobs to run on the cluster's local, which has tight memory limits, rather than on the hpc partitions that have appropriate capacity. Executing it this way also caused snakemake to not recognize the paths set in my config.yaml.
The lesson learned is that snakemake has a built in way of communicating with the slurm manager and it seems that executing snakemake through sbatch will cause conflicts.
("Conda Env") [user#log001 "Main Directory"]$ snakemake \
> --jobs 100 --use-conda -p -s Snakefile \
> --cluster-config cluster.yaml --cluster "sbatch \
> --parsable --qos=unlim --partition={cluster.queue} \
> --job-name=TARA.{rule}.{wildcards} --mem={cluster.mem}gb \
> --time={cluster.time} --ntasks={cluster.threads} --nodes={cluster.nodes}"

Snakemake partial expand, using one output file per execution in the following rule

I am having troubles trying to execute the final rule in my Snakefile once for each input I provide it. It currently uses a partial expand to fill one value, seen in rule download.
However, when using the expand function, the rule sees the input as a single list of strings, and is executed one time. I would like to have three executions of the rule, with the input being one string for each, so the downloads are able to happen in parallel.
Here is the Snakefile I am using:
Snakefile
import csv
def get_tissue_name():
tissue_data = []
with open("master_init.csv", "r") as rfile:
reader = csv.reader(rfile)
for line in reader:
id = line[1].split("_")[0] # naiveB_S1R1 -> naiveB
tissue_data.append(id)
return tissue_data
def get_tag_data():
tag_data = []
with open("master_init.csv", "r") as rfile:
reader = csv.reader(rfile)
for line in reader:
tag = line[1].split("_")[-1]
tag_data.append(tag) # example: S1R1
return tag_data
rule all:
input:
# Execute distribute & download
expand(os.path.join("output","{tissue_name}"),
tissue_name=get_tissue_name())
rule distribute:
input: "master_init.csv"
output: "init/{tissue_name}_{tag}.csv"
params:
id = "{tissue_name}_{tag}"
run:
lines = open(str(input), "r").readlines()
wfile = open(str(output), "w")
for line in lines:
line = line.rstrip() # remove trailing newline
# Only write line if the output file has the current
# tissue-name_tag (naiveB_S1R1) in the file name
if params.id in line:
wfile.write(line)
wfile.close()
rule download:
input: expand("init/{tissue_name}_{tag}.csv",
tag=get_tag_data(), allow_missing=True)
output: directory(os.path.join("output", "{tissue_name}"))
shell:
"""
while read srr name endtype; do
fastq-dump --split-files --gzip $srr --outdir {output}
done < {input}
"""
When I execute this Snakefile, I get the following output:
> snakemake --cores 1 --dry-run
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job count min threads max threads
---------- ------- ------------- -------------
all 1 1 1
distribute 3 1 1
download 1 1 1
total 5 1 1
[Wed Sep 1 15:28:27 2021]
rule distribute:
input: master_init.csv
output: init/naiveB_S1R1.csv
jobid: 2
wildcards: tissue_name=naiveB, tag=S1R1
resources: tmpdir=/var/folders/sr/gzlz2wcs5tz1jns1j13m57jr0000gn/T
[Wed Sep 1 15:28:27 2021]
Finished job 2.
1 of 5 steps (20%) done
[Wed Sep 1 15:28:27 2021]
rule distribute:
input: master_init.csv
output: init/naiveB_S1R2.csv
jobid: 3
wildcards: tissue_name=naiveB, tag=S1R2
resources: tmpdir=/var/folders/sr/gzlz2wcs5tz1jns1j13m57jr0000gn/T
[Wed Sep 1 15:28:27 2021]
Finished job 3.
2 of 5 steps (40%) done
[Wed Sep 1 15:28:27 2021]
rule distribute:
input: master_init.csv
output: init/naiveB_S1R3.csv
jobid: 4
wildcards: tissue_name=naiveB, tag=S1R3
resources: tmpdir=/var/folders/sr/gzlz2wcs5tz1jns1j13m57jr0000gn/T
[Wed Sep 1 15:28:27 2021]
Finished job 4.
3 of 5 steps (60%) done
[Wed Sep 1 15:28:27 2021]
rule download:
input: init/naiveB_S1R1.csv, init/naiveB_S1R2.csv, init/naiveB_S1R3.csv
output: output/naiveB
jobid: 1
wildcards: tissue_name=naiveB
resources: tmpdir=/var/folders/sr/gzlz2wcs5tz1jns1j13m57jr0000gn/T
/bin/bash: -c: line 2: syntax error near unexpected token `init/naiveB_S1R2.csv'
/bin/bash: -c: line 2: ` done < init/naiveB_S1R1.csv init/naiveB_S1R2.csv init/naiveB_S1R3.csv'
[Wed Sep 1 15:28:27 2021]
Error in rule download:
jobid: 1
output: output/naiveB
shell:
while read srr name endtype; do
fastq-dump --split-files --gzip $srr --outdir output/naiveB
done < init/naiveB_S1R1.csv init/naiveB_S1R2.csv init/naiveB_S1R3.csv
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /Users/joshl/PycharmProjects/FastqToGeneCounts/exampleSnakefile/.snakemake/log/2021-09-01T152827.234141.snakemake.log
I am getting an error in the execution of rule download because of the done < {input} portion. The entirety of the input is being used to read from, as opposed to single files. In an ideal execution, rule download would execute three separate times, once for each input file.
A simple fix is to wrap the while . . . done section in a for loop, but then I lose the ability to download multiple SRR files at the same time.
Does anyone know if this is possible?

You cannot execute a single rule multiple times for the same output. In your rule download output depends only on tissue_name, and doesn't depend on tag.
You have a choice: either to provide a filename that depends on tag as an output (like the filename you are downloading), or create a loop in the rule itself.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.