Snakemake MissingOutputException when writing list to file

Snakemake MissingOutputException when writing list to file - python

I'm having a MissingOutputException with what I think is a very basic rule. I'm trying to print a list given through the config file into a file using some Python commands but Snakemake keeps throwing MissingOutputException error:
# --- Importing Configuration Files --- #
configfile: "config.yaml"
# -------------------------------------------------
scaffolds = config["Scaffolds"]
localrules: all, MakeScaffoldList
# -------------------------------------------------
rule all:
input:
LIST = "scaffolds.list"
# -------------------------------------------------
rule MakeScaffoldList:
output:
LIST = "scaffolds.list"
params:
SCAFFOLDS = scaffolds
run:
"""
with open(output.LIST, 'w') as f:
for line in params.SCAFFOLDS:
f.write(f"{line}\n")
"""
Error:
[Thu Nov 17 14:08:33 2022]
localrule MakeScaffoldList:
output: scaffolds.list
jobid: 1
resources: mem_mb=27200, disk_mb=1000, tmpdir=/scratch, account=snic2022-22-156, partition=core, time=12:00:00, threads=4
MissingOutputException in line 37 of test.smk:
Job Missing files after 5 seconds. This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait:
scaffolds.list completed successfully, but some output files are missing. 0
Exiting because a job execution failed. Look above for error message
What am I doing wrong? Is it the Python code wrong?

If you want to include Python code directly into your Snakefile you have to loose the quotation marks around your Python code in the run directive:
scaffolds = ["dummy", "entries"]
localrules: all, MakeScaffoldList
# -------------------------------------------------
rule all:
input:
LIST = "scaffolds.list"
# -------------------------------------------------
rule MakeScaffoldList:
output:
LIST = "scaffolds.list"
params:
SCAFFOLDS = scaffolds
run:
with open(output.LIST, 'w') as f:
for line in params.SCAFFOLDS:
f.write(f"{line}\n")
works.

Related

Write multi-line secret to windows runner in GitHub workflow

Summary
What specific syntax must be changed in the code below in order for the multi-line contents of the $MY_SECRETS environment variable to be 1.) successfully written into the C:\\Users\\runneradmin\\somedir\\mykeys.yaml file on a Windows runner in the GitHub workflow whose code is given below, and 2.) read by the simple Python 3 main.py program given below?
PROBLEM DEFINITION:
The echo "$MY_SECRETS" > C:\\Users\\runneradmin\\somedir\\mykeys.yaml command is only printing the string literal MY_SECRETS into the C:\\Users\\runneradmin\\somedir\\mykeys.yaml file instead of printing the multi-line contents of the MY_SECRETS variable.
We confirmed that this same echo command does successfully print the same multi-line secret in an ubuntu-latest runner, and we manually validated the correct contents of the secrets.LIST_OF_SECRETS environment variable. ... This problem seems entirely isolated to either the windows command syntax, or perhaps to the windows configuration of the GitHub windows-latest runner, either of which should be fixable by changing the workflow code below.
EXPECTED RESULT:
The multi-line secret should be printed into the C:\\Users\\runneradmin\\somedir\\mykeys.yaml file and read by main.py.
The resulting printout of the contents of the C:\\Users\\runneradmin\\somedir\\mykeys.yaml file should look like:
***
***
***
***
LOGS THAT DEMONSTRATE THE FAILURE:
The result of running main.py in the GitHub Actions log is:
ccc item is: $MY_SECRETS
As you can see, the string literal $MY_SECRETS is being wrongly printed out instead of the 4 *** secret lines.
REPO FILE STRUCTURE:
Reproducing this error requires only 2 files in a repo file structure as follows:
.github/
workflows/
test.yml
main.py
WORKFLOW CODE:
The minimal code for the workflow to reproduce this problem is as follows:
name: write-secrets-to-file
on:
push:
branches:
- dev
jobs:
write-the-secrets-windows:
runs-on: windows-latest
steps:
- uses: actions/checkout#v3
- shell: python
name: Configure agent
env:
MY_SECRETS: ${{ secrets.LIST_OF_SECRETS }}
run: |
import subprocess
import pathlib
pathlib.Path("C:\\Users\\runneradmin\\somedir\\").mkdir(parents=True, exist_ok=True)
print('About to: echo "$MY_SECRETS" > C:\\Users\\runneradmin\\somedir\\mykeys.yaml')
output = subprocess.getoutput('echo "$MY_SECRETS" > C:\\Users\\runneradmin\\somedir\\mykeys.yaml')
print(output)
os.chdir('D:\\a\\myRepoName\\')
mycmd = "python myRepoName\\main.py"
p = subprocess.Popen(mycmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
while(True):
# returns None while subprocess is running
retcode = p.poll()
line = p.stdout.readline()
print(line)
if retcode is not None:
break
MINIMAL APP CODE:
Then the minimal main.py program that demonstrates what was actually written into the C:\\Users\\runneradmin\\somedir\\mykeys.yaml file is:
with open('C:\\Users\\runneradmin\\somedir\\mykeys.yaml') as file:
for item in file:
print('ccc item is: ', str(item))
if "var1" in item:
print("Found var1")
STRUCTURE OF MULTI-LINE SECRET:
The structure of the multi-line secret contained in the secrets.LIST_OF_SECRETS environment variable is:
var1:value1
var2:value2
var3:value3
var4:value4
These 4 lines should be what gets printed out when main.py is run by the workflow, though the print for each line should look like *** because each line is a secret.

The problem is - as it is so often - the quirks of Python with byte arrays and strings and en- and de-coding them in the right places...
Here is what I used:
test.yml:
name: write-secrets-to-file
on:
push:
branches:
- dev
jobs:
write-the-secrets-windows:
runs-on: windows-latest
steps:
- uses: actions/checkout#v3
- shell: python
name: Configure agent
env:
MY_SECRETS: ${{ secrets.LIST_OF_SECRETS }}
run: |
import subprocess
import pathlib
import os
# using os.path.expanduser() instead of hard-coding the user's home directory
pathlib.Path(os.path.expanduser("~/somedir")).mkdir(parents=True, exist_ok=True)
secrets = os.getenv("MY_SECRETS")
with open(os.path.expanduser("~/somedir/mykeys.yaml"),"w",encoding="UTF-8") as file:
file.write(secrets)
mycmd = ["python","./main.py"]
p = subprocess.Popen(mycmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
while(True):
# returns None while subprocess is running
retcode = p.poll()
line = p.stdout.readline()
# If len(line)==0 we are at EOF and do not need to print this line.
# An empty line from main.py would be '\n' with len('\n')==1!
if len(line)>0:
# We decode the byte array to a string and strip the
# new-line characters \r and \n from the end of the line,
# which were read from stdout of main.py
print(line.decode('UTF-8').rstrip('\r\n'))
if retcode is not None:
break
main.py:
import os
# using os.path.expanduser instead of hard-coding user home directory
with open(os.path.expanduser('~/somedir/mykeys.yaml'),encoding='UTF-8') as file:
for item in file:
# strip the new-line characters \r and \n from the end of the line
item=item.rstrip('\r\n')
print('ccc item is: ', str(item))
if "var1" in item:
print("Found var1")
secrets.LIST_OF_SECRETS:
var1: secret1
var2: secret2
var3: secret3
var4: secret4
And my output in the log was
ccc item is: ***
Found var1
ccc item is: ***
ccc item is: ***
ccc item is: ***

Edit: updated with fixed main.py and how to run it.
You can write the key file directly with Python:
- shell: python
name: Configure agent
env:
MY_SECRETS: ${{ secrets.LIST_OF_SECRETS }}
run: |
import os
import pathlib
pathlib.Path('C:\\Users\\runneradmin\\somedir\\').mkdir(parents=True, exist_ok=True)
with open('C:\\Users\\runneradmin\\somedir\\mykeys.yaml', 'w') as key_file:
key_file.write(os.environ['MY_SECRETS'])
- uses: actions/checkout#v3
- name: Run main
run: python main.py
To avoid newline characters in your output, you need a main.py that removes the newlines (here with .strip().splitlines()):
main.py
with open('C:\\Users\\runneradmin\\somedir\\mykeys.yaml') as file:
for item in file.read().strip().splitlines():
print('ccc item is: ', str(item))
if "var1" in item:
print("Found var1")
Here's the input:
LIST_OF_SECRETS = '
key:value
key2:value
key3:value
'
And the output:
ccc item is: ***
Found var1
ccc item is: ***
ccc item is: ***
ccc item is: ***
Here is my complete workflow:
name: write-secrets-to-file
on:
push:
branches:
- master
jobs:
write-the-secrets-windows:
runs-on: windows-latest
steps:
- shell: python
name: Configure agent
env:
MY_SECRETS: ${{ secrets.LIST_OF_SECRETS }}
run: |
import os
import pathlib
pathlib.Path('C:\\Users\\runneradmin\\somedir\\').mkdir(parents=True, exist_ok=True)
with open('C:\\Users\\runneradmin\\somedir\\mykeys.yaml', 'w') as key_file:
key_file.write(os.environ['MY_SECRETS'])
- uses: actions/checkout#v3
- name: Run main
run: python main.py
Also, a simpler version using only Windows shell (Powershell):
- name: Create key file
env:
MY_SECRETS: ${{ secrets.LIST_OF_SECRETS }}
run: |
mkdir C:\\Users\\runneradmin\\somedir
echo "$env:MY_SECRETS" > C:\\Users\\runneradmin\\somedir\\mykeys.yaml
- uses: actions/checkout#v3
- name: Run main
run: python main.py

I tried the following code and it worked fine :
LIST_OF_SECRETS
key1:val1
key2:val2
Github action (test.yml)
name: write-secrets-to-file
on:
push:
branches:
- main
jobs:
write-the-secrets-windows:
runs-on: windows-latest
steps:
- uses: actions/checkout#v3
- shell: python
name: Configure agentt
env:
MY_SECRETS: ${{ secrets.LIST_OF_SECRETS }}
run: |
import base64, subprocess, sys
import os
secrets = os.environ["MY_SECRETS"]
def powershell(cmd, input=None):
cmd64 = base64.encodebytes(cmd.encode('utf-16-le')).decode('ascii').strip()
stdin = None if input is None else subprocess.PIPE
process = subprocess.Popen(["powershell.exe", "-NonInteractive", "-EncodedCommand", cmd64], stdin=stdin, stdout=subprocess.PIPE)
if input is not None:
input = input.encode(sys.stdout.encoding)
output, stderr = process.communicate(input)
output = output.decode(sys.stdout.encoding).replace('\r\n', '\n')
return output
command = r"""$secrets = #'
{}
'#
$secrets | Out-File -FilePath .\mykeys.yaml""".format(secrets)
command1 = r"""Get-Content -Path .\mykeys.yaml"""
powershell(command)
print(powershell(command1))
Output
***
***
As you also mention in the question, Github will obfuscate any printed value containing the secrets with ***
EDIT : Updated the code to work with multiple line secrets. This answer was highly influenced by this one

You need to use yaml library:
import yaml
data = {'MY_SECRETS':'''
var1:value1
var2:value2
var3:value3
var4:value4
'''}#add your secret
with open('file.yaml', 'w') as outfile: # Your file
yaml.dump(data, outfile, default_flow_style=False)
This is result:
I used this.

Snakemake partial expand, using one output file per execution in the following rule

I am having troubles trying to execute the final rule in my Snakefile once for each input I provide it. It currently uses a partial expand to fill one value, seen in rule download.
However, when using the expand function, the rule sees the input as a single list of strings, and is executed one time. I would like to have three executions of the rule, with the input being one string for each, so the downloads are able to happen in parallel.
Here is the Snakefile I am using:
Snakefile
import csv
def get_tissue_name():
tissue_data = []
with open("master_init.csv", "r") as rfile:
reader = csv.reader(rfile)
for line in reader:
id = line[1].split("_")[0] # naiveB_S1R1 -> naiveB
tissue_data.append(id)
return tissue_data
def get_tag_data():
tag_data = []
with open("master_init.csv", "r") as rfile:
reader = csv.reader(rfile)
for line in reader:
tag = line[1].split("_")[-1]
tag_data.append(tag) # example: S1R1
return tag_data
rule all:
input:
# Execute distribute & download
expand(os.path.join("output","{tissue_name}"),
tissue_name=get_tissue_name())
rule distribute:
input: "master_init.csv"
output: "init/{tissue_name}_{tag}.csv"
params:
id = "{tissue_name}_{tag}"
run:
lines = open(str(input), "r").readlines()
wfile = open(str(output), "w")
for line in lines:
line = line.rstrip() # remove trailing newline
# Only write line if the output file has the current
# tissue-name_tag (naiveB_S1R1) in the file name
if params.id in line:
wfile.write(line)
wfile.close()
rule download:
input: expand("init/{tissue_name}_{tag}.csv",
tag=get_tag_data(), allow_missing=True)
output: directory(os.path.join("output", "{tissue_name}"))
shell:
"""
while read srr name endtype; do
fastq-dump --split-files --gzip $srr --outdir {output}
done < {input}
"""
When I execute this Snakefile, I get the following output:
> snakemake --cores 1 --dry-run
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job count min threads max threads
---------- ------- ------------- -------------
all 1 1 1
distribute 3 1 1
download 1 1 1
total 5 1 1
[Wed Sep 1 15:28:27 2021]
rule distribute:
input: master_init.csv
output: init/naiveB_S1R1.csv
jobid: 2
wildcards: tissue_name=naiveB, tag=S1R1
resources: tmpdir=/var/folders/sr/gzlz2wcs5tz1jns1j13m57jr0000gn/T
[Wed Sep 1 15:28:27 2021]
Finished job 2.
1 of 5 steps (20%) done
[Wed Sep 1 15:28:27 2021]
rule distribute:
input: master_init.csv
output: init/naiveB_S1R2.csv
jobid: 3
wildcards: tissue_name=naiveB, tag=S1R2
resources: tmpdir=/var/folders/sr/gzlz2wcs5tz1jns1j13m57jr0000gn/T
[Wed Sep 1 15:28:27 2021]
Finished job 3.
2 of 5 steps (40%) done
[Wed Sep 1 15:28:27 2021]
rule distribute:
input: master_init.csv
output: init/naiveB_S1R3.csv
jobid: 4
wildcards: tissue_name=naiveB, tag=S1R3
resources: tmpdir=/var/folders/sr/gzlz2wcs5tz1jns1j13m57jr0000gn/T
[Wed Sep 1 15:28:27 2021]
Finished job 4.
3 of 5 steps (60%) done
[Wed Sep 1 15:28:27 2021]
rule download:
input: init/naiveB_S1R1.csv, init/naiveB_S1R2.csv, init/naiveB_S1R3.csv
output: output/naiveB
jobid: 1
wildcards: tissue_name=naiveB
resources: tmpdir=/var/folders/sr/gzlz2wcs5tz1jns1j13m57jr0000gn/T
/bin/bash: -c: line 2: syntax error near unexpected token `init/naiveB_S1R2.csv'
/bin/bash: -c: line 2: ` done < init/naiveB_S1R1.csv init/naiveB_S1R2.csv init/naiveB_S1R3.csv'
[Wed Sep 1 15:28:27 2021]
Error in rule download:
jobid: 1
output: output/naiveB
shell:
while read srr name endtype; do
fastq-dump --split-files --gzip $srr --outdir output/naiveB
done < init/naiveB_S1R1.csv init/naiveB_S1R2.csv init/naiveB_S1R3.csv
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /Users/joshl/PycharmProjects/FastqToGeneCounts/exampleSnakefile/.snakemake/log/2021-09-01T152827.234141.snakemake.log
I am getting an error in the execution of rule download because of the done < {input} portion. The entirety of the input is being used to read from, as opposed to single files. In an ideal execution, rule download would execute three separate times, once for each input file.
A simple fix is to wrap the while . . . done section in a for loop, but then I lose the ability to download multiple SRR files at the same time.
Does anyone know if this is possible?

You cannot execute a single rule multiple times for the same output. In your rule download output depends only on tissue_name, and doesn't depend on tag.
You have a choice: either to provide a filename that depends on tag as an output (like the filename you are downloading), or create a loop in the rule itself.

How to use singularity and conda wrappers in Snakemake

TLDR I'm getting the following error:
The 'conda' command is not available inside your singularity container image. Snakemake mounts your conda installation into singularity. Sometimes, this can fail because of shell restrictions. It has been tested to work with docker://ubuntu, but it e.g. fails with docker://bash
I had created a Snakemake workflow and converted the shell: commands to rule-based package management via Snakemake wrappers: .
However, I ran into issues running this on HPC and one of the HPC support staff strongly recommended against using conda on any HPC system as:
"if the builder [of wrapper] is not super careful, dynamic libraries present in the conda environment that relies on the host libs (there are always a couple present because builder are most of the time carefree) will break. I think that relying on Singularity for your pipeline would make for a more robust system." - Anon
I did some reading over the weekend and according to this document, it's possible to combine containers with conda-based package management; by defining a global conda docker container and per-rule yaml files.
Note: In contrast to the example in the link above (Figure 5.4), which uses a predefined yaml and shell: command, here I've use
conda wrappers which download these yaml files into the
Singularity container (if I'm thinking correctly) so I thought should function the same - see the Note: at the end though...
Snakefile, config.yaml and samples.txt
Snakefile
# Directories------------------------------------------------------------------
configfile: "config.yaml"
# Setting the names of all directories
dir_list = ["REF_DIR", "LOG_DIR", "BENCHMARK_DIR", "QC_DIR", "TRIM_DIR", "ALIGN_DIR", "MARKDUP_DIR", "CALLING_DIR", "ANNOT_DIR"]
dir_names = ["refs", "logs", "benchmarks", "qc", "trimming", "alignment", "mark_duplicates", "variant_calling", "annotation"]
dirs_dict = dict(zip(dir_list, dir_names))
import os
import pandas as pd
# getting the samples information (names, path to r1 & r2) from samples.txt
samples_information = pd.read_csv("samples.txt", sep='\t', index_col=False)
# get a list of the sample names
sample_names = list(samples_information['sample'])
sample_locations = list(samples_information['location'])
samples_dict = dict(zip(sample_names, sample_locations))
# get number of samples
len_samples = len(sample_names)
# Singularity with conda wrappers
singularity: "docker://continuumio/miniconda3:4.5.11"
# Rules -----------------------------------------------------------------------
rule all:
input:
"resources/vep/plugins",
"resources/vep/cache"
rule download_vep_plugins:
output:
directory("resources/vep/plugins")
params:
release=100
resources:
mem=1000,
time=30
wrapper:
"0.66.0/bio/vep/plugins"
rule get_vep_cache:
output:
directory("resources/vep/cache")
params:
species="caenorhabditis_elegans",
build="WBcel235",
release="100"
resources:
mem=1000,
time=30
log:
"logs/vep/cache.log"
cache: True # save space and time with between workflow caching (see docs)
wrapper:
"0.66.0/bio/vep/cache"
config.yaml
# Files
REF_GENOME: "c_elegans.PRJNA13758.WS265.genomic.fa"
GENOME_ANNOTATION: "c_elegans.PRJNA13758.WS265.annotations.gff3"
# Tools
QC_TOOL: "fastQC"
TRIM_TOOL: "trimmomatic"
ALIGN_TOOL: "bwa"
MARKDUP_TOOL: "picard"
CALLING_TOOL: "varscan"
ANNOT_TOOL: "vep"
samples.txt
sample location
MTG324 /home/moldach/wrappers/SUBSET/MTG324_SUBSET
Submission
snakemake --profile slurm --use-singularity --use-conda --jobs 2
Logs
Workflow defines that rule get_vep_cache is eligible for caching between workflows (use the --cache argument to enable this).
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 get_vep_cache
1
[Mon Sep 21 15:35:50 2020]
rule get_vep_cache:
output: resources/vep/cache
log: logs/vep/cache.log
jobid: 0
resources: mem=1000, time=30
Activating singularity image /home/moldach/wrappers/SUBSET/VEP/.snakemake/singularity/d7617773b315c3abcb29e0484085ed06.simg
Activating conda environment: /home/moldach/wrappers/SUBSET/VEP/.snakemake/conda/774ea575
[Mon Sep 21 15:36:38 2020]
Finished job 0.
1 of 1 steps (100%) done
Note: Leaving --use-conda out of the submission of the workflow will cause an error for get_vep_cache: - /bin/bash: vep_install: command not found
Workflow defines that rule get_vep_cache is eligible for caching between workflows (use the --cache argument to enable this).
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 download_vep_plugins
1
[Mon Sep 21 15:35:50 2020]
rule download_vep_plugins:
output: resources/vep/plugins
jobid: 0
resources: mem=1000, time=30
Activating singularity image /home/moldach/wrappers/SUBSET/VEP/.snakemake/singularity/d7617773b315c3abcb29e0484085ed06.simg
Activating conda environment: /home/moldach/wrappers/SUBSET/VEP/.snakemake/conda/9f602d9a
[Mon Sep 21 15:35:56 2020]
Finished job 0.
1 of 1 steps (100%) done
The problem occurs when adding the third rule, fastq:
Updated Snakefile
# Directories------------------------------------------------------------------
configfile: "config.yaml"
# Setting the names of all directories
dir_list = ["REF_DIR", "LOG_DIR", "BENCHMARK_DIR", "QC_DIR", "TRIM_DIR", "ALIGN_DIR", "MARKDUP_DIR", "CALLING_DIR", "ANNOT_DIR"]
dir_names = ["refs", "logs", "benchmarks", "qc", "trimming", "alignment", "mark_duplicates", "variant_calling", "annotation"]
dirs_dict = dict(zip(dir_list, dir_names))
import os
import pandas as pd
# getting the samples information (names, path to r1 & r2) from samples.txt
samples_information = pd.read_csv("samples.txt", sep='\t', index_col=False)
# get a list of the sample names
sample_names = list(samples_information['sample'])
sample_locations = list(samples_information['location'])
samples_dict = dict(zip(sample_names, sample_locations))
# get number of samples
len_samples = len(sample_names)
# Singularity with conda wrappers
singularity: "docker://continuumio/miniconda3:4.5.11"
# Rules -----------------------------------------------------------------------
rule all:
input:
"resources/vep/plugins",
"resources/vep/cache",
expand('{QC_DIR}/{QC_TOOL}/before_trim/{sample}_{pair}_fastqc.{ext}', QC_DIR=dirs_dict["QC_DIR"], QC_TOOL=config["QC_TOOL"], sample=sample_names, pair=['R1', 'R2'], ext=['html', 'zip'])
rule download_vep_plugins:
output:
directory("resources/vep/plugins")
params:
release=100
resources:
mem=1000,
time=30
wrapper:
"0.66.0/bio/vep/plugins"
rule get_vep_cache:
output:
directory("resources/vep/cache")
params:
species="caenorhabditis_elegans",
build="WBcel235",
release="100"
resources:
mem=1000,
time=30
log:
"logs/vep/cache.log"
cache: True # save space and time with between workflow caching (see docs)
wrapper:
"0.66.0/bio/vep/cache"
def getHome(sample):
return(list(os.path.join(samples_dict[sample],"{0}_{1}.fastq.gz".format(sample,pair)) for pair in ['R1','R2']))
rule qc_before_trim_r1:
input:
r1=lambda wildcards: getHome(wildcards.sample)[0]
output:
html=os.path.join(dirs_dict["QC_DIR"],config["QC_TOOL"],"before_trim","{sample}_R1_fastqc.html"),
zip=os.path.join(dirs_dict["QC_DIR"],config["QC_TOOL"],"before_trim","{sample}_R1_fastqc.zip"),
params:
dir=os.path.join(dirs_dict["QC_DIR"],config["QC_TOOL"],"before_trim")
log:
os.path.join(dirs_dict["LOG_DIR"],config["QC_TOOL"],"{sample}_R1.log")
resources:
mem=1000,
time=30
singularity:
"https://depot.galaxyproject.org/singularity/fastqc:0.11.9--0"
threads: 1
message: """--- Quality check of raw data with FastQC before trimming."""
wrapper:
"0.66.0/bio/fastqc"
rule qc_before_trim_r2:
input:
r1=lambda wildcards: getHome(wildcards.sample)[1]
output:
html=os.path.join(dirs_dict["QC_DIR"],config["QC_TOOL"],"before_trim","{sample}_R2_fastqc.html"),
zip=os.path.join(dirs_dict["QC_DIR"],config["QC_TOOL"],"before_trim","{sample}_R2_fastqc.zip"),
params:
dir=os.path.join(dirs_dict["QC_DIR"],config["QC_TOOL"],"before_trim")
log:
os.path.join(dirs_dict["LOG_DIR"],config["QC_TOOL"],"{sample}_R2.log")
resources:
mem=1000,
time=30
singularity:
"https://depot.galaxyproject.org/singularity/fastqc:0.11.9--0"
threads: 1
message: """--- Quality check of raw data with FastQC before trimming."""
wrapper:
"0.66.0/bio/fastqc"
Error reported in nohup.out
Building DAG of jobs...
Pulling singularity image https://depot.galaxyproject.org/singularity/fastqc:0.11.9--0.
CreateCondaEnvironmentException:
The 'conda' command is not available inside your singularity container image. Snakemake mounts your conda installation into singularity. Sometimes, this can fail because of shell restrictions. It has been tested to work with docker://ubuntu, but it e.g. fails with docker://bash
File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/deployment/conda.py", line 247, in create
File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/deployment/conda.py", line 381, in __new__
File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/deployment/conda.py", line 394, in __init__
File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/deployment/conda.py", line 417, in _check
using shell: instead of wrapper:
I changed the wrapper back into the shell command:
and this is the error I get when submitting with ``:
orkflow defines that rule get_vep_cache is eligible for caching between workflows (use the --cache argument to enable this).
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 qc_before_trim_r2
1
[Mon Sep 21 16:32:54 2020]
Job 0: --- Quality check of raw data with FastQC before trimming.
Activating singularity image /home/moldach/wrappers/SUBSET/VEP/.snakemake/singularity/6740cb07e67eae01644839c9767bdca5.simg
^[[33mWARNING:^[[0m Skipping mount /var/singularity/mnt/session/etc/resolv.conf [files]: /etc/resolv.conf doesn't exist in container
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LANGUAGE = (unset),
LC_ALL = (unset),
LANG = "en_CA.UTF-8"
are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
Skipping '/home/moldach/wrappers/SUBSET/MTG324_SUBSET/MTG324_R2.fastq.gz' which didn't exist, or couldn't be read
Waiting at most 60 seconds for missing files.
MissingOutputException in line 84 of /home/moldach/wrappers/SUBSET/VEP/Snakefile:
Job completed successfully, but some output files are missing. Missing files after 60 seconds:
qc/fastQC/before_trim/MTG324_R2_fastqc.html
qc/fastQC/before_trim/MTG324_R2_fastqc.zip
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 544, in handle_job_success
File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 231, in handle_job_success
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
The error Skipping '/home/moldach/wrappers/SUBSET/MTG324_SUBSET/MTG324_R2.fastq.gz' which didn't exist, or couldn't be read is misleading because the file is does exist...
Update 2
Following the advice Manavalan Gajapathy I've eliminated defining singularity at two different levels (global + per-rule).
Now I'm using a singularity container at only the global level and using wrappers via --use-conda which creates the conda environment inside of the container:
# Directories------------------------------------------------------------------
configfile: "config.yaml"
# Setting the names of all directories
dir_list = ["REF_DIR", "LOG_DIR", "BENCHMARK_DIR", "QC_DIR", "TRIM_DIR", "ALIGN_DIR", "MARKDUP_DIR", "CALLING_DIR", "ANNOT_DIR"]
dir_names = ["refs", "logs", "benchmarks", "qc", "trimming", "alignment", "mark_duplicates", "variant_calling", "annotation"]
dirs_dict = dict(zip(dir_list, dir_names))
import os
import pandas as pd
# getting the samples information (names, path to r1 & r2) from samples.txt
samples_information = pd.read_csv("samples.txt", sep='\t', index_col=False)
# get a list of the sample names
sample_names = list(samples_information['sample'])
sample_locations = list(samples_information['location'])
samples_dict = dict(zip(sample_names, sample_locations))
# get number of samples
len_samples = len(sample_names)
# Singularity with conda wrappers
singularity: "docker://continuumio/miniconda3:4.5.11"
# Rules -----------------------------------------------------------------------
rule all:
input:
"resources/vep/plugins",
"resources/vep/cache",
expand('{QC_DIR}/{QC_TOOL}/before_trim/{sample}_{pair}_fastqc.{ext}', QC_DIR=dirs_dict["QC_DIR"], QC_TOOL=config["QC_TOOL"], sample=sample_names, pair=['R1', 'R2'], ext=['html', 'zip'])
rule download_vep_plugins:
output:
directory("resources/vep/plugins")
params:
release=100
resources:
mem=1000,
time=30
wrapper:
"0.66.0/bio/vep/plugins"
rule get_vep_cache:
output:
directory("resources/vep/cache")
params:
species="caenorhabditis_elegans",
build="WBcel235",
release="100"
resources:
mem=1000,
time=30
log:
"logs/vep/cache.log"
cache: True # save space and time with between workflow caching (see docs)
wrapper:
"0.66.0/bio/vep/cache"
def getHome(sample):
return(list(os.path.join(samples_dict[sample],"{0}_{1}.fastq.gz".format(sample,pair)) for pair in ['R1','R2']))
rule qc_before_trim_r1:
input:
r1=lambda wildcards: getHome(wildcards.sample)[0]
output:
html=os.path.join(dirs_dict["QC_DIR"],config["QC_TOOL"],"before_trim","{sample}_R1_fastqc.html"),
zip=os.path.join(dirs_dict["QC_DIR"],config["QC_TOOL"],"before_trim","{sample}_R1_fastqc.zip"),
params:
dir=os.path.join(dirs_dict["QC_DIR"],config["QC_TOOL"],"before_trim")
log:
os.path.join(dirs_dict["LOG_DIR"],config["QC_TOOL"],"{sample}_R1.log")
resources:
mem=1000,
threads: 1
message: """--- Quality check of raw data with FastQC before trimming."""
wrapper:
"0.66.0/bio/fastqc"
rule qc_before_trim_r2:
input:
r1=lambda wildcards: getHome(wildcards.sample)[1]
output:
html=os.path.join(dirs_dict["QC_DIR"],config["QC_TOOL"],"before_trim","{sample}_R2_fastqc.html"),
zip=os.path.join(dirs_dict["QC_DIR"],config["QC_TOOL"],"before_trim","{sample}_R2_fastqc.zip"),
params:
dir=os.path.join(dirs_dict["QC_DIR"],config["QC_TOOL"],"before_trim")
log:
os.path.join(dirs_dict["LOG_DIR"],config["QC_TOOL"],"{sample}_R2.log")
resources:
mem=1000,
time=30
threads: 1
message: """--- Quality check of raw data with FastQC before trimming."""
wrapper:
"0.66.0/bio/fastqc"
and submit via:
However, I'm still getting an error:
Workflow defines that rule get_vep_cache is eligible for caching between workflows (use the --cache argument to enable this).
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 qc_before_trim_r2
1
[Tue Sep 22 12:44:03 2020]
Job 0: --- Quality check of raw data with FastQC before trimming.
Activating singularity image /home/moldach/wrappers/SUBSET/OMG/.snakemake/singularity/d7617773b315c3abcb29e0484085ed06.simg
Activating conda environment: /home/moldach/wrappers/SUBSET/OMG/.snakemake/conda/c591f288
Skipping '/work/mtgraovac_lab/MATTS_SCRATCH/rep1_R2.fastq.gz' which didn't exist, or couldn't be read
Skipping ' 2> logs/fastQC/rep1_R2.log' which didn't exist, or couldn't be read
Failed to process qc/fastQC/before_trim
java.io.FileNotFoundException: qc/fastQC/before_trim (Is a directory)
at java.base/java.io.FileInputStream.open0(Native Method)
at java.base/java.io.FileInputStream.open(FileInputStream.java:219)
at java.base/java.io.FileInputStream.<init>(FileInputStream.java:157)
at uk.ac.babraham.FastQC.Sequence.FastQFile.<init>(FastQFile.java:73)
at uk.ac.babraham.FastQC.Sequence.SequenceFactory.getSequenceFile(SequenceFactory.java:106)
at uk.ac.babraham.FastQC.Sequence.SequenceFactory.getSequenceFile(SequenceFactory.java:62)
at uk.ac.babraham.FastQC.Analysis.OfflineRunner.processFile(OfflineRunner.java:159)
at uk.ac.babraham.FastQC.Analysis.OfflineRunner.<init>(OfflineRunner.java:121)
at uk.ac.babraham.FastQC.FastQCApplication.main(FastQCApplication.java:316)
Traceback (most recent call last):
File "/home/moldach/wrappers/SUBSET/OMG/.snakemake/scripts/tmpiwwprg5m.wrapper.py", line 35, in <module>
shell(
File "/mnt/snakemake/snakemake/shell.py", line 205, in __new__
raise sp.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'set -euo pipefail; fastqc qc/fastQC/before_trim --quiet -t 1 --outdir /tmp/tmps93snag8 /work/mtgraovac_lab/MATTS_SCRATCH/rep1_R2.fastq.gz ' 2> logs/fastQC/rep1_R$
[Tue Sep 22 12:44:16 2020]
Error in rule qc_before_trim_r2:
jobid: 0
output: qc/fastQC/before_trim/rep1_R2_fastqc.html, qc/fastQC/before_trim/rep1_R2_fastqc.zip
log: logs/fastQC/rep1_R2.log (check log file(s) for error message)
conda-env: /home/moldach/wrappers/SUBSET/OMG/.snakemake/conda/c591f288
RuleException:
CalledProcessError in line 97 of /home/moldach/wrappers/SUBSET/OMG/Snakefile:
Command ' singularity exec --home /home/moldach/wrappers/SUBSET/OMG --bind /home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages:/mnt/snakemake /home/moldach/wrappers/SUBSET/OMG/.snakemake$
File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 2189, in run_wrapper
File "/home/moldach/wrappers/SUBSET/OMG/Snakefile", line 97, in __rule_qc_before_trim_r2
File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 529, in _callback
File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/concurrent/futures/thread.py", line 57, in run
File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 515, in cached_or_run
File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 2201, in run_wrapper
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Reproducibility
To replicate this you can download this small dataset:
git clone https://github.com/CRG-CNAG/CalliNGS-NF.git
cp CalliNGS-NF/data/reads/rep1_*.fq.gz .
mv rep1_1.fq.gz rep1_R1.fastq.gz
mv rep1_2.fq.gz rep1_R2.fastq.gz
UPDATE 3: Bind Mounts
According to the link shared on mounting:
"By default Singularity bind mounts /home/$USER, /tmp, and $PWD into your container at runtime."
Thus, for simplicity (and also because I got errors using --singularity-args), I've moved the required files into /home/$USER and tried to run from there.
(snakemake) [~]$ pwd
/home/moldach
(snakemake) [~]$ ll
total 3656
drwx------ 26 moldach moldach 4096 Aug 27 17:36 anaconda3
drwx------ 2 moldach moldach 4096 Sep 22 10:11 bin
-rw------- 1 moldach moldach 265 Sep 22 14:29 config.yaml
-rw------- 1 moldach moldach 1817903 Sep 22 14:29 rep1_R1.fastq.gz
-rw------- 1 moldach moldach 1870497 Sep 22 14:29 rep1_R2.fastq.gz
-rw------- 1 moldach moldach 55 Sep 22 14:29 samples.txt
-rw------- 1 moldach moldach 3420 Sep 22 14:29 Snakefile
and ran with bash -c "nohup snakemake --profile slurm --use-singularity --use-conda --jobs 4 &"
However, I still get this odd error:
Activating conda environment: /home/moldach/.snakemake/conda/fdae4f0d
Skipping ' 2> logs/fastQC/rep1_R2.log' which didn't exist, or couldn't be read
Failed to process qc/fastQC/before_trim
java.io.FileNotFoundException: qc/fastQC/before_trim (Is a directory)
at java.base/java.io.FileInputStream.open0(Native Method)
at java.base/java.io.FileInputStream.open(FileInputStream.java:219)
at java.base/java.io.FileInputStream.<init>(FileInputStream.java:157)
at uk.ac.babraham.FastQC.Sequence.FastQFile.<init>(FastQFile.java:73)
at uk.ac.babraham.FastQC.Sequence.SequenceFactory.getSequenceFile(SequenceFactory.java:106)
at uk.ac.babraham.FastQC.Sequence.SequenceFactory.getSequenceFile(SequenceFactory.java:62)
at uk.ac.babraham.FastQC.Analysis.OfflineRunner.processFile(OfflineRunner.java:159)
at uk.ac.babraham.FastQC.Analysis.OfflineRunner.<init>(OfflineRunner.java:121)
at uk.ac.babraham.FastQC.FastQCApplication.main(FastQCApplication.java:316)
Traceback (most recent call last):
Why does it think it's being given a directory?
Note: If you submit only with --use-conda, e.g. bash -c "nohup snakemake --profile slurm --use-conda --jobs 4 &" there is no error from the fastqc rules. However, the --use-conda param alone is not %100 reproducible, case-in-point doesn't work on another HPC I tested it on
The full log in nohup.out when using --printshellcmds can be found at this gist

TLDR:
fastqc singularity container used in qc rule likely doesn't have conda available in it, and this doesn't satisfy what snakemake's--use-conda expects.
Explanation:
You have singularity containers defined at two different levels - 1. global level that will be used for all rules, unless they are overridden at rule level; 2. per-rule level that will be used at the rule level.
# global singularity container to use
singularity: "docker://continuumio/miniconda3:4.5.11"
# singularity container defined at rule level
rule qc_before_trim_r1:
....
....
singularity:
"https://depot.galaxyproject.org/singularity/fastqc:0.11.9--0"
When you use --use-singularity and --use-conda together, jobs will be run in conda environment inside the singularity container. So conda command needs to be available inside the singularity container for this to be possible. While this requirement is clearly satisfied for your global-level container, I am quite certain (haven't tested though) this is not the case for your fastqc container.
The way snakemake works if --use-conda flag is supplied, it will create conda environment locally or inside the container depending on the supply of --use-singularity flag. Since you are using snakemake-wrapper for qc rule and it comes with conda env recipe pre-defined, the easiest solution here is to just use the globally defined miniconda container for all rules. That is, there is no need to use fastqc specific container for qc rule.
If you really want to use the fastqc container, then you shouldn't be using --use-conda flag, but of course this will mean that all necessary tools are available from the container(s) defined globally or per rule.

Python os.stat not expanding wildcard in filename

Probably something really stupid I am doing but can someone please assist. All I am trying to do is stat a file. Python will not make this happen, when I debug my python variables I can stat in the shell with it's output. Please see below:
[root#logmaster output]# cat /usr/local/nagios/libexec/check_logrip_log_not_stale.py
import os
import sys
import datetime
import time
# Nagios return values
nagiosRetValOk = 0
nagiosRetValWarn = 1
nagiosRetValCritical = 2
# Below is the filename I am after
#logrip-out-2016-03-19-1458386101
dateFormat = datetime.datetime.now().strftime("%Y-%m-%d")
logFormat = "/home/famnet/logs/output/logrip-out-%s-*" % dateFormat
print os.stat(logFormat)
Here is what happens when I run the basic script:
[root#logmaster output]# python /usr/local/nagios/libexec/check_logrip_log_not_stale.py
Traceback (most recent call last):
File "/usr/local/nagios/libexec/check_logrip_log_not_stale.py", line 36, in <module>
print os.stat(logFormat)
OSError: [Errno 2] No such file or directory: '/home/famnet/logs/output/logrip-out-2016-03-19-*'
Please forgive me if this is an easy waste of time for some experts.
Thanks,
However when I take the output of my print debug and run in the shell it works.
[root#logmaster output]# stat /home/famnet/logs/output/logrip-out-2016-03-19-*
File: `/home/famnet/logs/output/logrip-out-2016-03-19-1458386101'
Size: 42374797 Blocks: 82776 IO Block: 4096 regular file
Device: fd02h/64770d Inode: 36590817 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 504/ famnet) Gid: ( 1100/ staff)
Access: 2016-03-19 07:15:01.725794193 -0400
Modify: 2016-03-19 07:44:09.847793116 -0400
Change: 2016-03-19 07:44:09.847793116 -0400

Expansion of wildcards is a feature of many common shells, such as bash in this case. It is not a feature of the system call underlying os.stat.
If you want to call os.stat against more than one file, you'll have to list them (using something like glob.glob) first, then call os.stat once per path. Something like this:
for full_path in glob.glob(logFormat):
print os.stat(full_path)
Observe as well that a path with a wildcard may expand to multiple concrete paths, which can work with the command-line STAT(1), but will certainly break os.stat which takes only a single path argument.

os.stat won't auto-expand the wildcard... try using glob

Unexpected python logger output when using several handlers with different log levels

I am trying to log data to stderr and into a file. The file should contain all log messages, and to stderr should go only the log level configured on the command line. This is described several times in the logging howto - but it does not seem to work for me. I have created a small test script which illustrates my problem:
#!/usr/bin/env python
import logging as l
l.basicConfig(level=100)
logger = l.getLogger("me")
# ... --- === SEE THIS LINE === --- ...
logger.setLevel(l.CRITICAL)
sh = l.StreamHandler()
sh.setLevel(l.ERROR)
sh.setFormatter(l.Formatter('%(levelname)-8s CONSOLE %(message)s'))
logger.addHandler(sh)
fh = l.FileHandler("test.dat", "w")
fh.setLevel(l.DEBUG)
fh.setFormatter(l.Formatter('%(levelname)-8s FILE %(message)s'))
logger.addHandler(fh)
logger.info("hi this is INFO")
logger.error("well this is ERROR")
In line 5th code line I can go for logger.setLevel(l.CRITICAL) or logger.setLevel(l.DEBUG). Both results are unsatisfying.
With logger.setLevel(l.CRITICAL) I get ...
$ python test.py
$ cat test.dat
$
Now with logger.setLevel(l.DEBUG) I get ...
$ python test.py
INFO:me:hi this is INFO
ERROR CONSOLE well this is ERROR
ERROR:me:well this is ERROR
$ cat test.dat
INFO FILE hi this is INFO
ERROR FILE well this is ERROR
$
In one case I see nothing nowhere, in the other I see everything everywhere, and one message is being displayed even twice on the console.
Now I get where the ERROR CONSOLE and ERROR FILE outputs come from, those I expect. I don't get where the INFO:me... or ERROR:me... outputs are coming from, and I would like to get rid of them.
Things I already tried:
Creating a filter as described here: https://stackoverflow.com/a/7447596/902327 (does not work)
Emptying handlers from the logger with logger.handlers = [] (also does not work)
Can somebody help me out here? It seems like a straightforward requirement and I really don't seem to get it.

You can set the root level to DEBUG, set propagate to False and then set the appropriate level for the other handlers.
import logging as l
l.basicConfig()
logger = l.getLogger("me")
# ... --- === SEE THIS LINE === --- ...
logger.setLevel(l.DEBUG)
logger.propagate = False
sh = l.StreamHandler()
sh.setLevel(l.ERROR)
sh.setFormatter(l.Formatter('%(levelname)-8s CONSOLE %(message)s'))
logger.addHandler(sh)
fh = l.FileHandler("test.dat", "w")
fh.setLevel(l.INFO)
fh.setFormatter(l.Formatter('%(levelname)-8s FILE %(message)s'))
logger.addHandler(fh)
logger.info("hi this is INFO")
logger.error("well this is ERROR")
Output:
~$ python test.py
ERROR CONSOLE well this is ERROR
~$ cat test.dat
INFO FILE hi this is INFO
ERROR FILE well this is ERROR

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.