"snakemake: error: unrecognized arguments:"

"snakemake: error: unrecognized arguments:" - python

I am working to implement a snakemake pipeline on our university's HPC. I am doing so in an activated conda environment and with the following script submitted using sbatch:
snakemake --dryrun --summary --jobs 100 --use-conda -p \
--configfile config.yaml --cluster-config cluster.yaml \
--profile /path/to/conda/env --cluster "sbatch --parsable \
--qos=unlim --partition={cluster.queue} \
--job-name=username.{rule}.{wildcards} --mem={cluster.mem}gb \
--time={cluster.time} --ntasks={cluster.threads} \
--nodes={cluster.nodes}"
config.yaml
metaG_accession: PRJNA766694
metaG_ena_table: /home/etucker5/miniconda3/envs/s-niv-MAGs/data/input/ENA_tables/PRJNA766694_metaG_wenv.txt
inputDIR: /home/etucker5/miniconda3/envs/s-niv-MAGs/data/input
outputDIR: /home/etucker5/miniconda3/envs/s-niv-MAGs/data/output
scratch: /home/etucker5/miniconda3/envs/s-niv-MAGs/data/scratch
adapters: /home/etucker5/miniconda3/envs/s-niv-MAGs/data/input/adapters/illumina-adapters.fa
metaG_sample_list: /home/etucker5/miniconda3/envs/s-niv-MAGs/data/input/SampleList_ForAssembly_metaG.txt
megahit_other: --continue --k-list 29,39,59,79,99,119
megahit_cpu: 80
megahit_min_contig: 1000
megahit_mem: 0.95
restart-times: 0
max-jobs-per-second: 1
max-status-checks-per-secon: 10
local-cores: 1
rerun-incomplete: true
keep-going: true
Snakefile
configfile: "config.yaml"
import io
import os
import pandas as pd
import numpy as np
import pathlib
from snakemake.exceptions import print_exception, WorkflowError
#----SET VARIABLES----#
METAG_ACCESSION = config["metaG_accession"]
METAG_SAMPLES = pd.read_table(config["metaG_ena_table"])
INPUTDIR = config["inputDIR"]
ADAPTERS = config["adapters"]
SCRATCHDIR = config["scratch"]
OUTPUTDIR = config["outputDIR"]
METAG_SAMPLELIST = pd.read_table(config["metaG_sample_list"], index_col="Assembly_group")
METAG_ASSEMBLYGROUP = list(METAG_SAMPLELIST.index)
ASSEMBLYGROUP = METAG_ASSEMBLYGROUP
#----COMPUTE VAR----#
MEGAHIT_CPU = config["megahit_cpu"]
MEGAHIT_MIN_CONTIG = config["megahit_min_contig"]
MEGAHIT_MEM = config["megahit_mem"]
MEGAHIT_OTHER = config["megahit_other"]
and slurm error output
snakemake: error: unrecognized arguments: --metaG_accession=PRJNA766694
--metaG_ena_table=/home/etucker5/miniconda3/envs/s-niv-MAGs/data/input/ENA_tables/PRJNA766694_metaG_wenv.txt
--inputDIR=/home/etucker5/miniconda3/envs/s-niv-MAGs/data/input
--outputDIR=/home/etucker5/miniconda3/envs/s-niv-MAGs/data/output
--scratch=/home/etucker5/miniconda3/envs/s-niv-MAGs/data/scratch
--adapters=/home/etucker5/miniconda3/envs/s-niv-MAGs/data/input/adapters/illumina-adapters.fa
--metaG_sample_list=/home/etucker5/miniconda3/envs/s-niv-MAGs/data/input/SampleList_ForAssembly_metaG.txt
--megahit_cpu=80 --megahit_min_contig=1000 --megahit_mem=0.95
On execution it fails to recognize arguments in my config.yaml file (for ex.):
snakemake: error: unrecognized arguments: --inputDIR=[path\to\dir]
In my understanding the Snakefile should be able to take any arguments stated in the config.yaml using:
INPUTDIR = config["inputDIR"]
when:
configfile: "config.yaml"
is input in my Snakefile.
Also, my config.yaml properly recognizes non-custom arguments such as:
max-jobs-per-second: 1
Is there some custom library setup that I need to initiate for this particular config.yaml? This is my first time using Snakemake and I am still learning how to properly work with config files.
Also, on swapping the paths directly into the Snakefile I was able to get the summary output for my dryrun without the unrecognized arguments error.

The issue was the way in which the workflow was executed using Slurm. I had been executing snakemake with sbatch as a bash script.
Instead, snakemake can be executed directly through the terminal or using bash. While I'm not exactly sure why, this caused my jobs to run on the cluster's local, which has tight memory limits, rather than on the hpc partitions that have appropriate capacity. Executing it this way also caused snakemake to not recognize the paths set in my config.yaml.
The lesson learned is that snakemake has a built in way of communicating with the slurm manager and it seems that executing snakemake through sbatch will cause conflicts.
("Conda Env") [user#log001 "Main Directory"]$ snakemake \
> --jobs 100 --use-conda -p -s Snakefile \
> --cluster-config cluster.yaml --cluster "sbatch \
> --parsable --qos=unlim --partition={cluster.queue} \
> --job-name=TARA.{rule}.{wildcards} --mem={cluster.mem}gb \
> --time={cluster.time} --ntasks={cluster.threads} --nodes={cluster.nodes}"

Related

What's the alternate of echo for windows cmd?

I want to override the args of a model in the "comp1" folder by passing the parameters to the main file in the "component" folder and hence need some mechanism to pass the override args.
I've run it before in wsl2 and it worked.I want it to work in windows cmd and hence need some workaround or an alternate of echo to be able to pass the override parameters to the main file.
Adding project folder structure for reference:
Folder Component1
Folder comp1
Adding the MLproject file(Used for wsl2) for reference:
name: KNN
conda_env: conda.yml
entry_points:
main:
parameters:
hydra_options:
description: Hydra parameters to override
type: str
default: ''
command: >-
python main.py $(echo {hydra_options})
I've tried the set command in windows to assign a variable to the override params(passed through cmd) and then use it to concatenate with the python main.py file to incorporate the hydra override parameters but it doesn't seem to work as well.
Adding for reference:
name: KNN_main
conda_env: conda.yml
entry_points:
main:
parameters:
hydra_options:
description: Hydra values to override
type: str
default: " "
command: >-
#echo off
set command = "python main.py" and %{hydra_options}%
echo %command%
Tech stack: MLflow==1.29.0 Hydra==1.2.0
OS: Windows 10

According to this answer, you shouldn't place space before and after = in the set command.
It would work if you rewrote the MLproject into this:
name: KNN_main
conda_env: conda.yml
entry_points:
main:
parameters:
hydra_options:
description: Hydra values to override
type: str
default: " "
command: >-
#echo off
set command="python main.py %{hydra_options}%"
echo %command%
Also, I'm not sure but I think you don't need echo and this command will work.
command: >-
python main.py %{hydra_options}%

Dataflow reading from PubSub works at GCP, can't run locally

I have a small test Dataflow job that just reads from a PubSub subscription and discards the message, that we're using to start some proof-of-concept work.
It works just fine running at GCP, but fails locally. My expectation is that the same code should work either way, just by switching the Dataflow runner, but perhaps that's not the case? Here's the code:
import os
from datetime import datetime
import logging
from apache_beam import Map, io, Pipeline
from apache_beam.options.pipeline_options import PipelineOptions
def noop(element):
pass
def run(input_subscription, pipeline_args=None):
pipeline_options = PipelineOptions(
pipeline_args, streaming=True, save_main_session=True
)
with Pipeline(options=pipeline_options) as pipeline:
(
pipeline
| "Read from Pub/Sub" >> io.ReadFromPubSub(subscription=input_subscription, with_attributes=True)
| "noop" >> Map(noop)
)
if __name__ == "__main__":
logging.getLogger().setLevel(logging.INFO)
run(
os.environ['INPUT_SUBSCRIPTION'],
[
'--runner', os.getenv('RUNNER', 'DirectRunner'),
'--project', os.getenv('PROJECT'),
'--region', os.getenv('REGION'),
'--temp_location', os.getenv('TEMP_LOCATION'),
'--service_account_email', os.getenv('SERVICE_ACCOUNT_EMAIL'),
'--network', os.getenv('NETWORK'),
'--subnetwork', os.getenv('SUBNETWORK'),
'--num_workers', os.getenv('NUM_WORKERS'),
]
)
If I run it with this command line, it creates and runs the job in the Google Cloud just fine:
INPUT_SUBSCRIPTION=subscriptionname \
RUNNER=DataflowRunner \
PROJECT=project \
REGION=region \
TEMP_LOCATION=gs://somewhere/temp \
SERVICE_ACCOUNT_EMAIL=serviceaccount#project.iam.gserviceaccount.com \
NETWORK=network \
SUBNETWORK=https://www.googleapis.com/compute/v1/projects/project/regions/region/subnetworks/subnetwork \
NUM_WORKERS=3 \
python read-pubsub-with-dataflow.py
If I omit the RUNNER option, so it uses DirectRunner:
INPUT_SUBSCRIPTION=subscriptionname \
PROJECT=project \
REGION=region \
TEMP_LOCATION=gs://somewhere/temp \
SERVICE_ACCOUNT_EMAIL=serviceaccount#project.iam.gserviceaccount.com \
NETWORK=network \
SUBNETWORK=https://www.googleapis.com/compute/v1/projects/project/regions/region/subnetworks/subnetwork \
NUM_WORKERS=3 \
python read-pubsub-with-dataflow.py
it fails with a whole flood of error messages, but I'll just include the first one (I think the rest are just cascading):
INFO:apache_beam.runners.direct.direct_runner:Running pipeline with DirectRunner.
/Users/denis/redacted/env/lib/python3.6/site-packages/google/auth/_default.py:70: UserWarning: Your application has authenticated using end user credentials from Google Cloud SDK without a quota project. You might receive a "quota exceeded" or "API not enabled" error. We recommend you rerun `gcloud auth application-default login` and make sure a quota project is added. Or you can use service accounts instead. For more information about service accounts, see https://cloud.google.com/docs/authentication/
warnings.warn(_CLOUD_SDK_CREDENTIALS_WARNING)
ERROR:apache_beam.runners.direct.executor:Exception at bundle <apache_beam.runners.direct.bundle_factory._Bundle object at 0x7fed3e368448>, due to an exception.
Traceback (most recent call last):
File "/Users/denis/redacted/env/lib/python3.6/site-packages/apache_beam/runners/direct/transform_evaluator.py", line 694, in _read_from_pubsub
self._sub_name, max_messages=10, return_immediately=True)
File "/Users/denis/redacted/env/lib/python3.6/site-packages/google/cloud/pubsub_v1/_gapic.py", line 40, in <lambda>
fx = lambda self, *a, **kw: wrapped_fx(self.api, *a, **kw) # noqa
File "/Users/denis/redacted/env/lib/python3.6/site-packages/google/pubsub_v1/services/subscriber/client.py", line 1106, in pull
"If the `request` argument is set, then none of "
ValueError: If the `request` argument is set, then none of the individual field arguments should be set.
During handling of the above exception, another exception occurred:
...etc...
I suspect maybe this has to do with credentials? Or our project config? Perhaps I should try in a new blank project.

This turned out to be incompatible package versions. My requirements.txt had been:
apache_beam[gcp]
google_apitools
google-cloud-pubsub
but that was installing a version of the google-cloud-pubsub package that was breaking apache_beam. I changed my requirements.txt to:
apache_beam[gcp]
google_apitools
and it all works now!
And for what it's worth, running locally with DirectRunner I obviously did not need a lot of the options that I needed for DataflowRunner. This sufficed:
GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json \
RUNNER=DirectRunner \
INPUT_SUBSCRIPTION=projects/mytopic/subscriptions/mysubscription \
python read-pubsub-with-dataflow.py

How to use singularity and conda wrappers in Snakemake

TLDR I'm getting the following error:
The 'conda' command is not available inside your singularity container image. Snakemake mounts your conda installation into singularity. Sometimes, this can fail because of shell restrictions. It has been tested to work with docker://ubuntu, but it e.g. fails with docker://bash
I had created a Snakemake workflow and converted the shell: commands to rule-based package management via Snakemake wrappers: .
However, I ran into issues running this on HPC and one of the HPC support staff strongly recommended against using conda on any HPC system as:
"if the builder [of wrapper] is not super careful, dynamic libraries present in the conda environment that relies on the host libs (there are always a couple present because builder are most of the time carefree) will break. I think that relying on Singularity for your pipeline would make for a more robust system." - Anon
I did some reading over the weekend and according to this document, it's possible to combine containers with conda-based package management; by defining a global conda docker container and per-rule yaml files.
Note: In contrast to the example in the link above (Figure 5.4), which uses a predefined yaml and shell: command, here I've use
conda wrappers which download these yaml files into the
Singularity container (if I'm thinking correctly) so I thought should function the same - see the Note: at the end though...
Snakefile, config.yaml and samples.txt
Snakefile
# Directories------------------------------------------------------------------
configfile: "config.yaml"
# Setting the names of all directories
dir_list = ["REF_DIR", "LOG_DIR", "BENCHMARK_DIR", "QC_DIR", "TRIM_DIR", "ALIGN_DIR", "MARKDUP_DIR", "CALLING_DIR", "ANNOT_DIR"]
dir_names = ["refs", "logs", "benchmarks", "qc", "trimming", "alignment", "mark_duplicates", "variant_calling", "annotation"]
dirs_dict = dict(zip(dir_list, dir_names))
import os
import pandas as pd
# getting the samples information (names, path to r1 & r2) from samples.txt
samples_information = pd.read_csv("samples.txt", sep='\t', index_col=False)
# get a list of the sample names
sample_names = list(samples_information['sample'])
sample_locations = list(samples_information['location'])
samples_dict = dict(zip(sample_names, sample_locations))
# get number of samples
len_samples = len(sample_names)
# Singularity with conda wrappers
singularity: "docker://continuumio/miniconda3:4.5.11"
# Rules -----------------------------------------------------------------------
rule all:
input:
"resources/vep/plugins",
"resources/vep/cache"
rule download_vep_plugins:
output:
directory("resources/vep/plugins")
params:
release=100
resources:
mem=1000,
time=30
wrapper:
"0.66.0/bio/vep/plugins"
rule get_vep_cache:
output:
directory("resources/vep/cache")
params:
species="caenorhabditis_elegans",
build="WBcel235",
release="100"
resources:
mem=1000,
time=30
log:
"logs/vep/cache.log"
cache: True # save space and time with between workflow caching (see docs)
wrapper:
"0.66.0/bio/vep/cache"
config.yaml
# Files
REF_GENOME: "c_elegans.PRJNA13758.WS265.genomic.fa"
GENOME_ANNOTATION: "c_elegans.PRJNA13758.WS265.annotations.gff3"
# Tools
QC_TOOL: "fastQC"
TRIM_TOOL: "trimmomatic"
ALIGN_TOOL: "bwa"
MARKDUP_TOOL: "picard"
CALLING_TOOL: "varscan"
ANNOT_TOOL: "vep"
samples.txt
sample location
MTG324 /home/moldach/wrappers/SUBSET/MTG324_SUBSET
Submission
snakemake --profile slurm --use-singularity --use-conda --jobs 2
Logs
Workflow defines that rule get_vep_cache is eligible for caching between workflows (use the --cache argument to enable this).
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 get_vep_cache
1
[Mon Sep 21 15:35:50 2020]
rule get_vep_cache:
output: resources/vep/cache
log: logs/vep/cache.log
jobid: 0
resources: mem=1000, time=30
Activating singularity image /home/moldach/wrappers/SUBSET/VEP/.snakemake/singularity/d7617773b315c3abcb29e0484085ed06.simg
Activating conda environment: /home/moldach/wrappers/SUBSET/VEP/.snakemake/conda/774ea575
[Mon Sep 21 15:36:38 2020]
Finished job 0.
1 of 1 steps (100%) done
Note: Leaving --use-conda out of the submission of the workflow will cause an error for get_vep_cache: - /bin/bash: vep_install: command not found
Workflow defines that rule get_vep_cache is eligible for caching between workflows (use the --cache argument to enable this).
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 download_vep_plugins
1
[Mon Sep 21 15:35:50 2020]
rule download_vep_plugins:
output: resources/vep/plugins
jobid: 0
resources: mem=1000, time=30
Activating singularity image /home/moldach/wrappers/SUBSET/VEP/.snakemake/singularity/d7617773b315c3abcb29e0484085ed06.simg
Activating conda environment: /home/moldach/wrappers/SUBSET/VEP/.snakemake/conda/9f602d9a
[Mon Sep 21 15:35:56 2020]
Finished job 0.
1 of 1 steps (100%) done
The problem occurs when adding the third rule, fastq:
Updated Snakefile
# Directories------------------------------------------------------------------
configfile: "config.yaml"
# Setting the names of all directories
dir_list = ["REF_DIR", "LOG_DIR", "BENCHMARK_DIR", "QC_DIR", "TRIM_DIR", "ALIGN_DIR", "MARKDUP_DIR", "CALLING_DIR", "ANNOT_DIR"]
dir_names = ["refs", "logs", "benchmarks", "qc", "trimming", "alignment", "mark_duplicates", "variant_calling", "annotation"]
dirs_dict = dict(zip(dir_list, dir_names))
import os
import pandas as pd
# getting the samples information (names, path to r1 & r2) from samples.txt
samples_information = pd.read_csv("samples.txt", sep='\t', index_col=False)
# get a list of the sample names
sample_names = list(samples_information['sample'])
sample_locations = list(samples_information['location'])
samples_dict = dict(zip(sample_names, sample_locations))
# get number of samples
len_samples = len(sample_names)
# Singularity with conda wrappers
singularity: "docker://continuumio/miniconda3:4.5.11"
# Rules -----------------------------------------------------------------------
rule all:
input:
"resources/vep/plugins",
"resources/vep/cache",
expand('{QC_DIR}/{QC_TOOL}/before_trim/{sample}_{pair}_fastqc.{ext}', QC_DIR=dirs_dict["QC_DIR"], QC_TOOL=config["QC_TOOL"], sample=sample_names, pair=['R1', 'R2'], ext=['html', 'zip'])
rule download_vep_plugins:
output:
directory("resources/vep/plugins")
params:
release=100
resources:
mem=1000,
time=30
wrapper:
"0.66.0/bio/vep/plugins"
rule get_vep_cache:
output:
directory("resources/vep/cache")
params:
species="caenorhabditis_elegans",
build="WBcel235",
release="100"
resources:
mem=1000,
time=30
log:
"logs/vep/cache.log"
cache: True # save space and time with between workflow caching (see docs)
wrapper:
"0.66.0/bio/vep/cache"
def getHome(sample):
return(list(os.path.join(samples_dict[sample],"{0}_{1}.fastq.gz".format(sample,pair)) for pair in ['R1','R2']))
rule qc_before_trim_r1:
input:
r1=lambda wildcards: getHome(wildcards.sample)[0]
output:
html=os.path.join(dirs_dict["QC_DIR"],config["QC_TOOL"],"before_trim","{sample}_R1_fastqc.html"),
zip=os.path.join(dirs_dict["QC_DIR"],config["QC_TOOL"],"before_trim","{sample}_R1_fastqc.zip"),
params:
dir=os.path.join(dirs_dict["QC_DIR"],config["QC_TOOL"],"before_trim")
log:
os.path.join(dirs_dict["LOG_DIR"],config["QC_TOOL"],"{sample}_R1.log")
resources:
mem=1000,
time=30
singularity:
"https://depot.galaxyproject.org/singularity/fastqc:0.11.9--0"
threads: 1
message: """--- Quality check of raw data with FastQC before trimming."""
wrapper:
"0.66.0/bio/fastqc"
rule qc_before_trim_r2:
input:
r1=lambda wildcards: getHome(wildcards.sample)[1]
output:
html=os.path.join(dirs_dict["QC_DIR"],config["QC_TOOL"],"before_trim","{sample}_R2_fastqc.html"),
zip=os.path.join(dirs_dict["QC_DIR"],config["QC_TOOL"],"before_trim","{sample}_R2_fastqc.zip"),
params:
dir=os.path.join(dirs_dict["QC_DIR"],config["QC_TOOL"],"before_trim")
log:
os.path.join(dirs_dict["LOG_DIR"],config["QC_TOOL"],"{sample}_R2.log")
resources:
mem=1000,
time=30
singularity:
"https://depot.galaxyproject.org/singularity/fastqc:0.11.9--0"
threads: 1
message: """--- Quality check of raw data with FastQC before trimming."""
wrapper:
"0.66.0/bio/fastqc"
Error reported in nohup.out
Building DAG of jobs...
Pulling singularity image https://depot.galaxyproject.org/singularity/fastqc:0.11.9--0.
CreateCondaEnvironmentException:
The 'conda' command is not available inside your singularity container image. Snakemake mounts your conda installation into singularity. Sometimes, this can fail because of shell restrictions. It has been tested to work with docker://ubuntu, but it e.g. fails with docker://bash
File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/deployment/conda.py", line 247, in create
File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/deployment/conda.py", line 381, in __new__
File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/deployment/conda.py", line 394, in __init__
File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/deployment/conda.py", line 417, in _check
using shell: instead of wrapper:
I changed the wrapper back into the shell command:
and this is the error I get when submitting with ``:
orkflow defines that rule get_vep_cache is eligible for caching between workflows (use the --cache argument to enable this).
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 qc_before_trim_r2
1
[Mon Sep 21 16:32:54 2020]
Job 0: --- Quality check of raw data with FastQC before trimming.
Activating singularity image /home/moldach/wrappers/SUBSET/VEP/.snakemake/singularity/6740cb07e67eae01644839c9767bdca5.simg
^[[33mWARNING:^[[0m Skipping mount /var/singularity/mnt/session/etc/resolv.conf [files]: /etc/resolv.conf doesn't exist in container
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LANGUAGE = (unset),
LC_ALL = (unset),
LANG = "en_CA.UTF-8"
are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
Skipping '/home/moldach/wrappers/SUBSET/MTG324_SUBSET/MTG324_R2.fastq.gz' which didn't exist, or couldn't be read
Waiting at most 60 seconds for missing files.
MissingOutputException in line 84 of /home/moldach/wrappers/SUBSET/VEP/Snakefile:
Job completed successfully, but some output files are missing. Missing files after 60 seconds:
qc/fastQC/before_trim/MTG324_R2_fastqc.html
qc/fastQC/before_trim/MTG324_R2_fastqc.zip
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 544, in handle_job_success
File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 231, in handle_job_success
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
The error Skipping '/home/moldach/wrappers/SUBSET/MTG324_SUBSET/MTG324_R2.fastq.gz' which didn't exist, or couldn't be read is misleading because the file is does exist...
Update 2
Following the advice Manavalan Gajapathy I've eliminated defining singularity at two different levels (global + per-rule).
Now I'm using a singularity container at only the global level and using wrappers via --use-conda which creates the conda environment inside of the container:
# Directories------------------------------------------------------------------
configfile: "config.yaml"
# Setting the names of all directories
dir_list = ["REF_DIR", "LOG_DIR", "BENCHMARK_DIR", "QC_DIR", "TRIM_DIR", "ALIGN_DIR", "MARKDUP_DIR", "CALLING_DIR", "ANNOT_DIR"]
dir_names = ["refs", "logs", "benchmarks", "qc", "trimming", "alignment", "mark_duplicates", "variant_calling", "annotation"]
dirs_dict = dict(zip(dir_list, dir_names))
import os
import pandas as pd
# getting the samples information (names, path to r1 & r2) from samples.txt
samples_information = pd.read_csv("samples.txt", sep='\t', index_col=False)
# get a list of the sample names
sample_names = list(samples_information['sample'])
sample_locations = list(samples_information['location'])
samples_dict = dict(zip(sample_names, sample_locations))
# get number of samples
len_samples = len(sample_names)
# Singularity with conda wrappers
singularity: "docker://continuumio/miniconda3:4.5.11"
# Rules -----------------------------------------------------------------------
rule all:
input:
"resources/vep/plugins",
"resources/vep/cache",
expand('{QC_DIR}/{QC_TOOL}/before_trim/{sample}_{pair}_fastqc.{ext}', QC_DIR=dirs_dict["QC_DIR"], QC_TOOL=config["QC_TOOL"], sample=sample_names, pair=['R1', 'R2'], ext=['html', 'zip'])
rule download_vep_plugins:
output:
directory("resources/vep/plugins")
params:
release=100
resources:
mem=1000,
time=30
wrapper:
"0.66.0/bio/vep/plugins"
rule get_vep_cache:
output:
directory("resources/vep/cache")
params:
species="caenorhabditis_elegans",
build="WBcel235",
release="100"
resources:
mem=1000,
time=30
log:
"logs/vep/cache.log"
cache: True # save space and time with between workflow caching (see docs)
wrapper:
"0.66.0/bio/vep/cache"
def getHome(sample):
return(list(os.path.join(samples_dict[sample],"{0}_{1}.fastq.gz".format(sample,pair)) for pair in ['R1','R2']))
rule qc_before_trim_r1:
input:
r1=lambda wildcards: getHome(wildcards.sample)[0]
output:
html=os.path.join(dirs_dict["QC_DIR"],config["QC_TOOL"],"before_trim","{sample}_R1_fastqc.html"),
zip=os.path.join(dirs_dict["QC_DIR"],config["QC_TOOL"],"before_trim","{sample}_R1_fastqc.zip"),
params:
dir=os.path.join(dirs_dict["QC_DIR"],config["QC_TOOL"],"before_trim")
log:
os.path.join(dirs_dict["LOG_DIR"],config["QC_TOOL"],"{sample}_R1.log")
resources:
mem=1000,
threads: 1
message: """--- Quality check of raw data with FastQC before trimming."""
wrapper:
"0.66.0/bio/fastqc"
rule qc_before_trim_r2:
input:
r1=lambda wildcards: getHome(wildcards.sample)[1]
output:
html=os.path.join(dirs_dict["QC_DIR"],config["QC_TOOL"],"before_trim","{sample}_R2_fastqc.html"),
zip=os.path.join(dirs_dict["QC_DIR"],config["QC_TOOL"],"before_trim","{sample}_R2_fastqc.zip"),
params:
dir=os.path.join(dirs_dict["QC_DIR"],config["QC_TOOL"],"before_trim")
log:
os.path.join(dirs_dict["LOG_DIR"],config["QC_TOOL"],"{sample}_R2.log")
resources:
mem=1000,
time=30
threads: 1
message: """--- Quality check of raw data with FastQC before trimming."""
wrapper:
"0.66.0/bio/fastqc"
and submit via:
However, I'm still getting an error:
Workflow defines that rule get_vep_cache is eligible for caching between workflows (use the --cache argument to enable this).
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 qc_before_trim_r2
1
[Tue Sep 22 12:44:03 2020]
Job 0: --- Quality check of raw data with FastQC before trimming.
Activating singularity image /home/moldach/wrappers/SUBSET/OMG/.snakemake/singularity/d7617773b315c3abcb29e0484085ed06.simg
Activating conda environment: /home/moldach/wrappers/SUBSET/OMG/.snakemake/conda/c591f288
Skipping '/work/mtgraovac_lab/MATTS_SCRATCH/rep1_R2.fastq.gz' which didn't exist, or couldn't be read
Skipping ' 2> logs/fastQC/rep1_R2.log' which didn't exist, or couldn't be read
Failed to process qc/fastQC/before_trim
java.io.FileNotFoundException: qc/fastQC/before_trim (Is a directory)
at java.base/java.io.FileInputStream.open0(Native Method)
at java.base/java.io.FileInputStream.open(FileInputStream.java:219)
at java.base/java.io.FileInputStream.<init>(FileInputStream.java:157)
at uk.ac.babraham.FastQC.Sequence.FastQFile.<init>(FastQFile.java:73)
at uk.ac.babraham.FastQC.Sequence.SequenceFactory.getSequenceFile(SequenceFactory.java:106)
at uk.ac.babraham.FastQC.Sequence.SequenceFactory.getSequenceFile(SequenceFactory.java:62)
at uk.ac.babraham.FastQC.Analysis.OfflineRunner.processFile(OfflineRunner.java:159)
at uk.ac.babraham.FastQC.Analysis.OfflineRunner.<init>(OfflineRunner.java:121)
at uk.ac.babraham.FastQC.FastQCApplication.main(FastQCApplication.java:316)
Traceback (most recent call last):
File "/home/moldach/wrappers/SUBSET/OMG/.snakemake/scripts/tmpiwwprg5m.wrapper.py", line 35, in <module>
shell(
File "/mnt/snakemake/snakemake/shell.py", line 205, in __new__
raise sp.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'set -euo pipefail; fastqc qc/fastQC/before_trim --quiet -t 1 --outdir /tmp/tmps93snag8 /work/mtgraovac_lab/MATTS_SCRATCH/rep1_R2.fastq.gz ' 2> logs/fastQC/rep1_R$
[Tue Sep 22 12:44:16 2020]
Error in rule qc_before_trim_r2:
jobid: 0
output: qc/fastQC/before_trim/rep1_R2_fastqc.html, qc/fastQC/before_trim/rep1_R2_fastqc.zip
log: logs/fastQC/rep1_R2.log (check log file(s) for error message)
conda-env: /home/moldach/wrappers/SUBSET/OMG/.snakemake/conda/c591f288
RuleException:
CalledProcessError in line 97 of /home/moldach/wrappers/SUBSET/OMG/Snakefile:
Command ' singularity exec --home /home/moldach/wrappers/SUBSET/OMG --bind /home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages:/mnt/snakemake /home/moldach/wrappers/SUBSET/OMG/.snakemake$
File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 2189, in run_wrapper
File "/home/moldach/wrappers/SUBSET/OMG/Snakefile", line 97, in __rule_qc_before_trim_r2
File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 529, in _callback
File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/concurrent/futures/thread.py", line 57, in run
File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 515, in cached_or_run
File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 2201, in run_wrapper
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Reproducibility
To replicate this you can download this small dataset:
git clone https://github.com/CRG-CNAG/CalliNGS-NF.git
cp CalliNGS-NF/data/reads/rep1_*.fq.gz .
mv rep1_1.fq.gz rep1_R1.fastq.gz
mv rep1_2.fq.gz rep1_R2.fastq.gz
UPDATE 3: Bind Mounts
According to the link shared on mounting:
"By default Singularity bind mounts /home/$USER, /tmp, and $PWD into your container at runtime."
Thus, for simplicity (and also because I got errors using --singularity-args), I've moved the required files into /home/$USER and tried to run from there.
(snakemake) [~]$ pwd
/home/moldach
(snakemake) [~]$ ll
total 3656
drwx------ 26 moldach moldach 4096 Aug 27 17:36 anaconda3
drwx------ 2 moldach moldach 4096 Sep 22 10:11 bin
-rw------- 1 moldach moldach 265 Sep 22 14:29 config.yaml
-rw------- 1 moldach moldach 1817903 Sep 22 14:29 rep1_R1.fastq.gz
-rw------- 1 moldach moldach 1870497 Sep 22 14:29 rep1_R2.fastq.gz
-rw------- 1 moldach moldach 55 Sep 22 14:29 samples.txt
-rw------- 1 moldach moldach 3420 Sep 22 14:29 Snakefile
and ran with bash -c "nohup snakemake --profile slurm --use-singularity --use-conda --jobs 4 &"
However, I still get this odd error:
Activating conda environment: /home/moldach/.snakemake/conda/fdae4f0d
Skipping ' 2> logs/fastQC/rep1_R2.log' which didn't exist, or couldn't be read
Failed to process qc/fastQC/before_trim
java.io.FileNotFoundException: qc/fastQC/before_trim (Is a directory)
at java.base/java.io.FileInputStream.open0(Native Method)
at java.base/java.io.FileInputStream.open(FileInputStream.java:219)
at java.base/java.io.FileInputStream.<init>(FileInputStream.java:157)
at uk.ac.babraham.FastQC.Sequence.FastQFile.<init>(FastQFile.java:73)
at uk.ac.babraham.FastQC.Sequence.SequenceFactory.getSequenceFile(SequenceFactory.java:106)
at uk.ac.babraham.FastQC.Sequence.SequenceFactory.getSequenceFile(SequenceFactory.java:62)
at uk.ac.babraham.FastQC.Analysis.OfflineRunner.processFile(OfflineRunner.java:159)
at uk.ac.babraham.FastQC.Analysis.OfflineRunner.<init>(OfflineRunner.java:121)
at uk.ac.babraham.FastQC.FastQCApplication.main(FastQCApplication.java:316)
Traceback (most recent call last):
Why does it think it's being given a directory?
Note: If you submit only with --use-conda, e.g. bash -c "nohup snakemake --profile slurm --use-conda --jobs 4 &" there is no error from the fastqc rules. However, the --use-conda param alone is not %100 reproducible, case-in-point doesn't work on another HPC I tested it on
The full log in nohup.out when using --printshellcmds can be found at this gist

TLDR:
fastqc singularity container used in qc rule likely doesn't have conda available in it, and this doesn't satisfy what snakemake's--use-conda expects.
Explanation:
You have singularity containers defined at two different levels - 1. global level that will be used for all rules, unless they are overridden at rule level; 2. per-rule level that will be used at the rule level.
# global singularity container to use
singularity: "docker://continuumio/miniconda3:4.5.11"
# singularity container defined at rule level
rule qc_before_trim_r1:
....
....
singularity:
"https://depot.galaxyproject.org/singularity/fastqc:0.11.9--0"
When you use --use-singularity and --use-conda together, jobs will be run in conda environment inside the singularity container. So conda command needs to be available inside the singularity container for this to be possible. While this requirement is clearly satisfied for your global-level container, I am quite certain (haven't tested though) this is not the case for your fastqc container.
The way snakemake works if --use-conda flag is supplied, it will create conda environment locally or inside the container depending on the supply of --use-singularity flag. Since you are using snakemake-wrapper for qc rule and it comes with conda env recipe pre-defined, the easiest solution here is to just use the globally defined miniconda container for all rules. That is, there is no need to use fastqc specific container for qc rule.
If you really want to use the fastqc container, then you shouldn't be using --use-conda flag, but of course this will mean that all necessary tools are available from the container(s) defined globally or per rule.

Celery: The module was not found

I am using Open Semantic Search (OSS) and I would like to monitor its processes using the Flower tool. The workers that Celery needs should be given as OSS states on its website
The workers will do tasks like analysis and indexing of the queued files. The workers are implemented by etl/tasks.py and will be started automatically on boot by the service opensemanticsearch.
This tasks.py file looks as follows:
#!/usr/bin/python3
# -*- coding: utf-8 -*-
#
# Queue tasks for batch processing and parallel processing
#
# Queue handler
from celery import Celery
# ETL connectors
from etl import ETL
from etl_delete import Delete
from etl_file import Connector_File
from etl_web import Connector_Web
from etl_rss import Connector_RSS
verbose = True
quiet = False
app = Celery('etl.tasks')
app.conf.CELERYD_MAX_TASKS_PER_CHILD = 1
etl_delete = Delete()
etl_web = Connector_Web()
etl_rss = Connector_RSS()
#
# Delete document with URI from index
#
#app.task(name='etl.delete')
def delete(uri):
etl_delete.delete(uri=uri)
#
# Index a file
#
#app.task(name='etl.index_file')
def index_file(filename, wait=0, config=None):
if wait:
time.sleep(wait)
etl_file = Connector_File()
if config:
etl_file.config = config
etl_file.index(filename=filename)
#
# Index file directory
#
#app.task(name='etl.index_filedirectory')
def index_filedirectory(filename):
from etl_filedirectory import Connector_Filedirectory
connector_filedirectory = Connector_Filedirectory()
result = connector_filedirectory.index(filename)
return result
#
# Index a webpage
#
#app.task(name='etl.index_web')
def index_web(uri, wait=0, downloaded_file=False, downloaded_headers=[]):
if wait:
time.sleep(wait)
result = etl_web.index(uri, downloaded_file=downloaded_file, downloaded_headers=downloaded_headers)
return result
#
# Index full website
#
#app.task(name='etl.index_web_crawl')
def index_web_crawl(uri, crawler_type="PATH"):
import etl_web_crawl
result = etl_web_crawl.index(uri, crawler_type)
return result
#
# Index webpages from sitemap
#
#app.task(name='etl.index_sitemap')
def index_sitemap(uri):
from etl_sitemap import Connector_Sitemap
connector_sitemap = Connector_Sitemap()
result = connector_sitemap.index(uri)
return result
#
# Index RSS Feed
#
#app.task(name='etl.index_rss')
def index_rss(uri):
result = etl_rss.index(uri)
return result
#
# Enrich with / run plugins
#
#app.task(name='etl.enrich')
def enrich(plugins, uri, wait=0):
if wait:
time.sleep(wait)
etl = ETL()
etl.read_configfile('/etc/opensemanticsearch/etl')
etl.read_configfile('/etc/opensemanticsearch/enhancer-rdf')
etl.config['plugins'] = plugins.split(',')
filename = uri
# if exist delete protocoll prefix file://
if filename.startswith("file://"):
filename = filename.replace("file://", '', 1)
parameters = etl.config.copy()
parameters['id'] = uri
parameters['filename'] = filename
parameters, data = etl.process (parameters=parameters, data={})
return data
#
# Read command line arguments and start
#
#if running (not imported to use its functions), run main function
if __name__ == "__main__":
from optparse import OptionParser
parser = OptionParser("etl-tasks [options]")
parser.add_option("-q", "--quiet", dest="quiet", action="store_true", default=False, help="Don\'t print status (filenames) while indexing")
parser.add_option("-v", "--verbose", dest="verbose", action="store_true", default=False, help="Print debug messages")
(options, args) = parser.parse_args()
if options.verbose == False or options.verbose==True:
verbose = options.verbose
etl_delete.verbose = options.verbose
etl_web.verbose = options.verbose
etl_rss.verbose = options.verbose
if options.quiet == False or options.quiet==True:
quiet = options.quiet
app.worker_main()
I read multiple tutorials about Celery and from my understanding, this line should do the job
celery -A etl.tasks flower
but it doesnt. The result is the statement
Error: Unable to load celery application. The module etl was not found.
Same for
celery -A etl.tasks worker --loglevel=debug
so Celery itself seems to be causing the trouble, not flower. I also tried e.g. celery -A etl.index_filedirectory worker --loglevel=debug but with the same result.
What am I missing? Do I have to somehow tell Celery where to find etl.tasks? Online research doesn't really show a similar case, most of the "Module not found" errors seem to occur while importing stuff. So possibly it's a silly question but I couldn't find a solution anywhere. I hope you guys can help me. Unfortunately, I won't be able to respond until Monday though, sorry in advance.

I got same issue, I installed and configured my queue as follows, and it works.
Install RabbitMQ
MacOS
brew install rabbitmq
sudo vim ~/.bash_profile
In bash_profile add the following line:
PATH=$PATH:/usr/local/sbin
Then update bash_profile:
sudo source ~/.bash_profile
Linux
sudo apt-get install rabbitmq-server
Configure RabbitMQ
Launch the queue:
sudo rabbitmq-server
In another Terminal, configure the queue:
sudo rabbitmqctl add_user myuser mypassword
sudo rabbitmqctl add_vhost myvhost
sudo rabbitmqctl set_user_tags myuser mytag
sudo rabbitmqctl set_permissions -p myvhost myuser ".*" ".*" ".*"
Launch Celery
I would suggest to go in the folder that contains task.py and use the following command:
celery -A task worker -l info -Q celery --concurrency 5

Beware that this error means two things:
The module is missing
The module exists but cannot be loaded. If it has errors in it, such as a SyntaxError for instance.
To check that it's not the latter, run:
python -c "import <myModuleContainingTasksDotPyFile>"
In the context of this question:
python -c "import etl"
If it crashes, fix this first (Unlike with celery, you'll get a detailed error message).

Solutions above did not work for me.
I had the same issue and my problem was that in main celery.py (that was in SmartCalend folder) I had:
app = Celery('proj')
but instead I must type there:
app = Celery('SmartCalend')
where SmartCalend is the actual app name where celery.py belongs (!). not any random word, but precisely app name. Thats nowhere mentioned, only in official docs here:

Try export PYTHONPATH=<parent directory> where parent directory is the folder where the etl is. Run the Celery worker, and see it if fixes your problem. This is probably one of the most common Celery "issues" (not really Celery, but Python in general). Alternatively, run the Celery worker from that folder.

Answer for MacOS Catalina:
When you install celery with pip (pip install celery), python can import celery, but you are not able to launch celery from the terminal because the terminal does not know of the celery executable.
Add celery to the path to fix:
nano ~/.bash_profile
In the file add: export PATH="/Users/gavinbelson/Library/Python/2.7/bin:$PATH"
To save the file in the nano editor: ctrl+o, then enter, then ctrl+x
To update the terminal with your change type: source ~/.bash_profile
Now you should be able to type celery in the terminal window
---- Note this is for the default python terminal command which runs version 2.7. If you are using python3 to run python, you would need to change alter the path variable accordingly

running python-daemon as non-priviliged user and keeping group-memberships

I'm writing a daemon in python, using the python-daemon package. the daemon is started at boot-time (init.d) and needs to access various devices.
the daemon is to run on an embedded system (beaglebone) running ubuntu.
now my problem is that I want to run the daemon as an unprivileged user rather (e.g. mydaemon) than root.
in order to allow the daemon to access the devices I added that user to the required groups.
in the python code I use daemon.DaemonContext(uid=uidofmydamon).
the process started by root daemonizes nicely and is owned by the correct user, but I get permission denied errors when trying to access the devices.
I wrote a small test application, and it seems that the process does not inherit the group-memberships of the user.
#!/usr/bin/python
import logging, daemon, os
if __name__ == '__main__':
lh=logging.StreamHandler()
logger = logging.getLogger()
logger.setLevel(logging.INFO)
logger.addHandler(lh)
uid=1001 ## UID of the daemon user
with daemon.DaemonContext(uid=uid,
files_preserve=[lh.stream],
stderr=lh.stream):
logger.warn("UID : %s" % str(os.getuid()))
logger.warn("groups: %s" % str(os.getgroups()))
when i run the above code as the user with uid=1001 i get something like
$ ./testdaemon.py
UID: 1001
groups: [29,107,1001]
whereas when I run the above code as root (or su), I get:
$ sudo ./testdaemon.py
UID: 1001
groups: [0]
How can I create a daemon-process started by root but with a different effective uid and intact group memberships?

my current solution involves dropping root priviliges before starting the actual daemon, using the chuid argument for start-stop-daemon:
start-stop-daemon \
--start \
--chuid daemonuser \
--name testdaemon \
--pidfile /var/run/testdaemon/test.pid \
--startas /tmp/testdaemon.py \
-- \
--pidfile /var/run/testdaemon/test.pid \
--logfile=/var/log/testdaemon/testdaemon.log
the drawback of this solution is, that i need to create all directories, where the daemon ought to write to (noteably /var/run/testdaemon and /var/log/testdaemon), before starting the actual daemon (with the proper file permissions).
i would have preferred to write that logic in python rather than bash.
for now that works, but me thinketh that this should be solveable in a more elegant fashion.

This can be fixed by monkey patching the daemon module, the code is as follows:
import os, grp, pwd
class DaemonError(Exception):
pass
class DaemonOSEnvironmentError(DaemonError, OSError):
pass
def change_process_owner(uid, gid):
try:
# This line adds all the groups the user is member of
# to keep the expected permissions
os.setgroups(
[g.gr_gid for g in grp.getgrall()
if pwd.getpwuid(uid).pw_name in g.gr_mem
]
)
os.setgid(gid)
os.setuid(uid)
except Exception, exc:
error = DaemonOSEnvironmentError(u"Unable to change process
owner (%(exc)s)" % vars())
raise error
And then the monkey patch:
import daemon
daemon.daemon.change_process_owner = change_process_owner

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.