Snakemake use file content as shell command - python

I'm trying to automatize a shell command that use the software using snakemake :
./chopchop.py -G hg38 -o temp -Target chr16:46390060-46390782
In this command the 'chr16:46390060-46390782' input will change.
All different input are in a file in which I'll have to parse to get the appropiate input format
cat test.bed
chr16 46390060 46390782
chr21 33931554 33931728
I have a simple snakemake rule that run shell command
rule run_chopchop :
input:
"/gpfs/home/user/crispr_project/CRISPRi_Enh_TALL/enhancer_dataset/jurkat/test.bed"
output:
"/gpfs/home/user/crispr_project/CRISPRi_Enh_TALL/chopchop_output/guide_chopchop.txt"
shell:'''
set +u; source /gpfs/home/user/Apps/anaconda3/bin/activate chopchop; set -u
./gpfs/home/user/git/chopchop/chopchop.py -G hg38 -o temp -Target {input} > {output}
'''
How can I use the content of the file as input in snakemake and change the syntax of the line to get the appropriate format ? I have really no idea of the syntax. If some can help me.
Thanks

You can always replace the shell: section with the run: one. In this case you need to call the shell() function each time you need to run the script:
rule run_chopchop :
input: "/gpfs/home/user/crispr_project/CRISPRi_Enh_TALL/enhancer_dataset/jurkat/test.bed"
output: "/gpfs/home/user/crispr_project/CRISPRi_Enh_TALL/chopchop_output/guide_chopchop.txt"
run:
# for each line in the file
input = ...
output = ...
shell(f'''
set +u; source /gpfs/home/user/Apps/anaconda3/bin/activate chopchop; set -u
./gpfs/home/user/git/chopchop/chopchop.py -G hg38 -o temp -Target {input} > {output}
''')

Related

Assertion error or snakemake: error: argument --snakefile/-s: expected one argument when working with Snakemake pipeline

I am trying to run a Snakemake pipeline, but it either runs into an assertion error or snakemake is unable to find the snakefile for some reason. I am running the pipeline through a Snakefile, a config.yml file, and a bash script.
Code:
Config.yml file:
REPO_DIR="/path/to/pipeline"
REF_FASTA ="$REPO_DIR/data/genome/sacCer3.fasta"
FASTQ_DIR="$REPO_DIR/pipelinetest/fastq"
OUTPUT_DIR="$REPO_DIR/pipelineoutput"
ANC_DIR="$REPO_DIR/pipelineanc"
LOG_FILE="$OUTPUT_DIR/00_logs/pipeline.log"
SNAKE_FILE="$REPO_DIR/workflow/Snakefile.py"
CONFIG_FILE="$REPO_DIR/config/config.yml"
cd $REPO_DIR
Bash script:
#!/bin/bash
# activate conda env
source activate pipeline_env
# run the pipeline
snakemake --cores --snakefile snakefile=$SNAKE_FILE --configfile snakefile config_file=$CONFIG_FILE \
--config output_dir=$OUTPUT_DIR fastq_dir=$FASTQ_DIR anc_dir=$ANC_DIR ref_fasta=$REF_FASTA\
--use-conda --conda-prefix="$HOME/.snakemake/conda"
echo -e "\nDONE!\n"
Part of snakefile:
import os
import json
from datetime import datetime
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Define Constants ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #
# discover input files using path from run config
SAMPLES = list(set(glob_wildcards(f"{config['fastq_dir']}/{{sample}}_R1_001.fastq.gz").sample))
# read output dir path from run config
OUTPUT_DIR = config['output_dir']
# Project name and date for bam header
SEQID='pipeline_align'
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Begin Pipeline ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #
# https://snakemake.readthedocs.io/en/v7.14.0/tutorial/basics.html#step-7-adding-a-target-rule
rule all:
input:
f'{OUTPUT_DIR}/DONE.txt'
# ~~~~~~~~~~~~~~~~~~~~~~~~~~ Set Up Reference Files ~~~~~~~~~~~~~~~~~~~~~~~~~ #
#
# export the current run configuration in JSON format
#
rule export_run_config:
output:
path=f"{OUTPUT_DIR}/00_logs/00_run_config.json"
run:
with open(output.path, 'w') as outfile:
json.dump(dict(config), outfile, indent=4)
#
# make a list of discovered samples
#
rule list_samples:
output:
f"{OUTPUT_DIR}/00_logs/00_sample_list.txt"
shell:
"echo -e '{}' > {{output}}".format('\n'.join(SAMPLES))
#
# copy the supplied reference genome fasta to the pipeline output directory for reference
#
rule copy_fasta:
input:
config['ref_fasta']
output:
f"{OUTPUT_DIR}/01_ref_files/{os.path.basename(config['ref_fasta'])}"
shell:
"cp {input} {output}"
rule index_fasta:
input:
rules.copy_fasta.output
output:
f"{rules.copy_fasta.output}.fai"
conda:
'envs/main.yml'
shell:
"samtools faidx {input}"
rule create_ref_dict:
input:
rules.copy_fasta.output
output:
f"{rules.copy_fasta.output}".rstrip('fasta') + 'dict'
conda:
'envs/main.yml'
shell:
"picard CreateSequenceDictionary -R {input}"
#
# create a BWA index from the copied fasta reference genome
#
rule create_bwa_index:
input:
rules.copy_fasta.output
output:
f"{rules.copy_fasta.output}.amb",
f"{rules.copy_fasta.output}.ann",
f"{rules.copy_fasta.output}.bwt",
f"{rules.copy_fasta.output}.pac",
f"{rules.copy_fasta.output}.sa",
conda:
'envs/main.yml'
shell:
"bwa index {input}"
And then I start putting the ancestor and sample files through the pipeline. However, the problem occurs upstream of when the Snakefile gets executed. I've tried reinstalling the git repo, but to no avail. I've also tried echoing the file path in the bash script, but the Snakefile still wasn't able to be found. I am submitting the bash script to a cluster, and it fails almost immediately.
How do I make sure the Snakefile is recognized through the bash script?
Error:
assert v is not None
AssertionError
or
snakemake: error: argument --snakefile/-s: expected one argument
I am using mamba as a package manager.
The shell call should be changed to this:
snakemake --cores --snakefile "$SNAKE_FILE" --configfile config_file ...
Where ... is the rest of the command. The main problem is that --snakefile (or -s) expects a string path to the Snakefile without any further keywords. Similarly, --configfile does not expect further keywords.

Snakemake integrate the multiple command lines in a rule

The output of my first command line "bcftools query -l {input.invcf} | head -n 1" prints the name of the first individual of vcf file (i.e. IND1). I want to use that output in selectvariants GATK in -sn IND1 option. How is it possible to integrate the 1st comamnd line in snakemake in order to use it's output in the next one?
rule selectvar:
input:
invcf="{family}_my.vcf"
params:
ind= ???
ref="ref.fasta"
output:
out="{family}.dn.vcf"
shell:
"""
bcftools query -l {input.invcf} | head -n 1 > {params.ind}
gatk --java-options "-Xms2G -Xmx2g -XX:ParallelGCThreads=2" SelectVariants -R {params.ref} -V {input.invcf} -sn {params.ind} -O {output.out}
"""
There are several options, but the easiest one is to store the results into a temporary bash variable:
rule selectvar:
...
shell:
"""
myparam=$(bcftools query -l {input.invcf} | head -n 1)
gatk -sn "$myparam" ...
"""
As noted by #dariober, if one modifies pipefail behaviour, there could be unexpected results, see the example in this answer.
When I have to do these things I prefer to use run instead of shell, and then shell out only at the end.
The reason for this is because it makes it possible for snakemake to lint the run statement, and to exit early if something goes wrong instead of following through with a broken shell command.
rule selectvar:
input:
invcf="{family}_my.vcf"
params:
ref="ref.fasta"
gatk_opts='--java-options "-Xms2G -Xmx2g -XX:ParallelGCThreads=2" SelectVariants'
output:
out="{family}.dn.vcf"
run:
opts = "{params.gatk_opts} -R {params.ref} -V {input.invcf} -O {output.out}"
sn_parameter = shell("bcftools query -l {input.invcf} | head -n 1")
# we could add a sanity check here if necessary, before shelling out
shell(f"gatk {options} {sn_parameter}")
"""
I think I found a solution:
rule selectvar:
input:
invcf="{family}_my.vcf"
params:
ref="ref.fasta"
output:
out="{family}.dn.vcf"
shell:
"""
gatk --java-options "-Xms2G -Xmx2g -XX:ParallelGCThreads=2" SelectVariants -R {params.ref} -V {input.invcf} -sn `bcftools query -l {input.invcf} | head -n 1` -O {output.out}
"""

Snakemake ambiguity

I have an ambiguity error and I can't figure out why and how to solve it.
Defining the wildcards:
rule all:
input:
xls = expand("reports/{sample}.xlsx", sample = config["samples"]),
runfolder_xls = expand("{runfolder}.xlsx", runfolder = config["runfolder"])
Actual rules:
rule sample_report:
input:
vcf = "vcfs/{sample}.annotated.vcf",
cov = "stats/{sample}.coverage.gz",
mod_bed = "tmp/mod_ref_{sample}.bed",
nirvana_g2t = "/mnt/storage/data/NGS/nirvana_genes2transcripts"
output:
"reports/{sample}.xlsx"
params:
get_nb_samples()
log:
"logs/{sample}.log"
shell: """
python /mnt/storage/home/kimy/projects/automate_CP/niles/NILES_create_sample_report.py -v {input.vcf} -c {input.cov} -r {input.mod_bed} -n {input.nirvana_g2t} -r {rule};
exitcode=$? ;
if [[ {params} > 1 ]]
then
python /mnt/storage/home/kimy/projects/automate_CP/niles/NILES_check_exitcode.py -e $exitcode -r {rule} -n {wildcards.sample}
elif [[ {params} == 1 ]]
then
python /mnt/storage/home/kimy/projects/automate_CP/niles/NILES_check_exitcode.py -e $exitcode -r sample_mode -n {wildcards.sample}
else
python /mnt/storage/home/kimy/projects/automate_CP/niles/NILES_check_exitcode.py -e 1 -r {rule} -n {wildcards.sample}
fi
"""
rule runfolder_report:
input:
sample_sheet = "SampleSheet.csv"
output:
"{runfolder}.xlsx"
log:
"logs/{runfolder}.log"
shell: """
python /mnt/storage/home/kimy/projects/automate_CP/niles/NILES_create_runfolder_report.py -run {wildcards.runfolder} -s {input.sample_sheet} -r {rule} ;
exitcode=$? ;
python /mnt/storage/home/kimy/projects/automate_CP/niles/NILES_check_exitcode.py -e $exitcode -r {rule} -n {wildcards.runfolder}
"""
Config file:
runfolder: "CP0340"
samples: ['C014044p', 'C130157', 'C014040p', 'C014054b-1', 'C051198-A', 'C014042p', 'C052007W-C', 'C051198-B', 'C014038p', 'C052004-B', 'C051198-C', 'C052004-C', 'C052003-B', 'C052003-A', 'C052004-A', 'C052002-C', 'C052005-C', 'C052002-A', 'C130157N', 'C052006-B', 'C014063pW', 'C014054b-2', 'C052002-B', 'C052006-C', 'C052007W-B', 'C052003-C', 'C014064bW', 'C052005-B', 'C052006-A', 'C052005-A']
Error:
$ snakemake -n -s ../niles/Snakefile --configfile logs/CP0340_config.yaml
Building DAG of jobs...
AmbiguousRuleException:
Rules runfolder_report and sample_report are ambiguous for the file reports/C014044p.xlsx.
Consider starting rule output with a unique prefix, constrain your wildcards, or use the ruleorder directive.
Wildcards:
runfolder_report: runfolder=reports/C014044p
sample_report: sample=C014044p
Expected input files:
runfolder_report: SampleSheet.csv
sample_report: vcfs/C014044p.annotated.vcf stats/C014044p.coverage.gz tmp/mod_ref_C014044p.bed /mnt/storage/data/NGS/nirvana_genes2transcriptsExpected output files:
runfolder_report: reports/C014044p.xlsx
sample_report: reports/C014044p.xlsx
If I understand Snakemake correctly, the wildcards in the rules are defined in my all rule so I don't understand why the runfolder_report rule tries to put reports/C014044p.xlsx as an output + how the output has a sample name instead of the runfolder name (as defined in the config file).
As the error message suggests, you could assign a distinct prefix to the output of each rule. So your original code will work if you replace {runfolder}.xlsx with, e.g., "runfolder/{runfolder}.xlsx" in rule all and in runfolder_report. Alternatively, constraint the wildcards (my preferred solution) by adding before rule all something like:
wildcard_constraints:
sample= '|'.join([re.escape(x) for x in config["samples"]]),
runfolder= re.escape(config["runfolder"]),
The reason for this is that snakemake matches input and output strings using regular expressions (the fine details of how it's done, I must admit, escape me...)
Ok here is my solution:
rule runfolder_report:
input:
"SampleSheet.csv"
output:
expand("{runfolder}.xlsx", runfolder = config["runfolder"])
params:
config["runfolder"]
log:
expand("logs/{runfolder}.log", runfolder = config["runfolder"])
shell: """
set +e ;
python /mnt/storage/home/kimy/projects/automate_CP/niles/NILES_create_runfolder_report.py -run {params} -s {input} -r {rule} ;
exitcode=$? ;
python /mnt/storage/home/kimy/projects/automate_CP/niles/NILES_check_exitcode.py -e $exitcode -r {rule} -n {params}
"""
However I still don't understand why it had errors and I know that it work previously.

Snakemake problems with Log

I'm trying to write a Log param in my Snakefile, but I don't know what I'm missing. This is my code:
include:
'config.py'
rule all:
input:
expand(WORK_DIR +"/trimmed/TFB{sample}_R{read_no}.fastq.gz.good",
sample=SAMPLE_TFB ,read_no=['1', '2'])
rule fastp:
input:
R1= SAMPLES_DIR + "/TFB{sample}_R1.fastq.gz",
R2= SAMPLES_DIR + "/TFB{sample}_R2.fastq.gz"
output:
R1out= WORK_DIR + "/trimmed/TFB{sample}_R1.fastq.gz.good",
R2out= WORK_DIR + "/trimmed/TFB{sample}_R2.fastq.gz.good"
log:
log = WORK_DIR + "/logs/fastp/{sample}.html"
shell:
"fastp -i {input.R1} -I {input.R2} -o {output.R1out} -O {output.R2out} \
-h {log.log}"
This is the error I get after executing the snakemake.
SyntaxError in line 16 of /work/users/leboralli/trofoZikaLincRNAs/scripts/Snakefile:
Colon expected after keyword log. (Snakefile, line 16)
I tried many options and nothing worked.
This software, fastp, has a param for logging: -h, with a .html output. Without the log my code works fine.
Thanks in advance.

rsync + ssh in python subprocess gets error with spaces in directory name

When I run my Python application (that synchronizes a remote directory locally) I have a problem if the directory that contains my app has one or more spaces in its name.
Directory name appears in ssh options like "-o UserKnownHostsFile=<path>" and "-i <path>".
I try to double quote paths in my function that generates the command string, but nothing. I also try to replace spaces like this: path.replace(' ', '\\ '), but it doesn't work.
Note that my code works with dirnames without spaces.
The error returned by ssh is "garbage at the end of line" (code 12)
The command line generated seems ok..
rsync -rztv --delete --stats --progress --timeout=900 --size-only --dry-run \
-e 'ssh -o BatchMode=yes \
-o UserKnownHostsFile="/cygdrive/C/Users/my.user/my\ app/.ssh/known_hosts" \
-i "/cygdrive/C/Users/my.user/my\ app/.ssh/id_rsa"'
user#host:/home/user/folder/ "/cygdrive/C/Users/my.user/my\ app/folder/"
What am I doing wrong? Thank you!
Have you tried building your command as a list of arguments - I just had a similar problem passing a key file for the ssh connection:
command = [
"rsync",
"-rztv",
"--delete",
"--stats",
"--progress",
"--timeout=900",
"--size-only",
"--dry-run",
"-e",
"ssh -o BatchMode=yes -o UserKnownHostsFile='/cygdrive/C/Users/my.user/my\ app/.ssh/known_hosts' -i '/cygdrive/C/Users/my.user/my\ app/.ssh/id_rsa'",
"user#host:/home/user/folder/",
"/cygdrive/C/Users/my.user/my\ app/folder/"
]
sp = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
output = sp.communicate()[0]
The returned data from the sub-command is including spaces as delimiters. Try updating the Internal Field Separator (IFS) list like:
# store a copy of the current IFS
SAVEIFS=$IFS;
# set IFS to be newline-only
IFS=$(echo -en "\n\b");
# execute your command(s)
rsync -rztv --delete --stats --progress --timeout=900 --size-only --dry-run -e 'ssh -o BatchMode=yes -o UserKnownHostsFile="/cygdrive/C/Users/my.user/my\ app/.ssh/known_hosts" -i "/cygdrive/C/Users/my.user/my\ app/.ssh/id_rsa"' user#host:/home/user/folder/ "/cygdrive/C/Users/my.user/my\ app/folder/"
# put the original IFS back
IFS=$SAVEIFS;
I haven't tested using your command, though it has worked in all cases I've tried in the past.
To avoid escaping issues, use a raw string,
raw_string = r'''rsync -rztv --delete --stats --progress --timeout=900 --size-only --dry-run -e 'ssh -o BatchMode=yes -o UserKnownHostsFile="/cygdrive/C/Users/my.user/my app/.ssh/known_hosts" -i "/cygdrive/C/Users/my.user/my app/.ssh/id_rsa"' user#host:/home/user/folder/ "/cygdrive/C/Users/my.user/my app/folder/"'''
sp = subprocess.Popen(raw_string, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
output = sp.communicate()[0]

Categories