'No values given for wildcard error' in snakemake - python

I am trying to make a simple pipeline using snakemake to download two files from the web and then merge them into a single output.
What I thought would work is the following code:
dwn_lnks = {
'1': 'https://molb7621.github.io/workshop/_downloads/sample.fa',
'2': 'https://molb7621.github.io/workshop/_downloads/sample.fa'
}
import os
# association between chromosomes and their links
def chromo2link(wildcards):
return dwn_lnks[wildcards.chromo]
rule all:
input:
os.path.join('genome_dir', 'human_en37_sm.fa')
rule download:
output:
expand(os.path.join('chr_dir', '{chromo}')),
params:
link=chromo2link,
shell:
"wget {params.link} -O {output}"
rule merger:
input:
expand(os.path.join('chr_dir', "{chromo}"), chromo=dwn_lnks.keys())
output:
os.path.join('genome_dir', 'human_en37_sm.fa')
run:
txt = open({output}, 'a+')
with open (os.path.join('chr_dir', "{chromo}") as file:
line = file.readline()
while line:
txt.write(line)
line = file.readline()
txt.close()
This code returns the error:
No values given for wildcard 'chromo'. in line 20
Also, in the merger rule, the python code within the run does not work.
The tutorial in the snakemake package does not cover enough examples to learn the details for non-computer scientists. If anybody knows a good resource to learn how to work with snakemake, I would appreciate if they could share :).

The problem is that you have an expand function in the output of the rule download that does not define the value for the wildcard {chromo}. I guess what you really want here is
rule download:
output:
'chr_dir/{chromo}',
params:
link=chromo2link,
shell:
"wget {params.link} -o {output}"
without the expand. The expand function is only needed to aggregate over wildcards, like you do it in the rule merger.
Also have a look at the official Snakemake tutorial, which explains this in detail.

Related

How do I use this python function within the params section of my Snakemake rule?

I'm trying to figure out how to extract the read-group lane information from a fastq file, and then use this string within my GATK AddOrReplaceReadGroups Snakemake below (below). I've written a short Python function (at the top of the rule) to do this operation, but can't quite figure out how to actually use it in the script such that I can access the string output (e.g., "ABCHGSX35:2") of the function on a given fastq file in my GATK shell command. Basically, I want to feed {params.PathToReadR1} into the extract_lane_info function, but can't figure out how to integrate them correctly.
Open to any solutions to the problem posed, or if there's an entirely more efficiently and different way to achieve the same result (getting the read-group lane info as a param value), that'd be great, too. Thanks for any help.
def extract_lane_info(file_path):
# Run the bash command using subprocess.run()
elements = subprocess.run(["zcat", file_path], stdout=subprocess.PIPE).stdout.split(b"\n")[0].split(b":")[2:4]
# Extract the lane information from the output of the command using a regular expression
read_group = elements[0].decode().strip("'")
string_after = elements[1].decode().strip("'")
elements = read_group + ":" + string_after
return(elements)
rule AddOrReplaceReadGroups:
input:
"results/{species}/bwa/merged/{read}_pese.bam",
output:
"results/{species}/GATK/AddOrReplace/{read}.pese.bwa_mem.addRG.bam",
threads: config["trimmmomatic"]["cpu"]
log:
"results/{species}/GATK/AddOrReplace/log/{read}_AddOrReplace.stderrs"
message:
"Running AddOrReplaceReadGroups on {wildcards.read}"
conda:
config["CondaEnvs"]
params:
ReadGroupID = lambda wildcards: final_Prefix_ReadGroup[wildcards.read],
PathToReadR1 = lambda wildcards: final_Prefix_RawReadPathR1[wildcards.read],
LIBRARY = lambda wildcards: final_Prefix_ReadGroupLibrary[wildcards.read],
shell:"gatk AddOrReplaceReadGroups -I {input} -O {output} -ID {params.ReadGroupID}.1 -LB {params.LIBRARY} -PL illumina -PU {input.lane_info}:{params.ReadGroupID} -SM {params.ReadGroupID} --SORT_ORDER 'coordinate' --CREATE_INDEX true 2>> {log}"
Basically, {params.PathToReadR1} would be "path/to/file.fastq.gz", and I want this file to be inputted into the extract_lane_info function, and then the output of this function to be used in the -PU section of the shell command (e.g., {params.lane_info}. I keep getting all types of errors as I've messed around with it, and am unsure how to move forward.

MD task with `snakemake`

I want to create a very simple pipeline for molecular dynamic simulation. The program (Amber) just wants 3 files as input, and produces a lot of files, some of them I will be needed in the future. So my pipeline is extremely simple:
Check that *.in, *.prmtop and *.rst are in folder (I guarantee it's only one file for any of these extensions) and warn if these files are not present
Run shell command (based on name of all input files)
Check that *.out, mden, mdinfo and *.nc were produced
That's all. It's standard approach to the program I deal with. One folder, one task, short and simple file names based on file purpose, not on its content.
I wrote a simple pipline:
rule all:
input: '{inp}.out'
rule amber:
input:
'{inp}.in',
'{top}.prmtop',
'{coord}.rst'
output:
'{inp}.out',
'mden',
'mdinfo',
'{inp}.nc'
shell:
'pmemd.cuda'
' -O'
' -i {inp}.in'
' -o {inp}.out'
' -p {top}.prmtop'
' -c {coord}.rst'
' -r {inp}.rst'
' -x {inp}.nc'
' -ref {coord}.rst'
And it doesn't work.
All inputs in all rule must be explicit. (Why? Why it cannot be regex or wildcard expression? If I see *.out in folder and status code of shell script was 0, that's all, work is done)
I must to use all wildcards from input in output, but I want to use some only in shell or in another rules
I must to not expect to get files like mden with potentially "non-unique" names because it's could be change with another task, but I know, that it will be only one task and it's a direct way how my MD program works (yeah, I know about Ambers's -e and -inf keys, but it's over-complication of simple task).
So, I would like to decide is it worth using snakemake for this, or not. It's very simple task, but I already spent several hours, I see a lot of documentation, a lot of examples, that I can't apply to my case. snakemake looks exactly what I need, but I can't express simple task in general terms with this framework, I don't want to specify explicit filenames, because I'll lose in flexibility, I want to run hundreds of simple tasks automatically, only input files will be different. I'm sure I just haven't figured out how to handle this framework yet. Maybe you can show me how should I? Thank you!
Hopefully this will put you in the right direction.
If I understand correctly, the input to snakemake is a folder containing the input files to amber. You know that this folder contains one .in file, one .prmtop file, and one .rst file but you don't know the full names of these files.
If you want snakemake to run on a single input folder, then you don't need wildcards at all and the script below should do.
import glob
import os
input_folder = config['amber_folder']
# We don't know the name of input file. We only know it ends in '.in'
inp = glob.glob(os.path.join(input_folder, '*.in'))
assert len(inp) == 1
inp = inp[0]
name = os.path.splitext(os.path.basename(inp))[0]
output_folder = name + '_results'
out = os.path.join(output_folder, name + '.out')
rule all:
input:
out
rule amber:
input:
inp= inp,
top= glob.glob(os.path.join(input_folder, '*.prmtop')),
rst= glob.glob(os.path.join(input_folder, '*.rst')),
output:
out= out,
nc= os.path.join(output_folder, name + '.nc'),
mden= os.path.join(output_folder, 'mden'),
mdinfo= os.path.join(output_folder, 'mdinfo'),
shell:
r"""
pmemd.cuda \
-O \
-i {input.inp} \
-o {output.out} \
-p {input.top} \
-c {input.rst} \
-r {input.rst} \
-x {output.nc} \
-ref {input.rst}
"""
Execute with:
snakemake -j 1 -C amber_folder='your-input-folder'
If you have many input folders you could write a for-loop to execute the command above but probably better is to pass the list of inputs to snakemake and let it handle them.

calling variables for rule individually and adding an independent environment for a specific rule

I need to run a snakemake rule in the cluster, therefore for some rules, I need some tools and library needed to e loaded whereas, these tools are independent/ exclusive to other rules. I this case how can I specify these in my snakemake rule. For example, for rule score I need to module load r/3.5.1 and export R_lib =/user/tools/software currently, I am running these lines separately in the command line before running snakemake. But it would be great if there is a way to do it within the rule as env.
Question,
I have a rule as following,
rule score:
input:
count=os.path.join(config['general']['paths']['outdir'], 'count_expression', '{sample}.tsv'),
libsize=os.path.join(config['general']['paths']['outdir'], 'count_expression', '{sample}.size_tsv')
params:
result_dir=os.path.join(config['general']['paths']['outdir'], 'score'),
cancertype=config['general']['paths']['cancertype'],
sample_id=expand('{sample}',sample=samples['sample'].unique())
output:
files=os.path.join(config['general']['paths']['outdir'], 'score', '{sample}_bg_scores.tsv', '{sample}_tp_scores.tsv')
shell:
'mkdir -p {params.result_dir};Rscript {config[general][paths][tool]} {params.result_dir} {params.cancertype} {params.sample_id} {input.count} {input.libsize}'
My actual behavior for the above code snippet is:
shell:
mkdir -p /cluster/user/snakemake_test/results_april30/score;Rscript /cluster/home/user/Projects/R_scripts/scoretool.R /cluster/user/snakemake_test/results_april30/score DMC GNMS4 MRT5T /cluster/projects/test/results/exp/MRT5T.tsv /cluster/projects/test/results/Exp/MRT5T.size.tsv
Whereas, the expected behavior is:
shell:
mkdir -p /cluster/user/snakemake_test/results_april30/score;Rscript /cluster/home/user/Projects/R_scripts/scoretool.R /cluster/user/snakemake_test/results_april30/score DMC MRT5T /cluster/projects/test/results/exp/MRT5T.tsv /cluster/projects/test/results/Exp/MRT5T.size.tsv
and for the second sample,
shell:
mkdir -p /cluster/user/snakemake_test/results_april30/score;Rscript /cluster/home/user/Projects/R_scripts/scoretool.R /cluster/user/snakemake_test/results_april30/score DMC GNMS4 /cluster/projects/test/results/exp/GNMS4.tsv /cluster/projects/test/results/Exp/GNMS4.ize.tsv
I need the variable sample_d ['GNMS4', 'MRT5T'] should be taken separately, not together in one shell command line.
Regarding your first question: You can put whatever module load or export commands you like in the shell section of a rule.
Regarding your second question, you should probably not use expand in the params section of your rule. In expand('{sample}',sample=samples['sample'].unique()) you are actually not using the value of the sample wildcard, but generating a list of all unique values in sample['sample']. You probably just need to use wildcards.sample in the definition of your shell command instead of using a params element.
If you want to run several instances of the score rule based on possible values for sample, you need to "drive" this using another rule that wants the output of score as its input.
Note that to improve readability, you can use python's multi-line strings (triple-quoted).
To sum up, you might try something like this:
rule all:
input:
expand(
os.path.join(
config['general']['paths']['outdir'],
'score',
'{sample}_bg_scores.tsv',
'{sample}_tp_scores.tsv'),
sample=samples['sample'].unique())
rule score:
input:
count = os.path.join(
config['general']['paths']['outdir'],
'count_expression', '{sample}.tsv'),
libsize = os.path.join(
config['general']['paths']['outdir'],
'count_expression', '{sample}.size_tsv')
params:
result_dir = os.path.join(config['general']['paths']['outdir'], 'score'),
cancertype = config['general']['paths']['cancertype'],
output:
files = os.path.join(
config['general']['paths']['outdir'],
'score', '{sample}_bg_scores.tsv', '{sample}_tp_scores.tsv')
shell:
"""
module load r/3.5.1
export R_lib =/user/tools/software
mkdir -p {params.result_dir}
Rscript {config[general][paths][tool]} {params.result_dir} {params.cancertype} {wildcards.sample} {input.count} {input.libsize}
"""
onstart would work I think. Note that dryruns don't trigger this handler, which is acceptable in your scenario.
onstart:
shell("load tools")
Simple bash for loop should solve the problem. However, if you want each sample to be run as a separate rule, you would have to use sample name as part of output filename.
shell:
'''
for sample in {param.sample_id}
do
your command $sample
done
'''

Snakemake - Wildcards in input files cannot be determined from output files

I am very new to snakemake and also not so fluent in python (so apologies this might be a very basic stupid question):
I am currently building a pipeline to analyze a set of bamfiles with atlas. These bamfiles are located in different folders and should not be moved to a common one. Therefore I decided to provide a samplelist looking like this (this is just an example, in reality samples might be on totaly different drives):
Sample Path
Sample1 /some/path/to/my/sample/
Sample2 /some/different/path/
And load it in my config.yaml with:
sample_file: /path/to/samplelist/samplslist.txt
Now to my Snakefile:
import pandas as pd
#define configfile with paths etc.
configfile: "config.yaml"
#read-in dataframe and define Sample and Path
SAMPLES = pd.read_table(config["sample_file"])
BAMFILE = SAMPLES["Sample"]
PATH = SAMPLES["Path"]
rule all:
input:
expand("{path}{sample}.summary.txt", zip, path=PATH, sample=BAMFILE)
#this works like a charm as long as I give the zip-function in the rules 'all' and 'summary':
rule indexBam:
input:
"{path}{sample}.bam"
output:
"{path}{sample}.bam.bai"
shell:
"samtools index {input}"
#this following command works as long as I give the specific folder for a sample instead of {path}.
rule bamdiagnostics:
input:
bam="{path}{sample}.bam",
bai=expand("{path}{sample}.bam.bai", zip, path=PATH, sample=BAMFILE)
params:
prefix="analysis/BAMDiagnostics/{sample}"
output:
"analysis/BAMDiagnostics/{sample}_approximateDepth.txt",
"analysis/BAMDiagnostics/{sample}_fragmentStats.txt",
"analysis/BAMDiagnostics/{sample}_MQ.txt",
"analysis/BAMDiagnostics/{sample}_readLength.txt",
"analysis/BAMDiagnostics/{sample}_BamDiagnostics.log"
message:
"running BamDiagnostics...{wildcards.sample}"
shell:
"{config[atlas]} task=BAMDiagnostics bam={input.bam} out={params.prefix} logFile={params.prefix}_BamDiagnostics.log verbose"
rule summary:
input:
index=expand("{path}{sample}.bam.bai", zip, path=PATH, sample=BAMFILE),
bamd=expand("analysis/BAMDiagnostics/{sample}_approximateDepth.txt", sample=BAMFILE)
output:
"{path}{sample}.summary.txt"
shell:
"echo -e '{input.index} {input.bamd}"
I get the error
WildcardError in line 28 of path/to/my/Snakefile:
Wildcards in input files cannot be determined from output files:
'path'
Can anyone help me?
- I tried to solve this problem with join, or creating input functions but I think I am just not skilled enough to see my error...
- I guess the problem is, that my summary-rule does not contain the tuplet with the {path} for the bamdiagnostics-output (since the output is somewhere else) and cannot make the connection to the input file or so...
- Expanding my input on bamdiagnostics-rule makes the code work, but of course takes every samples input to every samples output and creates a big mess:
In this case, both bamfiles are used for the creation of each outputfile. This is wrong as the samples AND the output are to be treated independently.
Based on the atlas doc, it seems like what you need is to run each rule separately for each sample, the complication here being that each sample is in separate path.
I modified your script to work for above case (see DAG). Variables in the beginning of script were modified to make better sense. config was removed for demo purposes, and pathlib library was used (instead of os.path.join). pathlib is not necessary, but it helps me keep sanity. A shell command was modified to avoid config.
import pandas as pd
from pathlib import Path
df = pd.read_csv('sample.tsv', sep='\t', index_col='Sample')
SAMPLES = df.index
BAM_PATH = df["Path"]
# print (BAM_PATH['sample1'])
rule all:
input:
expand("{path}{sample}.summary.txt", zip, path=BAM_PATH, sample=SAMPLES)
rule indexBam:
input:
str( Path("{path}") / "{sample}.bam")
output:
str( Path("{path}") / "{sample}.bam.bai")
shell:
"samtools index {input}"
#this following command works as long as I give the specific folder for a sample instead of {path}.
rule bamdiagnostics:
input:
bam = lambda wildcards: str( Path(BAM_PATH[wildcards.sample]) / f"{wildcards.sample}.bam"),
bai = lambda wildcards: str( Path(BAM_PATH[wildcards.sample]) / f"{wildcards.sample}.bam.bai"),
params:
prefix="analysis/BAMDiagnostics/{sample}"
output:
"analysis/BAMDiagnostics/{sample}_approximateDepth.txt",
"analysis/BAMDiagnostics/{sample}_fragmentStats.txt",
"analysis/BAMDiagnostics/{sample}_MQ.txt",
"analysis/BAMDiagnostics/{sample}_readLength.txt",
"analysis/BAMDiagnostics/{sample}_BamDiagnostics.log"
message:
"running BamDiagnostics...{wildcards.sample}"
shell:
".atlas task=BAMDiagnostics bam={input.bam} out={params.prefix} logFile={params.prefix}_BamDiagnostics.log verbose"
rule summary:
input:
bamd = "analysis/BAMDiagnostics/{sample}_approximateDepth.txt",
index = lambda wildcards: str( Path(BAM_PATH[wildcards.sample]) / f"{wildcards.sample}.bam.bai"),
output:
str( Path("{path}") / "{sample}.summary.txt")
shell:
"echo -e '{input.index} {input.bamd}"

Snakemake: How to save and access sample details in config.yml file?

Can anybody help me understand if it is possible to access sample details from a config.yml file when the sample names are not written in the snakemake workflow? This is so I can re-use the workflow for different projects and only adjust the config file. Let me give you an example:
I have four samples that belong together and should be analyzed together. They are called sample1-4. Every sample comes with some more information but to keep it simple here lets say its just a name tag such as S1, S2, etc.
My config.yml file could look like this:
samples: ["sample1","sample2","sample3","sample4"]
sample1:
tag: "S1"
sample2:
tag: "S2"
sample3:
tag: "S3"
sample4:
tag: "S4"
And here is an example of the snakefile that we use:
configfile: "config.yaml"
rule final:
input: expand("{sample}.txt", sample=config["samples"])
rule rule1:
output: "{sample}.txt"
params: tag=config["{sample}"]["tag"]
shell: """
touch {output}
echo {params.tag} > {output}
What rule1 is trying to do is create a file named after each sample as saved in the samples variable in the config file. So far no problem. Then, I would like to print the sample tag into that file. As the code is written above, running snakemake will fail because config["{sample}"] will literally look for the {sample} variable in the config file which doesn't exist because instead I need it to be replaced with the current sample that the rule is run for, e.g. sample1.
Does anybody know if this is somehow possible to do, and if yes, how I could do it?
Ideally I'd like to compress the information even more (see below) but that's further down the road.
samples:
sample1:
tag: "S1"
sample2:
tag: "S2"
sample3:
tag: "S3"
sample4:
tag: "S4"
I would suggest using a tab-delimited file in order to store samples information.
sample.tab:
Sample Tag
1 S1
2 S2
You could store the path to this file in the config file, and read it in your Snakefile.
config.yaml:
sample_file: "sample.tab"
Snakefile:
configfile: "config.yaml"
sample_file = config["sample_file"]
samples = read_table(sample_file)['Sample']
tags = read_table(sample_file)['Tag']
This way your can re-use your workflow for any number of samples, with any number of columns.
Apart from that, in Snakemake usually you can escape curly brackets by doubling them, maybe you could try that.
Good luck!
In the params section, you need to provide a function of wildcards. The following modification of your workflow seems to work:
configfile: "config.yaml"
rule final:
input: expand("{sample}.txt", sample=config["samples"])
rule rule1:
output:
"{sample}.txt"
params:
tag = lambda wildcards: config[wildcards.sample]["tag"]
shell:
"""
touch {output}
echo {params.tag} > {output}
"""

Categories