Snakemake: inserting sample name before every input file in one rule

Snakemake: inserting sample name before every input file in one rule - python

I am trying to create a rules file for a bioinformatics tool FMAP. https://github.com/jiwoongbio/FMAP
I am stuck at creating a rule for the FMAP_table.pl script. This is my current rule:
rule fmap_table:
input:
expand(str(CLASSIFY_FP/"mapping"/"{sample}_abundance.txt"), sample=Samples.keys())
output:
str(CLASSIFY_FP/'mapping'/'abundance_table.txt')
shell:
"""
perl /media/data/FMAP/FMAP_table.pl {input} > {output}
"""
I would like my column names to contain only the sample names, not the whole path. This can be done in the script like this
perl FMAP_table.pl [options] [name1=]abundance1.txt [[name2=]abundance2.txt [...]] > abundance_table.txt
My issue is that how do I select the sample name for each sample file, the path of the sample and add the = in between.
My samples are named like this SAMPLE111_S1_abundance.txt This is the format I would like to achieve automatically:
perl /media/data/FMAP/FMAP_table.pl SAMPLE111_S1 = SAMPLE111_S1_abundance.txt SAMPLE112_S2 = SAMPLE112_S2.abundance.txt [etc.] > abundance.txt"
Thanks

I might add a parameter to build that, and maybe also build the file names in dict externally:
FMAP_INPUTS = {sample: str(CLASSIFY_FP/"mapping"/"{sample}_abundance.txt")
for sample in Samples.keys()}
rule fmap:
input: FMAP_INPUTS.values()
output:
str(CLASSIFY_FP/'mapping'/'abundance_table.txt')
params:
names=" ".join(f"{s}={f}" for s,f in FMAP_INPUTS.items())
shell:
"""
perl /media/data/FMAP/FMAP_table.pl {params.names} > {output}
"""

Related

Snakemake input fastq files from each sample directory issue for metagenomics analysis

I am working on a new snakemake metagenomics pipeline to trim fastq files, and run them through kraken. Each sample has a directory containing the forward and reverse reads.
Sample_1/r1_paired.fq.gz
Sample_1/r2_paired.fq.gz
Sample_2/r1_paired.fq.gz
Sample_2/r2_paired.fq.gz
I am providing a sample sheet that users can upload, that contains the sample names and the read names. I used pandas to parse the sample sheet and provide the names required for the snakefile. Here is my snakefile.
#Extract sample names from CSV
import pandas as pd
import os
df = pd.read_csv("sample_table_test.csv")
print(df)
samples = df.library.to_list()
print("Samples being processed:", samples)
R1 = df.r1_file.to_list()
R2 = df.r2_file.to_list()
print(R1,R2)
rule all:
input:
expand("{sample}.bracken", sample=samples),
#Trimmomatic to trim paired end reads
rule trim_reads:
input:
"{sample}/{R1}",
"{sample}/{R2}",
output:
"{sample}/{R1}_1_trim_paired.fq.gz",
"{sample}/{R2}_2_trim_paired.fq.gz",
conda:
"env.yaml",
shell:
"trimmomatic PE -threads 8 {input} {input} {output} {output} SLIDINGWINDOW:4:30 LEADING:2 TRAILING:2 MINLEN:50"
#Kraken2 to bin reads and assign taxonomy
rule kraken2:
input:
"{sample}/{R1}_1_trim_paired.fq.gz",
"{sample}/{R2}_2_trim_paired.fq.gz",
output:
"{sample}_report.txt",
"{sample}_kraken_cseqs#.fq",
conda:
"env.yaml",
shell:
"kraken2 --gzip-compressed --paired --classified-out {output} {input} {input} --db database/minikraken2_v1_8GB/ --report {sample}_report.txt --threads 1"
#Bracken estimates abundance of a species within a sample
rule bracken:
input:
"{sample}_report.txt",
output:
"{sample}.bracken",
conda:
"env.yaml",
shell:
"bracken -d database/minikraken2_v1_8GB/ -i {input} -o {output} -r 150"
I am receiving the below error and have been struggling to find a better way to write my snakefile to avoid this issue. Any assistance here would be greatly appreciated.
WildcardError in line 19 of /Metagenomics/Metagenomics/snakemake/Snakefile:
Wildcards in input files cannot be determined from output files:
'R1'
Thank you!

The problem is in your rule kraken2:
rule kraken2:
input:
"{sample}/{R1}_1_trim_paired.fq.gz",
"{sample}/{R2}_2_trim_paired.fq.gz",
output:
"{sample}_report.txt",
"{sample}_kraken_cseqs#.fq",
All wildcards in the rule shall be determined from the output section. The logic of each rule is that it offers certain files as a possible output. In your case the rule offers files "{sample}_report.txt" and "{sample}_kraken_cseqs#.fq", where {sample} becomes one level of freedom and is substituted with a certain value that resolves the pattern into a filename. Now Snakemake can determine the inputs for this rule, but only if it has all the information. Ok, the value of {sample} is defined from the output, but what are the values of {R1} and {R2}?
You have several options. The first is to define these values somewhere in the output:. Looks like that is not your case. The second option is to define these values globally (as you are probably trying to do):
R1 = df.r1_file.to_list()
R2 = df.r2_file.to_list()
In this case {R1} and {R2} shall not be wildcards but the parameters of the expand function:
rule kraken2:
input:
expand("{{sample}}/{R1}_1_trim_paired.fq.gz", R1=R1),
expand("{{sample}}/{R1}_1_trim_paired.fq.gz", R2=R2)
output:
"{sample}_report.txt",
"{sample}_kraken_cseqs#.fq",
Or even better:
expand("{{sample}}/{R}_1_trim_paired.fq.gz", R=R1+R2)
Note that the wildcard {sample} now has to be in double braces to be distinguished from the parameters of the expand function.
There are other options like resolving the value of {R1} from the values of other vildcards, like lambda wildcards: ..., but I guess that is not what you need.

calling variables for rule individually and adding an independent environment for a specific rule

I need to run a snakemake rule in the cluster, therefore for some rules, I need some tools and library needed to e loaded whereas, these tools are independent/ exclusive to other rules. I this case how can I specify these in my snakemake rule. For example, for rule score I need to module load r/3.5.1 and export R_lib =/user/tools/software currently, I am running these lines separately in the command line before running snakemake. But it would be great if there is a way to do it within the rule as env.
Question,
I have a rule as following,
rule score:
input:
count=os.path.join(config['general']['paths']['outdir'], 'count_expression', '{sample}.tsv'),
libsize=os.path.join(config['general']['paths']['outdir'], 'count_expression', '{sample}.size_tsv')
params:
result_dir=os.path.join(config['general']['paths']['outdir'], 'score'),
cancertype=config['general']['paths']['cancertype'],
sample_id=expand('{sample}',sample=samples['sample'].unique())
output:
files=os.path.join(config['general']['paths']['outdir'], 'score', '{sample}_bg_scores.tsv', '{sample}_tp_scores.tsv')
shell:
'mkdir -p {params.result_dir};Rscript {config[general][paths][tool]} {params.result_dir} {params.cancertype} {params.sample_id} {input.count} {input.libsize}'
My actual behavior for the above code snippet is:
shell:
mkdir -p /cluster/user/snakemake_test/results_april30/score;Rscript /cluster/home/user/Projects/R_scripts/scoretool.R /cluster/user/snakemake_test/results_april30/score DMC GNMS4 MRT5T /cluster/projects/test/results/exp/MRT5T.tsv /cluster/projects/test/results/Exp/MRT5T.size.tsv
Whereas, the expected behavior is:
shell:
mkdir -p /cluster/user/snakemake_test/results_april30/score;Rscript /cluster/home/user/Projects/R_scripts/scoretool.R /cluster/user/snakemake_test/results_april30/score DMC MRT5T /cluster/projects/test/results/exp/MRT5T.tsv /cluster/projects/test/results/Exp/MRT5T.size.tsv
and for the second sample,
shell:
mkdir -p /cluster/user/snakemake_test/results_april30/score;Rscript /cluster/home/user/Projects/R_scripts/scoretool.R /cluster/user/snakemake_test/results_april30/score DMC GNMS4 /cluster/projects/test/results/exp/GNMS4.tsv /cluster/projects/test/results/Exp/GNMS4.ize.tsv
I need the variable sample_d ['GNMS4', 'MRT5T'] should be taken separately, not together in one shell command line.

Regarding your first question: You can put whatever module load or export commands you like in the shell section of a rule.
Regarding your second question, you should probably not use expand in the params section of your rule. In expand('{sample}',sample=samples['sample'].unique()) you are actually not using the value of the sample wildcard, but generating a list of all unique values in sample['sample']. You probably just need to use wildcards.sample in the definition of your shell command instead of using a params element.
If you want to run several instances of the score rule based on possible values for sample, you need to "drive" this using another rule that wants the output of score as its input.
Note that to improve readability, you can use python's multi-line strings (triple-quoted).
To sum up, you might try something like this:
rule all:
input:
expand(
os.path.join(
config['general']['paths']['outdir'],
'score',
'{sample}_bg_scores.tsv',
'{sample}_tp_scores.tsv'),
sample=samples['sample'].unique())
rule score:
input:
count = os.path.join(
config['general']['paths']['outdir'],
'count_expression', '{sample}.tsv'),
libsize = os.path.join(
config['general']['paths']['outdir'],
'count_expression', '{sample}.size_tsv')
params:
result_dir = os.path.join(config['general']['paths']['outdir'], 'score'),
cancertype = config['general']['paths']['cancertype'],
output:
files = os.path.join(
config['general']['paths']['outdir'],
'score', '{sample}_bg_scores.tsv', '{sample}_tp_scores.tsv')
shell:
"""
module load r/3.5.1
export R_lib =/user/tools/software
mkdir -p {params.result_dir}
Rscript {config[general][paths][tool]} {params.result_dir} {params.cancertype} {wildcards.sample} {input.count} {input.libsize}
"""

onstart would work I think. Note that dryruns don't trigger this handler, which is acceptable in your scenario.
onstart:
shell("load tools")
Simple bash for loop should solve the problem. However, if you want each sample to be run as a separate rule, you would have to use sample name as part of output filename.
shell:
'''
for sample in {param.sample_id}
do
your command $sample
done
'''

Snakemake wildcard in output only

I have a script which takes a large input file then breaks this down into a number of chunks from 1 to n using an unpredictable algorithm.
Then a following script will process each of these chunks iteratively.
How can I create a snakemake rule which essentially states that the output files will exist from 1 to n, and the following script should be run once for each of the 1 to n input files.
Thanks!

There is dynamic keyword. It can be used like this:
rule all:
input:
dynamic('{id}.png')
rule draw:
input:
'{id}.txt'
output:
'{id}.png'
shell:
'cp {input} {output}'
rule cluster:
input:
'input.csv'
output:
dynamic('{id}.txt')
shell:
'touch 1.txt 2.txt'

Have you tried setting a wildcard? For example, if you are iterating a rule over files 1 to 22, you can set a wildcard at the top of your snakemake file:
num=range(1,23)
Then use that wildcard in your snakemake file names or reference it as in {wildcard.num}

Snakemake - Wildcards in input files cannot be determined from output files

I am very new to snakemake and also not so fluent in python (so apologies this might be a very basic stupid question):
I am currently building a pipeline to analyze a set of bamfiles with atlas. These bamfiles are located in different folders and should not be moved to a common one. Therefore I decided to provide a samplelist looking like this (this is just an example, in reality samples might be on totaly different drives):
Sample Path
Sample1 /some/path/to/my/sample/
Sample2 /some/different/path/
And load it in my config.yaml with:
sample_file: /path/to/samplelist/samplslist.txt
Now to my Snakefile:
import pandas as pd
#define configfile with paths etc.
configfile: "config.yaml"
#read-in dataframe and define Sample and Path
SAMPLES = pd.read_table(config["sample_file"])
BAMFILE = SAMPLES["Sample"]
PATH = SAMPLES["Path"]
rule all:
input:
expand("{path}{sample}.summary.txt", zip, path=PATH, sample=BAMFILE)
#this works like a charm as long as I give the zip-function in the rules 'all' and 'summary':
rule indexBam:
input:
"{path}{sample}.bam"
output:
"{path}{sample}.bam.bai"
shell:
"samtools index {input}"
#this following command works as long as I give the specific folder for a sample instead of {path}.
rule bamdiagnostics:
input:
bam="{path}{sample}.bam",
bai=expand("{path}{sample}.bam.bai", zip, path=PATH, sample=BAMFILE)
params:
prefix="analysis/BAMDiagnostics/{sample}"
output:
"analysis/BAMDiagnostics/{sample}_approximateDepth.txt",
"analysis/BAMDiagnostics/{sample}_fragmentStats.txt",
"analysis/BAMDiagnostics/{sample}_MQ.txt",
"analysis/BAMDiagnostics/{sample}_readLength.txt",
"analysis/BAMDiagnostics/{sample}_BamDiagnostics.log"
message:
"running BamDiagnostics...{wildcards.sample}"
shell:
"{config[atlas]} task=BAMDiagnostics bam={input.bam} out={params.prefix} logFile={params.prefix}_BamDiagnostics.log verbose"
rule summary:
input:
index=expand("{path}{sample}.bam.bai", zip, path=PATH, sample=BAMFILE),
bamd=expand("analysis/BAMDiagnostics/{sample}_approximateDepth.txt", sample=BAMFILE)
output:
"{path}{sample}.summary.txt"
shell:
"echo -e '{input.index} {input.bamd}"
I get the error
WildcardError in line 28 of path/to/my/Snakefile:
Wildcards in input files cannot be determined from output files:
'path'
Can anyone help me?
- I tried to solve this problem with join, or creating input functions but I think I am just not skilled enough to see my error...
- I guess the problem is, that my summary-rule does not contain the tuplet with the {path} for the bamdiagnostics-output (since the output is somewhere else) and cannot make the connection to the input file or so...
- Expanding my input on bamdiagnostics-rule makes the code work, but of course takes every samples input to every samples output and creates a big mess:
In this case, both bamfiles are used for the creation of each outputfile. This is wrong as the samples AND the output are to be treated independently.

Based on the atlas doc, it seems like what you need is to run each rule separately for each sample, the complication here being that each sample is in separate path.
I modified your script to work for above case (see DAG). Variables in the beginning of script were modified to make better sense. config was removed for demo purposes, and pathlib library was used (instead of os.path.join). pathlib is not necessary, but it helps me keep sanity. A shell command was modified to avoid config.
import pandas as pd
from pathlib import Path
df = pd.read_csv('sample.tsv', sep='\t', index_col='Sample')
SAMPLES = df.index
BAM_PATH = df["Path"]
# print (BAM_PATH['sample1'])
rule all:
input:
expand("{path}{sample}.summary.txt", zip, path=BAM_PATH, sample=SAMPLES)
rule indexBam:
input:
str( Path("{path}") / "{sample}.bam")
output:
str( Path("{path}") / "{sample}.bam.bai")
shell:
"samtools index {input}"
#this following command works as long as I give the specific folder for a sample instead of {path}.
rule bamdiagnostics:
input:
bam = lambda wildcards: str( Path(BAM_PATH[wildcards.sample]) / f"{wildcards.sample}.bam"),
bai = lambda wildcards: str( Path(BAM_PATH[wildcards.sample]) / f"{wildcards.sample}.bam.bai"),
params:
prefix="analysis/BAMDiagnostics/{sample}"
output:
"analysis/BAMDiagnostics/{sample}_approximateDepth.txt",
"analysis/BAMDiagnostics/{sample}_fragmentStats.txt",
"analysis/BAMDiagnostics/{sample}_MQ.txt",
"analysis/BAMDiagnostics/{sample}_readLength.txt",
"analysis/BAMDiagnostics/{sample}_BamDiagnostics.log"
message:
"running BamDiagnostics...{wildcards.sample}"
shell:
".atlas task=BAMDiagnostics bam={input.bam} out={params.prefix} logFile={params.prefix}_BamDiagnostics.log verbose"
rule summary:
input:
bamd = "analysis/BAMDiagnostics/{sample}_approximateDepth.txt",
index = lambda wildcards: str( Path(BAM_PATH[wildcards.sample]) / f"{wildcards.sample}.bam.bai"),
output:
str( Path("{path}") / "{sample}.summary.txt")
shell:
"echo -e '{input.index} {input.bamd}"

'No values given for wildcard error' in snakemake

I am trying to make a simple pipeline using snakemake to download two files from the web and then merge them into a single output.
What I thought would work is the following code:
dwn_lnks = {
'1': 'https://molb7621.github.io/workshop/_downloads/sample.fa',
'2': 'https://molb7621.github.io/workshop/_downloads/sample.fa'
}
import os
# association between chromosomes and their links
def chromo2link(wildcards):
return dwn_lnks[wildcards.chromo]
rule all:
input:
os.path.join('genome_dir', 'human_en37_sm.fa')
rule download:
output:
expand(os.path.join('chr_dir', '{chromo}')),
params:
link=chromo2link,
shell:
"wget {params.link} -O {output}"
rule merger:
input:
expand(os.path.join('chr_dir', "{chromo}"), chromo=dwn_lnks.keys())
output:
os.path.join('genome_dir', 'human_en37_sm.fa')
run:
txt = open({output}, 'a+')
with open (os.path.join('chr_dir', "{chromo}") as file:
line = file.readline()
while line:
txt.write(line)
line = file.readline()
txt.close()
This code returns the error:
No values given for wildcard 'chromo'. in line 20
Also, in the merger rule, the python code within the run does not work.
The tutorial in the snakemake package does not cover enough examples to learn the details for non-computer scientists. If anybody knows a good resource to learn how to work with snakemake, I would appreciate if they could share :).

The problem is that you have an expand function in the output of the rule download that does not define the value for the wildcard {chromo}. I guess what you really want here is
rule download:
output:
'chr_dir/{chromo}',
params:
link=chromo2link,
shell:
"wget {params.link} -o {output}"
without the expand. The expand function is only needed to aggregate over wildcards, like you do it in the rule merger.
Also have a look at the official Snakemake tutorial, which explains this in detail.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Snakemake: inserting sample name before every input file in one rule - python

Related

Snakemake input fastq files from each sample directory issue for metagenomics analysis

calling variables for rule individually and adding an independent environment for a specific rule

Snakemake wildcard in output only

Snakemake - Wildcards in input files cannot be determined from output files

'No values given for wildcard error' in snakemake

Categories

Resources