Snakemake: using regex in expand() - python

I'm struggling with using regular expressions when using the expand function. For some reason, the wildcards are always imported as plain text instead of executed regular expressions. Whether the regex is introduced as a wildcard priorly or in connection to the expand function does not make a difference (see all_decompress vs. all_decompress2). The error is always:
Missing input files for rule DECOMPRESS:
Resources/raw/run1_lane1_read[1,2]_index\d+-\d+=[1-9], [11-32].fastq.gz
-
#!/usr/bin/env python3
import re
###### WILDCARDS #####
## General descriptive parameters
read = "[1,2]"
index_prefix = r"\d{3}-\d{3}"
index = r"[1-9], [11-32]"
##### RULES #####
### CONJUNCTION RULES ("all") ###
# PREANALYSIS #
rule all_decompress:
input:
expand("Resources/decompressed/read{read}_index{index_prefix}={index}.fastq", read=read, index_prefix=index_prefix, index=index)
rule all_decompress2:
input:
expand("Resources/decompressed/read{read}_index{index_prefix}={index}.fastq", read=[1,2], index_prefix=r"\d{3}-\d{3}", index=r"[1-9], [11-32]")
### TASK RULES ###
# PREANALYSIS #
# Decompress .gz zipped raw files
rule DECOMPRESS:
input:
"Resources/raw/run1_lane1_read{read}_index{index_prefix}={index}.fastq.gz"
output:
"Resources/decompressed/read{read}_index{index_prefix}={index}.fastq"
shell:
"gzip -d -c {input} > {output}"

If I'm right the function expand in Snakemake makes a list of strings. This function is used for filenames just like you used it.
I don't know if the expand function can be associated with regex to create the list.
But you can produce this list in python and give it to the rule all or the expand function.
In your case you can use the following code to get and make your liste of filenames :
import re
import os
path='.'
listoffiles=[]
for file in os.listdir(path):
if(re.search('read[1-2]_index\d{3}-\d{3}=[1-9]',file)):
listoffiles.append(os.path.splitext(file)[0])
Then in the listoffiles you have all your files names and you just have to use your expand like this :
expand("{repertory}{filename}{extension},
repertory = "Resources/decompressed/",
filename = listoffiles,
extension = ".fastq")
Then everything should work perfectly.
Remember all python code in a snakefile will be executed at the beginning of the workflow before all rules and dags creation. So it can be powerful.

Did you look into bli's links?
Specifically, this part of the documentation.
Here's a simple example of how I use it to generate a png, a pdf, or both:
rule all:
input: expand("{graph}.png", graph=["dag", "rulegraph"])
rule dot_to_image:
input: "{graph}.dot"
output: "{graph}.{ext,(pdf|png)}"
shell: "dot -T{wildcards.ext} -o {output} {input}"
Hope this helps.

I think it probably is just not possible to execute regular expressions in expand. However I found a workaround. For two of the wildcards I found different ways to describe them ("read" and "index") and for the third one I prepared a function and used it as the input.
#!/usr/bin/env python3
import re
###### WILDCARDS #####
## General descriptive parameters
read = (1,2)
index = list(range(1,9)) + list(range(11,32))
## Functions
def getDCinput(wildcards):
read = wildcards.read
index = wildcards.index
path = wd + "Resources/raw/run1_lane1_read" + read + r"_index[0-9]??-[0-9]??=" + sample + ".fastq.gz"
return(glob.glob(path))
##### RULES #####
### CONJUNCTION RULES ("all") ###
# PREANALYSIS #
rule all_decompress:
input:
expand("Resources/decompressed/read{read}_index{index}.fastq", read=read, index=index)
### TASK RULES ###
# FILE PREPARATION AND SMOOTHING #
# Decompress .gz zipped raw files
rule DECOMPRESS:
input:
getDCinput
output:
"Resources/decompressed/read{read}_index{index}.fastq"
shell:
"gzip -d -c {input} > {output}"

Related

Snakemake input fastq files from each sample directory issue for metagenomics analysis

I am working on a new snakemake metagenomics pipeline to trim fastq files, and run them through kraken. Each sample has a directory containing the forward and reverse reads.
Sample_1/r1_paired.fq.gz
Sample_1/r2_paired.fq.gz
Sample_2/r1_paired.fq.gz
Sample_2/r2_paired.fq.gz
I am providing a sample sheet that users can upload, that contains the sample names and the read names. I used pandas to parse the sample sheet and provide the names required for the snakefile. Here is my snakefile.
#Extract sample names from CSV
import pandas as pd
import os
df = pd.read_csv("sample_table_test.csv")
print(df)
samples = df.library.to_list()
print("Samples being processed:", samples)
R1 = df.r1_file.to_list()
R2 = df.r2_file.to_list()
print(R1,R2)
rule all:
input:
expand("{sample}.bracken", sample=samples),
#Trimmomatic to trim paired end reads
rule trim_reads:
input:
"{sample}/{R1}",
"{sample}/{R2}",
output:
"{sample}/{R1}_1_trim_paired.fq.gz",
"{sample}/{R2}_2_trim_paired.fq.gz",
conda:
"env.yaml",
shell:
"trimmomatic PE -threads 8 {input} {input} {output} {output} SLIDINGWINDOW:4:30 LEADING:2 TRAILING:2 MINLEN:50"
#Kraken2 to bin reads and assign taxonomy
rule kraken2:
input:
"{sample}/{R1}_1_trim_paired.fq.gz",
"{sample}/{R2}_2_trim_paired.fq.gz",
output:
"{sample}_report.txt",
"{sample}_kraken_cseqs#.fq",
conda:
"env.yaml",
shell:
"kraken2 --gzip-compressed --paired --classified-out {output} {input} {input} --db database/minikraken2_v1_8GB/ --report {sample}_report.txt --threads 1"
#Bracken estimates abundance of a species within a sample
rule bracken:
input:
"{sample}_report.txt",
output:
"{sample}.bracken",
conda:
"env.yaml",
shell:
"bracken -d database/minikraken2_v1_8GB/ -i {input} -o {output} -r 150"
I am receiving the below error and have been struggling to find a better way to write my snakefile to avoid this issue. Any assistance here would be greatly appreciated.
WildcardError in line 19 of /Metagenomics/Metagenomics/snakemake/Snakefile:
Wildcards in input files cannot be determined from output files:
'R1'
Thank you!
The problem is in your rule kraken2:
rule kraken2:
input:
"{sample}/{R1}_1_trim_paired.fq.gz",
"{sample}/{R2}_2_trim_paired.fq.gz",
output:
"{sample}_report.txt",
"{sample}_kraken_cseqs#.fq",
All wildcards in the rule shall be determined from the output section. The logic of each rule is that it offers certain files as a possible output. In your case the rule offers files "{sample}_report.txt" and "{sample}_kraken_cseqs#.fq", where {sample} becomes one level of freedom and is substituted with a certain value that resolves the pattern into a filename. Now Snakemake can determine the inputs for this rule, but only if it has all the information. Ok, the value of {sample} is defined from the output, but what are the values of {R1} and {R2}?
You have several options. The first is to define these values somewhere in the output:. Looks like that is not your case. The second option is to define these values globally (as you are probably trying to do):
R1 = df.r1_file.to_list()
R2 = df.r2_file.to_list()
In this case {R1} and {R2} shall not be wildcards but the parameters of the expand function:
rule kraken2:
input:
expand("{{sample}}/{R1}_1_trim_paired.fq.gz", R1=R1),
expand("{{sample}}/{R1}_1_trim_paired.fq.gz", R2=R2)
output:
"{sample}_report.txt",
"{sample}_kraken_cseqs#.fq",
Or even better:
expand("{{sample}}/{R}_1_trim_paired.fq.gz", R=R1+R2)
Note that the wildcard {sample} now has to be in double braces to be distinguished from the parameters of the expand function.
There are other options like resolving the value of {R1} from the values of other vildcards, like lambda wildcards: ..., but I guess that is not what you need.

Snakemake: inserting sample name before every input file in one rule

I am trying to create a rules file for a bioinformatics tool FMAP. https://github.com/jiwoongbio/FMAP
I am stuck at creating a rule for the FMAP_table.pl script. This is my current rule:
rule fmap_table:
input:
expand(str(CLASSIFY_FP/"mapping"/"{sample}_abundance.txt"), sample=Samples.keys())
output:
str(CLASSIFY_FP/'mapping'/'abundance_table.txt')
shell:
"""
perl /media/data/FMAP/FMAP_table.pl {input} > {output}
"""
I would like my column names to contain only the sample names, not the whole path. This can be done in the script like this
perl FMAP_table.pl [options] [name1=]abundance1.txt [[name2=]abundance2.txt [...]] > abundance_table.txt
My issue is that how do I select the sample name for each sample file, the path of the sample and add the = in between.
My samples are named like this SAMPLE111_S1_abundance.txt This is the format I would like to achieve automatically:
perl /media/data/FMAP/FMAP_table.pl SAMPLE111_S1 = SAMPLE111_S1_abundance.txt SAMPLE112_S2 = SAMPLE112_S2.abundance.txt [etc.] > abundance.txt"
Thanks
I might add a parameter to build that, and maybe also build the file names in dict externally:
FMAP_INPUTS = {sample: str(CLASSIFY_FP/"mapping"/"{sample}_abundance.txt")
for sample in Samples.keys()}
rule fmap:
input: FMAP_INPUTS.values()
output:
str(CLASSIFY_FP/'mapping'/'abundance_table.txt')
params:
names=" ".join(f"{s}={f}" for s,f in FMAP_INPUTS.items())
shell:
"""
perl /media/data/FMAP/FMAP_table.pl {params.names} > {output}
"""

Snakemake using input files in different folders summarizing by name

I'm trying to develop a pipeline that will take input files from different directories, specified in a yaml config file, and keep track of them by a name I specify in the yaml. For example, say my yaml looks like
input:
name1: /some/path/to/file1
name2: /a/totally/different/path/to/file2
name3: /yet/another/path/to/file3
output: /path/to/outdir
I'd like to go through a series of steps and end up with an outdir that has the contents
/path/to/outdir/processed_name1.extension
/path/to/outdir/processed_name2.extension
/path/to/outdir/processed_name3.extension
I honestly can't get anything to work. The current state I've stalled at is trying to treat the names as wildcards, and using that to access the config dictionary. But this doesn't work, because the wildcards are never initialized, because the very first step is accessing the inputs. I can't be super specific with my code example due to company policy, but basically it looks like this:
rule all:
input:
processed_files = expand(config['output'] + "/processed_{name}.extension", name=config['input'])
rule step_1:
input:
input_file = lambda wc: config['input'][wc.name]
output:
intermediate_file = config['output'] + "/intermediate_{name}.extension"
run:
<some command>
rule step_2:
input:
intermediate_file = config['output'] + "/intermediate_{name}.extension"
output:
processed_file = config['output'] + "/processed_{name}.extension"
run:
<some command>
But this gives me wildcard errors, which makes sense I think---there's no way for it to figure out the wildcards, since they only exist in the config file. I feel like this is so similar to the example in the Advanced Workflow Example, but sufficiently different that I just can't get it to work...
EDIT 1: I replaced all the f-strings with string concatenation, just to make sure that's not an issue
EDIT 2: I eventually got it to work. I'm honestly not sure what changed, I must have had a typo or something... but I guess I can say that this overall structure worked.
I found no major errors in your shown code, though I removed fstrings and changed run: to shell: to make an easy test. The following works just fine with the appropriate configfile.
configfile: "config.yaml"
rule all:
input:
processed_files = expand(config['output'] + "/processed_{name}.extension", name=config['input'])
rule step_1:
input:
input_file = lambda wc: config['input'][wc.name]
output:
intermediate_file = config['output'] + "/intermediate_{name}.extension"
shell:
"cat {input} > {output}"
rule step_2:
input:
intermediate_file = config['output'] + "/intermediate_{name}.extension"
output:
processed_file = config['output'] + "/processed_{name}.extension"
shell:
"cat {input} > {output}"

Snakemake - Wildcards in input files cannot be determined from output files

I am very new to snakemake and also not so fluent in python (so apologies this might be a very basic stupid question):
I am currently building a pipeline to analyze a set of bamfiles with atlas. These bamfiles are located in different folders and should not be moved to a common one. Therefore I decided to provide a samplelist looking like this (this is just an example, in reality samples might be on totaly different drives):
Sample Path
Sample1 /some/path/to/my/sample/
Sample2 /some/different/path/
And load it in my config.yaml with:
sample_file: /path/to/samplelist/samplslist.txt
Now to my Snakefile:
import pandas as pd
#define configfile with paths etc.
configfile: "config.yaml"
#read-in dataframe and define Sample and Path
SAMPLES = pd.read_table(config["sample_file"])
BAMFILE = SAMPLES["Sample"]
PATH = SAMPLES["Path"]
rule all:
input:
expand("{path}{sample}.summary.txt", zip, path=PATH, sample=BAMFILE)
#this works like a charm as long as I give the zip-function in the rules 'all' and 'summary':
rule indexBam:
input:
"{path}{sample}.bam"
output:
"{path}{sample}.bam.bai"
shell:
"samtools index {input}"
#this following command works as long as I give the specific folder for a sample instead of {path}.
rule bamdiagnostics:
input:
bam="{path}{sample}.bam",
bai=expand("{path}{sample}.bam.bai", zip, path=PATH, sample=BAMFILE)
params:
prefix="analysis/BAMDiagnostics/{sample}"
output:
"analysis/BAMDiagnostics/{sample}_approximateDepth.txt",
"analysis/BAMDiagnostics/{sample}_fragmentStats.txt",
"analysis/BAMDiagnostics/{sample}_MQ.txt",
"analysis/BAMDiagnostics/{sample}_readLength.txt",
"analysis/BAMDiagnostics/{sample}_BamDiagnostics.log"
message:
"running BamDiagnostics...{wildcards.sample}"
shell:
"{config[atlas]} task=BAMDiagnostics bam={input.bam} out={params.prefix} logFile={params.prefix}_BamDiagnostics.log verbose"
rule summary:
input:
index=expand("{path}{sample}.bam.bai", zip, path=PATH, sample=BAMFILE),
bamd=expand("analysis/BAMDiagnostics/{sample}_approximateDepth.txt", sample=BAMFILE)
output:
"{path}{sample}.summary.txt"
shell:
"echo -e '{input.index} {input.bamd}"
I get the error
WildcardError in line 28 of path/to/my/Snakefile:
Wildcards in input files cannot be determined from output files:
'path'
Can anyone help me?
- I tried to solve this problem with join, or creating input functions but I think I am just not skilled enough to see my error...
- I guess the problem is, that my summary-rule does not contain the tuplet with the {path} for the bamdiagnostics-output (since the output is somewhere else) and cannot make the connection to the input file or so...
- Expanding my input on bamdiagnostics-rule makes the code work, but of course takes every samples input to every samples output and creates a big mess:
In this case, both bamfiles are used for the creation of each outputfile. This is wrong as the samples AND the output are to be treated independently.
Based on the atlas doc, it seems like what you need is to run each rule separately for each sample, the complication here being that each sample is in separate path.
I modified your script to work for above case (see DAG). Variables in the beginning of script were modified to make better sense. config was removed for demo purposes, and pathlib library was used (instead of os.path.join). pathlib is not necessary, but it helps me keep sanity. A shell command was modified to avoid config.
import pandas as pd
from pathlib import Path
df = pd.read_csv('sample.tsv', sep='\t', index_col='Sample')
SAMPLES = df.index
BAM_PATH = df["Path"]
# print (BAM_PATH['sample1'])
rule all:
input:
expand("{path}{sample}.summary.txt", zip, path=BAM_PATH, sample=SAMPLES)
rule indexBam:
input:
str( Path("{path}") / "{sample}.bam")
output:
str( Path("{path}") / "{sample}.bam.bai")
shell:
"samtools index {input}"
#this following command works as long as I give the specific folder for a sample instead of {path}.
rule bamdiagnostics:
input:
bam = lambda wildcards: str( Path(BAM_PATH[wildcards.sample]) / f"{wildcards.sample}.bam"),
bai = lambda wildcards: str( Path(BAM_PATH[wildcards.sample]) / f"{wildcards.sample}.bam.bai"),
params:
prefix="analysis/BAMDiagnostics/{sample}"
output:
"analysis/BAMDiagnostics/{sample}_approximateDepth.txt",
"analysis/BAMDiagnostics/{sample}_fragmentStats.txt",
"analysis/BAMDiagnostics/{sample}_MQ.txt",
"analysis/BAMDiagnostics/{sample}_readLength.txt",
"analysis/BAMDiagnostics/{sample}_BamDiagnostics.log"
message:
"running BamDiagnostics...{wildcards.sample}"
shell:
".atlas task=BAMDiagnostics bam={input.bam} out={params.prefix} logFile={params.prefix}_BamDiagnostics.log verbose"
rule summary:
input:
bamd = "analysis/BAMDiagnostics/{sample}_approximateDepth.txt",
index = lambda wildcards: str( Path(BAM_PATH[wildcards.sample]) / f"{wildcards.sample}.bam.bai"),
output:
str( Path("{path}") / "{sample}.summary.txt")
shell:
"echo -e '{input.index} {input.bamd}"

'No values given for wildcard error' in snakemake

I am trying to make a simple pipeline using snakemake to download two files from the web and then merge them into a single output.
What I thought would work is the following code:
dwn_lnks = {
'1': 'https://molb7621.github.io/workshop/_downloads/sample.fa',
'2': 'https://molb7621.github.io/workshop/_downloads/sample.fa'
}
import os
# association between chromosomes and their links
def chromo2link(wildcards):
return dwn_lnks[wildcards.chromo]
rule all:
input:
os.path.join('genome_dir', 'human_en37_sm.fa')
rule download:
output:
expand(os.path.join('chr_dir', '{chromo}')),
params:
link=chromo2link,
shell:
"wget {params.link} -O {output}"
rule merger:
input:
expand(os.path.join('chr_dir', "{chromo}"), chromo=dwn_lnks.keys())
output:
os.path.join('genome_dir', 'human_en37_sm.fa')
run:
txt = open({output}, 'a+')
with open (os.path.join('chr_dir', "{chromo}") as file:
line = file.readline()
while line:
txt.write(line)
line = file.readline()
txt.close()
This code returns the error:
No values given for wildcard 'chromo'. in line 20
Also, in the merger rule, the python code within the run does not work.
The tutorial in the snakemake package does not cover enough examples to learn the details for non-computer scientists. If anybody knows a good resource to learn how to work with snakemake, I would appreciate if they could share :).
The problem is that you have an expand function in the output of the rule download that does not define the value for the wildcard {chromo}. I guess what you really want here is
rule download:
output:
'chr_dir/{chromo}',
params:
link=chromo2link,
shell:
"wget {params.link} -o {output}"
without the expand. The expand function is only needed to aggregate over wildcards, like you do it in the rule merger.
Also have a look at the official Snakemake tutorial, which explains this in detail.

Categories