Snakemake using input files in different folders summarizing by name - python

I'm trying to develop a pipeline that will take input files from different directories, specified in a yaml config file, and keep track of them by a name I specify in the yaml. For example, say my yaml looks like
input:
name1: /some/path/to/file1
name2: /a/totally/different/path/to/file2
name3: /yet/another/path/to/file3
output: /path/to/outdir
I'd like to go through a series of steps and end up with an outdir that has the contents
/path/to/outdir/processed_name1.extension
/path/to/outdir/processed_name2.extension
/path/to/outdir/processed_name3.extension
I honestly can't get anything to work. The current state I've stalled at is trying to treat the names as wildcards, and using that to access the config dictionary. But this doesn't work, because the wildcards are never initialized, because the very first step is accessing the inputs. I can't be super specific with my code example due to company policy, but basically it looks like this:
rule all:
input:
processed_files = expand(config['output'] + "/processed_{name}.extension", name=config['input'])
rule step_1:
input:
input_file = lambda wc: config['input'][wc.name]
output:
intermediate_file = config['output'] + "/intermediate_{name}.extension"
run:
<some command>
rule step_2:
input:
intermediate_file = config['output'] + "/intermediate_{name}.extension"
output:
processed_file = config['output'] + "/processed_{name}.extension"
run:
<some command>
But this gives me wildcard errors, which makes sense I think---there's no way for it to figure out the wildcards, since they only exist in the config file. I feel like this is so similar to the example in the Advanced Workflow Example, but sufficiently different that I just can't get it to work...
EDIT 1: I replaced all the f-strings with string concatenation, just to make sure that's not an issue
EDIT 2: I eventually got it to work. I'm honestly not sure what changed, I must have had a typo or something... but I guess I can say that this overall structure worked.

I found no major errors in your shown code, though I removed fstrings and changed run: to shell: to make an easy test. The following works just fine with the appropriate configfile.
configfile: "config.yaml"
rule all:
input:
processed_files = expand(config['output'] + "/processed_{name}.extension", name=config['input'])
rule step_1:
input:
input_file = lambda wc: config['input'][wc.name]
output:
intermediate_file = config['output'] + "/intermediate_{name}.extension"
shell:
"cat {input} > {output}"
rule step_2:
input:
intermediate_file = config['output'] + "/intermediate_{name}.extension"
output:
processed_file = config['output'] + "/processed_{name}.extension"
shell:
"cat {input} > {output}"

Related

MD task with `snakemake`

I want to create a very simple pipeline for molecular dynamic simulation. The program (Amber) just wants 3 files as input, and produces a lot of files, some of them I will be needed in the future. So my pipeline is extremely simple:
Check that *.in, *.prmtop and *.rst are in folder (I guarantee it's only one file for any of these extensions) and warn if these files are not present
Run shell command (based on name of all input files)
Check that *.out, mden, mdinfo and *.nc were produced
That's all. It's standard approach to the program I deal with. One folder, one task, short and simple file names based on file purpose, not on its content.
I wrote a simple pipline:
rule all:
input: '{inp}.out'
rule amber:
input:
'{inp}.in',
'{top}.prmtop',
'{coord}.rst'
output:
'{inp}.out',
'mden',
'mdinfo',
'{inp}.nc'
shell:
'pmemd.cuda'
' -O'
' -i {inp}.in'
' -o {inp}.out'
' -p {top}.prmtop'
' -c {coord}.rst'
' -r {inp}.rst'
' -x {inp}.nc'
' -ref {coord}.rst'
And it doesn't work.
All inputs in all rule must be explicit. (Why? Why it cannot be regex or wildcard expression? If I see *.out in folder and status code of shell script was 0, that's all, work is done)
I must to use all wildcards from input in output, but I want to use some only in shell or in another rules
I must to not expect to get files like mden with potentially "non-unique" names because it's could be change with another task, but I know, that it will be only one task and it's a direct way how my MD program works (yeah, I know about Ambers's -e and -inf keys, but it's over-complication of simple task).
So, I would like to decide is it worth using snakemake for this, or not. It's very simple task, but I already spent several hours, I see a lot of documentation, a lot of examples, that I can't apply to my case. snakemake looks exactly what I need, but I can't express simple task in general terms with this framework, I don't want to specify explicit filenames, because I'll lose in flexibility, I want to run hundreds of simple tasks automatically, only input files will be different. I'm sure I just haven't figured out how to handle this framework yet. Maybe you can show me how should I? Thank you!
Hopefully this will put you in the right direction.
If I understand correctly, the input to snakemake is a folder containing the input files to amber. You know that this folder contains one .in file, one .prmtop file, and one .rst file but you don't know the full names of these files.
If you want snakemake to run on a single input folder, then you don't need wildcards at all and the script below should do.
import glob
import os
input_folder = config['amber_folder']
# We don't know the name of input file. We only know it ends in '.in'
inp = glob.glob(os.path.join(input_folder, '*.in'))
assert len(inp) == 1
inp = inp[0]
name = os.path.splitext(os.path.basename(inp))[0]
output_folder = name + '_results'
out = os.path.join(output_folder, name + '.out')
rule all:
input:
out
rule amber:
input:
inp= inp,
top= glob.glob(os.path.join(input_folder, '*.prmtop')),
rst= glob.glob(os.path.join(input_folder, '*.rst')),
output:
out= out,
nc= os.path.join(output_folder, name + '.nc'),
mden= os.path.join(output_folder, 'mden'),
mdinfo= os.path.join(output_folder, 'mdinfo'),
shell:
r"""
pmemd.cuda \
-O \
-i {input.inp} \
-o {output.out} \
-p {input.top} \
-c {input.rst} \
-r {input.rst} \
-x {output.nc} \
-ref {input.rst}
"""
Execute with:
snakemake -j 1 -C amber_folder='your-input-folder'
If you have many input folders you could write a for-loop to execute the command above but probably better is to pass the list of inputs to snakemake and let it handle them.

Snakemake: inserting sample name before every input file in one rule

I am trying to create a rules file for a bioinformatics tool FMAP. https://github.com/jiwoongbio/FMAP
I am stuck at creating a rule for the FMAP_table.pl script. This is my current rule:
rule fmap_table:
input:
expand(str(CLASSIFY_FP/"mapping"/"{sample}_abundance.txt"), sample=Samples.keys())
output:
str(CLASSIFY_FP/'mapping'/'abundance_table.txt')
shell:
"""
perl /media/data/FMAP/FMAP_table.pl {input} > {output}
"""
I would like my column names to contain only the sample names, not the whole path. This can be done in the script like this
perl FMAP_table.pl [options] [name1=]abundance1.txt [[name2=]abundance2.txt [...]] > abundance_table.txt
My issue is that how do I select the sample name for each sample file, the path of the sample and add the = in between.
My samples are named like this SAMPLE111_S1_abundance.txt This is the format I would like to achieve automatically:
perl /media/data/FMAP/FMAP_table.pl SAMPLE111_S1 = SAMPLE111_S1_abundance.txt SAMPLE112_S2 = SAMPLE112_S2.abundance.txt [etc.] > abundance.txt"
Thanks
I might add a parameter to build that, and maybe also build the file names in dict externally:
FMAP_INPUTS = {sample: str(CLASSIFY_FP/"mapping"/"{sample}_abundance.txt")
for sample in Samples.keys()}
rule fmap:
input: FMAP_INPUTS.values()
output:
str(CLASSIFY_FP/'mapping'/'abundance_table.txt')
params:
names=" ".join(f"{s}={f}" for s,f in FMAP_INPUTS.items())
shell:
"""
perl /media/data/FMAP/FMAP_table.pl {params.names} > {output}
"""

Snakemake - Wildcards in input files cannot be determined from output files

I am very new to snakemake and also not so fluent in python (so apologies this might be a very basic stupid question):
I am currently building a pipeline to analyze a set of bamfiles with atlas. These bamfiles are located in different folders and should not be moved to a common one. Therefore I decided to provide a samplelist looking like this (this is just an example, in reality samples might be on totaly different drives):
Sample Path
Sample1 /some/path/to/my/sample/
Sample2 /some/different/path/
And load it in my config.yaml with:
sample_file: /path/to/samplelist/samplslist.txt
Now to my Snakefile:
import pandas as pd
#define configfile with paths etc.
configfile: "config.yaml"
#read-in dataframe and define Sample and Path
SAMPLES = pd.read_table(config["sample_file"])
BAMFILE = SAMPLES["Sample"]
PATH = SAMPLES["Path"]
rule all:
input:
expand("{path}{sample}.summary.txt", zip, path=PATH, sample=BAMFILE)
#this works like a charm as long as I give the zip-function in the rules 'all' and 'summary':
rule indexBam:
input:
"{path}{sample}.bam"
output:
"{path}{sample}.bam.bai"
shell:
"samtools index {input}"
#this following command works as long as I give the specific folder for a sample instead of {path}.
rule bamdiagnostics:
input:
bam="{path}{sample}.bam",
bai=expand("{path}{sample}.bam.bai", zip, path=PATH, sample=BAMFILE)
params:
prefix="analysis/BAMDiagnostics/{sample}"
output:
"analysis/BAMDiagnostics/{sample}_approximateDepth.txt",
"analysis/BAMDiagnostics/{sample}_fragmentStats.txt",
"analysis/BAMDiagnostics/{sample}_MQ.txt",
"analysis/BAMDiagnostics/{sample}_readLength.txt",
"analysis/BAMDiagnostics/{sample}_BamDiagnostics.log"
message:
"running BamDiagnostics...{wildcards.sample}"
shell:
"{config[atlas]} task=BAMDiagnostics bam={input.bam} out={params.prefix} logFile={params.prefix}_BamDiagnostics.log verbose"
rule summary:
input:
index=expand("{path}{sample}.bam.bai", zip, path=PATH, sample=BAMFILE),
bamd=expand("analysis/BAMDiagnostics/{sample}_approximateDepth.txt", sample=BAMFILE)
output:
"{path}{sample}.summary.txt"
shell:
"echo -e '{input.index} {input.bamd}"
I get the error
WildcardError in line 28 of path/to/my/Snakefile:
Wildcards in input files cannot be determined from output files:
'path'
Can anyone help me?
- I tried to solve this problem with join, or creating input functions but I think I am just not skilled enough to see my error...
- I guess the problem is, that my summary-rule does not contain the tuplet with the {path} for the bamdiagnostics-output (since the output is somewhere else) and cannot make the connection to the input file or so...
- Expanding my input on bamdiagnostics-rule makes the code work, but of course takes every samples input to every samples output and creates a big mess:
In this case, both bamfiles are used for the creation of each outputfile. This is wrong as the samples AND the output are to be treated independently.
Based on the atlas doc, it seems like what you need is to run each rule separately for each sample, the complication here being that each sample is in separate path.
I modified your script to work for above case (see DAG). Variables in the beginning of script were modified to make better sense. config was removed for demo purposes, and pathlib library was used (instead of os.path.join). pathlib is not necessary, but it helps me keep sanity. A shell command was modified to avoid config.
import pandas as pd
from pathlib import Path
df = pd.read_csv('sample.tsv', sep='\t', index_col='Sample')
SAMPLES = df.index
BAM_PATH = df["Path"]
# print (BAM_PATH['sample1'])
rule all:
input:
expand("{path}{sample}.summary.txt", zip, path=BAM_PATH, sample=SAMPLES)
rule indexBam:
input:
str( Path("{path}") / "{sample}.bam")
output:
str( Path("{path}") / "{sample}.bam.bai")
shell:
"samtools index {input}"
#this following command works as long as I give the specific folder for a sample instead of {path}.
rule bamdiagnostics:
input:
bam = lambda wildcards: str( Path(BAM_PATH[wildcards.sample]) / f"{wildcards.sample}.bam"),
bai = lambda wildcards: str( Path(BAM_PATH[wildcards.sample]) / f"{wildcards.sample}.bam.bai"),
params:
prefix="analysis/BAMDiagnostics/{sample}"
output:
"analysis/BAMDiagnostics/{sample}_approximateDepth.txt",
"analysis/BAMDiagnostics/{sample}_fragmentStats.txt",
"analysis/BAMDiagnostics/{sample}_MQ.txt",
"analysis/BAMDiagnostics/{sample}_readLength.txt",
"analysis/BAMDiagnostics/{sample}_BamDiagnostics.log"
message:
"running BamDiagnostics...{wildcards.sample}"
shell:
".atlas task=BAMDiagnostics bam={input.bam} out={params.prefix} logFile={params.prefix}_BamDiagnostics.log verbose"
rule summary:
input:
bamd = "analysis/BAMDiagnostics/{sample}_approximateDepth.txt",
index = lambda wildcards: str( Path(BAM_PATH[wildcards.sample]) / f"{wildcards.sample}.bam.bai"),
output:
str( Path("{path}") / "{sample}.summary.txt")
shell:
"echo -e '{input.index} {input.bamd}"

'No values given for wildcard error' in snakemake

I am trying to make a simple pipeline using snakemake to download two files from the web and then merge them into a single output.
What I thought would work is the following code:
dwn_lnks = {
'1': 'https://molb7621.github.io/workshop/_downloads/sample.fa',
'2': 'https://molb7621.github.io/workshop/_downloads/sample.fa'
}
import os
# association between chromosomes and their links
def chromo2link(wildcards):
return dwn_lnks[wildcards.chromo]
rule all:
input:
os.path.join('genome_dir', 'human_en37_sm.fa')
rule download:
output:
expand(os.path.join('chr_dir', '{chromo}')),
params:
link=chromo2link,
shell:
"wget {params.link} -O {output}"
rule merger:
input:
expand(os.path.join('chr_dir', "{chromo}"), chromo=dwn_lnks.keys())
output:
os.path.join('genome_dir', 'human_en37_sm.fa')
run:
txt = open({output}, 'a+')
with open (os.path.join('chr_dir', "{chromo}") as file:
line = file.readline()
while line:
txt.write(line)
line = file.readline()
txt.close()
This code returns the error:
No values given for wildcard 'chromo'. in line 20
Also, in the merger rule, the python code within the run does not work.
The tutorial in the snakemake package does not cover enough examples to learn the details for non-computer scientists. If anybody knows a good resource to learn how to work with snakemake, I would appreciate if they could share :).
The problem is that you have an expand function in the output of the rule download that does not define the value for the wildcard {chromo}. I guess what you really want here is
rule download:
output:
'chr_dir/{chromo}',
params:
link=chromo2link,
shell:
"wget {params.link} -o {output}"
without the expand. The expand function is only needed to aggregate over wildcards, like you do it in the rule merger.
Also have a look at the official Snakemake tutorial, which explains this in detail.

Snakemake: using regex in expand()

I'm struggling with using regular expressions when using the expand function. For some reason, the wildcards are always imported as plain text instead of executed regular expressions. Whether the regex is introduced as a wildcard priorly or in connection to the expand function does not make a difference (see all_decompress vs. all_decompress2). The error is always:
Missing input files for rule DECOMPRESS:
Resources/raw/run1_lane1_read[1,2]_index\d+-\d+=[1-9], [11-32].fastq.gz
-
#!/usr/bin/env python3
import re
###### WILDCARDS #####
## General descriptive parameters
read = "[1,2]"
index_prefix = r"\d{3}-\d{3}"
index = r"[1-9], [11-32]"
##### RULES #####
### CONJUNCTION RULES ("all") ###
# PREANALYSIS #
rule all_decompress:
input:
expand("Resources/decompressed/read{read}_index{index_prefix}={index}.fastq", read=read, index_prefix=index_prefix, index=index)
rule all_decompress2:
input:
expand("Resources/decompressed/read{read}_index{index_prefix}={index}.fastq", read=[1,2], index_prefix=r"\d{3}-\d{3}", index=r"[1-9], [11-32]")
### TASK RULES ###
# PREANALYSIS #
# Decompress .gz zipped raw files
rule DECOMPRESS:
input:
"Resources/raw/run1_lane1_read{read}_index{index_prefix}={index}.fastq.gz"
output:
"Resources/decompressed/read{read}_index{index_prefix}={index}.fastq"
shell:
"gzip -d -c {input} > {output}"
If I'm right the function expand in Snakemake makes a list of strings. This function is used for filenames just like you used it.
I don't know if the expand function can be associated with regex to create the list.
But you can produce this list in python and give it to the rule all or the expand function.
In your case you can use the following code to get and make your liste of filenames :
import re
import os
path='.'
listoffiles=[]
for file in os.listdir(path):
if(re.search('read[1-2]_index\d{3}-\d{3}=[1-9]',file)):
listoffiles.append(os.path.splitext(file)[0])
Then in the listoffiles you have all your files names and you just have to use your expand like this :
expand("{repertory}{filename}{extension},
repertory = "Resources/decompressed/",
filename = listoffiles,
extension = ".fastq")
Then everything should work perfectly.
Remember all python code in a snakefile will be executed at the beginning of the workflow before all rules and dags creation. So it can be powerful.
Did you look into bli's links?
Specifically, this part of the documentation.
Here's a simple example of how I use it to generate a png, a pdf, or both:
rule all:
input: expand("{graph}.png", graph=["dag", "rulegraph"])
rule dot_to_image:
input: "{graph}.dot"
output: "{graph}.{ext,(pdf|png)}"
shell: "dot -T{wildcards.ext} -o {output} {input}"
Hope this helps.
I think it probably is just not possible to execute regular expressions in expand. However I found a workaround. For two of the wildcards I found different ways to describe them ("read" and "index") and for the third one I prepared a function and used it as the input.
#!/usr/bin/env python3
import re
###### WILDCARDS #####
## General descriptive parameters
read = (1,2)
index = list(range(1,9)) + list(range(11,32))
## Functions
def getDCinput(wildcards):
read = wildcards.read
index = wildcards.index
path = wd + "Resources/raw/run1_lane1_read" + read + r"_index[0-9]??-[0-9]??=" + sample + ".fastq.gz"
return(glob.glob(path))
##### RULES #####
### CONJUNCTION RULES ("all") ###
# PREANALYSIS #
rule all_decompress:
input:
expand("Resources/decompressed/read{read}_index{index}.fastq", read=read, index=index)
### TASK RULES ###
# FILE PREPARATION AND SMOOTHING #
# Decompress .gz zipped raw files
rule DECOMPRESS:
input:
getDCinput
output:
"Resources/decompressed/read{read}_index{index}.fastq"
shell:
"gzip -d -c {input} > {output}"

Categories