is there a way to programmatically list log-files created per rule from within the Snakefile? Will I have to tap into the DAG and if yes, how?
Background: I'd like to bundle up and remove all created log-files (only cluster logs are in a separate folder; some output files have correspondingly called log files). For this I want to be specific and exclude log files that might have been created by run programs and that coincidentally match a log glob.
Are there alternatives, e.g. would parsing shellcmd_tracking files be easier?
Thanks,
Andreas
With the upcoming release 3.9.0, you can see the corresponding log files for all output files when invoking snakemake --summary.
You may try to do the following
onsuccess:
for rulename in dir(rules):
the_rule = getattr(rules, rulename)
if hasattr(the_rule, "log"):
print(rulename, ":\t", getattr(the_rule, "log"))
And similarly in onerror.
It may be possible to put this in an expand, to generate the real log file names, if there are some wildcards in your rule's log files.
I just tested this:
LETTERS = ["A", "B"]
NUMS = ["1", "2"]
rule all:
input:
expand("combined_{letter}.txt", letter=LETTERS)
rule generate_text:
output:
"text_{letter}_{num}.txt"
log:
"text_{letter}_{num}.log"
shell:
"""
echo "test" > {output} 2> {log}
"""
rule combine_text:
input:
expand("text_{{letter}}_{num}.txt", num=NUMS)
output:
"combined_{letter}.txt"
shell:
"""
cat {input} > {output}
"""
onsuccess:
for rulename in dir(rules):
the_rule = getattr(rules, rulename)
if hasattr(the_rule, "log"):
print(rulename, ":\t", expand(getattr(the_rule, "log"), letter=LETTERS, num=NUMS))
And I obtain the following output at the end:
all : []
combine_text : []
generate_text : ['text_A_1.log', 'text_B_1.log', 'text_A_2.log', 'text_B_2.log']
Problem is that this displays all log files potentially generated by your snakefile, not those actually generated in a particular run (if, for instance some rules don't need to be executed this time).
Edit: another way to expand the log file names
The onsuccess (or onerror) things could be done differently, in order to adapt to the log files actually generated:
import glob
onsuccess:
for rulename in dir(rules):
the_rule = getattr(rules, rulename)
if hasattr(the_rule, "log"):
print(rulename, ":\t", *[glob.glob(pattern) for pattern in expand(getattr(the_rule, "log"), letter=['*'], num=['*'])])
With this modification, I almost obtain the same list of filenames. The only thing that differs is the order in which they appear.
Related
In a pipeline that I use to work on different projects, I have a rule that takes a file, following the pattern tei/xxx_xx_xxxxx_xxxxx.xml as input. Depending on the project 2 possible outputs are possible, either one file called xhtml/xxx_xx_xxxxx_xxxxx.html or many files following the pattern xhtml/xxx_xx_xxxxx_xxxxx_sec_n (where n is a counter for the different files).
The problem is that it is not predictable at the beginning if the project is a case 1 or a case 2 project. It is decided in the script that is run as the action of the rule. Thus, I neither know, how to define the input in the default rule which request those file(s) nor how to define the output of the rule that creates those file(s).
I think it is probably a case for using checkpoint(), but from the examples I found I was not able to see how.
This is a simplified/reduced version of the scenario:
rule all:
input: # How to define the input when it is not clear if it is case 1 file or case 2 files
rule xhtml_manuscript:
input:
tei_manuscript = 'tei/xxx_xx_xxxxx_xxxxx.html'
output:
xhtml_manuscript = # How to define the input when it is not clear if it is case 1 file or case 2
run:
shell(f'java -jar {SAXON} -o:xxx_xx_xxxxx_xxxxx.html {{input}} {TRANSFORMDIR}/other/opt_split_html_sections.xsl')
Possible output:
xxx_xx_xxxxx_xxxxx.html
or
xxx_xx_xxxxx_xxxxx_sec_1.html
xxx_xx_xxxxx_xxxxx_sec_2.html
xxx_xx_xxxxx_xxxxx_sec_3.html
xxx_xx_xxxxx_xxxxx_sec_4.html
xxx_xx_xxxxx_xxxxx_sec_5.html
...
This is just Sultan's answer made more explicit. OP asks in comment:
the rule still creates the html file(s) but I do not mention them in the output explicitly, in favour of the tmp file
Yes, that's the idea. In fact, I would call the tmp file a "flag" file and I wouldn't mark is temporary. E.g:
rule all:
input:
'tei/xxx_xx_xxxxx_xxxxx.done',
rule xhtml_manuscript:
input:
tei_manuscript = 'tei/xxx_xx_xxxxx_xxxxx.html'
output:
# Note the touch function
xhtml_manuscript = touch('tei/xxx_xx_xxxxx_xxxxx.done'),
run:
shell(f'java -jar {SAXON} -o:xxx_xx_xxxxx_xxxxx.html {{input}} {TRANSFORMDIR}/other/opt_split_html_sections.xsl')
it [the flag file] would probably make the xhtml_manuscript succeed
Not really, snakemake will touch the flag file tei/xxx_xx_xxxxx_xxxxx.done only if the run or shell directive succeeds. So if the flag file is present you can be sure the underlying rule has exited with 0 exit code. Besides, you don't need to use the touch function and you could explicitly check that some files have been created. You could do:
shell:
"""
rm -rf <expected output html files>
java -jar <create html file(s)>
if this or that html file exists:
touch {output.xhtml_manuscript}
else:
exit 1
"""
Is that not a bit dirty and intransparent
I don't know... I got used to this way of handling such cases and it looks ok to me. Ultimately though, I would say the "dirt" may be more with the structure of the pipeline or the program causing the ambiguous output. I think snakemake is doing the right thing in making such cases somewhat clunky.
This might be counterintuitive, but I would define a third (temporary) file:
rule xhtml_manuscript:
output:
_tmp_file = temp('temp_file_{relevant_wildcards}.tmp'),
The idea here is that this temporary file is only used as a glue to link execution of the relevant rules.
Pros: a single rule to capture the two cases (single vs multiple outputs)
Cons: there is no explicit check on the outputs (the temporary file will be created if the rule succeeds, without checking the output).
I would like to create a Snakemake rule where there are: input, log, shell sections. There is no output, I would like to catch the log only as a result of the command.
Just tell Snakemake that the the log file is the output:
rule myrule:
input: "myfile.txt"
output: "logfile.log"
shell: "mycommand {input} > {output}"
You could skip output: and just use log: in the rule. These log files can be used as targets or as input to other rules. As per the doc:
Log files can be used as input for other rules, just like any other output file. However, unlike output files, log files are not deleted upon error. This is obviously necessary in order to discover causes of errors which might become visible in the log file.
So the code would look like:
rule some_rule:
input: "a.txt"
log: "a.log"
shell: "mycommand {input} > {log}"
Advantage here is that, unlike output file, log file will be preserved in case of job failure. However this advantage is also a disadvantage, because if you rerun the pipeline, snakemake will not rerun the failed job, as the rule's output file (ie. log file here) is already present. So, unless log preservation is important when a job fails, you might be better served with solution suggested by Maarten-vd-Sande.
I need to run a snakemake rule in the cluster, therefore for some rules, I need some tools and library needed to e loaded whereas, these tools are independent/ exclusive to other rules. I this case how can I specify these in my snakemake rule. For example, for rule score I need to module load r/3.5.1 and export R_lib =/user/tools/software currently, I am running these lines separately in the command line before running snakemake. But it would be great if there is a way to do it within the rule as env.
Question,
I have a rule as following,
rule score:
input:
count=os.path.join(config['general']['paths']['outdir'], 'count_expression', '{sample}.tsv'),
libsize=os.path.join(config['general']['paths']['outdir'], 'count_expression', '{sample}.size_tsv')
params:
result_dir=os.path.join(config['general']['paths']['outdir'], 'score'),
cancertype=config['general']['paths']['cancertype'],
sample_id=expand('{sample}',sample=samples['sample'].unique())
output:
files=os.path.join(config['general']['paths']['outdir'], 'score', '{sample}_bg_scores.tsv', '{sample}_tp_scores.tsv')
shell:
'mkdir -p {params.result_dir};Rscript {config[general][paths][tool]} {params.result_dir} {params.cancertype} {params.sample_id} {input.count} {input.libsize}'
My actual behavior for the above code snippet is:
shell:
mkdir -p /cluster/user/snakemake_test/results_april30/score;Rscript /cluster/home/user/Projects/R_scripts/scoretool.R /cluster/user/snakemake_test/results_april30/score DMC GNMS4 MRT5T /cluster/projects/test/results/exp/MRT5T.tsv /cluster/projects/test/results/Exp/MRT5T.size.tsv
Whereas, the expected behavior is:
shell:
mkdir -p /cluster/user/snakemake_test/results_april30/score;Rscript /cluster/home/user/Projects/R_scripts/scoretool.R /cluster/user/snakemake_test/results_april30/score DMC MRT5T /cluster/projects/test/results/exp/MRT5T.tsv /cluster/projects/test/results/Exp/MRT5T.size.tsv
and for the second sample,
shell:
mkdir -p /cluster/user/snakemake_test/results_april30/score;Rscript /cluster/home/user/Projects/R_scripts/scoretool.R /cluster/user/snakemake_test/results_april30/score DMC GNMS4 /cluster/projects/test/results/exp/GNMS4.tsv /cluster/projects/test/results/Exp/GNMS4.ize.tsv
I need the variable sample_d ['GNMS4', 'MRT5T'] should be taken separately, not together in one shell command line.
Regarding your first question: You can put whatever module load or export commands you like in the shell section of a rule.
Regarding your second question, you should probably not use expand in the params section of your rule. In expand('{sample}',sample=samples['sample'].unique()) you are actually not using the value of the sample wildcard, but generating a list of all unique values in sample['sample']. You probably just need to use wildcards.sample in the definition of your shell command instead of using a params element.
If you want to run several instances of the score rule based on possible values for sample, you need to "drive" this using another rule that wants the output of score as its input.
Note that to improve readability, you can use python's multi-line strings (triple-quoted).
To sum up, you might try something like this:
rule all:
input:
expand(
os.path.join(
config['general']['paths']['outdir'],
'score',
'{sample}_bg_scores.tsv',
'{sample}_tp_scores.tsv'),
sample=samples['sample'].unique())
rule score:
input:
count = os.path.join(
config['general']['paths']['outdir'],
'count_expression', '{sample}.tsv'),
libsize = os.path.join(
config['general']['paths']['outdir'],
'count_expression', '{sample}.size_tsv')
params:
result_dir = os.path.join(config['general']['paths']['outdir'], 'score'),
cancertype = config['general']['paths']['cancertype'],
output:
files = os.path.join(
config['general']['paths']['outdir'],
'score', '{sample}_bg_scores.tsv', '{sample}_tp_scores.tsv')
shell:
"""
module load r/3.5.1
export R_lib =/user/tools/software
mkdir -p {params.result_dir}
Rscript {config[general][paths][tool]} {params.result_dir} {params.cancertype} {wildcards.sample} {input.count} {input.libsize}
"""
onstart would work I think. Note that dryruns don't trigger this handler, which is acceptable in your scenario.
onstart:
shell("load tools")
Simple bash for loop should solve the problem. However, if you want each sample to be run as a separate rule, you would have to use sample name as part of output filename.
shell:
'''
for sample in {param.sample_id}
do
your command $sample
done
'''
I am very new to snakemake and also not so fluent in python (so apologies this might be a very basic stupid question):
I am currently building a pipeline to analyze a set of bamfiles with atlas. These bamfiles are located in different folders and should not be moved to a common one. Therefore I decided to provide a samplelist looking like this (this is just an example, in reality samples might be on totaly different drives):
Sample Path
Sample1 /some/path/to/my/sample/
Sample2 /some/different/path/
And load it in my config.yaml with:
sample_file: /path/to/samplelist/samplslist.txt
Now to my Snakefile:
import pandas as pd
#define configfile with paths etc.
configfile: "config.yaml"
#read-in dataframe and define Sample and Path
SAMPLES = pd.read_table(config["sample_file"])
BAMFILE = SAMPLES["Sample"]
PATH = SAMPLES["Path"]
rule all:
input:
expand("{path}{sample}.summary.txt", zip, path=PATH, sample=BAMFILE)
#this works like a charm as long as I give the zip-function in the rules 'all' and 'summary':
rule indexBam:
input:
"{path}{sample}.bam"
output:
"{path}{sample}.bam.bai"
shell:
"samtools index {input}"
#this following command works as long as I give the specific folder for a sample instead of {path}.
rule bamdiagnostics:
input:
bam="{path}{sample}.bam",
bai=expand("{path}{sample}.bam.bai", zip, path=PATH, sample=BAMFILE)
params:
prefix="analysis/BAMDiagnostics/{sample}"
output:
"analysis/BAMDiagnostics/{sample}_approximateDepth.txt",
"analysis/BAMDiagnostics/{sample}_fragmentStats.txt",
"analysis/BAMDiagnostics/{sample}_MQ.txt",
"analysis/BAMDiagnostics/{sample}_readLength.txt",
"analysis/BAMDiagnostics/{sample}_BamDiagnostics.log"
message:
"running BamDiagnostics...{wildcards.sample}"
shell:
"{config[atlas]} task=BAMDiagnostics bam={input.bam} out={params.prefix} logFile={params.prefix}_BamDiagnostics.log verbose"
rule summary:
input:
index=expand("{path}{sample}.bam.bai", zip, path=PATH, sample=BAMFILE),
bamd=expand("analysis/BAMDiagnostics/{sample}_approximateDepth.txt", sample=BAMFILE)
output:
"{path}{sample}.summary.txt"
shell:
"echo -e '{input.index} {input.bamd}"
I get the error
WildcardError in line 28 of path/to/my/Snakefile:
Wildcards in input files cannot be determined from output files:
'path'
Can anyone help me?
- I tried to solve this problem with join, or creating input functions but I think I am just not skilled enough to see my error...
- I guess the problem is, that my summary-rule does not contain the tuplet with the {path} for the bamdiagnostics-output (since the output is somewhere else) and cannot make the connection to the input file or so...
- Expanding my input on bamdiagnostics-rule makes the code work, but of course takes every samples input to every samples output and creates a big mess:
In this case, both bamfiles are used for the creation of each outputfile. This is wrong as the samples AND the output are to be treated independently.
Based on the atlas doc, it seems like what you need is to run each rule separately for each sample, the complication here being that each sample is in separate path.
I modified your script to work for above case (see DAG). Variables in the beginning of script were modified to make better sense. config was removed for demo purposes, and pathlib library was used (instead of os.path.join). pathlib is not necessary, but it helps me keep sanity. A shell command was modified to avoid config.
import pandas as pd
from pathlib import Path
df = pd.read_csv('sample.tsv', sep='\t', index_col='Sample')
SAMPLES = df.index
BAM_PATH = df["Path"]
# print (BAM_PATH['sample1'])
rule all:
input:
expand("{path}{sample}.summary.txt", zip, path=BAM_PATH, sample=SAMPLES)
rule indexBam:
input:
str( Path("{path}") / "{sample}.bam")
output:
str( Path("{path}") / "{sample}.bam.bai")
shell:
"samtools index {input}"
#this following command works as long as I give the specific folder for a sample instead of {path}.
rule bamdiagnostics:
input:
bam = lambda wildcards: str( Path(BAM_PATH[wildcards.sample]) / f"{wildcards.sample}.bam"),
bai = lambda wildcards: str( Path(BAM_PATH[wildcards.sample]) / f"{wildcards.sample}.bam.bai"),
params:
prefix="analysis/BAMDiagnostics/{sample}"
output:
"analysis/BAMDiagnostics/{sample}_approximateDepth.txt",
"analysis/BAMDiagnostics/{sample}_fragmentStats.txt",
"analysis/BAMDiagnostics/{sample}_MQ.txt",
"analysis/BAMDiagnostics/{sample}_readLength.txt",
"analysis/BAMDiagnostics/{sample}_BamDiagnostics.log"
message:
"running BamDiagnostics...{wildcards.sample}"
shell:
".atlas task=BAMDiagnostics bam={input.bam} out={params.prefix} logFile={params.prefix}_BamDiagnostics.log verbose"
rule summary:
input:
bamd = "analysis/BAMDiagnostics/{sample}_approximateDepth.txt",
index = lambda wildcards: str( Path(BAM_PATH[wildcards.sample]) / f"{wildcards.sample}.bam.bai"),
output:
str( Path("{path}") / "{sample}.summary.txt")
shell:
"echo -e '{input.index} {input.bamd}"
Can anybody help me understand if it is possible to access sample details from a config.yml file when the sample names are not written in the snakemake workflow? This is so I can re-use the workflow for different projects and only adjust the config file. Let me give you an example:
I have four samples that belong together and should be analyzed together. They are called sample1-4. Every sample comes with some more information but to keep it simple here lets say its just a name tag such as S1, S2, etc.
My config.yml file could look like this:
samples: ["sample1","sample2","sample3","sample4"]
sample1:
tag: "S1"
sample2:
tag: "S2"
sample3:
tag: "S3"
sample4:
tag: "S4"
And here is an example of the snakefile that we use:
configfile: "config.yaml"
rule final:
input: expand("{sample}.txt", sample=config["samples"])
rule rule1:
output: "{sample}.txt"
params: tag=config["{sample}"]["tag"]
shell: """
touch {output}
echo {params.tag} > {output}
What rule1 is trying to do is create a file named after each sample as saved in the samples variable in the config file. So far no problem. Then, I would like to print the sample tag into that file. As the code is written above, running snakemake will fail because config["{sample}"] will literally look for the {sample} variable in the config file which doesn't exist because instead I need it to be replaced with the current sample that the rule is run for, e.g. sample1.
Does anybody know if this is somehow possible to do, and if yes, how I could do it?
Ideally I'd like to compress the information even more (see below) but that's further down the road.
samples:
sample1:
tag: "S1"
sample2:
tag: "S2"
sample3:
tag: "S3"
sample4:
tag: "S4"
I would suggest using a tab-delimited file in order to store samples information.
sample.tab:
Sample Tag
1 S1
2 S2
You could store the path to this file in the config file, and read it in your Snakefile.
config.yaml:
sample_file: "sample.tab"
Snakefile:
configfile: "config.yaml"
sample_file = config["sample_file"]
samples = read_table(sample_file)['Sample']
tags = read_table(sample_file)['Tag']
This way your can re-use your workflow for any number of samples, with any number of columns.
Apart from that, in Snakemake usually you can escape curly brackets by doubling them, maybe you could try that.
Good luck!
In the params section, you need to provide a function of wildcards. The following modification of your workflow seems to work:
configfile: "config.yaml"
rule final:
input: expand("{sample}.txt", sample=config["samples"])
rule rule1:
output:
"{sample}.txt"
params:
tag = lambda wildcards: config[wildcards.sample]["tag"]
shell:
"""
touch {output}
echo {params.tag} > {output}
"""