I'm trying to run a little pipeline in Snakemake for a software to filter good reads in files from a RNA-seq.
This is my code:
SAMPLES = ['ZN21_S1', 'ZN22_S2','ZN27_S3', 'ZN28_S4', 'ZN29_S5' ,'ZN30_S6']
rule all:
input:
expand("SVA-{sample}_L001_R{read_no}.fastq.gz", sample=SAMPLES, read_no=['1', '2'])
rule fastp:
input:
reads1="SVA-{sample}_L001_R1.fastq.gz",
reads2="SVA-{sample}_L001_R2.fastq.gz"
output:
reads1out="out/SVA-{sample}_L001_R1.fastq.gz.good",
reads2out="out/SVA-{sample}_L001_R2.fastq.gz.good"
shell:
"fastp -i {input.reads1} -I {input.reads2} -o {output.reads1out} -O {output.reads2out}"
All samples (in symbolic link) are in the same folder and I only got the message "Nothing to be done".
What am I not seeing?
In your example, target files in rule all are supposed to match with rule fastp's output files, instead of its input files in your current setup. As per your code, target files in rule all already exist and hence the message Nothing to be done when executing it.
rule all:
input:
expand("out/SVA-{sample}_L001_R{read_no}.fastq.gz.good", sample=SAMPLES, read_no=['1', '2'])
Related
I'm having some trouble running snakemake. I want to perform quality control of some RNA-Seq bulk samples using FastQC. I've written the code in a way that all files following the pattern {sample}_{replicate}.fastq.gz should be used as input, where {sample} is the sample id (i.e. SRR6974023) and {replicate} is 1 or 2. My little scripts follows:
configfile: "config.yaml"
rule all:
input:
expand("raw_qc/{sample}_{replicate}_fastqc.{extension}", sample=config["samples"], replicate=[1, 2], extension=["zip", "html"])
rule fastqc:
input:
rawread=expand("raw_data/{sample}_{replicate}.fastq.gz", sample=config["samples"], replicate=[1, 2])
output:
compress=expand("raw_qc/{sample}_{replicate}_fastqc.zip", sample=config["samples"], replicate=[1, 2]),
net=expand("raw_qc/{sample}_{replicate}_fastqc.html", sample=config["samples"], replicate=[1, 2])
threads:
8
params:
path="raw_qc/"
shell:
"fastqc -t {threads} {input.rawread} -o {params.path}"
Just is case, the config.yaml is:
samples:
SRR6974023
SRR6974024
The raw_data directory with my files look like this:
SRR6974023_1.fastq.gz SRR6974023_2.fastq.gz SRR6974024_1.fastq.gz SRR6974024_2.fastq.gz
Finally, when I run the script, I always see the same error:
Building DAG of jobs...
MissingInputException in line 8 of /home/user/path/Snakefile:
Missing input files for rule fastqc:
raw_data/SRR6974023 SRR6974024_2.fastq.gz
raw_data/SRR6974023 SRR6974024_1.fastq.gz
It see correctly only the last files, in this case SRR6974024_1.fastq.gz and SRR6974024_2.fastq.gz. Whatsoever, the other one it's only seen as SRR6974023. How can I solve this issue? I appreciate some help. Thank you all!
The yaml is not configured correctly. It should have - to turn each row into a list:
samples:
- SRR6974023
- SRR6974024
I want to create a very simple pipeline for molecular dynamic simulation. The program (Amber) just wants 3 files as input, and produces a lot of files, some of them I will be needed in the future. So my pipeline is extremely simple:
Check that *.in, *.prmtop and *.rst are in folder (I guarantee it's only one file for any of these extensions) and warn if these files are not present
Run shell command (based on name of all input files)
Check that *.out, mden, mdinfo and *.nc were produced
That's all. It's standard approach to the program I deal with. One folder, one task, short and simple file names based on file purpose, not on its content.
I wrote a simple pipline:
rule all:
input: '{inp}.out'
rule amber:
input:
'{inp}.in',
'{top}.prmtop',
'{coord}.rst'
output:
'{inp}.out',
'mden',
'mdinfo',
'{inp}.nc'
shell:
'pmemd.cuda'
' -O'
' -i {inp}.in'
' -o {inp}.out'
' -p {top}.prmtop'
' -c {coord}.rst'
' -r {inp}.rst'
' -x {inp}.nc'
' -ref {coord}.rst'
And it doesn't work.
All inputs in all rule must be explicit. (Why? Why it cannot be regex or wildcard expression? If I see *.out in folder and status code of shell script was 0, that's all, work is done)
I must to use all wildcards from input in output, but I want to use some only in shell or in another rules
I must to not expect to get files like mden with potentially "non-unique" names because it's could be change with another task, but I know, that it will be only one task and it's a direct way how my MD program works (yeah, I know about Ambers's -e and -inf keys, but it's over-complication of simple task).
So, I would like to decide is it worth using snakemake for this, or not. It's very simple task, but I already spent several hours, I see a lot of documentation, a lot of examples, that I can't apply to my case. snakemake looks exactly what I need, but I can't express simple task in general terms with this framework, I don't want to specify explicit filenames, because I'll lose in flexibility, I want to run hundreds of simple tasks automatically, only input files will be different. I'm sure I just haven't figured out how to handle this framework yet. Maybe you can show me how should I? Thank you!
Hopefully this will put you in the right direction.
If I understand correctly, the input to snakemake is a folder containing the input files to amber. You know that this folder contains one .in file, one .prmtop file, and one .rst file but you don't know the full names of these files.
If you want snakemake to run on a single input folder, then you don't need wildcards at all and the script below should do.
import glob
import os
input_folder = config['amber_folder']
# We don't know the name of input file. We only know it ends in '.in'
inp = glob.glob(os.path.join(input_folder, '*.in'))
assert len(inp) == 1
inp = inp[0]
name = os.path.splitext(os.path.basename(inp))[0]
output_folder = name + '_results'
out = os.path.join(output_folder, name + '.out')
rule all:
input:
out
rule amber:
input:
inp= inp,
top= glob.glob(os.path.join(input_folder, '*.prmtop')),
rst= glob.glob(os.path.join(input_folder, '*.rst')),
output:
out= out,
nc= os.path.join(output_folder, name + '.nc'),
mden= os.path.join(output_folder, 'mden'),
mdinfo= os.path.join(output_folder, 'mdinfo'),
shell:
r"""
pmemd.cuda \
-O \
-i {input.inp} \
-o {output.out} \
-p {input.top} \
-c {input.rst} \
-r {input.rst} \
-x {output.nc} \
-ref {input.rst}
"""
Execute with:
snakemake -j 1 -C amber_folder='your-input-folder'
If you have many input folders you could write a for-loop to execute the command above but probably better is to pass the list of inputs to snakemake and let it handle them.
snakemake deletes all output files that are marked temporary but does not do anything to the files if the output is a directory as shown below:
rule all:
input:
'final.txt',
checkpoint split_big_file:
input: 'bigfile.txt'
output: temp(directory('split_files'))
shell: 'mkdir -p {output} ; split -l 5000 -d -e bigfile.txt {output}/part_'
rule copy_small_files:
input: 'split_files/part_{num}'
output: temp('copy_files/part_{num}.txt')
shell: 'cp -f {input} {output}'
def aggregate_input(wildcards):
'''
aggregate the file names of the random number of files
generated at the scatter step
'''
checkpoint_output = checkpoints.split_big_file.get(**wildcards).output[0]
print(checkpoint_output)
agg_inp = expand('copy_files/part_{num}.txt', num=glob_wildcards('split_files/part_{num}').num)
print(agg_inp)
return agg_inp
rule merge_small_files:
input: aggregate_input
output: 'final.txt'
shell: 'cat {input} > {output}'
When I run the code shown above with a bigfile.txt that has several thousand lines, everything runs fine but the split_files directory is not empty.
$ wc -l final.txt
61177 final.txt
$ wc -l bigfile.txt
61177 bigfile.txt
$ ls copy_files/
$ ls split_files/
part_00 part_01 part_02 part_03 part_04
part_05 part_06 part_07 part_08 part_09
part_10 part_11 part_12
What I would like to see:
copy_files directory should also be deleted (but apparently since snakemake cannot figure out whether there are any other files unrelated to snakemake in that directory it will not delete directories by default)
contents of the split_files directory (and preferably the directory itself; see point 1 above) should be deleted.
I can not recreate it:
rule all:
input:
"a.txt"
rule first:
output:
temp(directory("dir1"))
shell:
"mkdir {output}; touch {output}/a.txt; sleep 5"
rule second:
input:
"dir1"
output:
"a.txt"
shell:
"touch {output}"
What version of snakemake are you using? Is maybe output_dir listed under rule all for you? Snakemake assumes that the output you want is the input of your first rule (rule all probably). So it won't delete those files, removing output_dir from under rule all will solve this issue.
However I am just guessing since you didn't provide a minimal reproducible example.
edit
Hmm... That should work! Here are two non-ideal solutions I could come up with:
We can fool snakemake to again re-evaluate the DAG and then delete the folder like this, however not sure if the files get deleted early enough for you (files might be very large).
rule merge_small_files:
input: aggregate=aggregate_input, dummy='split_files'
output: 'final.txt'
shell: 'cat {input.aggregate} > {output}'
Or just delete the file after copying, however you will end up with an empty folder in the end:
rule copy_small_files:
input: 'split_files/part_{num}'
output: temp('copy_files/part_{num}.txt')
shell: 'cp -f {input} {output}; rm {input}'
You can ofcourse combine both solutions and have the best of both worlds, however it is not very pretty to look at unfortunately :(
I'm trying to write a script for a pipeline, but I'm having trouble declaring the input of a rule from a directory.
My code in these parts:
rule taco:
input:
all_gtf = GTF_DIR + "path_samplesGTF.txt"
output:
taco_out = TACO_DIR
shell:
"taco_run -v -p 20 -o {output.taco_out} \
--filter-min-expr 1 --gtf-expr-attr RPKM {input.all_gtf}"
rule feelnc_filter:
input:
assembly = TACO_DIR + "assembly.gtf",
annotation = GTF
output:
candidate_lncrna = FEELNC_FILTER + "candidate_lncrna.gtf"
shell:
"./FEELnc_filter.pl -i {input.assembly} -a {input.annotation} > {output.candidate_lncrna}"
This is my error:
MissingInputException in line 97 of /workdir/Snakefile:
Missing input files for rule feelnc_filter:
Thank you!
/workdir/pipeline-v01/TACO/assembly.gtf
Your script code definitely is smaller than 97 lines, so the exception description is not very useful. Anyway, MissingInputException means that Snakemake has successfully constructed the workflow DAG (which means that there is nothing wrong with your input/output and wildcards) and started the execution of this workflow. At some point it was trying to execute the rule where the expected output of this rule was not present at the end of the rule's shell script.
Now we have the second problem: your script runs your own Perl script and an unknown taco_run executable: I have no clue what do these program do. I guess that taco_run doesn't create the directory that you specify as -o {output.taco_out}.
I advise you to run your Snakemake with the --printshellcmds key. This would show you the exact commands being run, and you could try to run those commands separately. Check that those commands really create the expected outputs.
I have a script which takes a large input file then breaks this down into a number of chunks from 1 to n using an unpredictable algorithm.
Then a following script will process each of these chunks iteratively.
How can I create a snakemake rule which essentially states that the output files will exist from 1 to n, and the following script should be run once for each of the 1 to n input files.
Thanks!
There is dynamic keyword. It can be used like this:
rule all:
input:
dynamic('{id}.png')
rule draw:
input:
'{id}.txt'
output:
'{id}.png'
shell:
'cp {input} {output}'
rule cluster:
input:
'input.csv'
output:
dynamic('{id}.txt')
shell:
'touch 1.txt 2.txt'
Have you tried setting a wildcard? For example, if you are iterating a rule over files 1 to 22, you can set a wildcard at the top of your snakemake file:
num=range(1,23)
Then use that wildcard in your snakemake file names or reference it as in {wildcard.num}