Snakemake: Is it possible to use directories as wildcards?

Snakemake: Is it possible to use directories as wildcards? - python

Hi I´m new in Snakemake and have a question. I want to run a tool to multiple data sets. One data set represents one tissue and for each tissue exists fastq files, which are stored in the respective tissue directory. The rough command for the tools is:
python TEcount.py -rosette rosettefile -TE te_references -count result/tissue/output.csv -RNA <LIST OF FASTQ FILE FOR THE RESPECTIVE SAMPLE>
The tissues shall be the wildcards. How can I do this? Below I have a first try that did not work.
import os
#collect data sets
SAMPLES=os.listdir("data/rnaseq/")
rule all:
input:
expand("results/{sample}/TEtools.{sample}.output.csv", sample=SAMPLES)
rule run_TEtools:
input:
TEcount='scripts/TEtools/TEcount.py',
rosette='data/prepared_data/rosette/rosette',
te_references='data/prepared_data/references/all_TE_instances.fa'
params:
#collect the fastq files in the tissue directory
fastq_files = os.listdir("data/rnaseq/{sample}")
output:
'results/{sample}/TEtools.{sample}.output.csv'
shell:
'python {input.TEcount} -rosette {input.rosette} -TE
{input.te_references} -count {output} -RNA {params.fastq_files}'
In the rule run_TEtools it does not know what the {sample} is.

A snakemake wildcard can be anything. It is basically just a string.
There are some issues with the way you are trying to achieve what you want.
Ok, here's how I would do it. Explanations follow:
import os
#collect data sets
# Beware no other directories or files (than those containing fastqs) should be in that folder
SAMPLES=os.listdir("data/rnaseq/")
def getFastqFilesForTissu(wildcards):
fastqs = list()
# Beware no other files than fastqs should be there
for s in os.listdir("data/rnaseq/"+wildcards.sample):
fastqs.append(os.path.join("data/rnaseq",wildcards.sample,s))
return fastqs
rule all:
input:
expand("results/{sample}/TEtools.{sample}.output.csv", sample=SAMPLES)
rule run_TEtools:
input:
TEcount='scripts/TEtools/TEcount.py',
rosette='data/prepared_data/rosette/rosette',
te_references='data/prepared_data/references/all_TE_instances.fa',
fastq_files = getFastqFilesForTissu
output:
'results/{sample}/TEtools.{sample}.output.csv'
shell:
'python {input.TEcount} -rosette {input.rosette} -TE {input.te_references} -count {output} -RNA {input.fastq_files}'
First of all, your fastq file should be defined as inputs in order for snakemake to know that they are files and that if they are changed, rules must be rerun. It is quite bad practice to define input files as params. params are made for parameters, usually not for files.
Second, your script file is defined as input. You have to be aware that everytime you modify it, rules will be rerun. Maybe that's what you want.
I would use a defined function to get the fastq file in each directory. If you want to use a function (like os.listdir()), you can't use your wildcards directly. You have to inject it in the function as a python object. You can either define a function that will take one argument, a wildcard object containing all your wildcards, or use the lambda keyword (ex: input = lamdba wildcards: myFuntion(wildcards.sample)).
Another problem you have is that os.listdir() returns a list of files without the relative path. Also beware that the order in which os.listdir() will return you fastq file is unknown. Maybe that doesn't matter for your command.

Related

Using snakemake to rename files according to defined mapping

I'm trying to use snakemake to download a list of files, and then rename them according to mapping given in the file. I first read a dictionary from a file that has the form of {ID_for_download : sample_name}, and I pass the list of its keys to first rule for download (because downloading is taxing, I'm just using a dummy script to generate empty files). For every file in the list, two files are downloaded in the form of {file_1.fastq} and {file_2.fastq} When those files are downloaded, I then rename them using the second rule - here I take advantage of being able to run python code in a rule using run key word. When I do a dry-run using -n flag, everything works. But when I do an actual run, I get an error of the form
Job Missing files after 5 seconds [list of files]
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Job id: 0 completed successfully, but some output files are missing. 0
Exiting because a job execution failed. Look above for error message
Removing output files of failed job rename_srafiles_to_samples since they might be corrupted: [list of all files]
What happens is that a directory to store my files is created, and then my files are "downloaded", and then are renamed. Then when it reaches the last file, I get this error and everything is deleted. The snakemake file is below:
import csv
import os
SRA_MAPPING = read_dictionary() #dictionary read from a file
SRAFILES = list(SRA_MAPPING.keys())[1:] #list of sra files
SAMPLES = [SRA_MAPPING[key] for key in SRAFILES] #list of sample names
rule all:
input:
expand("raw_samples/{samples}_1.fastq",samples=SAMPLES),
expand("raw_samples/{samples}_2.fastq",samples=SAMPLES),
rule download_srafiles:
output:
expand("raw_samples/{srafiles}_1.fastq",srafiles=SRAFILES),
expand("raw_samples/{srafiles}_2.fastq",srafiles=SRAFILES)
shell:
"bash dummy_download.sh"
rule rename_srafiles_to_samples:
input:
expand("raw_samples/{srafiles}_1.fastq",srafiles=SRAFILES),
expand("raw_samples/{srafiles}_2.fastq",srafiles=SRAFILES)
output:
expand("raw_samples/{samples}_1.fastq",samples=SAMPLES),
expand("raw_samples/{samples}_2.fastq",samples=SAMPLES)
run:
os.chdir(os.getcwd()+r"/raw_samples")
for file in os.listdir():
old_name=file[:file.find("_")]
sample_name=SRA_MAPPING[old_name]
new_name=file.replace(old_name,sample_name)
os.rename(file,new_name)
I've separately tried to run download_srafiles and it worked. I also separately tried to run rename_srafiles_to_samples and it worked. But when I run those files in conjunction, I get the error. For completeness, the script dummy_download.sh is below:
#!/bin/bash
read -a samples <<< $(cut -d , -f 1 linker.csv | tail -n +2)
for file in "${samples[#]}"
do
touch raw_samples/${file}_1.fastq
touch raw_samples/${file}_2.fastq
done
(linker.csv is a file in one column has ID_for_download and in other column has sample_name)
What am I doing wrong?
EDIT: Per user dariober, the change of directories via python's os in the rule rename_srafiles_to_samples "confused" snakemake. Snakemake's logic is sound - if I change the directory to enter raw_samples, it tries to find raw_samples in itself and fails. To that extend, I tested different versions.
Version 1
Exactly as dariober explained. Important bits of code:
for file in os.listdir('raw_samples'):
old_name= file[:file.find("_")]
sample_name=SRA_MAPPING[old_name]
new_name= file.replace(old_name,sample_name)
os.rename('raw_samples/' + file, 'raw_samples/' + new_name)
It lists files in "raw_samples" directory, and then renames them. Crucial thing to do is to add prefix of directory (raw_samples/) to each rename.
Version 2
The same as my original post, but instead of leaving working directory, I exit it at the end of the loop. It works.
os.chdir(os.getcwd()+r"/raw_samples")
for file in os.listdir():
old_name= file[:file.find("_")]
sample_name=SRA_MAPPING[old_name]
new_name= file.replace(old_name,sample_name)
os.rename(file,new_name)
os.chdir("..")
Version 3
Same as my original post, but instead of modifying anything in the run segment, I modify the output to exclude file directory. This means that I have to modify my rule all too. It didn't work. Code is below:
rule all:
input:
expand("{samples}_1.fastq",samples=SAMPLES),
expand("{samples}_2.fastq",samples=SAMPLES),
rule download_srafiles:
output:
expand("raw_samples/{srafiles}_1.fastq",srafiles=SRAFILES),
expand("raw_samples/{srafiles}_2.fastq",srafiles=SRAFILES)
shell:
"touch {output}"
rule rename_srafiles_to_samples:
input:
expand("raw_samples/{srafiles}_1.fastq",srafiles=SRAFILES),
expand("raw_samples/{srafiles}_2.fastq",srafiles=SRAFILES)
output:
expand("{samples}_1.fastq",samples=SAMPLES),
expand("{samples}_2.fastq",samples=SAMPLES)
run:
os.chdir(os.getcwd()+r"/raw_samples")
for file in os.listdir():
old_name= file[:file.find("_")]
sample_name=SRA_MAPPING[old_name]
new_name= file.replace(old_name,sample_name)
os.rename(file,new_name)
The error it gives is:
MissingOutputException in line 24
...
Job files missing
The files are actually there. So I don't know if I made some error in the code or is this some bug.
Conclusion
I wouldn't say that this is a problem with snakemake. It's more of a problem with my poorly thought out process. In retrospect, it makes perfect sense that entering directory messes up output/input process of snakemake. If I want to use os module in snakemake to change directories, I have to be very careful. Enter wherever I need to, but ultimately go back to my original starting place. Many thanks to /u/dariober and /u/SultanOrazbayev

I think snakemake gets confused by os.chdir. Your rule rename_srafiles_to_samples creates the correct files and the input/output naming is fine. However, since you have changed directory snakemake cannot find the expected output. I'm not sure I'm correct in all this and if so if it is a bug... This version avoids os.chdir and seems to work:
import csv
import os
SRA_MAPPING = {'SRR1': 'A', 'SRR2': 'B'}
SRAFILES = list(SRA_MAPPING.keys()) #list of sra files
SAMPLES = [SRA_MAPPING[key] for key in SRAFILES] #list of sample names
rule all:
input:
expand("raw_samples/{samples}_1.fastq",samples=SAMPLES),
expand("raw_samples/{samples}_2.fastq",samples=SAMPLES),
rule download_srafiles:
output:
expand("raw_samples/{srafiles}_1.fastq",srafiles=SRAFILES),
expand("raw_samples/{srafiles}_2.fastq",srafiles=SRAFILES)
shell:
"touch {output}"
rule rename_srafiles_to_samples:
input:
expand("raw_samples/{srafiles}_1.fastq",srafiles=SRAFILES),
expand("raw_samples/{srafiles}_2.fastq",srafiles=SRAFILES)
output:
expand("raw_samples/{samples}_1.fastq",samples=SAMPLES),
expand("raw_samples/{samples}_2.fastq",samples=SAMPLES)
run:
# os.chdir(os.getcwd()+r"/raw_samples")
for file in os.listdir('raw_samples'):
old_name= file[:file.find("_")]
sample_name=SRA_MAPPING[old_name]
new_name= file.replace(old_name,sample_name)
os.rename('raw_samples/' + file, 'raw_samples/' + new_name)
(However, a more snakemake-ish solution would be to have a wildcard for the SRR id and have each rule executed once for each SRR id, basically avoiding expand in download_srafiles and rename_srafiles_to_samples)

How to make rule "all" in Snakefile condition on completion of parallel wildcard rule

I have some TGZ files containing audio SPH samples, which I unpack in snakemake like this:
tgz_files = ["a.tgz", "b.tgz"]
tgz_dirs = ["a", "b"]
rule untar_tgz_files:
input:
tgz_files
output:
directory(tgz_dirs)
shell:
tar -xzvf {input}
I don't know the names of the SPH sample files until after the untar. I then have a rule which translates the SPH files to WAV files, like this:
rule sph_to_wav:
input:
"{root}/{filename}.sph"
output:
"{root}_wav/{filename}.wav"
shell:
sox -t sph {input} -b 16 -t wav {output}
I want my Snakefile to run both of these steps (untar and convert), not knowing in advance the exact names of the SPH files in the TGZ archives. I need something like this to mark the completion of the sph_to_wav rule:
rule sph_to_wav_finished:
input:
"{root}_wav/{filename}.wav"
output:
"sph_to_wav_finished.txt"
and then I want to condition rule all on both of these processes:
rule all:
input:
tgz_dirs, "sph_to_wav_finished.txt"
However, I get the error:
Building DAG of jobs...
MissingInputException in Snakefile:
Missing input files for rule all:
sph_to_wav_finished.txt
How do I write this so that
Snakemake doesn't complain and runs the unpack and sph to wav
Runs the sph to wav after the unpack
?

This sounds like a use-case for a checkpoint. Since rule untar_tgz_files generates files that are not known in advance, you can convert it into a checkpoint:
checkpoint untar_tgz_files:
... # everything defined as in a regular rule
This will tell snakemake that once this checkpoint has been completed, DAG needs to be re-evaluated to take into account of the new files that were created.
Downstream rules will need to find out about the new files, so typically you will do some sort of glob.glob to get the list of new files. This is a rough idea, but you might need to fine tune it:
def list_new_files(wildcards):
output_dir = checkpoints. untar_tgz_files.get(sample=wildcards.sample).output
# you will also want to parse "root" here, skipping it for simplicity
filenames, _ = glob_wildcards(output_dir+"/{filename}.sph")
new_files = expand("{filename}.wav", filename=filenames)
return new_files
Finally, collect all the translated files with:
rule sph_to_wav_finished:
input:
list_new_files,
output:
"sph_to_wav_finished.txt"

Snakemake unknown number of output files / defining wildcards based on intermediate files

to speed up a my workflow I would like to do something similar to snakemake-unknown-output-input-files-after-splitting-by-chromosome
SAMPLE = "sample_a"
rule all:
expand("results/{sample}-final-combined.bam", sample=SAMPLE)
rule exclude_blacklist:
input:
region_file = "data/{sample}.bed",
blacklist = "data/blacklists/universal_blacklist.bed"
output:
"results/{sample}_blacklist-excluded.bed"
shell:
"""
bedtools intersect -v -a {input.region_file} -b {input.blacklist} > {output}
"""
rule target_regions_by_chrom:
input:
region_file ="results/{sample}_blacklist-excluded.bed",
output:
temp("results/{sample}_{chrom}.bed")
script:
"scripts/split_region_by_chrom.py"
rule simulate_reads:
input:
"results/{sample}_{chrom}.bed"
output:
"results/{sample}_{chrom}.bam"
script:
"scripts/simulate_reads.py"
rule combine_regions_again:
input:
files = expand("results/{sample}_{chrom}.bam", sample=SAMPLE, chrom=get_chroms())
output:
"results/{sample}-final-combined.bam"
script:
"scripts/combine_bams.py"
The workflow should take a .bed file with unfiltered regions, exclude problematic regions from a blacklist, split the filtered .bed file by chromosome and then simulate reads for these regions (for simplicity I spare you the details), generating .bam files. Ultimately these .bam files should be combined again to an output file.
The main problem is how to solve the unknown number of chromosomes beforehand. If I take the input .bed file, I get some errors, because some files will be empty. I thought about using an input function (here get_chroms()) to recover the chromosome from the output .bed file ("results/{sample}_blacklist-excluded.bed") after excluding the blacklisted regions, but this leads to errors during the creation of the DAG because the file is not existent.
Does anybody have any suggestions how to solve these problems:
split a .bed file by chromosome, while not knowing the present chromosomes
defining a wildcard based on an intermediate file in the workflow
Any help would be greatly appreciated!

The Dynamic Files feature of Snakemake is probably what you need:
Dynamic files can be used whenever one has a rule for which the number
of output files is unknown before the rule was executed.
You should wrap the "results/{sample}_{chrom}.bed" with dynamic() modifier in both input and output counterparts:
rule target_regions_by_chrom:
output:
dynamic("results/{sample}_{chrom}.bed")
rule simulate_reads:
input:
dynamic("results/{sample}_{chrom}.bed")

Writing a bash or python for loop with paired input files and multiple output files

I'm working on a very common set of commands used to analyze RNA-seq data. However, since this question is not specific to bioinformatics, I've chosen to post here instead of BioStars, etc.
Specifically, I am trimming Illumina Truseq adapters from paired end sequencing data. To do so, I use Trimmomatic 0.36.
I have two input files:
S6D10MajUnt1-1217_S12_R1_001.fastq.gz
S6D10MajUnt1-1217_S12_R2_001.fastq.gz
And the command generates five output files:
S6D10MajUnt1-1217_S12_R1_001.paired.fq.gz
S6D10MajUnt1-1217_S12_R1_001.unpaired.fq.gz
S6D10MajUnt1-1217_S12_R2_001.paired.fq.gz
S6D10MajUnt1-1217_S12_R2_001.unpaired.fq.gz
S6D10MajUnt1-1217_S12.trimlog
I'm trying to write a python or bash script to recursively loop over all the contents of a folder and perform the trim command with appropriate files and outputs.
#!/bin/bash
for DR in *.fastq.gz
do
FL1=$(ls ~/home/path/to/files/${DR}*_R1_*.fastq.gz)
FL2=$(ls ~/home/path/to/files/${DR}*_R2_*.fastq.gz)
java -jar ~/data2/RJG_Seq/apps/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 12 -phred33 -trimlog ~/data2/RJG_Seq/trimming/sample_folder/$FL1.trimlog ~/data2/RJG_Seq/demultiplexing/sample_folder/$FL1 ~/data2/RJG_Seq/demultiplexing/sample_folder/$FL2 ~/data2/RJG_Seq/trimming/sample_folder/$FL1.pair.fq.gz ~/data2/RJG_Seq/trimming/sample_folder/$FL1.unpair.fq.gz ~/data2/RJG_Seq/trimming/sample_folder/$FL2.pair.fq.gz ~/data2/RJG_Seq/trimming/sample_folder/$FL2.unpair.fq.gz ILLUMINACLIP:/media/RJG_Seq/apps/Trimmomatic-0.36/TruSeq3-PE.fa:2:30:10 LEADING:5 TRAILING:5 SLIDINGWINDOW:4:15 MINLEN:28
done
I believe there's something wrong with the way I am assigning and invoking FL1 and FL2, and ultimately I'm looking for help creating an excecutable command trim-my-reads.py or trim-my-reads.sh that could be modified to accept any arbitrarily named input R1.fastq.gz and R2.fastq.gz files.

You can write a simple python script to loop over all the files in a folder.
Note : I have assumed that the output files will be generated in a folder named "example"
import glob
for file in glob.glob("*.fastq.gz"):
#here you'll unzip the file to a folder assuming "example"
for files in glob.glob("/example/*"):
#here you can parse all the files inside the output folder

Each pair of samples has a matching string (SN=sample N) A solution to this question in bash could be:
#!/bin/bash
#apply loop function to samples 1-12
for SAMPLE in {1..12}
do
#set input file 1 to "FL1", input file 2 to "FL2"
FL1=$(ls ~path/to/input/files/_S${SAMPLE}_*_R1_*.gz)
FL2=$(ls ~path/to/input/files/_S${SAMPLE}_*_R2_*.gz)
#invoke java ,send FL1 and FL2 to appropriate output folders
java -jar ~/path/to/trimming/apps/Trimmomatic-0.36/trimmomatic-0.36.jar PE
-threads 12 -phred33 -trimlog ~/path/to/output/folders/${FL1}.trimlog
~/path/to/input/file1/${FL1} ~/path/to/input/file2/${FL2}
~/path/to/paired/output/folder/${FL1}.pair.fq.gz ~/path/to/unpaired/output/folder/${FL1}.unpair.fq.gz
~/path/to/paired/output/folder/${FL2}.pair.fq.gz ~/path/to/unpaired/output/folder/${FL2}.unpair.fq.gz
ILLUMINACLIP:/path/to/trimming/apps/Trimmomatic-0.36/TruSeq3-PE.fa:2:30:10 LEADING:5 TRAILING:5 SLIDINGWINDOW:4:15 MINLEN:28
#add verbose option to track progress
echo "Sample ${SAMPLE} done"
done
This is an inelegant solution, because it requires the format I'm using. A better method would be to grep each filename and assign them to FL1, FL2 accordingly, because this would generalize the method. Still, this is what worked for me, and I can easily control which samples are subjected to the for loop, as long as I always have the _S * _ format in the filename strings.

vim complete() with result from lvim

I am trying to create a function that will pop up a list of file includes the word "Module"(case insensitive).
I tried :lvim /Module/gj *.f90 when all *.f90 is in current dir, but I failed to make a globpath() like expand so that I can include and subdirs.
So, I turned to python. From python, I am getting the list perfectly. I am inserting the python code, which will possibly show my goal:
#!/usr/bin/python
import os
import re
flsts = []
path = "/home/rudra/Devel/dream/"
print("All files==>")
for dirs, subdirs, files in os.walk(path):
for tfile in files:
if tfile.endswith('f90'):
print(os.path.splitext(tfile)[0])
text = open(dirs+'/'+tfile).read()
match = re.search("Module", text)
if match:
flsts.append(os.path.splitext(tfile)[0])
print("The list to be used ==>")
print(flsts)
after having the list, I want a
complete(col('.')), flsts)
The problem is, I am unable to include it inside vim function.
May I kindly have some help, so that I can get a list from vim and use it in the complete function?
I have checked this as a possible solution, but unfortunately it is not.
Kindly help.
edit: More explanation
So, say, in my work-dir, i have:
$tree */*.f90
OLD/dirac.f90
OLD/environment.f90
src/constants.f90
src/gencrystal.f90
src/geninp.f90
src/init.f90
among them, only two has word module in it:
$ grep Module */*.f90
OLD/dirac.f90: 10 :module mdirac
src/constants.f90: 2 :module constants
So, I want, with a inoremap, complete() to pop up only constants and dirac.
Hence, Module is the keyword I am searching in the subdirs of present working directory, and only those file matches (dirac and constants in this example) should pop up in complete()

I'm not sure what your exact problem is.
With split(globpath('./**/Module/**', '*.f90'), '\n') you will obtain the list of all files that match *.f90, and which are somewhere within a directory named Module.
Then, using complete() has a few restrictions. It has to be from a function that will be called from insert mode, and that returns an empty string.
By itself, complete() will insert the selected text, if we play with the {starcol} parameter, we can even remove what's before the cursor. This way, you can type Module, hit the key you want and use Module to filter.
function! s:Complete()
" From lh-vim-lib: word_tools.vim
let key = GetCurrentKeyword()
let files = split(glob('./**/*'.key.'*/**', '*.vim'), '\n')
call complete(col('.')-len(key), files )
return ''
endfunction
inoremap µ <c-R>=<sid>Complete()<cr>
However, if you want to trigger an action (instead of inserting text), it becomes much more complex. I did that in muTemplate. I've published the framework used to associate hooks to completion items in lh-vim-lib (See lh#icomplete#*() functions).
EDIT: OK, then, I'll work with let files=split(system("grep --include=*.f90 -Ril module *"), '\n') to obtain the list of files, then call complete(col('.'), files) with that list. That should be the more efficient solution. This is somehow quite similar to Ingo's solution. The difference is that we don't need Python if grep is available.
Regarding Python integration, well it's possible with :py vim.command(). See for instance jira-complete that integrates complete() with a Python script that builds the completion-list: https://github.com/mnpk/vim-jira-complete/blob/master/autoload/jira.vim#L116
Notes:
if "module:" can be pre-searched with ctags, it will to possible to extract your files from tags database with taglist().
It's also possible to fill dynamically the list of files with complete_add(), which is something that would make sense from a python script that tests each file one after the other.

There's an example at :help complete() that you can adapt. If you modify your Python script to output just the (newline-separated) files, you can invoke it via system():
inoremap <F5> <C-R>=FindFiles()<CR>
function! FindFiles()
call complete(col('.'), split(system('python path/to/script.py'), '\n'))
return ''
endfunction

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Snakemake: Is it possible to use directories as wildcards? - python

Related

Using snakemake to rename files according to defined mapping

How to make rule "all" in Snakefile condition on completion of parallel wildcard rule

Snakemake unknown number of output files / defining wildcards based on intermediate files

Writing a bash or python for loop with paired input files and multiple output files

vim complete() with result from lvim

Categories

Resources