Using snakemake to rename files according to defined mapping - python

I'm trying to use snakemake to download a list of files, and then rename them according to mapping given in the file. I first read a dictionary from a file that has the form of {ID_for_download : sample_name}, and I pass the list of its keys to first rule for download (because downloading is taxing, I'm just using a dummy script to generate empty files). For every file in the list, two files are downloaded in the form of {file_1.fastq} and {file_2.fastq} When those files are downloaded, I then rename them using the second rule - here I take advantage of being able to run python code in a rule using run key word. When I do a dry-run using -n flag, everything works. But when I do an actual run, I get an error of the form
Job Missing files after 5 seconds [list of files]
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Job id: 0 completed successfully, but some output files are missing. 0
Exiting because a job execution failed. Look above for error message
Removing output files of failed job rename_srafiles_to_samples since they might be corrupted: [list of all files]
What happens is that a directory to store my files is created, and then my files are "downloaded", and then are renamed. Then when it reaches the last file, I get this error and everything is deleted. The snakemake file is below:
import csv
import os
SRA_MAPPING = read_dictionary() #dictionary read from a file
SRAFILES = list(SRA_MAPPING.keys())[1:] #list of sra files
SAMPLES = [SRA_MAPPING[key] for key in SRAFILES] #list of sample names
rule all:
input:
expand("raw_samples/{samples}_1.fastq",samples=SAMPLES),
expand("raw_samples/{samples}_2.fastq",samples=SAMPLES),
rule download_srafiles:
output:
expand("raw_samples/{srafiles}_1.fastq",srafiles=SRAFILES),
expand("raw_samples/{srafiles}_2.fastq",srafiles=SRAFILES)
shell:
"bash dummy_download.sh"
rule rename_srafiles_to_samples:
input:
expand("raw_samples/{srafiles}_1.fastq",srafiles=SRAFILES),
expand("raw_samples/{srafiles}_2.fastq",srafiles=SRAFILES)
output:
expand("raw_samples/{samples}_1.fastq",samples=SAMPLES),
expand("raw_samples/{samples}_2.fastq",samples=SAMPLES)
run:
os.chdir(os.getcwd()+r"/raw_samples")
for file in os.listdir():
old_name=file[:file.find("_")]
sample_name=SRA_MAPPING[old_name]
new_name=file.replace(old_name,sample_name)
os.rename(file,new_name)
I've separately tried to run download_srafiles and it worked. I also separately tried to run rename_srafiles_to_samples and it worked. But when I run those files in conjunction, I get the error. For completeness, the script dummy_download.sh is below:
#!/bin/bash
read -a samples <<< $(cut -d , -f 1 linker.csv | tail -n +2)
for file in "${samples[#]}"
do
touch raw_samples/${file}_1.fastq
touch raw_samples/${file}_2.fastq
done
(linker.csv is a file in one column has ID_for_download and in other column has sample_name)
What am I doing wrong?
EDIT: Per user dariober, the change of directories via python's os in the rule rename_srafiles_to_samples "confused" snakemake. Snakemake's logic is sound - if I change the directory to enter raw_samples, it tries to find raw_samples in itself and fails. To that extend, I tested different versions.
Version 1
Exactly as dariober explained. Important bits of code:
for file in os.listdir('raw_samples'):
old_name= file[:file.find("_")]
sample_name=SRA_MAPPING[old_name]
new_name= file.replace(old_name,sample_name)
os.rename('raw_samples/' + file, 'raw_samples/' + new_name)
It lists files in "raw_samples" directory, and then renames them. Crucial thing to do is to add prefix of directory (raw_samples/) to each rename.
Version 2
The same as my original post, but instead of leaving working directory, I exit it at the end of the loop. It works.
os.chdir(os.getcwd()+r"/raw_samples")
for file in os.listdir():
old_name= file[:file.find("_")]
sample_name=SRA_MAPPING[old_name]
new_name= file.replace(old_name,sample_name)
os.rename(file,new_name)
os.chdir("..")
Version 3
Same as my original post, but instead of modifying anything in the run segment, I modify the output to exclude file directory. This means that I have to modify my rule all too. It didn't work. Code is below:
rule all:
input:
expand("{samples}_1.fastq",samples=SAMPLES),
expand("{samples}_2.fastq",samples=SAMPLES),
rule download_srafiles:
output:
expand("raw_samples/{srafiles}_1.fastq",srafiles=SRAFILES),
expand("raw_samples/{srafiles}_2.fastq",srafiles=SRAFILES)
shell:
"touch {output}"
rule rename_srafiles_to_samples:
input:
expand("raw_samples/{srafiles}_1.fastq",srafiles=SRAFILES),
expand("raw_samples/{srafiles}_2.fastq",srafiles=SRAFILES)
output:
expand("{samples}_1.fastq",samples=SAMPLES),
expand("{samples}_2.fastq",samples=SAMPLES)
run:
os.chdir(os.getcwd()+r"/raw_samples")
for file in os.listdir():
old_name= file[:file.find("_")]
sample_name=SRA_MAPPING[old_name]
new_name= file.replace(old_name,sample_name)
os.rename(file,new_name)
The error it gives is:
MissingOutputException in line 24
...
Job files missing
The files are actually there. So I don't know if I made some error in the code or is this some bug.
Conclusion
I wouldn't say that this is a problem with snakemake. It's more of a problem with my poorly thought out process. In retrospect, it makes perfect sense that entering directory messes up output/input process of snakemake. If I want to use os module in snakemake to change directories, I have to be very careful. Enter wherever I need to, but ultimately go back to my original starting place. Many thanks to /u/dariober and /u/SultanOrazbayev

I think snakemake gets confused by os.chdir. Your rule rename_srafiles_to_samples creates the correct files and the input/output naming is fine. However, since you have changed directory snakemake cannot find the expected output. I'm not sure I'm correct in all this and if so if it is a bug... This version avoids os.chdir and seems to work:
import csv
import os
SRA_MAPPING = {'SRR1': 'A', 'SRR2': 'B'}
SRAFILES = list(SRA_MAPPING.keys()) #list of sra files
SAMPLES = [SRA_MAPPING[key] for key in SRAFILES] #list of sample names
rule all:
input:
expand("raw_samples/{samples}_1.fastq",samples=SAMPLES),
expand("raw_samples/{samples}_2.fastq",samples=SAMPLES),
rule download_srafiles:
output:
expand("raw_samples/{srafiles}_1.fastq",srafiles=SRAFILES),
expand("raw_samples/{srafiles}_2.fastq",srafiles=SRAFILES)
shell:
"touch {output}"
rule rename_srafiles_to_samples:
input:
expand("raw_samples/{srafiles}_1.fastq",srafiles=SRAFILES),
expand("raw_samples/{srafiles}_2.fastq",srafiles=SRAFILES)
output:
expand("raw_samples/{samples}_1.fastq",samples=SAMPLES),
expand("raw_samples/{samples}_2.fastq",samples=SAMPLES)
run:
# os.chdir(os.getcwd()+r"/raw_samples")
for file in os.listdir('raw_samples'):
old_name= file[:file.find("_")]
sample_name=SRA_MAPPING[old_name]
new_name= file.replace(old_name,sample_name)
os.rename('raw_samples/' + file, 'raw_samples/' + new_name)
(However, a more snakemake-ish solution would be to have a wildcard for the SRR id and have each rule executed once for each SRR id, basically avoiding expand in download_srafiles and rename_srafiles_to_samples)

Related

How to make rule "all" in Snakefile condition on completion of parallel wildcard rule

I have some TGZ files containing audio SPH samples, which I unpack in snakemake like this:
tgz_files = ["a.tgz", "b.tgz"]
tgz_dirs = ["a", "b"]
rule untar_tgz_files:
input:
tgz_files
output:
directory(tgz_dirs)
shell:
tar -xzvf {input}
I don't know the names of the SPH sample files until after the untar. I then have a rule which translates the SPH files to WAV files, like this:
rule sph_to_wav:
input:
"{root}/{filename}.sph"
output:
"{root}_wav/{filename}.wav"
shell:
sox -t sph {input} -b 16 -t wav {output}
I want my Snakefile to run both of these steps (untar and convert), not knowing in advance the exact names of the SPH files in the TGZ archives. I need something like this to mark the completion of the sph_to_wav rule:
rule sph_to_wav_finished:
input:
"{root}_wav/{filename}.wav"
output:
"sph_to_wav_finished.txt"
and then I want to condition rule all on both of these processes:
rule all:
input:
tgz_dirs, "sph_to_wav_finished.txt"
However, I get the error:
Building DAG of jobs...
MissingInputException in Snakefile:
Missing input files for rule all:
sph_to_wav_finished.txt
How do I write this so that
Snakemake doesn't complain and runs the unpack and sph to wav
Runs the sph to wav after the unpack
?
This sounds like a use-case for a checkpoint. Since rule untar_tgz_files generates files that are not known in advance, you can convert it into a checkpoint:
checkpoint untar_tgz_files:
... # everything defined as in a regular rule
This will tell snakemake that once this checkpoint has been completed, DAG needs to be re-evaluated to take into account of the new files that were created.
Downstream rules will need to find out about the new files, so typically you will do some sort of glob.glob to get the list of new files. This is a rough idea, but you might need to fine tune it:
def list_new_files(wildcards):
output_dir = checkpoints. untar_tgz_files.get(sample=wildcards.sample).output
# you will also want to parse "root" here, skipping it for simplicity
filenames, _ = glob_wildcards(output_dir+"/{filename}.sph")
new_files = expand("{filename}.wav", filename=filenames)
return new_files
Finally, collect all the translated files with:
rule sph_to_wav_finished:
input:
list_new_files,
output:
"sph_to_wav_finished.txt"

Snakemake: Is it possible to use directories as wildcards?

Hi I´m new in Snakemake and have a question. I want to run a tool to multiple data sets. One data set represents one tissue and for each tissue exists fastq files, which are stored in the respective tissue directory. The rough command for the tools is:
python TEcount.py -rosette rosettefile -TE te_references -count result/tissue/output.csv -RNA <LIST OF FASTQ FILE FOR THE RESPECTIVE SAMPLE>
The tissues shall be the wildcards. How can I do this? Below I have a first try that did not work.
import os
#collect data sets
SAMPLES=os.listdir("data/rnaseq/")
rule all:
input:
expand("results/{sample}/TEtools.{sample}.output.csv", sample=SAMPLES)
rule run_TEtools:
input:
TEcount='scripts/TEtools/TEcount.py',
rosette='data/prepared_data/rosette/rosette',
te_references='data/prepared_data/references/all_TE_instances.fa'
params:
#collect the fastq files in the tissue directory
fastq_files = os.listdir("data/rnaseq/{sample}")
output:
'results/{sample}/TEtools.{sample}.output.csv'
shell:
'python {input.TEcount} -rosette {input.rosette} -TE
{input.te_references} -count {output} -RNA {params.fastq_files}'
In the rule run_TEtools it does not know what the {sample} is.
A snakemake wildcard can be anything. It is basically just a string.
There are some issues with the way you are trying to achieve what you want.
Ok, here's how I would do it. Explanations follow:
import os
#collect data sets
# Beware no other directories or files (than those containing fastqs) should be in that folder
SAMPLES=os.listdir("data/rnaseq/")
def getFastqFilesForTissu(wildcards):
fastqs = list()
# Beware no other files than fastqs should be there
for s in os.listdir("data/rnaseq/"+wildcards.sample):
fastqs.append(os.path.join("data/rnaseq",wildcards.sample,s))
return fastqs
rule all:
input:
expand("results/{sample}/TEtools.{sample}.output.csv", sample=SAMPLES)
rule run_TEtools:
input:
TEcount='scripts/TEtools/TEcount.py',
rosette='data/prepared_data/rosette/rosette',
te_references='data/prepared_data/references/all_TE_instances.fa',
fastq_files = getFastqFilesForTissu
output:
'results/{sample}/TEtools.{sample}.output.csv'
shell:
'python {input.TEcount} -rosette {input.rosette} -TE {input.te_references} -count {output} -RNA {input.fastq_files}'
First of all, your fastq file should be defined as inputs in order for snakemake to know that they are files and that if they are changed, rules must be rerun. It is quite bad practice to define input files as params. params are made for parameters, usually not for files.
Second, your script file is defined as input. You have to be aware that everytime you modify it, rules will be rerun. Maybe that's what you want.
I would use a defined function to get the fastq file in each directory. If you want to use a function (like os.listdir()), you can't use your wildcards directly. You have to inject it in the function as a python object. You can either define a function that will take one argument, a wildcard object containing all your wildcards, or use the lambda keyword (ex: input = lamdba wildcards: myFuntion(wildcards.sample)).
Another problem you have is that os.listdir() returns a list of files without the relative path. Also beware that the order in which os.listdir() will return you fastq file is unknown. Maybe that doesn't matter for your command.

Writing a bash or python for loop with paired input files and multiple output files

I'm working on a very common set of commands used to analyze RNA-seq data. However, since this question is not specific to bioinformatics, I've chosen to post here instead of BioStars, etc.
Specifically, I am trimming Illumina Truseq adapters from paired end sequencing data. To do so, I use Trimmomatic 0.36.
I have two input files:
S6D10MajUnt1-1217_S12_R1_001.fastq.gz
S6D10MajUnt1-1217_S12_R2_001.fastq.gz
And the command generates five output files:
S6D10MajUnt1-1217_S12_R1_001.paired.fq.gz
S6D10MajUnt1-1217_S12_R1_001.unpaired.fq.gz
S6D10MajUnt1-1217_S12_R2_001.paired.fq.gz
S6D10MajUnt1-1217_S12_R2_001.unpaired.fq.gz
S6D10MajUnt1-1217_S12.trimlog
I'm trying to write a python or bash script to recursively loop over all the contents of a folder and perform the trim command with appropriate files and outputs.
#!/bin/bash
for DR in *.fastq.gz
do
FL1=$(ls ~/home/path/to/files/${DR}*_R1_*.fastq.gz)
FL2=$(ls ~/home/path/to/files/${DR}*_R2_*.fastq.gz)
java -jar ~/data2/RJG_Seq/apps/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 12 -phred33 -trimlog ~/data2/RJG_Seq/trimming/sample_folder/$FL1.trimlog ~/data2/RJG_Seq/demultiplexing/sample_folder/$FL1 ~/data2/RJG_Seq/demultiplexing/sample_folder/$FL2 ~/data2/RJG_Seq/trimming/sample_folder/$FL1.pair.fq.gz ~/data2/RJG_Seq/trimming/sample_folder/$FL1.unpair.fq.gz ~/data2/RJG_Seq/trimming/sample_folder/$FL2.pair.fq.gz ~/data2/RJG_Seq/trimming/sample_folder/$FL2.unpair.fq.gz ILLUMINACLIP:/media/RJG_Seq/apps/Trimmomatic-0.36/TruSeq3-PE.fa:2:30:10 LEADING:5 TRAILING:5 SLIDINGWINDOW:4:15 MINLEN:28
done
I believe there's something wrong with the way I am assigning and invoking FL1 and FL2, and ultimately I'm looking for help creating an excecutable command trim-my-reads.py or trim-my-reads.sh that could be modified to accept any arbitrarily named input R1.fastq.gz and R2.fastq.gz files.
You can write a simple python script to loop over all the files in a folder.
Note : I have assumed that the output files will be generated in a folder named "example"
import glob
for file in glob.glob("*.fastq.gz"):
#here you'll unzip the file to a folder assuming "example"
for files in glob.glob("/example/*"):
#here you can parse all the files inside the output folder
Each pair of samples has a matching string (SN=sample N) A solution to this question in bash could be:
#!/bin/bash
#apply loop function to samples 1-12
for SAMPLE in {1..12}
do
#set input file 1 to "FL1", input file 2 to "FL2"
FL1=$(ls ~path/to/input/files/_S${SAMPLE}_*_R1_*.gz)
FL2=$(ls ~path/to/input/files/_S${SAMPLE}_*_R2_*.gz)
#invoke java ,send FL1 and FL2 to appropriate output folders
java -jar ~/path/to/trimming/apps/Trimmomatic-0.36/trimmomatic-0.36.jar PE
-threads 12 -phred33 -trimlog ~/path/to/output/folders/${FL1}.trimlog
~/path/to/input/file1/${FL1} ~/path/to/input/file2/${FL2}
~/path/to/paired/output/folder/${FL1}.pair.fq.gz ~/path/to/unpaired/output/folder/${FL1}.unpair.fq.gz
~/path/to/paired/output/folder/${FL2}.pair.fq.gz ~/path/to/unpaired/output/folder/${FL2}.unpair.fq.gz
ILLUMINACLIP:/path/to/trimming/apps/Trimmomatic-0.36/TruSeq3-PE.fa:2:30:10 LEADING:5 TRAILING:5 SLIDINGWINDOW:4:15 MINLEN:28
#add verbose option to track progress
echo "Sample ${SAMPLE} done"
done
This is an inelegant solution, because it requires the format I'm using. A better method would be to grep each filename and assign them to FL1, FL2 accordingly, because this would generalize the method. Still, this is what worked for me, and I can easily control which samples are subjected to the for loop, as long as I always have the _S * _ format in the filename strings.

HTCondor output files: obtain created directory

I am using HTcondor to generate some data (txt, png). By running my program, it creates a directory next to the .sub file, named datasets, where the datasets are stored into. Unfortunately, condor does not give me back this created data when finished. In other words, my goal is to get the created data in a "Datasets" subfolder next to the .sub file.
I tried:
1) to not put the data under the datasets subfolder, and I obtained them as thought. Howerver, this is not a smooth solution, since I generate like 100 files which are now mixed up with the .sub file and all the other.
2) Also I tried to set this up in the sub file, leading to this:
notification = Always
should_transfer_files = YES
RunAsOwner = True
When_To_Transfer_Output = ON_EXIT_OR_EVICT
getenv = True
transfer_input_files = main.py
transfer_output_files = Datasets
universe = vanilla
log = log/test-$(Cluster).log
error = log/test-$(Cluster)-$(Process).err
output = log/test-$(Cluster)-$(Process).log
executable = Simulation.bat
queue
This time I get the error, that Datasets was not found. Spelling was checked already.
3) Another option would be, to pack everything in a zip, but since I have to run hundreds of jobs, I do not want to unpack all this files afterwards.
I hope somebody comes up with a good idea on how to solve this.
Just for the record here: HTCondor does not transfer created directories at the end of the run or its contents. The best way to get the content back is to write a wrapper script that will run your executable and then compress the created directory at the root of the working directory. This file will be transferred with all other files. For example, create run.exe:
./Simulation.bat
tar zcf Datasets.tar.gz Datasets
and in your condor submission script put:
executable = run.exe
However, if you do not want to do this and if HTCondor is using a common shared space like an AFS you can simply copy the whole directory out:
./Simulation.bat
cp -r Datasets <AFS location>
The other alternative is to define an initialdir as described at the end of: https://research.cs.wisc.edu/htcondor/manual/quickstart.html
But one must create the directory structure by hand.
also, look around pg. 65 of: https://indico.cern.ch/event/611296/contributions/2604376/attachments/1471164/2276521/TannenbaumT_UserTutorial.pdf
This document is, in general, a very useful one for beginners.

how to output multiple files from a set of different input files for a python script in bash

I already have my python script producing my desired outputfile by passing 5 different inputfiles to it. Every inputfile is in a different folder, and in each folder there are more files which all of them start by "chr" and finish by the extension ".vcf.gz"
So, the command that I execute to produce one output is:
python myscript.py /folder1/chrX.vcf.gz /folder2/chrX.vcf.gz /folder3/chrX.vcf.gz /folder4/chrX.vcf.gz /folder5/chrX.vcf.gz > /myNewFolderForOutputs/chrXoutput.txt
Now what I would like to obtain is a single command to do the same for the other inputfiles contained in the same folders, let's say "chrY.vcf.gz" and "chrZ.vcf.gz", and at the same time, producing one outfile for every set of my inputfiles, named "chrYoutput.txt" and "chrZoutput.txt"
Is that possible? Should I change my strategy maybe?
Thanks a lot for any suggestion or hint!
If your folder structure follows the pattern you described in your sample, then this will work:
for i in X Y Z; do
python myscript.py /folder[1-5]/chr$i.vcf.gz > /myNewFolderForOutputs/chr${i}output.txt
done
Not 100% sure if this is what you asked.

Categories