So I have a pipeline written in shell which loops over three folders and then an inside loop which loops over files inside the folder.
For next step, I have a snakemake file, which takes an input folder and output folder. For trial run I gave the folder path inside the snakemake file.
So I was wondering is there any way I can give input and output folder path explicitly.
For e.g.
snakemake --cores 30 -s trial.snakemake /path/to/input /path/to/output
Since I want to change the input and output according to the main loop.
I tried import sys and using sys.argv[1] and sys.argv[2] inside the snakemake file but its not working.
Below is the snippet of my pipeline, it takes three folder for now, ABC_Samples, DEF_Samples, XYZ_Samples
for folder in /path/to/*_Samples
do
folderName=$(basename $folder _Samples)
mkdir -p /path/to/output/$fodlerName
for files in $folder/*.gz
do
/
do something
/
done
snakemake --cores 30 -s trial.snakemake /path/to/output/$fodlerName /path/to/output2/
done
But the above doesn't work. So is there any way I can do this. I am really new to snakemake.
Thank you in advance.
An efficient way could be to incorporate the folder structure explicitly inside your Snakefile. For example, you could use the content of a parameter, e.g. example_path, inside the Snakefile and then pass it via config:
snakemake --config example_path_in=/path/to/input example_path_out=/path/to/output
Related
I'm trying to use snakemake to download a list of files, and then rename them according to mapping given in the file. I first read a dictionary from a file that has the form of {ID_for_download : sample_name}, and I pass the list of its keys to first rule for download (because downloading is taxing, I'm just using a dummy script to generate empty files). For every file in the list, two files are downloaded in the form of {file_1.fastq} and {file_2.fastq} When those files are downloaded, I then rename them using the second rule - here I take advantage of being able to run python code in a rule using run key word. When I do a dry-run using -n flag, everything works. But when I do an actual run, I get an error of the form
Job Missing files after 5 seconds [list of files]
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Job id: 0 completed successfully, but some output files are missing. 0
Exiting because a job execution failed. Look above for error message
Removing output files of failed job rename_srafiles_to_samples since they might be corrupted: [list of all files]
What happens is that a directory to store my files is created, and then my files are "downloaded", and then are renamed. Then when it reaches the last file, I get this error and everything is deleted. The snakemake file is below:
import csv
import os
SRA_MAPPING = read_dictionary() #dictionary read from a file
SRAFILES = list(SRA_MAPPING.keys())[1:] #list of sra files
SAMPLES = [SRA_MAPPING[key] for key in SRAFILES] #list of sample names
rule all:
input:
expand("raw_samples/{samples}_1.fastq",samples=SAMPLES),
expand("raw_samples/{samples}_2.fastq",samples=SAMPLES),
rule download_srafiles:
output:
expand("raw_samples/{srafiles}_1.fastq",srafiles=SRAFILES),
expand("raw_samples/{srafiles}_2.fastq",srafiles=SRAFILES)
shell:
"bash dummy_download.sh"
rule rename_srafiles_to_samples:
input:
expand("raw_samples/{srafiles}_1.fastq",srafiles=SRAFILES),
expand("raw_samples/{srafiles}_2.fastq",srafiles=SRAFILES)
output:
expand("raw_samples/{samples}_1.fastq",samples=SAMPLES),
expand("raw_samples/{samples}_2.fastq",samples=SAMPLES)
run:
os.chdir(os.getcwd()+r"/raw_samples")
for file in os.listdir():
old_name=file[:file.find("_")]
sample_name=SRA_MAPPING[old_name]
new_name=file.replace(old_name,sample_name)
os.rename(file,new_name)
I've separately tried to run download_srafiles and it worked. I also separately tried to run rename_srafiles_to_samples and it worked. But when I run those files in conjunction, I get the error. For completeness, the script dummy_download.sh is below:
#!/bin/bash
read -a samples <<< $(cut -d , -f 1 linker.csv | tail -n +2)
for file in "${samples[#]}"
do
touch raw_samples/${file}_1.fastq
touch raw_samples/${file}_2.fastq
done
(linker.csv is a file in one column has ID_for_download and in other column has sample_name)
What am I doing wrong?
EDIT: Per user dariober, the change of directories via python's os in the rule rename_srafiles_to_samples "confused" snakemake. Snakemake's logic is sound - if I change the directory to enter raw_samples, it tries to find raw_samples in itself and fails. To that extend, I tested different versions.
Version 1
Exactly as dariober explained. Important bits of code:
for file in os.listdir('raw_samples'):
old_name= file[:file.find("_")]
sample_name=SRA_MAPPING[old_name]
new_name= file.replace(old_name,sample_name)
os.rename('raw_samples/' + file, 'raw_samples/' + new_name)
It lists files in "raw_samples" directory, and then renames them. Crucial thing to do is to add prefix of directory (raw_samples/) to each rename.
Version 2
The same as my original post, but instead of leaving working directory, I exit it at the end of the loop. It works.
os.chdir(os.getcwd()+r"/raw_samples")
for file in os.listdir():
old_name= file[:file.find("_")]
sample_name=SRA_MAPPING[old_name]
new_name= file.replace(old_name,sample_name)
os.rename(file,new_name)
os.chdir("..")
Version 3
Same as my original post, but instead of modifying anything in the run segment, I modify the output to exclude file directory. This means that I have to modify my rule all too. It didn't work. Code is below:
rule all:
input:
expand("{samples}_1.fastq",samples=SAMPLES),
expand("{samples}_2.fastq",samples=SAMPLES),
rule download_srafiles:
output:
expand("raw_samples/{srafiles}_1.fastq",srafiles=SRAFILES),
expand("raw_samples/{srafiles}_2.fastq",srafiles=SRAFILES)
shell:
"touch {output}"
rule rename_srafiles_to_samples:
input:
expand("raw_samples/{srafiles}_1.fastq",srafiles=SRAFILES),
expand("raw_samples/{srafiles}_2.fastq",srafiles=SRAFILES)
output:
expand("{samples}_1.fastq",samples=SAMPLES),
expand("{samples}_2.fastq",samples=SAMPLES)
run:
os.chdir(os.getcwd()+r"/raw_samples")
for file in os.listdir():
old_name= file[:file.find("_")]
sample_name=SRA_MAPPING[old_name]
new_name= file.replace(old_name,sample_name)
os.rename(file,new_name)
The error it gives is:
MissingOutputException in line 24
...
Job files missing
The files are actually there. So I don't know if I made some error in the code or is this some bug.
Conclusion
I wouldn't say that this is a problem with snakemake. It's more of a problem with my poorly thought out process. In retrospect, it makes perfect sense that entering directory messes up output/input process of snakemake. If I want to use os module in snakemake to change directories, I have to be very careful. Enter wherever I need to, but ultimately go back to my original starting place. Many thanks to /u/dariober and /u/SultanOrazbayev
I think snakemake gets confused by os.chdir. Your rule rename_srafiles_to_samples creates the correct files and the input/output naming is fine. However, since you have changed directory snakemake cannot find the expected output. I'm not sure I'm correct in all this and if so if it is a bug... This version avoids os.chdir and seems to work:
import csv
import os
SRA_MAPPING = {'SRR1': 'A', 'SRR2': 'B'}
SRAFILES = list(SRA_MAPPING.keys()) #list of sra files
SAMPLES = [SRA_MAPPING[key] for key in SRAFILES] #list of sample names
rule all:
input:
expand("raw_samples/{samples}_1.fastq",samples=SAMPLES),
expand("raw_samples/{samples}_2.fastq",samples=SAMPLES),
rule download_srafiles:
output:
expand("raw_samples/{srafiles}_1.fastq",srafiles=SRAFILES),
expand("raw_samples/{srafiles}_2.fastq",srafiles=SRAFILES)
shell:
"touch {output}"
rule rename_srafiles_to_samples:
input:
expand("raw_samples/{srafiles}_1.fastq",srafiles=SRAFILES),
expand("raw_samples/{srafiles}_2.fastq",srafiles=SRAFILES)
output:
expand("raw_samples/{samples}_1.fastq",samples=SAMPLES),
expand("raw_samples/{samples}_2.fastq",samples=SAMPLES)
run:
# os.chdir(os.getcwd()+r"/raw_samples")
for file in os.listdir('raw_samples'):
old_name= file[:file.find("_")]
sample_name=SRA_MAPPING[old_name]
new_name= file.replace(old_name,sample_name)
os.rename('raw_samples/' + file, 'raw_samples/' + new_name)
(However, a more snakemake-ish solution would be to have a wildcard for the SRR id and have each rule executed once for each SRR id, basically avoiding expand in download_srafiles and rename_srafiles_to_samples)
I'm working on a very common set of commands used to analyze RNA-seq data. However, since this question is not specific to bioinformatics, I've chosen to post here instead of BioStars, etc.
Specifically, I am trimming Illumina Truseq adapters from paired end sequencing data. To do so, I use Trimmomatic 0.36.
I have two input files:
S6D10MajUnt1-1217_S12_R1_001.fastq.gz
S6D10MajUnt1-1217_S12_R2_001.fastq.gz
And the command generates five output files:
S6D10MajUnt1-1217_S12_R1_001.paired.fq.gz
S6D10MajUnt1-1217_S12_R1_001.unpaired.fq.gz
S6D10MajUnt1-1217_S12_R2_001.paired.fq.gz
S6D10MajUnt1-1217_S12_R2_001.unpaired.fq.gz
S6D10MajUnt1-1217_S12.trimlog
I'm trying to write a python or bash script to recursively loop over all the contents of a folder and perform the trim command with appropriate files and outputs.
#!/bin/bash
for DR in *.fastq.gz
do
FL1=$(ls ~/home/path/to/files/${DR}*_R1_*.fastq.gz)
FL2=$(ls ~/home/path/to/files/${DR}*_R2_*.fastq.gz)
java -jar ~/data2/RJG_Seq/apps/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 12 -phred33 -trimlog ~/data2/RJG_Seq/trimming/sample_folder/$FL1.trimlog ~/data2/RJG_Seq/demultiplexing/sample_folder/$FL1 ~/data2/RJG_Seq/demultiplexing/sample_folder/$FL2 ~/data2/RJG_Seq/trimming/sample_folder/$FL1.pair.fq.gz ~/data2/RJG_Seq/trimming/sample_folder/$FL1.unpair.fq.gz ~/data2/RJG_Seq/trimming/sample_folder/$FL2.pair.fq.gz ~/data2/RJG_Seq/trimming/sample_folder/$FL2.unpair.fq.gz ILLUMINACLIP:/media/RJG_Seq/apps/Trimmomatic-0.36/TruSeq3-PE.fa:2:30:10 LEADING:5 TRAILING:5 SLIDINGWINDOW:4:15 MINLEN:28
done
I believe there's something wrong with the way I am assigning and invoking FL1 and FL2, and ultimately I'm looking for help creating an excecutable command trim-my-reads.py or trim-my-reads.sh that could be modified to accept any arbitrarily named input R1.fastq.gz and R2.fastq.gz files.
You can write a simple python script to loop over all the files in a folder.
Note : I have assumed that the output files will be generated in a folder named "example"
import glob
for file in glob.glob("*.fastq.gz"):
#here you'll unzip the file to a folder assuming "example"
for files in glob.glob("/example/*"):
#here you can parse all the files inside the output folder
Each pair of samples has a matching string (SN=sample N) A solution to this question in bash could be:
#!/bin/bash
#apply loop function to samples 1-12
for SAMPLE in {1..12}
do
#set input file 1 to "FL1", input file 2 to "FL2"
FL1=$(ls ~path/to/input/files/_S${SAMPLE}_*_R1_*.gz)
FL2=$(ls ~path/to/input/files/_S${SAMPLE}_*_R2_*.gz)
#invoke java ,send FL1 and FL2 to appropriate output folders
java -jar ~/path/to/trimming/apps/Trimmomatic-0.36/trimmomatic-0.36.jar PE
-threads 12 -phred33 -trimlog ~/path/to/output/folders/${FL1}.trimlog
~/path/to/input/file1/${FL1} ~/path/to/input/file2/${FL2}
~/path/to/paired/output/folder/${FL1}.pair.fq.gz ~/path/to/unpaired/output/folder/${FL1}.unpair.fq.gz
~/path/to/paired/output/folder/${FL2}.pair.fq.gz ~/path/to/unpaired/output/folder/${FL2}.unpair.fq.gz
ILLUMINACLIP:/path/to/trimming/apps/Trimmomatic-0.36/TruSeq3-PE.fa:2:30:10 LEADING:5 TRAILING:5 SLIDINGWINDOW:4:15 MINLEN:28
#add verbose option to track progress
echo "Sample ${SAMPLE} done"
done
This is an inelegant solution, because it requires the format I'm using. A better method would be to grep each filename and assign them to FL1, FL2 accordingly, because this would generalize the method. Still, this is what worked for me, and I can easily control which samples are subjected to the for loop, as long as I always have the _S * _ format in the filename strings.
I have directory containing multiple subdirectories, all of which contain a file named sample.fas. Here, I want to run a python script (script.py) in each file sample.fas of the subdirectories, an export the output(s) with the name of each of their subdirectories.
However, the script needs the user to indicate the path/name of the input, and not create automatically the outputs (it's necessary to specify the path/name). Like this:
script.py sample_1.fas output_1a.nex output_1b.fas
I try using this lines, without success:
while find . -name '*.fas'; # find the *fas files
do python script.py $*.fas > /path/output_1a output_1b; # run the script and export the two outputs
done
So, I want to create a bash that read each sample.fas from all subdirectories (run the script recursively), and export the outputs with the names of their subdirectories.
I would appreciate any help.
One quick way of doing this would be something like:
for x in $(find . -type f -name *.fas); do
/usr/bin/python /my/full/path/to/script.py ${x} > /my/path/$(basename $(dirname ${x}))
done
This is running the script against all .fas files identified in the current directory (subdirectories included) and then redirects whatever the python script is outputting to a file named like the directory in which the currently processed .fas file was located. This file is created in /my/path/.
There is an assumption here (well, a few), and that is that all the directories which contain .fas files have unique names. Also, the paths are supposed not to have any spaces in them, this can be fixed with proper quoting. Another assumption is that the script is always outputting valid data (this just redirects all output from the script to that file). However, this should hopefully get you going in the right direction.
But I get the feeling that I didn't properly understand your question. If this is the case, could you rephrase and maybe provide a tree showing how the directories and sub-directories are structured like?
Also, if my answer is helping you, I would appreciate it if you could mark it as the accepted answer by clicking the check mark on the left.
I already have my python script producing my desired outputfile by passing 5 different inputfiles to it. Every inputfile is in a different folder, and in each folder there are more files which all of them start by "chr" and finish by the extension ".vcf.gz"
So, the command that I execute to produce one output is:
python myscript.py /folder1/chrX.vcf.gz /folder2/chrX.vcf.gz /folder3/chrX.vcf.gz /folder4/chrX.vcf.gz /folder5/chrX.vcf.gz > /myNewFolderForOutputs/chrXoutput.txt
Now what I would like to obtain is a single command to do the same for the other inputfiles contained in the same folders, let's say "chrY.vcf.gz" and "chrZ.vcf.gz", and at the same time, producing one outfile for every set of my inputfiles, named "chrYoutput.txt" and "chrZoutput.txt"
Is that possible? Should I change my strategy maybe?
Thanks a lot for any suggestion or hint!
If your folder structure follows the pattern you described in your sample, then this will work:
for i in X Y Z; do
python myscript.py /folder[1-5]/chr$i.vcf.gz > /myNewFolderForOutputs/chr${i}output.txt
done
Not 100% sure if this is what you asked.
I have a python script that runs on three files in the following way
align.py *.wav *.txt *.TextGrid
However, I have a directory full of files that I want to loop through. The original author suggests creating a shell script to loop through the files.
The tricky part about the loop is that I need to match three files at a time with three different extensions for the script to run correctly.
Can anyone help me figure out how to create a shell script to loop through a directory of files, match three of them according to name (with three different extensions) and run the python script on each triplet?
Thanks!
Assuming you're using bash, here is a one-liner:
for f in *.wav; do align.py $f ${f%\.*}.txt ${f%\.*}.TextGrid; done
You could use glob.glob to list only the wav files, then construct the subprocess.Popen call like so:
import glob
import os
import subprocess
for wav_name in glob.glob('*.wav'):
basename,ext = os.path.splitext(wav_name)
txt_name=basename+'.txt'
grid_name=basename+'.TextGrid'
proc=subprocess.Popen(['align.py',wav_name,txt_name,grid_name])
proc.communicate()