I have directory containing multiple subdirectories, all of which contain a file named sample.fas. Here, I want to run a python script (script.py) in each file sample.fas of the subdirectories, an export the output(s) with the name of each of their subdirectories.
However, the script needs the user to indicate the path/name of the input, and not create automatically the outputs (it's necessary to specify the path/name). Like this:
script.py sample_1.fas output_1a.nex output_1b.fas
I try using this lines, without success:
while find . -name '*.fas'; # find the *fas files
do python script.py $*.fas > /path/output_1a output_1b; # run the script and export the two outputs
done
So, I want to create a bash that read each sample.fas from all subdirectories (run the script recursively), and export the outputs with the names of their subdirectories.
I would appreciate any help.
One quick way of doing this would be something like:
for x in $(find . -type f -name *.fas); do
/usr/bin/python /my/full/path/to/script.py ${x} > /my/path/$(basename $(dirname ${x}))
done
This is running the script against all .fas files identified in the current directory (subdirectories included) and then redirects whatever the python script is outputting to a file named like the directory in which the currently processed .fas file was located. This file is created in /my/path/.
There is an assumption here (well, a few), and that is that all the directories which contain .fas files have unique names. Also, the paths are supposed not to have any spaces in them, this can be fixed with proper quoting. Another assumption is that the script is always outputting valid data (this just redirects all output from the script to that file). However, this should hopefully get you going in the right direction.
But I get the feeling that I didn't properly understand your question. If this is the case, could you rephrase and maybe provide a tree showing how the directories and sub-directories are structured like?
Also, if my answer is helping you, I would appreciate it if you could mark it as the accepted answer by clicking the check mark on the left.
Related
So I have a pipeline written in shell which loops over three folders and then an inside loop which loops over files inside the folder.
For next step, I have a snakemake file, which takes an input folder and output folder. For trial run I gave the folder path inside the snakemake file.
So I was wondering is there any way I can give input and output folder path explicitly.
For e.g.
snakemake --cores 30 -s trial.snakemake /path/to/input /path/to/output
Since I want to change the input and output according to the main loop.
I tried import sys and using sys.argv[1] and sys.argv[2] inside the snakemake file but its not working.
Below is the snippet of my pipeline, it takes three folder for now, ABC_Samples, DEF_Samples, XYZ_Samples
for folder in /path/to/*_Samples
do
folderName=$(basename $folder _Samples)
mkdir -p /path/to/output/$fodlerName
for files in $folder/*.gz
do
/
do something
/
done
snakemake --cores 30 -s trial.snakemake /path/to/output/$fodlerName /path/to/output2/
done
But the above doesn't work. So is there any way I can do this. I am really new to snakemake.
Thank you in advance.
An efficient way could be to incorporate the folder structure explicitly inside your Snakefile. For example, you could use the content of a parameter, e.g. example_path, inside the Snakefile and then pass it via config:
snakemake --config example_path_in=/path/to/input example_path_out=/path/to/output
I'm working on a very common set of commands used to analyze RNA-seq data. However, since this question is not specific to bioinformatics, I've chosen to post here instead of BioStars, etc.
Specifically, I am trimming Illumina Truseq adapters from paired end sequencing data. To do so, I use Trimmomatic 0.36.
I have two input files:
S6D10MajUnt1-1217_S12_R1_001.fastq.gz
S6D10MajUnt1-1217_S12_R2_001.fastq.gz
And the command generates five output files:
S6D10MajUnt1-1217_S12_R1_001.paired.fq.gz
S6D10MajUnt1-1217_S12_R1_001.unpaired.fq.gz
S6D10MajUnt1-1217_S12_R2_001.paired.fq.gz
S6D10MajUnt1-1217_S12_R2_001.unpaired.fq.gz
S6D10MajUnt1-1217_S12.trimlog
I'm trying to write a python or bash script to recursively loop over all the contents of a folder and perform the trim command with appropriate files and outputs.
#!/bin/bash
for DR in *.fastq.gz
do
FL1=$(ls ~/home/path/to/files/${DR}*_R1_*.fastq.gz)
FL2=$(ls ~/home/path/to/files/${DR}*_R2_*.fastq.gz)
java -jar ~/data2/RJG_Seq/apps/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 12 -phred33 -trimlog ~/data2/RJG_Seq/trimming/sample_folder/$FL1.trimlog ~/data2/RJG_Seq/demultiplexing/sample_folder/$FL1 ~/data2/RJG_Seq/demultiplexing/sample_folder/$FL2 ~/data2/RJG_Seq/trimming/sample_folder/$FL1.pair.fq.gz ~/data2/RJG_Seq/trimming/sample_folder/$FL1.unpair.fq.gz ~/data2/RJG_Seq/trimming/sample_folder/$FL2.pair.fq.gz ~/data2/RJG_Seq/trimming/sample_folder/$FL2.unpair.fq.gz ILLUMINACLIP:/media/RJG_Seq/apps/Trimmomatic-0.36/TruSeq3-PE.fa:2:30:10 LEADING:5 TRAILING:5 SLIDINGWINDOW:4:15 MINLEN:28
done
I believe there's something wrong with the way I am assigning and invoking FL1 and FL2, and ultimately I'm looking for help creating an excecutable command trim-my-reads.py or trim-my-reads.sh that could be modified to accept any arbitrarily named input R1.fastq.gz and R2.fastq.gz files.
You can write a simple python script to loop over all the files in a folder.
Note : I have assumed that the output files will be generated in a folder named "example"
import glob
for file in glob.glob("*.fastq.gz"):
#here you'll unzip the file to a folder assuming "example"
for files in glob.glob("/example/*"):
#here you can parse all the files inside the output folder
Each pair of samples has a matching string (SN=sample N) A solution to this question in bash could be:
#!/bin/bash
#apply loop function to samples 1-12
for SAMPLE in {1..12}
do
#set input file 1 to "FL1", input file 2 to "FL2"
FL1=$(ls ~path/to/input/files/_S${SAMPLE}_*_R1_*.gz)
FL2=$(ls ~path/to/input/files/_S${SAMPLE}_*_R2_*.gz)
#invoke java ,send FL1 and FL2 to appropriate output folders
java -jar ~/path/to/trimming/apps/Trimmomatic-0.36/trimmomatic-0.36.jar PE
-threads 12 -phred33 -trimlog ~/path/to/output/folders/${FL1}.trimlog
~/path/to/input/file1/${FL1} ~/path/to/input/file2/${FL2}
~/path/to/paired/output/folder/${FL1}.pair.fq.gz ~/path/to/unpaired/output/folder/${FL1}.unpair.fq.gz
~/path/to/paired/output/folder/${FL2}.pair.fq.gz ~/path/to/unpaired/output/folder/${FL2}.unpair.fq.gz
ILLUMINACLIP:/path/to/trimming/apps/Trimmomatic-0.36/TruSeq3-PE.fa:2:30:10 LEADING:5 TRAILING:5 SLIDINGWINDOW:4:15 MINLEN:28
#add verbose option to track progress
echo "Sample ${SAMPLE} done"
done
This is an inelegant solution, because it requires the format I'm using. A better method would be to grep each filename and assign them to FL1, FL2 accordingly, because this would generalize the method. Still, this is what worked for me, and I can easily control which samples are subjected to the for loop, as long as I always have the _S * _ format in the filename strings.
I already have my python script producing my desired outputfile by passing 5 different inputfiles to it. Every inputfile is in a different folder, and in each folder there are more files which all of them start by "chr" and finish by the extension ".vcf.gz"
So, the command that I execute to produce one output is:
python myscript.py /folder1/chrX.vcf.gz /folder2/chrX.vcf.gz /folder3/chrX.vcf.gz /folder4/chrX.vcf.gz /folder5/chrX.vcf.gz > /myNewFolderForOutputs/chrXoutput.txt
Now what I would like to obtain is a single command to do the same for the other inputfiles contained in the same folders, let's say "chrY.vcf.gz" and "chrZ.vcf.gz", and at the same time, producing one outfile for every set of my inputfiles, named "chrYoutput.txt" and "chrZoutput.txt"
Is that possible? Should I change my strategy maybe?
Thanks a lot for any suggestion or hint!
If your folder structure follows the pattern you described in your sample, then this will work:
for i in X Y Z; do
python myscript.py /folder[1-5]/chr$i.vcf.gz > /myNewFolderForOutputs/chr${i}output.txt
done
Not 100% sure if this is what you asked.
I'm not sure even where to start.
I have a list of output files from a program, lets call them foo. They are numbered outputs like foo_1.out
I'd like to make a directory for each file, move the file to its directory, run a bash script within that directory, take the output from each script, copy it to the root directory as a concatenated single file.
I understand that this is not a forum for "hey, do my work for me", I'm honestly trying to learn. Any suggestions on where to look are sincerely appreciated!
Thanks!
You should probably look up the documentation for the python modules os - specifically os.path and a couple of others - and subprocess which can be found here and here respectively.
Without wanting to do it all for you as you stated - you'll be wanting to do something like:
for f in filelist:
[pth, ext] = os.path.splitext(f)
os.mkdir(pth)
out = subprocess.Popen(SCRIPTNAME, stdout=...)
# and so on...
To get a list of all files in a directory or make folders, check out the os module. Specifically, try os.listdir and os.mkdir
To copy files, you could either manually open each file, copy the contents to a string, and rewrite it to a different file. Alternatively, look at the shutil module
To run bash scripts, use the subprocess library.
All three of those should be a part of python's standard library.
Perl has a lovely little utility called find2perl that will translate (quite faithfully) a command line for the Unix find utility into a Perl script to do the same.
If you have a find command like this:
find /usr -xdev -type d -name '*share'
^^^^^^^^^^^^ => name with shell expansion of '*share'
^^^^ => Directory (not a file)
^^^ => Do not go to external file systems
^^^ => the /usr directory (could be multiple directories
It finds all the directories ending in share below /usr
Now run find2perl /usr -xdev -type d -name '*share' and it will emit a Perl script to do the same. You can then modify the script to your use.
Python has os.walk() which certainly has the needed functionality, recursive directory listing, but there are big differences.
Take the simple case of find . -type f -print to find and print all files under the current directory. A naïve implementation using os.walk() would be:
for path, dirs, files in os.walk(root):
if files:
for file in files:
print os.path.join(path,file)
However, this will produce different results than typing find . -type f -print in the shell.
I have also been testing various os.walk() loops against:
# create pipe to 'find' with the commands with arg of 'root'
find_cmd='find %s -type f' % root
args=shlex.split(find_cmd)
p=subprocess.Popen(args,stdout=subprocess.PIPE)
out,err=p.communicate()
out=out.rstrip() # remove terminating \n
for line in out.splitlines()
print line
The difference is that os.walk() counts links as files; find skips these.
So a correct implementation that is the same as file . -type f -print becomes:
for path, dirs, files in os.walk(root):
if files:
for file in files:
p=os.path.join(path,file)
if os.path.isfile(p) and not os.path.islink(p):
print(p)
Since there are hundreds of permutations of find primaries and different side effects, this becomes time consuming to test every variant. Since find is the gold standard in the POSIX world on how to count files in a tree, doing it the same way in Python is important to me.
So is there an equivalent of find2perl that can be used for Python? So far I have just been using find2perl and then manually translating the Perl code. This is hard because the Perl file test operators are different than the Python file tests in os.path at times.
If you're trying to reimplement all of find, then yes, your code is going to get hairy. find is pretty hairy all by itself.
In most cases, though, you're not trying to replicate the complete behavior of find; you're performing a much simpler task (e.g., "find all files that end in .txt"). If you really need all of find, just run find and read the output. As you say, it's the gold standard; you might as well just use it.
I often write code that reads paths on stdin just so I can do this:
find ...a bunch of filters... | my_python_code.py
There are a couple of observations and several pieces of code to help you on your way.
First, Python can execute code in this form just like Perl:
cat code.py | python | the rest of the pipe story...
find2perl is a clever code template that emits a Perl function based on a template of find. Therefor, replicate this template and you will not have the "hundreds of permutations" that you are perceiving.
Second, the results from find2perl are not perfect just as there are potentially differences between versions of find, such as GNU or BSD.
Third, by default, os.walk is bottom up; find is top down. This makes for different results if your underlying directory tree is changing while you recurse it.
There are two projects in Python that may help you: twander and dupfinder. Each strives to be os independent and each recurses the file system like find.
If you template a general find like function in Python, set os.walk to recurse top down, use glob to replicate shell expansion, and use some of the code that you find in those two projects, you can replicate find2perl without too much difficulty.
Sorry I could not point to something ready to go for your needs...
I think glob could help in your implementation of this.
I wrote a Python script to use os.walk() to search-and-replace; it might be a useful thing to look at before writing something like this.
Replace strings in files by Python
And any Python replacement for find(1) is going to rely heavily on os.stat() to check various properties of the file. For example, there are flags to find(1) that check the size of the file or the last modified timestamp.