How to match filenames in different directories?(bash or python) - python

I have a set of files in two directories
~/Desktop/dir1 and ~/Desktop/dir2
I need to match files in dir1 to files in dir2 or vice versa
filenames in /dir1 are: 1.out, 2.out ... 21.out
filenames in /dir2 are: chr-1.out, chr-2.out ... chr-21.out
I wrote a plotting script in python which accepts command line arguments for filenames and builds some plots based on content of files. So the question is how to match files and provide it to a script? I tried to use bash, but I cant figure out how to do that. Maybe it is possible to that from python?
I could have done it by hands, but I would rather learn how to do that automatically.

In bash, use parameter expansion:
#!/bin/bash
for f in dir1/*.out
echo "$f" "dir2/chr-${f#dir1/}"
done

Alternatively in Python (working from the desktop):
import os
for file1 in os.listdir('dir1'):
for file2 in os.listdir('dir2'):
if file1 in file2:
print(file1)
There is probably a more efficient way of doing this but that's a quick and dirty method and should be relatively flexible.

Related

How to input multiple files from a directory

First and foremost, I am recently new to Unix and I have tried to find a solution to my question online, but I could not find a solution.
So I am running Python through my Unix terminal, and I have a program that parses xml files and inputs the results into a .dat file.
My program works, but I have to input every single xml file (which number over 50) individually.
For example:
clamshell: python3 my_parser2.py 'items-0.xml' 'items-1.xml' 'items-2.xml' 'items-3.xml' .....`
So I was wondering if it is possible to read from the directory, which contains all of my files into my program? Rather than typing all the xml file names individually and running the program that way.
Any help on this is greatly appreciated.
import glob
listOffiles = glob.glob('directory/*.xml')
The shell itself can expand wildcards so, if you don't care about the order of the input files, just use:
python3 my_parser2.py items-*.xml
If the numeric order is important (you want 0..9, 10-99 and so on in that order, you may have to adjust the wildcard arguments slightly to guarantee this, such as with:
python3 my_parser2.py items-[0-9].xml items-[1-9][0-9].xml items-[1-9][0-9][0-9].xml
python3 my_parser2.py *.xml should work.
Other than the command line option, you could just use glob from within your script and bypass the need for command arguments:
import glob
filenames = glob.glob("*.xml")
This will return all .xml files (as filenames) in the directory from which you are running the script.
Then, if needed you can simply iterate through all the files with a basic loop:
for file in filenames:
with open(file, 'r') as f:
# do stuff to f.

how can I cat two files with a matching string in the filename?

So I have a directory with ~162K files. Half of these files have the file name "uniquenumber.fasta" and the other half of the files have the file name "uniquenumber.fasta letters". For example:
12345.fasta
12345.fasta Somebacterialtaxaname
67890.fasta
67890.fasta Someotherbacterialtaxaname
...for another many thousand "pairs"
I would like to cat together the two files that share the unique fasta number. It does not matter the order of the concatenation (i.e. which contents comes first in the newly created combined file). I have tried some renditions of grep in the command line and a few lousy python scripts but I feel like this is more of a trivial problem than I am making it. Suggestions?
Here's a solution in Python (it will work unchanged with both Python 2 and 3). This assumes that each file XXXXX.fasta has one and only one matching XXXXX.fasta stringofstuff file.
import glob
fastafiles = sorted(glob.glob("*.fasta"))
for fastafile in fastafiles:
number = fastafile.split(".")[0]
space_file = glob.glob(number + ".fasta *")
with open(fastafile, "a+") as fasta:
with open(space_file[0], "r") as fasta_space:
fasta.write("\n")
fasta.writelines(fasta_space.readlines())
Here's how it works: first, the names of all *.fasta files are put into a list (I sort the list, but it's not strictly necessary). Next, the filename is split on . and the first part (the number in the filename) is stored. Then, we search for the matching XXXXX.fasta something file and, assuming there's only one of them, we open the .fasta file in append mode and the .fasta something file in read mode. We write a newline to the end of the .fasta file, then read in the contents of the "space file" and write them to the end of the .fasta file. Since we use the with context manager, we don't need to specifically close the files when we're done.
There's probably many ways to achieve this, but the first that came to my head would be to use the unix command find.
http://en.wikipedia.org/wiki/Find#Execute_an_action
The find command will print the filename that follows the pattern you specify. Using the -name and -exec flags, you can specify what characters should be in the file name, or run an additional command to filter the output.
If I was solving this problem, I would probably cycle over all files in the directory, and run either a -name pattern or -exec pattern that would "find" the matching file. Then | the two file names to a cat and redirect that output to a new file, hopefully concatenating the two. Hope that helps!

Running bash scripts within newly created folders based on file names

I'm not sure even where to start.
I have a list of output files from a program, lets call them foo. They are numbered outputs like foo_1.out
I'd like to make a directory for each file, move the file to its directory, run a bash script within that directory, take the output from each script, copy it to the root directory as a concatenated single file.
I understand that this is not a forum for "hey, do my work for me", I'm honestly trying to learn. Any suggestions on where to look are sincerely appreciated!
Thanks!
You should probably look up the documentation for the python modules os - specifically os.path and a couple of others - and subprocess which can be found here and here respectively.
Without wanting to do it all for you as you stated - you'll be wanting to do something like:
for f in filelist:
[pth, ext] = os.path.splitext(f)
os.mkdir(pth)
out = subprocess.Popen(SCRIPTNAME, stdout=...)
# and so on...
To get a list of all files in a directory or make folders, check out the os module. Specifically, try os.listdir and os.mkdir
To copy files, you could either manually open each file, copy the contents to a string, and rewrite it to a different file. Alternatively, look at the shutil module
To run bash scripts, use the subprocess library.
All three of those should be a part of python's standard library.

Python equivalent of find2perl

Perl has a lovely little utility called find2perl that will translate (quite faithfully) a command line for the Unix find utility into a Perl script to do the same.
If you have a find command like this:
find /usr -xdev -type d -name '*share'
^^^^^^^^^^^^ => name with shell expansion of '*share'
^^^^ => Directory (not a file)
^^^ => Do not go to external file systems
^^^ => the /usr directory (could be multiple directories
It finds all the directories ending in share below /usr
Now run find2perl /usr -xdev -type d -name '*share' and it will emit a Perl script to do the same. You can then modify the script to your use.
Python has os.walk() which certainly has the needed functionality, recursive directory listing, but there are big differences.
Take the simple case of find . -type f -print to find and print all files under the current directory. A naïve implementation using os.walk() would be:
for path, dirs, files in os.walk(root):
if files:
for file in files:
print os.path.join(path,file)
However, this will produce different results than typing find . -type f -print in the shell.
I have also been testing various os.walk() loops against:
# create pipe to 'find' with the commands with arg of 'root'
find_cmd='find %s -type f' % root
args=shlex.split(find_cmd)
p=subprocess.Popen(args,stdout=subprocess.PIPE)
out,err=p.communicate()
out=out.rstrip() # remove terminating \n
for line in out.splitlines()
print line
The difference is that os.walk() counts links as files; find skips these.
So a correct implementation that is the same as file . -type f -print becomes:
for path, dirs, files in os.walk(root):
if files:
for file in files:
p=os.path.join(path,file)
if os.path.isfile(p) and not os.path.islink(p):
print(p)
Since there are hundreds of permutations of find primaries and different side effects, this becomes time consuming to test every variant. Since find is the gold standard in the POSIX world on how to count files in a tree, doing it the same way in Python is important to me.
So is there an equivalent of find2perl that can be used for Python? So far I have just been using find2perl and then manually translating the Perl code. This is hard because the Perl file test operators are different than the Python file tests in os.path at times.
If you're trying to reimplement all of find, then yes, your code is going to get hairy. find is pretty hairy all by itself.
In most cases, though, you're not trying to replicate the complete behavior of find; you're performing a much simpler task (e.g., "find all files that end in .txt"). If you really need all of find, just run find and read the output. As you say, it's the gold standard; you might as well just use it.
I often write code that reads paths on stdin just so I can do this:
find ...a bunch of filters... | my_python_code.py
There are a couple of observations and several pieces of code to help you on your way.
First, Python can execute code in this form just like Perl:
cat code.py | python | the rest of the pipe story...
find2perl is a clever code template that emits a Perl function based on a template of find. Therefor, replicate this template and you will not have the "hundreds of permutations" that you are perceiving.
Second, the results from find2perl are not perfect just as there are potentially differences between versions of find, such as GNU or BSD.
Third, by default, os.walk is bottom up; find is top down. This makes for different results if your underlying directory tree is changing while you recurse it.
There are two projects in Python that may help you: twander and dupfinder. Each strives to be os independent and each recurses the file system like find.
If you template a general find like function in Python, set os.walk to recurse top down, use glob to replicate shell expansion, and use some of the code that you find in those two projects, you can replicate find2perl without too much difficulty.
Sorry I could not point to something ready to go for your needs...
I think glob could help in your implementation of this.
I wrote a Python script to use os.walk() to search-and-replace; it might be a useful thing to look at before writing something like this.
Replace strings in files by Python
And any Python replacement for find(1) is going to rely heavily on os.stat() to check various properties of the file. For example, there are flags to find(1) that check the size of the file or the last modified timestamp.

UNIX shell script to call python

I have a python script that runs on three files in the following way
align.py *.wav *.txt *.TextGrid
However, I have a directory full of files that I want to loop through. The original author suggests creating a shell script to loop through the files.
The tricky part about the loop is that I need to match three files at a time with three different extensions for the script to run correctly.
Can anyone help me figure out how to create a shell script to loop through a directory of files, match three of them according to name (with three different extensions) and run the python script on each triplet?
Thanks!
Assuming you're using bash, here is a one-liner:
for f in *.wav; do align.py $f ${f%\.*}.txt ${f%\.*}.TextGrid; done
You could use glob.glob to list only the wav files, then construct the subprocess.Popen call like so:
import glob
import os
import subprocess
for wav_name in glob.glob('*.wav'):
basename,ext = os.path.splitext(wav_name)
txt_name=basename+'.txt'
grid_name=basename+'.TextGrid'
proc=subprocess.Popen(['align.py',wav_name,txt_name,grid_name])
proc.communicate()

Categories