Perl has a lovely little utility called find2perl that will translate (quite faithfully) a command line for the Unix find utility into a Perl script to do the same.
If you have a find command like this:
find /usr -xdev -type d -name '*share'
^^^^^^^^^^^^ => name with shell expansion of '*share'
^^^^ => Directory (not a file)
^^^ => Do not go to external file systems
^^^ => the /usr directory (could be multiple directories
It finds all the directories ending in share below /usr
Now run find2perl /usr -xdev -type d -name '*share' and it will emit a Perl script to do the same. You can then modify the script to your use.
Python has os.walk() which certainly has the needed functionality, recursive directory listing, but there are big differences.
Take the simple case of find . -type f -print to find and print all files under the current directory. A naïve implementation using os.walk() would be:
for path, dirs, files in os.walk(root):
if files:
for file in files:
print os.path.join(path,file)
However, this will produce different results than typing find . -type f -print in the shell.
I have also been testing various os.walk() loops against:
# create pipe to 'find' with the commands with arg of 'root'
find_cmd='find %s -type f' % root
args=shlex.split(find_cmd)
p=subprocess.Popen(args,stdout=subprocess.PIPE)
out,err=p.communicate()
out=out.rstrip() # remove terminating \n
for line in out.splitlines()
print line
The difference is that os.walk() counts links as files; find skips these.
So a correct implementation that is the same as file . -type f -print becomes:
for path, dirs, files in os.walk(root):
if files:
for file in files:
p=os.path.join(path,file)
if os.path.isfile(p) and not os.path.islink(p):
print(p)
Since there are hundreds of permutations of find primaries and different side effects, this becomes time consuming to test every variant. Since find is the gold standard in the POSIX world on how to count files in a tree, doing it the same way in Python is important to me.
So is there an equivalent of find2perl that can be used for Python? So far I have just been using find2perl and then manually translating the Perl code. This is hard because the Perl file test operators are different than the Python file tests in os.path at times.
If you're trying to reimplement all of find, then yes, your code is going to get hairy. find is pretty hairy all by itself.
In most cases, though, you're not trying to replicate the complete behavior of find; you're performing a much simpler task (e.g., "find all files that end in .txt"). If you really need all of find, just run find and read the output. As you say, it's the gold standard; you might as well just use it.
I often write code that reads paths on stdin just so I can do this:
find ...a bunch of filters... | my_python_code.py
There are a couple of observations and several pieces of code to help you on your way.
First, Python can execute code in this form just like Perl:
cat code.py | python | the rest of the pipe story...
find2perl is a clever code template that emits a Perl function based on a template of find. Therefor, replicate this template and you will not have the "hundreds of permutations" that you are perceiving.
Second, the results from find2perl are not perfect just as there are potentially differences between versions of find, such as GNU or BSD.
Third, by default, os.walk is bottom up; find is top down. This makes for different results if your underlying directory tree is changing while you recurse it.
There are two projects in Python that may help you: twander and dupfinder. Each strives to be os independent and each recurses the file system like find.
If you template a general find like function in Python, set os.walk to recurse top down, use glob to replicate shell expansion, and use some of the code that you find in those two projects, you can replicate find2perl without too much difficulty.
Sorry I could not point to something ready to go for your needs...
I think glob could help in your implementation of this.
I wrote a Python script to use os.walk() to search-and-replace; it might be a useful thing to look at before writing something like this.
Replace strings in files by Python
And any Python replacement for find(1) is going to rely heavily on os.stat() to check various properties of the file. For example, there are flags to find(1) that check the size of the file or the last modified timestamp.
Related
I have directory containing multiple subdirectories, all of which contain a file named sample.fas. Here, I want to run a python script (script.py) in each file sample.fas of the subdirectories, an export the output(s) with the name of each of their subdirectories.
However, the script needs the user to indicate the path/name of the input, and not create automatically the outputs (it's necessary to specify the path/name). Like this:
script.py sample_1.fas output_1a.nex output_1b.fas
I try using this lines, without success:
while find . -name '*.fas'; # find the *fas files
do python script.py $*.fas > /path/output_1a output_1b; # run the script and export the two outputs
done
So, I want to create a bash that read each sample.fas from all subdirectories (run the script recursively), and export the outputs with the names of their subdirectories.
I would appreciate any help.
One quick way of doing this would be something like:
for x in $(find . -type f -name *.fas); do
/usr/bin/python /my/full/path/to/script.py ${x} > /my/path/$(basename $(dirname ${x}))
done
This is running the script against all .fas files identified in the current directory (subdirectories included) and then redirects whatever the python script is outputting to a file named like the directory in which the currently processed .fas file was located. This file is created in /my/path/.
There is an assumption here (well, a few), and that is that all the directories which contain .fas files have unique names. Also, the paths are supposed not to have any spaces in them, this can be fixed with proper quoting. Another assumption is that the script is always outputting valid data (this just redirects all output from the script to that file). However, this should hopefully get you going in the right direction.
But I get the feeling that I didn't properly understand your question. If this is the case, could you rephrase and maybe provide a tree showing how the directories and sub-directories are structured like?
Also, if my answer is helping you, I would appreciate it if you could mark it as the accepted answer by clicking the check mark on the left.
I am trying to create a function that will pop up a list of file includes the word "Module"(case insensitive).
I tried :lvim /Module/gj *.f90 when all *.f90 is in current dir, but I failed to make a globpath() like expand so that I can include and subdirs.
So, I turned to python. From python, I am getting the list perfectly. I am inserting the python code, which will possibly show my goal:
#!/usr/bin/python
import os
import re
flsts = []
path = "/home/rudra/Devel/dream/"
print("All files==>")
for dirs, subdirs, files in os.walk(path):
for tfile in files:
if tfile.endswith('f90'):
print(os.path.splitext(tfile)[0])
text = open(dirs+'/'+tfile).read()
match = re.search("Module", text)
if match:
flsts.append(os.path.splitext(tfile)[0])
print("The list to be used ==>")
print(flsts)
after having the list, I want a
complete(col('.')), flsts)
The problem is, I am unable to include it inside vim function.
May I kindly have some help, so that I can get a list from vim and use it in the complete function?
I have checked this as a possible solution, but unfortunately it is not.
Kindly help.
edit: More explanation
So, say, in my work-dir, i have:
$tree */*.f90
OLD/dirac.f90
OLD/environment.f90
src/constants.f90
src/gencrystal.f90
src/geninp.f90
src/init.f90
among them, only two has word module in it:
$ grep Module */*.f90
OLD/dirac.f90: 10 :module mdirac
src/constants.f90: 2 :module constants
So, I want, with a inoremap, complete() to pop up only constants and dirac.
Hence, Module is the keyword I am searching in the subdirs of present working directory, and only those file matches (dirac and constants in this example) should pop up in complete()
I'm not sure what your exact problem is.
With split(globpath('./**/Module/**', '*.f90'), '\n') you will obtain the list of all files that match *.f90, and which are somewhere within a directory named Module.
Then, using complete() has a few restrictions. It has to be from a function that will be called from insert mode, and that returns an empty string.
By itself, complete() will insert the selected text, if we play with the {starcol} parameter, we can even remove what's before the cursor. This way, you can type Module, hit the key you want and use Module to filter.
function! s:Complete()
" From lh-vim-lib: word_tools.vim
let key = GetCurrentKeyword()
let files = split(glob('./**/*'.key.'*/**', '*.vim'), '\n')
call complete(col('.')-len(key), files )
return ''
endfunction
inoremap µ <c-R>=<sid>Complete()<cr>
However, if you want to trigger an action (instead of inserting text), it becomes much more complex. I did that in muTemplate. I've published the framework used to associate hooks to completion items in lh-vim-lib (See lh#icomplete#*() functions).
EDIT: OK, then, I'll work with let files=split(system("grep --include=*.f90 -Ril module *"), '\n') to obtain the list of files, then call complete(col('.'), files) with that list. That should be the more efficient solution. This is somehow quite similar to Ingo's solution. The difference is that we don't need Python if grep is available.
Regarding Python integration, well it's possible with :py vim.command(). See for instance jira-complete that integrates complete() with a Python script that builds the completion-list: https://github.com/mnpk/vim-jira-complete/blob/master/autoload/jira.vim#L116
Notes:
if "module:" can be pre-searched with ctags, it will to possible to extract your files from tags database with taglist().
It's also possible to fill dynamically the list of files with complete_add(), which is something that would make sense from a python script that tests each file one after the other.
There's an example at :help complete() that you can adapt. If you modify your Python script to output just the (newline-separated) files, you can invoke it via system():
inoremap <F5> <C-R>=FindFiles()<CR>
function! FindFiles()
call complete(col('.'), split(system('python path/to/script.py'), '\n'))
return ''
endfunction
I have a set of files in two directories
~/Desktop/dir1 and ~/Desktop/dir2
I need to match files in dir1 to files in dir2 or vice versa
filenames in /dir1 are: 1.out, 2.out ... 21.out
filenames in /dir2 are: chr-1.out, chr-2.out ... chr-21.out
I wrote a plotting script in python which accepts command line arguments for filenames and builds some plots based on content of files. So the question is how to match files and provide it to a script? I tried to use bash, but I cant figure out how to do that. Maybe it is possible to that from python?
I could have done it by hands, but I would rather learn how to do that automatically.
In bash, use parameter expansion:
#!/bin/bash
for f in dir1/*.out
echo "$f" "dir2/chr-${f#dir1/}"
done
Alternatively in Python (working from the desktop):
import os
for file1 in os.listdir('dir1'):
for file2 in os.listdir('dir2'):
if file1 in file2:
print(file1)
There is probably a more efficient way of doing this but that's a quick and dirty method and should be relatively flexible.
I am writing a plug-in for RawTherapee in Python. I need to extract the version number from a file called 'AboutThisBuild.txt' that may exist anywhere in the directory tree. Although RawTherapee knows where it is installed this data is baked into the binary file.
My plug-in is being designed to collect basic system data when run without any command line parameters for the purpose of short circuiting troubleshooting. By having the version number, revision number and changeset (AKA Mercurial), I can sort out why the script may not be working as expected. OK that is the context.
I have tried a variety of methods, some suggested elsewhere on this site. The main one is using os.walk and fnmatch.
The problem is speed. Searching the entire directory tree is like watching paint dry!
To reduce load I have tried to predict likely hiding places and only traverse these. This is quicker but has the obvious disadvantage of missing some files.
This is what I have at the moment. Tested on Linux but not Windows as yet as I am still researching where the file might be placed.
import fnmatch
import os
import sys
rootPath = ('/usr/share/doc/rawtherapee',
'~',
'/media/CoreData/opt/',
'/opt')
pattern = 'AboutThisBuild.txt'
# Return the first instance of RT found in the paths searched
for CheckPath in rootPath:
print("\n")
print(">>>>>>>>>>>>> " + CheckPath)
print("\n")
for root, dirs, files in os.walk(CheckPath, True, None, False):
for filename in fnmatch.filter(files, pattern):
print( os.path.join(root, filename))
break
Usually 'AboutThisBuild.txt' is stored in a directory/subdirectory called 'rawtherapee' or has the string somewhere in the directory tree. I had naively though I could get the 5000 odd directory names and search these for 'rawtherapee' then use os.walk to traverse those directories but all modules and functions I have looked at collate all files in the directory (again).
Anyone have a quicker method of searching the entire directory tree or am I stuck with this hybrid option?
I am a beginner in Python, but I think I know the simplest way of finding a file in Windows.
import os
for dirpath, subdirs, filenames in os.walk('The directory you wanna search the file in'):
if 'name of your file with extension' in filenames:
print(dirpath)
This code will print out the directory of the file you are searching for in the console. All you have to do is get to the directory.
The thing about searching is that it doesn't matter too much how you get there (eg cheating). Once you have a result, you can verify it is correct relatively quickly.
You may be able to identify candidate locations fairly efficiently by guessing. For example, on Linux, you could first try looking in these locations (obviously not all are directories, but it doesn't do any harm to os.path.isfile('/;l$/AboutThisBuild.txt'))
$ strings /usr/bin/rawtherapee | grep '^/'
/lib/ld-linux.so.2
/H=!
/;l$
/9T$,
/.ba
/usr/share/rawtherapee
/usr/share/doc/rawtherapee
/themes/
/themes/slim
/options
/usr/share/color/icc
/cache
/languages/default
/languages/
/languages
/themes
/batch/queue
/batch/
/dcpprofiles
/#q=
/N6rtexif16NAISOInterpreterE
If you have it installed, you can try the locate command
If you still don't find it, move on to the brute force method
Here is a rough equivalent of strings using Python
>>> from string import printable, whitespace
>>> from itertools import groupby
>>> pathchars = set(printable) - set(whitespace)
>>> with open("/usr/bin/rawtherapee") as fp:
... data = fp.read()
...
>>> for k, g in groupby(data, pathchars.__contains__):
... if not k: continue
... g = ''.join(g)
... if len(g) > 3 and g.startswith("/"):
... print g
...
/lib64/ld-linux-x86-64.so.2
/^W0Kq[
/pW$<
/3R8
/)wyX
/WUO
/w=H
/t_1
/.badpixH
/d$(
/\$P
/D$Pv
/D$#
/D$(
/l$#
/d$#v?H
/usr/share/rawtherapee
/usr/share/doc/rawtherapee
/themes/
/themes/slim
/options
/usr/share/color/icc
/cache
/languages/default
/languages/
/languages
/themes
/batch/queue.csv
/batch/
/dcpprofiles
/#q=
/N6rtexif16NAISOInterpreterE
It sounds like you need a pure python solution here. If not, other answers will suffice.
In this case, you should traverse the folders using a queue and threads. While some may say Threads are never the solution, Threads are a great way of speeding up when you are I/O bound, which you are in this case. Essentially, you'll os.listdir the current dir. If it contains your file, party like it's 1999. If it doesn't, add each subfolder to the work queue.
If you're clever, you can play with depth first vs breadth first traversal to get the best results.
There is a great example I have used quite successfully at work at http://www.tutorialspoint.com/python/python_multithreading.htm. See the section titled Multithreaded Priority Queue. The example could probably be updated to include threadpools though, but it's not necessary.
Okay I'm having trouble not only with the problem itself but even with trying to explain my question. I have a directory tree consisting of about 7 iterations, so: rootdir/a/b/c/d/e/f/destinationdir
The thing is some may have 5 subdirectory levels and some may have as many as ten, such as:
rootdir/a/b/c/d/destinationdir
or:
rootdir/a/b/c/d/e/f/g/h/destinationdir
The only thing they have in common is that the destination directory is always named the same thing. The way I'm using the glob function is as follows:
for path in glob.glob('/rootdir/*/*/*/*/*/*/destinationdir'):
--- os.system('cd {0}; do whatever'.format(path))
However, this only works for the directories with that precise number of intermediate subdirectories. Is there any way for me not to have to specify that number of subdirectories(asterices); in other words having the function arrive at the destinationdir no matter what the number of intermediate subdirectories is, and allowing me to iterate through them. Thanks a lot!
I think this could be done more easily with os.walk:
def find_files(root,filename):
for directory,subdirs,files in os.walk(root):
if filename in files:
yield os.join(root,directory,filename)
Of course, this doesn't allow you to have a glob expression in the filename portion, but you could check that stuff using regex or fnmatch.
EDIT
Or to find a directory:
def find_files(root,d):
for directory,subdirs,files in os.walk(root):
if d in subdirs:
yield os.join(root,directory,d)
You can create a pattern for each level of indentation (increase 10 if needed):
for i in xrange(10):
pattern = '/rootdir/' + ('*/' * i) + 'destinationdir'
for path in glob.glob(pattern):
os.system('cd {0}; do whatever'.format(path))
This will iterate over:
'/rootdir/destinationdir'
'/rootdir/*/destinationdir'
'/rootdir/*/*/destinationdir'
'/rootdir/*/*/*/destinationdir'
'/rootdir/*/*/*/*/destinationdir'
'/rootdir/*/*/*/*/*/destinationdir'
'/rootdir/*/*/*/*/*/*/destinationdir'
'/rootdir/*/*/*/*/*/*/*/destinationdir'
'/rootdir/*/*/*/*/*/*/*/*/destinationdir'
'/rootdir/*/*/*/*/*/*/*/*/*/destinationdir'
If you have to iterate over directories with arbitrary depth then I suggest dividing the algorithm in two steps: one phase where you investigate where all 'destinationdir' directories are located and a second phase where you perform your operations.
Python 3 glob.glob now accepts double wildcards to designate any number of intermediate directories, as long as you also pass recursive=True:
>>> import glob
>>> glob.glob('**/*.txt', recursive=True)
['1.txt', 'foo/2.txt', 'foo/bar/3.txt', 'foo/bar/baz/4.txt']
If you are looking for files, you can use the Formic package (disclosure: I wrote it) - this implements Apache Ant's FileSet Globs with the '**' wildcard:
import formic
fileset = formic.FileSet(include="rootdir/**/destinationdir/*")
for file_name in fileset:
# Do something with file_name
This looks much easier to accomplish with a more versatile tool, like the find command (your os.system call indicates you're on a unix-like system, so this will work).
os.system('find /rootdir -mindepth 5 -maxdepth 10 -type d -name destinationdir | while read d; do ( cd $d && do whatever; ); done')
..Note that if you are going to put any user-supplied string into that command, this becomes drastically unsafe, and you should use subprocess.Popen instead, executing the shell and splitting the arguments yourself. It's safe as shown, though.