Hashing Multiple Files

Hashing Multiple Files - python

Problem Specification:
Given a directory, I want to iterate through the directory and its non-hidden sub-directories,
 and add a whirlpool hash into the non-hidden
file's names.
If the script is re-run it would would replace an old hash with a new one.
<filename>.<extension>   ==>  <filename>.<a-whirlpool-hash>.<extension>
<filename>.<old-hash>.<extension>   ==>  <filename>.<new-hash>.<extension>
Question:
a) How would you do this?
b) Out of the all methods available to you, what makes your method most suitable?
Verdict:
Thanks all, I have chosen SeigeX's answer for it's speed and portability.
It is emprically quicker than the other bash variants,
 and it worked without alteration on my Mac OS X machine.

Updated to fix:
1. File names with '[' or ']' in their name (really, any character now. See comment)
2. Handling of md5sum when hashing a file with a backslash or newline in its name
3. Functionized hash-checking algo for modularity
4. Refactored hash-checking logic to remove double-negatives
#!/bin/bash
if (($# != 1)) || ! [[ -d "$1" ]]; then
echo "Usage: $0 /path/to/directory"
exit 1
fi
is_hash() {
md5=${1##*.} # strip prefix
[[ "$md5" == *[^[:xdigit:]]* || ${#md5} -lt 32 ]] && echo "$1" || echo "${1%.*}"
}
while IFS= read -r -d $'\0' file; do
read hash junk < <(md5sum "$file")
basename="${file##*/}"
dirname="${file%/*}"
pre_ext="${basename%.*}"
ext="${basename:${#pre_ext}}"
# File already hashed?
pre_ext=$(is_hash "$pre_ext")
ext=$(is_hash "$ext")
mv "$file" "${dirname}/${pre_ext}.${hash}${ext}" 2> /dev/null
done < <(find "$1" -path "*/.*" -prune -o \( -type f -print0 \))
This code has the following benefits over other entries thus far
It is fully compliant with Bash versions 2.0.2 and beyond
No superfluous calls to other binaries like sed or grep; uses builtin parameter expansion instead
Uses process substitution for 'find' instead of a pipe, no sub-shell is made this way
Takes the directory to work on as an argument and does a sanity check on it
Uses $() rather than `` notation for command substitution, the latter is deprecated
Works with files with spaces
Works with files with newlines
Works with files with multiple extensions
Works with files with no extension
Does not traverse hidden directories
Does NOT skip pre-hashed files, it will recalculate the hash as per the spec
Test Tree
$ tree -a a
a
|-- .hidden_dir
| `-- foo
|-- b
| `-- c.d
| |-- f
| |-- g.5236b1ab46088005ed3554940390c8a7.ext
| |-- h.d41d8cd98f00b204e9800998ecf8427e
| |-- i.ext1.5236b1ab46088005ed3554940390c8a7.ext2
| `-- j.ext1.ext2
|-- c.ext^Mnewline
| |-- f
| `-- g.with[or].ext
`-- f^Jnewline.ext
4 directories, 9 files
Result
$ tree -a a
a
|-- .hidden_dir
| `-- foo
|-- b
| `-- c.d
| |-- f.d41d8cd98f00b204e9800998ecf8427e
| |-- g.d41d8cd98f00b204e9800998ecf8427e.ext
| |-- h.d41d8cd98f00b204e9800998ecf8427e
| |-- i.ext1.d41d8cd98f00b204e9800998ecf8427e.ext2
| `-- j.ext1.d41d8cd98f00b204e9800998ecf8427e.ext2
|-- c.ext^Mnewline
| |-- f.d41d8cd98f00b204e9800998ecf8427e
| `-- g.with[or].d41d8cd98f00b204e9800998ecf8427e.ext
`-- f^Jnewline.d3b07384d113edec49eaa6238ad5ff00.ext
4 directories, 9 files

#!/bin/bash
find -type f -print0 | while read -d $'\0' file
do
md5sum=`md5sum "${file}" | sed -r 's/ .*//'`
filename=`echo "${file}" | sed -r 's/\.[^./]*$//'`
extension="${file:${#filename}}"
filename=`echo "${filename}" | sed -r 's/\.md5sum-[^.]+//'`
if [[ "${file}" != "${filename}.md5sum-${md5sum}${extension}" ]]; then
echo "Handling file: ${file}"
mv "${file}" "${filename}.md5sum-${md5sum}${extension}"
fi
done
Tested on files containing spaces like 'a b'
Tested on files containing multiple extensions like 'a.b.c'
Tested with directories containing spaces and/or dots.
Tested on files containing no extension inside directories containing dots, such as 'a.b/c'
Updated: Now updates hashes if the file changes.
Key points:
Use of print0 piped to while read -d $'\0', to correctly handle spaces in file names.
md5sum can be replaced with your favourite hash function. The sed removes the first space and everything after it from the output of md5sum.
The base filename is extracted using a regular expression that finds the last period that isn't followed by another slash (so that periods in directory names aren't counted as part of the extension).
The extension is found by using a substring with starting index as the length of the base filename.

The logic of the requirements is complex enough to justify the use of Python instead of bash. It should provide a more readable, extensible, and maintainable solution.
#!/usr/bin/env python
import hashlib, os
def ishash(h, size):
"""Whether `h` looks like hash's hex digest."""
if len(h) == size:
try:
int(h, 16) # whether h is a hex number
return True
except ValueError:
return False
for root, dirs, files in os.walk("."):
dirs[:] = [d for d in dirs if not d.startswith(".")] # skip hidden dirs
for path in (os.path.join(root, f) for f in files if not f.startswith(".")):
suffix = hash_ = "." + hashlib.md5(open(path).read()).hexdigest()
hashsize = len(hash_) - 1
# extract old hash from the name; add/replace the hash if needed
barepath, ext = os.path.splitext(path) # ext may be empty
if not ishash(ext[1:], hashsize):
suffix += ext # add original extension
barepath, oldhash = os.path.splitext(barepath)
if not ishash(oldhash[1:], hashsize):
suffix = oldhash + suffix # preserve 2nd (not a hash) extension
else: # ext looks like a hash
oldhash = ext
if hash_ != oldhash: # replace old hash by new one
os.rename(path, barepath+suffix)
Here's a test directory tree. It contains:
files without extension inside directories with a dot in their name
filename which already has a hash in it (test on idempotency)
filename with two extensions
newlines in names
$ tree a
a
|-- b
| `-- c.d
| |-- f
| |-- f.ext1.ext2
| `-- g.d41d8cd98f00b204e9800998ecf8427e
|-- c.ext^Mnewline
| `-- f
`-- f^Jnewline.ext1
7 directories, 5 files
Result
$ tree a
a
|-- b
| `-- c.d
| |-- f.0bee89b07a248e27c83fc3d5951213c1
| |-- f.ext1.614dd0e977becb4c6f7fa99e64549b12.ext2
| `-- g.d41d8cd98f00b204e9800998ecf8427e
|-- c.ext^Mnewline
| `-- f.0bee89b07a248e27c83fc3d5951213c1
`-- f^Jnewline.b6fe8bb902ca1b80aaa632b776d77f83.ext1
7 directories, 5 files
The solution works correctly for all cases.
Whirlpool hash is not in Python's stdlib, but there are both pure Python and C extensions that support it e.g., python-mhash.
To install it:
$ sudo apt-get install python-mhash
To use it:
import mhash
print mhash.MHASH(mhash.MHASH_WHIRLPOOL, "text to hash here").hexdigest()
Output:
cbdca4520cc5c131fc3a86109dd23fee2d7ff7be56636d398180178378944a4f41480b938608ae98da7eccbf39a4c79b83a8590c4cb1bace5bc638fc92b3e653
Invoking whirlpooldeep in Python
from subprocess import PIPE, STDOUT, Popen
def getoutput(cmd):
return Popen(cmd, stdout=PIPE, stderr=STDOUT).communicate()[0]
hash_ = getoutput(["whirlpooldeep", "-q", path]).rstrip()
git can provide with leverage for the problems that need to track set of files based on their hashes.

I wasn't really happy with my first answer, since as I said there, this problem looks like it's best solved with perl. You already said in one edit of your question that you have perl on the OS X machine you want to run this on, so I gave it a shot.
It's hard to get it all right in bash, i.e. avoiding any quoting problems with odd filenames, and behaving nicely with corner-case filenames.
So here it is in perl, a complete solution to your problem. It runs over all the files/directories listed on its command line.
#!/usr/bin/perl -w
# whirlpool-rename.pl
# 2009 Peter Cordes <peter#cordes.ca>. Share and Enjoy!
use Fcntl; # for O_BINARY
use File::Find;
use Digest::Whirlpool;
# find callback, called once per directory entry
# $_ is the base name of the file, and we are chdired to that directory.
sub whirlpool_rename {
print "find: $_\n";
# my #components = split /\.(?:[[:xdigit:]]{128})?/; # remove .hash while we're at it
my #components = split /\.(?!\.|$)/, $_, -1; # -1 to not leave out trailing dots
if (!$components[0] && $_ ne ".") { # hidden file/directory
$File::Find::prune = 1;
return;
}
# don't follow symlinks or process non-regular-files
return if (-l $_ || ! -f _);
my $digest;
eval {
sysopen(my $fh, $_, O_RDONLY | O_BINARY) or die "$!";
$digest = Digest->new( 'Whirlpool' )->addfile($fh);
};
if ($#) { # exception-catching structure from whirlpoolsum, distributed with Digest::Whirlpool.
warn "whirlpool: couldn't hash $_: $!\n";
return;
}
# strip old hashes from the name. not done during split only in the interests of readability
#components = grep { !/^[[:xdigit:]]{128}$/ } #components;
if ($#components == 0) {
push #components, $digest->hexdigest;
} else {
my $ext = pop #components;
push #components, $digest->hexdigest, $ext;
}
my $newname = join('.', #components);
return if $_ eq $newname;
print "rename $_ -> $newname\n";
if (-e $newname) {
warn "whirlpool: clobbering $newname\n";
# maybe unlink $_ and return if $_ is older than $newname?
# But you'd better check that $newname has the right contents then...
}
# This could be link instead of rename, but then you'd have to handle directories, and you can't make hardlinks across filesystems
rename $_, $newname or warn "whirlpool: couldn't rename $_ -> $newname: $!\n";
}
#main
$ARGV[0] = "." if !#ARGV; # default to current directory
find({wanted => \&whirlpool_rename, no_chdir => 0}, #ARGV );
Advantages:
- actually uses whirlpool, so you can use this exact program directly. (after installing libperl-digest-whirlpool). Easy to change to any digest function you want, because instead of different programs with different output formats, you have the perl Digest common interface.
implements all other requirements: ignore hidden files (and files under hidden directories).
able to handle any possible filename without error or security problem. (Several people got this right in their shell scripts).
follows best practices for traversing a directory tree, by chdiring down into each directory (like my previous answer, with find -execdir). This avoids problems with PATH_MAX, and with directories being renamed while you're running.
clever handling of filenames that end with . foo..txt... -> foo..hash.txt...
Handles old filenames containing hashes already without renaming them and then renaming them back. (It strips any sequence of 128 hex digits that's surrounded by "." characters.) In the everything-correct case, no disk write activity happens, just reads of every file. Your current solution runs mv twice in the already-correctly-named case, causing directory metadata writes. And being slower, because that's two processes that have to be execced.
efficient. No programs are fork/execed, while most of the solutions that would actually work ended up having to sed something per-file.
Digest::Whirlpool is implemented with a natively-compiled shared lib, so it's not slow pure-perl. This should be faster than running a program on every file, esp. for small files.
Perl supports UTF-8 strings, so filenames with non-ascii characters shouldn't be a problem. (not sure if any multi-byte sequences in UTF-8 could include the byte that means ASCII '.' on its own. If that is possible, then you need UTF-8 aware string handling. sed doesn't know UTF-8. Bash's glob expressions may.)
easily extensible. When you go to put this into a real program, and you want to handle more corner cases, you can do so quite easily. e.g. decide what to do when you want to rename a file but the hash-named filename already exists.
good error reporting. Most shell scripts have this, though, by passing along errors from the progs they run.

find . -type f -print | while read file
do
hash=`$hashcommand "$file"`
filename=${file%.*}
extension=${file##*.}
mv $file "$filename.$hash.$extension"
done

You might want to store the results in one file, like in
find . -type f -exec md5sum {} \; > MD5SUMS
If you really want one file per hash:
find . -type f | while read f; do g=`md5sum $f` > $f.md5; done
or even
find . -type f | while read f; do g=`md5sum $f | awk '{print $1}'`; echo "$g $f"> $f-$g.md5; done

Here's my take on it, in bash. Features: skips non-regular files; correctly deals with files with weird characters (i.e. spaces) in their names; deals with extensionless filenames; skips already-hashed files, so it can be run repeatedly (although if files are modified between runs, it adds the new hash rather than replacing the old one). I wrote it using md5 -q as the hash function; you should be able to replace this with anything else, as long as it only outputs the hash, not something like filename => hash.
find -x . -type f -print0 | while IFS="" read -r -d $'\000' file; do
hash="$(md5 -q "$file")" # replace with your favorite hash function
[[ "$file" == *."$hash" ]] && continue # skip files that already end in their hash
dirname="$(dirname "$file")"
basename="$(basename "$file")"
base="${basename%.*}"
[[ "$base" == *."$hash" ]] && continue # skip files that already end in hash + extension
if [[ "$basename" == "$base" ]]; then
extension=""
else
extension=".${basename##*.}"
fi
mv "$file" "$dirname/$base.$hash$extension"
done

In sh or bash, two versions. One limits itself to files with extensions...
hash () {
#openssl md5 t.sh | sed -e 's/.* //'
whirlpool "$f"
}
find . -type f -a -name '*.*' | while read f; do
# remove the echo to run this for real
echo mv "$f" "${f%.*}.whirlpool-`hash "$f"`.${f##*.}"
done
Testing...
...
mv ./bash-4.0/signames.h ./bash-4.0/signames.whirlpool-d71b117a822394a5b273ea6c0e3f4dc045b1098326d39864564f1046ab7bd9296d5533894626288265a1f70638ee3ecce1f6a22739b389ff7cb1fa48c76fa166.h
...
And this more complex version processes all plain files, with or without extensions, with or without spaces and odd characters, etc, etc...
hash () {
#openssl md5 t.sh | sed -e 's/.* //'
whirlpool "$f"
}
find . -type f | while read f; do
name=${f##*/}
case "$name" in
*.*) extension=".${name##*.}" ;;
*) extension= ;;
esac
# remove the echo to run this for real
echo mv "$f" "${f%/*}/${name%.*}.whirlpool-`hash "$f"`$extension"
done

whirlpool isn't a very common hash. You'll probably have to install a program to compute it. e.g. Debian/Ubuntu include a "whirlpool" package. The program prints the hash of one file by itself. apt-cache search whirlpool shows that some other packages support it, including the interesting md5deep.
Some of the earlier anwsers will fail on filenames with spaces in them. If this is the case, but your files don't have any newlines in the filename, then you can safely use \n as a delimiter.
oldifs="$IFS"
IFS="
"
for i in $(find -type f); do echo "$i";done
#output
# ./base
# ./base2
# ./normal.ext
# ./trick.e "xt
# ./foo bar.dir ext/trick' (name "- }$foo.ext{}.ext2
IFS="$oldifs"
try without setting IFS to see why it matters.
I was going to try something with IFS="."; find -print0 | while read -a array, to split on "." characters, but I normally never use array variables. There's no easy way that I see in the man page to insert the hash as the second-last array index, and push down the last element (the file extension, if it had one.) Any time bash array variables look interesting, I know it's time to do what I'm doing in perl instead! See the gotchas for using read:
http://tldp.org/LDP/abs/html/gotchas.html#BADREAD0
I decided to use another technique I like: find -exec sh -c. It's the safest, since you're not parsing filenames.
This should do the trick:
find -regextype posix-extended -type f -not -regex '.*\.[a-fA-F0-9]{128}.*' \
-execdir bash -c 'for i in "${##./}";do
hash=$(whirlpool "$i");
ext=".${i##*.}"; base="${i%.*}";
[ "$base" = "$i" ] && ext="";
newname="$base.$hash$ext";
echo "ext:$ext $i -> $newname";
false mv --no-clobber "$i" "$newname";done' \
dummy {} +
# take out the "false" before the mv, and optionally take out the echo.
# false ignores its arguments, so it's there so you can
# run this to see what will happen without actually renaming your files.
-execdir bash -c 'cmd' dummy {} + has the dummy arg there because the first arg after the command becomes $0 in the shell's positional parameters, not part of "$#" that for loops over. I use execdir instead of exec so I don't have to deal with directory names (or the possibility of exceeding PATH_MAX for nested dirs with long names, when the actual filenames are all short enough.)
-not -regex prevents this from being applied twice to the same file. Although whirlpool is an extremely long hash, and mv says File name too long if I run it twice without that check. (on an XFS filesystem.)
Files with no extension get basename.hash. I had to check specially to avoid appending a trailing ., or getting the basename as the extension. ${##./} strips out the leading ./ that find puts in front of every filename, so there is no "." in the whole string for files with no extension.
mv --no-clobber may be a GNU extension. If you don't have GNU mv, do something else if you want to avoid deleting existing files (e.g. you run this once, some of the same file are added to the directory with their old names; you run it again.) OTOH, if you want that behaviour, just take it out.
My solution should work even when filenames contain a newline (they can, you know!), or any other possible character. It would be faster and easier in perl, but you asked for shell.
wallenborn's solution for making one file with all the checksums (instead of renaming the original) is pretty good, but inefficient. Don't run md5sum once per file, run it on as many files at once as will fit on its command line:
find dir -type f -print0 | xargs -0 md5sum > dir.md5
or with GNU find, xargs is built in (note the + instead of ';')
find dir -type f -exec md5sum {} + > dir.md5
if you just use find -print | xargs -d'\n', you will be screwed up by file names with quote marks in them, so be careful. If you don't know what files you might someday run this script on, always try to use print0 or -exec. This is esp. true if filenames are supplied by untrusted users (i.e. could be an attack vector on your server.)

In response to your updated question:
If anyone can comment on how I can avoid looking in hidden directories with my BASH Script, it would be much appreciated.
You can avoid hidden directories with find by using
find -name '.?*' -prune -o \( -type f -print0 \)
-name '.*' -prune will prune ".", and stop without doing anything. :/
I'd still recommend my Perl version, though. I updated it... You may still need to install Digest::Whirlpool from CPAN, though.

Hm, interesting problem.
Try the following (the mktest function is just for testing -- TDD for bash! :)
Edit:
Added support for whirlpool hashes.
code cleanup
better quoting of filenames
changed array-syntax for test part-- should now work with most korn-like shells. Note that pdksh does not support :-based parameter expansion (or rather
it means something else)
Note also that when in md5-mode it fails for filenames with whirlpool-like hashes, and
possibly vice-versa.
#!/usr/bin/env bash
#Tested with:
# GNU bash, version 4.0.28(1)-release (x86_64-pc-linux-gnu)
# ksh (AT&T Research) 93s+ 2008-01-31
# mksh #(#)MIRBSD KSH R39 2009/08/01 Debian 39.1-4
# Does not work with pdksh, dash
DEFAULT_SUM="md5"
#Takes a parameter, as root path
# as well as an optional parameter, the hash function to use (md5 or wp for whirlpool).
main()
{
case $2 in
"wp")
export SUM="wp"
;;
"md5")
export SUM="md5"
;;
*)
export SUM=$DEFAULT_SUM
;;
esac
# For all visible files in all visible subfolders, move the file
# to a name including the correct hash:
find $1 -type f -not -regex '.*/\..*' -exec $0 hashmove '{}' \;
}
# Given a file named in $1 with full path, calculate it's hash.
# Output the filname, with the hash inserted before the extention
# (if any) -- or: replace an existing hash with the new one,
# if a hash already exist.
hashname_md5()
{
pathname="$1"
full_hash=`md5sum "$pathname"`
hash=${full_hash:0:32}
filename=`basename "$pathname"`
prefix=${filename%%.*}
suffix=${filename#$prefix}
#If the suffix starts with something that looks like an md5sum,
#remove it:
suffix=`echo $suffix|sed -r 's/\.[a-z0-9]{32}//'`
echo "$prefix.$hash$suffix"
}
# Same as hashname_md5 -- but uses whirlpool hash.
hashname_wp()
{
pathname="$1"
hash=`whirlpool "$pathname"`
filename=`basename "$pathname"`
prefix=${filename%%.*}
suffix=${filename#$prefix}
#If the suffix starts with something that looks like an md5sum,
#remove it:
suffix=`echo $suffix|sed -r 's/\.[a-z0-9]{128}//'`
echo "$prefix.$hash$suffix"
}
#Given a filepath $1, move/rename it to a name including the filehash.
# Try to replace an existing hash, an not move a file if no update is
# needed.
hashmove()
{
pathname="$1"
filename=`basename "$pathname"`
path="${pathname%%/$filename}"
case $SUM in
"wp")
hashname=`hashname_wp "$pathname"`
;;
"md5")
hashname=`hashname_md5 "$pathname"`
;;
*)
echo "Unknown hash requested"
exit 1
;;
esac
if [[ "$filename" != "$hashname" ]]
then
echo "renaming: $pathname => $path/$hashname"
mv "$pathname" "$path/$hashname"
else
echo "$pathname up to date"
fi
}
# Create som testdata under /tmp
mktest()
{
root_dir=$(tempfile)
rm "$root_dir"
mkdir "$root_dir"
i=0
test_files[$((i++))]='test'
test_files[$((i++))]='testfile, no extention or spaces'
test_files[$((i++))]='.hidden'
test_files[$((i++))]='a hidden file'
test_files[$((i++))]='test space'
test_files[$((i++))]='testfile, no extention, spaces in name'
test_files[$((i++))]='test.txt'
test_files[$((i++))]='testfile, extention, no spaces in name'
test_files[$((i++))]='test.ab8e460eac3599549cfaa23a848635aa.txt'
test_files[$((i++))]='testfile, With (wrong) md5sum, no spaces in name'
test_files[$((i++))]='test spaced.ab8e460eac3599549cfaa23a848635aa.txt'
test_files[$((i++))]='testfile, With (wrong) md5sum, spaces in name'
test_files[$((i++))]='test.8072ec03e95a26bb07d6e163c93593283fee032db7265a29e2430004eefda22ce096be3fa189e8988c6ad77a3154af76f582d7e84e3f319b798d369352a63c3d.txt'
test_files[$((i++))]='testfile, With (wrong) whirlpoolhash, no spaces in name'
test_files[$((i++))]='test spaced.8072ec03e95a26bb07d6e163c93593283fee032db7265a29e2430004eefda22ce096be3fa189e8988c6ad77a3154af76f582d7e84e3f319b798d369352a63c3d.txt']
test_files[$((i++))]='testfile, With (wrong) whirlpoolhash, spaces in name'
test_files[$((i++))]='test space.txt'
test_files[$((i++))]='testfile, extention, spaces in name'
test_files[$((i++))]='test multi-space .txt'
test_files[$((i++))]='testfile, extention, multiple consequtive spaces in name'
test_files[$((i++))]='test space.h'
test_files[$((i++))]='testfile, short extention, spaces in name'
test_files[$((i++))]='test space.reallylong'
test_files[$((i++))]='testfile, long extention, spaces in name'
test_files[$((i++))]='test space.reallyreallyreallylong.tst'
test_files[$((i++))]='testfile, long extention, double extention,
might look like hash, spaces in name'
test_files[$((i++))]='utf8test1 - æeiaæå.txt'
test_files[$((i++))]='testfile, extention, utf8 characters, spaces in name'
test_files[$((i++))]='utf8test1 - 漢字.txt'
test_files[$((i++))]='testfile, extention, Japanese utf8 characters, spaces in name'
for s in . sub1 sub2 sub1/sub3 .hidden_dir
do
#note -p not needed as we create dirs top-down
#fails for "." -- but the hack allows us to use a single loop
#for creating testdata in all dirs
mkdir $root_dir/$s
dir=$root_dir/$s
i=0
while [[ $i -lt ${#test_files[*]} ]]
do
filename=${test_files[$((i++))]}
echo ${test_files[$((i++))]} > "$dir/$filename"
done
done
echo "$root_dir"
}
# Run test, given a hash-type as first argument
runtest()
{
sum=$1
root_dir=$(mktest)
echo "created dir: $root_dir"
echo "Running first test with hashtype $sum:"
echo
main $root_dir $sum
echo
echo "Running second test:"
echo
main $root_dir $sum
echo "Updating all files:"
find $root_dir -type f | while read f
do
echo "more content" >> "$f"
done
echo
echo "Running final test:"
echo
main $root_dir $sum
#cleanup:
rm -r $root_dir
}
# Test md5 and whirlpool hashes on generated data.
runtests()
{
runtest md5
runtest wp
}
#For in order to be able to call the script recursively, without splitting off
# functions to separate files:
case "$1" in
'test')
runtests
;;
'hashname')
hashname "$2"
;;
'hashmove')
hashmove "$2"
;;
'run')
main "$2" "$3"
;;
*)
echo "Use with: $0 test - or if you just want to try it on a folder:"
echo " $0 run path (implies md5)"
echo " $0 run md5 path"
echo " $0 run wp path"
;;
esac

using zsh:
$ ls
a.txt
b.txt
c.txt
The magic:
$ FILES=**/*(.)
$ # */ stupid syntax coloring thinks this is a comment
$ for f in $FILES; do hash=`md5sum $f | cut -f1 -d" "`; mv $f "$f:r.$hash.$f:e"; done
$ ls
a.60b725f10c9c85c70d97880dfe8191b3.txt
b.3b5d5c3712955042212316173ccf37be.txt
c.2cd6ee2c70b0bde53fbe6cac3c8b8bb1.txt
Happy deconstruction!
Edit: added files in subdirectories and quotes around mv argument

Ruby:
#!/usr/bin/env ruby
require 'digest/md5'
Dir.glob('**/*') do |f|
next unless File.file? f
next if /\.md5sum-[0-9a-f]{32}/ =~ f
md5sum = Digest::MD5.file f
newname = "%s/%s.md5sum-%s%s" %
[File.dirname(f), File.basename(f,'.*'), md5sum, File.extname(f)]
File.rename f, newname
end
Handles filenames that have spaces, no extension, and that have already been hashed.
Ignores hidden files and directories — add File::FNM_DOTMATCH as the second argument of glob if that's desired.

Related

How to test in shell if a path is already inside environment $*PATH? [duplicate]

With /bin/bash, how would I detect if a user has a specific directory in their $PATH variable?
For example
if [ -p "$HOME/bin" ]; then
echo "Your path is missing ~/bin, you might want to add it."
else
echo "Your path is correctly set"
fi

Using grep is overkill, and can cause trouble if you're searching for anything that happens to include RE metacharacters. This problem can be solved perfectly well with bash's builtin [[ command:
if [[ ":$PATH:" == *":$HOME/bin:"* ]]; then
echo "Your path is correctly set"
else
echo "Your path is missing ~/bin, you might want to add it."
fi
Note that adding colons before both the expansion of $PATH and the path to search for solves the substring match issue; double-quoting the path avoids trouble with metacharacters.

There is absolutely no need to use external utilities like grep for this. Here is what I have been using, which should be portable back to even legacy versions of the Bourne shell.
case :$PATH: # notice colons around the value
in *:$HOME/bin:*) ;; # do nothing, it's there
*) echo "$HOME/bin not in $PATH" >&2;;
esac

Here's how to do it without grep:
if [[ $PATH == ?(*:)$HOME/bin?(:*) ]]
The key here is to make the colons and wildcards optional using the ?() construct. There shouldn't be any problem with metacharacters in this form, but if you want to include quotes this is where they go:
if [[ "$PATH" == ?(*:)"$HOME/bin"?(:*) ]]
This is another way to do it using the match operator (=~) so the syntax is more like grep's:
if [[ "$PATH" =~ (^|:)"${HOME}/bin"(:|$) ]]

Something really simple and naive:
echo "$PATH"|grep -q whatever && echo "found it"
Where whatever is what you are searching for. Instead of && you can put $? into a variable or use a proper if statement.
Limitations include:
The above will match substrings of larger paths (try matching on "bin" and it will probably find it, despite the fact that "bin" isn't in your path, /bin and /usr/bin are)
The above won't automatically expand shortcuts like ~
Or using a perl one-liner:
perl -e 'exit(!(grep(m{^/usr/bin$},split(":", $ENV{PATH}))) > 0)' && echo "found it"
That still has the limitation that it won't do any shell expansions, but it doesn't fail if a substring matches. (The above matches "/usr/bin", in case that wasn't clear).

Here's a pure-bash implementation that will not pick up false-positives due to partial matching.
if [[ $PATH =~ ^/usr/sbin:|:/usr/sbin:|:/usr/sbin$ ]] ; then
do stuff
fi
What's going on here? The =~ operator uses regex pattern support present in bash starting with version 3.0. Three patterns are being checked, separated by regex's OR operator |.
All three sub-patterns are relatively similar, but their differences are important for avoiding partial-matches.
In regex, ^ matches to the beginning of a line and $ matches to the end. As written, the first pattern will only evaluate to true if the path it's looking for is the first value within $PATH. The third pattern will only evaluate to true if if the path it's looking for is the last value within $PATH. The second pattern will evaluate to true when it finds the path it's looking for in-between others values, since it's looking for the delimiter that the $PATH variable uses, :, to either side of the path being searched for.

I wrote the following shell function to report if a directory is listed in the current PATH. This function is POSIX-compatible and will run in compatible shells such as Dash and Bash (without relying on Bash-specific features).
It includes functionality to convert a relative path to an absolute path. It uses the readlink or realpath utilities for this but these tools are not needed if the supplied directory does not have .. or other links as components of its path. Other than this, the function doesn’t require any programs external to the shell.
# Check that the specified directory exists – and is in the PATH.
is_dir_in_path()
{
if [ -z "${1:-}" ]; then
printf "The path to a directory must be provided as an argument.\n" >&2
return 1
fi
# Check that the specified path is a directory that exists.
if ! [ -d "$1" ]; then
printf "Error: ‘%s’ is not a directory.\n" "$1" >&2
return 1
fi
# Use absolute path for the directory if a relative path was specified.
if command -v readlink >/dev/null ; then
dir="$(readlink -f "$1")"
elif command -v realpath >/dev/null ; then
dir="$(realpath "$1")"
else
case "$1" in
/*)
# The path of the provided directory is already absolute.
dir="$1"
;;
*)
# Prepend the path of the current directory.
dir="$PWD/$1"
;;
esac
printf "Warning: neither ‘readlink’ nor ‘realpath’ are available.\n"
printf "Ensure that the specified directory does not contain ‘..’ in its path.\n"
fi
# Check that dir is in the user’s PATH.
case ":$PATH:" in
*:"$dir":*)
printf "‘%s’ is in the PATH.\n" "$dir"
return 0
;;
*)
printf "‘%s’ is not in the PATH.\n" "$dir"
return 1
;;
esac
}
The part using :$PATH: ensures that the pattern also matches if the desired path is the first or last entry in the PATH. This clever trick is based upon this answer by Glenn Jackman on Unix & Linux.

This is a brute force approach but it works in all cases except when a path entry contains a colon. And no programs other than the shell are used.
previous_IFS=$IFS
dir_in_path='no'
export IFS=":"
for p in $PATH
do
[ "$p" = "/path/to/check" ] && dir_in_path='yes'
done
[ "$dir_in_path" = "no" ] && export PATH="$PATH:/path/to/check"
export IFS=$previous_IFS

$PATH is a list of strings separated by : that describe a list of directories. A directory is a list of strings separated by /. Two different strings may point to the same directory (like $HOME and ~, or /usr/local/bin and /usr/local/bin/). So we must fix the rules of what we want to compare/check. I suggest to compare/check the whole strings, and not physical directories, but remove duplicate and trailing /.
First remove duplicate and trailing / from $PATH:
echo $PATH | tr -s / | sed 's/\/:/:/g;s/:/\n/g'
Now suppose $d contains the directory you want to check. Then pipe the previous command to check $d in $PATH.
echo $PATH | tr -s / | sed 's/\/:/:/g;s/:/\n/g' | grep -q "^$d$" || echo "missing $d"

A better and fast solution is this:
DIR=/usr/bin
[[ " ${PATH//:/ } " =~ " $DIR " ]] && echo Found it || echo Not found
I personally use this in my bash prompt to add icons when i go to directories that are in $PATH.

Renaming sequential image files with gaps

I'm looking for a way to rename a list of image files with gaps to be sequential. Also I want to give them a padding of 4. I'm currently using Python 2.7 and Linux bash to program this.
Example:
1.png
2.png
3.png
20.png
21.png
50.png
Should turn into:
0001.png
0002.png
0003.png
0004.png
0005.png
0006.png
I also would like for the files name to be the same as the directory that they are currently in.
Example:
c_users_johnny_desktop_images.0001.png
c_users_johnny_desktop_images.0002.png
c_users_johnny_desktop_images.0003.png
c_users_johnny_desktop_images.0004.png
c_users_johnny_desktop_images.0005.png
c_users_johnny_desktop_images.0006.png
Any help would be greatly appreciated! :)
Cheers

this is python
#first collect all files that start with a number and end with .png
my_files = [f for f in os.listdir(some_directory) if f[0].isdigit() and f.endswith(".png")]
#sort them based on the number
sorted_files = sorted(my_files,key=lambda x:int(x.split(".")[0])) # sort the file names by starting number
#rename them sequentially
for i,fn in enumerate(sorted_files,1): #thanks wim
os.rename(sorted_files[i],"{0:04d}.png".format(i))
I could have used list.sort(key=...) to sort in place but I figured this would be marginally more verbose and readable ...

Try doing this in a shell :
rename -n '
$s = substr(join("_", split("/", $ENV{PWD})), 1) . ".";
s/(\d+)\.png/$s . sprintf("%04d", ++$c) . ".png"/e
' *.png
Output :
1.png -> c_users_johnny_desktop_images.0001.png
2.png -> c_users_johnny_desktop_images.0002.png
3.png -> c_users_johnny_desktop_images.0003.png
20.png -> c_users_johnny_desktop_images.0004.png
21.png -> c_users_johnny_desktop_images.0005.png
50.png -> c_users_johnny_desktop_images.0006.png
rename is http://search.cpan.org/~pederst/rename/ and is the defalut rename command on many distros.
When the command is tested as well, you can remove the -n switch to do it for real.

Blah Blah Blah. CSH is bad. BASH is good. Python is better. Bah humbug. I still use TCSH...
% set i = 1
% foreach FILE ( `ls *[0-9].png | sort -n` )
echo mv $FILE `printf %04d $i`.png ; # i ++
end
Output:
mv 1.png 0001.png
mv 2.png 0002.png
mv 3.png 0003.png
mv 20.png 0004.png
mv 21.png 0005.png
mv 50.png 0006.png
Responding to comments:
Still need c_users_johnny_desktop_images.
Ok, so use:
echo mv $FILE c_users_johnny_desktop_images.`printf %04d $i`.png ; # i ++
It's not like my example was hard to read.
Correction: Perhaps you meant to automatically extract the current directory name and incorporate it. E.g.:
echo mv $FILE `echo $cwd | sed -e 's|^/||' -e 's|/|_|g'`.`printf %04d $i`.png ; # i ++
-
are globs not present in tcsh ? Your parsing of ls seems scary
Of course globs are present. That's what we are passing into ls. But globbing gives us a list that is sorted alphabetically, as in 1,2,20,21,3,50. We want a numerical sort, as in 1,2,3,20,21,50. Standard problem when we don't have leading zeros in the numbers.
sort -n does a numeric sort. ls gives us a newline after each filename. We could just as easily write:
foreach FILE ( `echo *[0-9].png | tr ' ' '\012' | sort -n` )
But I'm lazy and ls does the newline for me. What's so scary about it?

shell script to find a string and print the current line and next line recursively

I have written a small script which searches the string and prints the current line. But m little confused to print the next line. I am ok with bash/perl/python
#!/bin/bash
CURRENT_DIR=`pwd`
cnt=0
for dir in $(find $CURRENT_DIR -type d)
do
for myFile in $dir/*
do
if [ -f "$myFile" ]; then
cat $myFile | while myLine=`line`
do
allFile="$myLine"
if echo "$myLine" | grep -q $1 ; then
echo "$myFile" "$allFile" ""
fi
#echo 'expr $count+1'
#echo "$allFile" ""
done #LINE
fi
done #FILE
done # DIRECTORY

If your grep is GNU:
grep -A1 pattern file

I am giving an example to you here in bash, This one considers a bunch of text files in a directory. You can manipulate it as you need .
Within one dir
grep "search string" *.txt
Search or go to sub-dir
find /full/path/to/dir -name "*.txt" -exec grep "search string" {} ;
Hope this helps you .

you can do it using awk:
awk '/Message/{print;getline;print}' your_file
Above is for one single file.This command will show you the matched pattern line and the next line in the file.
If you want to do it recursive in all the files in a directory structure then :
find . -name -type f|xargs awk '/Message/{print;getline;print}'

Add line on top of each Python file in current and sub directories

I'm on an Ubuntu platform and have a directory containing many .py files and subdirectories (also containing .py files). I would like to add a line of text to the top of each .py file. What's the easiest way to do that using Perl, Python, or shell script?

find . -name \*.py | xargs sed -i '1a Line of text here'
Edit: from tchrist's comment, handle filenames with spaces.
Assuming you have GNU find and xargs (as you specified the linux tag on the question)
find . -name \*.py -print0 | xargs -0 sed -i '1a Line of text here'
Without GNU tools, you'd do something like:
while IFS= read -r filename; do
{ echo "new line"; cat "$filename"; } > tmpfile && mv tmpfile "$filename"
done < <(find . -name \*.py -print)

for a in `find . -name '*.py'` ; do cp "$a" "$a.cp" ; echo "Added line" > "$a" ; cat "$a.cp" >> "$a" ; rm "$a.cp" ; done

#!/usr/bin/perl
use Tie::File;
for (#ARGV) {
tie my #array, 'Tie::File', $_ or die $!;
unshift #array, "A new line";
}
To process all .py files in a directory recursively run this command in your shell:
find . -name '*.py' | xargs perl script.pl

This will
recursively walk all directories starting with the current working
directory
modify only those files whose filename end with '.py'
preserve file permissions (unlike
open(filename,'w').)
fileinput also gives you the option of backing up your original files before modifying them.
import fileinput
import os
import sys
for root, dirs, files in os.walk('.'):
for line in fileinput.input(
(os.path.join(root,name) for name in files if name.endswith('.py')),
inplace=True,
# backup='.bak' # uncomment this if you want backups
):
if fileinput.isfirstline():
sys.stdout.write('Add line\n{l}'.format(l=line))
else:
sys.stdout.write(line)

import os
for root, dirs, files in os.walk(directory):
for file in files:
if file.endswith('.py')
file_ptr = open(file, 'r')
old_content = file_ptr.read()
file_ptr = open(file, 'w')
file_ptr.write(your_new_line)
file_ptr.write(old_content)
As far as I know you can't insert in begining or end of file in python. Only re-write or append.

What's the easiest way to do that using Perl, Python, or shell script?
I'd use Perl, but that's because I know Perl much better than I know Python. Heck, maybe I'd do this in Python just to learn it a bit better.
The easiest way is to use the language that you're familiar with and can work with. And, that's also probably the best way too.
If these are all Python scripts, I take it you know Python or have access to a bunch of people who know Python. So, you're probably better off doing the project in Python.
However, it's possible with shell scripts too, and if you know shell the best, be my guest. Here's a little, completely untested shell script right off the top of my head:
find . -type f -name "*.py" | while read file
do
sed 'i\
I want to insert this line
' $file > $file.temp
mv $file.temp $file
done

removing extensions in subdirectories

I need to remove the extension ".tex":
./1-aoeeu/1.tex
./2-thst/2.tex
./3-oeu/3.tex
./4-uoueou/4.tex
./5-aaa/5.tex
./6-oeua/6.tex
./7-oue/7.tex
Please, do it with some tools below:
Sed and find
Ruby
Python
My Poor Try:
$find . -maxdepth 2 -name "*.tex" -ok mv `sed 's#.tex##g' {}` {} +

A Python script to do the same:
import os.path, shutil
def remove_ext(arg, dirname, fnames):
argfiles = (os.path.join(dirname, f) for f in fnames if f.endswith(arg))
for f in argfiles:
shutil.move(f, f[:-len(arg)])
os.path.walk('/some/path', remove_ext, '.tex')

One way, not necessarily the fastest (but at least the quickest developed):
pax> for i in *.c */*.c */*/*.c ; do
...> j=$(echo "$i" | sed 's/\.c$//')
...> echo mv "$i" "$j"
...> done
It's equivalent since your maxdepth is 2. The script is just echoing the mv command at the moment (for test purposes) and working on C files (since I had no tex files to test with).
Or, you can use find with all its power thus:
pax> find . -maxdepth 2 -name '*.tex' | while read line ; do
...> j=$(echo "$line" | sed 's/\.tex$//')
...> mv "$line" "$j"
...> done

Using "for i in" may cause "too many parameters" errrors
A better approach is to pipe find onto the next process.
Example:
find . -type f -name "*.tex" | while read file
do
mv $file ${file%%tex}g
done
(Note: Wont handle files with spaces)

Using bash, find and mv from your base directory.
for i in $(find . -type f -maxdepth 2 -name "*.tex");
do
mv $i $(echo "$i" | sed 's|.tex$||');
done
Variation 2 based on other answers here.
find . -type f -maxdepth 2 -name "*.tex" | while read line;
do
mv "$line" "${line%%.tex}";
done
PS: I did not get the part about escaping '.' by pax...

There's an excellent Perl rename script that ships with some distributions, and otherwise you can find it on the web. (I'm not sure where it resides officially, but this is it). Check if your rename was written by Larry Wall (AUTHOR section of man rename). It will let you do something like:
find . [-maxdepth 2] -name "*.tex" -exec rename 's/\.tex//' '{}' \;
Using -exec is simplest here because there's only one action to perform, and it's not too expensive to invoke rename multiple times. If you need to do multiple things, use the "while read" form:
find . [-maxdepth 2] -name "*.tex" | while read texfile; do rename 's/\.tex//' $texfile; done
If you have something you want to invoke only once:
find . [-maxdepth 2] -name "*.tex" | xargs rename 's/\.tex//'
That last one makes clear how useful rename is - if everything's already in the same place, you've got a quick regexp renamer.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Hashing Multiple Files - python

find . -type f -print | while read file do hash=`$hashcommand "$file"` filename=${file%.} extension=${file##.} mv $file "$filename.$hash.$extension" done

Related

How to test in shell if a path is already inside environment $*PATH? [duplicate]

Renaming sequential image files with gaps

shell script to find a string and print the current line and next line recursively

Add line on top of each Python file in current and sub directories

removing extensions in subdirectories

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Hashing Multiple Files - python

find . -type f -print | while read file do hash=`$hashcommand "$file"` filename=${file%.*} extension=${file##*.} mv $file "$filename.$hash.$extension" done

Related

How to test in shell if a path is already inside environment $*PATH? [duplicate]

Renaming sequential image files with gaps

shell script to find a string and print the current line and next line recursively

Add line on top of each Python file in current and sub directories

removing extensions in subdirectories

Categories

Resources

find . -type f -print | while read file do hash=`$hashcommand "$file"` filename=${file%.} extension=${file##.} mv $file "$filename.$hash.$extension" done