How to compute shasum of every file in a tar file - python

I'm looking for a way to compute an sha-256 value for every file contained in a tar file. The problem is that the tar are 300GB with over 200,000 contained files.
It would be possible to do this in bash a couple of different ways.
Extract and then use find
tmp=`mktmp --directory extract_XXX`
cd "$tmp"
tar -xf "$tarfile"
find "$tmp" -type f -exec shasum -ba 256 {} +
cd ..
rm -rf "$tmp"
This method is bad because it requires 300GB space space to work and is slow because it has to copy the data before computing the sum
List the tar file and compute the individual sums
tar -tf "$tarfile" awk '/\/$/ {next} {print $0}' | while read file ; do
sum=`tar -xOf "$tarfile" "$file" | shasum -ba 256`
echo "${sum%-}${file}"
done
This requires less disk space but is much slower
How can I do this in a single pass of the tar file without extracting it to a temp directory?
I've tagged this as bash and python... The current code is bash but I'm flexable about language.

The tar utility knows its way:
tar xvf "$tarfile" --to-command 'shasum -ba 256'
The -v flag is important because tar sends each file at the standard input of the command. It will output the file on one line an the SHA sum on the next, but you can further process that very easily.
EDIT: here is the complete shell only code to output the SHA256s in one single tar file pass:
shopt -s extglob
tar xvf "$tarfile" --to-command 'shasum -ba 256' | \
while read L; do
[[ $L == *" *-" ]] && echo $SHAFILE ${L:0:64} || SHAFILE=$L
done
For the glibc source archive, the output would look like:
glibc-2.24/.gitattributes c3f8f279e7e7b0020028d06de61274b00b6cb84cfd005a8f380c014ef89ddf48
glibc-2.24/.gitignore 35bcd2a1d99fbb76087dc077b3e754d657118f353c3d76058f6c35c8c7f7abae
glibc-2.24/BUGS 9b2d4b25c8600508e1d148feeaed5da04a13daf988d5854012aebcc37fd84ef6
glibc-2.24/CONFORMANCE 66b6e97c93a2381711f84f34134e8910ef4ee4a8dc55a049a355f3a7582807ec
Edit by OP:
As a one-liner this can be done as:
tar xf "$tarfile" --to-command 'bash -c "sum=`shasum -ba 256`; echo \"\${sum%-}$TAR_FILENAME\""'
or (on Ubuntu 20.04 and higher):
tar xf "$tarfile" --to-command 'bash -c "sum=`shasum -ba 256 | cut -d \" \" -f 1`; echo \"\${sum%-}$TAR_FILENAME\""'
Manual Page here: https://www.gnu.org/software/tar/manual/tar.html#SEC87

I don't know how fast will it be, but in python it can be done the following way:
import tarfile
import hashlib
def sha256(flo):
hash_sha256 = hashlib.sha256()
for chunk in iter(lambda: flo.read(4096), b'')
hash_sha256.update(chunk)
return hash_sha256.hexdigest()
with tarfile.open('/path/to/tar/file') as mytar:
for member in mytar.getmembers():
with mytar.extractfile(member) as _file:
print('{} {}'.format(sha256(_file), member.name))

Related

Python, unzip from stdin

I have a python script that does AES decryption of an encrypted zip archive 'myzip.enc'
I'm trying to use the output of that decryption and use it as stdin for "unzip" command.
Here is my code:
decrypt = subprocess.Popen(['openssl', 'enc', '-d', '-aes-256-cbc', '-md', 'sha256', '-in', '{}'.format(inputFile), '-pass', 'pass:{}'.format(passw_hash)], stdout=subprocess.PIPE)
decompress = subprocess.Popen(['unzip', '-j', '-d', path_dict], stdin=decrypt.stdout)
inputFile is my encrypted archive 'myzip.enc'
passw_hash is the AES password
path_dict is a folder path where to extract the decrypted zip
I'm getting this in my terminal:
Usage: unzip [-Z] [-opts[modifiers]] file[.zip] [list] [-x xlist] [-d exdir]
Default action is to extract files in list, except those in xlist, to exdir;
file[.zip] may be a wildcard. -Z => ZipInfo mode ("unzip -Z" for usage).
-p extract files to pipe, no messages -l list files (short format)
-f freshen existing files, create none -t test compressed archive data
-u update files, create if necessary -z display archive comment only
-v list verbosely/show version info -T timestamp archive to latest
-x exclude files that follow (in xlist) -d extract files into exdir
modifiers:
-n never overwrite existing files -q quiet mode (-qq => quieter)
-o overwrite files WITHOUT prompting -a auto-convert any text files
-j junk paths (do not make directories) -aa treat ALL files as text
-U use escapes for all non-ASCII Unicode -UU ignore any Unicode fields
-C match filenames case-insensitively -L make (some) names lowercase
-X restore UID/GID info -V retain VMS version numbers
-K keep setuid/setgid/tacky permissions -M pipe through "more" pager
See "unzip -hh" or unzip.txt for more help. Examples:
unzip data1 -x joe => extract all files except joe from zipfile data1.zip
unzip -p foo | more => send contents of foo.zip via pipe into program more
unzip -fo foo ReadMe => quietly replace existing ReadMe if archive file newer
Is there something wrong in my unzip command?
Thanks.
Edit: It seems from Here that it is impossible to PIPE an output of zip archive to the unzip command due to the fact that unzip needs to read some info from the physical file.
My workaround ended up being this code which works:
output = open('{}.zip'.format(inputFile), "wb")
decrypt = subprocess.Popen(['openssl', 'enc', '-d', '-aes-256-cbc', '-md', 'sha256', '-in', '{}'.format(inputFile), '-pass', 'pass:{}'.format(passw_hash)], stdout=output)
decompress = subprocess.Popen(['unzip', '{}.zip'.format(inputFile), '-d', path_dict[0]])
Is there a way to unzip and delete the zip archive on the same time or add an rm to the decompress line ?
Thanks.

A bash script that reads python files recursively and stops after the output of each file exists

I have a bash that reads *.py scripts recursively and stops when the output of each *.py files exists in the directory (a *.pkl file). The main idea of the bash is that if the output not exists, the python script has to run again until creating the output for each *.py script. 
bash.sh
model1.py
model2.py
model3.py
model1.pkl # expected output
model2.pkl # expected output
model3.pkl # expected output
However, I have a problem here: When the second/third output NOT exists (from the second/third .*py script) the bash did not run again (while if the first output NOT exists the bash run again, as should be).
My bash is the following:
#!/bin/bash
for x in $(find . -type f -name "*.py"); do
if [[ ! -f `basename -s .py $x`.pkl ]]; then #output with the same name of *.py file
python3 ${x}
else
exit 0
fi
done
So, how I can force the bash script to run again if the output of any *.py script is missing? Or it is a problem with the name of the outputs?
I tried using the commands while read and until, but I failed to do the script read all the *.py files.
Thanks in advance!
try this: not the best way: but at least will help you in right direction.
keep_running(){
for f in $(find . -type f -name "*.py");
do
file_name=$(dirname $f)/$(basename $f .py).pk1
if [ ! -f "$file_name" ];
then
echo "$file_name doesn't exists" # here you can run your python script
fi
done
}
cnt_py=0
cnt_pkl=1
while [ $cnt_pkl -ne $cnt_py ] ; do
keep_running
cnt_py=`ls -1 *.py| wc -l`
cnt_pkl=`ls -1 *.pk1| wc -l`
done

shell script to convert windows file to unix using dos2unix

I'm writing a simple shell script to make use of dos2unix command to convert Windows-format files to Unix format as and when it arrives in my folder.
I used to use iconv in the script and automate it to get one encoding converted to the other. But now I need to use dos2unix instead of iconv.
I don't want the original file to be overwritten (it must be archived in the archive folder). This was straightforward with iconv; how can I do the same with dos2unix?
This is my script:
cd /myfolder/storage
filearrival_dir= /myfolder/storage
filearchive_dir=/myfolder/storage/archive
cd $filearrival_dir
echo " $filearrival_dir"
for file in File_October*.txt
do
iconv -f UTF16 -t UTF8 -o "$file.new" "$file" &&
mv -f "$file.new" "$file".`date +"%C%y%m%d"`.txt_conv &&
mv $file $filearchive_dir/$file
done
The above looks for files matching File_Oct*.txt, converts to the desired encoding and renames it with the timestamp and _conv at the end. This script also moves the original file to the archive.
How can I replace iconv in the above script with dos2unix and have the files archived and do the rest just like I did here?
You can "emulate" dos2unix using tr.
tr -d '\015' infile > outfile
If this is just about using dos2unix so it doesn't over-write the original file, just use
-n infile outfile
My recollection is that dos2unix writes UTF-8 by default, so you probably don't have to take any special action so far as encoding is concerned.

Unzip a file and copy the contents to different folder based on condition

Folks,
i have a requirement to unzip file and copy the contents of the subdirectories of the unzipped file into different location
For Example:
Filename: temp.zip
unzip temp.zip
we have folder structure like this under temp
temp/usr/data/keanu/*.pdf's
temp/usr/data/reaves/*.pdf's
my requirement is to go to the unzipped folders and copy
/keanu *.pdf's to /desti1/
and
/reaves/*.pdf's to /dest2/
i have tried the below:
unzip.sh <filename>
filename=$1
unzip $filename
//i have stuck here i need to go to unzip folder and find the path and copy those files to different destination
UPDATE on My script unzip the file and recursively copy recommended type of files to destination folder without changing the (by preserving the directory structure)
Filename: unzip.sh
#! /bin/bash
#shopt -s nullglob globstar
filename="$1"
var1=$(sed 's/.\{4\}$//' <<< "$filename")
echo $var1
unzip "$filename"
cd "$(dirname "$filename")"/"$var1"/**/includes
#pwd
#use -udm in cpio to overwrite
find . -name '*.pdf' | cpio -pdm /tmp/test/includes
cd -
cd "$(dirname "$filename")"/"$var1"/**/global
#pwd
find . -name '*.pdf' | cpio -pdm /tmp/test/global
In case the zip is always structured the same:
#! /bin/bash
shopt -s nullglob
filename="$1"
unzip "$filename"
cp "$(dirname "$filename")"/temp/usr/data/keanu/*.pdf /desti1/
cp "$(dirname "$filename")"/temp/usr/data/reaves/*.pdf /desti2/
In case the structure changes and you only know that there are directories keanu/ and reaves/ somewhere:
#! /bin/bash
shopt -s nullglob globstar
filename="$1"
unzip "$filename"
cp "$(dirname "$filename")"/**/keanu/*.pdf /desti1/
cp "$(dirname "$filename")"/**/reaves/*.pdf /desti2/
Both scripts do what you specified but not more than that. The unzipped files are copied over, that is, the original unzipped files will still lay around after the script terminates.
Python solution:
import zipfile,os
zf = zipfile.ZipFile("temp.zip")
for f in zf.namelist():
if not f.endswith("/"):
dest = "dest1" if os.path.basename(os.path.dirname(f))=="keanu" else "dest2"
zf.extract(f,path=os.path.join(dest,os.path.basename(f)))
iterate on the contents of the archive
filter out directories (end with "/")
if last dirname is "keanu", select destination 1 else the other
extract directly under the selected destination

Bash script to find files in folders

I have a couple of folders as
Main/
/a
/b
/c
..
I have to pass input file abc1.txt, abc2.txt from each of these folders respectively as an input file to my python program.
The script right now is,
for i in `cat file.list`
do
echo $i
cd $i
#works on the assumption that there is only one .txt file
inputfile=`ls | grep .txt`
echo $inputfile
python2.7 ../getDOC.py $inputfile
sleep 10
cd ..
done
echo "Script executed successfully"
So I want the script to work correctly regardless of number of .txt files.
Can anyone let me know if there is any inbuilt command in shell to fetch the correct .txt files in case for multiple .txt files?
The find command is well suited for this with -exec:
find /path/to/Main -type f -name "*.txt" -exec python2.7 ../getDOC.py {} \; -exec sleep 10 \;
Explanation:
find - invoke find
/path/to/Main - The directory to start your search at. By default find searches recursively.
-type f - Only consider files (as opposed to directories, etc)
-name "*.txt" - Only find the files with .txt extension. This is quoted so bash doesn't auto-expand the wildcard * via globbing.
-exec ... \; - For each such result found, run the following command on it:
python2.7 ../getDOC.py {}; - the {} part is where the search result from the find gets substituted into each time.
sleep 10 - sleep for 10 seconds after each time python script is run on the file. Remove this if you don't want it to sleep.
Better using globs :
shopt -s globstar nullglob
for i in Main/**/*txt; do
python2.7 ../getDOC.py "$i"
sleep 10
done
This example is recursive and require bash4
find . -name *.txt | xargs python2.7 ../getDOC.py

Categories