How to find and exec based on python variable in google colab - python

I'm working on google colaboratory and i have to do some elaboration to some files based on their extensions, like:
!find ./ -type f -name "*.djvu" -exec file '{}' ';'
and i expect an output:
./file.djvu: DjVu multiple page document
but when i try to mix bash and python to use a list of exensions:
for f in file_types:
!echo "*.{f}"
!find ./ -type f -name "*.{f}" -exec file '{}' ';'
!echo "*.$f"
!find ./ -type f -name "*.$f" -exec file '{}' ';'
i get only the output of both the echo but not of the files.
*.djvu
*.djvu
*.jpg
*.jpg
*.pdf
*.pdf
If i remove the exec part it actually find the files so i can't figure out why the find command combined with exec fail in some manner.
If needed i can provide more info/examples.

I found an ugly workaround passing trought a file, so first i write the array to a file in python:
with open('/content/file_types.txt', 'w') as f:
for ft in file_types:
f.write(ft + '\n')
and than i read and use it from bash in another cell:
%%bash
filename=/content/protected_file_types.txt
while IFS= read -r f;
do
find ./ -name "*.$f" -exec file {} ';' ;
done < $filename
Doing so i dont mix bash and python in the same cell as suggested in the comment of another answer.
I hope to find a better solution that maybe use some trick that I'm not aware of.

This works for me
declare -a my_array=("pdf" "py")
for i in ${my_array[*]}; do find ./ -type f -name "*.$i" -exec file '{}' ';'; done;

Related

bash script to automatize python script problem

i have an issue with a bash script, it runs a python script over all the csv files that i have inside in a directory, but i have notice the resultant files are incomplete, any file contains more than 100 rows,and curiously it left to write in the row number 100
i include my bash code:
#!/usr/bin/env bash
find ./nodos_aleatorios100 -type f -name '*.csv' -print0 |
while IFS= read -r -d $'\0' line; do
nname=$(echo "$line" | sed "s/ecn/encn/")
python algoritmo_aleatorio.py $line $nname
done
and the question is how can solve this problem?
thanks in advance to everyone could help me,

A bash script that reads python files recursively and stops after the output of each file exists

I have a bash that reads *.py scripts recursively and stops when the output of each *.py files exists in the directory (a *.pkl file). The main idea of the bash is that if the output not exists, the python script has to run again until creating the output for each *.py script. 
bash.sh
model1.py
model2.py
model3.py
model1.pkl # expected output
model2.pkl # expected output
model3.pkl # expected output
However, I have a problem here: When the second/third output NOT exists (from the second/third .*py script) the bash did not run again (while if the first output NOT exists the bash run again, as should be).
My bash is the following:
#!/bin/bash
for x in $(find . -type f -name "*.py"); do
if [[ ! -f `basename -s .py $x`.pkl ]]; then #output with the same name of *.py file
python3 ${x}
else
exit 0
fi
done
So, how I can force the bash script to run again if the output of any *.py script is missing? Or it is a problem with the name of the outputs?
I tried using the commands while read and until, but I failed to do the script read all the *.py files.
Thanks in advance!
try this: not the best way: but at least will help you in right direction.
keep_running(){
for f in $(find . -type f -name "*.py");
do
file_name=$(dirname $f)/$(basename $f .py).pk1
if [ ! -f "$file_name" ];
then
echo "$file_name doesn't exists" # here you can run your python script
fi
done
}
cnt_py=0
cnt_pkl=1
while [ $cnt_pkl -ne $cnt_py ] ; do
keep_running
cnt_py=`ls -1 *.py| wc -l`
cnt_pkl=`ls -1 *.pk1| wc -l`
done

Unzip a file and copy the contents to different folder based on condition

Folks,
i have a requirement to unzip file and copy the contents of the subdirectories of the unzipped file into different location
For Example:
Filename: temp.zip
unzip temp.zip
we have folder structure like this under temp
temp/usr/data/keanu/*.pdf's
temp/usr/data/reaves/*.pdf's
my requirement is to go to the unzipped folders and copy
/keanu *.pdf's to /desti1/
and
/reaves/*.pdf's to /dest2/
i have tried the below:
unzip.sh <filename>
filename=$1
unzip $filename
//i have stuck here i need to go to unzip folder and find the path and copy those files to different destination
UPDATE on My script unzip the file and recursively copy recommended type of files to destination folder without changing the (by preserving the directory structure)
Filename: unzip.sh
#! /bin/bash
#shopt -s nullglob globstar
filename="$1"
var1=$(sed 's/.\{4\}$//' <<< "$filename")
echo $var1
unzip "$filename"
cd "$(dirname "$filename")"/"$var1"/**/includes
#pwd
#use -udm in cpio to overwrite
find . -name '*.pdf' | cpio -pdm /tmp/test/includes
cd -
cd "$(dirname "$filename")"/"$var1"/**/global
#pwd
find . -name '*.pdf' | cpio -pdm /tmp/test/global
In case the zip is always structured the same:
#! /bin/bash
shopt -s nullglob
filename="$1"
unzip "$filename"
cp "$(dirname "$filename")"/temp/usr/data/keanu/*.pdf /desti1/
cp "$(dirname "$filename")"/temp/usr/data/reaves/*.pdf /desti2/
In case the structure changes and you only know that there are directories keanu/ and reaves/ somewhere:
#! /bin/bash
shopt -s nullglob globstar
filename="$1"
unzip "$filename"
cp "$(dirname "$filename")"/**/keanu/*.pdf /desti1/
cp "$(dirname "$filename")"/**/reaves/*.pdf /desti2/
Both scripts do what you specified but not more than that. The unzipped files are copied over, that is, the original unzipped files will still lay around after the script terminates.
Python solution:
import zipfile,os
zf = zipfile.ZipFile("temp.zip")
for f in zf.namelist():
if not f.endswith("/"):
dest = "dest1" if os.path.basename(os.path.dirname(f))=="keanu" else "dest2"
zf.extract(f,path=os.path.join(dest,os.path.basename(f)))
iterate on the contents of the archive
filter out directories (end with "/")
if last dirname is "keanu", select destination 1 else the other
extract directly under the selected destination

How to compute shasum of every file in a tar file

I'm looking for a way to compute an sha-256 value for every file contained in a tar file. The problem is that the tar are 300GB with over 200,000 contained files.
It would be possible to do this in bash a couple of different ways.
Extract and then use find
tmp=`mktmp --directory extract_XXX`
cd "$tmp"
tar -xf "$tarfile"
find "$tmp" -type f -exec shasum -ba 256 {} +
cd ..
rm -rf "$tmp"
This method is bad because it requires 300GB space space to work and is slow because it has to copy the data before computing the sum
List the tar file and compute the individual sums
tar -tf "$tarfile" awk '/\/$/ {next} {print $0}' | while read file ; do
sum=`tar -xOf "$tarfile" "$file" | shasum -ba 256`
echo "${sum%-}${file}"
done
This requires less disk space but is much slower
How can I do this in a single pass of the tar file without extracting it to a temp directory?
I've tagged this as bash and python... The current code is bash but I'm flexable about language.
The tar utility knows its way:
tar xvf "$tarfile" --to-command 'shasum -ba 256'
The -v flag is important because tar sends each file at the standard input of the command. It will output the file on one line an the SHA sum on the next, but you can further process that very easily.
EDIT: here is the complete shell only code to output the SHA256s in one single tar file pass:
shopt -s extglob
tar xvf "$tarfile" --to-command 'shasum -ba 256' | \
while read L; do
[[ $L == *" *-" ]] && echo $SHAFILE ${L:0:64} || SHAFILE=$L
done
For the glibc source archive, the output would look like:
glibc-2.24/.gitattributes c3f8f279e7e7b0020028d06de61274b00b6cb84cfd005a8f380c014ef89ddf48
glibc-2.24/.gitignore 35bcd2a1d99fbb76087dc077b3e754d657118f353c3d76058f6c35c8c7f7abae
glibc-2.24/BUGS 9b2d4b25c8600508e1d148feeaed5da04a13daf988d5854012aebcc37fd84ef6
glibc-2.24/CONFORMANCE 66b6e97c93a2381711f84f34134e8910ef4ee4a8dc55a049a355f3a7582807ec
Edit by OP:
As a one-liner this can be done as:
tar xf "$tarfile" --to-command 'bash -c "sum=`shasum -ba 256`; echo \"\${sum%-}$TAR_FILENAME\""'
or (on Ubuntu 20.04 and higher):
tar xf "$tarfile" --to-command 'bash -c "sum=`shasum -ba 256 | cut -d \" \" -f 1`; echo \"\${sum%-}$TAR_FILENAME\""'
Manual Page here: https://www.gnu.org/software/tar/manual/tar.html#SEC87
I don't know how fast will it be, but in python it can be done the following way:
import tarfile
import hashlib
def sha256(flo):
hash_sha256 = hashlib.sha256()
for chunk in iter(lambda: flo.read(4096), b'')
hash_sha256.update(chunk)
return hash_sha256.hexdigest()
with tarfile.open('/path/to/tar/file') as mytar:
for member in mytar.getmembers():
with mytar.extractfile(member) as _file:
print('{} {}'.format(sha256(_file), member.name))

Bash script to find files in folders

I have a couple of folders as
Main/
/a
/b
/c
..
I have to pass input file abc1.txt, abc2.txt from each of these folders respectively as an input file to my python program.
The script right now is,
for i in `cat file.list`
do
echo $i
cd $i
#works on the assumption that there is only one .txt file
inputfile=`ls | grep .txt`
echo $inputfile
python2.7 ../getDOC.py $inputfile
sleep 10
cd ..
done
echo "Script executed successfully"
So I want the script to work correctly regardless of number of .txt files.
Can anyone let me know if there is any inbuilt command in shell to fetch the correct .txt files in case for multiple .txt files?
The find command is well suited for this with -exec:
find /path/to/Main -type f -name "*.txt" -exec python2.7 ../getDOC.py {} \; -exec sleep 10 \;
Explanation:
find - invoke find
/path/to/Main - The directory to start your search at. By default find searches recursively.
-type f - Only consider files (as opposed to directories, etc)
-name "*.txt" - Only find the files with .txt extension. This is quoted so bash doesn't auto-expand the wildcard * via globbing.
-exec ... \; - For each such result found, run the following command on it:
python2.7 ../getDOC.py {}; - the {} part is where the search result from the find gets substituted into each time.
sleep 10 - sleep for 10 seconds after each time python script is run on the file. Remove this if you don't want it to sleep.
Better using globs :
shopt -s globstar nullglob
for i in Main/**/*txt; do
python2.7 ../getDOC.py "$i"
sleep 10
done
This example is recursive and require bash4
find . -name *.txt | xargs python2.7 ../getDOC.py

Categories