"gsutil rm" command using STDIN - python

I use gsutil in a Linux environment for managing files in GCS. I enjoy being able to use the command
gsutil -m cp -I gs://...
preceded by some other command to pass the STDIN to gsutil for uploading files; in doing so, I can maintain a local list of files that have been uploaded or generate specific patterns to upload and hand them off.
I would like to be able to do a similar command like
gsutil -m rm -I gs://...
to scrub files similarly. Presently, I build a big list of files to remove and run it with the following code:
while read line
do
gsutil rm gs://...
done < "$myfile.txt"
This is extraordinarily slow compared to the multithreaded "gsutil -m rm..." command, and enabling the -m flag has no effect when you have to process files one at a time from a list. I also experimented with just running
gsutil -m rm gs://.../* # remove everything
<my command> | gsutil -m cp -I gs://.../ # put back the pieces that I want
but this involves recopying a lot of a data and wastes a lot of time; the data is already there and just needs to have some removed. Any thoughts would be appreciated. Also, I don't have a lot of flexibility on either end with renaming files; otherwise, a quick rename before uploading would handle all of this.

As an interim solution, since we don't have a -I option for rm right now, how about just creating a string of all the objects you want to delete in your loop and then using gsutil -m rm to delete it? You could also do this with a simple python script that invokes the gsutil command from within python as a separate process.
Expanding on your earlier example, maybe something like the following (disclaimer: my bash-fu isn't the greatest, and I haven't tested this):
objects=''
while read line
do
objects="$objects gs://$line"
done
gsutil -m rm $objects

For anyone wondering, I wound up doing like Zach Wilt indicated above. For reference, I was removing on the order of a couple thousand files from a span of 5 directories, so roughly 10,000 files. Doing this without the "-m" switch was taking upwards of 30 minutes; with the "-m" switch, it takes less than 30 seconds. Zoom!
For a robust example: I am using this to update Google Cloud Storage files to match local files. On the current day, I have a program that dumps lots of files that are incremental, and also a handful that are "rolled up". After a week, the incremental files get scrubbed locally automatically, but the same should happen in GCS to save the space. Here's how to do this:
#!/bin/bash
# get the full date strings for touch
start=`date --date='-9 days' +%x`
end=`date --date='-8 days' +%x`
# other vars
mon=`date --date='-9 days' +%b | tr [A-Z] [a-z]`
day=`date --date='-9 days' +%d`
# display start and finish times
echo "Cleaning files from $start"
# update start and finish times
touch --date="$start" /tmp/start1
touch --date="$end" /tmp/end1
# repeat for all servers
for dr in "dir1" "dir2" "dir3" ...
do
# list files in range and build retention file
find /local/path/$dr/ -newer /tmp/start1 ! -newer /tmp/end1 > "$dr-local.txt"
# get list of all files from appropriate folder on GCS
gsutil ls gs://gcs_path/$mon/$dr/$day/ > "$dr-gcs.txt"
# formatting the host list file
sed -i "s|gs://gcs_path/$mon/$dr/$day/|/local/path/$dr/|" "$dr-gcs.txt"
# build sed command file to delete matches
while read line
do
echo "\|$line|d" >> "$dr-del.txt"
done < "$dr-local.txt"
# run command file to strip lines for files that need to remain
sed -f "$dr-del.txt" <"$dr-gcs.txt" >"$dr-out.txt"
# convert local names to GCS names
sed -i "s|/local/path/$dr/|gs://gcs_path/$mon/$dr/$day/|" "$dr-out.txt"
# new variable to hold string
del=""
# convert newline separated file to one long string
while read line
do
del="$del$line "
done < "$dr-out.txt"
# remove all files matching the final output
gsutil -m rm $del
# cleanup files
rm $dr-local.txt
rm $dr-gcs.txt
rm $dr-del.txt
rm $dr-out.txt
done
You'll need to modify to fit your needs, but this is a concrete and working method for deleting files locally, and then synchronizing the change to Google Cloud Storage. Obviously, modify to fit your needs. Thanks again to #Zach Wilt.

Related

Getting all pods for a container, storing them in text files and then using those files as args in single command

The picture above shows the list of all kubernetes pods I need to save to a text file (or multiple text files).
I need a command which:
stores multiple pod logs into text files (or on single text file) - so far I have this command which stores one pod into one text file but this is not enough since I will have to spell out each pod name individually for every pod:
$ kubectl logs ipt-prodcat-db-kp-kkng2 -n ho-it-sst4-i-ie-enf > latest.txt
I then need the command to send these files into a python script where it will check for various strings - so far this works but if this could be included with the above command then that would be extremely useful:
python CheckLogs.py latest.txt latest2.txt
Is it possible to do either (1) or both (1) and (2) in a single command?
The simplest solution is to create a shell script that does exactly what you are looking for:
#!/bin/sh
FILE="text1.txt"
for p in $(kubectl get pods -o jsonpath="{.items[*].metadata.name}"); do
kubectl logs $p >> $FILE
done
With this script you will get the logs of all the pods in your namespace in a FILE.
You can even add python CheckLogs.py latest.txt
There are various tools that could help here. Some of these are commonly available, and some of these are shortcuts that I create my own scripts for.
xargs: This is used to run multiple command lines in various combinations, based on the input. For instance, if you piped text output containing three lines, you could potentially execute three commands using the content of those three lines. There are many possible variations
arg1: This is a shortcut that I wrote that simply takes stdin and produces the first argument. The simplest form of this would just be "awk '{print $1}'", but I designed mine to take optional parameters, for instance, to override the argument number, separator, and to take a filename instead. I often use "-i{}" to specify a substitution marker for the value.
skipfirstline: Another shortcut I wrote, that simply takes some multiline text input and omits the first line. It is just "sed -n '1!p'".
head/tail: These print some of the first or last lines of stdin. Interesting forms of this take negative numbers. Read the man page and experiment.
sed: Often a part of my pipelines, for making inline replacements of text.

for fi in sys.argv[1:]: argument list too long

I am trying to execute a python script on all text files in a folder:
for fi in sys.argv[1:]:
And I get the following error
-bash: /usr/bin/python: Argument list too long
The way I call this Python function is the following:
python functionName.py *.txt
The folder has around 9000 files. Is there some way to run this function without having to split my data in more folders etc? Splitting the files would not be very practical because I will have to execute the function in even more files in the future... Thanks
EDIT: Based on the selected correct reply and the comments of the replier (Charles Duffy), what worked for me is the following:
printf '%s\0' *.txt | xargs -0 python ./functionName.py
because I don't have a valid shebang..
This is an OS-level problem (limit on command line length), and is conventionally solved with an OS-level (or, at least, outside-your-Python-process) solution:
find . -maxdepth 1 -type f -name '*.txt' -exec ./your-python-program '{}' +
...or...
printf '%s\0' *.txt | xargs -0 ./your-python-program
Note that this runs your-python-program once per batch of files found, where the batch size is dependent on the number of names that can fit in ARG_MAX; see the excellent answer by Marcus Müller if this is unsuitable.
No. That is a kernel limitation for the length (in bytes) of a command line.
Typically, you can determine that limit by doing
getconf ARG_MAX
which, at least for me, yields 2097152 (bytes), which means about 2MB.
I recommend using python to work through a folder yourself, i.e. giving your python program the ability to work with directories instead of individidual files, or to read file names from a file.
The former can easily be done using os.walk(...), whereas the second option is (in my opinion) the more flexible one. Use the argparse module to give your python program an easy-to-use command line syntax, then add an argument of a file type (see reference documentation), and python will automatically be able to understand special filenames like -, meaning you could instead of
for fi in sys.argv[1:]
do
for fi in opts.file_to_read_filenames_from.read().split(chr(0))
which would even allow you to do something like
find -iname '*.txt' -type f -print0|my_python_program.py -file-to-read-filenames-from -
Don't do it this way. Pass mask to your python script (e.g. call it as python functionName.py "*.txt") and expand it using glob (https://docs.python.org/2/library/glob.html).
I think about using glob module. With this module you invoke your program like:
python functionName.py "*.txt"
then shell will not expand *.txt into file names. You Python program will receive *.txt in argumens list and you can pass it into glob.glob():
for fi in glob.glob(sys.argv[1]):
...

Rewrite config file based on standard error output

I'm new to Linux and have a Fedora 20 build for a class. We installed Tripwire using the default, out of the box configs and I want to take the standard errors from the install to fix the config file.
To collect the errors:
tripwire -m i -c tw.cfg 2> errors
To clean the error file up for processing:
cat errors.txt | grep "/" | cut -d " " -f 3 > fixederrors
This gives me a nice clean file with one path per line i.e.:
/bin/ash
/bin/ash.static
/root/.Xresources
I would like to automate this process by comparing the data in fixederrors to the config file and prepend matching strings with a '#'.
I tried sed, but it commented out the whole config file!
sed 's/^/#/g' fixederrors > commentederrors
Alternatively, I thought about comparing the config file and the fixederrors and creating a third file. Is there a way to take two files, compare them, and remove duplicated data?
Any help is appreciated. I tried bash and python, but I don't know enough and went down the rabbit hole on this one. Again, this is for my growth and not in a production environment.
Let's we suppose that you have this input file named fixederrors "clean"
/bin/ash
/bin/ash.static
/root/.Xresources
this input as configuration file named config.original
use /bin/ash
use /bin/bash
do stuff with /bin/ash.static and friends
/root/.Xresources
do other stuff...
With this script in bash
#!/bin/bash
Input_Conf_File="config.original" # Original configuration file
Output_Conf_File="config.new" # New configuration file
Input_Error_File="fixederrors" # File of cleaned error
rm -f $Output_Conf_File # Let's we create a new file erasing old
while read -r line ; do # meanwhile there are line in config file
AddingChar="" # No Char to add #
while IFS= read -r line2 ; do # For each line of Error file,
# here you can add specific rules for your match e.g
# line2=${line2}" " # if it has always a space after...
[[ $line == *$line2* ]] && AddingChar="#" # If found --> Change prefix "#"
done < $Input_Error_File
echo "${AddingChar}${line}" >> $Output_Conf_File # Print in new file
done < $Input_Conf_File
cat $Output_Conf_File # You can avoid this it's only to check results
exit 0
You will have this output
#use /bin/ash
use /bin/bash
#do stuff with /bin/ash.static and friends
#/root/.Xresources
do other stuff...
Note:
IFS= removes trailing spaces in read
Use wisely because e.g. the match /bin/ash will catch lines with /bin/ash.EVERYTHING; so if it exists a line in your configuration input file as /bin/ash.dynamic will be commented too. Without prior knowledge of your configuration file it's not possible to set a specific rule, but you can do it starting from here.

Asynchronous tasks in Plone to query Python Package Index

I want to periodically (every hour?) query the Python Package Index API from Plone. Something equivalent to:
$ for i in `yolk -L 24 | awk '{print $1}'` # get releases made in last 24 hours
do
# search for plone classifier
results=`yolk -M $i -f classifiers | grep -i plone`
if [ $results ]; then
echo $i
fi
done
Results:
collective.sendaspdf
gocept.selenium
Products.EnhancedNewsItemImage
adi.workingcopyflag
Products.SimpleCalendarPortlet
Products.SimpleCalendar
Then I want to display this information in a template. I would love to, at least initially, avoid having to persist the results.
How do I display the results in a template without having to wait for the query to finish? I know there are some async packages available e.g.:
plone.app.async
But I'm not sure what the general approach should be (assuming I can schedule an async task, I may need to store the results somewhere. If I have to store the results, I'd prefer to do it "lightweight" e.g. annotations)
How about the low, low tech version?
Use a cron-job to run the query, put this in a temp file, then move the file into a known location, with a timestamp in the filename.
Then, when someone requests the page in question (showing new packages), simply read the newest file in that location:
filename = sorted(os.listdir(location))[-1]
data = open(os.path.join(location, filename)).read()
By using a move, you guarantee that the newest file in the designated location is always a complete file, avoiding a partial result being read.

Rejecting files with Windows line endings using Perforce triggers

Using Perforce, I'd like to be able to reject submits which contain files with Windows line endings (\r\n IIRC, maybe just \r anywhere as really we only want files with Unix line endings).
Rather than dos2unix incoming files or similar, to help track down instances where users attempt to submit files with Windows line endings, I'd like to add a trigger to reject text submissions which contain files with non-Unix line endings.
Could someone demonstrate how the trigger itself could be written, perhaps with bash or python?
Thanks
Here's the minimal edit I can thing of for the bash example found in the p4 docs:
#!/bin/sh
# Set target string, files to search, location of p4 executable...
TARGET='\r\n'
DEPOT_PATH="//depot/src/..."
CHANGE=$1
P4CMD="/usr/local/bin/p4 -p 1666 -c copychecker"
XIT=0
echo ""
# For each file, strip off #version and other non-filename info
# Use sed to swap spaces w/"%" to obtain single arguments for "for"
for FILE in `$P4CMD files $DEPOT_PATH#=$CHANGE | \
sed -e 's/\(.*\)\#[0-9]* - .*$/\1/' -e 's/ /%/g'`
do
# Undo the replacement to obtain filename...
FILE="`echo $FILE | sed -e 's/%/ /g'`"
# ...and use #= specifier to access file contents:
# p4 print -q //depot/src/file.c#=12345
if $P4CMD print -q "$FILE#=$CHANGE" | fgrep "$TARGET" > /dev/null
then
echo "Submit fails: '$TARGET' not found in $FILE"
XIT=1
else
echo ""
fi
done
exit $XIT
The original example fails if the target is missing, this one fails if it's present -- just switching the then and else branches of the if. You could edit it further of course (e.g. giving grep, or fgrep, the -q flag to suppress output, if your grep supports it as e.g. GNU's does).

Categories