Zip function for shell scripts - python

I'm trying to write a shell script that will make several targets into several different paths. I'll pass in a space-separated list of paths and a space-separated list of targets, and the script will make DESTDIR=$path $target for each pair of paths and targets. In Python, my script would look something like this:
for path, target in zip(paths, targets):
exec_shell_command('make DESTDIR=' + path + ' ' + target)
However, this is my current shell script:
#! /bin/bash
packages=$1
targets=$2
target=
set_target_number () {
number=$1
counter=0
for temp_target in $targets; do
if [[ $counter -eq $number ]]; then
target=$temp_target
fi
counter=`expr $counter + 1`
done
}
package_num=0
for package in $packages; do
package_fs="debian/tmp/$package"
set_target_number $package_num
echo "mkdir -p $package_fs"
echo "make DESTDIR=$package_fs $target"
package_num=`expr $package_num + 1`
done
Is there a Unix tool equivalent to Python's zip function or an easier way to retrieve an element from a space-separated list by its index? Thanks.

Use an array:
#!/bin/bash
packages=($1)
targets=($2)
if (("${#packages[#]}" != "${#targets[#]}"))
then
echo 'Number of packages and number of targets differ' >&2
exit 1
fi
for index in "${!packages[#]}"
do
package="${packages[$index]}"
target="${targets[$index]}"
package_fs="debian/tmp/$package"
mkdir -p "$package_fs"
make "DESTDIR=$package_fs" "$target"
done

Here is the solution
paste -d ' ' paths targets | sed 's/^/make DESTDIR=/' | sh
paste is equivalent of zip in shell. sed is used to prepend the make command (using regex) and result is passed to sh to execute

There's no way to do that in bash. You'll need to create two arrays from the input and then iterate through a counter using the values from each.

Related

How to define parameters for a snakemake rule with expand input

I have input files in this format:
dataset1/file1.bam
dataset1/file1_rmd.bam
dataset1/file2.bam
dataset1/file2_rmd.bam
I would like to run a command with each and merge the results into a csv file.
The command returns an integer given filename.
$ samtools view -c -F1 dataset1/file1.bam
200
I would like to run the command and merge the output for each file into the following csv file:
file1,200,100
file2,400,300
I can do this without using the expand in input and using an append operator >>, but in order to avoid possible file corruptions it can lead to I would like to use >.
I tried something like this which did not work due to wildcards.param2 part:
rule collect_rc_results:
input: in1=expand("{param1}/{param2}.bam", param1=PARS1, param2=PARS2),
in2=expand("{param1}/{param2}_rmd.bam", param1=PARS1, param2=PARS2)
output: "{param1}_merged.csv"
shell:
"""
RCT=$(samtools view -c -F1 {input.in1})
RCD=$(samtools view -c -F1 {input.in2})
printf "{wildcards.param2},${{RCT}},${{RCD}}\n" > {output}
"""
I am aware that the input is no longer a single file but a list of files created by expand. Therefore I defined a function to work on a list input, but it is still not quite right:
def get_read_count:
return [ os.popen("samtools view -c -F1 "+infile).read() for infile in infiles ]
rule collect_rc_results:
input: in1=expand("{param1}/{param2}.bam", param1=PARS1, param2=PARS2),
in2=expand("{param1}/{param2}_rmd.bam", param1=PARS1, param2=PARS2)
output: "{param1}_merged.csv"
params: rc1=get_read_count("{param1}/{param2}.bam"), rc2=get_read_count("{param1}/{param2}_rmd.bam")
shell:
"""
printf "{wildcards.param2},{params.rc1},{params.rc2}\n" > {output}
"""
What is the best practice to use the wildcards inside the input file IDs when the input file list is defined with expand?
Edit:
I can get the expected result with expand if I use an external bash script, such as script.sh
for INF in "${#}";do
IN1=${INF}
IN2=${IN1%.bam}_rmd.bam
LIB=$(basename ${IN1%.*}|cut -d_ -f1)
RCT=$(samtools view -c -F1 ${IN1} )
RCD=$(samtools view -c -F1 ${IN2} )
printf "${LIB},${RCT},${RCD}\n"
done
with
params: script="script.sh"
shell:
"""
bash {params.script} {input} > {output}
"""
but I am interested in learning if there is an easier way to get the same output only using snakemake.
Edit2:
I can also do it in the shell instead of a separate script,
rule collect_rc_results:
input:
in1=expand("{param1}/{param2}.bam", param1=PARS1, param2=PARS2),
in2=expand("{param1}/{param2}_rmd.bam", param1=PARS1, param2=PARS2)
output: "{param1}_merged.csv"
shell:
"""
for INF in {input};do
IN1=${{INF}}
IN2=${{IN1%.bam}}_rmd.bam
LIB=$(basename ${{IN1%.*}}|cut -d_ -f1)
RCT=$(samtools view -c -F1 ${{IN1}} )
RCD=$(samtools view -c -F1 ${{IN2}} )
printf ${{LIB}},${{RCT}},${{RCD}}\n"
done > {output}
"""
thus get the expected files. However, I would be interested to hear about it if anyone has a more elegant or a "best practice" solution.
I don't think there is anything really wrong with your current solution, but I would be more inclined to use a run directive with shell functions to perform your loop.
#bli's suggestion to use temp files is also good, especially if the intermediate step (samtools in this case) is long running; you can get tremendous wall-clock gains from parallelizing those computations. The downside is you will be creating lots of tiny files.
I noticed that your inputs are fully qualified through expand, but based on your example I think you want to leave param1 as a wildcard. Assuming PARS2 is a list, it should be safe to zip in1, in2 and PARS2 together. Here's my take (written but not tested).
rule collect_rc_results:
input:
in1=expand("{param1}/{param2}.bam", param2=PARS2, allow_missing=True),
in2=expand("{param1}/{param2}_rmd.bam", param2=PARS2, allow_missing=True)
output: "{param1}_merged.csv"
run:
with open(output[0], 'w') as outfile:
for infile1, infile2, parameter in zip(in1, in2, PARS2):
# I don't usually use shell, may have to strip newlines from this output?
RCT = shell(f'samtools view -c -F1 {infile1}')
RCD = shell(f'samtools view -c -F1 {infile2}')
outfile.write(f'{parameter},{RCT},{RCD}\n')

Ignoring a specific file in a shell script for loop

My shell script loops through a given directory to execute a python script. I want to be able to ignore a specific file which is export_config.yaml. This is currently used in the for loop and prints out error on the terminal I want it to totally ignore this file.
yamldir=$1
for yaml in ${yamldir}/*.yaml; do
if [ "$yaml" != "export_config.yaml" ]; then
echo "Running export for $yaml file...";
python jsonvalid.py -p ${yamldir}/export_config.yaml -e $yaml -d ${endDate}
wait
fi
done
Your string comparison in the if statement is not matching export_config.yaml because for each iteration of your for loop you are assigning the entire relative file path (ie "$yamldir/export_config.yaml", not just "export_config.yaml") to $yaml.
First Option: Changing your if statement to reflect that should correct your issue:
if [ "$yaml" != "${yamldir}/export_config.yaml" ]; then
#etc...
Another option: Use basename to grab only the terminal path string (ie the filename).
for yaml in ${yamldir}/*.yaml; do
yaml=$(basename $yaml)
if [ "$yaml" != "export_config.yaml" ]; then
#etc...
Third option: You can do away with the if statement entirely by doing your for loop like this instead:
for yaml in $(ls ${yamldir}/*.yaml | grep -v "export_config.yaml"); do
By piping the output of ls to grep -v, you can exclude any line including export_config.yaml from the directory listing in the first place.

Inserting python code in a bash script

I've got the following bash script:
#!/bin/bash
while read line
do
ORD=`echo $line | cut -c 7-21`
if [[ -r ../FASTA_SEC/${ORD}.fa ]]
then
WCR=`fgrep -o N ../FASTA_SEC/$ORD.fa | wc -l`
WCT=`wc -m < ../FASTA_SEC/$ORD.fa`
PER1=`echo print $WCR/$WCT.*100 | python`
WCTRIN=`fgrep -o N ../FASTA_SEC_EDITED/$ORD"_Trimmed.fa" | wc -l`
WCTRI=`wc -m < ../FASTA_SEC_EDITED/$ORD"_Trimmed.fa"`
PER2=`echo print $WCTRIN/$WCTRI.*100 | python`
PER3=`echo print $PER1-$PER2 | python`
echo $ORD $PER1 $PER2 $PER3 >> Log.txt
if [ $PER2 -ge 30 -a $PER3 -lt 10 ]
then
mv ../FASTA_SEC/$ORD.fa ./TRASH/$ORD.fa
mv ../FASTA_SEC_EDITED/$ORD"_Trimmed.fa" ./TRASH/$ORD"_Trimmed.fa"
fi
fi
done < ../READ/Data.txt
$PER variables are floating numbers as u might have noticed so I cannot use them normaly in the nested if conditional. I'd like to do this conditional iteration in python but I have no clue how do it whithin a bash script also I dont know how to import the value of the variables $PER2 and $PER3 into python. Could I write directly python code in the same bash script invvoking python somehow?
Thank you for your help, first time facing this.
You can use python -c CMD to execute a piece of python code from the command line. If you want bash to interpolate your environment variables, you should use double quotes around CMD.
You can return a value by calling sys.exit, but keep in mind that true and false in Python have the reverse meaning in bash.
So your code would be:
if python -c "import sys; sys.exit(not($PER2 > 30 and $PER3 < 10 ))"
It is possible to feed Python code to the standard input of python executable with the help of here document syntax:
variable=$(date)
python2.7 <<SCRIPT
print "The current date: %s" % "${variable}"
SCRIPT
In order to avoid parameter substitution (interpretation within the block), quote the first limit string: <<'SCRIPT'.
If you want to assign the output to a variable, use command substitution:
output=$(python2.7 <<SCRIPT
print "The current date: %s" % "${variable}"
SCRIPT
)
Note, it is not recommended to use back quotes for command substitution, as it is impossible to nest them, and the form $(...) is more readable.
maybe this helps?
$ X=4; Y=7; Z=$(python -c "print($X * $Y)")
$ echo $Z
28
python -c "str" takes "str" as input and runs it.
but then why not rewrite all in python? bash commands can nicely be executed with subprocess which is included in python or (need to install that) sh.

feed a command a comma separated list of file names in a directory, extract a variable motif from file names for labels

I have a directory containing files that look like this:
1_reads.fastq
2_reads.fastq
89_reads.fastq
42_reads.fastq
I would like to feed a comma separated list of these file names to a command from a python program, so the input to the python command would like this:
program.py -i 1_reads.fastq,2_reads.fastq,89_reads.fastq,42_reads.fastq
Furthermore, I'd like to use the numbers in the file names for a labeling function within the python command such that the input would look like this:
program.py -i 1_reads.fastq,2_reads.fastq,89_reads.fastq,42_reads.fastq -t s1,s2,s89,s42
Its important that the file names and the label IDs are in the same order.
First: This is a very poorly-thought-out calling convention. Don't use it.
However, if you're using software someone else wrote that already has that convention baked in...
#!/bin/bash
IFS=, # use comma as separator
files=( [[:digit:]]*_* )
[[ -e $files || -L $files ]] || { echo "ERROR: No files matching glob exist" >&2; exit 1; }
prefixes=( )
for file in "${files[#]}"; do
prefixes+=( "s${file%%_*}" )
done
# "exec" only if this is the last command in the script; remove otherwise
exec program.py -i "${files[*]}" -t "${prefixes[*]}"
How this works:
IFS=, causes ${array[*]} to put a comma between each expanded element. Thus, expanding ${files[*]} and ${prefixes[*]} creates comma-separated strings with the contents of each array.
${file%%_*} removes everything after the first _ in a filename, allowing the numbers alone to be extracted.
[[ -e $files || -L $files ]] actually only tests whether the first element in that array exists (as a symlink or otherwise); however, this will always be true if the glob being expanded to form the array matched any files (unless files have been deleted between the two lines' invocation).
Try this:
program.py $(cd DIR && var=$(ls) && echo $var | tr ' ' ',')
That will pass to program.py the string returned by te command line inside the $(..).
That command line will: Enter in your directory, run ls storing the output in a variable, that will remove the newline characters replacing with spaces, and it doesn't add a trailing space. Then echo that variable to 'tr' which will translate spaces to commas.
It can be done easily in pure Bash. Make sure you run from within the directory that contains the files.
#!/bin/bash
shopt -s extglob nullglob
# Create an array of files
f=( +([[:digit:]])_reads.fastq )
# Check that there are some files...
if ((${#f[#]}==0)); then
echo "No files found. Exiting."
exit
fi
# Create an array of labels, directly from the array f:
# Remove trailing _reads.fastq
l=( "${f[#]%_reads.fastq}" )
# And prepend the letter s
l=( "${l[#]/#/s}" )
# Now the arrays f and l are good: check them:
declare -p f l
# To join the arrays, we'll use eval. Safe because the code is single-quoted!
IFS=, eval 'program.py -i "${f[*]}" -t "${l[*]}"'
Note. The use of eval here is perfectly safe as we're passing a constant string (and it's actually an idiomatic way to join an array without using a subshell or a loop). Don't modify the command, in particular the single quotes.
Thanks to Charles Duffy who convinced me to add healthy comments about the use of eval

How to test in shell if a path is already inside environment $*PATH? [duplicate]

With /bin/bash, how would I detect if a user has a specific directory in their $PATH variable?
For example
if [ -p "$HOME/bin" ]; then
echo "Your path is missing ~/bin, you might want to add it."
else
echo "Your path is correctly set"
fi
Using grep is overkill, and can cause trouble if you're searching for anything that happens to include RE metacharacters. This problem can be solved perfectly well with bash's builtin [[ command:
if [[ ":$PATH:" == *":$HOME/bin:"* ]]; then
echo "Your path is correctly set"
else
echo "Your path is missing ~/bin, you might want to add it."
fi
Note that adding colons before both the expansion of $PATH and the path to search for solves the substring match issue; double-quoting the path avoids trouble with metacharacters.
There is absolutely no need to use external utilities like grep for this. Here is what I have been using, which should be portable back to even legacy versions of the Bourne shell.
case :$PATH: # notice colons around the value
in *:$HOME/bin:*) ;; # do nothing, it's there
*) echo "$HOME/bin not in $PATH" >&2;;
esac
Here's how to do it without grep:
if [[ $PATH == ?(*:)$HOME/bin?(:*) ]]
The key here is to make the colons and wildcards optional using the ?() construct. There shouldn't be any problem with metacharacters in this form, but if you want to include quotes this is where they go:
if [[ "$PATH" == ?(*:)"$HOME/bin"?(:*) ]]
This is another way to do it using the match operator (=~) so the syntax is more like grep's:
if [[ "$PATH" =~ (^|:)"${HOME}/bin"(:|$) ]]
Something really simple and naive:
echo "$PATH"|grep -q whatever && echo "found it"
Where whatever is what you are searching for. Instead of && you can put $? into a variable or use a proper if statement.
Limitations include:
The above will match substrings of larger paths (try matching on "bin" and it will probably find it, despite the fact that "bin" isn't in your path, /bin and /usr/bin are)
The above won't automatically expand shortcuts like ~
Or using a perl one-liner:
perl -e 'exit(!(grep(m{^/usr/bin$},split(":", $ENV{PATH}))) > 0)' && echo "found it"
That still has the limitation that it won't do any shell expansions, but it doesn't fail if a substring matches. (The above matches "/usr/bin", in case that wasn't clear).
Here's a pure-bash implementation that will not pick up false-positives due to partial matching.
if [[ $PATH =~ ^/usr/sbin:|:/usr/sbin:|:/usr/sbin$ ]] ; then
do stuff
fi
What's going on here? The =~ operator uses regex pattern support present in bash starting with version 3.0. Three patterns are being checked, separated by regex's OR operator |.
All three sub-patterns are relatively similar, but their differences are important for avoiding partial-matches.
In regex, ^ matches to the beginning of a line and $ matches to the end. As written, the first pattern will only evaluate to true if the path it's looking for is the first value within $PATH. The third pattern will only evaluate to true if if the path it's looking for is the last value within $PATH. The second pattern will evaluate to true when it finds the path it's looking for in-between others values, since it's looking for the delimiter that the $PATH variable uses, :, to either side of the path being searched for.
I wrote the following shell function to report if a directory is listed in the current PATH. This function is POSIX-compatible and will run in compatible shells such as Dash and Bash (without relying on Bash-specific features).
It includes functionality to convert a relative path to an absolute path. It uses the readlink or realpath utilities for this but these tools are not needed if the supplied directory does not have .. or other links as components of its path. Other than this, the function doesn’t require any programs external to the shell.
# Check that the specified directory exists – and is in the PATH.
is_dir_in_path()
{
if [ -z "${1:-}" ]; then
printf "The path to a directory must be provided as an argument.\n" >&2
return 1
fi
# Check that the specified path is a directory that exists.
if ! [ -d "$1" ]; then
printf "Error: ‘%s’ is not a directory.\n" "$1" >&2
return 1
fi
# Use absolute path for the directory if a relative path was specified.
if command -v readlink >/dev/null ; then
dir="$(readlink -f "$1")"
elif command -v realpath >/dev/null ; then
dir="$(realpath "$1")"
else
case "$1" in
/*)
# The path of the provided directory is already absolute.
dir="$1"
;;
*)
# Prepend the path of the current directory.
dir="$PWD/$1"
;;
esac
printf "Warning: neither ‘readlink’ nor ‘realpath’ are available.\n"
printf "Ensure that the specified directory does not contain ‘..’ in its path.\n"
fi
# Check that dir is in the user’s PATH.
case ":$PATH:" in
*:"$dir":*)
printf "‘%s’ is in the PATH.\n" "$dir"
return 0
;;
*)
printf "‘%s’ is not in the PATH.\n" "$dir"
return 1
;;
esac
}
The part using :$PATH: ensures that the pattern also matches if the desired path is the first or last entry in the PATH. This clever trick is based upon this answer by Glenn Jackman on Unix & Linux.
This is a brute force approach but it works in all cases except when a path entry contains a colon. And no programs other than the shell are used.
previous_IFS=$IFS
dir_in_path='no'
export IFS=":"
for p in $PATH
do
[ "$p" = "/path/to/check" ] && dir_in_path='yes'
done
[ "$dir_in_path" = "no" ] && export PATH="$PATH:/path/to/check"
export IFS=$previous_IFS
$PATH is a list of strings separated by : that describe a list of directories. A directory is a list of strings separated by /. Two different strings may point to the same directory (like $HOME and ~, or /usr/local/bin and /usr/local/bin/). So we must fix the rules of what we want to compare/check. I suggest to compare/check the whole strings, and not physical directories, but remove duplicate and trailing /.
First remove duplicate and trailing / from $PATH:
echo $PATH | tr -s / | sed 's/\/:/:/g;s/:/\n/g'
Now suppose $d contains the directory you want to check. Then pipe the previous command to check $d in $PATH.
echo $PATH | tr -s / | sed 's/\/:/:/g;s/:/\n/g' | grep -q "^$d$" || echo "missing $d"
A better and fast solution is this:
DIR=/usr/bin
[[ " ${PATH//:/ } " =~ " $DIR " ]] && echo Found it || echo Not found
I personally use this in my bash prompt to add icons when i go to directories that are in $PATH.

Categories