Related
Hi I'm trying to call the following command from python:
comm -3 <(awk '{print $1}' File1.txt | sort | uniq) <(awk '{print $1}' File2.txt | sort | uniq) | grep -v "#" | sed "s/\t//g"
How could I do the calling when the inputs for the comm command are also piped?
Is there an easy and straight forward way to do it?
I tried the subprocess module:
subprocess.call("comm -3 <(awk '{print $1}' File1.txt | sort | uniq) <(awk '{print $1}' File2.txt | sort | uniq) | grep -v '#' | sed 's/\t//g'")
Without success, it says:
OSError: [Errno 2] No such file or directory
Or do I have to create the different calls individually and then pass them using PIPE as it is described in the subprocess documentation:
p1 = Popen(["dmesg"], stdout=PIPE)
p2 = Popen(["grep", "hda"], stdin=p1.stdout, stdout=PIPE)
p1.stdout.close() # Allow p1 to receive a SIGPIPE if p2 exits.
output = p2.communicate()[0]
Process substitution (<()) is bash-only functionality. Thus, you need a shell, but it can't be just any shell (like /bin/sh, as used by shell=True on non-Windows platforms) -- it needs to be bash.
subprocess.call(['bash', '-c', "comm -3 <(awk '{print $1}' File1.txt | sort | uniq) <(awk '{print $1}' File2.txt | sort | uniq) | grep -v '#' | sed 's/\t//g'"])
By the way, if you're going to be going this route with arbitrary filenames, pass them out-of-band (as below: Passing _ as $0, File1.txt as $1, and File2.txt as $2):
subprocess.call(['bash', '-c',
'''comm -3 <(awk '{print $1}' "$1" | sort | uniq) '''
''' <(awk '{print $1}' "$2" | sort | uniq) '''
''' | grep -v '#' | tr -d "\t"''',
'_', "File1.txt", "File2.txt"])
That said, the best-practices approach is indeed to set up the chain yourself. The below is tested with Python 3.6 (note the need for the pass_fds argument to subprocess.Popen to make the file descriptors referred to via /dev/fd/## links available):
awk_filter='''! /#/ && !seen[$1]++ { print $1 }'''
p1 = subprocess.Popen(['awk', awk_filter],
stdin=open('File1.txt', 'r'),
stdout=subprocess.PIPE)
p2 = subprocess.Popen(['sort', '-u'],
stdin=p1.stdout,
stdout=subprocess.PIPE)
p3 = subprocess.Popen(['awk', awk_filter],
stdin=open('File2.txt', 'r'),
stdout=subprocess.PIPE)
p4 = subprocess.Popen(['sort', '-u'],
stdin=p3.stdout,
stdout=subprocess.PIPE)
p5 = subprocess.Popen(['comm', '-3',
('/dev/fd/%d' % (p2.stdout.fileno(),)),
('/dev/fd/%d' % (p4.stdout.fileno(),))],
pass_fds=(p2.stdout.fileno(), p4.stdout.fileno()),
stdout=subprocess.PIPE)
p6 = subprocess.Popen(['tr', '-d', '\t'],
stdin=p5.stdout,
stdout=subprocess.PIPE)
result = p6.communicate()
This is a lot more code, but (assuming that the filenames are parameterized in the real world) it's also safer code -- you aren't vulnerable to bugs like ShellShock that are triggered by the simple act of starting a shell, and don't need to worry about passing variables out-of-band to avoid injection attacks (except in the context of arguments to commands -- like awk -- that are scripting language interpreters themselves).
That said, another thing to think about is just implementing the whole thing in native Python.
lines_1 = set(line.split()[0] for line in open('File1.txt', 'r') if not '#' in line)
lines_2 = set(line.split()[0] for line in open('File2.txt', 'r') if not '#' in line)
not_common = (lines_1 - lines_2) | (lines_2 - lines_1)
for line in sorted(not_common):
print line
Also checkout plumbum. Makes life easier
http://plumbum.readthedocs.io/en/latest/
Pipelining
This may be wrong, but you can try this:
from plumbum.cmd import grep, comm, awk, sort, uniq, sed
_c1 = awk['{print $1}', 'File1.txt'] | sort | uniq
_c2 = awk['{print $1}', 'File2.txt'] | sort | uniq
chain = comm['-3', _c1(), _c2() ] | grep['-v', '#'] | sed['s/\t//g']
chain()
Let me know if this goes wrong, Will try to fix it.
Edit: As pointed out, I missed the substitution thing, and I think it would have to be explicitly done by redirecting the above command output to a temporary file and then using that file in the argument to comm.
So the above would now actually become:
from plumbum.cmd import grep, comm, awk, sort, uniq, sed
_c1 = awk['{print $1}', 'File1.txt'] | sort | uniq
_c2 = awk['{print $1}', 'File2.txt'] | sort | uniq
(_c1 > "/tmp/File1.txt")(), (_c2 > "/tmp/File2.txt")()
chain = comm['-3', "/tmp/File1.txt", "/tmp/File2.txt" ] | grep['-v', '#'] | sed['s/\t//g']
chain()
Also, alternatively you can use the method described by #charles by making use of mkfifo.
I've got a CSV file with a column which I want to sift through. I want to use a pattern file to find all entries where the pattern exists even in part of the column's value, and replace the whole cell value with this "pattern".
I made a list of keywords that I want to use as my "pattern" bank;
So, if a cell in this column (this case only second) has this "pattern" as part of its string, then I want to replace the whole cell with this "pattern".
so for example:
my target file:
id1,Taxidermy Equipment & Supplies,moreinfo1
id2,Taxis & Private Hire,moreinfo2
id3,Tax Services,moreinfo3
id4,Tools & Hardware,moreinfo4
id5,Tool Sharpening,moreinfo5
id6,Tool Shops,moreinfo6
id7,Video Conferencing,moreinfo7
id8,Video & DVD Shops,moreinfo8
id9,Woodworking Equipment & Supplies,moreinfo9
my "pattern" file:
Taxidermy Equipment & Supplies
Taxis
Tax Services
Tool
Video
Wood
output file:
id1,Taxidermy Equipment & Supplies,moreinfo1
id2,Taxis,moreinfo2
id3,Tax Services,moreinfo3
id4,Tool,moreinfo4
id5,Tool,moreinfo5
id6,Tool,moreinfo6
id7,Video,moreinfo7
id8,Video,moreinfo8
id9,Wood,moreinfo9
I came up with the usual "find and replace" sed:
sed -i 's/PATTERN/REPLACE/g' file.csv
but I want it to run on a specific column, so I came up with:
awk 'BEGIN{OFS=FS="|"}$2==PATTERN{$2=REPLACE}{print}' file.csv
but it doesn't work on "part of string" ([Video]:"Video & DVD Shops" -> "Video") and I can't seem to get it how awk takes input as a file for the "Pattern" block.
Is there an awk script for this? Or do I have to write something (in python with the built in csv suit for example?)
In awk, using index. It only prints record if a replacement is made but it's easy to modify to printing even if there is no match (for example replace the print $1,i,$3} with $0=$1 OFS i OFS $3} 1):
$ awk -F, -v OFS=, '
NR==FNR { a[$1]; next } # store "patterns" to a arr
{ for(i in a) # go thru whole a for each record
if(index($2,i)) # if "pattern" matches $2
print $1,i,$3 # print with replacement
}
' pattern_file target_file
id1,Taxidermy Equipment & Supplies,moreinfo1
id2,Taxis,moreinfo2
id3,Tax Services,moreinfo3
id4,Tool,moreinfo4
id5,Tool,moreinfo5
id6,Tool,moreinfo6
id7,Video,moreinfo7
id8,Video,moreinfo8
id9,Wood,moreinfo9
Perl solution, using Text::CSV_XS:
#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV_XS qw{ csv };
my ($input_file, $pattern_file) = #ARGV;
open my $pfh, '<', $pattern_file or die $!;
chomp( my #patterns = <$pfh> );
my $aoa = csv(in => $input_file);
for my $line (#$aoa) {
for my $pattern (#patterns) {
if (-1 != index $line->[1], $pattern) {
$line->[1] = $pattern;
last
}
}
}
csv(in => $aoa, quote_space => 0, eol => "\n", out => \*STDOUT);
Here's a (mostly) awk solution:
#/bin/bash
patterns_regex=`cat patterns_file | tr '\n' '|'`
cat target_file | awk -F"," -v patterns="$patterns_regex" '
BEGIN {
OFS=",";
split(patterns, patterns_split, "|");
}
{
for (pattern_num in patterns_split) {
pattern=patterns_split[pattern_num];
if (pattern != "" && $2 ~ pattern) {
print $1,pattern,$3
}
}
}'
When you want to solve this with sed, you will need some steps.
For each pattern you will need a command like
sed 's/^\([^,]*\),\(.*Tool.*\),/\1,Tool,/' inputfile
You will need each pattern twice, you can translate the patternfile with
sed 's/.*/"&" "&"/' patternfile
# Change the / into #, thats easier for the final command
sed 's#.*#"&" "&"#' patternfile
When you instruct sed to read a commandfile, you do need to start each line with sed. The commandfile will look like
sed 's#.*#s/^\\([^,]*\\),\\(.*&.*\\),/\\1,&,/#;s#&#\\&#g' patternfile
You can store this is a file and use the file, but with process substitution you can do things like
cat <(echo "Now this line from echo is handled as a file")
Nice. Lets test the solution
sed -f <(sed 's#.*#s/^\\([^,]*\\),\\(.*&.*\\),/\\1,&,/#' patternfile) inputfile
Almost there! Only the first output line is strange. Whats happening?
The first pattern has a &, and that has a special meaning.
We can patch our command by adding a backslash in the pattern:
sed -f <(sed 's#.*#s/^\\([^,]*\\),\\(.*&.*\\),/\\1,&,/#;s#&#\\&#g' patternfile) inputfile
I am trying to read the total number of the files to be synced using 'rsync', and read the value using the following python code, I get the following output. What code should I modify to get the desired output
Output
b'10'
Desired Output
10
cmd
rsync -nvaz --delete --stats user#host:/www/ . | ./awk.sh
awk.sh
awk '\
BEGIN {count = 0}
/deleting/ {if ( length($1) > 0 ) ++count} \
/Number of regular files transferred: / {count += $6} \
END \
{
printf "%d",count
}'
Python
subprocess.check_process(cmd, shell=True, stdout=False)
Your awk script is just looking for a line that includes a string and then printing it. Since your python script needs to read stdout to get that value anyway, you may as well ditch the script and stick with python. With the Popen object you can read stdout line by line
import subprocess
# for test...
source_dir = 'test1/'
target_dir = 'test2/'
count = 0
proc = subprocess.Popen(['rsync', '-nvaz', '--delete', '--stats',
source_dir, target_dir],
stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
for line in proc.stdout:
if line.startswith(b'Number of regular files transferred:'):
count = int(line.split(b':')[1])
proc.wait()
print(count)
Decoded the output to utf-8, and then parsed using RegEx
o = subprocess.check_output(cmd, shell=True)
g = re.search(r'count=(\d+)', o.decode("utf-8"), re.M|re.I)
We wrote an awk one liner to split an input csv file (Assay_51003_target_pairs.csv) into multiple files. For any row if their column 1 is equal to another column column 1, the column 2 is equal to another column 2, etc., these rows will be categorized into a new file. The new file will be named using the column values.
Here is the one liner
awk -F "," 'NF>1 && NR>1 && $1==$1 && $2==$2 && $9==$9 && $10==$10{print $0 >> ("Assay_"$1"_target_"$3"_assay_" $9 "_bcassay_" $10 "_bcalt_assay.csv");close("Assay_"$1"_target_"$3"_assay_" $9 "_bcassay_" $10 "_bcalt_assay.csv")}' Assay_51003_target_pairs.csv
This will generate the following example output (Assay_$1_target_$3_assay_$9_bcassay_$10_bcalt_assay.csv):
Assay_51003_target_1645_assay_7777_bcassay_8888_bcalt_assay.csv
51003,666666,1645,11145,EC50,,0.2,uM,7777,8888,IC50,,1,uM,,3,2.0555,3011-02-0100:00:00,1911-04-1100:00:00,Cell,Biochemical
51003,666666,1645,1680,EC50,<,0.1,uM,7777,8888,IC50,,1,uM,,2,2.8579,3004-06-0300:00:00,3000-04-1100:00:00,Cell,Biochemical
Assay_51003_target_1645_assay_7777_bcassay_9999_bcalt_assay.csv
51003,666666,1645,11145,EC50,,0.2,uM,7777,9999,IC50,,1,uM,,3,2.0555,3011-02-0100:00:00,1911-04-1100:00:00,Cell,Biochemical
51003,666666,1645,1680,EC50,<,0.1,uM,7777,9999,IC50,,1,uM,,2,2.8579,3004-06-0300:00:00,3000-04-1100:00:00,Cell,Biochemical
Assay_51003_target_1688_assay_7777_bcassay_9999_bcalt_assay.csv
51003,666666,1688,11145,EC50,,0.2,uM,7777,9999,IC50,,1,uM,,3,2.0555,3011-02-0100:00:00,1911-04-1100:00:00,Cell,Biochemical
51003,666666,1688,1680,EC50,<,0.1,uM,7777,9999,IC50,,1,uM,,2,2.8579,3004-06-0300:00:00,3000-04-1100:00:00,Cell,Biochemical
Later on we would like to do, for example,
awk -F, -f max_min.awk Assay_51003_target_1645_assay_7777_bcassay_8888_bcalt_assay.csv
awk -F, -f max_min.awk Assay_51003_target_1645_assay_7777_bcassay_9999_bcalt_assay.csv
awk -F, -f max_min.awk Assay_51003_target_1688_assay_7777_bcassay_9999_bcalt_assay.csv
#################################################
for b in 1645 1688
do
for c in 8888 9999
do
awk -F, -f max_min.awk Assay_51003_target_$b_assay_7777_bcassay_$c_bcalt_assay.csv
done
done
However, we don't know if there is any way to write a loop for the followup work because the outfile names are "random". May we know if there is any way for linux/bash to parse part of the file name into loop variables (such as parse 1645 and 1688 into b and 8888 & 9999 into c)?
With Bash it should be pretty much easy granting the values are always numbers:
shopt -s nullglob
FILES=(Assay_*_target_*_assay_*_bcassay_*_bcalt_assay.csv) ## No need to do +([[:digit:]]). The difference is unlikely.
for FILE in "${FILES[#]}"; do
IFS=_ read -a A <<< "$FILE"
# Do something with ${A[1]} ${A[3]} ${A[5]} and ${A[7]}
...
# Or
IFS=_ read __ A __ B __ C __ D __ <<< "$FILE"
# Do something with $A $B $C and $D
...
done
Asking if $1 == $1, etc., is pointless, since it will always be true. The following code is equivalent:
awk -F, '
NF > 1 && NR > 1 {
f = "Assay_" $1 "_target_" $3 "_assay_" $9 \
"_bcassay_" $10 "_bcalt_assay.csv"
print >> f;
close(f)
}' Assay_51003_target_pairs.csv
The reason this works is that the same file is appended to if the fields used in the construction of the filename match. But I wonder if it's an error on your part to be using $3 instead of $2 since you mention $2 in your description.
At any rate, what you are doing seems very odd. If you can give a straightforward description of what you are actually trying to accomplish, there may be a completely different way to do it.
I have a file with multiple KV pairs.
Input:
$ cat input.txt
k1:v1 k2:v2 k3:v3
...
I am only interested in the values. The keys (name) are just to remember what each value meant. Essentially I am looking to cut the keys out so that I can plot the value columns.
Output:
$ ...
v1 v2 v3
Is their a single-liner bash command that can help me achieve this?
UPDATE
This is how I am currently doing it (looks ugly)
>> cat input.txt | python -c "import sys; \
lines = sys.stdin.readlines(); \
values = [[i.split(':')[1] for i in item] for item in \
[line.split() for line in lines]]; \
import os; [os.system('echo %s'%v) for v in \
['\t'.join(value) for value in values]]" > output.txt
is this ok for you?
sed -r 's/\w+://g' yourfile
test:
kent$ echo "k1:v1 k2:v2 k3:v3"|sed -r 's/\w+://g'
v1 v2 v3
update
well, if your key contains "-" etc: see below
kent$ echo "k1#-$%-^=:v1 k2:v2 k3:v3"|sed -r 's/[^ ]+://g'
v1 v2 v3
awk -v FS=':' -v RS=' ' -v ORS=' ' '{print $2}' foo.txt
http://www.gnu.org/software/gawk/manual/gawk.html#Field-Separators
I see sed, awk and python, so here's plain bash:
while IFS=' ' read -a kv ; do printf '%s ' "${kv[#]#*:}" ; done < input.txt
Just for good measure, here's a perl version:
perl -n -e 'print(join(" ",values%{{#{[split(/[:\s]/,$_)]}}})," ")' < input.txt
The order of the values changes, though, so it's probably not going to be what you want.
Solution with awk:
awk '{split($0,p," "); for(kv in p) {split(p[kv],a,":"); printf "%s ",a[2];} print ""}' foo.txt
Try this
Input.txt
k1:v1 k2:v2 k3:v3
Code
awk -F " " '{for( i =1 ; i<=NF ;i+=1) print $i}' Input.txt | cut -d ":" -f 2 | tr '\n' ' '
Output
v1 v2 v3