Linux/bash/awk read parts of the files name as variables - python

We wrote an awk one liner to split an input csv file (Assay_51003_target_pairs.csv) into multiple files. For any row if their column 1 is equal to another column column 1, the column 2 is equal to another column 2, etc., these rows will be categorized into a new file. The new file will be named using the column values.
Here is the one liner
awk -F "," 'NF>1 && NR>1 && $1==$1 && $2==$2 && $9==$9 && $10==$10{print $0 >> ("Assay_"$1"_target_"$3"_assay_" $9 "_bcassay_" $10 "_bcalt_assay.csv");close("Assay_"$1"_target_"$3"_assay_" $9 "_bcassay_" $10 "_bcalt_assay.csv")}' Assay_51003_target_pairs.csv
This will generate the following example output (Assay_$1_target_$3_assay_$9_bcassay_$10_bcalt_assay.csv):
Assay_51003_target_1645_assay_7777_bcassay_8888_bcalt_assay.csv
51003,666666,1645,11145,EC50,,0.2,uM,7777,8888,IC50,,1,uM,,3,2.0555,3011-02-0100:00:00,1911-04-1100:00:00,Cell,Biochemical
51003,666666,1645,1680,EC50,<,0.1,uM,7777,8888,IC50,,1,uM,,2,2.8579,3004-06-0300:00:00,3000-04-1100:00:00,Cell,Biochemical
Assay_51003_target_1645_assay_7777_bcassay_9999_bcalt_assay.csv
51003,666666,1645,11145,EC50,,0.2,uM,7777,9999,IC50,,1,uM,,3,2.0555,3011-02-0100:00:00,1911-04-1100:00:00,Cell,Biochemical
51003,666666,1645,1680,EC50,<,0.1,uM,7777,9999,IC50,,1,uM,,2,2.8579,3004-06-0300:00:00,3000-04-1100:00:00,Cell,Biochemical
Assay_51003_target_1688_assay_7777_bcassay_9999_bcalt_assay.csv
51003,666666,1688,11145,EC50,,0.2,uM,7777,9999,IC50,,1,uM,,3,2.0555,3011-02-0100:00:00,1911-04-1100:00:00,Cell,Biochemical
51003,666666,1688,1680,EC50,<,0.1,uM,7777,9999,IC50,,1,uM,,2,2.8579,3004-06-0300:00:00,3000-04-1100:00:00,Cell,Biochemical
Later on we would like to do, for example,
awk -F, -f max_min.awk Assay_51003_target_1645_assay_7777_bcassay_8888_bcalt_assay.csv
awk -F, -f max_min.awk Assay_51003_target_1645_assay_7777_bcassay_9999_bcalt_assay.csv
awk -F, -f max_min.awk Assay_51003_target_1688_assay_7777_bcassay_9999_bcalt_assay.csv
#################################################
for b in 1645 1688
do
for c in 8888 9999
do
awk -F, -f max_min.awk Assay_51003_target_$b_assay_7777_bcassay_$c_bcalt_assay.csv
done
done
However, we don't know if there is any way to write a loop for the followup work because the outfile names are "random". May we know if there is any way for linux/bash to parse part of the file name into loop variables (such as parse 1645 and 1688 into b and 8888 & 9999 into c)?

With Bash it should be pretty much easy granting the values are always numbers:
shopt -s nullglob
FILES=(Assay_*_target_*_assay_*_bcassay_*_bcalt_assay.csv) ## No need to do +([[:digit:]]). The difference is unlikely.
for FILE in "${FILES[#]}"; do
IFS=_ read -a A <<< "$FILE"
# Do something with ${A[1]} ${A[3]} ${A[5]} and ${A[7]}
...
# Or
IFS=_ read __ A __ B __ C __ D __ <<< "$FILE"
# Do something with $A $B $C and $D
...
done

Asking if $1 == $1, etc., is pointless, since it will always be true. The following code is equivalent:
awk -F, '
NF > 1 && NR > 1 {
f = "Assay_" $1 "_target_" $3 "_assay_" $9 \
"_bcassay_" $10 "_bcalt_assay.csv"
print >> f;
close(f)
}' Assay_51003_target_pairs.csv
The reason this works is that the same file is appended to if the fields used in the construction of the filename match. But I wonder if it's an error on your part to be using $3 instead of $2 since you mention $2 in your description.
At any rate, what you are doing seems very odd. If you can give a straightforward description of what you are actually trying to accomplish, there may be a completely different way to do it.

Related

Calling a python script from awk

I have a file without about 150k rows, and two columns. I need to to run a a python script on the first field, and save its output as a third column, such that the change looks like this:
Original File:
Col1 Col2
d 2
e 4
f 6
New file:
Col1 Col2 Col3
d 2 output
e 4 output
f 6 output
I'm not able to run the script from inside awk.
cat original.list | awk -F" " ' {`/homes/script.py $1`}'
If I were able to, I would then want to save it as a variable, and print the new variable, plus $1 and $2 to the new file.
thanks in advance (related question here)
the answer to the "related question" you linked (and the one posted in the comments) actually solve your problem,
it just need to be adapted to your specific case.
cat original.list | awk -F" " ' {`/homes/script.py $1`}'
cat is useless here because awk can open and read the file by itself
you don't need -F" " because awk will split fields by spaces by default
backticks `` wont run your script, that's a shell (discouraged)
feature, doesn't work in
awk
we can use command | getline var to execute a command and store its
(first line of) output in a variable. from man awk:
command | getline var
pipes a record from command into var.
using your example file:
$ cat original
Col1 Col2
d 2
e 4
f 6
$
and a dummy script.py:
$ cat script.py
#!/bin/python
print("output")
$
we can do something like this:
$ awk '
NR == 1 { print $0, "Col3" }
NR > 1 { cmd="./script.py " $1; cmd | getline out; close(cmd); print $0, out }
' original
Col1 Col2 Col3
d 2 output
e 4 output
f 6 output
$
the first action runs on the first line of input, adds Col3 to the header and
avoids passing Col1 to the python script.
in the other action, we first build the command concatenating $1 to the
script's path, then we run it and store its first line of output in out
variable (I'm assuming your python script output is just one line). close(cmd) is important because after getline, the pipe reading
from cmd's output would remain open and doing this for many records could lead
to errors like too many open files. at the end we print $0 and cmd's
output.
third's column formatting looks a bit off, you can improve it either from
awk using printf or with an external program like column, e.g:
$ awk '
NR == 1 { print $0, "Col3" }
NR > 1 { cmd="./script.py " $1; cmd | getline out; close(cmd); print $0, out }
' original | column -t
Col1 Col2 Col3
d 2 output
e 4 output
f 6 output
$
lastly, doing all this on a 150k rows file means calling the python script 150k
times etc.., it probably will be a slow task, I think performance could be
improved by doing everything directly in the python script as already
suggested in the comments, but whether or not it is applicable to your specific case, goes
beyond the scope of this question/answer.

Split csv file thousands of times based on groupby

(An adaptation of David Erickson's question here)
Given a CSV file with columns A, B, and C and some values:
echo 'a,b,c' > file.csv
head -c 10000000 /dev/urandom | od -d | awk 'BEGIN{OFS = ","}{print $2, $3, $4}' | head -n 10000 >> file.csv
We would like to sort by columns a and b:
sort -t ',' -k1,1n -k2,2n file.csv > file_.csv
head -n 3 file_.csv
>a,b,c
3,50240,18792
7,54871,39438
And then for every unique pair (a, b) create a new CSV titled '{a}_Invoice_{b}.csv'.
The main challenge seems to be the I/O overhead of writing thousands of files - I started trying with awk but ran into awk: 17 makes too many open files.
Is there a quicker way to do this, in awk, Python, or some other scripting language?
Additional info:
I know I can do this in Pandas - I'm looking for a faster way using text processing
Though I used urandom to generate the sample data, the real data has runs of recurring values: for example a few rows where a=3, b=7. If so these should be saved as one file. (The idea is to replicate Pandas' groupby -> to_csv)
In python:
import pandas as pd
df = pd.read_csv("file.csv")
for (a, b), gb in df.groupby(['a', 'b']):
gb.to_csv(f"{a}_Invoice_{b}.csv", header=True, index=False)
In awk you can split like so, you will need to put the header back on each resultant file:
awk -F',' '{ out=$1"_Invoice_"$2".csv"; print >> out; close(out) }' file.csv
With adding the header line back:
awk -F',' 'NR==1 { hdr=$0; next } { out=$1"_Invoice_"$2".csv"; if (!seen[out]++) {print hdr > out} print >> out; close(out); }' file.csv
The benefit of this last example is that the input file.csv doesn't need to be sorted and is processed in a single pass.
Since your input is to be sorted on the key fields all you need is:
sort -t ',' -k1,1n -k2,2n file.csv |
awk -F ',' '
NR==1 { hdr=$0; next }
{ out = $1 "_Invoice_" $2 ".csv" }
out != prev {
close(prev)
print hdr > out
prev = out
}
{ print > out }
'

Find part of a string in CSV and replace whole cell with new entry?

I've got a CSV file with a column which I want to sift through. I want to use a pattern file to find all entries where the pattern exists even in part of the column's value, and replace the whole cell value with this "pattern".
I made a list of keywords that I want to use as my "pattern" bank;
So, if a cell in this column (this case only second) has this "pattern" as part of its string, then I want to replace the whole cell with this "pattern".
so for example:
my target file:
id1,Taxidermy Equipment & Supplies,moreinfo1
id2,Taxis & Private Hire,moreinfo2
id3,Tax Services,moreinfo3
id4,Tools & Hardware,moreinfo4
id5,Tool Sharpening,moreinfo5
id6,Tool Shops,moreinfo6
id7,Video Conferencing,moreinfo7
id8,Video & DVD Shops,moreinfo8
id9,Woodworking Equipment & Supplies,moreinfo9
my "pattern" file:
Taxidermy Equipment & Supplies
Taxis
Tax Services
Tool
Video
Wood
output file:
id1,Taxidermy Equipment & Supplies,moreinfo1
id2,Taxis,moreinfo2
id3,Tax Services,moreinfo3
id4,Tool,moreinfo4
id5,Tool,moreinfo5
id6,Tool,moreinfo6
id7,Video,moreinfo7
id8,Video,moreinfo8
id9,Wood,moreinfo9
I came up with the usual "find and replace" sed:
sed -i 's/PATTERN/REPLACE/g' file.csv
but I want it to run on a specific column, so I came up with:
awk 'BEGIN{OFS=FS="|"}$2==PATTERN{$2=REPLACE}{print}' file.csv
but it doesn't work on "part of string" ([Video]:"Video & DVD Shops" -> "Video") and I can't seem to get it how awk takes input as a file for the "Pattern" block.
Is there an awk script for this? Or do I have to write something (in python with the built in csv suit for example?)
In awk, using index. It only prints record if a replacement is made but it's easy to modify to printing even if there is no match (for example replace the print $1,i,$3} with $0=$1 OFS i OFS $3} 1):
$ awk -F, -v OFS=, '
NR==FNR { a[$1]; next } # store "patterns" to a arr
{ for(i in a) # go thru whole a for each record
if(index($2,i)) # if "pattern" matches $2
print $1,i,$3 # print with replacement
}
' pattern_file target_file
id1,Taxidermy Equipment & Supplies,moreinfo1
id2,Taxis,moreinfo2
id3,Tax Services,moreinfo3
id4,Tool,moreinfo4
id5,Tool,moreinfo5
id6,Tool,moreinfo6
id7,Video,moreinfo7
id8,Video,moreinfo8
id9,Wood,moreinfo9
Perl solution, using Text::CSV_XS:
#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV_XS qw{ csv };
my ($input_file, $pattern_file) = #ARGV;
open my $pfh, '<', $pattern_file or die $!;
chomp( my #patterns = <$pfh> );
my $aoa = csv(in => $input_file);
for my $line (#$aoa) {
for my $pattern (#patterns) {
if (-1 != index $line->[1], $pattern) {
$line->[1] = $pattern;
last
}
}
}
csv(in => $aoa, quote_space => 0, eol => "\n", out => \*STDOUT);
Here's a (mostly) awk solution:
#/bin/bash
patterns_regex=`cat patterns_file | tr '\n' '|'`
cat target_file | awk -F"," -v patterns="$patterns_regex" '
BEGIN {
OFS=",";
split(patterns, patterns_split, "|");
}
{
for (pattern_num in patterns_split) {
pattern=patterns_split[pattern_num];
if (pattern != "" && $2 ~ pattern) {
print $1,pattern,$3
}
}
}'
When you want to solve this with sed, you will need some steps.
For each pattern you will need a command like
sed 's/^\([^,]*\),\(.*Tool.*\),/\1,Tool,/' inputfile
You will need each pattern twice, you can translate the patternfile with
sed 's/.*/"&" "&"/' patternfile
# Change the / into #, thats easier for the final command
sed 's#.*#"&" "&"#' patternfile
When you instruct sed to read a commandfile, you do need to start each line with sed. The commandfile will look like
sed 's#.*#s/^\\([^,]*\\),\\(.*&.*\\),/\\1,&,/#;s#&#\\&#g' patternfile
You can store this is a file and use the file, but with process substitution you can do things like
cat <(echo "Now this line from echo is handled as a file")
Nice. Lets test the solution
sed -f <(sed 's#.*#s/^\\([^,]*\\),\\(.*&.*\\),/\\1,&,/#' patternfile) inputfile
Almost there! Only the first output line is strange. Whats happening?
The first pattern has a &, and that has a special meaning.
We can patch our command by adding a backslash in the pattern:
sed -f <(sed 's#.*#s/^\\([^,]*\\),\\(.*&.*\\),/\\1,&,/#;s#&#\\&#g' patternfile) inputfile

Comparing field#1 in two adjacent lines and printing fields#2 from all the lines where field#1 is identical

Can be considered a variant of this bash question that I looked up before asking this question here.
Following is a sample file:
C=b933cda8ce0/0 p=880080
C=b933cdd6580/0 p=880080
C=b933d02a240/0 p=880080
C=b933d059610/0 p=880080
C=b933d1c8690/0 p=880080
C=b933d2c1b60/0 p=1560315
C=b933d2c1b60/0 p=880080
C=b933d32f240/0 p=1229793
C=b933d32f240/0 p=123412
Output here should be:
C=b933d2c1b60/0 p1=1560315 p2=880080
C=b933d32f240/0 p1=1229793 p2=123412
I need to print out all the values of field#2 against field#1 from all the lines where field#1 matches.
Although I got the job done using the following long one-liner but it doesn't really seem elegant/efficient to me:
d=0; q=0; cat file |while read -r c p; do if [[ $c = $d ]]; then printf "$c\t$p\t$q\n"; fi; d=$c; q=$p; done
Code could be in any of the langs/tools tagged.
awk to the rescue
awk '{
c[$1]++;
sub("p","p"c[$1],$2);
sep=(c[$1]>1)?FS:"";
a[$1]=a[$1] sep $2
}
END {
for(i in a) print i, a[i]
}' file
concatenate second fields based on the key (first field). Suffix p with the index of the parameter (counted in c). There is a formatting trick to not to double space the first and second fields.

intersperse the lines of two different files

I have to do a simple task, but I don't know how to do it and I'm staked. I need to intersperse the lines of two different files each 4 lines:
File 1:
1
2
3
4
5
6
7
8
9
10
11
12
FILE 2:
A
B
C
D
E
F
G
H
I
J
K
L
Desired result:
1
2
3
4
A
B
C
D
5
6
7
8
E
F
G
H
9
10
11
12
I
J
K
L
I'm looking for a sed, awk or python script, or any other bash command.
Thanks for your time!!
I tried to do it using specific python libraries that recognize the 4 lines modules of each files. But It doesn't work and now I trying to do it without this libraries, but don't know how.
import sys
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
def main(forward,reverse):
for F, R in zip ( SeqIO.parse(forward, "fastq"), SeqIO.parse(reverse, "fastq") ):
fastq_out_F = SeqRecord( F.seq, id = F.id, description = "" )
fastq_out_F.letter_annotations["phred_quality"] = F.letter_annotations["phred_quality"]
fastq_out_R = SeqRecord( R.seq, id = R.id, description = "" )
fastq_out_R.letter_annotations["phred_quality"] = R.letter_annotations["phred_quality"]
print fastq_out_F.format("fastq"),
print fastq_out_R.format("fastq"),
if __name__ == '__main__':
main(sys.argv[1], sys.argv[2])
This might work for you:(using GNU sed)
sed -e 'n;n;n;R file2' -e 'R file2' -e 'R file2' -e 'R file2' file1
or using paste/bash:
paste -d' ' <(paste -sd' \n' file1) <(paste -sd' \n' file2) | tr ' ' '\n'
or:
parallel -N4 --xapply 'printf "%s\n%s\n" {1} {2}' :::: file1 :::: file2
It can be done in pure bash:
f1=""; f2=""
while test -z "$f1" -o -z "$f2"; do
{ read LINE && echo "$LINE" && \
read LINE && echo "$LINE" && \
read LINE && echo "$LINE" && \
read LINE && echo "$LINE"; } || f1=end;
{ read -u 3 LINE && echo "$LINE" && \
read -u 3 LINE && echo "$LINE" && \
read -u 3 LINE && echo "$LINE" && \
read -u 3 LINE && echo "$LINE"; } || f2=end;
done < f1 3< f2
The idea is to use a new file descriptor (3 in this case) and read from stdin and this file descriptor at the same time.
A mix of paste and sed can also be used if you do not have GNU sed:
paste -d '\n' f1 f2 | sed -e 'x;N;x;N;x;N;x;N;x;N;x;N;x;N;s/^\n//;H;s/.*//;x'
If you are not familiar with sed, there is a 2nd buffer called the hold space where you can save data. The x command exchanges the current buffer with the hold space, the N command appends one line to the current buffer, and the H command appends the current buffer to the hold space.
So the first x;N save the current line (from f1 because of paste) in the hold space and read the next line (from f2 because of paste), then each x;N;x;N read a new line from f1 and f2, and the script finishes by removing the new line from the 4 lines of f2, puts the lines from f2 at the end of the lines of f1, clean the hold space for the next run and print the 8 lines.
Try this, changing the appropriate filename values for f1 and f2.
awk 'BEGIN{
sectionSize=4; maxSectionCnt=sectionSize; maxSectionCnt++
notEof1=notEof2=1
f1="file1" ; f2="file2"
while (notEof1 && notEof2) {
if (notEof1) {
for (i=1;i<maxSectionCnt;i++) {
if (getline < f1 >0 ) { print "F1:" i":" $0 } else {notEof1=0}
}
}
if (notEof2) {
for (i=1;i<maxSectionCnt;i++) {
if (getline < f2 >0 ) { print "F2:" i":" $0 } else {notEof2=0}
}
}
}
}'
You can also remove the "F1: i":" etc record header. I added that help debug code.
As Pastafarianist rightly points out, you may need to modify this if you have expectations about what will happen if the files are not the same size, etc.
I hope this helps.
The code you posted looks extremely complicated. There is a rule of thumb with programming: there is always a simpler solution. In your case, way simpler.
First thing you should do is determine the limitations of the input. Are you going to process really big files? Or are they going to be only of one-or-two-kilobyte size? It matters.
Second thing: look at the tools you have. With Python, you've got file objects, lists, generators and so on. Try to combine these tools to produce the desired result.
In your particular case, there are some unclear points. What should the script do if the input files have different size? Or one of them is empty? Or the number of lines is not a factor of four? You should decide how to handle corner cases like these.
Take a look at the file object, xrange, list slicing and list comprehensions. If you prefer doing it the cool way, you can also take a look at the itertools module.

Categories