I have a file without about 150k rows, and two columns. I need to to run a a python script on the first field, and save its output as a third column, such that the change looks like this:
Original File:
Col1 Col2
d 2
e 4
f 6
New file:
Col1 Col2 Col3
d 2 output
e 4 output
f 6 output
I'm not able to run the script from inside awk.
cat original.list | awk -F" " ' {`/homes/script.py $1`}'
If I were able to, I would then want to save it as a variable, and print the new variable, plus $1 and $2 to the new file.
thanks in advance (related question here)
the answer to the "related question" you linked (and the one posted in the comments) actually solve your problem,
it just need to be adapted to your specific case.
cat original.list | awk -F" " ' {`/homes/script.py $1`}'
cat is useless here because awk can open and read the file by itself
you don't need -F" " because awk will split fields by spaces by default
backticks `` wont run your script, that's a shell (discouraged)
feature, doesn't work in
awk
we can use command | getline var to execute a command and store its
(first line of) output in a variable. from man awk:
command | getline var
pipes a record from command into var.
using your example file:
$ cat original
Col1 Col2
d 2
e 4
f 6
$
and a dummy script.py:
$ cat script.py
#!/bin/python
print("output")
$
we can do something like this:
$ awk '
NR == 1 { print $0, "Col3" }
NR > 1 { cmd="./script.py " $1; cmd | getline out; close(cmd); print $0, out }
' original
Col1 Col2 Col3
d 2 output
e 4 output
f 6 output
$
the first action runs on the first line of input, adds Col3 to the header and
avoids passing Col1 to the python script.
in the other action, we first build the command concatenating $1 to the
script's path, then we run it and store its first line of output in out
variable (I'm assuming your python script output is just one line). close(cmd) is important because after getline, the pipe reading
from cmd's output would remain open and doing this for many records could lead
to errors like too many open files. at the end we print $0 and cmd's
output.
third's column formatting looks a bit off, you can improve it either from
awk using printf or with an external program like column, e.g:
$ awk '
NR == 1 { print $0, "Col3" }
NR > 1 { cmd="./script.py " $1; cmd | getline out; close(cmd); print $0, out }
' original | column -t
Col1 Col2 Col3
d 2 output
e 4 output
f 6 output
$
lastly, doing all this on a 150k rows file means calling the python script 150k
times etc.., it probably will be a slow task, I think performance could be
improved by doing everything directly in the python script as already
suggested in the comments, but whether or not it is applicable to your specific case, goes
beyond the scope of this question/answer.
Related
(An adaptation of David Erickson's question here)
Given a CSV file with columns A, B, and C and some values:
echo 'a,b,c' > file.csv
head -c 10000000 /dev/urandom | od -d | awk 'BEGIN{OFS = ","}{print $2, $3, $4}' | head -n 10000 >> file.csv
We would like to sort by columns a and b:
sort -t ',' -k1,1n -k2,2n file.csv > file_.csv
head -n 3 file_.csv
>a,b,c
3,50240,18792
7,54871,39438
And then for every unique pair (a, b) create a new CSV titled '{a}_Invoice_{b}.csv'.
The main challenge seems to be the I/O overhead of writing thousands of files - I started trying with awk but ran into awk: 17 makes too many open files.
Is there a quicker way to do this, in awk, Python, or some other scripting language?
Additional info:
I know I can do this in Pandas - I'm looking for a faster way using text processing
Though I used urandom to generate the sample data, the real data has runs of recurring values: for example a few rows where a=3, b=7. If so these should be saved as one file. (The idea is to replicate Pandas' groupby -> to_csv)
In python:
import pandas as pd
df = pd.read_csv("file.csv")
for (a, b), gb in df.groupby(['a', 'b']):
gb.to_csv(f"{a}_Invoice_{b}.csv", header=True, index=False)
In awk you can split like so, you will need to put the header back on each resultant file:
awk -F',' '{ out=$1"_Invoice_"$2".csv"; print >> out; close(out) }' file.csv
With adding the header line back:
awk -F',' 'NR==1 { hdr=$0; next } { out=$1"_Invoice_"$2".csv"; if (!seen[out]++) {print hdr > out} print >> out; close(out); }' file.csv
The benefit of this last example is that the input file.csv doesn't need to be sorted and is processed in a single pass.
Since your input is to be sorted on the key fields all you need is:
sort -t ',' -k1,1n -k2,2n file.csv |
awk -F ',' '
NR==1 { hdr=$0; next }
{ out = $1 "_Invoice_" $2 ".csv" }
out != prev {
close(prev)
print hdr > out
prev = out
}
{ print > out }
'
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have an input file which looks like this:
input.txt
THISISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
I have another file with positions of letters i want to change and the letter i want to change it to, such as this:
textpos.txt
Position Text_Change
1 A
2 B
3 X
(Actually there will be about 10,000 alphabet changes)
And I would like one separate output file for each text change, which should look like this:
output1.txt
AHISISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
Next one:
output2.txt
TBISISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
Next one:
output3.txt
THXSISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
I would like to learn how to do this in an awk command and a pythonic way as well, and was wondering what would be the best and quickest way to do this?
Thanks in advance.
Could you please try following(considering that your actual Input_files will be having same kind of data in them). This solution should take care of error Too many open files error while running awk command since I am closing the output files in awk code.
awk '
FNR==NR{
a[++count]=$0
next
}
FNR>1{
close(file)
file="output"(FNR-1)".txt"
for(i=1;i<=count;i++){
if($1==1){
print $2 substr(a[i],2) > file
}
else{
print substr(a[i],1,$1-1) $2 substr(a[i],$1+1) > file
}
}
}' input.txt textpos.txt
3 output files named output1.txt, output2.txt and output3.txt and their content will be as follows.
cat output1.txt
AHISISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
cat output2.txt
TBISISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
cat output3.txt
THXSISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
Explanation: Adding explanation for above code here.
awk '
FNR==NR{ ##Condition FNR==NR will be TRUE when first file named input.txt is being read.
a[++count]=$0 ##Creating an array named a whose index is increasing value of count and value is current line.
next ##next will skip all further statements from here.
}
FNR>1{ ##This condition will be executed when 2nd Input_file textpos.txt is being read(excluding its header).
close(file) ##Closing file named file whose value will be output file names, getting created further.
file="output"(FNR-1)".txt" ##Creating output file named output FNR-1(line number -1) and .txt in it.
for(i=1;i<=count;i++){ ##Starting a for loop from 1 to till count value.
if($1==1){ ##Checking condition if value of 1st field is 1 then do following.
print $2 substr(a[i],2) > file ##Printing $2 substring of value of a[i] which starts from 2nd position till end of line to output file.
}
else{
print substr(a[i],1,$1-1) $2 substr(a[i],$1+1) > file ##Printing substrings 1st 1 to till value of $1-1 $2 and then substring from $1+1 till end of line.
}
}
}' input.txt textpos.txt ##Mentioning Input_file names here.
Using gawk:
$ awk 'NR > 1 && FNR == NR { r[$1] = $2; next } {
for (i in r) {
print substr($0, 1, i - 1) r[i] substr($0, i + 1) > "output" i ".txt"
}
}' textpos.txt input.txt
Using awk, abusing FS="" for the second file making each letter a column of its own:
$ awk '
NR==FNR {
a[$1]=$2; next } # hash positions and letters to a
{
for(i in a) # for all positions
$i=a[i] # replace the letters in them
}1' textpos FS="" OFS="" file
ABXSISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
Another using for and substr to build a variable char by char from a[] and $0:
$ awk '
NR==FNR {
a[$1]=$2; next } # hash textpos to a
{
for(i=1;i<=length($1);i++) # for each position in $0
b=b ((i in a)?a[i]:substr($0,i,1)) # get char from a[] or $0, in that order
print b; b="" # output and reset b for next round
}' textpos file
ABXSISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
Can be considered a variant of this bash question that I looked up before asking this question here.
Following is a sample file:
C=b933cda8ce0/0 p=880080
C=b933cdd6580/0 p=880080
C=b933d02a240/0 p=880080
C=b933d059610/0 p=880080
C=b933d1c8690/0 p=880080
C=b933d2c1b60/0 p=1560315
C=b933d2c1b60/0 p=880080
C=b933d32f240/0 p=1229793
C=b933d32f240/0 p=123412
Output here should be:
C=b933d2c1b60/0 p1=1560315 p2=880080
C=b933d32f240/0 p1=1229793 p2=123412
I need to print out all the values of field#2 against field#1 from all the lines where field#1 matches.
Although I got the job done using the following long one-liner but it doesn't really seem elegant/efficient to me:
d=0; q=0; cat file |while read -r c p; do if [[ $c = $d ]]; then printf "$c\t$p\t$q\n"; fi; d=$c; q=$p; done
Code could be in any of the langs/tools tagged.
awk to the rescue
awk '{
c[$1]++;
sub("p","p"c[$1],$2);
sep=(c[$1]>1)?FS:"";
a[$1]=a[$1] sep $2
}
END {
for(i in a) print i, a[i]
}' file
concatenate second fields based on the key (first field). Suffix p with the index of the parameter (counted in c). There is a formatting trick to not to double space the first and second fields.
We wrote an awk one liner to split an input csv file (Assay_51003_target_pairs.csv) into multiple files. For any row if their column 1 is equal to another column column 1, the column 2 is equal to another column 2, etc., these rows will be categorized into a new file. The new file will be named using the column values.
Here is the one liner
awk -F "," 'NF>1 && NR>1 && $1==$1 && $2==$2 && $9==$9 && $10==$10{print $0 >> ("Assay_"$1"_target_"$3"_assay_" $9 "_bcassay_" $10 "_bcalt_assay.csv");close("Assay_"$1"_target_"$3"_assay_" $9 "_bcassay_" $10 "_bcalt_assay.csv")}' Assay_51003_target_pairs.csv
This will generate the following example output (Assay_$1_target_$3_assay_$9_bcassay_$10_bcalt_assay.csv):
Assay_51003_target_1645_assay_7777_bcassay_8888_bcalt_assay.csv
51003,666666,1645,11145,EC50,,0.2,uM,7777,8888,IC50,,1,uM,,3,2.0555,3011-02-0100:00:00,1911-04-1100:00:00,Cell,Biochemical
51003,666666,1645,1680,EC50,<,0.1,uM,7777,8888,IC50,,1,uM,,2,2.8579,3004-06-0300:00:00,3000-04-1100:00:00,Cell,Biochemical
Assay_51003_target_1645_assay_7777_bcassay_9999_bcalt_assay.csv
51003,666666,1645,11145,EC50,,0.2,uM,7777,9999,IC50,,1,uM,,3,2.0555,3011-02-0100:00:00,1911-04-1100:00:00,Cell,Biochemical
51003,666666,1645,1680,EC50,<,0.1,uM,7777,9999,IC50,,1,uM,,2,2.8579,3004-06-0300:00:00,3000-04-1100:00:00,Cell,Biochemical
Assay_51003_target_1688_assay_7777_bcassay_9999_bcalt_assay.csv
51003,666666,1688,11145,EC50,,0.2,uM,7777,9999,IC50,,1,uM,,3,2.0555,3011-02-0100:00:00,1911-04-1100:00:00,Cell,Biochemical
51003,666666,1688,1680,EC50,<,0.1,uM,7777,9999,IC50,,1,uM,,2,2.8579,3004-06-0300:00:00,3000-04-1100:00:00,Cell,Biochemical
Later on we would like to do, for example,
awk -F, -f max_min.awk Assay_51003_target_1645_assay_7777_bcassay_8888_bcalt_assay.csv
awk -F, -f max_min.awk Assay_51003_target_1645_assay_7777_bcassay_9999_bcalt_assay.csv
awk -F, -f max_min.awk Assay_51003_target_1688_assay_7777_bcassay_9999_bcalt_assay.csv
#################################################
for b in 1645 1688
do
for c in 8888 9999
do
awk -F, -f max_min.awk Assay_51003_target_$b_assay_7777_bcassay_$c_bcalt_assay.csv
done
done
However, we don't know if there is any way to write a loop for the followup work because the outfile names are "random". May we know if there is any way for linux/bash to parse part of the file name into loop variables (such as parse 1645 and 1688 into b and 8888 & 9999 into c)?
With Bash it should be pretty much easy granting the values are always numbers:
shopt -s nullglob
FILES=(Assay_*_target_*_assay_*_bcassay_*_bcalt_assay.csv) ## No need to do +([[:digit:]]). The difference is unlikely.
for FILE in "${FILES[#]}"; do
IFS=_ read -a A <<< "$FILE"
# Do something with ${A[1]} ${A[3]} ${A[5]} and ${A[7]}
...
# Or
IFS=_ read __ A __ B __ C __ D __ <<< "$FILE"
# Do something with $A $B $C and $D
...
done
Asking if $1 == $1, etc., is pointless, since it will always be true. The following code is equivalent:
awk -F, '
NF > 1 && NR > 1 {
f = "Assay_" $1 "_target_" $3 "_assay_" $9 \
"_bcassay_" $10 "_bcalt_assay.csv"
print >> f;
close(f)
}' Assay_51003_target_pairs.csv
The reason this works is that the same file is appended to if the fields used in the construction of the filename match. But I wonder if it's an error on your part to be using $3 instead of $2 since you mention $2 in your description.
At any rate, what you are doing seems very odd. If you can give a straightforward description of what you are actually trying to accomplish, there may be a completely different way to do it.
I have to do a simple task, but I don't know how to do it and I'm staked. I need to intersperse the lines of two different files each 4 lines:
File 1:
1
2
3
4
5
6
7
8
9
10
11
12
FILE 2:
A
B
C
D
E
F
G
H
I
J
K
L
Desired result:
1
2
3
4
A
B
C
D
5
6
7
8
E
F
G
H
9
10
11
12
I
J
K
L
I'm looking for a sed, awk or python script, or any other bash command.
Thanks for your time!!
I tried to do it using specific python libraries that recognize the 4 lines modules of each files. But It doesn't work and now I trying to do it without this libraries, but don't know how.
import sys
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
def main(forward,reverse):
for F, R in zip ( SeqIO.parse(forward, "fastq"), SeqIO.parse(reverse, "fastq") ):
fastq_out_F = SeqRecord( F.seq, id = F.id, description = "" )
fastq_out_F.letter_annotations["phred_quality"] = F.letter_annotations["phred_quality"]
fastq_out_R = SeqRecord( R.seq, id = R.id, description = "" )
fastq_out_R.letter_annotations["phred_quality"] = R.letter_annotations["phred_quality"]
print fastq_out_F.format("fastq"),
print fastq_out_R.format("fastq"),
if __name__ == '__main__':
main(sys.argv[1], sys.argv[2])
This might work for you:(using GNU sed)
sed -e 'n;n;n;R file2' -e 'R file2' -e 'R file2' -e 'R file2' file1
or using paste/bash:
paste -d' ' <(paste -sd' \n' file1) <(paste -sd' \n' file2) | tr ' ' '\n'
or:
parallel -N4 --xapply 'printf "%s\n%s\n" {1} {2}' :::: file1 :::: file2
It can be done in pure bash:
f1=""; f2=""
while test -z "$f1" -o -z "$f2"; do
{ read LINE && echo "$LINE" && \
read LINE && echo "$LINE" && \
read LINE && echo "$LINE" && \
read LINE && echo "$LINE"; } || f1=end;
{ read -u 3 LINE && echo "$LINE" && \
read -u 3 LINE && echo "$LINE" && \
read -u 3 LINE && echo "$LINE" && \
read -u 3 LINE && echo "$LINE"; } || f2=end;
done < f1 3< f2
The idea is to use a new file descriptor (3 in this case) and read from stdin and this file descriptor at the same time.
A mix of paste and sed can also be used if you do not have GNU sed:
paste -d '\n' f1 f2 | sed -e 'x;N;x;N;x;N;x;N;x;N;x;N;x;N;s/^\n//;H;s/.*//;x'
If you are not familiar with sed, there is a 2nd buffer called the hold space where you can save data. The x command exchanges the current buffer with the hold space, the N command appends one line to the current buffer, and the H command appends the current buffer to the hold space.
So the first x;N save the current line (from f1 because of paste) in the hold space and read the next line (from f2 because of paste), then each x;N;x;N read a new line from f1 and f2, and the script finishes by removing the new line from the 4 lines of f2, puts the lines from f2 at the end of the lines of f1, clean the hold space for the next run and print the 8 lines.
Try this, changing the appropriate filename values for f1 and f2.
awk 'BEGIN{
sectionSize=4; maxSectionCnt=sectionSize; maxSectionCnt++
notEof1=notEof2=1
f1="file1" ; f2="file2"
while (notEof1 && notEof2) {
if (notEof1) {
for (i=1;i<maxSectionCnt;i++) {
if (getline < f1 >0 ) { print "F1:" i":" $0 } else {notEof1=0}
}
}
if (notEof2) {
for (i=1;i<maxSectionCnt;i++) {
if (getline < f2 >0 ) { print "F2:" i":" $0 } else {notEof2=0}
}
}
}
}'
You can also remove the "F1: i":" etc record header. I added that help debug code.
As Pastafarianist rightly points out, you may need to modify this if you have expectations about what will happen if the files are not the same size, etc.
I hope this helps.
The code you posted looks extremely complicated. There is a rule of thumb with programming: there is always a simpler solution. In your case, way simpler.
First thing you should do is determine the limitations of the input. Are you going to process really big files? Or are they going to be only of one-or-two-kilobyte size? It matters.
Second thing: look at the tools you have. With Python, you've got file objects, lists, generators and so on. Try to combine these tools to produce the desired result.
In your particular case, there are some unclear points. What should the script do if the input files have different size? Or one of them is empty? Or the number of lines is not a factor of four? You should decide how to handle corner cases like these.
Take a look at the file object, xrange, list slicing and list comprehensions. If you prefer doing it the cool way, you can also take a look at the itertools module.