File 1:
1075908|2178412|brown_eyeshorty#att.net|Claude|Desmangles
175908|2178412|naim.kazi#webtv.net|Naim|Kazi
175972|212946872418|gil_maynard#hotmail.com|Munster|Herman
175972|212946872418|meghanj4#lycos.com|Meghan|Judge
175972|212946872418|quenchia#gmail.com|Anna|Balint
176046|255875|keion#netscape.net|Charlene|Johnson
176046|255875|keion112#netscape.net|Charlene|Johnson
176086|2480881|lourdsneil#gmail.com|Lourds|Herman
File 2:
89129090|Sadiq|Islam
212946872418|Anna|Balint
255875|Charlene|Johnson
89234902|Bob|Brown
09123789|Fabio|Vanetti
I would like to extract lines where ALL values match on the following basis:
Column 2 in File 1 matches with Column 1 in File 2.
Column 4 in File 1 matches with Column 2 in File 2.
Column 5 in File 1 matches with Column 3 in File 2.
The expected output for the example is:
175972|212946872418|quenchia#gmail.com|Anna|Balint
176046|255875|keion#netscape.net|Charlene|Johnson
176046|255875|keion112#netscape.net|Charlene|Johnson
The two inputs I'm working with are both very large (11Gb and 3Gb respectively).
The only potential (messy) workaround I can think of is to combine the values to be joined into a single additional column and then use Join (I'm very new to this).
grep -f <(sed 's,|,|[^|]*|,' file2) file1
Returns
175972|212946872418|quenchia#gmail.com|Anna|Balint
176046|255875|keion#netscape.net|Charlene|Johnson
176046|255875|keion112#netscape.net|Charlene|Johnson
Explanations :
First command :
sed 's,|,|[^|]*|,' file2
Transforms file2 into a list of patterns to search in file 1 and returns :
89129090|[^|]*|Sadiq|Islam
212946872418|[^|]*|Anna|Balint
255875|[^|]*|Charlene|Johnson
89234902|[^|]*|Bob|Brown
09123789|[^|]*|Fabio|Vanetti
Second command :
grep -f <(command1) file1
Searchs patterns in file1
Could you please try following.
awk -F'|' '
FNR==NR{
a[$2,$4,$5]=(a[$2,$4,$5]?a[$2,$4,$5] ORS:"")$0
next
}
(($1,$2,$3) in a){
print a[$1,$2,$3]
}' Input_file1 Input_file2
Output will be as follows.
175972|212946872418|quenchia#gmail.com|Anna|Balint
176046|255875|keion#netscape.net|Charlene|Johnson
176046|255875|keion112#netscape.net|Charlene|Johnson
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have an input file which looks like this:
input.txt
THISISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
I have another file with positions of letters i want to change and the letter i want to change it to, such as this:
textpos.txt
Position Text_Change
1 A
2 B
3 X
(Actually there will be about 10,000 alphabet changes)
And I would like one separate output file for each text change, which should look like this:
output1.txt
AHISISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
Next one:
output2.txt
TBISISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
Next one:
output3.txt
THXSISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
I would like to learn how to do this in an awk command and a pythonic way as well, and was wondering what would be the best and quickest way to do this?
Thanks in advance.
Could you please try following(considering that your actual Input_files will be having same kind of data in them). This solution should take care of error Too many open files error while running awk command since I am closing the output files in awk code.
awk '
FNR==NR{
a[++count]=$0
next
}
FNR>1{
close(file)
file="output"(FNR-1)".txt"
for(i=1;i<=count;i++){
if($1==1){
print $2 substr(a[i],2) > file
}
else{
print substr(a[i],1,$1-1) $2 substr(a[i],$1+1) > file
}
}
}' input.txt textpos.txt
3 output files named output1.txt, output2.txt and output3.txt and their content will be as follows.
cat output1.txt
AHISISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
cat output2.txt
TBISISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
cat output3.txt
THXSISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
Explanation: Adding explanation for above code here.
awk '
FNR==NR{ ##Condition FNR==NR will be TRUE when first file named input.txt is being read.
a[++count]=$0 ##Creating an array named a whose index is increasing value of count and value is current line.
next ##next will skip all further statements from here.
}
FNR>1{ ##This condition will be executed when 2nd Input_file textpos.txt is being read(excluding its header).
close(file) ##Closing file named file whose value will be output file names, getting created further.
file="output"(FNR-1)".txt" ##Creating output file named output FNR-1(line number -1) and .txt in it.
for(i=1;i<=count;i++){ ##Starting a for loop from 1 to till count value.
if($1==1){ ##Checking condition if value of 1st field is 1 then do following.
print $2 substr(a[i],2) > file ##Printing $2 substring of value of a[i] which starts from 2nd position till end of line to output file.
}
else{
print substr(a[i],1,$1-1) $2 substr(a[i],$1+1) > file ##Printing substrings 1st 1 to till value of $1-1 $2 and then substring from $1+1 till end of line.
}
}
}' input.txt textpos.txt ##Mentioning Input_file names here.
Using gawk:
$ awk 'NR > 1 && FNR == NR { r[$1] = $2; next } {
for (i in r) {
print substr($0, 1, i - 1) r[i] substr($0, i + 1) > "output" i ".txt"
}
}' textpos.txt input.txt
Using awk, abusing FS="" for the second file making each letter a column of its own:
$ awk '
NR==FNR {
a[$1]=$2; next } # hash positions and letters to a
{
for(i in a) # for all positions
$i=a[i] # replace the letters in them
}1' textpos FS="" OFS="" file
ABXSISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
Another using for and substr to build a variable char by char from a[] and $0:
$ awk '
NR==FNR {
a[$1]=$2; next } # hash textpos to a
{
for(i=1;i<=length($1);i++) # for each position in $0
b=b ((i in a)?a[i]:substr($0,i,1)) # get char from a[] or $0, in that order
print b; b="" # output and reset b for next round
}' textpos file
ABXSISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
In a file that has a particular column information I want to remove exactly 5 fields (i.e :PG:PB:PI:PW:PC (separator is ':') from the end of the lines, not from the beginning.
GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC
GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC
GT:AD:DP:GQ:PGT:PID:PL:PG:PB:PI:PW:PC
GT:AD:DP:GQ:PGT:PID:PL:PG:PB:PI:PW:PC
Assuming that the above data is from column #3 of the file, I wrote the following code:
awk 'BEGIN{FS=OFS="\t"} { split($3, a,":")} {print ($1, $2, a[1]":"a[2]":"a[3]":"a[4]":"a[5])}' awk_test.vcf
This code splits and selects the first 5 fields, but I want to remove the last 5 fields. Selecting from the first fields won't work since certain fields like PGT , PID are inserted in certain lines. Only, removing from the end works.
Expected output:
GT:AD:DP:GQ:PL
GT:AD:DP:GQ:PL
GT:AD:DP:GQ:PGT:PID:PL
GT:AD:DP:GQ:PGT:PID:PL
Thanks for helping me with the code for first part of my question.
But, the script isn't working for my another file which has the following data. Here I want to update the 9th column with the same purpose. The columns are tab separated. But, what I want to do remains basically the same.
2 1463 . T TG 433.67 PASS AC=0;AF=0.00;AN=0;BaseQRankSum=-4.310e-01;ClippingRankSum=0.00;DP=247;ExcessHet=2.9800;FS=0.000;MQ=21.25;MQRankSum=0.00;QD=33.36;ReadPosRankSum=-6.740e-01;SOR=0.784;set=InDels GT:AD:DP:PL:PG:PB:PI:PW:PC ./.:76,0:76:0,0,0:./.:.:.:./.:. ./.:55,0:55:0,0,0:.:.:.:.:. ./.:68,0:68:0,0,0:.:.:.:.:. ./.:48,0:48:0,0,0:.:.:.:.:.
2 1466 . TG T 395.82 PASS AC=0;AF=0.00;AN=0;BaseQRankSum=1.01;ClippingRankSum=0.00;DP=287;ExcessHet=5.1188;FS=7.707;MQ=18.00;MQRankSum=0.00;QD=17.21;ReadPosRankSum=1.28;SOR=0.074;set=InDels GT:AD:DP:PL:PG:PB:PI:PW:PC ./.:95,0:95:0,0,0:./.:.:.:./.:. ./.:64,0:64:0,0,0:.:.:.:.:. ./.:75,0:75:0,0,0:.:.:.:.:. ./.:53,0:53:0,0,0:.:.:.:.:.
2 1467 . G T 1334.42 PASS AC=0;AF=0.00;AN=0;BaseQRankSum=0.674;ClippingRankSum=0.00;DP=287;ExcessHet=4.8226;FS=1.328;MQ=23.36;MQRankSum=0.00;QD=28.65;ReadPosRankSum=-4.310e-01;SOR=0.566;set=SNPs GT:AD:DP:PL:PG:PB:PI:PW:PC ./.:95,0:95:0,0,0:./.:.:.:./.:. ./.:64,0:64:0,0,0:.:.:.:.:. ./.:75,0:75:0,0,0:.:.:.:.:. ./.:53,0:53:0,0,0:.:.:.:.:.
2 1516 . C T 5902.93 PASS AC=2;AF=0.250;AN=8;BaseQRankSum=0.287;ClippingRankSum=0.00;DP=411;ExcessHet=0.5065;FS=1.489;InbreedingCoeff=0.3492;MQ=59.77;MQRankSum=0.00;QD=28.38;ReadPosRankSum=-7.100e-02;SOR=0.553;set=SNPs GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC 0/0:122,0:122:99:0,120,1800:0/0:.:.:0/0:. 1/1:1,108:109:99:3935,286,0:.:.:.:.:. 0/0:102,0:102:99:0,120,1800:.:.:.:.:. 0/0:78,0:78:99:0,120,1800:.:.:.:.:.
2 1584 . CT C 164.08 PASS AC=0;AF=0.00;AN=8;DP=717;ExcessHet=0.0812;FS=0.000;InbreedingCoeff=0.9386;MQ=60.00;QD=32.82;SOR=3.611;set=InDels GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC 0/0:122,0:122:99:0,120,1800:0/0:.:.:0/0:. 0/0:172,0:172:99:0,120,1800:.:.:.:.:. 0/0:102,0:102:99:0,120,1800:.:.:.:.:. 0/0:321,0:321:99:0,120,1800:.:.:.:.:.
2 1609 . C A 604.68 PASS AC=0;AF=0.00;AN=0;DP=386;ExcessHet=0.1158;FS=0.000;InbreedingCoeff=0.8938;MQ=12.32;QD=31.09;SOR=1.061;set=SNPs GT:AD:DP:PL:PG:PB:PI:PW:PC ./.:0,0:0:0,0,0:./.:.:.:./.:. ./.:0,0:0:0,0,0:.:.:.:.:. ./.:0,0:0:0,0,0:.:.:.:.:. ./.:386,0:386:0,0,0:.:.:.:.:.
2 1612 . TGTGAGCTATTTCTTTTACATTTTTCTTTAGATTCTAGGTTAAATTGTGAAGCTGATTATCTTTTTTGTTTACAG T 1298.76 PASS AC=2;AF=1.00;AN=2;DP=3;ExcessHet=0.1047;FS=0.000;InbreedingCoeff=0.8896;MQ=60.02;QD=29.54;SOR=1.179;set=InDels GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC ./.:0,0:0:.:0,0,0:./.:.:.:./.:. ./.:0,0:0:.:0,0,0:.:.:.:.:. ./.:0,0:0:.:0,0,0:.:.:.:.:. 1/1:0,3:3:99:1355,582,0:.:.:.:.:.
2 1657 . T A,* 3118.91 PASS AC=0,2;AF=0.00,1.00;AN=2;BaseQRankSum=0.578;ClippingRankSum=0.00;DP=4;ExcessHet=1.9114;FS=3.474;InbreedingCoeff=0.0821;MQ=26.68;MQRankSum=0.841;QD=28.10;ReadPosRankSum=-5.960e-01;SOR=0.821;set=SNPs GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC ./.:0,0,0:0:.:0,0,0,0,0,0:./.:.:.:./.:. ./.:1,0,0:1:.:0,0,0,0,0,0:.:.:.:.:. ./.:0,0,0:0:.:0,0,0,0,0,0:.:.:.:.:. 2/2:0,0,3:3:99:1355,1360,1393,582,615,0:.:.:.:.:.
2 1738 . A G 4693.24 PASS AC=2;AF=0.250;AN=8;BaseQRankSum=0.00;ClippingRankSum=0.00;DP=1595;ExcessHet=0.0577;FS=0.621;InbreedingCoeff=0.6496;MQ=60.00;MQRankSum=0.00;QD=5.46;ReadPosRankSum=0.307;SOR=0.773;set=SNPs GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC 0/1:389,92:481:99:1748,0,12243:0|1:.,.,.,.,.:935:|:0.5 0/0:318,0:318:99:0,120,1800:.:.:.:.:. 0/1:270,53:323:99:990,0,9096:.:.:.:.:. 0/0:473,0:473:99:0,120,1800:.:.:.:.:.
2 2781 . T G 435.07 PASS AC=1;AF=0.125;AN=8;BaseQRankSum=0.624;ClippingRankSum=0.00;DP=2146;ExcessHet=3.4523;FS=8.450;InbreedingCoeff=-0.0856;MQ=60.06;MQRankSum=-4.630e-01;QD=1.27;ReadPosRankSum=-5.980e+00;SOR=1.436;set=SNPs GT:AD:DP:GQ:PGT:PID:PL:PG:PB:PI:PW:PC 0/0:620,0:620:99:.:.:0,120,1800:0/0:.:.:0/0:. 0/1:309,34:343:99:0|1:2781_T_G:469,0,12941:.:.:.:.:. 0/0:492,0:492:99:.:.:0,120,1800:.:.:.:.:. 0/0:691,0:691:99:.:.:0,120,1800:.:.:.:.:.
2 2786 . C G 39.69 PASS AC=0;AF=0.00;AN=8;BaseQRankSum=0.881;ClippingRankSum=0.00;DP=2145;ExcessHet=4.3933;FS=0.000;InbreedingCoeff=-0.1367;MQ=52.41;MQRankSum=-1.356e+00;QD=1.13;ReadPosRankSum=0.577;SOR=0.527;set=SNPs GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC 0/0:620,0:620:99:0,120,1800:0/0:.:.:0/0:. 0/0:342,0:342:99:0,120,1800:.:.:.:.:. 0/0:492,0:492:99:0,120,1800:.:.:.:.:. 0/0:691,0:691:99:0,120,1800:.:.:.:.:.
2 2787 . T C 993.78 PASS AC=1;AF=0.125;AN=8;BaseQRankSum=-2.967e+00;ClippingRankSum=0.00;DP=2153;ExcessHet=3.8663;FS=4.941;InbreedingCoeff=-0.1076;MQ=60.06;MQRankSum=-5.100e-01;QD=2.84;ReadPosRankSum=-3.689e+00;SOR=0.875;set=SNPs GT:AD:DP:GQ:PGT:PID:PL:PG:PB:PI:PW:PC 0/0:620,0:620:99:.:.:0,120,1800:0/0:.:.:0/0:. 0/1:309,41:350:99:0|1:2781_T_G:1027,0,13619:.:.:.:.:. 0/0:492,0:492:99:.:.:0,120,1800:.:.:.:.:. 0/0:691,0:691:99:.:.:0,120,1800:.:.:.:.:.
2 2792 . A G 745.21 PASS AC=1;AF=0.125;AN=8;BaseQRankSum=0.271;ClippingRankSum=0.00;DP=2176;ExcessHet=5.9256;FS=5.964;InbreedingCoeff=-0.2087;MQ=59.48;MQRankSum=-4.920e-01;QD=1.83;ReadPosRankSum=-3.100e-02;SOR=1.389;set=SNPs GT:AD:DP:GQ:PGT:PID:PL:PG:PB:PI:PW:PC 0/0:620,0:620:99:.:.:0,120,1800:0/0:.:.:0/0:. 0/1:332,41:373:99:0|1:2781_T_G:705,0,13295:.:.:.:.:. 0/0:492,0:492:99:.:.:0,120,1800:.:.:.:.:. 0/0:691,0:691:99:.:.:0,120,1800:.:.:.:.:.
I also tried adding FS/OFS parameters but it isn't working.
After some clarification what the file looks like, here is my updated answer:
You can simply use
awk 'BEGIN{FS=OFS="\t"} {$9 = gensub(/(:[^:]+){5}$/,"","1",$9)} 1' yourfile
Here we use the standard awk field splitting, since your file is tab-separated.
We further do a regular expression replacement scoped to $9, which is the colon-separated string you want to change.
The regular expression works the same as in the old answer, in which I had the impression that the line consists only of the colon-separated string.
Old Answer
Since you wrote "pipe to python" in your comment, maybe you are open to an sed solution?
sed -r "s/(:[^:]+){5}$//" yourfile
Here we replace (s/...// replace the ... with nothing), the ... means:
from the end of line ($)
five ({5})
occurences of colon (:)
followed by something (+)
not a colon ([^:])
And this can again be "translated" to awk:
awk -F: 'BEGIN{FS=OFS="\t"} {$0 = gensub(/(:[^:]+){5}$/,"","1")} 1' yourfile
Maybe not the best awk solution but works:
awk -F: '{printf($1); for (i=2;i<=NF-5;i++) printf(":%s",$i); printf("\n"); }' file.txt
split the fields naturally according to colon
print first field, and then other fields minus the 5 last ones (using NF: number of fields preset variable), with leading colon.
print a linefeed to end the line.
EDIT: I knew there was better to do using awk. As Lars commented, this is way simpler and cleaner:
awk -F: '{s= $1; for(i = 2; i<= NF-5;i++) s= s FS $i; print s}'
use separator value instead of hardcoded colon
compose string instead of printing all fields
print string in the end
If you want to use it within a python script, I'd suggest that you write that in python, simpler & faster:
import csv
with open("file.txt") as fr, open("out.txt","w",newline="") as fw:
cr = csv.reader(fr,delimiter=":")
cw = csv.writer(fw,delimiter=":")
for row in cr:
cw.writerow(row[:-5]) # write the row but the 5 last fields
you can omit the with part if you already have open handles.
EDIT: since you heavily edited your question after my answer, now you want to remove the 5 last "fields" from one particular field (tab-separated). Lars has answered properly awk-wise, let me propose my python solution:
import csv
with open("file.txt") as fr, open("out.txt","w",newline="") as fw:
cr = csv.reader(fr,delimiter="\t")
cw = csv.writer(fw,delimiter="\t")
for row in cr:
row[8]=":".join(row[8].split(":")[:-5]) # remove 5 last "fields" from 8th field
cw.writerow(row) # write the modified row
I have two sorted files and want to merge them to make a third, but I need the output to be sorted. One column in the second file is a subset of the first and any place the second file doesn't match the first should be filled in with a NA. The files are large (~20,000,000) records each so loading things into memory is tough and speed is an issue.
File 1 looks like this:
1 a
2 b
3 c
4 d
5 e
File 2 looks like this:
1 aa
2 bb
4 dd
5 ee
And the the output should be like this
1 a aa
2 b bb
3 c NA
4 d cc
5 e ee
join is your friend here.
join -a 1 file1 file2
should do the trick. The only difference to your example output is that the unpairable lines are printed directly from file1, i.e. without the NA.
Edit: Here is a version that also handles the NAs:
join -a 1 -e NA -o 1.1 1.2 2.2 file1 file2
If I understand you correctly:
File #1 and file #2 will have the same lines
However, some lines will be missing from file #2 that are in file #1.
AND, most importantly, the lines will be sorted in each file.
That means if I get a line from file #2, and the keep reading through file #1, I'll find a matching line sooner or later. Therefore, we want to read a line from file #2, keep looking through file #1 until we find the matching line, and when we do find one, we want to print out both values.
I would imagine some sort of algorithm like this:
Read first line from file #2
While read line from file #1
if line from file #2 > line from file #1
write line from file #1 and "NA"
else
write line from file #1 and file #2
Read another line from file #2
fi
done
There should be some form of error checking (what if you find the line from file #1 to be greater than the line from file #2? That means line #1 is missing the line.) And, there should be some boundary checking (what if you run out of lines from file #2 before you finish file #1?)
This sounds like a school assignment, so I really don't want to give an actual answer. However, the algorithm is there. All you need to do is implement it in your favorite language.
If it isn't a school assignment, and you need more help, just post a comment on this answer, and I'll do what I can.
To the DNA Biologist
#! /usr/bin/env perl
use warnings;
use strict;
use feature qw(say);
use constant {
TEXT1 => "foo1.txt",
TEXT2 => "foo2.txt",
};
open (FILE1, "<", TEXT1) or die qq(Can't open file ) . TEXT1 . qq(for reading\n);
open (FILE2, "<", TEXT2) or die qq(Can't open file ) . TEXT2 . qq(for reading\n);
my $line2 = <FILE2>;
chomp $line2;
my ($lineNum2, $value2) = split(/\s+/, $line2, 2);
while (my $line1 = <FILE1>) {
chomp $line1;
my ($lineNum1, $value1) = split(/\s+/, $line1, 2);
if (not defined $line2) {
say "$lineNum1 - $value1 - NA";
}
elsif ($lineNum1 lt $lineNum2) { #Use "<" if numeric match and not string match
say "$lineNum1 - $value1 - NA";
}
elsif ($lineNum1 eq $lineNum2) {
say "$lineNum1 - $value1 - $value2";
$line2 = <FILE2>;
if (defined $line2) {
chomp $line2;
($lineNum2, $value2) = split(/\s+/, $line2, 2);
}
}
else {
die qq(Something went wrong: Line 1 = "$line1" Line 2 = "$line2"\n);
}
}
It wasn't thoroughly tested, but it worked on some short sample files.
You can do it all in shell:
sort file.1 > file.1.sorted
sort file.2 > file.2.sorted
join -e NA file.1.sorted file.2.sorted > file.joined
Here's a Python solution:
"""merge two files based on matching first columns"""
def merge_files(file1, file2, merge_file):
with (open(file1) as file1,
open(file2) as file2,
open(merge_file, 'w')) as merge:
for line2 in file2:
index2, value2 = line2.split(' ', 1)
for line1 in file1:
index1, value1 = line1.split(' ', 1)
if index1 != index2:
merge.write(line1)
continue
merge.write("%s %s %s" % (index1, value1[:-1], value2))
break
for line1 in file1: # grab any remaining lines in file1
merge.write(line1)
if __name__ == '__main__':
merge_files('test1.txt','test2.txt','test3.txt')