Match Multiple Columns In Two Files - Output Only Those That Match Fully - python

File 1:
1075908|2178412|brown_eyeshorty#att.net|Claude|Desmangles
175908|2178412|naim.kazi#webtv.net|Naim|Kazi
175972|212946872418|gil_maynard#hotmail.com|Munster|Herman
175972|212946872418|meghanj4#lycos.com|Meghan|Judge
175972|212946872418|quenchia#gmail.com|Anna|Balint
176046|255875|keion#netscape.net|Charlene|Johnson
176046|255875|keion112#netscape.net|Charlene|Johnson
176086|2480881|lourdsneil#gmail.com|Lourds|Herman
File 2:
89129090|Sadiq|Islam
212946872418|Anna|Balint
255875|Charlene|Johnson
89234902|Bob|Brown
09123789|Fabio|Vanetti
I would like to extract lines where ALL values match on the following basis:
Column 2 in File 1 matches with Column 1 in File 2.
Column 4 in File 1 matches with Column 2 in File 2.
Column 5 in File 1 matches with Column 3 in File 2.
The expected output for the example is:
175972|212946872418|quenchia#gmail.com|Anna|Balint
176046|255875|keion#netscape.net|Charlene|Johnson
176046|255875|keion112#netscape.net|Charlene|Johnson
The two inputs I'm working with are both very large (11Gb and 3Gb respectively).
The only potential (messy) workaround I can think of is to combine the values to be joined into a single additional column and then use Join (I'm very new to this).

grep -f <(sed 's,|,|[^|]*|,' file2) file1
Returns
175972|212946872418|quenchia#gmail.com|Anna|Balint
176046|255875|keion#netscape.net|Charlene|Johnson
176046|255875|keion112#netscape.net|Charlene|Johnson
Explanations :
First command :
sed 's,|,|[^|]*|,' file2
Transforms file2 into a list of patterns to search in file 1 and returns :
89129090|[^|]*|Sadiq|Islam
212946872418|[^|]*|Anna|Balint
255875|[^|]*|Charlene|Johnson
89234902|[^|]*|Bob|Brown
09123789|[^|]*|Fabio|Vanetti
Second command :
grep -f <(command1) file1
Searchs patterns in file1

Could you please try following.
awk -F'|' '
FNR==NR{
a[$2,$4,$5]=(a[$2,$4,$5]?a[$2,$4,$5] ORS:"")$0
next
}
(($1,$2,$3) in a){
print a[$1,$2,$3]
}' Input_file1 Input_file2
Output will be as follows.
175972|212946872418|quenchia#gmail.com|Anna|Balint
176046|255875|keion#netscape.net|Charlene|Johnson
176046|255875|keion112#netscape.net|Charlene|Johnson

Related

Compare 2 csv or excel files for string matches

I want to compare sub-string between columns of 2 different files:
Below are the sample input and expected output
Input File1.csv:
1. Amar,18
2. Akbar,20
3. Anthony,21
Input File2.csv:
1. Mr. Amar Khanna, Tennis
2. Mr. Anthony Rao, Cricket
3. Federer, Badminton
Expected Output File3.csv:
1. Amar,18,Tennis
2. Anthony,21,Cricket
I am trying doing it using shell scripting. These are the solutions that I have tried so far to find the matches in two files:
diff file1 file2
This did not work as it compares files for entire column match.
grep -f file1 file2
Even this did not work because of above issue.
awk 'FNR==NR{a[substr($1,5,8)];next} substr($1,5,8) in a' excel1.csv excel2.csv
This is not giving any result

How to use awk to select/remove fields from the end of a column after splitting?

In a file that has a particular column information I want to remove exactly 5 fields (i.e :PG:PB:PI:PW:PC (separator is ':') from the end of the lines, not from the beginning.
GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC
GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC
GT:AD:DP:GQ:PGT:PID:PL:PG:PB:PI:PW:PC
GT:AD:DP:GQ:PGT:PID:PL:PG:PB:PI:PW:PC
Assuming that the above data is from column #3 of the file, I wrote the following code:
awk 'BEGIN{FS=OFS="\t"} { split($3, a,":")} {print ($1, $2, a[1]":"a[2]":"a[3]":"a[4]":"a[5])}' awk_test.vcf
This code splits and selects the first 5 fields, but I want to remove the last 5 fields. Selecting from the first fields won't work since certain fields like PGT , PID are inserted in certain lines. Only, removing from the end works.
Expected output:
GT:AD:DP:GQ:PL
GT:AD:DP:GQ:PL
GT:AD:DP:GQ:PGT:PID:PL
GT:AD:DP:GQ:PGT:PID:PL
Thanks for helping me with the code for first part of my question.
But, the script isn't working for my another file which has the following data. Here I want to update the 9th column with the same purpose. The columns are tab separated. But, what I want to do remains basically the same.
2 1463 . T TG 433.67 PASS AC=0;AF=0.00;AN=0;BaseQRankSum=-4.310e-01;ClippingRankSum=0.00;DP=247;ExcessHet=2.9800;FS=0.000;MQ=21.25;MQRankSum=0.00;QD=33.36;ReadPosRankSum=-6.740e-01;SOR=0.784;set=InDels GT:AD:DP:PL:PG:PB:PI:PW:PC ./.:76,0:76:0,0,0:./.:.:.:./.:. ./.:55,0:55:0,0,0:.:.:.:.:. ./.:68,0:68:0,0,0:.:.:.:.:. ./.:48,0:48:0,0,0:.:.:.:.:.
2 1466 . TG T 395.82 PASS AC=0;AF=0.00;AN=0;BaseQRankSum=1.01;ClippingRankSum=0.00;DP=287;ExcessHet=5.1188;FS=7.707;MQ=18.00;MQRankSum=0.00;QD=17.21;ReadPosRankSum=1.28;SOR=0.074;set=InDels GT:AD:DP:PL:PG:PB:PI:PW:PC ./.:95,0:95:0,0,0:./.:.:.:./.:. ./.:64,0:64:0,0,0:.:.:.:.:. ./.:75,0:75:0,0,0:.:.:.:.:. ./.:53,0:53:0,0,0:.:.:.:.:.
2 1467 . G T 1334.42 PASS AC=0;AF=0.00;AN=0;BaseQRankSum=0.674;ClippingRankSum=0.00;DP=287;ExcessHet=4.8226;FS=1.328;MQ=23.36;MQRankSum=0.00;QD=28.65;ReadPosRankSum=-4.310e-01;SOR=0.566;set=SNPs GT:AD:DP:PL:PG:PB:PI:PW:PC ./.:95,0:95:0,0,0:./.:.:.:./.:. ./.:64,0:64:0,0,0:.:.:.:.:. ./.:75,0:75:0,0,0:.:.:.:.:. ./.:53,0:53:0,0,0:.:.:.:.:.
2 1516 . C T 5902.93 PASS AC=2;AF=0.250;AN=8;BaseQRankSum=0.287;ClippingRankSum=0.00;DP=411;ExcessHet=0.5065;FS=1.489;InbreedingCoeff=0.3492;MQ=59.77;MQRankSum=0.00;QD=28.38;ReadPosRankSum=-7.100e-02;SOR=0.553;set=SNPs GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC 0/0:122,0:122:99:0,120,1800:0/0:.:.:0/0:. 1/1:1,108:109:99:3935,286,0:.:.:.:.:. 0/0:102,0:102:99:0,120,1800:.:.:.:.:. 0/0:78,0:78:99:0,120,1800:.:.:.:.:.
2 1584 . CT C 164.08 PASS AC=0;AF=0.00;AN=8;DP=717;ExcessHet=0.0812;FS=0.000;InbreedingCoeff=0.9386;MQ=60.00;QD=32.82;SOR=3.611;set=InDels GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC 0/0:122,0:122:99:0,120,1800:0/0:.:.:0/0:. 0/0:172,0:172:99:0,120,1800:.:.:.:.:. 0/0:102,0:102:99:0,120,1800:.:.:.:.:. 0/0:321,0:321:99:0,120,1800:.:.:.:.:.
2 1609 . C A 604.68 PASS AC=0;AF=0.00;AN=0;DP=386;ExcessHet=0.1158;FS=0.000;InbreedingCoeff=0.8938;MQ=12.32;QD=31.09;SOR=1.061;set=SNPs GT:AD:DP:PL:PG:PB:PI:PW:PC ./.:0,0:0:0,0,0:./.:.:.:./.:. ./.:0,0:0:0,0,0:.:.:.:.:. ./.:0,0:0:0,0,0:.:.:.:.:. ./.:386,0:386:0,0,0:.:.:.:.:.
2 1612 . TGTGAGCTATTTCTTTTACATTTTTCTTTAGATTCTAGGTTAAATTGTGAAGCTGATTATCTTTTTTGTTTACAG T 1298.76 PASS AC=2;AF=1.00;AN=2;DP=3;ExcessHet=0.1047;FS=0.000;InbreedingCoeff=0.8896;MQ=60.02;QD=29.54;SOR=1.179;set=InDels GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC ./.:0,0:0:.:0,0,0:./.:.:.:./.:. ./.:0,0:0:.:0,0,0:.:.:.:.:. ./.:0,0:0:.:0,0,0:.:.:.:.:. 1/1:0,3:3:99:1355,582,0:.:.:.:.:.
2 1657 . T A,* 3118.91 PASS AC=0,2;AF=0.00,1.00;AN=2;BaseQRankSum=0.578;ClippingRankSum=0.00;DP=4;ExcessHet=1.9114;FS=3.474;InbreedingCoeff=0.0821;MQ=26.68;MQRankSum=0.841;QD=28.10;ReadPosRankSum=-5.960e-01;SOR=0.821;set=SNPs GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC ./.:0,0,0:0:.:0,0,0,0,0,0:./.:.:.:./.:. ./.:1,0,0:1:.:0,0,0,0,0,0:.:.:.:.:. ./.:0,0,0:0:.:0,0,0,0,0,0:.:.:.:.:. 2/2:0,0,3:3:99:1355,1360,1393,582,615,0:.:.:.:.:.
2 1738 . A G 4693.24 PASS AC=2;AF=0.250;AN=8;BaseQRankSum=0.00;ClippingRankSum=0.00;DP=1595;ExcessHet=0.0577;FS=0.621;InbreedingCoeff=0.6496;MQ=60.00;MQRankSum=0.00;QD=5.46;ReadPosRankSum=0.307;SOR=0.773;set=SNPs GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC 0/1:389,92:481:99:1748,0,12243:0|1:.,.,.,.,.:935:|:0.5 0/0:318,0:318:99:0,120,1800:.:.:.:.:. 0/1:270,53:323:99:990,0,9096:.:.:.:.:. 0/0:473,0:473:99:0,120,1800:.:.:.:.:.
2 2781 . T G 435.07 PASS AC=1;AF=0.125;AN=8;BaseQRankSum=0.624;ClippingRankSum=0.00;DP=2146;ExcessHet=3.4523;FS=8.450;InbreedingCoeff=-0.0856;MQ=60.06;MQRankSum=-4.630e-01;QD=1.27;ReadPosRankSum=-5.980e+00;SOR=1.436;set=SNPs GT:AD:DP:GQ:PGT:PID:PL:PG:PB:PI:PW:PC 0/0:620,0:620:99:.:.:0,120,1800:0/0:.:.:0/0:. 0/1:309,34:343:99:0|1:2781_T_G:469,0,12941:.:.:.:.:. 0/0:492,0:492:99:.:.:0,120,1800:.:.:.:.:. 0/0:691,0:691:99:.:.:0,120,1800:.:.:.:.:.
2 2786 . C G 39.69 PASS AC=0;AF=0.00;AN=8;BaseQRankSum=0.881;ClippingRankSum=0.00;DP=2145;ExcessHet=4.3933;FS=0.000;InbreedingCoeff=-0.1367;MQ=52.41;MQRankSum=-1.356e+00;QD=1.13;ReadPosRankSum=0.577;SOR=0.527;set=SNPs GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC 0/0:620,0:620:99:0,120,1800:0/0:.:.:0/0:. 0/0:342,0:342:99:0,120,1800:.:.:.:.:. 0/0:492,0:492:99:0,120,1800:.:.:.:.:. 0/0:691,0:691:99:0,120,1800:.:.:.:.:.
2 2787 . T C 993.78 PASS AC=1;AF=0.125;AN=8;BaseQRankSum=-2.967e+00;ClippingRankSum=0.00;DP=2153;ExcessHet=3.8663;FS=4.941;InbreedingCoeff=-0.1076;MQ=60.06;MQRankSum=-5.100e-01;QD=2.84;ReadPosRankSum=-3.689e+00;SOR=0.875;set=SNPs GT:AD:DP:GQ:PGT:PID:PL:PG:PB:PI:PW:PC 0/0:620,0:620:99:.:.:0,120,1800:0/0:.:.:0/0:. 0/1:309,41:350:99:0|1:2781_T_G:1027,0,13619:.:.:.:.:. 0/0:492,0:492:99:.:.:0,120,1800:.:.:.:.:. 0/0:691,0:691:99:.:.:0,120,1800:.:.:.:.:.
2 2792 . A G 745.21 PASS AC=1;AF=0.125;AN=8;BaseQRankSum=0.271;ClippingRankSum=0.00;DP=2176;ExcessHet=5.9256;FS=5.964;InbreedingCoeff=-0.2087;MQ=59.48;MQRankSum=-4.920e-01;QD=1.83;ReadPosRankSum=-3.100e-02;SOR=1.389;set=SNPs GT:AD:DP:GQ:PGT:PID:PL:PG:PB:PI:PW:PC 0/0:620,0:620:99:.:.:0,120,1800:0/0:.:.:0/0:. 0/1:332,41:373:99:0|1:2781_T_G:705,0,13295:.:.:.:.:. 0/0:492,0:492:99:.:.:0,120,1800:.:.:.:.:. 0/0:691,0:691:99:.:.:0,120,1800:.:.:.:.:.
I also tried adding FS/OFS parameters but it isn't working.
After some clarification what the file looks like, here is my updated answer:
You can simply use
awk 'BEGIN{FS=OFS="\t"} {$9 = gensub(/(:[^:]+){5}$/,"","1",$9)} 1' yourfile
Here we use the standard awk field splitting, since your file is tab-separated.
We further do a regular expression replacement scoped to $9, which is the colon-separated string you want to change.
The regular expression works the same as in the old answer, in which I had the impression that the line consists only of the colon-separated string.
Old Answer
Since you wrote "pipe to python" in your comment, maybe you are open to an sed solution?
sed -r "s/(:[^:]+){5}$//" yourfile
Here we replace (s/...// replace the ... with nothing), the ... means:
from the end of line ($)
five ({5})
occurences of colon (:)
followed by something (+)
not a colon ([^:])
And this can again be "translated" to awk:
awk -F: 'BEGIN{FS=OFS="\t"} {$0 = gensub(/(:[^:]+){5}$/,"","1")} 1' yourfile
Maybe not the best awk solution but works:
awk -F: '{printf($1); for (i=2;i<=NF-5;i++) printf(":%s",$i); printf("\n"); }' file.txt
split the fields naturally according to colon
print first field, and then other fields minus the 5 last ones (using NF: number of fields preset variable), with leading colon.
print a linefeed to end the line.
EDIT: I knew there was better to do using awk. As Lars commented, this is way simpler and cleaner:
awk -F: '{s= $1; for(i = 2; i<= NF-5;i++) s= s FS $i; print s}'
use separator value instead of hardcoded colon
compose string instead of printing all fields
print string in the end
If you want to use it within a python script, I'd suggest that you write that in python, simpler & faster:
import csv
with open("file.txt") as fr, open("out.txt","w",newline="") as fw:
cr = csv.reader(fr,delimiter=":")
cw = csv.writer(fw,delimiter=":")
for row in cr:
cw.writerow(row[:-5]) # write the row but the 5 last fields
you can omit the with part if you already have open handles.
EDIT: since you heavily edited your question after my answer, now you want to remove the 5 last "fields" from one particular field (tab-separated). Lars has answered properly awk-wise, let me propose my python solution:
import csv
with open("file.txt") as fr, open("out.txt","w",newline="") as fw:
cr = csv.reader(fr,delimiter="\t")
cw = csv.writer(fw,delimiter="\t")
for row in cr:
row[8]=":".join(row[8].split(":")[:-5]) # remove 5 last "fields" from 8th field
cw.writerow(row) # write the modified row

Extracting data from file with differing amounts of columns

I have a tab separated file that appears as so:
NM_000014 chr12 - 36 9220303 9220778 9221335 9222340 9223083 9224954 9225248 9227155 9229351 9229941 9230296 9231839 9232234 9232689 9241795 9242497 9242951 9243796 9246060 9247568 9248134 9251202 9251976 9253739 9254042 9256834 9258831 9259086 9260119 9261916 9262462 9262909 9264754 9264972 9265955 9268359 9220435 9220820 9221438 9222409 9223174 9225082 9225467 9227379 9229532 9230016 9230453 9231927 9232411 9232773 9241847 9242619 9243078 9244025 9246175 9247680 9248296 9251352 9252119 9253803 9254270 9256996 9258941 9259201 9260240 9262001 9262631 9262930 9264807 9265132 9266139 9268558 A2M 1
NM_000016 chr1 + 12 76190031 76194085 76198328 76198537 76199212 76200475 76205664 76211490 76215103 76216135 76226806 76228376 76190502 76194173 76198426 76198607 76199313 76200556 76205795 76211599 76215244 76216231 76227055 76229363 ACADM 1
As you can tell if you scroll to end of the lines, there are differing amounts of columns corresponding to the numbers listed. What I want to do is output the very last number before the gene name (A2M and ACADM in this case) to a file. Is there any way to do this? I've been trying to figure out a way using unix's awk, however I don't believe this will work due to the differing amounts of columns.
Any help is appreciated
Use $(NF-1) like so where NF is the number fields for that line:
awk '{print $(NF-1)}' /tmp/genes.txt
A2M
ACADM
Your posted example has spaces for delimiters. You may need to change the field separator to tabs if you file is truly tab delimited. Then it would be:
awk -F $'\t' {print $(NF-1)}' file_name
If you want the number before that name:
$ awk '{print $(NF-2)}' /tmp/genes.txt
9268558
76229363
Try:
awk '{ print $(NF-1) }' FILE
NF always provides the number of fields, so you can use that in an awk variable to dynamically set the field based on the field length.
Are all of your lines structured in the same manner. If so it is pretty straightforward:
for line in myLines:
data = line.split[-3]

Text processing with two files

I have two text files in the following format:
The first is this on every line:
Key1:Value1
The second is this:
Key2:Value2
Is there a way I can replace Value1 in file1 by the Value2 obtained from using it as a key in file2?
For example:
file1:
foo:hello
bar:world
file2:
hello:adam
bar:eve
I would like to get:
foo:adam
bar:eve
There isn't necessarily a match between the two files on every line. Can this be done neatly in awk or something, or should I do it naively in Python?
Create two dictionaries, one for each file. For example:
file1 = {}
for line in open('file1', 'r'):
k, v = line.strip().split(':')
file1[k] = v
Or if you prefer a one-liner:
file1 = dict(l.strip().split(':') for l in open('file1', 'r'))
Then you could do something like:
result = {}
for key, value in file1.iteritems():
if value in file2:
result[key] = file2[value]
Another way is you could generate the key-value pairs in reverse for file1 and use sets. For example, if your file1 contains foo:bar, your file1 dict is {bar: foo}.
for key in set(file1) & set(file2):
result[file1[key]] = file2[key]
Basically, you can quickly find common elements using set intersection, so those elements are guaranteed to be in file2 and you don't waste time checking for their existence.
Edit: As pointed out by #pepr You can use collections.OrderedDict for the first method if order is important to you.
The awk solution:
awk '
BEGIN {FS = OFS = ":"}
NR==FNR {val[$1] = $2; next}
$1 in val {$2 = val[$1]}
{print}
}' file2 file1
join -t : -1 2 -2 1 -o 0 2.2 -a 2 <(sort -k 2 -t : file1) <(sort file2)
The input files must be sorted on the field they are joined on.
The options:
-t : - Use a colon as the delimiter
-1 2 - Join on field 2 of file 1
-2 1 - Join on field 1 of file 2
-o 0 2.2 - Output the join field followed by field 2 from file2 (separated by the delimiter character)
-a 2 - Output unjoined lines from file2
Once you have:
file1 = {'foo':'hello', 'bar':'world'}
file2 = {'hello':'adam', 'bar':'eve'}
You can do an ugly one liner:
print dict([(i,file2[i]) if i in file2 else (i,file2[j]) if j in file2 else (i,j) for i,j in file1.items()])
{'foo': 'adam', 'bar': 'eve'}
As in your example you are using both the keys and values of file1 as keys in file2.
This might work for you (probably GNU sed):
sed 's#\([^:]*\):\(.*\)#/\\(^\1:\\|:\1$\\)/s/:.*/:\2/#' file2 | sed -f - file1
If you do not consider using basic Unix/Linux commands cheating, then here is a solution using paste and awk.
paste file1.txt file2.txt | awk -F ":" '{ print $1":"$3 }'
TXR:
#(next "file2")
#(collect)
#key:#value1
# (cases)
# (next "file1")
# (skip)
#value2:#key
# (or)
# (bind value2 key)
# (end)
# (output)
#value2:#value1
# (end)
#(end)
Run:
$ txr subst.txr
foo:adam
bar:eve

How do I quickly match the fields of two files that are sorted but one is a subset of the other

I have two sorted files and want to merge them to make a third, but I need the output to be sorted. One column in the second file is a subset of the first and any place the second file doesn't match the first should be filled in with a NA. The files are large (~20,000,000) records each so loading things into memory is tough and speed is an issue.
File 1 looks like this:
1 a
2 b
3 c
4 d
5 e
File 2 looks like this:
1 aa
2 bb
4 dd
5 ee
And the the output should be like this
1 a aa
2 b bb
3 c NA
4 d cc
5 e ee
join is your friend here.
join -a 1 file1 file2
should do the trick. The only difference to your example output is that the unpairable lines are printed directly from file1, i.e. without the NA.
Edit: Here is a version that also handles the NAs:
join -a 1 -e NA -o 1.1 1.2 2.2 file1 file2
If I understand you correctly:
File #1 and file #2 will have the same lines
However, some lines will be missing from file #2 that are in file #1.
AND, most importantly, the lines will be sorted in each file.
That means if I get a line from file #2, and the keep reading through file #1, I'll find a matching line sooner or later. Therefore, we want to read a line from file #2, keep looking through file #1 until we find the matching line, and when we do find one, we want to print out both values.
I would imagine some sort of algorithm like this:
Read first line from file #2
While read line from file #1
if line from file #2 > line from file #1
write line from file #1 and "NA"
else
write line from file #1 and file #2
Read another line from file #2
fi
done
There should be some form of error checking (what if you find the line from file #1 to be greater than the line from file #2? That means line #1 is missing the line.) And, there should be some boundary checking (what if you run out of lines from file #2 before you finish file #1?)
This sounds like a school assignment, so I really don't want to give an actual answer. However, the algorithm is there. All you need to do is implement it in your favorite language.
If it isn't a school assignment, and you need more help, just post a comment on this answer, and I'll do what I can.
To the DNA Biologist
#! /usr/bin/env perl
use warnings;
use strict;
use feature qw(say);
use constant {
TEXT1 => "foo1.txt",
TEXT2 => "foo2.txt",
};
open (FILE1, "<", TEXT1) or die qq(Can't open file ) . TEXT1 . qq(for reading\n);
open (FILE2, "<", TEXT2) or die qq(Can't open file ) . TEXT2 . qq(for reading\n);
my $line2 = <FILE2>;
chomp $line2;
my ($lineNum2, $value2) = split(/\s+/, $line2, 2);
while (my $line1 = <FILE1>) {
chomp $line1;
my ($lineNum1, $value1) = split(/\s+/, $line1, 2);
if (not defined $line2) {
say "$lineNum1 - $value1 - NA";
}
elsif ($lineNum1 lt $lineNum2) { #Use "<" if numeric match and not string match
say "$lineNum1 - $value1 - NA";
}
elsif ($lineNum1 eq $lineNum2) {
say "$lineNum1 - $value1 - $value2";
$line2 = <FILE2>;
if (defined $line2) {
chomp $line2;
($lineNum2, $value2) = split(/\s+/, $line2, 2);
}
}
else {
die qq(Something went wrong: Line 1 = "$line1" Line 2 = "$line2"\n);
}
}
It wasn't thoroughly tested, but it worked on some short sample files.
You can do it all in shell:
sort file.1 > file.1.sorted
sort file.2 > file.2.sorted
join -e NA file.1.sorted file.2.sorted > file.joined
Here's a Python solution:
"""merge two files based on matching first columns"""
def merge_files(file1, file2, merge_file):
with (open(file1) as file1,
open(file2) as file2,
open(merge_file, 'w')) as merge:
for line2 in file2:
index2, value2 = line2.split(' ', 1)
for line1 in file1:
index1, value1 = line1.split(' ', 1)
if index1 != index2:
merge.write(line1)
continue
merge.write("%s %s %s" % (index1, value1[:-1], value2))
break
for line1 in file1: # grab any remaining lines in file1
merge.write(line1)
if __name__ == '__main__':
merge_files('test1.txt','test2.txt','test3.txt')

Categories