How to comparing two big files on unique strings using fgrep/comm?

How to comparing two big files on unique strings using fgrep/comm? - python

I have two files. A file disk.txt contains 57665977 rows and database.txt 39035203 rows;
To test my script I made two example files:
$ cat database.txt
01fffca9-05c8-41a9-8539-8bb2f587cef2
02fffd0d-fbcf-4759-9478-cfd32c987101
03fffd54-8d62-4555-a4ce-370f061048d5
04fffdb6-24f9-4b98-865f-ce32bc44872c
05fffe0c-2b9d-47fa-8ee9-2d20d0b28334
06fffea1-46f2-4aa2-93b9-be627189e38b
10ffff8a-cc20-4a2b-b9b2-a3cbc2000e49
11ffffaf-fd54-49f3-9719-4a63690430d9
12ffffc6-4ea8-4336-bdf1-e2d9d71a1c29
$ cat disk.txt
01fffca9-05c8-41a9-8539-8bb2f587cef2
02fffd0d-fbcf-4759-9478-cfd32c987101
03fffd54-8d62-4555-a4ce-370f061048d5
04fffdb6-24f9-4b98-865f-ce32bc44872c
05fffe0c-2b9d-47fa-8ee9-2d20d0b28334
06fffea1-46f2-4aa2-93b9-be627189e38b
07fffeed-5a0b-41f8-86cd-e6d99834c187
08ffff24-fb12-488c-87eb-1a07072fc706
09ffff29-ba3d-4582-8ce2-80b47ed927d1
10ffff8a-cc20-4a2b-b9b2-a3cbc2000e49
What I try to accomplish is to create files for the differences.
A file with the uniques in disk.txt (so I can delete them from disk)
A file with the uniques in database.txt (So I can retrieve them from backup and restore)
Using comm to retrieve differences
I used comm to see the differences between the two files. Sadly comm also returns the duplicates after some uniques.
$ comm -13 database.txt disk.txt
07fffeed-5a0b-41f8-86cd-e6d99834c187
08ffff24-fb12-488c-87eb-1a07072fc706
09ffff29-ba3d-4582-8ce2-80b47ed927d1
10ffff8a-cc20-4a2b-b9b2-a3cbc2000e49
$ comm -13 database.txt disk.txt
07fffeed-5a0b-41f8-86cd-e6d99834c187
08ffff24-fb12-488c-87eb-1a07072fc706
09ffff29-ba3d-4582-8ce2-80b47ed927d1
10ffff8a-cc20-4a2b-b9b2-a3cbc2000e49
using comm on one of these large files takes 28,38s. This is really fast but is solely not a solution.
using fgrep to strip duplicates from comm result
I can use fgrep to remove the duplicates from the comm result and this works on the example.
$ fgrep -vf duplicate-plus-uniq-disk.txt duplicate-plus-uniq-database.txt
11ffffaf-fd54-49f3-9719-4a63690430d9
12ffffc6-4ea8-4336-bdf1-e2d9d71a1c29
$ fgrep -vf duplicate-plus-uniq-database.txt duplicate-plus-uniq-disk.txt
07fffeed-5a0b-41f8-86cd-e6d99834c187
08ffff24-fb12-488c-87eb-1a07072fc706
09ffff29-ba3d-4582-8ce2-80b47ed927d1
On the large files this script just crashed after a while. So it is not a viable option to solve my problem.
Using python difflib to get uniques
I tried using this python script I got from BigSpicyPotato's answer on a different post
import difflib
with open(r'disk.txt','r') as masterdata:
with open(r'database.txt','r') as useddata:
with open(r'uniq-disk.txt','w+') as Newdata:
usedfile = [ x.strip('\n') for x in list(useddata) ]
masterfile = [ x.strip('\n') for x in list(masterdata) ]
for line in masterfile:
if line not in usedfile:
Newdata.write(line + '\n')
this also works on the example. Currently this is still running and takes up alot my CPU power.. Looking at the uniq-disk file it is really slow aswell..
Question
Any faster / better option I can try in bash / python? I was aswell looking into awk / sed to maybe parse the the results form comm.

From man comm, * added by me:
Compare **sorted** files FILE1 and FILE2 line by line.
You have to sort the files for comm.
sort database.txt > database_sorted.txt
sort disk.txt > disk_sorted.txt
comm -13 database_sorted.txt disk_sorted.txt
See man sort for various speed and memory enhancing options, like --batch-size, --temporary-directory --buffer-size --parallel.
A file with the uniques in disk.txt
A file with the uniques in database.txt
After sorting, you can implement your python program that compares line-by-line the files and write to mentioned files, just like comm with custom output. Do not store whole files in memory.
You can also do something along this with join or comm --output-delimiter=' ':
join -v1 -v2 -o 1.1,2.1 disk_sorted.txt database_sorted.txt | tee >(
cut -d' ' -f1 | grep -v '^$' > unique_in_disk.txt) |
cut -d' ' -f2 | grep -v '^$' > unique_in_database.txt

comm does exactly what I needed. I had a white space behind line 10 of my disk.txt file. therefor comm returned it as a unique string. Please check #KamilCuk answer for more context about sorting your files and using comm.

# WHINY_USERS=1 isn't trying to insult anyone -
# it's a special shell variable recognized by
# mawk-1 to presort the results
WHINY_USERS=1 {m,g}awk '
function phr(_) {
print \
"\n Uniques for file : { "\
(_)" } \n\n -----------------\n"
}
BEGIN {
split(_,____)
split(_,______)
PROCINFO["sorted_in"] = "#val_num_asc"
FS = "^$"
} FNR==NF {
______[++_____]=FILENAME
} {
if($_ in ____) {
delete ____[$_]
} else {
____[$_]=_____ ":" NR
}
} END {
for(__ in ______) {
phr(______[__])
_____=_<_
for(_ in ____) {
if(+(___=____[_])==+__) {
print " ",++_____,_,
"(line "(substr(___,
index(___,":")+!!__))")"
} } }
printf("\n\n") } ' testfile_disk.txt testfile_database.txt
|
Uniques for file : { testfile_disk.txt }
-----------------
1 07fffeed-5a0b-41f8-86cd-e6d99834c187 (line 7)
2 08ffff24-fb12-488c-87eb-1a07072fc706 (line 8)
3 09ffff29-ba3d-4582-8ce2-80b47ed927d1 (line 9)
Uniques for file : { testfile_database.txt }
-----------------
1 11ffffaf-fd54-49f3-9719-4a63690430d9 (line 18)
2 12ffffc6-4ea8-4336-bdf1-e2d9d71a1c29 (line 19)

Related

Split csv file thousands of times based on groupby

(An adaptation of David Erickson's question here)
Given a CSV file with columns A, B, and C and some values:
echo 'a,b,c' > file.csv
head -c 10000000 /dev/urandom | od -d | awk 'BEGIN{OFS = ","}{print $2, $3, $4}' | head -n 10000 >> file.csv
We would like to sort by columns a and b:
sort -t ',' -k1,1n -k2,2n file.csv > file_.csv
head -n 3 file_.csv
>a,b,c
3,50240,18792
7,54871,39438
And then for every unique pair (a, b) create a new CSV titled '{a}_Invoice_{b}.csv'.
The main challenge seems to be the I/O overhead of writing thousands of files - I started trying with awk but ran into awk: 17 makes too many open files.
Is there a quicker way to do this, in awk, Python, or some other scripting language?
Additional info:
I know I can do this in Pandas - I'm looking for a faster way using text processing
Though I used urandom to generate the sample data, the real data has runs of recurring values: for example a few rows where a=3, b=7. If so these should be saved as one file. (The idea is to replicate Pandas' groupby -> to_csv)

In python:
import pandas as pd
df = pd.read_csv("file.csv")
for (a, b), gb in df.groupby(['a', 'b']):
gb.to_csv(f"{a}_Invoice_{b}.csv", header=True, index=False)
In awk you can split like so, you will need to put the header back on each resultant file:
awk -F',' '{ out=$1"_Invoice_"$2".csv"; print >> out; close(out) }' file.csv
With adding the header line back:
awk -F',' 'NR==1 { hdr=$0; next } { out=$1"_Invoice_"$2".csv"; if (!seen[out]++) {print hdr > out} print >> out; close(out); }' file.csv
The benefit of this last example is that the input file.csv doesn't need to be sorted and is processed in a single pass.

Since your input is to be sorted on the key fields all you need is:
sort -t ',' -k1,1n -k2,2n file.csv |
awk -F ',' '
NR==1 { hdr=$0; next }
{ out = $1 "_Invoice_" $2 ".csv" }
out != prev {
close(prev)
print hdr > out
prev = out
}
{ print > out }
'

Find part of a string in CSV and replace whole cell with new entry?

I've got a CSV file with a column which I want to sift through. I want to use a pattern file to find all entries where the pattern exists even in part of the column's value, and replace the whole cell value with this "pattern".
I made a list of keywords that I want to use as my "pattern" bank;
So, if a cell in this column (this case only second) has this "pattern" as part of its string, then I want to replace the whole cell with this "pattern".
so for example:
my target file:
id1,Taxidermy Equipment & Supplies,moreinfo1
id2,Taxis & Private Hire,moreinfo2
id3,Tax Services,moreinfo3
id4,Tools & Hardware,moreinfo4
id5,Tool Sharpening,moreinfo5
id6,Tool Shops,moreinfo6
id7,Video Conferencing,moreinfo7
id8,Video & DVD Shops,moreinfo8
id9,Woodworking Equipment & Supplies,moreinfo9
my "pattern" file:
Taxidermy Equipment & Supplies
Taxis
Tax Services
Tool
Video
Wood
output file:
id1,Taxidermy Equipment & Supplies,moreinfo1
id2,Taxis,moreinfo2
id3,Tax Services,moreinfo3
id4,Tool,moreinfo4
id5,Tool,moreinfo5
id6,Tool,moreinfo6
id7,Video,moreinfo7
id8,Video,moreinfo8
id9,Wood,moreinfo9
I came up with the usual "find and replace" sed:
sed -i 's/PATTERN/REPLACE/g' file.csv
but I want it to run on a specific column, so I came up with:
awk 'BEGIN{OFS=FS="|"}$2==PATTERN{$2=REPLACE}{print}' file.csv
but it doesn't work on "part of string" ([Video]:"Video & DVD Shops" -> "Video") and I can't seem to get it how awk takes input as a file for the "Pattern" block.
Is there an awk script for this? Or do I have to write something (in python with the built in csv suit for example?)

In awk, using index. It only prints record if a replacement is made but it's easy to modify to printing even if there is no match (for example replace the print $1,i,$3} with $0=$1 OFS i OFS $3} 1):
$ awk -F, -v OFS=, '
NR==FNR { a[$1]; next } # store "patterns" to a arr
{ for(i in a) # go thru whole a for each record
if(index($2,i)) # if "pattern" matches $2
print $1,i,$3 # print with replacement
}
' pattern_file target_file
id1,Taxidermy Equipment & Supplies,moreinfo1
id2,Taxis,moreinfo2
id3,Tax Services,moreinfo3
id4,Tool,moreinfo4
id5,Tool,moreinfo5
id6,Tool,moreinfo6
id7,Video,moreinfo7
id8,Video,moreinfo8
id9,Wood,moreinfo9

Perl solution, using Text::CSV_XS:
#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV_XS qw{ csv };
my ($input_file, $pattern_file) = #ARGV;
open my $pfh, '<', $pattern_file or die $!;
chomp( my #patterns = <$pfh> );
my $aoa = csv(in => $input_file);
for my $line (#$aoa) {
for my $pattern (#patterns) {
if (-1 != index $line->[1], $pattern) {
$line->[1] = $pattern;
last
}
}
}
csv(in => $aoa, quote_space => 0, eol => "\n", out => \*STDOUT);

Here's a (mostly) awk solution:
#/bin/bash
patterns_regex=`cat patterns_file | tr '\n' '|'`
cat target_file | awk -F"," -v patterns="$patterns_regex" '
BEGIN {
OFS=",";
split(patterns, patterns_split, "|");
}
{
for (pattern_num in patterns_split) {
pattern=patterns_split[pattern_num];
if (pattern != "" && $2 ~ pattern) {
print $1,pattern,$3
}
}
}'

When you want to solve this with sed, you will need some steps.
For each pattern you will need a command like
sed 's/^\([^,]*\),\(.*Tool.*\),/\1,Tool,/' inputfile
You will need each pattern twice, you can translate the patternfile with
sed 's/.*/"&" "&"/' patternfile
# Change the / into #, thats easier for the final command
sed 's#.*#"&" "&"#' patternfile
When you instruct sed to read a commandfile, you do need to start each line with sed. The commandfile will look like
sed 's#.*#s/^\\([^,]*\\),\\(.*&.*\\),/\\1,&,/#;s#&#\\&#g' patternfile
You can store this is a file and use the file, but with process substitution you can do things like
cat <(echo "Now this line from echo is handled as a file")
Nice. Lets test the solution
sed -f <(sed 's#.*#s/^\\([^,]*\\),\\(.*&.*\\),/\\1,&,/#' patternfile) inputfile
Almost there! Only the first output line is strange. Whats happening?
The first pattern has a &, and that has a special meaning.
We can patch our command by adding a backslash in the pattern:
sed -f <(sed 's#.*#s/^\\([^,]*\\),\\(.*&.*\\),/\\1,&,/#;s#&#\\&#g' patternfile) inputfile

Linux/bash/awk read parts of the files name as variables

We wrote an awk one liner to split an input csv file (Assay_51003_target_pairs.csv) into multiple files. For any row if their column 1 is equal to another column column 1, the column 2 is equal to another column 2, etc., these rows will be categorized into a new file. The new file will be named using the column values.
Here is the one liner
awk -F "," 'NF>1 && NR>1 && $1==$1 && $2==$2 && $9==$9 && $10==$10{print $0 >> ("Assay_"$1"_target_"$3"_assay_" $9 "_bcassay_" $10 "_bcalt_assay.csv");close("Assay_"$1"_target_"$3"_assay_" $9 "_bcassay_" $10 "_bcalt_assay.csv")}' Assay_51003_target_pairs.csv
This will generate the following example output (Assay_$1_target_$3_assay_$9_bcassay_$10_bcalt_assay.csv):
Assay_51003_target_1645_assay_7777_bcassay_8888_bcalt_assay.csv
51003,666666,1645,11145,EC50,,0.2,uM,7777,8888,IC50,,1,uM,,3,2.0555,3011-02-0100:00:00,1911-04-1100:00:00,Cell,Biochemical
51003,666666,1645,1680,EC50,<,0.1,uM,7777,8888,IC50,,1,uM,,2,2.8579,3004-06-0300:00:00,3000-04-1100:00:00,Cell,Biochemical
Assay_51003_target_1645_assay_7777_bcassay_9999_bcalt_assay.csv
51003,666666,1645,11145,EC50,,0.2,uM,7777,9999,IC50,,1,uM,,3,2.0555,3011-02-0100:00:00,1911-04-1100:00:00,Cell,Biochemical
51003,666666,1645,1680,EC50,<,0.1,uM,7777,9999,IC50,,1,uM,,2,2.8579,3004-06-0300:00:00,3000-04-1100:00:00,Cell,Biochemical
Assay_51003_target_1688_assay_7777_bcassay_9999_bcalt_assay.csv
51003,666666,1688,11145,EC50,,0.2,uM,7777,9999,IC50,,1,uM,,3,2.0555,3011-02-0100:00:00,1911-04-1100:00:00,Cell,Biochemical
51003,666666,1688,1680,EC50,<,0.1,uM,7777,9999,IC50,,1,uM,,2,2.8579,3004-06-0300:00:00,3000-04-1100:00:00,Cell,Biochemical
Later on we would like to do, for example,
awk -F, -f max_min.awk Assay_51003_target_1645_assay_7777_bcassay_8888_bcalt_assay.csv
awk -F, -f max_min.awk Assay_51003_target_1645_assay_7777_bcassay_9999_bcalt_assay.csv
awk -F, -f max_min.awk Assay_51003_target_1688_assay_7777_bcassay_9999_bcalt_assay.csv
#################################################
for b in 1645 1688
do
for c in 8888 9999
do
awk -F, -f max_min.awk Assay_51003_target_$b_assay_7777_bcassay_$c_bcalt_assay.csv
done
done
However, we don't know if there is any way to write a loop for the followup work because the outfile names are "random". May we know if there is any way for linux/bash to parse part of the file name into loop variables (such as parse 1645 and 1688 into b and 8888 & 9999 into c)?

With Bash it should be pretty much easy granting the values are always numbers:
shopt -s nullglob
FILES=(Assay_*_target_*_assay_*_bcassay_*_bcalt_assay.csv) ## No need to do +([[:digit:]]). The difference is unlikely.
for FILE in "${FILES[#]}"; do
IFS=_ read -a A <<< "$FILE"
# Do something with ${A[1]} ${A[3]} ${A[5]} and ${A[7]}
...
# Or
IFS=_ read __ A __ B __ C __ D __ <<< "$FILE"
# Do something with $A $B $C and $D
...
done

Asking if $1 == $1, etc., is pointless, since it will always be true. The following code is equivalent:
awk -F, '
NF > 1 && NR > 1 {
f = "Assay_" $1 "_target_" $3 "_assay_" $9 \
"_bcassay_" $10 "_bcalt_assay.csv"
print >> f;
close(f)
}' Assay_51003_target_pairs.csv
The reason this works is that the same file is appended to if the fields used in the construction of the filename match. But I wonder if it's an error on your part to be using $3 instead of $2 since you mention $2 in your description.
At any rate, what you are doing seems very odd. If you can give a straightforward description of what you are actually trying to accomplish, there may be a completely different way to do it.

Append values into key pair from two lines

My text file out put looks like this on two lines:
DelayTimeThreshold|MaxDelayPerMinute|Name
10000|5|rca
My desired output is too look like this:
DelayTimeThreshold 10000
MaxDelayPerMinute 5
Name rca
I'm having no luck trying this:
sed '$!N;s/|/\n/' foo
Any advice would be welcomed, thank you.

As you have just two lines, this can be a way:
$ paste -d' ' <(head -1 file | sed 's/|/\n/g') <(tail -1 file | sed 's/|/\n/g')
DelayTimeThreshold 10000
MaxDelayPerMinute 5
Name rca
By pieces. Let's get the first line and replace every pipe | with a new line:
$ head -1 file | sed 's/|/\n/g'
DelayTimeThreshold
MaxDelayPerMinute
Name
And do the same with the last line:
$ tail -1 file | sed 's/|/\n/g'
10000
5
rca
Then it is just a matter of pasting both results with a space as delimiter:
paste -d' ' output1 output2

this awk one-liner would work for your requirement:
awk -F'|' '!f{gsub(/\||$/," %s\n");f=$0;next}{printf f,$1,$2,$3}' file
output:
kent$ echo "DelayTimeThreshold|MaxDelayPerMinute|Name
10000|5|rca"|awk -F'|' '!f{gsub(/\||$/," %s\n");f=$0;next}{printf f,$1,$2,$3}'
DelayTimeThreshold 10000
MaxDelayPerMinute 5
Name rca

Using the Array::Transpose module:
perl -MArray::Transpose -F'\|' -lane '
push #a, [#F]
} END {print for map {join " ", #$_} transpose(\#a)
' <<END
DelayTimeThreshold|MaxDelayPerMinute|Name
10000|5|rca
END
DelayTimeThreshold 10000
MaxDelayPerMinute 5
Name rca

As a perl script:
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
my $file1 = ('DelayTimeThreshold|MaxDelayPerMinute|Name');
my $file2 = ('10000|5|rca');
my #file1 = split('\|', $file1);
my #file2 = split('\|', $file2);
my %hash;
#hash{#file1} = #file2;
print Dumper \%hash;
Output:
$VAR1 = {
'Name' => 'rca',
'DelayTimeThreshold' => '10000',
'MaxDelayPerMinute' => '5'
};
OR:
for (my $i = 0; $i < $#file1; $i++) {
print "$file1[$i] $file2[$i]\n";
}
Output:
DelayTimeThreshold 10000
MaxDelayPerMinute 5
Name rca

Suppose you have a file that contains a single header row with column names, followed by multiple detail rows with column values, for example,
DelayTimeThreshold|MaxDelayPerMinute|Name
10000|5|abc
20001|6|def
30002|7|ghk
40003|8|jkl
50004|9|mnp
The following code would print that file, using the names from the first row, paired with values from each subsequent (detail) row,
#!/bin/perl -w
use strict;
my ($fn,$fh)=("header.csv"); #whatever the file is named...
open($fh,"< $fn") || error "cannot open $fn";
my ($count,$line,#names,#vals)=(0);
while(<$fh>)
{
chomp $_;
#vals=split(/\|/,$_);
if($count++<1) { #names=#vals; next; } #first line is names
for (my $ndx=0; $ndx<=$#names; $ndx++) { #print each
print "$names[$ndx] $vals[$ndx]\n";
}
}
Suppose you want to keep around each row, annotated with names, in an array,
my %row;
my #records;
while(<$fh>)
{
chomp $_;
#vals=split(/\|/,$_);
if($count++<1) { #names=#vals; next; }
#row{#names} = #vals;
push(#records,\%row);
}
Maybe you want to refer to the rows by some key column,
my %row;
my %records;
while(<$fh>)
{
chomp $_;
#vals=split(/\|/,$_);
if($count++<1) { #names=#vals; next; }
#row{#names} = #vals;
$records{$vals[0]}=\%row;
}

intersperse the lines of two different files

I have to do a simple task, but I don't know how to do it and I'm staked. I need to intersperse the lines of two different files each 4 lines:
File 1:
1
2
3
4
5
6
7
8
9
10
11
12
FILE 2:
A
B
C
D
E
F
G
H
I
J
K
L
Desired result:
1
2
3
4
A
B
C
D
5
6
7
8
E
F
G
H
9
10
11
12
I
J
K
L
I'm looking for a sed, awk or python script, or any other bash command.
Thanks for your time!!
I tried to do it using specific python libraries that recognize the 4 lines modules of each files. But It doesn't work and now I trying to do it without this libraries, but don't know how.
import sys
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
def main(forward,reverse):
for F, R in zip ( SeqIO.parse(forward, "fastq"), SeqIO.parse(reverse, "fastq") ):
fastq_out_F = SeqRecord( F.seq, id = F.id, description = "" )
fastq_out_F.letter_annotations["phred_quality"] = F.letter_annotations["phred_quality"]
fastq_out_R = SeqRecord( R.seq, id = R.id, description = "" )
fastq_out_R.letter_annotations["phred_quality"] = R.letter_annotations["phred_quality"]
print fastq_out_F.format("fastq"),
print fastq_out_R.format("fastq"),
if __name__ == '__main__':
main(sys.argv[1], sys.argv[2])

This might work for you:(using GNU sed)
sed -e 'n;n;n;R file2' -e 'R file2' -e 'R file2' -e 'R file2' file1
or using paste/bash:
paste -d' ' <(paste -sd' \n' file1) <(paste -sd' \n' file2) | tr ' ' '\n'
or:
parallel -N4 --xapply 'printf "%s\n%s\n" {1} {2}' :::: file1 :::: file2

It can be done in pure bash:
f1=""; f2=""
while test -z "$f1" -o -z "$f2"; do
{ read LINE && echo "$LINE" && \
read LINE && echo "$LINE" && \
read LINE && echo "$LINE" && \
read LINE && echo "$LINE"; } || f1=end;
{ read -u 3 LINE && echo "$LINE" && \
read -u 3 LINE && echo "$LINE" && \
read -u 3 LINE && echo "$LINE" && \
read -u 3 LINE && echo "$LINE"; } || f2=end;
done < f1 3< f2
The idea is to use a new file descriptor (3 in this case) and read from stdin and this file descriptor at the same time.

A mix of paste and sed can also be used if you do not have GNU sed:
paste -d '\n' f1 f2 | sed -e 'x;N;x;N;x;N;x;N;x;N;x;N;x;N;s/^\n//;H;s/.*//;x'
If you are not familiar with sed, there is a 2nd buffer called the hold space where you can save data. The x command exchanges the current buffer with the hold space, the N command appends one line to the current buffer, and the H command appends the current buffer to the hold space.
So the first x;N save the current line (from f1 because of paste) in the hold space and read the next line (from f2 because of paste), then each x;N;x;N read a new line from f1 and f2, and the script finishes by removing the new line from the 4 lines of f2, puts the lines from f2 at the end of the lines of f1, clean the hold space for the next run and print the 8 lines.

Try this, changing the appropriate filename values for f1 and f2.
awk 'BEGIN{
sectionSize=4; maxSectionCnt=sectionSize; maxSectionCnt++
notEof1=notEof2=1
f1="file1" ; f2="file2"
while (notEof1 && notEof2) {
if (notEof1) {
for (i=1;i<maxSectionCnt;i++) {
if (getline < f1 >0 ) { print "F1:" i":" $0 } else {notEof1=0}
}
}
if (notEof2) {
for (i=1;i<maxSectionCnt;i++) {
if (getline < f2 >0 ) { print "F2:" i":" $0 } else {notEof2=0}
}
}
}
}'
You can also remove the "F1: i":" etc record header. I added that help debug code.
As Pastafarianist rightly points out, you may need to modify this if you have expectations about what will happen if the files are not the same size, etc.
I hope this helps.

The code you posted looks extremely complicated. There is a rule of thumb with programming: there is always a simpler solution. In your case, way simpler.
First thing you should do is determine the limitations of the input. Are you going to process really big files? Or are they going to be only of one-or-two-kilobyte size? It matters.
Second thing: look at the tools you have. With Python, you've got file objects, lists, generators and so on. Try to combine these tools to produce the desired result.
In your particular case, there are some unclear points. What should the script do if the input files have different size? Or one of them is empty? Or the number of lines is not a factor of four? You should decide how to handle corner cases like these.
Take a look at the file object, xrange, list slicing and list comprehensions. If you prefer doing it the cool way, you can also take a look at the itertools module.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to comparing two big files on unique strings using fgrep/comm? - python

comm does exactly what I needed. I had a white space behind line 10 of my disk.txt file. therefor comm returned it as a unique string. Please check #KamilCuk answer for more context about sorting your files and using comm.

Related

Split csv file thousands of times based on groupby

Find part of a string in CSV and replace whole cell with new entry?

Linux/bash/awk read parts of the files name as variables

Append values into key pair from two lines

intersperse the lines of two different files

Categories

Resources