Identify duplicates in rows of two different CSVs

Identify duplicates in rows of two different CSVs - python

I need to identify duplicates in column A of CSV1 with Column A of CSV2. If there is a dupe first name identified the entire row from CSV2 needs to get copied to a new CSV3. Can somebody help in python?
CSV1
Adam
Eve
John
George
CSV2
Steve
Mark
Adam Smith
John Smith
CSV3
Adam Smith
John Smith

Here is a quick answer. It's O(n^2) with n the number of lines in your csv, and assumes two equal length CSVs. If you need an O(n) solution (clearly optimal), then let me know. The trick there would be building a set of the elements of column A of csv1.
lines1 = open('csv1.txt').read().split('\n')
delim = ', '
fields1 = [line.split(delim) for line in lines1]
lines2 = open('csv2.txt').read().split('\n')
fields2 = [line.split(delim) for line in lines2]
duplicates = []
for line1 in fields1:
for line2 in fields2:
if line1[0] == line2[0]:
duplicates.append(line2)
print duplicates

Using any of the 3 one-liners:
Option 1: Parse file1 in the BEGIN Block
perl -lane 'BEGIN {$csv2 = pop; $seen{(split)[0]}++ while <>; #ARGV = $csv2 } print if $seen{$F[0]}' csv1 csv2
Option 2: Using a ternary
perl -lane 'BEGIN {($csv1) = #ARGV } $ARGV eq $csv1 ? $seen{$F[0]}++ : ($seen{$F[0]} && print)' csv1 csv2
Option 3: Using a single if
perl -lane 'BEGIN {($csv1) = #ARGV } print if $seen{$F[0]} += $ARGV eq $csv1 and $ARGV ne $csv1' csv1 csv2
Explanation:
Switches:
-l: Enable line ending processing
-a: Splits the line on space and loads them in an array #F
-n: Creates a while(<>){..} loop for each line in your input file.
-e: Tells perl to execute the code on command line.

Clean and python way to resolve your problem
words_a = set([])
words_b = set([])
with open('csv1') as a:
words_a = set([w.strip()
for l in a.readlines()
for w in l.split(" ")
if w.strip()])
with open('csv2') as b:
words_b = set([ w.strip()
for l in b.readlines()
for w in l.split(" ")
if w.strip()])
with open('csv3','w') as wf:
for w in words_a.intersection(words_b):
wf.write(w)
wf.write('\n')

Related

Python or Bash - How do you add words listed in a file in the middle of two sentences and put the output into another file?

I dont know how to define the variable 'set' and ' ' to then put a name and another ' ' before adding the last word 'username' and 'admin' in 'file2' for each name listed in 'file1'.
file1 = [/home/smith/file1.txt]
file2 = [/home/smith/file2.txt]
file3 = file1 + file2
Example:
[file1 - Names]
smith
jerry
summer
aaron
[file2 - Sentences]
set username
set admin
[file3 - Output]
set smith username
set smith admin
set jerry username
set jerry admin
set summer username
set summer admin
set aaron username
set aaron admin

Can you be more specific about your problem? And have you already tried something? If that is the case, please share it.
The way I see it you can open file2, read every line, split the two words on the space (and add it to a list for example). Then you can create a new string for every set of words you've created in that list. Loop on every line in file1. For every line in file1: take the first word from file2, add a space. Add the actual line from file1. And at last you add another space and the second word from.
You now have a new string which you can append to a new file for example. You should problably append that string to the new file in the same loop where you created the string.
But then again, I'm not shure if this is an answer to your question.

Try this one in Bash, it answers your question
#!/bin/bash
file1=".../f1.txt"
file2=".../f2.txt"
file3=".../f3.txt"
while read p1; do
while read p2; do
word1=$(echo $p2 | cut -f1 -d' ')
word2=$(echo $p2 | cut -f2 -d' ')
echo " $word1 $p1 $word2" >> $file3
done < $file2
done < $file1

Something like this, perhaps..
names = file("/home/smith/file1.txt").readlines()
commands = file("/home/smith/file2.txt").readlines()
res = []
for name in names:
for command in commands:
command = command.split(" ")
res.append(" ".join([command[0],name,command[1]]))
file("/home/smith/file3.txt","w").write("\n".join(res))
I'm sure this is not the prettiest way, but should work. But why do you want to do something like this...?

Yet another solution using utilities only:
join -1 2 -2 3 file1 file2 | awk '{printf "%s %s %s\n", $2, $1, $3}' > file3

Python cross-file search with regexp

I have 2 files, and I want to get all lines from file2(fsearch) that contain any given line from file1(forig)
I wrote a simple python script that looks like this:
def search_string(w, file):
global matches
reg = re.compile((r'(^|^.*\|)' + w.strip("\r\n") + r'(\t|\|).*$'), re.M)
match = reg.findall(file)
matches.extend(match)
fsearch_text = fsearch.read()
for fword in forig:
search_string(fword, fsearch_text)
There are about 100,000 lines in file1, and about 200,000 lines in file2, so my script takes about 6 hours to complete.
Is there a better algorithm to achieve the same goal in less time?
Edit:
I should have provided example for why I need regexp:
I am searching a list of words in file1 and trying to match them to translations from file2. If I do not use regexp to limit possible matches, I also match translations for words that only contain the word I search as part of itself, example:
Word I search: 浸し
Matched word: お浸し|御浸し|御したし &n boiled greens in bonito-flavoured soy sauce (vegetable side dish)
So I have to limit start of match by either ^ or |, and end of match by \t or |, but capture the whole line

Assuming you can have both files in memory. You can read them and sort them.
After that, you can compare linearly the lines.
f1 = open('e:\\temp\\file1.txt')
lines1 = sorted([line for line in f1])
f2 = open('e:\\temp\\file2.txt')
lines2 = sorted([line for line in f2])
i1 = 0
i2 = 0
matchCount = 0
while (i1 < len(lines1) and i2 < len(lines2)):
line1 = lines1[i1]
line2 = lines2[i2]
if line1 < line2:
i1 += 1
elif line1 > line2:
i2 += 1
else:
matchCount += 1
i2 += 1
print('matchCount')
print(matchCount)

If it is possible for you to use UNIX/GNU/Linux commands you could do this:
# fill example files
for i in {1..100000}; do echo $RANDOM; done > file1.txt
for i in {1..200000}; do echo $RANDOM; done > file2.txt
# get every line of file2.txt which is also in file1.txt
# for static string matching:
grep -F -x -f file1.txt file2.txt
# for regex matching use (regular expressions in file1.txt):
grep -f file1.txt file2.txt
grep is optimized for such operations so the above call takes less than a second (have a look at this).

Unique elements in each file

I have 4 files and would like to know elements which are non overlapping (per file) compared to the elements in other files.
File A
Vincy
ruby
rome
File B
Vincy
rome
Peter
File C
Vincy
Paul
alex
File D
Vincy
rocky
Willy
Any suggestion for one liner in perl, python, shell, bash. The expected output is:
File A: ruby, File B: Peter, File C: Paul, Alex File D: rocky, Willy.

Edit after question clarified: Unique elements across all files, and the file in which it occurs:
cat File_A File_B File_C File_D |sort | uniq -u | while read line ; do file=`grep -l $line File*` ; echo "$file $line" ; done
Edit:
perly way of doing it, will be faster if the files are large:
#!/usr/bin/perl
use strict;
use autodie;
my $wordHash ;
foreach my $arg(#ARGV){
open(my $fh, "<", $arg);
while(<$fh>){
chomp;
$wordHash->{$_}->[0] ++;
push(#{$wordHash->{$_}->[1]}, $arg);
}
}
for my $word ( keys %$wordHash ){
if($wordHash->{$word}->[0] eq 1){
print $wordHash->{$_}->[1]->[0] . ": $word\n"
}
}
execute as:
myscript.pl filea fileb filec ... filezz
stuff from before clarification:
Easy enough with shell commands. Non repeating elements across all files
cat File_A File_B File_C File_D |sort | uniq -u
Unique elements across all files
cat File_A File_B File_C File_D |sort | uniq
Unique elements per file
(edit thanks to #Dennis Williamson)
for line in File* ; do echo "working on $line" ; sort $line | uniq ; done

Here is a quick python script that will do what you ask over an arbitrary number of files:
from sys import argv
from collections import defaultdict
filenames = argv[1:]
X = defaultdict(list)
for f in filenames:
with open(f,'r') as FIN:
for word in FIN:
X[word.strip()].append(f)
for word in X:
if len(X[word])==1:
print "Filename: %s word: %s" % (X[word][0], word)
This gives:
Filename: D word: Willy
Filename: C word: alex
Filename: D word: rocky
Filename: C word: Paul
Filename: B word: Peter
Filename: A word: ruby

Hot needle:
import sys
inputs = {}
for inputFileName in sys.args[1:]:
with open(inputFileName, 'r') as inputFile:
inputs[inputFileName] = set([ line.strip() for line in inputFile ])
for inputFileName, inputSet in inputs.iteritems():
print inputFileName
result = inputSet
for otherInputFileName, otherInputSet in inputs.iteritems():
if otherInputFileName != inputFileName:
result -= otherInputSet
print result
Didn't try it though ;-)

Perl one-liner, readable version with comments:
perl -nlwe '
$a{$_}++; # count identical lines with hash
push #a, $_; # save lines in array
if (eof) { push #b,[$ARGV,#a]; #a=(); } # at eof save file name and lines
}{ # eskimo operator, executes rest of code at end of input files
for (#b) {
print shift #$_; # print file name
for (#$_) { print if $a{$_} == 1 }; # print unique lines
}
' file{A,B,C,D}.txt
Note: eof is for each individual input file.
Copy/paste version:
perl -nlwe '$a{$_}++; push #a, $_; if (eof) { push #b,[$ARGV,#a]; #a=(); } }{ for (#b) { print shift #$_; for (#$_) { print if $a{$_} == 1 } }' file{A,B,C,D}.txt
Output:
filea.txt
ruby
fileb.txt
Peter
filec.txt
Paul
alex
filed.txt
rocky
Willy
Notes: This was trickier than expected, and I'm sure there's a way to make it prettier, but I'll post this for now and see if I can clean it up.

Text processing with two files

I have two text files in the following format:
The first is this on every line:
Key1:Value1
The second is this:
Key2:Value2
Is there a way I can replace Value1 in file1 by the Value2 obtained from using it as a key in file2?
For example:
file1:
foo:hello
bar:world
file2:
hello:adam
bar:eve
I would like to get:
foo:adam
bar:eve
There isn't necessarily a match between the two files on every line. Can this be done neatly in awk or something, or should I do it naively in Python?

Create two dictionaries, one for each file. For example:
file1 = {}
for line in open('file1', 'r'):
k, v = line.strip().split(':')
file1[k] = v
Or if you prefer a one-liner:
file1 = dict(l.strip().split(':') for l in open('file1', 'r'))
Then you could do something like:
result = {}
for key, value in file1.iteritems():
if value in file2:
result[key] = file2[value]
Another way is you could generate the key-value pairs in reverse for file1 and use sets. For example, if your file1 contains foo:bar, your file1 dict is {bar: foo}.
for key in set(file1) & set(file2):
result[file1[key]] = file2[key]
Basically, you can quickly find common elements using set intersection, so those elements are guaranteed to be in file2 and you don't waste time checking for their existence.
Edit: As pointed out by #pepr You can use collections.OrderedDict for the first method if order is important to you.

The awk solution:
awk '
BEGIN {FS = OFS = ":"}
NR==FNR {val[$1] = $2; next}
$1 in val {$2 = val[$1]}
{print}
}' file2 file1

join -t : -1 2 -2 1 -o 0 2.2 -a 2 <(sort -k 2 -t : file1) <(sort file2)
The input files must be sorted on the field they are joined on.
The options:
-t : - Use a colon as the delimiter
-1 2 - Join on field 2 of file 1
-2 1 - Join on field 1 of file 2
-o 0 2.2 - Output the join field followed by field 2 from file2 (separated by the delimiter character)
-a 2 - Output unjoined lines from file2

Once you have:
file1 = {'foo':'hello', 'bar':'world'}
file2 = {'hello':'adam', 'bar':'eve'}
You can do an ugly one liner:
print dict([(i,file2[i]) if i in file2 else (i,file2[j]) if j in file2 else (i,j) for i,j in file1.items()])
{'foo': 'adam', 'bar': 'eve'}
As in your example you are using both the keys and values of file1 as keys in file2.

This might work for you (probably GNU sed):
sed 's#\([^:]*\):\(.*\)#/\\(^\1:\\|:\1$\\)/s/:.*/:\2/#' file2 | sed -f - file1

If you do not consider using basic Unix/Linux commands cheating, then here is a solution using paste and awk.
paste file1.txt file2.txt | awk -F ":" '{ print $1":"$3 }'

TXR:
#(next "file2")
#(collect)
#key:#value1
# (cases)
# (next "file1")
# (skip)
#value2:#key
# (or)
# (bind value2 key)
# (end)
# (output)
#value2:#value1
# (end)
#(end)
Run:
$ txr subst.txr
foo:adam
bar:eve

How do I quickly match the fields of two files that are sorted but one is a subset of the other

I have two sorted files and want to merge them to make a third, but I need the output to be sorted. One column in the second file is a subset of the first and any place the second file doesn't match the first should be filled in with a NA. The files are large (~20,000,000) records each so loading things into memory is tough and speed is an issue.
File 1 looks like this:
1 a
2 b
3 c
4 d
5 e
File 2 looks like this:
1 aa
2 bb
4 dd
5 ee
And the the output should be like this
1 a aa
2 b bb
3 c NA
4 d cc
5 e ee

join is your friend here.
join -a 1 file1 file2
should do the trick. The only difference to your example output is that the unpairable lines are printed directly from file1, i.e. without the NA.
Edit: Here is a version that also handles the NAs:
join -a 1 -e NA -o 1.1 1.2 2.2 file1 file2

If I understand you correctly:
File #1 and file #2 will have the same lines
However, some lines will be missing from file #2 that are in file #1.
AND, most importantly, the lines will be sorted in each file.
That means if I get a line from file #2, and the keep reading through file #1, I'll find a matching line sooner or later. Therefore, we want to read a line from file #2, keep looking through file #1 until we find the matching line, and when we do find one, we want to print out both values.
I would imagine some sort of algorithm like this:
Read first line from file #2
While read line from file #1
if line from file #2 > line from file #1
write line from file #1 and "NA"
else
write line from file #1 and file #2
Read another line from file #2
fi
done
There should be some form of error checking (what if you find the line from file #1 to be greater than the line from file #2? That means line #1 is missing the line.) And, there should be some boundary checking (what if you run out of lines from file #2 before you finish file #1?)
This sounds like a school assignment, so I really don't want to give an actual answer. However, the algorithm is there. All you need to do is implement it in your favorite language.
If it isn't a school assignment, and you need more help, just post a comment on this answer, and I'll do what I can.
To the DNA Biologist
#! /usr/bin/env perl
use warnings;
use strict;
use feature qw(say);
use constant {
TEXT1 => "foo1.txt",
TEXT2 => "foo2.txt",
};
open (FILE1, "<", TEXT1) or die qq(Can't open file ) . TEXT1 . qq(for reading\n);
open (FILE2, "<", TEXT2) or die qq(Can't open file ) . TEXT2 . qq(for reading\n);
my $line2 = <FILE2>;
chomp $line2;
my ($lineNum2, $value2) = split(/\s+/, $line2, 2);
while (my $line1 = <FILE1>) {
chomp $line1;
my ($lineNum1, $value1) = split(/\s+/, $line1, 2);
if (not defined $line2) {
say "$lineNum1 - $value1 - NA";
}
elsif ($lineNum1 lt $lineNum2) { #Use "<" if numeric match and not string match
say "$lineNum1 - $value1 - NA";
}
elsif ($lineNum1 eq $lineNum2) {
say "$lineNum1 - $value1 - $value2";
$line2 = <FILE2>;
if (defined $line2) {
chomp $line2;
($lineNum2, $value2) = split(/\s+/, $line2, 2);
}
}
else {
die qq(Something went wrong: Line 1 = "$line1" Line 2 = "$line2"\n);
}
}
It wasn't thoroughly tested, but it worked on some short sample files.

You can do it all in shell:
sort file.1 > file.1.sorted
sort file.2 > file.2.sorted
join -e NA file.1.sorted file.2.sorted > file.joined

Here's a Python solution:
"""merge two files based on matching first columns"""
def merge_files(file1, file2, merge_file):
with (open(file1) as file1,
open(file2) as file2,
open(merge_file, 'w')) as merge:
for line2 in file2:
index2, value2 = line2.split(' ', 1)
for line1 in file1:
index1, value1 = line1.split(' ', 1)
if index1 != index2:
merge.write(line1)
continue
merge.write("%s %s %s" % (index1, value1[:-1], value2))
break
for line1 in file1: # grab any remaining lines in file1
merge.write(line1)
if __name__ == '__main__':
merge_files('test1.txt','test2.txt','test3.txt')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Identify duplicates in rows of two different CSVs - python

I need to identify duplicates in column A of CSV1 with Column A of CSV2. If there is a dupe first name identified the entire row from CSV2 needs to get copied to a new CSV3. Can somebody help in python? CSV1 Adam Eve John George CSV2 Steve Mark Adam Smith John Smith CSV3 Adam Smith John Smith

Related

Python or Bash - How do you add words listed in a file in the middle of two sentences and put the output into another file?

Python cross-file search with regexp

Unique elements in each file

Text processing with two files

How do I quickly match the fields of two files that are sorted but one is a subset of the other

Categories

Resources