I have 4 files and would like to know elements which are non overlapping (per file) compared to the elements in other files.
File A
Vincy
ruby
rome
File B
Vincy
rome
Peter
File C
Vincy
Paul
alex
File D
Vincy
rocky
Willy
Any suggestion for one liner in perl, python, shell, bash. The expected output is:
File A: ruby, File B: Peter, File C: Paul, Alex File D: rocky, Willy.
Edit after question clarified: Unique elements across all files, and the file in which it occurs:
cat File_A File_B File_C File_D |sort | uniq -u | while read line ; do file=`grep -l $line File*` ; echo "$file $line" ; done
Edit:
perly way of doing it, will be faster if the files are large:
#!/usr/bin/perl
use strict;
use autodie;
my $wordHash ;
foreach my $arg(#ARGV){
open(my $fh, "<", $arg);
while(<$fh>){
chomp;
$wordHash->{$_}->[0] ++;
push(#{$wordHash->{$_}->[1]}, $arg);
}
}
for my $word ( keys %$wordHash ){
if($wordHash->{$word}->[0] eq 1){
print $wordHash->{$_}->[1]->[0] . ": $word\n"
}
}
execute as:
myscript.pl filea fileb filec ... filezz
stuff from before clarification:
Easy enough with shell commands. Non repeating elements across all files
cat File_A File_B File_C File_D |sort | uniq -u
Unique elements across all files
cat File_A File_B File_C File_D |sort | uniq
Unique elements per file
(edit thanks to #Dennis Williamson)
for line in File* ; do echo "working on $line" ; sort $line | uniq ; done
Here is a quick python script that will do what you ask over an arbitrary number of files:
from sys import argv
from collections import defaultdict
filenames = argv[1:]
X = defaultdict(list)
for f in filenames:
with open(f,'r') as FIN:
for word in FIN:
X[word.strip()].append(f)
for word in X:
if len(X[word])==1:
print "Filename: %s word: %s" % (X[word][0], word)
This gives:
Filename: D word: Willy
Filename: C word: alex
Filename: D word: rocky
Filename: C word: Paul
Filename: B word: Peter
Filename: A word: ruby
Hot needle:
import sys
inputs = {}
for inputFileName in sys.args[1:]:
with open(inputFileName, 'r') as inputFile:
inputs[inputFileName] = set([ line.strip() for line in inputFile ])
for inputFileName, inputSet in inputs.iteritems():
print inputFileName
result = inputSet
for otherInputFileName, otherInputSet in inputs.iteritems():
if otherInputFileName != inputFileName:
result -= otherInputSet
print result
Didn't try it though ;-)
Perl one-liner, readable version with comments:
perl -nlwe '
$a{$_}++; # count identical lines with hash
push #a, $_; # save lines in array
if (eof) { push #b,[$ARGV,#a]; #a=(); } # at eof save file name and lines
}{ # eskimo operator, executes rest of code at end of input files
for (#b) {
print shift #$_; # print file name
for (#$_) { print if $a{$_} == 1 }; # print unique lines
}
' file{A,B,C,D}.txt
Note: eof is for each individual input file.
Copy/paste version:
perl -nlwe '$a{$_}++; push #a, $_; if (eof) { push #b,[$ARGV,#a]; #a=(); } }{ for (#b) { print shift #$_; for (#$_) { print if $a{$_} == 1 } }' file{A,B,C,D}.txt
Output:
filea.txt
ruby
fileb.txt
Peter
filec.txt
Paul
alex
filed.txt
rocky
Willy
Notes: This was trickier than expected, and I'm sure there's a way to make it prettier, but I'll post this for now and see if I can clean it up.
Related
I want to search a txt file for the duplicate lines excluding [p] and the extension in the comparison. Once the equal lines are identified, show only the line that does not contain [p] and with its extension. I have this lines in test.txt:
Peliculas/Desperados (2020)[p].mp4
Peliculas/La Duquesa (2008)[p].mp4
Peliculas/Nueva York Año 2012 (1975).mkv
Peliculas/Acoso en la noche (1980) .mkv
Peliculas/Angustia a Flor de Piel (1982).mkv
Peliculas/Desperados (2020).mkv
Peliculas/Angustia (1947).mkv
Peliculas/Días de radio (1987) BR1080[p].mp4
Peliculas/Mona Lisa (1986) BR1080[p].mp4
Peliculas/La decente (1970) FlixOle WEB-DL 1080p [Buzz][p].mp4
Peliculas/Mona Lisa (1986) BR1080.mkv
In this file lines 1-6 and 9-11 are the same (withouth ext and [p]). Output needed:
Peliculas/Desperados (2020).mkv
Peliculas/Mona Lisa (1986) BR1080.mkv
i try this but only shows the same lines deleting extension and pattern [p] but i dont know the correct line and I need the entire line complete
sed 's/\[p\]//' ./test.txt | sed 's\.[^.]*$//' | sort | uniq -d
Error output (missing extension):
Peliculas/Desperados (2020)
Peliculas/Mona Lisa (1986) BR1080
because you mentioned bash...
Remove any line with a p:
cat test.txt | grep -v p
home/folder/house from earth.mkv
home/folder3/window 1.avi
Remove any line with [p]:
cat test.txt | grep -v '\[p\]'
home/folder/house from earth.mkv
home/folder3/window 1.avi
home/folder4/little mouse.mpg
Not likely your need, but just because: Remove [p] from every line, then dedupe:
cat test.txt | sed 's/\[p\]//g' | sort | uniq
home/folder/house from earth.mkv
home/folder/house from earth.mp4
home/folder2/test.mp4
home/folder3/window 1.avi
home/folder3/window 1.mp4
home/folder4/little mouse.mpg
If a 2-pass solution (which reads the test.txt file twice) is acceptable, would you please try:
declare -A ary # associate the filename with the base
while IFS= read -r file; do
if [[ $file != *\[p\]* ]]; then # the filename does not include "[p]"
base="${file%.*}" # remove the extension
ary[$base]="$file" # create a map
fi
done < test.txt
while IFS= read -r base; do
echo "${ary[$base]}"
done < <(sed 's/\[p\]//' ./test.txt | sed 's/\.[^.]*$//' | sort | uniq -d)
Output:
Peliculas/Desperados (2020).mkv
Peliculas/Mona Lisa (1986) BR1080.mkv
In the 1st pass, it reads the file line by line to create a map which associates the filename (with an extension) with the base (w/o the extension).
In the 2nd pass, it replace the output (base) with the filename.
If you prefer 1-pass solution (which will be faster), please try:
declare -A ary # associate the filename with the base
declare -A count # count the occurrences of the base
while IFS= read -r file; do
base="${file%.*}" # remove the extension
if [[ $base =~ (.*)\[p\](.*) ]]; then
# "$base" contains the substring "[p]"
(( count[${BASH_REMATCH[1]}${BASH_REMATCH[2]}]++ ))
# increment the counter
else
(( count[$base]++ )) # increment the counter
ary[$base]="$file" # map the filename
fi
done < test.txt
for base in "${!ary[#]}"; do # loop over the keys of ${ary[#]}
if (( count[$base] > 1 )); then
# it duplicates
echo "${ary[$base]}"
fi
done
In Python, you can use itertools.groupby with a function that makes a key that consists of the filename without any [p] and with the extension removed.
For any groups of size 2 or more, any filenames not containing '[p]' are printed.
import itertools
import re
def make_key(line):
return re.sub(r'\.[^.]*$', '', line.replace('[p]', ''))
with open('test.txt') as f:
lines = [line.strip() for line in f]
for key, group in itertools.groupby(lines, make_key):
files = [file for file in group]
if len(files) > 1:
for file in files:
if '[p]' not in file:
print(file)
This gives:
home/folder/house from earth.mkv
home/folder3/window 1.avi
I need to identify duplicates in column A of CSV1 with Column A of CSV2. If there is a dupe first name identified the entire row from CSV2 needs to get copied to a new CSV3. Can somebody help in python?
CSV1
Adam
Eve
John
George
CSV2
Steve
Mark
Adam Smith
John Smith
CSV3
Adam Smith
John Smith
Here is a quick answer. It's O(n^2) with n the number of lines in your csv, and assumes two equal length CSVs. If you need an O(n) solution (clearly optimal), then let me know. The trick there would be building a set of the elements of column A of csv1.
lines1 = open('csv1.txt').read().split('\n')
delim = ', '
fields1 = [line.split(delim) for line in lines1]
lines2 = open('csv2.txt').read().split('\n')
fields2 = [line.split(delim) for line in lines2]
duplicates = []
for line1 in fields1:
for line2 in fields2:
if line1[0] == line2[0]:
duplicates.append(line2)
print duplicates
Using any of the 3 one-liners:
Option 1: Parse file1 in the BEGIN Block
perl -lane 'BEGIN {$csv2 = pop; $seen{(split)[0]}++ while <>; #ARGV = $csv2 } print if $seen{$F[0]}' csv1 csv2
Option 2: Using a ternary
perl -lane 'BEGIN {($csv1) = #ARGV } $ARGV eq $csv1 ? $seen{$F[0]}++ : ($seen{$F[0]} && print)' csv1 csv2
Option 3: Using a single if
perl -lane 'BEGIN {($csv1) = #ARGV } print if $seen{$F[0]} += $ARGV eq $csv1 and $ARGV ne $csv1' csv1 csv2
Explanation:
Switches:
-l: Enable line ending processing
-a: Splits the line on space and loads them in an array #F
-n: Creates a while(<>){..} loop for each line in your input file.
-e: Tells perl to execute the code on command line.
Clean and python way to resolve your problem
words_a = set([])
words_b = set([])
with open('csv1') as a:
words_a = set([w.strip()
for l in a.readlines()
for w in l.split(" ")
if w.strip()])
with open('csv2') as b:
words_b = set([ w.strip()
for l in b.readlines()
for w in l.split(" ")
if w.strip()])
with open('csv3','w') as wf:
for w in words_a.intersection(words_b):
wf.write(w)
wf.write('\n')
My text file out put looks like this on two lines:
DelayTimeThreshold|MaxDelayPerMinute|Name
10000|5|rca
My desired output is too look like this:
DelayTimeThreshold 10000
MaxDelayPerMinute 5
Name rca
I'm having no luck trying this:
sed '$!N;s/|/\n/' foo
Any advice would be welcomed, thank you.
As you have just two lines, this can be a way:
$ paste -d' ' <(head -1 file | sed 's/|/\n/g') <(tail -1 file | sed 's/|/\n/g')
DelayTimeThreshold 10000
MaxDelayPerMinute 5
Name rca
By pieces. Let's get the first line and replace every pipe | with a new line:
$ head -1 file | sed 's/|/\n/g'
DelayTimeThreshold
MaxDelayPerMinute
Name
And do the same with the last line:
$ tail -1 file | sed 's/|/\n/g'
10000
5
rca
Then it is just a matter of pasting both results with a space as delimiter:
paste -d' ' output1 output2
this awk one-liner would work for your requirement:
awk -F'|' '!f{gsub(/\||$/," %s\n");f=$0;next}{printf f,$1,$2,$3}' file
output:
kent$ echo "DelayTimeThreshold|MaxDelayPerMinute|Name
10000|5|rca"|awk -F'|' '!f{gsub(/\||$/," %s\n");f=$0;next}{printf f,$1,$2,$3}'
DelayTimeThreshold 10000
MaxDelayPerMinute 5
Name rca
Using the Array::Transpose module:
perl -MArray::Transpose -F'\|' -lane '
push #a, [#F]
} END {print for map {join " ", #$_} transpose(\#a)
' <<END
DelayTimeThreshold|MaxDelayPerMinute|Name
10000|5|rca
END
DelayTimeThreshold 10000
MaxDelayPerMinute 5
Name rca
As a perl script:
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
my $file1 = ('DelayTimeThreshold|MaxDelayPerMinute|Name');
my $file2 = ('10000|5|rca');
my #file1 = split('\|', $file1);
my #file2 = split('\|', $file2);
my %hash;
#hash{#file1} = #file2;
print Dumper \%hash;
Output:
$VAR1 = {
'Name' => 'rca',
'DelayTimeThreshold' => '10000',
'MaxDelayPerMinute' => '5'
};
OR:
for (my $i = 0; $i < $#file1; $i++) {
print "$file1[$i] $file2[$i]\n";
}
Output:
DelayTimeThreshold 10000
MaxDelayPerMinute 5
Name rca
Suppose you have a file that contains a single header row with column names, followed by multiple detail rows with column values, for example,
DelayTimeThreshold|MaxDelayPerMinute|Name
10000|5|abc
20001|6|def
30002|7|ghk
40003|8|jkl
50004|9|mnp
The following code would print that file, using the names from the first row, paired with values from each subsequent (detail) row,
#!/bin/perl -w
use strict;
my ($fn,$fh)=("header.csv"); #whatever the file is named...
open($fh,"< $fn") || error "cannot open $fn";
my ($count,$line,#names,#vals)=(0);
while(<$fh>)
{
chomp $_;
#vals=split(/\|/,$_);
if($count++<1) { #names=#vals; next; } #first line is names
for (my $ndx=0; $ndx<=$#names; $ndx++) { #print each
print "$names[$ndx] $vals[$ndx]\n";
}
}
Suppose you want to keep around each row, annotated with names, in an array,
my %row;
my #records;
while(<$fh>)
{
chomp $_;
#vals=split(/\|/,$_);
if($count++<1) { #names=#vals; next; }
#row{#names} = #vals;
push(#records,\%row);
}
Maybe you want to refer to the rows by some key column,
my %row;
my %records;
while(<$fh>)
{
chomp $_;
#vals=split(/\|/,$_);
if($count++<1) { #names=#vals; next; }
#row{#names} = #vals;
$records{$vals[0]}=\%row;
}
I have been using this script for years at work to summarize log files.
#!/usr/bin/perl
$logf = '/var/log/messages.log';
#logf=( `cat $logf` );
foreach $line ( #logf ) {
$line=~s/\d+/#/g;
$count{$line}++;
}
#alpha=sort #logf;
$prev = 'null';
#uniq = grep($_ ne $prev && ($prev = $_), #alpha);
foreach $line (#uniq) {
print "$count{$line}: ";
print "$line";
}
I have wanted to rewrite it in Python but I do not fully understand certain portions of it, such as:
#alpha=sort #logf;
$prev = 'null';
#uniq = grep($_ ne $prev && ($prev = $_), #alpha);
Does anyone know of a Python module that would negate the need to rewrite this? I haven't had any luck find something similar. Thanks in advance!
As the name of the var implies,
#alpha=sort #logf;
$prev = 'null';
#uniq = grep($_ ne $prev && ($prev = $_), #alpha);
is finding unique elements (i.e. removing duplicate lines), ignoring numbers in the line since they were previously replaced with #. Those three lines could have been written
#uniq = sort keys(%count);
or maybe even
#uniq = keys(%count);
Another way of writing the program in Perl:
my $log_qfn = '/var/log/messages.log';
open(my $fh, '<', $log_qfn)
or die("Can't open $log_qfn: $!\n");
my %counts;
while (<$fh>) {
s/\d+/#/g;
++$counts{$_};
}
#for (sort keys(%counts)) {
for (keys(%counts)) {
print "$counts{$_}: $_";
}
This should be easier to translate into Python.
#alpha=sort #logf;
$prev = 'null';
#uniq = grep($_ ne $prev && ($prev = $_), #alpha);
would be equivalent to
uniq = sorted(set(logf))
if logf were a list of lines.
However, since you are counting the freqency of lines,
you could use a collections.Counter to both count the lines and collect the unique lines (as keys) (thus removing the need to compute uniq at all):
count = collections.Counter()
for line in f:
count[line] += 1
import sys
import re
import collections
logf = '/var/log/messages.log'
count = collections.Counter()
write = sys.stdout.write
with open(logf, 'r') as f:
for line in f:
line = re.sub(r'\d+','#',line)
count[line] += 1
for line in sorted(count):
write("{c}: {l}".format(c = count[line], l = line))
I have to say I often encountered with people trying to do stuff in python perl can be done in one line on shell or bash:
I don't care for downvotes, since people should know there is no reason to do stuff in 20 lines of python if it can be done on shell
< my_file.txt | sort | uniq > uniq_my_file.txt
I have two sorted files and want to merge them to make a third, but I need the output to be sorted. One column in the second file is a subset of the first and any place the second file doesn't match the first should be filled in with a NA. The files are large (~20,000,000) records each so loading things into memory is tough and speed is an issue.
File 1 looks like this:
1 a
2 b
3 c
4 d
5 e
File 2 looks like this:
1 aa
2 bb
4 dd
5 ee
And the the output should be like this
1 a aa
2 b bb
3 c NA
4 d cc
5 e ee
join is your friend here.
join -a 1 file1 file2
should do the trick. The only difference to your example output is that the unpairable lines are printed directly from file1, i.e. without the NA.
Edit: Here is a version that also handles the NAs:
join -a 1 -e NA -o 1.1 1.2 2.2 file1 file2
If I understand you correctly:
File #1 and file #2 will have the same lines
However, some lines will be missing from file #2 that are in file #1.
AND, most importantly, the lines will be sorted in each file.
That means if I get a line from file #2, and the keep reading through file #1, I'll find a matching line sooner or later. Therefore, we want to read a line from file #2, keep looking through file #1 until we find the matching line, and when we do find one, we want to print out both values.
I would imagine some sort of algorithm like this:
Read first line from file #2
While read line from file #1
if line from file #2 > line from file #1
write line from file #1 and "NA"
else
write line from file #1 and file #2
Read another line from file #2
fi
done
There should be some form of error checking (what if you find the line from file #1 to be greater than the line from file #2? That means line #1 is missing the line.) And, there should be some boundary checking (what if you run out of lines from file #2 before you finish file #1?)
This sounds like a school assignment, so I really don't want to give an actual answer. However, the algorithm is there. All you need to do is implement it in your favorite language.
If it isn't a school assignment, and you need more help, just post a comment on this answer, and I'll do what I can.
To the DNA Biologist
#! /usr/bin/env perl
use warnings;
use strict;
use feature qw(say);
use constant {
TEXT1 => "foo1.txt",
TEXT2 => "foo2.txt",
};
open (FILE1, "<", TEXT1) or die qq(Can't open file ) . TEXT1 . qq(for reading\n);
open (FILE2, "<", TEXT2) or die qq(Can't open file ) . TEXT2 . qq(for reading\n);
my $line2 = <FILE2>;
chomp $line2;
my ($lineNum2, $value2) = split(/\s+/, $line2, 2);
while (my $line1 = <FILE1>) {
chomp $line1;
my ($lineNum1, $value1) = split(/\s+/, $line1, 2);
if (not defined $line2) {
say "$lineNum1 - $value1 - NA";
}
elsif ($lineNum1 lt $lineNum2) { #Use "<" if numeric match and not string match
say "$lineNum1 - $value1 - NA";
}
elsif ($lineNum1 eq $lineNum2) {
say "$lineNum1 - $value1 - $value2";
$line2 = <FILE2>;
if (defined $line2) {
chomp $line2;
($lineNum2, $value2) = split(/\s+/, $line2, 2);
}
}
else {
die qq(Something went wrong: Line 1 = "$line1" Line 2 = "$line2"\n);
}
}
It wasn't thoroughly tested, but it worked on some short sample files.
You can do it all in shell:
sort file.1 > file.1.sorted
sort file.2 > file.2.sorted
join -e NA file.1.sorted file.2.sorted > file.joined
Here's a Python solution:
"""merge two files based on matching first columns"""
def merge_files(file1, file2, merge_file):
with (open(file1) as file1,
open(file2) as file2,
open(merge_file, 'w')) as merge:
for line2 in file2:
index2, value2 = line2.split(' ', 1)
for line1 in file1:
index1, value1 = line1.split(' ', 1)
if index1 != index2:
merge.write(line1)
continue
merge.write("%s %s %s" % (index1, value1[:-1], value2))
break
for line1 in file1: # grab any remaining lines in file1
merge.write(line1)
if __name__ == '__main__':
merge_files('test1.txt','test2.txt','test3.txt')