How to completely erase the duplicated lines by linux tools?

How to completely erase the duplicated lines by linux tools? - python

This question is not equal to How to print only the unique lines in BASH? because that ones suggests to remove all copies of the duplicated lines, while this one is about eliminating their duplicates only, i..e, change 1, 2, 3, 3 into 1, 2, 3 instead of just 1, 2.
This question is really hard to write because I cannot see anything to give meaning to it. But the example is clearly straight. If I have a file like this:
1
2
2
3
4
After to parse the file erasing the duplicated lines, becoming it like this:
1
3
4
I know python or some of it, this is a python script I wrote to perform it. Create a file called clean_duplicates.py and run it as:
import sys
#
# To run it use:
# python clean_duplicates.py < input.txt > clean.txt
#
def main():
lines = sys.stdin.readlines()
# print( lines )
clean_duplicates( lines )
#
# It does only removes adjacent duplicated lines, so your need to sort them
# with sensitive case before run it.
#
def clean_duplicates( lines ):
lastLine = lines[ 0 ]
nextLine = None
currentLine = None
linesCount = len( lines )
# If it is a one lined file, to print it and stop the algorithm
if linesCount == 1:
sys.stdout.write( lines[ linesCount - 1 ] )
sys.exit()
# To print the first line
if linesCount > 1 and lines[ 0 ] != lines[ 1 ]:
sys.stdout.write( lines[ 0 ] )
# To print the middle lines, range( 0, 2 ) create the list [0, 1]
for index in range( 1, linesCount - 1 ):
currentLine = lines[ index ]
nextLine = lines[ index + 1 ]
if currentLine == lastLine:
continue
lastLine = lines[ index ]
if currentLine == nextLine:
continue
sys.stdout.write( currentLine )
# To print the last line
if linesCount > 2 and lines[ linesCount - 2 ] != lines[ linesCount - 1 ]:
sys.stdout.write( lines[ linesCount - 1 ] )
if __name__ == "__main__":
main()
Although, while searching for duplicates lines remove seems to be easier to use tools as grep, sort, sed, uniq:
How to remove duplicate lines inside a text file?
removing line from list using sort, grep LINUX
Find duplicate lines in a file and count how many time each line was duplicated?
Remove duplicate entries in a Bash script
How to delete duplicate lines in a file without sorting it in Unix?
How to delete duplicate lines in a file...AWK, SED, UNIQ not working on my file

You may use uniq with -u/--unique option. As per the uniq man page:
-u / --unique
Don't output lines that are repeated in the input.
Print only lines that are unique in the INPUT.
For example:
cat /tmp/uniques.txt | uniq -u
OR, as mentioned in UUOC: Useless use of cat, better way will be to do it like:
uniq -u /tmp/uniques.txt
Both of these commands will return me value:
1
3
4
where /tmp/uniques.txt holds the number as mentioned in the question, i.e.
1
2
2
3
4
Note: uniq requires the content of file to be sorted. As mentioned in doc:
By default, uniq prints the unique lines in a sorted file, it discards all but one of identical successive input lines. so that the OUTPUT contains unique lines.
In case file is not sorted, you need to sort the content first
and then use uniq over the sorted content:
sort /tmp/uniques.txt | uniq -u

No sorting required and output order will be the same as input order:
$ awk 'NR==FNR{c[$0]++;next} c[$0]==1' file file
1
3
4

Europe Finland Office Supplies Online H 5/21/2015 193508565 7/3/2015 2339 651.21 524.96 1523180.19 1227881.44 295298.75
Europe Greece Household Online L 9/11/2015 895509612 9/26/2015 49 668.27 502.54 32745.23 24624.46 8120.77
Europe Hungary Beverages Online C 8/21/2012 722931563 8/25/2012 370 47.45 31.79 17556.50 11762.30 5794.20
Europe Hungary Beverages Online C 8/21/2012 722931563 8/25/2012 370 47.45 31.79 17556.50 11762.30 5794.20
If you have this kind of lines you can use this command.
[isuru#192 ~]$ sort duplines.txt | sed 's/\ /\-/g' | uniq | sed 's/\-/\ /g'
But keep in mind when using special characters. If there dashes in your lines makes sure to use different symbol. Here i keep a space between back & forward slash.
Before applied the code
After applied the code

Kindly use sort command with -u argument for listing unique values of any command's output.
cat file_name |sort -u
1
2
3
4

Related

Python cross-file search with regexp

I have 2 files, and I want to get all lines from file2(fsearch) that contain any given line from file1(forig)
I wrote a simple python script that looks like this:
def search_string(w, file):
global matches
reg = re.compile((r'(^|^.*\|)' + w.strip("\r\n") + r'(\t|\|).*$'), re.M)
match = reg.findall(file)
matches.extend(match)
fsearch_text = fsearch.read()
for fword in forig:
search_string(fword, fsearch_text)
There are about 100,000 lines in file1, and about 200,000 lines in file2, so my script takes about 6 hours to complete.
Is there a better algorithm to achieve the same goal in less time?
Edit:
I should have provided example for why I need regexp:
I am searching a list of words in file1 and trying to match them to translations from file2. If I do not use regexp to limit possible matches, I also match translations for words that only contain the word I search as part of itself, example:
Word I search: 浸し
Matched word: お浸し|御浸し|御したし &n boiled greens in bonito-flavoured soy sauce (vegetable side dish)
So I have to limit start of match by either ^ or |, and end of match by \t or |, but capture the whole line

Assuming you can have both files in memory. You can read them and sort them.
After that, you can compare linearly the lines.
f1 = open('e:\\temp\\file1.txt')
lines1 = sorted([line for line in f1])
f2 = open('e:\\temp\\file2.txt')
lines2 = sorted([line for line in f2])
i1 = 0
i2 = 0
matchCount = 0
while (i1 < len(lines1) and i2 < len(lines2)):
line1 = lines1[i1]
line2 = lines2[i2]
if line1 < line2:
i1 += 1
elif line1 > line2:
i2 += 1
else:
matchCount += 1
i2 += 1
print('matchCount')
print(matchCount)

If it is possible for you to use UNIX/GNU/Linux commands you could do this:
# fill example files
for i in {1..100000}; do echo $RANDOM; done > file1.txt
for i in {1..200000}; do echo $RANDOM; done > file2.txt
# get every line of file2.txt which is also in file1.txt
# for static string matching:
grep -F -x -f file1.txt file2.txt
# for regex matching use (regular expressions in file1.txt):
grep -f file1.txt file2.txt
grep is optimized for such operations so the above call takes less than a second (have a look at this).

Comparing two files and display a matrix with 0 and 1 for not present or present

I'm new with python so i'm reaaally struggling in making a script.
So, what I need is to make a comparison between two files. One file contains all proteins of some data base, the other contain only some of the proteins presents in the other file, because it belongs to a organism. So I need to know which proteins of this data base is present in my organism. For that I want to build a output like a matrix, with 0 and 1 referring to every protein present in the data base that may or may not be in my organism.
Does anybody have any idea of how could I do that?
I'm trying to use something like this
$ cat sorted.a
A
B
C
D
$ cat sorted.b
A
D
$ join sorted.a sorted.b | sed 's/^/1 /' && join -v 1 sorted.a sorted.b | sed 's/^/0 /'
1 A
1 D
0 B
0 C
But I'm not being able to use it because sometimes a protein is present but its not in the same line.
Here is a Example:
1-cysPrx_C
120_Rick_ant
14-03-2003
2-Hacid_dh
2-Hacid_dh_C
2-oxoacid_dh
2-ph_phosp
2CSK_N
2C_adapt
2Fe-2S_Ferredox
2H-phosphodiest
2HCT
2OG-FeII_Oxy
Comparing with
1-cysPrx_C
14-3-3
2-Hacid_dh
2-Hacid_dh_C
2-oxoacid_dh
2H-phosphodiest
2OG-FeII_Oxy
2OG-FeII_Oxy_3
2OG-FeII_Oxy_4
2OG-FeII_Oxy_5
2OG-Fe_Oxy_2
2TM
2_5_RNA_ligase2
Does anyone have an idea of how could I do that?
Thanks so far.

The fastest way in Python would be to read your organism file, and save each protein name to a set. Then open and iterate through your all_proteins file, for each name print it, check if that name is present in your organism set, and print a 0 or 1 appropriately.
Example code if your organism list is called 'prot_list':
with open(all_proteins_file) as f:
for line in f:
prot = line.strip()
if prot in prot_list: num = 1
else: num = 0
print '%i %s' % (num, prot)

Identify duplicates in rows of two different CSVs

I need to identify duplicates in column A of CSV1 with Column A of CSV2. If there is a dupe first name identified the entire row from CSV2 needs to get copied to a new CSV3. Can somebody help in python?
CSV1
Adam
Eve
John
George
CSV2
Steve
Mark
Adam Smith
John Smith
CSV3
Adam Smith
John Smith

Here is a quick answer. It's O(n^2) with n the number of lines in your csv, and assumes two equal length CSVs. If you need an O(n) solution (clearly optimal), then let me know. The trick there would be building a set of the elements of column A of csv1.
lines1 = open('csv1.txt').read().split('\n')
delim = ', '
fields1 = [line.split(delim) for line in lines1]
lines2 = open('csv2.txt').read().split('\n')
fields2 = [line.split(delim) for line in lines2]
duplicates = []
for line1 in fields1:
for line2 in fields2:
if line1[0] == line2[0]:
duplicates.append(line2)
print duplicates

Using any of the 3 one-liners:
Option 1: Parse file1 in the BEGIN Block
perl -lane 'BEGIN {$csv2 = pop; $seen{(split)[0]}++ while <>; #ARGV = $csv2 } print if $seen{$F[0]}' csv1 csv2
Option 2: Using a ternary
perl -lane 'BEGIN {($csv1) = #ARGV } $ARGV eq $csv1 ? $seen{$F[0]}++ : ($seen{$F[0]} && print)' csv1 csv2
Option 3: Using a single if
perl -lane 'BEGIN {($csv1) = #ARGV } print if $seen{$F[0]} += $ARGV eq $csv1 and $ARGV ne $csv1' csv1 csv2
Explanation:
Switches:
-l: Enable line ending processing
-a: Splits the line on space and loads them in an array #F
-n: Creates a while(<>){..} loop for each line in your input file.
-e: Tells perl to execute the code on command line.

Clean and python way to resolve your problem
words_a = set([])
words_b = set([])
with open('csv1') as a:
words_a = set([w.strip()
for l in a.readlines()
for w in l.split(" ")
if w.strip()])
with open('csv2') as b:
words_b = set([ w.strip()
for l in b.readlines()
for w in l.split(" ")
if w.strip()])
with open('csv3','w') as wf:
for w in words_a.intersection(words_b):
wf.write(w)
wf.write('\n')

How to sort output data into columns and rows

I have an output that looks like this, where the first number corresponds to the count of the type below (e.g. 72 for Type 4, etc)
72
Type
4
51
Type
5
66
Type
6
78
Type
7
..etc
Is there a way to organize this data to look something like this:
Type 4 = 72 times
Type 5 = 51 times
Type 6 = 66 times
etc..
Essentially, the question is how to take a single column of data and sort /organize it into something more readable using bash, awk, python, etc. (Ideally, in bash, but interested to know how to do in Python).
Thank you.

Use paste to join 3 consecutive lines from stdin, then just rearrange the fields.
paste - - - < file | awk '{print $2, $3, "=", $1, "times"}'

It's simple enough with Python to read three lines of data at a time:
def perthree(iterable):
return zip(*[iter(iterable)] * 3)
with open(inputfile) as infile:
for count, type_, type_num in perthree(infile):
print('{} {} = {} times'.format(type_.strip(), type_num.strip(), count.strip()))
The .strip() calls remove any extra whitespace, including the newline at the end of each line of input text.
Demo:
>>> with open(inputfile) as infile:
... for count, type_, type_num in perthree(infile):
... print('{} {} = {} times'.format(type_.strip(), type_num.strip(), count.strip()))
...
Type 4 = 72 times
Type 5 = 51 times
Type 6 = 66 times
Type 7 = 78 times

In Bash:
#!/bin/bash
A=() I=0
while read -r LINE; do
if (( (M = ++I % 3) )); then
A[M]=$LINE
else
printf "%s %s = %s times\n" "${A[2]}" "$LINE" "${A[1]}"
fi
done
Running bash script.sh < file creates:
Type 4 = 72 times
Type 5 = 51 times
Type 6 = 66 times
Type 7 = 78 times
Note: With a default IFS ($' \t\n'), read would remove leading and trailing spaces by default.

Try this awk one liner:
$ awk 'NR%3==1{n=$1}NR%3==2{t=$1}NR%3==0{print t,$1,"=",n,"times"}' file
Type 4 = 72 times
Type 5 = 51 times
Type 6 = 66 times
Type 7 = 78 times
How it works?
awk '
NR%3==1{ # if we are on lines 1,4,7, etc (NR is the record number (or the line number)
n=$1 # set the variable n to the first (and only) word
}
NR%3==2{ # if we are on lines 2,5,7, etc
t=$1 # set the variable t to the first (and only) word
}
NR%3==0{ # if we are on lines 3,6,9, etc
print t,$1,"=",n,"times" # print the desired output
}' file

how to trim file - for rows which with the same value in two columns, conserve only the row with max in another columns

I am now facing a file trimming problem. I would like to trim rows in a tab-delimited file.
The rule is: for rows which with the same value in two columns, preserve only the row with the largest value in the third column. There may be different numbers of such redundant rows defined by two columns. If there is a tie for the largest value in the third column, preserve the first one (after ordering the file).
(1) My file looks like (tab-delimited, with several millions of rows):
1 100 25 T
1 101 26 A
1 101 27 G
1 101 30 A
1 102 40 A
1 102 40 T
(2) The output I want:
1 100 25 T
1 101 30 A
1 102 40 T
This problem is faced by my real study, not home-work. I expect to have your helps on that, because I have restricted programming skills. I prefer an computation-efficient way, because there is so many rows in my data file. Your help will be very valuable to me.

Here's a solution that will rely on the input file already being sorted appropriately. It will scan line-by-line for lines with similar start (e.g. two first columns identical), check the third column value and preserve the line with the highest value - or the line that came first in the file. When a new start is found, it prints the old line, and begins checking again.
At the end of the input file, the max line in memory is printed out.
use warnings;
use strict;
my ($max_line, $start, $max) = parse_line(scalar <DATA>);
while (<DATA>) {
my ($line, $nl_start, $nl_max) = parse_line($_);
if ($nl_start eq $start) {
if ($nl_max > $max) {
$max_line = $line;
$max = $nl_max;
}
} else {
print $max_line;
$start = $nl_start;
$max = $nl_max;
$max_line = $line;
}
}
print $max_line;
sub parse_line {
my $line = shift;
my ($start, $max) = $line =~ /^([^\t]+\t[^\t]+\t)(\d+)/;
return ($line, $start, $max);
}
__DATA__
1 100 25 T
1 101 26 A
1 101 27 G
1 101 30 A
1 102 40 A
1 102 40 T
The output is:
1 100 25 T
1 101 30 A
1 102 40 A
You stated
If there is a tie for the largest
value in the third column, preserve
the first one (after ordering the
file).
which is rather cryptic. Then you asked for output that seemed to contradict this, where the last value was printed instead of the first.
I am assuming that what you meant is "preserve the first value". If you indeed meant "preserve the last value", then simply change the > sign in if ($nl_max > $max) to >=. This will effectively preserve the last value equal instead of the first.
If you however implied some kind of sort, which "after ordering the file" seems to imply, then I do not have enough information to know what you meant.

In python you can try the following code:
res = {}
for line in (line.split() for line in open('c:\\inpt.txt','r') if line):
line = tuple(line)
if not line[:2] in res:
res[line[:2]] = line[2:]
continue
elif res[line[:2]][0] <= line[3]:
res[line[:2]] = line[2:]
f = open('c:\\tst.txt','w')
[f.write(line) for line in ('\t'.join(k+v)+'\n' for k,v in res.iteritems())]
f.close()

Here's one approach
use strict;
use warnings;
use constant
{ LINENO => 0
, LINE => 1
, SCORE => 2
};
use English qw<$INPUT_LINE_NUMBER>;
my %hash;
while ( <> ) {
# split the line to get the fields
my #fields = split /\t/;
# Assemble a key for everything except the "score"
my $key = join( '-', #fields[0,1] );
# locally cache the score
my $score = $fields[SCORE];
# if we have a score, and the current is not greater, then next
next unless ( $hash{ $key } and $score > $hash{ $key }[SCORE];
# store the line number, line text, and score
$hash{ $key } = [ $INPUT_LINE_NUMBER, $_, $score ];
}
# sort by line number and print out the text of the line stored.
foreach my $struct ( sort { $a->[LINENO] <=> $b->[LINENO] } values %hash ) {
print $struct->[LINE];
}

In Python too, but cleaner imo
import csv
spamReader = csv.reader(open('eggs'), delimiter='\t')
select = {}
for row in spamReader:
first_two, three = (row[0], row[1]), row[2]
if first_two in select:
if select[first_two][2] > three:
continue
select[first_two] = row
spamWriter = csv.writer(open('ham', 'w'), delimiter='\t')
for line in select:
spamWrite.writerow(select[line])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to completely erase the duplicated lines by linux tools? - python

No sorting required and output order will be the same as input order: $ awk 'NR==FNR{c[$0]++;next} c[$0]==1' file file 1 3 4

Kindly use sort command with -u argument for listing unique values of any command's output. cat file_name |sort -u 1 2 3 4

Related

Python cross-file search with regexp

Comparing two files and display a matrix with 0 and 1 for not present or present

Identify duplicates in rows of two different CSVs

How to sort output data into columns and rows

how to trim file - for rows which with the same value in two columns, conserve only the row with max in another columns

Categories

Resources