Extracting same lines from two files while disregarding lower/uppercase - python

The aim is to extrac the same lines from two files while disregarding lower/uppercase and also disregarding punctuations
I have two files
source.txt
Foo bar
blah blah black sheep
Hello World
Kick the, bucket
processed.txt
foo bar
blah sheep black
Hello world
kick the bucket ,
Desired output (from source.txt):
Foo bar
Hello World
Kick the, bucket
I have been doing it as such:
from string import punctuation
with open('source.txt', 'r') as f1, open('processed.txt', 'r') as f2:
for i,j in zip(f1, f2):
lower_depunct_f1 = " ".join("".join([ch.lower() for ch in f1 if f1 not in punctuation]).split())
lower_depunct_f2 = " ".join("".join([ch.lower() for ch in f2 if f2 not in punctuation]).split())
if lower_depunct_f1 == lower_depunct_f2:
print f1
else:
print
Is there a way to do this with bash tools? perl, shell, awk, sed?

Easier to do this using awk:
awk 'FNR==NR {s=toupper($0); gsub(/[[:blank:][:punct:]]+/, "", s); a[s]++;next}
{s=toupper($0); gsub(/[[:blank:][:punct:]]+/, "", s); print (s in a)?$0:""}' file2 file1
Foo bar
Hello World
Kick the, bucket

The Perl solution is quite similar to the Python one:
open my $S1, '<', 'source.txt' or die $!;
open my $S2, '<', 'processed.txt' or die $!;
while (defined(my $s1 = <$S1>) and defined (my $s2 = <$S2>)) {
s/[[:punct:]]//g for $s1, $s2;
$_ = lc for $s1, $s2;
print $s1 eq $s2 ? $s1 : "\n";
}
Note that the result is different from yours, as the space after kick the bucket was not removed.

Bash solution, quite sinmilar the Perl one, with the same different result (as the space after kick the bucket was not removed):
#!/bin/bash
shopt -s nocasematch
exec 3<> source.txt # Open source.txt and assign fd 3 to it.
exec 4<> processed.txt
while read <&3 varline && read <&4 varpro
do
varline_noPunct=`echo $varline | tr -d '[:punct:]'`
varpro_noPunct=`echo $varpro | tr -d '[:punct:]'`
[[ $varline_noPunct == $varpro_noPunct ]] && echo "$varline" || echo
done
exec 3>&- # Close fd 3.
exec 4>&-

Check if this solution helps you :
use strict;
use warnings;
my $f1 = $ARGV[0];
open FILE1, "<", $f1 or die $!;
my $f2 = $ARGV[1];
open FILE2, "<", $f2 or die $!;
open OUTFILE, ">", "cmp.txt" or die $!;
my %seen;
while (<FILE1>) {
$_ =~ s/[[:punct:]]//isg;
$seen{lc($_)} = 1;
}
while (<FILE2>) {
my $next_line = <FILE2>;
$_ =~ s/[[:punct:]]//isg;
if ($seen{lc($_)}) {
print OUTFILE $_;
}
}
close OUTFILE;

Related

awk + adding a column baed on values of another column + adding a field name in the 1 command

I want to add a new column at the end, based on the text of another column(with an if statement), and then I want to add a new column/field name.
I am close but I am struggling with the syntax, I am using awk, but apologies its been a while since I used this. and I am wondering if I should use python/anaconda(jupyter notebook), but going with the easiest env I have available to me at the minute, awk .
This is my file:
$ cat file1
f1,f2,f3,f4,f5
row1_1,row1_2,row1_3,SBCDE,row1_5
row2_1,row2_2,row2_3,AWERF,row2_5
row3_1,row3_2,row3_3,ASDFG,row3_5
Here I want, based on the text in column 4, create a new column at the end and, but I am winging this a bit, that is I got it to work.
$ awk -F, '{if (substr($4,1,1)=="A")
print $0 (NR>1 ? FS substr($4,1,4) : "")
else
print $0 (NR>1 ? FS substr($4,1,2) : "")
}' file1
f1,f2,f3,f4,f5
row1_1,row1_2,row1_3,SBCDE,row1_5,SB
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
But here I wnat to add a field/column name at the end, which I am close, I believe.
$ awk -F, -v OFS=, 'NR==1{ print $0, "test"}
NR>1
{
if (substr($4,1,1)=="A")
print $0 (NR>1 ? FS substr($4,1,4) : "")
else
print $0 (NR>1 ? FS substr($4,1,2) : "")
}
' file1
f1,f2,f3,f4,f5,test
f1,f2,f3,f4,f5
row1_1,row1_2,row1_3,SBCDE,row1_5
row1_1,row1_2,row1_3,SBCDE,row1_5,SB
row2_1,row2_2,row2_3,AWERF,row2_5
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
What I want is this:
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SB
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
EDIT1
for my ref: this is the awk I want:
awk -F, '{if (substr($4,1,1)=="P")
print $0 (NR>1 ? FS substr($4,5,4) : "")
else
print $0 (NR>1 ? FS substr($4,1,4) : "")
}' file1
outputting it to file2:
awk -F, '{if (substr($4,1,1)=="P")
print $0 (NR>1 ? FS substr($4,5,4) : "")
else
print $0 (NR>1 ? FS substr($4,1,4) : "")
}' file1 > file2
$
$
2 files, file2 has other column added:
$ls
file1 file2
$cat file1
f1,f2,f3,f4,f5
row1_1,row1_2,row1_3,SBCDE,row1_5
row2_1,row2_2,row2_3,AWERF,row2_5
row3_1,row3_2,row3_3,ASDFG,row3_5
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5
$cat file2
f1,f2,f3,f4,f5
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ
EDIT2 -- Correction
file 2 is what I want:
cat file1
f1,f2,f3,f4,f5
row1_1,row1_2,row1_3,SBCDE,row1_5
row2_1,row2_2,row2_3,AWERF,row2_5
row3_1,row3_2,row3_3,ASDFG,row3_5
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5
cat file2
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ
awk -F, -v OFS=, 'NR==1{ print $0, "test"}
NR>1 {
if (substr($4,1,1)=="P")
print $0 (NR>1 ? FS substr($4,5,4) : "")
else
print $0 (NR>1 ? FS substr($4,1,4) : "")
}
' file1
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ
Newlines matter. Change:
NR>1
{
to
NR>1 {
As written you have 2 independent statements equivalent to:
NR>1 { print }
<true condition> {
if (whatever) print foo; else print bar
}
instead of what you intended:
NR>1 {
if (whatever) print foo; else print bar
}
Having said that, try this instead of what you have:
awk '
BEGIN { FS=OFS="," }
NR == 1 { x = "test" }
NR > 1 { x = substr( $4, 1, ($4 ~ /^A/ ? 4 : 2) ) }
{ print $0, x }
' file1
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SB
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
Same functionality, just more concise.
Rename x to some mnemonic of whatever you really intend that last column to represent.
You may use this awk command that removes any carriage return if present from each line before computing value of the last column:
awk 'BEGIN {FS=OFS=","}
{
sub(/\r$/, "")
print $0, (NR==1 ? "test" : (substr($4,1,1)=="A" ? substr($4,1,4) : substr($4,1,2)))
}' file
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SB
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
awk one-liner :
gawk '$++NF =(_^=!_)==NR ? "test" : substr(__=$4,_++,_^_^(__~"^A"))' FS=, OFS=,
|
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SB
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
to deal with Windows/DOS files, do this instead :
mawk 'BEGIN { RS ="\r?\n"; ___ = "test"; OFS = FS = ","
_++ } $++NF = _==NR ? ___ : substr(__=$4,_,++_^_--^(__~"^A"))'
This works because the regex selects whether to take 2^2^1 or 2^2^0, which works out to 4 and 2 respectively
python solution using solely standard library, let file.csv content be
f1,f2,f3,f4,f5
row1_1,row1_2,row1_3,SBCDE,row1_5
row2_1,row2_2,row2_3,AWERF,row2_5
row3_1,row3_2,row3_3,ASDFG,row3_5
then
import csv
with open('file.csv', newline='') as infile:
with open('fileout.csv', 'w', newline='') as outfile:
reader = csv.DictReader(infile)
outfields = reader.fieldnames + ['test']
writer = csv.DictWriter(outfile, outfields)
writer.writeheader()
for row in reader:
row['test'] = row['f4'][:4] if row['f4'][0] == 'A' else row['f4'][:2]
writer.writerow(row)
creates (or overwrite if already exist) file fileout.csv with followinc content
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SB
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
Explanation: I am using csv.DictReader and csv.DictWriter from csv, firstly I create two context managers (with open...) with newline suitable for reader and writer (see linked docs for further explanation), infile is opened for reading (default), whilst outfile for writing (w), I use reader for parsing infile, I concatenate its fieldnames (column names) with list holding single element test to get fieldnames for output file, then I output header (column names), then for each data row in input file I compute value for test using ternary operator (observe it is valueiftrue if condition else valueiffales which is different order than GNU AWK's condtion?valueiftrue:valueiffalse) and string slicing ([:n] means take n first character) and insert that into row dict, which is then written.
(tested in Python 3.8.10)

Get a string in Shell/Python with subprocess

After this topic Get a string in Shell/Python using sys.argv , I need to change my code, I need to use a subprocess in a main.py with this function :
def download_several_apps(self):
subproc_two = subprocess.Popen(["./readtext.sh", self.inputFileName_download], stdout=subprocess.PIPE)
Here is my file readtext.sh
#!/bin/bash
filename="$1"
counter=1
while IFS=: true; do
line=''
read -r line
if [ -z "$line" ]; then
break
fi
python3 ./download.py \
-c ./credentials.json \
--blobs \
"$line"
done < "$filename"
And my download.py file
if (len(sys.argv) == 2):
downloaded_apk_default_location = 'Downloads/'
else:
readtextarg = os.popen("ps " + str(os.getppid()) + " | awk ' { out = \"\"; for(i = 6; i <= NF; i++) out = out$i\" \" } END { print out } ' ").read()
textarg = readtextarg.split(" ")[1 : -1][0]
downloaded_apk_default_location = 'Downloads/'+textarg[1:]
How can I get and print self.inputFileName_download in my download.py file ?
I used sys.argv as answerd by #tripleee in my previous post but it doesn't work as I need.
Ok I changed the last line by :
downloaded_apk_default_location = 'Downloads/'+textarg.split("/")[-1]
to get the textfile name
The shell indirection seems completely superfluous here.
import download
with open(self.inputFileName_download) as apks:
for line in apks:
if line == '\n':
break
blob = line.rstrip('\n')
download.something(blob=blob, credentials='./credentials.json')
... where obviously I had to speculate about what the relevant function from downloads.py might be called.

Comparing two files and removing all whitespaces

Is there a more elegant way of comparing these two files?
Right now I am getting the following error message: syntax error near unexpected token (... diff <( tr -d ' '.
result = Popen("diff <( tr -d ' \n' <" + file1 + ") <( tr -d ' \n' <"
+ file2 + ") | wc =l", shell=True, stdout=PIPE).stdout.read()
Python seems to read "\n" as a literal character.
The constructs you are using are interpreted by bash and do not form a standalone statement that you can pass to system() or exec().
<( ${CMD} )
< ${FILE}
${CMD1} | ${CMD2}
As such, you will need to wire-up the redirection and pipelines yourself, or call on bash to interpret the line for you (as #wizzwizz4 suggests).
A better solution would be to use something like difflib that will perform this internally to your process rather than calling on system() / fork() / exec().
Using difflib.unified_diff will give you a similar result:
import difflib
def read_file_no_blanks(filename):
with open(filename, 'r') as f:
lines = f.readlines()
for line in lines:
if line == '\n':
continue
yield line
def count_differences(diff_lines):
diff_count = 0
for line in diff_lines:
if line[0] not in [ '-', '+' ]:
continue
if line[0:3] in [ '---', '+++' ]:
continue
diff_count += 1
return diff_count
a_lines = list(read_file_no_blanks('a'))
b_lines = list(read_file_no_blanks('b'))
diff_lines = difflib.unified_diff(a_lines, b_lines)
diff_count = count_differences(diff_lines)
print('differences: %d' % ( diff_count ))
This will fail when you fix the syntax error because you are attempting to use bash syntax in what is implemented as a C system call.
If you wish to do this in this way, either write a shell script or use the following:
result = Popen(['bash', '-c',
"diff <( tr -d ' \n' <" + file1 + ") <( tr -d ' \n' <"
+ file2 + ") | wc =l"], shell=True, stdout=PIPE).stdout.read()
This is not an elegant solution, however, since it is relying on the GNU coreutils and bash. A more elegant solution would be pure Python. You could do this with the difflib module and the re module.

Compare a regex match from two separate files and replace with values from one of them

I'm not really sure how is the best way to do this... I was thinking I might need to do it in python?
filea.html contains data-tx-text="9817db21ccc2d9acc021c4536690b90a_se"
fileb.html contains data-tx-text="0850235fcb0e503150c224dad3156312_se"
There are the exact same occurrences of data-tx-text values from filea.html to fileb.html (171).
I want to be able to use a regex pattern or a simple Python program to
Find data-tx-text="(.*?)" in filea.html
Find data-tx-text="(.*?)" in fileb.html
Replace the value from filea.html with the value found in fileb.html
Move to the next occurrence.
Continue until the end of the file, or until all values in filea.html match those in fileb.html
I have the basics. For instance, I know the regex pattern that I need, and I am guessing I need to loop this in Python or something similar?
Maybe I can do it with sed, but I'm not that good with that, so any help is greatly appreciated.
In awk, you could use something like this:
NR == FNR {
match($0, /data-tx-text="[^"]+"/);
if (RSTART > 0) {
data[++a] = substr($0, RSTART + 14, RLENGTH - 15);
}
next;
}
/data-tx-text/ {
sub(/data-tx-text="[^"]+"/, "data-tx-text=\"" data[++b] "\"");
print;
}
With GNU awk for the 3rd arg to match():
$ cat tst.awk
match($0,/(.*)(data-tx-text="[^"]+")(.*)/,a) {
if (NR==FNR) {
fileb[++bcnt] = a[2]
}
else {
$0 = a[1] fileb[++acnt] a[3]
}
}
NR>FNR
$ awk -f tst.awk fileb filea
data-tx-text="0850235fcb0e503150c224dad3156312_se"
with other awks you'd use 3 calls to substr() after the match():
$ cat tst.awk
match($0,/data-tx-text="[^"]+"/) {
if (NR==FNR) {
fileb[++bcnt] = substr($0,RSTART,RLENGTH)
}
else {
$0 = substr($0,1,RSTART-1) fileb[++acnt] substr($0,RSTART+RLENGTH)
}
}
NR>FNR
$ awk -f tst.awk fileb filea
data-tx-text="0850235fcb0e503150c224dad3156312_se"
open filea find stringa
open fileb find stringb
replace stringa with stringb
replace stringb with stringa
Write files back
In code as below
import re
pattern = 'data-tx-text="(.*?)"'
With open('filea.html', 'r') as f:
filea = f.read()
With open('fileb.html', 'r') as f:
fileb = f.read()
stringa= re.match(pattern, filea).group()
stringb= re.match(pattern, fileb).group()
filea = filea.replace(stringa, stringb)
fileb = fileb.replace(stringb, stringa)
with open('filea.html', 'w') as f:
f.write(filea)
with open('filea.html', 'w') as f:
f.write(fileb)
So this is how I have solved it using python, its a bit manual in that i have to change the names of filea and fileb each time, but it works
I think i can improve the regex with escapes?
import re
import sys
with open('filea.html') as originalFile:
originalFileContents = originalFile.read()
pattern = re.compile(r'[0-9a-f]{32}_se')
originalMatches = pattern.findall(originalFileContents)
counter = 0
def replaceId(match):
global counter
value = match.group()
newValue = originalMatches[counter]
print counter, '=> replacing', value, 'with', newValue
counter = counter + 1
return newValue
with open('fileb.html') as targetFile:
targetFileContents = targetFile.read()
changedTargetFileContents = pattern.sub(replaceId, targetFileContents)
print changedTargetFileContents
new_file = open("Output.html", "w")
new_file.write(changedTargetFileContents)
new_file.close()
Available on Github: https://github.com/timm088/rehjex-py
Here's how I'd do it using Beautiful Soup:
from bs4 import BeautifulSoup as bs
replacements, replaced_html = [], ''
with open('fileb.html') as fileb:
# Extract replacements
soup = bs(fileb, 'html.parser')
tags = soup.find_all(lambda tag: tag.get('data-tx-text'))
replacements = [tag.get('data-tx-text') for tag in tags]
with open('filea.html') as filea:
# Replace values
soup = bs(filea, 'html.parser')
tags = soup.find_all(lambda tag: tag.get('data-tx-text'))
for tag in tags:
tag['data-tx-text'] = replacements.pop(0)
replaced_html = str(soup)
with open('filea.html', 'w') as new_filea:
# Update file
new_filea.write(replaced_html)

merge multiple lines into single line by value of column

I have a tab-delimited text file that is very large. Many lines in the file have the same value for one of the columns in the file. I want to put them into same line. For example:
a foo
a bar
a foo2
b bar
c bar2
After run the script it should become:
a foo;bar;foo2
b bar
c bar2
how can I do this in either a shell script or in Python?
thanks.
With awk you can try this
{ a[$1] = a[$1] ";" $2 }
END { for (item in a ) print item, a[item] }
So if you save this awk script in a file called awkf.awk and if your input file is ifile.txt, run the script
awk -f awkf.awk ifile.txt | sed 's/ ;/ /'
The sed script is to remove out the leading ;
Hope this helps
from collections import defaultdict
items = defaultdict(list)
for line in open('sourcefile'):
key, val = line.split('\t')
items[key].append(val)
result = open('result', 'w')
for k in sorted(items):
result.write('%s\t%s\n' % (k, ';'.join(items[k])))
result.close()
not tested
Tested with Python 2.7:
import csv
data = {}
reader = csv.DictReader(open('infile','r'),fieldnames=['key','value'],delimiter='\t')
for row in reader:
if row['key'] in data:
data[row['key']].append(row['value'])
else:
data[row['key']] = [row['value']]
writer = open('outfile','w')
for key in data:
writer.write(key + '\t' + ';'.join(data[key]) + '\n')
writer.close()
A Perl way to do it:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
open my $fh, '<', 'path/to/file' or die "unable to open file:$!";
my %res;
while(<$fh>) {
my ($k, $v) = split;
push #{$res{$k}}, $v;
}
print Dumper \%res;
output:
$VAR1 = {
'c' => [
'bar2'
],
'a' => [
'foo',
'bar',
'foo2'
],
'b' => [
'bar'
]
};
#! /usr/bin/env perl
use strict;
use warnings;
# for demo only
*ARGV = *DATA;
my %record;
my #order;
while (<>) {
chomp;
my($key,$combine) = split;
push #order, $key unless exists $record{$key};
push #{ $record{$key} }, $combine;
}
print $_, "\t", join(";", #{ $record{$_} }), "\n" for #order;
__DATA__
a foo
a bar
a foo2
b bar
c bar2
Output (with tabs converted to spaces because Stack Overflow breaks the output):
a foo;bar;foo2
b bar
c bar2
def compress(infilepath, outfilepath):
input = open(infilepath, 'r')
output = open(outfilepath, 'w')
prev_index = None
for line in input:
index, val = line.split('\t')
if index == prev_index:
output.write(";%s" %val)
else:
output.write("\n%s %s" %(index, val))
input.close()
output.close()
Untested, but should work. Please leave a comment if there are any concerns

Categories