Text processing with two files

Text processing with two files - python

I have two text files in the following format:
The first is this on every line:
Key1:Value1
The second is this:
Key2:Value2
Is there a way I can replace Value1 in file1 by the Value2 obtained from using it as a key in file2?
For example:
file1:
foo:hello
bar:world
file2:
hello:adam
bar:eve
I would like to get:
foo:adam
bar:eve
There isn't necessarily a match between the two files on every line. Can this be done neatly in awk or something, or should I do it naively in Python?

Create two dictionaries, one for each file. For example:
file1 = {}
for line in open('file1', 'r'):
k, v = line.strip().split(':')
file1[k] = v
Or if you prefer a one-liner:
file1 = dict(l.strip().split(':') for l in open('file1', 'r'))
Then you could do something like:
result = {}
for key, value in file1.iteritems():
if value in file2:
result[key] = file2[value]
Another way is you could generate the key-value pairs in reverse for file1 and use sets. For example, if your file1 contains foo:bar, your file1 dict is {bar: foo}.
for key in set(file1) & set(file2):
result[file1[key]] = file2[key]
Basically, you can quickly find common elements using set intersection, so those elements are guaranteed to be in file2 and you don't waste time checking for their existence.
Edit: As pointed out by #pepr You can use collections.OrderedDict for the first method if order is important to you.

The awk solution:
awk '
BEGIN {FS = OFS = ":"}
NR==FNR {val[$1] = $2; next}
$1 in val {$2 = val[$1]}
{print}
}' file2 file1

join -t : -1 2 -2 1 -o 0 2.2 -a 2 <(sort -k 2 -t : file1) <(sort file2)
The input files must be sorted on the field they are joined on.
The options:
-t : - Use a colon as the delimiter
-1 2 - Join on field 2 of file 1
-2 1 - Join on field 1 of file 2
-o 0 2.2 - Output the join field followed by field 2 from file2 (separated by the delimiter character)
-a 2 - Output unjoined lines from file2

Once you have:
file1 = {'foo':'hello', 'bar':'world'}
file2 = {'hello':'adam', 'bar':'eve'}
You can do an ugly one liner:
print dict([(i,file2[i]) if i in file2 else (i,file2[j]) if j in file2 else (i,j) for i,j in file1.items()])
{'foo': 'adam', 'bar': 'eve'}
As in your example you are using both the keys and values of file1 as keys in file2.

This might work for you (probably GNU sed):
sed 's#\([^:]*\):\(.*\)#/\\(^\1:\\|:\1$\\)/s/:.*/:\2/#' file2 | sed -f - file1

If you do not consider using basic Unix/Linux commands cheating, then here is a solution using paste and awk.
paste file1.txt file2.txt | awk -F ":" '{ print $1":"$3 }'

TXR:
#(next "file2")
#(collect)
#key:#value1
# (cases)
# (next "file1")
# (skip)
#value2:#key
# (or)
# (bind value2 key)
# (end)
# (output)
#value2:#value1
# (end)
#(end)
Run:
$ txr subst.txr
foo:adam
bar:eve

Related

Compare strings from a txt with bash or python ignoring pattern

I want to search a txt file for the duplicate lines excluding [p] and the extension in the comparison. Once the equal lines are identified, show only the line that does not contain [p] and with its extension. I have this lines in test.txt:
Peliculas/Desperados (2020)[p].mp4
Peliculas/La Duquesa (2008)[p].mp4
Peliculas/Nueva York Año 2012 (1975).mkv
Peliculas/Acoso en la noche (1980) .mkv
Peliculas/Angustia a Flor de Piel (1982).mkv
Peliculas/Desperados (2020).mkv
Peliculas/Angustia (1947).mkv
Peliculas/Días de radio (1987) BR1080[p].mp4
Peliculas/Mona Lisa (1986) BR1080[p].mp4
Peliculas/La decente (1970) FlixOle WEB-DL 1080p [Buzz][p].mp4
Peliculas/Mona Lisa (1986) BR1080.mkv
In this file lines 1-6 and 9-11 are the same (withouth ext and [p]). Output needed:
Peliculas/Desperados (2020).mkv
Peliculas/Mona Lisa (1986) BR1080.mkv
i try this but only shows the same lines deleting extension and pattern [p] but i dont know the correct line and I need the entire line complete
sed 's/\[p\]//' ./test.txt | sed 's\.[^.]*$//' | sort | uniq -d
Error output (missing extension):
Peliculas/Desperados (2020)
Peliculas/Mona Lisa (1986) BR1080

because you mentioned bash...
Remove any line with a p:
cat test.txt | grep -v p
home/folder/house from earth.mkv
home/folder3/window 1.avi
Remove any line with [p]:
cat test.txt | grep -v '\[p\]'
home/folder/house from earth.mkv
home/folder3/window 1.avi
home/folder4/little mouse.mpg
Not likely your need, but just because: Remove [p] from every line, then dedupe:
cat test.txt | sed 's/\[p\]//g' | sort | uniq
home/folder/house from earth.mkv
home/folder/house from earth.mp4
home/folder2/test.mp4
home/folder3/window 1.avi
home/folder3/window 1.mp4
home/folder4/little mouse.mpg

If a 2-pass solution (which reads the test.txt file twice) is acceptable, would you please try:
declare -A ary # associate the filename with the base
while IFS= read -r file; do
if [[ $file != *\[p\]* ]]; then # the filename does not include "[p]"
base="${file%.*}" # remove the extension
ary[$base]="$file" # create a map
fi
done < test.txt
while IFS= read -r base; do
echo "${ary[$base]}"
done < <(sed 's/\[p\]//' ./test.txt | sed 's/\.[^.]*$//' | sort | uniq -d)
Output:
Peliculas/Desperados (2020).mkv
Peliculas/Mona Lisa (1986) BR1080.mkv
In the 1st pass, it reads the file line by line to create a map which associates the filename (with an extension) with the base (w/o the extension).
In the 2nd pass, it replace the output (base) with the filename.
If you prefer 1-pass solution (which will be faster), please try:
declare -A ary # associate the filename with the base
declare -A count # count the occurrences of the base
while IFS= read -r file; do
base="${file%.*}" # remove the extension
if [[ $base =~ (.*)\[p\](.*) ]]; then
# "$base" contains the substring "[p]"
(( count[${BASH_REMATCH[1]}${BASH_REMATCH[2]}]++ ))
# increment the counter
else
(( count[$base]++ )) # increment the counter
ary[$base]="$file" # map the filename
fi
done < test.txt
for base in "${!ary[#]}"; do # loop over the keys of ${ary[#]}
if (( count[$base] > 1 )); then
# it duplicates
echo "${ary[$base]}"
fi
done

In Python, you can use itertools.groupby with a function that makes a key that consists of the filename without any [p] and with the extension removed.
For any groups of size 2 or more, any filenames not containing '[p]' are printed.
import itertools
import re
def make_key(line):
return re.sub(r'\.[^.]*$', '', line.replace('[p]', ''))
with open('test.txt') as f:
lines = [line.strip() for line in f]
for key, group in itertools.groupby(lines, make_key):
files = [file for file in group]
if len(files) > 1:
for file in files:
if '[p]' not in file:
print(file)
This gives:
home/folder/house from earth.mkv
home/folder3/window 1.avi

Match Multiple Columns In Two Files - Output Only Those That Match Fully

File 1:
1075908|2178412|brown_eyeshorty#att.net|Claude|Desmangles
175908|2178412|naim.kazi#webtv.net|Naim|Kazi
175972|212946872418|gil_maynard#hotmail.com|Munster|Herman
175972|212946872418|meghanj4#lycos.com|Meghan|Judge
175972|212946872418|quenchia#gmail.com|Anna|Balint
176046|255875|keion#netscape.net|Charlene|Johnson
176046|255875|keion112#netscape.net|Charlene|Johnson
176086|2480881|lourdsneil#gmail.com|Lourds|Herman
File 2:
89129090|Sadiq|Islam
212946872418|Anna|Balint
255875|Charlene|Johnson
89234902|Bob|Brown
09123789|Fabio|Vanetti
I would like to extract lines where ALL values match on the following basis:
Column 2 in File 1 matches with Column 1 in File 2.
Column 4 in File 1 matches with Column 2 in File 2.
Column 5 in File 1 matches with Column 3 in File 2.
The expected output for the example is:
175972|212946872418|quenchia#gmail.com|Anna|Balint
176046|255875|keion#netscape.net|Charlene|Johnson
176046|255875|keion112#netscape.net|Charlene|Johnson
The two inputs I'm working with are both very large (11Gb and 3Gb respectively).
The only potential (messy) workaround I can think of is to combine the values to be joined into a single additional column and then use Join (I'm very new to this).

grep -f <(sed 's,|,|[^|]*|,' file2) file1
Returns
175972|212946872418|quenchia#gmail.com|Anna|Balint
176046|255875|keion#netscape.net|Charlene|Johnson
176046|255875|keion112#netscape.net|Charlene|Johnson
Explanations :
First command :
sed 's,|,|[^|]*|,' file2
Transforms file2 into a list of patterns to search in file 1 and returns :
89129090|[^|]*|Sadiq|Islam
212946872418|[^|]*|Anna|Balint
255875|[^|]*|Charlene|Johnson
89234902|[^|]*|Bob|Brown
09123789|[^|]*|Fabio|Vanetti
Second command :
grep -f <(command1) file1
Searchs patterns in file1

Could you please try following.
awk -F'|' '
FNR==NR{
a[$2,$4,$5]=(a[$2,$4,$5]?a[$2,$4,$5] ORS:"")$0
next
}
(($1,$2,$3) in a){
print a[$1,$2,$3]
}' Input_file1 Input_file2
Output will be as follows.
175972|212946872418|quenchia#gmail.com|Anna|Balint
176046|255875|keion#netscape.net|Charlene|Johnson
176046|255875|keion112#netscape.net|Charlene|Johnson

Python cross-file search with regexp

I have 2 files, and I want to get all lines from file2(fsearch) that contain any given line from file1(forig)
I wrote a simple python script that looks like this:
def search_string(w, file):
global matches
reg = re.compile((r'(^|^.*\|)' + w.strip("\r\n") + r'(\t|\|).*$'), re.M)
match = reg.findall(file)
matches.extend(match)
fsearch_text = fsearch.read()
for fword in forig:
search_string(fword, fsearch_text)
There are about 100,000 lines in file1, and about 200,000 lines in file2, so my script takes about 6 hours to complete.
Is there a better algorithm to achieve the same goal in less time?
Edit:
I should have provided example for why I need regexp:
I am searching a list of words in file1 and trying to match them to translations from file2. If I do not use regexp to limit possible matches, I also match translations for words that only contain the word I search as part of itself, example:
Word I search: 浸し
Matched word: お浸し|御浸し|御したし &n boiled greens in bonito-flavoured soy sauce (vegetable side dish)
So I have to limit start of match by either ^ or |, and end of match by \t or |, but capture the whole line

Assuming you can have both files in memory. You can read them and sort them.
After that, you can compare linearly the lines.
f1 = open('e:\\temp\\file1.txt')
lines1 = sorted([line for line in f1])
f2 = open('e:\\temp\\file2.txt')
lines2 = sorted([line for line in f2])
i1 = 0
i2 = 0
matchCount = 0
while (i1 < len(lines1) and i2 < len(lines2)):
line1 = lines1[i1]
line2 = lines2[i2]
if line1 < line2:
i1 += 1
elif line1 > line2:
i2 += 1
else:
matchCount += 1
i2 += 1
print('matchCount')
print(matchCount)

If it is possible for you to use UNIX/GNU/Linux commands you could do this:
# fill example files
for i in {1..100000}; do echo $RANDOM; done > file1.txt
for i in {1..200000}; do echo $RANDOM; done > file2.txt
# get every line of file2.txt which is also in file1.txt
# for static string matching:
grep -F -x -f file1.txt file2.txt
# for regex matching use (regular expressions in file1.txt):
grep -f file1.txt file2.txt
grep is optimized for such operations so the above call takes less than a second (have a look at this).

Identify duplicates in rows of two different CSVs

I need to identify duplicates in column A of CSV1 with Column A of CSV2. If there is a dupe first name identified the entire row from CSV2 needs to get copied to a new CSV3. Can somebody help in python?
CSV1
Adam
Eve
John
George
CSV2
Steve
Mark
Adam Smith
John Smith
CSV3
Adam Smith
John Smith

Here is a quick answer. It's O(n^2) with n the number of lines in your csv, and assumes two equal length CSVs. If you need an O(n) solution (clearly optimal), then let me know. The trick there would be building a set of the elements of column A of csv1.
lines1 = open('csv1.txt').read().split('\n')
delim = ', '
fields1 = [line.split(delim) for line in lines1]
lines2 = open('csv2.txt').read().split('\n')
fields2 = [line.split(delim) for line in lines2]
duplicates = []
for line1 in fields1:
for line2 in fields2:
if line1[0] == line2[0]:
duplicates.append(line2)
print duplicates

Using any of the 3 one-liners:
Option 1: Parse file1 in the BEGIN Block
perl -lane 'BEGIN {$csv2 = pop; $seen{(split)[0]}++ while <>; #ARGV = $csv2 } print if $seen{$F[0]}' csv1 csv2
Option 2: Using a ternary
perl -lane 'BEGIN {($csv1) = #ARGV } $ARGV eq $csv1 ? $seen{$F[0]}++ : ($seen{$F[0]} && print)' csv1 csv2
Option 3: Using a single if
perl -lane 'BEGIN {($csv1) = #ARGV } print if $seen{$F[0]} += $ARGV eq $csv1 and $ARGV ne $csv1' csv1 csv2
Explanation:
Switches:
-l: Enable line ending processing
-a: Splits the line on space and loads them in an array #F
-n: Creates a while(<>){..} loop for each line in your input file.
-e: Tells perl to execute the code on command line.

Clean and python way to resolve your problem
words_a = set([])
words_b = set([])
with open('csv1') as a:
words_a = set([w.strip()
for l in a.readlines()
for w in l.split(" ")
if w.strip()])
with open('csv2') as b:
words_b = set([ w.strip()
for l in b.readlines()
for w in l.split(" ")
if w.strip()])
with open('csv3','w') as wf:
for w in words_a.intersection(words_b):
wf.write(w)
wf.write('\n')

Python line by line data processing

I am new to python and I searched few articles but do not find a correct syntax to read a file and do awk line processing in python . I need your help in solving this problem .
This is how my bash script for build and deploy looks, I read a configurationf file in bash which looks like as below .
backup /apps/backup
oracle /opt/qosmon/qostool/oracle oracle-client-12.1.0.1.0
and the script for bash reading section looks like below
while read line
do
case "$line" in */package*) continue ;; esac
host_file_array+=("$line")
done < ${HOST_FILE}
for ((i=0 ; i < ${#host_file_array[*]}; i++))
do
# echo "${host_file_array[i]}"
host_file_line="${host_file_array[i]}"
if [[ "$host_file_line" != "#"* ]];
then
COMPONENT_NAME=$(echo $host_file_line | awk '{print $1;}' )
DIRECTORY=$(echo $host_file_line | awk '{print $2;}' )
VERSION=$(echo $host_file_line | awk '{print $3;}' )
if [[ ("${COMPONENT_NAME}" == *"oracle"*) ]];
then
print_parameters "Status ${DIRECTORY}/${COMPONENT_NAME}"
/bin/bash ${DIRECTORY}/${COMPONENT_NAME}/current/script/manage-oracle.sh ${FORMAT_STRING} start
fi
etc .........
How the same can be conveted to Python . This is what I have prepared so far in python .
f = open ('%s' % host_file,"r")
array = []
line = f.readline()
index = 0
while line:
line = line.strip("\n ' '")
line=line.split()
array.append([])
for item in line:
array[index].append(item)
line = f.readline()
index+= 1
f.close()
I tried with split in python , since the config file does not have equal number of columns in all rows, I get index bound error. what is the best way to process it .

I think dictionaries might be a good fit here, you can generate them as follows:
>>> result = []
>>> keys = ["COMPONENT_NAME", "DIRECTORY", "VERSION"]
>>> with open(hosts_file) as f:
... for line in f:
... result.append(dict(zip(keys, line.strip().split())))
...
>>> result
[{'DIRECTORY': '/apps/backup', 'COMPONENT_NAME': 'backup'},
{'DIRECTORY': '/opt/qosmon/qostool/oracle', 'VERSION': 'oracle-client-12.1.0.1.0', 'COMPONENT_NAME': 'oracle'}]
As you see this creates a list of dictionaries. Now when you're accessing the dictionaries, you know that some of them might not contain a 'VERSION' key. There are multiple ways of handling this. Either you try/except KeyError or get the value using dict.get().
Example:
>>> for r in result:
... print r.get('VERSION', "No version")
...
...
No version
oracle-client-12.1.0.1.0

result = [line.strip().split() for line in open(host_file)]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Text processing with two files - python

The awk solution: awk ' BEGIN {FS = OFS = ":"} NR==FNR {val[$1] = $2; next} $1 in val {$2 = val[$1]} {print} }' file2 file1

This might work for you (probably GNU sed): sed 's#\([^:]\):\(.\)#/\\(^\1:\\|:\1$\\)/s/:.*/:\2/#' file2 | sed -f - file1

If you do not consider using basic Unix/Linux commands cheating, then here is a solution using paste and awk. paste file1.txt file2.txt | awk -F ":" '{ print $1":"$3 }'

TXR: #(next "file2") #(collect) #key:#value1 # (cases) # (next "file1") # (skip) #value2:#key # (or) # (bind value2 key) # (end) # (output) #value2:#value1 # (end) #(end) Run: $ txr subst.txr foo:adam bar:eve

Related

Compare strings from a txt with bash or python ignoring pattern

Match Multiple Columns In Two Files - Output Only Those That Match Fully

Python cross-file search with regexp

Identify duplicates in rows of two different CSVs

Python line by line data processing

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Text processing with two files - python

The awk solution: awk ' BEGIN {FS = OFS = ":"} NR==FNR {val[$1] = $2; next} $1 in val {$2 = val[$1]} {print} }' file2 file1

This might work for you (probably GNU sed): sed 's#\([^:]*\):\(.*\)#/\\(^\1:\\|:\1$\\)/s/:.*/:\2/#' file2 | sed -f - file1

If you do not consider using basic Unix/Linux commands cheating, then here is a solution using paste and awk. paste file1.txt file2.txt | awk -F ":" '{ print $1":"$3 }'

TXR: #(next "file2") #(collect) #key:#value1 # (cases) # (next "file1") # (skip) #value2:#key # (or) # (bind value2 key) # (end) # (output) #value2:#value1 # (end) #(end) Run: $ txr subst.txr foo:adam bar:eve

Related

Compare strings from a txt with bash or python ignoring pattern

Match Multiple Columns In Two Files - Output Only Those That Match Fully

Python cross-file search with regexp

Identify duplicates in rows of two different CSVs

Python line by line data processing

Categories

Resources

This might work for you (probably GNU sed): sed 's#\([^:]\):\(.\)#/\\(^\1:\\|:\1$\\)/s/:.*/:\2/#' file2 | sed -f - file1