Compare 2 csv or excel files for string matches - python

I want to compare sub-string between columns of 2 different files:
Below are the sample input and expected output
Input File1.csv:
1. Amar,18
2. Akbar,20
3. Anthony,21
Input File2.csv:
1. Mr. Amar Khanna, Tennis
2. Mr. Anthony Rao, Cricket
3. Federer, Badminton
Expected Output File3.csv:
1. Amar,18,Tennis
2. Anthony,21,Cricket
I am trying doing it using shell scripting. These are the solutions that I have tried so far to find the matches in two files:
diff file1 file2
This did not work as it compares files for entire column match.
grep -f file1 file2
Even this did not work because of above issue.
awk 'FNR==NR{a[substr($1,5,8)];next} substr($1,5,8) in a' excel1.csv excel2.csv
This is not giving any result

Related

Python or Bash - How do you add words listed in a file in the middle of two sentences and put the output into another file?

I dont know how to define the variable 'set' and ' ' to then put a name and another ' ' before adding the last word 'username' and 'admin' in 'file2' for each name listed in 'file1'.
file1 = [/home/smith/file1.txt]
file2 = [/home/smith/file2.txt]
file3 = file1 + file2
Example:
[file1 - Names]
smith
jerry
summer
aaron
[file2 - Sentences]
set username
set admin
[file3 - Output]
set smith username
set smith admin
set jerry username
set jerry admin
set summer username
set summer admin
set aaron username
set aaron admin
Can you be more specific about your problem? And have you already tried something? If that is the case, please share it.
The way I see it you can open file2, read every line, split the two words on the space (and add it to a list for example). Then you can create a new string for every set of words you've created in that list. Loop on every line in file1. For every line in file1: take the first word from file2, add a space. Add the actual line from file1. And at last you add another space and the second word from.
You now have a new string which you can append to a new file for example. You should problably append that string to the new file in the same loop where you created the string.
But then again, I'm not shure if this is an answer to your question.
Try this one in Bash, it answers your question
#!/bin/bash
file1=".../f1.txt"
file2=".../f2.txt"
file3=".../f3.txt"
while read p1; do
while read p2; do
word1=$(echo $p2 | cut -f1 -d' ')
word2=$(echo $p2 | cut -f2 -d' ')
echo " $word1 $p1 $word2" >> $file3
done < $file2
done < $file1
Something like this, perhaps..
names = file("/home/smith/file1.txt").readlines()
commands = file("/home/smith/file2.txt").readlines()
res = []
for name in names:
for command in commands:
command = command.split(" ")
res.append(" ".join([command[0],name,command[1]]))
file("/home/smith/file3.txt","w").write("\n".join(res))
I'm sure this is not the prettiest way, but should work. But why do you want to do something like this...?
Yet another solution using utilities only:
join -1 2 -2 3 file1 file2 | awk '{printf "%s %s %s\n", $2, $1, $3}' > file3

Match Multiple Columns In Two Files - Output Only Those That Match Fully

File 1:
1075908|2178412|brown_eyeshorty#att.net|Claude|Desmangles
175908|2178412|naim.kazi#webtv.net|Naim|Kazi
175972|212946872418|gil_maynard#hotmail.com|Munster|Herman
175972|212946872418|meghanj4#lycos.com|Meghan|Judge
175972|212946872418|quenchia#gmail.com|Anna|Balint
176046|255875|keion#netscape.net|Charlene|Johnson
176046|255875|keion112#netscape.net|Charlene|Johnson
176086|2480881|lourdsneil#gmail.com|Lourds|Herman
File 2:
89129090|Sadiq|Islam
212946872418|Anna|Balint
255875|Charlene|Johnson
89234902|Bob|Brown
09123789|Fabio|Vanetti
I would like to extract lines where ALL values match on the following basis:
Column 2 in File 1 matches with Column 1 in File 2.
Column 4 in File 1 matches with Column 2 in File 2.
Column 5 in File 1 matches with Column 3 in File 2.
The expected output for the example is:
175972|212946872418|quenchia#gmail.com|Anna|Balint
176046|255875|keion#netscape.net|Charlene|Johnson
176046|255875|keion112#netscape.net|Charlene|Johnson
The two inputs I'm working with are both very large (11Gb and 3Gb respectively).
The only potential (messy) workaround I can think of is to combine the values to be joined into a single additional column and then use Join (I'm very new to this).
grep -f <(sed 's,|,|[^|]*|,' file2) file1
Returns
175972|212946872418|quenchia#gmail.com|Anna|Balint
176046|255875|keion#netscape.net|Charlene|Johnson
176046|255875|keion112#netscape.net|Charlene|Johnson
Explanations :
First command :
sed 's,|,|[^|]*|,' file2
Transforms file2 into a list of patterns to search in file 1 and returns :
89129090|[^|]*|Sadiq|Islam
212946872418|[^|]*|Anna|Balint
255875|[^|]*|Charlene|Johnson
89234902|[^|]*|Bob|Brown
09123789|[^|]*|Fabio|Vanetti
Second command :
grep -f <(command1) file1
Searchs patterns in file1
Could you please try following.
awk -F'|' '
FNR==NR{
a[$2,$4,$5]=(a[$2,$4,$5]?a[$2,$4,$5] ORS:"")$0
next
}
(($1,$2,$3) in a){
print a[$1,$2,$3]
}' Input_file1 Input_file2
Output will be as follows.
175972|212946872418|quenchia#gmail.com|Anna|Balint
176046|255875|keion#netscape.net|Charlene|Johnson
176046|255875|keion112#netscape.net|Charlene|Johnson

python - Issue in processing files with big size

Basically, I wanted to create a Python script for my daily tasks wherein i wanted to compare two file with any size & wanted to generated 2 new files having matching records & non-matching records from both file.
I have written below python script & found it's working properly for file size having few records.
But when i am executing same script with files with 200,000 and 500,000 records then resulting file getting generated is not giving valid output.
So, can you check below script and help to identify issue in it causing wrong output...?
Thanks in advance.
from sys import argv
script, filePathName1, filePathName2 = argv
def FileDifference(filePathName1, filePathName2):
fileObject1 = open(filePathName1,'r')
fileObject2 = open(filePathName2,'r')
newFilePathName1 = filePathName1 + ' - NonMatchingRecords.txt'
newFilePathName2 = filePathName1 + ' - MatchingRecords.txt'
newFileObject1 = open(newFilePathName1,'a')
newFileObject2 = open(newFilePathName2,'a')
file1 = fileObject1.readlines()
file2 = fileObject2.readlines()
Differece = [ diff for diff in file1 if diff not in file2 ]
for i in range(0,len(Differece)):
newFileObject1.write(Differece[i])
Matching = [ match for match in file1 if match in file2 ]
for j in range(0,len(Matching)):
newFileObject2.write(Matching[j])
fileObject1.close()
fileObject2.close()
newFileObject1.close()
newFileObject2.close()
FileDifference(filePathName1, filePathName2)
Edit-1 : Pls note that above program executes without any error. Its just that output is incorrect and program takes much longer time to get over large file.
I'll take a wild guess and assume that "no valid output" means: "runs forever and does nothing useful".
Which would be logical because of your list comprehensions:
Differece = [ diff for diff in file1 if diff not in file2 ]
for i in range(0,len(Differece)):
newFileObject1.write(Differece[i])
Matching = [ match for match in file1 if match in file2 ]
for i in range(0,len(Matching)):
newFileObject2.write(Matching[i])
They perform O(n) lookup, which is okay on a small number of lines but never ends if, say len(file1) == 100000 and so is file2. You then perform 100000*100000 iterations => 10**10 => forever.
Fix is simple: create sets and use intersection & difference, much faster.
file1 = set(fileObject1.readlines())
file2 = set(fileObject2.readlines())
difference = file1 - file2
for i in difference:
newFileObject1.write(i)
matching = file1 & file2
for i in matching:
newFileObject2.write(matching)

Comparing two files and display a matrix with 0 and 1 for not present or present

I'm new with python so i'm reaaally struggling in making a script.
So, what I need is to make a comparison between two files. One file contains all proteins of some data base, the other contain only some of the proteins presents in the other file, because it belongs to a organism. So I need to know which proteins of this data base is present in my organism. For that I want to build a output like a matrix, with 0 and 1 referring to every protein present in the data base that may or may not be in my organism.
Does anybody have any idea of how could I do that?
I'm trying to use something like this
$ cat sorted.a
A
B
C
D
$ cat sorted.b
A
D
$ join sorted.a sorted.b | sed 's/^/1 /' && join -v 1 sorted.a sorted.b | sed 's/^/0 /'
1 A
1 D
0 B
0 C
But I'm not being able to use it because sometimes a protein is present but its not in the same line.
Here is a Example:
1-cysPrx_C
120_Rick_ant
14-03-2003
2-Hacid_dh
2-Hacid_dh_C
2-oxoacid_dh
2-ph_phosp
2CSK_N
2C_adapt
2Fe-2S_Ferredox
2H-phosphodiest
2HCT
2OG-FeII_Oxy
Comparing with
1-cysPrx_C
14-3-3
2-Hacid_dh
2-Hacid_dh_C
2-oxoacid_dh
2H-phosphodiest
2OG-FeII_Oxy
2OG-FeII_Oxy_3
2OG-FeII_Oxy_4
2OG-FeII_Oxy_5
2OG-Fe_Oxy_2
2TM
2_5_RNA_ligase2
Does anyone have an idea of how could I do that?
Thanks so far.
The fastest way in Python would be to read your organism file, and save each protein name to a set. Then open and iterate through your all_proteins file, for each name print it, check if that name is present in your organism set, and print a 0 or 1 appropriately.
Example code if your organism list is called 'prot_list':
with open(all_proteins_file) as f:
for line in f:
prot = line.strip()
if prot in prot_list: num = 1
else: num = 0
print '%i %s' % (num, prot)

Python: Use value from file a to search for lines in another file

Newbie question
I have 2 files
File A: File with a list of items (apple, pears, oranges)
File B: File with all fruit in the world (1,000,000 lines)
In unix I would grep apple from file B and return all results
In unix I would
1. grep apple from file b >> fruitfound.txt
2. grep pears from file b >> fruitfound.txt
3. grep oranges from file b >> fruitfound.txt
I want to a python script that uses values from file a and search file b and then writes out the output. NOTE: FILE B would have green apple, red apple, yellow apple and I would want to write all 3 results to the fruitfound.txt
Kindest Regards
Kornity
grep -f $patterns $filename does exactly that. No need to use a python script.
To find lines that contain any of given keywords in Python, you could use a regex:
import re
from itertools import ifilter
def fgrep(words, lines):
# note: allow a partial match e.g., 'b c' matches 'ab cd'
return ifilter(re.compile("|".join(map(re.escape, words))).search, lines)
To turn it into a command-line script:
import sys
def main():
with open(sys.argv[1]) as kwfile: # read keywords from given file
# one keyword per line
keywords = [line.strip() for line in kwfile if line.strip()]
if not keywords:
sys.exit("no keywords are given")
if len(sys.argv) > 2: # read lines to match from given file
with open(sys.argv[2]) as file:
sys.stdout.writelines(fgrep(keywords, file))
else: # read lines from stdin
sys.stdout.writelines(fgrep(keywords, sys.stdin))
main()
Example:
$ python fgrep.py a b > fruitfound.txt
There are more efficient algorithms e.g., Ago-Corasick algorithm, but it takes less than a second to filter millions lines on my machine and it might be good enough (grep is several times faster). Surprisingly acora that is based on Ago-Corasick algorithm is slower for the data I've tried.

Categories