Merge 3 Textfiles with python - python

Im really new to programming and couldn´t find a satisfying answer so far. Im using python and I want to merge three textfiles receive all possible word combinations. I have 3 files:
First file:
line1
line2
line3
Second file(prefix):
pretext1
pretext2
pretext3
Third file(suffix):
suftext1
suftext2
suftext3
I already used .read() and have my variables containing the list for each textfile. Now I want to write a function to merge this 3 files to 1 and it should look like this:
outputfile:
pretext1 line1 suftext1 #this is ONE line(str)
pretext2 line1 suftext1
pretext3 line1 suftext1
pretext1 line1 suftext2
pretext1 line1 suftext3
and so on, you get the idea
I want all possible combinations in 1 textfile as output. I guess I have to use a loop within a loop?!

Here it is, if I got your question right.
First you have to focus into the correct folder with the os package.
import os
os.chdir("The_path_of_the_folder_containing_the_files")
Then you open you three files, and put the words into lists:
file_1 = open("file_1.txt")
file_1 = file_1.read()
file_1 = file_1.split("\n")
file_2 = open("file_2.txt")
file_2 = file_2.read()
file_2 = file_2.split("\n")
file_3 = open("file_3.txt")
file_3 = file_3.read()
file_3 = file_3.split("\n")
You create the text you want in your output file with loops:
text_output = ""
for i in range(len(file_2)):
for j in range(len(file_1)):
for k in range(len(file_3)):
text_output += file_2[i] + " " + file_1[j] + " " + file_3 [k] + "\n"
And you enter that text into your output file (if that file does not exist, it will be created).
file_output = open("file_output.txt","w")
file_output.write(text_output)
file_output.close()

While the existing answer may be correct, I think this is a case where bringing in a library function is definitely the way to go.
import itertools
with open('lines.txt') as line_file, open('pretext.txt') as prefix_file, open('suftext.txt') as suffix_file:
lines = [l.strip() for l in line_file.readlines()]
prefixes = [p.strip() for p in prefix_file.readlines()]
suffixes = [s.strip() for s in suffix_file.readlines()]
combos = [('%s %s %s' % (x[1], x[0], x[2]))
for x in itertools.product(lines, prefixes, suffixes)]
for c in combos:
print c

Related

Read lines in one file and find all strings starting with 4-letter strings listed in another txt file

I have 2 txt files (a and b_).
file_a.txt contains a long list of 4-letter combinations (one combination per line):
aaaa
bcsg
aacd
gdee
aadw
hwer
etc.
file_b.txt contains a list of letter combinations of various length (some with spaces):
aaaibjkes
aaleoslk
abaaaalkjel
bcsgiweyoieotpwe
csseiolskj
gaelsi asdas
aaaloiersaaageehikjaaa
hwesdaaadf wiibhuehu
bcspwiopiejowih
gdeaes
aaailoiuwegoiglkjaaake
etc.
I am looking for a python script that would allow me to do the following:
read file_a.txt line by line
take each 4-letter combination (e.g. aaai)
read file_b.txt and find all the various-length letter combinations starting with the 4-letter combination (eg. aaaibjkes, aaailoiersaaageehikjaaa, aaailoiuwegoiglkjaaaike etc.)
print the results of each search in a separate txt file named with the 4-letter combination.
File aaai.txt:
aaaibjkes
aaailoiersaaageehikjaaa
aaailoiuwegoiglkjaaake
etc.
File bcsi.txt:
bcspwiopiejowih
bcsiweyoieotpwe
etc.
I'm sorry I'm a newbie. Can someone point me in the right direction, please. So far I've got only:
#I presume I will have to use regex at some point
import re
file1 = open('file_a.txt', 'r').readlines()
file2 = open('file_b.txt', 'r').readlines()
#Should I look into findall()?
I hope this would help you;
file1 = open('file_a.txt', 'r')
file2 = open('file_b.txt', 'r')
#get every item in your second file into a list
mylist = file2.readlines()
# read each line in the first file
while file1.readline():
searchStr = file1.readline()
# find this line in your second file
exists = [s for s in mylist if searchStr in s]
if (exists):
# if this line exists in your second file then create a file for it
fileNew = open(searchStr,'w')
for line in exists:
fileNew.write(line)
fileNew.close()
file1.close()
What you can do is to open both files and run both files down line by line using for loops.
You can have two for loops, the first one reading file_a.txt as you will be reading through it only once. The second will read through file_b.txt and look for the string at the start.
To do so, you will have to use .find() to search for the string. Since it is at the start, the value should be 0.
file_a = open("file_a.txt", "r")
file_b = open("file_b.txt", "r")
for a_line in file_a:
# This result value will be written into your new file
result = ""
# This is what we will search with
search_val = a_line.strip("\n")
print "---- Using " + search_val + " from file_a to search. ----"
for b_line in file_b:
print "Searching file_b using " + b_line.strip("\n")
if b_line.strip("\n").find(search_val) == 0:
result += (b_line)
print "---- Search ended ----"
# Set the read pointer to the start of the file again
file_b.seek(0, 0)
if result:
# Write the contents of "results" into a file with the name of "search_val"
with open(search_val + ".txt", "a") as f:
f.write(result)
file_a.close()
file_b.close()
Test Cases:
I am using the test cases in your question:
file_a.txt
aaaa
bcsg
aacd
gdee
aadw
hwer
file_b.txt
aaaibjkes
aaleoslk
abaaaalkjel
bcsgiweyoieotpwe
csseiolskj
gaelsi asdas
aaaloiersaaageehikjaaa
hwesdaaadf wiibhuehu
bcspwiopiejowih
gdeaes
aaailoiuwegoiglkjaaake
The program produces an output file bcsg.txt as it is supposed to with bcsgiweyoieotpwe inside.
Try this:
f1 = open("a.txt","r").readlines()
f2 = open("b.txt","r").readlines()
file1 = [word.replace("\n","") for word in f1]
file2 = [word.replace("\n","") for word in f2]
data = []
data_dict ={}
for short_word in file1:
data += ([[short_word,w] for w in file2 if w.startswith(short_word)])
for single_data in data:
if single_data[0] in data_dict:
data_dict[single_data[0]].append(single_data[1])
else:
data_dict[single_data[0]]=[single_data[1]]
for key,val in data_dict.iteritems():
open(key+".txt","w").writelines("\n".join(val))
print(key + ".txt created")

Python find matches and count hits

I have a code to go through text files in a folder and look for specific word matches and count those. For example in file 1.txt I have word 'one' mentioned two times. So, my output should be:
1.txt | 2
print >> out, paper + "|" + str(hit_count)
Does not return me anything. Maybe str(hit_count) is not the right variable to print?
Any advise? Thanks.
for word in text:
if re.match("(.*)(one|two)(.*)", word)
hit_count = hit_count + 1
print >> out, paper + "|" + str(hit_count)
If I understand what you are trying to do, you don't really need a regex.
import glob
#glob.glob the directory to get a list of files - you didn't specify
for fname in file_list:
with open(fname,'r') as f:
# if files are very long consider line by line
# for line in f:
file_content = f.read()
count = file_content.count('one')
print '{0} | {1}'.format(fname, count)

How can I compare files quicker in Python?

Is there any way to make this script faster? I'm using one file to compare another file to print lines, if second column are equal.
import csv
output =[]
a = open('/home/lucas/Doutorado/Projeto Eduardo/Exoma Neandertal/Listas_eduardo/Phase1_missing.vcf', 'r')
list1 = a.readlines()
reader1 = a.read()
b = open('/home/lucas/Doutorado/Projeto Eduardo/Exoma Neandertal/Listas_eduardo/Neandertais.vcf', 'r')
list2 = b.readlines()
reader2 = b.read()
f3 = open('/home/lucas/Doutorado/Projeto Eduardo/Exoma Neandertal/Listas_eduardo/Neandertais_and_YRI.vcf', 'w')
for line1 in list1:
separar = line1.split("\t")
gene = separar[2]
for line2 in list2:
separar2 = line2.split("\t")
gene2 = separar2[2]
if gene == gene2:
print line1
f3.write(line1)
Input example (for both files):
1 14107321 rs187821037 C T 100 PASS AA=C;SNPSOURCE=LOWCOV,EXOME;AN=2184;AVGPOST=0.9996;VT=SNP;THETA=0.0006;RSQ=0.7640;LDAF=0.0006;AC=1;ERATE=0.0003;AF=0.0005;AFR_AF=0.0020;STATUS=sample_dropout
1 14107321 rs187821037 C T 100 PASS AA=C;SNPSOURCE=LOWCOV,EXOME;AN=2184;AVGPOST=0.9996;VT=SNP;THETA=0.0006;RSQ=0.7640;LDAF=0.0006;AC=1;ERATE=0.0003;AF=0.0005;AFR_AF=0.0020;STATUS=sample_dropout
1 14107321 rs187821037 C T 100 PASS AA=C;SNPSOURCE=LOWCOV,EXOME;AN=2184;AVGPOST=0.9996;VT=SNP;THETA=0.0006;RSQ=0.7640;LDAF=0.0006;AC=1;ERATE=0.0003;AF=0.0005;AFR_AF=0.0020;STATUS=sample_dropout
The command line below works equally for same purpose in bash:
awk 'FNR==NR {a[$3]; next} $3 in a' Neandertais.vcf Phase1_missing.vcf > teste.vcf
How can I improve this Python script?
If you store your lines in dictionaries that are keyed by the column that you are interested in, you can easily use Python's built-in set functions (which run at C speed) to find the matching lines. I tested a slightly modified version of this (filenames changed, and changed split('\t') to split() because of stackoverflow formatting) and it seems to work fine:
import collections
# Use 'rb' to open files
infn1 = '/home/lucas/Doutorado/Projeto Eduardo/Exoma Neandertal/Listas_eduardo/Phase1_missing.vcf'
infn2 = '/home/lucas/Doutorado/Projeto Eduardo/Exoma Neandertal/Listas_eduardo/Neandertais.vcf'
outfn = '/home/lucas/Doutorado/Projeto Eduardo/Exoma Neandertal/Listas_eduardo/Neandertais_and_YRI.vcf'
def readfile(fname):
'''
Read in a file and return a dictionary of lines, keyed by the item in the second column
'''
results = collections.defaultdict(list)
# Read in binary mode -- it's quicker
with open(fname, 'rb') as f:
for line in f:
parts = line.split("\t")
if not parts:
continue
gene = parts[2]
results[gene].append(line)
return results
dict1 = readfile(infn1)
dict2 = readfile(infn2)
with open(outfn, 'wb') as outf:
# Find keys that appear in both files
for key in set(dict1) & set(dict2):
# For these keys, print all the matching
# lines in the first file
for line in dict1[key]:
print(line.rstrip())
outf.write(line)

Intersecting regions of two files and printing combined result

I have two big files. I want to find common names in column 1 and column 2 of file1 and file2, respectively. The script below does it. Problem: I want to print also corresponding data from file1 in output, but it does not work. How to fix it?
file1.txt
GRMZM5G888627_P01 GO:0003674 molecular_function
GRMZM5G888620_P01 GO:0008150 biological_process
GRMZM5G888625_P03 GO:0008152 metabolic process
file2.txt
contig1 GRMZM5G888627_P01
contig2 AT2G41790.1
contig3 GRMZM5G888625_P03
Desired output,
contig1 GRMZM5G888627_P01 GO:0003674 molecular_function
contig3 GRMZM5G888625_P03 GO:0008152 metabolic process
Script,
f1=open('file1.txt','r')
f2=open('file2.txt','r')
output = open('result.txt','w')
dictA= dict()
for line1 in f1:
listA = line1.rstrip('\n').split('\t')
dictA[listA[0]] = listA
for line1 in f2:
new_list=line1.rstrip('\n').split('\t')
query=new_list[0]
subject=new_list[1]
new_list.append(query)
new_list.append(subject)
if subject in dictA:
output.writelines(query+'\t'+subject+'\t'+str(listA[1])+str(listA[2])+'\n')
output.close()
Inside the
for line1 in f2:
listA isn't going to be mapped to the associated f2 line. You stored them in dictA.
Once you test if the subject is in dictA, you need to retrieve the proper listA
for line1 in f2:
new_list=line1.rstrip('\n').split('\t')
query=new_list[0]
subject=new_list[1]
new_list.append(query)
new_list.append(subject)
if subject in dictA:
listA = dictA[subject]
output.writelines(query+'\t'+subject+'\t'+str(listA[1])+str(listA[2])+'\n')
output.close()
I don't understand why you are appending to new_list in here:
query=new_list[0]
subject=new_list[1]
new_list.append(query)
new_list.append(subject)
When processing the first line, you read in:
contig1 GRMZM5G888627_P01
Into new_list, giving you essentially:
new_list == ['contig1', 'GRMZM5G888627_P01']
Then you set query and subject to the two items in the list. Then append them back onto it, giving you:
new_list == ['contig1', 'GRMZM5G888627_P01', 'contig1', 'GRMZM5G888627_P01']
Which you never use. You should be able to just have:
for line1 in f2:
new_list=line1.rstrip('\n').split('\t')
subject=new_list[1]
if subject in dictA:
listA = dictA[subject]
output.writelines(new_list[0] + '\t' + subject + '\t' + str(listA[1]) + str(listA[2]) + '\n')
output.close()
Also you are only writing 1 line, so output.write is fine. And string addition is usually bad, so replaced by format. Your listA stored strings, so I eliminated the str() call.
for line1 in f2:
new_list=line1.rstrip('\n').split('\t')
subject=new_list[1]
if subject in dictA:
listA = dictA[subject]
output.write("{}\t{}\t{}{}\n".format(new_list[0], subject, listA[1], listA[2])
output.close()
try this,
ins = open('file1.txt', "r" )
values=''
dict={}
for line in ins:
arrayline=line.split()
dict[arrayline[0]]='\t'.join(arrayline)
file2=open('file2.txt', "r" )
output = open('result.txt','w')
for line in file2:
array2=line.split()
try:
v=dict[array2[1]]
output.write('\n'+array2[0]+'\t'+v)
except:
pass
output.close()
use sets
In [1]: list1=[1,2,3,4,5,6,7,8,9]
In [2]: list2=[1,2,3,10,11,12,13]
In [3]: list1=set(list1)
In [4]: list1.intersection(list2)
Out[4]: {1, 2, 3}

Compare two different files line by line in python

I have two different files and I want to compare theirs contents line by line, and write their common contents in a different file. Note that both of them contain some blank spaces.
Here is my pseudo code:
file1 = open('some_file_1.txt', 'r')
file2 = open('some_file_2.txt', 'r')
FO = open('some_output_file.txt', 'w')
for line1 in file1:
for line2 in file2:
if line1 == line2:
FO.write("%s\n" %(line1))
FO.close()
file1.close()
file2.close()
However, by doing this, I got lots of blank spaces in my FO file. Seems like common blank spaces are also written. I want to write only the text part. Can somebody please help me.
For example: my first file (file1) contains data:
Config:
Hostname = TUVALU
BT:
TS_Ball_Update_Threshold = 0.2
BT:
TS_Player_Search_Radius = 4
BT:
Ball_Template_Update = 0
while second file (file2) contains data:
Pole_ID = 2
Width = 1280
Height = 1024
Color_Mode = 0
Sensor_Scale = 1
Tracking_ROI_Size = 4
Ball_Template_Update = 0
If you notice, last two lines of each files are the same, hence, I want to write this file in my FO file. But, the problem with my approach is that, it writes the common blank space also. Should I use regex for this problem? I do not have experience with regex.
This solution reads both files in one pass, excludes blank lines, and prints common lines regardless of their position in the file:
with open('some_file_1.txt', 'r') as file1:
with open('some_file_2.txt', 'r') as file2:
same = set(file1).intersection(file2)
same.discard('\n')
with open('some_output_file.txt', 'w') as file_out:
for line in same:
file_out.write(line)
Yet another example...
from __future__ import print_function #Only for Python2
with open('file1.txt') as f1, open('file2.txt') as f2, open('outfile.txt', 'w') as outfile:
for line1, line2 in zip(f1, f2):
if line1 == line2:
print(line1, end='', file=outfile)
And if you want to eliminate common blank lines, just change the if statement to:
if line1.strip() and line1 == line2:
.strip() removes all leading and trailing whitespace, so if that's all that's on a line, it will become an empty string "", which is considered false.
If you are specifically looking for getting the difference between two files, then this might help:
with open('first_file', 'r') as file1:
with open('second_file', 'r') as file2:
difference = set(file1).difference(file2)
difference.discard('\n')
with open('diff.txt', 'w') as file_out:
for line in difference:
file_out.write(line)
If order is preserved between files you might also prefer difflib. Although Robᵩ's result is the bona-fide standard for intersections you might actually be looking for a rough diff-like:
from difflib import Differ
with open('cfg1.txt') as f1, open('cfg2.txt') as f2:
differ = Differ()
for line in differ.compare(f1.readlines(), f2.readlines()):
if line.startswith(" "):
print(line[2:], end="")
That said, this has a different behaviour to what you asked for (order is important) even though in this instance the same output is produced.
Once the file object is iterated, it is exausted.
>>> f = open('1.txt', 'w')
>>> f.write('1\n2\n3\n')
>>> f.close()
>>> f = open('1.txt', 'r')
>>> for line in f: print line
...
1
2
3
# exausted, another iteration does not produce anything.
>>> for line in f: print line
...
>>>
Use file.seek (or close/open the file) to rewind the file:
>>> f.seek(0)
>>> for line in f: print line
...
1
2
3
Try this:
from __future__ import with_statement
filename1 = "G:\\test1.TXT"
filename2 = "G:\\test2.TXT"
with open(filename1) as f1:
with open(filename2) as f2:
file1list = f1.read().splitlines()
file2list = f2.read().splitlines()
list1length = len(file1list)
list2length = len(file2list)
if list1length == list2length:
for index in range(len(file1list)):
if file1list[index] == file2list[index]:
print file1list[index] + "==" + file2list[index]
else:
print file1list[index] + "!=" + file2list[index]+" Not-Equel"
else:
print "difference inthe size of the file and number of lines"
I have just been faced with the same challenge, but I thought "Why programming this in Python if you can solve it with a simple "grep"?, which led to the following Python code:
import subprocess
from subprocess import PIPE
try:
output1, errors1 = subprocess.Popen(["c:\\cygwin\\bin\\grep", "-Fvf" ,"c:\\file1.txt", "c:\\file2.txt"], shell=True, stdout=PIPE, stderr=PIPE).communicate();
output2, errors2 = subprocess.Popen(["c:\\cygwin\\bin\\grep", "-Fvf" ,"c:\\file2.txt", "c:\\file1.txt"], shell=True, stdout=PIPE, stderr=PIPE).communicate();
if (len(output1) + len(output2) + len(errors1) + len(errors2) > 0):
print ("Compare result : There are differences:");
if (len(output1) + len(output2) > 0):
print (" Output differences : ");
print (output1);
print (output2);
if (len(errors1) + len(errors2) > 0):
print (" Errors : ");
print (errors1);
print (errors2);
else:
print ("Compare result : Both files are equal");
except Exception as ex:
print("Compare result : Exception during comparison");
print(ex);
raise;
The trick behind this is the following:
grep -Fvf file1.txt file2.txt verifies if all entries in file2.txt are present in file1.txt. By doing this in both directions we can see if the content of both files are "equal". I put "equal" between quotes because duplicate lines are disregarded in this way of working.
Obviously, this is just an example: you can replace grep by any commandline file comparison tool.
difflib is well worth the effort, with nice condensed output.
from pathlib import Path
import difflib
mypath = '/Users/x/lib/python3'
file17c = Path(mypath, 'oop17c.py')
file18c = Path(mypath, 'oop18c.py')
with open(file17c) as file_1:
file1 = file_1.readlines()
with open(file18c) as file_2:
file2 = file_2.readlines()
for line in difflib.unified_diff(
file1, file2, fromfile=str(file17c), tofile=str(file18c), lineterm=''):
print(line)
output
+ ... unique stuff present in file18c
- ... stuff absent in file18c but present in file17c

Categories