Comapring multiple text files and grouping similar files in one group - python

I have 100-200 text files with different name in a folder and I want to compare text present in the file with each other and keep the similar files in a group.
Note :
1.Files are not identical. They are similar like 2-3 lines in a paragraph are same with other file.
2. one file may be kept in different groups or can be kept in multiple groups
Can anyone help me in this as I an beginner to python?
I have tried the below code but it doesn't work for me.
file1=open("F1.txt","r")
file2=open("F2.txt","r")
file3=open("F3.txt","r")
file4=open("F4.txt","r")
file5=open("F5.txt","r")
list1=file1.readlines()
list2=file2.readlines()
list3=file3.readlines()
list4=file4.readlines()
list5=file5.readlines()
for line1 in list1:
for line2 in list2:
for line3 in list3:
for line3 in list4:
for line4 in list5:
if line1.strip() in line2.strip() in line3.strip() in line4.strip() in line5.strip():
print line1
file3.write(line1)

You can use this code to check similar lines in between files:
import glob
_contents = dict()
for filename in glob.glob('*.csv'):
file = open(filename, 'r')
frd = file.readlines()
_contents[filename]=frd
for key in _contents:
for other_key in _contents:
if key == other_key:
pass
else:
print("Comparing in between files {0} and {1}".format(key, other_key))
non_identical_contents = set(_contents[key]) - set(_contents[other_key])
print(list(set(_contents[key])-non_identical_contents))

If I understood your purpose right, you should iterate over all of the text files in the library and compare each one with the other (in all possible combinations). The code should look something like this:
import glob, os
nl = [] #Name list (containing the names of all files in the directory)
fl = [] #File list (containing the content of all files in the directory, each element in this list is a list of strings - the list of lines in a file)
os.chdir("/libwithtextfiles")
for filename in glob.glob("*.txt"): #Using glob to get all the files ending with '.txt'
nl.append(filename) #Appending all the filenames in the directory to 'nl'
f = open(filename, 'r')
fl.append(f.readlines()) #Appending all of the lists of line to 'fl'
f.close()
for fname1 in nl:
l1 = fl[nl.index(fname1)]
if nl.index(fname1) == len(nl) - 1: #We reached the last file
break
for fname2 in nl[nl.index(fname1) + 1:]:
l2 = fl[nl.index(fname2)]
#Here compare the amount of lines identical, use a counter
#then print it, or output to a file or do whatever you want
#with it
#e.g (according to what I understood from your code)
for f1line in l1:
for f2line in l2:
if f1line == f2line: #Why 'in' and not '=='?
"""
have some counter increase right here, a suggestion is having
a list of lists, where the first element is
a list that contains integers
the first integer is the number of lines found identical
between the file (index in list_of_lists is corresponding to the name in that index in 'nl')
and the one following it (index in list_of_lists + 1)
the next integer is the number of lines identical between the same file
and the one following the one following it (+2 this time), etc.
Long story short: list_of_lists[i][j] is the number of lines identical
between the 'i'th file and the 'i+j'th one.
"""
pass
Note that your code doesn't utilize loops where it should, you could have had a list called l instead of line1 - line5.
Aside from that, your code is unclear at all, I assume the missing indentation (for line2 in list2: should be indent, including anything afterwards) and the for line3 in list3: for line3 in list4: #using line3 twice are accidental and happened copying the code to this site. You're comparing every line with every line in the other files?
You should, as my comment in the code suggest, have a counter to count in how many files does that line repeat (doing that by having a for-loop with another loop nested inside, iterating over the lines and comparing just two, rather than all five, where even when having 5 files, each with 10 lines, you'd iterate 100,000 times over it (10**5) - whereas in my method, you only have 1000 iterations in such case, 100 times more efficient).

Related

Removing lines of files from a list matching patterns from another list

I have a list of files and a list of patterns like that :
fileList=glob.glob("*undex*fna")
barList=list(barcodes.values())
for i, j in zip (sorted(fileList), barList):
print(i,j)
The original list : <type 'list'>
('bc1001_5p_test_undex.fna', 'CACTCGACTCTCGCGT')
('bc1002_5p_test_undex.fna', 'ACACTAGATCGCGTGT')
('bc1003_5p_test_undex.fna', 'ACACATCTCGTGAGAG')
('bc1004_5p_test_undex.fna', 'CACATATCAGAGTGCG')
('bc1005_5p_test_undex.fna', 'CATATATATCAGCTGT')
('bc1006_5p_test_undex.fna', 'ACACACAGACTGTGAG')
('bc1008_5p_test_undex.fna', 'ACAGTCGAGCGCTGCG')
('bc1012_5p_test_undex.fna', 'CACGCACACACGCGCG')
All the files *fna have that format ( thousands of lines for each file) :
head -n 2 bc1001_5p_test_undex.fna
>m64071_201130_104452/590189/ccs CACGCACACACGCGCGTGGATTGATATGTAATACGACTCACTATAGAGAGCTAATCTAAGCGAAAAAAATAGACATTTGAAAGCAAAAGCGTA
>m64071_201130_104452/590191/ccs AACACATCTCGTGAGAGTGGATTGATATGTAATACGACTCACTATAGGCAAAACCAATAAGCATATATACAACTATATATCGAGAGAGATAATATCATATAATATGG
and so on ..
I need to remove the full lines of the *fna files, where the patterns are found. But, the trick is : for example, I look at the first pattern CACTCGACTCTCGCGT. I have to remove the lines of the *fna files, where that pattern is found, but not in the first file bc1001_5p_test_undex (actually, the first pattern is "associated" with the first file, the second pattern is associated with the second file and so on..). Same trick for the second pattern : I have to remove the lines in all the files (except the second file) where the pattern ACACTAGATCGCGTGT is found.
You can store the new codes in a list and display the value only if the code is not in the list
fileList=glob.glob("*undex*fna")
codes = []
barList=list(barcodes.values())
for i, j in zip (sorted(fileList), barList):
if j not in codes:
codes.append(j)
print(i,j)
Did not test this, but I hope this will get you somewhere:
file_list=glob.glob("*undex*fna")
bar_list=list(barcodes.values())
for file_name, bar_code in zip(sorted(file_list), bar_list):
codes_to_remove = [code for code in bar_list if code != bar_code]
new_file_content = list()
with open(file_name, 'r') as file_handle:
for line in file_handle.readlines():
if any((code for code in codes_to_remove if code in line)):
# found code in line, do not copy it
pass
else:
new_file_content.append(line)
with open(file_name, 'w') as file_handle:
file_handle.writelines(new_file_content)
[code for code in bar_list if code != bar_code] makes list of all the barcodes except the one associated with the filename.
Simpler version [x for x in bar_list] would create list of all the barcodes - if you are not familiar with this syntax read about list comprehensions.
There are many ways to edit existing file but overwriting it with new content is the easiest so I have chosen to open the file and save all lines without matching bar codes and then opening file once again to override it with new content.

Correcting a wrong splitted in txt, python

I have several txt files which consists of different values, e.g:
TFF,BAP,VAP,DNAAF5,CDKN2B,PDE2D,SLC22A19,RBPJ,STAT1,TAP2,HLA-
I have probabely done a wrong split in the middle of the code, and it splitted by '-' so when I double click one value, it choose all line till the '-'. This mistake does not effect the function till this step. Now I need to count each value occurrens with "Counter" , and the count is wrong.
My code:
gene_calc = r'C:\Users\MrD\Top'
new_dir = r'C:\\Users\\MrD\\Br_Count\\Frequency\\'
for files in gene_calc:
if not os.path.exists(new_dir):
os.mkdir(new_dir)
else:
break
os.chdir(gene_calc)
for files in glob.glob(os.path.join('*.txt*')):
#print(files) # iterating over files to check if prints
with open(files) as f:
content = (line for line in f.read().splitlines())
list = Counter(Vol for Vol in content).most_common()
with open(new_dir + files, "w") as output:
output.write(str(list))
gene_calc folder consists of values as shown in the example above.
I couldn't resplit it (tried "if ',' in gene_list:" or reversing .reverse() but it's already a list with tuples)
at the moment you are counting lines
content = (line for line in f.read().splitlines())
to count items you need a second split on ',':
content = (item for line in f for item in line.strip().split(','))

How to search a string in a file in another file

I need to scan 2 files in python and say which words in file1 are also in file2. I made a list with all words from file2 and then scan if the line from file1 is in the list.
So this works perfectly, but large files (like 500k) it can take 1h+ and I was wondering if there is a faster way
Thanks in advance
(defined var etc and files)
a = []
for line in var:
a += [line]
teller = 0
for line1 in new_file:
if line1 not in a:
print(line1, file=filter, end='')
else:
teller += 1
print(line1, file=bad, end='' )
print('There were', teller, 'lines that were in the old file.')
A faster alternative is using sets (as long as you can keep the content of both files in memory):
with open('a.txt', 'r') as a, open('b.txt', 'r') as b:
a_content = set(a)
b_content = set(b)
result = a_content.intersection(b_content)
If you're worried about speed, then you should be using your OS facilities, not Python loops. Typically, the fastest way to look for individual lines would be to sort both files and then do a simple file diff. If you insist on using Python, that would also be a much quicker way.
Your method will work, but it's super inefficient because you are traversing through the file2 for every single word/line within file1. Try turning both file1 and file2 to sets and then compare the sets; I'm pretty sure Python has something like .intersect or .intersection to compare two sets, lists, arrays, or other data structures.

How to find whether a integer is between first two columns of a file without using any for loop

I've a file which have integers in first two columns.
File Name : file.txt
col_a,col_b
1001021,1010045
2001021,2010045
3001021,3010045
4001021,4010045 and so on
Now using python, i get a variable var_a = 2002000.
Now how to find the range within which this var_a lies in "file.txt".
Expected Output : 2001021,2010045
I have tried with below,
With open("file.txt","r") as a:
a_line = a.readlines()
for line in a_line:
line_sp = line.split(',')
if var_a < line_sp[0] and var_a > line_sp[1]:
print ('%r, %r', %(line_sp[0], line_sp[1])
Since the file have more than million of record this make it time consuming. Is there any better way to do the same without a for loop.
Since the file have more than million of record this make it time
consuming. Is there any better way to do the same without a for loop.
Unfortunately you have to iterate over all records in file and the only way you can archive that is some kind of for loop. So complexity of this task will always be at least O(n).
It is better to read your file linewise (not all into memory) and store its content inside ranges to look them up for multiple numbers. Ranges store quite efficiently and you only have to read in your file once to check more then 1 number.
Since python 3.7 dictionarys are insert ordered, if your file is sorted you will only iterate your dictionary until the first time a number is in the range, for numbers not all all in range you iterate the whole dictionary.
Create file:
fn = "n.txt"
with open(fn, "w") as f:
f.write("""1001021,1010045
2001021,2010045
3001021,3010045
garbage
4001021,4010045""")
Process file:
fn = "n.txt"
# read in
data = {}
with open(fn) as f:
for nr,line in enumerate(f):
line = line.strip()
if line:
try:
start,stop = map(int, line.split(","))
data[nr] = range(start,stop+1)
except ValueError as e:
pass # print(f"Bad data ({e}) in line {nr}")
look_for_nums = [800, 1001021, 3001039, 4010043, 9999999]
for look_for in look_for_nums:
items_checked = 0
for nr,rng in data.items():
items_checked += 1
if look_for in rng:
print(f"Found {look_for} it in line {nr} in range: {rng.start},{rng.stop-1}", end=" ")
break
else:
print(f"{look_for} not found")
print(f"after {items_checked } checks")
Output:
800 not found after 4 checks
Found 1001021 it in line 0 in range: 1001021,1010045 after 1 checks
Found 3001039 it in line 2 in range: 3001021,3010045 after 3 checks
Found 4010043 it in line 5 in range: 4001021,4010045 after 4 checks
9999999 not found after 4 checks
There are better ways to store such a ranges-file, f.e. in a tree like datastructure - research into k-d-trees to get even faster results if you need them. They partition the ranges in a smarter way, so you do not need to use a linear search to find the right bucket.
This answer to Data Structure to store Integer Range , Query the ranges and modify the ranges provides more things to research.
Assuming each line in the file has the correct format, you can do something like following.
var_a = 2002000
with open("file.txt") as file:
for l in file:
a,b = map(int, l.split(',', 1)) # each line must have only two comma separated numbers
if a < var_a < b:
print(l) # use the line as you want
break # if you need only the first occurrence, break the loop now
Note that you'll have to do additional verifications/workarounds if the file format is not guaranteed.
Obviously you have to iterate through all the lines (in the worse case). But we don't load all the lines into memory at once. So as soon as the answer is found, the rest of the file is ignored without reading (assuming you are looking only for the first match).

Python: Appending file outputs from different directories into one overall list

I have n directories (labeled 0 to n), each that has a file (all the files have the same name), from which I want to grab certain lines from each file. I then want to append these grabbed lines together in order (from 0 to n) in a list.
This is my set-up:
for i in range(0, nfolders):
folder = "%02d" % i
os.system("cd " + folder)
myFile = open("myOutputFile", "r")
lines = myFile.readlines()
firstLine = float(lines[0])
#I then write a loop to store the next 5 lines in a list using append and call this list nextLines
My question is, is there an easy way to append firstLine from all the directories into one list (that my function returns), as well as append nextLines from all the directories into one list (again, that my function returns)?
I know there is the extend function, would I loop over that here (because let's say I have nfolders = 300, making it hard to manually add things together)?
Thanks!
You've got a couple of problems to deal with. os.system changes the working directory of the subshell invoke (and then immediately exit), but not the directory of this running script. Use os.chdir for that. Or, far better, just add the path to the file name and use that.
You don't need to read the entire file to get its first line, .readline or the next() function does that for you. Finally, just append to a list.
my_list = []
for i in range(0, nfolders):
filename = "%02d/MyOutputFile" % i
with open(filename) as myFile:
firstLine = float(next(myFile))
my_list.append(firstLine)
UPDATE
Suppose you want 4 + i lines from each file. You could tighten this up with
my_list = []
for i in range(0, nfolders):
filename = "%02d/MyOutputFile" % i
with open(filename) as myFile:
my_list += (next(myFile) for _ in range(4+i))
Note that we only use range to count iterations and don't care about its value so we use the variable _ as a quick visual queue that the value is not needed.

Categories