Python - Compare values in lists (not 1:1 match) - python

I've got 2 txt files that are structured like this:
File 1
LINK1;FILENAME1
LINK2;FILENAME2
LINK3;FILENAME3
File 2
FILENAME1
FILENAME2
FILENAME3
And I use this code to print the "unique" lines contained in both files:
with open('1.txt', 'r') as f1, open('2.txt', 'r') as f2:
a = f1.readlines()
b = f2.readlines()
non_duplicates = [line for line in a if line not in b]
non_duplicates += [line for line in b if line not in a]
for i in range(1, len(non_duplicates)):
print non_duplicates[i]
The problem is that in this way it prints all the lines of both files, what I want to do is to search if FILENAME1 is in some line of file 1 (the one with both links and filenams) and delete this line.

You need to first load all lines in 2.txt and then filter lines in 1.txt that contains a line from the former. Use a set or frozenset to organize the "blacklist", so that each not in runs in O(1) in average. Also note that f1 and f2 are already iterable:
with open('2.txt', 'r') as f2:
blacklist = frozenset(f2)
with open('1.txt', 'r') as f1:
non_duplicates = [x.strip() for x in f1 if x.split(";")[1] not in blacklist]

If the file2 is not too big create a set of all the lines, split the file1 lines and check if the second element is in the set of lines:
import fileinput
import sys
with open("file2.txt") as f:
lines = set(map(str.rstrip,f)) # itertools.imap python2
for line in fileinput.input("file1.txt",inplace=True):
# if FILENAME1 etc.. is not in the line write the line
if line.rstrip().split(";")[1] not in lines:
sys.stdout.write(line)
file1:
LINK1;FILENAME1
LINK2;FILENAME2
LINK3;FILENAME3
LINK1;FILENAME4
LINK2;FILENAME5
LINK3;FILENAME6
file2:
FILENAME1
FILENAME2
FILENAME3
file1 after:
LINK1;FILENAME4
LINK2;FILENAME5
LINK3;FILENAME6
fileinput.input with inplace changes the original file. You don't need to store the lines in a list.
You can also write to a tempfile, writing the unique lines to it and using shutil.move to replace the original file:
from tempfile import NamedTemporaryFile
from shutil import move
with open("file2.txt") as f, open("file1.txt") as f2, NamedTemporaryFile(dir=".",delete=False) as out:
lines = set(map(str.rstrip,f))
for line in f2:
if line.rstrip().split(";")[1] not in lines:
out.write(line)
move(out.name,"file1.txt")
If your code errors you won't lose any data in the original file using a tempfile.
using a set to store the lines means we have on average 0(1) lookups, storing all the lines in a list would give you a quadratic as opposed to a linear solution which for larger files would give you a significantly more efficient solution. There is also no need to store all the lines of the other file in a list with readlines as you can write as you iterate over the file object and do your lookups.

Unless the files are too large, then you may print the lines in file1.txt (that I call entries) whose filename-part is not listed in file2.txt with something like this:
with open('file1.txt') as f1:
entries = f1.read().splitlines()
with open('file2.txt') as f2:
filenames_to_delete = f2.read().splitlines()
print [entry for entry in entries if entry.split(';')[1] not in filenames_to_delete]
If file1.txt is large and file2.txt is small, then you may load the filenames in file2.txt entirely in memory, and then open file1.txt and go through it, checking against the in-memory list.
If file1.txt is small and file2.txt is large, you may do it the other way round.
If file1.txt and file2.txt are both excessively large, then if it is known that both files’ lines are sorted by filename, one could write some elaborate code to take advantage of that sorting to get the task done without loading the entire files in memory, as in this SO question. But if this is not an issue, you’ll be better off loading everything in memory and keeping things simple.
P.S. Once it is not necessary to open the two files simultaneously, we avoid it; we open a file, read it, close it, and then repeat for the next. Like that the code is simpler to follow.

Related

How to merge two files by line names using python

I think this should be easy but yet have not been able to solve it. I have two files as below and I want to merge them in a way that lines starting with > in the file1 to be the header of the lines in the file2
file1:
>seq12
ACGCTCGCA
>seq34
GCATCGCGT
>seq56
GCGATGCGC
file2:
ATCGCGCATGATCTCAG
AGCGCGCATGCGCATCG
AGCAAATTTAGCAACTC
so the desired output should be:
>seq12
ATCGCGCATGATCTCAG
>seq34
AGCGCGCATGCGCATCG
>seq56
AGCAAATTTAGCAACTC
I have tried this code so far but in output, all the lines coming from file2 are the same:
from Bio import SeqIO
with open(file1) as fw:
with open(file2,'r') as rv:
for line in rv:
items = line
for record in SeqIO.parse(fw, 'fasta'):
print('>' + record.id)
print(line)
If you cannot store your files in memory, you need a solution that reads line by line from each file, and writes accordingly to the output file. The following program does that. The comments try to clarify, though I believe it is clear from the code.
with open("file1.txt") as first, open("file2.txt") as second, open("output.txt", "w+") as output:
while 1:
line_first = first.readline() # line from file1 (header)
line_second = second.readline() # line from file2 (body)
if not (line_first and line_second):
# if any file has ended
break
# write to output file
output.writelines([line_first, line_second])
# jump one line from file1
first.readline()
Note that this will only work if file1.txt has the specific format you presented (odd lines are headers, even lines are useless).
In order to allow a bit more customization, you can wrap it up in a function as:
def merge_files(header_file_path, body_file_path, output_file="output.txt", every_n_lines=2):
with open(header_file_path) as first, open(body_file_path) as second, open(output_file, "w+") as output:
while 1:
line_first = first.readline() # line from header
line_second = second.readline() # line from body
if not (line_first and line_second):
# if any file has ended
break
# write to output file
output.writelines([line_first, line_second])
# jump n lines from header
for _ in range(every_n_lines - 1):
first.readline()
And then calling merge_files("file1.txt", "file2.txt") should do the trick.
If both files are small enough to fit in memory simultaneously, you can simply read them simultaneously and interleave them.
# Open two file handles.
with open("f1", mode="r") as f1, open("f2", mode="r") as f2:
lines_first = f1.readlines() # Read all lines in f1.
lines_second = f2.readlines() # Read all lines in f2.
lines_out = []
# For each line in the file without headers...
for idx in range(len(lines_second)):
# Take every even line from the first file and prepend it to
# the line from the second.
lines_out.append(lines_first[2 * idx + 1].rstrip() + lines_second[idx].rstrip())
You can generate the seq headers very easily given idx: I leave this as an exercise to the reader.
If either or both files are too large to fit in memory, you can repeat the above process line-by-line over both handles (using one variable to store information from the file with headers).

Insert text from one file into the middle of another in a specific place with out losing the contents of the second file

I have a file Coor.txt with 4 entrys in it. I would like to take these 4 and place it between lines 40 and 44 in the Otherfile.py
But it cannot delete everything below line 44 or above line 40, it needs to nestle between.
This is where I am stuck, all the examples I find either overwrite everything or parts of otherfile.py.
I have tried the import method but this just ignores the contents of Coor.text.
I would post an example but all attempts have failed.
You mean something like this?
insert_row = 10
with open('file1.txt', 'r') as file1, open('file2.txt', 'r') as file2:
file1_lines = file1.readlines()
file2_lines = file2.readlines()
new_lines = file2_lines[:insert_row] + file1_lines + file2_lines[insert_row:]
with open('file2.txt', 'w') as file2:
file2.writelines(new_lines)
Probably not really optimal for large files, but for small ones this should work fine.

How to skip lines until find the 'keyword' in txt file and save rest as csv? python

I have a txt file in csv form but with unnecessary few lines on top.
I need to skip a few 8-10 lines (it depends from file), after "[App]" line.
The file looks like this:
1, trash
2, trash
3, [APP]
4
.
.
.
100
and I need to save the 4-100 lines as csv where 4 will be headers and rest are rows.
What is the best way? I tried a:
"with open"
with open('som.txt', 'r') as fin:
data = fin.read().splitlines(True)
with open('som.txt', 'w') as fout:
fout.writelines(data[7:])
print(data)
So, I have now data list and its ok, but that code skip lines after a number of lines, not specific word. Also, I can't save this list as properly CSV file; c Can u help?:)
Use readlines, then use seek, and writelines:
with open('some.txt', 'r+') as f:
text=f.readlines()
f.seek(0)
f.writelines(text[[i for i, s in enumerate(mylist) if '[APP]' in s][0]:])
Your file will be now as expected, and another plus that it doesn't do it by index.

Find unique entries in files

guess you have a solution concerning the following issue:
I want to compare two lists for common entries (on the basis of column 10) and write common entries to one file and unique entries for the first list into another file. The code I wrote is:
INFILE1 = open ("c:\\python\\test\\58962.filtered.csv", "r")
INFILE2 = open ("c:\\python\\test\\83887.filtered.csv", "r")
OUTFILE1 = open ("c:\\python\\test\\58962_vs_83887.common.csv", "w")
OUTFILE2 = open ("c:\\python\\test\\58962_vs_83887.unique.csv", "w")
for line in INFILE1:
line = line.rstrip().split(",")
if line[11] in INFILE2:
OUTFILE1.write(line)
else:
OUTFILE2.write(line)
INFILE1.close()
INFILE2.close()
OUTFILE1.close()
OUTFILE2.close()
The following error appears:
8 OUTFILE1.write(line)
9 else:
---> 10 OUTFILE2.write(line)
11 INFILE1.close()
TypeError: write() argument must be str, not list
Does somebody know about help for this?
Best
This line
line = line.rstrip().split(",")
replaces the line you read from a file by it's splitted list. You then try to write the splitted list to your file - thats not how the write method works and it tells you exactly that.
Change it to :
for line in INFILE1:
lineList = line.rstrip().split(",") # dont overwrite line, use lineList
if lineList[11] in INFILE2: # used lineList
OUTFILE1.write(line) # corrected indentation
else:
OUTFILE2.write(line)
You could have easily found your error yourself, just printing out the line before and after splitting or just befrore writing.
Please read How to debug small programs (#1) and follow it - its easier to find and fix bugs yourself then posting questions here.
You have some other problem at hand, though:
Files are stream based, they start with a position of 0 in the file. The position is advanced if you access parts of the file. When at the end, you wont get anything by using INFILE2.read() or other methods.
So if you want to repeatadly check if some lines column of file1 is somewhere in file2 you need to read file2 into a list (or other datastructure) so your repeated checks work. In other words, this:
if lineList[11] in INFILE2:
might work once, then the file is consumed and it will return false all the time.
You also might want to change from:
f = open(...., ...)
# do something with f
f.close()
to
with open(name,"r") as f:
# do something with f, no close needed, closed when leaving block
as it is safer, will close the file even if exceptions happen.
To solve that try this (untested) code:
with open ("c:\\python\\test\\83887.filtered.csv", "r") as file2:
infile2 = file2.readlines() # read in all lines as list
with open ("c:\\python\\test\\58962.filtered.csv", "r") as INFILE1:
# next 2 lines are 1 line, \ at end signifies line continues
with open ("c:\\python\\test\\58962_vs_83887.common.csv", "w") as OUTFILE1, \
with open ("c:\\python\\test\\58962_vs_83887.unique.csv", "w") as OUTFILE2:
for line in INFILE1:
lineList = line.rstrip().split(",")
if any(lineList[11] in x for x in infile2): # check the list of lines if
# any contains line[11]
OUTFILE1.write(line)
else:
OUTFILE2.write(line)
# all files are autoclosed here
Links to read:
the-with-statement
any() and other built-ins

How to get the same part of file content?

I have two files , and I want to display the content exist in file 1 and file 2 in the screen. But it seems nothing display ( but it should display オレンジ) . What is the problem?
thanks
File 1
リンゴ
バナナ
オレンジ
File 2
オレンジ
Here is my code
import sys
File1 = open(sys.argv[1],"r",encoding="UTF-8")
F1_Content = File1.readlines()
F1_Content = map(lambda e : e.rstrip("\n"),F1_Content)
File2 = open(sys.argv[2],"r",encoding="UTF-8")
F2_Content = File2.readlines()
F2_Content = map(lambda e : e.rstrip("\n"),F2_Content)
for line in F1_Content:
print(repr(line))
if line in F2_Content:
print(line)
File1.close()
File2.close()
Output
'\ufeff
''
''
You probably have more whitespace in one of the files than just a newline. You could loop over either F1_Content and F2_Content, printing the representation of that line with print(repr(line)) or print(line.encode('unicode_escape')) to make it easier to spot how the lines differ.
I'd strip the lines entirely. Also, use a set for the lines of one file, testing will be much more efficient:
with open(sys.argv[1], "r") as file1:
f1_content = {line.strip() for line in file1}
with open open(sys.argv[2], "r") af file2:
for line in file2:
if line.strip() in file2:
print(line)
Looping directly over the file itself reads the file line by line, letting you handle file lines without having to read the whole file into memory.
Note also the use of with statements here; file objects are context managers, and when the context closes (the with block ends) the file object is automatically closed for you.
With Katakana, there is also the possibility that one of your files uses decomposition for the ZI character while the other does not; you can either express it as \u30B8 or as \u30B7\u3099; (SI + COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK):
>>> print('\u30B8 != \u30B7\u3099:', '\u30B8' != '\u30B7\u3099')
ジ != ジ: True
You can use unicodedata.normalize() to normalize all your lines to either composed or decomposed forms. Here I force all data to use composed forms:
from unicodedata import normalize
with open(sys.argv[1], "r") as file1:
f1_content = {normalize(line.strip(), 'NFKC') for line in file1}
with open open(sys.argv[2], "r") af file2:
for line in file2:
if normalize(line.strip(), 'NFKC') in file2:
print(line)

Categories