Python iterating through two files by line at the same time - python

I am trying to compare columns in two files to see if the values match, and if there is a match I want to merge/concatenate the data for that row together. My issue is that when reading line by line from the two files separately, I can't get python to iterate through the files together and look for a match. Instead it will iterate properly through one file and iterate over the same line in the second file multiple times...
I have had this issue in the past and still not really found a way around it. I know that indentation is one problem since I mess with the loop by using "for line in a, for line in b" so I thought that what I tried below would work but it hasn't. I have looked around for solutions but nobody seems to be using the same method so I wonder if I am completely off track as to how to do this? Can anyone explain what is a better way to do this, and whether my method would work at all and if not, why not? Thanks, it is much appreciated!
These are the formats of my two files, basically I want to compare the columns filename in both files and if they match I want to merge the rows together.
file1:
cluster_id hypothesis_id filename M1_name_offset Orientation
1 71133076 unique_name_1.png esc_sox2_Sox1_80_4 forward
1 50099120 unique_name_4.png hb_cebpb_ETS1_139_7 forward
1 91895576 unique_name_11.png he_tal1_at_AC_acptr_258_11 forward
file2:
Name Cluster_No Pattern filename
esc_sox2_Sox1_80 Cluster1 AP1(1N)ETS unique_name_4.png
hb_cebpb_ETS1_139 Cluster1 CREB(1N)ETS unique_name_11.png
he_tal1_at_AC_acptr_258 Cluster2 ETS(-1N)ZIC unique_name_3.png
What I have tried:
for aline in file1:
motif1 = aline.split()[2]
for bline in file2:
motif2 = bline.split()[-1]
if motif1 = motif2:
print "match", aline, bline
I have also tried:
for aline in file1:
motif1 = aline.split()[2]
for bline in file2:
motif2 = bline.split()[-1]
if motif1 = motif2:
print "match", aline, bline
I have also tried using string formatting but that didn't make a difference. The first way iterates through file2 incorrectly and the second way doesn't give me any output. I have played around with it a lot and tried various indentations and extra bits but I am stumped as to how to even try and fix it! Please help me :(

Use the zip builtin function.
with open(file1) as f1, open(file2) as f2:
for line1, line2 in zip(f1, f2):
motif1 = line1.split()[0]
motif2 = line2.split()[0]
...
Note that zip behaves differently in python2 and python3. In python2, it would be more efficient to use itertools.izip instead.

I'm assuming you're using Python 3. Here's a nice abstraction, iterlines. It hides the complexity of opening, reading, pairing, and closing n files. Note the use of zip_longest, this prevents the ends of longer files being silently discarded.
def iterlines(*paths, fillvalue=None, **open_kwargs):
files = []
try:
for path in paths:
files.append(open(path, **open_kwargs))
for lines in zip_longest(*files, fillvalue=fillvalue):
yield lines
finally:
for file_ in files:
with suppress():
file_.close()
Usage
for line_a, line_b in iterlines('a.txt', 'b.txt'):
print(line_a, line_b)
Complete code
from contextlib import suppress
from itertools import zip_longest
def iterlines(*paths, fillvalue=None, **open_kwargs):
files = []
try:
for path in paths:
files.append(open(path, **open_kwargs))
for lines in zip_longest(*files, fillvalue=fillvalue):
yield lines
finally:
for file_ in files:
with suppress():
file_.close()
for lines in iterlines('a.txt', 'b.txt', 'd.txt'):
print(lines)

Related

For each line in a file, replace multiple-whitespace substring of variable length with line break

Using Python 2.7.1, I read in a file:
input = open(file, "rU")
tmp = input.readlines()
which looks like this:
>name -----meoidoad
>longname -lksowkdkfg
>nm --kdmknskoeoe---
>nmee dowdbnufignwwwwcds--
That is, each line has a short substring of whitespaces, but the length of this substring varies by line.
I would like to write script that edits my tmp object such that when I write tmp to file, the result is
>name
-----meoidoad
>longname
-lksowkdkfg
>nm
--kdmknskoeoe---
>nmee
dowdbnufignwwwwcds--
I.e. I would like to break each line into two lines, at that substring of whitespaces (and get rid of the spaces in the process).
The starting position of the string after the whitespaces is always the same within a file, but may vary among a large batch of files I am working with. So, I need a solution that does not rely on positions.
I've seen many similar questions on here, with many well-liked answers that use short regex scripts to do so, so it is possible I am duplicating a previous question. However, none of what I've seen so far has worked for me.
import re
with open('path/to/input') as infile, open('path/to/output', 'w') as outfile:
for line in infile:
outfile.write(re.sub('\s\s+', '\n', line))
If the file isn't huge (i.e. hundreds of MB), you can do this concisely with split() and join():
with open(file, 'rU') as f, open(outfilename, 'w') as o:
o.write('\n'.join(f.read().split()))
I would also recommend against naming anything input, as that will mask the built-in.

Differences between enumerate(fileinput.input(file)) and enumerate(file)

I'm looking for some help with my code which is rigth below :
for file in file_name :
if os.path.isfile(file):
for line_number, line in enumerate(fileinput.input(file, inplace=1)):
print file
os.system("pause")
if line_number ==1:
line = line.replace('Object','#Object')
sys.stdout.write(line)
I wanted to modify some previous extracted files in order to plot them with matplotlib. To do so, I remove some lines, comment some others.
My problem is the following :
Using for line_number, line in enumerate(fileinput.input(file, inplace=1)): gives me only 4 out of 5 previous extracted files (when looking file_name list contains 5 references !)
Using for line_number, line in enumerate(file): gives me the 5 previous extracted file, BUT I don't know how to make modifications using the same file without creating another one...
Did you have an idea on this issue? Is this a normal issue?
There a number of things that might help you.
Firstly file_name appears to be a list of file names. It might be better named file_names and then you could use file_name for each one. You have verified that this does hold 5 entries.
The enumerate() function is used to help when enumerating a list of items to provide both an index and the item for each loop. This saves you having to use a separate counter variable, e.g.
for index, item in enumerate(["item1", "item2", "item3"]):
print index, item
would print:
0 item1
1 item2
2 item3
This is not really required, as you have chosen to use the fileinput library. This is designed to take a list of files and iterate over all of the lines in all of the files in one single loop. As such you need to tweak your approach a bit, assuming your list of files is called file_names then you write something as follows:
# Keep only files in the file list
file_names = [file_name for file_name in file_names if os.path.isfile(file_name)]
# Iterate all lines in all files
for line in fileinput.input(file_names, inplace=1):
if fileinput.filelineno() == 1:
line = line.replace('Object','#Object')
sys.stdout.write(line)
The main point here being that it is better to pre filter any non-filenames before passing the list to fileinput. I will leave it up to you to fix the output.
fileinput provides a number of functions to help you figure out which file or line number is currently being processed.
Assuming you're still having trouble, my typical approach is to open a file read-only, read its contents into a variable, close the file, make an edited variable, open the file to write (wiping out original file), and finally write the edited contents.
I like this approach since I can simply change the file_name that gets written out if I want to test my edits without wiping out the original file.
Also, I recommend naming containers using plural nouns, like #Martin Evans suggests.
import os
file_names = ['file_1.txt', 'file_2.txt', 'file_3.txt', 'file_4.txt', 'file_5.txt']
file_names = [x for x in file_names if os.path.isfile(x)] # see #Martin's answer again
for file_name in file_names:
# Open read-only and put contents into a list of line strings
with open(file_name, 'r') as f_in:
lines = f_in.read().splitlines()
# Put the lines you want to write out in out_lines
out_lines = []
for index_no, line in enumerate(lines):
if index_no == 1:
out_lines.append(line.replace('Object', '#Object'))
elif ...
else:
out_lines.append(line)
# Uncomment to write to different file name for edits testing
# with open(file_name + '.out', 'w') as f_out:
# f_out.write('\n'.join(out_lines))
# Write out the file, clobbering the original
with open(file_name, 'w') as f_out:
f_out.write('\n'.join(out_lines))
Downside with this approach is that each file needs to be small enough to fit into memory twice (lines + out_lines).
Best of luck!

Python removing duplicates and saving the result

I am trying to remove duplicates of 3-column tab-delimited txt file, but as long as the first two columns are duplicates, then it should be removed even if the two has different 3rd column.
from operator import itemgetter
import sys
input = sys.argv[1]
output = sys.argv[2]
#Pass any column number you want, note that indexing starts at 0
ig = itemgetter(0,1)
seen = set()
data = []
for line in input.splitlines():
key = ig(line.split())
if key not in seen:
data.append(line)
seen.add(key)
file = open(output, "w")
file.write(data)
file.close()
First, I get error
key = ig(line.split())
IndexError: list index out of range
Also, I can't see how to save the result to output.txt
People say saving to output.txt is a really basic matter. But no tutorial helped.
I tried methods that use codec, those that use with, those that use file.write(data) and all didn't help.
I could learn MatLab quite easily. The online tutorial was fantastic and a series of Googling always helped a lot.
But I can't find a helpful tutorial of Python yet. This is obviously because I am a complete novice. For complete novices like me, what would be the best tutorial with 1) comprehensiveness AND 2) lots of examples 3) line by line explanation that dosen't leave any line without explanation?
And why is the above code causing error and not saving result?
I'm assuming since you assign input to the first command line argument with input = sys.argv[1] and output to the second, you intend those to be your input and output file names. But you're never opening any file for the input data, so you're callling .splitlines() on a file name, not on file contents.
Next, splitlines() is the wrong approach here anyway. To iterate over a file line-by-line, simply use for line in f, where f is an open file. Those lines will include the newline at the end of the line, so it needs to be stripped if it's not supposed to be part of the third columns data.
Then you're opening and closing the file inside your loop, which means you'll try to write the entire contents of data to the file every iteration, effectively overwriting any data written to the file before. Therefore I moved that block out of the loop.
It's good practice to use the with statement for opening files. with open(out_fn, "w") as outfile will open the file named out_fn and assign the open file to outfile, and close it for you as soon as you exit that indented block.
input is a builtin function in Python. I therefore renamed your variables so no builtin names get shadowed.
You're trying to directly write data to the output file. This won't work since data is a list of lines. You need to join those lines first in order to turn them in a single string again before writing it to a file.
So here's your code with all those issues addressed:
from operator import itemgetter
import sys
in_fn = sys.argv[1]
out_fn = sys.argv[2]
getkey = itemgetter(0, 1)
seen = set()
data = []
with open(in_fn, 'r') as infile:
for line in infile:
line = line.strip()
key = getkey(line.split())
if key not in seen:
data.append(line)
seen.add(key)
with open(out_fn, "w") as outfile:
outfile.write('\n'.join(data))
Why is the above code causing error?
Because you haven't opened the file, you are trying to work with the string input.txtrather than with the file. Then when you try to access your item, you get a list index out of range because line.split() returns ['input.txt'].
How to fix that: open the file and then work with it, not with its name.
For example, you can do (I tried to stay as close to your code as possible)
input = sys.argv[1]
infile = open(input, 'r')
(...)
lines = infile.readlines()
infile.close()
for line in lines:
(...)
Why is this not saving result?
Because you are opening/closing the file inside the loop. What you need to do is write the data once you're out of the loop. Also, you cannot write directly a list to a file. Hence, you need to do something like (outside of your loop):
outfile = open(output, "w")
for item in data:
outfile.write(item)
outfile.close()
All together
There are other ways of reading/writing files, and it is pretty well documented on the internet but I tried to stay close to your code so that you would understand better what was wrong with it
from operator import itemgetter
import sys
input = sys.argv[1]
infile = open(input, 'r')
output = sys.argv[2]
#Pass any column number you want, note that indexing starts at 0
ig = itemgetter(0,1)
seen = set()
data = []
lines = infile.readlines()
infile.close()
for line in lines:
print line
key = ig(line.split())
if key not in seen:
data.append(line)
seen.add(key)
print data
outfile = open(output, "w")
for item in data:
outfile.write(item)
outfile.close()
PS: it seems to produce the result that you needed there Python to remove duplicates using only some, not all, columns

How to iterate over arbitrary number of files in parallel in python?

I have a list of file objects in a list called paths
I'd like to be able to go through and read the first line of each file, do something with this n-tuple of data, then move on the second line of each file. The number of file objects in path is arbitrary.
Is this possible?
import itertools
for line_tuple in itertools.izip(*files):
whatever()
I'd use zip, but that would read the entire contents of the files into memory. Note that files should be a list of file objects; I'm not sure what you mean by "list of file handlers".
This depends on how "arbitrary" it actually is. As long as the number is less than the limit of your OS, then itertools.izip should work just fine (or itertools.izip_longest as appropriate).
files = [open(f) for f in filenames]
for lines in itertools.izip(*files):
# do something
for f in files:
f.close()
If you can have more files than your OS will allow you to open, then you're out of luck (at least as far as an easy solution is concerned).
the first idea pop into my mind the following code , it seems too Straightforward
fp_list = []
for file in path_array:
fp = open(file)
fp_list.append(fp)
line_list = []
for fp in fp_list:
line = fp.readline()
line_list.append(line)
## you code here process the line_list
for fp in fp_list:
fp.close()

How to skip 2 lines in a file with Python?

I have a series of files and I want to extract a specific number from each of them.
In each of the files I have this line:
name, registration num
and exactly two lines after that there is the registration number. I would like to extract this number from each file. and put it as a value of a dictionary.Anyone have any idea how it is possible ?
my current code that does not actually work is like below:
matches=[]
for root, dirnames, filenames in os.walk('D:/Dataset2'):
for filename in fnmatch.filter(filenames, '*.txt'):
matches.append([root, filename])
filenames_list={}
for root,filename in matches:
filename_key = (os.path.join(filename).strip()).split('.',1)[0]
fullfilename = os.path.join(root, filename)
f= open(fullfilename, 'r')
for line in f:
if "<name, registration num'" in line:
key=filename_key
line+=2
val=line
I usually use next() when I want to skip a single line, usually a header for a file.
with open(file_path) as f:
next(f) # skip 1 line
next(f) # skip another one.
for line in f:
pass # now you can keep reading as if there was no first or second line.
Note: In Python 2.6 or earlier you must use f.next()
One way would be to load the whole line into an array, and then read the line(s) you want from it. Example
A file called testfile contains the following:
A1
B2
C3
D4
E5
A program test.py:
#!/usr/bin/env python
file = open('testfile')
lines = file.readlines()[2:]
file.close()
for line in lines:
print(line.strip())
Output:
$./test.py
C3
D4
E5
EDIT: I read the question again, and noticed you just want a single line. Then you could just remove the :, and use f.getlines()[2] to get the third line in a file
Or you could use f.getline() three times, and just ignore the first two
Or you could use a for line in f type loop, and just ignore the first two line (have an incrementing counter)
I suppose something like that would work...
f= open(fullfilename, 'r')
for line in f:
if "name, registration num" in line:
key=filename_key
break
f.readline()
res = f.readline()[:-1] #removed trailin newline
from itertools import islice
with open('data.txt') as f:
for line in islice(f, 2, None):
print line
Generally speaking, if you want to do something to a python iterator in-loop, like look two ahead, I find a good first place to look is to import itertools and look here. In your case, you might benefit from their implementation of consume.
Worth having a look to see if this issue hasn't been covered on SO before.
Edit: Indeed- look here, which includes a good discussion of python iterators.

Categories