How to iterate over arbitrary number of files in parallel in python?

How to iterate over arbitrary number of files in parallel in python? - python

I have a list of file objects in a list called paths
I'd like to be able to go through and read the first line of each file, do something with this n-tuple of data, then move on the second line of each file. The number of file objects in path is arbitrary.
Is this possible?

import itertools
for line_tuple in itertools.izip(*files):
whatever()
I'd use zip, but that would read the entire contents of the files into memory. Note that files should be a list of file objects; I'm not sure what you mean by "list of file handlers".

This depends on how "arbitrary" it actually is. As long as the number is less than the limit of your OS, then itertools.izip should work just fine (or itertools.izip_longest as appropriate).
files = [open(f) for f in filenames]
for lines in itertools.izip(*files):
# do something
for f in files:
f.close()
If you can have more files than your OS will allow you to open, then you're out of luck (at least as far as an easy solution is concerned).

the first idea pop into my mind the following code , it seems too Straightforward
fp_list = []
for file in path_array:
fp = open(file)
fp_list.append(fp)
line_list = []
for fp in fp_list:
line = fp.readline()
line_list.append(line)
## you code here process the line_list
for fp in fp_list:
fp.close()

Related

Find, Replace inline file from multiple lists in Python

I have three python lists:
filePaths
textToFind
textToReplace
The lists are always equal lengths and in the correct order.
I need to open each file in filePaths, find the line in textToFind, and replace the line with textToReplace. I have all the code that populates the lists. I am stuck on making the replacements. I have tried:
for line in fileinput.input(filePath[i], inplace=1):
sys.stdout.write(line.replace(find[i], replace[i]))
How do I iterate over each file to make the text replacements on each line that matches find?

When you need to use the indices of the items in a sequence while iterating over that sequence, use enumerate.
for i, path in enumerate(filePath):
for line in fileinput.input(path, inplace=1):
sys.stdout.write(line.replace(find[i], replace[i]))
Another option would be to use zip, which will give you one item from each sequence in order.
for path, find_text, replace_text in zip(filePath, textToFind, textToReplace):
for line in fileinput.input(path, inplace=1):
sys.stdout.write(line.replace(find_text, replace_text))
Note that for Python 2.x zip will produce a new list that can be iterated - so if the sequences you are zipping are huge it will consume memory. Python 3.x zip produces an iterator so it doesn't have that feature.
With a normal file object you could read the entire file into a variable and perform the string replacement on the whole file at once.

I might do something like this without more information
for my_file in file_paths:
with open(my_file, 'r') as cin, open(my_file, 'w') as cout:
lines = cin.readlines() #store the file in mem so i can overwrite it.
for line in lines:
line = line.replace(find, replace) # change as needed
cout.write(line)
Iterate over all the file paths, open the file up for reading and a separate one for writing. Store the files lines in a variable as in this code i will be overwriting the original file. Do your replace, remember if there is nothing to replace python just leaves the line alone. Write the line back to file.

You can read file to some temporary variable, make changes, and then write it back:
with open('file', 'r') as f:
text = f.read()
with open('file', 'w') as f:
f.write(text.replace('aaa', 'bbb'))

Using Python to search multiple text files for matches to a list of strings

So am starting from scratch on a program that I haven't really seen replicated anywhere else. I'll describe exactly what I want it to do:
I have a list of strings that looks like this:
12482-2958
02274+2482
23381-3857
..........
I want to take each of these strings and search through a few dozen files (all named wds000.dat, wds005.dat, wds010.dat, etc) for matches. If one of them finds a match, I want to write that string to a new file, so in the end I have a list of strings that had matches.
If I need to be more clear about something, please let me know. Any help on where to start with this would be much appreciated. Thanks guys and gals!

Something like this should work
import os
#### your array ####
myarray = {"12482-2958", "02274+2482", "23381-3857"}
path = os.path.expanduser("path/to/myfile")
newpath = os.path.expanduser("path/to/myResultsFile")
filename = 'matches.data'
newf = open(os.path.join(newpath, filename), "w+")
###### Loops through every element in the above array ####
for element in myarray:
elementstring=''.join(element)
#### opens the path where all of your .dat files are ####
files = os.listdir(path)
for f in files:
if f.strip().endswith(".dat"):
openfile = open(os.path.join(path, f), 'rb')
#### loops through every line in the file comparing the strings ####
for line in openfile:
if elementstring in line:
newf.write(line)
openfile.close()
newf.close()

Define a function that gets a path and a string and checks for match.
You can use: open(), find(), close()
Then just create all paths in a for loop, for every path check all strings with the function and print to file if needed
Not explained much... Needing more explaining?

Not so pythonic... and probably has something to straighten out but pretty much the logic to follow:
from glob import glob
strings = ['12482-2958',...] # your strings
output = []
for file in glob('ws*.dat'):
with open(file, 'rb+') as f:
for line in f.readlines():
for subs in strings:
if subs in line:
output.append(line)
print(output)

Differences between enumerate(fileinput.input(file)) and enumerate(file)

I'm looking for some help with my code which is rigth below :
for file in file_name :
if os.path.isfile(file):
for line_number, line in enumerate(fileinput.input(file, inplace=1)):
print file
os.system("pause")
if line_number ==1:
line = line.replace('Object','#Object')
sys.stdout.write(line)
I wanted to modify some previous extracted files in order to plot them with matplotlib. To do so, I remove some lines, comment some others.
My problem is the following :
Using for line_number, line in enumerate(fileinput.input(file, inplace=1)): gives me only 4 out of 5 previous extracted files (when looking file_name list contains 5 references !)
Using for line_number, line in enumerate(file): gives me the 5 previous extracted file, BUT I don't know how to make modifications using the same file without creating another one...
Did you have an idea on this issue? Is this a normal issue?

There a number of things that might help you.
Firstly file_name appears to be a list of file names. It might be better named file_names and then you could use file_name for each one. You have verified that this does hold 5 entries.
The enumerate() function is used to help when enumerating a list of items to provide both an index and the item for each loop. This saves you having to use a separate counter variable, e.g.
for index, item in enumerate(["item1", "item2", "item3"]):
print index, item
would print:
0 item1
1 item2
2 item3
This is not really required, as you have chosen to use the fileinput library. This is designed to take a list of files and iterate over all of the lines in all of the files in one single loop. As such you need to tweak your approach a bit, assuming your list of files is called file_names then you write something as follows:
# Keep only files in the file list
file_names = [file_name for file_name in file_names if os.path.isfile(file_name)]
# Iterate all lines in all files
for line in fileinput.input(file_names, inplace=1):
if fileinput.filelineno() == 1:
line = line.replace('Object','#Object')
sys.stdout.write(line)
The main point here being that it is better to pre filter any non-filenames before passing the list to fileinput. I will leave it up to you to fix the output.
fileinput provides a number of functions to help you figure out which file or line number is currently being processed.

Assuming you're still having trouble, my typical approach is to open a file read-only, read its contents into a variable, close the file, make an edited variable, open the file to write (wiping out original file), and finally write the edited contents.
I like this approach since I can simply change the file_name that gets written out if I want to test my edits without wiping out the original file.
Also, I recommend naming containers using plural nouns, like #Martin Evans suggests.
import os
file_names = ['file_1.txt', 'file_2.txt', 'file_3.txt', 'file_4.txt', 'file_5.txt']
file_names = [x for x in file_names if os.path.isfile(x)] # see #Martin's answer again
for file_name in file_names:
# Open read-only and put contents into a list of line strings
with open(file_name, 'r') as f_in:
lines = f_in.read().splitlines()
# Put the lines you want to write out in out_lines
out_lines = []
for index_no, line in enumerate(lines):
if index_no == 1:
out_lines.append(line.replace('Object', '#Object'))
elif ...
else:
out_lines.append(line)
# Uncomment to write to different file name for edits testing
# with open(file_name + '.out', 'w') as f_out:
# f_out.write('\n'.join(out_lines))
# Write out the file, clobbering the original
with open(file_name, 'w') as f_out:
f_out.write('\n'.join(out_lines))
Downside with this approach is that each file needs to be small enough to fit into memory twice (lines + out_lines).
Best of luck!

Python iterating through two files by line at the same time

I am trying to compare columns in two files to see if the values match, and if there is a match I want to merge/concatenate the data for that row together. My issue is that when reading line by line from the two files separately, I can't get python to iterate through the files together and look for a match. Instead it will iterate properly through one file and iterate over the same line in the second file multiple times...
I have had this issue in the past and still not really found a way around it. I know that indentation is one problem since I mess with the loop by using "for line in a, for line in b" so I thought that what I tried below would work but it hasn't. I have looked around for solutions but nobody seems to be using the same method so I wonder if I am completely off track as to how to do this? Can anyone explain what is a better way to do this, and whether my method would work at all and if not, why not? Thanks, it is much appreciated!
These are the formats of my two files, basically I want to compare the columns filename in both files and if they match I want to merge the rows together.
file1:
cluster_id hypothesis_id filename M1_name_offset Orientation
1 71133076 unique_name_1.png esc_sox2_Sox1_80_4 forward
1 50099120 unique_name_4.png hb_cebpb_ETS1_139_7 forward
1 91895576 unique_name_11.png he_tal1_at_AC_acptr_258_11 forward
file2:
Name Cluster_No Pattern filename
esc_sox2_Sox1_80 Cluster1 AP1(1N)ETS unique_name_4.png
hb_cebpb_ETS1_139 Cluster1 CREB(1N)ETS unique_name_11.png
he_tal1_at_AC_acptr_258 Cluster2 ETS(-1N)ZIC unique_name_3.png
What I have tried:
for aline in file1:
motif1 = aline.split()[2]
for bline in file2:
motif2 = bline.split()[-1]
if motif1 = motif2:
print "match", aline, bline
I have also tried:
for aline in file1:
motif1 = aline.split()[2]
for bline in file2:
motif2 = bline.split()[-1]
if motif1 = motif2:
print "match", aline, bline
I have also tried using string formatting but that didn't make a difference. The first way iterates through file2 incorrectly and the second way doesn't give me any output. I have played around with it a lot and tried various indentations and extra bits but I am stumped as to how to even try and fix it! Please help me :(

Use the zip builtin function.
with open(file1) as f1, open(file2) as f2:
for line1, line2 in zip(f1, f2):
motif1 = line1.split()[0]
motif2 = line2.split()[0]
...
Note that zip behaves differently in python2 and python3. In python2, it would be more efficient to use itertools.izip instead.

I'm assuming you're using Python 3. Here's a nice abstraction, iterlines. It hides the complexity of opening, reading, pairing, and closing n files. Note the use of zip_longest, this prevents the ends of longer files being silently discarded.
def iterlines(*paths, fillvalue=None, **open_kwargs):
files = []
try:
for path in paths:
files.append(open(path, **open_kwargs))
for lines in zip_longest(*files, fillvalue=fillvalue):
yield lines
finally:
for file_ in files:
with suppress():
file_.close()
Usage
for line_a, line_b in iterlines('a.txt', 'b.txt'):
print(line_a, line_b)
Complete code
from contextlib import suppress
from itertools import zip_longest
def iterlines(*paths, fillvalue=None, **open_kwargs):
files = []
try:
for path in paths:
files.append(open(path, **open_kwargs))
for lines in zip_longest(*files, fillvalue=fillvalue):
yield lines
finally:
for file_ in files:
with suppress():
file_.close()
for lines in iterlines('a.txt', 'b.txt', 'd.txt'):
print(lines)

How do I modify the last line of a file?

The last line of my file is:
29-dez,40,
How can I modify that line so that it reads:
29-Dez,40,90,100,50
Note: I don't want to write a new line. I want to take the same line and put new values after 29-Dez,40,
I'm new at python. I'm having a lot of trouble manipulating files and for me every example I look at seems difficult.

Unless the file is huge, you'll probably find it easier to read the entire file into a data structure (which might just be a list of lines), and then modify the data structure in memory, and finally write it back to the file.
On the other hand maybe your file is really huge - multiple GBs at least. In which case: the last line is probably terminated with a new line character, if you seek to that position you can overwrite it with the new text at the end of the last line.
So perhaps:
f = open("foo.file", "wb")
f.seek(-len(os.linesep), os.SEEK_END)
f.write("new text at end of last line" + os.linesep)
f.close()
(Modulo line endings on different platforms)

To expand on what Doug said, in order to read the file contents into a data structure you can use the readlines() method of the file object.
The below code sample reads the file into a list of "lines", edits the last line, then writes it back out to the file:
#!/usr/bin/python
MYFILE="file.txt"
# read the file into a list of lines
lines = open(MYFILE, 'r').readlines()
# now edit the last line of the list of lines
new_last_line = (lines[-1].rstrip() + ",90,100,50")
lines[-1] = new_last_line
# now write the modified list back out to the file
open(MYFILE, 'w').writelines(lines)
If the file is very large then this approach will not work well, because this reads all the file lines into memory each time and writes them back out to the file, which is very inefficient. For a small file however this will work fine.

Don't work with files directly, make a data structure that fits your needs in form of a class and make read from/write to file methods.

I recently wrote a script to do something very similar to this. It would traverse a project, find all module dependencies and add any missing import statements. I won't clutter this post up with the entire script, but I'll show how I went about modifying my files.
import os
from mmap import mmap
def insert_import(filename, text):
if len(text) < 1:
return
f = open(filename, 'r+')
m = mmap(f.fileno(), os.path.getsize(filename))
origSize = m.size()
m.resize(origSize + len(text))
pos = 0
while True:
l = m.readline()
if l.startswith(('import', 'from')):
continue
else:
pos = m.tell() - len(l)
break
m[pos+len(text):] = m[pos:origSize]
m[pos:pos+len(text)] = text
m.close()
f.close()
Summary: This snippet takes a filename and a blob of text to insert. It finds the last import statement already present, and sticks the text in at that location.
The part I suggest paying most attention to is the use of mmap. It lets you work with files in the same manner you may work with a string. Very handy.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to iterate over arbitrary number of files in parallel in python? - python

I have a list of file objects in a list called paths I'd like to be able to go through and read the first line of each file, do something with this n-tuple of data, then move on the second line of each file. The number of file objects in path is arbitrary. Is this possible?

import itertools for line_tuple in itertools.izip(*files): whatever() I'd use zip, but that would read the entire contents of the files into memory. Note that files should be a list of file objects; I'm not sure what you mean by "list of file handlers".

the first idea pop into my mind the following code , it seems too Straightforward fp_list = [] for file in path_array: fp = open(file) fp_list.append(fp) line_list = [] for fp in fp_list: line = fp.readline() line_list.append(line) ## you code here process the line_list for fp in fp_list: fp.close()

Related

Find, Replace inline file from multiple lists in Python

Using Python to search multiple text files for matches to a list of strings

Differences between enumerate(fileinput.input(file)) and enumerate(file)

Python iterating through two files by line at the same time

How do I modify the last line of a file?

Categories

Resources