I have a bunch of filenames. I need to read one line at a time from each of these files, do some processing and then read the one line again from each of these files, do some processing and so on.
I'm looking for suggestions on how to do this in a more Pythonic way. I know the number of lines present in each file so I'm hard-coding it for now, but I'd like to not have to do that.
UPDATE:
The files all have the same number of lines.
UPDATE2:
There are at least 30 different files.
filenames = []
line_count = 400
fileobjs = [open(i, 'r') for i in filenames]
for i in xrange(line_count):
lines = []
for each_fo in fileobjs:
for each_line in each_fo:
lines.append(each_line)
break
process(lines)
What about this?
from itertools import izip_longest
for file_lines in izip_longest(*map(open,filenames)):
for line in file_lines:
if line:
# process line
lines = [next(fo) for fo in fileobjs]
process(lines)
This will read both the files line by line at a time
with open('File1','r') as FileA, open('File2','r') as FileB:
for lineA,lineB in zip(FileA,FileB):
print lineA,lineB
filenames = []
files = [open(f, mode='r') for f in filenames]
for line in files[0]:
lines = [file.readline() for file in files]
process(lines)
Related
I would like to know if it's possible to read the second line of each files contains in a zip file?
zf = zipfile.ZipFile(myzip.zip)
for f in zf.namelist():
csv_f = zf.read(f)
first_line = csv_f.split('\n', 2)[0] ...?
thanks for any help.
Yes. Like this:
with zipfile.ZipFile("myzip.zip") as z:
for n in z.namelist():
with z.open(n) as f:
for i in range(2):
second_line = next(f)
This will only read the first two lines without reading the whole file, based on the recommendation by #S3DEV. One could be more fancy about not writing the first line to a variable second_line, but since it is overwritten on the second pass, this doesn't seem to clever.
I think this should be easy but yet have not been able to solve it. I have two files as below and I want to merge them in a way that lines starting with > in the file1 to be the header of the lines in the file2
file1:
>seq12
ACGCTCGCA
>seq34
GCATCGCGT
>seq56
GCGATGCGC
file2:
ATCGCGCATGATCTCAG
AGCGCGCATGCGCATCG
AGCAAATTTAGCAACTC
so the desired output should be:
>seq12
ATCGCGCATGATCTCAG
>seq34
AGCGCGCATGCGCATCG
>seq56
AGCAAATTTAGCAACTC
I have tried this code so far but in output, all the lines coming from file2 are the same:
from Bio import SeqIO
with open(file1) as fw:
with open(file2,'r') as rv:
for line in rv:
items = line
for record in SeqIO.parse(fw, 'fasta'):
print('>' + record.id)
print(line)
If you cannot store your files in memory, you need a solution that reads line by line from each file, and writes accordingly to the output file. The following program does that. The comments try to clarify, though I believe it is clear from the code.
with open("file1.txt") as first, open("file2.txt") as second, open("output.txt", "w+") as output:
while 1:
line_first = first.readline() # line from file1 (header)
line_second = second.readline() # line from file2 (body)
if not (line_first and line_second):
# if any file has ended
break
# write to output file
output.writelines([line_first, line_second])
# jump one line from file1
first.readline()
Note that this will only work if file1.txt has the specific format you presented (odd lines are headers, even lines are useless).
In order to allow a bit more customization, you can wrap it up in a function as:
def merge_files(header_file_path, body_file_path, output_file="output.txt", every_n_lines=2):
with open(header_file_path) as first, open(body_file_path) as second, open(output_file, "w+") as output:
while 1:
line_first = first.readline() # line from header
line_second = second.readline() # line from body
if not (line_first and line_second):
# if any file has ended
break
# write to output file
output.writelines([line_first, line_second])
# jump n lines from header
for _ in range(every_n_lines - 1):
first.readline()
And then calling merge_files("file1.txt", "file2.txt") should do the trick.
If both files are small enough to fit in memory simultaneously, you can simply read them simultaneously and interleave them.
# Open two file handles.
with open("f1", mode="r") as f1, open("f2", mode="r") as f2:
lines_first = f1.readlines() # Read all lines in f1.
lines_second = f2.readlines() # Read all lines in f2.
lines_out = []
# For each line in the file without headers...
for idx in range(len(lines_second)):
# Take every even line from the first file and prepend it to
# the line from the second.
lines_out.append(lines_first[2 * idx + 1].rstrip() + lines_second[idx].rstrip())
You can generate the seq headers very easily given idx: I leave this as an exercise to the reader.
If either or both files are too large to fit in memory, you can repeat the above process line-by-line over both handles (using one variable to store information from the file with headers).
I have a large 11 GB .txt file with email addresses. I would like to save only the strings till the # symbol among each other. My output only generate the first line.I have used this code of a earlier project. I would like to save the output in a different .txt file. I hope someone could help me out.
my code:
import re
def get_html_string(file,start_string,end_string):
answer="nothing"
with open(file, 'rb') as open_file:
for line in open_file:
line = line.rstrip()
if re.search(start_string, line) :
answer=line
break
start=answer.find(start_string)+len(start_string)
end=answer.find(end_string)
#print(start,end,answer)
return answer[start:end]
beginstr=''
end='#'
file='test.txt'
readstring=str(get_html_string(file,beginstr,end))
print readstring
Your file is quite big (11G) so you shouldn't keep all those strings in memory. Instead, process the file line by line and write the result before reading next line.
This should be efficient :
with open('test.txt', 'r') as input_file:
with open('result.txt', 'w') as output_file:
for line in input_file:
prefix = line.split('#')[0]
output_file.write(prefix + '\n')
If your file looks like this example:
user#google.com
user2#jshds.com
Useruser#jsnl.com
You can use this:
def get_email_name(file_name):
with open(file_name) as file:
lines = file.readlines()
result = list()
for line in lines:
result.append(line.split('#')[0])
return result
get_email_name('emails.txt')
Out:
['user', 'user2', 'Useruser']
I have two files: fileA and fileB. I'd like to get the line numbers of all the lines in the fileB that exist in the fileA. But if a line is indeed in fileA, I won't identify it as "exists in fileA" unless the next line is also in it. So I've written the following code:
def compare_two(fileA, fileB):
with open(fileA, 'r') as fa:
fa_content = fa.read()
with open(fileB, 'r') as fb:
keep_line_num = [] # the line number that's not in fileA
i = 1
while True:
line = fb.readline()
if line == '': # There are no blank lines in both files
break
last_pos = fb.tell()
theFollowing = line
new_line = fb.readline() # get the next line
theFollowing += new_line
fb.seek(last_pos)
if theFollowing not in fa_content:
keep_line_num.append(i)
i += 1
fb.close()
fa.close()
return keep_line_num
compare_two(fileA, fileB)
This works fine for small files. But I want to use it for large files as large as 2GB and this method is too slow for me. Are there any other way to work with this in Python2.7?
Take a look at difflib, it comes with Python.
It can tell you where your files are different or identical. See also python difflib comparing files
I want to join 100 differeant files into one.
Example of file with data:
example1.txt has in this format:
something
something
somehting
example2.txt has in this format:
something
something
somehting
and all the 100 files have the same format of data and also have a common name example1.....example100 which mean the example is the same and have a number.
from itertools import chain
infiles = [open('{}_example.txt'.format(i+1), 'r') for i in xrange(113)]
with open('example.txt', 'w') as fout:
for lines in chain(*infiles):
fout.write(lines)
I used this but the problem is the first line of the next file joined with the last of the previous file
If you have 100 files, better to just use an array of files:
infiles = [open('example{}.txt'.format(i+1), 'r') for i in xrange(100)]
with open('Join.txt', 'w') as fout:
for lines in izip_longest(*infiles, fillvalue=''):
lines = [line.rstrip('\n') for line in lines]
print >> fout, separator.join(lines)
I would open a new file as writable: join.txt, and then loop through the files you want with a range(1,100):
join = open('Join.txt','w')
for file in range(1,100):
file = open('example'+file+'.txt','r')
file = file.readlines()
for line in file:
join.write(line)