I am having a bit of trouble figuring the following out:
I have a file with 100 lines for example, let's call it file A
I also have another file with 100 lines for example, let's call it file B
Now I need the first loop to read 10 lines from file A and do it's thing and then go to the other loop that reads 10 lines from file B, does it thing and then goes back to the first loop to do 11-20 lines from file A and then back to second loop that does 11-20 lines from file B.
I need both loops to remember from which line to read.
How should I approach this?
Thanks!
EDIT:
Could something like this work?
a=0
b=10
x=0
y=10
for 1000 times:
read a-b rows:
do its thing
a += 10
b += 10
read x-y rows:
do its thing
x += 10
y += 10
You can iterate over 10 lines at a time using this approach.
class File:
def __init__(self, filename):
self.f = open(filename, 'r')
def line(self):
yield self.f.readline()
def next(self, limit):
for each in range(limit):
yield self.f.readline()
def lines(self, limit=10):
return [x for x in self.next(limit=limit)]
file1 = File('C:\\Temp\\test.csv')
file2 = File('C:\\Temp\\test2.csv')
print(file1.lines(10)
print(file2.lines(10)
print(file1.lines(10)
print(file2.lines(10)
Now you can jump back and forth between files iterating over the next 10 lines.
Here is another solution using a generator and a context manager:
class SwitchFileReader():
def __init__(self, file_paths, lines = 10):
self.file_paths = file_paths
self.file_objects = []
self.lines = 1 if lines < 1 else lines
def __enter__(self):
for file in self.file_paths:
file_object = open(file, "r")
self.file_objects.append(file_object)
return self
def __exit__(self, type, value, traceback):
for file in self.file_objects:
file.close()
def __iter__(self):
while True:
next_lines = [
[file.readline() for _ in range(self.lines)]
for file in self.file_objects
]
if any(not all(lines) for lines in next_lines):
break
for lines in next_lines:
yield lines
file_a = r"D:\projects\playground\python\stackgis\data\TestA.txt"
file_b = r"D:\projects\playground\python\stackgis\data\TestB.txt"
with SwitchFileReader([file_a, file_b], 10) as file_changer:
for next_lines in file_changer:
print(next_lines , end="") # do your thing
The iteration will stop as soon as there are less remaining lines in any of the files.
Assuming file_a has 12 lines and file_b has 13 lines. Line 11 and 12 from file_a and line 11 to 13 from file_b would be ignored.
For simplicity I'm going to work with list. You can read the file into a list.
Let's split the problem. We need
group each list by any number. In your case 10
Loop in each 10 bunches for both arrays.
Grouping
Here an answer: https://stackoverflow.com/a/4998460/2681662
def group_by_each(lst, N):
return [lst[n:n+N] for n in range(0, len(lst), N)]
Loop in two list at the same time:
You can use zip for this.
lst1 = list(range(100)) # <- Your data
lst2 = list(range(100, 200)) # <-- Your second data
def group_by_each(lst, N):
return [lst[n:n+N] for n in range(0, len(lst), N)]
for ten1, ten2 in zip(group_by_each(lst1, 10), group_by_each(lst2, 10)):
print(ten1)
print(ten2)
When you iterate over a file object, it yields lines in the associated file.
You just need a single loop that grabs the next ten lines from both files each iteration. In this example, the loop will end as soon as either file is exhausted:
from itertools import islice
lines_per_iter = 10
file_a = open("file_a.txt", "r")
file_b = open("file_b.txt", "r")
while (a := list(islice(file_a, lines_per_iter))) and (b := list(islice(file_b, lines_per_iter))):
print(f"Next {lines_per_iter} lines from A: {a}")
print(f"Next {lines_per_iter} lines from B: {b}")
file_a.close()
file_b.close()
Ok, thank you for all the answers, I found a working solution to my project like this:
a=0
b=10
x=0
y=10
while True:
for list1 in range(a, b):
#read the lines from file A
a += 10
b += 10
for list2 in range(x, y):
#read the lines from file B
if y == 100:
break
x += 10
y += 10
I know it's been a long time since this question was asked, but I still feel like answering it my own way for future viewers and future reference. I'm not exactly sure if this is the best way to do it, but it can read multiple files simultaneously which is pretty cool.
from itertools import islice, chain
from pprint import pprint
def simread(files, nlines_segments, nlines_contents):
lines = [[] for i in range(len(files))]
total_lines = sum(nlines_contents)
current_index = 0
while len(tuple(chain(*lines))) < total_lines:
if len(lines[current_index]) < nlines_contents[current_index]:
lines[current_index].extend(islice(
files[current_index],
nlines_segments[current_index],
))
current_index += 1
if current_index == len(files):
current_index = 0
return lines
with open('A.txt') as A, open('B.txt') as B:
lines = simread(
[A, B], # files
[10, 10], # lines to read at a time from each file
[100, 100], # number of lines in each file
) # returns two lists containing the lines in files A and B
pprint(lines)
You can even add another file C (with any number of lines, even a thousand) like so:
with open('A.txt') as A, open('B.txt') as B, open('C.txt') as C:
lines = simread(
[A, B, C], # files
[10, 10, 100], # lines to read at a time from each file
[100, 100, 1000], # number of lines in each file
) # returns two lists containing the lines in files A and B
pprint(lines)
The values in nlines_segments can also be changed, like so:
with open('A.txt') as A, open('B.txt') as B, open('C.txt') as C:
lines = simread(
[A, B, C], # files
[5, 20, 125], # lines to read at a time from each file
[100, 100, 1000], # number of lines in each file
) # returns two lists containing the lines in files A and B
pprint(lines)
This would read file A five lines at a time, file B twenty lines at a time, and file C 125 lines at a time.
NOTE: The values provided in nlines_segments all have to be factors of their corresponding values in nlines_contents, which should all be the exact number of lines in the files they correspond to.
I hope this heps!
There is already a billion answers, but I just felt like answering this in a simple way.
with open('fileA.txt', 'r') as a:
a_lines = a.readlines()
a_prog = 0
with open('fileB.txt', 'r') as b:
b_lines = b.readlines()
b_prog = 0
for i in range(10):
temp = []
for line in range(a_prog, a_prog + 10):
temp.append(a_lines[line].strip())
a_prog += 10
#Temp is the full 10-line block.
#Do something...
temp = []
for line in range(b_prog, b_prog + 10):
temp.append(b_lines[line].strip())
b_prog += 10
#Temp is the full 10-line block.
#Do something...
Related
I am writing a code to take an enormous textfile (several GB) N lines at a time, process that batch, and move onto the next N lines until I have completed the entire file. (I don't care if the last batch isn't the perfect size).
I have been reading about using itertools islice for this operation. I think I am halfway there:
from itertools import islice
N = 16
infile = open("my_very_large_text_file", "r")
lines_gen = islice(infile, N)
for lines in lines_gen:
...process my lines...
The trouble is that I would like to process the next batch of 16 lines, but I am missing something
islice() can be used to get the next n items of an iterator. Thus, list(islice(f, n)) will return a list of the next n lines of the file f. Using this inside a loop will give you the file in chunks of n lines. At the end of the file, the list might be shorter, and finally the call will return an empty list.
from itertools import islice
with open(...) as f:
while True:
next_n_lines = list(islice(f, n))
if not next_n_lines:
break
# process next_n_lines
An alternative is to use the grouper pattern:
from itertools import zip_longest
with open(...) as f:
for next_n_lines in zip_longest(*[f] * n):
# process next_n_lines
The question appears to presume that there is efficiency to be gained by reading an "enormous textfile" in blocks of N lines at a time. This adds an application layer of buffering over the already highly optimized stdio library, adds complexity, and probably buys you absolutely nothing.
Thus:
with open('my_very_large_text_file') as f:
for line in f:
process(line)
is probably superior to any alternative in time, space, complexity and readability.
See also Rob Pike's first two rules, Jackson's Two Rules, and PEP-20 The Zen of Python. If you really just wanted to play with islice you should have left out the large file stuff.
Here is another way using groupby:
from itertools import count, groupby
N = 16
with open('test') as f:
for g, group in groupby(f, key=lambda _, c=count(): c.next()/N):
print list(group)
How it works:
Basically groupby() will group the lines by the return value of the key parameter and the key parameter is the lambda function lambda _, c=count(): c.next()/N and using the fact that the c argument will be bound to count() when the function will be defined so each time groupby() will call the lambda function and evaluate the return value to determine the grouper that will group the lines so :
# 1 iteration.
c.next() => 0
0 / 16 => 0
# 2 iteration.
c.next() => 1
1 / 16 => 0
...
# Start of the second grouper.
c.next() => 16
16/16 => 1
...
Since the requirement was added that there be statistically uniform distribution of the lines selected from the file, I offer this simple approach.
"""randsamp - extract a random subset of n lines from a large file"""
import random
def scan_linepos(path):
"""return a list of seek offsets of the beginning of each line"""
linepos = []
offset = 0
with open(path) as inf:
# WARNING: CPython 2.7 file.tell() is not accurate on file.next()
for line in inf:
linepos.append(offset)
offset += len(line)
return linepos
def sample_lines(path, linepos, nsamp):
"""return nsamp lines from path where line offsets are in linepos"""
offsets = random.sample(linepos, nsamp)
offsets.sort() # this may make file reads more efficient
lines = []
with open(path) as inf:
for offset in offsets:
inf.seek(offset)
lines.append(inf.readline())
return lines
dataset = 'big_data.txt'
nsamp = 5
linepos = scan_linepos(dataset) # the scan only need be done once
lines = sample_lines(dataset, linepos, nsamp)
print 'selecting %d lines from a file of %d' % (nsamp, len(linepos))
print ''.join(lines)
I tested it on a mock data file of 3 million lines comprising 1.7GB on disk. The scan_linepos dominated the runtime taking about 20 seconds on my not-so-hot desktop.
Just to check the performance of sample_lines I used the timeit module as so
import timeit
t = timeit.Timer('sample_lines(dataset, linepos, nsamp)',
'from __main__ import sample_lines, dataset, linepos, nsamp')
trials = 10 ** 4
elapsed = t.timeit(number=trials)
print u'%dk trials in %.2f seconds, %.2fµs per trial' % (trials/1000,
elapsed, (elapsed/trials) * (10 ** 6))
For various values of nsamp; when nsamp was 100, a single sample_lines completed in 460µs and scaled linearly up to 10k samples at 47ms per call.
The natural next question is Random is barely random at all?, and the answer is "sub-cryptographic but certainly fine for bioinformatics".
Used chunker function from What is the most “pythonic” way to iterate over a list in chunks?:
from itertools import izip_longest
def grouper(iterable, n, fillvalue=None):
"grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return izip_longest(*args, fillvalue=fillvalue)
with open(filename) as f:
for lines in grouper(f, chunk_size, ""): #for every chunk_sized chunk
"""process lines like
lines[0], lines[1] , ... , lines[chunk_size-1]"""
Assuming "batch" means to want to process all 16 recs at one time instead of individually, read the file one record at a time and update a counter; when the counter hits 16, process that group. interim_list = []
infile = open("my_very_large_text_file", "r")
ctr = 0
for rec in infile:
interim_list.append(rec)
ctr += 1
if ctr > 15:
process_list(interim_list)
interim_list = []
ctr = 0
the final group
process_list(interim_list)
Another solution might be to create an iterator that yields lists of n elements:
def n_elements(n, it):
try:
while True:
yield [next(it) for j in range(0, n)]
except StopIteration:
return
with open(filename, 'rt') as f:
for n_lines in n_elements(n, f):
do_stuff(n_lines)
Python 3.7 question.
I do have a file looking like this:
1
10 10 10
3
25 29 10
52 55 30
70 70 20
0
where 1 shows there will be 1 line coming, 3 shows 3 will come, 0 marks end of file. How to achieve this?
I've tried
def read_each_course(filename):
with open(filename, 'r') as f:
lines = []
content = f.readlines()
lines += [x.rstrip() for x in content]
for i in lines:
while True:
if str(i).count(" ") == 0:
lines_to_read = int(i)
break
return lines_to_read, next(i)
but that won't work I get
TypeError: 'str' object is not an iterator
for the next(i).
My idea was to get a list of lists as the items like:
[[1, [10, 10, 10]], [3, [25, 29, 10], [52, 55, 30], [70, 70, 20]]]
BUT, I am unsure if that design is a good idea in general? Or should it then be rather a single linked list as the ultimate goal is that as the 3 numbers are coordinates I'll have to only use the next item such as x2-x1, y2-y1, penalty if left out (additional cost) where total cost is the hyp. of the xy triangle which is fine I can calculate that.
This revision of the answer by RomainL simplifies the logic.
It makes use of iterators to parse the file.
def read_each_course(filename):
result = []
with open(filename) as f:
it = iter(f)
while True:
count = int(next(it))
if count == 0: # Found the stop marker
break
current = [count]
for _ in range(count):
current.append([int(v) for v in next(it).strip().split()])
result.append(current)
return result
print(read_each_course("file2.txt"))
Output as required.
this code should do the tricks, for your design question I have no idea it seems to me to be opinion-based. So I will focus on the code.
In your code lines is a list, and i is an element of this list. you calling next on one of the list elem, not on the list here. I have to admit that I do not understand the logic of your code. So I cannot really help.
def read_each_course(filename):
result = []
current = []
with open(filename) as f_in:
for line in f_in: # loop over the file line by line
spt = line.strip().split() # split
if len(spt) == 1: # one elem
if current: # not the first one
result.append(current)
current = []
if spt[0] == 0: # end of file
break
current.append(int(spt[0]))
else:
current.append(list(map(int, spt)))
return result
I'm trying to read lines of a file into a list so every N lines will be in the same tuple. Assuming the file is valid so there are xN lines, how can I achive it?
The way I read the lines into the list:
def readFileIntoAList(file,N):
lines = list()
with open(file) as f:
lines = [line.rstrip('\n') for line in f]
return lines
What change I have to do with N so it will be a list of tuples so each tuple is of length N? For example I have the following file content:
ABC
abc xyz
123
XYZ
xyz abc
321
The output will be:
[("ABC","abc xyz","123"),("XYZ,"xyz abc",321")]
You could try using a chunking function:
def readFileIntoAList(file, n):
with open(file) as f:
lines = f.readlines()
return [lines[i:i + n] for i in range(0, len(lines), n)]
This will split the list of lines in the file into evenly sized chunks.
One way would be:
>>> data = []
>>> N = 3
>>> with open('/tmp/data') as f:
... while True:
... chunk = []
... for i in range(N):
... chunk.append(f.readline().strip('\n'))
... if any(True for c in chunk if not c):
... break
... data.append(tuple(chunk))
...
>>> print(data)
[('ABC', 'abc xyz', '123'), ('XYZ', 'xyz abc', '321')]
Note that this assumes the file has the right number of lines. Having the wrong number of lines in the above code can lead to infinite loop. A solution without that risk is:
data = []
N = 3
with open('/tmp/data') as f:
i = 0
chunk = []
for line in f:
chunk.append(line.strip('\n'))
i += 1
if i % N == 0 and i != 0:
data.append(tuple(chunk))
chunk = []
Both of these ways will not read the whole file in memory which should be more efficient when you process large datasets
You can use itertools.islice():
from itertools import islice
N = 3 # chunk size
with open("filename") as f:
lines = []
chunk = tuple(s.strip() for s in islice(f, N))
while chunk:
lines.append(chunk)
chunk = tuple(s.strip() for s in islice(f, N))
Also you can use map() if you prefer functional style:
chunk = tuple(map(str.strip, islice(f, N)))
import math
def readFileIntoAList(file,N):
lines= list()
lines1 = list()
with open(file) as f:
lines1 = [lineNew.rstrip("\n") for lineNew in f]
for a in range(math.ceil(len(lines1)/N)):
lines.append((*lines1[a*N:(a+1)*N],))
return lines
I used loop, I tried to make it easily.
There are two files, say FileA and FileB and we need to find all the numbers that are in FileA which is not there in FileB. All the numbers in the FileA are sorted and all the numbers in FileB are sorted. For example,
Input:
FileA = [1, 2, 3, 4, 5, ...]
FileB = [1, 3, 4, 6, ...]
Output:
[2, 5, ...]
The memory is very limited and even one entire file cannot be loaded into memory at a time. Also linear or lesser time complexity is needed.
So if the files are small enough to fit in the memory, we could load them and initialize its contents as two sets and then take a set difference so that the problem is solved in O(1) or constant time complexity.
set(contentsofFileA)-set(contentsofFileB)
But since the files are so big, they won't be able to load entirely into the memory and so this is not possible.
Also, another approach would be to use a brute force method with batch processing. So, we load a chunk or batch of data from FileA and then a batch from FileB and then compare it and then the next chunk from FileB and so on. Then after the FileA chunk is checked over all the elements in FileB then load the next batch from FileA and this continues. But this would create an O(n^2) or quadratic time complexity and not efficient for a very large file with large entries.
The problem is required to be solved in linear or lesser time complexity and without loading the entire files into memory. Any help?
If you want to read the files line by line since you don't have so much memory and you need a linear solution you can do this with iter if your files are line based, otherwise see this:
First in your terminal you can do this to generate some test files:
seq 0 3 100 > 3k.txt
seq 0 2 100 > 2k.txt
Then you run this code:
i1 = iter(open("3k.txt"))
i2 = iter(open("2k.txt"))
a = int(next(i1))
b = int(next(i2))
aNotB = []
# bNotA = []
while True:
try:
if a < b:
aNotB += [a]
a = int(next(i1, None))
elif a > b:
# bNotA += [a]
b = int(next(i2, None))
elif a == b:
a = int(next(i1, None))
b = int(next(i2, None))
except TypeError:
if not b:
aNotB += list(i1)
break
else:
# bNotA += list(i1)
break
print(aNotB)
Output:
[3, 9, 15, 21, 27, 33, 39, 45, 51, 57, 63, 69, 75, 81, 87, 93, 99]
If you want both the result for aNotB and bNotA you can uncomment those two lines.
Timing comparison with Andrej Kesely's answer:
$ seq 0 3 1000000 > 3k.txt
$ seq 0 2 1000000 > 2k.txt
$ time python manual_iter.py
python manual_iter.py 0.38s user 0.00s system 99% cpu 0.387 total
$ time python heapq_groupby.py
python heapq_groupby.py 1.11s user 0.00s system 99% cpu 1.116 total
As files are sorted you can just iterate through each line at a time, if the line of file A is less than the line of file B then you know that A is not in B so you then increment file A only and then check again. If the line in A is greater than the line in B then you know that B is not in A so you increment file B only. If A and B are equal then you know line is in both so increment both files. while in your original question you stated you were interested in entries which are in A but not B, this answer will extend that and also give entries in B not A. This extends the flexability but still allows you so print just those in A not B.
def strip_read(file):
return file.readline().rstrip()
in_a_not_b = []
in_b_not_a = []
with open("fileA") as A:
with open("fileB") as B:
Aline = strip_read(A)
Bline = strip_read(B)
while Aline or Bline:
if Aline < Bline and Aline:
in_a_not_b.append(Aline)
Aline = strip_read(A)
elif Aline > Bline and Bline:
in_b_not_a.append(Bline)
Bline = strip_read(B)
else:
Aline = strip_read(A)
Bline = strip_read(B)
print("in A not in B", in_a_not_b, "\nin B not in A", in_b_not_a)
OUTPUT for my sample Files
in A not in B ['2', '5', '7']
in B not in A ['6']
You can combine itertools.groupby (doc) and heapq.merge (doc) to iterate through FileA and FileB lazily (it works as long the files are sorted!)
FileA = [1, 1, 2, 3, 4, 5]
FileB = [1, 3, 4, 6]
from itertools import groupby
from heapq import merge
gen_a = ((v, 'FileA') for v in FileA)
gen_b = ((v, 'FileB') for v in FileB)
for v, g in groupby(merge(gen_a, gen_b, key=lambda k: int(k[0])), lambda k: int(k[0])):
if any(v[1] == 'FileB' for v in g):
continue
print(v)
Prints:
2
5
EDIT (Reading from files):
from itertools import groupby
from heapq import merge
gen_a = ((int(v.strip()), 1) for v in open('3k.txt'))
gen_b = ((int(v.strip()), 2) for v in open('2k.txt'))
for v, g in groupby(merge(gen_a, gen_b, key=lambda k: k[0]), lambda k: k[0]):
if any(v[1] == 2 for v in g):
continue
print(v)
Benchmark:
Generating files with 10_000_000 items:
seq 0 3 10000000 > 3k.txt
seq 0 2 10000000 > 2k.txt
The script takes ~10sec to complete:
real 0m10,656s
user 0m10,557s
sys 0m0,076s
A simple solution based on file reading (asuming that each line hold a number):
results = []
with open('file1.csv') as file1, open('file2.csv') as file2:
var1 = file1.readline()
var2 = file2.readline()
while var1:
while var1 and var2:
if int(var1) < int(var2):
results.append(int(var1))
var1 = file1.readline()
elif int(var1) > int(var2):
var2 = file2.readline()
elif int(var1) == int(var2):
var1 = file1.readline()
var2 = file2.readline()
if var1:
results.append(int(var1))
var1 = file1.readline()
print(results)
output = [2, 5, 7, 9]
This is similar to the classic Knuth Sorting and Searching.
You may wish to consider reading stack question,
on-line lecture notes pdf, and Wikipedia.
The stack question mentions something that I agree with, which is using unix sort command. Always, always test with your own data to ensure the method chosen is the most efficient for your data because some of these algorithms are data dependant.
I'm new to Python and trying to do a nested loop. I have a very large file (1.1 million rows), and I'd like to use it to create a file that has each line along with the next N lines, for example with the next 3 lines:
1 2
1 3
1 4
2 3
2 4
2 5
Right now I'm just trying to get the loops working with rownumbers instead of the strings since it's easier to visualize. I came up with this code, but it's not behaving how I want it to:
with open('C:/working_file.txt', mode='r', encoding = 'utf8') as f:
for i, line in enumerate(f):
line_a = i
lower_bound = i + 1
upper_bound = i + 4
with open('C:/working_file.txt', mode='r', encoding = 'utf8') as g:
for j, line in enumerate(g):
while j >= lower_bound and j <= upper_bound:
line_b = j
j = j+1
print(line_a, line_b)
Instead of the output I want like above, it's giving me this:
990 991
990 992
990 993
990 994
990 992
990 993
990 994
990 993
990 994
990 994
As you can see the inner loop is iterating multiple times for each line in the outer loop. It seems like there should only be one iteration per line in the outer loop. What am I missing?
EDIT: My question was answered below, here is the exact code I ended up using:
from collections import deque
from itertools import cycle
log = open('C:/example.txt', mode='w', encoding = 'utf8')
try:
xrange
except NameError: # python3
xrange = range
def pack(d):
tup = tuple(d)
return zip(cycle(tup[0:1]), tup[1:])
def window(seq, n=2):
it = iter(seq)
d = deque((next(it, None) for _ in range(n)), maxlen=n)
yield pack(d)
for e in it:
d.append(e)
yield pack(d)
for l in window(open('c:/working_file.txt', mode='r', encoding='utf8'),100):
for a, b in l:
print(a.strip() + '\t' + b.strip(), file=log)
Based on window example from old docs you can use something like:
from collections import deque
from itertools import cycle
try:
xrange
except NameError: # python3
xrange = range
def pack(d):
tup = tuple(d)
return zip(cycle(tup[0:1]), tup[1:])
def window(seq, n=2):
it = iter(seq)
d = deque((next(it, None) for _ in xrange(n)), maxlen=n)
yield pack(d)
for e in it:
d.append(e)
yield pack(d)
Demo:
>>> for l in window([1,2,3,4,5], 4):
... for l1, l2 in l:
... print l1, l2
...
1 2
1 3
1 4
2 3
2 4
2 5
So, basically you can pass your file to window to get desired result:
window(open('C:/working_file.txt', mode='r', encoding='utf8'), 4)
You can do this with slices. This is easiest if you read the whole file into a list first:
with open('C:/working_file.txt', mode='r', encoding = 'utf8') as f:
data = f.readlines()
for i, line_a in enumerate(data):
for j, line_b in enumerate(data[i+1:i+5], start=i+1):
print(i, j)
When you change it to printing the lines instead of the line numbers, you can drop the second enumerate and just do for line_b in data[i+1:i+5]. Note that the slice includes the item at the start index, but not the item at the end index, so that needs to be one higher than your current upper bound.
Based on alko's answer, I would suggest using the window recipe unmodified
from itertools import islice
def window(seq, n=2):
"Returns a sliding window (of width n) over data from the iterable"
" s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ... "
it = iter(seq)
result = tuple(islice(it, n))
if len(result) == n:
yield result
for elem in it:
result = result[1:] + (elem,)
yield result
for l in window([1,2,3,4,5], 4):
for item in l[1:]:
print l[0], item
I think the easiest way to solve this problem would be to read your file into a dictionary...
my_data = {}
for i, line in enumerate(f):
my_data[i] = line
After that is done you can do
for x in my_data:
for y in range(1, 4):
print my_data[x], my_data[x + y]
As written you are reading your million line file a million times for each line...
Since this was quite a big file, you might not want to load it all in memory at once. So to avoid reading a line more than once this is what you do.
Make a list with N elements, where N is the amount of next lines to read.
When you read the first line, add that to the first item in the list.
Add the nest line to the first and second item.
and so on for each line
When a item in that list reaches a length N, take it out and append it to the output file. And add a empty item at the end so you still have a list of N items.
This way you only need to read each line once, and you wont have to load the whole file in memory. You only need to hold, at max, N! lines in memory.