Skip the last row of CSV file when iterating in Python - python

I am working on a data analysis using a CSV file that I got from a datawarehouse(Cognos). The CSV file has the last row that sums up all the rows above, but I do not need this line for my analysis, so I would like to skip the last row.
I was thinking about adding "if" statement that checks a column name within my "for" loop like below.
import CSV
with open('COGNOS.csv', "rb") as f, open('New_COGNOS.csv', "wb") as w:
#Open 2 CSV files. One to read and the other to save.
CSV_raw = csv.reader(f)
CSV_new = csv.writer(w)
for row in CSV_raw:
item_num = row[3].split(" ")[0]
row.append(item_num)
if row[0] == "All Materials (By Collection)": break
CSV_new.writerow(row)
However, this looks like wasting a lot of resource. Is there any pythonian way to skip the last row when iterating through CSV file?

You can write a generator that'll return everything but the last entry in an input iterator:
def skip_last(iterator):
prev = next(iterator)
for item in iterator:
yield prev
prev = item
then wrap your CSV_raw reader object in that:
for row in skip_last(CSV_raw):
The generator basically takes the first entry, then starts looping and on each iteration yield the previous entry. When the input iterator is done, there is still one line left, that is never returned.
A generic version, letting you skip the last n elements, would be:
from collections import deque
from itertools import islice
def skip_last_n(iterator, n=1):
it = iter(iterator)
prev = deque(islice(it, n), n)
for item in it:
yield prev.popleft()
prev.append(item)

A generalized "skip-n" generator
from __future__ import print_function
from StringIO import StringIO
from itertools import tee
s = '''\
1
2
3
4
5
6
7
8
'''
def skip_last_n(iterator, n=1):
a, b = tee(iterator)
for x in xrange(n):
next(a)
for line in a:
yield next(b)
i = StringIO(s)
for x in skip_last_n(i, 1):
print(x, end='')
1
2
3
4
5
6
7
i = StringIO(s)
for x in skip_last_n(i, 3):
print(x, end='')
1
2
3
4
5

Related

Read a file in chunks of multiple lines [duplicate]

I am writing a code to take an enormous textfile (several GB) N lines at a time, process that batch, and move onto the next N lines until I have completed the entire file. (I don't care if the last batch isn't the perfect size).
I have been reading about using itertools islice for this operation. I think I am halfway there:
from itertools import islice
N = 16
infile = open("my_very_large_text_file", "r")
lines_gen = islice(infile, N)
for lines in lines_gen:
...process my lines...
The trouble is that I would like to process the next batch of 16 lines, but I am missing something
islice() can be used to get the next n items of an iterator. Thus, list(islice(f, n)) will return a list of the next n lines of the file f. Using this inside a loop will give you the file in chunks of n lines. At the end of the file, the list might be shorter, and finally the call will return an empty list.
from itertools import islice
with open(...) as f:
while True:
next_n_lines = list(islice(f, n))
if not next_n_lines:
break
# process next_n_lines
An alternative is to use the grouper pattern:
from itertools import zip_longest
with open(...) as f:
for next_n_lines in zip_longest(*[f] * n):
# process next_n_lines
The question appears to presume that there is efficiency to be gained by reading an "enormous textfile" in blocks of N lines at a time. This adds an application layer of buffering over the already highly optimized stdio library, adds complexity, and probably buys you absolutely nothing.
Thus:
with open('my_very_large_text_file') as f:
for line in f:
process(line)
is probably superior to any alternative in time, space, complexity and readability.
See also Rob Pike's first two rules, Jackson's Two Rules, and PEP-20 The Zen of Python. If you really just wanted to play with islice you should have left out the large file stuff.
Here is another way using groupby:
from itertools import count, groupby
N = 16
with open('test') as f:
for g, group in groupby(f, key=lambda _, c=count(): c.next()/N):
print list(group)
How it works:
Basically groupby() will group the lines by the return value of the key parameter and the key parameter is the lambda function lambda _, c=count(): c.next()/N and using the fact that the c argument will be bound to count() when the function will be defined so each time groupby() will call the lambda function and evaluate the return value to determine the grouper that will group the lines so :
# 1 iteration.
c.next() => 0
0 / 16 => 0
# 2 iteration.
c.next() => 1
1 / 16 => 0
...
# Start of the second grouper.
c.next() => 16
16/16 => 1
...
Since the requirement was added that there be statistically uniform distribution of the lines selected from the file, I offer this simple approach.
"""randsamp - extract a random subset of n lines from a large file"""
import random
def scan_linepos(path):
"""return a list of seek offsets of the beginning of each line"""
linepos = []
offset = 0
with open(path) as inf:
# WARNING: CPython 2.7 file.tell() is not accurate on file.next()
for line in inf:
linepos.append(offset)
offset += len(line)
return linepos
def sample_lines(path, linepos, nsamp):
"""return nsamp lines from path where line offsets are in linepos"""
offsets = random.sample(linepos, nsamp)
offsets.sort() # this may make file reads more efficient
lines = []
with open(path) as inf:
for offset in offsets:
inf.seek(offset)
lines.append(inf.readline())
return lines
dataset = 'big_data.txt'
nsamp = 5
linepos = scan_linepos(dataset) # the scan only need be done once
lines = sample_lines(dataset, linepos, nsamp)
print 'selecting %d lines from a file of %d' % (nsamp, len(linepos))
print ''.join(lines)
I tested it on a mock data file of 3 million lines comprising 1.7GB on disk. The scan_linepos dominated the runtime taking about 20 seconds on my not-so-hot desktop.
Just to check the performance of sample_lines I used the timeit module as so
import timeit
t = timeit.Timer('sample_lines(dataset, linepos, nsamp)',
'from __main__ import sample_lines, dataset, linepos, nsamp')
trials = 10 ** 4
elapsed = t.timeit(number=trials)
print u'%dk trials in %.2f seconds, %.2fµs per trial' % (trials/1000,
elapsed, (elapsed/trials) * (10 ** 6))
For various values of nsamp; when nsamp was 100, a single sample_lines completed in 460µs and scaled linearly up to 10k samples at 47ms per call.
The natural next question is Random is barely random at all?, and the answer is "sub-cryptographic but certainly fine for bioinformatics".
Used chunker function from What is the most “pythonic” way to iterate over a list in chunks?:
from itertools import izip_longest
def grouper(iterable, n, fillvalue=None):
"grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return izip_longest(*args, fillvalue=fillvalue)
with open(filename) as f:
for lines in grouper(f, chunk_size, ""): #for every chunk_sized chunk
"""process lines like
lines[0], lines[1] , ... , lines[chunk_size-1]"""
Assuming "batch" means to want to process all 16 recs at one time instead of individually, read the file one record at a time and update a counter; when the counter hits 16, process that group. interim_list = []
infile = open("my_very_large_text_file", "r")
ctr = 0
for rec in infile:
interim_list.append(rec)
ctr += 1
if ctr > 15:
process_list(interim_list)
interim_list = []
ctr = 0
the final group
process_list(interim_list)
Another solution might be to create an iterator that yields lists of n elements:
def n_elements(n, it):
try:
while True:
yield [next(it) for j in range(0, n)]
except StopIteration:
return
with open(filename, 'rt') as f:
for n_lines in n_elements(n, f):
do_stuff(n_lines)

Iterating through dictionary, 5 rows at a time

I am trying to open a csv file with csv.DictReader, read in just the first 5 rows of data, perform the primary process of my script, then read in the next 5 rows and do the same for them. Rinse and repeat.
I believe I have a method that works, however I am having issues with the last lines of the data not processing. I know I need to modify my if statement so that it also checks for if I'm at the end of the file, but am having trouble finding a way to do that. I've found methods online, but they involve reading in the whole file to get a row count but doing so would defeat the purpose of this script as I'm dealing with memory issues.
Here is what I have so far:
import csv
count = 0
data = []
with open('test.csv') as file:
reader = csv.DictReader(file)
for row in reader:
count +=1
data.append(row)
if count % 5 == 0 or #something to check for the end of the file:
#do stuff
data = []
Thank you for the help!
You can use the chunksize argument when reading in the csv. This will step by step read in the number of lines:
import pandas as pd
reader = pd.read_csv('test.csv', chunksize=5)
for df in reader:
# do stuff
You can handle the remaining lines after the for loop body. You can also use the more pythonic enumerate.
import csv
data = []
with open('test.csv') as file:
reader = csv.DictReader(file)
for count, row in enumerate(reader, 1):
data.append(row)
if count % 5 == 0:
# do stuff
data = []
print('handling remaining lines at end of file')
print(data)
considering the file
a,b
1,1
2,2
3,3
4,4
5,5
6,6
7,7
outputs
handling remaining lines at end of file
[OrderedDict([('a', '6'), ('b', '6')]), OrderedDict([('a', '7'), ('b', '7')])]
This is one approach using the iterator
Ex:
import csv
with open('test.csv') as file:
reader = csv.DictReader(file)
value = True
while value:
data = []
for _ in range(5): # Get 5 rows
value = next(reader, False)
if value:
data.append(value)
print(data) #List of 5 elements
Staying along the lines of what you wrote and not including any other imports:
import csv
data = []
with open('test.csv') as file:
reader = csv.DictReader(file)
for row in reader:
data.append(row)
if len(data) > 5:
del data[0]
if len(data) == 5:
# Do something with the 5 elements
print(data)
The if statements allow the array to be loaded with 5 elements before processing on the begins.
class ZeroItterNumberException(Exception):
pass
class ItterN:
def __init__(self, itterator, n):
if n<1:
raise ZeroItterNumberException("{} is not a valid number of rows.".format(n))
self.itterator = itterator
self.n = n
self.cache = []
def __iter__(self):
return self
def __next__(self):
self.cache.append(next(self.itterator))
if len(self.cache) < self.n:
return self.__next__()
if len(self.cache) > self.n:
del self.cache[0]
if len(self.cache) == 5:
return self.cache

Create Matrix from a csv file - Python

I am trying to read some numbers from a .csv file and store them into a matrix using Python. The input file looks like this
Input File
B,1
A,1
A,1
B,1
A,3
A,2
B,1
B,2
B,2
The input is to be manipulated to a matrix like -
Output File
1 2 3
A 2 1 1
B 3 2 0
Here, the first column of the input file becomes the row, second column becomes the column and the value is the count of the occurrence. How should I implement this? The size of my input file is huge (1000000 rows) and hence there can be large number of rows (anywhere between 50 to 10,000) and columns (from 1 to 50)
With pandas, it becomes easy, almost in just 3 lines
import pandas as pd
df = pd.read_csv('example.csv', names=['label', 'value'])
# >>> df
# label value
# 0 B 1
# 1 A 1
# 2 A 1
# 3 B 1
# 4 A 3
# 5 A 2
# 6 B 1
# 7 B 2
# 8 B 2
s = df.groupby(['label', 'value']).size()
# >>> s
# label value
# A 1 2
# 2 1
# 3 1
# B 1 3
# 2 2
# dtype: int64
# ref1: http://stackoverflow.com/questions/15751283/converting-a-pandas-multiindex-dataframe-from-rows-wise-to-column-wise
# ref2: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.unstack.html
m = s.unstack()
# >>> m
# value 1 2 3
# label
# A 2 1 1
# B 3 2 NaN
# Below are optional: just to make it look more like what you want
m.columns.name = None
m.index.name = None
m = m.fillna(0)
print m
# 1 2 3
# A 2 1 1
# B 3 2 0
My solution does not seem to be very effective for a huge amout of input data since I am doing a lot of stuff manually which could be done by some of pandas DataFrame methods, probably.
However, this does the job:
#!/usr/bin/env python3
# coding: utf-8
import pandas as pd
from collections import Counter
with open('foo.txt') as f:
l = f.read().splitlines()
numbers_list = []
letters_list = []
for element in l:
letter = element.split(',')[0]
number = element.split(',')[1]
if number not in numbers_list:
numbers_list.append(number)
if letter not in letters_list:
letters_list.append(letter)
c = Counter(l)
d = dict(c)
output = pd.DataFrame(columns=sorted(numbers_list), index=sorted(letters_list))
for col in numbers_list:
for row in letters_list:
key = '{},{}'.format(row, col)
if key in d:
output[col][row] = d[key]
else:
output[col][row] = 0
The output is as desired:
1 2 3
A 2 1 1
B 3 2 0
The following solution uses just standard Python modules:
import csv, collections, itertools
with open('my.csv', 'r') as f_input:
counts = collections.Counter()
for cols in csv.reader(f_input):
counts[(cols[0], cols[1])] += 1
keys = set(key[0] for key in counts.keys())
values = set(counts.values())
d = {}
for k in itertools.product(keys, values):
d[(k[0], str(k[1]))] = 0
d.update(dict(counts))
with open('output.csv', 'wb') as f_output:
csv_output = csv.writer(f_output)
# Write the header, 'X' is whatever you want the first column called
csv_output.writerow(['X'] + sorted(values))
# Write the rows
for k, g in itertools.groupby(sorted(d.items()), key=lambda x: x[0][0]):
csv_output.writerow([k] + [col[1] for col in g])
This gives you an output CSV file looking like:
X,1,2,3
A,2,1,1
B,3,2,0
Here is another variation using standard modules:
import csv
import re
from collections import defaultdict
from itertools import chain
d = defaultdict(list)
with open('data.csv', 'rb') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
d[row[0]].append(row[1])
k = sorted(d.keys())
v = sorted(map(int,set(chain.from_iterable(d.values()))))
e = []
for i in d:
e.append([0]*len(v))
for j in d[i]:
e[-1][int(j)-1] += 1
print ' ', re.sub(r'[\[\],]','',str(v))
for i, j in enumerate(k):
print j, re.sub(r'[\[\],]','',str(e[i]))
Given data.csv has the contents of the input file shown in the question, this script prints the following as output:
1 2 3
A 2 1 1
B 3 2 0
Thanks to #zyxue for a pure pandas solution. It takes a lot less code up front with the problem being selection of it. However, extra coding is not necessarily in vain regarding run time performance. Using timeit in IPython to measure the run time difference between my code and that of &zyxue using pure pandas, I found that my method ran 36 times faster excluding imports and input IO and 121 times faster when also excuding output IO (print statements). These tests were done with functions to encapsulate code blocks. Here are the functions that were tested using Python 2.7.10 and Pandas 0.16.2:
def p(): # 1st pandas function
s = df.groupby(['label', 'value']).size()
m = s.unstack()
m.columns.name = None
m.index.name = None
m = m.fillna(0)
print m
def p1(): # 2nd pandas function - omitting print statement
s = df.groupby(['label', 'value']).size()
m = s.unstack()
m.columns.name = None
m.index.name = None
m = m.fillna(0)
def q(): # first std mods function
k = sorted(d.keys())
v = sorted(map(int,set(chain.from_iterable(d.values()))))
e = []
for i in d:
e.append([0]*len(v))
for j in d[i]:
e[-1][int(j)-1] += 1
print ' ', re.sub(r'[\[\],]','',str(v))
for i, j in enumerate(k):
print j, re.sub(r'[\[\],]','',str(e[i]))
def q1(): # 2nd std mods function - omitting print statements
k = sorted(d.keys())
v = sorted(map(int,set(chain.from_iterable(d.values()))))
e = []
for i in d:
e.append([0]*len(v))
for j in d[i]:
e[-1][int(j)-1] += 1
Prior to testing the following code was run to import modules, input IO and initialize variables for all functions:
import pandas as pd
df = pd.read_csv('data.csv', names=['label', 'value'])
import csv
from collections import defaultdict
from itertools import chain
import re
d = defaultdict(list)
with open('data.csv', 'rb') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
d[row[0]].append(row[1])
The contents of the data.csv input file was:
B,1
A,1
A,1
B,1
A,3
A,2
B,1
B,2
B,2
The test command line for each function was of the form:
%timeit fun()
Here are the test results:
p(): 100 loops, best of 3: 4.47 ms per loop
p1(): 1000 loops, best of 3: 1.88 ms per loop
q(): 10000 loops, best of 3: 123 µs per loop
q1(): 100000 loops, best of 3: 15.5 µs per loop
These results are only suggestive and for one small dataset. In particular I would expect pandas to perform comparatively better for larger datasets up to a point.
Here is a way to do it with MapReduce using Hadoop streaming where the mapper and reducer scripts both read stdin.
The mapper script is mostly an input mechanism and filters input to remove improper data with advantages that the input can be split over multiple mapper processes with the total output automatically sorted and forwarded to a reducer plus the possibility of running combiners locally on mapper nodes. Combiners are essentially intermediate reducers useful for speeding up reduction through parallelism over a cluster.
# mapper script
import sys
import re
# mapper
for line in sys.stdin:
line = line.strip()
word = line.split()[0]
if word and re.match(r'\A[a-zA-Z]+,[0-9]+',word):
print '%s\t%s' % (word)
The reducer script gets sorted output over all mappers, builds an intermediate dict for each input key such as A or B, which is called 'prefix' in the code and outputs results to a file in csv format.
# reducer script
from collections import defaultdict
import sys
def output(s,d):
"""
this function takes a string s and dictionary d with int keys and values
and sorts the keys then creates a string of comma-separate values ordered
by the keys with appropriate insertion of comma-separate zeros equal in
number to the difference between successive keys minus one
"""
v = sorted(d.keys())
o = str(s) + ','
lastk = 0
for k in v:
o += '0,'*(k-lastk-1) + str(d[k]) + ','
lastk = k
return o
prefix = ''
current_prefix = ''
d = defaultdict(int)
maxkey = 0
for line in sys.stdin:
line = line.strip()
prefix,value = line.split(',')
try:
value = int(value)
except ValueError:
continue
if current_prefix == prefix:
d[value] += 1
else:
if current_prefix:
if len(d) > 0:
print output(current_prefix,d)
t = max(d.keys())
if t > maxkey:
maxkey = t
d = defaultdict(int)
current_prefix = prefix
d[value] += 1
# output info for last prefix if needed
if current_prefix == prefix:
print output(prefix,d)
t = max(d.keys())
if t > maxkey:
maxkey = t
# output csv list of keys from 1 through maxkey
h = ' ,'
for i in range(1,maxkey+1):
h += str(i) + ','
print h
To run through data streaming process, given that the mapper gets:
B,1
A,1
A,1
B,1
A,3
A,2
B,1
B,2
B,2
It directly outputs the same content which then all gets sorted (shuffled) and sent to a reducer. In this example, what the reducer gets is:
A,1
A,1
A,2
A,3
B,1
B,1
B,1
B,2
B,2
Finally the output of the reducer is:
A,2,1,1,
B,3,2,
,1,2,3,
For larger data sets, the input file would be split with portions containing all data for some sets of keys going to separate mappers. Using a combiner on each mapper node would save overall sorting time. There would still be a need for a single reducer so that the output is totally sorted by key. If that's not a requirement, multiple reducers could be used.
For practical reasons I made a couple of choices. First, each line of output only goes up to the highest integer for a key and trailing zeros are not printed because there is no way to know how many to write until all the input has been processed, which for large input means storing a large amount of intermediate data in memory or slowing down processing by writing it out to disk and reading it back in to complete the job. Second and for the same reason, the header line cannot be written until just before the end of the reduce job so that's when its written. It may be possible to prepend it to the output file, or the first one if output has been split, and that can be investigated in due course. However, provided a great speedup of performance from parallel processing, for massive input, these are minor issues.
This method will work with relatively minor but crucial modifications on a Spark cluster and can be converted to Java or Scala to improve performance if necessary.

Python scan file line by line and remove last line in the same loop

I am trying to scan a csv file and make adjustments line by line. In the end, I would like to remove the last line. How can I remove the last line within the same scanning loop?
My code below reads from the original file, makes adjustments and finally writes to a new file.
import csv
raw_data = csv.reader(open("original_data.csv", "r"), delimiter=",")
output_data = csv.writer(open("final_data.csv", "w"), delimiter=",")
lastline = # integer index of last line
for i, row in enumerate(raw_data):
if i == 10:
# some operations
output_data.writerow(row)
elif i > 10 and i < lastline:
# some operations
output_data.writerow(row)
elif i == lastline:
output_data.writerow([])
else:
continue
You can make a generator to yield all elements except the last one:
def remove_last_element(iterable):
iterator = iter(iterable)
try:
prev = next(iterator)
while True:
cur = next(iterator)
yield prev
prev = cur
except StopIteration:
return
Then you just wrap raw_data in it:
for i, row in enumerate(remove_last_element(raw_data)):
# your code
The last line will be ignored automatically.
This approach has the benefit of only reading the file once.
A variation of #Kolmar's idea:
def all_but_last(it):
buf = next(it)
for item in it:
yield buf
buf = item
for line in all_but_last(...):
Here's more generic code that extends islice (two-args version) for negative indexes:
import itertools, collections
def islice2(it, stop):
if stop >= 0:
for x in itertools.islice(it, stop):
yield x
else:
d = collections.deque(itertools.islice(it, -stop))
for item in it:
yield d.popleft()
d.append(item)
for x in islice2(xrange(20), -5):
print x,
# 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
You can iterate with window of size 2 and print only the first value in the window. This will lead to the last element being skipped:
from itertools import izip, tee
def pairwise(iterable):
a, b = itertools.tee(iterable)
next(b, None)
return izip(a, b)
for row, _ in pairwise(raw_data):
output_data.writerow(row)
output_data.writerow([])
An idea is to calculate the length of each line you iterate and then when coming to the last line truncate the file thus "shortening the file". Not sure if this is good practice though...
eg
Python: truncate a file to 100 lines or less
Instead of writing the current line each loop iteration, try writing the previously read line:
import csv
raw_data = csv.reader(open("original_data.csv", "r"), delimiter=",")
output_data = csv.writer(open("final_data.csv", "w"), delimiter=",")
last_iter = (None, None)
try:
last_iter = (0, raw_data.next())
except StopIteration:
# The file is empty
pass
else:
for new_row in raw_data:
i, row = last_iter
last_iter = (i + 1, new_row)
if i == 10:
# some operations
output_data.writerow(row)
elif i > 10:
# some operations
output_data.writerow(row)
# Here, the last row of the file is in the `last_iter` variable.
# It won't get written into the output file.
output_data.writerow([])

Python nested loop - get next N lines

I'm new to Python and trying to do a nested loop. I have a very large file (1.1 million rows), and I'd like to use it to create a file that has each line along with the next N lines, for example with the next 3 lines:
1 2
1 3
1 4
2 3
2 4
2 5
Right now I'm just trying to get the loops working with rownumbers instead of the strings since it's easier to visualize. I came up with this code, but it's not behaving how I want it to:
with open('C:/working_file.txt', mode='r', encoding = 'utf8') as f:
for i, line in enumerate(f):
line_a = i
lower_bound = i + 1
upper_bound = i + 4
with open('C:/working_file.txt', mode='r', encoding = 'utf8') as g:
for j, line in enumerate(g):
while j >= lower_bound and j <= upper_bound:
line_b = j
j = j+1
print(line_a, line_b)
Instead of the output I want like above, it's giving me this:
990 991
990 992
990 993
990 994
990 992
990 993
990 994
990 993
990 994
990 994
As you can see the inner loop is iterating multiple times for each line in the outer loop. It seems like there should only be one iteration per line in the outer loop. What am I missing?
EDIT: My question was answered below, here is the exact code I ended up using:
from collections import deque
from itertools import cycle
log = open('C:/example.txt', mode='w', encoding = 'utf8')
try:
xrange
except NameError: # python3
xrange = range
def pack(d):
tup = tuple(d)
return zip(cycle(tup[0:1]), tup[1:])
def window(seq, n=2):
it = iter(seq)
d = deque((next(it, None) for _ in range(n)), maxlen=n)
yield pack(d)
for e in it:
d.append(e)
yield pack(d)
for l in window(open('c:/working_file.txt', mode='r', encoding='utf8'),100):
for a, b in l:
print(a.strip() + '\t' + b.strip(), file=log)
Based on window example from old docs you can use something like:
from collections import deque
from itertools import cycle
try:
xrange
except NameError: # python3
xrange = range
def pack(d):
tup = tuple(d)
return zip(cycle(tup[0:1]), tup[1:])
def window(seq, n=2):
it = iter(seq)
d = deque((next(it, None) for _ in xrange(n)), maxlen=n)
yield pack(d)
for e in it:
d.append(e)
yield pack(d)
Demo:
>>> for l in window([1,2,3,4,5], 4):
... for l1, l2 in l:
... print l1, l2
...
1 2
1 3
1 4
2 3
2 4
2 5
So, basically you can pass your file to window to get desired result:
window(open('C:/working_file.txt', mode='r', encoding='utf8'), 4)
You can do this with slices. This is easiest if you read the whole file into a list first:
with open('C:/working_file.txt', mode='r', encoding = 'utf8') as f:
data = f.readlines()
for i, line_a in enumerate(data):
for j, line_b in enumerate(data[i+1:i+5], start=i+1):
print(i, j)
When you change it to printing the lines instead of the line numbers, you can drop the second enumerate and just do for line_b in data[i+1:i+5]. Note that the slice includes the item at the start index, but not the item at the end index, so that needs to be one higher than your current upper bound.
Based on alko's answer, I would suggest using the window recipe unmodified
from itertools import islice
def window(seq, n=2):
"Returns a sliding window (of width n) over data from the iterable"
" s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ... "
it = iter(seq)
result = tuple(islice(it, n))
if len(result) == n:
yield result
for elem in it:
result = result[1:] + (elem,)
yield result
for l in window([1,2,3,4,5], 4):
for item in l[1:]:
print l[0], item
I think the easiest way to solve this problem would be to read your file into a dictionary...
my_data = {}
for i, line in enumerate(f):
my_data[i] = line
After that is done you can do
for x in my_data:
for y in range(1, 4):
print my_data[x], my_data[x + y]
As written you are reading your million line file a million times for each line...
Since this was quite a big file, you might not want to load it all in memory at once. So to avoid reading a line more than once this is what you do.
Make a list with N elements, where N is the amount of next lines to read.
When you read the first line, add that to the first item in the list.
Add the nest line to the first and second item.
and so on for each line
When a item in that list reaches a length N, take it out and append it to the output file. And add a empty item at the end so you still have a list of N items.
This way you only need to read each line once, and you wont have to load the whole file in memory. You only need to hold, at max, N! lines in memory.

Categories