I'm new to Python and trying to do a nested loop. I have a very large file (1.1 million rows), and I'd like to use it to create a file that has each line along with the next N lines, for example with the next 3 lines:
1 2
1 3
1 4
2 3
2 4
2 5
Right now I'm just trying to get the loops working with rownumbers instead of the strings since it's easier to visualize. I came up with this code, but it's not behaving how I want it to:
with open('C:/working_file.txt', mode='r', encoding = 'utf8') as f:
for i, line in enumerate(f):
line_a = i
lower_bound = i + 1
upper_bound = i + 4
with open('C:/working_file.txt', mode='r', encoding = 'utf8') as g:
for j, line in enumerate(g):
while j >= lower_bound and j <= upper_bound:
line_b = j
j = j+1
print(line_a, line_b)
Instead of the output I want like above, it's giving me this:
990 991
990 992
990 993
990 994
990 992
990 993
990 994
990 993
990 994
990 994
As you can see the inner loop is iterating multiple times for each line in the outer loop. It seems like there should only be one iteration per line in the outer loop. What am I missing?
EDIT: My question was answered below, here is the exact code I ended up using:
from collections import deque
from itertools import cycle
log = open('C:/example.txt', mode='w', encoding = 'utf8')
try:
xrange
except NameError: # python3
xrange = range
def pack(d):
tup = tuple(d)
return zip(cycle(tup[0:1]), tup[1:])
def window(seq, n=2):
it = iter(seq)
d = deque((next(it, None) for _ in range(n)), maxlen=n)
yield pack(d)
for e in it:
d.append(e)
yield pack(d)
for l in window(open('c:/working_file.txt', mode='r', encoding='utf8'),100):
for a, b in l:
print(a.strip() + '\t' + b.strip(), file=log)
Based on window example from old docs you can use something like:
from collections import deque
from itertools import cycle
try:
xrange
except NameError: # python3
xrange = range
def pack(d):
tup = tuple(d)
return zip(cycle(tup[0:1]), tup[1:])
def window(seq, n=2):
it = iter(seq)
d = deque((next(it, None) for _ in xrange(n)), maxlen=n)
yield pack(d)
for e in it:
d.append(e)
yield pack(d)
Demo:
>>> for l in window([1,2,3,4,5], 4):
... for l1, l2 in l:
... print l1, l2
...
1 2
1 3
1 4
2 3
2 4
2 5
So, basically you can pass your file to window to get desired result:
window(open('C:/working_file.txt', mode='r', encoding='utf8'), 4)
You can do this with slices. This is easiest if you read the whole file into a list first:
with open('C:/working_file.txt', mode='r', encoding = 'utf8') as f:
data = f.readlines()
for i, line_a in enumerate(data):
for j, line_b in enumerate(data[i+1:i+5], start=i+1):
print(i, j)
When you change it to printing the lines instead of the line numbers, you can drop the second enumerate and just do for line_b in data[i+1:i+5]. Note that the slice includes the item at the start index, but not the item at the end index, so that needs to be one higher than your current upper bound.
Based on alko's answer, I would suggest using the window recipe unmodified
from itertools import islice
def window(seq, n=2):
"Returns a sliding window (of width n) over data from the iterable"
" s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ... "
it = iter(seq)
result = tuple(islice(it, n))
if len(result) == n:
yield result
for elem in it:
result = result[1:] + (elem,)
yield result
for l in window([1,2,3,4,5], 4):
for item in l[1:]:
print l[0], item
I think the easiest way to solve this problem would be to read your file into a dictionary...
my_data = {}
for i, line in enumerate(f):
my_data[i] = line
After that is done you can do
for x in my_data:
for y in range(1, 4):
print my_data[x], my_data[x + y]
As written you are reading your million line file a million times for each line...
Since this was quite a big file, you might not want to load it all in memory at once. So to avoid reading a line more than once this is what you do.
Make a list with N elements, where N is the amount of next lines to read.
When you read the first line, add that to the first item in the list.
Add the nest line to the first and second item.
and so on for each line
When a item in that list reaches a length N, take it out and append it to the output file. And add a empty item at the end so you still have a list of N items.
This way you only need to read each line once, and you wont have to load the whole file in memory. You only need to hold, at max, N! lines in memory.
Related
I am having a bit of trouble figuring the following out:
I have a file with 100 lines for example, let's call it file A
I also have another file with 100 lines for example, let's call it file B
Now I need the first loop to read 10 lines from file A and do it's thing and then go to the other loop that reads 10 lines from file B, does it thing and then goes back to the first loop to do 11-20 lines from file A and then back to second loop that does 11-20 lines from file B.
I need both loops to remember from which line to read.
How should I approach this?
Thanks!
EDIT:
Could something like this work?
a=0
b=10
x=0
y=10
for 1000 times:
read a-b rows:
do its thing
a += 10
b += 10
read x-y rows:
do its thing
x += 10
y += 10
You can iterate over 10 lines at a time using this approach.
class File:
def __init__(self, filename):
self.f = open(filename, 'r')
def line(self):
yield self.f.readline()
def next(self, limit):
for each in range(limit):
yield self.f.readline()
def lines(self, limit=10):
return [x for x in self.next(limit=limit)]
file1 = File('C:\\Temp\\test.csv')
file2 = File('C:\\Temp\\test2.csv')
print(file1.lines(10)
print(file2.lines(10)
print(file1.lines(10)
print(file2.lines(10)
Now you can jump back and forth between files iterating over the next 10 lines.
Here is another solution using a generator and a context manager:
class SwitchFileReader():
def __init__(self, file_paths, lines = 10):
self.file_paths = file_paths
self.file_objects = []
self.lines = 1 if lines < 1 else lines
def __enter__(self):
for file in self.file_paths:
file_object = open(file, "r")
self.file_objects.append(file_object)
return self
def __exit__(self, type, value, traceback):
for file in self.file_objects:
file.close()
def __iter__(self):
while True:
next_lines = [
[file.readline() for _ in range(self.lines)]
for file in self.file_objects
]
if any(not all(lines) for lines in next_lines):
break
for lines in next_lines:
yield lines
file_a = r"D:\projects\playground\python\stackgis\data\TestA.txt"
file_b = r"D:\projects\playground\python\stackgis\data\TestB.txt"
with SwitchFileReader([file_a, file_b], 10) as file_changer:
for next_lines in file_changer:
print(next_lines , end="") # do your thing
The iteration will stop as soon as there are less remaining lines in any of the files.
Assuming file_a has 12 lines and file_b has 13 lines. Line 11 and 12 from file_a and line 11 to 13 from file_b would be ignored.
For simplicity I'm going to work with list. You can read the file into a list.
Let's split the problem. We need
group each list by any number. In your case 10
Loop in each 10 bunches for both arrays.
Grouping
Here an answer: https://stackoverflow.com/a/4998460/2681662
def group_by_each(lst, N):
return [lst[n:n+N] for n in range(0, len(lst), N)]
Loop in two list at the same time:
You can use zip for this.
lst1 = list(range(100)) # <- Your data
lst2 = list(range(100, 200)) # <-- Your second data
def group_by_each(lst, N):
return [lst[n:n+N] for n in range(0, len(lst), N)]
for ten1, ten2 in zip(group_by_each(lst1, 10), group_by_each(lst2, 10)):
print(ten1)
print(ten2)
When you iterate over a file object, it yields lines in the associated file.
You just need a single loop that grabs the next ten lines from both files each iteration. In this example, the loop will end as soon as either file is exhausted:
from itertools import islice
lines_per_iter = 10
file_a = open("file_a.txt", "r")
file_b = open("file_b.txt", "r")
while (a := list(islice(file_a, lines_per_iter))) and (b := list(islice(file_b, lines_per_iter))):
print(f"Next {lines_per_iter} lines from A: {a}")
print(f"Next {lines_per_iter} lines from B: {b}")
file_a.close()
file_b.close()
Ok, thank you for all the answers, I found a working solution to my project like this:
a=0
b=10
x=0
y=10
while True:
for list1 in range(a, b):
#read the lines from file A
a += 10
b += 10
for list2 in range(x, y):
#read the lines from file B
if y == 100:
break
x += 10
y += 10
I know it's been a long time since this question was asked, but I still feel like answering it my own way for future viewers and future reference. I'm not exactly sure if this is the best way to do it, but it can read multiple files simultaneously which is pretty cool.
from itertools import islice, chain
from pprint import pprint
def simread(files, nlines_segments, nlines_contents):
lines = [[] for i in range(len(files))]
total_lines = sum(nlines_contents)
current_index = 0
while len(tuple(chain(*lines))) < total_lines:
if len(lines[current_index]) < nlines_contents[current_index]:
lines[current_index].extend(islice(
files[current_index],
nlines_segments[current_index],
))
current_index += 1
if current_index == len(files):
current_index = 0
return lines
with open('A.txt') as A, open('B.txt') as B:
lines = simread(
[A, B], # files
[10, 10], # lines to read at a time from each file
[100, 100], # number of lines in each file
) # returns two lists containing the lines in files A and B
pprint(lines)
You can even add another file C (with any number of lines, even a thousand) like so:
with open('A.txt') as A, open('B.txt') as B, open('C.txt') as C:
lines = simread(
[A, B, C], # files
[10, 10, 100], # lines to read at a time from each file
[100, 100, 1000], # number of lines in each file
) # returns two lists containing the lines in files A and B
pprint(lines)
The values in nlines_segments can also be changed, like so:
with open('A.txt') as A, open('B.txt') as B, open('C.txt') as C:
lines = simread(
[A, B, C], # files
[5, 20, 125], # lines to read at a time from each file
[100, 100, 1000], # number of lines in each file
) # returns two lists containing the lines in files A and B
pprint(lines)
This would read file A five lines at a time, file B twenty lines at a time, and file C 125 lines at a time.
NOTE: The values provided in nlines_segments all have to be factors of their corresponding values in nlines_contents, which should all be the exact number of lines in the files they correspond to.
I hope this heps!
There is already a billion answers, but I just felt like answering this in a simple way.
with open('fileA.txt', 'r') as a:
a_lines = a.readlines()
a_prog = 0
with open('fileB.txt', 'r') as b:
b_lines = b.readlines()
b_prog = 0
for i in range(10):
temp = []
for line in range(a_prog, a_prog + 10):
temp.append(a_lines[line].strip())
a_prog += 10
#Temp is the full 10-line block.
#Do something...
temp = []
for line in range(b_prog, b_prog + 10):
temp.append(b_lines[line].strip())
b_prog += 10
#Temp is the full 10-line block.
#Do something...
I am writing a code to take an enormous textfile (several GB) N lines at a time, process that batch, and move onto the next N lines until I have completed the entire file. (I don't care if the last batch isn't the perfect size).
I have been reading about using itertools islice for this operation. I think I am halfway there:
from itertools import islice
N = 16
infile = open("my_very_large_text_file", "r")
lines_gen = islice(infile, N)
for lines in lines_gen:
...process my lines...
The trouble is that I would like to process the next batch of 16 lines, but I am missing something
islice() can be used to get the next n items of an iterator. Thus, list(islice(f, n)) will return a list of the next n lines of the file f. Using this inside a loop will give you the file in chunks of n lines. At the end of the file, the list might be shorter, and finally the call will return an empty list.
from itertools import islice
with open(...) as f:
while True:
next_n_lines = list(islice(f, n))
if not next_n_lines:
break
# process next_n_lines
An alternative is to use the grouper pattern:
from itertools import zip_longest
with open(...) as f:
for next_n_lines in zip_longest(*[f] * n):
# process next_n_lines
The question appears to presume that there is efficiency to be gained by reading an "enormous textfile" in blocks of N lines at a time. This adds an application layer of buffering over the already highly optimized stdio library, adds complexity, and probably buys you absolutely nothing.
Thus:
with open('my_very_large_text_file') as f:
for line in f:
process(line)
is probably superior to any alternative in time, space, complexity and readability.
See also Rob Pike's first two rules, Jackson's Two Rules, and PEP-20 The Zen of Python. If you really just wanted to play with islice you should have left out the large file stuff.
Here is another way using groupby:
from itertools import count, groupby
N = 16
with open('test') as f:
for g, group in groupby(f, key=lambda _, c=count(): c.next()/N):
print list(group)
How it works:
Basically groupby() will group the lines by the return value of the key parameter and the key parameter is the lambda function lambda _, c=count(): c.next()/N and using the fact that the c argument will be bound to count() when the function will be defined so each time groupby() will call the lambda function and evaluate the return value to determine the grouper that will group the lines so :
# 1 iteration.
c.next() => 0
0 / 16 => 0
# 2 iteration.
c.next() => 1
1 / 16 => 0
...
# Start of the second grouper.
c.next() => 16
16/16 => 1
...
Since the requirement was added that there be statistically uniform distribution of the lines selected from the file, I offer this simple approach.
"""randsamp - extract a random subset of n lines from a large file"""
import random
def scan_linepos(path):
"""return a list of seek offsets of the beginning of each line"""
linepos = []
offset = 0
with open(path) as inf:
# WARNING: CPython 2.7 file.tell() is not accurate on file.next()
for line in inf:
linepos.append(offset)
offset += len(line)
return linepos
def sample_lines(path, linepos, nsamp):
"""return nsamp lines from path where line offsets are in linepos"""
offsets = random.sample(linepos, nsamp)
offsets.sort() # this may make file reads more efficient
lines = []
with open(path) as inf:
for offset in offsets:
inf.seek(offset)
lines.append(inf.readline())
return lines
dataset = 'big_data.txt'
nsamp = 5
linepos = scan_linepos(dataset) # the scan only need be done once
lines = sample_lines(dataset, linepos, nsamp)
print 'selecting %d lines from a file of %d' % (nsamp, len(linepos))
print ''.join(lines)
I tested it on a mock data file of 3 million lines comprising 1.7GB on disk. The scan_linepos dominated the runtime taking about 20 seconds on my not-so-hot desktop.
Just to check the performance of sample_lines I used the timeit module as so
import timeit
t = timeit.Timer('sample_lines(dataset, linepos, nsamp)',
'from __main__ import sample_lines, dataset, linepos, nsamp')
trials = 10 ** 4
elapsed = t.timeit(number=trials)
print u'%dk trials in %.2f seconds, %.2fµs per trial' % (trials/1000,
elapsed, (elapsed/trials) * (10 ** 6))
For various values of nsamp; when nsamp was 100, a single sample_lines completed in 460µs and scaled linearly up to 10k samples at 47ms per call.
The natural next question is Random is barely random at all?, and the answer is "sub-cryptographic but certainly fine for bioinformatics".
Used chunker function from What is the most “pythonic” way to iterate over a list in chunks?:
from itertools import izip_longest
def grouper(iterable, n, fillvalue=None):
"grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return izip_longest(*args, fillvalue=fillvalue)
with open(filename) as f:
for lines in grouper(f, chunk_size, ""): #for every chunk_sized chunk
"""process lines like
lines[0], lines[1] , ... , lines[chunk_size-1]"""
Assuming "batch" means to want to process all 16 recs at one time instead of individually, read the file one record at a time and update a counter; when the counter hits 16, process that group. interim_list = []
infile = open("my_very_large_text_file", "r")
ctr = 0
for rec in infile:
interim_list.append(rec)
ctr += 1
if ctr > 15:
process_list(interim_list)
interim_list = []
ctr = 0
the final group
process_list(interim_list)
Another solution might be to create an iterator that yields lists of n elements:
def n_elements(n, it):
try:
while True:
yield [next(it) for j in range(0, n)]
except StopIteration:
return
with open(filename, 'rt') as f:
for n_lines in n_elements(n, f):
do_stuff(n_lines)
I'm trying to read lines of a file into a list so every N lines will be in the same tuple. Assuming the file is valid so there are xN lines, how can I achive it?
The way I read the lines into the list:
def readFileIntoAList(file,N):
lines = list()
with open(file) as f:
lines = [line.rstrip('\n') for line in f]
return lines
What change I have to do with N so it will be a list of tuples so each tuple is of length N? For example I have the following file content:
ABC
abc xyz
123
XYZ
xyz abc
321
The output will be:
[("ABC","abc xyz","123"),("XYZ,"xyz abc",321")]
You could try using a chunking function:
def readFileIntoAList(file, n):
with open(file) as f:
lines = f.readlines()
return [lines[i:i + n] for i in range(0, len(lines), n)]
This will split the list of lines in the file into evenly sized chunks.
One way would be:
>>> data = []
>>> N = 3
>>> with open('/tmp/data') as f:
... while True:
... chunk = []
... for i in range(N):
... chunk.append(f.readline().strip('\n'))
... if any(True for c in chunk if not c):
... break
... data.append(tuple(chunk))
...
>>> print(data)
[('ABC', 'abc xyz', '123'), ('XYZ', 'xyz abc', '321')]
Note that this assumes the file has the right number of lines. Having the wrong number of lines in the above code can lead to infinite loop. A solution without that risk is:
data = []
N = 3
with open('/tmp/data') as f:
i = 0
chunk = []
for line in f:
chunk.append(line.strip('\n'))
i += 1
if i % N == 0 and i != 0:
data.append(tuple(chunk))
chunk = []
Both of these ways will not read the whole file in memory which should be more efficient when you process large datasets
You can use itertools.islice():
from itertools import islice
N = 3 # chunk size
with open("filename") as f:
lines = []
chunk = tuple(s.strip() for s in islice(f, N))
while chunk:
lines.append(chunk)
chunk = tuple(s.strip() for s in islice(f, N))
Also you can use map() if you prefer functional style:
chunk = tuple(map(str.strip, islice(f, N)))
import math
def readFileIntoAList(file,N):
lines= list()
lines1 = list()
with open(file) as f:
lines1 = [lineNew.rstrip("\n") for lineNew in f]
for a in range(math.ceil(len(lines1)/N)):
lines.append((*lines1[a*N:(a+1)*N],))
return lines
I used loop, I tried to make it easily.
I am trying to scan a csv file and make adjustments line by line. In the end, I would like to remove the last line. How can I remove the last line within the same scanning loop?
My code below reads from the original file, makes adjustments and finally writes to a new file.
import csv
raw_data = csv.reader(open("original_data.csv", "r"), delimiter=",")
output_data = csv.writer(open("final_data.csv", "w"), delimiter=",")
lastline = # integer index of last line
for i, row in enumerate(raw_data):
if i == 10:
# some operations
output_data.writerow(row)
elif i > 10 and i < lastline:
# some operations
output_data.writerow(row)
elif i == lastline:
output_data.writerow([])
else:
continue
You can make a generator to yield all elements except the last one:
def remove_last_element(iterable):
iterator = iter(iterable)
try:
prev = next(iterator)
while True:
cur = next(iterator)
yield prev
prev = cur
except StopIteration:
return
Then you just wrap raw_data in it:
for i, row in enumerate(remove_last_element(raw_data)):
# your code
The last line will be ignored automatically.
This approach has the benefit of only reading the file once.
A variation of #Kolmar's idea:
def all_but_last(it):
buf = next(it)
for item in it:
yield buf
buf = item
for line in all_but_last(...):
Here's more generic code that extends islice (two-args version) for negative indexes:
import itertools, collections
def islice2(it, stop):
if stop >= 0:
for x in itertools.islice(it, stop):
yield x
else:
d = collections.deque(itertools.islice(it, -stop))
for item in it:
yield d.popleft()
d.append(item)
for x in islice2(xrange(20), -5):
print x,
# 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
You can iterate with window of size 2 and print only the first value in the window. This will lead to the last element being skipped:
from itertools import izip, tee
def pairwise(iterable):
a, b = itertools.tee(iterable)
next(b, None)
return izip(a, b)
for row, _ in pairwise(raw_data):
output_data.writerow(row)
output_data.writerow([])
An idea is to calculate the length of each line you iterate and then when coming to the last line truncate the file thus "shortening the file". Not sure if this is good practice though...
eg
Python: truncate a file to 100 lines or less
Instead of writing the current line each loop iteration, try writing the previously read line:
import csv
raw_data = csv.reader(open("original_data.csv", "r"), delimiter=",")
output_data = csv.writer(open("final_data.csv", "w"), delimiter=",")
last_iter = (None, None)
try:
last_iter = (0, raw_data.next())
except StopIteration:
# The file is empty
pass
else:
for new_row in raw_data:
i, row = last_iter
last_iter = (i + 1, new_row)
if i == 10:
# some operations
output_data.writerow(row)
elif i > 10:
# some operations
output_data.writerow(row)
# Here, the last row of the file is in the `last_iter` variable.
# It won't get written into the output file.
output_data.writerow([])
I am working on a data analysis using a CSV file that I got from a datawarehouse(Cognos). The CSV file has the last row that sums up all the rows above, but I do not need this line for my analysis, so I would like to skip the last row.
I was thinking about adding "if" statement that checks a column name within my "for" loop like below.
import CSV
with open('COGNOS.csv', "rb") as f, open('New_COGNOS.csv', "wb") as w:
#Open 2 CSV files. One to read and the other to save.
CSV_raw = csv.reader(f)
CSV_new = csv.writer(w)
for row in CSV_raw:
item_num = row[3].split(" ")[0]
row.append(item_num)
if row[0] == "All Materials (By Collection)": break
CSV_new.writerow(row)
However, this looks like wasting a lot of resource. Is there any pythonian way to skip the last row when iterating through CSV file?
You can write a generator that'll return everything but the last entry in an input iterator:
def skip_last(iterator):
prev = next(iterator)
for item in iterator:
yield prev
prev = item
then wrap your CSV_raw reader object in that:
for row in skip_last(CSV_raw):
The generator basically takes the first entry, then starts looping and on each iteration yield the previous entry. When the input iterator is done, there is still one line left, that is never returned.
A generic version, letting you skip the last n elements, would be:
from collections import deque
from itertools import islice
def skip_last_n(iterator, n=1):
it = iter(iterator)
prev = deque(islice(it, n), n)
for item in it:
yield prev.popleft()
prev.append(item)
A generalized "skip-n" generator
from __future__ import print_function
from StringIO import StringIO
from itertools import tee
s = '''\
1
2
3
4
5
6
7
8
'''
def skip_last_n(iterator, n=1):
a, b = tee(iterator)
for x in xrange(n):
next(a)
for line in a:
yield next(b)
i = StringIO(s)
for x in skip_last_n(i, 1):
print(x, end='')
1
2
3
4
5
6
7
i = StringIO(s)
for x in skip_last_n(i, 3):
print(x, end='')
1
2
3
4
5