I have a medium-size file (25MB, 1000000 rows), and I want to read every row except every third row.
FIRST QUESTION: Is it faster to load the whole file into memory and then read the rows (method .read()), or load and read one row at the time (method .readline())?
Since I'm not an experienced coder I tried the second option with islice method from itertools module.
import intertools
with open(input_file) as inp:
inp_atomtype = itertools.islice(inp, 0, 40, 3)
inp_atomdata = itertools.islice(inp, 1, 40, 3)
for atomtype, atomdata in itertools.zip_longest(inp_atomtype, inp_atomdata):
print(atomtype + atomdata)
Although looping through single generator (inp_atomtype or inp_atomdata) prints correct data, looping through both of them simultaneously (as in this code) prints wrong data.
SECOND QUESTION: How can I reach desired rows using generators?
You don't need to slice the iterator, a simple line counter should be enough:
with open(input_file) as f:
current_line = 0
for line in f:
current_line += 1
if current_line % 3: # ignore every third line
print(line) # NOTE: print() will add an additional new line by default
As for turning it into a generator, just yield the line instead of printing.
When it comes to speed, given that you'll be reading your lines anyway the I/O part will probably take the same but you might benefit a bit (in total processing time) from fast list slicing instead of counting lines if you have enough working memory to keep the file contents and if loading the whole file upfront instead of streaming is acceptable.
yield is perfect for this.
This functions yields pairs from an iterable and skip every third item:
def two_thirds(seq):
_iter = iter(seq)
while True:
yield (next(_iter), next(_iter))
next(_iter)
You will lose half pairs, which means that two_thirds(range(2)) will stop iterating immediately.
https://repl.it/repls/DullNecessaryCron
You can also use the grouper recipe from itertools doc and ignore the third item in each tuple generated:
for atomtype, atomdata, _ in grouper(lines, 3):
pass
FIRST QUESTION: I am pretty sure that .readline() is faster than .read(). Plus, the fastest way based my test is to do lopping like:
with open(file, 'r') as f:
for line in f:
...
SECOND QUESTION: I am not quite sure abut this. you may consider to use yield.
There is a code snippet you may refer:
def myreadlines(f, newline):
buf = ""
while True:
while newline in buf:
pos = buf.index(newline)
yield buf[:pos]
buf = buf[pos + len(newline):]
chunk = f.read(4096)
if not chunk:
# the end of file
yield buf
break
buf += chunk
with open("input.txt") as f:
for line in myreadlines(f, "{|}"):
print (line)
q2: here's my generator:
def yield_from_file(input_file):
with open(input_file) as file:
yield from file
def read_two_skip_one(gen):
while True:
try:
val1 = next(gen)
val2 = next(gen)
yield val1, val2
_ = next(gen)
except StopIteration:
break
if __name__ == '__main__':
for atomtype, atomdata in read_two_skip_one(yield_from_file('sample.txt')):
print(atomtype + atomdata)
sample.txt was generated with a bash shell (it's just lines counting to 100)
for i in {001..100}; do echo $i; done > sample.txt
regarding q1: if you're reading the file multiple times, you'd be better off to have it in memory. otherwise you're fine reading it line by line.
Regarding the problem you're having with the wrong results:
both itertools.islice(inp, 0, 40, 3) statements will use inp as generator. Both will call next(inp), to provide you with a value.
Each time you call next() on an iterator, it will change its state, so that's where your problems come from.
You can use a generator expression:
with open(input_file, 'r') as f:
generator = (line for e, line in enumerate(f, start=1) if e % 3)
enumerate adds line numbers to each line, and the if clause ignores line numbers divisible by 3 (default numbering starts at 0, so you have to specify start=1 to get the desired pattern).
Keep in mind that you can only use the generator while the file is still open.
Related
This question already has answers here:
How to read the last line of a file in Python?
(10 answers)
Closed 1 year ago.
I have a csv file that grows until it reaches approximately 48M of lines.
Before adding new lines to it, I need to read the last line.
I tried the code below, but it got too slow and I need a faster alternative:
def return_last_line(filepath):
with open(filepath,'r') as file:
for x in file:
pass
return x
return_last_line('lala.csv')
Here is my take, in python:
I created a function that lets you choose how many last lines, because the last lines may be empty.
def get_last_line(file, how_many_last_lines = 1):
# open your file using with: safety first, kids!
with open(file, 'r') as file:
# find the position of the end of the file: end of the file stream
end_of_file = file.seek(0,2)
# set your stream at the end: seek the final position of the file
file.seek(end_of_file)
# trace back each character of your file in a loop
n = 0
for num in range(end_of_file+1):
file.seek(end_of_file - num)
# save the last characters of your file as a string: last_line
last_line = file.read()
# count how many '\n' you have in your string:
# if you have 1, you are in the last line; if you have 2, you have the two last lines
if last_line.count('\n') == how_many_last_lines:
return last_line
get_last_line('lala.csv', 2)
This lala.csv has 48 million lines, such as in your example. It took me 0 seconds to get the last line.
Here is code for finding the last line of a file mmap, and it should work on Unixen and derivatives and Windows alike (I've tested this on Linux only, please tell me if it works on Windows too ;), i.e. pretty much everywhere where it matters. Since it uses memory mapped I/O it could be expected to be quite performant.
It expects that you can map the entire file into the address space of a processor - should be OK for 50M file everywhere but for 5G file you'd need a 64-bit processor or some extra slicing.
import mmap
def iterate_lines_backwards(filename):
with open(filename, "rb") as f:
# memory-map the file, size 0 means whole file
with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
start = len(mm)
while start > 0:
start, prev = mm.rfind(b"\n", 0, start), start
slice = mm[start + 1:prev + 1]
# if the last character in the file was a '\n',
# technically the empty string after that is not a line.
if slice:
yield slice.decode()
def get_last_nonempty_line(filename):
for line in iterate_lines_backwards(filename):
if stripped := line.rstrip("\r\n"):
return stripped
print(get_last_nonempty_line("datafile.csv"))
As a bonus there is a generator iterate_lines_backwards that would efficiently iterate over the lines of a file in reverse for any number of lines:
print("Iterating the lines of datafile.csv backwards")
for l in iterate_lines_backwards("datafile.csv"):
print(l, end="")
This is generally a rather tricky thing to do. A very efficient way of getting a chunk that includes the last lines is the following:
import os
def get_last_lines(path, offset=500):
""" An efficient way to get the last lines of a file.
IMPORTANT:
1. Choose offset to be greater than
max_line_length * number of lines that you want to recover.
2. This will throw an os.OSError if the file is shorter than
the offset.
"""
with path.open("rb") as f:
f.seek(-offset, os.SEEK_END)
while f.read(1) != b"\n":
f.seek(-2, os.SEEK_CUR)
return f.readlines()
You need to know the maximum line length though and ensure that the file is at least one offset long!
To use it, do the following:
from pathlib import Path
n_last_lines = 10
last_bit_of_file = get_last_lines(Path("/path/to/my/file"))
real_last_n_lines = last_bit_of_file[-10:]
Now finally you need to decode the binary to strings:
real_last_n_lines_non_binary = [x.decode() for x in real_last_n_lines]
Probably all of this could be wrapped in one more convenient function.
If you are running your code in a Unix based environment, you can execute tail shell command from Python to read the last line:
import subprocess
subprocess.run(['tail', '-n', '1', '/path/to/lala.csv'])
You could additionally store the last line in a separate file, which you update whenever you add new lines to the main file.
This works well for me:
https://pypi.org/project/file-read-backwards/
from file_read_backwards import FileReadBackwards
with FileReadBackwards("/tmp/file", encoding="utf-8") as frb:
# getting lines by lines starting from the last line up
for l in frb:
if l:
print(l)
break
An easy way to do this is with deque:
from collections import deque
def return_last_line(filepath):
with open(filepath,'r') as f:
q = deque(f, 1)
return q[0]
since seek() returns the position that it moved to, you can use it to move backward and position the cursor to the beginning of the last line.
with open("test.txt") as f:
p = f.seek(0,2)-1 # ignore trailing end of line
while p>0 and f.read(1)!="\n": # detect end of line (or start of file)
p = f.seek(p-1,0) # search backward
lastLine = f.read().strip() # read from start of last line
print(lastLine)
To get the last non-empty line, you can add a while loop around the search:
with open("test.txt") as f:
p,lastLine = f.seek(0,2),"" # start from end of file
while p and not lastLine: # want last non-empty line
while p>0 and f.read(1)!="\n": # detect end of line (or start of file)
p = f.seek(p-1,0) # search backward
lastLine = f.read().strip() # read from start of last line
Based on #kuropan
Faster and shorter:
# 60.lastlinefromlargefile.py
# juanfc 2021-03-17
import os
def get_last_lines(fileName, offset=500):
""" An efficient way to get the last lines of a file.
IMPORTANT:
1. Choose offset to be greater than
max_line_length * number of lines that you want to recover.
2. This will throw an os.OSError if the file is shorter than
the offset.
"""
with open(fileName, "rb") as f:
f.seek(-offset, os.SEEK_END)
return f.read().decode('utf-8').rstrip().split('\n')[-1]
print(get_last_lines('60.lastlinefromlargefile.py'))
I'm trying to start reading some file from line 3, but I can't.
I've tried to use readlines() + the index number of the line, as seen bellow:
x = 2
f = open('urls.txt', "r+").readlines( )[x]
line = next(f)
print(line)
but I get this result:
Traceback (most recent call last):
File "test.py", line 441, in <module>
line = next(f)
TypeError: 'str' object is not an iterator
I would like to be able to set any line, as a variable, and from there, all the time that I use next() it goes to the next line.
IMPORTANT: as this is a new feature and all my code already uses next(f), the solution needs to be able to work with it.
Try this (uses itertools.islice):
from itertools import islice
f = open('urls.txt', 'r+')
start_at = 3
file_iterator = islice(f, start_at - 1, None)
# to demonstrate
while True:
try:
print(next(file_iterator), end='')
except StopIteration:
print('End of file!')
break
f.close()
urls.txt:
1
2
3
4
5
Output:
3
4
5
End of file!
This solution is better than readlines because it doesn't load the entire file into memory and only loads parts of it when needed. It also doesn't waste time iterating previous lines when islice can do that, making it much faster than #MadPhysicist's answer.
Also, consider using the with syntax to guarantee the file gets closed:
with open('urls.txt', 'r+') as f:
# do whatever
The readlines method returns a list of strings for the lines. So when you take readlines()[2] you're getting the third line, as a string. Calling next on that string then makes no sense, so you get an error.
The easiest way to do this is to slice the list: readlines()[x:] gives a list of everything from line x onwards. Then you can use that list however you like.
If you have your heart set on an iterator, you can turn a list (or pretty much anything) into an iterator with the iter builtin function. Then you can next it to your heart's content.
The following code will allow you to use an iterator to print the first line:
In [1]: path = '<path to text file>'
In [2]: f = open(path, "r+")
In [3]: line = next(f)
In [4]: print(line)
This code will allow you to print the lines starting from the xth line:
In [1]: path = '<path to text file>'
In [2]: x = 2
In [3]: f = iter(open(path, "r+").readlines()[x:])
In [4]: f = iter(f)
In [5]: line = next(f)
In [6]: print(line)
Edit: Edited the solution based on #Tomothy32's observation.
The line you printed returns a string:
open('urls.txt', "r+").readlines()[x]
open returns a file object. Its readlines method returns a list of strings. Indexing with [x] returns the third line in the file as a single string.
The first problem is that you open the file without closing it. The second is that your index doesn't specify a range of lines until the end. Here's an incremental improvement:
with open('urls.txt', 'r+') as f:
lines = f.readlines()[x:]
Now lines is a list of all the lines you want. But you first read the whole file into memory, then discarded the first two lines. Also, a list is an iterable, not an iterator, so to use next on it effectively, you'd need to take an extra step:
lines = iter(lines)
If you want to harness the fact that the file is already a rather efficient iterator, apply next to it as many times as you need to discard unwanted lines:
with open('urls.txt', 'r+') as f:
for _ in range(x):
next(f)
# now use the file
print(next(f))
After the for loop, any read operation you do on the file will start from the third line, whether it be next(f), f.readline(), etc.
There are a few other ways to strip the first lines. In all cases, including the example above, next(f) can be replaced with f.readline():
for n, _ in enumerate(f):
if n == x:
break
or
for _ in zip(f, range(x)): pass
After you run either of these loops, next(f) will return the xth line.
Just call next(f) as many times as you need to. (There's no need to overcomplicate this with itertools, nor to slurp the entire file with readlines.)
lines_to_skip = 3
with open('urls.txt') as f:
for _ in range(lines_to_skip):
next(f)
for line in f:
print(line.strip())
Output:
% cat urls.txt
url1
url2
url3
url4
url5
% python3 test.py
url4
url5
I usually read files like this in Python:
f = open('filename.txt', 'r')
for x in f:
doStuff(x)
f.close()
However, this splits the file by newlines. I now have a file which has all of its info in one line (45,000 strings separated by commas). While a file of this size is trivial to read in using something like
f = open('filename.txt', 'r')
doStuff(f.read())
f.close()
I am curious if for a much larger file which is all in one line it would be possible to achieve a similar iteration effect as in the first code snippet but with splitting by comma instead of newline, or by any other character?
The following function is a fairly straightforward way to do what you want:
def file_split(f, delim=',', bufsize=1024):
prev = ''
while True:
s = f.read(bufsize)
if not s:
break
split = s.split(delim)
if len(split) > 1:
yield prev + split[0]
prev = split[-1]
for x in split[1:-1]:
yield x
else:
prev += s
if prev:
yield prev
You would use it like this:
for item in file_split(open('filename.txt')):
doStuff(item)
This should be faster than the solution that EMS linked, and will save a lot of memory over reading the entire file at once for large files.
Open the file using open(), then use the file.read(x) method to read (approximately) the next x bytes from the file. You could keep requesting blocks of 4096 characters until you hit end-of-file.
You will have to implement the splitting yourself - you can take inspiration from the csv module, but I don't believe you can use it directly because it wasn't designed to deal with extremely long lines.
Trying to write a code that will find all of a certain type of character in a text file
For vowels it'll find all of the number of a's but won't reloop through text to read e's. help?
def finder_character(file_name,character):
in_file = open(file_name, "r")
if character=='vowel':
brain_rat='aeiou'
elif character=='consonant':
brain_rat='bcdfghjklmnpqrstvwxyz'
elif character=='space':
brain_rat=''
else:
brain_rat='!##$%^&*()_+=-123456789{}|":?><,./;[]\''
found=0
for line in in_file:
for i in range (len(brain_rat)):
found += finder(file_name,brain_rat[i+1,i+2])
in_file.close()
return found
def finder(file_name,character):
in_file = open(file_name, "r")
line_number = 1
found=0
for line in in_file:
line=line.lower()
found +=line.count(character)
return found
If you want to use your original code, you have to pass the filename to the finder() function, and open the file there, for each char you are testing for.
The reason for this is that the file object (in_file) is a generator, not a list. The way a generator works, is that it returns the next item each time you call their next() method. When you say
for line in in_file:
The for ... in statement calls in_file.next() as long as the next() method "returns" (it actually use the keyword yield, but don't think about that for now) a value. When the generator doesn't return any values any longer, we say that the generator is exhausted. You can't re-use an exhausted generator. If you want to start over again, you have to make a new generator.
I allowed myself to rewrite your code. This should give you the desired result. If anything is unclear, please ask!
def finder_character(file_name,character):
with open(file_name, "r") as ifile:
if character=='vowel':
brain_rat='aeiou'
elif character=='consonant':
brain_rat='bcdfghjklmnpqrstvwxyz'
elif character=='space':
brain_rat=' '
else:
brain_rat='!##$%^&*()_+=-123456789{}|":?><,./;[]\''
return sum(1 if c.lower() in brain_rat else 0 for c in ifile.read())
test.txt:
eeehhh
iii!#
kk ="k
oo o
Output:
>>>print(finder_character('test.txt', 'vowel'))
9
>>>print(finder_character('test.txt', 'consonant'))
6
>>>print(finder_character('test.txt', 'space'))
2
>>>print(finder_character('test.txt', ''))
4
If you are having problems understanding the return line, it should be read backwards, like this:
Sum this generator:
Make a generator with values as v in:
for row in ifile.read():
if c.lower() in brain_rat:
v = 1
else:
v = 0
If you want to know more about generators, I recommend the Python Wiki page concerning it.
This seems to be what you are trying to do in finder_character. I'm not sure why you need finder at all.
In python you can loop over iterables (like strings), so you don't need to do range(len(string)).
for line in in_file:
for i in brain_rat:
if i in line: found += 1
There appear to be a few other oddities in your code too:
You open (and iterate through) the file twice, but only closed once.
line_number is never used
You get the total of a character in a file for each line in the file, so the total will be vastly inflated.
This is probably a much safer version, with open... is generally better than open()... file.close() as you don't need to worry as much about error handling and closing. I've added some comments to help explain what you are trying to do.
def finder_character(file_name,character):
found=0 # Initialise the counter
with open(file_name, "r") as in_file:
# Open the file
in_file = file_name.split('\n')
opts = { 'vowel':'aeiou',
'consonant':'bcdfghjklmnpqrstvwxyz',
'space':'' }
default= '!##$%^&*()_+=-123456789{}|":?><,./;[]\''
for line in in_file:
# Iterate through each line in the file
for c in opts.get(character,default):
With each line, also iterate through the set of chars to check.
if c in line.lower():
# If the current character is in the line
found += 1 # iterate the counter.
return found # return the counter
I am looking for a method in Python which can read multiple lines from a file(10 lines at a time). I have already looked into readlines(sizehint), I tried to pass value 10 but doesn't read only 10 lines. It actually reads till end of the file(I have tried on the small file). Each line is 11 bytes long and each read should fetch me 10 lines each time. If less than 10 lines are found then return only those lines. My actual file contains more than 150K lines.
Any idea how I can achieve this?
You're looking for itertools.islice():
with open('data.txt') as f:
lines = []
while True:
line = list(islice(f, 10)) #islice returns an iterator ,so you convert it to list here.
if line:
#do something with current set of <=10 lines here
lines.append(line) # may be store it
else:
break
print lines
This should do it
def read10Lines(fp):
answer = []
for i in range(10):
answer.append(fp.readline())
return answer
Or, the list comprehension:
ten_lines = [fp.readline() for _ in range(10)]
In both cases, fp = open('path/to/file')
Another solution which can get rid of the silly infinite loop in favor of a more familiar for loop relies on itertools.izip_longest and a small trick with iterators. The trick is that zip(*[iter(iterator)]*n) breaks iterator up into chunks of size n. Since a file is already generator-like iterator (as opposed to being sequence like), we can write:
from itertools import izip_longest
with open('data.txt') as f:
for ten_lines in izip_longest(*[f]*10,fillvalue=None):
if ten_lines[-1] is None:
ten_lines = filter(ten_lines) #filter removes the `None` values at the end
process(ten_lines)
from itertools import groupby, count
with open("data.txt") as f:
groups = groupby(f, key=lambda x,c=count():next(c)//10)
for k, v in groups:
bunch_of_lines = list(v)
print bunch_of_lines