Python- lazily read file which does not have new lines [duplicate] - python

I usually read files like this in Python:
f = open('filename.txt', 'r')
for x in f:
doStuff(x)
f.close()
However, this splits the file by newlines. I now have a file which has all of its info in one line (45,000 strings separated by commas). While a file of this size is trivial to read in using something like
f = open('filename.txt', 'r')
doStuff(f.read())
f.close()
I am curious if for a much larger file which is all in one line it would be possible to achieve a similar iteration effect as in the first code snippet but with splitting by comma instead of newline, or by any other character?

The following function is a fairly straightforward way to do what you want:
def file_split(f, delim=',', bufsize=1024):
prev = ''
while True:
s = f.read(bufsize)
if not s:
break
split = s.split(delim)
if len(split) > 1:
yield prev + split[0]
prev = split[-1]
for x in split[1:-1]:
yield x
else:
prev += s
if prev:
yield prev
You would use it like this:
for item in file_split(open('filename.txt')):
doStuff(item)
This should be faster than the solution that EMS linked, and will save a lot of memory over reading the entire file at once for large files.

Open the file using open(), then use the file.read(x) method to read (approximately) the next x bytes from the file. You could keep requesting blocks of 4096 characters until you hit end-of-file.
You will have to implement the splitting yourself - you can take inspiration from the csv module, but I don't believe you can use it directly because it wasn't designed to deal with extremely long lines.

Related

How to quickly get the last line of a huge csv file (48M lines)? [duplicate]

This question already has answers here:
How to read the last line of a file in Python?
(10 answers)
Closed 1 year ago.
I have a csv file that grows until it reaches approximately 48M of lines.
Before adding new lines to it, I need to read the last line.
I tried the code below, but it got too slow and I need a faster alternative:
def return_last_line(filepath):
with open(filepath,'r') as file:
for x in file:
pass
return x
return_last_line('lala.csv')
Here is my take, in python:
I created a function that lets you choose how many last lines, because the last lines may be empty.
def get_last_line(file, how_many_last_lines = 1):
# open your file using with: safety first, kids!
with open(file, 'r') as file:
# find the position of the end of the file: end of the file stream
end_of_file = file.seek(0,2)
# set your stream at the end: seek the final position of the file
file.seek(end_of_file)
# trace back each character of your file in a loop
n = 0
for num in range(end_of_file+1):
file.seek(end_of_file - num)
# save the last characters of your file as a string: last_line
last_line = file.read()
# count how many '\n' you have in your string:
# if you have 1, you are in the last line; if you have 2, you have the two last lines
if last_line.count('\n') == how_many_last_lines:
return last_line
get_last_line('lala.csv', 2)
This lala.csv has 48 million lines, such as in your example. It took me 0 seconds to get the last line.
Here is code for finding the last line of a file mmap, and it should work on Unixen and derivatives and Windows alike (I've tested this on Linux only, please tell me if it works on Windows too ;), i.e. pretty much everywhere where it matters. Since it uses memory mapped I/O it could be expected to be quite performant.
It expects that you can map the entire file into the address space of a processor - should be OK for 50M file everywhere but for 5G file you'd need a 64-bit processor or some extra slicing.
import mmap
def iterate_lines_backwards(filename):
with open(filename, "rb") as f:
# memory-map the file, size 0 means whole file
with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
start = len(mm)
while start > 0:
start, prev = mm.rfind(b"\n", 0, start), start
slice = mm[start + 1:prev + 1]
# if the last character in the file was a '\n',
# technically the empty string after that is not a line.
if slice:
yield slice.decode()
def get_last_nonempty_line(filename):
for line in iterate_lines_backwards(filename):
if stripped := line.rstrip("\r\n"):
return stripped
print(get_last_nonempty_line("datafile.csv"))
As a bonus there is a generator iterate_lines_backwards that would efficiently iterate over the lines of a file in reverse for any number of lines:
print("Iterating the lines of datafile.csv backwards")
for l in iterate_lines_backwards("datafile.csv"):
print(l, end="")
This is generally a rather tricky thing to do. A very efficient way of getting a chunk that includes the last lines is the following:
import os
def get_last_lines(path, offset=500):
""" An efficient way to get the last lines of a file.
IMPORTANT:
1. Choose offset to be greater than
max_line_length * number of lines that you want to recover.
2. This will throw an os.OSError if the file is shorter than
the offset.
"""
with path.open("rb") as f:
f.seek(-offset, os.SEEK_END)
while f.read(1) != b"\n":
f.seek(-2, os.SEEK_CUR)
return f.readlines()
You need to know the maximum line length though and ensure that the file is at least one offset long!
To use it, do the following:
from pathlib import Path
n_last_lines = 10
last_bit_of_file = get_last_lines(Path("/path/to/my/file"))
real_last_n_lines = last_bit_of_file[-10:]
Now finally you need to decode the binary to strings:
real_last_n_lines_non_binary = [x.decode() for x in real_last_n_lines]
Probably all of this could be wrapped in one more convenient function.
If you are running your code in a Unix based environment, you can execute tail shell command from Python to read the last line:
import subprocess
subprocess.run(['tail', '-n', '1', '/path/to/lala.csv'])
You could additionally store the last line in a separate file, which you update whenever you add new lines to the main file.
This works well for me:
https://pypi.org/project/file-read-backwards/
from file_read_backwards import FileReadBackwards
with FileReadBackwards("/tmp/file", encoding="utf-8") as frb:
# getting lines by lines starting from the last line up
for l in frb:
if l:
print(l)
break
An easy way to do this is with deque:
from collections import deque
def return_last_line(filepath):
with open(filepath,'r') as f:
q = deque(f, 1)
return q[0]
since seek() returns the position that it moved to, you can use it to move backward and position the cursor to the beginning of the last line.
with open("test.txt") as f:
p = f.seek(0,2)-1 # ignore trailing end of line
while p>0 and f.read(1)!="\n": # detect end of line (or start of file)
p = f.seek(p-1,0) # search backward
lastLine = f.read().strip() # read from start of last line
print(lastLine)
To get the last non-empty line, you can add a while loop around the search:
with open("test.txt") as f:
p,lastLine = f.seek(0,2),"" # start from end of file
while p and not lastLine: # want last non-empty line
while p>0 and f.read(1)!="\n": # detect end of line (or start of file)
p = f.seek(p-1,0) # search backward
lastLine = f.read().strip() # read from start of last line
Based on #kuropan
Faster and shorter:
# 60.lastlinefromlargefile.py
# juanfc 2021-03-17
import os
def get_last_lines(fileName, offset=500):
""" An efficient way to get the last lines of a file.
IMPORTANT:
1. Choose offset to be greater than
max_line_length * number of lines that you want to recover.
2. This will throw an os.OSError if the file is shorter than
the offset.
"""
with open(fileName, "rb") as f:
f.seek(-offset, os.SEEK_END)
return f.read().decode('utf-8').rstrip().split('\n')[-1]
print(get_last_lines('60.lastlinefromlargefile.py'))

filtering a weird text file in python

I have a text file in which each ID line starts with > and the next line(s) are the a sequence of characters. And the next line after the sequence of characters would be an other ID line starting with >. but in some of them, instead of sequence I have “Sequence unavailable”. The sequence after the ID line can be one or more lines.
like this example:
>ENSG00000173153|ENST00000000442|64073050;64074640|64073208;64074651
AAGCAGCCGGCGGCGCCGCCGAGTGAGGGGACGCGGCGCGGTGGGGCGGCGCGGCCCGAGGAGGCGGCGGAGGAGGGGCCGCCCGCGGCCCCCGGCTCACTCCGGCACTCCGGGCCGCTC
>ENSG00000004139|ENST00000003834
Sequence unavailable
I want to filter out those IDs with “Sequence unavailable”. The output should look like this:
output:
>ENSG00000173153|ENST00000000442|64073050;64074640|64073208;64074651
AAGCAGCCGGCGGCGCCGCCGAGTGAGGGGACGCGGCGCGGTGGGGCGGCGCGGCCCGAGGAGGCGGCGGAGGAGGGGCCGCCCGCGGCCCCCGGCTCACTCCGGCACTCCGGGCCGCTC
do you know how to do that in python?
Unlike the other answers, I’d strongly recommand against parsing the FASTA format manually. It’s not too hard but there are pitfalls, and it’s completely unnecessary since efficient, well-tested implementations exist:
Use Bio.SeqIO from BioPython; for example:
from Bio import SeqIO
for record in SeqIO.parse(filename, 'fasta'):
if record.seq != 'Sequenceunavailable':
SeqIO.write(record, outfile, 'fasta')
Note the missing space in 'Sequenceunavailable': reading the sequences in FASTA format will omit spaces.
How about this:
with open(filename, 'r+') as f:
data = f.read()
data = data.split('>')
result = ['>{}'.format(item) for item in data if item and 'Sequence unavailable' not in item]
f.seek(0)
for line in result:
f.write(line)
def main():
filename = open('text.txt', 'rU').readlines()
filterFile(filename)
def filterFile(SequenceFile):
outfile = open('outfile', 'w')
for line in SequenceFile:
if line.startswith('>'):
sequence = line.next()
if sequence.startswith('Sequence unavailable'):
//nothing should happen I suppose?
else:
outfile.write(line + "\n" + sequence + "\n")
main()
I unfortunately can't test this code right now but I made this out of the top of my head! Please test it and let me know what the outcome is so I can adjust the code :-)
So I don't exactly know how large these files will get, just in case, I'm doing it without mapping the file in memory:
with open(filename) as fh:
with open(filename+'.new', 'w+') as fh_new:
for idline, geneseq in zip(*[iter(fh)] * 2):
if geneseq.strip() != 'Sequence unavailable':
fh_new.write(idline)
fh_new.write(geneseq)
It works by creating a new file, then the zip thing is some magic to read the 2 lines of the file, the idline will be the first part and the geneseq the second part.
This solution should be relatively cheap in computer power but will create an extra output file.

Split a file based on number of occurrences of 1 in position 1 of a line

I routinely use PowerShell to split larger text or csv files in to smaller files for quicker processing. However, I have a few files that come over that are an usual format. These are basically print files to a text file. Each record starts with a single line that starts with a 1 and there is nothing else on the line.
What I need to be able to do is to split a file based on the number of statements. So, basically if I want to split the file in to chunks of 3000 statements, I would go down until I see the 3001 occurrence of 1 in position 1 and copy everything before that to the new file. I can run this from windows, linux or OS X so pretty much anything is open for the split.
Any ideas would be greatly appreciated.
Maybe try recognizing it by the fact that there is a '1' plus a new line?
with open(input_file, 'r') as f:
my_string = f.read()
my_list = my_string.split('\n1\n')
Separates each record to a list assuming it has the following format:
1
....
....
1
....
....
....
You can then output each element in the list to a separate file.
for x in range(len(my_list)):
print >> str(x)+'.txt', my_list[x]
To avoid loading the file in memory, you could define a function that generates records incrementally and then use itertool's grouper recipe to write each 3000 records to a new file:
#!/usr/bin/env python3
from itertools import zip_longest
with open('input.txt') as input_file:
files = zip_longest(*[generate_records(input_file)]*3000, filevalue=())
for n, records in enumerate(files):
open('output{n}.txt'.format(n=n), 'w') as output_file:
output_file.writelines(''.join(lines)
for r in records for lines in r)
where generate_records() yields one record at a time where a record is also an iterator over lines in the input file:
from itertools import chain
def generate_records(input_file, start='1\n', eof=[]):
def record(yield_start=True):
if yield_start:
yield start
for line in input_file:
if line == start: # start new record
break
yield line
else: # EOF
eof.append(True)
# the first record may include lines before the first 1\n
yield chain(record(yield_start=False),
record())
while not eof:
yield record()
generate_records() is a generator that yield generators like itertools.groupby() does.
For performance reasons, you could read/write chunks of multiple lines at once.

Change delimiter on "for each" loops on Strings in python

I need to read an input text file in python, by streaming line by line. That means load the text file line by line instead of all at once into memory. But my line delimiters are not whitespaces, they are arbitrary characters.
Here is a method on Stack Overflow for loading files line by line:
with open("log.txt") as infile:
for line in infile:
do_something_with(line)
The above is perfect, however I need to change the delimiter from whitespaces to a different character.
How can this be done? Thank you.
import re
def open_delimited(filename, delimiter, chunksize=1024, *args, **kwargs):
with open(filename, *args, **kwargs) as infile:
remainder = ''
for chunk in iter(lambda: infile.read(chunksize), ''):
pieces = re.split(delimiter, remainder+chunk)
for piece in pieces[:-1]:
yield piece
remainder = pieces[-1]
if remainder:
yield remainder
for line in open_delimited("log.txt", delimiter='/'):
print(repr(line))
Python doesn't have a native construct for this. You can write a generator that reads the characters one at a time and accumulates them until you have a whole delimited item.
def items(infile, delim):
item = []
c = infile.read(1)
while c:
if c == delim:
yield "".join(item)
item = []
else:
c = infile.read(1)
item.append(c)
yield "".join(item)
with open("log.txt") as infile:
for item in items(infile, ","): # comma delimited
do_something_with(item)
You will get better performance if you read the file in chunks (say, 64K or so) and split these. However, the logic for this is more complicated since an item may be split across chunks, so I won't go into it here as I'm not 100% sure I'd get it right. :-)

Two simple questions about python

I have 2 simple questions about python:
1.How to get number of lines of a file in python?
2.How to locate the position in a file object to the
last line easily?
lines are just data delimited by the newline char '\n'.
1) Since lines are variable length, you have to read the entire file to know where the newline chars are, so you can count how many lines:
count = 0
for line in open('myfile'):
count += 1
print count, line # it will be the last line
2) reading a chunk from the end of the file is the fastest method to find the last newline char.
def seek_newline_backwards(file_obj, eol_char='\n', buffer_size=200):
if not file_obj.tell(): return # already in beginning of file
# All lines end with \n, including the last one, so assuming we are just
# after one end of line char
file_obj.seek(-1, os.SEEK_CUR)
while file_obj.tell():
ammount = min(buffer_size, file_obj.tell())
file_obj.seek(-ammount, os.SEEK_CUR)
data = file_obj.read(ammount)
eol_pos = data.rfind(eol_char)
if eol_pos != -1:
file_obj.seek(eol_pos - len(data) + 1, os.SEEK_CUR)
break
file_obj.seek(-len(data), os.SEEK_CUR)
You can use that like this:
f = open('some_file.txt')
f.seek(0, os.SEEK_END)
seek_newline_backwards(f)
print f.tell(), repr(f.readline())
Let's not forget
f = open("myfile.txt")
lines = f.readlines()
numlines = len(lines)
lastline = lines[-1]
NOTE: this reads the whole file in memory as a list. Keep that in mind in the case that the file is very large.
The easiest way is simply to read the file into memory. eg:
f = open('filename.txt')
lines = f.readlines()
num_lines = len(lines)
last_line = lines[-1]
However for big files, this may use up a lot of memory, as the whole file is loaded into RAM. An alternative is to iterate through the file line by line. eg:
f = open('filename.txt')
num_lines = sum(1 for line in f)
This is more efficient, since it won't load the entire file into memory, but only look at a line at a time. If you want the last line as well, you can keep track of the lines as you iterate and get both answers by:
f = open('filename.txt')
count=0
last_line = None
for line in f:
num_lines += 1
last_line = line
print "There were %d lines. The last was: %s" % (num_lines, last_line)
One final possible improvement if you need only the last line, is to start at the end of the file, and seek backwards until you find a newline character. Here's a question which has some code doing this. If you need both the linecount as well though, theres no alternative except to iterate through all lines in the file however.
For small files that fit memory,
how about using str.count() for getting the number of lines of a file:
line_count = open("myfile.txt").read().count('\n')
I'd like too add to the other solutions that some of them (those who look for \n) will not work with files with OS 9-style line endings (\r only), and that they may contain an extra blank line at the end because lots of text editors append it for some curious reasons, so you might or might not want to add a check for it.
The only way to count lines [that I know of] is to read all lines, like this:
count = 0
for line in open("file.txt"): count = count + 1
After the loop, count will have the number of lines read.
For the first question there're already a few good ones, I'll suggest #Brian's one as the best (most pythonic, line ending character proof and memory efficient):
f = open('filename.txt')
num_lines = sum(1 for line in f)
For the second one, I like #nosklo's one, but modified to be more general should be:
import os
f = open('myfile')
to = f.seek(0, os.SEEK_END)
found = -1
while found == -1 and to > 0:
fro = max(0, to-1024)
f.seek(fro)
chunk = f.read(to-fro)
found = chunk.rfind("\n")
to -= 1024
if found != -1:
found += fro
It seachs in chunks of 1Kb from the end of the file, until it finds a newline character or the file ends. At the end of the code, found is the index of the last newline character.
Answer to the first question (beware of poor performance on large files when using this method):
f = open("myfile.txt").readlines()
print len(f) - 1
Answer to the second question:
f = open("myfile.txt").read()
print f.rfind("\n")
P.S. Yes I do understand that this only suits for small files and simple programs. I think I will not delete this answer however useless for real use-cases it may seem.
Answer1:
x = open("file.txt")
opens the file or we have x associated with file.txt
y = x.readlines()
returns all lines in list
length = len(y)
returns length of list to Length
Or in one line
length = len(open("file.txt").readlines())
Answer2 :
last = y[-1]
returns the last element of list
Approach:
Open the file in read-mode and assign a file object named “file”.
Assign 0 to the counter variable.
Read the content of the file using the read function and assign it to a
variable named “Content”.
Create a list of the content where the elements are split wherever they encounter an “\n”.
Traverse the list using a for loop and iterate the counter variable respectively.
Further the value now present in the variable Counter is displayed
which is the required action in this program.
Python program to count the number of lines in a text file
# Opening a file
file = open("filename","file mode")#file mode like r,w,a...
Counter = 0
# Reading from file
Content = file.read()
CoList = Content.split("\n")
for i in CoList:
if i:
Counter += 1
print("This is the number of lines in the file")
print(Counter)
The above code will print the number of lines present in a file. Replace filename with the file with extension and file mode with read - 'r'.

Categories