Find number of lines in csv without reading it [duplicate] - python

This question already has answers here:
How to get line count of a large file cheaply in Python?
(44 answers)
Closed 9 years ago.
Does there exist a way of finding the number of lines in a csv file without actually loading the whole file in memory (in Python)?
I'd expect there can be some special optimized function for it. All I can imagine now is read it line by line and count the lines, but it kind of kills all the possible sense in it since I only need the number of lines, not the actual content.

You don't need to load the whole file into memory since files are iterable in terms of their lines:
with open(path) as fp:
count = 0
for _ in fp:
count += 1
Or, slightly more idiomatic:
with open(path) as fp:
for (count, _) in enumerate(fp, 1):
pass

Yes you need to read the whole file in memory before knowing how many lines are in it.
Just think the file to be a long long string Aaaaabbbbbbbcccccccc\ndddddd\neeeeee\n
to know how many 'lines' are in the string you need to find how many \n characters are in it.
If you want an approximate number what you can do is to read few lines (~20) and see how many characters are per lines and then from the file's size (stored in the file descriptor) get a possible estimate.

Related

How to cleverly read big file in chunks? [duplicate]

This question already has answers here:
How should I read a file line-by-line in Python?
(3 answers)
Closed 3 years ago.
I have a very big file (~10GB) and I want to read it in its wholeness. In order to achieve this, I cut it into chunks. However, I have troubles cutting the big file into exploitable pieces: I want thousands lines together without having them splitted in the middle. I have found a function here on SO that I have arranged a bit:
def readPieces(file):
while True:
data = file.read(4096).strip()
if not data:
break
yield data
with open('bigfile.txt', 'r') as f:
for chunk in readPieces(f):
print(chunk)
I can specify the bytes I want to read (here 4MB) but when I do so my lines get cut in the middle, and if I remove it, it'll read the big file that will lead to a process stop. How can I do this?
Also, the lines in my file haven't equal size.
The following code reads the file line by line, the previous line gets garbage collected.
with open('bigfile.txt') as file:
for line in file:
print(line)

Counting number of lines in a JSON file using Python [duplicate]

This question already has answers here:
How to get line count of a large file cheaply in Python?
(44 answers)
Closed 8 years ago.
I have a .json file which contains a JSON item on every line. I want to try and count how many JSON items I have within the file using Python. I am currently just counting the lines using the following code
count =0
with open(file) as f:
for line in f:
count+=1
This seems like an unefficient way of doing it though. Is there a more efficient way of either calculating the number of JSON items in a .json file or counting the number of lines within a file?
Edit to fix my wrong doing:
This is a one liner for counting lines in file.
num = sum(1 for line in open(filename))

How to effectively remove a line in the middle of a large file? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Fastest Way to Delete a Line from Large File in Python
How to edit a line in middle of txt file without overwriting everything?
I know I can read every line into a list, remove a line, then write the list back.
But the file is large, is there a way to remove a part in the middle of the file, and needn't rewrite the whole file?
I don't know if a way to change the file in place, even using low-level file system commands, but you don't need to load it into a list, so you can do this without a large memory footprint:
with open('input_file', 'r') as input_file:
with open('output_file', 'w') as output_file:
for line in input_file:
if should_delete(line):
pass
else:
output_file.write(line)
This assumes that the section you want to delete is a line in a text file, and that should_delete is a function which determines whether the line should be kept or deleted. It is easy to change this slightly to work with a binary file instead, or to use a counter instead of a function.
Edit: If you're dealing with a binary file, you know the exact position you want to remove, and its not too near the start of the file, you may be able to optimise it slightly with io.IOBase.truncate (see http://docs.python.org/2/library/io.html#io.IOBase). However, I would only suggest pursuing this if the a profiler indicates that you really need to optimise to this extent.

Reading only the end of huge text file [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Get last n lines of a file with Python, similar to tail
Read a file in reverse order using python
I have a file that's about 15GB in size, it's a log file that I'm supposed to analyze the output from. I already did a basic parsing of a similar but GREATLY smaller file, with just few line of logging. Parsing strings is not the issue. The issue is the huge file and the amount of redundant data it contains.
Basically I'm attempting to make a python script that I could say to; for example, give me 5000 last lines of the file. That's again basic handling the arguments and all that, nothing special there, I can do that.
But how do I define or tell the file reader to ONLY read the amount of lines I specified from the end of the file? I'm trying to skip the huuuuuuge amount of lines in the beginning of a file since I'm not interested in those and to be honest, reading about 15GB of lines from a txt file takes too long. Is there a way to err.. start reading from.. end of the file? Does that even make sense?
It all boils down to the issue of reading a 15GB file, line by line takes too long. So I want to skip the already redundant data (redundant to me at least) in the beginning and only read the amount of lines from the end of file I want to read.
Obvious answer is to manually just copy N amount of lines from the file to another file but is there a way to do this semi-auto-magically just to read the N amount of lines from the end of the file with python?
Farm this out to unix:
import os
os.popen('tail -n 1000 filepath').read()
use subprocess.Popen instead of os.popen if you need to be able to access stderr (and some other features)
You need to seek to the end of the file, then read some chunks in blocks from the end, counting lines, until you've found enough newlines to read your n lines.
Basically, you are re-implementing a simple form of tail.
Here's some lightly tested code that does just that:
import os, errno
def lastlines(hugefile, n, bsize=2048):
# get newlines type, open in universal mode to find it
with open(hugefile, 'rU') as hfile:
if not hfile.readline():
return # empty, no point
sep = hfile.newlines # After reading a line, python gives us this
assert isinstance(sep, str), 'multiple newline types found, aborting'
# find a suitable seek position in binary mode
with open(hugefile, 'rb') as hfile:
hfile.seek(0, os.SEEK_END)
linecount = 0
pos = 0
while linecount <= n + 1:
# read at least n lines + 1 more; we need to skip a partial line later on
try:
hfile.seek(-bsize, os.SEEK_CUR) # go backwards
linecount += hfile.read(bsize).count(sep) # count newlines
hfile.seek(-bsize, os.SEEK_CUR) # go back again
except IOError, e:
if e.errno == errno.EINVAL:
# Attempted to seek past the start, can't go further
bsize = hfile.tell()
hfile.seek(0, os.SEEK_SET)
pos = 0
linecount += hfile.read(bsize).count(sep)
break
raise # Some other I/O exception, re-raise
pos = hfile.tell()
# Re-open in text mode
with open(hugefile, 'r') as hfile:
hfile.seek(pos, os.SEEK_SET) # our file position from above
for line in hfile:
# We've located n lines *or more*, so skip if needed
if linecount > n:
linecount -= 1
continue
# The rest we yield
yield line
Even though I would prefer the 'tail' solution - if you know the max number of characters per line you can implement another possible solution by getting the size of the file, open a file handler and use the 'seek' method with some estimated number of characters you are looking for.
This final code should look somehing like this - just to explain why I also prefer the tail solution :) goodluck!
MAX_CHARS_PER_LINE = 80
size_of_file = os.path.getsize('15gbfile.txt')
file_handler = file.open('15gbfile.txt', "rb")
seek_index = size_of_file - (number_of_requested_lines * MAX_CHARS_PER_LINE)
file_handler.seek(seek_index)
buffer = file_handler.read()
you can improve this code by analyzing newlines of the buffer you read.
Good luck ( and you should use the tail solution ;-) i am quite sure you can get tail for every OS)
The preferred method at this point was just to use unix's tail for the job and modify the python to accept input through std input.
tail hugefile.txt -n1000 | python magic.py
It's nothing sexy but at least it takes care of the job. The big file is a too big of a burden to handle, I found out. At least for my python skills. So it was a lot easier just to add a pinch of nix magic to it to cut down the filesize. Tail was new one for me so. Learned something and figure out another way of using the terminal to my advantage again. Thank you everyone.

it's possible to determine how many lines exist in file without per line iteration? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to get line count cheaply in Python?
Good day. i have some code below, which implements per line file reading and counter iteration.
def __set_quantity_filled_lines_in_file(self):
count = 0
with open(self.filename, 'r') as f:
for line in f:
count += 1
return count
My question is, are there methods to determine how many lines of text data in current file without per line iteration?
Thanks!
In general it's not possible to do better than reading every character in the file and counting newline characters.
It may be possible if you know details about the internal structure of the file. For example, if the file is 1024kB long, and every line is 1kB in length, then you can deduce there are 1024 lines in the file.
I'm not sure if Python has that function or not, highly doubt it, but it would essentially require reading the whole file. A newline is signified by the \n character (actually system dependent) so there is no way to know how many of those exist in a file without going through the whole file.
You could use the readlines() file method and this is probably the easiest.
If you want to be different, you could use the read() member function to get the entire file and count CR, LF,CRLR LFCR character combinations using collections.Counter class. However, you will have to deal with the various ways of terminating lines.
Something like:
import collections
f=open("myfile","rb")
d=f.read()
f.close()
c=collections.Counter(d)
lines1=c['\r\n']
lines2=c['\n\r']
lines3=c['\r']-lines1-lines2
lines4=c['\n']-lines1-lines2
nlines=lines3+lines4
No, such information can only be retrieved by iterating over the whole file's content (or reading the whole file into memory. But unless you know for sure that the files will always be small better don't even think about doing this).
Even if you do not loop over the file contents, the functions you call do. For example, len(f.readlines()) will read the whole file into a list just to count the number of elements. That's horribly inefficient since you don't need to store the file contents at all.
This gives the answer, but reads the whole file and stores the lines in a list
len(f.readlines())

Categories