How to cleverly read big file in chunks? [duplicate] - python

This question already has answers here:
How should I read a file line-by-line in Python?
(3 answers)
Closed 3 years ago.
I have a very big file (~10GB) and I want to read it in its wholeness. In order to achieve this, I cut it into chunks. However, I have troubles cutting the big file into exploitable pieces: I want thousands lines together without having them splitted in the middle. I have found a function here on SO that I have arranged a bit:
def readPieces(file):
while True:
data = file.read(4096).strip()
if not data:
break
yield data
with open('bigfile.txt', 'r') as f:
for chunk in readPieces(f):
print(chunk)
I can specify the bytes I want to read (here 4MB) but when I do so my lines get cut in the middle, and if I remove it, it'll read the big file that will lead to a process stop. How can I do this?
Also, the lines in my file haven't equal size.

The following code reads the file line by line, the previous line gets garbage collected.
with open('bigfile.txt') as file:
for line in file:
print(line)

Related

How do you read the whole of a file in python as a single line instead of line by line? [duplicate]

This question already has answers here:
read the whole file at once
(2 answers)
Closed 7 months ago.
Let us say I wanted to read a whole file at once instead of going through it line by line (let's say for example to speed up retrieval of the times 'i' occurs in the file). How would I go about reading it as a whole instead of the lines in which it is written?
with open("file.ext", 'r') as f:
data = f.read()
As others have mentioned, you can find an official Python tutorial for Reading and Writing Files which explains this. You can also see the Methods of File Objects section, which explains the use of f.read(size) and f.readline() and the difference between between them.

Python streaming text file [duplicate]

This question already has answers here:
Python seek to read constantly growing files
(1 answer)
How can I tail a log file in Python?
(14 answers)
Closed 2 years ago.
GOAL
I have a text file which receives receives at least 20 more lines every seconds.
What I want is that every time a new line appears in my txt file, I want the line to go through a process. For example, the line will be uppercase and print it out. And then wait for the next line to appear.
WHAT I TRIED
My idea was to read the file like we can normally read a file line by line in python because by the time it was reading the lines, some more lines would be added. However, python reads the lines too fast...
This was my code:
file = open('/file/path/text.txt')
for line in file:
line = line.upper
print(line)
time.sleep(1)
I know there is PySpark, but is there an easier solution? Because I only have one text file and it must go through a simple process (eg: upper case read lines, print line and then wait for next line to appear).
I am a newbie in python so I am sorry if the answer is obvious for some of you. I would appreciate any help. Thank you.

Reading in a file, one chunk at a time [duplicate]

This question already has answers here:
Read multiple block of file between start and stop flags
(4 answers)
Closed 6 years ago.
I have a VERY large file formatted like this:
(mydelimiter)
line
line
(mydelimiter)
line
line
(mydelimiter)
Since the file is so large I can't read it all into memory at once. So I would like to read each chunk between "(mydelimiter)" at a time, perform some operations on it, then read in the next chunk.
This is the code I have so far:
with open(infile,'r') as f:
chunk = []
for line in f:
chunk.append(line)
Now, I'm not sure how to tell python "keep appending lines UNTIL you hit another line with '(mydelimiter)' in it", and then save the line where it stopped abd start there in the next iteration of the for loop.
Note: it's also not possible to read in a certain number of lines at a time since each chunk is variable length.
Aren't you perhaps over thinking this? Something as simple as the following code can do the trick for you
with open(infile,'r') as f:
chunk = []
for line in f:
if line == 'my delimiter':
call_something(chunk)
chunk=[]
else :
chunk.append(line)

Find number of lines in csv without reading it [duplicate]

This question already has answers here:
How to get line count of a large file cheaply in Python?
(44 answers)
Closed 9 years ago.
Does there exist a way of finding the number of lines in a csv file without actually loading the whole file in memory (in Python)?
I'd expect there can be some special optimized function for it. All I can imagine now is read it line by line and count the lines, but it kind of kills all the possible sense in it since I only need the number of lines, not the actual content.
You don't need to load the whole file into memory since files are iterable in terms of their lines:
with open(path) as fp:
count = 0
for _ in fp:
count += 1
Or, slightly more idiomatic:
with open(path) as fp:
for (count, _) in enumerate(fp, 1):
pass
Yes you need to read the whole file in memory before knowing how many lines are in it.
Just think the file to be a long long string Aaaaabbbbbbbcccccccc\ndddddd\neeeeee\n
to know how many 'lines' are in the string you need to find how many \n characters are in it.
If you want an approximate number what you can do is to read few lines (~20) and see how many characters are per lines and then from the file's size (stored in the file descriptor) get a possible estimate.

How to effectively remove a line in the middle of a large file? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Fastest Way to Delete a Line from Large File in Python
How to edit a line in middle of txt file without overwriting everything?
I know I can read every line into a list, remove a line, then write the list back.
But the file is large, is there a way to remove a part in the middle of the file, and needn't rewrite the whole file?
I don't know if a way to change the file in place, even using low-level file system commands, but you don't need to load it into a list, so you can do this without a large memory footprint:
with open('input_file', 'r') as input_file:
with open('output_file', 'w') as output_file:
for line in input_file:
if should_delete(line):
pass
else:
output_file.write(line)
This assumes that the section you want to delete is a line in a text file, and that should_delete is a function which determines whether the line should be kept or deleted. It is easy to change this slightly to work with a binary file instead, or to use a counter instead of a function.
Edit: If you're dealing with a binary file, you know the exact position you want to remove, and its not too near the start of the file, you may be able to optimise it slightly with io.IOBase.truncate (see http://docs.python.org/2/library/io.html#io.IOBase). However, I would only suggest pursuing this if the a profiler indicates that you really need to optimise to this extent.

Categories