MemoryError when trying to load 5GB text file - python

I want to read data stored in text format in a 5GB file. when I try to read the content of file using this code:
file = open('../data/entries_en.txt', 'r')
data = file.readlines()
an error occurred:
data = file.readlines()
MemoryError
My laptop has 8GB memory and at least 4GB is empty when I want to run the program. but when I monitor the system performance, when python uses about 1.5GB of memory, this error happens.
I'm using python 2.7, but if it matters please tell me solution for 2.x and 3.x
What should I do to read this file?

The best way for you to handle large files would be -
with open('../file.txt', 'r') as f:
for line in f:
# do stuff
readlines() would error because you are trying to load too large a file directly into the memory. The above code will automatically close your file once you are done processing on it.

If you want to process lines in the file, you should rather use:
for line in file:
# do something with the line
It will read the file line by line, instead of reading it all to the RAM at once.

Related

Reading a binary file from memory in chunks of 10 bytes with python

I have a very big .BIN file and I am loading it into the available RAM memory (128 GB) by using:
ice.Load_data_to_memory("global.bin", True)
(see: https://github.com/iceland2k14/secp256k1)
Now I need to read the content of the file in chunks of 10 bytes, and for that I am using:
with open('global.bin', 'rb') as bf:
while True:
data = bf.read(10)
if data = y:
do this!
This works good with the rest of the code, if the .BIN file is small, but not if the file is big. My suspection is, by writing the code this way I will open the .BIN file twice OR I won't get any result, because with open('global.bin', 'rb') as bf is not "synchronized" with ice.Load_data_to_memory("global.bin", True). Thus, I would like to find a way to directly read the chunks of 10 bytes from memory, without having to open the file with "with open('global.bin', 'rb') as bf"
I found a working approach here: LOAD FILE INTO MEMORY
This is working good with a small .BIN file containing 3 strings of 10 bytes each:
with open('0x4.bin', 'rb') as f:
# Size 0 will read the ENTIRE file into memory!
m = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ) #File is open read-only
# Proceed with your code here -- note the file is already in memory
# so "readine" here will be as fast as could be
data = m.read(10) # using read(10) instead of readline()
while data:
do something!
Now the point: When using a much bigger .BIN file, it will take much more time to load the whole file into the memory and the while data: part starts immediately to work, so I would need here a function delay, so that the script only starts to work AFTER the file is completely loaded into the memory...

Limiting Python file space in memory

When you open a file in python (e.g., open(filename, 'r') does it load the entire file into memory? More importantly, is there a way to partially load a file into memory to save memory space (for larger systems) or am I overthinking this? Particularly, I'm trying to optimize this in a cloud environment where I only need ~1-2 lines of a large file and would prefer not inputting all of that into memory as we pay for computation time.
General question, nothing was tested. looking for opinions and such
You can't add any more arguments into the open() function, but you can change how you read the lines from the file. For example:
# open the sample file used
file = open('test.txt')
# read the content of the file opened
content = file.readlines()
# read 10th line from the file
print("tenth line")
print(content[9])
# print first 3 lines of file
print("first three lines")
print(content[0:3])
You could also use the file.readline() method to read individual lines from a file.
Although this still means that your entire file will be read into memory, this is a compressed version of the full file, so doesn't take up the same amount of space as the full file in memory.

Why won't a single line print from a file?

As part of a bigger project, I would simply like to make sure that a file can be opened and Python can read and use it. So after I opened up the txt file, I said:
data = txtfile.read()
first_line = data.split('\n',1)[2]
print(first_line)
I also tried
print(f1.readline())
where f1 is the txt file. This, again, did nothing.
I am using the spyder IDE, and it just says running file, and doesn't print anything. Is it because my file is too large? It is 4.6 gigs.
Does anyone have any idea what's going on?
and it just says running file, and doesn't print anything. Is it
because my file is too large? It is 4.6 gigs.
Yes.
data = txtfile.read()
This function is going to read the entire file. Since you stated that the file is 4.6GB, it is going to take time to load the entire file and then split the by newline character.
See this: Read large text files in Python
I don't know your context of use, so, if you can process line by line, it would be simpler. Or even chunks would make it simpler than reading the entire file.
first_line = open('myfile.txt', 'r').readline()

Read huge .txt file with python

I have a problem in reading a huge txt file with python. I should read all the ~500M lines of a 33 GB .txt file, one by one, but for some obscure reason, my script stops at the 7446633rd line, and gives no error..
The script is the following easy one:
file = open ("file.txt","r")
i = 0
for line in file:
i = i + 1
print i
file.close()
I tried the script on more than one machine, and with both 32 and 64-bit versions of python, but no luck..
Anyone knows what could be the problem??
Try using the "with" statement.
with open("file.txt") as input_file:
for line in input_file:
process_line(line)
Also you could probably think about processing the lines in parallel using celery or something similar.
Later edit: if that doesn't work try to open the files and then use a range to read lines (read in batches).

Download a file part by part in Python 3

I'm using Python 3 to download a file:
local_file = open(file_name, "w" + file_mode)
local_file.write(f.read())
local_file.close()
This code works, but it copies the whole file into memory first. This is a problem with very big files because my program becomes memory hungry. (Going from 17M memory to 240M memory for a 200 MB file)
I would like to know if there is a way in Python to download a small part of a file (packet), write it to file, erase it from memory, and keep repeating the process until the file is completely downloaded.
Try using the method described here:
Lazy Method for Reading Big File in Python?
I am specifically referring to the accepted answer. Let me also copy it here to ensure complete clarity of response.
def read_in_chunks(file_object, chunk_size=1024):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
f = open('really_big_file.dat')
for piece in read_in_chunks(f):
process_data(piece)
This will likely be adaptable to your needs: it reads the file in smaller chunks, allowing for processing without filling your entire memory. Come back if you have any further questions.

Categories