Read huge .txt file with python - python

I have a problem in reading a huge txt file with python. I should read all the ~500M lines of a 33 GB .txt file, one by one, but for some obscure reason, my script stops at the 7446633rd line, and gives no error..
The script is the following easy one:
file = open ("file.txt","r")
i = 0
for line in file:
i = i + 1
print i
file.close()
I tried the script on more than one machine, and with both 32 and 64-bit versions of python, but no luck..
Anyone knows what could be the problem??

Try using the "with" statement.
with open("file.txt") as input_file:
for line in input_file:
process_line(line)
Also you could probably think about processing the lines in parallel using celery or something similar.
Later edit: if that doesn't work try to open the files and then use a range to read lines (read in batches).

Related

Python writing data to file only works when run from console

If I run
file = open("BAL.txt","w")
I = '200'
file.write(I)
file.close
from a script, it outputs nothing in the file. (It literally overwrites the file with nothing)
Furthermore, running cat BAL.txt just goes to the next line like nothing is in the file.
But if I run it line by line in a python console it works perfectly fine.
Why does this happen. ( I am a begginner learning python the mistake may be super obvious. I have thrown about 2 hours into trying to figure this out)
Thanks in advance
You aren't closing your file properly. To close it you are missing the () at the end of file.close so it should look like this:
file = open("BAL.txt", "w")
file.write("This has been written to a file")
file.close()
This site has the same example and may be of some use to you.
Another way, especially useful when you are appending multiple values into a single file is to use something like with open("BAL.txt","w") as file:. Here is your script rewritten to include this example:
I = '200'
with open("BAL.txt","w") as file:
file.write(I)
This opens our file with the value file and allows us to write values to it. Also note that file.close() is not needed here and when appending text w+ needs to be used.
to write to a file you do this:
file = open("file.txt","w")
file.write("something")
file.close()
when you use file.write() it deletes all of the contents of the file, if you want to write to the end of the file do this:
file = open("file.text","w+")
file.write(file.read()+"something")
file.close()
There are other ways to do this but this one is the most intuitive (not the most efficient), also the other way tends to be buggy so there is no reason to post it because this is reliable.
Firstly, you're missing the parentheses when you're closing the file. Secondly, writing to a file should be done like this:
file = open("BAL.txt", "w")
file.write("This has been written to a file")
file.close()
Let me know if you have any questions.

Why won't a single line print from a file?

As part of a bigger project, I would simply like to make sure that a file can be opened and Python can read and use it. So after I opened up the txt file, I said:
data = txtfile.read()
first_line = data.split('\n',1)[2]
print(first_line)
I also tried
print(f1.readline())
where f1 is the txt file. This, again, did nothing.
I am using the spyder IDE, and it just says running file, and doesn't print anything. Is it because my file is too large? It is 4.6 gigs.
Does anyone have any idea what's going on?
and it just says running file, and doesn't print anything. Is it
because my file is too large? It is 4.6 gigs.
Yes.
data = txtfile.read()
This function is going to read the entire file. Since you stated that the file is 4.6GB, it is going to take time to load the entire file and then split the by newline character.
See this: Read large text files in Python
I don't know your context of use, so, if you can process line by line, it would be simpler. Or even chunks would make it simpler than reading the entire file.
first_line = open('myfile.txt', 'r').readline()

How does one read a .dif file with Python

I am working on a project that requires me to read a file with a .dif extension. Dif stands for data information exchange. The file opens nicely in Open Office Calc. Then you can easily save as a csv file, however when I open in Python all I get are random characters that don't make sense. Here is the last code that I tried just to see if I could read.
txt = open('C:\myfile.dif', 'rb').read()
print txt
I would even be open to programatically converting the file to csv first. before opening if someone knows how to do that. As always, any help is much appreciated. Below is a partial screenshot of what I get when I run the code.
Hadn't heard of this file format. Went and got a sample here.
I tested your method and it works fine:
>>> content = open(r"E:\sample.dif", 'rb').read()
>>> print (content)
b'TABLE\r\n0,1\r\n"EXCEL"\r\nVECTORS\r\n0,8\r\n""\r\nTUPLES\r\n0,3\r\n""\r\nDATA\r\n0,0\r\n""\r\n-1,0\r\nBOT\r\n1,0\r\n"Welcome to File Extension FYI Center!"\r\n1,0\r\n""\r\n1,0\r\n""\r\n-1,0\r\nBOT\r\n1,0\r\n""\r\n1,0\r\n""\r\n1,0\r\n""\r\n-1,0\r\nBOT\r\n1,0\r\n"ID"\r\n1,0\r\n"Type"\r\n1,0\r\n"Description"\r\n-1,0\r\nBOT\r\n0,1\r\nV\r\n1,0\r\n"ASP"\r\n1,0\r\n"Active Server Pages"\r\n-1,0\r\nBOT\r\n0,2\r\nV\r\n1,0\r\n"JSP"\r\n1,0\r\n"JavaServer Pages"\r\n-1,0\r\nBOT\r\n0,3\r\nV\r\n1,0\r\n"PNG"\r\n1,0\r\n"Portable Network Graphics"\r\n-1,0\r\nBOT\r\n0,4\r\nV\r\n1,0\r\n"GIF"\r\n1,0\r\n"Graphics Interchange Format"\r\n-1,0\r\nBOT\r\n0,5\r\nV\r\n1,0\r\n"WMV"\r\n1,0\r\n"Windows Media Video"\r\n-1,0\r\nEOD\r\n'
>>>
The question is what is in the file and how do you want to handle it. Personally I liked:
with open(r"E:\sample.dif", 'rb') as f:
for line in f:
print (line)
In the first code block, that long line that has a b'' (for bytes!) in front of it can be iterated on \r\n:
b'TABLE\r\n'
b'0,1\r\n'
b'"EXCEL"\r\n'
b'VECTORS\r\n'
b'0,8\r\n'
b'""\r\n'
b'TUPLES\r\n'
b'0,3\r\n'
b'""\r\n'
b'DATA\r\n'
b'0,0\r\n'
.
.
.
b'"Windows Media Video"\r\n'
b'-1,0\r\n'
b'EOD\r\n'

MemoryError when trying to load 5GB text file

I want to read data stored in text format in a 5GB file. when I try to read the content of file using this code:
file = open('../data/entries_en.txt', 'r')
data = file.readlines()
an error occurred:
data = file.readlines()
MemoryError
My laptop has 8GB memory and at least 4GB is empty when I want to run the program. but when I monitor the system performance, when python uses about 1.5GB of memory, this error happens.
I'm using python 2.7, but if it matters please tell me solution for 2.x and 3.x
What should I do to read this file?
The best way for you to handle large files would be -
with open('../file.txt', 'r') as f:
for line in f:
# do stuff
readlines() would error because you are trying to load too large a file directly into the memory. The above code will automatically close your file once you are done processing on it.
If you want to process lines in the file, you should rather use:
for line in file:
# do something with the line
It will read the file line by line, instead of reading it all to the RAM at once.

Find&Replace using Python - Binary file

I'm attempting to do a "find and replace" in a file on a Mac OS X computer. Although it appears to work correctly. It seems that the file is somehow altered. The text editor that I use (Text Wrangler) is unable to even open the file once this is completed.
Here is the code as I have it:
import fileinput
for line in fileinput.FileInput("testfile.txt",inplace=1):
line = line.replace("newhost",host)
print line,
When I view the file from the terminal, it does say "testfile" may be a binary file. See it anyway? Is there a chance that this replace is corrupting the file? Do I have another option for this to work? I really appreciate the help.
Thank you,
Aaron
UPDATE: the actual file is NOT a .txt file it is a .plist file which is preference file in Mac OS X if that makes any difference
LINK to plist file:
http://www.queencitytech.com/plist.zip
Your code worked for me fine. However, I would suggest a different approach: don't try overwriting the file directly. I never like changing the file directly because if you have a bug or something like that the file is lost. Generate a new file then copy it over manually (or within python, if you really want to).
PATH = 'testfile.txt'
FILE = open(PATH)
OUT_FILE = open('out_' + PATH, 'w')
for line in FILE.readlines():
print >> OUT_FILE, line.replace('newhost', host),
Try using sys.stdout.write instead of print. readlines() retains the new line characters at the end of the read line. The print statement adds an additional new line character, so it's likely double spacing the file.

Categories