I have a huge dna sequence saved in a .txt file of size 140GB that I would like to open using a txt file editor. Notepad, Python, R don't allow to open such a file. Is there a dedicated text file editor to open large files?
I am currently using this code in Python to open the 140GB large file .txt file:
path = open("my_file_path\my_140GB_file.txt", "r")
file = path.read()
print(file)
The error message is MemoryError referring to file = path.read()
There are a multiple of ways to read large text files in Python. If it is a delimited file, you might want to use the pandas library.
You can use a context manager and read chunks as follows.
Python 3.8+
with open("my_file_path\my_140GB_file.txt", "r") as f:
while chunk := f.read(1024 * 10): # you can use any chunk size you want
do_something(chunk)
Before Python 3.8
You can iterate with a lambda:
with open("my_file_path\my_140GB_file.txt", "rb") as f:
for chunk in iter(lambda:f.read(1024*10), ""):
do_something(chunk)
Or, if the file is line based, you can read each line.
with open("my_file_path\my_140GB_file.txt", "r") as f:
for line in f:
do_something(line)
Pandas DataFrame for delimited files
If your file is delimited (like a csv), then you might consider using pandas.
import pandas as pd
for chunk in pd.read_csv("my_file_path\my_140GB_file.csv", chunksize=2):
do_something(chunk )
Related
So I'm currently trying to use Python to create a neat and tidy .csv file from a .txt file. The first stage is to get some 8-digit numbers into one column called 'Number'. I've created the header and just need to put each number from each line into the column. What I want to know is, how do I tell Python to read the first eight characters of each line in the .txt file (which correspond to the number I'm looking for) and then write them to the .csv file? This is probably very simple but I'm only new to Python!
So far, I have something which looks like this:
with open(r'C:/Users/test1.txt') as rf:
with open(r'C:/Users/test2.csv','w',newline='') as wf:
outputDictWriter = csv.DictWriter(wf,['Number'])
outputDictWriter.writeheader()
writeLine = rf.read(8)
for line in rf:
wf.write(writeLine)
You can use pandas:
import pandas as pd
df = pd.read_csv(r'C:/Users/test2.txt')
df.to_csv(r'C:/Users/test2.csv')
Here is how to read the first 8 characters of each line in a file and store them in a list:
with open('file.txt','r') as f:
lines = [line[:8] for line in f.readlines()]
You can use regex to select digits with the charecter. Search for it
pattern = re.searh(w*\d{8})
Just go one step back and read again what you need:
read the first eight characters of each line in the .txt file (which correspond to the number I'm looking for) and then write them to the .csv file
Now forget Python and explain what is to be done in pseudo code:
open txt file for reading
open csv file for writing (beware end of line is expected to be \r\n for a CSV file)
write the header to the csv file
loop reading the txt file 1 line at a time until end of file
extract 8 first characters from the line
write them to the csv file, ended with a \r\n
close both files
Ok, time to convert above pseudo code to Python language:
with open('C:/Users/test1.txt') as rf, open('C:/Users/test2.csv', 'w', newline='\r\n') as wf:
print('Number', file=wf)
for line in rf:
print(line.rstrip()[:8], file=wf)
when trying to open the excel file it gives an error: excel not able to open the file because file format or file extension is not valid
import openpyxl
with open('fruits.txt') as myfile:
content = myfile.read()
with open("output2.xlsx", "w") as myfile:
myfile.write(content)
You should rather use csv for this. Open the txt file as usual from file read methods of Python and then use csv for writing it into a file of type csv(you will get the feel of excel file only). Do not use openpyxl to open a file of txt extension.
I use the following code to read a LARGE CSV file (6-10 GB), insert a header text, and then export it to CSV a again.
df = read_csv('read file')
df.columns =['list of headers']
df.to_csv('outfile',index=False,quoting=csv.QUOTE_NONNUMERIC)
But this methodology is extremely slow and I run out of memory. Any suggestions?
Rather than reading in the whole 6GB file, could you not just add the headers to a new file, and then cat in the rest? Something like this:
import fileinput
columns = ['list of headers']
columns.to_csv('outfile.csv',index=False,quoting=csv.QUOTE_NONNUMERIC)
with FileInput(files=('infile.csv')) as f:
for line in f:
outfile.write(line)
outfile.close()
I have another doubt related to reading the dat file.
The file format is DAT file (.dat)
The content inside the file is in bytes.
When I tried the run open file code, the program built and ran successfully. However, the python shell has no output (I can't see the contents from the file).
Since the content inside the file is in bytes, should I modify the code ? What is the code to use for bytes?
Thank you.
There is no "DAT" file format and, as you say, the file contains bytes - as do all files.
It's possible that the file contains binary data for which it's best to open the file in binary mode. You do that by specifying b as part of the mode parameter to open(), like this:
f = open('file.dat', 'rb')
data = f.read() # read the entire file into data
print(data)
f.close()
Note that the full mode parameter is set to rb which means open the file in binary mode for reading.
A better way is to use with:
with open('file.dat', 'rb') as f:
data = f.read()
print(data)
No need to explicitly close the file.
If you know that the file contains text, possibly encoded in some specific encoding, e.g. UTF8, then you can specify the encoding when you open the file (Python 3):
with open('file.dat', encoding='UTF8') as f:
for line in f:
print(line)
In Python 2 you can use io.open().
I am working with .gz extension file where I am required to delete specific pattern from the file with least processing time and not altering the file at all.
Have you tried using gzip.GzipFile? Arguments are similar to open.
Example of reading a lines from a file and writing to another files if a certain condition does not match:
import gzip
with gzip.GzipFile('output.gz', 'w') as fout:
with gzip.GzipFile('input.gz','r') as fin:
for line in fin:
if not your_remove_condition(line):
fout.write(line)
Note that the input and output file must be different.