I have a huge dna sequence saved in a .txt file of size 140GB that I would like to open using a txt file editor. Notepad, Python, R don't allow to open such a file. Is there a dedicated text file editor to open large files?
I am currently using this code in Python to open the 140GB large file .txt file:
path = open("my_file_path\my_140GB_file.txt", "r")
file = path.read()
print(file)
The error message is MemoryError referring to file = path.read()
There are a multiple of ways to read large text files in Python. If it is a delimited file, you might want to use the pandas library.
You can use a context manager and read chunks as follows.
Python 3.8+
with open("my_file_path\my_140GB_file.txt", "r") as f:
while chunk := f.read(1024 * 10): # you can use any chunk size you want
do_something(chunk)
Before Python 3.8
You can iterate with a lambda:
with open("my_file_path\my_140GB_file.txt", "rb") as f:
for chunk in iter(lambda:f.read(1024*10), ""):
do_something(chunk)
Or, if the file is line based, you can read each line.
with open("my_file_path\my_140GB_file.txt", "r") as f:
for line in f:
do_something(line)
Pandas DataFrame for delimited files
If your file is delimited (like a csv), then you might consider using pandas.
import pandas as pd
for chunk in pd.read_csv("my_file_path\my_140GB_file.csv", chunksize=2):
do_something(chunk )
I have csv & excel files that were not correctly saved as UTF-8 so i cannot simply load them into pandas. Manually, I can open it and save as excel or csv and select utf-8 and then it works fine in pandas but I have too many files to do this manually and I don't want to replace the raw file (so overwriting it is out of the question). How can I accomplish this programmatically?
I thought of one solution could be to do something like this:
import pandas as pd
with open('path/to/bad_file.csv', 'rb') as f:
text = f.read()
with open('fixed-temp.csv', 'w', encoding='utf8') as f:
f.write(text.decode(encoding="latin-1"))
df = pd.read_csv('fixed-temp.csv')
But this leaves behind a temporary file or a new file that i don't want. I guess I could write more code to then delete this temporary file but that seems unclean and I'd rather encapsulate all this into one convenience function.
I have a few really big .zip files. Each contains 1 huge .csv.
When I try to read it in, I either get a memory error or everything freezes/crashes.
I've tried this:
zf = zipfile.ZipFile('Eve.zip', 'r')
df1 = zf.read('Eve.csv')
but this gives a MemoryError.
I've done some research and tried this:
import zipfile
with zipfile.ZipFile('Events_WE20200308.zip', 'r') as z:
with z.open('Events_WE20200308.csv') as f:
for line in f:
df=pd.DataFrame(f)
print(line)
but I can't get it into a dataframe.
Any ideas please?
Suppose I have a very large csv file.
file = open("foo.csv")
seems to put the whole csv in RAM. If I just need the first row of the csv but don't want python to load the open the entire file, is there a way to do it?
If you just need the first row then you can use the csv module like so.
import csv
with open("foo.csv", "r") as my_csv:
reader = csv.reader(my_csv)
first_row = next(reader)
# do stuff with first_row
The CSV module uses generators so the whole file is not loaded into RAM, more rows are loaded as requested to prevent the whole file being loaded into RAM.
I am writing a program in Python which should import *.dat files, subtract a specific value from certain columns and subsequently save the file in *.dat format in a different directory.
My current tactic is to load the datafiles in a numpy array, perform the calculation and then save it. I am stuck with the saving part. I do not know how to save a file in python in the *.dat format. Can anyone help me? Or is there an alternative way without needing to import the *.dat file as a numpy array? Many thanks!
You can use struct to pack the integers in a bytes format and write them to a dat file.
import struct
data = [# your data]
Open:
with open('your_data.dat', 'rb') as your_data_file:
values = struct.unpack('i'*len(data), your_data_file.read())
Save data:
with open('your_data.dat', 'wb') as your_dat_file:
your_dat_file.write(struct.pack('i'*len(data), *data))
Reference.
You can read and export a .dat file using pandas:
import pandas as pd
input_df = pd.read_table('input_file_name.dat')
...
output_df = pd.DataFrame({'column_name': column_values})
output_df.to_csv('output_file_name.dat')
assuming your file looks like
file = open(filename, "r")
all you need to do is open another file with "w" as the second parameter
file = open(new_file-path,"w")
file.write(data)
file.close()
if your data is not a string, either make it a string, or use
file = open(filename, "rb")
file = open(filename, "wb")
when reading and writing, since these read and write raw bytes
The .dat file can be read using the pandas library:
df = pd.read_csv('xxxx.dat', sep='\s+', header=None, skiprows=1)
skiprows=1 will ignore the first row, which is the header.
\s+ is the separation (default) of .dat file.
Correct me if I'm wrong, but opening, writing to, and subsequently closing a file should count as "saving" it. You can test this yourself by running your import script and comparing the last modified dates.