Read and write large CSV file in python - python

I use the following code to read a LARGE CSV file (6-10 GB), insert a header text, and then export it to CSV a again.
df = read_csv('read file')
df.columns =['list of headers']
df.to_csv('outfile',index=False,quoting=csv.QUOTE_NONNUMERIC)
But this methodology is extremely slow and I run out of memory. Any suggestions?

Rather than reading in the whole 6GB file, could you not just add the headers to a new file, and then cat in the rest? Something like this:
import fileinput
columns = ['list of headers']
columns.to_csv('outfile.csv',index=False,quoting=csv.QUOTE_NONNUMERIC)
with FileInput(files=('infile.csv')) as f:
for line in f:
outfile.write(line)
outfile.close()

Related

Open 140GB .txt file in Windows?

I have a huge dna sequence saved in a .txt file of size 140GB that I would like to open using a txt file editor. Notepad, Python, R don't allow to open such a file. Is there a dedicated text file editor to open large files?
I am currently using this code in Python to open the 140GB large file .txt file:
path = open("my_file_path\my_140GB_file.txt", "r")
file = path.read()
print(file)
The error message is MemoryError referring to file = path.read()
There are a multiple of ways to read large text files in Python. If it is a delimited file, you might want to use the pandas library.
You can use a context manager and read chunks as follows.
Python 3.8+
with open("my_file_path\my_140GB_file.txt", "r") as f:
while chunk := f.read(1024 * 10): # you can use any chunk size you want
do_something(chunk)
Before Python 3.8
You can iterate with a lambda:
with open("my_file_path\my_140GB_file.txt", "rb") as f:
for chunk in iter(lambda:f.read(1024*10), ""):
do_something(chunk)
Or, if the file is line based, you can read each line.
with open("my_file_path\my_140GB_file.txt", "r") as f:
for line in f:
do_something(line)
Pandas DataFrame for delimited files
If your file is delimited (like a csv), then you might consider using pandas.
import pandas as pd
for chunk in pd.read_csv("my_file_path\my_140GB_file.csv", chunksize=2):
do_something(chunk )

Python - how to re-export a file as utf-8

I have csv & excel files that were not correctly saved as UTF-8 so i cannot simply load them into pandas. Manually, I can open it and save as excel or csv and select utf-8 and then it works fine in pandas but I have too many files to do this manually and I don't want to replace the raw file (so overwriting it is out of the question). How can I accomplish this programmatically?
I thought of one solution could be to do something like this:
import pandas as pd
with open('path/to/bad_file.csv', 'rb') as f:
text = f.read()
with open('fixed-temp.csv', 'w', encoding='utf8') as f:
f.write(text.decode(encoding="latin-1"))
df = pd.read_csv('fixed-temp.csv')
But this leaves behind a temporary file or a new file that i don't want. I guess I could write more code to then delete this temporary file but that seems unclean and I'd rather encapsulate all this into one convenience function.

How to read in a huge csv zipped file in python?

I have a few really big .zip files. Each contains 1 huge .csv.
When I try to read it in, I either get a memory error or everything freezes/crashes.
I've tried this:
zf = zipfile.ZipFile('Eve.zip', 'r')
df1 = zf.read('Eve.csv')
but this gives a MemoryError.
I've done some research and tried this:
import zipfile
with zipfile.ZipFile('Events_WE20200308.zip', 'r') as z:
with z.open('Events_WE20200308.csv') as f:
for line in f:
df=pd.DataFrame(f)
print(line)
but I can't get it into a dataframe.
Any ideas please?

Is it possible to open a large csv without loading it to RAM entirely

Suppose I have a very large csv file.
file = open("foo.csv")
seems to put the whole csv in RAM. If I just need the first row of the csv but don't want python to load the open the entire file, is there a way to do it?
If you just need the first row then you can use the csv module like so.
import csv
with open("foo.csv", "r") as my_csv:
reader = csv.reader(my_csv)
first_row = next(reader)
# do stuff with first_row
The CSV module uses generators so the whole file is not loaded into RAM, more rows are loaded as requested to prevent the whole file being loaded into RAM.

Save data as a *.dat file?

I am writing a program in Python which should import *.dat files, subtract a specific value from certain columns and subsequently save the file in *.dat format in a different directory.
My current tactic is to load the datafiles in a numpy array, perform the calculation and then save it. I am stuck with the saving part. I do not know how to save a file in python in the *.dat format. Can anyone help me? Or is there an alternative way without needing to import the *.dat file as a numpy array? Many thanks!
You can use struct to pack the integers in a bytes format and write them to a dat file.
import struct
data = [# your data]
Open:
with open('your_data.dat', 'rb') as your_data_file:
values = struct.unpack('i'*len(data), your_data_file.read())
Save data:
with open('your_data.dat', 'wb') as your_dat_file:
your_dat_file.write(struct.pack('i'*len(data), *data))
Reference.
You can read and export a .dat file using pandas:
import pandas as pd
input_df = pd.read_table('input_file_name.dat')
...
output_df = pd.DataFrame({'column_name': column_values})
output_df.to_csv('output_file_name.dat')
assuming your file looks like
file = open(filename, "r")
all you need to do is open another file with "w" as the second parameter
file = open(new_file-path,"w")
file.write(data)
file.close()
if your data is not a string, either make it a string, or use
file = open(filename, "rb")
file = open(filename, "wb")
when reading and writing, since these read and write raw bytes
The .dat file can be read using the pandas library:
df = pd.read_csv('xxxx.dat', sep='\s+', header=None, skiprows=1)
skiprows=1 will ignore the first row, which is the header.
\s+ is the separation (default) of .dat file.
Correct me if I'm wrong, but opening, writing to, and subsequently closing a file should count as "saving" it. You can test this yourself by running your import script and comparing the last modified dates.

Categories