How to manipulate a huge csv file (> 12GB)?

How to manipulate a huge csv file (> 12GB)? - python

I am dealing with a huge csv file of approximately 13GB and around 130,000,000 line. I am using python and tried to work on it with pandas library, which I used before for this kind of work. However, I was always dealing with csv files of less than 2,000,000 lines or 500MB previously. For this huge file, pandas doesn't seem appropriate anymore as my computer is dying when I try my code (MacBook Pro from 2011 with 8GB RAM). Could somebody advise me a way to deal with this kind of file in python? Would the csv library be more appropriate?
Thank you in advance!

In Python I have found that for opening big files it is better to use generators as in:
with open("ludicrously_humongous.csv", "r") as f:
for line in f:
#Any process of that line goes here
Programming this way, makes your program read only a line at a time into memory, allowing you to work with large files in an agile manner.

Related

Importing large IDL files into Python with SciPy

I currently use scipy.io.readsav() to import IDL .sav files to Python, which is working well, eg:
data = scipy.io.readsav('data.sav', python_dict=True, verbose=True)
However, if the .sav file is large (say > 1 GB), I get a MemoryError when trying to import into Python.
Usually, iterating through the data would of course solve this (if it were a .txt or .csv file) rather than loading it in all in at once, but I don't see how I can do this when using .sav files, considering the only method I know of to import it is using readsav.
Any ideas how I can avoid this memory error?

This was resolved by using 64 bit python.

Instant access to line from a large file without loading the file

In one of my recent projects I need to perform this simple task but I'm not sure what is the most efficient way to do so.
I have several large text files (>5GB) and I need to continuously extract random lines from those files. The requirements are: I can't load the files into memory, I need to perform this very efficiently ( >>1000 lines a second), and preferably I need to do as less pre-processing as possible.
The files consists of many short lines ~(20 mil lines). The "raw" files has varying line length, but with a short pre-processing I can make all lines have the same length (though, the perfect solution would not require pre-processing)
I already tried the default python solutions mentioned here but they were too slow (and the linecache solution loads the file into memory, therefore is not usable here)
The next solution I thought about is to create some kind of index. I found this solution but it's very outdated so it needs some work to get working, and even then I'm not sure if the overhead created during the processing of the index file won't slow down the process to time-scale of the solution above.
Another solution is converting the file into a binary file and then getting instant access to lines this way. For this solution I couldn't find any python package that supports binary-text work, and I feel like creating a robust parser this way could take very long time and could create many hard-to-diagnose errors down the line because of small miscalculations/mistakes.
The final solution I thought about is using some kind of database (sqlite in my case) which will require transferring the lines into a database and loading them this way.
Note: I will also load thousands of (random) lines each time, therefore solutions which work better for groups of lines will have an advantage.
Thanks in advance,
Art.

As said in the comments, I believe using hdf5 would we a good option.
This answer shows how to read that kind of file

Python sas7bdat module - iterator or memory intensive?

I'm wondering if the sas7bdat module in Python creates an iterator-type object or loads the entire file into memory as a list? I'm interested in doing something line-by-line to a .sas7bdat file that is on the order of 750GB, and I really don't want Python to attempt to load the whole thing into RAM.
Example script:
from sas7bdat import SAS7BDAT
count = 0
with SAS7BDAT('big_sas_file.sas7bdat') as f:
for row in f:
count+=1
I can also use
it = f.__iter__()
but I'm not sure if that will still go through a memory-intensive data load. Any knowledge of how sas7bdat works OR another way to deal with this issue would be greatly appreciated!

You can see the relevant code on bitbucket. The docstring describes iteration as a "generator", and looking at the code, it appears to be reading small pieces of the file rather than reading the whole thing at once. However, I don't know enough about the file format to know if there are situations that could cause it to read a lot of data at once.
If you really want to get a sense of its performance before trying it on a giant 750G file, you should test it by creating a few sample files of increasing size and seeing how its performance scales with the file size.

When writing large data into .csv file, is it better to open and close file often?

I am writing a program with a while loop, which would write giant amount of data into a csv file. There maybe more than 1 million rows.
Considering running time, memory usage, debugging and so on, what is the better option between the two:
open a CSV file, keep it open and write line by line, until the 1 million all written
Open a file, write about 100 lines, close(), open again, write about 100 lines, ......
I guess I just want to know would it take more memories if we're to keep the file open all the time? And which one will take longer?
I can't run the code to compare because I'm using a VPN for the code, and testing through testing would cost too much $$ for me. So just some rules of thumb would be enough for this thing.

I believe the write will immediately write to the disk, so there isn't any benefit that I can see from closing and reopening the file. The file isn't stored in memory when it's opened, you just get essentially a pointer to the file, and then load or write a portion of it at a time.
Edit
To be more explicit, no, opening a large file will not use a large amount of memory. Similarly writing a large amount of data will not use a large amount of memory as long as you don't hold the data in memory after it has been written to the file.

openpyxl: writing large excel files with python

I have a big problem here with python, openpyxl and Excel files. My objective is to write some calculated data to a preconfigured template in Excel. I load this template and write the data on it. There are two problems:
I'm talking about writing Excel books with more than 2 millions of cells, divided into several sheets.
I do this successfully, but the waiting time is unthinkable.
I don't know other way to solve this problem. Maybe openpyxl is not the solution. I have tried to write in xlsb, but I think openpyxl does not support this format. I have also tried with optimized writer and reader, but the problem comes when I save, due to the big data. However, the output file size is 10 MB, at most. I'm very stuck with this. Do you know if there is another way to do this?
Thanks in advance.

The file size isn't really the issue when it comes to memory use but the number of cells in memory. Your use case really will push openpyxl to the limits at the moment which is currently designed to support either optimised reading or optimised writing but not both at the same time. One thing you might try would be to read in openpyxl with use_iterators=True this will give you a generator that you can call from xlsxwriter which should be able to write a new file for you. xlsxwriter is currently significantly faster than openpyxl when creating files. The solution isn't perfect but it might work for you.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.