Optimising RAM in reading large file in Python

Optimising RAM in reading large file in Python - python

I have a neural network that takes in arrays of 1024 doubles and returns a single value. I've trained it, and now I want to use it on a set of test data.
The file that I want to test it on is very big (8.2G), and whenever I try to import it into Google Colab it crashes the RAM.
I'm reading it in using read_csv as follows:
testsignals = pd.read_csv("/content/drive/My Drive/MyFile.txt", delimiter="\n",header=None).
What I'd like to know is if there's a more efficient way of reading in the file, or whether I will just have to work with smaller data sets.
Edit: following Prune's comment, I looked at the advice from the people who had commented, and I tried this:
import csv
testsignals=[]
with open('/content/drive/MyDrive/MyFile') as csvfile:
reader=csv.reader(csvfile)
for line in reader:
testsignals.append(line)
But it still exceed the RAM.
If anyone can give me some help I'd be very grateful!

Related

Memory problems with pandas (understanding memory issues)

The issue is that I have 299 .csv files (each of 150-200 MB on average of millions of rows and 12 columns - that makes up a year of data (approx of 52 GB/year). I have 6 years and would like to finally concatenate all of them) which I want to concatenate into a single .csv file with python. As you may expect, I ran into memory errors when trying the following code (my machine has 16GB of RAM):
import os, gzip, pandas as pd, time
rootdir = "/home/eriz/Desktop/2012_try/01"
dataframe_total_list = []
counter = 0
start = time.time()
for subdir, dirs, files in os.walk(rootdir):
dirs.sort()
for files_gz in files:
with gzip.open(os.path.join(subdir, files_gz)) as f:
df = pd.read_csv(f)
dataframe_total_list.append(df)
counter += 1
print(counter)
total_result = pd.concat(dataframe_total_list)
total_result.to_csv("/home/eriz/Desktop/2012_try/01/01.csv", encoding="utf-8", index=False)
My aim: get a single .csv file to then use to train DL models and etc.
My constraint: I'm very new to this huge amounts of data, but I have done "part" of the work:
I know that multiprocessing will not help much in my development; it is a sequential job in which I need each task to complete so that I can start the following one. The main problem is memory run out.
I know that pandas works well with this, even with chunk size added. However, the memory problem is still there because the amount of data is huge.
I have tried to split the work into little tasks so that I don't run out of memory, but I will have it anyway later when concatenating.
My questions:
Is still possible to do this task with python/pandas in any other way I'm not aware of or I need to switch no matter what to a database approach? Could you advise which?
If the database approach is the only path, will I have problems when need to perform python-based operations to train the DL model? I mean, if I need to use pandas/numpy functions to transform the data, is that possible or will I have problems because of the file size?
Thanks a lot in advance and I would appreciate a lot a "more" in deep explanation of the topic.
UPDATE 10/7/2018
After trying and using the below code snippets that #mdurant pointed out I have learned a lot and corrected my perspective about dask and memory issues.
Lessons:
Dask is there to be used AFTER the first pre-processing task (if it is the case that finally you end up with huge files and pandas struggles to load/process them). Once you have your "desired" mammoth file, you can load it into dask.dataframe object without any issue and process it.
Memory related:
first lesson - come up with a procedure so that you don't need to concat all the files and you run out of memory; just process them looping and reducing their content by changing dtypes, dropping columns, resampling... second lesson - try to ONLY put into memory what you need, so that you don't run out. Third lesson - if any of the other lessons don't apply, just look for a EC2 instance, big data tools like Spark, SQL etc.
Thanks #mdurant and #gyx-hh for your time and guidance.

First thing: to take the contents of each CSV and concatenate into one huge CSV is simple enough, you do not need pandas or anything else for that (or even python)
outfile = open('outpath.csv', 'w')
for files_gz in files:
with gzip.open(os.path.join(subdir, files_gz)) as f:
for line in f:
outfile.write(line)
outfile.close()
(you may want to ignore the first line of each CSV, if it has a header with column names).
To do processing on the data is harder. The reason is, that although Dask can read all the files and work on the set as a single data-frame, if any file results in more memory than your system can handle, processing will fail. This is because random-access doesn't mix with gzip compression.
The output file, however, is (presumably) uncompressed, so you can do:
import dask.dataframe as dd
df = dd.read_csv('outpath.csv') # automatically chunks input
df[filter].groupby(fields).mean().compute()
Here, only the reference to dd and the .compute() are specific to dask.

Reading file with huge number of columns in python

I have a huge file csv file with around 4 million column and around 300 rows. File size is about 4.3G. I want to read this file and run some machine learning algorithm on the data.
I tried reading the file via pandas read_csv in python but it is taking long time for reading even a single row ( I suspect due to large number of columns ). I checked few other options like numpy fromfile, but nothing seems to be working.
Can someone please suggest some way to load file with many columns in python?

Pandas/numpy should be able to handle that volume of data no problem. I hope you have at least 8GB of RAM on that machine. To import a CSV file with Numpy, try something like
data = np.loadtxt('test.csv', dtype=np.uint8, delimiter=',')
If there is missing data, np.genfromtext might work instead. If none of these meet your needs and you have enough RAM to hold a duplicate of the data temporarily, you could first build a Python list of lists, one per row using readline and str.split. Then pass that to Pandas or numpy, assuming that's how you intend to operate on the data. You could then save it to disk in a format for easier ingestion later. hdf5 was already mentioned and is a good option. You can also save a numpy array to disk with numpy.savez or my favorite the speedy bloscpack.(un)pack_ndarray_file.

csv is very inefficient for storing large datasets. You should convert your csv file into a better suited format. Try hdf5 (h5py.org or pytables.org), it is very fast and allows you to read parts of the dataset without fully loading it into memory.

According to this answer, pandas (which you already tried) is the fastest library available to read a CSV in Python, or at least was in 2014.

Python: how to close mat file?

I'm reading data from 20k mat files into an array.
After reading around 13k files, the process is ended with "Killed" message.
Apparently, it looks like the problem is that too many files are open.
I've tried to find out how to explicitly "close" mat files in Python, but didn't find any except for savemat which is not what I need in this case.
How can I explicitly close mat files in python?
import scipy.io
x=[]
with open('mat_list.txt','r') as f:
for l in f:
l=l.replace('\n','')
mat = scipy.io.loadmat(l)
x.append(mat['data'])

You don't need to. loadmat does not keep the file open. If given a file name, it loads the contents of the file into memory, then immediately closes it. You can use a file object like #nils-werner suggested, but you will get no benefit from doing so. You can see this from looking at the source code.
You are most likely running out of memory due to simply having too much data at a time. The first thing I would try is to load all the data into one big numpy array. You know the size of each file, and you know how many files there are, so you can pre-allocate an array of the right size and write the data to slices of that array. This will also tell you right away if this is a problem with your array size.
If you are still running out of memory, you will need another solution. A simple solution would be to use dask. This allows you to create something that looks and acts like a numpy array, but lives in a file rather than in memory. This allows you to work with data sets too large to fit into memory. bcolz and blaze offer similar capabilities, although not as seamlessly.
If these are not an option, h5py and pytables allow you to store data sets to files incrementally rather than having to keep the whole thing in memory at once.
Overall, I think this question is a classic example of the XY Problem. It is generally much better to state your symptoms, and ask for help on those symptoms, rather than guessing what the solution is and asking for someone to help you implement the solution.

You can pass an open file handle to scipy.io.loadmat:
import scipy.io
x=[]
with open('mat_list.txt','r') as f:
for l in f:
l=l.replace('\n','')
with open(l, 'r') as matfile:
mat = scipy.io.loadmat(matfile)
x.append(mat['data'])
leaving the with open() context will then automatically close the file.

Analyzing huge dataset from 80GB file with only 32GB of RAM

I have very huge text file ( around 80G ). File contains only numbers(integers+floats) and has 20 columns. Now I have to analyze each column. By analyze I mean, I have to do some basic calculations on each column like finding mean, plotting histograms, check if condition is satisfied or not etc. I am reading file like following
with open(filename) as original_file:
all_rows = [[float(digit) for digit in line.split()] for line in original_file]
all_rows = np.asarray(all_rows)
After this I do all analysis on specific columns. I use 'good' configuration server/workstation (with 32GB RAM) to execute my program. Problem is that I am not able to finish my job. I waited almost day to finish that program but it was still running after 1 day. I had to kill it manually later on. I know my script is correct without any error because I have tried same script on smaller size files (around 1G) and it worked nicely.
My initial guess is it will have some memory problem. Is there any way I can run such job? Some different method or some other way ?
I tried splitting files into smaller file size and then analyzed them individually in loop like follows
pre_name = "split_file"
for k in range(11): #There are 10 files with almost 8G each
filename = pre_name+str(k).zfill(3) #My files are in form "split_file000, split_file001 ..."
with open(filename) as original_file:
all_rows = [[float(digit) for digit in line.split()] for line in original_file]
all_rows = np.asarray(all_rows)
#Some analysis here
plt.hist(all_rows[:,8],100) #Plotting histogram for 9th Column
all_rows = None
I have tested above code on bunch of smaller files and it works fine. However again it was same problem when I used on big files. Any suggestions? Is there any other cleaner way to do it ?

For such lengthy operations (when data don't fit in memory), it might be useful to use libraries like dask ( http://dask.pydata.org/en/latest/ ), particularly the dask.dataframe.read_csv to read the data and then perform your operations like you would do in pandas library (another useful package to mention).

Two alternatives come to my mind:
You should consider performing your computation by online algorithms
In computer science, an online algorithm is one that can process its input piece-by-piece in a serial fashion, i.e., in the order that the input is fed to the algorithm, without having the entire input available from the start.
It is possible to compute mean and variance and a histogram with pre-specified bins this way with constant memory complexity.
You should throw your data into a proper database and make use of that database system's statistical and data handling capabilities, e.g., aggregate functions and indexes.
Random links:
http://www.postgresql.org/
https://www.mongodb.org/

how to rapidaly load data into memory with python?

I have a large csv file (5 GB) and I can read it with pandas.read_csv(). This operation takes a lot of time 10-20 minutes.
How can I speed it up?
Would it be useful to transform the data in a sqllite format? In case what should I do?
EDIT: More information:
The data contains 1852 columns and 350000 rows. Most of the columns are float65 and contain numbers. Some other contains string or dates (that I suppose are considered as string)
I am using a laptop with 16 GB of RAM and SSD hard drive. The data should fit fine in memory (but I know that python tends to increase the data size)
EDIT 2 :
During the loading I receive this message
/usr/local/lib/python3.4/dist-packages/pandas/io/parsers.py:1164: DtypeWarning: Columns (1841,1842,1844) have mixed types. Specify dtype option on import or set low_memory=False.
data = self._reader.read(nrows)
EDIT: SOLUTION
Read one time the csv file and save it as
data.to_hdf('data.h5', 'table')
This format is incredibly efficient

This actually depends on which part of reading it is taking 10 minutes.
If it's actually reading from disk, then obviously any more compact form of the data will be better.
If it's processing the CSV format (you can tell this because your CPU is at near 100% on one core while reading; it'll be very low for the other two), then you want a form that's already preprocessed.
If it's swapping memory, e.g., because you only have 2GB of physical RAM, then nothing is going to help except splitting the data.
It's important to know which one you have. For example, stream-compressing the data (e.g., with gzip) will make the first problem a lot better, but the second one even worse.
It sounds like you probably have the second problem, which is good to know. (However, there are things you can do that will probably be better no matter what the problem.)
Your idea of storing it in a sqlite database is nice because it can at least potentially solve all three at once; you only read the data in from disk as-needed, and it's stored in a reasonably compact and easy-to-process form. But it's not the best possible solution for the first two, just a "pretty good" one.
In particular, if you actually do need to do array-wide work across all 350000 rows, and can't translate that work into SQL queries, you're not going to get much benefit out of sqlite. Ultimately, you're going to be doing a giant SELECT to pull in all the data and then process it all into one big frame.
Writing out the shape and structure information, then writing the underlying arrays in NumPy binary form. Then, for reading, you have to reverse that. NumPy's binary form just stores the raw data as compactly as possible, and it's a format that can be written blindingly quickly (it's basically just dumping the raw in-memory storage to disk). That will improve both the first and second problems.
Similarly, storing the data in HDF5 (either using Pandas IO or an external library like PyTables or h5py) will improve both the first and second problems. HDF5 is designed to be a reasonably compact and simple format for storing the same kind of data you usually store in Pandas. (And it includes optional compression as a built-in feature, so if you know which of the two you have, you can tune it.) It won't solve the second problem quite as well as the last option, but probably well enough, and it's much simpler (once you get past setting up your HDF5 libraries).
Finally, pickling the data may sometimes be faster. pickle is Python's native serialization format, and it's hookable by third-party modules—and NumPy and Pandas have both hooked it to do a reasonably good job of pickling their data.
(Although this doesn't apply to the question, it may help someone searching later: If you're using Python 2.x, make sure to explicitly use pickle format 2; IIRC, NumPy is very bad at the default pickle format 0. In Python 3.0+, this isn't relevant, because the default format is at least 3.)

Python has two built-in libraries called pickle and cPickle that can store any Python data structure.
cPickle is identical to pickle except that cPickle has trouble with Unicode stuff and is 1000x faster.
Both are really convenient for saving stuff that's going to be re-loaded into Python in some form, since you don't have to worry about some kind of error popping up in your file I/O.
Having worked with a number of XML files, I've found some performance gains from loading pickles instead of raw XML. I'm not entirely sure how the performance compares with CSVs, but it's worth a shot, especially if you don't have to worry about Unicode stuff and can use cPickle. It's also simple, so if it's not a good enough boost, you can move on to other methods with minimal time lost.
A simple example of usage:
>>> import pickle
>>> stuff = ["Here's", "a", "list", "of", "tokens"]
>>> fstream = open("test.pkl", "wb")
>>> pickle.dump(stuff,fstream)
>>> fstream.close()
>>>
>>> fstream2 = open("test.pkl", "rb")
>>> old_stuff = pickle.load(fstream2)
>>> fstream2.close()
>>> old_stuff
["Here's", 'a', 'list', 'of', 'tokens']
>>>
Notice the "b" in the file stream openers. This is important--it preserves cross-platform compatibility of the pickles. I've failed to do this before and had it come back to haunt me.
For your stuff, I recommend writing a first script that parses the CSV and saves it as a pickle; when you do your analysis, the script associated with that loads the pickle like in the second block of code up there.
I've tried this with XML; I'm curious how much of a boost you will get with CSVs.

If the problem is in the processing overhead, then you can divide the file into smaller files and handle them in different CPU cores or threads. Also for some algorithms the python time will increase non-linearly and the dividing method will help in these cases.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.