Importing large IDL files into Python with SciPy - python

I currently use scipy.io.readsav() to import IDL .sav files to Python, which is working well, eg:
data = scipy.io.readsav('data.sav', python_dict=True, verbose=True)
However, if the .sav file is large (say > 1 GB), I get a MemoryError when trying to import into Python.
Usually, iterating through the data would of course solve this (if it were a .txt or .csv file) rather than loading it in all in at once, but I don't see how I can do this when using .sav files, considering the only method I know of to import it is using readsav.
Any ideas how I can avoid this memory error?

This was resolved by using 64 bit python.

Related

scipy.io.loadmat returns MemoryError for big matlab structures

I want to open and process some big .mat files in python. The scipy.io.loadmat function is perfect for that purpose. However, the function returns MemoryError when the .mat files are big. The problem might be due to the python version I use (Python 2.7.10 32 bits, interfaced with spyder). This problem has already been raised but I can't find any decent solution. Ideally, I would be able to open these files without changing my python. Is there a way to make the scipy.io.loadmat function load just some variables contained in the .mat file?
See the documentation here: http://docs.scipy.org/doc/scipy/reference/generated/scipy.io.loadmat.html
You can pass a list of variable names to read from the file:
scipy.io.loadmat("myfile.mat", variable_names=["myvar1", "myvar2"])

open and read a Sigmaplot .JNB file in Python

My lab has a very large directory of Sigmaplot files, saved as .JNB . I would like to process the data in these files using Python. However, I have thus far been unable to read the files into anything interpretable.
I've already tried pretty much every numpy read function and most the panda read functions, and am getting nothing but gibberish.
Does anyone have any advice about reading these files short of exporting them all to excel one by one?

Python sas7bdat module - iterator or memory intensive?

I'm wondering if the sas7bdat module in Python creates an iterator-type object or loads the entire file into memory as a list? I'm interested in doing something line-by-line to a .sas7bdat file that is on the order of 750GB, and I really don't want Python to attempt to load the whole thing into RAM.
Example script:
from sas7bdat import SAS7BDAT
count = 0
with SAS7BDAT('big_sas_file.sas7bdat') as f:
for row in f:
count+=1
I can also use
it = f.__iter__()
but I'm not sure if that will still go through a memory-intensive data load. Any knowledge of how sas7bdat works OR another way to deal with this issue would be greatly appreciated!
You can see the relevant code on bitbucket. The docstring describes iteration as a "generator", and looking at the code, it appears to be reading small pieces of the file rather than reading the whole thing at once. However, I don't know enough about the file format to know if there are situations that could cause it to read a lot of data at once.
If you really want to get a sense of its performance before trying it on a giant 750G file, you should test it by creating a few sample files of increasing size and seeing how its performance scales with the file size.

How to manipulate a huge csv file (> 12GB)?

I am dealing with a huge csv file of approximately 13GB and around 130,000,000 line. I am using python and tried to work on it with pandas library, which I used before for this kind of work. However, I was always dealing with csv files of less than 2,000,000 lines or 500MB previously. For this huge file, pandas doesn't seem appropriate anymore as my computer is dying when I try my code (MacBook Pro from 2011 with 8GB RAM). Could somebody advise me a way to deal with this kind of file in python? Would the csv library be more appropriate?
Thank you in advance!
In Python I have found that for opening big files it is better to use generators as in:
with open("ludicrously_humongous.csv", "r") as f:
for line in f:
#Any process of that line goes here
Programming this way, makes your program read only a line at a time into memory, allowing you to work with large files in an agile manner.

Read .daq (Data Acquisition Toolbox for Matlab) file format in python

Anyone know of a way to read in .daq files generated with the Matlab Data Acquisition Toolbox in python? Alternatively, a simple way (using only open-source software) to convert the files to csv or .mat files which can be read by python would be ok.
I am not sure if you have solved this by now or not, but you can generate .mat files from the .daq files easily in matlab using daqread. In the command window:
Data = daqread(mydata.daq);
cd(saveDir)
save('Data','Data')
The Data variable is a .mat file. This can easily be added to a file directory scanning function. Reading into python is also straightforward using the scipy module. From the interpreter for example:
from scipy.io import loadmat
data = loadmat('Data.mat')
There may also be a .daq loading submodule somewhere in scipy, but you would have to look.
Hope this helps!

Categories