Python sas7bdat module - iterator or memory intensive? - python

I'm wondering if the sas7bdat module in Python creates an iterator-type object or loads the entire file into memory as a list? I'm interested in doing something line-by-line to a .sas7bdat file that is on the order of 750GB, and I really don't want Python to attempt to load the whole thing into RAM.
Example script:
from sas7bdat import SAS7BDAT
count = 0
with SAS7BDAT('big_sas_file.sas7bdat') as f:
for row in f:
count+=1
I can also use
it = f.__iter__()
but I'm not sure if that will still go through a memory-intensive data load. Any knowledge of how sas7bdat works OR another way to deal with this issue would be greatly appreciated!

You can see the relevant code on bitbucket. The docstring describes iteration as a "generator", and looking at the code, it appears to be reading small pieces of the file rather than reading the whole thing at once. However, I don't know enough about the file format to know if there are situations that could cause it to read a lot of data at once.
If you really want to get a sense of its performance before trying it on a giant 750G file, you should test it by creating a few sample files of increasing size and seeing how its performance scales with the file size.

Related

Better to read the whole file, close it, and then loop over it, or loop over while it's open?

I was wondering, which of these is the better and safer way to process a file's contents line by line. The assumption here is that the file's contents are very critical, but the file is not very large, so memory consumption is not an issue.
Is it better to close the file as soon as possible using this:
with open('somefile.txt') as f:
lines = f.readlines()
for line in lines:
do_something(line)
Or to just loop over it in one go:
with open('somefile.txt') as f:
for line in f:
do_something(line)
Which of these practises is generally the better and the more accepted way of doing it?
There is no "better" solution. Simply because these two are far from being equivalent.
The first one loads entire file into memory and then processes the in-memory data. This has a potential advantage of being faster depending on what the processing is. Note that if the file is bigger than the amount of RAM you have then this is not an option at all.
The second one loads only a piece of the file into memory, processes it and then loads another piece and so on. This is generally slower (although it is likely you won't see the difference because often the processing time, especially in Python, dominates the reading time) but drastically reduces memory consumption (assuming that your file has more than 1 line). Also in some cases it may be more difficult to work with. For example say that you are looking for a specific pattern xy\nz in the file. Now with "line by line" loading you have to remember previous line in order to do a correct check. Which is more difficult to implement (but only a bit). So again: it depends on what you are doing.
As you can see there are tradeoffs and what is better depends on your context. I often do this: if file is relatively small (say up to few hundred megabytes) then load it into memory.
Now you've mentioned that the content is "critical". I don't know what that means but for example if you are trying to make updates to the file atomic or reads consistent between processes then this is a very different problem from the one you've posted. And generally hard so I advice using a proper database. SQLite is an easy option (again: depending on your scenario) similar to having a file.

Instant access to line from a large file without loading the file

In one of my recent projects I need to perform this simple task but I'm not sure what is the most efficient way to do so.
I have several large text files (>5GB) and I need to continuously extract random lines from those files. The requirements are: I can't load the files into memory, I need to perform this very efficiently ( >>1000 lines a second), and preferably I need to do as less pre-processing as possible.
The files consists of many short lines ~(20 mil lines). The "raw" files has varying line length, but with a short pre-processing I can make all lines have the same length (though, the perfect solution would not require pre-processing)
I already tried the default python solutions mentioned here but they were too slow (and the linecache solution loads the file into memory, therefore is not usable here)
The next solution I thought about is to create some kind of index. I found this solution but it's very outdated so it needs some work to get working, and even then I'm not sure if the overhead created during the processing of the index file won't slow down the process to time-scale of the solution above.
Another solution is converting the file into a binary file and then getting instant access to lines this way. For this solution I couldn't find any python package that supports binary-text work, and I feel like creating a robust parser this way could take very long time and could create many hard-to-diagnose errors down the line because of small miscalculations/mistakes.
The final solution I thought about is using some kind of database (sqlite in my case) which will require transferring the lines into a database and loading them this way.
Note: I will also load thousands of (random) lines each time, therefore solutions which work better for groups of lines will have an advantage.
Thanks in advance,
Art.
As said in the comments, I believe using hdf5 would we a good option.
This answer shows how to read that kind of file

How to write large JSON data?

I have been trying to write large amount (>800mb) of data to JSON file; I did some fair amount of trial and error to get this code:
def write_to_cube(data):
with open('test.json') as file1:
temp_data = json.load(file1)
temp_data.update(data)
file1.close()
with open('test.json', 'w') as f:
json.dump(temp_data, f)
f.close()
to run it just call the function write_to_cube({"some_data" = data})
Now the problem with this code is that it's fast for the small amount of data, but the problem comes when test.json file has more than 800mb in it. When I try to update or add data to it, it takes ages.
I know there are external libraries such as simplejson or jsonpickle, I am not pretty sure on how to use them.
Is there any other way to this problem?
Update:
I am not sure how this can be a duplicate, other articles say nothing about writing or updating a large JSON file, rather they say only about parsing.
Is there a memory efficient and fast way to load big json files in python?
Reading rather large json files in Python
None of the above resolve this question a duplicate. They don't say anything about writing or update.
So the problem is that you have a long operation. Here are a few approaches that I usually do:
Optimize the operation: This rarely works. I wouldn't recommend any superb library that would parse the json a few seconds faster
Change your logic: If the purpose is to save and load data, probably you would like to try something else, like storing your object into a binary file instead.
Threads and callback, or deferred objects in some web frameworks. In case of web applications, sometimes, the operation takes longer than a request can wait, we can do the operation in background (some cases are: zipping files, then send the zip to user's email, sending SMS by calling another third party's api...)

Debugging a python script which first needs to read large files. Do I have to load them every time anew?

I have a python script which starts by reading a few large files and then does something else. Since I want to run this script multiple times and change some of the code until I am happy with the result, it would be nice if the script did not have to read the files every time anew, because they will not change. So I mainly want to use this for debugging.
It happens to often, that I run scripts with bugs in them, but I only see the error message after minutes, because the reading took so long.
Are there any tricks to do something like this?
(If it is feasible, I create smaller test files)
I'm not good at Python, but it seems to be able to dynamically reload code from a changed module: How to re import an updated package while in Python Interpreter?
Some other suggestions not directly related to Python.
Firstly, try to create a smaller test file. Is the whole file required to demonstrate the bug you are observing? Most probably it is only a small part of your input file that is relevant.
Secondly, are these particular files required, or the problem will show up on any big amount of data? If it shows only on particular files, then once again most probably it is related to some feature of these files and will show also on a smaller file with the same feature. If the main reason is just big amount of data, you might be able to avoid reading it by generating some random data directly in a script.
Thirdly, what is a bottleneck of your reading the file? Is it just hard drive performance issue, or do you do some heavy processing of the read data in your script before actually coming to the part that generates problems? In the latter case, you might be able to do that processing once and write the results to a new file, and then modify your script to load this processed data instead of doing the processing each time anew.
If the hard drive performance is the issue, consider a faster filesystem. On Linux, for example, you might be able to use /dev/shm.

How to manipulate a huge csv file (> 12GB)?

I am dealing with a huge csv file of approximately 13GB and around 130,000,000 line. I am using python and tried to work on it with pandas library, which I used before for this kind of work. However, I was always dealing with csv files of less than 2,000,000 lines or 500MB previously. For this huge file, pandas doesn't seem appropriate anymore as my computer is dying when I try my code (MacBook Pro from 2011 with 8GB RAM). Could somebody advise me a way to deal with this kind of file in python? Would the csv library be more appropriate?
Thank you in advance!
In Python I have found that for opening big files it is better to use generators as in:
with open("ludicrously_humongous.csv", "r") as f:
for line in f:
#Any process of that line goes here
Programming this way, makes your program read only a line at a time into memory, allowing you to work with large files in an agile manner.

Categories