Instant access to line from a large file without loading the file

Instant access to line from a large file without loading the file - python

In one of my recent projects I need to perform this simple task but I'm not sure what is the most efficient way to do so.
I have several large text files (>5GB) and I need to continuously extract random lines from those files. The requirements are: I can't load the files into memory, I need to perform this very efficiently ( >>1000 lines a second), and preferably I need to do as less pre-processing as possible.
The files consists of many short lines ~(20 mil lines). The "raw" files has varying line length, but with a short pre-processing I can make all lines have the same length (though, the perfect solution would not require pre-processing)
I already tried the default python solutions mentioned here but they were too slow (and the linecache solution loads the file into memory, therefore is not usable here)
The next solution I thought about is to create some kind of index. I found this solution but it's very outdated so it needs some work to get working, and even then I'm not sure if the overhead created during the processing of the index file won't slow down the process to time-scale of the solution above.
Another solution is converting the file into a binary file and then getting instant access to lines this way. For this solution I couldn't find any python package that supports binary-text work, and I feel like creating a robust parser this way could take very long time and could create many hard-to-diagnose errors down the line because of small miscalculations/mistakes.
The final solution I thought about is using some kind of database (sqlite in my case) which will require transferring the lines into a database and loading them this way.
Note: I will also load thousands of (random) lines each time, therefore solutions which work better for groups of lines will have an advantage.
Thanks in advance,
Art.

As said in the comments, I believe using hdf5 would we a good option.
This answer shows how to read that kind of file

Related

Better to read the whole file, close it, and then loop over it, or loop over while it's open?

I was wondering, which of these is the better and safer way to process a file's contents line by line. The assumption here is that the file's contents are very critical, but the file is not very large, so memory consumption is not an issue.
Is it better to close the file as soon as possible using this:
with open('somefile.txt') as f:
lines = f.readlines()
for line in lines:
do_something(line)
Or to just loop over it in one go:
with open('somefile.txt') as f:
for line in f:
do_something(line)
Which of these practises is generally the better and the more accepted way of doing it?

There is no "better" solution. Simply because these two are far from being equivalent.
The first one loads entire file into memory and then processes the in-memory data. This has a potential advantage of being faster depending on what the processing is. Note that if the file is bigger than the amount of RAM you have then this is not an option at all.
The second one loads only a piece of the file into memory, processes it and then loads another piece and so on. This is generally slower (although it is likely you won't see the difference because often the processing time, especially in Python, dominates the reading time) but drastically reduces memory consumption (assuming that your file has more than 1 line). Also in some cases it may be more difficult to work with. For example say that you are looking for a specific pattern xy\nz in the file. Now with "line by line" loading you have to remember previous line in order to do a correct check. Which is more difficult to implement (but only a bit). So again: it depends on what you are doing.
As you can see there are tradeoffs and what is better depends on your context. I often do this: if file is relatively small (say up to few hundred megabytes) then load it into memory.
Now you've mentioned that the content is "critical". I don't know what that means but for example if you are trying to make updates to the file atomic or reads consistent between processes then this is a very different problem from the one you've posted. And generally hard so I advice using a proper database. SQLite is an easy option (again: depending on your scenario) similar to having a file.

Reading multiple (CERN) ROOT files into NumPy array - using n nodes and say, 2n GPUs

I am reading many (say 1k) CERN ROOT files using a loop and storing some data into a nested NumPy array. The use of loops makes it serial task and each file take quite some time to complete the process. Since I am working on a deep learning model, I must create a large enough dataset - but the reading time itself is taking a very long time (reading 835 events takes about 21 minutes). Can anyone please suggest if it is possible to use multiple GPUs to read the data, so that less time is required for the reading? If so, how?
Adding some more details: I pushed to program to GitHub so that this can be seen (please let me know if posting GitHub link is not allowed, in that case, I will post the relevant portion here):
https://github.com/Kolahal/SupervisedCounting/blob/master/read_n_train.py
I run the program as:
python read_n_train.py <input-file-list>
where the argument is a text file containing the list of the files with addresses. I was opening the ROOT files in a loop in the read_data_into_list() function. But as I mentioned, this serial task is consuming a lot of time. Not only that, I notice that the reading speed is getting worse as we read more and more data.
Meanwhile I tried to used slurmpy package https://github.com/brentp/slurmpy
With this, I can distribute the job into, say, N worker nodes, for example. In this case, an individual reading program will read the file assigned to it and will return a corresponding list. It is just that in the end, I need to add the lists. I couldn't figure out a way to do this.
Any help is highly appreciated.
Regards,
Kolahal

You're looping over all the events sequentially from python, that's probably the bottleneck.
You can look into root_numpy to load the data you need from the root file into numpy arrays:
root_numpy is a Python extension module that provides an efficient interface between ROOT and NumPy. root_numpy’s internals are compiled C++ and can therefore handle large amounts of data much faster than equivalent pure Python implementations.
I'm also currently looking at root_pandas which seems similar.
While this solution does not precisely answer the request for parallelization, it may make the parallelization unnecessary. And if it is still too slow, then it can still be used on parallel using slurm or something else.

Debugging a python script which first needs to read large files. Do I have to load them every time anew?

I have a python script which starts by reading a few large files and then does something else. Since I want to run this script multiple times and change some of the code until I am happy with the result, it would be nice if the script did not have to read the files every time anew, because they will not change. So I mainly want to use this for debugging.
It happens to often, that I run scripts with bugs in them, but I only see the error message after minutes, because the reading took so long.
Are there any tricks to do something like this?
(If it is feasible, I create smaller test files)

I'm not good at Python, but it seems to be able to dynamically reload code from a changed module: How to re import an updated package while in Python Interpreter?
Some other suggestions not directly related to Python.
Firstly, try to create a smaller test file. Is the whole file required to demonstrate the bug you are observing? Most probably it is only a small part of your input file that is relevant.
Secondly, are these particular files required, or the problem will show up on any big amount of data? If it shows only on particular files, then once again most probably it is related to some feature of these files and will show also on a smaller file with the same feature. If the main reason is just big amount of data, you might be able to avoid reading it by generating some random data directly in a script.
Thirdly, what is a bottleneck of your reading the file? Is it just hard drive performance issue, or do you do some heavy processing of the read data in your script before actually coming to the part that generates problems? In the latter case, you might be able to do that processing once and write the results to a new file, and then modify your script to load this processed data instead of doing the processing each time anew.
If the hard drive performance is the issue, consider a faster filesystem. On Linux, for example, you might be able to use /dev/shm.

Reading A Big File With Python

I'm trying to read some files in a directory, which has 10 text files. With time, the number of files increases, and the total size as of now goes around 400MB.
File contents are in the format:
student_name:student_ID:date_of_join:anotherfield1:anotherfield2
In case of a match, I have to print out the whole line. Here's what I've tried.
findvalue = "student_id" #this is users input alphanumeric
directory = "./RecordFolder"
for filename in os.listdir(directory):
with open(os.path.join(directory, filename)) as f:
for line in f:
if findvalue in line:
print line
This works, but it takes a lot of time. How can I reduce the run time?

When textfiles become too slow, you need to start looking at databases. One of the main purposes of databases is to intelligently handle IO from persistent data storage.
Depending on the needs of your application, SQLite may be a good fit. I suspect this is what you want, given that you don't seem to have a gargantuan data set. From there, it's just a matter of making database API calls and allowing SQLite to handle the lookups -- it does so much better than you!
If (for some strange reason) you really don't want to use a database, then consider further breaking up your data into a tree, if at all possible. For example, you could have a file for each letter of the alphabet in which you put student data. This should cut down on looping time since you're reducing the number of students per file. This is a quick hack, but I think you'll lose less hair if you go with a database.

IO is notoriously slow compared to computation, and given that you are dealing with large files it's probably best deal with the files line by line. I don't see an obvious easy way to speed this up in Python.
Depending on how frequent your "hits" (i.e., findvalue in line) will be you may decide to write to a file so not to be possibly slowed down by console output, but if there will be relatively few items found, it wouldn't make much of a difference.
I think for Python there's nothing obvious and major you can do. You could always explore other tools (such as grep or databases ...) as alternative approaches.
PS: No need for the else:pass ..

how to speed up the code?

in my program i have a method which requires about 4 files to be open each time it is called,as i require to take some data.all this data from the file i have been storing in list for manupalation.
I approximatily need to call this method about 10,000 times.which is making my program very slow?
any method for handling this files in a better ways and is storing the whole data in list time consuming what is better alternatives for list?
I can give some code,but my previous question was closed as that only confused everyone as it is a part of big program and need to be explained completely to understand,so i am not giving any code,please suggest ways thinking this as a general question...
thanks in advance

As a general strategy, it's best to keep this data in an in-memory cache if it's static, and relatively small. Then, the 10k calls will read an in-memory cache rather than a file. Much faster.
If you are modifying the data, the alternative might be a database like SQLite, or embedded MS SQL Server (and there are others, too!).
It's not clear what kind of data this is. Is it simple config/properties data? Sometimes you can find libraries to handle the loading/manipulation/storage of this data, and it usually has it's own internal in-memory cache, all you need to do is call one or two functions.
Without more information about the files (how big are they?) and the data (how is it formatted and structured?), it's hard to say more.

Opening, closing, and reading a file 10,000 times is always going to be slow. Can you open the file once, do 10,000 operations on the list, then close the file once?

It might be better to load your data into a database and put some indexes on the database. Then it will be very fast to make simple queries against your data. You don't need a lot of work to set up a database. You can create an SQLite database without requiring a separate process and it doesn't have a complicated installation process.

Call the open to the file from the calling method of the one you want to run. Pass the data as parameters to the method

If the files are structured, kinda configuration files, it might be good to use ConfigParser library, else if you have other structural format then I think it would be better to store all this data in JSON or XML and perform any necessary operations on your data

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.