Python - Checking concordance between two huge text files - python

So, this one has been giving me a hard time!
I am working with HUGE text files, and by huge I mean 100Gb+. Specifically, they are in the fastq format. This format is used for DNA sequencing data, and consists of records of four lines, something like this:
#REC1
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))*55CCF>>>>>>CCCCCCC65
#REC2
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
.
.
.
For the sake of this question, just focus on the header lines, starting with a '#'.
So, for QA purposes, I need to compare two such files. These files should have matching headers, so the first record in the other file should also have the header '#REC1', the next should be '#REC2' and so on. I want to make sure that this is the case, before I proceed to heavy downstream analyses.
Since the files are so large, a naive iteration a string comparisson would take very long, but this QA step will be run numerous times, and I can't afford to wait that long. So I thought a better way would be to sample records from a few points in the files, for example every 10% of the records. If the order of the records is messed up, I'd be very likely to detect it.
So far, I have been able to handle such files by estimating the file size and than using python's file.seek() to access a record in the middle of the file. For example, to access a line approximately in the middle, I'd do:
file_size = os.stat(fastq_file).st_size
start_point = int(file_size/2)
with open(fastq_file) as f:
f.seek(start_point)
# look for the next beginning of record, never mind how
But now the problem is more complex, since I don't know how to coordinate between the two files, since the bytes location is not an indicator of the line index in the file. In other words, how can I access the 10,567,311th lines in both files to make sure they are the same, without going over the whole file?
Would appreciate any ideas\hints. Maybe iterating in parallel? but how exactly?
Thanks!

Sampling is one approach, but you're relying on luck. Also, Python is the wrong tool for this job. You can do things differently and calculate an exact answer in a still reasonably efficient way, using standard Unix command-line tools:
Linearize your FASTQ records: replace the newlines in the first three lines with tabs.
Run diff on a pair of linearized files. If there is a difference, diff will report it.
To linearize, you can run your FASTQ file through awk:
$ awk '\
BEGIN { \
n = 0; \
} \
{ \
a[n % 4] = $0; \
if ((n+1) % 4 == 0) { \
print a[0]"\t"a[1]"\t"a[2]"\t"a[3]; \
} \
n++; \
}' example.fq > example.fq.linear
To compare a pair of files:
$ diff example_1.fq.linear example_2.fq.linear
If there's any difference, diff will find it and tell you which FASTQ record is different.
You could just run diff on the two files directly, without doing the extra work of linearizing, but it is easier to see which read is problematic if you first linearize.
So these are large files. Writing new files is expensive in time and disk space. There's a way to improve on this, using streams.
If you put the awk script into a file (e.g., linearize_fq.awk), you can run it like so:
$ awk -f linearize_fq.awk example.fq > example.fq.linear
This could be useful with your 100+ Gb files, in that you can now set up two Unix file streams via bash process substitutions, and run diff on those streams directly:
$ diff <(awk -f linearize_fq.awk example_1.fq) <(awk -f linearize_fq.awk example_2.fq)
Or you can use named pipes:
$ mkfifo example_1.fq.linear
$ mkfifo example_2.fq.linear
$ awk -f linearize_fq.awk example_1.fq > example_1.fq.linear &
$ awk -f linearize_fq.awk example_2.fq > example_2.fq.linear &
$ diff example_1.fq.linear example_2.fq.linear
$ rm example_1.fq.linear example_2.fq.linear
Both named pipes and process substitutions avoid the step of creating extra (regular) files, which could be an issue for your kind of input. Writing linearized copies of 100+ Gb files to disk could take a while to do, and those copies could also use disk space you may not have much of.
Using streams gets around those two problems, which makes them very useful for handling bioinformatics datasets in an efficient way.
You could reproduce these approaches with Python, but it will almost certainly run much slower, as Python is very slow at I/O-heavy tasks like these.

Iterating in parallel might be the best way to do this in Python. I have no idea how fast this will run (a fast SSD will probably be the best way to speed this up), but since you'll have to count newlines in both files anyway, I don't see a way around this:
with open(file1) as f1, open(file2) as f2:
for l1, l2 in zip(f1,f2):
if l1.startswith("#REC"):
if l1 != l2:
print("Difference at record", l1)
break
else:
print("No differences")
This is written for Python 3 where zip returns an iterator; in Python 2, you need to use itertools.izip() instead.

Have you looked into using the rdiff command.
The upsides of rdiff are:
with the same 4.5GB files, rdiff only ate about 66MB of RAM and scaled very well. It never crashed to date.
it is also MUCH faster than diff.
rdiff itself combines both diff and patch capabilities, so you can create deltas and apply them using the same program
The downsides of rdiff are:
it's not part of standard Linux/UNIX distribution – you have to
install the librsync package.
delta files rdiff produces have a slightly different format than diff's.
delta files are slightly larger (but not significantly enough to care).
a slightly different approach is used when generating a delta with rdiff, which is both good and bad – 2 steps are required. The
first one produces a special signature file. In the second step, a
delta is created using another rdiff call (all shown below). While
the 2-step process may seem annoying, it has the benefits of
providing faster deltas than when using diff.
See: http://beerpla.net/2008/05/12/a-better-diff-or-what-to-do-when-gnu-diff-runs-out-of-memory-diff-memory-exhausted/

import sys
import re
""" To find of the difference record in two HUGE files. This is expected to
use of minimal memory. """
def get_rec_num(fd):
""" Look for the record number. If not found return -1"""
while True:
line = fd.readline()
if len(line) == 0: break
match = re.search('^#REC(\d+)', line)
if match:
num = int(match.group(1))
return(num)
return(-1)
f1 = open('hugefile1', 'r')
f2 = open('hugefile2', 'r')
hf1 = dict()
hf2 = dict()
while f1 or f2:
if f1:
r = get_rec_num(f1)
if r < 0:
f1.close()
f1 = None
else:
# if r is found in f2 hash, no need to store in f1 hash
if not r in hf2:
hf1[r] = 1
else:
del(hf2[r])
pass
pass
if f2:
r = get_rec_num(f2)
if r < 0:
f2.close()
f2 = None
else:
# if r is found in f1 hash, no need to store in f2 hash
if not r in hf1:
hf2[r] = 1
else:
del(hf1[r])
pass
pass
print('Records found only in f1:')
for r in hf1:
print('{}, '.format(r));
print('Records found only in f2:')
for r in hf2:
print('{}, '.format(r));

Both answers from #AlexReynolds and #TimPietzcker are excellent from my point of view, but I would like to put my two cents in. You also might want to speed up your hardware:
Raplace HDD with SSD
Take n SSD's and create a RAID 0. In the perfect world you will get n times speed up for your disk IO.
Adjust the size of chunks you read from the SSD/HDD. I would expect, for instance, one 16 MB read to be executed faster than sixteen 1 MB reads. (this applies to a single SSD, for RAID 0 optimization one has to take a look at RAID controller options and capabilities).
The last option is especially relevant to NOR SSD's. Don't pursuit the minimal RAM utilization, but try to read as much as it needs to keep your disk reading fast. For instance, parallel reads of single rows from two files can probably speed down reading - imagine an HDD where two rows of the two files are always on the same side of the same magnetic disk(s).

Related

Writing a big file in python faster in memory efficient way

I am trying to create a big file with the same text but my system hangs after executing the script after sometime.
the_text = "This is the text I want to copy 100's of time"
count = 0
while True:
the_text += the_text
count += 1
if count > (int)1e10:
break
NOTE: Above is an oversimplified version of my code. I want to create a file containing the same text many times and the size of the file is around 27GB.
I know it's because RAM is being overloaded. And that's what I want to know how can I do this in fast and effective way in python.
Don't accumulate the string in memory, instead write them directly to file:
the_text = "This is the text I want to copy 100's of time"
with open( "largefile.txt","wt" ) as output_file
for n in range(10000000):
output_file.write(the_text)
This took ~14s on my laptop using SSD to create a file of ~440MiB.
Above code is writing one string at a time - I'm sure it could be speeded up by batching the lines together, but doesn't seem much point speculating on that without any info about what your application can do.
Ultimately this will be limited by the disk speed; if your disk can manage 50MiB/s sustained writes then writing 450MiB will take about 9s - this sounds like what my laptop is doing with the line-by-line writes
If I write 100 strings write(the_text*100) at once for /100 times, i.e. range(100000), this takes ~6s, speedup of 2.5x, writing at ~70MiB/s
If I write 1000 strings at once using range(10000) this takes ~4s - my laptop is starting to top out at ~100MiB/s.
I get ~125MiB/s with write(the_text*100000).
Increasing further to write(the_text*1000000) slows things down, presumably Python memory handling for the string starts to take appreciable time.
Doing text i/o will be slowing things down a bit - I know with Python I can do about 300MiB/s combined read+write of binary files.
SUMMARY: for a 27GiB file, my laptop running Python 3.9.5 on Windows 10 maxes out at about 125MiB/s or 8s/GiB, so would take ~202s to create the file, when writing strings in chunks of about 4.5MiB (45 chars*100,000). YMMV

Python or Bash - Iterate all words in a text file over itself

I have a text file that contains thousands of words, e.g:
laban
labrador
labradors
lacey
lachesis
lacy
ladoga
ladonna
lafayette
lafitte
lagos
lagrange
lagrangian
lahore
laius
lajos
lakeisha
lakewood
I want to iterate every word over itself so i get:
labanlaban
labanlabrador
labanlabradors
labanlacey
labanlachesis
etc...
In bash i can do the following, but it is extremely slow:
#!/bin/bash
( cat words.txt | while read word1; do
cat words.txt | while read word2; do
echo "$word1$word2" >> doublewords.txt
done; done )
Is there a faster and more efficient way to do this?
Also, how would i iterate two different text files in this manner?
If you can fit the list into memory:
import itertools
with open(words_filename, 'r') as words_file:
words = [word.strip() for word in words_file]
for words in itertools.product(words, repeat=2):
print(''.join(words))
(You can also do a double-for loop, but I was feeling itertools tonight.)
I suspect the win here is that we can avoid re-reading the file over and over again; the inner loop in your bash example will cat the file one for each iteration of the outer loop. Also, I think Python just tends to execute faster than bash, IIRC.
You could certainly pull this trick with bash (read the file into an array, write a double-for loop), it's just more painful.
It looks like sed is pretty efficient to append a text to each line.
I propose:
#!/bin/bash
for word in $(< words.txt)
do
sed "s/$/$word/" words.txt;
done > doublewords.txt
(Do you confuse $ which means end of line for sed and $word which is a bash variable).
For a 2000 line file, this runs in about 20 s on my computer, compared to ~2 min for you solution.
Remark: it also looks like you are slightly better off redirecting the standard output of the whole program instead of forcing writes at each loop.
(Warning, this is a bit off topic and personal opinion!)
If you are really going for speed, you should consider using a compiled language such as C++. For example:
vector<string> words;
ifstream infile("words.dat");
for(string line ; std::getline(infile,line) ; )
words.push_back(line);
infile.close();
ofstream outfile("doublewords.dat");
for(auto word1 : data)
for(auto word2 : data)
outfile << word1 << word2 << "\n";
outfile.close();
You need to understand that both bash and python are bad at double for loops: that's why you use tricks (#Thanatos) or predefined commands (sed). Recently, I came across a double for loop problem (given a set of 10000 points in 3d, compute all the distances between pairs) and I successful solved it using C++ instead of python or Matlab.
If you have GHC available, Cartesian products are a synch!
Q1: One file
-- words.hs
import Control.Applicative
main = interact f
where f = unlines . g . words
g x = map (++) x <*> x
This splits the file into a list of words, and then appends each word to each other word with the applicative <*>.
Compile with GHC,
ghc words.hs
and then run with IO redirection:
./words <words.txt >out
Q2: Two files
-- words2.hs
import Control.Applicative
import Control.Monad
import System.Environment
main = do
ws <- mapM ((liftM words) . readFile) =<< getArgs
putStrLn $ unlines $ g ws
where g (x:y:_) = map (++) x <*> y
Compile as before and run with the two files as arguments:
./words2 words1.txt words2.txt > out
Bleh, compiling?
Want the convenience of a shell script and the performance of a compiled executable? Why not do both?
Simply wrap the Haskell program you want in a wrapper script which compiles it in /var/tmp, and then replaces itself with the resulting executable:
#!/bin/bash
# wrapper.sh
cd /var/tmp
cat > c.hs <<CODE
# replace this comment with haskell code
CODE
ghc c.hs >/dev/null
cd - >/dev/null
exec /var/tmp/c "$#"
This handles arguments and IO redirection as though the wrapper didn't exist.
Results
Timing against some of the other answers with two 2000 word files:
$ time ./words2 words1.txt words2.txt >out
3.75s user 0.20s system 98% cpu 4.026 total
$ time ./wrapper.sh words1.txt words2.txt > words2
4.12s user 0.26s system 97% cpu 4.485 total
$ time ./thanatos.py > out
4.93s user 0.11s system 98% cpu 5.124 total
$ time ./styko.sh
7.91s user 0.96s system 74% cpu 11.883 total
$ time ./user3552978.sh
57.16s user 29.17s system 93% cpu 1:31.97 total
You can do this in pythonic way by creating a tempfile and write data to it while reading the existing file and finally remove the original file and move the new file to original file.
import sys
from os import remove
from shutil import move
from tempfile import mkstemp
def data_redundent(source_file_path):
fh, target_file_path = mkstemp()
with open(target_file_path, 'w') as target_file:
with open(source_file_path, 'r') as source_file:
for line in source_file:
target_file.write(line.replace('\n', '')+line)
remove(source_file_path)
move(target_file_path, source_file_path)
data_redundent('test_data.txt')
I'm not sure how efficient this is, but a very simple way, using the Unix tool specifically designed for this sort of thing, would be
paste -d'\0' <file> <file>
The -d option specifies the delimiter to be used between the concatenated parts, and \0 indicates a NULL character (i.e. no delimiter at all).

Embedding binary data in a script efficiently

I have seen some installation files (huge ones, install.sh for Matlab or Mathematica, for example) for Unix-like systems, they must have embedded quite a lot of binary data, such as icons, sound, graphics, etc, into the script. I am wondering how that can be done, since this can be potentially useful in simplifying file structure.
I am particularly interested in doing this with Python and/or Bash.
Existing methods that I know of in Python:
Just use a byte string: x = b'\x23\xa3\xef' ..., terribly inefficient, takes half a MB for a 100KB wav file.
base64, better than option 1, enlarge the size by a factor of 4/3.
I am wondering if there are other (better) ways to do this?
You can use base64 + compression (using bz2 for instance) if that suits your data (e.g., if you're not embedding already compressed data).
For instance, to create your data (say your data consist of 100 null bytes followed by 200 bytes with value 0x01):
>>> import bz2
>>> bz2.compress(b'\x00' * 100 + b'\x01' * 200).encode('base64').replace('\n', '')
'QlpoOTFBWSZTWcl9Q1UAAABBBGAAQAAEACAAIZpoM00SrccXckU4UJDJfUNV'
And to use it (in your script) to write the data to a file:
import bz2
data = 'QlpoOTFBWSZTWcl9Q1UAAABBBGAAQAAEACAAIZpoM00SrccXckU4UJDJfUNV'
with open('/tmp/testfile', 'w') as fdesc:
fdesc.write(bz2.decompress(data.decode('base64')))
Here's a quick and dirty way. Create the following script called MyInstaller:
#!/bin/bash
dd if="$0" of=payload bs=1 skip=54
exit
Then append your binary to the script, and make it executable:
cat myBinary >> myInstaller
chmod +x myInstaller
When you run the script, it will copy the binary portion to a new file specified in the path of=. This could be a tar file or whatever, so you can do additional processing (unarchiving, setting execute permissions, etc) after the dd command. Just adjust the number in "skip" to reflect the total length of the script before the binary data starts.

Python os.system call slow ~1/2 second

I'm writing a script to find all duplicate files in two different file trees. The script works fine except it's too slow to be practical on large numbers of files (>1000). Profiling my script with cProfile revealed that a single line in my code is responsible for almost all of the execution time.
The line is a call to os.system():
cmpout = os.system("cmp -s -n 10MiB %s %s" % (callA, callB));
This call is inside a for loop that gets called about N times if I have N identical files. The average execution time is 0.53 seconds
ncalls tottime percall cumtime percall filename:lineno(function)
563 301.540 0.536 301.540 0.536 {built-in method system}
This of course quickly adds up for more than a thousand files. I have tried speeding it up by replacing it with call from the subprocess module:
cmpout = call("cmp -s -n 10MiB %s %s" % (callA, callB), shell=True);
But this has a near identical execution time. I have also tried reducing the byte limit on the cmp command itself, but this only saves a very small amount of time.
Is there anyway I can speed this up?
Full function I'm using:
def dirintersect(dirA, dirB):
intersectionAB = []
filesA = listfiles(dirA);
filesB = listfiles(dirB);
for (pathB, filenameB) in filesB:
for (pathA, filenameA) in filesA:
if filenameA == filenameB:
callA = shlex.quote(os.path.join(pathA, filenameA));
callB = shlex.quote(os.path.join(pathB, filenameB));
cmpout = os.system("cmp -s -n 10MiB %s %s" % (callA, callB));
#cmpout = call("cmp -s -n 10MiB %s %s" % (callA, callB), shell=True);
if cmpout is 0:
intersectionAB.append((filenameB, pathB, pathA))
return intersectionAB
Update: Thanks for all the feedback! I will try to address most of your comments and give some more information.
#Iarsmans. You're absolutely right that my nested for loop scales with n² I had already figured out myself that I could do the same by using a dictionary or set and do set operations. But even the overhead of this 'bad' algorithm is insignificant to the time it takes to run os.system . The actual if clause triggers approximately once for each filename (that is I expect there to be only one duplicate for each filename). So os.system only gets run N times and not N² times, but even for this linear time it isn't fast enough.
#Iarsman and #Alex Reynolds: The reason I didn't choose for a hashing solution like you suggest is because in the usage case I envision I compare a smaller directory tree with a larger one and hashing all of the files in the larger tree would take a very long time (as it could be all the files in an entire partition), while I would only require to do the actual comparison on a small fraction of the files.
#abarnert: The reason I use shell=True in the call command is simply because I started with os.system and then read that it was better to use subprocess.call and this was the way to convert between the too. If there is a better way to run the cmp command, I'd like to know. The reason I qoute the arguments is because I had issues with spaces in filenames when I just passed the os.path.join result in the command.
Thanks for your suggestion, I will change it to if cmpout == 0
#Gabe: I don't know how to time a bash command, but I believe it runs much faster than half a second when I just run the command.
I said the byte limit didn't matter much, because when I changed it to only 10Kib it changed the total execution time of my test run to 290seconds instead of around 300s. The reason I kept the limit is to prevent it from comparing really large files (such as 1GiB video files).
Update 2:
I have followed #abarnert 's suggestion and changed the call to:
cmpout = call(["cmp", '-s', '-n', '10MiB', callA, callB])
The execution time for my test scenario has now dropped to 270s from 300seconds. Not sufficient yet, but it's a start.
You're using the wrong algorithm to do this. Comparing all pairs of files takes Θ(n²) time for n files, while you can get the intersection of two directories in linear time by hashing the files:
from hashlib import sha512
import os
import os.path
def hash_file(fname):
with open(fname) as f:
return sha512(f.read()).hexdigest()
def listdir(d):
return [os.path.join(d, fname) for fname in os.listdir(d)]
def dirintersect(d1, d2):
files1 = {hash_file(fname): fname for fname in listdir(d1)}
return [(files1[hash_file(fname)], fname) for fname in listdir(d2)
if hash_file(fname) in files1]
This function loops over the first directory, storing filenames indexed by their SHA-512 hash, then filters the files in the second directory by the presence of files with the same hash in the index built from the first directory. A few obvious optimizations are left as an exercise for the reader :)
The function assumes the directories contain only regular files or symlinks to those, and it reads the files into memory in one go (but that's not too hard to fix).
(SHA-512 doesn't actually guarantee equality of files, so a full comparison can be installed as a backup measure, though you'll be hard-pressed to find two files with the same SHA-512.)

Moving to an arbitrary position in a file in Python

Let's say that I routinely have to work with files with an unknown, but large, number of lines. Each line contains a set of integers (space, comma, semicolon, or some non-numeric character is the delimiter) in the closed interval [0, R], where R can be arbitrarily large. The number of integers on each line can be variable. Often times I get the same number of integers on each line, but occasionally I have lines with unequal sets of numbers.
Suppose I want to go to Nth line in the file and retrieve the Kth number on that line (and assume that the inputs N and K are valid --- that is, I am not worried about bad inputs). How do I go about doing this efficiently in Python 3.1.2 for Windows?
I do not want to traverse the file line by line.
I tried using mmap, but while poking around here on SO, I learned that that's probably not the best solution on a 32-bit build because of the 4GB limit. And in truth, I couldn't really figure out how to simply move N lines away from my current position. If I can at least just "jump" to the Nth line then I can use .split() and grab the Kth integer that way.
The nuance here is that I don't just need to grab one line from the file. I will need to grab several lines: they are not necessarily all near each other, the order in which I get them matters, and the order is not always based on some deterministic function.
Any ideas? I hope this is enough information.
Thanks!
Python's seek goes to a byte offset in a file, not to a line offset, simply because that's the way modern operating systems and their filesystems work -- the OS/FS just don't record or remember "line offsets" in any way whatsoever, and there's no way for Python (or any other language) to just magically guess them. Any operation purporting to "go to a line" will inevitably need to "walk through the file" (under the covers) to make the association between line numbers and byte offsets.
If you're OK with that and just want it hidden from your sight, then the solution is the standard library module linecache -- but performance won't be any better than that of code you could write yourself.
If you need to read from the same large file multiple times, a large optimization would be to run once on that large file a script that builds and saves to disk the line number - to - byte offset correspondence (technically an "index" auxiliary file); then, all your successive runs (until the large file changes) could very speedily use the index file to navigate with very high performance through the large file. Is this your use case...?
Edit: since apparently this may apply -- here's the general idea (net of careful testing, error checking, or optimization;-). To make the index, use makeindex.py, as follows:
import array
import sys
BLOCKSIZE = 1024 * 1024
def reader(f):
blockstart = 0
while True:
block = f.read(BLOCKSIZE)
if not block: break
inblock = 0
while True:
nextnl = block.find(b'\n', inblock)
if nextnl < 0:
blockstart += len(block)
break
yield nextnl + blockstart
inblock = nextnl + 1
def doindex(fn):
with open(fn, 'rb') as f:
# result format: x[0] is tot # of lines,
# x[N] is byte offset of END of line N (1+)
result = array.array('L', [0])
result.extend(reader(f))
result[0] = len(result) - 1
return result
def main():
for fn in sys.argv[1:]:
index = doindex(fn)
with open(fn + '.indx', 'wb') as p:
print('File', fn, 'has', index[0], 'lines')
index.tofile(p)
main()
and then to use it, for example, the following useindex.py:
import array
import sys
def readline(n, f, findex):
f.seek(findex[n] + 1)
bytes = f.read(findex[n+1] - findex[n])
return bytes.decode('utf8')
def main():
fn = sys.argv[1]
with open(fn + '.indx', 'rb') as f:
findex = array.array('l')
findex.fromfile(f, 1)
findex.fromfile(f, findex[0])
findex[0] = -1
with open(fn, 'rb') as f:
for n in sys.argv[2:]:
print(n, repr(readline(int(n), f, findex)))
main()
Here's an example (on my slow laptop):
$ time py3 makeindex.py kjv10.txt
File kjv10.txt has 100117 lines
real 0m0.235s
user 0m0.184s
sys 0m0.035s
$ time py3 useindex.py kjv10.txt 12345 98765 33448
12345 '\r\n'
98765 '2:6 But this thou hast, that thou hatest the deeds of the\r\n'
33448 'the priest appointed officers over the house of the LORD.\r\n'
real 0m0.049s
user 0m0.028s
sys 0m0.020s
$
The sample file is a plain text file of King James' Bible:
$ wc kjv10.txt
100117 823156 4445260 kjv10.txt
100K lines, 4.4 MB, as you can see; this takes about a quarter second to index and 50 milliseconds to read and print out three random-y lines (no doubt this can be vastly accelerated with more careful optimization and a better machine). The index in memory (and on disk too) takes 4 bytes per line of the textfile being indexed, and performance should scale in a perfectly linear way, so if you had about 100 million lines, 4.4 GB, I would expect about 4-5 minutes to build the index, a minute to extract and print out three arbitrary lines (and the 400 MB of RAM taken for the index should not inconvenience even a small machine -- even my tiny slow laptop has 2GB after all;-).
You can also see that (for speed and convenience) I treat the file as binary (and assume utf8 encoding -- works with any subset like ASCII too of course, eg that KJ text file is ASCII) and don't bother collapsing \r\n into a single character if that's what the file has as line terminator (it's pretty trivial to do that after reading each line if you want).
The problem is that since your lines are not of fixed length, you have to pay attention to line end markers to do your seeking, and that effectively becomes "traversing the file line by line". Thus, any viable approach is still going to be traversing the file, it's merely a matter of what can traverse it fastest.
Another solution, if the file is potentially going to change a lot, is to go full-way to a proper database. The database engine will create and maintain the indexes for you so you can do very fast searches/queries.
This may be an overkill though.

Categories