Reading A Big File With Python

Reading A Big File With Python - python

I'm trying to read some files in a directory, which has 10 text files. With time, the number of files increases, and the total size as of now goes around 400MB.
File contents are in the format:
student_name:student_ID:date_of_join:anotherfield1:anotherfield2
In case of a match, I have to print out the whole line. Here's what I've tried.
findvalue = "student_id" #this is users input alphanumeric
directory = "./RecordFolder"
for filename in os.listdir(directory):
with open(os.path.join(directory, filename)) as f:
for line in f:
if findvalue in line:
print line
This works, but it takes a lot of time. How can I reduce the run time?

When textfiles become too slow, you need to start looking at databases. One of the main purposes of databases is to intelligently handle IO from persistent data storage.
Depending on the needs of your application, SQLite may be a good fit. I suspect this is what you want, given that you don't seem to have a gargantuan data set. From there, it's just a matter of making database API calls and allowing SQLite to handle the lookups -- it does so much better than you!
If (for some strange reason) you really don't want to use a database, then consider further breaking up your data into a tree, if at all possible. For example, you could have a file for each letter of the alphabet in which you put student data. This should cut down on looping time since you're reducing the number of students per file. This is a quick hack, but I think you'll lose less hair if you go with a database.

IO is notoriously slow compared to computation, and given that you are dealing with large files it's probably best deal with the files line by line. I don't see an obvious easy way to speed this up in Python.
Depending on how frequent your "hits" (i.e., findvalue in line) will be you may decide to write to a file so not to be possibly slowed down by console output, but if there will be relatively few items found, it wouldn't make much of a difference.
I think for Python there's nothing obvious and major you can do. You could always explore other tools (such as grep or databases ...) as alternative approaches.
PS: No need for the else:pass ..

Related

Better to read the whole file, close it, and then loop over it, or loop over while it's open?

I was wondering, which of these is the better and safer way to process a file's contents line by line. The assumption here is that the file's contents are very critical, but the file is not very large, so memory consumption is not an issue.
Is it better to close the file as soon as possible using this:
with open('somefile.txt') as f:
lines = f.readlines()
for line in lines:
do_something(line)
Or to just loop over it in one go:
with open('somefile.txt') as f:
for line in f:
do_something(line)
Which of these practises is generally the better and the more accepted way of doing it?

There is no "better" solution. Simply because these two are far from being equivalent.
The first one loads entire file into memory and then processes the in-memory data. This has a potential advantage of being faster depending on what the processing is. Note that if the file is bigger than the amount of RAM you have then this is not an option at all.
The second one loads only a piece of the file into memory, processes it and then loads another piece and so on. This is generally slower (although it is likely you won't see the difference because often the processing time, especially in Python, dominates the reading time) but drastically reduces memory consumption (assuming that your file has more than 1 line). Also in some cases it may be more difficult to work with. For example say that you are looking for a specific pattern xy\nz in the file. Now with "line by line" loading you have to remember previous line in order to do a correct check. Which is more difficult to implement (but only a bit). So again: it depends on what you are doing.
As you can see there are tradeoffs and what is better depends on your context. I often do this: if file is relatively small (say up to few hundred megabytes) then load it into memory.
Now you've mentioned that the content is "critical". I don't know what that means but for example if you are trying to make updates to the file atomic or reads consistent between processes then this is a very different problem from the one you've posted. And generally hard so I advice using a proper database. SQLite is an easy option (again: depending on your scenario) similar to having a file.

Instant access to line from a large file without loading the file

In one of my recent projects I need to perform this simple task but I'm not sure what is the most efficient way to do so.
I have several large text files (>5GB) and I need to continuously extract random lines from those files. The requirements are: I can't load the files into memory, I need to perform this very efficiently ( >>1000 lines a second), and preferably I need to do as less pre-processing as possible.
The files consists of many short lines ~(20 mil lines). The "raw" files has varying line length, but with a short pre-processing I can make all lines have the same length (though, the perfect solution would not require pre-processing)
I already tried the default python solutions mentioned here but they were too slow (and the linecache solution loads the file into memory, therefore is not usable here)
The next solution I thought about is to create some kind of index. I found this solution but it's very outdated so it needs some work to get working, and even then I'm not sure if the overhead created during the processing of the index file won't slow down the process to time-scale of the solution above.
Another solution is converting the file into a binary file and then getting instant access to lines this way. For this solution I couldn't find any python package that supports binary-text work, and I feel like creating a robust parser this way could take very long time and could create many hard-to-diagnose errors down the line because of small miscalculations/mistakes.
The final solution I thought about is using some kind of database (sqlite in my case) which will require transferring the lines into a database and loading them this way.
Note: I will also load thousands of (random) lines each time, therefore solutions which work better for groups of lines will have an advantage.
Thanks in advance,
Art.

As said in the comments, I believe using hdf5 would we a good option.
This answer shows how to read that kind of file

Downsides of keeping file handles open?

I am parsing some XML and write data to different files depending on the XML element that is currently being processed. Processing an element is really fast, and writing the data is, too. Therefore, files would need to open and close very often. For example, given a huge file:
for _, node in lxml.etree.iterparse(file):
with open(f"{node.tag}.txt", 'a') as fout:
fout.write(node.attrib['someattr']+'\n'])
This would work, but relatively speaking it would take a lot of time opening and closing the files. (Note: this is a toy program. In reality the actual contents that I write to the files as well as the filenames are different. See the last paragraph for data details.)
An alternative could be:
fhs = {}
for _, node in lxml.etree.iterparse(file):
if node.tag not in fhs:
fhs[node.tag] = open(f"{node.tag}.txt", 'w')
fhs[node.tag].write(node.attrib['someattr']+'\n'])
for _, fh in fhs.items(): fh.close()
This will keep the files open until the parsing of XML is completed. There is a bit of lookup overhead, but that should be minimal compared to iteratively opening and closing the file.
My question is, what is the downside of this approach, performance wise? I know that this will make the open files inaccessible by other processes, and that you may run into a limit of open files. However, I am more interested in performance issues. Does keeping all file handles open create some sort of memory issues or processing issues? Perhaps too much file buffering is going on in such scenarios? I am not certain, hence this question.
The input XML files can be up to around 70GB. The number of files generated is limited to around 35, which is far from the limits I read about in the aforementioned post.

The obvious downsides you have already mentioned, is that there will be a lot of memory required to keep all the file handles open, depending of course on how many files. This is a calculation you have to do on your own. And don't forget the write locks.
Otherwise there isn't very much wrong with it per say, but it would be good with some precaution:
fhs = {}
try:
for _, node in lxml.etree.iterparse(file):
if node.tag not in fhs:
fhs[node.tag] = open(f"{node.tag}.txt", 'w')
fhs[node.tag].write(node.attrib['someattr']+'\n'])
finally:
for fh in fhs.values(): fh.close()
Note:
When looping over a dict in python, the items you get are really only the keys. I'd recommend doing for key, item in d.items(): or for item in d.values():

You don't didn't say just how many files the process would end up holding open. If it's not so many that it creates a problem, then this could be a good approach. I doubt you can really know without trying it out with your data and in your execution environment.
In my experience, open() is relatively slow, so avoiding unnecessary calls is definitely worth thinking about-- you also avoid setting up all the associated buffers, populating them, flushing them every time you close the file, and garbage-collecting. Since you ask, file pointers do come with large buffers. On OS X, the default buffer size is 8192 bytes (8KB) and there is additional overhead for the object, as with all Python object. So if you have hundreds or thousands of files and little RAM, it can add up. You can specify less buffering or no buffering at all, but that could defeat any efficiency gained from avoiding repeated opens.
Edit: For just 35 distinct files (or any two-digit number), you have nothing to worry about: The space that 35 output buffers will need (at 8 KB per buffer for the actual buffering) will not even be the biggest part of your memory footprint. So just go ahead and do it they way you proposed. You'll see a dramatic speed improvement over opening and closing the file for each xml node.
PS. The default buffer size is given by io.DEFAULT_BUFFER_SIZE.

As a good rule,try to close a file as soon as possible.
Note that also your operating system has limits - you can open only certain number of files. So you might soon hit this limit and you will start getting "Failed to open file" exceptions.
Memory and file handles leaking are obvious problem ( if you fail to close the files for some reason ).

If you are generating thousands of files the way you might considder writing
them to a directory structure to get them separately stored in different
directories to have easier access afterwards. For example: a/a/aanode.txt , a/c/acnode.txt, etc.
In case the XML contains consecutive nodes you can write while that
condition is True. You only close the moment another node for another file appears.
What you gain from it largely depends on the structure of your XML file.

Examining large log files in python

A little hesitant about posting this - as far as I'm concerned it's genuine question, but I guess I'll understand if it's criticised or closed as being an invite for discussion...
Anyway, I need to use Python to search some quite large web logs for specific events. RegEx would be good but I'm not tied to any particular approach - I just want lines that contain two strings that could appear anywhere in a GET request.
As a typical file is over 400mb and contains around a million lines, performance both in terms of time to complete and load on the server (ubuntu/nginx VM - reasonably well spec'd and rarely overworked) are likely to be issues.
I'm a fairly recent convert to Python (note quite a newbie but still plenty to learn) and I'd like a bit of guidance on the best way to achieve this
Do I open and iterate through?
Grep to a new file and then open?
Some combination of the two?
Something else?

As long as you don't read whole file at once but iterate trough it continuously you should be fine. I think it doesn't really matter whether you read whole file with python or with grep, you still have to load whole file :). And if you take advantage of generators you can do this really programmer friendly:
# Generator; fetch specific rows from log file
def parse_log(filename):
reg = re.prepare( '...')
with open(filename,'r') as f:
for row in f:
match = reg.match(row)
if match:
yield match.group(1)
for i in parse_log('web.log'):
pass # Do whatever you need with matched row

What is the best way to loop through and process a large (10GB+) text file?

I need to loop through a very large text file, several gigabytes in size (a zone file to be exact). I need to run a few queries for each entry in the zone file, and then store the results in a searchable database.
My weapons of choice at the moment, mainly because I know them, are Python and MySQL. I'm not sure how well either will deal with files of this size, however.
Does anyone with experience in this area have any suggestions on the best way to open and loop through the file without overloading my system? How about the most efficient way to process the file once I can open it (threading?) and store the processed data?

You shouldn't have any real trouble storing that amount of data in MySQL, although you will probably not be able to store the entire database in memory, so expect some IO performance issues. As always, make sure you have the appropriate indices before running your queries.
The most important thing is to not try to load the entire file into memory. Loop through the file, don't try to use a method like readlines which will load the whole file at once.
Make sure to batch the requests. Load up a few thousand lines at a time and send them all in one big SQL request.
This approach should work:
def push_batch(batch):
# Send a big INSERT request to MySQL
current_batch = []
with open('filename') as f:
for line in f:
batch.append(line)
if len(current_batch) > 1000:
push_batch(current_batch)
current_batch = []
push_batch(current_batch)
Zone files are pretty normally formatted, consider if you can get away with just using LOAD DATA INFILE. You might also consider creating a named pipe, pushing partially formatted data in to it from python, and using LOAD DATA INFILE to read it in with MySQL.
MySQL has some great tips on optimizing inserts, some highlights:
Use multiple value lists in each insert statement.
Use INSERT DELAYED, particularly if you are pushing from multiple clients at once (e.g. using threading).
Lock your tables before inserting.
Tweak the key_buffer_size and bulk_insert_buffer_size.
The fastest processing will be done in MySQL, so consider if you can get away with doing the queries you need after the data is in the db, not before. If you do need to do operations in Python, threading is not going to help you. Only one thread of Python code can execute at a time (GIL), so unless you're doing something which spends a considerable amount of time in C, or interfaces with external resources, you're only going to ever be running in one thread anyway.
The most important optimization question is what is bounding the speed, there's no point spinning up a bunch of threads to read the file, if the database is the bounding factor. The only way to really know is to try it and make tweaks until it is fast enough for your purpose.

#Zack Bloom's answer is excellent and I upvoted it. Just a couple of thoughts:
As he showed, just using with open(filename) as f: / for line in f is all you need to do. open() returns an iterator that gives you one line at a time from the file.
If you want to slurp every line into your database, do it in the loop. If you only want certain lines that match a certain regular expression, that's easy.
import re
pat = re.compile(some_pattern)
with open(filename) as f:
for line in f:
if not pat.search(line):
continue
# do the work to insert the line here
With a file that is multiple gigabytes, you are likely to be I/O bound. So there is likely no reason to worry about multithreading or whatever. Even running several regular expressions is likely to crunch through the data faster than the file can be read or the database updated.
Personally, I'm not much of a database guy and I like using an ORM. The last project I did database work on, I was using Autumn with SQLite. I found that the default for the ORM was to do one commit per insert, and it took forever to insert a bunch of records, so I extended Autumn to let you explicitly bracket a bunch of inserts with a single commit; it was much faster that way. (Hmm, I should extend Autumn to work with a Python with statement, so that you could wrap a bunch of inserts into a with block and Autumn would automatically commit.)
http://autumn-orm.org/
Anyway, my point was just that with database stuff, doing things the wrong way can be very slow. If you are finding that the database inserting is your bottleneck, there might be something you can do to fix it, and Zack Bloom's answer contains several ideas to start you out.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.