Search multiple strings in a file using python which is time efficient

Search multiple strings in a file using python which is time efficient - python

I have a long list of strings to look into a very large file. I know I can achieve the above by using two for loops:
dns = sys.argv[2]
file = open(dns)
search_words = var #[list of 100+ strings ]
for line in file:
for word in search_words:
if word in line::
print(line)
However I'm looking for an efficient way todo this so that I don't have to wait for an half an hour for this to run. Can anyone help ?

The problem here is that you read the file line by line instead of actually loading the entire text file into RAM at once, which would gain you a lot of time in this case. This is what takes most time, but text-search can be improved in many ways that aren't as straightforward.
That said, there are multiple packages that are genuinely designed to do text-search efficiently in Python. I suggest that you have a look at AhocoraPy, which is based on the Aho-Corasick Algorithm, which, by the way, is the algorithm used in the well-known grep function. The GitHub page of the package provides explanation on how to achieve your task efficiently, so I will not go into further detail here.

Related

How to use/import a large dictionary without taking a massive performance hit?

I'm trying to use this word list I found for a somewhat ambitious hangman game for a discord bot, and it recommended to use the .json version of a file as a dictionary, if I were using Python, which I am. Only problem, it takes forever for it to go through(interpret?), presumably because it has 370102 lines, and considering this is going to be run on a raspberry pi, this probably isn't going to work out very well.
What would be the best way to go about this? I'm new to python and programming in general, so I'm not quite sure how to do so. Would it be faster if I were to use it in C? Maybe I could use an array somehow?
It doesn't have to be in a dictionary, it's just that the file was provided like that.

For something as basic as dictionary lookup, put the words into a dbm file then read from that. This effectively works as an on-disk dictionary letting you look up keys and their values quickly without having to load the whole thing into memory.
import dbm
with dbm.open('cache', 'r') as db:
...use db as a dictionary...
For anything more complicated use SQLite, a stand-alone SQL database.
However, as RandomDavis points out, you don't need to load the whole word list. You just need to pick one word at random per game. This can be done in a single read of the file.
The file can be compressed and still read line by line.
See if it's fast enough for your purposes. If it isn't, perhaps you could run a thread which loads the next word in the background while they're working on the first one.

It's not necessary to use a dictionary in this case. You're just randomly picking a word from a list. If you use words_alpha.txt from what github repo and import that as a list, it's super fast:
import random
with open('words_alpha.txt') as words_file:
words = words_file.read().splitlines()
print(random.choice(words))
The above takes less than half a second on my machine. And the random.choice() part is blazing fast, any slowness would be the file reading only.

Instant access to line from a large file without loading the file

In one of my recent projects I need to perform this simple task but I'm not sure what is the most efficient way to do so.
I have several large text files (>5GB) and I need to continuously extract random lines from those files. The requirements are: I can't load the files into memory, I need to perform this very efficiently ( >>1000 lines a second), and preferably I need to do as less pre-processing as possible.
The files consists of many short lines ~(20 mil lines). The "raw" files has varying line length, but with a short pre-processing I can make all lines have the same length (though, the perfect solution would not require pre-processing)
I already tried the default python solutions mentioned here but they were too slow (and the linecache solution loads the file into memory, therefore is not usable here)
The next solution I thought about is to create some kind of index. I found this solution but it's very outdated so it needs some work to get working, and even then I'm not sure if the overhead created during the processing of the index file won't slow down the process to time-scale of the solution above.
Another solution is converting the file into a binary file and then getting instant access to lines this way. For this solution I couldn't find any python package that supports binary-text work, and I feel like creating a robust parser this way could take very long time and could create many hard-to-diagnose errors down the line because of small miscalculations/mistakes.
The final solution I thought about is using some kind of database (sqlite in my case) which will require transferring the lines into a database and loading them this way.
Note: I will also load thousands of (random) lines each time, therefore solutions which work better for groups of lines will have an advantage.
Thanks in advance,
Art.

As said in the comments, I believe using hdf5 would we a good option.
This answer shows how to read that kind of file

quickest method of accessing key - value pairs?

I hope the question is not too unspecific: I have a huge database-like list (~ 900,000 entries) which I want to use for processing text files. (More details below.) Since this list will be edited and used with other programs as well, I would prefer to keep it in one separate file and include it in the python code, either directly or by dumping it to some format that python can use. I was wondering if you can advice on what would be the quickest and most efficient way. I have looked at several options, but may not have seen what is best:
Include the list as a python dictionary in the form
my_list = { "key": "value" }
directly into my python code.
Dump the list to an sqlite database and use the sqlite3 module.
Have the list as a yml file and use the yaml module.
Any ideas how these approaches would scale if I process a text file and want to do replacements on something like 30,000 lines?
For those interested: this is for linguistic processing, in particular ancient Greek. The list is an exhaustive list of Greek forms and the head words that they are derived from. For every word form in a text file, I want to add the dictionary head word.

Point 1 is much faster than using either YAML or SQL as #b4hand and #DeepSpace indicated. What you should do though is not include the list in the rest of the rest of the python code you are developing, as you indicated, but make a separate .py file with just the that dictionary definition.
That way the list in that file is more easy to write from a program (or extend by a program). And, on first import, a .pyc will be created which speeds up re-reading on further runs of your program. This is actually very similar in performance
to using the pickle module and pickling the dictionary to file and reading it back from there, while keeping the dictionary in an easy human readable and editable form.

Less than one million entries is not huge and should fit in memory easily. Thus, your best bet is option 1.

If you are looking for speed, option 1 should be the fastest because the other 2 will need to repeatedly access the HD which will be the bottleneck.

I would use a caching mechanism to hold this data or maybe a data structure storage like redis. Loading all of this in memory might become too expensive.

Examining large log files in python

A little hesitant about posting this - as far as I'm concerned it's genuine question, but I guess I'll understand if it's criticised or closed as being an invite for discussion...
Anyway, I need to use Python to search some quite large web logs for specific events. RegEx would be good but I'm not tied to any particular approach - I just want lines that contain two strings that could appear anywhere in a GET request.
As a typical file is over 400mb and contains around a million lines, performance both in terms of time to complete and load on the server (ubuntu/nginx VM - reasonably well spec'd and rarely overworked) are likely to be issues.
I'm a fairly recent convert to Python (note quite a newbie but still plenty to learn) and I'd like a bit of guidance on the best way to achieve this
Do I open and iterate through?
Grep to a new file and then open?
Some combination of the two?
Something else?

As long as you don't read whole file at once but iterate trough it continuously you should be fine. I think it doesn't really matter whether you read whole file with python or with grep, you still have to load whole file :). And if you take advantage of generators you can do this really programmer friendly:
# Generator; fetch specific rows from log file
def parse_log(filename):
reg = re.prepare( '...')
with open(filename,'r') as f:
for row in f:
match = reg.match(row)
if match:
yield match.group(1)
for i in parse_log('web.log'):
pass # Do whatever you need with matched row

Reading A Big File With Python

I'm trying to read some files in a directory, which has 10 text files. With time, the number of files increases, and the total size as of now goes around 400MB.
File contents are in the format:
student_name:student_ID:date_of_join:anotherfield1:anotherfield2
In case of a match, I have to print out the whole line. Here's what I've tried.
findvalue = "student_id" #this is users input alphanumeric
directory = "./RecordFolder"
for filename in os.listdir(directory):
with open(os.path.join(directory, filename)) as f:
for line in f:
if findvalue in line:
print line
This works, but it takes a lot of time. How can I reduce the run time?

When textfiles become too slow, you need to start looking at databases. One of the main purposes of databases is to intelligently handle IO from persistent data storage.
Depending on the needs of your application, SQLite may be a good fit. I suspect this is what you want, given that you don't seem to have a gargantuan data set. From there, it's just a matter of making database API calls and allowing SQLite to handle the lookups -- it does so much better than you!
If (for some strange reason) you really don't want to use a database, then consider further breaking up your data into a tree, if at all possible. For example, you could have a file for each letter of the alphabet in which you put student data. This should cut down on looping time since you're reducing the number of students per file. This is a quick hack, but I think you'll lose less hair if you go with a database.

IO is notoriously slow compared to computation, and given that you are dealing with large files it's probably best deal with the files line by line. I don't see an obvious easy way to speed this up in Python.
Depending on how frequent your "hits" (i.e., findvalue in line) will be you may decide to write to a file so not to be possibly slowed down by console output, but if there will be relatively few items found, it wouldn't make much of a difference.
I think for Python there's nothing obvious and major you can do. You could always explore other tools (such as grep or databases ...) as alternative approaches.
PS: No need for the else:pass ..

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.