How can I add tolerance to an input search engine - python

I have this code to search text in a big text file:
y = input("Apellido/s:").upper()
for line in fread:
if y in line:
print(line)
How can I implement a way that it searches for similar text/autocorrect if nothing is found. Not as extensive as google does it, but just search for text that has maybe 1 more letter or an accent in a word. I can't imagine the algorithm to do it by myself.
I now that i have to add an if but im asking for the logic

You can do it using find_near_matches from fuzzysearch
from fuzzysearch import find_near_matches
y = input("Apellido/s:").upper()
for line in fread:
if find_near_matches(y, line, max_l_dist=2):
print(line)
find_near_matches return a list of matches if some found, if not match found it returns an empty array which is evaluated to false.
max_l_dist option says the total number of substitutions, insertions and deletions (a.k.a. the Levenshtein distance)

Related

Find information in text file given setpoints are known

So I know the setpoints <start point> and <end point> in the text file and I need to use these to find certain information between them which will be used and printed. I currently have .readlines() within a different function which is used within the new function to find the information.
You can try something like this:
flag = False
info = [] # your desired information will be appended as a string in list
with open(your_file, 'r') as file:
for line in file.readlines():
if '<start point>' in line: # Pointer reached the start point
flag = True
if '<end point>' in line: # Pointer reached the end point
flag = False
if flag: # this line is between the start point and endpoint
info.append(line)
>> info
['Number=12', 'Word=Hello']
This seems like a job for regular expressions. If you have not yet encountered regular expressions, they are an extremely powerful tool that can basically be used to search for a specific pattern in a text string.
For example the regular expression (or regex for short) Number=\d+ would find any line in the text document that has Number= followed by any number of number characters. The regex Word=\w+ would match any string starting with Word= and then followed by any number of letters.
In python you can use regular expression through the re module. For a great introduction to using regular expressions in python check out this chapter from the book Automate the Boring Stuff with Python. To test out regular expressions this site is great.
In this particular instance you would do something like:
import re
your_file = "test.txt"
with open(your_file,'r') as file:
file_contents = file.read()
number_regex = re.compile(r'Number=\d+')
number_matches = re.findall(number_regex, file_contents)
print(number_matches)
>>> ['Number=12']
This would return a list with all matches to the number regex. You could then do the same thing for the word match.

Searching for specific phrase pattern within lines. python

I have made certain rules that I need to search for in a file. These rules are essentially phrases with an unknown number of words within. For example,
mutant...causes(...)GS
Here, this a phrase, which I want to search for in my file. The ... means a few words should be here(i.e. in this gap) & (...) means there may/may not be words in this gap. GS here is a fixed string variable that I know.
Basically I made these rules by going through many such files and they tell me that a particular file does what I am looking for.
The problem is that the gap can have any(small) number of words. There can even be a new line that begins in one of the gaps. Hence, I cannot go for identical string matching.
Some example texts -
!Series_summary "To better understand how the expression of a *mutant gene that causes ALS* can perturb the normal phenotype of astrocytes, and to identify genes that may
Here the GS is ALS (defined) and the starred text should be found as a positive match for the rule mutant...causes(...)GS
!Series_overall_design "The analysis includes 9 samples of genomic DNA from
isolated splenic CD11c+ dendritic cells (>95% pure) per group. The two groups are neonates born to mothers with *induced allergy to ovalbumin*, and normal control neonates. All neonates are genetically and environmentally identical, and allergen-naive."
Here the GS is ovalbumin (defined) and the starred text should be found as a positive match for the rule
induced...to GS
I am a beginner in programming in python, so any help will be great!!
The following should get you started, it will read in your file and display all possible matching lines using a Python regular expression, this will help you to determine that it is matching all of the correct lines:
import re
with open('input.txt', 'r') as f_input:
data = f_input.read()
print re.findall(r'(mutant\s.*?\scauses.*?GS)', data, re.S)
To then just search for just the presence of one match, change findall to search:
import re
with open('input.txt', 'r') as f_input:
data = f_input.read()
if re.search(r'(mutant\s.*?\scauses.*?GS)', data, re.S):
print 'found'
To carry this out on many such files, you could adapt it as follows:
import re
import glob
for filename in glob.glob('*.*'):
with open(filename, 'r') as f_input:
data = f_input.read()
if re.search(r'mutant\s.*?\scauses.*?GS', data, re.S):
print "'{}' matches".format(filename)

Multiple patterns search in a line python

I'm trying to find a way to search multiple patterns in a line. In this case, I already pre-defined a pattern and compiled it as variable "p" and want to add this pattern into other array of strings that I'd like to search from line.
basically, my routine is to check if all 3 patterns can be found from line then increase a counter. but if I include pattern "p" in the search group, it will return the error
"'in ' requires string as left operand, not _sre.SRE_Pattern".
does anyone know how to also add pattern "p" in the search group in a neat way?
thanks,
p = re.compile("|".join(pattern))
for line in myfile:
if all(x in line for x in ['DOG','CAT', p]:
count += 1

Python: Regex a dictionary using user input wildcards

I would like to be able to search a dictionary in Python using user input wildcards.
I have found this:
import fnmatch
lst = ['this','is','just','a','test', 'thing']
filtered = fnmatch.filter(lst, 'th*')
This matches this and thing. Now if I try to input a whole file and search through
with open('testfilefolder/wssnt10.txt') as f:
file_contents = f.read().lower()
filtered = fnmatch.filter(file_contents, 'th*')
this doesn't match anything. The difference is that in the file that I am reading from I is a text file (Shakespeare play) so I have spaces and it is not a list. I can match things such as a single letter, so if I just have 't' then I get a bunch of t's. So this tells me that I am matching single letters - I however am wanting to match whole words - but even more, to preserve the wildcard structure.
Since what I would like to happen is that a user enters in text (including what will be a wildcard) that I can substitute it in to the place that 'th*' is. The wild card would do what it should still. That leads to the question, can I just stick in a variable holding the search text in for 'th*'? After some investigation I am wondering if I am somehow supposed to translate the 'th*' for example and have found something such as:
regex = fnmatch.translate('th*')
print(regex)
which outputs th.*\Z(?ms)
Is this the right way to go about doing this? I don't know if it is needed.
What would be the best way in going about "passing in regex formulas" as well as perhaps an idea of what I have wrong in the code as it is not operating on the string of incoming text in the second set of code as it does (correctly) in the first.
If the problem is just that you "have spaces and it is not a list," why not make it into a list?
with open('testfilefolder/wssnt10.txt') as f:
file_contents = f.read().lower().split(' ') # split line on spaces to make a list
filtered = fnmatch.filter(file_contents, 'th*')

How to search for string in Python by removing line breaks but return the exact line where the string was found?

I have a bunch of PDF files that I have to search for a set of keywords against. I have to extract the exact line where the keyword was found. I first used xpdf's pdf2text to convert the file to PDF. (Tried solr but had a tough time tailoring the output/schema to suit my requirement).
import sys
file_name = sys.argv[1]
searched_string = sys.argv[2]
result = [(line_number+1, line) for line_number, line in enumerate(open(file_name)) if searched_string.lower() in line.lower()]
#print result
for each in result:
print each[0], each[1]
ThinkCode:~$ python find_string.py sample.txt "String Extraction"
The problem I have with this is that for cases where search string is broken towards the end of the line :
If you are going to index large binary files, remember to change the
size limits. String
Extraction is a common problem
If I am searching for 'String Extraction', I will miss this keyword if I use the code presented above. What is the most efficient way of achieving this without making 2 copies of text file (one for searching the keyword to extract the line (number) and the other for removing line breaks and finding the keyword to eliminate the case where the keyword spans across 2 lines).
Much appreciated guys!
Note: Some considerations without any code, but I think they belong to an answer rather than to a comment.
My idea would be to search only for the first keyword; if a match is found, search for the second. This allows you to, if the match is found at the end of the line, take into consideration the next line and do line concatenation only if a match is found in first place*.
Edit:
Coded a simple example and ended up using a different algorithm; the basic idea behind it is this code snippet:
def iterwords(fh):
for number, line in enumerate(fh):
for word in re.split(r'\s+', line.strip()):
yield number, word
It iterates over the file handler and produces a (line_number, word) tuple for each word in the file.
The matching afterwards becomes pretty easy; you can find my implementation as a gist on github. It can be run as follows:
python search.py 'multi word search string' file.txt
There is one main concern with the linked code, I didn't code a workaround both for performance and complexity reasons. Can you figure it out? (Spoiler: try to search for a sentence whose first word appears two times in a row in the file)
* I didn't perform any testing on my own, but this article and the python wiki suggest that string concatenation is not that efficient in python (don't know how actual the information is).
There may be a better way of doing it, but my suggestion would be to start by taking in two lines (let's call them line1 and line2), concatenating them into line3 or something similar, and then search that resultant line.
Then you'd assign line2 to line1, get a new line2, and repeat the process.
Use the flag re.MULTILINE when compiling your expressions: http://docs.python.org/library/re.html#re.MULTILINE
Then use \s to represent all white space (including new lines).

Categories