Multiple patterns search in a line python - python

I'm trying to find a way to search multiple patterns in a line. In this case, I already pre-defined a pattern and compiled it as variable "p" and want to add this pattern into other array of strings that I'd like to search from line.
basically, my routine is to check if all 3 patterns can be found from line then increase a counter. but if I include pattern "p" in the search group, it will return the error
"'in ' requires string as left operand, not _sre.SRE_Pattern".
does anyone know how to also add pattern "p" in the search group in a neat way?
thanks,
p = re.compile("|".join(pattern))
for line in myfile:
if all(x in line for x in ['DOG','CAT', p]:
count += 1

Related

How can I add tolerance to an input search engine

I have this code to search text in a big text file:
y = input("Apellido/s:").upper()
for line in fread:
if y in line:
print(line)
How can I implement a way that it searches for similar text/autocorrect if nothing is found. Not as extensive as google does it, but just search for text that has maybe 1 more letter or an accent in a word. I can't imagine the algorithm to do it by myself.
I now that i have to add an if but im asking for the logic
You can do it using find_near_matches from fuzzysearch
from fuzzysearch import find_near_matches
y = input("Apellido/s:").upper()
for line in fread:
if find_near_matches(y, line, max_l_dist=2):
print(line)
find_near_matches return a list of matches if some found, if not match found it returns an empty array which is evaluated to false.
max_l_dist option says the total number of substitutions, insertions and deletions (a.k.a. the Levenshtein distance)

How to split a string on multiple pattern using pythonic way (one liner)?

I am trying to extract file name from file pointer without extension. My file name is as follows:
this site:time.list,this.list,this site:time_sec.list, that site:time_sec.list and so on. Here required file name always precedes either whitespace or dot.
Currently I am doing this to get file from file name preceding white space and dot in file name.
search_term = os.path.basename(f.name).split(" ")[0]
and
search_term = os.path.basename(f.name).split(".")[0]
Expected file name output: this, this, this, that.
How can i combine above two into one liner kind and pythonic way?
Thanks in advance.
using regex as below,
[ .] will split either on a space or a dot char
re.split('[ .]', os.path.basename(f.name))[0]
If you split on one and splitting on the other still returns something smaller, that's the one you want. If not, what you get is what you got from the first split. You don't need regex for this.
search_term = os.path.basename(f.name).split(" ")[0].split(".")[0]
Use regex to get the first word at the beginning of the string:
import re
re.match(r"\w+", "this site:time_sec.list").group()
# 'this'
re.match(r"\w+", "this site:time.list").group()
# 'this'
re.match(r"\w+", "that site:time_sec.list").group()
# 'that'
re.match(r"\w+", "this.list").group()
# 'this'
try this:
pattern = re.compile(r"\w+")
pattern.match(os.path.basename(f.name)).group()
Make sure your filenames don't have whitespace inside when you rely on the assumption that a whitespace separates what you want to extract from the rest. It's much more likely to get unexpected results you didn't think up in advance if you rely on implicit rules like that instead of actually looking at the strings you want to extract and tailor explicit expressions to fit the content.

Can anyone see why my python regex search is only outputtings "0"s?

I'm working on a python program to extract all the tags within a kml file.
import re
KML = open('NYC_Tri-State_Area.kml','r')
NYC_Coords = open('NYC_Coords.txt', 'w')
coords = re.findall(r'<coordinates>+(.)+<\/coordinates>', KML.read())
for coord in coords:
NYC_Coords.write(str(coord) + "\n")
KML.close()
NYC_Coords.close()
I tested the regex on the file within RegExr and it worked properly.
Here is a small sample of the kml file I'm reading: http://puu.sh/bhayn/2e233a1033.png
The output file contains lines with a single 0 on every line except the last one which is empty.
It seems you have the + operators placed outside of your grouping.
So with >+ this matches > literally between "one or more" times and using the dot . in conjuction with a repeated capturing group (.)+ only the last iteration will be captured, in this case 0 for each match result.
Remove the beginning + operator and move the one placed outside of the group to the inside.
coords = re.findall(r'<coordinates>(.+?)</coordinates>', KML.read())
Note: Use +? to prevent greediness, you also probably want to use the s (dotall) modifier here.

Pyparsing finds first occurence in file

I'm parsing file via
output=wilcard.parseFile(myfile)
print output
And I do get only first match of string.
I have a big config file to parse, with "entries" which are surrounded by braces.
I expect to see all the matches that are in file or exception for not matching.
How do I achieve that?
By default, pyparsing will find the longest match, starting at the first character. So, if your parse is given by num = Word('0123456789'), parsing either "462" or "462-780" will both return the same value. However, if the parseAll=True option is passed, the parse will attempt to parse the entire string. In this case, "462" would be matched, but parsing "462-780" would raise a ParseException, because the parser doens't know how to deal with the dash.
I would recommend constructing something that will match the entirety of the file, then using the parseAll=True flag in parseFile(). If I understand your description of each entry being separated by braces correctly, one could do the following.
entire_file = OneOrMore('[' + wildcard + ']')
output = wildcard.parseFile(myfile,parseAll=True)
print output

How to search for string in Python by removing line breaks but return the exact line where the string was found?

I have a bunch of PDF files that I have to search for a set of keywords against. I have to extract the exact line where the keyword was found. I first used xpdf's pdf2text to convert the file to PDF. (Tried solr but had a tough time tailoring the output/schema to suit my requirement).
import sys
file_name = sys.argv[1]
searched_string = sys.argv[2]
result = [(line_number+1, line) for line_number, line in enumerate(open(file_name)) if searched_string.lower() in line.lower()]
#print result
for each in result:
print each[0], each[1]
ThinkCode:~$ python find_string.py sample.txt "String Extraction"
The problem I have with this is that for cases where search string is broken towards the end of the line :
If you are going to index large binary files, remember to change the
size limits. String
Extraction is a common problem
If I am searching for 'String Extraction', I will miss this keyword if I use the code presented above. What is the most efficient way of achieving this without making 2 copies of text file (one for searching the keyword to extract the line (number) and the other for removing line breaks and finding the keyword to eliminate the case where the keyword spans across 2 lines).
Much appreciated guys!
Note: Some considerations without any code, but I think they belong to an answer rather than to a comment.
My idea would be to search only for the first keyword; if a match is found, search for the second. This allows you to, if the match is found at the end of the line, take into consideration the next line and do line concatenation only if a match is found in first place*.
Edit:
Coded a simple example and ended up using a different algorithm; the basic idea behind it is this code snippet:
def iterwords(fh):
for number, line in enumerate(fh):
for word in re.split(r'\s+', line.strip()):
yield number, word
It iterates over the file handler and produces a (line_number, word) tuple for each word in the file.
The matching afterwards becomes pretty easy; you can find my implementation as a gist on github. It can be run as follows:
python search.py 'multi word search string' file.txt
There is one main concern with the linked code, I didn't code a workaround both for performance and complexity reasons. Can you figure it out? (Spoiler: try to search for a sentence whose first word appears two times in a row in the file)
* I didn't perform any testing on my own, but this article and the python wiki suggest that string concatenation is not that efficient in python (don't know how actual the information is).
There may be a better way of doing it, but my suggestion would be to start by taking in two lines (let's call them line1 and line2), concatenating them into line3 or something similar, and then search that resultant line.
Then you'd assign line2 to line1, get a new line2, and repeat the process.
Use the flag re.MULTILINE when compiling your expressions: http://docs.python.org/library/re.html#re.MULTILINE
Then use \s to represent all white space (including new lines).

Categories