Searching for specific phrase pattern within lines. python - python

I have made certain rules that I need to search for in a file. These rules are essentially phrases with an unknown number of words within. For example,
mutant...causes(...)GS
Here, this a phrase, which I want to search for in my file. The ... means a few words should be here(i.e. in this gap) & (...) means there may/may not be words in this gap. GS here is a fixed string variable that I know.
Basically I made these rules by going through many such files and they tell me that a particular file does what I am looking for.
The problem is that the gap can have any(small) number of words. There can even be a new line that begins in one of the gaps. Hence, I cannot go for identical string matching.
Some example texts -
!Series_summary "To better understand how the expression of a *mutant gene that causes ALS* can perturb the normal phenotype of astrocytes, and to identify genes that may
Here the GS is ALS (defined) and the starred text should be found as a positive match for the rule mutant...causes(...)GS
!Series_overall_design "The analysis includes 9 samples of genomic DNA from
isolated splenic CD11c+ dendritic cells (>95% pure) per group. The two groups are neonates born to mothers with *induced allergy to ovalbumin*, and normal control neonates. All neonates are genetically and environmentally identical, and allergen-naive."
Here the GS is ovalbumin (defined) and the starred text should be found as a positive match for the rule
induced...to GS
I am a beginner in programming in python, so any help will be great!!

The following should get you started, it will read in your file and display all possible matching lines using a Python regular expression, this will help you to determine that it is matching all of the correct lines:
import re
with open('input.txt', 'r') as f_input:
data = f_input.read()
print re.findall(r'(mutant\s.*?\scauses.*?GS)', data, re.S)
To then just search for just the presence of one match, change findall to search:
import re
with open('input.txt', 'r') as f_input:
data = f_input.read()
if re.search(r'(mutant\s.*?\scauses.*?GS)', data, re.S):
print 'found'
To carry this out on many such files, you could adapt it as follows:
import re
import glob
for filename in glob.glob('*.*'):
with open(filename, 'r') as f_input:
data = f_input.read()
if re.search(r'mutant\s.*?\scauses.*?GS', data, re.S):
print "'{}' matches".format(filename)

Related

How can I add tolerance to an input search engine

I have this code to search text in a big text file:
y = input("Apellido/s:").upper()
for line in fread:
if y in line:
print(line)
How can I implement a way that it searches for similar text/autocorrect if nothing is found. Not as extensive as google does it, but just search for text that has maybe 1 more letter or an accent in a word. I can't imagine the algorithm to do it by myself.
I now that i have to add an if but im asking for the logic
You can do it using find_near_matches from fuzzysearch
from fuzzysearch import find_near_matches
y = input("Apellido/s:").upper()
for line in fread:
if find_near_matches(y, line, max_l_dist=2):
print(line)
find_near_matches return a list of matches if some found, if not match found it returns an empty array which is evaluated to false.
max_l_dist option says the total number of substitutions, insertions and deletions (a.k.a. the Levenshtein distance)

Find regular expression or list of regular expressions across multiple text files and extract matching lines

The issue
Caveat: I am good at regular expressions, but I am a Python novice. I have tried to read as extensively as possible and could not find a solution that matched my scenario, so I am asking this question.
I wish to accomplish the following:
Loop through all the text files in a folder (I might use .docx / xml files at some point, but I will figure out the details). I suspect this is a matter of iterating, but I do not understand how to do it here;
Search for regular expressions OR a list of regular expressions contained in a file (as with a gazetteer), ideally stored in an external .txt or .csv file;
Print (or, better yet, write to CSV or Pandas) the name of the file, the match as found, and the line of text containing the latter. Ideally, these would go in different columns of a spreadsheet, so they could be comma separated values, but a dictionary would work just as well.
I had some success with code of this kind, which has allowed me to successfully print matching lines. With about six hours of Python experience in total, I felt pretty happy.
import re
def main():
regex = re.compile("regex")
with open("text_file.txt") as f:
for line in f:
result = regex.findall(line)
if result == None:
continue
elif result == []:
continue
else:
print(f, result, line)
main()
Problems and goals:
It returns all capture groups for the regular expression (I have multiple capture groups) before the matching line. This is not a problem, but I would like to be able to manipulate this in some way in the future;
I would like to be able to reuse the objects (filename, match, line) for further manipulation and analysis, ideally importing it all into pandas object, but I have no idea how to do it. Any suggestion would be massively appreciated;
When a regex matches multiple patterns in the same line, it only returns one line containing the matches. However, I would like for one such instance to be handled differently. Specifically, I would like for it to return as many lines as there are matches. Consider the example string:
We used to call Bob "Little Bobby"
My regular expression "Bob(by)?" will match "Bob" and "Bobby". But my code will print something like this (if I am not mistaken).
<_io.TextIOWrapper name='text_file.txt' mode='r' encoding='UTF-8'> [('Bob', ''), ('Bobby', ('by')) We used to call Bob "Little Bobby"
Instead, I want it to print two lines (one for the "Bob" match and one for the "Bobby" match. This can be done relatively easily in grep, if I recall correctly, but I can't find anything helpful in the re module documentation.
Loop through all the text files in a folder (I might use .docx / xml files at some point, but I will figure out the details). I suspect this is a matter of iterating, but I do not understand how to do it here;
Yes, you need to iterate. I recommend using os.listdir or glob.glob depending on your needs.
Example:
import glob
for filename in glob.glob('/path/to/my/dir', '*.txt'):
print(filename)
# do other stuff with filename
Search for regular expressions OR a list of regular expressions contained in a file (as with a gazetteer), ideally stored in an external .txt or .csv file;
I recommend using re.findall or re.finditer.
Example:
import re
my_re = re.compile('whatever your regex is')
with open(filename) as f:
file_contents = f.read()
for match in my_re.findall(file_contents):
print(match)
# do whatever you want with the match here
To extract groups from a match, you need to use the .groups function.
Print (or, better yet, write to CSV or Pandas) the name of the file, the match as found, and the line of text containing the latter. Ideally, these would go in different columns of a spreadsheet, so they could be comma separated values, but a dictionary would work just as well.
You can load all of the data into a Python list of dicts and then use the csv library for outputting it to a CSV.
Example:
import csv
list_of_data = [{ ... }, { ... }]
with open(output_filename, 'w+') as f:
# this specifies what the headers of your CSV will be.
# you can also just specify a list of strings here
fieldnames = list_of_data[0].keys()
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
for item in list_of_data:
writer.writerow(item)

How to compare two HTML files in python and print only the differences?

I have two html reports generated from sonar showing the issues in my code.
Problem Statement: I need to compare two sonar reports and find out the differences i.e. new issues that got introduced. Basically need to find differences in html and print those differences only.
I tried few things -
import difflib
file1 = open('sonarlint-report.html', 'r').readlines()
file2 = open('sonarlint-report_latest.html', 'r').readlines()
htmlDiffer = difflib.HtmlDiff()
htmldiffs = htmlDiffer.make_file(file1, file2)
with open('comparison.html', 'w') as outfile:
outfile.write(htmldiffs)
Now this gives me a comparison.html which is nothing but two html diff. Doesn't print only the different lines.
Should I try HTML parsing and then somehow get the differences only to be printed? Please suggest.
If you use difflib.Differ, you can keep only the difference lines and by filtering with the two letter codes that get written on every line. From the docs:
class difflib.Differ
This is a class for comparing sequences of lines
of text, and producing human-readable differences or deltas. Differ
uses SequenceMatcher both to compare sequences of lines, and to
compare sequences of characters within similar (near-matching) lines.
Each line of a Differ delta begins with a two-letter code:
Code Meaning
'- ' line unique to sequence 1
'+ ' line unique to sequence 2
' ' line common to both sequences
'? ' line not present in either inputsequence
Lines beginning with ‘?’ attempt to guide the eye to intraline
differences, and were not present in either input sequence. These
lines can be confusing if the sequences contain tab characters
By keeping the lines started with '- ' and '+ ' just the differences.
I would start by trying to iterate through each html file line by line and checking to see if the lines are the same.
with open('file1.html') as file1, open('file2.html') as file2:
for file1Line, file2Line in zip(file1, file2):
if file1Line != file2Line:
print(file1Line.strip('\n'))
print(file2Line.strip('\n'))
You'll have to deal with newline characters and multiple line differences in a row, but this is probably a good start :)

Find information in text file given setpoints are known

So I know the setpoints <start point> and <end point> in the text file and I need to use these to find certain information between them which will be used and printed. I currently have .readlines() within a different function which is used within the new function to find the information.
You can try something like this:
flag = False
info = [] # your desired information will be appended as a string in list
with open(your_file, 'r') as file:
for line in file.readlines():
if '<start point>' in line: # Pointer reached the start point
flag = True
if '<end point>' in line: # Pointer reached the end point
flag = False
if flag: # this line is between the start point and endpoint
info.append(line)
>> info
['Number=12', 'Word=Hello']
This seems like a job for regular expressions. If you have not yet encountered regular expressions, they are an extremely powerful tool that can basically be used to search for a specific pattern in a text string.
For example the regular expression (or regex for short) Number=\d+ would find any line in the text document that has Number= followed by any number of number characters. The regex Word=\w+ would match any string starting with Word= and then followed by any number of letters.
In python you can use regular expression through the re module. For a great introduction to using regular expressions in python check out this chapter from the book Automate the Boring Stuff with Python. To test out regular expressions this site is great.
In this particular instance you would do something like:
import re
your_file = "test.txt"
with open(your_file,'r') as file:
file_contents = file.read()
number_regex = re.compile(r'Number=\d+')
number_matches = re.findall(number_regex, file_contents)
print(number_matches)
>>> ['Number=12']
This would return a list with all matches to the number regex. You could then do the same thing for the word match.

Stdin Stdout python

For my work i'm used to work with matlab. No i try to learn the basic skills of python as well. Currently I'm working on on the following excersise:
You are interested in extracting all of the occurrences that look like
this
<Aug22-2008> <15:37:37> Bond Energy LDA -17.23014168 eV
In particular, you want to gather the numerical values (eg,
-17.23014168), and print them out. Write a script that reads the output file from standard input, and uses regular expressions to
locate the values you want to extract. Have your script print out all
the values to standard output.
This is the code I use:
import os,re
from string import rjust
dataEx=re.compile(r'''
^\s*
<Aug22-2008>
\s+
<\d{2}:\d{2}:\d{2}>
\s+
Bond
\s
Energy
\s
LDA
\s+
((\+|-)?(\d*)\.?\d*)
''',re.VERBOSE)
f=open('Datafile_Q2.txt','r')
line = f.readline()
while line != '':
line = f.readline() # Get next line
m = dataEx.match(line)
if m:
# print line
print m.group(1)
With this code I'm able to find all values in the datafile they ask for. However I do have a few questions. Firstly, they ask specific something about stdin and stdout. No I'm wondering do I use the right code to read the output file from standard input and do I really print out all the values to standard output in this way? Futhermore, I'm wondering whether there is a better or more easy way to find the required values?
To find the numbers your looking for I would use a positive lookbehind and lookahead function in your regular expression.
(?<=Bond Energy LDA ).*(?= eV)
This checks to see if the thing you are looking at is proceeded by 'Bond Energy LDA' and followed by 'eV' but does not include them in the string you extract. So assuming that the numbers you are looking for are always proceeded and followed by those two things you can find them like that.
A nice way to read from stdin is to use the sys python module.
import sys
Then you can read lines straight from stdin:
import sys
import re
from line in sys.stdin:
matchObj = re.search(r '(?<=Bond Energy LDA ).*(?= eV)', line, re.I)
if(matchObj):
print(matchObj.group())
If the regular expression is not found on the line then matchObj will be null skipping the if statement. If it is found the search will return a matchObj containing groups. You can then print the group to stdout as print will by default print to stdout if no file is given.
Why use a regular expression? Split the input:
>>> s = """<Aug22-2008> <15:37:37> Bond Energy LDA -17.23014168 eV"""
>>> s.split()[5]
'-17.23014168'
Of course, if you can provide more sample input that does not put the number on the 5th position, this perhaps is not enough.
Ask your teacher for more sample input.
STDIN and STDOUT are documented.
If you want to use regex you may use:
(?:<.*>\W+)[a-zA-Z ]+([-+]?[0-9]*\.?[0-9]+)
Demo

Categories