cs50 PSET6/DNA Regular Expressions - python

I'm attempting to work through finding the amount of consecutive STRs (a substring pattern, i.e. "AGAT") in a sequence file.
String Patterns: AGATC,TTTTTTCT,AATG,TCTAG,GATA,TATC,GAAA,TCTG
Sequence file(one of many other sequence files): AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG
In the above sequence, TATC is the maximum with a run of 5 consecutive TATC pairs. With my regular expression, it is returning matches whether they are consecutive or not.
I believe using regular expressions is my best bet. This is my first time working in Python so don't expect too much. I've used the regex tool at regex101.com and it has provided me some good insight into regex formulations. I'm passing a variable into the regex with {head}, which which is the string pattern, but I want to find the matched string {head} 2 or more times. My below regex returns a match to head at least 1 or more times, so I know why that is returning the way it does.
groups = re.findall(rf'?:{head})+, text)
If I use r"(AGAT){2,}" in regex101.com, this works the way I expect. It finds the matched string of characters 2 or more times. If I pass it into my code as groups = re.findall(rf'(?:{head}){2,}), it doesn't return anything.
My code is below:
import csv
import re
import string
if len(sys.argv) != 3:
print("missing command-line argument")
exit(1)
if re.search(r"(.csv)", sys.argv[1]) == None:
print("CSV file not found!")
print("Usage: 'python.py *.csv *.txt'")
exit(1)
if re.search(r"(.txt)", sys.argv[2]) == None:
print("TXT file not found!")
print("Usage: 'python.py *.csv *.txt'")
exit(1)
# use reader or DictReader from the CSV module
# use sys.argv for command-line arguments
# use open(filename) and f.read() to read its contents.
# open CSV and DNA sequence and read into memory
with open(sys.argv[1], newline='') as database, open(sys.argv[2], newline='') as sequence:
reader = csv.DictReader(database)
headers = reader.fieldnames
text = sequence.read()
for head in headers:
groups = re.findall(rf'(?:{head})+', text)
print(head, groups)
If I use the above groups = re.findall(rf'(?:{head})+', text) variable I get the below output
AGATC ['AGATCAGATCAGATCAGATC']
TTTTTTCT []
AATG ['AATG']
TCTAG []
GATA ['GATA', 'GATA']
TATC ['TATCTATCTATCTATCTATC']
GAAA ['GAAA', 'GAAA', 'GAAA']
TCTG []
If I use groups = re.findall(rf'(?:{head}){2,}', text) I get nothing.
AGATC []
TTTTTTCT []
AATG []
TCTAG []
GATA []
TATC []
GAAA []
TCTG []
So, I suppose I'm asking, how can I use regex to find a string of characters(passed as a variable) 2 or more times?

You can use pattern ((your pattern)\2*) in your regular expression to find largest consecutive pattern (regex101 for pattern TATC):
import re
seq = 'AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG'
patterns = ['AGATC','TTTTTTCT','AATG','TCTAG','GATA','TATC','GAAA','TCTG']
m = max([x for p in patterns for x in re.findall(r'(({})\2*)'.format(p), seq)], key=lambda k: len(k[0]) // len(k[1]))
print('Most repeated pattern: {}, number of repetitions {}'.format(m[1], len(m[0]) // len(m[1])))
Prints:
Most repeated pattern: TATC, number of repetitions 5

This answer was given from a user, yeahIProgram, on Reddit's cs50 subreddit.
"That's what I was referring to, but I had to look it up and you escape the braces inside the formatted string by doubling them."
So, the regular expression I was looking for was groups = re.findall(rf'(({head}){{2,}})', text). Which in returned the below output that I was expecting.
AGATC [('AGATCAGATCAGATCAGATC', 'AGATC')]
TTTTTTCT []
AATG []
TCTAG []
GATA []
TATC [('TATCTATCTATCTATCTATC', 'TATC')]
GAAA []
TCTG []
Now, I just need to get the total number of times the string occurs and I should be well on my.
Thank you #Andrej Kesely for your input.

Related

Finding an exact word in list

I am learning Python and am struggling with fining an exact word in each string in a list of strings.
Apologies if this is an already asked question for this situation.
This is what my code looks like so far:
with open('text.txt') as f:
lines = f.readlines()
lines = [line.rstrip('\n') for line in open('text.txt')]
keyword = input("Enter a keyword: ")
matching = [x for x in lines if keyword.lower() in x.lower()]
match_count = len(matching)
print('\nNumber of matches: ', match_count, '\n')
print(*matching, sep='\n')
Right now, matching will return all strings containing the word, not strings contating the exact word. For example, if I enter in 'local' as the keyword, strings with 'locally' and 'localized' in addition to 'local' will be returned when I only want just instances of 'local' returned.
I have tried:
match_test = re.compile(r"\b" + keyword+ r"\b")
match_test = ('\b' + keyword + '\b')
match_test = re.compile('?:^|\s|$){0}'.format(keyword))
matching = [x for x in lines if keyword.lower() == x.lower()]
matching = [x for x in lines if keyword.lower() == x.lower().strip()]
And none of them shave worked, so I'm a bit stuck.
How do I take the keyword entered from the user, and then return all strings in a list that contain that exact keyword?
Thanks
in means contained in, 'abc' in 'abcd' is True. For exact match use ==
matching = [x for x in lines if keyword.lower() == x.lower()]
You might need to remove spaces\new lines as well
matching = [x for x in lines if keyword.lower().strip() == x.lower().strip()]
Edit:
To find a line containing the keyword you can use loops
matches = []
for line in lines:
for string in line.split(' '):
if string.lower().strip() == keyword.lower().strip():
matches.append(line)
This method avoids having to read the whole file into memory. It also deals with cases like "LocaL" or "LOCAL" assuming you want to capture all such variants. There is a bit of performance overhead on making the temp string each time the line is read, however:
import re
reader(filename, target):
#this regexp matches a word at the front, end or in the middle of a line stripped
#of all punctuation and other non-alpha, non-whitespace characters:
regexp = re.compile(r'(^| )' + target.lower() + r'($| )')
with open(filename) as fin:
matching = []
#read lines one at at time:
for line in fin:
line = line.rstrip('\n')
#generates a line of lowercase and whitespace to test against
temp = ''.join([x.lower() for x in line if x.isalpha() or x == ' '])
print(temp)
if regexp.search(temp):
matching.append(line) #store unaltered line
return matching
Given the following tests:
locally local! localized
locally locale nonlocal localized
the magic word is Local.
Localized or nonlocal or LOCAL
This is returned:
['locally local! localized',
'the magic word is Local.',
'Localized or nonlocal or LOCAL']
Please find my solution which should match only local among following mentioned text in text file . I used search regular expression to find the instance which has only 'local' in string and other strings containing local will not be searched for .
Variables which were provided in text file :
local
localized
locally
local
local diwakar
local
local##!
Code to find only instances of 'local' in text file :
import os
import sys
import time
import re
with open('C:/path_to_file.txt') as f:
for line in f:
a = re.search(r'local\W$', line)
if a:
print(line)
Output
local
local
local
Let me know if this is what you were looking for
Your first test seems to be on the right track
Using input:
import re
lines = [
'local student',
'i live locally',
'keyboard localization',
'what if local was in middle',
'end with local',
]
keyword = 'local'
Try this:
pattern = re.compile(r'.*\b{}\b'.format(keyword.lower()))
matching = [x for x in lines if pattern.match(x.lower())]
print(matching)
Output:
['local student', 'what if local was in middle', 'end with local']
pattern.match will return the first instance of the regex matching or None. Using this as your if condition will filter for strings that match the whole keyword in some place. This works because \b matches the begining/ending of words. The .* works to capture any characters that may occur at the start of the line before your keyword shows up.
For more info about using Python's re, checkout the docs here: https://docs.python.org/3.8/library/re.html
You can try
pattern = re.compile(r"\b{}\b".format(keyword))
match_test = pattern.search(line)
like shown in
Python - Concat two raw strings with an user name

In Python, how to match a string to a dictionary item (like 'Bra*')

I'm a complete novice at Python so please excuse me for asking something stupid.
From a textfile a dictionary is made to be used as a pass/block filter.
The textfile contains addresses and either a block or allow like "002029568,allow" or "0011*,allow" (without the quotes).
The search-input is a string with a complete code like "001180000".
How can I evaluate if the search-item is in the dictionary and make it match the "0011*,allow" line?
Thank you very much for your efford!
The filter-dictionary is made with:
def loadFilterDict(filename):
global filterDict
try:
with open(filename, "r") as text_file:
lines = text_file.readlines()
for s in lines:
fields = s.split(',')
if len(fields) == 2:
filterDict[fields[0]] = fields[1].strip()
text_file.close()
except:
pass
Check if the code (ccode) is in the dictionary:
if ccode in filterDict:
if filterDict[ccode] in ['block']:
continue
else:
if filterstat in ['block']:
continue
The filters-file is like:
002029568,allow
000923993,allow
0011*, allow
If you can use re, you don't have to worry about the wildcard but let re.match do the hard work for you:
# Rules input (this could also be read from file)
lines = """002029568,allow
0011*,allow
001180001,block
"""
# Parse rules from string
rules = []
for line in lines.split("\n"):
line = line.strip()
if not line:
continue
identifier, ruling = line.split(",")
rules += [(identifier, ruling)]
# Get rulings for specific number
def rule(number):
from re import match
rulings = []
for identifier, ruling in rules:
# Replace wildcard with regex .*
identifier = identifier.replace("*", ".*")
if match(identifier, number):
rulings += [ruling]
return rulings
print(rule("001180000"))
print(rule("001180001"))
Which prints:
['allow']
['allow', 'block']
The function will return a list of rulings. Their order is the same order as they appear in your config lines. So you could easily just pick the last or first ruling whichever is the one you're interested in.
Or break the loop prematurely if you can assume that no two rulings will interfere.
Examples:
001180000 is matched by 0011*,allow only, so the only ruling which applies is allow.
001180001 is matched by 0011*,allow at first, so you'll get allow as before. However, it is also matched by 001180001,block, so a block will get added to the rulings, too.
If the wildcard entries in the file have a fixed length (for example, you only need to support lines like 0011*,allow and not 00110*,allow or 0*,allow or any other arbitrary number of digits followed by *) you can use a nested dictionary, where the outer keys are the known parts of the wildcarded entries.
d = {'0011': {'001180000': 'value', '001180001': 'value'}}
Then when you parse the file and get to the line 0011*,allow you do not need to do any matching. All you have to do is check if '0011' is present. Crude example:
d = {'0011': {'001180000': 'value', '001180001': 'value'}}
line = '0011*,allow'
prefix = line.split(',')[0][:-1]
if prefix in d:
# there is a "match", then you can deal with all the entries that match,
# in this case the items in the inner dictionary
# {'001180000': 'value', '001180001': 'value'}
print('match')
else:
print('no match')
If you do need to support arbitrary lengths of wildcarded entries, you will have to resort to a loop iterating over the dictionary (and therefore beating the point of using a dictionary to begin with):
d = {'001180000': 'value', '001180001': 'value'}
line = '0011*,allow'
prefix = line.split(',')[0][:-1]
for k, v in d.items():
if k.startswith(prefix):
# found matching key-value pair
print(k, v)

how to get the number of occurrence of an expression in a file using python

I have a code that read files and find the matching expression with the user input and highlight it, using findall function in regular expression.
also i am trying to save in json file several information based on this matching.
like :
file name
matching expression
number of occurrence
the problem is that the program read the file and display the text with highlighted expression but in the json file it save the number of occurrence as the number of lines.
in this example the word this is the searched word it exist in the text file twice
the result in the json file is = 12 ==> that is the number of text lines
result of the json file and the highlighted text
code:
def MatchFunc(self):
self.textEdit_PDFpreview.clear()
x = self.lineEditSearch.text()
TextString=self.ReadingFileContent(self.FileListSelected())
d = defaultdict(list)
filename = os.path.basename(self.FileListSelected())
RepX='<u><b style="color:#FF0000">'+x+'</b></u>'
for counter , myLine in enumerate(filename):
self.textEdit_PDFpreview.clear()
thematch=re.sub(x,RepX,TextString)
thematchFilt=re.findall(x,TextString,re.M|re.I)
if thematchFilt:
d[thematchFilt[0]].append(counter + 1)
self.textEdit_PDFpreview.insertHtml(str(thematch))
else:
self.textEdit_PDFpreview.insertHtml('No Match Found')
OutPutListMetaData = []
for match , positions in d.items():
print ("this is match {}".format(match))
print("this is position {}".format(positions))
listMetaData = {"File Name":filename,"Searched Word":match,"Number Of Occurence":len(positions)}
OutPutListMetaData.append(listMetaData)
for p in positions:
print("on line {}".format(p))
jsondata = json.dumps(OutPutListMetaData,indent=4)
print(jsondata)
folderToCreate = "search_result"
today = time.strftime("%Y%m%d__%H-%M")
jsonFileName = "{}_searchResult.json".format(today)
if not(os.path.exists(os.getcwd() + os.sep + folderToCreate)):
os.mkdir("./search_result")
fpJ = os.path.join(os.getcwd()+os.sep+folderToCreate,jsonFileName)
print(fpJ)
with open(fpJ,"a") as jsf:
jsf.write(jsondata)
print("finish writing")
It's straightforward using Counter. Once you pass an iterable, it returns each one of them along with the number of occurrences as tuples.
As the re.findall function returns a list you can just do len(result).

Faster operation reading file

I have to process a 15MB txt file (nucleic acid sequence) and find all the different substrings (size 5). For instance:
ABCDEF
would return 2, as we have both ABCDE and BCDEF, but
AAAAAA
would return 1. My code:
control_var = 0
f=open("input.txt","r")
list_of_substrings=[]
while(f.read(5)!=""):
f.seek(control_var)
aux = f.read(5)
if(aux not in list_of_substrings):
list_of_substrings.append(aux)
control_var += 1
f.close()
print len(list_of_substrings)
Would another approach be faster (instead of comparing the strings direct from the file)?
Depending on what your definition of a legal substring is, here is a possible solution:
import re
regex = re.compile(r'(?=(\w{5}))')
with open('input.txt', 'r') as fh:
input = fh.read()
print len(set(re.findall(regex, input)))
Of course, you may replace \w with whatever you see fit to qualify as a legal character in your substring. [A-Za-z0-9], for example will match all alphanumeric characters.
Here is an execution example:
>>> import re
>>> input = "ABCDEF GABCDEF"
>>> set(re.findall(regex, input))
set(['GABCD', 'ABCDE', 'BCDEF'])
EDIT: Following your comment above, that all character in the file are valid, excluding the last one (which is \n), it seems that there is no real need for regular expressions here and the iteration approach is much faster. You can benchmark it yourself with this code (note that I slightly modified the functions to reflect your update regarding the definition of a valid substring):
import timeit
import re
FILE_NAME = r'input.txt'
def re_approach():
return len(set(re.findall(r'(?=(.{5}))', input[:-1])))
def iter_approach():
return len(set([input[i:i+5] for i in xrange(len(input[:-6]))]))
with open(FILE_NAME, 'r') as fh:
input = fh.read()
# verify that the output of both approaches is identicle
assert set(re.findall(r'(?=(.{5}))', input[:-1])) == set([input[i:i+5] for i in xrange(len(input[:-6]))])
print timeit.repeat(stmt = re_approach, number = 500)
print timeit.repeat(stmt = iter_approach, number = 500)
15MB doesn't sound like a lot. Something like this probably would work fine:
import Counter, re
contents = open('input.txt', 'r').read()
counter = Counter.Counter(re.findall('.{5}', contents))
print len(counter)
Update
I think user590028 gave a great solution, but here is another option:
contents = open('input.txt', 'r').read()
print set(contents[start:start+5] for start in range(0, len(contents) - 4))
# Or using a dictionary
# dict([(contents[start:start+5],True) for start in range(0, len(contents) - 4)]).keys()
You could use a dictionary, where each key is a substring. It will take care of duplicates, and you can just count the keys at the end.
So: read through the file once, storing each substring in the dictionary, which will handle finding duplicate substrings & counting the distinct ones.
Reading all at once is more i/o efficient, and using a dict() is going to be faster than testing for existence in a list. Something like:
fives = {}
buf = open('input.txt').read()
for x in xrange(len(buf) - 4):
key = buf[x:x+5]
fives[key] = 1
for keys in fives.keys():
print keys

KeyError in Python Script

I've tried debugging this script but I'm not sure waht's causing the error.
list1 = ['<p>Text ([0-9]):(.*)</p>' ,'<p>Text2 ([0-9]):(.*)</p>','<p>Text ([0-9]):(.*)</p>']
list2 = ["<p class='text'>Text \1:\2</p>" ,"<p class='text'>Text \1:\2</p>","<p class='text'>TEXT ([0-9]):(.*)</p>"]
translation = dict(zip(list1, list2))
pattern = re.compile('(%s)' % '|'.join(dicts.list1))
file.close()
file = open(args.file,'r+')
def translate(match):
return dicts.translation[match.group(0)]
with open(args.file, 'r+') as output:
with open(args.file, 'r+') as book:
for line in book:
output.write(pattern.sub(translate, line))
Error:
return dicts.translation5[match.group(0)]
KeyError: '<p>Text 1:1-1</p>'
I believe you are trying to match a read line and see what regexp it matches so that you can apply appropriate change to it (also in regexp form). This approach might work but using a dictionary is redundant in this case.
The broad approach is
You match the line to compiled pattern to find a match.
Then you compare each pattern in list1 to the matched string to see if it
matches.
If it does you convert the matched string to the form in list2
Something like
list1 = ['<p>Text ([0-9]):(.*)</p>' ,'<p>Text2 ([0-9]):(.*)</p>','<p>Text3 ([0-9]):(.*)</p>']
list2 = ["<p class='text'>Text \1:\2</p>" ,"<p class='text'>Text \1:\2</p>","<p class='text'>TEXT ([0-9]):(.*)</p>"]
translation = dict(zip(list1, list2))
pattern = re.compile('(%s)' % '|'.join(dicts.list1))
def translate(m):
for x,v in translation.items():
if re.search(x,m.group()):
return re.sub(x,v,m.group())
for line in book:
m = pattern.findall(line)
ret = translate(m)
if ret is not None:
output.write(ret)
else:
#No match. Echo back original line
output.write(line)
Input
<p>Text 1:1-1</p>
Output
<p class='text'>Text 1:1-1</p>
There are probably other better ways to do it
The issue is that the text '<p>Text 1:1-1</p>' is not a key in your dict. As dicts is a free variable in your code, there is nothing more we can tell you.
Try match.group(1) instead. In regex results, group(0) is the entire matched string and groups 1 and following are the groups in the regex itself. In your case group(0) == "<p>Text 1:1-1\</p\>" and group(1) == "1".

Categories