Finding patterns in HEX data with regex but getting duplicates - python

I have a regex python script to go over Hex data and find patterns which looks like this
r"(.{6,}?)\1{2,}"
all it does is look for at least 6 character long hex strings that repeat and at least have two instances of it repeating. My issue is it is also finding substrings inside larger strings it has already found for example:
if it was "a00b00a00b00a00b00a00b00a00b00a00b00" it would find 2 instances of "a00b00a00b00a00b00" and 6 instances of "a00b00" How could I go about keeping only the longest patterns found and ignoring even looking for shorter patterns without more hardcoded parameters?

#!/usr/bin/python
import fnmatch
pattern_string = "abcdefabcdef"
def print_pattern(pattern, num):
n = num
# takes n and splits it by that value in this case 6
new_pat = [pattern[i:i+n] for i in range(0, len(pattern), n)]
# this is the hit counter for matches
match = 0
# stores the new value of the match
new_match = ""
#loops through the list to see if it matches more than once
for new in new_pat:
new_match = new
print new
#if matches previous keep adding to match
if fnmatch.fnmatch(new, new_pat[0]):
match += 1
if match:
print "Count: %d\nPattern:%s" %(match, new_match)
#returns the match
return new_match
print_pattern(pattern_string, 6)
regex is better but this was funner to write

Related

Given some string and index, find longest repeated string

I apologize if this question has been answered elsewhere on this site, but I have searched for a while and have not found a similar question. For some slight context, I am working with RNA sequences.
Without diving into the Bio aspect, my question boils down to this:
Given a string and an index/position, I want to find the largest matching substring based on that position.
For example:
Input
string = "fsalstackoverflowwqiovmnrflofmnastackovsnv"
position = 13 # the f in the substring 'stackoverflow'
Desired Output
rflo
So basically, despite 'stackov' being the longest repeated substring within the string, I only want the largest repeated substring based on the index given.
Any help is appreciated. Thanks!
Edit
I appreciate the answers provided thus far. However, I intentionally made position equal to 13 in order to show that I want to search and expand on either side of the starting position, not just to the right.
We iteratively check longer and longer substrings starting at position position simply checking if they occur in the remaining string using the in keyword. j is the length of the substring that we currently test, which is string[index:index+j] and longest keeps track of the longest substring seen so far. We can break as soon as the sequence starting at position does not occur anymore with the current length j
string = "fsalstackoverflowwqiovmnrflofmnastackovsnv"
position = 13
index=position-1
longest=0
for j in range(1, (len(string)-index)//2):
if string[index:index+j] in string[index+j:]:
longest=j
else:
break
print(longest)
print(string[index:index+longest])
Output:
4
rflo
Use the in keyword to check for presence in the remainder of the string, like this:
string = "fsalstackoverflowwqiovmnrflofmnastackovsnv"
# Python string indices start at 0
position = 12
for sub_len in range(1, len(string) - position):
# Simply check if the string is present in the remainder of the string
sub_string = string[position:position + sub_len]
if sub_string in string[position + sub_len:] or sub_string in string[0:position]:
continue
break
# The last iteration of the loop did not return any occurrences, print the longest match
print(string[position:position + sub_len - 1])
Output:
rflo
If you set position = 32, this returns stackov, showing how it searches from the beginning as well.

Why does the order of expressions matter in re.match?

I'm making a function that will take a string like "three()" or something like "{1 + 2}" and put them into a list of token (EX: "three()" = ["three", "(", ")"] I using the re.match to help separate the string.
def lex(s):
# scan input string and return a list of its tokens
seq = []
patterns = (r"^(\t|\n|\r| )*(([a-z])*|[0-9]|\(|\)|\*|\/|)(\t|\n|\r| )*")
m = re.match(patterns,s)
while m != None:
if s == '':
break
seq.append(m.group(2))
s = s[len(m.group(0)):]
m = re.match(patterns,s)
return seq
This one works if the string is just "three". But if the string contains "()" or any symbol it stays in the while loop.
But a funny thing happens when move ([a-z])* in the pattern string it works. Why is that happening?
works: patterns = (r"^(\t|\n|\r| )*([0-9]|\(|\)|\*|\/|([a-z])*)(\t|\n|\r| )*")
Does not work: patterns = (r"^(\t|\n|\r| )*(([a-z])*|[0-9]|\(|\)|\*|\/)(\t|\n|\r| )*")
This one is a bit tricky, but the problem is with this part ([a-z])*. This matches any string of lowercase letters size 0 (zero) or more.
If you put this sequence at the end, like here:
patterns = (r"^(\t|\n|\r| )*([0-9]|\(|\)|\*|\/|([a-z])*)(\t|\n|\r| )*")
The regex engine will try the other matches first, and if it finds a match, stop there. Only if none of the others match, does it try ([a-z])* and since * is 'greedy', it will match all of three, then proceed to match ( and finally ).
Read an explanation of how the full expression is tested in the documentation (thanks to #kaya3).
However, if you put that sequence a the start, like here:
patterns = (r"^(\t|\n|\r| )*(([a-z])*|[0-9]|\(|\)|\*|\/)(\t|\n|\r| )*")
It will now try to match it first. It's still greedy, so three still gets matched. But then on the next try, it will try to match ([a-z])* to the remaining '()' - and it matches, since that string starts with zero letters.
It keeps matching it like that, and gets stuck in the loop. You can fix it by changing the * for a + which will only match if there is 1 or more matches:
patterns = (r"^(\t|\n|\r| )*(([a-z])+|[0-9]|\(|\)|\*|\/)(\t|\n|\r| )*")

Printing a String in Reverse After Extracting [duplicate]

This question already has answers here:
How do I reverse a string in Python?
(19 answers)
Closed 2 years ago.
I am trying to create a program in which the user inputs a statement containing two '!' surrounding a string. (example: hello all! this is a test! bye.) I am to grab the string within the two exclamation points, and print it in reverse letter by letter. I have been able to find the start and endpoints that contain the statement, however I am having difficulties creating an index that would cycle through my variable userstring in reverse and print.
test = input('Enter a string with two "!" surrounding portion of the string:')
expoint = test.find('!')
#print (expoint)
twoexpoint = test.find('!', expoint+1)
#print (twoexpoint)
userstring = test[expoint+1 : twoexpoint]
#print(userstring)
number = 0
while number < len(userstring) :
letter = [twoexpoint - 1]
print (letter)
number += 1
twoexpoint - 1 is the last index of the string you need relative to the input string. So what you need is to start from that index and reduce. In your while loop:
letter = test[twoexpoint- number - 1]
Each iteration you increase number which will reduce the index and reverse the string.
But this way you don't actually use the userstring you already found (except for the length...). Instead of caring for indexes, just reverse the userstring:
for letter in userstring[::-1]:
print(letter)
Explanation we use regex to find the pattern
then we loop for every occurance and we replace the occurance with the reversed string. We can reverse string in python with mystring[::-1] (works for lists too)
Python re documentation Very usefull and you will need it all the time down the coder road :). happy coding!
Very usefull article Check it out!
import re # I recommend using regex
def reverse_string(a):
matches = re.findall(r'\!(.*?)\!', a)
for match in matches:
print("Match found", match)
print("Match reversed", match[::-1])
for i in match[::-1]:
print(i)
In [3]: reverse_string('test test !test! !123asd!')
Match found test
Match reversed tset
t
s
e
t
Match found 123asd
Match reversed dsa321
d
s
a
3
2
1
You're overcomplicating it. Don't bother with an index, simply use reversed() on userstring to cycle through the characters themselves:
userstring = test[expoint+1:twoexpoint]
for letter in reversed(userstring):
print(letter)
Or use a reversed slice:
userstring = test[twoexpoint-1:expoint:-1]
for letter in userstring:
print(letter)

Python: How to move the position of an output variable using the split() method

This is my first SO post, so go easy! I have a script that counts how many matches occur in a string named postIdent for the substring ff. Based on this it then iterates over postIdent and extracts all of the data following it, like so:
substring = 'ff'
global occurences
occurences = postIdent.count(substring)
x = 0
while x <= occurences:
for i in postIdent.split("ff"):
rawData = i
required_Id = rawData[-8:]
x += 1
To explain further, if we take the string "090fd0909a9090ff90493090434390ff90904210412419ghfsdfs9000ff", it is clear there are 3 instances of ff. I need to get the 8 preceding characters at every instance of the substring ff, so for the first instance this would be 909a9090.
With the rawData, I essentially need to offset the variable required_Id by -1 when I get the data out of the split() method, as I am currently getting the last 8 characters of the current string, not the string I have just split. Another way of doing it could be to pass the current required_Id to the next iteration, but I've not been able to do this.
The split method gets everything after the matching string ff.
Using the partition method can get me the data I need, but does not allow me to iterate over the string in the same way.
Get the last 8 digits of each split using a slice operation in a list-comprehension:
s = "090fd0909a9090ff90493090434390ff90904210412419ghfsdfs9000ff"
print([x[-8:] for x in s.split('ff') if x])
# ['909a9090', '90434390', 'sdfs9000']
Not a difficult problem, but tricky for a beginner.
If you split the string on 'ff' then you appear to want the eight characters at the end of every substring but the last. The last eight characters of string s can be obtained using s[-8:]. All but the last element of a sequence x can similarly be obtained with the expression x[:-1].
Putting both those together, we get
subject = '090fd0909a9090ff90493090434390ff90904210412419ghfsdfs9000ff'
for x in subject.split('ff')[:-1]:
print(x[-8:])
This should print
909a9090
90434390
sdfs9000
I wouldn't do this with split myself, I'd use str.find. This code isn't fancy but it's pretty easy to understand:
fullstr = "090fd0909a9090ff90493090434390ff90904210412419ghfsdfs9000ff"
search = "ff"
found = None # our next offset of
last = 0
l = 8
print(fullstr)
while True:
found = fullstr.find(search, last)
if found == -1:
break
preceeding = fullstr[found-l:found]
print("At position {} found preceeding characters '{}' ".format(found,preceeding))
last = found + len(search)
Overall I like Austin's answer more; it's a lot more elegant.

Use Python to extract Branch Lengths from Newick Format

I have a list in python consisting of one item which is a tree written in Newick Format, as below:
['(BMNH833953:0.16529463651919140688,(((BMNH833883:0.22945757727367316336,(BMNH724182a:0.18028180766761139897,(BMNH724182b:0.21469677818346077913,BMNH724082:0.54350916483644962085):0.00654573856803835914):0.04530853441176059537):0.02416511342888815264,(((BMNH794142:0.21236619242575086042,(BMNH743008:0.13421900772403019819,BMNH724591:0.14957653992840658219):0.02592135486124686958):0.02477670174791116522,BMNH703458a:0.22983459269245612444):0.00000328449424529074,BMNH703458b:0.29776257618061197086):0.09881729077887969892):0.02257522897558370684,BMNH833928:0.21599133163597591945):0.02365043128986757739,BMNH724053:0.16069861523756587274):0.0;']
In tree format this appears as below:
I am trying to write some code that will look through the list item and return the IDs (BMNHxxxxxx) which are joined by branch length of 0 (or <0.001 for example) (highlighted in red). I thought about using regex such as:
JustTree = []
with JustTree as f:
for match in re.finditer(r"(?<=Item\sA)(?:(?!Item\sB).){50,}", subject, re.I):
f.extend(match.group()+"\n")
As taken from another StackOverflow answer where item A would be a ':' as the branch lengths always appear after a : and item B would be either a ',' or ')'or a ';' as these a there three characters that delimit it, but Im not experienced enough in regex to do this.
By using a branch length of 0 in this case I want the code to output ['BMNH703458a', 'BMNH703458b']. If I could alter this to also include ID's joined by a branch length of user defined value of say 0.01 this would be highly useful.
If anyone has any input, or can point me to a useful answer I would highly appreciate it.
Okay, here's a regex to extract only numbers (with potential decimals):
\b[0-9]+(?:\.[0-9]+)?\b
The \bs make sure that there is no other number, letter or underscore around the number right next to it. It's called a word boundary.
[0-9]+ matches multiple digits.
(?:\.[0-9]+)? is an optional group, meaning that it may or may not match. If there is a dot and digits after the first [0-9]+, then it will match those. Otherwise, it won't. The group itself matches a dot, and at least 1 digit.
You can use it with re.findall to put all the matches in a list:
import re
NewickTree = ['(BMNH833953:0.16529463651919140688,(((BMNH833883:0.22945757727367316336,(BMNH724182a:0.18028180766761139897,(BMNH724182b:0.21469677818346077913,BMNH724082:0.54350916483644962085):0.00654573856803835914):0.04530853441176059537):0.02416511342888815264,(((BMNH794142:0.21236619242575086042,(BMNH743008:0.13421900772403019819,BMNH724591:0.14957653992840658219):0.02592135486124686958):0.02477670174791116522,BMNH703458a:0.22983459269245612444):0.00000328449424529074,BMNH703458b:0.29776257618061197086):0.09881729077887969892):0.02257522897558370684,BMNH833928:0.21599133163597591945):0.02365043128986757739,BMNH724053:0.16069861523756587274):0.0;']
pattern = re.compile(r"\b[0-9]+(?:\.[0-9]+)?\b")
for tree in NewickTree:
branch_lengths = pattern.findall(tree)
# Do stuff to the list branch_lengths
print(branch_lengths)
For this list, you get this printed:
['0.16529463651919140688', '0.22945757727367316336', '0.18028180766761139897',
'0.21469677818346077913', '0.54350916483644962085', '0.00654573856803835914',
'0.04530853441176059537', '0.02416511342888815264', '0.21236619242575086042',
'0.13421900772403019819', '0.14957653992840658219', '0.02592135486124686958',
'0.02477670174791116522', '0.22983459269245612444', '0.00000328449424529074',
'0.29776257618061197086', '0.09881729077887969892', '0.02257522897558370684',
'0.21599133163597591945', '0.02365043128986757739', '0.16069861523756587274',
'0.0']
I know your question has been answered, but if you ever want your data as a nested list instead of a flat string:
import re
import pprint
a="(BMNH833953:0.16529463651919140688,(((BMNH833883:0.22945757727367316336,(BMNH724182a:0.18028180766761139897,(BMNH724182b:0.21469677818346077913,BMNH724082:0.54350916483644962085):0.00654573856803835914):0.04530853441176059537):0.02416511342888815264,(((BMNH794142:0.21236619242575086042,(BMNH743008:0.13421900772403019819,BMNH724591:0.14957653992840658219):0.02592135486124686958):0.02477670174791116522,BMNH703458a:0.22983459269245612444):0.00000328449424529074,BMNH703458b:0.29776257618061197086):0.09881729077887969892):0.02257522897558370684,BMNH833928:0.21599133163597591945):0.02365043128986757739,BMNH724053:0.16069861523756587274):0.0;"
def tokenize(str):
for m in re.finditer(r"\(|\)|[\w.:]+", str):
yield m.group()
def make_nested_list(tok, L=None):
if L is None: L = []
while True:
try: t = tok.next()
except StopIteration: break
if t == "(": L.append(make_nested_list(tok))
elif t == ")": break
else:
i = t.find(":"); assert i != -1
if i == 0: L.append(float(t[1:]))
else: L.append([t[:i], float(t[i+1:])])
return L
L = make_nested_list(tokenize(a))
pprint.pprint(L)
There are several Python libraries that support the newick format. The ETE toolkit allows to read newick strings and operate with trees as Python objects:
from ete2 import Tree
tree = Tree(newickFile)
print tree
Several newick subformats can be choosen and branch distances are parsed even if they are expressed in scientific notation.
from ete2 import Tree
tree = Tree("(A:3.4, (B:0.15E-10,C:0.0001):1.5E-234);")

Categories