How to use a regex to search for contiguous incrementing sequences

How to use a regex to search for contiguous incrementing sequences - python

I would like to use regex to increase the speed of my searches for specific records within a large binary image. It seems like regex searches always outperform my own search methods, so that's why I'm looking into this. I have already implemented the following, which works, but is not very fast.
My binary image is loaded into a Numpy memmap as words.
I_FILE = np.memmap(opts.image_file, dtype='uint32', mode='r')
And here is start of my search loop currently (which works):
for i in range(0, FILESIZE - 19):
if (((I_FILE[i] + 1 == I_FILE[i + 19]) or (I_FILE[i - 19] + 1 == I_FILE[i])) and I_FILE[i] < 60):
...do stuff...
This is seeking out records that are 19 bytes long that start with a decimal sequence number between 0 and 59. It looks for an incrementing sequence on either a record before or after the current search location to validate the record.
I've seen a few examples where folks have crafted variables into string using re.escape (like this: How to use a variable inside a regular expression?) But I can't seem to figure out how to search for a changing value sequence.

I managed to make it work with regex, but it was a bit more complicated than I expected. The regex expressions look for two values between 0 and 59 that are separated by 72 bytes (18 words). I used two regex searches to ensure that I wouldn't miss records at the end of a sequence:
# First search uses the lookahead assertion to not consume large amounts of data.
SearchPattern1 = re.compile(b'[\0-\x3B]\0\0\0(?=.{72}[\1-\x3B]\0\0\0)', re.DOTALL)
# Again using the positive lookbehind assertion (?<= ... ) to grab the ending entries.
SearchPattern2 = re.compile(b'(?<=[\0-\x3B]\0\0\0.{72})[\1-\x3B]\0\0\0', re.DOTALL)
Next, perform both searches and combine the results.
HitList1 = [m.start(0) for m in SearchPattern1.finditer(I_FILE)]
HitList2 = [m.start(0) for m in SearchPattern2.finditer(I_FILE)]
AllHitList = list(set(HitList1 + HitList2))
SortedHitList = sorted(AllHitList)
Now I run a search that has the same conditions as my original solution, but it runs on a much smaller set of data!
for i in range(0, len(SortedHitList)):
TestLoc = SortedHitList[i]
if (I_FILE[TestLoc] + 1 == I_FILE[TestLoc + 19]) or (I_FILE[TestLoc - 19] + 1 == I_FILE[TestLoc]):
... do stuff ...
The result was very successful! The original solution took 58 seconds to run on a 300 MB binary file, while the new regex solution took only 2 seconds!!

Related

How to count cells containing numbers in specific range with cells that contain both text and numbers

I thought I could easily sort this issue out but it took me ages to solve just half of it.
I have a table that contains 100 data cells in a row. Data in each cell are either text-only or text and numbers (see layout at bottom).
I need a function that COUNTs how many cells are present in the table that report the value of N2 OR E to be >=37.
Negative
Positive (N2: 23, E: 23)
Negative Function answer: 2
Positive (N2: 37, E: 26)
Positive (N2: 31, E: 38)
So far I could only extract each N2 number with a function [=MID(A2,15,FIND(",",A2)-15)] that considers the 15th character, then a second function counts how many extracted numbers (they have been extracted in B row) are >=37, [=COUNTIF(B2:B100, ">=37")] but have not a clue on how to take the E value into account.
In addition, I would like the function to consider cells containing the N2 value OR the E value >=37.
Is there the chance to have one big function that does that? Is there the chance not to rely on KUTools for Excel?

If you have the newest version of Excel, you can use FILTERXML after making some minor changes. First concatenate the whole range with CONCAT, then eliminate all ","s and replace ")"s with spaces in the concatenated string.
For example, the below gets you all the instances over 36 (if you only want the number of times, wrap it in a COUNT):
=FILTERXML("<t><s>"&SUBSTITUTE(
SUBSTITUTE(SUBSTITUTE(CONCAT($F$2:$F$7), ")", " "), ",", ""), " ",
"</s><s>")&"</s></t>", "//s[number()>=37]")
For more info on dealing with strings, see here.
EDIT: Thanks #MarkBalhoff for catching a missing space in the formula and
#JvdV for giving another way with =IFERROR(COUNT(FILTERXML("<t><s>"&SUBSTITUTE(TEXTJOIN(" ",,F2:F6)," ","</s><s>")&"</s></t>","//s[translate(.,',','')*1>=37 or translate(following::*[2],')','')*1>=37]")),0)

Since you include the python tag and also reference KU-Tools, I assume you have some familiarity with VBA.
You could easily, and flexibly, implement the logic in Excel VBA using regular expressions.
For this function, I allowed three arguments:
The range to search
The threshold for the values
A list of values to look for
In the regex, the pattern looks for the digits that follow either of the strings in "searchFor". Note that, as written, you need to include the colons in the searchFor strings, and that that the strings are case-sensitive. (easily changed)
Option Explicit
Function CountVals(r As Range, Threshold As Long, ParamArray searchFor() As Variant) As Long
Dim RE As Object, MC As Object, M As Object
Dim counter As Long
Dim vSrc As Variant, v As Variant
Dim sPat As String
'read range into variant array for fastest processing
vSrc = r
'create Pattern
sPat = "(?:" & Join(searchFor, "|") & ")\s*(\d+)"
'initialize regex
Set RE = CreateObject("vbscript.regexp")
With RE
.Global = True
.ignorecase = False 'or change to true if capitalization not important
.Pattern = sPat
counter = 0
'check each string for the values
For Each v In vSrc
Set MC = .Execute(v)
For Each M In MC
If CLng(M.submatches(0)) >= Threshold Then counter = counter + 1
Next M
Next v
CountVals = counter
End With
End Function

Why my code consumes too much memory even after clearing list?

So i'm trying to solve this problem and the question goes like this
Probably, You all Know About The Famous Japanese Cartoon Character Nobita and Shizuka. Nobita Shizuka are very Good friend. However , Shizuka Love a special kind of string Called Tokushuna.
A string T is called Tokushuna if
The length of the string is greater or equal then 3 (|T| ≥ 3 )
It start and end with a charecter ‘1’ (one)
It contain (|T|-2) number of ‘0’ (zero)
here |T| = length of string T . Example , 10001 ,101,10001 is Tokushuna string But 1100 ,1111, 0000 is not.
One Day Shizuka Give a problem to nobita and promise to go date with him if he is able to solve this problem. Shizuka give A string S and told to Count number of Tokushuna string can be found from all possible the substring of string S . Nobita wants to go to date with Shizuka But You Know , he is very weak in Math and counting and always get lowest marks in Math . And In this Time Doraemon is not present to help him .So he need your help to solve the problem .
Input
First line of the input there is an integer T, the number of test cases. In each test case, you are given a binary string S consisting only 0 and 1.
Subtasks
Subtask #1 (50 points)
1 ≤ T ≤ 100
1 ≤ |S| ≤ 100
Subtask #2 (50 points)
1 ≤ T ≤ 100
1 ≤ |S| ≤ 105
Output
For each test case output a line Case X: Y where X is the case number and Y is the number of Tokushuna string can be found from all possible the substring of string S
Sample
Input
3
10001
10101
1001001001
Output
Case 1: 1
Case 2: 2
Case 3: 3
Look, in first case 10001 is itself is Tokushuna string.
In second Case 2 Substring S[1-3] 101 and S[3-6] 101 Can be found which is Tokushuna string.
What I've done so far
I've already solved the problem but the problem is it shows my code exceeds memory limit (512mb). I'm guessing it is because of the large input size. To solve that I've tried to clear the list which holds all the substring of one string after completing each operation. But this isn't helping.
My code
num = int(input())
num_list = []
for i in range(num):
i = input()
num_list.append(i)
def condition(a_list):
case = 0
case_no = 1
sub = []
for st in a_list:
sub.append([st[i:j] for i in range(len(st)) for j in range(i + 1, len(st) + 1)])
for i in sub:
for item in i:
if len(item) >= 3 and (item[0] == '1' and item[-1] == '1') and (len(item) - 2 == item.count('0')):
case += 1
print("Case {}: {}".format(case_no, case))
case = 0
case_no += 1
sub.clear()
condition(num_list)
Is there any better approach to solve the memory consumption problem?

Have you tried taking java heap dump and java thread dump? These will tell the memory leak and also the thread that is consuming memory.

Your method of creating all possible substrings won't scale very well to larger problems. If the input string is length N, the number of substrings is N * (N + 1) / 2 -- in other words, the memory needed will grow roughly like N ** 2. That said, it is a bit puzzling to me why your code would exceed 512MB if the length of the input string is always less than 105.
In any case, there is no need to store all of those substrings in memory, because a Tokushuna string cannot contain other Tokushuna strings nested within
it:
1 # Leading one.
0... # Some zeros. Cannot be or contain a Tokushuna.
1 # Trailing one. Could also be the start of the next Tokushuna.
That means a single scan over the string should be sufficient to find them all.
You could write your own algorithmic code to scan the characters and keep track
of whether it finds a Tokushuna string. But that requires some tedious
bookkeeping.
A better option is regex, which is very good at character-by-character analysis:
import sys
import re
# Usage: python foo.py 3 10001 10101 1001001001
cases = sys.argv[2:]
# Match a Tokushuna string without consuming the last '1', using a lookahead.
rgx = re.compile(r'10+(?=1)')
# Check the cases.
for i, c in enumerate(cases):
matches = list(rgx.finditer(c))
msg = 'Case {}: {}'.format(i + 1, len(matches))
print(msg)
If you do not want to use regex, my first instinct would be to start the algorithm by finding the indexes of all of the ones: indexes = [j for j, c in enumerate(case) if c == '1']. Then pair those indexes up: zip(indexes, indexes[1:]). Then iterate over the pairs, checking whether the part in the middle is all zeros.
A small note regarding your current code:
# Rather than this,
sub = []
for st in a_list:
sub.append([...]) # Incurs memory cost of the temporary list
# and a need to drill down to the inner list.
...
sub.clear() # Also requires a step that's easy to forget.
# just do this.
for st in a_list:
sub = [...]
...

Using interpolation search to find beginning of list in large text file - Python

I need to find the last timestamp in a very large log file with an unknown number of lines before I reach a line with a timestamp. I read the file backwards one line at a time, which is usually very quick except for one case. Sometimes, I will run into a very large block (thousands of lines) with a known repeating pattern (one entry shown below) and no timestamps:
goal_tolerance[0]:
name: joint_b
position: -1
velocity: -1
acceleration: -1
Since this is the only case where I have this kind of problem, I can just throw a piece of code into the loop that checks for it before searching the log line by line.
The number after goal_tolerance is a counter, going up 1 each time the pattern repeats, so what I would like to do is use that number to calculate the beginning of the pattern. What I have now looks something like this:
if ' goal_tolerance' in line:
gtolnum = line[17:-3]
print gtolnum
startFrom = currentPosition - ((long(gtolnum) + 1) * 95)
break
However, this does not take into account the number of characters in the counter, so I end up running through the search loop several more times than necessary. Is there a fast way to include those characters in the calculation?
EDIT: I do not read the entire file to get to that point, since it is large and I have several hundred timestamps to search for in several hundred log files. My search function seeks to a position in the text file, then finds the beginning of a line near that point and reads it. The calculation is determining a file position I can use with .seek() based on the number of bytes or characters in the pattern.

I did some maths in the meantime and came up with a mathematical solution:
...
n = long(gtolnum)
q = len(gtolnum) # I'll refer to this as the number's "level"
x = n + 1 - 10**(q - 1) # Number of entries in the current level
c = x * (q - 1) # Additional digits in the current level
i = 2
p = 0
while i < q:
p += 9 * (q - i) * (10**(q - i)) # Additional digits in i levels previous
i += 1
startFrom = currentPosition - ((n + 1) * 95 + p + c)
...
Seems like there should be a much simpler solution, but I'm not seeing it. Perhaps a log function could help?

parsing scientific publication page ranges in python

I need to parse a set of strings that contain page ranges as they appear in metadata of scientific and other publications. I don't have a complete spec of the pagination format, and I am not even sure if one exists, but examples of strings I need to process are:
6-10, 19-22
xlvii-xlviii
111S-2S
326
A078-132
XC-CIII
Ideally, I'd like to return the number of pages for each string, e.g. 9 for 6-10, 19-22. If that's too hard, at least whether it's a single page or more. The latter is pretty easy actually since commas and dashes seem to be the only delimiters in the examples I've seen so far. But I do very much prefer to get the right count.
I can write my own parser but I am curious whether there are any existing packages that can already do this out of the box or with minimal mods.

Here's a solution that supports parsing "normal" numbers as well as roman numerals. For parsing roman numerals, install the roman package (easy_install roman). You can enhance the parse_num function to support additional formats.
import roman
def parse_num(p):
p = p.strip()
try:
return roman.fromRoman(p.upper())
except:
return int(p)
def parse_pages(s):
count = 0
for part in s.split(','):
rng = part.split('-', 1)
a, b = parse_num(rng[0]), parse_num(rng[-1])
count += b - a + 1
return count
>>> parse_pages('17')
1
>>> parse_pages('6-10, 19-22')
9
>>> parse_pages('xlvii-xlviii')
2
>>> parse_pages('XC-CIII')
14

How to refine a python script for a bioinformatics query

I am quite new to python and I would be grateful for some assistance if possible. I am comparing the genomes of two closely related organisms [E_C & E_F] and trying to identify some basic insertions and deletions. I have run a FASTA pairwise alignment (glsearch36) using sequences from both organisms.
The below is a section of my python script where I have been able to identify a 7 nucleotide (heptamer) in one sequence (database) that corresponds to a gap in the other sequence (query). This is an example of what I have:
ATGCACAA-ACCTGTATG # query
ATGCAGAGGAAGAGCAAG # database
9
GAGGAAG
Assume the gap is at position 9. I am trying to refine the script to select gaps that are 20 nucleotides or more apart on both sequences and only if the surrounding nucleotides also match
ATGCACAAGTAAGGTTACCG-ACCTGTATGTGAACTCAACA
||| |||
GTGCTCGGGTCACCTTACCGGACCGCCCAGGGCGGCCCAAG
21
CCGGACC
This is the section of my script, the top half deals with opening different files. it also prints a dictionary with the count of each sequence at the end.
list_of_positions = []
for match in re.finditer(r'(?=(%s))' % re.escape("-"), dict_seqs[E_C]):
list_of_positions.append(match.start())
set_of_positions = set(list_of_positions)
for position in list_of_positions:
list_no_indels = []
for number in range(position-20, position) :
list_no_indels.append(number)
for number in range(position+1, position+21) :
list_no_indels.append(number)
set_no_indels = set(list_no_indels)
if len(set_no_indels.intersection(set_of_positions))> 0 : continue
if len(set_no_indels.intersection(set_of_positions_EF))> 0 : continue
print position
#print match.start()
print dict_seqs[E_F][position -3:position +3]
key = dict_seqs[E_F][position -3: position +3]
if nt_dict.has_key(key):
nt_dict[key] += 1
else:
nt_dict[key] = 1
print nt_dict
Essentially, I was trying to edit the results of pairwise alignments to try and identify the nucleotides opposite the gaps in both the query and database sequences in order to conduct some basic Insertion/Deletion analysis.
I was able to solve one of my earlier issues by increasing the distance between gaps "-" to 20 nt's in an attempt to reduce noise, this has improved my results. Script edited above.
This is an example of my results and at the end I have a dictionary which counts the occurences of each sequence.
ATGCACAA-ACCTGTATG # query
ATGCAGAGGAAGAGCAAG # database
9 (position on the sequence)
GAGGAA (hexamer)
ATGCACAAGACCTGTATG # query
ATGCAGAG-AAGAGCAAG # database
9 (position)
CAAGAC (hexamer)
However, I am still trying to fix the script where I get the nucleotides around the gap to match exactly such as this, where the | is just to show matching nt's on each sequence:
GGTTACCG-ACCTGTATGTGAACTCAACA # query
||| ||
CCTTACCGGACCGCCCAGGGCGGCCCAAG # database
9
ACCGAC
Any help with this would be gratefully appreciated!

I think I understand what you are trying to do but as #alko has said - comments in your code will definitely help a lot.
As to finding an exact match around the gap you could run a string comparison:
Something along the lines of:
if query[position -3: position] == database[position -3: position] and query[position +1: position +3] == database[position +1: position +3]:
# Do something
You will need to replace "query" and "database" with what you have called your strings that you want to compare.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to use a regex to search for contiguous incrementing sequences - python

Related

How to count cells containing numbers in specific range with cells that contain both text and numbers

Why my code consumes too much memory even after clearing list?

Using interpolation search to find beginning of list in large text file - Python

parsing scientific publication page ranges in python

How to refine a python script for a bioinformatics query

Categories

Resources