Switch statement using numeric patterns - python

I'm attempting to create a switch statement for a given value. The value is a 16-bit unsigned number, and I want to jump to the appropriate pattern. Each pattern is a hexadecimal string, but an underscore denotes a wildcard. For example, (0x1234 matches '1234' and '12_4' but not '56_8'). While I'm only posting a subset of these patterns, assume they cover the entire range of 0x0000-0xFFFF.
patterns = {
'15__': foo,
'2__0': bar,
'8__0': baz,
...
}
...
def run(self, x: int) -> None:
# x to string (0x567f -> "567F")
x_str = str(hex(opcode))[2:].zfill(4).upper()
# Search for the matching pattern and execute the associated method
for pattern, instruction in patterns.items():
if all([x_str[i] == pattern[i] for i in range(len(pattern)) if pattern[i] != '_'):
instruction(x)
break
Now, this works. However, it is incredibly slow, and defeats the purpose of using a dictionary since it just iterates through it. Also, since it has to convert x to a string (with formatting) then check that string against the pattern string, the whole thing is a giant bottleneck. I'm looking for a way to, preferably, get it closer to an actual lookup table, bonus points if we don't need to convert x to a string.

A switch statement of this type is an iterative process, not a jump table. In the general case you present, the way to avoid iteration is to generate the graph of partial indexing decisions, based on the specific arrangement of common digits (hexits) and wild cards in your table.
Instead, try simply speeding up your matching. I suggest that you take a mask-and-match approach to your table keys. Code the "don't-care" (wild-card) positions separately, and keep the key as a tuple of match and mask values. Your examples would be
patterns = {
(0x1500, 0xFF00): foo,
(0x2000, 0xF00F): bar,
(0x8000, 0xF00F): baz,
...
}
To check a particular key against your candidate cand, you look for bit equality, but mask off any mismatches in the wild-card positions:
cand ^ match # bit inequality; mismatch is 1
result & mask # force don't-care bits to 0
So that you can check
if (cand ^ match) & mask:
continue # Something doesn't match
else:
return value from dict
Your dict format is
(match, mask): value
Can you handle the logic for iteration and return value?

Related

Trying to print keys based on their value type (int or str) from a dictionary of lists

I'm learning to access dictionary keys-values and work with list comprehensions. My assignment asks me to:
"Use a while loop that prints only variant names located in chromosomes that do not have numbers (e.g., X)."
And I'm working with this dictionary of lists, where the keys are variant names, and the zeroth elements in the list values (the character sets on the left of the colon([0])) are chromosome names, while the characters to the right of the colon ([1])are their chromosome location, and the [2] values are gene names.
cancer_variations={"rs13283416": ["9:116539328-116539328+","ASTN2"],\
"rs17610181":["17:61590592-61590592+","NACA2"],\
"rs1569113445":["X:12906527-12906527+","TLR8TLR8-AS1"],\
"rs143083812":["7:129203569-129203569+","SMO"],\
"rs5009270":["7:112519123-112519123+","IFRD1"],\
"rs12901372":["15:67078168-67078168+","SMAD3"],\
"rs4765540":["12:124315096-124315096+","FAM101A"],\
"rs3815148":["CHR_HG2266_PATCH:107297975-107297975+","COG5"],\
"rs12982744":["19:2177194-2177194+","DOT1L"],\
"rs11842874":["13:113040195-113040195+","MCF2L"]}
I have found how to print the variant names based on the length of the zeroth element in the lists (the chromosome names):
for rs, info in cancer_variations.items():
tmp_info=info[0].split(":")
if (len(tmp_info[0])>3):
print(rs)
But I'm having trouble printing the key values, the variant names, based on the TYPE of the chromosome name, the zeroth element in the list values. To that end, I've devised this code, but I'm not sure how to phrase the Boolean values to print only if the chromosome name is one particular type, (Str) or (int).
for rs, info in cancer_variations.items():
tmp_info=info[0].split(":")
if tmp_info[0] = type.str
print(rs)
I am not sure exactly what I'm not seeing here with my syntax.
Any help will be greatly appreciated.
If I understand you right, you want to check if the first part before : contains a number or not.
You can iterate the string character-by-character and use str.isnumeric() to check if the character is number or not. If any character is a number, continue to next item:
cancer_variations = {
"rs13283416": ["9:116539328-116539328+", "ASTN2"],
"rs17610181": ["17:61590592-61590592+", "NACA2"],
"rs1569113445": ["X:12906527-12906527+", "TLR8TLR8-AS1"],
"rs143083812": ["7:129203569-129203569+", "SMO"],
"rs5009270": ["7:112519123-112519123+", "IFRD1"],
"rs12901372": ["15:67078168-67078168+", "SMAD3"],
"rs4765540": ["12:124315096-124315096+", "FAM101A"],
"rs3815148": ["CHR_HG2266_PATCH:107297975-107297975+", "COG5"],
"rs12982744": ["19:2177194-2177194+", "DOT1L"],
"rs11842874": ["13:113040195-113040195+", "MCF2L"],
}
for k, (v, *_) in cancer_variations.items():
if not any(ch.isnumeric() for ch in v.split(":")[0]):
print(k)
Prints:
rs1569113445
You need to look up how to determine your desired classification of the data. In this case, all you need is to differentiate alphabetic data from numeric:
if tmp_info[0].isalpha():
print(rs)
Should get you on your way.
First you need to make sure what you want to do.
If what you want is to distinguish a numeric string from a normal string, then you may want to know that a numeric string is strictly formed of numbers; if you add any other character, it's not considered numeric by python. You can prove this making this experiment:
print('23123'.isnumeric())
print('2312ds3'.isnumeric())
Results in:
True
False
Numeric strings is what you are looking to exclude, and any other, in this case, that stays as str, will fit, if i'm understanding.
So, in that manner, we are going to iterate over the dict, using the loop you've made:
for rs, info in cancer_variations.items():
tmp_info=info[0].split(":")
if not tmp_info[0].isnumeric():
print(rs)
Which results in:
rs1569113445
rs3815148

python basic regex function

I am trying to write a function that implements a simple regex matching algorithm. The special characters "*" and "?" should stand for 1 and n>=0 degrees of freedom respectively. For example the strings
y="abc" and x="a*c",
y="abc" and x="a?c",
y="abddddzfjc" and x="a?" or x="a?c"
should return True, whereas the strings
y="abcd" and x="a*d",
y="abcdef" and x="a?d*"
should return False.
My method is to run in a loop and shorten the strings as each subsequent match is identified, which works fine for identical matches or single * with alphabet character matches, but I am a stumped on about how to do it for edge cases like the last example. To handle the case where "?" has n degrees of freedom, I loop forward in the right string to find the next alphabet character, then try to find that character in the left string, looking from right to left. I am sure there is a more elegant way (maybe with a generator?!).
def match_func(x,y):
x, y = list(x), list(y)
if len(x)==len(y)==1:
if x[0] == y[0] or bool((set(x)|set(y)) & {"?","*"})
return True
elif len(x)>0 and len(y)==0:
return False
else:
for ix, char in enumerate(x):
if char==y[ix] or char=="*":
return match_func(x[ix+1:],y[ix+1:])
else:
if char=="?"
if ix==len(x)=1: return True
##check if the next letter in x has an eventual match in y
peek = ix+1
next_char = x[peek]
while peek<len(x)-1:
next_char = x[peek]
if next_char.isalpha():
break
else: peek+=1
if peek == len(x)-1:
return True
ys = ''.join(y)
next_char_ix = ys[ix].rfind(next_char)
##search y for next possible match?
if next_char_ix!=-1:
return match_func(x[peek:], y[next_char_ix:])
else:
return False
else:
return False
return True
First decide whether to make your match algorithm a minimal or maximal search. Meaning, if your pattern is a, and your subject string is aa, does the match occur at the first or second position? As you state the problem, either choice seems to be acceptable.
Having made that choice, it will become clear how you should traverse the string - either as far to the right as possible and then working backward until you either match or fail; or starting at the left and backtracking after each attempt.
I recommend a recursive implementation either way. At each position, evaluate whether you have a possible match. If so, make your recursive call advancing the appropriate amount down both the pattern and subject string. If not, give up. If there is no match for the first character of the pattern, advance only the subject string (according to your minimal/maximal choice) and try again.
The tricky part is, you have to consider variable-length tokens in your pattern as possible matches even if the same character also matches a literal character following that wildcard. That puts you in the realm of depth-first search. Evaluating patterns like a?a?a?a on subject strings like aaaabaaaa will be lots of fun, and if you push it too far, may take years to complete.
Your professor chose well the regex operators to give you to make the assignment of meaningful depth, without the tedium of writing a full-on parser and lexer just to make the thing work.
Good luck!

In Python, does a set count as a buffer?

I am working through Cracking the Coding Interview (4th ed), and one of the questions is as follows:
Design an algorithm and write code to remove the duplicate characters in a string
without using any additional buffer. NOTE: One or two additional variables are fine.
An extra copy of the array is not.
I have written the following solution, which satisfies all of the test cases specified by the author:
def remove_duplicate(s):
return ''.join(sorted(set(s)))
print(remove_duplicate("abcd")) // output "abcd"
print(remove_duplicate("aaaa")) // output "a"
print(remove_duplicate("")) // output ""
print(remove_duplicate("aabb")) // output "ab"
Does my use of a set in my solution count as the use of an additional buffer, or is my solution adequate? If my solution is not adequate, what would be a better way to go about this?
Thank you very much!
Only the person administering the question or evaluating the answer could say for sure, but I would say that a set does count as a buffer.
If there are no repeated characters in the string, the length of the set would equal that of the string. In fact, since a set has significant overhead, since it works on a hash list, the set would probably take more memory than the string. If the string holds Unicode, the number of unique characters could be very large.
If you do not know how many unique characters are in the string, you will not be able to predict the length of the set. The possible-long and probably-unpredictable length of the set makes it count as a buffer--or worse, given the possible longer length than the string.
To follow up on v.coder's comment, I rewrote the code he (or she) was referring to in Python, and added some comments to try to explain what is going on.
def removeduplicates(s):
"""Original java implementation by
Druv Gairola (http://stackoverflow.com/users/495545/dhruv-gairola)
in his/her answer
http://stackoverflow.com/questions/2598129/function-to-remove-duplicate-characters-in-a-string/10473835#10473835
"""
# python strings are immutable, so first converting the string to a list of integers,
# each integer representing the ascii value of the letter
# (hint: look up "ascii table" on the web)
L = [ord(char) for char in s]
# easiest solution is to use a set, but to use Druv Gairola's method...
# (hint, look up "bitmaps" on the web to learn more!)
bitmap = 0
#seen = set()
for index, char in enumerate(L):
# first check for duplicates:
# number of bits to shift left (the space is the "lowest"
# character on the ascii table, and 'char' here is the position
# of the current character in the ascii table. so if 'char' is
# a space, the shift length will be 0, if 'char' is '!', shift
# length will be 1, and so on. This naturally requires the
# integer to actually have as many "bit positions" as there are
# characters in the ascii table from the space to the ~,
# but python uses "very big integers" (BigNums? I am not really
# sure here..) - so that's probably going to be fine..
shift_length = char - ord(' ')
# make a new integer where only one bit is set;
# the bit position the character corresponds to
bit_position = 1 << shift_length
# if the same bit is already set [to 1] in the bitmap,
# the result of AND'ing the two integers together
# will be an integer where that only that exact bit is
# set - but that still means that the integer will be greater
# than zero. (assuming that the so-called "sign bit" of the
# integer doesn't get set. Again, I am not entirely sure about
# how python handles integers this big internally.. but it
# seems to work fine...)
bit_position_already_occupied = bitmap & bit_position > 0
if bit_position_already_occupied:
#if char in seen:
L[index] = 0
else:
# update the bitmap to indicate that this character
# is now seen.
# so, same procedure as above. first find the bit position
# this character represents...
bit_position = char - ord(' ')
# make an integer that has a single bit set:
# the bit that corresponds to the position of the character
integer = 1 << bit_position
# "add" the bit to the bitmap. The way we do this is that
# we OR the current bitmap with the integer that has the
# required bit set to 1. The result of OR'ing two integers
# is that all bits that are set to 1 in *either* of the two
# will be set to 1 in the result.
bitmap = bitmap | integer
#seen.add(char)
# finally, turn the list back to a string to be able to return it
# (again, just kind of a way to "get around" immutable python strings)
return ''.join(chr(i) for i in L if i != 0)
if __name__ == "__main__":
print(removeduplicates('aaaa'))
print(removeduplicates('aabcdee'))
print(removeduplicates('aabbccddeeefffff'))
print(removeduplicates('&%!%)(FNAFNZEFafaei515151iaaogh6161626)([][][ ao8faeo~~~````%!)"%fakfzzqqfaklnz'))

The number of differences between characters in a string in Python 3

Given a string, lets say "TATA__", I need to find the total number of differences between adjacent characters in that string. i.e. there is a difference between T and A, but not a difference between A and A, or _ and _.
My code more or less tells me this. But when a string such as "TTAA__" is given, it doesn't work as planned.
I need to take a character in that string, and check if the character next to it is not equal to the first character. If it is indeed not equal, I need to add 1 to a running count. If it is equal, nothing is added to the count.
This what I have so far:
def num_diffs(state):
count = 0
for char in state:
if char != state[char2]:
count += 1
char2 += 1
return count
When I run it using num_diffs("TATA__") I get 4 as the response. When I run it with num_diffs("TTAA__") I also get 4. Whereas the answer should be 2.
If any of that makes sense at all, could anyone help in fixing it/pointing out where my error lies? I have a feeling is has to do with state[char2]. Sorry if this seems like a trivial problem, it's just that I'm totally new to the Python language.
import operator
def num_diffs(state):
return sum(map(operator.ne, state, state[1:]))
To open this up a bit, it maps !=, operator.ne, over state and state beginning at the 2nd character. The map function accepts multible iterables as arguments and passes elements from those one by one as positional arguments to given function, until one of the iterables is exhausted (state[1:] in this case will stop first).
The map results in an iterable of boolean values, but since bool in python inherits from int you can treat it as such in some contexts. Here we are interested in the True values, because they represent the points where the adjacent characters differed. Calling sum over that mapping is an obvious next step.
Apart from the string slicing the whole thing runs using iterators in python3. It is possible to use iterators over the string state too, if one wants to avoid slicing huge strings:
import operator
from itertools import islice
def num_diffs(state):
return sum(map(operator.ne,
state,
islice(state, 1, len(state))))
There are a couple of ways you might do this.
First, you could iterate through the string using an index, and compare each character with the character at the previous index.
Second, you could keep track of the previous character in a separate variable. The second seems closer to your attempt.
def num_diffs(s):
count = 0
prev = None
for ch in s:
if prev is not None and prev!=ch:
count += 1
prev = ch
return count
prev is the character from the previous loop iteration. You assign it to ch (the current character) at the end of each iteration so it will be available in the next.
You might want to investigate Python's groupby function which helps with this kind of analysis.
from itertools import groupby
def num_diffs(seq):
return len(list(groupby(seq))) - 1
for test in ["TATA__", "TTAA__"]:
print(test, num_diffs(test))
This would display:
TATA__ 4
TTAA__ 2
The groupby() function works by grouping identical entries together. It returns a key and a group, the key being the matching single entry, and the group being a list of the matching entries. So each time it returns, it is telling you there is a difference.
Trying to make as little modifications to your original code as possible:
def num_diffs(state):
count = 0
for char2 in range(1, len(state)):
if state[char2 - 1] != state[char2]:
count += 1
return count
One of the problems with your original code was that the char2 variable was not initialized within the body of the function, so it was impossible to predict the function's behaviour.
However, working with indices is not the most Pythonic way and it is error prone (see comments for a mistake that I made). You may want rewrite the function in such a way that it does one loop over a pair of strings, a pair of characters at a time:
def num_diffs(state):
count = 0
for char1, char2 in zip(state[:-1], state[1:]):
if char1 != char2:
count += 1
return count
Finally, that very logic can be written much more succinctly — see #Ilja's answer.

Python memory error when searching substring

I am trying to find substring of very large string and getting memory error:
The code:
def substr(string):
le = []
st = list(string)
for s in xrange(len(string)+1):
for s1 in xrange(len(string)+1):
le.append(''.join(st[s:s1]))
cou = Counter(le)
cou_val = cou.keys()
cou_val.remove('')
return le, cou_val
I am getting error as ile "solution.py", line 31, in substr
le.append(''.join(st[s:s1]))
MemoryError
How to tackle this problem?
Answer
I noticed that your code prints all the possible substrings of string in a certain order. I suggest that instead of storing all of them in an array, you use code to return just the substring that you want. I tested the subroutine below with 'a very long string' and it always returns the same value as if you were to get an indexed value from an array.
string = 'a very long string'
def substr2(string,i):
return string[i//(len(string)+1):i%(len(string)+1)]
print(substr2(string,10))
Solution
The way you order the arguments for your for loops (s,s1) work similarly to a number system. s1 increments by 1 until it gets to a given value, then it resets to 0 and s increments by 1, repeating the cycle. This is seen in a decimal system (e.g. 01,02,03,04,05,06,07,08,09,10,11,12,13,14,15,16 etc.)
The i//n div operator returns the integer value of i/n. (e.g. 14//10=1).
The i%n mod operator returns the remainder value of i/n. (e.g. 14%10 is 4).
So if we were to, for example, increment i by 1 and define (s,s1) as [i//10,i%10], we would get:
[0,0],[0,1],[0,2],[0,3],[0,4],[0,5],[0,6],[0,7],[0,8],[0,9],[1,0],[1,1],[1,2] etc.
We can utilize these to produce the same answer as in your array.
PS. My first answer. Hope this helps!
It seems that you are running out of memory. When the string is too large the code you posted seems to be copying it over and over again into the le list. As #Rikka's link suggests, buffer/memoryview may be of use for you but I have never used it.
As a workaround to your solution/code I would suggest that instead of storing each substring in le, store the indexes as a tuple. Additionally, I don't think that st list is required (not sure tho if your way speeds it up) so the result would be (code not tested):
def substr(string):
le = []
for s in xrange(len(string)+1):
for s1 in xrange(len(string)+1):
# Skip empty strings
if s!=s1:
le.append((s, s1))
cou = Counter(le)
cou_val = cou.keys()
cou_val.remove('')
return le, cou_val
Now, an example of how you can use the substr is (code not tested):
myString = "very long string here"
matchString = "here"
matchPos = False
indexes, count = substr(myString)
# Get all the substrings without storing them simultaneously in memory
for i in indexes:
# construct substring and compare
if myString[i[0],i[1]]==matchString:
matchPos = i
break
After the above you have start and end positions of the 1st occurrence of "here" into your large string. I am not sure what you try to achieve but this can easily be modified to find all occurrences, count matches, etc - I just post it as example. I am also not sure why the Counter is there...
This approach should not present the memory error, however, it is a trade-off between memory and CPU and I expect it to be bit slower on runtime since every time you use indexes you have to re-construct every substring.
Hope it helps
The solution:
The error in memory is always caused by out of range.And the slice technique also has some rules.
When the step is positive, just like 1, the first index must be greater than the second.And on the contrary, when negative, such as -1, the number of the index is shorter than the second, but it is actually the greater one.(-1 > -2)
So in your program, the index s is greater than s1 when step is one, so you access a place you have not applied for it.And you know, that is MemoryError!!!

Categories