Parse nested expression to retrieve each inner functions - python

Suppose I have an expression as shown below:
expression = "LEN(Replace(Lower(UPPER([ProductName]+[ProductName])), 'chaichai', 'chai'))"
Required output:
['UPPER([ProductName]+[ProductName])','Lower(UPPER([ProductName]+[ProductName]))','Replace(Lower(UPPER([ProductName]+[ProductName])),'chaichai','chai')','LEN(Replace(Lower(UPPER([ProductName]+[ProductName])),'chaichai','chai'))']
I have tried the below code but not getting required result:
exp_splits = expression.strip(')').split('(')
for i_enm, i in enumerate(range(len(exp_splits)-2, -1, -1), start=1):
result.append(f"{'('.join(exp_splits[i:])}{')'*i_enm}")
print(result)
my code's output:
["UPPER([ProductName]+[ProductName])),'chaichai','chai')", "Lower(UPPER([ProductName]+[ProductName])),'chaichai','chai'))", "Replace(Lower(UPPER([ProductName]+[ProductName])),'chaichai','chai')))", "LEN(Replace(Lower(UPPER([ProductName]+[ProductName])),'chaichai','chai'))))"]

import re
e = "LEN(Replace(Lower(UPPER([ProductName]+[ProductName])),'chaichai','chai'))"
print ([e[i:j+1] for i in range(len(e)) for j in range(len(e)) if e[i:j+1].count('(') == e[i:j+1].count(')') != 0 and (e[i-1] == '(' or i == 0) and e[j] == ')'])
Output:
["LEN(Replace(Lower(UPPER([ProductName]+[ProductName])),'chaichai','chai'))", "Replace(Lower(UPPER([ProductName]+[ProductName])),'chaichai','chai')", 'Lower(UPPER([ProductName]+[ProductName]))', 'UPPER([ProductName]+[ProductName])']
Unfolded version:
for i in range(len(e)):
for j in range(len(e)):
#Check for same number of opening/closing parenthesis
if e[i:j+1].count('(') == e[i:j+1].count(')') != 0:
#Check that (first char is preceded by an opening parenthesis OR that first char is the beginning of e) AND last char is a parenthesis
if (e[i-1] == '(' or i == 0) and e[j] == ')':
print (e[i:j+1])
Output:
LEN(Replace(Lower(UPPER([ProductName]+[ProductName])),'chaichai','chai'))
Replace(Lower(UPPER([ProductName]+[ProductName])),'chaichai','chai')
Lower(UPPER([ProductName]+[ProductName]))
UPPER([ProductName]+[ProductName])

Approach using split and for loop
Here's another approach to get this done. In this approach, I am splitting the string into parts.
Splitting them by left parenthesis and right parenthesis.
Then concatenating them each time to create the expression
Assumption: The expression has equal number of left and right parenthesis
Step 1: Count the number of left parenthesis in the string
Step 2: Split the expression by left parenthesis
Step 3: pop the last expression from the list of left parenthesis and
store into right expression. This contains right parenthesis
Step 4: Split the expression by right parenthesis
Step 5: Now that you have both the sides, stitch them together
Note: While concatenating the expression, left side goes from right
to left (index -1 thru 0) and right side goes from left to right
(index 0 to -1)
Note: For each iteration, you need to concatenate the previous answer
with left and right
Code is as shown below:
expression = "LEN(Replace(Lower(UPPER([ProductName]+[ProductName])),'chaichai','chai'))"
n = expression.count('(')
exp_left = expression.split('(')
exp_right = exp_left.pop().split(')')
exp_list = []
exp_string = ''
for i in range(n):
exp_string = exp_left[-i-1] + '(' + exp_string + exp_right[i] + ')'
exp_list.append(exp_string)
for exp in exp_list: print (exp)
The output of this will be:
UPPER([ProductName]+[ProductName])
Lower(UPPER([ProductName]+[ProductName]))
Replace(Lower(UPPER([ProductName]+[ProductName])),'chaichai','chai')
LEN(Replace(Lower(UPPER([ProductName]+[ProductName])),'chaichai','chai'))
Below code is the same as above. I have added comments to each line for you to understand what's being done.
expression = "LEN(Replace(Lower(UPPER([ProductName]+[ProductName])),'chaichai','chai'))"
#find the number of equations in the string. Count of '(' will give you the number
n = expression.count('(')
#split the string by left parenthesis. You get all the functions + middle part + right hand side
exp_left = expression.split('(')
#middle + right hand part is at position index -1. Use pop to remove the last value
#Use the popped string to split by right parenthesis
#result will be middle part + all right hand side parts.
#empty string if there was no text between two right parenthesis
exp_right = exp_left.pop().split(')')
#define a list to store all the expressions
exp_list = []
#Now put it all together looping thru n times
#store the expression in a string so you can concat left and right to it each time
exp_string = ''
for i in range(n):
#for each iteration, concat left side + ( + middle string + right side + )
#left hand side: concat from right to left (-1 to 0)
#right hand side: concat from left to right (0 to n-1)
exp_string = exp_left[-i-1] + '(' + exp_string + exp_right[i] + ')'
#append the expression to the expression list
exp_list.append(exp_string)
#print each string separately
for exp in exp_list: print (exp)
Approach using While Statement
Here's how to do the search and extract version.
e = "LEN(Replace(Lower(UPPER([ProductName]+[ProductName])),'chaichai','chai'))"
x = e.count('(')
for i in range(x-1): e = e[e.find('(')+1:]
expression = e[:e.find(')')+1]
print (expression)
The result of this will be:
UPPER([ProductName]+[ProductName])
If you want all of them, then you can do this until you reach the innermost brackets.
e = "LEN(Replace(Lower(UPPER([ProductName]+[ProductName])),'chaichai','chai'))"
#x = e.count('(')
#for i in range(x-1): e = e[e.find('(')+1:]
#expression = e[:e.find(')')+1]
exp_list = [e]
while e.count('(') > 1:
e = e[e.find('(')+1:e.rfind(')')]
while e[-1] != ')': e = e[:e.rfind(')')+1]
exp_list.append(e)
for exp in exp_list:
print (exp)
The output of this will be:
LEN(Replace(Lower(UPPER([ProductName]+[ProductName])),'chaichai','chai'))
Replace(Lower(UPPER([ProductName]+[ProductName])),'chaichai','chai')
Lower(UPPER([ProductName]+[ProductName]))
UPPER([ProductName]+[ProductName])

Related

Ignoring Changed Index Check (Python)

I have made a script:
our_word = "Success"
def duplicate_encode(word):
char_list = []
final_str = ""
changed_index = []
base_wrd = word.lower()
for k in base_wrd:
char_list.append(k)
for i in range(0, len(char_list)):
count = 0
for j in range(i + 1, len(char_list)):
if j not in changed_index:
if char_list[j] == char_list[i]:
char_list[j] = ")"
changed_index.append(j)
count += 1
else:
continue
if count > 0:
char_list[i] = ")"
else:
char_list[i] = "("
print(changed_index)
print(char_list)
final_str = "".join(char_list)
return final_str
print(duplicate_encode(our_word))
essentialy the purpose of this script is to convert a string to a new string where each character in the new string is "(", if that character appears only once in the original string, or ")", if that character appears more than once in the original string. I have made a rather layered up script (I am relatively new to the python language so didn't want to use any helpful in-built functions) that attempts to do this. My issue is that where I check if the current index has been previously edited (in order to prevent it from changing), it seems to ignore it. So instead of the intended )())()) I get )()((((. I'd really appreciate an insightful answer to why I am getting this issue and ways to work around this, since I'm trying to gather an intuitive knowledge surrounding python. Thanks!
word = "Success"
print(''.join([')' if word.lower().count(c) > 1 else '(' for c in word.lower()]))
The issue here has nothing to do with your understanding of Python. It's purely algorithmic. If you retain this 'layered' algorithm, it is essential that you add one more check in the "i" loop.
our_word = "Success"
def duplicate_encode(word):
char_list = list(word.lower())
changed_index = []
for i in range(len(word)):
count = 0
for j in range(i + 1, len(word)):
if j not in changed_index:
if char_list[j] == char_list[i]:
char_list[j] = ")"
changed_index.append(j)
count += 1
if i not in changed_index: # the new inportant check to avoid reversal of already assigned ')' to '('
char_list[i] = ")" if count > 0 else "("
return "".join(char_list)
print(duplicate_encode(our_word))
Your algorithm can be greatly simplified if you avoid using char_list as both the input and output. Instead, you can create an output list of the same length filled with ( by default, and then only change an element when a duplicate is found. The loops will simply walk along the entire input list once for each character looking for any matches (other than self-matches). If one is found, the output list can be updated and the inner loop will break and move on to the next character.
The final code should look like this:
def duplicate_encode(word):
char_list = list(word.lower())
output = list('(' * len(word))
for i in range(len(char_list)):
for j in range(len(char_list)):
if i != j and char_list[i] == char_list[j]:
output[i] = ')'
break
return ''.join(output)
for our_word in (
'Success',
'ChJsTk(u cIUzI htBp#qX)OTIHpVtHHhQ',
):
result = duplicate_encode(our_word)
print(our_word)
print(result)
Output:
Success
)())())
ChJsTk(u cIUzI htBp#qX)OTIHpVtHHhQ
))(()(()))))())))()()((())))()))))

Recursive Decompression of Strings

I'm trying to decompress strings using recursion. For example, the input:
3[b3[a]]
should output:
baaabaaabaaa
but I get:
baaaabaaaabaaaabbaaaabaaaabaaaaa
I have the following code but it is clearly off. The first find_end function works as intended. I am absolutely new to using recursion and any help understanding / tracking where the extra letters come from or any general tips to help me understand this really cool methodology would be greatly appreciated.
def find_end(original, start, level):
if original[start] != "[":
message = "ERROR in find_error, must start with [:", original[start:]
raise ValueError(message)
indent = level * " "
index = start + 1
count = 1
while count != 0 and index < len(original):
if original[index] == "[":
count += 1
elif original[index] == "]":
count -= 1
index += 1
if count != 0:
message = "ERROR in find_error, mismatched brackets:", original[start:]
raise ValueError(message)
return index - 1
def decompress(original, level):
# set the result to an empty string
result = ""
# for any character in the string we have not looked at yet
for i in range(len(original)):
# if the character at the current index is a digit
if original[i].isnumeric():
# the character of the current index is the number of repetitions needed
repititions = int(original[i])
# start = the next index containing the '[' character
x = 0
while x < (len(original)):
if original[x].isnumeric():
start = x + 1
x = len(original)
else:
x += 1
# last = the index of the matching ']'
last = find_end(original, start, level)
# calculate a substring using `original[start + 1:last]
sub_original = original[start + 1 : last]
# RECURSIVELY call decompress with the substring
# sub = decompress(original, level + 1)
# concatenate the result of the recursive call times the number of repetitions needed to the result
result += decompress(sub_original, level + 1) * repititions
# set the current index to the index of the matching ']'
i = last
# else
else:
# concatenate the letter at the current index to the result
if original[i] != "[" and original[i] != "]":
result += original[i]
# return the result
return result
def main():
passed = True
ORIGINAL = 0
EXPECTED = 1
# The test cases
provided = [
("3[b]", "bbb"),
("3[b3[a]]", "baaabaaabaaa"),
("3[b2[ca]]", "bcacabcacabcaca"),
("5[a3[b]1[ab]]", "abbbababbbababbbababbbababbbab"),
]
# Run the provided tests cases
for t in provided:
actual = decompress(t[ORIGINAL], 0)
if actual != t[EXPECTED]:
print("Error decompressing:", t[ORIGINAL])
print(" Expected:", t[EXPECTED])
print(" Actual: ", actual)
print()
passed = False
# print that all the tests passed
if passed:
print("All tests passed")
if __name__ == '__main__':
main()
From what I gathered from your code, it probably gives the wrong result because of the approach you've taken to find the last matching closing brace at a given level (I'm not 100% sure, the code was a lot). However, I can suggest a cleaner approach using stacks (almost similar to DFS, without the complications):
def decomp(s):
stack = []
for i in s:
if i.isalnum():
stack.append(i)
elif i == "]":
temp = stack.pop()
count = stack.pop()
if count.isnumeric():
stack.append(int(count)*temp)
else:
stack.append(count+temp)
for i in range(len(stack)-2, -1, -1):
if stack[i].isnumeric():
stack[i] = int(stack[i])*stack[i+1]
else:
stack[i] += stack[i+1]
return stack[0]
print(decomp("3[b]")) # bbb
print(decomp("3[b3[a]]")) # baaabaaabaaa
print(decomp("3[b2[ca]]")) # bcacabcacabcaca
print(decomp("5[a3[b]1[ab]]")) # abbbababbbababbbababbbababbbab
This works on a simple observation: rather tha evaluating a substring after on reading a [, evaluate the substring after encountering a ]. That would allow you to build the result AFTER the pieces have been evaluated individually as well. (This is similar to the prefix/postfix evaluation using programming).
(You can add error checking to this as well, if you wish. It would be easier to check if the string is semantically correct in one pass and evaluate it in another pass, rather than doing both in one go)
Here is the solution with the similar idea from above:
we go through string putting everything on stack until we find ']', then we go back until '[' taking everything off, find the number, multiply and put it back on stack
It's much less consuming as we don't add strings, but work with lists
Note: multiply number can't be more than 9 as we parse it as one element string
def decompress(string):
stack = []
letters = []
for i in string:
if i != ']':
stack.append(i)
elif i == ']':
letter = stack.pop()
while letter != '[':
letters.append(letter)
letter = stack.pop()
word = ''.join(letters[::-1])
letters = []
stack.append(''.join([word for j in range(int(stack.pop()))]))
return ''.join(stack)

list index not updating in for loop (python)

In the definition separate I am trying to get the index of a ')' and then loop in reverse until I get '('. The reversed statement is working and the statement continues to stay on the first index of ')'. What is the reason behind the index not being able to update?
class elements:
periodic_table = ['']
def __init__(self, equation):
self.equation = equation
def poly(self):
polyatomic = 'C2H3O2', 'HCO3', 'HSO4', 'ClO', 'ClO3', 'ClO2', 'OCN', 'CN', 'H2PO4', 'OH', 'NO3', 'NO2', 'ClO4', 'MnO4', 'SCN',
return polyatomic
def separate(self):
element = elements.equation
list1 = []
for first, second in zip(element, element[1:]):
if first == ')' and second.isdigit():
multiply = int(second)
print(first, second)
print(element.index(first))
for multiplcation in element[element.index(first)::-1]:
if multiplcation == '(':
break
elif multiplcation != ')':
final = multiplcation * multiply
print(final)
if first == '=':
list1.append(first)
elif first.isupper() and second.islower():
list1.append(first + second)
elif first.isupper() and second.isdigit():
amount = first * int(second)
list1.append(amount)
elif first.isupper():
list1.append(first)
elements = elements(
'K4Fe(CN)6 + KMnO4 + H2SO4 = KHSO4 + Fe2(SO4)3 + MnSO4 + HNO3 + CO2 + H2O')
print(elements.separate())
The crux of the problem is here:
for multiplcation in element[element.index(first)::-1]:
You're backing up from the first occurrence of RPAREN in your string, not from the one you just found. In the given example, you will always return to walk through "CN" for any later parentheses.
I recommend that you redesign your code: split the equation into molecules; write a function to return the expansion of each individual molecule. join those together if you need the entire equation reconstituted.
That should get you past the current problem.

Python: Removing certain characters from a word using while loop

Given that certain characters are 'abcdef' = char
I would like to 1) remove 3rd char of the 3 chars in a row in a word, and 2) from changed word, remove 2nd char of the 2 chars in a row in the word.
E.g. If there is a word 'bacifeaghab' it should
1) first remove 'c' and 'a', which is ba(c)ife(a)hab and change word into 'baifehab'
2) remove 'a','e', and 'b', which is b(a)if(e)ha(b) and change word into 'bifha'
This is what I have done so far, but when I run this and put word in it, it doesn't pint anything. Not even error or blank(' '), it just goes to next line without '>>>'.
def removal(w):
x,y = 0,0
while y < len(w)-2:
if (w[y] and w[y+1] and w[y+2]) in 'abcdef':
w = w[:y+2] + w[y+3:]
while x < len(w):
if w[x] in 'abcedf':
w = w[:x+1] + w[x+2:]
x = x+1
else :
x = x+1
return(w)
Could anyone find out what's wrong??
Since it was first time for me to use while loop, I thought that using double while loop can be a problem, so also tried,
def removal(w):
x,y,z = 0,0,0
while y < len(w)-2:
if (w[y] and w[y+1] and w[y+2]) in 'abcdef':
w = w[:y+2] + w[y+3:]
return(w)
But same result. I also tried print(w) at the end of function. same result.
You had a couple of mistakes:
There were 2 errors that I have corrected in your code. The first being that you weren't incrementing x properly (as you did not need the elif) and you had forgotten completely to increment y!
I corrected these and then also, in your if conditions, your syntax was incorrect. The part in the brackets evaluated to the last element and then just this was checked to see if it was in the string 'abcdef'. I have corrected that now to check each individual element in turn.
So now the function is:
def removal(w):
chars = 'abcdef'
x,y = 0,0
while y < len(w)-2:
if w[y] in chars and w[y+1] in chars and w[y+2] in chars:
w = w[:y+2] + w[y+3:]
y += 3
while x < len(w):
if w[x] in 'abcedf' and w[x+1] in 'abcdef':
w = w[:x+1] + w[x+2:]
x += 2
return w
and calling it with 'bacifeahab' (without the 'g' which I think was a typo in your e.g.):
removal("bacifeahab")
returns what you wanted:
'bifha'
Hope this helps!

extract substring pattern

I have long file like 1200 sequences
>3fm8|A|A0JLQ2
CFLVNLNADPALNELLVYYLKEHTLIGSANSQDIQLCGMGILPEHCIIDITSEGQVMLTP
QKNTRTFVNGSSVSSPIQLHHGDRILWGNNHFFRLNLP
>2ht9|A|A0JLT0
LATAPVNQIQETISDNCVVIFSKTSCSYCTMAKKLFHDMNVNYKVVELDLLEYGNQFQDA
LYKMTGERTVPRIFVNGTFIGGATDTHRLHKEGKLLPLVHQCYL
I want to read each possible pattern has cysteine in middle and has in the beginning five string and follow by other five string such as xxxxxCxxxxx
the output should be like this:
QDIQLCGMGIL
ILPEHCIIDIT
TISDNCVVIFS
FSKTSCSYCTM
this is the pogram only give position of C . it is not work like what I want
pos=[]
def find(ch,string1):
for i in range(len(string1)):
if ch == string1[i]:
pos.append(i)
return pos
z=find('C','AWERQRTCWERTYCTAAAACTTCTTT')
print z
You need to return outside the loop, you are returning on the first match so you only ever get a single character in your list:
def find(ch,string1):
pos = []
for i in range(len(string1)):
if ch == string1[i]:
pos.append(i)
return pos # outside
You can also use enumerate with a list comp in place of your range logic:
def indexes(ch, s1):
return [index for index, char in enumerate(s1)if char == ch and 5 >= index <= len(s1) - 6]
Each index in the list comp is the character index and each char is the actual character so we keep each index where char is equal to ch.
If you want the five chars that are both sides:
In [24]: s="CFLVNLNADPALNELLVYYLKEHTLIGSANSQDIQLCGMGILPEHCIIDITSEGQVMLTP QKNTRTFVNGSSVSSPIQLHHGDRILWGNNHFFRLNLP"
In [25]: inds = indexes("C",s)
In [26]: [s[i-5:i+6] for i in inds]
Out[26]: ['QDIQLCGMGIL', 'ILPEHCIIDIT']
I added checking the index as we obviously cannot get five chars before C if the index is < 5 and the same from the end.
You can do it all in a single function, yielding a slice when you find a match:
def find(ch, s):
ln = len(s)
for i, char in enumerate(s):
if ch == char and 5 <= i <= ln - 6:
yield s[i- 5:i + 6]
Where presuming the data in your question is actually two lines from yoru file like:
s="""">3fm8|A|A0JLQ2CFLVNLNADPALNELLVYYLKEHTLIGSANSQDIQLCGMGILPEHCIIDITSEGQVMLTPQKNTRTFVNGSSVSSPIQLHHGDRILWGNNHFFRLNLP
>2ht9|A|A0JLT0LATAPVNQIQETISDNCVVIFSKTSCSYCTMAKKLFHDMNVNYKVVELDLLEYGNQFQDALYKMTGERTVPRIFVNGTFIGGATDTHRLHKEGKLLPLVHQCY"""
Running:
for line in s.splitlines():
print(list(find("C" ,line)))
would output:
['0JLQ2CFLVNL', 'QDIQLCGMGIL', 'ILPEHCIIDIT']
['TISDNCVVIFS', 'FSKTSCSYCTM', 'TSCSYCTMAKK']
Which gives six matches not four as your expected output suggest so I presume you did not include all possible matches.
You can also speed up the code using str.find, starting at the last match index + 1 for each subsequent match
def find(ch, s):
ln, i = len(s) - 6, s.find(ch)
while 5 <= i <= ln:
yield s[i - 5:i + 6]
i = s.find(ch, i + 1)
Which will give the same output. Of course if the strings cannot overlap you can start looking for the next match much further in the string each time.
My solution is based on regex, and shows all possible solutions using regex and while loop. Thanks to #Smac89 for improving it by transforming it into a generator:
import re
string = """CFLVNLNADPALNELLVYYLKEHTLIGSANSQDIQLCGMGILPEHCIIDITSEGQVMLTPQKNTRTFVNGSSVSSPIQLHHGDRILWGNNHFFRLNLP
LATAPVNQIQETISDNCVVIFSKTSCSYCTMAKKLFHDMNVNYKVVELDLLEYGNQFQDA LYKMTGERTVPRIFVNGTFIGGATDTHRLHKEGKLLPLVHQCYL"""
# Generator
def find_cysteine2(string):
# Create a loop that will utilize regex multiple times
# in order to capture matches within groups
while True:
# Find a match
data = re.search(r'(\w{5}C\w{5})',string)
# If match exists, let's collect the data
if data:
# Collect the string
yield data.group(1)
# Shrink the string to not include
# the previous result
location = data.start() + 1
string = string[location:]
# If there are no matches, stop the loop
else:
break
print [x for x in find_cysteine2(string)]
# ['QDIQLCGMGIL', 'ILPEHCIIDIT', 'TISDNCVVIFS', 'FSKTSCSYCTM', 'TSCSYCTMAKK']

Categories