Analyze a chemical equation, multiply sub-indexes outside parenthesis in Python - python

So I'm kind of new to Python. Right now I'm making a chemical equation balancer and I've got stuck because what I want to do right now is that if you receive a compound in parenthesis, with a subindex outside (like this: (NaCl)2), I want to expand it to this form: Na2Cl2 (and also get rid of the parenthesis). Right now I've managed just to get rid of the parenthesis with this code:
import string
import re
linealEquation = ""
def linealEq(equation):
#missing code
allow = string.letters + string.digits + '+' + '-' + '>'
linealEquation = re.sub('[^%s]' % allow, '', equation)
print linealEquation
linealEq("(CrNa)2 -> Cr+Na")
But how can I trace the string and multiply the indexes out of the parenthesis?
I know how to iterate over a string, but I cannot think of how to specifically multiply the sub index.
Thanks for the help.

Probably not the shortest solution and won't work in all cases, but works for your example:
left, right = equation.split('->')
exp = left.strip()[-1]
inside = left[1:-3]
c2 = re.findall('[A-Z][^A-Z]*', inside)
l = [s + exp for s in c2]
res =''.join(l)
N.B. you can add print statements to better understand each step...

Related

How to extract a substring from a string in Python 3

I am trying to pull a substring out of a function result, but I'm having trouble figuring out the best way to strip the necessary string out using Python.
Output Example:
[<THIS STRING-STRING-STRING THAT THESE THOSE>]
In this example, I would like to grab "STRING-STRING-STRING" and throw away all the rest of the output. In this example, "[<THIS " &" THAT THESE THOSE>]" are static.
Many many ways to solve this. Here are two examples:
First one is a simple replacement of your unwanted characters.
targetstring = '[<THIS STRING-STRING-STRING THAT THESE THOSE>]'
#ALTERNATIVE 1
newstring = targetstring.replace(r" THAT THESE THOSE>]", '').replace(r"[<THIS ", '')
print(newstring)
and this drops everything except your target pattern:
#ALTERNATIVE 2
match = "STRING-STRING-STRING"
start = targetstring.find(match)
stop = len(match)
targetstring[start:start+stop]
These can be shortened but thought it might be useful for OP to have them written out.
I found this extremely useful, might be of help to you as well: https://www.computerhope.com/issues/ch001721.htm
If by '"[<THIS " &" THAT THESE THOSE>]" are static' you mean that they are always the exact same string, then:
s = "[<THIS STRING-STRING-STRING THAT THESE THOSE>]"
before = len("[<THIS ")
after = len(" THAT THESE THOSE>]")
s[before:-after]
# 'STRING-STRING-STRING'
Like so (as long as the postition of the characters in the string doesn't change):
myString = "[<THIS STRING-STRING-STRING THAT THESE THOSE>]"
myString = myString[7:27]
Another alternative method;
import re
my_str = "[<THIS STRING-STRING-STRING THAT THESE THOSE>]"
string_pos = [(s.start(), s.end()) for s in list(re.finditer('STRING-STRING-STRING', my_str))]
start, end = string_pos[0]
print(my_str[start: end + 1])
STRING-STRING-STRING
If the STRING-STRING-STRING occurs multiple times in the string, start and end indexes of the each occurrences will be given as tuples in string_pos.

How to find a String that contains Specific other String?

I have a problem with my homework and it's just confusing for me, this is the problem:
So input is a string that is a linear Equation Like " A + B = C ".
but for some reason one of A, B or C is not clear to us and we can't see it right.
for example:
"1# + 24 = 34" or "5131 + #251 = 76382"
Note that: It can happen to One part of Equation; A, B or C! and '#' can be more than one Digit!
(((( if input is = "10# + 50 = 10052" , output shoul be "10002 + 50 = 10052"))))
So here is a Question! How can I Highlight or Select part of this String that contains '#'?
I tried to search in RegExr and I can't find a pattern that matches my situation!
This retrieves the part of string that contains #:
import re
textExample = "5131 + #251 = 76382"
re.findall(r'[^ ]*#[^ ]*',textExample)
In case the expression does not always separate operators and numbers with spaces, you should search for a preceding or subsequent digit around the pound sign:
import re
equation = "5131 + #251 = 76382"
r = re.findall(r"((?<=\d)#|#(?=\d))",equation)
If you only intend to replace the pound sign with some digits, you don't need to find/highlight it. Simply use the built-in string replace function
equality = equation.replace("#","71") #==> '5131 + 71251 = 76382'

After re.split, how to put separators back?

the following code take apart the equation and assemble it after sorting it.
def simplify(poly):
import re
p=re.split('\+|\-',poly)
return '+'.join(sorted(''.join(sorted(x)) for x in p))
print(simplify('a+ca-ab'))
problem : It's not hard to sort them, but it's difficult to put the operator (+,-) back, the code above can only put back the '+' but not the '-' back into the equation.
May I ask how should I put the operators back?
Note that your function name doesn't suit the code. You're only trying to sort the unknowns in lexicographic order.
If you separate the unknowns and their sign, the processing will become a mess. Just process them together:
import re
pattern = re.compile('[+-]?[a-z]+', re.I)
def ignore_sign(s):
return re.sub('[+-]', '', s)
def simplify(poly):
if poly[0] not in '+-':
poly = '+' + poly
parts = [''.join(sorted(part)) for part in re.findall(pattern, poly)]
sorted_parts = sorted(parts, key=ignore_sign)
return re.sub('^\+', '', ''.join(sorted_parts))
print(simplify('a-ac+ba'))
# a+ab-ac
A + is prepended to the string in order to avoid mixing unknowns together (thanks #rici):
print(simplify('z-ac+ba'))
# ab-ac+z
When sorting the parts, you just need to ignore any sign so that -a appears before +z:
>>> sorted(['-a', '+z'])
['+z', '-a']
>>> sorted(['-a', '+z'], key=ignore_sign)
['-a', '+z']
Try this logic. It will not remove the split character will keep it with list element.
import re
def simplify(poly):
original_p = re.split('(\+|\-)',poly)
without_operand = [x for x in original_p if x not in ["+", "-"]]
return "".join(original_p)
print(simplify('a+ca-ab'))

regular expression replace (if pattern found replace symbol for symbol)

I have several lines of text (RNA sequence), I want to make a matrix regarding conservation of characters, because they are aligned according similarity.
But I have several gaps (-) which actually mean missing a whole structure (e.g.#- > 100) If this happens I want to change that for dots (other symbol for making a distinguishment) with the same amount found.
I thought I can do this with regular expression, but I am not able to replace only the pattern, or when I do so, I replace everything but with the incorrect number of dots.
My code looks like this:
with alnfile as f_in:
if re.search('-{100,}', elem,):
elem = re.sub('-{100,}','.', elem, ) #failed alternative*len(m.groups(x)), elem)
print len(elem) # check if I am keeping the lenghth of my sequence
print elem[0:100] # check the start
f1.write(elem)
if my file is:
ONE ----(*100)atgtgca----(*20)
I am getting:
ONE ..(*100)atgtgca----(*20)
My other change was only dots then I get:
ONE ....(*100)atgtgca....(*20)
WHAT I NEED:
ONE ....(*100)atgtgca----(*20)
I know that I am missing something, but I can not figure it out? Is there a flag or something that help me or would allow the exact change of this?
You could try the following:
data = "ONE " + "-" * 100 + "atgtgca" + "-" * 20
print re.sub(r'-{100,}', lambda x: '.' * len(x.group(0)), data)
This would display:
ONE ....................................................................................................atgtgca--------------------

python string manipulation and processing

I have a number of codes which I need to process, and these come through in a number of different formats which I need to manipulate first to get them in the right format:
Examples of codes:
ABC1.12 - correct format
ABC 1.22 - space between letters and numbers
ABC1.12/13 - 2 codes joined together and leading 1. missing from 13, should be ABC1.12 and ABC1.13
ABC 1.12 / 1.13 - codes joined together and spaces
I know how to remove the spaces but am not sure how to handle the codes which have been split. I know I can use the split function to create 2 codes but not sure how I can then append the letters (and first number part) to the second code. This is the 3rd and 4th example in the list above.
WHAT I HAVE SO FAR
val = # code
retList = [val]
if "/" in val:
(code1, code2) = session_codes = val.split("/", 1)
(inital_letters, numbers) = code1.split(".", 1)
if initial_letters not in code2:
code2 = initial_letters + '.' + code2
# reset list so that it returns both values
retList = [code1, code2]
This won't really handle the splits for 4 as the code2 becomes ABC1.1.13
You can use regex for this purpose
A possible implementation would be as follows
>>> def foo(st):
parts=st.replace(' ','').split("/")
parts=list(re.findall("^([A-Za-z]+)(.*)$",parts[0])[0])+parts[1:]
parts=parts[0:1]+[x.split('.') for x in parts[1:]]
parts=parts[0:1]+['.'.join(x) if len(x) > 1 else '.'.join([parts[1][0],x[0]]) for x in parts[1:]]
return [parts[0]+p for p in parts[1:]]
>>> foo('ABC1.12')
['ABC1.12']
>>> foo('ABC 1.22')
['ABC1.22']
>>> foo('ABC1.12/13')
['ABC1.12', 'ABC1.13']
>>> foo('ABC 1.12 / 1.13')
['ABC1.12', 'ABC1.13']
>>>
Are you familiar with regex? That would be an angle worth exploring here. Also, consider splitting on the space character, not just the slash and decimal.
I suggest you write a regular expression for each code pattern and then form a larger regular expression which is the union of the individual ones.
Using PyParsing
The answer by #Abhijit is a good, and for this simple problem reg-ex may be the way to go. However, when dealing with parsing problems, you'll often need a more extensible solution that can grow with your problem. I've found that pyparsing is great for that, you write the grammar it does the parsing:
from pyparsing import *
index = Combine(Word(alphas))
# Define what a number is and convert it to a float
number = Combine(Word(nums)+Optional('.'+Optional(Word(nums))))
number.setParseAction(lambda x: float(x[0]))
# What do extra numbers look like?
marker = Word('/').suppress()
extra_numbers = marker + number
# Define what a possible line could be
line_code = Group(index + number + ZeroOrMore(extra_numbers))
grammar = OneOrMore(line_code)
From this definition we can parse the string:
S = '''ABC1.12
ABC 1.22
XXX1.12/13/77/32.
XYZ 1.12 / 1.13
'''
print grammar.parseString(S)
Giving:
[['ABC', 1.12], ['ABC', 1.22], ['XXX', 1.12, 13.0, 77.0, 32.0], ['XYZ', 1.12, 1.13]]
Advantages:
The number is now in the correct format, as we've type-casted them to floats during the parsing. Many more "numbers" are handled, look at the index "XXX", all numbers of type 1.12, 13, 32. are parsed, irregardless of decimal.
Take a look at this method. The might be the simple and yet best way to do.
val = unicode(raw_input())
for aChar in val:
if aChar.isnumeric():
lastIndex = val.index(aChar)
break
part1 = val[:lastIndex].strip()
part2 = val[lastIndex:]
if "/" not in part2:
print part1+part2
else:
if " " not in part2:
codes = []
divPart2 = part2.split(".")
partCodes = divPart2[1].split("/")
for aPart in partCodes:
codes.append(part1+divPart2[0]+"."+aPart)
print codes
else:
codes = []
divPart2 = part2.split("/")
for aPart in divPart2:
aPart = aPart.strip()
codes.append(part1+aPart)
print codes

Categories