Change string for defiened pattern (Python) - python

Learning Python, came across a demanding begginer's exercise.
Let's say you have a string constituted by "blocks" of characters separated by ';'. An example would be:
cdk;2(c)3(i)s;c
And you have to return a new string based on old one but in accordance to a certain pattern (which is also a string), for example:
c?*
This pattern means that each block must start with an 'c', the '?' character must be switched by some other letter and finally '*' by an arbitrary number of letters.
So when the pattern is applied you return something like:
cdk;cciiis
Another example:
string: 2(a)bxaxb;ab
pattern: a?*b
result: aabxaxb
My very crude attempt resulted in this:
def switch(string,pattern):
d = []
for v in range(0,string):
r = float("inf")
for m in range (0,pattern):
if pattern[m] == string[v]:
d.append(pattern[m])
elif string[m]==';':
d.append(pattern[m])
elif (pattern[m]=='?' & Character.isLetter(string.charAt(v))):
d.append(pattern[m])
return d
Tips?

To split a string you can use split() function.
For pattern detection in strings you can use regular expressions (regex) with the re library.

Related

Return a string of country codes from an argument that is a string of prices

So here's the question:
Write a function that will return a string of country codes from an argument that is a string of prices (containing dollar amounts following the country codes). Your function will take as an argument a string of prices like the following: "US$40, AU$89, JP$200". In this example, the function would return the string "US, AU, JP".
Hint: You may want to break the original string into a list, manipulate the individual elements, then make it into a string again.
Example:
> testEqual(get_country_codes("NZ$300, KR$1200, DK$5")
> "NZ, KR, DK"
As of now, I'm clueless as to how to separate the $ and the numbers. I'm very lost.
I would advice using and looking up regex expressions
https://docs.python.org/2/library/re.html
If you use re.findall it will return you a list of all matching strings, and you can use a regex expression like /[A-Z]{2}$ to find all the two letter capital words in the list.
After that you can just create a string from the resulting list.
Let me know if that is not clear
def test(string):
return ", ".join([item.split("$")[0] for item in string.split(", ")])
string = "NZ$300, KR$1200, DK$5"
print test(string)
Use a regular expression pattern and append the matches to a string. (\w{2})\$ matches exactly 2 word characters followed by by a $.
def get_country_codes(string):
matches = re.findall(r"(\w{2})\$", string)
return ", ".join(match for match in matches)

Regex to match strings within braces

I am trying to write a regex to a string that has the following format
12740(34,12) [abc (a1b2c3) (a2b3c4)......] myId123
Currently, I have something like this
\((?P<expression>\S+)\)
But with this, I can capture only the strings within square brackets.
Is there anyway I can capture the integers before the square brackets and also id at the end along with the strings within square brackets.
The number of strings enclosed within small brackets will not be the same. I could also have a string that looks like this
10(3,2) [abc (a1b2c3)] myId1
I know that I can write a simple regex for the above expression using brute force. But could anyone please help me write one when the number of strings within the square bracket keeps changing.
Thanks in advance
You can capture the information by using ^ and $, which mean start and end respectively:
((?P<front>^\d+)|\((?P<expression>\S+)\)|(?P<id>[a-zA-Z0-9]+)$)
Regex101:
https://regex101.com/r/PoA5k4/1
To make the result more usable, I'd turn it into a dictionary:
import re
myStr = "12740(34,12) [abc (a1b2c3) (a2b3c4)......] myId123"
di = {}
for find in re.findall("((?P<front>^\d+)|\((?P<expression>\S+)\)|(?P<id>[a-zA-Z0-9]+)$)",myStr):
if find[1] != "":
di["starter"] = find[1]
elif find[3] != "":
di["id"] = find[3]
else:
di.setdefault("expression",[]).append(find[2])
print(di)

Replacing all numeric value to formatted string

What I am trying to do is:
Find out all the numeric values in a string.
input_string = "高露潔光感白輕悅薄荷牙膏100 79.80"
numbers = re.finditer(r'[-+]?[0-9]*\.?[0-9]+(?:[eE][-+]?[0-9]+)?',input_string)
for number in numbers:
print ("{} start > {}, end > {}".format(number.group(), number.start(0), number.end(0)))
'''Output'''
>>100 start > 12, end > 15
>>79.80 start > 18, end > 23
And then I want to replace all the integer and float value to a certain format:
INT_(number of digit) and FLT(number of decimal places)
eg. 100 -> INT_3 // 79.80 -> FLT_2
Thus, the expect output string is like this:
"高露潔光感白輕悅薄荷牙膏INT_3 FLT2"
But the string replace substring method in Python is kind of weird, which can't archive what I want to do.
So I am trying to use the substring append substring methods
string[:number.start(0)] + "INT_%s"%len(number.group()) +.....
which looks stupid and most importantly I still can't make it work.
Can anyone give me some advice on this problem?
Use re.sub and a callback method inside where you can perform various manipulations on the match:
import re
def repl(match):
chunks = match.group(1).split(".")
if len(chunks) == 2:
return "FLT_{}".format(len(chunks[1]))
else:
return "INT_{}".format(len(chunks[0]))
input_string = "高露潔光感白輕悅薄荷牙膏100 79.80"
result = re.sub(r'[-+]?([0-9]*\.?[0-9]+)(?:[eE][-+]?[0-9]+)?',repl,input_string)
print(result)
See the Python demo
Details:
The regex now has a capturing group over the number part (([0-9]*\.?[0-9]+)), this will be analyzed inside the repl method
Inside the repl method, Group 1 contents is split with . to see if we have a float/double, and if yes, we return the length of the fractional part, else, the length of the integer number.
You need to group the parts of your regex possibly like this
import re
def repl(m):
if m.group(1) is None: #int
return ("INT_%i"%len(m.group(2)))
else: #float
return ("FLT_%i"%(len(m.group(2))))
input_string = "高露潔光感白輕悅薄荷牙膏100 79.80"
numbers = re.sub(r'[-+]?([0-9]*\.)?([0-9]+)([eE][-+]?[0-9]+)?',repl,input_string)
print(numbers)
group 0 is the whole string that was matched (can be used for putting into float or int)
group 1 is any digits before the . and the . itself if exists else it is None
group 2 is all digits after the . if it exists else it it is just all digits
group 3 is the exponential part if existing else None
You can get a python-number from it with
def parse(m):
s=m.group(0)
if m.group(1) is not None or m.group(3) is not None: # if there is a dot or an exponential part it must be a float
return float(s)
else:
return int(s)
You probably are looking for something like the code below (of course there are other ways to do it). This one just starts with what you were doing and show how it can be done.
import re
input_string = u"高露潔光感白輕悅薄荷牙膏100 79.80"
numbers = re.finditer(r'[-+]?[0-9]*\.?[0-9]+(?:[eE][-+]?[0-9]+)?',input_string)
s = input_string
for m in list(numbers)[::-1]:
num = m.group(0)
if '.' in num:
s = "%sFLT_%s%s" % (s[:m.start(0)],str(len(num)-num.index('.')-1),s[m.end(0):])
else:
s = "%sINT_%s%s" % (s[:m.start(0)],str(len(num)), s[m.end(0):])
print(s)
This may look a bit complicated because there are really several simple problems to solve.
For instance your initial regex find both ints and floats, but you with to apply totally different replacements afterward. This would be much more straightforward if you were doing only one thing at a time. But as parts of floats may look like an int, doing everything at once may not be such a bad idea, you just have to understand that this will lead to a secondary check to discriminate both cases.
Another more fundamental issue is that really you can't replace anything in a python string. Python strings are non modifiable objects, henceforth you have to make a copy. This is fine anyway because the format change may need insertion or removal of characters and an inplace replacement wouldn't be efficient.
The last trouble to take into account is that replacement must be made backward, because if you change the beginning of the string the match position would also change and the next replacement wouldn't be at the right place. If we do it backward, all is fine.
Of course I agree that using re.sub() is much simpler.

Regular Expression Testing

So i have been working on this project for myself to understand regular expressions There are 6 lines of input. The first line will contain 10 character strings. The last 5 lines will contain a valid regular expression string.
For the output, each regular expression print all the character strings that are matches to the strings according to line 1; if none match then print none. # is used to say it is an empty string. I have gotten everything but the empty string part so here is my code
and example input that would be
1)#,aac,acc,abc,ac,abbc,abbbc,abbbbc,aabc,accb
and i would like the second input to be
2)b*
the output im trying to get is #
and so far it outputs nothing
import re
inp = input("Search String:").upper().split(',')
for runs in range(50):
temp = []
query = input("Search Query:").replace("?", "[A-Z_0-9]+?+$").upper()
for item in inp:
search = re.match(query, item)
if search:
if search.group() not in temp:
temp.append(search.group())
if len(temp) > 0:
print(" ".join(temp))
else:
print("NONE")
b matches only the literal character 'b', so your search string will only match a sequence of zero or more b's, such as
b
or
bbbb
or
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb (and so on)
Your match string will not match anything else.
I don't know why you are using a specific letter, but I assume you intended an escape sequence, like "\b*", although that only matches transitions between types of characters, so it won't match # in this context. If you use \W*, it will match # (not sure whether it will match the other stuff you want).
If you haven't already, check out the following resources on regular expressions, including all of the escape characters and metacharacters:
Wikipedia
Python.org 2.7

Use Python to extract Branch Lengths from Newick Format

I have a list in python consisting of one item which is a tree written in Newick Format, as below:
['(BMNH833953:0.16529463651919140688,(((BMNH833883:0.22945757727367316336,(BMNH724182a:0.18028180766761139897,(BMNH724182b:0.21469677818346077913,BMNH724082:0.54350916483644962085):0.00654573856803835914):0.04530853441176059537):0.02416511342888815264,(((BMNH794142:0.21236619242575086042,(BMNH743008:0.13421900772403019819,BMNH724591:0.14957653992840658219):0.02592135486124686958):0.02477670174791116522,BMNH703458a:0.22983459269245612444):0.00000328449424529074,BMNH703458b:0.29776257618061197086):0.09881729077887969892):0.02257522897558370684,BMNH833928:0.21599133163597591945):0.02365043128986757739,BMNH724053:0.16069861523756587274):0.0;']
In tree format this appears as below:
I am trying to write some code that will look through the list item and return the IDs (BMNHxxxxxx) which are joined by branch length of 0 (or <0.001 for example) (highlighted in red). I thought about using regex such as:
JustTree = []
with JustTree as f:
for match in re.finditer(r"(?<=Item\sA)(?:(?!Item\sB).){50,}", subject, re.I):
f.extend(match.group()+"\n")
As taken from another StackOverflow answer where item A would be a ':' as the branch lengths always appear after a : and item B would be either a ',' or ')'or a ';' as these a there three characters that delimit it, but Im not experienced enough in regex to do this.
By using a branch length of 0 in this case I want the code to output ['BMNH703458a', 'BMNH703458b']. If I could alter this to also include ID's joined by a branch length of user defined value of say 0.01 this would be highly useful.
If anyone has any input, or can point me to a useful answer I would highly appreciate it.
Okay, here's a regex to extract only numbers (with potential decimals):
\b[0-9]+(?:\.[0-9]+)?\b
The \bs make sure that there is no other number, letter or underscore around the number right next to it. It's called a word boundary.
[0-9]+ matches multiple digits.
(?:\.[0-9]+)? is an optional group, meaning that it may or may not match. If there is a dot and digits after the first [0-9]+, then it will match those. Otherwise, it won't. The group itself matches a dot, and at least 1 digit.
You can use it with re.findall to put all the matches in a list:
import re
NewickTree = ['(BMNH833953:0.16529463651919140688,(((BMNH833883:0.22945757727367316336,(BMNH724182a:0.18028180766761139897,(BMNH724182b:0.21469677818346077913,BMNH724082:0.54350916483644962085):0.00654573856803835914):0.04530853441176059537):0.02416511342888815264,(((BMNH794142:0.21236619242575086042,(BMNH743008:0.13421900772403019819,BMNH724591:0.14957653992840658219):0.02592135486124686958):0.02477670174791116522,BMNH703458a:0.22983459269245612444):0.00000328449424529074,BMNH703458b:0.29776257618061197086):0.09881729077887969892):0.02257522897558370684,BMNH833928:0.21599133163597591945):0.02365043128986757739,BMNH724053:0.16069861523756587274):0.0;']
pattern = re.compile(r"\b[0-9]+(?:\.[0-9]+)?\b")
for tree in NewickTree:
branch_lengths = pattern.findall(tree)
# Do stuff to the list branch_lengths
print(branch_lengths)
For this list, you get this printed:
['0.16529463651919140688', '0.22945757727367316336', '0.18028180766761139897',
'0.21469677818346077913', '0.54350916483644962085', '0.00654573856803835914',
'0.04530853441176059537', '0.02416511342888815264', '0.21236619242575086042',
'0.13421900772403019819', '0.14957653992840658219', '0.02592135486124686958',
'0.02477670174791116522', '0.22983459269245612444', '0.00000328449424529074',
'0.29776257618061197086', '0.09881729077887969892', '0.02257522897558370684',
'0.21599133163597591945', '0.02365043128986757739', '0.16069861523756587274',
'0.0']
I know your question has been answered, but if you ever want your data as a nested list instead of a flat string:
import re
import pprint
a="(BMNH833953:0.16529463651919140688,(((BMNH833883:0.22945757727367316336,(BMNH724182a:0.18028180766761139897,(BMNH724182b:0.21469677818346077913,BMNH724082:0.54350916483644962085):0.00654573856803835914):0.04530853441176059537):0.02416511342888815264,(((BMNH794142:0.21236619242575086042,(BMNH743008:0.13421900772403019819,BMNH724591:0.14957653992840658219):0.02592135486124686958):0.02477670174791116522,BMNH703458a:0.22983459269245612444):0.00000328449424529074,BMNH703458b:0.29776257618061197086):0.09881729077887969892):0.02257522897558370684,BMNH833928:0.21599133163597591945):0.02365043128986757739,BMNH724053:0.16069861523756587274):0.0;"
def tokenize(str):
for m in re.finditer(r"\(|\)|[\w.:]+", str):
yield m.group()
def make_nested_list(tok, L=None):
if L is None: L = []
while True:
try: t = tok.next()
except StopIteration: break
if t == "(": L.append(make_nested_list(tok))
elif t == ")": break
else:
i = t.find(":"); assert i != -1
if i == 0: L.append(float(t[1:]))
else: L.append([t[:i], float(t[i+1:])])
return L
L = make_nested_list(tokenize(a))
pprint.pprint(L)
There are several Python libraries that support the newick format. The ETE toolkit allows to read newick strings and operate with trees as Python objects:
from ete2 import Tree
tree = Tree(newickFile)
print tree
Several newick subformats can be choosen and branch distances are parsed even if they are expressed in scientific notation.
from ete2 import Tree
tree = Tree("(A:3.4, (B:0.15E-10,C:0.0001):1.5E-234);")

Categories