I need to write a function, tag_count, that takes as its argument a list of strings. It should return a count of how many of those strings are XML tags. You can tell if a string is an XML tag if it begins with a left angle bracket "<" and end with a right angle bracket ">".
def tag_count(input_list):
found = 0
counts = input_list.count('<')
for key in input_list:
if key == counts:
found += 1
return found
Test for the tag_count function:
list1 = ['<greeting>', 'Hello World!', '</greeting>']
count = tag_count(list1)
print("Expected result: 2, Actual result: {}".format(count))
Can someone tell me why this does not work - and come up with
something that does using a def function.
At the moment, it is returning: Expected result: 2, Actual result: 0
The main problem with your trying to count the number of strings in your list that are a single '<'. You need to iterate over your list and count the strings that begin and end with angle brackets:
>>> def tag_count(lst):
return sum(s[0] == '<' and s[-1] == '>' for s in lst)
>>>
>>> list1 = ['<greeting>', 'Hello World!', '</greeting>']
>>> count = tag_count(list1)
>>> count
2
>>>
If there may be cases where there are empty strings in your data, use str.starstwith and str.endswith rather than indexing to avoid an IndexError:
return sum(s.startswith('<') and s.endswith('>') for s in lst)
Taking Cuber's answer into account, a safe and readable way to count XML tags could be:
def is_key_XML(key):
try :
return (key[0] == '<') and (key[-1] == '>')
except IndexError:
return False
def tag_count(input_list):
return sum(is_key_XML(k) for k in input_list)
And the test could be:
list1 = ['<greeting>', 'Hello World!', '</greeting>', '< Graou', 'L', '<>', '']
count = tag_count(list1)
print("Expected result: 3, Actual result: {}".format(count))
def tag_count(input_list):
found = 0
for key in input_list:
if (len(key) > 1) and (key[0] == '<') and (key[-1] == '>'):
found += 1
return found
You need to check whether the characters in your key correspond to '>' or '<'.
Also, len(key) > 1 checks whether the string has atleast 2 characters.
list1 = ['<greeting>', 'Hello World!', '</greeting>', '']
import re
len( [ s for s in list1 if re.match(r'<.*>', s) ] )
Output:
2
You can write it in a list comprehension notation:
requested_strs = len([s for s in input_list if s and s.startswith('<') and s.endswith('>')])
Even though it is a simple solution, I don't recommend using regexes in case. Compiling regex to match the strings and matching them will take to much time to perform a simple check as this one..
Related
as the title suggests I'm trying to parse a piece of code into a tree or a list.
First off I would like to thank for any contribution and time spent on this.
So far my code is doing what I expect, yet I am not sure that this is the optimal / most generic way to do this.
Problem
1. I want to have a more generic solution since in the future I am going to need further analysis of this sintax.
2. I am unable right now to separate the operators like '=' or '>=' as you can see below in the output I share.
In the future I might change the content of the list / tree from strings to tuples so i can identify the kind of operator (parameter, comparison like = or >= ....). But this is not a real need right now.
Research
My first attempt was parsing the text character by character, but my code was getting too messy and barely readable, so I assumed that I was doing something wrong there (I don't have that code to share here anymore)
So i started looking around how people where doing it and found some approaches that didn't necessarily fullfil the requirements of simplicity and generic.
I would share the links to the sites but I didn't keep track of them.
The Syntax of the code
The syntax is pretty simple, after all I'm no interested in types or any further detail. just the functions and parameters.
strings are defined as 'my string', variables as !variable and numbers as in any other language.
Here is a sample of code:
db('1', '2', if(ATTRS('Dim 1', !Element Structure, 'ID') = '3','4','5'), 6)
My Output
Here my output is partialy correct since I'm still unable to separate the "= '3'" part (of course I have to separate it because in this case its a comparison operator and not part of a string)
[{'db': ["'1'", "'2'", {'if': [{'ATTRS': ["'Dim 1'", '!Element Structure', "'ID'"]}, "= '3'", "'4'", "'5'"]}, '6']}]
Desired Output
[{'db': ["'1'", "'2'", {'if': [{'ATTRS': ["'Dim 1'", '!Element Structure', "'ID'"]}, "=", "'3'", "'4'", "'5'"]}, '6']}]
My code so far
The parseRecursive method is the entry point.
import re
class FileParser:
#order is important to avoid miss splits
COMPARATOR_SIGN = {
'#='
,'#<>'
,'<>'
,'>='
,'<='
,'='
,'>'
,'<'
}
def __init__(self):
pass
def __charExistsInOccurences(self,current_needle, needles, text):
"""
check if other needles are present in text
current_needle : string -> the current needle being evaluated
needles : list -> list of needles
text : string/list<string> -> a string or a list of string to evaluate
"""
#if text is a string convert it to list of strings
text = text if isinstance(text, list) else [text]
exists = False
for t in text:
#check if needle is inside text value
for needle in needles:
#dont check the same key
if needle != current_needle:
regex_search_needle = split_regex = '\s*'+'\s*'.join(needle) + '\s*'
#list of 1's and 0's . 1 if another character is found in the string.
found = [1 if re.search(regex_search_needle, x) else 0 for x in t]
if sum(found) > 0:
exists = True
break
return exists
def findOperator(self, needles, haystack):
"""
split parameters from operators
needles : list -> list of operators
haystack : string
"""
string_open = haystack.find("'")
#if no string has been found set the index to 0
if string_open < 0:
string_open = 0
occurences = []
string_closure = haystack.rfind("'")
operator = ''
for needle in needles:
#regex to ignore the possible spaces between characters of the needle
split_regex = '\s*'+'\s*'.join(needle) + '\s*'
#parse parameters before and after the string
before_string = re.split(split_regex, haystack[0:string_open])
after_string = re.split(split_regex, haystack[string_closure+1:])
#check if any other needle exists in the results found
before_string_exists = self.__charExistsInOccurences(needle, needles, before_string)
after_string_exists = self.__charExistsInOccurences(needle, needles, after_string)
#if the operator has been found merge the results with the occurences and assign the operator
if not before_string_exists and not after_string_exists:
occurences.extend(before_string)
occurences.extend([haystack[string_open:string_closure+1]])
occurences.extend(after_string)
operator = needle
#filter blank spaces generated
occurences = list(filter(lambda x: len(x.strip())>0,occurences))
result_check = [1 if x==haystack else 0 for x in occurences]
#if the haystack was originaly a simple string like '1' the occurences list is going to be filled with the same character over and over due to the before string an after string part
if len(result_check) == sum(result_check):
occurences= [haystack]
operator = ''
return operator, occurences
def parseRecursive(self,text):
"""
parse a block of text
text : string
"""
assert(len(text) < 1, "text is empty")
function_open = text.find('(')
accumulated_params = []
if function_open > -1:
#there is another function nested
text_prev_function = text[0:function_open]
#find last space coma or equal to retrieve the function name
last_space = -1
for j in range(len(text_prev_function)-1, 0 , -1):
if text_prev_function[j] == ' ' or text_prev_function[j] == ',' or text_prev_function[j] == '=':
last_space = j
break
func_name = ''
if last_space > -1:
#there is something else behind the function name
func_name = text_prev_function[last_space+1:]
#no parentesis before so previous characters from function name are parameters
text_prev_func_params = list(filter(lambda x: len(x.strip())>0,text_prev_function[:last_space+1].split(',')))
text_prev_func_params = [x.strip() for x in text_prev_func_params]
#debug here
#accumulated_params.extend(text_prev_func_params)
for itext_prev in text_prev_func_params:
operator, text_prev_operator = self.findOperator(self.COMPARATOR_SIGN,itext_prev)
if operator == '':
accumulated_params.extend(text_prev_operator)
else:
text_prev_operator.append(operator)
accumulated_params.extend(text_prev_operator)
#accumulated_params.extend(text_prev_operator)
else:
#function name is the start of the string
func_name = text_prev_function[0:].strip()
#find the closure of parentesis
function_close = text.rfind(')')
#parse the next function and extend the current list of parameters
next_func = text[function_open+1:function_close]
func_params = {func_name : self.parseRecursive(next_func)}
accumulated_params.append(func_params)
#
# parameters after the function
#
new_text = text[function_close+1:]
accumulated_params.extend(self.parseRecursive(new_text))
else:
#there is no other function nested
split_text = text.split(',')
current_func_params = list(filter(lambda x: len(x.strip())>0,split_text))
current_func_params = [x.strip() for x in current_func_params]
accumulated_params.extend(current_func_params)
#accumulated_params = list(filter(lambda x: len(x.strip())>0,accumulated_params))
return accumulated_params
text = "db('1', '2', if(ATTRS('Dim 1', !Element Structure, 'ID') = '3','4','5'), 6)"
obj = FileParser()
print(obj.parseRecursive(text))
You can use pyparsing to deal with such a case.
* pyparsing can be installed by pip install pyparsing
Code:
import pyparsing as pp
# A parsing pattern
w = pp.Regex(r'(?:![^(),]+)|[^(), ]+') ^ pp.Suppress(',')
pattern = w + pp.nested_expr('(', ')', content=w)
# A recursive function to transform a pyparsing result into your desirable format
def transform(elements):
stack = []
for e in elements:
if isinstance(e, list):
key = stack.pop()
stack.append({key: transform(e)})
else:
stack.append(e)
return stack
# A sample
string = "db('1', '2', if(ATTRS('Dim 1', !Element Structure, 'ID') = '3','4','5'), 6)"
# Operations to parse the sample string
elements = pattern.parse_string(string).as_list()
result = transform(elements)
# Assertion
assert result == [{'db': ["'1'", "'2'", {'if': [{'ATTRS': ["'Dim 1'", '!Element Structure', "'ID'"]}, '=', "'3'", "'4'", "'5'"]}, '6']}]
# Show the result
print(result)
Output:
[{'db': ["'1'", "'2'", {'if': [{'ATTRS': ["'Dim 1'", '!Element Structure', "'ID'"]}, '=', "'3'", "'4'", "'5'"]}, '6']}]
Note:
If there is an unbalanced parenthesis inside () (for example a(b(c), a(b)c), etc), an unexpected result is obtained or an IndexError is raised. So be careful in such cases.
At the moment, only a single sample is available to make a pattern to parse string. So if you encounter a parsing error, provide more examples in your question.
I have a very long string of text with () and [] in it. I'm trying to remove the characters between the parentheses and brackets but I cannot figure out how.
The list is similar to this:
x = "This is a sentence. (once a day) [twice a day]"
This list isn't what I'm working with but is very similar and a lot shorter.
You can use re.sub function.
>>> import re
>>> x = "This is a sentence. (once a day) [twice a day]"
>>> re.sub("([\(\[]).*?([\)\]])", "\g<1>\g<2>", x)
'This is a sentence. () []'
If you want to remove the [] and the () you can use this code:
>>> import re
>>> x = "This is a sentence. (once a day) [twice a day]"
>>> re.sub("[\(\[].*?[\)\]]", "", x)
'This is a sentence. '
Important: This code will not work with nested symbols
Explanation
The first regex groups ( or [ into group 1 (by surrounding it with parentheses) and ) or ] into group 2, matching these groups and all characters that come in between them. After matching, the matched portion is substituted with groups 1 and 2, leaving the final string with nothing inside the brackets. The second regex is self explanatory from this -> match everything and substitute with the empty string.
-- modified from comment by Ajay Thomas
Run this script, it works even with nested brackets.
Uses basic logical tests.
def a(test_str):
ret = ''
skip1c = 0
skip2c = 0
for i in test_str:
if i == '[':
skip1c += 1
elif i == '(':
skip2c += 1
elif i == ']' and skip1c > 0:
skip1c -= 1
elif i == ')'and skip2c > 0:
skip2c -= 1
elif skip1c == 0 and skip2c == 0:
ret += i
return ret
x = "ewq[a [(b] ([c))]] This is a sentence. (once a day) [twice a day]"
x = a(x)
print x
print repr(x)
Just incase you don't run it,
Here's the output:
>>>
ewq This is a sentence.
'ewq This is a sentence. '
Here's a solution similar to #pradyunsg's answer (it works with arbitrary nested brackets):
def remove_text_inside_brackets(text, brackets="()[]"):
count = [0] * (len(brackets) // 2) # count open/close brackets
saved_chars = []
for character in text:
for i, b in enumerate(brackets):
if character == b: # found bracket
kind, is_close = divmod(i, 2)
count[kind] += (-1)**is_close # `+1`: open, `-1`: close
if count[kind] < 0: # unbalanced bracket
count[kind] = 0 # keep it
else: # found bracket to remove
break
else: # character is not a [balanced] bracket
if not any(count): # outside brackets
saved_chars.append(character)
return ''.join(saved_chars)
print(repr(remove_text_inside_brackets(
"This is a sentence. (once a day) [twice a day]")))
# -> 'This is a sentence. '
This should work for parentheses. Regular expressions will "consume" the text it has matched so it won't work for nested parentheses.
import re
regex = re.compile(".*?\((.*?)\)")
result = re.findall(regex, mystring)
or this would find one set of parentheses, simply loop to find more:
start = mystring.find("(")
end = mystring.find(")")
if start != -1 and end != -1:
result = mystring[start+1:end]
You can split, filter, and join the string again. If your brackets are well defined the following code should do.
import re
x = "".join(re.split("\(|\)|\[|\]", x)[::2])
You can try this. Can remove the bracket and the content exist inside it.
import re
x = "This is a sentence. (once a day) [twice a day]"
x = re.sub("\(.*?\)|\[.*?\]","",x)
print(x)
Expected ouput :
This is a sentence.
For anyone who appreciates the simplicity of the accepted answer by jvallver, and is looking for more readability from their code:
>>> import re
>>> x = 'This is a sentence. (once a day) [twice a day]'
>>> opening_braces = '\(\['
>>> closing_braces = '\)\]'
>>> non_greedy_wildcard = '.*?'
>>> re.sub(f'[{opening_braces}]{non_greedy_wildcard}[{closing_braces}]', '', x)
'This is a sentence. '
Most of the explanation for why this regex works is included in the code. Your future self will thank you for the 3 additional lines.
(Replace the f-string with the equivalent string concatenation for Python2 compatibility)
The RegEx \(.*?\)|\[.*?\] removes bracket content by finding pairs, first it remove paranthesis and then square brackets. I also works fine for the nested brackets as it acts in sequence. Ofcourse, it would break in case of bad brackets scenario.
_brackets = re.compile("\(.*?\)|\[.*?\]")
_spaces = re.compile("\s+")
_b = _brackets.sub(" ", "microRNAs (miR) play a role in cancer ([1], [2])")
_s = _spaces.sub(" ", _b.strip())
print(_s)
# OUTPUT: microRNAs play a role in cancer
I have written the following code in python,
the fnc recieves two arguments of "Genome" and "Pattern" as a string, and whenever the pattern matches the genome, the starting index of the match is saved in a list, but I should return the result not as a list but as a string in which the indices are separated by space.
example:
Sample Input: ATAT, GATATATGCATATACTT
Sample Output:1 3 9
any suggestions?
def PatternMatching(Genome, Pattern):
index=[]
for i in range(len(Genome)-len(Pattern)+1):
if Genome[i:i+len(Pattern)]==Pattern:
index.append(i)
return index
Genome="GATATATGCATATACTT"
Pattern="ATAT"
print(PatternMatching(Genome, Pattern))
You can use a join to print the list the way you want:
def PatternMatching(Genome, Pattern):
index=[]
for i in range(len(Genome)-len(Pattern)+1):
if Genome[i:i+len(Pattern)]==Pattern:
index.append(i)
return ' '.join([str(_) for _ in index]) # This is new
Genome="GATATATGCATATACTT"
Pattern="ATAT"
print(PatternMatching(Genome, Pattern))
It simple as
return ' '.join(map(str, index))
Will work like a charm.
Try this, string the index and then use the join function to join the list into a string.
def PatternMatching(Genome, Pattern):
index = []
for i in range(len(Genome)-len(Pattern)+1):
if Genome[i:i+len(Pattern)]==Pattern:
index.append(str(i))
return " ".join(index)
Genome = "GATATATGCATATACTT"
Pattern = "ATAT"
print(PatternMatching(Genome, Pattern))
You can iterate over the string by extracting 4 chars at a time and then compare with the needed pattern
>>> s = "GATATATGCATATACTT"
>>> ptrn = "ATAT"
>>> res = [i for i in range(len(s)-4) if s[i:i+4] == ptrn]
>>> out = ' '.join(map(str, res))
>>> out
'1 3 9'
With your current code, you can just use the iterator unpacking operator *:
print(*PatternMatching(Genome, Pattern))
def get_middle_character(odd_string):
variable = len(odd_string)
x = str((variable/2))
middle_character = odd_string.find(x)
middle_character2 = odd_string[middle_character]
return middle_character2
def main():
print('Enter a odd length string: ')
odd_string = input()
print('The middle character is', get_middle_character(odd_string))
main()
I need to figure out how to print the middle character in a given odd length string. But when I run this code, I only get the last character. What is the problem?
You need to think more carefully about what your code is actually doing. Let's do this with an example:
def get_middle_character(odd_string):
Let's say that we call get_middle_character('hello'), so odd_string is 'hello':
variable = len(odd_string) # variable = 5
Everything is OK so far.
x = str((variable/2)) # x = '2'
This is the first thing that is obviously odd - why do you want the string '2'? That's the index of the middle character, don't you just want an integer? Also you only need one pair of parentheses there, the other set is redundant.
middle_character = odd_string.find(x) # middle_character = -1
Obviously you can't str.find the substring '2' in odd_string, because it was never there. str.find returns -1 if it cannot find the substring; you should use str.index instead, which gives you a nice clear ValueError when it can't find the substring.
Note that even if you were searching for the middle character, rather than the stringified index of the middle character, you would get into trouble as str.find gives the first index at which the substring appears, which may not be the one you're after (consider 'lolly'.find('l')...).
middle_character2 = odd_string[middle_character] # middle_character2 = 'o'
As Python allows negative indexing from the end of a sequence, -1 is the index of the last character.
return middle_character2 # return 'o'
You could actually have simplified to return odd_string[middle_character], and removed the superfluous assignment; you'd have still had the wrong answer, but from neater code (and without middle_character2, which is a terrible name).
Hopefully you can now see where you went wrong, and it's trivially obvious what you should do to fix it. Next time use e.g. Python Tutor to debug your code before asking a question here.
You need to simply access character based on index of string and string slicing. For example:
>>> s = '1234567'
>>> middle_index = len(s)/2
>>> first_half, middle, second_half = s[:middle_index], s[middle_index], s[middle_index+1:]
>>> first_half, middle, second_half
('123', '4', '567')
Explanation:
str[:n]: returns string from 0th index to n-1th index
str[n]: returns value at nth index
str[n:]: returns value from nth index till end of list
Should be like below:
def get_middle_character(odd_string):
variable = len(odd_string)/2
middle_character = odd_string[variable +1]
return middle_character
i know its too late but i post my solution
I hope it will be useful ;)
def get_middle_char(string):
if len(string) % 2 == 0:
return None
elif len(string) <= 1:
return None
str_len = int(len(string)/2))
return string[strlen]
reversedString = ''
print('What is your name')
str = input()
idx = len(str)
print(idx)
str_to_iterate = str
for char in str_to_iterate[::-1]:
print(char)
evenodd = len(str) % 2
if evenodd == 0:
print('even')
else:
print('odd')
l = str
if len(l) % 2 == 0:
x = len(l) // 2
y = len(l) // 2 - 1
print(l[x], l[y])
else:
n = len(l) // 2
print(l[n])
I have a string in which every marked substring within < and >
has to be reversed (the brackets don't nest). For example,
"hello <wolfrevokcats>, how <t uoy era>oday?"
should become
"hello stackoverflow, how are you today?"
My current idea is to loop over the string and find pairs of indices
where < and > are. Then simply slice the string and put the slices
together again with everything that was in between the markers reversed.
Is this a correct approach? Is there an obvious/better solution?
It's pretty simple with regular expressions. re.sub takes a function as an argument to which the match object is passed.
>>> import re
>>> s = 'hello <wolfrevokcats>, how <t uoy era>oday?'
>>> re.sub('<(.*?)>', lambda m: m.group(1)[::-1], s)
'hello stackoverflow, how are you today?'
Explanation of the regex:
<(.*?)> will match everything between < and > in matching group 1. To ensure that the regex engine will stop at the first > symbol occurrence, the lazy quantifier *? is used.
The function lambda m: m.group(1)[::-1] that is passed to re.sub takes the match object, extracts group 1, and reverses the string. Finally re.sub inserts this return value.
Or, use re.sub() and a replacing function:
>>> import re
s = 'hello <wolfrevokcats>, how <t uoy era>oday?'
>>> re.sub(r"<(.*?)>", lambda match: match.group(1)[::-1], s)
'hello stackoverflow, how are you today?'
where .*? would match any characters any number of times in a non-greedy fashion. The parenthesis around it would help us to capture it in a group which we then refer to in the replacing function - match.group(1). [::-1] slice notation reverses a string.
I'm going to assume this is a coursework assignment and the use of regular expressions isn't allowed. So I'm going to offer a solution that doesn't use it.
content = "hello <wolfrevokcats>, how <t uoy era>oday?"
insert_pos = -1
result = []
placeholder_count = 0
for pos, ch in enumerate(content):
if ch == '<':
insert_pos = pos
elif ch == '>':
insert_pos = -1
placeholder_count += 1
elif insert_pos >= 0:
result.insert(insert_pos - (placeholder_count * 2), ch)
else:
result.append(ch)
print("".join(result))
The gist of the code is to have just a single pass at the string one character at a time. When outside the brackets, simply append the character at the end of the result string. When inside the brackets, insert the character at the position of the opening bracket (i.e. pre-pend the character).
I agree that regular expressions is the proper tool to solve this problem, and I like the gist of Dmitry B.'s answer. However, I used this question to practice about generators and functional programming, and I post my solution just for sharing it.
msg = "<,woN> hello <wolfrevokcats>, how <t uoy era>oday?"
def traverse(s, d=">"):
for c in s:
if c in "<>": d = c
else: yield c, d
def group(tt, dc=None):
for c, d in tt:
if d != dc:
if dc is not None:
yield dc, l
l = [c]
dc = d
else:
l.append(c)
else: yield dc, l
def direct(groups):
func = lambda d: list if d == ">" else reversed
fst = lambda t: t[0]
snd = lambda t: t[1]
for gr in groups:
yield func(fst(gr))(snd(gr))
def concat(groups):
return "".join("".join(gr) for gr in groups)
print(concat(direct(group(traverse(msg)))))
#Now, hello stackoverflow, how are you today?
Here's another one without using regular expressions:
def reverse_marked(str0):
separators = ['<', '>']
reverse = 0
str1 = ['', str0]
res = ''
while len(str1) == 2:
str1 = str1[1].split(separators[reverse], maxsplit=1)
res = ''.join((res, str1[0][::-1] if reverse else str1[0]))
reverse = 1 - reverse # toggle 0 - 1 - 0 ...
return res
print(reverse_marked('hello <wolfrevokcats>, how <t uoy era>oday?'))
Output:
hello stackoverflow, how are you today?