I'm trying to remove the characters between the parentheses and brackets based on the length of characters inside the parentheses and brackets.
Using this:
def remove_text_inside_brackets(text, brackets="()[]"):
count = [0] * (len(brackets) // 2) # count open/close brackets
saved_chars = []
for character in text:
for i, b in enumerate(brackets):
if character == b: # found bracket
kind, is_close = divmod(i, 2)
count[kind] += (-1)**is_close # `+1`: open, `-1`: close
if count[kind] < 0: # unbalanced bracket
count[kind] = 0 # keep it
else: # found bracket to remove
break
else: # character is not a [balanced] bracket
if not any(count): # outside brackets
saved_chars.append(character)
return ''.join(saved_chars)
I'm able to remove the characters between the parentheses and brackets, but I cannot figure out how to remove the characters based on the length of characters inside.
I wanted to remove characters between the parentheses and brackets if the length <=4 with parentheses and brackets if they are >4 remove only parentheses and brackets.
Sample Text:
text = "This is a sentence. (RMVE) (Once a day) [twice a day] [RMV]"
Output:
print(remove_text_inside_brackets(text))
This is a sentence.
Desired Output:
This is a sentence. Once a day twice a day
You can use a simple regex with re.sub and a function as replacement to check the length of the match:
import re
out = re.sub('\(.*?\)|\[.*?\]',
lambda m: '' if len(m.group())<=(4+2) else m.group()[1:-1],
text)
Output:
'This is a sentence. Once a day twice a day '
This give you the logic for more complex checks, in which case you might want to define a named function rather than a lambda
How about splitting on [ and look for ] and measure length (since each split with ] will be necessarily longer than normal split, 4 becomes 5):
def remove_text_inside_brackets(string):
my_str = string.replace('(','[').replace(')',']')
out = []
for s in my_str.split('['):
if ']' in s and len(s) > 5:
s1 = s.rstrip().rstrip(']') + ' '
elif ']' in s and len(s) <= 5:
s1 = ['']
else:
s1 = s
out.extend(s1)
return ''.join(out).strip()
remove_text_inside_brackets(text)
Output:
'This is a sentence. RMVE Once a day twice a day'
Someone will hopefully improve on this, but as an alternative, this nested regular expression can work:
re.sub(r'\[([^)]{5,})\]', '\g<1>',
re.sub(r'\(([^)]{5,})\)', '\g<1>',
re.sub(r'\[[^\]]{,4}\]', '',
re.sub(r'\([^)]{,4}\)', '', text))))
Note that extra spaces, after the period and at the end of the line.
The output of this is slightly different than your given expected output:
'This is a sentence. Once a day twice a day '
It completely removes text and its surrounding brackets when the length is 4 or shorter, while it replaces the match with just the inner text where the length if 5 or longer.
Note that nested brackets, e.g., ((some text) more text) or [(four)] may fail.
I would just use string.find, rather than go character by character. Too much state to track. Note that this will explode if there is an unmatched open paren or open bracket. That's not hard to catch.
text = "This is a sentence. (RMVE) (Once a day) [twice a day] [RMV]"
def remove_text_inside_brackets(text):
i = 0
while i >= 0:
# Try for parens.
i = text.find('(')
j = text.find(')')
if i < 0:
# No parens, try for brackets.
i = text.find('[')
j = text.find(']')
if i >= 0:
if j-i > 5:
text = text[:i] + text[i+1:j] + text[j+1:]
else:
text = text[:i] + text[j+1:]
return text
print(remove_text_inside_brackets(text))
We can take help from regular expressions to solve this
import re
text = "This is a sentence. (RMVE) (Once a day) [twice a day] [RMV]"
text = re.sub('(\(|\[)[a-zA-Z]{1,4}(\)|\])', '', text)
print(re.sub('\[|\]|\(|\)', '', text))
output: "This is a sentence. Once a day twice a day"
here in the regular expression i tried to match the pattern for 1 to 4 length of letter inside braces, along with braces, you can also match numbers and other special characters too.
Related
I'm trying to get the split() method to split at a list or string of one character.
Here's the program I was trying out before I came here:
def strcontains(a, str):
a_match = [True for match in a if match in str]
return True in a_match
def splitall(chars, text):
full = []
for char in chars:
if char in text:
x = text.split(char)
if strcontains([i for i in chars], x):
x = splitall(chars, ''.join(x))
full.extend(x)
return full
print(splitall('dfs','hello i like dogs cuz they so fluffy'))
What I expect:
['hello I like ', 'og', ' cuz they ', 'o ', 'lu', '', 'y']
What I get:
['hello i like ', 'ogs cuz they so fluffy', 'hello i like dogs cuz they so ', 'lu', '', 'y', 'hello i like dog', ' cuz they ', 'o fluffy']
How would I combine those list items to get what I expected?
Personally, I much prefer a pure pythonic way of solving a question like this, without having to import a big module (such as re). Below, I made a function to do this:
def splits(string, chars):
indexes = []
for index, char in enumerate(string):
if char in chars:
indexes.append(index)
indexes.append(len(string))
splits = []
pindex = 0
for index in indexes:
newsect = string[pindex:index]
for char in chars:
newsect = newsect.replace(char, '')
splits.append(newsect)
pindex = index
return splits
Breaking it down, there are 2 main parts of the function. In the first, it goes through and identifies where all the various target characters are, and marks their positions in a list, for chopping up in part 2.
In part 2, we start by creating a list, where all the substrings will go. The main loop works by adding the string in between the previous index (pindex), and the current index (indexes being the positions of the target characters determined in part 1).
For example, if you had a string of: "Bob and I went to the park," and the target was "n," then pindex starts as 0, and the first index of 'n' is at 6, so the function adds string[0:6] ('Bob an') to the final list. Then, pindex is now 6, and the next index of n is at 13, so string[6:13] is then added.
A couple extra lines, and why they exist:
indexes.append(len(string)): this adds the end of the string as an index. Otherwise, in part 2, after it reaches the last index of the target characters, it will quit, and the part from the last character to the end is ignored
for char in chars: newsect = newsect.replace(char, ''): As you may have noticed in the example, the target characters were still included in the substrings, ('Bob an' vs 'Bob a'`), because all that was done was slicing. This line is to get rid of any target characters left over after slicing
Note: If the end letter of the string is a target, an unnecessarily large amount of blank strings ('') will be added to the end of the list. You can remove these with a line such as: if newsect=='': continue, before the splits.append(newsect)
Use re.split as explained in this article
https://www.geeksforgeeks.org/python-split-multiple-characters-from-string/
Riddle:
Return a version of the given string, where for every star (*) in the string the star and the chars immediately to its left and right are gone. So "ab*cd" yields "ad" and "ab**cd" also yields "ad".
I'm wondering if there's a pythonish way to improve this algorithm:
def starKill(string):
result = ''
for idx in range(len(string)):
if(idx == 0 and string[idx] != '*'):
result += string[idx]
elif (idx > 0 and string[idx] != '*' and (string[idx-1]) != '*'):
result += string[idx]
elif (idx > 0 and string[idx] == '*' and (string[idx-1]) != '*'):
result = result[0:len(result) - 1]
return result
starKill("wacy*xko") yields wacko
Here's a numpy solution just for fun:
def star_kill(string, target='*'):
arr = np.array(list(string))
mask = arr != '*'
mask[1:] &= mask[:-1]
mask[:-1] &= mask[1:]
arr = arr[mask]
return arr[mask].view(dtype=f'U{arr.size}').item()
Regular expression?
>>> import re
>>> for s in "ab*cd", "ab**cd", "wacy*xko", "*Mad*Physicist*":
print(re.sub(r'\w?\*\w?', '', s))
ad
ad
wacko
ahysicis
You can do this by iterating over the string three times in parallel. Each iteration will be shifted relative to the next by one character. The middle one is the one that will provide the valid letters, the other two let us check if adjacent characters are stars. The two flanking iterators require dummy values to represent "before the start" and "after the end" of the string. There are a variety of ways to set that up, I'm using itertools.chain (and .islice) to fill in None for the dummy values. But you could use plain string and iterator manipulation if you prefer (i.e. iter('x' + string) and iter(string[1:] + 'x')):
import itertools
def star_kill(string):
main_iterator = iter(string)
look_behind = itertools.chain([None], string)
look_ahead = itertools.chain(itertools.islice(string, 1, None), [None])
return "".join(a for a, b, c in zip(main_iterator, look_behind, look_ahead)
if a != '*' and b != '*' and c != '*')
Not sure whether or not it's "Pythonic," but the problem can be solved with regular expressions.
import re
def starkill(s):
s = re.sub(".{0,1}\\*{1,}.{0,1}", "", s)
return s
For those not familiar with regex, I'll break that long string down:
Prefix
".{0,1}"
This specifies we want the replaced section to begin with either 0 or 1 of any character. If there is a character before the star, we want to replace it; otherwise, we still want the expression to hit if the star is at the very beginning of the input string.
Star
"\\*{1,}"
This specifies that the middle of the expression must contain an asterisk character, but it can also contain more than one. For instance, "a****b" will still hit, even though there are four stars. We need a backslash before the asterisk because regex has asterisk as a reserved character, and we need a second backslash before that because Python strings reserve the backslash character.
Suffix
.{0,1}
Same as the prefix. The expression can either end with one or zero of any character.
Hope that helps!
I have a seemingly simple problem, which I cannot seem to solve. Given a string containing a DOI, I need to remove the last character if it is a punctuation mark until the last character is letter or number.
For example, if the string was:
sampleDoi = "10.1097/JHM-D-18-00044.',"
I want the following output:
"10.1097/JHM-D-18-00044"
ie. remove .',
I wrote the following script to do this:
invalidChars = set(string.punctuation.replace("_", ""))
a = "10.1097/JHM-D-18-00044.',"
i = -1
for each in reversed(a):
if any(char in invalidChars for char in each):
a = a[:i]
i = i - 1
else:
print (a)
break
However, this produces 10.1097/JHM-D-18-00 but I would like it to produce 10.1097/JHM-D-18-00044. Why is the 44 removed from the end?
The string function rstrip() is designed to do exactly this:
>>> sampleDoi = "10.1097/JHM-D-18-00044.',"
>>> sampleDoi.rstrip(",.'")
'10.1097/JHM-D-18-00044'
Corrected code:
import string
invalidChars = set(string.punctuation.replace("_", ""))
a = "10.1097/JHM-D-18-00044.',"
i = -1
for each in reversed(a):
if any(char in invalidChars for char in each):
a = a[:i]
i = i # Well Really this line can just be removed all together.
else:
print (a)
break
This gives the output you want, while keeping the original code mostly the same.
This is one way using next and str.isalnum with a generator expression utilizing enumerate / reversed.
sampleDoi = "10.1097/JHM-D-18-00044.',"
idx = next((i for i, j in enumerate(reversed(sampleDoi)) if j.isalnum()), 0)
res = sampleDoi[:-idx]
print(res)
'10.1097/JHM-D-18-00044'
The default parameter 0is used so that, if no alphanumeric character is found, an empty string is returned.
If you dont wanna use regex:
the_str = "10.1097/JHM-D-18-00044.',"
while the_str[-1] in string.punctuation:
the_str = the_str[:-1]
Removes the last character until it's no longer a punctuation character.
How would you count the number of spaces or new line charaters in a text in such a way that consecutive spaces are counted only as one?
For example, this is very close to what I want:
string = "This is an example text.\n But would be good if it worked."
counter = 0
for i in string:
if i == ' ' or i == '\n':
counter += 1
print(counter)
However, instead of returning with 15, the result should be only 11.
The default str.split() function will treat consecutive runs of spaces as one. So simply split the string, get the size of the resulting list, and subtract one.
len(string.split())-1
Assuming you are permitted to use Python regex;
import re
print len(re.findall(ur"[ \n]+", string))
Quick and easy!
UPDATE: Additionally, use [\s] instead of [ \n] to match any whitespace character.
You can do this:
string = "This is an example text.\n But would be good if it worked."
counter = 0
# A boolean flag indicating whether the previous character was a space
previous = False
for i in string:
if i == ' ' or i == '\n':
# The current character is a space
previous = True # Setup for the next iteration
else:
# The current character is not a space, check if the previous one was
if previous:
counter += 1
previous = False
print(counter)
re to the rescue.
>>> import re
>>> string = "This is an example text.\n But would be good if it worked."
>>> spaces = sum(1 for match in re.finditer('\s+', string))
>>> spaces
11
This consumes minimal memory, an alternative solution that builds a temporary list would be
>>> len(re.findall('\s+', string))
11
If you only want to consider space characters and newline characters (as opposed to tabs, for example), use the regex '(\n| )+' instead of '\s+'.
Just store a character that was the last character found. Set it to i each time you loop. Then within your inner if, do not increase the counter if the last character found was also a whitespace character.
You can iterate through numbers to use them as indexes.
for i in range(1, len(string)):
if string[i] in ' \n' and string[i-1] not in ' \n':
counter += 1
if string[0] in ' \n':
counter += 1
print(counter)
Pay attention to the first symbol as this constuction starts from the second symbol to prevent IndexError.
You can use enumerate, checking the next char is not also whitespace so consecutive whitespace will only count as 1:
string = "This is an example text.\n But would be good if it worked."
print(sum(ch.isspace() and not string[i:i+1].isspace() for i, ch in enumerate(string, 1)))
You can also use iter with a generator function, keeping track of the last character and comparing:
def con(s):
it = iter(s)
prev = next(it)
for ele in it:
yield prev.isspace() and not ele.isspace()
prev = ele
yield ele.isspace()
print(sum(con(string)))
An itertools version:
string = "This is an example text.\n But would be good if it worked. "
from itertools import tee, izip_longest
a, b = tee(string)
next(b)
print(sum(a.isspace() and not b.isspace() for a,b in izip_longest(a,b, fillvalue="") ))
Try:
def word_count(my_string):
word_count = 1
for i in range(1, len(my_string)):
if my_string[i] == " ":
if not my_string[i - 1] == " ":
word_count += 1
return word_count
You can use the function groupby() to find groups of consecutive spaces:
from collections import Counter
from itertools import groupby
s = 'This is an example text.\n But would be good if it worked.'
c = Counter(k for k, _ in groupby(s, key=lambda x: ' ' if x == '\n' else x))
print(c[' '])
# 11
So I need the output of my program to look like:
ababa
ab ba
xxxxxxxxxxxxxxxxxxx
that is it followed by a lot of spaces .
no dot at the end
The largest run of consecutive whitespace characters was 47.
But what I am getting is:
ababa
ab ba
xxxxxxxxxxxxxxxxxxx
that is it followed by a lot of spaces .
no dot at the end
The longest run of consecutive whitespace characters was 47.
When looking further into the code I wrote, I found with the print(c) statement that this happens:
['ababa', '', 'ab ba ', '', ' xxxxxxxxxxxxxxxxxxx', 'that is it followed by a lot of spaces .', ' no dot at the end']
Between some of the lines, theres the , '',, which is probably the cause of why my print statement wont work.
How would I remove them? I've tried using different list functions but I keep getting syntax errors.
This is the code I made:
a = '''ababa
ab ba
xxxxxxxxxxxxxxxxxxx
that is it followed by a lot of spaces .
no dot at the end'''
c = a.splitlines()
print(c)
#d = c.remove(" ") #this part doesnt work
#print(d)
for row in c:
print(' '.join(row.split()))
last_char = ""
current_seq_len = 0
max_seq_len = 0
for d in a:
if d == last_char:
current_seq_len += 1
if current_seq_len > max_seq_len:
max_seq_len = current_seq_len
else:
current_seq_len = 1
last_char = d
#this part just needs to count the whitespace
print("The longest run of consecutive whitespace characters was",str(max_seq_len)+".")
Regex time:
import re
print(re.sub(r"([\n ])\1*", r"\1", a))
#>>> ababa
#>>> ab ba
#>>> xxxxxxxxxxxxxxxxxxx
#>>> that is it followed by a lot of spaces .
#>>> no dot at the end
re.sub(matcher, replacement, target_string)
Matcher is r"([\n ])\1* which means:
([\n ]) → match either "\n" or " " and put it in a group (#1)
\1* → match whatever group #1 matched, 0 or more times
And the replacement is just
\1 → group #1
You can get the longest whitespace sequence with
max(len(match.group()) for match in re.finditer(r"([\n ])\1*", a))
Which uses the same matcher but instead just gets their lengths, and then maxs it.
From what I can tell, your easiest solution would be using list comprehension:
c= [item for item in a.splitlines() if item != '']
If you wish to make it slightly more robust by also removing strings that only contain whitespace such as ' ', then you can alter it as follows:
c= [item for item in a.splitlines() if item.strip() != '']
You can then also join it the list back together as follows:
output = '\n'.join(c)
This can be easily solved with the built-in filter function:
c = filter(None, a.splitlines())
# or, more explicit
c = filter(lambda x: x != "", a.splitlines())
The first variant will create a list with all elements from the list returned by a.splitlines() that do not evaluate to False, like the empty string.
The second variant creates a small anonymous function (using lambda) that checks if a given element is the empty string and returns False if that is the case. This is more explicit than the first variant.
Another option would be to use a list comprehension that achieves the same thing:
c = [string for string in a.splitlines if string]
# or, more explicit
c = [string for string in a.splitlines if string != ""]