Python: Removing whitespace from multiple lines of a string - python

So I need the output of my program to look like:
ababa
ab ba
xxxxxxxxxxxxxxxxxxx
that is it followed by a lot of spaces .
no dot at the end
The largest run of consecutive whitespace characters was 47.
But what I am getting is:
ababa
ab ba
xxxxxxxxxxxxxxxxxxx
that is it followed by a lot of spaces .
no dot at the end
The longest run of consecutive whitespace characters was 47.
When looking further into the code I wrote, I found with the print(c) statement that this happens:
['ababa', '', 'ab ba ', '', ' xxxxxxxxxxxxxxxxxxx', 'that is it followed by a lot of spaces .', ' no dot at the end']
Between some of the lines, theres the , '',, which is probably the cause of why my print statement wont work.
How would I remove them? I've tried using different list functions but I keep getting syntax errors.
This is the code I made:
a = '''ababa
ab ba
xxxxxxxxxxxxxxxxxxx
that is it followed by a lot of spaces .
no dot at the end'''
c = a.splitlines()
print(c)
#d = c.remove(" ") #this part doesnt work
#print(d)
for row in c:
print(' '.join(row.split()))
last_char = ""
current_seq_len = 0
max_seq_len = 0
for d in a:
if d == last_char:
current_seq_len += 1
if current_seq_len > max_seq_len:
max_seq_len = current_seq_len
else:
current_seq_len = 1
last_char = d
#this part just needs to count the whitespace
print("The longest run of consecutive whitespace characters was",str(max_seq_len)+".")

Regex time:
import re
print(re.sub(r"([\n ])\1*", r"\1", a))
#>>> ababa
#>>> ab ba
#>>> xxxxxxxxxxxxxxxxxxx
#>>> that is it followed by a lot of spaces .
#>>> no dot at the end
re.sub(matcher, replacement, target_string)
Matcher is r"([\n ])\1* which means:
([\n ]) → match either "\n" or " " and put it in a group (#1)
\1* → match whatever group #1 matched, 0 or more times
And the replacement is just
\1 → group #1
You can get the longest whitespace sequence with
max(len(match.group()) for match in re.finditer(r"([\n ])\1*", a))
Which uses the same matcher but instead just gets their lengths, and then maxs it.

From what I can tell, your easiest solution would be using list comprehension:
c= [item for item in a.splitlines() if item != '']
If you wish to make it slightly more robust by also removing strings that only contain whitespace such as ' ', then you can alter it as follows:
c= [item for item in a.splitlines() if item.strip() != '']
You can then also join it the list back together as follows:
output = '\n'.join(c)

This can be easily solved with the built-in filter function:
c = filter(None, a.splitlines())
# or, more explicit
c = filter(lambda x: x != "", a.splitlines())
The first variant will create a list with all elements from the list returned by a.splitlines() that do not evaluate to False, like the empty string.
The second variant creates a small anonymous function (using lambda) that checks if a given element is the empty string and returns False if that is the case. This is more explicit than the first variant.
Another option would be to use a list comprehension that achieves the same thing:
c = [string for string in a.splitlines if string]
# or, more explicit
c = [string for string in a.splitlines if string != ""]

Related

Remove multiple spaces from python list elements

I have a list with spaces within the string. How can I remove these spaces.
['ENTRY', ' 102725023 CDS T01001']
I would like to have the final list as:
['ENTRY', '102725023 CDS T01001']
I tried the strip() function but the function is not working on list. Any help is highly appreciated.Remo
Suppose this is you string
string = "A b c "
And you want it in this way
Abc
What you can do is
string2 = " ".join(string.split())
print(string2)
The easiest way is to build a new list of the values with the spaces removed. For this, you can use list comprehensions and the idiom proposed by #CodeWithYash
old_list = ['ENTRY', ' 102725023 CDS T01001']
new_list = [" ".join(string.split()) for s in old_list]
Note that this works because the default behavior of split is:
split according to any whitespace, and discard empty strings from the result.
If you would want to remove anything but whitespace, you would have to implement you own function, maybe using regular expression.
Note also that in Python strings are immutable: you can not edit each item of the list in place. If you do not want to create a new list (for example, if a reference to the list is kept in other place of the program), you can change every item:
l = ['ENTRY', ' 102725023 CDS T01001']
for i, s in enumerate(l):
old_list[i] = " ".join(s.split())
print(l)
Output:
['ENTRY', '102725023 CDS T01001']
I wrote this function:
s = " abc def xyz "
def proc(s):
l = len(s)
s = s.replace(' ',' ')
while len(s) != l:
l = len(s)
s = s.replace(' ',' ')
if s[0] == ' ':
s = s[1:]
if s[-1] == ' ':
s = s[:-1]
return s
print(proc(s))
the idea is to keep replacing every two spaces with 1 space, then check if the first and last elements are also spaces
I don't know if there exists an easier way with regular expressions or something else.

Remove text between () and [] based on condition in Python?

I'm trying to remove the characters between the parentheses and brackets based on the length of characters inside the parentheses and brackets.
Using this:
def remove_text_inside_brackets(text, brackets="()[]"):
count = [0] * (len(brackets) // 2) # count open/close brackets
saved_chars = []
for character in text:
for i, b in enumerate(brackets):
if character == b: # found bracket
kind, is_close = divmod(i, 2)
count[kind] += (-1)**is_close # `+1`: open, `-1`: close
if count[kind] < 0: # unbalanced bracket
count[kind] = 0 # keep it
else: # found bracket to remove
break
else: # character is not a [balanced] bracket
if not any(count): # outside brackets
saved_chars.append(character)
return ''.join(saved_chars)
I'm able to remove the characters between the parentheses and brackets, but I cannot figure out how to remove the characters based on the length of characters inside.
I wanted to remove characters between the parentheses and brackets if the length <=4 with parentheses and brackets if they are >4 remove only parentheses and brackets.
Sample Text:
text = "This is a sentence. (RMVE) (Once a day) [twice a day] [RMV]"
Output:
print(remove_text_inside_brackets(text))
This is a sentence.
Desired Output:
This is a sentence. Once a day twice a day
You can use a simple regex with re.sub and a function as replacement to check the length of the match:
import re
out = re.sub('\(.*?\)|\[.*?\]',
lambda m: '' if len(m.group())<=(4+2) else m.group()[1:-1],
text)
Output:
'This is a sentence. Once a day twice a day '
This give you the logic for more complex checks, in which case you might want to define a named function rather than a lambda
How about splitting on [ and look for ] and measure length (since each split with ] will be necessarily longer than normal split, 4 becomes 5):
def remove_text_inside_brackets(string):
my_str = string.replace('(','[').replace(')',']')
out = []
for s in my_str.split('['):
if ']' in s and len(s) > 5:
s1 = s.rstrip().rstrip(']') + ' '
elif ']' in s and len(s) <= 5:
s1 = ['']
else:
s1 = s
out.extend(s1)
return ''.join(out).strip()
remove_text_inside_brackets(text)
Output:
'This is a sentence. RMVE Once a day twice a day'
Someone will hopefully improve on this, but as an alternative, this nested regular expression can work:
re.sub(r'\[([^)]{5,})\]', '\g<1>',
re.sub(r'\(([^)]{5,})\)', '\g<1>',
re.sub(r'\[[^\]]{,4}\]', '',
re.sub(r'\([^)]{,4}\)', '', text))))
Note that extra spaces, after the period and at the end of the line.
The output of this is slightly different than your given expected output:
'This is a sentence. Once a day twice a day '
It completely removes text and its surrounding brackets when the length is 4 or shorter, while it replaces the match with just the inner text where the length if 5 or longer.
Note that nested brackets, e.g., ((some text) more text) or [(four)] may fail.
I would just use string.find, rather than go character by character. Too much state to track. Note that this will explode if there is an unmatched open paren or open bracket. That's not hard to catch.
text = "This is a sentence. (RMVE) (Once a day) [twice a day] [RMV]"
def remove_text_inside_brackets(text):
i = 0
while i >= 0:
# Try for parens.
i = text.find('(')
j = text.find(')')
if i < 0:
# No parens, try for brackets.
i = text.find('[')
j = text.find(']')
if i >= 0:
if j-i > 5:
text = text[:i] + text[i+1:j] + text[j+1:]
else:
text = text[:i] + text[j+1:]
return text
print(remove_text_inside_brackets(text))
We can take help from regular expressions to solve this
import re
text = "This is a sentence. (RMVE) (Once a day) [twice a day] [RMV]"
text = re.sub('(\(|\[)[a-zA-Z]{1,4}(\)|\])', '', text)
print(re.sub('\[|\]|\(|\)', '', text))
output: "This is a sentence. Once a day twice a day"
here in the regular expression i tried to match the pattern for 1 to 4 length of letter inside braces, along with braces, you can also match numbers and other special characters too.

How would I use the python split() method, but in a way that it splits at multiple characters and not just that specific string that is inputed?

I'm trying to get the split() method to split at a list or string of one character.
Here's the program I was trying out before I came here:
def strcontains(a, str):
a_match = [True for match in a if match in str]
return True in a_match
def splitall(chars, text):
full = []
for char in chars:
if char in text:
x = text.split(char)
if strcontains([i for i in chars], x):
x = splitall(chars, ''.join(x))
full.extend(x)
return full
print(splitall('dfs','hello i like dogs cuz they so fluffy'))
What I expect:
['hello I like ', 'og', ' cuz they ', 'o ', 'lu', '', 'y']
What I get:
['hello i like ', 'ogs cuz they so fluffy', 'hello i like dogs cuz they so ', 'lu', '', 'y', 'hello i like dog', ' cuz they ', 'o fluffy']
How would I combine those list items to get what I expected?
Personally, I much prefer a pure pythonic way of solving a question like this, without having to import a big module (such as re). Below, I made a function to do this:
def splits(string, chars):
indexes = []
for index, char in enumerate(string):
if char in chars:
indexes.append(index)
indexes.append(len(string))
splits = []
pindex = 0
for index in indexes:
newsect = string[pindex:index]
for char in chars:
newsect = newsect.replace(char, '')
splits.append(newsect)
pindex = index
return splits
Breaking it down, there are 2 main parts of the function. In the first, it goes through and identifies where all the various target characters are, and marks their positions in a list, for chopping up in part 2.
In part 2, we start by creating a list, where all the substrings will go. The main loop works by adding the string in between the previous index (pindex), and the current index (indexes being the positions of the target characters determined in part 1).
For example, if you had a string of: "Bob and I went to the park," and the target was "n," then pindex starts as 0, and the first index of 'n' is at 6, so the function adds string[0:6] ('Bob an') to the final list. Then, pindex is now 6, and the next index of n is at 13, so string[6:13] is then added.
A couple extra lines, and why they exist:
indexes.append(len(string)): this adds the end of the string as an index. Otherwise, in part 2, after it reaches the last index of the target characters, it will quit, and the part from the last character to the end is ignored
for char in chars: newsect = newsect.replace(char, ''): As you may have noticed in the example, the target characters were still included in the substrings, ('Bob an' vs 'Bob a'`), because all that was done was slicing. This line is to get rid of any target characters left over after slicing
Note: If the end letter of the string is a target, an unnecessarily large amount of blank strings ('') will be added to the end of the list. You can remove these with a line such as: if newsect=='': continue, before the splits.append(newsect)
Use re.split as explained in this article
https://www.geeksforgeeks.org/python-split-multiple-characters-from-string/

Remove punctuation items from end of string

I have a seemingly simple problem, which I cannot seem to solve. Given a string containing a DOI, I need to remove the last character if it is a punctuation mark until the last character is letter or number.
For example, if the string was:
sampleDoi = "10.1097/JHM-D-18-00044.',"
I want the following output:
"10.1097/JHM-D-18-00044"
ie. remove .',
I wrote the following script to do this:
invalidChars = set(string.punctuation.replace("_", ""))
a = "10.1097/JHM-D-18-00044.',"
i = -1
for each in reversed(a):
if any(char in invalidChars for char in each):
a = a[:i]
i = i - 1
else:
print (a)
break
However, this produces 10.1097/JHM-D-18-00 but I would like it to produce 10.1097/JHM-D-18-00044. Why is the 44 removed from the end?
The string function rstrip() is designed to do exactly this:
>>> sampleDoi = "10.1097/JHM-D-18-00044.',"
>>> sampleDoi.rstrip(",.'")
'10.1097/JHM-D-18-00044'
Corrected code:
import string
invalidChars = set(string.punctuation.replace("_", ""))
a = "10.1097/JHM-D-18-00044.',"
i = -1
for each in reversed(a):
if any(char in invalidChars for char in each):
a = a[:i]
i = i # Well Really this line can just be removed all together.
else:
print (a)
break
This gives the output you want, while keeping the original code mostly the same.
This is one way using next and str.isalnum with a generator expression utilizing enumerate / reversed.
sampleDoi = "10.1097/JHM-D-18-00044.',"
idx = next((i for i, j in enumerate(reversed(sampleDoi)) if j.isalnum()), 0)
res = sampleDoi[:-idx]
print(res)
'10.1097/JHM-D-18-00044'
The default parameter 0is used so that, if no alphanumeric character is found, an empty string is returned.
If you dont wanna use regex:
the_str = "10.1097/JHM-D-18-00044.',"
while the_str[-1] in string.punctuation:
the_str = the_str[:-1]
Removes the last character until it's no longer a punctuation character.

python regex letter must be followed by another letter

A string consists of letters and numbers but if it contains a 'c' the following letter after the 'c' must be either 'h' or 'k', does anyone know how to write such a regex for Python?
I would suggest the following:
^(?!.*c(?![hk]))[^\W_]+$
Explanation:
^ # Start of string
(?! # Assert that it's not possible to match...
.* # Any string, followed by
c # the letter c
(?! # unless that is followed by
[hk] # h or k
) # (End of inner negative lookahead)
) # (End of outer negative lookahead).
[^\W_]+ # Match one or more letters or digits.
$ # End of string
[^\W_] means "Match any character that's matched by \w, excluding the _".
>>> import re
>>> strings = ["test", "check", "tick", "pic", "cow"]
>>> for item in strings:
... print("{0} is {1}".format(item,
... "valid" if re.match(r"^(?!.*c(?![hk]))[^\W_]+$", item)
... else "invalid"))
...
test is valid
check is valid
tick is valid
pic is invalid
cow is invalid
The expression ^([^\Wc]*(c[hk])*)*$ also works. It says the whole string (from ^ to $) must consist of repetitions of blocks where each block has any number of non-c characters, [^\Wc]*, and any number of ch or ck pairs, (c[hk])* .
For example:
re.search(r'^([^\Wc]*(c[hk])*)*$', 'checkchek').group()
gives
'checkchek'
If you don't want to match the empty string, replace the last * with a +. Ordinarily, to avoid errors like mentioned in a comment when the input string doesn't match, assign the search result to a variable and test for not none:
In [88]: y = re.search(r'^([^\Wc]*(c[hk])*)*$', 'ca')
In [89]: if y:
....: print y.group()
....: else:
....: print 'No match'
....:
No match
The following code detects the presence of "c not followed by h or k" in myinputstring, and if so it prints "problem":
import re
if ((re.findall(r'c(?!(h|k))', myinputstring).length)>0):
print "problem"

Categories