How to remove selected characters from a string? - python

I have been trying to learn how I can remove special characters on random given strings. A random given string could be something like:
uh\n haha - yes 'nope' \t tuben\xa01337
and I have used both regex and string.translate to try what could work out for me:
import re
random_string = "uh\n haha - yes 'nope' \t tuben\xa01337"
print(re.sub(r"/[' \n \t\r]|(\xa0)/g", '', random_string))
print("-------")
print(random_string.translate(str.maketrans({c: "" for c in "\n \xa0\t\r"})))
The output of that returns:
uh
haha - yes 'nope' tuben 1337
-------
uhhaha-yes'nope'tuben1337
The problem is that it does not work as I wanted since I want a output to be:
uh haha - yes nope tuben 1337
I wonder how I could be able to do that?
\n\t\xa0 or any similar should be replaced as one whitespace
' and " should be replaced with no whitespace, just remove the ' and "
double whitespaces or more should be replaced with only one whitespace total. Meaning that if there are two or more whitespaces in a text they should be replaced with one.
Any special characters should be removed as well

You can use
import re
random_string = "uh\n haha - yes 'nope' \t tuben\xa01337"
random_string = re.sub(r"\s+", " ", random_string).strip().replace('"', '').replace("'", '')
print(random_string)
See the Python demo.
Notes:
re.sub(r"\s+", " ", random_string) - shrinks any chunks of one or more whitespace chars into a single regular space char
.strip() - removes leading/trailing whitespace
.replace('"', '').replace("'", '') - removes " and ' chars.

/[' \n \t\r]|(\xa0)/g
This is syntax that is used by tools like sed or Vim, not Python's re module.
The equivalent would be
print(re.sub(r"[' \n \t\r]|(\xa0)", '', random_string))
which prints
uhhaha-yesnopetuben1337
which is not far off, but you also removed all spaces.
If you don't remove the spaces,
print(re.sub(r"['\n\t\r]|(\xa0)", '', random_string))
you get
uh haha - yes nope tuben1337
which has too many spaces.
A solution is to use the inverse regular expression (which matches runs of characters you want to keep) with re.findall to get a list of words, which you can then re-join:
result = re.findall(r"[^' \n\t\r\xa0]+", random_string)
print(' '.join(result))
which prints
uh haha - yes nope tuben 1337

This regular expression will do the trick:
>>> print(re.sub(" +", ' ', re.sub(r'''/|[^\w\s]|\n|\t|\r|(\xa0)/g''', '', random_string)))
uh haha yes nope tuben 1337
The outer re.sub matches multiple whitespace and replaces it with one whitespace.
The inner re.sub is almost identical to the one you're using, I just found it more readable to have them all as choices with |.

Related

How do I ignore the spaces in a string inputted by the user? [duplicate]

I want to eliminate all the whitespace from a string, on both ends, and in between words.
I have this Python code:
def my_handle(self):
sentence = ' hello apple '
sentence.strip()
But that only eliminates the whitespace on both sides of the string. How do I remove all whitespace?
If you want to remove leading and ending spaces, use str.strip():
>>> " hello apple ".strip()
'hello apple'
If you want to remove all space characters, use str.replace() (NB this only removes the “normal” ASCII space character ' ' U+0020 but not any other whitespace):
>>> " hello apple ".replace(" ", "")
'helloapple'
If you want to remove duplicated spaces, use str.split() followed by str.join():
>>> " ".join(" hello apple ".split())
'hello apple'
To remove only spaces use str.replace:
sentence = sentence.replace(' ', '')
To remove all whitespace characters (space, tab, newline, and so on) you can use split then join:
sentence = ''.join(sentence.split())
or a regular expression:
import re
pattern = re.compile(r'\s+')
sentence = re.sub(pattern, '', sentence)
If you want to only remove whitespace from the beginning and end you can use strip:
sentence = sentence.strip()
You can also use lstrip to remove whitespace only from the beginning of the string, and rstrip to remove whitespace from the end of the string.
An alternative is to use regular expressions and match these strange white-space characters too. Here are some examples:
Remove ALL spaces in a string, even between words:
import re
sentence = re.sub(r"\s+", "", sentence, flags=re.UNICODE)
Remove spaces in the BEGINNING of a string:
import re
sentence = re.sub(r"^\s+", "", sentence, flags=re.UNICODE)
Remove spaces in the END of a string:
import re
sentence = re.sub(r"\s+$", "", sentence, flags=re.UNICODE)
Remove spaces both in the BEGINNING and in the END of a string:
import re
sentence = re.sub("^\s+|\s+$", "", sentence, flags=re.UNICODE)
Remove ONLY DUPLICATE spaces:
import re
sentence = " ".join(re.split("\s+", sentence, flags=re.UNICODE))
(All examples work in both Python 2 and Python 3)
"Whitespace" includes space, tabs, and CRLF. So an elegant and one-liner string function we can use is str.translate:
Python 3
' hello apple '.translate(str.maketrans('', '', ' \n\t\r'))
OR if you want to be thorough:
import string
' hello apple'.translate(str.maketrans('', '', string.whitespace))
Python 2
' hello apple'.translate(None, ' \n\t\r')
OR if you want to be thorough:
import string
' hello apple'.translate(None, string.whitespace)
For removing whitespace from beginning and end, use strip.
>> " foo bar ".strip()
"foo bar"
' hello \n\tapple'.translate({ord(c):None for c in ' \n\t\r'})
MaK already pointed out the "translate" method above. And this variation works with Python 3 (see this Q&A).
In addition, strip has some variations:
Remove spaces in the BEGINNING and END of a string:
sentence= sentence.strip()
Remove spaces in the BEGINNING of a string:
sentence = sentence.lstrip()
Remove spaces in the END of a string:
sentence= sentence.rstrip()
All three string functions strip lstrip, and rstrip can take parameters of the string to strip, with the default being all white space. This can be helpful when you are working with something particular, for example, you could remove only spaces but not newlines:
" 1. Step 1\n".strip(" ")
Or you could remove extra commas when reading in a string list:
"1,2,3,".strip(",")
Be careful:
strip does a rstrip and lstrip (removes leading and trailing spaces, tabs, returns and form feeds, but it does not remove them in the middle of the string).
If you only replace spaces and tabs you can end up with hidden CRLFs that appear to match what you are looking for, but are not the same.
eliminate all the whitespace from a string, on both ends, and in between words.
>>> import re
>>> re.sub("\s+", # one or more repetition of whitespace
'', # replace with empty string (->remove)
''' hello
... apple
... ''')
'helloapple'
https://en.wikipedia.org/wiki/Whitespace_character
Python docs:
https://docs.python.org/library/stdtypes.html#textseq
https://docs.python.org/library/stdtypes.html#str.replace
https://docs.python.org/library/string.html#string.replace
https://docs.python.org/library/re.html#re.sub
https://docs.python.org/library/re.html#regular-expression-syntax
I use split() to ignore all whitespaces and use join() to concatenate
strings.
sentence = ''.join(' hello apple '.split())
print(sentence) #=> 'helloapple'
I prefer this approach because it is only a expression (not a statement).
It is easy to use and it can use without binding to a variable.
print(''.join(' hello apple '.split())) # no need to binding to a variable
import re
sentence = ' hello apple'
re.sub(' ','',sentence) #helloworld (remove all spaces)
re.sub(' ',' ',sentence) #hello world (remove double spaces)
In the following script we import the regular expression module which we use to substitute one space or more with a single space. This ensures that the inner extra spaces are removed. Then we use strip() function to remove leading and trailing spaces.
# Import regular expression module
import re
# Initialize string
a = " foo bar "
# First replace any number of spaces with a single space
a = re.sub(' +', ' ', a)
# Then strip any leading and trailing spaces.
a = a.strip()
# Show results
print(a)
I found that this works the best for me:
test_string = ' test a s test '
string_list = [s.strip() for s in str(test_string).split()]
final_string = ' '.join(string_array)
# final_string: 'test a s test'
It removes any whitespaces, tabs, etc.
try this.. instead of using re i think using split with strip is much better
def my_handle(self):
sentence = ' hello apple '
' '.join(x.strip() for x in sentence.split())
#hello apple
''.join(x.strip() for x in sentence.split())
#helloapple

How to remove all non-alphanumerical characters except when part of a word [duplicate]

I want to be able to remove all punctuation and single quotes ' from a string, unless the single quote ' is in the middle of a word.
At this point I have the following code:
with open('test.txt','r') as f:
for line in f:
line = line.lower()
line = re.sub('[^a-z\ \']+', " ", line)
print line
if there happens to be a line in test.txt like:
Here is some stuff. 'Now there are quotes.' Now there's not.
The result I want is:
here is some stuff now there are quotes now there's not
But the result I get is:
here is some stuff 'now there are quotes' now there's not
How can I remove the single quotes ' from a string if they're at the beginning or end of the word but not in the middle? Thanks for the help!
Split the string, use strip() on each word to remove leading and trailing characters on it, then join it all back together.
>>> s = "'here is some stuff 'now there are quotes' now there's not'"
>>> print(' '.join(w.strip("'") for w in s.split()).lower())
here is some stuff now there are quotes now there's not
Using regular expressions, you could first remove 's that don't follow a letter, then remove 's that don't precede a letter (thus only keeping ones that both follow and precede a letter):
line = "Here is some stuff. 'Now there are quotes.' Now there's not."
print re.sub(r"'([^A-Za-z])", r"\1", re.sub(r"([^A-Za-z])'", r"\1", line))
# Here is some stuff. Now there are quotes. Now there's not.
Probably more efficient to do it #TigerhawkT3's way. Though they produce different results if you have something like 'this'. If you want to remove that second ' too, then the regular expressions method is probably the simplest you can do.
Here's another solution using regular expressions with lookarounds.
This method will preserve any whitespace your string may have.
import re
rgx = re.compile(r"(?<!\w)\'|\'(?!\w)")
# Regex explanation:
# (?<!\w)\' match any quote not preceded by a word
# | or
# \'(?!\w) match any quote not followed by a word
s = "'here is some stuff 'now there are quotes' now there's not'"
print rgx.sub('', s) # here is some stuff now there are quotes now there's not
If a word is a sequence of 1+ letters, digits and underscores that can be matched with \w+ you may use
re.sub(r"(?!\b'\b)'", "", text)
See the regex demo. Here, ' is matched when it is not preceded nor followed with letters/digits/_.
Or, if words are strictly linguistic words that only consist of letters, use
re.sub(r"'(?!(?<=[a-zA-Z]')[a-zA-Z])", "", text) # ASCII only
re.sub(r"'(?!(?<=[^\W\d_]')[^\W\d_])", "", text) # any Unicode letter support
See Demo #2 (ASCII only letters) and Demo #3 (see last line in the demo text). Here, ' is only matched if it is not preceded nor followed with a letter (ASCII or any).
Python demo:
import re
text = "'text... 'some quotes', there's none'. three 'four' can't, '2'4', '_'_', 'l'école'"
print( re.sub(r"(?!\b'\b)'", "", text) )
# => text... some quotes, there's none. three four can't, 2'4, _'_, l'école
print( re.sub(r"'(?!(?<=[a-zA-Z]')[a-zA-Z])", "", text) )
# => text... some quotes, there's none. three four can't, 24, __, lécole
print( re.sub(r"'(?!(?<=[^\W\d_]')[^\W\d_])", "", text) )
# => text... some quotes, there's none. three four can't, 24, __, l'école
Here is complete solution to remove whatever you don't want in a string:
def istext (text):
ok = 0
for x in text: ok += x.isalnum()
return ok>0
def stripit (text, ofwhat):
for x in ofwhat: text = text.strip(x)
return text
def purge (text, notwanted="'\"!#$%&/()=?*+-.,;:_<>|\\[]{}"):
text = text.splitlines()
text = [" ".join([stripit(word, notwanted) for word in line.split() if istext(word)]) for line in text]
return "\n".join(text)
>>> print purge("'Nice, .to, see! you. Isn't it?'")
Nice to see you Isn't it
Note: this will kill all whitespaces too and transform them to space or remove them completely.

Python re.sub strip leading/trailing whitespace within quotes

I want to use re.sub to remove leading and trailing whitespace from single-quoted strings embedded in a larger string. If I have, say,
textin = " foo ' bar nox ': glop ,' frox ' "
I want to produce
desired = " foo 'bar nox': glop ,'frox' "
Removing the leading whitespace is relatively straightforward.
>>> lstripped = re.sub(r"'\s*([^']*')", r"'\1", textin)
>>> lstripped
" foo 'bar nox ': glop ,'frox ' "
The problem is removing the trailing whitespace. I tried, for example,
>>> rstripped = re.sub(r"('[^']*)(\s*')", r"\1'", lstripped)
>>> rstripped
" foo 'bar nox ': glop ,'frox ' "
but that fails because the [^']* matches the trailing whitespace.
I thought about using lookback patterns, but the Re doc says they can only contain fixed-length patterns.
I'm sure this is a previously solved problem but I'm stumped.
Thanks!
EDIT: The solution needs to handle strings containing a single non-whitespace character and empty strings, i.e. ' p ' --> 'p' and ' ' --> ''.
[^\']* - is greedy, i.e. it includes also spaces and/or tabs, so let's use non-greedy one: [^\']*?
In [66]: re.sub(r'\'\s*([^\']*?)\s*\'','\'\\1\'', textin)
Out[66]: " foo 'bar nox': glop ,'frox' "
Less escaped version:
re.sub(r"'\s*([^']*?)\s*'", r"'\1'", textin)
The way to catch the whitespaces is by defining the previous
* as non-greedy, instead of r"('[^']*)(\s*')" use r"('[^']*?)(\s*')".
You can also catch both sides with a single regex:
stripped = re.sub("'\s*([^']*?)\s*'", r"'\1'", textin)
This seems to work:
'(\s*)(.*?)(\s*)'
' # an apostrophe
(\s*) # 0 or more white-space characters (leading white-space)
(.*?) # 0 or more any character, lazily matched (keep)
(\s*) # 0 or more white-space characters (trailing white-space)
' # an apostrophe
Demo

Split leading whitespace from rest of string

I'm not sure how to exactly convey what I'm trying to do, but I'm trying to create a function to split off a part of my string (the leading whitespace) so that I can edit it with different parts of my script, then add it again to my string after it has been altered.
So lets say I have the string:
" That's four spaces"
I want to split it so I end up with:
" " and "That's four spaces"
You can use re.match:
>>> import re
>>> re.match('(\s*)(.*)', " That's four spaces").groups()
(' ', "That's four spaces")
>>>
(\s*) captures zero or more whitespace characters at the start of the string and (.*) gets everything else.
Remember though that strings are immutable in Python. Technically, you cannot edit their contents; you can only create new string objects.
For a non-Regex solution, you could try something like this:
>>> mystr = " That's four spaces"
>>> n = next(i for i, c in enumerate(mystr) if c != ' ') # Count spaces at start
>>> (' ' * n, mystr[n:])
(' ', "That's four spaces")
>>>
The main tools here are next, enumerate, and a generator expression. This solution is probably faster than the Regex one, but I personally think that the first is more elegant.
Why don't you try matching instead of splitting?
>>> import re
>>> s = " That's four spaces"
>>> re.findall(r'^\s+|.+', s)
[' ', "That's four spaces"]
Explanation:
^\s+ Matches one or more spaces at the start of a line.
| OR
.+ Matches all the remaining characters.
One solution is to lstrip the string, then figure out how many characters you've removed. You can then 'modify' the string as desired and finish by adding the whitespace back to your string. I don't think this would work properly with tab characters, but for spaces only it seems to get the job done:
my_string = " That's four spaces"
no_left_whitespace = my_string.lstrip()
modified_string = no_left_whitespace + '!'
index = my_string.index(no_left_whitespace)
final_string = (' ' * index) + modified_string
print(final_string) # That's four spaces!
And a simple test to ensure that we've done it right, which passes:
assert final_string == my_string + '!'
One thing you can do it make a list out of string.that is
x=" That's four spaces"
y=list(x)
z="".join(y[0:4]) #if this is variable you can apply a loop over here to detect spaces from start
k="".join(y[4:])
s=[]
s.append(z)
s.append(k)
print s
This is a non regex solution which will not require any imports

how to not remove apostrophe only for some words in text file in python

In a sentence, How can I remove apostrophe, double quotes, comma and so on for all words excluding words like it's, what's etc.. and at end of the sentence there must be a space between word and full stop.
For example
Input Sentence :
"'This has punctuation, and it's hard to remove. ?"
Desired Output Sentence :
This has punctuation and it's hard to remove .
Use a negative look-behind
(?<!\w)["'?]|,(?= )
REmove the matched '"? characters through re.sub.
DEMO
And your code would be,
>>> s = '\"\'This has punctuation, and it\'s hard to remove. ?\" '
>>> m = re.sub(r'(?<!\w)[\"\'\?]|,(?= )', r'', s)
>>> m
"This has punctuation and it's hard to remove. "
I propose this code:
import re
sentences = [""""'This has punctuation, and it's hard to remove. ?" """,
"Did you see Cress' haircut?.",
"This 'thing' hasn't a really bad habit, you know?.",
"'I bought this for $30 from Best Buy it's. What a waste of money! The ear gels are 'comfortable at first, but what's after an hour."]
for s in sentences:
# Remove the specified characters
new_s = re.sub(r"""["?,$!]|'(?!(?<! ')[ts])""", "", s)
# Deal with the final dot
new_s = re.sub(r"\.", " .", new_s)
print(new_s)
ideone demo
Output:
This has punctuation and it's hard to remove .
Did you see Cress haircut .
This thing hasn't a really bad habit you know .
I bought this for 30 from Best Buy it's . What a waste of money The ear gels are comfortable at first but what's after an hour .
The regex:
["?,$!] # Match " ? , $ or !
| # OR
' # A ' if it does not have...
(?!
(?<! ')
[ts] # t or s after it, provided it has no ` '` before the t or s
)
Use this:
(?<![tT](?=.[sS]))["'?:;,.]
If you also want to leave the period at the end of a line (as long as it is preceded by a space):
(?<![tT](?=.[sS]))(?<! (?=.$))["'?:;,.]
My take for this is, remove all quotations which are at either end of a word. So split the sentences to word (separated by white-space) and strip any leading or trailing quotation marks from the words
>>> ''.join(e.strip(string.punctuation) for e in re.split("(\s)",st))
"This has punctuation and it's hard to remove "
Use the string.strip(delimiter) function for the outside quotes
like this :
output = chaine.strip("\"")
Be careful, you have to escape some characters with a '\' like ', ", \, and so on. Or you can enter them as "'", '"' (unsure).
Edit : mmh, didn't think about the apostrophes, if the only problem is the apostrophes you can strip the rest first then parse it manually with a for statement, place indice of first apostrophe found then if followed by an 's', leave it, I don't know, you have to set lexical/semantical rules before coding it.
Edit 2 :
If the string is only a sentence, and always has a dot at the end, and always needs the space, then use this at the end :
chaine[:-2]+" "+chaine[-2:]

Categories