Removing Punctuation and Replacing it with Whitespace using Replace in Python

Removing Punctuation and Replacing it with Whitespace using Replace in Python - python

trying to remove the following punctuation in python I need to use the replace methods to remove these punctuation characters and replace it with whitespace ,.:;'"-?!/
here is my code:
text_punct_removed = raw_text.replace(".", "")
text_punct_removed = raw_text.replace("!", "")
print("\ntext with punctuation characters removed:\n", text_punct_removed)
It only will remove the last one I try to replace, so I tried combining them
text_punct_removed = raw_text.replace(".", "" , "!", "")
print("\ntext with punctuation characters removed:\n", text_punct_removed)
but I get an error message, how do I remove multiple punctuation? Also there will be an issue if I put the " in quotes like this """ which will make a comment, is there a way around that? thanks

If you don't need to explicitly use replace:
exclude = set(",.:;'\"-?!/")
text = "".join([(ch if ch not in exclude else " ") for ch in text])

Here's a naive but working solution:
for sp in '.,"':
raw_text = raw_text.replace(sp, '')

If you need to replace all punctuations with space, you can use the built-in punctuation list to replace the string:
Python 3
import string
import re
my_string = "(I hope...this works!)"
translator = re.compile('[%s]' % re.escape(string.punctuation))
translator.sub(' ', my_string)
print(my_string)
# Result:
# I hope this works
After, if you want to remove double spaces inside string, you can make:
my_string = re.sub(' +',' ', my_string).strip()
print(my_string)
# Result:
# I hope this works

This works in Python3.5.3:
from string import punctuation
raw_text_with_punctuations = "text, with: punctuation; characters? all over ,.:;'\"-?!/"
print(raw_text_with_punctuations)
for char in punctuation:
raw_text_with_punctuations = raw_text_with_punctuations.replace(char, '')
print(raw_text_with_punctuations)

Either remove one character at a time:
raw_text.replace(".", "").replace("!", "")
Or, better, use regular expressions (re.sub()):
re.sub(r"\.|!", "", raw_text)

Related

How do I remove multiple words from a strting in python

I know that how do I remove a single word. But I can't remove multiple words. Can you help me?
This is my string and I want to remove "Color:", "Ring size:" and "Personalization:".
string = "Color:Silver,Ring size:6 3/4 US,Personalization:J"
I know that how do I remove a single word. But I can't remove multiple words. I want to remove "Color:", "Ring size:" and "Personalization:"

Seems like a good job for a regex.
Specific case:
import re
out = re.sub(r'(Color|Ring size|Personalization):', '', string)
Generic case (any word before :):
import re
out = re.sub(r'[^:,]+:', '', string)
Output: 'Silver,6 3/4 US,J'
Regex:
[^:,]+ # any character but , or :
: # followed by :
replace with empty string (= delete)

string = "Color:Silver,Ring size:6 3/4 US,Personalization:J"
def remove_words(string: str, words: list[str]) -> str:
for word in words:
string = string.replace(word, "")
return string
new_string = remove_words(string, ["Color:", "Ring size:", "Personalization:"])
Alternative:
string = "Color:Silver,Ring size:6 3/4 US,Personalization:J"
new_string = string.replace("Color:", "").replace("Ring size:", "").replace("Personalization:", "")

How to insert quotes around a string in the middle of another string

I need to change this string:
input_str = '{resourceType=Type, category=[{coding=[{system=http://google.com, code=item, display=Item}]}]}'
To json format:
output_str = '{"resourceType":"Type", "category":[{"coding":[{"system":"http://google.com", "code":"item", "display":"Item"}]}]}'
Changing the equal sign "=" to colon ":" is quite easy by using replace function:
input_str.replace("=", ":")
But adding quotes before and after each value / word is something that I can't find the solution for

I suggest to surround with double quotes any sequence of characters that are not reserved in your markup. I also made a provision for escaped double quotes, and you can add more escaped symbols to it:
import re
input_str = '{resourceType=Type, category=[{coding=[{system=http://google.com, code=item, display=Item}]}]}'
output_str = re.sub (r'(([^=([\]{},\s]|\")+)', r'"\1"', input_str).replace('=', ':')
print (output_str)
Output:
{"resourceType":"Type", "category":[{"coding":[{"system":"http://google.com", "code":"item", "display":"Item"}]}]}

You can use this function for the conversion.
def to_json(in_str):
return in_str.replace('{', '{"').replace('=', '":"').replace(',', '", "').replace('[', '[').replace('}', '"}').replace(']', ']').replace('" ', '"').replace(':"[', ':[').replace(']"', ']')
this works correctly for the input you have mentioned.
print(to_json(input_str))
#output = {"resourceType":"Type", "category":[{"coding":[{"system":"http://google.com", "code":"item", "display":"Item"}]}]}

Regex is certainly more concise and efficient but, just for the fun, it's also possible using replace :
input_str = input_str.replace("=", "\":\"")
input_str = input_str.replace("=[", "\":[")
input_str = input_str.replace(", ", "\", \"")
input_str = input_str.replace("{", "{\"")
input_str = input_str.replace("}", "\"}")
input_str = input_str.replace("]\"}", "]}")
input_str = input_str.replace("\"[", "[")
print(input_str) #=> '{"resourceType":"Type", "category":[{"coding":[{"system":"http://google.com", "code":"item", "display":"Item"}]}]}'

How to remove text before a particular character or string in multi-line text?

I want to remove all the text before and including */ in a string.
For example, consider:
string = ''' something
other things
etc. */ extra text.
'''
Here I want extra text. as the output.
I tried:
string = re.sub("^(.*)(?=*/)", "", string)
I also tried:
string = re.sub(re.compile(r"^.\*/", re.DOTALL), "", string)
But when I print string, it did not perform the operation I wanted and the whole string is printing.

I suppose you're fine without regular expressions:
string[string.index("*/ ")+3:]
And if you want to strip that newline:
string[string.index("*/ ")+3:].rstrip()

The problem with your first regex is that . does not match newlines as you noticed. With your second one, you were closer but forgot the * that time. This would work:
string = re.sub(re.compile(r"^.*\*/", re.DOTALL), "", string)
You can also just get the part of the string that comes after your "*/":
string = re.search(r"(\*/)(.*)", string, re.DOTALL).group(2)

Update: After doing some research, I found that the pattern (\n|.) to match everything including newlines is inefficient. I've updated the answer to use [\s\S] instead as shown on the answer I linked.
The problem is that . in python regex matches everything except newlines. For a regex solution, you can do the following:
import re
strng = ''' something
other things
etc. */ extra text.
'''
print(re.sub("[\s\S]+\*/", "", strng))
# extra text.
Add in a .strip() if you want to remove that remaining leading whitespace.

to keep text until that symbol you can do:
split_str = string.split(' ')
boundary = split_str.index('*/')
new = ' '.join(split_str[0:boundary])
print(new)
which gives you:
something
other things
etc.

string_list = string.split('*/')[1:]
string = '*/'.join(string_list)
print(string)
gives output as
' extra text. \n'

Remove variable parts of a string that start and end the same

I have a string as the following:
'1:CH,AG,ME,GS,AP,CH,HE,AC,AC,AG,CA,HE,AT,AT,AC,AT,OG,NE,AG,AC,CS,OD\n&:TA,EB,PA,AC,BR,TH,PO,AC,2I,AC,TH,PE,TH,AZ,AZ,ZE,CS,OD,CH,EO,ZE,OG\n&:TH,ZE,ZE,HE,HE,HP,HP,OG,HP,ZE\n2:ZE,FD,FD,AG,EO,OG,AG,NE,RU,GS,HP,ZE,ZE,HM,HM,PC,PC,AS,AS,TY,TY,AG\n&:AG,GS,NO,EU,ZF,HE,AT,AT,OD,OD,EB,OD,GS,TR,OD,AC,TR,GS,OD,TR,OD,AT,GS\n&:CA,GS,NE,GS,AG,PS,HL,AG,NE,ID,AJ,AX,DI,OD,ME,AT,GS,MU,HO,PB,LT,9Z,PT,9Y\n&:9W,9X,AR,9V,9U,9T,AX,9S,9R,AT,AJ,DI,ST,EA,AG,ME,NE,MU,9Q,9P,9O,9N,9M,9L\n&:9K,ID,MG,OD,FY,AU,AU,HR,HR,9J,TL,9I,9H,9G,9F,AC,BR,AC,9E,9D,9C,9B,99\n'
As you can see, I would like to get the '\n(number or & here):' replaced by ','
Since they all start with '\n' and end with ':' I believe that there should be a way to replace them all at once.
The output would be as the sort:
'CH,AG,ME,GS,AP,CH,HE,AC,AC,AG,CA,HE,AT,AT,AC,AT,OG,NE,AG,AC,CS,OD,TA,EB,PA,AC,BR,TH,PO,AC,2I,AC,TH,PE,TH,AZ,AZ,ZE,CS,OD,CH,EO,ZE,OG,TH,ZE,ZE,HE,HE,HP,HP,OG,HP,ZE,ZE,FD,FD,AG,EO,OG,AG,NE,RU,GS,HP,ZE,ZE,HM,HM,PC,PC,AS,AS,TY,TY,AG,AG,GS,NO,EU,ZF,HE,AT,AT,OD,OD,EB,OD,GS,TR,OD,AC,TR,GS,OD,TR,OD,AT,GS,CA,GS,NE,GS,AG,PS,HL,AG,NE,ID,AJ,AX,DI,OD,ME,AT,GS,MU,HO,PB,LT,9Z,PT,9Y,9W,9X,AR,9V,9U,9T,AX,9S,9R,AT,AJ,DI,ST,EA,AG,ME,NE,MU,9Q,9P,9O,9N,9M,9L,9K,ID,MG,OD,FY,AU,AU,HR,HR,9J,TL,9I,9H,9G,9F,AC,BR,AC,9E,9D,9C,9B,99'
What could work was making a for lop for numbers and &.
string.replace('\n&:',',')
for i in range(1,20):
string.replace('\ni:',',')
But I believe there must be a better way.

You can use regex to get the job done:
Input:
import re
text = '1:CH,AG,ME,GS,AP,CH,HE,AC,AC,AG,CA,HE,AT,AT,AC,AT,OG,NE,AG,AC,CS,OD\n&:TA,EB,PA,AC,BR,TH,PO,AC,2I,AC,TH,PE,TH,AZ,AZ,ZE,CS,OD,CH,EO,ZE,OG\n&:TH,ZE,ZE,HE,HE,HP,HP,OG,HP,ZE\n2:ZE,FD,FD,AG,EO,OG,AG,NE,RU,GS,HP,ZE,ZE,HM,HM,PC,PC,AS,AS,TY,TY,AG\n&:AG,GS,NO,EU,ZF,HE,AT,AT,OD,OD,EB,OD,GS,TR,OD,AC,TR,GS,OD,TR,OD,AT,GS\n&:CA,GS,NE,GS,AG,PS,HL,AG,NE,ID,AJ,AX,DI,OD,ME,AT,GS,MU,HO,PB,LT,9Z,PT,9Y\n&:9W,9X,AR,9V,9U,9T,AX,9S,9R,AT,AJ,DI,ST,EA,AG,ME,NE,MU,9Q,9P,9O,9N,9M,9L\n&:9K,ID,MG,OD,FY,AU,AU,HR,HR,9J,TL,9I,9H,9G,9F,AC,BR,AC,9E,9D,9C,9B,99\n'
text = re.sub(r'\n&*(\d*:)*',',', text[2:]).rstrip(',')
Output:
'CH,AG,ME,GS,AP,CH,HE,AC,AC,AG,CA,HE,AT,AT,AC,AT,OG,NE,AG,AC,CS,OD,TA,EB,PA,AC,BR,TH,PO,AC,2I,AC,TH,PE,TH,AZ,AZ,ZE,CS,OD,CH,EO,ZE,OG,TH,ZE,ZE,HE,HE,HP,HP,OG,HP,ZE,ZE,FD,FD,AG,EO,OG,AG,NE,RU,GS,HP,ZE,ZE,HM,HM,PC,PC,AS,AS,TY,TY,AG,AG,GS,NO,EU,ZF,HE,AT,AT,OD,OD,EB,OD,GS,TR,OD,AC,TR,GS,OD,TR,OD,AT,GS,CA,GS,NE,GS,AG,PS,HL,AG,NE,ID,AJ,AX,DI,OD,ME,AT,GS,MU,HO,PB,LT,9Z,PT,9Y,9W,9X,AR,9V,9U,9T,AX,9S,9R,AT,AJ,DI,ST,EA,AG,ME,NE,MU,9Q,9P,9O,9N,9M,9L,9K,ID,MG,OD,FY,AU,AU,HR,HR,9J,TL,9I,9H,9G,9F,AC,BR,AC,9E,9D,9C,9B,99'

You can use a regular expression replace:
s = '1:CH,AG,ME,GS,AP,CH,HE,AC,AC,AG,CA,HE,AT,AT,AC,AT,OG,NE,AG,AC,CS,OD\n&:TA,EB,PA,AC,BR,TH,PO,AC,2I,AC,TH,PE,TH,AZ,AZ,ZE,CS,OD,CH,EO,ZE,OG\n&:TH,ZE,ZE,HE,HE,HP,HP,OG,HP,ZE\n2:ZE,FD,FD,AG,EO,OG,AG,NE,RU,GS,HP,ZE,ZE,HM,HM,PC,PC,AS,AS,TY,TY,AG\n&:AG,GS,NO,EU,ZF,HE,AT,AT,OD,OD,EB,OD,GS,TR,OD,AC,TR,GS,OD,TR,OD,AT,GS\n&:CA,GS,NE,GS,AG,PS,HL,AG,NE,ID,AJ,AX,DI,OD,ME,AT,GS,MU,HO,PB,LT,9Z,PT,9Y\n&:9W,9X,AR,9V,9U,9T,AX,9S,9R,AT,AJ,DI,ST,EA,AG,ME,NE,MU,9Q,9P,9O,9N,9M,9L\n&:9K,ID,MG,OD,FY,AU,AU,HR,HR,9J,TL,9I,9H,9G,9F,AC,BR,AC,9E,9D,9C,9B,99\n'
s = re.sub(r"(\n\d*?:)|(\n&:)", ",", s).strip() # replaces the middle bits with commas and strips trailing \n
s = re.sub(r"^(\d*?:)|(&:)", "", s) # removes the initial 1: or similar

Remove all newlines from inside a string

I'm trying to remove all newline characters from a string. I've read up on how to do it, but it seems that I for some reason am unable to do so. Here is step by step what I am doing:
string1 = "Hello \n World"
string2 = string1.strip('\n')
print string2
And I'm still seeing the newline character in the output. I've tried with rstrip as well, but I'm still seeing the newline. Could anyone shed some light on why I'm doing this wrong? Thanks.

strip only removes characters from the beginning and end of a string. You want to use replace:
str2 = str.replace("\n", "")
re.sub('\s{2,}', ' ', str) # To remove more than one space

As mentioned by #john, the most robust answer is:
string = "a\nb\rv"
new_string = " ".join(string.splitlines())

Answering late since I recently had the same question when reading text from file; tried several options such as:
with open('verdict.txt') as f:
First option below produces a list called alist, with '\n' stripped, then joins back into full text (optional if you wish to have only one text):
alist = f.read().splitlines()
jalist = " ".join(alist)
Second option below is much easier and simple produces string of text called atext replacing '\n' with space;
atext = f.read().replace('\n',' ')
It works; I have done it. This is clean, easier, and efficient.

strip() returns the string after removing leading and trailing whitespace. see doc
In your case, you may want to try replace():
string2 = string1.replace('\n', '')

or you can try this:
string1 = 'Hello \n World'
tmp = string1.split()
string2 = ' '.join(tmp)

This should work in many cases -
text = ' '.join([line.strip() for line in text.strip().splitlines() if line.strip()])
text = re.sub('[\r\n]+', ' ', text)

strip() returns the string with leading and trailing whitespaces(by default) removed.
So it would turn " Hello World " to "Hello World", but it won't remove the \n character as it is present in between the string.
Try replace().
str = "Hello \n World"
str2 = str.replace('\n', '')
print str2

If the file includes a line break in the middle of the text neither strip() nor rstrip() will not solve the problem,
strip family are used to trim from the began and the end of the string
replace() is the way to solve your problem
>>> my_name = "Landon\nWO"
>>> print my_name
Landon
WO
>>> my_name = my_name.replace('\n','')
>>> print my_name
LandonWO

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing Punctuation and Replacing it with Whitespace using Replace in Python - python

If you don't need to explicitly use replace: exclude = set(",.:;'\"-?!/") text = "".join([(ch if ch not in exclude else " ") for ch in text])

Here's a naive but working solution: for sp in '.,"': raw_text = raw_text.replace(sp, '')

Either remove one character at a time: raw_text.replace(".", "").replace("!", "") Or, better, use regular expressions (re.sub()): re.sub(r"\.|!", "", raw_text)

Related

How do I remove multiple words from a strting in python

How to insert quotes around a string in the middle of another string

How to remove text before a particular character or string in multi-line text?

Remove variable parts of a string that start and end the same

Remove all newlines from inside a string

Categories

Resources