Removing Punctuation and Replacing it with Whitespace using Replace in Python - python
trying to remove the following punctuation in python I need to use the replace methods to remove these punctuation characters and replace it with whitespace ,.:;'"-?!/
here is my code:
text_punct_removed = raw_text.replace(".", "")
text_punct_removed = raw_text.replace("!", "")
print("\ntext with punctuation characters removed:\n", text_punct_removed)
It only will remove the last one I try to replace, so I tried combining them
text_punct_removed = raw_text.replace(".", "" , "!", "")
print("\ntext with punctuation characters removed:\n", text_punct_removed)
but I get an error message, how do I remove multiple punctuation? Also there will be an issue if I put the " in quotes like this """ which will make a comment, is there a way around that? thanks
If you don't need to explicitly use replace:
exclude = set(",.:;'\"-?!/")
text = "".join([(ch if ch not in exclude else " ") for ch in text])
Here's a naive but working solution:
for sp in '.,"':
raw_text = raw_text.replace(sp, '')
If you need to replace all punctuations with space, you can use the built-in punctuation list to replace the string:
Python 3
import string
import re
my_string = "(I hope...this works!)"
translator = re.compile('[%s]' % re.escape(string.punctuation))
translator.sub(' ', my_string)
print(my_string)
# Result:
# I hope this works
After, if you want to remove double spaces inside string, you can make:
my_string = re.sub(' +',' ', my_string).strip()
print(my_string)
# Result:
# I hope this works
This works in Python3.5.3:
from string import punctuation
raw_text_with_punctuations = "text, with: punctuation; characters? all over ,.:;'\"-?!/"
print(raw_text_with_punctuations)
for char in punctuation:
raw_text_with_punctuations = raw_text_with_punctuations.replace(char, '')
print(raw_text_with_punctuations)
Either remove one character at a time:
raw_text.replace(".", "").replace("!", "")
Or, better, use regular expressions (re.sub()):
re.sub(r"\.|!", "", raw_text)
Related
How do I remove multiple words from a strting in python
I know that how do I remove a single word. But I can't remove multiple words. Can you help me? This is my string and I want to remove "Color:", "Ring size:" and "Personalization:". string = "Color:Silver,Ring size:6 3/4 US,Personalization:J" I know that how do I remove a single word. But I can't remove multiple words. I want to remove "Color:", "Ring size:" and "Personalization:"
Seems like a good job for a regex. Specific case: import re out = re.sub(r'(Color|Ring size|Personalization):', '', string) Generic case (any word before :): import re out = re.sub(r'[^:,]+:', '', string) Output: 'Silver,6 3/4 US,J' Regex: [^:,]+ # any character but , or : : # followed by : replace with empty string (= delete)
string = "Color:Silver,Ring size:6 3/4 US,Personalization:J" def remove_words(string: str, words: list[str]) -> str: for word in words: string = string.replace(word, "") return string new_string = remove_words(string, ["Color:", "Ring size:", "Personalization:"]) Alternative: string = "Color:Silver,Ring size:6 3/4 US,Personalization:J" new_string = string.replace("Color:", "").replace("Ring size:", "").replace("Personalization:", "")
How to insert quotes around a string in the middle of another string
I need to change this string: input_str = '{resourceType=Type, category=[{coding=[{system=http://google.com, code=item, display=Item}]}]}' To json format: output_str = '{"resourceType":"Type", "category":[{"coding":[{"system":"http://google.com", "code":"item", "display":"Item"}]}]}' Changing the equal sign "=" to colon ":" is quite easy by using replace function: input_str.replace("=", ":") But adding quotes before and after each value / word is something that I can't find the solution for
I suggest to surround with double quotes any sequence of characters that are not reserved in your markup. I also made a provision for escaped double quotes, and you can add more escaped symbols to it: import re input_str = '{resourceType=Type, category=[{coding=[{system=http://google.com, code=item, display=Item}]}]}' output_str = re.sub (r'(([^=([\]{},\s]|\")+)', r'"\1"', input_str).replace('=', ':') print (output_str) Output: {"resourceType":"Type", "category":[{"coding":[{"system":"http://google.com", "code":"item", "display":"Item"}]}]}
You can use this function for the conversion. def to_json(in_str): return in_str.replace('{', '{"').replace('=', '":"').replace(',', '", "').replace('[', '[').replace('}', '"}').replace(']', ']').replace('" ', '"').replace(':"[', ':[').replace(']"', ']') this works correctly for the input you have mentioned. print(to_json(input_str)) #output = {"resourceType":"Type", "category":[{"coding":[{"system":"http://google.com", "code":"item", "display":"Item"}]}]}
Regex is certainly more concise and efficient but, just for the fun, it's also possible using replace : input_str = input_str.replace("=", "\":\"") input_str = input_str.replace("=[", "\":[") input_str = input_str.replace(", ", "\", \"") input_str = input_str.replace("{", "{\"") input_str = input_str.replace("}", "\"}") input_str = input_str.replace("]\"}", "]}") input_str = input_str.replace("\"[", "[") print(input_str) #=> '{"resourceType":"Type", "category":[{"coding":[{"system":"http://google.com", "code":"item", "display":"Item"}]}]}'
How to remove text before a particular character or string in multi-line text?
I want to remove all the text before and including */ in a string. For example, consider: string = ''' something other things etc. */ extra text. ''' Here I want extra text. as the output. I tried: string = re.sub("^(.*)(?=*/)", "", string) I also tried: string = re.sub(re.compile(r"^.\*/", re.DOTALL), "", string) But when I print string, it did not perform the operation I wanted and the whole string is printing.
I suppose you're fine without regular expressions: string[string.index("*/ ")+3:] And if you want to strip that newline: string[string.index("*/ ")+3:].rstrip()
The problem with your first regex is that . does not match newlines as you noticed. With your second one, you were closer but forgot the * that time. This would work: string = re.sub(re.compile(r"^.*\*/", re.DOTALL), "", string) You can also just get the part of the string that comes after your "*/": string = re.search(r"(\*/)(.*)", string, re.DOTALL).group(2)
Update: After doing some research, I found that the pattern (\n|.) to match everything including newlines is inefficient. I've updated the answer to use [\s\S] instead as shown on the answer I linked. The problem is that . in python regex matches everything except newlines. For a regex solution, you can do the following: import re strng = ''' something other things etc. */ extra text. ''' print(re.sub("[\s\S]+\*/", "", strng)) # extra text. Add in a .strip() if you want to remove that remaining leading whitespace.
to keep text until that symbol you can do: split_str = string.split(' ') boundary = split_str.index('*/') new = ' '.join(split_str[0:boundary]) print(new) which gives you: something other things etc.
string_list = string.split('*/')[1:] string = '*/'.join(string_list) print(string) gives output as ' extra text. \n'
Remove variable parts of a string that start and end the same
I have a string as the following: '1:CH,AG,ME,GS,AP,CH,HE,AC,AC,AG,CA,HE,AT,AT,AC,AT,OG,NE,AG,AC,CS,OD\n&:TA,EB,PA,AC,BR,TH,PO,AC,2I,AC,TH,PE,TH,AZ,AZ,ZE,CS,OD,CH,EO,ZE,OG\n&:TH,ZE,ZE,HE,HE,HP,HP,OG,HP,ZE\n2:ZE,FD,FD,AG,EO,OG,AG,NE,RU,GS,HP,ZE,ZE,HM,HM,PC,PC,AS,AS,TY,TY,AG\n&:AG,GS,NO,EU,ZF,HE,AT,AT,OD,OD,EB,OD,GS,TR,OD,AC,TR,GS,OD,TR,OD,AT,GS\n&:CA,GS,NE,GS,AG,PS,HL,AG,NE,ID,AJ,AX,DI,OD,ME,AT,GS,MU,HO,PB,LT,9Z,PT,9Y\n&:9W,9X,AR,9V,9U,9T,AX,9S,9R,AT,AJ,DI,ST,EA,AG,ME,NE,MU,9Q,9P,9O,9N,9M,9L\n&:9K,ID,MG,OD,FY,AU,AU,HR,HR,9J,TL,9I,9H,9G,9F,AC,BR,AC,9E,9D,9C,9B,99\n' As you can see, I would like to get the '\n(number or & here):' replaced by ',' Since they all start with '\n' and end with ':' I believe that there should be a way to replace them all at once. The output would be as the sort: 'CH,AG,ME,GS,AP,CH,HE,AC,AC,AG,CA,HE,AT,AT,AC,AT,OG,NE,AG,AC,CS,OD,TA,EB,PA,AC,BR,TH,PO,AC,2I,AC,TH,PE,TH,AZ,AZ,ZE,CS,OD,CH,EO,ZE,OG,TH,ZE,ZE,HE,HE,HP,HP,OG,HP,ZE,ZE,FD,FD,AG,EO,OG,AG,NE,RU,GS,HP,ZE,ZE,HM,HM,PC,PC,AS,AS,TY,TY,AG,AG,GS,NO,EU,ZF,HE,AT,AT,OD,OD,EB,OD,GS,TR,OD,AC,TR,GS,OD,TR,OD,AT,GS,CA,GS,NE,GS,AG,PS,HL,AG,NE,ID,AJ,AX,DI,OD,ME,AT,GS,MU,HO,PB,LT,9Z,PT,9Y,9W,9X,AR,9V,9U,9T,AX,9S,9R,AT,AJ,DI,ST,EA,AG,ME,NE,MU,9Q,9P,9O,9N,9M,9L,9K,ID,MG,OD,FY,AU,AU,HR,HR,9J,TL,9I,9H,9G,9F,AC,BR,AC,9E,9D,9C,9B,99' What could work was making a for lop for numbers and &. string.replace('\n&:',',') for i in range(1,20): string.replace('\ni:',',') But I believe there must be a better way.
You can use regex to get the job done: Input: import re text = '1:CH,AG,ME,GS,AP,CH,HE,AC,AC,AG,CA,HE,AT,AT,AC,AT,OG,NE,AG,AC,CS,OD\n&:TA,EB,PA,AC,BR,TH,PO,AC,2I,AC,TH,PE,TH,AZ,AZ,ZE,CS,OD,CH,EO,ZE,OG\n&:TH,ZE,ZE,HE,HE,HP,HP,OG,HP,ZE\n2:ZE,FD,FD,AG,EO,OG,AG,NE,RU,GS,HP,ZE,ZE,HM,HM,PC,PC,AS,AS,TY,TY,AG\n&:AG,GS,NO,EU,ZF,HE,AT,AT,OD,OD,EB,OD,GS,TR,OD,AC,TR,GS,OD,TR,OD,AT,GS\n&:CA,GS,NE,GS,AG,PS,HL,AG,NE,ID,AJ,AX,DI,OD,ME,AT,GS,MU,HO,PB,LT,9Z,PT,9Y\n&:9W,9X,AR,9V,9U,9T,AX,9S,9R,AT,AJ,DI,ST,EA,AG,ME,NE,MU,9Q,9P,9O,9N,9M,9L\n&:9K,ID,MG,OD,FY,AU,AU,HR,HR,9J,TL,9I,9H,9G,9F,AC,BR,AC,9E,9D,9C,9B,99\n' text = re.sub(r'\n&*(\d*:)*',',', text[2:]).rstrip(',') Output: 'CH,AG,ME,GS,AP,CH,HE,AC,AC,AG,CA,HE,AT,AT,AC,AT,OG,NE,AG,AC,CS,OD,TA,EB,PA,AC,BR,TH,PO,AC,2I,AC,TH,PE,TH,AZ,AZ,ZE,CS,OD,CH,EO,ZE,OG,TH,ZE,ZE,HE,HE,HP,HP,OG,HP,ZE,ZE,FD,FD,AG,EO,OG,AG,NE,RU,GS,HP,ZE,ZE,HM,HM,PC,PC,AS,AS,TY,TY,AG,AG,GS,NO,EU,ZF,HE,AT,AT,OD,OD,EB,OD,GS,TR,OD,AC,TR,GS,OD,TR,OD,AT,GS,CA,GS,NE,GS,AG,PS,HL,AG,NE,ID,AJ,AX,DI,OD,ME,AT,GS,MU,HO,PB,LT,9Z,PT,9Y,9W,9X,AR,9V,9U,9T,AX,9S,9R,AT,AJ,DI,ST,EA,AG,ME,NE,MU,9Q,9P,9O,9N,9M,9L,9K,ID,MG,OD,FY,AU,AU,HR,HR,9J,TL,9I,9H,9G,9F,AC,BR,AC,9E,9D,9C,9B,99'
You can use a regular expression replace: s = '1:CH,AG,ME,GS,AP,CH,HE,AC,AC,AG,CA,HE,AT,AT,AC,AT,OG,NE,AG,AC,CS,OD\n&:TA,EB,PA,AC,BR,TH,PO,AC,2I,AC,TH,PE,TH,AZ,AZ,ZE,CS,OD,CH,EO,ZE,OG\n&:TH,ZE,ZE,HE,HE,HP,HP,OG,HP,ZE\n2:ZE,FD,FD,AG,EO,OG,AG,NE,RU,GS,HP,ZE,ZE,HM,HM,PC,PC,AS,AS,TY,TY,AG\n&:AG,GS,NO,EU,ZF,HE,AT,AT,OD,OD,EB,OD,GS,TR,OD,AC,TR,GS,OD,TR,OD,AT,GS\n&:CA,GS,NE,GS,AG,PS,HL,AG,NE,ID,AJ,AX,DI,OD,ME,AT,GS,MU,HO,PB,LT,9Z,PT,9Y\n&:9W,9X,AR,9V,9U,9T,AX,9S,9R,AT,AJ,DI,ST,EA,AG,ME,NE,MU,9Q,9P,9O,9N,9M,9L\n&:9K,ID,MG,OD,FY,AU,AU,HR,HR,9J,TL,9I,9H,9G,9F,AC,BR,AC,9E,9D,9C,9B,99\n' s = re.sub(r"(\n\d*?:)|(\n&:)", ",", s).strip() # replaces the middle bits with commas and strips trailing \n s = re.sub(r"^(\d*?:)|(&:)", "", s) # removes the initial 1: or similar
Remove all newlines from inside a string
I'm trying to remove all newline characters from a string. I've read up on how to do it, but it seems that I for some reason am unable to do so. Here is step by step what I am doing: string1 = "Hello \n World" string2 = string1.strip('\n') print string2 And I'm still seeing the newline character in the output. I've tried with rstrip as well, but I'm still seeing the newline. Could anyone shed some light on why I'm doing this wrong? Thanks.
strip only removes characters from the beginning and end of a string. You want to use replace: str2 = str.replace("\n", "") re.sub('\s{2,}', ' ', str) # To remove more than one space
As mentioned by #john, the most robust answer is: string = "a\nb\rv" new_string = " ".join(string.splitlines())
Answering late since I recently had the same question when reading text from file; tried several options such as: with open('verdict.txt') as f: First option below produces a list called alist, with '\n' stripped, then joins back into full text (optional if you wish to have only one text): alist = f.read().splitlines() jalist = " ".join(alist) Second option below is much easier and simple produces string of text called atext replacing '\n' with space; atext = f.read().replace('\n',' ') It works; I have done it. This is clean, easier, and efficient.
strip() returns the string after removing leading and trailing whitespace. see doc In your case, you may want to try replace(): string2 = string1.replace('\n', '')
or you can try this: string1 = 'Hello \n World' tmp = string1.split() string2 = ' '.join(tmp)
This should work in many cases - text = ' '.join([line.strip() for line in text.strip().splitlines() if line.strip()]) text = re.sub('[\r\n]+', ' ', text)
strip() returns the string with leading and trailing whitespaces(by default) removed. So it would turn " Hello World " to "Hello World", but it won't remove the \n character as it is present in between the string. Try replace(). str = "Hello \n World" str2 = str.replace('\n', '') print str2
If the file includes a line break in the middle of the text neither strip() nor rstrip() will not solve the problem, strip family are used to trim from the began and the end of the string replace() is the way to solve your problem >>> my_name = "Landon\nWO" >>> print my_name Landon WO >>> my_name = my_name.replace('\n','') >>> print my_name LandonWO