I have a string that contains words or phrases that are enclosed in double quotes and I need to remove them from quotes., in python. Example:
The text has "single quotes" and "commas".
The text has "double quotes".
removing the words from the quotes results in this:
The text has " " and " ".
The text has " ".
I used the RE re.finditer that lists all the quotes found, but I know how it would be to remove the words that exist between the quotes in the string. Anybody know?
>> from re import sub
>> s
'The text has "single quotes" and "commas".'
>> sub('".*?"', '" "',s)
'The text has " " and " ".'
A bit complicated, but maybe,
(?<=")[^\s".][^"\r\n]*|[^"\r\n]*[^\s".](?=")
might be OK to look into.
RegEx Demo
This pattern would probably fail on some edge cases, which you'd likely want to look into:
[^\s".]
Test
import re
string = '''
The text has "single quotes" and "commas".
The text has "double quotes"
"single quotes" and "commas"
"double quotes"
"d"
"d""d""d""d"
'''
expression = r'(?<=")[^\s".][^"\r\n]*|[^"\r\n]*[^\s".](?=")'
print(re.sub(expression, '', string))
Output
The text has "" and "".
The text has ""
"" and ""
""
""
""""""""
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:
Take a look at this simple regex:
"[\w\s]+"
Regex Demo
We capture any word characters and possible spaces between " ", and then replace with "":
expression = r'"[\w\s]+"'
print(re.sub(expression, '""', string))
you can use this code. Hope it helps.
text = 'The text has "single quotes" and "commas".'
text = re.sub('"[^"]*[$"]', '""', text)
print(text) # The text has "" and "".
Related
I'm trying to remove specific double quotes from text using regular expression in python. I would like to leave only those double quotes which indicate an inch. So this would mean leave any double quote following a number.
txt = 'measurement 1/2" and 3" "remove" end" a " multiple"""
Expected output:
measurement 1/2" and 3" remove end a multiple
This is the closest I've got.
re.sub(r'[^(?!\d+/\d+")]"+', '', txt)
Simply use
(?<!\d)"+
See a demo on regex101.com.
Your original expression
[^(?!\d+/\d+")]
basically meant not (, ?, !, etc.
Alternatively, you could use the newer regex module with (*SKIP)(*FAIL):
import regex as re
junk = '''measurement 1/2" and 3" "remove" end" a " multiple"""
ABC2DEF3"'''
rx = re.compile(r'\b\d(?:/\d+)?"(*SKIP)(*FAIL)|"+')
cleaned = rx.sub('', junk)
print(cleaned)
Which would yield
measurement 1/2" and 3" remove end a multiple
ABC2DEF3
Basically, I have a string that has multiple double-whitespaces like this:
"Some text\s\sWhy is there no punctuation\s\s"
I also have a list of punctuation marks that should replace the double-whitespaces, so that the output would be this:
puncts = ['.', '?']
# applying some function
# output:
>>> "Some text. Why is there no punctuation?"
I have tried re.sub(' +', puncts[i], text) but my problem here is that I don't know how to properly iterate through the list and replace the 1st double-whitespace with the 1st element in puncts, the 2nd double-whitespace with the 2nd element in puncts and so on.
If we're still using re.sub(), here's one possible solution that follows this basic pattern:
Get the next punctuation character.
Replace only the first occurrence of that character in text.
puncts = ['.', '?']
text = "Some text Why is there no punctuation "
for i in puncts:
text = re.sub('\s(?=\s)', i, text, 1)
The call to re.sub() returns a string, and basically says "find all series of two whitespace characters, but only replace the first whitespace character with a punctuation character." The final argument "1" makes it so that we only replace the first instance of the double whitespace, and not all of them (default behavior).
If the positive lookahead (the part of the regex that we want to match but not replace) confuses you, you can also do without it:
puncts = ['.', '?']
text = "Some text Why is there no punctuation "
for i in puncts:
text = re.sub('\s\s', i + " ", text, 1)
This yields the same output.
There will be a leftover whitespace at the end of the sentence, but if you're stingy about that, a simple text.rstrip() should take care of that one.
Further explanation
Your first try of using regex ' +' doesn't work because that regex matches all instances where there is at least one whitespace — that is, it will match everything, and then also replace all of it with a punctuation character. The above solutions account for the double-whitespace in their respective regexes.
You can do it simply using the replace method!
text = "Some text Why is there no punctuation "
puncts = ['.', '?']
for i in puncts:
text = text.replace(" ", i, 1) #notice the 1 here
print(text)
Output : Some text.Why is there no punctuation?
You can use re.split() to break the string into substrings between the double spaces and intersperse the punctuation marks using join:
import re
string = "Some text Why is there no punctuation "
iPunct = iter([". ","? "])
result = "".join(x+next(iPunct,"") for x in re.split(r"\s\s",string))
print(result)
# Some text. Why is there no punctuation?
I'm looking to grab noise text that has a specific pattern in it:
text = "this is some text lskdfmd&#kjansdl and some more text sldkf&#lsakjd and some other stuff"
I want to be able to remove everything in this sentence where after a space, and before a space contains &#.
result = "this is some text and some more text and some other stuff"
been trying:
re.compile(r'([\s]&#.*?([\s])).sub(" ", text)
I can't seem to get the first part though.
You may use
\S+&#\S+\s*
See a demo on regex101.com.
In Python:
import re
text = "this is some text lskdfmd&#kjansdl and some more text sldkf&#lsakjd and some other stuff"
rx = re.compile(r'\S+&#\S+\s*')
text = rx.sub('', text)
print(text)
Which yields
this is some text and some more text and some other stuff
You can use this regex to capture that noise string,
\s+\S*&#\S*\s+
and replace it with a single space.
Here, \s+ matches any whitespace(s) then \S* matches zero or more non-whitespace characters while sandwiching &# within it and again \S* matches zero or more whitespace(s) and finally followed by \s+ one or more whitespace which gets removed by a space, giving you your intended string.
Also, if this noise string can be either at the very start or very end of string, feel free to change \s+ to \s*
Regex Demo
Python code,
import re
s = 'this is some text lskdfmd&#kjansdl and some more text sldkf&#lsakjd and some other stuff'
print(re.sub(r'\s+\S*&#\S*\s+', ' ', s))
Prints,
this is some text and some more text and some other stuff
Try This:
import re
result = re.findall(r"[a-zA-z]+\&\#[a-zA-z]+", text)
print(result)
['lskdfmd&#kjansdl', 'sldkf&#lsakjd']
now remove the result list from the list of all words.
Edit1 Suggest by #Jan
re.sub(r"[a-zA-z]+\&\#[a-zA-z]+", '', text)
output: 'this is some text and some more text and some other stuff'
Edit2 Suggested by #Pushpesh Kumar Rajwanshi
re.sub(r" [a-zA-z]+\&\#[a-zA-z]+ ", " ", text)
output:'this is some text and some more text and some other stuff'
I have a huge string which contains a lot of numbers in square brackets. For instance:
[1] this is an example
...
[123] another example
How can I remove the numbers and the brackets from my text string?
My current code to extract the text from a file:
text = txtFile.read()
text = str(text)
text = text.replace("\\n", " ")
text = " ".join(text.split())
Try using re.sub:
import re
text = txtFile.read()
text = str(text)
text = re.sub(r'\[\d+\]', '', text)
The regex pattern \[\d+\] should match any bracket term which has one or more numbers in it.
Note that re.sub by default will do a replacement against the entire input string.
Lets say I have:
a = r''' Example
This is a very annoying string
that takes up multiple lines
and h#s a// kind{s} of stupid symbols in it
ok String'''
I need a way to do a replace(or just delete) and text in between "This" and "ok" so that when I call it, a now equals:
a = "Example String"
I can't find any wildcards that seem to work. Any help is much appreciated.
You need Regular Expression:
>>> import re
>>> re.sub('\nThis.*?ok','',a, flags=re.DOTALL)
' Example String'
Another method is to use string splits:
def replaceTextBetween(originalText, delimeterA, delimterB, replacementText):
leadingText = originalText.split(delimeterA)[0]
trailingText = originalText.split(delimterB)[1]
return leadingText + delimeterA + replacementText + delimterB + trailingText
Limitations:
Does not check if the delimiters exist
Assumes that there are no duplicate delimiters
Assumes that delimiters are in correct order
The DOTALL flag is the key. Ordinarily, the '.' character doesn't match newlines, so you don't match across lines in a string. If you set the DOTALL flag, re will match '.*' across as many lines as it needs to.
Use re.sub : It replaces the text between two characters or symbols or strings with desired character or symbol or string.
format: re.sub('A?(.*?)B', P, Q, flags=re.DOTALL)
where
A : character or symbol or string
B : character or symbol or string
P : character or symbol or string which replaces the text between A and B
Q : input string
re.DOTALL : to match across all lines
import re
re.sub('\nThis?(.*?)ok', '', a, flags=re.DOTALL)
output : ' Example String'
Lets see an example with html code as input
input_string = '''<body> <h1>Heading</h1> <p>Paragraph</p><b>bold text</b></body>'''
Target : remove <p> tag
re.sub('<p>?(.*?)</p>', '', input_string, flags=re.DOTALL)
output : '<body> <h1>Heading</h1> <b>bold text</b></body>'
Target : replace <p> tag with word : test
re.sub('<p>?(.*?)</p>', 'test', input_string, flags=re.DOTALL)
otput : '<body> <h1>Heading</h1> test<b>bold text</b></body>'
a=re.sub('This.*ok','',a,flags=re.DOTALL)
If you want first and last words:
re.sub(r'^\s*(\w+).*?(\w+)$', r'\1 \2', a, flags=re.DOTALL)