get full string before and after a specific pattern - python

I'm looking to grab noise text that has a specific pattern in it:
text = "this is some text lskdfmd&#kjansdl and some more text sldkf&#lsakjd and some other stuff"
I want to be able to remove everything in this sentence where after a space, and before a space contains &#.
result = "this is some text and some more text and some other stuff"
been trying:
re.compile(r'([\s]&#.*?([\s])).sub(" ", text)
I can't seem to get the first part though.

You may use
\S+&#\S+\s*
See a demo on regex101.com.
In Python:
import re
text = "this is some text lskdfmd&#kjansdl and some more text sldkf&#lsakjd and some other stuff"
rx = re.compile(r'\S+&#\S+\s*')
text = rx.sub('', text)
print(text)
Which yields
this is some text and some more text and some other stuff

You can use this regex to capture that noise string,
\s+\S*&#\S*\s+
and replace it with a single space.
Here, \s+ matches any whitespace(s) then \S* matches zero or more non-whitespace characters while sandwiching &# within it and again \S* matches zero or more whitespace(s) and finally followed by \s+ one or more whitespace which gets removed by a space, giving you your intended string.
Also, if this noise string can be either at the very start or very end of string, feel free to change \s+ to \s*
Regex Demo
Python code,
import re
s = 'this is some text lskdfmd&#kjansdl and some more text sldkf&#lsakjd and some other stuff'
print(re.sub(r'\s+\S*&#\S*\s+', ' ', s))
Prints,
this is some text and some more text and some other stuff

Try This:
import re
result = re.findall(r"[a-zA-z]+\&\#[a-zA-z]+", text)
print(result)
['lskdfmd&#kjansdl', 'sldkf&#lsakjd']
now remove the result list from the list of all words.
Edit1 Suggest by #Jan
re.sub(r"[a-zA-z]+\&\#[a-zA-z]+", '', text)
output: 'this is some text and some more text and some other stuff'
Edit2 Suggested by #Pushpesh Kumar Rajwanshi
re.sub(r" [a-zA-z]+\&\#[a-zA-z]+ ", " ", text)
output:'this is some text and some more text and some other stuff'

Related

Python extract substring via regex with marker as delimiter

In a textfile
1. Notice
Some text
End Notice
2. Blabla
Some other text
Even more text
3. Notice
Some more text
End Notice
I would like to extract the text from "2. Blabla" and the following text(lines) with regex.
A section as "2. Blabla" might be in the textile several time (as with "1. Notice" etc.).
I tried
pattern = r"(\d+\. Blabla[\S\n\t\v ]*?\d+\. )"
re.compile(pattern)
result = re.findall(pattern, text)
print(result)
but it gives me
['2. BlaBla\nSome other text\nEven more text\n3. ']
How can I get rid of the "3. "?
You can use
(?ms)^\d+\. Blabla.*?(?=^\d+\. |\Z)
It will match start of a line, one or more digits, a dot, a space, Blabla, and then zero or more chars, as few as possible, till the first occurrence of one or more digits + . + space at the start of a line, or end of the whole string.
However, there is a faster expression:
(?m)^\d+\. Blabla.*(?:\n(?!\d+\.).*)*
See the regex demo. Details:
^ - start of a line (due to re.M option in the Python code)
\d+ - one or more digits
\. - a dot
Blabla - a fixed string
.* - the rest of the line
(?:\n(?!\d+\.).*)* - any zero or more lines that do not start with one or more digits and then a . char.
See the Python demo:
import re
text = "1. Notice \nSome text \nEnd Notice\n2. Blabla \nSome other text \nEven more text\n3. Notice \nSome more text\nEnd Notice"
pattern = r"^\d+\. Blabla.*(?:\n(?!\d+\.).*)*"
result = re.findall(pattern, text, re.M)
print(result)
# => ['2. Blabla \nSome other text \nEven more text']

Regex in Python to remove all uppercase characters before a colon

I have a text where I would like to remove all uppercase consecutive characters up to a colon. I have only figured out how to remove all characters up to the colon itself; which results in the current output shown below.
Input Text
text = 'ABC: This is a text. CDEFG: This is a second text. HIJK: This is a third text'
Desired output:
'This is a text. This is a second text. This is a third text'
Current code & output:
re.sub(r'^.+[:]', '', text)
#current output
'This is a third text'
Can this be done with a one-liner regex or do I need to iterate through every character.isupper() and then implement regex ?
You can use
\b[A-Z]+:\s*
\b A word boundary to prevent a partial match
[A-Z]+: Match 1+ uppercase chars A-Z and a :
\s* Match optional whitespace chars
Regex demo
import re
text = 'ABC: This is a text. CDEFG: This is a second text. HIJK: This is a third text'
print(re.sub(r'\b[A-Z]+:\s*', '', text))
Output
This is a text. This is a second text. This is a third text

regex to match string before colon till whitespace

i have a sample string from a text file. i want to find all the words before colon till whitespace.
i have written code like this:
import re
text = 'From: mathew <mathew#mantis.co.uk>\nSubject: Alt.Atheism FAQ: Atheist Resources\n\nArchive-
name: atheism/resources\nAlt-atheism-archive-name:'
email_data = re.findall("[^\s].*(?=:)", text)
print(email_data)
Output:
['From', 'Subject: Alt.Atheism FAQ', 'Archive-name', 'Alt-atheism-archive-name']
Desired Output:
['From', 'Subject', 'FAQ', 'Archive-name', 'Alt-atheism-archive-name']
Code is picking up data till newline charater because of (.*) used. i want to restrict it till whitespace so i put [^\s] but its not working. What could i do instead?
You may use
email_data = re.findall(r"\S[^:\s]+(?=:)", text)
See the Python demo and the regex demo.
Details
\S - a non-whitespace char
[^:\s]+ - 1+ chars other than : and whitespace
(?=:) - immediately to the right, there must be a : char (it is not consumed, not added to the match value).
Use re.IGNORECASE flag with the regex pattern
\b[a-z-]+(?=:(?:\s|$))
https://regex101.com/r/0UHsbo/1
https://ideone.com/oz91bP

How to remove all non-alphanumerical characters except when part of a word [duplicate]

I want to be able to remove all punctuation and single quotes ' from a string, unless the single quote ' is in the middle of a word.
At this point I have the following code:
with open('test.txt','r') as f:
for line in f:
line = line.lower()
line = re.sub('[^a-z\ \']+', " ", line)
print line
if there happens to be a line in test.txt like:
Here is some stuff. 'Now there are quotes.' Now there's not.
The result I want is:
here is some stuff now there are quotes now there's not
But the result I get is:
here is some stuff 'now there are quotes' now there's not
How can I remove the single quotes ' from a string if they're at the beginning or end of the word but not in the middle? Thanks for the help!
Split the string, use strip() on each word to remove leading and trailing characters on it, then join it all back together.
>>> s = "'here is some stuff 'now there are quotes' now there's not'"
>>> print(' '.join(w.strip("'") for w in s.split()).lower())
here is some stuff now there are quotes now there's not
Using regular expressions, you could first remove 's that don't follow a letter, then remove 's that don't precede a letter (thus only keeping ones that both follow and precede a letter):
line = "Here is some stuff. 'Now there are quotes.' Now there's not."
print re.sub(r"'([^A-Za-z])", r"\1", re.sub(r"([^A-Za-z])'", r"\1", line))
# Here is some stuff. Now there are quotes. Now there's not.
Probably more efficient to do it #TigerhawkT3's way. Though they produce different results if you have something like 'this'. If you want to remove that second ' too, then the regular expressions method is probably the simplest you can do.
Here's another solution using regular expressions with lookarounds.
This method will preserve any whitespace your string may have.
import re
rgx = re.compile(r"(?<!\w)\'|\'(?!\w)")
# Regex explanation:
# (?<!\w)\' match any quote not preceded by a word
# | or
# \'(?!\w) match any quote not followed by a word
s = "'here is some stuff 'now there are quotes' now there's not'"
print rgx.sub('', s) # here is some stuff now there are quotes now there's not
If a word is a sequence of 1+ letters, digits and underscores that can be matched with \w+ you may use
re.sub(r"(?!\b'\b)'", "", text)
See the regex demo. Here, ' is matched when it is not preceded nor followed with letters/digits/_.
Or, if words are strictly linguistic words that only consist of letters, use
re.sub(r"'(?!(?<=[a-zA-Z]')[a-zA-Z])", "", text) # ASCII only
re.sub(r"'(?!(?<=[^\W\d_]')[^\W\d_])", "", text) # any Unicode letter support
See Demo #2 (ASCII only letters) and Demo #3 (see last line in the demo text). Here, ' is only matched if it is not preceded nor followed with a letter (ASCII or any).
Python demo:
import re
text = "'text... 'some quotes', there's none'. three 'four' can't, '2'4', '_'_', 'l'école'"
print( re.sub(r"(?!\b'\b)'", "", text) )
# => text... some quotes, there's none. three four can't, 2'4, _'_, l'école
print( re.sub(r"'(?!(?<=[a-zA-Z]')[a-zA-Z])", "", text) )
# => text... some quotes, there's none. three four can't, 24, __, lécole
print( re.sub(r"'(?!(?<=[^\W\d_]')[^\W\d_])", "", text) )
# => text... some quotes, there's none. three four can't, 24, __, l'école
Here is complete solution to remove whatever you don't want in a string:
def istext (text):
ok = 0
for x in text: ok += x.isalnum()
return ok>0
def stripit (text, ofwhat):
for x in ofwhat: text = text.strip(x)
return text
def purge (text, notwanted="'\"!#$%&/()=?*+-.,;:_<>|\\[]{}"):
text = text.splitlines()
text = [" ".join([stripit(word, notwanted) for word in line.split() if istext(word)]) for line in text]
return "\n".join(text)
>>> print purge("'Nice, .to, see! you. Isn't it?'")
Nice to see you Isn't it
Note: this will kill all whitespaces too and transform them to space or remove them completely.

Replace substrings with items from list

Basically, I have a string that has multiple double-whitespaces like this:
"Some text\s\sWhy is there no punctuation\s\s"
I also have a list of punctuation marks that should replace the double-whitespaces, so that the output would be this:
puncts = ['.', '?']
# applying some function
# output:
>>> "Some text. Why is there no punctuation?"
I have tried re.sub(' +', puncts[i], text) but my problem here is that I don't know how to properly iterate through the list and replace the 1st double-whitespace with the 1st element in puncts, the 2nd double-whitespace with the 2nd element in puncts and so on.
If we're still using re.sub(), here's one possible solution that follows this basic pattern:
Get the next punctuation character.
Replace only the first occurrence of that character in text.
puncts = ['.', '?']
text = "Some text Why is there no punctuation "
for i in puncts:
text = re.sub('\s(?=\s)', i, text, 1)
The call to re.sub() returns a string, and basically says "find all series of two whitespace characters, but only replace the first whitespace character with a punctuation character." The final argument "1" makes it so that we only replace the first instance of the double whitespace, and not all of them (default behavior).
If the positive lookahead (the part of the regex that we want to match but not replace) confuses you, you can also do without it:
puncts = ['.', '?']
text = "Some text Why is there no punctuation "
for i in puncts:
text = re.sub('\s\s', i + " ", text, 1)
This yields the same output.
There will be a leftover whitespace at the end of the sentence, but if you're stingy about that, a simple text.rstrip() should take care of that one.
Further explanation
Your first try of using regex ' +' doesn't work because that regex matches all instances where there is at least one whitespace — that is, it will match everything, and then also replace all of it with a punctuation character. The above solutions account for the double-whitespace in their respective regexes.
You can do it simply using the replace method!
text = "Some text Why is there no punctuation "
puncts = ['.', '?']
for i in puncts:
text = text.replace(" ", i, 1) #notice the 1 here
print(text)
Output : Some text.Why is there no punctuation?
You can use re.split() to break the string into substrings between the double spaces and intersperse the punctuation marks using join:
import re
string = "Some text Why is there no punctuation "
iPunct = iter([". ","? "])
result = "".join(x+next(iPunct,"") for x in re.split(r"\s\s",string))
print(result)
# Some text. Why is there no punctuation?

Categories