Regex in Python to remove all uppercase characters before a colon - python

I have a text where I would like to remove all uppercase consecutive characters up to a colon. I have only figured out how to remove all characters up to the colon itself; which results in the current output shown below.
Input Text
text = 'ABC: This is a text. CDEFG: This is a second text. HIJK: This is a third text'
Desired output:
'This is a text. This is a second text. This is a third text'
Current code & output:
re.sub(r'^.+[:]', '', text)
#current output
'This is a third text'
Can this be done with a one-liner regex or do I need to iterate through every character.isupper() and then implement regex ?

You can use
\b[A-Z]+:\s*
\b A word boundary to prevent a partial match
[A-Z]+: Match 1+ uppercase chars A-Z and a :
\s* Match optional whitespace chars
Regex demo
import re
text = 'ABC: This is a text. CDEFG: This is a second text. HIJK: This is a third text'
print(re.sub(r'\b[A-Z]+:\s*', '', text))
Output
This is a text. This is a second text. This is a third text

Related

Wrap each word with a tag inside a sentence with Python regex and `re.sub`

I want to split the sentence into words, wrap words in tags and join the string back.
Example: Test, abc; text.. Should become <span>Test</span>, <span>abc</span>; <span>text</span>.
I've tried to use regex and \b but I don't understand how \b works.
You can use
import re
text = "Test, abc; text."
print( re.sub(r'\w+', r'<span>\g<0></span>', text) )
# => <span>Test</span>, <span>abc</span>; <span>text</span>.
See the Python demo.
With \w+, you match any chunks of one or more letters, digits, some diacritical marks or connector punctuation chars and the <span>\g<0></span> replacement pattern wraps each match (\g<0> is the whole match backreference) with span tags.
Note that, in Python 3, \w matches any Unicode letters and digits. In Python 2.x, you'd need to add flags=re.U:
re.sub(r'\w+', r'<span>\g<0></span>', text, flags=re.U)
Or use an inline modifier:
re.sub(r'(?u)\w+', r'<span>\g<0></span>', text)

Split python string if newline starts with digit

I try to split a text if a newline starts with a digit.
a="""1.pharagraph1
text1
2.pharagraph2
text2
3.pharagraph3
text3
"""
The expected result would be:
['1.pharagraph1 text1' , '2.pharagraph2 text2', '3.pharagraph3 text3']
I tried: re.split('\n\d{1}',a) and it doesn't work for this task.
You can use a lookahead to only split when the newline and spaces are followed by a digit:
import re
result = re.split('\n\s+(?=\d)', a)
If you really have leading spaces and you did not make a typo when creating a sample string, you can use
[re.sub(r'[^\S\n]*\n[^\S\n]*', ' ', x).strip() for x in re.split(r'\n[^\S\n]*(?=\d)', a)]
# => ['1.pharagraph1 text1', '2.pharagraph2 text2', '3.pharagraph3 text3']
See the Python demo.
The \n[^\S\n]*(?=\d) pattern matches a newline and then any zero or more horizontal whitespaces ([^\S\n]*) followed with a digit. Then, inside each match, every sequence of 0+ horizontal whitespaces, newline and 0+ horizontal whitespaces is replaced with a space.
If the string has no leading whitespace, you can use a simpler approach:
import re
a="""1.pharagraph1
text1
2.pharagraph2
text2
3.pharagraph3
text3"""
print( [x.replace("\n"," ") for x in re.split(r'\n(?=\d)', a)] )
# => ['1.pharagraph1 text1', '2.pharagraph2 text2', '3.pharagraph3 text3']
See the online Python demo. Here, the string is simply split at a newline that is followed with a digit (\n(?=\d)) and then all newlines are replaced with a space.

how can I perform conditional splitting with exceptions in python

I want to split a string into sentences.
But there is some exceptions that I did not expected:
str = "Text... Text. Text! Text? UPPERCASE.UPPERCASE. Name. Text."
Desired split:
split = ['Text...', 'Text.', 'Text!', 'Text?', 'UPPERCASE.UPPERCASE. Name.', 'Text.']
How can I do using regex python
My efforts so far,
str = "Text... Text. Text! Text? UPPERCASE.UPPERCASE. Name. Text."
split = re.split('(?<=[.|?|!|...])\s', str)
print(split)
I got:
['Text...', 'Text.', 'Text!', 'Text?', 'UPPERCASE.UPPERCASE.', 'Name.', 'Text.']
Expect:
['UPPERCASE.UPPERCASE. Name.']
The \s in [A-Z]+\. Name do not split
You can use
(?<=[.?!])(?<![A-Z]\.(?=\s+Name))\s+
See the regex demo. Details:
(?<=[.?!]) - a positive lookbehind that requires ., ? or ! immediately to the left of the current location
(?<![A-Z]\.(?=\s+Name)) - a negative lookbehind that fails the match if there is an uppercase letter and a . followed with 1+ whitespaces + Name immediately to the left of the current location (note the + is used in the lookahead, that is why it works with Python re, and \s+ in the lookahead is necessary to check for the Name presence after whitespace that will be matched and consumed with the next \s+ pattern below)
\s+ - one or more whitespace chars.
See the Python demo:
import re
text = "Text... Text. Text! Text? UPPERCASE.UPPERCASE. Name. Text."
print(re.split(r'(?<=[.?!])(?<![A-Z]\.(?=\s+Name))\s+', text))
# => ['Text...', 'Text.', 'Text!', 'Text?', 'UPPERCASE.UPPERCASE. Name.', 'Text.']

regex to match string before colon till whitespace

i have a sample string from a text file. i want to find all the words before colon till whitespace.
i have written code like this:
import re
text = 'From: mathew <mathew#mantis.co.uk>\nSubject: Alt.Atheism FAQ: Atheist Resources\n\nArchive-
name: atheism/resources\nAlt-atheism-archive-name:'
email_data = re.findall("[^\s].*(?=:)", text)
print(email_data)
Output:
['From', 'Subject: Alt.Atheism FAQ', 'Archive-name', 'Alt-atheism-archive-name']
Desired Output:
['From', 'Subject', 'FAQ', 'Archive-name', 'Alt-atheism-archive-name']
Code is picking up data till newline charater because of (.*) used. i want to restrict it till whitespace so i put [^\s] but its not working. What could i do instead?
You may use
email_data = re.findall(r"\S[^:\s]+(?=:)", text)
See the Python demo and the regex demo.
Details
\S - a non-whitespace char
[^:\s]+ - 1+ chars other than : and whitespace
(?=:) - immediately to the right, there must be a : char (it is not consumed, not added to the match value).
Use re.IGNORECASE flag with the regex pattern
\b[a-z-]+(?=:(?:\s|$))
https://regex101.com/r/0UHsbo/1
https://ideone.com/oz91bP

get full string before and after a specific pattern

I'm looking to grab noise text that has a specific pattern in it:
text = "this is some text lskdfmd&#kjansdl and some more text sldkf&#lsakjd and some other stuff"
I want to be able to remove everything in this sentence where after a space, and before a space contains &#.
result = "this is some text and some more text and some other stuff"
been trying:
re.compile(r'([\s]&#.*?([\s])).sub(" ", text)
I can't seem to get the first part though.
You may use
\S+&#\S+\s*
See a demo on regex101.com.
In Python:
import re
text = "this is some text lskdfmd&#kjansdl and some more text sldkf&#lsakjd and some other stuff"
rx = re.compile(r'\S+&#\S+\s*')
text = rx.sub('', text)
print(text)
Which yields
this is some text and some more text and some other stuff
You can use this regex to capture that noise string,
\s+\S*&#\S*\s+
and replace it with a single space.
Here, \s+ matches any whitespace(s) then \S* matches zero or more non-whitespace characters while sandwiching &# within it and again \S* matches zero or more whitespace(s) and finally followed by \s+ one or more whitespace which gets removed by a space, giving you your intended string.
Also, if this noise string can be either at the very start or very end of string, feel free to change \s+ to \s*
Regex Demo
Python code,
import re
s = 'this is some text lskdfmd&#kjansdl and some more text sldkf&#lsakjd and some other stuff'
print(re.sub(r'\s+\S*&#\S*\s+', ' ', s))
Prints,
this is some text and some more text and some other stuff
Try This:
import re
result = re.findall(r"[a-zA-z]+\&\#[a-zA-z]+", text)
print(result)
['lskdfmd&#kjansdl', 'sldkf&#lsakjd']
now remove the result list from the list of all words.
Edit1 Suggest by #Jan
re.sub(r"[a-zA-z]+\&\#[a-zA-z]+", '', text)
output: 'this is some text and some more text and some other stuff'
Edit2 Suggested by #Pushpesh Kumar Rajwanshi
re.sub(r" [a-zA-z]+\&\#[a-zA-z]+ ", " ", text)
output:'this is some text and some more text and some other stuff'

Categories