Python extract substring via regex with marker as delimiter - python

In a textfile
1. Notice
Some text
End Notice
2. Blabla
Some other text
Even more text
3. Notice
Some more text
End Notice
I would like to extract the text from "2. Blabla" and the following text(lines) with regex.
A section as "2. Blabla" might be in the textile several time (as with "1. Notice" etc.).
I tried
pattern = r"(\d+\. Blabla[\S\n\t\v ]*?\d+\. )"
re.compile(pattern)
result = re.findall(pattern, text)
print(result)
but it gives me
['2. BlaBla\nSome other text\nEven more text\n3. ']
How can I get rid of the "3. "?

You can use
(?ms)^\d+\. Blabla.*?(?=^\d+\. |\Z)
It will match start of a line, one or more digits, a dot, a space, Blabla, and then zero or more chars, as few as possible, till the first occurrence of one or more digits + . + space at the start of a line, or end of the whole string.
However, there is a faster expression:
(?m)^\d+\. Blabla.*(?:\n(?!\d+\.).*)*
See the regex demo. Details:
^ - start of a line (due to re.M option in the Python code)
\d+ - one or more digits
\. - a dot
Blabla - a fixed string
.* - the rest of the line
(?:\n(?!\d+\.).*)* - any zero or more lines that do not start with one or more digits and then a . char.
See the Python demo:
import re
text = "1. Notice \nSome text \nEnd Notice\n2. Blabla \nSome other text \nEven more text\n3. Notice \nSome more text\nEnd Notice"
pattern = r"^\d+\. Blabla.*(?:\n(?!\d+\.).*)*"
result = re.findall(pattern, text, re.M)
print(result)
# => ['2. Blabla \nSome other text \nEven more text']

Related

Regex in Python to remove all uppercase characters before a colon

I have a text where I would like to remove all uppercase consecutive characters up to a colon. I have only figured out how to remove all characters up to the colon itself; which results in the current output shown below.
Input Text
text = 'ABC: This is a text. CDEFG: This is a second text. HIJK: This is a third text'
Desired output:
'This is a text. This is a second text. This is a third text'
Current code & output:
re.sub(r'^.+[:]', '', text)
#current output
'This is a third text'
Can this be done with a one-liner regex or do I need to iterate through every character.isupper() and then implement regex ?
You can use
\b[A-Z]+:\s*
\b A word boundary to prevent a partial match
[A-Z]+: Match 1+ uppercase chars A-Z and a :
\s* Match optional whitespace chars
Regex demo
import re
text = 'ABC: This is a text. CDEFG: This is a second text. HIJK: This is a third text'
print(re.sub(r'\b[A-Z]+:\s*', '', text))
Output
This is a text. This is a second text. This is a third text

How to create regular expression to substitute strings surrounded by parentheses?

I'm trying to substitute all chars inside () alongside with what's inside them but there is a problem. In the output it leaves whitespaces at start and end.
Code:
import re
regex = r"\(.+?\)"
test_str = ("(a) method in/to one's madness\n"
"(all) by one's lonesome\n"
"(as) tough as (old boot's)\n"
" (at) any moment (now) \n"
"factors (in or into or out) \n"
" right-to-life\n"
"all mouth (and no trousers/action)\n"
"(it's a) small world\n"
" throw (someone) a bone ")
subst = ""
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
if result:
print (result)
Result:
method in/to one's madness
by one's lonesome
tough as
any moment
factors
right-to-life
all mouth
small world
throw a bone
I tried different patterns (other than this) to remove the \s from start but then when it finds the space at the end of any line it combines the following lines to the preceding one's.
You can use
regex = r"[^\S\r\n]*\([^()]*\)"
result = "\n".join([x.strip() for x in re.sub(regex, "", test_str).splitlines()])
See the Python demo
The [^\S\r\n]*\([^()]*\) regex will remove all instances of
[^\S\r\n]* - zero or more horizontal whitespaces and then
\([^()]*\) - (, any zero or more chars other than ( and ) and then )
The "\n".join([x.strip() for x in re.sub(regex, "", test_str).splitlines()]) part splits all text into lines, strips them from leading/trailing whitespace and joins them back with a line feed.
You can go with /\(([^)]+)\)/g
So basically:
\(: matches the opening parenthesis
([^)]+: matches a group of characters
\): matches the closing parenthesis
/g: all matches
User you re.sub(...) to replace all the regex matches.

regex capture info in text file after multiple blank lines

I open a complex text file in python, match everything else I need with regex but am stuck with one search.
I want to capture the numbers after the 'start after here' line. The space between the two rows is important and plan to split later.
start after here: test
5.7,-9.0,6.2
1.6,3.79,3.3
Code:
text = open(r"file.txt","r")
for line in text:
find = re.findall(r"start after here:[\s]\D+.+", line)
I tried this here https://regexr.com/ and it seems to work but it is for Java.
It doesn't find anything. I assume this is because I need to incorporate multiline but unsure how to read file in differently or incorporate. Have been trying many adjustments to regex but have not been successful.
import re
test_str = ("start after here: test\n\n\n"
"5.7,-9.0,6.2\n\n"
"1.6,3.79,3.3\n")
m = re.search(r'start after here:([^\n])+\n+(.*)', test_str)
new_str = m[2]
m = re.search(r'(-?\d*\.\d*,?\s*)+', new_str)
print(m[0])
The pattern start after here:[\s]\D+.+ matches the literal words and then a whitespace char using [\s] (you can omit the brackets).
Then 1+ times not a digit is matched, which will match until before 5.7. Then 1+ times any character except a newline will be matched which will match 5.7,-9.0,6.2 It will not match the following empty line and the next line.
One option could be to match your string and match all the lines after that do not start with a decimal in a capturing group.
\bstart after here:.*[\r\n]+(\d+\.\d+.*(?:[\r\n]+[ \t]*\d+\.\d+.*)*).*
The values including the empty line are in the first capturing group.
For example
import re
regex = r"\bstart after here:.*[\r\n]+(\d+\.\d+.*(?:[\r\n]+[ \t]*\d+\.\d+.*)*).*"
test_str = ("start after here: test\n\n\n"
"5.7,-9.0,6.2\n\n"
"1.6,3.79,3.3\n")
matches = re.findall(regex, test_str)
print(matches)
Result
['5.7,-9.0,6.2\n\n1.6,3.79,3.3']
Regex demo | Python demo
If you want to match the decimals (or just one or more digits) before the comma you might split on 1 or more newlines and use:
[+-]?(?:\d+(?:\.\d+)?|\.\d+)(?=,|$)
Regex demo

get full string before and after a specific pattern

I'm looking to grab noise text that has a specific pattern in it:
text = "this is some text lskdfmd&#kjansdl and some more text sldkf&#lsakjd and some other stuff"
I want to be able to remove everything in this sentence where after a space, and before a space contains &#.
result = "this is some text and some more text and some other stuff"
been trying:
re.compile(r'([\s]&#.*?([\s])).sub(" ", text)
I can't seem to get the first part though.
You may use
\S+&#\S+\s*
See a demo on regex101.com.
In Python:
import re
text = "this is some text lskdfmd&#kjansdl and some more text sldkf&#lsakjd and some other stuff"
rx = re.compile(r'\S+&#\S+\s*')
text = rx.sub('', text)
print(text)
Which yields
this is some text and some more text and some other stuff
You can use this regex to capture that noise string,
\s+\S*&#\S*\s+
and replace it with a single space.
Here, \s+ matches any whitespace(s) then \S* matches zero or more non-whitespace characters while sandwiching &# within it and again \S* matches zero or more whitespace(s) and finally followed by \s+ one or more whitespace which gets removed by a space, giving you your intended string.
Also, if this noise string can be either at the very start or very end of string, feel free to change \s+ to \s*
Regex Demo
Python code,
import re
s = 'this is some text lskdfmd&#kjansdl and some more text sldkf&#lsakjd and some other stuff'
print(re.sub(r'\s+\S*&#\S*\s+', ' ', s))
Prints,
this is some text and some more text and some other stuff
Try This:
import re
result = re.findall(r"[a-zA-z]+\&\#[a-zA-z]+", text)
print(result)
['lskdfmd&#kjansdl', 'sldkf&#lsakjd']
now remove the result list from the list of all words.
Edit1 Suggest by #Jan
re.sub(r"[a-zA-z]+\&\#[a-zA-z]+", '', text)
output: 'this is some text and some more text and some other stuff'
Edit2 Suggested by #Pushpesh Kumar Rajwanshi
re.sub(r" [a-zA-z]+\&\#[a-zA-z]+ ", " ", text)
output:'this is some text and some more text and some other stuff'

Python Regex for End of Line

I am trying to write a regex which adds a space before and after a dot.
However I only want this if there is a space or end of line after the dot.
However I am unable to do so for end of line cases.
Eg.
I want a hotel. >> I want a hotel .
my email is zob#gmail.com >> my email is zob#gmail.com
I have to play. bye! >> I have to play . bye!
Following is my code:
# If "Dot and space" after word or number put space before and after
utterance = re.sub(r'(?<=[a-z0-9])[.][ $]',' . ',utterance)
How do I correct my regex to make sure my 1st example above also works, I tried putting a $ sign in square bracket but that doesn't work.
The main issue is that $ inside a character class denotes a literal $ symbol, you just need a grouping construct here.
I suggest using the following code:
import re
regex = r"([^\W_])\.(?:\s+|$)"
ss = ["I want a hotel.","my email is zob#gmail.com", "I have to play. bye!"]
for s in ss:
result = re.sub(regex, r"\1 . ", s).rstrip()
print(result)
See the Python demo.
If you need to apply this on lines only without affecting line breaks, you can use
import re
regex = r"([^\W_])\.(?:[^\S\n\r]+|$)"
text = "I want a hotel.\nmy email is zob#gmail.com\nI have to play. bye!"
print( re.sub(regex, r"\1 . ", text, flags=re.M).rstrip() )
See this Python demo.
Output:
I want a hotel .
my email is zob#gmail.com
I have to play . bye!
Details:
([^\W_]) - Group 1 matching any letter or digit
\. - a literal dot
(?:\s+|$) - a grouping matching either 1+ whitespaces or end of string anchor (here, $ matches the end of string.)
The rstrip will remove the trailing space added during replacement.
If you are using Python 3, the [^\W_] will match all Unicode letters and digits by default. In Python 2, re.U flag will enable this behavior.
Note that \s+ in the last (?:\s+|$) will "shrink" multiple whitespaces into 1 space.
Use the lookahead assertion (?=) to find a . followed by space or end of line \n:
utterance = re.sub('\\.(?= )|\\.(?=\n)', ' . ', utterance )
[ $] defines a class of characters consisting of a space and a dollar sign, so it matches on space or dollar (literally). To match on space or end of line, use ( |$) (in this case, $ keeps it special meaning.

Categories