REGEX match for portions of a document between two headers - python

I'm trying and failing to write a python compliant REGEX that captures multiple parts of a document. My code will actually be in Python, but right now i've only tried on regex101.com to get the expression right. (unsuccessfully obviously :) )
My text that is file-based, looks something like this:
<#
.SYNOPSIS
This is the synopsis text, that is a multiline
synopsis - I want to match all of this text
as a capture group.
.PARAMETER
This a another block of
multiline text that I want to capture
.SOMEOTHER HEADER
And some other multiline text
#
I'd like to capture 2 groups (the header and the body text), globally. (i.e for each section).
My ultimate aim is a python array of dictionaries like;
[
{'header':'SYNOPSIS', 'text': }
{'header':'PARAMETER', 'text': }
]
The header section is always anchored to the beginning the line, with '.' and followed by uppercase TEXT. The body of the section includes any words and non-word characters including CR/LF (windows based).
The Header names are not guaranteed to be fixed literals, or in a specific order. Nor do I know how many headers might exist.
Right now it looks like this
(^\.[A-Z]+)([\n\W\w]+)
Right now I can match the header followed by a body, but I'm having a hard time telling REGEX to essentially 'stop looking when you hit the next .HEADERTEXT'.
I've created a Regex101 https://regex101.com/r/YqibeH/4 if its of use (not sure how this might work out). . .
My psuedo code says something like,
Find all lines beginning with ^.[A-Z] as a capture group, then continue to match all text (multiline) after the header as a second capture group. Stop capturing just before the next header that begins ^.[A-Z]
Any help greatly appreciated.

I believe what you're looking for is look aheads. Additionally the search you were doing is greedy and should be changed out for a lazy quantifier. That being said. This should work.
^\.\w+[\n\W\w]+?(?=^\.\w+|^#>)
https://regex101.com/r/YqibeH/7
^\.\w+ Greedily captures your header text.
[\n\W\w]+? Lazily searches for your body text.
(?=^\.\w+|^#>) until it looks ahead and finds either a line beginning with another header text or a line beginning with a header closing tag.
Note that if the greedy quantifier + would be used rather than +? it would continue matching until the last possible instance it could match.

text = '<#\n.SYNOPSIS\nThis is the block of code that I would like to have matched along with the .SYNOPSIS header, ' \
'as this block belongs to SYNOPSIS\n .NOTES\n This block needs to belong with\nNOTES ' \
'header\n.SOMEOTHERHEADER\nAnd resulting text\n\n#> '
pattern = "(\.[A-Z]+\n)+"
import re
print(re.split(pattern, text))
If I got your problem right, I solved it in the following way. This way you have a list with all the elements that you need to be appendend to your dictionary by cleaning the string, of course.

Related

How do I remove texts in brackets only from the beginning of lines with regex in Python?

I would like to remove all the line codes in brackets placed at the beginning of lines but want to preserve the other words in brackets.
NOTE:
In the application that I use I cannot import any Python library but can use Python regexes. The regex and the replacement value in the substitution have to be separated by a comma. For example, I use ([^\s\d])(-\s+),\1 to merge hyphenated words at the end of lines. So I would need something similar.
\([^()]*\) finds every text in brackets.
^\h*\([^()]*\) finds only the first one but not the rest.
How should I modify it?
The sample text is the following:
(#p0340r#) This is a sentence. This is another one but I need more sentences to fill the space to start a new line.
(#p0350q#) Why? (this text should be left unchanged)
(#p0360r#) Because I need to remove these codes from interview texts.
The expected outcome should be:
This is a sentence. This is another one but I need more sentences
to fill the space to start a new line.
Why? (this text should be left unchanged)
Because I need to remove these codes from interview texts.
Thank you!
To remove a pattern at the start of any line with Python re.sub (or any re.sub powered search and replace), you need to use the ^ before the pattern (that is what you already have) and pass the multiline (?m) flag (if you have access to code you could use flags=re.M).
Also, \h is not Python re compliant, you need to use a construct like [ \t] or [^\S\n] (in some rare cases, also [^\S\r\n], usually when you read a file in binary mode) to match any horizontal whitespace.
So you can use
(?m)^[^\S\n]*\([^()]*\)[^\S\n]*
and replace with an empty string.
Note: if you ever want to remove one or more substrings inside parentheses at the start of a line group the pattern and apply the + quantifier on it:
(?m)^(?:[^\S\n]*\([^()]*\))+[^\S\n]*
# ^^^ ^^

regex for matching german characters in python

Could someone help me on regex to match German words/sentences in
python? It does not work on jupyter notebook. I tried same in jsfiddle
it works fine. I tried using this below script but does not work
import re
pattern = re.compile(r'\[^a-zA-Z0-9äöüÄÖÜß]\\', re.UNICODE)
print(pattern.search(text))
Your expression will always fail:
\[^a-zA-Z0-9äöüÄÖÜß]\\
Broken down, you require
[ # literally
^ # start of the line / text
a-z # literally, etc.
The problem is that you require a [ literally right before the start of a line which can never be true (either there's nothing or a newline). So in the end, either remove the backslash to get a proper character class as in:
[^a-zA-Z0-9äöüÄÖÜß]+
But this will surely not match the words you're looking for (quite the opposite). So either use something as simple as \w+ or the solution proposed by #Wiktor in the comments section.
The square brackets define a range of characters you want to look for, however the '^' negates these characters if it appears within the character class.
If you want to specify the beginning of the line you need to put the '^' before the brackets.
Also you need to add a multiplier behind the class to search for more than just one character in this case:
r'^[a-zA-Z0-9äöüÄÖÜß]+'
One ore more characters contained in the brackets are matched as long as they are not seperated by any other character not listed between '[]'
Here's the link to the official documentation

Removing markup links in text

I'm cleaning some text from Reddit. When you include a link in a Reddit self-text, you do so like this:
[the text you read](https://website.com/to/go/to). I'd like to use regex to remove the hyperlink (e.g. https://website.com/to/go/to) but keep the text you read.
Here is another example:
[the podcast list](https://www.reddit.com/r/datascience/wiki/podcasts)
I'd like to keep: the podcast list.
How can I do this with Python's re library? What is the appropriate regex?
I have created an initial attempt at your requested regex:
(?<=\[.+\])\(.+\)
The first part (?<=...) is a look behind, which means it looks for it but does not match it. You can use this regex along with re's method sub. You can also see the meanings of all the regex symbols here.
You can extend the above regex to look for only things that have weblinks in the brackets, like so:
(?<=\[.+\])\(https?:\/\/.+\)
The problem with this is that if the link they provide is not started with an http or https it will fail.
After this you will need to remove the square brackets, maybe just removing all square brackets works fine for you.
Edit 1:
Valentino pointed out that substitute accepts capturing groups, which lets you capture the text and substitute the text back in using the following regex:
\[(.+)\]\(.+\)
You can then substitute the first captured group (in the square brackets) back in using:
re.sub(r"\[(.+)\]\(.+\)", r"\1", original_text)
If you want to look at the regex in more detail (if you're new to regex or want to learn what they mean) I would recommend an online regex interpreter, they explain what each symbol does and it makes it much easier to read (especially when there are lots of escaped symbols like there are here).

Preserve key:value values in text while regex replacing non-word characters in keys (Notepad++)

Trying without luck in Notepad++ to replace any non-word characters \W with underscore _ from a block of multi-line text, with exception to (and right of) a colon : (which doesn't occur on every line- something of space-delineated hierarchy, terminating in a key-value pair). A python solution could be of use as well, as I'm trying to do other things with it once reformatted. Example:
This 100% isn't what I want
Yet, it's-what-I've got currently: D#rnit :(
This_100_is_what_I_d_like: See?
Indentation_isn_t_necessary
_to_maintain_but_would_be_nice: :)<-preserved!
I_m_Mr_Conformist_over_here: |Whereas, I'm like whatever's clever.|
If_you_can_help: Thanks 100.1%!
I admit that I'm answering an off-topic question I just liked the problem. Hold CTRL+H, enable Regular Expressions in N++ then search for:
(:[^\r\n]*|^\s+)|\W(?<![\r\n])
And replace with:
(?1\1:_)
Regex has two main parts. First side of outer alternation which matches leading spaces of a line (indentation) or every thing after first occurrence of a colon, and second side which matches a non-word character except a carriage return \r or newline \n character (in negative lookbehind) to preserve linebreaks. Replacement string is a conditional block which says if first capturing group is matched replace it with itself and if not replace it with a _.
Seeing a better description of what you're trying to do, I don't think you'll be able to do it from inside notepad++ using a single regular expression. However, you could write a python script that scrolls through your document, one line at time, and sanitizes anything to the left of a colon (if one exists)
Here's a quick and dirty example (untested). This assumes doc is an open file pointer to the file you want to sanitize
import re
sanitized_lines = []
for line in doc:
line_match = re.match(r"^(\s*)([^:\n]*)(.*)", line)
indentation = line_match.group(1)
left_of_colon = line_match.group(2)
remainder = line_match.group(3)
left_of_colon = re.sub(r"\W", "_", left_of_colon)
sanitized_lines.append("".join((indentation, left_of_colon, remainder)))
sanitized_doc = "".join(sanitized_lines)
print(sanitized_doc)
You may try this python script,
ss="""This 100% isn't what I want
Yet, it's-what-I've got currently: D#rnit :(
If you can help: Thanks 100.1%!"""
import re
splitcapture=re.compile(r'(?m)^([^:\n]+)(:[^\n]*|)$')
subregx=re.compile(r'\W+')
print(splitcapture.sub(lambda m: subregx.sub('_', m.group(1))+m.group(2), ss))
in which first I tried to match each line and capture 2 parts separately(the one part not containing ':'character is capured to group 1, and the other possible part started with ':' and goes on to the end of the line is captured to group 2), and then implemented replacing process only on group 1 captured string and finally joined 2 parts, replaced group 1 + group 2
And output is
This_100_isn_t_what_I_want_
_Yet_it_s_what_I_ve_got_currently: D#rnit :(
If_you_can_help: Thanks 100.1%!

regex python - using lookbehinds to find my specific text

UPDATED
I want to find a string within a big text
..."img good img two_apple.txt"
Want to extract the two_apples.txt from a text, but it can change to one_apple, three_apple..so on...
When I try to use lookbehinds, it matches text all the way from the beginning.
You are mis-using lookarounds. Looks like you dont even NEED a lookaround:
pattern = r'src="images/(.+?.png")'
should work for you. As my comment suggests though, using regex is not recommended for parsing HTML/XML style documents but you do you.
EDIT - accommodate your edit:
Now that I understand your problem more, I can see why you would want to use a look-around. However, since you are looking for a file name, you know there aren't going to be any spaces in the name, so you can just ensure that your capturing token does not include spaces:
pattern = r'src="img (\w+?.png")'
^ ensure there is a space HERE because of how your text is
\w - \w is equivalent to [a-zA-Z0-9_] (any letters, numbers or underscore)
This removes the greediness of capture the first 'img ' string that pops up and ensures your capture group doesnt have any spaces.
by using \w, I am assuming you are only expecting _ and letter characters. to include anything else, make your own character group with [any characters you want to capture in here]
" ([^ ]+_apple\.txt)"
Starts with a space, ends with _apple.txt. The middle bit is anything-except-a-space which stops it matching "good img two". Parentheses to capture the bit you care about.
Try it here: https://regex101.com/r/wO7lG3/2

Categories