replace some part of a word with regex - python

how do you delete text inside <ref> *some text*</ref> together with ref itself?
in '...and so on<ref>Oxford University Press</ref>.'
re.sub(r'<ref>.+</ref>', '', string) only removes <ref> if
<ref> is followed by a whitespace
EDIT: it has smth to do with word boundaries I guess...or?
EDIT2 What I need is that it will math the last (closing) </ref> even if it is on a newline.

I don't really see you problem, because the code pasted will remove the <ref>...</ref> part of the string. But if what you mean is that and empty ref tag is not removed:
re.sub(r'<ref>.+</ref>', '', '...and so on<ref></ref>.')
Then what you need to do is change the .+ with .*
A + means one or more, while * means zero or more.
From http://docs.python.org/library/re.html:
'.' (Dot.) In the default mode, this matches any character except a newline.
If the DOTALL flag has been specified, this matches any character including
a newline.
'*' Causes the resulting RE to match 0 or more repetitions of the preceding
RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’
followed by any number of ‘b’s.
'+' Causes the resulting RE to match 1 or more repetitions of the preceding
RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will
not match just ‘a’.
'?' Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
ab? will match either ‘a’ or ‘ab’.

You could make a fancy regex to do just what you intend, but you need to use DOTALL and non-greedy search, and you need to understand how regexes work in general, which you don't.
Your best option is to use string methods rather than regexes, which is more pythonic anyway:
while '<reg>' in string:
begin, end = string.split('<reg>', 1)
trash, end = end.split('</reg>', 1)
string = begin + end
If you want to be very generic, allowing strange capitalization of the tags or whitespaces and properties in the tags, you shouldn't do this either, but invest in learning a html/xml parsing library. lxml currently seems to be widely recommended and well-supported.

You might want to be cautious not to remove a whole lot of text just because there are more than one closing </ref>s. Below regex would be more accurate in my opinion:
r'<ref>[^<]*</ref>'
This would prevent the 'greedy' matching.
BTW: There is a great tool called The Regex Coach to analyze and test your regexes. You can find it at: http://www.weitz.de/regex-coach/
edit: forgot to add code tag in the first paragraph.

If you try to do this with regular expressions you're in for a world of trouble. You're effectively trying to parse something but your parser isn't up to the task.
Matching greedily across strings probably eats up too much, as in this example:
<ref>SDD</ref>...<ref>XX</ref>
You'd end up cleraning up the entire middle.
You really want a parser, something like Beautiful Soup.
from BeautifulSoup import BeautifulSoup, Tag
s = "<a>sfsdf</a> <ref>XX</ref> || <ref>YY</ref>"
soup = BeautifulSoup(s)
x = soup.findAll("ref")
for z in x:
soup.ref.replaceWith('!')
soup # <a>sfsdf</a> ! || !

Related

Regex to replace filepaths in a string when there's more than one in Python

I'm having trouble finding a way to match multiple filepaths in a string while maintaining the rest of the string.
EDIT: forgot to add that the filepath might contain a dot, so edited "username" to user.name"
# filepath always starts with "file:///" and ends with file extension
text = """this is an example text extracted from file:///c:/users/user.name/download/temp/anecdote.pdf
1 of 4 page and I also continue with more text from
another path file:///c:/windows/system32/now with space in name/file (1232).html running out of text to write."""
I've found many answers that work, but fails when theres more than one filepath, also replacing the other characters in between.
import re
fp_pattern = r"file:\/\/\/(\w|\W){1,255}\.[\w]{3,4}"
print(re.sub(fp_pattern, "*IGOTREPLACED*", text, flags=re.MULTILINE))
>>>"this is an example text extracted from *IGOTREPLACED* running out of text to write."
I've also tried using a "stop when after finding a whitespace after the pattern" but I couldn't get one to work:
fp_pattern = r"file:\/\/\/(\w|\W){1,255}\.[\w]{3,4} ([^\s]+)"
>>> 0 matches
Note that {1,255} is a greedy quantifier, and will match as many chars as possible, you need to add ? after it.
However, just using a lazy {1,255}? quantifier won't solve the problem. You need to define where the match should end. It seems you only want to match these URLs when the extension is immediately followed with whitespace or end of string.
Hence, use
fp_pattern = r"file:///.{1,255}?\.\w{3,4}(?!\S)"
See the regex demo
The (?!\S) negative lookahead will fail any match if, immediately to the right of the current location, there is a non-whitespace char. .{1,255}? will match any 1 to 255 chars, as few as possible.
Use in Python as
re.sub(fp_pattern, "*IGOTREPLACED*", text, flags=re.S)
The re.MULTILINE (re.M) flag only redefines ^ and $ anchor behavior making them match start/end of lines rather than the whole string. The re.S flag allows . to match any chars, including line break chars.
Please never use (\w|\W){1,255}?, use .{1,255}? with re.S flag to match any char, else, performance will decrease.
You can try re.findall to find out how many time regex matches in string. Hope this helps.
import re
len(re.findall(pattern, string_to_search))

Trying to find the regex for this particular case? Also can I parse this without creating groups?

text to capture looks like this..
Policy Number ABCD000012345 other text follows in same line....
My regex looks like this
regex value='(?i)(?:[P|p]olicy\s[N|n]o[|:|;|,][\n\r\s\t]*[\na-z\sA-Z:,;\r\d\t]*[S|s]e\s*[H|h]abla\s*[^\n]*[\n\s\r\t]*|(?i)[P|p]olicy[\s\n\t\r]*[N|n]umber[\s\n\r\t]*)(?P<policy_number>[^\n]*)'
this particular case matches with the second or case.. however it is also capturing everything after the policy number. What can be the stopping condition for it to just grab the number. I know something is wrong but can't find a way out.
(?i)[P|p]olicy[\s\n\t\r]*[N|n]umber[\s\n\r\t]*)
current output
ABCD000012345othertextfollowsinsameline....
expected output
ABCD000012345
You may use a more simple regex, just finding from the beginning "[P|p]olicy\s*[N|n]umber\s*\b([A-Z]{4}\d+)\b.*" and use the word boundary \b
pattern = re.compile(r"[P|p]olicy\s*[N|n]umber\s*\b([A-Z0-9]+)\b.*")
line = "Policy Number ABCD000012345 other text follows in same line...."
matches = pattern.match(line)
id_res = matches.group(1)
print(id_res) # ABCD000012345
And if there's always 2 words before you can use (?:\w+\s+){2}\b([A-Z0-9]+)\b.*
Also \s is for [\r\n\t\f\v ] so no need to repeat them, your [\n\r\s\t] is just \s
you don't need the upper and lower case p and n specified since you're already specifying case insensitive.
Also \s already covers \n, \t and \r.
(?i)policy\s+number\s+([A-Z]{4}\d+)\b
for verification purpose: Regex
Another Solution:
^[\s\w]+\b([A-Z]{4}\d+)\b
for verification purpose: Regex
I like this better, in case your text changes from policy number

Python Regex for Clinical Trials Fields

I am trying to split text of clinical trials into a list of fields. Here is an example doc: https://obazuretest.blob.core.windows.net/stackoverflowquestion/NCT00000113.txt. Desired output is of the form: [[Date:<date>],[URL:<url>],[Org Study ID:<id>],...,[Keywords:<keywords>]]
I am using re.split(r"\n\n[^\s]", text) to split at paragraphs that start with a character other than space (to avoid splitting at the indented paragraphs within a field). This is all good, except the resulting fields are all (except the first field) missing their first character. Unfortunately, it is not possible to use string.partition with a regex.
I can add back the first characters by finding them using re.findall(r"\n\n[^\s]", text), but this requires a second iteration through the entire text (and seems clunky).
I am thinking it makes sense to use re.findall with some regex that matches all fields, but I am getting stuck. re.findall(r"[^\s].+\n\n") only matches the single line fields.
I'm not so experienced with regular expressions, so I apologize if the answer to this question is easily found elsewhere. Thanks for the help!
You may use a positive lookahead instead of a negated character class:
re.split(r"\n\n(?=\S)", text)
Now, it will only match 2 newlines if they are followed with a non-whitespace char.
Also, if there may be 2 or more newlines, you'd better use a {2,} limiting quantifier:
re.split(r"\n{2,}(?=\S)", text)
See the Python demo and a regex demo.
You want a lookahead. You also might want it to be more flexible as far as how many newlines / what newline characters. You might try this:
import re
r = re.compile(r"""(\r\n|\r|\n)+(?=\S)""")
l = r.split(text)
though this does seem to insert \r\n characters into the list... Hmm.

search a repeated structure with regex

I have a string of the structure:
A_1: text
a lot more text
A_2: some text
a lot more other text
Now I want to extract the descriptive title (A_1) and the following text. Something like
[("A_1", "text\na lot more text"),("A_2", "some text\na lot more other text")]
My expression I use is
(A_\d+):([.\s]+)
But I get only [('A_1', ' '), ('A_2', ' ')].
Has someone an idea for me?
Thanks in advance,
Martin
You can use a lookahead to limit the match to another occurence of the searched start indicator.
(?s)A_\d+:.*?(?=\s*A_\d+:|$)
(?s) dotall flag to make dot also match newlines
A_\d+: your start indicator
.*? match as few as possible (lazy dot)
(?=\s*A_\d+:|$) until start pattern with optional spaces ahead or $ end
See demo at regex101.com (Python code generator)
Your [.\s]+ matches one or more literal dots (since . inside a character class loses its special meaning) and whitespaces. I think you meant to use . with a re.DOTALL flag. However, you can use something different, a tempered greedy token (there are other ways, too).
You can use
(?s)(A_\d+):\s*((?:(?!A_\d).)+)
See regex demo
IDEONE demo:
import re
p = re.compile(r'(A_\d+):\s*((?:(?!A_\d).)+)', re.DOTALL)
test_str = "A_1: text\na lot more text\n\nA_2: some text\na lot more other text"
print(p.findall(test_str))
The (?:(?!A_\d).)+ tempered greedy token will match any text up to the first A_+digit pattern.

Python, regex and html: match final tag on line

I'm confused about python greedy/not-greedy characters.
"Given multi-line html, return the final tag on each line."
I would think this would be correct:
re.findall('<.*?>$', html, re.MULTILINE)
I'm irked because I expected a list of single tags like:
"</html>", "<ul>", "</td>".
My O'Reilly's Pocket Reference says that *? wil "match 0 or more times, but as few times as possible."
So why am I getting 'greedier' matches, i.e., more than one tag in some (but not all) matches?
Your problem stems from the fact that you have an end-of-line anchor ('$'). The way non-greedy matching works is that the engine first searches for the first unconstrained pattern on the line ('<' in your case). It then looks for the first '>' character (which you have constrained, with the $ anchor, to be at the end of the line). So a non-greedy * is not any different from a greedy * in this situation.
Since you cannot remove the '$' from your RE (you are looking for the final tag on a line), you will need to take a different tack...see #Mark's answer. '<[^><]*>$' will work.

Categories