I am trying to split text of clinical trials into a list of fields. Here is an example doc: https://obazuretest.blob.core.windows.net/stackoverflowquestion/NCT00000113.txt. Desired output is of the form: [[Date:<date>],[URL:<url>],[Org Study ID:<id>],...,[Keywords:<keywords>]]
I am using re.split(r"\n\n[^\s]", text) to split at paragraphs that start with a character other than space (to avoid splitting at the indented paragraphs within a field). This is all good, except the resulting fields are all (except the first field) missing their first character. Unfortunately, it is not possible to use string.partition with a regex.
I can add back the first characters by finding them using re.findall(r"\n\n[^\s]", text), but this requires a second iteration through the entire text (and seems clunky).
I am thinking it makes sense to use re.findall with some regex that matches all fields, but I am getting stuck. re.findall(r"[^\s].+\n\n") only matches the single line fields.
I'm not so experienced with regular expressions, so I apologize if the answer to this question is easily found elsewhere. Thanks for the help!
You may use a positive lookahead instead of a negated character class:
re.split(r"\n\n(?=\S)", text)
Now, it will only match 2 newlines if they are followed with a non-whitespace char.
Also, if there may be 2 or more newlines, you'd better use a {2,} limiting quantifier:
re.split(r"\n{2,}(?=\S)", text)
See the Python demo and a regex demo.
You want a lookahead. You also might want it to be more flexible as far as how many newlines / what newline characters. You might try this:
import re
r = re.compile(r"""(\r\n|\r|\n)+(?=\S)""")
l = r.split(text)
though this does seem to insert \r\n characters into the list... Hmm.
Related
text to capture looks like this..
Policy Number ABCD000012345 other text follows in same line....
My regex looks like this
regex value='(?i)(?:[P|p]olicy\s[N|n]o[|:|;|,][\n\r\s\t]*[\na-z\sA-Z:,;\r\d\t]*[S|s]e\s*[H|h]abla\s*[^\n]*[\n\s\r\t]*|(?i)[P|p]olicy[\s\n\t\r]*[N|n]umber[\s\n\r\t]*)(?P<policy_number>[^\n]*)'
this particular case matches with the second or case.. however it is also capturing everything after the policy number. What can be the stopping condition for it to just grab the number. I know something is wrong but can't find a way out.
(?i)[P|p]olicy[\s\n\t\r]*[N|n]umber[\s\n\r\t]*)
current output
ABCD000012345othertextfollowsinsameline....
expected output
ABCD000012345
You may use a more simple regex, just finding from the beginning "[P|p]olicy\s*[N|n]umber\s*\b([A-Z]{4}\d+)\b.*" and use the word boundary \b
pattern = re.compile(r"[P|p]olicy\s*[N|n]umber\s*\b([A-Z0-9]+)\b.*")
line = "Policy Number ABCD000012345 other text follows in same line...."
matches = pattern.match(line)
id_res = matches.group(1)
print(id_res) # ABCD000012345
And if there's always 2 words before you can use (?:\w+\s+){2}\b([A-Z0-9]+)\b.*
Also \s is for [\r\n\t\f\v ] so no need to repeat them, your [\n\r\s\t] is just \s
you don't need the upper and lower case p and n specified since you're already specifying case insensitive.
Also \s already covers \n, \t and \r.
(?i)policy\s+number\s+([A-Z]{4}\d+)\b
for verification purpose: Regex
Another Solution:
^[\s\w]+\b([A-Z]{4}\d+)\b
for verification purpose: Regex
I like this better, in case your text changes from policy number
I want a regex that doesn't match a string if contains the word page, and match if it's not contain.
^https?.+/(event|news)/.+(?!page).+$ this is the regex I'm currently using, so I want it to not match with, e.g. https://www.foosite.com/news/foopath/page/10, but it does. Where did I made a mistake?
The double .+ expressions should imply that there should be some string around the page string, and (?!page) should imply there must not be a string like page between them. What's wrong with this expression? Thanks, and sorry for poor grammar.
Your problem is that .+(?!page).+ will match foopath/page/10 because the first .+ match can end at the 1 in 10, and the second can match from there until $. Instead, just assert there is no combination of characters plus the word page after (event|news)/:
^https?.+/(event|news)/(?!.*page)
Demo on regex101
If you want more than just a match/nomatch decision, you can capture the entire matching string with this regex:
^https?.+/(event|news)/(?!.*page).*$
Demo on regex101
You might be looking for
^https?.+/(event|news)/(?:(?!page).)+$
See a demo on regex101.com.
Matching is usually way easier in regex than excluding.
I would rather match your excluded words and invert the logic on the if-clause.
if(!re.match(...
UPDATED
I want to find a string within a big text
..."img good img two_apple.txt"
Want to extract the two_apples.txt from a text, but it can change to one_apple, three_apple..so on...
When I try to use lookbehinds, it matches text all the way from the beginning.
You are mis-using lookarounds. Looks like you dont even NEED a lookaround:
pattern = r'src="images/(.+?.png")'
should work for you. As my comment suggests though, using regex is not recommended for parsing HTML/XML style documents but you do you.
EDIT - accommodate your edit:
Now that I understand your problem more, I can see why you would want to use a look-around. However, since you are looking for a file name, you know there aren't going to be any spaces in the name, so you can just ensure that your capturing token does not include spaces:
pattern = r'src="img (\w+?.png")'
^ ensure there is a space HERE because of how your text is
\w - \w is equivalent to [a-zA-Z0-9_] (any letters, numbers or underscore)
This removes the greediness of capture the first 'img ' string that pops up and ensures your capture group doesnt have any spaces.
by using \w, I am assuming you are only expecting _ and letter characters. to include anything else, make your own character group with [any characters you want to capture in here]
" ([^ ]+_apple\.txt)"
Starts with a space, ends with _apple.txt. The middle bit is anything-except-a-space which stops it matching "good img two". Parentheses to capture the bit you care about.
Try it here: https://regex101.com/r/wO7lG3/2
So I have this code:
(r'\[quote\](.+?)\[/quote\]')
What I want to do is to change the regex so it only matches if the text within [quote] [/quote] is between 1-50 words.
Is there any easy way to do this?
Edit: Removed confusing html code in the regex example. I am NOT trying to match HTML.
Sure there is, depending on how you define a "word."
I would do so separately from regex, but if you want to use regex, you could probably do:
r"\[quote\](.+?\s){1,49}[/quote\]"
That will match between 2 and 50 words (since it demands a trailing \s, it can't match ONE)
Crud, that also won't match the LAST word, so let's do this instead:
r"\[quote\](.+?(?:\s.+?){1,49})\[/quote\]"
This is a definite misuse of regexes for a lot of reasons, not the least of which is the problem matching [X]HTML as #Hyperboreus noted, but if you really insist you could do something along the lines of ([a-zA-Z0-9]\s){1}{49}.
For the record, I don't recommend this.
how do you delete text inside <ref> *some text*</ref> together with ref itself?
in '...and so on<ref>Oxford University Press</ref>.'
re.sub(r'<ref>.+</ref>', '', string) only removes <ref> if
<ref> is followed by a whitespace
EDIT: it has smth to do with word boundaries I guess...or?
EDIT2 What I need is that it will math the last (closing) </ref> even if it is on a newline.
I don't really see you problem, because the code pasted will remove the <ref>...</ref> part of the string. But if what you mean is that and empty ref tag is not removed:
re.sub(r'<ref>.+</ref>', '', '...and so on<ref></ref>.')
Then what you need to do is change the .+ with .*
A + means one or more, while * means zero or more.
From http://docs.python.org/library/re.html:
'.' (Dot.) In the default mode, this matches any character except a newline.
If the DOTALL flag has been specified, this matches any character including
a newline.
'*' Causes the resulting RE to match 0 or more repetitions of the preceding
RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’
followed by any number of ‘b’s.
'+' Causes the resulting RE to match 1 or more repetitions of the preceding
RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will
not match just ‘a’.
'?' Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
ab? will match either ‘a’ or ‘ab’.
You could make a fancy regex to do just what you intend, but you need to use DOTALL and non-greedy search, and you need to understand how regexes work in general, which you don't.
Your best option is to use string methods rather than regexes, which is more pythonic anyway:
while '<reg>' in string:
begin, end = string.split('<reg>', 1)
trash, end = end.split('</reg>', 1)
string = begin + end
If you want to be very generic, allowing strange capitalization of the tags or whitespaces and properties in the tags, you shouldn't do this either, but invest in learning a html/xml parsing library. lxml currently seems to be widely recommended and well-supported.
You might want to be cautious not to remove a whole lot of text just because there are more than one closing </ref>s. Below regex would be more accurate in my opinion:
r'<ref>[^<]*</ref>'
This would prevent the 'greedy' matching.
BTW: There is a great tool called The Regex Coach to analyze and test your regexes. You can find it at: http://www.weitz.de/regex-coach/
edit: forgot to add code tag in the first paragraph.
If you try to do this with regular expressions you're in for a world of trouble. You're effectively trying to parse something but your parser isn't up to the task.
Matching greedily across strings probably eats up too much, as in this example:
<ref>SDD</ref>...<ref>XX</ref>
You'd end up cleraning up the entire middle.
You really want a parser, something like Beautiful Soup.
from BeautifulSoup import BeautifulSoup, Tag
s = "<a>sfsdf</a> <ref>XX</ref> || <ref>YY</ref>"
soup = BeautifulSoup(s)
x = soup.findAll("ref")
for z in x:
soup.ref.replaceWith('!')
soup # <a>sfsdf</a> ! || !