search a repeated structure with regex - python

I have a string of the structure:
A_1: text
a lot more text
A_2: some text
a lot more other text
Now I want to extract the descriptive title (A_1) and the following text. Something like
[("A_1", "text\na lot more text"),("A_2", "some text\na lot more other text")]
My expression I use is
(A_\d+):([.\s]+)
But I get only [('A_1', ' '), ('A_2', ' ')].
Has someone an idea for me?
Thanks in advance,
Martin

You can use a lookahead to limit the match to another occurence of the searched start indicator.
(?s)A_\d+:.*?(?=\s*A_\d+:|$)
(?s) dotall flag to make dot also match newlines
A_\d+: your start indicator
.*? match as few as possible (lazy dot)
(?=\s*A_\d+:|$) until start pattern with optional spaces ahead or $ end
See demo at regex101.com (Python code generator)

Your [.\s]+ matches one or more literal dots (since . inside a character class loses its special meaning) and whitespaces. I think you meant to use . with a re.DOTALL flag. However, you can use something different, a tempered greedy token (there are other ways, too).
You can use
(?s)(A_\d+):\s*((?:(?!A_\d).)+)
See regex demo
IDEONE demo:
import re
p = re.compile(r'(A_\d+):\s*((?:(?!A_\d).)+)', re.DOTALL)
test_str = "A_1: text\na lot more text\n\nA_2: some text\na lot more other text"
print(p.findall(test_str))
The (?:(?!A_\d).)+ tempered greedy token will match any text up to the first A_+digit pattern.

Related

Regex to match following pattern in SQL query

I am trying to extract parts of a MySQL query to get the information I want.
I used this code / regex in Python:
import re
query = "SELECT `asd`.`ssss` as `column1`, `ss`.`wwwwwww` from `table`"
table_and_columns = re.findall('\`.*?`[.]\`.*?`',query)
My expected output:
['`asd`.`ssss`', `ss`.`wwwwwww`']
My real output:
['`asd`.`ssss`', '`column1`, `ss`.`wwwwwww`']
Can anybody help me and explain me where I went wrong?
The regex should only find the ones that have two strings like asd and a dot in the middle.
PS: I know that this is not a valid query.
The dot . can also match a backtick, so the pattern starts by matching a backtick and is able to match all chars until it reaches the literal dot in [.]
There is no need to use non greedy quantifiers, you can use a negated character class only prevent crossing the backtick boundary.
`[^`]*`\.`[^`]*`
Regex demo
The asterix * matches 0 or more times. If there has to be at least a single char, and newlines and spaces are unwanted, you could add \s to prevent matching whitespace chars and use + to match 1 or more times.
`[^`\s]+`\.`[^`\s]+`
Regex demo | Python demo
For example
import re
query = "SELECT `asd`.`ssss` as `column1`, `ss`.`wwwwwww` from `table`"
table_and_columns = re.findall('`[^`\s]+`\.`[^`\s]+`',query)
print(table_and_columns)
Output
['`asd`.`ssss`', '`ss`.`wwwwwww`']
Please try below regex. Greedy nature of .* from left to right is what caused issue.
Instead you should search for [^`]*
`[^`]*?`\.`[^`]*?`
Demo
The thing is that
.*? matches any character (except for line terminators) even whitespaces.
Also as you're already using * which means either 0 or unlimited occurrences,not sure you need to use ?.
So this seems to work:
\`\S+\`[.]\`\S+\`
where \S is any non-whitespace character.
You always can check you regexes using https://regex101.com

Python Regex for Clinical Trials Fields

I am trying to split text of clinical trials into a list of fields. Here is an example doc: https://obazuretest.blob.core.windows.net/stackoverflowquestion/NCT00000113.txt. Desired output is of the form: [[Date:<date>],[URL:<url>],[Org Study ID:<id>],...,[Keywords:<keywords>]]
I am using re.split(r"\n\n[^\s]", text) to split at paragraphs that start with a character other than space (to avoid splitting at the indented paragraphs within a field). This is all good, except the resulting fields are all (except the first field) missing their first character. Unfortunately, it is not possible to use string.partition with a regex.
I can add back the first characters by finding them using re.findall(r"\n\n[^\s]", text), but this requires a second iteration through the entire text (and seems clunky).
I am thinking it makes sense to use re.findall with some regex that matches all fields, but I am getting stuck. re.findall(r"[^\s].+\n\n") only matches the single line fields.
I'm not so experienced with regular expressions, so I apologize if the answer to this question is easily found elsewhere. Thanks for the help!
You may use a positive lookahead instead of a negated character class:
re.split(r"\n\n(?=\S)", text)
Now, it will only match 2 newlines if they are followed with a non-whitespace char.
Also, if there may be 2 or more newlines, you'd better use a {2,} limiting quantifier:
re.split(r"\n{2,}(?=\S)", text)
See the Python demo and a regex demo.
You want a lookahead. You also might want it to be more flexible as far as how many newlines / what newline characters. You might try this:
import re
r = re.compile(r"""(\r\n|\r|\n)+(?=\S)""")
l = r.split(text)
though this does seem to insert \r\n characters into the list... Hmm.

how to replace markdown tags into html by python?

I want to replace some "markdown" tags into html tags.
for example:
#Title1#
##title2##
Welcome to **My Home Page**
will be turned into
<h1>Title1</h1>
<h2>title2</h2>
Welcome to <b>My Home Page</b>
I just don't know how to do that...For Title1,I tried this:
#!/usr/bin/env python3
import re
text = '''
#Title1#
##title2##
'''
p = re.compile('^#\w*#\n$')
print(p.sub('<h1>\w*</h1>',text))
but nothing happens..
#Title1#
##title2##
How could those bbcode/markdown language come into html tags?
Check this regex: demo
Here you can see how I substituted the #...# into <h1>...</h1>.
I believe you can get this to work with double # and so on to get other markdown features considered, but still you should listen to #Thomas and #nhahtdh comments and use a markdown parser. Using regexes in such cases is unreliable, slow and unsafe.
As for inline text like **...** to <b>...</b> you can try this regex with substitution: demo. Hope you can twink this for other features like underlining and so on.
Your regular expression does not work because in the default mode, ^ and $ (respectively) matches the beginning and the end of the whole string.
'^'
(Caret.) Matches the start of the string, and in MULTILINE mode also matches immediately after each newline (my emph.)
'$'
Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline. foo matches both ‘foo’ and ‘foobar’, while the regular expression foo$ matches only ‘foo’. More interestingly, searching for foo.$ in 'foo1\nfoo2\n' matches ‘foo2’ normally, but ‘foo1’ in MULTILINE mode; searching for a single $ in 'foo\n' will find two (empty) matches: one just before the newline, and one at the end of the string.
(7.2.1. Regular Expression Syntax)
Add the flag re.MULTILINE in your compile line:
p = re.compile('^#(\w*)#\n$', re.MULTILINE)
and it should work – at least for single words, such as your example. A better check would be
p = re.compile('^#([^#]*)#\n$', re.MULTILINE)
– any sequence that does not contain a #.
In both expressions, you need to add parentheses around the part you want to copy so you can use that text in your replacement code. See the official documentation on Grouping for that.

Regex, not statement

Heyho,
I have the regex
([ ;(\{\}),\[\'\"]?)(_[a-zA-Z_\-0-9]*)([ =;\/*\-+\]\"\'\}\{,]?)
to match every occurrence of
_var
Problem is that it also matches strings like
test_var
I tried to add a new matching group negating any word character but it didn't worked properly.
Can someone figure out what I have to do to not match strings like var_var?
Thanks for help!
You can use the following "fix":
([[ ;(){},'"]?)(\b_[a-zA-Z_0-9-]*\b)([] =;/*+"'{},-]?)
^ ^
See regex demo
The word boundary \b is an anchor that asserts the position between a word and a non-word boundary. That means your _var will never match if preceded with a letter, a digit, or a . Also, I removed overescaping inside the character classes in the optional capturing groups. Note the so-called "smart placement" of hyphens and square brackets that for a Python regex might be not that important, but is still a best practice in writing regexes. Also, in Python regex you don't need to escape / since there are no regex delimiters there.
And one more hint: without u modifier, \w matches [a-zA-Z0-9_], so you can write the regex as
([[ ;(){},'"]?)(\b_[\w-]*\b)([] =;/*+"'{},-]?)
See regex demo 2.
And an IDEONE demo (note the use of r'...'):
import re
p = re.compile(r'([[ ;(){},\'"]?)(\b_[\w-]*\b)([] =;/*+"\'{},-]?)')
test_str = "Some text _var and test_var"
print (re.findall(p, test_str))

replace some part of a word with regex

how do you delete text inside <ref> *some text*</ref> together with ref itself?
in '...and so on<ref>Oxford University Press</ref>.'
re.sub(r'<ref>.+</ref>', '', string) only removes <ref> if
<ref> is followed by a whitespace
EDIT: it has smth to do with word boundaries I guess...or?
EDIT2 What I need is that it will math the last (closing) </ref> even if it is on a newline.
I don't really see you problem, because the code pasted will remove the <ref>...</ref> part of the string. But if what you mean is that and empty ref tag is not removed:
re.sub(r'<ref>.+</ref>', '', '...and so on<ref></ref>.')
Then what you need to do is change the .+ with .*
A + means one or more, while * means zero or more.
From http://docs.python.org/library/re.html:
'.' (Dot.) In the default mode, this matches any character except a newline.
If the DOTALL flag has been specified, this matches any character including
a newline.
'*' Causes the resulting RE to match 0 or more repetitions of the preceding
RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’
followed by any number of ‘b’s.
'+' Causes the resulting RE to match 1 or more repetitions of the preceding
RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will
not match just ‘a’.
'?' Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
ab? will match either ‘a’ or ‘ab’.
You could make a fancy regex to do just what you intend, but you need to use DOTALL and non-greedy search, and you need to understand how regexes work in general, which you don't.
Your best option is to use string methods rather than regexes, which is more pythonic anyway:
while '<reg>' in string:
begin, end = string.split('<reg>', 1)
trash, end = end.split('</reg>', 1)
string = begin + end
If you want to be very generic, allowing strange capitalization of the tags or whitespaces and properties in the tags, you shouldn't do this either, but invest in learning a html/xml parsing library. lxml currently seems to be widely recommended and well-supported.
You might want to be cautious not to remove a whole lot of text just because there are more than one closing </ref>s. Below regex would be more accurate in my opinion:
r'<ref>[^<]*</ref>'
This would prevent the 'greedy' matching.
BTW: There is a great tool called The Regex Coach to analyze and test your regexes. You can find it at: http://www.weitz.de/regex-coach/
edit: forgot to add code tag in the first paragraph.
If you try to do this with regular expressions you're in for a world of trouble. You're effectively trying to parse something but your parser isn't up to the task.
Matching greedily across strings probably eats up too much, as in this example:
<ref>SDD</ref>...<ref>XX</ref>
You'd end up cleraning up the entire middle.
You really want a parser, something like Beautiful Soup.
from BeautifulSoup import BeautifulSoup, Tag
s = "<a>sfsdf</a> <ref>XX</ref> || <ref>YY</ref>"
soup = BeautifulSoup(s)
x = soup.findAll("ref")
for z in x:
soup.ref.replaceWith('!')
soup # <a>sfsdf</a> ! || !

Categories