regex regular expression python - python

I'm having problems with this method in python called findall. I'm accessing a web pages HTML and trying to return the name of a product in this case 'bread' and print it out to the console.

Don't use regex for HTML parsing.
There are a few solutions. I suggest BeautifulSoup (http://www.crummy.com/software/BeautifulSoup/)
Having said so, however, in this particular case, RE will suffice. Just relax it a notch. There might be more or less spaces or maybe those are tabs. So instead of literal spaces use the space class \s:
product = re.findall(r'Item:\s*is\s*in\s*lane\s*12\s*(\w*)', content)
print product[0]
Since The '*', '+', and '?' qualifiers are all greedy (they match as much text as possible) you don't need to restrict it with [^<]*<br>

In case you still want to use regexps, here's a working one for your case:
product = re.findall(r'<br>\s*Item:\s+is\s+in\s+lane 12\s+(\w*)[^<]*<br>', content)
It takes into account DSM's space flexibility suggestion and non-letters after (\w*) that might appear before <br>.

Related

How do I extract definitions from a html file?

I'm trying to practice with regular expressions by extracting function definitions from Python's standard library built-in functions page. What I do have so far is that the definitions are generally printed between <dd><p> and </dd></dl>. When I try
import re
fname = open('functions.html').read()
deflst = re.findall(r'<dd><p>([\D3]+)</dd></dl>', fhand)
it doesn't actually stop at </dd></dl>. This is probably something very silly that I'm missing here, but I've been really having a hard time trying to figure this one out.
Regular expressions are evaluated left to right, in a sense. So in your regular expression,
r'<dd><p>([\D3]+)</dd></dl>'
the regex engine will first look for a <dd><p>, then it will look at each of the following characters in turn, checking each for whether it's a nondigit or 3, and if so, add it to the match. It turns out that all the characters in </dd></dl> are in the class "nondigit or 3", so all of them get added to the portion matched by [\D3]+, and the engine dutifully keeps going. It will only stop when it finds a character that is a digit other than 3, and then go on and "notice" the rest of the regex (the </dd></dl>).
To fix this, you can use the reluctant quantifier like so:
r'<dd><p>([\D3]+?)</dd></dl>'
(note the added ?) which means the regex engine should be conservative in how much it adds to the match. Instead of trying to "gobble" as many characters as possible, it will now try to match the [\D3]+? to just one character and then go on and see if the rest of the regex matches, and if not it will try to match [\D3]+? with just two characters, and so on.
Basically, [\D3]+ matches the longest possible string of [\D3]'s that it can while still letting the full regex match, whereas [\D3]+? matches the shortest possible string of [\D3]'s that it can while still letting the full regex match.
Of course one shouldn't really be using regular expressions to parse HTML in "the real world", but if you just want to practice regular expressions, this is probably as good a text sample as any.
By default all quantifiers are greedy which means they want to match as many characters as possible. You can use ? after quantifier to make it lazy which matches as few characters as possible. \d+? matches at least one digit, but as few as possible.
Try r'<dd><p>([\D3]+?)</dd></dl>'

Python Regex for Clinical Trials Fields

I am trying to split text of clinical trials into a list of fields. Here is an example doc: https://obazuretest.blob.core.windows.net/stackoverflowquestion/NCT00000113.txt. Desired output is of the form: [[Date:<date>],[URL:<url>],[Org Study ID:<id>],...,[Keywords:<keywords>]]
I am using re.split(r"\n\n[^\s]", text) to split at paragraphs that start with a character other than space (to avoid splitting at the indented paragraphs within a field). This is all good, except the resulting fields are all (except the first field) missing their first character. Unfortunately, it is not possible to use string.partition with a regex.
I can add back the first characters by finding them using re.findall(r"\n\n[^\s]", text), but this requires a second iteration through the entire text (and seems clunky).
I am thinking it makes sense to use re.findall with some regex that matches all fields, but I am getting stuck. re.findall(r"[^\s].+\n\n") only matches the single line fields.
I'm not so experienced with regular expressions, so I apologize if the answer to this question is easily found elsewhere. Thanks for the help!
You may use a positive lookahead instead of a negated character class:
re.split(r"\n\n(?=\S)", text)
Now, it will only match 2 newlines if they are followed with a non-whitespace char.
Also, if there may be 2 or more newlines, you'd better use a {2,} limiting quantifier:
re.split(r"\n{2,}(?=\S)", text)
See the Python demo and a regex demo.
You want a lookahead. You also might want it to be more flexible as far as how many newlines / what newline characters. You might try this:
import re
r = re.compile(r"""(\r\n|\r|\n)+(?=\S)""")
l = r.split(text)
though this does seem to insert \r\n characters into the list... Hmm.

findall() behaviour (python 2.7)

Suppose I have the following string:
"<p>Hello</p>NOT<p>World</p>"
and i want to extract the words Hello and World
I created the following script for the job
#!/usr/bin/env python
import re
string = "<p>Hello</p>NOT<p>World</p>"
match = re.findall(r"(<p>[\w\W]+</p>)", string)
print match
I'm not particularly interested in stripping < p> and < /p> so I never bothered doing it within the script.
The interpreter prints
['<p>Hello</p>NOT<p>World</p>']
so it obviously sees the first < p> and the last < /p> while disregarding the in between tags. Shouldn't findall() return all three sets of matching strings though? (the string it prints, and the two words).
And if it shouldn't, how can i alter the code to do so?
PS: This is for a project and I found an alternative way to do what i needed to, so this is for educational reasons I guess.
The reason that you get the entire contents in a single match is because [\w\W]+ will match as many things as it can (including all of your <p> and </p> tags). To prevent this, you want to use the non-greedy version by appending a ?.
match = re.findall(r"(<p>[\w\W]+?</p>)", string)
# ['<p>Hello</p>', '<p>World</p>']
From the documentation:
*?, +?, ??
The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.*> is matched against <a> b <c>, it will match the entire string, and not just <a>. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using the RE <.*?> will match only <a>.
If you don't want the <p> and </p> tags in the result, you will want to use look-ahead and look behind assertions to not include them in the result.
match = re.findall(r"((?<=<p>)\w+?(?=</p>))", string)
# ['Hello', 'World']
As a side note though, if you are trying to parse HTML or XML with regular expressions, it is preferable to use a library such as BeautifulSoup which is intended for parsing HTML.

Comments in string and strings in comments

I am trying to count characters in comments included in C code using Python and Regex, but no success. I can erase strings first to get rid of comments in strings, but this will erase string in comments too and result will be bad ofc. Is there any chance to ask by using regex to not match strings in comments or vice versa?
No, not really.
Regex is not the correct tool to parse nested structures like you describe; instead you will need to parse the C syntax (or the "dumb subset" of it you're interested in, anyway), and you might find regex helpful in that. A relatively simple state machine with three states (CODE, STRING, COMMENT) would do it.
Regular expressions are not always a replacement for a real parser.
You can strip out all strings that aren't in comments by searching for the regular expression:
'[^'\r\n]+'|(//.*|/\*(?s:.*?)\*/)
and replacing with:
$1
Essentially, this searches for the regex string|(comment) which matches a string or a comment, capturing the comment. The replacement is either nothing if a string was matched or the comment if a comment was matched.
Though regular expressions are not a replacement for a real parser you can quickly build a rudimentary parser by creating a giant regex that alternates all of the tokens you're interested in (comments and strings in this case). If you're writing a bit of code to handle comments, but not those in strings, iterate over all the matches of the above regex, and count the characters in the first capturing group if it participated in the match.

Regex matching very slow

I am trying to parse a PDF to extract the text from it (please don't suggest any libraries to do this, as this is part of learning the format).
I have already handled deflating it to put it in the alphanumeric format. I now need to extract the text from the text blocks.
So, my current pattern is BT.*?\((.*?)\).*?ET (with DOTMATCHALL set) to match something like:
BT
/F13 12 Tf
288 720 Td
(ABC) Tj
ET
The only bit I want is the text ABC in the brackets.
The above is only formatted like that to make it clear to see. In the deflated text it may be all in one line, it may not be. There is no gurantee that the BT/ET will be at the start of a line. There may be spaces and text before/after the bracketed section, there may not be. There will however, be only one bracketed section per BT/ET block.
The above pattern works, but is really slow, I assume it is because the regex library is failing to match the pattern that matches the text between BT and the (ABC) many times.
The regex is pre-compiled in an attempt to speed it up, but it seems negligible.
How may I speed this up?
How many of these blocks might appear in a document?
Often slow Regex execution is the result of catastrophic backtracking, as described here: http://www.regular-expressions.info/catastrophic.html
I don't know what regex technology you're using, but you could try to use lookaround assertions, as described here:
http://www.regular-expressions.info/lookaround.html
These allow you to first just match what you want, ABC inside parentheses, and then validate that it is preceded by some value and followed by some other value.
Are you sure the regex is correct and pulls out ABC as a match? What language's regex engine is this? Using my regular expression debugger shows that:
"BT.*?((.*?)).*?ET" doesn't pull out ABC and in fact must find the string 'ET' then backtrack back to find everything else.
"BT.*?\\((.*?)\\).*?ET" works as expected with a single pass left to right.
here's one without regex. simple string parsing using Python internals.
>>> xtract="""
... BT
... /F13 12 Tf
... 288 720 Td
... (ABC) Tj
... ET
...
... """
>>> for chunk in xtract.split("ET"):
... if "BT" in chunk:
... for brace in chunk.split(")"):
... if "(" in brace:
... print brace[brace.find("(")+1:]
...
ABC
You can't just parse the PDF with a regex to extract the text. In most cases the text in inside compressed binary blobs or encoded. A PDF with the text shown like this is very much the exception.
There's not really enough info for a definite answer--or maybe you're assuming we know more about PDF than you do. Are there always parenthesized chunks inside these BT...ET sections? Is there always only one of them? Is the BT or ET always at the beginning of a line? If so, I would suggest
(?m)^BT[^()]*\((.*?)\)[^()]*?^ET
If I knew how PDF represented literal parentheses, I could probably come up with something more efficient.
EDIT: According to the PDF spec, literal parentheses have to be escaped with a backslash, and there are a bunch of other backslash-escape sequences. So try this:
(?s)\bBT\b[^()]*\(((?:[^()\\]*(?:\\.[^()\\]*)*))\)
This part--[^()\\]*(?:\\.[^()\\]*)*--matches a block of text which may contain escaped characters (including parens), but not unescaped parens. I know it looks ugly, but it's the most efficient way, since Python doesn't support atomic groups or possessive quantifiers.
(?s) allows . to match newlines, and \bBT\b makes sure the BT isn't part of a longer "word". I'm reasonably confident that this is all I need to match all of the actual text content, so I don't bother matching the stuff after the closing paren.
Since there will be only one bracketed expression between a BT and an ET, you could try the following regular expression for speed:
r"(?s)\bBT\b[^(]*\(([^)]*)\).*?\bET\b"

Categories