How to grab html using regex - python

#<link rel='canonical' href='http://www.samplewebsite.com/image/5434553/' />
#I am trying to grab the text in href
image = str(Soup)
image_re = re.compile('\<link rel=\'cononical\' href=')
image_pat = re.findall(image_re, image)
print image_pa
#>> []
#Thanks!

Edit: This uses the BeautifulSoup package, which I thought I saw in the previous version of this question.
Edit: More straightforward is this:
soup = BeautifulSoup(document)
links = soup.findAll('link', rel='canonical')
for link in links:
print link['href']
Instead of all that, you can use:
soup = BeautifulSoup(document)
links = soup("link")
for link in links:
if "rel" in link and link["rel"] == 'canonical':
print link["href"]

Use two regular expressions:
import re
link_tag_re = re.compile(r'(<link[^>]*>')
# capture all link tags in your text with it. Then for each of those, use:
href_capture = re.compile(r'href\s*=\s*(\'[^\']*\'|"[^"]*")')
The first regex will capture the entire <link> tag; the second one will look for href="something" or href='something'.
In general, though, you should probably use an XML parser for HTML, even though this problem is a perfectly regular language problem. They're far simpler to use for this sort of thing and are less likely to cause you problems.

You're better of using a proper HTML parser on the data, but if you really want to go down this route then the following will do it:
>>> data = "... <link rel='canonical' href='http://www.samplewebsite.com/image/5434553/' /> ..."
>>>
>>> re.search("<link[^>]+?rel='canonical'[^>]+?href='([^']+)", x).group(1)
'http://www.samplewebsite.com/image/5434553/'
>>>
I also notice that your HTML uses single quotes rather than double quotes.

You should use an HTML parser such as lxml.html or BeautifulSoup. But if you only want to grab the href of a single link, you could use a simple regex too:
re.findall(r"href=(['\"])([^\1]*)\1", url)

This would be the regex to match the example html you've given:
<link rel='canonical' href='(\S+)'
But I'm not sure if regex is the right tool. This regex will fail when using double quotes (or no quotes) for the values. Or if rel and href are turned around.
I'd recommend using something like BeautifulSoup to find and collect all rel canonical href values.

Related

BeautifulSoup find partial string in section

I am trying to use BeautifulSoup to scrape a particular download URL from a web page, based on a partial text match. There are many links on the page, and it changes frequently. The html I'm scraping is full of sections that look something like this:
<section class="onecol habonecol">
<a href="https://longGibberishDownloadURL" title="Download">
<img src="\azure_storage_blob\includes\download_for_windows.png"/>
</a>
sentinel-3.2022335.1201.1507_1608C.ab.L3.FL3.v951T202211_1_3.CIcyano.LakeOkee.tif
</section>
The second to last line (sentinel-3.2022335...LakeOkee.tif) is the part I need to search using a partial string to pull out the correct download url. The code I have attempted so far looks something like this:
import requests, re
from bs4 import BeautifulSoup
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
result = soup.find('section', attrs={'class':'onecol habonecol'}, string=re.compile(?))
I've been searching StackOverflow a long time now and while there are similar questions and answers, none of the proposed solutions have worked for me so far (re.compile, lambdas, etc.). I am able to pull up a section if I remove the string argument, but when I try to include a partial matching string I get None for my result. I'm unsure what to put for the string argument (? above) to find a match based on partial text, say if I wanted to find the filename that has "CIcyano" somewhere in it (see second to last line of html example at top).
I've tried multiple methods using re.compile and lambdas, but I don't quite understand how either of those functions really work. I was able to pull up other sections from the html using these solutions, but something about this filename string with all the periods seems to be preventing it from working. Or maybe it's the way it is positioned within the section? Perhaps I'm going about this the wrong way entirely.
Is this perhaps considered part of the section id, and so the string argument can't find it?? An example of a section on the page that I AM able to find has html like the one below, and I'm easily able to find it using the string argument and re.compile using "Name", "^N", etc.
<section class="onecol habonecol">
<h3>
Name
</h3>
</section>
Appreciate any advice on how to go about this! Once I get the correct section, I know how to pull out the URL via the a tag.
Here is the full html of the page I'm scraping, if that helps clarify the structure I'm working against.
I believe you are overthinking. Just remove the regular expression part, take the text and you will be fine.
import requests
from bs4 import BeautifulSoup
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
result = soup.find('section', attrs={'class':'onecol habonecol'}).text
print(result)
You can query inside every section for the string you want. Like so:
s.find('section', attrs={'class':'onecol habonecol'}).find(string=re.compile(r'.sentinel.*'))
Using this regular expression you will match any text that has sentinel in it, be careful that you will have to match some characters like spaces, that's why there is a . at beginning of the regex, you might want a more robust regex which you can test here:
https://regex101.com/
I ended up finding another method not using the string argument in find(), instead using something like the code below, which pulls the first instance of a section that contains a partial text match.
sections = soup.find_all('section', attrs={'class':'onecol habonecol'})
for s in sections:
text = s.text
if 'CIcyano' in text:
print(s)
break
links = s.find('a')
dwn_url = links.get('href')
This works for my purposes and fetches the first instance of the matching filename, and grabs the URL.

Change ALL html tags to symbol using python

What I want to do is change every tag (whether its <a href=> or <title> or </title> or </div>... etc) to a symbol.
I tried using beautiful soup but it only finds tags that I define...
I found some code in the HTMLparser.py
tagfind = re.compile('([a-zA-Z][^\t\n\r\f />\x00]*)(?:\s|/(?!>))*')
I beleive this is what I'm looking for I just dont know how to use it properly.
Also I figured I could use the:
handle_starttag(self, tag, attrs):
But I don't want to define the tag, I just want the script to find every single tag and change it to something...
Is this possible?
Thank you for all of your help!!
A much more reliable way is to recursively visit each tag, I just changed the name in the example below but you can do whatever you want once you have the tag:
from bs4 import BeautifulSoup, element
def visit(s):
if isinstance(s, element.Tag):
has_children = s.find_all()
if has_children:
s.name = "foobar"
for child in s:
visit(child)
else:
s.name = "foobar"
To use it:
soup = BeautifulSoup(...)
visit(soup)
Then any changes will be reflected in the soup.
BeautifulSoup isn't a good idea here - that's designed for parsing HTML, not editing it.
Also, that regex doesn't seem like a very good one (only matches the content inside a tag rather than the whole tag itself) so I found a different one that would be better suited to your purposes:
</?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[\^'">\s]+))?)+\s*|\s*)/?>
This tag will match anything like the following:
<h1>
</h1>
<img src="foo.com/image.png">
We can use this for replacing all tags by using re.sub. This finds all matches for a certain regex and replaces them with something else. Here's how you'd use it for what you want to do:
import re
html_regex = r"""</?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[\^'">\s]+))?)+\s*|\s*)/?>"""
html = "<h1>Foo</h1>"
print(re.sub(html_regex, "#", html))
This would print:
#Foo#

replacing html tags with BeautifulSoup

I'm currently reformatting some HTML pages with BeautifulSoup, and I ran into bit of a problem.
My problem is that the original HTML has things like this:
<li><p>stff</p></li>
and
<li><div><p>Stuff</p></div></li>
as well as
<li><div><p><strong>stff</strong></p></div><li>
With BeautifulSoup I hope to eliminate the div and the p tags, if they exists, but keep the strong tag.
I'm looking through the beautiful soup documentation and couldn't find any.
Ideas?
Thanks.
This question probably refered to an older version of BeautifulSoup because with bs4 you can simply use the unwrap function:
s = BeautifulSoup('<li><div><p><strong>stff</strong></p></div><li>')
s.div.unwrap()
>> <div></div>
s.p.unwrap()
>> <p></p>
s
>> <html><body><li><strong>stff</strong></li><li></li></body></html>
What you want to do can be done using replaceWith. You have to duplicate the element you want to use as the replacement, and then feed that as the argument to replaceWith. The documentation for replaceWith is pretty clear on how to do this.
I saw many answers for this simple question, i also came here to see something useful but unfortunately i didn't get what i was looking for then after few tries I found a simple solution for this answer and here it is
soup = BeautifulSoup(htmlData, "html.parser")
h2_headers = soup.find_all("h2")
for header in h2_headers:
header.name = "h1" # replaces h2 tag with h1
All h2 tags converted to h1. You can convert any tag by just changing the name.
You can write your own function to strip tags:
import re
def strip_tags(string):
return re.sub(r'<.*?>', '', string)
strip_tags("<li><div><p><strong>stff</strong></p></div><li>")
'stff'
Simple solution get your whole node means div:
Convert to string
Replace <tag> with required tag/string.
Replace corresponding tag with empty string.
Convert the converted string to parsable string by passing to beautifulsoup
What I have done for mint
Example:
<div class="col-md-12 option" itemprop="text">
<span class="label label-info">A</span>
**-2<sup>31</sup> to 2<sup>31</sup>-1**
sup = opt.sup
if sup: //opt has sup tag then
//opts converted to string.
opt = str(opts).replace("<sup>","^").replace("</sup>","") //replacing
//again converted from string to beautiful string.
s = BeautifulSoup(opt, 'lxml')
//resign to required variable after manipulation
opts = s.find("div", class_="col-md-12 option")
Output:
-2^31 to 2^31-1
without manipulation it will like this (-231 to 231-1)

Getting the value of href attributes in all <a> tags on a html file with Python

I'm building an app in python, and I need to get the URL of all links in one webpage. I already have a function that uses urllib to download the html file from the web, and transform it to a list of strings with readlines().
Currently I have this code that uses regex (I'm not very good at it) to search for links in every line:
for line in lines:
result = re.match ('/href="(.*)"/iU', line)
print result
This is not working, as it only prints "None" for every line in the file, but I'm sure that at least there are 3 links on the file I'm opening.
Can someone give me a hint on this?
Thanks in advance
Beautiful Soup can do this almost trivially:
from BeautifulSoup import BeautifulSoup as soup
html = soup('<body>qweasd</body>')
print [tag.attrMap['href'] for tag in html.findAll('a', {'href': True})]
Another alternative to BeautifulSoup is lxml (http://lxml.de/);
import lxml.html
links = lxml.html.parse("http://stackoverflow.com/").xpath("//a/#href")
for link in links:
print link
There's an HTML parser that comes standard in Python. Checkout htmllib.
As previously mentioned: regex does not have the power to parse HTML. Do not use regex for parsing HTML. Do not pass Go. Do not collect £200.
Use an HTML parser.
But for completeness, the primary problem is:
re.match ('/href="(.*)"/iU', line)
You don't use the “/.../flags” syntax for decorating regexes in Python. Instead put the flags in a separate argument:
re.match('href="(.*)"', line, re.I|re.U)
Another problem is the greedy ‘.*’ pattern. If you have two hrefs in a line, it'll happily suck up all the content between the opening " of the first match and the closing " of the second match. You can use the non-greedy ‘.*?’ or, more simply, ‘[^"]*’ to only match up to the first closing quote.
But don't use regexes for parsing HTML. Really.
What others haven't told you is that using regular expressions for this is not a reliable solution.
Using regular expression will give you wrong results on many situations: if there are <A> tags that are commented out, or if there are text in the page which include the string "href=", or if there are <textarea> elements with html code in it, and many others. Plus, the href attribute may exist on tags other that the anchor tag.
What you need for this is XPath, which is a query language for DOM trees, i.e. it lets you retrieve any set of nodes satisfying the conditions you specify (HTML attributes are nodes in the DOM).
XPath is a well standarized language now a days (W3C), and is well supported by all major languages. I strongly suggest you use XPath and not regexp for this.
adw's answer shows one example of using XPath for your particular case.
Don't divide the html content into lines, as there maybe multiple matches in a single line. Also don't assume there is always quotes around the url.
Do something like this:
links = re.finditer(' href="?([^\s^"]+)', content)
for link in links:
print link
Well, just for completeness I will add here what I found to be the best answer, and I found it on the book Dive Into Python, from Mark Pilgrim.
Here follows the code to list all URL's from a webpage:
from sgmllib import SGMLParser
class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []
def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.extend(href)
import urllib, urllister
usock = urllib.urlopen("http://diveintopython.net/")
parser = urllister.URLLister()
parser.feed(usock.read())
usock.close()
parser.close()
for url in parser.urls: print url
Thanks for all the replies.

Python and web-tags regex

i have need webpage-content. I need to get some data from it. It looks like:
< div class="deg">DATA< /div>
As i understand, i have to use regex, but i can't choose one.
I tried the code below but had no any results. Please, correct me:
regexHandler = re.compile('(<div class="deg">(?P<div class="deg">.*?)</div>)')
result = regexHandler.search( pageData )
I suggest using a good HTML parser (such as BeautifulSoup -- but for your purposes, i.e. with well-formed HTML as input, the ones that come with the Python standard library, such as HTMLParser, should also work well) rather than raw REs to parse HTML.
If you want to persist with the raw RE approach, the pattern:
r'<div class="deg">([^<]*)</div>'
looks like the simplest way to get the string 'DATA' out of the string '<div class="deg">DATA</div>' -- assuming that's what you're after. You may need to add one or more \s* in spots where you need to tolerate optional whitespace.
If you want the div tags included in the matched item:
regexpHandler = re.compile('(<div class="deg">.*?</div>)')
If you don't want the div tags included, only the DATA portion:
regexpHandler = re.compile('<div class="deg">(.*?)</div>')
Then to run the match and get the result:
result = regexHandler.search( pageData )
matchedText = result.groups()[0]
you can use simple string functions in Python, no need for regex
mystr = """< div class="deg">DATA< /div>"""
if "div" in mystr and "class" in mystr and "deg" in mystr:
s = mystr.split(">")
for n,item in enumerate(s):
if "deg" in item:
print s[n+1][:s[n+1].index("<")]
my approach, get something to split on. eg in the above, i split on ">". Then go through the splitted items, check for "deg", and get the item after it, since "deg" appears before the data you want to get. of course, this is not the only approach.
While it is ok to use rexex for quick and dirty html processing a much better and cleaner way is to use a html parser like lxml.html and to query the parsed tree with XPath or CSS Selectors.
html = """<html><body><div class="deg">DATA1</div><div class="deg">DATA2</div></body></html>"""
import lxml.html
page = lxml.html.fromstring(html)
#page = lxml.html.parse(url)
for element in page.findall('.//div[#class="deg"]'):
print element.text
#using css selectors
from lxml.cssselect import CSSSelector
sel = CSSSelector("div.deg")
for element in sel(page):
print element.text

Categories