I have tried using regex but read around and got directed to beautiful soup...
I've kinda figured out how to get urls in html tags with soup, but how would I grab urls from both html tags (href=*) and the body text of the page?
Also for grabbing the ones in tags, how do I specify that I only want urls starting with http://, https://... ?
Thanks in advance!
First look at parsing-html-in-python-lxml-or-beautifulsoup. I read it and never looked at the soup. I guess because I find lxml so easy. I am sure there are different ways to do what you asked, perhaps there are easier ones. But I'll show what I use.
In lxml you can use XPath it's like using regex for XML/HTML. This code below will find all "a" tags that have "http" attribute and print all links that start with http. This should help you get started on your parsing.
from lxml.html import etree
tree = etree.parse("my.html", etree.HTMLParser())
root = tree.getroot()
links = root.findall('*//a[#href]')
foreach link in links:
if link.get("http").startswith("http"):
print link.get("http")
Related
HTML has a concept of empty elements, as listed on MDN. However, beautiful soup doesn't seem to handle them properly:
import bs4
soup = bs4.BeautifulSoup(
'<div><input name=the-input><label for=the-input>My label</label></div>',
'html.parser'
)
print(soup.contents)
I get:
[<div><input name="the-input"><label for="the-input">My label</label></input></div>]
I.e. the input has wrapped the label.
Question: Is there any way to get beautiful soup to parse this properly? Or is there an official explanation of this behaviour somewhere I haven't found yet?
At the very least I'd expect something like:
[<div><input name="the-input"></input><label for="the-input">My label</label></div>]
I.e. the input automatically closed before the label.
As stated in their documentation html5lib parses the document as the web browser does (Like lxmlin this case). It'll try to fix your document tree by adding/closing tags when needed.
In your example I've used lxml as the parser and it gave the following result:
soup = bs4.BeautifulSoup(
'<div><input name=the-input><label for=the-input>My label</label></div>',
'lxml'
)
print(soup.body.contents)
[<div><input name="the-input"/><label for="the-input">My label</label></div>]
Note that lxml added html & body tags because they weren't present in the source, that is why I've printed the body contents.
I would say soup is doing what it can for fixing this html structure, it is actually helpful in some occasions.
Anyway, for your case I would say to use lxml, which will parse the html structure as you want, or maybe give a try to parsel
I'm starting to make progress on a website scraper, but I've run into two snags. Here is the code first:
import requests
from bs4 import BeautifulSoup
r=requests.get("http://www.nytimes.com")
soup=BeautifulSoup(r.text)
headlines=soup.find_all(class_="story-heading")
for headline in headlines:
print (headline)
Questions
Why do you a have to use find_all(class_= blahblahblah)
Instead of just find_all(blahblahblah)? I realize that the story-heading is a class of its own, but can't I just search all the HTML using find_all and get the same results? The notes for BeautifulSoup show find_all.a returning all the anchor tags in an HTML document, why won't find_all("story-heading") do the same?
Is it because if I try and do that, it will just find all the instances of "story-heading" within the HTML and return those? I am trying to get python to return everything in that tag. That's my best guess.
Why do I get all this extra junk code? Should my requests to find all just show me everything within the story-header tag? I'm getting a lot more text than what I am just trying to specify.
Beautiful Soup allows you use CSS Selectors. Look in the doc for "CSS selector"
You can find all elements with class "story-heading" like so:
soup.find_all(".story-heading")
If instead it's you're looking for id's just do
soup.find_all("#id-name")
I'm using cygwin and do not have BeautifulSoup installed.
Getting the value of href attributes in all <a> tags on a html file with Python
python, regex to find anchor link html
Regular expression to extract URL from an HTML link
If you don't care much about performance you can use regular expressions:
import re
linkre = re.compile(r"""href=["']([^"']+)["']""")
links = linkre.findall(your_html)
If you just want links like in http:// links then change the expression to:
linkre = re.compile(r"""href=["']http:([^"']+)["']""")
Or you can put "' as optional if by some chance you have html without them around the links.
I have downloaded the web page into an html file. I am wondering what's the simplest way to get the content of that page. By content, I mean I need the strings that a browser would display.
To be clear:
Input:
<html><head><title>Page title</title></head>
<body><p id="firstpara" align="center">This is paragraph <b>one</b>.
<p id="secondpara" align="blah">This is paragraph <b>two</b>.
</html>
Output:
Page title This is paragraph one. This is paragraph two.
putting together:
from BeautifulSoup import BeautifulSoup
import re
def removeHtmlTags(page):
p = re.compile(r'''<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>''')
return p.sub('', page)
def removeHtmlTags2(page):
soup = BeautifulSoup(page)
return ''.join(soup.findAll(text=True))
Related
Python HTML removal
Extracting text from HTML file using Python
What is a light python library that can eliminate HTML tags? (and only text)
Remove HTML tags in AppEngine Python Env (equivalent to Ruby’s Sanitize)
RegEx match open tags except XHTML self-contained tags (famous don't use regex to parse html rant)
Parse the HTML with Beautiful Soup.
To get all the text, without the tags, try:
''.join(soup.findAll(text=True))
Personally, I use lxml because it's a swiss-army knife...
from lxml import html
print html.parse('http://someurl.at.domain').xpath('//body')[0].text_content()
This tells lxml to retrieve the page, locate the <body> tag then extract and print all the text.
I do a lot of page parsing and a regex is the wrong solution most of the time, unless it's a one-time-only need. If the author of the page changes their HTML you run a good risk of your regex breaking. A parser is a lot more likely to continue working.
The big problem with a parser is learning how to access the sections of the document you are after, but there are a lot of XPATH tools you can use inside your browser that simplify the task.
You want to look at Extracting data from HTML documents - Dive into Python because HERE it does (almost)exactly what you want.
The best modules for this task are lxml or html5lib; Beautifull Soap is imho not worth to use anymore. And for recursive models regular expressions are definitly the wrong method.
If I am getting your question correctly, this can simply be done by using urlopen function of urllib. Just have a look at this function to open an url and read the response which will be the html code of that page.
The quickest way to get a usable sample of what a browser would display is to remove any tags from the html and print the rest. This can, for example, be done using python's re.
I'm building an app in python, and I need to get the URL of all links in one webpage. I already have a function that uses urllib to download the html file from the web, and transform it to a list of strings with readlines().
Currently I have this code that uses regex (I'm not very good at it) to search for links in every line:
for line in lines:
result = re.match ('/href="(.*)"/iU', line)
print result
This is not working, as it only prints "None" for every line in the file, but I'm sure that at least there are 3 links on the file I'm opening.
Can someone give me a hint on this?
Thanks in advance
Beautiful Soup can do this almost trivially:
from BeautifulSoup import BeautifulSoup as soup
html = soup('<body>qweasd</body>')
print [tag.attrMap['href'] for tag in html.findAll('a', {'href': True})]
Another alternative to BeautifulSoup is lxml (http://lxml.de/);
import lxml.html
links = lxml.html.parse("http://stackoverflow.com/").xpath("//a/#href")
for link in links:
print link
There's an HTML parser that comes standard in Python. Checkout htmllib.
As previously mentioned: regex does not have the power to parse HTML. Do not use regex for parsing HTML. Do not pass Go. Do not collect £200.
Use an HTML parser.
But for completeness, the primary problem is:
re.match ('/href="(.*)"/iU', line)
You don't use the “/.../flags” syntax for decorating regexes in Python. Instead put the flags in a separate argument:
re.match('href="(.*)"', line, re.I|re.U)
Another problem is the greedy ‘.*’ pattern. If you have two hrefs in a line, it'll happily suck up all the content between the opening " of the first match and the closing " of the second match. You can use the non-greedy ‘.*?’ or, more simply, ‘[^"]*’ to only match up to the first closing quote.
But don't use regexes for parsing HTML. Really.
What others haven't told you is that using regular expressions for this is not a reliable solution.
Using regular expression will give you wrong results on many situations: if there are <A> tags that are commented out, or if there are text in the page which include the string "href=", or if there are <textarea> elements with html code in it, and many others. Plus, the href attribute may exist on tags other that the anchor tag.
What you need for this is XPath, which is a query language for DOM trees, i.e. it lets you retrieve any set of nodes satisfying the conditions you specify (HTML attributes are nodes in the DOM).
XPath is a well standarized language now a days (W3C), and is well supported by all major languages. I strongly suggest you use XPath and not regexp for this.
adw's answer shows one example of using XPath for your particular case.
Don't divide the html content into lines, as there maybe multiple matches in a single line. Also don't assume there is always quotes around the url.
Do something like this:
links = re.finditer(' href="?([^\s^"]+)', content)
for link in links:
print link
Well, just for completeness I will add here what I found to be the best answer, and I found it on the book Dive Into Python, from Mark Pilgrim.
Here follows the code to list all URL's from a webpage:
from sgmllib import SGMLParser
class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []
def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.extend(href)
import urllib, urllister
usock = urllib.urlopen("http://diveintopython.net/")
parser = urllister.URLLister()
parser.feed(usock.read())
usock.close()
parser.close()
for url in parser.urls: print url
Thanks for all the replies.