So lets say I have the following base url http://example.com/Stuff/preview/v/{id}/fl/1/t/. There are a number of urls with different {id}s on the page being parsed. I want to find all the links matching this template in an HTML page.
I can use xpath to just match to a part of the template//a[contains(#href,preview/v] or just use regexes, but I was wondering if anyone knew a more elegant way to match to the entire template using xpath and regexes so its fast and the matches are definitely correct.
Thanks.
Edit. I timed it on a sample page. With my internet connection and 100 trials the iteration takes 0.467 seconds on average and BeautifulSoup takes 0.669 seconds.
Also if you have Scrapy its one can use Selectors.
data=get(url).text
sel = Selector(text=data, type="html")
a=sel.xpath('//a[re:test(#href,"/Stuff/preview/v/\d+/fl/1/t/")]//#href').extract()
Average time on this is also 0.467
You cannot use regexes in the xpath expressions using lxml, since lxml supports xpath 1.0 and xpath 1.0 doesn't support regular expression search.
Instead, you can find all the links on a page using iterlinks(), iterate over them and check the href attribute value:
import re
import lxml.html
tree = lxml.html.fromstring(data)
pattern = re.compile("http://example.com/Stuff/preview/v/\d+/fl/1/t/")
for element, attribute, link, pos in tree.iterlinks():
if not pattern.match(link):
continue
print link
An alternative option would be to use BeautifulSoup html parser:
import re
from bs4 import BeautifulSoup
data = "your html"
soup = BeautifulSoup(data)
pattern = re.compile("http://example.com/Stuff/preview/v/\d+/fl/1/t/")
print soup.find_all('a', {'href': pattern})
To make BeautifulSoup parsing faster you can let it use lxml:
soup = BeautifulSoup(data, "lxml")
Also, you can make use of a SoupStrainer class that lets you parse only specific web page parts instead of a whole page.
Hope that helps.
Related
I have beautiful soup code that looks like:
for item in beautifulSoupObj.find_all('cite'):
pagelink.append(item.get_text())
the problem is, the html code I'm trying to parse looks like:
<cite>https://www.<strong>websiteurl.com/id=6</strong></cite>
My current selector above would get everything, including strong tags in it.
Thus, how can I parse only:
https://www.websiteurl.com/id=6
Note <cite> appears multiple times throughout the page, and I want to extract, and print everything.
Thank you.
Extracting only the text portion is easy as doing .text on the object.
We can use basic BeautifulSoup methods to traverse the tree hierarchy.
Helpful explanation on how to do that: HERE
from bs4 import BeautifulSoup
html = '''<cite>https://www.<strong>websiteurl.com/id=6</strong></cite>'''
soup = BeautifulSoup(html, 'html.parser')
print(soup.cite.text)
# is the same as soup.find('cite').text
It's been a while since I've used regex, and I feel like this should be simple to figure out.
I have a web page full of links that looks like the string_to_match in the below code. I want to grab just the numbers in the links, like number "58" in the string_to_match. For the life of me I can't figure it out.
import re
string_to_match = 'Roster'
re.findall('Roster',string_to_match)
Instead of using regular expressions, you can use a combination of HTML parsing (using BeautifulSoup parser) to locate the desired link and extract the href attribute value and URL parsing, which in this case, we'll use regular expressions for:
import re
from bs4 import BeautifulSoup
data = """
<body>
Roster
</body>
"""
soup = BeautifulSoup(data, "html.parser")
link = soup.find("a", text="Roster")["href"]
print(re.search(r"teamId=(\d+)", link).group(1))
Prints 58.
I would recommend using BeautifulSoup or lxml, it's worth the learning curve.
...But if you still want to use regexp
re.findall('href="[^"]*teamId=(\d+)',string_to_match)
I am parsing a poorly designed web page using beautiful soup.
At the moment, what I need is the select the comment section of the web page but each comment is treated as a DIV and each have an ID like "IAMCOMMENT_00001" but that's it. No class (This would have helped a lot).
So I am forced to search for all DIVs that start with "IAMCOMMENT" but I can't figure out how to do this. The closest I could find is SoupStrainer but couldn't understand how to even use it.
How would I be able to achieve this?
I would use BeautifulSoup's built in find_all function:
from bs4 import BeautifulSoup
soup = BeautifulSoup(yourhtml)
soup.find_all('div', id_=re.compile('IAMCOMMENT_'))
If you want to parse form comments, first you need to find the comment of the your html. A way to do this is like:
import re
from bs4 import BeautifulSoup, Comment
soup = BeautifulSoup(myhtml)
comments = soup.find_all(text=lambda text: isinstance(text, Comment))
to find the divs inside a comment,
for comment in comments:
cmnt_soup = BeautifulSoup(comment)
divs = cmnt_soup.find_all('div', attrs={"id": re.compile(r'IAMCOMMENT_\d+')})
# do things with the divs
I have tried using regex but read around and got directed to beautiful soup...
I've kinda figured out how to get urls in html tags with soup, but how would I grab urls from both html tags (href=*) and the body text of the page?
Also for grabbing the ones in tags, how do I specify that I only want urls starting with http://, https://... ?
Thanks in advance!
First look at parsing-html-in-python-lxml-or-beautifulsoup. I read it and never looked at the soup. I guess because I find lxml so easy. I am sure there are different ways to do what you asked, perhaps there are easier ones. But I'll show what I use.
In lxml you can use XPath it's like using regex for XML/HTML. This code below will find all "a" tags that have "http" attribute and print all links that start with http. This should help you get started on your parsing.
from lxml.html import etree
tree = etree.parse("my.html", etree.HTMLParser())
root = tree.getroot()
links = root.findall('*//a[#href]')
foreach link in links:
if link.get("http").startswith("http"):
print link.get("http")
I'm using cygwin and do not have BeautifulSoup installed.
Getting the value of href attributes in all <a> tags on a html file with Python
python, regex to find anchor link html
Regular expression to extract URL from an HTML link
If you don't care much about performance you can use regular expressions:
import re
linkre = re.compile(r"""href=["']([^"']+)["']""")
links = linkre.findall(your_html)
If you just want links like in http:// links then change the expression to:
linkre = re.compile(r"""href=["']http:([^"']+)["']""")
Or you can put "' as optional if by some chance you have html without them around the links.