For a class project, I'm working on extracting all links on a webpage. This is what I have so far.
from bs4 import BeautifulSoup, SoupStrainer
with open("input.htm") as inputFile:
soup = BeautifulSoup(inputFile)
outputFile=open('output.txt', 'w')
for link in soup.find_all('a', href=True):
outputFile.write(str(link)+'\n')
outputFile.close()
This works very well.
Here's the complication: for every <a> element, my project requires me to know the entire "tree structure" to the current link. In other words, I'd like to know all the precendent elements starting with the the <body> element. And the class and id along the way.
Like the navigation page on Windows explorer. Or the navigation panel on many browsers' element inspection tool.
For example, if you look at the Bible page on Wikipedia and a link to the Wikipedia page for the Talmud, the following "path" is what I'm looking for.
<body class="mediawiki ...>
<div id="content" class="mw-body" role="main">
<div id="bodyContent" class="mw-body-content">
<div id="mw-content-text" ...>
<div class="mw-parser-output">
<div role="navigation" ...>
<table class="nowraplinks ...>
<tbody>
<td class="navbox-list ...>
<div style="padding:0em 0.25em">
<ul>
<li>
<a href="/wiki/Talmud"
Thanks a bunch.
-Maureen
Try this code:
soup = BeautifulSoup(inputFile, 'html.parser')
Or use lxml:
soup = BeautifulSoup(inputFile, 'lxml')
If it is not installed:
pip install lxml
Here is a solution I just wrote. It works by finding the element, then navigating up the tree by the elements parent. I parse just the opening tag and add it to a list. Reverse the list at the end. Finally we end up with a list that resembles the tree you requested.
I have written it for one element, you can modify it to work with your find_all
from bs4 import BeautifulSoup
import requests
page = requests.get("https://en.wikipedia.org/wiki/Bible")
soup = BeautifulSoup(page.text, 'html.parser')
tree = []
hrefElement = soup.find('a', href=True)
hrefString = str(hrefElement).split(">")[0] + ">"
tree.append(hrefString)
hrefParent = hrefElement.find_parent()
while (hrefParent.name != "html"):
hrefString = str(hrefParent).split(">")[0] + ">"
tree.append(hrefString)
hrefParent = hrefParent.find_parent()
tree.reverse()
print(tree)
Related
This is an example of the type of block of HTML source code I'm targeting with BeautifulSoup
<div class="fighter_list left">
<meta itemprop="image" content="/image_crop/44/44/_images/fighter/1406924569376_20140801011731_Picture17.JPG">
<img class="lazy" src="/image_crop/44/44/_images/fighter/1406924569376_20140801011731_Picture17.JPG" data-original="/image_crop/44/44/_images/fighter/1406924569376_20140801011731_Picture17.JPG" alt="Jason DeLucia" title="Jason DeLucia" />
<div class="fighter_result_data">
<a itemprop="url" href="/fighter/Jason-DeLucia-22"><span itemprop="name">Jason<br />DeLucia</span></a><br>
This is one of multiple blocks like this for each "fighter_list left" on the page.
I want to get all of the itemprop="url" href links that are in the "fighter_list left" class (i.e. /fighter/Jason-DeLucia-22)
When I try the below code I get nothing.
for link in html.find_all('a', class_="fighter_List left", itemprop="url"):
print(link.get('href'))
The closest I can get is getting every itemprop=url link on the page when I omit the class_= part.
But I only want the ones under the fighter_list left class.
This is the website https://www.sherdog.com/events/UFC-1-The-Beginning-7
You can use CSS selector for the task:
import requests
from bs4 import BeautifulSoup
url = "https://www.sherdog.com/events/UFC-1-The-Beginning-7"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for link in soup.select('.fighter_list.left [itemprop="url"]'):
print(link["href"])
Prints:
/fighter/Jason-DeLucia-22
/fighter/Royce-Gracie-19
/fighter/Gerard-Gordeau-15
/fighter/Ken-Shamrock-4
/fighter/Royce-Gracie-19
/fighter/Kevin-Rosier-17
/fighter/Gerard-Gordeau-15
I'm trying to get links to group members:
response.css('.text--ellipsisOneLine::attr(href)').getall()
Why isn't this working?
html:
<div class="flex flex--row flex--noGutters flex--alignCenter">
<div class="flex-item _memberItem-module_name__BSx8i">
<a href="/ru-RU/Connect-IT-Meetup-in-Chisinau/members/280162178/profile/?returnPage=1">
<h4 class="text--bold text--ellipsisOneLine">Liviu Cernei</h4>
</a>
</div>
</div>
Your selector isn't working because you are looking for a attribute (href) that this element doesn't have.
response.css('.text--ellipsisOneLine::attr(href)').getall()
This selector is searching for href inside elements of class text--ellipsisOneLine. In your HTML snippet that class matches only with this:
<h4 class="text--bold text--ellipsisOneLine">Liviu Cernei</h4>
As you can see, there is no href attribute. Now, if you want the text between this h4 element you need to use ::text pseudo-element.
response.css('.text--ellipsisOneLine::text').getall()
Read more here.
I realize that this isn't scrapy, but personally for web scraping I use the requests module and BeautifulSoup4, and the following code snippet will get you a list of users with the aforementioned modules:
import requests
from bs4 import BeautifulSoup
response = requests.get('https://www.meetup.com/ru-RU/Connect-IT-Meetup-in-Chisinau/members/')
if response.status_code == 200:
html_doc = response.text
html_source = BeautifulSoup(html_doc, 'html.parser')
users = html_source.findAll('h4')
for user in users:
print(user.text)
css:
response.css('.member-item .flex--alignCenter a::attr(href)').getall()
I am trying to scrape the tables from the following page:
https://www.baseball-reference.com/boxes/CHA/CHA193805220.shtml
When I reach the html for the batting tables I encounter a very long comment which contains the html for the table
<div id="all_WashingtonSenatorsbatting" class="table_wrapper table_controls">
<div class="section_heading">
<div class="section_heading_text">
<div class="placeholder"></div>
<!--
<div class="table_outer_container">
.....
-->
<div class="table_outer_container mobile_table">
<div class="footer no_hide_long">
Where the last two div are what I am interested in scraping and everything in between the <!-- and the --> is a comment which happens to contain a copy of the table in the table_outer_container class below.
The problem is that when I read the page source into beautiful soup it does will not read anything after the comment within the table_wrapper class div which contains everything. The following code illustrates the problem:
batting = page_source.find('div', {'id':'all_WashingtonSenatorsbatting'})
divs = batting.find_all('div')
len(divs)
gives me
Out[1]: 3
When there are obviously 5 div children under the div id="all_WashingtonSenatorsbatting" element.
Even when I extract the comment using
from bs4 import Comment
for comments in soup.findAll(text=lambda text:isinstance(text, Comment)):
comments.extract()
The resulting soup still doesn't contain the last two div elements I want to scrape. I am trying to play with the code using regular expressions but so far no luck, any suggestions?
I found workable solution, By using the following code I extract the comment (which brings with it the last two div elements I wanted to scrape), process it again in BeautifulSoup and scrape the table
s = requests.get(url).content
soup = BeautifulSoup(s, "html.parser")
table = soup.find_all('div', {'class':'table_wrapper'})[0]
comment = t(text=lambda x: isinstance(x, Comment))[0]
newsoup = BeautifulSoup(comment, 'html.parser')
table = newsoup.find('table')
It took me a while to get to this and would be interested to see if anyone comes up with any other solutions or can offer an explanation of how this problem came to be.
I am completely new to web parsing with Python/BeautifulSoup. I have an HTML that has (part of) the code as follows:
<div id="pages">
<ul>
<li class="active">Example</li>
<li>Example</li>
<li>Example 1</li>
<li>Example 2</li>
</ul>
</div>
I have to visit each link (basically each <li> element) until there are no more <li> tags present. Each time a link is clicked, its corresponding <li> element gets class as 'active'. My code is:
from bs4 import BeautifulSoup
import urllib2
import re
landingPage = urllib2.urlopen('somepage.com').read()
soup = BeautifulSoup(landingPage)
pageList = soup.find("div", {"id": "pages"})
page = pageList.find("li", {"class": "active"})
This code gives me the first <li> item in the list. My logic is I am keeping on checking if the next_sibling is not None. If it is not None, I am creating an HTTP request to the href attribute of the <a> tag in that sibling <li>. That would get me to the next page, and so on, till there are no more pages.
But I can't figure out how to get the next_sibling of the page variable given above. Is it page.next_sibling.get("href") or something like that? I looked through the documentation, but somehow couldn't find it. Can someone help please?
Use find_next_sibling() and be explicit about what sibling element do you want to find:
next_li_element = page.find_next_sibling("li")
next_li_element would become None if the page corresponds to the last active li:
if next_li_element is None:
# no more pages to go
Have you looked at dir(page) or the documentation? If so, how did you miss .find_next_sibling()?
from bs4 import BeautifulSoup
import urllib2
import re
landingPage = urllib2.urlopen('somepage.com').read()
soup = BeautifulSoup(landingPage)
pageList = soup.find("div", {"id": "pages"})
page = pageList.find("li", {"class": "active"})
sibling = page.find_next_sibling()
I'm working on a python script to do some web scraping. I want to find the base URL of a given section on a web page that looks like this:
<div class='pagination'>
<a href='webpage-category/page/1'>1</a>
<a href='webpage-category/page/2'>2</a>
...
</div>
So, I just need to get everything from the first href besides the number('webpage-category/page/') and I have the following working code:
pages = [l['href'] for link in soup.find_all('div', class_='pagination')
for l in link.find_all('a') if not re.search('pageSub', l['href'])]
s = pages[0]
f = ''.join([i for i in s if not i.isdigit()])
The problem is, generating this list is a waste, since I just need the first href. I think a Generator would be the answer but I couldn't pull this off. Maybe you guys could help me to make this code more concise?
What about this:
from bs4 import BeautifulSoup
html = """ <div class='pagination'>
<a href='webpage-category/page/1'>1</a>
<a href='webpage-category/page/2'>2</a>
</div>"""
soup = BeautifulSoup(html)
link = soup.find('div', {'class': 'pagination'}).find('a')['href']
print '/'.join(link.split('/')[:-1])
prints:
webpage-category/page
Just FYI, speaking about the code you've provided - you can use [next()][-1] instead of a list comprehension:
s = next(l['href'] for link in soup.find_all('div', class_='pagination')
for l in link.find_all('a') if not re.search('pageSub', l['href']))
UPD (using the website link provided):
import urllib2
from bs4 import BeautifulSoup
url = "http://www.hdwallpapers.in/cars-desktop-wallpapers/page/2"
soup = BeautifulSoup(urllib2.urlopen(url))
links = soup.find_all('div', {'class': 'pagination'})[1].find_all('a')
print next('/'.join(link['href'].split('/')[:-1]) for link in links
if link.text.isdigit() and link.text != "1")