I am completely new to web parsing with Python/BeautifulSoup. I have an HTML that has (part of) the code as follows:
<div id="pages">
<ul>
<li class="active">Example</li>
<li>Example</li>
<li>Example 1</li>
<li>Example 2</li>
</ul>
</div>
I have to visit each link (basically each <li> element) until there are no more <li> tags present. Each time a link is clicked, its corresponding <li> element gets class as 'active'. My code is:
from bs4 import BeautifulSoup
import urllib2
import re
landingPage = urllib2.urlopen('somepage.com').read()
soup = BeautifulSoup(landingPage)
pageList = soup.find("div", {"id": "pages"})
page = pageList.find("li", {"class": "active"})
This code gives me the first <li> item in the list. My logic is I am keeping on checking if the next_sibling is not None. If it is not None, I am creating an HTTP request to the href attribute of the <a> tag in that sibling <li>. That would get me to the next page, and so on, till there are no more pages.
But I can't figure out how to get the next_sibling of the page variable given above. Is it page.next_sibling.get("href") or something like that? I looked through the documentation, but somehow couldn't find it. Can someone help please?
Use find_next_sibling() and be explicit about what sibling element do you want to find:
next_li_element = page.find_next_sibling("li")
next_li_element would become None if the page corresponds to the last active li:
if next_li_element is None:
# no more pages to go
Have you looked at dir(page) or the documentation? If so, how did you miss .find_next_sibling()?
from bs4 import BeautifulSoup
import urllib2
import re
landingPage = urllib2.urlopen('somepage.com').read()
soup = BeautifulSoup(landingPage)
pageList = soup.find("div", {"id": "pages"})
page = pageList.find("li", {"class": "active"})
sibling = page.find_next_sibling()
Related
I try to extract the next page href string using lxml.
For example I try to extract the "/review/bulb.co.uk?b=MTYxOTg5MDE1OTAwMHw2MDhkOGZlZmY5ZjQ4NzA4ZTA4MWI2Mzk" from the html in the following example:
<nav rel="nav" class="pagination-container AjaxPager">
<a href="/review/bulb.co.uk?b=MTYxOTg5MDE1OTAwMHw2MDhkOGZlZmY5ZjQ4NzA4ZTA4MWI2Mzk" data-page-number="next-page" class="button button--primary next-page" rel="next" data-track-link="{'target': 'Company profile', 'name': 'navigation', 'navigationType': 'next'}">
Next page
</a>
</nav>
I have tried the following but it returns a list not the string that I am looking for:
import requests
import lxml.html as html
URL = https://uk.trustpilot.com/review/bulb.co.uk
page = requests.get(URL)
tree = html.fromstring(page.content)
href = tree.xpath('//a/#href')
Any idea what I am doing wrong?
Making this change to your code
href = tree.xpath('//a[#class="button button--primary next-page"]/#href')
href[0]
Gives me this output:
'/review/bulb.co.uk?b=MTYxOTk1ODMxMzAwMHw2MDhlOWEyOWY5ZjQ4NzA4ZTA4MjMxNTE'
which is close to the output in your question (its value may change dynamically).
I'm trying to get links to group members:
response.css('.text--ellipsisOneLine::attr(href)').getall()
Why isn't this working?
html:
<div class="flex flex--row flex--noGutters flex--alignCenter">
<div class="flex-item _memberItem-module_name__BSx8i">
<a href="/ru-RU/Connect-IT-Meetup-in-Chisinau/members/280162178/profile/?returnPage=1">
<h4 class="text--bold text--ellipsisOneLine">Liviu Cernei</h4>
</a>
</div>
</div>
Your selector isn't working because you are looking for a attribute (href) that this element doesn't have.
response.css('.text--ellipsisOneLine::attr(href)').getall()
This selector is searching for href inside elements of class text--ellipsisOneLine. In your HTML snippet that class matches only with this:
<h4 class="text--bold text--ellipsisOneLine">Liviu Cernei</h4>
As you can see, there is no href attribute. Now, if you want the text between this h4 element you need to use ::text pseudo-element.
response.css('.text--ellipsisOneLine::text').getall()
Read more here.
I realize that this isn't scrapy, but personally for web scraping I use the requests module and BeautifulSoup4, and the following code snippet will get you a list of users with the aforementioned modules:
import requests
from bs4 import BeautifulSoup
response = requests.get('https://www.meetup.com/ru-RU/Connect-IT-Meetup-in-Chisinau/members/')
if response.status_code == 200:
html_doc = response.text
html_source = BeautifulSoup(html_doc, 'html.parser')
users = html_source.findAll('h4')
for user in users:
print(user.text)
css:
response.css('.member-item .flex--alignCenter a::attr(href)').getall()
For a class project, I'm working on extracting all links on a webpage. This is what I have so far.
from bs4 import BeautifulSoup, SoupStrainer
with open("input.htm") as inputFile:
soup = BeautifulSoup(inputFile)
outputFile=open('output.txt', 'w')
for link in soup.find_all('a', href=True):
outputFile.write(str(link)+'\n')
outputFile.close()
This works very well.
Here's the complication: for every <a> element, my project requires me to know the entire "tree structure" to the current link. In other words, I'd like to know all the precendent elements starting with the the <body> element. And the class and id along the way.
Like the navigation page on Windows explorer. Or the navigation panel on many browsers' element inspection tool.
For example, if you look at the Bible page on Wikipedia and a link to the Wikipedia page for the Talmud, the following "path" is what I'm looking for.
<body class="mediawiki ...>
<div id="content" class="mw-body" role="main">
<div id="bodyContent" class="mw-body-content">
<div id="mw-content-text" ...>
<div class="mw-parser-output">
<div role="navigation" ...>
<table class="nowraplinks ...>
<tbody>
<td class="navbox-list ...>
<div style="padding:0em 0.25em">
<ul>
<li>
<a href="/wiki/Talmud"
Thanks a bunch.
-Maureen
Try this code:
soup = BeautifulSoup(inputFile, 'html.parser')
Or use lxml:
soup = BeautifulSoup(inputFile, 'lxml')
If it is not installed:
pip install lxml
Here is a solution I just wrote. It works by finding the element, then navigating up the tree by the elements parent. I parse just the opening tag and add it to a list. Reverse the list at the end. Finally we end up with a list that resembles the tree you requested.
I have written it for one element, you can modify it to work with your find_all
from bs4 import BeautifulSoup
import requests
page = requests.get("https://en.wikipedia.org/wiki/Bible")
soup = BeautifulSoup(page.text, 'html.parser')
tree = []
hrefElement = soup.find('a', href=True)
hrefString = str(hrefElement).split(">")[0] + ">"
tree.append(hrefString)
hrefParent = hrefElement.find_parent()
while (hrefParent.name != "html"):
hrefString = str(hrefParent).split(">")[0] + ">"
tree.append(hrefString)
hrefParent = hrefParent.find_parent()
tree.reverse()
print(tree)
Trying to teach myself some web scraping, just for fun. Decided to use it to look at a list of jobs posted on a website. I've gotten stuck. I want to be able to pull all the jobs listed on this page, but can't seem to get it to recognize anything deeper in the container I've made. Any suggestions are more than appreciated.
Current Code:
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
myURL = 'https://jobs.collinsaerospace.com/search-jobs/'
uClient = uReq(myURL)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
container = page_soup.findAll("section", {"id":"search-results-list"})
container
Sample of the container:
<section id="search-results-list">
<ul>
<li>
<a data-job-id="12394447" href="/job/melbourne/test-technician/1738/12394447">
<h2>Test Technician</h2>
<span class="job-location">Melbourne, Florida</span>
<span class="job-date-posted">06/27/2019</span>
</a>
</li>
<li>
<a data-job-id="12394445" href="/job/cedar-rapids/associate-systems-engineer/1738/12394445">
<h2>Associate Systems Engineer</h2>
<span class="job-location">Cedar Rapids, Iowa</span>
<span class="job-date-posted">06/27/2019</span>
</a>
</li>
<li>
I'm trying to understand how to actually extract the h2 level information (or really any information within the container I currently created)
I have tried to replicate the same using lxml.
import requests
from lxml import html
resp = requests.get('https://jobs.collinsaerospace.com/search-jobs/')
data_root = html.fromstring(resp.content)
data = []
for node in data_root.xpath('//section[#id="search-results-list"]/ul/li'):
data.append({"url":node.xpath('a/#href')[0],"name":node.xpath('a/h2/text()')[0],"location":node.xpath('a/span[#class="job-location"]/text()')[0],"posted":node.xpath('a/span[#class="job-date-posted"]/text()')[0]})
print(data)
If I understand correctly, you're looking to extract the headings from your container. Here's the snippet to do that:
for child in container:
for heading in child.find_all('h2'):
print(heading.text)
Note that child and heading are just dummy variables I'm using to iterate through the ResultSet (that the container is) and the list (that all the headings are). For each child, I'm searching for all the tags, and for each one I'm printing its text.
If you wanted to extract something else from your container, just tweak find_all.
I'm trying to scrape a html for links under a specific class called "category-list"
Each link reside under a h4 tag(I'm ignoring its parent h3 tag):
<ul class="category-list">
<li class="category-item">
<h3>
<a href="/derdubor/c/alarm_og_sikkerhet/">
Alarm og sikkerhet
</a>
</h3>
<ul>
<li>
<h4>
<a href="/derdubor/c/alarm_og_sikkerhet/brannsikring/">
<span class="category-has-customers">
Brannsikring
</span>
(1)
</a>
</h4>
</li>
</ul>
</li>
...
My code for scraping the html is the following:
r = request.urlopen(str_top_url)
soup = BeautifulSoup(r.read(),'html.parser')
tag_category_list = soup.find('ul', class_ = 'category-list')
tag_items = tag_category_list.find_all('h4')
for tag_item in tag_items.find_all('a'):
print(tag_item.get('href'))
I get the error:
"ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item..."
Reading the BeautifulSoup manual on crummy, it looks like you can use the same methods belonging to the BeautifulSoup class on a tag object?
I can't seem to figure out what I'm doing wrong...
I've tried numerous answers her on stackoverflow. But to no avail...
Regards MH
Problem is in this line for tag_item in tag_items.find_all('a'):. You should first iterate through tag_items and the through find_all('a') items. Here is the edited code:
from bs4 import BeautifulSoup
soup = BeautifulSoup('<ul class="category-list"><li class="category-item"><h3>Alarm og sikkerhet</h3><ul><li><h4><span class="category-has-customers">Brannsikring</span>(1)</h4></li></ul></li>','html.parser')
tag_category_list = soup.find('ul', class_ = 'category-list')
tag_items = tag_category_list.find_all('h4')
for elm in tag_items:
for tag_item in elm.find_all('a'):
print(tag_item.get('href'))
And here is the result:
/derdubor/c/alarm_og_sikkerhet/brannsikring/
The problem is that tag_items is a ResultSet, not a Tag.
From the Beautiful Soup documentation:
AttributeError: 'ResultSet' object has no attribute 'foo' - This usually happens because you expected find_all() to return a single tag or string. But find_all() returns a list of tags and stringsāa ResultSet object. You need to iterate over the list and look at the .foo of each one. Or, if you really only want one result, you need to use find() instead of find_all().
So this nested loop should work:
for tag_item in tag_items:
for link in tag_item.find_all('a'):
print(link.get('href'))
Or, if you were only expecting one h4, change find_all('h4') to find('h4').