Trying to teach myself some web scraping, just for fun. Decided to use it to look at a list of jobs posted on a website. I've gotten stuck. I want to be able to pull all the jobs listed on this page, but can't seem to get it to recognize anything deeper in the container I've made. Any suggestions are more than appreciated.
Current Code:
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
myURL = 'https://jobs.collinsaerospace.com/search-jobs/'
uClient = uReq(myURL)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
container = page_soup.findAll("section", {"id":"search-results-list"})
container
Sample of the container:
<section id="search-results-list">
<ul>
<li>
<a data-job-id="12394447" href="/job/melbourne/test-technician/1738/12394447">
<h2>Test Technician</h2>
<span class="job-location">Melbourne, Florida</span>
<span class="job-date-posted">06/27/2019</span>
</a>
</li>
<li>
<a data-job-id="12394445" href="/job/cedar-rapids/associate-systems-engineer/1738/12394445">
<h2>Associate Systems Engineer</h2>
<span class="job-location">Cedar Rapids, Iowa</span>
<span class="job-date-posted">06/27/2019</span>
</a>
</li>
<li>
I'm trying to understand how to actually extract the h2 level information (or really any information within the container I currently created)
I have tried to replicate the same using lxml.
import requests
from lxml import html
resp = requests.get('https://jobs.collinsaerospace.com/search-jobs/')
data_root = html.fromstring(resp.content)
data = []
for node in data_root.xpath('//section[#id="search-results-list"]/ul/li'):
data.append({"url":node.xpath('a/#href')[0],"name":node.xpath('a/h2/text()')[0],"location":node.xpath('a/span[#class="job-location"]/text()')[0],"posted":node.xpath('a/span[#class="job-date-posted"]/text()')[0]})
print(data)
If I understand correctly, you're looking to extract the headings from your container. Here's the snippet to do that:
for child in container:
for heading in child.find_all('h2'):
print(heading.text)
Note that child and heading are just dummy variables I'm using to iterate through the ResultSet (that the container is) and the list (that all the headings are). For each child, I'm searching for all the tags, and for each one I'm printing its text.
If you wanted to extract something else from your container, just tweak find_all.
Related
I am trying to scrape elements from a website.
<h2 class="a b" data-test-search-result-header-title> Heading </h2>
How can I extract the value Heading from the website using BeautifulSoup?
I have tried the following codes :
Code 1 :
soup.find_all(h2,{'class':['a','b']})
Code 2:
soup.find_all(h2,class_='a b'})
Both the codes return an empty list.
How to resolve this?
Try to fix code2 to soup.find_all('h2',class_='a b')
Example:
Given are four h2 tags with its classes, soup.find_all('h2',class_='a b') get the first of them, cause it is matching the filter.
To get the text of the h2 element use .text, I have done it with
[heading.text for heading in soup.find_all('h2',class_='a b')]
cause we have to loop the find_all() result.
from bs4 import BeautifulSoup
html = """
<h2 class="a b"> Heading a and b </h2>
<h2 class="b a"> Heading b and a </h2>
<h2 class="a"> Heading a </h2>
<h2 class="b"> Heading b </h2>
"""
soup=BeautifulSoup(html,'html.parser')
[heading.text for heading in soup.find_all('h2',class_='a b')]
Output
[' Heading a and b ']
Further thoughts
You say, that it would not work for you - without providing further code/information, it is hard to help and more guessing. Let me show you what also could be a reason:
Let´s say you are scraping google results, there are a lot of options to do that, I just wanna show two approaches requests and selenium.
Requests Example
Inspected classes for h3 in browser are LC20lb DKV0Md
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.google.com/search?q=stackoverflow')
soup = BeautifulSoup(r.content, 'lxml')
headingsH3Class = soup.find_all('h3', class_='LC20lb DKV0Md')
headingsH3Only = soup.find_all('h3')
print(headingsH3Class[:2])
print(headingsH3Only[:2],'\n')
Requests Example Output
An empty list
[]
A list that show us that the inspected classes are not in the page content, we get back by requests
_
[<h3 class="zBAuLc"><div class="BNeawe vvjwJb AP7Wnd">Stack Overflow</div></h3>, <h3 class="zBAuLc"><div class="BNeawe vvjwJb AP7Wnd">Stack Overflow (Website) – Wikipedia</div></h3>]
Selenium Example
from selenium import webdriver
from bs4 import BeautifulSoup
url = 'https://www.google.com/search?q=stackoverflow'
browser = webdriver.Chrome(executable_path=r'C:\Program Files\ChromeDriver\chromedriver.exe')
browser.get(url)
soup = BeautifulSoup(browser.page_source, 'lxml')
headingsH3Class = soup.find_all('h3', class_='LC20lb DKV0Md')
headingsH3Only = soup.find_all('h3')
print(headingsH3Class[:2])
print(headingsH3Only[:2])
browser.close()
Selenium Example Output
A List with exactly the h3 with it´s both classes we searched for.
_
[<h3 class="LC20lb DKV0Md"><span>Stack Overflow - Where Developers Learn, Share, & Build ...</span></h3>, <h3 class="LC20lb DKV0Md"><span>Stack Overflow (Website) – Wikipedia</span></h3>]
A list with all h3 Elements
_
[<h3 class="LC20lb DKV0Md"><span>Stack Overflow - Where Developers Learn, Share, & Build ...</span></h3>, <h3 class="r"><a class="l" data-ved="2ahUKEwj426uv9u3tAhUPohQKHYymBMAQjBAwAXoECAcQAQ" href="https://stackoverflow.com/questions" ping="/url?sa=t&source=web&rct=j&url=https://stackoverflow.com/questions&ved=2ahUKEwj426uv9u3tAhUPohQKHYymBMAQjBAwAXoECAcQAQ">Questions</a></h3>]
Conclusion
Always check the data you are scraping, cause response and inspected things in browser can be different.
For a class project, I'm working on extracting all links on a webpage. This is what I have so far.
from bs4 import BeautifulSoup, SoupStrainer
with open("input.htm") as inputFile:
soup = BeautifulSoup(inputFile)
outputFile=open('output.txt', 'w')
for link in soup.find_all('a', href=True):
outputFile.write(str(link)+'\n')
outputFile.close()
This works very well.
Here's the complication: for every <a> element, my project requires me to know the entire "tree structure" to the current link. In other words, I'd like to know all the precendent elements starting with the the <body> element. And the class and id along the way.
Like the navigation page on Windows explorer. Or the navigation panel on many browsers' element inspection tool.
For example, if you look at the Bible page on Wikipedia and a link to the Wikipedia page for the Talmud, the following "path" is what I'm looking for.
<body class="mediawiki ...>
<div id="content" class="mw-body" role="main">
<div id="bodyContent" class="mw-body-content">
<div id="mw-content-text" ...>
<div class="mw-parser-output">
<div role="navigation" ...>
<table class="nowraplinks ...>
<tbody>
<td class="navbox-list ...>
<div style="padding:0em 0.25em">
<ul>
<li>
<a href="/wiki/Talmud"
Thanks a bunch.
-Maureen
Try this code:
soup = BeautifulSoup(inputFile, 'html.parser')
Or use lxml:
soup = BeautifulSoup(inputFile, 'lxml')
If it is not installed:
pip install lxml
Here is a solution I just wrote. It works by finding the element, then navigating up the tree by the elements parent. I parse just the opening tag and add it to a list. Reverse the list at the end. Finally we end up with a list that resembles the tree you requested.
I have written it for one element, you can modify it to work with your find_all
from bs4 import BeautifulSoup
import requests
page = requests.get("https://en.wikipedia.org/wiki/Bible")
soup = BeautifulSoup(page.text, 'html.parser')
tree = []
hrefElement = soup.find('a', href=True)
hrefString = str(hrefElement).split(">")[0] + ">"
tree.append(hrefString)
hrefParent = hrefElement.find_parent()
while (hrefParent.name != "html"):
hrefString = str(hrefParent).split(">")[0] + ">"
tree.append(hrefString)
hrefParent = hrefParent.find_parent()
tree.reverse()
print(tree)
I'm trying to parse this webpage and take some of information:
http://www.tsetmc.com/Loader.aspx?ParTree=151311&i=778253364357513
import requests
page = requests.get("http://www.tsetmc.com/Loader.aspx?ParTree=151311&i=778253364357513")
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
All_Information = soup.find(id="MainContent")
print(All_Information)
it seams all information between tag is hidden. when i run the code this data is returned.
<div class="tabcontent content" id="MainContent">
<div id="TopBox"></div>
<div id="ThemePlace" style="text-align:center">
<div class="box1 olive tbl z2_4 h250" id="Section_relco" style="display:none"></div>
<div class="box1 silver tbl z2_4 h250" id="Section_history" style="display:none"></div>
<div class="box1 silver tbl z2_4 h250" id="Section_tcsconfirmedorders" style="display:none"></div>
</div>
</div>
Why is the information not there, and how can I find and/or access it?
The information that I assume you are looking for is not loaded in your request. The webpage makes additional requests after it has initally loaded. There are a few ways you can get that information.
You can try selenium. It is a python package that simulates a web browser. This allows the page to load all the information before you try to scrape.
Another way is to reverse enginneer the website and find out where it is getting the information you need.
Have a look at this link.
http://www.tsetmc.com/tsev2/data/instinfofast.aspx?i=778253364357513&c=57+
It is called by your page every few seconds, and it appears to contain all the pricing information you are looking for. It may be easier to call that webpage to get your information.
Unsure how to properly word the issue.
I am trying to parse through an HTML document with a tree similar to that of
div(unique-class)
|-a
|-h4
|-div(class-a)
|-div(class-b)
|-div(class-c)
|-p
Etc, it continues. I only listed the few items I need. It is a lot of sibling hierarchy, all existing within one div.
I've been working quite a bit with BeautifulSoup for the past few hours, and I finally have a working version (Beta) of what I'm trying to parse, in this example.
from bs4 import BeautifulSoup
import urllib2
import csv
file = "C:\\Python27\\demo.html"
soup = BeautifulSoup (open(file), 'html.parser')
#(page, 'html.parser')
#Let's pull prices
names = []
pricing = []
discounts = []
for name in soup.find_all('div', attrs={'class': 'unique_class'}):
names.append(name.h4.text)
for price in soup.find_all('div', attrs={'class': 'class-b'}):
pricing.append(price.text)
for discount in soup.find_all('div', attrs={'class': 'class-a'}):
discounts.append(discount.text)
ofile = open('output2.csv','wb')
fieldname = ['name', 'discountPrice', 'originalPrice']
writer = csv.DictWriter(ofile, fieldnames = fieldname)
writer.writeheader()
for i in range(len(names)):
print (names[i], pricing[i], discounts[i])
writer.writerow({'name': names[i], 'discountPrice':pricing[i], 'originalPrice': discounts[i]})
ofile.close()
As you can tell this it iterating from top to bottom and appending to a distinct array for each one. The issue is, if I'm iterating over, let's say, 30,000 items and the website can modify itself (We'll say a ScoreBoard app on a JS Framework), by the time I get to the 2nd iteration, the order may have changed. (As I type this I realize this scenario actually would need more variables since BS would 'catch' the website at time of load, but I think the point still stands.)
I believe I need to leverage the next_sibling function within BS4 but when I did that I started capturing items I wasn't specifying, because I couldn't apply a 'class' to the sibling.
Update
An additional issue I encouraged when trying to do a loop within a loop to find the 3 children I need under the unique-class was I would end up with the first price being listed for all names.
Update - Adding sample HTML
<div class="unique_class">
<h4>World</h4>
<div class="class_b">$1.99</div>
<div class="class_a">$1.99</div>
</div>
<div class="unique_class">
<h4>World2</h4>
<div class="class_b">$2.99</div>
<div class="class_a">$2.99</div>
</div>
<div class="unique_class">
<h4>World3</h4>
<div class="class_b">$3.99</div>
<div class="class_a">$3.99</div>
</div>
<div class="unique_class">
<h4>World4</h4>
<div class="class_b">$4.99</div>
<div class="class_a">$3.99</div>
</div>
I have also found a fix, and submitted the answer to be Optimized - Located at CodeReview
If the site you are looking to scrape the data from is using JS you may want to use selenium and use its page_source method to extract snapshots of the page with loaded JS you can then load into BS.
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get(<url>)
page = driver.page_source
Then you can use BS to parse the JS loaded 'page'
If you want to wait for other JS events to load up you are able to specify events to wait for in selenium.
I am completely new to web parsing with Python/BeautifulSoup. I have an HTML that has (part of) the code as follows:
<div id="pages">
<ul>
<li class="active">Example</li>
<li>Example</li>
<li>Example 1</li>
<li>Example 2</li>
</ul>
</div>
I have to visit each link (basically each <li> element) until there are no more <li> tags present. Each time a link is clicked, its corresponding <li> element gets class as 'active'. My code is:
from bs4 import BeautifulSoup
import urllib2
import re
landingPage = urllib2.urlopen('somepage.com').read()
soup = BeautifulSoup(landingPage)
pageList = soup.find("div", {"id": "pages"})
page = pageList.find("li", {"class": "active"})
This code gives me the first <li> item in the list. My logic is I am keeping on checking if the next_sibling is not None. If it is not None, I am creating an HTTP request to the href attribute of the <a> tag in that sibling <li>. That would get me to the next page, and so on, till there are no more pages.
But I can't figure out how to get the next_sibling of the page variable given above. Is it page.next_sibling.get("href") or something like that? I looked through the documentation, but somehow couldn't find it. Can someone help please?
Use find_next_sibling() and be explicit about what sibling element do you want to find:
next_li_element = page.find_next_sibling("li")
next_li_element would become None if the page corresponds to the last active li:
if next_li_element is None:
# no more pages to go
Have you looked at dir(page) or the documentation? If so, how did you miss .find_next_sibling()?
from bs4 import BeautifulSoup
import urllib2
import re
landingPage = urllib2.urlopen('somepage.com').read()
soup = BeautifulSoup(landingPage)
pageList = soup.find("div", {"id": "pages"})
page = pageList.find("li", {"class": "active"})
sibling = page.find_next_sibling()