How to extract hidden table data from job-postings using BeautifulSoup? - python

Hi I'm doing a python course and for one of our assignments today, we're supposed to extract the job listings on: https://remoteok.com/remote-python-jobs
Here is a screenshot of the html in question:
python jobs f12
And here is what I've written so far:
import requests
from bs4 import BeautifulSoup
def extract(term):
url = f"https://remoteok.com/remote-{term}-jobs"
request = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
if request.status_code == 200:
soup = BeautifulSoup(request.text, 'html.parser')
table = soup.find_all('table', id="jobsboard")
print(len(table))
for tbody in table:
tbody.find_all('tbody')
print(len(tbody))
else:
print("can't request website")
extract("python")
print(len(table)) gives me 1 and
print(len(tbody)) gives me 131.
So it's pretty clear that I've made a mistake somewhere, but I'm having trouble identifying the cause.
One suspicion I have is that when I do request the html text and parse it with BeautifulSoup I am not getting the full webpage. But otherwise, I'm really not sure what I'm doing wrong here..

Requests do not manipulate or render a website like a browser will do, it only provide the static HTML - Websites content is generated dynamically by JavaScript that converts some JSON data into structure.
Use these to extract your data:
[json.loads(e.text.strip()) for e in soup.select('table tr.job [type="application/ld+json"]')]
Result:
[{'#context': 'http://schema.org', '#type': 'JobPosting', 'datePosted': '2022-09-04T05:21:13+00:00', 'description': 'About the Team\n\nThe Design Infrastructure team designs, builds, and ships the Design System foundations and UI components used in all of DoorDash’s products, on all platforms. Specifically, the iOS team works closely with designers and product engineering teams across the company to help shape the Design System, and owns the shared UI library for iOS – developed for both SwiftUI and UIKit.\nAbout the Role\n\nWe are looking for a lead iOS engineer who has a strong passion for UI components and working very closely with design. As part of the role you will be leading the iOS initiative for our Design System, which will include working closely with designers and iOS engineers on product teams to align, develop, maintain, and evolve the library of foundations and UI components; which is adopted in all our products.\n\nYou will report into the Lead Design Technologist for Mobile on our Design Infrastructure team in our Product Design organization. This role is 100% flexible, and can b\n Apply now and work remotely at DoorDash', 'baseSalary': {'#type': 'MonetaryAmount', 'currency': 'USD', 'value': {'#type': 'QuantitativeValue', 'minValue': 70000, 'maxValue': 120000, 'unitText': 'YEAR'}}, 'employmentType': 'FULL_TIME', 'directApply': 'http://schema.org/False', 'industry': 'Startups', 'jobLocationType': 'TELECOMMUTE', 'jobLocation': [{'address': {'#type': 'PostalAddress', 'addressCountry': 'United States', 'addressRegion': 'Anywhere', 'streetAddress': 'Anywhere', 'postalCode': 'Anywhere', 'addressLocality': 'Anywhere'}}], 'applicantLocationRequirements': [{'#type': 'Country', 'name': 'United States'}], 'title': 'Lead Design Technologist iOS', 'image': 'https://remoteok.com/assets/img/jobs/f2f1ab68227768717536a0ab7e2578ab1662268873.png', 'occupationalCategory': 'Lead Design Technologist iOS', 'workHours': 'Flexible', 'validThrough': '2022-12-03T05:21:13+00:00', 'hiringOrganization': {'#type': 'Organization', 'name': 'DoorDash', 'url': 'https://remoteok.com/doordash', 'sameAs': 'https://remoteok.com/doordash', 'logo': {'#type': 'ImageObject', 'url': 'https://remoteok.com/assets/img/jobs/f2f1ab68227768717536a0ab7e2578ab1662268873.png'}}}, {'#context': 'http://schema.org', '#type': 'JobPosting', 'datePosted': '2022-09-03T00:00:09+00:00', 'description': "We’re seeking a senior core, distributed systems engineers to build dev tools. At [Iterative](https://iterative.ai) we build [DVC](https://dvc.org) (9000+ ⭐on GitHub) and [CML](https://cml.dev) (2000+ ⭐ on GitHub) and a few other projects that are not released yet. It's a great opportunity if you love open source, dev tools, systems programming, and remote work. Join our well-funded remote-first team to build developer tools to see how your code is used by thousands of developers every day!\n\nABOUT YOU\n\n- Excellent communication skills and a positive mindset 🤗\n- No prior deep knowledge of ML is required\n- At least one year of experience with file systems, concurrency, multithreading, and server architectures\n- Passionate about building highly reliable system software\n- Python knowledge and excellent coding culture (standards, unit test, docs, etc) are required.\n- Initiative to help shape the engineering practices, products, and culture of a young startup\n- R\n Apply now and work remotely at Iterative", 'baseSalary': {'#type': 'MonetaryAmount', 'currency': 'USD', 'value': {'#type': 'QuantitativeValue', 'minValue': 50000, 'maxValue': 180000, 'unitText': 'YEAR'}}, 'employmentType': 'FULL_TIME', 'directApply': 'http://schema.org/False', 'industry': 'Startups', 'jobLocationType': 'TELECOMMUTE', 'applicantLocationRequirements': {'#type': 'Country', 'name': 'Anywhere'}, 'jobLocation': {'address': {'#type': 'PostalAddress', 'addressCountry': 'Anywhere', 'addressRegion': 'Anywhere', 'streetAddress': 'Anywhere', 'postalCode': 'Anywhere', 'addressLocality': 'Anywhere'}}, 'title': 'Senior Software Engineer', 'image': 'https://remoteOK.com/assets/img/jobs/cb9a279f231a5312283e6d935bba3be91636086324.png', 'occupationalCategory': 'Senior Software Engineer', 'workHours': 'Flexible', 'validThrough': '2022-12-02T00:00:09+00:00', 'hiringOrganization': {'#type': 'Organization', 'name': 'Iterative', 'url': 'https://remoteok.com/iterative', 'sameAs': 'https://remoteok.com/iterative', 'logo': {'#type': 'ImageObject', 'url': 'https://remoteOK.com/assets/img/jobs/cb9a279f231a5312283e6d935bba3be91636086324.png'}}}, {'#context': 'http://schema.org', '#type': 'JobPosting', 'datePosted': '2022-09-06T09:10:04+00:00', 'description': '<p dir="ltr">Get a remote job that you will love with better compensation and career growth.<strong></strong></p><p dir="ltr">We’re Lemon.io — a marketplace where we match you with hand-picked startups from the US and Europe.\xa0<strong></strong></p><p dir="ltr"><br /></p><p dir="ltr"><strong>Why work with us:</strong></p><ul><li dir="ltr"><p dir="ltr">We’ll find you a team that respects you. No time-trackers or any micromanagement stuff</p></li><li dir="ltr"><p dir="ltr">Our engineers earn $5k - $9k / month. We’ve already paid out over $10M.</p></li><li dir="ltr"><p dir="ltr">Choose your schedule. We have both full- and part-time projects.</p></li><li dir="ltr"><p dir="ltr">No project managers in the middle — only direct communications with clients, most of whom have a technical background</p></li><li dir="ltr"><p dir="ltr">Our customer success team provides life support to help you resolve anything.</p></li><li dir="ltr"><p dir="ltr">You don’\n Apply now and work remotely at lemon.io', 'baseSalary': {'#type': 'MonetaryAmount', 'currency': 'USD', 'value': {'#type': 'QuantitativeValue', 'minValue': 60000, 'maxValue': 110000, 'unitText': 'YEAR'}}, 'employmentType': 'FULL_TIME', 'directApply': 'http://schema.org/False', 'industry': 'Startups', 'jobLocationType': 'TELECOMMUTE', 'applicantLocationRequirements': {'#type': 'Country', 'name': 'Anywhere'}, 'jobLocation': {'address': {'#type': 'PostalAddress', 'addressCountry': 'Anywhere', 'addressRegion': 'Anywhere', 'streetAddress': 'Anywhere', 'postalCode': 'Anywhere', 'addressLocality': 'Anywhere'}}, 'title': 'DevOps Engineer', 'image': 'https://remoteOK.com/assets/img/jobs/b31a9584a903e655bd2f67a2d7f584781662455404.png', 'occupationalCategory': 'DevOps Engineer', 'workHours': 'Flexible', 'validThrough': '2022-12-05T09:10:04+00:00', 'hiringOrganization': {'#type': 'Organization', 'name': 'lemon.io', 'url': 'https://remoteok.com/lemon-io', 'sameAs': 'https://remoteok.com/lemon-io', 'logo': {'#type': 'ImageObject', 'url': 'https://remoteOK.com/assets/img/jobs/b31a9584a903e655bd2f67a2d7f584781662455404.png'}}},...]
Example
import requests,json
from bs4 import BeautifulSoup
def extract(term):
url = f"https://remoteok.com/remote-{term}-jobs"
request = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
if request.status_code == 200:
soup = BeautifulSoup(request.text, 'html.parser')
table = soup.find_all('table', id="jobsboard")
print(len(table))
for tbody in table:
tbody.find_all('tbody')
print(len(tbody))
else:
print("can't request website")
data = [json.loads(e.text.strip()) for e in soup.select('table tr.job [type="application/ld+json"]')]
return data
for post in extract("python"):
print(post['hiringOrganization']['name'])
Output
DoorDash
Iterative
lemon.io
Angaza
Angaza
Kandji
Kandji
Kandji
Kandji
Kandji
Great Minds
Jobber
Udacity
...

tbody does appear on the web page but isn't pulled into the table variable by beautifulsoup.
I encountered this before. The solution is to get your tags directly from selenium.
But there is only one jobsboard and one tbody on the web page; so you could just skip tbody and look for a more useful tag.
I use Google Chrome. It has the free extension ChroPath, which makes it super easy to identify selectors. I just right click on text in a browser and select Inspect, sometimes twice, and the correct HTML tag is highlighted.
PyCharm allows you to view the contents of each variable with ease.
This code will allow you to view the web page HTML source code in a text file:
outputFile = r"C:\Users\user\Documents\HP Laptop\Documents\Documents\Jobs\DIT\IDMB\OutputZ.txt"
def update_output_file(pageSource: str):
with open(outputFile, 'w', encoding='utf-8') as f:
f.write(pageSource)
f.close()

Related

Unable to locate div.class element in html using BeautifulSoup

I am trying to run the following BeautifulSoup code on https://apps.npr.org/best-books/#view=list&year=2022 to locate the book title of the books listed on this page. I am using the below code, which I have confirmed generally works a basic web scraper:
import requests
from bs4 import BeautifulSoup
url = 'https://apps.npr.org/best-books/#view=list&year=2022'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('div',{'class':'title'})
Which I would expect would yield a list of all the book titles. Instead, I am getting an empty list, which essentially means it is not finding the html I'm searching for.
For reference, an example of an html string that has the information I want (which can equivalently be found by inspecting the source of the page I linked above):
<div class="title">(Serious) New Cook: Recipes, Tips, and Techniques</div>
Any tips on how to troubleshoot this?
Simply fecth the data from the api, you could find it in the dev tools of your browser, check the xhr tab:
import requests
requests.get('https://apps.npr.org/best-books/2022.json').json()
Output
[{'title': 'The School for Good Mothers: A Novel',
'author': 'Jessamine Chan',
'dimensions': {'width': 329, 'height': 500},
'cover': '1982156120',
'tags': ['sci fi, fantasy & speculative fiction',
'book club ideas',
'eye-opening reads',
'family matters',
'identity & culture',
'the states we’re in',
'staff picks',
'the dark side'],
'id': 1},
{'title': 'Young Mungo',
'author': 'Douglas Stuart',
'dimensions': {'width': 336, 'height': 500},
'cover': '0802159559',
'tags': ['realistic fiction',
'book club ideas',
'family matters',
'identity & culture',
'love stories',
'seriously great writing',
'tales from around the world',
'staff picks'],
'id': 2},...]
This is most likely because the content is loaded using javascript. You could either fetch from the API like HedgeHog suggested or you could load the javascript in the page using a automated browser like selenium.

Web Scraping - Every scroll the data changes and only 20 entries are showed

I am trying to web scrape this web site, but the pages content changes when scroll and only 20 entries are shown.
As shown my code below, it only gives 20 enteries and if I scroll down before running the code, the 20 entries change.
I want to get the whole 896 entries all at once.
main = requests.get("https://www.sarbanes-oxley-forum.com/category/20/general-sarbanes-oxley- discussion/21")
soup = BeautifulSoup(main.content,"lxml")
main_dump = soup.find_all("h2",{"class":"title","component":{"topic/header"}})
for k in range(len(main_dump)):
main_titles.append(main_dump[k].find("a").text)
main_links.append("https://www.sarbanes-oxley-forum.com/"+main_dump[k].find("a").attrs["href"])
print(len(main_links))
Output: 20
You do not need BeautifulSoup nor selenium in this case, cause you could get all data structured from their api. Simply request the first page, check the number of topics and iterate over:
https://www.sarbanes-oxley-forum.com/api/category/20/general-sarbanes-oxley-discussion/1
Example
Note: Replace 894 with 1 to start over from first page, I just limited the number of requests for demo here starting from page 894:
import requests
api_url = 'https://www.sarbanes-oxley-forum.com/api/category/20/general-sarbanes-oxley-discussion/'
data = []
for i in range(894,requests.get(api_url+'1').json()['totalTopicCount']+1):
for e in requests.get(api_url+str(i)).json()['topics']:
data.append({
'title':e['title'],
'url': 'https://www.sarbanes-oxley-forum.com/topic/'+e['slug']
})
data
Output
[{'title': 'We have a team of expert',
'url': 'https://www.sarbanes-oxley-forum.com/topic/8550/we-have-a-team-of-expert'},
{'title': 'What is the privacy in Google Nest Wifi?',
'url': 'https://www.sarbanes-oxley-forum.com/topic/8552/what-is-the-privacy-in-google-nest-wifi'},
{'title': 'Reporting Requirements _and_amp; Financial Results Release Timin 382',
'url': 'https://www.sarbanes-oxley-forum.com/topic/6214/reporting-requirements-_and_amp-financial-results-release-timin-382'},
{'title': 'Use of digital signatures instead of wet ink signatures on control documentation',
'url': 'https://www.sarbanes-oxley-forum.com/topic/8476/use-of-digital-signatures-instead-of-wet-ink-signatures-on-control-documentation'},...]

How to customize userAgentData(Sec-Ch-Ua) in selenium

https://chromedevtools.github.io/devtools-protocol/tot/Emulation/#method-canEmulate
enter image description here
If you look at the Emulation.setUserAgentOverride section of the developer protocol site here, there is the ability to enter a userAgentMetadata parameter, but Python Selenium doesn't recognize the parameter.
I want to customize Sec-Ch-Ua.
When I use the return navigator.userAgentData code I want it to come out like this.
{'brands': [{'brand': '.Not/A)Brand', 'version': '99'}, {'brand': 'Google Chrome', 'version': '103'}, {'brand': 'Chromium', 'version': '103'}], 'mobile': False, 'platform': 'Windows'}

Python Requests-HTML - Can't find specific data

I am trying to scrape a web page using python requests-html library.
link to that web page is https://www.koyfin.com/charts/g/USADebt2GDP?view=table ,
below image shows (red rounded data) the data what i want to get.
My code is like this,
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.koyfin.com/charts/g/USADebt2GDP?view=table')
r.html.render(timeout=60)
print(r.text)
web page html like this,
Problem is when i scrape the web page i can't find the data i want, in HTML code i can see
the data inside first div tags in body section.
Any specific suggestions for how to solve this.
Thanks.
The problem is that the data is being loaded by JavaScript code after the initial page load. One solution is to use Selenium to drive a web browser to scrape the page. But using a regular browser I looked at the network requests that were being made and it appears that the data you seek is being loaded with the following AJAX call:
https://api.koyfin.com/api/v2/commands/g/g.gec/USADebt2GDP?dateFrom=2010-08-20&dateTo=2020-09-05&period=yearly
So:
import requests
response = requests.get('https://api.koyfin.com/api/v2/commands/g/g.gec/USADebt2GDP?dateFrom=2010-08-20&dateTo=2020-09-05&period=yearly')
results = response.json();
print(results)
for t in results['graph']['data']:
print(t)
Prints:
{'ticker': 'USADebt2GDP', 'companyName': 'United States Gross Federal Debt to GDP', 'startDate': '1940-12-31T00:00:00.000Z', 'endDate': '2019-12-31T00:00:00.000Z', 'unit': 'percent', 'graph': {'column_names': ['Date', 'Volume'], 'data': [['2010-12-31', 91.4], ['2011-12-31', 96], ['2012-12-31', 100.1], ['2013-12-31', 101.2], ['2014-12-31', 103.2], ['2015-12-31', 100.8], ['2016-12-31', 105.8], ['2017-12-31', 105.4], ['2018-12-31', 106.1], ['2019-12-31', 106.9]]}, 'withoutLiveData': True}
['2010-12-31', 91.4]
['2011-12-31', 96]
['2012-12-31', 100.1]
['2013-12-31', 101.2]
['2014-12-31', 103.2]
['2015-12-31', 100.8]
['2016-12-31', 105.8]
['2017-12-31', 105.4]
['2018-12-31', 106.1]
['2019-12-31', 106.9]
How I Came Up with the URL
And when you click on the last message:

Scraperwiki Python Loop Issue

I'm creating a scraper through ScraperWiki using Python, but I'm having an issue with the results I get. I'm basing my code off the basic example on ScraperWiki's docs and everything seems very similar, so I'm not sure where my issue is. For my results, I get the first documents title/URL that is on the page, but there seems to be a problem with the loop, as it does not return the remaining documents after that one. Any advice is appreciated!
import scraperwiki
import requests
import lxml.html
html = requests.get("http://www.store.com/us/a/productDetail/a/910271.htm").content
dom = lxml.html.fromstring(html)
for entry in dom.cssselect('.downloads'):
document = {
'title': entry.cssselect('a')[0].text_content(),
'url': entry.cssselect('a')[0].get('href')
}
print document
You need to iterate over the a tags inside the div with class downloads:
for entry in dom.cssselect('.downloads a'):
document = {
'title': entry.text_content(),
'url': entry.get('href')
}
print document
Prints:
{'url': '/webassets/kpna/catalog/pdf/en/1012741_4.pdf', 'title': 'Rough In/Spec Sheet'}
{'url': '/webassets/kpna/catalog/pdf/en/1012741_2.pdf', 'title': 'Installation and Care Guide with Service Parts'}
{'url': '/webassets/kpna/catalog/pdf/en/1204921_2.pdf', 'title': 'Installation and Care Guide without Service Parts'}
{'url': '/webassets/kpna/catalog/pdf/en/1011610_2.pdf', 'title': 'Installation Guide without Service Parts'}

Categories