I am trying to web scrape this web site, but the pages content changes when scroll and only 20 entries are shown.
As shown my code below, it only gives 20 enteries and if I scroll down before running the code, the 20 entries change.
I want to get the whole 896 entries all at once.
main = requests.get("https://www.sarbanes-oxley-forum.com/category/20/general-sarbanes-oxley- discussion/21")
soup = BeautifulSoup(main.content,"lxml")
main_dump = soup.find_all("h2",{"class":"title","component":{"topic/header"}})
for k in range(len(main_dump)):
main_titles.append(main_dump[k].find("a").text)
main_links.append("https://www.sarbanes-oxley-forum.com/"+main_dump[k].find("a").attrs["href"])
print(len(main_links))
Output: 20
You do not need BeautifulSoup nor selenium in this case, cause you could get all data structured from their api. Simply request the first page, check the number of topics and iterate over:
https://www.sarbanes-oxley-forum.com/api/category/20/general-sarbanes-oxley-discussion/1
Example
Note: Replace 894 with 1 to start over from first page, I just limited the number of requests for demo here starting from page 894:
import requests
api_url = 'https://www.sarbanes-oxley-forum.com/api/category/20/general-sarbanes-oxley-discussion/'
data = []
for i in range(894,requests.get(api_url+'1').json()['totalTopicCount']+1):
for e in requests.get(api_url+str(i)).json()['topics']:
data.append({
'title':e['title'],
'url': 'https://www.sarbanes-oxley-forum.com/topic/'+e['slug']
})
data
Output
[{'title': 'We have a team of expert',
'url': 'https://www.sarbanes-oxley-forum.com/topic/8550/we-have-a-team-of-expert'},
{'title': 'What is the privacy in Google Nest Wifi?',
'url': 'https://www.sarbanes-oxley-forum.com/topic/8552/what-is-the-privacy-in-google-nest-wifi'},
{'title': 'Reporting Requirements _and_amp; Financial Results Release Timin 382',
'url': 'https://www.sarbanes-oxley-forum.com/topic/6214/reporting-requirements-_and_amp-financial-results-release-timin-382'},
{'title': 'Use of digital signatures instead of wet ink signatures on control documentation',
'url': 'https://www.sarbanes-oxley-forum.com/topic/8476/use-of-digital-signatures-instead-of-wet-ink-signatures-on-control-documentation'},...]
Hi I'm doing a python course and for one of our assignments today, we're supposed to extract the job listings on: https://remoteok.com/remote-python-jobs
Here is a screenshot of the html in question:
python jobs f12
And here is what I've written so far:
import requests
from bs4 import BeautifulSoup
def extract(term):
url = f"https://remoteok.com/remote-{term}-jobs"
request = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
if request.status_code == 200:
soup = BeautifulSoup(request.text, 'html.parser')
table = soup.find_all('table', id="jobsboard")
print(len(table))
for tbody in table:
tbody.find_all('tbody')
print(len(tbody))
else:
print("can't request website")
extract("python")
print(len(table)) gives me 1 and
print(len(tbody)) gives me 131.
So it's pretty clear that I've made a mistake somewhere, but I'm having trouble identifying the cause.
One suspicion I have is that when I do request the html text and parse it with BeautifulSoup I am not getting the full webpage. But otherwise, I'm really not sure what I'm doing wrong here..
Requests do not manipulate or render a website like a browser will do, it only provide the static HTML - Websites content is generated dynamically by JavaScript that converts some JSON data into structure.
Use these to extract your data:
[json.loads(e.text.strip()) for e in soup.select('table tr.job [type="application/ld+json"]')]
Result:
[{'#context': 'http://schema.org', '#type': 'JobPosting', 'datePosted': '2022-09-04T05:21:13+00:00', 'description': 'About the Team\n\nThe Design Infrastructure team designs, builds, and ships the Design System foundations and UI components used in all of DoorDash’s products, on all platforms. Specifically, the iOS team works closely with designers and product engineering teams across the company to help shape the Design System, and owns the shared UI library for iOS – developed for both SwiftUI and UIKit.\nAbout the Role\n\nWe are looking for a lead iOS engineer who has a strong passion for UI components and working very closely with design. As part of the role you will be leading the iOS initiative for our Design System, which will include working closely with designers and iOS engineers on product teams to align, develop, maintain, and evolve the library of foundations and UI components; which is adopted in all our products.\n\nYou will report into the Lead Design Technologist for Mobile on our Design Infrastructure team in our Product Design organization. This role is 100% flexible, and can b\n Apply now and work remotely at DoorDash', 'baseSalary': {'#type': 'MonetaryAmount', 'currency': 'USD', 'value': {'#type': 'QuantitativeValue', 'minValue': 70000, 'maxValue': 120000, 'unitText': 'YEAR'}}, 'employmentType': 'FULL_TIME', 'directApply': 'http://schema.org/False', 'industry': 'Startups', 'jobLocationType': 'TELECOMMUTE', 'jobLocation': [{'address': {'#type': 'PostalAddress', 'addressCountry': 'United States', 'addressRegion': 'Anywhere', 'streetAddress': 'Anywhere', 'postalCode': 'Anywhere', 'addressLocality': 'Anywhere'}}], 'applicantLocationRequirements': [{'#type': 'Country', 'name': 'United States'}], 'title': 'Lead Design Technologist iOS', 'image': 'https://remoteok.com/assets/img/jobs/f2f1ab68227768717536a0ab7e2578ab1662268873.png', 'occupationalCategory': 'Lead Design Technologist iOS', 'workHours': 'Flexible', 'validThrough': '2022-12-03T05:21:13+00:00', 'hiringOrganization': {'#type': 'Organization', 'name': 'DoorDash', 'url': 'https://remoteok.com/doordash', 'sameAs': 'https://remoteok.com/doordash', 'logo': {'#type': 'ImageObject', 'url': 'https://remoteok.com/assets/img/jobs/f2f1ab68227768717536a0ab7e2578ab1662268873.png'}}}, {'#context': 'http://schema.org', '#type': 'JobPosting', 'datePosted': '2022-09-03T00:00:09+00:00', 'description': "We’re seeking a senior core, distributed systems engineers to build dev tools. At [Iterative](https://iterative.ai) we build [DVC](https://dvc.org) (9000+ ⭐on GitHub) and [CML](https://cml.dev) (2000+ ⭐ on GitHub) and a few other projects that are not released yet. It's a great opportunity if you love open source, dev tools, systems programming, and remote work. Join our well-funded remote-first team to build developer tools to see how your code is used by thousands of developers every day!\n\nABOUT YOU\n\n- Excellent communication skills and a positive mindset 🤗\n- No prior deep knowledge of ML is required\n- At least one year of experience with file systems, concurrency, multithreading, and server architectures\n- Passionate about building highly reliable system software\n- Python knowledge and excellent coding culture (standards, unit test, docs, etc) are required.\n- Initiative to help shape the engineering practices, products, and culture of a young startup\n- R\n Apply now and work remotely at Iterative", 'baseSalary': {'#type': 'MonetaryAmount', 'currency': 'USD', 'value': {'#type': 'QuantitativeValue', 'minValue': 50000, 'maxValue': 180000, 'unitText': 'YEAR'}}, 'employmentType': 'FULL_TIME', 'directApply': 'http://schema.org/False', 'industry': 'Startups', 'jobLocationType': 'TELECOMMUTE', 'applicantLocationRequirements': {'#type': 'Country', 'name': 'Anywhere'}, 'jobLocation': {'address': {'#type': 'PostalAddress', 'addressCountry': 'Anywhere', 'addressRegion': 'Anywhere', 'streetAddress': 'Anywhere', 'postalCode': 'Anywhere', 'addressLocality': 'Anywhere'}}, 'title': 'Senior Software Engineer', 'image': 'https://remoteOK.com/assets/img/jobs/cb9a279f231a5312283e6d935bba3be91636086324.png', 'occupationalCategory': 'Senior Software Engineer', 'workHours': 'Flexible', 'validThrough': '2022-12-02T00:00:09+00:00', 'hiringOrganization': {'#type': 'Organization', 'name': 'Iterative', 'url': 'https://remoteok.com/iterative', 'sameAs': 'https://remoteok.com/iterative', 'logo': {'#type': 'ImageObject', 'url': 'https://remoteOK.com/assets/img/jobs/cb9a279f231a5312283e6d935bba3be91636086324.png'}}}, {'#context': 'http://schema.org', '#type': 'JobPosting', 'datePosted': '2022-09-06T09:10:04+00:00', 'description': '<p dir="ltr">Get a remote job that you will love with better compensation and career growth.<strong></strong></p><p dir="ltr">We’re Lemon.io — a marketplace where we match you with hand-picked startups from the US and Europe.\xa0<strong></strong></p><p dir="ltr"><br /></p><p dir="ltr"><strong>Why work with us:</strong></p><ul><li dir="ltr"><p dir="ltr">We’ll find you a team that respects you. No time-trackers or any micromanagement stuff</p></li><li dir="ltr"><p dir="ltr">Our engineers earn $5k - $9k / month. We’ve already paid out over $10M.</p></li><li dir="ltr"><p dir="ltr">Choose your schedule. We have both full- and part-time projects.</p></li><li dir="ltr"><p dir="ltr">No project managers in the middle — only direct communications with clients, most of whom have a technical background</p></li><li dir="ltr"><p dir="ltr">Our customer success team provides life support to help you resolve anything.</p></li><li dir="ltr"><p dir="ltr">You don’\n Apply now and work remotely at lemon.io', 'baseSalary': {'#type': 'MonetaryAmount', 'currency': 'USD', 'value': {'#type': 'QuantitativeValue', 'minValue': 60000, 'maxValue': 110000, 'unitText': 'YEAR'}}, 'employmentType': 'FULL_TIME', 'directApply': 'http://schema.org/False', 'industry': 'Startups', 'jobLocationType': 'TELECOMMUTE', 'applicantLocationRequirements': {'#type': 'Country', 'name': 'Anywhere'}, 'jobLocation': {'address': {'#type': 'PostalAddress', 'addressCountry': 'Anywhere', 'addressRegion': 'Anywhere', 'streetAddress': 'Anywhere', 'postalCode': 'Anywhere', 'addressLocality': 'Anywhere'}}, 'title': 'DevOps Engineer', 'image': 'https://remoteOK.com/assets/img/jobs/b31a9584a903e655bd2f67a2d7f584781662455404.png', 'occupationalCategory': 'DevOps Engineer', 'workHours': 'Flexible', 'validThrough': '2022-12-05T09:10:04+00:00', 'hiringOrganization': {'#type': 'Organization', 'name': 'lemon.io', 'url': 'https://remoteok.com/lemon-io', 'sameAs': 'https://remoteok.com/lemon-io', 'logo': {'#type': 'ImageObject', 'url': 'https://remoteOK.com/assets/img/jobs/b31a9584a903e655bd2f67a2d7f584781662455404.png'}}},...]
Example
import requests,json
from bs4 import BeautifulSoup
def extract(term):
url = f"https://remoteok.com/remote-{term}-jobs"
request = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
if request.status_code == 200:
soup = BeautifulSoup(request.text, 'html.parser')
table = soup.find_all('table', id="jobsboard")
print(len(table))
for tbody in table:
tbody.find_all('tbody')
print(len(tbody))
else:
print("can't request website")
data = [json.loads(e.text.strip()) for e in soup.select('table tr.job [type="application/ld+json"]')]
return data
for post in extract("python"):
print(post['hiringOrganization']['name'])
Output
DoorDash
Iterative
lemon.io
Angaza
Angaza
Kandji
Kandji
Kandji
Kandji
Kandji
Great Minds
Jobber
Udacity
...
tbody does appear on the web page but isn't pulled into the table variable by beautifulsoup.
I encountered this before. The solution is to get your tags directly from selenium.
But there is only one jobsboard and one tbody on the web page; so you could just skip tbody and look for a more useful tag.
I use Google Chrome. It has the free extension ChroPath, which makes it super easy to identify selectors. I just right click on text in a browser and select Inspect, sometimes twice, and the correct HTML tag is highlighted.
PyCharm allows you to view the contents of each variable with ease.
This code will allow you to view the web page HTML source code in a text file:
outputFile = r"C:\Users\user\Documents\HP Laptop\Documents\Documents\Jobs\DIT\IDMB\OutputZ.txt"
def update_output_file(pageSource: str):
with open(outputFile, 'w', encoding='utf-8') as f:
f.write(pageSource)
f.close()
I'm just trying to scrape the titles from the page, but the html that is being loaded with page.inner_html('body') does not include all of the html. I think it may be loaded from JS, but when I look into the network tab in dev tools I cannot seem to find a json or where it's being loaded from. I have tried this with Selenium as well, so there must be something I'm doing fundamentally wrong.
So no items appear from the list, but the regular HTML shows up fine. No amount of waiting for the content to load, will load the information.
#import playwright
from playwright.sync_api import sync_playwright
url = 'https://order.mandarake.co.jp/order/listPage/list?categoryCode=07&keyword=naruto&lang=en'
#open url
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
#enable javascript
page.goto(url)
#enable javascript
#load the page and wait for the page to load
page.wait_for_load_state("networkidle")
#get the html content
html = page.inner_html("body")
print(html)
#close browser
browser.close()
No, the webpage isn't loaded content dynamically by JavaScript rather it's entirely static HTML DOM
from bs4 import BeautifulSoup
import requests
page = requests.get('https://order.mandarake.co.jp/order/listPage/list?categoryCode=07&keyword=naruto&lang=en')
soup = BeautifulSoup(page.content,'lxml')
data = []
for e in soup.select('div.title'):
d = {
'title':e.a.get_text(strip=True),
}
data.append(d)
print(data)
Output:
[{'title': 'NARUTO THE ANIMATION CHRONICLE\u3000genga made for sale'}, {'title': 'Plex DPCF Haruno Sakura Reboru ring of the eyes'}, {'title': 'Naruto: Shippuden\u3000(replica) ナルト'}, {'title': 'Naruto: Shippuden\u3000(replica) ナルト'}, {'title': 'Naruto: Shippuden\u3000(replica) NARUTO -ナルト-'}, {'title': 'Naruto: Shippuden ナルト\u3000(replica)'}, {'title': 'Naruto Shippuuden\u3000(replica) NARUTO -ナルト-'}, {'title': 'NARUTO -ナルト- 疾風伝\u3000(複製セル)'}, {'title': 'MegaHouse ちみ メガ Petit Chara Land NARUTO SHIPPUDEN ナルト blast-of-wind intermediary Even [swirl ナルト special is a volume on ばよ.
All 6 types set] inner bag not opened/box damaged'}, {'title': 'NARUTO -ナルト- 疾風伝\u3000(複製セル)'}, {'title': 'NARUTO -ナルト- 疾風伝\u3000(複製セル)'}, {'title': 'NARUTO -ナルト- 疾風伝'}, {'title': 'NARUTO -ナルト- 疾風伝'}, {'title': 'NARUTO -ナルト-'}]
I am trying to scrape a web page using python requests-html library.
link to that web page is https://www.koyfin.com/charts/g/USADebt2GDP?view=table ,
below image shows (red rounded data) the data what i want to get.
My code is like this,
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.koyfin.com/charts/g/USADebt2GDP?view=table')
r.html.render(timeout=60)
print(r.text)
web page html like this,
Problem is when i scrape the web page i can't find the data i want, in HTML code i can see
the data inside first div tags in body section.
Any specific suggestions for how to solve this.
Thanks.
The problem is that the data is being loaded by JavaScript code after the initial page load. One solution is to use Selenium to drive a web browser to scrape the page. But using a regular browser I looked at the network requests that were being made and it appears that the data you seek is being loaded with the following AJAX call:
https://api.koyfin.com/api/v2/commands/g/g.gec/USADebt2GDP?dateFrom=2010-08-20&dateTo=2020-09-05&period=yearly
So:
import requests
response = requests.get('https://api.koyfin.com/api/v2/commands/g/g.gec/USADebt2GDP?dateFrom=2010-08-20&dateTo=2020-09-05&period=yearly')
results = response.json();
print(results)
for t in results['graph']['data']:
print(t)
Prints:
{'ticker': 'USADebt2GDP', 'companyName': 'United States Gross Federal Debt to GDP', 'startDate': '1940-12-31T00:00:00.000Z', 'endDate': '2019-12-31T00:00:00.000Z', 'unit': 'percent', 'graph': {'column_names': ['Date', 'Volume'], 'data': [['2010-12-31', 91.4], ['2011-12-31', 96], ['2012-12-31', 100.1], ['2013-12-31', 101.2], ['2014-12-31', 103.2], ['2015-12-31', 100.8], ['2016-12-31', 105.8], ['2017-12-31', 105.4], ['2018-12-31', 106.1], ['2019-12-31', 106.9]]}, 'withoutLiveData': True}
['2010-12-31', 91.4]
['2011-12-31', 96]
['2012-12-31', 100.1]
['2013-12-31', 101.2]
['2014-12-31', 103.2]
['2015-12-31', 100.8]
['2016-12-31', 105.8]
['2017-12-31', 105.4]
['2018-12-31', 106.1]
['2019-12-31', 106.9]
How I Came Up with the URL
And when you click on the last message:
I'm creating a scraper through ScraperWiki using Python, but I'm having an issue with the results I get. I'm basing my code off the basic example on ScraperWiki's docs and everything seems very similar, so I'm not sure where my issue is. For my results, I get the first documents title/URL that is on the page, but there seems to be a problem with the loop, as it does not return the remaining documents after that one. Any advice is appreciated!
import scraperwiki
import requests
import lxml.html
html = requests.get("http://www.store.com/us/a/productDetail/a/910271.htm").content
dom = lxml.html.fromstring(html)
for entry in dom.cssselect('.downloads'):
document = {
'title': entry.cssselect('a')[0].text_content(),
'url': entry.cssselect('a')[0].get('href')
}
print document
You need to iterate over the a tags inside the div with class downloads:
for entry in dom.cssselect('.downloads a'):
document = {
'title': entry.text_content(),
'url': entry.get('href')
}
print document
Prints:
{'url': '/webassets/kpna/catalog/pdf/en/1012741_4.pdf', 'title': 'Rough In/Spec Sheet'}
{'url': '/webassets/kpna/catalog/pdf/en/1012741_2.pdf', 'title': 'Installation and Care Guide with Service Parts'}
{'url': '/webassets/kpna/catalog/pdf/en/1204921_2.pdf', 'title': 'Installation and Care Guide without Service Parts'}
{'url': '/webassets/kpna/catalog/pdf/en/1011610_2.pdf', 'title': 'Installation Guide without Service Parts'}