I'm just trying to scrape the titles from the page, but the html that is being loaded with page.inner_html('body') does not include all of the html. I think it may be loaded from JS, but when I look into the network tab in dev tools I cannot seem to find a json or where it's being loaded from. I have tried this with Selenium as well, so there must be something I'm doing fundamentally wrong.
So no items appear from the list, but the regular HTML shows up fine. No amount of waiting for the content to load, will load the information.
#import playwright
from playwright.sync_api import sync_playwright
url = 'https://order.mandarake.co.jp/order/listPage/list?categoryCode=07&keyword=naruto&lang=en'
#open url
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
#enable javascript
page.goto(url)
#enable javascript
#load the page and wait for the page to load
page.wait_for_load_state("networkidle")
#get the html content
html = page.inner_html("body")
print(html)
#close browser
browser.close()
No, the webpage isn't loaded content dynamically by JavaScript rather it's entirely static HTML DOM
from bs4 import BeautifulSoup
import requests
page = requests.get('https://order.mandarake.co.jp/order/listPage/list?categoryCode=07&keyword=naruto&lang=en')
soup = BeautifulSoup(page.content,'lxml')
data = []
for e in soup.select('div.title'):
d = {
'title':e.a.get_text(strip=True),
}
data.append(d)
print(data)
Output:
[{'title': 'NARUTO THE ANIMATION CHRONICLE\u3000genga made for sale'}, {'title': 'Plex DPCF Haruno Sakura Reboru ring of the eyes'}, {'title': 'Naruto: Shippuden\u3000(replica) ナルト'}, {'title': 'Naruto: Shippuden\u3000(replica) ナルト'}, {'title': 'Naruto: Shippuden\u3000(replica) NARUTO -ナルト-'}, {'title': 'Naruto: Shippuden ナルト\u3000(replica)'}, {'title': 'Naruto Shippuuden\u3000(replica) NARUTO -ナルト-'}, {'title': 'NARUTO -ナルト- 疾風伝\u3000(複製セル)'}, {'title': 'MegaHouse ちみ メガ Petit Chara Land NARUTO SHIPPUDEN ナルト blast-of-wind intermediary Even [swirl ナルト special is a volume on ばよ.
All 6 types set] inner bag not opened/box damaged'}, {'title': 'NARUTO -ナルト- 疾風伝\u3000(複製セル)'}, {'title': 'NARUTO -ナルト- 疾風伝\u3000(複製セル)'}, {'title': 'NARUTO -ナルト- 疾風伝'}, {'title': 'NARUTO -ナルト- 疾風伝'}, {'title': 'NARUTO -ナルト-'}]
Related
I am trying to run the following BeautifulSoup code on https://apps.npr.org/best-books/#view=list&year=2022 to locate the book title of the books listed on this page. I am using the below code, which I have confirmed generally works a basic web scraper:
import requests
from bs4 import BeautifulSoup
url = 'https://apps.npr.org/best-books/#view=list&year=2022'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('div',{'class':'title'})
Which I would expect would yield a list of all the book titles. Instead, I am getting an empty list, which essentially means it is not finding the html I'm searching for.
For reference, an example of an html string that has the information I want (which can equivalently be found by inspecting the source of the page I linked above):
<div class="title">(Serious) New Cook: Recipes, Tips, and Techniques</div>
Any tips on how to troubleshoot this?
Simply fecth the data from the api, you could find it in the dev tools of your browser, check the xhr tab:
import requests
requests.get('https://apps.npr.org/best-books/2022.json').json()
Output
[{'title': 'The School for Good Mothers: A Novel',
'author': 'Jessamine Chan',
'dimensions': {'width': 329, 'height': 500},
'cover': '1982156120',
'tags': ['sci fi, fantasy & speculative fiction',
'book club ideas',
'eye-opening reads',
'family matters',
'identity & culture',
'the states we’re in',
'staff picks',
'the dark side'],
'id': 1},
{'title': 'Young Mungo',
'author': 'Douglas Stuart',
'dimensions': {'width': 336, 'height': 500},
'cover': '0802159559',
'tags': ['realistic fiction',
'book club ideas',
'family matters',
'identity & culture',
'love stories',
'seriously great writing',
'tales from around the world',
'staff picks'],
'id': 2},...]
This is most likely because the content is loaded using javascript. You could either fetch from the API like HedgeHog suggested or you could load the javascript in the page using a automated browser like selenium.
I am trying to web scrape this web site, but the pages content changes when scroll and only 20 entries are shown.
As shown my code below, it only gives 20 enteries and if I scroll down before running the code, the 20 entries change.
I want to get the whole 896 entries all at once.
main = requests.get("https://www.sarbanes-oxley-forum.com/category/20/general-sarbanes-oxley- discussion/21")
soup = BeautifulSoup(main.content,"lxml")
main_dump = soup.find_all("h2",{"class":"title","component":{"topic/header"}})
for k in range(len(main_dump)):
main_titles.append(main_dump[k].find("a").text)
main_links.append("https://www.sarbanes-oxley-forum.com/"+main_dump[k].find("a").attrs["href"])
print(len(main_links))
Output: 20
You do not need BeautifulSoup nor selenium in this case, cause you could get all data structured from their api. Simply request the first page, check the number of topics and iterate over:
https://www.sarbanes-oxley-forum.com/api/category/20/general-sarbanes-oxley-discussion/1
Example
Note: Replace 894 with 1 to start over from first page, I just limited the number of requests for demo here starting from page 894:
import requests
api_url = 'https://www.sarbanes-oxley-forum.com/api/category/20/general-sarbanes-oxley-discussion/'
data = []
for i in range(894,requests.get(api_url+'1').json()['totalTopicCount']+1):
for e in requests.get(api_url+str(i)).json()['topics']:
data.append({
'title':e['title'],
'url': 'https://www.sarbanes-oxley-forum.com/topic/'+e['slug']
})
data
Output
[{'title': 'We have a team of expert',
'url': 'https://www.sarbanes-oxley-forum.com/topic/8550/we-have-a-team-of-expert'},
{'title': 'What is the privacy in Google Nest Wifi?',
'url': 'https://www.sarbanes-oxley-forum.com/topic/8552/what-is-the-privacy-in-google-nest-wifi'},
{'title': 'Reporting Requirements _and_amp; Financial Results Release Timin 382',
'url': 'https://www.sarbanes-oxley-forum.com/topic/6214/reporting-requirements-_and_amp-financial-results-release-timin-382'},
{'title': 'Use of digital signatures instead of wet ink signatures on control documentation',
'url': 'https://www.sarbanes-oxley-forum.com/topic/8476/use-of-digital-signatures-instead-of-wet-ink-signatures-on-control-documentation'},...]
I am trying to download multiple csv files from the below url and hoping to use selenium or any other method for it. The url requires filling up of a form that includes selecting options from multiple dropdowns. Then, an 'image' button needs to be clicked for the download link to appear.
If I run selenium chrome driver from python and click on the button, nothing appears. I am also unable to figure out the url of the csv files so they could be downloaded using 'requests' or 'urllib'.
Here's the url I need to download from:
https://www1.nseindia.com/products/content/derivatives/equities/historical_fo.htm
Here's my code so far:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
url = 'https://www1.nseindia.com/products/content/derivatives/equities/historical_fo.htm'
driver.get(url)
instr_type = Select(driver.find_element_by_id('instrumentType'))
symbol = Select(driver.find_element_by_id('symbol'))
opt_type = Select(driver.find_element_by_id('optionType'))
date_range = Select(driver.find_element_by_id('dateRange'))
button = driver.find_element_by_xpath("//input[#src='/common/images/btn-get-data.gif' and #type='image']")
instr_type.select_by_visible_text('Index Options')
symbol.select_by_visible_text('NIFTY 50')
opt_type.select_by_visible_text('CE')
date_range.select_by_visible_text('90 Days')
button.click()
And this is what happens in the selenium driver -
Any thoughts on how to download the csv files from above link? Doesn't necessarily have to be using selenium.
I don't know how to resolve problem with Selenium but I now how to get it with requests and BeautifulSoup
This page sends your options from form to this page as values directly in this URL (not as POST)
https://www1.nseindia.com/products/dynaContent/common/productsSymbolMapping.jsp
and server sends back HTML with table and also all data in <div id="csvContentDiv">.
This tag has all data as text already formatted to save in csv - it need only to replace : with \n
EDIT:
Sometimes server gives me Status 405 Method Not Allowed so I added requests.Session() to get cookies and maybe it will work better.
import requests
from bs4 import BeautifulSoup
session = requests.Session()
session.headers.update({'User-Agent': 'Mozilla/5.0'})
# --- main page ---
url = 'https://www1.nseindia.com/products/content/derivatives/equities/historical_fo.htm'
r = session.get(url)
#print(r.status_code)
# --- table ---
url = 'https://www1.nseindia.com/products/dynaContent/common/productsSymbolMapping.jsp'
payload = {
'instrumentType': 'OPTIDX',
'symbol': 'NIFTY',
'expiryDate': 'select',
'optionType': 'CE',
'strikePrice': '',
'dateRange': '3month',
'fromDate': '',
'toDate': '',
'segmentLink': '9',
'symbolCount':'',
}
r = session.get(url, params=payload)
#print(r.text)
soup = BeautifulSoup(r.text, 'html.parser')
data = soup.find('div', {'id': 'csvContentDiv'})
#print(data.text)
data = data.text.replace(':', '\n')
with open('output.csv', 'w') as fh:
fh.write(data)
print(data)
Server doesn't send data if you don't use real User-Agent - at least short 'Mozilla/5.0'
I found this url using DevTools in Firefox/Chrome, tab Network. And later I get response from this url and manually check what I get and I found csv data. But I expected that I would have to scrape data from HTML table in this response.
I am trying to scrape a web page using python requests-html library.
link to that web page is https://www.koyfin.com/charts/g/USADebt2GDP?view=table ,
below image shows (red rounded data) the data what i want to get.
My code is like this,
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.koyfin.com/charts/g/USADebt2GDP?view=table')
r.html.render(timeout=60)
print(r.text)
web page html like this,
Problem is when i scrape the web page i can't find the data i want, in HTML code i can see
the data inside first div tags in body section.
Any specific suggestions for how to solve this.
Thanks.
The problem is that the data is being loaded by JavaScript code after the initial page load. One solution is to use Selenium to drive a web browser to scrape the page. But using a regular browser I looked at the network requests that were being made and it appears that the data you seek is being loaded with the following AJAX call:
https://api.koyfin.com/api/v2/commands/g/g.gec/USADebt2GDP?dateFrom=2010-08-20&dateTo=2020-09-05&period=yearly
So:
import requests
response = requests.get('https://api.koyfin.com/api/v2/commands/g/g.gec/USADebt2GDP?dateFrom=2010-08-20&dateTo=2020-09-05&period=yearly')
results = response.json();
print(results)
for t in results['graph']['data']:
print(t)
Prints:
{'ticker': 'USADebt2GDP', 'companyName': 'United States Gross Federal Debt to GDP', 'startDate': '1940-12-31T00:00:00.000Z', 'endDate': '2019-12-31T00:00:00.000Z', 'unit': 'percent', 'graph': {'column_names': ['Date', 'Volume'], 'data': [['2010-12-31', 91.4], ['2011-12-31', 96], ['2012-12-31', 100.1], ['2013-12-31', 101.2], ['2014-12-31', 103.2], ['2015-12-31', 100.8], ['2016-12-31', 105.8], ['2017-12-31', 105.4], ['2018-12-31', 106.1], ['2019-12-31', 106.9]]}, 'withoutLiveData': True}
['2010-12-31', 91.4]
['2011-12-31', 96]
['2012-12-31', 100.1]
['2013-12-31', 101.2]
['2014-12-31', 103.2]
['2015-12-31', 100.8]
['2016-12-31', 105.8]
['2017-12-31', 105.4]
['2018-12-31', 106.1]
['2019-12-31', 106.9]
How I Came Up with the URL
And when you click on the last message:
I'm creating a scraper through ScraperWiki using Python, but I'm having an issue with the results I get. I'm basing my code off the basic example on ScraperWiki's docs and everything seems very similar, so I'm not sure where my issue is. For my results, I get the first documents title/URL that is on the page, but there seems to be a problem with the loop, as it does not return the remaining documents after that one. Any advice is appreciated!
import scraperwiki
import requests
import lxml.html
html = requests.get("http://www.store.com/us/a/productDetail/a/910271.htm").content
dom = lxml.html.fromstring(html)
for entry in dom.cssselect('.downloads'):
document = {
'title': entry.cssselect('a')[0].text_content(),
'url': entry.cssselect('a')[0].get('href')
}
print document
You need to iterate over the a tags inside the div with class downloads:
for entry in dom.cssselect('.downloads a'):
document = {
'title': entry.text_content(),
'url': entry.get('href')
}
print document
Prints:
{'url': '/webassets/kpna/catalog/pdf/en/1012741_4.pdf', 'title': 'Rough In/Spec Sheet'}
{'url': '/webassets/kpna/catalog/pdf/en/1012741_2.pdf', 'title': 'Installation and Care Guide with Service Parts'}
{'url': '/webassets/kpna/catalog/pdf/en/1204921_2.pdf', 'title': 'Installation and Care Guide without Service Parts'}
{'url': '/webassets/kpna/catalog/pdf/en/1011610_2.pdf', 'title': 'Installation Guide without Service Parts'}