I am trying to extract book names from oreilly media website using python beautiful soup.
However I see that the book names are not in the page source html.
I am using this link to see the books:
https://www.oreilly.com/search/?query=*&extended_publisher_data=true&highlight=true&include_assessments=false&include_case_studies=true&include_courses=true&include_playlists=true&include_collections=true&include_notebooks=true&include_sandboxes=true&include_scenarios=true&is_academic_institution_account=false&source=user&formats=book&formats=article&formats=journal&sort=date_added&facet_json=true&json_facets=true&page=0&include_facets=true&include_practice_exams=true
Attached is a screenshot that shows the webpage with the first two books alongside with chrome developer tool with arrows pointing to the elements i'd like to extract.
I looked at the page source but could not find the book names - maybe they are hidden inside some other links inside the main html.
I tried to open some of the links inside the html and searched for the book names but could not find anything.
is it possible to extract the first or second book names from the website using beautiful soup?
if not is there any other python package that can do that? maybe selenium?
Or as a last resort any other tool...
So if you investigate into network tab, when loading page, you are sending request to API
It returns json with books.
After some investigation by me, you can get your titles via
import json
import requests
response_json = json.loads(requests.get(
"https://www.oreilly.com/api/v2/search/?query=*&extended_publisher_data=true&highlight=true&include_assessments=false&include_case_studies=true&include_courses=true&include_playlists=true&include_collections=true&include_notebooks=true&include_sandboxes=true&include_scenarios=true&is_academic_institution_account=false&source=user&formats=book&formats=article&formats=journal&sort=date_added&facet_json=true&json_facets=true&page=0&include_facets=true&include_practice_exams=true&orm-service=search-frontend").text)
for book in response_json['results']:
print(book['highlights']['title'][0])
To solve this issue you need to know beautiful soup can deal with websites that use plan html. so the the websites that use JavaScript in their page beautiful soup cant's get all page data that you looking for bcz you need a browser like to load the JavaScript data in the website.
and here you need to use Selenium bcz it open a browser page and load all data of the page, and you can use both as a combine like this:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import lxml
# This will make selenium run in backround
chrome_options = Options()
chrome_options.add_argument("--headless")
# You need to install driver
driver = webdriver.Chrome('#Dir of the driver' ,options=chrome_options)
driver.get('#url')
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
and with this you can get all data that you need, and dont forget to
write this at end to quit selenium in background.
driver.quit()
Related
Simple question. Why is it that when I inspect element I see the data I want embedded within the JS tags - but when I go directly to Page Source I do not see it at all?
As an example, basically I am looking to get the description of the eBay listing. In this case, the text in the body of the listing that reads "BRAND NEW Factory Sealed
Playstation 5 (PS5) Bluray Disc System Console [...]
We usually ship within 24 hours of purchase."
Sample code below. If I search for the text within the printout, I cannot find it.
import requests
from bs4 import BeautifulSoup
url = 'www.ebay.com/itm/272037717929'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
print(soup.prettify())
it's probably because ebay is using javascript to load content into the page. A workout this problem would be using something playwright or selenium. I personally prefer the first option. It uses a chromium browser to actually get the page contents, hence loads javascript in the proccess
I was writting a scraping program. I firstly used selenium for getting the element's source (an mp4 file), then I see selenium is mainly used for automation and testing but not scraping. I thought using other scraper modules would be more reasonable. But when I use requests+beautifulsoup or urllib2/3+beautifulsoup I couldn't manage to get the inspect elements. They are getting page source but in the web page I'm working, the page source is not the same as the HTML that pops up when I inspect it. (I don't know much about the difference between inspect and page source but I guess it has something to do with JS.) Any ideas how I can solve this issue?
here is my code:
from bs4 import BeautifulSoup
import requests
response = requests.get("https://animefrenzy.org/stream/one-piece-episode-974")
soup = BeautifulSoup(response.text,"lxml")
print(soup)
here is the html I want as string:
Inspect
here is the result I get when I execute the above code:
terminal result
If you just want the HTML(Source) only then here's the code to get that.
from selenium import webdriver
import time
driver = webdriver.Firefox/Chrome(executable_path=r'/path/to/webdriver')
driver.get('https://animefrenzy.org/stream/one-piece-episode-974')
time.sleep(10)
html=driver.page_source
print(html)
This shoud give you the HTML that you want, we used time.sleep(10) because page have to load javascript and change content of the page. If you are not getting desire HTML then try to change sleep time into little more, so that page will loaded completly.
I am trying to parse below webpage to get name of stocks hitting now all time high or low in the exchange.
https://www.bseindia.com/markets/equity/EQReports/HighLow.html?Flag=H#
however, when i download the webpage using beautiful soup and check the data i do not find the stock name or price mentioned in the webpage.
I wish to write a function to download the stocks that hit a new all time high each day please help what am i missing?
Part of the HTML on the page is generated dynamically by JavaScript. You are most likely using the requests library, which cannot handle HTML generated in this way.
What you can do, instead, is use the Selenium library, which allows you to launch an instance of a web browser controlled by Python, and get the page source from there.
from selenium import webdriver
path = '...' # path to driver here
url = 'https://www.bseindia.com/markets/equity/EQReports/HighLow.html?Flag=H#'
driver = webdriver.Chrome(path)
page_source = driver.get(url).page_source
By parsing page_source with BeautifulSoup, you can get what you want.
I have a flash card making program for Spanish that pulls information from here: http://www.spanishdict.com/examples/zorro (this is just an example). I've set it up so it gets the translations fine, but now I want to add examples. I noticed however, that the examples on that page are dynamically generated so I installed Beautiful Soup and HTML5 parser. The tag I'm specifically interested in is:
<span class="megaexamples-pair-part">Los perros siguieron el rastro del <span
class="megaexamples-highlight">zorro</span>. </span>
The code I'm using to try and retrieve it is:
soup = BeautifulSoup(urlopen("http://www.spanishdict.com/examples/zorro").read(), 'html5lib')
example = soup.findAll("span", {"class": "megaexamples-pair-part"})
However, no matter what way I swing it, I can't seem to get it to pull down the dynamically generated code. I have confirmed I get the page by doing a search for megaexamples-container, which works fine (and you can see by just right clicking in google chrome and hitting View Page Source).
Any ideas?
What you're doing is just pull the HTML page, and it's likely loading more data from the server via a JavaScript call.
You have 2 options:
Use a webdriver such as selenium to control a web browser that correctly loads the entire page (you can then parse it with BeautifulSoup or find elements with selenium's own tools). This incurs in some overhead due to the browser usage.
Use the network tab of your browser's developer tools (usually accessed with F12) to analyze incoming and outgoing requests from dynamic loading and use the requests module to replicate them. This is more efficient but might also be more tricky.
Remember to do this only if you have permission from the site's owner, though. In many cases it's against the ToS.
I used Pedro's answer to get me moving in the right direction. Here is what I did to get it to work:
Download selenium with pip install selenium
Download the driver for the browser you want to emulate. You can download them from this page. The driver must be in the PATH variable or you will need to specify the path in the constructor for the webdriver.
Import selenium with from selenium import webdriver
Now use the following code:
browser = webdriver.Chrome()
browser.get(raw_input("Enter URL: "))
html_source = browser.page_source
Note: If you did not put your driver in path, you have to call the constructor with browser = webdriver.Chrome(<PATH_TO_DRIVER_HERE>)
Note 2: You can use something like webdriver.Firefox() if you want a different browser.
Now you can parse it with something like: soup = BeautifulSoup(html_source, 'html5lib')
I am trying to scrape this page on Flipkart:
http://www.flipkart.com/moto-x-play/p/itmeajtqp9sfxgsk?pid=MOBEAJTQRH4CCRYM&ref=L%3A7224647610489585789&srno=p_1&query=moto+x+play&otracker=from-search
I am trying to find the div with class "fk-ui-ccarousel-supercontainer same-vreco-section reco-carousel-border-top sameHorizontalReco" but it returns empty result.
from bs4 import BeautifulSoup
import requests
url = "http://www.flipkart.com/moto-x-play/p/itmeajtqp9sfxgsk?pid=MOBEAJTQRH4CCRYM&ref=L%3A7224647610489585789&srno=p_1&query=moto%20x%20play&otracker=from-search"
page = requests.get(url)
soup = BeautifulSoup(page.text)
divs = soup.find_all("div",{"class":"fk-ui-ccarousel-supercontainer same-vreco-section reco-carousel-border-top sameHorizontalReco"})
print divs
divs is empty. I copied the class name using inspect element.
I found the answer in this question. http://www.google.com/url?q=http%3A%2F%2Fstackoverflow.com%2Fquestions%2F22028775%2Ftried-python-beautifulsoup-and-phantom-js-still-cant-scrape-websites&sa=D&sntz=1&usg=AFQjCNFOZIMVyUDcUqNNuv-05Dp7P_L6-g
When you use requests.get(url) you load the HTML content of the url without JavaScript enabled. Without JavaScript enabled, the section of the page called 'customers who viewed this product also viewed' is never even rendered.
You can explore this behaviour by turning off JavaScript in your browser. If you scrape regularly, you might also want to download a JavaScript switcher plugin.
An alternative that you might want to look into is using a browser automation tool such as selenium.
requests.get(..) will return the content that is the plain HTTP GET on that url. all the Javascript rels that the page contains will not be downloaded, also, any inline javascript will not be executed either.
If flipkart uses js to modify the DOM after it is loaded in the browser, those changes will not reflect in the page.contents or page.text values.
you could try a different parser instead of the default parser in beautiful soup. I tried html5lib and it worked for a different website. maybe it will for you too. It will be slower than the default parser, but could be faster than selenium or other full fledged headless browsers.