Clicking button on web scraping - python

I am trying to do web scraping in this web page
https://www.camara.cl/pley/pley_detalle.aspx?prmID=13505&prmBL=712960-07
And I am trying to obtain the information in the table that is contain in Autores.
I have tried using this code
button=browser.find_element_by_link_text('Autores')
button.click()
soup_level2=BeautifulSoup(browser.page_source, 'lxml')
But the click is not working.

This should do:
browser=webdriver.Chrome()
url = "https://www.camara.cl/pley/pley_detalle.aspx?prmID=13505&prmBL=712960-07"
browser.get(url) #navigate to the page
browser.find_element_by_id("ctl00_mainPlaceHolder_btnAutores").click()
innerHTML = browser.execute_script("return document.body.innerHTML")
soup_level2=BeautifulSoup(innerHTML, 'html.parser')
PS: Your sister Vale is one of my thesis professors, small world !

Related

Selenium Web Scraping With Beautiful Soup on Dynamic Content and Hidden Data Table

Really need help from this community!
I am doing web scraping on Dynamic Content in Python by using Selenium and Beautiful Soup.
The thing is the pricing data table can not be parsed to Python, even though using the following code:
html=browser.execute_script('return document.body.innerHTML')
sel_soup=BeautifulSoup(html, 'html.parser')
However, What I found later is that if I click on ' View All Prices' Button on the WebPage before using the above code, I can parse that data table into python.
My Question would be How can I parse and get access to those hidden dynamic td tag info in my python without using Selenium to click on all the 'View All Prices' buttons, because there are so many.
The url for the website I am doing the Web Scraping on is https://www.cruisecritic.com/cruiseto/cruiseitineraries.cfm?port=122,
and the attached picture is the html in terms of the dynamic data table which I need.
enter image description here
Really appreciate the help from this community!
You should target the element after has loaded and take arguments[0] and not the entire page via document
html_of_interest=driver.execute_script('return arguments[0].innerHTML',element)
sel_soup=BeautifulSoup(html_of_interest, 'html.parser')
This has 2 practical cases:
1
the element is not yet loaded in the DOM and you need to wait for the element:
browser.get("url")
sleep(experimental) # usually get will finish only after the page is loaded but sometimes there is some JS woo running after on load time
try:
element= WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.ID, 'your_id_of_interest')))
print "element is ready do the thing!"
html_of_interest=driver.execute_script('return arguments[0].innerHTML',element)
sel_soup=BeautifulSoup(html_of_interest, 'html.parser')
except TimeoutException:
print "Somethings wrong!"
2
the element is in a shadow root and you need to expand first the shadow root, probably not your situation but I will mention it here since it is relevant for future reference. ex:
import selenium
from selenium import webdriver
driver = webdriver.Chrome()
from bs4 import BeautifulSoup
def expand_shadow_element(element):
shadow_root = driver.execute_script('return arguments[0].shadowRoot', element)
return shadow_root
driver.get("chrome://settings")
root1 = driver.find_element_by_tag_name('settings-ui')
html_of_interest=driver.execute_script('return arguments[0].innerHTML',root1)
sel_soup=BeautifulSoup(html_of_interest, 'html.parser')
sel_soup# empty root not expande
shadow_root1 = expand_shadow_element(root1)
html_of_interest=driver.execute_script('return arguments[0].innerHTML',shadow_root1)
sel_soup=BeautifulSoup(html_of_interest, 'html.parser')
sel_soup

Webscraping multiple pages when the url remains the same (but given an ajax response)

I am trying to webscrape all of the reviews for a specific book on Goodreads.com.
url= https://www.goodreads.com/book/show/320.One_Hundred_Years_of_Solitude?ac=1&from_search=true
this worked out pretty successfully for the first page using python and Beautiful Soup, but my problem is trying to scrape the subsequent pages of reviews. I am having issues because each new page that is generated has the same url (so I only get the reviews on page 1). When I inspect the html it seems that the new pages are generated via ajax request.
<a class="previous_page" href="#" onclick="new Ajax.Request('/book/reviews/320.One_Hundred_Years_of_Solitude?authenticity_token=sZXyhbZUmjF0yvXFy3p2w3PllReMI02adUUeA5yOHzvY1ypaIv1z9e70UMgH1mDpx5FHr%2FakQ4rG7Ge5ZoD6zQ%3D%3D&amp;hide_last_page=true&amp;page=1', {asynchronous:true, evalScripts:true, method:'get', parameters:'authenticity_token=' + encodeURIComponent('4sfXlAmAjNZyCOAnywx+OVJZ1rHkR3E065/m/pbsTC6LhQ9LnSllEug2RSoHoGgT5i0ECZ7AfyRYNp9EbOKp2A==')}); return false;">« previous</a>
I am very new to webscraping in general and have no idea how to go about getting the information I need from this. Any points in the right direction would be awesome.
Thanks
If you're going to be "driving" the web page then I would suggest using a webdriver. https://www.seleniumhq.org/projects/webdriver/
A webdriver can open a "headless" browser that you can manipulate using Selenium's API. For example, in this case you would open the browser and navigate to your page by:
from selenium import webdriver
browser = webdriver.Firefox() # open a browser
browser.get("https://www.goodreads.com/book/show/320.One_Hundred_Years_of_Solitude?ac=1&from_search=true") # open your webpage
Now you're browser object is on the page you are beautiful souping. You can use browser.page_source to get the html, and then soup it:
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
Then you can do whatever you want with your soup. When you're ready to get the next page of reviews, you can tell your browser to click the button, wait a second for it to load, then get the soup again:
element = browser.find_element_by_id("your_element_id")
element.click()
time.sleep(3) # sleep three seconds so page can load
html = browser.page_source # now this has new reviews on it
soup = BeautifulSoup(html, 'html.parser') # now you have soup again, but with new reviews
You can throw this process in a loop until there are no more "next page" elements showing up.

Extract URL from a website including archived links

I'm crawling a news website to extracts all links including the archived ones which is typical of a news website. The site here has a a button View More Stories that loads more website articles. Now this code below
def find_urls():
start_url = "e.vnexpress.net/news/business"
r = requests.get("http://" + start_url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
links = soup.findAll('a')
url_list = []
for url in links:
all_link = url.get('href')
if all_link.startswith('http://e.vnexpress.net/news/business'):
url_list.append(all_link)
return set(url_list)
successfully load quite a few url but how do I load more here is a snippet of the button
<a href="javascript:void(0)" id="vnexpress_folder_load_more" data-page="2"
data-cate="1003895">
View more stories
</a>
Can someone help me out. Thanks.
You can use a browser like selenium to click the button till the button disappears or disables. Finally you can scrape the entire page using beautifulsoup in one go.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
#initializing browser
driver = webdriver.Firefox()
driver.set_window_size(1120, 550)
driver.get("http://e.vnexpress.net/news/news")
# run this till button is present
elem = driver.find_element_by_id('vnexpress_folder_load_more'))
elem.click()

How to scrape text that only unlocks after clicking "more" button

Im trying to scrape a review from the trip advisor website. I succeed in scraping the reviews, but some reviews are long and are partially shown, until you click the "more" button.
This is the link of the website :
https://www.tripadvisor.ca/Hotel_Review-g190479-d3587956-Reviews-The_Thief-Oslo_Eastern_Norway.html#REVIEWS
This is the source code of the "more" button:
<span class= soup.findAll(attrs={"class": "entry"}):
review = item.text.replace(',', '').replace('\n', ' ').encode('utf-8').strip()
This is how i grab the reviews from the page
for item in soup.findAll(attrs={"class": "entry"}):
review = item.text.replace(',', '').replace('\n', ' ').encode('utf-8').strip()
How do i manage to scrape all the reviews after the more button is clicked?
Try loading the page in Selenium. This would allow you to interact with javascript. I haven't tried it with BeautifulSoup, but I think it would look something like this:
from selenium import webdriver
import BeautifulSoup
browser = webdriver.Firefox() #Or any other driver you want
browser.get('https://www.tripadvisor.ca/Hotel_Review-g190479-d3587956-Reviews-The_Thief-Oslo_Eastern_Norway.html#REVIEWS')
next_btn = browser.find_element_by_xpath('PATH_FOR_NEXT_LINK_ELEMENT')
next_btn.click()
html_source = browser.page_source
browser.quit()
soup = BeautifulSoup.BeautifulSoup(html_source)
review = soup("YOUR_SCRAPING_LOGIC")
When you click the More link, JavaScript code will run in the browser to get data or jump to another link, requests will return the html code, it can not handle the JavaScript.

Web Scraping using python -How to interact with the object in the webpage

How to expand the list(+) from the web page and get the title and timings? I'm new to web scraping,so kindly guide me.
driver = webdriver.Firefox()
driver.get("http://www.simplilearn.com/big-data-and-analytics/big-data-hadoop-architect-masters-program-training")
html = driver.page_source
soup = BeautifulSoup(html,"lxml")
Using Selenium, It's very easy. You just have to first find the xpath for the expand button. Below is the example for expanding 'Big-Data and Hadoop Developer' column on the given page.
elem = driver.find_element_by_xpath('//*[#id="body_content"]/div[1]/div[7]/div[1]/div[1]/ul/li[1]/div[1]/span')
elem.click()

Categories