How to scrape text that only unlocks after clicking "more" button - python

Im trying to scrape a review from the trip advisor website. I succeed in scraping the reviews, but some reviews are long and are partially shown, until you click the "more" button.
This is the link of the website :
https://www.tripadvisor.ca/Hotel_Review-g190479-d3587956-Reviews-The_Thief-Oslo_Eastern_Norway.html#REVIEWS
This is the source code of the "more" button:
<span class= soup.findAll(attrs={"class": "entry"}):
review = item.text.replace(',', '').replace('\n', ' ').encode('utf-8').strip()
This is how i grab the reviews from the page
for item in soup.findAll(attrs={"class": "entry"}):
review = item.text.replace(',', '').replace('\n', ' ').encode('utf-8').strip()
How do i manage to scrape all the reviews after the more button is clicked?

Try loading the page in Selenium. This would allow you to interact with javascript. I haven't tried it with BeautifulSoup, but I think it would look something like this:
from selenium import webdriver
import BeautifulSoup
browser = webdriver.Firefox() #Or any other driver you want
browser.get('https://www.tripadvisor.ca/Hotel_Review-g190479-d3587956-Reviews-The_Thief-Oslo_Eastern_Norway.html#REVIEWS')
next_btn = browser.find_element_by_xpath('PATH_FOR_NEXT_LINK_ELEMENT')
next_btn.click()
html_source = browser.page_source
browser.quit()
soup = BeautifulSoup.BeautifulSoup(html_source)
review = soup("YOUR_SCRAPING_LOGIC")

When you click the More link, JavaScript code will run in the browser to get data or jump to another link, requests will return the html code, it can not handle the JavaScript.

Related

Clicking button on web scraping

I am trying to do web scraping in this web page
https://www.camara.cl/pley/pley_detalle.aspx?prmID=13505&prmBL=712960-07
And I am trying to obtain the information in the table that is contain in Autores.
I have tried using this code
button=browser.find_element_by_link_text('Autores')
button.click()
soup_level2=BeautifulSoup(browser.page_source, 'lxml')
But the click is not working.
This should do:
browser=webdriver.Chrome()
url = "https://www.camara.cl/pley/pley_detalle.aspx?prmID=13505&prmBL=712960-07"
browser.get(url) #navigate to the page
browser.find_element_by_id("ctl00_mainPlaceHolder_btnAutores").click()
innerHTML = browser.execute_script("return document.body.innerHTML")
soup_level2=BeautifulSoup(innerHTML, 'html.parser')
PS: Your sister Vale is one of my thesis professors, small world !

Selenium Web Scraping With Beautiful Soup on Dynamic Content and Hidden Data Table

Really need help from this community!
I am doing web scraping on Dynamic Content in Python by using Selenium and Beautiful Soup.
The thing is the pricing data table can not be parsed to Python, even though using the following code:
html=browser.execute_script('return document.body.innerHTML')
sel_soup=BeautifulSoup(html, 'html.parser')
However, What I found later is that if I click on ' View All Prices' Button on the WebPage before using the above code, I can parse that data table into python.
My Question would be How can I parse and get access to those hidden dynamic td tag info in my python without using Selenium to click on all the 'View All Prices' buttons, because there are so many.
The url for the website I am doing the Web Scraping on is https://www.cruisecritic.com/cruiseto/cruiseitineraries.cfm?port=122,
and the attached picture is the html in terms of the dynamic data table which I need.
enter image description here
Really appreciate the help from this community!
You should target the element after has loaded and take arguments[0] and not the entire page via document
html_of_interest=driver.execute_script('return arguments[0].innerHTML',element)
sel_soup=BeautifulSoup(html_of_interest, 'html.parser')
This has 2 practical cases:
1
the element is not yet loaded in the DOM and you need to wait for the element:
browser.get("url")
sleep(experimental) # usually get will finish only after the page is loaded but sometimes there is some JS woo running after on load time
try:
element= WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.ID, 'your_id_of_interest')))
print "element is ready do the thing!"
html_of_interest=driver.execute_script('return arguments[0].innerHTML',element)
sel_soup=BeautifulSoup(html_of_interest, 'html.parser')
except TimeoutException:
print "Somethings wrong!"
2
the element is in a shadow root and you need to expand first the shadow root, probably not your situation but I will mention it here since it is relevant for future reference. ex:
import selenium
from selenium import webdriver
driver = webdriver.Chrome()
from bs4 import BeautifulSoup
def expand_shadow_element(element):
shadow_root = driver.execute_script('return arguments[0].shadowRoot', element)
return shadow_root
driver.get("chrome://settings")
root1 = driver.find_element_by_tag_name('settings-ui')
html_of_interest=driver.execute_script('return arguments[0].innerHTML',root1)
sel_soup=BeautifulSoup(html_of_interest, 'html.parser')
sel_soup# empty root not expande
shadow_root1 = expand_shadow_element(root1)
html_of_interest=driver.execute_script('return arguments[0].innerHTML',shadow_root1)
sel_soup=BeautifulSoup(html_of_interest, 'html.parser')
sel_soup

Web scraping using selenium and beautifulsoup.. trouble in parsing and selecting button

I am trying to web scrape the following website "url='https://angel.co/life-sciences'
". The website contains more than 8000 data. From this page I need the information like company name and link, joined date and followers. Before that I need to sort the followers column by clicking the button. then load more information by clicking more hidden button. The page is clickable (more hidden) content at the max 20 times, after that it doesn't load more information. But I can take only top follower information by sorting it. Here I have implemented the click() event but it's showing error.
Unable to locate element: {"method":"xpath","selector":"//div[#class="column followers sortable sortable"]"} #before edit this was my problem, using wrong class name
So do I need to give more sleep time here?(tried giving that but same error)
I need to parse all the above information then visit individual link of those website to scrape content div of that html page only.
Please suggest me a way to do it
Here is my current code, I have not added html parsing part using beautifulsoup.
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from time import sleep
from selenium import webdriver
from bs4 import BeautifulSoup
#import urlib2
driver = webdriver.Chrome()
url='https://angel.co/life-sciences'
driver.get(url)
sleep(10)
driver.find_element_by_xpath('//div[#class="column followers sortable"]').click()#edited
sleep(5)
for i in range(2):
driver.find_element_by_xpath('//div[#class="more hidden"]').click()
sleep(8)
sleep(8)
element = driver.find_element_by_id("root").get_attribute('innerHTML')
#driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
#WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CLASS_NAME, 'more hidden')))
'''
results = html.find_elements_by_xpath('//div[#class="name"]')
# wait for the page to load
for result in results:
startup = result.find_elements_by_xpath('.//a')
link = startup.get_attribute('href')
print(link)
'''
page_source = driver.page_source
html = BeautifulSoup(element, 'html.parser')
#for link in html.findAll('a', {'class': 'startup-link'}):
# print link
divs = html.find_all("div", class_=" dts27 frw44 _a _jm")
The above code was working and was giving me html source before I have added the Followers click event.
My final goal is to import all these five information like Name of the company, Its link, Joined date, No of Followers and the company description (which to be obtained after visiting their individual links) into a CSV or xls file.
Help and comments are apprecieted.
This is my first python work and selenium, so little confused and need guidance.
Thanks :-)
The click method is intended to emulate a mouse click; it's for use on elements that can be clicked, such as buttons, drop-down lists, check boxes, etc. You have applied this method to a div element which is not clickable. Elements like div, span, frame and so on are used to organise HTML and provide for decoration of fonts, etc.
To make this code work you will need to identify the elements in the page that are actually clickable.
Oops my typing mistake or some silly mistake here, I was using the div class name wrong it is "column followers sortable" instead I was using "column followers sortable selected". :-(
Now the above works pretty good.. but can anyone guide me with beautifulsoup HTML parsing part?

Webscraping multiple pages when the url remains the same (but given an ajax response)

I am trying to webscrape all of the reviews for a specific book on Goodreads.com.
url= https://www.goodreads.com/book/show/320.One_Hundred_Years_of_Solitude?ac=1&from_search=true
this worked out pretty successfully for the first page using python and Beautiful Soup, but my problem is trying to scrape the subsequent pages of reviews. I am having issues because each new page that is generated has the same url (so I only get the reviews on page 1). When I inspect the html it seems that the new pages are generated via ajax request.
<a class="previous_page" href="#" onclick="new Ajax.Request('/book/reviews/320.One_Hundred_Years_of_Solitude?authenticity_token=sZXyhbZUmjF0yvXFy3p2w3PllReMI02adUUeA5yOHzvY1ypaIv1z9e70UMgH1mDpx5FHr%2FakQ4rG7Ge5ZoD6zQ%3D%3D&amp;hide_last_page=true&amp;page=1', {asynchronous:true, evalScripts:true, method:'get', parameters:'authenticity_token=' + encodeURIComponent('4sfXlAmAjNZyCOAnywx+OVJZ1rHkR3E065/m/pbsTC6LhQ9LnSllEug2RSoHoGgT5i0ECZ7AfyRYNp9EbOKp2A==')}); return false;">« previous</a>
I am very new to webscraping in general and have no idea how to go about getting the information I need from this. Any points in the right direction would be awesome.
Thanks
If you're going to be "driving" the web page then I would suggest using a webdriver. https://www.seleniumhq.org/projects/webdriver/
A webdriver can open a "headless" browser that you can manipulate using Selenium's API. For example, in this case you would open the browser and navigate to your page by:
from selenium import webdriver
browser = webdriver.Firefox() # open a browser
browser.get("https://www.goodreads.com/book/show/320.One_Hundred_Years_of_Solitude?ac=1&from_search=true") # open your webpage
Now you're browser object is on the page you are beautiful souping. You can use browser.page_source to get the html, and then soup it:
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
Then you can do whatever you want with your soup. When you're ready to get the next page of reviews, you can tell your browser to click the button, wait a second for it to load, then get the soup again:
element = browser.find_element_by_id("your_element_id")
element.click()
time.sleep(3) # sleep three seconds so page can load
html = browser.page_source # now this has new reviews on it
soup = BeautifulSoup(html, 'html.parser') # now you have soup again, but with new reviews
You can throw this process in a loop until there are no more "next page" elements showing up.

Extract URL from a website including archived links

I'm crawling a news website to extracts all links including the archived ones which is typical of a news website. The site here has a a button View More Stories that loads more website articles. Now this code below
def find_urls():
start_url = "e.vnexpress.net/news/business"
r = requests.get("http://" + start_url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
links = soup.findAll('a')
url_list = []
for url in links:
all_link = url.get('href')
if all_link.startswith('http://e.vnexpress.net/news/business'):
url_list.append(all_link)
return set(url_list)
successfully load quite a few url but how do I load more here is a snippet of the button
<a href="javascript:void(0)" id="vnexpress_folder_load_more" data-page="2"
data-cate="1003895">
View more stories
</a>
Can someone help me out. Thanks.
You can use a browser like selenium to click the button till the button disappears or disables. Finally you can scrape the entire page using beautifulsoup in one go.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
#initializing browser
driver = webdriver.Firefox()
driver.set_window_size(1120, 550)
driver.get("http://e.vnexpress.net/news/news")
# run this till button is present
elem = driver.find_element_by_id('vnexpress_folder_load_more'))
elem.click()

Categories