Python Request entire HTML page, instead of initially loaded content

Python Request entire HTML page, instead of initially loaded content - python

I am trying to get some data of reviews publicly available on the PlayStore, and since the provided API only allows to get reviews for one own's apps, I am trying to scrape it from the web.
I am using requests package to get the HTML page of a given app on the PlayStore and will use BeautifulSoup to parse it and save it to file, to then extract the relevant content (rating and comment of each user).
My issue is that not the entire content of the page is retrieved with request.get(URL).
Navigating to the "Read All Reviews" on an app on the PlayStore, one gets to a page with all reviews for that app. Unfortunately, though, only a limited set of reviews loads when first loading the page, while the rest of the reviews only loads upon scrolling down to the bottom. By calling request.get(URL) only that limited set of reviews is retrieved, instead of all reviews.
Try navigating to https://play.google.com/store/apps/details?id=com.bendingspoons.thirtydayfitness&hl=en&showAllReviews=true and see older reviews load only when scrolling to the bottom of the page.
Is there a way to access the entire page/trigger the loading of more reviews/simulating the scrolling?
Below is my code:
# get reviews for Thirty Days of Fitness app
URL = "https://play.google.com/store/apps/details?id=com.bendingspoons.thirtydayfitness&hl=en&showAllReviews=true"
# make request
request = requests.get(URL)
# extract HTML text
raw_text = request.text
# parse HTML and prettify
soup = BeautifulSoup(raw_text, 'html.parser')
text = soup.prettify()
# write to file
save_path = './thirtydayfitness_html.txt'
with open(save_path, 'w+', encoding=request.encoding) as f:
f.write(text)

Would consider using a web driver to scroll down. Like so
SCROLL_PAUSE_TIME = 0.5
# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(SCROLL_PAUSE_TIME)
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
Reference:- How can I scroll a web page using selenium webdriver in python?

Related

Cannot scrol down and scrape a Dynamic wep page

I am trying to scrape news from the Skynews website which has a dynamic scrolling page (The link for the website I am trying to scrap: https://www.skynewsarabia.com/live-story/1345832-%D9%85%D8%B3%D8%AA%D8%AC%D8%AF%D8%A7%D8%AA-%D9%83%D9%88%D8%B1%D9%88%D9%86%D8%A7-%D8%A7%D9%94%D8%AD%D8%AF%D8%AB-%D8%A7%D9%94%D8%AE%D8%A8%D8%A7%D8%B1-%D8%A7%D9%84%D9%81%D9%8A%D8%B1%D9%88%D8%B3-%D8%A7%D9%84%D8%B9%D8%A7%D9%84%D9%85).
I am using Selenium to scroll down the page and beautifulsoup to scrape information.
The problem is my code is only returning the first five news and it does not perform the scrolling, However, the same code works well on other website and scrolling down the page.
Following is the code I am using to scroll the site and scrap info:
from selenium import webdriver
import time
url="https://www.skynewsarabia.com/live-story/1345832-%D9%85%D8%B3%D8%AA%D8%AC%D8%AF%D8%A7%D8%AA-%D9%83%D9%88%D8%B1%D9%88%D9%86%D8%A7-%D8%A7%D9%94%D8%AD%D8%AF%D8%AB-%D8%A7%D9%94%D8%AE%D8%A8%D8%A7%D8%B1-%D8%A7%D9%84%D9%81%D9%8A%D8%B1%D9%88%D8%B3-%D8%A7%D9%84%D8%B9%D8%A7%D9%84%D9%85"
driver = webdriver.Chrome(executable_path=r"C:\Users\Acoe\Documents\yaman\Scraping\chromedriver.exe")
driver.get('https://www.skynewsarabia.com/live-story/1345832-%D9%85%D8%B3%D8%AA%D8%AC%D8%AF%D8%A7%D8%AA-%D9%83%D9%88%D8%B1%D9%88%D9%86%D8%A7-%D8%A7%D9%94%D8%AD%D8%AF%D8%AB-%D8%A7%D9%94%D8%AE%D8%A8%D8%A7%D8%B1-%D8%A7%D9%84%D9%81%D9%8A%D8%B1%D9%88%D8%B3-%D8%A7%D9%84%D8%B9%D8%A7%D9%84%D9%85')
time. sleep(3)
previous_height = driver.execute_script('return document.body.scrollHeight')
while True:
#### the below code to scroll down the web page
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
time.sleep(3)
new_height= driver.execute_script('return document.body.scrollHeight')
# we use "+ str(i+1)" to loop over all we pages in this address because its
# the same page address including page number in the end
## Start the scraping
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
headlines = soup.find('body').find_all('h2')
for x in headlines:
if len(x.text.strip()) > 20:
fake_news.append(x.text.strip())
print(x.text.strip())
else:
pass
if new_height == previous_height:
break
previous_height = new_height
Anyone have any clue why the code could not scroll down this website specifically since the code is doing well for any other website with the same properties? (ex: https://www.reddit.com/search/?q=covid19)

The problem of the page is that it will not load new articles when the scroll is 100% to the bottom. You can solve this by incrementing the scroll coordinates by a little every time, instead of going to the bottom of the page.
scrolled = 500
while True:
driver.execute_script(f'window.scrollTo(0, {scrolled});')
scrolled += 500
time.sleep(1)

How do I scrape all Twitter followers using Selenium?

I'm trying to scrape all Twitter followers from my profile and then save it to a text file, but I don't know where to look to do this effectively as I'm novice at Python.
What am I doing wrong? :(
See here. This is what I want to scrape
This is the code:
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
sleep(5)
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
for twusernames in driver.find_elements_by_xpath('//div[#aria-label="Timeline: Followers"]//a[#role="link"]'):
print(twusernames.get_property('href'))
file = open("usernames.txt", "a")
file.write(twusernames.get_property('href'))
file.write("\n")
file.close()
So what it does is - it scrolls down so that new followers are loaded and then scrape them, but it's messed up. It picks up other unnecessary links too, and it misses some followers.
Thanks for help.

After using Selenium to load the whole page, my code stop scraping after the first 100 items?

I am trying to get a list of all the movie/series on my personal IMDb watchlist. I am using selenium to click the load more button so everything shows up in the html code. However, when I try and scrape that data, only the first 100 movies show up.
Nothing past 'page3' shows up.
The image below shows the part of the html that connotes page 3:
After clicking the load button with selenium, all the movies are shown in the chrome pop up. However, only the first 100/138 are printed to my console.
Here is the URL: https://www.imdb.com/user/ur130279232/watchlist
This is my current code:
URL = "https://www.imdb.com/user/ur130279232/watchlist"
driver = webdriver.Chrome()
wait = WebDriverWait(driver,20)
driver.get(URL)
while True:
try:
watchlist = driver.find_element_by_xpath("//div[#class='lister-list mode-detail']")
watchlistHTML = watchlist.get_attribute('innerHTML')
loadMoreButton = driver.find_element_by_xpath("//button[#class='load-more']")
soup = BeautifulSoup(watchlistHTML, 'html.parser')
content = soup.find_all('h3', class_ ='lister-item-header')
#pdb.set_trace()
print('length: ',len(content))
for elem in content:
print(elem.find('a').contents[0])
time.sleep(2)
loadMoreButton.click()
time.sleep(5)
except Exception as e:
print(e)
break
Even though after clicking load more button, "lister-list mode-detail" includes everything up until Sound of Music?

Rest of the data is returned by doing HTTP GET to (scroll down and hit Load more)
https://www.imdb.com/title/data?ids=tt0144117,tt0116996,tt0106179,tt0118589,tt11252090,tt13132030,tt6083778,tt0106611,tt0115685,tt1959563,tt8385148,tt0118971,tt0340855,tt8629748,tt13932270,tt11185940,tt5580390,tt4975722,tt2024544,tt1024648,tt1504320,tt1010048,tt0169547,tt0138097,tt0112573,tt0109830,tt0108052,tt0097239,tt0079417,tt0071562,tt0068646,tt0070735,tt0067116,tt0059742,tt0107207,tt0097937&tracking_tag=&pageId=ls089853956&pageType=list&subpageType=watchlist

What #balderman mentioned works if you can access the HTTP GET.
The main thing is that there's a delayed loading of the titles and it doesn't load the later ones until the earlier ones are loaded. I don't know if they only load if you're in the right region, but a janky way to get around it is to programmatically scroll through the page and let it load.

Scrape dynamic HTML (YouTube comments)

With Beautiful Soup and Request Library I am able to scrape HTML content, but not what loads by JavaScript or AJAX calls.
How do I mimic this through my Python script? Because YouTube comments load when we scroll the page. I found 2 methods; one using Selenium and another using lxml requests, which I couldn't understand a bit.
Example (this is the video):
import requests
from bs4 import BeautifulSoup as soup
url = 'https://www.youtube.com/watch?v=iFPMz36std4'
response = requests.get(url)
page_html = response.content
#print page_html
page_soup=soup(page_html,"html.parser")
print page_soup

You need to use selenium :
Here is a trick , Youtube only load comments when you scroll just down of video , if you scroll bottom or elsewhere, comments will not load , so first scroll to that down part and wait for loading comments after that scroll to bottom or whenever you want :
from selenium import webdriver
import time
driver=webdriver.Chrome()
driver.get('https://www.youtube.com/watch?v=iFPMz36std4')
driver.execute_script('window.scrollTo(1, 500);')
#now wait let load the comments
time.sleep(5)
driver.execute_script('window.scrollTo(1, 3000);')
comment_div=driver.find_element_by_xpath('//*[#id="contents"]')
comments=comment_div.find_elements_by_xpath('//*[#id="content-text"]')
for comment in comments:
print(comment.text)
some part of output:
#can't post full output its too long
I love Kygo's Stranger Things and Netflix's Stranger Things <3
Stranger Things, Kygo and OneRepublic, could it be better?
Amazing Vibe!!!!!!!!!🔥🔥🔥🔥

Using Selenium would do the trick.
Though I have a different way of scrolling down. This function will help you to scroll down by calling regularly javascript and check whether the height of the window changed between the actual and previous scroll down.
def scrollDown(pause, driver):
"""
Function to scroll down till end of page.
"""
import time
lastHeight = driver.execute_script("return document.body.scrollHeight")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(pause)
newHeight = driver.execute_script("return document.body.scrollHeight")
if newHeight == lastHeight:
break
lastHeight = newHeight
# Main Code
driver = webdriver.Chrome()
# Instantiate browser and navigate to page
driver.get('https://www.youtube.com/watch?v=iFPMz36std4')
scrollDown(6, driver)
# Page soup
soup = BeautifulSoup(driver.page_source, "html.parser")

Scroll over website using phatomJS and selenium

I need to scroll over a web page (example twitter) an make a web scraping of the new elements that appear as one advances on the website. I try to make this using python 3.x, selenium and PhantomJS. This is my code
import time
from selenium import webdriver
from bs4 import BeautifulSoup
user = 'ciroylospersas'
# Start web browser
#browser = webdriver.Firefox()
browser = webdriver.PhantomJS()
browser.set_window_size(1024, 768)
browser.get("https://twitter.com/")
# Fill username in login
element = browser.find_element_by_id("signin-email")
element.clear()
element.send_keys('your twitter user')
# Fill password in login
element = browser.find_element_by_id("signin-password")
element.clear()
element.send_keys('your twitter pass')
browser.save_screenshot('screen.png') # save a screenshot to disk
# Summit the login
element.submit()
time.sleep(5
browser.save_screenshot('screen1.png') # save a screenshot to disk
# Move to the following url
browser.get("https://twitter.com/" + user + "/following")
browser.save_screenshot('screen2.png') # save a screenshot to disk
scroll_script = "var h = document.body.scrollHeight; window.scrollTo(0, h); return h;"
newHeight = browser.execute_script(scroll_script)
print(newHeight)
browser.save_screenshot('screen3.png') # save a screenshot to disk
The problem is I can't scroll to the bottom. The screen2.png and screen3.png are the same. But if I change the webdriver from PhantomJS to Firefox the same code work fine. Why?

I was able to get this to work in phantomJS when trying to solve a similar problem:
check_height = driver.execute_script("return document.body.scrollHeight;")
while True:
browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5)
height = driver.execute_script("return document.body.scrollHeight;")
if height == check_height:
break
check_height = height
It will scroll to the current "bottom", wait, see if the page loaded more, and bail if it did not (assuming everything got loaded if the heights match.)
In my original code I had a "max" value I checked alongside the matching heights because I was only interested in the first 10 or so "pages". If there were more I wanted it to stop loading and skip them.
Also, this is the answer I used as an example

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Request entire HTML page, instead of initially loaded content - python

Related

Cannot scrol down and scrape a Dynamic wep page

How do I scrape all Twitter followers using Selenium?

After using Selenium to load the whole page, my code stop scraping after the first 100 items?

Scrape dynamic HTML (YouTube comments)

Scroll over website using phatomJS and selenium

Categories

Resources