Cannot scrol down and scrape a Dynamic wep page

Cannot scrol down and scrape a Dynamic wep page - python

I am trying to scrape news from the Skynews website which has a dynamic scrolling page (The link for the website I am trying to scrap: https://www.skynewsarabia.com/live-story/1345832-%D9%85%D8%B3%D8%AA%D8%AC%D8%AF%D8%A7%D8%AA-%D9%83%D9%88%D8%B1%D9%88%D9%86%D8%A7-%D8%A7%D9%94%D8%AD%D8%AF%D8%AB-%D8%A7%D9%94%D8%AE%D8%A8%D8%A7%D8%B1-%D8%A7%D9%84%D9%81%D9%8A%D8%B1%D9%88%D8%B3-%D8%A7%D9%84%D8%B9%D8%A7%D9%84%D9%85).
I am using Selenium to scroll down the page and beautifulsoup to scrape information.
The problem is my code is only returning the first five news and it does not perform the scrolling, However, the same code works well on other website and scrolling down the page.
Following is the code I am using to scroll the site and scrap info:
from selenium import webdriver
import time
url="https://www.skynewsarabia.com/live-story/1345832-%D9%85%D8%B3%D8%AA%D8%AC%D8%AF%D8%A7%D8%AA-%D9%83%D9%88%D8%B1%D9%88%D9%86%D8%A7-%D8%A7%D9%94%D8%AD%D8%AF%D8%AB-%D8%A7%D9%94%D8%AE%D8%A8%D8%A7%D8%B1-%D8%A7%D9%84%D9%81%D9%8A%D8%B1%D9%88%D8%B3-%D8%A7%D9%84%D8%B9%D8%A7%D9%84%D9%85"
driver = webdriver.Chrome(executable_path=r"C:\Users\Acoe\Documents\yaman\Scraping\chromedriver.exe")
driver.get('https://www.skynewsarabia.com/live-story/1345832-%D9%85%D8%B3%D8%AA%D8%AC%D8%AF%D8%A7%D8%AA-%D9%83%D9%88%D8%B1%D9%88%D9%86%D8%A7-%D8%A7%D9%94%D8%AD%D8%AF%D8%AB-%D8%A7%D9%94%D8%AE%D8%A8%D8%A7%D8%B1-%D8%A7%D9%84%D9%81%D9%8A%D8%B1%D9%88%D8%B3-%D8%A7%D9%84%D8%B9%D8%A7%D9%84%D9%85')
time. sleep(3)
previous_height = driver.execute_script('return document.body.scrollHeight')
while True:
#### the below code to scroll down the web page
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
time.sleep(3)
new_height= driver.execute_script('return document.body.scrollHeight')
# we use "+ str(i+1)" to loop over all we pages in this address because its
# the same page address including page number in the end
## Start the scraping
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
headlines = soup.find('body').find_all('h2')
for x in headlines:
if len(x.text.strip()) > 20:
fake_news.append(x.text.strip())
print(x.text.strip())
else:
pass
if new_height == previous_height:
break
previous_height = new_height
Anyone have any clue why the code could not scroll down this website specifically since the code is doing well for any other website with the same properties? (ex: https://www.reddit.com/search/?q=covid19)

The problem of the page is that it will not load new articles when the scroll is 100% to the bottom. You can solve this by incrementing the scroll coordinates by a little every time, instead of going to the bottom of the page.
scrolled = 500
while True:
driver.execute_script(f'window.scrollTo(0, {scrolled});')
scrolled += 500
time.sleep(1)

Related

how to find URL of some elements of a webpage?

the webpage is : https://www.vpgame.com/market/gold?order_type=pro_price&order=desc&offset=0
As you can see there are 25 items in the selling part of this page that when you click them it opens a new tab and show you that specific item details.
Now I want to make a program to get those 25 item URLs and save them in a list, and my problem is as you can see in page inspect, their tags are which should be and also I can't find any 'href' attributes that related to them.
# using selenium and driver = webdriver.Chrome()
link = driver.find_elements_by_tag_name('a')
link2 = [l.get_attribute('href') for l in link]
I thought I can do it with above code but the problem is what I said. any suggestion?

Looks like you are trying to scrape a page that is powered by react. There are no href tags because javascript is powering all the linking. Your best bet is to use selenium to execute a click on each of the div objects, switch to the newly tabe, and use something like this code to get the URL of the page it's taken you to:
import time
links = driver.find_elements_by_class_name('card-header')
urls = []
for link in links:
new_page = link.click()
driver.switch_to.window(driver.window_handles[1])
url = driver.current_url
urls.append(url)
driver.close()
driver.switch_to.window(driver.window_handles[0])
time.sleep(1)
Note that the code closes the new tab each time and goes back to the main tab. I added time.sleep() so it doesn't go too fast.

Python Request entire HTML page, instead of initially loaded content

I am trying to get some data of reviews publicly available on the PlayStore, and since the provided API only allows to get reviews for one own's apps, I am trying to scrape it from the web.
I am using requests package to get the HTML page of a given app on the PlayStore and will use BeautifulSoup to parse it and save it to file, to then extract the relevant content (rating and comment of each user).
My issue is that not the entire content of the page is retrieved with request.get(URL).
Navigating to the "Read All Reviews" on an app on the PlayStore, one gets to a page with all reviews for that app. Unfortunately, though, only a limited set of reviews loads when first loading the page, while the rest of the reviews only loads upon scrolling down to the bottom. By calling request.get(URL) only that limited set of reviews is retrieved, instead of all reviews.
Try navigating to https://play.google.com/store/apps/details?id=com.bendingspoons.thirtydayfitness&hl=en&showAllReviews=true and see older reviews load only when scrolling to the bottom of the page.
Is there a way to access the entire page/trigger the loading of more reviews/simulating the scrolling?
Below is my code:
# get reviews for Thirty Days of Fitness app
URL = "https://play.google.com/store/apps/details?id=com.bendingspoons.thirtydayfitness&hl=en&showAllReviews=true"
# make request
request = requests.get(URL)
# extract HTML text
raw_text = request.text
# parse HTML and prettify
soup = BeautifulSoup(raw_text, 'html.parser')
text = soup.prettify()
# write to file
save_path = './thirtydayfitness_html.txt'
with open(save_path, 'w+', encoding=request.encoding) as f:
f.write(text)

Would consider using a web driver to scroll down. Like so
SCROLL_PAUSE_TIME = 0.5
# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(SCROLL_PAUSE_TIME)
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
Reference:- How can I scroll a web page using selenium webdriver in python?

Scrape dynamic HTML (YouTube comments)

With Beautiful Soup and Request Library I am able to scrape HTML content, but not what loads by JavaScript or AJAX calls.
How do I mimic this through my Python script? Because YouTube comments load when we scroll the page. I found 2 methods; one using Selenium and another using lxml requests, which I couldn't understand a bit.
Example (this is the video):
import requests
from bs4 import BeautifulSoup as soup
url = 'https://www.youtube.com/watch?v=iFPMz36std4'
response = requests.get(url)
page_html = response.content
#print page_html
page_soup=soup(page_html,"html.parser")
print page_soup

You need to use selenium :
Here is a trick , Youtube only load comments when you scroll just down of video , if you scroll bottom or elsewhere, comments will not load , so first scroll to that down part and wait for loading comments after that scroll to bottom or whenever you want :
from selenium import webdriver
import time
driver=webdriver.Chrome()
driver.get('https://www.youtube.com/watch?v=iFPMz36std4')
driver.execute_script('window.scrollTo(1, 500);')
#now wait let load the comments
time.sleep(5)
driver.execute_script('window.scrollTo(1, 3000);')
comment_div=driver.find_element_by_xpath('//*[#id="contents"]')
comments=comment_div.find_elements_by_xpath('//*[#id="content-text"]')
for comment in comments:
print(comment.text)
some part of output:
#can't post full output its too long
I love Kygo's Stranger Things and Netflix's Stranger Things <3
Stranger Things, Kygo and OneRepublic, could it be better?
Amazing Vibe!!!!!!!!!🔥🔥🔥🔥

Using Selenium would do the trick.
Though I have a different way of scrolling down. This function will help you to scroll down by calling regularly javascript and check whether the height of the window changed between the actual and previous scroll down.
def scrollDown(pause, driver):
"""
Function to scroll down till end of page.
"""
import time
lastHeight = driver.execute_script("return document.body.scrollHeight")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(pause)
newHeight = driver.execute_script("return document.body.scrollHeight")
if newHeight == lastHeight:
break
lastHeight = newHeight
# Main Code
driver = webdriver.Chrome()
# Instantiate browser and navigate to page
driver.get('https://www.youtube.com/watch?v=iFPMz36std4')
scrollDown(6, driver)
# Page soup
soup = BeautifulSoup(driver.page_source, "html.parser")

Scroll over website using phatomJS and selenium

I need to scroll over a web page (example twitter) an make a web scraping of the new elements that appear as one advances on the website. I try to make this using python 3.x, selenium and PhantomJS. This is my code
import time
from selenium import webdriver
from bs4 import BeautifulSoup
user = 'ciroylospersas'
# Start web browser
#browser = webdriver.Firefox()
browser = webdriver.PhantomJS()
browser.set_window_size(1024, 768)
browser.get("https://twitter.com/")
# Fill username in login
element = browser.find_element_by_id("signin-email")
element.clear()
element.send_keys('your twitter user')
# Fill password in login
element = browser.find_element_by_id("signin-password")
element.clear()
element.send_keys('your twitter pass')
browser.save_screenshot('screen.png') # save a screenshot to disk
# Summit the login
element.submit()
time.sleep(5
browser.save_screenshot('screen1.png') # save a screenshot to disk
# Move to the following url
browser.get("https://twitter.com/" + user + "/following")
browser.save_screenshot('screen2.png') # save a screenshot to disk
scroll_script = "var h = document.body.scrollHeight; window.scrollTo(0, h); return h;"
newHeight = browser.execute_script(scroll_script)
print(newHeight)
browser.save_screenshot('screen3.png') # save a screenshot to disk
The problem is I can't scroll to the bottom. The screen2.png and screen3.png are the same. But if I change the webdriver from PhantomJS to Firefox the same code work fine. Why?

I was able to get this to work in phantomJS when trying to solve a similar problem:
check_height = driver.execute_script("return document.body.scrollHeight;")
while True:
browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5)
height = driver.execute_script("return document.body.scrollHeight;")
if height == check_height:
break
check_height = height
It will scroll to the current "bottom", wait, see if the page loaded more, and bail if it did not (assuming everything got loaded if the heights match.)
In my original code I had a "max" value I checked alongside the matching heights because I was only interested in the first 10 or so "pages". If there were more I wanted it to stop loading and skip them.
Also, this is the answer I used as an example

Scrape spotify web interface

I'm trying to get the number of plays for the top songs from a number of artists on Spotify using python and splinter.
If you fill in the username and password below with yours, you should be able to run the code.
from splinter import Browser
import time
from bs4 import BeautifulSoup
browser = Browser()
url = 'http://play.spotify.com'
browser.visit(url)
time.sleep(2)
button = browser.find_by_id('has-account')
button.click()
time.sleep(1)
browser.fill('username', 'your_username')
browser.fill('password', 'your_password')
buttons = browser.find_by_css('button')
visible_buttons = [button for button in buttons if button.visible]
login_button = visible_buttons[-1]
login_button.click()
time.sleep(1)
browser.visit('https://play.spotify.com/artist/5YGY8feqx7naU7z4HrwZM6')
time.sleep(10)
So far, so good. If you open up firefox, you'll can see Miley Cyrus's artist page, including the number of plays for top tracks.
If you open up the Firefox Developer Tools Inspector and hover, you can see the name of the song in .tl-highlight elements, and the number of plays in .tl-listen-count elements. However, I've found it impossible (at least on my machine) to access these elements using splinter. Moreover, when I try to get the source for the entire page, the elements that I can see by hovering my mouse over them in Firefox don't show up in what is ostensibly the page source.
html = browser.html
soup = BeautifulSoup(html)
output = soup.prettify()
with open('miley_cyrus_artist_page.html', 'w') as output_f:
output_f.write(output)
browser.quit()
I don't think I know enough about web programming to know what the issue is here--Firefox sees all the DOM elements clearly, but splinter that is driving Firefox does not.

The key problem is that there is an iframe containing the artist's page with list of tracks. You need to switch into it's context before searching for elements:
frame = browser.driver.find_element_by_css_selector("iframe[id^=browse-app-spotify]")
browser.driver.switch_to.frame(frame)

Many thanks to #alecxe, the following code works to pull the information on the artist.
from splinter import Browser
import time
from bs4 import BeautifulSoup
import codecs
browser = Browser()
url = 'http://play.spotify.com'
browser.visit(url)
time.sleep(2)
button = browser.find_by_id('has-account')
button.click()
time.sleep(1)
browser.fill('username', 'your_username')
browser.fill('password', 'your_password')
buttons = browser.find_by_css('button')
visible_buttons = [button for button in buttons if button.visible]
login_button = visible_buttons[-1]
login_button.click()
time.sleep(1)
browser.visit('https://play.spotify.com/artist/5YGY8feqx7naU7z4HrwZM6')
time.sleep(30)
CORRECT_FRAME_INDEX = 6
with browser.get_iframe(CORRECT_FRAME_INDEX) as iframe:
html = iframe.html
soup = BeautifulSoup(html)
output = soup.prettify()
with codecs.open('test.html', 'w', 'utf-8') as output_f:
output_f.write(output)
browser.quit()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cannot scrol down and scrape a Dynamic wep page - python

Related

how to find URL of some elements of a webpage?

Python Request entire HTML page, instead of initially loaded content

Scrape dynamic HTML (YouTube comments)

Scroll over website using phatomJS and selenium

Scrape spotify web interface

Categories

Resources