Scrape dynamic HTML (YouTube comments) - python

With Beautiful Soup and Request Library I am able to scrape HTML content, but not what loads by JavaScript or AJAX calls.
How do I mimic this through my Python script? Because YouTube comments load when we scroll the page. I found 2 methods; one using Selenium and another using lxml requests, which I couldn't understand a bit.
Example (this is the video):
import requests
from bs4 import BeautifulSoup as soup
url = 'https://www.youtube.com/watch?v=iFPMz36std4'
response = requests.get(url)
page_html = response.content
#print page_html
page_soup=soup(page_html,"html.parser")
print page_soup

You need to use selenium :
Here is a trick , Youtube only load comments when you scroll just down of video , if you scroll bottom or elsewhere, comments will not load , so first scroll to that down part and wait for loading comments after that scroll to bottom or whenever you want :
from selenium import webdriver
import time
driver=webdriver.Chrome()
driver.get('https://www.youtube.com/watch?v=iFPMz36std4')
driver.execute_script('window.scrollTo(1, 500);')
#now wait let load the comments
time.sleep(5)
driver.execute_script('window.scrollTo(1, 3000);')
comment_div=driver.find_element_by_xpath('//*[#id="contents"]')
comments=comment_div.find_elements_by_xpath('//*[#id="content-text"]')
for comment in comments:
print(comment.text)
some part of output:
#can't post full output its too long
I love Kygo's Stranger Things and Netflix's Stranger Things <3
Stranger Things, Kygo and OneRepublic, could it be better?
Amazing Vibe!!!!!!!!!🔥🔥🔥🔥

Using Selenium would do the trick.
Though I have a different way of scrolling down. This function will help you to scroll down by calling regularly javascript and check whether the height of the window changed between the actual and previous scroll down.
def scrollDown(pause, driver):
"""
Function to scroll down till end of page.
"""
import time
lastHeight = driver.execute_script("return document.body.scrollHeight")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(pause)
newHeight = driver.execute_script("return document.body.scrollHeight")
if newHeight == lastHeight:
break
lastHeight = newHeight
# Main Code
driver = webdriver.Chrome()
# Instantiate browser and navigate to page
driver.get('https://www.youtube.com/watch?v=iFPMz36std4')
scrollDown(6, driver)
# Page soup
soup = BeautifulSoup(driver.page_source, "html.parser")

Related

Cannot scrol down and scrape a Dynamic wep page

I am trying to scrape news from the Skynews website which has a dynamic scrolling page (The link for the website I am trying to scrap: https://www.skynewsarabia.com/live-story/1345832-%D9%85%D8%B3%D8%AA%D8%AC%D8%AF%D8%A7%D8%AA-%D9%83%D9%88%D8%B1%D9%88%D9%86%D8%A7-%D8%A7%D9%94%D8%AD%D8%AF%D8%AB-%D8%A7%D9%94%D8%AE%D8%A8%D8%A7%D8%B1-%D8%A7%D9%84%D9%81%D9%8A%D8%B1%D9%88%D8%B3-%D8%A7%D9%84%D8%B9%D8%A7%D9%84%D9%85).
I am using Selenium to scroll down the page and beautifulsoup to scrape information.
The problem is my code is only returning the first five news and it does not perform the scrolling, However, the same code works well on other website and scrolling down the page.
Following is the code I am using to scroll the site and scrap info:
from selenium import webdriver
import time
url="https://www.skynewsarabia.com/live-story/1345832-%D9%85%D8%B3%D8%AA%D8%AC%D8%AF%D8%A7%D8%AA-%D9%83%D9%88%D8%B1%D9%88%D9%86%D8%A7-%D8%A7%D9%94%D8%AD%D8%AF%D8%AB-%D8%A7%D9%94%D8%AE%D8%A8%D8%A7%D8%B1-%D8%A7%D9%84%D9%81%D9%8A%D8%B1%D9%88%D8%B3-%D8%A7%D9%84%D8%B9%D8%A7%D9%84%D9%85"
driver = webdriver.Chrome(executable_path=r"C:\Users\Acoe\Documents\yaman\Scraping\chromedriver.exe")
driver.get('https://www.skynewsarabia.com/live-story/1345832-%D9%85%D8%B3%D8%AA%D8%AC%D8%AF%D8%A7%D8%AA-%D9%83%D9%88%D8%B1%D9%88%D9%86%D8%A7-%D8%A7%D9%94%D8%AD%D8%AF%D8%AB-%D8%A7%D9%94%D8%AE%D8%A8%D8%A7%D8%B1-%D8%A7%D9%84%D9%81%D9%8A%D8%B1%D9%88%D8%B3-%D8%A7%D9%84%D8%B9%D8%A7%D9%84%D9%85')
time. sleep(3)
previous_height = driver.execute_script('return document.body.scrollHeight')
while True:
#### the below code to scroll down the web page
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
time.sleep(3)
new_height= driver.execute_script('return document.body.scrollHeight')
# we use "+ str(i+1)" to loop over all we pages in this address because its
# the same page address including page number in the end
## Start the scraping
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
headlines = soup.find('body').find_all('h2')
for x in headlines:
if len(x.text.strip()) > 20:
fake_news.append(x.text.strip())
print(x.text.strip())
else:
pass
if new_height == previous_height:
break
previous_height = new_height
Anyone have any clue why the code could not scroll down this website specifically since the code is doing well for any other website with the same properties? (ex: https://www.reddit.com/search/?q=covid19)
The problem of the page is that it will not load new articles when the scroll is 100% to the bottom. You can solve this by incrementing the scroll coordinates by a little every time, instead of going to the bottom of the page.
scrolled = 500
while True:
driver.execute_script(f'window.scrollTo(0, {scrolled});')
scrolled += 500
time.sleep(1)

Python Request entire HTML page, instead of initially loaded content

I am trying to get some data of reviews publicly available on the PlayStore, and since the provided API only allows to get reviews for one own's apps, I am trying to scrape it from the web.
I am using requests package to get the HTML page of a given app on the PlayStore and will use BeautifulSoup to parse it and save it to file, to then extract the relevant content (rating and comment of each user).
My issue is that not the entire content of the page is retrieved with request.get(URL).
Navigating to the "Read All Reviews" on an app on the PlayStore, one gets to a page with all reviews for that app. Unfortunately, though, only a limited set of reviews loads when first loading the page, while the rest of the reviews only loads upon scrolling down to the bottom. By calling request.get(URL) only that limited set of reviews is retrieved, instead of all reviews.
Try navigating to https://play.google.com/store/apps/details?id=com.bendingspoons.thirtydayfitness&hl=en&showAllReviews=true and see older reviews load only when scrolling to the bottom of the page.
Is there a way to access the entire page/trigger the loading of more reviews/simulating the scrolling?
Below is my code:
# get reviews for Thirty Days of Fitness app
URL = "https://play.google.com/store/apps/details?id=com.bendingspoons.thirtydayfitness&hl=en&showAllReviews=true"
# make request
request = requests.get(URL)
# extract HTML text
raw_text = request.text
# parse HTML and prettify
soup = BeautifulSoup(raw_text, 'html.parser')
text = soup.prettify()
# write to file
save_path = './thirtydayfitness_html.txt'
with open(save_path, 'w+', encoding=request.encoding) as f:
f.write(text)
Would consider using a web driver to scroll down. Like so
SCROLL_PAUSE_TIME = 0.5
# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(SCROLL_PAUSE_TIME)
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
Reference:- How can I scroll a web page using selenium webdriver in python?

Can't get all titles from a list with Python WebScraping

I'm practicing web scraping with Python atm and I found a problem, I wanted to scrape one website that has a list of anime that I watched before but when I try to scrape it (via requests or selenium) it only gets around 30 of 110 anime names from the page.
Here is my code with selenium:
from selenium import webdriver
from bs4 import BeautifulSoup
browser = webdriver.Firefox()
browser.get("https://anilist.co/user/Agusmaris/animelist/Completed")
data = BeautifulSoup(browser.page_source, 'lxml')
for title in data.find_all(class_="title"):
print(title.getText())
And when I run it, the page source only shows up until an anime called 'Golden time' when there are like 70 or more left that are in the page.
Thanks
Edit: Code that works now thanks to 'supputuri':
from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Firefox()
driver.get("https://anilist.co/user/Agusmaris/animelist/Completed")
time.sleep(3)
footer = driver.find_element_by_css_selector("div.footer")
preY = 0
print(str(footer))
while footer.rect['y'] != preY:
preY = footer.rect['y']
footer.location_once_scrolled_into_view
print('loading')
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
for title in soup.find_all(class_="title"):
print(title.getText())
driver.close()
driver.quit()
ret = input()
Here is the solution.
Make sure to add import time
driver.get("https://anilist.co/user/Agusmaris/animelist/Completed")
time.sleep(3)
footer =driver.find_element_by_css_selector("div.footer")
preY =0
while footer.rect['y']!=preY:
preY = footer.rect['y']
footer.location_once_scrolled_into_view
time.sleep(1)
print(str(driver.page_source))
This will iterate until all the anime is loaded and then gets the page source.
Let us know if this was helpful.
So, this is the jist of what I get when I load the page source:
AniListwindow.al_token = 'E1lPa1kzYco5hbdwT3GAMg3OG0rj47Gy5kF0PUmH';Sorry, AniList requires Javascript.Please enable Javascript or http://outdatedbrowser.com>upgrade to a modern web browser.Sorry, AniList requires a modern browser.Please http://outdatedbrowser.com>upgrade to a newer web browser.
Since I know damn well that Javascript is enabled and my Chrome version is fully up to date, and the URL listed takes one to a nonsecure website to "download" a new version of your browser, I think this is a spam site. Not sure if you were aware of that when posting so I won't flag as such, but I wanted you and others who come across this to be aware.

Web scraping when scrolling down is needed

I want to scrape, e.g., the title of the first 200 questions under the web page https://www.quora.com/topic/Stack-Overflow-4/all_questions. And I tried the following code:
import requests
from bs4 import BeautifulSoup
url = "https://www.quora.com/topic/Stack-Overflow-4/all_questions"
print("url")
print(url)
r = requests.get(url) # HTTP request
print("r")
print(r)
html_doc = r.text # Extracts the html
print("html_doc")
print(html_doc)
soup = BeautifulSoup(html_doc, 'lxml') # Create a BeautifulSoup object
print("soup")
print(soup)
It gave me a text https://pastebin.com/9dSPzAyX. If we search href='/, we can see that the html does contain title of some questions. However, the problem is that the number is not enough; actually on the web page, a user needs to manually scroll down to trigger extra load.
Does anyone know how I could mimic "scrolling down" by the program to load more content of the page?
Infinite scrolls on a webpage is based on the Javascript functionality. Therefore, to find out what URL we need to access and what parameters to use, we need to either thoroughly study the JS code working inside the page or, and preferably, examine the requests that the browser does when you scroll down the page. We can study requests using the Developer Tools.
See example for quora
the more you scroll down, the more requests generated. so now your requests will be done to that url instead of normal url but keep in mind to send correct headers and playload.
other easier solution will be by using selenium
Couldn't find a response using request. But you can use Selenium. First printed out the number of questions at first load, then send the End key to mimic scrolling down. You can see number of questions went from 20 to 40 after sending the End key.
I used driver.implicitly wait for 5 seconds before loading the DOM again in case the script load to fast before the DOM was loaded. You can improve by using EC with selenium.
The page loads 20 questions per scroll. So if you are looking to scrape 100 questions, then you need to send the End key 5 times.
To use the code below you need to install chromedriver.
http://chromedriver.chromium.org/downloads
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
CHROMEDRIVER_PATH = ""
CHROME_PATH = ""
WINDOW_SIZE = "1920,1080"
chrome_options = Options()
# chrome_options.add_argument("--headless")
chrome_options.add_argument("--window-size=%s" % WINDOW_SIZE)
chrome_options.binary_location = CHROME_PATH
prefs = {'profile.managed_default_content_settings.images':2}
chrome_options.add_experimental_option("prefs", prefs)
url = "https://www.quora.com/topic/Stack-Overflow-4/all_questions"
def scrape(url, times):
if not url.startswith('http'):
raise Exception('URLs need to start with "http"')
driver = webdriver.Chrome(
executable_path=CHROMEDRIVER_PATH,
chrome_options=chrome_options
)
driver.get(url)
counter = 1
while counter <= times:
q_list = driver.find_element_by_class_name('TopicAllQuestionsList')
questions = [x for x in q_list.find_elements_by_xpath('//div[#class="pagedlist_item"]')]
q_len = len(questions)
print(q_len)
html = driver.find_element_by_tag_name('html')
html.send_keys(Keys.END)
wait = WebDriverWait(driver, 5)
time.sleep(5)
questions2 = [x for x in q_list.find_elements_by_xpath('//div[#class="pagedlist_item"]')]
print(len(questions2))
counter += 1
driver.close()
if __name__ == '__main__':
scrape(url, 5)
I recommend using selenium rather than bs.
selenium can control browser and parsing. like scroll down, click button, etc…
this example is for scroll down for get all liker user in instagram.
https://stackoverflow.com/a/54882356/5611675
If the content only loads on "scrolling down", this probably means that the page is using Javascript to dynamically load the content.
You can try using a web client such as PhantomJS to load the page and execute the javascript in it, and simulate the scroll by injecting some JS such as document.body.scrollTop = sY; (Simulate scroll event using Javascript).

Scrape spotify web interface

I'm trying to get the number of plays for the top songs from a number of artists on Spotify using python and splinter.
If you fill in the username and password below with yours, you should be able to run the code.
from splinter import Browser
import time
from bs4 import BeautifulSoup
browser = Browser()
url = 'http://play.spotify.com'
browser.visit(url)
time.sleep(2)
button = browser.find_by_id('has-account')
button.click()
time.sleep(1)
browser.fill('username', 'your_username')
browser.fill('password', 'your_password')
buttons = browser.find_by_css('button')
visible_buttons = [button for button in buttons if button.visible]
login_button = visible_buttons[-1]
login_button.click()
time.sleep(1)
browser.visit('https://play.spotify.com/artist/5YGY8feqx7naU7z4HrwZM6')
time.sleep(10)
So far, so good. If you open up firefox, you'll can see Miley Cyrus's artist page, including the number of plays for top tracks.
If you open up the Firefox Developer Tools Inspector and hover, you can see the name of the song in .tl-highlight elements, and the number of plays in .tl-listen-count elements. However, I've found it impossible (at least on my machine) to access these elements using splinter. Moreover, when I try to get the source for the entire page, the elements that I can see by hovering my mouse over them in Firefox don't show up in what is ostensibly the page source.
html = browser.html
soup = BeautifulSoup(html)
output = soup.prettify()
with open('miley_cyrus_artist_page.html', 'w') as output_f:
output_f.write(output)
browser.quit()
I don't think I know enough about web programming to know what the issue is here--Firefox sees all the DOM elements clearly, but splinter that is driving Firefox does not.
The key problem is that there is an iframe containing the artist's page with list of tracks. You need to switch into it's context before searching for elements:
frame = browser.driver.find_element_by_css_selector("iframe[id^=browse-app-spotify]")
browser.driver.switch_to.frame(frame)
Many thanks to #alecxe, the following code works to pull the information on the artist.
from splinter import Browser
import time
from bs4 import BeautifulSoup
import codecs
browser = Browser()
url = 'http://play.spotify.com'
browser.visit(url)
time.sleep(2)
button = browser.find_by_id('has-account')
button.click()
time.sleep(1)
browser.fill('username', 'your_username')
browser.fill('password', 'your_password')
buttons = browser.find_by_css('button')
visible_buttons = [button for button in buttons if button.visible]
login_button = visible_buttons[-1]
login_button.click()
time.sleep(1)
browser.visit('https://play.spotify.com/artist/5YGY8feqx7naU7z4HrwZM6')
time.sleep(30)
CORRECT_FRAME_INDEX = 6
with browser.get_iframe(CORRECT_FRAME_INDEX) as iframe:
html = iframe.html
soup = BeautifulSoup(html)
output = soup.prettify()
with codecs.open('test.html', 'w', 'utf-8') as output_f:
output_f.write(output)
browser.quit()

Categories