How do I scrape all Twitter followers using Selenium? - python

I'm trying to scrape all Twitter followers from my profile and then save it to a text file, but I don't know where to look to do this effectively as I'm novice at Python.
What am I doing wrong? :(
See here. This is what I want to scrape
This is the code:
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
sleep(5)
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
for twusernames in driver.find_elements_by_xpath('//div[#aria-label="Timeline: Followers"]//a[#role="link"]'):
print(twusernames.get_property('href'))
file = open("usernames.txt", "a")
file.write(twusernames.get_property('href'))
file.write("\n")
file.close()
So what it does is - it scrolls down so that new followers are loaded and then scrape them, but it's messed up. It picks up other unnecessary links too, and it misses some followers.
Thanks for help.

Related

Cannot scrol down and scrape a Dynamic wep page

I am trying to scrape news from the Skynews website which has a dynamic scrolling page (The link for the website I am trying to scrap: https://www.skynewsarabia.com/live-story/1345832-%D9%85%D8%B3%D8%AA%D8%AC%D8%AF%D8%A7%D8%AA-%D9%83%D9%88%D8%B1%D9%88%D9%86%D8%A7-%D8%A7%D9%94%D8%AD%D8%AF%D8%AB-%D8%A7%D9%94%D8%AE%D8%A8%D8%A7%D8%B1-%D8%A7%D9%84%D9%81%D9%8A%D8%B1%D9%88%D8%B3-%D8%A7%D9%84%D8%B9%D8%A7%D9%84%D9%85).
I am using Selenium to scroll down the page and beautifulsoup to scrape information.
The problem is my code is only returning the first five news and it does not perform the scrolling, However, the same code works well on other website and scrolling down the page.
Following is the code I am using to scroll the site and scrap info:
from selenium import webdriver
import time
url="https://www.skynewsarabia.com/live-story/1345832-%D9%85%D8%B3%D8%AA%D8%AC%D8%AF%D8%A7%D8%AA-%D9%83%D9%88%D8%B1%D9%88%D9%86%D8%A7-%D8%A7%D9%94%D8%AD%D8%AF%D8%AB-%D8%A7%D9%94%D8%AE%D8%A8%D8%A7%D8%B1-%D8%A7%D9%84%D9%81%D9%8A%D8%B1%D9%88%D8%B3-%D8%A7%D9%84%D8%B9%D8%A7%D9%84%D9%85"
driver = webdriver.Chrome(executable_path=r"C:\Users\Acoe\Documents\yaman\Scraping\chromedriver.exe")
driver.get('https://www.skynewsarabia.com/live-story/1345832-%D9%85%D8%B3%D8%AA%D8%AC%D8%AF%D8%A7%D8%AA-%D9%83%D9%88%D8%B1%D9%88%D9%86%D8%A7-%D8%A7%D9%94%D8%AD%D8%AF%D8%AB-%D8%A7%D9%94%D8%AE%D8%A8%D8%A7%D8%B1-%D8%A7%D9%84%D9%81%D9%8A%D8%B1%D9%88%D8%B3-%D8%A7%D9%84%D8%B9%D8%A7%D9%84%D9%85')
time. sleep(3)
previous_height = driver.execute_script('return document.body.scrollHeight')
while True:
#### the below code to scroll down the web page
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
time.sleep(3)
new_height= driver.execute_script('return document.body.scrollHeight')
# we use "+ str(i+1)" to loop over all we pages in this address because its
# the same page address including page number in the end
## Start the scraping
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
headlines = soup.find('body').find_all('h2')
for x in headlines:
if len(x.text.strip()) > 20:
fake_news.append(x.text.strip())
print(x.text.strip())
else:
pass
if new_height == previous_height:
break
previous_height = new_height
Anyone have any clue why the code could not scroll down this website specifically since the code is doing well for any other website with the same properties? (ex: https://www.reddit.com/search/?q=covid19)
The problem of the page is that it will not load new articles when the scroll is 100% to the bottom. You can solve this by incrementing the scroll coordinates by a little every time, instead of going to the bottom of the page.
scrolled = 500
while True:
driver.execute_script(f'window.scrollTo(0, {scrolled});')
scrolled += 500
time.sleep(1)

Does YouTube have a 30 video limit when scraping using Selenium and Python?

I understand the key concepts to scraping YouTube is to carry out the following automated actions using Selenium, Webdriver and Python:
Launch Chrome browser (or firefox)
Authenticate YouTube "I Agree" button
Supply a YouTube URL
Determine the scroll height of the page
Scrape video content
I have created the script below that successfully launches the Chrome browser, authenticates the "I Agree" button, launches the YouTube URL and scrolls down the page whilst carrying out the Title, Views and Posted Date scraping in the background.
Issue
In my test whilst scraping a YouTube VIDEOS Page with 108 videos, it only captures the first 30 videos.
Behind the scenes, it cycles through the script identifying 4 different scroll heights 2264, 3812, 5576 & 6380 respectfully.
However each scroll height duplicates the original capture (first 30 videos).
Thus, I get 30 unique video captures (titles, views and posted dates) presented 4 times with a total video count being 120. If all was working well I should get 108 unique video captures.
I have read many articles on Stackoverflow, but cannot find any reference material to address this issue. Any help would be much appreciated.
Python Script
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
import pandas as pd
import time
# Row & Column Viewing Options
pd.set_option("max_rows", None)
pd.set_option("max_colwidth", None)
# Removes SSL Issues With Chrome
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors-spki-list')
options.add_argument('log-level=3')
options.add_argument('--disable-notifications')
# Get URL
url = 'https://www.youtube.com/c/JohnWatsonRooney/videos?view=0&sort=dd&flow=grid'
driver = webdriver.Chrome(options=options)
driver.get(url)
# Auto Consent Youtube
consent_button_xpath = '//*[#id="yDmH0d"]/c-wiz/div/div/div/div[2]/div[1]/div[4]/form/div[1]/div/button/span'
consent = WebDriverWait(driver, 40).until(EC.element_to_be_clickable((By.XPATH, consent_button_xpath)))
consent = driver.find_element_by_xpath(consent_button_xpath)
consent.click()
# Get scroll height
current_scroll_height = driver.execute_script("return document.documentElement.scrollHeight")
print('> Current Scroll Height (before loop):', current_scroll_height) #starts at 2264
# Locate individual YouTube Video
videos = driver.find_elements_by_class_name('style-scope ytd-grid-video-renderer')
video_list = [] # Create array for video list (video_item > video > videos)
# Loop: YouTube page load to end of scroll
while True:
# Scroll to bottom of page
driver.execute_script("window.scrollTo(0, arguments[0]);", current_scroll_height)
# Wait to load page
time.sleep(1)
print(' - Current Scroll Height (while loop):', current_scroll_height)
# Extract title, views & date posted from individual youtube video
for video in videos:
title = video.find_element_by_xpath('.//*[#id="video-title"]').text
views = video.find_element_by_xpath('.//*[#id="metadata-line"]/span[1]').text
post_date = video.find_element_by_xpath('.//*[#id="metadata-line"]/span[2]').text
# Store individual YouTube video information within dictionary
vid_item = {
'title': title,
'views': views,
'posted': post_date
}
# Store individual videos in main array video_list
video_list.append(vid_item)
# Use pandas to present main array containing individual videos + number of videos in array
df = pd.DataFrame(video_list)
# Calculate new scroll height and compare with current scroll height
new_height = driver.execute_script("return document.documentElement.scrollHeight")
print(' - New Scroll Height (after for loop):', new_height)
if new_height == current_scroll_height:
print(df)
print('\nNumber of Videos (end of for loop):', len(video_list))
break
current_scroll_height = new_height
driver.quit()

Python Request entire HTML page, instead of initially loaded content

I am trying to get some data of reviews publicly available on the PlayStore, and since the provided API only allows to get reviews for one own's apps, I am trying to scrape it from the web.
I am using requests package to get the HTML page of a given app on the PlayStore and will use BeautifulSoup to parse it and save it to file, to then extract the relevant content (rating and comment of each user).
My issue is that not the entire content of the page is retrieved with request.get(URL).
Navigating to the "Read All Reviews" on an app on the PlayStore, one gets to a page with all reviews for that app. Unfortunately, though, only a limited set of reviews loads when first loading the page, while the rest of the reviews only loads upon scrolling down to the bottom. By calling request.get(URL) only that limited set of reviews is retrieved, instead of all reviews.
Try navigating to https://play.google.com/store/apps/details?id=com.bendingspoons.thirtydayfitness&hl=en&showAllReviews=true and see older reviews load only when scrolling to the bottom of the page.
Is there a way to access the entire page/trigger the loading of more reviews/simulating the scrolling?
Below is my code:
# get reviews for Thirty Days of Fitness app
URL = "https://play.google.com/store/apps/details?id=com.bendingspoons.thirtydayfitness&hl=en&showAllReviews=true"
# make request
request = requests.get(URL)
# extract HTML text
raw_text = request.text
# parse HTML and prettify
soup = BeautifulSoup(raw_text, 'html.parser')
text = soup.prettify()
# write to file
save_path = './thirtydayfitness_html.txt'
with open(save_path, 'w+', encoding=request.encoding) as f:
f.write(text)
Would consider using a web driver to scroll down. Like so
SCROLL_PAUSE_TIME = 0.5
# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(SCROLL_PAUSE_TIME)
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
Reference:- How can I scroll a web page using selenium webdriver in python?

Scrape dynamic HTML (YouTube comments)

With Beautiful Soup and Request Library I am able to scrape HTML content, but not what loads by JavaScript or AJAX calls.
How do I mimic this through my Python script? Because YouTube comments load when we scroll the page. I found 2 methods; one using Selenium and another using lxml requests, which I couldn't understand a bit.
Example (this is the video):
import requests
from bs4 import BeautifulSoup as soup
url = 'https://www.youtube.com/watch?v=iFPMz36std4'
response = requests.get(url)
page_html = response.content
#print page_html
page_soup=soup(page_html,"html.parser")
print page_soup
You need to use selenium :
Here is a trick , Youtube only load comments when you scroll just down of video , if you scroll bottom or elsewhere, comments will not load , so first scroll to that down part and wait for loading comments after that scroll to bottom or whenever you want :
from selenium import webdriver
import time
driver=webdriver.Chrome()
driver.get('https://www.youtube.com/watch?v=iFPMz36std4')
driver.execute_script('window.scrollTo(1, 500);')
#now wait let load the comments
time.sleep(5)
driver.execute_script('window.scrollTo(1, 3000);')
comment_div=driver.find_element_by_xpath('//*[#id="contents"]')
comments=comment_div.find_elements_by_xpath('//*[#id="content-text"]')
for comment in comments:
print(comment.text)
some part of output:
#can't post full output its too long
I love Kygo's Stranger Things and Netflix's Stranger Things <3
Stranger Things, Kygo and OneRepublic, could it be better?
Amazing Vibe!!!!!!!!!🔥🔥🔥🔥
Using Selenium would do the trick.
Though I have a different way of scrolling down. This function will help you to scroll down by calling regularly javascript and check whether the height of the window changed between the actual and previous scroll down.
def scrollDown(pause, driver):
"""
Function to scroll down till end of page.
"""
import time
lastHeight = driver.execute_script("return document.body.scrollHeight")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(pause)
newHeight = driver.execute_script("return document.body.scrollHeight")
if newHeight == lastHeight:
break
lastHeight = newHeight
# Main Code
driver = webdriver.Chrome()
# Instantiate browser and navigate to page
driver.get('https://www.youtube.com/watch?v=iFPMz36std4')
scrollDown(6, driver)
# Page soup
soup = BeautifulSoup(driver.page_source, "html.parser")

Scroll over website using phatomJS and selenium

I need to scroll over a web page (example twitter) an make a web scraping of the new elements that appear as one advances on the website. I try to make this using python 3.x, selenium and PhantomJS. This is my code
import time
from selenium import webdriver
from bs4 import BeautifulSoup
user = 'ciroylospersas'
# Start web browser
#browser = webdriver.Firefox()
browser = webdriver.PhantomJS()
browser.set_window_size(1024, 768)
browser.get("https://twitter.com/")
# Fill username in login
element = browser.find_element_by_id("signin-email")
element.clear()
element.send_keys('your twitter user')
# Fill password in login
element = browser.find_element_by_id("signin-password")
element.clear()
element.send_keys('your twitter pass')
browser.save_screenshot('screen.png') # save a screenshot to disk
# Summit the login
element.submit()
time.sleep(5
browser.save_screenshot('screen1.png') # save a screenshot to disk
# Move to the following url
browser.get("https://twitter.com/" + user + "/following")
browser.save_screenshot('screen2.png') # save a screenshot to disk
scroll_script = "var h = document.body.scrollHeight; window.scrollTo(0, h); return h;"
newHeight = browser.execute_script(scroll_script)
print(newHeight)
browser.save_screenshot('screen3.png') # save a screenshot to disk
The problem is I can't scroll to the bottom. The screen2.png and screen3.png are the same. But if I change the webdriver from PhantomJS to Firefox the same code work fine. Why?
I was able to get this to work in phantomJS when trying to solve a similar problem:
check_height = driver.execute_script("return document.body.scrollHeight;")
while True:
browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5)
height = driver.execute_script("return document.body.scrollHeight;")
if height == check_height:
break
check_height = height
It will scroll to the current "bottom", wait, see if the page loaded more, and bail if it did not (assuming everything got loaded if the heights match.)
In my original code I had a "max" value I checked alongside the matching heights because I was only interested in the first 10 or so "pages". If there were more I wanted it to stop loading and skip them.
Also, this is the answer I used as an example

Categories