I'm trying to load the videos page of a youtube channel and parse it to extract recent video information. I want to avoid using the API since it has a daily usage quota.
The problem I'm having is Selenium does not seem to load the full html of the webpage when printing "driver.pagesource":
from bs4 import BeautifulSoup
from selenium.webdriver import Chrome
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.options import Options
driver = Chrome(executable_path='chromedriver')
driver.get('https://www.youtube.com/c/Oxylabs/videos')
# Agree to youtube cookie popup
try:
consent = driver.find_element_by_xpath(
"//*[contains(text(), 'I agree')]")
consent.click()
except:
pass
# Parse html
WebDriverWait(driver,100).until(EC.visibility_of_element_located((By.XPATH, '//*[#id="show-more-button"]')))
print(driver.page_source)
I have tried to implement WebDriverWait as seen above. This results in a timeout exception error. However, the following xpath (/html - the end of the webpage) does not result in a timeout exception:
WebDriverWait(driver,100).until(EC.visibility_of_element_located((By.XPATH, '/html')))
-but this does not load the full html either.
I have also tried to implement time.sleep(100) instead of WebDriverWait, but this too results in the incomplete html. Any help would be greatly appreciated.
The element you are looking for is not on the page, this is the reason for the timeout:
//*[#id="show-more-button"]
Have you tried scrolling to the page bottom or looking for some other element??
driver.execute_script("arguments[0].scrollIntoView();", element)
Related
I trying to scrape data from given link below. But I can not get html elements. I am using selenium with python. When I do print(driver.page_source), it prints just bunch of JS like when we try to scrape a javascript driven website with BeautifulSoup. I waited longer to render the whole page but still selenium driver can not get html rendered elements. So how do I scrape it?
https://kosis.kr/statHtml/statHtml.do?orgId=101&tblId=DT_1JH20151&vw_cd=MT_ETITLE&list_id=J1_10&scrId=&language=en&seqNo=&lang_mode=en&obj_var_id=&itm_id=&conn_path=MT_ETITLE&path=%252Feng%252FstatisticsList%252FstatisticsListIndex.do
I am trying scrape kosis.kr but selenium driver.page_source is giving nothig.
The data of your interest is located in nested iframes on that page. Try this to get the tabular content from there:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
link = "https://kosis.kr/statHtml/statHtml.do?orgId=101&tblId=DT_1JH20151&vw_cd=MT_ETITLE&list_id=J1_10&scrId=&language=en&seqNo=&lang_mode=en&obj_var_id=&itm_id=&conn_path=MT_ETITLE&path=%252Feng%252FstatisticsList%252FstatisticsListIndex.do"
with webdriver.Chrome() as driver:
driver.get(link)
WebDriverWait(driver,20).until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR,"iframe#iframe_rightMenu")))
WebDriverWait(driver,20).until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR,"iframe#iframe_centerMenu1")))
for item in WebDriverWait(driver,20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,"table[id='mainTable'] tr"))):
data = [i.text for i in item.find_elements(By.CSS_SELECTOR,'th,td')]
print(data)
purpose: using selenium get entire page source.
problem: loaded page does not contain content, only JavaScript files and css files.
target site : https://www.warcraftlogs.com
test code(need 'pip install selenium'):
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.warcraftlogs.com/zone/rankings/29#boss=2512&metric=hps&difficulty=3&class=Priest&spec=Discipline")
pageSource = driver.page_source
fileToWrite = open("page_source.html", "w",encoding='utf-8')
fileToWrite.write(pageSource)
fileToWrite.close()
trythings--
try python request code, same result. that did't contain content only js,css things
It's a personal opinion, this site deliberated hide contant data.
i wanna do scriping this site data,
how can i do?
Here is a way of getting the page source, after all elements loaded:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time as t
[...]
wait = WebDriverWait(driver, 5)
url='https://www.warcraftlogs.com/zone/rankings/29#boss=2512&metric=hps&difficulty=3&class=Priest&spec=Discipline'
driver.get(url)
stuffs = wait.until(EC.presence_of_all_elements_located((By.XPATH, '//div[#class="top-100-details-number kill"]')))
t.sleep(5)
print(driver.page_source)
You can then write page source to file, etc. Selenium documentation: https://www.selenium.dev/documentation/
selenium and xpath. I have a python script that uses selenium to scrape the video source from a movie website. I can get the script to play the video using Selenium but want to scrape the src link to the MP4 video file. I think my xpath syntax is incorrect.
**** Code ******
# Load selenium components
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait, Select
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
import time
browser = webdriver.Chrome(executable_path=r"C:\\temp\\chromedriver.exe")
## Link to the movie as an example
url = "https://vw.ffmovies.sc/film/fatman-2020/watching/?server_id=3"
browser.get(url)
element = WebDriverWait(browser,10).until(EC.element_to_be_clickable((By.XPATH, "//*[#id='player']")))
clickable = browser.find_element_by_id("player")
clickable.find_element_by_xpath('.//*').click()
browser.switch_to.frame("iframe-embed")
time.sleep(5)
### This is where I am stuck.. It cannot find the xpath element....
##########################################################################################
## I am getting the xpath wrong. I want the video link to be stored in the link variable.
link=browser.switch_to.frame(browser.find_element_by_xpath('//*[#id="player"]/iframe').get_attribute('src'))
## Getting error in the above code ^^^^^^^^^^^^^^^^^^^^^^^^^^^
browser.close()
Any advice will be much appreciated. Thanks.
id='player'
is outside the iframe so that you shouldn't use it in your xpath.
So you should consider the iframe as the root of your new context.
Instead of browser.find_element_by_xpath('//*[#id="player"]/iframe').get_attribute('src') try:
browser.find_element_by_xpath('.//video/source').get_attribute('src')
I was making a script to download images from comic naver and I'm kind of done with it, however I can't seem to save the images.
I successfully grabbed the images via urlib and BeasutifulSoup, now, seems like they've introduced hotlink blocking and I can't seem to save the images on my system via urlib or selenium.
Update: I tried changing the useragent to see if that was causing problems... still the same.
Any fix or solution?
My code right now :
import requests
from bs4 import BeautifulSoup
import re
import urllib
import urllib2
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
"Chrome/15.0.87"
)
url = "http://comic.naver.com/webtoon/detail.nhn?titleId=654817&no=44&weekday=tue"
driver = webdriver.PhantomJS(desired_capabilities=dcap)
soup = BeautifulSoup(urllib.urlopen(url).read())
scripts = soup.findAll('img', alt='comic content')
for links in scripts:
Imagelinks = links['src']
filename = Imagelinks.split('_')[-1]
print 'Downloading Image : '+filename
driver.get(Imagelinks)
driver.save_screenshot(filename)
driver.close()
Following 'MAI's' reply, I tried what I could with selenium, and got what I wanted. It's solved now. My code :
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
from selenium.webdriver.common.action_chains import ActionChains
driver = webdriver.Chrome()
url = "http://comic.naver.com/webtoon/detail.nhn?titleId=654817&no=44&weekday=tue"
driver.get(url)
elem = driver.find_elements_by_xpath("//div[#class='wt_viewer']//img[#alt='comic content']")
for links in elem:
print links.get_attribute('src')
driver.quit()
but, when I try to taek screenshots of this, it shows that the "element is not attached to the page". Now, how am I supposed to solve that :/
(Note: Apologies, I'm not able to comment, so I have to make this an answer.)
To answer your original question, I've just been able to download an image in cURL from Naver Webtoons (the English site) by adding a Referer: http://www.webtoons.com header like so:
curl -H "Referer: http://www.webtoons.com" [link to image] > img.jpg
I haven't tried, but you'll probably want to use http://comic.naver.com instead. To do this with urllib, create a Request object with the header required:
req = urllib.request.Request(url, headers={"Referer": "http://comic.naver.com"})
with urllib.request.urlopen(req) as response, open("image.jpg", "wb") as outfile:
Then you can save the file using shutil.copyfileobj(src, dest). So instead of taking screenshots, you can simply get a list of all the images to download, then make a request for each one using the referer header.
Edit: I have a working script on GitHub which only requires urllib and BeautifulSoup.
I took a short look at the website with Chrome dev tools.
I would suggest you to download the image directly instead of screen-shooting. Selenium webdriver should actually run the javascripts on PhantomJS headless browser, so you should get images loaded by javascript at the following path.
The path that I am getting by eye-balling the html is
html body #wrap #container #content div #comic_view_area div img
The image tags in the last level have IDs like content_image_N, N counting from 0. So you can also get specific picture by using img#content_image_0 for example.
I want to download the video from the following page:
https://www.indiegogo.com/projects/protest-the-hero-new-album--3#/
Using Firebug, I can see the url of the video,
<video src="https://09-lvl3-pdl.vimeocdn.com/01/1449/2/57246728/138634997.mp4?expires=1457996449&token=089e435c20e7781d36fce" preload="metadata">
</video>
However, I tried to scape the page using Python, this sentence is lost and I could not get the url. I also tried Selenium, but the same problem remained. How could I access the video url with my scraper?
Also, it seems that the video url does not work. How could I get the url from which I can download the video?
You can solve it with selenium.
The trick is that the desired video tag is inside the iframe - you would need to switch into it's context and then search for the video element. Then, use get_attribute() to get the src attribute value. Complete working code:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
browser = webdriver.Chrome() # or webdriver.Firefox(), or webdriver.PhantomJS() or etc.
wait = WebDriverWait(browser, 10)
browser.get('https://www.indiegogo.com/projects/protest-the-hero-new-album--3#/')
# waiting for the frame to become present
frame = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#vimeoPlayer")))
browser.switch_to.frame(frame)
# get video url
url = browser.find_element_by_tag_name("video").get_attribute("src")
print(url)
browser.close()
Prints:
https://09-lvl3-pdl.vimeocdn.com/01/1449/2/57246728/138634997.mp4?expires=1457998452&token=0c54810bc365a94ea8486