I am very new to web scraping and trying to scrape gif urls from a website. For example, from gifer.com, search gifs for "smile" and then download urls for all gifs listed.
Below is an example of the source from which I want to extract src element for the video (https://i.gifer.com/ON0.mp4 in this case).
<div class="page-media-swipe desktop">
<div class="container">
<div class="swipe-left">
<span class="icon-arrow-left-2 icon" style="color: rgb(255, 255, 255); font-size: 44px;"></span>
</div>
<div class="media desktop" style="width: 367.462px;">
<div style="padding-top: 122.462%;">
<div class="media-container1">
<div class="media-container2" style="width: 367.462px;">
<div>
<video poster="https://i.gifer.com/fetch/w300-preview/d0/d0e6e89a42c43d31b5913e232d87af7b.gif" class="full-media" loop="" autoplay="" playsinline="">
<source src="https://i.gifer.com/ON0.mp4" type="video/mp4">
</video>
</div>
</div>
</div>
</div>
</div>
<div class="swipe-right">
<span class="icon-arrow-right-2 icon" style="color: rgb(255, 255, 255); font-size: 44px;">
</span>
</div>
</div>
</div>
There are more than thousands of such results and I was advised to use Python and Selenium. However my knowledge of Selenium and Python is limited
I tried below but I am not able to make much headway.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get("https://gifer.com/en/gifs/smile")
imgResults = driver.find_elements(By.CLASS_NAME, "media-container2")
print(len(imgResults))
#print(driver.page_source)
for i in range(0,len(imgResults)):
print(imgResults[i])
driver.quit()
Above returns 4 elements-
<selenium.webdriver.remote.webelement.WebElement (session="fac424650675a90b2a8dee91efdc01f4", element="16e771ca-37d8-45a0-8200-0f03da0b7d14")>
<selenium.webdriver.remote.webelement.WebElement (session="fac424650675a90b2a8dee91efdc01f4", element="8c9abdcb-bc9d-47da-9958-109e722b3ae9")>
<selenium.webdriver.remote.webelement.WebElement (session="fac424650675a90b2a8dee91efdc01f4", element="d9640144-4ba1-414b-aa4f-5141387335ef")>
<selenium.webdriver.remote.webelement.WebElement (session="fac424650675a90b2a8dee91efdc01f4", element="9626db84-1da9-42ad-b314-56222a5e933b")>
Now, how do I grab the source src link for each video element is what I am not getting.
I was wrong, no need to load a new page to get the mp4 link:
for img in driver.find_elements(By.CSS_SELECTOR, "figure a"):
code = img.get_attribute('href').split('/')[-1]
link = f'https://i.gifer.com/{code}.mp4'
print(link)
output
https://i.gifer.com/fzvh.mp4
https://i.gifer.com/7F5y.mp4
https://i.gifer.com/6qOR.mp4
https://i.gifer.com/3JT.mp4
...
You can obtain the list of links in one line
links = [f"https://i.gifer.com/{img.get_attribute('href').split('/')[-1]}.mp4" for img in driver.find_elements(By.CSS_SELECTOR, "figure a")]
Related
I am trying to navigate to a search box and send_keys with selenium python but completely stuck.
And here is the source code snippet:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<div id="LeftTreeFrame" class="leftNavBackground" >
<div class="ui-widget searchPanelContainer">
<div id="searchPanel" class="search-field-container search-field-container-margin">
<input type="text" doesntDirty id="Search" name="Search" class="search-text-field-left-tree-frame" NoHighlight="nohighlight"/>
<div class="search-field-icon-container">
<a id="searchlbl" href="#"><img src="../images/normal_search_u39.svg" title="Go To Page" /></a>
</div>
</div>
</div>
<div id='pageNavigation'>
<div id='ootbNavigationPage'></div>
<div id='favoriteNavigationPage'></div>
<div id='adminNavigationPage'></div>
<div id='navigationEmptyState' class="treeEmptyState">
<div class="message"></div>
</div>
</div>
<div class="navigation-view-mode-container">
<div class="box" onclick="renderModel(0)">
<button type="button">
<span class="svg-load ootb-icon" data-src="~/images/Reskin/ootb-icon.svg"></span>
</button>
</div>
<div class="star" onclick="renderModel(1)">
<button type="button">
<span class="svg-load star-icon" data-src="~/images/Reskin/star.svg"></span>
</button>
</div>
<div class="person" onclick="renderModel(2)">
<button type="button">
<span class="svg-load person-icon" data-src="~/images/Reskin/person-nav.svg"></span>
</button>
</div>
</div>
</div>
When I try to do
element = driver.find_element(By.XPATH, '//input[#name="Search"]')
element.send_keys('test')
I get error "selenium.common.exceptions.ElementNotInteractableException: Message: element not interactable"
I have tried everything I can imagine, but cannot click the element or send keys.
Also, this page is a new page that opens after the last successful click. I first tried switching to this page by
#printing handles
handles = driver.window_handles
i=0
for handle in handles:
print(f"Handle {i}: {handle}\n")
i +=1
#after confirming new page is second handle via:
driver.switch_to.window(handles[1])
print(f" Title: {driver.title}")
print(f" Current url: {driver.current_url}")
print('\n')
#I can even find the tag I am looking for after switching to new window:
all_div_tags = driver.find_elements(By.TAG_NAME, "input")
for tag in all_div_tags:
print(f"Attribute name: {tag.get_attribute('name')}\n")
#but i cannot get to the search box. Thank you in advance!
Look at the html code, notice that //input[#name="Search"] is contained in an <iframe>. In order to select an element inside an iframe with find_element() you have first to switch to the iframe, as shown in the code
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
WebDriverWait(driver, 20).until(EC.frame_to_be_available_and_switch_to_it((By.ID, "frmCode")))
element = driver.find_element(By.XPATH, '//input[#name="Search"]')
...
Hello I would like to be able to change the value of "50 Profiles / Page" to "500 Profiles / Page", but the problem is that in the HTML there is no "Select" tag.
I tried doing this but it didn't work
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
url = 'https://www.personality-database.com/profile?pid=1&sort=hot'
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
driver.implicitly_wait(30)
driver.get(url)
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="root"]/div/section/main/div[1]/div[2]/div/div[5]/ul/li[10]/div/div[1]/span[2][text()="500 Profiles / Page"]'))).click()
Here is the code The HTML code
<li class="rc-pagination-options">
<div class="rc-select rc-pagination-options-size-changer rc-select-single rc-select-show-arrow">
<span class="rc-select-arrow" unselectable="on" aria-hidden="true">
<span class="rc-select-arrow-icon"></span></span>
<div class="rc-select-dropdown rc-select-dropdown-placement-topLeft rc-select-dropdown-hidden">
<div role="listbox" id="rc_select_0_list">
<div aria-label="20 Profiles / Page" role="option" id="rc_select_0_list_0"
aria-selected="false">20</div>
</div>
<div class="rc-virtual-list" style="position: relative;">
<div class="rc-virtual-list-holder">
<div class="rc-virtual-list-holder-inner"
style="display: flex; flex-direction: column;">
<div aria-selected="false" class="rc-select-item rc-select-item-option"
title="20 Profiles / Page">
<div class="rc-select-item-option-content">20 Profiles / Page</div><span
class="rc-select-item-option-state" unselectable="on" aria-hidden="true"
style="user-select: none;"><span
class="rc-select-item-option-state-icon"></span></span>
</div>
<div aria-selected="false" class="rc-select-item rc-select-item-option"
title="500 Profiles / Page">
<div class="rc-select-item-option-content">500 Profiles / Page</div><span
class="rc-select-item-option-state" unselectable="on" aria-hidden="true"
style="user-select: none;"><span
class="rc-select-item-option-state-icon"></span></span>
</div>
...
</li>
First we need to close the pop-ups and then try to click on pagination options.
And using both Implicit wait and Explicit wait is not Recommended.
Try the following solution:
driver.get("https://www.personality-database.com/profile?pid=1&sort=hot")
wait = WebDriverWait(driver,30)
try:
# Close the footer add
wait.until(EC.element_to_be_clickable((By.XPATH,"//span[#id='ezmob-wrapper']/div/center/span/div/div/span"))).click()
# Scroll a distance so that the Cookie pop up appears and Close it
driver.execute_script("window.scrollBy(0,50);")
wait.until(EC.element_to_be_clickable((By.XPATH,"//button[#id='rcc-confirm-button']"))).click()
except:
print("no adds")
# click on the drop down option
pagination = wait.until(EC.element_to_be_clickable((By.XPATH,"//li[#class='rc-pagination-options']")))
pagination.click()
# Click on the 500 profiles
option = wait.until(EC.element_to_be_clickable((By.XPATH,"//div[#class='rc-virtual-list-holder-inner']//div[text()='500 Profiles / Page']")))
option.click()
First xpath to click dropdown:
//div[#class='rc-select rc-pagination-options-size-changer rc-select-single rc-select-show-arrow']
Second xpath to click the option for 500 pages:
//div[#class='rc-select-item-option-content']/self::div[text()='500 Profiles / Page']
Here is a cheatsheet for relative xpaths https://devhints.io/xpath
Please be aware that browsers use xpath 1.0 and selenium also only supports 1.0,
So some things like 'ends-with' won't work.
I only started to study python. I have a problem with my code. I use Python + selenium and I don’t understand how I can click on the first element in this code, then go to products__item, after that return back to catalog and click on the second element, then on the third element using selenium.
This is code from a typical internet shop. I understand I need to use a cycle “for” for it but how to do it I don’t know.
<div class="products__list" style="position: relative">
<div class=" products__item ">
<a href="/catalog/dveri-mezhkomnatnyye/dveri-ekoshpon/bravo/bravo-21-snow-art" class="card">
<div class="card__title"> Product-1 </div>
<div class="card__color"> Snow Art </div>
<div class="card__img"> <div class="card__img-wrapper"> <img src="/storage/products/small/60fa84e2b06096.72907873.jpg"> </div>
<div class="card__img-wrapper"> </div>
<div class=" products__item ">
<a href="/catalog/dveri-mezhkomnatnyye/dveri-ekoshpon/bravo/bravo-21-snow" class="card">
<div class="card__title"> Product-2 </div>
<div class="card__color"> Snow </div>
<div class="card__img"> <div class="card__img-wrapper"> <img src="/storage/products/small/60fa84e2b06096.72907874.jpg"> </div>
<div class="card__img-wrapper"> </div>
<div class=" products__item ">
<a href="/catalog/dveri-mezhkomnatnyye/dveri-ekoshpon/bravo/bravo-21-snow-art-classic" class="card">
<div class="card__title"> Product-3 </div>
<div class="card__color"> Snow Art Classic </div>
<div class="card__img"> <div class="card__img-wrapper"> <img src="/storage/products/small/60fa84e2b06096.72907875.jpg"> </div>
<div class="card__img-wrapper"> </div>
Hope it helps!
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import time
from selenium.webdriver.common.action_chains import ActionChains
driver = webdriver.Chrome('path/to/chromedriver')
driver.get("https://dveri.com/catalog/dveri-mezhkomnatnyye?page=1")
wait = WebDriverWait(driver, 10)
elements_xpath = '//div[#class=" products__item "]/a[#class="card"]'
# Wait for emelents to load
wait.until(EC.element_to_be_clickable((By.XPATH, elements_xpath)))
num_elements = len(driver.find_elements(By.XPATH, elements_xpath))
ac = ActionChains(driver)
for i in range(num_elements):
# Wait until elements are clickable
wait.until(EC.element_to_be_clickable((By.XPATH, elements_xpath)))
# Get all elements and select only the i-th one
element = driver.find_elements(By.XPATH, elements_xpath)[i]
# Click the element with the offset from center to actually go to the other page
ac.move_to_element(element).move_by_offset(0, 100).click().perform()
# Here do whatever has to be done on a specific webpage
time.sleep(1)
# Go back to the previous page
driver.execute_script("window.history.go(-1)")
I am trying to scrape links from a youtube playlist with the help of following code:
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pyperclip
import time
url = input('Please enter youtube playlist url: ')
driver = webdriver.Firefox()
driver.get(url)
elem = driver.find_element_by_tag_name('html')
elem.send_keys(Keys.END)
time.sleep(3)
elem.send_keys(Keys.END)
innerHTML = driver.execute_script("return document.body.innerHTML")
soup = bs(innerHTML, 'html.parser')
res = soup.select('div#content.style-scope.ytd-playlist-video-renderer a.yt-simple-endpoint.style-scope.ytd-playlist-video-renderer')
whole_list = ''
for i in res:
print(i.get('href'))
print(i['href'])
print(i.attrs['href'])
# whole_list = whole_list + " '" + i.get('href') + "', \n"
print(whole_list)
pyperclip.copy(whole_list)
driver.close()
while youtube's playlist video components are shown as following in chrome developer tools:
<a class="yt-simple-endpoint style-scope ytd-playlist-video-renderer" href="/watch?v=QXeEoD0pB3E&list=PLsyeobzWxl7poL9JTVyndKe62ieoN-MZ3&index=2&t=0s">
<ytd-thumbnail id="thumbnail" height="68" width="120" class="style-scope ytd-playlist-video-renderer">
<a id="thumbnail" class="yt-simple-endpoint inline-block style-scope ytd-thumbnail" aria-hidden="true" tabindex="-1" rel="null" href="/watch?v=QXeEoD0pB3E&list=PLsyeobzWxl7poL9JTVyndKe62ieoN-MZ3&index=2&t=0s">
<yt-img-shadow class="style-scope ytd-thumbnail no-transition" style="background-color: transparent;" loaded=""><img id="img" class="style-scope yt-img-shadow" alt="" width="120" src="https://i.ytimg.com/vi/QXeEoD0pB3E/hqdefault.jpg?sqp=-oaymwEZCPYBEIoBSFXyq4qpAwsIARUAAIhCGAFwAQ==&rs=AOn4CLCsnnE_5VNrXFHejH29sP0T7NSSmw"></yt-img-shadow>
<div id="overlays" class="style-scope ytd-thumbnail"><ytd-thumbnail-overlay-resume-playback-renderer class="style-scope ytd-thumbnail"><div id="progress" class="style-scope ytd-thumbnail-overlay-resume-playback-renderer" style="width: 100%;"></div></ytd-thumbnail-overlay-resume-playback-renderer><ytd-thumbnail-overlay-time-status-renderer class="style-scope ytd-thumbnail" overlay-style="DEFAULT"><span class="style-scope ytd-thumbnail-overlay-time-status-renderer" aria-label="66 seconds">
1:06
</span></ytd-thumbnail-overlay-time-status-renderer><ytd-thumbnail-overlay-now-playing-renderer class="style-scope ytd-thumbnail">
<span class="style-scope ytd-thumbnail-overlay-now-playing-renderer">Now playing</span>
</ytd-thumbnail-overlay-now-playing-renderer></div>
<div id="mouseover-overlay" class="style-scope ytd-thumbnail"></div>
<div id="hover-overlays" class="style-scope ytd-thumbnail"></div>
</a>
</ytd-thumbnail>
<div id="meta" class="style-scope ytd-playlist-video-renderer">
<h3 class="style-scope ytd-playlist-video-renderer">
<ytd-badge-supported-renderer class="style-scope ytd-playlist-video-renderer">
<dom-repeat id="repeat" as="badge" class="style-scope ytd-badge-supported-renderer"><template is="dom-repeat"></template></dom-repeat>
</ytd-badge-supported-renderer>
<span id="video-title" class="style-scope ytd-playlist-video-renderer" aria-label="#0 Python Tutorial | Python Programming Tutorial for Beginners | Course Introduction by Telusko 1 year ago 66 seconds 1,108,432 views" title="#0 Python Tutorial | Python Programming Tutorial for Beginners | Course Introduction">
#0 Python Tutorial | Python Programming Tutorial for Beginners | Course Introduction
</span>
</h3>
<ytd-video-meta-block class="playlist style-scope ytd-playlist-video-renderer">
<div id="metadata" class="style-scope ytd-video-meta-block">
<div id="byline-container" class="style-scope ytd-video-meta-block">
<ytd-channel-name id="channel-name" class="style-scope ytd-video-meta-block">
<div id="container" class="style-scope ytd-channel-name">
<div id="text-container" class="style-scope ytd-channel-name">
<yt-formatted-string id="text" class="style-scope ytd-channel-name complex-string" ellipsis-truncate="" title="Telusko" has-link-only_=""><a class="yt-simple-endpoint style-scope yt-formatted-string" spellcheck="false" href="/user/javaboynavin">Telusko</a></yt-formatted-string>
</div>
</div>
<ytd-badge-supported-renderer class="style-scope ytd-channel-name" disable-upgrade="" hidden="">
</ytd-badge-supported-renderer>
</ytd-channel-name>
<div id="separator" class="style-scope ytd-video-meta-block">•</div>
</div>
<div id="metadata-line" class="style-scope ytd-video-meta-block">
<dom-repeat strip-whitespace="" class="style-scope ytd-video-meta-block"><template is="dom-repeat"></template></dom-repeat>
</div>
</div>
<div id="additional-metadata-line" class="style-scope ytd-video-meta-block">
<dom-repeat class="style-scope ytd-video-meta-block"><template is="dom-repeat"></template></dom-repeat>
</div>
</ytd-video-meta-block>
</div>
<ytd-badge-supported-renderer id="badges" class="style-scope ytd-playlist-video-renderer" disable-upgrade="" hidden="">
</ytd-badge-supported-renderer>
<yt-formatted-string id="contributor" class="style-scope ytd-playlist-video-renderer" hidden=""></yt-formatted-string>
</a>
As you can see, I am trying to use all three suggestions I have found online, i.e. using i.get('href') is giving me null; while the rest two options are giving me error. I am stuck at this since yesterday and can't find what I am doing wrong.
Sometimes <a> may not have href so I would use if to skip it.
for i in res:
href = i.get('href')
if href:
whole_list = whole_list + " '" + href + "', \n"
This code gives me all hrefs for some playlist. And you see that it also gets None for first i but I skip this value.
from bs4 import BeautifulSoup as BS
from selenium import webdriver
import pyperclip
import time
#url = input('Please enter youtube playlist url: ')
url = 'https://www.youtube.com/playlist?list=PLmNPvQr9Tf-a4MrEG5thq3qzlkrF5NFbC'
driver = webdriver.Firefox()
driver.get(url)
time.sleep(3)
html = driver.page_source
soup = BS(html, 'html.parser')
res = soup.select('a.yt-simple-endpoint.style-scope.ytd-playlist-video-renderer')
all_hrefs = []
for i in res:
href = i.get('href')
print(href)
if href:
all_hrefs.append(href)
text = ',\n'.join([" '{}'".format(x) for x in all_hrefs])
print(text)
pyperclip.copy(text)
driver.close()
This is the image where, when clicked, user is redirected to another page.
<div class="lis_el " id="cel_lisimg_18755" onclick="lis_mostrarficha(0);">
<div class="lis_elc ">
<div class="lis_eloverflow">
<div class="lis_elc_img">
<div class="lis_elc_imgc"><img class="lis_elc_img_img" id="lisimg_18755" src="https://sgfm.elcorteingles.es/SGFM/dctm/MEDIA03/201705/29/0280282401564764342_1_.jpg">
</div>
</div>
</div>
<div class="lis_info ">
<div class="clear"></div>
<div class="lis_info_precio">
5<span class="lis_info_preciop">,99€</span>
</div>
<h2>Camiseta flame</h2>
<div class="lis_mascol displaynone" id="lis_mascol18755" style="display: block;">+ Colores</div>
</div>
</div>
</div>
I'm tryin to obtain that link using Selenium in Python, but I don't know where I can obtain it from. I noticed this however, which I suppose this function does the redirection:
onclick="lis_mostrarficha(0);
I don't have much experience in web developing so I'm not sure how I can obtain that link without clicking, as this would take too long.
Thanks,
You will have to perform the click event in this case because the HTML does not contain the URL linked to the image -- it calls a script. What can be done is to use Selenium to click the element that contains the onclick event.
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
options = Options()
options.add_argument('--disable-infobars')
driver = webdriver.Chrome(chrome_options=options)
div = find_element_by_id('cel_lisimg_18755')
div.click()
# Then wait for the page to load
# Get the URL
url = driver.current_url
print(url) # Assumes v3 python