I try to scrap my deezer music but, when I scroll the site, selenium skips a lot of music, selenium skips the first 30 music, displays 10, then skips another 30, etc. until the end of the page.
Here is the code:
import selenium
from selenium import webdriver
path = "./chromedriver"
driver = webdriver.Chrome(executable_path=path)
url = 'https://www.deezer.com/fr/playlist/2560242784'
driver.get(url)
for i in range(0,20):
try :
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
musics = driver.find_elements_by_class_name('BT3T6')
for music in musics:
print (music.text)
except Exception as e:
print(e)
I've tried to scrape the page based on your code and ended up with success.
I've decided to scroll the page by 500px per step and then remove all duplications and empty strings.
import selenium
import time
from selenium import webdriver
path = "./chromedriver"
driver = webdriver.Chrome(executable_path=path)
url = 'https://www.deezer.com/fr/playlist/2560242784'
driver.get(url)
all_music = []
last_scroll_y = driver.execute_script("return window.scrollY")
for i in range(0, 100):
try :
#first scrape
musics = driver.find_elements_by_class_name('BT3T6')
for music in musics:
all_music.append(music.text)
#then scroll down +500px
driver.execute_script("window.scrollTo(0, window.scrollY+500);")
time.sleep(0.2) #some wait for the new content (200ms)
current_scroll_y = driver.execute_script("return window.scrollY")
# exit the loop if the page is not scrolled any more
if current_scroll_y == last_scroll_y:
break
last_scroll_y = current_scroll_y
except Exception as e:
print(e)
# this removes all empty strings
all_music = list(filter(None, all_music))
# this removes all duplications, but keeps the order
# based on https://stackoverflow.com/a/17016257/5226491
# python 3.7 required
all_music = list(dict.fromkeys(all_music))
# this also removes all duplications, but the order will be changed
#all_music = list(set(all_music))
for m in all_music:
print(m)
print('Total music found: ' + len(all_music))
This works ~ 60-90 seconds and scrape 1000+ items.
Note: it works fine with the active window, and also works in headless mode, but it finish scraping when I collapse the browser window.. So run this with headless chrome option
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.headless = True
driver = webdriver.Chrome(CHROMEDRIVER_PATH, options=options)
or do not collapse the window.
Related
Currently having quite the issue with selenium.
I am trying to get all the links on a page, click each, obtain the data from the page and go back. Even when using the StaleElementReference exception handler, it will completely break the loop,despite using driver.back() as is advised.
The code is as follows:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver import ActionChains
from selenium.webdriver.common.keys import Keys
from datetime import datetime
from pymongo import MongoClient
from selenium.common.exceptions import StaleElementReferenceException
options = Options()
options.page_load_strategy = 'none'
# options.add_argument("--headless")
driver = webdriver.Chrome(options=options)
url = "https://www.depop.com/purevintage_clothing/"
# driver = webdriver.Chrome()
driver.get(url)
for link in links:
linkClass = link.get_attribute("class")
try:
if str(linkClass[:19]) == "styles__ProductCard":
action = ActionChains(driver)
action.move_to_element(link)
action.click().perform()
product = doSomethingFunction()
if product != None:
insertIntoDatabase(product)
driver.back()
except StaleElementReferenceException as e:
print(e)
driver.back()
I am aware indentation is a bit dodgy here, wrote this out manually as the rest of the processing code such as insertIntoDatabase I'm not sure is relevant here (please let me know if you need all of it)
Whenever I do this I end up with the error exception in a loop despite the driver.back() I'm sure the answer is staring me in the face and I'm a bit too dense to see it, but any help is appreciated here
Everytime you go back to the main page you need to get links because they are not present in the DOM anymore since you changed page; so you should do as follows:
links = driver.find_elements_by_xpath(path_to_elements)
for i in range(len(links)):
link = driver.find_elements_by_xpath(path_to_elements)[i]
linkClass = link.get_attribute("class")
if str(linkClass[:19]) == "styles__ProductCard":
action = ActionChains(driver)
action.move_to_element(link)
action.click().perform()
product = doSomethingFunction()
if product != None:
insertIntoDatabase(product)
driver.back()
I am trying to iterate over multiple pages of a website, however the code I am using below is only returning the results from the first page, even though I am using Selenium to click to the next page. I am at a loss for what could be causing this. Any explanation would be much appreciated!
The website in question:
https://www.cruiseplum.com/search#{%22numPax%22:2,%22geo%22:%22US%22,%22portsMatchAll%22:true,%22numOptionsShown%22:100,%22ppdIncludesTaxTips%22:true,%22uiVersion%22:%22split%22,%22sortTableByField%22:%22dd%22,%22sortTableOrderDesc%22:false,%22filter%22:null}
from selenium import webdriver
import time
import xlsxwriter
from lxml import html
u = 'https://www.cruiseplum.com/search#{%22numPax%22:2,%22geo%22:%22US%22,%22portsMatchAll%22:true,%22numOptionsShown%22:100,%22ppdIncludesTaxTips%22:true,%22uiVersion%22:%22split%22,%22sortTableByField%22:%22dd%22,%22sortTableOrderDesc%22:false,%22filter%22:null}'
driver = webdriver.Chrome()
driver.get(u)
driver.maximize_window()
time.sleep(.3)
driver.find_element_by_id('restoreSettingsYesEncl').click() # select 'yes' on the webpage to restore settings
time.sleep(7) # wait until the website downloads data so we get a return value
elem = driver.find_element_by_xpath("//*")
source_code = elem.get_attribute("innerHTML")
t = html.fromstring(source_code)
for i in range(5):
for i in t.xpath('.//td[#class="dc-table-column _0"]/text()'):
print(i.strip())
driver.find_element_by_xpath('//*[#id="listings-table-split"]/div[5]/div/span[4]').click() # click to next page
time.sleep(.05)
driver.quit()
In the code above, t is getting value outside the loop
elem = driver.find_element_by_xpath("//*") source_code =
elem.get_attribute("innerHTML")
t = html.fromstring(source_code)
for i in range(5):
so it is loaded only the first time and keeps repeating the same elements. To achieve you need to move it inside the loop like in the code below:
from selenium import webdriver
import time
import xlsxwriter
from lxml import html
u = 'https://www.cruiseplum.com/search#{%22numPax%22:2,%22geo%22:%22US%22,%22portsMatchAll%22:true,%22numOptionsShown%22:100,%22ppdIncludesTaxTips%22:true,%22uiVersion%22:%22split%22,%22sortTableByField%22:%22dd%22,%22sortTableOrderDesc%22:false,%22filter%22:null}'
driver = webdriver.Chrome()
driver.get(u)
driver.maximize_window()
time.sleep(.3)
driver.find_element_by_id('restoreSettingsYesEncl').click() # select 'yes' on the webpage to restore settings
time.sleep(7) # wait until the website downloads data so we get a return value
for i in range(5):
elem = driver.find_element_by_xpath("//*")
source_code = elem.get_attribute("innerHTML")
t = html.fromstring(source_code)
for i in t.xpath('.//td[#class="dc-table-column _0"]/text()'):
print(i.strip())
driver.find_element_by_xpath('//*[#id="listings-table-split"]/div[5]/div/span[4]').click() # click to next page
time.sleep(.05)
driver.quit()
I can't drag and drop in Selenium with the latest Chromedriver.
selenium='3.141.0'
python 3.7
Chrome = 74.0.3729.169
ChromeDriver =latest
The below code executed successfully, but the items are not being dragged from source to destination. I am also not getting any error at all. I tried all of the below solutions, one by one, but none of working them are at all.
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
import time
cd = webdriver.Chrome('Chromedriver.exe')
cd.get('https://www.seleniumeasy.com/test/drag-and-drop-demo.html')
cd.maximize_window()
elements = cd.find_element_by_id('todrag')
drag_item = elements.find_elements_by_tag_name('span')
drag_to = cd.find_element_by_id('mydropzone')
# Solution 1 (not working)
for i in drag_item:
action = ActionChains(cd)
action.drag_and_drop(i, drag_to).perform() # this is not working
# Solution 2 (not working)
ActionChains(cd).click_and_hold(i).move_to_element(drag_to).release(
drag_to).perform()
# Solution 3 (not working, as you need to download the js files)
jquery_url = "http://code.jquery.com/jquery-1.11.2.min.js"
with open("jquery_load_helper.js") as f:
load_jquery_js = f.read()
with open("drag_and_drop_helper.js") as f:
js = f.read()
cd.execute_async_script(load_jquery_js, jquery_url)
cd.execute_script(js + "$(\'arguments[0]\').simulateDragDrop({ dropTarget: \"arguments[1]\"});", i, drag_to)
I think there is something wrong with the site because this example I found on the web seems to work:
import time
from selenium import webdriver
from selenium.webdriver import ActionChains
# Create chrome driver.
driver = webdriver.Chrome()
# Open the webpage.
driver.get("https://openwritings.net/sites/default/files/selenium-test-pages/drag-drop.html")
# Pause for 5 seconds for you to see the initial state.
time.sleep(5)
# Drag and drop to target item.
##################################
drag_item = driver.find_element_by_id("draggable")
target_item = driver.find_element_by_id("droppable")
action_chains = ActionChains(driver)
action_chains.drag_and_drop(drag_item, target_item).perform()
##################################
# Pause for 10 seconds so that you can see the results.
time.sleep(10)
# Close.
driver.quit()
Hopefully, that example helped you!
My code :
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
# execute url
url = "https://www.youtube.com/user/xuanvinh1612/community"
driver_path = ('F:/chromedriver.exe')
browser = webdriver.Chrome(executable_path=driver_path)
browser.get(url)
# Auto scroll and auto click with text:'Read more'
read_mores2 = browser.find_elements_by_link_text('Read more')
for read_mores2 in read_mores2:
browser.execute_script("arguments[0].scrollIntoView();", read_mores2)
browser.execute_script("$(arguments[0]).click();", read_mores2)
# Scroll down stop when all post was showed
read_mores2 = browser.find_elements_by_link_text('Read more')
With a same code, my code can run some website(2-3 another website). But when i re-use code for auto scroll down and auto click on Youtube/community, it not working. I dont know how it not work. I need help, please.
Try this code:
It will first load all the pages, then click on all Read More.
import time
from selenium import webdriver
# execute url
url = "https://www.youtube.com/user/xuanvinh1612/community"
browser = webdriver.Chrome()
browser.get(url)
# Auto scroll and auto click with text:'Read more'
previous_count = 0
page_sections = browser.find_elements_by_css_selector('.style-scope.ytd-item-section-renderer')
current_count = len(page_sections)
print("Scrolling to enable all the pages")
while previous_count != current_count:
try:
previous_count = current_count
browser.execute_script("arguments[0].scrollIntoView();", page_sections[-1])
print("Number of total Elements found: {}".format(len(page_sections)))
finally:
# As the page load the newer elements, you need to implement logic here to wait until the loading spinner at the
# button becomes invisible (not attached to the DOM)
time.sleep(2) # WorkAround as you need to implement the above logic here
page_sections = browser.find_elements_by_css_selector('.style-scope.ytd-item-section-renderer')
current_count = len(page_sections)
print("Clicking on all Read More")
for read_more in browser.find_elements_by_css_selector('.more-button'):
browser.execute_script("arguments[0].scrollIntoView();", read_more)
browser.execute_script("arguments[0].click();", read_more)
I'm trying to scrape this website: http://data.eastmoney.com/xg/xg/
So far I've used selenium to execute the javascript and get the table scraped. However, my code right now only gets me the first page. I was wondering if there's a way to access the other 17 pages, because when I click on next page the URL does not change, so I cannot just iterate over a different URL each time
Below is my code so far:
from selenium import webdriver
import lxml
from bs4 import BeautifulSoup
import time
def scrape():
url = 'http://data.eastmoney.com/xg/xg/'
d={}
f = open('east.txt','a')
driver = webdriver.PhantomJS()
driver.get(url)
lst = [x for x in range(0,25)]
htmlsource = driver.page_source
bs = BeautifulSoup(htmlsource)
heading = bs.find_all('thead')[0]
hlist = []
for header in heading.find_all('tr'):
head = header.find_all('th')
for i in lst:
if i!=2:
hlist.append(head[i].get_text().strip())
h = '|'.join(hlist)
print h
table = bs.find_all('tbody')[0]
for row in table.find_all('tr'):
cells = row.find_all('td')
d[cells[0].get_text()]=[y.get_text() for y in cells]
for key in d:
ret=[]
for i in lst:
if i != 2:
ret.append(d.get(key)[i])
s = '|'.join(ret)
print s
if __name__ == "__main__":
scrape()
Or is it possible for me to click next through the browser if I use webdriver.Chrome() instead of PhantomJS and then the Python run on the new page, after I click each time?
This is not a trivial page to interact with and would require the use of Explicit Waits to wait for invisibility of "loading" indicators.
Here is the complete and working implementation that you may use as a starting point:
# -*- coding: utf-8 -*-
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
import time
url = "http://data.eastmoney.com/xg/xg/"
driver = webdriver.PhantomJS()
driver.get(url)
def get_table_results(driver):
for row in driver.find_elements_by_css_selector("table#dt_1 tr[class]"):
print [cell.text for cell in row.find_elements_by_tag_name("td")]
# initial wait for results
WebDriverWait(driver, 10).until(EC.invisibility_of_element_located((By.XPATH, u"//th[. = '加载中......']")))
while True:
# print current page number
page_number = driver.find_element_by_id("gopage").get_attribute("value")
print "Page #" + page_number
get_table_results(driver)
next_link = driver.find_element_by_link_text("下一页")
if "nolink" in next_link.get_attribute("class"):
break
next_link.click()
time.sleep(2) # TODO: fix?
# wait for results to load
WebDriverWait(driver, 10).until(EC.invisibility_of_element_located((By.XPATH, u"//img[contains(#src, 'loading')]")))
print "------"
The idea is to have an endless loop which we would exit only if the "Next Page" link becomes disabled (no more pages available). On every iteration, get the table results (printing on the console for the sake of an example), click the next link and wait for invisibility of the "loading" spinning circle appearing on top of the grid.
I found another way to do this in C# using Chromedriver and Selenium. All you have to do is add selenium references to the code and put chromedriver.exe references.
In your code you can navigate to the url using
using (var driver = new chromedriver())
{
driver.Navigate().GoToUrl(pathofurl);
//find your element by using FindElementByXpath
//var element = driver.FindElementByXpath(--Xpath--).Text;
}
Finding Xpath is easy - all you have to do is download scraper extension or x-path extension in chrome by going to chrome store. once you get a hang of x-path for elements you can find x-path for next button and use it in your code to navigate through pages very easily in a loop. Hope this helps.