I'm trying to get information in the last link that i'll show you in the website this one
The problem is my list of elements is not displayed even though when I try find_element (one) it works.
Here is my code :
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import pandas as pd
options = Options()
# Creating our dictionary
all_services = pd.DataFrame(columns=['Profil', 'Motif', 'Questions', 'Reponses'])
path = "C:/Users/Al4D1N/Documents/ChromeDriver_webscraping/chromedriver.exe"
driver = webdriver.Chrome(options=options, executable_path=path)
# we are going to visit all profils procedures
# for profil in ['particuliers','professionnels','associations']:
# driver.get("https://www.demarches.interieur.gouv.fr/{profil}/accueil-{profil}")
driver.get("https://www.demarches.interieur.gouv.fr/associations/accueil-associations")
# Get all first elements in bodyFiche id which contains all procedures for associations profile
list_of_services = driver.find_elements_by_class_name("liste-sous-menu")
for service in list_of_services:
# In each element, select the tags
# atags = service.find_elements_by_css_selector('a')
atags = service.find_elements_by_xpath("//li[starts-with(#id,'summary')]")
for atag in atags:
# In each atag, select the href
href = atag.get_attribute('href')
print(href)
# Open a new window
driver.execute_script("window.open('');")
# Switch to the new window and open URL
driver.switch_to.window(driver.window_handles[1])
driver.get(href)
# we are now on the second link
# Get all links in the iterated element
list_of_services2 = driver.find_elements_by_class_name("content")
for service2 in list_of_services2:
atags2 = service2.find_elements_by_css_selector('a')
for atag2 in atags2:
href = atag2.get_attribute('href')
driver.execute_script("window.open('');")
driver.switch_to.window(driver.window_handles[1])
driver.get(href)
# we are now on the third link
# Get all links in the iterated element
list_of_services3 = driver.find_elements_by_class_name("content")
for service3 in list_of_services2:
atags3 = service3.find_elements_by_css_selector('a')
for atag3 in atags3:
href = atag3.get_attribute('href')
driver.execute_script("window.open('');")
driver.switch_to.window(driver.window_handles[1])
driver.get(href)
# Get Q/A section
list_of_services4 = driver.find_elements_by_class_name("QuestionReponse")
for service4 in list_of_services4:
atags4 = service4.find.elements_by_css_selector('a')
for atag4 in atags4:
href = atag3.get_attribute('href')
# We store our questions
questions = href.text
driver.execute_script("window.open('');")
driver.switch_to.window(driver.window_handles[1])
driver.get(href)
# Get data
reponses = driver.find_elements_by_class_name("texte")
all_services = all_services.append({'Questions': questions,
'Reponses': reponses}, ignore_index=True)
driver.close()
driver.switch_to.window(driver.window_handles[0])
driver.close()
driver.switch_to.window(driver.window_handles[0])
driver.close()
driver.switch_to.window(driver.window_handles[0])
# Close the tab with URL B
driver.close()
# Switch back to the first tab with URL A
driver.switch_to.window(driver.window_handles[0])
driver.close()
all_services.to_excel('Limit_Testing.xlsx', index=False)
I'm not sure if my method is working or not , the idea is going through links like in a tree and when I succeed to my leaf I get my desired information. Correct me if im wrong .
I don't know my list_of_services is a NULL list , even if im correct on the class name.
What's worked for me in previous experiences: add waiting time. The logic for this is that when you make the GET request, you go straight to analyze whether there is a WebElement with class='liste-sous-menu', without waiting for the driver to get the website loaded, this causes the list to be empty as there is nothing to return. Therefore, my suggestion is the following:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import pandas as pd
## Import sleep
from time import sleep
options = Options()
path = "C:/Users/Al4D1N/Documents/ChromeDriver_webscraping/chromedriver.exe"
driver = webdriver.Chrome(options=options, executable_path=path)
driver.get("https://www.demarches.interieur.gouv.fr/associations/accueil-associations")
################### HERE YOU ADD SOME WAITING TIME, it will depend on the speed of you computer/driver
sleep(0.5)
list_of_services = driver.find_elements_by_class_name("liste-sous-menu")
I have applied it in your code and it now seems to be returning a list with content. However, it does not return the links, it just returns the UL (unordered list) that contains the links, you will need to dig deeper once you have the UL element. This means adding the following:
list_of_services = driver.find_elements_by_class_name("liste-sous-menu")
### Now you get the li elements (each row)
services = list_of_services.find_elements_by_tag_name('li')
## Now you iterate over the services object (list of 'li' elements)
Hope to have solved your question.
Related
I'm new at Python and Selenium. I'm trying to do something--which im sure im going in a very round-about way--any help is greatly appreciated.
The page im trying to parse through has different cards that need to be clicked on, i need to go to each card, and from there grab the name (h1) and the url. I havent gotten very far, and this is what i have so far.
I go through the first page, grab all the urls, add them to a list. Then i want to go through the list, and go to each url (opening a new tab) and from there grabbing the h1 and url. It doesn't seem like I'm even able to grab the h1, and it opens a new tab, then hangs, then opens the same tab.
Thank you in advance!
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
driver = webdriver.Chrome()
driver.get('https://zdb.pedaily.cn/enterprise//') #main URL
title_links = driver.find_elements_by_css_selector('ul.n4 a')
urls = [] #list of URLs
# main = driver.find_elements_by_id('enterprise-list')
for item in title_links:
urls.append(item.get_attribute('href'))
# print(urls)
for url in urls:
driver.execute_script("window.open('');")
driver.switch_to.window(driver.window_handles[1])
driver.get(url)
print(driver.find_element_by_css_selector('div.info h1'))
Well, there are a few issues here:
You should be much more specific with your tag for grabbing urls. This is leading to multiple copies of the same url--that's why it is opening the same pages again.
You should give the site enough time to load before trying to grab objects, that may be why it's timing out but always good to be on the safe side before grabbing objects.
You have to shift focus back to the original page to continue iterating the list
You don't need to inject JS to open a new tab and use a py call to open , and JS formatting could be cleaner
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
driver = webdriver.Chrome()
driver.get('https://zdb.pedaily.cn/enterprise/') # main URL
# Be much more specific or you'll get multiple returns of the same link
urls = driver.find_elements(By.TAG_NAME, 'ul.n4 li div.img a')
for url in urls:
# get href to print
print(url.get_attribute('href'))
# Inject JS to open new tab
driver.execute_script("window.open(arguments[0])", url)
# Switch focus to new tab
driver.switch_to.window(driver.window_handles[1])
# Make sure what we want has time to load and exists before trying to grab it
WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, 'div.info h1')))
# Grab it and print it's contents
print(driver.find_element(By.CSS_SELECTOR, 'div.info h1').text)
# Uncomment the next line to do one tab at a time. Will reduce speed but not use so much ram.
#driver.close()
# Focus back on first window
driver.switch_to.window(driver.window_handles[0])
# Close window
driver.quit()
I am trying to iterate over multiple pages of a website, however the code I am using below is only returning the results from the first page, even though I am using Selenium to click to the next page. I am at a loss for what could be causing this. Any explanation would be much appreciated!
The website in question:
https://www.cruiseplum.com/search#{%22numPax%22:2,%22geo%22:%22US%22,%22portsMatchAll%22:true,%22numOptionsShown%22:100,%22ppdIncludesTaxTips%22:true,%22uiVersion%22:%22split%22,%22sortTableByField%22:%22dd%22,%22sortTableOrderDesc%22:false,%22filter%22:null}
from selenium import webdriver
import time
import xlsxwriter
from lxml import html
u = 'https://www.cruiseplum.com/search#{%22numPax%22:2,%22geo%22:%22US%22,%22portsMatchAll%22:true,%22numOptionsShown%22:100,%22ppdIncludesTaxTips%22:true,%22uiVersion%22:%22split%22,%22sortTableByField%22:%22dd%22,%22sortTableOrderDesc%22:false,%22filter%22:null}'
driver = webdriver.Chrome()
driver.get(u)
driver.maximize_window()
time.sleep(.3)
driver.find_element_by_id('restoreSettingsYesEncl').click() # select 'yes' on the webpage to restore settings
time.sleep(7) # wait until the website downloads data so we get a return value
elem = driver.find_element_by_xpath("//*")
source_code = elem.get_attribute("innerHTML")
t = html.fromstring(source_code)
for i in range(5):
for i in t.xpath('.//td[#class="dc-table-column _0"]/text()'):
print(i.strip())
driver.find_element_by_xpath('//*[#id="listings-table-split"]/div[5]/div/span[4]').click() # click to next page
time.sleep(.05)
driver.quit()
In the code above, t is getting value outside the loop
elem = driver.find_element_by_xpath("//*") source_code =
elem.get_attribute("innerHTML")
t = html.fromstring(source_code)
for i in range(5):
so it is loaded only the first time and keeps repeating the same elements. To achieve you need to move it inside the loop like in the code below:
from selenium import webdriver
import time
import xlsxwriter
from lxml import html
u = 'https://www.cruiseplum.com/search#{%22numPax%22:2,%22geo%22:%22US%22,%22portsMatchAll%22:true,%22numOptionsShown%22:100,%22ppdIncludesTaxTips%22:true,%22uiVersion%22:%22split%22,%22sortTableByField%22:%22dd%22,%22sortTableOrderDesc%22:false,%22filter%22:null}'
driver = webdriver.Chrome()
driver.get(u)
driver.maximize_window()
time.sleep(.3)
driver.find_element_by_id('restoreSettingsYesEncl').click() # select 'yes' on the webpage to restore settings
time.sleep(7) # wait until the website downloads data so we get a return value
for i in range(5):
elem = driver.find_element_by_xpath("//*")
source_code = elem.get_attribute("innerHTML")
t = html.fromstring(source_code)
for i in t.xpath('.//td[#class="dc-table-column _0"]/text()'):
print(i.strip())
driver.find_element_by_xpath('//*[#id="listings-table-split"]/div[5]/div/span[4]').click() # click to next page
time.sleep(.05)
driver.quit()
I am extracting board members from a list of URLs. For each url in the URL_lst, click the first xpath (ViewMore to expand the list), then extract values from the second xpath (BoardMembers' info).
Below are the three companies I want to extract info: https://www.bloomberg.com/quote/FB:US, https://www.bloomberg.com/quote/AAPL:US, https://www.bloomberg.com/quote/MSFT:US
My code is shown below but doesn't work. The Output list is not aggregated. I know sth wrong with the loop but don't know how to fix it. Can anyone tell me how to correct the code? Thanks!
URL_lst = ['https://www.bloomberg.com/quote/FB:US','https://www.bloomberg.com/quote/AAPL:US','https://www.bloomberg.com/quote/MSFT:US']
Outputs = []
driver = webdriver.Chrome(r'xxx\chromedriver.exe')
for url in URL_lst:
driver.get(url)
for c in driver.find_elements_by_xpath("//*[#id='root']/div/div/section[3]/div[10]/div[2]/div/span[1]"):
c.click()
for e in (c.find_elements_by_xpath('//*[#id="root"]/div/div/section[3]/div[10]/div[1]/div[2]/div/div[2]')[0].text.split('\n'):
Outputs.append(e)
print(Outputs)
Based on the URLs you provided, I did some refactoring for you. I added wait on each item you are trying to click and a scrollIntoView Javascript call to scroll down to the View More button. You were originally clicking View More buttons in a loop, but your XPath only returned 1 element, so the loop was redundant.
I also refactored your selector for board members to query directly on the div element containing their names. Your original query was finding a div several levels above the actual name text, which is why your Outputs list was returning empty.
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from time import sleep
URL_lst = ['https://www.bloomberg.com/quote/FB:US','https://www.bloomberg.com/quote/AAPL:US','https://www.bloomberg.com/quote/MSFT:US']
Outputs = []
driver = webdriver.Chrome(r'xxx\chromedriver.exe')
wait = WebDriverWait(driver, 30)
for url in URL_lst:
driver.get(url)
# get "Board Members" header
board_members_header = wait.until(EC.presence_of_element_located((By.XPATH, "//h2[span[text()='Board Members']]")))
# scroll down to board members
driver.execute_script("arguments[0].scrollIntoView();", board_members_header)
# get view more button
view_more_button = wait.until(EC.presence_of_element_located((By.XPATH, "//section[contains(#class, 'PageMainContent')]/div/div[2]/div/span[span[text()='View More']]")))
# click view more button
view_more_button.click()
# wait on 'View less' to exist, meaning list is expanded now
wait.until(EC.presence_of_element_located((By.XPATH, "//section[contains(#class, 'PageMainContent')]/div/div[2]/div/span[span[text()='View Less']]")))
# wait on visibility of board member names
wait.until(EC.presence_of_all_elements_located((By.XPATH, "//div[contains(#class, 'boardWrap')]//div[contains(#class, 'name')]")))
# get list of board members names
board_member_names = driver.find_elements_by_xpath("//div[contains(#class, 'boardWrap')]//div[contains(#class, 'name')]")
for board_member in board_member_names:
Outputs.append(board_member.text)
# explicit sleep to avoid being flagged as bot
sleep(5)
print(Outputs)
I also added an explicit sleep between URL grabs, so that Bloomberg does not flag you as a bot.
I'm trying to automate the scraping of links from here:
https://thegoodpubguide.co.uk/pubs/?paged=1&order_by=category&search=pubs&pub_name=&postal_code=®ion=london
Once I have the first page, I want to click the right chevron at the bottom, in order to move to the second, the third and so on. Scraping the links in between.
Unfortunately nothing I try will allow me to send chrome to the next page.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from datetime import datetime
import csv
from selenium.webdriver.common.action_chains import ActionChains
#User login info
pagenum = 1
#Creates link to Chrome Driver and shortens this to 'browser'
path_to_chromedriver = '/Users/abc/Downloads/chromedriver 2' # change path as needed
driver = webdriver.Chrome(executable_path = path_to_chromedriver)
#Navigates Chrome to the specified page
url = 'https://thegoodpubguide.co.uk/pubs/?paged=1&order_by=category&search=pubs&pub_name=&postal_code=®ion=london'
#Clicks Login
def findlinks(address):
global pagenum
list = []
driver.get(address)
#wait
while pagenum <= 2:
for i in range(20): # Scrapes available links
xref = '//*[#id="search-results"]/div[1]/div[' + str(i+1) + ']/div/div/div[2]/div[1]/p/a'
link = driver.find_element_by_xpath(xref).get_attribute('href')
print(link)
list.append(link)
with open("links.csv", "a") as fp: # Saves list to file
wr = csv.writer(fp, dialect='excel')
wr.writerow(list)
print(pagenum)
pagenum = pagenum + 1
element = driver.find_element_by_xpath('//*[#id="search-results"]/div[2]/div/div/ul/li[8]/a')
element.click()
findlinks(url)
Is something blocking the button that i'm not seeing?
The error printed in my terminal:
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//*[#id="search-results"]/div[2]/div/div/ul/li[8]/a"}
try this :
element = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a[class='next-page btn']"))
element.click()
EDIT :
The xpath that you're specifying for the chevron is variable between pages, and is not exactly correct. Note the li[6] and li[8] and li[9].
On page 1: the xpath is //*[#id="search-results"]/div[2]/div/div/ul/li[6]/a/i
On page 2: the xpath is //*[#id="search-results"]/div[2]/div/div/ul/li[8]/a/i
On page 3: the xpath is //*[#id="search-results"]/div[2]/div/div/ul/li[9]/a/i
You'll have to come up with some way of determining what xpath to use. Here's a hint: it seems that the last li under the //*[#id="search-results"]/div[2]/div/div/ul/ designates the chevron.
ORIGINAL POST :
You may want to try waiting for the page to load before you try to find and click the chevron. I usually just do a time.sleep(...) when I'm testing my automation script, but for (possibly) more sophisticated functions, try Waits. See the documentation here.
I'm trying to scrape this website: http://data.eastmoney.com/xg/xg/
So far I've used selenium to execute the javascript and get the table scraped. However, my code right now only gets me the first page. I was wondering if there's a way to access the other 17 pages, because when I click on next page the URL does not change, so I cannot just iterate over a different URL each time
Below is my code so far:
from selenium import webdriver
import lxml
from bs4 import BeautifulSoup
import time
def scrape():
url = 'http://data.eastmoney.com/xg/xg/'
d={}
f = open('east.txt','a')
driver = webdriver.PhantomJS()
driver.get(url)
lst = [x for x in range(0,25)]
htmlsource = driver.page_source
bs = BeautifulSoup(htmlsource)
heading = bs.find_all('thead')[0]
hlist = []
for header in heading.find_all('tr'):
head = header.find_all('th')
for i in lst:
if i!=2:
hlist.append(head[i].get_text().strip())
h = '|'.join(hlist)
print h
table = bs.find_all('tbody')[0]
for row in table.find_all('tr'):
cells = row.find_all('td')
d[cells[0].get_text()]=[y.get_text() for y in cells]
for key in d:
ret=[]
for i in lst:
if i != 2:
ret.append(d.get(key)[i])
s = '|'.join(ret)
print s
if __name__ == "__main__":
scrape()
Or is it possible for me to click next through the browser if I use webdriver.Chrome() instead of PhantomJS and then the Python run on the new page, after I click each time?
This is not a trivial page to interact with and would require the use of Explicit Waits to wait for invisibility of "loading" indicators.
Here is the complete and working implementation that you may use as a starting point:
# -*- coding: utf-8 -*-
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
import time
url = "http://data.eastmoney.com/xg/xg/"
driver = webdriver.PhantomJS()
driver.get(url)
def get_table_results(driver):
for row in driver.find_elements_by_css_selector("table#dt_1 tr[class]"):
print [cell.text for cell in row.find_elements_by_tag_name("td")]
# initial wait for results
WebDriverWait(driver, 10).until(EC.invisibility_of_element_located((By.XPATH, u"//th[. = '加载中......']")))
while True:
# print current page number
page_number = driver.find_element_by_id("gopage").get_attribute("value")
print "Page #" + page_number
get_table_results(driver)
next_link = driver.find_element_by_link_text("下一页")
if "nolink" in next_link.get_attribute("class"):
break
next_link.click()
time.sleep(2) # TODO: fix?
# wait for results to load
WebDriverWait(driver, 10).until(EC.invisibility_of_element_located((By.XPATH, u"//img[contains(#src, 'loading')]")))
print "------"
The idea is to have an endless loop which we would exit only if the "Next Page" link becomes disabled (no more pages available). On every iteration, get the table results (printing on the console for the sake of an example), click the next link and wait for invisibility of the "loading" spinning circle appearing on top of the grid.
I found another way to do this in C# using Chromedriver and Selenium. All you have to do is add selenium references to the code and put chromedriver.exe references.
In your code you can navigate to the url using
using (var driver = new chromedriver())
{
driver.Navigate().GoToUrl(pathofurl);
//find your element by using FindElementByXpath
//var element = driver.FindElementByXpath(--Xpath--).Text;
}
Finding Xpath is easy - all you have to do is download scraper extension or x-path extension in chrome by going to chrome store. once you get a hang of x-path for elements you can find x-path for next button and use it in your code to navigate through pages very easily in a loop. Hope this helps.