How to scrape multiple pages with an unchanging URL - python

How to scrape multiple pages with an unchanging URL - python - python

I'm trying to scrape this website: http://data.eastmoney.com/xg/xg/
So far I've used selenium to execute the javascript and get the table scraped. However, my code right now only gets me the first page. I was wondering if there's a way to access the other 17 pages, because when I click on next page the URL does not change, so I cannot just iterate over a different URL each time
Below is my code so far:
from selenium import webdriver
import lxml
from bs4 import BeautifulSoup
import time
def scrape():
url = 'http://data.eastmoney.com/xg/xg/'
d={}
f = open('east.txt','a')
driver = webdriver.PhantomJS()
driver.get(url)
lst = [x for x in range(0,25)]
htmlsource = driver.page_source
bs = BeautifulSoup(htmlsource)
heading = bs.find_all('thead')[0]
hlist = []
for header in heading.find_all('tr'):
head = header.find_all('th')
for i in lst:
if i!=2:
hlist.append(head[i].get_text().strip())
h = '|'.join(hlist)
print h
table = bs.find_all('tbody')[0]
for row in table.find_all('tr'):
cells = row.find_all('td')
d[cells[0].get_text()]=[y.get_text() for y in cells]
for key in d:
ret=[]
for i in lst:
if i != 2:
ret.append(d.get(key)[i])
s = '|'.join(ret)
print s
if __name__ == "__main__":
scrape()
Or is it possible for me to click next through the browser if I use webdriver.Chrome() instead of PhantomJS and then the Python run on the new page, after I click each time?

This is not a trivial page to interact with and would require the use of Explicit Waits to wait for invisibility of "loading" indicators.
Here is the complete and working implementation that you may use as a starting point:
# -*- coding: utf-8 -*-
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
import time
url = "http://data.eastmoney.com/xg/xg/"
driver = webdriver.PhantomJS()
driver.get(url)
def get_table_results(driver):
for row in driver.find_elements_by_css_selector("table#dt_1 tr[class]"):
print [cell.text for cell in row.find_elements_by_tag_name("td")]
# initial wait for results
WebDriverWait(driver, 10).until(EC.invisibility_of_element_located((By.XPATH, u"//th[. = '加载中......']")))
while True:
# print current page number
page_number = driver.find_element_by_id("gopage").get_attribute("value")
print "Page #" + page_number
get_table_results(driver)
next_link = driver.find_element_by_link_text("下一页")
if "nolink" in next_link.get_attribute("class"):
break
next_link.click()
time.sleep(2) # TODO: fix?
# wait for results to load
WebDriverWait(driver, 10).until(EC.invisibility_of_element_located((By.XPATH, u"//img[contains(#src, 'loading')]")))
print "------"
The idea is to have an endless loop which we would exit only if the "Next Page" link becomes disabled (no more pages available). On every iteration, get the table results (printing on the console for the sake of an example), click the next link and wait for invisibility of the "loading" spinning circle appearing on top of the grid.

I found another way to do this in C# using Chromedriver and Selenium. All you have to do is add selenium references to the code and put chromedriver.exe references.
In your code you can navigate to the url using
using (var driver = new chromedriver())
{
driver.Navigate().GoToUrl(pathofurl);
//find your element by using FindElementByXpath
//var element = driver.FindElementByXpath(--Xpath--).Text;
}
Finding Xpath is easy - all you have to do is download scraper extension or x-path extension in chrome by going to chrome store. once you get a hang of x-path for elements you can find x-path for next button and use it in your code to navigate through pages very easily in a loop. Hope this helps.

Related

Selenium doesnt display data with find multiple elements

I'm trying to get information in the last link that i'll show you in the website this one
The problem is my list of elements is not displayed even though when I try find_element (one) it works.
Here is my code :
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import pandas as pd
options = Options()
# Creating our dictionary
all_services = pd.DataFrame(columns=['Profil', 'Motif', 'Questions', 'Reponses'])
path = "C:/Users/Al4D1N/Documents/ChromeDriver_webscraping/chromedriver.exe"
driver = webdriver.Chrome(options=options, executable_path=path)
# we are going to visit all profils procedures
# for profil in ['particuliers','professionnels','associations']:
# driver.get("https://www.demarches.interieur.gouv.fr/{profil}/accueil-{profil}")
driver.get("https://www.demarches.interieur.gouv.fr/associations/accueil-associations")
# Get all first elements in bodyFiche id which contains all procedures for associations profile
list_of_services = driver.find_elements_by_class_name("liste-sous-menu")
for service in list_of_services:
# In each element, select the tags
# atags = service.find_elements_by_css_selector('a')
atags = service.find_elements_by_xpath("//li[starts-with(#id,'summary')]")
for atag in atags:
# In each atag, select the href
href = atag.get_attribute('href')
print(href)
# Open a new window
driver.execute_script("window.open('');")
# Switch to the new window and open URL
driver.switch_to.window(driver.window_handles[1])
driver.get(href)
# we are now on the second link
# Get all links in the iterated element
list_of_services2 = driver.find_elements_by_class_name("content")
for service2 in list_of_services2:
atags2 = service2.find_elements_by_css_selector('a')
for atag2 in atags2:
href = atag2.get_attribute('href')
driver.execute_script("window.open('');")
driver.switch_to.window(driver.window_handles[1])
driver.get(href)
# we are now on the third link
# Get all links in the iterated element
list_of_services3 = driver.find_elements_by_class_name("content")
for service3 in list_of_services2:
atags3 = service3.find_elements_by_css_selector('a')
for atag3 in atags3:
href = atag3.get_attribute('href')
driver.execute_script("window.open('');")
driver.switch_to.window(driver.window_handles[1])
driver.get(href)
# Get Q/A section
list_of_services4 = driver.find_elements_by_class_name("QuestionReponse")
for service4 in list_of_services4:
atags4 = service4.find.elements_by_css_selector('a')
for atag4 in atags4:
href = atag3.get_attribute('href')
# We store our questions
questions = href.text
driver.execute_script("window.open('');")
driver.switch_to.window(driver.window_handles[1])
driver.get(href)
# Get data
reponses = driver.find_elements_by_class_name("texte")
all_services = all_services.append({'Questions': questions,
'Reponses': reponses}, ignore_index=True)
driver.close()
driver.switch_to.window(driver.window_handles[0])
driver.close()
driver.switch_to.window(driver.window_handles[0])
driver.close()
driver.switch_to.window(driver.window_handles[0])
# Close the tab with URL B
driver.close()
# Switch back to the first tab with URL A
driver.switch_to.window(driver.window_handles[0])
driver.close()
all_services.to_excel('Limit_Testing.xlsx', index=False)
I'm not sure if my method is working or not , the idea is going through links like in a tree and when I succeed to my leaf I get my desired information. Correct me if im wrong .
I don't know my list_of_services is a NULL list , even if im correct on the class name.

What's worked for me in previous experiences: add waiting time. The logic for this is that when you make the GET request, you go straight to analyze whether there is a WebElement with class='liste-sous-menu', without waiting for the driver to get the website loaded, this causes the list to be empty as there is nothing to return. Therefore, my suggestion is the following:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import pandas as pd
## Import sleep
from time import sleep
options = Options()
path = "C:/Users/Al4D1N/Documents/ChromeDriver_webscraping/chromedriver.exe"
driver = webdriver.Chrome(options=options, executable_path=path)
driver.get("https://www.demarches.interieur.gouv.fr/associations/accueil-associations")
################### HERE YOU ADD SOME WAITING TIME, it will depend on the speed of you computer/driver
sleep(0.5)
list_of_services = driver.find_elements_by_class_name("liste-sous-menu")
I have applied it in your code and it now seems to be returning a list with content. However, it does not return the links, it just returns the UL (unordered list) that contains the links, you will need to dig deeper once you have the UL element. This means adding the following:
list_of_services = driver.find_elements_by_class_name("liste-sous-menu")
### Now you get the li elements (each row)
services = list_of_services.find_elements_by_tag_name('li')
## Now you iterate over the services object (list of 'li' elements)
Hope to have solved your question.

How To Loop Through Multiple Pages And Open Links At The Same Time

I'm currently trying to figure out how to loop through a set of studios on a fitness class website.
On the search results page of this website, it lists 50 studios on each page and there are about 26 pages. https://classpass.com/search if you want to take a look.
My code parses the search result page, and selenium gets the link for each studio on the page(In my full code selenium opens goes to the link and scrapes data on the page).
After looping through all the results on page 1, I want to click the next page button and repeat on results page 2. I get the error Message: no such element: Unable to locate element: but I know the element is definitely on the results page and can be clicked. I tested this with a simplified script to confirm.
What could I be doing wrong? I've tried many suggestions but none have worked so far.
from selenium import webdriver
from bs4 import BeautifulSoup as soup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as browser_wait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
import time
import re
import csv
# initialize the chrome browser
browser = webdriver.Chrome(executable_path=r'./chromedriver')
# URL
class_pass_url = 'https://www.classpass.com'
# Create file and writes the first row, added encoding type as write was giving errors
#f = open('ClassPass.csv', 'w', encoding='utf-8')
#headers = 'URL, Studio, Class Name, Description, Image, Address, Phone, Website, instagram, facebook, twitter\n'
#f.write(headers)
# classpass results page
page = "https://classpass.com/search"
browser.get(page)
# Browser waits
browser_wait(browser, 10).until(EC.visibility_of_element_located((By.CLASS_NAME, "line")))
# Scrolls to bottom of page to reveal all classes
# browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Extract page source and parse
search_source = browser.page_source
search_soup = soup(search_source, "html.parser")
pageCounter = 0
maxpagecount = 27
# Looks through results and gets link to class page
studios = search_soup.findAll('li', {'class': '_3vk1F9nlSJQIGcIG420bsK'})
while (pageCounter < maxpagecount):
search_source = browser.page_source
search_soup = soup(search_source, "html.parser")
studios = search_soup.findAll('li', {'class': '_3vk1F9nlSJQIGcIG420bsK'})
for studio in studios:
studio_link = class_pass_url + studio.a['href']
browser.get(studio_link)
browser_wait(browser, 10).until(EC.visibility_of_element_located((By.CLASS_NAME, "line")))
element = browser.find_element_by_xpath('//*[#id="Search_Results"]/div[1]/div/div/nav/button[2]')
browser.execute_script("arguments[0].click();", element)

You have to return to the main page before finding the next page button. You could solve the problem by the replacing the following code. This code will initially collect all page's studio url.
studios = search_soup.findAll('li', {'class': '_3vk1F9nlSJQIGcIG420bsK'})
to
studios = []
for page in range(num_pages):
studios.append(search_soup.findAll('li', {'class': '_3vk1F9nlSJQIGcIG420bsK'}))
element = browser.find_element_by_xpath('//*[#id="Search_Results"]/div[1]/div/div/nav/button[2]')
browser.execute_script("arguments[0].click();", element)
and remove the code clicking the next page button element.

t.xpath not getting results from second page?

I am trying to iterate over multiple pages of a website, however the code I am using below is only returning the results from the first page, even though I am using Selenium to click to the next page. I am at a loss for what could be causing this. Any explanation would be much appreciated!
The website in question:
https://www.cruiseplum.com/search#{%22numPax%22:2,%22geo%22:%22US%22,%22portsMatchAll%22:true,%22numOptionsShown%22:100,%22ppdIncludesTaxTips%22:true,%22uiVersion%22:%22split%22,%22sortTableByField%22:%22dd%22,%22sortTableOrderDesc%22:false,%22filter%22:null}
from selenium import webdriver
import time
import xlsxwriter
from lxml import html
u = 'https://www.cruiseplum.com/search#{%22numPax%22:2,%22geo%22:%22US%22,%22portsMatchAll%22:true,%22numOptionsShown%22:100,%22ppdIncludesTaxTips%22:true,%22uiVersion%22:%22split%22,%22sortTableByField%22:%22dd%22,%22sortTableOrderDesc%22:false,%22filter%22:null}'
driver = webdriver.Chrome()
driver.get(u)
driver.maximize_window()
time.sleep(.3)
driver.find_element_by_id('restoreSettingsYesEncl').click() # select 'yes' on the webpage to restore settings
time.sleep(7) # wait until the website downloads data so we get a return value
elem = driver.find_element_by_xpath("//*")
source_code = elem.get_attribute("innerHTML")
t = html.fromstring(source_code)
for i in range(5):
for i in t.xpath('.//td[#class="dc-table-column _0"]/text()'):
print(i.strip())
driver.find_element_by_xpath('//*[#id="listings-table-split"]/div[5]/div/span[4]').click() # click to next page
time.sleep(.05)
driver.quit()

In the code above, t is getting value outside the loop
elem = driver.find_element_by_xpath("//*") source_code =
elem.get_attribute("innerHTML")
t = html.fromstring(source_code)
for i in range(5):
so it is loaded only the first time and keeps repeating the same elements. To achieve you need to move it inside the loop like in the code below:
from selenium import webdriver
import time
import xlsxwriter
from lxml import html
u = 'https://www.cruiseplum.com/search#{%22numPax%22:2,%22geo%22:%22US%22,%22portsMatchAll%22:true,%22numOptionsShown%22:100,%22ppdIncludesTaxTips%22:true,%22uiVersion%22:%22split%22,%22sortTableByField%22:%22dd%22,%22sortTableOrderDesc%22:false,%22filter%22:null}'
driver = webdriver.Chrome()
driver.get(u)
driver.maximize_window()
time.sleep(.3)
driver.find_element_by_id('restoreSettingsYesEncl').click() # select 'yes' on the webpage to restore settings
time.sleep(7) # wait until the website downloads data so we get a return value
for i in range(5):
elem = driver.find_element_by_xpath("//*")
source_code = elem.get_attribute("innerHTML")
t = html.fromstring(source_code)
for i in t.xpath('.//td[#class="dc-table-column _0"]/text()'):
print(i.strip())
driver.find_element_by_xpath('//*[#id="listings-table-split"]/div[5]/div/span[4]').click() # click to next page
time.sleep(.05)
driver.quit()

Getting value after button click with BeautifulSoup Python

I'm trying to get a value that is given by the website after a click on a button.
Here is the website: https://www.4devs.com.br/gerador_de_cpf
You can see that there is a button called "Gerar CPF", this button provides a number that appears after the click.
My current script opens the browser and get the value, but I'm getting the value from the page before the click, so the value is empty. I would like to know if it is possible to get the value after the click on the button.
from selenium import webdriver
from bs4 import BeautifulSoup
from requests import get
url = "https://www.4devs.com.br/gerador_de_cpf"
def open_browser():
driver = webdriver.Chrome("/home/felipe/Downloads/chromedriver")
driver.get(url)
driver.find_element_by_id('bt_gerar_cpf').click()
def get_cpf():
response = get(url)
page_with_cpf = BeautifulSoup(response.text, 'html.parser')
cpf = page_with_cpf.find("div", {"id": "texto_cpf"}).text
print("The value is: " + cpf)
open_browser()
get_cpf()

open_browser and get_cpf are absolutely not related to each other...
Actually you don't need get_cpf at all. Just wait for text after clicking the button:
from selenium.webdriver.support.ui import WebDriverWait as wait
def open_browser():
driver = webdriver.Chrome("/home/felipe/Downloads/chromedriver")
driver.get(url)
driver.find_element_by_id('bt_gerar_cpf').click()
text_field = driver.find_element_by_id('texto_cpf')
text = wait(driver, 10).until(lambda driver: not text_field.text == 'Gerando...' and text_field.text)
return text
print(open_browser())
Update
The same with requests:
import requests
url = 'https://www.4devs.com.br/ferramentas_online.php'
data = {'acao': 'gerar_cpf', 'pontuacao': 'S'}
response = requests.post(url, data=data)
print(response.text)

You don't need to use requests and BeautifulSoup.
from selenium import webdriver
from time import sleep
url = "https://www.4devs.com.br/gerador_de_cpf"
def get_cpf():
driver = webdriver.Chrome("/home/felipe/Downloads/chromedriver")
driver.get(url)
driver.find_element_by_id('bt_gerar_cpf').click()
sleep(10)
text=driver.find_element_by_id('texto_cpf').text
print(text)
get_cpf()

Can you use a While loop until text changes?
from selenium import webdriver
url = "https://www.4devs.com.br/gerador_de_cpf"
def get_value():
driver = webdriver.Chrome()
driver.get(url)
driver.find_element_by_id('bt_gerar_cpf').click()
while driver.find_element_by_id('texto_cpf').text == 'Gerando...':
continue
val = driver.find_element_by_id('texto_cpf').text
driver.quit()
return val
print(get_value())

I recommend this website that does exactly the same thing.
https://4devs.net.br/gerador-cpf
But to get the "gerar cpf" action with selenium, you can inspect the HTML source code with a browser and click on "copy XPath for this element".
It is much simpler than manually searching for the elements in the page.

Python Selenium unable to click() button

I'm trying to automate the scraping of links from here:
https://thegoodpubguide.co.uk/pubs/?paged=1&order_by=category&search=pubs&pub_name=&postal_code=&region=london
Once I have the first page, I want to click the right chevron at the bottom, in order to move to the second, the third and so on. Scraping the links in between.
Unfortunately nothing I try will allow me to send chrome to the next page.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from datetime import datetime
import csv
from selenium.webdriver.common.action_chains import ActionChains
#User login info
pagenum = 1
#Creates link to Chrome Driver and shortens this to 'browser'
path_to_chromedriver = '/Users/abc/Downloads/chromedriver 2' # change path as needed
driver = webdriver.Chrome(executable_path = path_to_chromedriver)
#Navigates Chrome to the specified page
url = 'https://thegoodpubguide.co.uk/pubs/?paged=1&order_by=category&search=pubs&pub_name=&postal_code=&region=london'
#Clicks Login
def findlinks(address):
global pagenum
list = []
driver.get(address)
#wait
while pagenum <= 2:
for i in range(20): # Scrapes available links
xref = '//*[#id="search-results"]/div[1]/div[' + str(i+1) + ']/div/div/div[2]/div[1]/p/a'
link = driver.find_element_by_xpath(xref).get_attribute('href')
print(link)
list.append(link)
with open("links.csv", "a") as fp: # Saves list to file
wr = csv.writer(fp, dialect='excel')
wr.writerow(list)
print(pagenum)
pagenum = pagenum + 1
element = driver.find_element_by_xpath('//*[#id="search-results"]/div[2]/div/div/ul/li[8]/a')
element.click()
findlinks(url)
Is something blocking the button that i'm not seeing?
The error printed in my terminal:
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//*[#id="search-results"]/div[2]/div/div/ul/li[8]/a"}

try this :
element = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a[class='next-page btn']"))
element.click()

EDIT :
The xpath that you're specifying for the chevron is variable between pages, and is not exactly correct. Note the li[6] and li[8] and li[9].
On page 1: the xpath is //*[#id="search-results"]/div[2]/div/div/ul/li[6]/a/i
On page 2: the xpath is //*[#id="search-results"]/div[2]/div/div/ul/li[8]/a/i
On page 3: the xpath is //*[#id="search-results"]/div[2]/div/div/ul/li[9]/a/i
You'll have to come up with some way of determining what xpath to use. Here's a hint: it seems that the last li under the //*[#id="search-results"]/div[2]/div/div/ul/ designates the chevron.
ORIGINAL POST :
You may want to try waiting for the page to load before you try to find and click the chevron. I usually just do a time.sleep(...) when I'm testing my automation script, but for (possibly) more sophisticated functions, try Waits. See the documentation here.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to scrape multiple pages with an unchanging URL - python - python

Related

Selenium doesnt display data with find multiple elements

How To Loop Through Multiple Pages And Open Links At The Same Time

t.xpath not getting results from second page?

Getting value after button click with BeautifulSoup Python

Python Selenium unable to click() button

Categories

Resources