I want to get address but they provide me empty what I am doing wrong in the XPath..... these is the page link https://www.findtruckservice.com/page/cummins-sales-and-service-farmington-nm-430653
Snapshot of the address:
Code trials:
import scrapy
from scrapy import Selector
from scrapy_selenium import SeleniumRequest
from scrapy.http import Request
class TestSpider(scrapy.Spider):
name = 'test'
def start_requests(self):
yield SeleniumRequest(
url ="https://www.findtruckservice.com/search/?city=Florida%2C+CO&mainCat=1&subCat=Truck+Repair&lat=37.0731&lon=-106.247&cat_field=Mobile+Repair+-+Truck+Repair",
wait_time = 3,
screenshot = True,
callback = self.parse,
dont_filter = True
)
def parse(self, response):
books = response.xpath("//h3//a//#href").extract()
for book in books:
url = response.urljoin(book)
yield Request(url, callback=self.parse_book)
def parse_book(self, response):
address=response.xpath("//div[1][#class='threecol align_left card']//div//text()").get()
yield{
'address':address
}
Try the following:
[...]
address = ' '.join([x.strip() for x in response.xpath("//div[#class='threecol align_left card'][1]/div[#class='container']/text()").extract()])
To print the desired text from the website you need to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following locator strategies:
Using XPATH and text attribute:
driver.get("https://www.findtruckservice.com/page/cummins-sales-and-service-farmington-nm-430653")
print(WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.XPATH, "//h4[#class='sec-title' and text()='CONTACT']//following::div[#class='container']"))).text)
Using XPATH and get_attribute("textContent"):
driver.get("https://www.findtruckservice.com/page/cummins-sales-and-service-farmington-nm-430653")
print(WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.XPATH, "//h4[#class='sec-title' and text()='CONTACT']//following::div[#class='container']"))).get_attribute("textContent"))
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Console Output:
Cummins Sales and Service
1101 N Troy King Rd
Farmington, NM
505-327-7331 (primary)
505-326-2948 (fax)
References
Link to useful documentation:
get_attribute() method Gets the given attribute or property of the element.
text attribute returns The text of the element.
Difference between text and innerHTML using Selenium
Related
The code below sometimes returns one (1) element, sometimes all, and sometimes none. For it to work for my application I need it to return all the matching elements in the page
Code trials:
from selenium import webdriver
from selenium.webdriver.common.by import By
def villanovan():
driver = webdriver.Chrome()
driver.implicitly_wait(10)
url = 'http://webcache.googleusercontent.com/search?q=cache:https://villanovan.com/&strip=0&vwsrc=0'
url_2 = 'https://villanovan.com/'
driver.get(url_2)
a = driver.find_elements(By.CLASS_NAME, "homeheadline")
titles = [i.text for i in a if len(i.text) != 0]
links = [i.get_attribute('href') for i in a if len(i.text) != 0]
return [titles, links]
if __name__ == "__main__":
print(villanovan())
I was expecting a list with multiple links and article titles, but recieved a list with the first element found, not all elements found.
To extract the value of href attributes you can use list comprehension and you can use either of the following locator strategies:
Using CSS_SELECTOR:
driver.get("https://villanovan.com/")
time.sleep(3)
print([my_elem.get_attribute("href") for my_elem in driver.find_elements(By.CSS_SELECTOR, "a.homeheadline[href]")])
Using XPATH:
driver.get("https://villanovan.com/")
time.sleep(3)
print([my_elem.get_attribute("href") for my_elem in driver.find_elements(By.XPATH, "//a[#class='homeheadline' and #href]")])
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Console Output:
['https://villanovan.com/22105/sports/villanova-goes-cold-in-clutch-against-no-14-marquette/', 'https://villanovan.com/22102/sports/villanova-bests-marquette-in-blowout-win-73-54/', 'https://villanovan.com/22098/news/decarbonizing-villanova-a-town-hall-on-fossil-fuel-divestment/', 'https://villanovan.com/22096/news/biology-professors-granted-1-million-for-wetlands-research/', 'https://villanovan.com/22093/news/students-create-the-space-supporting-sex-education/', 'https://villanovan.com/22098/news/decarbonizing-villanova-a-town-hall-on-fossil-fuel-divestment/', 'https://villanovan.com/22096/news/biology-professors-granted-1-million-for-wetlands-research/', 'https://villanovan.com/22044/culture/julia-staniscis-leaning-on-letters/', 'https://villanovan.com/22032/culture/villanova-sorority-recruitment-recap/', 'https://villanovan.com/22105/sports/villanova-goes-cold-in-clutch-against-no-14-marquette/', 'https://villanovan.com/22102/sports/villanova-bests-marquette-in-blowout-win-73-54/', 'https://villanovan.com/21932/opinion/villanova-should-be-free-for-families-earning-less-than-100000/', 'https://villanovan.com/21897/opinion/grasshoppergate-the-state-of-villanova-dining/', 'https://villanovan.com/22105/sports/villanova-goes-cold-in-clutch-against-no-14-marquette/', 'https://villanovan.com/22102/sports/villanova-bests-marquette-in-blowout-win-73-54/', 'https://villanovan.com/22093/news/students-create-the-space-supporting-sex-education/', 'https://villanovan.com/22090/news/mlk-day-of-service/', 'https://villanovan.com/22087/news/university-updates-covid-procedures/']
I am trying to extract all URL's and iterate where the next button is pressed until there isn't a next button. I would then like to open each URL if that is possible. Could I be pointed in the right direction for this please.
The website where you need to press the search button is here
Link to Table of URL's that need to be extracted
from selenium import webdriver
from selenium.webdriver.common.by import By
driver=webdriver.Chrome(executable_path=r"C:\Users\matt_\Documents\Python Scripts\Selenium\chromedriver.exe")
driver.get("https://publicaccess.aberdeencity.gov.uk/online-applications/search.do?action=monthlyList")
driver.find_element_by_xpath("/html/body/div/div/div[3]/div[3]/div/form/fieldset/div[5]/input[2]").click()
test = driver.find_elements(By.TAG_NAME,"a")
print(test)
Here is the example what you looking for
from bs4 import BeautifulSoup as Soup
from selenium import webdriver
import pandas as pd
import time
driver = webdriver.Chrome()
driver.get("https://monerobenchmarks.info/")
page = Soup(driver.page_source, features='html.parser')
final_list = []
def parsh_table():
table = page.find('table')
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
final_list.extend(row)
def next_bu():
next_button = driver.find_element_by_xpath('//*[#id="cpu_next"]')
next_button.click()
# put range of pages
for _ in range(1,2):
parsh_table()
time.sleep(2)
next_bu()
print(final_list)
You can check the element exists or not with simple logic like this:
if len(driver.find_elements_by_css_selector('.next')) > 0:
Try the below code:
driver.get('https://publicaccess.aberdeencity.gov.uk/online-applications/search.do?action=monthlyList')
search_btn = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '.button.primary')))
search_btn.click()
condition = True
while condition:
links = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, 'li.searchresult a')))
for link in links:
print(link.get_attribute('href'))
if len(driver.find_elements_by_css_selector('.next')) > 0:
driver.find_element_by_css_selector('.next').click()
else:
condition = False
driver.quit()
Following import:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
Here you go
from selenium import webdriver
driver = webdriver.Chrome(executable_path=r"C:\Users\matt_\Documents\Python Scripts\Selenium\chromedriver.exe")
driver.get("https://publicaccess.aberdeencity.gov.uk/online-applications/search.do?action=monthlyList")
driver.find_element_by_css_selector("input[value='Search']").click()
def parse():
links = driver.find_elements_by_xpath('//*[#id="searchresults"]/li/a')
for link in links:
print(link.text, link.get_attribute("href"))
try:
driver.find_element_by_class_name('next').click()
parse()
except:
print('complete')
parse()
Hi I've tired to find the right selenium code to get click the main parent class if the following requirements exist in the class :
Parent Class
<div class ="col-xs-2-4 shopee-search-item-result__item" data-sqe="item">
Child class
<a data-sqe="link" href= all urls that is printed in python.>
Child class contains this element
<div class = "_1gkBDw _2O43P5">
<div class = "_1HvBLA">
<div class = "_3ao649" data-sqe="ad"> Ad</div>
Here is the code bellow
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
import time
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
import csv
import time
url = 'https://shopee.com.my/search?keyword=mattress'
driver = webdriver.Chrome(executable_path=r'E:/users/Francabicon/Desktop/Bots/others/chromedriver.exe')
driver.get(url)
time.sleep(0.8)
# select language
driver.find_element_by_xpath('//div[#class="language-selection__list"]/button').click()
time.sleep(3)
# scroll few times to load all items
for x in range(10):
driver.execute_script("window.scrollBy(0,300)")
time.sleep(0.1)
# get all links (without clicking)
all_items = driver.find_elements_by_xpath('//a[#data-sqe="link"]')
print('len:', len(all_items))
all_urls = []
j = 0
k = 45
for item in all_items:
url = item.get_attribute('href')
all_urls.append(url)
print(all_urls)
a= len(all_urls)
# now use links
i = 0
while i <= 4 :
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[#class='col-xs-2-4 shopee-search-item-result__item' and #data-sqe='item']//a[#class='link' and #href= all_urls[i]]"))).click()
i+=1
I've tried to locate:
-Div the whole class
-locate classes and the href individualy
-click the first five columns
but it all always fails.
Traceback (most recent call last):
File "E:/Users/Asashin/Desktop/Bots/click test 7.py", line 52, in <module>
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[#class='col-xs-2-4 shopee-search-item-result__item' and #data-sqe='item']//a[#class='link' and #href= all_urls[i]]"))).click()
File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\selenium\webdriver\support\wait.py", line 80, in until
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
Can I be solved?
I have made couple of changes.
When you are fetching href values you are getting complete url and not the url you are seeing in DOM so you need to remove the preceding values in order to verify later.
In the last while loop all_urls[i] is variable you need passed it as variable not string.
Once you click each link you need to come back to the parent page again by using driver.back()
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import time
url = 'https://shopee.com.my/search?keyword=mattress'
driver = webdriver.Chrome(executable_path=r'E:/users/Francabicon/Desktop/Bots/others/chromedriver.exe')
driver.get(url)
# select language
WebDriverWait(driver,5).until(EC.element_to_be_clickable((By.XPATH,'//div[#class="language-selection__list"]/button'))).click()
time.sleep(3)
# scroll few times to load all items
for x in range(10):
driver.execute_script("window.scrollBy(0,300)")
time.sleep(0.1)
# get all links (without clicking)
all_items = driver.find_elements_by_xpath('//a[#data-sqe="link"]')
print('len:', len(all_items))
all_urls = []
j = 0
k = 45
for item in all_items:
# This give you whole url of the anchor tag
url = item.get_attribute('href')
# You need to remove the preceding values in order to verify href later for clicking
urlfinal=url.split('https://shopee.com.my')[1]
all_urls.append(urlfinal)
print(all_urls)
a= len(all_urls)
# now use links
i = 0
while i <= 4 :
#Identify the parent tag by child tag use following Xpath.
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[#class='col-xs-2-4 shopee-search-item-result__item' and #data-sqe='item'][.//a[#data-sqe='link' and #href='" + all_urls[i] +"']]"))).click()
driver.back()
i+=1
How to get parent element:
child_element = driver.find_element_by_xpath('//a[#data-sqe="link"]')
parent_element = child_element.find_element_by_xpath('./ancestor::div[contains(#class, "shopee-search-item-result__item")][1]')
How to get element with specific child:
element = driver.find_element_by_xpath('div[contains(#class, "shopee-search-item-result__item") and .//a[#data-sqe="link"]]')
Im looking to grab all the #text portion of a css selector when inspecting an element. I seem to be grabbing all numbers under my selector instead of the text portion.
The Link im scraping is https://www.virginmobile.ca/en/phones/phone-details.html#!/gs9/Grey/64/TR20.
I would like to grab the prices under 'pick your phone price' but without the '$' and '99' cents at the end of the string
Currently Im only familiar with grabbing the entire String.
driver.get(link)
time.sleep(3)
print('--------------------------- begining ------------------')
planTypeUpfrontCostListRaw = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '#phonePricesList .ultra')))
for element in planTypeUpfrontCostListRaw:
upfrontCost = element.text
print(upfrontCost)
print('--------------------------- END ------------------------')
Solution 1
Instead of using text, use innerHTML. This will returns you the html code of that element including the text!
For example, it will return you:
"<sup>$</sup>199<sup>99</sup>"
Then you can use the regex library re to get the value in the middle only.
print(re.search('\d+', upfrontCost).group(0))
Output: 199
Here's the code to do so:
from selenium.webdriver import Chrome
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
import re
link = "https://www.virginmobile.ca/en/phones/phone-details.html#!/gs9/Grey/64/TR20"
driver = Chrome()
wait = WebDriverWait(driver, 15)
driver.get(link)
print('--------------------------- begining ------------------')
planTypeUpfrontCostListRaw = wait.until \
(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '.price.ultra.ng-binding.ng-scope')))
for element in planTypeUpfrontCostListRaw:
upfrontCost = element.get_attribute('innerHTML')
upfrontCost = re.search('\d+', upfrontCost).group(0)
print(upfrontCost)
print('--------------------------- END ------------------------')
Output:
--------------------------- begining ------------------
0
0
199
349
739
1019
--------------------------- END ------------------------
Solution2
You can still use text and remove the unwanted data using strip for the $ and remove last two digit.
driver = Chrome()
wait = WebDriverWait(driver, 15)
driver.get(link)
print('--------------------------- begining ------------------')
planTypeUpfrontCostListRaw = wait.until \
(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '.price.ultra.ng-binding.ng-scope')))
for element in planTypeUpfrontCostListRaw:
upfrontCost = element.text.strip('$')
if upfrontCost != '0':
upfrontCost = upfrontCost[:-2]
print(upfrontCost)
print('--------------------------- END ------------------------')
You could dump into bs4 and use stripped_strings
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
d = webdriver.Chrome(r'C:\Users\User\Documents\chromedriver.exe')
d.get('https://www.virginmobile.ca/en/phones/phone-details.html?province=ON&geoResult=failed#!/gs9/Grey/64/TR20')
WebDriverWait(d,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "planlevels .price")))
soup = bs(d.page_source, 'lxml')
plans = soup.select('planlevels .price')
for plan in plans:
price = [string for string in plan.stripped_strings][1]
print(price)
Uglier, IMO, could be to use split and no BS4
plans = WebDriverWait(d,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "planlevels .price")))
for plan in plans:
print(plan.get_attribute('innerHTML').split('</sup>')[1].split('<sup>')[0])
I have a script that loads a page and saves a bunch of data ids from multiple containers. I then want to open up new urls appending those said data ids onto the end of the urls. For each url I want to locate all the hrefs and compare them to a list of specific links and if any of them match I want to save that link and a few other details to a table.
I have managed to get it to open the url with the appended data id but when I try to search for elements in the new page it either pulls them from the first url that was parsed if I try to findAll from soup again or I constantly get this error when I try to run another html.parser.
ResultSet object has no attribute 'findAll'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
Is it not possible to run another parser or am I just doing something wrong?
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup as soup
from selenium.webdriver.common.action_chains import ActionChains
url = "http://csgo.exchange/id/76561197999004010#x"
driver = webdriver.Firefox()
driver.get(url)
import time
time.sleep(15)
html = driver.page_source
soup = soup(html, "html.parser")
containers = soup.findAll("div",{"class":"vItem"})
print(len(containers))
data_ids = [] # Make a list to hold the data-id's
for container in containers:
test = container.attrs["data-id"]
data_ids.append(test) # add data-id's to the list
print(str(test))
for id in data_ids:
url2 = "http://csgo.exchange/item/" + id
driver.get(url2)
import time
time.sleep(2)
soup2 = soup(html, "html.parser")
containers2 = soup2.findAll("div",{"class":"bar"})
print(str(containers2))
with open('scraped.txt', 'w', encoding="utf-8") as file:
for id in data_ids:
file.write(str(id)+'\n') # write every data-id to a new line
Not sure exactly what you want from each page. You should add waits. I add waits looking for hrefs in the flow history section of each page (if present). It should illustrate the idea.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
url = 'http://csgo.exchange/id/76561197999004010'
driver = webdriver.Chrome()
driver.get(url)
ids = [item.get_attribute('data-id') for item in WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "[data-id]")))]
results = []
baseURL = 'http://csgo.exchange/item/'
for id in ids:
url = baseURL + id
driver.get(url)
try:
flowHistory = [item.get_attribute('href') for item in WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#tab-history-flow [href]")))]
results.append([id, flowHistory])
except:
print(url)
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
url = 'http://csgo.exchange/id/76561197999004010'
profile = webdriver.FirefoxProfile()
profile.set_preference("permissions.default.image", 2) # Block all images to load websites faster.
driver = webdriver.Firefox(firefox_profile=profile)
driver.get(url)
ids = [item.get_attribute('data-id') for item in WebDriverWait(driver,30).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "[data-id]")))]
results = []
baseURL = 'http://csgo.exchange/item/'
for id in ids:
url = baseURL + id
driver.get(url)
try:
pros = ['http://csgo.exchange/profiles/76561198149324950']
flowHistory = [item.get_attribute('href') for item in WebDriverWait(driver,3).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#tab-history-flow [href]")))]
if flowHistory in pros:
results.append([url,flowHistory])
print(results)
except:
print()
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
urls = ['http://csgo.exchange/id/76561197999004010']
profile = webdriver.FirefoxProfile()
profile.set_preference("permissions.default.image", 2) # Block all images to load websites faster.
driver = webdriver.Firefox(firefox_profile=profile)
for url in urls:
driver.get(url)
ids = [item.get_attribute('data-id') for item in WebDriverWait(driver,30).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "[data-id]")))]
results = []
pros = ['http://csgo.exchange/profiles/76561198149324950', 'http://csgo.exchange/profiles/76561198152970370']
baseURL = 'http://csgo.exchange/item/'
for id in ids:
url = baseURL + id
driver.get(url)
try:
flowHistory = [item.get_attribute('href') for item in WebDriverWait(driver,2).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#tab-history-flow [href]")))]
match = []
for string in pros:
if string in flowHistory:
match = string
break
if match:
pass
results.append([url,match])
print(results)
except:
print()