My code goes to a website, clicks each iteration of row (of the table) which opens a new window.
I want to scrape 1 information per this new window, but I am having difficulty using CSS selectors to get this field (Faculty)
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
import time
import requests
driver = webdriver.Chrome()
productlink=[]
driver.get('https://aaaai.planion.com/Web.User/SearchSessions?ACCOUNT=AAAAI&CONF=AM2021&USERPID=PUBLIC&ssoOverride=OFF')
time.sleep(3)
page_source = driver.page_source
soup = BeautifulSoup(page_source,'html.parser')
productlist=driver.find_elements_by_class_name('clickdiv')
for item in productlist:
item.click() #opens the new window per each row
time.sleep(2)
faculty=driver.find_elements_by_xpath('//*[#id="W1"]/div/div/div/div[2]/div[2]/table/tbody/tr[7]/td/table/tbody/tr/td[2]/b')
print(faculty)
driver.find_element_by_class_name('XX').click()#closes window
time.sleep(1)
You have:
faculty=driver.find_elements_by_xpath('//*[#id="W1"]/div/div/div/div[2]/div[2]/table/tbody/tr[7]/td/table/tbody/tr/td[2]/b')
So you have an array of elements; when you print, you will have something like differents selenium elements. If you only have one element, you can use find_element_by_xpath instead of find_elements_by_xpath.
If you want to obtain the faculty values, and you have an array, you need to do a for and extract the text or get the attribute of all the elements.
Faculty reside within a table with class sorttable. You want the first of these so can use nth-of-type to restrict matches. Then the actual names are within b tags so use a child combinator ( )to move to the child b tags within the table. Use a list comprehension to return a faculty list:
faculty = [i.text for i in driver.find_elements_by_css_selector('.sortable:nth-of-type(1) b')
Related
Im trying to create an discord embed that containts info of some webstite. Im trying to store the driver.find_element.text of with selenium.
Then I want to put that python variable into a json code that makes a discord embed.
The problem is that each product of this page give me 3 different texts. How can I save each one in diferents variables. I put my code here
`
from selenium import webdriver
from selenium.webdriver.common.by import By
import csv
DRIVER_PATH = 'C:\chromedriver.exe'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get('https://www.zalando.es/release-calendar/zapatillas-mujer/')
product_1 = driver.find_element(By.XPATH, '//*[#id="release-calendar"]/div/div[1]/div')
print(product_1.text)
The result in terminakl is
119,95€
adidas Originals
Forum MID TM
28 de octubre de 2022, 8:00
Recordármelo
Thanks for the help I really dont know how to save the .text info into differents python varaibles.
Store the text in a variable or a
element_text = product_1.text
element_text_split = element_text.split() # split by space
If you wanted the price of that item: element_text_split[0] would get the first word
Second word element_text_split[1] is the company
You could also slice up the string using string slicing. Keep in mind not all data you get is going to look exactly the same.
So, you are trying to get texts of each product on the page, right?
If so, you can use find_elements() method to put all the products on the page into a list. For that you need to use an xpath (or any other locator) that finds not one element but all of them.
Here is the code
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get('https://www.zalando.es/release-calendar/zapatillas-mujer/')
# This finds all products on the page and puts them into a list of elements:
products = driver.find_elements(By.XPATH, '//div[#class="auWjdQ rj7dfC Hjm2Cs _8NW8Ug xJbu_q"]')
# This will put text of each product into a list
texts = []
for product in products:
texts.append(product.text)
# Here we can see what is in the list of texts
print('All texts:', texts)
# If list doesn't suite your needs, you can unpack the list into three variables
# (only if you are sure that there are only three items on the page):
prod1, prod2, prod3 = texts
print('Product 1:', prod1)
print('Product 2:', prod2)
print('Product 3:', prod3)
driver.quit()
This is what I've used in the past but that's assuming you have a specific element attribute which you're trying to extract:
For example:
element_text = driver.find_element(By.XPATH, '//*[#id="release-calendar"]/div/div[1]/div').get_attribute('title')
You'll need to replace 'title' with the specific attribute from your element which contains the text you want.
I am trying to come up with a way to scrape information on houses on Zillow and I am currently using xpath to look at data such as rent price, principal and mortgage costs, insurance costs.
I was able to find the information using xpath but I wanted to make it automatic and put it inside a for loop but I realized as I was using xpath, not all the data for each listing has the same xpath information. for some it would be off by 1 of a list or div. See code below for what I mean. How do I get it more specific? Is there a way to look up for a string like "principal and interest" and select the next value which would be the numerical value that I am looking for?
works for one listing:
driver.find_element_by_xpath("/html/body/div[1]/div[6]/div/div[1]/div[1]/div[1]/ul/li[1]/article/div[1]/div[2]/div")
a different listing would contain this:
driver.find_element_by_xpath("/html/body/div[1]/div[6]/div/div[1]/div[1]/div[2]/ul/li[1]/article/div[1]/div[2]/div")
The xpaths that you are using are specific to the elements of the first listing. To be able to access elements for each listing, you will need to use xpaths in a way that can help you access elements for each listing:
import pandas as pd
from selenium import webdriver
I searched for listing for sale in Manhattan and got the below URL
url = "https://www.zillow.com/homes/Manhattan,-New-York,-NY_rb/"
Asking selenium to open the above link in Chrome
driver = webdriver.Chrome()
driver.get(url)
I hovered my mouse on one of the house listings and clicked "inspect". This opened the HTML code and highlighted the item I am inspecting. I noticed that the elements having class "list-card-info" contain all the info of the house that we need. So, our strategy would be for each house access the element that has class "list-card-info". So, using the following code, I saved all such HTML blocks in house_cards variable
house_cards = driver.find_elements_by_class_name("list-card-info")
There are 40 elements in house_cards i.e. one for each house (each page has 40 houses listed)
I loop over each of these 40 houses and extract the information I need. Notice that I am now using xpaths which are specific to elements within the "list-card-info" element. I save this info in a pandas datagram.
address = []
price = []
bedrooms = []
baths = []
sq_ft = []
for house in house_cards:
address.append(house.find_element_by_class_name("list-card-addr").text)
price.append(house.find_element_by_class_name("list-card-price").text)
bedrooms.append(house.find_element_by_xpath('.//div[#class="list-card-heading"]/ul[#class="list-card-details"]/li[1]').text)
baths.append(house.find_element_by_xpath('.//div[#class="list-card-heading"]/ul[#class="list-card-details"]/li[2]').text)
sq_ft.append(house.find_element_by_xpath('.//div[#class="list-card-heading"]/ul[#class="list-card-details"]/li[3]').text)
driver.quit()
# print(address, price,bedrooms,baths, sq_ft)
Manahattan_listings = pd.DataFrame({"address":address,
"bedrooms": bedrooms,
"baths":baths,
"sq_ft":sq_ft,
"price":price},)
pandas dataframe output
Now, to extract info from more pages i.e. page2, page 3, etc, you can loop over website pages i.e. keep modifying your URL and keep extracting info
Happy Scraping!
selecting multiple elements using xpath is not a good idea. You can look into "css selector". Using this you can get similar elements.
How do I click an element using selenium and beautifulsoup in python? I got these lines of code and I find it difficult to achieve. I want to click every element in every iteration. There are no pagination or next page. There are only like about 10 elements and after clicking the last element, it should stop. Does anyone know what should I do. Here are my code
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
import urllib
import urllib.request
from bs4 import BeautifulSoup
chrome_path = r"C:\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
url = 'https://www.99.co/singapore/condos-apartments/a-treasure-trove'
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html,'lxml')
details = soup.select('.FloorPlans__container__rwH_w') //Whole container of the result
for d in details:
picture = d.find('span',{'class':'Tappable-inactive'}).click() //the single element.
print(d)
driver.close()
Here is the site https://www.99.co/singapore/condos-apartments/a-treasure-trove . I want to scrape the details and the image in every floor plans section but it is difficult because the image only appears after you click the specific element. I can only get the details except for the image itself. Try it yourself so that you know what I mean.
EDIT:
I tried this method
for d in driver.find_elements_by_xpath('//*[#id="floorPlans"]/div/div/div/div/span'):
d.click()
The problem is it clicks too fast that the image couldn't load. And also im using selenium here. Is there any method like selecting a beautifulsoup like this format picture = d.find('span',{'class':'Tappable-inactive'}).click() ?
You cannot interact with website widgets by using beautifulSoup you need to work with selenium. There are 2 ways to handle this problem.
First is to get the main wrapper (class) of the 10 elements and then iterate to each child element of the main class.
You can get the element by xpath and increment the last number in xpath by one in each iteration to move to the next element.
I print some result to check your code.
"details" only has one item.
And "picture" is not element. (So it's not clickable.)
details = soup.select('.FloorPlans__container__rwH_w')
print(details)
print(len(details))
for d in details:
print(d)
picture = d.find('span',{'class':'Tappable-inactive'})
print(picture)
Output:
For your edited version, you can check images have been visible before you do click().
Use visibility_of_element_located to do.
Reference: https://selenium-python.readthedocs.io/waits.html
I am trying to count the number of items in a list box on a webpage and then select multiple items from this list box. I can select the items fine I am just struggling to find out how to count the items in the list box.
see code:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
... ...
accountListBox = Select(driver.find_element_by_id("ctl00_MainContent_accountItemsListBox"))
accountListBox.select_by_index(0)
print(len(accountListBox))
I have tried using len() which results in the error "TypeError: object of type 'Select' has no len()".
I have also tried accountListBox.size() and also removed the 'Select' from line 3 which also doesn't work.
Pretty new to this so would appreciate your feedback.
Thanks!
According to the docs a list of Select element's options can be obtained by saying select.options. In your particular case this would be accountListBox.options, and you need to call len() on that, and not on the Select instance itself:
print(len(accountListBox.options))
Or, if you only want to print a list of currently selected options:
print(len(accountListBox.all_selected_options))
You should use find_elements by using some common selector for each listbox's items to find all of them, store the found elements into a variable, and use a native python's library to count them.
I usually use Selenium along with Beautiful Soup. Beautiful Soup is a Python package for parsing HTML and XML documents.
With Beautiful Soup you can get the count of items in a list box in the following way:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.PhantomJS() # or webdriver.Firefox()
driver.get('http://some-website.com/some-page/')
html = driver.page_source.encode('utf-8')
b = BeautifulSoup(html, 'lxml')
items = b.find_all('p', attrs={'id':'ctl00_MainContent_accountItemsListBox'})
print(len(items))
I assumed that the DOM element you want to find is a paragraph (p tag), but you can replace this with whatever element you need to find.
I want to extract text of a particular span which is given in the snapshot. I am unable to find the span by its class attribute. I have attached The html source (snapshot) of the data to be extracted as well.
Any suggestions?
import bs4 as bs
import urllib
sourceUrl='https://www.pakwheels.com/forums/t/planing-a-trip-from-karachi-to-lahore-by-road-in-feb-2017/414115/2'
source=urllib.request.urlopen(sourceUrl).read()
soup=bs.BeautifulSoup(source, 'html.parser')
count=soup.find('span',{'class':'number'})
print(len(count))
See the image:
If you disable JavaScript in your browser you can easily see that span element that you want are disappearing.
In order to get that element one of the possible solutions can be using Selenium browser.
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.pakwheels.com/forums/t/planing-a-trip-from-karachi-to-lahore-by-road-in-feb-2017/414115/2')
span = driver.find_element_by_xpath('//li[3]/span')
print(span.text)
driver.close()
Output:
Another solution - find desired value deep down in web page source(in Chrome browser press Ctrl+U) and extract span value using a regular expression.
import re
import requests
r = requests.get(
'https://www.pakwheels.com/forums/t/planing-a-trip-from-karachi-to-lahore-by-road-in-feb-2017/414115/2')
span = re.search('\"posts_count\":(\d+)', r.text)
print(span.group(1))
Output:
If You know how to use CSS SELECTORS you can use :
mySpan = soup.select("span.number")
It will return List of all nodes which are valid for this selector.
So mySpan[0] could contain what You need. And then use one of the methods like for example get_text() to get what You need.
First of all you need to decode response
source=urllib.request.urlopen(sourceUrl).read().decode()
Maybe your issue will disappears after this fix