I'm beginner in Python Webscriping using beautifulsoup. I was trying to scrape one real estate website using beautifulsoup but there is row with different information in each column. However each column's class name is same so When I trying to scrape information of each column, I got a same result becuase of same class name.
Link of the website I was trying to scrape.
Code From The HTML
<div class="lst-middle-section resale">
<div class="item-datapoint va-middle">
<div class="lst-sub-title stub text-ellipsis">Built Up Area</div>
<div class="lst-sub-value stub text-ellipsis">2294 sq.ft.</div>
</div>
<div class="item-datapoint va-middle">
<div class="lst-sub-title stub text-ellipsis">Avg. Price</div>
<div class="lst-sub-value stub text-ellipsis"><i class="icon-rupee"></i> 6.5k / sq.ft.</div>
</div>
<div class="item-datapoint va-middle">
<div class="lst-sub-title stub text-ellipsis">Possession Date</div>
<div class="lst-sub-value stub text-ellipsis">31st Dec, 2020</div>
</div>
Code I Tried!
for item in all:
try:
print(item.find('span', {'class': 'lst-price'}).getText())
print(item.find('div',{'class': 'lst-heading'}).getText())
print(item.find('div', {'class': 'item-datapoint va-middle'}).getText())
print('')
except AttributeError:
pass
If I use class 'item-datapoint va-middle' again then it shows sq.ft area not avg.price or Possession date.
Solution? TIA!
Use find_elements_by_class_name instead of find_element_by_class_name.
find_elements_by_class_name("item-datapoint.va-middle")
You will get a list of elements.
Selenium docs: Locating Elements
Edit:
from selenium import webdriver
url = 'https://housing.com/in/buy/search?f=eyJiYXNlIjpbeyJ0eXBlIjoiUE9MWSIsInV1aWQiOiJhMWE1MjFmYjUzNDdjYT' \
'AxNWZlNyIsImxhYmVsIjoiQWhtZWRhYmFkIn1dLCJub25CYXNlQ291bnQiOjAsImV4cGVjdGVkUXVlcnkiOiIlMjBBaG1lZGFiYWQiL' \
'CJxdWVyeSI6IiBBaG1lZGFiYWQiLCJ2IjoyLCJzIjoiZCJ9'
driver = webdriver.Chrome()
driver.get(url)
fields = driver.find_elements_by_class_name("item-datapoint.va-middle")
for i, field in enumerate(fields):
print(i, field.text)
driver.quit()
Now you see the index in the list (fields) for every element.
Print the elements you want like here:
poss_date = fields[2].text
Related
I have two divs that look like this:
<div id="RightColumn">
<div class="profile-info">
<div class= "info">
</div>
<div class="title">
</div>
</div>
</div>
How do I target the internal div labelled "title"? It appears multiple times on the page but the one that I need to target is within "RightColumn".
Here is the code I tried:
mainDIV = driver.find_element_by_id("RightColumn")
targetDIV = mainDIV.find_element_by_xpath('//*[#class="title"]').text
Unfortunately the above code still pulls all title divs on the page vs the one I need within the mainDiv.
//div[#id='RightColumn']//child::div[#class='title']
this should get the job done.
first use id RightColumn to taget div and then title class div is a child.
This will select the first title div under this element:
mainDIV.find_element_by_xpath('.//div[#class="title"]
However, this will select the first title on the page:
mainDIV.find_element_by_xpath('//div[#class="title"]
Try:
targetDIV = mainDIV.find_element_by_xpath('.//div[#class="title"]').text
Note as of Selenium 4.0.0, the find_element_by_* functions are deprecated and should be replaced with find_element().
targetDIV = mainDIV.find_element(By.XPATH, './/div[#class="title"]').text
Reference:
WebDriver API - find_element_by_xpath
I am trying to select the title of posts that are loaded in a webpage by integrating multiple css selectors. See below my process:
Load relevant libraries
import time
from selenium import webdriver
from webdriver_manager.firefox import GeckoDriverManager
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
Then load the content I wish to analyse
options = Options()
options.set_preference("dom.push.enabled", False)
browser = webdriver.Firefox(options=options)
browser.get("https://medium.com/search")
browser.find_element_by_xpath("//input[#type='search']").send_keys("international development",Keys.ENTER)
time.sleep(5)
scrolls = 2
while True:
scrolls -= 1
browser.execute_script("window.scrollTo(0, document.body.scrollHeight)")
time.sleep(5)
if scrolls < 0:
break
Then to get the content for each selector separately, call for css_selector
titles=browser.find_elements_by_css_selector("h3[class^='graf']")
TitlesList = []
for names in titles:
names.text
TitlesList.append(names.text)
times=browser.find_elements_by_css_selector("time[datetime^='2016']")
Times = []
for names in times:
names.text
Times.append(names.text)
It all works so far...Now trying to bring them together with the aim to identify only choices from 2016
choices = browser.find_elements_by_css_selector("time[datetime^='2016'] and h3[class^='graf']")
browser.quit()
On this last snippet, I always get an empty list.
So I wonder 1) How can I select multiple elements by considering different css_selector as conditions for selection at the same time 2) if the syntax to find under multiple conditions would be the same to link elements by using different approaches like css_selector or x_paths and 3) if there is a way to get the text for elements identified by calling for multiple css selectors along a similar line of what below:
[pair.text for pair in browser.find_elements_by_css_selector("h3[class^='graf']") if pair.text]
Thanks
Firstly, I think what you're trying to do is to get any title that has time posted in 2016 right?
You're using CSS selector "time[datetime^='2016'] and h3[class^='graf']", but this will not work because its syntax is not valid (and is not valid). Plus, these are 2 different elements, CSS selector can only find 1 element. In your case, to add a condition from another element, use a common element like a parent element or something.
I've checked the site, here's the HTML that you need to take a look at (if you're trying to the title that published in 2016). This is the minimal HTML part that can help you identify what you need to get.
<div class="postArticle postArticle--short js-postArticle js-trackPostPresentation" data-post-id="d17220aecaa8"
data-source="search_post---------2">
<div class="u-clearfix u-marginBottom15 u-paddingTop5">
<div class="postMetaInline u-floatLeft u-sm-maxWidthFullWidth">
<div class="u-flexCenter">
<div class="postMetaInline postMetaInline-authorLockup ui-captionStrong u-flex1 u-noWrapWithEllipsis">
<div
class="ui-caption u-fontSize12 u-baseColor--textNormal u-textColorNormal js-postMetaInlineSupplemental">
<a class="link link--darken"
href="https://provocations.darkmatterlabs.org/reimagining-international-development-for-the-21st-century-d17220aecaa8?source=search_post---------2"
data-action="open-post"
data-action-value="https://provocations.darkmatterlabs.org/reimagining-international-development-for-the-21st-century-d17220aecaa8?source=search_post---------2"
data-action-source="preview-listing">
<time datetime="2016-09-05T13:55:05.811Z">Sep 5, 2016</time>
</a>
</div>
</div>
</div>
</div>
</div>
<div class="postArticle-content">
<a href="https://provocations.darkmatterlabs.org/reimagining-international-development-for-the-21st-century-d17220aecaa8?source=search_post---------2"
data-action="open-post" data-action-source="search_post---------2"
data-action-value="https://provocations.darkmatterlabs.org/reimagining-international-development-for-the-21st-century-d17220aecaa8?source=search_post---------2"
data-action-index="2" data-post-id="d17220aecaa8">
<section class="section section--body section--first section--last">
<div class="section-divider">
<hr class="section-divider">
</div>
<div class="section-content">
<div class="section-inner sectionLayout--insetColumn">
<h3 name="5910" id="5910" class="graf graf--h3 graf--leading graf--title">Reimagining
International Development for the 21st Century.</h3>
</div>
</div>
</section>
</a>
</div>
</div>
Both time and h3 are in a big div with class of postArticle. The article contains time published & the title, so it makes sense to get the whole article div that published in 2016 right?
Using XPATH is much more powerful & easier to write:
This will get all articles div that contains class name of postArticle--short: article_xpath = '//div[contains(#class, "postArticle--short")]'
This will get all time tag that contains class name of 2016: //time[contains(#datetime, "2016")]
Let's combine both of them. I want to get article div that contains a time tag with classname of 2016:
article_2016_xpath = '//div[contains(#class, "postArticle--short")][.//time[contains(#datetime, "2016")]]'
article_element_list = driver.find_elements_by_xpath(article_2016_xpath)
# now let's get the title
for article in article_element_list:
title = article.find_element_by_tag_name("h3").text
I haven't tested the code yet, only the xpath. You might need to adapt the code to work on your side.
By the way, using find_element... is not a good idea, try using explicit wait: https://selenium-python.readthedocs.io/waits.html
This will help you to avoid making stupid time.sleep waits and improve your app performance, and you can handle errors pretty well.
Only use find_element... when you already located the element, and you need to find a child element inside. For example, in this case if I want to find articles, I will find by explicit wait, then after the element is located, I will use find_element... to find child element h3.
I want to build a program that automatically gets the live price of the german index (DAX). Therefore i use a website with the price provider FXCM.
In my code i use beautifulsoup and requests as packages. The div Box where the current value is stored looks like this :
<div class="left" data-item="quoteContainer" data-bg_quotepush="133962:74:bid">
<div class="wrapper cf">
<div class="left">
<span class="quote quote_standard" data-bg_quotepush="quote" data-bg_quotepush_i="133962:74:bid" data-bg_quotepush_f="quote" data-bg_quotepush_c="40">13.599,24</span>
<span class="label" data-bg_quotepush="time" data-bg_quotepush_i="133962:74:bid" data-bg_quotepush_f="time" data-bg_quotepush_c="41">25.12.2020</span>
<span class="label"> • </span>
<span class="label" data-item="currency"></span>
</div>
<div class="right">
<span class="percent up" data-bg_quotepush="percent" data-bg_quotepush_i="133962:74:bid" data-bg_quotepush_f="percent" data-bg_quotepush_c="42">+0,00<span>%</span></span>
<span class="label up" data-bg_quotepush="change" data-bg_quotepush_i="133962:74:bid" data-bg_quotepush_f="change" data-bg_quotepush_c="43">0,00</span>
</div>
</div>
</div>
The value i want to have is the one after data-bg_quotepush_c="40" and has a vaulue of 13.599,24.
My Python code looks like this:
import requests as rq
from bs4 import BeautifulSoup as bs
url = "https://news.guidants.com/#Ticker/Profil/?i=133962&e=74"
response = rq.get(url)
soup = bs(response.text, "lxml")
price = soup.find_all("div", {"class":"left"})[0].find("span")
print(price["data-bg_quotepush_c"])
It returns the following error:
File "C:\Users\Felix\anaconda3\lib\site-packages\bs4\element.py", line 1406, in __getitem__
return self.attrs[key]
KeyError: 'data-bg_quotepush_c'
Use Selenium instead of requests if working with dynamically generated content
What is going on?
Requesting the website with requests just provide the initial content, that not contains all the dynamically generatet information, so you can not find what your looking for.
To wait until website loaded completely use Selenium and sleep() as simple method or selenium waits in advanced.
Avoiding the error
Use price.text to get the text of the element that looks like this:
<span class="quote quote_standard" data-bg_quotepush="quote" data-bg_quotepush_c="40" data-bg_quotepush_f="quote" data-bg_quotepush_i="133962:74:bid">13.599,24</span>
Example
from selenium import webdriver
from bs4 import BeautifulSoup
url = "https://news.guidants.com/#Ticker/Profil/?i=133962&e=74"
driver = webdriver.Chrome(executable_path=r'C:\Program Files\ChromeDriver\chromedriver.exe')
driver.get(url)
driver.implicitly_wait(3)
soup = BeautifulSoup(driver.page_source,"html5lib")
price = soup.find_all("div", {"class":"left"})[0].find("span")
print(price.text)
driver.close()
Output
13.599,24
if you scraping the value of div class try this, example
driver = webdriver.Chrome(YourPATH to driver)
from bs4 import BeautifulSoup
# create variable to store a url strings
url = 'https://news.guidants.com/#Ticker/Profil/?i=133962&e=74'
driver.get(url)
# scraping proccess
soup = BeautifulSoup(driver.page_source,"html5lib")
# parse
prices = soup.find_all("div", attrs={"class":"left"})
for price in prices:
total_price = price.find('span')
# close the driver
driver.close()
if you using requests module try use different parser
you can install with pip example html5lib
pip install html5lib
thanks
I'm trying to get links to group members:
response.css('.text--ellipsisOneLine::attr(href)').getall()
Why isn't this working?
html:
<div class="flex flex--row flex--noGutters flex--alignCenter">
<div class="flex-item _memberItem-module_name__BSx8i">
<a href="/ru-RU/Connect-IT-Meetup-in-Chisinau/members/280162178/profile/?returnPage=1">
<h4 class="text--bold text--ellipsisOneLine">Liviu Cernei</h4>
</a>
</div>
</div>
Your selector isn't working because you are looking for a attribute (href) that this element doesn't have.
response.css('.text--ellipsisOneLine::attr(href)').getall()
This selector is searching for href inside elements of class text--ellipsisOneLine. In your HTML snippet that class matches only with this:
<h4 class="text--bold text--ellipsisOneLine">Liviu Cernei</h4>
As you can see, there is no href attribute. Now, if you want the text between this h4 element you need to use ::text pseudo-element.
response.css('.text--ellipsisOneLine::text').getall()
Read more here.
I realize that this isn't scrapy, but personally for web scraping I use the requests module and BeautifulSoup4, and the following code snippet will get you a list of users with the aforementioned modules:
import requests
from bs4 import BeautifulSoup
response = requests.get('https://www.meetup.com/ru-RU/Connect-IT-Meetup-in-Chisinau/members/')
if response.status_code == 200:
html_doc = response.text
html_source = BeautifulSoup(html_doc, 'html.parser')
users = html_source.findAll('h4')
for user in users:
print(user.text)
css:
response.css('.member-item .flex--alignCenter a::attr(href)').getall()
Trying to reply to facebook comments using selenium and python.
I've been able to select the field using
find_elements_by_css_selector(".UFIAddCommentInput")
But I can't post text using the send_keys method. Heres a simplified structure of the comment html for facebook:
<div><input tabindex="-1" name="add_comment_text">
<div class="UFIAddCommentInput _1osb _5yk1"><div class="_5yk2">
<div class="_5yw9"><div class="_5ywb">
<div class="_3br6">Write a comment...</div></div>
<div class="_5ywa">
<div title="Write a comment..." role="combobox"
class="_54z"contenteditable="true">
<div data-contents="true">
<div class="_209g _2vxa">
It works perfectly fine. The only catch being, clear the div everytime before you begin to type a new comment.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
f = webdriver.Firefox()
f.get("https://facebook.com")
# Skipping the logging in and going to a particular post part.
# The selector I used correspond to id of the post and
# class of the div that contains editable text part
e = f.find_element_by_css_selector("div#u_jsonp_8_q div._54-z")
e.send_keys("bob")
e.send_keys(Keys.ENTER)
e.clear()
e.send_keys("kevin")
e.send_keys(Keys.ENTER)
e.clear()