I am trying to select the title of posts that are loaded in a webpage by integrating multiple css selectors. See below my process:
Load relevant libraries
import time
from selenium import webdriver
from webdriver_manager.firefox import GeckoDriverManager
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
Then load the content I wish to analyse
options = Options()
options.set_preference("dom.push.enabled", False)
browser = webdriver.Firefox(options=options)
browser.get("https://medium.com/search")
browser.find_element_by_xpath("//input[#type='search']").send_keys("international development",Keys.ENTER)
time.sleep(5)
scrolls = 2
while True:
scrolls -= 1
browser.execute_script("window.scrollTo(0, document.body.scrollHeight)")
time.sleep(5)
if scrolls < 0:
break
Then to get the content for each selector separately, call for css_selector
titles=browser.find_elements_by_css_selector("h3[class^='graf']")
TitlesList = []
for names in titles:
names.text
TitlesList.append(names.text)
times=browser.find_elements_by_css_selector("time[datetime^='2016']")
Times = []
for names in times:
names.text
Times.append(names.text)
It all works so far...Now trying to bring them together with the aim to identify only choices from 2016
choices = browser.find_elements_by_css_selector("time[datetime^='2016'] and h3[class^='graf']")
browser.quit()
On this last snippet, I always get an empty list.
So I wonder 1) How can I select multiple elements by considering different css_selector as conditions for selection at the same time 2) if the syntax to find under multiple conditions would be the same to link elements by using different approaches like css_selector or x_paths and 3) if there is a way to get the text for elements identified by calling for multiple css selectors along a similar line of what below:
[pair.text for pair in browser.find_elements_by_css_selector("h3[class^='graf']") if pair.text]
Thanks
Firstly, I think what you're trying to do is to get any title that has time posted in 2016 right?
You're using CSS selector "time[datetime^='2016'] and h3[class^='graf']", but this will not work because its syntax is not valid (and is not valid). Plus, these are 2 different elements, CSS selector can only find 1 element. In your case, to add a condition from another element, use a common element like a parent element or something.
I've checked the site, here's the HTML that you need to take a look at (if you're trying to the title that published in 2016). This is the minimal HTML part that can help you identify what you need to get.
<div class="postArticle postArticle--short js-postArticle js-trackPostPresentation" data-post-id="d17220aecaa8"
data-source="search_post---------2">
<div class="u-clearfix u-marginBottom15 u-paddingTop5">
<div class="postMetaInline u-floatLeft u-sm-maxWidthFullWidth">
<div class="u-flexCenter">
<div class="postMetaInline postMetaInline-authorLockup ui-captionStrong u-flex1 u-noWrapWithEllipsis">
<div
class="ui-caption u-fontSize12 u-baseColor--textNormal u-textColorNormal js-postMetaInlineSupplemental">
<a class="link link--darken"
href="https://provocations.darkmatterlabs.org/reimagining-international-development-for-the-21st-century-d17220aecaa8?source=search_post---------2"
data-action="open-post"
data-action-value="https://provocations.darkmatterlabs.org/reimagining-international-development-for-the-21st-century-d17220aecaa8?source=search_post---------2"
data-action-source="preview-listing">
<time datetime="2016-09-05T13:55:05.811Z">Sep 5, 2016</time>
</a>
</div>
</div>
</div>
</div>
</div>
<div class="postArticle-content">
<a href="https://provocations.darkmatterlabs.org/reimagining-international-development-for-the-21st-century-d17220aecaa8?source=search_post---------2"
data-action="open-post" data-action-source="search_post---------2"
data-action-value="https://provocations.darkmatterlabs.org/reimagining-international-development-for-the-21st-century-d17220aecaa8?source=search_post---------2"
data-action-index="2" data-post-id="d17220aecaa8">
<section class="section section--body section--first section--last">
<div class="section-divider">
<hr class="section-divider">
</div>
<div class="section-content">
<div class="section-inner sectionLayout--insetColumn">
<h3 name="5910" id="5910" class="graf graf--h3 graf--leading graf--title">Reimagining
International Development for the 21st Century.</h3>
</div>
</div>
</section>
</a>
</div>
</div>
Both time and h3 are in a big div with class of postArticle. The article contains time published & the title, so it makes sense to get the whole article div that published in 2016 right?
Using XPATH is much more powerful & easier to write:
This will get all articles div that contains class name of postArticle--short: article_xpath = '//div[contains(#class, "postArticle--short")]'
This will get all time tag that contains class name of 2016: //time[contains(#datetime, "2016")]
Let's combine both of them. I want to get article div that contains a time tag with classname of 2016:
article_2016_xpath = '//div[contains(#class, "postArticle--short")][.//time[contains(#datetime, "2016")]]'
article_element_list = driver.find_elements_by_xpath(article_2016_xpath)
# now let's get the title
for article in article_element_list:
title = article.find_element_by_tag_name("h3").text
I haven't tested the code yet, only the xpath. You might need to adapt the code to work on your side.
By the way, using find_element... is not a good idea, try using explicit wait: https://selenium-python.readthedocs.io/waits.html
This will help you to avoid making stupid time.sleep waits and improve your app performance, and you can handle errors pretty well.
Only use find_element... when you already located the element, and you need to find a child element inside. For example, in this case if I want to find articles, I will find by explicit wait, then after the element is located, I will use find_element... to find child element h3.
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver import ActionChains
import selenium.webdriver.common.keys
from bs4 import BeautifulSoup
import requests
import time
driver = webdriver.Chrome(executable_path="../drivers/chromedriver.exe")
driver.get("https://www.Here the address of the relevant website ends with aspx.com.aspx")
element=driver.find_element_by_id("ctl00_ContentPlaceHolder1_LB_SEKTOR")
drp=Select(element)
drp.select_by_index(0)
element1=driver.find_element_by_id("ctl00_ContentPlaceHolder1_Lb_Oran")
drp=Select(element1)
drp.select_by_index(41)
element2=driver.find_element_by_id("ctl00_ContentPlaceHolder1_LB_DONEM")
drp=Select(element2)
drp.select_by_index(1)
driver.find_element_by_id("ctl00_ContentPlaceHolder1_ImageButton1").click()
time.sleep(1)
print(driver.page_source)
The last part of these codes, I can print the source codes of the page as a result. So I can get the source codes of the page as a print.
But in source codes of the page I just need the following table part written in java. How can I extract this section. and I can output csv as a table. (How can I get the table in the Java section.)
Not:In the Selenium test, I thought of pressing the CTRL U keys while in Chrome, but I was not successful in this.The web page is a user interactive page. Some interactions are required to get the data I want. That's why I used Selenium.
<span id="ctl00_ContentPlaceHolder1_Label2" class="Georgia_10pt_Red"></span>
<div id="ctl00_ContentPlaceHolder1_Divtable">
<div id="table">
<layer name="table" top="0"><IMG height="2" src="../images/spacer.gif" width="2"><br>
<font face="arial" color="#000000" size="2"><b>Tablo Yükleniyor. Lütfen Bekleyiniz...</b></font><br>
</layer>
</div>
</div>
<script language=JavaScript> var theHlp='/yardim/matris.asp';var theTitle = 'Piya Deg';var theCaption='OtomoT (TL)';var lastmod = '';var h='<a class=hislink href=../Hisse/Hisealiz.aspx?HNO=';var e='<a class=hislink href=../endeks/endeksAnaliz.aspx?HNO=';var d='<center><font face=symbol size=1 color=#FF0000><b>ß</b></font></center>';var u='<center><font face=symbol size=1 color=#008000><b>İ</b></font></center>';var n='<center><font face=symbol size=1 color=#00A000><b>=</b></font></center>';var fr='<font color=#FF0000>';var fg='<font color=#008000>';var theFooter=new Array();var theCols = new Array();theCols[0] = new Array('cksart',4,50);theCols[1] = new Array('2018.12',1,60);theCols[2] = new Array('2019.03',1,60);theCols[3] = new Array('2019.06',1,60);theCols[4] = new Array('2019.09',1,60);theCols[5] = new Array('2019.12',1,60);theCols[6] = new Array('2020.03',1,60);var theRows = new Array();theRows[0] = new Array ('<b>'+h+'42>AHRT</B></a>','519,120,000.00','590,520,000.00','597,240,000.00','789,600,000.00','1,022,280,000.00','710,640,000.00');
theRows[1] = new Array ('<b>'+h+'427>SEEL</B></a>','954,800,000.00','983,400,000.00','1,201,200,000.00','1,716,000,000.00','2,094,400,000.00','-');
theRows[2] = new Array ('<b>'+h+'140>TOFO</B></a>','17,545,500,000.00','17,117,389,800.00','21,931,875,000.00','20,844,054,000.00','24,861,973,500.00','17,292,844,800.00');
theRows[3] = new Array ('<b>'+h+'183>MSO</B></a>','768,000,000.00','900,000,000.00','732,000,000.00','696,000,000.00','1,422,000,000.00','1,134,000,000.00');
theRows[4] = new Array ('<b>'+h+'237>KURT</B></a>','2,118,000,000.00','2,517,600,000.00','2,736,000,000.00','3,240,000,000.00','3,816,000,000.00','2,488,800,000.00');
theRows[5] = new Array ('<b>'+h+'668>GRTY</B></a>','517,500,000.00','500,250,000.00','445,050,000.00','552,000,000.00','737,150,000.00','-');
theRows[6] = new Array ('<b>'+h+'291>MEME</B></a>','8,450,000,000.00','8,555,000,000.00','9,650,000,000.00','10,140,000,000.00','13,430,000,000.00','8,225,000,000.00');
theRows[7] = new Array ('<b>'+h+'292>AMMI</B></a>','-','-','-','-','-','-');
theRows[8] = new Array ('<b>'+h+'426>GOTE</B></a>','1,862,578,100.00','1,638,428,300.00','1,689,662,540.00','2,307,675,560.00','2,956,642,600.00','2,121,951,440.00');
var thetable=new mytable();thetable.tableWidth=650;thetable.shownum=false;thetable.controlaccess=true;thetable.visCols=new Array(true,true,true,true,true);thetable.initsort=new Array(0,-1);thetable.inittable();thetable.refreshTable();</script></form>
<div style="clear: both; margin-top: 10px;">
<div style="background-color: Red; border: 2px solid Green; display: none">
TABLO-ALT</div>
<div id="Bannerctl00_SiteBannerControl2">
<div id="_bannerctl00_SiteBannerControl2">
<div id="Sayfabannerctl00_SiteBannerControl2" class="banner_Codex">
</div>
Please, note that I've only used Selenium in Java, so I'll give you the most generic and languaje-agnostic answer I can. Keep in mind that Python Selenium MAY provide a method to do this directly.
Steps:
Make all Selenium interactions so the WebDriver actually has a VALID page version with all your contents loaded
Extract from selenium the current contents of the whole page
Load it with a HTML parsing library. I use JSoup in Java, I don't now if there's a Python version. From now on, Selenium does not matter.
Use CSS selectors on your parser Object to get the section you want
Convert that section to String to print.
If performance is a requeriment this approach may be a bit too expensive, as the contents are parsed twice: Selenium does it first, and your HTML parser will do it again later with the extracted String from Selenium.
ALTERNATIVE: If your "target page" uses AJAX, you may directly interact with the REST API that javascript is accesing to get the data to fill for you. I tend to follow this approach when doing serious web scraping, but sometimes this is not an option, so I use the above approach.
EDIT
Some more details base on questions in comments:
You can use BeautifullSoup as a html parsing library.
To load a page in BeautifullSoup use:
html = "<html><head></head><body><div id=\"events-horizontal\">Hello world</div></body></html>"
soup = BeautifulSoup(html, "html.parser")
Then look at this answer to see how to extract the specific contents from your soup:
your_div = soup.select_one('div#events-horizontal')
That would give you the first div with events-horizontal id:
<div id="events-horizontal">Hello world</div>
BeautifullSoup code based on:
How to use CSS selectors to retrieve specific links lying in some class using BeautifulSoup?
How do I locate an input field via its label using webdriver?
I like to test a certain web form which unfortunately uses dynamically generated
ids, so they're unsuitable as identifiers.
Yet, the labels associated with each web element strike me as suitable.
Unfortunately I was not able to do it with the few suggestions
offered on the web. There is one thread here at SO, but which did not
yield an accepted answer:
Selenium WebDriver Java - Clicking on element by label not working on certain labels
To solve this problem in Java, it is commonly suggested to locate the label as an anchor via its text content and then specifying the xpath to the input element:
//label[contains(text(), 'TEXT_TO_FIND')]
I am not sure how to do this in python though.
My web element:
<div class="InputText">
<label for="INPUT">
<span>
LABEL TEXT
</span>
<span id="idd" class="Required" title="required">
*
</span>
</label>
<span class="Text">
<input id="INPUT" class="Text ColouredFocus" type="text" onchange="var wcall=wicketAjaxPost(';jsessionid= ... ;" maxlength="30" name="z1013400259" value=""></input>
</span>
<div class="RequiredLabel"> … </div>
<span> … </span>
</div>
Unfortunately I was not able to use CSS or XPATH expressions
on the site. IDs and names always changed.
The only solution to my problem I found was a dirty one - parsing
the source code of the page and extract the ids by string operations.
Certainly this is not the way webdriver was intended to be used, but
it works robustly.
Code:
lines = []
for line in driver.page_source.splitlines():
lines.append(line)
if 'LABEL TEXT 1' in line:
id_l1 = lines[-2].split('"', 2)[1]
You should start with a div and check that there is a label with an appropriate span inside, then get the input element from the span tag with class Text:
//div[#class='InputText' and contains(label/span, 'TEXT_TO_FIND')]/span[#class='Text']/input