Selenium how to extract href from attributes

Selenium how to extract href from attributes - python

<div class="turbolink_scroller" id="container">
<article><div class="inner- article">
<a style="height:81px;" href="LINK TO EXTRACT">
<img width="81" height="81" src="//image.jpg" alt="code" />
Hello! I'm pretty new to selenium and I've been playing around with how to get sources for my webdriver. So far, I'm trying to extract a href link given an alt code as above and I'm not sure if the documentation has a means to do this. I'm feeling that the answer is find_by_xpath but I'm not entirely sure. Thank you for any tips!

The way is as follows
href = driver.find_element_by_tag_name('a').get_attribute('href')
of course, you may have a lot of 'a' tags in a page, so you may make the path to your respective tag,
e.g
div = driver.find_element_by_id('container')
a = div.find_element_by_tag_name('a')
href = a.get_attribute('href')

Related

Selenium starts-with searchs entire page not in given Webelement

I want to search class name with starts-with in specific Webelement but it search in entire page. I do not know what is wrong.
This returns list
muidatagrid_rows = driver.find_elements(by=By.CLASS_NAME, value='MuiDataGrid-row')
one_row = muidatagrid_rows[0]
This HTML piece in WebElement (one_row)
<div class="market-watcher-title_os_button_container__4-yG+">
<div class="market-watcher-title_tags_container__F37og"></div>
<div>
<a href="#" target="blank" rel="noreferrer" data-testid="ios download button for 1628080370">
<img class="apple-badge-icon-image"></a>
</div>
<div></div>
</div>
If a search with full class name like this:
tags_and_marketplace_section = one_row.find_element(by=By.CLASS_NAME, value="market-watcher-title_os_button_container__4-yG+")
It gives error:
selenium.common.exceptions.InvalidSelectorException: Message: Given css selector expression ".market-watcher-title_os_button_container__4-yG+" is invalid: InvalidSelectorError: Element.querySelector: '.market-watcher-title_os_button_container__4-yG+' is not a valid selector: ".market-watcher-title_os_button_container__4-yG+"
So i want to search with starts-with method but i can not get what i want.
This should returns only two Webelements but it returns 20
tags_and_marketplace_section = one_row.find_element(by=By.XPATH, value='//div[starts-with(#class, "market-watcher-")]')
print(len(tags_and_marketplace_section))
>>> 20

Without seeing the codebase you are scraping from it's difficult to help fully, however what I've found is that "Chaining" values can help to narrow down the returned results. Also, using the "By.CSS_SELECTOR" method works best for me.
For example, if what you want is inside a div and p, then you would do something like this;
driver.find_elements(by=By.CSS_SELECTOR, value="div #MuiDataGrid-row p")
Then you can work with the elements that are returned as you described. You maybe able to use other methods/selectors but this is my favourite route so far.

Creating a css selector to locate multiple ids in a single-shot

I've defined css selectors within the script to get the text within span elements and I'm getting them accordingly. However, the way I tried is definitely messy. I just seperated different css selectors using comma to let the script understand I'm after this or that.
If I opt for xpath I could have used 'div//span[.="Featured" or .="Sponsored"]' but in case of css selector I could not find anything similar to serve the same purpose. I know using 'span:contains("Featured"),span:contains("Sponsored")' I can get the text but there is the comma in between as usual.
What is the ideal way to locate the elements (within different ids) using css selectors except for comma?
My try so far with:
from lxml.html import fromstring
html = """
<div class="rest-list-information">
<a class="restaurant-header" href="/madison-wi/restaurants/pizza-hut">
Pizza Hut
</a>
<div id="featured other-dynamic-ids">
<span>Sponsored</span>
</div>
</div>
<div class="rest-list-information">
<a class="restaurant-header" href="/madison-wi/restaurants/salads-up">
Salads UP
</a>
<div id="other-dynamic-ids border">
<span>Featured</span>
</div>
</div>
"""
root = fromstring(html)
for item in root.cssselect("[id~='featured'] span,[id~='border'] span"):
print(item.text)

You can do:
.rest-list-information div span
But I think it's a bad idea to consider the comma messy. You won't find many stylesheets that don't have commas.

If you are just looking to get all 'span' text from the HTML then the following should suffice:
root_spans = root.xpath('//span')
for i, root_spans in enumerate(root_spans):
span_text = root_spans.xpath('.//text()')[0]
print(span_text)

HTML source while webscraping seems inconsistent for website

I checked out say:
https://www.calix.com/search-results.html?searchKeyword=C7
And if I inspect element on the first link I get this:
<a class="title viewDoc"
href="https://www.calix.com/content/dam/calix/mycalix-
misc/ed-svcs/learning_paths/C7_lp.pdf" data-
preview="/session/4e14b237-f19b-47dd-9bb5-d34cc4c4ce01/"
data-preview-count="1" target="_blank"><i class="fa fa-file-
pdf-o grn"></i><b>C7</b> Learning Path</a>
I coded:
import requests, bs4
res = requests.get('https://www.calix.com/search-results.html?
searchKeyword=C7',headers={'User-Agent':'test'})
print(res)
#res.raise_for_status()
bs_obj= bs4.BeautifulSoup(res.text, "html.parser")
elems=bs_obj.findAll('a',attrs={"class","title viewDoc"})
print(elems)
And there was [] as output (empty list).
So, I thought about actually looking through the "view-source" for the page.
view-source:https://www.calix.com/search-results.html?searchKeyword=C7
If you search through the "view-source" you will not find the code for the "inspect element" I mentioned earlier.
There is no "a class="title viewDoc"" in the view-source of the page.
That is probably why my code isn't returning anything.
The I went to www.nba.com, and inspected a link
<a class="content_list--item clearfix"
href="/article/2018/07/07/demarcus-cousins-discusses-
stacked-golden-state-warriors-roster"><h5 class="content_list-
-title">Cousins on Warriors' potential: 'Scary'</h5><time
class="content_list--time">in 5 hours</time></a>
The content of "inspect" for this link was in the "view-source" of the page.
And, obviously my code was working for this page.
I have seen a few examples of issue #1.
Just curious why the difference in html formats, or am I missing something?

Python 3.5 + Selenium Scrape. Is there anyway to select <a><a/> tags?

So I'm very new to python and selenium. I'm writting an scraper to take some balances and download a txt file. So far I've managed to grab the account balances but downloading the txt files have proven to be a difficult task.
This is a sample of the html
<td>
<div id="expoDato_msdd" class="dd noImprimible" style="width: 135px">
<div id="expoDato_title123" class="ddTitle">
<span id="expoDato_arrow" class="arrow" style="background-position: 0pt 0pt"></span>
<span id="expoDato_titletext" class="textTitle">Exportar Datos</span>
</div>
<div id="expoDato_child" class="ddChild" style="width: 133px; z-index: 50">
<a class="enabled" href="/CCOLEmpresasCartolaHistoricaWEB/exportarDatos.do;jsessionid=9817239879882871987129837882222R?tipoExportacion=txt">txt</a>
<a class="enabled" href="/CCOLEmpresasCartolaHistoricaWEB/exportarDatos.do;jsessionid=9817239879882871987129837882222R?tipoExportacion=pdf">PDF</a>
<a class="enabled" href="/CCOLEmpresasCartolaHistoricaWEB/exportarDatos.do;jsessionid=9817239879882871987129837882222R?tipoExportacion=excel">Excel</a>
<a class="modal" href="#info_formatos">Información Formatos</a>
</div>
</div>
I need to click on the fisrt "a" class=enabled. But i just can't manage to get there by xpath, class or whatever really. Here is the last thing i tried.
#Descarga de Archivos
ddmenu2 = driver.find_element_by_id("expoDato_child")
ddmenu2.find_element_by_css_selector("txt").click()
This is more of the stuff i've already tryed
#TXT = driver.select
#TXT.send_keys(Keys.RETURN)
#ddmenu2 = driver.find_element_by_xpath("/html/body/div[1]/div[1]/div/div/form/table/tbody/tr[2]/td/div[2]/table/tbody/tr/td[4]/div/div[2]")
#Descarga = ddmenu2.find_element_by_visible_text("txt")
#Descarga.send_keys(Keys.RETURN)
Please i would apreciate your help.
Ps:English is not my native language, so i'm sorry for any confusion.
EDIT:
This was the approach that worked, I'll try your other suggetions to make a more neat code. Also it will only work if the mouse pointer is over the browser windows, it doesn't matter where.
ddmenu2a = driver.find_element_by_xpath("/html/body/div[1]/div[1]/div/div/form/table/tbody/tr[2]/td/div[2]/table/tbody/tr/td[4]/div/div[1]").click()
ddmenu2b = driver.find_element_by_xpath("/html/body/div[1]/div[1]/div/div/form/table/tbody/tr[2]/td/div[2]/table/tbody/tr/td[4]/div/div[2]")
ddmenu2c = driver.find_element_by_xpath("/html/body/div[1]/div[1]/div/div/form/table/tbody/tr[2]/td/div[2]/table/tbody/tr/td[4]/div/div[2]/a[1]").click()
Pretty much brute force, but im getting to like python scripting.

Or simply use CSS to match on the href:
driver.find_element_by_css_selector("div#expoDato_child a.enabled[href*='txt']")

You can get all anchor elements like this:
a_list = driver.find_elements_by_tag_name('a')
this will return a list of elements. you can click on each element:
for a in a_list:
a.click()
driver.back()
or try xpath for each anchor element:
a1 = driver.find_element_by_xpath('//a[#class="enabled"][1]')
a2 = driver.find_element_by_xpath('//a[#class="enabled"][2]')
a3 = driver.find_element_by_xpath('//a[#class="enabled"][3]')
Please let me know if this was helpful

you can directly reach the elements by xpath via text:
driver.find_element_by_xpath("//*[#id='expoDato_child' and contains(., 'txt')]").click()
driver.find_element_by_xpath("//*[#id='expoDato_child' and contains(., 'PDF')]").click()
...

If there is a public link for the page in question that would be helpful.
However, generally, I can think of two methods for this:
If you can discover the direct link you can extract the link text and use pythons' urllib and download the file directly.
or
Use use Seleniums' click function and have it click on the link in the page.
A quick search resulted thusly:
downloading-file-using-selenium

I am not able to parse using Beautiful Soup

<td>
<a name="corner"></a>
<div>
<div style="aaaaa">
<div class="class-a">My name is alis</div>
</div>
<div>
<span><span class="class-b " title="My title"><span>Very Good</span></span> </span>
<b>My Description</b><br />
My Name is Alis I am a python learner...
</div>
<div class="class-3" style="style-2 clear: both;">
alis
</div>
</div>
<br /></td>
I want the description after scraping it:
My Name is Alis I am a python learner...
I tried a lots of thing but i could not figure it out the best way. Can you guys give the in general solution for this.

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("Your sample html here")
soup.td.div('div')[2].contents[-1]
This will return the string you are looking for (the unicode string, with any applicable whitespace, it should be noted).
This works by parsing the html, grabbing the first td tag and its contents, grabbing any div tags within the first div tag, selecting the 3rd item in the list (list index 2), and grabbing the last of its contents.
In BeautifulSoup, there are A LOT of ways to do this, so this answer probably hasn't taught you much and I genuinely recommend you read the tutorial that David suggested.

Have you tried reading the examples provided in the documentation? They quick start is located here http://www.crummy.com/software/BeautifulSoup/documentation.html#Quick Start
Edit:
To find
You would load your html up via
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("My html here")
myDiv = soup.find("div", { "class" : "class-a" })
Also remember you can do most of this via the python console and then using dir() along with help() walk through what you're trying to do. It might make life easier on you to try out ipython or perhaps python IDLE which have very friendly consoles for beginners.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Selenium how to extract href from attributes - python

Related

Selenium starts-with searchs entire page not in given Webelement

Creating a css selector to locate multiple ids in a single-shot

HTML source while webscraping seems inconsistent for website

Python 3.5 + Selenium Scrape. Is there anyway to select <a><a/> tags?

I am not able to parse using Beautiful Soup

Categories

Resources