I'm trying to extract a list of links in a web page using python selenium. All the links on the page have the following format in the source code:
Using the following line gives me all the elements on the page with tag name a:
driver.find_elements_by_tag_name("a")
The issue is that I need only a specific set of links, and all these links are within a table. The above code gives me all the links on the page, even those outside the table. Outline of the page source looks like this:
<html>
...
...
<frame name = "frame">
<a href = "unwantedLink">
<form name = "form">
<table name = "table">
<a href = "link1">
<a href = "link2">
<a href = "link3">
</table>
</form>
</frame>
...
</html>
I need link1,link2 and link3, but not unwantedLink. Both the required links and the unwanted link are in the same frame, so switching frames won't work. Is there a way to look for tag names a within the table but not within the parent frame?
Thanks
This should give you want you want:
driver.find_elements_by_css_selector("table[name='table'] a")
The table[name='table'] bit selects only the table with the attribute name set to "table". And then the selector gets all a elements that are descendants of the table. So it does not matter whether the a elements are children of the table element or if they appear in side td elements.
Note that if you have more than one table that has a name attribute set to the value "table", you'll get more elements than you are actually looking for. (There are no guarantees of uniqueness for the name attribute.)
Related
I want to search class name with starts-with in specific Webelement but it search in entire page. I do not know what is wrong.
This returns list
muidatagrid_rows = driver.find_elements(by=By.CLASS_NAME, value='MuiDataGrid-row')
one_row = muidatagrid_rows[0]
This HTML piece in WebElement (one_row)
<div class="market-watcher-title_os_button_container__4-yG+">
<div class="market-watcher-title_tags_container__F37og"></div>
<div>
<a href="#" target="blank" rel="noreferrer" data-testid="ios download button for 1628080370">
<img class="apple-badge-icon-image"></a>
</div>
<div></div>
</div>
If a search with full class name like this:
tags_and_marketplace_section = one_row.find_element(by=By.CLASS_NAME, value="market-watcher-title_os_button_container__4-yG+")
It gives error:
selenium.common.exceptions.InvalidSelectorException: Message: Given css selector expression ".market-watcher-title_os_button_container__4-yG+" is invalid: InvalidSelectorError: Element.querySelector: '.market-watcher-title_os_button_container__4-yG+' is not a valid selector: ".market-watcher-title_os_button_container__4-yG+"
So i want to search with starts-with method but i can not get what i want.
This should returns only two Webelements but it returns 20
tags_and_marketplace_section = one_row.find_element(by=By.XPATH, value='//div[starts-with(#class, "market-watcher-")]')
print(len(tags_and_marketplace_section))
>>> 20
Without seeing the codebase you are scraping from it's difficult to help fully, however what I've found is that "Chaining" values can help to narrow down the returned results. Also, using the "By.CSS_SELECTOR" method works best for me.
For example, if what you want is inside a div and p, then you would do something like this;
driver.find_elements(by=By.CSS_SELECTOR, value="div #MuiDataGrid-row p")
Then you can work with the elements that are returned as you described. You maybe able to use other methods/selectors but this is my favourite route so far.
I would like to select the second element by specifying the fact that it contains "title" element in it (I don't want to just select the second element in the list)
sample = """<h5 class="card__coins">
<a class="link-detail" href="/en/coin/smartcash">SmartCash (SMART)</a>
</h5>
<a class="link-detail" href="/en/event/smartrewards-812" title="SmartRewards">"""
How could I do it?
My code (does not work):
from bs4 import BeautifulSoup
soup = BeautifulSoup(sample.content, "html.parser")
second = soup.find("a", {"title"})
for I in soup.find_all('a', title=True):
print(I)
After looping through all a tags we are checking if it contains title attribute and it will only print it if it contains title attribute.
Another way to do this is by using CSS selector
soup.select_one('a[title]')
this selects the first a element having the title attribute.
I have such HTML code:
<li class="IDENTIFIER"><h5 class="hidden">IDENTIFIER</h5><p>
<span class="tooltip-iws" data-toggle="popover" data-content="SOME TEXT">
other text</span></p></li>
And I'd like to obtain the SOME TEXT from the data-content.
I wrote
target = soup.find('span', {'class' : 'tooltip-iws'})['data-content']
to get the span, and I wrote
identifier_elt= soup.find("li", {'class': 'IDENTIFIER'})
to get the class, but I'm not sure how to combine the two.
But the class tooltip-iws is not unique, and I would get extraneous results if I just used that (there are other spans, before the code snippet, with the same class)
That's why I want to specify my search within the class IDENTIFIER. How can I do that in BeautifulSoup?
try using css selector,
soup.select_one("li[class='IDENTIFIER'] > p > span")['data-content']
Try using selectorlib, should solve your issue, comment if you need further assistance
https://selectorlib.com/
I had luck getting a list of telephone numbers using this code:
from lxml import html
import requests
lnk='https://docs.legis.wisconsin.gov/2019/legislators/assembly'
page=requests.get(lnk)
tree=html.fromstring(page.content)
ph_nums=tree.xpath('//span[#class="info telephone"]/text()')
print(ph_nums)
which is scraping info from an HTML element that looks like this:
<span class="info telephone">
<span class="title"><strong>Telephone</strong>:<br></span>
(608) 266-8580<br>(888) 534-0097
</span>
However, I can't do the same for this element when I change info telephone to info...
<span class="info" style="width:16em;">
<span>
<a id="A">
<strong></strong></a><strong>Jenkins, Leroy t</strong> <small>(R - Madison)</small>
</span>
<br>
<span style="width:8em;"><small>District 69</small></span>
<br>
<span style="width:8em;">Details</span>
<br>
<span style="width:8em;">
Website
</span>
<br>
<br>
</span>
since there's multiple titles in this element, whereas "info telephone" only had one. How would I return separate lists, each with a different piece of info (i.e. a list of names, and a list of Districts, in this scenario)?
FYI - I am not educated in HTML (and hardly experienced in Python) so I would appreciate a simplified explanation.
For this task I would recommend the BeautifulSoup Package for Python.
You don't have to deeply understand HTML to use it (I don't!), and it offers a very friendly approach to find certain items from a web page.
Your first example could be rewritten as follows:
from bs4 import BeautifulSoup
#soup element contains the xml data
soup = BeautifulSoup(page.content, 'lxml')
# the find_all method finds all nodes in page.content whose type is 'span'
# and whose class is 'info telephone'
info_tels = soup.find_all('span', {"class": "info telephone"})
The info_tels element contains all instances of <span class="info telephone"> on your document. We can then parse it to find what's relevant:
list_tels = []
for tel in info_tels:
tel_text = tel.text #extracts text from info_telephone node
tel_text = tel_text.replace("\nTelephone:\n","").replace('\n', "") #removes "Telephone:" part and line breaks
tel_text = tel_text.strip() #removes trailing space
list_tels.append(tel_text)
You can do something similar for the 'info' class:
info_class = soup.find_all('span', {"class": "info"})
And then find the elements you want to put into lists:
info_class[0].find_all('a')[1].text #returns you the first name
The challenge here is to identify which types/classes do these names/districts/etc. have. In your first example, it is relatively clear (('span', {"class": "info telephone"})), but the "info" class has various data points inside of it with no specific, identifiable type.
For instance, the '' tag appears multiple times in your file, also with distinct data points (District, Details, etc.)
I came up with a small solution for the District problem - you might get inspired to tackle the other information too!!
list_districts = []
for info in info_class:
try:
district_contenders = info.find_all('span', {'style': "width:8em;"})
for element in district_contenders:
if 'District' in element.text:
list_districts.append(element.text)
except:
pass
I am scraping items from a webpage (there are multiple of these):
<a class="iusc" style="height:160px;width:233px" m="{"cid":"T0QMbGSZ","purl":"http://www.tti.library.tcu.edu.tw/DERMATOLOGY/mm/mmsa04.htm","murl":"http://www.tti.lcu.edu.tw/mm/img0035.jpg","turl":"https://tse2.mm.bing.net/th?id=OIP.T0QMbGSZbOpkyXU4ms5SFwEsDI&pid=15.1","md5":"4f440c6c64996cea64c975389ace5217"}" mad="{"turl":"https://tse3.mm.bing.net/th?id=OIP.T0QMbGSZbOpkyXU4ms5EsDI&w=300&h=200&pid=1.1","maw":"300","mah":"200","mid":"C303D7F4BB661CA67E2CED4DB11E9154A0DD330B"}" href="/images/search?view=detailV2&ccid=T0QMbGSZ&id=C303D7F4BB661E2CED4DB11E9154A0DD330B&thid=OIP.T0QMbGSZbOpkyXU4ms5SFwEsDI&q=searchtearm;amp;simid=6080204499593&selectedIndex=162" h="ID=images.5978_5,5125.1" data-focevt="1"><div class="img_cont hoff"><img class="mimg" style="color: rgb(169, 88, 34);" height="160" width="233" src="https://tse3.mm.bing.net/th?id=OIP.T0QMbGSZ4ms5SFwEsDI&w=233&h=160&c=7&qlt=90&o=4&dpr=2&pid=1.7" alt="Image result fsdata-bm="169" /></div></a>
What I want to do is download the image and information associated with it in the m attribute.
To accomplish that, I tried something like this to get the attributes:
links = soup.find_all("a", class_="iusc")
And then, to get the m attribute, I tried something like this:
for a in soup.find_all("m"):
test = a.text.replace(""" '"')
metadata = json.loads(test)["murl"]
print(str(metadata))
However, that doesn't quite work as expected, and nothing is printed out (with no errors either).
You are not iterating through the links list. Try this.
links = soup.find_all("a", class_="iusc")
for link in links:
print(link.get('m'))