python beatifull soup, find only exact class matches - python

hello so i have this little script
html = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
parse = BeautifulSoup(html, 'html.parser')
selector = Selector(text=html)
divs = selector.css('.panel .panel-heading a::attr(href)').getall()
and it works fine but i if div has
<div class="panel grey">
i dont want this to match, only exact matches when div has one div
match only this
<div class="panel">
i tried using decompsoe() function but didnt worked in my case, what is the best solution i have my script done this is the only issue
so in short, find children of div only if div has one class

To strictly match div with class equals to panel value, not just some element that contains panel class attribute you can write that explicitly.
Instead of
divs = selector.css('.panel .panel-heading a::attr(href)').getall()
try using
divs = selector.css('div[class="panel"] .panel-heading a::attr(href)').getall()

Related

How to get links within a subelement in selenium?

I have the following html code:
<div id="category"> //parent div
<div class="username"> // n-number of elements of class username which all exist within parent div
<a rel="" href="link" title="smth">click</a>
</div>
</div>
I want to get all the links witin the class username BUT only those within the parent div where id=category. When I execute the code below it doesn´t work. I can only access the title attribute by default but can´t extract the link. Does anyone have a solution?
a = driver.find_element_by_id('category').find_elements_by_class_name("username")
links = [x.get_attribute("href") for x in a]
Use the following css selector which will return all the anchor tags.
links = [x.get_attribute("href") for x in driver.find_elements(By.CSS_SELECTOR,"#category > .username >a")]
Or
links = [x.get_attribute("href") for x in driver.find_elements_by_css_selector("#category > .username >a")]

How to get the desired value in BeautifulSoup?

Suppose we have the html code as follows:
html = '<div class="dt name">abc</div><div class="name">xyz</div>'
soup = BeautifulSoup(html, 'lxml')
I want to get the name xyz. Then, I write
soup.find('div',{'class':'name'})
However, it returns abc.
How to solve this problem?
The thing is that Beautiful Soup returns the first element that has the class name and div so the thing is that the first div has class name and class dt so it selects that div.
So, div helps but it still narrows down to 2 divs. Next, it returns a array so check the second div to use print(soup('div')[1].text). If you want to print all the divs use this code:
for i in range(len(soup('div')))
print(soup('div')[i].text)
And as pointed out in Ankur Sinha's answer, if you want to select all the divs that have only class name, then you have to use select, like this:
soup.select('div[class=name]')[0].get_text()
But if there are multiple divs that satisfy this property, use this:
for i in range(len(soup.select('div[class=name]'))):
print(soup.select('div[class=name]')[i].get_text())
Just to continue Ankur Sinha, when you use select or even just soup() it forms a array, because there can be multiple items so that's why I used len(), to figure out the length of the array. Then I ran a for loop on it and then printed the select function at i starting from 0.
When you do that, it rather would give a specific div instead of a array, and if it gave out a array, calling get_text() would produce errors because the array is NOT text.
This blog was helpful in doing what you would like, and that is to explicitly find a tag with specific class attribute:
from bs4 import BeautifulSoup
html = '<div class="dt name">abc</div><div class="name">xyz</div>'
soup = BeautifulSoup(html, 'html.parser')
soup.find(lambda tag: tag.name == 'div' and tag['class'] == ['name'])
Output:
<div class="name">xyz</div>
You can do it without lambda also using select to find exact class name like this:
soup.select("div[class = name]")
Will give:
[<div class="name">xyz</div>]
And if you want the value between tags:
soup.select("div[class=name]")[0].get_text()
Will give:
xyz
In case you have multiple div with class = 'name', then you can do:
for i in range(len(soup.select("div[class=name]"))):
print(soup.select("div[class=name]")[i].get_text())
Reference:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
This might work for you, note that it is contingent on the div being the second div item in the html.
import requests
from bs4 import BeautifulSoup
html = '<div class="dt name">abc</div><div class="name">xyz</div>'
soup = BeautifulSoup(html, features='lxml')
print(soup('div')[1].text)

How to get a div with a particular class name

I wrote the following script to try to extract between the content from the div tag called markers
phone_category_data = requests.get(phone_category_url)
base_category_soup = soup(phone_category_data.content, "html.parser")
div_list = base_category_soup.find_all("div")
for div in div_list:
if div["class"]:
if div['class'][0] == 'makers':
print div.text
A common way to check for class names when locating elements would be to use:
base_category_soup.find_all("div", class_="makers")
Or, using a CSS selector:
base_category_soup.select("div.makers")
Note that since class is a multi-valued attribute and BeautifulSoup has a special handling for it, both of the approaches would check for any of the class values to be makers, e.g. all of the following would match:
<div class="makers"></div>
<div class="test makers"></div>
<div class="makers test"></div>
<div class="test1 makers test2"></div>

find the CSS path (ancestor tags) in HTML using python

I want to get all the ancestor div tags where I match a text. So for example if the html looks like HTML snippet
And i'm searching for "Earl E. Byrd". I wanna get a list which contains {"buyer-info","buyer-name"}
This is what i did
r=requests.get(self.url,verify='/path/to/certfile')
soup = BeautifulSoup(r.text,"lxml")
divTags = soup.find_all('div')
How should I proceed ?
If you want to search for the div by text and get all the previous divs that have title attributes, first find the div using the text, then use find_all_previous setting title=True
soup = BeautifulSoup(r.text,"lxml")
div = soup.find('div', text="Earl E. Byrd")
print([div["title"]] + [d["title"] for d in div.find_all_previous("div", title=True)])
A solution using an xpath expression :
//div[#title="buyer-info"]/div[text() = "Carlson Busses"]/ancestor::div

Python 3, beautiful soup, get next tag

I have the following html part which repeates itself several times with other href links:
<div class="product-list-item margin-bottom">
<a title="titleexample" href="http://www.urlexample.com/example_1" data-style-id="sp_2866">
Now I want to get all the href links in this document that are directly after the div tag with the class "product-list-item".
Pretty new to beautifulsoup and nothing that I came up with worked.
Thanks for your ideas.
EDIT: Does not really have to be beautifulsoup; when it can be done with regex and the python html parser this is also ok.
EDIT2: What I tried (I'm pretty new to python, so what I did might be totaly stupid from an advanced viewpoint):
soup = bs4.BeautifulSoup(htmlsource)
x = soup.find_all("div")
for i in range(len(x)):
if x[i].get("class") and "product-list-item" in x[i].get("class"):
print(x[i].get("class"))
This will give me a list of all the "product-list-item" but then I tried something like
print(x[i].get("class").next_element)
Because I thought next_element or next_sibling should give me the next tag but it just leads to AttributeError: 'list' object has no attribute 'next_element'. So I tried with only the first list element:
print(x[i][0].get("class").next_element)
Which led to this error: return self.attrs[key] KeyError: 0.
Also tried with .find_all("href") and .get("href") but this all leads to the same errors.
EDIT3: Ok seems I found out how to solve it, now I did:
x = soup.find_all("div")
for i in range(len(x)):
if x[i].get("class") and "product-list-item" in x[i].get("class"):
print(x[i].next_element.next_element.get("href"))
This can also be shortened by using another attribute to the find_all function:
x = soup.find_all("div", "product-list-item")
for i in x:
print(i.next_element.next_element.get("href"))
greetings
I want to get all the href links in this document that are directly after the div tag with the class "product-list-item"
To find the first <a href> element in the <div>:
links = []
for div in soup.find_all('div', 'product-list-item'):
a = div.find('a', href=True) # find <a> anywhere in <div>
if a is not None:
links.append(a['href'])
It assumes that the link is inside <div>. Any elements in <div> before the first <a href> are ignored.
If you'd like; you can be more strict about it e.g., taking the link only if it is the first child in <div>:
a = div.contents[0] # take the very first child even if it is not a Tag
if a.name == 'a' and a.has_attr('href'):
links.append(a['href'])
Or if <a> is not inside <div>:
a = div.find_next('a', href=True) # find <a> that appears after <div>
if a is not None:
links.append(a['href'])
There are many ways to search and navigate in BeautifulSoup.
If you search with lxml.html, you could also use xpath and css expressions if you are familiar with them.

Categories