How to locate td class in html code using python? - python

I have a class in my html code. I need to locate td class "Currentlocation" using python.
CODE :
<td class="CURRENTLOCATION"><img align="MIDDLE" src="..\Images\FolderOpen.bmp"/> Metrics</td>
Below are the codes I tried.
First attempt:
My_result = page_soup.find_element_by_class_name('CURRENTLOCATION')
Getting "TypeError: 'NoneType' object is not callable" error. Second attempt:
My_result = page_soup.find(‘td’, attrs={‘class’: ‘CURRENTLOCATION’})
Getting "invalid character in identifier" error.
Can anyone please help me locate a class in html code using python?

from bs4 import BeautifulSoup
sdata = '<td class="CURRENTLOCATION"><img align="MIDDLE" src="..\Images\FolderOpen.bmp"/> Metrics</td>'
soup = BeautifulSoup(sdata, 'lxml')
mytds = soup.findAll("td", {"class": "CURRENTLOCATION"})
for td in mytds:
print(td)

I tried your code, the second example, and the problem are the quotation marks you use. To me they are apostrophes (‘, unicode code point \u2019), while the python interpreter requires single (') or double (") quotation marks.
Changing them I can find the tag:
>>> bs.find('td', attrs={'class': 'CURRENTLOCATION'})
<td class="CURRENTLOCATION"><img align="MIDDLE" src="..\Images\FolderOpen.bmp"/> Metrics</td>
About your first example. I do not know where you find a reference to the method find_element_by_class_name but it seems to not be implemented by the BeautifulSoup class. The class instead implements the __getattr__ method that is a special one that is invoked anytime you try to access a non existing attribute. Here an excerpt of the method:
def __getattr__(self, tag):
#print "Getattr %s.%s" % (self.__class__, tag)
if len(tag) > 3 and tag.endswith('Tag'):
#
# We special case contents to avoid recursion.
elif not tag.startswith("__") and not tag == "contents":
return self.find(tag)
When you try to access the attribute find_element_by_class_name, you are actually looking for a tag with the same name.

There is a function in BeautifulSoup for this.
You can get all the desired tags and specify the attributes which you are lookin for in find_all function. It returns the list of all the elements which fulfill the criteria
import re
from bs4 import BeautifulSoup
text = '<td class="CURRENTLOCATION"><img align="MIDDLE" src="..\Images\FolderOpen.bmp"/> Metrics</td>'
soup = BeautifulSoup(text, 'lxml')
output_list = soup.find_all('td',{"class": "CURRENTLOCATION"}) # I am looking for all the td tags whose class atrribute is set to CURRENTLOCATION

Related

How do I include a "-" in my code when it is part of a HTML class attribute?

I am brand new to Python/Web scraping
I am currently trying to web-scrape data from a website that has a class attribute of "data-row". However, whenever I attempt to use this attribute it splits data/row in half and shows a problem ("Expected parameter name Pylance). Is there any way to include this "-" in the code?
Example of code that works
exampleVariable = exampleDocument.find("tr", **id**="0")
Example of code I want to fix
exampleVariable = exampleDocument.find("tr", **data-row**="0")
Use attrs= parameter of .find function:
from bs4 import BeautifulSoup
html_code = """\
<tr data-row="0">Some data</tr>"""
soup = BeautifulSoup(html_code, "html.parser")
tr = soup.find("tr", attrs={"data-row": "0"})
print(tr)
Prints:
<tr data-row="0">Some data</tr>
Or: Use CSS selector and .select_one method:
tr = soup.select_one('tr[data-row="0"]')

Using findAll two times BeautifulSoup

I'm scraping this webpage with some tables. I want to 'build' two lists and the site have the class 'txt' for two datatypes. I need to extract those datatypes separately, so I'm tryng to "filter" the first type, extract, and then doing the other type.
I made this code:
from bs4 import BeautifulSoup
r = requests.get(url, headers=header)
soup = BeautifulSoup(r.content, 'html.parser')
page = soup.find('div', class_='content')
labels = page.findAll('td', class_='label')
Output:
[<td class="label w15"><span class="help tips" title="Code">?</span><span class="txt">Paper</span></td>,
<td class="label destaque w2"><span class="help tips" title="Last value">?</span><span class="txt">Value</span></td>]
I need what is inside those <span class="txt">Paper</span>
When I try this:
myfilter = labels.findAll('span', class_='txt')
I get this error:
AttributeError: ResultSet object has no attribute 'findAll'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
Why? How I can do this?
As the error message says, you can't use a list of results as a result by itself. You need to loop over them.
myfilter = []
for label in labels:
myfilter.extend(label.findAll('span', class_='txt'))

How to get the desired value in BeautifulSoup?

Suppose we have the html code as follows:
html = '<div class="dt name">abc</div><div class="name">xyz</div>'
soup = BeautifulSoup(html, 'lxml')
I want to get the name xyz. Then, I write
soup.find('div',{'class':'name'})
However, it returns abc.
How to solve this problem?
The thing is that Beautiful Soup returns the first element that has the class name and div so the thing is that the first div has class name and class dt so it selects that div.
So, div helps but it still narrows down to 2 divs. Next, it returns a array so check the second div to use print(soup('div')[1].text). If you want to print all the divs use this code:
for i in range(len(soup('div')))
print(soup('div')[i].text)
And as pointed out in Ankur Sinha's answer, if you want to select all the divs that have only class name, then you have to use select, like this:
soup.select('div[class=name]')[0].get_text()
But if there are multiple divs that satisfy this property, use this:
for i in range(len(soup.select('div[class=name]'))):
print(soup.select('div[class=name]')[i].get_text())
Just to continue Ankur Sinha, when you use select or even just soup() it forms a array, because there can be multiple items so that's why I used len(), to figure out the length of the array. Then I ran a for loop on it and then printed the select function at i starting from 0.
When you do that, it rather would give a specific div instead of a array, and if it gave out a array, calling get_text() would produce errors because the array is NOT text.
This blog was helpful in doing what you would like, and that is to explicitly find a tag with specific class attribute:
from bs4 import BeautifulSoup
html = '<div class="dt name">abc</div><div class="name">xyz</div>'
soup = BeautifulSoup(html, 'html.parser')
soup.find(lambda tag: tag.name == 'div' and tag['class'] == ['name'])
Output:
<div class="name">xyz</div>
You can do it without lambda also using select to find exact class name like this:
soup.select("div[class = name]")
Will give:
[<div class="name">xyz</div>]
And if you want the value between tags:
soup.select("div[class=name]")[0].get_text()
Will give:
xyz
In case you have multiple div with class = 'name', then you can do:
for i in range(len(soup.select("div[class=name]"))):
print(soup.select("div[class=name]")[i].get_text())
Reference:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
This might work for you, note that it is contingent on the div being the second div item in the html.
import requests
from bs4 import BeautifulSoup
html = '<div class="dt name">abc</div><div class="name">xyz</div>'
soup = BeautifulSoup(html, features='lxml')
print(soup('div')[1].text)

"has:()" and "not:()" pseudo-classes behaves otherwise when used within BeautifulSoup

I'm trying to figure out how css pseudo-classes like not:() and has:() work in the following cases.
The following selector is not supposed to print 27A-TAX DISTRICT 27A but it does print it:
from bs4 import BeautifulSoup
htmlelement = """
<tbody>
<tr style="">
<td><a>27A-TAX DISTRICT</a> 27A</td>
</tr>
<tr style="">
<td><strong>Parcel Number</strong> 720</td>
</tr>
</tbody>
"""
soup = BeautifulSoup(htmlelement,"lxml")
item = soup.select_one("tr:not(a)").text
print(item)
On the other hand, the following selector is supposed to print I should be printed but it throws AttributeError error.
from bs4 import BeautifulSoup
htmlelement = """
<p class="vital">I should be printed</p>
<p>I should not be printed</p>
"""
soup = BeautifulSoup(htmlelement,"lxml")
item = soup.select_one("p:has(.vital)").text
print(item)
Where I'm going wrong and how can I make them work?
Unfortunately, your understanding of what :not() and :has() does is most likely not correct.
In your first example, you use:
soup.select_one("tr:not(a)").text
The way you are using it will select every tr. This is because it is saying "I want a tr tag that is not an a tag. tr tags are never a tags so your code always grabs the text of any tr tag, including the one that contains 27A-TAX DISTRICT.
If you want tr tags that don't have a tags, then you could use:
soup.select_one("tr:not(:has(a))").text
What this says is "I want a tr tag that does not have a descendant a tag".
For more info read:
https://developer.mozilla.org/en-US/docs/Web/CSS/:not
This leads us to your second issue. :has() is a relational selector. In your second example, you used:
soup.select_one("p:has(.vital)").text
:has() looks ahead at either children, descendants, or sibling (depending on the syntax you use) to determine if the tag is the the one you want.
So what you were saying was "I want a p tag that has a descendant tag with the class vital". None of your p tags even have descendants, so there is no way one could have a vital class. What you want is actually more simple:
soup.select_one("p.vital").text
What this says is "I want a p tag that also has a class vital."
For more info read:
https://developer.mozilla.org/en-US/docs/Web/CSS/:has

BeautifulSoup: AttributeError: 'NavigableString' object has no attribute 'name'

Do you know why the first example in BeautifulSoup tutorial http://www.crummy.com/software/BeautifulSoup/documentation.html#QuickStart gives AttributeError: 'NavigableString' object has no attribute 'name'? According to this answer the space characters in the HTML causes the problem. I tried with sources of a few pages and 1 worked the others gave the same error (I removed spaces). Can you explain what does "name" refer to and why this error happens? Thanks.
Just ignore NavigableString objects while iterating through the tree:
from bs4 import BeautifulSoup, NavigableString, Tag
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for body_child in soup.body.children:
if isinstance(body_child, NavigableString):
continue
if isinstance(body_child, Tag):
print(body_child.name)
name will refer to the name of the tag if the object is a Tag object (ie: <html> name = "html")
if you have spaces in your markup in between nodes BeautifulSoup will turn those into NavigableString's. So if you use the index of the contents to grab nodes, you might grab a NavigableString instead of the next Tag.
To avoid this, query for the node you are looking for: Searching the Parse Tree
or if you know the name of the next tag you would like, you can use that name as the property and it will return the first Tag with that name or None if no children with that name exist: Using Tag Names as Members
If you wanna use the contents you have to check the objects you are working with. The error you are getting just means you are trying to access the name property because the code assumes it's a Tag
You can use try catch to eliminate the cases when Navigable String is being parsed in the loop, like this:
for j in soup.find_all(...)
try:
print j.find(...)
except NavigableString:
pass
This is the latest working code to obtain the name of the tags in soup.
from bs4 import BeautifulSoup, Tag
res = requests.get(url).content
soup = BeautifulSoup(res, 'lxml')
for child in soup.body.children:
if isinstance(body_child, Tag):
print(child.name)

Categories