"has:()" and "not:()" pseudo-classes behaves otherwise when used within BeautifulSoup - python

I'm trying to figure out how css pseudo-classes like not:() and has:() work in the following cases.
The following selector is not supposed to print 27A-TAX DISTRICT 27A but it does print it:
from bs4 import BeautifulSoup
htmlelement = """
<tbody>
<tr style="">
<td><a>27A-TAX DISTRICT</a> 27A</td>
</tr>
<tr style="">
<td><strong>Parcel Number</strong> 720</td>
</tr>
</tbody>
"""
soup = BeautifulSoup(htmlelement,"lxml")
item = soup.select_one("tr:not(a)").text
print(item)
On the other hand, the following selector is supposed to print I should be printed but it throws AttributeError error.
from bs4 import BeautifulSoup
htmlelement = """
<p class="vital">I should be printed</p>
<p>I should not be printed</p>
"""
soup = BeautifulSoup(htmlelement,"lxml")
item = soup.select_one("p:has(.vital)").text
print(item)
Where I'm going wrong and how can I make them work?

Unfortunately, your understanding of what :not() and :has() does is most likely not correct.
In your first example, you use:
soup.select_one("tr:not(a)").text
The way you are using it will select every tr. This is because it is saying "I want a tr tag that is not an a tag. tr tags are never a tags so your code always grabs the text of any tr tag, including the one that contains 27A-TAX DISTRICT.
If you want tr tags that don't have a tags, then you could use:
soup.select_one("tr:not(:has(a))").text
What this says is "I want a tr tag that does not have a descendant a tag".
For more info read:
https://developer.mozilla.org/en-US/docs/Web/CSS/:not
This leads us to your second issue. :has() is a relational selector. In your second example, you used:
soup.select_one("p:has(.vital)").text
:has() looks ahead at either children, descendants, or sibling (depending on the syntax you use) to determine if the tag is the the one you want.
So what you were saying was "I want a p tag that has a descendant tag with the class vital". None of your p tags even have descendants, so there is no way one could have a vital class. What you want is actually more simple:
soup.select_one("p.vital").text
What this says is "I want a p tag that also has a class vital."
For more info read:
https://developer.mozilla.org/en-US/docs/Web/CSS/:has

Related

How to select HTML element by id that is a number in beatiful soup

I am trying to select an HTML element with an id that is a number, ie. <div id=27047243>
When I try to use the select method like this soup.select("#27047243") I get an error that says Malformed id selector
I figured I need to escape the number somehow, I tried like this soup.select(r"#\3{number}" but even though I did not get the error anymore, I could not get the element
I know I could use the find method soup.find(id="27047243") and that works, but the problem is I need to go deeper into nested elements so I want to know if there is a way how to do this using 'select' so I can use CSS selectors
You can use div[id="27047243"]:
from bs4 import BeautifulSoup
html_doc = """
<div id=27047243>
I want this.
</div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
print(soup.select_one('div[id="27047243"]'))
Prints:
<div id="27047243">
I want this.
</div>
You can try this:
soup.select_one('div#27047243')

How to locate td class in html code using python?

I have a class in my html code. I need to locate td class "Currentlocation" using python.
CODE :
<td class="CURRENTLOCATION"><img align="MIDDLE" src="..\Images\FolderOpen.bmp"/> Metrics</td>
Below are the codes I tried.
First attempt:
My_result = page_soup.find_element_by_class_name('CURRENTLOCATION')
Getting "TypeError: 'NoneType' object is not callable" error. Second attempt:
My_result = page_soup.find(‘td’, attrs={‘class’: ‘CURRENTLOCATION’})
Getting "invalid character in identifier" error.
Can anyone please help me locate a class in html code using python?
from bs4 import BeautifulSoup
sdata = '<td class="CURRENTLOCATION"><img align="MIDDLE" src="..\Images\FolderOpen.bmp"/> Metrics</td>'
soup = BeautifulSoup(sdata, 'lxml')
mytds = soup.findAll("td", {"class": "CURRENTLOCATION"})
for td in mytds:
print(td)
I tried your code, the second example, and the problem are the quotation marks you use. To me they are apostrophes (‘, unicode code point \u2019), while the python interpreter requires single (') or double (") quotation marks.
Changing them I can find the tag:
>>> bs.find('td', attrs={'class': 'CURRENTLOCATION'})
<td class="CURRENTLOCATION"><img align="MIDDLE" src="..\Images\FolderOpen.bmp"/> Metrics</td>
About your first example. I do not know where you find a reference to the method find_element_by_class_name but it seems to not be implemented by the BeautifulSoup class. The class instead implements the __getattr__ method that is a special one that is invoked anytime you try to access a non existing attribute. Here an excerpt of the method:
def __getattr__(self, tag):
#print "Getattr %s.%s" % (self.__class__, tag)
if len(tag) > 3 and tag.endswith('Tag'):
#
# We special case contents to avoid recursion.
elif not tag.startswith("__") and not tag == "contents":
return self.find(tag)
When you try to access the attribute find_element_by_class_name, you are actually looking for a tag with the same name.
There is a function in BeautifulSoup for this.
You can get all the desired tags and specify the attributes which you are lookin for in find_all function. It returns the list of all the elements which fulfill the criteria
import re
from bs4 import BeautifulSoup
text = '<td class="CURRENTLOCATION"><img align="MIDDLE" src="..\Images\FolderOpen.bmp"/> Metrics</td>'
soup = BeautifulSoup(text, 'lxml')
output_list = soup.find_all('td',{"class": "CURRENTLOCATION"}) # I am looking for all the td tags whose class atrribute is set to CURRENTLOCATION

Python - Extracting data from this Html tag using BS4, instead of getting None

This is my code:
html = '''
<td class="ClassName class" width="60%">Data I want to extract<span lang=EN-
UK style="font-size:12pt;font-family:'arial'"></span></td>
'''
soup = BeautifulSoup(html, 'html.parser')
print(soup.select_one('td').string)
It returns None. I think it has to do with that span tag which is empty. I think it goes into that span tag, and returns those contents? So I either want to delete that span tag, or stop as soon as it finds the 'Data I want to extract', or tell it to ignore empty tags
If there are no empty tags inside 'td' it actually works.
Is there a way to ignore empty tags in general and go one step back? Instead of ignoring this specific span tag?
Sorry if this is too elementary, but I spent a fair amount of time searching.
Use .text property, not .string:
html = '''
<td class="ClassName class" width="60%">Data I want to extract<span lang=EN-
UK style="font-size:12pt;font-family:'arial'"></span></td>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
print(soup.select_one('td').text)
Output:
Data I want to extract
Use .text:
>>> soup.find('td').text
u'Data I want to extract'

Python and BeautifulSoup4: finding certain text from the tables and parsing the very next table

I'm facing quite a tricky problem while trying to fetch some data with BeautifulSoup.
I'd like to find all the tables that have certain text in them (in my example code 'Name:', 'City:' and 'Address:') and parse the text that is located in the very next table in the source code.
Page source code:
...
...
<td>Name:</td>
<td>John</td>
...
<td>City:</td>
<td>London</td>
...
<td>Address:</td>
<td>Bowling Alley 123</td>
...
...
I'd like to parse: "John", "London", "Bowling Alley 123"
Sorry I don't have any python code here to show my past effort, but it's because I've no idea where to start. Thanks!
This is clunky, but depending on how your TD's are wrapped and how consistent your TD targets are, you should be able to find them, iterate through them and use findNextSibling() to get your data:
from BeautifulSoup import BeautifulSoup
html = """\
<table>
<tr>
<td>Name:</td>
<td>John</td>
</tr>
<tr>
<td>City:</td>
<td>London</td>
</tr>
<tr>
<td>Address:</td>
<td>Bowling Alley 123</td>
</tr>
</table>
"""
targets=["City:","Address:","Name:"]
soup = BeautifulSoup(html)
for tr in soup.findAll("tr"):
for td in tr.findAll("td"):
if td.text in targets:
print td.findNextSibling().text
Bottom line, as long as you've got some sane/normal elements containing your TD's, using the NextSibling functions should get you where you're going.
Whether this works properly is dependent on whether the HTML is properly formed, but will likely work even if there are extraneous newlines or other text.
import bs4
def parseCAN(html):
b = bs4.BeautifulSoup(html)
matches = ('City:', 'Address:', 'Name:')
found = []
elements = b.findAll('td')
for n, e in enumerate(elements):
if e.text not in matches:
continue
if n < len(elements) - 1:
found.append(elements[n+1].text)
return found

Find 2 attributes in BeautifulSoup

Here is the part of the HTML:
<td class="team-name">
<div class="goat_australia"></div>
Melbourne<br />
Today
</td>
<td class="team-name">
<div class="goat_australia"></div>
Sydney<br />
Tomorrow
</td>
So i would like to return all these td tags with the class name "team-name", and only if it contains the text "Today" in it.
My code so far:
from BeautifulSoup import BeautifulSoup
import urllib2, re
starting_url = urllib2.urlopen('http://www.mysite.com.au/').read()
soup = BeautifulSoup(''.join(starting_url))
soup2 = soup.findAll("td", {'class':'team-name'})
for entry in soup2:
if "Today" in soup2:
print entry
If i run this nothing returns.
If i take out that last if statement and just put
print soup2
I get back all the td tags, but some have "Today" and some have "Tomorrow" etc.
So any pointers? is there a way to add 2 attributes to the soup.findAll function?
I also tried running a findAll on a findAll, that did not work.
Using the structure of the code you've got currently, try looking for "Today" with an embedded findAll:
for entry in soup2:
if entry.findAll(text=re.compile("Today")):
print entry

Categories