Finding specific tag using BeautifulSoup - python

Here is the website that i'm parsing: http://uniapple.net/usaddress/address.php?address1=501+10th+ave&address2=&city=nyc&state=ny&zipcode=10036&country=US
I would like to be able to find the word that will be in line 39 between the td tags. That line tells me if the address is residential or commercial, which is what I need for my script.
Here's what I have, but i'm getting this error:
AttributeError: 'NoneType' object has no attribute 'find_next'
The code I'm using is:
from bs4 import BeautifulSoup
import urllib
page = "http://uniapple.net/usaddress/address.php?address1=501+10th+ave&address2=&city=nyc&state=ny&zipcode=10036&country=US"
z = urllib.urlopen(page).read()
thesoup = BeautifulSoup(z, "html.parser")
comres = (thesoup.find("th",text=" Residential or ").find_next("td").text)
print(str(comres))

text argument would not work in this particular case. This is related to how the .string property of an element is calculated. Instead, I would use a search function where you can actually call get_text() and check the complete "text" of an element including the children nodes:
label = thesoup.find(lambda tag: tag and tag.name == "th" and \
"Residential" in tag.get_text())
comres = label.find_next("td").get_text()
print(str(comres))
Prints Commercial.
We can go a little bit further and make a reusable function to get a value by label:
soup = BeautifulSoup(z, "html.parser")
def get_value_by_label(soup, label):
label = soup.find(lambda tag: tag and tag.name == "th" and label in tag.get_text())
return label.find_next("td").get_text(strip=True)
print(get_value_by_label(soup, "Residential"))
print(get_value_by_label(soup, "City"))
Prints:
Commercial
NYC

All you are missing is a bit of housekeeping:
ths = thesoup.find_all("th")
for th in ths:
if 'Residential or' in th.text:
comres = th.find_next("td").text
print(str(comres))
>> Commercial

You'll need to use a regular expression as your text field, like re.compile('Residential or'), rather than a string.
This was working for me. I had to iterate over the results provided, though if you only expect a single result per page you could swap find for find_all:
for r in thesoup.find_all(text=re.compile('Residential or')):
r.find_next('td').text

Related

How do I have nested find_all statements in BeautifulSoup (Python)?

I started off by pulling the page with Selenium and I believe I passed the part of the page I needed to BeautifulSoup correctly using this code:
soup = BeautifulSoup(driver.find_element("xpath", '//*[#id="version_table"]/tbody').get_attribute('outerHTML'))
Now I can parse using BeautifulSoup
query = soup.find_all("tr", class_=lambda x: x != "hidden*")
print (query)
My problem is that I need to dig deeper than just this one query. For example, I would like to nest this one inside of the first (so the first needs to be true, and then this one):
query2 = soup.find_all("tr", id = "version_new_*")
print (query2)
Logically speaking, this is what I'm trying to do (but I get SyntaxError: invalid syntax):
query = soup.find_all(("tr", class_=lambda x: x != "hidden*") and ("tr", id = "version_new_*"))
print (query)
How do I accomplish this?
As mentioned without any example it is hard to help or give a precise answer - However you could use a css selector:
soup.select('tr[id^="version_new_"]:not(.hidden)')
Example
from bs4 import BeautifulSoup
html = '''
<tr id="version_new_1" class="hidden"></tr>
<tr id="version_new_2"></tr>
<tr id="version_new_3" class="hidden"></tr>
<tr id="version_new_4"></tr>
'''
soup = BeautifulSoup(html)
soup.select('tr[id^="version_new_"]:not(.hidden)')
Output
Will be a ResultSet you could iterate to scrape what you need.
[<tr id="version_new_2"></tr>, <tr id="version_new_4"></tr>]
Regarding: query = soup.find_all(...) and print (query)
find_all is going to return an iterable type. Iterable types can be iterated.
for query in soup.find_all(...):
print(query)
You can use a lambda function (along with regex) for every element to do some advanced conditioning
import re
query = soup.find_all(
lambda tag:
tag.name == 'tr' and
'id' in tag.attrs and re.search('^version_new_*', str(tag.attrs['id'])) and
'class' in tag.attrs and not re.search('^hidden*', str(tag.attrs['class']))
)
print(list(query))
For every element in the html, we are checking...
If the tag is a table row (tr)
If the tag has an id and if that id matches the pattern
If the tag has a class and if that class matches the pattern

How do you get a text from a span tag using BeautifulSoup when there's no clear identification?

enter image description here
I am trying to extract the value from this span tag for Year Built using BeautifulSoup and the following code below, but I'm not getting the actual Year. Please help. Thanks :)
enter image description here
results = []
for url in All_product[:2]:
link = url
html = getAndParseURL(url)
YearBuilt = html.findAll("span", {"class":"header font-color-
gray-light inline-block"})[4]
results.append([YearBuilt])
The output shows
[[<span class="header font-color-gray-light inline-block">Year Built</span>],
[<span class="header font-color-gray-light inline-block">Community</span>]]
Try using the .next_sibling:
result = []
year_built = html.find_all(
"span", {"class":"header font-color- gray-light inline-block"}
)
for elem in year_built:
if elem.text.strip() == 'Year Built':
result.append(elem.next_sibling)
I'm not sure how the whole HTML looks, but something along these lines might help.
Note: Sure there would be a more specific solution to extract all attributes for your results you may need, but therefor you should improve your question and add more details
Using css selectors you can simply chain / combinate your selection to be more strict. In this case you select the <span> contains your string and use adjacent sibling combinator to get the next sibling <span>.
YearBuilt = e.text if (e := html.select_one('span.header:-soup-contains("Year Built") + span')) else None
It also avoid AttributeError: 'NoneType' object has no attribute 'text', if element is not available you can check if it exists before calling text method
soup = BeautifulSoup(html_doc, "html.parser")
results = []
for url in All_product[:2]:
link = url
html = getAndParseURL(url)
YearBuilt = e.text if (e := html.select_one('span.header:-soup-contains("Year Built") + span')) else None
results.append([YearBuilt])

In BeautifulSoup, how do I search for an element within another element?

I'm using Django 2, Python 3.7, and BeautifulSoup 4. I have the below code, which is supposed to find an element within an element ...
req = urllib2.Request(fullurl, headers=settings.HDR)
html = urllib2.urlopen(req, timeout=settings.SOCKET_TIMEOUT_IN_SECONDS).read()
bs = BeautifulSoup(html, features="lxml")
pattern = re.compile(r'^submitted ')
posted_elt = bs.find(text=pattern)
author_elt = posted_elt.find("span", class_="author") if posted_elt is not None else None
However the line
author_elt = posted_elt.find("span", class_="author") if posted_elt is not None else None
is throwing the error "TypeError: find() takes no keyword arguments". What's the correct way to search for an element within another element?
When you search for text in BeautifulSoup, you get a bs4.element.NavigableString object that is much like a regular python str. Luckily, it has that "Navigable" part in it. navigableString.parent references the parent element which can be used in the next find. You aren't trying to find a <span> child of the text node because text nodes don't have child elements. You are trying to find the element that has this text node, and continuing the search from there.
req = urllib2.Request(fullurl, headers=settings.HDR)
html = urllib2.urlopen(req, timeout=settings.SOCKET_TIMEOUT_IN_SECONDS).read()
bs = BeautifulSoup(html, features="lxml")
pattern = re.compile(r'^submitted ')
posted_elt = bs.find(text=pattern)
author_elt = posted_elt.parent.find("span", class_="author") if posted_elt is not None else None
The .find() method searches for html tags. When the tag is found you should convert the result to string using the .text attribute of .find() results. Then use your regex search on that string.
Here is example usage:
from bs4 import BeautifulSoup
import requests
import re
res = requests.get("https://en.wikipedia.org/wiki/Dog")
soup = BeautifulSoup(res.content,"html.parser")
reg = re.compile("-")
s = soup.find("title").text
print(re.search(reg,s).group(0))
# If you want to find all html tags and search each of them use find_all()
all_res = soup.find_all("p")
reg = re.compile("dog")
for i in all_res:
s = i.text
match = re.search(reg,s)
if match:
print(match.group(0))
The later example will find all <p> tags convert them to string and search for "dog" in them.

How do I write a BeautifulSoup strainer that only parses objects with certain text between the tags?

I'm using Django and Python 3.7. I want to have more efficient parsing so I was reading about SoupStrainer objects. I created a custom one to help me parse only the elements I need ...
def my_custom_strainer(self, elem, attrs):
for attr in attrs:
print("attr:" + attr + "=" + attrs[attr])
if elem == 'div' and 'class' in attr and attrs['class'] == "score":
return True
elif elem == "span" and elem.text == re.compile("my text"):
return True
article_stat_page_strainer = SoupStrainer(self.my_custom_strainer)
soup = BeautifulSoup(html, features="html.parser", parse_only=article_stat_page_strainer)
One of the conditions is I only want to parse "span" elements whose text matches a certain pattern. Hence the
elem == "span" and elem.text == re.compile("my text")
clause. However, this results in an
AttributeError: 'str' object has no attribute 'text'
error when I try and run the above. What's the proper way to write my strainer?
TLDR; No, this is currently not easily possible in BeautifulSoup (modification of BeautifulSoup and SoupStrainer objects would be needed).
Explanation:
The problem is that the Strainer-passed function gets called on handle_starttag() method. As you can guess, you only have values in the opening tag (eg. element name and attrs).
https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/init.py#L524
if (self.parse_only and len(self.tagStack) <= 1
and (self.parse_only.text
or not self.parse_only.search_tag(name, attrs))):
return None
And as you can see, if your Strainer function returns False, the element gets discarded immediately, without having chance to take the inner text inside into consideration (unfortunately).
On the other hand if you add "text" to search.
SoupStrainer(text="my text")
it will start to search inside the tag for text, but this doesn't have context of element or attributes - you can see the irony :/
and combining it together will just find nothing. And you can't even access parent like shown here in find function:
https://gist.github.com/RichardBronosky/4060082
So currently Strainers are just good to filter on elements/attrs. You would need to change a lot of Beautiful soup code to get that working.
If you really need this, I suggest inheriting BeautifulSoup and SoupStrainer objects and modifying their behavior.
It seems you try to loop along soup elements in my_custom_strainer method.
In order to do so, you could do it as follows:
soup = BeautifulSoup(html, features="html.parser", parse_only=article_stat_page_strainer)
my_custom_strainer(soup, attrs)
Then slightly modify my_custom_strainer to meet something like:
def my_custom_strainer(soup, attrs):
for attr in attrs:
print("attr:" + attr + "=" + attrs[attr])
for d in soup.findAll(['div','span']):
if d.name == 'span' and 'class' in attr and attrs['class'] == "score":
return d.text # meet your needs here
elif d.name == 'span' and d.text == re.compile("my text"):
return d.text # meet your needs here
This way you can access the soup objects iteratively.
I recently created a lxml / BeautifulSoup parser for html files, which also searches between specific tags.
The function I wrote opens up a your operating system's file manager and allows you to select the specifi html file to parse.
def openFile(self):
options = QFileDialog.Options()
options |= QFileDialog.DontUseNativeDialog
fileName, _ = QFileDialog.getOpenFileName(self, "QFileDialog.getOpenFileName()", "",
"All Files (*);;Python Files (*.py)", options=options)
if fileName:
file = open(fileName)
data = file.read()
soup = BeautifulSoup(data, "lxml")
for item in soup.find_all('strong'):
results.append(float(item.text))
print('Score =', results[1])
print('Fps =', results[0])
You can see that the tag i specified was 'strong', and i was trying to find the text within that tag.
Hope I could help in someway.

Extending selection with BeautifulSoup

I am trying to get BeautifulSoup to do the following.
I have HTML files which I wish to modify. I am interested in two tags in particular, one which I will call TagA is
<div class ="A">...</div>
and one which I will call TagB
<p class = "B">...</p>
Both tags occur independently throughout the HTML and may themselves contain other tags and be nested inside other tags.
I want to place a marker tag around every TagA whenever it is not immediately followed by TagB so that
<p class="A"">...</p> becomes <marker><p class="A">...</p></marker>
But when TagA is followed immediately by TagB, I want the marker Tag to surround them both
so that
<p class="A">...</p><div class="B">...</div>
becomes
<marker><p class="A">...</p><div class="B">...</div></marker>
I can see how to select TagA and enclose it with the marker tag, but when it is followed by TagB I do not know if or how the BeautiulSoup 'selection' can be extended to include the NextSibling.
Any help appreciated.
beautifulSoup does have a "next sibling" function. find all tags of class A and use a.next_sibling to check if it is b.
look at the docs:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#going-sideways
I think I was going about this the wrong way by trying to extend the 'selection' from one tag to the following. Instead I found the following code which insets the outer 'Marker' tag and then inserts the A and B tags does the trick.
I am pretty new to Python so would appreciate advice regarding improvements or snags with the following.
def isTagB(tag):
#If tag is <p class = "B"> return true
#if not - or tag is just a string return false
try:
return tag.name == 'p'#has_key('p') and tag.has_key('B')
except:
return False
from bs4 import BeautifulSoup
soup = BeautifulSoup("""<div class = "A"><p><i>more content</i></p></div><div class = "A"><p><i>hello content</i></p></div><p class="B">da <i>de</i> da </p><div class = "fred">not content</div>""")
for TagA in soup.find_all("div", "A"):
Marker = soup.new_tag('Marker')
nexttag = TagA.next_sibling
#skipover white space
while str(nexttag).isspace():
nexttag = nexttag.next_sibling
if isTagB(nexttag):
TagA.replaceWith(Marker) #Put it where the A element is
Marker.insert(1,TagA)
Marker.insert(2,nexttag)
else:
#print("FALSE",nexttag)
TagA.replaceWith(Marker) #Put it where the A element is
Marker.insert(1,TagA)
print (soup)
import urllib
from BeautifulSoup import BeautifulSoup
html = urllib.urlopen("http://ursite.com") #gives html response
soup = BeautifulSoup(html)
all_div = soup.findAll("div",attrs={}) #use attrs as dict for attribute parsing
#exa- attrs={'class':"class","id":"1234"}
single_div = all_div[0]
#to find p tag inside single_div
p_tag_obj = single_div.find("p")
you can use obj.findNext(), obj.findAllNext(), obj.findALLPrevious(), obj.findPrevious(),
to get attribute you can use obj.get("href"), obj.get("title") etc.

Categories