I want to extract excerpts of data like company name and address from a website using BeautifulSoup. I am getting, however, the following failure:
Calgary's Notary Public
Traceback (most recent call last):
File "test.py", line 16, in <module>
print item.find_all(class_='jsMapBubbleAddress').text
AttributeError: 'ResultSet' object has no attribute 'text'
The HTML code snippet is here. I want to extract all the text information and convert into a CSV file. Please any one help me.
<div class="listing__right article hasIcon">
<h3 class="listing__name jsMapBubbleName" itemprop="name"><a data-analytics='{"lk_listing_id":"100971374","lk_non-ad-rollup":"0","lk_page_num":"1","lk_pos":"in_listing","lk_proximity":"14.5","lk_directory_heading":[{"085100":[{"00910600":"1"},{"00911000":"1"}]}],"lk_geo_tier":"in","lk_area":"left_1","lk_relevancy":"1","lk_name":"busname","lk_pos_num":"1","lk_se_id":"e292d1d2-f130-463d-8f0c-7dd66800dead_Tm90YXJ5_Q2FsZ2FyeSwgQUI_56","lk_ev":"link","lk_product":"l2"}' href="/bus/Alberta/Calgary/Calgary-s-Notary-Public/100971374.html?what=Notary&where=Calgary%2C+AB&useContext=true" title="See detailed information for Calgary's Notary Public">Calgary's Notary Public</a> </h3>
<div class="listing__address address mainLocal">
<em class="itemCounter">1</em>
<span class="listing__address--full" itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress">
<span class="jsMapBubbleAddress" itemprop="streetAddress">340-600 Crowfoot Cres NW</span>, <span class="jsMapBubbleAddress" itemprop="addressLocality">Calgary</span>, <span class="jsMapBubbleAddress" itemprop="addressRegion">AB</span> <span class="jsMapBubbleAddress" itemprop="postalCode">T3G 0B4</span></span>
<a class="listing__direction" data-analytics='{"lk_listing_id":"100971374","lk_non-ad-rollup":"0","lk_page_num":"1","lk_pos":"in_listing","lk_proximity":"14.5","lk_directory_heading":[{"085100":[{"00910600":"1"},{"00911000":"1"}]}],"lk_geo_tier":"in","lk_area":"left_1a","lk_relevancy":"1","lk_name":"directions-step1","lk_pos_num":"1","lk_se_id":"e292d1d2-f130-463d-8f0c-7dd66800dead_Tm90YXJ5_Q2FsZ2FyeSwgQUI_56","lk_ev":"link","lk_product":"l2"}' href="/merchant/directions/100971374?what=Notary&where=Calgary%2C+AB&useContext=true" rel="nofollow" title="Get direction to Calgary's Notary Public">Get directions »</a>
</div>
<div class="listing__details">
<p class="listing__details__teaser" itemprop="description">We offer you a convenient, quick and affordable solution for your Notary Public or Commissioner for Oaths in Calgary needs.</p>
</div>
<div class="listing__ratings--root">
<div class="listing__ratings ratingWarp" itemprop="aggregateRating" itemscope="" itemtype="http://schema.org/AggregateRating">
<meta content="5" itemprop="ratingValue"/>
<meta content="1" itemprop="ratingCount"/>
<span class="ypStars" data-analytics-group="stars" data-clicksent="false" data-rating="rating5" title="Ratings: 5 out of 5 stars">
<span class="star1" data-analytics-name="stars" data-label="Optional : Why did you hate it?" title="I hated it"></span>
<span class="star2" data-analytics-name="stars" data-label="Optional : Why didn't you like it?" title="I didn't like it"></span>
<span class="star3" data-analytics-name="stars" data-label="Optional : Why did you like it?" title="I liked it"></span>
<span class="star4" data-analytics-name="stars" data-label="Optional : Why did you really like it?" title="I really liked it"></span>
<span class="star5" data-analytics-name="stars" data-label="Optional : Why did you love it?" title="I loved it"></span>
</span><a class="listing__ratings__count" data-analytics='{"lk_listing_id":"100971374","lk_non-ad-rollup":"0","lk_page_num":"1","lk_pos":"in_listing","lk_proximity":"14.5","lk_directory_heading":[{"085100":[{"00910600":"1"},{"00911000":"1"}]}],"lk_geo_tier":"in","lk_area":"left_1","lk_relevancy":"1","lk_name":"read_yp_reviews","lk_pos_num":"1","lk_se_id":"e292d1d2-f130-463d-8f0c-7dd66800dead_Tm90YXJ5_Q2FsZ2FyeSwgQUI_56","lk_ev":"link","lk_product":"l2"}' href="/bus/Alberta/Calgary/Calgary-s-Notary-Public/100971374.html?what=Notary&where=Calgary%2C+AB&useContext=true#ypgReviewsHeader" rel="nofollow" title="1 of Review for Calgary's Notary Public">1<span class="hidden-phone"> YP review</span></a>
</div>
</div>
<div class="listing__details detailsWrap">
<ul>
<li>Notaries
,
</li>
<li>Notaries Public</li>
</ul>
</div>
</div>
There are many divs with listing__right article hasIcon. I am using for loop to extract the information.
The python code I have written so far is.
import requests
from bs4 import BeautifulSoup
url = 'http://www.yellowpages.ca/search/si-rat/1/Notary/Calgary%2C+AB'
response = requests.get(url)
content = response.content
soup = BeautifulSoup(content)
g_data=soup.find_all('div', attrs={'class': 'listing__right article hasIcon'})
for item in g_data:
print item.find('h3').text
#print item.contents[2].find_all('em', attrs={'class': 'itemCounter'})[1].text
print item.find_all(class_='jsMapBubbleAddress').text
find_all returns a list which has no 'text' attribute so you are getting an error, not sure what output you are looking for, but this code seems to work ok:
import requests
from bs4 import BeautifulSoup
url = 'http://www.yellowpages.ca/search/si-rat/1/Notary/Calgary%2C+AB'
response = requests.get(url)
content = response.content
soup = BeautifulSoup(content,"lxml")
g_data=soup.find_all('div', attrs={'class': 'listing__right article hasIcon'})
for item in g_data:
print item.find('h3').text
#print item.contents[2].find_all('em', attrs={'class': 'itemCounter'})[1].text
items = item.find_all(class_='jsMapBubbleAddress')
for item in items:
print item.text
Related
I'm trying to get the links in all li tags under the ul tag
HTML code:
<div id="chapter-list" class="sbox" style="">
<ul>
<li>
<a href="https://example.com/manga/name/2">
<div class="chpbox">
<span class="chapternum">
Chapter 2 </span>
</div>
</a>
</li>
<li>
<a href="https://example.com/manga/name/1">
<div class="chpbox">
<span class="chapternum">
Chapter 1 </span>
</div>
</a>
</li>
</ul>
</div>
The code I wrote:
from bs4 import BeautifulSoup
import requests
html_page = requests.get('https://example.com/manga/name/')
soup = BeautifulSoup(html_page.content, 'html.parser')
chapters = soup.find('div', {"id": "chapter-list"})
children = chapters.findChildren("ul" , recursive=False) # when printed, it gives the the whole ul content
for litag in children.find('li'):
print(litag.find("a")["href"])
When I try to print the li tags links, it gives the following error:
Traceback (most recent call last):
File "C:\0.py", line 12, in <module>
for litag in children.find('li'):
File "C:\Users\hs\AppData\Local\Programs\Python\Python310\lib\site-packages\bs4\element.py", line 2289, in __getattr__
raise AttributeError(
AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
You can use find to find the ul in the chapter list. And then find_all to find the list items in the ul. Finally, use find_all again to find the links in each list item and print the URL. Details of these two methods can be found in find and find_all method documentation on bs4. You can use the get_text() after searching by the class chapternum on each link to get the link's text like Chapter 1. Searching by class be found in bs4 documentation for searching element by class
(Updated) Code:
from bs4 import BeautifulSoup
html_doc = """
<div id="chapter-list" class="sbox" style="">
<ul>
<li>
<a href="https://example.com/manga/name/2">
<div class="chpbox">
<span class="chapternum">
Chapter 2 </span>
</div>
</a>
</li>
<li>
<a href="https://example.com/manga/name/1">
<div class="chpbox">
<span class="chapternum">
Chapter 1 </span>
</div>
</a>
</li>
</ul>
</div>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
chapters = soup.find('div', {"id": "chapter-list"})
list_items = chapters.find('ul').find_all('li')
for list_item in list_items:
for link in list_item.find_all('a'):
title = link.find('span', class_='chapternum').get_text().strip()
href = link.get("href")
print(f"{title}: {href}")
Output:
Chapter 2: https://example.com/manga/name/2
Chapter 1: https://example.com/manga/name/1
References:
find and find_all method documentation on bs4
bs4 documentation for searching element by class
I'm trying to get some image url using python beautifulsoup from html content.
My HTML Content :
<div id="photos" class="tab rel-photos multiple-photos">
<span id="watch-this" class="classified-detail-buttons">
<span id="c_id_10832265:c_type_202:watch_this">
<a href="/watchlist/classified/baby-items/10832265/1/" id="watch_this_logged" data-require-auth="favoriteAd" data-tr-event-name="dpv-add-to-favourites">
<i class="fa fa-fw fa-star-o"></i></a></span>
</span>
<span id="thumb1" class=" image">
<a href="https://images.dubizzle.com/v1/files/eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJmbiI6ImYzYWdrZm8xcDBlai1EVUJJWlpMRSIsInciOlt7ImZuIjoiNWpldWk3cWZ6aWU2MS1EVUJJWlpMRSIsInMiOjUwLCJwIjoiY2VudGVyLGNlbnRlciIsImEiOjgwfV19.s1GmifnZr0_Bx4HG8RTR4puYcxN0asqAmnBvSpIExEI/image;p=main"
id="a-photo-modal-view:263986810"
rel="photos-modal"
target="_new"
onClick="return dbzglobal_event_adapter(this);">
<div style="background-image:url(https://images.dubizzle.com/v1/files/eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJmbiI6ImYzYWdrZm8xcDBlai1EVUJJWlpMRSIsInciOlt7ImZuIjoiNWpldWk3cWZ6aWU2MS1EVUJJWlpMRSIsInMiOjUwLCJwIjoiY2VudGVyLGNlbnRlciIsImEiOjgwfV19.s1GmifnZr0_Bx4HG8RTR4puYcxN0asqAmnBvSpIExEI/image;p=main);"></div>
</a>
</span>
<ul id="thumbs-list">
<li>
<span id="thumb2" class="image2">
<a href="https://images.dubizzle.com/v1/files/eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJmbiI6Imtmc3cxMWgzNTB2cTMtRFVCSVpaTEUiLCJ3IjpbeyJmbiI6IjVqZXVpN3FmemllNjEtRFVCSVpaTEUiLCJzIjo1MCwicCI6ImNlbnRlcixjZW50ZXIiLCJhIjo4MH1dfQ.Wo2YqPdWav8shtmyVO2AdisHmLX-ZLDAiskLPAmTSPU/image;p=main" id="a-photo-modal-view:263986811" rel="photos-modal" target="_new" onClick="return dbzglobal_event_adapter(this);" >
<div style="background-image:url(https://images.dubizzle.com/v1/files/eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJmbiI6Imtmc3cxMWgzNTB2cTMtRFVCSVpaTEUiLCJ3IjpbeyJmbiI6IjVqZXVpN3FmemllNjEtRFVCSVpaTEUiLCJzIjo1MCwicCI6ImNlbnRlcixjZW50ZXIiLCJhIjo4MH1dfQ.Wo2YqPdWav8shtmyVO2AdisHmLX-ZLDAiskLPAmTSPU/image;p=thumb_retina);"></div>
</a>
</span>
</li>
<li id="thumbnails-info">
4 Photos
</li>
</ul>
<div id="photo-count">
4 Photos - Click to enlarge
</div>
</div>
My python code :
images = soup.find("div", {"id": ["photos"]}).find_all("a")
for image in images:
sk = image.get("href").replace("p=main","p=thumb_retina",1)
print(sk)
But i'm getting error :
Traceback (most recent call last):
File "/Users/evilslab/Documents/Websites/www.futurepoint.dev.cc/dobuyme/SCRAPE/boats.py", line 47, in <module>
images = soup.find("div", {"id": ["photos"]}).find_all("a")
AttributeError: 'NoneType' object has no attribute 'find_all'
How i can get only the url from a href tag ?
Your code works for me, more completely (given your HTML as html_doc):
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
images = soup.find("div", {"id": ["photos"]}).find_all("a")
for image in images:
print(image['href'].replace("p=main","p=thumb_retina",1))
However your problem is that the text returned by requests from URL is not the same as the HTML sample you give. Despite your attempt to supply a random user agent, the server returns:
<li>You\'re a power user moving through this website with super-human speed.</li>\n <li>You\'ve disabled JavaScript in your web browser.</li>\n <li>A third-party browser plugin, such as Ghostery or NoScript, is preventing JavaScript from running. Additional information is available in this <a title=\'Third party browser plugins that block javascript\' href=\'http://ds.tl/help-third-party-plugins\' target=\'_blank\'>support article</a>.</li>\n </ul>\n </div>\n <p class="we-could-be-wrong" >\n We could be wrong, and sorry about that! Please complete the CAPTCHA below and we’ll get you back on dubizzle right away.
Since the CAPTCHA is intended to prevent scraping, I suggest respecting the admin's wishes and not scraping it. Maybe there's an API?
Try this:
for item in soup.find_all('span'):
try:
link = item.find_all('a', href=True)[0].attrs.get('href', None)
except IndexError:
continue
else:
print(link)
output
/watchlist/classified/baby-items/10832265/1/
/watchlist/classified/baby-items/10832265/1/
https://images.dubizzle.com/v1/files/eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJmbiI6ImYzYWdrZm8xcDBlai1EVUJJWlpMRSIsInciOlt7ImZuIjoiNWpldWk3cWZ6aWU2MS1EVUJJWlpMRSIsInMiOjUwLCJwIjoiY2VudGVyLGNlbnRlciIsImEiOjgwfV19.s1GmifnZr0_Bx4HG8RTR4puYcxN0asqAmnBvSpIExEI/image;p=main
https://images.dubizzle.com/v1/files/eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJmbiI6Imtmc3cxMWgzNTB2cTMtRFVCSVpaTEUiLCJ3IjpbeyJmbiI6IjVqZXVpN3FmemllNjEtRFVCSVpaTEUiLCJzIjo1MCwicCI6ImNlbnRlcixjZW50ZXIiLCJhIjo4MH1dfQ.Wo2YqPdWav8shtmyVO2AdisHmLX-ZLDAiskLPAmTSPU/image;p=main
I am new to BeautifulSoup and I have some sort of issue I do not understand, I think the question may have yet been answered, but none of the answers I have found help me in this case.
I need to access the inside of a div to retrieve the glossary entries of a website, however the inside of that div seems to "not show" at all with BeautifulSoup. Could you help me ?
So this is the html on the website :
<!DOCTYPE html>
<html lang="en-US" style="margin-top: 0px !important;">
<head>...</head>
<body>
<header>...</header>
<section id="glossary" class="search-off">
<dl class="title">
<dt>Glossary</dt>
</dl>
<div class="content">
<aside id="glossary-aside">
<div></div>
<ul></ul>
</aside>
<div id="glossary-list" class="list">
<dl data-id="2103">...</dl>
<dl data-id="1105">
<dt>ABV (Alcohol by volume)</dt>
<dd>
<p style="margin-bottom: 0cm; text-align: justify;"><span style="font-family: Arial Cyr,sans-serif;"><span style="font-size: x-small;"><span style="font-size: small;"><span style="font-size: medium;">Alcohol by volume (ABV) is the measure of an alcoholic beverage’s alcohol content. Wines may have alcohol content from 4% ABV to 18% ABV; however, wines’ typical alcohol content ranges from 12.5% to 14.5% ABV. You can find a particular wine’s alcohol content by checking the label.</span></span></span></span><span style="font-size: medium;"> </span></p>
</dd>
</dl>
<dl data-id="1106">...</dl>
<dl data-id="1213">...</dl>
<dl data-id="2490">...</dl>
<dl data-id="11705">...</dl>
<dl data-id="1782">...</dl>
</div>
<div id="glossary-single" class="list">...</div>
</div>
<div class="s_content">
<div id="glossary-s_list" class="list"></div>
</div>
</section>
<footer></footer>
</body>
</html>
And I need to access the different <dl> tags in the <div id="glossary-list" class="list">.
My code is now as follow :
url_winevibe = requests.get("http://winevibe.com/glossary")
soup = BeautifulSoup(html, "lxml")
ct = url_winevibe.find("div", {"id":"glossary-list"}).findAll("dl")
I have tried various things, including getting to the descendants and children, but all I get is an empty list.
If I try ct = soup.find("div", {"id":"glossary-list"}) and print it, I get : <div class="list" id="glossary-list"></div>. It seems to me the inside of the div is somehow blocked but I am not quite sure.
Does anyone have an idea of how to access this ?
First Solution url is based on my research from where the data loads ! and i do see that it's loads via XHR from different url where the JavaScript rendered:
import requests
import json
r = requests.get('http://winevibe.com/wp-json/glossary/key/?l=en').json()
hoks = json.loads(r)
for item in hoks:
print(item['key'])
Second Solution:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
browser = webdriver.Firefox()
url = 'http://winevibe.com/glossary/'
browser.get(url)
time.sleep(20) # wait 20 seconds for the site to load.
html = browser.page_source
soup = BeautifulSoup(html, features='html.parser')
for item in soup.findAll('div', attrs={'id': 'glossary-list'}):
for dt in item.findAll('dt'):
print(dt.text)
you can use browser.close() to close the browser
Output:
Here's the final code which will get through all user requests via Chat:
import requests
import json
r = requests.get('http://winevibe.com/wp-json/glossary/key/?l=en').json()
data = json.loads(r)
result = ([(item['key'], item['id']) for item in data])
text = []
for item in result:
try:
r = requests.get(
f"http://winevibe.com/wp-json/glossary/text/?id={item[1]}").json()
data = json.loads(r)
print(f"Getting Text For: {item[0]}")
text.append(data[0]['text'])
except KeyboardInterrupt:
print('Good Bye')
break
with open('result.txt', 'w+') as f:
for a, b in zip(result, text):
lines = ', '.join([a[0], b.replace('\n', '')]) + '\n'
f.write(lines)
I have a python script using beautifulsoup to scrape a property sales website.
I am trying to get the number of beds from the HTML.
The data-reactid changes for each listing in the search results. The number $11606747 is unique.
I am trying to wild card search for "*$beds.0.0" to return the number of beds =3 in the example.
There is no error message, and the code runs but doesn't return the number.
What am I doing wrong?
The HTML:
<div class="property-features is-regular listing-result__features" data-reactid=".1e881obdfqe.3.1.3.1:$11606747.0.1.0.2"><span class="property-feature__feature" data-reactid=".1e881obdfqe.3.1.3.1:$11606747.0.1.0.2.$beds"><span class="property-feature__feature-text-container" data-reactid=".1e881obdfqe.3.1.3.1:$11606747.0.1.0.2.$beds.0"><span data-reactid=".1e881obdfqe.3.1.3.1:$11606747.0.1.0.2.$beds.0.0">3</span><span data-reactid=".1e881obdfqe.3.1.3.1:$11606747.0.1.0.2.$beds.0.1"> </span><span class="property-features__feature-text" data-reactid=".1e881obdfqe.3.1.3.1:$11606747.0.1.0.2.$beds.0.2">Beds</span></span>
The Python code
beds = listing.findAll('span',{"data-reactid":re.compile('*$beds.0.0')})
You can try like this to get the bed status:
content='''
<html>
<body>
<div class="property-features is-regular listing-result__features" data-reactid=".1e881obdfqe.3.1.3.1:$11606747.0.1.0.2">
<span class="property-feature__feature" data-reactid=".1e881obdfqe.3.1.3.1:$11606747.0.1.0.2.$beds">
<span class="property-feature__feature-text-container" data-reactid=".1e881obdfqe.3.1.3.1:$11606747.0.1.0.2.$beds.0">
<span data-reactid=".1e881obdfqe.3.1.3.1:$11606747.0.1.0.2.$beds.0.0">
3
</span>
<span data-reactid=".1e881obdfqe.3.1.3.1:$11606747.0.1.0.2.$beds.0.1">
</span>
<span class="property-features__feature-text" data-reactid=".1e881obdfqe.3.1.3.1:$11606747.0.1.0.2.$beds.0.2">
Beds
</span>
</span>
</span>
</div>
</body>
</html>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(content,"lxml")
item = soup.select("div span[data-reactid*='$11606747']")[0].text
print(' '.join(item.split()))
Result:
3 Beds
You need to escape symbols $, . and *, because they are special in regex:
re.compile(r'\*\$beds\.0\.0')
Sorry.
I have asked a question like this.
After that i still have problem about data not in tag.
A few different the question i asked
(How can i crawl web data that not in tags)
<div class="bbs" id="main-content">
<div class="metaline">
<span class="article-meta-tag">
author
</span>
<span class="article-meta-value">
Jorden
</span>
</div>
<div class="metaline">
<span class="article-meta-tag">
board
</span>
<span class="article-meta-value">
NBA
</span>
</div>
I am here
</div>
I only need
I am here
The string is a child of the main div of type NavigableString, so you can loop through div.children and filter based on the type of the node:
from bs4 import BeautifulSoup, NavigableString
[x.strip() for x in soup.find("div", {'id': 'main-content'}).children if isinstance(x, NavigableString) and x.strip()]
# [u'I am here']
Data:
soup = BeautifulSoup("""<div class="bbs" id="main-content">
<div class="metaline">
<span class="article-meta-tag">
author
</span>
<span class="article-meta-value">
Jorden
</span>
</div>
<div class="metaline">
<span class="article-meta-tag">
board
</span>
<span class="article-meta-value">
NBA
</span>
</div>
I am here
</div>""", "html.parser")
soup = BeautifulSoup(that_html)
div_tag = soup.div
required_string = div_tag.string
go thought this documentation