Python beautifulsoup AttributeError - python

I'm trying to get some image url using python beautifulsoup from html content.
My HTML Content :
<div id="photos" class="tab rel-photos multiple-photos">
<span id="watch-this" class="classified-detail-buttons">
<span id="c_id_10832265:c_type_202:watch_this">
<a href="/watchlist/classified/baby-items/10832265/1/" id="watch_this_logged" data-require-auth="favoriteAd" data-tr-event-name="dpv-add-to-favourites">
<i class="fa fa-fw fa-star-o"></i></a></span>
</span>
<span id="thumb1" class=" image">
<a href="https://images.dubizzle.com/v1/files/eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJmbiI6ImYzYWdrZm8xcDBlai1EVUJJWlpMRSIsInciOlt7ImZuIjoiNWpldWk3cWZ6aWU2MS1EVUJJWlpMRSIsInMiOjUwLCJwIjoiY2VudGVyLGNlbnRlciIsImEiOjgwfV19.s1GmifnZr0_Bx4HG8RTR4puYcxN0asqAmnBvSpIExEI/image;p=main"
id="a-photo-modal-view:263986810"
rel="photos-modal"
target="_new"
onClick="return dbzglobal_event_adapter(this);">
<div style="background-image:url(https://images.dubizzle.com/v1/files/eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJmbiI6ImYzYWdrZm8xcDBlai1EVUJJWlpMRSIsInciOlt7ImZuIjoiNWpldWk3cWZ6aWU2MS1EVUJJWlpMRSIsInMiOjUwLCJwIjoiY2VudGVyLGNlbnRlciIsImEiOjgwfV19.s1GmifnZr0_Bx4HG8RTR4puYcxN0asqAmnBvSpIExEI/image;p=main);"></div>
</a>
</span>
<ul id="thumbs-list">
<li>
<span id="thumb2" class="image2">
<a href="https://images.dubizzle.com/v1/files/eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJmbiI6Imtmc3cxMWgzNTB2cTMtRFVCSVpaTEUiLCJ3IjpbeyJmbiI6IjVqZXVpN3FmemllNjEtRFVCSVpaTEUiLCJzIjo1MCwicCI6ImNlbnRlcixjZW50ZXIiLCJhIjo4MH1dfQ.Wo2YqPdWav8shtmyVO2AdisHmLX-ZLDAiskLPAmTSPU/image;p=main" id="a-photo-modal-view:263986811" rel="photos-modal" target="_new" onClick="return dbzglobal_event_adapter(this);" >
<div style="background-image:url(https://images.dubizzle.com/v1/files/eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJmbiI6Imtmc3cxMWgzNTB2cTMtRFVCSVpaTEUiLCJ3IjpbeyJmbiI6IjVqZXVpN3FmemllNjEtRFVCSVpaTEUiLCJzIjo1MCwicCI6ImNlbnRlcixjZW50ZXIiLCJhIjo4MH1dfQ.Wo2YqPdWav8shtmyVO2AdisHmLX-ZLDAiskLPAmTSPU/image;p=thumb_retina);"></div>
</a>
</span>
</li>
<li id="thumbnails-info">
4 Photos
</li>
</ul>
<div id="photo-count">
4 Photos - Click to enlarge
</div>
</div>
My python code :
images = soup.find("div", {"id": ["photos"]}).find_all("a")
for image in images:
sk = image.get("href").replace("p=main","p=thumb_retina",1)
print(sk)
But i'm getting error :
Traceback (most recent call last):
File "/Users/evilslab/Documents/Websites/www.futurepoint.dev.cc/dobuyme/SCRAPE/boats.py", line 47, in <module>
images = soup.find("div", {"id": ["photos"]}).find_all("a")
AttributeError: 'NoneType' object has no attribute 'find_all'
How i can get only the url from a href tag ?

Your code works for me, more completely (given your HTML as html_doc):
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
images = soup.find("div", {"id": ["photos"]}).find_all("a")
for image in images:
print(image['href'].replace("p=main","p=thumb_retina",1))
However your problem is that the text returned by requests from URL is not the same as the HTML sample you give. Despite your attempt to supply a random user agent, the server returns:
<li>You\'re a power user moving through this website with super-human speed.</li>\n <li>You\'ve disabled JavaScript in your web browser.</li>\n <li>A third-party browser plugin, such as Ghostery or NoScript, is preventing JavaScript from running. Additional information is available in this <a title=\'Third party browser plugins that block javascript\' href=\'http://ds.tl/help-third-party-plugins\' target=\'_blank\'>support article</a>.</li>\n </ul>\n </div>\n <p class="we-could-be-wrong" >\n We could be wrong, and sorry about that! Please complete the CAPTCHA below and we’ll get you back on dubizzle right away.
Since the CAPTCHA is intended to prevent scraping, I suggest respecting the admin's wishes and not scraping it. Maybe there's an API?

Try this:
for item in soup.find_all('span'):
try:
link = item.find_all('a', href=True)[0].attrs.get('href', None)
except IndexError:
continue
else:
print(link)
output
/watchlist/classified/baby-items/10832265/1/
/watchlist/classified/baby-items/10832265/1/
https://images.dubizzle.com/v1/files/eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJmbiI6ImYzYWdrZm8xcDBlai1EVUJJWlpMRSIsInciOlt7ImZuIjoiNWpldWk3cWZ6aWU2MS1EVUJJWlpMRSIsInMiOjUwLCJwIjoiY2VudGVyLGNlbnRlciIsImEiOjgwfV19.s1GmifnZr0_Bx4HG8RTR4puYcxN0asqAmnBvSpIExEI/image;p=main
https://images.dubizzle.com/v1/files/eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJmbiI6Imtmc3cxMWgzNTB2cTMtRFVCSVpaTEUiLCJ3IjpbeyJmbiI6IjVqZXVpN3FmemllNjEtRFVCSVpaTEUiLCJzIjo1MCwicCI6ImNlbnRlcixjZW50ZXIiLCJhIjo4MH1dfQ.Wo2YqPdWav8shtmyVO2AdisHmLX-ZLDAiskLPAmTSPU/image;p=main

Related

Parsing website with BeautifulSoup and Requests returns None

Im a beginner in programming all together and work on a project of mine. For that I'm trying to parse data from a website to make a tool that uses the data. I found that BeatifulSoup and Requests are common tools to do it, but unfortunately i can not seem to make it work. It always returns the value None or an error where it says:
"TypeError: 'NoneType' object is not callable"
Did i do anything wrong? Is it maybe not possible to parse some websites data and I'm being restricted the access or something?
If there are other ways to access the data im happy to hear as well.
Here is my code:
from bs4 import BeautifulSoup
import requests
pickrates = {} # dict to store winrate of champions for each position
source = requests.get("http://u.gg/lol/champions/aatrox/build?role=top").text
soup = BeautifulSoup(source, "lxml")
value = soup.find("div", class_="content-section champion-ranking-stats")
print(value.prettify())
Remember when you request a webpage with requests module, you will only get the html of that page. I mean this module is not capable of rendering JavaScript.
Try this code:
import requests
source = requests.get("http://u.gg/lol/champions/aatrox/build?role=top").text
print(source)
Then search for the class names you provided by hand(ctrl + f), there is no such elements at all. It means those are generated by other requests like ajax. They are somehow created after the initial html page is loaded. So before Beautiful soup comes to the party, you can't get them even in .text attribute of the response object.
One way of doing it is to Selenium or any other libraries which handles the JS.
It seems like this question (can't find html tag when I scrape web using beautifulsoup), the problem would be caused by the JavaScript event listener. I would suggest you to use selenium to handle this issue. So, let apply selenium at sending request and getting back page source and then use BeautifulSoup to parse it.
Don't forget to download a browser driver from https://www.selenium.dev/documentation/getting_started/installing_browser_drivers/ and place it in the same directory with your code.
The example of code below is using selenium with Firefox:
from selenium import webdriver
from bs4 import BeautifulSoup
URL = 'http://u.gg/lol/champions/aatrox/build?role=top'
browser = webdriver.Firefox()
browser.get(URL)
soup = BeautifulSoup(browser.page_source, 'html.parser')
time.sleep(1)
browser.close()
value = soup.find("div", class_="content-section champion-ranking-stats")
print(value.prettify())
Your expected output would be like:
>>> print(value.prettify())
<div class="content-section champion-ranking-stats">
<div class="win-rate meh-tier">
<div class="value">
48.4%
</div>
<div class="label">
Win Rate
</div>
</div>
<div class="overall-rank">
<div class="value">
49 / 58
</div>
<div class="label">
Rank
</div>
</div>
<div class="pick-rate">
<div class="value">
3.6%
</div>
<div class="label">
Pick Rate
</div>
</div>
<div class="ban-rate">
<div class="value">
2.3%
</div>
<div class="label">
Ban Rate
</div>
</div>
<div class="matches">
<div class="value">
55,432
</div>
<div class="label">
Matches
</div>
</div>
</div>

How do I link a title with a url in Python and BeautifulSoup?

So im working on a script to download videos for me automatically, however it would seem i have stumbled upon a problem(insufficient experience).
How do i link a category title with a bunch of url's?
Expected output:
->CATEGORY 1
->https://example.com/part/of/link/x
->https://example.com/part/of/link/x
->https://example.com/part/of/link/x
.....
->CATEGORY 2
->https://example.com/part/of/link/x
for category_title in soup.findAll(class_='section-title'):
title = category_title.get_text().lstrip()
print(title)
for onelink in soup.findAll('a'):
link = onelink.get('href')
print(f'https://example.com{link}')
What this does is:
Lists all the titles(10 titles)
Lists all the links(100 links)
<div class="section-title">
CATEGORY 1
<li class="section-item next-random">
<a class="item" href="/part/of/link/x">
<div class="title-container">
<div class="btn-primary btn-sm pull-right">
BUTTON
</div>
<span class="random-name">
URL TITLE
</span>
</div>
</a>
</li>
</ul>
Depending on how the HTML structure of your page is made, you could check the "section-title" parent object, and then, list all the link of that particular section. Giving you all the link for category #
Here some help

How to scrape tags that appear within a script

My intention is to scrape the names of the top-selling products on Ali-Express.
I'm using the Requests library alongside Beautiful Soup to accomplish this.
# Remember to import BeautifulSoup, requests and pprint
url = "https://bestselling.aliexpress.com/en?spm=2114.11010108.21.3.qyEJ5m"
soup = bs(req.get(url).text, 'html.parser')
#pp.pprint(soup) Verify that the page has been found
all_items = soup.find_all('li',class_= 'top10-item')
pp.pprint(all_items)
# []
However this returns an empty list, indicating that soup_find_all() did not find any tags fitting that criteria.
Inspect Element in Chrome displays the list items as such
.
However in source code (ul class = "top10-items") contains a script, which seems to iterate through each list item (I'm not familiar with HTML).
<div class="container">
<div class="top10-header"><span class="title">TOP SELLING</span> <span class="sub-title">This week's most popular products</span></div>
<ul class="top10-items loading" id="bestselling-top10">
</ul>
<script class="X-template-top10" type="text/mustache-template">
{{#topList}}
<li class="top10-item">
<div class="rank-orders">
<span class="rank">{{rank}}</span>
<span class="orders">{{productOrderNum}}</span>
</div>
<div class="img-wrap">
<a href="{{productDetailUrl}}" target="_blank">
<img src="{{productImgUrl}}" alt="{{productName}}">
</a>
</div>
<a class="item-desc" href="{{productDetailUrl}}" target="_blank">{{productName}}</a>
<p class="item-price">
<span class="price">US ${{productMinPrice}}</span>
<span class="uint">/ {{productUnitType}}</span>
</p>
</li>
{{/topList}}</script>
</div>
</div>
So this probably explains why soup.find_all() doesn't find the "li" tag.
My question is: How can I extract the item names from the script using Beautiful soup?

Scraping the links from a specific url

this is my first question if I have explained anything wrong please forgive me.
I am trying scrape url's from a specific website in python and parse the links to a csv. The thing is when i parse the website in BeautifulSoup I can't extract the url's because when I parse it in python I can only get <div id="dvScores" style="min-height: 400px;">\n</div>, and nothing under that branch. But when I open the console and copy the table where the links are and paste it to a text editor it pastes 600 pages of html. What I want to do is to write a for loop that shows the links. The structure of the html is below:
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
#shadow-root (open)
<head>...</head>
<body>
<div id="body">
<div id="wrapper">
#multiple divs but i don't need them
<div id="live-master"> #what I need is under this div
<span id="contextual">
#multiple divs but i don't need them
<div id="live-score-master"> #what I need is under this div
<div ng-app="live-menu" id="live-score-rightcoll">
#multiple divs but i don't need them
<div id="left-score-lefttemp" style="padding-top: 35px;">
<div id="dvScores">
<table cellspacing=0 ...>
<colgroup>...</colgroup>
<tbody>
<tr class="row line-bg1"> #this changes to bg2 or bg3
<td class="row">
<span class="row">
<a href="www.example.com" target="_blank" class="td_row">
#I need to extract this link
</span>
</td>
#Multiple td's
</tr>
#multiple tr class="row line-bg1" or "row line-bg2"
.
.
.
</tbody>
</table>
</div>
</div>
</div>
</div>
</span>
</div>
</div>
</body>
</html>
What am I doing wrong? I need to automate a system for python to do rather than pasting the html to text and extracting links with a regex.
My python code is below also:
import requests
from bs4 import BeautifulSoup
r=requests.get("http://example.com/example")
c=r.content
soup=BeautifulSoup(c,"html.parser")
all=soup.find_all("span",id="contextual")
span=all[0].find_all("tbody")
if you are trying scrape urls then you should get hrefs :
urls = soup.find_all('a', href=True)
This site uses JavaScript for populating its content, therefore, you can't get url via beautifulsoup. If you inspect network tab in your browser you can spot a this link. It contains all data what you need. You can simply parse it and extract all desired value.
import requests
req = requests.get('http://goapi.mackolik.com/livedata?group=0').json()
for el in req['m'][4:100]:
index = el[0]
team_1 = el[2].replace(' ', '-')
team_2 = el[4].replace(' ', '-')
print('http://www.mackolik.com/Mac/{}/{}-{}'.format(index, team_1, team_2))
It seems like the html is being dynamically generated by js. You would need to crawl it with a crawler to mimic a browser. Since you are using requests, it already has a crawler session.
session = requests.session()
data = session.get ("http://website.com").content #usage xample
After this you can do the parsing, additional scraping, etc.

Scraping with BeautifulSoup: object has no attribute

I want to extract excerpts of data like company name and address from a website using BeautifulSoup. I am getting, however, the following failure:
Calgary's Notary Public
Traceback (most recent call last):
File "test.py", line 16, in <module>
print item.find_all(class_='jsMapBubbleAddress').text
AttributeError: 'ResultSet' object has no attribute 'text'
The HTML code snippet is here. I want to extract all the text information and convert into a CSV file. Please any one help me.
<div class="listing__right article hasIcon">
<h3 class="listing__name jsMapBubbleName" itemprop="name"><a data-analytics='{"lk_listing_id":"100971374","lk_non-ad-rollup":"0","lk_page_num":"1","lk_pos":"in_listing","lk_proximity":"14.5","lk_directory_heading":[{"085100":[{"00910600":"1"},{"00911000":"1"}]}],"lk_geo_tier":"in","lk_area":"left_1","lk_relevancy":"1","lk_name":"busname","lk_pos_num":"1","lk_se_id":"e292d1d2-f130-463d-8f0c-7dd66800dead_Tm90YXJ5_Q2FsZ2FyeSwgQUI_56","lk_ev":"link","lk_product":"l2"}' href="/bus/Alberta/Calgary/Calgary-s-Notary-Public/100971374.html?what=Notary&where=Calgary%2C+AB&useContext=true" title="See detailed information for Calgary's Notary Public">Calgary's Notary Public</a> </h3>
<div class="listing__address address mainLocal">
<em class="itemCounter">1</em>
<span class="listing__address--full" itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress">
<span class="jsMapBubbleAddress" itemprop="streetAddress">340-600 Crowfoot Cres NW</span>, <span class="jsMapBubbleAddress" itemprop="addressLocality">Calgary</span>, <span class="jsMapBubbleAddress" itemprop="addressRegion">AB</span> <span class="jsMapBubbleAddress" itemprop="postalCode">T3G 0B4</span></span>
<a class="listing__direction" data-analytics='{"lk_listing_id":"100971374","lk_non-ad-rollup":"0","lk_page_num":"1","lk_pos":"in_listing","lk_proximity":"14.5","lk_directory_heading":[{"085100":[{"00910600":"1"},{"00911000":"1"}]}],"lk_geo_tier":"in","lk_area":"left_1a","lk_relevancy":"1","lk_name":"directions-step1","lk_pos_num":"1","lk_se_id":"e292d1d2-f130-463d-8f0c-7dd66800dead_Tm90YXJ5_Q2FsZ2FyeSwgQUI_56","lk_ev":"link","lk_product":"l2"}' href="/merchant/directions/100971374?what=Notary&where=Calgary%2C+AB&useContext=true" rel="nofollow" title="Get direction to Calgary's Notary Public">Get directions »</a>
</div>
<div class="listing__details">
<p class="listing__details__teaser" itemprop="description">We offer you a convenient, quick and affordable solution for your Notary Public or Commissioner for Oaths in Calgary needs.</p>
</div>
<div class="listing__ratings--root">
<div class="listing__ratings ratingWarp" itemprop="aggregateRating" itemscope="" itemtype="http://schema.org/AggregateRating">
<meta content="5" itemprop="ratingValue"/>
<meta content="1" itemprop="ratingCount"/>
<span class="ypStars" data-analytics-group="stars" data-clicksent="false" data-rating="rating5" title="Ratings: 5 out of 5 stars">
<span class="star1" data-analytics-name="stars" data-label="Optional : Why did you hate it?" title="I hated it"></span>
<span class="star2" data-analytics-name="stars" data-label="Optional : Why didn't you like it?" title="I didn't like it"></span>
<span class="star3" data-analytics-name="stars" data-label="Optional : Why did you like it?" title="I liked it"></span>
<span class="star4" data-analytics-name="stars" data-label="Optional : Why did you really like it?" title="I really liked it"></span>
<span class="star5" data-analytics-name="stars" data-label="Optional : Why did you love it?" title="I loved it"></span>
</span><a class="listing__ratings__count" data-analytics='{"lk_listing_id":"100971374","lk_non-ad-rollup":"0","lk_page_num":"1","lk_pos":"in_listing","lk_proximity":"14.5","lk_directory_heading":[{"085100":[{"00910600":"1"},{"00911000":"1"}]}],"lk_geo_tier":"in","lk_area":"left_1","lk_relevancy":"1","lk_name":"read_yp_reviews","lk_pos_num":"1","lk_se_id":"e292d1d2-f130-463d-8f0c-7dd66800dead_Tm90YXJ5_Q2FsZ2FyeSwgQUI_56","lk_ev":"link","lk_product":"l2"}' href="/bus/Alberta/Calgary/Calgary-s-Notary-Public/100971374.html?what=Notary&where=Calgary%2C+AB&useContext=true#ypgReviewsHeader" rel="nofollow" title="1 of Review for Calgary's Notary Public">1<span class="hidden-phone"> YP review</span></a>
</div>
</div>
<div class="listing__details detailsWrap">
<ul>
<li>Notaries
,
</li>
<li>Notaries Public</li>
</ul>
</div>
</div>
There are many divs with listing__right article hasIcon. I am using for loop to extract the information.
The python code I have written so far is.
import requests
from bs4 import BeautifulSoup
url = 'http://www.yellowpages.ca/search/si-rat/1/Notary/Calgary%2C+AB'
response = requests.get(url)
content = response.content
soup = BeautifulSoup(content)
g_data=soup.find_all('div', attrs={'class': 'listing__right article hasIcon'})
for item in g_data:
print item.find('h3').text
#print item.contents[2].find_all('em', attrs={'class': 'itemCounter'})[1].text
print item.find_all(class_='jsMapBubbleAddress').text
find_all returns a list which has no 'text' attribute so you are getting an error, not sure what output you are looking for, but this code seems to work ok:
import requests
from bs4 import BeautifulSoup
url = 'http://www.yellowpages.ca/search/si-rat/1/Notary/Calgary%2C+AB'
response = requests.get(url)
content = response.content
soup = BeautifulSoup(content,"lxml")
g_data=soup.find_all('div', attrs={'class': 'listing__right article hasIcon'})
for item in g_data:
print item.find('h3').text
#print item.contents[2].find_all('em', attrs={'class': 'itemCounter'})[1].text
items = item.find_all(class_='jsMapBubbleAddress')
for item in items:
print item.text

Categories