well im still learning beautifulsoup module and im replcating this from the book automate the boring stuff with python i tried replcating the get amazon price script but i get a traceback on the .select() method
the error 'TypeError: 'NoneType' object is not callable'
its getiing devastated with this error as i couldnt find much about it
import bs4
import requests
header = {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"}
def site(url):
x = requests.get(url, headers=header)
x.raise_for_status()
soup = bs4.BeautifulSoup(x.text, "html.parser")
p = soup.Select('#buyNewSection > a > h5 > div > div.a-column.a-span8.a-text-right.a-span-last > div > span.a-size-medium.a-color-price.offer-price.a-text-normal')
abc = p[0].text.strip()
return abc
price = site('https://www.amazon.com/Automate-Boring-Stuff-Python-Programming/dp/1593275994')
print('price is' + str(price))
it must return a list value containing the price but im stuck with this error
If you use soup.select as opposed to soup.Select, your code does work, it just returns an empty list.The reason can see if we inspect the function you are using:
help(soup.Select)
Out[1]:
Help on NoneType object:
class NoneType(object)
| Methods defined here:
|
| __bool__(self, /)
| self != 0
|
| __repr__(self, /)
| Return repr(self).
|
| ----------------------------------------------------------------------
| Static methods defined here:
|
| __new__(*args, **kwargs) from builtins.type
| Create and return a new object. See help(type) for accurate signature.
Compared to:
help(soup.select)
Out[2]:
Help on method select in module bs4.element:
select(selector, namespaces=None, limit=None, **kwargs) method of bs4.BeautifulSoup instance
Perform a CSS selection operation on the current element.
This uses the SoupSieve library.
:param selector: A string containing a CSS selector.
:param namespaces: A dictionary mapping namespace prefixes
used in the CSS selector to namespace URIs. By default,
Beautiful Soup will use the prefixes it encountered while
parsing the document.
:param limit: After finding this number of results, stop looking.
:param kwargs: Any extra arguments you'd like to pass in to
soupsieve.select().
Having that said, it seems that the page structure is actually different than the one you are trying to get, missing the <a> tag.
<div id="buyNewSection" class="rbbHeader dp-accordion-row">
<h5>
<div class="a-row">
<div class="a-column a-span4 a-text-left a-nowrap">
<span class="a-text-bold">Buy New</span>
</div>
<div class="a-column a-span8 a-text-right a-span-last">
<div class="inlineBlock-display">
<span class="a-letter-space"></span>
<span class="a-size-medium a-color-price offer-price a-text-normal">$16.83</span>
</div>
</div>
</div>
</h5>
</div>
So this should work:
p = soup.select('#buyNewSection > h5 > div > div.a-column.a-span8.a-text-right.a-span-last > div.inlineBlock-display > span.a-size-medium.a-color-price.offer-price.a-text-normal')
abc = p[0].text.strip()
abc
Out[2]:
'$16.83'
Additionally, you could consider using a more granular approach that let's you debug your code better. For instance:
buySection = soup.find('div', attrs={'id':'buyNewSection'})
buySpan = buySection.find('span', attrs={'class': 'a-size-medium a-color-price offer-price a-text-normal'})
print (buyScan)
Out[1]:
'$16.83'
Related
Using Python/BeautifulSoup, I'm trying to get the post title and URL from every result returned on Reddit.
Below is part of my code that retrieves all Reddit search results.
url = 'https://www.reddit.com/search/?q=test'
r = s.get(url, headers=headers_Get)
soup = BeautifulSoup(r.text, "html.parser")
results = soup.find_all('a', attrs={'data-click-id':'body'})
for result in results:
print(result.prettify())
title_post = result.find('h3').text
url_post = result.find('a')['href']
soup.find_all('a', attrs={'data-click-id':'body'}) appears to return a list of all search results. This is working as I'm expecting / hoping.
by doing print(result), I can validate that it is returning what I need. Below is the result of print(result.prettify()):
<a class="SQnoC3ObvgnGjWt90zD9Z _2INHSNB8V5eaWp4P0rY_mE" data-click-id="body" href="/r/Cricket/comments/kunmyt/match_thread_3rd_test_australia_v_india_day_5/">
<div class="_2SdHzo12ISmrC8H86TgSCp _1zpZYP8cFNLfLDexPY65Y7" style="--posttitletextcolor:#222222">
<h3 class="_eYtD2XCVieq6emjKBH3m">
<span style="font-weight:normal">Match Thread: 3rd
<em style="font-weight:700">Test
</em>- Australia v India, Day 5
</span>
</h3>
</div>
</a>
title_post = result.find('h3').text extracts the title associated with the comment or post. It is working as expected / hoped.
The problem that I have is with retrieving the address of the post (see href=):
<a class="SQnoC3ObvgnGjWt90zD9Z _2INHSNB8V5eaWp4P0rY_mE" data-click-id="body" href="/r/Cricket/comments/kunmyt/match_thread_3rd_test_australia_v_india_day_5/">
The line url_post = result.find('a')['href'] returns an error TypeError: 'NoneType' object is not subscriptable.
If I could use the "result" as a string, then I could just look for href within it. Something like:
loc = result.text.find('href=')
print(result.text[loc:])
Obviously, this won't work:
result.text does not return the HTML code, but just the string "Match Thread: 3rd Test - Australia v India, Day 5"
Question 1:
Is there a way to return only the href="" component?
Question 2:
Is there a way to convert the soup object "result" into plain text while keeping the HTML components? If it was possible, then I'd have an easy workaround.
The href is already in the .attrs of result:
>>> for result in results:
... print(result.attrs)
...
{'data-click-id': 'body', 'class': ['SQnoC3ObvgnGjWt90zD9Z', '_2INHSNB8V5eaWp4P0rY_mE'], 'href': '/r/Cricket/comments/kunmyt/match_thread_3rd_test_australia_v_india_day_5/'}
...
so don't call the .find() method, instead access the href value using the [key] notation (like a dictionary).
In your example:
for result in results:
url_post = result["href"]
print(url_post)
Output:
/r/Cricket/comments/kunmyt/match_thread_3rd_test_australia_v_india_day_5/
/r/Cricket/comments/ku008u/match_thread_3rd_test_australia_v_india_day_4/
/r/Cricket/comments/ktcg7n/match_thread_3rd_test_australia_v_india_day_3/
...
You can use PRAW: The Python Reddit API Wrapper for their API, which is much easier to use than parsing from the webpages. You are obviously not able to access their randomly generated class names.
https://praw.readthedocs.io/en/latest/
Im scraping a page and found that with my xpath and regex methods i cant seem to get to a set of values that are within a div class
I have tried the method stated here on this page
How to get all the li tag within div tag
and then the current logic shown below that is within my file
#PRODUCT ATTRIBUTES (STYLE, SKU, BRAND) need to figure out how to loop thru a class and pull out the 2 list tags
prodattr = re.compile(r'<div class=\"pdp-desc-attr spec-prod-attr\">([^<]+)</div>', re.IGNORECASE)
prodattrmatches = re.findall(prodattr, html)
for m in prodattrmatches:
m = re.compile(r'<li class=\"last last-item\">([^<]+)</li>', re.IGNORECASE)
stymatches = re.findall(m, html)
#STYLE
sty = re.compile(r'<li class=\"last last-item\">([^<]+)</li>', re.IGNORECASE)
stymatches = re.findall(sty, html)
#BRAND
brd = re.compile(r'<li class=\"first first-item\">([^<]+)</li>', re.IGNORECASE)
brdmatches = re.findall(brd, html)
The above is the current code that is NOT working.. everything comes back empty. For the purpose of my testing im merely writing the data, if any, out to the print command so i can see it on the console..
itmDetails2 = dets['sku'] +","+ dets['description']+","+ dets['price']+","+ dets['brand']
and within the console this is what i get this, which is what i expect and the generic messages are just place holders until i get this logic figured out.
SKUE GOES HERE,adidas Women's Essentials Tricot Track Jacket,34.97, BRAND GOES HERE
<div class="pdp-desc-attr spec-prod-attr">
<ul class="prod-attr-list">
<li class="first first-item">Brand: adidas</li>
<li>Country of Origin: Imported</li>
<li class="last last-item">Style: F18AAW400D</li>
</ul>
</div>
Do not use Regex to parse HTML
There are better and safer ways to do this.
Take a look in this code using Parsel and BeautifulSoup to extract the li tags of your sample code:
from parsel import Selector
from bs4 import BeautifulSoup
html = ('<div class="pdp-desc-attr spec-prod-attr">'
'<ul class="prod-attr-list">'
'<li class="first first-item">Brand: adidas</li>'
'<li>Country of Origin: Imported</li>'
'<li class="last last-item">Style: F18AAW400D</li>'
'</ul>'
'</div>')
# Using parsel
sel = Selector(text=html)
for li in sel.xpath('//li'):
print(li.xpath('./text()').get())
# Using BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
for li in soup.find_all('li'):
print(li.text)
Output:
Brand: adidas
Country of Origin: Imported
Style: F18AAW400D
Brand: adidas
Country of Origin: Imported
Style: F18AAW400D
I would use an html parser and look for the class of the ul. Using bs4 4.7.1
from bs4 import BeautifulSoup as bs
html = '''
<div class="pdp-desc-attr spec-prod-attr">
<ul class="prod-attr-list">
<li class="first first-item">Brand: adidas</li>
<li>Country of Origin: Imported</li>
<li class="last last-item">Style: F18AAW400D</li>
</ul>
</div>
'''
soup = bs(html, 'lxml')
for item in soup.select('.prod-attr-list:has(> li)'):
print([sub_item.text for sub_item in item.select('li')])
I am trying to find an ID in a div class which has multiple values using BS4 the HTML is
<div class="size ">
<a class="selectSize"
id="25746"
data-ean="884751585616"
ata-test="170"
data-test1=""
data-test2="1061364-41"
data-test3-original="41"
data-test4-eu="41"
data-test5-uk="7"
data-test6-us="8"
data-test-cm="26"
</div>
</div>
I want to find data-test5-uk, my current code is soup =
bs(size.text,"html.parser")
sizes = soup.find_all("div",{"class":"size"})
size = sizes[0]["data-test5-uk"]
size.text is from a get request to the site with the html, however it returns
size = sizes[0]["data-test5-uk"]
File "C:\Users\ninja_000\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\element.py", line 1011, in __getitem__
return self.attrs[key]
KeyError: 'data-test5-uk'
Help is appreciated!
Explanation and then the solution.
.find_all('tag') is used to find all instances of that tag and we can later loop through them.
.find('tag') is used to find the ONLY first instance.
We can either extract the content of the argument with ['arg'] or ..get('arg') it is the SAME.
from bs4 import BeautifulSoup
html = '''<div class="size ">
<a class="selectSize"
id="25746"
data-ean="884751585616"
ata-test="170"
data-test1=""
data-test2="1061364-41"
data-test3-original="41"
data-test4-eu="41"
data-test5-uk="7"
data-test6-us="8"
data-test-cm="26"
</div>'''
soup = BeautifulSoup(html, 'lxml')
one_div = soup.find('div', class_='size ')
print( one_div.find('a')['data-test5-uk'])
# your code didn't work because you weren't in the a tag
# we have found the tag that contains the tag .find('a')['data-test5-uk']
# for multiple divs
for each in soup.find_all('div', class_='size '):
# we loop through each instance and do the same
datauk = each.find('a')['data-test5-uk']
print('data-test5-uk:', datauk)
Output:
data-test5-uk: 7
Additional
Why did your ['arg']? - You've tried to extract the ["data-test5-uk"] of the div. <div class="size "> the div has no arguments like that except one class="size "
I'm scrapping data using lxml
This is the inspect element of single post
<article id="post-4855" class="post-4855 post type-post status-publish format-standard hentry category-uncategorized">
<header class="entry-header">
<h1 class="entry-title">Cybage..</h1>
<div class="entry-meta">
<span class="byline"> Posted by <span class="author vcard"><a class="url fn n" href="http://aitplacements.com/author/tpoait/">TPO</a></span></span><span class="posted-on"> on <time class="entry-date published updated" datetime="2017-09-13T11:02:32+00:00">September 13, 2017</time></span><span class="comments-link"> with 0 Comment</span> </div><!-- .entry-meta -->
</header><!-- .entry-header -->
<div class="entry-content">
<p>cybage placement details shared READ MORE</p>
</div><!-- .entry-content -->
For every such post, I want to extract title, content of post, and post timing.
For example in above, the details will be
{title : "Cybage..",
post : "cybage placement details shared"
datetime="2017-09-13T11:02:32+00:00"
}
Till now what I'm able to achieve:
the website requires login, I'm successfull in doing that.
For extracting information:
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64)
Chrome/42.0.2311.90'}
url = 'http://aitplacements.com/news/'
page = requests.get(url,headers=headers)
doc = html.fromstring(page.content)
#print doc # it prints <Element html at 0x7f59c38d2260>
raw_title = doc.xpath('//h1[#class="entry-title"]/a/#href/text()')
print raw_title
The raw_title gives empty value [] ?
What I'm doing wrong ?
#href refers to the value of the href attribute:
In [14]: doc.xpath('//h1[#class="entry-title"]/a/#href')
Out[14]: ['http://aitplacements.com/uncategorized/cybage/']
You want the text of the <a> element instead:
In [16]: doc.xpath('//h1[#class="entry-title"]/a/text()')
Out[16]: ['Cybage..']
Therefore, use
raw_title = doc.xpath('//h1[#class="entry-title"]/a/text()')
if len(raw_title) > 0:
raw_title = raw_title[0]
else:
# handle the case of missing title
raise ValueError('Missing title')
This is a follow up to my post Using Python to Scrape Nested Divs and Spans in Twitter?.
I'm not using the Twitter API because it doesn't look at the tweets by
hashtag this far back. Complete code and output are below after examples.
I want to scrape specific data from each tweet. name and handle are retrieving exactly what I'm looking for, but I'm having trouble narrowing down the rest of the elements.
As an example:
link = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
url = link[0]
Retrieves this:
<a class="tweet-timestamp js-permalink js-nav js-tooltip" href="/Mikepeeljourno/status/648787700980408320" title="2:13 AM - 29 Sep 2015">
<span class="_timestamp js-short-timestamp " data-aria-label-part="last" data-long-form="true" data-time="1443518016" data-time-ms="1443518016000">29 Sep 2015</span></a>
For url, I only need the href value from the first line.
Similarly, the retweets and favorites commands return large chunks of html, when all I really need is the numerical value that is displayed for each one.
How can I narrow down the results to the required data for the url, retweetcount and favcount outputs?
I am planning to have this cycle through all the tweets once I get it working, in case that has an influence on your suggestions.
Complete Code:
from bs4 import BeautifulSoup
import requests
import sys
url = 'https://twitter.com/search?q=%23bangkokbombing%20since%3A2015-08-10%20until%3A2015-09-30&src=typd&lang=en'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
r = requests.get(url, headers=headers)
data = r.text.encode('utf-8')
soup = BeautifulSoup(data, "html.parser")
name = soup('strong', {'class': 'fullname js-action-profile-name show-popup-with-id'})
username = name[0].contents[0]
handle = soup('span', {'class': 'username js-action-profile-name'})
userhandle = handle[0].contents[1].contents[0]
link = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
url = link[0]
messagetext = soup('p', {'class': 'TweetTextSize js-tweet-text tweet-text'})
message = messagetext[0]
retweets = soup('button', {'class': 'ProfileTweet-actionButtonUndo js-actionButton js-actionRetweet'})
retweetcount = retweets[0]
favorites = soup('button', {'class': 'ProfileTweet-actionButtonUndo u-linkClean js-actionButton js-actionFavorite'})
favcount = favorites[0]
print (username, "\n", "#", userhandle, "\n", "\n", url, "\n", "\n", message, "\n", "\n", retweetcount, "\n", "\n", favcount) #extra linebreaks for ease of reading
Complete Output:
Michael Peel
#Mikepeeljourno
<a class="tweet-timestamp js-permalink js-nav js-tooltip" href="/Mikepeeljourno/status/648787700980408320" title="2:13 AM - 29 Sep 2015"><span class="_timestamp js-short-timestamp " data-aria-label-part="last" data-long-form="true" data-time="1443518016" data-time-ms="1443518016000">29 Sep 2015</span></a>
<p class="TweetTextSize js-tweet-text tweet-text" data-aria-label-part="0" lang="en"><a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/FT?src=hash"><s>#</s><b>FT</b></a> Case closed: <a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/Thailand?src=hash"><s>#</s><b>Thailand</b></a> police chief proclaims <a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/Bangkokbombing?src=hash"><s>#</s><b><strong>Bangkokbombing</strong></b></a> solved ahead of his retirement this week -even as questions over case grow</p>
<button class="ProfileTweet-actionButtonUndo js-actionButton js-actionRetweet" data-modal="ProfileTweet-retweet" type="button">
<div class="IconContainer js-tooltip" title="Undo retweet">
<span class="Icon Icon--retweet"></span>
<span class="u-hiddenVisually">Retweeted</span>
</div>
<div class="IconTextContainer">
<span class="ProfileTweet-actionCount">
<span aria-hidden="true" class="ProfileTweet-actionCountForPresentation">4</span>
</span>
</div>
</button>
<button class="ProfileTweet-actionButtonUndo u-linkClean js-actionButton js-actionFavorite" type="button">
<div class="IconContainer js-tooltip" title="Undo like">
<div class="HeartAnimationContainer">
<div class="HeartAnimation"></div>
</div>
<span class="u-hiddenVisually">Liked</span>
</div>
<div class="IconTextContainer">
<span class="ProfileTweet-actionCount">
<span aria-hidden="true" class="ProfileTweet-actionCountForPresentation">2</span>
</span>
</div>
</button>
It was suggested that BeautifulSoup - extracting attribute values may have an answer to this question there. However, I think the question and its answers do not have sufficient context or explanation to be helpful in more complex situations. The link to the relevant part of the Beautiful Soup Documentation is helpful though, http://www.crummy.com/software/BeautifulSoup/documentation.html#The%20attributes%20of%20Tags
Use the dictionary-like access to the Tag's attributes.
For example, to get the href attribute value:
links = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
url = link[0]["href"]
Or, if you need to get the href values for every link found:
links = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
urls = [link["href"] for link in links]
As a side note, you don't need to specify the complete class value to locate elements. class is a special multi-valued attribute and you can just use one of the classes (if this is enough to narrow down the search for the desired elements). For example, instead of:
soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
You may use:
soup('a', {'class': 'tweet-timestamp'})
Or, a CSS selector:
soup.select("a.tweet-timestamp")
Alecxe already explained to use the 'href' key to get the value.
So I'm going to answer the other part of your questions:
Similarly, the retweets and favorites commands return large chunks of
html, when all I really need is the numerical value that is displayed
for each one.
.contents returns a list of all the children. Since you're finding 'buttons' which has several children you're interested in, you can just get them from the following parsed content list:
retweetcount = retweets[0].contents[3].contents[1].contents[1].string
This will return the value 4.
If you want a rather more readable approach, try this:
retweetcount = retweets[0].find_all('span', class_='ProfileTweet-actionCountForPresentation')[0].string
favcount = favorites[0].find_all('span', { 'class' : 'ProfileTweet-actionCountForPresentation')[0].string
This returns 4 and 2 respectively.
This works because we convert the ResultSet returned by soup/find_all and get the tag element (using [0]) and recursively find across all it's descendants again using find_all().
Now you can loop across each tweet and extract this information rather easily.