how to scrape nested two elements with python - python

hi i would like to get some info which is on below < del> and < ins> tags but i could not find any solution for it can is anyone has idea about this scraping and is there any for getting those informations
this is my python code
import requests
import json
from bs4 import BeautifulSoup
header = {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'}
base_url = "https://www.n11.com/super-firsatlar"
r = requests.get(base_url,headers=header)
if r.status_code == 200:
soup = BeautifulSoup(r.text, 'html.parser')
books = soup.find_all('li',attrs={"class":"column"})
result=[]
for book in books:
title=book.find('h3').text
link=base_url +book.find('a')['href']
picture = base_url + book.find('img')['src']
price=book.find('p', {'class': 'del'})
single ={'title':title,'link':link,'picture':picture,'price':price}
result.append(single)
with open('book.json','w', encoding='utf-8') as f:
json.dump(result,f,indent=4,ensure_ascii=False)
else:
print(r.status_code)
and this my html page
<li class="column">
<script type="text/javascript">
var customTextOptionMap = {};
</script>
<div id="p-457010862" class="columnContent ">
<div class="pro">
<a href="https://www.n11.com/urun/oppo-a73-128-gb-oppo-turkiye-garantili-1599155?magaza=gelecekbizde"
title="Oppo A73 128 GB (Oppo Türkiye Garantili)" class="plink" data-id="457010862">
<img data-original="https://n11scdn1.akamaized.net/a1/215/elektronik/cep-telefonu/oppo-a73-128-gb-oppo-turkiye-garantili__1298508275589871.jpg"
width="215" height="215"
src="https://n11scdn1.akamaized.net/a1/215/elektronik/cep-telefonu/oppo-a73-128-gb-oppo-turkiye-garantili__1298508275589871.jpg"
alt="Oppo A73 128 GB (Oppo Türkiye Garantili)" class="lazy" style="">
<h3 class="productName ">
Oppo A73 128 GB (Oppo Türkiye Garantili)</h3>
<span class="loading"></span>
</a>
</div>
<div class="proDetail">
<a href="https://www.n11.com/urun/oppo-a73-128-gb-oppo-turkiye-garantili-1599155?magaza=gelecekbizde"
class="oldPrice" title="Oppo A73 128 GB (Oppo Türkiye Garantili)">
<del>2.999, 00 TL</del>
</a> <a href="https://www.n11.com/urun/oppo-a73-128-gb-oppo-turkiye-garantili-1599155?magaza=gelecekbizde"
class="newPrice" title="Oppo A73 128 GB (Oppo Türkiye Garantili)">
<ins>2.899, 00<span content="TRY">TL</span></ins>
</a>
<div class="discount discountS">
<div>
<span class="percent">%</span>
<span class="ratio">3</span>
</div>
</div>
<span class="textImg freeShipping"></span>
<p class="catalogView-hover-separate"></p>
<div class="moreOpt">
<a title="Oppo A73 128 GB (Oppo Türkiye Garantili)" class="textImg moreOptBtn"
href="https://www.n11.com/urun/oppo-a73-128-gb-oppo-turkiye-garantili-1599155?magaza=gelecekbizde"></a>
</div>
</div>
</div>
</li>

Unless I am not understanding your question, it should be as simple as doing:
del_data = soup.find_all("del")
ins_data = soup.find_all("ins")
is this not what you're trying to achieve? If not please clarify your question

del and ins are not class names but tags. You can simply find them with Soup.find_all('del')
price = book.find_all('del')
for p in price:
print(p.text)
gives
2.999,00 TL
189,90 TL
8.308,44 TL
499,90 TL
6.999,00 TL
99,00 TL
18,00 TL
499,00 TL
169,99 TL
1.499,90 TL
3.010,00 TL
2.099,90 TL
......
which is what you want I guess. You have to access the text attribute here. So, the element is located. The way you want to serialize it is a different question.

Related

Outputting tag from BeautifulSoup gives html and None

I am having trouble with getting the tag from a BeautifulSoup .find() method.
Here is my code:
url = evaluations['href']
page = requests.get(url, headers = HEADERS)
soup = BeautifulSoup(page.content, 'lxml')
evaluators = soup.find("section", class_="main-content list-content")
evaluators_list = evaluators.find("ul", class_='evaluation-list').find_all("li")
evaluators_dict = defaultdict(dict)
for evaluator in evaluators_list:
eval_list = evaluator.find('ul', class_='highlights-list')
print(eval_list.prettify())
This then gives the output:
<ul class="highlights-list">
<li class="eval-meta evaluator">
<b class="uppercase heading">
Evaluated By
</b>
<img alt="Andrew Ivins" height="50" src="https://s3media.247sports.com/Uploads/Assets/680/358/9358680.jpeg?fit=bounds&crop=50:50,offset-y0.50&width=50&height=50&fit=crop" title="Andrew Ivins" width="50"/>
<div class="evaluator">
<b class="text">
Andrew Ivins
</b>
<span class="uppercase">
Southeast Recruiting Analyst
</span>
</div>
</li>
<li class="eval-meta projection">
<b class="uppercase heading">
Projection
</b>
<b class="text">
First Round
</b>
</li>
<li class="eval-meta">
<b class="uppercase heading">
Comparison
</b>
<a href="https://247sports.com/Player/Charles-Woodson-76747/" target="_blank">
Charles Woodson
</a>
<span class="uppercase">
Oakland Raiders
</span>
</li>
</ul>
and the error
Traceback (most recent call last):
File "XXX", line 2, in <module>
player = Player("Travis-Hunter-46084728").player
File "XXX", line 218, in __init__
self.player = self._parse_player()
File "XXX", line 253, in _parse_player
evaluators, background, skills = self._find_scouting_report(soup)
File "XXX", line 468, in _find_scouting_report
print(eval_list.prettify())
AttributeError: 'NoneType' object has no attribute 'prettify'
As you can see it does find the tag and outputs it in a prettify manner but also outputs a None. What can be a way around this? Thank you in advance. The link I am using is: https://247sports.com/PlayerInstitution/Travis-Hunter-at-Collins-Hill-236028/PlayerInstitutionEvaluations/
EDIT: I have used selenium thinking it may be a JS problem but that did not resolve either.
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:100.0) Gecko/20100101 Firefox/100.0'
}
def get_soup(content):
return BeautifulSoup(content, 'lxml')
def main(url):
with requests.Session() as req:
req.headers.update(headers)
r = req.get(url)
soup = get_soup(r.content)
goal = [list(x.stripped_strings) for x in soup.select(
'.main-content.list-content > .evaluation-list > li > .highlights-list')]
for i in goal:
print(i[1:3] + i[-2:])
if __name__ == "__main__":
main('https://247sports.com/PlayerInstitution/Travis-Hunter-at-Collins-Hill-236028/PlayerInstitutionEvaluations/')
Output:
['Andrew Ivins', 'Southeast Recruiting Analyst', 'Charles Woodson', 'Oakland Raiders']
['Andrew Ivins', 'Southeast Recruiting Analyst', 'Xavier Rhodes', 'Minnesota Vikings']
['Charles Power', 'National writer', 'Marcus Peters', 'Baltimore Ravens']

Web parsing using beautiful soup

I need to parse data from a website: https://finance.yahoo.com/quote/MSFT/community.
I am trying to scrape the time of the post. For example "21 hours ago".
HTML code of the site. I am trying to extract time from
<span class = "F....>21 hours ago<...
<li class="comment Pend(2px) Mt(5px) Mb(11px) P(12px) " data-reactid="24">
<div class="Pos(r) Pstart(52px) " data-reactid="25">
<div class="Fz(12px) Mend(20px) Mb(5px)" data-reactid="26">
<div class="avatar D(ib) Bdrs(46.5%) Pos(a) Start(0) Cur(p)" data-reactid="27">...</div>
<button aria-label="See reaction history for 💰Chef💰" class="D(ib) Fw(b) P(0) Bd(0) M(0) Mend(10px)
Fz(16px) Ta(start) C($c-fuji-blue-1-a)" data-reactid="31">💰Chef💰</button>
<span class="Fz(12px) C(#828c93)"><span>21 hours ago</span></span>
<div class="Wow(bw)" data-reactid="33">...</div>
<div class="Py(4px)" data-reactid="39">...</div>
<div>...</div>
<div class="Pos(r) Pt(5px)" data-reactid="45">...</div>
</div>
</li>
The issue is, I am not able to find the time after reading the data using beautifulsoup. Here is the code which I have written.
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')
content = soup.find('ul' , class_= "comments-list List(n) Ovs(touch) Pos(r) Mt(10px) Mstart(-12px) Pt(5px)")
li = content.find('li' , class_ = "comment Pend(2px) Mt(5px) Mb(11px) P(12px)")
print(li.prettify())
Output
li = content.find('li' , class_ = "comment Pend(2px) Mt(5px) Mb(11px) P(12px)")
print(li.prettify())
<li class="comment Pend(2px) Mt(5px) Mb(11px) P(12px)" data-reactid="24">
<div class="Pos(r) Pstart(52px)" data-reactid="25">
<div class="Fz(12px) Mend(20px) Mb(5px)" data-reactid="26">
<div class="avatar D(ib) Bdrs(46.5%) Pos(a) Start(0) Cur(p)" data-reactid="27">
<div class="Pos(r)" data-reactid="28">
<div class="avatar-text Ta(c) Bdrs(48%)" data-reactid="29" style="background-color:#ff333a;color:#fff;font-size:24px;line-height:40px;width:40px;height:40px;" title="See reaction history for 💰Chef💰">
�
</div>
<img alt="💰Chef💰" class="avatar-img Bdrs(48%) Pos(a) StretchedBox Bgc(#400090.03)" data-reactid="30" height="40" src="https://s.yimg.com/it/api/res/1.2/H4mqpmacU.CnalUl7leuZA--~A/YXBwaWQ9eW5ld3M7dz04MDtoPTgwO3E9ODA-/https://s.yimg.com/gq/1735/40147141066_fc6972_o.jpg" title="See reaction history for 💰Chef💰" width="40"/>
</div>
</div>
<button aria-label="See reaction history for 💰Chef💰" class="D(ib) Fw(b) P(0) Bd(0) M(0) Mend(10px) Fz(16px) Ta(start) C($c-fuji-blue-1-a)" data-reactid="31">
💰Chef💰
</button>
<!-- react-empty: 32 -->
</div>
<div class="Wow(bw)" data-reactid="33">
<div class="C($c-fuji-grey-l) Mb(2px) Fz(14px) Lh(20px) Pend(8px)" data-reactid="34">
<!-- react-text: 35 -->
I managed to pick up an Xbox series X and I have to say the tech is very impressive (bought it not knowing). This only makes up around 10% of Microsoft’s business and slated to out perform PS5. I personally believe this stock is still undervalued. Longs stay strong!
<!-- /react-text -->
</div>
<div class="Pos(r) D(ib) Lh(12px)" data-reactid="36">
<video autoplay="" class="W(a) Mah(320px) My(3px) H(a) Maw(100%)" data-reactid="37" height="236" loop="" muted="" playsinline="" poster="https://media.tenor.com/images/e239c83d92cd0e896dc8b9ac05be0bf4/tenor.png" width="258">
<source data-reactid="38" src="https://media.tenor.com/videos/1eff610b01f69f793fe1f267c88bbbd4/mp4" type="video/mp4"/>
</video>
</div>
</div>
<div class="Py(4px)" data-reactid="39">
<div class="D(ib) Pos(r)" data-reactid="40">
<div class="Fz(12px) Px(8px) Mend(4px) Va(m) Bdrs(3px) C($c-fuji-grey-g) Bgc($c-fuji-grey-b) Cur(d) Py(3px)" data-reactid="41">
<div class="D(ib) Mend(6px) Cur(d) Cur(p)" data-icon="traffic" data-reactid="42" style="vertical-align:middle;fill:#333;stroke:#333;stroke-width:0;width:18px;height:18px;">
</div>
<!-- react-text: 43 -->
Bullish
<!-- /react-text -->
</div>
</div>
</div>
<!-- react-empty: 44 -->
<div class="Pos(r) Pt(5px)" data-reactid="45">
<div class="D(ib)" data-reactid="46">
<button class="reply-button O(n):h O(n):a P(0) Bd(n) Cur(p) Fz(12px) Fw(n) C(#828c93) D(ib) Mend(20px)" data-reactid="47">
<svg class="Mend(6px) Cur(p)" data-icon="reply" data-reactid="48" height="15" style="fill:#828c93;stroke:#828c93;stroke-width:0;vertical-align:bottom;" viewbox="0 0 25 20" width="15">
<path d="M12.8 5.8V2.4c0-.4-.2-.7-.6-.8-.3-.2-.8-.1-1.1.1L.9 8.8c-.2.2-.4.4-.4.7s.2.6.4.8l10.2 7.1c.3.2.7.3 1.1.1.3-.2.6-.5.6-.8v-3.3h1.3c3 0 6.1.6 8.5 5.2.2.3.5.5.9.5h.3c.5-.1.8-.6.7-1C23 7.3 18.5 5.9 12.8 5.8zm-.1 5.6h-1c-.6 0-1 .4-1 .9v2.4L3.2 9.5l7.5-5.2v2.4c0 .5.5.9 1 .9 4.2 0 7.9 0 9.8 6.3-2.9-2.6-6.3-2.6-8.8-2.5z" data-reactid="49">
</path>
</svg>
<!-- react-text: 50 -->
Reply
<!-- /react-text -->
</button>
</div>
<div class="D(ib) Pos(r)" data-reactid="51">
<button aria-label="8 Thumbs Up" class="O(n):h O(n):a Bgc(t) Bdc(t) M(0) P(0) Bd(n) Mend(20px)" data-reactid="52">
<svg class="Cur(p)" data-icon="thumbsup-outline" data-reactid="53" height="12" style="vertical-align:middle;fill:#828c93;stroke:#828c93;stroke-width:0;" viewbox="0 0 24 24" width="12">
<path d="M2.4 21.7V11.6h2.1c1.1 0 1.7-.7 2.1-1.1.7-.7 1.5-1.4 2.2-2.1.3-.2.5-.4.7-.6 1.6-1.8 2.3-3.4 2.2-5.5 0 0 .1-.1.3 0 .5.1 1 .6 1.1 1.6.3 2.4-.2 4.3-.7 4.7-.9.7-.4 2 .8 2h7.1c.7 0 1.3.4 1.3.8 0 .3-.5.8-.7.8-1.6 0-1.6 2.1-.1 2.2 0 .1.1.2.1.4 0 .1 0 .2-.3.3-.3-.1-.5-.1-.7-.1-1.3 0-1.7 1.7-.5 2.2 0 0 .2.1.3.2.2.2.3.3.3.6 0 .1-.1.1-.3.3-.2.1-.5.2-.7.2-1.3.2-1.4 1.9-.1 2.2l.1.1s.2.2.2.5c0 .4-.2.6-.6.6l-16.2-.2zM20.3 8.3h-5.1c.4-1.3.6-2.9.3-4.7-.2-2-1.5-3.3-3.1-3.6-1.7-.3-3.2.7-3.1 2.4.1 1.5-.4 2.6-1.7 4l-.5.5c-.6.7-1.3 1.4-2 2-.3.2-.5.4-.7.5H1.2c-.7 0-1.2.5-1.2 1.1v12.3c0 .7.6 1.2 1.2 1.2h17.3c1.6 0 3-1 3-2.8 0-.5-.1-.9-.3-1.2.7-.5 1.1-1.1 1.1-2 0-.5-.1-.9-.3-1.3.7-.4 1.2-1.1 1.2-2 0-.4-.1-.9-.3-1.3.6-.5 1-1.3 1-2.1.1-1.8-1.7-3-3.6-3z" data-reactid="54">
</path>
</svg>
<span class="D(ib) Va(m) Fz(12px) Mstart(6px) C(#828c93)" data-reactid="55">
8
</span>
</button>
<button aria-label="0 Thumbs Down" class="O(n):h O(n):a Bgc(t) Bdc(t) M(0) P(0) Bd(n)" data-reactid="56">
<svg class="Cur(p)" data-icon="thumbsdown-outline" data-reactid="57" height="12" style="vertical-align:middle;fill:#828c93;stroke:#828c93;stroke-width:0;" viewbox="0 0 24 24" width="12">
<path d="M21.6 2.3v10.1h-2.1c-1.1 0-1.7.7-2.1 1.1-.7.7-1.5 1.4-2.2 2.1-.2.2-.4.4-.5.6-1.6 1.8-2.3 3.4-2.2 5.5 0 0-.1.1-.3 0-.5-.1-1-.6-1.1-1.6-.3-2.4.2-4.3.7-4.7.9-.7.4-2-.8-2H3.9c-.7 0-1.3-.4-1.3-.8-.2-.2.4-.6.5-.6 1.6 0 1.6-2.1.1-2.2 0-.1-.1-.2-.1-.4 0-.1 0-.2.3-.3s.5-.1.7-.1c1.3 0 1.7-1.7.5-2.2 0 0-.2-.1-.3-.2-.2-.2-.3-.3-.3-.6 0 0 .1-.1.3-.2.3-.1.6-.2.7-.2 1.3-.2 1.4-1.9.1-2.2L5 3.3s-.1-.2-.1-.5c0-.4.2-.6.6-.6l16.1.1zM0 12.7c0 1.8 1.8 3 3.7 3h5.1c-.4 1.3-.6 2.9-.3 4.7.3 1.9 1.5 3.3 3.1 3.5 1.7.3 3.2-.7 3.1-2.4-.1-1.5.4-2.6 1.7-4 .2-.2.3-.4.5-.5.7-.7 1.4-1.3 2.1-1.9.3-.2.5-.4.6-.5h3.2c.7 0 1.2-.5 1.2-1.1V1.2c0-.7-.6-1.2-1.2-1.2H5.5c-1.6 0-3 1-3 2.8 0 .5.1.9.3 1.2-.8.5-1.2 1.2-1.2 2 0 .5.1.9.3 1.3-.7.4-1.2 1.1-1.2 2 0 .4.1.9.3 1.3-.6.6-1 1.3-1 2.1" data-reactid="58">
</path>
</svg>
</button>
</div>
</div>
</div>
</li>
If I print this line of code the output says "None".
date = content.find('span' , class_="Fz(12px) C(#828c93)")
print(date)
Output
None
Can someone explain me why this is happening or help with the code which reads the time of the post (21 hours ago)?
Thank you .
You get output as "None" because BeautifulSoup gets the HTML Source from the initial Request and it doesn't have the Posted time values. Those values are rendered after the initial request. To capture those kinds of values you can use a library like "Selenium". When you use selenium, you can wait till the page loads and then pass that HTML Source into BeautifulSoup and try out. Below is a sample code created to get a basic understanding:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://finance.yahoo.com/quote/MSFT/community')
html = driver.page_source
try:
myElem = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'comment Pend(2px) Mt(5px) Mb(11px) P(12px)')))
soup = BeautifulSoup(html, 'html.parser')
# Rest of the code
except TimeoutException:
print("Load failed")
import requests
from bs4 import BeautifulSoup
response = requests.get("https://finance.yahoo.com/quote/MSFT/community")
soup = BeautifulSoup(response.content, 'html.parser')
print(soup)

scraping a div with confirmation popup

I am trying to scrape a file in this site.
https://data.gov.in/catalog/complete-towns-directory-indiastatedistrictsub-district-level-census-2011
I am looking to download excelsheet with complete directory of towns of TRIPURA. first one in grid list.
my code is :
import requests
import selenium
with requests.Session() as session:
session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36'}
response = session.get(URL)
soup = BeautifulSoup(response.content, 'html.parser')
soup
And the corresponding element to get our file is given below. how to actually download that particular excel. it will direct to another window where the purpose has to be given and email address. It would be great if you could provide solution to this.
<div class="view-content">
<div class="views-row views-row-1 views-row-odd views-row-first ogpl-grid-list">
<div class="views-field views-field-title"> <span class="field-content"><span class="title-content">Complete Town Directory by India/State/District/Sub-District Level, Census 2011 - TRIPURA</span></span> </div>
<div class="views-field views-field-field-short-name confirmation-popup-177303 download-confirmation-box file-container excel"> <div class="field-content"><a class="177303 data-extension excel" href="https://data.gov.in/resources/complete-town-directory-indiastatedistrictsub-district-level-census-2011-tripura" target="_blank" title="excel (Open in new window)">excel</a></div> </div>
<div class="views-field views-field-dms-allowed-operations-3 visual-access"> <span class="field-content">Visual Access: NA</span> </div>
<div class="views-field views-field-field-granularity"> <span class="views-label views-label-field-granularity">Granularity: </span> <div class="field-content">Decadal</div> </div>
<div class="views-field views-field-nothing-1 download-file"> <span class="field-content"><span class="download-filesize">File Size: 44.5 KB</span></span> </div>
<div class="views-field views-field-field-file-download-count"> <span class="field-content download-counts"> Download: 529</span> </div>
<div class="views-field views-field-field-reference-url"> <span class="views-label views-label-field-reference-url">Reference URL: </span> <div class="field-content">http://www.censusindia.gov.in/2011census...</div> </div>
<div class="views-field views-field-dms-allowed-operations-1 vote_request_data_api"> <span class="field-content"><a class="api-link" href="https://data.gov.in/resources/complete-town-directory-indiastatedistrictsub-district-level-census-2011-tripura/api" title="View API">Data API</a></span> </div>
<div class="views-field views-field-field-note"> <span class="views-label views-label-field-note">Note: </span> <div class="field-content ogpl-more">NA</div> </div>
<div class="views-field views-field-dms-allowed-operations confirmationpopup-177303 data-export-cont"> <span class="views-label views-label-dms-allowed-operations">EXPORT IN: </span> <span class="field-content"><ul></ul></span> </div> </div>
When you click on the excel link it opens the following page :
https://data.gov.in/node/ID/download
It seems that the ID is the name of the first class of the link eg t.find('a')['class'][0]. Maybe there is a more concise method to get the id but it works as is using the classname
Then the page https://data.gov.in/node/ID/download redirects to the final URL (of the file).
The following is gathering all the URL in a list :
import requests
from bs4 import BeautifulSoup
URL = 'https://data.gov.in/catalog/complete-towns-directory-indiastatedistrictsub-district-level-census-2011'
src = requests.get(URL)
soup = BeautifulSoup(src.content, 'html.parser')
node_list = [
t.find('a')['class'][0]
for t in soup.findAll("div", { "class" : "excel" })
]
url_list = []
for url in node_list:
node = requests.get("https://data.gov.in/node/{0}/download".format(url))
soup = BeautifulSoup(node.content, 'html.parser')
content = soup.find_all("meta")[1]["content"].split("=")[1]
url_list.append(content)
print(url_list)
Complete code that downloads the files using default filename (using this post) :
import requests
from bs4 import BeautifulSoup
import urllib2
import shutil
import urlparse
import os
def download(url, fileName=None):
def getFileName(url,openUrl):
if 'Content-Disposition' in openUrl.info():
# If the response has Content-Disposition, try to get filename from it
cd = dict(map(
lambda x: x.strip().split('=') if '=' in x else (x.strip(),''),
openUrl.info()['Content-Disposition'].split(';')))
if 'filename' in cd:
filename = cd['filename'].strip("\"'")
if filename: return filename
# if no filename was found above, parse it out of the final URL.
return os.path.basename(urlparse.urlsplit(openUrl.url)[2])
r = urllib2.urlopen(urllib2.Request(url))
try:
fileName = fileName or getFileName(url,r)
with open(fileName, 'wb') as f:
shutil.copyfileobj(r,f)
finally:
r.close()
URL = 'https://data.gov.in/catalog/complete-towns-directory-indiastatedistrictsub-district-level-census-2011'
src = requests.get(URL)
soup = BeautifulSoup(src.content, 'html.parser')
node_list = [
t.find('a')['class'][0]
for t in soup.findAll("div", { "class" : "excel" })
]
url_list = []
for url in node_list:
node = requests.get("https://data.gov.in/node/{0}/download".format(url))
soup = BeautifulSoup(node.content, 'html.parser')
content = soup.find_all("meta")[1]["content"].split("=")[1]
url_list.append(content)
print("download : " + content)
download(content)

Beautifulsoup return none value for the href how to select it?

How can I get the hrefs from the Remixed From part on the left hand side for the following page
webpage = 'http://www.thingiverse.com/thing:1492411'
I tried something like:
response = requests.get('http://www.thingiverse.com/thing:1492411')
soup = BeautifulSoup(response.text, "lxml")
remm = soup.find_all("a", class_="thing-stamp")
for rems in remm:
adress = rems.find("href")
print(adress)
Of course didn't work
EDIT: HTML tag looks like this:
<a class="thing-stamp" href="/thing:1430125">
<div class="stamp-content">
<img alt="" class="render" src="https://cdn.thingiverse.com/renders/57/ea/1b/cc/c2/1105df2596f30c4df6dcf12ca5800547_thumb_small.JPG"/> <div>Anet A8 button guide by Simhopp</div>
</div>
</a>
<a class="thing-stamp" href="/thing:1430727">
<div class="stamp-content">
<img alt="" class="render" src="https://cdn.thingiverse.com/renders/93/a9/c8/45/da/05fe99b5215b04ec33d22ea70266ac72_thumb_small.JPG"/> <div>Frame brace for Anet A8 by Simhopp</div>
</div>
</a>
try:
remm = soup.find_all("a", class_="thing-stamp")
for rems in remm:
adress = rems.get('href')
print(adress)

python parse html elements while scraping

well i have a website :
http://www.custojusto.pt/Lisboa?ca=14_s&th=1&q=macbook&cg=0&w=1
and i want to get all the name of the ads and the value for the item in a array, what i have right now is :
import urllib2
from BeautifulSoup import BeautifulSoup
import re
listofads = []
page = urllib2.urlopen("http://www.custojusto.pt/Lisboa?ca=14_s&th=1&q=macbook&cg=0&w=1").read()
soup = BeautifulSoup(page)
for a in soup.findAll("div", {"class":re.compile("lista")}):
for i in a:
c = soup.findAll('h2')
y = soup.findAll("span", {"class":re.compile("right")})
listofads.append(c)
listofads.append(y)
print listofads
what i get is something like this :
</h2>, <h2>
Procura: Macbook Pro i7, 15'
</h2>], [<span class="right">50 €</span>
which look very bad .... i want to get :
Macbook bla bla . price = 500
Macbook B . price = 600
and so on
The html of the site is like this :
<div class="listofads">
<div class="lista " style="cursor: pointer;">
<div class="lista " style="cursor: pointer;">
<div class="li_image">
<div class="li_desc">
<a href="http://www.custojusto.pt/Lisboa/Laptops/Macbook+pro+15-11018054.htm?xtcr=2&" name="11018054">
<h2> Macbook pro 15 </h2>
</a>
<div class="clear"></div>
<span class="li_date largedate listline"> Informática & Acessórios - Loures </span>
<span class="li_date largedate listline">
</div>
<div class="li_categoria">
<span class="li_price">
<ul>
<li>
<span class="right">1 199 €</span>
<div class="clear"></div>
</li>
<li class="excep"> </li>
</ul>
</span>
</div>
<div class="clear"></div>
</div>
As you can see i only want the H2 value ( text ) on the div with the class "li_desc" and the price from the span on the class "right" .
I don't know how to do it with BeautifulSoup as it doesn't support xpath, but here's how you could do it nicely with lxml:
import urllib2
from lxml import etree
from lxml.cssselect import CSSSelector
url = "http://www.custojusto.pt/Lisboa?ca=14_s&th=1&q=macbook&cg=0&w=1"
response = urllib2.urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
my_products = []
# Here, we harvet all the results into a list of dictionaries, containing the items we want.
for product_result in CSSSelector(u'div.lista')(tree):
# Now, we can select the children element of each div.lista.
this_product = {
u'name': product_result.xpath('div[2]/a/h2'), # first h2 of the second child div
u'category': product_result.xpath('div[2]/span[1]'), # first span of the second child div
u'price': product_result.xpath('div[3]/span/ul/li[1]/span'), # Third div, span, ul, first li, span tag.
}
print this_product.get(u'name')[0].text
my_products.append(this_product)
# Let's inspect a product result now:
for product in my_products:
print u'Product Name: "{0}", costs: "{1}"'.format(
product.get(u'name')[0].text.replace(u'Procura:', u'').strip() if product.get(u'name') else 'NONAME!',
product.get(u'price')[0].text.strip() if product.get(u'price') else u'NO PRICE!',
)
And, here's some output:
Product Name: "Macbook Pro", costs: "890 €"
Product Name: "Memoria para Macbook Pro", costs: "50 €"
Product Name: "Macbook pro 15", costs: "1 199 €"
Product Name: "Macbook Air 13", costs: "1 450 €"
Some items do not contain a price, so results need to be checked before outputting each one.

Categories