python parse html elements while scraping

python parse html elements while scraping - python

well i have a website :
http://www.custojusto.pt/Lisboa?ca=14_s&th=1&q=macbook&cg=0&w=1
and i want to get all the name of the ads and the value for the item in a array, what i have right now is :
import urllib2
from BeautifulSoup import BeautifulSoup
import re
listofads = []
page = urllib2.urlopen("http://www.custojusto.pt/Lisboa?ca=14_s&th=1&q=macbook&cg=0&w=1").read()
soup = BeautifulSoup(page)
for a in soup.findAll("div", {"class":re.compile("lista")}):
for i in a:
c = soup.findAll('h2')
y = soup.findAll("span", {"class":re.compile("right")})
listofads.append(c)
listofads.append(y)
print listofads
what i get is something like this :
</h2>, <h2>
Procura: Macbook Pro i7, 15'
</h2>], [<span class="right">50 €</span>
which look very bad .... i want to get :
Macbook bla bla . price = 500
Macbook B . price = 600
and so on
The html of the site is like this :
<div class="listofads">
<div class="lista " style="cursor: pointer;">
<div class="lista " style="cursor: pointer;">
<div class="li_image">
<div class="li_desc">
<a href="http://www.custojusto.pt/Lisboa/Laptops/Macbook+pro+15-11018054.htm?xtcr=2&" name="11018054">
<h2> Macbook pro 15 </h2>
</a>
<div class="clear"></div>
<span class="li_date largedate listline"> Informática & Acessórios - Loures </span>
<span class="li_date largedate listline">
</div>
<div class="li_categoria">
<span class="li_price">
<ul>
<li>
<span class="right">1 199 €</span>
<div class="clear"></div>
</li>
<li class="excep"> </li>
</ul>
</span>
</div>
<div class="clear"></div>
</div>
As you can see i only want the H2 value ( text ) on the div with the class "li_desc" and the price from the span on the class "right" .

I don't know how to do it with BeautifulSoup as it doesn't support xpath, but here's how you could do it nicely with lxml:
import urllib2
from lxml import etree
from lxml.cssselect import CSSSelector
url = "http://www.custojusto.pt/Lisboa?ca=14_s&th=1&q=macbook&cg=0&w=1"
response = urllib2.urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
my_products = []
# Here, we harvet all the results into a list of dictionaries, containing the items we want.
for product_result in CSSSelector(u'div.lista')(tree):
# Now, we can select the children element of each div.lista.
this_product = {
u'name': product_result.xpath('div[2]/a/h2'), # first h2 of the second child div
u'category': product_result.xpath('div[2]/span[1]'), # first span of the second child div
u'price': product_result.xpath('div[3]/span/ul/li[1]/span'), # Third div, span, ul, first li, span tag.
}
print this_product.get(u'name')[0].text
my_products.append(this_product)
# Let's inspect a product result now:
for product in my_products:
print u'Product Name: "{0}", costs: "{1}"'.format(
product.get(u'name')[0].text.replace(u'Procura:', u'').strip() if product.get(u'name') else 'NONAME!',
product.get(u'price')[0].text.strip() if product.get(u'price') else u'NO PRICE!',
)
And, here's some output:
Product Name: "Macbook Pro", costs: "890 €"
Product Name: "Memoria para Macbook Pro", costs: "50 €"
Product Name: "Macbook pro 15", costs: "1 199 €"
Product Name: "Macbook Air 13", costs: "1 450 €"
Some items do not contain a price, so results need to be checked before outputting each one.

Related

How print li strong without class or id in beautifulsoup

I have this code for scrap '1.6.3'
<div class="product-short-description">
<ul>
<li>Very cheap price & Original product !</li>
<li><strong>Product Version :</strong> 1.6.3</li>
<li><strong>Product Last Updated :</strong> 08.12.2021</li>
</ul>
</div>
I havent id or class in li or strong. This is my code.
version_soup = soup_linke.find(class_='product-short-description')
strong_items = version_soup.find_all('strong')
li_items = version_soup.find_all('li')
for i,z in zip(strong_items, li_items):
if i.get_text() == 'Product Version :':
print(z.text)
else:
continue

To print the <li> text without the <strong> text a generic approach would be to use:
.find(text=True, recursive=False)
Example
from bs4 import BeautifulSoup
html='''
<div class="product-short-description">
<ul>
<li>Very cheap price & Original product !</li>
<li><strong>Product Version </strong> 1.6.3</li>
<li><strong>Product Last Updated </strong> 08.12.2021</li>
</ul>
</div>
'''
soup = BeautifulSoup(html)
for e in soup.select('.product-short-description li'):
print(e.find(text=True, recursive=False).strip())
Output
Very cheap price & Original product !
1.6.3
08.12.2021

Extracting text from multiple spans with different classes using BeautifulSoup

I am trying to extract some data from a webpage that I've parsed through BeautifulSoup.
<div class="product-data-list data-points-en_GB">
<div class="float-left in-left col-totalNetAssets" style="height: 36px;">
<span class="caption">
Net Assets of Share Class
<span class="as-of-date">
as of 20-Jul-20
</span>
</span>
<span class="data">
USD 36,636,694,134
</span>
</div>
<div class="float-left in-right col-totalNetAssetsFundLevel">
<span class="caption">
Net Assets of Fund
<span class="as-of-date">
as of 20-Jul-20
</span>
</span>
<span class="data">
USD 37,992,258,237
</span>
</div>
<div class="float-left in-left col-baseCurrencyCode" style="height: 16px;">
<span class="caption">
Fund Base Currency
<span class="as-of-date">
</span>
</span>
<span class="data">
USD
</span>
</div>
I want to capture the information from the 'caption', 'as-of-date' and 'data' spans to create something like:
[('Net Assets of Share Class','20-Jul-20','USD 36,636,694,134'),
('Net Assets of Fund','20-Jul-20','USD 37,992,258,237'),
('Fund Base Currency','','USD')]
This is my code:
data=[]
for tag in soup.findAll("div", {"id": "keyFundFacts"}):
for span in tag.findAll("div", {"class": "product-data-list data-points-en_GB"}):
a = span.find("span", {"class": "caption"}).text
b = span.find("span", {"class": "as-of-date"}).text
c = span.find("span", {"class": "data"}).text
data.append((a,b,c))
however, I only get 1 result when I look at the list 'data':
<pre>
[('\nNet Assets of Share Class\n\nas of 20-Jul-20\n\n', '\nas of 20-Jul-20\n', '\nUSD 36,636,694,134\n')]
</pre>
Aside from needing to strip out the new lines, I know I am missing something to get the script to go through all the other spans but have been staring at the screen for so long, it isn't getting any clearer.
Can anyone help put me out of my misery?!

One solution is to cycle through all the div elements that are under your main "div", {"class": "product-data-list data-points-en_GB" element. This way for each div element you will get the elements you want.
for tag in soup.findAll("div", {"id": "keyFundFacts"}):
for element in tag.findAll("div", {"class": "product-data-list data-points-en_GB"}):
for divEle in element.findAll('div')
a = divEle.find("span", {"class": "caption"}).text
b = divEle.find("span", {"class": "as-of-date"}).text
c = divEle.find("span", {"class": "data"}).text
This makes for a lot of nested loops so I don't recommend this. I suggest finding a more precise way. If you have a url with the html I could take a look.

I have stumbled upon a solution which seems to do the trick:
data=[]
for tag in soup.findAll("div", {"id": "keyFundFacts"}):
for element in tag.findAll("div", {"class": "product-data-list data-points-en_GB"}):
for thing in element.findChildren('div'):
a = thing.findNext("span", {"class": "caption"}).text
b = thing.findNext("span", {"class": "as-of-date"}).text
c = thing.findNext("span", {"class": "data"}).text
data.append((a,b,c))
Its not perfect but hopefully functional.
thanks all

How to select second div tag with same classname?

I'm trying to select the the second div tag with the info classname, but with no success using bs4 find_next. How Do you go about selecting the text inside the second div tag that share classname?
[<div class="info">
<a href="/clubs/12/Manchester-United/overview">
Manchester United<span class="playerClub badge-20 t1"></span>
</a>
</div>
<div class="info">Defender</div>
<div class="info">
<a href="/clubs/12/Manchester-United/overview">
Manchester United<span class="playerClub badge-20 t1"></span>
</a>
</div>
<div class="info">Defender</div>]
Here is what I have tried
from bs4 import BeautifulSoup
import requests
players_url =['http://www.premierleague.com//players/13559/Axel-Tuanzebe/stats']
# this is dict where we store all information:
players = {}
for url in players_url:
player_page = requests.get(url)
cont = soup(player_page.content, 'lxml')
data = dict((k.contents[0].strip(), v.get_text(strip=True)) for k, v in zip(cont.select('.topStat span.stat, .normalStat span.stat'), cont.select('.topStat span.stat > span, .normalStat span.stat > span')))
club = {"Club" : cont.find('div', attrs={'class' : 'info'}).get_text(strip=True)}
position = {"Position": cont.find_next('div', attrs={'class' : 'info'})}
players[cont.select_one('.playerDetails .name').get_text(strip=True)] = data
print(position)

You can try follows:
clud_ele = cont.find('div', attrs={'class' : 'info'})
club = {"Club" : clud_ele.get_text(strip=True)}
position = {"Position": clud_ele.find_next('div', attrs={'class' : 'info'})}

Parsing IMDB with BeautifulSoup

I've stripped the following code from IMDB's mobile site using BeautifulSoup, with Python 2.7.
I want to create a separate object for the episode number '1', title 'Winter is Coming', and IMDB score '8.9'. Can't seem to figure out how to split apart the episode number and the title.
<a class="btn-full" href="/title/tt1480055?ref_=m_ttep_ep_ep1">
<span class="text-large">
1.
<strong>
Winter Is Coming
</strong>
</span>
<br/>
<span class="mobile-sprite tiny-star">
</span>
<strong>
8.9
</strong>
17 Apr. 2011
</a>

You can use find to locate the span with the class text-large to the specific element you need.
Once you have your desired span, you can use next to grab the next line, containing the episode number and find to locate the strong containing the title
html = """
<a class="btn-full" href="/title/tt1480055?ref_=m_ttep_ep_ep1">
<span class="text-large">
1.
<strong>
Winter Is Coming
</strong>
</span>
<br/>
<span class="mobile-sprite tiny-star">
</span>
<strong>
8.9
</strong>
17 Apr. 2011
</a>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
span = soup.find('span', attrs={'text-large'})
ep = str(span.next).strip()
title = str(span.find('strong').text).strip()
print ep
print title
> 1.
> Winter Is Coming

Once you have each a class="btn-full", you can use the span classes to get the tags you want, the strong tag is a child of the span with the text-large class so you just need to call .strong.text on the Tag, for the span with the css class mobile-sprite tiny-star, you need to find the next strong tag as it is a sibling of the span not a child:
h = """<a class="btn-full" href="/title/tt1480055?ref_=m_ttep_ep_ep1">
<span class="text-large">
1.
<strong>
Winter Is Coming
</strong>
</span>
<br/>
<span class="mobile-sprite tiny-star">
</span>
<strong>
8.9
</strong>
17 Apr. 2011
</a>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(h)
title = soup.select_one("span.text-large").strong.text.strip()
score = soup.select_one("span.mobile-sprite.tiny-star").find_next("strong").text.strip()
print(title, score)
Which gives you:
(u'Winter Is Coming', u'8.9')
If you really want to get the episode the simplest way is to split the text once:
soup = BeautifulSoup(h)
ep, title = soup.select_one("span.text-large").text.split(None, 1)
score = soup.select_one("span.mobile-sprite.tiny-star").find_next("strong").text.strip()
print(ep, title.strip(), score)
Which will give you:
(u'1.', u'Winter Is Coming', u'8.9')

Using url html scraping with reguest and regular expression search.
import os, sys, requests
frame = ('http://www.imdb.com/title/tt1480055?ref_=m_ttep_ep_ep1')
f = requests.get(frame)
helpme = f.text
import re
result = re.findall('itemprop="name" class="">(.*?) ', helpme)
result2 = re.findall('"ratingCount">(.*?)</span>', helpme)
result3 = re.findall('"ratingValue">(.*?)</span>', helpme)
print result[0].encode('utf-8')
print result2[0]
print result3[0]
output:
Winter Is Coming
24,474
9.0

How to print the text inside of a child tag and the href of a grandchild element with a single BeautifulSoup Object?

I have a document which contains several div.inventory siblings.
<div class="inventory">
<span class="item-number">123</span>
<span class="cost">
$1.23
</span>
</div>
I would like to iterate over them to print the item number and link of the item.
123 http://linktoitem
456 http://linktoitem2
789 http://linktoitem3
How do I parse these two values after selecting the div.inventory element?
import requests
from bs4 import BeautifulSoup
htmlSource = requests.get(url).text
soup = BeautifulSoup(htmlSource)
matches = soup.select('div.inventory')
for match in matches:
#prints 123
#prints http://linktoitem
Also - what is the difference between the select function and find* functions?

You can find both items using find() relying on the class attributes:
soup = BeautifulSoup(data)
for inventory in soup.select('div.inventory'):
number = inventory.find('span', class_='item-number').text
link = inventory.find('span', class_='cost').a.get('href')
print number, link
Example:
from bs4 import BeautifulSoup
data = """
<body>
<div class="inventory">
<span class="item-number">123</span>
<span class="cost">
$1.23
</span>
</div>
<div class="inventory">
<span class="item-number">456</span>
<span class="cost">
$1.23
</span>
</div>
<div class="inventory">
<span class="item-number">789</span>
<span class="cost">
$1.23
</span>
</div>
</body>
"""
soup = BeautifulSoup(data)
for inventory in soup.select('div.inventory'):
number = inventory.find('span', class_='item-number').text
link = inventory.find('span', class_='cost').a.get('href')
print number, link
Prints:
123 http://linktoitem
456 http://linktoitem2
789 http://linktoitem3
Note the use of select() - this method allows to use CSS Selectors for searching over the page. Also note the use of class_ argument - underscore is important since class is a reversed keyword in Python.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python parse html elements while scraping - python

Related

How print li strong without class or id in beautifulsoup

Extracting text from multiple spans with different classes using BeautifulSoup

How to select second div tag with same classname?

Parsing IMDB with BeautifulSoup

How to print the text inside of a child tag and the href of a grandchild element with a single BeautifulSoup Object?

Categories

Resources