Using SPLIT to create a list of HTML

Using SPLIT to create a list of HTML - python

I have a return value from a search I'm doing which returns alot of HTML.
for i in deal_list:
regex2 = '(?s)'+'<figure class="deal-card deal-list-tile deal-tile deal-tile-standard" data-bhc="'+ i +'"'+'.+?</figure>'
pattern2 = re.compile(regex2)
info2 = re.search(pattern2,htmltext)
html_captured = info2.group(0).split('</figure>')
print html_captured
Here is an example what is being returned:
<figure class="deal-card deal-list-tile deal-tile deal-tile-standard" data-bhc="deal:giorgios-brick-oven-pizza-wine-bar" data-bhd="{"accessType":"extended"}" data-bh-viewport="respect">
<a href="//www" class="deal-tile-inner">
<img>
<figcaption>
<div class="deal-tile-content">
<p class="deal-title should-truncate">Up to 73% Off Wine-Tasting Dinner at 1742 Wine Bar</p>
<p class="merchant-name truncation ">1742 Wine Bar</p>
<p class="deal-location truncate-others ">
<span class="deal-location-name">Upper East Side</span>
</p>
<div class="description should-truncate deal-tile-description"><p>Wine tasting includes three reds and three whites; dinner consists of one appetizer, two entrees, and a bottle of wine</p></div>
</div>
<div class="purchase-info clearfix ">
<p class="deal-price">
<s class="original-price">$178.90</s>
<s class="discount-price">$49</s>
</p>
<div class="hide show-in-list-view">
<p class="deal-tile-actions">
<button class="btn-small btn-buy" data-bhw="ViewDealButton">
View Deal
</button>
</p>
</div>
</div>
</figcaption>
</a>
</figure>
<figure class="deal-card deal-list-tile deal-tile deal-tile-standard" data-bhc="deal:statler-grill-4" data-bhd="{"accessType":"extended"}" data-bh-viewport="respect">
<a href="//www" class="deal-tile-inner">
<img>
<figcaption>
<div class="deal-tile-content">
<p class="deal-title should-truncate">Up to 59% Off Four-Course Dinner at Statler Grill</p>
<p class="merchant-name truncation ">Statler Grill</p>
<p class="deal-location truncate-others ">
<span class="deal-location-name">Midtown</span>
</p>
<div class="description should-truncate deal-tile-description"><p>Chefs sear marbled new york prime sirloin and dice fresh sashimi-grade tuna to satisfy appetites amid white tablecloths and chandeliers</p></div>
</div>
<div class="purchase-info clearfix ">
<p class="deal-price">
<s class="original-price">$213</s>
<s class="discount-price">$89</s>
</p>
<div class="hide show-in-list-view">
<p class="deal-tile-actions">
<button class="btn-small btn-buy" data-bhw="ViewDealButton">
View Deal
</button>
</p>
</div>
</div>
</figcaption>
</a>
</figure>
I want to use html_captured = info2.group(0).split('</figure> so that all HTML between each new set of tags become an element of a list, in this case HTML_CAPTURED.
It kind of works except that each becomes its own list with a '' at the end. For example: ['<figure .... </figure>','']['<figure .... </figure>','']
But what I want is ['<figure .... </figure>','<figure .... </figure>','<figure .... </figure>'...etc]

There are special tools for parsing HTML - HTML parsers.
Example using BeautifulSoup:
from bs4 import BeautifulSoup
data = """
your html here
"""
soup = BeautifulSoup(data)
print [figure for figure in soup.find_all('figure')]
Also see why you should not use regex for parsing HTML:
RegEx match open tags except XHTML self-contained tags

Related

Python BeautifulSoup Same name DIV, ignore first

So playing around with the python bs4 and trying to work out how to ignore the same DIV name to collect the data for the second lot.
Below is an example of the code I am try to extract ##Wanted data##
##Pointless Data###
<div class="PowerDetails">
<div class="Company">
<p class="RunningCost">$4.44</p>
<p class="Time">peek</p>
<p class="RunningCost"> $2.33</p>
<p class="Time">Off-peek</p>
</div>
</div>
##Wanted data##
<div class="PowerDetails">
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $9.99</p>
<p class="Time">Off-peek</p>
</div>
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $7.77</p>
<p class="Time">Off-peek</p>
</div>
</div>
from bs4 import BeautifulSoup
soup = BeautifulSoup("<html_text>" , "html.parser")
div = soup.find("div")
div.find_all("div", {"class":"PowerDetails"})
PowerDetails[1].find_all("p", "class":"RunningCost")
PowerDetails[1].find_all("p", "class":"Time")

find_all() will return list. use slicing or index to access just elements you want.

You can slice the resultant list to get elements from the 1st index onwards. But, first you are not finding the right tags in your code.
from bs4 import BeautifulSoup
html_doc = """
<div class="PowerDetails">
<div class="Company">
<p class="RunningCost">$4.44</p>
<p class="Time">peek</p>
<p class="RunningCost"> $2.33</p>
<p class="Time">Off-peek</p>
</div>
</div>
##Wanted data##
<div class="PowerDetails">
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $9.99</p>
<p class="Time">Off-peek</p>
</div>
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $7.77</p>
<p class="Time">Off-peek</p>
</div>
</div>
"""
soup = BeautifulSoup(html_doc , "html.parser")
# You can get the divs with one line of code
powerDetails = soup.find_all(class_="PowerDetails")
print(len(powerDetails)) # Outputs 2
Now, you can slice the list to ignore the first div
powerDetails = powerDetails[1:] # Get elements from 2nd element onwards (ignoring the first one)
print(len(powerDetails)) # Outputs 1
Now, you will have a list with one element only
print(powerDetails)
Output:
[<div class="PowerDetails">
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $9.99</p>
<p class="Time">Off-peek</p>
</div>
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $7.77</p>
<p class="Time">Off-peek</p>
</div>
</div>]

Another method.
from simplified_scrapy import SimplifiedDoc
html = '''
##Pointless Data###
<div class="PowerDetails">
<div class="Company">
<p class="RunningCost">$4.44</p>
<p class="Time">peek</p>
<p class="RunningCost"> $2.33</p>
<p class="Time">Off-peek</p>
</div>
</div>
##Wanted data##
<div class="PowerDetails">
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $9.99</p>
<p class="Time">Off-peek</p>
</div>
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $7.77</p>
<p class="Time">Off-peek</p>
</div>
</div>
'''
doc = SimplifiedDoc(html)
# First method, get all, use index.
PowerDetails = doc.selects('div.PowerDetails')[1].selects(
'div.Company').selects('p')
for ps in PowerDetails:
print([(p['class'], p.text) for p in ps])
# Second method, skip the first with parameter start
PowerDetails = doc.getElement(
'div', value='PowerDetails',
start='class="PowerDetails"').selects('div.Company').selects('p')
for ps in PowerDetails:
print([(p['class'], p.text) for p in ps])
Result:
[('RunningCost', '$8.88'), ('Time', 'peek'), ('RunningCost', '$9.99'), ('Time', 'Off-peek')]
[('RunningCost', '$8.88'), ('Time', 'peek'), ('RunningCost', '$7.77'), ('Time', 'Off-peek')]
[('RunningCost', '$8.88'), ('Time', 'peek'), ('RunningCost', '$9.99'), ('Time', 'Off-peek')]
[('RunningCost', '$8.88'), ('Time', 'peek'), ('RunningCost', '$7.77'), ('Time', 'Off-peek')]

Iterate div tags using lxml and retrieve text for a dictionary in python

I just came to know lxmlx in python and I'm in the need for some help as I have no experience with XPath.
I want to get text data from a webpage into a dictionary.
I'm referring to the html snippet I posted below. Within the original html page there's a div element of the class general-info that I retrieve using the following line:
general_info = document_tree.xpath("//div[contains(concat(' ', normalize-space(#class), ' '), 'general-info')]")
From here on I want to iterate over the nested divs and get the 2 <p> tags as key and value. The text inside the <strong> being the key.
There can also be empty div tags and there can be a special case where the key and the value for the dictionary can be within the same div (see the last element).
EDIT:
The number of elements can change, so it would be best to use the <strong> tags as starting point and then search for the next <p> tag.
This is code that I was able to write using BeautifulSoup:
generalinfo = documentSoup.findAll("div", {"class": "general-info"})
if generalinfo:
strongs = generalinfo[0].find_all('strong')
for descr in strongs:
p = descr.find_next_sibling("p")
if p:
key = descr.text.strip().rstrip(':')
details_dict[key] = p.text.strip()
else:
nextdiv = descr.parent.parent.find_next_sibling("div")
if nextdiv:
child = nextdiv.findChild()
if child:
key = descr.text.strip()[:-1]
details_dict[key] = child.text.strip()
I am going for the following output:
['Title:' : 'This is a title',
'Owner:' : 'This is an owner',
'Category:' : 'This is a categroy',
'Type:' : 'This is a type',
'Special case:' : 'This is a special case']
If anyone can help me out here I will appreciate this!
html code:
<body>
<main>
<div>
...
<div class="general-info margin-bottom-20 margin-top-20">
<div class="row padding-x-20">
<div class="col-sm-4">
<p class="margin-0">
<strong>Title:</strong>
</p>
</div>
<div class="col-sm-8">
<p class="margin-0">This is a title</p>
</div>
</div>
<div class="row padding-x-20">
<div class="col-sm-4">
<p class="margin-0">
<strong>Owner:</strong>
</p>
</div>
<div class="col-sm-8">
<p class="margin-0">This is an owner</p>
</div>
</div>
<h2 class="h3 margin-top-10 margin-bottom-10 padding-x-20">Validity</h2>
<div class="row padding-x-20">
<div class="col-sm-4">
<p class="margin-0">
<strong>Category:</strong>
</p>
</div>
<div class="col-sm-8">
<p class="margin-0">This is a category</p>
</div>
</div>
<div class="row padding-x-40"></div>
<div class="row padding-x-20">
<div class="col-sm-4">
<p class="margin-0">
<strong>Type:</strong>
</p>
</div>
<div class="col-sm-8">
<p class="margin-0">This is a type</p>
</div>
</div>
<div class="row padding-x-40">
<div>
<strong>Special case:</strong>
<p>This is a special case</p>
</div>
</div>
</div>
...

I believe this is about as generalized as I can get given the html provided:
general_info = doc.xpath("//div[contains(concat(' ', normalize-space(#class), ' '), 'general-info')]//p[#class='margin-0']")
for i in general_info :
if len(i.xpath('./strong/text()'))>0:
topic = i.xpath('./strong/text()')[0]
if len(i.text.strip())>0:
entry += i.text.replace('\n','').strip()
print(topic+' '+i.text.replace('\n','').strip())
special = general_info[0].xpath('./ancestor::div[#class="general-info margin-bottom-20 margin-top-20"]//div/div/strong')[0]
print(special.text+" ",special.xpath('./following-sibling::p/text()')[0])
Output:
('Title: This is a title',
'Owner: This is an owner',
'Category: This is a category',
'Type: This is a type',
'Special case: This is a special case')

I recommend another solution, which is very suitable for extracting data from XML.
from simplified_scrapy.spider import SimplifiedDoc
html='''
<body>
<main>
<div class="general-info margin-bottom-20 margin-top-20">
<div class="row padding-x-20">
<div class="col-sm-4">
<p class="margin-0">
<strong>Title:</strong>
</p>
</div>
<div class="col-sm-8">
<p class="margin-0">This is a title</p>
</div>
</div>
<div class="row padding-x-20">
<div class="col-sm-4">
<p class="margin-0">
<strong>Owner:</strong>
</p>
</div>
<div class="col-sm-8">
<p class="margin-0">This is an owner</p>
</div>
</div>
<h2 class="h3 margin-top-10 margin-bottom-10 padding-x-20">Validity</h2>
<div class="row padding-x-20">
<div class="col-sm-4">
<p class="margin-0">
<strong>Category:</strong>
</p>
</div>
<div class="col-sm-8">
<p class="margin-0">This is a category</p>
</div>
</div>
<div class="row padding-x-40"></div>
<div class="row padding-x-20">
<div class="col-sm-4">
<p class="margin-0">
<strong>Type:</strong>
</p>
</div>
<div class="col-sm-8">
<p class="margin-0">This is a type</p>
</div>
</div>
<div class="row padding-x-40">
<div>
<strong>Special case:</strong>
<p>This is a special case</p>
</div>
</div>
</div>
'''
data={}
doc = SimplifiedDoc(html) # create doc
divs = doc.selects('div.general-info')
# First way
for div in divs:
strongs = div.strongs
for strong in strongs:
p = strong.next
if not p:
p=strong.parent.next
data[strong.text]=p.text
print(data)
data={}
# Second way
for div in divs:
ds = div.selects('strong|p>text()')
for i in range(0,len(ds),2):
data[ds[i]]=ds[i+1]
print(data)
Result:
{'Title:': 'This is a title', 'Owner:': 'This is an owner', 'Category:': 'This is a category', 'Type:': 'This is a type', 'Special case:': 'This is a special case'}
{'Title:': 'This is a title', 'Owner:': 'This is an owner', 'Category:': 'This is a category', 'Type:': 'This is a type', 'Special case:': 'This is a special case'}
Here are more examples:https://github.com/yiyedata/simplified-scrapy-demo/blob/master/doc_examples/

BeautifulSoup: scraping for a span gives me a result, for another span it gives "None"

I am coding a scraper for Etsy and when I scrape the span for reviews I get the right output. However when I scrape for the span with the price it gives me only None values and I don't understand why. If someone could help, it would be great!
#html parsing
page_soup = soup(page_html, "html.parser")
#grabs each listing card
divs = page_soup.find_all("div", {"class": "v2-listing-card__shop"})
for i in divs:
shop = i.p.text
reviews = i.find("span", {"class" : "text-body-smaller text-gray-lighter display-inline-block vertical-align-middle icon-b-1"})
prices = i.find("span", {"class" : "currency-value"})
print shop
print reviews.text
print prices
Here are the two span elements as on the website:
<div class="v2-listing-card__info">
<p class="text-gray text-truncate mb-xs-0 text-body">
Blush Watercolor Flowers & Leaves with Different Shades Clipart Separate Elements Hand Painted Commercial Use | S15 Fairy Tale
</p>
<div class="v2-listing-card__shop">
<p class="text-gray-lighter text-body-smaller display-inline-block mr-xs-1">PatishopArt</p>
<div class="v2-listing-card__rating icon-t-2">
<div class="stars-svg stars-smaller ">
<input name="initial-rating" type="hidden" value="5"/>
<input name="rating" type="hidden" value="5"/>
<span class="screen-reader-only">5 out of 5 stars</span>
<div aria-hidden="true" class="rating lit rating-first icon-b-2" data-rating="1">
<span class="etsy-icon stars-svg-star" title="Disappointed"><svg aria-hidden="true" focusable="false" viewbox="3 3 18 18" xmlns="http://www.w3.org/2000/svg"><path d="M19.985,10.36a0.5,0.5,0,0,0-.477-0.352H14.157L12.488,4.366a0.5,0.5,0,0,0-.962,0l-1.67,5.642H4.5a0.5,0.5,0,0,0-.279.911L8.53,13.991l-1.5,5.328a0.5,0.5,0,0,0,.741.6l4.231-2.935,4.215,2.935a0.5,0.5,0,0,0,.743-0.6l-1.484-5.328,4.306-3.074A0.5,0.5,0,0,0,19.985,10.36Z"></path></svg></span>
<div class="rating lit" data-rating="2">
<span class="etsy-icon stars-svg-star" title="Not a fan"><svg aria-hidden="true" focusable="false" viewbox="3 3 18 18" xmlns="http://www.w3.org/2000/svg"><path d="M19.985,10.36a0.5,0.5,0,0,0-.477-0.352H14.157L12.488,4.366a0.5,0.5,0,0,0-.962,0l-1.67,5.642H4.5a0.5,0.5,0,0,0-.279.911L8.53,13.991l-1.5,5.328a0.5,0.5,0,0,0,.741.6l4.231-2.935,4.215,2.935a0.5,0.5,0,0,0,.743-0.6l-1.484-5.328,4.306-3.074A0.5,0.5,0,0,0,19.985,10.36Z"></path></svg></span>
<div class="rating lit" data-rating="3">
<span class="etsy-icon stars-svg-star" title="It's okay"><svg aria-hidden="true" focusable="false" viewbox="3 3 18 18" xmlns="http://www.w3.org/2000/svg"><path d="M19.985,10.36a0.5,0.5,0,0,0-.477-0.352H14.157L12.488,4.366a0.5,0.5,0,0,0-.962,0l-1.67,5.642H4.5a0.5,0.5,0,0,0-.279.911L8.53,13.991l-1.5,5.328a0.5,0.5,0,0,0,.741.6l4.231-2.935,4.215,2.935a0.5,0.5,0,0,0,.743-0.6l-1.484-5.328,4.306-3.074A0.5,0.5,0,0,0,19.985,10.36Z"></path></svg></span>
<div class="rating lit" data-rating="4">
<span class="etsy-icon stars-svg-star" title="Like it"><svg aria-hidden="true" focusable="false" viewbox="3 3 18 18" xmlns="http://www.w3.org/2000/svg"><path d="M19.985,10.36a0.5,0.5,0,0,0-.477-0.352H14.157L12.488,4.366a0.5,0.5,0,0,0-.962,0l-1.67,5.642H4.5a0.5,0.5,0,0,0-.279.911L8.53,13.991l-1.5,5.328a0.5,0.5,0,0,0,.741.6l4.231-2.935,4.215,2.935a0.5,0.5,0,0,0,.743-0.6l-1.484-5.328,4.306-3.074A0.5,0.5,0,0,0,19.985,10.36Z"></path></svg></span>
<div class="rating lit" data-rating="5">
<span class="etsy-icon stars-svg-star" title="Love it"><svg aria-hidden="true" focusable="false" viewbox="3 3 18 18" xmlns="http://www.w3.org/2000/svg"><path d="M19.985,10.36a0.5,0.5,0,0,0-.477-0.352H14.157L12.488,4.366a0.5,0.5,0,0,0-.962,0l-1.67,5.642H4.5a0.5,0.5,0,0,0-.279.911L8.53,13.991l-1.5,5.328a0.5,0.5,0,0,0,.741.6l4.231-2.935,4.215,2.935a0.5,0.5,0,0,0,.743-0.6l-1.484-5.328,4.306-3.074A0.5,0.5,0,0,0,19.985,10.36Z"></path></svg></span>
</div>
</div>
</div>
</div>
</div>
</div>
<span class="text-body-smaller text-gray-lighter display-inline-block vertical-align-middle icon-b-1">(110)</span>
</div>
</div>
<p class="n-listing-card__price text-gray strong mt-xs-0">
<span class="currency-symbol">$</span><span class="currency-value">6.60</span>
</p>
<!-- This shows Free shipping on its own line , we only show it if it wasn't shown above -->
</div>

You are only checking in divs of type listing-card__shop but it looks to me as if the span in question, is outside of those divs

How can one replace an element in lxml?

I have a text that I get (data entered by users of CRM) web service, which returns a "terrifying format". I am filtering with python before using the data, but when it comes to removing line breaks (br) removed me also the texts. The code is as follows:
description = '''
<div id="highlight" class="section">
<p>
text...............
</p>
<br>
<h1>TITLE</h1>
<p>Multiple text
<br>
</p>
<ul>
<li>bad layer....</li>
</ul>
<p>
<br>subTitle
</p>
<p> </p>
<p style="text-align: center;">
<br>Text1
<br>Text2
<br>Text3
<br>Text4
<br>Text5
<br>Text6
</p>
<p style="text-align: center;">
<strong>small title</strong>
<br>Text small</p>
<p style="text-align: center;">
<strong>highlighted text</strong>
<br>
<br><strong>Text1</strong>
<br>Text2
<br>Text3
<br>Text4
</p>
<p style="text-align: center;">
<strong>small text</strong>
<br>Text1
<br>Text2
</p>
<p style="text-align: center;">
<strong>small text</strong>
<br>description
</p>
<p style="text-align: center;">
<br> </p>
<p><strong>description two</strong></p>
<p>
<br> </p>
</div>
'''
tree = html.fragment_fromstring( description )
for element in tree.xpath('//br'):
#element.getparent().remove(element)
print element.text
print element.getparent().getchildren()
#print element
#print element.getparent()
#print element.getchildren()
#print element.getnext()
#print '--------------------------------'
I have tried to remove the br with element.getparent().remove(element), but also deletes the text, I did tests to see if the texts belong to any node, but not so.
I've thought about changing the br by li, making the p with stylo in ul, but I can't think as do it, something like this (the previous text lame):
..........
..........
<ul>
<li>Text1</li>
<li>Text2</li>
<li>Text3</li>
<li>Text4</li>
<li>Text5</li>
<li>Text6</li>
</ul>
<ul>
<li><strong>small title</strong></li>
<li>Text small</li></ul>
<ul>
<li><strong>highlighted text</strong></li>
<li><strong>Text1</strong></li>
<li>Text2</li>
<li>Text3</li>
<li>Text4</li>
</ul>
<ul>
<li><strong>small text</strong></li>
<li>Text1</li>
<li>Text2</li>
</ul>
<ul>
<li><strong>small text</strong></li>
<li>description</li>
</ul>
<ul>
<li> </li></ul>
........
I can't think as take texts, because I thought that just choosing the xpath of the node p with style and its value, creating nodes children of li and a parent ul, eliminated p.
Is possible? Thanks
Regards

You can use lxml.etree.strip_elements, like so:
import lxml.etree
import lxml.html
tree = lxml.html.fragment_fromstring(description)
lxml.etree.strip_elements(tree, 'br', with_tail=False)
print(lxml.etree.tostring(tree, pretty_print=True))

Issue with scraping data using beautiful soup

I am using following code to scrape data from a website.
# -*- coding: cp1252 -*-
import urllib2
import sys
from bs4 import BeautifulSoup
page = urllib2.urlopen('http://www.att.com/shop/wireless/plans-new.html#fbid=U-XD_DHOGEp').read()
soup = BeautifulSoup(page)
plans = soup.findAll('div', {"class": "planTitle"})
for plan in plans:
planname = u' '.join(plan.stripped_strings)
plantypes = soup.findAll('div', {"class":"top"})
prices = soup.findAll('div', {"class":"bottom"})
for plantype, price in zip(plantypes, prices):
plantype1 = u' '.join(plantype.stripped_strings)
price1 = u' '.join(price.stripped_strings)
print planname, plantype1, price1
Problem: If you go through the webpage which is mentioned in this code, these are 4-5 types of plans and for each plan 3 voice options and some 2-3 data options are present. I want to scrape the data in such way that for each plan I can get its respective voice options against it and then monthly prices for those options.
Code which I am running right now returns the all possible combinations of plan name + voice options. For each plan name I get some 20-30 entries because it creates a entry even for wrong Plan name + voice options combination. Eg. Individual Plan - 550 minutes - $59.99, in this combination 500 minutes and 59.99 are part of Family Plan.
I want the loop to run so that only correct Plan + Voice options combination is extracted.
Snippet of Webpage:
For each plan one box is present on webpage, which contains voice options and prices corresponding to those options, I want the loop to run for each box, but element + class combination for voice options and their prices is not unique. That's why plan name takes value from other boxex also.
<div class="innerContainer">
<div class="planTitle">
<h2>AT&T Individual Plans</h2>
</div>
<div class="planSubTitle">
<img src="/shopcms/media/att/2012/shop/wireless/promotions/Plans-CM_page/tiny-clock.jpg" alt="">
<p>Voice plan options:</p>
</div>
<!-- Begin three white boxes -->
<!-- Note, extra boxes can be added to the row with the following method -->
<!-- 1. Add more div containers inside .whiteBox -->
<!-- 2. Modify class names to boxes_one, boxes_two, boxes_three etc... (max six) -->
<div class="whiteBox">
<div class="boxes_three">
<a class="lnk-help tooltips fullBoxLink" href="#smartphone_individual_voice_450" onclick="window.location.href = 'http://www.att.com/shop/wireless/plans/voice/sku3830290.html?source=IC95ATPLP00PSP00L&wtExtndSource=spindvoice450';return false;" aria-describedby="smartphone_individual_voice_450" data-cqpath="/content/att/shop/en/wireless/plans-new/jcr:content/maincontent/authortext;2013010" title=""></a>
<span id="smartphone_individual_voice_450" class="tips" role="tooltip">$0.45/min. for additional minutes</span>
<div class="top">
<p class="stat">450</p>
<p class="statText">Minutes</p>
</div>
<div class="bottom">
<p>$39.99/mo.</p>
</div>
</div>
<div class="boxes_three">
<a class="lnk-help tooltips fullBoxLink" href="#smartphone_individual_voice_900" onclick="window.location.href = 'http://www.att.com/shop/wireless/plans/voice/sku3830292.html?source=IC95ATPLP00PSP00L&wtExtndSource=spindvoice900';return false;" aria-describedby="smartphone_individual_voice_900" data-cqpath="/content/att/shop/en/wireless/plans-new/jcr:content/maincontent/authortext;2013010" title=""></a>
<span id="smartphone_individual_voice_900" class="tips" role="tooltip">$0.40/min. for additional minutes</span>
<div class="top">
<p class="stat">900</p>
<p class="statText">Minutes</p>
</div>
<div class="bottom">
<p>$59.99/mo.</p>
</div>
</div>
<div class="boxes_three borderNone">
<a class="fullBoxLink" href="http://www.att.com/shop/wireless/plans/voice/sku3830293.html?source=IC95ATPLP00PSP00L&wtExtndSource=spindvoiceunlim" data-cqpath="/content/att/shop/en/wireless/plans-new/jcr:content/maincontent/authortext;2013010"></a>
<div class="top">
<p class="stat">Unlimited</p>
<p class="statText">Minutes</p>
</div>
<div class="bottom">
<p>$69.99/mo.</p>
</div>
</div>
</div>
<!-- End three white boxes -->
<!-- Begin left gray container -->
<div class="containerTwoThirds">
<div class="planSubTitle">
<img src="/shopcms/media/att/2012/shop/wireless/promotions/Plans-CM_page/tiny-globe.jpg" alt="">
<p>Data plan options:</p>
</div>
<div class="grayTwoThirds">
<div class="grayBox">
<a class="fullBoxLink" href="http://www.att.com/shop/wireless/services/dataplus300mb-smartphone4glte-sku5380269.html?source=IC95ATPLP00PSP00L&wtExtndSource=spinddata300mb" data-cqpath="/content/att/shop/en/wireless/plans-new/jcr:content/maincontent/authortext;2013010"></a>
<p class="stat"><strong>300MB</strong></p>
<p class="statText">$20.00/mo.</p>
</div>
<div class="grayBoxBreak"></div>
<div class="grayBox">
<a class="fullBoxLink" href="http://www.att.com/shop/wireless/services/datapro3gb-smartphone4glte-sku5470232.html?source=IC95ATPLP00PSP00L&wtExtndSource=spinddata3gb" data-cqpath="/content/att/shop/en/wireless/plans-new/jcr:content/maincontent/authortext;2013010"></a>
<p class="stat"><strong>3GB</strong></p>
<p class="statText">$30.00/mo.</p>
</div>
<div class="grayBoxBreak"></div>
<div class="grayBox">
<a class="fullBoxLink" href="http://www.att.com/shop/wireless/services/datapro5gb-smartphone4glte-sku5480228.html?source=IC95ATPLP00PSP00L&wtExtndSource=spinddata5gb" data-cqpath="/content/att/shop/en/wireless/plans-new/jcr:content/maincontent/authortext;2013010"></a>
<p class="stat"><strong>5GB</strong></p>
<p class="statText">$50.00/mo.</p>
</div>
</div>
</div>
<!-- End left gray container -->
<!-- Begin right gray container -->
<div class="containerThird">
<div class="planSubTitle">
<img src="/shopcms/media/att/2012/shop/wireless/promotions/Plans-CM_page/tiny-phone.jpg" alt="">
<p>Messaging plan options: <span class="fix"></span></p>
</div>
<div class="grayThird">
<div class="grayBox">
<a data-cqpath="/content/att/shop/en/wireless/plans-new/jcr:content/maincontent/authortext;2012325" href="http://www.att.com/shop/wireless/services/messagingunlimited-sku1160055.html?source=IC95ATPLP00PSP00L&wtExtndSource=spindmessunlim" class="fullBoxLink"></a>
<p class="stat"><strong>ULTD</strong> MSGS</p>
<p class="statText">$20.00/mo.</p>
</div>
<div class="grayBoxBreak"></div>
<div class="grayBox last">
<p class="stat"><strong>PAY PER USE</strong></p>
<p class="statText">20¢/text <span class="lightGray">|</span> 30¢/pic/video</p>
</div>
</div>
</div>
<!-- End right gray container -->
<!-- Begin sub footer -->
<div class="bottomLinks">
<div class="links">
Nation with Canada Plans | Unlimited Home Phone | Senior Plans
</div>
<a class="shop_button" href="http://www.att.com/shop/wireless/devices/smartphones.html?source=IC95ATPLP00PSP00L&wtExtndSource=indshopsp" data-cqpath="/content/att/shop/en/wireless/plans-new/jcr:content/maincontent/authortext;2013010"><img src="/shopcms/media/att/2012/shop/wireless/promotions/Plans-CM_page/buttons/shop_smartphones.png" alt="Shop Smartphones" width="158" height="29"></a>
</div>
<!-- End sub footer -->
</div>
Please help me in solving this issue as I am new to programming.

Rewrote the thing from scratch. There's no comments, but it's pretty self explanatory. The lambda in the dictionary is for finding attributes that start with a certain string. I referenced this answer for that: https://stackoverflow.com/a/2830550/541208
I had thought that you were using findAll on soup, when you should have been using plan.findAll instead, but then it didn't help anything, so I just rewrote the whole thing.
import urllib2
import sys
from bs4 import BeautifulSoup
page = urllib2.urlopen('http://www.att.com/shop/wireless/plans-new.html#fbid=U-XD_DHOGEp').read()
soup = BeautifulSoup(page)
#find the container for all the plans
tabcontent = soup.find('div', {"id": "smartphonePlans", "class": "tabcontent"})
containers = tabcontent.findAll('div', {"class": "innerContainer"})
for plan in containers:
planTitle = plan.find("div", {"class": "planTitle"})
if planTitle:
title = planTitle.find("a").text
print title
voiceBoxes = plan.find("div", {"class": "whiteBox"})
if voiceBoxes:
box3 = voiceBoxes.findAll("div", {"class": lambda x: x and x.startswith("boxes_")})
if box3:
for box in box3:
top = box.findAll("p")
minutes = u" ".join([tag.text for tag in top])
print "\t", minutes
Which outputs:
AT&T Individual Plans
450 Minutes $39.99/mo.
900 Minutes $59.99/mo.
Unlimited Minutes $69.99/mo.
AT&T Family Plans
550 Minutes $59.99/mo.
700 Minutes $69.99/mo.
1,400 Minutes $89.99/mo.
2,100 Minutes $109.99/mo.
Unlimited Minutes $119.99/mo.
AT&T Mobile Share Plans
1GB $40/mo. + $45/smartphone
4GB $70/mo. + $40/smartphone
6GB $90/mo. + $35/smartphone
10GB $120/mo.
15GB $160/mo. + $30/smartphone
20GB $200/mo.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using SPLIT to create a list of HTML - python

Related

Python BeautifulSoup Same name DIV, ignore first

Iterate div tags using lxml and retrieve text for a dictionary in python

BeautifulSoup: scraping for a span gives me a result, for another span it gives "None"

How can one replace an element in lxml?

Issue with scraping data using beautiful soup

Categories

Resources