Issue with scraping data using beautiful soup - python

I am using following code to scrape data from a website.
# -*- coding: cp1252 -*-
import urllib2
import sys
from bs4 import BeautifulSoup
page = urllib2.urlopen('http://www.att.com/shop/wireless/plans-new.html#fbid=U-XD_DHOGEp').read()
soup = BeautifulSoup(page)
plans = soup.findAll('div', {"class": "planTitle"})
for plan in plans:
planname = u' '.join(plan.stripped_strings)
plantypes = soup.findAll('div', {"class":"top"})
prices = soup.findAll('div', {"class":"bottom"})
for plantype, price in zip(plantypes, prices):
plantype1 = u' '.join(plantype.stripped_strings)
price1 = u' '.join(price.stripped_strings)
print planname, plantype1, price1
Problem: If you go through the webpage which is mentioned in this code, these are 4-5 types of plans and for each plan 3 voice options and some 2-3 data options are present. I want to scrape the data in such way that for each plan I can get its respective voice options against it and then monthly prices for those options.
Code which I am running right now returns the all possible combinations of plan name + voice options. For each plan name I get some 20-30 entries because it creates a entry even for wrong Plan name + voice options combination. Eg. Individual Plan - 550 minutes - $59.99, in this combination 500 minutes and 59.99 are part of Family Plan.
I want the loop to run so that only correct Plan + Voice options combination is extracted.
Snippet of Webpage:
For each plan one box is present on webpage, which contains voice options and prices corresponding to those options, I want the loop to run for each box, but element + class combination for voice options and their prices is not unique. That's why plan name takes value from other boxex also.
<div class="innerContainer">
<div class="planTitle">
<h2>AT&T Individual Plans</h2>
</div>
<div class="planSubTitle">
<img src="/shopcms/media/att/2012/shop/wireless/promotions/Plans-CM_page/tiny-clock.jpg" alt="">
<p>Voice plan options:</p>
</div>
<!-- Begin three white boxes -->
<!-- Note, extra boxes can be added to the row with the following method -->
<!-- 1. Add more div containers inside .whiteBox -->
<!-- 2. Modify class names to boxes_one, boxes_two, boxes_three etc... (max six) -->
<div class="whiteBox">
<div class="boxes_three">
<a class="lnk-help tooltips fullBoxLink" href="#smartphone_individual_voice_450" onclick="window.location.href = 'http://www.att.com/shop/wireless/plans/voice/sku3830290.html?source=IC95ATPLP00PSP00L&wtExtndSource=spindvoice450';return false;" aria-describedby="smartphone_individual_voice_450" data-cqpath="/content/att/shop/en/wireless/plans-new/jcr:content/maincontent/authortext;2013010" title=""></a>
<span id="smartphone_individual_voice_450" class="tips" role="tooltip">$0.45/min. for additional minutes</span>
<div class="top">
<p class="stat">450</p>
<p class="statText">Minutes</p>
</div>
<div class="bottom">
<p>$39.99/mo.</p>
</div>
</div>
<div class="boxes_three">
<a class="lnk-help tooltips fullBoxLink" href="#smartphone_individual_voice_900" onclick="window.location.href = 'http://www.att.com/shop/wireless/plans/voice/sku3830292.html?source=IC95ATPLP00PSP00L&wtExtndSource=spindvoice900';return false;" aria-describedby="smartphone_individual_voice_900" data-cqpath="/content/att/shop/en/wireless/plans-new/jcr:content/maincontent/authortext;2013010" title=""></a>
<span id="smartphone_individual_voice_900" class="tips" role="tooltip">$0.40/min. for additional minutes</span>
<div class="top">
<p class="stat">900</p>
<p class="statText">Minutes</p>
</div>
<div class="bottom">
<p>$59.99/mo.</p>
</div>
</div>
<div class="boxes_three borderNone">
<a class="fullBoxLink" href="http://www.att.com/shop/wireless/plans/voice/sku3830293.html?source=IC95ATPLP00PSP00L&wtExtndSource=spindvoiceunlim" data-cqpath="/content/att/shop/en/wireless/plans-new/jcr:content/maincontent/authortext;2013010"></a>
<div class="top">
<p class="stat">Unlimited</p>
<p class="statText">Minutes</p>
</div>
<div class="bottom">
<p>$69.99/mo.</p>
</div>
</div>
</div>
<!-- End three white boxes -->
<!-- Begin left gray container -->
<div class="containerTwoThirds">
<div class="planSubTitle">
<img src="/shopcms/media/att/2012/shop/wireless/promotions/Plans-CM_page/tiny-globe.jpg" alt="">
<p>Data plan options:</p>
</div>
<div class="grayTwoThirds">
<div class="grayBox">
<a class="fullBoxLink" href="http://www.att.com/shop/wireless/services/dataplus300mb-smartphone4glte-sku5380269.html?source=IC95ATPLP00PSP00L&wtExtndSource=spinddata300mb" data-cqpath="/content/att/shop/en/wireless/plans-new/jcr:content/maincontent/authortext;2013010"></a>
<p class="stat"><strong>300MB</strong></p>
<p class="statText">$20.00/mo.</p>
</div>
<div class="grayBoxBreak"></div>
<div class="grayBox">
<a class="fullBoxLink" href="http://www.att.com/shop/wireless/services/datapro3gb-smartphone4glte-sku5470232.html?source=IC95ATPLP00PSP00L&wtExtndSource=spinddata3gb" data-cqpath="/content/att/shop/en/wireless/plans-new/jcr:content/maincontent/authortext;2013010"></a>
<p class="stat"><strong>3GB</strong></p>
<p class="statText">$30.00/mo.</p>
</div>
<div class="grayBoxBreak"></div>
<div class="grayBox">
<a class="fullBoxLink" href="http://www.att.com/shop/wireless/services/datapro5gb-smartphone4glte-sku5480228.html?source=IC95ATPLP00PSP00L&wtExtndSource=spinddata5gb" data-cqpath="/content/att/shop/en/wireless/plans-new/jcr:content/maincontent/authortext;2013010"></a>
<p class="stat"><strong>5GB</strong></p>
<p class="statText">$50.00/mo.</p>
</div>
</div>
</div>
<!-- End left gray container -->
<!-- Begin right gray container -->
<div class="containerThird">
<div class="planSubTitle">
<img src="/shopcms/media/att/2012/shop/wireless/promotions/Plans-CM_page/tiny-phone.jpg" alt="">
<p>Messaging plan options: <span class="fix"></span></p>
</div>
<div class="grayThird">
<div class="grayBox">
<a data-cqpath="/content/att/shop/en/wireless/plans-new/jcr:content/maincontent/authortext;2012325" href="http://www.att.com/shop/wireless/services/messagingunlimited-sku1160055.html?source=IC95ATPLP00PSP00L&wtExtndSource=spindmessunlim" class="fullBoxLink"></a>
<p class="stat"><strong>ULTD</strong> MSGS</p>
<p class="statText">$20.00/mo.</p>
</div>
<div class="grayBoxBreak"></div>
<div class="grayBox last">
<p class="stat"><strong>PAY PER USE</strong></p>
<p class="statText">20¢/text <span class="lightGray">|</span> 30¢/pic/video</p>
</div>
</div>
</div>
<!-- End right gray container -->
<!-- Begin sub footer -->
<div class="bottomLinks">
<div class="links">
Nation with Canada Plans | Unlimited Home Phone | Senior Plans
</div>
<a class="shop_button" href="http://www.att.com/shop/wireless/devices/smartphones.html?source=IC95ATPLP00PSP00L&wtExtndSource=indshopsp" data-cqpath="/content/att/shop/en/wireless/plans-new/jcr:content/maincontent/authortext;2013010"><img src="/shopcms/media/att/2012/shop/wireless/promotions/Plans-CM_page/buttons/shop_smartphones.png" alt="Shop Smartphones" width="158" height="29"></a>
</div>
<!-- End sub footer -->
</div>
Please help me in solving this issue as I am new to programming.

Rewrote the thing from scratch. There's no comments, but it's pretty self explanatory. The lambda in the dictionary is for finding attributes that start with a certain string. I referenced this answer for that: https://stackoverflow.com/a/2830550/541208
I had thought that you were using findAll on soup, when you should have been using plan.findAll instead, but then it didn't help anything, so I just rewrote the whole thing.
import urllib2
import sys
from bs4 import BeautifulSoup
page = urllib2.urlopen('http://www.att.com/shop/wireless/plans-new.html#fbid=U-XD_DHOGEp').read()
soup = BeautifulSoup(page)
#find the container for all the plans
tabcontent = soup.find('div', {"id": "smartphonePlans", "class": "tabcontent"})
containers = tabcontent.findAll('div', {"class": "innerContainer"})
for plan in containers:
planTitle = plan.find("div", {"class": "planTitle"})
if planTitle:
title = planTitle.find("a").text
print title
voiceBoxes = plan.find("div", {"class": "whiteBox"})
if voiceBoxes:
box3 = voiceBoxes.findAll("div", {"class": lambda x: x and x.startswith("boxes_")})
if box3:
for box in box3:
top = box.findAll("p")
minutes = u" ".join([tag.text for tag in top])
print "\t", minutes
Which outputs:
AT&T Individual Plans
450 Minutes $39.99/mo.
900 Minutes $59.99/mo.
Unlimited Minutes $69.99/mo.
AT&T Family Plans
550 Minutes $59.99/mo.
700 Minutes $69.99/mo.
1,400 Minutes $89.99/mo.
2,100 Minutes $109.99/mo.
Unlimited Minutes $119.99/mo.
AT&T Mobile Share Plans
1GB $40/mo. + $45/smartphone
4GB $70/mo. + $40/smartphone
6GB $90/mo. + $35/smartphone
10GB $120/mo.
15GB $160/mo. + $30/smartphone
20GB $200/mo.

Related

I need to get text from 2 different span tags with BeautifulSoup

I am trying to get these values from the following code, the text in span under Developer and Genres
<div _ngcontent-sc18="" class="col-xl-6">
<!-- -->
<p _ngcontent-sc18="" class="mb-2">
<strong _ngcontent-sc18="">
Developer:
</strong>
<!-- -->
<span _ngcontent-sc18="">
Capy Games
</span>
</p>
<!-- -->
<!-- -->
<p _ngcontent-sc18="" class="mb-2">
<strong _ngcontent-sc18="">
Genres:
</strong>
<!-- -->
<span _ngcontent-sc18="">
Action,
</span>
<span _ngcontent-sc18="">
Platformer
</span>
</p>
</div>
I'm able to pull the values fine if there is only one Genre, but if there are two, I can only figure out how to pull the first one. This is what I have so far
for y in gamesoup.find_all('p', class_= 'mb-2'):
if y.strong.text == "Developer: ":
devs.append(y.span.text)
if "Genre" in y.strong.text:
genres.append(y.span.text)
What can I do to pull both of the values under Genre?
Thank you!
Use find_all('span') inside if statement of Genre.
for y in gamesoup.find_all('p', class_= 'mb-2'):
if y.strong.text == "Developer: ":
devs.append(y.span.text)
if "Genre" in y.strong.text:
for genre in y.find_all('span'):
genres.append(genre.text)

Extracting the right elements by text and span / Beautiful Soup / Python

Im trying to scrape following data:
Cuisine: 4.5
Service: 4.0
Quality: 4.5
But im having issues to scrape the right data. I tried following two Codes:
for bewertungen in soup.find_all('div', {'class' : 'histogramCommon bubbleHistogram wrap'}):
if bewertungen.find(text='Cuisine'):
cuisine = bewertungen.find(text='Cuisine')
cuisine = cuisine.next_element
print("test " + str(cuisine))
if bewertungen.find_all(text='Service'):
for s_bewertung in bewertungen.find_all('span', {'class':'ui_bubble_rating'}):
s_speicher = s_bewertung['alt']
In the first if i get no result. In the second If i get the right elements but i get all 3 results but i can not define which ones belongs to which text (Cuisine, Service, Quality)
Can someone give me an advice how to get the right data?
I put at the bottom the html code.
<div class="histogramCommon bubbleHistogram wrap">
<div class="colTitle">\nGesamtwertung\n</div>
<ul class="barChart">
<li>
<div class="ratingRow wrap">
<div class="label part ">
<span class="text">Cuisine</span>
</div>
<div class="wrap row part ">
<span alt="4.5 of five" class="ui_bubble_rating bubble_45"></span>
</div>
</div>
<div class="ratingRow wrap">
<div class="label part ">
<span class="text">Service</span>
</div>
<div class="wrap row part ">
<span alt="4.0 of five" class="ui_bubble_rating bubble_40"></span>
</div>
</div>
</li>
<li>
<div class="ratingRow wrap">
<div class="label part ">
<span class="text">Quality</span>
</div>
<div class="wrap row part "><span alt="4.5 of five" class="ui_bubble_rating bubble_45"></span></div>
</div>
</li>
</ul>
</div>
Try this. According to the snippet you have pasted above, the following code should work:
from bs4 import BeautifulSoup
soup = BeautifulSoup(content,"lxml")
for item in soup.select(".ratingRow"):
category = item.select_one(".text").text
rating = item.select_one(".row span")['alt'].split(" ")[0]
print("{} : {}".format(category,rating))
Another way would be:
for item in soup.select(".ratingRow"):
category = item.select_one(".text").text
rating = item.select_one(".text").find_parent().find_next_sibling().select_one("span")['alt'].split(" ")[0]
print("{} : {}".format(category,rating))
Output:
Cuisine : 4.5
Service : 4.0
Quality : 4.5

Scrapy conditional crawling

my HTML code contains a number of divs with mostly similar structure ... below is the code excerpt containing 2 such divs
<!-- 1st Div start -->
<div class="outer-container">
<div class="inner-container">
<div class="abc xyz" title="verified"></div>
<div class="mody">
<div class="row">
<div class="col-md-5 col-xs-12">
<h2><a class="mheading primary h4" href="/c/my-llc"><strong>Top Dude, LLC</strong></a></h2>
<div class="mvsdfm casmhrn" itemprop="address">
<span itemprop="Address">1223 Industrial Blvd</span><br>
<span itemprop="Locality">Paris</span>, <span itemprop="Region">BA</span> <span itemprop="postalCode">123345</span>
</div>
<div class="hidden-device-xs" itemprop="phone" rel="mainPhone">
(800) 845-0000
</div>
</div>
</div>
</div>
</div>
</div>
<!-- 2nd Div start -->
<div class="outer-container">
<div class="inner-container">
<div class="mody">
<div class="row">
<div class="col-md-5 col-xs-12">
<h2><a class="mheading primary h4" href="/c/my-llc"><strong>Fat Dude, LLC</strong></a></h2>
<div class="mvsdfm casmhrn" itemprop="address">
<span itemprop="Address">7890 Business St</span><br>
<span itemprop="Locality">Tokyo</span>, <span itemprop="Region">MA</span> <span itemprop="postalCode">987655</span>
</div>
<div class="hidden-device-xs" itemprop="phone" rel="mainPhone">
(800) 845-0000
</div>
</div>
</div>
</div>
</div>
</div>
So here is what I want Scrapy to do ...
If the div with class="outer-container" contains another div with title="verified" like in the 1st div above, it should go to the URL above it (i.e www.xxxxxx.com) and fetch some other feilds on that page.
If there is no div containing title="verified", like in 2nd div above, it should fetch all the data under div class="mody". i.e. company name (Fat Dude, LLC), address, city, state etc ... and NOT follow the url (i.e. www.yyyyy.com)
So how do I apply this condition/logic in Scrapy crawler. I was thinking of using BeautifulSoup, but not sure ....
What have I tried so far ....
class MySpider(CrawlSpider):
name = 'dknfetch'
start_urls = ['http://www.xxxxxx.com/scrapy/all-listing']
allowed_domains = ['www.xxxxx.com']
def parse(self, response):
hxs = Selector(response)
soup = BeautifulSoup(response.body, 'lxml')
nf = NewsFields()
cName = soup.find_all("a", class_="mheading primary h4")
addrs = soup.find_all("span", itemprop_="Address")
loclity = soup.find_all("span", itemprop_="Locality")
region = soup.find_all("span", itemprop_="Region")
post = soup.find_all("span", itemprop_="postalCode")
nf['companyName'] = cName[0]['content']
nf['address'] = addrs[0]['content']
nf['locality'] = loclity[0]['content']
nf['state'] = region[0]['content']
nf['zipcode'] = post[0]['content']
yield nf
for url in hxs.xpath('//div[#class="inner-container"]/a/#href').extract():
yield Request(url, callback=self.parse)
Ofcourse, the above code returns and crawl all the URL's under the div class="inner-container" as there is no conditional crawling specified in this code, becuase I don't know where/how to set it.
If anyone has done something similar before, please share. Thanks
No need to use BeautifulSoup, Scrapy comes with it's own selector capabilities (also released separately as parsel). Let's use your HTML to make an example:
html = u"""
<!-- 1st Div start -->
<div class="outer-container">
<div class="inner-container">
<div class="abc xyz" title="verified"></div>
<div class="mody">
<div class="row">
<div class="col-md-5 col-xs-12">
<h2><a class="mheading primary h4" href="/c/my-llc"><strong>Top Dude, LLC</strong></a></h2>
<div class="mvsdfm casmhrn" itemprop="address">
<span itemprop="Address">1223 Industrial Blvd</span><br>
<span itemprop="Locality">Paris</span>, <span itemprop="Region">BA</span> <span itemprop="postalCode">123345</span>
</div>
<div class="hidden-device-xs" itemprop="phone" rel="mainPhone">
(800) 845-0000
</div>
</div>
</div>
</div>
</div>
</div>
<!-- 2nd Div start -->
<div class="outer-container">
<div class="inner-container">
<div class="mody">
<div class="row">
<div class="col-md-5 col-xs-12">
<h2><a class="mheading primary h4" href="/c/my-llc"><strong>Fat Dude, LLC</strong></a></h2>
<div class="mvsdfm casmhrn" itemprop="address">
<span itemprop="Address">7890 Business St</span><br>
<span itemprop="Locality">Tokyo</span>, <span itemprop="Region">MA</span> <span itemprop="postalCode">987655</span>
</div>
<div class="hidden-device-xs" itemprop="phone" rel="mainPhone">
(800) 845-0000
</div>
</div>
</div>
</div>
</div>
</div>
"""
from parsel import Selector
sel = Selector(text=html)
for div in sel.css('.outer-container'):
if div.css('div[title="verified"]'):
url = div.css('a::attr(href)').extract_first()
print 'verified, follow this URL:', url
else:
nf = {}
nf['companyName'] = div.xpath('string(.//h2)').extract_first()
nf['address'] = div.css('span[itemprop="Address"]::text').extract_first()
nf['locality'] = div.css('span[itemprop="Locality"]::text').extract_first()
nf['state'] = div.css('span[itemprop="Region"]::text').extract_first()
nf['zipcode'] = div.css('span[itemprop="postalCode"]::text').extract_first()
print 'not verified, extracted item is:', nf
The result for the previous snippet is:
verified, follow this URL: www.xxxxxx.com
not verified, extracted item is: {'zipcode': u'987655', 'state': u'MA', 'address': u'7890 Business St', 'locality': u'Tokyo', 'companyName': u'Fat Dude, LLC'}
But in Scrapy you don't even need to instantiate the Selector class, there is a shortcut available in the response object passed to callbacks. Also, you shouldn't be subclassing CrawlSpider, just the regular Spider class is enough. Putting it all together:
from scrapy import Spider, Request
from myproject.items import NewsFields
class MySpider(Spider):
name = 'dknfetch'
start_urls = ['http://www.xxxxxx.com/scrapy/all-listing']
allowed_domains = ['www.xxxxx.com']
def parse(self, response):
for div in response.selector.css('.outer-container'):
if div.css('div[title="verified"]'):
url = div.css('a::attr(href)').extract_first()
yield Request(url)
else:
nf = NewsFields()
nf['companyName'] = div.xpath('string(.//h2)').extract_first()
nf['address'] = div.css('span[itemprop="Address"]::text').extract_first()
nf['locality'] = div.css('span[itemprop="Locality"]::text').extract_first()
nf['state'] = div.css('span[itemprop="Region"]::text').extract_first()
nf['zipcode'] = div.css('span[itemprop="postalCode"]::text').extract_first()
yield nf
I would suggest you to get familar with Parsel's API: https://parsel.readthedocs.io/en/latest/usage.html
Happy scraping!

Using SPLIT to create a list of HTML

I have a return value from a search I'm doing which returns alot of HTML.
for i in deal_list:
regex2 = '(?s)'+'<figure class="deal-card deal-list-tile deal-tile deal-tile-standard" data-bhc="'+ i +'"'+'.+?</figure>'
pattern2 = re.compile(regex2)
info2 = re.search(pattern2,htmltext)
html_captured = info2.group(0).split('</figure>')
print html_captured
Here is an example what is being returned:
<figure class="deal-card deal-list-tile deal-tile deal-tile-standard" data-bhc="deal:giorgios-brick-oven-pizza-wine-bar" data-bhd="{"accessType":"extended"}" data-bh-viewport="respect">
<a href="//www" class="deal-tile-inner">
<img>
<figcaption>
<div class="deal-tile-content">
<p class="deal-title should-truncate">Up to 73% Off Wine-Tasting Dinner at 1742 Wine Bar</p>
<p class="merchant-name truncation ">1742 Wine Bar</p>
<p class="deal-location truncate-others ">
<span class="deal-location-name">Upper East Side</span>
</p>
<div class="description should-truncate deal-tile-description"><p>Wine tasting includes three reds and three whites; dinner consists of one appetizer, two entrees, and a bottle of wine</p></div>
</div>
<div class="purchase-info clearfix ">
<p class="deal-price">
<s class="original-price">$178.90</s>
<s class="discount-price">$49</s>
</p>
<div class="hide show-in-list-view">
<p class="deal-tile-actions">
<button class="btn-small btn-buy" data-bhw="ViewDealButton">
View Deal
</button>
</p>
</div>
</div>
</figcaption>
</a>
</figure>
<figure class="deal-card deal-list-tile deal-tile deal-tile-standard" data-bhc="deal:statler-grill-4" data-bhd="{"accessType":"extended"}" data-bh-viewport="respect">
<a href="//www" class="deal-tile-inner">
<img>
<figcaption>
<div class="deal-tile-content">
<p class="deal-title should-truncate">Up to 59% Off Four-Course Dinner at Statler Grill</p>
<p class="merchant-name truncation ">Statler Grill</p>
<p class="deal-location truncate-others ">
<span class="deal-location-name">Midtown</span>
</p>
<div class="description should-truncate deal-tile-description"><p>Chefs sear marbled new york prime sirloin and dice fresh sashimi-grade tuna to satisfy appetites amid white tablecloths and chandeliers</p></div>
</div>
<div class="purchase-info clearfix ">
<p class="deal-price">
<s class="original-price">$213</s>
<s class="discount-price">$89</s>
</p>
<div class="hide show-in-list-view">
<p class="deal-tile-actions">
<button class="btn-small btn-buy" data-bhw="ViewDealButton">
View Deal
</button>
</p>
</div>
</div>
</figcaption>
</a>
</figure>
I want to use html_captured = info2.group(0).split('</figure> so that all HTML between each new set of tags become an element of a list, in this case HTML_CAPTURED.
It kind of works except that each becomes its own list with a '' at the end. For example: ['<figure .... </figure>','']['<figure .... </figure>','']
But what I want is ['<figure .... </figure>','<figure .... </figure>','<figure .... </figure>'...etc]
There are special tools for parsing HTML - HTML parsers.
Example using BeautifulSoup:
from bs4 import BeautifulSoup
data = """
your html here
"""
soup = BeautifulSoup(data)
print [figure for figure in soup.find_all('figure')]
Also see why you should not use regex for parsing HTML:
RegEx match open tags except XHTML self-contained tags

Using BeautifulSoup to pull multiple posts from a single blog archive page, without scripts

I'm trying to scrape the authors, title, date, and post content out of a series of WordPress and Blogger blog archive pages. I've saved the pages down so I'm not pinging the server repeatedly. I've got the other parts working, but I can't seem to both get all the posts from each file and also not get the "add-to-any" or "sociable" or other messy script from the bottom. Here's where I am.
import urllib2
from bs4 import BeautifulSoup
import re
file_list = open ("hafiles.txt", "r")
posts_file = open ("haposts.txt","w")
for indurl in file_list:
indurl = indurl.rstrip("\n")
with open(indurl,"r") as ha_file:
soup_ha = BeautifulSoup(ha_file)
#works the second find gets rid of the sociable crap
# this is the way it looks on the page <div class='post-body'>
posts = soup_ha.find("div", class_="post-body").find_all("p")
#tried a trick i saw on http://stackoverflow.com/questions/24458353/cleaning-text-string-after-getting-body-text-using-beautifulsoup
#no joy
#posts = soup_ha.find("div", class_="post-body")
#text = [''.join(s.findAll(text=True))for s in posts.findAll('p')]
text = str(posts) + "\n" + "\n"
posts_file.write (text)
print ("All done!")
file_list.close()
posts_file.close()
So if I do a find_all and get all the posts (not even sure I am actually getting them all), then I get the script. I can get pretty posts with no script at least two ways if I just use find. I have a list of files, each file has several posts to extract.
I've searched here at stackoverflow and on the web.
eta:the input is a really messy webpage with a ton of scripts at the top, all the css definitions on the page, then
<div id='main-wrapper'>
<div class='main section' id='main'><div class='widget Blog' id='Blog1'>
<div class='blog-posts'>
<h2 class='date-header'>27 February, 2007</h2>
<div class='post uncustomized-post-template'>
<a name='edit'></a>
<h3 class='post-title'>
<a href='http:// edited for anon.html'>edit</a>
</h3>
<div class='post-header-line-1'></div>
<div class='post-body'>
<style>span.fullpost{display:none;}</style>
<p>edit this is post text - what i want</p>
<script type='text/javascript'>
var permlink='edit';
var title='edit';
var spans = document.getElementsByTagName('span');
var number = 0;
for(i=0; i <spans.length; i++){
var c = " " + spans[i].className + " ";
if (c.indexOf("fullpost") != -1) {
number++;
}
}
if(number != memory){document.write('<p></p><a href=' + permlink + '>"'+ title + '" continues...</a>') }
memory = number;
</script>
<div style='clear: both;'></div>
</div>
<div class='post-footer'>
<p class='post-footer-line post-footer-line-1'>
<span class='post-author'>
Posted by
this is the author name, also want, have way to get
</span>
<span class='post-timestamp'>
at
<a class='timestamp-link' href='http://edit' title='permanent link'>2:53 pm</a>
</span>
<span class='post-comment-link'>
<a class='comment-link' href='edit' onclick=''>1 comments</a>
</span>
<span class='post-backlinks post-comment-link'>
<a class='comment-link' href='edit'>Links to this post</a>
</span>
<span class='post-icons'>
<span class='item-control blog-admin pid-edit'>
<a href='edit' title='Edit Post'>
<img alt='' class='icon-action' height='18' src='http://img2.blogblog.com/img/icon18_edit_allbkg.gif' width='18'/>
</a>
</span>
</span>
</p>
<p class='post-footer-line post-footer-line-2'>
<span class='post-labels'>
Labels:
<a href='edit' rel='tag'>edi</a>
</span>
</p>
<p class='post-footer-line post-footer-line-3'></p>
</div>
</div>
<h2 class='date-header'>26 February, 2007</h2>
<div class='post uncustomized-post-template'>
<a name='5518681505930320089'></a>
<h3 class='post-title'>
<a href='edit'>edit</a>
</h3>
<div class='post-header-line-1'></div>
<div class='post-body'>
<style>span.fullpost{display:none;}</style>
<p>edit post text, what I want.</p>
<script type='text/javascript'>
var permlink='http://edit';
var title='edit';
var spans = document.getElementsByTagName('span');
var number = 0;
for(i=0; i <spans.length; i++){
var c = " " + spans[i].className + " ";
if (c.indexOf("fullpost") != -1) {
number++;
}
}
if(number != memory){document.write('<p></p><a href=' + permlink + '>"'+ title + '" continues...</a>') }
memory = number;
</script>
<div style='clear: both;'></div>
</div>
<div class='post-footer'>
<p class='post-footer-line post-footer-line-1'>
<span class='post-author'>
Posted by
edit author name
</span>
<span class='post-timestamp'>
at
<a class='timestamp-link' href='edit' title='permanent link'>9:00 am</a>
</span>
<span class='post-comment-link'>
<a class='comment-link' href='edit' onclick=''>5
comments</a>
</span>
<span class='post-backlinks post-comment-link'>
<a class='comment-link' href='edit'>Links to this post</a>
</span>
<span class='post-icons'>
<span class='item-control blog-admin pid-edit'>
<a href='edit' title='Edit Post'>
<img alt='' class='icon-action' height='18' src='http://img2.blogblog.com/img/icon18_edit_allbkg.gif' width='18'/>
</a>
</span>
</span>
</p>
<p class='post-footer-line post-footer-line-2'>
<span class='post-labels'>
Labels:
<a href='edit' rel='tag'>edit</a>,
<a href='edit' rel='tag'>edit</a>
</span>
</p>
<p class='post-footer-line post-footer-line-3'></p>
</div>
</div>
<h2 class='date-header'>22 February, 2007</h2>
<div class='post uncustomized-post-template'>
<a name='edit'></a>
Yuck! so i have maybe 20 or so files each with 1 to maybe 10 posts in them (this has 2)... what would be lovely is a csv or excel file that was like
date author title postcontent
in columns, one row per.
i'll take a file with just post content with some space between each post. I am fine with some links in the text and some bolds and lists and whatnot, but I don't want all the messy scripts.
Thanks
Here's an example for one single page with multiple posts:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('test.html'))
posts = []
for post in soup.find_all('div', class_='post'):
title = post.find('h3', class_='post-title').text.strip()
author = post.find('span', class_='post-author').text.replace('Posted by', '').strip()
content = post.find('div', class_='post-body').p.text.strip()
date = post.find_previous_sibling('h2', class_='date-header').text.strip()
posts.append({'title': title,
'author': author,
'content': content,
'date': date})
print posts
For the html you've posted, it prints:
[{'content': u'edit this is post text - what i want',
'date': u'27 February, 2007',
'author': u'this is the author name, also want, have way to get',
'title': u'edit'},
{'content': u'edit post text, what I want.',
'date': u'26 February, 2007',
'author': u'edit author name',
'title': u'edit'}]

Categories