Unable to Scrape Content that comes after a Comment Python BeautifulSoup - python

I am trying to scrape the tables from the following page:
https://www.baseball-reference.com/boxes/CHA/CHA193805220.shtml
When I reach the html for the batting tables I encounter a very long comment which contains the html for the table
<div id="all_WashingtonSenatorsbatting" class="table_wrapper table_controls">
<div class="section_heading">
<div class="section_heading_text">
<div class="placeholder"></div>
<!--
<div class="table_outer_container">
.....
-->
<div class="table_outer_container mobile_table">
<div class="footer no_hide_long">
Where the last two div are what I am interested in scraping and everything in between the <!-- and the --> is a comment which happens to contain a copy of the table in the table_outer_container class below.
The problem is that when I read the page source into beautiful soup it does will not read anything after the comment within the table_wrapper class div which contains everything. The following code illustrates the problem:
batting = page_source.find('div', {'id':'all_WashingtonSenatorsbatting'})
divs = batting.find_all('div')
len(divs)
gives me
Out[1]: 3
When there are obviously 5 div children under the div id="all_WashingtonSenatorsbatting" element.
Even when I extract the comment using
from bs4 import Comment
for comments in soup.findAll(text=lambda text:isinstance(text, Comment)):
comments.extract()
The resulting soup still doesn't contain the last two div elements I want to scrape. I am trying to play with the code using regular expressions but so far no luck, any suggestions?

I found workable solution, By using the following code I extract the comment (which brings with it the last two div elements I wanted to scrape), process it again in BeautifulSoup and scrape the table
s = requests.get(url).content
soup = BeautifulSoup(s, "html.parser")
table = soup.find_all('div', {'class':'table_wrapper'})[0]
comment = t(text=lambda x: isinstance(x, Comment))[0]
newsoup = BeautifulSoup(comment, 'html.parser')
table = newsoup.find('table')
It took me a while to get to this and would be interested to see if anyone comes up with any other solutions or can offer an explanation of how this problem came to be.

Related

Python / Beautifulsoup: HTML Path to the current element

For a class project, I'm working on extracting all links on a webpage. This is what I have so far.
from bs4 import BeautifulSoup, SoupStrainer
with open("input.htm") as inputFile:
soup = BeautifulSoup(inputFile)
outputFile=open('output.txt', 'w')
for link in soup.find_all('a', href=True):
outputFile.write(str(link)+'\n')
outputFile.close()
This works very well.
Here's the complication: for every <a> element, my project requires me to know the entire "tree structure" to the current link. In other words, I'd like to know all the precendent elements starting with the the <body> element. And the class and id along the way.
Like the navigation page on Windows explorer. Or the navigation panel on many browsers' element inspection tool.
For example, if you look at the Bible page on Wikipedia and a link to the Wikipedia page for the Talmud, the following "path" is what I'm looking for.
<body class="mediawiki ...>
<div id="content" class="mw-body" role="main">
<div id="bodyContent" class="mw-body-content">
<div id="mw-content-text" ...>
<div class="mw-parser-output">
<div role="navigation" ...>
<table class="nowraplinks ...>
<tbody>
<td class="navbox-list ...>
<div style="padding:0em 0.25em">
<ul>
<li>
<a href="/wiki/Talmud"
Thanks a bunch.
-Maureen
Try this code:
soup = BeautifulSoup(inputFile, 'html.parser')
Or use lxml:
soup = BeautifulSoup(inputFile, 'lxml')
If it is not installed:
pip install lxml
Here is a solution I just wrote. It works by finding the element, then navigating up the tree by the elements parent. I parse just the opening tag and add it to a list. Reverse the list at the end. Finally we end up with a list that resembles the tree you requested.
I have written it for one element, you can modify it to work with your find_all
from bs4 import BeautifulSoup
import requests
page = requests.get("https://en.wikipedia.org/wiki/Bible")
soup = BeautifulSoup(page.text, 'html.parser')
tree = []
hrefElement = soup.find('a', href=True)
hrefString = str(hrefElement).split(">")[0] + ">"
tree.append(hrefString)
hrefParent = hrefElement.find_parent()
while (hrefParent.name != "html"):
hrefString = str(hrefParent).split(">")[0] + ">"
tree.append(hrefString)
hrefParent = hrefParent.find_parent()
tree.reverse()
print(tree)

Python HTML Parsing with BS4

I'm having the problem of trying to parse through HTML using Python & Beautiful Soup and I'm encountering the problem of which I want to parse for a very specific piece of data. This is the kind of code I'm encountering:
<div class="big_div">
<div class="smaller div">
<div class="other div">
<div class="this">A</div>
<div class="that">2213</div>
<div class="other div">
<div class="this">B</div>
<div class="that">215</div>
<div class="other div">
<div class="this">C</div>
<div class="that">253</div>
There is a series of repeat HTML as you can see with only the values being different, my problem is locating a specific value. I want to locate the 253 in the last div. I would appreciate any help as this is a recurring problem in parsing through HTML.
Thank you in advance!
So far I've tried to parse for it but because the names are the same I have no idea how to navigate through it. I've tried using the for loop too but made little to no progress at all.
You can use string attribute as argument in find. BS docs for string attr.
"""Suppose html is the object holding html code of your web page that you want to scrape
and req_text is some text that you want to find"""
soup = BeautifulSoup(html, 'lxml')
req_div = soup.find('div', string=req_text)
req_div will contain the div element which you want.

Python - beautifulSoup unable to iterate repetitive blocks

Unsure how to properly word the issue.
I am trying to parse through an HTML document with a tree similar to that of
div(unique-class)
|-a
|-h4
|-div(class-a)
|-div(class-b)
|-div(class-c)
|-p
Etc, it continues. I only listed the few items I need. It is a lot of sibling hierarchy, all existing within one div.
I've been working quite a bit with BeautifulSoup for the past few hours, and I finally have a working version (Beta) of what I'm trying to parse, in this example.
from bs4 import BeautifulSoup
import urllib2
import csv
file = "C:\\Python27\\demo.html"
soup = BeautifulSoup (open(file), 'html.parser')
#(page, 'html.parser')
#Let's pull prices
names = []
pricing = []
discounts = []
for name in soup.find_all('div', attrs={'class': 'unique_class'}):
names.append(name.h4.text)
for price in soup.find_all('div', attrs={'class': 'class-b'}):
pricing.append(price.text)
for discount in soup.find_all('div', attrs={'class': 'class-a'}):
discounts.append(discount.text)
ofile = open('output2.csv','wb')
fieldname = ['name', 'discountPrice', 'originalPrice']
writer = csv.DictWriter(ofile, fieldnames = fieldname)
writer.writeheader()
for i in range(len(names)):
print (names[i], pricing[i], discounts[i])
writer.writerow({'name': names[i], 'discountPrice':pricing[i], 'originalPrice': discounts[i]})
ofile.close()
As you can tell this it iterating from top to bottom and appending to a distinct array for each one. The issue is, if I'm iterating over, let's say, 30,000 items and the website can modify itself (We'll say a ScoreBoard app on a JS Framework), by the time I get to the 2nd iteration, the order may have changed. (As I type this I realize this scenario actually would need more variables since BS would 'catch' the website at time of load, but I think the point still stands.)
I believe I need to leverage the next_sibling function within BS4 but when I did that I started capturing items I wasn't specifying, because I couldn't apply a 'class' to the sibling.
Update
An additional issue I encouraged when trying to do a loop within a loop to find the 3 children I need under the unique-class was I would end up with the first price being listed for all names.
Update - Adding sample HTML
<div class="unique_class">
<h4>World</h4>
<div class="class_b">$1.99</div>
<div class="class_a">$1.99</div>
</div>
<div class="unique_class">
<h4>World2</h4>
<div class="class_b">$2.99</div>
<div class="class_a">$2.99</div>
</div>
<div class="unique_class">
<h4>World3</h4>
<div class="class_b">$3.99</div>
<div class="class_a">$3.99</div>
</div>
<div class="unique_class">
<h4>World4</h4>
<div class="class_b">$4.99</div>
<div class="class_a">$3.99</div>
</div>
I have also found a fix, and submitted the answer to be Optimized - Located at CodeReview
If the site you are looking to scrape the data from is using JS you may want to use selenium and use its page_source method to extract snapshots of the page with loaded JS you can then load into BS.
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get(<url>)
page = driver.page_source
Then you can use BS to parse the JS loaded 'page'
If you want to wait for other JS events to load up you are able to specify events to wait for in selenium.

beautifulsoup CSS Select - find a tag in which a particular attribute (style for ex) is not present

My first here on SO. Thanks for helping us noobs for so long. Coming straight to point:
Scenario:
I am working on an existing program that is reading the CSS selector as a string from a configuration file to make the program dynamic and able to scrap any site by just changing the configuration value of CSS selector.
Problem:
I am trying to scrape a site which is rendering items as one of the 2 options below:
Option1:
.........
<div class="price">
<span class="price" style="color:red;margin-right:0.1in">
<del>$299</del>
</span>
<span class="price">
$195
</span>
</div>
soup = soup.select("span.price") - this doesn't work as I need second span tag or last span tag :(
Option2:
.........
<div class="price">
<span class="price">
$199
</span>
</div>
soup = soup.select("span.price") - this works great!
Question:
In both the above options I want to be able to get the last span tag ($195 or $199) and don't care about the $299. Basically I just want to extract the final sale price and not the original price.
So the 2 ways I know as of now are:
1) Always get the last span tag
2) Always get the span tag which doesn't have style attribute
Now, I know the not operator, last-of-type are not present in bs4 (only nth-of-type is available) so I am stuck here. Any suggestions are helpful.
Edit: - Since this is an existing program, I cant use soup.find_all() or any other method apart from soup.select(). Sorry :(
Thanks!
You can search for the span tag without the style attribute:
prices = soup.select('span.price')
no_style = [price for price in prices if 'style' not in price.attrs]
>> [<span class="price">$199</span>]
This might be a good time to use a function. In this case BeautifulSoup gives span_with_style each tag and the function tests whether the tag's name is span and it has the attribute style. If this is true then BeautifulSoup appends the tag to its list of results.
HTML = '''\
<div class='price'>
<span class='price' style='color: red; margin-right: 0.1in'>
<del>$299</del>
</span>
<span class='price'>
$195
</span>
</div>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(HTML, 'lxml')
for item in soup.find_all(lambda tag: tag.name=='span' and tag.has_attr('style')):
print (item)
The code inside the select function needs to change to:
def select(soup, the_variable_you_pass):
soup.find('div', attrs={'class': 'price'}).find_all(the_variable_you_pass)[-1]

Parse JavaScript href with Python

Been having a lot of trouble with this... new to Python so sorry if I just don't know the proper search terms to find the info myself. I'm not even positive it's because of the JS but that's the best idea I've got.
Here's the section of HTML I'm parsing:
...
<div class="promotion">
<div class="address">
5203 Alhama Drive
</div>
</div>
...
...and the Python I'm using to do it (this version is the closest I've gotten to success):
homeFinderSoup = BeautifulSoup(open("homeFinderHTML.html"), "html5lib")
addressClass = homeFinderSoup.find_all('div', 'address')
for row in addressClass:
print row.get('href')
...which returns
None
None
None
# Create soup from the html. (Here I am assuming that you have already read the file into
# the variable "html" as a string).
soup = BeautifulSoup(html)
# Find all divs with class="address"
address_class = soup.find_all('div', {"class": "address"})
# Loop over the results
for row in address_class:
# Each result has one <a> tag, and we need to get the href property from it.
print row.find('a').get('href')

Categories