Python, BeautifulSoup finding HTML segment - python

I am a newbie just trying to follow the webscraping examples from automate the boring stuff webscraping example. What I'm trying is to automate downloading images from phdcomics in one python code that will
find the link of the image from HTML and download then
find the link for the previous page from HTML and go there to repeat step 1 until the very first page.
For the downloading current page image, the segment of the HTML code after printing soup.prettify() looks like this -
<meta content="Link to Piled Higher and Deeper" name="description">
<meta content="PHD Comic: Remind me" name="title">
<link
href="http://www.phdcomics.com/comics/archive/phd041218s.gif" rel="image_src">
<div class="jumbotron" style="background-color:#52697d;padding: 0em 0em 0em; margin-top:0px; margin-bottom: 0px; background-image: url('http://phdcomics.com/images/bkg_bottom_stuff3.png'); background-repeat: repeat-x;">
<div align="center" class="container-fluid" style="max-width: 1800px;padding-left: 0px; padding-right:0px;">
and then when I write
newurl=soup.find('link', {'rel': "image_src"}).get('href')
it gives me what I need, which is
"http://www.phdcomics.com/comics/archive/phd041218s.gif"
In the next step when I want to find the previous page link, which I believe is in the following part of the HTML code -
<!-- Comic Table --!>
<table border="0" cellspacing="0" cellpadding="0">
<tr>
<td align="right" valign="top">
<a href=http://phdcomics.com/comics/archive.php?comicid=2004><img height=52 width=49 src=http://phdcomics.com/comics/images/prev_button.gif border=0 align=middle><br></a><font
face=Arial,Helvetica,Geneva,Swiss,SunSans-Regular size=-1><i><b>previous </b></i></font><br><br><a href=http://phdcomics.com/comics/archive.php?comicid=1995><img src=http://phdcomics.com/comics/images/jump_bck10.gif border=0></a><br><a href=http://phdcomics.com/comics/archive.php?comicid=2000><img src=http://phdcomics.com/comics/images/jump_bck5.gif border=0></a><br><font face=Arial,Helvetica,Geneva,Swiss,SunSans-Regular size=-1><i><b>jump</b></i></font><br><br><a href=http://phdcomics.com/comics/archive.php?comicid=1><img src=http://phdcomics.com/comics/images/first_button.gif border=0 align=middle><br></a><font face=Arial,Helvetica,Geneva,Swiss,SunSans-Regular size=-1><i><b>first</b></i></font><br><br> </td>
<td align="center" valign="top"><font color="black">
From this part of the code I want to find
=http://phdcomics.com/comics/archive.php?comicid=2004
as my previous link.
when I try something like this -
Prevlink=soup.find('a',{'src': 'http://phdcomics.com/comics/images/prev_button.gif'}).get('href')
print(Prevlink)
it gives me an error like this-
Prevlink=soup.find('a',{'src': 'http://phdcomics.com/comics/images/prev_button.gif'}).get('href')
AttributeError: 'NoneType' object has no attribute 'get'
Even when I try to do this-
Prevlink=soup.find('a',{'href': 'http://phdcomics.com/comics/archive.php?comicid=2004'}).get('href')
print(Prevlink)
I get similar error -
Prevlink=soup.find('a',{'href': 'http://phdcomics.com/comics/archive.php?comicid=2004'}).get('href')
AttributeError: 'NoneType' object has no attribute 'get'
What should be the right way to get the right 'href'?
TIA

The problem is in the way comments are added on the html of Phd comics.
If you see closely in the output of soup.prettify() you will find comments like this
<!-- Comic Table --!>
when it should be,
<!-- Comic Table -->
This causes BeautifulSoup to miss certain tags. There are many ways to parse and remove comments like using regex, Comment, but it might be difficult to get them to work in this case. The easiest way would be to fix comment tags after collecting the html.
from bs4 import BeautifulSoup
import requests
url = "https://phdcomics.com/"
r = requests.get(url)
data = r.text
data = data.replace("--!>","-->") # fix comments
soup = BeautifulSoup(data)
Prevlink=soup.find('a',{'href': 'http://phdcomics.com/comics/archive.php?comicid=2004'}).get('href')
print Prevlink
http://phdcomics.com/comics/archive.php?comicid=2004
Update:
To auto find the requested link, we need to find the parent element of "http://phdcomics.com/comics/images/prev_button.gif" and extract the link
img_tag = soup.find('img',{'src':'http://phdcomics.com/comics/images/prev_button.gif'})
print img_tag.find_parent().get('href')
http://phdcomics.com/comics/archive.php?comicid=2005

Related

find a very specific tag in soup

I am using BS4 and I have some "soup":
<TABLE CLASS=MAINBODY WIDTH=100% CELLSPACING=0 CELLPADDING=4 BORDER=1 BORDERCOLOR=#000000><TR><TD>
<TABLE CLASS=OBJECTNAME WIDTH=100% CELLSPACING=0 CELLPADDING=1><TR><TD WIDTH=44><IMG SRC="foobar.img"></TD><TD>Foobar text</TD></TR></TABLE>
<!--========== SECTION: FOOBAR DETAILS ==========-->
<TABLE CLASS=OBJECTNAME HEIGHT=25><TR><TD>Foobar text</TD></TD></TABLE>
<!--foobar text-->
and I want to find the tag:
<TABLE CLASS=OBJECTNAME WIDTH=100% CELLSPACING=0 CELLPADDING=1><TR><TD WIDTH=44><IMG SRC="foobar.img"></TD><TD>Foobar text</TD></TR></TABLE>
I have a list with the string:
<TD>Foobar text</TD>
in it that I am using to search.
How do I find the specific tag without getting the second tag with the same value or get the comment with the same text?
after sleeping on it and getting pointed in a better direction by the comments here I found the answer:
assuming a list of strings of tags such as ['<td>foobart text<\td>','<td>foo text<\td>',''<td>bar text<\td>'] named TagList is my input here is what I came up with:
TagList = ['<td>foobart text<\td>','<td>foo text<\td>',''<td>bar text<\td>']
for i in TagList:
parentTable = soup.find('td', string=BeautifulSoup(i).text).find_parent('table')
print(parentTable)

Python - XPath issue while scraping the IMDb Website

I am trying to scrape the movies on IMDb using Python and I can get data about all the important aspects but the actors names.
Here is a sample URL that I am working on:
https://www.imdb.com/title/tt0106464/
Using the "Inspect" browser functionality I found the XPath that relates to all actors names, but when it comes to run the code on Python, it looks like the XPath is not valid (does not return anything).
Here is a simple version of the code I am using:
import requests
from lxml import html
movie_to_scrape = "https://www.imdb.com/title/tt0106464"
timeout_time = 5
IMDb_html = requests.get(movie_to_scrape, timeout=timeout_time)
doc = html.fromstring(IMDb_html.text)
actors = doc.xpath('//table[#class="cast_list"]//tbody//tr//td[not(contains(#class,"primary_photo"))]//a/text()')
print(actors)
I tried to change the XPath many times trying to make it more generic and then more specific, but it still does not return anything
Don't blindly accept the markup structure you see using inspect element.
Browser are very lenient and will try to fix any markup issue in the source.
With that being said, if you check the source using view source you can see that the table you're tying to scrape has no <tbody> as they are inserted by the browser.
So if you removed it form here
//table[#class="cast_list"]//tbody//tr//td[not(contains(#class,"primary_photo"))]//a/text() -> //table[#class="cast_list"]//tr//td[not(contains(#class,"primary_photo"))]//a/text()
your query should work.
From looking at the HTML start with a simple xpath like //td[#class="primary_photo"]
<table class="cast_list">
<tr><td colspan="4" class="castlist_label">Cast overview, first billed only:</td></tr>
<tr class="odd">
<td class="primary_photo">
<a href="/name/nm0000418/?ref_=tt_cl_i1"
><img height="44" width="32" alt="Danny Glover" title="Danny Glover" src="https://m.media-amazon.com/images/G/01/imdb/images/nopicture/32x44/name-2138558783._CB470041625_.png" class="loadlate hidden " loadlate="https://m.media-amazon.com/images/M/MV5BMTI4ODM2MzQwN15BMl5BanBnXkFtZTcwMjY2OTI5MQ##._V1_UY44_CR1,0,32,44_AL_.jpg" /></a> </td>
<td>
PYTHON:
for photo in doc.xpath('//td[#class="primary_photo"]'):
print photo

Python HTML Parsing with BS4

I'm having the problem of trying to parse through HTML using Python & Beautiful Soup and I'm encountering the problem of which I want to parse for a very specific piece of data. This is the kind of code I'm encountering:
<div class="big_div">
<div class="smaller div">
<div class="other div">
<div class="this">A</div>
<div class="that">2213</div>
<div class="other div">
<div class="this">B</div>
<div class="that">215</div>
<div class="other div">
<div class="this">C</div>
<div class="that">253</div>
There is a series of repeat HTML as you can see with only the values being different, my problem is locating a specific value. I want to locate the 253 in the last div. I would appreciate any help as this is a recurring problem in parsing through HTML.
Thank you in advance!
So far I've tried to parse for it but because the names are the same I have no idea how to navigate through it. I've tried using the for loop too but made little to no progress at all.
You can use string attribute as argument in find. BS docs for string attr.
"""Suppose html is the object holding html code of your web page that you want to scrape
and req_text is some text that you want to find"""
soup = BeautifulSoup(html, 'lxml')
req_div = soup.find('div', string=req_text)
req_div will contain the div element which you want.

Beautifulsoup parse through poor tags

My understanding is that regex is the poor man's approach to dealing with beautifulsoup, but I was wondering if it's my only option if there aren't well defined tags in the html I'm trying to parse?
I'm ultimately just trying to get some simple data from the html...but it's just in a series of tables that look like this:
<table width="733" border="0" cellpadding="2">
<tr>
<td align="right" valign="top" nowrap="nowrap" bgcolor="#29ff36">
<font size="-1" face="Verdana, Arial, Helvetica, sans-serif">
<strong>
PART CODE:
</strong>
</font>
</td>
<td align="left" valign="top" nowrap="nowrap">
<font size="-1" color="#7b1010" face="Verdana, Arial, Helvetica, sans-serif">
PART# (//THIS IS WHAT I WANT)
</font>
</td>
<td>
</td>
Is there a good way to approach this without regex?
Thanks for the help guys. This site is incredible
OK:
There's about 15 of those tables, each has a label (such as Cost, Vendor, On-Hand) which sits in the first cell, and then the data that I actually want is always in the next cell over.
label = 'Price:'
rows = soup.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
for td in cols:
if td.find(text=True) == label:
print td.find(text=True)
That works well enough to find the correct cell with the label in it... I basically just need to find the next cell over now I guess. The "next" command per the beautifulsoup documentation is not really accomplishing this though.
Any thoughts?
You can also do this with lxml instead of beautifulsoup. I switched over to using lxml.html instead of beautifulsoup because of the cssselect() method. It takes css rules just like you would use in a css file or jQuery.
from lxml.html import fromstring
raw_html_data = """ ... your html data here ... """
doc = fromstring(raw_html_data)
part_number = doc.cssselect('td[align=left] font')[0].text
# part_number.strip() # optionally strip leading and trailing whitespace
You can use pip to install lxml.
$ pip install lxml
Silver platter solution:
# ... starting with doc from above
info = []
target_trs = doc.cssselect('table tr') # tweak based on actual html
for tr in trs:
target_cells = tr.cssselect('td font')
label = target_cells[0].text.strip()
data = target_cells[1].text.strip()
info.append((label,data))
# now you have an array of (label,data) pairs in info
The example you provided isn't exactly clear, but here's a snippet that will retrieve the Part# from your example HTML source:
columns = soup.findAll('td')
for col in columns:
try:
part = col.find("font", {"color": "#7b1010"}).contents[0]
print(part)
except:
pass
The lxml people claim to work well with malformed HTML.

I am not able to parse using Beautiful Soup

<td>
<a name="corner"></a>
<div>
<div style="aaaaa">
<div class="class-a">My name is alis</div>
</div>
<div>
<span><span class="class-b " title="My title"><span>Very Good</span></span> </span>
<b>My Description</b><br />
My Name is Alis I am a python learner...
</div>
<div class="class-3" style="style-2 clear: both;">
alis
</div>
</div>
<br /></td>
I want the description after scraping it:
My Name is Alis I am a python learner...
I tried a lots of thing but i could not figure it out the best way. Can you guys give the in general solution for this.
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("Your sample html here")
soup.td.div('div')[2].contents[-1]
This will return the string you are looking for (the unicode string, with any applicable whitespace, it should be noted).
This works by parsing the html, grabbing the first td tag and its contents, grabbing any div tags within the first div tag, selecting the 3rd item in the list (list index 2), and grabbing the last of its contents.
In BeautifulSoup, there are A LOT of ways to do this, so this answer probably hasn't taught you much and I genuinely recommend you read the tutorial that David suggested.
Have you tried reading the examples provided in the documentation? They quick start is located here http://www.crummy.com/software/BeautifulSoup/documentation.html#Quick Start
Edit:
To find
You would load your html up via
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("My html here")
myDiv = soup.find("div", { "class" : "class-a" })
Also remember you can do most of this via the python console and then using dir() along with help() walk through what you're trying to do. It might make life easier on you to try out ipython or perhaps python IDLE which have very friendly consoles for beginners.

Categories