Getting value from tag with BeautifulSoup - python

I'm trying to scrape movie information from the info box on Wikipedia using BeautifulSoup. I'm having trouble scraping movie budgets, as below.
For example, I want to scrape the '$25 million' budget value from the info box. How can I get the budget value, given that the neither the th nor td tags are unique? (See example HTML).
Say I have tag = soup.find('th') with the value
<th scope="row" style="white-space:nowrap;padding-right:0.65em;">Budget</th> - How can I get the value of '$25 million' from tag?
I thought I could do something like tag.td or tag.text but neither of these are working for me.
Do I have to loop over all tags and check if their text is equal to 'Budget', and if so get the following cell?
Example HTML Code:
<tr>
<th scope="row" style="white-space:nowrap;padding-right:0.65em;">Budget</th>
<td style="line-height:1.3em;">$25 million<sup id="cite_ref-2" class="reference">[2]</sup></td>
</tr>
<tr>
<th scope="row" style="white-space:nowrap;padding-right:0.65em;">Box office</th>
<td style="line-height:1.3em;">$65.7 million<sup id="cite_ref-BOM_3-0" class="reference">[3]</sup></td>
</tr>

You can firstly find the node with tag td whose text is Budget and then find its next sibling td and get the text from the node:
soup.find("th", text="Budget").find_next_sibling("td").get_text()
# u'$25 million[2]'

To get every Amount in <td> tags You should use
tags = soup.findAll('td')
and then
for tag in tags:
print tag.get_text() # To get the text i.e. '$25 million'

What you need is find_all() method in BeatifulSoup.
For example:
tdTags = soup.find_all('td',{'class':'reference'})
This means you will find all 'td' tags when class = 'reference'.
You can find whatever td tags you want as long as you find the unique attribute in expected td tags.
Then you can do a for loop to find the content, as #Bijoy said.

The other possible way might be:
split_text = soup.get_text().split('\n')
# The next index from Budget is cost
split_text[split_text.index('Budget')+1]

Related

Detecting header in HTML tables using beautifulsoup / lxml when table lacks thead element

I'd like to detect the header of an HTML table when that table does not have <thead> elements. (MediaWiki, which drives Wikipedia, does not support <thead> elements.) I'd like to do this with python in both BeautifulSoup and lxml. Let's say I already have a table object and I'd like to get out of it a thead object, a tbody object, and a tfoot object.
Currently, parse_thead does the following when the <thead> tag is present:
In BeautifulSoup, I get table objects with doc.find_all('table') and I can use table.find_all('thead')
In lxml, I get table objects with doc.xpath() on an xpath_expr on //table, and I can use table.xpath('.//thead')
and parse_tbody and parse_tfoot work in the same way. (I did not write this code and I am not experienced with either BS or lxml.) However, without a <thead>, parse_thead returns nothing and parse_tbody returns the header and the body together.
I append a wikitable instance below as an example. It lacks <thead> and <tbody>. Instead all rows, header or not, are enclosed in <tr>...</tr>, but header rows have <th> elements and body rows have <td> elements. Without <thead>, it seems like the right criterion for identifying the header is "from the start, put rows into the header until you find a row that has an element that's not <th>".
I'd appreciate suggestions on how I could write parse_thead and parse_tbody. Without much experience here, I would think I could either
Dive into the table object and manually insert thead and tbody tags before parsing it (this seems nice because then I wouldn't have to change any other code that recognizes tables with <thead>), or alternately
Change parse_thead and parse_tbody to recognize the table rows that have only <th> elements. (With either alternative, it seems like I really need to detect the head-body boundary in this way.)
I don't know how to do either of those things and I'd appreciate advice on both which alternative is more sensible and how I might go about it.
(Edit: Examples with no header rows and multiple header rows. I can't assume it has only one header row.)
<table class="wikitable">
<tr>
<th>Rank</th>
<th>Score</th>
<th>Overs</th>
<th><b>Ext</b></th>
<th>b</th>
<th>lb</th>
<th>w</th>
<th>nb</th>
<th>Opposition</th>
<th>Ground</th>
<th>Match Date</th>
</tr>
<tr>
<td>1</td>
<td>437</td>
<td>136.0</td>
<td><b>64</b></td>
<td>18</td>
<td>11</td>
<td>1</td>
<td>34</td>
<td>v West Indies</td>
<td>Manchester</td>
<td>27 Jul 1995</td>
</tr>
</table>
We can use <th> tags to detect headers, in case the table doesn't contain <thead> tags. If all columns of a row are <th> tags then we can assume that it is a header. Based on that I created a function that identifies the header and body.
Code for BeautifulSoup:
def parse_table(table):
head_body = {'head':[], 'body':[]}
for tr in table.select('tr'):
if all(t.name == 'th' for t in tr.find_all(recursive=False)):
head_body['head'] += [tr]
else:
head_body['body'] += [tr]
return head_body
Code for lxml:
def parse_table(table):
head_body = {'head':[], 'body':[]}
for tr in table.cssselect('tr'):
if all(t.tag == 'th' for t in tr.getchildren()):
head_body['head'] += [tr]
else:
head_body['body'] += [tr]
return head_body
The table parameter is either a Beautiful Soup Tag object or a lxml Element object. head_body is a dictionary that contains two lists of <tr> tags, the header and body rows.
Usage example:
html = '<table><tr><th>heade</th></tr><tr><td>body</td></tr></table>'
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table')
table_rows = parse_table(table)
print(table_rows)
#{'head': [<tr><th>header</th></tr>], 'body': [<tr><td>body</td></tr>]}
You should verify if the tr tag contains the th child you want, candidate.th returns None if there's no th inside candidate:
possibleHeaders = soup.find("table").findAll("tr")
Headers = []
for candidate in possibleHeaders:
if candidate.th:
Headers.append(candidate)

How to extract the text of value attribute indside a tag using selenium

so I wanna extract lets say value="THE TEXT IWANNA EXTRACT ;0" at the html code below. I wanna extract all the string inside value attribute of "td class="regu". But I cant seem to find a way to extract it. I have extracted Names of ppl but I cant extract the string inside value attrib. Any help is much appreciated. Thankyou. Im stuck for like 24 hours already. Im open to use other Libraries as long as I can extract it.
<table class="dbtable" border="0" width="100%">
<tbody><tr>
<td class="tableheader" align="center" width="1%"><b>#</b></td>
<td class="tableheader" align="center" width="60%"><b>User Name</b></td>
<td class="tableheader" align="center"><b>User Type</b></td>
</tr><tr bgcolor="#ffffff">
<td class="regu"><input name="chkStud" value="THE TEXT IWANNA EXTRACT ;0" type="checkbox"></td>
<td class="regu">NAME OF STUDENT HERE </td>
<td class="regu"> Student</td>
</tr><tr bgcolor="#ffffff">
<td class="regu"><input name="chkStud" value="PLEASE EXTRACT ME HERE, IM DYING TO GET OUT;0" type="checkbox"></td>
<td class="regu">FOO BAR FOO BAR</td>
<td class="regu"> Student</td>
</tbody></table>
Here is the python code
#!/usr/bin/python
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import logging
driver = webdriver.Firefox()
driver.get("http://somewebsite/iwannascrape/login.php") #page requires a login T_T
assert "Student" in driver.title
elem = driver.find_element_by_name("txtUser")
elem.clear()
elem.send_keys("YOU_SIR_NAME") #login creds. please dont mind :D
elem2 = driver.find_element_by_name("txtPwd")
elem2.clear()
elem2.send_keys("PASSSWORDHERE")
elem2.send_keys(Keys.RETURN)
driver.get("http://somewebsite/iwannascrape/afterlogin/illhere")
# using this to extract only the table with class='dbtable' so its easier to manipulate :)
table_clas = driver.find_element_by_xpath("//*[#class='dbtable']")
source_code = table_clas.get_attribute("outerHTML") #this prints out the table and its children.
print source_code
for i in range (10): # spacing for readability
print "\n"
print table_clas.text #this prints out the names.
Once you locate the desired element, use get_attribute() method:
elm = driver.find_element_by_css_selector("#dbtable input[name=chkStud]")
print(elm.get_attribute("value"))
table_clas = driver.find_element_by_xpath("//*[#class='dbtable']")
#select the desired element to thin down the html
td = table_clas.find_elements_by_xpath("//*[#name='chkStud']")
#finally hunt down the element you want specifally.
#find_elements or find_element
#should you use find_elements, then it returns a list you can iterate it
# like
for things in td:
print things.get_attribute("value")
this prints:
THE TEXT IWANNA EXTRACT
PLEASE EXTRACT ME HERE, IM DYING TO GET OUT;0

Storing the unknown Id of an html tag

So I am trying to scrape an html using BeautifulSoup, but I am having problems finding a tag id using Python 3.4. I know what the tag ("tr") is, but the id is constantly changing and I would like to save the id when it changes. For example:
<div class = "thisclass"
<table id = "thistable">
<tbody>
<tr id="what i want">
<td class = "someinfo">
<tbody>
<table>
<div>
I can find the div tag and the table, and I know the tr tag is there, but I want to extract the text next to id, without knowing what the text is going to say.
so far I have this code:
soup = BeautifulSoup(url.read())
divTag = soup.find_all("table",id ="thistable")
i = 0
for i in divTag:
trtag = soup.find("tr", id)
print(trtag)
i = i+1
if anyone could help me solve this problem I would appreciate it.
You can use a css selector:
print([element.get('id') for element in soup.select('table#thistable tr[id]'))

Using BeautifulSoup To Extract Specific TD Table Elements Text?

I trying to extract IP Addresses from a autogenerated HTML table using the BeautifulSoup library and im having a little trouble.
The HTML is structured like so:
<html>
<body>
<table class="mainTable">
<thead>
<tr>
<th>IP</th>
<th>Country</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="hello.html">127.0.0.1<a></td>
<td><img src="uk.gif" />uk</td>
</tr>
<tr>
<td><a href="hello.html">192.168.0.1<a></td>
<td><img src="uk.gif" />us</td>
</tr>
<tr>
<td><a href="hello.html">255.255.255.0<a></td>
<td><img src="uk.gif" />br</td>
</tr>
</tbody>
</table>
The small code below extracts the text from the two td rows but i only need the IP data, not the IP and Country data:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("data.htm"))
table = soup.find('table', {'class': 'mainTable'})
for row in table.findAll("a"):
print(row.text)
this outputs:
127.0.0.1
uk
192.168.0.1
us
255.255.255.0
br
What i need is the IP table.tbody.tr.td.a elements text and not the country table.tbody.tr.td.img.a elements.
Are there any experienced users of BeautifulSoup who would have any inkling on how to to this selection and extraction.
Thanks.
This gives you the right list:
>>> pred = lambda tag: tag.parent.find('img') is None
>>> list(filter(pred, soup.find('tbody').find_all('a')))
[127.0.0.1<a></a>, <a></a>, 192.168.0.1<a></a>, <a></a>, 255.255.255.0<a></a>, <a></a>]
just apply .text on the elements of this list.
There are multiple empty <a></a> tags in above list because the <a> tags in the html are not closed properly. To get rid of them, you may use
pred = lambda tag: tag.parent.find('img') is None and tag.text
and ultimately:
>>> [tag.text for tag in filter(pred, soup.find('tbody').find_all('a'))]
['127.0.0.1', '192.168.0.1', '255.255.255.0']
You can use a little regular expression for extracting the ip address. BeautifulSoup with regular expression is a nice combination for scraping.
ip_pat = re.compile(r"^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$")
for row in table.findAll("a"):
if ip_pat.match(row.text):
print(row.text)
Search just first <td> for each row in tbody:
# html should contain page content:
[row.find('td').getText() for row in bs4.BeautifulSoup(html).find('tbody').find_all('tr')]
or maybe more readable:
rows = [row in bs4.BeautifulSoup(html).find('tbody').find_all('tr')]
iplist = [row.find('td').getText() for row in rows]
First note that the HTML is not well-formed. It is not closing the a tags. There are two <a> tags started here:
<a href="hello.html">127.0.0.1<a>
If you print table you'll see BeautifulSoup is parsing the HTML as:
<td>
127.0.0.1<a></a>
</td>
...
Each a is followed by an empty a.
Given the presence of those extra <a> tags, if you want every third <a> tag, then
for row in table.findAll("a")[::3]:
print(row.get_text())
suffices:
127.0.0.1
192.168.0.1
255.255.255.0
On the other hand, if the occurrence of <a> tags is not so regular and you only want that <a> tags with no previous sibling (such as, but not limited to <img>), then
for row in table.findAll("a"):
sibling = row.findPreviousSibling()
if sibling is None:
print(row.get_text())
would work.
If you have lxml, the criteria can be expressed more succinctly using XPath:
import lxml.html as LH
doc = LH.parse("data.htm")
ips = doc.xpath('//table[#class="mainTable"]//td/a[not(preceding-sibling::img)]/text()')
print(ips)
The XPath used above has the following meaning:
//table select all <table> tags
[#class="mainTable"] that have a class="mainTable" attribute
// from these tags select descendants
td/a which are td tags with a child <a> tag
[not(preceding-sibling::img)] such that it does not have a preceding sibling <img> tag
/text() return the text of the <a> tag
It does take a little time to learn XPath, but once you learn it you may never want to use BeautifulSoup again.

Scrapy : getting only td elements with ALIGN=RIGHT

I'm using scrapy to scrape data from this website: http://www.nuforc.org/webreports/ndxevent.html
I need to seperate dates from counts of UFO sightings, yes exciting!
Here is an example of what I'm scraping
<TR VALIGN=TOP>
<TD><FONT style=FONT-SIZE:11pt FACE="Calibri" COLOR=#000000><A HREF= ndxe201303.html>03/2013</A></TD>
<TD ALIGN=RIGHT><FONT style=FONT-SIZE:11pt FACE="Calibri" COLOR=#000000>108</TD>
So in this example date = 03/2013, count = 108
Now the dates are not a problem since I can just do
hxs.select('//tbody//td//font//a//text()').extract()
To get the text within "a" tag.
But is there a way to get text from td element that has the style ALIGN=RIGHT.
I have looked at the docs and selectors but I'm confused
hxs.select('//tbody[contains(td, "ALIGN")]').extract()
?
This selects text from all <td> with the attribute ALIGN="RIGHT":
hxs.select('//tbody//td[#ALIGN="RIGHT"]//text()').extract()

Categories