I am learning Beautiful Soup and Python and in this context I am doing the "Baby names" exercise of the Google Tutorial on Regex using the set of html files that contains popular baby names for different years (e.g. baby1990.html etc). You can find this dataset if you are interested here: https://developers.google.com/edu/python/exercises/baby-names
The html files contain a particular table which store the popular baby names and whose html code is the following:
<table width="100%" border="0" cellspacing="0" cellpadding="4" summary="formatting">
<tr valign="top"><td width="25%" class="greycell">
Background information
<p><br />
Select another <label for="yob">year of birth</label>?<br />
<form method="post" action="/cgi-bin/popularnames.cgi">
<input type="text" name="year" id="yob" size="4" value="1990">
<input type="hidden" name="top" value="1000">
<input type="hidden" name="number" value="">
<input type="submit" value=" Go "></form>
</td><td>
<h3 align="center">Popularity in 1990</h3>
<p align="center">
<table width="48%" border="1" bordercolor="#aaabbb"
cellpadding="2" cellspacing="0" summary="Popularity for top 1000">
<tr align="center" valign="bottom">
<th scope="col" width="12%" bgcolor="#efefef">Rank</th>
<th scope="col" width="41%" bgcolor="#99ccff">Male name</th>
<th scope="col" bgcolor="pink" width="41%">Female name</th></tr>
<tr align="right"><td>1</td><td>Michael</td><td>Jessica</td> # Targeted row
<tr align="right"><td>2</td><td>Christopher</td><td>Ashley</td> # Targeted row
etc...
There is also another table in the html file that I do not want to capture and has the following html code.
<table width="100%" border="0" cellspacing="0" cellpadding="4">
<tbody>
<tr><td class="sstop" valign="bottom" align="left" width="25%">
Social Security Online
</td><td valign="bottom" class="titletext">
<!-- sitetitle -->Popular Baby Names
</td>
</tr>
<tr bgcolor="#333366"><td colspan="2" height="2"></td></tr>
<tr><td class="graystars" width="25%" valign="top">
Popular Baby Names</td><td valign="top">
<a href="http://www.ssa.gov/"><img src="/templateimages/tinylogo.gif"
width="52" height="47" align="left"
alt="SSA logo: link to Social Security home page" border="0"></a><a name="content"></a>
<h1>Popular Names by Birth Year</h1>September 12, 2007</td>
</tr>
<tr bgcolor="#333366"><td colspan="2" height="1"></td></tr>
</tbody></table>
In comparing the table Tags of the two tables I concluded that the unique characteristic of the targeted table -- the table I am trying to capture-- is the 'summary' attribute which appears to have the value 'formatting'. Therefore I tried the following command:
right_table = soup.find("table", summary = "formatting")
However, this command failed to select the targeted table.
In contrast, the following command succeeded:
table = soup.find(summary="Popularity for top 1000")
Could you explain by looking at the html code why the first command failed and the second succeeded?
Your advice will be appreciated.
I answered your question earlier, the code works.
And one more thing, html.patser is broken in python2, do not use it, use lxml.
Related
I want to get a table and save it to Excel with pyhon scripts. Here is the response:
<body>
<table id="need">
<tr height="30" align="center">
<td>need</td>
<td id="td1">need</td>
<td id="td2" type="wholeLast">not need</td>
<td id="td3" type="whole">need</td>
...
</tr>
<tr height="30" align="center" cid="2" class="txt">
<td>not need</td>
<td id="td1">not need</td>
<td id="td2" type="wholeLast">not need</td>
<td id="td3" type="whole">not need</td>
...
</tr>
...
</table>
<table>
...
</table>
</body>
I need to get the contents in <tr> except <tr> with 'class="txt"' and the <td> except <td> with 'type="wholeLast"'. In short, I need to get all the "need" in the above response.
I tried this:trs = soup.find_all("tr", attrs={"height": "30", "align": "center"}). But I don't know how to remove the <td> which type="wholeLast". Maybe I need to use other ways.
Any suggestion is appreciate.
with css selectors and not pseudo class you could do this
tds=soup.select('tr:not(.txt) td:not([type="wholeLast"])')
Here an example of the code I used to scraping the table:
with open ('text.txt', 'w') as algroo:
for row in RoOtbody.find_all('tr'):
for cell in row.find_all('td'):
algroo.write(cell.text)
algroo.write('\n')
I already used Selenium and requests to extract the outer html from the webpage. I also tried to use html.parser and lxml.
The html looks like this:
<tr class="table">
<td class="table" valign="top">
<p class="tbl-hdr">HS heading</p>
</td>
<td class="table" valign="top">
<p class="tbl-hdr">Desccription of product</p>
</td>
<td class="table" colspan="2" valign="top">
<p class="tbl-hdr">Working or processing, carried out on non-originating
materials, which confers originating status</p>
</td>
</tr>
The problem is that when I open the txt file, all the cells elements are in a single column like the one below, literaly:
HS heading
Desccription of product
Working or processing, carried out on non-originating materials, which confers originating status
In all the tutorials I watched and read, they should be in the same row, like this:
HS headingDesccription of productWorking or processing, carried out on non-originating materials, which confers originating status
Can anyone help me, please?
I don't know if this will help you
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = '''<tr class="table">
<td class="table" valign="top">
<p class="tbl-hdr">HS heading</p>
</td>
<td class="table" valign="top">
<p class="tbl-hdr">Desccription of product</p>
</td>
<td class="table" colspan="2" valign="top">
<p class="tbl-hdr">Working or processing, carried out on non-originating
materials, which confers originating status</p>
</td>
</tr>'''
doc = SimplifiedDoc(html)
tr = doc.tr # get first tr
print (tr.text)
print (tr.getText(' '))
tds = tr.tds # get all td
for td in tds:
print (td.text)
I'm using scrapy to scrape information from 2 tables on the website
I firstly scrape the tables. It turns out that staffs and students are empty while response is not empty. I also find the table tab in page source. Can anyone find out what's the problem?
import scrapy
from universities.items import UniversitiesItem
class UniversityOfSouthCarolinaColumbia(scrapy.Spider):
name = 'uscc'
allowed_domains = ['sc.edu']
start_urls = ['http://www.sc.edu/about/directory/?name=']
def parse(self, response):
for ln in ['Zhao']:
query = response.url + ln
yield scrapy.Request(query, callback=self.parse_item)
#staticmethod
def parse_item(response):
staffs = response.xpath('//table[#id="directorystaff"]/tbody/tr[#role="row"]')
students = response.xpath('//table[#id="directorystudent"]/tbody/tr[#role="row"]')
print('--------------------------')
print('staffs', staffs)
print('==========================')
print('students', students)
It's realy cool question. I'm investigate this. And I has concluded that the response does not contain info about the tags attribute. I think that browser is modify page_source_body with anybody script adding attribute to tags.
In response tr-tags do not have attribute 'role'
Please see it:
<table class="display" id="directorystaff" width="100%">
<thead>
<tr>
<th style="text-align: left">Name</th>
<th style="text-align: left">Email</th>
<th style="text-align: left">Phone</th>
<th style="text-align: left">Department</th>
<th style="text-align: left">Office Address</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">Zhao, Xia </td>
<td style="text-align: left"> </td>
<td style="text-align: left">(803) 777-8436 </td>
<td style="text-align: left">Chemistry </td>
<td style="text-align: left">537 </td>
</tr>
<tr>
<td style="text-align: left">Zhao, Xing </td>
<td style="text-align: left"> </td>
<td style="text-align: left"> </td>
<td style="text-align: left">Mechanical Engineering </td>
<td style="text-align: left"> </td>
</tr>
In this picture we see the response page
and in this picture we see page on browser:
So, if you want got list of staffs , I'm recommend next XPath:
//table[#id="directorystaff"]/tbody/tr/td
And for students, I'm recommend next XPath:
//table[#id="directorystudent"]/tbody/tr/td
If you want something else, you can modify this is XPath query.
This is example for you:
import requests
from lxml import html
x = requests.get("https://www.sc.edu/about/directory/?name=Zhao")
ht = html.fromstring(x.text)
element = ht.xpath('//table[#id="directorystaff"]/tbody/tr/td')
for el in element:
print(el.text)
And output:
>>Zhao, Xia
>>
>>(803) 777-8436
>>Chemistry
>>537
>>Zhao, Xing
>>
>>
>>Mechanical Engineering
I am trying to write a small python script that scrape tracking details for an internal system. The details are presented in a html table below. I am looking to turn it into python tuples:
(processed, unit b door 3, 30-MAY-16 12:19)
(created, unit b door 2, 30-MAY-16 06:17)
for example. I am using Splinter.
<table class="resultView" cellspacing="0" rules="all" border="1" style="width:540px;border-collapse:collapse;">
<tr class="clearHeader">
<th align="left" scope="col">Activity</th><th scope="col"> </th><th align="center" scope="col">Date</th>
</tr>
<tr class="statusRow">
<td style="width:30%;">Processed</td>
<td align="center"> Unit B<br /> Door 3 </td>
<td align="center" style="width:20%;">30-May-16<br/>12:19</td>
</tr>
<tr class="statusAlternate">
<td style="width:30%;">Created</td>
<td align="center"> Unit B <br /> Door 2</td>
<td align="center" style="width:20%;">30-May-16<br/>06:17</td>
</tr>
</table>
If I run:
for update in browser.find_by_css('tr'):
print update.find_by_css('td')
it displays:
[<splinter.driver.webdriver.WebDriverElement object at 0x103085e90>,
<splinter.driver.webdriver.WebDriverElement object at 0x103085ed0>,
<splinter.driver.webdriver.WebDriverElement object at 0x1030b4050>]
Which is what I would have expected. However, I cannot access the value from it. Changing the line to:
print update.find_by_css('td').value
gives the error:
AttributeError: 'ElementList' object has no attribute 'value'
This is a list so I try to access the first element on the list with
print update.find_by_css('td').first.value
I then get this error:
splinter.exceptions.ElementDoesNotExist: no elements could be found with css "td"
I cannot work out what I am doing wrong?
I think that your problem is that you are looking for "tr" or "td" into your table with css 'tr' or 'td' and any of the "tr" and/or "td" in your hable don't have this class
I suggest you to this case, use xpath to look for elements that you want to find
I'm working in Python for the first time and I've used Mechanize to search a website along with BeautifulSoup to select a particular div, now I'm trying to grab a specific sentence with a regular expression. This is the soup object's contents;
<div id="results">
<table cellspacing="0" width="100%">
<tr>
<th align="left" valign="middle" width="32%">Physician Name, (CPSO#)</th>
<th align="left" valign="middle" width="36%">Primary Practice Location</th>
<!-- <th width="16%" align="center" valign="middle">Accepting New Patients?</th> -->
<th align="center" valign="middle" width="32%">Disciplinary Info & Restrictions</th>
</tr>
<tr>
<td>
<a class="doctor" href="details.aspx?view=1&id= 85956">Hull, Christopher Merritt </a> (#85956)
</td>
<td>Four Counties Medical Clinic<br/>1824 Concessions Dr<br/>Newbury ON N0L 1Z0<br/>Phone: (519) 693-0350<br/>Fax: (519) 693-0083</td>
<!-- <td></td> -->
<td align="center"></td>
</tr>
</table>
</div>
(Thank you for the assistance with formatting)
My regular expression to get the text "Hull, Christopher Merritt" is;
patFinderName = re.compile('<a class="doctor" href="details.aspx?view=1&id= 85956">(.*) </a>')
It keeps returning empty and I can't figure out why, anybody have any ideas?
Thank you for the answers, I've changed it to;
patFinderName = re.compile('<a class="doctor" href=".*">(.*) </a>')
Now it works beautifully.
? is a magic token in regular expressions, meaning zero or one of the previous atom. As you want a literal question mark symbol, you need to escape it.
You should escape the ? in your regex:
In [8]: re.findall('<a class="doctor" href="details.aspx\?view=1&id= 85956">(.*)</a>', text)
Out[8]: ['Hull, Christopher Merritt ']