Finding sibling items between headers - python

I'm trying to scrape some documentation for data files composed in XML. I had been writing the XSD manually by reading the pages and typing it out, but it has occurred to me that this is a prime case for page scraping. The general format (as far as I can tell based on a random sample) is something like the following:
<h2>
<span class="mw-headline" id="The_.22show.22_Child_Element">
The "show" Child Element
</span>
</h2>
<dl>
<dd>
<table class="infotable">
<tr>
<td class="leftnormal">
allowstack
</td>
<td>
(Optional) Boolean – Indicates whether the user is allowed to stack items within the table, subject to the restrictions imposed for each item. Default: "yes".
</td>
</tr>
<tr>
<td>
agentlist
</td>
<td>
(Optional) Id – If set to a tag group id, only picks with the agent pick's identity tag from that group are shown in the table. Default: None.
</td>
</tr>
<tr>
<td>
allowmove
</td>
<td>
(Optional) Boolean – Indicates whether picks in this table can be moved out of this table, if the user drags them around. Default: "yes".
</td>
</tr>
<tr>
<td>
listpick
</td>
<td>
(Optional) Id – Unique id of the pick to take the table's list expression from (see listfield, below). Note that this does not work when used with portals. Default: None.
</td>
</tr>
<tr>
<td>
listfield
</td>
<td>
(Optional) Id – Unique id of the field to take the table's list expression from (see listpick, above). Note that this does not work when used with portals. Default: None.
</td>
</tr>
</table>
</dd>
</dl>
<p>
The "show" element also possesses child elements that define additional behaviors of the table. The list of these child elements is below and must appear in the order shown. Click on the link to access the details for each element.
</p>
<dl>
<dd>
<table class="infotable">
<tr>
<td class="leftnormal">
<a href="index.php5#title=TableDef_Element_(Data).html#list">
list
</a>
</td>
<td>
An optional "list" element may appear as defined by the given link. This element defines a
<a href="index.php5#title=List_Tag_Expression.html" title="List Tag Expression">
List Tag Expression
</a>
for the table.
</td>
</tr>
</table>
</dd>
</dl>
There's a pretty clear pattern of each file having a number of elements defined by a header followed by text followed by a table (generally the attributes), and possibly another set of text and a table (for the child elements). I think I can reach a reasonable solution by simply using next or next-sibling to step through items and trying to scan the text to determine if the following table is attributes or classes, but it feels a bit weird that I can't just grab everything in between two header tags and then scan that.

You can search for multiple elements at the same time, for example <h2> and <table>. You can then make a note of each <h2> contents before processing each <table>.
For example:
soup = BeautifulSoup(html, "html.parser")
for el in soup.find_all(['h2', 'table']):
if el.name == 'h2':
h2 = el.get_text(strip=True)
h2_id = el.span['id']
else:
for tr in el.find_all('tr'):
row = [td.get_text(strip=True) for td in tr.find_all('td')]
print([h2, h2_id, *row])

Related

Use beautifulSoup to find a table after a header?

I am trying to scrape some data off a website. The data that I want is listed in a table, but there are multiple tables and no ID's. I then had the idea that I would find the header just above the table I was searching for and then use that as an indicator.
This has really troubled me, so as a last resort, I wanted to ask if there were someone who knows how to BeautifulSoup to find the table.
A snipped of the HTML code is provided beneath, thanks in advance :)
The table I am interested in, is the table right beneath <h2>Mine neaste vagter</h2>
<h2>Min aktuelle vagt</h2>
<div>
<a href='/shifts/detail/595212/'>Flere detaljer</a>
<p>Vagt starter: <b>11/06 2021 - 07:00</b></p>
<p>Vagt slutter: <b>11/06 2021 - 11:00</b></p>
<h2>Masker</h2>
<table class='list'>
<tr><th>Type</th><th>Fra</th><th> </th><th>Til</th></tr>
<tr>
<td>Fri egen regningD</td>
<td>07:00</td>
<td> - </td>
<td>11:00</td>
</tr>
</table>
</div>
<hr>
<h2>Mine neaste vagter</h2>
<table class='list'>
<tr>
<th class="alignleft">Dato</th>
<th class="alignleft">Rolle</th>
<th class="alignleft">Tidsrum</th>
<th></th>
<th class="alignleft">Bytte</th>
<th class="alignleft" colspan='2'></th>
</tr>
<tr class="rowA separator">
<td>
<h3>12/6</h3>
</td>
<td>Kundeservice</td>
<td>18:00 → 21:30 (3.5 t)</td>
<td style="max-width: 20em;"></td>
<td>
<a href="/shifts/ajax/popup/595390/" class="swap shiftpop">
Byt denne vagt
</a>
</td>
<td><a href="/shifts/detail/595390/">Detaljer</td>
<td>
</td>
</tr>
Here are two approaches to find the correct <table>:
Since the table you want is the last one in the HTML, you can use find_all() and using index slicing [-1] to find the last table:
print(soup.find_all("table", class_="list")[-1])
Find the h2 element by text, and the use the find_next() method to find the table:
print(soup.find(lambda tag: tag.name == "h2" and "Mine neaste vagter" in tag.text).find_next("table"))
You can use :-soup-contains (or just :contains) to target the <h2> by its text and then use find_next to move to the table:
from bs4 import BeautifulSoup as bs
html = '''your html'''
soup = bs(html, 'lxml')
soup.select_one('h2:-soup-contains("Mine neaste vagter")').find_next('table')
This is assuming the HTML, as shown, is returned by whatever access method you are using.

How to extract data with beautifulsoup with similar attributes

I'm trying to scrape a saved html page of results and copy the entries for each and iterate through the document. However I can't figure out how to narrow down the element to start. The data I want to grab is in the "td" tags below each of the following "tr" tags:
<tr bgcolor="#d7d7d7">
<td valign="top" nowrap="">
Submittal<br>20190919-5000
<!-- ParentAccession= -->
<br>
</td>
<td valign="top">
09/18/2019<br>
09/19/2019
</td>
<td valign="top" nowrap="">
ER19-2760-000<br>ER19-2762-000<br>ER19-2763-000<br>ER19-2764-000<br>ER1 9-2765-000<br>ER19-2766-000<br>ER19-2768-000<br><br>
</td>
<td valign="top">
(doc-less) Motion to Intervene of Snohomish County Public Utility District No. 1 under ER19-2760, et. al..<br>Availability: Public<br>
</td>
<td valign="top">
<classtype>Intervention /<br> Motion/Notice of Intervention</classtype>
</td>
<td valign="top">
<table valign="top">
<input type="HIDDEN" name="ext" value="TXT"><tbody><tr><td valign="top"> <input type="checkbox" name="subcheck" value="V:14800341:12904817:15359058:TXT"></td><td> Text</td><td> & nbsp; 0K</td></tr><input type="HIDDEN" name="ext" value="PDF"><tr><td valign="top"> <input type="checkbox" name="subcheck" value="V:14800341:12904822:15359063:PDF"></td><td> FERC Generated PDF</td><td> 11K</td></tr>
</tbody></table>
</td>
The next tag is: with the same structure as the one above. These alternate so the results are in different colors on the results page.
I need to go through all of the subsequent td tags and grab the data but they aren't differentiated by a class or anything I can zero in on. The code I wrote grabs the entire contents of the td tags text and appends it but I need to treat each td tag as a separate item and then do the same for the next entry etc.
By setting the td[0] value I start at the first td tag but I don't think this is the correct approach.
from bs4 import BeautifulSoup
import urllib
import re
soup = BeautifulSoup(open("/Users/Desktop/FERC/uploads/ferris_9-19-2019-9-19-2019.electric.submittal.html"), "html.parser")
data = []
for td in soup.findAll(bgcolor=["#d7d7d7", "White"]):
values = [td[0].text.strip() for td in td.findAll('td')]
data.append(values)
print(data)

How to extract the innerText of a <td> element with respect to the innerText of another <td> element

I am using selenium in python. I have come across this table webelement. I need to check if a string is present in the webelement and return a corresponding string in case its present.
<table width="700px" class="tableListGrid">
<thead>
<tr class="tableInfoTrBox">
<th>Date</th>
<th>Task Code</th>
<!-- th>Phone Number</th -->
<th>Fota Job</th>
<th colspan="2" class="thLineEnd">Task Description</th>
</tr>
</thead>
<tbody>
<tr class="tableTr_r">
<td>2018-04-06 05:48:29</td>
<td>FU</td>
<!-- td></td -->
<td>
57220180406-JSA69596727
</td>
<td style="text-align:left;">
updated from [A730FXXU1ARAB/A730FOJM1ARAB/A730FXXU1ARAB] to [A730FXXU2ARC9/A730FOJM2ARC1/A730FXXU2ARC9]
</td>
<td>
<table class="btnTypeE">
<tr>
<td>
View
</td>
</tr>
</table>
</td>
</tr>
</tbody>
</table>
I need to search for "A730FXXU2ARC9/A730FOJM2ARC1/A730FXXU2ARC9" in this element and return "57220180406-JSA69596727" which is present in same row at a different place in the web page. Is it possible to do in selenium ?
EDIT: Cleaned the code to only contain useful data.
It can be achieved by finding the element using the following Xpath:
//td[contains(., 'A730FXXU2ARC9/A730FOJM2ARC1/A730FXXU2ARC9')]/preceding-sibling::td[1]/a
Xpath can be read as
find td which contains "A730FXXU2ARC9/A730FOJM2ARC1/A730FXXU2ARC9". Then find the td
preceding the found td and move to the a tag
After this you can get text using selenium
driver.find_element(By.XPATH, '//td[contains(., 'A730FXXU2ARC9/A730FOJM2ARC1/A730FXXU2ARC9')]/preceding-sibling::td[1]/a').text
To look out for a text e.g.A730FXXU2ARC9/A730FOJM2ARC1/A730FXXU2ARC9 and find out an associated text e.g. 57220180406-JSA69596727, you can write a function as follows :
def test_me(myString):
myText = driver.find_element_by_xpath("//table[#class='tableListGrid']//tbody/tr[#class='tableTr_r']//td[.='" + myString + "']//preceding::td[1]/a").get_attribute("innerHTML")
Now, from your main()/#Test you can call the function with the desired text as follows :
test_me("A730FXXU2ARC9/A730FOJM2ARC1/A730FXXU2ARC9")

Extracting string from an HTML table from a given tab using Python

I need to extract a string value from the HTML table below. I want to loop through the table from a particular tab, and copy the results horizontally into the command line or some file.
I am pasting only one row of information here.
This table gets updated based on the changes happening on the Gerrits.
The result that I want is all the Gerrit number under a new tab
For example, if I want to get the Gerrit list from the approval queue, the values should display as shown in the image below.
7897423, 2423343, 34242342, 34234, 57575675
<ul>
<li><span>Review Queue</span></li>
<li><span>Approval Queue</span></li>
<li><span>Verification Queue</span></li>
<li><span>Merge Queue</span></li>
<li><span>Open Queue</span></li>
<li><span>Failed verification</span></li>
</ul>
<div id="tab1">
<h1>Review Queue</h1>
<table class="tablesorter" id="dashboardTable">
<thead>
<tr>
<th></th>
<th>Gerrit</th>
<th>Owner</th>
<th>CR(s)</th>
<th>Project</th>
<th>Dev Branch/PL</th>
<th>Subject</th>
<th>Status</th>
<th>Days in Queue</th>
</tr>
</thead>
<tbody>
<tr>
<td><input type="checkbox" /></td>
<td> 1696771 </td>
<td> ponga </td>
<td> 1055680 </td>
<td>platform/hardware/kiosk/</td>
<td> hidden-userspace.aix.2.0.dev </td>
<td>display: information regarding display</td>
<td> some info here </td>
<td> 2 </td>
</tr>
What stops you from leveraging BeautifulSoup for this?
Lets say you have already read the html (using sgmllib or any other library) into a string variable named html_contents.
Since you are not mentioning which column you want to get data from, I am extracting the gerrit number column.
You can simply do:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
for tr in soup.tbody:
if tr.name == 'tr':
print tr.contents[3].contents[1].string
Above you can loop in all the tr tags inside the tbody and (assuming all the td tags contained inside the tr have the same structure) their value is extracted, in this case the value of the a tag inside.
Read the quick start, it will make your life easier on parsing HTML documents.

Pull out information from between tags without unique classes or IDs

I'm trying to scrape the content of a particular website and render the output so that it can be further manipulated / used in other mediums. The biggest challenge I'm facing is that few of the tags have unique IDs or classes, and some of the content is simply displayed in between tags, e.g., <br></br>TEXT<br></br> (see, for example, "Ranking" in the sample HTML below).
Somehow, I've created working code - even if commensurate with the skill of someone in the fourth grade - but this is the furthest I've gotten, and I was hoping to get some help on how to continue to pull out the relevant information. Ultimately, I'm looking to pull any plain text within tags, and plain text in between tags. The only exception is that, whenever there's an img of word_icon_b.gif or word_icon_R.gif, then text "active" or "inactive" gets substituted.
Below is the code I've managed to cobble together:
from bs4 import BeautifulSoup
import urllib2
pageFile = urllib2.urlopen("http://www.url.com")
pageHtml = pageFile.read()
pageFile.close()
soup = BeautifulSoup("".join(pageHtml))
table = soup.findAll("td",{"class": ["row_col_1", "row_col_2"]})
print table
There are several tables in the page, but none have a unique ID or class, so I didn't reference them, and instead just pulled out the TDs - since these are the only tags with unique classes which sandwich the information I'm looking for. The HTML that the above pulls from is as follows:
<tr>
<td class="row_col_1" valign="top" nowrap="" scope="row">
5/22/2014
</td>
<td class="row_col_1" valign="top" scope="row">
100%
</td>
<td class="row_col_1" valign="top" nowrap="" scope="row">
<a target="_top" href="/NY/2014/N252726.DOC">
<img width="18" height="14" border="0" alt="Click here to download word document n252726.doc" src="images/word_icon_b.gif"></img>
</a>
<a target="_top" href="index.asp?ru=n252726&qu=ranking&vw=detail">
NY N252726
</a>
<br></br>
Ranking
<br></br>
<a href="javascript:disclaimer('EU Regulatory Body','h…cripts/listing_current.asp?Phase=List_items&lookfor=847720');">
8477.20.mnop
</a>
<br></br>
<a target="_new" href="http://dataweb.url.com/scripts/ranking_current.asp?Phase=List_items&lookfor=847759">
8477.59.abcd
</a>
<br></br>
</td>
<td class="row_col_1" valign="top" scope="row">
The ranking of a long-fanged monkey sock puppet that is coding-ly challenged
</td>
<td class="row_col_1" valign="top" nowrap="" scope="row">
</td>
</tr>
The reason why I have ["row_col_1", "row_col_2"] is because the data served up is presented as <td class="row_col_1" valign="top" nowrap="" scope="row"> for the odd rows, and <td class="row_col_2" valign="top" nowrap="" scope="row"> for the even rows. I have no control over the HTML that attempting to I'm pulling from.
Also, the base links, such as javascript:disclaimer('EU Regulatory Body','h…cripts/listing_current.asp? and http://dataweb.url.com/scripts/ranking_current.asp?Phase=List_items& will always remain the same (though the specific links will change, e.g., current.asp?Phase=List_items&lookfor=847759" may next be on the next page as current.asp?Phase=List_items&lookfor=101010">).
EDIT: #Martijn: I'm hoping to have returned to me the following items from the HTML: 1) 5/22/2014, 2) 100%, 3) the image name, word_icon_b.gif (to substitute text for it) 4) NY N252726 (and the preceding link), 5) Ranking, 6) 8477.20.mnop (and the preceding link), 7) 8477.59.abcd (and the preceding link), and 8) 'The ranking of a long-fanged monkey sock puppet that is coding-ly challenged.'
I'd like the output to be wrapped in XML tags, but this is not excessively important I imagine these tags can just be inserted into the bs4 code.
If you would like to try lxml library and xpath, this is a hint on how your code might look like. You should probably make a better selection of the desired <tr>s, than I did without seeing the html code. Also you should handle any potential IndexErrors, etc..
from lxml import html
pageFile = urllib2.urlopen("http://www.url.com")
pageHtml = pageFile.read()
pageFile.close()
x = html.fromstring(pageHtml)
all_rows = x.xpath(".//tr")
results = []
for row in all_rows:
date = row.xpath(".//td[contains(#class, 'row_col')/text()]")[0]
location = row.xpath(".//a[contains(#href, 'index.asp')/text()]")[0]
rank = row.xpath(".//a[contains(#href, 'javascript:disclaimer(')]/text()")[0]
results.append({'date': date, 'location': location, 'rank': rank})

Categories