I am dealing with HTML table data consisting of two fields: First, a field that holds a hyperlinked text string, and second, one that holds a date string. I need the two to be extracted and remain associated.
I am catching the rows in the following fashion (found from another SO question):
pg = s.get(url).text # s = requests Session object
soup = BeautifulSoup(pg, 'html.parser')
files = [[
[td for td in tr.find_all('td')]
for tr in table.find_all('tr')]
for table in soup.find_all('table')]
iterating over files[0] yields rows that have dynamic classes because the HTML was published from Microsoft Excel. So I can't depend on class names. But the location of elements are stable. The rows look like this:
[<td class="excel auto tag" height="16" style="height:12.0pt;border-top:none"><span style='font-size:9.0pt;font-family:"Courier New", monospace;mso-font-charset:0'>north_america-latest.shp.zip</span></td>, <td class="another auto tag" style="border-top:none;border-left:none">2023-01-01</td>]
Broken up, for easier reading:
[
<td class="excel auto tag" height="16" style="height:12.0pt;border-top:none">
<a href="subfolder/north_america-latest.shp.zip">
<span style='font-size:9.0pt;font-family:"Courier New", monospace;mso-font-charset:0'>
north_america-latest.shp.zip
</span>
</a>
</td>,
<td class="another auto tag" style="border-top:none;border-left:none">
2023-01-01
</td>
]
Using the .get_text() method with td I can get the string literal of the link, as well as the date in one go, but once I have the td object, how do I go about obtaining the following three elements?
"subfolder/north_america-latest.shp.zip" # the link
"north_america-latest.shp.zip" # the name
"2023-01-01" # the date
Assuming that what you call 'row' is actually a string, here is how you would get those bits of information:
from bs4 import BeautifulSoup as bs
my_list = '''[<td class="excel auto tag" height="16" style="height:12.0pt;border-top:none"><span style='font-size:9.0pt;font-family:"Courier New", monospace;mso-font-charset:0'>north_america-latest.shp.zip</span></td>, <td class="another auto tag" style="border-top:none;border-left:none">2023-01-01</td>]'''
soup = bs(my_list, 'html.parser')
link = soup.select_one('td a').get('href')
text = soup.select_one('td a').get_text()
date = soup.select('td')[1].get_text()
print('link:', link)
print('text:', text)
print('date:', date)
Result in terminal:
link: subfolder/north_america-latest.shp.zip
text: north_america-latest.shp.zip
date: 2023-01-01
I'm not particularly convinced this is the actual key to your conundrum: surely there is a better way of getting the information you're after, beside that list comprehension you're using. As stated in comments, without the actual page HTML, truly debugging this is next to impossible.
Related
im currently learning how to scrape webpages.
THE PROBLEM:
I cant use css selector, because on other sites the postion (order) of this tag (information about estimated start time) changes.
MY GOAL: How can I retrieve the information: January 2022
HTML-SNIPPET:
<tr>
<td headers="studyInfoColTitle"> Estimated <span style="display:inline;" class="term" data-term="Study Start Date" title="Show definition">Study Start Date <i class="fa fa-info-circle term" aria-hidden="true" data-term="Study Start Date" style="border-bottom-style:none;"></i></span> :
</td>
<td headers="studyInfoColData" style="padding-left:1em">January 2022</td>
</tr>
WHAT I HAVE TRIED:
1.) I tried to declare a func to filter out (combined with find_all) this tag:
def searchMethod(tag):
return re.compile("Estimated") and (str(tag.string).find("Estimated") > -1)
#calling here above func
foundTag_s = soup.find_all(searchMethod)
this helped me for other similar cases, but here it didnt work, I think it has to do with how the stringtext is devided between the tags...
2.) I tried to use the string search:
starttime_elem = soup.find("td", string="Estimated")
but it doesnt work for some reason.
After many hours of searching I decided to ask here.
Ref: https://clinicaltrials.gov/ct2/show/NCT05169372?draw=2&rank=1
So, you are actually looking at different pages within the same domain. The html is basically consistent in terms of elements and attributes.
CSS selector lists are a lot more versatile than just for positional matching. There are numerous ways to solve your current problem.
One is simply to use a css attribute = value css selector to target the start date node then move to the next td
import requests
from bs4 import BeautifulSoup as bs
links = ['https://clinicaltrials.gov/ct2/show/NCT05169372?draw=2&rank=1', 'https://clinicaltrials.gov/ct2/show/NCT05169359?draw=2&rank=2']
with requests.Session() as s:
for link in links:
r = s.get(link, headers = {'User-Agent':'Mozilla/5.0'})
soup = bs(r.content, 'lxml')
start = soup.select_one('[data-term="Study Start Date"]')
if start is not None:
print(start.text)
print(start.find_next('td').text)
This is a robust and consistent attribute.
You could also use :-soup-contains:
start = soup.select_one('.term:-soup-contains("Study Start Date")')
I had luck getting a list of telephone numbers using this code:
from lxml import html
import requests
lnk='https://docs.legis.wisconsin.gov/2019/legislators/assembly'
page=requests.get(lnk)
tree=html.fromstring(page.content)
ph_nums=tree.xpath('//span[#class="info telephone"]/text()')
print(ph_nums)
which is scraping info from an HTML element that looks like this:
<span class="info telephone">
<span class="title"><strong>Telephone</strong>:<br></span>
(608) 266-8580<br>(888) 534-0097
</span>
However, I can't do the same for this element when I change info telephone to info...
<span class="info" style="width:16em;">
<span>
<a id="A">
<strong></strong></a><strong>Jenkins, Leroy t</strong> <small>(R - Madison)</small>
</span>
<br>
<span style="width:8em;"><small>District 69</small></span>
<br>
<span style="width:8em;">Details</span>
<br>
<span style="width:8em;">
Website
</span>
<br>
<br>
</span>
since there's multiple titles in this element, whereas "info telephone" only had one. How would I return separate lists, each with a different piece of info (i.e. a list of names, and a list of Districts, in this scenario)?
FYI - I am not educated in HTML (and hardly experienced in Python) so I would appreciate a simplified explanation.
For this task I would recommend the BeautifulSoup Package for Python.
You don't have to deeply understand HTML to use it (I don't!), and it offers a very friendly approach to find certain items from a web page.
Your first example could be rewritten as follows:
from bs4 import BeautifulSoup
#soup element contains the xml data
soup = BeautifulSoup(page.content, 'lxml')
# the find_all method finds all nodes in page.content whose type is 'span'
# and whose class is 'info telephone'
info_tels = soup.find_all('span', {"class": "info telephone"})
The info_tels element contains all instances of <span class="info telephone"> on your document. We can then parse it to find what's relevant:
list_tels = []
for tel in info_tels:
tel_text = tel.text #extracts text from info_telephone node
tel_text = tel_text.replace("\nTelephone:\n","").replace('\n', "") #removes "Telephone:" part and line breaks
tel_text = tel_text.strip() #removes trailing space
list_tels.append(tel_text)
You can do something similar for the 'info' class:
info_class = soup.find_all('span', {"class": "info"})
And then find the elements you want to put into lists:
info_class[0].find_all('a')[1].text #returns you the first name
The challenge here is to identify which types/classes do these names/districts/etc. have. In your first example, it is relatively clear (('span', {"class": "info telephone"})), but the "info" class has various data points inside of it with no specific, identifiable type.
For instance, the '' tag appears multiple times in your file, also with distinct data points (District, Details, etc.)
I came up with a small solution for the District problem - you might get inspired to tackle the other information too!!
list_districts = []
for info in info_class:
try:
district_contenders = info.find_all('span', {'style': "width:8em;"})
for element in district_contenders:
if 'District' in element.text:
list_districts.append(element.text)
except:
pass
So here is my code:
import requests
from bs4 import BeautifulSoup
import lxml
r = requests.post('https://opir.fiu.edu/instructor_evals/instr_eval_result.asp', data={'Term': '1175', 'Coll': 'CBADM'})
soup = BeautifulSoup(r.text, "lxml")
tables = soup.find_all('table')
print(tables)
print(tables)
I had to do a post request due to the fact that it's an ASP page, and I had to grab the correct data. Looking in the college of Business for all tables from a specific semester. The problem is the output:
<tr class="tableback2"><td>Overall assessment of instructor</td><td align="right">0.0%</td><td align="right">56.8%</td><td align="right">27.0%</td><td align="right">13.5%</td><td align="right">2.7%</td><td align="right">0.0%</td> </tr>
</table>, <table align="center" border="0" cellpadding="0" cellspacing="0" width="75%">
<tr class="boldtxt"><td>Term: 1175 - Summer 2017</td></tr><tr class="boldtxt"><td>Instructor Name: Austin, Lathan Craig</td><td colspan="6"> Department: MARKETING</td></tr>
<tr class="boldtxt"><td>Course: TRA 4721 </td><td colspan="2">Section: RVBB-1</td><td colspan="4">Title: Global Logistics</td></tr>
<tr class="boldtxt"><td>Enrolled: 56</td><td colspan="2">Ref#: 55703 -1</td><td colspan="4"> Completed Forms: 46</td></tr>
I expected beautifulsoup to be able to parse the text, and return it nice and neat into a dataframe with each column separated. I would like to put it into a dataframe after, or perhaps save it to a CSV file.... But I have no idea how to get rid of all of these CSS selectors and tags. I tried using this code to do so, and it removed the ones specified, but td and tr didn't work:
for tag in soup():
for attribute in ["class", "id", "name", "style", "td", "tr"]:
del tag[attribute]
Then, I tried to use this package called bleach, but when putting the 'tables' into it but it specified that it must be a text input. So I can't put my table into it apparently. This is ideally what I would like to see with my output.
So I'm truly at a loss here of how to format this in a proper way. Any help is much appreciated.
Give this a try. I suppose this is what you expected. Btw, if there are more than one tables in that page and if you want another table then twitch the index, as in soup.select('table')[n]. Thanks.
import requests
from bs4 import BeautifulSoup
res = requests.post('https://opir.fiu.edu/instructor_evals/instr_eval_result.asp', data={'Term': '1175', 'Coll': 'CBADM'})
soup = BeautifulSoup(res.text, "lxml")
tables = soup.select('table')[0]
list_items = [[items.text.replace("\xa0","") for items in list_item.select("td")]
for list_item in tables.select("tr")]
for data in list_items:
print(' '.join(data))
Partial results:
Term: 1175 - Summer 2017
Instructor Name: Elias, Desiree Department: SCHACCOUNT
Course: ACG 2021 Section: RVCC-1 Title: ACC Decisions
Enrolled: 118 Ref#: 51914 -1 Completed Forms: 36
I'm using scrapy to scrape data from this website: http://www.nuforc.org/webreports/ndxevent.html
I need to seperate dates from counts of UFO sightings, yes exciting!
Here is an example of what I'm scraping
<TR VALIGN=TOP>
<TD><FONT style=FONT-SIZE:11pt FACE="Calibri" COLOR=#000000><A HREF= ndxe201303.html>03/2013</A></TD>
<TD ALIGN=RIGHT><FONT style=FONT-SIZE:11pt FACE="Calibri" COLOR=#000000>108</TD>
So in this example date = 03/2013, count = 108
Now the dates are not a problem since I can just do
hxs.select('//tbody//td//font//a//text()').extract()
To get the text within "a" tag.
But is there a way to get text from td element that has the style ALIGN=RIGHT.
I have looked at the docs and selectors but I'm confused
hxs.select('//tbody[contains(td, "ALIGN")]').extract()
?
This selects text from all <td> with the attribute ALIGN="RIGHT":
hxs.select('//tbody//td[#ALIGN="RIGHT"]//text()').extract()
Here is the part of the HTML:
<td class="team-name">
<div class="goat_australia"></div>
Melbourne<br />
Today
</td>
<td class="team-name">
<div class="goat_australia"></div>
Sydney<br />
Tomorrow
</td>
So i would like to return all these td tags with the class name "team-name", and only if it contains the text "Today" in it.
My code so far:
from BeautifulSoup import BeautifulSoup
import urllib2, re
starting_url = urllib2.urlopen('http://www.mysite.com.au/').read()
soup = BeautifulSoup(''.join(starting_url))
soup2 = soup.findAll("td", {'class':'team-name'})
for entry in soup2:
if "Today" in soup2:
print entry
If i run this nothing returns.
If i take out that last if statement and just put
print soup2
I get back all the td tags, but some have "Today" and some have "Tomorrow" etc.
So any pointers? is there a way to add 2 attributes to the soup.findAll function?
I also tried running a findAll on a findAll, that did not work.
Using the structure of the code you've got currently, try looking for "Today" with an embedded findAll:
for entry in soup2:
if entry.findAll(text=re.compile("Today")):
print entry