Here is the part of the HTML:
<td class="team-name">
<div class="goat_australia"></div>
Melbourne<br />
Today
</td>
<td class="team-name">
<div class="goat_australia"></div>
Sydney<br />
Tomorrow
</td>
So i would like to return all these td tags with the class name "team-name", and only if it contains the text "Today" in it.
My code so far:
from BeautifulSoup import BeautifulSoup
import urllib2, re
starting_url = urllib2.urlopen('http://www.mysite.com.au/').read()
soup = BeautifulSoup(''.join(starting_url))
soup2 = soup.findAll("td", {'class':'team-name'})
for entry in soup2:
if "Today" in soup2:
print entry
If i run this nothing returns.
If i take out that last if statement and just put
print soup2
I get back all the td tags, but some have "Today" and some have "Tomorrow" etc.
So any pointers? is there a way to add 2 attributes to the soup.findAll function?
I also tried running a findAll on a findAll, that did not work.
Using the structure of the code you've got currently, try looking for "Today" with an embedded findAll:
for entry in soup2:
if entry.findAll(text=re.compile("Today")):
print entry
Related
I am dealing with HTML table data consisting of two fields: First, a field that holds a hyperlinked text string, and second, one that holds a date string. I need the two to be extracted and remain associated.
I am catching the rows in the following fashion (found from another SO question):
pg = s.get(url).text # s = requests Session object
soup = BeautifulSoup(pg, 'html.parser')
files = [[
[td for td in tr.find_all('td')]
for tr in table.find_all('tr')]
for table in soup.find_all('table')]
iterating over files[0] yields rows that have dynamic classes because the HTML was published from Microsoft Excel. So I can't depend on class names. But the location of elements are stable. The rows look like this:
[<td class="excel auto tag" height="16" style="height:12.0pt;border-top:none"><span style='font-size:9.0pt;font-family:"Courier New", monospace;mso-font-charset:0'>north_america-latest.shp.zip</span></td>, <td class="another auto tag" style="border-top:none;border-left:none">2023-01-01</td>]
Broken up, for easier reading:
[
<td class="excel auto tag" height="16" style="height:12.0pt;border-top:none">
<a href="subfolder/north_america-latest.shp.zip">
<span style='font-size:9.0pt;font-family:"Courier New", monospace;mso-font-charset:0'>
north_america-latest.shp.zip
</span>
</a>
</td>,
<td class="another auto tag" style="border-top:none;border-left:none">
2023-01-01
</td>
]
Using the .get_text() method with td I can get the string literal of the link, as well as the date in one go, but once I have the td object, how do I go about obtaining the following three elements?
"subfolder/north_america-latest.shp.zip" # the link
"north_america-latest.shp.zip" # the name
"2023-01-01" # the date
Assuming that what you call 'row' is actually a string, here is how you would get those bits of information:
from bs4 import BeautifulSoup as bs
my_list = '''[<td class="excel auto tag" height="16" style="height:12.0pt;border-top:none"><span style='font-size:9.0pt;font-family:"Courier New", monospace;mso-font-charset:0'>north_america-latest.shp.zip</span></td>, <td class="another auto tag" style="border-top:none;border-left:none">2023-01-01</td>]'''
soup = bs(my_list, 'html.parser')
link = soup.select_one('td a').get('href')
text = soup.select_one('td a').get_text()
date = soup.select('td')[1].get_text()
print('link:', link)
print('text:', text)
print('date:', date)
Result in terminal:
link: subfolder/north_america-latest.shp.zip
text: north_america-latest.shp.zip
date: 2023-01-01
I'm not particularly convinced this is the actual key to your conundrum: surely there is a better way of getting the information you're after, beside that list comprehension you're using. As stated in comments, without the actual page HTML, truly debugging this is next to impossible.
I'm trying to figure out how css pseudo-classes like not:() and has:() work in the following cases.
The following selector is not supposed to print 27A-TAX DISTRICT 27A but it does print it:
from bs4 import BeautifulSoup
htmlelement = """
<tbody>
<tr style="">
<td><a>27A-TAX DISTRICT</a> 27A</td>
</tr>
<tr style="">
<td><strong>Parcel Number</strong> 720</td>
</tr>
</tbody>
"""
soup = BeautifulSoup(htmlelement,"lxml")
item = soup.select_one("tr:not(a)").text
print(item)
On the other hand, the following selector is supposed to print I should be printed but it throws AttributeError error.
from bs4 import BeautifulSoup
htmlelement = """
<p class="vital">I should be printed</p>
<p>I should not be printed</p>
"""
soup = BeautifulSoup(htmlelement,"lxml")
item = soup.select_one("p:has(.vital)").text
print(item)
Where I'm going wrong and how can I make them work?
Unfortunately, your understanding of what :not() and :has() does is most likely not correct.
In your first example, you use:
soup.select_one("tr:not(a)").text
The way you are using it will select every tr. This is because it is saying "I want a tr tag that is not an a tag. tr tags are never a tags so your code always grabs the text of any tr tag, including the one that contains 27A-TAX DISTRICT.
If you want tr tags that don't have a tags, then you could use:
soup.select_one("tr:not(:has(a))").text
What this says is "I want a tr tag that does not have a descendant a tag".
For more info read:
https://developer.mozilla.org/en-US/docs/Web/CSS/:not
This leads us to your second issue. :has() is a relational selector. In your second example, you used:
soup.select_one("p:has(.vital)").text
:has() looks ahead at either children, descendants, or sibling (depending on the syntax you use) to determine if the tag is the the one you want.
So what you were saying was "I want a p tag that has a descendant tag with the class vital". None of your p tags even have descendants, so there is no way one could have a vital class. What you want is actually more simple:
soup.select_one("p.vital").text
What this says is "I want a p tag that also has a class vital."
For more info read:
https://developer.mozilla.org/en-US/docs/Web/CSS/:has
Here's html snippet 1:
<td class="firstleft lineupopt-name" style="">Trump, Donald <span style="color:#666;font-size:10px;">B</span> <span style="color:#cc1100;font-size:10px;font-weight:bold;">TTT</span></td>
Here's html snippet 2:
<td class="firstleft lineupopt-name" style="">Clinton, Hillary <span style="color:#cc1100;font-size:10px;font-weight:bold;">TTT</span></td>
Here's my relevant code:
all = cols[1].find_all('span')
for ele in all:
if (ele is not None):
ttt = cols[1].span.text
else:
ttt = 'none'
Issue: my code works in both instances, but for html snippet 2, it grabs content from the first span tag. In both instances, if the tag exists, I'd like to grab content from only the last span tag. How can this be done?
BS4 now supports last-child so a possible approach could be:
soup.select('td span:last-child')
To get the texts out just iterat the resultset.
Example
from bs4 import BeautifulSoup
html='''
<td class="firstleft lineupopt-name" style="">Trump, Donald <span style="color:#666;font-size:10px;">B</span> <span style="color:#cc1100;font-size:10px;font-weight:bold;">TTT</span></td>
<td class="firstleft lineupopt-name" style="">Clinton, Hillary <span style="color:#cc1100;font-size:10px;font-weight:bold;">TTT</span></td>
'''
soup = BeautifulSoup(html)
[t.text for t in soup.select('td span:last-child')]
Output
['TTT', 'TTT']
A straightforward approach would be to get the last element by -1 index:
ttt = all[-1].text if all else 'none'
I've also tried to approach it with a CSS selector, but BeautifulSoup does not support last-child, last-of-type or nth-last-of-type and supports only nth-of-type pseudo-class.
I test it in conda env with bs4 v4.9.1, and now nth-last-of-type(1) is OK.
If you are trying to select only direct children from an already selected element, use the :scope tag to reference itself in the selector, so you can now use >, ~, and + operators. Docs
selected_element.select_one(":scope > :last-child")
I have the following (repeating) HTML text from which I need to extract some values using Python and regular expressions.
<tr>
<td width="35%">Demand No</td>
<td width="65%"><input type="text" name="T1" size="12" onFocus="this.blur()" value="876716001"></td>
</tr>
I can get the first value by using
match_det = re.compile(r'<td width="35.+?">(.+?)</td>').findall(html_source_det)
But the above is on one line. However, I also need to get the second value which is on the line following the first one but I cannot get it to work. I have tried the following, but I won't get a match
match_det = re.compile('<td width="35.+?">(.+?)</td>\n'
'<td width="65.+?value="(.+?)"></td>').findall(html_source_det)
Perhaps I am unable to get it to work since the text is multiline, but I added "\n" at the end of the first line, so I thought this would resolve it but it did not.
What I am doing wrong?
The html_source is retrieved downloading it (it is not a static HTML file like outlined above - I only put it here so you could see the text). Maybe this is not the best way in getting the source.
I am obtaining the html_source like this:
new_url = "https://webaccess.site.int/curracc/" + url_details #not a real url
myresponse_det = urllib2.urlopen(new_url)
html_source_det = myresponse_det.read()
Please do not try to parse HTML with regex, as it is not regular. Instead use an HTML parsing library like BeautifulSoup. It will make your life a lot easier! Here is an example with BeautifulSoup:
from bs4 import BeautifulSoup
html = '''<tr>
<td width="35%">Demand No</td>
<td width="65%"><input type="text" name="T1" size="12" onFocus="this.blur()" value="876716001"></td>
</tr>'''
soup = BeautifulSoup(html)
print soup.find('td', attrs={'width': '65%'}).findNext('input')['value']
Or more simply:
print soup.find('input', attrs={'name': 'T1'})['value']
I'm currently learning Python and I'm trying to make a small scraper but I'm running in to problems with Beautiful Soup and regex.
I am trying to match all links in a site that has the following html:
<td>
Place Number 1
</td>
<td width="100">
California </td>
<td>
Place Number 2
</td>
<td width="100">
Florida </td>
I want to get all the following links : "/lxxxx/Place+Number+x"
I am using python and beautifulsoup for this:
import BeautifulSoup
import urllib2
import re
address = 'http://www.example.com'
html = urllib2.urlopen(address).read()
soup = BeautifulSoup(html)
for tag in soup.findAll('a', id = re.compile('l[0-9]*')):
print tag['href']
The regex part in the soup.findAll I found on some example code because I can't seem to get the example from the BeautifulSoup documentation to work.Without the regex part, I got all the links on the page, but I only want the "lxxx" ones
What am I doing wrong with my regex? Maybe there's a way to do this wthout regexes, but I can't seem to find a way.
Shouldn't you be trying to do the regex match on href and not id?
for tag in soup.findAll('a', href = re.compile('l[0-9]*')):
print tag['href']
I would suggest
for tag in soup.findAll('a', href = re.compile('^/l[0-9]+/.*$')):
print tag['href']
for avoiding tags looking like but not exactly what you are look like
Apart from check href not id
re.compile(r'^\/l[0-9]{4}/Place\+Number\+[0-9]+')
match seems to assume your regex starts with "^".
>>> m = re.compile(r"abc")
>>> m.match("eabc")
>>> m.match("abcd")
<_sre.SRE_Match object at 0x7f23192318b8>
So adding the \/ allows the first slash to be matched. Also I'm using {4} to match four numbers rather than * which will match zero or more numbers.
>>> m = re.compile(r'\/l[0-9]*')
>>> m.match("/longurl/somewhere")
<_sre.SRE_Match object at 0x7f2319231850>