How to get a nested element in beautiful soup

How to get a nested element in beautiful soup - python

I am struggling with the syntax required to grab some hrefs in a td.
The table, tr and td elements dont have any class's or id's.
If I wanted to grab the anchor in this example, what would I need?
< tr >
< td > < a >...
Thanks

As per the docs, you first make a parse tree:
import BeautifulSoup
html = "<html><body><tr><td><a href='foo'/></td></tr></body></html>"
soup = BeautifulSoup.BeautifulSoup(html)
and then you search in it, for example for <a> tags whose immediate parent is a <td>:
for ana in soup.findAll('a'):
if ana.parent.name == 'td':
print ana["href"]

Something like this?
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
anchors = [td.find('a') for td in soup.findAll('td')]
That should find the first "a" inside each "td" in the html you provide. You can tweak td.find to be more specific or else use findAll if you have several links inside each td.
UPDATE: re Daniele's comment, if you want to make sure you don't have any None's in the list, then you could modify the list comprehension thus:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
anchors = [a for a in (td.find('a') for td in soup.findAll('td')) if a]
Which basically just adds a check to see if you have an actual element returned by td.find('a').

Related

BeautifulSoup4: replace a tag with 2 others

I want to replace <h1>title</h1> with <b><u>title</u></b>
I know I can replace h1 with b using soup.h1.name = "b"
But is there a way to replace a single tag by several others?
(Special edit for Daniel Roseman: the tags don't really matter...)

thanks to RobertB I could figure out the rest of the answser.
You need:
wrap the h1 with new tags p
wrap the h1 with new tags u
remove the tag h1 (using unwrap() )
<!-- language: python -->
from bs4 import BeautifulSoup
soup = BeautifulSoup("<h1>title</h1>", "html.parser")
soup.h1.string.wrap(soup.new_tag("b"))
print(soup) # >> <h1><b>title</b></h1>
soup.h1.string.wrap(soup.new_tag("u"))
print(soup) # >> <h1><b><u>title</u></b></h1>
soup.h1.unwrap()
print(soup) #>> <b><u>title</u></b>

Use wrap()
From the documentation:
soup = BeautifulSoup("<p>I wish I was bold.</p>")
soup.p.string.wrap(soup.new_tag("b"))
# <b>I wish I was bold.</b>
soup.p.wrap(soup.new_tag("div"))
# <div><p><b>I wish I was bold.</b></p></div>

Finding the number of divs with a certain id in BeautifulSoup?

I am trying to find a way to count the number of divs with the id "blue". Is this possible in BeautifulSoup? Here is my code:
import BeautifulSoup
scanning = True
soup = BeautifulSoup.BeautifulSoup("<html><body><div id='blue'></div><div id='blue'></div><div id='purple'></div></body></html>")
blues = []
blues.append(soup.find("div", {"id": "blue"}))
print len(blues)

the find method will only fetch the first occurence, hence the output of 1. If you use find_all, it will literally find all of the occurences, saving the results to a list on your behalf. In this case 'divs' becomes a list of every div id=blue, and you can check the length of that.
import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup("<html><body><div id='blue'></div><div id='blue'></div><div id='purple'></div></body></html>", 'html.parser')
divs = soup.find_all("div", {"id": "blue"})
print(len(divs))

If divs only have the id blue, then you could just use:
divs = soup.find_all("#blue")
blues = len(divs) if divs else 0

Python BeautifulSoup webcrawling getting text tag inside link

I need to get the information within the "< b >" tags for each website.
response = requests.get(href)
soup = BeautifulSoup(response.content, "lxml") # or BeautifulSoup(response.content, "html5lib")
tempWeekend = []
print soup.findAll('b')
The soup.findAll('b') line prints all the b tags in the site, how can I limit it to just the dates that I want?
The website is http://www.boxofficemojo.com/movies/?page=weekend&id=catchingfire.htm, under the weekend tab.

It is often easiest to search using CSS selectors, e.g.
soup.select('table.chart-wide > tr > td > nobr > font > a > b')

Sadly, if the tags are not further identified, there is no way to select specific ones. How should BeautifulSoup be able to distinguish between them. If you know what to roughly expect in the tags you need you could iterate over all of them and check if they match:
for b in soup.findAll('b):
if b.innerHTML == whatever:
return b
or something like that...
Or you could get the surrounding tags, i.e. 'a' in your example and check if that matches and then get the next occurence of 'b'.

Why not search for all the b tags, and choose the ones which contain a month?
import requests
from bs4 import BeautifulSoup
s = requests.get('http://www.boxofficemojo.com/movies/?page=weekend&id=catchingfire.htm').content
soup = BeautifulSoup(s, "lxml") # or BeautifulSoup(response.content, "html5lib")
dates = []
for i in soup.find_all('b'):
if i.text.split()[0].upper() in "JAN FEB MAR APR JUN JUL AUG SEP OCT NOV DEC":
dates.append(i.text)
print dates
(Note: I did not check the exact abbreviations that the website uses. Please check these first and accordingly modify the code)

Looking at that page it doesn't have any divs or class or id tags which makes it tough. The only pattern I could see what that the <b> tag directly before the dates was <b>Date:</b>. I would iterate over the <b> tags and then collect the tags after I hit the one with Date in it.

i would try something like
all_a = site.find_all('a')
for a in all_a:
if '?yr=?' in a['href']:
dates.append(a.get_text())

Python 2.7 : Can't figure out how to parse a tree with BeautifulSoup4

I am trying to parse this site to create 5 lists, one for each day and filled with one string for each announcement. For example
[in] custom_function(page)
[out] [[<MONDAYS ANNOUNCEMENTS>],
[<TUESDAYS ANNOUNCEMENTS>],
[<WEDNESDAYS ANNOUNCEMENTS>],
[<THURSDAYS ANNOUNCEMENTS>],
[<FRIDAYS ANNOUNCEMENTS>]]
But I can't figure out the correct way to do this.
This is what I have so far
from bs4 import BeautifulSoup
import requests
import datetime
url = http://mam.econoday.com/byweek.asp?day=7&month=4&year=2014&cust=mam&lid=0
# Get the text of the webpage
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
full_table_1 = soup.find('table', 'eventstable')
I Figured out that what I want is in the highlighted tag, but I'm not sure how to get to that exact tag and then parse out the times/announcements into a list. I've tried multiple methods but it just keeps getting messier.
What do I do?

The idea is to find all td elements with events class, then read div elements inside:
data = []
for day in soup.find_all('td', class_='events'):
data.append([div.text for div in day.find_all('div', class_='econoevents')])
print data
prints:
[[u'Gallup US Consumer Spending Measure8:30 AM\xa0ET',
u'4-Week Bill Announcement11:00 AM\xa0ET',
u'3-Month Bill Auction11:30 AM\xa0ET',
...
],
...
]

Python webscraping and getting contents of first div tag of its class

I'm working with Python 3.3 and this website:
http://www.nasdaq.com/markets/ipos/
My goal is to read only the companies that are in the Upcoming IPO. It is in the div tag with div class="genTable thin floatL" There are two with this class, and the target data is in the first one.
Here is my code
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen("http://www.nasdaq.com/markets/ipos/").read()
soup = BeautifulSoup(html)
for divparent in soup.find_all('div', attrs={'class':'genTable thin floatL'}) [0]: # I tried putting a [0] so it will only return divs in the first genTable thin floatL class
for div in soup.find_all('div', attrs={'class':'ipo-cell-height'}):
s = div.string
if re.match(r'\d{1,2}/\d{1,2}/\d{4}$', s):
div_next = div.find_next('div')
print('{} - {}'.format(s, div_next.string))
I'd like it to return only
3/7/2014 - RECRO PHARMA, INC.
2/28/2014 - VARONIS SYSTEMS INC
2/27/2014 - LUMENIS LTD
2/21/2014 - SUNDANCE ENERGY AUSTRALIA LTD
2/21/2014 - SEMLER SCIENTIFIC, INC.
But it prints all div classes with the re.match specifications and multiple times as well. I tried inserting [0] on the for divparent loop to retrieve only the first one but this cause the repeating problem instead.
EDIT: Here is the updated code according to warunsl solution. This works.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen("http://www.nasdaq.com/markets/ipos/").read()
soup = BeautifulSoup(html)
divparent = soup.find_all('div', attrs={'class':'genTable thin floatL'})[0]
table= divparent.find('table')
for div in table.find_all('div', attrs={'class':'ipo-cell-height'}):
s = div.string
if re.match(r'\d{1,2}/\d{1,2}/\d{4}$', s):
div_next = div.find_next('div')
print('{} - {}'.format(s, div_next.string))

You mentioned that there are two elements that fit the 'class':'genTable thin floatL' criteria. So running a for loop for it's first element does not make sense.
So replace your outer for loop with
divparent = soup.find_all('div', attrs={'class':'genTable thin floatL'})[0]
Now you need not do a soup.find_all again. Doing so will search the entire document. You need to restrict the search to the divparent. So, you do:
table = divparent.find('table')
The remainder of the code to extract the dates and the company name would be the same, except that they will be with reference to the table variable.
for row in table.find_all('tr'):
for data in row.find_all('td'):
print data.string
Hope it helps.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to get a nested element in beautiful soup - python

I am struggling with the syntax required to grab some hrefs in a td. The table, tr and td elements dont have any class's or id's. If I wanted to grab the anchor in this example, what would I need? < tr > < td > < a >... Thanks

Related

BeautifulSoup4: replace a tag with 2 others

Finding the number of divs with a certain id in BeautifulSoup?

Python BeautifulSoup webcrawling getting text tag inside link

Python 2.7 : Can't figure out how to parse a tree with BeautifulSoup4

Python webscraping and getting contents of first div tag of its class

Categories

Resources