Can I search multiple HTML elements within the soup.find_all() function? - python

I'm trying to scrape a website for most viewed headlines. The class selector of the text I want shares common words with other items on the page. For example, I want the text between the tag and class "black_color". Other items use the tag and have the class "color_black hover_color_gray_90" and I don't want these included. I was thinking I could use more HTML elements to be more specific but I'm not sure how to incorporate them.
from bs4 import BeautifulSoup
def getHeadlines():
url = "https://www.bostonglobe.com/"
source_code = requests.get(url)
plainText = source_code.text
soup = BeautifulSoup(plainText, "html.parser")
#results = soup.find_all("h2",{"class":"headline"})
results = soup.find_all("a",{"class":"black_color"})
with open("headlines.txt", "w", encoding="utf-8") as f:
for i in results:
f.write(str(i.text + ' \n' + '\n'))
getHeadlines()

I think looking at the <a> tag may actually be harder than using at the matching <h2>, which has a 'headline' class.
Try this:
soup = BeautifulSoup(source_code.text, "html.parser")
for headline in soup.find_all("h2", class_="headline"):
print(headline.text)
Output:
Boston College outbreak worries epidemiologists, students, community
Nantucket finds ‘community spread’ of COVID-19 among tradespeople The town’s Select Board and Board of Health will hold an emergency meeting at 10 a.m. Monday to consider placing restrictions on some of those trades, officials said Friday.
Weddings in a pandemic: Welcome to the anxiety vortexNewlyweds are spending their honeymoons praying they don’t hear from COVID‐19 contact tracers. Relatives are agonizing over “damned if we RSVP yes, damned if we RSVP no” decisions. Wedding planners are adding contract clauses specifying they’ll walk off the job if social distancing rules are violated.
Fauci says US should plan to ‘hunker down’ for fall and winter
Of struggling area, Walsh says, ‘We have to get it better under control’
...
After looking at it for awhile, I think the <a> tag may actually have some classes added dynamically, which BS will not pick up. Just searching for the color_black class and excluding the color_black hover_color_gray_90 class yields the headline classes that you don't want (e.g. 'Sports'), even though when I look at the actual web page source code, I see it's differentiated in the way you've indicated.
That's usually a good sign that there are post-load CSS changes being made to a page. (I may be wrong about that, but in either case I hope the <h2> approach gets you what you need.)

Related

How can I get the contents between two tags in a html page using Beautiful Soup?

I am trying to extract the text from the Risk Factors section of this 10K report from the SEC's EDGAR database https://www.sec.gov/Archives/edgar/data/101830/000010183019000022/sprintcorp201810-k.htm
As you can see I have managed to identify the headings for the Risk Factors (section I want to grab all the text from) and Unresolved Staff Comments (section immediately after Risk Factors) sections but I have been unable to then proceed to identify/grab all the text from between these to headings (the text in the Risk Factors section).
As you can see here I have tried the "next_sibling" method and some other suggestions on SO but I am still doing it incorrectly.
Code:
import requests
import bs4 as bs
file = requests.get('https://www.sec.gov/Archives/edgar/data/101830/000010183019000022/sprintcorp201810-k.htm')
soup = bs.BeautifulSoup(file.content, 'html.parser')
risk_factors_header = soup.find_all("a", text="Risk Factors")[0]
staff_comments_header = soup.find_all("a", text="Unresolved Staff Comments")[0]
risk_factors_text = risk_factors_header.next_sibling
print(risk_factors_text.contents)
Extract of Desired Output (looking for all text in the Risk Factors section):
In addition to the other information contained in this Annual Report on Form 10-K, the following risk factors should be considered carefully in evaluating us. Our business, financial condition, liquidity or results of operations could be materially adversely affected by any of these risks.
Risks Relating to the Merger Transactions
The closing of the Merger Transactions is subject to many conditions, including the receipt of approvals from various governmental entities, which may not approve the Merger Transactions, may delay the approvals for, or may impose conditions or restrictions on, jeopardize or delay completion of, or reduce the anticipated benefits of, the Merger Transactions, and if these conditions are not satisfied or waived, the Merger Transactions will not be completed.
The completion of the Merger Transactions is subject to a number of conditions, including, among others, obtaining certain governmental authorizations, consents, orders or other approvals and the absence of any injunction prohibiting the Merger Transactions or any legal requ........
A couple of issues:
You are selecting the link from the table of contents instead of the header: the header is not an a tag, but just a font tag (you can always inspect these details in a browser). However, if you try to do soup.find_all("font", text="Risk Factors") you will get 2 results because the link from the table of contents also has a font tag, so you would need to select the second one: soup.find_all("font", text="Risk Factors")[1].
Similar issue for the second header, but this time something funny happens: the header has an "invisible" space just before the closing tag, although the link from the TOC doesn't, so you would need to select it like this soup.find_all("font", text="Unresolved Staff Comments ")[0].
Another issue, the "text in between" is not a sibling (or siblings) of the tree elements that we've selected, but siblings with an ancestor from those elements. If you inspect the page source code, you will see that the headings are included inside a div, inside a table cell (td), inside a table row (tr), inside a table, so we need to go 4 parent levels up: risk_factors_header.parent.parent.parent.parent.
Also, there are several siblings that you are interested in, better to use next_siblings and iterate through all of them.
Once you've got all of that, you can use the second heading to break the iteration once you reach it.
Since you want to get the text only (ignoring all the html tags) you can use get_text() instead of content.
Ok, all together:
import requests
import bs4 as bs
file = requests.get('https://www.sec.gov/Archives/edgar/data/101830/000010183019000022/sprintcorp201810-k.htm')
soup = bs.BeautifulSoup(file.content, 'html.parser')
risk_factors_header = soup.find_all("font", text="Risk Factors")[1]
staff_comments_header = soup.find_all("font", text="Unresolved Staff Comments ")[0]
for paragraph in risk_factors_header.parent.parent.parent.parent.next_siblings:
if paragraph == staff_comments_header.parent.parent.parent.parent:
break
print(paragraph.get_text())
I would take a totally different approach from the other answers here - because you are dealing with EDGAR filings which are terrible as a general matter and especially terrible when it comes to html (and, if you are unlucky enough to have to deal with it, xbrl).
So in order to extract the Risk Factors section I resort to the method below. It relies on the fact that Risk Factors is always Item IA and is always (at least in my experience so far) followed by Item 1B even if, as in this case, Item IB is "none".
filing = ''
for f in soup.select('font'):
if f.text is not None and t.text != "Table of Contents":
filing+=(f.text)+" \n"
print(filing.split('Item 1B')[0].split('Item 1A')[-1])
You lose most of the formatting and, as always, there will be some clean up to do anyway, but it's close enough - in most cases.
Note that, this being EDGAR, sooner or later you will run into another filing where the text is not in <font> but in some other tag - so you'll have to adopt...
Another solution. You can use .find_previous_sibling() to check if you're inside section you want:
import requests
from bs4 import BeautifulSoup
url = 'https://www.sec.gov/Archives/edgar/data/101830/000010183019000022/sprintcorp201810-k.htm#s8925A97DDFA55204808914F6529AC721'
soup = BeautifulSoup(requests.get(url).content, 'lxml')
out = []
for tag in soup.find('text').find_all(recursive=False):
prev = tag.find_previous_sibling(lambda t: t.name == 'table' and t.text.startswith('Item'))
if prev and prev.text.startswith('Item 1A.') and not tag.text.startswith('Item 1B'):
out.append(tag.text)
# print the section:
print('\n'.join(out))
Prints:
In addition to the other information contained in this Annual Report on Form 10-K, the following risk factors should be considered carefully in evaluating us. Our business, financial condition, liquidity or results of operations could be materially adversely affected by any of these risks.
Risks Relating to the Merger Transactions
...
agreed to implement certain measures to protect national security, certain of which may materially and adversely affect our operating results due to increasing the cost of compliance with security measures, and limiting our control over certain U.S. facilities, contracts, personnel, vendor selection, and operations. If we fail to comply with our obligations under the NSA or other agreements, our ability to operate our business may be adversely affected.
Rather ugly but you could first remove the page numbers and table of contents links then use filtering to remove the stop point header and its subsequent siblings from the target header and its subsequent siblings. Requires bs4 4.7.1+
for unwanted in soup.select('a:contains("Table of Contents"), div:has(+[style="page-break-after:always"])'):
unwanted.decompose() #remove table of contents hyperlinks and page numbers
selector = ','.join(['table:has(tr:nth-of-type(2):has(font:contains("Risk Factors")))'
,'table:has(tr:nth-of-type(2):has(font:contains("Risk Factors")))' + \
' ~ *:not(table:has(tr:nth-of-type(2):has(font:contains("Unresolved Staff Comments"))), ' + \
'table:has(tr:nth-of-type(2):has(font:contains("Unresolved Staff Comments"))) ~ *)'
])
text = '\n'.join([i.text for i in soup.select(selector)])
print(text, end='\n')
Using variables might make code easier to follow:
for unwanted in soup.select('a:contains("Table of Contents"), div:has(+[style="page-break-after:always"])'):
unwanted.decompose() #remove table of contents hyperlinks and page numbers
start_header = 'table:has(tr:nth-of-type(2):has(font:contains("Risk Factors")))'
stop_header = 'table:has(tr:nth-of-type(2):has(font:contains("Unresolved Staff Comments")))'
selector = ','.join([start_header,start_header + f' ~ *:not({stop_header}, {stop_header} ~ *)'])
text = '\n'.join([i.text for i in soup.select(selector)])
print(text, end='\n')
You could of course loop siblings from target header until stop header found.

how to exclude nested tags from a parent tag to just get the ouput as text skipping the links (a) tags

I want to exclude the included nested tags like in this case ignore the a tags "links" associated with the word-
base_url="https://www.usatoday.com/story/tech/2020/09/17/qanon-conspiracy-theories-debunked-social-media/5791711002/"
response=requests.get(base_url)
html=response.content
bs=BeautifulSoup(html,parser="lxml")
article=bs.find_all("article",{"class":"gnt_pr"})
body=article[0].find_all("p",{"class":"gnt_ar_b_p"})
output is-
[<p class="gnt_ar_b_p">An emboldened community of believers known as QAnon is spreading a baseless patchwork of conspiracy theories that are fooling Americans who are looking for simple answers in a time of intense political polarization, social isolation and economic turmoil.</p>,
<p class="gnt_ar_b_p">Experts call QAnon <a class="gnt_ar_b_a" data-t-l="|inline|intext|n/a" href="https://www.usatoday.com/in-depth/tech/2020/08/31/qanon-conspiracy-theories-trump-election-covid-19-pandemic-extremist-groups/5662374002/" rel="noopener" target="_blank">a "digital cult"</a> because of its pseudo-religious qualities and an extreme belief system that enthrones President Donald Trump as a savior figure crusading against evil.</p>,
<p class="gnt_ar_b_p">The core of QAnon is the false theory that Trump was elected to root out a secret child-sex trafficking ring run by Satanic, cannibalistic Democratic politicians and celebrities. Although it may sound absurd, it has nonetheless attracted devoted followers who have begun to perpetuate other theories that they suggest, imply or argue are somehow related to the main premise.</p>,
want to exclude these a tags
To get only text from paragraphs, you can use .get_text() method. For example:
import requests
from bs4 import BeautifulSoup
base_url = "https://www.usatoday.com/story/tech/2020/09/17/qanon-conspiracy-theories-debunked-social-media/5791711002/"
response = requests.get(base_url)
soup = BeautifulSoup(response.content, "lxml")
body = soup.select("article p")
for paragraph in body:
print(paragraph.get_text(strip=True, separator=' '))
Prints:
An emboldened community of believers known as QAnon is spreading a baseless patchwork of conspiracy theories that are fooling Americans who are looking for simple answers in a time of intense political polarization, social isolation and economic turmoil.
...etc.
Or: You can .unwrap() all elements inside paragraph and the get text:
for paragraph in body:
for tag in paragraph.find_all():
tag.unwrap()
print(paragraph.text)

scraping part of website in python

I want to scrape part of this website http://warframe.wikia.com/wiki/ , expecially the description, command should reply as
"Equipped with a diverse selection of arrows, deft hands, shroud of stealth and an exalted bow, the crafty Ivara infiltrates hostile territory with deception and diversion, and eliminates threats with a shot from beyond. Ivara emerged in Update 18.0."
and nothing else, maybe can be useful set a sort of <p> I want to print.
For now I have this, but it doesn't reply what I want.
import requests
from bs4 import BeautifulSoup
req = requests.get('http://warframe.wikia.com/wiki/Ivara')
soup = BeautifulSoup(req.text, "lxml")
for sub_heading in soup.find_all('p'):
print(sub_heading.text)
You can use index of target paragraph and get required text as
print(soup.select('p')[4].text.strip())
or fetch by previous paragraph with text "Release Date:":
print(soup.findAll('b', text="Release Date:")[0].parent.next_sibling.text.strip())
Using the solution #Andersson provided (it won't work for all heroes as there is no Release date for everyone) and #SIM's comment, I'm giving you a generalized solution for any hero/champion (or whatever you call them in that game).
name = 'Ivara'
url = 'http://warframe.wikia.com/wiki/' + name
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
main_text = soup.find('div', class_='tabbertab')
print(main_text.find('b', text=name).parent.text.strip())
Output:
Equipped with a diverse selection of arrows, deft hands, shroud of stealth and an exalted bow, the crafty Ivara infiltrates hostile territory with deception and diversion, and eliminates threats with a shot from beyond. Ivara emerged in Update 18.0.
For other heroes, just change the name variable.
One more example with name = 'Volt', Output:
Volt has the power to wield and bend electricity. He is highly versatile, armed with powerful abilities that can damage enemies, provide cover and complement the ranged and melee combat of his cell. The electrical nature of his abilities make him highly effective against the Corpus, with their robots in particular. He is one of three starter options for new players.
Explanation:
If you inspect the page, you can see that there is only 1 <b>hero-name</b> inside the <div class="tabbertab" ... > tag. So you can use the <b>...</b> to find the text you need.

How to select a specific tag within this html?

How would I select all of the titles in this page
http://bulletin.columbia.edu/columbia-college/departments-instruction/african-american-studies/#coursestext
for example: I'm trying to get all the lines similar to this:
AFAS C1001 Introduction to African-American Studies. 3 points.
main_page is iterating through all of the school classes from here so I can grab all of the titles like above:
http://bulletin.columbia.edu/columbia-college/departments-instruction/
for page in main_page:
sub_abbrev = page.find("div", {"class": "courseblock"})
I have this code but I can't figure out exactly how to select all of the ('strong') tags of the first child.
Using latest python and beautiful soup 4 to web-scrape.
Lmk if there is anything else that is needed.
Thanks
Iterate over elements with courseblock class, then, for every course, get the element with courseblocktitle class. Working example using select() and select_one() methods:
import requests
from bs4 import BeautifulSoup
url = "http://bulletin.columbia.edu/columbia-college/departments-instruction/african-american-studies/#coursestext"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
for course in soup.select(".courseblock"):
title = course.select_one("p.courseblocktitle").get_text(strip=True)
print(title)
Prints:
AFAS C1001 Introduction to African-American Studies.3 points.
AFAS W3030 African-American Music.3 points.
AFAS C3930 (Section 3) Topics in the Black Experience: Concepts of Race and Racism.4 points.
AFAS C3936 Black Intellectuals Seminar.4 points.
AFAS W4031 Protest Music and Popular Culture.3 points.
AFAS W4032 Image and Identity in Contemporary Advertising.4 points.
AFAS W4035 Criminal Justice and the Carceral State in the 20th Century United States.4 points.
AFAS W4037 (Section 1) Third World Studies.4 points.
AFAS W4039 Afro-Latin America.4 points.
A good follow-up question from #double_j:
In the OPs example, he has a space between the points. How would you keep that? That's how the data shows on the site, even thought it's not really in the source code.
I though about using separator argument of the get_text() method, but that would also add an extra space before the last dot. Instead, I would join the strong element texts via str.join():
for course in soup.select(".courseblock"):
title = " ".join(strong.get_text() for strong in course.select("p.courseblocktitle > strong"))
print(title)

Printing all occurence of certain document elements of a webpage

So I was scraping this particular webpage https://www.zomato.com/srijata , for all the "restaurant reviews"(not the self comments on her own reviews) posted by user "Sri".
zomato_ind = urllib2.urlopen('https://www.zomato.com/srijata')
zomato_info = zomato_ind.read()
open('zomato_info.html', 'w').write(zomato_info)
soup = BeautifulSoup(open('zomato_info.html'))
soup.find('div','mtop0 rev-text').text
This prints her first restaurant review i.e. - "Sri reviewed Big Straw - Chew On This" as :-
u'Rated This is situated right in the heart of the city. The items on the menu are alright and I really had to compromise for bubble tea. The tapioca was not fresh. But the latte and the soda pop my friends tried was good. Another issue which I faced was mosquitos... They almost had me.. Lol..'
I also tried another selector :-
I have this questions :-
How can I print the next restaurant review ? I tried findNextSiblings etc and all but none seem to work.
First of all, you don't need to write the output to the file, pass the result of urlopen() call to the BeautifulSoup constructor.
To get the reviews, you need to iterate over all div tags with class rev-text, and get the .next_sibling of the div element inside:
import urllib2
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('https://www.zomato.com/srijata'))
for div in soup.find_all('div', class_='rev-text'):
print div.div.next_sibling
Prints:
This is situated right in the heart of the city. The items on the menu are alright and I really had to compromise for bubble tea. The tapioca was not fresh. But the latte and the soda pop my friends tried was good. Another issue which I faced was mosquitos... They almost had me.. Lol..
The ambience is good. The food quality is good. I Didn't find anything to complain. I wanted to visit the place fir a very long time and had dinner today. The meals are very good and if u want the better quality compared to other Andhra restaurants then this is the place. It's far better than nandhana. The staffs are very polite too.
...
You should make a for loop and use find_all instead of find:
zomato_ind = urllib2.urlopen('https://www.zomato.com/srijata')
zomato_info = zomato_ind.read()
open('zomato_info.html', 'w').write(zomato_info)
soup = BeautifulSoup(open('zomato_info.html'))
for div in soup.find_all('div','rev-text'):
print div.text
Also one question: Why are saving the html in a file and then reading the file into a soup object?

Categories