scraping part of website in python

scraping part of website in python - python

I want to scrape part of this website http://warframe.wikia.com/wiki/ , expecially the description, command should reply as
"Equipped with a diverse selection of arrows, deft hands, shroud of stealth and an exalted bow, the crafty Ivara infiltrates hostile territory with deception and diversion, and eliminates threats with a shot from beyond. Ivara emerged in Update 18.0."
and nothing else, maybe can be useful set a sort of <p> I want to print.
For now I have this, but it doesn't reply what I want.
import requests
from bs4 import BeautifulSoup
req = requests.get('http://warframe.wikia.com/wiki/Ivara')
soup = BeautifulSoup(req.text, "lxml")
for sub_heading in soup.find_all('p'):
print(sub_heading.text)

You can use index of target paragraph and get required text as
print(soup.select('p')[4].text.strip())
or fetch by previous paragraph with text "Release Date:":
print(soup.findAll('b', text="Release Date:")[0].parent.next_sibling.text.strip())

Using the solution #Andersson provided (it won't work for all heroes as there is no Release date for everyone) and #SIM's comment, I'm giving you a generalized solution for any hero/champion (or whatever you call them in that game).
name = 'Ivara'
url = 'http://warframe.wikia.com/wiki/' + name
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
main_text = soup.find('div', class_='tabbertab')
print(main_text.find('b', text=name).parent.text.strip())
Output:
Equipped with a diverse selection of arrows, deft hands, shroud of stealth and an exalted bow, the crafty Ivara infiltrates hostile territory with deception and diversion, and eliminates threats with a shot from beyond. Ivara emerged in Update 18.0.
For other heroes, just change the name variable.
One more example with name = 'Volt', Output:
Volt has the power to wield and bend electricity. He is highly versatile, armed with powerful abilities that can damage enemies, provide cover and complement the ranged and melee combat of his cell. The electrical nature of his abilities make him highly effective against the Corpus, with their robots in particular. He is one of three starter options for new players.
Explanation:
If you inspect the page, you can see that there is only 1 <b>hero-name</b> inside the <div class="tabbertab" ... > tag. So you can use the <b>...</b> to find the text you need.

Related

beautifulsoup creates an incomplete result?

I want beautifulsoup to parse a page from lexico dictionary
what I want to parse is the content under this tag ul
screenshot of the tag
This (sorry for the wot) is the result that
url = 'https://www.lexico.com/definition/iron'
req = requests.get(url)
soup = BeautifulSoup(req.text, 'lxml')
semb = soup.find('ul', attrs={'class':'semb'})
print(semb)
should give(I think)
However, the code here gives this (sorry again)
It seems that bs stops parsing for some reason in the middle of the second li tag. It doesn't seem anything related to javascript to me, am I wrong? Thanks anyone.
Beautifulsoup version: 4.11.1
Python: 3.9.12

Since you want just the definitions, try the following, which will get all of the definitions under each grammar type (in the example page linked, it will get the noun and verb definitions)
import requests
from bs4 import BeautifulSoup
url = 'https://www.lexico.com/definition/iron'
req = requests.get(url)
soup = BeautifulSoup(req.text, 'lxml') #I used html.parser as I didn't have lxml installed but either should work
definitions = soup.find_all("span", class_='ind one-click-content') # note the keyword is class_ because class is reserved.
# This gave us a list of all the <spans> containing the definitions.
for n, d in enumerate(definitions, start=1):
print(f"{n}. {d.text}")
OUTPUT:
1. A strong, hard magnetic silvery-grey metal, the chemical element of atomic number 26, much used as a material for construction and manufacturing, especially in the form of steel.
2. Used figuratively as a symbol or type of firmness, strength, or resistance.
3. A tool or implement now or originally made of iron.
4. Metal supports for a malformed leg.
5. Fetters or handcuffs.
6. Stirrups.
7. A handheld implement, typically an electrical one, with a heated flat steel base, used to smooth clothes, sheets, etc.
8. A golf club with a metal head (typically with a numeral indicating the degree to which the head is angled in order to loft the ball)
9. A shot made with an iron.
10. A meteorite containing a high proportion of iron.
11. Smooth (clothes, sheets, etc.) with an iron.
12. Firmness or ruthlessness cloaked in outward gentleness.
13. Have a range of options or courses of action available, or be involved in many activities or commitments at the same time.
14. Have other options or courses of action available, or be involved in other activities or commitments at the same time.
15. Having the feet or hands fettered.
16. (of a sailing vessel) stalled head to wind and unable to come about or tack either way.
17. Solve or settle difficulties or problems.
If you want to get the definitions for the different parts of grammar separately, that would be possible too. Use find_all to get all with class "semb" and then use find_all on each of those to get the spans as above, and also extract the labels for whichever section it is.

how to exclude nested tags from a parent tag to just get the ouput as text skipping the links (a) tags

I want to exclude the included nested tags like in this case ignore the a tags "links" associated with the word-
base_url="https://www.usatoday.com/story/tech/2020/09/17/qanon-conspiracy-theories-debunked-social-media/5791711002/"
response=requests.get(base_url)
html=response.content
bs=BeautifulSoup(html,parser="lxml")
article=bs.find_all("article",{"class":"gnt_pr"})
body=article[0].find_all("p",{"class":"gnt_ar_b_p"})
output is-
[<p class="gnt_ar_b_p">An emboldened community of believers known as QAnon is spreading a baseless patchwork of conspiracy theories that are fooling Americans who are looking for simple answers in a time of intense political polarization, social isolation and economic turmoil.</p>,
<p class="gnt_ar_b_p">Experts call QAnon <a class="gnt_ar_b_a" data-t-l="|inline|intext|n/a" href="https://www.usatoday.com/in-depth/tech/2020/08/31/qanon-conspiracy-theories-trump-election-covid-19-pandemic-extremist-groups/5662374002/" rel="noopener" target="_blank">a "digital cult"</a> because of its pseudo-religious qualities and an extreme belief system that enthrones President Donald Trump as a savior figure crusading against evil.</p>,
<p class="gnt_ar_b_p">The core of QAnon is the false theory that Trump was elected to root out a secret child-sex trafficking ring run by Satanic, cannibalistic Democratic politicians and celebrities. Although it may sound absurd, it has nonetheless attracted devoted followers who have begun to perpetuate other theories that they suggest, imply or argue are somehow related to the main premise.</p>,
want to exclude these a tags

To get only text from paragraphs, you can use .get_text() method. For example:
import requests
from bs4 import BeautifulSoup
base_url = "https://www.usatoday.com/story/tech/2020/09/17/qanon-conspiracy-theories-debunked-social-media/5791711002/"
response = requests.get(base_url)
soup = BeautifulSoup(response.content, "lxml")
body = soup.select("article p")
for paragraph in body:
print(paragraph.get_text(strip=True, separator=' '))
Prints:
An emboldened community of believers known as QAnon is spreading a baseless patchwork of conspiracy theories that are fooling Americans who are looking for simple answers in a time of intense political polarization, social isolation and economic turmoil.
...etc.
Or: You can .unwrap() all elements inside paragraph and the get text:
for paragraph in body:
for tag in paragraph.find_all():
tag.unwrap()
print(paragraph.text)

Can I search multiple HTML elements within the soup.find_all() function?

I'm trying to scrape a website for most viewed headlines. The class selector of the text I want shares common words with other items on the page. For example, I want the text between the tag and class "black_color". Other items use the tag and have the class "color_black hover_color_gray_90" and I don't want these included. I was thinking I could use more HTML elements to be more specific but I'm not sure how to incorporate them.
from bs4 import BeautifulSoup
def getHeadlines():
url = "https://www.bostonglobe.com/"
source_code = requests.get(url)
plainText = source_code.text
soup = BeautifulSoup(plainText, "html.parser")
#results = soup.find_all("h2",{"class":"headline"})
results = soup.find_all("a",{"class":"black_color"})
with open("headlines.txt", "w", encoding="utf-8") as f:
for i in results:
f.write(str(i.text + ' \n' + '\n'))
getHeadlines()

I think looking at the <a> tag may actually be harder than using at the matching <h2>, which has a 'headline' class.
Try this:
soup = BeautifulSoup(source_code.text, "html.parser")
for headline in soup.find_all("h2", class_="headline"):
print(headline.text)
Output:
Boston College outbreak worries epidemiologists, students, community
Nantucket finds ‘community spread’ of COVID-19 among tradespeople The town’s Select Board and Board of Health will hold an emergency meeting at 10 a.m. Monday to consider placing restrictions on some of those trades, officials said Friday.
Weddings in a pandemic: Welcome to the anxiety vortexNewlyweds are spending their honeymoons praying they don’t hear from COVID‐19 contact tracers. Relatives are agonizing over “damned if we RSVP yes, damned if we RSVP no” decisions. Wedding planners are adding contract clauses specifying they’ll walk off the job if social distancing rules are violated.
Fauci says US should plan to ‘hunker down’ for fall and winter
Of struggling area, Walsh says, ‘We have to get it better under control’
...
After looking at it for awhile, I think the <a> tag may actually have some classes added dynamically, which BS will not pick up. Just searching for the color_black class and excluding the color_black hover_color_gray_90 class yields the headline classes that you don't want (e.g. 'Sports'), even though when I look at the actual web page source code, I see it's differentiated in the way you've indicated.
That's usually a good sign that there are post-load CSS changes being made to a page. (I may be wrong about that, but in either case I hope the <h2> approach gets you what you need.)

How to select a specific tag within this html?

How would I select all of the titles in this page
http://bulletin.columbia.edu/columbia-college/departments-instruction/african-american-studies/#coursestext
for example: I'm trying to get all the lines similar to this:
AFAS C1001 Introduction to African-American Studies. 3 points.
main_page is iterating through all of the school classes from here so I can grab all of the titles like above:
http://bulletin.columbia.edu/columbia-college/departments-instruction/
for page in main_page:
sub_abbrev = page.find("div", {"class": "courseblock"})
I have this code but I can't figure out exactly how to select all of the ('strong') tags of the first child.
Using latest python and beautiful soup 4 to web-scrape.
Lmk if there is anything else that is needed.
Thanks

Iterate over elements with courseblock class, then, for every course, get the element with courseblocktitle class. Working example using select() and select_one() methods:
import requests
from bs4 import BeautifulSoup
url = "http://bulletin.columbia.edu/columbia-college/departments-instruction/african-american-studies/#coursestext"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
for course in soup.select(".courseblock"):
title = course.select_one("p.courseblocktitle").get_text(strip=True)
print(title)
Prints:
AFAS C1001 Introduction to African-American Studies.3 points.
AFAS W3030 African-American Music.3 points.
AFAS C3930 (Section 3) Topics in the Black Experience: Concepts of Race and Racism.4 points.
AFAS C3936 Black Intellectuals Seminar.4 points.
AFAS W4031 Protest Music and Popular Culture.3 points.
AFAS W4032 Image and Identity in Contemporary Advertising.4 points.
AFAS W4035 Criminal Justice and the Carceral State in the 20th Century United States.4 points.
AFAS W4037 (Section 1) Third World Studies.4 points.
AFAS W4039 Afro-Latin America.4 points.
A good follow-up question from #double_j:
In the OPs example, he has a space between the points. How would you keep that? That's how the data shows on the site, even thought it's not really in the source code.
I though about using separator argument of the get_text() method, but that would also add an extra space before the last dot. Instead, I would join the strong element texts via str.join():
for course in soup.select(".courseblock"):
title = " ".join(strong.get_text() for strong in course.select("p.courseblocktitle > strong"))
print(title)

How can I extract all text between tags?

I would like to extract a random poem from this book.
Using BeautifulSoup, I have been able to find the title and prose.
print soup.find('div', class_="pre_poem").text
print soup.find('table', class_="poem").text
But I would like to find all the poems and pick one.
Should I use a regex and match all between
<h3> and </span></p> ?

Use an html document parser instead. It's safer in terms of the unintended consquences.
The reason why all programmers discourage parsing HTML with regex is that HTML mark-up of the page is not static especially if your souce HTML is a webpage. Regex is better suited for strings.
Use regex at your own risk.

Assuming you already have a suitable soup object to work with, the following might help you get started:
poem_ids = []
for section in soup.find_all('ol', class_="TOC"):
poem_ids.extend(li.find('a').get('href') for li in section.find_all('li'))
poem_ids = [id[1:] for id in poem_ids[:-1] if id]
poem_id = random.choice(poem_ids)
poem_start = soup.find('a', id=poem_id)
poem = poem_start.find_next()
poem_text = []
while True:
poem = poem.next_element
if poem.name == 'h3':
break
if poem.name == None:
poem_text.append(poem.string)
print '\n'.join(poem_text).replace('\n\n\n', '\n')
This first extracts a list of the poems from the table of contents at the top of the page. These contain unique IDs to each of the poems. Next a random ID is chosen and the matching poem is then extracted based on that ID.
For example, if the first poem was selected, you would see the following output:
"The Arrow and the Song," by Longfellow (1807-82), is placed first in
this volume out of respect to a little girl of six years who used to
love to recite it to me. She knew many poems, but this was her
favourite.
I shot an arrow into the air,
It fell to earth, I knew not where;
For, so swiftly it flew, the sight
Could not follow it in its flight.
I breathed a song into the air,
It fell to earth, I knew not where;
For who has sight so keen and strong
That it can follow the flight of song?
Long, long afterward, in an oak
I found the arrow, still unbroke;
And the song, from beginning to end,
I found again in the heart of a friend.
Henry W. Longfellow.
This is done by using BeautifulSoup to extract all of the text from each element until the next <h3> tag is found, and then removing any extra line breaks.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

scraping part of website in python - python

You can use index of target paragraph and get required text as print(soup.select('p')[4].text.strip()) or fetch by previous paragraph with text "Release Date:": print(soup.findAll('b', text="Release Date:")[0].parent.next_sibling.text.strip())

Related

beautifulsoup creates an incomplete result?

how to exclude nested tags from a parent tag to just get the ouput as text skipping the links (a) tags

Can I search multiple HTML elements within the soup.find_all() function?

How to select a specific tag within this html?

How can I extract all text between tags?

Categories

Resources