beautifulsoup creates an incomplete result?

beautifulsoup creates an incomplete result? - python

I want beautifulsoup to parse a page from lexico dictionary
what I want to parse is the content under this tag ul
screenshot of the tag
This (sorry for the wot) is the result that
url = 'https://www.lexico.com/definition/iron'
req = requests.get(url)
soup = BeautifulSoup(req.text, 'lxml')
semb = soup.find('ul', attrs={'class':'semb'})
print(semb)
should give(I think)
However, the code here gives this (sorry again)
It seems that bs stops parsing for some reason in the middle of the second li tag. It doesn't seem anything related to javascript to me, am I wrong? Thanks anyone.
Beautifulsoup version: 4.11.1
Python: 3.9.12

Since you want just the definitions, try the following, which will get all of the definitions under each grammar type (in the example page linked, it will get the noun and verb definitions)
import requests
from bs4 import BeautifulSoup
url = 'https://www.lexico.com/definition/iron'
req = requests.get(url)
soup = BeautifulSoup(req.text, 'lxml') #I used html.parser as I didn't have lxml installed but either should work
definitions = soup.find_all("span", class_='ind one-click-content') # note the keyword is class_ because class is reserved.
# This gave us a list of all the <spans> containing the definitions.
for n, d in enumerate(definitions, start=1):
print(f"{n}. {d.text}")
OUTPUT:
1. A strong, hard magnetic silvery-grey metal, the chemical element of atomic number 26, much used as a material for construction and manufacturing, especially in the form of steel.
2. Used figuratively as a symbol or type of firmness, strength, or resistance.
3. A tool or implement now or originally made of iron.
4. Metal supports for a malformed leg.
5. Fetters or handcuffs.
6. Stirrups.
7. A handheld implement, typically an electrical one, with a heated flat steel base, used to smooth clothes, sheets, etc.
8. A golf club with a metal head (typically with a numeral indicating the degree to which the head is angled in order to loft the ball)
9. A shot made with an iron.
10. A meteorite containing a high proportion of iron.
11. Smooth (clothes, sheets, etc.) with an iron.
12. Firmness or ruthlessness cloaked in outward gentleness.
13. Have a range of options or courses of action available, or be involved in many activities or commitments at the same time.
14. Have other options or courses of action available, or be involved in other activities or commitments at the same time.
15. Having the feet or hands fettered.
16. (of a sailing vessel) stalled head to wind and unable to come about or tack either way.
17. Solve or settle difficulties or problems.
If you want to get the definitions for the different parts of grammar separately, that would be possible too. Use find_all to get all with class "semb" and then use find_all on each of those to get the spans as above, and also extract the labels for whichever section it is.

Related

Can I search multiple HTML elements within the soup.find_all() function?

I'm trying to scrape a website for most viewed headlines. The class selector of the text I want shares common words with other items on the page. For example, I want the text between the tag and class "black_color". Other items use the tag and have the class "color_black hover_color_gray_90" and I don't want these included. I was thinking I could use more HTML elements to be more specific but I'm not sure how to incorporate them.
from bs4 import BeautifulSoup
def getHeadlines():
url = "https://www.bostonglobe.com/"
source_code = requests.get(url)
plainText = source_code.text
soup = BeautifulSoup(plainText, "html.parser")
#results = soup.find_all("h2",{"class":"headline"})
results = soup.find_all("a",{"class":"black_color"})
with open("headlines.txt", "w", encoding="utf-8") as f:
for i in results:
f.write(str(i.text + ' \n' + '\n'))
getHeadlines()

I think looking at the <a> tag may actually be harder than using at the matching <h2>, which has a 'headline' class.
Try this:
soup = BeautifulSoup(source_code.text, "html.parser")
for headline in soup.find_all("h2", class_="headline"):
print(headline.text)
Output:
Boston College outbreak worries epidemiologists, students, community
Nantucket finds ‘community spread’ of COVID-19 among tradespeople The town’s Select Board and Board of Health will hold an emergency meeting at 10 a.m. Monday to consider placing restrictions on some of those trades, officials said Friday.
Weddings in a pandemic: Welcome to the anxiety vortexNewlyweds are spending their honeymoons praying they don’t hear from COVID‐19 contact tracers. Relatives are agonizing over “damned if we RSVP yes, damned if we RSVP no” decisions. Wedding planners are adding contract clauses specifying they’ll walk off the job if social distancing rules are violated.
Fauci says US should plan to ‘hunker down’ for fall and winter
Of struggling area, Walsh says, ‘We have to get it better under control’
...
After looking at it for awhile, I think the <a> tag may actually have some classes added dynamically, which BS will not pick up. Just searching for the color_black class and excluding the color_black hover_color_gray_90 class yields the headline classes that you don't want (e.g. 'Sports'), even though when I look at the actual web page source code, I see it's differentiated in the way you've indicated.
That's usually a good sign that there are post-load CSS changes being made to a page. (I may be wrong about that, but in either case I hope the <h2> approach gets you what you need.)

Split by bs4 tag/Get text between two tags

Currently I trying to read the text between two tags from a webpage.
This is my code so far:
soup = BeautifulSoup(r.text, 'lxml')
text = soup.text
tag_one = soup.select_one('div.first-header')
tage_two = soup.select_one('div.second-header')
text = text.split(tag_one)[1]
text = text.split(tage_two)[0]
print(text)
Basically I am trying to get the text between the first and second header by identifying their tags. I was planning on to do this by splitting by the first tag and second tag.
Is this even possible? Is there a smarter way to do this?
Example:
If you look at: https://en.wikipedia.org/wiki/Python_(programming_language)
I would like to find a way to extract the text under "History" by identifying the tags of "History" and "Features and philosophy" and splitting by these tags.

With BeautifulSoup 4.7+, the CSS select ability is much improved. This task can be done utilizing the CSS4 :has() selector that is now supported in BeautifulSoup:
import requests
from bs4 import BeautifulSoup
website_url = requests.get("https://en.wikipedia.org/wiki/Python_(programming_language)").text
soup = BeautifulSoup(website_url, "lxml")
els = soup.select('h2:has(span#History) ~ *:has(~ h2:has(span#Features_and_philosophy))')
with codecs.open('text.txt', 'w', 'utf-8') as f:
for el in els:
print(el.get_text())
The output:
Guido van Rossum at OSCON 2006.Main article: History of PythonPython was conceived in the late 1980s[31] by Guido van Rossum at Centrum Wiskunde & Informatica (CWI) in the Netherlands as a successor to the ABC language (itself inspired by SETL)[32], capable of exception handling and interfacing with the Amoeba operating system.[7] Its implementation began in December 1989.[33] Van Rossum's long influence on Python is reflected in the title given to him by the Python community: Benevolent Dictator For Life (BDFL) – a post from which he gave himself permanent vacation on July 12, 2018.[34]
Python 2.0 was released on 16 October 2000 with many major new features, including a cycle-detecting garbage collector and support for Unicode.[35]
Python 3.0 was released on 3 December 2008. It was a major revision of the language that is not completely backward-compatible.[36] Many of its major features were backported to Python 2.6.x[37] and 2.7.x version series. Releases of Python 3 include the 2to3 utility, which automates (at least partially) the translation of Python 2 code to Python 3.[38]
Python 2.7's end-of-life date was initially set at 2015 then postponed to 2020 out of concern that a large body of existing code could not easily be forward-ported to Python 3.[39][40] In January 2017, Google announced work on a Python 2.7 to Go transcompiler to improve performance under concurrent workloads.[41]

You can't do it the way you're hoping because BS4 works on the dom, a tree-structure, rather than something linear.
Using your wiki example, what you're really looking for is
find id="History" (it's a span)
Navigate up to the H2 element -- remember that as the starting point.
find id="Features_and_philosophy" (it's another span)
Navigate up to the nearest H2 element -- remember that as the ending point.
Now, notice that the two H2 elements are siblings (they have the same parent). So what you're looking to do is get each sibling between starting H2 and ending H2, and, for each sibling, get the full text of each sibling.
That's not hard, but it's a loop, where you're comparing each sibling until you reach your ending one. Nothing as simple as you'd hoped.
In a more general case, it's much harder (or tedious, really), in that you may have to go up and down the DOM tree looking for the matching element.

scraping part of website in python

I want to scrape part of this website http://warframe.wikia.com/wiki/ , expecially the description, command should reply as
"Equipped with a diverse selection of arrows, deft hands, shroud of stealth and an exalted bow, the crafty Ivara infiltrates hostile territory with deception and diversion, and eliminates threats with a shot from beyond. Ivara emerged in Update 18.0."
and nothing else, maybe can be useful set a sort of <p> I want to print.
For now I have this, but it doesn't reply what I want.
import requests
from bs4 import BeautifulSoup
req = requests.get('http://warframe.wikia.com/wiki/Ivara')
soup = BeautifulSoup(req.text, "lxml")
for sub_heading in soup.find_all('p'):
print(sub_heading.text)

You can use index of target paragraph and get required text as
print(soup.select('p')[4].text.strip())
or fetch by previous paragraph with text "Release Date:":
print(soup.findAll('b', text="Release Date:")[0].parent.next_sibling.text.strip())

Using the solution #Andersson provided (it won't work for all heroes as there is no Release date for everyone) and #SIM's comment, I'm giving you a generalized solution for any hero/champion (or whatever you call them in that game).
name = 'Ivara'
url = 'http://warframe.wikia.com/wiki/' + name
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
main_text = soup.find('div', class_='tabbertab')
print(main_text.find('b', text=name).parent.text.strip())
Output:
Equipped with a diverse selection of arrows, deft hands, shroud of stealth and an exalted bow, the crafty Ivara infiltrates hostile territory with deception and diversion, and eliminates threats with a shot from beyond. Ivara emerged in Update 18.0.
For other heroes, just change the name variable.
One more example with name = 'Volt', Output:
Volt has the power to wield and bend electricity. He is highly versatile, armed with powerful abilities that can damage enemies, provide cover and complement the ranged and melee combat of his cell. The electrical nature of his abilities make him highly effective against the Corpus, with their robots in particular. He is one of three starter options for new players.
Explanation:
If you inspect the page, you can see that there is only 1 <b>hero-name</b> inside the <div class="tabbertab" ... > tag. So you can use the <b>...</b> to find the text you need.

Scrape web with a query

I am trying to scrape impact factors of journals from a particular website or entire web. I have been searching for something close but hard luck..
This is the first time I am trying web scrape with python. I am trying to find the simplest way.
I have a list of ISSN numbers belong to Journals and I want to retrieve the impact factor values of them from web or a particular site. The list has more than 50K values so manually searching the values is practically hard .
Input type
Index,JOURNALNAME,ISSN,Impact Factor 2015,URL,ABBV,SUBJECT
1,4OR-A Quarterly Journal of Operations Research,1619-4500,,,4OR Q J OPER RES,Management Science
2,Aaohn Journal,0891-0162,,,AAOHN J,
3,Aapg Bulletin,0149-1423,,,AAPG BULL,Engineering
4,AAPS Journal,1550-7416,,,AAPS J,Medicine
5,Aaps Pharmscitech,1530-9932,,,AAPS PHARMSCITECH,
6,Aatcc Review,1532-8813,,,AATCC REV,
7,Abdominal Imaging,0942-8925,,,ABDOM IMAGING,
8,Abhandlungen Aus Dem Mathematischen Seminar Der Universitat Hamburg,0025-5858,,,ABH MATH SEM HAMBURG,
9,Abstract and Applied Analysis,1085-3375,,,ABSTR APPL ANAL,Math
10,Academic Emergency Medicine,1069-6563,,,ACAD EMERG MED,Medicine
What is needed ?
The input above has a column of ISSN numbers. Read the ISSN numbers and search for it in researchgate.net or in web. Then wen the individual web pages are found search for Impact Factor 2015 and retrieve the value put it in the empty place beside ISSN Number and also place the retrieved URL next to it
so that web search can be also limited to one site and one keyword search for the value .. the empty one can be kept as "NAN"
Thanks in advance for the suggestions and help

Try this code using beautiful soup and urllib2. I am using h2 tag and searching for 'Journal Impact:', but I will let you decide on the algorithm to extract the data. The html content is present in soup and soup provides API to extract it. What I provide is an example and that may work for you.
#!/usr/bin/env python
import urllib2
from bs4 import BeautifulSoup
issn = '0219-5305'
url = 'https://www.researchgate.net/journal/%s_Analysis_and_Applications' % (issn)
htmlDoc = urllib2.urlopen(url).read()
soup = BeautifulSoup(htmlDoc, 'html.parser')
for tag in soup.find_all('h2'):
if 'Journal Impact:' in tag.text:
value = tag.text
value = value.replace('Journal Impact:', '')
value = value.strip(' *')
print value
Output:
1.13
I think the official documentation for beautiful soup is pretty good. I will suggest spending an hour on the documentation if you are new to this, before even try to write some code. That hour spent on reading the documentation will save you lot more hours later.
https://www.crummy.com/software/BeautifulSoup/
https://www.crummy.com/software/BeautifulSoup/bs4/doc/

How to select a specific tag within this html?

How would I select all of the titles in this page
http://bulletin.columbia.edu/columbia-college/departments-instruction/african-american-studies/#coursestext
for example: I'm trying to get all the lines similar to this:
AFAS C1001 Introduction to African-American Studies. 3 points.
main_page is iterating through all of the school classes from here so I can grab all of the titles like above:
http://bulletin.columbia.edu/columbia-college/departments-instruction/
for page in main_page:
sub_abbrev = page.find("div", {"class": "courseblock"})
I have this code but I can't figure out exactly how to select all of the ('strong') tags of the first child.
Using latest python and beautiful soup 4 to web-scrape.
Lmk if there is anything else that is needed.
Thanks

Iterate over elements with courseblock class, then, for every course, get the element with courseblocktitle class. Working example using select() and select_one() methods:
import requests
from bs4 import BeautifulSoup
url = "http://bulletin.columbia.edu/columbia-college/departments-instruction/african-american-studies/#coursestext"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
for course in soup.select(".courseblock"):
title = course.select_one("p.courseblocktitle").get_text(strip=True)
print(title)
Prints:
AFAS C1001 Introduction to African-American Studies.3 points.
AFAS W3030 African-American Music.3 points.
AFAS C3930 (Section 3) Topics in the Black Experience: Concepts of Race and Racism.4 points.
AFAS C3936 Black Intellectuals Seminar.4 points.
AFAS W4031 Protest Music and Popular Culture.3 points.
AFAS W4032 Image and Identity in Contemporary Advertising.4 points.
AFAS W4035 Criminal Justice and the Carceral State in the 20th Century United States.4 points.
AFAS W4037 (Section 1) Third World Studies.4 points.
AFAS W4039 Afro-Latin America.4 points.
A good follow-up question from #double_j:
In the OPs example, he has a space between the points. How would you keep that? That's how the data shows on the site, even thought it's not really in the source code.
I though about using separator argument of the get_text() method, but that would also add an extra space before the last dot. Instead, I would join the strong element texts via str.join():
for course in soup.select(".courseblock"):
title = " ".join(strong.get_text() for strong in course.select("p.courseblocktitle > strong"))
print(title)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.