Webscraping merriam-webster using beautifulsoup - python

I am using beautifulSoup and trying to scrape only the first definition (very cold) of a word from merriam-webster but it scrapes second line (a sentence) as well. This is my code.
P.S: i want only the "very cold" part. "put on your jacket...." should not be included in the output. Please someone help.
import requests
from bs4 import BeautifulSoup
url = "https://www.merriam-webster.com/dictionary/freezing"
r = requests.get(url)
soup = BeautifulSoup(r.content,"lxml")
definition = soup.find("span", {"class" : "dt"})
tag = definition.findChild()
print(tag.text)

Selecting by class is second faster method for css selector matching. Using select_one returns only first match and using next_sibling will take you to the node you want
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.merriam-webster.com/dictionary/freezing')
soup = bs(r.content, 'lxml')
print(soup.select_one('.mw_t_bc').next_sibling.strip())

The way that Merriam-Webster structures their page is a little strange, but you can find the <strong> tag that precedes the definition, grab the next sibling and strip out all whitespace like this:
>>> tag.find('strong').next_sibling.strip()
u'very cold'

Related

Find a word using BeautifulSoup

I want to extract ads that contain two special Persian words "توافق" or "توافقی" from a website. I am using BeautifulSoup and split the content in the soup to find the ads that have my special words, but my code does not work, May you please help me?
Here is my simple code:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://divar.ir/s/tehran")
soup = BeautifulSoup(r.text, "html.parser")
results = soup.find_all("div", attrs={"class": "kt-post-card__body"})
for content in results:
words = content.split()
if words == "توافقی" or words == "توافق":
print(content)
Since that توافقی is appeared in the div tags with kt-post-card__description class, I will use this. Then you can get the adds by using tag's properties like .previous_sibling or .parent or whatever...
import requests
from bs4 import BeautifulSoup
r = requests.get("https://divar.ir/s/tehran")
soup = BeautifulSoup(r.text, "html.parser")
results = soup.find_all("div", attrs={"class": "kt-post-card__description"})
for content in results:
text = content.text
if "توافقی" in text or "توافق" in text:
print(content.previous_sibling) # It's the h2 title.
so basically you are trying to split bs4 class and hence its giving error. Before splitting it, you need to convert it into text string.
import re
from bs4 import BeautifulSoup
import requests
r = requests.get("https://divar.ir/s/tehran")
soup = BeautifulSoup(r.text, "html.parser")
results = soup.find_all("div", attrs={"class": "kt-post-card__description"})
for content in results:
words = content.text.split()
if "توافقی" in words or "توافق" in words:
print(content)
There are differnet issues first one, also mentioned by #Tim Roberts, you have to compare the list items with in:
if 'توافقی' in words or 'توافق' in words:
Second you have to seperate the texts from each of the child elements, so use get_text() with separator:
words=content.get_text(' ', strip=True)
Note: requests do not render dynamic content, it justs focus on static one
Example
import requests
from bs4 import BeautifulSoup
r=requests.get('https://divar.ir/s/tehran')
soup=BeautifulSoup(r.text,'html.parser')
results=soup.find_all('div',attrs={'class':"kt-post-card__body"})
for content in results:
words=content.get_text(' ', strip=True)
if 'توافقی' in words or 'توافق' in words:
print(content.text)
An alternative in this specific case could be the use of css selectors, so you could select the whole <article> and pick elements you need:
results = soup.select('article:-soup-contains("توافقی"),article:-soup-contains("توافق")')
for item in results:
print(item.h2)
print(item.span)

How do I extract just the blog content and exclude other elements using Beautiful Soup

I am trying to get the blog content from this blog post and by content, I just mean the first six paragraphs. This is what I've come up with so far:
soup = BeautifulSoup(url, 'lxml')
body = soup.find('div', class_='post-body')
Printing body will also include other stuff under the main div tag.
Try this:
import requests ; from bs4 import BeautifulSoup
res = requests.get("http://www.fashionpulis.com/2017/08/being-proud-too-soon.html").text
soup = BeautifulSoup(res, 'html.parser')
for item in soup.select("div#post-body-604825342214355274"):
print(item.text.strip())
Use this:
import requests ; from bs4 import BeautifulSoup
res = requests.get("http://www.fashionpulis.com/2017/08/acceptance-is-must.html").text
soup = BeautifulSoup(res, 'html.parser')
for item in soup.select("div[id^='post-body-']"):
print(item.text)
I found this solution very interesting: Scrape multiple pages with BeautifulSoup and Python
However, I haven't found any Query String Parameters to tackle on, maybe you can start something out of this approach.
What I find most obvious to do right now is something like this:
Scrape through every month and year and get all titles from the Blog Archive part of the pages (e.g. on http://www.fashionpulis.com/2017/03/ and so on)
Build the URLs using the titles and the according months/years (the URL is always http://www.fashionpulis.com/$YEAR/$MONTH/$TITLE.html)
Scrape the text as described by Shahin in a previous answer

Extracting 'a' tags containing specific substring with Python's BeautifulSoup

Using BeautifulSoup, I would like to return only "a" tags containing "Company" and not "Sector" in their href string. Is there a way to use regex inside of re.compile() to return only Companies and not Sectors?
Code:
soup = soup.findAll('tr')[5].findAll('a')
print(soup)
Output
[<a class="example" href="../ref/index.htm">Example</a>,
Facebook,
Exxon,
Technology,
Oil & Gas]
Using this method:
import re
soup.findAll('a', re.compile("Company"))
Returns:
AttributeError: 'ResultSet' object has no attribute 'findAll'
But I would like it to return (without the Sectors):
[Facebook,
Exxon]
Using:
Urllib.request version: 3.5
BeautifulSoup version: 4.4.1
Pandas version: 0.17.1
Python 3
Using soup = soup.findAll('tr')[5].findAll('a') and then soup.findAll('a', re.compile("Company")) writes over the original soup variable. findAll returns a ResultSet that is basically an array of BeautifulSoup objects. Try using the following to get all of the "Company" links instead.
links = soup.findAll('tr')[5].findAll('a', href=re.compile("Company"))
To get the text contained in these tags:
companies = [link.text for link in links]
Another approach is xpath, which supports AND/NOT operations for querying by attributes in an XML document. Unfortunately, BeautifulSoup doesn't handle xpath itself, but lxml can:
from lxml.html import fromstring
import requests
r = requests.get("YourUrl")
tree = fromstring(r.text)
#get elements with company in the URL but excludes ones with Sector
a_tags = tree.xpath("//a[contains(#href,'?Company') and not(contains(#href, 'Sector'))]")
You can use a css selector getting all the a tags where the href starts with ?Company:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
a = soup.select("a[href^=?Company]")
If you want them just from the sixth tr you can use nth-of-type:
.select("tr:nth-of-type(6) a[href^=?Company]"))
Thanks for the above answers #Padriac Cunningham and #Wyatt I !! This is a less elegant solution I came up with:
import re
for i in range(1, len(soup)):
if re.search("Company" , str(soup[i])):
print(soup[i])

Beautiful Soup not returning everything in HTML file?

HTML noob here, so I could be misunderstanding something about the HTML document, so bear with me.
I'm using Beautiful Soup to parse web data in Python. Here is my code:
import urllib
import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url).read()
soup = BeautifulSoup.BeautifulSoup(page)
indicateGameDone = str(soup.find("div", {"class": "nbaModTopStatus"}))
print indicateGameDone
now, if you look at the website, the HTML code has the line <p class="nbaLiveStatTxSm"> FINAL </p>, (inspect the 'Final' text on the left side of the container on the first ATL-WAS game on the page to see it for youself.) But when I run the code above, my code doesn't return the 'FINAL' that is seen on the webpage, and instead the nbaLiveStatTxSm class is empty.
On my machine, this is the output when I print indicateGameDone:
<div class="nbaModTopStatus"><p class="nbaLiveStatTx">Live</p><p class="nbaLiveStatTxSm"></p><p class="nbaFnlStatTx">Final</p><p class="nbaFnlStatTxSm"></p></div>
Does anyone know why this is happening?
EDIT: clarification: the problem isn't retrieving the text within the tag, the problem is that when I take the html code from the website and print it out in python, something that I saw when I inspected the element on the web is not there in the print statement in Python.
You can use this logic to extract any text.
This code allows you to extract any data between any tags.
Output - FINAL
import urllib
from bs4 import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url)
soup = BeautifulSoup(page)
indicateGameDone = soup.find("div", {"class": "nbaFnlStatTx"})
for p in indicateGameDone:
p_text = soup.find("p", {"class": "nbaFnlStatTxSm"})
print(p_text.getText())
break;
It looks like your problem is not with BeautifulSoup but instead with urllib.
Try running the following commands
>>> import urllib
>>> url = "http://www.nba.com/gameline/20160323/"
>>> page = urllib.urlopen(url).read()
>>> page.find('<div class="nbaModTopStatus">')
44230
Which is no surprise considering that Beautiful Soup was able to find the div itself. However when we look a little deeper into what urllib is actually collecting we can see that the <p class="nbaFnlStatTxSm"> is empty by running
>>> page[44230:45000]
'<div class="nbaModTopStatus"><p class="nbaLiveStatTx">Live</p><p class="nbaLiveStatTxSm"></p><p class="nbaFnlStatTx">Final</p><p class="nbaFnlStatTxSm"></p></div><div id="nbaGLBroadcast"><img src="/.element/img/3.0/sect/gameline/broadcasters/lp.png"></div><div class="nbaTeamsRow"><div class="nbaModTopTeamScr nbaModTopTeamAw"><h5 class="nbaModTopTeamName awayteam">ATL</h5><img src="http://i.cdn.turner.com/nba/nba/.element/img/2.0/sect/gameline/teams/ATL.gif" width="34" height="22" title="Atlanta Hawks"><h4 class="nbaModTopTeamNum win"></h4></div><div class="nbaModTopTeamScr nbaModTopTeamHm"><h5 class="nbaModTopTeamName hometeam">WAS</h5><img src="http://i.cdn.turner.com/nba/nba/.element/img/2.0/sect/gameline/teams/WAS.gif" width="34" '
You can see that the tag is empty, so your problem is the data that's being passed to Beautiful Soup, not the package itself.
changed the import of beautifulsoup to the proper syntax for the current version of BeautifulSoup
corrected the way you were constructing the BeautifulSoup object
fixed your find statement, then used the .text command to get the string representation of the text in the HTML you're after.
With some minor modifications to your code as listed above, your code runs for me.
import urllib
from bs4 import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url).read()
soup = BeautifulSoup(page)
indicateGameDone = soup.find("div", {"class": "nbaModTopStatus"})
print indicateGameDone.text ## "LiveFinal "
to address comments:
import urllib
from bs4 import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url).read()
soup = BeautifulSoup(page)
indicateGameDone = soup.find("p", {"class": "nbaFnlStatTx"})
print indicateGameDone.text

Removing span tags in python

I'm a newbie having trouble removing span tags after using BeautifulSoup to to grab the html from a page. Tried using "del links['span'] but it returned the same results. A few attemps at using getText() failed, as well. Clearly I'm doing something wrong that should be very easy. Help?
from bs4 import BeautifulSoup
import urllib.request
import re
url = urllib.request.urlopen("http://www.python.org")
content = url.read()
soup = BeautifulSoup(content)
for links in soup.find_all("span", text=re.compile(".com")):
del links['class']
print(links.)
Use the .unwrap() method to remove tags, preserving their contents:
for links in soup.find_all("span", text=re.compile(".com")):
links.unwrap()
print soup
Depending what you are trying to do, you could either use unwrap to remove the tag (in fact, replacing the element by its content) or decompose to remove the element and its content.

Categories