How to get rid of whitespace above text, using bs4 - python

Ok, so I'm using bs4 (BeautifulSoup) to parse through a website and find the specific titles I am looking for. My code looks like this:
import requests
from bs4 import BeautifulSoup
url = 'http://www.ewn.co.za/Categories/Local'
r = requests.get(url).text
soup = BeautifulSoup(r)
for i in soup.find_all(class_='article-short'):
if i.a:
print(i.a.text.replace('\n', '').strip())
else:
print(i.contents[0].strip())
This code works, but in output it shows like 20 lines of whitespace first, before printing the requested titles from the website. Is there something wrong with my code or is there something I can do to get rid of the whitespace?

Because you have elements like this:
<article class="article-short">
<div class="thumb"><img alt="FILE: Boys who have undergone a circumcision ceremony walk near Qunu in the Eastern Cape in 2013. Picture: AFP." height="147" src="http://ewn.co.za/cdn/-%2fmedia%2f3C37CB28056746CD95FC913757AAD41C.ashx%3fas%3d1%26h%3d147%26w%3d234%26crop%3d1;waeb9b8157b3e310df" width="234"/></div>
<h6 class="h6-mega">Contralesa against scrapping initiation due to cold weather</h6>
</article>
where the first link contains an image and no text.
You should probably look for h6 tags instead. So, something like this works:
import requests
from bs4 import BeautifulSoup
url = 'http://www.ewn.co.za/Categories/Local'
r = requests.get(url).text
soup = BeautifulSoup(r)
for i in soup.find_all(class_='article-short'):
title = (i.h6.text.replace('\n', '') if i.h6 else contents[0]).strip()
if title:
print(title)

Related

How can i make BeautifulSoup finde more than one tag?

I iam testing a programm for printing the content of wikipedia into the prompt. I alrready got some output but its a bit messy. So i thought to only get the content of tags <p> and <b> that are the two which wikipedia uses to show the content. Here is my code:
import urllib.request
from bs4 import BeautifulSoup
URL = input("Enter the url (only wikipedia supported, default url https://?.wikipedia.org/wiki) : ")
page = urllib.request.urlopen(URL)
html_doc = page.read()
soup = BeautifulSoup(html_doc, 'html.parser')
for x in soup.find_all('p').find_all('b'):
print(x.string)
The interrogation mark is because wikipedia shows the lenguage there so it depends. As you see i added one more .find_all with because i didnĀ“t know how to add it. Sorry for my bad english and my bad code because i am not very related to this request field. Thanks
BeautifulSoup.find_all returns a ResultSet which is essentially a list of elements. You need to iterate through that list for the second operation yourself.
import urllib.request
from bs4 import BeautifulSoup
URL = input("Enter the url (only wikipedia supported, default url https://?.wikipedia.org/wiki) : ")
page = urllib.request.urlopen(URL)
html_doc = page.read()
soup = BeautifulSoup(html_doc, 'html.parser')
for elem in soup.find_all('p'):
for x in elem.find_all('b'):
print(x.string)

Python beautifulsoup search issue

I'm having issues having bs find this text. I think it's because the text on the page has extra quotes around it. I was told it's because the class is actually blank. If that's the case, then any suggestions on how I can build my search?
Actual text on website: <span class="" data-product-price="">
My code (I've tried several variations): soup.find_all('span',{'class' : '" data-product-price="'})
I've also tried just doing a regular search, but I'm not doing that correctly. Any suggestions or should I use something other than bs?
Edited to include full code:
import bs4
import requests
from bs4 import BeautifulSoup
r=requests.get('https://www.gouletpens.com/products/twsbi-diamond-580-
fountain-pen-clear?variant=11884892028971')
soup = bs4.BeautifulSoup(r.text, features="html.parser")
print(soup)
#soup.find_all('span',{'class' : '" data-product-price="'})
#soup.find_all('span',{'class' : 'data-product-price'})[0].text
After looking at URL, you can select the price with CSS selector:
import requests
from bs4 import BeautifulSoup
url = 'https://www.gouletpens.com/products/twsbi-diamond-580-fountain-pen-clear?variant=11884892028971'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
print(soup.select_one('span[data-product-price]').get_text(strip=True))
Prints:
$50.00
OR: with bs4 API (set {'data-product-price':True} to search tags with this attribute regardless of value in it:
print(soup.find('span', {'data-product-price':True}).get_text(strip=True))

Can't get item with python beautifulsoup

I'm trying to learn how to webscrape with beautifulsoup + python, and I want to grab the name of the cinematographer from https://letterboxd.com/film/donnie-darko/ but I can't figure out how to isolate the text. The html for what I want is written as below, what I want to output is "Steven Poster":
<h3><span>Cinematography</span></h3>
<div class="text-sluglist">
<p>
Steven Poster
</p>
</div>
within my code I've done soup.find(text="Cinematography"), and a mixture of different thigns like trying to find item or get_text from within the a and p tags, but ...
I would use a regex to parse the soup object for a link that contains "cinematography".
import re
import requests
from bs4 import BeautifulSoup
r = requests.get('https://letterboxd.com/film/donnie-darko/')
soup = BeautifulSoup(r.text, 'lxml')
cinematographer = soup(href=re.compile(r'/cinematography/'))[0].text
print cinematographer
# outputs "Stephen Poster"
You can do the same without using regex as well:
import requests
from bs4 import BeautifulSoup
res = requests.get('https://letterboxd.com/film/donnie-darko/')
soup = BeautifulSoup(res.text,'lxml')
item = soup.select("[href*='cinematography']")[0].text
print(item)
Output:
Steven Poster
Use CSS partial text selector:
soup.find('a[href*="cinematography"]').text

How do I extract just the blog content and exclude other elements using Beautiful Soup

I am trying to get the blog content from this blog post and by content, I just mean the first six paragraphs. This is what I've come up with so far:
soup = BeautifulSoup(url, 'lxml')
body = soup.find('div', class_='post-body')
Printing body will also include other stuff under the main div tag.
Try this:
import requests ; from bs4 import BeautifulSoup
res = requests.get("http://www.fashionpulis.com/2017/08/being-proud-too-soon.html").text
soup = BeautifulSoup(res, 'html.parser')
for item in soup.select("div#post-body-604825342214355274"):
print(item.text.strip())
Use this:
import requests ; from bs4 import BeautifulSoup
res = requests.get("http://www.fashionpulis.com/2017/08/acceptance-is-must.html").text
soup = BeautifulSoup(res, 'html.parser')
for item in soup.select("div[id^='post-body-']"):
print(item.text)
I found this solution very interesting: Scrape multiple pages with BeautifulSoup and Python
However, I haven't found any Query String Parameters to tackle on, maybe you can start something out of this approach.
What I find most obvious to do right now is something like this:
Scrape through every month and year and get all titles from the Blog Archive part of the pages (e.g. on http://www.fashionpulis.com/2017/03/ and so on)
Build the URLs using the titles and the according months/years (the URL is always http://www.fashionpulis.com/$YEAR/$MONTH/$TITLE.html)
Scrape the text as described by Shahin in a previous answer

Beautiful Soup not returning everything in HTML file?

HTML noob here, so I could be misunderstanding something about the HTML document, so bear with me.
I'm using Beautiful Soup to parse web data in Python. Here is my code:
import urllib
import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url).read()
soup = BeautifulSoup.BeautifulSoup(page)
indicateGameDone = str(soup.find("div", {"class": "nbaModTopStatus"}))
print indicateGameDone
now, if you look at the website, the HTML code has the line <p class="nbaLiveStatTxSm"> FINAL </p>, (inspect the 'Final' text on the left side of the container on the first ATL-WAS game on the page to see it for youself.) But when I run the code above, my code doesn't return the 'FINAL' that is seen on the webpage, and instead the nbaLiveStatTxSm class is empty.
On my machine, this is the output when I print indicateGameDone:
<div class="nbaModTopStatus"><p class="nbaLiveStatTx">Live</p><p class="nbaLiveStatTxSm"></p><p class="nbaFnlStatTx">Final</p><p class="nbaFnlStatTxSm"></p></div>
Does anyone know why this is happening?
EDIT: clarification: the problem isn't retrieving the text within the tag, the problem is that when I take the html code from the website and print it out in python, something that I saw when I inspected the element on the web is not there in the print statement in Python.
You can use this logic to extract any text.
This code allows you to extract any data between any tags.
Output - FINAL
import urllib
from bs4 import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url)
soup = BeautifulSoup(page)
indicateGameDone = soup.find("div", {"class": "nbaFnlStatTx"})
for p in indicateGameDone:
p_text = soup.find("p", {"class": "nbaFnlStatTxSm"})
print(p_text.getText())
break;
It looks like your problem is not with BeautifulSoup but instead with urllib.
Try running the following commands
>>> import urllib
>>> url = "http://www.nba.com/gameline/20160323/"
>>> page = urllib.urlopen(url).read()
>>> page.find('<div class="nbaModTopStatus">')
44230
Which is no surprise considering that Beautiful Soup was able to find the div itself. However when we look a little deeper into what urllib is actually collecting we can see that the <p class="nbaFnlStatTxSm"> is empty by running
>>> page[44230:45000]
'<div class="nbaModTopStatus"><p class="nbaLiveStatTx">Live</p><p class="nbaLiveStatTxSm"></p><p class="nbaFnlStatTx">Final</p><p class="nbaFnlStatTxSm"></p></div><div id="nbaGLBroadcast"><img src="/.element/img/3.0/sect/gameline/broadcasters/lp.png"></div><div class="nbaTeamsRow"><div class="nbaModTopTeamScr nbaModTopTeamAw"><h5 class="nbaModTopTeamName awayteam">ATL</h5><img src="http://i.cdn.turner.com/nba/nba/.element/img/2.0/sect/gameline/teams/ATL.gif" width="34" height="22" title="Atlanta Hawks"><h4 class="nbaModTopTeamNum win"></h4></div><div class="nbaModTopTeamScr nbaModTopTeamHm"><h5 class="nbaModTopTeamName hometeam">WAS</h5><img src="http://i.cdn.turner.com/nba/nba/.element/img/2.0/sect/gameline/teams/WAS.gif" width="34" '
You can see that the tag is empty, so your problem is the data that's being passed to Beautiful Soup, not the package itself.
changed the import of beautifulsoup to the proper syntax for the current version of BeautifulSoup
corrected the way you were constructing the BeautifulSoup object
fixed your find statement, then used the .text command to get the string representation of the text in the HTML you're after.
With some minor modifications to your code as listed above, your code runs for me.
import urllib
from bs4 import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url).read()
soup = BeautifulSoup(page)
indicateGameDone = soup.find("div", {"class": "nbaModTopStatus"})
print indicateGameDone.text ## "LiveFinal "
to address comments:
import urllib
from bs4 import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url).read()
soup = BeautifulSoup(page)
indicateGameDone = soup.find("p", {"class": "nbaFnlStatTx"})
print indicateGameDone.text

Categories