Python beautifulsoup search issue - python

I'm having issues having bs find this text. I think it's because the text on the page has extra quotes around it. I was told it's because the class is actually blank. If that's the case, then any suggestions on how I can build my search?
Actual text on website: <span class="" data-product-price="">
My code (I've tried several variations): soup.find_all('span',{'class' : '" data-product-price="'})
I've also tried just doing a regular search, but I'm not doing that correctly. Any suggestions or should I use something other than bs?
Edited to include full code:
import bs4
import requests
from bs4 import BeautifulSoup
r=requests.get('https://www.gouletpens.com/products/twsbi-diamond-580-
fountain-pen-clear?variant=11884892028971')
soup = bs4.BeautifulSoup(r.text, features="html.parser")
print(soup)
#soup.find_all('span',{'class' : '" data-product-price="'})
#soup.find_all('span',{'class' : 'data-product-price'})[0].text

After looking at URL, you can select the price with CSS selector:
import requests
from bs4 import BeautifulSoup
url = 'https://www.gouletpens.com/products/twsbi-diamond-580-fountain-pen-clear?variant=11884892028971'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
print(soup.select_one('span[data-product-price]').get_text(strip=True))
Prints:
$50.00
OR: with bs4 API (set {'data-product-price':True} to search tags with this attribute regardless of value in it:
print(soup.find('span', {'data-product-price':True}).get_text(strip=True))

Related

Browser code and beautifulsoup collection different

I try to parse soccerstand front page soccer matches and fail because the items I get with BeautifulSoup are really different from what I see in browser.
My code is simple at the moment:
import urllib.request
from bs4 import BeautifulSoup
with urllib.request.urlopen('https://soccerstand.com/') as response:
url_data = response.read()
soup = BeautifulSoup(url_data, 'html.parser')
print(soup.find_all('div.event__match'))
So I tried this and this failed. When I checked soup variable it turned out not to contain such divs at all, so what I get with BS is different from what I see by inspecting code on the website.
What's the reason for that? Is there any workaround?

getting Video URL using Python Scripting

I am working with beautiful soup to extract the URL. I get all the attributes of the href but i want to get only specific URL.
Here is my code:
import requests
from bs4 import BeautifulSoup
page=requests.get("https://www.youtube.com/results?search_query=cooking")
soup = BeautifulSoup(page.content ,'html.parser')
for a_tag in soup.findAll("a"):
if a_tag.has_attr("href"):
print(a_tag['href'])
enter image description here
but i want only this
watch?v=nTe_44ao82w
/watch?v=nTe_44ao82w
More Minimization to the first answer:
import requests
from bs4 import BeautifulSoup
page=requests.get("https://www.youtube.com/results?search_query=cooking")
soup = BeautifulSoup(page.content ,'html.parser')
for a_tag in soup.findAll("a"):
if 'watch' in a_tag['href']:
print(a_tag['href'])
This will check if the href tag has string watch in it.
Hope this helps!
There doesn't really seem to be a good way to differentiate those a tags other than by the URL itself (they don't have any unique classes or anything) so I would probably just check if the URL contains "watch":
...
for a_tag in soup.findAll("a"):
if a_tag.has_attr("href") and "watch" in a_tag["href"]:
print(a_tag['href'])
Outputs
/watch?v=cbxe1ANrfDo
/watch?v=nTe_44ao82w
/watch?v=v1wIThmCams
/watch?v=FTociictyyE
/watch?v=dw2QHkAtB_Y
/watch?v=ej9UHVwlQqk
/watch?v=KGAj8IhnR3c
/watch?v=G8A73R_gZdM
/watch?v=XPQW_2YOmjY
/watch?v=J0pS2lhH0Vc
/watch?v=5aU5qrbCsF4
/watch?v=kvAJ_mc9NXs
/watch?v=kKiYVLIk_9s
/watch?v=G2jYIGdmC6I
/watch?v=jMW5ZDQviOA
/watch?v=iTmcGy9CWhE
/watch?v=66Ck_5SePZg
/watch?v=lyD9t3uhHio

Can't get item with python beautifulsoup

I'm trying to learn how to webscrape with beautifulsoup + python, and I want to grab the name of the cinematographer from https://letterboxd.com/film/donnie-darko/ but I can't figure out how to isolate the text. The html for what I want is written as below, what I want to output is "Steven Poster":
<h3><span>Cinematography</span></h3>
<div class="text-sluglist">
<p>
Steven Poster
</p>
</div>
within my code I've done soup.find(text="Cinematography"), and a mixture of different thigns like trying to find item or get_text from within the a and p tags, but ...
I would use a regex to parse the soup object for a link that contains "cinematography".
import re
import requests
from bs4 import BeautifulSoup
r = requests.get('https://letterboxd.com/film/donnie-darko/')
soup = BeautifulSoup(r.text, 'lxml')
cinematographer = soup(href=re.compile(r'/cinematography/'))[0].text
print cinematographer
# outputs "Stephen Poster"
You can do the same without using regex as well:
import requests
from bs4 import BeautifulSoup
res = requests.get('https://letterboxd.com/film/donnie-darko/')
soup = BeautifulSoup(res.text,'lxml')
item = soup.select("[href*='cinematography']")[0].text
print(item)
Output:
Steven Poster
Use CSS partial text selector:
soup.find('a[href*="cinematography"]').text

Beautiful Soup not returning everything in HTML file?

HTML noob here, so I could be misunderstanding something about the HTML document, so bear with me.
I'm using Beautiful Soup to parse web data in Python. Here is my code:
import urllib
import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url).read()
soup = BeautifulSoup.BeautifulSoup(page)
indicateGameDone = str(soup.find("div", {"class": "nbaModTopStatus"}))
print indicateGameDone
now, if you look at the website, the HTML code has the line <p class="nbaLiveStatTxSm"> FINAL </p>, (inspect the 'Final' text on the left side of the container on the first ATL-WAS game on the page to see it for youself.) But when I run the code above, my code doesn't return the 'FINAL' that is seen on the webpage, and instead the nbaLiveStatTxSm class is empty.
On my machine, this is the output when I print indicateGameDone:
<div class="nbaModTopStatus"><p class="nbaLiveStatTx">Live</p><p class="nbaLiveStatTxSm"></p><p class="nbaFnlStatTx">Final</p><p class="nbaFnlStatTxSm"></p></div>
Does anyone know why this is happening?
EDIT: clarification: the problem isn't retrieving the text within the tag, the problem is that when I take the html code from the website and print it out in python, something that I saw when I inspected the element on the web is not there in the print statement in Python.
You can use this logic to extract any text.
This code allows you to extract any data between any tags.
Output - FINAL
import urllib
from bs4 import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url)
soup = BeautifulSoup(page)
indicateGameDone = soup.find("div", {"class": "nbaFnlStatTx"})
for p in indicateGameDone:
p_text = soup.find("p", {"class": "nbaFnlStatTxSm"})
print(p_text.getText())
break;
It looks like your problem is not with BeautifulSoup but instead with urllib.
Try running the following commands
>>> import urllib
>>> url = "http://www.nba.com/gameline/20160323/"
>>> page = urllib.urlopen(url).read()
>>> page.find('<div class="nbaModTopStatus">')
44230
Which is no surprise considering that Beautiful Soup was able to find the div itself. However when we look a little deeper into what urllib is actually collecting we can see that the <p class="nbaFnlStatTxSm"> is empty by running
>>> page[44230:45000]
'<div class="nbaModTopStatus"><p class="nbaLiveStatTx">Live</p><p class="nbaLiveStatTxSm"></p><p class="nbaFnlStatTx">Final</p><p class="nbaFnlStatTxSm"></p></div><div id="nbaGLBroadcast"><img src="/.element/img/3.0/sect/gameline/broadcasters/lp.png"></div><div class="nbaTeamsRow"><div class="nbaModTopTeamScr nbaModTopTeamAw"><h5 class="nbaModTopTeamName awayteam">ATL</h5><img src="http://i.cdn.turner.com/nba/nba/.element/img/2.0/sect/gameline/teams/ATL.gif" width="34" height="22" title="Atlanta Hawks"><h4 class="nbaModTopTeamNum win"></h4></div><div class="nbaModTopTeamScr nbaModTopTeamHm"><h5 class="nbaModTopTeamName hometeam">WAS</h5><img src="http://i.cdn.turner.com/nba/nba/.element/img/2.0/sect/gameline/teams/WAS.gif" width="34" '
You can see that the tag is empty, so your problem is the data that's being passed to Beautiful Soup, not the package itself.
changed the import of beautifulsoup to the proper syntax for the current version of BeautifulSoup
corrected the way you were constructing the BeautifulSoup object
fixed your find statement, then used the .text command to get the string representation of the text in the HTML you're after.
With some minor modifications to your code as listed above, your code runs for me.
import urllib
from bs4 import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url).read()
soup = BeautifulSoup(page)
indicateGameDone = soup.find("div", {"class": "nbaModTopStatus"})
print indicateGameDone.text ## "LiveFinal "
to address comments:
import urllib
from bs4 import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url).read()
soup = BeautifulSoup(page)
indicateGameDone = soup.find("p", {"class": "nbaFnlStatTx"})
print indicateGameDone.text

python with beautifulsoup - remove tags

I am doing some python program to extract lyrics
the code i use:
import urllib
from bs4 import BeautifulSoup
url = urllib.urlopen("http://www.lyricsnmusic.com/david-bowie/slip-away-lyrics/22143075")
soup = BeautifulSoup(url.read())
print soup.find('pre', itemprop='description')
the result gets me what i need but with the extra of the tag
for example : <pre item="description> then the lyrics
anyone know how to get only the lyrics?
the structure puts the lyrics between the pre tag
thanks in advance
Use the text attribute of the node that you've found
import urllib
from BeautifulSoup import BeautifulSoup
url = urllib.urlopen("http://www.lyricsnmusic.com/david-bowie/slip-away-lyrics/2
2143075")
soup = BeautifulSoup(url.read())
desc=soup.find('pre', itemprop='description')
print desc.text

Categories