Python web scraping - Why do I get this error? - python

I want to get text from a website using bs4 but I keep getting this error and I don't know why. This is the error: TypeError: slice indices must be integers or None or have an index method.
This is my code:
from urllib.request import urlopen
import bs4
url = "https://www.oddsshark.com/nfl/dallas-pittsburgh-odds-august-5-2021-1410371"
page = urlopen(url)
html_bytes = page.read()
html = html_bytes.decode("utf-8")
text = html.find("div", {"class":"gc-score__title"})#the error is in this line
print(text)

On this line:
text = html.find("div", {"class":"gc-score__title"})
you just use str.find method, not bs4.BeautifulSoup.find method
So if you do
soup = bs4.BeautifulSoup(html, 'html.parser')
text = soup.find("div", {"class":"gc-score__title"})
print(text)
you will get rid of the error.
That said, the site is using JavaScript, so this will not yield what you expect. You will need to use tools like Selenium to scrape this site.

First, if you want BeautifulSoup to parse the data, you need to ask it to do that.
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://www.oddsshark.com/nfl/dallas-pittsburgh-odds-august-5-2021-1410371"
page = urlopen(url)
html_bytes = page.read()
soup = BeautifulSoup(html_bytes)
Then you can used soup.find to find <div> tags:
text = soup.find("div", {"class":"gc-score__title"})
That will eliminate the error. You were calling str.find because html is a string, and to pick tags out you need to call the find method of a bs4.BeautifulSoup object.
But besides eliminating the error, that line won't do what you want. It won't return anything, because the data at that url does not contain the tag <div class="gc-score__title">.
Copy the contents of html_bytes to a text editor to confirm this.

Related

Web crawling extracts same sentence twice

When I write the below code, as you can see in the picture, the result shows me the same sentence twice.
How can I solve this problem?
import requests
from bs4 import BeautifulSoup
res = requests.get("https://www.consumeraffairs.com/education/online-courses/coursera.html?page=2#scroll_to_reviews=true")
soup = BeautifulSoup(res.content, 'html.parser')
data = soup.find_all('p')
for item in data:
print(item.string)
Changing your parser to lxml will fix this.
By inspecting the website, you will see there are empty p tags around the p tag of the reviews, and they are messing up the html parser.
This snippet change on the first line will fix by changing your parser, and then on the data list, you can filter the empty p tags by seeing if they have content in their string attribute.
import requests
from bs4 import BeautifulSoup
res = requests.get("https://www.consumeraffairs.com/education/online-courses/coursera.html?page=2#scroll_to_reviews=true")
soup = BeautifulSoup(res.content, "lxml")
data = soup.find_all('p')
for item in data:
if item.string:
print(item.string)
This will print for you the reviews only once!
If you get an error about not having a builder, run
pip install lxml

How can i make BeautifulSoup finde more than one tag?

I iam testing a programm for printing the content of wikipedia into the prompt. I alrready got some output but its a bit messy. So i thought to only get the content of tags <p> and <b> that are the two which wikipedia uses to show the content. Here is my code:
import urllib.request
from bs4 import BeautifulSoup
URL = input("Enter the url (only wikipedia supported, default url https://?.wikipedia.org/wiki) : ")
page = urllib.request.urlopen(URL)
html_doc = page.read()
soup = BeautifulSoup(html_doc, 'html.parser')
for x in soup.find_all('p').find_all('b'):
print(x.string)
The interrogation mark is because wikipedia shows the lenguage there so it depends. As you see i added one more .find_all with because i didnĀ“t know how to add it. Sorry for my bad english and my bad code because i am not very related to this request field. Thanks
BeautifulSoup.find_all returns a ResultSet which is essentially a list of elements. You need to iterate through that list for the second operation yourself.
import urllib.request
from bs4 import BeautifulSoup
URL = input("Enter the url (only wikipedia supported, default url https://?.wikipedia.org/wiki) : ")
page = urllib.request.urlopen(URL)
html_doc = page.read()
soup = BeautifulSoup(html_doc, 'html.parser')
for elem in soup.find_all('p'):
for x in elem.find_all('b'):
print(x.string)

Function from list of parameters

can you please help me with my python code? I wanted to parse several homepages with beautiful soup provided in the list html with the function stars
html=["https://www.onvista.de/aktien/fundamental/Adidas-Aktie-DE000A1EWWW0", "https://www.onvista.de/aktien/fundamental/Allianz-Aktie-DE0008404005", "https://www.onvista.de/aktien/fundamental/BASF-Aktie-DE000BASF111"]
def stars(html):
bsObj = BeautifulSoup(html.read())
starbewertung = bsObj.findAll("section")[8].findAll("div")[1].findAll("span")[16]
str_cells = str(starbewertung)
cleantext = BeautifulSoup(str_cells, "lxml").get_text()
print(cleantext)
lst=[]
lst.append(cleantext)
stars(html)
Instead I am getting an error "AttributeError: 'list' object has no attribute 'read'"
As some of the comments mentioned you need to use the requests library to actually grab the content of each link in your list.
import requests
from bs4 import BeautifulSoup
html=["https://www.onvista.de/aktien/fundamental/Adidas-Aktie-DE000A1EWWW0", "https://www.onvista.de/aktien/fundamental/Allianz-Aktie-DE0008404005", "https://www.onvista.de/aktien/fundamental/BASF-Aktie-DE000BASF111"]
def stars(html):
for url in html:
resp = requests.get(url)
bsObj = BeautifulSoup(resp.content, 'html.parser')
print(bsObj) # Should print the entire html document.
# Do other stuff with bsObj here.
stars(html)
The IndexError from bsObj.findAll("section")[8].findAll("div")[1].findAll("span")[16] is something you'll need to figure out yourself.
You have a couple of errors here.
you are trying to load the whole list of pages into BeautifulSoup. You should process page by page.
You should get the source code of the page before processing it.
there is no "section" element on the page you are loading, so you will get an exception as you are trying to get the 8th element. So you might need to evaluate whether you found anything.
def stars(html):
request = requests.get(html)
if request.status_code != 200:
return
page_content = request.content
bsObj = BeautifulSoup(page_content)
starbewertung = bsObj.findAll("section")[8].findAll("div")[1].findAll("span")[16]
str_cells = str(starbewertung)
cleantext = BeautifulSoup(str_cells, "lxml").get_text()
print(cleantext)
for page in html:
stars(page)

BeautifulSoup doesn't find all spans or children

I am trying to access the sequence on this webpage:
https://www.ncbi.nlm.nih.gov/protein/EGW15053.1?report=fasta
The sequence is stored under the div class="seq gbff". Each line is stored under
<span class='ff_line' id='gi_344258949_1"> *line 1 of sequence* </span>
When I try to search for the spans containing the sequence, beautiful soup returns None. Same problem when I try to look at the children or content of the div above the spans.
Here is the code:
import requests
import re
from bs4 import BeautifulSoup
# Create a variable with the url
url = 'https://www.ncbi.nlm.nih.gov/protein/EGW15053.1?report=fasta'
# Use requests to get the contents
r = requests.get(url)
# Get the text of the contents
html_content = r.text
# Convert the html content into a beautiful soup object
soup = BeautifulSoup(html_content, 'html.parser')
div = soup.find_all('div', attrs={'class', 'seq gbff'})
for each in div.children:
print(each)
soup.find_all('span', aatrs={'class', 'ff_line'})
Neither method works and I'd greatly appreciate any help :D
This page uses JavaScript to load data
With DevTools in Chrome/Firefox I found this url and there are all <span>
https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=344258949&db=protein&report=fasta&extrafeat=0&fmt_mask=0&retmode=html&withmarkup=on&tool=portal&log$=seqview&maxdownloadsize=1000000
Now hard part. You have to find this url in HTML because different pages will use different arguments in url. Or you have to compare few urls and find schema so you could generate this url manually.
EDIT: if in url you change retmode=html to retmode=xml then you get it as XML. If you use retmode=text then you get it as text without HTML tags. retmode=json doesn't works.

Beautiful Soup not returning everything in HTML file?

HTML noob here, so I could be misunderstanding something about the HTML document, so bear with me.
I'm using Beautiful Soup to parse web data in Python. Here is my code:
import urllib
import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url).read()
soup = BeautifulSoup.BeautifulSoup(page)
indicateGameDone = str(soup.find("div", {"class": "nbaModTopStatus"}))
print indicateGameDone
now, if you look at the website, the HTML code has the line <p class="nbaLiveStatTxSm"> FINAL </p>, (inspect the 'Final' text on the left side of the container on the first ATL-WAS game on the page to see it for youself.) But when I run the code above, my code doesn't return the 'FINAL' that is seen on the webpage, and instead the nbaLiveStatTxSm class is empty.
On my machine, this is the output when I print indicateGameDone:
<div class="nbaModTopStatus"><p class="nbaLiveStatTx">Live</p><p class="nbaLiveStatTxSm"></p><p class="nbaFnlStatTx">Final</p><p class="nbaFnlStatTxSm"></p></div>
Does anyone know why this is happening?
EDIT: clarification: the problem isn't retrieving the text within the tag, the problem is that when I take the html code from the website and print it out in python, something that I saw when I inspected the element on the web is not there in the print statement in Python.
You can use this logic to extract any text.
This code allows you to extract any data between any tags.
Output - FINAL
import urllib
from bs4 import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url)
soup = BeautifulSoup(page)
indicateGameDone = soup.find("div", {"class": "nbaFnlStatTx"})
for p in indicateGameDone:
p_text = soup.find("p", {"class": "nbaFnlStatTxSm"})
print(p_text.getText())
break;
It looks like your problem is not with BeautifulSoup but instead with urllib.
Try running the following commands
>>> import urllib
>>> url = "http://www.nba.com/gameline/20160323/"
>>> page = urllib.urlopen(url).read()
>>> page.find('<div class="nbaModTopStatus">')
44230
Which is no surprise considering that Beautiful Soup was able to find the div itself. However when we look a little deeper into what urllib is actually collecting we can see that the <p class="nbaFnlStatTxSm"> is empty by running
>>> page[44230:45000]
'<div class="nbaModTopStatus"><p class="nbaLiveStatTx">Live</p><p class="nbaLiveStatTxSm"></p><p class="nbaFnlStatTx">Final</p><p class="nbaFnlStatTxSm"></p></div><div id="nbaGLBroadcast"><img src="/.element/img/3.0/sect/gameline/broadcasters/lp.png"></div><div class="nbaTeamsRow"><div class="nbaModTopTeamScr nbaModTopTeamAw"><h5 class="nbaModTopTeamName awayteam">ATL</h5><img src="http://i.cdn.turner.com/nba/nba/.element/img/2.0/sect/gameline/teams/ATL.gif" width="34" height="22" title="Atlanta Hawks"><h4 class="nbaModTopTeamNum win"></h4></div><div class="nbaModTopTeamScr nbaModTopTeamHm"><h5 class="nbaModTopTeamName hometeam">WAS</h5><img src="http://i.cdn.turner.com/nba/nba/.element/img/2.0/sect/gameline/teams/WAS.gif" width="34" '
You can see that the tag is empty, so your problem is the data that's being passed to Beautiful Soup, not the package itself.
changed the import of beautifulsoup to the proper syntax for the current version of BeautifulSoup
corrected the way you were constructing the BeautifulSoup object
fixed your find statement, then used the .text command to get the string representation of the text in the HTML you're after.
With some minor modifications to your code as listed above, your code runs for me.
import urllib
from bs4 import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url).read()
soup = BeautifulSoup(page)
indicateGameDone = soup.find("div", {"class": "nbaModTopStatus"})
print indicateGameDone.text ## "LiveFinal "
to address comments:
import urllib
from bs4 import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url).read()
soup = BeautifulSoup(page)
indicateGameDone = soup.find("p", {"class": "nbaFnlStatTx"})
print indicateGameDone.text

Categories