BeautifulSoup webscraper issue: can't find certain divs/tables - python

I'm having issues with scraping pro-football-reference.com. I'm trying to access the "Team Offense" table but can't seem to target the div/table.
The best I can do is:
soup.find('div', {'id':'all_team_stats})
which doesn't return the table nor it's immediate div wrapper. The following attempts return "None":
soup.find('div', {'id':'div_team_stats'})
soup.find('table', {'id':'team_stats'})
I've already scraped different pages simply by:
soup.find('table', {'id':'table_id})
but I can't figure out why it's not working on this page. Below is the code I've been working with. Any help is much appreciated!
from bs4 import BeautifulSoup
import urllib2
def make_soup(url):
page = urllib2.urlopen(url)
soupdata = BeautifulSoup(page, 'lxml')
return soupdata
def get_player_totals():
soup = make_soup("http://www.pro-football-reference.com/years/2015/")
tableStats = soup.find('table', {'id':'team_stats'})
return tableStats
print get_player_totals()
EDIT:
Thanks for all the help everyone. Both of the provided solutions below have been successful. Much appreciated!

Just remove the comments with re.sub before you pass to bs4:
from bs4 import BeautifulSoup
import urllib2
import re
comm = re.compile("<!--|-->")
def make_soup(url):
page = urllib2.urlopen(url)
soupdata = BeautifulSoup(comm.sub("", page.read()), 'lxml')
return soupdata
def get_player_totals():
soup = make_soup("http://www.pro-football-reference.com/years/2015/")
tableStats = soup.find('table', {'id':'team_stats'})
return tableStats
print get_player_totals()
You will see the table when you run the code.

A couple of thoughts here: first, as mentioned in the comments, the acutal table is commented out and is not per se part of the DOM (not accessible via a parser adhoc).
In this situation, you can loop over the comments found and try to get the information via regular expressions (though this heavily discussed and mostly disliked on Stackoverflow, see here for more information). Last, but not least, I would recommend requests rather than urllib2.
That being said, here is a working code example:
from bs4 import BeautifulSoup, Comment
import requests, re
def make_soup(url):
r = requests.get(url)
soupdata = BeautifulSoup(r.text, 'lxml')
return soupdata
soup = make_soup("http://www.pro-football-reference.com/years/2015/")
# get the comments
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
# look for table with the id "team_stats"
rx = re.compile(r'<table.+?id="team_stats".+?>[\s\S]+?</table>')
for comment in comments:
try:
table = rx.search(comment.string).group(0)
print(table)
# break the loop if found
break
except:
pass
Always favour a parser solution over this one, especially the regex part is very error-prone.

Related

Why does the BeautifulSoup select() method return empty list?

import requests
from bs4 import BeautifulSoup
response = requests.get('https://stackoverflow.com/questions')
soup = BeautifulSoup(response.text, 'html.parser')
questions = soup.select('.question-summary')
print(questions)
This returns:
[]
From the information in the https://codewithmosh.com/courses/ python course I payed for, this should not had happened.
Why does this code return [[]]?
Your code returns [] because there is no element with the class .question-summary
You should always inspect the website first. As FLAK-ZOSO said, I also couldn't find any question-summary class in the HTML.
You can get the question titles using
question_titles = soup.select('.s-post-summary--content-title')
and question summaries by
question_summaries = soup.select('.s-post-summary--content-excerpt')
Example:
import requests
from bs4 import BeautifulSoup
response = requests.get('https://stackoverflow.com/questions')
soup = BeautifulSoup(response.text, 'html.parser')
question_summaries = soup.select('.s-post-summary--content-excerpt')
print(question_summaries[0].text.strip())
I inspected the source code of https://stackoverflow.com/questions, and noticed there's no class question-summary used.
But there's a s-post-summary and a s-post-summary--stats-item class, have you tried them?

How to select all links of apps from app store and extract its href?

from bs4 import BeautifulSoup
import requests
from urllib.request import urlopen
url = f'https://www.apple.com/kr/search/youtube?src=globalnav'
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'html.parser')
links = soup.select(".rf-serp-productname-list")
print(links)
I want to crawl through all links of shown apps. When I searched for a keyword, I thought links = soup.select(".rf-serp-productname-list") would work, but links list is empty.
What should I do?
Just check this code, I think is what you want:
import re
import requests
from bs4 import BeautifulSoup
pages = set()
def get_links(page_url):
global pages
pattern = re.compile("^(/)")
html = requests.get(f"your_URL{page_url}").text # fstrings require Python 3.6+
soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all("a", href=pattern):
if "href" in link.attrs:
if link.attrs["href"] not in pages:
new_page = link.attrs["href"]
print(new_page)
pages.add(new_page)
get_links(new_page)
get_links("")
Source:
https://gist.github.com/AO8/f721b6736c8a4805e99e377e72d3edbf
You can change the part:
for link in soup.find_all("a", href=pattern):
#do something
To check for a keyword I think
You are cooking a soup so first at all taste it and check if everything you expect contains in it.
ResultSet of your selection is empty cause structure in response differs a bit from your expected one from the developer tools.
To get the list of links select more specific:
links = [a.get('href') for a in soup.select('a.icon')]
Output:
['https://apps.apple.com/kr/app/youtube/id544007664', 'https://apps.apple.com/kr/app/%EC%BF%A0%ED%8C%A1%ED%94%8C%EB%A0%88%EC%9D%B4/id1536885649', 'https://apps.apple.com/kr/app/youtube-music/id1017492454', 'https://apps.apple.com/kr/app/instagram/id389801252', 'https://apps.apple.com/kr/app/youtube-kids/id936971630', 'https://apps.apple.com/kr/app/youtube-studio/id888530356', 'https://apps.apple.com/kr/app/google-chrome/id535886823', 'https://apps.apple.com/kr/app/tiktok-%ED%8B%B1%ED%86%A1/id1235601864', 'https://apps.apple.com/kr/app/google/id284815942']

beautifulsoup returns none when using find() for an element

I'm trying to scrape this site to retrieve the years of each paper thats been published. I've managed to get titles to work but when it comes to scraping the years it returns none.
I've broken it down and the results of 'none' occur when its going into the for loop but I can't figure out why this happens when its worked with titles.
import requests
from bs4 import BeautifulSoup
URL = "https://dblp.org/search?q=web%20scraping"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(class_="publ-list")
paperResults = results.find_all(class_="data tts-content")
for singlepaper in paperResults:
paperyear = singlepaper.find(class_="datePublished")
print(paperyear)
When it goes to paperResults it gives the breakdown of the section I've selected within the results on the line above that.
Any suggestions on how to retrieve the years would be greatly appreciated
Change this
for singlepaper in paperResults:
paperyear = singlepaper.find(class_="datePublished")
print(paperyear)
To this
for singlepaper in paperResults:
paperyear = singlepaper.find('span', itemprop="datePublished")
print(paperyear.string)
You were looking for a class when you needed to be parsing span... if you print paperResults you will see that your datePublished is an itemprop in a span element.
Try this:
import requests
from bs4 import BeautifulSoup
URL = "https://dblp.org/search?q=web%20scraping"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(class_="publ-list")
paperResults = results.find_all(class_="data tts-content")
for singlepaper in paperResults:
paperyear = singlepaper.find(attrs={"itemprop": "datePublished"})
print(paperyear)
It worked for me.

Simple if statement not accurately checking if two values are the same

I am trying to webscrape some article titles from a website. I do not want to include "Notes from the Editor" when I run my program, but for some reason this super simple and should be easy if statement on the last two lines isn't working, and still prints out Notes from the Editor. What's wrong?
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.cambridge.org/core/journals/american-political-science-review/issue/4061249B1054342207CEF9C50AEC68C5")
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.findAll('a', class_='part-link')
for result in results:
if result.text != 'Notes from the Editors':
print(result.text)
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.cambridge.org/core/journals/american-political-science-review/issue/4061249B1054342207CEF9C50AEC68C5")
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.findAll('a', class_='part-link')
for result in results:
if result.text != '\nNotes from the Editors\n':
print(result.text)
its because your if statement doesnt exactly match the if statement. uyou are not considering white spaces or enters. try this.

Follow link in forum to scrape thread (comments) using BS4

I have a forum with 3 threads. I am trying to scrape the data in all three posts. so I need to follow the href link to each post and scrape the data. this is giving me an error and I'm not sure what I am dong wrong...
import csv
import time
from bs4 import BeautifulSoup
import requests
source = requests.get('https://mainforum.com').text
soup = BeautifulSoup(source, 'lxml')
#get the thread href (thread_link)
for threads in soup.find_all('p', class_= 'small'):
thread_name = threads.text
thread_link = threads.a.get('href')# there are three threads and this gets all 3 links
print (thread_link)
Rest of the code is where I am having an issue with?
# request the individual thread links
for follow_link in thread_link:
response = requests.get(follow_link)
#parse thread link
soup= BeautifulSoup(response, 'lxml')
#print Data
for p in soup.find_all('p'):
print(p)
As to your schema error...
You're getting the schema error because you are overwriting one link over and over. Then you attempt to call that link as if it were a list of links. At this point it is a string and you just iterate through the characters (starting with h) hence the error.
See here: requests.exceptions.MissingSchema: Invalid URL 'h': No schema supplied
As to the general query and how to solve something like this...
If I was to do this the flow would go as follows:
Get the three hrefs (similar to what you've already done)
Use a function that scrapes the thread hrefs individually and returns whatever you want them to return
Save/append that returned information wherever you want.
Repeat
Something like this perhaps
import csv
import time
from bs4 import BeautifulSoup
import requests
source = requests.get('https://mainforum.com')
soup = BeautifulSoup(source.content, 'lxml')
all_thread_info = []
def scrape_thread_link(href):
response = requests.get(href)
#parse thread link
soup= BeautifulSoup(response.content, 'lxml')
#return data
return [p.text for p in soup.find_all('p')]
#get the thread href (thread_link)
for threads in soup.find_all('p', class_= 'small'):
this_thread_info = {}
this_thread_info["thread_name"] = threads.text
this_thread_info["thread_link"] = threads.a.get('href')
this_thread_info["thread_data"] = scrape_thread_link(this_thread_info["thread_link"])
all_thread_info.append(this_thread_info)
print(all_thread_info)
There's quite a lot unspecified in the original question so I made some assumptions. Ideally though you can see the gist.
Also note I prefer to use the .content of the response instead of .text.
#Darien Schettler I made some changes/adjustments to the code would love to hear if I messed up somewhere?
all_thread_info = []
def scrape_thread_link(href):
response = requests.get(href)
soup= BeautifulSoup(response.content, 'lxml')
for Thread in soup.find_all(id= 'discussionReplies'):
Thread_Name = Thread.find_all('div', class_='xg_user_generated')
for Posts in Thread_Name:
print(Posts.text)
for threads in soup.find_all('p', class_= 'small'):
thread_name = threads.text
thread_link = threads.a.get('href')
thread_data = scrape_thread_link(thread_link)
all_thread_info.append(thread_data)

Categories