Scraping PFR Football Data with Python for a Beginner - python

background: i'm trying to scrape some tables from this pro-football-reference page. I'm a complete newbie to Python, so a lot of the technical jargon ends up lost on me but in trying to understand how to solve the issue, i can't figure it out.
specific issue: because there are multiple tables on the page, i can't figure out how to get python to target the one i want. I'm trying to get the Defense & Fumbles table. The code below is what i've got so far, and it's from this tutorial using a page from the same site- but one that only has a single table.
sample code:
#url we are scraping
url = "https://www.pro-football-reference.com/teams/nwe/2017.htm"
#html from the given url
html=urlopen(url)
# make soup object of html
soup = BeautifulSoup(html)
# we see that soup is a beautifulsoup object
type(soup)
#
column_headers = [th.getText() for th in
soup.findAll('table', {"id": "defense").findAll('th')]
column_headers #our column headers
attempts made: I realized that the tutorial's method would not work for me, so i attempted to change the soup.findAll portion to target the specific table. But i repeatedly get an error saying:
AttributeError: ResultSet object has no attribute 'findAll'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
when changing it to find, the error becomes:
AttributeError: 'NoneType' object has no attribute 'find'
I'll be absolutely honest that i have no idea what i'm doing or what these mean. I'd appreciate any help in figuring how to target that data and then scrape it.
Thank you,

your missing a "}" in the dict after the word "defense". Try below and see if it works.
column_headers = [th.getText() for th in
soup.findAll('table', {"id": "defense"}).findAll('th')]

First off, you want to use soup.find('table', {"id": "defense"}).findAll('th') - find one table, then find all of its 'th' tags.
The other problem is that the table with id "defense" is commented out in the html on that page:
<div class="placeholder"></div>
<!--
<div class="table_outer_container">
<div class="overthrow table_container" id="div_defense">
<table class="sortable stats_table" id="defense" data-cols-to-freeze=2><caption>Defense & Fumbles Table</caption>
<colgroup><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col></colgroup>
<thead>
etc. I assume that javascript is un-hiding it. BeautifulSoup doesn't parse the text of comments, so you'll need to find the text of all the comments on the page as in this answer, look for one with id="defense" in it, and then feed the text of that comment into BeautifulSoup.
Like this:
from bs4 import Comment
comments = comments = soup.findAll(text=lambda text:isinstance(text, Comment))
defenseComment = next(c for c in comments if 'id="defense"' in c)
defenseSoup = BeautifulSoup(str(defenseComment))

Related

Extract List Values using Beautiful Soup in Python

Please help me extract the information from the below list code output. I want to extract the count i.e. "4" and text "bds" from the abbr from the below output:
[<ul class="list-card-details">
<li>4<abbr class="list-card-label"> <!-- -->bds</abbr></li>
<li>4<abbr class="list-card-label"> <!-- -->ba</abbr></li>
<li>2,482<abbr class="list-card-label"> <!-- -->sqft</abbr></li>
</ul>]
I got the above output by running the below code:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
list_info = soup.find_all('div', class_='list-card-info')
house_one = list_info[0]
house_one_price = house_one.find('div', class_='list-card-price').text
house_one_bds_count = house_one.find_all('ul', class_='list-card-details')
print(house_one_bds_count)
#house_one_bds = house_one.li('abbr', class_='list-card-label').text -- working fine, so I commented
it to incorporate it later into the code
#print(house_one_price) -- working fine, so I commented it to incorporate it later into the code
Also can you please guide me why I get an error of "AttributeError: ResultSet object has no attribute 'content'. You're probably treating a list of elements like a single element. Did
you call find_all() when you meant to call find()?" when I add .text in print(house_one_bds_count.text).
Will be very thankful. I am new to stack overflow so apologies if the formatting is not correct. Thanks in advance.

Extracting book title and author from website - Python 3

I am trying to create two Python lists - one of book titles, and one of the books' authors, from a publisher's "coming soon" website.
I have tried a similar approach on other publisher's sites with success, but on this site it does not seem to be working. I am new to parsing html so I am obviously missing something, just can't figure out what. The find_all function just returns an empty list, so my titles and authors lists are empty too.
For reference, this is what the html shows when I click "inspect" in my browser for the first title and author, respectively. I've looked through the BS4 documentation and still can't figure out what I'm doing wrong here.
<h3 class="sp__the-title">Flame</h3>
<p class="sp__the-author">Donna Grant</p>
Thanks for your help!
import requests
from bs4 import BeautifulSoup
page = 'https://us.macmillan.com/search?collection=coming-soon'
page_response = requests.get(page)
soup = BeautifulSoup(page_response.content, "html.parser")
titles = []
for tag in soup.find_all("h3", {"class":"sp__the-title"}):
print(tag.text)
titles.append(tag.text)
authors = []
for tag in soup.find_all("p", {"class":"sp__the-author"}):
print(tag.text)
authors.append(tag.text)

Getting "None" when parsing out data on python,BS4

For a while have been trying to make a python program which can split data from websites. I came across the bs4 library for python and decided to use it for that job.
The problem is that I always get as a result None which is something that I cannot understand
I want to get only one word which is in a #href, located in a div class and for that, I wrote a function which is like that:
def run(self):
response = requests.get(self.url)
soup = BeautifulSoup(response.text, 'html.parser')
finalW = soup.find('a', attrs={'class': 'target'})
print(finalW)
With this code, I expect to get a word, but it just returns None.
It is highly possible, too, that I had made a mistake with the path to this directory, so I post an image with the thing I want to extract from the HTML:
When bs4 is not able to find the query, it returns None.
In your case the html is more or less like this.
...
<div class='target'>
neededlink
notneededlink
...
</div>
...
soup.find('a', attrs={'class': 'target'}) thus will not be able to math your query as there are not attrs in a.
If you are certain that your link is first in below query.
soup.find('div', {'class': 'target'}).find('a')['href']

Python: BeautifulSoup - Get an attribute value from the name of a class

I am scraping items from a webpage (there are multiple of these):
<a class="iusc" style="height:160px;width:233px" m="{"cid":"T0QMbGSZ","purl":"http://www.tti.library.tcu.edu.tw/DERMATOLOGY/mm/mmsa04.htm","murl":"http://www.tti.lcu.edu.tw/mm/img0035.jpg","turl":"https://tse2.mm.bing.net/th?id=OIP.T0QMbGSZbOpkyXU4ms5SFwEsDI&pid=15.1","md5":"4f440c6c64996cea64c975389ace5217"}" mad="{"turl":"https://tse3.mm.bing.net/th?id=OIP.T0QMbGSZbOpkyXU4ms5EsDI&w=300&h=200&pid=1.1","maw":"300","mah":"200","mid":"C303D7F4BB661CA67E2CED4DB11E9154A0DD330B"}" href="/images/search?view=detailV2&ccid=T0QMbGSZ&id=C303D7F4BB661E2CED4DB11E9154A0DD330B&thid=OIP.T0QMbGSZbOpkyXU4ms5SFwEsDI&q=searchtearm;amp;simid=6080204499593&selectedIndex=162" h="ID=images.5978_5,5125.1" data-focevt="1"><div class="img_cont hoff"><img class="mimg" style="color: rgb(169, 88, 34);" height="160" width="233" src="https://tse3.mm.bing.net/th?id=OIP.T0QMbGSZ4ms5SFwEsDI&w=233&h=160&c=7&qlt=90&o=4&dpr=2&pid=1.7" alt="Image result fsdata-bm="169" /></div></a>
What I want to do is download the image and information associated with it in the m attribute.
To accomplish that, I tried something like this to get the attributes:
links = soup.find_all("a", class_="iusc")
And then, to get the m attribute, I tried something like this:
for a in soup.find_all("m"):
test = a.text.replace(""" '"')
metadata = json.loads(test)["murl"]
print(str(metadata))
However, that doesn't quite work as expected, and nothing is printed out (with no errors either).
You are not iterating through the links list. Try this.
links = soup.find_all("a", class_="iusc")
for link in links:
print(link.get('m'))

problems scraping web page using python

Hi I'm quite new to python and my boss has asked me to scrape this data however it is not my strong point so i was wondering how i would go about this.
The text that I'm after also changes in the quote marks every few minutes so I'm also not sure how to locate that.
I am using beautiful soup at the moment and Lxml however if there are better alternatives I'm happy to try them
This is the inspected element of the webpage:
div class = "sometext"
<h3> somemoretext </h3>
<p>
<span class = "title" title="text i want">text i want</span>
<br>
</p>
I have tried using:
from lxml import html
import requests
from bs4 import BeautifulSoup
page = requests.get('the url')
soup = BeautifulSoup(page.text)
r = soup.findAll('//span[#class="title"]/text()')
print r
Thank you in advance,any help would be appreciated!
First do this to get what you are looking at in the soup:
soup = BeautifulSoup(page)
print soup
That way you can double check that you are actually dealing will what you think you are dealing with.
Then do this:
r = soup.findAll('span', attrs={"class":"title"})
for span in r:
print span.text
This will get all the span tags with a class=title, and then text will print out all the text in between the tags.
Edited to Add
Note that esecules' answer will get you the title within the tag (<span class = "title" title="text i want">) whereas mine will get the title from the text (<span class = "title" >text i want</span>)
perhaps find is the method you really need since you're only ever looking for one element. docs
r = soup.find('div', 'sometext').find('span','title')['title']
if you're familiar with XPath and you don't need feature that specific to BeautifulSoup, then using lxml only is enough (or maybe even better since lxml is known to be faster) :
from lxml import html
import requests
page = requests.get('the url')
root = html.fromstring(page.text)
r = root.xpath('//span[#class="title"]/text()')
print r

Categories