Web scrape not pulling back title correctly - python

I am trying to pull back only the Title from a source code online. My code is able to currently pull all the correct lines, but I cannot figure out how to make it only pull back the title.
from bs4 import BeautifulSoup # BeautifulSoup is in bs4 package
import requests
URL = 'https://sc2replaystats.com/replay/playerStats/10774659/8465'
content = requests.get(URL)
soup = BeautifulSoup(content.text, 'html.parser')
tb = soup.find('table', class_='table table-striped table-condensed')
for link in tb.find_all('tr'):
name = link.find('td')
print(name.get_text('title'))
I expect it to only say
Nexus
Pylon
Gateway
Assimilator
ect
but I get the error:
Traceback (most recent call last):
File "main.py", line 11, in <module>
print(name.get_text().strip())
AttributeError: 'NoneType' object has no attribute 'get_text'
I don't understand what I am doing wrong since from what I read it should only pull back the desired results

Try below code. Your first row had table header instead of table data so it will be none when you are looking for the td tag.
So add the condition to check when you can find either td or span inside td tag then get its title as below.
from bs4 import BeautifulSoup # BeautifulSoup is in bs4 package
import requests
URL = 'https://sc2replaystats.com/replay/playerStats/10774659/8465'
content = requests.get(URL)
soup = BeautifulSoup(content.text, 'html.parser')
tb = soup.find('table', class_='table table-striped table-condensed')
for link in tb.find_all('tr'):
name = link.find('span')
if name is not None:
# Process only if the element is available
print(name['title'])

I thinkg you should use something like
for link in tb.find_all('tr'):
name = link.select('td[title]')
print(name.get_text('title'))
Because until I see, the string comes empty because there are not title tag name, so you are trying to get text from title attr from the tag td

bkyada's answer is perfect if you want another solution then.
In your for loop instead of finding td find_all span and iterate through it and find it's title attribute.
containers = link.find('span')
if containers is not None:
print(containers['title'])

It is more efficient to simply use the class name to identify the elements with title attribute as they all have one in first column.
from bs4 import BeautifulSoup # BeautifulSoup is in bs4 package
import requests
URL = 'https://sc2replaystats.com/replay/playerStats/10774659/8465'
content = requests.get(URL)
soup = BeautifulSoup(content.text, 'html.parser')
tb = soup.find('table', class_='table table-striped table-condensed')
titles = [i['title'] for i in tb.select('.blizzard_icons_single')]
print(titles)
titles = {i['title'] for i in tb.select('.blizzard_icons_single')} #set of unique
print(titles)
As title attribute is limited to that column you could also have used (slighlty less quick) attribute selector:
titles = {i['title'] for i in tb.select('[title]')} #set of unique

Related

How to print the first google search result link using bs4?

I'm a beginner in python, I'm trying to get the first search result link from google which was stored inside a div with class='yuRUbf' using beautifulsoup. When I run the script output is 'None' what is the error here.
import requests
import bs4
url = 'https://www.google.com/search?q=site%3Astackoverflow.com+how+to+use+bs4+in+python&sxsrf=AOaemvKrCLt-Ji_EiPLjcEso3DVfBUmRbg%3A1630215433722&ei=CR0rYby7K7ue4-EP7pqIkAw&oq=site%3Astackoverflow.com+how+to+use+bs4+in+python&gs_lcp=Cgdnd3Mtd2l6EAM6BwgAEEcQsAM6BwgjELACECc6BQgAEM0CSgQIQRgAUMw2WPh_YLiFAWgBcAJ4AIABkAKIAd8lkgEHMC4xMC4xM5gBAKABAcgBCMABAQ&sclient=gws-wiz&ved=0ahUKEwj849XewdXyAhU7zzgGHW4NAsIQ4dUDCA8&uact=5'
request_result=requests.get( url )
soup = bs4.BeautifulSoup(request_result.text,"html.parser")
productDivs = soup.find("div", {"class": "yuRUbf"})
print(productDivs)
Let's see:
from bs4 import BeautifulSoup
import requests, json
headers = {
'User-agent':
"useragent"
}
html = requests.get('https://www.google.com/search?q=hello', headers=headers).text
soup = BeautifulSoup(html, 'lxml')
# locating div element with a tF2Cxc class
# calling for <a> tag and then calling for 'href' attribute
link = soup.find('div', class_='tF2Cxc').a['href']
print(link)
output:
'''
https://www.youtube.com/watch?v=YQHsXMglC9A
As you want first google search in which class name which you are looking for might be differ with name so first you can first find manually that link so it will be easy to identify
import requests
import bs4
url = 'https://www.google.com/search?q=site%3Astackoverflow.com+how+to+use+bs4+in+python&sxsrf=AOaemvKrCLt-Ji_EiPLjcEso3DVfBUmRbg%3A1630215433722&ei=CR0rYby7K7ue4-EP7pqIkAw&oq=site%3Astackoverflow.com+how+to+use+bs4+in+python&gs_lcp=Cgdnd3Mtd2l6EAM6BwgAEEcQsAM6BwgjELACECc6BQgAEM0CSgQIQRgAUMw2WPh_YLiFAWgBcAJ4AIABkAKIAd8lkgEHMC4xMC4xM5gBAKABAcgBCMABAQ&sclient=gws-wiz&ved=0ahUKEwj849XewdXyAhU7zzgGHW4NAsIQ4dUDCA8&uact=5'
request_result=requests.get( url )
soup = bs4.BeautifulSoup(request_result.text,"html.parser")
Using select method:
I have used css selector method in which it identifies all matching
divs and from list i have taken from index postion 1
And than i have use select_one to get a tag and find href
according to it!
main_data=soup.select("div.ZINbbc.xpd.O9g5cc.uUPGi")[1:]
main_data[0].select_one("a")['href'].replace("/url?q=","")
Using find method:
main_data=soup.find_all("div",class_="ZINbbc xpd O9g5cc uUPGi")[1:]
main_data[0].find("a")['href'].replace("/url?q=","")
Output [Same for Both the Case]:
'https://stackoverflow.com/questions/23102833/how-to-scrape-a-website-which-requires-login-using-python-and-beautifulsoup&sa=U&ved=2ahUKEwjGxv2wytXyAhUprZUCHR8mBNsQFnoECAkQAQ&usg=AOvVaw280R9Wlz2mUKHFYQUOFVv8'

How do you webscrape data from a "span" tag with a "data-reactid" using beautifulsoup in python?

I am trying to extract real time price data of stocks from Yahoo Finance. This information is contain in a "span" tag with a "class" and "data-reactid". I am unable to extract the information out of this span tag.
When I enter my code, I don't get any output nor do I get any errors.
I have tried almost all the other answers to this question, but none have worked for me.
<--HTML Code-->
<span class="Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib)" data-reactid="34">197.00</span>
#Python Script
my_url = "https://finance.yahoo.com/quote/AAPL?p=AAPL&.tsrc=fin-srch"
u_client = u_req(my_url)
page_html = u_client.read()
u_client.close()
page_soup = soup(page_html, "html.parser")
container = page_soup.find('span', {"data-reactid":'34'})
I would like to get the output of "197.00" (real time price of the stock) as the output.
You can fetch that in number of ways. Here is one of them:
import requests
from bs4 import BeautifulSoup
res = requests.get('https://finance.yahoo.com/quote/AAPL')
soup = BeautifulSoup(res.text, 'lxml')
price = soup.select_one('#quote-market-notice').find_all_previous()[2].text
print(price)
Another way:
price = soup.select_one("[class*='smartphone_Mt'] span").text
print(price)
Somehow the data-reactid is changed to 14 when reading the url.
page_soup = soup(page_html, "html.parser")
container = page_soup.find('span', {"data-reactid":'14'})
if container:
print(container.text)
Given that data-reactid can change I would use a unique class to select. Selecting by class is also faster.
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://finance.yahoo.com/quote/AAPL/')
soup = bs(r.content, 'lxml')
print(soup.select_one('.Mb\(-4px\)').text)
I opened the URL in chrome and pressed F12. Clicking on the network tab revealed this query from the page: https://query1.finance.yahoo.com/v8/finance/chart/AAPL?region=US&lang=en-US&includePrePost=false&interval=2m&range=1d&corsDomain=finance.yahoo.com&.tsrc=finance
I would suggest exploring the underlying AJAX calls which appear to present a nicely formatted JSON result and looking at the URL a number of params you can modify.

Python3: How to identify which 'text' element to scrape, and how to scrape URL via use of class

I'm currently scraping a website and getting the text of some details in the webpage of which I have tried to identify via Google Chrome's 'inspect' button, and I actually was able to get the text I wanted from the normal text elements.
However, I have two questions:
1. I need to properly get the text that is associated with the proper div class. As you can see in below code, I just entered 'h3', 'p', and 'abbr' and I was able to actually get the text, however they're not particular to a certain 'class'. I suppose it's just getting the first one it encounters, that's why in some webpages I encounter below error because it's pointing out to the wrong element.
Traceback (most recent call last):
File "C:\Users\admin\Desktop\FolderName\FileName.py", line 18, in <module>
name1 = info2_text.text
AttributeError: 'NoneType' object has no attribute 'text'
So I guess my real question#1 is, to avoid having above error because of misidentified 'p' paragraphs as in below example, how can I put into code to identify in terms of 'class'? I already tried info2_text = soup.find('p', attrs={'class': '_5rgt _5nk5 _5msi'}), however I only get above error.
<div class="_5rgt _5nk5 _5msi" style data-gt="{"tn":"*s"}" data-ft="{"tn":"*s"}"> == $0
<span>
<p>
"Sample paragraph"
</p>
2. How to get the actual url from the a href element? In below example:
<div class="_52jc _5qc4 _78cz _24u0 _36xo" data-sigil="m-feed-voice-subtitle">
I have tried to use info4_url = soup.find('a', attrs={'class': '_4g34._5i2i._52we'}) however I only get to print 'None' for this line. Or, am I looking at the wrong div class?
Below is the actual code I am trying to use, and I want to make it as simple as possible. Thanks so much for your help!
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
from bs4 import SoupStrainer
import re
import requests
# specify the url
url = 'https://sampleurl.com/'
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
info1_header = soup.find('h3')
info2_text = soup.find('p')
info3_text = soup.find('abbr')
info4_url = soup.find('a')
# Get the data by getting its text
name = info1_header.text
name1 = info2_text.text
name2 = info3_text.text
#print text
print(name)
print(name1)
print(name2)
print(info4_url)
Find the paragraph/anchor in the relevant div's only:
For the first question:
html = '''<div class="_5rgt _5nk5 _5msi" style data-gt="{"tn":"*s"}" data-ft="{"tn":"*s"}"> == $0
<span>
<p>
"Sample paragraph"
</p>'''
soup = BeautifulSoup(html, 'html.parser')
parentDiv = soup.find_all("div", class_="_5rgt _5nk5 _5msi")
for elem in parentDiv:
para = elem.find("p").text
print(para.strip())
OUTPUT:
"Sample paragraph"
For the second question:
html = '''<div class="_52jc _5qc4 _78cz _24u0 _36xo" data-sigil="m-feed-voice-subtitle">
</div>'''
soup = BeautifulSoup(html, 'html.parser')
for anc in soup.find_all('div', class_="_52jc _5qc4 _78cz _24u0 _36xo"):
anchor = anc.find("a")
print("Found the URL:", anchor['href'])
OUTPUT:
Found the URL: sampleurl.com

Using BeautifulSoup to find a attribute called data-stats

I'm currently working on a web scraper that will allow me to pull stats from a football player. Usually this would be an easy task if I could just grab the divs however, this website uses a attribute called data-stats and uses it like a class. This is an example of that.
<th scope="row" class="left " data-stat="year_id">2000</th>
If you would like to check the site for yourself here is the link.
https://www.pro-football-reference.com/players/B/BradTo00.htm
I'm tried a few different methods. Either It won't work at all or I will be able to start a for loop and start putting things into arrays, however you will notice that not everything in the table is the same var type.
Sorry for the formatting and the grammer.
Here is what I have so far, I'm sure its not the best looking code, it's mainly just code I've tried on my own and a few things mixed in from searching on Google. Ignore the random imports I was trying different things
# import libraries
import csv
from datetime import datetime
import requests
from bs4 import BeautifulSoup
import lxml.html as lh
import pandas as pd
# specify url
url = 'https://www.pro-football-reference.com/players/B/BradTo00.htm'
# request html
page = requests.get(url)
# Parse html using BeautifulSoup, you can use a different parser like lxml if present
soup = BeautifulSoup(page.content, 'lxml')
# find searches the given tag (div) with given class attribute and returns the first match it finds
headers = [c.get_text() for c in soup.find(class_ = 'table_container').find_all('td')[0:31]]
data = [[cell.get_text(strip=True) for cell in row.find_all('td')[0:32]]
for row in soup.find_all("tr", class_=True)]
tags = soup.find(data ='pos')
#stats = tags.find_all('td')
print(tags)
You need to use the get method from BeautifulSoup to get the attributes by name
See: BeautifulSoup Get Attribute
Here is a snippet to get all the data you want from the table:
from bs4 import BeautifulSoup
import requests
url = "https://www.pro-football-reference.com/players/B/BradTo00.htm"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
# Get table
table = soup.find(class_="table_outer_container")
# Get head
thead = table.find('thead')
th_head = thead.find_all('th')
for thh in th_head:
# Get case value
print(thh.get_text())
# Get data-stat value
print(thh.get('data-stat'))
# Get body
tbody = table.find('tbody')
tr_body = tbody.find_all('tr')
for trb in tr_body:
# Get id
print(trb.get('id'))
# Get th data
th = trb.find('th')
print(th.get_text())
print(th.get('data-stat'))
for td in trb.find_all('td'):
# Get case value
print(td.get_text())
# Get data-stat value
print(td.get('data-stat'))
# Get footer
tfoot = table.find('tfoot')
thf = tfoot.find('th')
# Get case value
print(thf.get_text())
# Get data-stat value
print(thf.get('data-stat'))
for tdf in tfoot.find_all('td'):
# Get case value
print(tdf.get_text())
# Get data-stat value
print(tdf.get('data-stat'))
You can of course save the data in a csv or even a json instead of printing it
It's not very clear what exactly you're trying to extract, but this might help you a little bit:
import requests
from bs4 import BeautifulSoup as bs
url = 'https://www.pro-football-reference.com/players/B/BradTo00.htm'
page = requests.get(url)
soup = bs(page.text, "html.parser")
# Extract table
table = soup.find_all('table')
# Let's extract data from each row in table
for row in table:
col = row.find_all('td')
for c in col:
print(c.text)
Hope this helps!

Web scraping with Python and Beautiful Soup

I am practicing building web scrapers. One that I am working on now involves going to a site, scraping links for the various cities on that site, then taking all of the links for each of the cities and scraping all the links for the properties in said cites.
I'm using the following code:
import requests
from bs4 import BeautifulSoup
main_url = "http://www.chapter-living.com/"
# Getting individual cities url
re = requests.get(main_url)
soup = BeautifulSoup(re.text, "html.parser")
city_tags = soup.find_all('a', class_="nav-title") # Bottom page not loaded dynamycally
cities_links = [main_url + tag["href"] for tag in city_tags.find_all("a")] # Links to cities
If I print out city_tags I get the HTML I want. However, when I print cities_links I get AttributeError: 'ResultSet' object has no attribute 'find_all'.
I gather from other q's on here that this error occurs because city_tags returns none, but this can't be the case if it is printing out the desired html? I have noticed that said html is in [] - does this make a difference?
Well city_tags is a bs4.element.ResultSet (essentially a list) of tags and you are calling find_all on it. You probably want to call find_all in every element of the resultset or in this specific case just retrieve their href attribute
import requests
from bs4 import BeautifulSoup
main_url = "http://www.chapter-living.com/"
# Getting individual cities url
re = requests.get(main_url)
soup = BeautifulSoup(re.text, "html.parser")
city_tags = soup.find_all('a', class_="nav-title") # Bottom page not loaded dynamycally
cities_links = [main_url + tag["href"] for tag in city_tags] # Links to cities
As the error says, the city_tags is a ResultSet which is a list of nodes and it doesn't have the find_all method, you either have to loop through the set and apply find_all on each individual node or in your case, I think you can simply extract the href attribute from each node:
[tag['href'] for tag in city_tags]
#['https://www.chapter-living.com/blog/',
# 'https://www.chapter-living.com/testimonials/',
# 'https://www.chapter-living.com/events/']

Categories