Dynamically extract text from webpage using Python BeautifulSoup - python

I'm trying to extract player position from many players' webpages (here's an example for Malcolm Brogdon). I'm able to extract Malcolm Brogdon's position using the following code:
player_id = 'malcolm-brogdon-1'
# Import libraries
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
import pandas as pd
import numpy as np
url = "https://www.sports-reference.com/cbb/players/{}.html".format(player_id)
req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
page_soup = soup(webpage, "html.parser")
pos = page_soup.p.find("strong").next_sibling.strip()
pos
However, I want to be able to do this in a more dynamic way (that is, to locate "Position:" and then find what comes after). There are other players for which the webpage is structured slightly differently, and my current code wouldn't return position (i.e. Cat Barber).
I've tried doing something like page_soup.find("strong", text="Position:") but that doesn't seem to work.

You can select the element that contains the text "Position:" and then the next text sibling:
import requests
from bs4 import BeautifulSoup
url = "https://www.sports-reference.com/cbb/players/anthony-cat-barber-1.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
pos = soup.select_one('strong:contains("Position")').find_next_sibling(text=True).strip()
print(pos)
Prints:
Guard
EDIT: Another version:
import requests
from bs4 import BeautifulSoup
url = "https://www.sports-reference.com/cbb/players/anthony-cat-barber-1.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
pos = (
soup.find("strong", text=lambda t: "Position" in t)
.find_next_sibling(text=True)
.strip()
)
print(pos)

Related

Beautiful soup returns random elements when called for a certain tag or class name

So from the following site, I need to filter the current price of the product listed which is located in the tag with the class name - current price, so I wrote the following code to get the result, but what I get is the price with some other bunch of stuffs, how to filter the price alone from the html code ?
https://www.tendercuts.in/chicken
here is the code i used :
import requests
from bs4 import BeautifulSoup
baseurl = 'https://www.tendercuts.in/'
r = requests.get('https://www.tendercuts.in/chicken')
soup = BeautifulSoup(r.content, 'lxml')
productweight = soup.find_all('p', class_='currentprice')
print(productweight)
You need to use the correct class as follows:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.tendercuts.in/chicken')
soup = BeautifulSoup(r.content, 'lxml')
for p in soup.find_all('p', class_='current-price'):
print(p.text)
Here you go...there are two changes you have to make in your code. Check the following code and comment.
import requests
from bs4 import BeautifulSoup
baseurl = 'https://www.tendercuts.in/'
r = requests.get('https://www.tendercuts.in/chicken')
soup = BeautifulSoup(r.content, 'html.parser')
productweight = soup.find_all('p', class_='current-price')
print(productweight)

Web scraping several a href

I would like to scrape this page with Python: https://statusinvest.com.br/acoes/proventos/ibovespa.
With this code:
import requests
from bs4 import BeautifulSoup as bs
URL = "https://statusinvest.com.br/acoes/proventos/ibovespa"
page = 1
req = requests.get(URL+str(page))
soup = bs(req.text, 'html.parser')
container = soup.find('div', attrs={'class','list'})
dividends = container.find('a')
for dividend in dividends:
links = dividend.find_all('a')
print(links)
But it doesn't return anything.
Can someone help me please?
Edited: you can see the below updated code to access any data you mentioned in the comment, you can modify according to your needs as all the data on that page is inside data variable.
Updated Code:
import json
import requests
from bs4 import BeautifulSoup as bs
url = "https://statusinvest.com.br"
links = []
req = requests.get(f"{url}/acoes/proventos/ibovespa")
soup = bs(req.content, 'html.parser')
data = json.loads(soup.find('input', attrs={'id': 'result'})["value"])
print("Date Com Data")
for datecom in data["dateCom"]:
print(f"{datecom['code']}\t{datecom['companyName']}\t{datecom['companyNameClean']}\t{datecom['companyId']}\t{datecom['companyId']}\t{datecom['resultAbsoluteValue']}\t{datecom['dateCom']}\t{datecom['paymentDividend']}\t{datecom['earningType']}\t{datecom['dy']}\t{datecom['recentEvents']}\t{datecom['recentEvents']}\t{datecom['uRLClear']}")
print("\nDate Payment Data")
for datePayment in data["datePayment"]:
print(f"{datePayment['code']}\t{datePayment['companyName']}\t{datePayment['companyNameClean']}\t{datePayment['companyId']}\t{datePayment['companyId']}\t{datePayment['resultAbsoluteValue']}\t{datePayment['dateCom']}\t{datePayment['paymentDividend']}\t{datePayment['earningType']}\t{datePayment['dy']}\t{datePayment['recentEvents']}\t{datePayment['recentEvents']}\t{datePayment['uRLClear']}")
print("\nProvisioned Data")
for provisioned in data["provisioned"]:
print(f"{provisioned['code']}\t{provisioned['companyName']}\t{provisioned['companyNameClean']}\t{provisioned['companyId']}\t{provisioned['companyId']}\t{provisioned['resultAbsoluteValue']}\t{provisioned['dateCom']}\t{provisioned['paymentDividend']}\t{provisioned['earningType']}\t{provisioned['dy']}\t{provisioned['recentEvents']}\t{provisioned['recentEvents']}\t{provisioned['uRLClear']}")
Seeing to the source code of that website one could fetch the json directly and get your desired links follow the below code.
Code:
import json
import requests
from bs4 import BeautifulSoup as bs
url = "https://statusinvest.com.br"
links=[]
req = requests.get(f"{url}/acoes/proventos/ibovespa")
soup = bs(req.content, 'html.parser')
data = json.loads(soup.find('input', attrs={'id': 'result'})["value"])
for datecom in data["dateCom"]:
links.append(f"{url}{datecom['uRLClear']}")
for datePayment in data["datePayment"]:
links.append(f"{url}{datePayment['uRLClear']}")
for provisioned in data["provisioned"]:
links.append(f"{url}{provisioned['uRLClear']}")
print(links)
Output:
Let me know if you have any questions :)

Unable to make find.all(string='television') work with BeautifulSoup (Python 3.x)

I'm building a webpage scraper (first time) with the intention to find a specific word in it.
I'm able to get the page and parse it, but when I try to use find.all() or even find() to search for the string='television',
I get 0 results. The word is there. Also, if I try find.all('td') it finds all 2000+ tags, but when I try string I get 0.
Here is the code:
import urllib
import requests
from bs4 import BeautifulSoup
#get site
page_link =
'https://www.txdot.gov/insdtdot/orgchart/cmd/cserve/bidtab/12033001.htm'
page_response = requests.get(page_link, timeout=5)
page_content = BeautifulSoup(page_response.content, "html.parser")
match = page_content.find_all(string="television")
print(len(match))
You are searching the text TELEVISION which is partial string contains in td tag.
So To achieve this.either you can use regular expression.
import requests
from bs4 import BeautifulSoup
import re
page_link ='https://www.txdot.gov/insdtdot/orgchart/cmd/cserve/bidtab/12033001.htm'
page_response = requests.get(page_link, timeout=5)
page_content = BeautifulSoup(page_response.content, "html.parser")
match = page_content.find_all(text=re.compile("TELEVISION"))
print(len(match))
Or if you have beautifulsoup 4.7.1 or above you can use css selector and contains.
import requests
from bs4 import BeautifulSoup
page_link ='https://www.txdot.gov/insdtdot/orgchart/cmd/cserve/bidtab/12033001.htm'
page_response = requests.get(page_link, timeout=5)
page_content = BeautifulSoup(page_response.content, "html.parser")
match = page_content.select('td:contains(TELEVISION)')
print(len(match))
Please note you have to use exact text which you have on webpage.

How to make beautiful soup grab only what is between a set of "[:" ":]" in a web page?

Good afternoon! How do I make Beautifulsoup grab only what is between multiple sets of "[:" and ":]" So far I have got the entire page in my soup, but it does not have tags, sadly.
What it looks like so far
I have tried a couple of things so far:
soup.findAll(text="[")
keys = soup.find("span", attrs = {"class": "objectBox objectBox-string"})
import bs4 as bs
import urllib.request
source = urllib.request.urlopen("https://login.microsoftonline.com/common/discovery/keys").read()
soup = bs.BeautifulSoup(source,'lxml')
# ---------------------------------------------
# prior script that I was playing with trying to tackle this issue
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
# Set URL to scrape new certs from
newcerts = "https://login.microsoftonline.com/common/discovery/keys"
# Connect to the URL
response = requests.get(newcerts)
# Parse HTML and save to BeautifulSoup Object
soup = BeautifulSoup(response.text, "html.parser")
keys = soup.find("span", attrs = {"class": "objectBox objectBox-string"})
End goal is to retrieve the public PKI keys from Azure's website at https://login.microsoftonline.com/common/discovery/keys
Not sure if this is what you meant to grab. Try the script below:
import json
import requests
url = 'https://login.microsoftonline.com/common/discovery/keys'
res = requests.get(url)
jsonobject = json.loads(res.content)
for item in jsonobject['keys']:
print(item['x5c'])

how to reach dipper divs inside <span> tag using python crawler?

the body tag has a <span> tag. There are many other divs inside the span tag. I want to go dipper but when I trying this code:
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = 'https://www.instagram.com/artfido/'
data = urlopen(url)
soup = BeautifulSoup(data, 'html.parser')
result = soup.body.span
print (result)
the result was just this:
<span id="react-root"></span>
How can I reach to divs inside the span tag?
Can we parse the <span> tag? Is it possible? If yes so why I'm not able to parse the span?
By using this:
result = soup.body.span.contents
The output was:
[]
As talked in comments, urlopen(url) returns a file like object, which means that you need to read from it if you want to get what's inside it.
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = 'https://www.instagram.com/artfido/'
data = urlopen(url)
soup = BeautifulSoup(data.read(), 'html.parser')
result = soup.body.span
print (result)
The code I used for my python 2.7 setup:
from bs4 import BeautifulSoup
import urllib2
url = 'https://www.instagram.com/artfido/'
data = urllib2.urlopen(url)
soup = BeautifulSoup(data.read(), 'lxml')
result = soup.body.span
print result
EDIT
for future reference, if you want something more simple for handling the url, there is a package called requests . In this case, it is similar but I find it easier to understand.
from bs4 import BeautifulSoup
import requests
url = 'https://www.instagram.com/artfido/'
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'lxml')
result = soup.body.span
print result

Categories