Python Beautifulsoup (bs4) findAll not finding all elements

Python Beautifulsoup (bs4) findAll not finding all elements - python

From the url that is in the code, I am ultimately trying to gather all of the players names from the page. However, when I am using .findAll in order to get all of the list elements, I am yet to be successful. Please advise.
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
players_url = 'https://stats.nba.com/players/list/?Historic=Y'
# Opening up the Connection and grabbing the page
uClient = uReq(players_url)
page_html = uClient.read()
players_soup = soup(page_html, "html.parser")
# Taking all of the elements from the unordered lists that contains all of the players.
list_elements = players_soup.findAll('li', {'class': 'players-list__name'})

As #Oluwafemi Sule suggested it is better to use selenium together with BS:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://stats.nba.com/players/list/?Historic=Y')
soup = BeautifulSoup(driver.page_source, 'lxml')
for div in soup.findAll('li', {'class': 'players-list__name'}):
print(div.find('a').contents[0])
Output:
Abdelnaby, Alaa
Abdul-Aziz, Zaid
Abdul-Jabbar, Kareem
Abdul-Rauf, Mahmoud
Abdul-Wahad, Tariq
etc.

You can do this with requests alone by pulling direct from the js script which provides the names.
import requests
import json
r = requests.get('https://stats.nba.com/js/data/ptsd/stats_ptsd.js')
s = r.text.replace('var stats_ptsd = ','').replace('};','}')
data = json.loads(s)['data']['players']
players = [item[1] for item in data]
print(players)

As #Oluwafemi Sule suggested) mentioned in the comment:
The list of players generated in the page is done with javascript.
Instead of using Selenium, I recommend you this package requests-html created by the author of very popular requests. It uses Chromium under the hood to render JavaScript content.
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://stats.nba.com/players/list/?Historic=Y')
r.html.render()
for anchor in r.html.find('.players-list__name > a'):
print(anchor.text)
Output:
Abdelnaby, Alaa
Abdul-Aziz, Zaid
Abdul-Jabbar, Kareem
Abdul-Rauf, Mahmoud
Abdul-Wahad, Tariq
...

Related

How to extract url/links that are contents of a webpage with BeautifulSoup

So the website I am using is : https://keithgalli.github.io/web-scraping/webpage.html and I want to extract all the social media links on the webpage.
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://keithgalli.github.io/web-scraping/webpage.html')
soup = bs(r.content)
links = soup.find_all('a', {'class':'socials'})
actual_links = [link['href'] for link in links]
I get an error, specifically:
KeyError: 'href'
For a different example and webpage, I was able to use the same code to extract the webpage link but for some reason this time it is not working and I don't know why.
I also tried to see what the problem was specifically and it appears that
links is a nested array where links[0] outputs the entire content of the ul tag that has class=socials so its not iterable so to speak since the first element contains all the links rather than having each social li tag be seperate elements inside links

Here is the solution using css selectors:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://keithgalli.github.io/web-scraping/webpage.html')
soup = bs(r.content, 'lxml')
links = soup.select('ul.socials li a')
actual_links = [link['href'] for link in links]
print(actual_links)
Output:
['https://www.instagram.com/keithgalli/', 'https://twitter.com/keithgalli', 'https://www.linkedin.com/in/keithgalli/', 'https://www.tiktok.com/#keithgalli']

Why not try something like:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://keithgalli.github.io/web-
scraping/webpage.html')
soup = bs(r.content)
links = soup.find_all('a', {'class':'socials'})
actual_links = [link['href'] for link in links if 'href' in link.keys()]
After gaining some new information from you and visiting the webpage, I've realized that you did the following mistake:
The socials class is never used in any a-element and thus you won't find any such in your script. Instead you should look for the li-elements with the class "social".
Thus your code should look like:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://keithgalli.github.io/web-
scraping/webpage.html')
soup = bs(r.content, "lxml")
link_list_items = soup.find_all('li', {'class':'social'})
links = [item.find('a').get('href') for item in link_list_items]
print(links)

Not able to find a link in a product page

I am trying to make a list of the links that are inside a product page.
I have multiple links through which I want to get the links of the product page.
I am just posting the code for a single link.
r = requests.get("https://funskoolindia.com/products.php?search=9723100")
soup = BeautifulSoup(r.content)
for a_tag in soup.find_all('a', class_='product-bg-panel', href=True):
print('href: ', a_tag['href'])
This is what it should print: https://funskoolindia.com/product_inner_page.php?product_id=1113

The site is dynamic, thus, you can use selenium
from bs4 import BeautifulSoup as soup
from selenium import webdriver
d = webdriver.Chrome('/path/to/chromedriver')
d.get('https://funskoolindia.com/products.php?search=9723100')
results = [*{i.a['href'] for i in soup(d.page_source, 'html.parser').find_all('div', {'class':'product-media light-bg'})}]
Output:
['product_inner_page.php?product_id=1113']

The data are loaded dynamically through Javascript from different URL. One solution is using selenium - that executes Javascript and load links that way.
Other solution is using re module and parse the data url manually:
import re
import requests
from bs4 import BeautifulSoup
url = 'https://funskoolindia.com/products.php?search=9723100'
data_url = 'https://funskoolindia.com/admin/load_data.php'
data = {'page':'1',
'sort_val':'new',
'product_view_val':'grid',
'show_list':'12',
'brand_id':'',
'checkboxKey': re.findall(r'var checkboxKey = "(.*?)";', requests.get(url).text)[0]}
soup = BeautifulSoup(requests.post(data_url, data=data).text, 'lxml')
for a in soup.select('#list-view .product-bg-panel > a[href]'):
print('https://funskoolindia.com/' + a['href'])
Prints:
https://funskoolindia.com/product_inner_page.php?product_id=1113

try this : print('href: ', a_tag.get("href"))
and add features="lxml" to the BeautifulSoup constructor

BeautifulSoup find_all() returns nothing []

I'm trying to scrape this page of all the offers, and want to iterate over <p class="white-strip"> but page_soup.find_all("p", "white-strip") returns an empty list [].
My code so far-
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.sbicard.com/en/personal/offers.page#all-offers'
# Opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
# html parsing
page_soup = soup(page_html, "lxml")
Edit: I got it working using Selenium and below is the code I used. However, I am not able to figure out the other method through which the same can be done.
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome("C:\chromedriver_win32\chromedriver.exe")
driver.get('https://www.sbicard.com/en/personal/offers.page#all-offers')
# html parsing
page_soup = BeautifulSoup(driver.page_source, 'lxml')
# grabs each offer
containers = page_soup.find_all("p", {'class':"white-strip"})
filename = "offers.csv"
f = open(filename, "w")
header = "offer-list\n"
f.write(header)
for container in containers:
offer = container.span.text
f.write(offer + "\n")
f.close()
driver.close()

If you look for either of the items, you can find them within a script tag containing var offerData. To get the desired content out of that script, you can try the following.
import re
import json
import requests
url = "https://www.sbicard.com/en/personal/offers.page#all-offers"
res = requests.get(url)
p = re.compile(r"var offerData=(.*?);",re.DOTALL)
script = p.findall(res.text)[0].strip()
items = json.loads(script)
for item in items['offers']['offer']:
print(item['text'])
Output are like:
Upto Rs 8000 off on flights at Yatra
Electricity Bill payment – Phonepe Offer
25% off on online food ordering
Get 5% cashback at Best Price stores
Get 5% cashback

website is dynamic rendering request data.
You should try automation selenium library. it allows you to scrape dynamic rendering request(js or ajax) page data.
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome("/usr/bin/chromedriver")
driver.get('https://www.sbicard.com/en/personal/offers.page#all-offers')
page_soup = BeautifulSoup(driver.page_source, 'lxml')
p_list = page_soup.find_all("p", {'class':"white-strip"})
print(p_list)
where '/usr/bin/chromedriver' selenium web driver path.
Download selenium web driver for chrome browser:
http://chromedriver.chromium.org/downloads
Install web driver for chrome browser:
https://christopher.su/2015/selenium-chromedriver-ubuntu/
Selenium tutorial:
https://selenium-python.readthedocs.io/

Problem with scraping data from website with BeautifulSoup

I am trying to take a movie rating from the website Letterboxd. I have used code like this on other websites and it has worked, but it is not getting the info I want off of this website.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://letterboxd.com/film/avengers-endgame/")
soup = BeautifulSoup(page.content, 'html.parser')
final = soup.find("section", attrs={"class":"section ratings-histogram-
chart"})
print(final)
This prints nothing, but there is a tag in the website for this class and the info I want is under it.

The reason behind this, is that the website loads most of the content asynchronously, so you'll have to look at the http requests it sends to the server in order to load the page content after loading the page layout. You can find them in "network" section in the browser (F12 key).
For instance, one of the apis they use to load the rating is this one:
https://letterboxd.com/csi/film/avengers-endgame/rating-histogram/

You can get the weighted average from another tag
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://letterboxd.com/film/avengers-endgame/')
soup = bs(r.content, 'lxml')
print(soup.select_one('[name="twitter:data2"]')['content'])
Text of all histogram
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://letterboxd.com/csi/film/avengers-endgame/rating-histogram/')
soup = bs(r.content, 'lxml')
ratings = [item['title'].replace('\xa0',' ') for item in soup.select('.tooltip')]
print(ratings)

Missing information in scraped web data, Google translate, Using Python

I want to scrape the Google translate website and get the translated text from it using Python 3.
Here is my code:
from bs4 import BeautifulSoup as soup
from urllib.request import Request as uReq
from urllib.request import urlopen as open
my_url = "https://translate.google.com/#en/es/I%20am%20Animikh%20Aich"
req = uReq(my_url, headers={'User-Agent':'Mozilla/5.0'})
uClient = open(req)
page_html = uClient.read()
uClient.close()
html = soup(page_html, 'html5lib')
print(html)
Unfortunately, I am unable to find the required information in the parsed Webpage.
In chrome "Inspect", It is showing that the translated text is inside:
<span id="result_box" class="short_text" lang="es"><span class="">Yo soy Animikh Aich</span></span>
However, When I am searching for the information in the parsed HTML code, this is what I'm finding in it:
<span class="short_text" id="result_box"></span>
I have tried parsing using all of html5lib, lxml, html.parser. I have not been able to find a solution for this.
Please help me with the issue.

you could use a specific python api:
import goslate
gs = goslate.Goslate()
print(gs.translate('I am Animikh Aich', 'es'))
Yo soy Animikh Aich

Try like below to get the desired content:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://translate.google.com/#en/es/I%20am%20Animikh%20Aich")
soup = BeautifulSoup(driver.page_source, 'html5lib')
item = soup.select_one("#result_box span").text
print(item)
driver.quit()
Output:
Yo soy Animikh Aich

JavaScript is modifying the HTML code after it loads. urllib can't handle JavaScript, you'll have to use Selenium to get the data that you want.
For installation and demo, refer this link.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Beautifulsoup (bs4) findAll not finding all elements - python

Related

How to extract url/links that are contents of a webpage with BeautifulSoup

Not able to find a link in a product page

BeautifulSoup find_all() returns nothing []

Problem with scraping data from website with BeautifulSoup

Missing information in scraped web data, Google translate, Using Python

Categories

Resources