Web Scraping problem, getting empty list back with BeautifulSoup Library - python

I'm very new to web scraping. I am trying to scrape the genres of a list of movies from their mubi website's source code by their urls. Here I found the genre with classname "css-1wuve65 eyplj6810" in source code shown in the picture below:
and with the following code, I am trying to get this genre by 'select':
'''
for i in range(len(movie_url.movie_url)):
url = movie_url.movie_url[i]
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html,'html.parser')
gener_tags = soup.select('div.css-1wuve65 eyplj6810')
print(gener_tags)
however I keep getting empty list back to every single genre_tags. I have checked url, it is correctly retrieved. Can someone help or give me some tips about how to do this? a sample url is
https://mubi.com/films/elementary-particles

You're missing a dot (.) between the div.css-1wuve65 and eyplj6810:
import requests
from bs4 import BeautifulSoup
url = 'https://mubi.com/films/elementary-particles'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
print(soup.select_one('div.css-1wuve65.eyplj6810').text)
Prints:
Comedy, Drama, Romance, Cult

Related

Trying to get the string inside of <div> tags using bs4 (python3)

Please be patient with me. Brand new to Python and Stackoverflow.
I am trying to pull crypto price data into a program in order find out exactly how much I have in usd. I am currently stuck trying to extract the string from the tag that I get back. What I have so far:
It won't allow me to add a picture of my post yet so here is a link:
[1]: https://i.stack.imgur.com/DVlxe.png
I will also put the code on here, Please forgive the formatting.
from bs4 import BeautifulSoup
import requests
url = ('https://coinmarketcap.com/currencies/shiba-inu/')
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
price = soup.find_all("div", {"class": "imn55z-0 hCqbVS price"})
for i in price:
prices = (i.find("div"))
print(prices)
I am wanting to pull the string out to turn it into an int to do some math equations on later in the program.
Any and all help will be much appreciated.
You don't need the whole class (which possibly might change), it should just work with price. Try the following:
from bs4 import BeautifulSoup
import requests
url = 'https://coinmarketcap.com/currencies/shiba-inu/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
div_price = soup.find('div', class_="price")
price = div_price.div.text
print(price)
This displayed:
$0.00001019

How can I scrape for certain html classes in python

I'm trying to scrape a random site and get all the text with a certain class off of a page.
from bs4 import BeautifulSoup
import requests
sources = ['https://cnn.com']
for source in sources:
page = requests.get(source)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find_all("div", class_='cd_content')
for result in results:
title = result.find('span', class_="cd__headline-text vid-left-enabled")
print(title)
From what I found online, this should work but for some reason, it can't find anything and results is empty. Any help is greatly appreciated.
Upon inspecting the network calls, you see that the page is loaded dynamically via sending a GET request to:
https://www.cnn.com/data/ocs/section/index.html:homepage1-zone-1/views/zones/common/zone-manager.izl
The HTML is available within the html key on the page
import requests
from bs4 import BeautifulSoup
URL = "https://www.cnn.com/data/ocs/section/index.html:homepage1-zone-1/views/zones/common/zone-manager.izl"
response = requests.get(URL).json()["html"]
soup = BeautifulSoup(response, "html.parser")
for tag in soup.find_all(class_="cd__headline-text vid-left-enabled"):
print(tag.text)
Output (truncated):
This is the first Covid-19 vaccine in the US authorized for use in younger teens and adolescents
When the US could see Covid cases and deaths plummet
'Truly, madly, deeply false': Keilar fact-checks Ron Johnson's vaccine claim
These are the states with the highest and lowest vaccination rates

Python BeautifulSoup trouble extracting titles from a page with JS

I'm having some serious issues trying to extract the titles from a webpage. I've done this before on some other sites but this one seems to be an issue because of the Javascript.
The test link is "https://www.thomasnet.com/products/adhesives-393009-1.html"
The first title I want extracted is "Toagosei America, Inc."
Here is my code:
import requests
from bs4 import BeautifulSoup
url = ("https://www.thomasnet.com/products/adhesives-393009-1.html")
r = requests.get(url).content
soup = BeautifulSoup(r, "html.parser")
print(soup.get_text())
Now if I run it like this, with get_text, i can find the titles in the result, however as soon as I change it to find_all or find, the titles are lost. I cant find them using web browser's inspect tool, because its all JS generated.
Any advice would be greatly appreciated.
You have to specify what to find, in this case <h2> to get first title:
import requests
from bs4 import BeautifulSoup
url = 'https://www.thomasnet.com/products/adhesives-393009-1.html'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
first_title = soup.find('h2')
print(first_title.text)
Prints:
Toagosei America, Inc.

How to extract data from table using Beautifulsoup, with no text

Im trying to extract the leaderboard on https://fortnitetracker.com/events/epicgames_S11_PlatformCup_PC_NAE.
However the data inside the table is not extractable as 'text' elements. Do not know how to extract player names...
from bs4 import BeautifulSoup
import requests
url = "https://fortnitetracker.com/events/epicgames_S11_PlatformCup_PC_NAE"
response = requests.get(url)
print(response)
soup = BeautifulSoup(response.text, "html.parser")
table_needed = soup.find_all('table',attrs={'class':'trn-table'})
table_needed=table_needed[0]
for i in table_needed:
print(i.find('div'))
This gives me the output. What can i do from here?
'{{ getPlayerNameList(entry.teamAccountIds, 4) }}'
I see theres a getPlayerNameList...
the player names are dynamically injected into the table with javascript(vue) so I don't think that you can scrape it with just beautifulsoup, i would suggest you try tools like selenium.

Python BeautifulSoup4 - Scrape Section/Table Header and Values from Multiple Sections/Tables

I'm trying to scrape links with contextual information from the following page: https://www.reddit.com/r/anime/wiki/discussion_archive/2018. I'm able to get the links just fine using BS4 using Python, but having year, season, titles, and episodes associated to the links is ideal. The desired output would look like this:
I've started with the code below, but don't know how to loop through the code to capture things in sections for each season/title:
import requests
from bs4 import BeautifulSoup
session = requests.Session()
link = 'https://www.reddit.com/r/anime/wiki/discussion_archive/2018'
request_2018 = session.get(link, headers={'User-agent': 'Chrome'})
soup = BeautifulSoup(request_2018.content, 'lxml')
data_table = soup.find('div', class_='md wiki')
Is this something that's doable with BS4? Thanks for your help!
EDIT
criteria = {'class':'md wiki'} # so it can reuse later
data_soup = soup.find('div', criteria)
titles = data_soup.find_all('strong')
tables = data_soup.find_all('table')
Try following:
titles = soup.find('div', {'class':'md wiki'}).find_all('strong')
data_tables = soup.find('div', {'class':'md wiki'}).find_all('table')
Better put the second argument of find into a dict and find_all will return all elements which match your search.

Categories