getting a specific image from a website link with beautifulSoup - python

I'm trying to get some specific image on a website using beautiful soup :
from bs4 import BeautifulSoup
import urllib.request
import random
url=urllib.request.urlopen("https://www.arandomlink.com")
content = url.read() #We read it
soup = BeautifulSoup(content) #We create a BeautifulSoup object
#WE OBTAIN THE URL OF THE IMAGES RELATED TO THE ITEM
images = soup.findAll("img", {'class': "arandonclass"})
print(images['src'])
The problem is that it doesn't work, it can't find any image even if I verified that there is images with that class.
What did I do wrong?
I'm using python 3 and BS4

Beautiful Soup Documentation
Searching by CSS class
It’s very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, “class”, is a reserved word in Python. Using class as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_:
soup.find_all("a", class_="sister")

Related

How can retrieve the value of a non-standard keyword attribute using regex to match attribute's value with beautifulsoup?

In a private project (learning python scripting), i needed to retrieve only the rpm package of the scrapped page. I spotted that all package links (.msi, .deb, .rpm) has an attribute called data-link inside 'a' balise.
I also taylored my own regex (https://regexr.com/6rqd2) to match only the package i need.
According to documentation, it seems that this kind of attribute (data-*) is a non-standard
attribute in HTML 5.
So i tried the attrs argument and passed into find_all() but with no success.
Unsuccessfull Code below
#!/usr/bin/env python3
import re
from bs4 import BeautifulSoup
url = "https://www.splunk.com/en_us/download/splunk-enterprise.html"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
pattern = re.compile("(?<=data-link=\")[^ ]+rpm")
package = soup.find_all(attrs={"data-link": pattern})
print(package)
Thank you in advance for your help
Another solution, using CSS selectors:
import requests
from bs4 import BeautifulSoup
url = "https://www.splunk.com/en_us/download/splunk-enterprise.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for a in soup.select('a[data-link$=".rpm"]'):
print(a["data-link"])
Prints:
https://download.splunk.com/products/splunk/releases/9.0.0.1/linux/splunk-9.0.0.1-9e907cedecb1-linux-2.6-x86_64.rpm
Do you need all the features beautiful soup provides? The below should find the links as required.
re.findall(pattern, str(page.content))
You don't need to include data-link in your expression because you're searching by a value of the attribute, so you're matching a value only, not a full element:
soup.find_all(
"a",
{"data-link": re.compile(r"^(.+?)\.rpm$")},
)

Web scraping youtube page

i'm trying to get the title of youtube videos given a link.
But i'm unable to access the element that hold the title. I'm using bs4 to parse the html.
I noticed im unable to access any element that is within 'ytd-app' tag in the youtube page.
import bs4
import requests
listed_url = "https://www.youtube.com/watch?v=9IfT8KXX_9c&list=PLQVvvaa0QuDfKTOs3Keq_kaG2P55YRn5v&index=31"
listed = requests.get(listed_url)
soup = bs4.BeautifulSoup(listed.text, "html.parser")
a = soup.find_all(attrs={"class": "style-scope ytd-video-primary-info-renderer"})
print(a)
So how can i get the video title ? Is there something i'm doing wrong or youtube intentionally created a tag like this to prevent web_scraping ?
See class that you are using is render through Javascript and all the contents are dynamic so it is very difficult to find any data using bs4
So what you can do find data in soup by manually and find particular tag
Also you can try out with pytube
import bs4
import requests
listed_url = "https://www.youtube.com/watch?v=9IfT8KXX_9c&list=PLQVvvaa0QuDfKTOs3Keq_kaG2P55YRn5v&index=31"
listed = requests.get(listed_url)
soup = bs4.BeautifulSoup(listed.text, "html.parser")
soup.find("title").get_text()

How to find all Elements of a specific Type with the new Requests-HTML library

I wanna find all specific fields in a HTML, in Beautiful soup everything is working with this code:
soup = BeautifulSoup(html_text, 'html.parser')
urls_previous = soup.find_all('h2', {'class': 'b_algo'})
but how can I make the same search with the requests library or can requests only find a single element in a HTML document, I couldn't find how to do it in the docs or examples ?
https://html.python-requests.org/
Example:
<li class="b_algo"><h2>Vereinigte Staaten – Wikipedia</h2>https://de.wikipedia.org/wiki/Vereinigte_Staaten</div><p>U.S., I wanna have THIS text here</p></li>
How can I find all Elements of a specific type with the requests library ?
with requests-html
from requests_html import HTML
doc = """<li class="b_algo"><h2>Vereinigte Staaten – Wikipedia</h2>https://de.wikipedia.org/wiki/Vereinigte_Staaten</div><p>U.S., I wanna have THIS text here</p></li>"""
#load html from string
html = HTML(html=doc)
x = html.find('h2')
print(x)

BeautifulSoup doesn't find all spans or children

I am trying to access the sequence on this webpage:
https://www.ncbi.nlm.nih.gov/protein/EGW15053.1?report=fasta
The sequence is stored under the div class="seq gbff". Each line is stored under
<span class='ff_line' id='gi_344258949_1"> *line 1 of sequence* </span>
When I try to search for the spans containing the sequence, beautiful soup returns None. Same problem when I try to look at the children or content of the div above the spans.
Here is the code:
import requests
import re
from bs4 import BeautifulSoup
# Create a variable with the url
url = 'https://www.ncbi.nlm.nih.gov/protein/EGW15053.1?report=fasta'
# Use requests to get the contents
r = requests.get(url)
# Get the text of the contents
html_content = r.text
# Convert the html content into a beautiful soup object
soup = BeautifulSoup(html_content, 'html.parser')
div = soup.find_all('div', attrs={'class', 'seq gbff'})
for each in div.children:
print(each)
soup.find_all('span', aatrs={'class', 'ff_line'})
Neither method works and I'd greatly appreciate any help :D
This page uses JavaScript to load data
With DevTools in Chrome/Firefox I found this url and there are all <span>
https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=344258949&db=protein&report=fasta&extrafeat=0&fmt_mask=0&retmode=html&withmarkup=on&tool=portal&log$=seqview&maxdownloadsize=1000000
Now hard part. You have to find this url in HTML because different pages will use different arguments in url. Or you have to compare few urls and find schema so you could generate this url manually.
EDIT: if in url you change retmode=html to retmode=xml then you get it as XML. If you use retmode=text then you get it as text without HTML tags. retmode=json doesn't works.

soup.find_all works but soup.select doesn't work

I'm playing around with parsing an html page using css selectors
import requests
import webbrowser
from bs4 import BeautifulSoup
page = requests.get('http://www.marketwatch.com', headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(page.content, 'html.parser')
I'm having trouble selecting a list tag with a class when using the select method. However, I have no trouble when using the find_all method
soup.find_all('ul', class_= "latestNews j-scrollElement")
This returns the output I desire, but for some reason I can't do the same using
css selectors. I want to know what I'm doing wrong.
Here is my attempt:
soup.select("ul .latestNews j-scrollElement")
which is returning an empty list.
I can't figure out what I'm doing wrong with the select method.
Thank you.
From the documentation:
If you want to search for tags that match two or more CSS classes, you
should use a CSS selector:
css_soup.select("p.strikeout.body")
In your case, you'd call it like this:
In [1588]: soup.select("ul.latestNews.j-scrollElement")
Out[1588]:
[<ul class="latestNews j-scrollElement" data-track-code="MW_Header_Latest News|MW_Header_Latest News_Facebook|MW_Header_Latest News_Twitter" data-track-query=".latestNews__headline a|a.icon--facebook|a.icon--twitter">
.
.
.

Categories