use beautiful soup to extract src inside image inside a - python

I've been trying to get beautiful soup to extract the image files (pokemon card images) from this page:
https://www.pokellector.com/sets/EVO-Evolutions
The code below only gives some src's of buttons but I can't manage to extract all the images sources.
for a in soupimages.find_all('a'):
if a.img:
if a.img.has_attr('src'):
print(a.img['src'])

Looks like all the card image thumbnails are formatted like this:
<img class="card lazyload" data-src="https://.../Caterpie.EVO.5.thumb.png">
Find the <img> elements with class='card' and you should get the card image URLs.
from bs4 import BeautifulSoup
import requests
url = "https://www.pokellector.com/sets/EVO-Evolutions"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
for img in soup.find_all('img', class_="card"):
print(img.get('data-src'))
Output:
https://den-cards.pokellector.com/197/Venusaur-EX.EVO.1.thumb.png
https://den-cards.pokellector.com/197/M-Venusaur-EX.EVO.2.thumb.png
...

Related

How to find image current source in html using Python

when extracting image from a website I use below command to get the URL:
image = soup.findAll('img')
image_link = image["src"]
But as the picture shows there is not a compelite link to save the image. Now my question is that what is this 'current source' showed on the picture and how I can extract the link from there?
soup.findAll() returns a list of elements. Iterate over the "image" variable then access the "src" attribute on it.
If need to resolve relative URLs then need to call requests.compat.urljoin(url, src) on the image src value.
Try something like this:
import requests
from bs4 import BeautifulSoup
# sample base url for testing
url = 'https://en.wikipedia.org/wiki/Main_Page'
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
for img in soup.findAll('img'):
src = img.get("src")
if src:
# resolve any relative urls to absolute urls using base URL
src = requests.compat.urljoin(url, src)
print(">>", src)
Output:
...
>> https://en.wikipedia.org/wiki/Special:CentralAutoLogin/start?type=1x1
>> https://en.wikipedia.org/static/images/footer/wikimedia-button.png
Without resolving relative urls in example above, the URL would be instead "/static/images/footer/wikimedia-button.png".
In your case u can scrap images like this:
import requests
from bs4 import BeautifulSoup
url = 'https://www.sfmta.com/getting-around/drive-park/how-avoid-parking-tickets'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
for image in soup.find_all('img'):
image_link = requests.compat.urljoin(url, image.get('src'))
print(image_link)
OUTPUT:
https://www.sfmta.com/sites/all/themes/clients-theme/logo.png
https://www.sfmta.com/sites/default/files/imce-images/repair-40cd8d7db439deac706e161cd89ea3cc.jpg
https://www.sfmta.com/sites/default/files/imce-images/repair-a30c0bf7f9e9f1fcf5a4c6b69548c46b.jpg
https://www.sfmta.com/sites/default/files/imce-images/repair-cca5688579bf809ecb49daed5fab030a.jpg
https://www.sfmta.com/sites/default/files/imce-images/repair-8c6467ecb4673775240576524e4c5bc6.jpg
https://www.sfmta.com/sites/default/files/imce-images/repair-02709e2cecd6edde21a728562995764f.jpg
https://www.sfmta.com/sites/default/files/imce-images/repair-86826c5eeae51535f527f5a1a56a80fb.jpg
https://www.sfmta.com/sites/default/files/imce-images/repair-db285d8e0abc5e28e53f75a1a99d4a0b.jpg
https://www.sfmta.com/sites/default/files/imce-images/repair-73faffef0e5f0f36e0295e573dea1381.jpg
https://www.sfmta.com/sites/default/files/imce-images/repair-4eb40baaa405c6cb3e8379d5693c2941.jpg
https://www.sfmta.com/sites/default/files/imce-images/repair-f8fdee3388b83ec5eac01ff7c93a923e.jpg
https://www.sfmta.com/sites/all/themes/clients-theme/images/icons/application-pdf.png
https://www.sfmta.com/sites/all/themes/clients-theme/images/icons/application-pdf.png
https://www.sfmta.com/sites/all/themes/clients-theme/images/icons/application-pdf.png
https://www.sfmta.com/sites/all/themes/clients-theme/images/icons/application-pdf.png
But i recommend specify body class like field-body in soup, or check if link contains imce-images

How can I get the url of "src=..."?

I am trying to loop through all the img classes here but I am not sure how I can get the src= link
import requests
from bs4 import BeautifulSoup
url = 'https://giphy.com/search/anxiety'
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
gifs = soup.findAll("img", attrs={"class": "giphy-gif-img"})
for gif in gifs:
print(gif.get('image-src'))
In your last line, you can use gif.get('src').
However, gifs is empty since there are no images with class=giphy-gif-img on the page.

BeautifulSoup doesn't find all spans or children

I am trying to access the sequence on this webpage:
https://www.ncbi.nlm.nih.gov/protein/EGW15053.1?report=fasta
The sequence is stored under the div class="seq gbff". Each line is stored under
<span class='ff_line' id='gi_344258949_1"> *line 1 of sequence* </span>
When I try to search for the spans containing the sequence, beautiful soup returns None. Same problem when I try to look at the children or content of the div above the spans.
Here is the code:
import requests
import re
from bs4 import BeautifulSoup
# Create a variable with the url
url = 'https://www.ncbi.nlm.nih.gov/protein/EGW15053.1?report=fasta'
# Use requests to get the contents
r = requests.get(url)
# Get the text of the contents
html_content = r.text
# Convert the html content into a beautiful soup object
soup = BeautifulSoup(html_content, 'html.parser')
div = soup.find_all('div', attrs={'class', 'seq gbff'})
for each in div.children:
print(each)
soup.find_all('span', aatrs={'class', 'ff_line'})
Neither method works and I'd greatly appreciate any help :D
This page uses JavaScript to load data
With DevTools in Chrome/Firefox I found this url and there are all <span>
https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=344258949&db=protein&report=fasta&extrafeat=0&fmt_mask=0&retmode=html&withmarkup=on&tool=portal&log$=seqview&maxdownloadsize=1000000
Now hard part. You have to find this url in HTML because different pages will use different arguments in url. Or you have to compare few urls and find schema so you could generate this url manually.
EDIT: if in url you change retmode=html to retmode=xml then you get it as XML. If you use retmode=text then you get it as text without HTML tags. retmode=json doesn't works.

Python: get CATCHA file from html

I try to decode captcha on python, but I don't know, how can I get it from html.
I use
html = session.get(page, headers=headers).text
soup = BeautifulSoup(html, "html.parser")
And html looks like
<img src="/captcha.gif" style="width:1px;height:1px"/>
How can I exctract it? I can do it only with save image?
you can get image on your PC like this:
import urllib.request
from bs4 import BeautifulSoup as BS
tag = '<img src="/captcha.gif" style="width:1px;height:1px"/>'
soup = BS(tag)
img_tag = soup.find('img')
urllib.request.urlretrieve('https://absolute/path/to'+img_tag['src'], os.getcwd() + '/temp_img')

How to get website images src using BeautifulSoup

I'm trying to get all the images src/hyperlink form a webpage
import requests
from bs4 import BeautifulSoup
image_list = []
r = requests.get('https://example.com')
soup = BeautifulSoup(r.content)
for link in soup.find_all('img'):
image_list.append(link)
find the attributes of an html tag using get function. Pass the name of the attribute you want to extract from html tag to get function
for link in soup.find_all('img'):
image_list.append(link.get('src'))

Categories