How can I get the url of "src=..."?

How can I get the url of "src=..."? - python

I am trying to loop through all the img classes here but I am not sure how I can get the src= link
import requests
from bs4 import BeautifulSoup
url = 'https://giphy.com/search/anxiety'
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
gifs = soup.findAll("img", attrs={"class": "giphy-gif-img"})
for gif in gifs:
print(gif.get('image-src'))

In your last line, you can use gif.get('src').
However, gifs is empty since there are no images with class=giphy-gif-img on the page.

Related

use beautiful soup to extract src inside image inside a

I've been trying to get beautiful soup to extract the image files (pokemon card images) from this page:
https://www.pokellector.com/sets/EVO-Evolutions
The code below only gives some src's of buttons but I can't manage to extract all the images sources.
for a in soupimages.find_all('a'):
if a.img:
if a.img.has_attr('src'):
print(a.img['src'])

Looks like all the card image thumbnails are formatted like this:
<img class="card lazyload" data-src="https://.../Caterpie.EVO.5.thumb.png">
Find the <img> elements with class='card' and you should get the card image URLs.
from bs4 import BeautifulSoup
import requests
url = "https://www.pokellector.com/sets/EVO-Evolutions"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
for img in soup.find_all('img', class_="card"):
print(img.get('data-src'))
Output:
https://den-cards.pokellector.com/197/Venusaur-EX.EVO.1.thumb.png
https://den-cards.pokellector.com/197/M-Venusaur-EX.EVO.2.thumb.png
...

Beautiful Soup can't find most of the tags

I am trying to scrape this page https://ntrs.nasa.gov/search .
I am using the code below and Beautiful soup is finding only 3 tags when there are many more. I have tried using html5lib, lxml and HTML parsers but none of them have worked.
Can you advise what might be the problem please?
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
# Set the URL
url = 'https://ntrs.nasa.gov/search'
# Connect to the URL
response = requests.get(url)
# Parse HTML and save to a BeautifulSoup object¶
soup = BeautifulSoup(response.content, "html5lib")
# soup = BeautifulSoup(response.text, "html5lib")
# soup = BeautifulSoup(response.content, "html.parser")
# soup = BeautifulSoup(response.content, "lxml")
# loop through all a-tags
for a_tag in soup.findAll('a'):
if 'title' in a_tag:
if a_tag['title'] == 'Download Document':
link = a_tag['href']
download_url = 'https://ntrs.nasa.gov' + link
urllib.request.urlretrieve(download_url,'./'+link[link.find('/citations/')+1:11])

It is dynamically pulled from a script tag. You can regex out the JavaScript object which contains the download url, handle some string replacements for html entities, parse as json then extract the desired url:
import requests, re, json
r = requests.get('https://ntrs.nasa.gov/search')
data = json.loads(re.search(r'(\{.*/api.*\})', r.text).group(1).replace('&q;','"'))
print('https://ntrs.nasa.gov' + data['http://ntrs-proxy-auto-deploy:3001/citations/search']['results'][0]['downloads'][0]['links']['pdf'])
You could append the ?attachment=true but I don't think that is required.

Your problem stems from the fact that the page is rendered using Javascipt, and the actual page source is only a few script and style tags.

How to separate links of images on the basis of content inside them in beautifulsoup4

I am new to BeautifulSoup4 and I am trying to fetch all image links from a site for eg.Unsplash but I only wan urls that contains word "photo" in there url eg.
https://images.unsplash.com/photo-1541892079-2475b9253785?ixlib=rb-1.2.1&auto=format&fit=crop&w=500&q=60
I don't want urls that contain word "profile" for eg.
https://images.unsplash.com/profile-1508728808608-d3781b017e73?dpr=1&auto=format&fit=crop&w=32&h=32&q=60&crop=faces&bg=fff
How can I do so I am using Pyhton 3.6 and urllib3 .

You can use this script as an example, how to filter the links:
import requests
from bs4 import BeautifulSoup
url = 'https://unsplash.com'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for img in soup.find_all('img'):
if 'photo' in img['src']: # print only links with `photo` inside them
print(img['src'])
Prints:
https://images.unsplash.com/photo-1597649260558-e2bd7d35f043?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format%2Ccompress&fit=crop&w=1000&h=1000
https://images.unsplash.com/photo-1598929214025-d6bb6167d43b?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
https://images.unsplash.com/photo-1599567513879-604247ea2bd3?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
https://images.unsplash.com/photo-1599366611308-719895c34512?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
https://images.unsplash.com/photo-1598929214025-d6bb6167d43b?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
https://images.unsplash.com/photo-1599366611308-719895c34512?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
https://images.unsplash.com/photo-1599567513879-604247ea2bd3?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
https://images.unsplash.com/photo-1598929214025-d6bb6167d43b?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
https://images.unsplash.com/photo-1599567513879-604247ea2bd3?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
https://images.unsplash.com/photo-1599366611308-719895c34512?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
With urllib:
import urllib.request
from bs4 import BeautifulSoup
url = 'https://unsplash.com'
soup = BeautifulSoup(urllib.request.urlopen(url).read(), 'html.parser')
for img in soup.find_all('img'):
if 'photo' in img['src']:
print(img['src'])

Extracting image links using BeautifulSoup

I'm trying to extract image links from GoT wiki page
The first two links work find but the second two give me a 404 error code.
I'm trying to find out what I'm doing wrong.
I've tried different combinations to come up with the right link.
import requests
from bs4 import BeautifulSoup
import urllib
import urllib.request as request
import re
url = 'https://en.wikipedia.org/w/index.php' + \
'?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
r = requests.get(url)
html_contents = r.text
soup = BeautifulSoup(html_contents, 'html.parser')
# Find all a tags in the soup
for a in soup.find_all('a'):
# While looping through the text if you find img in 'a' tag
# Then print the src attribute
if a.img:
print('http:/'+a.img['src'])
# And here are the images on the page
http:///upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png
http:///upload.wikimedia.org/wikipedia/commons/thumb/2/2e/Game_of_Thrones_2011_logo.svg/300px-Game_of_Thrones_2011_logo.svg.png
http://static/images/wikimedia-button.png
http://static/images/poweredby_mediawiki_88x31.png
The first two links work
But I want to get the second two links to work as well.

Thanks for the help. I kept it simple. Here is what worked for me:
# Find all a tags in the soup
for a in soup.find_all('a'):
# While looping through the text if you find img in 'a' tag
# Then print the src attribute
if a.img:
if a.img['src'][:2] == '//':
print('https:'+a.img['src'])
else:
print('https://en.wikipedia.org/'+a.img['src'])
# And here are the images on the page

These urls starts with / so they are without domain and you have to add https://en.wikipedia.org to get full URLs like https://en.wikipedia.org/static/images/wikimedia-button.png
More or less:
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/w/index.php?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
for a in soup.find_all('a'):
if a.img:
src = a.img['src']
if src.startswith('http'):
print(src)
elif src.startswith('//'):
print('https:' + src)
elif src.startswith('/'):
print('https://en.wikipedia.org' + src)
else:
print('https://en.wikipedia.org/w/' + src)
EDIT: you can also use urllib.parse.urljoin()
import requests
from bs4 import BeautifulSoup
import urllib.parse
url = 'https://en.wikipedia.org/w/index.php?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
for a in soup.find_all('a'):
if a.img:
src = a.img['src']
print(urllib.parse.urljoin('https://en.wikipedia.org', src))

Python: get CATCHA file from html

I try to decode captcha on python, but I don't know, how can I get it from html.
I use
html = session.get(page, headers=headers).text
soup = BeautifulSoup(html, "html.parser")
And html looks like
<img src="/captcha.gif" style="width:1px;height:1px"/>
How can I exctract it? I can do it only with save image?

you can get image on your PC like this:
import urllib.request
from bs4 import BeautifulSoup as BS
tag = '<img src="/captcha.gif" style="width:1px;height:1px"/>'
soup = BS(tag)
img_tag = soup.find('img')
urllib.request.urlretrieve('https://absolute/path/to'+img_tag['src'], os.getcwd() + '/temp_img')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I get the url of "src=..."? - python

In your last line, you can use gif.get('src'). However, gifs is empty since there are no images with class=giphy-gif-img on the page.

Related

use beautiful soup to extract src inside image inside a

Beautiful Soup can't find most of the tags

How to separate links of images on the basis of content inside them in beautifulsoup4

Extracting image links using BeautifulSoup

Python: get CATCHA file from html

Categories

Resources