First of all I'll say that, as my code comments are in Spanish, I'll try to explain them in English even though the code is pretty obvious and easy to understand. Don't feel insulted if I'm explaining things that are too obvious :)
So I'm trying to get all img from a website but it seems it just doesn't want to. I've read some similar articles but none seem to work.
import requests
from bs4 import BeautifulSoup as bs
import os
You can visit the web and see the html yourself.
# url de las imgs
url = 'https://dbz.space/cards/'
Here request the web page for it to be parsed
# descargamos la pagina para scrapear
page = requests.get(url)
soup = bs(page.text, 'html.parser')
Here I search for all the images with that class
# localizamos todas las imgs con esa clase
image_tags = soup.findAll("img", {"class": "thumb on"})
Here i just look if the folder imgs exist and if it doesn't then create one to then go inside it
# si no existe imgs lo creamos
if not os.path.exists('imgs'):
os.makedirs('imgs')
# cambiamos de directorio
os.chdir('imgs')
A variable for then naming all images
# para el nombre de la imagen
x = 0
And finally the saving process
# guardando imagenes
for image in image_tags:
try:
url = image['src']
response = requests.get(url)
if response.status_code == 200:
with open('img-' + str(x) + '.jpg', 'wb') as f:
f.write(requests.get(url).content)
f.close()
x += 1
print('Nueva imagen en carpeta')
except:
pass
So, the imgs on the web are inside a div tag and they have the class "thumb on" and they also contain the src (obviously) link which is the one I want to get to my folder called "imgs"
If all you want is the URL of the image file itself...
> <img class="thumb on"
> src="https://dbz.s3.amazonaws.com/v2/global/character/thumb/card_1011720_thumb.png">
Then simply...
yourBSobj.find("img", {"class": "thumb on"}).attrs['src']
I would use find_all() actually so you can iterate through a loop of images, do your processing/saving etc and then see your results afterwards.
First of all, as #cricket_007 said, the img tags are indeed loaded asynchronously by JavaScript. But, there is no need of using Selenium.
Upon inspection, you can see that each img tag is located inside this tag:
<div class="..." res="..." base="..." aim="" quantity="" release="1" imgur="x">
This tag is available in the source code (i.e. Not loaded by JavaScript). Here, we can get the x value which is a part of the imgur URL. One example:
<div class="..." res="1010160" base="1010161" aim="" quantity="" release="1" imgur="yK0wNs3">
After getting the imgur value, you can make the URL like this:
'https://i.imgur.com/{}.png'.format(imgur)
As the URL is https://i.imgur.com/yK0wNs3.png.
Complete code:
r = requests.get('https://dbz.space/cards/')
#soup = BeautifulSoup(r.text, 'lxml')
soup = bs(r.text, 'html.parser')
if not os.path.exists('imgs'):
os.makedirs('imgs')
os.chdir('imgs')
i = 0
for item in soup.find_all('div', imgur=True):
imgur = item['imgur']
if imgur:
r = requests.get('https://i.imgur.com/{}.png'.format(imgur))
with open('img-{}.jpg'.format(i), 'wb') as f:
f.write(r.content)
i += 1
Partial Output:
Note: I'm using f.write(r.content) and not f.write(requests.get(url).content). There's no need to send one more request.
So the error that poped saying File "pilla.py", line 6, in <module> soup = BeautifulSoup(r.text, 'lxml') NameError: name 'BeautifulSoup' is not defined
Is solved by changing on the variable soup BeautifulSoup for bs and lxlm for html.parser Complete code is right here:
import requests
from bs4 import BeautifulSoup as bs
import os
r = requests.get('https://dbz.space/cards/')
soup = bs(r.text, 'html.parser')
if not os.path.exists('imgs'):
os.makedirs('imgs')
os.chdir('imgs')
i = 0
for item in soup.find_all('div', imgur=True):
imgur = item['imgur']
if imgur:
r = requests.get('https://i.imgur.com/{}.png'.format(imgur))
with open('img-{}.jpg'.format(i), 'wb') as f:
f.write(r.content)
i += 1
Thank you all very much for the help. Really apreciate it :)
Related
I'm new to Python and need some help. I am trying to scrape the image urls from this site but can't seems to do so. I pull up all the html. Here is my code.
import requests
import pandas as pd
import urllib.parse
from bs4 import BeautifulSoup
import csv
baseurl = ('https://www.thewhiskyexchange.com/')
productlinks = []
for x in range(1,4):
r = requests.get(f'https://www.thewhiskyexchange.com/c/316/campbeltown-single-malt-scotch-whisky?pg={x}')
soup = BeautifulSoup(r.content, 'html.parser')
tag = soup.find_all('ul',{'class':'product-grid__list'})
for items in tag:
for link in items.find_all('a', href=True):
productlinks.append(baseurl + link['href'])
#print(len(productlinks))
for items in productlinks:
r = requests.get(items)
soup = BeautifulSoup(r.content, 'html.parser')
name = soup.find('h1', class_='product-main__name').text.strip()
price = soup.find('p', class_='product-action__price').text.strip()
imgurl = soup.find('div', class_='product-main__image-container')
print(imgurl)
And here is the piece of HTML I am trying to scrape from.
<div class="product-card__image-container"><img src="https://img.thewhiskyexchange.com/480/gstob.non1.jpg" alt="Glen Scotia Double Cask Sherry Finish" class="product-card__image" loading="lazy" width="3" height="4">
I would appreicate any help. Thanks
You need to first select the image then get the src attribute.
Try this:
imgurl = soup.find('div', class_='product-main__image-container').find('img')['src']
I'm not sure if I fully understand what output you are looking for. But if you just want the img source URLs, this might work:
# imgurl = soup.find('div', class_='product-main__image-container')
imgurl = soup.find('img', class_='product-main__image')
imgurl_attribute = imgurl['src']
print(imgurl_attribute[:5])
#https://img.thewhiskyexchange.com/900/gstob.non1.jpg
#https://img.thewhiskyexchange.com/900/gstob.15yov1.jpg
#https://img.thewhiskyexchange.com/900/gstob.18yov1.jpg
#https://img.thewhiskyexchange.com/900/gstob.25yo.jpg
#https://img.thewhiskyexchange.com/900/sets_gst1.jpg
import pandas as pd
from bs4 import BeautifulSoup
import requests
import os
url = 'https://fr.indeed.com/jobs?q=data%20anlayst&l=france'
#grabbing page content and parsing it into html
def data_grabber(url):
page = requests.get(url)
html = page.text
soup = BeautifulSoup(html, 'html.parser')
job_soup = soup.find_all('div', {"class":"job_seen_beacon"})
return job_soup
def job_title(url):
titles = data_grabber(url)
for title in titles:
t = title.find_all('tbody')
return t
this is my source code, and im testing it out in jupyter notebook to make sure my functions work correctly but I've hit a small road block. My html soup from my first function works perfectly. It grabs all the info from indeed, especially the job_seen_beacon class.
Mr job_title function is wrong because it only outputs the first 'tbody' class it finds. refer to image here, I don't have enough points on stack
while for my data_grabber it returns every single job_seen_beacon. If you were able to scroll, you would easily see the multiple job_seen_beacon's.
I'm clearly missing something but I can't see it, any ideas?
What happens?
In moment you are return something from a function you leave it and that happens in first iteration.
Not sure where you will end up with your code, but you can do something like that:
def job_title(item):
title = item.select_one('h2')
return title.get_text('|',strip=True).split('|')[-1] if title else 'No Title'
Example
from bs4 import BeautifulSoup
import requests
url = 'https://fr.indeed.com/jobs?q=data%20anlayst&l=france'
#grabbing page content and parsing it into html
def data_grabber(url):
page = requests.get(url)
html = page.text
soup = BeautifulSoup(html, 'html.parser')
job_soup = soup.find_all('div', {"class":"job_seen_beacon"})
return job_soup
def job_title(item):
title = item.select_one('h2')
return title.get_text('|',strip=True).split('|')[-1] if title else 'No Title'
def job_location(item):
location = item.select_one('div.companyLocation')
return location.get_text(strip=True) if location else 'No Location'
data = []
for item in data_grabber(url):
data.append({
'title':job_title(item),
'companyLocation':job_location(item)
})
data
Output
[{'title': 'Chef de Projet Big Data H/F', 'companyLocation': 'Lyon (69)'},{'title': 'Chef de Projet Big Data F/H', 'companyLocation': 'Lyon 9e (69)'}]
I have a next link which represent an exact graph I want to scrape: https://index.minfin.com.ua/ua/economy/index/svg.php?indType=1&fromYear=2010&acc=1
I'm simply can't understand is it a xml or svg graph and how to scrape data. I think I need to use bs4, requests but don't know the way to do that.
Anyone could help?
You will load HTML like this:
import requests
url = "https://index.minfin.com.ua/ua/economy/index/svg.php?indType=1&fromYear=2010&acc=1"
resp = requests.get(url)
data = resp.text
Then you will create a BeatifulSoup object with this HTML.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, features="html.parser")
After this, it is usually very subjective how to parse out what you want. The candidate codes may vary a lot. This is how I did it:
Using BeautifulSoup, I parsed all "rect"s and check if "onmouseover" exists in that rect.
rects = soup.svg.find_all("rect")
yx_points = []
for rect in rects:
if rect.has_attr("onmouseover"):
text = rect["onmouseover"]
x_start_index = text.index("'") + 1
y_finish_index = text[x_start_index:].index("'") + x_start_index
yx = text[x_start_index:y_finish_index].split()
print(text[x_start_index:y_finish_index])
yx_points.append(yx)
As you can see from the image below, I scraped onmouseover= part and get those 02.2015 155,1 parts.
Here, this is how yx_points looks like now:
[['12.2009', '100,0'], ['01.2010', '101,8'], ['02.2010', '103,7'], ...]
from bs4 import BeautifulSoup
import requests
import re
#First get all the text from the url.
url="https://index.minfin.com.ua/ua/economy/index/svg.php?indType=1&fromYear=2010&acc=1"
response = requests.get(url)
html = response.text
#Find all the tags in which the data is stored.
soup = BeautifulSoup(html, 'lxml')
texts = soup.findAll("rect")
final = []
for each in texts:
names = each.get('onmouseover')
try:
q = re.findall(r"'(.*?)'", names)
final.append(q[0])
except Exception as e:
print(e)
#The details are appended to the final variable
Please bear with me. I am quite new at Python - but having a lot of fun. I am trying to code a web crawler that crawls through election results from the last referendum in Denmark. I have managed to extract all the relevant links from the main page. And now I want Python to follow each of the 92 links and gather 9 pieces of information from each of those pages. But I am so stuck. Hope you can give me a hint.
Here is my code:
import requests
import urllib2
from bs4 import BeautifulSoup
# This is the original url http://www.kmdvalg.dk/
soup = BeautifulSoup(urllib2.urlopen('http://www.kmdvalg.dk/').read())
my_list = []
all_links = soup.find_all("a")
for link in all_links:
link2 = link["href"]
my_list.append(link2)
for i in my_list[1:93]:
print i
# The output shows all the links that I would like to follow and gather information from. How do I do that?
Here is my solution using lxml. It's similar to BeautifulSoup
import lxml
from lxml import html
import requests
page = requests.get('http://www.kmdvalg.dk/main')
tree = html.fromstring(page.content)
my_list = tree.xpath('//div[#class="LetterGroup"]//a/#href') # grab all link
print 'Length of all links = ', len(my_list)
my_list is a list consist of all links. And now you can use for loop to scrape information inside each page.
We can for loop through each links. Inside each page, you can extract information as example. This is only for the top table.
table_information = []
for t in my_list:
page_detail = requests.get(t)
tree = html.fromstring(page_detail.content)
table_key = tree.xpath('//td[#class="statusHeader"]/text()')
table_value = tree.xpath('//td[#class="statusText"]/text()') + tree.xpath('//td[#class="statusText"]/a/text()')
table_information.append(zip([t]*len(table_key), table_key, table_value))
For table below the page,
table_information_below = []
for t in my_list:
page_detail = requests.get(t)
tree = html.fromstring(page_detail.content)
l1 = tree.xpath('//tr[#class="tableRowPrimary"]/td[#class="StemmerNu"]/text()')
l2 = tree.xpath('//tr[#class="tableRowSecondary"]/td[#class="StemmerNu"]/text()')
table_information_below.append([t]+l1+l2)
Hope this help!
A simple approach would be to iterate through your list of urls and parse them each individually:
for url in my_list:
soup = BeautifulSoup(urllib2.urlopen(url).read())
# then parse each page individually here
Alternatively, you could speed things up significantly using Futures.
from requests_futures.sessions import FuturesSession
def my_parse_function(html):
"""Use this function to parse each page"""
soup = BeautifulSoup(html)
all_paragraphs = soup.find_all('p')
return all_paragraphs
session = FuturesSession(max_workers=5)
futures = [session.get(url) for url in my_list]
page_results = [my_parse_function(future.result()) for future in results]
This would be my solution for your problem
import requests
from bs4 import BeautifulSoup
def spider():
url = "http://www.kmdvalg.dk/main"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
for link in soup.findAll('div', {'class': 'LetterGroup'}):
anc = link.find('a')
href = anc.get('href')
print(anc.getText())
print(href)
# spider2(href) call a second function from here that is similar to this one(making url = to herf)
spider2(href)
print("\n")
def spider2(linktofollow):
url = linktofollow
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
for link in soup.findAll('tr', {'class': 'tableRowPrimary'}):
anc = link.find('td')
print(anc.getText())
print("\n")
spider()
its not done... i only get a simple element from the table but you get the idea and how its supposed to work.
Here is my final code that works smooth. Please let me know if I could have done it smarter!
import urllib2
from bs4 import BeautifulSoup
import codecs
f = codecs.open("eu2015valg.txt", "w", encoding="iso-8859-1")
soup = BeautifulSoup(urllib2.urlopen('http://www.kmdvalg.dk/').read())
liste = []
alle_links = soup.find_all("a")
for link in alle_links:
link2 = link["href"]
liste.append(link2)
for url in liste[1:93]:
soup = BeautifulSoup(urllib2.urlopen(url).read().decode('iso-8859-1'))
tds = soup.findAll('td')
stemmernu = soup.findAll('td', class_='StemmerNu')
print >> f, tds[5].string,";",tds[12].string,";",tds[14].string,";",tds[16].string,";", stemmernu[0].string,";",stemmernu[1].string,";",stemmernu[2].string,";",stemmernu[3].string,";",stemmernu[6].string,";",stemmernu[8].string,";",'\r\n'
f.close()
I'm using BeautifulSoup to get a HTML page from IMDb, and I would like to extract the poster image from the page. I've got the image based on one of the attributes, but I don't know how to extract the data inside it.
Here's my code:
url = 'http://www.imdb.com/title/tt%s/' % (id)
soup = BeautifulSoup(urllib2.urlopen(url).read())
print("before FOR")
for src in soup.find(itemprop="image"):
print("inside FOR")
print(link.get('src'))
You're almost there - just a couple of mistakes. soup.find() gets the first element that matches, not a list, so you don't need to iterate over it. Once you have got the element, you can get its attributes (like src) using dictionary access. Here's a reworked version:
film_id = '0423409'
url = 'http://www.imdb.com/title/tt%s/' % (film_id)
soup = BeautifulSoup(urllib2.urlopen(url).read())
link = soup.find(itemprop="image")
print(link["src"])
# output:
http://ia.media-imdb.com/images/M/MV5BMTg2ODMwNTY3NV5BMl5BanBnXkFtZTcwMzczNjEzMQ##._V1_SY317_CR0,0,214,317_.jpg
I've changed id to film_id, because id() is a built-in function, and it's bad practice to mask those.
I believe your example is very close. You need to use findAll() instead of find() and when you iterate, you switch from src to link. In the below example I switched it to tag
This code is working for me with BeautifulSoup4:
url = 'http://www.imdb.com/title/tt%s/' % (id,)
soup = BeautifulSoup(urllib2.urlopen(url).read())
print "before FOR"
for tag in soup.findAll(itemprop="image"):
print "inside FOR"
print(tag['src'])
If I understand correctly you are looking for the src of the image, for the extraction of it after that.
In the first place you need to find (using the inspector) in which position in the HTML is the image. For example, in my particle case that I was scrapping soccer team shields, I needed:
m_url = 'http://www.marca.com/futbol/primera/equipos.html'
client = uOpen(m_url)
page = client.read()
client.close()
page_soup = BS(page, 'html.parser')
teams = page_soup.findAll('li', {'id': 'nombreEquipo'})
for team in teams:
name = team.h2.text
shield_url = team.img['src']
Then, you need to process the image. You have to options.
1st: using numpy:
def url_to_image(url):
'''
FunciĆ³n para extraer una imagen de una URL
'''
resp = uOpen(url)
image = np.asarray(bytearray(resp.read()), dtype='uint8')
image = cv2.imdecode(image, cv2.IMREAD_COLOR)
return image
shield = url_to_image(shield_url)
2nd Using scikit-image library (that you will probably need to install):
shield = io.imread('http:' + shield_url)
Note: Just in this particular example I needed to add http: at the beggining.
Hope it helps!
Here's a full working example with gazpacho:
Step 1 - import everything and download the html:
from pathlib import Path
from urllib.request import urlretrieve as download
from gazpacho import Soup
id = 'tt5057054'
url = f"https://www.imdb.com/title/{id}"
soup = Soup.get(url)
Step 2 - find the src url for the image asset:
image = (soup
.find("div", {"id": "title-overview"})
.find("div", {"class": "poster"})
.find("img")
.attrs['src']
)
Step 3 - save it to your machine:
directory = "images"
Path(directory).mkdir(exist_ok=True)
extension = image.split('.')[-1]
download(image, f"{directory}/{id}.{extension}")