How to scrape images from webpage using BeautifulSoup?

How to scrape images from webpage using BeautifulSoup? - python

Please pardon my ignorance but I can't get my head around this. I had to create a new question as I have realized that I don't really know how to do this. So how to scrape the images from the webpage like this https://www.jooraccess.com/r/products?token=feba69103f6c9789270a1412954cf250 ? I have an experience with BeautifulSoup but as far as I understand, I need to use some other package here? soup.find("div", class_="PhotoBreadcrumb_...6uHZm") doesn't work
<div class="PhotoBreadcrumb_PhotoBreadcrumb__14D_N ProductCard_photoBreadCrumb__6uHZm">
<img src="https://cdn.jooraccess.com/img/uploads/accounts/678917/images/Iris_floral02.jpg" alt="Breadcrumb">
<div class="PhotoBreadcrumb_breadcrumbContainer__2cALf" data-testid="breadcrumbContainer">
<div data-position="0" class="PhotoBreadcrumb_dot__2PbsQ"></div>
<div data-position="1" class="PhotoBreadcrumb_dot__2PbsQ"></div>
<div data-position="2" class="PhotoBreadcrumb_dot__2PbsQ"></div>
<div data-position="3" class="PhotoBreadcrumb_dot__2PbsQ"></div>
<div data-position="4" class="PhotoBreadcrumb_active__2T6z2 PhotoBreadcrumb_dot__2PbsQ"></div>
<div data-position="5" class="PhotoBreadcrumb_dot__2PbsQ"></div>
</div>

BeautifulSoup is for cleaning the html gotten after sending http request, in your case you should :
1. Send http request to your target website with requests module. (with appropriate headers).
2. Select the json data of the response.
3. Iterate over the list of products.
4. For each product get the img_urls .
5. Send a new request to get each image in your list of urls.
6. Save the image.
Code :
Note : you should update the cookie in the headers to get a response.
import requests
from os.path import basename
from urllib.parse import urlparse
URL = 'https://atlas-main.kube.jooraccess.com/graphql'
headers = {"accept": "*/*","Accept-Encoding": "gzip, deflate, br","Accept-Language": "fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7","Connection": "keep-alive","Content-Length": "2249","content-type": "application/json","Cookie":"_hjSessionUser_686103=eyJpZCI6ImY4MTZjN2YxLWJlYmQtNTg2ZC1iYmRkLTllYjdhNGQzNmVjYiIsImNyZWF0ZWQiOjE2NDYxMTkwMDUyODcsImV4aXN0aW5nIjp0cnVlfQ==; _hjSession_686103=eyJpZCI6ImM5MWJmOGRhLTcwZDEtNGQ2ZS04MzA1LTQ4NWNlYTYzZGMwNSIsImNyZWF0ZWQiOjE2NDYxMjc3MDQ5MjgsImluU2FtcGxlIjp0cnVlfQ==; _hjAbsoluteSessionInProgress=0; mp_2e072c90929b30e1ea5d9fd56399f106_mixpanel=%7B%22distinct_id%22%3A%20%2217f4456c057375-062236d0c47071-a3e3164-144000-17f4456c05857f%22%2C%22%24device_id%22%3A%20%2217f4456c057375-062236d0c47071-a3e3164-144000-17f4456c05857f%22%2C%22%24initial_referrer%22%3A%20%22%24direct%22%2C%22%24initial_referring_domain%22%3A%20%22%24direct%22%2C%22accountId%22%3A%20null%2C%22canShop%22%3A%20false%2C%22canTransact%22%3A%20false%2C%22canViewAssortments%22%3A%20false%2C%22canViewDataPortal%22%3A%20false%2C%22userId%22%3A%20null%2C%22accountUserId%22%3A%20null%2C%22isAdmin%22%3A%20false%2C%22loggedAsAdmin%22%3A%20false%2C%22retailerSettings%22%3A%20false%2C%22assortmentPlanning%22%3A%20false%2C%22accountType%22%3A%201%7D","Host": "atlas-main.kube.jooraccess.com","Origin": "https://www.jooraccess.com","Referer": "https://www.jooraccess.com/","sec-ch-ua-mobile": "?0","sec-ch-ua-platform": "Windows","Sec-Fetch-Dest": "empty","Sec-Fetch-Mode": "cors","Sec-Fetch-Site": "same-site","User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36",}
result = requests.get(URL, headers ).json() # you may need headers argument so you should add it in this case
data = result["data"]["public"]["collectionProductsByShareToken"]["edges"]
for d in data:
img_urls = d["product"]["imageUrls"]
for img_url in img_urls:
img_data = requests.get(img_url).content
img_name = basename(urlparse(img_url).path)
with open(img_name , 'wb') as handle:
response = requests.get(img_url, stream=True)
if not response.ok:
print(response)
for block in response.iter_content(1024):
if not block:
break
handle.write(block)

Related

How can I scrape this data when requests doesn't return it?

I want to scrape the information from this page:
https://databases.usatoday.com/nfl-arrests/
Each of the arrests is listed in a table on the page under the css selector: #csp-data I can see this in the page's source as well: <div id="csp-data" class="csp-data"></div> but there is nothing in-between those tags for me to parse.
When I try to run the following code, I return no results.
import requests
from bs4 import BeautifulSoup
url = "https://databases.usatoday.com/nfl-arrests/"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
test = soup.select('#csp-data > div > div:nth-child(3) > div > div.table-responsive > table > tbody')
print(test)
If I use test = soup.select('#csp-data'), I return <div class="csp-data" id="csp-data"></div> If I move to the next step #csp-data > div, I return no results.
I'm assuming that the data isn't being loaded when requests gets the data, but I'm not sure. When I go in through my browser and use inspect element, I can see the table has loaded.
Does anyone have an idea on how I could move forward here?

Here is the working output from ajax calls
import requests
import json
body = 'action=cspFetchTable&security=3193d24eb0&pageID=10&blogID=&sortBy=Date&sortOrder=desc&page=1&searches={}&heads=true'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36',
'Content-Type': 'application/x-www-form-urlencoded'}
url='https://databases.usatoday.com/wp-admin/admin-ajax.php'
r = requests.post(url, data=body, headers =headers)
tables = r.json()['data']['Result']
for table in tables:
print(table['First_name'])
Output:Example
Bradley
Deonte
Barkevious
Darius
Jarron
Tamorrion
Zaven
Frank
Justin
Aldon
Jeff
Marshon
Broderick
Frank
Jaydon
Kevin
Kemah
Chad
Isaiah
Rashard

Scraping <div> inside a <div>

I'm having some trouble scraping the names of a <div> that are already in a <div> (It works with complete other part even though I tried to search for a specific card-body)
https://namemc.com/minecraft-names?sort=asc&length_op=&length=3&lang=&searches=500
I need this part:
<div class="card-body p-0">
<div class="row no-gutters py-1 px-3">
<div class="col col-lg order-lg-1 text-nowrap text-ellipsis">
example
Even though I find names, they are not from the list I want. Does anybody know how to locate them?
Im using beautifulsoup and lxml. Part of my code:
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://namemc.com/minecraft-names?sort=asc&length_op=&length=3&lang=&searches=500').text
soup = BeautifulSoup(html_text, 'lxml')
itemlocator = soup.find('div', class_='card-body p-0')
for items in itemlocator:
print(items)

The following script should produce the available names that you see in that page. However, it seems you are only after the container in which Commander is available. In that case, you can try like below to get the desired portion which is concise and efficient compare to your current attempt.
import requests
from bs4 import BeautifulSoup
link = 'https://namemc.com/minecraft-names?sort=asc&length_op=&length=3&lang=&searches=500'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'
}
html_text = requests.get(link,headers=headers)
soup = BeautifulSoup(html_text.text,'lxml')
item = soup.select_one(".card-body > .no-gutters a[href^='/name/Commander']")
item_text = item.get_text(strip=True)
datetime = item.find_parent().find_parent().select_one("time").get("datetime")
print(item_text,datetime)
Output:
Commander 2021-03-19T13:10:40.000Z

Unable to parse a rating information from a webpage using requests

I tried to scrape a certain information from a webpage but failed miserably. The text I wish to grab is available in the page source but I still can't fetch it. This is the site address. I'm after the portion visible in the image as Not Rated.
Relevant html:
<div class="subtext">
Not Rated
<span class="ghost">|</span> <time datetime="PT188M">
3h 8min
</time>
<span class="ghost">|</span>
Drama,
Musical,
Romance
<span class="ghost">|</span>
<a href="/title/tt0150992/releaseinfo?ref_=tt_ov_inf" title="See more release dates">18 June 1999 (India)
</a> </div>
I've tried with:
import requests
from bs4 import BeautifulSoup
link = "https://www.imdb.com/title/tt0150992/?ref_=ttfc_fc_tt"
with requests.Session() as s:
s.headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
rating = soup.select_one(".titleBar .subtext").next_element
print(rating)
I get None using the script above.
Expected output:
Not Rated
How can I get the rating from that webpage?

If you want to get correct version of HTML page, specify Accept-Language http header:
import requests
from bs4 import BeautifulSoup
link = "https://www.imdb.com/title/tt0150992/?ref_=ttfc_fc_tt"
with requests.Session() as s:
s.headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"
s.headers['Accept-Language'] = 'en-US,en;q=0.5' # <-- specify also this!
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
rating = soup.select_one(".titleBar .subtext").next_element
print(rating)
Prints:
Not Rated

There is a better way to getting info on the page. If you dump the html content returned by the request.
import requests
from bs4 import BeautifulSoup
link = "https://www.imdb.com/title/tt0150992/?ref_=ttfc_fc_tt"
with requests.Session() as s:
s.headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
with open("response.html", "w", encoding=r.encoding) as file:
file.write(r.text)
you will find a element <script type="application/ld+json"> which contains all the information about the movie.
Then, you simply get the element text, parse it as json, and use the json to extract the info you wanted.
here is a working example
import json
import requests
from bs4 import BeautifulSoup
link = "https://www.imdb.com/title/tt0150992/?ref_=ttfc_fc_tt"
with requests.Session() as s:
s.headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
movie_data = soup.find("script", attrs={"type": "application/ld+json"}).next # Find the element <script type="application/ld+json"> and get it's content
movie_data = json.loads(movie_data) # parse the data to json
content_rating = movie_data["contentRating"] # get rating

IMDB is one of those webpages that makes it incredible easy to do webscraping and I love it. So what they do to make it easy for webscrapers is to put a script in the top of the html that contains the whole movie object in the format of JSON.
So to get all the relevant information and organize it you simply need to get the content of that single script tag, and convert it to JSON, then you can simply ask for the specific information like with a dictionary.
import requests
import json
from bs4 import BeautifulSoup
#This part is basically the same as yours
link = "https://www.imdb.com/title/tt0150992/?ref_=ttfc_fc_tt"
r = requests.get(link)
soup = BeautifulSoup(r.content,"lxml")
#Why not get the whole json element of the movie?
script = soup.find('script', {"type" : "application/ld+json"})
element = json.loads(script.text)
print(element['contentRating'])
#Outputs "Not Rated"
# You can also inspect te rest of the json it has all the relevant information inside
#Just -> print(json.dumps(element, indent=2))
Note:
Headers and session are not necessary in this example.

How to extract a specific string with BeautifulSoup

So I'm trying to retrieve Bitcoin prices from CoinMarketCap.com.
I'm using Python along with requests and bs4.
import requests
from bs4 import BeautifulSoup
link = "https://coinmarketcap.com/currencies/bitcoin/"
header = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0'}
data = requests.get(headers = header, url = link)
soup = BeautifulSoup(data.content, 'html.parser')
bitcoinPrice = soup.find(id="quote_price")
print(bitcoinPrice)
So when I run the script, I have the following result with some additional code that I don't want. I just want the Bitcoin price.
<span data-currency-price="" data-usd="9806.68980398" id="quote_price">
<span class="h2 text-semi-bold details-panel-item--price__value" data-currency-value="">9806.69</span>
<span class="text-large" data-currency-code="">USD</span>
</span>
How do I extract the Bitcoin price from that chunk of data?

I believe this should give you what you want:
bitcoinPrice.span.contents[0]
contains
'9808.16'

bitcoinPrice = soup.find("span", class_="details-panel-item--price__value").text

This is another way using css selector.
print(soup.select_one('.details-panel-item--price__value').text)

You can use the official API under the basic (free) plan and then simply add your API key into below. Code example updated from here.
from requests import Request, Session
from requests.exceptions import ConnectionError, Timeout, TooManyRedirects
import json
url = 'https://pro-api.coinmarketcap.com/v1/cryptocurrency/quotes/latest'
parameters = {
'id':'1'
}
headers = {
'Accepts': 'application/json',
'X-CMC_PRO_API_KEY': 'api_key',
}
session = Session()
session.headers.update(headers)
try:
response = session.get(url, params=parameters)
data = json.loads(response.text)
#print(data)
print(data['data']['1']['quote']['USD']['price'])
except (ConnectionError, Timeout, TooManyRedirects) as e:
print(e)

Scraping Schema with Beautiful Soup?

I'm trying to scrape a site that contains the following html code:
<div class="content-sidebar-wrap"><main class="content"><article
class="post-773 post type-post status-publish format-standard has-post-
thumbnail category-money entry" itemscope
itemtype="http://schema.org/CreativeWork">
This contains data I'm interested in... I've tried using BeautifulSoup to parse it, but the following returns:
<div class="content-sidebar-wrap"><main class="content"><article
class="entry">
<h1 class="entry-title">Not found, error 404</h1><div class="entry-content
"><p>"The page you are looking for no longer exists. Perhaps you can return
back to the site's "homepage and
see if you can find what you are looking for. Or, you can try finding it
by using the search form below.</p><form
action="http://www.totalsportek.com/" class="search-form"
itemprop="potentialAction" itemscope=""
itemtype="http://schema.org/SearchAction" method="get" role="search">
# I've made small modifications to make it readable
The beautiful soup element doesn't contain my desired code. I'm not too familiar with html, but I'm assuming this makes a call to some external service that returns the data..? I've read this has something to with Schema.
Is there anyway I can access this data?

You need to specify the User-Agent header when making a request. Working example that prints the article header and the content as well:
import requests
from bs4 import BeautifulSoup
url = "http://www.totalsportek.com/money/barcelona-player-salaries/"
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36"})
soup = BeautifulSoup(response.content, "html.parser")
article = soup.select_one(".content article.post.entry.status-publish")
header = article.header.get_text(strip=True)
content = article.select_one(".entry-content").get_text(strip=True)
print(header)
print(content)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to scrape images from webpage using BeautifulSoup? - python

Related

How can I scrape this data when requests doesn't return it?

Scraping <div> inside a <div>

Unable to parse a rating information from a webpage using requests

How to extract a specific string with BeautifulSoup

Scraping Schema with Beautiful Soup?

Categories

Resources