Can't scrape category titles from a webpage - python

I've written a scraper in python to get different category names from a webpage but it is unable to fetch anything from that page. I'm seriously confused not to be able to figure out where i'm going wrong. Any help would be vastly appreciated.
Here is the link to the webpage: URL
Here is what I've tried so far:
from bs4 import BeautifulSoup
import requests
res = requests.get("replace_with_above_url",headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,"lxml")
for items in soup.select('.slide_container .h3.standardTitle'):
print(items.text)
Elements within which one such category names I'm after:
<div class="slide_container">
<a href="/offers/furniture/" tabindex="0">
<picture style="float: left; width: 100%;"><img style="width:100%" src="/_m4/9/8/1513184943_4413.jpg" data-w="270"></picture>
<div class="floated-details inverted" style="height: 69px;">
<div class="h3 margin-top-sm margin-bottom-sm standardTitle">
Furniture Offers #This is the name I'm after
</div>
<p class="carouselDesc">
</p>
</div>
</a>
</div>

from bs4 import BeautifulSoup
import requests
headers = {
'accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding':'gzip, deflate, br',
'accept-language':'en-US,en;q=0.9',
'cache-control':'max-age=0',
'referer':'https://www.therange.co.uk/',
'upgrade-insecure-requests':'1',
'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36',
}
res = requests.get("https://www.therange.co.uk/",headers=headers)
soup = BeautifulSoup(res.text,'html.parser')
for items in soup.select('.slide_container .h3.standardTitle'):
print(items.text)
Try this
a user-agent is not enough because headers are the most important part
of scrapping.if you miss any header then server ll treat you as a bot.

Use "html.parser" instead of "lxml"
soup = BeautifulSoup(res.text,"html.parser")

Related

How to scrape images from webpage using BeautifulSoup?

Please pardon my ignorance but I can't get my head around this. I had to create a new question as I have realized that I don't really know how to do this. So how to scrape the images from the webpage like this https://www.jooraccess.com/r/products?token=feba69103f6c9789270a1412954cf250 ? I have an experience with BeautifulSoup but as far as I understand, I need to use some other package here? soup.find("div", class_="PhotoBreadcrumb_...6uHZm") doesn't work
<div class="PhotoBreadcrumb_PhotoBreadcrumb__14D_N ProductCard_photoBreadCrumb__6uHZm">
<img src="https://cdn.jooraccess.com/img/uploads/accounts/678917/images/Iris_floral02.jpg" alt="Breadcrumb">
<div class="PhotoBreadcrumb_breadcrumbContainer__2cALf" data-testid="breadcrumbContainer">
<div data-position="0" class="PhotoBreadcrumb_dot__2PbsQ"></div>
<div data-position="1" class="PhotoBreadcrumb_dot__2PbsQ"></div>
<div data-position="2" class="PhotoBreadcrumb_dot__2PbsQ"></div>
<div data-position="3" class="PhotoBreadcrumb_dot__2PbsQ"></div>
<div data-position="4" class="PhotoBreadcrumb_active__2T6z2 PhotoBreadcrumb_dot__2PbsQ"></div>
<div data-position="5" class="PhotoBreadcrumb_dot__2PbsQ"></div>
</div>
BeautifulSoup is for cleaning the html gotten after sending http request, in your case you should :
1. Send http request to your target website with requests module. (with appropriate headers).
2. Select the json data of the response.
3. Iterate over the list of products.
4. For each product get the img_urls .
5. Send a new request to get each image in your list of urls.
6. Save the image.
Code :
Note : you should update the cookie in the headers to get a response.
import requests
from os.path import basename
from urllib.parse import urlparse
URL = 'https://atlas-main.kube.jooraccess.com/graphql'
headers = {"accept": "*/*","Accept-Encoding": "gzip, deflate, br","Accept-Language": "fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7","Connection": "keep-alive","Content-Length": "2249","content-type": "application/json","Cookie":"_hjSessionUser_686103=eyJpZCI6ImY4MTZjN2YxLWJlYmQtNTg2ZC1iYmRkLTllYjdhNGQzNmVjYiIsImNyZWF0ZWQiOjE2NDYxMTkwMDUyODcsImV4aXN0aW5nIjp0cnVlfQ==; _hjSession_686103=eyJpZCI6ImM5MWJmOGRhLTcwZDEtNGQ2ZS04MzA1LTQ4NWNlYTYzZGMwNSIsImNyZWF0ZWQiOjE2NDYxMjc3MDQ5MjgsImluU2FtcGxlIjp0cnVlfQ==; _hjAbsoluteSessionInProgress=0; mp_2e072c90929b30e1ea5d9fd56399f106_mixpanel=%7B%22distinct_id%22%3A%20%2217f4456c057375-062236d0c47071-a3e3164-144000-17f4456c05857f%22%2C%22%24device_id%22%3A%20%2217f4456c057375-062236d0c47071-a3e3164-144000-17f4456c05857f%22%2C%22%24initial_referrer%22%3A%20%22%24direct%22%2C%22%24initial_referring_domain%22%3A%20%22%24direct%22%2C%22accountId%22%3A%20null%2C%22canShop%22%3A%20false%2C%22canTransact%22%3A%20false%2C%22canViewAssortments%22%3A%20false%2C%22canViewDataPortal%22%3A%20false%2C%22userId%22%3A%20null%2C%22accountUserId%22%3A%20null%2C%22isAdmin%22%3A%20false%2C%22loggedAsAdmin%22%3A%20false%2C%22retailerSettings%22%3A%20false%2C%22assortmentPlanning%22%3A%20false%2C%22accountType%22%3A%201%7D","Host": "atlas-main.kube.jooraccess.com","Origin": "https://www.jooraccess.com","Referer": "https://www.jooraccess.com/","sec-ch-ua-mobile": "?0","sec-ch-ua-platform": "Windows","Sec-Fetch-Dest": "empty","Sec-Fetch-Mode": "cors","Sec-Fetch-Site": "same-site","User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36",}
result = requests.get(URL, headers ).json() # you may need headers argument so you should add it in this case
data = result["data"]["public"]["collectionProductsByShareToken"]["edges"]
for d in data:
img_urls = d["product"]["imageUrls"]
for img_url in img_urls:
img_data = requests.get(img_url).content
img_name = basename(urlparse(img_url).path)
with open(img_name , 'wb') as handle:
response = requests.get(img_url, stream=True)
if not response.ok:
print(response)
for block in response.iter_content(1024):
if not block:
break
handle.write(block)

Scraping <div> inside a <div>

I'm having some trouble scraping the names of a <div> that are already in a <div> (It works with complete other part even though I tried to search for a specific card-body)
https://namemc.com/minecraft-names?sort=asc&length_op=&length=3&lang=&searches=500
I need this part:
<div class="card-body p-0">
<div class="row no-gutters py-1 px-3">
<div class="col col-lg order-lg-1 text-nowrap text-ellipsis">
example
Even though I find names, they are not from the list I want. Does anybody know how to locate them?
Im using beautifulsoup and lxml. Part of my code:
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://namemc.com/minecraft-names?sort=asc&length_op=&length=3&lang=&searches=500').text
soup = BeautifulSoup(html_text, 'lxml')
itemlocator = soup.find('div', class_='card-body p-0')
for items in itemlocator:
print(items)
The following script should produce the available names that you see in that page. However, it seems you are only after the container in which Commander is available. In that case, you can try like below to get the desired portion which is concise and efficient compare to your current attempt.
import requests
from bs4 import BeautifulSoup
link = 'https://namemc.com/minecraft-names?sort=asc&length_op=&length=3&lang=&searches=500'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'
}
html_text = requests.get(link,headers=headers)
soup = BeautifulSoup(html_text.text,'lxml')
item = soup.select_one(".card-body > .no-gutters a[href^='/name/Commander']")
item_text = item.get_text(strip=True)
datetime = item.find_parent().find_parent().select_one("time").get("datetime")
print(item_text,datetime)
Output:
Commander 2021-03-19T13:10:40.000Z

Unable to scrape the real-time price of bitcoin using beautifulsoup

I'm trying to scrape the real-time price of bitcoin. The price of bitcoin changes almost every 5 seconds on the website but in my code, it's not updating and remains the same as the first price scraped by the code. Can you help me why this is happening?
import requests
from bs4 import BeautifulSoup
url = 'https://coinmarketcap.com/currencies/bitcoin/'
for i in range(100):
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(response.text, 'html.parser')
price = soup.find('span' ,attrs={"class" : "cmc-details-panel-price__price"})
print (price)
time.sleep(20)
My output:
<span class="cmc-details-panel-price__price">$18,106.79</span>
<span class="cmc-details-panel-price__price">$18,106.79</span>
<span class="cmc-details-panel-price__price">$18,106.79</span>
<span class="cmc-details-panel-price__price">$18,106.79</span>
<span class="cmc-details-panel-price__price">$18,106.79</span>
<span class="cmc-details-panel-price__price">$18,106.79</span>
<span class="cmc-details-panel-price__price">$18,106.79</span>
<span class="cmc-details-panel-price__price">$18,106.79</span>
<span class="cmc-details-panel-price__price">$18,106.79</span>
<span class="cmc-details-panel-price__price">$18,106.79</span>
<span class="cmc-details-panel-price__price">$18,106.79</span>
<span class="cmc-details-panel-price__price">$18,106.79</span>
<span class="cmc-details-panel-price__price">$18,106.79</span>
<span class="cmc-details-panel-price__price">$18,106.79</span>
<span class="cmc-details-panel-price__price">$18,106.79</span>
The site is using live update, I guess some javascript. Everytime you refresh the site you will get the same value and then the site fires a trigger to update the value. Since your request can't wait or interact with javascript on the site it always gets the first value on the load.
My advice is to use an API, it's more efficient than scraping websites.
The first Google search gives: https://www.coindesk.com/coindesk-api as a free Bitcoin API.
See if their API endpoint: https://api.coindesk.com/v1/bpi/currentprice.json
gives what you need, and then just parse the JSON.
Edit: Read the terms on their page.
API is better but If you still want to scrape it, here's how you can do it using Google search results:
import time, requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
# check 100 times (or use while loop instead)
for _ in range(100):
html = requests.get('https://www.google.com/search?q=bitcoin+usd', headers=headers)
soup = BeautifulSoup(html.text, 'lxml')
print(soup.select_one('.SwHCTb').text)
time.sleep(20) # sleep for price to change
Output:
58,654.40
58,654.40
58,654.40
58,654.40
58,594.20
58,594.20
58,594.20
58,586.30
58,586.30
...
Alternatively, you can get this information by using the Google Direct Answer Box API from SerpApi. It's a paid API with a free plan.
The difference, in this case, is that you don't have to figure out how to bypass blocks from Google.
Code to integrate:
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": "bitcoin usd",
"gl": "us",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
print(results['answer_box']['result'])
# 60,571.40 United States Dollar
Disclaimer, I work for SerpApi.

Unable to parse a rating information from a webpage using requests

I tried to scrape a certain information from a webpage but failed miserably. The text I wish to grab is available in the page source but I still can't fetch it. This is the site address. I'm after the portion visible in the image as Not Rated.
Relevant html:
<div class="subtext">
Not Rated
<span class="ghost">|</span> <time datetime="PT188M">
3h 8min
</time>
<span class="ghost">|</span>
Drama,
Musical,
Romance
<span class="ghost">|</span>
<a href="/title/tt0150992/releaseinfo?ref_=tt_ov_inf" title="See more release dates">18 June 1999 (India)
</a> </div>
I've tried with:
import requests
from bs4 import BeautifulSoup
link = "https://www.imdb.com/title/tt0150992/?ref_=ttfc_fc_tt"
with requests.Session() as s:
s.headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
rating = soup.select_one(".titleBar .subtext").next_element
print(rating)
I get None using the script above.
Expected output:
Not Rated
How can I get the rating from that webpage?
If you want to get correct version of HTML page, specify Accept-Language http header:
import requests
from bs4 import BeautifulSoup
link = "https://www.imdb.com/title/tt0150992/?ref_=ttfc_fc_tt"
with requests.Session() as s:
s.headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"
s.headers['Accept-Language'] = 'en-US,en;q=0.5' # <-- specify also this!
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
rating = soup.select_one(".titleBar .subtext").next_element
print(rating)
Prints:
Not Rated
There is a better way to getting info on the page. If you dump the html content returned by the request.
import requests
from bs4 import BeautifulSoup
link = "https://www.imdb.com/title/tt0150992/?ref_=ttfc_fc_tt"
with requests.Session() as s:
s.headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
with open("response.html", "w", encoding=r.encoding) as file:
file.write(r.text)
you will find a element <script type="application/ld+json"> which contains all the information about the movie.
Then, you simply get the element text, parse it as json, and use the json to extract the info you wanted.
here is a working example
import json
import requests
from bs4 import BeautifulSoup
link = "https://www.imdb.com/title/tt0150992/?ref_=ttfc_fc_tt"
with requests.Session() as s:
s.headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
movie_data = soup.find("script", attrs={"type": "application/ld+json"}).next # Find the element <script type="application/ld+json"> and get it's content
movie_data = json.loads(movie_data) # parse the data to json
content_rating = movie_data["contentRating"] # get rating
IMDB is one of those webpages that makes it incredible easy to do webscraping and I love it. So what they do to make it easy for webscrapers is to put a script in the top of the html that contains the whole movie object in the format of JSON.
So to get all the relevant information and organize it you simply need to get the content of that single script tag, and convert it to JSON, then you can simply ask for the specific information like with a dictionary.
import requests
import json
from bs4 import BeautifulSoup
#This part is basically the same as yours
link = "https://www.imdb.com/title/tt0150992/?ref_=ttfc_fc_tt"
r = requests.get(link)
soup = BeautifulSoup(r.content,"lxml")
#Why not get the whole json element of the movie?
script = soup.find('script', {"type" : "application/ld+json"})
element = json.loads(script.text)
print(element['contentRating'])
#Outputs "Not Rated"
# You can also inspect te rest of the json it has all the relevant information inside
#Just -> print(json.dumps(element, indent=2))
Note:
Headers and session are not necessary in this example.

Scraping Schema with Beautiful Soup?

I'm trying to scrape a site that contains the following html code:
<div class="content-sidebar-wrap"><main class="content"><article
class="post-773 post type-post status-publish format-standard has-post-
thumbnail category-money entry" itemscope
itemtype="http://schema.org/CreativeWork">
This contains data I'm interested in... I've tried using BeautifulSoup to parse it, but the following returns:
<div class="content-sidebar-wrap"><main class="content"><article
class="entry">
<h1 class="entry-title">Not found, error 404</h1><div class="entry-content
"><p>"The page you are looking for no longer exists. Perhaps you can return
back to the site's "homepage and
see if you can find what you are looking for. Or, you can try finding it
by using the search form below.</p><form
action="http://www.totalsportek.com/" class="search-form"
itemprop="potentialAction" itemscope=""
itemtype="http://schema.org/SearchAction" method="get" role="search">
# I've made small modifications to make it readable
The beautiful soup element doesn't contain my desired code. I'm not too familiar with html, but I'm assuming this makes a call to some external service that returns the data..? I've read this has something to with Schema.
Is there anyway I can access this data?
You need to specify the User-Agent header when making a request. Working example that prints the article header and the content as well:
import requests
from bs4 import BeautifulSoup
url = "http://www.totalsportek.com/money/barcelona-player-salaries/"
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36"})
soup = BeautifulSoup(response.content, "html.parser")
article = soup.select_one(".content article.post.entry.status-publish")
header = article.header.get_text(strip=True)
content = article.select_one(".entry-content").get_text(strip=True)
print(header)
print(content)

Categories