I'm trying to scrape a site that contains the following html code:
<div class="content-sidebar-wrap"><main class="content"><article
class="post-773 post type-post status-publish format-standard has-post-
thumbnail category-money entry" itemscope
itemtype="http://schema.org/CreativeWork">
This contains data I'm interested in... I've tried using BeautifulSoup to parse it, but the following returns:
<div class="content-sidebar-wrap"><main class="content"><article
class="entry">
<h1 class="entry-title">Not found, error 404</h1><div class="entry-content
"><p>"The page you are looking for no longer exists. Perhaps you can return
back to the site's "homepage and
see if you can find what you are looking for. Or, you can try finding it
by using the search form below.</p><form
action="http://www.totalsportek.com/" class="search-form"
itemprop="potentialAction" itemscope=""
itemtype="http://schema.org/SearchAction" method="get" role="search">
# I've made small modifications to make it readable
The beautiful soup element doesn't contain my desired code. I'm not too familiar with html, but I'm assuming this makes a call to some external service that returns the data..? I've read this has something to with Schema.
Is there anyway I can access this data?
You need to specify the User-Agent header when making a request. Working example that prints the article header and the content as well:
import requests
from bs4 import BeautifulSoup
url = "http://www.totalsportek.com/money/barcelona-player-salaries/"
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36"})
soup = BeautifulSoup(response.content, "html.parser")
article = soup.select_one(".content article.post.entry.status-publish")
header = article.header.get_text(strip=True)
content = article.select_one(".entry-content").get_text(strip=True)
print(header)
print(content)
Related
Please pardon my ignorance but I can't get my head around this. I had to create a new question as I have realized that I don't really know how to do this. So how to scrape the images from the webpage like this https://www.jooraccess.com/r/products?token=feba69103f6c9789270a1412954cf250 ? I have an experience with BeautifulSoup but as far as I understand, I need to use some other package here? soup.find("div", class_="PhotoBreadcrumb_...6uHZm") doesn't work
<div class="PhotoBreadcrumb_PhotoBreadcrumb__14D_N ProductCard_photoBreadCrumb__6uHZm">
<img src="https://cdn.jooraccess.com/img/uploads/accounts/678917/images/Iris_floral02.jpg" alt="Breadcrumb">
<div class="PhotoBreadcrumb_breadcrumbContainer__2cALf" data-testid="breadcrumbContainer">
<div data-position="0" class="PhotoBreadcrumb_dot__2PbsQ"></div>
<div data-position="1" class="PhotoBreadcrumb_dot__2PbsQ"></div>
<div data-position="2" class="PhotoBreadcrumb_dot__2PbsQ"></div>
<div data-position="3" class="PhotoBreadcrumb_dot__2PbsQ"></div>
<div data-position="4" class="PhotoBreadcrumb_active__2T6z2 PhotoBreadcrumb_dot__2PbsQ"></div>
<div data-position="5" class="PhotoBreadcrumb_dot__2PbsQ"></div>
</div>
BeautifulSoup is for cleaning the html gotten after sending http request, in your case you should :
1. Send http request to your target website with requests module. (with appropriate headers).
2. Select the json data of the response.
3. Iterate over the list of products.
4. For each product get the img_urls .
5. Send a new request to get each image in your list of urls.
6. Save the image.
Code :
Note : you should update the cookie in the headers to get a response.
import requests
from os.path import basename
from urllib.parse import urlparse
URL = 'https://atlas-main.kube.jooraccess.com/graphql'
headers = {"accept": "*/*","Accept-Encoding": "gzip, deflate, br","Accept-Language": "fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7","Connection": "keep-alive","Content-Length": "2249","content-type": "application/json","Cookie":"_hjSessionUser_686103=eyJpZCI6ImY4MTZjN2YxLWJlYmQtNTg2ZC1iYmRkLTllYjdhNGQzNmVjYiIsImNyZWF0ZWQiOjE2NDYxMTkwMDUyODcsImV4aXN0aW5nIjp0cnVlfQ==; _hjSession_686103=eyJpZCI6ImM5MWJmOGRhLTcwZDEtNGQ2ZS04MzA1LTQ4NWNlYTYzZGMwNSIsImNyZWF0ZWQiOjE2NDYxMjc3MDQ5MjgsImluU2FtcGxlIjp0cnVlfQ==; _hjAbsoluteSessionInProgress=0; mp_2e072c90929b30e1ea5d9fd56399f106_mixpanel=%7B%22distinct_id%22%3A%20%2217f4456c057375-062236d0c47071-a3e3164-144000-17f4456c05857f%22%2C%22%24device_id%22%3A%20%2217f4456c057375-062236d0c47071-a3e3164-144000-17f4456c05857f%22%2C%22%24initial_referrer%22%3A%20%22%24direct%22%2C%22%24initial_referring_domain%22%3A%20%22%24direct%22%2C%22accountId%22%3A%20null%2C%22canShop%22%3A%20false%2C%22canTransact%22%3A%20false%2C%22canViewAssortments%22%3A%20false%2C%22canViewDataPortal%22%3A%20false%2C%22userId%22%3A%20null%2C%22accountUserId%22%3A%20null%2C%22isAdmin%22%3A%20false%2C%22loggedAsAdmin%22%3A%20false%2C%22retailerSettings%22%3A%20false%2C%22assortmentPlanning%22%3A%20false%2C%22accountType%22%3A%201%7D","Host": "atlas-main.kube.jooraccess.com","Origin": "https://www.jooraccess.com","Referer": "https://www.jooraccess.com/","sec-ch-ua-mobile": "?0","sec-ch-ua-platform": "Windows","Sec-Fetch-Dest": "empty","Sec-Fetch-Mode": "cors","Sec-Fetch-Site": "same-site","User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36",}
result = requests.get(URL, headers ).json() # you may need headers argument so you should add it in this case
data = result["data"]["public"]["collectionProductsByShareToken"]["edges"]
for d in data:
img_urls = d["product"]["imageUrls"]
for img_url in img_urls:
img_data = requests.get(img_url).content
img_name = basename(urlparse(img_url).path)
with open(img_name , 'wb') as handle:
response = requests.get(img_url, stream=True)
if not response.ok:
print(response)
for block in response.iter_content(1024):
if not block:
break
handle.write(block)
I'm having some trouble scraping the names of a <div> that are already in a <div> (It works with complete other part even though I tried to search for a specific card-body)
https://namemc.com/minecraft-names?sort=asc&length_op=&length=3&lang=&searches=500
I need this part:
<div class="card-body p-0">
<div class="row no-gutters py-1 px-3">
<div class="col col-lg order-lg-1 text-nowrap text-ellipsis">
example
Even though I find names, they are not from the list I want. Does anybody know how to locate them?
Im using beautifulsoup and lxml. Part of my code:
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://namemc.com/minecraft-names?sort=asc&length_op=&length=3&lang=&searches=500').text
soup = BeautifulSoup(html_text, 'lxml')
itemlocator = soup.find('div', class_='card-body p-0')
for items in itemlocator:
print(items)
The following script should produce the available names that you see in that page. However, it seems you are only after the container in which Commander is available. In that case, you can try like below to get the desired portion which is concise and efficient compare to your current attempt.
import requests
from bs4 import BeautifulSoup
link = 'https://namemc.com/minecraft-names?sort=asc&length_op=&length=3&lang=&searches=500'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'
}
html_text = requests.get(link,headers=headers)
soup = BeautifulSoup(html_text.text,'lxml')
item = soup.select_one(".card-body > .no-gutters a[href^='/name/Commander']")
item_text = item.get_text(strip=True)
datetime = item.find_parent().find_parent().select_one("time").get("datetime")
print(item_text,datetime)
Output:
Commander 2021-03-19T13:10:40.000Z
i would like to scrape amazon top 10 bestsellers in baby-products.
i want just the titel text but it seems that i have a problem.
im getting 'None' when I'm trying this code.
after getting "result" i want to iterate it using "content" and print the titles.
thanks!
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}
url = "https://www.amazon.com/gp/bestsellers/baby-products"
r=requests.get(url, headers=headers)
print("status: ", r.status_code)
soup = BeautifulSoup(r.content, 'html.parser')
print("url: ", r.url)
result = soup.find("ol", {"id": "zg-ordered-list"})
content = result.findAll("div", {"class": "a-section a-spacing-none aok-relative"})
print(result)
print(content)
You won't be able to scrape the Amazon website in this way. You are using requests.get to get the HTTP response body of the url provided. Pay attention to what that response actually is (e.g. by print(r.content)). What you can see in your web browser is different than the raw HTTP response, because of client-side rendering technologies used by Amazon (typically JavaScript and others).
I advice you to use Selenium, which sorts of "emulates" the typical browser inside the Python runtime, renders the site like the normal browser would do and allows you to access properties of the same website you see in your web browser.
I am trying to scrape the results data from this website (https://www.ufc.com/matchup/908/7717/post) and I am completely at a loss for why my proposed solution isn't working.
The outer html that I am trying to scrape is <h4 class="e-t5 winner">Jon Jones</h4>. I don't have a lot of experience with web scraping or HTML but all of the relevant information is contained in the h4 tag.
I have been successful in extracting the data from the h2 tag but I am confused as to why the same approach doesn't work for h4. For example, to extract the relevant data from <h2 class="field--name-name name_given red">Jon Jones <span class="field--field-rank rank"></span></h2> the following code works.
from requests import get
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36',
}
raw_html = get('https://www.ufc.com/matchup/908/7717/post', headers=headers)
html = BeautifulSoup(raw_html.content)
# this works
html.find_all('h2', attrs={'class': 'field--name-name name_given red'})[0].get_text().strip()
# this does not work?
html.find_all('h4', attrs={'class': 'e-t5 winner red'})
# this code gets me to the headers but not the actual listed data inside
html.find('div', attrs={'class': 'l-flex--4col-2to4'})
I am mostly confused as to why the above doesn't work and why the text I can see when inspecting the element in my browser, doesn't appear in the scraped HTML.
It is added dynamically. You can find the source in the network tab. Assuming there is always one winner you can use something like
import requests
r = requests.get('https://dvk92099qvr17.cloudfront.net/V1/908/Fnt.json').json()
winner = [fighter['FullName'] for fighter in r['FMLiveFeed']['Fights'][0]['Fighters'] if fighter['Outcome'] == 'Win'][0]
print(winner)
I am trying to fetch the stock of a company specified by a user by taking the input. I am using requests to get the source code and BeautifulSoup to scrape. I am fetching the data from google.com. I am trying the fetch only the last stock price (806.93 in the picture). When I run my script, it prints none. None of the data is being fetched. What am I missing ?
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
company = raw_input("Enter the company name:")
URL = "https://www.google.co.in/?gfe_rd=cr&ei=-AKmV6eqC-LH8AfRqb_4Aw#newwindow=1&safe=off&q="+company+"+stock"
request = requests.get(URL)
soup = BeautifulSoup(request.content,"lxml")
code = soup.find('span',{'class':'_Rnb fmob_pr fac-l','data-symbol':'GOOGL'})
print code.contents[0]
The source code of the page looks like this :
Looks like that source is from inspecting the element, not the actual source. A couple of suggestions. Use google finance to get rid of some noise - https://www.google.com/finance?q=googl would be the URL. On that page there is a section that looks like this:
<div class=g-unit>
<div id=market-data-div class="id-market-data-div nwp g-floatfix">
<div id=price-panel class="id-price-panel goog-inline-block">
<div>
<span class="pr">
<span id="ref_694653_l">806.93</span>
</span>
<div class="id-price-change nwp">
<span class="ch bld"><span class="chg" id="ref_694653_c">+9.68</span>
<span class="chg" id="ref_694653_cp">(1.21%)</span>
</span>
</div>
</div>
You should be able to pull the number out of that.
I went to
https://www.google.com/?gfe_rd=cr&ei=-AKmV6eqC-LH8AfRqb_4Aw#newwindow=1&safe=off&q=+google+stock
, did a right click and "View Page Source" but did not see the code that you screenshotted.
Then I typed out a section of your code screenshot and created a BeautifulSoup object with it and then ran your find on it:
test_screenshot = BeautifulSoup('<div class="_F0c" data-tmid="/m/07zln7n"><span class="_Rnb fmob_pr fac-l" data-symbol="GOOGL" data-tmid="/m/07zln7n" data-value="806.93">806.93.</span> = $0<span class ="_hgj">USD</span>')
test_screenshot.find('span',{'class':'_Rnb fmob_pr fac-l','data-symbol':'GOOGL'})`
Which will output what you want:
<span class="_Rnb fmob_pr fac-l" data-symbol="GOOGL" data-tmid="/m/07zln7n" data-value="806.93">806.93.</span>
This means that the code you are getting is not the code you expect to get.
I suggest using the google finance page:
https://www.google.com/finance?q=google (replace 'google' with what you want to search), which will give you wnat you are looking for:
request = requests.get(URL)
soup = BeautifulSoup(request.content,"lxml")
code = soup.find("span",{'class':'pr'})
print code.contents
Will give you
[u'\n', <span id="ref_694653_l">806.93</span>, u'\n'].
In general, scraping Google search results can get really nasty, so try to avoid it if you can.
You might also want to look into Yahoo Finance Python API.
You're looking for this:
# two selectors which will handle two layouts
current_price = soup.select_one('.wT3VGc, .XcVN5d').text
Have a look at the SelectorGadget Chrome extension to grab CSS selectors by clicking on the desired element in your browser. CSS selectors reference.
It might be because there's no user-agent specified in your request headers.
The default requests user-agent is python-requests thus Google blocks a request because it knows that it's a bot and not a "real" user visit and you received a different HTML with different selectors and elements, and some sort of an error. User-agent fakes user visit by adding this information into HTTP request headers.
Pass user-agent into request headers:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
response = requests.get('YOUR_URL', headers=headers)
Code and example in the online IDE:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0'
params = {
'q': 'alphabet inc class a stock',
'gl': 'us'
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
# two selectors which will handle two layouts
current_price = soup.select_one('.wT3VGc, .XcVN5d').text
print(current_price)
# 2,816.00
Alternatively, you can achieve the same thing by using Google Direct Answer Box API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you only need to iterate over structured JSON and get the data you want fast rather than figuring out why certain things don't work as expected and then to maintain it over time.
Code to integrate:
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": "alphabet inc class a stock",
"gl": "us",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
current_price = results['answer_box']['price']
print(current_price)
# 2,816.00
P.S - I wrote an in-depth blog post about how to reduce the chance of being blocked while web scraping search engines.
Disclaimer, I work for SerpApi.