Scraping: cannot access information from web

Scraping: cannot access information from web - python

I am scraping some information from this url: https://www.rockethub.com/projects/34210-lunar-lion-the-first-ever-university-led-mission-to-the-moon#description-tab
Everything was fine till I scraped the description.
I tried and tried to scrape, but I failed so far.
It seems like I can't reach that information. Here is my code:
html = urllib.urlopen("https://www.rockethub.com/projects/34210-lunar-lion-the-first-ever-university-led-mission-to-the-moon")
tree=BeautifulSoup(html, "lxml")
description=tree.find('div',{'id':'description_section','class':'description-section'})
Any of you has any suggestion?

You would need to make an additional request to get the description. Here is a complete working example using requests + BeautifulSoup:
import requests
from bs4 import BeautifulSoup
url = "https://www.rockethub.com/projects/34210-lunar-lion-the-first-ever-university-led-mission-to-the-moon/"
with requests.Session() as session:
session.headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"
}
# get the token
response = session.get(url)
soup = BeautifulSoup(response.content, "html.parser")
token = soup.find("meta", {"name": "csrf-token"})["content"]
# get the description
description_url = url + "description"
response = session.get(description_url, headers={"X-CSRF-Token": token, "X-Requested-With": "XMLHttpRequest"})
soup = BeautifulSoup(response.content, "html.parser")
description = soup.find('div', {'id':'description_section', 'class': 'description-section'})
print(description.get_text(strip=True))

I use XML package to web scraping, and I can't get the description section as you described with BeautifulSoup.
However if you just want to scrap this page only, you can download the page. Then:
page = htmlTreeParse("Lunar Lion - the first ever university-led mission to the Moon _ RocketHub.html",
useInternal = TRUE,encoding="utf8")
unlist(xpathApply(page, '//div[#id="description_section"]', xmlValue))
I tried the R code to download, and I can't find the description_section either.
url="https://www.rockethub.com/projects/34210-lunar-lion-the-first-ever-university-led-mission-to-the-moon"
download.file(url,"page.html",mode="w")
Maybe we have to add some options in the function download.file. I hope that some html experts could help.

I found out how to scrap with R:
library("rvest")
url="https://www.rockethub.com/projects/34210-lunar-lion-the-first-ever-university-led-mission-to-the-moon/description"
url %>%
html() %>%
html_nodes(xpath='//div[#id="description_section"]', xmlValue) %>%
html_text()

Related

Scraping HREF Links contained within a Table

I've been bouncing around a ton of similar questions, but nothing that seems to fix the issue... I've set this up (with help) to scrape the HREF tags from a different URL.
I'm trying to now take the HREF links in the "Result" column from this URL.
here
The script doesn't seem to be working like it did for other sites.
The table is an HTML element, but no matter how I tweak my script, I can't retrieve anything except a blank result.
Could someone explain to me why this is the case? I'm watching many YouTube videos trying to understand, but this just doesn't make sense to me.
import requests
from bs4 import BeautifulSoup
profiles = []
urls = [
'https://stats.ncaa.org/player/game_by_game?game_sport_year_ctl_id=15881&id=15881&org_id=6&stats_player_seq=-100'
]
for url in urls:
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
for profile in soup.find_all('a'):
profile = profile.get('href')
profiles.append(profile)
print(profiles)

The following code works:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.60 Safari/537.17'}
r = requests.get('https://stats.ncaa.org/player/game_by_game?game_sport_year_ctl_id=15881&id=15881&org_id=6&stats_player_seq=-100', headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
for x in soup.select('a'):
print(x.get('href'))

Main issue in that case is that you miss to send a user-agent, cause some sites, regardless of whether it is a good idea, use this as base to decide that you are a bot and do not or only specific content.
So minimum is to provide some of that infromation while making your request:
req = requests.get(url,headers={'User-Agent': 'Mozilla/5.0'})
Also take a closer look to your selection. Assuming you like to get the team links only you should adjust it, I used css selectors:
for profile in soup.select('table a[href^="/team/"]'):
It also needs concating the baseUrl to the extracted values:
profile = 'https://stats.ncaa.org'+profile.get('href')
Example
from bs4 import BeautifulSoup
import requests
profiles = []
urls = ['https://stats.ncaa.org/player/game_by_game?game_sport_year_ctl_id=15881&id=15881&org_id=6&stats_player_seq=-100']
for url in urls:
req = requests.get(url,headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(req.text, 'html.parser')
for profile in soup.select('table a[href^="/team/"]'):
profile = 'https://stats.ncaa.org'+profile.get('href')
profiles.append(profile)
print(profiles)

scraper returning empty when trying to scrape in beautiful soup

Hi so i want to scrape domain names and their prices but its returning null idk why
from bs4 import BeautifulSoup
url = 'https://www.brandbucket.com/styles/6-letter-domain-names?page=1'
response = requests.get(url)
soup = BeautifulSoup(response.text,'html.parser')
names = soup.findAll('div', {'class': "domainCardDetail"})
print(names)

Try the following approach to get domain names and their price from that site. The script currently parses content from the first page only. If you wish to get content from other pages, make sure to use the desired page number here page=1 which is located within link.
import requests
from bs4 import BeautifulSoup
link = 'https://www.brandbucket.com/styles/6-letter-domain-names?page=1'
url = 'https://www.brandbucket.com/amp/ga'
payload = {
'__amp_source_origin': 'https://www.brandbucket.com'
}
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
res = s.get(link)
soup = BeautifulSoup(res.text,'html.parser')
payload['dp'] = soup.select_one("amp-iframe")['src'].split("list=")[1].split("&")[0]
resp = s.get(url,params=payload)
for key,val in resp.json()['triggers'].items():
if not key.startswith('domain'):continue
container = val['extraUrlParams']
print(container['pr1nm'],container['pr1pr'])
Output are like (truncated):
estita.com 2035
rendro.com 1675
rocamo.com 3115
wzrdry.com 4315
prutti.com 2395
bymodo.com 3495
ethlax.com 2035
intezi.com 2035
podoxa.com 2430
rorror.com 3190
zemoxa.com 2195

Check the status code of the response. When I tested there was 403 from the Web Server and because of that there is no such element like "domainCardDetail" div in response.

The reason for this is that website is protected by Cloudflare.
There are some advanced ways to bypass this.
The following solution is very simple if you do not need a mass amount of scraping. Otherwise, you may want to use "clouscraper" "Selenium" or another method to enable JavaScript on the website.
Open the developer console
Go to "Network". Make sure ticks are clicked as below picture.
https://i.stack.imgur.com/v0KTv.png
Refresh the page.
Copy JSON result and parse it in Python
https://i.stack.imgur.com/odX5S.png

"AttributeError: 'NoneType' object has no attribute 'get_text'"

Whenever I tried to run this code:
page = requests.get(URL, headers = headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find(id="productTitle").get_text()
price = soup.find(id="priceblock_ourprice").get_text()
converted_price = price[0:7]
if (converted_price < '₹ 1,200'):
send_mail()
print(converted_price)
print(title.strip())
if(converted_price > '₹ 1,400'):
send_mail()
It gives me an error AttributeError: 'NoneType' object has no attribute 'get_text' earlier this code was working fine.

import requests
from bs4 import BeautifulSoup
url = 'https://www.amazon.com/Camera-24-2MP-18-135mm-Essential-Including/dp/B081PMPPM1/ref=sr_1_1_sspa?dchild=1&keywords=Canon+EOS+80D&qid=1593325243&sr=8-1-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUEyU1M0M1JVTkY3WTBVJmVuY3J5cHRlZElkPUEwNDQzMjI5Uk9DM08zQkM1RU9RJmVuY3J5cHRlZEFkSWQ9QTAyNjI0NjkzT0ZLUExSRkdJMDYmd2lkZ2V0TmFtZT1zcF9hdGYmYWN0aW9uPWNsaWNrUmVkaXJlY3QmZG9Ob3RMb2dDbGljaz10cnVl'
headers = { "user-Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'}
page = requests.get(url,headers= headers)
soup = BeautifulSoup(page.content,"lxml")
title = soup.find(id = "productTitle").get_text()
print(title)
i tried this and it worked

Either the productTitle id or the priceblock_ourprice id do not exist in the page you are querying. I would suggest you following two steps:
- Check the URL on your browser and look for that ids
- Check what you get in page.content because it is maybe not the same as what you see in the browser
Hope it helps

I assume you trying analyze Amazon products.
Elements productTitle and priceblock_ourprice exist (I have checked).
You should check page.content.
Maybe your headers are unacceptable for website.
Try:
import requests
from bs4 import BeautifulSoup
URL = "https://www.amazon.de/COMIFORT-PC-Tisch-Studie-Schreibtisch-Mehrfarbig/dp/B075R95B1S"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "lxml")
title = soup.find(id="productTitle").get_text()
price = soup.find(id="priceblock_ourprice").get_text()
print(title)
print(price)
Result:
COMIFORT, Computerschreibtisch, Schreibtisch für das Arbeitszimmer, Schreibtisch, Maße: 90 x 50 x 77 cm
50,53 €

Please check it might be the reason product it out of stock means price in not there in the site thats why its Nonetype. Try to select another product with visible price.

I know this is 2.2 years late, but I'm going through this DevEd tutorial now -
and 'ourprice' is now 'id="priceblock_dealprice". But only runs once every 15 attempts.

It works once and then stops working. Amazon is blocking the request I think.
The ids are correct and it does not change if you use lxml, html.parser, or html5lib. Provided you print(soup) and look in the body, you will see a captcha prompt from amazon basically saying you have to prove you are not a robot. I don't know a way around that.

If you try to run it consecutive days you will run into this error. Amazon if blocking the request. One trick to get it working again is to simply print the html after you get it before trying to parse.
response = requests.get(url=AMAZON_URI, headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9"
})
response.raise_for_status()
data = response.text
# Adding this print will fix the issue for consecutive days.
print(data)
soup = BeautifulSoup(data, "html.parser")
price_dollar = soup.find(name="span", class_="a-price-whole").getText()
price_cents = soup.find(name="span", class_="a-price-fraction").getText()
total_price = (float(f"{price_dollar}{price_cents}"))
print(total_price)

I tried to print the total_price without printing the data and it worked.

requests-html not finding page element

So I'm trying to navigate to this url: https://www.instacart.com/store/wegmans/search_v3/horizon%201%25
and scrape data from the div with the class item-name item-row. There are two main problems though, the first is that instacart.com requires a login before you can get to that url, and the second is that most of the page is generated with javascript.
I believe I've solved the first problem because my session.post(...) gets a 200 response code. I'm also pretty sure that r.html.render() is supposed to solve the second problem by rendering the javascript generated html before I scrape it. Unfortunately, the last line in my code is only returning an empty list, despite the fact that selenium had no problem getting this element. Does anyone know why this isn't workng?
from requests_html import HTMLSession
from bs4 import BeautifulSoup
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}
session = HTMLSession()
res1 = session.get('http://www.instacart.com', headers=headers)
soup = BeautifulSoup(res1.content, 'html.parser')
token = soup.find('meta', {'name': 'csrf-token'}).get('content')
data = {"user": {"email": "alexanderjbusch#gmail.com", "password": "password"},
"authenticity_token": token}
response = session.post('https://www.instacart.com/accounts/login', headers=headers, data=data)
print(response)
r = session.get("https://www.instacart.com/store/wegmans/search_v3/horizon%201%25", headers=headers)
r.html.render()
print(r.html.xpath("//div[#class='item-name item-row']"))

After logging in using requests module and BeautifulSoup, you can make use of the link I've already suggested in the comment to parse the required data available within json. The following script should get you name, quantity, price and a link to the concerning product. You can only get 21 product using the script below. There is an option for pagination within this json content. You can get all of the products by playing around with that pagination.
import json
import requests
from bs4 import BeautifulSoup
baseurl = 'https://www.instacart.com/store/'
data_url = "https://www.instacart.com/v3/retailers/159/module_data/dynamic_item_lists/cart_starters/storefront_canonical?origin_source_type=store_root_department&tracking.page_view_id=b974d56d-eaa4-4ce2-9474-ada4723fc7dc&source=web&cache_key=df535d-6863-f-1cd&per=30"
data = {"user": {"email": "alexanderjbusch#gmail.com", "password": "password"},
"authenticity_token": ""}
headers = {
'user-agent':'Mozilla/5.0',
'x-requested-with': 'XMLHttpRequest'
}
with requests.Session() as s:
res = s.get('https://www.instacart.com/',headers={'user-agent':'Mozilla/5.0'})
soup = BeautifulSoup(res.text, 'lxml')
token = soup.select_one("[name='csrf-token']").get('content')
data["authenticity_token"] = token
s.post("https://www.instacart.com/accounts/login",json=data,headers=headers)
resp = s.get(data_url, headers=headers)
for item in resp.json()['module_data']['items']:
name = item['name']
quantity = item['size']
price = item['pricing']['price']
product_page = baseurl + item['click_action']['data']['container']['path']
print(f'{name}\n{quantity}\n{price}\n{product_page}\n')
Partial output:
SB Whole Milk
1 gal
$3.90
https://www.instacart.com/store/items/item_147511418
Banana
At $0.69/lb
$0.26
https://www.instacart.com/store/items/item_147559922
Yellow Onion
At $1.14/lb
$0.82
https://www.instacart.com/store/items/item_147560764

How can I parse long web pages with beautiful soup?

I have been using following code to parse web page in the link https://www.blogforacure.com/members.php. The code is expected to return the links of all the members of the given page.
from bs4 import BeautifulSoup
import urllib
r = urllib.urlopen('https://www.blogforacure.com/members.php').read()
soup = BeautifulSoup(r,'lxml')
headers = soup.find_all('h3')
print(len(headers))
for header in headers:
a = header.find('a')
print(a.attrs['href'])
But I get only the first 10 links from the above page. Even while printing the prettify option I see only the first 10 links.

The results are dynamically loaded by making AJAX requests to the https://www.blogforacure.com/site/ajax/scrollergetentries.php endpoint.
Simulate them in your code with requests maintaining a web-scraping session:
from bs4 import BeautifulSoup
import requests
url = "https://www.blogforacure.com/site/ajax/scrollergetentries.php"
with requests.Session() as session:
session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'}
session.get("https://www.blogforacure.com/members.php")
page = 0
members = []
while True:
# get page
response = session.post(url, data={
"p": str(page),
"id": "#scrollbox1"
})
html = response.json()['html']
# parse html
soup = BeautifulSoup(html, "html.parser")
page_members = [member.get_text() for member in soup.select(".memberentry h3 a")]
print(page, page_members)
members.extend(page_members)
page += 1
It prints the current page number and the list of members per page accumulating member names into a members list. Not posting what it prints since it contains names.
Note that I've intentionally left the loop endless, please figure out the exit condition. May be when response.json() throws an error.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping: cannot access information from web - python

I found out how to scrap with R: library("rvest") url="https://www.rockethub.com/projects/34210-lunar-lion-the-first-ever-university-led-mission-to-the-moon/description" url %>% html() %>% html_nodes(xpath='//div[#id="description_section"]', xmlValue) %>% html_text()

Related

Scraping HREF Links contained within a Table

scraper returning empty when trying to scrape in beautiful soup

"AttributeError: 'NoneType' object has no attribute 'get_text'"

requests-html not finding page element

How can I parse long web pages with beautiful soup?

Categories

Resources