User-agent error with web scraping python3 - python

It is my first time using web scraping. When I am using page = requests.get(URL) it works perfectly fine but when I am adding
headers = {"User-Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.4 Safari/605.1.15'}
page = requests.get(URL, headers=headers)
I am getting an error
title = soup.find(id="productTitle").get_text()
AttributeError: 'NoneType' object has no attribute 'get_text'
What's wrong with that? Should I resign with headers?

I think the page contains non valid HTML and therefore BeatifulSoup is not able to find your element.
Try to prettify the HTML first:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.com/dp/B07JP9QJ15/ref=dp_cerb_1'
headers = {
"User-Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.4 Safari/605.1.15'}
page = requests.get(URL, headers=headers)
pretty = BeautifulSoup(page.text,'html.parser').prettify()
soup = BeautifulSoup(pretty,'html.parser')
print(soup.find(id='productTitle').get_text())
Which returns:
Dell UltraSharp U2719D - LED Monitor - 27"

Related

Trying to get BeautifulSoup to print URL list from Yahoo Finance of firstname

Goal is to get Python / BeautifulSoup to scrape Yahoo Finance and the first/last name of public company owner:
from bs4 import BeautifulSoup
import requests
url = 'https://finance.yahoo.com/quote/GTVI/profile?p=GTVI'
page = requests.get(url, headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
})
soup = BeautifulSoup(page.text, 'html.parser')
price = soup.find_all("tr", {"class": "C($primaryColor) BdB Bdc($seperatorColor) H(36px)"})
print(soup.select_one("td > span").text)
^-The above single call works perfectly, but I can't get it to loop and print multiple times keeping the useragent of the browser masked. Here is my attempt at it (new to Python keep in mind) Haaalp :)
from bs4 import BeautifulSoup
import requests
url = ['https://finance.yahoo.com/quote/GTVI/profile?p=GTVI',
'https://finance.yahoo.com/quote/RAFA/profile?p=RAFA',
'https://finance.yahoo.com/quote/CYDX/profile?p=CYDX',
'https://finance.yahoo.com/quote/TTHG/profile?p=TTHG']
names = []
for link in url:
w=1
reqs2 = requests.get(link)
page = requests.get(url, headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
})
soup = BeautifulSoup(page.text, 'html.parser')
for x in soup.find_all("tr", {"class": "C($primaryColor) BdB Bdc($seperatorColor) H(36px)"})
names.append(x.text)
print(names)(soup.select_one("td > span").text)
Check your indents to get your code running and also your requests. Cause expected result from your question is not that clear, this is just a hint how to fix or get a result.
Example
from bs4 import BeautifulSoup
import requests
url = ['https://finance.yahoo.com/quote/GTVI/profile?p=GTVI',
'https://finance.yahoo.com/quote/RAFA/profile?p=RAFA',
'https://finance.yahoo.com/quote/CYDX/profile?p=CYDX',
'https://finance.yahoo.com/quote/TTHG/profile?p=TTHG']
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36"}
names = []
for link in url:
w=1
page = requests.get(link, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')
for x in soup.find_all("tr", {"class": "C($primaryColor) BdB Bdc($seperatorColor) H(36px)"}):
names.append(x.text)
print(soup.select_one("td > span").text)
print(names)

Extracting redirected link from an url

I am trying to extract the redirected link of this link. When I click on this link I am redirected to this page and I want to store this page link. So, for this I have tried with urllib module but it didn't give any response.
from urllib import request
headers = headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko)'}
url = 'https://www.forexfactory.com/news/403059-manufacturing-in-us-expands-after-reaching-three-year-low/hit'
response = requests.get(url, headers=headers)
print(response) # Output: <Response [503]>
So, how can I extract this link?
You can use cloudscraper to process the cloudflare redirect:
import cloudscraper
scraper = cloudscraper.create_scraper()
url = 'https://www.forexfactory.com/news/403059-manufacturing-in-us-expands-after-reaching-three-year-low/hit'
r = scraper.get(url)
print(r.url)
you can use the requests library
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko)'}
url = 'https://www.forexfactory.com/news/403059-manufacturing-in-us-expands-after-reaching-three-year-low/hit'
response = requests.get(url, headers=headers)
print(response.url)

Requests Module gets different content in different python versions

def req(url) -> BeautifulSoup:
print(url)
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, "html.parser")
return soup
def get_novels(page, sort, order, status):
books = []
novel_list = req(f"https://www.novelupdates.com/novelslisting/?sort={sort}&order={order}&status={status}&pg={page}")
novels = novel_list.find_all(class_="search_main_box_nu")...
The above code gets the actual content of the page in python version 3.10.2, but in 3.9.12, it gets some bot verification page.
Why is that, and how do I fix it? Please help.

You don't have permission to access "http://www.carrefour.pk/" on this server.<p> Reference #18.451d2017.1615456534.6b4445

I'm trying to scrape carrefour website data through python. I've used scrappy, beautiful soup, selenium but nothing seems to work. I'm getting the error that you don't have the permission to access. Is there any way to scrape this website? The code is attached below, NEED HELP!
from requests_html import HTMLSession
session = HTMLSession()
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"}
resp = session.get("https://www.carrefour.pk/",headers=headers)
resp.html.render()
a=resp.html.html
print(a)
think you are using the wrong headers. These headers work fine for me.
headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Cafari/537.36'}
Or full:
import requests
from bs4 import BeautifulSoup as bs
# Block cookies
from http import cookiejar # Python 2: import cookielib as cookiejar
class BlockAll(cookiejar.CookiePolicy):
return_ok = set_ok = domain_return_ok = path_return_ok = lambda self, *args, **kwargs: False
netscape = True
rfc2965 = hide_cookie2 = False
s = requests.Session()
s.cookies.set_policy(BlockAll())
#Get URL
url = "https://www.carrefour.pk"
headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Cafari/537.36'}
r = s.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
print(soup)

How to change user agent urllib2

I'm trying to access a page using the following
page = urllib2.urlopen(full_url)
soup = BeautifulSoup(page, 'html.parser')
li_post_id = "post-" + str(post_id)
li_soup = soup.find('li', attrs={'id':li_post_id})
This works fine on my ubuntu machine, but when running it on my Windows server I get 403 Forbidden error, so I assume the issue is with the user agent.
How do I change this, say, to Firefox? I have only seen tutorials to change the user agent using requests, but I don't want to change all of my code to this.
You could try this.
import random
import requests, bs4
agents= [
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko)',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko)',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko)',
'Mozilla/5.0 (Windows NT 6.4; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)']
headers = {"User-Agent":random.choice(agents)}
response = requests.get(full_url,headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
Changing the header doesn't have anything to do with BeautifulSoup. It is meant for HTML parsing only. You need to change it in your urllib request like so:
Python3
import urllib.request
req = urllib.request.build_opener()
req.addheaders = [('User-Agent', 'Some user agent')]
response = req.open('http://www.stackoverflow.com')
Python2.7
import urllib2
req = urllib2.build_opener()
req.addheaders = [('User-Agent', 'Some user agent')]
response = req.open('http://www.stackoverflow.com')

Categories