How to avoid 403 problem using BeautifulSoup and headers? - python

I am using the combination of request and beautifulsoup to develop a web-scraping program in python.
Unfortunately, I got 403 problem (even using header).
Here my code:
from bs4 import BeautifulSoup
from requests import get
headers_m = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
sapo_m = "https://www.idealista.it/vendita-case/milano-milano/"
response_m = get(sapo_m, headers=headers_m)

This is not general python question. The site blocks such straightforward attempts of scraping, you need to find a set of headers (specific for this site) that will pass validation.
Regards,

Simply use Chrome as User-Agent.
from bs4 import BeautifulSoup
BeautifulSoup(requests.get("https://...", headers={"User-Agent": "Chrome"}).content, 'html.parser')

Related

Beautiful Soup \u003b appears and messes up find_all?

I've been working on a web scraper for top news sites. Beautiful Soup in python has been a great tool, letting me get full articles with very simple code. BUT
article_url='https://apnews.com/article/lifestyle-travel-coronavirus-pandemic-health-education-418fe38201db53d2848e0138a28ff824'
session = requests.Session()
retry = Retry(connect=3, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
user_agent='Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36'
request_header={ 'User-Agent': user_agent}
source=session.get(article_url, headers=request_header).text
soup = BeautifulSoup(source,'lxml')
#get all <p> paragraphs from article
paragraphs=soup.find_all('p')
#print each paragraph as a line
for paragraph in paragraphs:
print(paragraph)
This works great on most news sites I've tried BUT for some reason The AP site gives me no output at all. Which is strange because the exact same code works on maybe 10 other sites like the NYT, WaPo, and The Hill. And I know why.
What it does is, where every other site prints out all the paragraphs, it prints nothing. Except when I look at the paragraphs soup variable, here is the kind of thing I see:
address the pandemic.\u003c/p>\u003cdiv class=\"ad-placeholder\">\u003c/div>\u003cp>Instead, public schools
Clearly what's happening is the < HTML symbol is being translated as \u003b. And because of that find_all('p') can't properly find the HTML tags. But for some reason only the AP site is doing it. When I inspect the AP website, their html has the same symbols as all the other sites.
Does anyone have any idea why this is happening? Or what I can do to fix it? Because I'm seriously confused
For me, at least, I had to extract a javascript object containing the data with regex, then parse with json into json object, then grab the value associated with the page html as you see it in browser, soup it and then extract the paragraphs. I removed the retries stuff; you can easily re-insert that.
import requests
#from requests.adapters import HTTPAdapter
#from requests.packages.urllib3.util.retry import Retry
from bs4 import BeautifulSoup
import re,json
article_url='https://apnews.com/article/lifestyle-travel-coronavirus-pandemic-health-education-418fe38201db53d2848e0138a28ff824'
user_agent='Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36'
request_header={ 'User-Agent': user_agent}
source = requests.get(article_url, headers=request_header).text
data = json.loads(re.search(r"window\['titanium-state'\] = (.*)", source, re.M).group(1))
content = data['content']['data']
content = content[list(content.keys())[0]]
soup = BeautifulSoup(content['storyHTML'])
for p in soup.select('p'):
print(p.text.strip())
Regex:

BeautifulSoup and MechanicalSoup won't read website

I am dealing with BeautifulSoup and also trying it with MechanicalSoup and I have got it to load with other websites, but when I request that the website be requested it takes a long time and then never really gets it. Any ideas would be super helpful.
Here is the BeautifulSoup code that I am writing:
import urllib3
from bs4 import BeautifulSoup as soup
url = 'https://www.apartments.com/apartments/saratoga-springs-ut/1-bedrooms/?bb=hy89sjv-mN24znkgE'
http = urllib3.PoolManager()
r = http.request('GET', url)
Here is the Mechanicalsoup code:
import mechanicalsoup
browser = mechanicalsoup.Browser()
url = 'https://www.apartments.com/apartments/saratoga-springs-ut/1-bedrooms/'
page = browser.get(url)
page
What I am trying to do is gather data on different cities and apartments, so the url will change to have be 2-bedrooms and then 3-bedrooms then it will move to a different city and do the same thing there, so I really need this part to work.
Any help would be appreciated.
You see the same thing if you use curl or wget to fetch the page. My guess is they are using browser detection to try to prevent people from stealing their copyrighted information, as you are attempting to do. You can search for the User-Agent header to see how to pretend to be another browser.
import urllib3
import requests
from bs4 import BeautifulSoup as soup
headers = requests.utils.default_headers()
headers.update({
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
})
url = 'https://www.apartments.com/apartments/saratoga-springs-ut/1-bedrooms/'
r = requests.get(url, headers=headers)
rContent = soup(r.content, 'lxml')
rContent
Just as Tim said, I needed to add headers to my code to ensure that it was being read as not from a bot.

my python app doesn't work and give a None as answer

Hi i would like to know why my app is giving me that error i've tried already everything what i found in google and still have not idea why is that happeing
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.co.uk/XLTOK-Charging-Transfer-Charger-Nintendo/dp/B0828RYQ7W/ref=sr_1_1_sspa?dchild=1&keywords=type+c&qid=1598485860&sr=8-1-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUE3TDNSNUlITUNKTUMmZW5jcnlwdGVkSWQ9QTAwNDg4MTMyUlFQN0Y4RllGQzE2JmVuY3J5cHRlZEFkSWQ9QTAxNDk0NzMyMFNLSUdPU0taVUpRJndpZGdldE5hbWU9c3BfYXRmJmFjdGlvbj1jbGlja1JlZGlyZWN0JmRvTm90TG9nQ2xpY2s9dHJ1ZQ=='
headers = {"User-agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36'}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find(id="productTitle")
print(title)
When called:
C:\Proyecto nuevo>Python main.py
None
So if anyone would like to help me will be amazing!!
If you look at the webpage code (the one your are trying to scrape), you will find that it is pretty much all javascript that populates the webpage when it loads. The requests library gets this code and doesn't run it. Your "find title" gets none because the code doesn't contain that.
To scrape this page you will have to run the javascript on it. You can check out the Selenium WebDriver in python to do this.

How to get commented section of the html code using requests?

I am trying to get number of followers of a facebook page i.e. https://web.facebook.com/marlenaband. I am using python requests library. When I see the page source in the browser, the text "142 people follow this" appears to be in the commented section of the page. But, I am not seeing it in the response text using requests and BeautifulSoup. Would someone please help me on how to get this? Thanks
Here is the code I am using:
import requests
from bs4 import BeautifulSoup as bs
url = 'https://web.facebook.com/marlenaband'
headers = {
'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/7.0.185.1002 Safari/537.36',
}
res = requests.get(url, headers=headers)
print(res.content)
I actually got it using requests by modifying the headers to this:
headers = {
'accept-language':'en-US,en;q=0.8',
}

Extracting href using bs4/python3? (again)

sorry to repost this question. someone migrated the question to a different site, without the cookies i could not comment or edit.
i'm new to python and bs4, please go easy on me.
#!/usr/bin/python3
import bs4 as bs
import urllib.request
import time, datetime, os, requests, lxml.html
import re
from fake_useragent import UserAgent
url = "https://www.cvedetails.com/vulnerability-list.php"
ua = UserAgent()
header = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'}
snkr = requests.get(url,headers=header)
soup = bs.BeautifulSoup(snkr.content,'lxml')
for item in soup.find_all('tr', class_="srrowns"):
print(item.td.next_sibling.next_sibling.a)
prints:
CVE-2017-6712
CVE-2017-6708
CVE-2017-6707
CVE-2017-1269
CVE-2017-0711
CVE-2017-0706
using the recommened string:
print(item.td.next_sibling.next_sibling.a.href)
prints:
None
None
None
None
None
None
can't figure out how to extract the /cve/CVE-2017-XXXX/ parts. purhaps i've gone about it the wrong way. i dont need the titles or html, just the uri's.
I think you should try something like:
for item in soup.find_all('tr', class_="srrowns"):
print(item.td.next_sibling.next_sibling.a['href'])

Categories