Webscraping CrunchBase Access Denied while using User Agent Header

Webscraping CrunchBase Access Denied while using User Agent Header - python

I am trying to webscrape Crunch Base to find the total funding amount for certain companies. Here is a link to an example.
At first, I tried just using beautiful soup but I keep getting an error saying:
Access to this page has been denied because we believe you are using automation tools to browse the\nwebsite.
So then I looked up how to fake a browser visit and I changed my code, but I still get the same error. What am I doing wrong??
import requests
from bs4 import BeautifulSoup as BS
url = 'https://www.crunchbase.com/organization/incube-labs'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)
print(response.content)

All in all your code looks great! It appears that the website you are trying to scrape requires a more complex header than the one you have. The following code should solve your issue:
import requests
from bs4 import BeautifulSoup as BS
url = 'https://www.crunchbase.com/organization/incube-labs'
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate", "DNT": "1", "Connection": "close", "Upgrade-Insecure-Requests": "1"}
response = requests.get(url, headers=headers)
print(response.content)

Related

How do I get around requests.get(url) taking forever for certain websites (Probably blocking a web scraper)?

I am very new to web scraping with Python. I am trying to build a web scraper that would scrape certain clothing websites for newest releases and their prices. However I am unable to get through requests.get(url) for certain websites(Ex:https://www.pacsun.com/).
I already tried the following based upon another post(requests.get(url) not returning for this specific url):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36',
"Upgrade-Insecure-Requests": "1",
"DNT": "1",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate"}
r = requests.get(url, headers=headers)
Thanks for any help!

Trying to access variables while scraping website

I'm currently trying to scrape a website (url in code below), however when I pull out the section of html I'm looking to work with, all I get is the variable name of the information I'm looking for. The actual value for the variable is present when I manually inspect the page's html but I assume when I scrape the page that all I see is the website referencing the variable from elsewhere.
I'm hoping someone can help me try to access this information. I have tried just scraping the website's html using selenium, however I seem just get back the same html that I scrape when using requests (maybe I'm doing it incorrectly and somebody can show me the correct approach).
This is a refined version of my code:
import scrapy #For scraping
from scrapy import Selector
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0',
'Accept-Language': 'en-GB,en;q=0.5',
'Referer': 'https://google.com',
'DNT': '1'}
url = 'https://groceries.aldi.ie/en-GB/drinks/beer-ciders'
html = requests.get(url, headers=headers).content
sel = Selector(text=html)
x = sel.xpath('//*[#id="vueSearchResults"]/div//span')#/div[2]/div/div[4]/div/div/span/span
print((x.extract())[8])
This then returns the following:
<span>{{Product.ListPrice}}</span>
From which I want to get the actual value of 'Product.ListPrice'. I'd appreciate it if some can point me in the right direction as to accessing this variable's information, or a way to scrape the website's html - as seen by a user traversing the webpage.
** It was recommended to me to send a POST request through this API: 'https://groceries.aldi.ie/api/product/calculatePrices' along with passing request headers and payload, but I'm not entirely sure how to pull this off (I'm new to this aspect of scraping), if someone could provide me an example of how to carry this out I'd greatly appreciate it!
Thanks!

This is how you can replicate the POST request through Scrapy.
Code
import scrapy
import json
class Test(scrapy.Spider):
name = 'test'
headers = {
"authority": "groceries.aldi.ie",
"pragma": "no-cache",
"cache-control": "no-cache",
"sec-ch-ua": "\" Not;A Brand\";v=\"99\", \"Google Chrome\";v=\"91\", \"Chromium\";v=\"91\"",
"accept-language": "en-GB",
"sec-ch-ua-mobile": "?0",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.164 Safari/537.36",
"websiteid": "a763fb4a-0224-4ca8-bdaa-a33a4b47a026",
"content-type": "application/json",
"accept": "application/json, text/javascript, */*; q=0.01",
"x-requested-with": "XMLHttpRequest",
"origin": "https://groceries.aldi.ie",
"sec-fetch-site": "same-origin",
"sec-fetch-mode": "cors",
"sec-fetch-dest": "empty",
"referer": "https://groceries.aldi.ie/en-GB/drinks/beer-ciders?sortDirection=asc&page=1"
}
body = '{"products":["4088600284026","5391528370382","5391528372836","5391528372850","5391528372874","4088600298696","4088600103709","4088600388700","5035766046028","5000213021934","4088600012551","4088600325934","4088600300153","25389111","4072700001171","4088600012537","4088600012544","4088600013138","4088600013145","4088600103525","4088600103532","4088600103570","4088600103600","4088600135182","4088600141848","4088600142050","4088600158105","4088600217024","4088600217208","4088600241302","4088600249292","4088600249308","4088600280615","4088600281445","4088600283043","4088600284088","4088600295688","4088600295800","4088600295817","4088600303925"]}'
def start_requests(self):
url = 'https://groceries.aldi.ie/api/product/calculatePrices'
yield scrapy.Request(url=url,method='POST', headers=self.headers,body=self.body)
def parse(self,response):
data = json.loads(response.body)
for i in data.get('ProductPrices'):
print('Listing price is', i.get('ListPrice'),'\t','Unit Price is',i.get('UnitPrice') )
Output(Truncated)
Listing price is €4.19 Unit Price is €4.23
Listing price is €1.99 Unit Price is €3.98
Listing price is €1.99 Unit Price is €3.98
Listing price is €1.99 Unit Price is €3.98
Listing price is €1.99 Unit Price is €3.98

What is the right way to send a request to NSE Option Chain API?

Currently I have this code to send the request:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15','Accept-Encoding': 'gzip, deflate, br','Accept-Language': 'en-US'}
url = "https://www.nseindia.com/api/option-chain-indices?symbol=NIFTY"
res = requests.get(url, headers = headers)
But when I execute the last line, it either hangs for a long time, or just gives 401 error.
I tried other variations of headers, but still not getting the correct response.
Just pasting the above url in the browser, gives the json object painlessly.
Also if I try the above code from an online python tool, I'm sometimes getting the needed o/p. Does this mean I'll be able to get this working only when I host it on a trusted domain or something?

This should work:
import requests
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate, br',
'Connection':'keep-alive',
'Cookie':'AKA_A2=A; ak_bmsc=2C30C88FD1C6BEED087CCD02E7643772~000000000000000000000000000000~YAAQPvASAr+cI2F6AQAAk6+tdQxncocfEFby+qeRnNu3MgRblj1MWVtVy+W1Stx/CNaRaf9PhfVoT568zV8qztByVrxV+WfdrCN2nXU0nToPdEoaZFeZ7irUu8aSUXcln/sou0taKkr1gjmS3f6faZs+Rv8LA32eUAtlTD+GgYL0OKTJ44qVVinDxeeaVOiLxzQaiv0YjRCLcovFhO7jVBCJhNeXzgOeUYCLjkOg+2DEnRaF1Cd85f83pkjjieOFpjvywz20ImVWy1fr+S2nEDqmcgKZdhjHPfJ76+Z3bvVB/Kyv2dH7J8BMjlVf7kxyGbmot54yxchJNEMs0A/QTkeow2Xa54IcGZo/RUxGRu90SFu6VpfcxLaVOdN9EbvhcNs//OPA1jhDm9Nf4A==; bm_sv=BB4B29FC4D88791AABD65B43FACB0AF7~ObLG1UzBN4vOInl5m0vWqjOpZUXtLDHJDxr92uXdHHp5bjKjrEMMJcJRzS5VY5lkIs3N7JH+gZtoTnYIWKFqPZFhFC8Oo+sjmZLrin4taKkPfpvp7RdbqySQh6BLQwbWg3UgQJUQN29H0q9MJN6FuaW2b2i13zn5CmZUSDSpJVo=',
'Host':'www.nseindia.com',
'Sec-Fetch-Dest':'document',
'Sec-Fetch-Mode':'navigate',
'Sec-Fetch-Site':'none',
'Sec-Fetch-User':'?1',
'TE':'trailers',
'Upgrade-Insecure-Requests':'1',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0',
}
r = requests.get('https://www.nseindia.com/api/option-chain-indices?symbol=NIFTY', headers=headers)
print(r.json())

Wayfair product prices

I'm not sure if there's an API for this but I'm trying to scrape the price of certain products from Wayfair.
I wrote some python code using Beautiful Soup and requests but I'm getting some HTML which mentions Our systems have detected unusual traffic from your computer network.
Is there anything I can do to make this work?
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'
}
def fetch_wayfair_price(url):
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content)
print(soup)
fetch_wayfair_price('https://www.wayfair.ca/furniture/pdp/mistana-katarina-5-piece-extendable-solid-wood-dining-set-mitn2584.html')

The Wayfair site loads a lot of data in after the initial page, so just getting the html from the URL you provided probably won't be all you need. That being said, I was able to get the request to work.
Start by using a session from the requests libary; this will track cookies and session. I also added upgrade-insecure-requests: '1' to the headers so that you can avoid some of the issue that HTTPS introduce to scraping. The request now returns the same response as the browser request does, but the information you want is probably loaded in subsequent requests the browser/JS make.
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
'upgrade-insecure-requests': '1',
'dnt': '1'
}
def fetch_wayfair_price(url):
with requests.Session() as session:
post = session.get(url, headers=headers)
soup = BeautifulSoup(post.text, 'html.parser')
print(soup)
fetch_wayfair_price(
'https://www.wayfair.ca/furniture/pdp/mistana-katarina-5-piece-extendable-solid-wood-dining-set-mitn2584.html')
Note: The session that the requests libary creates really should persist outside of the fetch_wayfair_price method. I have contained it in the method to match your example.

python requests.get() returns an empty string

When I run the below code it returns an empty string
url = 'http://www.allflicks.net/wp-content/themes/responsive/processing/processing_us.php?draw=5&columns[0][data]=box_art&columns[0][name]=&columns[0][searchable]=true&columns[0][orderable]=false&columns[0][search][value]=&columns[0][search][regex]=false&columns[1][data]=title&columns[1][name]=&columns[1][searchable]=true&columns[1][orderable]=true&columns[1][search][value]=&columns[1][search][regex]=false&columns[2][data]=year&columns[2][name]=&columns[2][searchable]=true&columns[2][orderable]=true&columns[2][search][value]=&columns[2][search][regex]=false&columns[3][data]=genre&columns[3][name]=&columns[3][searchable]=true&columns[3][orderable]=true&columns[3][search][value]=&columns[3][search][regex]=false&columns[4][data]=rating&columns[4][name]=&columns[4][searchable]=true&columns[4][orderable]=true&columns[4][search][value]=&columns[4][search][regex]=false&columns[5][data]=available&columns[5][name]=&columns[5][searchable]=true&columns[5][orderable]=true&columns[5][search][value]=&columns[5][search][regex]=false&columns[6][data]=director&columns[6][name]=&columns[6][searchable]=true&columns[6][orderable]=true&columns[6][search][value]=&columns[6][search][regex]=false&columns[7][data]=cast&columns[7][name]=&columns[7][searchable]=true&columns[7][orderable]=true&columns[7][search][value]=&columns[7][search][regex]=false&order[0][column]=5&order[0][dir]=desc&start=0&length=25&search[value]=sherlock&search[regex]=false&movies=true&shows=true&documentaries=true&rating=netflix&_=1451768717982'
print requests.get(url).text
but if I put the url in my browser it'll show my the json information. I did notice while debugging the browser must have the plugin tamper data installed to view the json. If a browser doesn't have the plug in a blank web page will appear. So, my theory is it has to do something with out the http request is being handed but I'm not where to go from here.
Any help would be great.

You need to open up a session, visit the main page to get the cookies set and then make the XHR request to "processing_us.php":
url = "http://www.allflicks.net/wp-content/themes/responsive/processing/processing_us.php?draw=5&columns[0][data]=box_art&columns[0][name]=&columns[0][searchable]=true&columns[0][orderable]=false&columns[0][search][value]=&columns[0][search][regex]=false&columns[1][data]=title&columns[1][name]=&columns[1][searchable]=true&columns[1][orderable]=true&columns[1][search][value]=&columns[1][search][regex]=false&columns[2][data]=year&columns[2][name]=&columns[2][searchable]=true&columns[2][orderable]=true&columns[2][search][value]=&columns[2][search][regex]=false&columns[3][data]=genre&columns[3][name]=&columns[3][searchable]=true&columns[3][orderable]=true&columns[3][search][value]=&columns[3][search][regex]=false&columns[4][data]=rating&columns[4][name]=&columns[4][searchable]=true&columns[4][orderable]=true&columns[4][search][value]=&columns[4][search][regex]=false&columns[5][data]=available&columns[5][name]=&columns[5][searchable]=true&columns[5][orderable]=true&columns[5][search][value]=&columns[5][search][regex]=false&columns[6][data]=director&columns[6][name]=&columns[6][searchable]=true&columns[6][orderable]=true&columns[6][search][value]=&columns[6][search][regex]=false&columns[7][data]=cast&columns[7][name]=&columns[7][searchable]=true&columns[7][orderable]=true&columns[7][search][value]=&columns[7][search][regex]=false&order[0][column]=5&order[0][dir]=desc&start=0&length=25&search[value]=sherlock&search[regex]=false&movies=true&shows=true&documentaries=true&rating=netflix&_=1451768717982"
with requests.Session() as session:
session.headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"}
session.get("http://www.allflicks.net/")
response = session.get(url, headers={"Accept" : "application/json, text/javascript, */*; q=0.01",
"X-Requested-With": "XMLHttpRequest",
"Referer": "http://www.allflicks.net/",
"Host": "www.allflicks.net"})
print(response.json())

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Webscraping CrunchBase Access Denied while using User Agent Header - python

Related

How do I get around requests.get(url) taking forever for certain websites (Probably blocking a web scraper)?

Trying to access variables while scraping website

What is the right way to send a request to NSE Option Chain API?

Wayfair product prices

python requests.get() returns an empty string

Categories

Resources