Done a few small successful projects, been struggling to get the requests from this website fro ages - any tips?
UPDATE - Would like to get full beautiful soup request so I can start scraping the information from the tables
from bs4 import BeautifulSoup
import requests
r = requests.get("http://www.transfermarkt.co.uk/championship/marktwerte/wettbewerb/GB2")
soup = BeautifulSoup(r.content,"html.parser")
print soup
returning
<html>
<head><title>404 Not Found</title></head>
<body bgcolor="white">
<center><h1>404 Not Found</h1></center>
<hr><center>nginx</center>
</hr></body>
</html>
You need to pretend to be a real user with a browser and provide a User-Agent header:
r = requests.get("http://www.transfermarkt.co.uk/championship/marktwerte/wettbewerb/GB2", headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"
})
Demo:
>>> from bs4 import BeautifulSoup
>>> import requests
>>>
>>> r = requests.get("http://www.transfermarkt.co.uk/championship/marktwerte/wettbewerb/GB2", headers={
... "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"
... })
>>> soup = BeautifulSoup(r.content,"html.parser")
>>> print(soup.title.get_text())
Top market values 15/16 - Championship - Transfermarkt
There are some sites where requests fail to give response as many of them track if the request originating party is a browser or a bot.
So, let us look like a browser.
It can be done by modifying the header as follows:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36', "Upgrade-Insecure-Requests": "1","DNT": "1","Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Accept-Language": "en-US,en;q=0.5","Accept-Encoding": "gzip, deflate"}
And then, just simply add this header your GET request as follow:
response = requests.get("https://example.com",headers=headers)
In total you will get:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36', "Upgrade-Insecure-Requests": "1","DNT": "1","Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Accept-Language": "en-US,en;q=0.5","Accept-Encoding": "gzip, deflate"}
response = requests.get("https://example.com",headers=headers)
Related
I have watched other questions on stakeoverflow regarding HTTP 403 error however, have not found solution there.
i would like to change error from 403 to 200
trying to scrape this url https://angel.co/startups.
import requests
import random
my_session = requests.session()
for_cookies = my_session.get('https://angel.co/startups')
cookies = for_cookies.cookies
user_agents_list = [
'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko)
Mobile/15E148',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/99.0.4844.83 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/99.0.4844.51 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/105.0.0.0 Safari/537.36',
]
response = my_session.get('https://angel.co/startups',cookies=cookies, headers={'User-Agent':
random.choice(user_agents_list)})
print(response.text)
response.status_code #403
while running this code i am getting 403 error and instead of whole HTML page.
apart from that, i successfully managed to scrape 1st page using cloudscraper however, no idea how to scraper another pages.
page format 1,2,3...2500
It may be due to cloudflare protection or some sort of protection.
So, use cloudscraper to bypass it.
import cloudscraper
url = "https://angel.co/startups"
scraper = cloudscraper.create_scraper()
response = scraper.get(url)
text = response.text
print(response.status_code)
Output
200
hey everyone I am trying to scrape this website but for some reason, it's not scarping. its really appreciate it if someone can give me a hand with this problem I have tried to use a different user agent but it's not working for some reason. for page content, it prints b'' and for the soup its empty
thanks in advance here's my code:
import requests
from bs4 import BeautifulSoup
url = "https://www.carrefourjordan.com/mafjor/en/c/deals?currentPage=1&filter=&nextPageOffset=0&pageSize=60&sortBy=relevance"
headers = {'User-Agent':'test'}
page = requests.get(url,headers=headers)
print(page.content)
soup = BeautifulSoup(page.content, "html.parser")
print(soup)
** These the 3 different headers I used **
```headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36'}
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36', "Upgrade-Insecure-Requests": "1","DNT": "1","Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Accept-Language": "en-US,en;q=0.5","Accept-Encoding": "gzip, deflate"}```
you need to get the right cookies first. so you'll need to use a session
import requests
from bs4 import BeautifulSoup
headers={'User-Agent': 'Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.2; Trident/5.1)'}
url = "https://www.carrefourjordan.com/mafjor/en/c/deals?currentPage=1&filter=&nextPageOffset=0&pageSize=60&sortBy=relevance"
with requests.session() as s:
s.headers.update(headers)
# get the cookies first
s.get("https://www.carrefourjordan.com")
page = s.get(url)
soup = BeautifulSoup(page.text, "html.parser")
print(soup)
hi can anyone get this to work - I am trying to scrape sizes from an interactive dropdown selector but keep getting a timeout error
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}
soup = BeautifulSoup(requests.get("https://www.asos.com/nike/nike-air max-95-logo-leather-trainers-in-dark-navy-orange/prd/20750072 colourwayid=60085113", timeout=60.0).content)
print([size.text.strip() for size in soup.find(class_="colour-size select")])
It's because you've forgot the parameter headers
Try again:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}
soup = BeautifulSoup(requests.get("https://www.asos.com/nike/nike-air max-95-logo-leather-trainers-in-dark-navy-orange/prd/20750072 colourwayid=60085113",
timeout=60.0,
headers=headers).content)
I want to scrape using beautiful soup and python requests a website that requires a login first, I'm able to login by giving my username and password via a post request, however making a get request within the same session after login yeilds error 403(FORBIDDEN), is there a solution to this? The last line in my code is producing a 'forbidden' message, is there a workaround?
import requests
from bs4 import BeautifulSoup
headers = {
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36'
}
payload = {
'login' : '#my_username' , 'password': '#my_password', 'remember_me': 'false', 'fallback': 'false'
}
with requests.Session() as s:
url = 'https://www.hackerrank.com/auth/login'
r = s.get(url , headers = headers)
soup = BeautifulSoup(r.content , 'html5lib')
r = s.post(url , data = payload , headers = headers)
print(r.content)
s.get('Webpage_that_can_be_accessed_only_after_login' , headers = headers)
I did the almost the same thing only difference was that I passed the exact header I saw being passed in chrome and passed csrf_token
import requests
import json
import sys
from bs4 import BeautifulSoup
#header string picked from chrome
headerString='''
{
"accept": "text/html,application/xhtml+xml,application/xml;q':0.9,image/avif,image/webp,image/apng,*/*;q':0.8,application/signed-exchange;v':b3;q':0.9',text/html,application/xhtml+xml,application/xml;q':0.9,image/avif,image/webp,image/apng,*/*;q':0.8,application/signed-exchange;v':b3;q':0.9",
"accept-encoding": "gzip, deflate, br",
"accept-language": "en-US,en;q':0.9",
"cache-control": "max-age=0",
"cookie": "hackerrank_mixpanel_token':7283187c-1f24-4134-a377-af6c994db2a0; hrc_l_i':F; _hrank_session':653fb605c88c81624c6d8f577c9094e4f8657136ca3487f07a3068c25080706db7178cc4deda978006ce9d0937c138b52271e3cd199fda638e8a0b8650e24bb7; _ga':GA1.2.397113208.1599678708; _gid':GA1.2.933726361.1599678708; user_type':hacker; session_id':h3xb3ljp-1599678763378; __utma':74197771.397113208.1599678708.1599678764.1599678764.1; __utmc':74197771; __utmz':74197771.1599678764.1.1.utmcsr':(direct)|utmccn':(direct)|utmcmd':(none); __utmt':1; __utmb':74197771.3.10.1599678764; _biz_uid':5969ac22487d4b0ff8d000621de4a30c; _biz_sid:79bd07; _biz_nA':1; _biz_pendingA':%5B%5D; _biz_flagsA':%7B%22Version%22%3A1%2C%22ViewThrough%22%3A%221%22%2C%22XDomain%22%3A%221%22%7D; _gat_UA-45092266-28':1; _gat_UA-45092266-26':1; session_referrer':https%3A%2F%2Fwww.google.com%2F; session_referring_domain':www.google.com; session_landing_url':https%3A%2F%2Fwww.hackerrank.com%2Fprefetch_data%3Fcontest_slug%3Dmaster%26get_feature_feedback_list%3Dtrue",
"sec-fetch-dest": "document",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "none",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36"
}
'''
d=json.loads(headerString)
#creating session
s = requests.Session()
url='https://www.hackerrank.com/auth/login'
r=s.get(url, headers=d)
#getting the csrf_token
soup = BeautifulSoup(r.text, 'html.parser')
csrf_token=soup.find('meta', id='csrf-token')['content']
#using it in login post call
request_header={
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36",
"x-csrf-token": csrf_token
}
payload={"login":"<user-name>","password":"<password>","remember_me":False,"fallback":True}
r=s.post(url, headers=request_header, data=payload)
#then I tested if login is successful by going into dashboard page
d=json.loads(r.text)
csrf_token=d['csrf_token']
url='https://www.hackerrank.com/dashboard'
request_header={
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36",
"x-csrf-token": csrf_token
}
r=s.get(url, headers=request_header, data=payload)
print(r.text)```
I'm trying to scrape this website:
https://www.footpatrol.com/
However it seems like the website denies my scraping attempt.
Using headers did not help.
from bs4 import BeautifulSoup
import requests
url = "https://www.footpatrol.com/"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers = headers)
data = r.text
soup = BeautifulSoup(data, 'lxml')
for a in soup.find_all():
print(a)
This leads to me getting the ConnectionError, how can I fix my code so I can scrape the site?
I'm able to get a response by changing the User Agent to:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
and the following User Agent also works:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
It seems that the Chrome version is the culprit in your User Agent.