I am scraping web data with python using requests and beautiful soup. I have found that 2 of the websites I am scraping from only respond if I do not specify the page number.
The following code works and allows me to extract the data needed:
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)}
r = requests.get('https://www.milkround.com/jobs/graduate-software-engineer', headers = headers)
soup = BeautifulSoup(r.content, 'html5lib')
table = soup.find('div', attrs = {'class':'col-xs-12 job-results clearfix'})
If however I change the link to specify a page number, such as:
r = requests.get('https://www.milkround.com/jobs/graduate-software-engineer?page=2', headers = headers)
Then request never responds. There is no error code, the console just waits indefinitely. What is causing this and how do I resolve it?
EDIT: I opened the site in Incognito manually. It seems that when opening with the page number I get an "access denied" response, but if I refresh the page it lets me in?
That's because if you see, you are not able to access the page numbers on website from outside. So if you are logged in and have some sort of cookie then add it to your headers.
What I just checked on website is you are trying to access wrong URI.There are no page numbers. Did you add ?page= from your own?
The problem you're tackling with is about web scraping. In your very case, the web page you have blocks because your header declaration lacks of a proper user-agent definition.
To get it to work you need to include a user-agent declaration like this:
headers={'user-agent':'Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522+ (KHTML, like Gecko) Safari/419.3',}
You can dive more deeply into the problem of writing good web scrapers here:
https://towardsdatascience.com/5-strategies-to-write-unblock-able-web-scrapers-in-python-5e40c147bdaf
A list of proper user-agents can be found here:
https://webscraping.com/blog/User-agents/
Hope it get's you working with your problem.
Related
The code below extracts data from Zillow Sale.
My 1st question is where people get the headers information.
My 2nd question is how do I know when I needs headers? For some other page like Cars.com, I don't need put headers=headers and I can still get data correctly.
Thank you for your help.
HHC
import requests
from bs4 import BeautifulSoup
import re
url ='https://www.zillow.com/baltimore-md-21201/?searchQueryState=%7B%22pagination%22%3A%7B%7D%2C%22usersSearchTerm%22%3A%2221201%22%2C%22mapBounds%22%3A%7B%22west%22%3A-76.67377295275878%2C%22east%22%3A-76.5733510472412%2C%22south%22%3A39.26716345016057%2C%22north%22%3A39.32309233550334%7D%2C%22regionSelection%22%3A%5B%7B%22regionId%22%3A66811%2C%22regionType%22%3A7%7D%5D%2C%22isMapVisible%22%3Atrue%2C%22filterState%22%3A%7B%22ah%22%3A%7B%22value%22%3Atrue%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A14%7D'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
'referer': 'https://www.zillow.com/new-york-ny/rentals/2_p/?searchQueryState=%7B%22pagination'
}
raw_page = requests.get(url, headers=headers)
status = raw_page.status_code
print(status)
# Loading the page content into the beautiful soup
page = raw_page.content
page_soup = BeautifulSoup(page, 'html.parser')
print(page_soup)
You can get headers from going to the site with your browser and using the network tab of the developer tools in there, select a request and you can headers sent in requests.
Some websites don't serve bots, so to make them think you're not a bot you set the user agent header to one a browser uses, some sites may require more headers for you to pass the not a bot test. You can see all the headers being sent in developer tools, you can test different headers until your request succeeds.
from your browser go to this website: http://myhttpheader.com/
you will find headers info there.
Secondly, whenever some website like zillow blocks you from scraping data, only then we need to provide headers.
Check this picture:
enter image description here
i use code below to scrape results from bing and when I see the scraped web page it says "There are no results for python".
but when I search in the browser there is no problem.
import requests
from bs4 import BeautifulSoup
term = 'python'
url = f'https://www.bing.com/search?q={term}&setlang=en-us'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.prettify())
I searched and I didn't find any similar problem
You need to pass the user-agent while requesting to get the value.
import requests
from bs4 import BeautifulSoup
term = 'python'
url = 'https://www.bing.com/search?q={}&setlang=en-us'.format(term)
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.prettify())
Since Bing is a dynamic website, meaning Javascript generates the code, you won't be able to scrape it using only Beautifulsoup. Instead, I recommend selenium, which opens a browser that you can control and parse the code with Beautifulsoup.
The same will happen for any other dynamically coded website, including Google and many others.
It's probably because there's no user-agent being passed into request headers (as already mentioned by KunduK) thus when no user-agent is specified while using requests library, it defaults to python-requests so Bing or other search engines understands that it's a bot/script, then it blocks a request. Check what's your user-agent.
Pass user-agent:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
requests.get('URL', headers=headers)
How to reduce the chance of being blocked while web scraping search engines.
Alternatively, you can achieve the same thing by using Bing Organic Results API from SerpApi. It's a paid API with a free plan.
The difference is that you don't have to spend time trying to bypass blocks from Bing or other search engines. Instead, focus on the data that needs to be extracted from the structured JSON. Check out the playground.
Disclaimer, I work for SerpApi.
Thanks for checking out this question!
I'm teaching myself how to collect web data.
The objective is to collect reviews of 'booking(dot)com' listings within a city.
I'm using requests library in order to collect the source code and find useful data.
All reviews of a hotel are not in the listing's source code, however I have figured out a way to access review list of certain hotel, the link recipe works for all listings. It directs to simplified (no CSS) version of 'View Reviews' tab.
The problem is, function used for collection of source codes returns an empty list with review list links, but works great with other addresses.
Review list links work when opening them in browser 'manually'. How to solve this?
In: page ='https://www.booking.com/reviewlist.html?aid=679422&cc1=lt&pagename=gradiali&rows=10&'
download = requests.get(page)
decoded_content = download.content.decode('utf-8')
page_content = decoded_content.split('\n')
page_content
Out: ['']
Thanks, K.
Solved!
Discovered that requests function can send User Agent line to the server, and make it 'think' that webpage is opened by browser.
page = hotels.iloc[0,1]
header = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
download = requests.get(page, headers=header)
decoded_content = download.content.decode('utf-8')
page_content = decoded_content.split('\n')
page_content
If anyone has this issue make sure you try different User Agents, some work and some don't :)
First, thanks for taking the time to read this and maybe trying to help me.
I am trying to make a script to easily login in a site. I wanted to get the login cookies too, so maybe I could reuse them later. I made the script and it logs me in correctly. But I can not get the cookies. When I try to print them, I see just this:
<RequestsCookieJar[]>
Obviously this can't help me, I think. So now I would be interested in knowing how to get the real cookie data. Thanks a lot to whoever can hep me reaching that.
My code:
import requests
import cfscrape
from bs4 import BeautifulSoup as bs
header = {"User-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"}
s = requests.session()
scraper=cfscrape.create_scraper(sess=s) #i use cfscrape because the page uses cloudflare anti ddos page
scraper.get("https://www.bstn.com/einloggen", headers=header)
myacc={"login[email]": "my#mail.com", #obviously change
"login[password]": "password123"}
entry=scraper.post("https://www.bstn.com/einloggen", headers=header, data=myacc)
soup=bs(entry.text, 'lxml')
accnm=soup.find('meta',{'property':'og:title'})['content']
print("Logged in as: " + accnm)
aaaa=scraper.get("https://www.bstn.com/kundenkonto", headers=header)
print(aaaa.cookies)
If I print the ccokies, I just get the <RequestsCookiesJar[]> like described earlier... It would be really nice if anyone could help me getting the "real" cookies
If you want to get your login cookie that you ought to use the response which after posting, because you are doing login action! Server will send back session cookies if you input correct email&password. And why you got empty cookies in aaaa is website didn't want to set or change your cookies.
entry = scraper.post("https://www.bstn.com/einloggen", allow_redirects=False, headers=header, data=myacc)
print(entry.cookies)
I'm reading a web site content using following 3 liners. I used an example domain for sale which doesn't have many content.
url = "http://localbusiness.com/"
response = requests.get(url)
html = response.text
It returns following html content where the website contains more html when you check through view source. Am I doing something wrong here
Python version 2.7
<html><head></head><body><!-- vbe --></body></html>
Try setting a User-Agent:
import requests
url = "http://localbusiness.com/"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36',
'Content-Type': 'text/html',
}
response = requests.get(url, headers=headers)
html = response.text
The default User-Agent set by requests is 'User-Agent': 'python-requests/2.8.1'. Try to simulate that the request is coming from a browser and not a script.
#jason answered it correctly so I am extending his answer for the reason
Why It happens
Some DOM elements code changed through the Ajax calls and JavaScript code so that will not be seen in the response of your call (Although it's not the case here as you are already using the view source (ctrl+u) to compare and not view element)
Some sites uses user-agent to know the nature of user (as of desktop or mobile user) and provide the response accordingly (as the probable case here)
Other alternatives
You can use the mechanize module of python to mimic a browser to fool
a web site (come handy when the site is using some short of
authentication cookies) A small tutorial
Use selenium to actually implement a browser