The code below extracts data from Zillow Sale.
My 1st question is where people get the headers information.
My 2nd question is how do I know when I needs headers? For some other page like Cars.com, I don't need put headers=headers and I can still get data correctly.
Thank you for your help.
HHC
import requests
from bs4 import BeautifulSoup
import re
url ='https://www.zillow.com/baltimore-md-21201/?searchQueryState=%7B%22pagination%22%3A%7B%7D%2C%22usersSearchTerm%22%3A%2221201%22%2C%22mapBounds%22%3A%7B%22west%22%3A-76.67377295275878%2C%22east%22%3A-76.5733510472412%2C%22south%22%3A39.26716345016057%2C%22north%22%3A39.32309233550334%7D%2C%22regionSelection%22%3A%5B%7B%22regionId%22%3A66811%2C%22regionType%22%3A7%7D%5D%2C%22isMapVisible%22%3Atrue%2C%22filterState%22%3A%7B%22ah%22%3A%7B%22value%22%3Atrue%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A14%7D'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
'referer': 'https://www.zillow.com/new-york-ny/rentals/2_p/?searchQueryState=%7B%22pagination'
}
raw_page = requests.get(url, headers=headers)
status = raw_page.status_code
print(status)
# Loading the page content into the beautiful soup
page = raw_page.content
page_soup = BeautifulSoup(page, 'html.parser')
print(page_soup)
You can get headers from going to the site with your browser and using the network tab of the developer tools in there, select a request and you can headers sent in requests.
Some websites don't serve bots, so to make them think you're not a bot you set the user agent header to one a browser uses, some sites may require more headers for you to pass the not a bot test. You can see all the headers being sent in developer tools, you can test different headers until your request succeeds.
from your browser go to this website: http://myhttpheader.com/
you will find headers info there.
Secondly, whenever some website like zillow blocks you from scraping data, only then we need to provide headers.
Check this picture:
enter image description here
Related
Getting no response from a url by using requests.get on the other hand if I past the url in Firefox then it's responding. The provided url is a link of a json file. I don't know what's happening? here is my code
from urllib.request import urlopen,Request
import requests
import pprint
import json
import pandas as pd
url = "https://www.nseindia.com/api/option-chain-equities?symbol=ACC"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"}
response = requests.get(url, headers=headers)
print(response.status_code)
##data_json = json.loads(response.read())
df = pd.read_json(response)
pprint.pprint(df['records'][1])
This website protects itself from bots. There are so many ways to detect bots, some of them are:
requests rate
disabled javascript
empty cookies
not using mouse to click buttons
etc.
To enable javascript and cookies, you can use selenium.
The website you want to scrape has powerful bot detection methods. I couldn't access the link that you have shared. But when I first tried website main page and after that your link, It shows json file.
But this is not easy to make a bot for. I tried selenium and clicked the website button by moving the mouse, but it detected that I'm a bot. So we can conclude that the website uses cookies. You need to generate fake cookies to access the webpage.
I am scraping web data with python using requests and beautiful soup. I have found that 2 of the websites I am scraping from only respond if I do not specify the page number.
The following code works and allows me to extract the data needed:
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)}
r = requests.get('https://www.milkround.com/jobs/graduate-software-engineer', headers = headers)
soup = BeautifulSoup(r.content, 'html5lib')
table = soup.find('div', attrs = {'class':'col-xs-12 job-results clearfix'})
If however I change the link to specify a page number, such as:
r = requests.get('https://www.milkround.com/jobs/graduate-software-engineer?page=2', headers = headers)
Then request never responds. There is no error code, the console just waits indefinitely. What is causing this and how do I resolve it?
EDIT: I opened the site in Incognito manually. It seems that when opening with the page number I get an "access denied" response, but if I refresh the page it lets me in?
That's because if you see, you are not able to access the page numbers on website from outside. So if you are logged in and have some sort of cookie then add it to your headers.
What I just checked on website is you are trying to access wrong URI.There are no page numbers. Did you add ?page= from your own?
The problem you're tackling with is about web scraping. In your very case, the web page you have blocks because your header declaration lacks of a proper user-agent definition.
To get it to work you need to include a user-agent declaration like this:
headers={'user-agent':'Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522+ (KHTML, like Gecko) Safari/419.3',}
You can dive more deeply into the problem of writing good web scrapers here:
https://towardsdatascience.com/5-strategies-to-write-unblock-able-web-scrapers-in-python-5e40c147bdaf
A list of proper user-agents can be found here:
https://webscraping.com/blog/User-agents/
Hope it get's you working with your problem.
I am dealing with BeautifulSoup and also trying it with MechanicalSoup and I have got it to load with other websites, but when I request that the website be requested it takes a long time and then never really gets it. Any ideas would be super helpful.
Here is the BeautifulSoup code that I am writing:
import urllib3
from bs4 import BeautifulSoup as soup
url = 'https://www.apartments.com/apartments/saratoga-springs-ut/1-bedrooms/?bb=hy89sjv-mN24znkgE'
http = urllib3.PoolManager()
r = http.request('GET', url)
Here is the Mechanicalsoup code:
import mechanicalsoup
browser = mechanicalsoup.Browser()
url = 'https://www.apartments.com/apartments/saratoga-springs-ut/1-bedrooms/'
page = browser.get(url)
page
What I am trying to do is gather data on different cities and apartments, so the url will change to have be 2-bedrooms and then 3-bedrooms then it will move to a different city and do the same thing there, so I really need this part to work.
Any help would be appreciated.
You see the same thing if you use curl or wget to fetch the page. My guess is they are using browser detection to try to prevent people from stealing their copyrighted information, as you are attempting to do. You can search for the User-Agent header to see how to pretend to be another browser.
import urllib3
import requests
from bs4 import BeautifulSoup as soup
headers = requests.utils.default_headers()
headers.update({
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
})
url = 'https://www.apartments.com/apartments/saratoga-springs-ut/1-bedrooms/'
r = requests.get(url, headers=headers)
rContent = soup(r.content, 'lxml')
rContent
Just as Tim said, I needed to add headers to my code to ensure that it was being read as not from a bot.
i use code below to scrape results from bing and when I see the scraped web page it says "There are no results for python".
but when I search in the browser there is no problem.
import requests
from bs4 import BeautifulSoup
term = 'python'
url = f'https://www.bing.com/search?q={term}&setlang=en-us'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.prettify())
I searched and I didn't find any similar problem
You need to pass the user-agent while requesting to get the value.
import requests
from bs4 import BeautifulSoup
term = 'python'
url = 'https://www.bing.com/search?q={}&setlang=en-us'.format(term)
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.prettify())
Since Bing is a dynamic website, meaning Javascript generates the code, you won't be able to scrape it using only Beautifulsoup. Instead, I recommend selenium, which opens a browser that you can control and parse the code with Beautifulsoup.
The same will happen for any other dynamically coded website, including Google and many others.
It's probably because there's no user-agent being passed into request headers (as already mentioned by KunduK) thus when no user-agent is specified while using requests library, it defaults to python-requests so Bing or other search engines understands that it's a bot/script, then it blocks a request. Check what's your user-agent.
Pass user-agent:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
requests.get('URL', headers=headers)
How to reduce the chance of being blocked while web scraping search engines.
Alternatively, you can achieve the same thing by using Bing Organic Results API from SerpApi. It's a paid API with a free plan.
The difference is that you don't have to spend time trying to bypass blocks from Bing or other search engines. Instead, focus on the data that needs to be extracted from the structured JSON. Check out the playground.
Disclaimer, I work for SerpApi.
i would like to scrape amazon top 10 bestsellers in baby-products.
i want just the titel text but it seems that i have a problem.
im getting 'None' when I'm trying this code.
after getting "result" i want to iterate it using "content" and print the titles.
thanks!
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}
url = "https://www.amazon.com/gp/bestsellers/baby-products"
r=requests.get(url, headers=headers)
print("status: ", r.status_code)
soup = BeautifulSoup(r.content, 'html.parser')
print("url: ", r.url)
result = soup.find("ol", {"id": "zg-ordered-list"})
content = result.findAll("div", {"class": "a-section a-spacing-none aok-relative"})
print(result)
print(content)
You won't be able to scrape the Amazon website in this way. You are using requests.get to get the HTTP response body of the url provided. Pay attention to what that response actually is (e.g. by print(r.content)). What you can see in your web browser is different than the raw HTTP response, because of client-side rendering technologies used by Amazon (typically JavaScript and others).
I advice you to use Selenium, which sorts of "emulates" the typical browser inside the Python runtime, renders the site like the normal browser would do and allows you to access properties of the same website you see in your web browser.