i would like to scrape amazon top 10 bestsellers in baby-products.
i want just the titel text but it seems that i have a problem.
im getting 'None' when I'm trying this code.
after getting "result" i want to iterate it using "content" and print the titles.
thanks!
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}
url = "https://www.amazon.com/gp/bestsellers/baby-products"
r=requests.get(url, headers=headers)
print("status: ", r.status_code)
soup = BeautifulSoup(r.content, 'html.parser')
print("url: ", r.url)
result = soup.find("ol", {"id": "zg-ordered-list"})
content = result.findAll("div", {"class": "a-section a-spacing-none aok-relative"})
print(result)
print(content)
You won't be able to scrape the Amazon website in this way. You are using requests.get to get the HTTP response body of the url provided. Pay attention to what that response actually is (e.g. by print(r.content)). What you can see in your web browser is different than the raw HTTP response, because of client-side rendering technologies used by Amazon (typically JavaScript and others).
I advice you to use Selenium, which sorts of "emulates" the typical browser inside the Python runtime, renders the site like the normal browser would do and allows you to access properties of the same website you see in your web browser.
Related
import requests
import lxml
from bs4 import BeautifulSoup
LISTINGS_URL = 'https://shorturl.at/ceoAB'
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/95.0.4638.69 Safari/537.36 ",
"Accept-Language": "en-US,en;q=0.9"
}
response = requests.get(LISTINGS_URL, headers=headers)
listings = response.text
class DataScraper:
def __init__(self):
self.soup = BeautifulSoup(listings, "html.parser")
def get_links(self):
for a in self.soup.select(".list-card-top a"):
print(a)
# listing_text = [link.getText() for link in links]
def get_address(self):
pass
def get_prices(self):
pass
I Have Used the correct css selectors, even trying to find the elements using attrs in find_all()
What I am trying to achieve is to parse in all the anchor tags then to fetch the href links for the specific listings however it is only returning the first 10
You can make a GET request to this endpoint and fetch the data you need.
https://www.zillow.com/search/GetSearchPageState.htm?searchQueryState={"pagination":{"currentPage":1},"mapBounds":{"west":-123.33522421253342,"east":-121.44008261097092,"south":37.041584214606814,"north":38.39290664366326},"isMapVisible":false,"filterState":{"price":{"max":872627},"beds":{"min":1},"isForSaleForeclosure":{"value":false},"monthlyPayment":{"max":3000},"isAuction":{"value":false},"isNewConstruction":{"value":false},"isForRent":{"value":true},"isForSaleByOwner":{"value":false},"isComingSoon":{"value":false},"isForSaleByAgent":{"value":false}},"isListVisible":true,"mapZoom":9}&wants={"cat1":["listResults"]}
Change the "currentPage" url parameter value in the above URL to fetch data from different pages.
Since the response is JSON, you can easily parse it and extract the information using json module.
Website is using probably lazy loading, so you can either use something like selenium/puppeteer or use an API of this website (will be an easier way). To do this you need to make a GET request to an url which starts with https://www.zillow.com/search/GetSearchPageState.htm (see in your dev tools in browser), parse JSON response and you have your href link under cat1.searchResults.listResults[index in array].detailUrl.
The code below extracts data from Zillow Sale.
My 1st question is where people get the headers information.
My 2nd question is how do I know when I needs headers? For some other page like Cars.com, I don't need put headers=headers and I can still get data correctly.
Thank you for your help.
HHC
import requests
from bs4 import BeautifulSoup
import re
url ='https://www.zillow.com/baltimore-md-21201/?searchQueryState=%7B%22pagination%22%3A%7B%7D%2C%22usersSearchTerm%22%3A%2221201%22%2C%22mapBounds%22%3A%7B%22west%22%3A-76.67377295275878%2C%22east%22%3A-76.5733510472412%2C%22south%22%3A39.26716345016057%2C%22north%22%3A39.32309233550334%7D%2C%22regionSelection%22%3A%5B%7B%22regionId%22%3A66811%2C%22regionType%22%3A7%7D%5D%2C%22isMapVisible%22%3Atrue%2C%22filterState%22%3A%7B%22ah%22%3A%7B%22value%22%3Atrue%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A14%7D'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
'referer': 'https://www.zillow.com/new-york-ny/rentals/2_p/?searchQueryState=%7B%22pagination'
}
raw_page = requests.get(url, headers=headers)
status = raw_page.status_code
print(status)
# Loading the page content into the beautiful soup
page = raw_page.content
page_soup = BeautifulSoup(page, 'html.parser')
print(page_soup)
You can get headers from going to the site with your browser and using the network tab of the developer tools in there, select a request and you can headers sent in requests.
Some websites don't serve bots, so to make them think you're not a bot you set the user agent header to one a browser uses, some sites may require more headers for you to pass the not a bot test. You can see all the headers being sent in developer tools, you can test different headers until your request succeeds.
from your browser go to this website: http://myhttpheader.com/
you will find headers info there.
Secondly, whenever some website like zillow blocks you from scraping data, only then we need to provide headers.
Check this picture:
enter image description here
How can I get all the URLs from this particular link: https://www.oddsportal.com/results/#soccer
For every URL on this page, there are multiple pages e.g. the first link of the page:
https://www.oddsportal.com/soccer/africa/
leads to the below page as an example:
https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/
-> https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/#/page/2/...
https://www.oddsportal.com/soccer/africa/africa-cup-of-nations-2019/results/
-> https://www.oddsportal.com/soccer/africa/africa-cup-of-nations-2019/results/#/page/2/...
I would ideally like to code in python as I am pretty comfortable with it (more than other languages through not at all close to what I can call as comfortable)
and
After clicking on the link:
When I go to inspect element, I can see tha the links can be scraped however I am very new to it.
Please help
I have extracted the URLs from the main page that you mentioned.
import requests
import bs4 as bs
url = 'https://www.oddsportal.com/results/#soccer'
headers = {'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'}
resp = requests.get(url, headers=headers)
soup = bs.BeautifulSoup(resp.text, 'html.parser')
base_url = 'https://www.oddsportal.com'
a = soup.findAll('a', attrs={'foo': 'f'})
# This set will have all the URLs of the main page
s = set()
for i in a:
s.add(base_url + i['href'])
Since you are new to web-scraping I suggest you to go through these.
Beautiful Soup - Beautiful Soup is a Python library for pulling data out of HTML and XML files.
Docs: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
requests - Requests is an elegant and simple HTTP library for Python.
Docs: https://docs.python-requests.org/en/master/
Selenium - Selenium is an umbrella project for a range of tools and libraries that enable and support the automation of web browsers.
Docs: https://selenium-python.readthedocs.io/
I am scraping web data with python using requests and beautiful soup. I have found that 2 of the websites I am scraping from only respond if I do not specify the page number.
The following code works and allows me to extract the data needed:
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)}
r = requests.get('https://www.milkround.com/jobs/graduate-software-engineer', headers = headers)
soup = BeautifulSoup(r.content, 'html5lib')
table = soup.find('div', attrs = {'class':'col-xs-12 job-results clearfix'})
If however I change the link to specify a page number, such as:
r = requests.get('https://www.milkround.com/jobs/graduate-software-engineer?page=2', headers = headers)
Then request never responds. There is no error code, the console just waits indefinitely. What is causing this and how do I resolve it?
EDIT: I opened the site in Incognito manually. It seems that when opening with the page number I get an "access denied" response, but if I refresh the page it lets me in?
That's because if you see, you are not able to access the page numbers on website from outside. So if you are logged in and have some sort of cookie then add it to your headers.
What I just checked on website is you are trying to access wrong URI.There are no page numbers. Did you add ?page= from your own?
The problem you're tackling with is about web scraping. In your very case, the web page you have blocks because your header declaration lacks of a proper user-agent definition.
To get it to work you need to include a user-agent declaration like this:
headers={'user-agent':'Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522+ (KHTML, like Gecko) Safari/419.3',}
You can dive more deeply into the problem of writing good web scrapers here:
https://towardsdatascience.com/5-strategies-to-write-unblock-able-web-scrapers-in-python-5e40c147bdaf
A list of proper user-agents can be found here:
https://webscraping.com/blog/User-agents/
Hope it get's you working with your problem.
i use code below to scrape results from bing and when I see the scraped web page it says "There are no results for python".
but when I search in the browser there is no problem.
import requests
from bs4 import BeautifulSoup
term = 'python'
url = f'https://www.bing.com/search?q={term}&setlang=en-us'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.prettify())
I searched and I didn't find any similar problem
You need to pass the user-agent while requesting to get the value.
import requests
from bs4 import BeautifulSoup
term = 'python'
url = 'https://www.bing.com/search?q={}&setlang=en-us'.format(term)
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.prettify())
Since Bing is a dynamic website, meaning Javascript generates the code, you won't be able to scrape it using only Beautifulsoup. Instead, I recommend selenium, which opens a browser that you can control and parse the code with Beautifulsoup.
The same will happen for any other dynamically coded website, including Google and many others.
It's probably because there's no user-agent being passed into request headers (as already mentioned by KunduK) thus when no user-agent is specified while using requests library, it defaults to python-requests so Bing or other search engines understands that it's a bot/script, then it blocks a request. Check what's your user-agent.
Pass user-agent:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
requests.get('URL', headers=headers)
How to reduce the chance of being blocked while web scraping search engines.
Alternatively, you can achieve the same thing by using Bing Organic Results API from SerpApi. It's a paid API with a free plan.
The difference is that you don't have to spend time trying to bypass blocks from Bing or other search engines. Instead, focus on the data that needs to be extracted from the structured JSON. Check out the playground.
Disclaimer, I work for SerpApi.