I am just started learning web scraping using python Beautifulsoup and requests library and using Pycharm tool.
import requests
from bs4 import BeautifulSoup
result1 = requests.get("https://www.grainger.com/")
print('result1 is '+ str(result1.status_code))
While I am using this website its keeps on loading and if I use google.com it's giving output.
I wonder why I didn't get output for the above website?
To get status 200 from this site, specify User-Agent HTTP header:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:81.0) Gecko/20100101 Firefox/81.0'}
result1 = requests.get("https://www.grainger.com/", headers=headers)
print('result1 is '+ str(result1.status_code))
Prints:
result1 is 200
The reason why this is works is because some sites will ignore requests that don't appear to be made from a web browser. By default, requests uses the User-Agent python-requests, so the website can tell you are not requesting the website from a web browser. The reason why your request hangs and eventually times out is likely because their server is ignoring your request.
Hmm... there are a couple of things.
The website might not exist
You're using http instead of https
That site blocks scraping (send a user agent header)
It might be a problem with requests. Try using a different library.
Related
I'm trying to fetch product title and it's description from a webpage using requests module. The title and description appear to be static as they both are present in page source. However, I failed to grab them using following attempt. The script throws AttributeError at this moment.
import requests
from bs4 import BeautifulSoup
link = 'https://www.nordstrom.com/s/anine-bing-womens-plaid-shirt/6638030'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
}
with requests.Session() as s:
s.headers.update(headers)
res = s.get(link)
soup = BeautifulSoup(res.text,"lxml")
product_title = soup.select_one("h1[itemProp='name']").text
product_desc = soup.select_one("#product-page-selling-statement").text
print(product_title,product_desc)
How can I scrape title and description from above pages using requests module?
The page is dynamic. go after the data from the api source:
import requests
import pandas as pd
api = 'https://www.nordstrom.com/api/ng-looks/styleId/6638030?customerId=f36cf526cfe94a72bfb710e5e155f9ba&limit=7'
jsonData = requests.get(api).json()
df = pd.json_normalize(jsonData['products'].values())
print(df.iloc[0])
Output:
id 6638030-400
name ANINE BING Women's Plaid Shirt
styleId 6638030
styleNumber
colorCode 400
colorName BLUE
brandLabelName ANINE BING
hasFlatShot True
imageUrl https://n.nordstrommedia.com/id/sr3/6d000f40-8...
price $149.00
pathAlias anine-bing-womens-plaid-shirt/6638030?origin=c...
originalPrice $149.00
productTypeLvl1 12
productTypeLvl2 216
isUmap False
Name: 0, dtype: object
When testing requests like these you should output the response to see what you're getting back. Best to use something like Postman (I think VSCode has a similar function to it now) to set up URLs, headers, methods, and parameters, and to also see the full response with headers. When you have everything working right, just convert it to python code. Postman even has some 'export to code' functions for common languages.
Anyways...
I tried your request on Postman and got this response:
Requests done from python vs a browser are the same thing. If the headers, URLs, and parameters are identical, they should receive identical responses. So the next step is comparing the difference between your request and the request done by the browser:
So one or more of the headers included by the browser gets a good response from the server, but just using User-Agent is not enough.
I would try to identify which headers, but unfortunately, Nordstrom detected some 'unusual activity' and seems to have blocked my IP :(
Probably due to sending an obvious handmade request. I think it's my IP that's blocked since I can't access the site from any browser, even after clearing my cache.
So double-check that the same hasn't happened to you while working with your scraper.
Best of luck!
'''
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
r = requests.get(url_,headers=headers)
'''
I am trying to scrape this website "https://allegro.pl/uzytkownik/feni44/lampy-przednie-i-elementy-swiatla-do-jazdy-dziennej-drl-255102?bmatch=cl-e2101-d3793-c3792-fd-60-aut-1-3-0412"
Everything was working good till yesterday, but suddenly I get 403 error.
I have used proxies/VPN but still the error persists.
When scraping a website, you must be careful of the website's anti-DDOS protection strategies. One form of DDOS is submitting many load requests at once via refresh, which can increase a server's load and hinder its performance. Using a web scraper does exactly that as it goes through each link, and so the website can mistake your bot as a DDOS'er and block it's IP address, making it FORBIDDEN (error 403) to access the website from it's IP address.
Usually this is only temporary, so after 12 hours or 24 hours (or however long the website sets a block period) it should be good to go. If you don't want to avoid a future 403 FORBIDDEN error, then consider sleeping for 10 seconds between each request.
Try to use some proxy services like Bright proxy. They have more than 72million+ proxies. I think this issue will resolve on rotating the proxy and useragent.
I am trying to crawl data from this site. It uses multiple iframes for different components.
When I try to open one of the iframe url in browser, it opens in that particular session but in another icognito/private session it doesn't. Same happens when I try to do this via requests or wget.
I have tried using requests along with session, then also it doesn't work. Here is my code snippet
import requests
s = requests.Session()
s.get('https://www.epc.shell.com/')
r = s.get('https://www.epc.shell.com/welcome.asp')
r.text
The last line only returns the javascript text with error that URL is invalid.
I know Selenium can solve this problem but I am considering it as last option.
Is it possible to crawl this URL with requests (or without using Javascript)? If yes, any help would be appreciated. If no, is there any alternative lightweight Javascript library in Python that can achieve this?
Your issue can be easily solved by adding custom headers to your requests, all in all, your code should look like this:
import requests
s = requests.Session()
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Language": "en-US,en;q=0.5"}
s.get('https://www.epc.shell.com/', headers = headers)
r = s.get('https://www.epc.shell.com/welcome.asp', headers = headers)
print(r.text)
(Do note that it is almost always recommended to use headers when sending requests).
I hope this helps!
I want to get data from this site.
When I get data from the main url. I get an HTML file that contains structure but not the values.
import requests
from bs4 import BeautifulSoup
url ='http://option.ime.co.ir/'
r = requests.get(url)
soup = BeautifulSoup(r,'lxml')
print(soup.prettify())
I find out that the site get values from
url1 = 'http://option.ime.co.ir/GetTime'
url2 = 'http://option.ime.co.ir/GetMarketData'
When I watch responses from those url in the browser. I see a JSON format response and time in a specific format.
but when I use requests to get the data it gives me same HTML that I get from url.
Do you know whats the reason? How should I get the responses that I see in the browser?
I check headers for all urls and I didn't find something special that I should send with my request.
You have to provide the proper HTTP headers in the request. In my case, I was able to make it work using the following headers. Note that in my testing the HTTP response was a 200 OK rather than a redirect to the root website (as when no HTTP headers were provided in the request).
Raw HTTP Request:
GET http://option.ime.co.ir/GetTime HTTP/1.1
Host: option.ime.co.ir
Referer: "http://option.ime.co.ir/"
Accept: "application/json, text/plain, */*"
User-Agent: "Mozilla/5.0 (Windows NT 6.1; rv:45.0) Gecko/20100101 Firefox/45.0"
This should give you the proper JSON response you need.
You first connection using the browser is getting a 302 Redirection response (to the same url).
Then it is running some JS so the so the second request doesn't redirect anymore and gets the expected JSON.
It is a usual technique so other people don't use their API without permission.
Set the "preserve log" checkbox in dev. tools so you can see it by yourself.
What is the proper way to Google something in Python 3? I have tried requests and urllib for a Google page. When I simply res = requests.get("https://www.google.com/#q=" + query) that doesn't come back with the same HTML as when I inspect the Google page in Safari. The same happens with urllib. A similar thing happens when I use Bing. I am familiar with AJAX. However, it seems that that is now depreciated.
In python, if you do not specify the user agent header in http requests manually, python will add for you by default which can be detected by Google and may be forbidden by it.
Try the following if it can help.
import urllib
yourUrl = "post it here"
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
req = urllib.request.Request(yourUrl, headers = headers)
page = urllib.request.urlopen(req)