python requests from usnews.com timing out other websites work fine - python

url = "https://www.usnews.com"
page = requests.get(url, timeout = 5)
soup = BeautifulSoup(page.content,"html.parser")
requests from usnews.com is not working properly. The code runs forever or times out after five seconds as instructed. I have tried using other websites which work perfectly fine (wikipedia.org, google.com).

They are using a special protection against web scrapers like you. Whenever you go to a website, your web browser sends a special piece of data called a User-Agent. It tells the website what type of browser you are using and if you are on a phone or computer. By default, the requests module doesn't do this.
You can set your own User-Agent pretty easily. Using your website as an example:
import requests
from bs4 import BeautifulSoup
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"}
url = "https://www.usnews.com"
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content,"html.parser")
This code tells the website that we are an actual person and not a bot.
You can learn more about User Agents here (https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent).

you should try something like selenium this code is similar to selenium but a bit user friendly
from requests_html import HTMLSession
import re
#from fake_useragent import UserAgent
#create the session
#ua = UserAgent()
session = HTMLSession()
#define our URL
url = "https://www.usnews.com"
#use the session to get the data
r = session.get(url)
#Render the page, up the number on scrolldown to page down multiple times on a page
r.html.render(sleep=1,timeout = 30, keep_page=True, scrolldown=1)
print(r.text)
this code mimics a real search engine and should bypass the bot detection

Related

How to crawl a website that requires login using BeautifulSoup in Python3

I'm trying to parse articles from 'https://financialpost.com/', and example link is provided below. To parse this, i need to login to their website.
I do successfully post my cresidentials, however, it still do not parse the entire webpage, just the beginning.
How do I crawl everything?
import requests
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
link = 'https://financialpost.com/sign-in/'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'
res = s.get(link)
soup = BeautifulSoup(res.text,'html.parser')
payload = {i['email']:i.get('value','') for i in soup.select('input[email]')}
payload['email'] = 'email#email.com'
payload['password'] = 'my_password'
s.post(link,data=payload)
url = 'https://financialpost.com/pmn/business-pmn/hydrogen-is-every-u-s-gas-utilitys-favorite-hail-mary-pass'
content_url = Request(url)
article_content = urlopen(content_url).read()
article_soup = BeautifulSoup(article_content, 'html.parser')
article_table = article_soup.findAll('section',attrs={'class':'article-content__content-group'})
for x in article_table:
print(x.find('p').text)
Using just requests
It's a bit complicated using just requests but possible, you would have to first authenticate to get authentication token, then you would ask for the article with said token so that site will know that you are authenticated and will display full article. To find out which API endpoints are being used to authenticate and load website content you can use something like chrome dev tools or fiddler (they can record all HTTP request so you can find manually interesting ones)
Using just selenium
Easier way would be to just use Selenium. It is a browser that can be used by code, so that you can just open login website authenticate and request for the article and the site would think that you are a human.

getting no response from a url?

Getting no response from a url by using requests.get on the other hand if I past the url in Firefox then it's responding. The provided url is a link of a json file. I don't know what's happening? here is my code
from urllib.request import urlopen,Request
import requests
import pprint
import json
import pandas as pd
url = "https://www.nseindia.com/api/option-chain-equities?symbol=ACC"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"}
response = requests.get(url, headers=headers)
print(response.status_code)
##data_json = json.loads(response.read())
df = pd.read_json(response)
pprint.pprint(df['records'][1])
This website protects itself from bots. There are so many ways to detect bots, some of them are:
requests rate
disabled javascript
empty cookies
not using mouse to click buttons
etc.
To enable javascript and cookies, you can use selenium.
The website you want to scrape has powerful bot detection methods. I couldn't access the link that you have shared. But when I first tried website main page and after that your link, It shows json file.
But this is not easy to make a bot for. I tried selenium and clicked the website button by moving the mouse, but it detected that I'm a bot. So we can conclude that the website uses cookies. You need to generate fake cookies to access the webpage.

BeautifulSoup and MechanicalSoup won't read website

I am dealing with BeautifulSoup and also trying it with MechanicalSoup and I have got it to load with other websites, but when I request that the website be requested it takes a long time and then never really gets it. Any ideas would be super helpful.
Here is the BeautifulSoup code that I am writing:
import urllib3
from bs4 import BeautifulSoup as soup
url = 'https://www.apartments.com/apartments/saratoga-springs-ut/1-bedrooms/?bb=hy89sjv-mN24znkgE'
http = urllib3.PoolManager()
r = http.request('GET', url)
Here is the Mechanicalsoup code:
import mechanicalsoup
browser = mechanicalsoup.Browser()
url = 'https://www.apartments.com/apartments/saratoga-springs-ut/1-bedrooms/'
page = browser.get(url)
page
What I am trying to do is gather data on different cities and apartments, so the url will change to have be 2-bedrooms and then 3-bedrooms then it will move to a different city and do the same thing there, so I really need this part to work.
Any help would be appreciated.
You see the same thing if you use curl or wget to fetch the page. My guess is they are using browser detection to try to prevent people from stealing their copyrighted information, as you are attempting to do. You can search for the User-Agent header to see how to pretend to be another browser.
import urllib3
import requests
from bs4 import BeautifulSoup as soup
headers = requests.utils.default_headers()
headers.update({
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
})
url = 'https://www.apartments.com/apartments/saratoga-springs-ut/1-bedrooms/'
r = requests.get(url, headers=headers)
rContent = soup(r.content, 'lxml')
rContent
Just as Tim said, I needed to add headers to my code to ensure that it was being read as not from a bot.

beautiful soup returns none when the element exists in browser

I have looked through the previous answers but none seemed to be applicable. I am building an open source quizlet scraper to extract all links from a class (e.g. https://quizlet.com/class/3675834/). In this case, the tag is a and class is "UILink". But when I use the following code, the list returned does not contain the element that I am looking for. Is it because of the JavaScript issue described here?
I tried to use the previous method of importing folder as written here but it does not contain the urls.
How can I scrape these urls?
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36"
}
url = 'https://quizlet.com/class/8536895/'
response = requests.get(url, verify=False, headers=headers)
soup = BeautifulSoup(response.text,'html.parser')
b = soup.find_all("a", class_="UILink")
You wouldn't be able to directly scrape dynamic webpages using just requests. What you see browser is fully rendered page taken care by browser.
Inorder to scrape data from these kind of webpages, you following any of below approaches.
Use requests-html instead of requests
pip install requests-html
scraper.py
from requests_html import HTMLSession
from bs4 import BeautifulSoup
session = HTMLSession()
url = 'https://quizlet.com/class/8536895/'
response = session.get(url)
response.html.render() # render the webpage
# access html page source with html.html
soup = BeautifulSoup(response.html.html, 'html.parser')
b = soup.find_all("a", class_="UILink")
print(len(b))
Note: this uses headless browser(chromium) under the hood to render the page. So it can timeout or be a little slow at times.
Use selenium webdriver
Use driver.get(url) to get the page and pass the page source to beautiful Soup with driver.page_source
Note: run this in headless mode as well and there might be some latency at times.

Download html in python?

I am trying to download the html of a page that is requested through a javascript action when you click a link in the browser. I can download the first page because it has a general URL:
http://www.locationary.com/stats/hotzone.jsp?hz=1
But there are links along the bottom of the page that are numbers (1 to 10). So if you click on one, it goes to, for example, page 2:
http://www.locationary.com/stats/hotzone.jsp?ACTION_TOKEN=hotzone_jsp$JspView$NumericAction&inPageNumber=2
When I put that URL into my program and try to download the html, it gives me the html of a different page on the website and I think it is the home page.
How can I get the html of this URL that uses javascript and when there is no specific URL?
Thanks.
Code:
import urllib
import urllib2
import cookielib
import re
URL = ''
def load(url):
data = urllib.urlencode({"inUserName":"email", "inUserPass":"password"})
jar = cookielib.FileCookieJar("cookies")
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(jar))
opener.addheaders.append(('User-agent', 'Mozilla/5.0 (Windows NT 6.1; rv:13.0) Gecko/20100101 Firefox/13.0.1'))
opener.addheaders.append(('Referer', 'http://www.locationary.com/'))
opener.addheaders.append(('Cookie','site_version=REGULAR'))
request = urllib2.Request("https://www.locationary.com/index.jsp?ACTION_TOKEN=tile_loginBar_jsp$JspView$LoginAction", data)
response = opener.open(request)
page = opener.open("https://www.locationary.com/index.jsp?ACTION_TOKEN=tile_loginBar_jsp$JspView$LoginAction").read()
h = response.info().headers
jsid = re.findall(r'Set-Cookie: (.*);', str(h[5]))
data = urllib.urlencode({"inUserName":"email", "inUserPass":"password"})
jar = cookielib.FileCookieJar("cookies")
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(jar))
opener.addheaders.append(('User-agent', 'Mozilla/5.0 (Windows NT 6.1; rv:13.0) Gecko/20100101 Firefox/13.0.1'))
opener.addheaders.append(('Referer', 'http://www.locationary.com/'))
opener.addheaders.append(('Cookie','site_version=REGULAR; ' + str(jsid[0])))
request = urllib2.Request("https://www.locationary.com/index.jsp?ACTION_TOKEN=tile_loginBar_jsp$JspView$LoginAction", data)
response = opener.open(request)
page = opener.open(url).read()
print page
load(URL)
The selenium webdriver from the selenium tool suite uses standard browsers to retrieve the HTML (it's main goal is test automation for web applications), so it is well suited for scrapping javascript-rich applications. It has nice Python bindings.
I tend to use selenium to grab the page source after all ajax stuff is fired and parse it with something like BeautifulSoup (BeautifulSoup copes well with malformed HTML).

Categories