python library to fetch content(http text) from the websites? - python

I'd like something equivalent to urlopen from urllib python library to fetch data from the web,
urlopen does not seem to work on sites like google or youtube, probably due to incorrect headers,
So any other python based content fetcher i can use?

Here is an example with urllib2 that fetches a web page:
import urllib2
response = urllib2.urlopen('http://google.com/')
html = response.read()
The example is taken from the urllib2 documentation.

Related

How to scrape Dynamic JavaScript based website using Python REQUESTS and BeautifulSoup?

I am Scraping https://www.shiksha.com/b-tech/colleges/b-tech-colleges-mumbai-all to collect college informations.
On the webpage below each College only one Course Name is given and rest of courses are scripted in JavaScript.
For eg. +13 More Courses+
So I don't get their info when I use requests.get(url) .
How I can scrape such details using REQUESTS and BeautifulSoup?
I use Anaconda Jupyter Notebook as IDE.
I have heard about Selenium but don't know about it.
Since Selenium is bit heavy is there any possible lite alternative to load all the JavaScript contents at once.
I have also heard about Splash framework. If anyone knows about it and how to integrate it with Python Requests and BeautifulSoup please answer.
Things I have tried
1.PyQt
Reference: https://www.youtube.com/watch?v=FSH77vnOGqU
I have imported different libraries than in video depending on PyQt version in anaconda.
import sys
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl
from PyQt5.QtWebKitWidgets import QWebPage
import requests
from bs4 import BeautifulSoup
class Client(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self.on_page_load)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def on_page_load(self):
self.app.quit()
url="https://www.shiksha.com/b-tech/colleges/b-tech-colleges-mumbai-all"
client_response=Client(url)
src=client_response.mainFrame().toHtml()
soup = BeautifulSoup(src,"lxml")
tpm = soup.find_all("section",{"class":"tpl-curse-dtls more_46905_0"})
print(tpm)
Output: []
2. json() in Requests Module
import requests
from bs4 import BeautifulSoup
url="https://www.shiksha.com/b-tech/colleges/b-tech-colleges-mumbai-all"
r=requests.get(url)
a=r.json()
OUTPUT:
JSONDecodeError: Expecting value: line 3 column 1 (char 3)
3. json.loads() from json module
Inspection Details on clicking
import json
j_url='https://www.shiksha.com//nationalCategoryList/NationalCategoryList/loadMoreCourses/'
def j_data(url=j_url):
dt = tp[0].find_all("input",{"id":"remainingCourseIds_46905"})
output = dt[0]['value']
data = {
'courseIds': '231298,231294,231304,231306',
'loadedCourseCount': 0
#'page':page
}
response = requests.post(url, data=data)
return json.loads(r.content)
print(j_data())
OUTPUT:
JSONDecodeError: Expecting value: line 3 column 1 (char 3)
DRYSCRAPE is not available for Windows
You don't need to know what its Javascript does. Just click the link and use your browser inspector to observe the network request.
In your specific case, the Javascript sends a POST request to "/nationalCategoryList/NationalCategoryList/loadMoreCourses/". So you can send the same request and you'll get back a new HTML string. You can parse that string using BeautifulSoup and get the data you need.
There is an extra step above because the POST request needs a payload that specifies parameters. You should be able to find these parameters in the original page. Once you find them, you can look at their surrounding HTML elements and either use BeautifulSoup to extract them, or use regular expression to find them.
I hope it helps.

Python - Requests pulling HTML instead of JSON

I'm building a Python web scraper (personal use) and am running into some trouble retrieving a JSON file. I was able to find the request URL I need, but when I run my script (I'm using Requests) the URL returns HTML instead of the JSON shown in the Chrome Developer Tools console. Here's my current script:
import requests
import json
url = 'https://nytimes.wd5.myworkdayjobs.com/Video?clientRequestID=1f1a6071627946499b4b09fd0f668ef0'
r = requests.get(url)
print(r.text)
Completely new to Python, so any push in the right direction is greatly appreciated. Thanks!
Looks like that website returns the response depending on the accept headers provided by the request. So try:
import requests
import json
url = 'https://nytimes.wd5.myworkdayjobs.com/Video?clientRequestID=1f1a6071627946499b4b09fd0f668ef0'
r = requests.get(url, headers={'accept': 'application/json'})
print(r.json())
You can have a look at the full api for further reference: http://docs.python-requests.org/en/latest/api/.

Python : Extract requests query parameters that any given URL receives

Background:
Typically, if I want to see what type of requests a website is getting, I would open up chrome developer tools (F12), go to the Network tab and filter the requests I want to see.
Example:
Once I have the request URL, I can simply parse the URL for the query string parameters I want.
This is a very manual task and I thought I could write a script that does this for any URL I provide. I thought Python would be great for this.
Task:
I have found a library called requests that I use to validate the URL before opening.
testPage = "http://www.google.com"
validatedRequest = str(requests.get(testPage, verify=False).url)
page = urlopen(validatedRequest)
However, I am unsure of how to get the requests that the URL I enter receives. Is this possible in python? A point in the right direction would be great. Once I know how to access these request headers, I can easily parse through.
Thank you.
You can use the urlparse method to fetch the query params
Demo:
import requests
import urllib
from urlparse import urlparse
testPage = "http://www.google.com"
validatedRequest = str(requests.get(testPage, verify=False).url)
page = urllib.urlopen(validatedRequest)
print urlparse(page.url).query
Result:
gfe_rd=cr&dcr=0&ei=ISdiWuOLJ86dX8j3vPgI
Tested in python2.7

python urllib2.urlopen - html text is garbled - why?

The printed html returns garbled text... instead of what I expect to see as seen in "view source" in browser.
Why is that? How to fix it easily?
Thank you for your help.
Same behavior using mechanize, curl, etc.
import urllib
import urllib2
start_url = "http://www.ncert.nic.in/ncerts/textbook/textbook.htm"
response = urllib2.urlopen(start_url)
html = response.read()
print html
I got the same garbled text using curl
curl http://www.ncert.nic.in/ncerts/textbook/textbook.htm
The result appears to be gzipped. So this shows the correct HTML for me.
curl http://www.ncert.nic.in/ncerts/textbook/textbook.htm | gunzip
Here's a solutions on doing this in Python: Convert gzipped data fetched by urllib2 to HTML
Edited by OP:
The revised answer after reading above is:
import urllib
import urllib2
import gzip
import StringIO
start_url = "http://www.ncert.nic.in/ncerts/textbook/textbook.htm"
response = urllib2.urlopen(start_url)
html = response.read()
data = StringIO.StringIO(html)
gzipper = gzip.GzipFile(fileobj=data)
html = gzipper.read()
html now holds the HTML (Print it to see)
Try requests. Python Requests.
import requests
response = requests.get("http://www.ncert.nic.in/ncerts/textbook/textbook.htm")
print response.text
The reason for this is because the site uses gzip encoding. To my knowledge urllib doesn't support deflating so you end up with compressed html responses for certain sites that use that encoding. You can confirm this by printing the content headers from the response like so.
print response.headers
There you will see that the "Content-Encoding" is gzip format. In order to get around this using the standard urllib library you'd need to use the gzip module. Mechanize also does this because it uses the same urllib library. Requests will handle this encoding and format it nicely for you.

Twitter authentication using cookielib

Can someone tell me why this doesn't work?
import cookielib
import urllib
import urllib2
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
data = urllib.urlencode({'session[username_or_email]':'twitter handle' , 'session[password]':'password'})
opener.open('https://twitter.com' , data)
stuff = opener.open('https://twitter.com')
print stuff.read()
Why doesn't this give the html of the page after logging in?
Please consider using an Oauth library for your task. Scraping the site using mechanize is not recommended because twitter can change the HTML specific stuffs any time, and then your code will break.
Check this out: Python-twitter at http://code.google.com/p/python-twitter/
Simplest example to post an update:
>>> import twitter
>>> api = twitter.Api(
consumer_key='yourConsumerKey',
consumer_secret='consumerSecret',
access_token_key='accessToken',
access_token_secret='accessTokenSecret')
>>> api.PostUpdate('Blah blah lbah!')
There can be many reasons why it is failing:
Twitter probably expects a User-Agent header, which you are not providing.
I didn't look at the HTML, but many be there's some Javascript at play before the form is actually submitted (I actually think this is the case, because I vaguely remember writing a very detailed answer on this exact thing (and I dont seem to find the link to it!)).

Categories