Link with status code 200 redirects - python

I have a link which has status code 200. But when I open it in browser it redirects.
On fetching the same link with Python Requests it simply shows the data from the original link. I tried both Python Requests and urllib but had no success.
How to capture the final URL and its data?
How can a link with status 200 redirect?
>>> url ='http://www.afaqs.com/news/story/52344_The-target-is-to-get-advertisers-to-switch-from-print-to-TV-Ravish-Kumar-Viacom18'
>>> r = requests.get(url)
>>> r.url
'http://www.afaqs.com/news/story/52344_The-target-is-to-get-advertisers-to-switch-from-print-to-TV-Ravish-Kumar-Viacom18'
>>> r.history
[]
>>> r.status_code
200
This is the link
Redirected link

This kind of redirect is done by JavaScript. So, you won't directly get the redirected link using requests.get(...). The original URL has the following page source:
<html>
<head>
<meta http-equiv="refresh" content="0;URL=http://www.afaqs.com/interviews/index.html?id=572_The-target-is-to-get-advertisers-to-switch-from-print-to-TV-Ravish-Kumar-Viacom18">
<script type="text/javascript" src="http://gc.kis.v2.scr.kaspersky-labs.com/D5838D60-3633-1046-AA3A-D5DDF145A207/main.js" charset="UTF-8"></script>
</head>
<body bgcolor="#FFFFFF"></body>
</html>
Here, you can see the redirected URL. Your job is to scrape that. You can do it using RegEx, or simply some string split operations.
For example:
r = requests.get('http://www.afaqs.com/news/story/52344_The-target-is-to-get-advertisers-to-switch-from-print-to-TV-Ravish-Kumar-Viacom18')
redirected_url = r.text.split('URL=')[1].split('">')[0]
print(redirected_url)
# http://www.afaqs.com/interviews/index.html?id=572_The-target-is-to-get-advertisers-to-switch-from-print-to-TV-Ravish-Kumar-Viacom18
r = requests.get(redirected_url)
# Start scraping from this link...
Or, using a regex:
redirected_url = re.findall(r'URL=(http.*)">', r.text)[0]

These kind of url's are present in script tag as they are javascript code. Therefore they are nor fetched by python.
To get the link simply extract them from their respective tags.

Related

Why python requests module not pulling the whole html?

The link: https://www.hyatt.com/explore-hotels/service/hotels
code:
r = requests.get('https://www.hyatt.com/explore-hotels/service/hotels')
soup = BeautifulSoup(r.text, 'lxml')
print(soup.prettify())
Tried also this:
r = requests.get('https://www.hyatt.com/explore-hotels/service/hotels')
data = json.dumps(r.text)
print(data)
output:
<!DOCTYPE html>
<head>
</head>
<body>
<script src="SOME_value">
</script>
</body>
</html>
Its printing the html without the tag the data are in, only showing a single script tag.
How to access the data (shown in browsing view, looks like json)?browsing view my code code response)
I don't believe this can be done...That data simply isn't in the r.text
If you do this:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.hyatt.com/explore-hotels/service/hotels")
soup = BeautifulSoup(r.text, "html.parser")
print(soup.prettify())
You get this:
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<script src="/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/ips.js?tkrm_alpekz_s1.3=0EOFte3LjRKv3iJhEEV2hrnisE5M3Lwy3ac3UPZ19zdiB49A6ZtBjtiwBqgKQN3q2MEQ3NbFjTWfmP9GqArOIAML6zTvSb4lRHD7FsmJFVWNkSwuTNWUNuJWv6hEXBG37DhBtTXFEO50999RihfPbTjsB">
</script>
</body>
</html>
As you can see there is no <pre> tag for whatever reason. So you're unable to access that.
I also get an 429 Error when accessing the URL:
GET https://www.hyatt.com/explore-hotels/service/hotels 429
What is the end goal here? Because this site doesn't seem to be willing to do anything. Some sites are unable to be parsed, for various reasons. If you're wanting to play with JSON data I would look into using an API instead.
If you google https://www.hyatt.com and manually go to the URL you mentioned you get a 404 error.
I would say Hyatt don't want you parsing their site. So don't!
The response is JSON, not HTML. You can verify this by opening the Network tab in your browser's dev tools. There you will see that the content-type header is application/json; charset=utf-8.
You can parse this into a useable form with the standard json package:
r = requests.get('https://www.hyatt.com/explore-hotels/service/hotels')
data = json.loads(r.text)
print(data)

Getting src code for a section directly with requests in python

I want to get the source code only of a section from website instead of whole page and then parsing out the section, as it will be faster than loading whole page and then parsing. I tried passing the section link as url parameter but still getting whole page.
url = 'https://stackoverflow.com/questions/19012495/smooth-scroll-to-div-id-jquery/#answer-19013712'
response = requests.get(url)
print(response.text)
You cannot get specific section directly with requests api, but you can use beautifulsoup for that purpose.
A small sample is given by dataquest website:
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
page.content
Running the above script will output this html String.
<html>
<head>
<title>A simple example page
</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
First paragraph.
</p><p class="inner-text">
Second paragraph.
</p></div>
<p class="outer-text first-item" id="second"><b>
First outer paragraph.
</b></p><p class="outer-text"><b>
Second outer paragraph.
</b>
</p>
</body>
</html>
You can get specific section by finding it through tag type, class or id.
By tag-type:
soup.find_all('p')
By class:
soup.find_all('p', class_='outer-text')
By Id:
soup.find_all(id="first")
HTTPS will not allow you to do that.
You can use the Stackoverflow API instead. You can pass the answer id 19013712. And thus only get that specific answer via the API.
Note, you may still have to register for an APP key

python requests only returning empty sets when scraping .htm page

I am attempting to scrape a .htm link and cannot get my script to return anything besides '[]'.
link = https://www.forwardhealth.wi.gov/WIPortal/StaticContent/Member/caseloads/481-caseload.htm
import requests
from bs4 import BeautifulSoup as bs
link = 'https://www.forwardhealth.wi.gov/WIPortal/StaticContent/Member/caseloads/481-caseload.htm'
headers = {'User-Agent':'Mozilla/5.0'}
r = requests.get(link, headers=headers)
soup = bs(r.text, 'lxml') #I've tried other html parsers in here as well as r.content
I believe the issue lies in my attempt to interact with the page (possibly incorrect encoding?). The above format is how I've always set up any web-scraping performed in the past and haven't had any issues that I couldn't address. What stands out the most is when I call r.content or r.text I get a response that seems foreign:
'<HTML>\r\n<HEAD>\r\n<TITLE>481-caseload</TITLE>\r\n<META NAME="GENERATOR" CONTENT="Microsoft FrontPage 5.0">\r\n<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">\r\n</HEAD>\r\n\r\n<FRAMESET ROWS="*,48" FRAMESPACING="0" FRAMEBORDER="no" BORDER="0">\r\n<FRAME NAME="ReportArea" SRC="481-caseload/by_county_tribe/by_county_tribe.htm"\r\n MARGINWIDTH="0" MARGINHEIGHT="0" SCROLLING="no" FRAMEBORDER="0" NORESIZE>\r\n<FRAMESET COLS="*" FRAMESPACING="0" FRAMEBORDER="0" BORDER="0">\r\n<FRAME NAME="ReportLinks" SRC="481-caseload/DocLinks.htm" FRAMEBORDER="0" MARGINWIDTH="2" MARGINHEIGHT="3" scrolling="auto">\r\n</FRAMESET></FRAMESET></HTML>'
This makes me think that my script isn't properly written to handle whatever this is above. I've never seen "Microsoft FrontPage 5.0" before and don't know if that might be what's throwing off my code. I've tried forcing an encoding by changing r.encoding = #encoding here. Any guidance would be helpful.
This is because the page consists of multiple nested iframes - basically, separate pages with its own URLs loaded by the browser when the main "container" page is loaded. Use browser developer tools to inspect the page and see in what iframe your desired content is located.
The main content of this page is coming from the this url:
In [1]: import requests
In [2]: from bs4 import BeautifulSoup
In [3]: url = "https://www.forwardhealth.wi.gov/WIPortal/StaticContent/Member/caseloads/481-caseload/by_county_tribe/0.htm"
In [4]: response = requests.get(url)
In [5]: soup = BeautifulSoup(response.content, "lxml")
In [6]: soup.select_one("td.s2").get_text()
Out[6]: 'Wisconsin Medicaid'

requests.get(url) can not get html's body

I am trying to crawl a website.
import requests
url='https://stackoverflow.com'
r=requests.get(url)
content = r.text
print(content)
For this code ,it works well.But when I try connect other url
url='http://www.iyingdi.com/web/article'
#Sorry you guys ,it is a Chinese website.But it does not matter.
#I don't find same problem in others English website.
output:
<html>
<head>
<!--head part works well-->
</head>
<body>
<!--body part is empty-->
</body>
</html>
head part works well ,but body part is empty!
Try to search in Google or SOF,I have not find same problem.
Thanks for you answers.

Is it possible to hook up a more robust HTML parser to Python mechanize?

I am trying to parse and submit a form on a website using mechanize, but it appears that the built-in form parser cannot detect the form and its elements. I suspect that it is choking on poorly formed HTML, and I'd like to try pre-parsing it with a parser better designed to handle bad HTML (say lxml or BeautifulSoup) and then feeding the prettified, cleaned-up output to the form parser. I need mechanize not only for submitting the form but also for maintaining sessions (I'm working this form from within a login session.)
I'm not sure how to go about doing this, if it is indeed possible.. I'm not that familiar with the various details of the HTTP protocol, how to get various parts to work together etc. Any pointers?
I had a problem where a form field was missing from a form, I couldn't find any malformed html but I figured that was the cause so I used BeautifulSoup's prettify function to parse it and it worked.
resp = br.open(url)
soup = BeautifulSoup(resp.get_data())
resp.set_data(soup.prettify())
br.set_response(resp)
I'd love to know how to this automatically.
Edit: found out how to do this automatically
class PrettifyHandler(mechanize.BaseHandler):
def http_response(self, request, response):
if not hasattr(response, "seek"):
response = mechanize.response_seek_wrapper(response)
# only use BeautifulSoup if response is html
if response.info().dict.has_key('content-type') and ('html' in response.info().dict['content-type']):
soup = BeautifulSoup(response.get_data())
response.set_data(soup.prettify())
return response
# also parse https in the same way
https_response = http_response
br = mechanize.Browser()
br.add_handler(PrettifyHandler())
br will now use BeautifulSoup to parse all responses where html is contained in the content type (mime type), eg text/html
reading from the big example on the first page of the mechanize website:
# Sometimes it's useful to process bad headers or bad HTML:
response = br.response() # this is a copy of response
headers = response.info() # currently, this is a mimetools.Message
headers["Content-type"] = "text/html; charset=utf-8"
response.set_data(response.get_data().replace("<!---", "<!--"))
br.set_response(response)
so it seems very possible to preprocess the response with another parser which will regenerate well-formed HTML, then feed it back to mechanize for further processing.
What you're looking for can be done with lxml.etree which is the xml.etree.ElementTree emulator (and replacement) provided by lxml:
First we take bad mal-formed HTML:
% cat bad.html
<html>
<HEAD>
<TITLE>this HTML is awful</title>
</head>
<body>
<h1>THIS IS H1</H1>
<A HREF=MYLINK.HTML>This is a link and it is awful</a>
<img src=yay.gif>
</body>
</html>
(Observe the mixed case between opening and closing tags, missing quotation marks).
And then parse it:
>>> from lxml import etree
>>> bad = file('bad.html').read()
>>> html = etree.HTML(bad)
>>> print etree.tostring(html)
<html><head><title>this HTML is awful</title></head><body>
<h1>THIS IS H1</h1>
This is a link and it is awful
<img src="yay.gif"/></body></html>
Observe that the tagging and quotation has been corrected for us.
If you were having problems parsing the HTML before, this might be the answer you're looking for. As for the details of HTTP, that is another matter entirely.

Categories