requests.get(url) can not get html's body - python

I am trying to crawl a website.
import requests
url='https://stackoverflow.com'
r=requests.get(url)
content = r.text
print(content)
For this code ,it works well.But when I try connect other url
url='http://www.iyingdi.com/web/article'
#Sorry you guys ,it is a Chinese website.But it does not matter.
#I don't find same problem in others English website.
output:
<html>
<head>
<!--head part works well-->
</head>
<body>
<!--body part is empty-->
</body>
</html>
head part works well ,but body part is empty!
Try to search in Google or SOF,I have not find same problem.
Thanks for you answers.

Related

Why python requests module not pulling the whole html?

The link: https://www.hyatt.com/explore-hotels/service/hotels
code:
r = requests.get('https://www.hyatt.com/explore-hotels/service/hotels')
soup = BeautifulSoup(r.text, 'lxml')
print(soup.prettify())
Tried also this:
r = requests.get('https://www.hyatt.com/explore-hotels/service/hotels')
data = json.dumps(r.text)
print(data)
output:
<!DOCTYPE html>
<head>
</head>
<body>
<script src="SOME_value">
</script>
</body>
</html>
Its printing the html without the tag the data are in, only showing a single script tag.
How to access the data (shown in browsing view, looks like json)?browsing view my code code response)
I don't believe this can be done...That data simply isn't in the r.text
If you do this:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.hyatt.com/explore-hotels/service/hotels")
soup = BeautifulSoup(r.text, "html.parser")
print(soup.prettify())
You get this:
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<script src="/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/ips.js?tkrm_alpekz_s1.3=0EOFte3LjRKv3iJhEEV2hrnisE5M3Lwy3ac3UPZ19zdiB49A6ZtBjtiwBqgKQN3q2MEQ3NbFjTWfmP9GqArOIAML6zTvSb4lRHD7FsmJFVWNkSwuTNWUNuJWv6hEXBG37DhBtTXFEO50999RihfPbTjsB">
</script>
</body>
</html>
As you can see there is no <pre> tag for whatever reason. So you're unable to access that.
I also get an 429 Error when accessing the URL:
GET https://www.hyatt.com/explore-hotels/service/hotels 429
What is the end goal here? Because this site doesn't seem to be willing to do anything. Some sites are unable to be parsed, for various reasons. If you're wanting to play with JSON data I would look into using an API instead.
If you google https://www.hyatt.com and manually go to the URL you mentioned you get a 404 error.
I would say Hyatt don't want you parsing their site. So don't!
The response is JSON, not HTML. You can verify this by opening the Network tab in your browser's dev tools. There you will see that the content-type header is application/json; charset=utf-8.
You can parse this into a useable form with the standard json package:
r = requests.get('https://www.hyatt.com/explore-hotels/service/hotels')
data = json.loads(r.text)
print(data)

<!DOCTYPE html> missing in Selenium Python page_source

I'm using Selenium for functional testing of a Django application and thought I'd try html5lib as a way of validating the html output. One of the validations is that the page starts with a <!DOCTYPE ...> tag.
The unit test checks with response.content.decode() all worked fine, correctly flagging errors, but I found that Selenium driver.page_source output starts with an html tag. I have double-checked that I'm using the correct template by modifying the title and making sure that the change is reflected in the page_source. There is also a missing newline and indentation between the <html> tag and the <title> tag.
This is what the first few lines looks like in the Firefox browser.
<!DOCTYPE html>
<html>
<head>
<title>NetLog</title>
</head>
Here's the Python code.
self.driver.get(f"{self.live_server_url}/netlog/")
print(self.driver.page_source
And here's the first few lines of the print when run under the Firefox web driver.
<html><head>
<title>NetLog</title>
</head>
The page body looks fine, while the missing newline is also present between </body> and </html>. Is this expected behaviour? I suppose I could just stuff the DOCTYPE tag in front of the string as a workaround but would prefer to have it behave as intended.
Chris

Getting src code for a section directly with requests in python

I want to get the source code only of a section from website instead of whole page and then parsing out the section, as it will be faster than loading whole page and then parsing. I tried passing the section link as url parameter but still getting whole page.
url = 'https://stackoverflow.com/questions/19012495/smooth-scroll-to-div-id-jquery/#answer-19013712'
response = requests.get(url)
print(response.text)
You cannot get specific section directly with requests api, but you can use beautifulsoup for that purpose.
A small sample is given by dataquest website:
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
page.content
Running the above script will output this html String.
<html>
<head>
<title>A simple example page
</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
First paragraph.
</p><p class="inner-text">
Second paragraph.
</p></div>
<p class="outer-text first-item" id="second"><b>
First outer paragraph.
</b></p><p class="outer-text"><b>
Second outer paragraph.
</b>
</p>
</body>
</html>
You can get specific section by finding it through tag type, class or id.
By tag-type:
soup.find_all('p')
By class:
soup.find_all('p', class_='outer-text')
By Id:
soup.find_all(id="first")
HTTPS will not allow you to do that.
You can use the Stackoverflow API instead. You can pass the answer id 19013712. And thus only get that specific answer via the API.
Note, you may still have to register for an APP key

Link with status code 200 redirects

I have a link which has status code 200. But when I open it in browser it redirects.
On fetching the same link with Python Requests it simply shows the data from the original link. I tried both Python Requests and urllib but had no success.
How to capture the final URL and its data?
How can a link with status 200 redirect?
>>> url ='http://www.afaqs.com/news/story/52344_The-target-is-to-get-advertisers-to-switch-from-print-to-TV-Ravish-Kumar-Viacom18'
>>> r = requests.get(url)
>>> r.url
'http://www.afaqs.com/news/story/52344_The-target-is-to-get-advertisers-to-switch-from-print-to-TV-Ravish-Kumar-Viacom18'
>>> r.history
[]
>>> r.status_code
200
This is the link
Redirected link
This kind of redirect is done by JavaScript. So, you won't directly get the redirected link using requests.get(...). The original URL has the following page source:
<html>
<head>
<meta http-equiv="refresh" content="0;URL=http://www.afaqs.com/interviews/index.html?id=572_The-target-is-to-get-advertisers-to-switch-from-print-to-TV-Ravish-Kumar-Viacom18">
<script type="text/javascript" src="http://gc.kis.v2.scr.kaspersky-labs.com/D5838D60-3633-1046-AA3A-D5DDF145A207/main.js" charset="UTF-8"></script>
</head>
<body bgcolor="#FFFFFF"></body>
</html>
Here, you can see the redirected URL. Your job is to scrape that. You can do it using RegEx, or simply some string split operations.
For example:
r = requests.get('http://www.afaqs.com/news/story/52344_The-target-is-to-get-advertisers-to-switch-from-print-to-TV-Ravish-Kumar-Viacom18')
redirected_url = r.text.split('URL=')[1].split('">')[0]
print(redirected_url)
# http://www.afaqs.com/interviews/index.html?id=572_The-target-is-to-get-advertisers-to-switch-from-print-to-TV-Ravish-Kumar-Viacom18
r = requests.get(redirected_url)
# Start scraping from this link...
Or, using a regex:
redirected_url = re.findall(r'URL=(http.*)">', r.text)[0]
These kind of url's are present in script tag as they are javascript code. Therefore they are nor fetched by python.
To get the link simply extract them from their respective tags.

Read HEAD contents from HTML

i need small script in python. Need to read custom block in a web file.
#!/usr/bin/python
# -*- coding: utf-8 -*-
import urllib2
req = urllib2.Request('http://target.com')
response = urllib2.urlopen(req)
the_page = response.read()
print the_page # Here is all page source with html tags, but
# i need read only section from <head> to </head>
# example the http://target.com source is:
# <html>
# <body>
# <head>
# ... need to read this section ...
# </head>
# ... page source ...
# </body>
# </html>
How read the custom section?
To parse HTML, we use a parser, such as BeautifulSoup.
Of course you can parse it using a regular expression, but that is something you should never do. Just because it works for some cases doesn't mean it is the standard way of doing it or is the proper way of doing it. If you are interested in knowing why, read this excellent answer here on SO.
Start with the BeautifulSoup tutorial and see how to parse the required information. It is pretty easy to do it. We are not going to do it for you, that is for you to read and learn!
Just to give you a heads up, you have the_page which contains the HTML data.
>> from BeautifulSoup import BeautifulSoup
>> soup = BeautifulSoup(the_page)
Now follow the tutorial and see how to get everything within the head tag.
from BeautifulSoup import BeautifulSoup
import urllib2
page = urllib2.urlopen('http://www.example.com')
soup = BeautifulSoup(page.read())
print soup.find('head')
outputs
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Example Web Page</title>
</head>
One solution would be to use the awesome python library Beautiful Soup. It allows you do parse the html/xml pretty easily, and will try to help out when the documents are broken or invalid.

Categories