I have this webpage. When I try to get its html using requests module like this :
import requests
link = "https://www.worldmarktheclub.com/resorts/7m/"
f = requests.get(link)
print(f.text)
I get a result like this:
<!DOCTYPE html>
<html><head>
<meta http-equiv="Pragma" content="no-cache"/>
<meta http-equiv="Expires" content="-1"/>
<meta http-equiv="CacheControl" content="no-cache"/>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
<link rel="shortcut icon" href="data:;base64,iVBORw0KGgo="/>
<script>
(function(){
var securemsg;
var dosl7_common;
// seemingly garbage like [Z.li]+Z._j+Z.LO+Z.SJ+"(/.{"+Z.i+","+Z.Ii+"}
</script>
<script type="text/javascript" src="/TSPD/08e841a5c5ab20007f02433a700e2faba779c2e847ad5d441605ef3d4bbde75cd229bcdb30078f66?type=9"></script>
<noscript>Please enable JavaScript to view the page content.</noscript>
</head><body>
</body></html>
Only a part of the result shown. But I can see the proper html when I inspect the webpage in a browser. I guess there might be an issue with the encoding of the page, but can't figure it out. Using urllib.request + read() gives the same wrong result. How do I correct this. Thanks in advance.
As suggested by #DeepSpace, the garbage issue in script is due to the minified JS code. But why am I not getting the html correctly?
What you deem as "garbage" is obfuscated/minified JS code that is written in <script> tags instead of in an external JS file.
If you look at the bottom of f.text, you will see <noscript>Please enable JavaScript to view the page content.</noscript>.
requests is not a browser, hence it can not execute JS code which this page is making use of, and the server will not allow user-agents who do not support JS to access it. Setting the User-Agent header to Chrome's (Chrome/60.0.3112.90) still does not work.
You will have to resort to other tools that allow JS execution, such as selenium.
The HTML code is produced on the fly by the Javascript code you see. Unfortunately, as said by #DeepSpace, requests does not execute Javascript.
As an alternative I suggest to use selenium. It is a library which simulate a browser and so execute Javascript.
Related
I'm using Selenium for functional testing of a Django application and thought I'd try html5lib as a way of validating the html output. One of the validations is that the page starts with a <!DOCTYPE ...> tag.
The unit test checks with response.content.decode() all worked fine, correctly flagging errors, but I found that Selenium driver.page_source output starts with an html tag. I have double-checked that I'm using the correct template by modifying the title and making sure that the change is reflected in the page_source. There is also a missing newline and indentation between the <html> tag and the <title> tag.
This is what the first few lines looks like in the Firefox browser.
<!DOCTYPE html>
<html>
<head>
<title>NetLog</title>
</head>
Here's the Python code.
self.driver.get(f"{self.live_server_url}/netlog/")
print(self.driver.page_source
And here's the first few lines of the print when run under the Firefox web driver.
<html><head>
<title>NetLog</title>
</head>
The page body looks fine, while the missing newline is also present between </body> and </html>. Is this expected behaviour? I suppose I could just stuff the DOCTYPE tag in front of the string as a workaround but would prefer to have it behave as intended.
Chris
I am currently trying to scrape this Amazon page "https://www.amazon.com/b/?ie=UTF8&node=11552285011&ref_=sv_kstore_5" with the following code:
from bs4 import BeautifulSoup
import requests
url = 'https://www.amazon.com/b/?ie=UTF8&node=11552285011&ref_=sv_kstore_5'
r = requests.get(url)
soup = BeautifulSoup(r.content)
print(soup.prettify)
However when I run it instead of getting the simple html source code I get a bunch of lines which don't make much sense to me starting like this:
<bound method Tag.prettify of <!DOCTYPE html>
<html class="a-no-js" data-19ax5a9jf="dingo"><head><script>var aPageStart = (new Date()).getTime();</script><meta charset="utf-8"/><!-- emit CSM JS -->
<style>
[class*=scx-line-clamp-]{overflow:hidden}.scx-offscreen-truncate{position:relative;left:-1000000px}.scx-line-clamp-1{max-height:16.75px}.scx-truncate-medium.scx-line-clamp-1{max-height:20.34px}.scx-truncate-small.scx-line-clamp-1{max-height:13px}.scx-line-clamp-2{max-height:35.5px}.scx-truncate-medium.scx-line-clamp-2{max-height:41.67px}.scx-truncate-small.scx-line-clamp-2{max-height:28px}.scx-line-clamp-3{max-height:54.25px}.scx-truncate-medium.scx-line-clamp-3{max-height:63.01px}.scx-truncate-small.scx-line-clamp-3{max-height:43px}.scx-line-clamp-4{max-height:73px}.scx-truncate-medium.scx-line-clamp-4{max-height:84.34px}.scx-truncate-small.scx-line-clamp-4{max-height:58px}.scx-line-clamp-5{max-height:91.75px}.scx-truncate-medium.scx-line-clamp-5{max-height:105.68px}.scx-truncate-small.scx-line-clamp-5{max-height:73px}.scx-line-clamp-6{max-height:110.5px}.scx-truncate-medium.scx-line-clamp-6{max-height:127.01
And even when I scroll down, there is nothing that really resemble a structured html code with all the info I need. What am I doing wrong ? (I am a beginner so it could be anything really). Thank you very much!
print(soup.prettify)
intend to call soup.prettify.__repr__(). The output is
<bound method Tag.prettify of <!DOCTYPE html><html class="a-no-js" data-19ax5a9jf="dingo"><head>...
while you need to call the prettify method:
print(soup.prettify())
The output:
<html class="a-no-js" data-19ax5a9jf="dingo">
<head>
<script>
var aPageStart = (new Date()).getTime();
</script>
<meta charset="utf-8"/>
<!-- emit CSM JS -->
<style>
...
I am trying to scrape a Japanese website (a trimmed down sample below):
<html>
<head>
<meta charset="euc-jp">
</head>
<body>
<h3>不審者の出没</h3>
</body>
</html>
I am trying to get data of this html by request package using:
response = requests.get(url)
I am getting data from h3 field in as:
'¡ÊÂçʬ' and unicode value of it is like this:
'\xa4\xaa\xa4\xaa\xa4\xa4\xa4\xbf\'
but when I load this html from a file or from a local wsgi server (tried with Django to serve a static html page) then I get:
不審者の出没. It's actual data.
Now I am not understanding how to resolve this issue?
I am currently writing a python's parser for extract automatically some information from a website. I am using mechanize to browse the website. I obtain the following html code:
<html>
<head>
<title>
XXXXX
</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8; no-cache;" />
<link rel="shortcut icon" type="image/x-icon" href="/favicon.ico" />
<link rel="stylesheet" href="/rr/style_other.css" type="text/css" />
</head>
<frameset cols="*,370" border="1">
<frame src="affiche_cp.php?uid=yyyyyyy&type=entree" name="cdrg" />
<frame src="affiche_bp.php?uid=yyyyyyy&type=entree" name="cdrd" />
</frameset>
</html>
I want to access to the both frames:
in cdrd I must fill some forms and submit
in cdrg I will obtain the result of the submission
How can I do this?
Personally, I do not use BeautifulSoup for parsing HTML. But instead I use PyQuery, which is similar but I like the CSS selector syntax as opposed to XPath. I also use Requests to make HTTP requests.
That alone is enough to scrape data, and submit requests. It can do what you want. I understand this probably isn't the answer you're looking for but it might very well be useful to you.
Scraping Frames with PyQuery
import requests
import pyquery
response = requests.get('http://example.com')
dom = pyquery.PyQuery(response.text)
frames = dom('frame')
frame_one = frames[0]
frame_two = frames[1]
Making HTTP Requests
import requests
response = requests.post('http://example.com/signup', data={
'username': 'someuser',
'password': 'secret'
})
response_text = response.text
data is a dictionary with the POST data to submit to the forms. You should use Chrome's network explorer, Fiddlr or Burp Suite to monitor requests. Whilst monitoring manually submit both forms. Inspect the HTTP requests and recreate the request using Requests.
Hope that helps a little. I work in this field, so if you require any more information feel free to hit me up.
The solution of my issue was to load the first frame and fill the form in this page. Then I load the second frame and I can read it and obtain the results associated to the form in the first frame.
I have this complicated problem that I can't find a answer to.
I have a Python HTTPServer running that serves webpages. These webpages are created at runtime with help of Beautiful Soup. Problem is that the Firefox shows HTML Code for the webpage and not the actual page? I really don't know know who is causing this problem -
- Python HTTPServer
- Beautiful Soup
- HTML Code
Any case, I have copied parts of the webpage HTML:-
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>
My title
</title>
<link href="style.css" rel="stylesheet" type="text/css" />
<script src="./123_ui.js">
</script>
</head>
<body>
<div>
Hellos
</div>
</body>
</html>
Just to help you, here are the things that I have already tried-
- I have made sure that Python HTTPServer is sending the MIME header as text/html
- Just copying and pasting the HTML Code will show you correct page as its static. I can tell from here that the problem is in HTTPServer side
- The Firebug shows that is empty and "This element has no style rules. You can create a rule for it." is displayed
I just want to know if the error is in Beautiful Soup or HTTPServer or HTML?
Thanks,
Amit
Why are you adding this at the top of the document?
<?xml version="1.0" encoding="iso-8859-1"?>
That will make the browser think the entire document is XML and not XHTML. Removing that line should make it render correctly. I assume Firefox is displaying a page with a bunch of elements which you can expand/collapse to see the content like it normally would for an XML document, even though the HTTP headers might say it's text/html.
So guys,
I have finally solved this problem. The reason was because I wasn't sending MIME header (even though I thought I was) with content type "text/html"
In python HTTPServer, before writing anything to file you always do this:-
self.send_response(301)
self.send_header("Location", self.path + "/")
self.end_headers()
# Once you have called the above methods, you can send the HTML to Client
self.wfile.write('ANY HTML CODE YOU WANT TO WRITE')