I am currently writing a python's parser for extract automatically some information from a website. I am using mechanize to browse the website. I obtain the following html code:
<html>
<head>
<title>
XXXXX
</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8; no-cache;" />
<link rel="shortcut icon" type="image/x-icon" href="/favicon.ico" />
<link rel="stylesheet" href="/rr/style_other.css" type="text/css" />
</head>
<frameset cols="*,370" border="1">
<frame src="affiche_cp.php?uid=yyyyyyy&type=entree" name="cdrg" />
<frame src="affiche_bp.php?uid=yyyyyyy&type=entree" name="cdrd" />
</frameset>
</html>
I want to access to the both frames:
in cdrd I must fill some forms and submit
in cdrg I will obtain the result of the submission
How can I do this?
Personally, I do not use BeautifulSoup for parsing HTML. But instead I use PyQuery, which is similar but I like the CSS selector syntax as opposed to XPath. I also use Requests to make HTTP requests.
That alone is enough to scrape data, and submit requests. It can do what you want. I understand this probably isn't the answer you're looking for but it might very well be useful to you.
Scraping Frames with PyQuery
import requests
import pyquery
response = requests.get('http://example.com')
dom = pyquery.PyQuery(response.text)
frames = dom('frame')
frame_one = frames[0]
frame_two = frames[1]
Making HTTP Requests
import requests
response = requests.post('http://example.com/signup', data={
'username': 'someuser',
'password': 'secret'
})
response_text = response.text
data is a dictionary with the POST data to submit to the forms. You should use Chrome's network explorer, Fiddlr or Burp Suite to monitor requests. Whilst monitoring manually submit both forms. Inspect the HTTP requests and recreate the request using Requests.
Hope that helps a little. I work in this field, so if you require any more information feel free to hit me up.
The solution of my issue was to load the first frame and fill the form in this page. Then I load the second frame and I can read it and obtain the results associated to the form in the first frame.
Related
The link: https://www.hyatt.com/explore-hotels/service/hotels
code:
r = requests.get('https://www.hyatt.com/explore-hotels/service/hotels')
soup = BeautifulSoup(r.text, 'lxml')
print(soup.prettify())
Tried also this:
r = requests.get('https://www.hyatt.com/explore-hotels/service/hotels')
data = json.dumps(r.text)
print(data)
output:
<!DOCTYPE html>
<head>
</head>
<body>
<script src="SOME_value">
</script>
</body>
</html>
Its printing the html without the tag the data are in, only showing a single script tag.
How to access the data (shown in browsing view, looks like json)?browsing view my code code response)
I don't believe this can be done...That data simply isn't in the r.text
If you do this:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.hyatt.com/explore-hotels/service/hotels")
soup = BeautifulSoup(r.text, "html.parser")
print(soup.prettify())
You get this:
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<script src="/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/ips.js?tkrm_alpekz_s1.3=0EOFte3LjRKv3iJhEEV2hrnisE5M3Lwy3ac3UPZ19zdiB49A6ZtBjtiwBqgKQN3q2MEQ3NbFjTWfmP9GqArOIAML6zTvSb4lRHD7FsmJFVWNkSwuTNWUNuJWv6hEXBG37DhBtTXFEO50999RihfPbTjsB">
</script>
</body>
</html>
As you can see there is no <pre> tag for whatever reason. So you're unable to access that.
I also get an 429 Error when accessing the URL:
GET https://www.hyatt.com/explore-hotels/service/hotels 429
What is the end goal here? Because this site doesn't seem to be willing to do anything. Some sites are unable to be parsed, for various reasons. If you're wanting to play with JSON data I would look into using an API instead.
If you google https://www.hyatt.com and manually go to the URL you mentioned you get a 404 error.
I would say Hyatt don't want you parsing their site. So don't!
The response is JSON, not HTML. You can verify this by opening the Network tab in your browser's dev tools. There you will see that the content-type header is application/json; charset=utf-8.
You can parse this into a useable form with the standard json package:
r = requests.get('https://www.hyatt.com/explore-hotels/service/hotels')
data = json.loads(r.text)
print(data)
I have this webpage. When I try to get its html using requests module like this :
import requests
link = "https://www.worldmarktheclub.com/resorts/7m/"
f = requests.get(link)
print(f.text)
I get a result like this:
<!DOCTYPE html>
<html><head>
<meta http-equiv="Pragma" content="no-cache"/>
<meta http-equiv="Expires" content="-1"/>
<meta http-equiv="CacheControl" content="no-cache"/>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
<link rel="shortcut icon" href="data:;base64,iVBORw0KGgo="/>
<script>
(function(){
var securemsg;
var dosl7_common;
// seemingly garbage like [Z.li]+Z._j+Z.LO+Z.SJ+"(/.{"+Z.i+","+Z.Ii+"}
</script>
<script type="text/javascript" src="/TSPD/08e841a5c5ab20007f02433a700e2faba779c2e847ad5d441605ef3d4bbde75cd229bcdb30078f66?type=9"></script>
<noscript>Please enable JavaScript to view the page content.</noscript>
</head><body>
</body></html>
Only a part of the result shown. But I can see the proper html when I inspect the webpage in a browser. I guess there might be an issue with the encoding of the page, but can't figure it out. Using urllib.request + read() gives the same wrong result. How do I correct this. Thanks in advance.
As suggested by #DeepSpace, the garbage issue in script is due to the minified JS code. But why am I not getting the html correctly?
What you deem as "garbage" is obfuscated/minified JS code that is written in <script> tags instead of in an external JS file.
If you look at the bottom of f.text, you will see <noscript>Please enable JavaScript to view the page content.</noscript>.
requests is not a browser, hence it can not execute JS code which this page is making use of, and the server will not allow user-agents who do not support JS to access it. Setting the User-Agent header to Chrome's (Chrome/60.0.3112.90) still does not work.
You will have to resort to other tools that allow JS execution, such as selenium.
The HTML code is produced on the fly by the Javascript code you see. Unfortunately, as said by #DeepSpace, requests does not execute Javascript.
As an alternative I suggest to use selenium. It is a library which simulate a browser and so execute Javascript.
I am trying to scrape a Japanese website (a trimmed down sample below):
<html>
<head>
<meta charset="euc-jp">
</head>
<body>
<h3>不審者の出没</h3>
</body>
</html>
I am trying to get data of this html by request package using:
response = requests.get(url)
I am getting data from h3 field in as:
'¡ÊÂçʬ' and unicode value of it is like this:
'\xa4\xaa\xa4\xaa\xa4\xa4\xa4\xbf\'
but when I load this html from a file or from a local wsgi server (tried with Django to serve a static html page) then I get:
不審者の出没. It's actual data.
Now I am not understanding how to resolve this issue?
This is part of my html code:
<link rel ="stylesheet" type="text/css" href="catalog/view/theme/default/stylesheet/stylesheet.css" />
<link id='all-css-0' href='http://1' type='text/css' media='all' rel='stylesheet' />
<link rel='stylesheet' id='all-css-1' href = 'http://2' type='text/css' media='all' />
I have to find all hrefs of stylesheets.
I tried to use regular expression like
<link\s+rel\s*=\s*["']stylesheet["']\s*href\s*=\s*["'](.*?)["'][^>]*?>
The full code is
body = '''<link rel ="stylesheet" type="text/css" href="catalog/view/theme/default/stylesheet/stylesheet.css" />
<link id='all-css-0' href='http://1' type='text/css' media='all' rel='stylesheet' />
<link rel='stylesheet' id='all-css-1' href = 'http://2' type='text/css' media='all' />''''
real_viraz = '''<link\s+rel\s*=\s*["']stylesheet["']\s*href\s*=\s*["'](.*?)["'][^>]*?>'''
r = re.findall(real_viraz, body, re.I|re.DOTALL)
print r
But the problem is that rel='stylesheet' and href='' can be in any order in <link ...>, and it can be almost everything between them.
Please help me to find the right regular expression. Thanks.
Somehow, your name looks like a power automation tool Sikuli :)
If you are trying to parse HTML/XML based text in Python. BeautifulSoup (DOCUMENT)is an extremely powerful library to help you with that. Otherwise, you are indeed reinventing the wheel(an interesting story from Randy Sargent).
from bs4 import BeautifulSoup4
# in case you need to get the page first.
#import urllib2
#url = "http://selenium-python.readthedocs.org/en/latest/"
#text = urllib2.urlopen("url").read()
text = """<link rel ="stylesheet" type="text/css" href="catalog/view/theme/default/stylesheet/stylesheet.css" /><link id='all-css-0' href='http://1' type='text/css' media='all' rel='stylesheet' /><link rel='stylesheet' id='all-css-1' href = 'http://2' type='text/css' media='all' />"""
soup = BeautifulSoup(text)
links = soup.find_all("link", {"rel":"stylesheet"})
for link in links:
try:
print link['href']
except:
pass
the output is:
catalog/view/theme/default/stylesheet/stylesheet.css
http://1
http://2
Learn beautifulsoup well and you are 100% ready for parsing anything in HTML or XML.
(You might also want to put Selenium, Scrapy into your toolbox in the future.)
Short answer: Don't use regular expressions to parse (X)HTML, use a (X)HTML parser.
In Python, this would be lxml. You could parse the HTML using lxml's HTML Parser, and use an XPath query to get all the link elements, and collect their href attributes:
from lxml import etree
parser = etree.HTMLParser()
doc = etree.parse(open('sample.html'), parser)
links = doc.xpath("//head/link[#rel='stylesheet']")
hrefs = [l.attrib['href'] for l in links]
print hrefs
Output:
['catalog/view/theme/default/stylesheet/stylesheet.css', 'http://1', 'http://2']
I'm amazed by the many developers here in Stack-Exchange who insist on using outside Modules over the RE module for obtaining data and Parsing Strings,HTML and CSS. Nothing works more efficiently or faster than RE.
These two lines not only grab the CSS style-sheet path but also grab several if there is more than one CSS stylesheet and place them into a nice Python List for processing and or for a urllib request method.
a = re.findall('link rel="stylesheet" href=".*?"', t)
a=str(a)
Also for those unaware of Native C's use of what most developers know to be the HTML Comment Out Lines.
<!-- stuff here -->
Which allows anything in RE to process and grab data at will from HTML or CSS. And or to remove chunks of pesky Java Script for testing browser capabilities in a single iteration as shown below.
txt=re.sub('<script>', '<!--', txt)
txt=re.sub('</script>', '-->', txt)
txt=re.sub('<!--.*?-->', '', txt)
Python retains all the regular expressions from native C,, so use them people. That's what their for and nothing is as slow as Beautiful Soup and HTMLParser.
Use the RE module to grab all your data from Html tags as well as CSS. Or from anything a string can contain. And if you have a problem with a variable not being of type string then make it a string with a single tiny line of code.
var=str(var)
I have this complicated problem that I can't find a answer to.
I have a Python HTTPServer running that serves webpages. These webpages are created at runtime with help of Beautiful Soup. Problem is that the Firefox shows HTML Code for the webpage and not the actual page? I really don't know know who is causing this problem -
- Python HTTPServer
- Beautiful Soup
- HTML Code
Any case, I have copied parts of the webpage HTML:-
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>
My title
</title>
<link href="style.css" rel="stylesheet" type="text/css" />
<script src="./123_ui.js">
</script>
</head>
<body>
<div>
Hellos
</div>
</body>
</html>
Just to help you, here are the things that I have already tried-
- I have made sure that Python HTTPServer is sending the MIME header as text/html
- Just copying and pasting the HTML Code will show you correct page as its static. I can tell from here that the problem is in HTTPServer side
- The Firebug shows that is empty and "This element has no style rules. You can create a rule for it." is displayed
I just want to know if the error is in Beautiful Soup or HTTPServer or HTML?
Thanks,
Amit
Why are you adding this at the top of the document?
<?xml version="1.0" encoding="iso-8859-1"?>
That will make the browser think the entire document is XML and not XHTML. Removing that line should make it render correctly. I assume Firefox is displaying a page with a bunch of elements which you can expand/collapse to see the content like it normally would for an XML document, even though the HTTP headers might say it's text/html.
So guys,
I have finally solved this problem. The reason was because I wasn't sending MIME header (even though I thought I was) with content type "text/html"
In python HTTPServer, before writing anything to file you always do this:-
self.send_response(301)
self.send_header("Location", self.path + "/")
self.end_headers()
# Once you have called the above methods, you can send the HTML to Client
self.wfile.write('ANY HTML CODE YOU WANT TO WRITE')