I'm trying to web scrape using Requests. My code so far is the usual:
import requests
html = requests.get('https://www.sampleurl.com').text
This gives:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><HTML lang=en><HEAD><TITLE>Company Name</TITLE><META HTTP-EQUIV="refresh" CONTENT="1; URL=https://www.sampleurl.com"></HEAD><BODY>
The url inside CONTENT is the same as the url I put into Requests, so it's not a redirect that I can follow by extracting the url with BeautifulSoup. Is there some way of bypassing circular meta refreshes so I can get to the html of the website?
Related
The link: https://www.hyatt.com/explore-hotels/service/hotels
code:
r = requests.get('https://www.hyatt.com/explore-hotels/service/hotels')
soup = BeautifulSoup(r.text, 'lxml')
print(soup.prettify())
Tried also this:
r = requests.get('https://www.hyatt.com/explore-hotels/service/hotels')
data = json.dumps(r.text)
print(data)
output:
<!DOCTYPE html>
<head>
</head>
<body>
<script src="SOME_value">
</script>
</body>
</html>
Its printing the html without the tag the data are in, only showing a single script tag.
How to access the data (shown in browsing view, looks like json)?browsing view my code code response)
I don't believe this can be done...That data simply isn't in the r.text
If you do this:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.hyatt.com/explore-hotels/service/hotels")
soup = BeautifulSoup(r.text, "html.parser")
print(soup.prettify())
You get this:
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<script src="/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/ips.js?tkrm_alpekz_s1.3=0EOFte3LjRKv3iJhEEV2hrnisE5M3Lwy3ac3UPZ19zdiB49A6ZtBjtiwBqgKQN3q2MEQ3NbFjTWfmP9GqArOIAML6zTvSb4lRHD7FsmJFVWNkSwuTNWUNuJWv6hEXBG37DhBtTXFEO50999RihfPbTjsB">
</script>
</body>
</html>
As you can see there is no <pre> tag for whatever reason. So you're unable to access that.
I also get an 429 Error when accessing the URL:
GET https://www.hyatt.com/explore-hotels/service/hotels 429
What is the end goal here? Because this site doesn't seem to be willing to do anything. Some sites are unable to be parsed, for various reasons. If you're wanting to play with JSON data I would look into using an API instead.
If you google https://www.hyatt.com and manually go to the URL you mentioned you get a 404 error.
I would say Hyatt don't want you parsing their site. So don't!
The response is JSON, not HTML. You can verify this by opening the Network tab in your browser's dev tools. There you will see that the content-type header is application/json; charset=utf-8.
You can parse this into a useable form with the standard json package:
r = requests.get('https://www.hyatt.com/explore-hotels/service/hotels')
data = json.loads(r.text)
print(data)
I'm trying to get BeautifulSoup to read this page but the URL is not passed correctly into the get() command.
The URL is https://www.econjobrumors.com/topic/supreme-court-to-%e2%80%9cconsider%e2%80%9d-taking-up-harvard-affirmative-action-case-on-june-10. But when I try to use BeautifulSoup to get the data from the URL it always gives an error saying that the URL is incorrect
response = requests.get(url = "https://www.econjobrumors.com/topic/supreme-court-to-%e2%80%9cconsider%e2%80%9d-taking-up-harvard-affirmative-action-case-on-june-10",
verify = False \
)
print(response.request.url, end="\r")
It was the double quotes, “ (U+201C) and ” (U+201D), that caused the error. I've been trying for hours but still don't know to figure out a way to pass the URL correctly.
I changed the double quotes to single around the URL
from bs4 import BeautifulSoup
import requests
url = 'https://www.econjobrumors.com/topic/supreme-court-to-%e2%80%9cconsider%e2%80%9d-taking-up-harvard-affirmative-action-case-on-june-10'
r = requests.get(url, allow_redirects=False)
soup = BeautifulSoup(r.content, 'lxml')
print(soup)
prints out the html as expected, I edited it to fit in this answer
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xml:lang="en-US" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="IE=8" http-equiv="X-UA-Compatible"/>
<ALL THE CONTENT>Too much to paste in the answer</ALL THE CONTENT>
</html>
I am trying to scrape a Japanese website (a trimmed down sample below):
<html>
<head>
<meta charset="euc-jp">
</head>
<body>
<h3>不審者の出没</h3>
</body>
</html>
I am trying to get data of this html by request package using:
response = requests.get(url)
I am getting data from h3 field in as:
'¡ÊÂçʬ' and unicode value of it is like this:
'\xa4\xaa\xa4\xaa\xa4\xa4\xa4\xbf\'
but when I load this html from a file or from a local wsgi server (tried with Django to serve a static html page) then I get:
不審者の出没. It's actual data.
Now I am not understanding how to resolve this issue?
I want to access the student details of all the students from the following college website https://java.access.uni.edu/ed/faces/searchStudent.jsp
I dont know the names of students and I want to access the details of each student .
The directory is open and there is nothing illegal in it .
I am using the following github code as a reference .
https://github.com/JoshuaRLi/direktory/blob/master/direktory.py
Please help !
You can either do it Using bs4 beautifulsoup which will help you to scrap the content from the given directory ... its basically called web scraping ..
that's what represented in your github link...
another method is, selenium webdriver ..
from this method, you can simply pass the url followed by giving corresponding field name and its value.
you can trigger the API URL from selenium itself...
other you can send POST request and get response directly using python requests method...
here is the eg:
>>> import requests
>>> r = requests.post("https://java.access.uni.edu/ed/faces/searchStudent.jsp;jsessionid=e8093da105003620293edb31ec442edfdfa514485389b950c4f20b46515aa640.e34Sbx0MaNuObi0LahiMaxmRb30Re0", data={'txtLastName':'mohamemd','txtFirstName':'mohideen','txtEmail':'temp#mail.com','soMajor':0,'soCollege':0,'soClass':0})
>>> r.status_code
200
>>> r.text[:300]
u'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"\r\n"http://www.w3.org/TR/html4/loose.dtd">\r\n\r\n\r\n\r\n\r\n\r\n\r\n <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/loose.dtd"><html dir="ltr" lang="en-US">\r\n <head id="head1"><title>UNI Directory - Student Search</t'
>>> a = r.text[:300]
>>> len(a)
300
>>>
here i restricted output into 300.. if you want full you can simply print,
r.text
I use spynner for scraping data from a site. My code is this:
import spynner
br = spynner.Browser()
br.load("http://www.venere.com/it/hotel/roma/hotel-ferrari/#reviews")
text = br._get_html()
This code fails to load the entire html page. This is the html that I received:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head>
<script type="text/javascript">(function(){var d=document,m=d.cookie.match(/_abs=(([or])[a-z]*)/i)
v_abs=m?m[1].toUpperCase():'N'
if(m){d.cookie='_abs='+v_abs+'; path=/; domain=.venere.com';if(m[2]=='r')location.reload(true)}
v_abp='--OO--OOO-OO-O'
v_abu=[,,1,1,,,1,1,1,,1,1,,1]})()
My question is: how do I load the complete html?
More information:
I tried with:
import spynner
br = spynner.Browser()
respond = br.load("http://www.venere.com/it/hotel/roma/hotel-ferrari/#reviews")
if respond == None:
br.wait_load ()
but loading html is never complete or certain. What is the problem? I'm going crazy.
Again:
I'm working in Django 1.3. If I use the same code in Python (2.7) sometimes load all html.
Now after you check the contents of test.html you will find the p elements with id="feedback-...somenumber..." :
import spynner
def content_ready(browser):
if 'id="feedback-' in browser.html:
return True
br = spynner.Browser()
br.load("http://www.venere.com/it/hotel/roma/hotel-ferrari/#reviews", wait_callback=content_ready)
with open("test.html", "w") as hf:
hf.write(br.html.encode("utf-8"))