Urllib2 get garbled string instead of page source [duplicate]

Urllib2 get garbled string instead of page source [duplicate] - python

This question already has answers here:
Does python urllib2 automatically uncompress gzip data fetched from webpage?
(4 answers)
Closed 7 years ago.
When I crawl the webpage using urllib2, I can't get the page source but a garbled string which I can't understand what it's. And my code as follow:
url = 'http://finance.sina.com.cn/china/20150905/065523161502.shtml'
conn = urllib2.urlopen(url)
content = conn.read()
print content
Can anyone help me find out what's wrong? Thank you so much.
Update: I think you can run the code above to get what I get. and follows is what I get in python:
{G?0????l???%ߐ?C0 ?K?z?%E
|?B ??|?F?oeB?'??M6?
y???~???;j????H????L?mv:??:]0Z?Wt6+Y+LV? VisV:캆P?Y?,
O?m?p[8??m/???Y]????f.|x~Fa]S?op1M?H?imm5??g?????k?K#?|??? ???????p:O
??(? P?FThq1??N4??P???X??lD???F???6??z?0[?}??z??|??+?pR"s?Lq??&g#?v[((J~??w1#-?G?8???'?V+ks0?????%???5)
And this is what I expected (using curl):
<html>
<head>
<link rel="mask-icon" sizes="any" href="http://www.sina.com.cn/favicon.svg" color="red">
<meta charset="gbk"/>
<meta http-equiv="X-UA-Compatible" content="IE=edge" />

Here is a possible way to get the source information using requests and BeautifulSoup
import requests
from bs4 import BeautifulSoup
#Url to request
url = "http://finance.sina.com.cn/china/20150905/065523161502.shtml"
r = requests.get(url)
#Use BeautifulSoup to organise the 'requested' content
soup=BeautifulSoup(r.content,"lxml")
print soup

Related

Why python requests module not pulling the whole html?

The link: https://www.hyatt.com/explore-hotels/service/hotels
code:
r = requests.get('https://www.hyatt.com/explore-hotels/service/hotels')
soup = BeautifulSoup(r.text, 'lxml')
print(soup.prettify())
Tried also this:
r = requests.get('https://www.hyatt.com/explore-hotels/service/hotels')
data = json.dumps(r.text)
print(data)
output:
<!DOCTYPE html>
<head>
</head>
<body>
<script src="SOME_value">
</script>
</body>
</html>
Its printing the html without the tag the data are in, only showing a single script tag.
How to access the data (shown in browsing view, looks like json)?browsing view my code code response)

I don't believe this can be done...That data simply isn't in the r.text
If you do this:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.hyatt.com/explore-hotels/service/hotels")
soup = BeautifulSoup(r.text, "html.parser")
print(soup.prettify())
You get this:
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<script src="/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/ips.js?tkrm_alpekz_s1.3=0EOFte3LjRKv3iJhEEV2hrnisE5M3Lwy3ac3UPZ19zdiB49A6ZtBjtiwBqgKQN3q2MEQ3NbFjTWfmP9GqArOIAML6zTvSb4lRHD7FsmJFVWNkSwuTNWUNuJWv6hEXBG37DhBtTXFEO50999RihfPbTjsB">
</script>
</body>
</html>
As you can see there is no <pre> tag for whatever reason. So you're unable to access that.
I also get an 429 Error when accessing the URL:
GET https://www.hyatt.com/explore-hotels/service/hotels 429
What is the end goal here? Because this site doesn't seem to be willing to do anything. Some sites are unable to be parsed, for various reasons. If you're wanting to play with JSON data I would look into using an API instead.
If you google https://www.hyatt.com and manually go to the URL you mentioned you get a 404 error.
I would say Hyatt don't want you parsing their site. So don't!

The response is JSON, not HTML. You can verify this by opening the Network tab in your browser's dev tools. There you will see that the content-type header is application/json; charset=utf-8.
You can parse this into a useable form with the standard json package:
r = requests.get('https://www.hyatt.com/explore-hotels/service/hotels')
data = json.loads(r.text)
print(data)

Python BeautifulSoup get request with double quotes in URL

I'm trying to get BeautifulSoup to read this page but the URL is not passed correctly into the get() command.
The URL is https://www.econjobrumors.com/topic/supreme-court-to-%e2%80%9cconsider%e2%80%9d-taking-up-harvard-affirmative-action-case-on-june-10. But when I try to use BeautifulSoup to get the data from the URL it always gives an error saying that the URL is incorrect
response = requests.get(url = "https://www.econjobrumors.com/topic/supreme-court-to-%e2%80%9cconsider%e2%80%9d-taking-up-harvard-affirmative-action-case-on-june-10",
verify = False \
)
print(response.request.url, end="\r")
It was the double quotes, “ (U+201C) and ” (U+201D), that caused the error. I've been trying for hours but still don't know to figure out a way to pass the URL correctly.

I changed the double quotes to single around the URL
from bs4 import BeautifulSoup
import requests
url = 'https://www.econjobrumors.com/topic/supreme-court-to-%e2%80%9cconsider%e2%80%9d-taking-up-harvard-affirmative-action-case-on-june-10'
r = requests.get(url, allow_redirects=False)
soup = BeautifulSoup(r.content, 'lxml')
print(soup)
prints out the html as expected, I edited it to fit in this answer
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xml:lang="en-US" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="IE=8" http-equiv="X-UA-Compatible"/>
<ALL THE CONTENT>Too much to paste in the answer</ALL THE CONTENT>
</html>

Get content from meta in external website

I need to extract the meta description of an external website. I've already searched and maybe the simple answer is already out there, but I wasn't able to apply it to my code.
Currently, I can get its title doing the following:
from urllib.request import urlopen
from lxml.html import parse
url = "https://technofall.com/how-to-report-traffic-violation-pay-vehicle-fines-e-challan/"
page = urlopen(URL)
p = parse(page)
print (p.find(".//title").text)
I am getting title from here
However, the description is a bit trickier. It can come in the form of:
<meta name="og:description" content="blabla"
<meta property="og:description" content="blabla"
<meta name="description" content="blabla"
So what I want is to extract the first one of these that appears inside the Html.

Beautiful Soup doesn't give data for a site

I am trying to this site for information:
https://farm.ewg.org/addrsearch.php?stab2=NY&fullname=B&b=1&page=0
I tried writing code that has worked for other sites, but it just leaves me with an empty text file. Instead of filling up with data like it has for other sites. Here is my code:
import urllib
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import json
import time
outfile = open('/Users/Luca/Desktop/test/farm_data.text','w')
my_list = list()
site = "https://farm.ewg.org/addrsearch.php?stab2=NY&fullname=A&b=1&page=0"
my_list.append(site)
site = "https://farm.ewg.org/addrsearch.php?stab2=NY&fullname=B&b=1&page=0"
my_list.append(site)
site = "https://farm.ewg.org/addrsearch.php?stab2=NY&fullname=C&b=1&page=0"
my_list.append(site)
for item in my_list:
time.sleep( 5 )
html = urlopen(item)
bsObj = BeautifulSoup(html.read(), "html.parser")
nameList = bsObj.prettify().split('.')
count = 0
for name in nameList:
print (name[2:])
outfile.write(name[2:] + ',' + item + '\n')
I am trying to split it into smaller parts and go from there. I have used this code on sites like this: https://www.mtggoldfish.com/price/Aether+Revolt/Heart+of+Kiran#online
for example and it worked.
Any ideas why it works for some sites and not others? thanks so much.

The website in question probably disallows webscraping, which is why you get:
HTTPError: HTTP Error 403: Forbidden
You can spoof your user agent, by pretending to be a browser agent. Here's an example of how to do it using the fantastic requests module. You'll pass a User-Agent header when making the request.
import requests
url = "https://farm.ewg.org/addrsearch.php?stab2=NY&fullname=A&b=1&page=0"
html = requests.get(url, headers={'User-Agent' : 'Mozilla/5.0'}).text
bsObj = BeautifulSoup(html, "html.parser")
print(bsObj)
Output:
<!DOCTYPE doctype html>
<html class="no-js" lang="en" prefix="og: http://ogp.me/ns#" xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://ogp.me/ns/fb#">
<head>
<meta charset="utf-8"/>
.
.
.
You can massage this code into your loop now.

Read HEAD contents from HTML

i need small script in python. Need to read custom block in a web file.
#!/usr/bin/python
# -*- coding: utf-8 -*-
import urllib2
req = urllib2.Request('http://target.com')
response = urllib2.urlopen(req)
the_page = response.read()
print the_page # Here is all page source with html tags, but
# i need read only section from <head> to </head>
# example the http://target.com source is:
# <html>
# <body>
# <head>
# ... need to read this section ...
# </head>
# ... page source ...
# </body>
# </html>
How read the custom section?

To parse HTML, we use a parser, such as BeautifulSoup.
Of course you can parse it using a regular expression, but that is something you should never do. Just because it works for some cases doesn't mean it is the standard way of doing it or is the proper way of doing it. If you are interested in knowing why, read this excellent answer here on SO.
Start with the BeautifulSoup tutorial and see how to parse the required information. It is pretty easy to do it. We are not going to do it for you, that is for you to read and learn!
Just to give you a heads up, you have the_page which contains the HTML data.
>> from BeautifulSoup import BeautifulSoup
>> soup = BeautifulSoup(the_page)
Now follow the tutorial and see how to get everything within the head tag.

from BeautifulSoup import BeautifulSoup
import urllib2
page = urllib2.urlopen('http://www.example.com')
soup = BeautifulSoup(page.read())
print soup.find('head')
outputs
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Example Web Page</title>
</head>

One solution would be to use the awesome python library Beautiful Soup. It allows you do parse the html/xml pretty easily, and will try to help out when the documents are broken or invalid.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Urllib2 get garbled string instead of page source [duplicate] - python

Related

Why python requests module not pulling the whole html?

Python BeautifulSoup get request with double quotes in URL

Get content from meta in external website

Beautiful Soup doesn't give data for a site

Read HEAD contents from HTML

Categories

Resources