Spynner doesn't load html from URL - python

I use spynner for scraping data from a site. My code is this:
import spynner
br = spynner.Browser()
br.load("http://www.venere.com/it/hotel/roma/hotel-ferrari/#reviews")
text = br._get_html()
This code fails to load the entire html page. This is the html that I received:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head>
<script type="text/javascript">(function(){var d=document,m=d.cookie.match(/_abs=(([or])[a-z]*)/i)
v_abs=m?m[1].toUpperCase():'N'
if(m){d.cookie='_abs='+v_abs+'; path=/; domain=.venere.com';if(m[2]=='r')location.reload(true)}
v_abp='--OO--OOO-OO-O'
v_abu=[,,1,1,,,1,1,1,,1,1,,1]})()
My question is: how do I load the complete html?
More information:
I tried with:
import spynner
br = spynner.Browser()
respond = br.load("http://www.venere.com/it/hotel/roma/hotel-ferrari/#reviews")
if respond == None:
br.wait_load ()
but loading html is never complete or certain. What is the problem? I'm going crazy.
Again:
I'm working in Django 1.3. If I use the same code in Python (2.7) sometimes load all html.

Now after you check the contents of test.html you will find the p elements with id="feedback-...somenumber..." :
import spynner
def content_ready(browser):
if 'id="feedback-' in browser.html:
return True
br = spynner.Browser()
br.load("http://www.venere.com/it/hotel/roma/hotel-ferrari/#reviews", wait_callback=content_ready)
with open("test.html", "w") as hf:
hf.write(br.html.encode("utf-8"))

Related

Why python requests module not pulling the whole html?

The link: https://www.hyatt.com/explore-hotels/service/hotels
code:
r = requests.get('https://www.hyatt.com/explore-hotels/service/hotels')
soup = BeautifulSoup(r.text, 'lxml')
print(soup.prettify())
Tried also this:
r = requests.get('https://www.hyatt.com/explore-hotels/service/hotels')
data = json.dumps(r.text)
print(data)
output:
<!DOCTYPE html>
<head>
</head>
<body>
<script src="SOME_value">
</script>
</body>
</html>
Its printing the html without the tag the data are in, only showing a single script tag.
How to access the data (shown in browsing view, looks like json)?browsing view my code code response)
I don't believe this can be done...That data simply isn't in the r.text
If you do this:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.hyatt.com/explore-hotels/service/hotels")
soup = BeautifulSoup(r.text, "html.parser")
print(soup.prettify())
You get this:
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<script src="/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/ips.js?tkrm_alpekz_s1.3=0EOFte3LjRKv3iJhEEV2hrnisE5M3Lwy3ac3UPZ19zdiB49A6ZtBjtiwBqgKQN3q2MEQ3NbFjTWfmP9GqArOIAML6zTvSb4lRHD7FsmJFVWNkSwuTNWUNuJWv6hEXBG37DhBtTXFEO50999RihfPbTjsB">
</script>
</body>
</html>
As you can see there is no <pre> tag for whatever reason. So you're unable to access that.
I also get an 429 Error when accessing the URL:
GET https://www.hyatt.com/explore-hotels/service/hotels 429
What is the end goal here? Because this site doesn't seem to be willing to do anything. Some sites are unable to be parsed, for various reasons. If you're wanting to play with JSON data I would look into using an API instead.
If you google https://www.hyatt.com and manually go to the URL you mentioned you get a 404 error.
I would say Hyatt don't want you parsing their site. So don't!
The response is JSON, not HTML. You can verify this by opening the Network tab in your browser's dev tools. There you will see that the content-type header is application/json; charset=utf-8.
You can parse this into a useable form with the standard json package:
r = requests.get('https://www.hyatt.com/explore-hotels/service/hotels')
data = json.loads(r.text)
print(data)

How do I scrape multi-language web using Python

I'm using Python to scrape data from Japanese website where it offers both English & Japanese language. Link here
The problem is I got the data I needed but in the wrong language (Link of both languages are identical). I tried inspecting the html page and saw the element 'lang' as followed:
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<html xmlns="http://www.w3.org/1999/xhtml" lang="ja" xml:lang="ja" class="">
Here is the code I used:
import requests
import lxml.html as lh
import pandas as pd
url='https://data.j-league.or.jp/SFMS01/search?team_ids=33&home_away_select=0'
page = requests.get(url)
doc = lh.fromstring(page.content)
tr_elements = doc.xpath('//tr')
col = []
i = 0
for t in tr_elements[0]:
i += 1
name = t.text_content()
print("{}".format(name))
col.append((name,[]))
At this point I got the head row of the table from the page but in Japanese version.
I'm new to Python and the scrapy. I don't know if there's any method I could use to get the data in English?
If there is any existing examples, templates or other resources I could use, that'd be better.
Thanks in advance!
I visited the website you added, so for english it adds a cookie (look at the headers for Request URL: https://data.j-league.or.jp/SFMS01/search?team_ids=33&home_away_select=0 in network tab), you will see
Set-Cookie: SFCM01LANG=en; Max-Age=63072000; Expires=Tue, 18-Oct-2022 19:14:29 GMT; Path=/
So I have basically used that,
change you code snippet to this
import requests
import lxml.html as lh
import pandas as pd
url='https://data.j-league.or.jp/SFMS01/search?team_ids=33&home_away_select=0'
page = requests.get(url, cookies={'SFCM01LANG':'en'})
doc = lh.fromstring(page.content)
tr_elements = doc.xpath('//tr')

How to get past a circular meta refresh with Requests?

I'm trying to web scrape using Requests. My code so far is the usual:
import requests
html = requests.get('https://www.sampleurl.com').text
This gives:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><HTML lang=en><HEAD><TITLE>Company Name</TITLE><META HTTP-EQUIV="refresh" CONTENT="1; URL=https://www.sampleurl.com"></HEAD><BODY>
The url inside CONTENT is the same as the url I put into Requests, so it's not a redirect that I can follow by extracting the url with BeautifulSoup. Is there some way of bypassing circular meta refreshes so I can get to the html of the website?

Python: BeautifulSoup returning garbage

I am building a basic data crawler in python using BeautifulSoup, for Batoto, the manga host. For some the reason, the URL works sometimes and other times it doesn't. For example:
from bs4 import BeautifulSoup
from urllib2 import urlopen
x= urlopen(*manga url here*)
y = BeautifulSoup(x)
print y
The result should be a tag soup of the page but instead I get a big wall of this
´ºŸ{›æP™oRhtüs2å÷%ëmßñ6Y›þ�GDŸ0Ë­͇켮Yé)–ÀØÅð&ô]½f³ÓÞ€Þþ)ú$÷á�üv…úzW¿¾úà†lªÀí¥ï«·_ OTL_ˆêsÁÿƒÁÖ<Ø?°Þ›Â+WLç¥àEh>rýÜ>x ˆ‡eÇžù»èå»–Ùý e:›§`L_.‹¦úoÓ‘®e=‰ìÓ4Wëo’]~Ãõ¬À8>x:²âœ2¸ Á|&0ÍVpMLÎñ»v¥Ín÷-ÅÉ–T§`Ì.SÔsóë„œ¡×[˜·P6»�ùè�>Ô¾È]Œ—·ú£âÊgí%ضkwýÃ=Üϸ2cïÑfÙ_�×]Õê“ž?„UÖ* m³/­`ñ§ÿL0³dµ·jªÅ}õ/õOXß×;«]®’ϯw‹·þ¡ÿ|Gýª`I{µœ}œí�ë–¼yÖÇ'�Wç�ëµÅþþ*ýœd{ÿDv:РíHzqÿÆ­÷æélG-èÈâpÇßQé´^ÐO´®Xÿ�ýö(‹šëñþ"4!SÃõ2{òÿÜ´»ûE</kî?x´&ý˜`Ù)uÂï¹ã[ÏŠ²y°kÆpù}¢></uŒ¸kpž¼cì∬ƒcubÆ¡¢=en2‚påÓb9®`áï|z…p"i6pvif¨þõ“⟒></t`$ò-e></cé”r)$�ˆ)ìªÜrd&mÉÊ*ßdÒuÄ.Æ-hx#9[s=m�Ýfd2o1ˆ]‡[Ôádœtë¤qâxæ°‹qËÁ×,½ŠmʇꇢùÅýl></sí°çù¡h?‡ÌÜœbá‰æÆý¡sd~¬></zz¡ózwÎ[à!n‰Àš5¤…¸‘ݹŽ></sÃ:›3Ìæ></lÑggu�».Б#4õë\ÃñÆ:¸5ÔwÛ·…)~ÛacÑ,d­³båÖ6></tg9y+wΉí%r8ƒ·}n`¼ÁÆ8˜”é²êÞ½°¶Ï></sÖ-di¨a±j9³4></ss„*w(ßibðïj*¶„)pâýÌ”a§%va{‰ò¦m mi></o³o˜Ÿ?¿Ñu-}{cÜ›a~:k²Ì></r+=ÅÌk˜c></wÓ¹âߊž‡ëf7vÑ�akÆ4ƒ‚></szŽµiÞêzâšÒ¬ú¢“âÀ#�-></qebndΑg*cxgsÆ€Ùüe¡³-ŠngÁ:�3ænæ5ï0`coäÏÖ9œ1Ða¯,æ—ªìàãÉÂð></j›h¶`à;)òiÖ š+></o”64ˆÎº9°��u—Úd¿ý¥pÎÖ‰0¢s:c�yƧ³t=ÕŸ“Ý‹41%}*,e³Ô¥ó></hiræe—';></v�fÞ«Ë¥n§Ð·¡kaììë\�`ùsõ©¸pv¦‘></bñ¼ut«w)Ø'¹ú#{)n0¡Žan¶Ë5èsª�–u–></y_x.mÅd:g}ëÕðhçð«õõ8ŠcËÕÌvž­v™-šêÙ`b¹˜ùÃΓçˤÔÙtx¹�ßïǶÎgþ°r‹$ò†aÆ–š?ì<y«Ëñõo{%ׇo{ú¥Á»æ]‡></u´¬Ø¸eÖïÝtßÚ'è3®nh±ûk4È#l«s]–Åec¹ÑtmÓl|ë£Þ¼~zôéõûwêÓÑñÉÆw\soøÊiyjvØÖ$¯ÈoºÙoyã]æ5]-t^[“¡aÑ{²Å¸6¦ðtŒçm¼ÂÎz´></wà™´»äõ#©õ></mÏu:=¼þ·'�qwúËö«m„l^ˆær¥30q±ÒšŸëù></l(„7¼=xi’?¤;ö$ØË4ßoóiòyoµxÉøþ¨—«g³Ãíß{|></body></html>
wrapped in html and body tags.
Sometimes I will keep trying and it works, but it is so inconsistent, I can't figure out the reason for it.
Any help would be appreciated.
It seems to be urlopen having issues with encoding, requests works fine:
x = requests.get("http://bato.to/comic/_/comics/rakudai-kishi-no-eiyuutan-r11615")
y = BeautifulSoup(x.content)
print y
<!DOCTYPE html>
<html lang="en" xmlns:fb="http://www.facebook.com/2008/fbml">
<head>
<meta charset="utf-8"/>
<title>Rakudai Kishi no Eiyuutan - Scanlations - Comic - Comic Directory - Batoto - Batoto</title>
.................
Using urlopen we get the following:
x = urlopen("http://bato.to/comic/_/comics/rakudai-kishi-no-eiyuutan-r11615")
print x.read()
���������s+I���2���l��9C<�� ^�����쾯�dw�xzNT%��,T��A^�ݫ���9��a��E�C���W!�����ڡϳ��f7���s2�Px$���}I�*�'��;'3O>���'g?�u®{����e.�ڇ�e{�u���jf:aث
�����DS��%��X�Zͮ���������9�:�Dx�����\-�
�*tBW������t�I���GQ�=�c��\:����u���S�V(�><y�C��ã�*:�ۜ?D��a�g�o�sPD�m�"�,�Ɲ<;v[��s���=��V2�fX��ì�Cj̇�В~�
-~����+;V���m�|kv���:V!�hP��D�K�/`oԣ|�k�5���B�{�0�wa�-���iS
�>�œ��gǿ�o�OE3jçCV<`���Q!��5�B��N��Ynd����?~��q���� _G����;T�S'�#΀��t��Ha�.;J�61'`Й�#���>>`��Z�ˠ�x�#� J*u��'���-����]p�9{>����������#�<-~�K"[AQh0HjP
0^��R�]�{N#��
...................
So as you can see it is a problem with urlopen not BeautifulSoup.
The server is returning gzipped bytes. So to download the content using urllib2:
import sys
import urllib2
import gzip
import io
url = "http://bato.to/comic/_/comics/rakudai-kishi-no-eiyuutan-r11615"
response = urllib2.urlopen(url)
# print(response.headers)
content = response.read()
if response.headers['Content-Encoding'] == 'gzip':
g = gzip.GzipFile(fileobj=io.BytesIO(content))
content = g.read()
encoding = response.info().getparam('charset')
content = content.decode(encoding)
This checks the content is the same as the page.text returned by requests:
import requests
page = requests.get(url)
# print(page.headers)
assert content == page.text
Since requests handles the gunzipping and decoding for you -- and more robustly too -- using requests is highly recommended.

Read HEAD contents from HTML

i need small script in python. Need to read custom block in a web file.
#!/usr/bin/python
# -*- coding: utf-8 -*-
import urllib2
req = urllib2.Request('http://target.com')
response = urllib2.urlopen(req)
the_page = response.read()
print the_page # Here is all page source with html tags, but
# i need read only section from <head> to </head>
# example the http://target.com source is:
# <html>
# <body>
# <head>
# ... need to read this section ...
# </head>
# ... page source ...
# </body>
# </html>
How read the custom section?
To parse HTML, we use a parser, such as BeautifulSoup.
Of course you can parse it using a regular expression, but that is something you should never do. Just because it works for some cases doesn't mean it is the standard way of doing it or is the proper way of doing it. If you are interested in knowing why, read this excellent answer here on SO.
Start with the BeautifulSoup tutorial and see how to parse the required information. It is pretty easy to do it. We are not going to do it for you, that is for you to read and learn!
Just to give you a heads up, you have the_page which contains the HTML data.
>> from BeautifulSoup import BeautifulSoup
>> soup = BeautifulSoup(the_page)
Now follow the tutorial and see how to get everything within the head tag.
from BeautifulSoup import BeautifulSoup
import urllib2
page = urllib2.urlopen('http://www.example.com')
soup = BeautifulSoup(page.read())
print soup.find('head')
outputs
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Example Web Page</title>
</head>
One solution would be to use the awesome python library Beautiful Soup. It allows you do parse the html/xml pretty easily, and will try to help out when the documents are broken or invalid.

Categories