I'm trying to scrape "game tag" data (not the same as HTML tags) from games listed on the digital game distribution site, Steam (store.steampowered.com). This information isn't available via the Steam API, as far as I can tell.
Once I have the raw source data for a page, I want to pass it into beautifulsoup for further parsing, but I have a problem - urllib2 doesn't seem to be reading the information I want (request doesn't work either), even though it's obviously in the source page when viewed in the browser.
For example, I might download the page for the game "7 Days to Die" (http://store.steampowered.com/app/251570/). When viewing the browser source page in Chrome, I can see the following relevant information regarding the game's "tags"
near the end, starting at line 1615:
<script type="text/javascript">
$J( function() {
InitAppTagModal( 251570,
{"tagid":1662,"name":"Survival","count":283,"browseable":true},
{"tagid":1659,"name":"Zombies","count":274,"browseable":true},
{"tagid":1702,"name":"Crafting","count":248,"browseable":true},...
In initAppTagModal, there are the tags "Survival", "Zombies", "Crafting", ect that contain the information I'd like to collect.
But when I use urllib2 to get the page source:
import urllib2
url = "http://store.steampowered.com/app/224600/" #7 Days to Die page
page = urllib2.urlopen(url).read()
The part of the source page that I'm interested in is not saved in the my "page" variable, instead everything below line 1555 is simply blank until the closing body and html tags. Resulting in this (carriage returns included):
</div><!-- End Footer -->
</body>
</html>
In the blank space is where the source code I need (along with other code), should be.
I've tried this on several different computers with different installs of python 2.7 (Windows machines and a Mac), and I get the same result on all of them.
How can I get the data that I'm looking for?
Thank you for your consideration.
Well, I don't know if I'm missing something, but it's working for me using requests:
import requests
# Getting html code
url = "http://store.steampowered.com/app/251570/"
html = requests.get(url).text
And even more, the data requested is in json format, so it's easy to extract it in this way:
# Extracting javscript object (a json like object)
start_tag = 'InitAppTagModal( 251570,'
end_tag = '],'
startIndex = html.find(start_tag) + len(start_tag)
endIndex = html.find(end_tag, startIndex) + len(end_tag) - 1
raw_data = html[startIndex:endIndex]
# Load raw data as python json object
data = json.loads(raw_data)
You will see a beatiful json object like this (this is the info that you need, right?):
[
{
"count": 283,
"browseable": true,
"tagid": 1662,
"name": "Survival"
},
{
"count": 274,
"browseable": true,
"tagid": 1659,
"name": "Zombies"
},
{
"count": 248,
"browseable": true,
"tagid": 1702,
"name": "Crafting"
}......
I hope it helps....
UPDATED:
Ok, I see your problem right now, it seems that the problem is in the page 224600. In this case the webpage requires that you confirm your age before to show you the games info. Anyway, easy to solve it just posting the form that confirm the age. Here is the code updated (and I created a function):
def extract_info_games(page_id):
# Create session
session = requests.session()
# Get initial html
html = session.get("http://store.steampowered.com/app/%s/" % page_id).text
# Checking if I'm in the check age page (just checking if the check age form is in the html code)
if ('<form action="http://store.steampowered.com/agecheck/app/%s/"' % page_id) in html:
# I'm being redirected to check age page
# let's confirm my age with a POST:
post_data = {
'snr':'1_agecheck_agecheck__age-gate',
'ageDay':1,
'ageMonth':'January',
'ageYear':'1960'
}
html = session.post('http://store.steampowered.com/agecheck/app/%s/' % page_id, post_data).text
# Extracting javscript object (a json like object)
start_tag = 'InitAppTagModal( %s,' % page_id
end_tag = '],'
startIndex = html.find(start_tag) + len(start_tag)
endIndex = html.find(end_tag, startIndex) + len(end_tag) - 1
raw_data = html[startIndex:endIndex]
# Load raw data as python json object
data = json.loads(raw_data)
return data
And to use it:
extract_info_games(224600)
extract_info_games(251570)
Enjoy!
When using urllib2 and read(), you will have to read repeatedly in chunks till you hit EOF, in order to read the entire HTML source.
import urllib2
url = "http://store.steampowered.com/app/224600/" #7 Days to Die page
url_handle = urllib2.urlopen(url)
data = ""
while True:
chunk = url_handle.read()
if not chunk:
break
data += chunk
An alternative would be to use the requests module as:
import requests
r = requests.get('http://store.steampowered.com/app/251570/')
soup = BeautifulSoup(r.text)
Related
I am trying to get the links from all the pages on https://apexranked.com/. I tried using
url = 'https://apexranked.com/'
page = 1
while page != 121:
url = f'https://apexranked.com/?page={page}'
print(url)
page = page + 1
however, if you click on the page numbers it doesn't include a https://apexranked.com/?page=number as you see from https://www.mlb.com/stats/?page=2. How would I go about accessing and getting the links from all pages if the page doesn't include ?page=number after the link?
The page is not reloading when you click on page 2. Instead, it is firing a GET request to the website's backend.
The request is being sent to : https://apexranked.com/wp-admin/admin-ajax.php
In addition, several parameters are parsed directly onto the previous url.
?action=get_player_data&page=3&total_pages=195&_=1657230896643
Parameters :
action: As the endpoint can handle several purpose, you must indicate the performed action. Surely a mandatory parameter, don't omit it.
page: indicates the requested page (i.e the index you're iteraring over).
total_pages: indicates the total number of page (maybe it can be omitted, otherwise you can scrap it on the main page)
_: this one corresponds to an unix timestamp, same idea as above, try to omit and see what happens. Otherwise you can get a unix timestamp quite easily with time.time()
Once you get a response, it yields a rendered HTML, maybe try to set Accept: application/json field in request headers to get a Json, but that's just a detail.
All these informations wrapped up:
import requests
import time
url = "https://apexranked.com/wp-admin/admin-ajax.php"
# Issued from a previous scraping on the main page
total_pages = 195
params = {
"total_pages": total_pages,
"_": round(time.time() * 1000),
"action": "get_player_data"
}
# Make sure to include all mandatory fields
headers = {
...
}
for k in range(1, total_pages + 1):
params['page'] = k
res = requests.get(url, headers=headers, params=params)
# Make your thing :)
I don't exactly know what you mean but if you for example wanna get the raw text u can do it with requests
import requests
# A loop that will keep going until the page is not found.
while(requests.get(f"https://apexranked.com/?page={page}").status_code != 404):
#scrap content e.g whole page
link = f"https://apexranked.com/?page={page}"
page = page + 1
you can also add the link then to an array with nameOfArray.append(link)
I am a seller on Target.com and am trying to scrape the URL for every product in my catalog using Python (Python 3). When I try this I get an empty list for 'urllist', and when I print the variable 'soup', what BS4 has actually collected is the contents "view page source" (forgive my naiveté here, definitely a novice at this still!). In reality I'd really like to be scraping URLs from the content found in the "elements" section of the Devtools page. I can sift through the html on that page manually and find the links, so I know they're in there...I just don't know enough yet to tell BS4 that's the content I want to search. How can I do that?
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
#Need this part below for HTTPS
ctx=ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
#Needs that context = ctx line to deal with HTTPS
url = input('Enter URL: ')
urllist=[]
html = urllib.request.urlopen(url, context = ctx).read()
soup=BeautifulSoup(html, 'html.parser')
for link in soup.find_all('a'):
urllist.append(link.get('href'))
print(urllist)
If it helps, I found code that someone developed in Java that can be run from the developer console that works and grabbed me all of my links. But my goal is to be able to do this in Python (Python 3)
var x = document.querySelectorAll("a");
var myarray = []
for (var i=0; i<x.length; i++){
var nametext = x[i].textContent;
var cleantext = nametext.replace(/\s+/g, ' ').trim();
var cleanlink = x[i].href;
myarray.push([cleantext,cleanlink]);
};
function make_table() {
var table = '<table><thead><th>Name</th><th>Links</th></thead><tbody>';
for (var i=0; i<myarray.length; i++) {
table += '<tr><td>'+ myarray[i][0] + '</td><td>'+myarray[i][1]+'</td></tr>';
};
var w = window.open("");
w.document.write(table);
}
make_table()
I suspect this is occurring because Target's website (at least, the main page) builds the page content via Javascript. Your browser is able to render the page's source code, but your python code does no such thing. See this post for help in that regard.
Without going into the specifics of your code, fundamentally, if you can make a call to a url - you've got that url. If you use the script to scrape one entered url at the time - that could be logged by entering the correct amendment to the urllist entry (the object returned by each .link.get('href')).
If you have some other original source (a list?) for the urls to scrape, that could be added to the urllist.-object in a similar fashion.
The course of action choosen depends on the actual data structure returned by .link.get('href')). Suggestions:
If it's a string containing html, put that string in a dict key 'html', and add another dict key 'url'
If it's already a dict object: Just add a key-value-pair 'url'.
If you want to enter one url and extract the others from the url's html document, retreive the html and parse it with something like ElementTree
You can do this a number of ways.
I've created a script in python using requests module to fetch some information displayed upon filling in a form using this email africk2#nd.edu. The problem is when I hit the search button, I can see a new tab containing all the information I wish to grab. Moreover, I don't see any link in the All tab under Network section within chrome dev tools. So, I'm hopeless as to how I can get the information using requests module.
website address
Steps to populate the result manually:
Put this email address africk2#nd.edu next to the inputbox of Email address and hit the Search button.
I've tried with:
import requests
from bs4 import BeautifulSoup
url = "https://eds.nd.edu/search/index.shtml"
post_url = "https://eds.nd.edu/cgi-bin/nd_ldap_search.pl"
res = requests.get(url,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,"lxml")
payload = {item['name']:item.get('value','') for item in soup.select('input[name]')}
payload['email'] = 'africk2#nd.edu'
del payload['clear']
resp = requests.post(post_url,data=payload)
print(resp.content)
The above script is a faulty approach. However, I can't find any idea to grab the information connected to that email.
P.S. I'm not after selenium-oriented solution.
Ok, solved it:
from urllib.parse import quote
import requests
def get_contact_html(email: str):
encoded = quote('o=\"University of Notre Dame\", '
'st=Indiana, '
'c=US?displayName,edupersonaffiliation,ndTitle,ndDepartment,postalAddress,telephoneNumber,mail,searchGuide,labeledURI,'
'uid?'
'sub?'
f'(&(ndMail=*{email}*))')
data = {
"ldapurl": f'LDAP://directory.nd.edu:389/{encoded}',
"ldaphost": "directory.nd.edu",
"ldapport": '389',
"ldapbase": 'o="University of Notre Dame", st=Indiana, c=US',
"ldapfilter": f'(&(ndMail=*{email}*))',
"ldapheadattr": "displayname",
"displayformat": "nd",
"ldapmask": "",
"ldapscope": "",
"ldapsort": "",
"ldapmailattr": "",
"ldapurlattr": "",
"ldapaltattr": "",
"ldapjpgattr": "",
"ldapdnattr": "",
}
res = requests.post('https://eds.nd.edu/cgi-bin/nd_ldap_search.pl',
data=data)
res.raise_for_status()
return res.text
if __name__ == '__main__':
html = get_contact_html('africk2#nd.edu')
print(html)
output:
...
Formal Name:
...
Aaron D Frick
...
this will give you the HTML for the page.
The trick was converting encoded spaces + to real spaces in "ldapbase": 'o="University of Notre Dame", st=Indiana, c=US', field and letting requests module to encode the value itself. Otherwise + signs get double encoded.
I would like to scrape just the title of a webpage using Python. I need to do this for thousands of sites so it has to be fast. I've seen previous questions like retrieving just the title of a webpage in python, but all of the ones I've found download the entire page before retrieving the title, which seems highly inefficient as most often the title is contained within the first few lines of HTML.
Is it possible to download only the parts of the webpage until the title has been found?
I've tried the following, but page.readline() downloads the entire page.
import urllib2
print("Looking up {}".format(link))
hdr = {'User-Agent': 'Mozilla/5.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
req = urllib2.Request(link, headers=hdr)
page = urllib2.urlopen(req, timeout=10)
content = ''
while '</title>' not in content:
content = content + page.readline()
-- Edit --
Note that my current solution makes use of BeautifulSoup constrained to only process the title so the only place I can optimize is likely to not read in the entire page.
title_selector = SoupStrainer('title')
soup = BeautifulSoup(page, "lxml", parse_only=title_selector)
title = soup.title.string.strip()
-- Edit 2 --
I've found that BeautifulSoup itself splits the content into multiple strings in the self.current_data
variable (see this function in bs4), but I'm unsure how to modify the code to basically stop reading all remaining content after the title has been found. One issue could be that redirects should still work.
-- Edit 3 --
So here's an example. I have a link www.xyz.com/abc and I have to follow this through any redirects (almost all of my links use a bit.ly kind of link shortening). I'm interested in both the title and domain that occurs after any redirections.
-- Edit 4 --
Thanks a lot for all of your assistance! The answer by Kul-Tigin works very well and has been accepted. I'll keep the bounty until it runs out though to see if a better answer comes up (as shown by e.g. a time measurement comparison).
-- Edit 5 --
For anyone interested: I've timed the accepted answer to be roughly twice as fast as my existing solution using BeautifulSoup4.
You can defer downloading the entire response body by enabling stream mode of requests.
Requests 2.14.2 documentation - Advanced Usage
By default, when you make a request, the body of the response is
downloaded immediately. You can override this behaviour and defer
downloading the response body until you access the Response.content
attribute with the stream parameter:
...
If you set stream to True when making a request, Requests cannot release the connection back to the pool unless you consume all the data or call Response.close.
This can lead to inefficiency with connections. If you find yourself partially reading request bodies (or not reading them at all) while using stream=True, you should consider using contextlib.closing (documented here)
So, with this method, you can read the response chunk by chunk until you encounter the title tag. Since the redirects will be handled by the library you'll be ready to go.
Here's an error-prone code tested with Python 2.7.10 and 3.6.0:
try:
from HTMLParser import HTMLParser
except ImportError:
from html.parser import HTMLParser
import requests, re
from contextlib import closing
CHUNKSIZE = 1024
retitle = re.compile("<title[^>]*>(.*?)</title>", re.IGNORECASE | re.DOTALL)
buffer = ""
htmlp = HTMLParser()
with closing(requests.get("http://example.com/abc", stream=True)) as res:
for chunk in res.iter_content(chunk_size=CHUNKSIZE, decode_unicode=True):
buffer = "".join([buffer, chunk])
match = retitle.search(buffer)
if match:
print(htmlp.unescape(match.group(1)))
break
Question: ... the only place I can optimize is likely to not read in the entire page.
This does not read the entire page.
Note: Unicode .decode() will raise Exception if you cut a Unicode sequence in the middle. Using .decode(errors='ignore') remove those sequences.
For instance:
import re
try:
# PY3
from urllib import request
except:
import urllib2 as request
for url in ['http://www.python.org/', 'http://www.google.com', 'http://www.bit.ly']:
f = request.urlopen(url)
re_obj = re.compile(r'.*(<head.*<title.*?>(.*)</title>.*</head>)',re.DOTALL)
Found = False
data = ''
while True:
b_data = f.read(4096)
if not b_data: break
data += b_data.decode(errors='ignore')
match = re_obj.match(data)
if match:
Found = True
title = match.groups()[1]
print('title={}'.format(title))
break
f.close()
Output:
title=Welcome to Python.org
title=Google
title=Bitly | URL Shortener and Link Management Platform
Tested with Python: 3.4.2 and 2.7.9
You're scraping webpages using standard REST requests and I'm not aware of any request that only returns the title, so I don't think it's possible.
I know this doesn't necessarily help get the title only, but I usually use BeautifulSoup for any web scraping. It's much easier. Here's an example.
Code:
import requests
from bs4 import BeautifulSoup
urls = ["http://www.google.com", "http://www.msn.com"]
for url in urls:
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
print "Title with tags: %s" % soup.title
print "Title: %s" % soup.title.text
print
Output:
Title with tags: <title>Google</title>
Title: Google
Title with tags: <title>MSN.com - Hotmail, Outlook, Skype, Bing, Latest News, Photos & Videos</title>
Title: MSN.com - Hotmail, Outlook, Skype, Bing, Latest News, Photos & Videos
the kind of thing you want i don't think can be done, since the way the web is set up, you get the response for a request before anything is parsed. there isn't usually a streaming "if encounter <title> then stop giving me data" flag. if there is id love to see it, but there is something that may be able to help you. keep in mind, not all sites respect this. so some sites will force you to download the entire page source before you can act on it. but a lot of them will allow you to specify a range header. so in a requests example:
import requests
targeturl = "http://www.urbandictionary.com/define.php?term=Blarg&page=2"
rangeheader = {"Range": "bytes=0-150"}
response = requests.get(targeturl, headers=rangeheader)
response.text
and you get
'<!DOCTYPE html>\n<html lang="en-US" prefix="og: http://ogp.me/ns#'
now of course here's the problems with this
what if you specify a range that is too short to get the title of the page?
whats a good range to aim for? (combination of speed and assurance of accuracy)
what happens if the page doesn't respect Range? (most of the time you just get the whole response you would have without it.)
i don't know if this might help you? i hope so. but i've done similar things to only get file headers for download checking.
EDIT4:
so i thought of another kind of hacky thing that might help. nearly every page has a 404 page not found page. we might be able to use this to our advantage. instead of requesting the regular page. request something like this.
http://www.urbandictionary.com/nothing.php
the general page will have tons of information, links, data. but the 404 page is nothing more than a message, and (in this case) a video. and usually there is no video. just some text.
but you also notice that the title still appears here. so perhaps we can just request something we know does not exist on any page like.
X5ijsuUJSoisjHJFk948.php
and get a 404 for each page. that way you only download a very small and minimalistic page. nothing more. which will significantly reduce the amount of information you download. thus increasing speed and efficiency.
heres the problem with this method: you need to check somehow if the page does not supply its own version of the 404. most pages have it because it looks good with the site. and its standard practice to include one. but not all of them do. make sure to handle this case.
but i think that could be something worth trying out. over the course of thousands of sites, it would save many ms of download time for each html.
EDIT5:
so as we talked about, since you are interested in urls that redirect. we might make use of an http head reqeust. which wont get the site content. just the headers. so in this case:
response = requests.head('http://myshortenedurl.com/5b2su2')
replace my shortenedurl with tunyurl to follow along.
>>>response
<Response [301]>
nice so we know this redirects to something.
>>>response.headers['Location']
'http://stackoverflow.com'
now we know where the url redirects to without actually following it or downloading any page source. now we can apply any of the other techniques previously discussed.
Heres an example, using requests and lxml modules and using the 404 page idea. (be aware, i have to replace bit.ly with bit'ly so stack overflow doesnt get mad.)
#!/usr/bin/python3
import requests
from lxml.html import fromstring
links = ['http://bit'ly/MW2qgH',
'http://bit'ly/1x0885j',
'http://bit'ly/IFHzvO',
'http://bit'ly/1PwR9xM']
for link in links:
response = '<Response [301]>'
redirect = ''
while response == '<Response [301]>':
response = requests.head(link)
try:
redirect = response.headers['Location']
except Exception as e:
pass
fakepage = redirect + 'X5ijsuUJSoisjHJFk948.php'
scrapetarget = requests.get(fakepage)
tree = fromstring(scrapetarget.text)
print(tree.findtext('.//title'))
so here we get the 404 pages, and it will follow any number of redirects. now heres the output from this:
Urban Dictionary error
Page Not Found - Stack Overflow
Error 404 (Not Found)!!1
Kijiji: Page Not Found
so as you can see we did indeed get out titles. but we see some problems with the method. namely some titles add things, and some just dont have a good title at all. and thats the issue with that method. we could however try the range method too. benefits of that would be the title would be correct, but sometimes we might miss it, and sometimes we have to download the whole pagesource to get it. increasing required time.
Also credit to alecxe for this part of my quick and dirty script
tree = fromstring(scrapetarget.text)
print(tree.findtext('.//title'))
for an example with the range method. in the loop for link in links: change the code after the try catch statement to this:
rangeheader = {"Range": "bytes=0-500"}
scrapetargetsection = requests.get(redirect, headers=rangeheader)
tree = fromstring(scrapetargetsection.text)
print(tree.findtext('.//title'))
output is:
None
Stack Overflow
Google
Kijiji: Free Classifieds in...
here we see urban dictionary has no title or ive missed it in the bytes returned. in any of these methods there are tradeoffs. the only way to get close to total accuracy would be to download the entire source for each page i think.
using urllib you can set the Range header to request a certain range of bytes, but there are some consequences:
it depends on the server to honor the request
you assume that data you're looking for is within desired range (however you can make another request using different range header to get next bytes - i.e. download first 300 bytes and get another 300 only if you can't find title within first result - 2 requests of 300 bytes are still much cheaper than whole document)
(edit) - to avoid situations when title tag splits between two ranged requests, make your ranges overlapped, see 'range_header_overlapped' function in my example code
import urllib
req = urllib.request.Request('http://www.python.org/')
req.headers['Range']='bytes=%s-%s' % (0, 300)
f = urllib.request.urlopen(req)
just to verify if server accepted our range:
content_range=f.headers.get('Content-Range')
print(content_range)
my code also solves cases when title tag is splitted between chunks.
#!/usr/bin/env python2
# -*- coding: utf-8 -*-
"""
Created on Tue May 30 04:21:26 2017
====================
#author: s
"""
import requests
from string import lower
from html.parser import HTMLParser
#proxies = { 'http': 'http://127.0.0.1:8080' }
urls = ['http://opencvexamples.blogspot.com/p/learning-opencv-functions-step-by-step.html',
'http://www.robindavid.fr/opencv-tutorial/chapter2-filters-and-arithmetic.html',
'http://blog.iank.org/playing-capitals-with-opencv-and-python.html',
'http://docs.opencv.org/3.2.0/df/d9d/tutorial_py_colorspaces.html',
'http://scikit-image.org/docs/dev/api/skimage.exposure.html',
'http://apprize.info/programming/opencv/8.html',
'http://opencvexamples.blogspot.com/2013/09/find-contour.html',
'http://docs.opencv.org/2.4/modules/imgproc/doc/geometric_transformations.html',
'https://github.com/ArunJayan/OpenCV-Python/blob/master/resize.py']
class TitleParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.match = False
self.title = ''
def handle_starttag(self, tag, attributes):
self.match = True if tag == 'title' else False
def handle_data(self, data):
if self.match:
self.title = data
self.match = False
def valid_content( url, proxies=None ):
valid = [ 'text/html; charset=utf-8',
'text/html',
'application/xhtml+xml',
'application/xhtml',
'application/xml',
'text/xml' ]
r = requests.head(url, proxies=proxies)
our_type = lower(r.headers.get('Content-Type'))
if not our_type in valid:
print('unknown content-type: {} at URL:{}'.format(our_type, url))
return False
return our_type in valid
def range_header_overlapped( chunksize, seg_num=0, overlap=50 ):
"""
generate overlapping ranges
(to solve cases when title tag splits between them)
seg_num: segment number we want, 0 based
overlap: number of overlaping bytes, defaults to 50
"""
start = chunksize * seg_num
end = chunksize * (seg_num + 1)
if seg_num:
overlap = overlap * seg_num
start -= overlap
end -= overlap
return {'Range': 'bytes={}-{}'.format( start, end )}
def get_title_from_url(url, proxies=None, chunksize=300, max_chunks=5):
if not valid_content(url, proxies=proxies):
return False
current_chunk = 0
myparser = TitleParser()
while current_chunk <= max_chunks:
headers = range_header_overlapped( chunksize, current_chunk )
headers['Accept-Encoding'] = 'deflate'
# quick fix, as my locally hosted Apache/2.4.25 kept raising
# ContentDecodingError when using "Content-Encoding: gzip"
# ContentDecodingError: ('Received response with content-encoding: gzip, but failed to decode it.',
# error('Error -3 while decompressing: incorrect header check',))
r = requests.get( url, headers=headers, proxies=proxies )
myparser.feed(r.content)
if myparser.title:
return myparser.title
current_chunk += 1
print('title tag not found within {} chunks ({}b each) at {}'.format(current_chunk-1, chunksize, url))
return False
I am trying to get my head around how data scraping works when you look past HTML (i.e. DOM scraping).
I've been trying to write a simple Python code to automatically retrieve the number of people that have seen a specific ad: the part where it says '3365 people viewed Peter's place this week.'
At first I tried to see if that was displayed in the HTML code but could not find it. Did some research and saw that not everything will be in the code as it can be processes by the browser through JavaScript or other languages that I don't quite understand yet. I then inspected the element and realised that I would need to use the Python library 'retrieve' and 'lxml.html'. So I wrote this code:
import requests
import lxml.html
response = requests.get('https://www.airbnb.co.uk/rooms/501171')
resptext = lxml.html.fromstring(response.text)
final = resptext.text_content()
finalu = final.encode('utf-8')
file = open('file.txt', 'w')
file.write(finalu)
file.close()
With that, I get a code with all the text in the web page, but not the text that I am looking for! Which is the magic number 3365.
So my question is: how do I get it? I have thought that maybe I am not using the correct language to get the DOM, maybe it is done with JavaScript and I am only using lxml. However, I have no idea.
The DOM element you are looking at is updated after page load with what looks like an AJAX call with the following request URL:
https://www.airbnb.co.uk/rooms/501171/personalization.json
If you GET that URL, it will return the following JSON data:
{
"extras_price":"£30",
"preview_bar_phrases":{
"steps_remaining":"<strong>1 step</strong> to list"
},
"flag_info":{
},
"user_is_admin":false,
"is_owned_by_user":false,
"is_instant_bookable":true,
"instant_book_reasons":{
"within_max_lead_time":null,
"within_max_nights":null,
"enough_lead_time":true,
"valid_reservation_status":null,
"not_country_or_village":true,
"allowed_noone":null,
"allowed_everyone":true,
"allowed_socially_connected":null,
"allowed_experienced_guest":null,
"is_instant_book_host":true,
"guest_has_profile_pic":null
},
"instant_book_experiments":{
"ib_max_nights":14
},
"lat":51.5299601405844,
"lng":-0.12462748035984603,
"localized_people_pricing_description":"£30 / night after 2 guests",
"monthly_price":"£4200",
"nightly_price":"£150",
"security_deposit":"",
"social_connections":{
"connected":null
},
"staggered_price":"£4452",
"weekly_price":"£1050",
"show_disaster_info":false,
"cancellation_policy":"Strict",
"cancellation_policy_link":"/home/cancellation_policies#strict",
"show_fb_cta":true,
"should_show_review_translations":false,
"listing_activity_data":{
"day":{
"unique_views":226,
"total_views":363
},
"week":{
"unique_views":3365,
"total_views":5000
}
},
"should_hide_action_buttons":false
}
If you look under "listing_activity_data" you will find the information you seek. Appending /personalization.json to any room URL seems to return this data (for now).
Update per the user agent issues
It looks like they are filtering requests to this URL based on user agent. I had to set the user agent on the urllib request in order to fix this:
import urllib2
import json
headers = { 'User-Agent' : 'Mozilla/5.0' }
req = urllib2.Request('http://www.airbnb.co.uk/rooms/501171/personalization.json', None, headers)
json = json.load(urllib2.urlopen(req))
print(json['listing_activity_data']['week']['unique_views'])
so first of all you need to figure out if that section of code has any unique tags. So if you look at the HTML tree you have
html > body > #room > ....... > #book-it-urgency-commitment > div > div > ... > div#media-body > b
The data you need is stored in a 'b' tag. I'm not sure about using lxml, but I usually use BeautifulSoup for my scraping.
You can reference http://www.crummy.com/software/BeautifulSoup/bs4/doc/ it's pretty straight forward.