Python data scraping - Elementary concepts - python

I am trying to get my head around how data scraping works when you look past HTML (i.e. DOM scraping).
I've been trying to write a simple Python code to automatically retrieve the number of people that have seen a specific ad: the part where it says '3365 people viewed Peter's place this week.'
At first I tried to see if that was displayed in the HTML code but could not find it. Did some research and saw that not everything will be in the code as it can be processes by the browser through JavaScript or other languages that I don't quite understand yet. I then inspected the element and realised that I would need to use the Python library 'retrieve' and 'lxml.html'. So I wrote this code:
import requests
import lxml.html
response = requests.get('https://www.airbnb.co.uk/rooms/501171')
resptext = lxml.html.fromstring(response.text)
final = resptext.text_content()
finalu = final.encode('utf-8')
file = open('file.txt', 'w')
file.write(finalu)
file.close()
With that, I get a code with all the text in the web page, but not the text that I am looking for! Which is the magic number 3365.
So my question is: how do I get it? I have thought that maybe I am not using the correct language to get the DOM, maybe it is done with JavaScript and I am only using lxml. However, I have no idea.

The DOM element you are looking at is updated after page load with what looks like an AJAX call with the following request URL:
https://www.airbnb.co.uk/rooms/501171/personalization.json
If you GET that URL, it will return the following JSON data:
{
"extras_price":"£30",
"preview_bar_phrases":{
"steps_remaining":"<strong>1 step</strong> to list"
},
"flag_info":{
},
"user_is_admin":false,
"is_owned_by_user":false,
"is_instant_bookable":true,
"instant_book_reasons":{
"within_max_lead_time":null,
"within_max_nights":null,
"enough_lead_time":true,
"valid_reservation_status":null,
"not_country_or_village":true,
"allowed_noone":null,
"allowed_everyone":true,
"allowed_socially_connected":null,
"allowed_experienced_guest":null,
"is_instant_book_host":true,
"guest_has_profile_pic":null
},
"instant_book_experiments":{
"ib_max_nights":14
},
"lat":51.5299601405844,
"lng":-0.12462748035984603,
"localized_people_pricing_description":"£30 / night after 2 guests",
"monthly_price":"£4200",
"nightly_price":"£150",
"security_deposit":"",
"social_connections":{
"connected":null
},
"staggered_price":"£4452",
"weekly_price":"£1050",
"show_disaster_info":false,
"cancellation_policy":"Strict",
"cancellation_policy_link":"/home/cancellation_policies#strict",
"show_fb_cta":true,
"should_show_review_translations":false,
"listing_activity_data":{
"day":{
"unique_views":226,
"total_views":363
},
"week":{
"unique_views":3365,
"total_views":5000
}
},
"should_hide_action_buttons":false
}
If you look under "listing_activity_data" you will find the information you seek. Appending /personalization.json to any room URL seems to return this data (for now).
Update per the user agent issues
It looks like they are filtering requests to this URL based on user agent. I had to set the user agent on the urllib request in order to fix this:
import urllib2
import json
headers = { 'User-Agent' : 'Mozilla/5.0' }
req = urllib2.Request('http://www.airbnb.co.uk/rooms/501171/personalization.json', None, headers)
json = json.load(urllib2.urlopen(req))
print(json['listing_activity_data']['week']['unique_views'])

so first of all you need to figure out if that section of code has any unique tags. So if you look at the HTML tree you have
html > body > #room > ....... > #book-it-urgency-commitment > div > div > ... > div#media-body > b
The data you need is stored in a 'b' tag. I'm not sure about using lxml, but I usually use BeautifulSoup for my scraping.
You can reference http://www.crummy.com/software/BeautifulSoup/bs4/doc/ it's pretty straight forward.

Related

Python scraping links from a webpage - Why no URLS?

I am a seller on Target.com and am trying to scrape the URL for every product in my catalog using Python (Python 3). When I try this I get an empty list for 'urllist', and when I print the variable 'soup', what BS4 has actually collected is the contents "view page source" (forgive my naiveté here, definitely a novice at this still!). In reality I'd really like to be scraping URLs from the content found in the "elements" section of the Devtools page. I can sift through the html on that page manually and find the links, so I know they're in there...I just don't know enough yet to tell BS4 that's the content I want to search. How can I do that?
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
#Need this part below for HTTPS
ctx=ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
#Needs that context = ctx line to deal with HTTPS
url = input('Enter URL: ')
urllist=[]
html = urllib.request.urlopen(url, context = ctx).read()
soup=BeautifulSoup(html, 'html.parser')
for link in soup.find_all('a'):
urllist.append(link.get('href'))
print(urllist)
If it helps, I found code that someone developed in Java that can be run from the developer console that works and grabbed me all of my links. But my goal is to be able to do this in Python (Python 3)
var x = document.querySelectorAll("a");
var myarray = []
for (var i=0; i<x.length; i++){
var nametext = x[i].textContent;
var cleantext = nametext.replace(/\s+/g, ' ').trim();
var cleanlink = x[i].href;
myarray.push([cleantext,cleanlink]);
};
function make_table() {
var table = '<table><thead><th>Name</th><th>Links</th></thead><tbody>';
for (var i=0; i<myarray.length; i++) {
table += '<tr><td>'+ myarray[i][0] + '</td><td>'+myarray[i][1]+'</td></tr>';
};
var w = window.open("");
w.document.write(table);
}
make_table()
I suspect this is occurring because Target's website (at least, the main page) builds the page content via Javascript. Your browser is able to render the page's source code, but your python code does no such thing. See this post for help in that regard.
Without going into the specifics of your code, fundamentally, if you can make a call to a url - you've got that url. If you use the script to scrape one entered url at the time - that could be logged by entering the correct amendment to the urllist entry (the object returned by each .link.get('href')).
If you have some other original source (a list?) for the urls to scrape, that could be added to the urllist.-object in a similar fashion.
The course of action choosen depends on the actual data structure returned by .link.get('href')). Suggestions:
If it's a string containing html, put that string in a dict key 'html', and add another dict key 'url'
If it's already a dict object: Just add a key-value-pair 'url'.
If you want to enter one url and extract the others from the url's html document, retreive the html and parse it with something like ElementTree
You can do this a number of ways.

unable to retrieving table through lxml, xpath (unanswered)

I"m trying to scrape http://www.sgx.com/wps/portal/sgxweb/home/company_disclosure/stockfacts for this table, and return the Company Name, Code and industry into a list, for all 15 pages of it.
And i've been trying to work with lxml.html, xpath and Beautifulsoup to try and get this information, but i'm stuck.
I realised that this information seems to be a #html embedded within the website, but i'm not sure how I can build a module to retrieve it.
Any thoughts? Or if I should be using a different module/technique?
Edit
I found out that this link was embedded into the website, which consist of the #html that I was talking about previously: http://sgx.wealthmsi.com/index.html#http%3A%2F%2Fwww.sgx.com%2Fwps%2Fportal%2Fsgxweb%2Fhome%2Fcompany_disclosure%2Fstockfacts
When I tried to use Beautifulsoup to pull the data out:
r = requests.get('http://sgx.wealthmsi.com/index.html#http%3A%2F%2Fwww.sgx.com%2Fwps%2Fportal%2Fsgxweb%2Fhome%2Fcompany_disclosure%2Fstockfacts')
wb = BeautifulSoup(r.text, "html.parser")
print(wb.findAll('div', attrs={'class': 'table-wrapper results-display'}))
It returns the result below:
[<div class="table-wrapper results-display">
<table>
<thead>
<tr></tr>
</thead>
<tbody></tbody>
</table>
</div>]
But that's different from what is in the website. Any thoughts?
You might want to address this problem another way.
By looking at the server calls (chrome -> F12 -> network tab), you can figure out which url you should actually call to get a json response instead.
Apparently, you could use a url that starts like this:
http://sgx-api-lb-195267723.ap-southeast-1.elb.amazonaws.com/sgx/search?callback=json&json=???? (you'll need to do some reverse engineering to figure out the actual json query but it doesn't look too difficult)
Sorry I did not look much further into the json query but I hope this helps you keep going :)
Note: I based my answer on this url
#!/usr/bin/env python
import requests
url = "http://sgx-api-lb-195267723.ap-southeast-1.elb.amazonaws.com/sgx/search"
params = {
'callback': 'json',
'json': {
# key / value pairs defining your actual query to the server
# you need to figure this out yourself depending on the data you want
# to retrieve.
# I usually look at chrome's network tab (F12), find the proper URL
# that queries for the data, reverse engineer the key/value pairs
}
}
response = requests.get(url, params)
print(response.json())

downloading data using python from a website that uses javascript to display information

I typically use the following template script to download data from a website:
import urllib.request as web
from bs4 import BeautifulSoup
...
url_to_visit ='http://www.website-link-to-download-data'
source_code = web.urlopen(url_to_visit).read()
source_code = ''.join(map(chr, source_code)
source_code = source_code.split('\n')
## then further process the lines returned in `source_code` as needed
But sometimes I come across very difficult sites.
Consider the site: https://www.spice-indices.com/idp2/Main#home. Suppose from the first table Intraday Alerts - United States, I want to download via Python script the information that is displayed when I click the SP TMI tab.
I looked at the output of the splitSource above, but I couldn't figure out how to extract the information I want. It seems to be using Javascript backend to display the information. Can someone give me any pointers or suggestions?
I am using Python 3.x.
When you activate the "SP TMI" tab there is a POST request send to "intraday-announcements.json" endpoint - simulate that in your code and parse the JSON response.
Sample working code using requests:
import requests
with requests.Session() as session:
session.get("https://www.spice-indices.com/idp2/Main#home")
response = session.post("https://www.spice-indices.com/idp2/intraday/effectivedate/11-14-2015/intraday-announcements.json", data={
"start": "0",
"limit": "10",
"indexKey": "SPUSA-TMI-USDUF--P-US----"
})
data = response.json()["widget_data"]
for item in data:
print(item["EVENT_NAME"])
Prints:
Dividend
Weekly Share Change
Special Dividend
Merger/Acquisition
Merger/Acquisition
Drop
Merger/Acquisition
Merger/Acquisition
Drop
Identifier Changes
Note that the effective date is actually inside the URL, see the 11-14-2015 part.

Facebook Login Using Requests error

import requests
from bs4 import BeautifulSoup
a = requests.Session()
soup = BeautifulSoup(a.get("https://www.facebook.com/").content)
payload = {
"lsd":soup.find("input",{"name":"lsd"})["value"],
"email":"my_email",
"pass":"my_password",
"persistent":"1",
"default_persistent":"1",
"timezone":"300",
"lgnrnd":soup.find("input",{"name":"lgnrnd"})["value"],
"lgndim":soup.find("input",{"name":"lgndim"})["value"],
"lgnjs":soup.find("input",{"name":"lgnjs"})["value"],
"locale":"en_US",
"qsstamp":soup.find("input",{"name":"qsstamp"})["value"]
}
soup = BeautifulSoup(a.post("https://www.facebook.com/",data = payload).content)
print([i.text for i in soup.find_all("a")])
Im playing around with requests and have read several threads here in SO about it so I decided to try it out myself.
I am stumped by this line. "qsstamp":soup.find("input",{"name":"qsstamp"})["value"]
because it returns empty thereby cause an error.
however looking at chrome developer tools this "qsstamp" is populated what am I missing here?
the payload is everything shown in the form data on chrome dev tools. so what is going on?
Using Firebug and search for qsstamp gives matched results directs to: Here
You can see: j.createHiddenInputs({qsstamp:u},v)
That means qsstamp is dynamically generated by JavaScript.
requests will not run JavaScript(since what it does is to fetch that page's HTML.) You may want to use something like dryscape or using emulated browser like Selenium.

Missing source page information using urllib2

I'm trying to scrape "game tag" data (not the same as HTML tags) from games listed on the digital game distribution site, Steam (store.steampowered.com). This information isn't available via the Steam API, as far as I can tell.
Once I have the raw source data for a page, I want to pass it into beautifulsoup for further parsing, but I have a problem - urllib2 doesn't seem to be reading the information I want (request doesn't work either), even though it's obviously in the source page when viewed in the browser.
For example, I might download the page for the game "7 Days to Die" (http://store.steampowered.com/app/251570/). When viewing the browser source page in Chrome, I can see the following relevant information regarding the game's "tags"
near the end, starting at line 1615:
<script type="text/javascript">
$J( function() {
InitAppTagModal( 251570,
{"tagid":1662,"name":"Survival","count":283,"browseable":true},
{"tagid":1659,"name":"Zombies","count":274,"browseable":true},
{"tagid":1702,"name":"Crafting","count":248,"browseable":true},...
In initAppTagModal, there are the tags "Survival", "Zombies", "Crafting", ect that contain the information I'd like to collect.
But when I use urllib2 to get the page source:
import urllib2
url = "http://store.steampowered.com/app/224600/" #7 Days to Die page
page = urllib2.urlopen(url).read()
The part of the source page that I'm interested in is not saved in the my "page" variable, instead everything below line 1555 is simply blank until the closing body and html tags. Resulting in this (carriage returns included):
</div><!-- End Footer -->
</body>
</html>
In the blank space is where the source code I need (along with other code), should be.
I've tried this on several different computers with different installs of python 2.7 (Windows machines and a Mac), and I get the same result on all of them.
How can I get the data that I'm looking for?
Thank you for your consideration.
Well, I don't know if I'm missing something, but it's working for me using requests:
import requests
# Getting html code
url = "http://store.steampowered.com/app/251570/"
html = requests.get(url).text
And even more, the data requested is in json format, so it's easy to extract it in this way:
# Extracting javscript object (a json like object)
start_tag = 'InitAppTagModal( 251570,'
end_tag = '],'
startIndex = html.find(start_tag) + len(start_tag)
endIndex = html.find(end_tag, startIndex) + len(end_tag) - 1
raw_data = html[startIndex:endIndex]
# Load raw data as python json object
data = json.loads(raw_data)
You will see a beatiful json object like this (this is the info that you need, right?):
[
{
"count": 283,
"browseable": true,
"tagid": 1662,
"name": "Survival"
},
{
"count": 274,
"browseable": true,
"tagid": 1659,
"name": "Zombies"
},
{
"count": 248,
"browseable": true,
"tagid": 1702,
"name": "Crafting"
}......
I hope it helps....
UPDATED:
Ok, I see your problem right now, it seems that the problem is in the page 224600. In this case the webpage requires that you confirm your age before to show you the games info. Anyway, easy to solve it just posting the form that confirm the age. Here is the code updated (and I created a function):
def extract_info_games(page_id):
# Create session
session = requests.session()
# Get initial html
html = session.get("http://store.steampowered.com/app/%s/" % page_id).text
# Checking if I'm in the check age page (just checking if the check age form is in the html code)
if ('<form action="http://store.steampowered.com/agecheck/app/%s/"' % page_id) in html:
# I'm being redirected to check age page
# let's confirm my age with a POST:
post_data = {
'snr':'1_agecheck_agecheck__age-gate',
'ageDay':1,
'ageMonth':'January',
'ageYear':'1960'
}
html = session.post('http://store.steampowered.com/agecheck/app/%s/' % page_id, post_data).text
# Extracting javscript object (a json like object)
start_tag = 'InitAppTagModal( %s,' % page_id
end_tag = '],'
startIndex = html.find(start_tag) + len(start_tag)
endIndex = html.find(end_tag, startIndex) + len(end_tag) - 1
raw_data = html[startIndex:endIndex]
# Load raw data as python json object
data = json.loads(raw_data)
return data
And to use it:
extract_info_games(224600)
extract_info_games(251570)
Enjoy!
When using urllib2 and read(), you will have to read repeatedly in chunks till you hit EOF, in order to read the entire HTML source.
import urllib2
url = "http://store.steampowered.com/app/224600/" #7 Days to Die page
url_handle = urllib2.urlopen(url)
data = ""
while True:
chunk = url_handle.read()
if not chunk:
break
data += chunk
An alternative would be to use the requests module as:
import requests
r = requests.get('http://store.steampowered.com/app/251570/')
soup = BeautifulSoup(r.text)

Categories