Python scraping links from a webpage - Why no URLS? - python

I am a seller on Target.com and am trying to scrape the URL for every product in my catalog using Python (Python 3). When I try this I get an empty list for 'urllist', and when I print the variable 'soup', what BS4 has actually collected is the contents "view page source" (forgive my naiveté here, definitely a novice at this still!). In reality I'd really like to be scraping URLs from the content found in the "elements" section of the Devtools page. I can sift through the html on that page manually and find the links, so I know they're in there...I just don't know enough yet to tell BS4 that's the content I want to search. How can I do that?
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
#Need this part below for HTTPS
ctx=ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
#Needs that context = ctx line to deal with HTTPS
url = input('Enter URL: ')
urllist=[]
html = urllib.request.urlopen(url, context = ctx).read()
soup=BeautifulSoup(html, 'html.parser')
for link in soup.find_all('a'):
urllist.append(link.get('href'))
print(urllist)
If it helps, I found code that someone developed in Java that can be run from the developer console that works and grabbed me all of my links. But my goal is to be able to do this in Python (Python 3)
var x = document.querySelectorAll("a");
var myarray = []
for (var i=0; i<x.length; i++){
var nametext = x[i].textContent;
var cleantext = nametext.replace(/\s+/g, ' ').trim();
var cleanlink = x[i].href;
myarray.push([cleantext,cleanlink]);
};
function make_table() {
var table = '<table><thead><th>Name</th><th>Links</th></thead><tbody>';
for (var i=0; i<myarray.length; i++) {
table += '<tr><td>'+ myarray[i][0] + '</td><td>'+myarray[i][1]+'</td></tr>';
};
var w = window.open("");
w.document.write(table);
}
make_table()

I suspect this is occurring because Target's website (at least, the main page) builds the page content via Javascript. Your browser is able to render the page's source code, but your python code does no such thing. See this post for help in that regard.

Without going into the specifics of your code, fundamentally, if you can make a call to a url - you've got that url. If you use the script to scrape one entered url at the time - that could be logged by entering the correct amendment to the urllist entry (the object returned by each .link.get('href')).
If you have some other original source (a list?) for the urls to scrape, that could be added to the urllist.-object in a similar fashion.
The course of action choosen depends on the actual data structure returned by .link.get('href')). Suggestions:
If it's a string containing html, put that string in a dict key 'html', and add another dict key 'url'
If it's already a dict object: Just add a key-value-pair 'url'.
If you want to enter one url and extract the others from the url's html document, retreive the html and parse it with something like ElementTree
You can do this a number of ways.

Related

When I take html from a website using urllib2, the inner html is empty. Anyone know why?

I am working on a project and one of the steps includes getting a random word which I will use later. When I try to grab the random word, it gives me '<span id="result"></span>' but as you can see, there is no word inside.
Code:
import urllib2
from bs4 import BeautifulSoup
quote_page = 'http://watchout4snakes.com/wo4snakes/Random/RandomWord'
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
name_box = soup.find("span", {"id": "result"})
print name_box
name = name_box.text.strip()
print name
I am thinking that maybe it might need to wait for a word to appear, but I'm not sure how to do that.
This word is added to the page using JavaScript. We can verify this by looking at the actual HTML that is returned in the request and comparing it with what we see in the web browser DOM inspector. There are two options:
Use a library capable of executing JavaScript and giving you the resulting HTML
Try a different approach that doesn't require JavaScript support
For 1, we can use something like requests_html. This would look like:
from requests_html import HTMLSession
url = 'http://watchout4snakes.com/wo4snakes/Random/RandomWord'
session = HTMLSession()
r = session.get(url)
# Some sleep required since the default of 0.2 isn't long enough.
r.html.render(sleep=0.5)
print(r.html.find('#result', first=True).text)
For 2, if we look at the network requests that the page is making, then we can see that it retrieves random words by making a POST request to http://watchout4snakes.com/wo4snakes/Random/RandomWord. Making a direct request with a library like requests (recommended in the standard library documentation here) looks like:
import requests
url = 'http://watchout4snakes.com/wo4snakes/Random/RandomWord'
print(requests.post(url).text)
So the way that the site works is that it sends you the site with no word in the span box, and edits it in later through JavaScript; that's why you get a span box with nothing inside.
However, since you're trying to get the word I'd definitely suggest you use a different method to getting the word, rather than scraping the word off the page, you can simply send a POST request to http://watchout4snakes.com/wo4snakes/Random/RandomWord with no body and receive the word in response.
You're using Python 2 but in Python 3 (for example, so I can show this works) you can do:
>>> import requests
>>> r = requests.post('http://watchout4snakes.com/wo4snakes/Random/RandomWord')
>>> print(r.text)
doom
You can do something similar using urllib in Python 2 as well.

Webscraping my grades

I'm trying to create a program that grabs my school grades from a website everyday. Then stores the values and creates a graph for my grades, but when i try to scrape the page the HTML that i receive is different then the HTML that i get with inspect element.
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://ames.usoe-dcs.org/Students/2567")
bsObj = BeautifulSoup(html.read(), 'lxml');
print(bsObj)
inspect element gives me: http://pastebin.com/BakmpqUM
while python gives me: http://pastebin.com/7gPY1WgB
i figure this is because the URL to my grades (https://ames.usoe-dcs.org/Students/2567) is private, so when you type it into the browser it returns me here:https://ames.usoe-dcs.org/Login/?DestinationURL=%2FStudents%2F2566
is there a way to use python to automatically sign me in?
The URL isn't necessarily private, however requesting the URL without cookies verifying your status as a user won't get you to the information you see when you logged in.
I would recommend opening Inspect Element to the network tab and reloading the page with your grades on it (while signed in). Then right click on the first request (should be a GET request answered with HTML, code 200), hover over copy, then click Copy as cURL command (bash). Then paste into this webpage and copy the python. It will give you the proper request for the page including the cookies and verification parameters you used to access them in the browser. From there you can parse the HTML response for your grade.
You should have something like this to receive and parse your HTML from the request:
cookies = {
...stuff...
}
headers = {
...stuff...
}
r = requests.get("https://ames.usoe-dcs.org/Students/2567", headers=headers, cookies=cookies)
soup = BeautifulSoup(r.text, "lxml")
grade = soup.find("h1", {"class":"grade"}).contents # Customize to find your grade
print(grade)
The cookies and headers dictionaries come from the cURL to Python output.

i can not get the body element of html page in web scraping by python

I would like to parse a website with urllib python library. I wrote this:
from bs4 import BeautifulSoup
from urllib.request import HTTPCookieProcessor, build_opener
from http.cookiejar import FileCookieJar
def makeSoup(url):
jar = FileCookieJar("cookies")
opener = build_opener(HTTPCookieProcessor(jar))
html = opener.open(url).read()
return BeautifulSoup(html, "lxml")
def articlePage(url):
return makeSoup(url)
Links = "http://collegeprozheh.ir/%d9%85%d9%82%d8%a7%d9%84%d9%87- %d9%85%d8%af%d9%84-%d8%b1%d9%82%d8%a7%d8%a8%d8%aa%db%8c-%d8%af%d8%b1-%d8%b5%d9%86%d8%b9%d8%aa-%d9%be%d9%86%d9%84-%d9%87%d8%a7%db%8c-%d8%ae%d9%88%d8%b1%d8%b4%db%8c%d8%af/"
print(articlePage(Links))
but the website does not return content of body tag.
this is result of my program:
cURL = window.location.href;
var p = new Date();
second = p.getTime();
GetVars = getUrlVars();
setCookie("Human" , "15421469358743" , 10);
check_coockie = getCookie("Human");
if (check_coockie != "15421469358743")
document.write("Could not Set cookie!");
else
window.location.reload(true);
</script>
</head><body></body>
</html>
i think the cookie has caused this problem.
The page is using JavaScript to check the cookie and to generate the content. However, urllib does not process JavaScript and thus the page shows nothing.
You'll either need to use something like Selenium that acts as a browser and executes JavaScript, or you'll need to set the cookie yourself before you request the page (from what I can see, that's all the JavaScript code does). You seem to be loading a file containing cookie definitions (using FileCookieJar), however you haven't included the content.

Python data scraping - Elementary concepts

I am trying to get my head around how data scraping works when you look past HTML (i.e. DOM scraping).
I've been trying to write a simple Python code to automatically retrieve the number of people that have seen a specific ad: the part where it says '3365 people viewed Peter's place this week.'
At first I tried to see if that was displayed in the HTML code but could not find it. Did some research and saw that not everything will be in the code as it can be processes by the browser through JavaScript or other languages that I don't quite understand yet. I then inspected the element and realised that I would need to use the Python library 'retrieve' and 'lxml.html'. So I wrote this code:
import requests
import lxml.html
response = requests.get('https://www.airbnb.co.uk/rooms/501171')
resptext = lxml.html.fromstring(response.text)
final = resptext.text_content()
finalu = final.encode('utf-8')
file = open('file.txt', 'w')
file.write(finalu)
file.close()
With that, I get a code with all the text in the web page, but not the text that I am looking for! Which is the magic number 3365.
So my question is: how do I get it? I have thought that maybe I am not using the correct language to get the DOM, maybe it is done with JavaScript and I am only using lxml. However, I have no idea.
The DOM element you are looking at is updated after page load with what looks like an AJAX call with the following request URL:
https://www.airbnb.co.uk/rooms/501171/personalization.json
If you GET that URL, it will return the following JSON data:
{
"extras_price":"£30",
"preview_bar_phrases":{
"steps_remaining":"<strong>1 step</strong> to list"
},
"flag_info":{
},
"user_is_admin":false,
"is_owned_by_user":false,
"is_instant_bookable":true,
"instant_book_reasons":{
"within_max_lead_time":null,
"within_max_nights":null,
"enough_lead_time":true,
"valid_reservation_status":null,
"not_country_or_village":true,
"allowed_noone":null,
"allowed_everyone":true,
"allowed_socially_connected":null,
"allowed_experienced_guest":null,
"is_instant_book_host":true,
"guest_has_profile_pic":null
},
"instant_book_experiments":{
"ib_max_nights":14
},
"lat":51.5299601405844,
"lng":-0.12462748035984603,
"localized_people_pricing_description":"£30 / night after 2 guests",
"monthly_price":"£4200",
"nightly_price":"£150",
"security_deposit":"",
"social_connections":{
"connected":null
},
"staggered_price":"£4452",
"weekly_price":"£1050",
"show_disaster_info":false,
"cancellation_policy":"Strict",
"cancellation_policy_link":"/home/cancellation_policies#strict",
"show_fb_cta":true,
"should_show_review_translations":false,
"listing_activity_data":{
"day":{
"unique_views":226,
"total_views":363
},
"week":{
"unique_views":3365,
"total_views":5000
}
},
"should_hide_action_buttons":false
}
If you look under "listing_activity_data" you will find the information you seek. Appending /personalization.json to any room URL seems to return this data (for now).
Update per the user agent issues
It looks like they are filtering requests to this URL based on user agent. I had to set the user agent on the urllib request in order to fix this:
import urllib2
import json
headers = { 'User-Agent' : 'Mozilla/5.0' }
req = urllib2.Request('http://www.airbnb.co.uk/rooms/501171/personalization.json', None, headers)
json = json.load(urllib2.urlopen(req))
print(json['listing_activity_data']['week']['unique_views'])
so first of all you need to figure out if that section of code has any unique tags. So if you look at the HTML tree you have
html > body > #room > ....... > #book-it-urgency-commitment > div > div > ... > div#media-body > b
The data you need is stored in a 'b' tag. I'm not sure about using lxml, but I usually use BeautifulSoup for my scraping.
You can reference http://www.crummy.com/software/BeautifulSoup/bs4/doc/ it's pretty straight forward.

Missing source page information using urllib2

I'm trying to scrape "game tag" data (not the same as HTML tags) from games listed on the digital game distribution site, Steam (store.steampowered.com). This information isn't available via the Steam API, as far as I can tell.
Once I have the raw source data for a page, I want to pass it into beautifulsoup for further parsing, but I have a problem - urllib2 doesn't seem to be reading the information I want (request doesn't work either), even though it's obviously in the source page when viewed in the browser.
For example, I might download the page for the game "7 Days to Die" (http://store.steampowered.com/app/251570/). When viewing the browser source page in Chrome, I can see the following relevant information regarding the game's "tags"
near the end, starting at line 1615:
<script type="text/javascript">
$J( function() {
InitAppTagModal( 251570,
{"tagid":1662,"name":"Survival","count":283,"browseable":true},
{"tagid":1659,"name":"Zombies","count":274,"browseable":true},
{"tagid":1702,"name":"Crafting","count":248,"browseable":true},...
In initAppTagModal, there are the tags "Survival", "Zombies", "Crafting", ect that contain the information I'd like to collect.
But when I use urllib2 to get the page source:
import urllib2
url = "http://store.steampowered.com/app/224600/" #7 Days to Die page
page = urllib2.urlopen(url).read()
The part of the source page that I'm interested in is not saved in the my "page" variable, instead everything below line 1555 is simply blank until the closing body and html tags. Resulting in this (carriage returns included):
</div><!-- End Footer -->
</body>
</html>
In the blank space is where the source code I need (along with other code), should be.
I've tried this on several different computers with different installs of python 2.7 (Windows machines and a Mac), and I get the same result on all of them.
How can I get the data that I'm looking for?
Thank you for your consideration.
Well, I don't know if I'm missing something, but it's working for me using requests:
import requests
# Getting html code
url = "http://store.steampowered.com/app/251570/"
html = requests.get(url).text
And even more, the data requested is in json format, so it's easy to extract it in this way:
# Extracting javscript object (a json like object)
start_tag = 'InitAppTagModal( 251570,'
end_tag = '],'
startIndex = html.find(start_tag) + len(start_tag)
endIndex = html.find(end_tag, startIndex) + len(end_tag) - 1
raw_data = html[startIndex:endIndex]
# Load raw data as python json object
data = json.loads(raw_data)
You will see a beatiful json object like this (this is the info that you need, right?):
[
{
"count": 283,
"browseable": true,
"tagid": 1662,
"name": "Survival"
},
{
"count": 274,
"browseable": true,
"tagid": 1659,
"name": "Zombies"
},
{
"count": 248,
"browseable": true,
"tagid": 1702,
"name": "Crafting"
}......
I hope it helps....
UPDATED:
Ok, I see your problem right now, it seems that the problem is in the page 224600. In this case the webpage requires that you confirm your age before to show you the games info. Anyway, easy to solve it just posting the form that confirm the age. Here is the code updated (and I created a function):
def extract_info_games(page_id):
# Create session
session = requests.session()
# Get initial html
html = session.get("http://store.steampowered.com/app/%s/" % page_id).text
# Checking if I'm in the check age page (just checking if the check age form is in the html code)
if ('<form action="http://store.steampowered.com/agecheck/app/%s/"' % page_id) in html:
# I'm being redirected to check age page
# let's confirm my age with a POST:
post_data = {
'snr':'1_agecheck_agecheck__age-gate',
'ageDay':1,
'ageMonth':'January',
'ageYear':'1960'
}
html = session.post('http://store.steampowered.com/agecheck/app/%s/' % page_id, post_data).text
# Extracting javscript object (a json like object)
start_tag = 'InitAppTagModal( %s,' % page_id
end_tag = '],'
startIndex = html.find(start_tag) + len(start_tag)
endIndex = html.find(end_tag, startIndex) + len(end_tag) - 1
raw_data = html[startIndex:endIndex]
# Load raw data as python json object
data = json.loads(raw_data)
return data
And to use it:
extract_info_games(224600)
extract_info_games(251570)
Enjoy!
When using urllib2 and read(), you will have to read repeatedly in chunks till you hit EOF, in order to read the entire HTML source.
import urllib2
url = "http://store.steampowered.com/app/224600/" #7 Days to Die page
url_handle = urllib2.urlopen(url)
data = ""
while True:
chunk = url_handle.read()
if not chunk:
break
data += chunk
An alternative would be to use the requests module as:
import requests
r = requests.get('http://store.steampowered.com/app/251570/')
soup = BeautifulSoup(r.text)

Categories