Accessing a website in python - python

I am trying to get all the urls on a website using python. At the moment I am just copying the websites html into the python program and then using code to extract all the urls. Is there a way I could do this straight from the web without having to copy the entire html?

In Python 2, you can use urllib2.urlopen:
import urllib2
response = urllib2.urlopen('http://python.org/')
html = response.read()
In Python 3, you can use urllib.request.urlopen:
import urllib.request
with urllib.request.urlopen('http://python.org/') as response:
html = response.read()
If you have to perform more complicated tasks like authentication or passing parameters I suggest to have a look at the requests library.

The most straightforward would probably be urllib.urlopen if you're using python2, or urllib.request.urlopen if you're using python3 (you have to do import urllib or import urllib.request first of course). That way you get an file like object from which you can read (ie f.read()) the html document.
Example for python 2:
import urllib
f = urlopen("http://stackoverflow.com")
http_document = f.read()
f.close()
The good news is that you seem to have done the hard part which is analyzing the html document for links.

You might want to use the bs4(BeautifulSoup) library.
Beautiful Soup is a Python library for pulling data out of HTML and XML files.
You can download bs4 with the followig command at the cmd line. pip install BeautifulSoup4
import urllib2
import urlparse
from bs4 import BeautifulSoup
url = "http://www.google.com"
response = urllib2.urlopen(url)
content = response.read()
soup = BeautifulSoup(content, "html.parser")
for link in soup.find_all('a', href=True):
print urlparse.urljoin(url, link['href'])

You can simply use the combination of requests and BeautifulSoup.
First make an HTTP request using requests to get the HTML content. You will get it as a Python string, which you can manipulate as you like.
Take the HTML content string and supply it into the BeautifulSoup, which has done all the job to extract the DOM, and get all URLs, i.e. <a> elements.
Here is an example of how to fetch all links from StackOverflow:
import requests
from bs4 import BeautifulSoup, SoupStrainer
response = requests.get('http://stackoverflow.com')
html_str = response.text
bs = BeautifulSoup(html_str, parseOnlyThese=SoupStrainer('a'))
for a_element in bs:
if a_element.has_attr('href'):
print(a_element['href'])
Sample output:
/questions/tagged/facebook-javascript-sdk
/questions/31743507/facebook-app-request-dialog-keep-loading-on-mobile-after-fb-login-called
/users/3545752/user3545752
/questions/31743506/get-nuspec-file-for-existing-nuget-package
/questions/tagged/nuget
...

Related

How do I decode a webpage using Requests and BeatifulSoup library in Python?

I tied writing some code for a project I am doing. First, I'll show you my code.
import requests
from bs4 import BeautifulSoup
url = 'http://github.com'
r = requests.get(url)
r_html = r.text
soup = BeautifulSoup(r_html, "html.parser")
title = soup.find('span', 'articletitle')
The project is to be able to decode a webpage. Basicaly in the variable url, you put in any url and use python for it to return back the basic html code back in a txt format. I am using the requests and BeautifulSoup library for Python.
I tried running this code, and it should be right but when it runs, it doesn't return anything. Can you help me?

How can I get data from this link into a JSON?

I am trying to extract the search results with Python from this link into a JSON file, but normal request methods seem not functioning in this case. How can extract all the results?
url= https://apps.usp.org/app/worldwide/medQualityDatabase/reportResults.html?country=Ethiopia%2BGhana%2BKenya%2BMozambique%2BNigeria%2BCambodia%2BLao+PDR%2BPhilippines%2BThailand%2BViet+Nam%2BBolivia%2BColombia%2BEcuador%2BGuatemala%2BGuyana%2BPeru&period=2017%2B2016%2B2015%2B2014%2B2013%2B2012%2B2011%2B2010%2B2009%2B2008%2B2007%2B2006%2B2005%2B2004%2B2003&conclusion=Both&testType=Both&counterfeit=Both&recordstart=50
my code
import requests
from bs4 import BeautifulSoup
r = requests.get(url)
results_page = BeautifulSoup(response.content,'lxml')
Why am I not getting the full source code of the page?

Web scraping using Python and Beautiful Soup for /post-sitemap.xml/

I am trying to scrape a page website/post-sitemap.xml which contains all url's posted for a wordpress website. In the first step, I need to make a list of all the url's present in post-sitemap. When I use requests.get and I check the output, it opens all of the internal urls as well, which is weird. My intention is to make a list of all url's first and then using a loop, I will scrape individual url's in the next function. Below is the code I have done so far. I would need all url's as a list as my final output if python gurus can help.
I have tried using requests.get and openurl but nothing seems to open only the base url for /post-sitemap.xml
import pandas as pd
import numpy as np
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
import re
class wordpress_ext_url_cleanup(object):
def __init__(self,wp_url):
self.wp_url_raw = wp_url
self.wp_url = wp_url + '/post-sitemap.xml/'
def identify_ext_url(self):
html = requests.get(self.wp_url)
print(self.wp_url)
print(html.text)
soup = BeautifulSoup(html.text,'lxml')
#print(soup.get_text())
raw_data = soup.find_all('tr')
print (raw_data)
#for link in raw_data:
#print(link.get("href"))
def main():
print ("Inside Main Function");
url="http://punefirst dot com" #(knowingly removed the . so it doesnt look spammy)
first_call = wordpress_ext_url_cleanup(url)
first_call.identify_ext_url()
if __name__ == '__main__':
main()
I would need all 548 url's present in the post sitemap as a list which I will use it for the next function for further scraping.
The document that is returned from the server is XML and transformed with XSLT to HTML form (more info here). To parse all links from this XML, you can use this script:
import requests
from bs4 import BeautifulSoup
url = 'http://punefirst.com/post-sitemap.xml/'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
for loc in soup.select('url > loc'):
print(loc.text)
Prints:
http://punefirst.com
http://punefirst.com/hospitals/pcmc-hospitals/aditya-birla-memorial-hospital-chinchwad-pune
http://punefirst.com/hospitals/pcmc-hospitals/saijyoti-hospital-and-icu-chinchwad-pune
http://punefirst.com/hospitals/pcmc-hospitals/niramaya-hospital-chinchwad-pune
http://punefirst.com/hospitals/pcmc-hospitals/chetna-hospital-chinchwad-pune
http://punefirst.com/hospitals/hadapsar-hospitals/pbmas-h-v-desai-eye-hospital
http://punefirst.com/hospitals/punecentral-hospitals/shree-sai-prasad-hospital
http://punefirst.com/hospitals/punecentral-hospitals/sadhu-vaswani-missions-medical-complex
http://punefirst.com/hospitals/katraj-kondhwa-hospitals/shivneri-hospital
http://punefirst.com/hospitals/punecentral-hospitals/kelkar-nursing-home
http://punefirst.com/hospitals/pcmc-hospitals/shrinam-hospital
http://punefirst.com/hospitals/pcmc-hospitals/dhanwantari-hospital-nigdi
http://punefirst.com/hospitals/punecentral-hospitals/dr-tarabai-limaye-hospital
http://punefirst.com/hospitals/katraj-kondhwa-hospitals/satyanand-hospital-kondhwa-pune
...and so on.

"soup.prettify()" gives just URL

I'm using Python3, BeautifulSoup4
When I run code below, it gives just url "www.google.com" not XML.
I couldn't find it What is wrong.
from bs4 import BeautifulSoup
import urllib
html = "www.google.com"
soup = BeautifulSoup(html)
print (soup.prettify())
You need to use urllib2 or a similar library to fetch the HTML
import urllib2
html = urllib2.urlopen("www.google.com")
soup = BeautifulSoup(html)
print (soup.prettify())
EDIT: Just as a side note to clarify why I suggested urllib2. If you read the urllib documentation, you'll find "The urlopen() function has been removed in Python 3 in favor of urllib2.urlopen()." Given that you have tagged Python3, urllib2 would probably be your best option.

Beautiful Soup to parse url to get another urls data

I need to parse a url to get a list of urls that link to a detail page. Then from that page I need to get all the details from that page. I need to do it this way because the detail page url is not regularly incremented and changes, but the event list page stays the same.
Basically:
example.com/events/
Event 1
Event 2
example.com/events/1
...some detail stuff I need
example.com/events/2
...some detail stuff I need
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen('http://yahoo.com').read()
soup = BeautifulSoup(page)
soup.prettify()
for anchor in soup.findAll('a', href=True):
print anchor['href']
It will give you the list of urls. Now You can iterate over those urls and parse the data.
inner_div = soup.findAll("div", {"id": "y-shade"})
This is an example. You can go through the BeautifulSoup tutorials.
For the next group of people that come across this, BeautifulSoup has been upgraded to v4 as of this post as v3 is no longer being updated..
$ easy_install beautifulsoup4
$ pip install beautifulsoup4
To use in Python...
import bs4 as BeautifulSoup
Use urllib2 to get the page, then use beautiful soup to get the list of links, also try scraperwiki.com
Edit:
Recent discovery: Using BeautifulSoup through lxml with
from lxml.html.soupparser import fromstring
is miles better than just BeautifulSoup. It lets you do dom.cssselect('your selector') which is a life saver. Just make sure you have a good version of BeautifulSoup installed. 3.2.1 works a treat.
dom = fromstring('<html... ...')
navigation_links = [a.get('href') for a in htm.cssselect('#navigation a')]
FULL PYTHON 3 EXAMPLE
Packages
# urllib (comes with standard python distribution)
# pip3 install beautifulsoup4
Example:
import urllib.request
from bs4 import BeautifulSoup
with urllib.request.urlopen('https://www.wikipedia.org/') as f:
data = f.read().decode('utf-8')
d = BeautifulSoup(data)
d.title.string
The above should print out 'Wikipedia'

Categories