hello there
i was wondering if it was possible to connect to a http host (I.e. for example google.com)
and download the source of the webpage?
Thanks in advance.
Using urllib2 to download a page.
Google will block this request as it will try to block all robots. Add user-agent to the request.
import urllib2
user_agent = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.63 Safari/534.3'
headers = { 'User-Agent' : user_agent }
req = urllib2.Request('http://www.google.com', None, headers)
response = urllib2.urlopen(req)
page = response.read()
response.close() # its always safe to close an open connection
You can also use pyCurl
import sys
import pycurl
class ContentCallback:
def __init__(self):
self.contents = ''
def content_callback(self, buf):
self.contents = self.contents + buf
t = ContentCallback()
curlObj = pycurl.Curl()
curlObj.setopt(curlObj.URL, 'http://www.google.com')
curlObj.setopt(curlObj.WRITEFUNCTION, t.content_callback)
curlObj.perform()
curlObj.close()
print t.contents
You can use urllib2 module.
import urllib2
url = "http://somewhere.com"
page = urllib2.urlopen(url)
data = page.read()
print data
See the doc for more examples
The documentation of httplib (low-level) and urllib (high-level) should get you started. Choose the one that's more suitable for you.
Using requests package:
# Import requests
import requests
#url
url = 'https://www.google.com/'
# Create the binary string html containing the HTML source
html = requests.get(url).content
or with the urllib
from urllib.request import urlopen
#url
url = 'https://www.google.com/'
# Create the binary string html containing the HTML source
html = urlopen(url).read()
so here's another approach to this problem using mechanize. I found this to bypass a website's robot checking system. i commented out the set_all_readonly because for some reason it wasn't recognized as a module in mechanize.
import mechanize
url = 'http://www.example.com'
br = mechanize.Browser()
#br.set_all_readonly(False) # allow everything to be written to
br.set_handle_robots(False) # ignore robots
br.set_handle_refresh(False) # can sometimes hang without this
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')] # [('User-agent', 'Firefox')]
response = br.open(url)
print response.read() # the text of the page
response1 = br.response() # get the response again
print response1.read() # can apply lxml.html.fromstring()
Related
I'm trying to parse the div class titled "dealer-info" from the URL below.
https://www.nissanusa.com/dealer-locator.html
I tried this:
import urllib.request
from bs4 import BeautifulSoup
url = "https://www.nissanusa.com/dealer-locator.html"
text = urllib.request.urlopen(url).read()
soup = BeautifulSoup(text)
data = soup.findAll('div',attrs={'class':'dealer-info'})
for div in data:
links = div.findAll('a')
for a in links:
print(a['href'])
Normally, I would expect that to work, but I'm getting this result: HTTPError: Forbidden
Also, tried this.
import urllib.request
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
url = "https://www.nissanusa.com/dealer-locator.html"
headers={'User-Agent':user_agent,}
request=urllib.request.Request(url,None,headers) #The assembled request
response = urllib.request.urlopen(request)
data = response.read() # The data u need
print(data)
That gives me all the HTML on the site, but it's pretty ugly to look at, or make any sense of at all.
I'm trying to get a structured data set, of "dealer-info". I am using Python 3.6.
You might be being rejected by the server in your first example due to not pretending to be an ordinary browser. You should try combining the user agent code from the second example with the Beautiful Soup code from the first:
import urllib.request
from bs4 import BeautifulSoup
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
url = "https://www.nissanusa.com/dealer-locator.html"
headers={'User-Agent':user_agent,}
request=urllib.request.Request(url,None,headers) #The assembled request
response = urllib.request.urlopen(request)
text = response.read()
soup = BeautifulSoup(text, "lxml")
data = soup.findAll('div',attrs={'class':'dealer-info'})
for div in data:
links = div.findAll('a')
for a in links:
print(a['href'])
Keep in mind that if the web site is explicitly trying to keep Beautiful Soup or other non-recognized user agents out, they may take issue with you scraping their web site data. You should consult and obey https://www.nissanusa.com/robots.txt as well as any terms of use or terms of service agreements you may have agreed to.
I'm trying to find all the information inside "inspect" when using a browser for example chrome, currently i can get the page "source" but it doesn't contain everything that inspect contains
when i tried using
with urllib.request.urlopen(section_url) as url:
html = url.read()
I got the following error message: "urllib.error.HTTPError: HTTP Error 403: Forbidden"
Now I'm assuming this is because the url I'm trying to get is from a https url instead of a http one, and i was wondering if there is a specific way to get that information from https since the normal methods arn't working.
Note: I've also tried this, but it didn't show me everything
f = requests.get(url)
print(f.text)
You need to have a user-agent to make the browser think you're not a robot.
import urllib.request, urllib.error, urllib.parse
url = 'http://www.google.com' #Input your url
user_agent = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.63 Safari/534.3'
headers = { 'User-Agent' : user_agent }
req = urllib.request.Request(url, None, headers)
response = urllib.request.urlopen(req)
html = response.read()
response.close()
adapted from https://stackoverflow.com/a/3949760/6622817
import requests
from bs4 import BeautifulSoup
import lxml
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
f =open('ala2009link.csv','r')
s=open('2009alanews.csv','w')
for row in csv.reader(f):
url=row[0]
print url
res = requests.get(url)
print res.content
soup = BeautifulSoup(res.content)
print soup
data=soup.find_all("article",{"class":"article-wrapper news"})
#data=soup.find_all("main",{"class":"main-content"})
for item in data:
title= item.find_all("h2",{"class","article-headline"})[0].text
s.write("%s \n"% title)
content=soup.find_all("p")
for main in content:
k=main.text.encode('utf-8')
s.write("%s \n"% k)
#k=csv.writer(s)
#k.writerow('%s\n'% (main))
s.close()
f.close()
this is my code to extract data in website ,but i don't know why i can't extract data ,is this ad blocker warning to block my beautifulsoup ?
this is the example link:http://www.rolltide.com/news/2009/6/23/Bert_Bank_Passes_Away.aspx?path=football
The reason that no results are returned is because this website requires that you have a User-Agent header in your request.
To fix this add a headers parameter with a User-Agent to the requests.get() like so.
url = 'http://www.rolltide.com/news/2009/6/23/Bert_Bank_Passes_Away.aspx?path=football'
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/29.0.1547.65 Chrome/29.0.1547.65 Safari/537.36',
}
res = requests.get(url, headers=headers)
I want to extract the all the available items in the équipements, but I can only get the first four items, and then I got '+ plus'.
import urllib2
from bs4 import BeautifulSoup
import re
import requests
headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
url = 'https://www.airbnb.fr/rooms/8261637?s=bAMrFL5A'
req = urllib2.Request(url = url, headers = headers)
html = urllib2.urlopen(req)
bsobj = BeautifulSoup(html.read(),'lxml')
b = bsobj.findAll("div",{"class": "row amenities"})
for the result of b, it does not return all the list inside the tag.
And for the last one of it is '+ plus', looks like as following.
<span data-reactid=".mjeft4n4sg.0.0.0.0.1.8.1.0.0.$1.1.0.0">+ Plus</span></strong></a></div></div></div></div></div>]
This is because data filled up using reactjs after page load. So if you download it via requests you can't see the data.
Instead you have to use selenium web driver, open page and process all the javascripts. Then you can get ccess to all data you expect
I have made a web crawler which gets all links till the 1st level of page and from them it gets all link and text plus imagelinks and alt. here is whole code:
import urllib
import re
import time
from threading import Thread
import MySQLdb
import mechanize
import readability
from bs4 import BeautifulSoup
from readability.readability import Document
import urlparse
url = ["http://sparkbrowser.com"]
i=0
while i<len(url):
counterArray = [0]
levelLinks = []
linkText = ["homepage"]
levelLinks = []
def scraper(root,steps):
urls = [root]
visited = [root]
counter = 0
while counter < steps:
step_url = scrapeStep(urls)
urls = []
for u in step_url:
if u not in visited:
urls.append(u)
visited.append(u)
counterArray.append(counter +1)
counter +=1
levelLinks.append(visited)
return visited
def scrapeStep(root):
result_urls = []
br = mechanize.Browser()
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
for url in root:
try:
br.open(url)
for link in br.links():
newurl = urlparse.urljoin(link.base_url, link.url)
result_urls.append(newurl)
#levelLinks.append(newurl)
except:
print "error"
return result_urls
scraperOut = scraper(url[i],1)
for sl,ca in zip(scraperOut,counterArray):
print "\n\n",sl," Level - ",ca,"\n"
#Mechanize
br = mechanize.Browser()
page = br.open(sl)
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
#BeautifulSoup
htmlcontent = page.read()
soup = BeautifulSoup(htmlcontent)
for linkins in br.links(text_regex=re.compile('^((?!IMG).)*$')):
newesturl = urlparse.urljoin(linkins.base_url, linkins.url)
linkTxt = linkins.text
print newesturl,linkTxt
for linkwimg in soup.find_all('a', attrs={'href': re.compile("^http://")}):
imgSource = linkwimg.find('img')
if linkwimg.find('img',alt=True):
imgLink = linkwimg['href']
#imageLinks.append(imgLink)
imgAlt = linkwimg.img['alt']
#imageAlt.append(imgAlt)
print imgLink,imgAlt
elif linkwimg.find('img',alt=False):
imgLink = linkwimg['href']
#imageLinks.append(imgLink)
imgAlt = ['No Alt']
#imageAlt.append(imgAlt)
print imgLink,imgAlt
i+=1
Everything is working great until my crawler reaches one of facebook links which he can't read, but he gives me error
httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt
for the line 68 which is: page = br.open(sl)
And I don't now why because as you can see, I've setted up mechanize set_handle_robots and add_headers options.
I don't know why is that but I noticed that I'm getting that error for facebook links, in this case facebook.com/sparkbrowser and google to.
Any help or advice is welcome.
cheers
Ok, so the same problem appeared in this question:
Why is mechanize throwing a HTTP 403 error?
By sending all the request headers a normal browser would send, and accepting / sending back the cookies the server sends should resolve the issue.