web scraping tutorial using python 3? - python

I am trying to learn python 3.x so that I can scrape websites. People have recommended that I use Beautiful Soup 4 or lxml.html. Could someone point me in the right direction for tutorial or examples for BeautifulSoup with python 3.x?
Thank you for your help.

I've actually just written a full guide on web scraping that includes some sample code in Python. I wrote and tested in on Python 2.7 but both the of the packages that I used (requests and BeautifulSoup) are fully compatible with Python 3 according to the Wall of Shame.
Here's some code to get you started with web scraping in Python:
import sys
import requests
from BeautifulSoup import BeautifulSoup
def scrape_google(keyword):
# dynamically build the URL that we'll be making a request to
url = "http://www.google.com/search?q={term}".format(
term=keyword.strip().replace(" ", "+"),
)
# spoof some headers so the request appears to be coming from a browser, not a bot
headers = {
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5)",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"accept-charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.3",
"accept-encoding": "gzip,deflate,sdch",
"accept-language": "en-US,en;q=0.8",
}
# make the request to the search url, passing in the the spoofed headers.
r = requests.get(url, headers=headers) # assign the response to a variable r
# check the status code of the response to make sure the request went well
if r.status_code != 200:
print("request denied")
return
else:
print("scraping " + url)
# convert the plaintext HTML markup into a DOM-like structure that we can search
soup = BeautifulSoup(r.text)
# each result is an <li> element with class="g" this is our wrapper
results = soup.findAll("li", "g")
# iterate over each of the result wrapper elements
for result in results:
# the main link is an <h3> element with class="r"
result_anchor = result.find("h3", "r").find("a")
# print out each link in the results
print(result_anchor.contents)
if __name__ == "__main__":
# you can pass in a keyword to search for when you run the script
# be default, we'll search for the "web scraping" keyword
try:
keyword = sys.argv[1]
except IndexError:
keyword = "web scraping"
scrape_google(keyword)
If you just want to learn more about Python 3 in general and are already familiar with Python 2.x, then this article on transitioning from Python 2 to Python 3 might be helpful.

Related

Scraping Google Scholar with urllib2 instead of requests

I have the simple script below which works just fine for fetching a list of articles from Google Scholar searching for a term of interest.
import urllib
import urllib2
import requests
from bs4 import BeautifulSoup
SEARCH_SCHOLAR_HOST = "https://scholar.google.com"
SEARCH_SCHOLAR_URL = "/scholar"
def searchScholar(searchStr, limit=10):
"""Search Google Scholar for articles and publications containing terms of interest"""
url = SEARCH_SCHOLAR_HOST + SEARCH_SCHOLAR_URL + "?q=" + urllib.quote_plus(searchStr) + "&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search"
content = requests.get(url, verify=False).text
page = BeautifulSoup(content, 'lxml')
results = {}
count = 0
for entry in page.find_all("h3", attrs={"class": "gs_rt"}):
if count < limit:
try:
text = entry.a.text.encode("ascii", "ignore")
url = entry.a['href']
results[url] = text
count += 1
except:
pass
return results
queryStr = "Albert einstein"
pubs = searchScholar(queryStr, 10)
if len(pubs) == 0:
print "No articles found"
else:
for pub in pubs.keys():
print pub + ' ' + pubs[pub]
However, I want to run this script as a CGI application on a remote server, without access to console, so I cannot install any external Python modules. (I managed to 'install' BeautifulSoup without resorting to pip or easy_install by just copying the bs4 directory to my cgi-bin directory, but this trick does not worked with requests because of its large amount of dependencies.)
So, my question is: is it possible to use the built-in urllib2 or httplib Python modules instead of requests for fetching the Google Scholar page and then pass it to BeautifulSoup? It should be, because I found some code here which scrapes Google Scholar using just the standard libraries plus BeautifulSoup, but it is rather convoluted. I would prefer to achieve a much simpler solution, just be adapting my script for using the standard libraries instead of requests.
Could anyone give me some help?
This code is enough to perform a simple request using urllib2:
def get(url):
req = urllib2.Request(url)
req.add_header('User-Agent', 'Mozilla/2.0 (compatible; MSIE 5.5; Windows NT)')
return urllib2.urlopen(req).read()
if you need to do something more advanced in the future it will be more code. What request does is simplifies the usage over that of the standard libs.

How to web scrape in an .jsf site with cookies and javascript "javax.faces.ViewState" CDATA using python requests and bs4 (BeautifulSoup) modules? [duplicate]

This question already has an answer here:
How to programmatically send POST request to JSF page without using HTML form?
(1 answer)
Closed 2 years ago.
I would like to automate the extraction of data from this site:
http://www.snirh.gov.br/hidroweb/publico/medicoes_historicas_abas.jsf
Explanation of the steps to be followed for extract the data that I want:
Beginning in the url above click in "Séries Históricas". You should see a page with a form with some inputs. In my case I only need to input the station code in the "Código da Estação" input. Suppose that the station code is 938001, insert that and hit "Consultar". Now you should see a lot of checkboxes. Check the one below "Selecionar", this one will check all checkboxes. Supposing that I dont want all kinds of data, I want rain rate and flow rate, I check only the checkbox below "Chuva" and the other one below "Vazão". After that is necessary to choose the type of the file that are going to be download, chose the "Arquivo Texto (.TXT)", this one is the .txt format. After that is necessary to generate the file, to do that click in "Gerar Arquivo". After that is possible todownload the file, to do that just click "Baixar Arquivo".
Note: the site now is in version v1.0.0.12, it may be different in the future.
I have a list of station codes. Imagine how bad would be to do these operations more than 1000 times?! I want to automate this!
Many people in Brazil have been trying to automate the extraction of data from this web site. Some that I found:
Really old one: https://www.youtube.com/watch?v=IWCrC0MlasQ
Others:
https://pt.stackoverflow.com/questions/60124/gerar-e-baixar-links-programaticamente/86150#86150
https://pt.stackoverflow.com/questions/282111/r-download-de-dados-do-portal-hidroweb
The earlier try that I found, but that does not work too because the site have changed: https://github.com/duartejr/pyHidroWeb
So a lot people need this and none of the above solutions work more because of updates in the site.
I do not want use selenium, it is slow compared with a solution that uses requests library, and it needs a interface.
My attempt:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
from urllib import parse
URL = 'http://www.snirh.gov.br/hidroweb/publico/apresentacao.jsf'
s = requests.Session()
r = s.get(URL)
JSESSIONID = s.cookies['JSESSIONID']
soup = BeautifulSoup(r.content, "html.parser")
javax_faces_ViewState = soup.find("input", {"type": "hidden", "name":"javax.faces.ViewState"})['value']
d = {}
d['menuLateral:menuForm'] = 'menuLateral:menuForm'
d['javax.faces.ViewState'] = javax_faces_ViewState
d['menuLateral:menuForm:menuSection:j_idt68:link'] = 'menuLateral:menuForm:menuSection:j_idt68:link'
h = {}
h['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'
h['Accept-Encoding'] = 'gzip, deflate'
h['Accept-Language'] = 'pt-BR,pt;q=0.9,en-US;q=0.8,en;q=0.7'
h['Cache-Control'] = 'max-age=0'
h['Connection'] = 'keep-alive'
h['Content-Length'] = '218'
h['Content-Type'] = 'application/x-www-form-urlencoded'
h['Cookie'] = '_ga=GA1.3.4824711.1520011013; JSESSIONID={}; _gid=GA1.3.743342153.1522450617'.format(JSESSIONID)
h['Host'] = 'www.snirh.gov.br'
h['Origin'] = 'http://www.snirh.gov.br'
h['Referer'] = 'http://www.snirh.gov.br/hidroweb/publico/apresentacao.jsf'
h['Upgrade-Insecure-Requests'] = '1'
h['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
URL2 = 'http://www.snirh.gov.br/hidroweb/publico/medicoes_historicas_abas.jsf'
post_response = s.post(URL2, headers=h, data=d)
soup = BeautifulSoup(post_response.text, "html.parser")
javax_faces_ViewState = soup.find("input", {"type": "hidden", "name":"javax.faces.ViewState"})['value']
def f_headers(JSESSIONID):
headers = {}
headers['Accept'] = '*/*'
headers['Accept-Encoding'] = 'gzip, deflate'
headers['Accept-Language'] = 'pt-BR,pt;q=0.9,en-US;q=0.8,en;q=0.7'
headers['Connection'] = 'keep-alive'
headers['Content-Length'] = '672'
headers['Content-type'] = 'application/x-www-form-urlencoded;charset=UTF-8'
headers['Cookie'] = '_ga=GA1.3.4824711.1520011013; JSESSIONID=' + str(JSESSIONID)
headers['Faces-Request'] = 'partial/ajax'
headers['Host'] = 'www.snirh.gov.br'
headers['Origin'] = 'http://www.snirh.gov.br'
headers['Referer'] = 'http://www.snirh.gov.br/hidroweb/publico/medicoes_historicas_abas.jsf'
headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
return headers
def build_data(data, n, javax_faces_ViewState):
if n == 1:
data['form'] = 'form'
data['form:fsListaEstacoes:codigoEstacao'] = '938001'
data['form:fsListaEstacoes:nomeEstacao'] = ''
data['form:fsListaEstacoes:j_idt92'] = 'a39c3713-c0f7-4461-b2c8-c2814b3a9af1'
data['form:fsListaEstacoes:j_idt101'] = 'a39c3713-c0f7-4461-b2c8-c2814b3a9af1'
data['form:fsListaEstacoes:nomeResponsavel'] = ''
data['form:fsListaEstacoes:nomeOperador'] = ''
data['javax.faces.ViewState'] = javax_faces_ViewState
data['javax.faces.source'] = 'form:fsListaEstacoes:bt'
data['javax.faces.partial.event'] = 'click'
data['javax.faces.partial.execute'] = 'form:fsListaEstacoes:bt form:fsListaEstacoes'
data['javax.faces.partial.render'] = 'form:fsListaEstacoes:pnListaEstacoes'
data['javax.faces.behavior.event'] = 'action'
data['javax.faces.partial.ajax'] = 'true'
data = {}
build_data(data, 1, javax_faces_ViewState)
headers = f_headers(JSESSIONID)
post_response = s.post(URL, headers=headers, data=data)
print(post_response.text)
That prints:
<?xml version='1.0' encoding='UTF-8'?>
<partial-response><changes><update id="javax.faces.ViewState"><![CDATA[-18212878
48648292010:1675387092887841821]]></update></changes></partial-response>
Explanations about what I tryed:
I used the chrome develop tool, actually clicked "F12", clicked "Network" and in the website page clicked "Séries Históricas" to discover what are de headers and forms. I think I did it correctly. There is another way or a better way? Some people told me about postman and postman interceptor, but a dont know how to use and if it is helpful.
After that I filled the station code in the "Código da Estação" input with 938001 and hit "Consultar" to see what were the headers and forms.
Why is the site returning a xml? This means that something went wrong?
This xml has an CDATA section.
What does <![CDATA[]]> in XML mean?
A undestand the basic idea of CDATA, but how it is used in this site, and how I shoud use this in the web scrape? I guess that it is used to save partial information, but it is just a guess. I am lost.
I tryed this for the other clicks too, and got more forms and the response was xml. I did not put it here because it makes the code bigger and the xml is big too.
One SO answer that is not complete related to my is this:
https://stackoverflow.com/a/8625286
this answer explain the steps to upload a file, using java, to a JSF-generated form. This is not my case, I want to download a file using python requests.
General questions:
When is not possible and possible to use requests + bs4 to scrape a website?
Whats are the steps to do this kind of web scrape?
In cases like this site, is possible to go straightforward and in one request extract the information or we have to mimic the step by step as we would do by hand filling the form? Based on this answer it looks like the answer is no https://stackoverflow.com/a/35665529
I have faced many dificulties and doubts. In my opinion there is a gap of explanation about this kind of situation.
I agree with this SO question
Python urllib2 or requests post method
in the point that most tutorials are useless for a situation like this site that I am trying. A question like this one https://stackoverflow.com/q/43793998/9577149
that is as hard as my does not have answer.
That is my first post in stackoverflow, sorry if I made mistakes and I am not a native english speaker, feel free to correct me.
1) Its always possible to scrape html websites using bs4. But getting the response you would like requires more than just beautiful soup.
2) My approach with bs4 is usually as follows:
response = requests.request(
method="GET",
url='http://yourwebsite.com',
params=params #(params should be a python object)
)
soup = BeautifulSoup(response.text, 'html.parser')
3) If you notice when you fill out the first form (series historicas) and click submit, the page url (or action url) does not change. This is because an ajax request is being made to retrieve and update the data on the current page. Since you cant see the request its impossible for you to mimic that.
To submit the form i would recommend looking into Mechanize (a python library for filling and submitting form data)
import re
from mechanize import Browser
b = Browser()
b.open("http://yourwebsite.com")
b.select_form(name="form")
b["bacia"] = ["value"]
response = b.submit() # submit the form
The URL of the last request is wrong. In the penultimate line of code s.post(URL, headers=headers, data=data) the parameter should be URL2 instead.
The cookie name, also, is now SESSIONID not JSESSIONID, but that must have been a change made since the question was asked.
You do not need to manage cookies manually like that when using requests.Session(), it will keep track of cookies for you automatically.

how to scrape website in which page url is not changed but the next button add data below the same url page

I have a URL:
http://www.goudengids.be/qn/business/advanced/where/Provincie%20Antwerpen/what/restaurant
On that page there is a "next results" button which loads another 20 data point while still showing first dataset, without updating the URL. I wrote a script to scrape this page in python but it only scrapes the first 22 data point even though the "next results" button is clicked and shows about 40 data.
How can I scrape these types of website that dynamically load content
My script is
import csv
import requests
from bs4 import BeautifulSoup
url = "http://www.goudengids.be/qn/business/advanced/where/Provincie%20Antwerpen/what/restaurant/"
r = requests.get(url)
r.content
soup = BeautifulSoup(r.content)
print (soup.prettify())
g_data2 = soup.find_all("a", {"class": "heading"})
for item in g_data2:
try:
name = item.text
print name
except IndexError:
name = ''
print "No Name found!"
If you were to solve it with requests, you need to mimic what browser does when you click the "Load More" button - it sends an XHR request to the http://www.goudengids.be/q/ajax/business/results.json endpoint, simulate it in your code maintaining the web-scraping session. The XHR responses are in JSON format - no need for BeautifulSoup in this case at all:
import requests
main_url = "http://www.goudengids.be/qn/business/advanced/where/Provincie%20Antwerpen/what/restaurant/"
xhr_url = "http://www.goudengids.be/q/ajax/business/results.json"
with requests.Session() as session:
session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'}
# visit main URL
session.get(main_url)
# load more listings - follow the pagination
page = 1
listings = []
while True:
params = {
"input": "restaurant Provincie Antwerpen",
"what": "restaurant",
"where": "Provincie Antwerpen",
"type": "DOUBLE",
"resultlisttype": "A_AND_B",
"page": str(page),
"offset": "2",
"excludelistingids": "nl_BE_YP_FREE_11336647_0000_1746702_6165_20130000, nl_BE_YP_PAID_11336647_0000_1746702_7575_20139729427, nl_BE_YP_PAID_720348_0000_187688_7575_20139392980",
"context": "SRP * A_LIST"
}
response = requests.get(xhr_url, params=params, headers={
"X-Requested-With": "XMLHttpRequest",
"Referer": main_url
})
data = response.json()
# collect listing names in a list (for example purposes)
listings.extend([item["bn"] for item in data["overallResult"]["searchResults"]])
page += 1
# TODO: figure out exit condition for the while True loop
print(listings)
I've left an important TODO for you - figure out an exit condition - when to stop collecting listings.
Instead of focusing on scraping HTML I think you should look at the JSON that is retrieved via AJAX. I think the JSON is less likely to be changed in the future as opposed to the page's markup. And on top of that, it's way easier to traverse a JSON structure than it is to scrape a DOM.
For instance, when you load the page you provided it hits a url to get JSON at http://www.goudengids.be/q/ajax/business/results.json.
Then it provides some url parameters to query the businesses. I think you should look more into using this to get your data as opposed to scraping the page and simulating button clicks, and etc.
Edit:
And it looks like it's using the headers set from visiting the site initially to ensure that you have a valid session. So you may have to hit the site initially to get the cookie headers and set that for subsequent requests to get the JSON from the endpoint above. I still think this will be easier and more predictable than trying to scrape HTML.

How to get contents of frames automatically if browser does not support frames + can't access frame directly

I am trying to automatically download PDFs from URLs like this to make a library of UN resolutions.
If I use beautiful soup or mechanize to open that URL, I get "Your browser does not support frames" -- and I get the same thing if I use the copy as curl feature in chrome dev tools.
The standard advice for the "Your browser does not support frames" when using mechanize or beautiful soup is to follow the source of each individual frame and load that frame. But if I do so, I get to an error message that the page is not authorized.
How can I proceed? I guess I could try this in zombie or phantom but I would prefer to not use those tools as I am not that familiar with them.
Okay, this was an interesting task to do with requests and BeautifulSoup.
There is a set of underlying calls to un.org and daccess-ods.un.org that are important and set relevant cookies. This is why you need to maintain requests.Session() and visit several urls before getting access to the pdf.
Here's the complete code:
import re
from urlparse import urljoin
from bs4 import BeautifulSoup
import requests
BASE_URL = 'http://www.un.org/en/ga/search/'
URL = "http://www.un.org/en/ga/search/view_doc.asp?symbol=A/RES/68/278"
BASE_ACCESS_URL = 'http://daccess-ods.un.org'
# start session
session = requests.Session()
response = session.get(URL, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36'})
# get frame links
soup = BeautifulSoup(response.text)
frames = soup.find_all('frame')
header_link, document_link = [urljoin(BASE_URL, frame.get('src')) for frame in frames]
# get header
session.get(header_link, headers={'Referer': URL})
# get document html url
response = session.get(document_link, headers={'Referer': URL})
soup = BeautifulSoup(response.text)
content = soup.find('meta', content=re.compile('URL='))['content']
document_html_link = re.search('URL=(.*)', content).group(1)
document_html_link = urljoin(BASE_ACCESS_URL, document_html_link)
# follow html link and get the pdf link
response = session.get(document_html_link)
soup = BeautifulSoup(response.text)
# get the real document link
content = soup.find('meta', content=re.compile('URL='))['content']
document_link = re.search('URL=(.*)', content).group(1)
document_link = urljoin(BASE_ACCESS_URL, document_link)
print document_link
# follow the frame link with login and password first - would set the important cookie
auth_link = soup.find('frame', {'name': 'footer'})['src']
session.get(auth_link)
# download file
with open('document.pdf', 'wb') as handle:
response = session.get(document_link, stream=True)
for block in response.iter_content(1024):
if not block:
break
handle.write(block)
You should probably extract separate blocks of code into functions to make it more readable and reusable.
FYI, all of this could be more easily done through the real browser with the help of selenium of Ghost.py.
Hope that helps.

Scraping and parsing Google search results using Python

I asked a question on realizing a general idea to crawl and save webpages.
Part of the original question is: how to crawl and save a lot of "About" pages from the Internet.
With some further research, I got some choices to go ahead with both on scraping and parsing (listed at the bottom).
Today, I ran into another Ruby discussion about how to scrape from Google search results. This provides a great alternative for my problem which will save all the effort on the crawling part.
The new question are: in Python, to scrape Google search results for a given keyword, in this case "About", and finally get the links for further parsing.
What are the best choices of methods and libraries to go ahead with? (in measure of easy-to-learn and easy-to-implement).
p.s. in this website, the exactly same thing is implemented, but closed and ask for money for more results. I'd prefer to do it myself if no open-source available and learn more Python in the meanwhile.
Oh, btw, advices for parsing the links from search results would be nice, if any. Still, easy-to-learn and easy-to-implement. Just started learning Python. :P
Final update, problem solved. Code using xgoogle, please read note in the section below in order to make xgoogle working.
import time, random
from xgoogle.search import GoogleSearch, SearchError
f = open('a.txt','wb')
for i in range(0,2):
wt = random.uniform(2, 5)
gs = GoogleSearch("about")
gs.results_per_page = 10
gs.page = i
results = gs.get_results()
#Try not to annnoy Google, with a random short wait
time.sleep(wt)
print 'This is the %dth iteration and waited %f seconds' % (i, wt)
for res in results:
f.write(res.url.encode("utf8"))
f.write("\n")
print "Done"
f.close()
Note on xgoogle (below answered by Mike Pennington):
The latest version from it's Github does not work by default already, due to changes in Google search results probably. These two replies (a b) on the home page of the tool give a solution and it is currently still working with this tweak. But maybe some other day it may stop working again due to Google's change/block.
Resources known so far:
For scraping, Scrapy seems to be a popular choice and a webapp called ScraperWiki is very interesting and there is another project extract it's library for offline/local usage. Mechanize was brought up quite several times in different discussions too.
For parsing HTML, BeautifulSoup seems to be the one of the most
popular choices. Of course. lxml too.
You may find xgoogle useful... much of what you seem to be asking for is there...
There is a twill lib for emulating browser. I used it when had a necessity to login with google email account. While it's a great tool with a great idea, it's pretty old and seems to have a lack of support nowadays (the latest version is released in 2007).
It might be useful if you want to retrieve results that require cookie-handling or authentication. Likely that twill is one of the best choices for that purposes.
BTW, it's based on mechanize.
As for parsing, you are right, BeautifulSoup and Scrapy are great. One of the cool things behind BeautifulSoup is that it can handle invalid HTML (unlike Genshi, for example.)
Have a look at this awesome urllib wrapper for web scraping https://github.com/mattseh/python-web/blob/master/web.py
Another option to scrape Google search results using Python is the one by ZenSERP.
I like the API-first approach which is easy to use and the JSON results are easily integrated into our solution.
Here is an example for a curl request:
curl "https://app.zenserp.com/api/search" -F "q=Pied Piper" -F "location=United States" -F "search_engine=google.com" -F "language=English" -H "apikey: APIKEY"
And the response:
{
"q": "Pied Piper",
"domain": "google.com",
"location": "United States",
"language": "English",
"url": "https://www.google.com/search?q=Pied%20Piper&num=100&hl=en&gl=US&gws_rd=cr&ie=UTF-8&oe=UTF-8&uule=w+CAIQIFISCQs2MuSEtepUEUK33kOSuTsc",
"total_results": 17100000,
"auto_correct": "",
"auto_correct_type": "",
"results": []
}
A Python code for example:
import requests
headers = {
'apikey': 'APIKEY',
}
params = (
('q', 'Pied Piper'),
('location', 'United States'),
('search_engine', 'google.com'),
('language', 'English'),
)
response = requests.get('https://app.zenserp.com/api/search', headers=headers, params=params)
To extract links from multiple pages of Google Search results you can use SerpApi. It's a paid API with a free trial.
Full example
import os
# Python package: https://pypi.org/project/google-search-results
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "about",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
pages = search.pagination()
for result in pages:
print(f"Current page: {result['serpapi_pagination']['current']}\n")
for organic_result in result["organic_results"]:
print(
f"Title: {organic_result['title']}\nLink: {organic_result['link']}\n"
)
Output
Current page: 12
URL: https://fi.google.com/
URL: https://www.mayoclinic.org/about-mayo-clinic
...
Current page: 18
URL: https://igem.org/About
URL: https://www.ieee.org/
URL: https://www.cancer.org/
...
Disclaimer: I work at SerpApi.
This one works good for this moment. If any search is made, the scraper keeps grabbing titles and their links traversing all next pages until there is no more next page is left or your ip address is banned. Make sure your bs4 version is >= 4.7.0 as I've used pseudo css selector within the script.
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import requests
base = "https://www.google.de"
link = "https://www.google.de/search?q={}"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}
def grab_content(link):
res = requests.get(link,headers=headers)
soup = BeautifulSoup(res.text,"lxml")
for container in soup.select("[class='g'] a[href^='http'][data-ved]:has(h3)"):
post_title = container.select_one("h3").get_text(strip=True)
post_link = container.get('href')
yield post_title,post_link
next_page = soup.select_one("a[href][id='pnnext']")
if next_page:
next_page_link = urljoin(base,next_page.get("href"))
yield from grab_content(next_page_link)
if __name__ == '__main__':
search_keyword = "python"
qualified_link = link.format(search_keyword.replace(" ","+"))
for item in grab_content(qualified_link):
print(item)
This can be done using google and beautifulsoup module, install it in CMD using command given below:
pip install google beautifulsoup4
Thereafter, run this simplified code given below
import webbrowser, googlesearch as gs
def direct(txt):
print(f"sure, searching '{txt}'...")
results=gs.search(txt,num=1,stop=1,pause=0)
#num, stop denotes number of search results you want
for link in results:
print(link)
webbrowser.open_new_tab(link)#to open the results in browser
direct('cheap thrills on Youtube') #this will play the song on YouTube
#(for this, keep num=1,stop=1)
Output:
TIP: Using this, you can also make a small Virtual Assistant that will open the top search result in browser for your given query(txt) in natural language.
Feel free to comment in case of difficulty while running this code:)
from urllib.request import urlopen
from bs4 import BeautifulSoup
import urllib.request
import re
import numpy as np
count=0
query=input("query>>")
query=query.strip().split()
query="+".join(query)
html = "https://www.google.co.in/search?site=&source=hp&q="+query+"&gws_rd=ssl"
req = urllib.request.Request(html, headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(urlopen(req).read(),"html.parser")
#Regex
reg=re.compile(".*&sa=")
links = []
#Parsing web urls
for item in soup.find_all('h3', attrs={'class' : 'r'}):
line = (reg.match(item.a['href'][7:]).group())
links.append(line[:-4])
print(links)
this should be handy....for more go to -
https://github.com/goyal15rajat/Crawl-google-search.git
Here is a Python script using requests and BeautifulSoup to scrape Google results.
import urllib
import requests
from bs4 import BeautifulSoup
# desktop user-agent
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"
# mobile user-agent
MOBILE_USER_AGENT = "Mozilla/5.0 (Linux; Android 7.0; SM-G930V Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.125 Mobile Safari/537.36"
query = "hackernoon How To Scrape Google With Python"
query = query.replace(' ', '+')
URL = f"https://google.com/search?q={query}"
headers = {"user-agent": USER_AGENT}
resp = requests.get(URL, headers=headers)
if resp.status_code == 200:
soup = BeautifulSoup(resp.content, "html.parser")
results = []
for g in soup.find_all('div', class_='r'):
anchors = g.find_all('a')
if anchors:
link = anchors[0]['href']
title = g.find('h3').text
item = {
"title": title,
"link": link
}
results.append(item)
print(results)

Categories