I asked a question on realizing a general idea to crawl and save webpages.
Part of the original question is: how to crawl and save a lot of "About" pages from the Internet.
With some further research, I got some choices to go ahead with both on scraping and parsing (listed at the bottom).
Today, I ran into another Ruby discussion about how to scrape from Google search results. This provides a great alternative for my problem which will save all the effort on the crawling part.
The new question are: in Python, to scrape Google search results for a given keyword, in this case "About", and finally get the links for further parsing.
What are the best choices of methods and libraries to go ahead with? (in measure of easy-to-learn and easy-to-implement).
p.s. in this website, the exactly same thing is implemented, but closed and ask for money for more results. I'd prefer to do it myself if no open-source available and learn more Python in the meanwhile.
Oh, btw, advices for parsing the links from search results would be nice, if any. Still, easy-to-learn and easy-to-implement. Just started learning Python. :P
Final update, problem solved. Code using xgoogle, please read note in the section below in order to make xgoogle working.
import time, random
from xgoogle.search import GoogleSearch, SearchError
f = open('a.txt','wb')
for i in range(0,2):
wt = random.uniform(2, 5)
gs = GoogleSearch("about")
gs.results_per_page = 10
gs.page = i
results = gs.get_results()
#Try not to annnoy Google, with a random short wait
time.sleep(wt)
print 'This is the %dth iteration and waited %f seconds' % (i, wt)
for res in results:
f.write(res.url.encode("utf8"))
f.write("\n")
print "Done"
f.close()
Note on xgoogle (below answered by Mike Pennington):
The latest version from it's Github does not work by default already, due to changes in Google search results probably. These two replies (a b) on the home page of the tool give a solution and it is currently still working with this tweak. But maybe some other day it may stop working again due to Google's change/block.
Resources known so far:
For scraping, Scrapy seems to be a popular choice and a webapp called ScraperWiki is very interesting and there is another project extract it's library for offline/local usage. Mechanize was brought up quite several times in different discussions too.
For parsing HTML, BeautifulSoup seems to be the one of the most
popular choices. Of course. lxml too.
You may find xgoogle useful... much of what you seem to be asking for is there...
There is a twill lib for emulating browser. I used it when had a necessity to login with google email account. While it's a great tool with a great idea, it's pretty old and seems to have a lack of support nowadays (the latest version is released in 2007).
It might be useful if you want to retrieve results that require cookie-handling or authentication. Likely that twill is one of the best choices for that purposes.
BTW, it's based on mechanize.
As for parsing, you are right, BeautifulSoup and Scrapy are great. One of the cool things behind BeautifulSoup is that it can handle invalid HTML (unlike Genshi, for example.)
Have a look at this awesome urllib wrapper for web scraping https://github.com/mattseh/python-web/blob/master/web.py
Another option to scrape Google search results using Python is the one by ZenSERP.
I like the API-first approach which is easy to use and the JSON results are easily integrated into our solution.
Here is an example for a curl request:
curl "https://app.zenserp.com/api/search" -F "q=Pied Piper" -F "location=United States" -F "search_engine=google.com" -F "language=English" -H "apikey: APIKEY"
And the response:
{
"q": "Pied Piper",
"domain": "google.com",
"location": "United States",
"language": "English",
"url": "https://www.google.com/search?q=Pied%20Piper&num=100&hl=en&gl=US&gws_rd=cr&ie=UTF-8&oe=UTF-8&uule=w+CAIQIFISCQs2MuSEtepUEUK33kOSuTsc",
"total_results": 17100000,
"auto_correct": "",
"auto_correct_type": "",
"results": []
}
A Python code for example:
import requests
headers = {
'apikey': 'APIKEY',
}
params = (
('q', 'Pied Piper'),
('location', 'United States'),
('search_engine', 'google.com'),
('language', 'English'),
)
response = requests.get('https://app.zenserp.com/api/search', headers=headers, params=params)
To extract links from multiple pages of Google Search results you can use SerpApi. It's a paid API with a free trial.
Full example
import os
# Python package: https://pypi.org/project/google-search-results
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "about",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
pages = search.pagination()
for result in pages:
print(f"Current page: {result['serpapi_pagination']['current']}\n")
for organic_result in result["organic_results"]:
print(
f"Title: {organic_result['title']}\nLink: {organic_result['link']}\n"
)
Output
Current page: 12
URL: https://fi.google.com/
URL: https://www.mayoclinic.org/about-mayo-clinic
...
Current page: 18
URL: https://igem.org/About
URL: https://www.ieee.org/
URL: https://www.cancer.org/
...
Disclaimer: I work at SerpApi.
This one works good for this moment. If any search is made, the scraper keeps grabbing titles and their links traversing all next pages until there is no more next page is left or your ip address is banned. Make sure your bs4 version is >= 4.7.0 as I've used pseudo css selector within the script.
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import requests
base = "https://www.google.de"
link = "https://www.google.de/search?q={}"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}
def grab_content(link):
res = requests.get(link,headers=headers)
soup = BeautifulSoup(res.text,"lxml")
for container in soup.select("[class='g'] a[href^='http'][data-ved]:has(h3)"):
post_title = container.select_one("h3").get_text(strip=True)
post_link = container.get('href')
yield post_title,post_link
next_page = soup.select_one("a[href][id='pnnext']")
if next_page:
next_page_link = urljoin(base,next_page.get("href"))
yield from grab_content(next_page_link)
if __name__ == '__main__':
search_keyword = "python"
qualified_link = link.format(search_keyword.replace(" ","+"))
for item in grab_content(qualified_link):
print(item)
This can be done using google and beautifulsoup module, install it in CMD using command given below:
pip install google beautifulsoup4
Thereafter, run this simplified code given below
import webbrowser, googlesearch as gs
def direct(txt):
print(f"sure, searching '{txt}'...")
results=gs.search(txt,num=1,stop=1,pause=0)
#num, stop denotes number of search results you want
for link in results:
print(link)
webbrowser.open_new_tab(link)#to open the results in browser
direct('cheap thrills on Youtube') #this will play the song on YouTube
#(for this, keep num=1,stop=1)
Output:
TIP: Using this, you can also make a small Virtual Assistant that will open the top search result in browser for your given query(txt) in natural language.
Feel free to comment in case of difficulty while running this code:)
from urllib.request import urlopen
from bs4 import BeautifulSoup
import urllib.request
import re
import numpy as np
count=0
query=input("query>>")
query=query.strip().split()
query="+".join(query)
html = "https://www.google.co.in/search?site=&source=hp&q="+query+"&gws_rd=ssl"
req = urllib.request.Request(html, headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(urlopen(req).read(),"html.parser")
#Regex
reg=re.compile(".*&sa=")
links = []
#Parsing web urls
for item in soup.find_all('h3', attrs={'class' : 'r'}):
line = (reg.match(item.a['href'][7:]).group())
links.append(line[:-4])
print(links)
this should be handy....for more go to -
https://github.com/goyal15rajat/Crawl-google-search.git
Here is a Python script using requests and BeautifulSoup to scrape Google results.
import urllib
import requests
from bs4 import BeautifulSoup
# desktop user-agent
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"
# mobile user-agent
MOBILE_USER_AGENT = "Mozilla/5.0 (Linux; Android 7.0; SM-G930V Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.125 Mobile Safari/537.36"
query = "hackernoon How To Scrape Google With Python"
query = query.replace(' ', '+')
URL = f"https://google.com/search?q={query}"
headers = {"user-agent": USER_AGENT}
resp = requests.get(URL, headers=headers)
if resp.status_code == 200:
soup = BeautifulSoup(resp.content, "html.parser")
results = []
for g in soup.find_all('div', class_='r'):
anchors = g.find_all('a')
if anchors:
link = anchors[0]['href']
title = g.find('h3').text
item = {
"title": title,
"link": link
}
results.append(item)
print(results)
Related
I've written a script that looks up a few blogs and sees if a new post has been added. However, when I try to do this on Patreon I cannot find the right element with bs4.
Let's take https://www.patreon.com/cubecoders for example.
Say I want to get the number of exclusive posts under the 'Become a patron to' section, which would be 25 as of now.
This code works just fine:
import requests
from bs4 import BeautifulSoup
plain_html = requests.get("https://www.patreon.com/cubecoders").text
full_html = BeautifulSoup(plain_html, "html.parser")
text_of_newest_post = full_html.find("div", class_="sc-AxjAm fXpRSH").text
print(text_of_newest_post)
Output: 25
Now, I want to get the title of the newest post, which would be 'New in AMP 2.0.2 - Integrated SCP/SFTP server!' as of now.
I inspect the title in my browser and see that it is contained by a span tag with the class 'sc-1di2uql-1 vYcWR'.
However, when I try to run this code I cannot fetch the element:
import requests
from bs4 import BeautifulSoup
plain_html = requests.get("https://www.patreon.com/cubecoders").text
full_html = BeautifulSoup(plain_html, "html.parser")
text_of_newest_post = full_html.find("span", class_="sc-1di2uql-1 vYcWR")
print(text_of_newest_post)
Output: None
I've already tried to fetch the element with XPath or CSS selector but couldn't do it. I thought it might be because the site is rendered first with JavaScript and thus I cannot access the elements before they are rendered correctly.
When I use Selenium to render the site first I can see the title when printing out all div tags on the page but when I want to get only the very first title I can't access it.
Do you guys know a workaround maybe?
Thanks in advance!
EDIT:
In Selenium I can do this:
from selenium import webdriver
browser = webdriver.Chrome("C:\webdrivers\chromedriver.exe")
browser.get("https://www.patreon.com/cubecoders")
divs = browser.find_elements_by_tag_name("div")
def find_text(divs):
for div in divs:
for span in div.find_elements_by_tag_name("span"):
if span.get_attribute("class") == "sc-1di2uql-1 vYcWR":
return span.text
print(find_text(divs))
browser.close()
Output: New in AMP 2.0.2 - Integrated SCP/SFTP server!
When I just try to search for the spans with class 'sc-1di2uql-1 vYcWR' from the start it won't give me the result though. Could it be that the find_elements method does not look deeper inside for nestled tags?
The data you see is loaded via Ajax from their API. You can use requests module to load the data.
For example:
import re
import json
import requests
from bs4 import BeautifulSoup
url = 'https://www.patreon.com/cubecoders'
api_url = 'https://www.patreon.com/api/posts'
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0',
'Accept-Language': 'en-US,en;q=0.5',
'Referer': url
}
with requests.session() as s:
html_text = s.get(url, headers=headers).text
campaign_id = re.search(r'https://www\.patreon\.com/api/campaigns/(\d+)', html_text).group(1)
data = s.get(api_url, headers=headers, params={'filter[campaign_id]': campaign_id, 'filter[contains_exclusive_posts]': 'true', 'sort': '-published_at'}).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
# print some information to screen:
for d in data['data']:
print('{:<70} {}'.format(d['attributes']['title'], d['attributes']['published_at']))
Prints:
New in AMP 2.0.2 - Integrated SCP/SFTP server! 2020-07-17T13:28:49.000+00:00
AMP Enterprise Pricing Reveal! 2020-07-07T10:02:02.000+00:00
AMP Enterprise Edition Waiting List 2020-07-03T13:25:35.000+00:00
Upcoming changes to the user system 2020-05-29T10:53:43.000+00:00
More video tutorials! What do you want to see? 2020-05-21T12:20:53.000+00:00
Third AMP tutorial - Windows installation! 2020-05-21T12:19:23.000+00:00
Another day, another video tutorial! 2020-05-08T22:56:45.000+00:00
AMP Video Tutorial - Out takes! 2020-05-05T23:01:57.000+00:00
AMP Video Tutorials - Installing AMP on Linux 2020-05-05T23:01:46.000+00:00
What is the AMP Console Assistant (AMPCA), and why does it exist? 2020-05-04T01:14:39.000+00:00
Well that was unexpected... 2020-05-01T11:21:09.000+00:00
New Goal - MariaDB/MySQL Support! 2020-04-22T13:41:51.000+00:00
Testing out AMP Enterprise Features 2020-03-31T18:55:42.000+00:00
Temporary feature unlock for all Patreon backers! 2020-03-11T14:53:31.000+00:00
Preparing for Enterprise 2020-03-11T13:09:40.000+00:00
Aarch64/ARM64 and Raspberry Pi is here! 2020-03-06T19:07:09.000+00:00
Aarch64/ARM64 and Raspberry Pi progress! 2020-02-26T17:53:53.000+00:00
Wallpaper! 2020-02-13T11:04:39.000+00:00
Instance Templating - Make once, deploy many. 2020-02-06T15:26:09.000+00:00
Time for a new module! 2020-01-07T13:41:17.000+00:00
Having trouble scraping links and article names from google scholar. I'm unsure if the issue is with my code or the xpath that I'm using to retrieve the data – or possibly both?
I've already spent the past few hours trying to debug/consulting other stackoverflow queries but to no success.
import scrapy
from scrapyproj.items import ScrapyProjItem
class scholarScrape(scrapy.Spider):
name = "scholarScraper"
allowed_domains = "scholar.google.com"
start_urls=["https://scholar.google.com/scholar?hl=en&oe=ASCII&as_sdt=0%2C44&q=rare+disease+discovery&btnG="]
def parse(self,response):
item = ScrapyProjItem()
item['hyperlink'] = item.xpath("//h3[class=gs_rt]/a/#href").extract()
item['name'] = item.xpath("//div[#class='gs_rt']/h3").extract()
yield item
The error messages I have been receiving say: "AttributeError: xpath" so I believe that the issue lies with the path that I'm using to try and retrieve the data, but I could also be mistaken?
Adding my comment as an answer, as it solved the problem:
The issue is with scrapyproj.items.ScrapyProjItem objects: they do not have an xpath attribute. Is this an official scrapy class? I think you meant to call xpath on response:
item['hyperlink'] = response.xpath("//h3[class=gs_rt]/a/#href").extract()
item['name'] = response.xpath("//div[#class='gs_rt']/h3").extract()
Also, the first path expression might need a set of quotes around the attribute value "gs_rt":
item['hyperlink'] = response.xpath("//h3[class='gs_rt']/a/#href").extract()
Apart from that, the XPath expressions are fine.
Alternative solution using bs4:
from bs4 import BeautifulSoup
import requests, lxml, os
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://scholar.google.com/citations?hl=en&user=m8dFEawAAAAJ', headers=headers).text
soup = BeautifulSoup(html, 'lxml')
# Container where all articles located
for article_info in soup.select('#gsc_a_b .gsc_a_t'):
# title CSS selector
title = article_info.select_one('.gsc_a_at').text
# Same title CSS selector, except we're trying to get "data-href" attribute
# Note, it will be relative link, so you need to join it with absolute link after extracting.
title_link = article_info.select_one('.gsc_a_at')['data-href']
print(f'Title: {title}\nTitle link: https://scholar.google.com{title_link}\n')
# Part of the output:
'''
Title: Automating Gödel's Ontological Proof of God's Existence with Higher-order Automated Theorem Provers.
Title link: https://scholar.google.com/citations?view_op=view_citation&hl=en&user=m8dFEawAAAAJ&citation_for_view=m8dFEawAAAAJ:-f6ydRqryjwC
'''
Alternatively, you can do the same with Google Scholar Author Articles API from SerpApi.
The main difference is that you don't have to think about finding good proxies, trying to solve CAPTCHA even if you're using selenium. It's a paid API with a free plan.
Code to integrate:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google_scholar_author",
"author_id": "9PepYk8AAAAJ",
}
search = GoogleSearch(params)
results = search.get_dict()
for article in results['articles']:
article_title = article['title']
article_link = article['link']
# Part of the output:
'''
Title: p-GaN gate HEMTs with tungsten gate metal for high threshold voltage and low gate current
Link: https://scholar.google.com/citations?view_op=view_citation&hl=en&user=9PepYk8AAAAJ&citation_for_view=9PepYk8AAAAJ:bUkhZ_yRbTwC
'''
Disclaimer, I work for SerpApi.
This question already has an answer here:
How to programmatically send POST request to JSF page without using HTML form?
(1 answer)
Closed 2 years ago.
I would like to automate the extraction of data from this site:
http://www.snirh.gov.br/hidroweb/publico/medicoes_historicas_abas.jsf
Explanation of the steps to be followed for extract the data that I want:
Beginning in the url above click in "Séries Históricas". You should see a page with a form with some inputs. In my case I only need to input the station code in the "Código da Estação" input. Suppose that the station code is 938001, insert that and hit "Consultar". Now you should see a lot of checkboxes. Check the one below "Selecionar", this one will check all checkboxes. Supposing that I dont want all kinds of data, I want rain rate and flow rate, I check only the checkbox below "Chuva" and the other one below "Vazão". After that is necessary to choose the type of the file that are going to be download, chose the "Arquivo Texto (.TXT)", this one is the .txt format. After that is necessary to generate the file, to do that click in "Gerar Arquivo". After that is possible todownload the file, to do that just click "Baixar Arquivo".
Note: the site now is in version v1.0.0.12, it may be different in the future.
I have a list of station codes. Imagine how bad would be to do these operations more than 1000 times?! I want to automate this!
Many people in Brazil have been trying to automate the extraction of data from this web site. Some that I found:
Really old one: https://www.youtube.com/watch?v=IWCrC0MlasQ
Others:
https://pt.stackoverflow.com/questions/60124/gerar-e-baixar-links-programaticamente/86150#86150
https://pt.stackoverflow.com/questions/282111/r-download-de-dados-do-portal-hidroweb
The earlier try that I found, but that does not work too because the site have changed: https://github.com/duartejr/pyHidroWeb
So a lot people need this and none of the above solutions work more because of updates in the site.
I do not want use selenium, it is slow compared with a solution that uses requests library, and it needs a interface.
My attempt:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
from urllib import parse
URL = 'http://www.snirh.gov.br/hidroweb/publico/apresentacao.jsf'
s = requests.Session()
r = s.get(URL)
JSESSIONID = s.cookies['JSESSIONID']
soup = BeautifulSoup(r.content, "html.parser")
javax_faces_ViewState = soup.find("input", {"type": "hidden", "name":"javax.faces.ViewState"})['value']
d = {}
d['menuLateral:menuForm'] = 'menuLateral:menuForm'
d['javax.faces.ViewState'] = javax_faces_ViewState
d['menuLateral:menuForm:menuSection:j_idt68:link'] = 'menuLateral:menuForm:menuSection:j_idt68:link'
h = {}
h['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'
h['Accept-Encoding'] = 'gzip, deflate'
h['Accept-Language'] = 'pt-BR,pt;q=0.9,en-US;q=0.8,en;q=0.7'
h['Cache-Control'] = 'max-age=0'
h['Connection'] = 'keep-alive'
h['Content-Length'] = '218'
h['Content-Type'] = 'application/x-www-form-urlencoded'
h['Cookie'] = '_ga=GA1.3.4824711.1520011013; JSESSIONID={}; _gid=GA1.3.743342153.1522450617'.format(JSESSIONID)
h['Host'] = 'www.snirh.gov.br'
h['Origin'] = 'http://www.snirh.gov.br'
h['Referer'] = 'http://www.snirh.gov.br/hidroweb/publico/apresentacao.jsf'
h['Upgrade-Insecure-Requests'] = '1'
h['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
URL2 = 'http://www.snirh.gov.br/hidroweb/publico/medicoes_historicas_abas.jsf'
post_response = s.post(URL2, headers=h, data=d)
soup = BeautifulSoup(post_response.text, "html.parser")
javax_faces_ViewState = soup.find("input", {"type": "hidden", "name":"javax.faces.ViewState"})['value']
def f_headers(JSESSIONID):
headers = {}
headers['Accept'] = '*/*'
headers['Accept-Encoding'] = 'gzip, deflate'
headers['Accept-Language'] = 'pt-BR,pt;q=0.9,en-US;q=0.8,en;q=0.7'
headers['Connection'] = 'keep-alive'
headers['Content-Length'] = '672'
headers['Content-type'] = 'application/x-www-form-urlencoded;charset=UTF-8'
headers['Cookie'] = '_ga=GA1.3.4824711.1520011013; JSESSIONID=' + str(JSESSIONID)
headers['Faces-Request'] = 'partial/ajax'
headers['Host'] = 'www.snirh.gov.br'
headers['Origin'] = 'http://www.snirh.gov.br'
headers['Referer'] = 'http://www.snirh.gov.br/hidroweb/publico/medicoes_historicas_abas.jsf'
headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
return headers
def build_data(data, n, javax_faces_ViewState):
if n == 1:
data['form'] = 'form'
data['form:fsListaEstacoes:codigoEstacao'] = '938001'
data['form:fsListaEstacoes:nomeEstacao'] = ''
data['form:fsListaEstacoes:j_idt92'] = 'a39c3713-c0f7-4461-b2c8-c2814b3a9af1'
data['form:fsListaEstacoes:j_idt101'] = 'a39c3713-c0f7-4461-b2c8-c2814b3a9af1'
data['form:fsListaEstacoes:nomeResponsavel'] = ''
data['form:fsListaEstacoes:nomeOperador'] = ''
data['javax.faces.ViewState'] = javax_faces_ViewState
data['javax.faces.source'] = 'form:fsListaEstacoes:bt'
data['javax.faces.partial.event'] = 'click'
data['javax.faces.partial.execute'] = 'form:fsListaEstacoes:bt form:fsListaEstacoes'
data['javax.faces.partial.render'] = 'form:fsListaEstacoes:pnListaEstacoes'
data['javax.faces.behavior.event'] = 'action'
data['javax.faces.partial.ajax'] = 'true'
data = {}
build_data(data, 1, javax_faces_ViewState)
headers = f_headers(JSESSIONID)
post_response = s.post(URL, headers=headers, data=data)
print(post_response.text)
That prints:
<?xml version='1.0' encoding='UTF-8'?>
<partial-response><changes><update id="javax.faces.ViewState"><![CDATA[-18212878
48648292010:1675387092887841821]]></update></changes></partial-response>
Explanations about what I tryed:
I used the chrome develop tool, actually clicked "F12", clicked "Network" and in the website page clicked "Séries Históricas" to discover what are de headers and forms. I think I did it correctly. There is another way or a better way? Some people told me about postman and postman interceptor, but a dont know how to use and if it is helpful.
After that I filled the station code in the "Código da Estação" input with 938001 and hit "Consultar" to see what were the headers and forms.
Why is the site returning a xml? This means that something went wrong?
This xml has an CDATA section.
What does <![CDATA[]]> in XML mean?
A undestand the basic idea of CDATA, but how it is used in this site, and how I shoud use this in the web scrape? I guess that it is used to save partial information, but it is just a guess. I am lost.
I tryed this for the other clicks too, and got more forms and the response was xml. I did not put it here because it makes the code bigger and the xml is big too.
One SO answer that is not complete related to my is this:
https://stackoverflow.com/a/8625286
this answer explain the steps to upload a file, using java, to a JSF-generated form. This is not my case, I want to download a file using python requests.
General questions:
When is not possible and possible to use requests + bs4 to scrape a website?
Whats are the steps to do this kind of web scrape?
In cases like this site, is possible to go straightforward and in one request extract the information or we have to mimic the step by step as we would do by hand filling the form? Based on this answer it looks like the answer is no https://stackoverflow.com/a/35665529
I have faced many dificulties and doubts. In my opinion there is a gap of explanation about this kind of situation.
I agree with this SO question
Python urllib2 or requests post method
in the point that most tutorials are useless for a situation like this site that I am trying. A question like this one https://stackoverflow.com/q/43793998/9577149
that is as hard as my does not have answer.
That is my first post in stackoverflow, sorry if I made mistakes and I am not a native english speaker, feel free to correct me.
1) Its always possible to scrape html websites using bs4. But getting the response you would like requires more than just beautiful soup.
2) My approach with bs4 is usually as follows:
response = requests.request(
method="GET",
url='http://yourwebsite.com',
params=params #(params should be a python object)
)
soup = BeautifulSoup(response.text, 'html.parser')
3) If you notice when you fill out the first form (series historicas) and click submit, the page url (or action url) does not change. This is because an ajax request is being made to retrieve and update the data on the current page. Since you cant see the request its impossible for you to mimic that.
To submit the form i would recommend looking into Mechanize (a python library for filling and submitting form data)
import re
from mechanize import Browser
b = Browser()
b.open("http://yourwebsite.com")
b.select_form(name="form")
b["bacia"] = ["value"]
response = b.submit() # submit the form
The URL of the last request is wrong. In the penultimate line of code s.post(URL, headers=headers, data=data) the parameter should be URL2 instead.
The cookie name, also, is now SESSIONID not JSESSIONID, but that must have been a change made since the question was asked.
You do not need to manage cookies manually like that when using requests.Session(), it will keep track of cookies for you automatically.
Goal: I would like to verify, if a specific Google search has a suggested result on the right hand side and - in case of such a suggestion - scrape some information like company type / address / etc.
Approach: I wanted to use a Python scraper with Requests and BeautifulSoup4
import bs4
import requests
address='https://www.google.co.ve/?gws_rd=cr&ei=DgBqVpWJMoPA-gHy25fACg#q=caracas+arepa'
page = requests.get(address)
soup = bs4.BeautifulSoup(page.content,'html.parser')
print (soup.prettify())
Issue:
The requested page does not include the search results (I am not sure if some variable on the Google page is set to invisible?), Rather only the header and footer of the Google page
Questions:
Alternative ways to obtain the described information? Any ideas?
Once I obtained results with the described method, but the respective address was constructed differently (I remember many numbers in the Google URL, but unfortunately cannot reproduce the search address). Therefore: Is there a requirement of the Google URL so that it can be scraped via requests.get?
The best way to get information from a service like Google Places will almost always be the official API. That said, if you're dead set on scraping, it's likely that what's returned by the HTTP request is meant for a browser to render. What BeautifulSoup does is not equivalent to rendering the data it receives, so it's very likely you're just getting useless empty containers that are later filled out dynamically.
I think your question is similar to google-search-with-python-reqeusts, maybe you could get some help from that~
And I agree with LiterallyElvis, API is better idea than crawl it directly.
Finally if you want to use requests for this work, I recommend to use PhantomJS and selenium to mock browser works, as Google should use some AJAX tech which makes different views between real browser and crawler.
As in country of difficult to visit Google, I couldn't repeat your problem directly, the above are sth I could think about, wish it helps
You need select_one() element (container) that contains all the needed data and check if it exists, and if so, scrape the data.
Make sure you're using user-agent to act as a "real" user visit, otherwise your request might be blocked or you receive a different HTML with different selectors. Check what's your user-agent.
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
params = {
"q": "caracas arepa bar google",
"gl": "us"
}
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36"
}
html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
# if right side knowledge graph is present -> parse the data.
if soup.select_one(".liYKde"):
place_name = soup.select_one(".PZPZlf.q8U8x span").text
place_type = soup.select_one(".YhemCb+ .YhemCb").text
place_reviews = soup.select_one(".hqzQac span").text
place_rating = soup.select_one(".Aq14fc").text
print(place_name, place_type, place_reviews, place_rating, sep="\n")
# output:
'''
Caracas Arepa Bar
Venezuelan restaurant
1,123 Google reviews
4.5
'''
Alternatively, you can achieve the same thing using Google Knowledge Graph API from SerpApi. It's a paid API with a free plan.
The biggest difference is that you don't need to figure out how to parse the data, increase the number of requests, bypass blocks from Google, and other search engines.
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": "caracas arepa bar place",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
print(json.dumps([results["knowledge_graph"]], indent=2))
# part of the output:
'''
[
{
"title": "Caracas Arepa Bar",
"type": "Venezuelan restaurant",
"place_id": "ChIJVcQ2ll9ZwokRwmkvsArPXyo",
"website": "http://caracasarepabar.com/",
"description": "Arepa specialist offering creative, low-priced renditions of the Venezuelan corn-flour staple.",
"local_map": {
"image": "https://www.google.com/maps/vt/data=TF2Rd51PtEnU2M3pkZHYHKdSwhMDJ_ZwRfg0vfwlDRAmv1u919sgFl8hs_lo832ziTWxCZM9BKECs6Af-TA1hh0NLjuYAzOLFA1-RBEmj-8poygymcRX2KLNVTGGZZKDerZrKW6fnkONAM4Ui-BVN8XwFrwigoqqxObPg8bqFIgeM3LPCg",
"link": "https://www.google.com/maps/place/Caracas+Arepa+Bar/#40.7131972,-73.9574167,15z/data=!4m2!3m1!1s0x0:0x2a5fcf0ab02f69c2?sa=X&hl=en",
"gps_coordinates": {
"latitude": 40.7131972,
"longitude": -73.9574167,
"altitude": 15
}
} ... much more results including place images, popular times, user reviews.
}
]
'''
Disclaimer: I work for SerpApi.
I am trying to learn python 3.x so that I can scrape websites. People have recommended that I use Beautiful Soup 4 or lxml.html. Could someone point me in the right direction for tutorial or examples for BeautifulSoup with python 3.x?
Thank you for your help.
I've actually just written a full guide on web scraping that includes some sample code in Python. I wrote and tested in on Python 2.7 but both the of the packages that I used (requests and BeautifulSoup) are fully compatible with Python 3 according to the Wall of Shame.
Here's some code to get you started with web scraping in Python:
import sys
import requests
from BeautifulSoup import BeautifulSoup
def scrape_google(keyword):
# dynamically build the URL that we'll be making a request to
url = "http://www.google.com/search?q={term}".format(
term=keyword.strip().replace(" ", "+"),
)
# spoof some headers so the request appears to be coming from a browser, not a bot
headers = {
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5)",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"accept-charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.3",
"accept-encoding": "gzip,deflate,sdch",
"accept-language": "en-US,en;q=0.8",
}
# make the request to the search url, passing in the the spoofed headers.
r = requests.get(url, headers=headers) # assign the response to a variable r
# check the status code of the response to make sure the request went well
if r.status_code != 200:
print("request denied")
return
else:
print("scraping " + url)
# convert the plaintext HTML markup into a DOM-like structure that we can search
soup = BeautifulSoup(r.text)
# each result is an <li> element with class="g" this is our wrapper
results = soup.findAll("li", "g")
# iterate over each of the result wrapper elements
for result in results:
# the main link is an <h3> element with class="r"
result_anchor = result.find("h3", "r").find("a")
# print out each link in the results
print(result_anchor.contents)
if __name__ == "__main__":
# you can pass in a keyword to search for when you run the script
# be default, we'll search for the "web scraping" keyword
try:
keyword = sys.argv[1]
except IndexError:
keyword = "web scraping"
scrape_google(keyword)
If you just want to learn more about Python 3 in general and are already familiar with Python 2.x, then this article on transitioning from Python 2 to Python 3 might be helpful.