Web scraping articles from Google News - python

I am trying to web scrape googlenews with the gnews package. However, I don't know how to do web scraping for older articles like, for example, articles from 2010.
from gnews import GNews
from newspaper import Article
import pandas as pd
import datetime
google_news = GNews(language='es', country='Argentina', period = '7d')
argentina_news = google_news.get_news('protesta clarin')
print(len(argentina_news))
this code works perfectly to get recent articles but I need older articles. I saw https://github.com/ranahaani/GNews#todo and something like the following appears:
google_news = GNews(language='es', country='Argentina', period='7d', start_date='01-01-2015', end_date='01-01-2016', max_results=10, exclude_websites=['yahoo.com', 'cnn.com'],
proxy=proxy)
but when I try star_date I get:
TypeError: __init__() got an unexpected keyword argument 'start_date'
can anyone help to get articles for specific dates. Thank you very mucha guys!

The example code is incorrect for gnews==0.2.7 which is the latest you can install off PyPI via pip (or whatever). The documentation is for the unreleased mainline code that you can get directly off their git source.
Confirmed by inspecting the GNews::__init__ method, and the method doesn't have keyword args for start_date or end_date:
In [1]: import gnews
In [2]: gnews.GNews.__init__??
Signature:
gnews.GNews.__init__(
self,
language='en',
country='US',
max_results=100,
period=None,
exclude_websites=None,
proxy=None,
)
Docstring: Initialize self. See help(type(self)) for accurate signature.
Source:
def __init__(self, language="en", country="US", max_results=100, period=None, exclude_websites=None, proxy=None):
self.countries = tuple(AVAILABLE_COUNTRIES),
self.languages = tuple(AVAILABLE_LANGUAGES),
self._max_results = max_results
self._language = language
self._country = country
self._period = period
self._exclude_websites = exclude_websites if exclude_websites and isinstance(exclude_websites, list) else []
self._proxy = {'http': proxy, 'https': proxy} if proxy else None
File: ~/src/news-test/.venv/lib/python3.10/site-packages/gnews/gnews.py
Type: function
If you want the start_date and end_date functionality, that was only added rather recently, so you will need to install the module off their git source.
# use whatever you use to uninstall any pre-existing gnews module
pip uninstall gnews
# install from the project's git main branch
pip install git+https://github.com/ranahaani/GNews.git
Now you can use the start/end functionality:
import datetime
import gnews
start = datetime.date(2015, 1, 15)
end = datetime.date(2015, 1, 16)
google_news = GNews(language='es', country='Argentina', start_date=start, end_date=end)
rsp = google_news.get_news("protesta")
print(rsp)
I get this as a result:
[{'title': 'Latin Roots: The Protest Music Of South America - NPR',
'description': 'Latin Roots: The Protest Music Of South America NPR',
'published date': 'Thu, 15 Jan 2015 08:00:00 GMT',
'url': 'https://www.npr.org/sections/world-cafe/2015/01/15/377491862/latin-roots-the-protest-music-of-south-america',
'publisher': {'href': 'https://www.npr.org', 'title': 'NPR'}}]
Also note:
period is ignored if you set start_date and end_date
Their documentation shows you can pass the dates as tuples like (2015, 1, 15). This doesn't seem to work - just be safe and pass a datetime object.

You can also use Python requests module and xpath to get what you need without using any external packages.
Here is a snapshot of the code:
from bs4 import BeautifulSoup
import requests
from lxml.html import fromstring
url = 'https://www.google.com/search?q=google+news&&hl=es&sxsrf=ALiCzsZoYzwIP0ZR9d6LLa5U6IJ2WDo1sw%3A1660116293247&source=lnt&tbs=cdr%3A1%2Ccd_min%3A8%2F10%2F2010%2Ccd_max%3A8%2F10%2F2022&tbm=nws'
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4758.87 Safari/537.36",
}
r = requests.get(url, headers=headers, timeout=30)
root = fromstring(r.text)
news = []
for i in root.xpath('//div[#class="xuvV6b BGxR7d"]'):
item={}
item['title'] = i.xpath('.//div[#class="mCBkyc y355M ynAwRc MBeuO nDgy9d"]//text()')
item['description'] = i.xpath('.//div[#class="GI74Re nDgy9d"]//text()')
item['published date'] = i.xpath('.//div[#class="OSrXXb ZE0LJd"]//span/text()')
item['url'] = i.xpath('.//a/#href')
item['publisher'] = i.xpath('.//div[#class="CEMjEf NUnG9d"]//span/text()')
news.append(item)
And here is what i get:
for i in news:
print i
"""
{'published date': ['Hace 1 mes'], 'url': ['https://www.20minutos.es/noticia/5019464/0/google-news-regresa-a-espana-tras-ocho-anos-cerrado/'], 'publisher': ['20Minutos'], 'description': [u'"Google News ayuda a los lectores a encontrar noticias de fuentes \nfidedignas, desde los sitios web de noticias m\xe1s grandes del mundo hasta \nlas publicaciones...'], 'title': [u'Noticias de 20minutos en Google News: c\xf3mo seguir la \xfaltima ...']}
{'published date': ['14 jun 2022'], 'url': ['https://www.bbc.com/mundo/noticias-61803565'], 'publisher': ['BBC'], 'description': [u'C\xf3mo funciona LaMDA, el sistema de inteligencia artificial que "cobr\xf3 \nconciencia y siente" seg\xfan un ingeniero de Google. Alicia Hern\xe1ndez \n#por_puesto; BBC News...'], 'title': [u'C\xf3mo funciona LaMDA, el sistema de inteligencia artificial que "cobr\xf3 \nconciencia y siente" seg\xfan un ingeniero de Google']}
{'published date': ['24 mar 2022'], 'url': ['https://www.theguardian.com/world/2022/mar/24/russia-blocks-google-news-after-it-bans-ads-on-proukraine-invasion-content'], 'publisher': ['The Guardian'], 'description': [u'Russia has blocked Google News, accusing it of promoting \u201cinauthentic \ninformation\u201d about the invasion of Ukraine. The ban came just hours after \nGoogle...'], 'title': ['Russia blocks Google News after ad ban on content condoning Ukraine invasion']}
{'published date': ['2 feb 2021'], 'url': ['https://dircomfidencial.com/medios/google-news-showcase-que-es-y-como-funciona-el-agregador-por-el-que-los-medios-pueden-generar-ingresos-20210202-0401/'], 'publisher': ['Dircomfidencial'], 'description': [u'Google News Showcase: qu\xe9 es y c\xf3mo funciona el agregador por el que los \nmedios pueden generar ingresos. MEDIOS | 2 FEBRERO 2021 | ACTUALIZADO: 3 \nFEBRERO 2021 8...'], 'title': [u'Google News Showcase: qu\xe9 es y c\xf3mo funciona el ...']}
{'published date': ['4 nov 2021'], 'url': ['https://www.euronews.com/next/2021/11/04/google-news-returns-to-spain-after-the-country-adopts-new-eu-copyright-law'], 'publisher': ['Euronews'], 'description': ['News aggregator Google News will return to Spain following a change in \ncopyright law that allows online platforms to negotiate fees directly with \ncontent...'], 'title': ['Google News returns to Spain after the country adopts new EU copyright law']}
{'published date': ['27 may 2022'], 'url': ['https://indianexpress.com/article/technology/tech-news-technology/google-hit-with-fresh-uk-investigation-over-ad-tech-dominance-7938896/'], 'publisher': ['The Indian Express'], 'description': ['The Indian Express website has been rated GREEN for its credibility and \ntrustworthiness by Newsguard, a global service that rates news sources for \ntheir...'], 'title': ['Google hit with fresh UK investigation over ad tech dominance']}
{'published date': [u'Hace 1 d\xeda'], 'url': ['https://indianexpress.com/article/technology/tech-news-technology/google-down-outage-issues-user-error-8079170/'], 'publisher': ['The Indian Express'], 'description': ['The outage also impacted a range of other Google products such as Google \n... Join our Telegram channel (The Indian Express) for the latest news and \nupdates.'], 'title': ['Google, Google Maps and other services recover after global ...']}
{'published date': ['14 nov 2016'], 'url': ['https://www.reuters.com/article/us-alphabet-advertising-idUSKBN1392MM'], 'publisher': ['Reuters'], 'description': ["Google's move similarly does not address the issue of fake news or hoaxes \nappearing in Google search results. That happened in the last few days, \nwhen a search..."], 'title': ['Google, Facebook move to restrict ads on fake news sites']}
{'published date': ['27 sept 2021'], 'url': ['https://news.sky.com/story/googles-appeal-against-eu-record-3-8bn-fine-starts-today-as-us-cases-threaten-to-break-the-company-up-12413655'], 'publisher': ['Sky News'], 'description': ["Google's five-day appeal against the decision is being heard at European \n... told Sky News he expected there could be another appeal after the \nhearing in..."], 'title': [u"Google's appeal against EU record \xa33.8bn fine starts today, as US cases \nthreaten to break the company up"]}
{'published date': ['11 jun 2022'], 'url': ['https://www.washingtonpost.com/technology/2022/06/11/google-ai-lamda-blake-lemoine/'], 'publisher': ['The Washington Post'], 'description': [u"SAN FRANCISCO \u2014 Google engineer Blake Lemoine opened his laptop to the \ninterface for LaMDA, Google's artificially intelligent chatbot generator,..."], 'title': ["The Google engineer who thinks the company's AI has come ..."]}
"""

You can consider the parsel library which is an XML/HTML parser that have full XPath and CSS selectors support.
Using this library, I want to demonstrate how to scrape Google News using pagination. Оne of the ways is to use the start URL parameter which is equal to 0 by default. 0 means the first page, 10 is for the second, and so on.
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "protesta clarin",
"hl": "es", # language, es -> Spanish
"gl": "AR", # country of the search, AR -> Argentina
"tbm": "nws", # google news
"start": 0, # number page by default up to 0
}
While the next button exists, you need to increment the ["start"] parameter value by 10 to access the next page if it's present, otherwise we need to break out of the while loop:
if selector.css('.d6cvqb a[id=pnnext]'):
params["start"] += 10
else:
break
Also, make sure you're using request headers user-agent to act as a "real" user visit. Because default requests user-agent is python-requests and websites understand that it's most likely a script that sends a request. Check what's your user-agent.
Code and full example in online IDE:
from parsel import Selector
import requests
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "protesta clarin",
"hl": "es", # language, es -> Spanish
"gl": "AR", # country of the search, AR -> Argentina
"tbm": "nws", # google news
"start": 0, # number page by default up to 0
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36",
}
page_num = 0
while True:
page_num += 1
print(f"{page_num} page:")
html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
selector = Selector(html.text)
for result in selector.css(".WlydOe"):
source = result.css(".NUnG9d span::text").get()
title = result.css(".mCBkyc::text").get()
link = result.attrib['href']
snippet = result.css(".GI74Re::text").get()
date = result.css(".ZE0LJd span::text").get()
print(source, title, link, snippet, date, sep='\n', end='\n\n')
if selector.css('.d6cvqb a[id=pnnext]'):
params["start"] += 10
else:
break
Output:
1 page:
Clarín
Protesta de productores rurales en Salliqueló ante la llegada ...
https://www.clarin.com/politica/protesta-productores-productores-rurales-salliquelo-llegada-presidente-ministro-economia_0_d6gT1lm0hy.html
Hace clic acá y te lo volvemos a enviar. Ya la active. Cancelar. Clarín.
Para comentar nuestras notas por favor completá los siguientes datos.
hace 9 horas
Clarín
Paro docente y polémica: por qué un conflicto en el sur afecta ...
https://www.clarin.com/sociedad/paro-docente-polemica-conflicto-sur-afecta-pais_0_7Kt1kRgL8F.html
Expertos en Educación explican a Clarín los límites –o la falta de ellos–
entre la protesta y el derecho a tener clases. Y por qué todos los...
hace 2 días
Clarín
El campo hace un paro en protesta por la presión impositiva y la falta de
gasoil
https://www.clarin.com/rural/protesta-presion-impositiva-falta-gasoil-campo-hara-paro-pais_0_z4VuXzllXP.html
Se viene la Rural de Palermo: precios, horarios y actividades de la
muestra. Newsletters Clarín Cosecha de noticias. Héctor Huergo trae lo
más...
Hace 1 mes
... other results from the 1st and subsequent pages.
31 page:
Ámbito Financiero
Hamburgo se blinda frente a protestas contra el G-20 y ...
http://www.ambito.com/888830-hamburgo-se-blinda-frente-a-protestas-contra-el-g-20-y-anticipa-tolerancia-cero
Fuerzas especiales alemanas frente al centro de convenciones donde se
desarrollará la cumbre. El ministro del Interior de Alemania, Thomas de...
4 jul 2017
Chequeado
Quién es Carlos Rosenkrantz, el nuevo presidente de la Corte ...
https://chequeado.com/el-explicador/quien-es-carlos-rosenkrantz-el-proximo-presidente-de-la-corte-suprema/
... entre otras, del Grupo Clarín y de Farmacity y Pegasus, las dos últimas
... más restrictiva del derecho a la protesta y a los cortes en la vía
pública.
3 nov 2018
ámbito.com
Echeverría: graves incidentes en protesta piquetera
https://www.ambito.com/informacion-general/echeverria-graves-incidentes-protesta-piquetera-n3594273
Una protesta de grupos piqueteros frente a la municipalidad de Esteban
Echeverría desató hoy serios incidentes que concluyeron con al menos...
20 nov 2009
... other results from 31st page.
If you need a more detailed explanation about scraping Google News, have a look at Web Scraping Google News with Python blog post.

Related

How to web scrap this page and turn it into a csv file?

My name is João, im a law student from Brazil and im new to this. Im trying to web scrape this page for a week to help me with the Undergraduate thesis and other researchers.
I want make a csv file with all the results from a research in a court (this link). As you can see in the link, there are 404 results (processo) divide in 41 pages. Each result has it own html with its information (such as in a marketplace).
The result html is divided in two main tables. The first one has the result general information and probably will have the same structure in all results. The second table contains the results files (they are decisions in a administrative process), which may change in number of files and even have some files with same name, but different dates. From this second table I just need the link to the oldest "relatório/voto" and its date and the link to oldest "acórdão" and its date.
The head of the csv file should look like the following image and each result should be a line.
I'm working with python on google colab and I've been trying many ways to scrape but it did not work well. My most complete approach was when I tried to adapt a product scrape tutorial: video and corespondent code in Github.
My adaptation does not work in colab, it neither results in a error message, nor in a csv file. In the following code, I identified some problems in the adaptation by comparing the pages and the lesson, they are:
While extracting the result html out of one of the 41 pages, I believe I should create a list results html extracted, but it extracted the text too and I'm not sure how to correct it.
While trying to extract the data from the result html, I fail. Whenever I tried to create a list with these it only returned me one result.
Beyond the tutorial, I would also like to extract data from the second table in the results html, it would be the link to the oldest "relatório/voto" and its date and the link to oldest "acórdão" and its date. I'm no sure how and when in the code i should do that.
ADAPTED CODE
from requests_html import HTMLSession
import csv
s = HTMLSession()
# STEP 01: take the result html
def get_results_links(page):
url = f"https://www.tce.sp.gov.br/jurisprudencia/pesquisar?txtTdPalvs=munic%C3%ADpio+pessoal+37&txtExp=temporari&txtQqUma=admiss%C3%A3o+contrata%C3%A7%C3%A3o&txtNenhPalvs=&txtNumIni=&txtNumFim=&tipoBuscaTxt=Documento&_tipoBuscaTxt=on&quantTrechos=1&processo=&exercicio=&dataAutuacaoInicio=&dataAutuacaoFim=&dataPubInicio=01%2F01%2F2021&dataPubFim=31%2F12%2F2021&_relator=1&_auditor=1&_materia=1&tipoDocumento=2&_tipoDocumento=1&acao=Executa&offset={page}"
links = []
r = s.get(url)
results = r.html.find('td.small a')
for item in results:
links.append(item.find('a', first=True).attrs['href']) #Problem 01: I believe it should creat a list of the results html extracted out the page, but it extracted the text too.
return links
# STEP 02: extracting relevant information from the result html before extracted
def parse_result(url):
r = s.get(url)
numero = r.html.find('td.small', first=True).text.strip()
data_autuacao = r.html.find('td.small', first=True).text.strip()
try:
parte_1 = r.html.find('td.small', first=True).text.strip()
except AttributeError as err:
sku = 'Não há'
try:
parte_2 = r.html.find('td.small', first=True).text.strip()
except AttributeError as err:
parte_2 = 'Não há'
materia = r.html.find('td.small', first=True).text.strip()
exercicio = r.html.find('td.small', first=True).text.strip()
objeto = r.html.find('td.small', first=True).text.strip()
relator = r.html.find('td.small', first=True).text.strip()
#Problem 02
# STEP 03: creating a list based objetcs created before
product = {
'Nº do Processo': numero,
"Link do Processo" : r,
'Data de Autuação': data_autuacao,
'Parte 1': parte_1,
'Parte 2': parte_2,
'Exercício': exercicio,
'Matéria' : materia,
'Objeto' : objeto,
'Relator' : relator
#'Relatório/Voto' :
#'Data Relatório/Voto' :
#'Acórdão' :
#'Data Acórdão' :
}#Problem 03
return product
# STEP 04: saving as csv
def save_csv(final):
keys = final [0].keys()
with open('products.csv', 'w') as f:
dict_writer = csv.DictWriter(f, keys)
dict_writer.writeheader()
dict_writer.writerows(final)
# STEP 05: main - joinning the functions
def main():
final = []
for x in range(0, 410, 10):
print('Getting Page ', x)
urls = get_results_links(x)
for url in urls:
final.append(parse_result(url))
print('Total: ', len(final))
save_csv(final)
Thank you, #shelter, for your help so far. I tryed to specify it.
There are better (albeit more complex) ways of obtaining that information, like scrapy, or an async solution. Nonetheless, here is one way of getting that information you're after, as well as saving it into a csv file. I only scraped the first 2 pages (20 results), you can increase the range if you wish:
from bs4 import BeautifulSoup as bs
import requests
from tqdm.notebook import tqdm
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
s = requests.Session()
s.headers.update(headers)
big_list = []
detailed_list = []
for x in tqdm(range(0, 20, 10)):
url = f'https://www.tce.sp.gov.br/jurisprudencia/pesquisar?txtTdPalvs=munic%C3%ADpio+pessoal+37&txtExp=temporari&txtQqUma=admiss%C3%A3o+contrata%C3%A7%C3%A3o&txtNenhPalvs=&txtNumIni=&txtNumFim=&tipoBuscaTxt=Documento&_tipoBuscaTxt=on&quantTrechos=1&processo=&exercicio=&dataAutuacaoInicio=&dataAutuacaoFim=&dataPubInicio=01%2F01%2F2021&dataPubFim=31%2F12%2F2021&_relator=1&_auditor=1&_materia=1&tipoDocumento=2&_tipoDocumento=1&acao=Executa&offset={x}'
r = s.get(url)
urls = bs(r.text, 'html.parser').select('tr[class="borda-superior"] td:nth-of-type(2) a')
big_list.extend(['https://www.tce.sp.gov.br/jurisprudencia/' + x.get('href') for x in urls])
for x in tqdm(big_list):
r = s.get(x)
soup = bs(r.text, 'html.parser')
n_proceso = soup.select_one('td:-soup-contains("N° Processo:")').find_next('td').text if soup.select('td:-soup-contains("N° Processo:")') else None
link_proceso = x
autoacao = soup.select_one('td:-soup-contains("Autuação:")').find_next('td').text if soup.select('td:-soup-contains("Autuação:")') else None
parte_1 = soup.select_one('td:-soup-contains("Parte 1:")').find_next('td').text if soup.select('td:-soup-contains("Parte 1:")') else None
parte_2 = soup.select_one('td:-soup-contains("Parte 2:")').find_next('td').text if soup.select('td:-soup-contains("Parte 2:")') else None
materia = soup.select_one('td:-soup-contains("Matéria:")').find_next('td').text if soup.select('td:-soup-contains("Matéria:")') else None
exercicio = soup.select_one('td:-soup-contains("Exercício:")').find_next('td').text if soup.select('td:-soup-contains("Exercício:")') else None
objeto = soup.select_one('td:-soup-contains("Objeto:")').find_next('td').text if soup.select('td:-soup-contains("Objeto:")') else None
relator = soup.select_one('td:-soup-contains("Relator:")').find_next('td').text if soup.select('td:-soup-contains("Relator:")') else None
relatorio_voto = soup.select_one('td:-soup-contains("Relatório / Voto ")').find_previous('a').get('href') if soup.select('td:-soup-contains("Relatório / Voto")') else None
data_relatorio = soup.select_one('td:-soup-contains("Relatório / Voto ")').find_previous('td').text if soup.select('td:-soup-contains("Relatório / Voto")') else None
acordao = soup.select_one('td:-soup-contains("Acórdão ")').find_previous('a').get('href') if soup.select('td:-soup-contains("Acórdão ")') else None
data_acordao = soup.select_one('td:-soup-contains("Acórdão ")').find_previous('td').text if soup.select('td:-soup-contains("Acórdão ")') else None
detailed_list.append((n_proceso, link_proceso, autoacao, parte_1, parte_2,
materia, exercicio, objeto, relator, relatorio_voto,
data_relatorio, acordao, data_acordao))
detailed_df = pd.DataFrame(detailed_list, columns = ['n_proceso', 'link_proceso', 'autoacao', 'parte_1',
'parte_2', 'materia', 'exercicio', 'objeto', 'relator',
'relatorio_voto', 'data_relatorio', 'acordao', 'data_acordao'])
display(detailed_df)
detailed_df.to_csv('legal_br_stuffs.csv')
Result in terminal:
100%
2/2 [00:04<00:00, 1.78s/it]
100%
20/20 [00:07<00:00, 2.56it/s]
n_proceso link_proceso autoacao parte_1 parte_2 materia exercicio objeto relator relatorio_voto data_relatorio acordao data_acordao
0 18955/989/20 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=18955/989/20&offset=0 31/07/2020 ELVES SCIARRETTA CARREIRA PREFEITURA MUNICIPAL DE BRODOWSKI RECURSO ORDINARIO 2020 Recurso Ordinário Protocolado em anexo. EDGARD CAMARGO RODRIGUES https://www2.tce.sp.gov.br/arqs_juri/pdf/801385.pdf 20/01/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/801414.pdf 20/01/2021
1 13614/989/18 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=13614/989/18&offset=0 11/06/2018 PREFEITURA MUNICIPAL DE SERRA NEGRA RECURSO ORDINARIO 2014 Recurso Ordinário ANTONIO ROQUE CITADINI https://www2.tce.sp.gov.br/arqs_juri/pdf/797986.pdf 05/02/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/800941.pdf 05/02/2021
2 6269/989/19 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=6269/989/19&offset=0 19/02/2019 PREFEITURA MUNICIPAL DE TREMEMBE ADMISSAO DE PESSOAL - CONCURSO PROCESSO SELETIVO 2018 INTERESSADO: Rafael Varejão Munhos e outros. EDITAL Nº: 01/2017. CONCURSO PÚBLICO: 01/2017. None https://www2.tce.sp.gov.br/arqs_juri/pdf/804240.pdf 06/02/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/804258.pdf 06/02/2021
3 14011/989/19 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=14011/989/19&offset=0 11/06/2019 RUBENS EDUARDO DE SOUZA AROUCA PREFEITURA MUNICIPAL DE TREMEMBE RECURSO ORDINARIO 2019 Recurso Ordinário RENATO MARTINS COSTA https://www2.tce.sp.gov.br/arqs_juri/pdf/804240.pdf 06/02/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/804258.pdf 06/02/2021
4 14082/989/19 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=14082/989/19&offset=0 12/06/2019 PREFEITURA MUNICIPAL DE TREMEMBE RECURSO ORDINARIO 2019 Recurso Ordinário nos autos do TC n° 6269.989.19 - Admissão de pessoal - Concurso Público RENATO MARTINS COSTA https://www2.tce.sp.gov.br/arqs_juri/pdf/804240.pdf 06/02/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/804258.pdf 06/02/2021
5 14238/989/19 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=14238/989/19&offset=0 13/06/2019 MARCELO VAQUELI PREFEITURA MUNICIPAL DE TREMEMBE RECURSO ORDINARIO 2019 Recurso Ordinário RENATO MARTINS COSTA https://www2.tce.sp.gov.br/arqs_juri/pdf/804240.pdf 06/02/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/804258.pdf 06/02/2021
6 14141/989/20 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=14141/989/20&offset=0 28/05/2020 PREFEITURA MUNICIPAL DE BIRIGUI CRISTIANO SALMEIRAO RECURSO ORDINARIO 2018 Recurso Ordinário RENATO MARTINS COSTA https://www2.tce.sp.gov.br/arqs_juri/pdf/804259.pdf 06/02/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/804262.pdf 06/02/2021
7 15371/989/19 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=15371/989/19&offset=0 02/07/2019 PREFEITURA MUNICIPAL DE BIRIGUI ADMISSAO DE PESSOAL - TEMPO DETERMINADO 2018 INTERESSADOS: ADRIANA PEREIRA CRISTAL E OUTROS. PROCESSOS SELETIVOS/EDITAIS Nºs:002/2016, 004/2017, 05/2017, 06/2017,001/2018 e 002/2018. LEIS AUTORIZADORAS: Nº 5134/2009 e Nº 3946/2001. None https://www2.tce.sp.gov.br/arqs_juri/pdf/804259.pdf 06/02/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/804262.pdf 06/02/2021
8 15388/989/20 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=15388/989/20&offset=0 04/06/2020 MARIA ANGELICA MIRANDA FERNANDES RECURSO ORDINARIO 2018 Recurso Ordinário RENATO MARTINS COSTA https://www2.tce.sp.gov.br/arqs_juri/pdf/804259.pdf 06/02/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/804262.pdf 06/02/2021
9 12911/989/16 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=12911/989/16&offset=0 20/07/2016 MARCELO CANDIDO DE SOUZA PREFEITURA MUNICIPAL DE SUZANO RECURSO ORDINARIO 2016 Recurso Ordinário Ref. Atos de Admissão de Pessoal - Exercício 2012. objetivando o preenchimento temporário dos cargos de Médico Cardiologista 20h, Fotógrafo, Médico Clínico Geral 20lt, Médico Gineco DIMAS RAMALHO https://www2.tce.sp.gov.br/arqs_juri/pdf/814599.pdf 27/04/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/814741.pdf 27/04/2021
10 1735/002/11 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=1735/002/11&offset=10 22/11/2011 FUNDACAO DE APOIO AOS HOSP VETERINARIOS DA UNESP ADMISSAO DE PESSOAL - TEMPO DETERMINADO 2010 ADMISSAO DE PESSOAL POR TEMPO DETERMINADO COM CONCURSO/PROCESSO SELETIVO ANTONIO ROQUE CITADINI https://www2.tce.sp.gov.br/arqs_juri/pdf/800893.pdf 21/01/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/800969.pdf 21/01/2021
11 23494/989/18 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=23494/989/18&offset=10 20/11/2018 HAMILTON LUIS FOZ RECURSO ORDINARIO 2018 Recurso Ordinário DIMAS RAMALHO https://www2.tce.sp.gov.br/arqs_juri/pdf/816918.pdf 13/05/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/817317.pdf 13/05/2021
12 24496/989/19 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=24496/989/19&offset=10 25/11/2019 PREFEITURA MUNICIPAL DE LORENA RECURSO ORDINARIO 2017 Recurso Ordinário em face de sentença proferida nos autos de TC 00006265.989.19-4 DIMAS RAMALHO https://www2.tce.sp.gov.br/arqs_juri/pdf/814660.pdf 27/04/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/814805.pdf 27/04/2021
13 17110/989/18 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=17110/989/18&offset=10 03/08/2018 JORGE ABISSAMRA PREFEITURA MUNICIPAL DE FERRAZ DE VASCONCELOS RECURSO ORDINARIO 2018 Recurso Ordinário DIMAS RAMALHO https://www2.tce.sp.gov.br/arqs_juri/pdf/814633.pdf 27/04/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/814774.pdf 27/04/2021
14 24043/989/19 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=24043/989/19&offset=10 18/11/2019 PREFEITURA MUNICIPAL DE IRAPURU RECURSO ORDINARIO 2018 Recurso ordinário ROBSON MARINHO https://www2.tce.sp.gov.br/arqs_juri/pdf/817014.pdf 12/05/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/817269.pdf 12/05/2021
15 2515/989/20 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=2515/989/20&offset=10 03/02/2020 PREFEITURA MUNICIPAL DE IPORANGA RECURSO ORDINARIO 2020 Recurso interposto em face da sentença proferida nos autos do TC 15791/989/19-7. ROBSON MARINHO https://www2.tce.sp.gov.br/arqs_juri/pdf/817001.pdf 12/05/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/817267.pdf 12/05/2021
16 1891/989/20 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=1891/989/20&offset=10 24/01/2020 PREFEITURA MUNICIPAL DE IPORANGA RECURSO ORDINARIO 2020 RECURSO ORDINÁRIO DIMAS RAMALHO https://www2.tce.sp.gov.br/arqs_juri/pdf/802484.pdf 03/02/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/802620.pdf 03/02/2021
17 15026/989/20 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=15026/989/20&offset=10 02/06/2020 DIXON RONAN CARVALHO PREFEITURA MUNICIPAL DE PAULINIA RECURSO ORDINARIO 2018 RECURSO ORDINÁRIO ANTONIO ROQUE CITADINI https://www2.tce.sp.gov.br/arqs_juri/pdf/802648.pdf 05/02/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/803361.pdf 05/02/2021
18 9070/989/20 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=9070/989/20&offset=10 09/03/2020 PREFEITURA MUNICIPAL DE FLORIDA PAULISTA RECURSO ORDINARIO 2017 Recurso Ordinário ROBSON MARINHO https://www2.tce.sp.gov.br/arqs_juri/pdf/817006.pdf 12/05/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/817296.pdf 12/05/2021
19 21543/989/20 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=21543/989/20&offset=10 11/09/2020 PREFEITURA MUNICIPAL DE JERIQUARA RECURSO ORDINARIO 2020 RECURSO ORDINÁRIO SIDNEY ESTANISLAU BERALDO https://www2.tce.sp.gov.br/arqs_juri/pdf/802997.pdf 13/02/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/804511.pdf 13/02/2021
If you will need coding in your career, I strongly suggest you start building some foundational knowledge first, then try to code or adapt other code.

scraped data using BeautifulSoup does not match source code

I'm new to webscraping. I have seen a few tutorials on how to scrape websites using beautifulsoup.
As an exercise I would like to extract data from a real estate website.
The specific page I want to scrape is this one: https://www.immoweb.be/fr/recherche/maison-et-appartement/a-vendre?countries=BE&page=1
My goal is to extract a list of all the links to each real estate sale.
Afterwards, I want to loop through that list of links to extract all the data for each sale (price, location, nb bedrooms etc.)
The first issue I'm encountering is that the data scraped using the classic beautifulsoup code did not match the source code of the webpage.
This is my code:
URL = "https://www.immoweb.be/fr/recherche/maison-et-appartement/a-vendre?countries=BE&page=1"
page = requests.get(URL)
html = page.content
soup = BeautifulSoup(html, 'html.parser')
print(soup)
Hence, when looking for the links of each real estate sale which is located under
soup.find_all("a", class_="card__title-link")
It outputs an empty list. Indeed these tags were actually not properly extracted from my code above.
Why is that? What should I do to ensure that the extracted html correctly corresponds to what is visible in the source code of the website?
Thank you :-)
The data you see is embedded within the page in Json format. You can use this example how to load it:
import json
import requests
from bs4 import BeautifulSoup
url = "https://www.immoweb.be/fr/recherche/maison-et-appartement/a-vendre?countries=BE&page=1"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = json.loads(soup.find("iw-search")[":results"])
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
# print some data:
for ad in data:
print(
"{:<63} {:<8} {}".format(
ad["property"]["title"],
ad["transaction"]["sale"]["price"] or "-",
"https://www.immoweb.be/fr/annonce/{}".format(ad["id"]),
)
)
Prints:
Triplex appartement met 3 slaapkamers en garage. 239000 https://www.immoweb.be/fr/annonce/9309298
Appartement 285000 https://www.immoweb.be/fr/annonce/9309895
Heel ruime, moderne, lichtrijke Duplex te koop, bij centrum 269000 https://www.immoweb.be/fr/annonce/9303797
À VENDRE PAR LANDBERGH : appartement de deux chambres à Gand 359000 https://www.immoweb.be/fr/annonce/9310300
Prachtige nieuwbouw appartementen - https://www.immoweb.be/fr/annonce/9309278
Prachtige nieuwbouw appartementen - https://www.immoweb.be/fr/annonce/9309251
Prachtige nieuwbouw appartementen - https://www.immoweb.be/fr/annonce/9309264
Appartement intéressant avec agréable vue panoramique verdoy 219000 https://www.immoweb.be/fr/annonce/9309366
Projet Utopia by Godin - https://www.immoweb.be/fr/annonce/9309458
Appartement 2-ch avec vue unique! 270000 https://www.immoweb.be/fr/annonce/9309183
Residentieel wonen in Hélécine, dichtbij de natuur en de sne - https://www.immoweb.be/fr/annonce/9309241
Appartement 375000 https://www.immoweb.be/fr/annonce/9309187
DUPLEX LUMIEUX ET SPACIEUX 380000 https://www.immoweb.be/fr/annonce/9298271
SINT-PIETERS-LEEUW / Magnifique maison de ±130m² avec jardin 430000 https://www.immoweb.be/fr/annonce/9310259
PARC PARMENTIER // APP MODERNE 3CH 490000 https://www.immoweb.be/fr/annonce/9262193
BOIS DE LA CAMBRE – AV DE FRE – CLINIQUES DE L’EUROPE 575000 https://www.immoweb.be/fr/annonce/9309664
Entre Stockel et le Stade Fallon 675000 https://www.immoweb.be/fr/annonce/9310094
Maisons neuves dans un cadre verdoyant - https://www.immoweb.be/fr/annonce/6792221
Nieuwbouwproject Dockside Gardens - Gent - https://www.immoweb.be/fr/annonce/9008956
Appartement 139000 https://www.immoweb.be/fr/annonce/9187904
A VENDRE CHEZ LANDBERGH: appartements à Merelbeke Flora - https://www.immoweb.be/fr/annonce/9306877
Très beau studio avec une belle vue sur la plage et la mer! 319000 https://www.immoweb.be/fr/annonce/9306787
BEL APPARTEMENT LUMINEUX DIAMANT / PLASKY 320000 https://www.immoweb.be/fr/annonce/9264748
Un projet d'appartements neufs à proximité de Woluwé-St-Lamb - https://www.immoweb.be/fr/annonce/9308037
PLACE JOURDAN - 2 CHAMBRES 345000 https://www.immoweb.be/fr/annonce/9306953
Magnifiek appartement in de Brugse Rand - Assebroek 399000 https://www.immoweb.be/fr/annonce/9306613
Bien d'exception 415000 https://www.immoweb.be/fr/annonce/9308022
Appartement 435000 https://www.immoweb.be/fr/annonce/9307802
Magnifique maison 5CH - 3SDB - bureau - dressing - garage 465000 https://www.immoweb.be/fr/annonce/9307178
Magnifique maison 5CH - 3SDB - bureau - dressing - garage 465000 https://www.immoweb.be/fr/annonce/9307177
EDIT: Added URL column.

How to change language in Request (GET) URL?

I am trying this code however still unable to change the language of the URL.
from requests import get
url = 'https://www.fincaraiz.com.co/apartamento-apartaestudio/arriendo/bogota/'
headers = {"Accept-Language": "en-US,en;q=0.5"}
params = dict(lang='en-US,en;q=0.5')
response = get(url, headers = headers, params= params)
print(response.text[:500])
titles = []
for a in html_soup.findAll('div', id = 'divAdverts'):
for x in html_soup.findAll(class_ = 'h2-grid'):
title = x.text.replace("\r", "").replace("\n", "").strip()
titles.append(title)
titles
Output
['Local en Itaguí - Santamaría',
'Casa en Sopó - Vereda Comuneros',
'Apartamento en Santa Marta - Bello Horizonte',
'Apartamento en Funza - Zuame',
'Casa en Bogotá - Centro Comercial Titán Plaza',
'Apartamento en Cali - Los Cristales',
'Apartamento en Itaguí - Suramerica',
'Casa en Palmira - Barrio Contiguo A Las Flores',
'Apartamento en Cali - La Hacienda',
'Casa en Bogotá - Marsella',
'Casa en Medellín - La Castellana',
'Casa en Villavicencio - Quintas De San Fernando',
'Apartamento en Santa Marta - Playa Salguero',
'Casa Campestre en Rionegro - La Mosquita',
'Casa Campestre en Jamundí - La Morada',
'Casa en Envigado - Loma De Las Brujas',
'Casa Campestre en El Retiro - Los Salados']
Does anyone know how can I change the language of the URL? Tried everything
I am only giving example for particular field title, you may extend it to other fields, you may face issue like being blocked by google for number of concurrent request while using this library as it is not official one. Also you must consider to see the Note written in the documentation https://pypi.org/project/googletrans/
from requests import get
from bs4 import BeautifulSoup
from googletrans import Translator
translator = Translator()
url = 'https://www.fincaraiz.com.co/apartamento-apartaestudio/arriendo/bogota/'
headers = {"Accept-Language": "en-US,en;q=0.5"}
params = dict(lang='en-US,en;q=0.5')
response = get(url, headers = headers, params= params)
titles = []
html_soup = BeautifulSoup(response.text, 'html.parser')
for a in html_soup.findAll('div', id = 'divAdverts'):
for x in html_soup.findAll(class_ = 'h2-grid'):
title = x.text.replace("\r", "").replace("\n", "").strip()
titles.append(title)
english_titles=[]
english_translations= translator.translate(titles)
for trans in english_translations:
english_titles.append(trans.text)
print(english_titles)
Since you are scraping from spanish language to english language you can specify parameters in translator.translate(titles,src="es",dest="en")

Scraping Yahoo Finance with Python3

I'm a complete newbie in scraping and I'm trying to scrape https://fr.finance.yahoo.com and I can't figure out what I'm doing wrong.
My goal is to scrape the index name, current level and the change(both in value and in %)
Here is the code I have used:
import urllib.request
from bs4 import BeautifulSoup
url = 'https://fr.finance.yahoo.com'
request = urllib.request.Request(url)
html = urllib.request.urlopen(request).read()
soup = BeautifulSoup(html,'html.parser')
main_table = soup.find("div",attrs={'data-reactid':'12'})
print(main_table)
links = main_table.find_all("li", class_=' D(ib) Bxz(bb) Bdc($seperatorColor) Mend(16px) BdEnd ')
print(links)
However, the print(links) comes out empty. Could someone please assist? Any help would be highly appreciated as I have been trying to figure this out for a few days now.
Although the better way to get all the fields is parse and process the relevant script tag, this is one of the ways you can get all them.
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://fr.finance.yahoo.com/'
r = requests.get(url,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(r.text,'html.parser')
df = pd.DataFrame(columns=['Index Name','Current Level','Value','Percentage Change'])
for item in soup.select("[id='market-summary'] li"):
index_name = item.select_one("a").contents[1]
current_level = ''.join(item.select_one("a > span").text.split())
value = ''.join(item.select_one("a")['aria-label'].split("ou")[1].split("points")[0].split())
percentage_change = ''.join(item.select_one("a > span + span").text.split())
df = df.append({'Index Name':index_name, 'Current Level':current_level,'Value':value,'Percentage Change':percentage_change}, ignore_index=True)
print(df)
Output are like:
Index Name Current Level Value Percentage Change
0 CAC 40 4444,56 -0,88 -0,02%
1 Euro Stoxx 50 2905,47 0,49 +0,02%
2 Dow Jones 24438,63 -35,49 -0,15%
3 EUR/USD 1,0906 -0,0044 -0,40%
4 Gold future 1734,10 12,20 +0,71%
5 BTC-EUR 8443,23 161,79 +1,95%
6 CMC Crypto 200 185,66 4,42 +2,44%
7 Pétrole WTI 33,28 -0,64 -1,89%
8 DAX 11073,87 7,94 +0,07%
9 FTSE 100 5993,28 -21,97 -0,37%
10 Nasdaq 9315,26 30,38 +0,33%
11 S&P 500 2951,75 3,24 +0,11%
12 Nikkei 225 20388,16 -164,15 -0,80%
13 HANG SENG 22930,14 -1349,89 -5,56%
14 GBP/USD 1,2177 -0,0051 -0,41%
I think you need to fix your element selection.
For example the following code:
import urllib.request
from bs4 import BeautifulSoup
url = 'https://fr.finance.yahoo.com'
request = urllib.request.Request(url)
html = urllib.request.urlopen(request).read()
soup = BeautifulSoup(html,'html.parser')
main_table = soup.find(id="market-summary")
links = main_table.find_all("a")
for i in links:
print(i.attrs["aria-label"])
Gives output text having index name, % change, change, and value:
CAC 40 a augmenté de 0,37 % ou 16,55 points pour atteindre 4 461,99 points
Euro Stoxx 50 a augmenté de 0,28 % ou 8,16 points pour atteindre 2 913,14 points
Dow Jones a diminué de -0,63 % ou -153,98 points pour atteindre 24 320,14 points
EUR/USD a diminué de -0,49 % ou -0,0054 points pour atteindre 1,0897 points
Gold future a augmenté de 0,88 % ou 15,10 points pour atteindre 1 737,00 points
a augmenté de 1,46 % ou 121,30 points pour atteindre 8 402,74 points
CMC Crypto 200 a augmenté de 1,60 % ou 2,90 points pour atteindre 184,14 points
Pétrole WTI a diminué de -3,95 % ou -1,34 points pour atteindre 32,58 points
DAX a augmenté de 0,29 % ou 32,27 points pour atteindre 11 098,20 points
FTSE 100 a diminué de -0,39 % ou -23,18 points pour atteindre 5 992,07 points
Nasdaq a diminué de -0,30 % ou -28,25 points pour atteindre 9 256,63 points
S&P 500 a diminué de -0,43 % ou -12,62 points pour atteindre 2 935,89 points
Nikkei 225 a diminué de -0,80 % ou -164,15 points pour atteindre 20 388,16 points
HANG SENG a diminué de -5,56 % ou -1 349,89 points pour atteindre 22 930,14 points
GBP/USD a diminué de -0,34 % ou -0,0041 points pour atteindre 1,2186 points
Try following css selector to get all the links.
import urllib
from bs4 import BeautifulSoup
url = 'https://fr.finance.yahoo.com'
request = urllib.request.Request(url)
html = urllib.request.urlopen(request).read()
soup = BeautifulSoup(html,'html.parser')
links=[link['href'] for link in soup.select("ul#market-summary a")]
print(links)
Output:
['/quote/^FCHI?p=^FCHI', '/quote/^STOXX50E?p=^STOXX50E', '/quote/^DJI?p=^DJI', '/quote/EURUSD=X?p=EURUSD=X', '/quote/GC=F?p=GC=F', '/quote/BTC-EUR?p=BTC-EUR', '/quote/^CMC200?p=^CMC200', '/quote/CL=F?p=CL=F', '/quote/^GDAXI?p=^GDAXI', '/quote/^FTSE?p=^FTSE', '/quote/^IXIC?p=^IXIC', '/quote/^GSPC?p=^GSPC', '/quote/^N225?p=^N225', '/quote/^HSI?p=^HSI', '/quote/GBPUSD=X?p=GBPUSD=X']

Scraping Google Destinations

I'm preparing a tour around the world and am curious to find out what the top sights are around the world, so I´m trying to scrape the top destinations within a certain place. I want to end up with the top places in a country, and their best sights. Google Destinations was recently added as a a great functionality for this.
For example, when googling Cuba Destinations, Google shows a card with destinations Havana, Varadero, Trinidad, Santiago de Cuba.
Then, when googling Havana Cuba Destinations, it shows `Old Havana, Malecon, Castillo de los Tres Reyes Magos del Morro, El Capitolio.
Finally I´ll turn it into a table, that looks like:
Cuba, Havana, Old Havana.
Cuba, Havana, Malecon.
Cuba, Havana, Castillo de los Tres Reyes Magos del Morro.
Cuba, Havana, El Capitolio.
Cuba, Varadero, Hicacos Peninsula.
and so on.
I have tried the API call as shown in Travel destinations API, butthat does not provide the right feedback, and often yields OVER_QUERY_LIMIT.
The code below returns an error:
URL = "https://www.google.nl/destination/compare?q=cuba+destinations&site=search&output=search&dest_mid=/m/0d04z6&sa=X&ved=0API_KEY"
import requests
from bs4 import BeautifulSoup
#URL = "http://www.values.com/inspirational-quotes"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
print(soup.prettify())
Any tips?
You will need to use something like Selenium for this as the page makes multiple XHRs you will not be able to get the rendered page using requests alone. First install Selenium.
sudo pip3 install selenium
Then get a driver https://sites.google.com/a/chromium.org/chromedriver/downloads
(Depending upon your OS you may need to specify the location of your driver)
from bs4 import BeautifulSoup
from selenium import webdriver
import time
browser = webdriver.Chrome()
url = ("https://www.google.nl/destination/compare?q=cuba+destinations&site=search&output=search&dest_mid=/m/0d04z6&sa=X&ved=0API_KEY")
browser.get(url)
time.sleep (2)
html_source = browser.page_source
browser.quit()
soup = BeautifulSoup(html_source, "lxml")
# Get the headings
hs = [tag.text for tag in soup.find_all('h2')]
# get the text containg divs
divs = [tag.text for tag in soup.find_all('div', {'class': False})]
# Delete surplus divs
del divs[:22]
del divs[-1:]
print(list(zip(hs,divs)))
Outputs:
[('Havana', "Cuban capital known for Old Havana's colonial architecture, live salsa music & nearby beaches."), ('Varadero', 'Major Cuban resort town on Hicacos Peninsula, with a 20km beach, a golf course & several parks.'), ('Trinidad', 'Cuban town known for Plaza Mayor, colonial architecture & plantations of Valle de los Ingenios.'), ('Santiago de Cuba', 'Cuban city known for Afro-Cuban festivals & music, plus Spanish colonial & revolutionary history.'), ('Viñales', 'Cuban town known for Viñales Valley, Casa de Caridad Botanical Gardens & nearby tobacco farms.'), ('Cienfuegos', 'Cuban coastal city, known for Tomás Terry Theater, Arco de Triunfo & Playa Rancho Luna resorts.'), ('Santa Clara', 'Cuban city home to the Che Guevara Mausoleum, Parque Vidal & ornate Teatro La Caridad.'), ('Cayo Coco', 'Cuban island known for its white-sand beaches & resorts, plus reef snorkeling & flamingos.'), ('Cayo Santa María', 'Cuban island known for Gaviotas Beach, Cayo Santa María Wildlife Refuge & Pueblo La Estrella.'), ('Cayo Largo del Sur', 'Cuban island, known for beaches like Playa Blanca & Playa Sirena, plus a sea turtle center & diving.'), ('Plaza de la Revolución', 'Che Guevara and monuments'), ('Camagüey', 'Ballet, churches, history, and beaches'), ('Holguín', 'Cuban city known for Parque Calixto García, the Hacha de Holguín axe head & Guardalavaca beaches.'), ('Cayo Guillermo', 'Cuban island with beaches like Playa del Medio & Playa Pilar, plus vast expanses of coral reef.'), ('Matanzas', 'Caves, theater, beaches, history, and rivers'), ('Baracoa', 'Beaches, rivers, and nature'), ('Centro Habana', '\xa0'), ('Playa Girón', 'Beaches, snorkeling, and museums'), ('Topes de Collantes', 'Scenic nature reserve park for hiking'), ('Guardalavaca', 'Cuban resort known for Esmeralda Beach, the Cayo Naranjo Aquarium & the Chorro de Maíta Museum.'), ('Bay of Pigs', 'Snorkeling, scuba diving, and beaches'), ('Isla de la Juventud', 'Scuba diving and beaches'), ('Zapata Swamp', 'Parks, crocodiles, birdwatching, and swamps'), ('Pinar del Río', 'History'), ('Remedios', 'Churches, beaches, and museums'), ('Bayamo', 'Wax museums, monuments, history, and music'), ('Sierra Maestra', 'Peaks with a storied political history'), ('Las Terrazas', 'Zip-lining, nature reserves, and hiking'), ('Sancti Spíritus', 'History and museums'), ('Playa Ancon', 'Beaches, snorkeling, and scuba diving'), ('Jibacoa', 'Beaches, snorkeling, and jellyfish'), ('Jardines de la Reina', 'Scuba diving, fly-fishing, and gardens'), ('Cayo Jutías', 'Beach and snorkeling'), ('Guamá, Cuba', 'Crocodiles, beaches, snorkeling, and lakes'), ('Morón', 'Crocodiles, lagoons, and beaches'), ('Las Tunas', 'Beaches, nightlife, and history'), ('Soroa', 'Waterfalls, gardens, nature, and ecotourism'), ('Guanabo', 'Beach'), ('María la Gorda', 'Scuba diving, beaches, and snorkeling'), ('Alejandro de Humboldt National Park', 'Park, protected area, and hiking'), ('Ciego de Ávila', 'Zoos and beaches'), ('Bacunayagua', '\xa0'), ('Guantánamo', 'Beaches, history, and nature'), ('Cárdenas', 'Beaches, museums, monuments, and history'), ('Canarreos Archipelago', 'Sailing and coral reefs'), ('Caibarién', 'Beaches'), ('El Nicho', 'Waterfalls, parks, and nature'), ('San Luis Valley', 'Cranes, national wildlife refuge, and elk')]
UPDATED IN RESPONSE TO COMMENT:
from bs4 import BeautifulSoup
from selenium import webdriver
import time
browser = webdriver.Chrome()
for place in ["Cuba", "Belgum", "France"]:
url = ("https://www.google.nl/destination/compare?site=destination&output=search")
browser.get(url) # you may not need to do this every time if you clear the search box
time.sleep(2)
element = browser.find_element_by_name('q') # get the query box
time.sleep(2)
element.send_keys(place) # populate the search box
time.sleep (2)
search_box=browser.find_element_by_class_name('sbsb_c') # get the first element in the list
search_box.click() # click it
time.sleep (2)
destinations=browser.find_element_by_id('DESTINATIONS') # Click the destinations link
destinations.click()
time.sleep (2)
html_source = browser.page_source
soup = BeautifulSoup(html_source, "lxml")
# Get the headings
hs = [tag.text for tag in soup.find_all('h2')]
# get the text containg divs
divs = [tag.text for tag in soup.find_all('div', {'class': False})]
# Delete surplus divs
del divs[:22]
del divs[-1:]
print(list(zip(hs,divs)))
browser.quit()
Try this Google Places API URL. You will get the point of Interest/Attraction/Tourists places in (for example) New York City. You have to use the CITY NAME with the keyword Point Of Interest.
https://maps.googleapis.com/maps/api/place/textsearch/json?query=new+york+city+point+of+interest&language=en&key=API_KEY
These API results are same as the results of the Google search results below.
https://www.google.com/search?sclient=psy-ab&site=&source=hp&btnG=Search&q=New+York+point+of+interest
Two more little tips for you:
You can use the Python Client for Google Maps Services: https://github.com/googlemaps/google-maps-services-python
For the OVER_QUERY_LIMIT problem, make sure that you add a billing method to your Google Cloud project (with your credit card or free trail credit balance). Don't worry too much because Google will give you some thousand free queries each month.

Categories