I am trying to web scrape googlenews with the gnews package. However, I don't know how to do web scraping for older articles like, for example, articles from 2010.
from gnews import GNews
from newspaper import Article
import pandas as pd
import datetime
google_news = GNews(language='es', country='Argentina', period = '7d')
argentina_news = google_news.get_news('protesta clarin')
print(len(argentina_news))
this code works perfectly to get recent articles but I need older articles. I saw https://github.com/ranahaani/GNews#todo and something like the following appears:
google_news = GNews(language='es', country='Argentina', period='7d', start_date='01-01-2015', end_date='01-01-2016', max_results=10, exclude_websites=['yahoo.com', 'cnn.com'],
proxy=proxy)
but when I try star_date I get:
TypeError: __init__() got an unexpected keyword argument 'start_date'
can anyone help to get articles for specific dates. Thank you very mucha guys!
The example code is incorrect for gnews==0.2.7 which is the latest you can install off PyPI via pip (or whatever). The documentation is for the unreleased mainline code that you can get directly off their git source.
Confirmed by inspecting the GNews::__init__ method, and the method doesn't have keyword args for start_date or end_date:
In [1]: import gnews
In [2]: gnews.GNews.__init__??
Signature:
gnews.GNews.__init__(
self,
language='en',
country='US',
max_results=100,
period=None,
exclude_websites=None,
proxy=None,
)
Docstring: Initialize self. See help(type(self)) for accurate signature.
Source:
def __init__(self, language="en", country="US", max_results=100, period=None, exclude_websites=None, proxy=None):
self.countries = tuple(AVAILABLE_COUNTRIES),
self.languages = tuple(AVAILABLE_LANGUAGES),
self._max_results = max_results
self._language = language
self._country = country
self._period = period
self._exclude_websites = exclude_websites if exclude_websites and isinstance(exclude_websites, list) else []
self._proxy = {'http': proxy, 'https': proxy} if proxy else None
File: ~/src/news-test/.venv/lib/python3.10/site-packages/gnews/gnews.py
Type: function
If you want the start_date and end_date functionality, that was only added rather recently, so you will need to install the module off their git source.
# use whatever you use to uninstall any pre-existing gnews module
pip uninstall gnews
# install from the project's git main branch
pip install git+https://github.com/ranahaani/GNews.git
Now you can use the start/end functionality:
import datetime
import gnews
start = datetime.date(2015, 1, 15)
end = datetime.date(2015, 1, 16)
google_news = GNews(language='es', country='Argentina', start_date=start, end_date=end)
rsp = google_news.get_news("protesta")
print(rsp)
I get this as a result:
[{'title': 'Latin Roots: The Protest Music Of South America - NPR',
'description': 'Latin Roots: The Protest Music Of South America NPR',
'published date': 'Thu, 15 Jan 2015 08:00:00 GMT',
'url': 'https://www.npr.org/sections/world-cafe/2015/01/15/377491862/latin-roots-the-protest-music-of-south-america',
'publisher': {'href': 'https://www.npr.org', 'title': 'NPR'}}]
Also note:
period is ignored if you set start_date and end_date
Their documentation shows you can pass the dates as tuples like (2015, 1, 15). This doesn't seem to work - just be safe and pass a datetime object.
You can also use Python requests module and xpath to get what you need without using any external packages.
Here is a snapshot of the code:
from bs4 import BeautifulSoup
import requests
from lxml.html import fromstring
url = 'https://www.google.com/search?q=google+news&&hl=es&sxsrf=ALiCzsZoYzwIP0ZR9d6LLa5U6IJ2WDo1sw%3A1660116293247&source=lnt&tbs=cdr%3A1%2Ccd_min%3A8%2F10%2F2010%2Ccd_max%3A8%2F10%2F2022&tbm=nws'
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4758.87 Safari/537.36",
}
r = requests.get(url, headers=headers, timeout=30)
root = fromstring(r.text)
news = []
for i in root.xpath('//div[#class="xuvV6b BGxR7d"]'):
item={}
item['title'] = i.xpath('.//div[#class="mCBkyc y355M ynAwRc MBeuO nDgy9d"]//text()')
item['description'] = i.xpath('.//div[#class="GI74Re nDgy9d"]//text()')
item['published date'] = i.xpath('.//div[#class="OSrXXb ZE0LJd"]//span/text()')
item['url'] = i.xpath('.//a/#href')
item['publisher'] = i.xpath('.//div[#class="CEMjEf NUnG9d"]//span/text()')
news.append(item)
And here is what i get:
for i in news:
print i
"""
{'published date': ['Hace 1 mes'], 'url': ['https://www.20minutos.es/noticia/5019464/0/google-news-regresa-a-espana-tras-ocho-anos-cerrado/'], 'publisher': ['20Minutos'], 'description': [u'"Google News ayuda a los lectores a encontrar noticias de fuentes \nfidedignas, desde los sitios web de noticias m\xe1s grandes del mundo hasta \nlas publicaciones...'], 'title': [u'Noticias de 20minutos en Google News: c\xf3mo seguir la \xfaltima ...']}
{'published date': ['14 jun 2022'], 'url': ['https://www.bbc.com/mundo/noticias-61803565'], 'publisher': ['BBC'], 'description': [u'C\xf3mo funciona LaMDA, el sistema de inteligencia artificial que "cobr\xf3 \nconciencia y siente" seg\xfan un ingeniero de Google. Alicia Hern\xe1ndez \n#por_puesto; BBC News...'], 'title': [u'C\xf3mo funciona LaMDA, el sistema de inteligencia artificial que "cobr\xf3 \nconciencia y siente" seg\xfan un ingeniero de Google']}
{'published date': ['24 mar 2022'], 'url': ['https://www.theguardian.com/world/2022/mar/24/russia-blocks-google-news-after-it-bans-ads-on-proukraine-invasion-content'], 'publisher': ['The Guardian'], 'description': [u'Russia has blocked Google News, accusing it of promoting \u201cinauthentic \ninformation\u201d about the invasion of Ukraine. The ban came just hours after \nGoogle...'], 'title': ['Russia blocks Google News after ad ban on content condoning Ukraine invasion']}
{'published date': ['2 feb 2021'], 'url': ['https://dircomfidencial.com/medios/google-news-showcase-que-es-y-como-funciona-el-agregador-por-el-que-los-medios-pueden-generar-ingresos-20210202-0401/'], 'publisher': ['Dircomfidencial'], 'description': [u'Google News Showcase: qu\xe9 es y c\xf3mo funciona el agregador por el que los \nmedios pueden generar ingresos. MEDIOS | 2 FEBRERO 2021 | ACTUALIZADO: 3 \nFEBRERO 2021 8...'], 'title': [u'Google News Showcase: qu\xe9 es y c\xf3mo funciona el ...']}
{'published date': ['4 nov 2021'], 'url': ['https://www.euronews.com/next/2021/11/04/google-news-returns-to-spain-after-the-country-adopts-new-eu-copyright-law'], 'publisher': ['Euronews'], 'description': ['News aggregator Google News will return to Spain following a change in \ncopyright law that allows online platforms to negotiate fees directly with \ncontent...'], 'title': ['Google News returns to Spain after the country adopts new EU copyright law']}
{'published date': ['27 may 2022'], 'url': ['https://indianexpress.com/article/technology/tech-news-technology/google-hit-with-fresh-uk-investigation-over-ad-tech-dominance-7938896/'], 'publisher': ['The Indian Express'], 'description': ['The Indian Express website has been rated GREEN for its credibility and \ntrustworthiness by Newsguard, a global service that rates news sources for \ntheir...'], 'title': ['Google hit with fresh UK investigation over ad tech dominance']}
{'published date': [u'Hace 1 d\xeda'], 'url': ['https://indianexpress.com/article/technology/tech-news-technology/google-down-outage-issues-user-error-8079170/'], 'publisher': ['The Indian Express'], 'description': ['The outage also impacted a range of other Google products such as Google \n... Join our Telegram channel (The Indian Express) for the latest news and \nupdates.'], 'title': ['Google, Google Maps and other services recover after global ...']}
{'published date': ['14 nov 2016'], 'url': ['https://www.reuters.com/article/us-alphabet-advertising-idUSKBN1392MM'], 'publisher': ['Reuters'], 'description': ["Google's move similarly does not address the issue of fake news or hoaxes \nappearing in Google search results. That happened in the last few days, \nwhen a search..."], 'title': ['Google, Facebook move to restrict ads on fake news sites']}
{'published date': ['27 sept 2021'], 'url': ['https://news.sky.com/story/googles-appeal-against-eu-record-3-8bn-fine-starts-today-as-us-cases-threaten-to-break-the-company-up-12413655'], 'publisher': ['Sky News'], 'description': ["Google's five-day appeal against the decision is being heard at European \n... told Sky News he expected there could be another appeal after the \nhearing in..."], 'title': [u"Google's appeal against EU record \xa33.8bn fine starts today, as US cases \nthreaten to break the company up"]}
{'published date': ['11 jun 2022'], 'url': ['https://www.washingtonpost.com/technology/2022/06/11/google-ai-lamda-blake-lemoine/'], 'publisher': ['The Washington Post'], 'description': [u"SAN FRANCISCO \u2014 Google engineer Blake Lemoine opened his laptop to the \ninterface for LaMDA, Google's artificially intelligent chatbot generator,..."], 'title': ["The Google engineer who thinks the company's AI has come ..."]}
"""
You can consider the parsel library which is an XML/HTML parser that have full XPath and CSS selectors support.
Using this library, I want to demonstrate how to scrape Google News using pagination. Оne of the ways is to use the start URL parameter which is equal to 0 by default. 0 means the first page, 10 is for the second, and so on.
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "protesta clarin",
"hl": "es", # language, es -> Spanish
"gl": "AR", # country of the search, AR -> Argentina
"tbm": "nws", # google news
"start": 0, # number page by default up to 0
}
While the next button exists, you need to increment the ["start"] parameter value by 10 to access the next page if it's present, otherwise we need to break out of the while loop:
if selector.css('.d6cvqb a[id=pnnext]'):
params["start"] += 10
else:
break
Also, make sure you're using request headers user-agent to act as a "real" user visit. Because default requests user-agent is python-requests and websites understand that it's most likely a script that sends a request. Check what's your user-agent.
Code and full example in online IDE:
from parsel import Selector
import requests
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "protesta clarin",
"hl": "es", # language, es -> Spanish
"gl": "AR", # country of the search, AR -> Argentina
"tbm": "nws", # google news
"start": 0, # number page by default up to 0
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36",
}
page_num = 0
while True:
page_num += 1
print(f"{page_num} page:")
html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
selector = Selector(html.text)
for result in selector.css(".WlydOe"):
source = result.css(".NUnG9d span::text").get()
title = result.css(".mCBkyc::text").get()
link = result.attrib['href']
snippet = result.css(".GI74Re::text").get()
date = result.css(".ZE0LJd span::text").get()
print(source, title, link, snippet, date, sep='\n', end='\n\n')
if selector.css('.d6cvqb a[id=pnnext]'):
params["start"] += 10
else:
break
Output:
1 page:
Clarín
Protesta de productores rurales en Salliqueló ante la llegada ...
https://www.clarin.com/politica/protesta-productores-productores-rurales-salliquelo-llegada-presidente-ministro-economia_0_d6gT1lm0hy.html
Hace clic acá y te lo volvemos a enviar. Ya la active. Cancelar. Clarín.
Para comentar nuestras notas por favor completá los siguientes datos.
hace 9 horas
Clarín
Paro docente y polémica: por qué un conflicto en el sur afecta ...
https://www.clarin.com/sociedad/paro-docente-polemica-conflicto-sur-afecta-pais_0_7Kt1kRgL8F.html
Expertos en Educación explican a Clarín los límites –o la falta de ellos–
entre la protesta y el derecho a tener clases. Y por qué todos los...
hace 2 días
Clarín
El campo hace un paro en protesta por la presión impositiva y la falta de
gasoil
https://www.clarin.com/rural/protesta-presion-impositiva-falta-gasoil-campo-hara-paro-pais_0_z4VuXzllXP.html
Se viene la Rural de Palermo: precios, horarios y actividades de la
muestra. Newsletters Clarín Cosecha de noticias. Héctor Huergo trae lo
más...
Hace 1 mes
... other results from the 1st and subsequent pages.
31 page:
Ámbito Financiero
Hamburgo se blinda frente a protestas contra el G-20 y ...
http://www.ambito.com/888830-hamburgo-se-blinda-frente-a-protestas-contra-el-g-20-y-anticipa-tolerancia-cero
Fuerzas especiales alemanas frente al centro de convenciones donde se
desarrollará la cumbre. El ministro del Interior de Alemania, Thomas de...
4 jul 2017
Chequeado
Quién es Carlos Rosenkrantz, el nuevo presidente de la Corte ...
https://chequeado.com/el-explicador/quien-es-carlos-rosenkrantz-el-proximo-presidente-de-la-corte-suprema/
... entre otras, del Grupo Clarín y de Farmacity y Pegasus, las dos últimas
... más restrictiva del derecho a la protesta y a los cortes en la vía
pública.
3 nov 2018
ámbito.com
Echeverría: graves incidentes en protesta piquetera
https://www.ambito.com/informacion-general/echeverria-graves-incidentes-protesta-piquetera-n3594273
Una protesta de grupos piqueteros frente a la municipalidad de Esteban
Echeverría desató hoy serios incidentes que concluyeron con al menos...
20 nov 2009
... other results from 31st page.
If you need a more detailed explanation about scraping Google News, have a look at Web Scraping Google News with Python blog post.
I'm new to webscraping. I have seen a few tutorials on how to scrape websites using beautifulsoup.
As an exercise I would like to extract data from a real estate website.
The specific page I want to scrape is this one: https://www.immoweb.be/fr/recherche/maison-et-appartement/a-vendre?countries=BE&page=1
My goal is to extract a list of all the links to each real estate sale.
Afterwards, I want to loop through that list of links to extract all the data for each sale (price, location, nb bedrooms etc.)
The first issue I'm encountering is that the data scraped using the classic beautifulsoup code did not match the source code of the webpage.
This is my code:
URL = "https://www.immoweb.be/fr/recherche/maison-et-appartement/a-vendre?countries=BE&page=1"
page = requests.get(URL)
html = page.content
soup = BeautifulSoup(html, 'html.parser')
print(soup)
Hence, when looking for the links of each real estate sale which is located under
soup.find_all("a", class_="card__title-link")
It outputs an empty list. Indeed these tags were actually not properly extracted from my code above.
Why is that? What should I do to ensure that the extracted html correctly corresponds to what is visible in the source code of the website?
Thank you :-)
The data you see is embedded within the page in Json format. You can use this example how to load it:
import json
import requests
from bs4 import BeautifulSoup
url = "https://www.immoweb.be/fr/recherche/maison-et-appartement/a-vendre?countries=BE&page=1"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = json.loads(soup.find("iw-search")[":results"])
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
# print some data:
for ad in data:
print(
"{:<63} {:<8} {}".format(
ad["property"]["title"],
ad["transaction"]["sale"]["price"] or "-",
"https://www.immoweb.be/fr/annonce/{}".format(ad["id"]),
)
)
Prints:
Triplex appartement met 3 slaapkamers en garage. 239000 https://www.immoweb.be/fr/annonce/9309298
Appartement 285000 https://www.immoweb.be/fr/annonce/9309895
Heel ruime, moderne, lichtrijke Duplex te koop, bij centrum 269000 https://www.immoweb.be/fr/annonce/9303797
À VENDRE PAR LANDBERGH : appartement de deux chambres à Gand 359000 https://www.immoweb.be/fr/annonce/9310300
Prachtige nieuwbouw appartementen - https://www.immoweb.be/fr/annonce/9309278
Prachtige nieuwbouw appartementen - https://www.immoweb.be/fr/annonce/9309251
Prachtige nieuwbouw appartementen - https://www.immoweb.be/fr/annonce/9309264
Appartement intéressant avec agréable vue panoramique verdoy 219000 https://www.immoweb.be/fr/annonce/9309366
Projet Utopia by Godin - https://www.immoweb.be/fr/annonce/9309458
Appartement 2-ch avec vue unique! 270000 https://www.immoweb.be/fr/annonce/9309183
Residentieel wonen in Hélécine, dichtbij de natuur en de sne - https://www.immoweb.be/fr/annonce/9309241
Appartement 375000 https://www.immoweb.be/fr/annonce/9309187
DUPLEX LUMIEUX ET SPACIEUX 380000 https://www.immoweb.be/fr/annonce/9298271
SINT-PIETERS-LEEUW / Magnifique maison de ±130m² avec jardin 430000 https://www.immoweb.be/fr/annonce/9310259
PARC PARMENTIER // APP MODERNE 3CH 490000 https://www.immoweb.be/fr/annonce/9262193
BOIS DE LA CAMBRE – AV DE FRE – CLINIQUES DE L’EUROPE 575000 https://www.immoweb.be/fr/annonce/9309664
Entre Stockel et le Stade Fallon 675000 https://www.immoweb.be/fr/annonce/9310094
Maisons neuves dans un cadre verdoyant - https://www.immoweb.be/fr/annonce/6792221
Nieuwbouwproject Dockside Gardens - Gent - https://www.immoweb.be/fr/annonce/9008956
Appartement 139000 https://www.immoweb.be/fr/annonce/9187904
A VENDRE CHEZ LANDBERGH: appartements à Merelbeke Flora - https://www.immoweb.be/fr/annonce/9306877
Très beau studio avec une belle vue sur la plage et la mer! 319000 https://www.immoweb.be/fr/annonce/9306787
BEL APPARTEMENT LUMINEUX DIAMANT / PLASKY 320000 https://www.immoweb.be/fr/annonce/9264748
Un projet d'appartements neufs à proximité de Woluwé-St-Lamb - https://www.immoweb.be/fr/annonce/9308037
PLACE JOURDAN - 2 CHAMBRES 345000 https://www.immoweb.be/fr/annonce/9306953
Magnifiek appartement in de Brugse Rand - Assebroek 399000 https://www.immoweb.be/fr/annonce/9306613
Bien d'exception 415000 https://www.immoweb.be/fr/annonce/9308022
Appartement 435000 https://www.immoweb.be/fr/annonce/9307802
Magnifique maison 5CH - 3SDB - bureau - dressing - garage 465000 https://www.immoweb.be/fr/annonce/9307178
Magnifique maison 5CH - 3SDB - bureau - dressing - garage 465000 https://www.immoweb.be/fr/annonce/9307177
EDIT: Added URL column.
I am currently trying to scrape some information from the following link:
http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2001.nsf/ee3e4953228bd84705256dcd008385e7/4ec9c3be3fc593e2052571c40071de75?OpenDocument
I would like to scrape some of the information in the table using BeautifulSoup in Python. Ideally I would like to scrape the "Groupo Parliamentario," "Titulo," "Sumilla," and "Autores" from the table as separate items.
So far I've developed the following code using BeautifulSoup:
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2001.nsf/ee3e4953228bd84705256dcd008385e7/4ec9c3be3fc593e2052571c40071de75?OpenDocument'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
table = soup.find('table', {'bordercolor' : '#6583A0'})
contents = []
summary = []
authors = []
contents.append(table.findAll('font'))
authors.append(table.findAll('a'))
What I'm struggling with is that the code to scrape the authors only scrapes the first author in the list. Ideally I need to scrape all of the authors in the list. This seems odd to me because looking at the html code for the webpage, all authors in the list are indicated with '<a href = >' tags. I would think table.findAll('a')) would grab all of the authors in the list then.
Finally, I'm sort of just dumping the rest of the very messy html (title, summary, parliamentary group) all into one long string under contents. I'm not sure if I'm missing something, I'm sort of new to html and webscraping, but would there be a way to pull these items out and store them individually (ie: storing just the title in an object, just the summary in an object, etc). I'm having a tough time identifying unique tags to do this in the code for the web page. Or is this something I should just clean and parse after scraping?
to get the authors you can use:
soup.find('input', {'name': 'NomCongre'})['value']
output:
'Santa María Calderón Luis,Alva Castro Luis,Armas Vela Carlos,Cabanillas Bustamante Mercedes,Carrasco Távara José,De la Mata Fernández Judith,De La Puente Haya Elvira,Del Castillo Gálvez Jorge,Delgado Nuñez Del Arco José,Gasco Bravo Luis,Gonzales Posada Eyzaguirre Luis,León Flores Rosa Marina,Noriega Toledo Víctor,Pastor Valdivieso Aurelio,Peralta Cruz Jonhy,Zumaeta Flores César'
to scrape Grupo Parlamentario
table.find_all('td', {'width': 446})[1].text
output:
'Célula Parlamentaria Aprista'
to scrape Título:
table.find_all('td', {'width': 446})[2].text
output:
'IGV/SELECTIVO:D.L.821/LEY INTERPRETATIVA '
to scrape Sumilla:
table.find_all('td', {'width': 446})[3].text
output:
' Propone la aprobación de una Ley Interpretativa del Texto Original del Numeral 1 del Apéndice II del Decreto Legislativo N°821,modificatorio del Texto Vigente del Numeral 1 del Apéndice II del Texto Único Ordenado de la Ley del Impuesto General a las Ventas y Selectivo al Consumo,aprobado por Decreto Supremo N°054-99-EF. '
I have attempted several methods to pull links from the following webpage, but can't seem to find the desired links. From this webpage (https://www.espn.com/collegefootball/scoreboard//year/2019/seasontype/2/week/1) I am attempting to extract all of the links for the "gamecast" button. The example of the first one I would be attempting to get is this: https://www.espn.com/college-football/game//gameId/401110723
When I try to just pull all links on the page I do not even seem to get the desired ones at all, so I'm confused where I'm going wrong here. A few attempts I have made below that don't seem to be pulling in what I want. First method I tried below.
import requests
import csv
from bs4 import BeautifulSoup
import pandas as pd
page = requests.get('https://www.espn.com/college-football/scoreboard/_/year/2019/seasontype/2/week/1')
soup = BeautifulSoup(page.text, 'html.parser')
# game_id = soup.find(name_='&lpos=college-football:scoreboard:gamecast')
game_id = soup.find('a',class_='button-alt sm')
Here is a second method I tried. Any help is greatly appreciated.
for a in soup.find_all('a'):
if 'college-football' in a['href']:
print(link['href'])
Edit: as a clarification I am attempting to pull all links that contain a gameID as in the example link.
The button with the link you are trying to have is loaded with javascript. The requests module does not load the javascript in the html it is searching through. Therefore, you cannot scrape the button directly to find the links you desire (without a web page simulator like Selenium). However, I found json data in the html that contains the scoreboard data in which the link is located in. If you are also looking to scrape more information (times, etc.) from this page, I highly recommend looking through the json data in the variable json_scoreboard in the code.
Code
import requests, re, json
from bs4 import BeautifulSoup
r = requests.get(r'https://www.espn.com/college-football/scoreboard/_/year/2019/seasontype/2/week/1')
soup = BeautifulSoup(r.text, 'html.parser')
scripts_head = soup.find('head').find_all('script')
all_links = {}
for script in scripts_head:
if 'window.espn.scoreboardData' in script.text:
json_scoreboard = json.loads(re.search(r'({.*?});', script.text).group(1))
for event in json_scoreboard['events']:
name = event['name']
for link in event['links']:
if link['text'] == 'Gamecast':
gamecast = link['href']
all_links[name] = gamecast
print(all_links)
Output
{'Miami Hurricanes at Florida Gators': 'http://www.espn.com/college-football/game/_/gameId/401110723', 'Georgia Tech Yellow Jackets at Clemson Tigers': 'http://www.espn.com/college-football/game/_/gameId/401111653', 'Texas State Bobcats at Texas A&M Aggies': 'http://www.espn.com/college-football/game/_/gameId/401110731', 'Utah Utes at BYU Cougars': 'http://www.espn.com/college-football/game/_/gameId/401114223', 'Florida A&M Rattlers at UCF Knights': 'http://www.espn.com/college-football/game/_/gameId/401117853', 'Tulsa Golden Hurricane at Michigan State Spartans': 'http://www.espn.com/college-football/game/_/gameId/401112212', 'Wisconsin Badgers at South Florida Bulls': 'http://www.espn.com/college-football/game/_/gameId/401117856', 'Duke Blue Devils at Alabama Crimson Tide': 'http://www.espn.com/college-football/game/_/gameId/401110720', 'Georgia Bulldogs at Vanderbilt Commodores': 'http://www.espn.com/college-football/game/_/gameId/401110732', 'Florida Atlantic Owls at Ohio State Buckeyes': 'http://www.espn.com/college-football/game/_/gameId/401112251', 'Georgia Southern Eagles at LSU Tigers': 'http://www.espn.com/college-football/game/_/gameId/401110725', 'Middle Tennessee Blue Raiders at Michigan Wolverines': 'http://www.espn.com/college-football/game/_/gameId/401112222', 'Louisiana Tech Bulldogs at Texas Longhorns': 'http://www.espn.com/college-football/game/_/gameId/401112135', 'Oregon Ducks at Auburn Tigers': 'http://www.espn.com/college-football/game/_/gameId/401110722', 'Eastern Washington Eagles at Washington Huskies': 'http://www.espn.com/college-football/game/_/gameId/401114233', 'Idaho Vandals at Penn State Nittany Lions': 'http://www.espn.com/college-football/game/_/gameId/401112257', 'Miami (OH) RedHawks at Iowa Hawkeyes': 'http://www.espn.com/college-football/game/_/gameId/401112191', 'Northern Iowa Panthers at Iowa State Cyclones': 'http://www.espn.com/college-football/game/_/gameId/401112085', 'Syracuse Orange at Liberty Flames': 'http://www.espn.com/college-football/game/_/gameId/401112434', 'New Mexico State Aggies at Washington State Cougars': 'http://www.espn.com/college-football/game/_/gameId/401114228', 'South Alabama Jaguars at Nebraska Cornhuskers': 'http://www.espn.com/college-football/game/_/gameId/401112238', 'Northwestern Wildcats at Stanford Cardinal': 'http://www.espn.com/college-football/game/_/gameId/401112245', 'Houston Cougars at Oklahoma Sooners': 'http://www.espn.com/college-football/game/_/gameId/401112114', 'Notre Dame Fighting Irish at Louisville Cardinals': 'http://www.espn.com/college-football/game/_/gameId/401112436'}
Trying to scrape climbing gym data. I'm using BeautifulSoup.
I want to store arrays of the gym name, location, phone number, link, and description.
Here is sample html:
div class="city">Alberta</div>
<p><b>Camp He Ho Ha Climbing Gym</b><br>
Seba Beach, Alberta, TOE 2BO Canada<br>
(780) 429-3277<br>
<a rel='nofollow' target='_blank' href='http://camphehoha.com/summer-camp/camp-life/'>Camp He Ho Ha Climbing Gym</a><br>
<span class='rt'></span> The Summit is Camp He Ho Ha's 40' climbing gym and ropes course. Facility is available for rent, with safety equipment, orientation to the course and staffing provided.</p>
<div class="city">Calgary</div>
<p><b>Bolder Climbing Community</b><br>
5508 1st Street SE, Calgary, Alberta, Canada<br>
403 988-8140<br>
<a rel='nofollow' target='_blank' href='http://www.bolderclimbing.com/'>Bolder Climbing Community</a><br>
<span class='rt'></span> Calgary's first bouldering specific climbing centre.</p>
I can easily move between each climbing gym because they are separated by <p> but the individual items I'm interested in are separated by <br>. How do I store these items into separate arrays?
You can do something like this. Basically, find the <br> tag, then the content right before it.
html = '''div class="city">Alberta</div>
<p><b>Camp He Ho Ha Climbing Gym</b><br>
Seba Beach, Alberta, TOE 2BO Canada<br>
(780) 429-3277<br>
<a rel='nofollow' target='_blank' href='http://camphehoha.com/summer-camp/camp-life/'>Camp He Ho Ha Climbing Gym</a><br>
<span class='rt'></span> The Summit is Camp He Ho Ha's 40' climbing gym and ropes course. Facility is available for rent, with safety equipment, orientation to the course and staffing provided.</p>
<div class="city">Calgary</div>
<p><b>Bolder Climbing Community</b><br>
5508 1st Street SE, Calgary, Alberta, Canada<br>
403 988-8140<br>
<a rel='nofollow' target='_blank' href='http://www.bolderclimbing.com/'>Bolder Climbing Community</a><br>
<span class='rt'></span> Calgary's first bouldering specific climbing centre.</p>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
final_content = []
ps = soup.find_all('p')
for p in ps:
content = []
breaks = p.find_all('br')
for br in breaks:
try:
b = br.previousSibling.strip()
content.append(b)
except:
continue
final_content.append(content)
Output:
print (final_content)
[['Seba Beach, Alberta, TOE 2BO Canada', '(780) 429-3277'], ['5508 1st Street SE, Calgary, Alberta, Canada', '403 988-8140']]