I am new to Python and I am trying to webscrape this website. What I am trying to do is to get just dates and articles' titles from this website. I follow a procedure I found on SO which is as follows:
from bs4 import BeautifulSoup
import requests
url = "https://www.ecb.europa.eu/press/inter/html/index.en.html"
res = requests.get(url)
soup = BeautifulSoup(res.text)
movies = soup.select(".title a , .date")
print(movies)
movies_titles = [title.text for title in movies]
movies_links = ["http://www.ecb.europa.eu"+ title["href"] for title in movies]
print(movies_titles)
print(movies_links)
I got .title a , .date using SelectorGadget in the url I shared. However, print(movies) is empty. What am I doing wrong?
Can anyone help me?
Thanks!
The content is not part of index.en.html but is loaded in by js from
https://www.ecb.europa.eu/press/inter/date/2021/html/index_include.en.html
Then you can't select pairs afaik, so you need to select for titles and dates separately:
titles = soup.select(".title a")
dates = soup.select(".date")
pairs = list(zip(titles, dates))
Then you can print them out like this:
movies_titles = [pair[0].text for pair in pairs]
print(movies_titles)
movies_links = ["http://www.ecb.europa.eu" + pair[0]["href"] for pair in pairs]
print(movies_links)
Result:
['Christine Lagarde:\xa0Interview with CNBC', 'Fabio Panetta:\xa0Interview with El País ', 'Isabel Schnabel:\xa0Interview with Der Spiegel', 'Philip R. Lane:\xa0Interview with CNBC', 'Frank Elderson:\xa0Q&A on Twitter', 'Isabel Schnabel:\xa0Interview with Les Echos ', 'Philip R. Lane:\xa0Interview with the Financial Times', 'Luis de Guindos:\xa0Interview with Público', 'Philip R. Lane:\xa0Interview with Expansión', 'Isabel Schnabel:\xa0Interview with LETA', 'Fabio Panetta:\xa0Interview with Der Spiegel', 'Christine Lagarde:\xa0Interview with Le Journal du Dimanche ', 'Philip R. Lane:\xa0Interview with Süddeutsche Zeitung', 'Isabel Schnabel:\xa0Interview with Deutschlandfunk', 'Philip R. Lane:\xa0Interview with SKAI TV', 'Isabel Schnabel:\xa0Interview with Der Standard']
['http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210412~ccd1b7c9bf.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210411~44ade9c3b5.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210409~c8c348a12c.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210323~e4026c61d1.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210317_1~1d81212506.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210317~458636d643.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210316~930d09ce3c.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210302~c793ad7b68.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210226~79eba6f9fb.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210225~5f1be75a9f.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210209~af9c628e30.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210207~f6e34f3b90.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210131_1~650f5ce5f7.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210131~13d84cb9b2.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210127~9ad88eb038.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210112~1c3f989acd.en.html']
Full code:
from bs4 import BeautifulSoup
import requests
url = "https://www.ecb.europa.eu/press/inter/date/2021/html/index_include.en.html"
res = requests.get(url)
soup = BeautifulSoup(res.text)
titles = soup.select(".title a")
dates = soup.select(".date")
pairs = list(zip(titles, dates))
movies_titles = [pair[0].text for pair in pairs]
print(movies_titles)
movies_links = ["http://www.ecb.europa.eu" + pair[0]["href"] for pair in pairs]
print(movies_links)
I would recommend using Python Selenium
Try something like this :
from selenium.webdriver import Chrome
url = "https://www.ecb.europa.eu/press/inter/html/index.en.html"
browser = Chrome()
browser.get(url)
interviews = browser.find_elements_by_class_name('title')
links = []
for interview in interviews:
try:
anchor = interview.find_element_by_tag_name('a')
link = anchor.get_attribute('href')
links.append(link)
except NoSuchElementException:
pass
Links will contain the links to all the interviews. You can do something similar for the dates
Related
So, i need help here , this is mi code
results=[]
import re
for i in popup_linkz: # Here I take N links like this one https://www.mercadopublico.cl/Procurement/Modules/RFB/DetailsAcquisition.aspx?qs=uEap3sWEgifS2G+m9xvYiA== to iterate thorught them a scraping
url=i # so right now I scrape the iterating urls
response = requests.get(url)
print('url:', response.url)
#print('status:', response.status_code)
soup = BeautifulSoup(response.content, "html.parser")
results=[]
#json_res = json.loads(res.text)
#print(json_res[0]['price'])
item_1='grvProducto_ctl02_lblCategoria'
for line in soup.findAll('span', attrs={'id': 'grvProducto_ctl02_lblCategoria'}):
results.append(line.text)
#this actually get the first code, but don't know how to iterate for others, also doesn't store every code on it, when I print doesn't stack them , show them single on print.
print('id',results)
I am trying to get from this urlsample >https://www.mercadopublico.cl/Procurement/Modules/RFB/DetailsAcquisition.aspx?qs=uEap3sWEgifS2G+m9xvYiA==
actually it iterates from 2 to 10.000 of them.
information I want to get here but that cant get it
I am not sure how use this
for line in soup.findAll('span', attrs={'id': 'grvProducto_ctl02_lblCategoria'}):
results.append(line.text)
to use the same loop to get the other information.
data of page underlying
could you enlight me please?
Try:
import requests
from bs4 import BeautifulSoup
url = "https://www.mercadopublico.cl/Procurement/Modules/RFB/DetailsAcquisition.aspx?qs=uEap3sWEgifS2G+m9xvYiA=="
soup = BeautifulSoup(requests.get(url).content, "html.parser")
licitation_number = soup.select_one("#lblNumLicitacion").text
responsable = soup.select_one("#lblResponsable").text
ficha = soup.select_one("#lblFicha2Reclamo").text
print(f"{licitation_number=}")
print(f"{responsable=}")
print(f"{ficha=}")
print("-" * 80)
for t in soup.select("#grvProducto .borde_tabla00"):
categoria = t.select_one('[id$="lblCategoria"]').text
candidad = t.select_one('[id$="lblCantidad"]').text
descripction = t.select_one('[id$="lblDescripcion"]').text
print(f"{categoria=} {candidad=}")
print(f"{descripction=}")
print()
Prints:
licitation_number='1549-5-LR22'
responsable='SERVICIO DE SALUD METROPOLITANA NORTE HOSPITAL SAN JOSE, Hospital San José'
ficha='107'
--------------------------------------------------------------------------------
categoria='42221501' candidad='130'
descripction='(226-2001) STENT CORONARIO DE CROMO COBALTO, LIBERADOR DE FÁRMACO EVEROLIMUS'
categoria='42221501' candidad='360'
descripction='(226-2002) STENT CORONARIO DE CROMO COBALTO, LIBERADOR DE FÁRMACO ZOTAROLIMUS'
categoria='42221501' candidad='120'
descripction='(226-2004) STENT CORONARIO DE CROMO COBALTO, LIBERADOR DE FÁRMACO SIROLIMUS, CON STRUT DE 0.80'
categoria='42221501' candidad='240'
descripction='(226-2003) STENT CORONARIO DE CROMO COBALTO, LIBERADOR DE FÁRMACO SIROLIMUS, CON STRUT DE 0.60'
I was trying to use the request library in python to search here in stack overflow for questions in search bar, than take the first 3 links founded, get the content of the pages and sent to my notionAPI but I stuck on how did I take the html where the link is directing me in python
import requests
from bs4 import BeautifulSoup
#buscas online
def stackoverflow(question):
questionAdjusted = question.replace(' ','+')
Req = requests.get("https://pt.stackoverflow.com/search?q="+questionAdjusted)
soup = BeautifulSoup(Req.text,"html.parser")
questions = soup.select(".question-summary")
for que in questions:
#print(que.select_one('.question-hyperlink').getText().replace('P: ',''))
#print((que.select_one('.question-hyperlink').getText().replace('P: ', '').replace(' ','-').replace('--------','')))
for link in soup.findAll('a', href=(que.select_one('.question-hyperlink').getText().replace('P: ', '').replace(' ','-').replace('--------',''))):
print(link['href'])
stackoverflow('python database')
I just made this until now
To get first 3 links + their description from the URL, you can use next example:
import requests
from bs4 import BeautifulSoup
def stackoverflow(question):
url = "https://pt.stackoverflow.com/search"
r = requests.get(url, params={"q": question})
soup = BeautifulSoup(r.content, "html.parser")
questions = soup.select(".question-hyperlink")
for q in questions[:3]: # <-- select ony first 3
print(q.get_text(strip=True).replace("P: ", ""))
print("https://pt.stackoverflow.com" + q["href"])
print()
stackoverflow("python database")
Prints:
Select from e insert into em outro database com Python
https://pt.stackoverflow.com/questions/376648/select-from-e-insert-into-em-outro-database-com-python?r=SearchResults
Finalizando um projeto em python [duplicada]
https://pt.stackoverflow.com/questions/259591/finalizando-um-projeto-em-python?r=SearchResults
Erro de conexão com SQL Server 2012 com Python
https://pt.stackoverflow.com/questions/478779/erro-de-conex%c3%a3o-com-sql-server-2012-com-python?r=SearchResults
I have some difficulties in saving the results that I am scraping.
Please refer to this code (this code was slightly changed for my specific case):
import bs4, requests
import pandas as pd
import re
import time
headline=[]
corpus=[]
dates=[]
tag=[]
start=1
url="https://www.imolaoggi.it/category/cron/"
while True:
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'html')
headlines=soup.find_all('h3')
corpora=soup.find_all('p')
dates=soup.find_all('time', attrs={'class':'entry-date published updated'})
tags=soup.find_all('span', attrs={'class':'cat-links'})
for t in headlines:
headline.append(t.text)
for s in corpora:
corpus.append(s.text)
for d in date:
dates.append(d.text)
for c in tags:
tag.append(c.text)
if soup.find_all('a', attrs={'class':'page-numbers'}):
url = f"https://www.imolaoggi.it/category/cron/page/{page}"
page +=1
else:
break
Create dataframe
df = pd.DataFrame(list(zip(date, headline, tag, corpus)),
columns =['Date', 'Headlines', 'Tags', 'Corpus'])
I would like to save all the pages from this link. The code works, but it seems that it writes everytime (i.e. every page) two identical sentences for the corpus:
I think this is happening because of the tag I chosen:
corpora=soup.find_all('p')
This causes a misalignment in rows in my dataframe, as data are saved in lists and corpus starts being correctly scraped later, if compared to others.
I hope you cab help to understand how to fix it.
You were close, but your selectors were off, and you mis-naned some of your variables.
I would use css selectors like this:
eadline=[]
corpus=[]
date_list=[]
tag_list=[]
headlines=soup.select('h3.entry-title')
corpora=soup.select('div.entry-meta + p')
dates=soup.select('div.entry-meta span.posted-on')
tags=soup.select('span.cat-links')
for t in headlines:
headline.append(t.text)
for s in corpora:
corpus.append(s.text.strip())
for d in dates:
date_list.append(d.text)
for c in tags:
tag_list.append(c.text)
df = pd.DataFrame(list(zip(date_list, headline, tag_list, corpus)),
columns =['Date', 'Headlines', 'Tags', 'Corpus'])
df
Output:
Date Headlines Tags Corpus
0 30 Ottobre 2020 Roma: con spranga di ferro danneggia 50 auto i... CRONACA, NEWS Notte di vandalismi a Colli Albani dove un uom...
1 30 Ottobre 2020\n30 Ottobre 2020 Aggressione con machete: grave un 28enne, arre... CRONACA, NEWS Roma - Ha impugnato il suo machete e lo ha agi...
2 30 Ottobre 2020\n30 Ottobre 2020 Deep State e globalismo, Mons. Viganò scrive a... CRONACA, NEWS LETTERA APERTA\r\nAL PRESIDENTE DEGLI STATI UN...
3 30 Ottobre 2020 Meluzzi e Scandurra: “Sacrificare libertà per ... CRONACA, NEWS "Sacrificare la libertà per la sicurezza è un ...
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
import pandas as pd
def main(req, num):
r = req.get("https://www.imolaoggi.it/category/cron/page/{}/".format(num))
soup = BeautifulSoup(r.content, 'html.parser')
goal = [(x.time.text, x.h3.a.text, x.select_one("span.cat-links").get_text(strip=True), x.p.get_text(strip=True))
for x in soup.select("div.entry-content")]
return goal
with ThreadPoolExecutor(max_workers=30) as executor:
with requests.Session() as req:
fs = [executor.submit(main, req, num) for num in range(1, 2937)]
allin = []
for f in fs:
allin.extend(f.result())
df = pd.DataFrame.from_records(
allin, columns=["Date", "Title", "Tags", "Content"])
print(df)
df.to_csv("result.csv", index=False)
I have this:
from bs4 import BeautifulSoup
import requests
page = requests.get("https://www.marca.com/futbol/primera/equipos.html")
soup = BeautifulSoup(page.content, 'html.parser')
equipos = soup.findAll('li', attrs={'id':'nombreEquipo'})
aux = []
for equipo in equipos:
aux.append(equipo)
If i do print(aux[0]) i got this:
,
Villarreal
Entrenador:
Javier Calleja
Jugadores:
1 Sergio Asenjo
13 Andrés Fernández
25 Mariano Barbosa
...
And my problem is i want to take the tag:
<h2 class="cintillo">Villarreal</h2>
And the tag:
1 Sergio Asenjo
And put it into a bataBase
How can i take that?
Thanks
You can extract the first <h2 class="cintillo"> element from equipo like this:
h2 = str(equipo.find('h2', {'class':'cintillo'}))
If you only want the inner HTML (without any tags), use:
h2 = equipo.find('h2', {'class':'cintillo'}).text
And you can extract all the <span class="dorsal-jugador"> elements from equipo like this:
jugadores = equipo.find_all('span', {'class':'dorsal-jugador'})
Then append h2 and jugadores to a multi-dimensional list.
Full code:
from bs4 import BeautifulSoup
import requests
page = requests.get("https://www.marca.com/futbol/primera/equipos.html")
soup = BeautifulSoup(page.content, 'html.parser')
equipos = soup.findAll('li', attrs={'id':'nombreEquipo'})
aux = []
for equipo in equipos:
h2 = equipo.find('h2', {'class':'cintillo'}).text
jugadores = equipo.find_all('span', {'class':'dorsal-jugador'})
aux.append([h2,[j.text for j in jugadores]])
# format list for printing
print('\n\n'.join(['--'+i[0]+'--\n' + '\n'.join(i[1]) for i in aux]))
Output sample:
--Alavés--
Fernando Pacheco
Antonio Sivera
Álex Domínguez
Carlos Vigaray
...
Demo: https://repl.it/#glhr/55550385
You could create a dictionary of team names as keys with lists of [entrenador, players ] as values
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.marca.com/futbol/primera/equipos.html')
soup = bs(r.content, 'lxml')
teams = {}
for team in soup.select('[id=nombreEquipo]'):
team_name = team.select_one('.cintillo').text
entrenador = team.select_one('dd').text
players = [item.text for item in team.select('.dorsal-jugador')]
teams[team_name] = {entrenador : players}
print(teams)
Hi I'm practicing the regular expression with Python to parse the titles of top250 movies from IMDb but I am having difficulties to search contents between two tags like:
The Godfather
import re, urllib.request
def movie(url):
web_page = urllib.request.urlopen(url)
lines = web_page.read().decode(errors = "replace")
web_page.close()
return re.findall('(?<=.+?(?=)', lines, re.DOTALL)
title = movie("https://www.imdb.com/search/title?groups=top_250&sort=user_rating")
for name in title:
print(name)
As pointed in the comments, you better give a try on BeautifulSoup. Something like this will list the titles, in Python3:
import requests
from bs4 import BeautifulSoup
html = requests.get('https://www.imdb.com/search/title?groups=top_250&sort=user_rating')
if html.ok:
soup = BeautifulSoup(html.text, 'html.parser')
html.close()
for title in soup('h3', 'lister-item-header'):
print(title('a')[0].get_text())
And here is a cleaner version of the code above:
import requests
from bs4 import BeautifulSoup
imdb_entry_point = 'https://www.imdb.com/search/title'
imdb_payload = {
'groups': 'top_250',
'sort': 'user_rating'
}
with requests.get(imdb_entry_point, imdb_payload) as imdb:
if imdb.ok:
html = BeautifulSoup(imdb.text, 'html.parser')
for i, h3 in enumerate(html('h3', 'lister-item-header'), 1):
for a in h3('a'):
print(i, a.get_text())
BTW, that entry point is returning just 50 results and not 250 as you are expecting.
here is a working solution, using both BeautifulSoup and some nasty regex, but it's working fine. I love regex but it seems that I make them in a weird way, I can explaine to you how they works if you want.
import re, urllib.request
from bs4 import BeautifulSoup
url = "https://www.imdb.com/search/title?groups=top_250&sort=user_rating"
response = urllib.request.urlopen(url)
html = response.read()
soup = BeautifulSoup(html, 'html.parser')
i = 0
for txt in soup.findAll(attrs={"class" :"lister-item-header"}):
i += 1
print(str(i) + " ." + re.match("""^.*>(.*)</a>.*$""", re.sub('"', '', re.sub('\n', '', str(txt)))).group(1))
My output : (it's french...)
Les évadés
Le parrain
The Dark Knight: Le chevalier noir
Le parrain, 2ème partie
Le seigneur des anneaux: Le retour du roi
And the list goes on...