How to scrape data from wikipedia list into pandas dataframe - python

I'm trying to scrape a list, not a table, from a wikipedia page. It says "list index out of range": how can I solve this?
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://it.m.wikipedia.org/wiki/Premio_Bagutta'
data = requests.get(url)
soup= BeautifulSoup(data.content, "html.parser")
raw = soup.find_all("div", {"class": "div-col"})[0].find_all("li")
df = pd.DataFrame([[item.get_text().split(" ")[0],
item.find_next("a").get("title"),
item.find_next("i").get_text()[1:-1]]
for item in raw if item.find_next("i")],
columns=("Year"))
print(df.head())

You could try this:
import pandas as pd
import requests
from bs4 import BeautifulSoup
data = requests.get("https://it.m.wikipedia.org/wiki/Premio_Bagutta")
raw = BeautifulSoup(data.content, "html.parser").find_all(
"section", class_="mf-section-2 collapsible-block"
)[0]
raw_years = [item.text.replace("\n", "") for item in raw.find_all("p")]
raw_authors = [item for item in raw.find_all("ul")]
# For some years, there are several authors, so you have to iterate in sync
years = []
authors = []
for (year, author) in zip(raw_years, raw_authors):
years.append(year)
authors.append(author.text.split("\n"))
df = pd.DataFrame({"year": years, "author": authors}).explode("author")
print(df)
# Output
year author
0 1927 Giovanni Battista Angioletti, Il giorno del giudizio[11][12] (Ribet)
1 1928 Giovanni Comisso, Gente di mare[13][14] (Treves)
2 1929 Vincenzo Cardarelli, Il sole a picco[15][16] (Mondadori)
3 1930 Gino Rocca, Gli ultimi furono i primi[17][18] (Treves)
4 1931 Giovanni Titta Rosa, Il varco nel muro[19][20] (Carabba)
.. ... ...
82 2018 Helena Janeczek, La ragazza con la Leica[154] (Guanda)
83 2019 Marco Balzano, Resto qui[9][155] (Einaudi)
84 2020 Enrico Deaglio, La bomba[8][156] (Feltrinelli)
85 2021 Giorgio Fontana, Prima di noi[157] (Sellerio)
86 2022 Benedetta Craveri, La contessa[158][159] (Adelphi)

Related

How to web scrap this page and turn it into a csv file?

My name is João, im a law student from Brazil and im new to this. Im trying to web scrape this page for a week to help me with the Undergraduate thesis and other researchers.
I want make a csv file with all the results from a research in a court (this link). As you can see in the link, there are 404 results (processo) divide in 41 pages. Each result has it own html with its information (such as in a marketplace).
The result html is divided in two main tables. The first one has the result general information and probably will have the same structure in all results. The second table contains the results files (they are decisions in a administrative process), which may change in number of files and even have some files with same name, but different dates. From this second table I just need the link to the oldest "relatório/voto" and its date and the link to oldest "acórdão" and its date.
The head of the csv file should look like the following image and each result should be a line.
I'm working with python on google colab and I've been trying many ways to scrape but it did not work well. My most complete approach was when I tried to adapt a product scrape tutorial: video and corespondent code in Github.
My adaptation does not work in colab, it neither results in a error message, nor in a csv file. In the following code, I identified some problems in the adaptation by comparing the pages and the lesson, they are:
While extracting the result html out of one of the 41 pages, I believe I should create a list results html extracted, but it extracted the text too and I'm not sure how to correct it.
While trying to extract the data from the result html, I fail. Whenever I tried to create a list with these it only returned me one result.
Beyond the tutorial, I would also like to extract data from the second table in the results html, it would be the link to the oldest "relatório/voto" and its date and the link to oldest "acórdão" and its date. I'm no sure how and when in the code i should do that.
ADAPTED CODE
from requests_html import HTMLSession
import csv
s = HTMLSession()
# STEP 01: take the result html
def get_results_links(page):
url = f"https://www.tce.sp.gov.br/jurisprudencia/pesquisar?txtTdPalvs=munic%C3%ADpio+pessoal+37&txtExp=temporari&txtQqUma=admiss%C3%A3o+contrata%C3%A7%C3%A3o&txtNenhPalvs=&txtNumIni=&txtNumFim=&tipoBuscaTxt=Documento&_tipoBuscaTxt=on&quantTrechos=1&processo=&exercicio=&dataAutuacaoInicio=&dataAutuacaoFim=&dataPubInicio=01%2F01%2F2021&dataPubFim=31%2F12%2F2021&_relator=1&_auditor=1&_materia=1&tipoDocumento=2&_tipoDocumento=1&acao=Executa&offset={page}"
links = []
r = s.get(url)
results = r.html.find('td.small a')
for item in results:
links.append(item.find('a', first=True).attrs['href']) #Problem 01: I believe it should creat a list of the results html extracted out the page, but it extracted the text too.
return links
# STEP 02: extracting relevant information from the result html before extracted
def parse_result(url):
r = s.get(url)
numero = r.html.find('td.small', first=True).text.strip()
data_autuacao = r.html.find('td.small', first=True).text.strip()
try:
parte_1 = r.html.find('td.small', first=True).text.strip()
except AttributeError as err:
sku = 'Não há'
try:
parte_2 = r.html.find('td.small', first=True).text.strip()
except AttributeError as err:
parte_2 = 'Não há'
materia = r.html.find('td.small', first=True).text.strip()
exercicio = r.html.find('td.small', first=True).text.strip()
objeto = r.html.find('td.small', first=True).text.strip()
relator = r.html.find('td.small', first=True).text.strip()
#Problem 02
# STEP 03: creating a list based objetcs created before
product = {
'Nº do Processo': numero,
"Link do Processo" : r,
'Data de Autuação': data_autuacao,
'Parte 1': parte_1,
'Parte 2': parte_2,
'Exercício': exercicio,
'Matéria' : materia,
'Objeto' : objeto,
'Relator' : relator
#'Relatório/Voto' :
#'Data Relatório/Voto' :
#'Acórdão' :
#'Data Acórdão' :
}#Problem 03
return product
# STEP 04: saving as csv
def save_csv(final):
keys = final [0].keys()
with open('products.csv', 'w') as f:
dict_writer = csv.DictWriter(f, keys)
dict_writer.writeheader()
dict_writer.writerows(final)
# STEP 05: main - joinning the functions
def main():
final = []
for x in range(0, 410, 10):
print('Getting Page ', x)
urls = get_results_links(x)
for url in urls:
final.append(parse_result(url))
print('Total: ', len(final))
save_csv(final)
Thank you, #shelter, for your help so far. I tryed to specify it.
There are better (albeit more complex) ways of obtaining that information, like scrapy, or an async solution. Nonetheless, here is one way of getting that information you're after, as well as saving it into a csv file. I only scraped the first 2 pages (20 results), you can increase the range if you wish:
from bs4 import BeautifulSoup as bs
import requests
from tqdm.notebook import tqdm
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
s = requests.Session()
s.headers.update(headers)
big_list = []
detailed_list = []
for x in tqdm(range(0, 20, 10)):
url = f'https://www.tce.sp.gov.br/jurisprudencia/pesquisar?txtTdPalvs=munic%C3%ADpio+pessoal+37&txtExp=temporari&txtQqUma=admiss%C3%A3o+contrata%C3%A7%C3%A3o&txtNenhPalvs=&txtNumIni=&txtNumFim=&tipoBuscaTxt=Documento&_tipoBuscaTxt=on&quantTrechos=1&processo=&exercicio=&dataAutuacaoInicio=&dataAutuacaoFim=&dataPubInicio=01%2F01%2F2021&dataPubFim=31%2F12%2F2021&_relator=1&_auditor=1&_materia=1&tipoDocumento=2&_tipoDocumento=1&acao=Executa&offset={x}'
r = s.get(url)
urls = bs(r.text, 'html.parser').select('tr[class="borda-superior"] td:nth-of-type(2) a')
big_list.extend(['https://www.tce.sp.gov.br/jurisprudencia/' + x.get('href') for x in urls])
for x in tqdm(big_list):
r = s.get(x)
soup = bs(r.text, 'html.parser')
n_proceso = soup.select_one('td:-soup-contains("N° Processo:")').find_next('td').text if soup.select('td:-soup-contains("N° Processo:")') else None
link_proceso = x
autoacao = soup.select_one('td:-soup-contains("Autuação:")').find_next('td').text if soup.select('td:-soup-contains("Autuação:")') else None
parte_1 = soup.select_one('td:-soup-contains("Parte 1:")').find_next('td').text if soup.select('td:-soup-contains("Parte 1:")') else None
parte_2 = soup.select_one('td:-soup-contains("Parte 2:")').find_next('td').text if soup.select('td:-soup-contains("Parte 2:")') else None
materia = soup.select_one('td:-soup-contains("Matéria:")').find_next('td').text if soup.select('td:-soup-contains("Matéria:")') else None
exercicio = soup.select_one('td:-soup-contains("Exercício:")').find_next('td').text if soup.select('td:-soup-contains("Exercício:")') else None
objeto = soup.select_one('td:-soup-contains("Objeto:")').find_next('td').text if soup.select('td:-soup-contains("Objeto:")') else None
relator = soup.select_one('td:-soup-contains("Relator:")').find_next('td').text if soup.select('td:-soup-contains("Relator:")') else None
relatorio_voto = soup.select_one('td:-soup-contains("Relatório / Voto ")').find_previous('a').get('href') if soup.select('td:-soup-contains("Relatório / Voto")') else None
data_relatorio = soup.select_one('td:-soup-contains("Relatório / Voto ")').find_previous('td').text if soup.select('td:-soup-contains("Relatório / Voto")') else None
acordao = soup.select_one('td:-soup-contains("Acórdão ")').find_previous('a').get('href') if soup.select('td:-soup-contains("Acórdão ")') else None
data_acordao = soup.select_one('td:-soup-contains("Acórdão ")').find_previous('td').text if soup.select('td:-soup-contains("Acórdão ")') else None
detailed_list.append((n_proceso, link_proceso, autoacao, parte_1, parte_2,
materia, exercicio, objeto, relator, relatorio_voto,
data_relatorio, acordao, data_acordao))
detailed_df = pd.DataFrame(detailed_list, columns = ['n_proceso', 'link_proceso', 'autoacao', 'parte_1',
'parte_2', 'materia', 'exercicio', 'objeto', 'relator',
'relatorio_voto', 'data_relatorio', 'acordao', 'data_acordao'])
display(detailed_df)
detailed_df.to_csv('legal_br_stuffs.csv')
Result in terminal:
100%
2/2 [00:04<00:00, 1.78s/it]
100%
20/20 [00:07<00:00, 2.56it/s]
n_proceso link_proceso autoacao parte_1 parte_2 materia exercicio objeto relator relatorio_voto data_relatorio acordao data_acordao
0 18955/989/20 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=18955/989/20&offset=0 31/07/2020 ELVES SCIARRETTA CARREIRA PREFEITURA MUNICIPAL DE BRODOWSKI RECURSO ORDINARIO 2020 Recurso Ordinário Protocolado em anexo. EDGARD CAMARGO RODRIGUES https://www2.tce.sp.gov.br/arqs_juri/pdf/801385.pdf 20/01/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/801414.pdf 20/01/2021
1 13614/989/18 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=13614/989/18&offset=0 11/06/2018 PREFEITURA MUNICIPAL DE SERRA NEGRA RECURSO ORDINARIO 2014 Recurso Ordinário ANTONIO ROQUE CITADINI https://www2.tce.sp.gov.br/arqs_juri/pdf/797986.pdf 05/02/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/800941.pdf 05/02/2021
2 6269/989/19 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=6269/989/19&offset=0 19/02/2019 PREFEITURA MUNICIPAL DE TREMEMBE ADMISSAO DE PESSOAL - CONCURSO PROCESSO SELETIVO 2018 INTERESSADO: Rafael Varejão Munhos e outros. EDITAL Nº: 01/2017. CONCURSO PÚBLICO: 01/2017. None https://www2.tce.sp.gov.br/arqs_juri/pdf/804240.pdf 06/02/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/804258.pdf 06/02/2021
3 14011/989/19 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=14011/989/19&offset=0 11/06/2019 RUBENS EDUARDO DE SOUZA AROUCA PREFEITURA MUNICIPAL DE TREMEMBE RECURSO ORDINARIO 2019 Recurso Ordinário RENATO MARTINS COSTA https://www2.tce.sp.gov.br/arqs_juri/pdf/804240.pdf 06/02/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/804258.pdf 06/02/2021
4 14082/989/19 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=14082/989/19&offset=0 12/06/2019 PREFEITURA MUNICIPAL DE TREMEMBE RECURSO ORDINARIO 2019 Recurso Ordinário nos autos do TC n° 6269.989.19 - Admissão de pessoal - Concurso Público RENATO MARTINS COSTA https://www2.tce.sp.gov.br/arqs_juri/pdf/804240.pdf 06/02/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/804258.pdf 06/02/2021
5 14238/989/19 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=14238/989/19&offset=0 13/06/2019 MARCELO VAQUELI PREFEITURA MUNICIPAL DE TREMEMBE RECURSO ORDINARIO 2019 Recurso Ordinário RENATO MARTINS COSTA https://www2.tce.sp.gov.br/arqs_juri/pdf/804240.pdf 06/02/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/804258.pdf 06/02/2021
6 14141/989/20 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=14141/989/20&offset=0 28/05/2020 PREFEITURA MUNICIPAL DE BIRIGUI CRISTIANO SALMEIRAO RECURSO ORDINARIO 2018 Recurso Ordinário RENATO MARTINS COSTA https://www2.tce.sp.gov.br/arqs_juri/pdf/804259.pdf 06/02/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/804262.pdf 06/02/2021
7 15371/989/19 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=15371/989/19&offset=0 02/07/2019 PREFEITURA MUNICIPAL DE BIRIGUI ADMISSAO DE PESSOAL - TEMPO DETERMINADO 2018 INTERESSADOS: ADRIANA PEREIRA CRISTAL E OUTROS. PROCESSOS SELETIVOS/EDITAIS Nºs:002/2016, 004/2017, 05/2017, 06/2017,001/2018 e 002/2018. LEIS AUTORIZADORAS: Nº 5134/2009 e Nº 3946/2001. None https://www2.tce.sp.gov.br/arqs_juri/pdf/804259.pdf 06/02/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/804262.pdf 06/02/2021
8 15388/989/20 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=15388/989/20&offset=0 04/06/2020 MARIA ANGELICA MIRANDA FERNANDES RECURSO ORDINARIO 2018 Recurso Ordinário RENATO MARTINS COSTA https://www2.tce.sp.gov.br/arqs_juri/pdf/804259.pdf 06/02/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/804262.pdf 06/02/2021
9 12911/989/16 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=12911/989/16&offset=0 20/07/2016 MARCELO CANDIDO DE SOUZA PREFEITURA MUNICIPAL DE SUZANO RECURSO ORDINARIO 2016 Recurso Ordinário Ref. Atos de Admissão de Pessoal - Exercício 2012. objetivando o preenchimento temporário dos cargos de Médico Cardiologista 20h, Fotógrafo, Médico Clínico Geral 20lt, Médico Gineco DIMAS RAMALHO https://www2.tce.sp.gov.br/arqs_juri/pdf/814599.pdf 27/04/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/814741.pdf 27/04/2021
10 1735/002/11 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=1735/002/11&offset=10 22/11/2011 FUNDACAO DE APOIO AOS HOSP VETERINARIOS DA UNESP ADMISSAO DE PESSOAL - TEMPO DETERMINADO 2010 ADMISSAO DE PESSOAL POR TEMPO DETERMINADO COM CONCURSO/PROCESSO SELETIVO ANTONIO ROQUE CITADINI https://www2.tce.sp.gov.br/arqs_juri/pdf/800893.pdf 21/01/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/800969.pdf 21/01/2021
11 23494/989/18 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=23494/989/18&offset=10 20/11/2018 HAMILTON LUIS FOZ RECURSO ORDINARIO 2018 Recurso Ordinário DIMAS RAMALHO https://www2.tce.sp.gov.br/arqs_juri/pdf/816918.pdf 13/05/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/817317.pdf 13/05/2021
12 24496/989/19 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=24496/989/19&offset=10 25/11/2019 PREFEITURA MUNICIPAL DE LORENA RECURSO ORDINARIO 2017 Recurso Ordinário em face de sentença proferida nos autos de TC 00006265.989.19-4 DIMAS RAMALHO https://www2.tce.sp.gov.br/arqs_juri/pdf/814660.pdf 27/04/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/814805.pdf 27/04/2021
13 17110/989/18 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=17110/989/18&offset=10 03/08/2018 JORGE ABISSAMRA PREFEITURA MUNICIPAL DE FERRAZ DE VASCONCELOS RECURSO ORDINARIO 2018 Recurso Ordinário DIMAS RAMALHO https://www2.tce.sp.gov.br/arqs_juri/pdf/814633.pdf 27/04/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/814774.pdf 27/04/2021
14 24043/989/19 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=24043/989/19&offset=10 18/11/2019 PREFEITURA MUNICIPAL DE IRAPURU RECURSO ORDINARIO 2018 Recurso ordinário ROBSON MARINHO https://www2.tce.sp.gov.br/arqs_juri/pdf/817014.pdf 12/05/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/817269.pdf 12/05/2021
15 2515/989/20 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=2515/989/20&offset=10 03/02/2020 PREFEITURA MUNICIPAL DE IPORANGA RECURSO ORDINARIO 2020 Recurso interposto em face da sentença proferida nos autos do TC 15791/989/19-7. ROBSON MARINHO https://www2.tce.sp.gov.br/arqs_juri/pdf/817001.pdf 12/05/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/817267.pdf 12/05/2021
16 1891/989/20 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=1891/989/20&offset=10 24/01/2020 PREFEITURA MUNICIPAL DE IPORANGA RECURSO ORDINARIO 2020 RECURSO ORDINÁRIO DIMAS RAMALHO https://www2.tce.sp.gov.br/arqs_juri/pdf/802484.pdf 03/02/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/802620.pdf 03/02/2021
17 15026/989/20 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=15026/989/20&offset=10 02/06/2020 DIXON RONAN CARVALHO PREFEITURA MUNICIPAL DE PAULINIA RECURSO ORDINARIO 2018 RECURSO ORDINÁRIO ANTONIO ROQUE CITADINI https://www2.tce.sp.gov.br/arqs_juri/pdf/802648.pdf 05/02/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/803361.pdf 05/02/2021
18 9070/989/20 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=9070/989/20&offset=10 09/03/2020 PREFEITURA MUNICIPAL DE FLORIDA PAULISTA RECURSO ORDINARIO 2017 Recurso Ordinário ROBSON MARINHO https://www2.tce.sp.gov.br/arqs_juri/pdf/817006.pdf 12/05/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/817296.pdf 12/05/2021
19 21543/989/20 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=21543/989/20&offset=10 11/09/2020 PREFEITURA MUNICIPAL DE JERIQUARA RECURSO ORDINARIO 2020 RECURSO ORDINÁRIO SIDNEY ESTANISLAU BERALDO https://www2.tce.sp.gov.br/arqs_juri/pdf/802997.pdf 13/02/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/804511.pdf 13/02/2021
If you will need coding in your career, I strongly suggest you start building some foundational knowledge first, then try to code or adapt other code.

How could I extract the content of an article from a html webpage using python Beautiful Soup?

I am a novice wiht python and I am trying to do webscraping as exercise. I would like to scrape the content, and the title of each article inside a web page .
I have a problem with my code because I do not think is very efficient and I would like to optimize it.
the page I am trying to scrape is https://www.ansa.it/sito/notizie/politica/politica.shtml
this is what I have done so far:
#libraries
from bs4 import BeautifulSoup
from bs4 import BeautifulSoup as soup
import requests
import pandas as pd
import urllib.request,sys,time
import csv
from csv import writer
import time
from datetime import datetime
r= requests.get('https://www.ansa.it/sito/notizie/politica/politica.shtml')
b= soup(r.content, 'lxml')
title=[]
links=[]
content=[]
for c in b.findAll('h3',{'class':'news-title'}):
title.append(c.text.strip())
for c in b.findAll("h3", {"class": "news-title"}):
links.append(c.a["href"])
for link in links:
page=requests.get('https://www.ansa.it'+link)
bsjop=soup(page.content)
for n in bsjop.findAll('div',{'itemprop': 'articleBody'}):
content.append(n.text.strip())
The problem is that my output is made of multiple links, multiple titles and multiple contents that do not match each other (like one article has title and a content that has nothing to do with it)
If you know ways that I can improve my code it would be nice
thanks
To get all articles titles, urls, texts into a Pandas DataFrame you can use next example (I used tqdm module to get nice progress bar):
import requests
from tqdm import tqdm
from bs4 import BeautifulSoup
url = "https://www.ansa.it/sito/notizie/politica/politica.shtml"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
all_data = []
for title in tqdm(soup.select("h3.news-title")):
t = title.get_text(strip=True)
u = title.a["href"]
s = BeautifulSoup(
requests.get("https://www.ansa.it" + u).content, "html.parser"
)
text = s.select_one('[itemprop="articleBody"]')
text = text.get_text(strip=True, separator="\n") if text else ""
all_data.append([t, u, text])
df = pd.DataFrame(all_data, columns=["Title", "URL", "Text"])
df.to_csv("data.csv", index=False)
Creates data.csv (screenshot from LibreOffice):
import requests
from bs4 import BeautifulSoup
import pandas as pd
news_list = []
r = requests.get('https://www.ansa.it/sito/notizie/politica/politica.shtml')
soup = BeautifulSoup(r.text, 'html.parser')
articles = soup.select('article.news')
for art in articles:
try:
title = art.select_one('h3').text.strip()
if 'javascript:void(0);' in art.select('a')[0].get('href'):
url = 'https://www.ansa.it' + art.select('a')[1].get('href')
else:
url = 'https://www.ansa.it' + art.select('a')[0].get('href')
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
content = soup.select_one('div.news-txt').text.strip()
print(f'retrieving {url}')
news_list.append((title, content, url))
except Exception as e:
print(art.text.strip(), e)
df = pd.DataFrame(news_list, columns = ['Title', 'Content', 'Url'])
print(df)
This will return some errors for some links on page, which you will have to investigate and debug (and ask for help if you need it - it's an important part of learning process), and a dataframe with the list successfully retrieved, which looks like this:
Title Content Url
0 Letta: 'Ora difficile ricomporre con M5s'. Mel... Partiti scossi dallo scioglimento anticipato d... https://www.ansa.it/sito/notizie/politica/2022...
1 L'emozione di Draghi: 'Ancheil cuore dei banch... "Certe volte anche il cuore dei banchieri cent... https://www.ansa.it/sito/notizie/politica/2022...
2 La giornata di Draghi in foto https://www.ansa.it/sito/photogallery/primopia...
3 Il timing del voto, liste entro un mese. A Fer... Le liste dei candidati entro un mese a partire... https://www.ansa.it/sito/notizie/politica/2022...
4 Si lavora sulla concorrenza, ipotesi stralcio ... Il DDL Concorrenza andrà in Aula alla Camera l... https://www.ansa.it/sito/notizie/economia/2022...
5 Le cifre del governo Draghi: 55 voti fiducia e... Una media di 7,4 leggi approvate ogni mese su ... https://www.ansa.it/sito/notizie/politica/2022...
6 I 522 giorni del governo Draghi LE FOTO L'arrivo di SuperMario, gli incontri, le allea... https://www.ansa.it/sito/photogallery/primopia...
7 Presidi, disappunto per le urne in autunno, ci... "C'è disappunto non preoccupazione per le urne... https://www.ansa.it/sito/notizie/politica/2022...
8 Ucraina: Di Maio,sostegno ricerca mercati alte... (ANSA) - ROMA, 22 LUG - Lo scoppio del conflit... https://www.ansa.it/sito/photogallery/primopia...
9 Il giorno più lungo: dal Senato fiducia a Drag... Passa la fiducia al premier Draghi in Senato, ... https://www.ansa.it/sito/notizie/politica/2022...
10 Oltre mille sindaci a sostegno di Draghi Nei giorni che attendono il mercoledì che deci... https://www.ansa.it/sito/notizie/politica/2022...
11 Mattarella scioglie le Camere, si vota il 25 s... E' stata una scelta "inevitabile", il voto del... https://www.ansa.it/sito/notizie/politica/2022...
12 Camere sciolte ma il vitalizio è salvo Nonostante lo scioglimento delle Camere antici... https://www.ansa.it/sito/notizie/politica/2022...
13 Ultimatum di Conte, risposte o fuori. Di Maio:... Senza "risposte chiare" il Movimento 5 Stelle ... https://www.ansa.it/sito/notizie/politica/2022...
14 Di Maio, Conte sta compiendo una vendetta poli... Se le cose restano come sono oggi "Mario Dragh... https://www.ansa.it/sito/notizie/politica/2022...
15 Governo: mercoledì la fiducia fiducia prima al... Le comunicazioni del presidente del Consiglio ... https://www.ansa.it/sito/notizie/politica/2022...
16 Il giorno più lungo: dal Senato fiducia a Drag... Passa la fiducia al premier Draghi in Senato, ... https://www.ansa.it/sito/notizie/politica/2022...
17 Mattarella scioglie le Camere, si vota il 25 s... E' stata una scelta "inevitabile", il voto del... https://www.ansa.it/sito/notizie/politica/2022...
18 Il discorso di Draghi al Senato 'Partiti, pron... "Siamo qui perché lo hanno chiesto gli italian... https://www.ansa.it/sito/notizie/politica/2022...
19 Governo: mercoledì la fiducia fiducia prima al... Le comunicazioni del presidente del Consiglio ... https://www.ansa.it/sito/notizie/politica/2022...
20 Draghi al Senato per una fiducia al buio. Prem... Draghi al bivio tra governo e crisi. Alle 9.30... https://www.ansa.it/sito/notizie/politica/2022...
21 Ultimatum di Conte, risposte o fuori. Di Maio:... Senza "risposte chiare" il Movimento 5 Stelle ... https://www.ansa.it/sito/notizie/politica/2022...
22 Camere sciolte ma il vitalizio è salvo Nonostante lo scioglimento delle Camere antici... https://www.ansa.it/sito/notizie/politica/2022...
You don't need two different loops if you are referring to the same element. Try the below code to save the title and links.
for c in b.findAll('h3',{'class':'news-title'}):
title.append(c.text.strip())
links.append(c.a["href"])
By combining you will be sure that the title and link are scraped from the same element.

Extracting tags using select_one

I am trying to extract content within specific tags using CSS selector in Python from this page:
https://scenarieconomici.it/page/898/
Specifically, I am interesting in title, date, author, category and summary.
I have tried as follows:
print(tag.select_one(".entry-title").text)
print(tag.select_one("span.meta-time").text)
print(tag.select_one("span.meta-author").text)
print(tag.select_one("span.category-item").text)
print(tag.find_next(class_="entry-content").text.strip())
Could you please tell me if they are right? I could provide you with the whole code I am using, if required.
Many thanks
After the Wasif's answer below, I changed my code but unfortunately it seems there is still a problem with tags:
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
import pandas as pd
def main(req, num):
r = req.get("https://scenarieconomici.it/page/{}/".format(num))
# r = req.get("https://www.imolaoggi.it/category/polit/page/{}/".format(num))
soup = BeautifulSoup(r.content, 'html.parser')
#goal = [(x.time.text, x.h3.a.text, x.select_one("span.cat-links").get_text(strip=True), x.p.get_text(strip=True))
# for x in soup.select("div.site-content")]
for tag in soup.select('div', class_='entry-blog'):
print(tag.find('span',class_='entry-title'))
print(tag.find('span',class_='meta-time'))
print(tag.find('span',class_='meta-author'))
print(tag.find('span',class_='category-item'))
print(tag.find_next(class_='entry-content'))
return tag.find('span',class_='entry-title'),tag.find('span',class_='meta-time'), tag.find('span',class_='meta-author'), tag.find('span',class_='category-item'), tag.find_next(class_='entry-content')
with ThreadPoolExecutor(max_workers=30) as executor:
with requests.Session() as req:
# fs = [executor.submit(main, req, num) for num in range(1, 2937)]
fs = [executor.submit(main, req, num) for num in range(1, 2)]
allin = []
for f in fs:
allin.append(f.result())
df = pd.DataFrame.from_records(
allin, columns=["Title", "Time", "Author", "Category", "Content"])
as I am getting only None values.
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
from itertools import repeat
import re
import pandas as pd
def main(req, num):
r = req.get("https://scenarieconomici.it/page/{}/".format(num))
soup = BeautifulSoup(r.content, 'html.parser')
return [(
x.select_one("h2.entry-title").text,
x.select_one("span.meta-time").text,
x.select_one("span.meta-author").a.text,
x.select_one("span.category-item").a.text,
x.select_one("div.entry-content").p.text
)
for x in soup.findAll("article", id=re.compile(r"post-\d+"))]
with ThreadPoolExecutor(max_workers=50) as executor:
with requests.Session() as req:
fs = executor.map(main, repeat(req), range(1, 899))
final = []
[final.extend(f) for f in fs]
df = pd.DataFrame.from_records(
final, columns=["Title", "Time", "Author", "Category", "Content"])
print(df)
Output
Title Time ... Category Content
0 LA SOLUZIONE DEL TEAM BIDEN PER IL COVID: MORI... 11 Novembre 2020 ... attualita' Ezekiel Emanuel Biden ha gia fatto partire i s...
1 10 Procuratori Statali con Trump, contro la Pe... 11 Novembre 2020 ... attualita' I procuratori generali del Missouri, Alabama, ...
2 TAGLIARE L’IVA: rilanciare l’economia ed incre... 11 Novembre 2020 ... attualita' Ieri Salvini e Bagnai hanno presentato il pian...
3 Il Parlamento Europeo? Incrementa i vincoli su... 10 Novembre 2020 ... attualita' Lunedì in commissione ECON è  stato  discusso ...
4 Dove siete finiti, cari rigoristi dell’austeri... 10 Novembre 2020 ... analisi e studi Dove siete finiti, detrattori della spesa pubb...
... ... ... ... ... ...
17941 Analisi della CRISI attuale: la GUERRA incombe 12 Novembre 2011 ... Uncategorized 10 domande e 10 rispostePerche’ gli spread tra...
17942 SEMPLICE CRISI FINANZIARIA O TERZA GUERRA MOND... 12 Luglio 2011 ... Uncategorized Intervista immaginaria a Frank (che spiega per...
17943 SLIDING DOORS: “SIAMO L’UNICO PAESE DEL MONDO ... 12 Giugno 2011 ... Uncategorized Spesso siamo portati a criticare, a lamentarci...
17944 L’origine del DEBITO PUBBLICO Italiano 12 Giugno 2011 ... Uncategorized Spesso ci si chiede la vera origine del nostro...
17945 La caduta dei Giganti 12 Giugno 2011 ... Uncategorized Vorrei soffermarmi un’istante sui 3 paesi card...
[17946 rows x 5 columns]
Why not use .find() instead:
print(tag.find('span',class_='your class'))

Scraping data from interactive website map

I am trying to scrape the geolocations from the 2 following websites:
https://zendantenneskaart.omgeving.vlaanderen.be/ --> for this one, I found the underlying source json file, so it was easy https://www.mercator.vlaanderen.be/raadpleegdienstenmercatorpubliek/us/ows?service=WFS&version=1.0.0&request=GetFeature&typeName=us:us_zndant_pnt&outputFormat=application/json
http://www.sites.bipt.be/index.php?language=EN --> for this one, I cannot find such a json file; moreover, I cannot find a way to scrape it using beautiful soup, since the visibility of the pins is dependent on the zoom of the map
Any ideas to scrape all the geo locations for the second website?
You can use url http://www.sites.bipt.be/ajaxinterface.php and as latitude/longitude parameters specify some huge range. That way you get all data in one go.
For example:
import json
import requests
from html import unescape
url = 'http://www.sites.bipt.be/ajaxinterface.php'
data = {"action": "getSites",
"latfrom": "-9999",
"latto": "9999",
"longfrom": "-9999",
"longto": "9999",
"LangSiteTable": "sitesfr"}
data = requests.post(url, data=data).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for d in data[:10]: # <-- print only first 10 items
print('{:<50}{:<50}{:<30}{:<40} {:.4f} {:.4f}'.format(d['Eigenaar1'], unescape(d['Locatie']), unescape(d['Adres']), unescape(d['PostcodeGemeente']), float(d['Longitude']), float(d['Latitude'])))
print()
print('Total items:', len(data))
Prints:
Orange Belgium: 203W1_2 Cité de la Bruyère Clos des Marronniers 201 1480 Tubize 4.2086 50.6810
Telenet: _AN0171A Watertoren Scheeveld 2870 Puurs 4.2718 51.0744
Telenet: _AN0235V E34 2290 Vorselaar 4.7578 51.2449
Orange Belgium: 148L1_6 Institut Provincial d'Enseignement Supérieur Rue du Commerce 14 4100 Seraing 5.5077 50.6130
Orange Belgium: 198L1_5 / 32198L1_1 / 42198L1_1 Lieu-dit 'Bièster' Thier de Coo 4970 Stavelot 5.8876 50.3859
Telenet: _NR1363A Route de Sovenne 5560 Houyet 4.9529 50.1997
Orange Belgium: 181R1_1 Route Rimbaut / Route Rimbaut 6890 Libin 5.1504 49.9989
Proximus: 80WAM_00 Rue de Hottleux 71 4950 Waimes 6.0879 50.4152
Orange Belgium: 013R1_8 Rue Saint-Michel 6870 Saint-Hubert 5.3666 50.0355
Proximus: 41BIA_00 Aéroport de Bierset batiment 56 Aérodrome 4460 Grâce-Hollogne 5.4584 50.6416
Total items: 8104

Scraping Yahoo Finance with Python3

I'm a complete newbie in scraping and I'm trying to scrape https://fr.finance.yahoo.com and I can't figure out what I'm doing wrong.
My goal is to scrape the index name, current level and the change(both in value and in %)
Here is the code I have used:
import urllib.request
from bs4 import BeautifulSoup
url = 'https://fr.finance.yahoo.com'
request = urllib.request.Request(url)
html = urllib.request.urlopen(request).read()
soup = BeautifulSoup(html,'html.parser')
main_table = soup.find("div",attrs={'data-reactid':'12'})
print(main_table)
links = main_table.find_all("li", class_=' D(ib) Bxz(bb) Bdc($seperatorColor) Mend(16px) BdEnd ')
print(links)
However, the print(links) comes out empty. Could someone please assist? Any help would be highly appreciated as I have been trying to figure this out for a few days now.
Although the better way to get all the fields is parse and process the relevant script tag, this is one of the ways you can get all them.
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://fr.finance.yahoo.com/'
r = requests.get(url,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(r.text,'html.parser')
df = pd.DataFrame(columns=['Index Name','Current Level','Value','Percentage Change'])
for item in soup.select("[id='market-summary'] li"):
index_name = item.select_one("a").contents[1]
current_level = ''.join(item.select_one("a > span").text.split())
value = ''.join(item.select_one("a")['aria-label'].split("ou")[1].split("points")[0].split())
percentage_change = ''.join(item.select_one("a > span + span").text.split())
df = df.append({'Index Name':index_name, 'Current Level':current_level,'Value':value,'Percentage Change':percentage_change}, ignore_index=True)
print(df)
Output are like:
Index Name Current Level Value Percentage Change
0 CAC 40 4444,56 -0,88 -0,02%
1 Euro Stoxx 50 2905,47 0,49 +0,02%
2 Dow Jones 24438,63 -35,49 -0,15%
3 EUR/USD 1,0906 -0,0044 -0,40%
4 Gold future 1734,10 12,20 +0,71%
5 BTC-EUR 8443,23 161,79 +1,95%
6 CMC Crypto 200 185,66 4,42 +2,44%
7 Pétrole WTI 33,28 -0,64 -1,89%
8 DAX 11073,87 7,94 +0,07%
9 FTSE 100 5993,28 -21,97 -0,37%
10 Nasdaq 9315,26 30,38 +0,33%
11 S&P 500 2951,75 3,24 +0,11%
12 Nikkei 225 20388,16 -164,15 -0,80%
13 HANG SENG 22930,14 -1349,89 -5,56%
14 GBP/USD 1,2177 -0,0051 -0,41%
I think you need to fix your element selection.
For example the following code:
import urllib.request
from bs4 import BeautifulSoup
url = 'https://fr.finance.yahoo.com'
request = urllib.request.Request(url)
html = urllib.request.urlopen(request).read()
soup = BeautifulSoup(html,'html.parser')
main_table = soup.find(id="market-summary")
links = main_table.find_all("a")
for i in links:
print(i.attrs["aria-label"])
Gives output text having index name, % change, change, and value:
CAC 40 a augmenté de 0,37 % ou 16,55 points pour atteindre 4 461,99 points
Euro Stoxx 50 a augmenté de 0,28 % ou 8,16 points pour atteindre 2 913,14 points
Dow Jones a diminué de -0,63 % ou -153,98 points pour atteindre 24 320,14 points
EUR/USD a diminué de -0,49 % ou -0,0054 points pour atteindre 1,0897 points
Gold future a augmenté de 0,88 % ou 15,10 points pour atteindre 1 737,00 points
a augmenté de 1,46 % ou 121,30 points pour atteindre 8 402,74 points
CMC Crypto 200 a augmenté de 1,60 % ou 2,90 points pour atteindre 184,14 points
Pétrole WTI a diminué de -3,95 % ou -1,34 points pour atteindre 32,58 points
DAX a augmenté de 0,29 % ou 32,27 points pour atteindre 11 098,20 points
FTSE 100 a diminué de -0,39 % ou -23,18 points pour atteindre 5 992,07 points
Nasdaq a diminué de -0,30 % ou -28,25 points pour atteindre 9 256,63 points
S&P 500 a diminué de -0,43 % ou -12,62 points pour atteindre 2 935,89 points
Nikkei 225 a diminué de -0,80 % ou -164,15 points pour atteindre 20 388,16 points
HANG SENG a diminué de -5,56 % ou -1 349,89 points pour atteindre 22 930,14 points
GBP/USD a diminué de -0,34 % ou -0,0041 points pour atteindre 1,2186 points
Try following css selector to get all the links.
import urllib
from bs4 import BeautifulSoup
url = 'https://fr.finance.yahoo.com'
request = urllib.request.Request(url)
html = urllib.request.urlopen(request).read()
soup = BeautifulSoup(html,'html.parser')
links=[link['href'] for link in soup.select("ul#market-summary a")]
print(links)
Output:
['/quote/^FCHI?p=^FCHI', '/quote/^STOXX50E?p=^STOXX50E', '/quote/^DJI?p=^DJI', '/quote/EURUSD=X?p=EURUSD=X', '/quote/GC=F?p=GC=F', '/quote/BTC-EUR?p=BTC-EUR', '/quote/^CMC200?p=^CMC200', '/quote/CL=F?p=CL=F', '/quote/^GDAXI?p=^GDAXI', '/quote/^FTSE?p=^FTSE', '/quote/^IXIC?p=^IXIC', '/quote/^GSPC?p=^GSPC', '/quote/^N225?p=^N225', '/quote/^HSI?p=^HSI', '/quote/GBPUSD=X?p=GBPUSD=X']

Categories