Looking for child content with Beautifulsoup

Looking for child content with Beautifulsoup - python

I am trying to scrape a phrase/author from the body of a URL. I can scrape the phrases but I don't know how to find the author and print it together with the phrase. Can you help me?
import urllib.request
from bs4 import BeautifulSoup
page_url = "https://www.pensador.com/frases/"
page = urllib.request.urlopen(page_url)
soup = BeautifulSoup(page, "html.parser")
for frase in soup.find_all("p", attrs={'class': 'frase fr'}):
print(frase.text + '\n')
# author = soup.find_all("span", attrs={'class': 'autor'})
# print(author.text)
# this is the author that I need, for each phrase the right author

You can get to the parent of the p.frase.fr tag, which is a div, and get the author by selecting span.autor descending the div:
In [1268]: for phrase in soup.select('p.frase.fr'):
...: author = phrase.parent.select_one('span.autor')
...: print(author.text.strip(), ': ', phrase.text.strip())
...:
Roberto Shinyashiki : Tudo o que um sonho precisa para ser realizado é alguém que acredite que ele possa ser realizado.
Paulo Coelho : Imagine uma nova história para sua vida e acredite nela.
Carlos Drummond de Andrade : Ser feliz sem motivo é a mais autêntica forma de felicidade.
...
...
Here, I'm using the CSS selector by phrase.parent.select_one('span.autor'), you can obviously use find here:
phrase.parent.find('span', attrs={'class': 'autor'})

Related

How could I extract the content of an article from a html webpage using python Beautiful Soup?

I am a novice wiht python and I am trying to do webscraping as exercise. I would like to scrape the content, and the title of each article inside a web page .
I have a problem with my code because I do not think is very efficient and I would like to optimize it.
the page I am trying to scrape is https://www.ansa.it/sito/notizie/politica/politica.shtml
this is what I have done so far:
#libraries
from bs4 import BeautifulSoup
from bs4 import BeautifulSoup as soup
import requests
import pandas as pd
import urllib.request,sys,time
import csv
from csv import writer
import time
from datetime import datetime
r= requests.get('https://www.ansa.it/sito/notizie/politica/politica.shtml')
b= soup(r.content, 'lxml')
title=[]
links=[]
content=[]
for c in b.findAll('h3',{'class':'news-title'}):
title.append(c.text.strip())
for c in b.findAll("h3", {"class": "news-title"}):
links.append(c.a["href"])
for link in links:
page=requests.get('https://www.ansa.it'+link)
bsjop=soup(page.content)
for n in bsjop.findAll('div',{'itemprop': 'articleBody'}):
content.append(n.text.strip())
The problem is that my output is made of multiple links, multiple titles and multiple contents that do not match each other (like one article has title and a content that has nothing to do with it)
If you know ways that I can improve my code it would be nice
thanks

To get all articles titles, urls, texts into a Pandas DataFrame you can use next example (I used tqdm module to get nice progress bar):
import requests
from tqdm import tqdm
from bs4 import BeautifulSoup
url = "https://www.ansa.it/sito/notizie/politica/politica.shtml"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
all_data = []
for title in tqdm(soup.select("h3.news-title")):
t = title.get_text(strip=True)
u = title.a["href"]
s = BeautifulSoup(
requests.get("https://www.ansa.it" + u).content, "html.parser"
)
text = s.select_one('[itemprop="articleBody"]')
text = text.get_text(strip=True, separator="\n") if text else ""
all_data.append([t, u, text])
df = pd.DataFrame(all_data, columns=["Title", "URL", "Text"])
df.to_csv("data.csv", index=False)
Creates data.csv (screenshot from LibreOffice):

import requests
from bs4 import BeautifulSoup
import pandas as pd
news_list = []
r = requests.get('https://www.ansa.it/sito/notizie/politica/politica.shtml')
soup = BeautifulSoup(r.text, 'html.parser')
articles = soup.select('article.news')
for art in articles:
try:
title = art.select_one('h3').text.strip()
if 'javascript:void(0);' in art.select('a')[0].get('href'):
url = 'https://www.ansa.it' + art.select('a')[1].get('href')
else:
url = 'https://www.ansa.it' + art.select('a')[0].get('href')
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
content = soup.select_one('div.news-txt').text.strip()
print(f'retrieving {url}')
news_list.append((title, content, url))
except Exception as e:
print(art.text.strip(), e)
df = pd.DataFrame(news_list, columns = ['Title', 'Content', 'Url'])
print(df)
This will return some errors for some links on page, which you will have to investigate and debug (and ask for help if you need it - it's an important part of learning process), and a dataframe with the list successfully retrieved, which looks like this:
Title Content Url
0 Letta: 'Ora difficile ricomporre con M5s'. Mel... Partiti scossi dallo scioglimento anticipato d... https://www.ansa.it/sito/notizie/politica/2022...
1 L'emozione di Draghi: 'Ancheil cuore dei banch... "Certe volte anche il cuore dei banchieri cent... https://www.ansa.it/sito/notizie/politica/2022...
2 La giornata di Draghi in foto https://www.ansa.it/sito/photogallery/primopia...
3 Il timing del voto, liste entro un mese. A Fer... Le liste dei candidati entro un mese a partire... https://www.ansa.it/sito/notizie/politica/2022...
4 Si lavora sulla concorrenza, ipotesi stralcio ... Il DDL Concorrenza andrà in Aula alla Camera l... https://www.ansa.it/sito/notizie/economia/2022...
5 Le cifre del governo Draghi: 55 voti fiducia e... Una media di 7,4 leggi approvate ogni mese su ... https://www.ansa.it/sito/notizie/politica/2022...
6 I 522 giorni del governo Draghi LE FOTO L'arrivo di SuperMario, gli incontri, le allea... https://www.ansa.it/sito/photogallery/primopia...
7 Presidi, disappunto per le urne in autunno, ci... "C'è disappunto non preoccupazione per le urne... https://www.ansa.it/sito/notizie/politica/2022...
8 Ucraina: Di Maio,sostegno ricerca mercati alte... (ANSA) - ROMA, 22 LUG - Lo scoppio del conflit... https://www.ansa.it/sito/photogallery/primopia...
9 Il giorno più lungo: dal Senato fiducia a Drag... Passa la fiducia al premier Draghi in Senato, ... https://www.ansa.it/sito/notizie/politica/2022...
10 Oltre mille sindaci a sostegno di Draghi Nei giorni che attendono il mercoledì che deci... https://www.ansa.it/sito/notizie/politica/2022...
11 Mattarella scioglie le Camere, si vota il 25 s... E' stata una scelta "inevitabile", il voto del... https://www.ansa.it/sito/notizie/politica/2022...
12 Camere sciolte ma il vitalizio è salvo Nonostante lo scioglimento delle Camere antici... https://www.ansa.it/sito/notizie/politica/2022...
13 Ultimatum di Conte, risposte o fuori. Di Maio:... Senza "risposte chiare" il Movimento 5 Stelle ... https://www.ansa.it/sito/notizie/politica/2022...
14 Di Maio, Conte sta compiendo una vendetta poli... Se le cose restano come sono oggi "Mario Dragh... https://www.ansa.it/sito/notizie/politica/2022...
15 Governo: mercoledì la fiducia fiducia prima al... Le comunicazioni del presidente del Consiglio ... https://www.ansa.it/sito/notizie/politica/2022...
16 Il giorno più lungo: dal Senato fiducia a Drag... Passa la fiducia al premier Draghi in Senato, ... https://www.ansa.it/sito/notizie/politica/2022...
17 Mattarella scioglie le Camere, si vota il 25 s... E' stata una scelta "inevitabile", il voto del... https://www.ansa.it/sito/notizie/politica/2022...
18 Il discorso di Draghi al Senato 'Partiti, pron... "Siamo qui perché lo hanno chiesto gli italian... https://www.ansa.it/sito/notizie/politica/2022...
19 Governo: mercoledì la fiducia fiducia prima al... Le comunicazioni del presidente del Consiglio ... https://www.ansa.it/sito/notizie/politica/2022...
20 Draghi al Senato per una fiducia al buio. Prem... Draghi al bivio tra governo e crisi. Alle 9.30... https://www.ansa.it/sito/notizie/politica/2022...
21 Ultimatum di Conte, risposte o fuori. Di Maio:... Senza "risposte chiare" il Movimento 5 Stelle ... https://www.ansa.it/sito/notizie/politica/2022...
22 Camere sciolte ma il vitalizio è salvo Nonostante lo scioglimento delle Camere antici... https://www.ansa.it/sito/notizie/politica/2022...

You don't need two different loops if you are referring to the same element. Try the below code to save the title and links.
for c in b.findAll('h3',{'class':'news-title'}):
title.append(c.text.strip())
links.append(c.a["href"])
By combining you will be sure that the title and link are scraped from the same element.

how to extract with beatifulsoup, problems with div

I need to extract from the url all the Nacimientos, years of yesterday today and tomorrow, I try to extract all the <li>, but when a <div> appears it only extracts up to the <div>, try next_sibling and it didn't work either.
# Página objetivo
url = "https://es.m.wikipedia.org/wiki/9_de_julio"
##Cantidad de articulos en español actuales. ##
# Obtener un requests de la URL objetivo
wikipedia2 = requests.get(url)
# Si el Status Code es OK!
if wikipedia2.status_code == 200:
nacimientos2 = soup(wikipedia2.text, "lxml")
else:
print("La página respondió con error", wikipedia.status_code)
filtro= nacimientos2.find("section", id="mf-section-2")
anios= filtro.find('ul').find_all('li')
lista2 = []
for data in anios:
lista2.append(data.text[:4])
lista2

Main issue is that the focus of your selection is the first <ul> and all its <li>, you can simply adjust the selection while skipping the <ul>, cause your working on a specific <section>.
As one line with list comprehension and css selectors`:
yearList = [e.text[:4] for e in soup.select('section#mf-section-2 li')]
or based on your code -> anios= filtro.find_all('li'):
filtro= nacimientos2.find("section", id="mf-section-2")
anios= filtro.find_all('li')
lista2 = []
for data in anios:
lista2.append(data.text[:4])
lista2

WebScraping in Python returning repeated data

I'm using the script below to retrieve property data for a college project. It is working without errors, but the dataframe has repeated values, that is, if I put it to fetch data from pages it repeats the same data from page 1 5 times, please help!
import requests, re, time, os, csv
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
# Inicializamos as listas para guardar as informações
link_imovel=[] # nesta lista iremos guardar a url
address=[] # nesta lista iremos guardar o endereço
neighbor=[] # nesta lista iremos guardar o bairro
anunciante=[] # nesta lista iremos guardar o anunciante
area=[] # nesta lista iremos guardar a area
tipo=[] # nesta lista iremos guardar o tipo de imóvel
room=[] # nesta lista iremos guardar a quantidade de quartos
bath=[] # nesta lista iremos guardar a quantidade de banheiros
park=[] # nesta lista iremos guardar a quantidade de vagas de garagem
price=[] # nesta lista iremos guardar o preço do imóvel
# Ele irá solicitar quantas páginas você deseja coletar
pages_number=int(input('How many pages? '))
# inicializa o tempo de execução
tic = time.time()
# Configure chromedriver
# para executar, é necessário que você baixe o chromedriver e deixe ele na mesma pasta de execução, ou mude o path
chromedriver = "./chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)
time.sleep(15)
# Criando o loop entre as paginas do site
for page in range(1,pages_number+1):
link = 'https://www.vivareal.com.br/venda/minas-gerais/pocos-de-caldas/casa_residencial/?pagina='+str(page)+''
driver.get(link)
# Definimos um sleep time para não sobrecarregar o site
# coletamos todas as informações da página e transformamos em formato legivel
data = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
soup_complete_source = BeautifulSoup(data.encode('utf-8'), "lxml")
# identificamos todos os itens de card de imóveis
soup = soup_complete_source.find(class_='results-list js-results-list')
# Web-Scraping
# para cada elemento no conjunto de cards, colete:
for line in soup.findAll(class_="js-card-selector"):
# colete o endereço completo e o bairro
try:
full_address=line.find(class_="property-card__address").text.strip()
address.append(full_address.replace('\n', '')) #Get all address
if full_address[:3]=='Rua' or full_address[:7]=='Avenida' or full_address[:8]=='Travessa' or full_address[:7]=='Alameda':
neighbor_first=full_address.strip().find('-')
neighbor_second=full_address.strip().find(',', neighbor_first)
if neighbor_second!=-1:
neighbor_text=full_address.strip()[neighbor_first+2:neighbor_second]
neighbor.append(neighbor_text) # Guarde na lista todos os bairros
else: # Bairro não encontrado
neighbor_text='-'
neighbor.append(neighbor_text) # Caso o bairro não seja encontrado
else:
get_comma=full_address.find(',')
if get_comma!=-1:
neighbor_text=full_address[:get_comma]
neighbor.append(neighbor_text) # Guarde na lista todos os bairros com problema de formatação provenientes do proprio website
else:
get_hif=full_address.find('-')
neighbor_text=full_address[:get_hif]
neighbor.append(neighbor_text)
# Coleta o link
full_link=line.find(class_='property-card__main-info').a.get('href')
link_imovel.append(full_link)
# Coleta o anunciante
full_anunciante=line.find(class_='property-card__account-link js-property-card-account-link').img.get('alt').title()
anunciante.append(full_anunciante)
# Coleta a área
full_area=line.find(class_="property-card__detail-value js-property-card-value property-card__detail-area js-property-card-detail-area").text.strip()
area.append(full_area)
# Coleta tipologia
full_tipo = line.find(class_='property-card__title js-cardLink js-card-title').text.split()[0]
full_tipo=full_tipo.replace(' ','')
full_tipo=full_tipo.replace('\n','')
tipo.append(full_tipo)
# Coleta numero de quartos
full_room=line.find(class_="property-card__detail-item property-card__detail-room js-property-detail-rooms").text.strip()
full_room=full_room.replace(' ','')
full_room=full_room.replace('\n','')
full_room=full_room.replace('Quartos','')
full_room=full_room.replace('Quarto','')
room.append(full_room) #Get apto's rooms
# Coleta numero de banheiros
full_bath=line.find(class_="property-card__detail-item property-card__detail-bathroom js-property-detail-bathroom").text.strip()
full_bath=full_bath.replace(' ','')
full_bath=full_bath.replace('\n','')
full_bath=full_bath.replace('Banheiros','')
full_bath=full_bath.replace('Banheiro','')
bath.append(full_bath) #Get apto's Bathrooms
# Coleta numero de vagas de garagem
full_park=line.find(class_="property-card__detail-item property-card__detail-garage js-property-detail-garages").text.strip()
full_park=full_park.replace(' ','')
full_park=full_park.replace('\n','')
full_park=full_park.replace('Vagas','')
full_park=full_park.replace('Vaga','')
park.append(full_park) #Get apto's parking lot
# Coleta preço
full_price=re.sub('[^0-9]','',line.find(class_="property-card__price js-property-card-prices js-property-card__price-small").text.strip())
price.append(full_price) #Get apto's parking lot
except:
continue
# fecha o chromedriver
driver.quit()
# cria um dataframe pandas e salva como um arquivo CSV
for i in range(0,len(neighbor)):
combinacao=[link_imovel[i],address[i],neighbor[i],anunciante[i],area[i],tipo[i],room[i],bath[i],park[i],price[i]]
df=pd.DataFrame(combinacao)
with open('VivaRealData.csv', 'a', encoding='utf-16', newline='') as f:
df.transpose().to_csv(f, encoding='iso-8859-1', header=False)
# Tempo de execução
toc = time.time()
get_time=round(toc-tic,3)
print('Finished in ' + str(get_time) + ' seconds')
print(str(len(price))+' results!')
it seems to me that the "for line in soup.findAll" never pops, I've tried everything but I always get data from the first page.

Indeed the URL does return the same results regardless of the page number requested. It also returns the same information if requests is used avoiding the huge overhead of using Selenium.
A better (and much faster) approach is to access all of the data directly from the site's JSON API.
The following shows you a possible starting point. All of the data is inside data, you just need to find the information you want inside it and access it. I suggest you print(data) and use a tool to format it better.
import requests, re, time, os, csv
# Ele irá solicitar quantas páginas você deseja coletar
#pages_number = int(input('How many pages? '))
pages_number = 5
# inicializa o tempo de execução
tic = time.time()
sess = requests.Session()
params = {
'addressCity' : 'Poços de Caldas',
'addressLocationId' : 'BR>Minas Gerais>NULL>Pocos de Caldas',
'addressNeighborhood' : '',
'addressState' : 'Minas Gerais',
'addressCountry' : 'Brasil',
'addressStreet' : '',
'addressZone' : '',
'addressPointLat' : '-21.7854',
'addressPointLon' : '-46.561934',
'business' : 'SALE',
'facets' : 'amenities',
'unitTypes' : 'HOME',
'unitSubTypes' : 'UnitSubType_NONE,SINGLE_STOREY_HOUSE,VILLAGE_HOUSE,KITNET',
'unitTypesV3' : 'HOME',
'usageTypes' : 'RESIDENTIAL',
'listingType' : 'USED',
'parentId' : 'null',
'categoryPage' : 'RESULT',
'includeFields' : 'search(result(listings(listing(displayAddressType,amenities,usableAreas,constructionStatus,listingType,description,title,unitTypes,nonActivationReason,propertyType,unitSubTypes,id,portal,parkingSpaces,address,suites,publicationType,externalId,bathrooms,usageTypes,totalAreas,advertiserId,bedrooms,pricingInfos,showPrice,status,advertiserContact,videoTourLink,whatsappNumber,stamps),account(id,name,logoUrl,licenseNumber,showAddress,legacyVivarealId,phones),medias,accountLink,link)),totalCount),page,seasonalCampaigns,fullUriFragments,nearby(search(result(listings(listing(displayAddressType,amenities,usableAreas,constructionStatus,listingType,description,title,unitTypes,nonActivationReason,propertyType,unitSubTypes,id,portal,parkingSpaces,address,suites,publicationType,externalId,bathrooms,usageTypes,totalAreas,advertiserId,bedrooms,pricingInfos,showPrice,status,advertiserContact,videoTourLink,whatsappNumber,stamps),account(id,name,logoUrl,licenseNumber,showAddress,legacyVivarealId,phones),medias,accountLink,link)),totalCount)),expansion(search(result(listings(listing(displayAddressType,amenities,usableAreas,constructionStatus,listingType,description,title,unitTypes,nonActivationReason,propertyType,unitSubTypes,id,portal,parkingSpaces,address,suites,publicationType,externalId,bathrooms,usageTypes,totalAreas,advertiserId,bedrooms,pricingInfos,showPrice,status,advertiserContact,videoTourLink,whatsappNumber,stamps),account(id,name,logoUrl,licenseNumber,showAddress,legacyVivarealId,phones),medias,accountLink,link)),totalCount)),account(id,name,logoUrl,licenseNumber,showAddress,legacyVivarealId,phones,phones),developments(search(result(listings(listing(displayAddressType,amenities,usableAreas,constructionStatus,listingType,description,title,unitTypes,nonActivationReason,propertyType,unitSubTypes,id,portal,parkingSpaces,address,suites,publicationType,externalId,bathrooms,usageTypes,totalAreas,advertiserId,bedrooms,pricingInfos,showPrice,status,advertiserContact,videoTourLink,whatsappNumber,stamps),account(id,name,logoUrl,licenseNumber,showAddress,legacyVivarealId,phones),medias,accountLink,link)),totalCount)),owners(search(result(listings(listing(displayAddressType,amenities,usableAreas,constructionStatus,listingType,description,title,unitTypes,nonActivationReason,propertyType,unitSubTypes,id,portal,parkingSpaces,address,suites,publicationType,externalId,bathrooms,usageTypes,totalAreas,advertiserId,bedrooms,pricingInfos,showPrice,status,advertiserContact,videoTourLink,whatsappNumber,stamps),account(id,name,logoUrl,licenseNumber,showAddress,legacyVivarealId,phones),medias,accountLink,link)),totalCount))',
'size' : '100',
'from' : '144',
'q' : '',
'developmentsSize' : '5',
'__vt' : '',
'levels' : 'CITY,UNIT_TYPE',
'ref' : '/venda/minas-gerais/pocos-de-caldas/casa_residencial/',
'pointRadius' : '',
'isPOIQuery' : '',
}
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36',
'x-domain': 'www.vivareal.com.br',
}
results = 0
with open('VivaRealData.csv', 'w', newline='', encoding='utf-16') as f_output:
csv_output = csv.writer(f_output)
# Criando o loop entre as paginas do site
for page in range(pages_number+1):
print(f"Page {page+1}")
link = 'https://glue-api.vivareal.com/v2/listings'
params['from'] = f"{page * 100}"
req = sess.get(link, headers=headers, params=params)
data = req.json()
for listing in data['search']['result']['listings']:
href = listing['link']['href']
street = listing['listing']['address'].get('street', '').strip()
bedrooms = listing['listing']['bedrooms'][0]
bathrooms = listing['listing']['bathrooms'][0]
price = listing['listing']['pricingInfos'][0]['price']
row = [href, street, bedrooms, bathrooms, price]
csv_output.writerow(row)
results += 1
# Tempo de execução
toc = time.time()
get_time=round(toc-tic,3)
print(f'Finished in {get_time} seconds')
print(f'{results} results!')
For this example, it is hard coded to 5 pages and returns 593 results in about 6 seconds.
Using Pandas might be a bit overkill here as the data can be written a row at a time directly to your output CSV file.
How was this solved?
Your best friend here is your browser's network dev tools. With this you can watch the requests made to obtain the information. The normal process flow is the initial HTML page is downloaded, this runs the javascript and requests more data to further fill the page.
The trick is to first locate where the data you want is (often returned as JSON), then determine what you need to recreate the parameters needed to make the request for it.
Approaches using Selenium allow the javascript to work, but most times this is not needed as it is just making requests and formatting the data for display.

How to edit Python code to loop the request to extract information from list

I'm only a couple of weeks into learning Python, and I'm trying to extract specific info from a list (events). I've been able to call the list and extract specific lines (info for single event), but the objective is running the program and extracting the information from the entirety of the called list (info from all of the events).
Among others, my best guesses so far have been along the lines of:
one_a_tag = soup.findAll('a')[22:85]
and
one_a_tag = soup.findAll('a')[22+1]
But I come up with these errors:
TypeError Traceback (most recent call last)
<ipython-input-15-ee19539fbb00> in <module>
11 soup.findAll('a')
12 one_a_tag = soup.findAll('a')[22:85]
---> 13 link = one_a_tag['href']
14 'https://arema.mx' + link
15 eventUrl = ('https://arema.mx' + link)
TypeError: list indices must be integers or slices, not str
And
TypeError Traceback (most recent call last)
<ipython-input-22-81d98bcf8fd8> in <module>
10 soup
11 soup.findAll('a')
---> 12 one_a_tag = soup.findAll('a')[22]+1
13 link = one_a_tag['href']
14 'https://arema.mx' + link
TypeError: unsupported operand type(s) for +: 'Tag' and 'int'
This is the entire code so far:
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
url = 'https://arema.mx/'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
soup
soup.findAll('a')
one_a_tag = soup.findAll('a')[22]
link = one_a_tag['href']
'https://arema.mx' + link
eventUrl = ('https://arema.mx' + link)
print(eventUrl)
def getAremaTitulo(eventUrl):
res = requests.get(eventUrl)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
elems = soup.select('body > div.body > div.ar.eventname')
return elems[0].text.strip()
def getAremaInfo(eventUrl):
res = requests.get(eventUrl)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
elems = soup.select('body > div.body > div.event-header')
return elems[0].text.strip()
titulo = getAremaTitulo(eventUrl)
print('Nombre de evento: ' + titulo)
info = getAremaInfo(eventUrl)
print('Info: ' + info)
time.sleep(1)
I'm sure there may be some redundancies in the code, but what I'm most keen on solving is creating a loop to extract the specific info I'm looking for from all of the events. What do I need to add to get there?
Thanks!

To get all information about events, you can use this script:
import requests
from bs4 import BeautifulSoup
url = 'https://arema.mx/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for event_link in soup.select('#events a.event'):
u = 'https://arema.mx' + event_link['href']
s = BeautifulSoup(requests.get(u).content, 'html.parser')
event_name = s.select_one('.eventname').get_text(strip=True)
event_info = s.select_one('.event-header').text.strip()
print(event_name)
print(event_info)
print('-' * 80)
Prints:
...
--------------------------------------------------------------------------------
NOCHE BOHEMIA <A PIANO Y GUITARRA>
"Freddy González y Víctor Freez dos amigos que al
paso del tiempo hermanaron sus talentos para crear un concepto musical cálido y
acústico entre cuerdas y teclas haciéndonos vibrar entre una línea de canciones
de ayer y hoy. Rescatando las bohemias que tantos recuerdos y encuentros nos han
generado a lo largo del tiempo.
 Precio: $69*ya incluye cargo de servicio.Fecha: Sábado 15 de agosto 20:00 hrsTransmisión en vivo por Arema LiveComo ingresar a ver la presentación.·         Dale clic en Comprar  y Selecciona tu acceso.·         Elije la forma de pago que más se te facilite y finaliza la compra.·         Te llegara un correo electrónico con la confirmación de compra y un liga exclusiva para ingresar a la transmisión el día seleccionado únicamente.La compra de tu boleto es un apoyo para el artista.Importante:  favor de revisar tu correo en bandeja de entrada, no deseados o spam ya que los correos en ocasiones son enviados a esas carpetas.
--------------------------------------------------------------------------------
...

Python BeautifulSoup extracting titles according to id

This is a subquestion of this one: Python associate urls's ids and url's titles in lists
I have this HTML script:
<a href="http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html"
class="ss-titre">Monte le son</a>
<div class="rs-cell-details">
<a href="http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html"
class="ss-titre">"Rubin_Steiner"</a>
<a href="http://pluzz.francetv.fr/videos/fare_maohi_,102103928.html"
class="ss-titre">Fare maohi</a>
How can I do to have this result with BeautifulSoup:
list_titre = [['Monte le son', 'Rubin_Steiner'], ['Fare maohi']] #one sublist by id
I tried this:
f = urllib.urlopen(url)
page = f.read()
f.close()
soup = BeautifulSoup(page)
show=[]
list_titre=[]
list_url=[]
for link in soup.findAll('a'):
lien = link.get('href')
if lien == None:
lien = ""
if "http://pluzz.francetv.fr/videos/" in lien:
titre = (link.text.strip())
if "Voir cette vidéo" in titre:
titre = ""
if "Lire la vidéo" in titre:
titre = ""
list_titre.append(titre)
list_url.append(lien)
My result is:
list_titre = ['Monte le son', 'Rubin_Steiner', 'Fare maohi']
list_url = [http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html, http://pluzz.francetv.fr/videos/fare_maohi_,102103928.html]
But "titre" is not sorted by id.

Search for your links with a CSS selector to limit hits to just qualifying URLs.
Collect the links in a dictionary by URL; that way you can then process the information by sorting the dictionary keys:
from bs4 import BeautifulSoup
links = {}
soup = BeautifulSoup(page)
for link in soup.select('a[href^=http://pluzz.francetv.fr/videos/]'):
title = link.get_text().strip()
if title and title not in (u'Voir cette vidéo', u'Lire la vidéo'):
url = link['href']
links.setdefault(url, []).append(title)
The dict.setdefault() call sets an empty list for urls not yet encountered; this produces a dictionary with the URLs as keys, and the titles as a list of values per URL.
Demo:
>>> page = '''\
... <a href="http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html"
... class="ss-titre">Monte le son</a>
... <div class="rs-cell-details">
... <a href="http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html"
... class="ss-titre">"Rubin_Steiner"</a>
... <a href="http://pluzz.francetv.fr/videos/fare_maohi_,102103928.html"
... class="ss-titre">Fare maohi</a>
... '''
>>> links = {}
>>> soup = BeautifulSoup(page)
>>> for link in soup.select('a[href^=http://pluzz.francetv.fr/videos/]'):
... title = link.get_text().strip()
... if title and title not in (u'Voir cette vidéo', u'Lire la vidéo'):
... url = link['href']
... links.setdefault(url, []).append(title)
...
>>> from pprint import pprint
>>> pprint(links)
{'http://pluzz.francetv.fr/videos/ce_soir_ou_jamais_,101506826.html': [u'Ce soir (ou jamais !)',
u'"Qui est propri\xe9taire de quoi ? La propri\xe9t\xe9 mise \xe0 mal dans tous les domaines"'],
'http://pluzz.francetv.fr/videos/clip_locaux_,102890631.html': [u'Clips'],
'http://pluzz.francetv.fr/videos/fare_maohi_,102152859.html': [u'Fare maohi'],
'http://pluzz.francetv.fr/videos/fare_maohi_,102292937.html': [u'Fare maohi'],
'http://pluzz.francetv.fr/videos/fare_maohi_,102365651.html': [u'Fare maohi'],
'http://pluzz.francetv.fr/videos/inspecteur_barnaby_,101972045.html': [u'Inspecteur Barnaby',
u'"La musique en h\xe9ritage"'],
'http://pluzz.francetv.fr/videos/le_lab_o_saison3_,101215383.html': [u'Le Lab.\xd4',
u'"Episode 22"',
u'Saison 3'],
'http://pluzz.francetv.fr/videos/monsieur_madame_saison1_,101970319.html': [u'Les Monsieur Madame',
u'"Musique"',
u'Saison 1'],
'http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html': [u'Monte le son !',
u'"Rubin Steiner"'],
'http://pluzz.francetv.fr/videos/music_explorer_saison1_,101215382.html': [u'Music Explorer : les chasseurs de sons',
u'"Episode 3/6"',
u'Saison 1'],
'http://pluzz.francetv.fr/videos/retour_a_goree_,101641108.html': [u'Retour \xe0 Gor\xe9e'],
'http://pluzz.francetv.fr/videos/singe_mi_singe_moi_,101507102.html': [u'Singe mi singe moi',
u'"Le chat"'],
'http://pluzz.francetv.fr/videos/singe_mi_singe_moi_,101777072.html': [u'Singe mi singe moi',
u'"L\'autruche"'],
'http://pluzz.francetv.fr/videos/toute_nouvelle_tendance_,102472310.html': [u'T.N.T'],
'http://pluzz.francetv.fr/videos/toute_nouvelle_tendance_,102472336.html': [u'T.N.T'],
'http://pluzz.francetv.fr/videos/toute_nouvelle_tendance_,102721018.html': [u'T.N.T'],
'http://pluzz.francetv.fr/videos/toute_nouvelle_tendance_,103216774.html': [u'T.N.T.'],
'http://pluzz.francetv.fr/videos/toute_nouvelle_tendance_,103216788.html': [u'T.N.T'],
'http://pluzz.francetv.fr/videos/via_cultura_,101959892.html': [u'Via cultura',
u'"L\'Ochju, le Mauvais oeil"']}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Looking for child content with Beautifulsoup - python

Related

How could I extract the content of an article from a html webpage using python Beautiful Soup?

how to extract with beatifulsoup, problems with div

WebScraping in Python returning repeated data

How to edit Python code to loop the request to extract information from list

Python BeautifulSoup extracting titles according to id

Categories

Resources