Python BeautifulSoup extracting titles according to id

Python BeautifulSoup extracting titles according to id - python

This is a subquestion of this one: Python associate urls's ids and url's titles in lists
I have this HTML script:
<a href="http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html"
class="ss-titre">Monte le son</a>
<div class="rs-cell-details">
<a href="http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html"
class="ss-titre">"Rubin_Steiner"</a>
<a href="http://pluzz.francetv.fr/videos/fare_maohi_,102103928.html"
class="ss-titre">Fare maohi</a>
How can I do to have this result with BeautifulSoup:
list_titre = [['Monte le son', 'Rubin_Steiner'], ['Fare maohi']] #one sublist by id
I tried this:
f = urllib.urlopen(url)
page = f.read()
f.close()
soup = BeautifulSoup(page)
show=[]
list_titre=[]
list_url=[]
for link in soup.findAll('a'):
lien = link.get('href')
if lien == None:
lien = ""
if "http://pluzz.francetv.fr/videos/" in lien:
titre = (link.text.strip())
if "Voir cette vidéo" in titre:
titre = ""
if "Lire la vidéo" in titre:
titre = ""
list_titre.append(titre)
list_url.append(lien)
My result is:
list_titre = ['Monte le son', 'Rubin_Steiner', 'Fare maohi']
list_url = [http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html, http://pluzz.francetv.fr/videos/fare_maohi_,102103928.html]
But "titre" is not sorted by id.

Search for your links with a CSS selector to limit hits to just qualifying URLs.
Collect the links in a dictionary by URL; that way you can then process the information by sorting the dictionary keys:
from bs4 import BeautifulSoup
links = {}
soup = BeautifulSoup(page)
for link in soup.select('a[href^=http://pluzz.francetv.fr/videos/]'):
title = link.get_text().strip()
if title and title not in (u'Voir cette vidéo', u'Lire la vidéo'):
url = link['href']
links.setdefault(url, []).append(title)
The dict.setdefault() call sets an empty list for urls not yet encountered; this produces a dictionary with the URLs as keys, and the titles as a list of values per URL.
Demo:
>>> page = '''\
... <a href="http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html"
... class="ss-titre">Monte le son</a>
... <div class="rs-cell-details">
... <a href="http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html"
... class="ss-titre">"Rubin_Steiner"</a>
... <a href="http://pluzz.francetv.fr/videos/fare_maohi_,102103928.html"
... class="ss-titre">Fare maohi</a>
... '''
>>> links = {}
>>> soup = BeautifulSoup(page)
>>> for link in soup.select('a[href^=http://pluzz.francetv.fr/videos/]'):
... title = link.get_text().strip()
... if title and title not in (u'Voir cette vidéo', u'Lire la vidéo'):
... url = link['href']
... links.setdefault(url, []).append(title)
...
>>> from pprint import pprint
>>> pprint(links)
{'http://pluzz.francetv.fr/videos/ce_soir_ou_jamais_,101506826.html': [u'Ce soir (ou jamais !)',
u'"Qui est propri\xe9taire de quoi ? La propri\xe9t\xe9 mise \xe0 mal dans tous les domaines"'],
'http://pluzz.francetv.fr/videos/clip_locaux_,102890631.html': [u'Clips'],
'http://pluzz.francetv.fr/videos/fare_maohi_,102152859.html': [u'Fare maohi'],
'http://pluzz.francetv.fr/videos/fare_maohi_,102292937.html': [u'Fare maohi'],
'http://pluzz.francetv.fr/videos/fare_maohi_,102365651.html': [u'Fare maohi'],
'http://pluzz.francetv.fr/videos/inspecteur_barnaby_,101972045.html': [u'Inspecteur Barnaby',
u'"La musique en h\xe9ritage"'],
'http://pluzz.francetv.fr/videos/le_lab_o_saison3_,101215383.html': [u'Le Lab.\xd4',
u'"Episode 22"',
u'Saison 3'],
'http://pluzz.francetv.fr/videos/monsieur_madame_saison1_,101970319.html': [u'Les Monsieur Madame',
u'"Musique"',
u'Saison 1'],
'http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html': [u'Monte le son !',
u'"Rubin Steiner"'],
'http://pluzz.francetv.fr/videos/music_explorer_saison1_,101215382.html': [u'Music Explorer : les chasseurs de sons',
u'"Episode 3/6"',
u'Saison 1'],
'http://pluzz.francetv.fr/videos/retour_a_goree_,101641108.html': [u'Retour \xe0 Gor\xe9e'],
'http://pluzz.francetv.fr/videos/singe_mi_singe_moi_,101507102.html': [u'Singe mi singe moi',
u'"Le chat"'],
'http://pluzz.francetv.fr/videos/singe_mi_singe_moi_,101777072.html': [u'Singe mi singe moi',
u'"L\'autruche"'],
'http://pluzz.francetv.fr/videos/toute_nouvelle_tendance_,102472310.html': [u'T.N.T'],
'http://pluzz.francetv.fr/videos/toute_nouvelle_tendance_,102472336.html': [u'T.N.T'],
'http://pluzz.francetv.fr/videos/toute_nouvelle_tendance_,102721018.html': [u'T.N.T'],
'http://pluzz.francetv.fr/videos/toute_nouvelle_tendance_,103216774.html': [u'T.N.T.'],
'http://pluzz.francetv.fr/videos/toute_nouvelle_tendance_,103216788.html': [u'T.N.T'],
'http://pluzz.francetv.fr/videos/via_cultura_,101959892.html': [u'Via cultura',
u'"L\'Ochju, le Mauvais oeil"']}

Related

how to extract with beatifulsoup, problems with div

I need to extract from the url all the Nacimientos, years of yesterday today and tomorrow, I try to extract all the <li>, but when a <div> appears it only extracts up to the <div>, try next_sibling and it didn't work either.
# Página objetivo
url = "https://es.m.wikipedia.org/wiki/9_de_julio"
##Cantidad de articulos en español actuales. ##
# Obtener un requests de la URL objetivo
wikipedia2 = requests.get(url)
# Si el Status Code es OK!
if wikipedia2.status_code == 200:
nacimientos2 = soup(wikipedia2.text, "lxml")
else:
print("La página respondió con error", wikipedia.status_code)
filtro= nacimientos2.find("section", id="mf-section-2")
anios= filtro.find('ul').find_all('li')
lista2 = []
for data in anios:
lista2.append(data.text[:4])
lista2

Main issue is that the focus of your selection is the first <ul> and all its <li>, you can simply adjust the selection while skipping the <ul>, cause your working on a specific <section>.
As one line with list comprehension and css selectors`:
yearList = [e.text[:4] for e in soup.select('section#mf-section-2 li')]
or based on your code -> anios= filtro.find_all('li'):
filtro= nacimientos2.find("section", id="mf-section-2")
anios= filtro.find_all('li')
lista2 = []
for data in anios:
lista2.append(data.text[:4])
lista2

Accessing multiple tags inside one tag

I´ve the following HTML code to webscrape:
<ul class="item-features">
<li>
<strong>Graphic Type:</strong> Dedicated Card
</li>
<li>
<strong>Resolution:</strong> 3840 x 2160
</li>
<li>
<strong>Weight:</strong> 4.40 lbs.
</li>
<li>
<strong>Color:</strong> Black
</li>
</ul>
I would like to print in a .csv file all single tags inside the : Graphic Type, Resolution, Weight, etc. in different columns in a .csv file.
I´ve tried the following in Python:
import bs4
from urllib.request import urlopen as req
from bs4 import BeautifulSoup as soup
url ='https://www.newegg.com/Laptops-Notebooks/SubCategory/ID-32?Tid=6740'
Client = req(url)
pagina = Client.read()
Client.close()
pagina_soup=soup(pagina,"html.parser")
productes = pagina_soup.findAll("div",{"class":"item-container})
producte = productes [0]
features = producte.findAll("ul",{"class":"item-features"})
features[0].text
And it displays all the features but just in one single column of the .csv.
'\nGraphic Type: Dedicated CardResolution: 3840 x 2160Weight: 4.40 lbs.Color: Black\nModel #: AERO 15 OLED SA-7US5020SH\nItem #: N82E16834233268\nReturn Policy: Standard Return Policy\n'
I don´t now how to export them one by one. Please, see my whole pyhton code:
import bs4
from urllib.request import urlopen as req
from bs4 import BeautifulSoup as soup
#Link de la pàgina on farem webscraping
url = 'https://www.newegg.com/Laptops-Notebooks/SubCategory/ID-32?Tid=6740'
#Obrim una connexió amb la pàgina web
Client = req(url)
#Offloads the content of the page into a variable
pagina = Client.read()
#Closes the client
Client.close()
#html parser
pagina_soup=soup(pagina,"html.parser")
#grabs each product
productes = pagina_soup.findAll("div",{"class":"item-container"})
#Obrim un axiu .csv
filename = "ordinadors.csv"
f=open(filename,"w")
#Capçaleres del meu arxiu .csv
headers = "Marca; Producte; PreuActual; PreuAnterior; Rebaixa; CostEnvio
\n"
#Escrivim la capçalera
f.write(headers)
#Fem un loop sobre tots els productes
for producte in productes:
#Agafem la marca del producte
marca_productes = producte.findAll("div",{"class":"item-info"})
marca = marca_productes[0].div.a.img["title"]
#Agafem el nom del producte
name = producte.a.img["title"]
#Preu Actual
actual_productes = producte.findAll("li",{"class":"price-current"})
preuActual = actual_productes[0].strong.text
#Preu anterior
try:
preuAbans = producte.find("li", class_="price-
was").next_element.strip()
except:
print("Not found")
#Agafem els costes de envio
costos_productes = producte.findAll("li",{"class":"price-ship"})
#Com que es tracta d'un vector, agafem el primer element i el netegem.
costos = costos_productes[0].text.strip()
#Writing the file
f.write(marca + ";" + name.replace(","," ") + ";" + preuActual + ";"
+ preuAbans + ";" + costos + "\n")
f.close()

keys = [x.find().text for x in pagina_soup.find_all('li')]
values = [x.find('strong').next_sibling.strip() for x in pagina_soup.find_all('li')]
print(keys)
print(values)
out:
Out[6]: ['Graphic Type:', 'Resolution:', 'Weight:', 'Color:']
Out[7]: ['Dedicated Card', '3840 x 2160', '4.40 lbs.', 'Black']

Looking for child content with Beautifulsoup

I am trying to scrape a phrase/author from the body of a URL. I can scrape the phrases but I don't know how to find the author and print it together with the phrase. Can you help me?
import urllib.request
from bs4 import BeautifulSoup
page_url = "https://www.pensador.com/frases/"
page = urllib.request.urlopen(page_url)
soup = BeautifulSoup(page, "html.parser")
for frase in soup.find_all("p", attrs={'class': 'frase fr'}):
print(frase.text + '\n')
# author = soup.find_all("span", attrs={'class': 'autor'})
# print(author.text)
# this is the author that I need, for each phrase the right author

You can get to the parent of the p.frase.fr tag, which is a div, and get the author by selecting span.autor descending the div:
In [1268]: for phrase in soup.select('p.frase.fr'):
...: author = phrase.parent.select_one('span.autor')
...: print(author.text.strip(), ': ', phrase.text.strip())
...:
Roberto Shinyashiki : Tudo o que um sonho precisa para ser realizado é alguém que acredite que ele possa ser realizado.
Paulo Coelho : Imagine uma nova história para sua vida e acredite nela.
Carlos Drummond de Andrade : Ser feliz sem motivo é a mais autêntica forma de felicidade.
...
...
Here, I'm using the CSS selector by phrase.parent.select_one('span.autor'), you can obviously use find here:
phrase.parent.find('span', attrs={'class': 'autor'})

unable to fetch full data inside<div>

HTML:
<div>
Está en: <b>
Inicio /
Valle Del Cauca /
Cali /
Zona Sur /
Zona Sur /
<a>Los Naranjos Conjunto Campestre</a></b>
</div>
Unable to fetch all <a> tags inside <div> tag
My code:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.fincaraiz.com.co/oceana-52/barranquilla/proyecto-nuevo-det-1041165.aspx')
soup = BeautifulSoup(page.content, 'html.parser')
first = soup.find('div' , 'breadcrumb left')
link = first.find('div')
a_link = link.findAll('a')
print (a_link)
The above coding only printing the first <a> tag
[Inicio]
Following are the output required from the above HTML
Valle Del Cauca
Cali
Zona Sur
Zona Sur
I'm not sure why it was not printing after '/' inside <b> tag

You can use lxml parser, html.parser normalizes/prettify the actual source before BS4 parse it.
soup = BeautifulSoup(page.content, 'lxml')

Search inside results object - Python, BeatifulSoup

I'm trying to get some informations in a site, put it in a list and exporting this list to csv.
This is an part of the site, it repeats several times.
<img src="image.jpg" alt="Aclimação">
</a>
</div>
Clique na imagem para ampliar
</div>
<div class="colInfos">
<h4>Aclimação</h4>
<div class="addressInfo">
Rua Muniz de Souza, 1110<br>
Aclimação - São Paulo - SP<br>
01534-001<br>
<br>
(11) 3208-3418 / 2639-0173<br>
aclimacao.sp#escolas.com.br<br>
I want to get the image link, name (h4), address(inside addressInfo, each br should be an separated item in a list) and email of each school (a href mailto:) in this site and export to s csv file. This is how I'm trying. But there is a problem, because I don't know how to search inside the results object 'endereco' How can I do this?
This is my code:
import urllib2
from BeautifulSoup import BeautifulSoup
url = urllib2.urlopen("http://www.fisk.com.br/unidades?pais=1&uf=&rg=&cid=&ba=&un=")
soup = BeautifulSoup(url)
#nomes = soup.findAll('h4')
dados = []
i = 1
for endereco in enderecos:
text = ''.join(endereco.findAll(???)) **<- how an I search the br's inside this?**
dados[i] = text.encode('utf-8').strip()
i = i +
enderecos = soup.findAll('div', attrs={'class': 'colInfos'})

It really works fine. All you have to do is replace
dados = []
i = 1
for endereco in enderecos:
text = ''.join(endereco.findAll(text=True))
dados[i] = text.encode('utf-8').strip()
i = i +
enderecos = soup.findAll('div', attrs={'class': 'colInfos'})
with
dados = []
enderecos = soup.findAll('div', attrs={'class': 'colInfos'})
for endereco in enderecos:
text = ''.join(endereco.findAll(text=True))
dados.append(text.encode('utf-8').strip())
print dados

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python BeautifulSoup extracting titles according to id - python

Related

how to extract with beatifulsoup, problems with div

Accessing multiple tags inside one tag

Looking for child content with Beautifulsoup

unable to fetch full data inside<div>

Search inside results object - Python, BeatifulSoup

Categories

Resources