Scrape information within <a href and <span - python

I would need to scrape journalists' names and journals from this website:
https://www.politicasufacebook.it/giornalisti/
What I am looking for is to get specifically <a href information (journalist's name) and < span (newspaper's name).
For example, Andrea Scanzi:
Andrea Scanzi
and Il Fatto Quotidiano
<span style="font-size:13px;line-height:25px"> Il Fatto Quotidiano</span>
I have wrote the following
with requests.Session() as s: # use session object for efficiency of tcp re-use
s.headers = {'User-Agent': 'Mozilla/5.0'}
r = s.get('https://www.politicasufacebook.it/giornalisti/')
soup = bs(r.content, 'lxml')
but I do not know how to continue in order to extract such information.

You can use soup.find_all with the desired tag and attributes.
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.politicasufacebook.it/giornalisti/')
soup = BeautifulSoup(r.content, 'lxml')
journalists = soup.find_all('a', {'style': 'color:#003060', 'target': '_blank'})
newspapers = soup.find_all('span', {'style': 'font-size:13px;line-height:25px'})
for i, v in enumerate(journalists):
print(v.text.strip() + ' - ' + newspapers[i].text.strip())
Output:
Roberto Saviano - La Repubblica
Marco Travaglio - Il Fatto Quotidiano
Enrico Mentana - La7
Andrea Scanzi - Il Fatto Quotidiano
Massimo Gramellini - Corriere Della Sera
Nicola Porro - Rete 4
Salvo Sottile - Rai1
Carmelo Abbate - Storie Nere
Gad Lerner - autonomo
Michele Serra - La Repubblica
...

Related

Beautiful Soup Find href text based on partial attribute value

I am trying to identify tags in an html document based on part of the attribute value.
I'm interested in any 'a' under 'tr' tag as long as it starts or has :
"AnnoncesImmobilier.asp?rech_cod_pay=" in the href attribute.
HTML source :
<tr bgcolor="#f1efe2" class="Tableau1" valign="middle">
<td bgcolor="#294a73" height="20"><img alt="" height="1" src="/images/space.gif" width="1"/></td>
<td> Berge Du Lac </td>
<td bgcolor="#294a73"><img alt="" height="1" src="/images/space.gif" width="1"/></td>
<td onmouseover="return escape('<b>Rubrique</b> : Offres<br/><b>Nature</b> : Terrain<br/><b>Type</b> : Terrain nu');" style="CURSOR:pointer;"> Terrain</td>
will give ad_title = "Berge Du Lac"
In the source HTML , each "tr" tag with class "Tableau1" contains an ad with different tr , a , tags for title, price, description etc...
Below is my code :
import re
from bs4 import BeautifulSoup
# The URL to get data from
URL = 'http://www.tunisie-annonce.com/AnnoncesImmobilier.asp'
data = requests.get(URL)
soup = BeautifulSoup(data.content, "html.parser")
# Variable to extract the ads
ads = soup.find_all("tr", {"class":"Tableau1"})
for ad in ads:
ad_title = ads.find(text=re.compile('AnnoncesImmobilier.asp?rech_cod_pay=')).parent.get_text())
print(title)
ad_title = ads.find(text=re.compile('AnnoncesImmobilier.asp?rech_cod_pay=')).parent.get_text()) is the last snippet that I tried to retrieve the text, but neither this or previous code worked for me.
How can i proceed ?
I'm interested in any 'a' under 'tr' tag as long as it starts or has :
"AnnoncesImmobilier.asp?rech_cod_pay=" in the href attribute.
You can make your selection more specific with css selectors:
soup.select('tr.Tableau1:has(a[href*="AnnoncesImmobilier.asp?rech_cod_pay="])')
To get a list of all the href texts just iterat the result set:
[row.a.text for row in soup.select('tr.Tableau1:has(a[href*="AnnoncesImmobilier.asp?rech_cod_pay="])')]
Using set() you can filter the list to unique values:
set([row.a.text for row in soup.select('tr.Tableau1:has(a[href*="AnnoncesImmobilier.asp?rech_cod_pay="])')])
Output
{'Hammam Lif', 'El Manar 2', 'El Menzah 8', 'Chotrana 1', 'Rades', 'Sousse Corniche', 'Cite De La Sant', 'Sousse', 'Bizerte', 'Ain Zaghouan', 'Hammamet', 'La Soukra', 'Riadh Landlous', 'El Menzah 5', 'Khezama Ouest', 'Montplaisir', 'Sousse Khezama', 'Hergla', 'El Ouerdia', 'Hammam Sousse', 'El Menzah 1', 'Cite Ennasr 2', 'Bab El Khadra'}
To extract more than just the href text you can do the following:
data = []
for row in soup.select('tr.Tableau1:has(a[href*="AnnoncesImmobilier.asp?rech_cod_pay="])'):
d = list(row.stripped_strings)
d.append(row.a['href'])
data.append(d)
pd.DataFrame(data)
Output
Région
Nature
Type
Texte annonce
Prix
Modifiée
Link
Sousse Corniche
Location
App. 3 pièc
Magnifique appartement s2 fac
1 000
08/02/2022
AnnoncesImmobilier.asp?rech_cod_pay=TN&rech_cod_vil=12114&rech_cod_loc=1211413
Riadh Landlous
Location
App. 4 pièc
S3 situé au 1ér étage à riadh
850
08/02/2022
AnnoncesImmobilier.asp?rech_cod_pay=TN&rech_cod_vil=10201&rech_cod_loc=1020135
Khezama Ouest
Vente
App. 4 pièc
Magnifique s3 khzema pré
250 000
08/02/2022
AnnoncesImmobilier.asp?rech_cod_pay=TN&rech_cod_vil=12112&rech_cod_loc=1211209
El Menzah 8
Location
App. 1 pièc
Studio meublé manzah 8 vv
600
08/02/2022
AnnoncesImmobilier.asp?rech_cod_pay=TN&rech_cod_vil=10201&rech_cod_loc=1020126
Hergla
Vente
App. 3 pièc
Appartement s 2 vue mer
300 000
08/02/2022
AnnoncesImmobilier.asp?rech_cod_pay=TN&rech_cod_vil=12105&rech_cod_loc=1210502
...
You do not need a regex here, you can use
titles = []
for ad in ads:
links = ad.find_all('a', href=lambda h: h and "AnnoncesImmobilier.asp?rech_cod_pay=" in h)
for link in links:
titles.append(link.get_text())
print(titles)
If you want to get a unique list of titles, use a set:
titles = set()
for ad in ads:
links = ad.find_all('a', href=lambda h: h and "AnnoncesImmobilier.asp?rech_cod_pay=" in h)
for link in links:
titles.add(link.get_text())
In both cases, href=lambda h: h and "AnnoncesImmobilier.asp?rech_cod_pay=" in h makes sure there is a href attribute and it contains a AnnoncesImmobilier.asp?rech_cod_pay= string.

Looking for child content with Beautifulsoup

I am trying to scrape a phrase/author from the body of a URL. I can scrape the phrases but I don't know how to find the author and print it together with the phrase. Can you help me?
import urllib.request
from bs4 import BeautifulSoup
page_url = "https://www.pensador.com/frases/"
page = urllib.request.urlopen(page_url)
soup = BeautifulSoup(page, "html.parser")
for frase in soup.find_all("p", attrs={'class': 'frase fr'}):
print(frase.text + '\n')
# author = soup.find_all("span", attrs={'class': 'autor'})
# print(author.text)
# this is the author that I need, for each phrase the right author
You can get to the parent of the p.frase.fr tag, which is a div, and get the author by selecting span.autor descending the div:
In [1268]: for phrase in soup.select('p.frase.fr'):
...: author = phrase.parent.select_one('span.autor')
...: print(author.text.strip(), ': ', phrase.text.strip())
...:
Roberto Shinyashiki : Tudo o que um sonho precisa para ser realizado é alguém que acredite que ele possa ser realizado.
Paulo Coelho : Imagine uma nova história para sua vida e acredite nela.
Carlos Drummond de Andrade : Ser feliz sem motivo é a mais autêntica forma de felicidade.
...
...
Here, I'm using the CSS selector by phrase.parent.select_one('span.autor'), you can obviously use find here:
phrase.parent.find('span', attrs={'class': 'autor'})

unable to fetch full data inside<div>

HTML:
<div>
Está en: <b>
Inicio /
Valle Del Cauca /
Cali /
Zona Sur /
Zona Sur /
<a>Los Naranjos Conjunto Campestre</a></b>
</div>
Unable to fetch all <a> tags inside <div> tag
My code:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.fincaraiz.com.co/oceana-52/barranquilla/proyecto-nuevo-det-1041165.aspx')
soup = BeautifulSoup(page.content, 'html.parser')
first = soup.find('div' , 'breadcrumb left')
link = first.find('div')
a_link = link.findAll('a')
print (a_link)
The above coding only printing the first <a> tag
[Inicio]
Following are the output required from the above HTML
Valle Del Cauca
Cali
Zona Sur
Zona Sur
I'm not sure why it was not printing after '/' inside <b> tag
You can use lxml parser, html.parser normalizes/prettify the actual source before BS4 parse it.
soup = BeautifulSoup(page.content, 'lxml')

Python BeautifulSoup extracting titles according to id

This is a subquestion of this one: Python associate urls's ids and url's titles in lists
I have this HTML script:
<a href="http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html"
class="ss-titre">Monte le son</a>
<div class="rs-cell-details">
<a href="http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html"
class="ss-titre">"Rubin_Steiner"</a>
<a href="http://pluzz.francetv.fr/videos/fare_maohi_,102103928.html"
class="ss-titre">Fare maohi</a>
How can I do to have this result with BeautifulSoup:
list_titre = [['Monte le son', 'Rubin_Steiner'], ['Fare maohi']] #one sublist by id
I tried this:
f = urllib.urlopen(url)
page = f.read()
f.close()
soup = BeautifulSoup(page)
show=[]
list_titre=[]
list_url=[]
for link in soup.findAll('a'):
lien = link.get('href')
if lien == None:
lien = ""
if "http://pluzz.francetv.fr/videos/" in lien:
titre = (link.text.strip())
if "Voir cette vidéo" in titre:
titre = ""
if "Lire la vidéo" in titre:
titre = ""
list_titre.append(titre)
list_url.append(lien)
My result is:
list_titre = ['Monte le son', 'Rubin_Steiner', 'Fare maohi']
list_url = [http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html, http://pluzz.francetv.fr/videos/fare_maohi_,102103928.html]
But "titre" is not sorted by id.
Search for your links with a CSS selector to limit hits to just qualifying URLs.
Collect the links in a dictionary by URL; that way you can then process the information by sorting the dictionary keys:
from bs4 import BeautifulSoup
links = {}
soup = BeautifulSoup(page)
for link in soup.select('a[href^=http://pluzz.francetv.fr/videos/]'):
title = link.get_text().strip()
if title and title not in (u'Voir cette vidéo', u'Lire la vidéo'):
url = link['href']
links.setdefault(url, []).append(title)
The dict.setdefault() call sets an empty list for urls not yet encountered; this produces a dictionary with the URLs as keys, and the titles as a list of values per URL.
Demo:
>>> page = '''\
... <a href="http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html"
... class="ss-titre">Monte le son</a>
... <div class="rs-cell-details">
... <a href="http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html"
... class="ss-titre">"Rubin_Steiner"</a>
... <a href="http://pluzz.francetv.fr/videos/fare_maohi_,102103928.html"
... class="ss-titre">Fare maohi</a>
... '''
>>> links = {}
>>> soup = BeautifulSoup(page)
>>> for link in soup.select('a[href^=http://pluzz.francetv.fr/videos/]'):
... title = link.get_text().strip()
... if title and title not in (u'Voir cette vidéo', u'Lire la vidéo'):
... url = link['href']
... links.setdefault(url, []).append(title)
...
>>> from pprint import pprint
>>> pprint(links)
{'http://pluzz.francetv.fr/videos/ce_soir_ou_jamais_,101506826.html': [u'Ce soir (ou jamais !)',
u'"Qui est propri\xe9taire de quoi ? La propri\xe9t\xe9 mise \xe0 mal dans tous les domaines"'],
'http://pluzz.francetv.fr/videos/clip_locaux_,102890631.html': [u'Clips'],
'http://pluzz.francetv.fr/videos/fare_maohi_,102152859.html': [u'Fare maohi'],
'http://pluzz.francetv.fr/videos/fare_maohi_,102292937.html': [u'Fare maohi'],
'http://pluzz.francetv.fr/videos/fare_maohi_,102365651.html': [u'Fare maohi'],
'http://pluzz.francetv.fr/videos/inspecteur_barnaby_,101972045.html': [u'Inspecteur Barnaby',
u'"La musique en h\xe9ritage"'],
'http://pluzz.francetv.fr/videos/le_lab_o_saison3_,101215383.html': [u'Le Lab.\xd4',
u'"Episode 22"',
u'Saison 3'],
'http://pluzz.francetv.fr/videos/monsieur_madame_saison1_,101970319.html': [u'Les Monsieur Madame',
u'"Musique"',
u'Saison 1'],
'http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html': [u'Monte le son !',
u'"Rubin Steiner"'],
'http://pluzz.francetv.fr/videos/music_explorer_saison1_,101215382.html': [u'Music Explorer : les chasseurs de sons',
u'"Episode 3/6"',
u'Saison 1'],
'http://pluzz.francetv.fr/videos/retour_a_goree_,101641108.html': [u'Retour \xe0 Gor\xe9e'],
'http://pluzz.francetv.fr/videos/singe_mi_singe_moi_,101507102.html': [u'Singe mi singe moi',
u'"Le chat"'],
'http://pluzz.francetv.fr/videos/singe_mi_singe_moi_,101777072.html': [u'Singe mi singe moi',
u'"L\'autruche"'],
'http://pluzz.francetv.fr/videos/toute_nouvelle_tendance_,102472310.html': [u'T.N.T'],
'http://pluzz.francetv.fr/videos/toute_nouvelle_tendance_,102472336.html': [u'T.N.T'],
'http://pluzz.francetv.fr/videos/toute_nouvelle_tendance_,102721018.html': [u'T.N.T'],
'http://pluzz.francetv.fr/videos/toute_nouvelle_tendance_,103216774.html': [u'T.N.T.'],
'http://pluzz.francetv.fr/videos/toute_nouvelle_tendance_,103216788.html': [u'T.N.T'],
'http://pluzz.francetv.fr/videos/via_cultura_,101959892.html': [u'Via cultura',
u'"L\'Ochju, le Mauvais oeil"']}

Search inside results object - Python, BeatifulSoup

I'm trying to get some informations in a site, put it in a list and exporting this list to csv.
This is an part of the site, it repeats several times.
<img src="image.jpg" alt="Aclimação">
</a>
</div>
Clique na imagem para ampliar
</div>
<div class="colInfos">
<h4>Aclimação</h4>
<div class="addressInfo">
Rua Muniz de Souza, 1110<br>
Aclimação - São Paulo - SP<br>
01534-001<br>
<br>
(11) 3208-3418 / 2639-0173<br>
aclimacao.sp#escolas.com.br<br>
I want to get the image link, name (h4), address(inside addressInfo, each br should be an separated item in a list) and email of each school (a href mailto:) in this site and export to s csv file. This is how I'm trying. But there is a problem, because I don't know how to search inside the results object 'endereco' How can I do this?
This is my code:
import urllib2
from BeautifulSoup import BeautifulSoup
url = urllib2.urlopen("http://www.fisk.com.br/unidades?pais=1&uf=&rg=&cid=&ba=&un=")
soup = BeautifulSoup(url)
#nomes = soup.findAll('h4')
dados = []
i = 1
for endereco in enderecos:
text = ''.join(endereco.findAll(???)) **<- how an I search the br's inside this?**
dados[i] = text.encode('utf-8').strip()
i = i +
enderecos = soup.findAll('div', attrs={'class': 'colInfos'})
It really works fine. All you have to do is replace
dados = []
i = 1
for endereco in enderecos:
text = ''.join(endereco.findAll(text=True))
dados[i] = text.encode('utf-8').strip()
i = i +
enderecos = soup.findAll('div', attrs={'class': 'colInfos'})
with
dados = []
enderecos = soup.findAll('div', attrs={'class': 'colInfos'})
for endereco in enderecos:
text = ''.join(endereco.findAll(text=True))
dados.append(text.encode('utf-8').strip())
print dados

Categories