Beautiful Soup Find href text based on partial attribute value - python

I am trying to identify tags in an html document based on part of the attribute value.
I'm interested in any 'a' under 'tr' tag as long as it starts or has :
"AnnoncesImmobilier.asp?rech_cod_pay=" in the href attribute.
HTML source :
<tr bgcolor="#f1efe2" class="Tableau1" valign="middle">
<td bgcolor="#294a73" height="20"><img alt="" height="1" src="/images/space.gif" width="1"/></td>
<td> Berge Du Lac </td>
<td bgcolor="#294a73"><img alt="" height="1" src="/images/space.gif" width="1"/></td>
<td onmouseover="return escape('<b>Rubrique</b> : Offres<br/><b>Nature</b> : Terrain<br/><b>Type</b> : Terrain nu');" style="CURSOR:pointer;"> Terrain</td>
will give ad_title = "Berge Du Lac"
In the source HTML , each "tr" tag with class "Tableau1" contains an ad with different tr , a , tags for title, price, description etc...
Below is my code :
import re
from bs4 import BeautifulSoup
# The URL to get data from
URL = 'http://www.tunisie-annonce.com/AnnoncesImmobilier.asp'
data = requests.get(URL)
soup = BeautifulSoup(data.content, "html.parser")
# Variable to extract the ads
ads = soup.find_all("tr", {"class":"Tableau1"})
for ad in ads:
ad_title = ads.find(text=re.compile('AnnoncesImmobilier.asp?rech_cod_pay=')).parent.get_text())
print(title)
ad_title = ads.find(text=re.compile('AnnoncesImmobilier.asp?rech_cod_pay=')).parent.get_text()) is the last snippet that I tried to retrieve the text, but neither this or previous code worked for me.
How can i proceed ?

I'm interested in any 'a' under 'tr' tag as long as it starts or has :
"AnnoncesImmobilier.asp?rech_cod_pay=" in the href attribute.
You can make your selection more specific with css selectors:
soup.select('tr.Tableau1:has(a[href*="AnnoncesImmobilier.asp?rech_cod_pay="])')
To get a list of all the href texts just iterat the result set:
[row.a.text for row in soup.select('tr.Tableau1:has(a[href*="AnnoncesImmobilier.asp?rech_cod_pay="])')]
Using set() you can filter the list to unique values:
set([row.a.text for row in soup.select('tr.Tableau1:has(a[href*="AnnoncesImmobilier.asp?rech_cod_pay="])')])
Output
{'Hammam Lif', 'El Manar 2', 'El Menzah 8', 'Chotrana 1', 'Rades', 'Sousse Corniche', 'Cite De La Sant', 'Sousse', 'Bizerte', 'Ain Zaghouan', 'Hammamet', 'La Soukra', 'Riadh Landlous', 'El Menzah 5', 'Khezama Ouest', 'Montplaisir', 'Sousse Khezama', 'Hergla', 'El Ouerdia', 'Hammam Sousse', 'El Menzah 1', 'Cite Ennasr 2', 'Bab El Khadra'}
To extract more than just the href text you can do the following:
data = []
for row in soup.select('tr.Tableau1:has(a[href*="AnnoncesImmobilier.asp?rech_cod_pay="])'):
d = list(row.stripped_strings)
d.append(row.a['href'])
data.append(d)
pd.DataFrame(data)
Output
Région
Nature
Type
Texte annonce
Prix
Modifiée
Link
Sousse Corniche
Location
App. 3 pièc
Magnifique appartement s2 fac
1 000
08/02/2022
AnnoncesImmobilier.asp?rech_cod_pay=TN&rech_cod_vil=12114&rech_cod_loc=1211413
Riadh Landlous
Location
App. 4 pièc
S3 situé au 1ér étage à riadh
850
08/02/2022
AnnoncesImmobilier.asp?rech_cod_pay=TN&rech_cod_vil=10201&rech_cod_loc=1020135
Khezama Ouest
Vente
App. 4 pièc
Magnifique s3 khzema pré
250 000
08/02/2022
AnnoncesImmobilier.asp?rech_cod_pay=TN&rech_cod_vil=12112&rech_cod_loc=1211209
El Menzah 8
Location
App. 1 pièc
Studio meublé manzah 8 vv
600
08/02/2022
AnnoncesImmobilier.asp?rech_cod_pay=TN&rech_cod_vil=10201&rech_cod_loc=1020126
Hergla
Vente
App. 3 pièc
Appartement s 2 vue mer
300 000
08/02/2022
AnnoncesImmobilier.asp?rech_cod_pay=TN&rech_cod_vil=12105&rech_cod_loc=1210502
...

You do not need a regex here, you can use
titles = []
for ad in ads:
links = ad.find_all('a', href=lambda h: h and "AnnoncesImmobilier.asp?rech_cod_pay=" in h)
for link in links:
titles.append(link.get_text())
print(titles)
If you want to get a unique list of titles, use a set:
titles = set()
for ad in ads:
links = ad.find_all('a', href=lambda h: h and "AnnoncesImmobilier.asp?rech_cod_pay=" in h)
for link in links:
titles.add(link.get_text())
In both cases, href=lambda h: h and "AnnoncesImmobilier.asp?rech_cod_pay=" in h makes sure there is a href attribute and it contains a AnnoncesImmobilier.asp?rech_cod_pay= string.

Related

Find different elements with BeautifulSoup that have the same class

i am trying to scrape 2 elements from this url : https://www.welcometothejungle.com/fr/companies/dataiku/jobs/ai-solutions-manager-life-science_london_DATAI_a2jpa5o . CDI and LONDON, the issue here is that :
They have the same classe
They are in the same div
for London :
<li class="sc-1qc42fc-0 kExFnG"><span role="img" class="sc-1qc42fc-3 heity"><i name="location" class="sc-kmATbt bGKMNx"></i></span><span class="sc-1qc42fc-2 jmExaK">London</span></li>
for CDI :
<li class="sc-1qc42fc-0 kExFnG"><span role="img" class="sc-1qc42fc-3 heity"><i name="contract" class="sc-kmATbt jYkMSd"></i></span><span class="sc-1qc42fc-2 jmExaK"><span>CDI</span> </span></li>
I can see that both html codes have one thing that this different, the "i": one's name is location, the other contract, but i can't seem to find a way tu use this info in order to scrape the correct element
How can i manage to do a soup.find that will allow me to extract bot element "CDI and London" ?
From what I understand, this should work for you:
# Get all the children of the parent of the first li with that class
lis = list(soup.find_all('li', attrs={'class': 'sc-1qc42fc-0 kExFnG'})[0].parent.children)
fields = {}
for li in lis:
fields[li.find('i').get('name')] = li.text.strip()
print(fields)
Output:
{'contract': 'CDI', 'location': 'London'}

How to scrape content from a website with no class or id specified in attribute with BeautifulSoup4

I want to scrape separate content like- text in 'a' tag (ie. only the name- "42mm Architecture") and 'scope of services, types of built projects, Locations of Built Projects, Style of work, Website' as CSV file headers and its content for a whole webpage.
The elements have no Class or ID associated with it. So I am kind of stuck on how to extract those details properly, also there are those 'br' and 'b' tags in between.
There are multiple 'p' tags before and after the block of code provided. Here is the website.
<h2>
<a href="http://www.dezeen.com/tag/design-by-42mm-architecture" rel="noopener noreferrer" target="_blank">
42mm Architecture
</a>
|
<span style="color: #808080;">
Delhi | Top Architecture Firms/ Architects in India
</span>
</h2>
<!-- /wp:paragraph -->
<p>
<b>
Scope of services:
</b>
Architecture, Interiors, Urban Design.
<br/>
<b>
Types of Built Projects:
</b>
Residential, commercial, hospitality, offices, retail, healthcare, housing, Institutional
<br/>
<b>
Locations of Built Projects:
</b>
New Delhi and nearby states
<b>
<br/>
</b>
<b>
Style of work
</b>
<span style="font-weight: 400;">
: Contemporary
</span>
<br/>
<b>
Website
</b>
<span style="font-weight: 400;">
:
<a href="https://www.42mm.co.in/">
42mm.co.in
</a>
</span>
</p>
So how is it done using BeautifulSoup4?
This one was a bit of a time consuming one! The webpage is not complete and it has less tags and identifiers. To add more on that they haven't even spell checked the content Eg. One place has a heading Scope of Services and another place has Scope of services and there are many more like that! So what I have done is a crude extraction and I'm sure it would help you if you have the idea of paginating also.
import requests
from bs4 import BeautifulSoup
import csv
page = requests.get('https://www.re-thinkingthefuture.com/top-architects/top-architecture-firms-in-india-part-1/')
soup = BeautifulSoup(page.text, 'lxml')
# there are many h2 tags but we want the one without any class name
h2 = soup.find_all('h2', class_= '')
headers = []
contents = []
header_len = []
a_tags = []
for i in h2:
if i.find_next().name == 'a': # to make sure we do not grab the wrong tag
a_tags.append(i.find_next().text)
p = i.find_next_sibling()
contents.append(p.text)
h =[j.text for j in p.find_all('strong')] # some headings were bold in the website
headers.append(h)
header_len.append(len(h))
# since only some headings were in bold the max number of bold would give all headers
headers = headers[header_len.index(max(header_len))]
# removing the : from headings
headers = [i[:len(i)-1] for i in headers]
# inserted a new heading
headers.insert(0, 'Firm')
# n for traversing through headers list
# k for traversing through a_tags list
n =1
k =0
# this is the difficult part where the content will have all the details in one value including the heading like this
"""
Scope of services: Architecture, Interiors, Urban Design.Types of Built Projects: Residential, commercial, hospitality, offices, retail, healthcare, housing, InstitutionalLocations of Built Projects: New Delhi and nearby statesStyle of work: ContemporaryWebsite: 42mm.co.in
"""
# thus I am splitting it using the ':' and then splicing it from the start of the each heading
contents = [i.split(':') for i in contents]
for i in contents:
for j in i:
h = headers[n][:5]
if i.index(j) == 0:
i[i.index(j)] = a_tags[k]
n+=1
k+=1
elif h in j:
i[i.index(j)] = j[:j.index(h)]
j = j[:j.index(h)]
if n < len(headers)-1:
n+=1
n =1
# merging those extra values in the list if any
if len(i) == 7:
i[3] = i[3] + ' ' + i[4]
i.remove(i[4])
# writing into csv file
# if you don't want a line space between each row then add newline = '' argument in the open function below
with open('output.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow(headers)
writer.writerows(contents)
This was the output:
If you want to paginate then just add the page number to the end of the url and you'll be good!
page_num = 1
while page_num <13:
page = requests.get(f'https://www.re-thinkingthefuture.com/top-architects/top-architecture-firms-in-india-part-1/{page_num}/')
# paste the above code starting from soup = BeautifulSoup(page.text, 'lxml')
page_num +=1
Hope this helps, let me know if there's any error.
EDIT 1:
I forgot to say the most important part sorry, if there is a tag with no class name then you can still get the tag with what I have used in the code above
h2 = soup.find_all('h2', class_= '')
This just says that give me all the h2 tags which does not have a class name. This itself can sometimes be a unique identifier as we are using this no class value to identify it.
You can use this example as a basis how to scrape the informations from that page:
import requests
import pandas as pd
url = "https://www.gov.uk/government/publications/endorsing-bodies-start-up/start-up"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
parent = soup.select_one("div.govspeak")
mapping = {"sector": "sectors", "endorses businesses": "endorses businesses in"}
all_data = []
for h3 in parent.select("h3"):
name = h3.text
link = h3.a["href"] if h3.a else "-"
ul = h3.find_next("ul")
if ul and ul.find_previous("h3") == h3 and ul.parent == parent:
li = [
list(map(lambda x: mapping.get((i := x.strip()), i), v))
for li in ul.select("li")
if len(v := li.get_text(strip=True).split(":")) == 2
]
else:
li = []
all_data.append({"name": name, "link": link, **dict(li)})
df = pd.DataFrame(all_data)
print(df)
df.to_csv("data.csv", index=False)
Creates data.csv (screenshot from LibreOffice):

Webscraping with BeautifulSoup in Python tags

I am currently trying to scrape some information from the following link:
http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2001.nsf/ee3e4953228bd84705256dcd008385e7/4ec9c3be3fc593e2052571c40071de75?OpenDocument
I would like to scrape some of the information in the table using BeautifulSoup in Python. Ideally I would like to scrape the "Groupo Parliamentario," "Titulo," "Sumilla," and "Autores" from the table as separate items.
So far I've developed the following code using BeautifulSoup:
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2001.nsf/ee3e4953228bd84705256dcd008385e7/4ec9c3be3fc593e2052571c40071de75?OpenDocument'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
table = soup.find('table', {'bordercolor' : '#6583A0'})
contents = []
summary = []
authors = []
contents.append(table.findAll('font'))
authors.append(table.findAll('a'))
What I'm struggling with is that the code to scrape the authors only scrapes the first author in the list. Ideally I need to scrape all of the authors in the list. This seems odd to me because looking at the html code for the webpage, all authors in the list are indicated with '<a href = >' tags. I would think table.findAll('a')) would grab all of the authors in the list then.
Finally, I'm sort of just dumping the rest of the very messy html (title, summary, parliamentary group) all into one long string under contents. I'm not sure if I'm missing something, I'm sort of new to html and webscraping, but would there be a way to pull these items out and store them individually (ie: storing just the title in an object, just the summary in an object, etc). I'm having a tough time identifying unique tags to do this in the code for the web page. Or is this something I should just clean and parse after scraping?
to get the authors you can use:
soup.find('input', {'name': 'NomCongre'})['value']
output:
'Santa María Calderón Luis,Alva Castro Luis,Armas Vela Carlos,Cabanillas Bustamante Mercedes,Carrasco Távara José,De la Mata Fernández Judith,De La Puente Haya Elvira,Del Castillo Gálvez Jorge,Delgado Nuñez Del Arco José,Gasco Bravo Luis,Gonzales Posada Eyzaguirre Luis,León Flores Rosa Marina,Noriega Toledo Víctor,Pastor Valdivieso Aurelio,Peralta Cruz Jonhy,Zumaeta Flores César'
to scrape Grupo Parlamentario
table.find_all('td', {'width': 446})[1].text
output:
'Célula Parlamentaria Aprista'
to scrape Título:
table.find_all('td', {'width': 446})[2].text
output:
'IGV/SELECTIVO:D.L.821/LEY INTERPRETATIVA '
to scrape Sumilla:
table.find_all('td', {'width': 446})[3].text
output:
' Propone la aprobación de una Ley Interpretativa del Texto Original del Numeral 1 del Apéndice II del Decreto Legislativo N°821,modificatorio del Texto Vigente del Numeral 1 del Apéndice II del Texto Único Ordenado de la Ley del Impuesto General a las Ventas y Selectivo al Consumo,aprobado por Decreto Supremo N°054-99-EF. '

Python BeautifulSoup extracting titles according to id

This is a subquestion of this one: Python associate urls's ids and url's titles in lists
I have this HTML script:
<a href="http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html"
class="ss-titre">Monte le son</a>
<div class="rs-cell-details">
<a href="http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html"
class="ss-titre">"Rubin_Steiner"</a>
<a href="http://pluzz.francetv.fr/videos/fare_maohi_,102103928.html"
class="ss-titre">Fare maohi</a>
How can I do to have this result with BeautifulSoup:
list_titre = [['Monte le son', 'Rubin_Steiner'], ['Fare maohi']] #one sublist by id
I tried this:
f = urllib.urlopen(url)
page = f.read()
f.close()
soup = BeautifulSoup(page)
show=[]
list_titre=[]
list_url=[]
for link in soup.findAll('a'):
lien = link.get('href')
if lien == None:
lien = ""
if "http://pluzz.francetv.fr/videos/" in lien:
titre = (link.text.strip())
if "Voir cette vidéo" in titre:
titre = ""
if "Lire la vidéo" in titre:
titre = ""
list_titre.append(titre)
list_url.append(lien)
My result is:
list_titre = ['Monte le son', 'Rubin_Steiner', 'Fare maohi']
list_url = [http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html, http://pluzz.francetv.fr/videos/fare_maohi_,102103928.html]
But "titre" is not sorted by id.
Search for your links with a CSS selector to limit hits to just qualifying URLs.
Collect the links in a dictionary by URL; that way you can then process the information by sorting the dictionary keys:
from bs4 import BeautifulSoup
links = {}
soup = BeautifulSoup(page)
for link in soup.select('a[href^=http://pluzz.francetv.fr/videos/]'):
title = link.get_text().strip()
if title and title not in (u'Voir cette vidéo', u'Lire la vidéo'):
url = link['href']
links.setdefault(url, []).append(title)
The dict.setdefault() call sets an empty list for urls not yet encountered; this produces a dictionary with the URLs as keys, and the titles as a list of values per URL.
Demo:
>>> page = '''\
... <a href="http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html"
... class="ss-titre">Monte le son</a>
... <div class="rs-cell-details">
... <a href="http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html"
... class="ss-titre">"Rubin_Steiner"</a>
... <a href="http://pluzz.francetv.fr/videos/fare_maohi_,102103928.html"
... class="ss-titre">Fare maohi</a>
... '''
>>> links = {}
>>> soup = BeautifulSoup(page)
>>> for link in soup.select('a[href^=http://pluzz.francetv.fr/videos/]'):
... title = link.get_text().strip()
... if title and title not in (u'Voir cette vidéo', u'Lire la vidéo'):
... url = link['href']
... links.setdefault(url, []).append(title)
...
>>> from pprint import pprint
>>> pprint(links)
{'http://pluzz.francetv.fr/videos/ce_soir_ou_jamais_,101506826.html': [u'Ce soir (ou jamais !)',
u'"Qui est propri\xe9taire de quoi ? La propri\xe9t\xe9 mise \xe0 mal dans tous les domaines"'],
'http://pluzz.francetv.fr/videos/clip_locaux_,102890631.html': [u'Clips'],
'http://pluzz.francetv.fr/videos/fare_maohi_,102152859.html': [u'Fare maohi'],
'http://pluzz.francetv.fr/videos/fare_maohi_,102292937.html': [u'Fare maohi'],
'http://pluzz.francetv.fr/videos/fare_maohi_,102365651.html': [u'Fare maohi'],
'http://pluzz.francetv.fr/videos/inspecteur_barnaby_,101972045.html': [u'Inspecteur Barnaby',
u'"La musique en h\xe9ritage"'],
'http://pluzz.francetv.fr/videos/le_lab_o_saison3_,101215383.html': [u'Le Lab.\xd4',
u'"Episode 22"',
u'Saison 3'],
'http://pluzz.francetv.fr/videos/monsieur_madame_saison1_,101970319.html': [u'Les Monsieur Madame',
u'"Musique"',
u'Saison 1'],
'http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html': [u'Monte le son !',
u'"Rubin Steiner"'],
'http://pluzz.francetv.fr/videos/music_explorer_saison1_,101215382.html': [u'Music Explorer : les chasseurs de sons',
u'"Episode 3/6"',
u'Saison 1'],
'http://pluzz.francetv.fr/videos/retour_a_goree_,101641108.html': [u'Retour \xe0 Gor\xe9e'],
'http://pluzz.francetv.fr/videos/singe_mi_singe_moi_,101507102.html': [u'Singe mi singe moi',
u'"Le chat"'],
'http://pluzz.francetv.fr/videos/singe_mi_singe_moi_,101777072.html': [u'Singe mi singe moi',
u'"L\'autruche"'],
'http://pluzz.francetv.fr/videos/toute_nouvelle_tendance_,102472310.html': [u'T.N.T'],
'http://pluzz.francetv.fr/videos/toute_nouvelle_tendance_,102472336.html': [u'T.N.T'],
'http://pluzz.francetv.fr/videos/toute_nouvelle_tendance_,102721018.html': [u'T.N.T'],
'http://pluzz.francetv.fr/videos/toute_nouvelle_tendance_,103216774.html': [u'T.N.T.'],
'http://pluzz.francetv.fr/videos/toute_nouvelle_tendance_,103216788.html': [u'T.N.T'],
'http://pluzz.francetv.fr/videos/via_cultura_,101959892.html': [u'Via cultura',
u'"L\'Ochju, le Mauvais oeil"']}

Search inside results object - Python, BeatifulSoup

I'm trying to get some informations in a site, put it in a list and exporting this list to csv.
This is an part of the site, it repeats several times.
<img src="image.jpg" alt="Aclimação">
</a>
</div>
Clique na imagem para ampliar
</div>
<div class="colInfos">
<h4>Aclimação</h4>
<div class="addressInfo">
Rua Muniz de Souza, 1110<br>
Aclimação - São Paulo - SP<br>
01534-001<br>
<br>
(11) 3208-3418 / 2639-0173<br>
aclimacao.sp#escolas.com.br<br>
I want to get the image link, name (h4), address(inside addressInfo, each br should be an separated item in a list) and email of each school (a href mailto:) in this site and export to s csv file. This is how I'm trying. But there is a problem, because I don't know how to search inside the results object 'endereco' How can I do this?
This is my code:
import urllib2
from BeautifulSoup import BeautifulSoup
url = urllib2.urlopen("http://www.fisk.com.br/unidades?pais=1&uf=&rg=&cid=&ba=&un=")
soup = BeautifulSoup(url)
#nomes = soup.findAll('h4')
dados = []
i = 1
for endereco in enderecos:
text = ''.join(endereco.findAll(???)) **<- how an I search the br's inside this?**
dados[i] = text.encode('utf-8').strip()
i = i +
enderecos = soup.findAll('div', attrs={'class': 'colInfos'})
It really works fine. All you have to do is replace
dados = []
i = 1
for endereco in enderecos:
text = ''.join(endereco.findAll(text=True))
dados[i] = text.encode('utf-8').strip()
i = i +
enderecos = soup.findAll('div', attrs={'class': 'colInfos'})
with
dados = []
enderecos = soup.findAll('div', attrs={'class': 'colInfos'})
for endereco in enderecos:
text = ''.join(endereco.findAll(text=True))
dados.append(text.encode('utf-8').strip())
print dados

Categories