I'm trying to get some informations in a site, put it in a list and exporting this list to csv.
This is an part of the site, it repeats several times.
<img src="image.jpg" alt="Aclimação">
</a>
</div>
Clique na imagem para ampliar
</div>
<div class="colInfos">
<h4>Aclimação</h4>
<div class="addressInfo">
Rua Muniz de Souza, 1110<br>
Aclimação - São Paulo - SP<br>
01534-001<br>
<br>
(11) 3208-3418 / 2639-0173<br>
aclimacao.sp#escolas.com.br<br>
I want to get the image link, name (h4), address(inside addressInfo, each br should be an separated item in a list) and email of each school (a href mailto:) in this site and export to s csv file. This is how I'm trying. But there is a problem, because I don't know how to search inside the results object 'endereco' How can I do this?
This is my code:
import urllib2
from BeautifulSoup import BeautifulSoup
url = urllib2.urlopen("http://www.fisk.com.br/unidades?pais=1&uf=&rg=&cid=&ba=&un=")
soup = BeautifulSoup(url)
#nomes = soup.findAll('h4')
dados = []
i = 1
for endereco in enderecos:
text = ''.join(endereco.findAll(???)) **<- how an I search the br's inside this?**
dados[i] = text.encode('utf-8').strip()
i = i +
enderecos = soup.findAll('div', attrs={'class': 'colInfos'})
It really works fine. All you have to do is replace
dados = []
i = 1
for endereco in enderecos:
text = ''.join(endereco.findAll(text=True))
dados[i] = text.encode('utf-8').strip()
i = i +
enderecos = soup.findAll('div', attrs={'class': 'colInfos'})
with
dados = []
enderecos = soup.findAll('div', attrs={'class': 'colInfos'})
for endereco in enderecos:
text = ''.join(endereco.findAll(text=True))
dados.append(text.encode('utf-8').strip())
print dados
Related
I am trying to identify tags in an html document based on part of the attribute value.
I'm interested in any 'a' under 'tr' tag as long as it starts or has :
"AnnoncesImmobilier.asp?rech_cod_pay=" in the href attribute.
HTML source :
<tr bgcolor="#f1efe2" class="Tableau1" valign="middle">
<td bgcolor="#294a73" height="20"><img alt="" height="1" src="/images/space.gif" width="1"/></td>
<td> Berge Du Lac </td>
<td bgcolor="#294a73"><img alt="" height="1" src="/images/space.gif" width="1"/></td>
<td onmouseover="return escape('<b>Rubrique</b> : Offres<br/><b>Nature</b> : Terrain<br/><b>Type</b> : Terrain nu');" style="CURSOR:pointer;"> Terrain</td>
will give ad_title = "Berge Du Lac"
In the source HTML , each "tr" tag with class "Tableau1" contains an ad with different tr , a , tags for title, price, description etc...
Below is my code :
import re
from bs4 import BeautifulSoup
# The URL to get data from
URL = 'http://www.tunisie-annonce.com/AnnoncesImmobilier.asp'
data = requests.get(URL)
soup = BeautifulSoup(data.content, "html.parser")
# Variable to extract the ads
ads = soup.find_all("tr", {"class":"Tableau1"})
for ad in ads:
ad_title = ads.find(text=re.compile('AnnoncesImmobilier.asp?rech_cod_pay=')).parent.get_text())
print(title)
ad_title = ads.find(text=re.compile('AnnoncesImmobilier.asp?rech_cod_pay=')).parent.get_text()) is the last snippet that I tried to retrieve the text, but neither this or previous code worked for me.
How can i proceed ?
I'm interested in any 'a' under 'tr' tag as long as it starts or has :
"AnnoncesImmobilier.asp?rech_cod_pay=" in the href attribute.
You can make your selection more specific with css selectors:
soup.select('tr.Tableau1:has(a[href*="AnnoncesImmobilier.asp?rech_cod_pay="])')
To get a list of all the href texts just iterat the result set:
[row.a.text for row in soup.select('tr.Tableau1:has(a[href*="AnnoncesImmobilier.asp?rech_cod_pay="])')]
Using set() you can filter the list to unique values:
set([row.a.text for row in soup.select('tr.Tableau1:has(a[href*="AnnoncesImmobilier.asp?rech_cod_pay="])')])
Output
{'Hammam Lif', 'El Manar 2', 'El Menzah 8', 'Chotrana 1', 'Rades', 'Sousse Corniche', 'Cite De La Sant', 'Sousse', 'Bizerte', 'Ain Zaghouan', 'Hammamet', 'La Soukra', 'Riadh Landlous', 'El Menzah 5', 'Khezama Ouest', 'Montplaisir', 'Sousse Khezama', 'Hergla', 'El Ouerdia', 'Hammam Sousse', 'El Menzah 1', 'Cite Ennasr 2', 'Bab El Khadra'}
To extract more than just the href text you can do the following:
data = []
for row in soup.select('tr.Tableau1:has(a[href*="AnnoncesImmobilier.asp?rech_cod_pay="])'):
d = list(row.stripped_strings)
d.append(row.a['href'])
data.append(d)
pd.DataFrame(data)
Output
Région
Nature
Type
Texte annonce
Prix
Modifiée
Link
Sousse Corniche
Location
App. 3 pièc
Magnifique appartement s2 fac
1 000
08/02/2022
AnnoncesImmobilier.asp?rech_cod_pay=TN&rech_cod_vil=12114&rech_cod_loc=1211413
Riadh Landlous
Location
App. 4 pièc
S3 situé au 1ér étage à riadh
850
08/02/2022
AnnoncesImmobilier.asp?rech_cod_pay=TN&rech_cod_vil=10201&rech_cod_loc=1020135
Khezama Ouest
Vente
App. 4 pièc
Magnifique s3 khzema pré
250 000
08/02/2022
AnnoncesImmobilier.asp?rech_cod_pay=TN&rech_cod_vil=12112&rech_cod_loc=1211209
El Menzah 8
Location
App. 1 pièc
Studio meublé manzah 8 vv
600
08/02/2022
AnnoncesImmobilier.asp?rech_cod_pay=TN&rech_cod_vil=10201&rech_cod_loc=1020126
Hergla
Vente
App. 3 pièc
Appartement s 2 vue mer
300 000
08/02/2022
AnnoncesImmobilier.asp?rech_cod_pay=TN&rech_cod_vil=12105&rech_cod_loc=1210502
...
You do not need a regex here, you can use
titles = []
for ad in ads:
links = ad.find_all('a', href=lambda h: h and "AnnoncesImmobilier.asp?rech_cod_pay=" in h)
for link in links:
titles.append(link.get_text())
print(titles)
If you want to get a unique list of titles, use a set:
titles = set()
for ad in ads:
links = ad.find_all('a', href=lambda h: h and "AnnoncesImmobilier.asp?rech_cod_pay=" in h)
for link in links:
titles.add(link.get_text())
In both cases, href=lambda h: h and "AnnoncesImmobilier.asp?rech_cod_pay=" in h makes sure there is a href attribute and it contains a AnnoncesImmobilier.asp?rech_cod_pay= string.
I want to scrape separate content like- text in 'a' tag (ie. only the name- "42mm Architecture") and 'scope of services, types of built projects, Locations of Built Projects, Style of work, Website' as CSV file headers and its content for a whole webpage.
The elements have no Class or ID associated with it. So I am kind of stuck on how to extract those details properly, also there are those 'br' and 'b' tags in between.
There are multiple 'p' tags before and after the block of code provided. Here is the website.
<h2>
<a href="http://www.dezeen.com/tag/design-by-42mm-architecture" rel="noopener noreferrer" target="_blank">
42mm Architecture
</a>
|
<span style="color: #808080;">
Delhi | Top Architecture Firms/ Architects in India
</span>
</h2>
<!-- /wp:paragraph -->
<p>
<b>
Scope of services:
</b>
Architecture, Interiors, Urban Design.
<br/>
<b>
Types of Built Projects:
</b>
Residential, commercial, hospitality, offices, retail, healthcare, housing, Institutional
<br/>
<b>
Locations of Built Projects:
</b>
New Delhi and nearby states
<b>
<br/>
</b>
<b>
Style of work
</b>
<span style="font-weight: 400;">
: Contemporary
</span>
<br/>
<b>
Website
</b>
<span style="font-weight: 400;">
:
<a href="https://www.42mm.co.in/">
42mm.co.in
</a>
</span>
</p>
So how is it done using BeautifulSoup4?
This one was a bit of a time consuming one! The webpage is not complete and it has less tags and identifiers. To add more on that they haven't even spell checked the content Eg. One place has a heading Scope of Services and another place has Scope of services and there are many more like that! So what I have done is a crude extraction and I'm sure it would help you if you have the idea of paginating also.
import requests
from bs4 import BeautifulSoup
import csv
page = requests.get('https://www.re-thinkingthefuture.com/top-architects/top-architecture-firms-in-india-part-1/')
soup = BeautifulSoup(page.text, 'lxml')
# there are many h2 tags but we want the one without any class name
h2 = soup.find_all('h2', class_= '')
headers = []
contents = []
header_len = []
a_tags = []
for i in h2:
if i.find_next().name == 'a': # to make sure we do not grab the wrong tag
a_tags.append(i.find_next().text)
p = i.find_next_sibling()
contents.append(p.text)
h =[j.text for j in p.find_all('strong')] # some headings were bold in the website
headers.append(h)
header_len.append(len(h))
# since only some headings were in bold the max number of bold would give all headers
headers = headers[header_len.index(max(header_len))]
# removing the : from headings
headers = [i[:len(i)-1] for i in headers]
# inserted a new heading
headers.insert(0, 'Firm')
# n for traversing through headers list
# k for traversing through a_tags list
n =1
k =0
# this is the difficult part where the content will have all the details in one value including the heading like this
"""
Scope of services: Architecture, Interiors, Urban Design.Types of Built Projects: Residential, commercial, hospitality, offices, retail, healthcare, housing, InstitutionalLocations of Built Projects: New Delhi and nearby statesStyle of work: ContemporaryWebsite: 42mm.co.in
"""
# thus I am splitting it using the ':' and then splicing it from the start of the each heading
contents = [i.split(':') for i in contents]
for i in contents:
for j in i:
h = headers[n][:5]
if i.index(j) == 0:
i[i.index(j)] = a_tags[k]
n+=1
k+=1
elif h in j:
i[i.index(j)] = j[:j.index(h)]
j = j[:j.index(h)]
if n < len(headers)-1:
n+=1
n =1
# merging those extra values in the list if any
if len(i) == 7:
i[3] = i[3] + ' ' + i[4]
i.remove(i[4])
# writing into csv file
# if you don't want a line space between each row then add newline = '' argument in the open function below
with open('output.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow(headers)
writer.writerows(contents)
This was the output:
If you want to paginate then just add the page number to the end of the url and you'll be good!
page_num = 1
while page_num <13:
page = requests.get(f'https://www.re-thinkingthefuture.com/top-architects/top-architecture-firms-in-india-part-1/{page_num}/')
# paste the above code starting from soup = BeautifulSoup(page.text, 'lxml')
page_num +=1
Hope this helps, let me know if there's any error.
EDIT 1:
I forgot to say the most important part sorry, if there is a tag with no class name then you can still get the tag with what I have used in the code above
h2 = soup.find_all('h2', class_= '')
This just says that give me all the h2 tags which does not have a class name. This itself can sometimes be a unique identifier as we are using this no class value to identify it.
You can use this example as a basis how to scrape the informations from that page:
import requests
import pandas as pd
url = "https://www.gov.uk/government/publications/endorsing-bodies-start-up/start-up"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
parent = soup.select_one("div.govspeak")
mapping = {"sector": "sectors", "endorses businesses": "endorses businesses in"}
all_data = []
for h3 in parent.select("h3"):
name = h3.text
link = h3.a["href"] if h3.a else "-"
ul = h3.find_next("ul")
if ul and ul.find_previous("h3") == h3 and ul.parent == parent:
li = [
list(map(lambda x: mapping.get((i := x.strip()), i), v))
for li in ul.select("li")
if len(v := li.get_text(strip=True).split(":")) == 2
]
else:
li = []
all_data.append({"name": name, "link": link, **dict(li)})
df = pd.DataFrame(all_data)
print(df)
df.to_csv("data.csv", index=False)
Creates data.csv (screenshot from LibreOffice):
I´ve the following HTML code to webscrape:
<ul class="item-features">
<li>
<strong>Graphic Type:</strong> Dedicated Card
</li>
<li>
<strong>Resolution:</strong> 3840 x 2160
</li>
<li>
<strong>Weight:</strong> 4.40 lbs.
</li>
<li>
<strong>Color:</strong> Black
</li>
</ul>
I would like to print in a .csv file all single tags inside the : Graphic Type, Resolution, Weight, etc. in different columns in a .csv file.
I´ve tried the following in Python:
import bs4
from urllib.request import urlopen as req
from bs4 import BeautifulSoup as soup
url ='https://www.newegg.com/Laptops-Notebooks/SubCategory/ID-32?Tid=6740'
Client = req(url)
pagina = Client.read()
Client.close()
pagina_soup=soup(pagina,"html.parser")
productes = pagina_soup.findAll("div",{"class":"item-container})
producte = productes [0]
features = producte.findAll("ul",{"class":"item-features"})
features[0].text
And it displays all the features but just in one single column of the .csv.
'\nGraphic Type: Dedicated CardResolution: 3840 x 2160Weight: 4.40 lbs.Color: Black\nModel #: AERO 15 OLED SA-7US5020SH\nItem #: N82E16834233268\nReturn Policy: Standard Return Policy\n'
I don´t now how to export them one by one. Please, see my whole pyhton code:
import bs4
from urllib.request import urlopen as req
from bs4 import BeautifulSoup as soup
#Link de la pàgina on farem webscraping
url = 'https://www.newegg.com/Laptops-Notebooks/SubCategory/ID-32?Tid=6740'
#Obrim una connexió amb la pàgina web
Client = req(url)
#Offloads the content of the page into a variable
pagina = Client.read()
#Closes the client
Client.close()
#html parser
pagina_soup=soup(pagina,"html.parser")
#grabs each product
productes = pagina_soup.findAll("div",{"class":"item-container"})
#Obrim un axiu .csv
filename = "ordinadors.csv"
f=open(filename,"w")
#Capçaleres del meu arxiu .csv
headers = "Marca; Producte; PreuActual; PreuAnterior; Rebaixa; CostEnvio
\n"
#Escrivim la capçalera
f.write(headers)
#Fem un loop sobre tots els productes
for producte in productes:
#Agafem la marca del producte
marca_productes = producte.findAll("div",{"class":"item-info"})
marca = marca_productes[0].div.a.img["title"]
#Agafem el nom del producte
name = producte.a.img["title"]
#Preu Actual
actual_productes = producte.findAll("li",{"class":"price-current"})
preuActual = actual_productes[0].strong.text
#Preu anterior
try:
preuAbans = producte.find("li", class_="price-
was").next_element.strip()
except:
print("Not found")
#Agafem els costes de envio
costos_productes = producte.findAll("li",{"class":"price-ship"})
#Com que es tracta d'un vector, agafem el primer element i el netegem.
costos = costos_productes[0].text.strip()
#Writing the file
f.write(marca + ";" + name.replace(","," ") + ";" + preuActual + ";"
+ preuAbans + ";" + costos + "\n")
f.close()
keys = [x.find().text for x in pagina_soup.find_all('li')]
values = [x.find('strong').next_sibling.strip() for x in pagina_soup.find_all('li')]
print(keys)
print(values)
out:
Out[6]: ['Graphic Type:', 'Resolution:', 'Weight:', 'Color:']
Out[7]: ['Dedicated Card', '3840 x 2160', '4.40 lbs.', 'Black']
import requests
import string
from bs4 import BeautifulSoup, Tag
[...]
def disease_spider(maxpages):
i = 0
while i <= maxpages:
url = 'http://www.cdc.gov/DiseasesConditions/az/'+ alpha[i]+'.html'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for l in soup.findAll('a', {'class':'noLinking'}):
x =l.find("em")
if x is not None:
return x.em.replaceWith(Tag('a'))
i += 1
Some of the text from the website uses tags instead of tags and I wanted to replace them with tags.
Using this code I get this error:
AttributeError: 'NoneType' object has no attribute 'replaceWith'
From what I understand, you want to replace em with it's text.
In other words, the a element containing:
<a class="noLinking" href="http://www.cdc.gov/hi-disease/index.html">
including Hib Infection (<em>Haemophilus influenzae</em> Infection)
</a>
should be replaced with:
<a class="noLinking" href="http://www.cdc.gov/hi-disease/index.html">
including Hib Infection (Haemophilus influenzae Infection)
</a>
In this case, I would locate all em tags directly under the a tags and, for each em tag found, replace it with it's text using replace_with():
for em in soup.select('a.noLinking > em'):
em.replace_with(em.text)
As a side note, the replacement might not be necessary, because the .text of the a tag would give you the full text of the node including it's children:
In [1]: from bs4 import BeautifulSoup
In [2]: data = """
...: <a class="noLinking" href="http://www.cdc.gov/hi-disease/index.html">
...: including Hib Infection (<em>Haemophilus influenzae</em> Infection)
...: </a>
...: """
In [3]: soup = BeautifulSoup(data)
In [4]: print soup.a.text
including Hib Infection (Haemophilus influenzae Infection)
I am using the code at the far bottom to get weblink, and the Masjid name. however I would like to also get denomination and street address. please help I am stuck.
Currently I am getting the following
Weblink:
<div class="subtitleLink"><a href="http://www.salatomatic.com/d/Tempe+5313+Masjid-Al-Hijrah">
and Masjid name
<b>Masjid Al-Hijrah</b>
But would like to get the below;
Denomination
<b>Denomination:</b> Sunni (Traditional)
and street address
<br>45 Station Street (Sydney)
The below code scrapes the following
<td width=25><img src='http://www.halalfire.com/images/en/photo_small.jpg' alt='Masjid Al-Hijrah' title='Masjid Al-Hijrah' border=0 width=48 height=36></a></td><td width=10><img src="http://www.salatomatic.com/images/spacer.gif" width=10 border=0></td><td nowrap><div class="subtitleLink"><b>Masjid Al-Hijrah</b> </div><div class="tinyLink"><b>Denomination:</b> Sunni (Traditional)<br>45 Station Street (Sydney) </div></td><td align=right valign=center><div class="tinyLink"></div></td>
CODE:
from bs4 import BeautifulSoup
import urllib2
url1 = "http://www.salatomatic.com/c/Sydney+168"
content1 = urllib2.urlopen(url1).read()
soup = BeautifulSoup(content1)
results = soup.findAll("div", {"class" : "subtitleLink"})
for result in results :
br = result.find('b')
a = result.find('a')
currenturl = a.get('href')
if not currenturl.startswith("http"):
currenturl = "http://www.salatomatic.com" + currenturl
print currenturl
elif currenturl.startswith("http"):
print a.get('href')
pos = br.get_text()
print pos
You can check next <div> element with a class attribute with value tinyLink and that contains either a <b> and a <br> tags and extract their strings:
...
print pos
div = result.find_next_sibling('div', attrs={"class": "tinyLink"})
if div and div.b and div.br:
print(div.b.next_sibling.string)
print(div.br.next_sibling.string)