scraped data using BeautifulSoup does not match source code

scraped data using BeautifulSoup does not match source code - python

I'm new to webscraping. I have seen a few tutorials on how to scrape websites using beautifulsoup.
As an exercise I would like to extract data from a real estate website.
The specific page I want to scrape is this one: https://www.immoweb.be/fr/recherche/maison-et-appartement/a-vendre?countries=BE&page=1
My goal is to extract a list of all the links to each real estate sale.
Afterwards, I want to loop through that list of links to extract all the data for each sale (price, location, nb bedrooms etc.)
The first issue I'm encountering is that the data scraped using the classic beautifulsoup code did not match the source code of the webpage.
This is my code:
URL = "https://www.immoweb.be/fr/recherche/maison-et-appartement/a-vendre?countries=BE&page=1"
page = requests.get(URL)
html = page.content
soup = BeautifulSoup(html, 'html.parser')
print(soup)
Hence, when looking for the links of each real estate sale which is located under
soup.find_all("a", class_="card__title-link")
It outputs an empty list. Indeed these tags were actually not properly extracted from my code above.
Why is that? What should I do to ensure that the extracted html correctly corresponds to what is visible in the source code of the website?
Thank you :-)

The data you see is embedded within the page in Json format. You can use this example how to load it:
import json
import requests
from bs4 import BeautifulSoup
url = "https://www.immoweb.be/fr/recherche/maison-et-appartement/a-vendre?countries=BE&page=1"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = json.loads(soup.find("iw-search")[":results"])
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
# print some data:
for ad in data:
print(
"{:<63} {:<8} {}".format(
ad["property"]["title"],
ad["transaction"]["sale"]["price"] or "-",
"https://www.immoweb.be/fr/annonce/{}".format(ad["id"]),
)
)
Prints:
Triplex appartement met 3 slaapkamers en garage. 239000 https://www.immoweb.be/fr/annonce/9309298
Appartement 285000 https://www.immoweb.be/fr/annonce/9309895
Heel ruime, moderne, lichtrijke Duplex te koop, bij centrum 269000 https://www.immoweb.be/fr/annonce/9303797
À VENDRE PAR LANDBERGH : appartement de deux chambres à Gand 359000 https://www.immoweb.be/fr/annonce/9310300
Prachtige nieuwbouw appartementen - https://www.immoweb.be/fr/annonce/9309278
Prachtige nieuwbouw appartementen - https://www.immoweb.be/fr/annonce/9309251
Prachtige nieuwbouw appartementen - https://www.immoweb.be/fr/annonce/9309264
Appartement intéressant avec agréable vue panoramique verdoy 219000 https://www.immoweb.be/fr/annonce/9309366
Projet Utopia by Godin - https://www.immoweb.be/fr/annonce/9309458
Appartement 2-ch avec vue unique! 270000 https://www.immoweb.be/fr/annonce/9309183
Residentieel wonen in Hélécine, dichtbij de natuur en de sne - https://www.immoweb.be/fr/annonce/9309241
Appartement 375000 https://www.immoweb.be/fr/annonce/9309187
DUPLEX LUMIEUX ET SPACIEUX 380000 https://www.immoweb.be/fr/annonce/9298271
SINT-PIETERS-LEEUW / Magnifique maison de ±130m² avec jardin 430000 https://www.immoweb.be/fr/annonce/9310259
PARC PARMENTIER // APP MODERNE 3CH 490000 https://www.immoweb.be/fr/annonce/9262193
BOIS DE LA CAMBRE – AV DE FRE – CLINIQUES DE L’EUROPE 575000 https://www.immoweb.be/fr/annonce/9309664
Entre Stockel et le Stade Fallon 675000 https://www.immoweb.be/fr/annonce/9310094
Maisons neuves dans un cadre verdoyant - https://www.immoweb.be/fr/annonce/6792221
Nieuwbouwproject Dockside Gardens - Gent - https://www.immoweb.be/fr/annonce/9008956
Appartement 139000 https://www.immoweb.be/fr/annonce/9187904
A VENDRE CHEZ LANDBERGH: appartements à Merelbeke Flora - https://www.immoweb.be/fr/annonce/9306877
Très beau studio avec une belle vue sur la plage et la mer! 319000 https://www.immoweb.be/fr/annonce/9306787
BEL APPARTEMENT LUMINEUX DIAMANT / PLASKY 320000 https://www.immoweb.be/fr/annonce/9264748
Un projet d'appartements neufs à proximité de Woluwé-St-Lamb - https://www.immoweb.be/fr/annonce/9308037
PLACE JOURDAN - 2 CHAMBRES 345000 https://www.immoweb.be/fr/annonce/9306953
Magnifiek appartement in de Brugse Rand - Assebroek 399000 https://www.immoweb.be/fr/annonce/9306613
Bien d'exception 415000 https://www.immoweb.be/fr/annonce/9308022
Appartement 435000 https://www.immoweb.be/fr/annonce/9307802
Magnifique maison 5CH - 3SDB - bureau - dressing - garage 465000 https://www.immoweb.be/fr/annonce/9307178
Magnifique maison 5CH - 3SDB - bureau - dressing - garage 465000 https://www.immoweb.be/fr/annonce/9307177
EDIT: Added URL column.

Related

How could I extract the content of an article from a html webpage using python Beautiful Soup?

I am a novice wiht python and I am trying to do webscraping as exercise. I would like to scrape the content, and the title of each article inside a web page .
I have a problem with my code because I do not think is very efficient and I would like to optimize it.
the page I am trying to scrape is https://www.ansa.it/sito/notizie/politica/politica.shtml
this is what I have done so far:
#libraries
from bs4 import BeautifulSoup
from bs4 import BeautifulSoup as soup
import requests
import pandas as pd
import urllib.request,sys,time
import csv
from csv import writer
import time
from datetime import datetime
r= requests.get('https://www.ansa.it/sito/notizie/politica/politica.shtml')
b= soup(r.content, 'lxml')
title=[]
links=[]
content=[]
for c in b.findAll('h3',{'class':'news-title'}):
title.append(c.text.strip())
for c in b.findAll("h3", {"class": "news-title"}):
links.append(c.a["href"])
for link in links:
page=requests.get('https://www.ansa.it'+link)
bsjop=soup(page.content)
for n in bsjop.findAll('div',{'itemprop': 'articleBody'}):
content.append(n.text.strip())
The problem is that my output is made of multiple links, multiple titles and multiple contents that do not match each other (like one article has title and a content that has nothing to do with it)
If you know ways that I can improve my code it would be nice
thanks

To get all articles titles, urls, texts into a Pandas DataFrame you can use next example (I used tqdm module to get nice progress bar):
import requests
from tqdm import tqdm
from bs4 import BeautifulSoup
url = "https://www.ansa.it/sito/notizie/politica/politica.shtml"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
all_data = []
for title in tqdm(soup.select("h3.news-title")):
t = title.get_text(strip=True)
u = title.a["href"]
s = BeautifulSoup(
requests.get("https://www.ansa.it" + u).content, "html.parser"
)
text = s.select_one('[itemprop="articleBody"]')
text = text.get_text(strip=True, separator="\n") if text else ""
all_data.append([t, u, text])
df = pd.DataFrame(all_data, columns=["Title", "URL", "Text"])
df.to_csv("data.csv", index=False)
Creates data.csv (screenshot from LibreOffice):

import requests
from bs4 import BeautifulSoup
import pandas as pd
news_list = []
r = requests.get('https://www.ansa.it/sito/notizie/politica/politica.shtml')
soup = BeautifulSoup(r.text, 'html.parser')
articles = soup.select('article.news')
for art in articles:
try:
title = art.select_one('h3').text.strip()
if 'javascript:void(0);' in art.select('a')[0].get('href'):
url = 'https://www.ansa.it' + art.select('a')[1].get('href')
else:
url = 'https://www.ansa.it' + art.select('a')[0].get('href')
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
content = soup.select_one('div.news-txt').text.strip()
print(f'retrieving {url}')
news_list.append((title, content, url))
except Exception as e:
print(art.text.strip(), e)
df = pd.DataFrame(news_list, columns = ['Title', 'Content', 'Url'])
print(df)
This will return some errors for some links on page, which you will have to investigate and debug (and ask for help if you need it - it's an important part of learning process), and a dataframe with the list successfully retrieved, which looks like this:
Title Content Url
0 Letta: 'Ora difficile ricomporre con M5s'. Mel... Partiti scossi dallo scioglimento anticipato d... https://www.ansa.it/sito/notizie/politica/2022...
1 L'emozione di Draghi: 'Ancheil cuore dei banch... "Certe volte anche il cuore dei banchieri cent... https://www.ansa.it/sito/notizie/politica/2022...
2 La giornata di Draghi in foto https://www.ansa.it/sito/photogallery/primopia...
3 Il timing del voto, liste entro un mese. A Fer... Le liste dei candidati entro un mese a partire... https://www.ansa.it/sito/notizie/politica/2022...
4 Si lavora sulla concorrenza, ipotesi stralcio ... Il DDL Concorrenza andrà in Aula alla Camera l... https://www.ansa.it/sito/notizie/economia/2022...
5 Le cifre del governo Draghi: 55 voti fiducia e... Una media di 7,4 leggi approvate ogni mese su ... https://www.ansa.it/sito/notizie/politica/2022...
6 I 522 giorni del governo Draghi LE FOTO L'arrivo di SuperMario, gli incontri, le allea... https://www.ansa.it/sito/photogallery/primopia...
7 Presidi, disappunto per le urne in autunno, ci... "C'è disappunto non preoccupazione per le urne... https://www.ansa.it/sito/notizie/politica/2022...
8 Ucraina: Di Maio,sostegno ricerca mercati alte... (ANSA) - ROMA, 22 LUG - Lo scoppio del conflit... https://www.ansa.it/sito/photogallery/primopia...
9 Il giorno più lungo: dal Senato fiducia a Drag... Passa la fiducia al premier Draghi in Senato, ... https://www.ansa.it/sito/notizie/politica/2022...
10 Oltre mille sindaci a sostegno di Draghi Nei giorni che attendono il mercoledì che deci... https://www.ansa.it/sito/notizie/politica/2022...
11 Mattarella scioglie le Camere, si vota il 25 s... E' stata una scelta "inevitabile", il voto del... https://www.ansa.it/sito/notizie/politica/2022...
12 Camere sciolte ma il vitalizio è salvo Nonostante lo scioglimento delle Camere antici... https://www.ansa.it/sito/notizie/politica/2022...
13 Ultimatum di Conte, risposte o fuori. Di Maio:... Senza "risposte chiare" il Movimento 5 Stelle ... https://www.ansa.it/sito/notizie/politica/2022...
14 Di Maio, Conte sta compiendo una vendetta poli... Se le cose restano come sono oggi "Mario Dragh... https://www.ansa.it/sito/notizie/politica/2022...
15 Governo: mercoledì la fiducia fiducia prima al... Le comunicazioni del presidente del Consiglio ... https://www.ansa.it/sito/notizie/politica/2022...
16 Il giorno più lungo: dal Senato fiducia a Drag... Passa la fiducia al premier Draghi in Senato, ... https://www.ansa.it/sito/notizie/politica/2022...
17 Mattarella scioglie le Camere, si vota il 25 s... E' stata una scelta "inevitabile", il voto del... https://www.ansa.it/sito/notizie/politica/2022...
18 Il discorso di Draghi al Senato 'Partiti, pron... "Siamo qui perché lo hanno chiesto gli italian... https://www.ansa.it/sito/notizie/politica/2022...
19 Governo: mercoledì la fiducia fiducia prima al... Le comunicazioni del presidente del Consiglio ... https://www.ansa.it/sito/notizie/politica/2022...
20 Draghi al Senato per una fiducia al buio. Prem... Draghi al bivio tra governo e crisi. Alle 9.30... https://www.ansa.it/sito/notizie/politica/2022...
21 Ultimatum di Conte, risposte o fuori. Di Maio:... Senza "risposte chiare" il Movimento 5 Stelle ... https://www.ansa.it/sito/notizie/politica/2022...
22 Camere sciolte ma il vitalizio è salvo Nonostante lo scioglimento delle Camere antici... https://www.ansa.it/sito/notizie/politica/2022...

You don't need two different loops if you are referring to the same element. Try the below code to save the title and links.
for c in b.findAll('h3',{'class':'news-title'}):
title.append(c.text.strip())
links.append(c.a["href"])
By combining you will be sure that the title and link are scraped from the same element.

BS4 : Iterate through page return same result in Python

Why this code return same film title (title from first page)?
url_base = "https://www.senscritique.com/liste/Le_meilleur_du_meilleur_des_meilleurs/772407#page-"
for page in range(1, 3): #nb_pages+1):
url_n = url_base + str(page)
print(url_n)
html_n = urllib2.urlopen(url_n).read().decode('utf-8')
soup_n = BeautifulSoup(html_n, 'html.parser')
for film in soup_n.find_all('li', attrs={"class": u"elli-item"}):
print(film.find('a', attrs={"class": u"elco-anchor"}).text)

The page is loading the titles from different url via Ajax:
import requests
from bs4 import BeautifulSoup
# https://www.senscritique.com/liste/Le_meilleur_du_meilleur_des_meilleurs/772407#page-
url_base = 'https://www.senscritique.com/sc2/liste/772407/page-{}.ajax'
for page in range(1, 3): #nb_pages+1):
url_n = url_base.format(page)
soup_n = BeautifulSoup(requests.get(url_n).content, 'html.parser')
for film in soup_n.find_all('li', attrs={"class": u"elli-item"}):
print(film.find('a', attrs={"class": u"elco-anchor"}).text)
Prints:
Old Boy
Lucy
Le Loup de Wall Street
Tomboy
Dersou Ouzala
Les Lumières de la ville
12 hommes en colère
Gravity
Hunger Games : La Révolte, partie 1
Le Parrain
La Nuit du chasseur
La Belle et la Bête
The Big Lebowski
Interstellar
Le Ruban blanc
Vive la France
La vie est belle
Le Hobbit : La Bataille des cinq armées
Pulp Fiction
Melancholia
Her
The Grand Budapest Hotel
The Tree of Life
Le Prestige
Conan le Barbare
Ninja Turtles
Jurassic Park III
Rebelle
Mud, sur les rives du Mississippi
Détour mortel
Only God Forgives
A Serious Man
Bienvenue à Gattaca
Colombiana
Rome, ville ouverte
Man of Steel
Black Book
La Rafle
Aliens : Le Retour
Les Petits Mouchoirs
Mysterious Skin
Rashômon
Lolita
Le Mystère de la matière noire
Godzilla
9 mois ferme
Pour une poignée de dollars
Les Enfants du paradis
Drive
Fight Club
Evil Dead
Le Labyrinthe
Sous les jupes des filles
Le Seigneur des Anneaux : La Communauté de l'anneau
La Chasse
Le Locataire
Gone Girl
La Planète des singes : L'Affrontement
L'Homme sans âge
Cinquante nuances de Grey

Change it to this. The problem part is urllib2 in python3.
The urllib2 module has been split across several modules in Python 3
named urllib.request and urllib.error. The 2to3 tool will
automatically adapt imports when converting your sources to Python 3.
from bs4 import BeautifulSoup
from urllib.request import urlopen
url_base = "https://www.senscritique.com/liste/Le_meilleur_du_meilleur_des_meilleurs/772407#page-"
for page in range(1, 3): #nb_pages+1):
url_n = url_base + str(page)
print(url_n)
html_n = urlopen(url_n).read().decode('utf-8')
soup_n = BeautifulSoup(html_n, 'html.parser')
for film in soup_n.find_all('li', attrs={"class": u"elli-item"}):
print(film.find('a', attrs={"class": u"elco-anchor"}).text)
Output
Old Boy
Lucy
Le Loup de Wall Street
Tomboy
Dersou Ouzala
Les Lumières de la ville
12 hommes en colère
Gravity
Hunger Games : La Révolte, partie 1
Le Parrain
La Nuit du chasseur
La Belle et la Bête
The Big Lebowski
Interstellar
Le Ruban blanc
Vive la France
La vie est belle
Le Hobbit : La Bataille des cinq armées
Pulp Fiction
Melancholia
Her
The Grand Budapest Hotel
The Tree of Life
Le Prestige
Conan le Barbare
Ninja Turtles
Jurassic Park III
Rebelle
Mud, sur les rives du Mississippi
Détour mortel
https://www.senscritique.com/liste/Le_meilleur_du_meilleur_des_meilleurs/772407#page-2
Old Boy
Lucy
Le Loup de Wall Street
Tomboy
Dersou Ouzala
Les Lumières de la ville
12 hommes en colère
Gravity
Hunger Games : La Révolte, partie 1
Le Parrain
La Nuit du chasseur
La Belle et la Bête
The Big Lebowski
Interstellar
Le Ruban blanc
Vive la France
La vie est belle
Le Hobbit : La Bataille des cinq armées
Pulp Fiction
Melancholia
Her
The Grand Budapest Hotel
The Tree of Life
Le Prestige
Conan le Barbare
Ninja Turtles
Jurassic Park III
Rebelle
Mud, sur les rives du Mississippi
Détour mortel

Can't get text without tag using Selenium Python

first of all, I'll show the code that I'm having problem to in order to better explain myself.
<div class="archivos"> ... </div>
<br>
<br>
<br>
<br>
THIS IS THE TEXT THAT I WANT TO CHECK
<div class="archivos"> ... </div>
...
I'm using Selenium in Python.
So, this is a piece of the html that I'm working with. My objective is, inside the div with "class=archivos", there's a link that i want to click, but for that, I need to first analyze the text that's over it to know if I want to click or not the link.
The problem is that there's no tag on the text, and I can't seem to find a way to copy it so I can search it for the information I want. The text changes every time so I need to locate the possible texts previous to every "class=archivos".
So far I've tried a lot of ways to find it using XPath mainly, trying to get to the previous element of the div. I haven't come with anything that works yet, as I'm not very experienced with Selenium and XPaths.
I've found this https://chercher.tech/python/relative-xpath-selenium-python,which helped me try some XPaths, and several responses here on SO but to no avail.
I've read somewhere that I can use Javascript code from Python using Selenium to get it, but I don't know Javascript and don't know how to do it. Maybe somebody understands what I'm talking about.
This is the webpage if it helps: http://www.boa.aragon.es/cgi-bin/EBOA/BRSCGI?CMD=VERLST&DOCS=1-200&BASE=BOLE&SEC=FIRMA&SEPARADOR=&PUBL=20200901
Thanks in advance for the help, and I'll provide any further information if it's needed.

Here is example how to extract the previous text with BeautifulSoup. I loaded the page with requests module, but you can feed the HTML source to BeautifulSoup from selenium:
import requests
from bs4 import BeautifulSoup
url = 'http://www.boa.aragon.es/cgi-bin/EBOA/BRSCGI?CMD=VERLST&DOCS=1-200&BASE=BOLE&SEC=FIRMA&SEPARADOR=&PUBL=20200901'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for t in soup.select('.archivos'):
previous_text = t.find_previous(text=True).strip()
link = t.a['href']
print(previous_text)
print('http://www.boa.aragon.es' + link)
print('-' * 80)
Prints:
ORDEN HAP/804/2020, de 17 de agosto, por la que se modifica la Relación de Puestos de Trabajo de los Departamentos de Industria, Competitividad y Desarrollo Empresarial y de Economía, Planificación y Empleo.
http://www.boa.aragon.es/cgi-bin/EBOA/BRSCGI?CMD=VERDOC&BASE=BOLE&PIECE=BOLE&DOCS=1-22&DOCR=1&SEC=FIRMA&RNG=200&SEPARADOR=&&PUBL=20200901
--------------------------------------------------------------------------------
ORDEN HAP/805/2020, de 17 de agosto, por la que se modifica la Relación de Puestos de Trabajo del Departamento de Agricultura, Ganadería y Medio Ambiente.
http://www.boa.aragon.es/cgi-bin/EBOA/BRSCGI?CMD=VERDOC&BASE=BOLE&PIECE=BOLE&DOCS=1-22&DOCR=2&SEC=FIRMA&RNG=200&SEPARADOR=&&PUBL=20200901
--------------------------------------------------------------------------------
ORDEN HAP/806/2020, de 17 de agosto, por la que se modifica la Relación de Puestos de Trabajo del Organismo Autónomo Instituto Aragonés de Servicios Sociales.
http://www.boa.aragon.es/cgi-bin/EBOA/BRSCGI?CMD=VERDOC&BASE=BOLE&PIECE=BOLE&DOCS=1-22&DOCR=3&SEC=FIRMA&RNG=200&SEPARADOR=&&PUBL=20200901
--------------------------------------------------------------------------------
ORDEN ECD/807/2020, de 24 de agosto, por la que se aprueba el expediente relativo al procedimiento selectivo de acceso al Cuerpo de Catedráticos de Música y Artes Escénicas.
http://www.boa.aragon.es/cgi-bin/EBOA/BRSCGI?CMD=VERDOC&BASE=BOLE&PIECE=BOLE&DOCS=1-22&DOCR=4&SEC=FIRMA&RNG=200&SEPARADOR=&&PUBL=20200901
--------------------------------------------------------------------------------
RESOLUCIÓN de 28 de julio de 2020, de la Dirección General de Justicia, por la que se convocan a concurso de traslado plazas vacantes entre funcionarios de los Cuerpos y Escalas de Gestión Procesal y Administrativa, Tramitación Procesal y
Administrativa y Auxilio Judicial de la Administración de Justicia.
http://www.boa.aragon.es/cgi-bin/EBOA/BRSCGI?CMD=VERDOC&BASE=BOLE&PIECE=BOLE&DOCS=1-22&DOCR=5&SEC=FIRMA&RNG=200&SEPARADOR=&&PUBL=20200901
--------------------------------------------------------------------------------
...and so on.

How to change language in Request (GET) URL?

I am trying this code however still unable to change the language of the URL.
from requests import get
url = 'https://www.fincaraiz.com.co/apartamento-apartaestudio/arriendo/bogota/'
headers = {"Accept-Language": "en-US,en;q=0.5"}
params = dict(lang='en-US,en;q=0.5')
response = get(url, headers = headers, params= params)
print(response.text[:500])
titles = []
for a in html_soup.findAll('div', id = 'divAdverts'):
for x in html_soup.findAll(class_ = 'h2-grid'):
title = x.text.replace("\r", "").replace("\n", "").strip()
titles.append(title)
titles
Output
['Local en Itaguí - Santamaría',
'Casa en Sopó - Vereda Comuneros',
'Apartamento en Santa Marta - Bello Horizonte',
'Apartamento en Funza - Zuame',
'Casa en Bogotá - Centro Comercial Titán Plaza',
'Apartamento en Cali - Los Cristales',
'Apartamento en Itaguí - Suramerica',
'Casa en Palmira - Barrio Contiguo A Las Flores',
'Apartamento en Cali - La Hacienda',
'Casa en Bogotá - Marsella',
'Casa en Medellín - La Castellana',
'Casa en Villavicencio - Quintas De San Fernando',
'Apartamento en Santa Marta - Playa Salguero',
'Casa Campestre en Rionegro - La Mosquita',
'Casa Campestre en Jamundí - La Morada',
'Casa en Envigado - Loma De Las Brujas',
'Casa Campestre en El Retiro - Los Salados']
Does anyone know how can I change the language of the URL? Tried everything

I am only giving example for particular field title, you may extend it to other fields, you may face issue like being blocked by google for number of concurrent request while using this library as it is not official one. Also you must consider to see the Note written in the documentation https://pypi.org/project/googletrans/
from requests import get
from bs4 import BeautifulSoup
from googletrans import Translator
translator = Translator()
url = 'https://www.fincaraiz.com.co/apartamento-apartaestudio/arriendo/bogota/'
headers = {"Accept-Language": "en-US,en;q=0.5"}
params = dict(lang='en-US,en;q=0.5')
response = get(url, headers = headers, params= params)
titles = []
html_soup = BeautifulSoup(response.text, 'html.parser')
for a in html_soup.findAll('div', id = 'divAdverts'):
for x in html_soup.findAll(class_ = 'h2-grid'):
title = x.text.replace("\r", "").replace("\n", "").strip()
titles.append(title)
english_titles=[]
english_translations= translator.translate(titles)
for trans in english_translations:
english_titles.append(trans.text)
print(english_titles)
Since you are scraping from spanish language to english language you can specify parameters in translator.translate(titles,src="es",dest="en")

Parsing a tag in HTML

I know that the question has been asked but I think not in this specific situation. If it's the case feel free to show me the case.
I have a HTML file hierarchized (you can view the original here) that way :
<h5 id="foo1">Title 1</h5>
<table class="foo2">
<tbody>
<tr>
<td>
<h3 class="foo3">SomeName1</h3>
<img src="Somesource" alt="SomeName2" title="SomeTitle"><br>
<p class="textcode">
Some precious text here
</p>
</td>
...
</table>
I would like to extract the name, the image and the text contained in the <p> each table data in each h5 separately meaning I would like to save each one of these items in a separate folder named after the h5 therein.
I tried this :
# coding: utf-8
import os
import re
from bs4 import BeautifulSoup as bs
os.chdir("WorkingDirectory")
# Sélection du HTML et remplissage de son contenu dans la variable éponyme
with open("TheGoodPath.htm","r") as html:
html = bs(html,'html.parser')
# Sélection des hearders, restriction des résultats aux six premiers et création des dossiers
h5 = html.find_all("h5",limit=6)
for h in h5:
# Création des fichiers avec le nom des headers
chemin = u"../Résulat/"
nom = str(h.contents[0].string)
os.makedirs(chemin + nom,exist_ok=True)
# Sélection de la table soeur située juste après le header
table = h.find_next_sibling(name = 'table')
for t in table:
# Sélection des headers contenant les titres des documents
h3 = t.find_all("h3")
for k in h3:
titre = str(k.string)
# Création des répertoires avec les noms des figures
os.makedirs(chemin + nom + titre,exist_ok=True)
os.fdopen(titre.tex)
# Récupération de l'image située dans la balise soeur située juste après le header précédent
img = k.find_next_sibling("img")
chimg = img.img['src']
os.fdopen(img.img['title'])
# Récupération du code TikZ située dans la balise soeur située juste après le header précédent
tikz = k.find_next_sibling('p')
# Extraction du code TikZ contenu dans la balise précédemment récupérée
code = tikz.get_text()
# Définition puis écriture du préambule et du code nécessaire à la production de l'image précédemment enregistrée
preambule = r"%PREAMBULE \n \usepackage{pgfplots} \n \usepackage{tikz} \n \usepackage[european resistor, european voltage, european current]{circuitikz} \n \usetikzlibrary{arrows,shapes,positioning} \n \usetikzlibrary{decorations.markings,decorations.pathmorphing, decorations.pathreplacing} \n \usetikzlibrary{calc,patterns,shapes.geometric} \n %FIN PREAMBULE"
with open(chemin + nom + titre,'w') as result:
result.write(preambule + code)
But it prints AttributeError: 'NavigableString' object has no attribute 'find_next_element' for h3 = t.find_all("h3"), line 21

This seems to be what you want, there only seems to be one table between each h5 so don't iterate over it just use find_next and use the table returned:
from bs4 import BeautifulSoup
import requests
cont = requests.get("http://www.physagreg.fr/schemas-figures-physique-svg-tikz.php").text
soup = BeautifulSoup(cont)
h5s = soup.find_all("h5",limit=6)
for h5 in h5s:
# find first table after
table = h5.find_next("table")
# find all h3 elements in that table
for h3 in table.select("h3"):
print(h3.text)
img = h3.find_next("img")
print(img["src"])
print(img["title"])
print(img.find_next("p").text)
print()
Which gives you output like:
repere-plan.svg
\begin{tikzpicture}[scale=1]
\draw (0,0) --++ (1,1) --++ (3,0) --++ (-1,-1) --++ (-3,0);
\draw [thick] [->] (2,0.5) --++(0,2) node [right] {z};
%thick : gras ; very thick : trÃ¨s gras ; ultra thick : hyper gras
\draw (2,0.5) node [left] {O};
\draw [thick] [->] (2,0.5) --++(-1,-1) node [left] {x};
\draw [thick] [->] (2,0.5) --++(2,0) node [below] {y};
\end{tikzpicture}
Lignes de champ et Ã©quipotentielles
images/cours-licence/em3/ligne-champ-equipot.svg
ligne-champ-equipot.svg
\begin{tikzpicture}[scale=0.8]
\draw[->] (-2,0) -- (2,0);
\draw[->] (0,-2) -- (0,2);
\draw node [red] at (-2,1.25) {\scriptsize{Lignes de champ}};
\draw node [blue] at (2,-1.25) {\scriptsize{Equipotentielles}};
\draw[color=red,domain=-3.14:3.14,samples=200,smooth] plot (canvas polar cs:angle=\x r,radius={3*sin(\x r)*3*sin(\x r)*5});
%r = angle en radian
%domain permet de dÃ©finir le domaine dans lequel la fonction sera tracÃ©e
%samples=200 permet d'augmenter le nombre de points pour le tracÃ©
%smooth amÃ©liore Ã©galement la qualitÃ© de la trace
\draw[color=red,domain=-3.14:3.14,samples=200,smooth] plot (canvas polar cs:angle=\x r,radius={2*sin(\x r)*2*sin(\x r)*5});
\draw[color=blue,domain=-pi:pi,samples=200,smooth] plot (canvas polar cs:angle=\x r,radius={3*sqrt(abs(cos(\x r)))*15});
\draw[color=blue,domain=-pi:pi,samples=200,smooth] plot (canvas polar cs:angle=\x r,radius={2*sqrt(abs(cos(\x r)))*15});
\end{tikzpicture}
Fonction arctangente
images/schemas/math/arctan.svg
arctan.svg
\begin{tikzpicture}[scale=0.8]
\draw[very thin,color=gray] (-pi,pi) grid (-pi,pi);
\draw[->] (-pi,0) -- (pi,0) node[right] {$x$};
\draw[->] (0,-2) -- (0,2);
\draw[color=red,domain=-pi:pi,samples=150] plot ({\x},{rad(atan(\x))} )node[right,red] {$\arctan(x)$};
\draw[color=blue,domain=-pi:pi] plot ({\x},{rad(-atan(\x))} )node[right,blue] {$-\arctan(x)$};
%Le rad() est une autre faÃ§on de dire que l'argument est en radian
\end{tikzpicture}
To write all the .svg's to disk:
from bs4 import BeautifulSoup
import requests
from urlparse import urljoin
from os import path
cont = requests.get("http://www.physagreg.fr/schemas-figures-physique-svg-tikz.php").text
soup = BeautifulSoup(cont)
base_url = "http://www.physagreg.fr/"
h5s = soup.find_all("h5", limit=6)
for h5 in h5s:
# find first table after
table = h5.find_next("table")
# find all h3 elements in that table
for h3 in table.select("h3"):
print(h3.text)
img = h3.find_next("img")
src, title = img["src"], img["title"]
# join base url and image url
img_url = urljoin(base_url, src)
# open file using title as file name
with open(title, "w") as f:
# requests the img url and write content
f.write(requests.get(img_url).content)
Which will give you arctan.svg courbe-Epeff.svg and all the rest on the page etc..

It looks like (judging by the for t in table loop) you meant to find multiple "table" elements. Use find_next_siblings() instead of find_next_sibling():
table = h.find_next_siblings(name='table')
for t in table:

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

scraped data using BeautifulSoup does not match source code - python

Related

How could I extract the content of an article from a html webpage using python Beautiful Soup?

BS4 : Iterate through page return same result in Python

Can't get text without tag using Selenium Python

How to change language in Request (GET) URL?

Parsing a tag in HTML

Categories

Resources