How to change language in Request (GET) URL? - python

I am trying this code however still unable to change the language of the URL.
from requests import get
url = 'https://www.fincaraiz.com.co/apartamento-apartaestudio/arriendo/bogota/'
headers = {"Accept-Language": "en-US,en;q=0.5"}
params = dict(lang='en-US,en;q=0.5')
response = get(url, headers = headers, params= params)
print(response.text[:500])
titles = []
for a in html_soup.findAll('div', id = 'divAdverts'):
for x in html_soup.findAll(class_ = 'h2-grid'):
title = x.text.replace("\r", "").replace("\n", "").strip()
titles.append(title)
titles
Output
['Local en Itaguí - Santamaría',
'Casa en Sopó - Vereda Comuneros',
'Apartamento en Santa Marta - Bello Horizonte',
'Apartamento en Funza - Zuame',
'Casa en Bogotá - Centro Comercial Titán Plaza',
'Apartamento en Cali - Los Cristales',
'Apartamento en Itaguí - Suramerica',
'Casa en Palmira - Barrio Contiguo A Las Flores',
'Apartamento en Cali - La Hacienda',
'Casa en Bogotá - Marsella',
'Casa en Medellín - La Castellana',
'Casa en Villavicencio - Quintas De San Fernando',
'Apartamento en Santa Marta - Playa Salguero',
'Casa Campestre en Rionegro - La Mosquita',
'Casa Campestre en Jamundí - La Morada',
'Casa en Envigado - Loma De Las Brujas',
'Casa Campestre en El Retiro - Los Salados']
Does anyone know how can I change the language of the URL? Tried everything

I am only giving example for particular field title, you may extend it to other fields, you may face issue like being blocked by google for number of concurrent request while using this library as it is not official one. Also you must consider to see the Note written in the documentation https://pypi.org/project/googletrans/
from requests import get
from bs4 import BeautifulSoup
from googletrans import Translator
translator = Translator()
url = 'https://www.fincaraiz.com.co/apartamento-apartaestudio/arriendo/bogota/'
headers = {"Accept-Language": "en-US,en;q=0.5"}
params = dict(lang='en-US,en;q=0.5')
response = get(url, headers = headers, params= params)
titles = []
html_soup = BeautifulSoup(response.text, 'html.parser')
for a in html_soup.findAll('div', id = 'divAdverts'):
for x in html_soup.findAll(class_ = 'h2-grid'):
title = x.text.replace("\r", "").replace("\n", "").strip()
titles.append(title)
english_titles=[]
english_translations= translator.translate(titles)
for trans in english_translations:
english_titles.append(trans.text)
print(english_titles)
Since you are scraping from spanish language to english language you can specify parameters in translator.translate(titles,src="es",dest="en")

Related

Get full text below h2 tag with BeautifulSoup

I am trying to obtain different character's descriptions and habilities for a dataset.
The problem I've encountered is that there seems to be a span tag within the h2 tag and in some cases a figure before de p tags. This is the format I'm facing:
<h2><span class="mw-headline" id="Apariencia">Apariencia</span></h2>
<figure class="thumb tleft " style="width: 100px">
<p>...</p>
<p>...</p>
<p>...</p>
<h2><span class="mw-headline" id="Personalidad">Personalidad</span></h2>
<p>...</p>
<p>...</p>
<p>...</p>
I need to obtain the text in those paragraphs.
I have tried something like this but it obviously does not work.
import urllib.request
from bs4 import BeautifulSoup
fp = urllib.request.urlopen("https://jojo.fandom.com/es/wiki/Star_Platinum")
mybytes = fp.read()
html_doc = mybytes.decode("utf8")
fp.close()
soup = BeautifulSoup(html_doc, 'html.parser')
spans = soup.find_all('span', {"class": "mw-headline"})
for s in spans:
print(s.nextSibling.getText)
You can search backwards for previous <h2> and store result in the dictionary:
from bs4 import BeautifulSoup
html_doc = '''\
<h2><span class="mw-headline" id="Apariencia">Apariencia</span></h2>
<figure class="thumb tleft " style="width: 100px">
<p>T1 ...</p>
<p>T2 ...</p>
<p>T3 ...</p>
<h2><span class="mw-headline" id="Personalidad">Personalidad</span></h2>
<p>T4 ...</p>
<p>T5 ...</p>
<p>T6 ...</p>'''
soup = BeautifulSoup(html_doc, 'html.parser')
out = {}
for p in soup.select('p'):
previous_h2 = p.find_previous('h2')
out.setdefault(previous_h2.text, []).append(p.text)
print(out)
Prints:
{
'Apariencia': ['T1 ...', 'T2 ...', 'T3 ...'],
'Personalidad': ['T4 ...', 'T5 ...', 'T6 ...']
}
I think in this case, .select with CSS selectors would be very useful.
To get all the section headers, you can use the selector h2:has(span.mw-headline[id]); and to get a specific header by the id attribute value (hId below) of the span nested within, you can use the selector hSel below.
hId = 'Personalidad' # for example
hSel = f'h2:has(span.mw-headline[id="{hId}"])'
Then, to get all the tags after that header, you can use f'{hSel}~*', but you also need to filter out the tags after the next header to only get tags in that section, so the full selector would be
sSel = f'{hSel}~*:not({hSel}~h2~*):not(h2)'
[:not({hSel}~h2~*) filters out the tags after the next header, and :not(h2) filters out the next header tag itself.]
Making use of this, the function below returns a dictionary containing section header, text and html.
def get_wikiSection(header, wSoup, sec1Header='? Abstract ?'):
sSel = 'h2:has(span.mw-headline[id])'
sSel = f'*:has(~{sSel}):not({sSel}~*)'
if not header: hId = hSel = None # [first section has no header]
elif isinstance(header, str):
hId, hSel = header, f'h2:has(span.mw-headline[id="{header}"])'
header = wSoup.select_one(hSel)
if not header: return {'errorMsg': f'Not found: {hSel}'}
else: hId = header.select_one('span.mw-headline[id]')['id']
## header SHOULD BE: None/hId/a tag containing span.mw-headline[id] ##
if hId:
hSel = f'h2:has(span.mw-headline[id="{hId}"])'
sSel = f'{hSel}~*:not({hSel}~h2~*):not(h2)'
header = header.get_text(' ').strip()
else: header = sec1Header
sect = wSoup.select(sSel)
sText = '\n'.join([s.get_text(' ').strip() for s in sect])
sHtml = '\n'.join([''.join(s.prettify().splitlines()) for s in sect])
if not sect: sText = sHtml = None
return {'headerId': hId, 'sectionHeader': header,
'sectionText': sText, 'sectionHtml': sHtml}
For example, get_wikiSection('Personalidad', soup) will return
{
'headerId': 'Personalidad',
'sectionHeader': 'Personalidad',
'sectionText': 'Jotaro describió a Star Platinum como un ser muy violento. Suele ser silencioso, excepto cuando lanza golpes, gritando "ORA ORA ORA" en voz alta, bastante rápido y repetidamente. Con una cara relativamente humana ha demostrado tener expresiones tales como fruncir el ceño y sonreír.\nStar Platinum demuestra un tipo de interés en la auto-conservación, como se ve cuando detiene una bala que Jotaro dispara experimentalmente en su propia cabeza, protege a un Jotaro incapacitado de los ataques de DIO durante los efectos de su The World , y lo revive de cerca de la muerte directamente hacia latir su corazón (Sin embargo, considerando el papel pionero de Star Platinum en la serie, esta capacidad puede hablar principalmente de cualidades metafísicas o subconscientes genéricas para los usuarios de Stand).\nEn el manga original, Star Platinum desde el principio se ve con una sonrisa amplia y desconcertante. Más tarde, Star Platinum gana el rostro estoico de Jotaro, cualquier sonrisa futura va a advertir a la persona que se dirige a un gran dolor inminente.\nStar Platinum lleva el nombre de la carta del Tarot La Estrella , que simboliza el optimismo, el discernimiento y la esperanza.',
'sectionHtml': '<p> Jotaro describió a Star Platinum como un ser muy violento. Suele ser silencioso, excepto cuando lanza golpes, gritando "ORA ORA ORA" en voz alta, bastante rápido y repetidamente. Con una cara relativamente humana ha demostrado tener expresiones tales como fruncir el ceño y sonreír.</p>\n<p> Star Platinum demuestra un tipo de interés en la auto-conservación, como se ve cuando detiene una bala que Jotaro dispara experimentalmente en su propia cabeza, protege a un Jotaro incapacitado de los ataques de DIO durante los efectos de su The World , y lo revive de cerca de la muerte directamente hacia latir su corazón (Sin embargo, considerando el papel pionero de Star Platinum en la serie, esta capacidad puede hablar principalmente de cualidades metafísicas o subconscientes genéricas para los usuarios de Stand).</p>\n<p> En el manga original, Star Platinum desde el principio se ve con una sonrisa amplia y desconcertante. Más tarde, Star Platinum gana el rostro estoico de Jotaro, cualquier sonrisa futura va a advertir a la persona que se dirige a un gran dolor inminente.</p>\n<p> Star Platinum lleva el nombre de la carta del Tarot <a class="extiw" href="http://en.wikipedia.org/wiki/es:La_Estrella_(Tarot)" title="wikipedia:es:La Estrella (Tarot)"> La Estrella </a> , que simboliza el optimismo, el discernimiento y la esperanza.</p>'
}
If you want all the sections:
h2Tags = soup.select('h2:has(span.mw-headline[id])')
wikiSections = [get_wikiSection(h, soup) for h in [None]+h2Tags]
The resulting list of dictionaries can also be converted to a pandas DataFrame as simply as pd.DataFrame(wikiSections).

How could I extract the content of an article from a html webpage using python Beautiful Soup?

I am a novice wiht python and I am trying to do webscraping as exercise. I would like to scrape the content, and the title of each article inside a web page .
I have a problem with my code because I do not think is very efficient and I would like to optimize it.
the page I am trying to scrape is https://www.ansa.it/sito/notizie/politica/politica.shtml
this is what I have done so far:
#libraries
from bs4 import BeautifulSoup
from bs4 import BeautifulSoup as soup
import requests
import pandas as pd
import urllib.request,sys,time
import csv
from csv import writer
import time
from datetime import datetime
r= requests.get('https://www.ansa.it/sito/notizie/politica/politica.shtml')
b= soup(r.content, 'lxml')
title=[]
links=[]
content=[]
for c in b.findAll('h3',{'class':'news-title'}):
title.append(c.text.strip())
for c in b.findAll("h3", {"class": "news-title"}):
links.append(c.a["href"])
for link in links:
page=requests.get('https://www.ansa.it'+link)
bsjop=soup(page.content)
for n in bsjop.findAll('div',{'itemprop': 'articleBody'}):
content.append(n.text.strip())
The problem is that my output is made of multiple links, multiple titles and multiple contents that do not match each other (like one article has title and a content that has nothing to do with it)
If you know ways that I can improve my code it would be nice
thanks
To get all articles titles, urls, texts into a Pandas DataFrame you can use next example (I used tqdm module to get nice progress bar):
import requests
from tqdm import tqdm
from bs4 import BeautifulSoup
url = "https://www.ansa.it/sito/notizie/politica/politica.shtml"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
all_data = []
for title in tqdm(soup.select("h3.news-title")):
t = title.get_text(strip=True)
u = title.a["href"]
s = BeautifulSoup(
requests.get("https://www.ansa.it" + u).content, "html.parser"
)
text = s.select_one('[itemprop="articleBody"]')
text = text.get_text(strip=True, separator="\n") if text else ""
all_data.append([t, u, text])
df = pd.DataFrame(all_data, columns=["Title", "URL", "Text"])
df.to_csv("data.csv", index=False)
Creates data.csv (screenshot from LibreOffice):
import requests
from bs4 import BeautifulSoup
import pandas as pd
news_list = []
r = requests.get('https://www.ansa.it/sito/notizie/politica/politica.shtml')
soup = BeautifulSoup(r.text, 'html.parser')
articles = soup.select('article.news')
for art in articles:
try:
title = art.select_one('h3').text.strip()
if 'javascript:void(0);' in art.select('a')[0].get('href'):
url = 'https://www.ansa.it' + art.select('a')[1].get('href')
else:
url = 'https://www.ansa.it' + art.select('a')[0].get('href')
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
content = soup.select_one('div.news-txt').text.strip()
print(f'retrieving {url}')
news_list.append((title, content, url))
except Exception as e:
print(art.text.strip(), e)
df = pd.DataFrame(news_list, columns = ['Title', 'Content', 'Url'])
print(df)
This will return some errors for some links on page, which you will have to investigate and debug (and ask for help if you need it - it's an important part of learning process), and a dataframe with the list successfully retrieved, which looks like this:
Title Content Url
0 Letta: 'Ora difficile ricomporre con M5s'. Mel... Partiti scossi dallo scioglimento anticipato d... https://www.ansa.it/sito/notizie/politica/2022...
1 L'emozione di Draghi: 'Ancheil cuore dei banch... "Certe volte anche il cuore dei banchieri cent... https://www.ansa.it/sito/notizie/politica/2022...
2 La giornata di Draghi in foto https://www.ansa.it/sito/photogallery/primopia...
3 Il timing del voto, liste entro un mese. A Fer... Le liste dei candidati entro un mese a partire... https://www.ansa.it/sito/notizie/politica/2022...
4 Si lavora sulla concorrenza, ipotesi stralcio ... Il DDL Concorrenza andrà in Aula alla Camera l... https://www.ansa.it/sito/notizie/economia/2022...
5 Le cifre del governo Draghi: 55 voti fiducia e... Una media di 7,4 leggi approvate ogni mese su ... https://www.ansa.it/sito/notizie/politica/2022...
6 I 522 giorni del governo Draghi LE FOTO L'arrivo di SuperMario, gli incontri, le allea... https://www.ansa.it/sito/photogallery/primopia...
7 Presidi, disappunto per le urne in autunno, ci... "C'è disappunto non preoccupazione per le urne... https://www.ansa.it/sito/notizie/politica/2022...
8 Ucraina: Di Maio,sostegno ricerca mercati alte... (ANSA) - ROMA, 22 LUG - Lo scoppio del conflit... https://www.ansa.it/sito/photogallery/primopia...
9 Il giorno più lungo: dal Senato fiducia a Drag... Passa la fiducia al premier Draghi in Senato, ... https://www.ansa.it/sito/notizie/politica/2022...
10 Oltre mille sindaci a sostegno di Draghi Nei giorni che attendono il mercoledì che deci... https://www.ansa.it/sito/notizie/politica/2022...
11 Mattarella scioglie le Camere, si vota il 25 s... E' stata una scelta "inevitabile", il voto del... https://www.ansa.it/sito/notizie/politica/2022...
12 Camere sciolte ma il vitalizio è salvo Nonostante lo scioglimento delle Camere antici... https://www.ansa.it/sito/notizie/politica/2022...
13 Ultimatum di Conte, risposte o fuori. Di Maio:... Senza "risposte chiare" il Movimento 5 Stelle ... https://www.ansa.it/sito/notizie/politica/2022...
14 Di Maio, Conte sta compiendo una vendetta poli... Se le cose restano come sono oggi "Mario Dragh... https://www.ansa.it/sito/notizie/politica/2022...
15 Governo: mercoledì la fiducia fiducia prima al... Le comunicazioni del presidente del Consiglio ... https://www.ansa.it/sito/notizie/politica/2022...
16 Il giorno più lungo: dal Senato fiducia a Drag... Passa la fiducia al premier Draghi in Senato, ... https://www.ansa.it/sito/notizie/politica/2022...
17 Mattarella scioglie le Camere, si vota il 25 s... E' stata una scelta "inevitabile", il voto del... https://www.ansa.it/sito/notizie/politica/2022...
18 Il discorso di Draghi al Senato 'Partiti, pron... "Siamo qui perché lo hanno chiesto gli italian... https://www.ansa.it/sito/notizie/politica/2022...
19 Governo: mercoledì la fiducia fiducia prima al... Le comunicazioni del presidente del Consiglio ... https://www.ansa.it/sito/notizie/politica/2022...
20 Draghi al Senato per una fiducia al buio. Prem... Draghi al bivio tra governo e crisi. Alle 9.30... https://www.ansa.it/sito/notizie/politica/2022...
21 Ultimatum di Conte, risposte o fuori. Di Maio:... Senza "risposte chiare" il Movimento 5 Stelle ... https://www.ansa.it/sito/notizie/politica/2022...
22 Camere sciolte ma il vitalizio è salvo Nonostante lo scioglimento delle Camere antici... https://www.ansa.it/sito/notizie/politica/2022...
You don't need two different loops if you are referring to the same element. Try the below code to save the title and links.
for c in b.findAll('h3',{'class':'news-title'}):
title.append(c.text.strip())
links.append(c.a["href"])
By combining you will be sure that the title and link are scraped from the same element.

scraped data using BeautifulSoup does not match source code

I'm new to webscraping. I have seen a few tutorials on how to scrape websites using beautifulsoup.
As an exercise I would like to extract data from a real estate website.
The specific page I want to scrape is this one: https://www.immoweb.be/fr/recherche/maison-et-appartement/a-vendre?countries=BE&page=1
My goal is to extract a list of all the links to each real estate sale.
Afterwards, I want to loop through that list of links to extract all the data for each sale (price, location, nb bedrooms etc.)
The first issue I'm encountering is that the data scraped using the classic beautifulsoup code did not match the source code of the webpage.
This is my code:
URL = "https://www.immoweb.be/fr/recherche/maison-et-appartement/a-vendre?countries=BE&page=1"
page = requests.get(URL)
html = page.content
soup = BeautifulSoup(html, 'html.parser')
print(soup)
Hence, when looking for the links of each real estate sale which is located under
soup.find_all("a", class_="card__title-link")
It outputs an empty list. Indeed these tags were actually not properly extracted from my code above.
Why is that? What should I do to ensure that the extracted html correctly corresponds to what is visible in the source code of the website?
Thank you :-)
The data you see is embedded within the page in Json format. You can use this example how to load it:
import json
import requests
from bs4 import BeautifulSoup
url = "https://www.immoweb.be/fr/recherche/maison-et-appartement/a-vendre?countries=BE&page=1"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = json.loads(soup.find("iw-search")[":results"])
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
# print some data:
for ad in data:
print(
"{:<63} {:<8} {}".format(
ad["property"]["title"],
ad["transaction"]["sale"]["price"] or "-",
"https://www.immoweb.be/fr/annonce/{}".format(ad["id"]),
)
)
Prints:
Triplex appartement met 3 slaapkamers en garage. 239000 https://www.immoweb.be/fr/annonce/9309298
Appartement 285000 https://www.immoweb.be/fr/annonce/9309895
Heel ruime, moderne, lichtrijke Duplex te koop, bij centrum 269000 https://www.immoweb.be/fr/annonce/9303797
À VENDRE PAR LANDBERGH : appartement de deux chambres à Gand 359000 https://www.immoweb.be/fr/annonce/9310300
Prachtige nieuwbouw appartementen - https://www.immoweb.be/fr/annonce/9309278
Prachtige nieuwbouw appartementen - https://www.immoweb.be/fr/annonce/9309251
Prachtige nieuwbouw appartementen - https://www.immoweb.be/fr/annonce/9309264
Appartement intéressant avec agréable vue panoramique verdoy 219000 https://www.immoweb.be/fr/annonce/9309366
Projet Utopia by Godin - https://www.immoweb.be/fr/annonce/9309458
Appartement 2-ch avec vue unique! 270000 https://www.immoweb.be/fr/annonce/9309183
Residentieel wonen in Hélécine, dichtbij de natuur en de sne - https://www.immoweb.be/fr/annonce/9309241
Appartement 375000 https://www.immoweb.be/fr/annonce/9309187
DUPLEX LUMIEUX ET SPACIEUX 380000 https://www.immoweb.be/fr/annonce/9298271
SINT-PIETERS-LEEUW / Magnifique maison de ±130m² avec jardin 430000 https://www.immoweb.be/fr/annonce/9310259
PARC PARMENTIER // APP MODERNE 3CH 490000 https://www.immoweb.be/fr/annonce/9262193
BOIS DE LA CAMBRE – AV DE FRE – CLINIQUES DE L’EUROPE 575000 https://www.immoweb.be/fr/annonce/9309664
Entre Stockel et le Stade Fallon 675000 https://www.immoweb.be/fr/annonce/9310094
Maisons neuves dans un cadre verdoyant - https://www.immoweb.be/fr/annonce/6792221
Nieuwbouwproject Dockside Gardens - Gent - https://www.immoweb.be/fr/annonce/9008956
Appartement 139000 https://www.immoweb.be/fr/annonce/9187904
A VENDRE CHEZ LANDBERGH: appartements à Merelbeke Flora - https://www.immoweb.be/fr/annonce/9306877
Très beau studio avec une belle vue sur la plage et la mer! 319000 https://www.immoweb.be/fr/annonce/9306787
BEL APPARTEMENT LUMINEUX DIAMANT / PLASKY 320000 https://www.immoweb.be/fr/annonce/9264748
Un projet d'appartements neufs à proximité de Woluwé-St-Lamb - https://www.immoweb.be/fr/annonce/9308037
PLACE JOURDAN - 2 CHAMBRES 345000 https://www.immoweb.be/fr/annonce/9306953
Magnifiek appartement in de Brugse Rand - Assebroek 399000 https://www.immoweb.be/fr/annonce/9306613
Bien d'exception 415000 https://www.immoweb.be/fr/annonce/9308022
Appartement 435000 https://www.immoweb.be/fr/annonce/9307802
Magnifique maison 5CH - 3SDB - bureau - dressing - garage 465000 https://www.immoweb.be/fr/annonce/9307178
Magnifique maison 5CH - 3SDB - bureau - dressing - garage 465000 https://www.immoweb.be/fr/annonce/9307177
EDIT: Added URL column.

How to edit Python code to loop the request to extract information from list

I'm only a couple of weeks into learning Python, and I'm trying to extract specific info from a list (events). I've been able to call the list and extract specific lines (info for single event), but the objective is running the program and extracting the information from the entirety of the called list (info from all of the events).
Among others, my best guesses so far have been along the lines of:
one_a_tag = soup.findAll('a')[22:85]
and
one_a_tag = soup.findAll('a')[22+1]
But I come up with these errors:
TypeError Traceback (most recent call last)
<ipython-input-15-ee19539fbb00> in <module>
11 soup.findAll('a')
12 one_a_tag = soup.findAll('a')[22:85]
---> 13 link = one_a_tag['href']
14 'https://arema.mx' + link
15 eventUrl = ('https://arema.mx' + link)
TypeError: list indices must be integers or slices, not str
And
TypeError Traceback (most recent call last)
<ipython-input-22-81d98bcf8fd8> in <module>
10 soup
11 soup.findAll('a')
---> 12 one_a_tag = soup.findAll('a')[22]+1
13 link = one_a_tag['href']
14 'https://arema.mx' + link
TypeError: unsupported operand type(s) for +: 'Tag' and 'int'
This is the entire code so far:
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
url = 'https://arema.mx/'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
soup
soup.findAll('a')
one_a_tag = soup.findAll('a')[22]
link = one_a_tag['href']
'https://arema.mx' + link
eventUrl = ('https://arema.mx' + link)
print(eventUrl)
def getAremaTitulo(eventUrl):
res = requests.get(eventUrl)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
elems = soup.select('body > div.body > div.ar.eventname')
return elems[0].text.strip()
def getAremaInfo(eventUrl):
res = requests.get(eventUrl)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
elems = soup.select('body > div.body > div.event-header')
return elems[0].text.strip()
titulo = getAremaTitulo(eventUrl)
print('Nombre de evento: ' + titulo)
info = getAremaInfo(eventUrl)
print('Info: ' + info)
time.sleep(1)
I'm sure there may be some redundancies in the code, but what I'm most keen on solving is creating a loop to extract the specific info I'm looking for from all of the events. What do I need to add to get there?
Thanks!
To get all information about events, you can use this script:
import requests
from bs4 import BeautifulSoup
url = 'https://arema.mx/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for event_link in soup.select('#events a.event'):
u = 'https://arema.mx' + event_link['href']
s = BeautifulSoup(requests.get(u).content, 'html.parser')
event_name = s.select_one('.eventname').get_text(strip=True)
event_info = s.select_one('.event-header').text.strip()
print(event_name)
print(event_info)
print('-' * 80)
Prints:
...
--------------------------------------------------------------------------------
NOCHE BOHEMIA <A PIANO Y GUITARRA>
"Freddy González y Víctor Freez dos amigos que al
paso del tiempo hermanaron sus talentos para crear un concepto musical cálido y
acústico entre cuerdas y teclas haciéndonos vibrar entre una línea de canciones
de ayer y hoy. Rescatando las bohemias que tantos recuerdos y encuentros nos han
generado a lo largo del tiempo.
 Precio: $69*ya incluye cargo de servicio.Fecha: Sábado 15 de agosto 20:00 hrsTransmisión en vivo por Arema LiveComo ingresar a ver la presentación.·         Dale clic en Comprar  y Selecciona tu acceso.·         Elije la forma de pago que más se te facilite y finaliza la compra.·         Te llegara un correo electrónico con la confirmación de compra y un liga exclusiva para ingresar a la transmisión el día seleccionado únicamente.La compra de tu boleto es un apoyo para el artista.Importante:  favor de revisar tu correo en bandeja de entrada, no deseados o spam ya que los correos en ocasiones son enviados a esas carpetas.
--------------------------------------------------------------------------------
...

Scraping Yahoo Finance with Python3

I'm a complete newbie in scraping and I'm trying to scrape https://fr.finance.yahoo.com and I can't figure out what I'm doing wrong.
My goal is to scrape the index name, current level and the change(both in value and in %)
Here is the code I have used:
import urllib.request
from bs4 import BeautifulSoup
url = 'https://fr.finance.yahoo.com'
request = urllib.request.Request(url)
html = urllib.request.urlopen(request).read()
soup = BeautifulSoup(html,'html.parser')
main_table = soup.find("div",attrs={'data-reactid':'12'})
print(main_table)
links = main_table.find_all("li", class_=' D(ib) Bxz(bb) Bdc($seperatorColor) Mend(16px) BdEnd ')
print(links)
However, the print(links) comes out empty. Could someone please assist? Any help would be highly appreciated as I have been trying to figure this out for a few days now.
Although the better way to get all the fields is parse and process the relevant script tag, this is one of the ways you can get all them.
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://fr.finance.yahoo.com/'
r = requests.get(url,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(r.text,'html.parser')
df = pd.DataFrame(columns=['Index Name','Current Level','Value','Percentage Change'])
for item in soup.select("[id='market-summary'] li"):
index_name = item.select_one("a").contents[1]
current_level = ''.join(item.select_one("a > span").text.split())
value = ''.join(item.select_one("a")['aria-label'].split("ou")[1].split("points")[0].split())
percentage_change = ''.join(item.select_one("a > span + span").text.split())
df = df.append({'Index Name':index_name, 'Current Level':current_level,'Value':value,'Percentage Change':percentage_change}, ignore_index=True)
print(df)
Output are like:
Index Name Current Level Value Percentage Change
0 CAC 40 4444,56 -0,88 -0,02%
1 Euro Stoxx 50 2905,47 0,49 +0,02%
2 Dow Jones 24438,63 -35,49 -0,15%
3 EUR/USD 1,0906 -0,0044 -0,40%
4 Gold future 1734,10 12,20 +0,71%
5 BTC-EUR 8443,23 161,79 +1,95%
6 CMC Crypto 200 185,66 4,42 +2,44%
7 Pétrole WTI 33,28 -0,64 -1,89%
8 DAX 11073,87 7,94 +0,07%
9 FTSE 100 5993,28 -21,97 -0,37%
10 Nasdaq 9315,26 30,38 +0,33%
11 S&P 500 2951,75 3,24 +0,11%
12 Nikkei 225 20388,16 -164,15 -0,80%
13 HANG SENG 22930,14 -1349,89 -5,56%
14 GBP/USD 1,2177 -0,0051 -0,41%
I think you need to fix your element selection.
For example the following code:
import urllib.request
from bs4 import BeautifulSoup
url = 'https://fr.finance.yahoo.com'
request = urllib.request.Request(url)
html = urllib.request.urlopen(request).read()
soup = BeautifulSoup(html,'html.parser')
main_table = soup.find(id="market-summary")
links = main_table.find_all("a")
for i in links:
print(i.attrs["aria-label"])
Gives output text having index name, % change, change, and value:
CAC 40 a augmenté de 0,37 % ou 16,55 points pour atteindre 4 461,99 points
Euro Stoxx 50 a augmenté de 0,28 % ou 8,16 points pour atteindre 2 913,14 points
Dow Jones a diminué de -0,63 % ou -153,98 points pour atteindre 24 320,14 points
EUR/USD a diminué de -0,49 % ou -0,0054 points pour atteindre 1,0897 points
Gold future a augmenté de 0,88 % ou 15,10 points pour atteindre 1 737,00 points
a augmenté de 1,46 % ou 121,30 points pour atteindre 8 402,74 points
CMC Crypto 200 a augmenté de 1,60 % ou 2,90 points pour atteindre 184,14 points
Pétrole WTI a diminué de -3,95 % ou -1,34 points pour atteindre 32,58 points
DAX a augmenté de 0,29 % ou 32,27 points pour atteindre 11 098,20 points
FTSE 100 a diminué de -0,39 % ou -23,18 points pour atteindre 5 992,07 points
Nasdaq a diminué de -0,30 % ou -28,25 points pour atteindre 9 256,63 points
S&P 500 a diminué de -0,43 % ou -12,62 points pour atteindre 2 935,89 points
Nikkei 225 a diminué de -0,80 % ou -164,15 points pour atteindre 20 388,16 points
HANG SENG a diminué de -5,56 % ou -1 349,89 points pour atteindre 22 930,14 points
GBP/USD a diminué de -0,34 % ou -0,0041 points pour atteindre 1,2186 points
Try following css selector to get all the links.
import urllib
from bs4 import BeautifulSoup
url = 'https://fr.finance.yahoo.com'
request = urllib.request.Request(url)
html = urllib.request.urlopen(request).read()
soup = BeautifulSoup(html,'html.parser')
links=[link['href'] for link in soup.select("ul#market-summary a")]
print(links)
Output:
['/quote/^FCHI?p=^FCHI', '/quote/^STOXX50E?p=^STOXX50E', '/quote/^DJI?p=^DJI', '/quote/EURUSD=X?p=EURUSD=X', '/quote/GC=F?p=GC=F', '/quote/BTC-EUR?p=BTC-EUR', '/quote/^CMC200?p=^CMC200', '/quote/CL=F?p=CL=F', '/quote/^GDAXI?p=^GDAXI', '/quote/^FTSE?p=^FTSE', '/quote/^IXIC?p=^IXIC', '/quote/^GSPC?p=^GSPC', '/quote/^N225?p=^N225', '/quote/^HSI?p=^HSI', '/quote/GBPUSD=X?p=GBPUSD=X']

Categories