I'm a complete newbie in scraping and I'm trying to scrape https://fr.finance.yahoo.com and I can't figure out what I'm doing wrong.
My goal is to scrape the index name, current level and the change(both in value and in %)
Here is the code I have used:
import urllib.request
from bs4 import BeautifulSoup
url = 'https://fr.finance.yahoo.com'
request = urllib.request.Request(url)
html = urllib.request.urlopen(request).read()
soup = BeautifulSoup(html,'html.parser')
main_table = soup.find("div",attrs={'data-reactid':'12'})
print(main_table)
links = main_table.find_all("li", class_=' D(ib) Bxz(bb) Bdc($seperatorColor) Mend(16px) BdEnd ')
print(links)
However, the print(links) comes out empty. Could someone please assist? Any help would be highly appreciated as I have been trying to figure this out for a few days now.
Although the better way to get all the fields is parse and process the relevant script tag, this is one of the ways you can get all them.
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://fr.finance.yahoo.com/'
r = requests.get(url,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(r.text,'html.parser')
df = pd.DataFrame(columns=['Index Name','Current Level','Value','Percentage Change'])
for item in soup.select("[id='market-summary'] li"):
index_name = item.select_one("a").contents[1]
current_level = ''.join(item.select_one("a > span").text.split())
value = ''.join(item.select_one("a")['aria-label'].split("ou")[1].split("points")[0].split())
percentage_change = ''.join(item.select_one("a > span + span").text.split())
df = df.append({'Index Name':index_name, 'Current Level':current_level,'Value':value,'Percentage Change':percentage_change}, ignore_index=True)
print(df)
Output are like:
Index Name Current Level Value Percentage Change
0 CAC 40 4444,56 -0,88 -0,02%
1 Euro Stoxx 50 2905,47 0,49 +0,02%
2 Dow Jones 24438,63 -35,49 -0,15%
3 EUR/USD 1,0906 -0,0044 -0,40%
4 Gold future 1734,10 12,20 +0,71%
5 BTC-EUR 8443,23 161,79 +1,95%
6 CMC Crypto 200 185,66 4,42 +2,44%
7 Pétrole WTI 33,28 -0,64 -1,89%
8 DAX 11073,87 7,94 +0,07%
9 FTSE 100 5993,28 -21,97 -0,37%
10 Nasdaq 9315,26 30,38 +0,33%
11 S&P 500 2951,75 3,24 +0,11%
12 Nikkei 225 20388,16 -164,15 -0,80%
13 HANG SENG 22930,14 -1349,89 -5,56%
14 GBP/USD 1,2177 -0,0051 -0,41%
I think you need to fix your element selection.
For example the following code:
import urllib.request
from bs4 import BeautifulSoup
url = 'https://fr.finance.yahoo.com'
request = urllib.request.Request(url)
html = urllib.request.urlopen(request).read()
soup = BeautifulSoup(html,'html.parser')
main_table = soup.find(id="market-summary")
links = main_table.find_all("a")
for i in links:
print(i.attrs["aria-label"])
Gives output text having index name, % change, change, and value:
CAC 40 a augmenté de 0,37 % ou 16,55 points pour atteindre 4 461,99 points
Euro Stoxx 50 a augmenté de 0,28 % ou 8,16 points pour atteindre 2 913,14 points
Dow Jones a diminué de -0,63 % ou -153,98 points pour atteindre 24 320,14 points
EUR/USD a diminué de -0,49 % ou -0,0054 points pour atteindre 1,0897 points
Gold future a augmenté de 0,88 % ou 15,10 points pour atteindre 1 737,00 points
a augmenté de 1,46 % ou 121,30 points pour atteindre 8 402,74 points
CMC Crypto 200 a augmenté de 1,60 % ou 2,90 points pour atteindre 184,14 points
Pétrole WTI a diminué de -3,95 % ou -1,34 points pour atteindre 32,58 points
DAX a augmenté de 0,29 % ou 32,27 points pour atteindre 11 098,20 points
FTSE 100 a diminué de -0,39 % ou -23,18 points pour atteindre 5 992,07 points
Nasdaq a diminué de -0,30 % ou -28,25 points pour atteindre 9 256,63 points
S&P 500 a diminué de -0,43 % ou -12,62 points pour atteindre 2 935,89 points
Nikkei 225 a diminué de -0,80 % ou -164,15 points pour atteindre 20 388,16 points
HANG SENG a diminué de -5,56 % ou -1 349,89 points pour atteindre 22 930,14 points
GBP/USD a diminué de -0,34 % ou -0,0041 points pour atteindre 1,2186 points
Try following css selector to get all the links.
import urllib
from bs4 import BeautifulSoup
url = 'https://fr.finance.yahoo.com'
request = urllib.request.Request(url)
html = urllib.request.urlopen(request).read()
soup = BeautifulSoup(html,'html.parser')
links=[link['href'] for link in soup.select("ul#market-summary a")]
print(links)
Output:
['/quote/^FCHI?p=^FCHI', '/quote/^STOXX50E?p=^STOXX50E', '/quote/^DJI?p=^DJI', '/quote/EURUSD=X?p=EURUSD=X', '/quote/GC=F?p=GC=F', '/quote/BTC-EUR?p=BTC-EUR', '/quote/^CMC200?p=^CMC200', '/quote/CL=F?p=CL=F', '/quote/^GDAXI?p=^GDAXI', '/quote/^FTSE?p=^FTSE', '/quote/^IXIC?p=^IXIC', '/quote/^GSPC?p=^GSPC', '/quote/^N225?p=^N225', '/quote/^HSI?p=^HSI', '/quote/GBPUSD=X?p=GBPUSD=X']
Related
I am a novice wiht python and I am trying to do webscraping as exercise. I would like to scrape the content, and the title of each article inside a web page .
I have a problem with my code because I do not think is very efficient and I would like to optimize it.
the page I am trying to scrape is https://www.ansa.it/sito/notizie/politica/politica.shtml
this is what I have done so far:
#libraries
from bs4 import BeautifulSoup
from bs4 import BeautifulSoup as soup
import requests
import pandas as pd
import urllib.request,sys,time
import csv
from csv import writer
import time
from datetime import datetime
r= requests.get('https://www.ansa.it/sito/notizie/politica/politica.shtml')
b= soup(r.content, 'lxml')
title=[]
links=[]
content=[]
for c in b.findAll('h3',{'class':'news-title'}):
title.append(c.text.strip())
for c in b.findAll("h3", {"class": "news-title"}):
links.append(c.a["href"])
for link in links:
page=requests.get('https://www.ansa.it'+link)
bsjop=soup(page.content)
for n in bsjop.findAll('div',{'itemprop': 'articleBody'}):
content.append(n.text.strip())
The problem is that my output is made of multiple links, multiple titles and multiple contents that do not match each other (like one article has title and a content that has nothing to do with it)
If you know ways that I can improve my code it would be nice
thanks
To get all articles titles, urls, texts into a Pandas DataFrame you can use next example (I used tqdm module to get nice progress bar):
import requests
from tqdm import tqdm
from bs4 import BeautifulSoup
url = "https://www.ansa.it/sito/notizie/politica/politica.shtml"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
all_data = []
for title in tqdm(soup.select("h3.news-title")):
t = title.get_text(strip=True)
u = title.a["href"]
s = BeautifulSoup(
requests.get("https://www.ansa.it" + u).content, "html.parser"
)
text = s.select_one('[itemprop="articleBody"]')
text = text.get_text(strip=True, separator="\n") if text else ""
all_data.append([t, u, text])
df = pd.DataFrame(all_data, columns=["Title", "URL", "Text"])
df.to_csv("data.csv", index=False)
Creates data.csv (screenshot from LibreOffice):
import requests
from bs4 import BeautifulSoup
import pandas as pd
news_list = []
r = requests.get('https://www.ansa.it/sito/notizie/politica/politica.shtml')
soup = BeautifulSoup(r.text, 'html.parser')
articles = soup.select('article.news')
for art in articles:
try:
title = art.select_one('h3').text.strip()
if 'javascript:void(0);' in art.select('a')[0].get('href'):
url = 'https://www.ansa.it' + art.select('a')[1].get('href')
else:
url = 'https://www.ansa.it' + art.select('a')[0].get('href')
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
content = soup.select_one('div.news-txt').text.strip()
print(f'retrieving {url}')
news_list.append((title, content, url))
except Exception as e:
print(art.text.strip(), e)
df = pd.DataFrame(news_list, columns = ['Title', 'Content', 'Url'])
print(df)
This will return some errors for some links on page, which you will have to investigate and debug (and ask for help if you need it - it's an important part of learning process), and a dataframe with the list successfully retrieved, which looks like this:
Title Content Url
0 Letta: 'Ora difficile ricomporre con M5s'. Mel... Partiti scossi dallo scioglimento anticipato d... https://www.ansa.it/sito/notizie/politica/2022...
1 L'emozione di Draghi: 'Ancheil cuore dei banch... "Certe volte anche il cuore dei banchieri cent... https://www.ansa.it/sito/notizie/politica/2022...
2 La giornata di Draghi in foto https://www.ansa.it/sito/photogallery/primopia...
3 Il timing del voto, liste entro un mese. A Fer... Le liste dei candidati entro un mese a partire... https://www.ansa.it/sito/notizie/politica/2022...
4 Si lavora sulla concorrenza, ipotesi stralcio ... Il DDL Concorrenza andrà in Aula alla Camera l... https://www.ansa.it/sito/notizie/economia/2022...
5 Le cifre del governo Draghi: 55 voti fiducia e... Una media di 7,4 leggi approvate ogni mese su ... https://www.ansa.it/sito/notizie/politica/2022...
6 I 522 giorni del governo Draghi LE FOTO L'arrivo di SuperMario, gli incontri, le allea... https://www.ansa.it/sito/photogallery/primopia...
7 Presidi, disappunto per le urne in autunno, ci... "C'è disappunto non preoccupazione per le urne... https://www.ansa.it/sito/notizie/politica/2022...
8 Ucraina: Di Maio,sostegno ricerca mercati alte... (ANSA) - ROMA, 22 LUG - Lo scoppio del conflit... https://www.ansa.it/sito/photogallery/primopia...
9 Il giorno più lungo: dal Senato fiducia a Drag... Passa la fiducia al premier Draghi in Senato, ... https://www.ansa.it/sito/notizie/politica/2022...
10 Oltre mille sindaci a sostegno di Draghi Nei giorni che attendono il mercoledì che deci... https://www.ansa.it/sito/notizie/politica/2022...
11 Mattarella scioglie le Camere, si vota il 25 s... E' stata una scelta "inevitabile", il voto del... https://www.ansa.it/sito/notizie/politica/2022...
12 Camere sciolte ma il vitalizio è salvo Nonostante lo scioglimento delle Camere antici... https://www.ansa.it/sito/notizie/politica/2022...
13 Ultimatum di Conte, risposte o fuori. Di Maio:... Senza "risposte chiare" il Movimento 5 Stelle ... https://www.ansa.it/sito/notizie/politica/2022...
14 Di Maio, Conte sta compiendo una vendetta poli... Se le cose restano come sono oggi "Mario Dragh... https://www.ansa.it/sito/notizie/politica/2022...
15 Governo: mercoledì la fiducia fiducia prima al... Le comunicazioni del presidente del Consiglio ... https://www.ansa.it/sito/notizie/politica/2022...
16 Il giorno più lungo: dal Senato fiducia a Drag... Passa la fiducia al premier Draghi in Senato, ... https://www.ansa.it/sito/notizie/politica/2022...
17 Mattarella scioglie le Camere, si vota il 25 s... E' stata una scelta "inevitabile", il voto del... https://www.ansa.it/sito/notizie/politica/2022...
18 Il discorso di Draghi al Senato 'Partiti, pron... "Siamo qui perché lo hanno chiesto gli italian... https://www.ansa.it/sito/notizie/politica/2022...
19 Governo: mercoledì la fiducia fiducia prima al... Le comunicazioni del presidente del Consiglio ... https://www.ansa.it/sito/notizie/politica/2022...
20 Draghi al Senato per una fiducia al buio. Prem... Draghi al bivio tra governo e crisi. Alle 9.30... https://www.ansa.it/sito/notizie/politica/2022...
21 Ultimatum di Conte, risposte o fuori. Di Maio:... Senza "risposte chiare" il Movimento 5 Stelle ... https://www.ansa.it/sito/notizie/politica/2022...
22 Camere sciolte ma il vitalizio è salvo Nonostante lo scioglimento delle Camere antici... https://www.ansa.it/sito/notizie/politica/2022...
You don't need two different loops if you are referring to the same element. Try the below code to save the title and links.
for c in b.findAll('h3',{'class':'news-title'}):
title.append(c.text.strip())
links.append(c.a["href"])
By combining you will be sure that the title and link are scraped from the same element.
I'm trying to scrape a list, not a table, from a wikipedia page. It says "list index out of range": how can I solve this?
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://it.m.wikipedia.org/wiki/Premio_Bagutta'
data = requests.get(url)
soup= BeautifulSoup(data.content, "html.parser")
raw = soup.find_all("div", {"class": "div-col"})[0].find_all("li")
df = pd.DataFrame([[item.get_text().split(" ")[0],
item.find_next("a").get("title"),
item.find_next("i").get_text()[1:-1]]
for item in raw if item.find_next("i")],
columns=("Year"))
print(df.head())
You could try this:
import pandas as pd
import requests
from bs4 import BeautifulSoup
data = requests.get("https://it.m.wikipedia.org/wiki/Premio_Bagutta")
raw = BeautifulSoup(data.content, "html.parser").find_all(
"section", class_="mf-section-2 collapsible-block"
)[0]
raw_years = [item.text.replace("\n", "") for item in raw.find_all("p")]
raw_authors = [item for item in raw.find_all("ul")]
# For some years, there are several authors, so you have to iterate in sync
years = []
authors = []
for (year, author) in zip(raw_years, raw_authors):
years.append(year)
authors.append(author.text.split("\n"))
df = pd.DataFrame({"year": years, "author": authors}).explode("author")
print(df)
# Output
year author
0 1927 Giovanni Battista Angioletti, Il giorno del giudizio[11][12] (Ribet)
1 1928 Giovanni Comisso, Gente di mare[13][14] (Treves)
2 1929 Vincenzo Cardarelli, Il sole a picco[15][16] (Mondadori)
3 1930 Gino Rocca, Gli ultimi furono i primi[17][18] (Treves)
4 1931 Giovanni Titta Rosa, Il varco nel muro[19][20] (Carabba)
.. ... ...
82 2018 Helena Janeczek, La ragazza con la Leica[154] (Guanda)
83 2019 Marco Balzano, Resto qui[9][155] (Einaudi)
84 2020 Enrico Deaglio, La bomba[8][156] (Feltrinelli)
85 2021 Giorgio Fontana, Prima di noi[157] (Sellerio)
86 2022 Benedetta Craveri, La contessa[158][159] (Adelphi)
I'm new to webscraping. I have seen a few tutorials on how to scrape websites using beautifulsoup.
As an exercise I would like to extract data from a real estate website.
The specific page I want to scrape is this one: https://www.immoweb.be/fr/recherche/maison-et-appartement/a-vendre?countries=BE&page=1
My goal is to extract a list of all the links to each real estate sale.
Afterwards, I want to loop through that list of links to extract all the data for each sale (price, location, nb bedrooms etc.)
The first issue I'm encountering is that the data scraped using the classic beautifulsoup code did not match the source code of the webpage.
This is my code:
URL = "https://www.immoweb.be/fr/recherche/maison-et-appartement/a-vendre?countries=BE&page=1"
page = requests.get(URL)
html = page.content
soup = BeautifulSoup(html, 'html.parser')
print(soup)
Hence, when looking for the links of each real estate sale which is located under
soup.find_all("a", class_="card__title-link")
It outputs an empty list. Indeed these tags were actually not properly extracted from my code above.
Why is that? What should I do to ensure that the extracted html correctly corresponds to what is visible in the source code of the website?
Thank you :-)
The data you see is embedded within the page in Json format. You can use this example how to load it:
import json
import requests
from bs4 import BeautifulSoup
url = "https://www.immoweb.be/fr/recherche/maison-et-appartement/a-vendre?countries=BE&page=1"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = json.loads(soup.find("iw-search")[":results"])
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
# print some data:
for ad in data:
print(
"{:<63} {:<8} {}".format(
ad["property"]["title"],
ad["transaction"]["sale"]["price"] or "-",
"https://www.immoweb.be/fr/annonce/{}".format(ad["id"]),
)
)
Prints:
Triplex appartement met 3 slaapkamers en garage. 239000 https://www.immoweb.be/fr/annonce/9309298
Appartement 285000 https://www.immoweb.be/fr/annonce/9309895
Heel ruime, moderne, lichtrijke Duplex te koop, bij centrum 269000 https://www.immoweb.be/fr/annonce/9303797
À VENDRE PAR LANDBERGH : appartement de deux chambres à Gand 359000 https://www.immoweb.be/fr/annonce/9310300
Prachtige nieuwbouw appartementen - https://www.immoweb.be/fr/annonce/9309278
Prachtige nieuwbouw appartementen - https://www.immoweb.be/fr/annonce/9309251
Prachtige nieuwbouw appartementen - https://www.immoweb.be/fr/annonce/9309264
Appartement intéressant avec agréable vue panoramique verdoy 219000 https://www.immoweb.be/fr/annonce/9309366
Projet Utopia by Godin - https://www.immoweb.be/fr/annonce/9309458
Appartement 2-ch avec vue unique! 270000 https://www.immoweb.be/fr/annonce/9309183
Residentieel wonen in Hélécine, dichtbij de natuur en de sne - https://www.immoweb.be/fr/annonce/9309241
Appartement 375000 https://www.immoweb.be/fr/annonce/9309187
DUPLEX LUMIEUX ET SPACIEUX 380000 https://www.immoweb.be/fr/annonce/9298271
SINT-PIETERS-LEEUW / Magnifique maison de ±130m² avec jardin 430000 https://www.immoweb.be/fr/annonce/9310259
PARC PARMENTIER // APP MODERNE 3CH 490000 https://www.immoweb.be/fr/annonce/9262193
BOIS DE LA CAMBRE – AV DE FRE – CLINIQUES DE L’EUROPE 575000 https://www.immoweb.be/fr/annonce/9309664
Entre Stockel et le Stade Fallon 675000 https://www.immoweb.be/fr/annonce/9310094
Maisons neuves dans un cadre verdoyant - https://www.immoweb.be/fr/annonce/6792221
Nieuwbouwproject Dockside Gardens - Gent - https://www.immoweb.be/fr/annonce/9008956
Appartement 139000 https://www.immoweb.be/fr/annonce/9187904
A VENDRE CHEZ LANDBERGH: appartements à Merelbeke Flora - https://www.immoweb.be/fr/annonce/9306877
Très beau studio avec une belle vue sur la plage et la mer! 319000 https://www.immoweb.be/fr/annonce/9306787
BEL APPARTEMENT LUMINEUX DIAMANT / PLASKY 320000 https://www.immoweb.be/fr/annonce/9264748
Un projet d'appartements neufs à proximité de Woluwé-St-Lamb - https://www.immoweb.be/fr/annonce/9308037
PLACE JOURDAN - 2 CHAMBRES 345000 https://www.immoweb.be/fr/annonce/9306953
Magnifiek appartement in de Brugse Rand - Assebroek 399000 https://www.immoweb.be/fr/annonce/9306613
Bien d'exception 415000 https://www.immoweb.be/fr/annonce/9308022
Appartement 435000 https://www.immoweb.be/fr/annonce/9307802
Magnifique maison 5CH - 3SDB - bureau - dressing - garage 465000 https://www.immoweb.be/fr/annonce/9307178
Magnifique maison 5CH - 3SDB - bureau - dressing - garage 465000 https://www.immoweb.be/fr/annonce/9307177
EDIT: Added URL column.
I'm only a couple of weeks into learning Python, and I'm trying to extract specific info from a list (events). I've been able to call the list and extract specific lines (info for single event), but the objective is running the program and extracting the information from the entirety of the called list (info from all of the events).
Among others, my best guesses so far have been along the lines of:
one_a_tag = soup.findAll('a')[22:85]
and
one_a_tag = soup.findAll('a')[22+1]
But I come up with these errors:
TypeError Traceback (most recent call last)
<ipython-input-15-ee19539fbb00> in <module>
11 soup.findAll('a')
12 one_a_tag = soup.findAll('a')[22:85]
---> 13 link = one_a_tag['href']
14 'https://arema.mx' + link
15 eventUrl = ('https://arema.mx' + link)
TypeError: list indices must be integers or slices, not str
And
TypeError Traceback (most recent call last)
<ipython-input-22-81d98bcf8fd8> in <module>
10 soup
11 soup.findAll('a')
---> 12 one_a_tag = soup.findAll('a')[22]+1
13 link = one_a_tag['href']
14 'https://arema.mx' + link
TypeError: unsupported operand type(s) for +: 'Tag' and 'int'
This is the entire code so far:
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
url = 'https://arema.mx/'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
soup
soup.findAll('a')
one_a_tag = soup.findAll('a')[22]
link = one_a_tag['href']
'https://arema.mx' + link
eventUrl = ('https://arema.mx' + link)
print(eventUrl)
def getAremaTitulo(eventUrl):
res = requests.get(eventUrl)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
elems = soup.select('body > div.body > div.ar.eventname')
return elems[0].text.strip()
def getAremaInfo(eventUrl):
res = requests.get(eventUrl)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
elems = soup.select('body > div.body > div.event-header')
return elems[0].text.strip()
titulo = getAremaTitulo(eventUrl)
print('Nombre de evento: ' + titulo)
info = getAremaInfo(eventUrl)
print('Info: ' + info)
time.sleep(1)
I'm sure there may be some redundancies in the code, but what I'm most keen on solving is creating a loop to extract the specific info I'm looking for from all of the events. What do I need to add to get there?
Thanks!
To get all information about events, you can use this script:
import requests
from bs4 import BeautifulSoup
url = 'https://arema.mx/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for event_link in soup.select('#events a.event'):
u = 'https://arema.mx' + event_link['href']
s = BeautifulSoup(requests.get(u).content, 'html.parser')
event_name = s.select_one('.eventname').get_text(strip=True)
event_info = s.select_one('.event-header').text.strip()
print(event_name)
print(event_info)
print('-' * 80)
Prints:
...
--------------------------------------------------------------------------------
NOCHE BOHEMIA <A PIANO Y GUITARRA>
"Freddy González y Víctor Freez dos amigos que al
paso del tiempo hermanaron sus talentos para crear un concepto musical cálido y
acústico entre cuerdas y teclas haciéndonos vibrar entre una línea de canciones
de ayer y hoy. Rescatando las bohemias que tantos recuerdos y encuentros nos han
generado a lo largo del tiempo.
Precio: $69*ya incluye cargo de servicio.Fecha: Sábado 15 de agosto 20:00 hrsTransmisión en vivo por Arema LiveComo ingresar a ver la presentación.· Dale clic en Comprar y Selecciona tu acceso.· Elije la forma de pago que más se te facilite y finaliza la compra.· Te llegara un correo electrónico con la confirmación de compra y un liga exclusiva para ingresar a la transmisión el día seleccionado únicamente.La compra de tu boleto es un apoyo para el artista.Importante: favor de revisar tu correo en bandeja de entrada, no deseados o spam ya que los correos en ocasiones son enviados a esas carpetas.
--------------------------------------------------------------------------------
...
I am trying this code however still unable to change the language of the URL.
from requests import get
url = 'https://www.fincaraiz.com.co/apartamento-apartaestudio/arriendo/bogota/'
headers = {"Accept-Language": "en-US,en;q=0.5"}
params = dict(lang='en-US,en;q=0.5')
response = get(url, headers = headers, params= params)
print(response.text[:500])
titles = []
for a in html_soup.findAll('div', id = 'divAdverts'):
for x in html_soup.findAll(class_ = 'h2-grid'):
title = x.text.replace("\r", "").replace("\n", "").strip()
titles.append(title)
titles
Output
['Local en Itaguí - Santamaría',
'Casa en Sopó - Vereda Comuneros',
'Apartamento en Santa Marta - Bello Horizonte',
'Apartamento en Funza - Zuame',
'Casa en Bogotá - Centro Comercial Titán Plaza',
'Apartamento en Cali - Los Cristales',
'Apartamento en Itaguí - Suramerica',
'Casa en Palmira - Barrio Contiguo A Las Flores',
'Apartamento en Cali - La Hacienda',
'Casa en Bogotá - Marsella',
'Casa en Medellín - La Castellana',
'Casa en Villavicencio - Quintas De San Fernando',
'Apartamento en Santa Marta - Playa Salguero',
'Casa Campestre en Rionegro - La Mosquita',
'Casa Campestre en Jamundí - La Morada',
'Casa en Envigado - Loma De Las Brujas',
'Casa Campestre en El Retiro - Los Salados']
Does anyone know how can I change the language of the URL? Tried everything
I am only giving example for particular field title, you may extend it to other fields, you may face issue like being blocked by google for number of concurrent request while using this library as it is not official one. Also you must consider to see the Note written in the documentation https://pypi.org/project/googletrans/
from requests import get
from bs4 import BeautifulSoup
from googletrans import Translator
translator = Translator()
url = 'https://www.fincaraiz.com.co/apartamento-apartaestudio/arriendo/bogota/'
headers = {"Accept-Language": "en-US,en;q=0.5"}
params = dict(lang='en-US,en;q=0.5')
response = get(url, headers = headers, params= params)
titles = []
html_soup = BeautifulSoup(response.text, 'html.parser')
for a in html_soup.findAll('div', id = 'divAdverts'):
for x in html_soup.findAll(class_ = 'h2-grid'):
title = x.text.replace("\r", "").replace("\n", "").strip()
titles.append(title)
english_titles=[]
english_translations= translator.translate(titles)
for trans in english_translations:
english_titles.append(trans.text)
print(english_titles)
Since you are scraping from spanish language to english language you can specify parameters in translator.translate(titles,src="es",dest="en")