Scraping data from interactive website map

Scraping data from interactive website map - python

I am trying to scrape the geolocations from the 2 following websites:
https://zendantenneskaart.omgeving.vlaanderen.be/ --> for this one, I found the underlying source json file, so it was easy https://www.mercator.vlaanderen.be/raadpleegdienstenmercatorpubliek/us/ows?service=WFS&version=1.0.0&request=GetFeature&typeName=us:us_zndant_pnt&outputFormat=application/json
http://www.sites.bipt.be/index.php?language=EN --> for this one, I cannot find such a json file; moreover, I cannot find a way to scrape it using beautiful soup, since the visibility of the pins is dependent on the zoom of the map
Any ideas to scrape all the geo locations for the second website?

You can use url http://www.sites.bipt.be/ajaxinterface.php and as latitude/longitude parameters specify some huge range. That way you get all data in one go.
For example:
import json
import requests
from html import unescape
url = 'http://www.sites.bipt.be/ajaxinterface.php'
data = {"action": "getSites",
"latfrom": "-9999",
"latto": "9999",
"longfrom": "-9999",
"longto": "9999",
"LangSiteTable": "sitesfr"}
data = requests.post(url, data=data).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for d in data[:10]: # <-- print only first 10 items
print('{:<50}{:<50}{:<30}{:<40} {:.4f} {:.4f}'.format(d['Eigenaar1'], unescape(d['Locatie']), unescape(d['Adres']), unescape(d['PostcodeGemeente']), float(d['Longitude']), float(d['Latitude'])))
print()
print('Total items:', len(data))
Prints:
Orange Belgium: 203W1_2 Cité de la Bruyère Clos des Marronniers 201 1480 Tubize 4.2086 50.6810
Telenet: _AN0171A Watertoren Scheeveld 2870 Puurs 4.2718 51.0744
Telenet: _AN0235V E34 2290 Vorselaar 4.7578 51.2449
Orange Belgium: 148L1_6 Institut Provincial d'Enseignement Supérieur Rue du Commerce 14 4100 Seraing 5.5077 50.6130
Orange Belgium: 198L1_5 / 32198L1_1 / 42198L1_1 Lieu-dit 'Bièster' Thier de Coo 4970 Stavelot 5.8876 50.3859
Telenet: _NR1363A Route de Sovenne 5560 Houyet 4.9529 50.1997
Orange Belgium: 181R1_1 Route Rimbaut / Route Rimbaut 6890 Libin 5.1504 49.9989
Proximus: 80WAM_00 Rue de Hottleux 71 4950 Waimes 6.0879 50.4152
Orange Belgium: 013R1_8 Rue Saint-Michel 6870 Saint-Hubert 5.3666 50.0355
Proximus: 41BIA_00 Aéroport de Bierset batiment 56 Aérodrome 4460 Grâce-Hollogne 5.4584 50.6416
Total items: 8104

Related

scraped data using BeautifulSoup does not match source code

I'm new to webscraping. I have seen a few tutorials on how to scrape websites using beautifulsoup.
As an exercise I would like to extract data from a real estate website.
The specific page I want to scrape is this one: https://www.immoweb.be/fr/recherche/maison-et-appartement/a-vendre?countries=BE&page=1
My goal is to extract a list of all the links to each real estate sale.
Afterwards, I want to loop through that list of links to extract all the data for each sale (price, location, nb bedrooms etc.)
The first issue I'm encountering is that the data scraped using the classic beautifulsoup code did not match the source code of the webpage.
This is my code:
URL = "https://www.immoweb.be/fr/recherche/maison-et-appartement/a-vendre?countries=BE&page=1"
page = requests.get(URL)
html = page.content
soup = BeautifulSoup(html, 'html.parser')
print(soup)
Hence, when looking for the links of each real estate sale which is located under
soup.find_all("a", class_="card__title-link")
It outputs an empty list. Indeed these tags were actually not properly extracted from my code above.
Why is that? What should I do to ensure that the extracted html correctly corresponds to what is visible in the source code of the website?
Thank you :-)

The data you see is embedded within the page in Json format. You can use this example how to load it:
import json
import requests
from bs4 import BeautifulSoup
url = "https://www.immoweb.be/fr/recherche/maison-et-appartement/a-vendre?countries=BE&page=1"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = json.loads(soup.find("iw-search")[":results"])
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
# print some data:
for ad in data:
print(
"{:<63} {:<8} {}".format(
ad["property"]["title"],
ad["transaction"]["sale"]["price"] or "-",
"https://www.immoweb.be/fr/annonce/{}".format(ad["id"]),
)
)
Prints:
Triplex appartement met 3 slaapkamers en garage. 239000 https://www.immoweb.be/fr/annonce/9309298
Appartement 285000 https://www.immoweb.be/fr/annonce/9309895
Heel ruime, moderne, lichtrijke Duplex te koop, bij centrum 269000 https://www.immoweb.be/fr/annonce/9303797
À VENDRE PAR LANDBERGH : appartement de deux chambres à Gand 359000 https://www.immoweb.be/fr/annonce/9310300
Prachtige nieuwbouw appartementen - https://www.immoweb.be/fr/annonce/9309278
Prachtige nieuwbouw appartementen - https://www.immoweb.be/fr/annonce/9309251
Prachtige nieuwbouw appartementen - https://www.immoweb.be/fr/annonce/9309264
Appartement intéressant avec agréable vue panoramique verdoy 219000 https://www.immoweb.be/fr/annonce/9309366
Projet Utopia by Godin - https://www.immoweb.be/fr/annonce/9309458
Appartement 2-ch avec vue unique! 270000 https://www.immoweb.be/fr/annonce/9309183
Residentieel wonen in Hélécine, dichtbij de natuur en de sne - https://www.immoweb.be/fr/annonce/9309241
Appartement 375000 https://www.immoweb.be/fr/annonce/9309187
DUPLEX LUMIEUX ET SPACIEUX 380000 https://www.immoweb.be/fr/annonce/9298271
SINT-PIETERS-LEEUW / Magnifique maison de ±130m² avec jardin 430000 https://www.immoweb.be/fr/annonce/9310259
PARC PARMENTIER // APP MODERNE 3CH 490000 https://www.immoweb.be/fr/annonce/9262193
BOIS DE LA CAMBRE – AV DE FRE – CLINIQUES DE L’EUROPE 575000 https://www.immoweb.be/fr/annonce/9309664
Entre Stockel et le Stade Fallon 675000 https://www.immoweb.be/fr/annonce/9310094
Maisons neuves dans un cadre verdoyant - https://www.immoweb.be/fr/annonce/6792221
Nieuwbouwproject Dockside Gardens - Gent - https://www.immoweb.be/fr/annonce/9008956
Appartement 139000 https://www.immoweb.be/fr/annonce/9187904
A VENDRE CHEZ LANDBERGH: appartements à Merelbeke Flora - https://www.immoweb.be/fr/annonce/9306877
Très beau studio avec une belle vue sur la plage et la mer! 319000 https://www.immoweb.be/fr/annonce/9306787
BEL APPARTEMENT LUMINEUX DIAMANT / PLASKY 320000 https://www.immoweb.be/fr/annonce/9264748
Un projet d'appartements neufs à proximité de Woluwé-St-Lamb - https://www.immoweb.be/fr/annonce/9308037
PLACE JOURDAN - 2 CHAMBRES 345000 https://www.immoweb.be/fr/annonce/9306953
Magnifiek appartement in de Brugse Rand - Assebroek 399000 https://www.immoweb.be/fr/annonce/9306613
Bien d'exception 415000 https://www.immoweb.be/fr/annonce/9308022
Appartement 435000 https://www.immoweb.be/fr/annonce/9307802
Magnifique maison 5CH - 3SDB - bureau - dressing - garage 465000 https://www.immoweb.be/fr/annonce/9307178
Magnifique maison 5CH - 3SDB - bureau - dressing - garage 465000 https://www.immoweb.be/fr/annonce/9307177
EDIT: Added URL column.

Python Webscraping Approach for Comparing Football Players' college alma maters with total NFL Fantasy Football output

I am looking to a data science project where I will be able to sum up the fantasy football points by the college the players went to (e.g. Alabama has 56 active players in the NFL so I will go through a database and add up all of their fantasy points to compare with other schools).
I was looking at the website:
https://fantasydata.com/nfl/fantasy-football-leaders?season=2020&seasontype=1&scope=1&subscope=1&aggregatescope=1&range=3
and I was going to use Beautiful Soup to scrape the rows of players and statistics and ultimately, fantasy football points.
However, I am having trouble figuring out how to extract the players' college alma mater. To do so, I would have to:
Click each "players" name
Scrape each and every profile of the hundreds of NFL players for one line "College"
Place all of this information into its own column.
Any suggestions here?

There's no need for Selenium, or other headless, automated browsers. That's overkill.
If you take a look at your browser's network traffic, you'll notice that your browser makes a POST request to this REST API endpoint: https://fantasydata.com/NFL_FantasyStats/FantasyStats_Read
If the POST request is well-formed, the API responds with JSON, containing information about every single player. Normally, this information would be used to populate the DOM asynchronously using JavaScript. There's quite a lot of information there, but unfortunately, the college information isn't part of the JSON response. However, there is a field PlayerUrlString, which is a relative-URL to a given player's profile page, which does contain the college name. So:
Make a POST request to the API to get information about all players
For each player in the response JSON:
Visit that player's profile
Use BeautifulSoup to extract the college name from the current
player's profile
Code:
def main():
import requests
from bs4 import BeautifulSoup
url = "https://fantasydata.com/NFL_FantasyStats/FantasyStats_Read"
data = {
"sort": "FantasyPoints-desc",
"pageSize": "50",
"filters.season": "2020",
"filters.seasontype": "1",
"filters.scope": "1",
"filters.subscope": "1",
"filters.aggregatescope": "1",
"filters.range": "3",
}
response = requests.post(url, data=data)
response.raise_for_status()
players = response.json()["Data"]
for player in players:
url = "https://fantasydata.com" + player["PlayerUrlString"]
response = requests.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.content, "html.parser")
college = soup.find("dl", {"class": "dl-horizontal"}).findAll("dd")[-1].text.strip()
print(player["Name"] + " went to " + college)
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
Patrick Mahomes went to Texas Tech
Kyler Murray went to Oklahoma
Aaron Rodgers went to California
Russell Wilson went to Wisconsin
Josh Allen went to Wyoming
Deshaun Watson went to Clemson
Ryan Tannehill went to Texas A&M
Lamar Jackson went to Louisville
Dalvin Cook went to Florida State
...
You can also edit the pageSize POST parameter in the data dictionary. The 50 corresponds to information about the first 50 players in the JSON response (according to the filters set by the other POST parameters). Changing this value will yield more or less players in the JSON response.

I agree, API are the way to go if they are there. My second "go to" is pandas' .read_html() (which uses BeautifulSoup under the hood to parse <table> tags. Here's an alternate solution using ESPNs api to get team roster links, then use pandas to pull the table from each link. Saves you the trouble of having to iterate througheach player to get the college (I whish they just had an api that returned all players. nfl.com USED to have that, but is no longer publicly available, that I know of).
Code:
import requests
import pandas as pd
url = 'https://site.web.api.espn.com/apis/common/v3/sports/football/nfl/athletes/101'
all_teams = []
roster_links = []
for i in range(1,35):
url = 'http://site.api.espn.com/apis/site/v2/sports/football/nfl/teams/{teamId}'.format(teamId=i)
jsonData = requests.get(url).json()
print (jsonData['team']['displayName'])
for link in jsonData['team']['links']:
if link['text'] == 'Roster':
roster_links.append(link['href'])
break
for link in roster_links:
print (link)
tables = pd.read_html(link)
df = pd.concat(tables).drop('Unnamed: 0',axis=1)
df['Jersey'] = df['Name'].str.replace("([A-Za-z.' ]+)", '')
df['Name'] = df['Name'].str.extract("([A-Za-z.' ]+)")
all_teams.append(df)
final_df = pd.concat(all_teams).reset_index(drop=True)
Output:
print (final_df)
Name POS Age HT WT Exp College Jersey
0 Matt Ryan QB 35 6' 4" 217 lbs 13 Boston College 2
1 Matt Schaub QB 39 6' 6" 245 lbs 17 Virginia 8
2 Todd Gurley II RB 26 6' 1" 224 lbs 6 Georgia 21
3 Brian Hill RB 25 6' 1" 219 lbs 4 Wyoming 23
4 Qadree Ollison RB 24 6' 1" 232 lbs 2 Pittsburgh 30
... .. ... ... ... .. ... ...
1772 Jonathan Owens S 25 5' 11" 210 lbs 2 Missouri Western 36
1773 Justin Reid S 23 6' 1" 203 lbs 3 Stanford 20
1774 Ka'imi Fairbairn PK 26 6' 0" 183 lbs 5 UCLA 7
1775 Bryan Anger P 32 6' 3" 205 lbs 9 California 9
1776 Jon Weeks LS 34 5' 10" 242 lbs 11 Baylor 46
[1777 rows x 8 columns]

Scraping Yahoo Finance with Python3

I'm a complete newbie in scraping and I'm trying to scrape https://fr.finance.yahoo.com and I can't figure out what I'm doing wrong.
My goal is to scrape the index name, current level and the change(both in value and in %)
Here is the code I have used:
import urllib.request
from bs4 import BeautifulSoup
url = 'https://fr.finance.yahoo.com'
request = urllib.request.Request(url)
html = urllib.request.urlopen(request).read()
soup = BeautifulSoup(html,'html.parser')
main_table = soup.find("div",attrs={'data-reactid':'12'})
print(main_table)
links = main_table.find_all("li", class_=' D(ib) Bxz(bb) Bdc($seperatorColor) Mend(16px) BdEnd ')
print(links)
However, the print(links) comes out empty. Could someone please assist? Any help would be highly appreciated as I have been trying to figure this out for a few days now.

Although the better way to get all the fields is parse and process the relevant script tag, this is one of the ways you can get all them.
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://fr.finance.yahoo.com/'
r = requests.get(url,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(r.text,'html.parser')
df = pd.DataFrame(columns=['Index Name','Current Level','Value','Percentage Change'])
for item in soup.select("[id='market-summary'] li"):
index_name = item.select_one("a").contents[1]
current_level = ''.join(item.select_one("a > span").text.split())
value = ''.join(item.select_one("a")['aria-label'].split("ou")[1].split("points")[0].split())
percentage_change = ''.join(item.select_one("a > span + span").text.split())
df = df.append({'Index Name':index_name, 'Current Level':current_level,'Value':value,'Percentage Change':percentage_change}, ignore_index=True)
print(df)
Output are like:
Index Name Current Level Value Percentage Change
0 CAC 40 4444,56 -0,88 -0,02%
1 Euro Stoxx 50 2905,47 0,49 +0,02%
2 Dow Jones 24438,63 -35,49 -0,15%
3 EUR/USD 1,0906 -0,0044 -0,40%
4 Gold future 1734,10 12,20 +0,71%
5 BTC-EUR 8443,23 161,79 +1,95%
6 CMC Crypto 200 185,66 4,42 +2,44%
7 Pétrole WTI 33,28 -0,64 -1,89%
8 DAX 11073,87 7,94 +0,07%
9 FTSE 100 5993,28 -21,97 -0,37%
10 Nasdaq 9315,26 30,38 +0,33%
11 S&P 500 2951,75 3,24 +0,11%
12 Nikkei 225 20388,16 -164,15 -0,80%
13 HANG SENG 22930,14 -1349,89 -5,56%
14 GBP/USD 1,2177 -0,0051 -0,41%

I think you need to fix your element selection.
For example the following code:
import urllib.request
from bs4 import BeautifulSoup
url = 'https://fr.finance.yahoo.com'
request = urllib.request.Request(url)
html = urllib.request.urlopen(request).read()
soup = BeautifulSoup(html,'html.parser')
main_table = soup.find(id="market-summary")
links = main_table.find_all("a")
for i in links:
print(i.attrs["aria-label"])
Gives output text having index name, % change, change, and value:
CAC 40 a augmenté de 0,37 % ou 16,55 points pour atteindre 4 461,99 points
Euro Stoxx 50 a augmenté de 0,28 % ou 8,16 points pour atteindre 2 913,14 points
Dow Jones a diminué de -0,63 % ou -153,98 points pour atteindre 24 320,14 points
EUR/USD a diminué de -0,49 % ou -0,0054 points pour atteindre 1,0897 points
Gold future a augmenté de 0,88 % ou 15,10 points pour atteindre 1 737,00 points
a augmenté de 1,46 % ou 121,30 points pour atteindre 8 402,74 points
CMC Crypto 200 a augmenté de 1,60 % ou 2,90 points pour atteindre 184,14 points
Pétrole WTI a diminué de -3,95 % ou -1,34 points pour atteindre 32,58 points
DAX a augmenté de 0,29 % ou 32,27 points pour atteindre 11 098,20 points
FTSE 100 a diminué de -0,39 % ou -23,18 points pour atteindre 5 992,07 points
Nasdaq a diminué de -0,30 % ou -28,25 points pour atteindre 9 256,63 points
S&P 500 a diminué de -0,43 % ou -12,62 points pour atteindre 2 935,89 points
Nikkei 225 a diminué de -0,80 % ou -164,15 points pour atteindre 20 388,16 points
HANG SENG a diminué de -5,56 % ou -1 349,89 points pour atteindre 22 930,14 points
GBP/USD a diminué de -0,34 % ou -0,0041 points pour atteindre 1,2186 points

Try following css selector to get all the links.
import urllib
from bs4 import BeautifulSoup
url = 'https://fr.finance.yahoo.com'
request = urllib.request.Request(url)
html = urllib.request.urlopen(request).read()
soup = BeautifulSoup(html,'html.parser')
links=[link['href'] for link in soup.select("ul#market-summary a")]
print(links)
Output:
['/quote/^FCHI?p=^FCHI', '/quote/^STOXX50E?p=^STOXX50E', '/quote/^DJI?p=^DJI', '/quote/EURUSD=X?p=EURUSD=X', '/quote/GC=F?p=GC=F', '/quote/BTC-EUR?p=BTC-EUR', '/quote/^CMC200?p=^CMC200', '/quote/CL=F?p=CL=F', '/quote/^GDAXI?p=^GDAXI', '/quote/^FTSE?p=^FTSE', '/quote/^IXIC?p=^IXIC', '/quote/^GSPC?p=^GSPC', '/quote/^N225?p=^N225', '/quote/^HSI?p=^HSI', '/quote/GBPUSD=X?p=GBPUSD=X']

Scraping Google Destinations

I'm preparing a tour around the world and am curious to find out what the top sights are around the world, so I´m trying to scrape the top destinations within a certain place. I want to end up with the top places in a country, and their best sights. Google Destinations was recently added as a a great functionality for this.
For example, when googling Cuba Destinations, Google shows a card with destinations Havana, Varadero, Trinidad, Santiago de Cuba.
Then, when googling Havana Cuba Destinations, it shows `Old Havana, Malecon, Castillo de los Tres Reyes Magos del Morro, El Capitolio.
Finally I´ll turn it into a table, that looks like:
Cuba, Havana, Old Havana.
Cuba, Havana, Malecon.
Cuba, Havana, Castillo de los Tres Reyes Magos del Morro.
Cuba, Havana, El Capitolio.
Cuba, Varadero, Hicacos Peninsula.
and so on.
I have tried the API call as shown in Travel destinations API, butthat does not provide the right feedback, and often yields OVER_QUERY_LIMIT.
The code below returns an error:
URL = "https://www.google.nl/destination/compare?q=cuba+destinations&site=search&output=search&dest_mid=/m/0d04z6&sa=X&ved=0API_KEY"
import requests
from bs4 import BeautifulSoup
#URL = "http://www.values.com/inspirational-quotes"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
print(soup.prettify())
Any tips?

You will need to use something like Selenium for this as the page makes multiple XHRs you will not be able to get the rendered page using requests alone. First install Selenium.
sudo pip3 install selenium
Then get a driver https://sites.google.com/a/chromium.org/chromedriver/downloads
(Depending upon your OS you may need to specify the location of your driver)
from bs4 import BeautifulSoup
from selenium import webdriver
import time
browser = webdriver.Chrome()
url = ("https://www.google.nl/destination/compare?q=cuba+destinations&site=search&output=search&dest_mid=/m/0d04z6&sa=X&ved=0API_KEY")
browser.get(url)
time.sleep (2)
html_source = browser.page_source
browser.quit()
soup = BeautifulSoup(html_source, "lxml")
# Get the headings
hs = [tag.text for tag in soup.find_all('h2')]
# get the text containg divs
divs = [tag.text for tag in soup.find_all('div', {'class': False})]
# Delete surplus divs
del divs[:22]
del divs[-1:]
print(list(zip(hs,divs)))
Outputs:
[('Havana', "Cuban capital known for Old Havana's colonial architecture, live salsa music & nearby beaches."), ('Varadero', 'Major Cuban resort town on Hicacos Peninsula, with a 20km beach, a golf course & several parks.'), ('Trinidad', 'Cuban town known for Plaza Mayor, colonial architecture & plantations of Valle de los Ingenios.'), ('Santiago de Cuba', 'Cuban city known for Afro-Cuban festivals & music, plus Spanish colonial & revolutionary history.'), ('Viñales', 'Cuban town known for Viñales Valley, Casa de Caridad Botanical Gardens & nearby tobacco farms.'), ('Cienfuegos', 'Cuban coastal city, known for Tomás Terry Theater, Arco de Triunfo & Playa Rancho Luna resorts.'), ('Santa Clara', 'Cuban city home to the Che Guevara Mausoleum, Parque Vidal & ornate Teatro La Caridad.'), ('Cayo Coco', 'Cuban island known for its white-sand beaches & resorts, plus reef snorkeling & flamingos.'), ('Cayo Santa María', 'Cuban island known for Gaviotas Beach, Cayo Santa María Wildlife Refuge & Pueblo La Estrella.'), ('Cayo Largo del Sur', 'Cuban island, known for beaches like Playa Blanca & Playa Sirena, plus a sea turtle center & diving.'), ('Plaza de la Revolución', 'Che Guevara and monuments'), ('Camagüey', 'Ballet, churches, history, and beaches'), ('Holguín', 'Cuban city known for Parque Calixto García, the Hacha de Holguín axe head & Guardalavaca beaches.'), ('Cayo Guillermo', 'Cuban island with beaches like Playa del Medio & Playa Pilar, plus vast expanses of coral reef.'), ('Matanzas', 'Caves, theater, beaches, history, and rivers'), ('Baracoa', 'Beaches, rivers, and nature'), ('Centro Habana', '\xa0'), ('Playa Girón', 'Beaches, snorkeling, and museums'), ('Topes de Collantes', 'Scenic nature reserve park for hiking'), ('Guardalavaca', 'Cuban resort known for Esmeralda Beach, the Cayo Naranjo Aquarium & the Chorro de Maíta Museum.'), ('Bay of Pigs', 'Snorkeling, scuba diving, and beaches'), ('Isla de la Juventud', 'Scuba diving and beaches'), ('Zapata Swamp', 'Parks, crocodiles, birdwatching, and swamps'), ('Pinar del Río', 'History'), ('Remedios', 'Churches, beaches, and museums'), ('Bayamo', 'Wax museums, monuments, history, and music'), ('Sierra Maestra', 'Peaks with a storied political history'), ('Las Terrazas', 'Zip-lining, nature reserves, and hiking'), ('Sancti Spíritus', 'History and museums'), ('Playa Ancon', 'Beaches, snorkeling, and scuba diving'), ('Jibacoa', 'Beaches, snorkeling, and jellyfish'), ('Jardines de la Reina', 'Scuba diving, fly-fishing, and gardens'), ('Cayo Jutías', 'Beach and snorkeling'), ('Guamá, Cuba', 'Crocodiles, beaches, snorkeling, and lakes'), ('Morón', 'Crocodiles, lagoons, and beaches'), ('Las Tunas', 'Beaches, nightlife, and history'), ('Soroa', 'Waterfalls, gardens, nature, and ecotourism'), ('Guanabo', 'Beach'), ('María la Gorda', 'Scuba diving, beaches, and snorkeling'), ('Alejandro de Humboldt National Park', 'Park, protected area, and hiking'), ('Ciego de Ávila', 'Zoos and beaches'), ('Bacunayagua', '\xa0'), ('Guantánamo', 'Beaches, history, and nature'), ('Cárdenas', 'Beaches, museums, monuments, and history'), ('Canarreos Archipelago', 'Sailing and coral reefs'), ('Caibarién', 'Beaches'), ('El Nicho', 'Waterfalls, parks, and nature'), ('San Luis Valley', 'Cranes, national wildlife refuge, and elk')]
UPDATED IN RESPONSE TO COMMENT:
from bs4 import BeautifulSoup
from selenium import webdriver
import time
browser = webdriver.Chrome()
for place in ["Cuba", "Belgum", "France"]:
url = ("https://www.google.nl/destination/compare?site=destination&output=search")
browser.get(url) # you may not need to do this every time if you clear the search box
time.sleep(2)
element = browser.find_element_by_name('q') # get the query box
time.sleep(2)
element.send_keys(place) # populate the search box
time.sleep (2)
search_box=browser.find_element_by_class_name('sbsb_c') # get the first element in the list
search_box.click() # click it
time.sleep (2)
destinations=browser.find_element_by_id('DESTINATIONS') # Click the destinations link
destinations.click()
time.sleep (2)
html_source = browser.page_source
soup = BeautifulSoup(html_source, "lxml")
# Get the headings
hs = [tag.text for tag in soup.find_all('h2')]
# get the text containg divs
divs = [tag.text for tag in soup.find_all('div', {'class': False})]
# Delete surplus divs
del divs[:22]
del divs[-1:]
print(list(zip(hs,divs)))
browser.quit()

Try this Google Places API URL. You will get the point of Interest/Attraction/Tourists places in (for example) New York City. You have to use the CITY NAME with the keyword Point Of Interest.
https://maps.googleapis.com/maps/api/place/textsearch/json?query=new+york+city+point+of+interest&language=en&key=API_KEY
These API results are same as the results of the Google search results below.
https://www.google.com/search?sclient=psy-ab&site=&source=hp&btnG=Search&q=New+York+point+of+interest
Two more little tips for you:
You can use the Python Client for Google Maps Services: https://github.com/googlemaps/google-maps-services-python
For the OVER_QUERY_LIMIT problem, make sure that you add a billing method to your Google Cloud project (with your credit card or free trail credit balance). Don't worry too much because Google will give you some thousand free queries each month.

Get the Text from the next_sibling - BeautifulSoup 4

I want to scrape Restaurants from this URL
for rests in dining_soup.select("div.infos-restos"):
for rest in rests.select("h3"):
safe_print(" Rest Nsme: "+rest.text)
print(rest.next_sibling.next_sibling.next_sibling.next_sibling.contents)
outputs
<div class="descriptif-resto">
<p>
<strong>Type of cuisine</strong>:International</p>
<p>
<strong>Opening hours</strong>:06:00-23:30</p>
<p>The Food Square bar and restaurant offers a varied menu in an elegant and welcoming setting. In fine weather you can also enjoy your meal next to the pool or relax on the garden terrace.</p>
</div>
and
print(rest.next_sibling.next_sibling.next_sibling.next_sibling.text)
outputs always empty
So my question is how do I scrape Type of cuisine and opening hours from that Div?

Opening hours and cuisine are in "descriptif-resto" text:
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.accorhotels.com/gb/hotel-5548-mercure-niederbronn-hotel/restaurant.shtml")
soup = BeautifulSoup(r.content)
print(soup.find("div",attrs={"class":"descriptif-resto"}).text)
Type of cuisine:Brasserie
Opening hours:12:00 - 14:00 / 19:00 - 22:00
The name is in the first h3 tag, the type and opening hours are in the two p tags:
name = soup.find("div", attrs={"class":"infos-restos"}).h3.text
det = soup.find("div",attrs={"class":"descriptif-resto"}).p
hours = det.find_next("p").text
tpe = det.text
print(name)
print(hours)
print(tpe)
LA STUB DU CASINO
Opening hours:12:00 - 14:00 / 19:00 - 22:00
Type of cuisine:Brasserie
Ok so some parts don't have both opening hours and cuisine so you will have to fine tune that but this gets all the info:
from itertools import chain
all_dets = soup.find_all("div", attrs={"class":"infos-restos"})
# get all names from h3 tagsusing chain so we can zip later
names = chain.from_iterable(x.find_all("h3") for x in all_dets)
# get all info to extract cuisine, hours
det = chain.from_iterable(x.find_all("div",attrs={"class":"descriptif-resto"}) for x in all_dets)
# zipp appropriate details with each name
zipped = zip(names, det)
for name, det in zipped:
details = det.p
name, tpe = name.text, details
hours = details.find_next("p") if "cuisine" in det.p.text else ""
if hours: # empty string means we have a bar
print(name, tpe.text, hours.text)
else:
print(name, tpe.text)
print("-----------------------------")
LA STUB DU CASINO
Type of cuisine:Brasserie
Opening hours:12:00 - 14:00 / 19:00 - 22:00
-----------------------------
RESTAURANT DU CASINO IVORY
Type of cuisine:French
Opening hours:19:00 - 22:00
-----------------------------
BAR DE L'HOTEL LE DOLLY
Opening hours:10:00-01:00
-----------------------------
BAR DES MACHINES A SOUS
Opening hours:10:30-03:00
-----------------------------

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping data from interactive website map - python

Related

scraped data using BeautifulSoup does not match source code

Python Webscraping Approach for Comparing Football Players' college alma maters with total NFL Fantasy Football output

Scraping Yahoo Finance with Python3

Scraping Google Destinations

Get the Text from the next_sibling - BeautifulSoup 4

Categories

Resources