How do I fix the code to scrape Zomato website? - python

I wrote this code but got this as the error "IndexError: list index out of range" after running the last line. Please, how do I fix this?
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
response = requests.get("https://www.zomato.com/bangalore/top-restaurants",headers=headers)
content = response.content
soup = BeautifulSoup(content,"html.parser")
top_rest = soup.find_all("div",attrs={"class": "sc-bblaLu dOXFUL"})
list_tr = top_rest[0].find_all("div",attrs={"class": "sc-gTAwTn cKXlHE"})
list_rest =[]
for tr in list_tr:
dataframe ={}
dataframe["rest_name"] = (tr.find("div",attrs={"class": "res_title zblack bold nowrap"})).text.replace('\n', ' ')
dataframe["rest_address"] = (tr.find("div",attrs={"class": "nowrap grey-text fontsize5 ttupper"})).text.replace('\n', ' ')
dataframe["cuisine_type"] = (tr.find("div",attrs={"class":"nowrap grey-text"})).text.replace('\n', ' ')
list_rest.append(dataframe)
list_rest

You are receiving this error because top_rest is empty when you attempt to get the first element of it "top_rest[0]". The reason for that is because the first class your attempting to reference is dynamically named. You will notice if you refresh the page the same location of that div will not be named the same. So when you attempt to scrape you get empty results.
An alternative would be to scrape ALL divs, then narrow in on the elements you want, be mindful of the dynamic div naming schema so from one request to another you will get different results:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
response = requests.get("https://www.zomato.com/bangalore/top-restaurants",headers=headers)
content = response.content
soup = BeautifulSoup(content,"html.parser")
top_rest = soup.find_all("div")
list_tr = top_rest[0].find_all("div",attrs={"class": "bke1zw-1 eMsYsc"})
list_tr

I recently did a project that made me research scraping the Zomato's website in Manila, Philippines. I used Geolibrary to get the longitude and latitude values of Manila City, then scraped the restaurants' details using this information.
ADD: You can get your own API key on zomato website to make up to 1000 calls in a day.
# Use geopy library to get the latitude and longitude values of Manila City.
from geopy.geocoders import Nominatim
address = 'Manila City, Philippines'
geolocator = Nominatim(user_agent = 'Makati_explorer')
location = geolocator.geocode(address)
latitude = location.lenter code hereatitude
longitude = location.longitude
print('The geographical coordinate of Makati City are {}, {}.'.format(latitude, longitude))
# Use Zomato's API to make call
headers = {'user-key': '617e6e315c6ec2ad5234e884957bfa4d'}
venues_information = []
for index, row in foursquare_venues.iterrows():
print("Fetching data for venue: {}".format(index + 1))
venue = []
url = ('https://developers.zomato.com/api/v2.1/search?q={}' +
'&start=0&count=1&lat={}&lon={}&sort=real_distance').format(row['name'], row['lat'], row['lng'])
try:
result = requests.get(url, headers = headers).json()
except:
print("There was an error...")
try:
if (len(result['restaurants']) > 0):
venue.append(result['restaurants'][0]['restaurant']['name'])
venue.append(result['restaurants'][0]['restaurant']['location']['latitude'])
venue.append(result['restaurants'][0]['restaurant']['location']['longitude'])
venue.append(result['restaurants'][0]['restaurant']['average_cost_for_two'])
venue.append(result['restaurants'][0]['restaurant']['price_range'])
venue.append(result['restaurants'][0]['restaurant']['user_rating']['aggregate_rating'])
venue.append(result['restaurants'][0]['restaurant']['location']['address'])
venues_information.append(venue)
else:
venues_information.append(np.zeros(6))
except:
pass
ZomatoVenues = pd.DataFrame(venues_information,
columns = ['venue', 'latitude',
'longitude', 'price_for_two',
'price_range', 'rating', 'address'])

Using Web Scraping Language I was able to write this:
GOTO https://www.zomato.com/bangalore/top-restaurants
EXTRACT {'rest_name': '//div[#class="res_title zblack bold nowrap"]',
'rest_address': '//div[#class="nowrap grey-text fontsize5 ttupper',
'cusine_type': '//div[#class="nowrap grey-text"]'} IN //div[#class="bke1zw-1 eMsYsc"]
This will iterate over each record element with class bke1zw-1 eMsYsc and pull
each restaurant information.

Related

Scrape data with <Script type="text/javascript" using beautifulsoup

Im building a web scrape to pull product data from a website, this particular company hides the price behind a "login for Price" banner but the price is hidden in the HTML under <Script type="text/javascript" but im unable to pull it out. the specific link that im testing is https://www.chadwellsupply.com/categories/appliances/Stove-Ranges/hotpoint-24-spacesaver-electric-range---white/
My current code is this and the last line is the one im using to pull the text out.
```
import requests
from bs4 import BeautifulSoup
import pandas as pd
baseurl="https://www.chadwellsupply.com/"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
productlinks = []
for x in range (1,3):
response = requests.get(f'https://www.chadwellsupply.com/categories/appliances/Stove-Ranges/?q=&filter=&clearedfilter=undefined&orderby=19&pagesize=24&viewmode=list&currenttab=products&pagenumber={x}&articlepage=')
soup = BeautifulSoup(response.content,'html.parser')
productlist = soup.find_all('div', class_="product-header")
for item in productlist:
for link in item.find_all('a', href = True):
productlinks.append(link['href'])
testlink = 'https://www.chadwellsupply.com/categories/appliances/Stove-Ranges/hotpoint-24-spacesaver-electric-range---white/'
response = requests.get(testlink, headers = headers)
soup = BeautifulSoup(response.content,'html.parser')
print(soup.find('div',class_="product-title").text.strip())
print(soup.find('p',class_="status").text.strip())
print(soup.find('meta',{'property':'og:url'}))
print(soup.find('div',class_="tab-pane fade show active").text.strip())
print(soup.find('div',class_="Chadwell-Shared-Breadcrumbs").text.strip())
print(soup.find('script',{'type':'text/javascript'}).text.strip())
```
Below is the chunk of script from the website (tried to paste directly here but it wouldnt format correctly) that im expecting it to pull but what it gives me is
"window.dataLayer = window.dataLayer || [];"
HTML From website
Ideally id like to just pull the price out but if i can atleast get the whole chunk of data out i can manually extract price.
You can use re/json module to search/parse the HTML data (obviously, beautifulsoup cannot parse JavaScript - another option is to use selenium).
import re
import json
import requests
url = "https://www.chadwellsupply.com/categories/appliances/Stove-Ranges/hotpoint-24-spacesaver-electric-range---white/"
html_doc = requests.get(url).text
data = re.search(r"ga\('ec:addProduct', (.*?)\);", html_doc).group(1)
data = json.loads(data)
print(data)
Prints:
{
"id": "301078",
"name": 'HOTPOINT® 24" SPACESAVER ELECTRIC RANGE - WHITE',
"category": "Stove/ Ranges",
"brand": "Hotpoint",
"price": "759",
}
Then for price you can do:
print(data["price"])
Prints:
759
A hacky alternative to regex is to select for a function in the scripts. In your case, the script contains function(i,s,o,g,r,a,m).
from bs4 import BeautifulSoup
import requests
import json
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
testlink = 'https://www.chadwellsupply.com/categories/appliances/Stove-Ranges/hotpoint-24-spacesaver-electric-range---white/'
response = requests.get(testlink, headers = headers)
soup = BeautifulSoup(response.content,'html.parser')
for el in soup.find_all("script"):
if "function(i,s,o,g,r,a,m)" in el.text:
scripttext = el.text
You can then select the data.
extracted = scripttext.split("{")[-1].split("}")[0]
my_json = json.loads("{%s}" % extracted)
print(my_json)
#{'id': '301078', 'name': 'HOTPOINT® 24" SPACESAVER ELECTRIC RANGE - WHITE', 'category': 'Stove/ Ranges', 'brand': 'Hotpoint', 'price': '759'}
Then get the price.
print(my_json["price"])
#759

How do I access a table thats split into different pages but the same url for transfermarkt?

I'm having an issue where I access the first page of the table but not the rest. When i click on say tab 2, it gives me players 26-50 but i can't scrape it as it is the same URL and not a different one. Is there a way to edit my code so that I could get all the pages of the tables?
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page = "https://www.transfermarkt.us/premier-league/transferrekorde/wettbewerb/GB1/plus/1/galerie/0?saison_id=2021&land_id=alle&ausrichtung=alle&spielerposition_id=alle&altersklasse=alle&leihe=&w_s=s&zuab=0"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
TransferPrice = pageSoup.find_all("td",{"class","rechts hauptlink"})
transfer_prices = []
cleaned_transfer_prices = []
for i in TransferPrice:
transfer_prices.append(i.text)
for i in transfer_prices:
i = i[1:-1]
i = float(i)
cleaned_transfer_prices.append(i)
cleaned_transfer_prices
some_list = []
#Players = pageSoup.find_all("td",{"class", "hauptlink"})
for td_tag in pageSoup.find_all("td",{"class", "hauptlink"}):
a_tag = td_tag.find('a')
if a_tag == None:
pass
else:
some_list.append(a_tag.text)
players = []
team_left = []
team_gone_to = []
for i in range(0,len(some_list),3):
players.append(some_list[i])
for i in range(1,len(some_list),3):
team_left.append(some_list[i])
for i in range(2,len(some_list),3):
team_gone_to.append(some_list[i])
df_2 = pd.DataFrame()
df_2['Player Name'] = players
df_2['Team Left'] = team_left
df_2['New Team'] = team_gone_to
df_2['Transfer Price'] = cleaned_transfer_prices
df_2.index+=1
df_2
you are using the wrong link here.
you should use this one instead:
https://www.transfermarkt.us/premier-league/transferrekorde/wettbewerb/GB1/ajax/yw1/saison_id/2021/land_id/alle/ausrichtung/alle/spielerposition_id/alle/altersklasse/alle/leihe//w_s/s/zuab/0/plus/1/galerie/0/page/2?ajax=yw1
pay attention to the ajax part of it (those dynamic code you need to get is uploaded by ajax)
the last part of the link is page/2?ajax=yw1 here you can solve your described problem by rotating the number (here it's 2 - the second page, you can change it to any number you need (by using f-strings))

Python web scrape numerical weather data

I am attempting to print the int value of current outside air temperature. (55)
Any chance for a tip on what I am doing wrong? (sorry not a lot of wisdom here!)
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import datetime as dt
#this is used at the end with plotting results to current hour
h = dt.datetime.now().hour
r = requests.get(
'https://www.google.com/search?q=weather+duluth')
soup = BeautifulSoup(r.text, 'html.parser')
stuff = []
for item in soup.select('vk_bk sol-tmp'):
item = int(item.contents[1].get_text(strip=True)[:-1])
#print(item)#this is weather data
stuff.append(item)
This is the web URL for weather and the current outdoor temperature is tied to the div class highlighted below.
If I attempt to print stuff I just get an empty list returned.
Adding User-Agent header should give expected result
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get('https://www.google.com/search?q=weather%20duluth', headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
soup.find("span", {"class": "wob_t"}).text

Multiple Page BeautifulSoup Script only Pulling first value

New to screen scraping here and this is my first time posting on stackoverflow. Aplogies in advance for any formatting errors in this post. Attempting to extract data from multiple pages with URL:
https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-' + str(page)
For instance, page 1 is:
https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-1
Page 2:
https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-2
and so on...
My script is running without errors. However, my Pandas exported csv only contains 1 row with the first extracted value. At the time of this posting, the first value is:
14.01 Acres   Vestaburg, Montcalm County, MI$275,000
My intent is to create a spreadsheet with hundreds of rows that pull the property description from the URLs.
Here is my code:
import requests
from requests import get
from bs4 import BeautifulSoup
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'
}
)
n_pages = 0
desc = []
for page in range(1,900):
n_pages += 1
sapo_url = 'https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-' + str(page)
r=get(sapo_url, headers=headers)
page_html = BeautifulSoup(r.text, 'html.parser')
house_containers = page_html.find_all('div', class_="propName")
if house_containers != []:
for container in house_containers:
desc = container.getText(strip=True)
else:
break
print('you scraped {} pages containing {} Properties'.format(n_pages, len(desc)))
import pandas as pd
df = pd.DataFrame({'description': [desc]})
df.to_csv('test4.csv', encoding = 'utf-8')
I suspect the problem is with the line reading desc = container.getText(strip=True) and have tried changing the line but keep getting errors when running.
Any help is appreciated.
I believe the mistake is in the line:
desc = container.getText(strip=True)
Every time it loops, the value in desc is replaced, not added on. To add items into the list, do:
desc.append(container.getText(strip=True))
Also, since it is already a list, you can remove the brackets from the DataFrame creation like so:
df = pd.DataFrame({'description': desc})
The cause is that no data is being added in the loop, so only the final data is being saved. For testing purposes, this code is now on page 2, so please fix it.
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'
}
)
n_pages = 0
desc = []
all_data = pd.DataFrame(index=[], columns=['description'])
for page in range(1,3):
n_pages += 1
sapo_url = 'https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-' + str(page)
r=get(sapo_url, headers=headers)
page_html = BeautifulSoup(r.text, 'html.parser')
house_containers = page_html.find_all('div', class_="propName")
if house_containers != []:
for container in house_containers:
desc = container.getText(strip=True)
df = pd.DataFrame({'description': [desc]})
all_data = pd.concat([all_data, df], ignore_index=True)
else:
break
all_data.to_csv('test4.csv', encoding = 'utf-8')
print('you scraped {} pages containing {} Properties'.format(n_pages, len(desc)))

Web-scraping data from a graph

I am working with lobbying data from opensecrets.org, in particular industry data. I want to have a time series of lobby expenditures for each industry going back since the 90's.
I want to web-scrape the data automatically. Urls where the data is have the following format:
https://www.opensecrets.org/lobby/indusclient.php?id=H04&year=2019
which are pretty easy to embed in a loop, the problem is that the data I need is not in an easy format in the webpage. It is inside a bar graph, and when I inspect the graph I do not know how to get the data since it is not in the html code. I am familiar with web-scraping in python when the data is in the html code, but in this case I am not sure how to proceed.
If there is an API, that your best bet as mentioned above. But the data is able to be parsed anyway provided you get the right url/query parameters:
I've managed to iterate through it with the links for you to grab each table. I stored it in a dictionary with the key being the Firm name, and the value being the table/data. You can change it up to anyway you'd like. Maybe just store as json, or save each as csv.
Code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.opensecrets.org/lobby/indusclient.php?id=H04&year=2019'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}
data = requests.get(url, headers=headers)
soup = BeautifulSoup(data.text, 'html.parser')
links = soup.find_all('a', href=True)
root_url = 'https://www.opensecrets.org/lobby/include/IMG_client_year_comp.php?'
links_dict = {}
for each in links:
if 'clientsum.php?' in each['href']:
w=1
firms = each.text
link = root_url + each['href'].split('?')[-1].split('&')[0].strip() + '&type=c'
links_dict[firms] = link
all_tables = {}
n=1
tot = len(links_dict)
for firms, link in links_dict.items():
print ('%s of %s ---- %s' %(n, tot, firms))
data = requests.get(link)
soup = BeautifulSoup(data.text, 'html.parser')
results = pd.DataFrame()
graph = soup.find_all('set')
for each in graph:
year = each['label']
total = each['value']
temp_df = pd.DataFrame([[year, total]], columns=['year','$mil'])
results = results.append(temp_df,sort=True).reset_index(drop=True)
all_tables[firms] = results
n+=1
*Output:**
Not going to print as there are 347 tables, but just so you see the structure:

Categories