New to screen scraping here and this is my first time posting on stackoverflow. Aplogies in advance for any formatting errors in this post. Attempting to extract data from multiple pages with URL:
https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-' + str(page)
For instance, page 1 is:
https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-1
Page 2:
https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-2
and so on...
My script is running without errors. However, my Pandas exported csv only contains 1 row with the first extracted value. At the time of this posting, the first value is:
14.01 Acres  Vestaburg, Montcalm County, MI$275,000
My intent is to create a spreadsheet with hundreds of rows that pull the property description from the URLs.
Here is my code:
import requests
from requests import get
from bs4 import BeautifulSoup
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'
}
)
n_pages = 0
desc = []
for page in range(1,900):
n_pages += 1
sapo_url = 'https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-' + str(page)
r=get(sapo_url, headers=headers)
page_html = BeautifulSoup(r.text, 'html.parser')
house_containers = page_html.find_all('div', class_="propName")
if house_containers != []:
for container in house_containers:
desc = container.getText(strip=True)
else:
break
print('you scraped {} pages containing {} Properties'.format(n_pages, len(desc)))
import pandas as pd
df = pd.DataFrame({'description': [desc]})
df.to_csv('test4.csv', encoding = 'utf-8')
I suspect the problem is with the line reading desc = container.getText(strip=True) and have tried changing the line but keep getting errors when running.
Any help is appreciated.
I believe the mistake is in the line:
desc = container.getText(strip=True)
Every time it loops, the value in desc is replaced, not added on. To add items into the list, do:
desc.append(container.getText(strip=True))
Also, since it is already a list, you can remove the brackets from the DataFrame creation like so:
df = pd.DataFrame({'description': desc})
The cause is that no data is being added in the loop, so only the final data is being saved. For testing purposes, this code is now on page 2, so please fix it.
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'
}
)
n_pages = 0
desc = []
all_data = pd.DataFrame(index=[], columns=['description'])
for page in range(1,3):
n_pages += 1
sapo_url = 'https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-' + str(page)
r=get(sapo_url, headers=headers)
page_html = BeautifulSoup(r.text, 'html.parser')
house_containers = page_html.find_all('div', class_="propName")
if house_containers != []:
for container in house_containers:
desc = container.getText(strip=True)
df = pd.DataFrame({'description': [desc]})
all_data = pd.concat([all_data, df], ignore_index=True)
else:
break
all_data.to_csv('test4.csv', encoding = 'utf-8')
print('you scraped {} pages containing {} Properties'.format(n_pages, len(desc)))
Related
Attempting to update my script so that it searches through not only the url provided but all of the pages in range (1-3) and adds them to the CSV. Can anyone spot why my current code would not be working? The addition to pages following 1 are in the following format: page-2
from bs4 import BeautifulSoup
import requests
from csv import writer
from random import randint
from time import sleep
#example of second page url: https://www.propertypal.com/property-for-sale/ballymena-area/page-2
url= "https://www.propertypal.com/property-for-sale/ballymena-area/"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'}
for page in range(1, 4):
req = requests.get(url + 'page-' + str(page), headers=headers)
# print(page)
soup = BeautifulSoup(req.content, 'html.parser')
lists = soup.find_all('li', class_="pp-property-box")
with open('ballymena.csv', 'w', encoding='utf8', newline='') as f:
thewriter = writer(f)
header = ['Address', 'Price']
thewriter.writerow(header)
for list in lists:
title = list.find('h2').text
price = list.find('p', class_="pp-property-price").text
info = [title, price]
thewriter.writerow(info)
sleep(randint(2,10))
You are overwrite req multiple times and end up only analyzing the results of page 2. Put everything inside your loop.
edit: Also the upper limit in range() is not included, so you probably want to do for page in range(1, 4): to get the first three pages.
edit full example:
from bs4 import BeautifulSoup
import requests
from csv import writer
url = "https://www.propertypal.com/property-for-sale/ballymena-area/page-"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'}
with open('ballymena.csv', 'w', encoding='utf8', newline='') as f:
thewriter = writer(f)
header = ['Address', 'Price']
thewriter.writerow(header)
for page in range(1, 4):
req = requests.get(f"{url}{page}", headers=headers)
soup = BeautifulSoup(req.content, 'html.parser')
for li in soup.find_all('li', class_="pp-property-box"):
title = li.find('h2').text
price = li.find('p', class_="pp-property-price").text
info = [title, price]
thewriter.writerow(info)
The solution from bitflip is fine, however a few things I'll point out to help you.
try to avoid variable names that are predefined functions in python. For example list being one of those.
while csv writer is a fine package to use, also consider using pandas. You will likely further down the road need to do some data manipulation and what not, so might as well familiarise yourself with the package now. It's a very powerful tool.
Here's how I would have coded it.
from bs4 import BeautifulSoup
import requests
import pandas as pd
from random import randint
from time import sleep
from os.path import exists
#example of second page url: https://www.propertypal.com/property-for-sale/ballymena-area/page-2
url= "https://www.propertypal.com/property-for-sale/ballymena-area/"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'}
# Check if csv file exists
file_exists = exists('ballymena.csv')
for page in range(1, 4):
rows = []
req = requests.get(url + 'page-' + str(page), headers=headers)
# print(page)
soup = BeautifulSoup(req.content, 'html.parser')
lists = soup.find_all('li', class_="pp-property-box")
for li in lists:
title = li.find('h2').text
price = li.find('p', class_="pp-property-price").text
row = {
'Address':title,
'Price':price}
rows.append(row)
df = pd.DataFrame(rows)
# If file doesnt exists, write initial file
if not file_exists:
df.to_csv('ballymena.csv', index=False)
file_exists = True
# If it already exists, ammend to file
else:
df.to_csv('ballymena.csv', mode = 'a', header = False, index = False)
sleep(randint(2,10))
I'm having an issue where I access the first page of the table but not the rest. When i click on say tab 2, it gives me players 26-50 but i can't scrape it as it is the same URL and not a different one. Is there a way to edit my code so that I could get all the pages of the tables?
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page = "https://www.transfermarkt.us/premier-league/transferrekorde/wettbewerb/GB1/plus/1/galerie/0?saison_id=2021&land_id=alle&ausrichtung=alle&spielerposition_id=alle&altersklasse=alle&leihe=&w_s=s&zuab=0"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
TransferPrice = pageSoup.find_all("td",{"class","rechts hauptlink"})
transfer_prices = []
cleaned_transfer_prices = []
for i in TransferPrice:
transfer_prices.append(i.text)
for i in transfer_prices:
i = i[1:-1]
i = float(i)
cleaned_transfer_prices.append(i)
cleaned_transfer_prices
some_list = []
#Players = pageSoup.find_all("td",{"class", "hauptlink"})
for td_tag in pageSoup.find_all("td",{"class", "hauptlink"}):
a_tag = td_tag.find('a')
if a_tag == None:
pass
else:
some_list.append(a_tag.text)
players = []
team_left = []
team_gone_to = []
for i in range(0,len(some_list),3):
players.append(some_list[i])
for i in range(1,len(some_list),3):
team_left.append(some_list[i])
for i in range(2,len(some_list),3):
team_gone_to.append(some_list[i])
df_2 = pd.DataFrame()
df_2['Player Name'] = players
df_2['Team Left'] = team_left
df_2['New Team'] = team_gone_to
df_2['Transfer Price'] = cleaned_transfer_prices
df_2.index+=1
df_2
you are using the wrong link here.
you should use this one instead:
https://www.transfermarkt.us/premier-league/transferrekorde/wettbewerb/GB1/ajax/yw1/saison_id/2021/land_id/alle/ausrichtung/alle/spielerposition_id/alle/altersklasse/alle/leihe//w_s/s/zuab/0/plus/1/galerie/0/page/2?ajax=yw1
pay attention to the ajax part of it (those dynamic code you need to get is uploaded by ajax)
the last part of the link is page/2?ajax=yw1 here you can solve your described problem by rotating the number (here it's 2 - the second page, you can change it to any number you need (by using f-strings))
I wrote this code but got this as the error "IndexError: list index out of range" after running the last line. Please, how do I fix this?
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
response = requests.get("https://www.zomato.com/bangalore/top-restaurants",headers=headers)
content = response.content
soup = BeautifulSoup(content,"html.parser")
top_rest = soup.find_all("div",attrs={"class": "sc-bblaLu dOXFUL"})
list_tr = top_rest[0].find_all("div",attrs={"class": "sc-gTAwTn cKXlHE"})
list_rest =[]
for tr in list_tr:
dataframe ={}
dataframe["rest_name"] = (tr.find("div",attrs={"class": "res_title zblack bold nowrap"})).text.replace('\n', ' ')
dataframe["rest_address"] = (tr.find("div",attrs={"class": "nowrap grey-text fontsize5 ttupper"})).text.replace('\n', ' ')
dataframe["cuisine_type"] = (tr.find("div",attrs={"class":"nowrap grey-text"})).text.replace('\n', ' ')
list_rest.append(dataframe)
list_rest
You are receiving this error because top_rest is empty when you attempt to get the first element of it "top_rest[0]". The reason for that is because the first class your attempting to reference is dynamically named. You will notice if you refresh the page the same location of that div will not be named the same. So when you attempt to scrape you get empty results.
An alternative would be to scrape ALL divs, then narrow in on the elements you want, be mindful of the dynamic div naming schema so from one request to another you will get different results:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
response = requests.get("https://www.zomato.com/bangalore/top-restaurants",headers=headers)
content = response.content
soup = BeautifulSoup(content,"html.parser")
top_rest = soup.find_all("div")
list_tr = top_rest[0].find_all("div",attrs={"class": "bke1zw-1 eMsYsc"})
list_tr
I recently did a project that made me research scraping the Zomato's website in Manila, Philippines. I used Geolibrary to get the longitude and latitude values of Manila City, then scraped the restaurants' details using this information.
ADD: You can get your own API key on zomato website to make up to 1000 calls in a day.
# Use geopy library to get the latitude and longitude values of Manila City.
from geopy.geocoders import Nominatim
address = 'Manila City, Philippines'
geolocator = Nominatim(user_agent = 'Makati_explorer')
location = geolocator.geocode(address)
latitude = location.lenter code hereatitude
longitude = location.longitude
print('The geographical coordinate of Makati City are {}, {}.'.format(latitude, longitude))
# Use Zomato's API to make call
headers = {'user-key': '617e6e315c6ec2ad5234e884957bfa4d'}
venues_information = []
for index, row in foursquare_venues.iterrows():
print("Fetching data for venue: {}".format(index + 1))
venue = []
url = ('https://developers.zomato.com/api/v2.1/search?q={}' +
'&start=0&count=1&lat={}&lon={}&sort=real_distance').format(row['name'], row['lat'], row['lng'])
try:
result = requests.get(url, headers = headers).json()
except:
print("There was an error...")
try:
if (len(result['restaurants']) > 0):
venue.append(result['restaurants'][0]['restaurant']['name'])
venue.append(result['restaurants'][0]['restaurant']['location']['latitude'])
venue.append(result['restaurants'][0]['restaurant']['location']['longitude'])
venue.append(result['restaurants'][0]['restaurant']['average_cost_for_two'])
venue.append(result['restaurants'][0]['restaurant']['price_range'])
venue.append(result['restaurants'][0]['restaurant']['user_rating']['aggregate_rating'])
venue.append(result['restaurants'][0]['restaurant']['location']['address'])
venues_information.append(venue)
else:
venues_information.append(np.zeros(6))
except:
pass
ZomatoVenues = pd.DataFrame(venues_information,
columns = ['venue', 'latitude',
'longitude', 'price_for_two',
'price_range', 'rating', 'address'])
Using Web Scraping Language I was able to write this:
GOTO https://www.zomato.com/bangalore/top-restaurants
EXTRACT {'rest_name': '//div[#class="res_title zblack bold nowrap"]',
'rest_address': '//div[#class="nowrap grey-text fontsize5 ttupper',
'cusine_type': '//div[#class="nowrap grey-text"]'} IN //div[#class="bke1zw-1 eMsYsc"]
This will iterate over each record element with class bke1zw-1 eMsYsc and pull
each restaurant information.
A year ago I learned some python in one of my classes but haven't had to use much since then so this may or may not be a simple question.
I'm trying to web-scrape the top grossing films of all time table from Box Office Mojo and I want to grab the rank, title, and gross for the top 10 films in the 2010s. I've been playing around in python and I can get the entire table into python but I don't know how to manipulate it from there, let alone write out a csv file. Any guidance/tips?
Here is what will print the entire table for me (the first few lines are copied from an old web-scraping assignment to get me started):
import bs4
import requests
from bs4 import BeautifulSoup as soup
url = "https://www.boxofficemojo.com/chart/top_lifetime_gross/"
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
page_html = requests.get(url, headers=headers)
page_soup = soup(page_html.text, "html.parser")
boxofficemojo_table = page_soup.find("div", {"class": "a-section imdb-scroll-table-inner"})
complete_table = boxofficemojo_table.get_text()
print(complete_table)`
You Can use pd.read_html for this.
import pandas as pd
Data = pd.read_html(r'https://www.boxofficemojo.com/chart/top_lifetime_gross/')
for data in Data:
data.to_csv('Data.csv', ',')
2.Using Bs4
import pandas as pd
from bs4 import BeautifulSoup
import requests
URL = r'https://www.boxofficemojo.com/chart/top_lifetime_gross/'
print('\n>> Exctracting Data using Beautiful Soup for :'+ URL)
try:
res = requests.get(URL)
except Exception as e:
print(repr(e))
print('\n<> URL present status Code = ',(res.status_code))
soup = BeautifulSoup(res.text,"lxml")
table = soup.find('table')
list_of_rows = []
for row in table.findAll('tr'):
list_of_cells = []
for cell in row.findAll(["td"]):
text = cell.text
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
for item in list_of_rows:
' '.join(item)
Data = pd.DataFrame(list_of_rows)
Data.dropna(axis = 0, how = 'all',inplace = True)
print(Data.head(10))
Data.to_csv('Table.csv')
I am working with lobbying data from opensecrets.org, in particular industry data. I want to have a time series of lobby expenditures for each industry going back since the 90's.
I want to web-scrape the data automatically. Urls where the data is have the following format:
https://www.opensecrets.org/lobby/indusclient.php?id=H04&year=2019
which are pretty easy to embed in a loop, the problem is that the data I need is not in an easy format in the webpage. It is inside a bar graph, and when I inspect the graph I do not know how to get the data since it is not in the html code. I am familiar with web-scraping in python when the data is in the html code, but in this case I am not sure how to proceed.
If there is an API, that your best bet as mentioned above. But the data is able to be parsed anyway provided you get the right url/query parameters:
I've managed to iterate through it with the links for you to grab each table. I stored it in a dictionary with the key being the Firm name, and the value being the table/data. You can change it up to anyway you'd like. Maybe just store as json, or save each as csv.
Code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.opensecrets.org/lobby/indusclient.php?id=H04&year=2019'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}
data = requests.get(url, headers=headers)
soup = BeautifulSoup(data.text, 'html.parser')
links = soup.find_all('a', href=True)
root_url = 'https://www.opensecrets.org/lobby/include/IMG_client_year_comp.php?'
links_dict = {}
for each in links:
if 'clientsum.php?' in each['href']:
w=1
firms = each.text
link = root_url + each['href'].split('?')[-1].split('&')[0].strip() + '&type=c'
links_dict[firms] = link
all_tables = {}
n=1
tot = len(links_dict)
for firms, link in links_dict.items():
print ('%s of %s ---- %s' %(n, tot, firms))
data = requests.get(link)
soup = BeautifulSoup(data.text, 'html.parser')
results = pd.DataFrame()
graph = soup.find_all('set')
for each in graph:
year = each['label']
total = each['value']
temp_df = pd.DataFrame([[year, total]], columns=['year','$mil'])
results = results.append(temp_df,sort=True).reset_index(drop=True)
all_tables[firms] = results
n+=1
*Output:**
Not going to print as there are 347 tables, but just so you see the structure: