Looping through the page numbers with Python BeautifulSoup - python

Attempting to update my script so that it searches through not only the url provided but all of the pages in range (1-3) and adds them to the CSV. Can anyone spot why my current code would not be working? The addition to pages following 1 are in the following format: page-2
from bs4 import BeautifulSoup
import requests
from csv import writer
from random import randint
from time import sleep
#example of second page url: https://www.propertypal.com/property-for-sale/ballymena-area/page-2
url= "https://www.propertypal.com/property-for-sale/ballymena-area/"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'}
for page in range(1, 4):
req = requests.get(url + 'page-' + str(page), headers=headers)
# print(page)
soup = BeautifulSoup(req.content, 'html.parser')
lists = soup.find_all('li', class_="pp-property-box")
with open('ballymena.csv', 'w', encoding='utf8', newline='') as f:
thewriter = writer(f)
header = ['Address', 'Price']
thewriter.writerow(header)
for list in lists:
title = list.find('h2').text
price = list.find('p', class_="pp-property-price").text
info = [title, price]
thewriter.writerow(info)
sleep(randint(2,10))

You are overwrite req multiple times and end up only analyzing the results of page 2. Put everything inside your loop.
edit: Also the upper limit in range() is not included, so you probably want to do for page in range(1, 4): to get the first three pages.
edit full example:
from bs4 import BeautifulSoup
import requests
from csv import writer
url = "https://www.propertypal.com/property-for-sale/ballymena-area/page-"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'}
with open('ballymena.csv', 'w', encoding='utf8', newline='') as f:
thewriter = writer(f)
header = ['Address', 'Price']
thewriter.writerow(header)
for page in range(1, 4):
req = requests.get(f"{url}{page}", headers=headers)
soup = BeautifulSoup(req.content, 'html.parser')
for li in soup.find_all('li', class_="pp-property-box"):
title = li.find('h2').text
price = li.find('p', class_="pp-property-price").text
info = [title, price]
thewriter.writerow(info)

The solution from bitflip is fine, however a few things I'll point out to help you.
try to avoid variable names that are predefined functions in python. For example list being one of those.
while csv writer is a fine package to use, also consider using pandas. You will likely further down the road need to do some data manipulation and what not, so might as well familiarise yourself with the package now. It's a very powerful tool.
Here's how I would have coded it.
from bs4 import BeautifulSoup
import requests
import pandas as pd
from random import randint
from time import sleep
from os.path import exists
#example of second page url: https://www.propertypal.com/property-for-sale/ballymena-area/page-2
url= "https://www.propertypal.com/property-for-sale/ballymena-area/"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'}
# Check if csv file exists
file_exists = exists('ballymena.csv')
for page in range(1, 4):
rows = []
req = requests.get(url + 'page-' + str(page), headers=headers)
# print(page)
soup = BeautifulSoup(req.content, 'html.parser')
lists = soup.find_all('li', class_="pp-property-box")
for li in lists:
title = li.find('h2').text
price = li.find('p', class_="pp-property-price").text
row = {
'Address':title,
'Price':price}
rows.append(row)
df = pd.DataFrame(rows)
# If file doesnt exists, write initial file
if not file_exists:
df.to_csv('ballymena.csv', index=False)
file_exists = True
# If it already exists, ammend to file
else:
df.to_csv('ballymena.csv', mode = 'a', header = False, index = False)
sleep(randint(2,10))

Related

Fix for missing 'tr' class in webscraping

I'm trying to webscrape different stocks by rows, with the data scraped from https://www.slickcharts.com/sp500. I am following a tutorial using a similar website, however that website uses classes for each of its rows, while mine doesn't (attached below).
This is the code I'm trying to use, however I don't get any output whatsoever. I'm still pretty new at coding so any feedback is welcome.
import requests
import pandas as pd
from bs4 import BeautifulSoup
company = []
symbol = []
url = 'https://www.slickcharts.com/sp500' #Data from SlickCharts
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
rows = soup.find_all('tr')
for i in rows:
row = i.find_all('td')
print(row[0])
First of all, you need to add some headers to your request because most likely you get the same as me: status code 403 Forbidden. It's because the website is blocking your request. Adding User-Agent does the trick:
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
}
page = requests.get(url, headers=headers)
Then you can iterate over tr tags as you do. But you should be careful, because, for example first tr doesn't have td tags and you will get exception in the row:
print(row[0])
Here is the example of code that prints names of all companies:
import requests
from bs4 import BeautifulSoup
company = []
symbol = []
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
url = 'https://www.slickcharts.com/sp500' #Data from SlickCharts
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')
rows = soup.find_all('tr')
for row in rows:
all_td_tags = row.find_all('td')
if len(all_td_tags) > 0:
print(all_td_tags[1].text)
But this code also outputs some other data besides company names. It's because you are iterating over all tr tags on the page. But you need to iterate over a specific table only (first table on the page in this case).
import requests
from bs4 import BeautifulSoup
company = []
symbol = []
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
url = 'https://www.slickcharts.com/sp500' #Data from SlickCharts
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')
first_table_on_the_page = soup.find('table')
rows = first_table_on_the_page.find_all('tr')
for row in rows:
all_td_tags = row.find_all('td')
if len(all_td_tags) > 0:
print(all_td_tags[1].text)

Limited number of scraped data?

I am scraping a website and everything seems work fine from today's news until news published in 2015/2016. After these years, I am not able to scrape news.
Could you please tell me if anything has changed?
I should get 672 pages getting titles and snippets from this page:
https://catania.liveuniversity.it/attualita/
but I have got approx. 158.
The code that I am using is:
import bs4, requests
import pandas as pd
import re
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
page_num=1
website="https://catania.liveuniversity.it/attualita/"
while True:
r = requests.get(website, headers=headers)
soup = bs4.BeautifulSoup(r.text, 'html')
title=soup.find_all('h2')
date=soup.find_all('span', attrs={'class':'updated'})
if soup.find_all('a', attrs={'class':'page-numbers'}):
website = f"https://catania.liveuniversity.it/attualita/page/{page_num}"
page_num +=1
print(page_num)
else:
break
df = pd.DataFrame(list(zip(dates, titles)),
columns =['Date', 'Titles'])
I think there has been some changes in tags (for example in next page button, or just in the date/title tag).
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
import pandas as pd
def main(req, num):
r = req.get(
"https://catania.liveuniversity.it/attualita/page/{}/".format(num))
soup = BeautifulSoup(r.content, 'html.parser')
try:
data = [(x.select_one("span.updated").text, x.findAll("a")[1].text, x.select_one("div.entry-content").get_text(strip=True)) for x in soup.select(
"div.col-lg-8.col-md-8.col-sm-8")]
return data
except AttributeError:
print(r.url)
return False
with ThreadPoolExecutor(max_workers=30) as executor:
with requests.Session() as req:
fs = [executor.submit(main, req, num) for num in range(1, 673)]
allin = []
for f in fs:
f = f.result()
if f:
allin.extend(f)
df = pd.DataFrame.from_records(
allin, columns=["Date", "Title", "Content"])
print(df)
df.to_csv("result.csv", index=False)

Multiple Page BeautifulSoup Script only Pulling first value

New to screen scraping here and this is my first time posting on stackoverflow. Aplogies in advance for any formatting errors in this post. Attempting to extract data from multiple pages with URL:
https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-' + str(page)
For instance, page 1 is:
https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-1
Page 2:
https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-2
and so on...
My script is running without errors. However, my Pandas exported csv only contains 1 row with the first extracted value. At the time of this posting, the first value is:
14.01 Acres   Vestaburg, Montcalm County, MI$275,000
My intent is to create a spreadsheet with hundreds of rows that pull the property description from the URLs.
Here is my code:
import requests
from requests import get
from bs4 import BeautifulSoup
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'
}
)
n_pages = 0
desc = []
for page in range(1,900):
n_pages += 1
sapo_url = 'https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-' + str(page)
r=get(sapo_url, headers=headers)
page_html = BeautifulSoup(r.text, 'html.parser')
house_containers = page_html.find_all('div', class_="propName")
if house_containers != []:
for container in house_containers:
desc = container.getText(strip=True)
else:
break
print('you scraped {} pages containing {} Properties'.format(n_pages, len(desc)))
import pandas as pd
df = pd.DataFrame({'description': [desc]})
df.to_csv('test4.csv', encoding = 'utf-8')
I suspect the problem is with the line reading desc = container.getText(strip=True) and have tried changing the line but keep getting errors when running.
Any help is appreciated.
I believe the mistake is in the line:
desc = container.getText(strip=True)
Every time it loops, the value in desc is replaced, not added on. To add items into the list, do:
desc.append(container.getText(strip=True))
Also, since it is already a list, you can remove the brackets from the DataFrame creation like so:
df = pd.DataFrame({'description': desc})
The cause is that no data is being added in the loop, so only the final data is being saved. For testing purposes, this code is now on page 2, so please fix it.
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'
}
)
n_pages = 0
desc = []
all_data = pd.DataFrame(index=[], columns=['description'])
for page in range(1,3):
n_pages += 1
sapo_url = 'https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-' + str(page)
r=get(sapo_url, headers=headers)
page_html = BeautifulSoup(r.text, 'html.parser')
house_containers = page_html.find_all('div', class_="propName")
if house_containers != []:
for container in house_containers:
desc = container.getText(strip=True)
df = pd.DataFrame({'description': [desc]})
all_data = pd.concat([all_data, df], ignore_index=True)
else:
break
all_data.to_csv('test4.csv', encoding = 'utf-8')
print('you scraped {} pages containing {} Properties'.format(n_pages, len(desc)))

How do I convert a web-scraped table into a csv?

A year ago I learned some python in one of my classes but haven't had to use much since then so this may or may not be a simple question.
I'm trying to web-scrape the top grossing films of all time table from Box Office Mojo and I want to grab the rank, title, and gross for the top 10 films in the 2010s. I've been playing around in python and I can get the entire table into python but I don't know how to manipulate it from there, let alone write out a csv file. Any guidance/tips?
Here is what will print the entire table for me (the first few lines are copied from an old web-scraping assignment to get me started):
import bs4
import requests
from bs4 import BeautifulSoup as soup
url = "https://www.boxofficemojo.com/chart/top_lifetime_gross/"
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
page_html = requests.get(url, headers=headers)
page_soup = soup(page_html.text, "html.parser")
boxofficemojo_table = page_soup.find("div", {"class": "a-section imdb-scroll-table-inner"})
complete_table = boxofficemojo_table.get_text()
print(complete_table)`
You Can use pd.read_html for this.
import pandas as pd
Data = pd.read_html(r'https://www.boxofficemojo.com/chart/top_lifetime_gross/')
for data in Data:
data.to_csv('Data.csv', ',')
2.Using Bs4
import pandas as pd
from bs4 import BeautifulSoup
import requests
URL = r'https://www.boxofficemojo.com/chart/top_lifetime_gross/'
print('\n>> Exctracting Data using Beautiful Soup for :'+ URL)
try:
res = requests.get(URL)
except Exception as e:
print(repr(e))
print('\n<> URL present status Code = ',(res.status_code))
soup = BeautifulSoup(res.text,"lxml")
table = soup.find('table')
list_of_rows = []
for row in table.findAll('tr'):
list_of_cells = []
for cell in row.findAll(["td"]):
text = cell.text
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
for item in list_of_rows:
' '.join(item)
Data = pd.DataFrame(list_of_rows)
Data.dropna(axis = 0, how = 'all',inplace = True)
print(Data.head(10))
Data.to_csv('Table.csv')

is get_text() from bs4 different for span tags? Cant remove span tags

Whilst making a web scraper i am able to find an scrape the data available.
on 2 fields of data i am able to use the beautifulsoup get_text() to reomve html from the data
but the 3rd fields will not work when i use get_text(). I can get it to give me the whole span tag just not the text inside it.
i have tried different iterations of getting the data all the same, it will give me the whole span tag ie. stuff
Trying to set busnumber to the phone number inside this span tag
<span class="business--telephoneNumber" itemprop="telephone">01430 422826 </span>
ive tried
from bs4 import BeautifulSoup
import requests
import csv
data_list=[]
url = 'https://www.yell.com/ucs/UcsSearchAction.do?keywords=farmer&location=leeds'
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
}
site = requests.get(url, headers=headers, timeout=5);
if site.status_code is 200:
content = BeautifulSoup(site.content, 'html.parser')
#print(content)
questions = content.find_all(class_='businessCapsule')
for question in questions:
busname = question.find(class_='businessCapsule--name').get_text()
bustype = question.find(class_='businessCapsule--classification').get_text()
busnum = question.find('span', {'itemprop': 'telephone'})
print(busnum)
busnumber = busnum.get_text()
new_data = {"busname": busname, "bustype": bustype, "busnumber": busnumber}
data_list.append(new_data)
with open ('selector.csv','w') as file:
writer = csv.DictWriter(file, fieldnames = ["busname", "bustype", "busnumber"], delimiter = ';')
writer.writeheader()
for row in data_list:
writer.writerow(row)
as well as
from bs4 import BeautifulSoup
import requests
import csv
data_list=[]
url = 'https://www.yell.com/ucs/UcsSearchAction.do?keywords=farmer&location=leeds'
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
}
site = requests.get(url, headers=headers, timeout=5);
if site.status_code is 200:
content = BeautifulSoup(site.content, 'html.parser')
#print(content)
questions = content.find_all(class_='businessCapsule')
for question in questions:
busname = question.find(class_='businessCapsule--name').get_text()
bustype = question.find(class_='businessCapsule--classification').get_text()
busnumber = question.find('span', {'itemprop': 'telephone'}).get_text()
new_data = {"busname": busname, "bustype": bustype, "busnumber": busnumber}
data_list.append(new_data)
with open ('selector.csv','w') as file:
writer = csv.DictWriter(file, fieldnames = ["busname", "bustype", "busnumber"], delimiter = ';')
writer.writeheader()
for row in data_list:
writer.writerow(row)
on both cases the get_text() gives this error
Traceback (most recent call last):
File "webscraper2.py", line 22, in <module>
busnumber = busnum.get_text()
AttributeError: 'NoneType' object has no attribute 'get_text'
if get_text is removed it will give the whole tag
<span class="business--telephoneNumber" itemprop="telephone">01430 422826 </span>
i only need the insed phone number.
update - latest code
from bs4 import BeautifulSoup as bs
import requests
import csv
data_list=[]
url = 'https://www.yell.com/ucs/UcsSearchAction.do?keywords=farmer&location=leeds'
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
}
site = requests.get(url, headers=headers, timeout=5)
soup = bs(site.content, 'html.parser')
questions = soup.select('.businessCapsule--mainContent')
for question in questions:
busname = question.find(class_='businessCapsule--name').get_text()
bustype = question.find(class_='businessCapsule--classification').get_text()
busnumber = question.select_one('span.business--telephoneNumber').text
print(busnumber)
new_data = {"busname": busname, "bustype": bustype, "busnumber": busnumber}
data_list.append(new_data)
with open ('selector.csv','w') as file:
writer = csv.DictWriter(file, fieldnames = ["busname", "bustype", "busnumber"], delimiter = ';')
writer.writeheader()
for row in data_list:
writer.writerow(row)
You need to get a different parent in order to select the appropriate child and change your selector for the child as shown below:
import requests
from bs4 import BeautifulSoup as bs
url = 'https://www.yell.com/ucs/UcsSearchAction.do?keywords=farmer&location=leeds'
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
}
site = requests.get(url, headers=headers, timeout=5)
soup = bs(site.content, 'lxml')
questions = soup.select('.businessCapsule--mainContent:has(span.business--telephoneNumber)')
for question in questions:
print(question.select_one('span.business--telephoneNumber').text)
If you check this different parent selector you will see it selects the entire box with info in so you can then select your various children
If that is too retrictive you can test if tel was present
import requests
from bs4 import BeautifulSoup as bs
url = 'https://www.yell.com/ucs/UcsSearchAction.do?keywords=farmer&location=leeds'
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
}
site = requests.get(url, headers=headers, timeout=5)
soup = bs(site.content, 'lxml')
questions = soup.select('.businessCapsule--mainContent')
for question in questions:
tel = question.select_one('span.business--telephoneNumber')
if tel is None:
tel = 'Not present'
else:
tel = tel.text
print(tel)

Categories