Web Scraping using Selenium from a list of URLs

Web Scraping using Selenium from a list of URLs - python

I have a list of urls in a csv file that I want to scrape content from. The csv has 200 plus urls. The code that I'm running is picking the first url and then failing. Here is the code:
import csv
from selenium import webdriver
with open('Godzilla1.csv', 'w') as f:
csv_writer = csv.writer(f)
csv_writer.writerow(["Title", "Content"])
f = open("links.csv")
urls = [url.strip() for url in f.readlines()]
driver = webdriver.Firefox()
for url in urls:
driver.get(url)
titles = driver.find_elements_by_xpath('//h2[#class="entry-title"]')
contents = driver.find_elements_by_class_name("et_pb_post")
num_page_items = len(titles)
with open('Godzilla1.csv', 'a') as f:
for i in range(num_page_items):
f.write(titles[i].text + "," + contents[i].text + "\n")
# Clean up (close browser once completed task).
driver.close()
When that code runs the error reported is: f.write(titles[i].text + "," + contents[i].text + "\n")
IndexError: list index out of range

The problem is while you get 2 items in titles, there's only 1 element in contents. So as you iterate to the 2nd item in title, the content is out of range (hence the error). It appears the title content repeats twice, so rather than get all elements with entry-title class, just get the first element.
You're also going to run into issues here using the , as you delimiter, as there are commas in the content. Can I suggest just using pandas?
import pandas as pd
from selenium import webdriver
f = open("links.csv")
urls = [url.strip() for url in f.readlines()]
driver = webdriver.Firefox()
rows = []
for url in urls:
driver.get(url)
title = driver.find_element_by_xpath('//h2[#class="entry-title"]')
content = driver.find_element_by_class_name("et_pb_post")
row = {'Title':title.text,
'Content':content.text}
rows.append(row)
# Clean up (close browser once completed task).
driver.close()
df = pd.DataFrame(rows)
df.to_csv('Godzilla1.csv', index=False)
There is also the option to avoid Selenium and simply use requests and BeautifulSoup:
import pandas as pd
import requests
from bs4 import BeautifulSoup
f = open("links.csv")
urls = [url.strip() for url in f.readlines()]
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Mobile Safari/537.36'}
rows = []
for url in urls:
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
title = soup.find('h2',{'class':"entry-title"})
content = soup.find('div',{'class':'entry-content'}).find('p')
post_meta = soup.find('p', {'class':'post-meta'})
try:
category = post_meta.find('a',{'rel':'category tag'}).text.strip()
except:
category = ''
row = {'Title':title.text,
'Content':content.text,
'Category':category}
print(row)
rows.append(row)
df = pd.DataFrame(rows)
df.to_csv('Godzilla1.csv', index=False)

Related

Looping through the page numbers with Python BeautifulSoup

Attempting to update my script so that it searches through not only the url provided but all of the pages in range (1-3) and adds them to the CSV. Can anyone spot why my current code would not be working? The addition to pages following 1 are in the following format: page-2
from bs4 import BeautifulSoup
import requests
from csv import writer
from random import randint
from time import sleep
#example of second page url: https://www.propertypal.com/property-for-sale/ballymena-area/page-2
url= "https://www.propertypal.com/property-for-sale/ballymena-area/"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'}
for page in range(1, 4):
req = requests.get(url + 'page-' + str(page), headers=headers)
# print(page)
soup = BeautifulSoup(req.content, 'html.parser')
lists = soup.find_all('li', class_="pp-property-box")
with open('ballymena.csv', 'w', encoding='utf8', newline='') as f:
thewriter = writer(f)
header = ['Address', 'Price']
thewriter.writerow(header)
for list in lists:
title = list.find('h2').text
price = list.find('p', class_="pp-property-price").text
info = [title, price]
thewriter.writerow(info)
sleep(randint(2,10))

You are overwrite req multiple times and end up only analyzing the results of page 2. Put everything inside your loop.
edit: Also the upper limit in range() is not included, so you probably want to do for page in range(1, 4): to get the first three pages.
edit full example:
from bs4 import BeautifulSoup
import requests
from csv import writer
url = "https://www.propertypal.com/property-for-sale/ballymena-area/page-"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'}
with open('ballymena.csv', 'w', encoding='utf8', newline='') as f:
thewriter = writer(f)
header = ['Address', 'Price']
thewriter.writerow(header)
for page in range(1, 4):
req = requests.get(f"{url}{page}", headers=headers)
soup = BeautifulSoup(req.content, 'html.parser')
for li in soup.find_all('li', class_="pp-property-box"):
title = li.find('h2').text
price = li.find('p', class_="pp-property-price").text
info = [title, price]
thewriter.writerow(info)

The solution from bitflip is fine, however a few things I'll point out to help you.
try to avoid variable names that are predefined functions in python. For example list being one of those.
while csv writer is a fine package to use, also consider using pandas. You will likely further down the road need to do some data manipulation and what not, so might as well familiarise yourself with the package now. It's a very powerful tool.
Here's how I would have coded it.
from bs4 import BeautifulSoup
import requests
import pandas as pd
from random import randint
from time import sleep
from os.path import exists
#example of second page url: https://www.propertypal.com/property-for-sale/ballymena-area/page-2
url= "https://www.propertypal.com/property-for-sale/ballymena-area/"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'}
# Check if csv file exists
file_exists = exists('ballymena.csv')
for page in range(1, 4):
rows = []
req = requests.get(url + 'page-' + str(page), headers=headers)
# print(page)
soup = BeautifulSoup(req.content, 'html.parser')
lists = soup.find_all('li', class_="pp-property-box")
for li in lists:
title = li.find('h2').text
price = li.find('p', class_="pp-property-price").text
row = {
'Address':title,
'Price':price}
rows.append(row)
df = pd.DataFrame(rows)
# If file doesnt exists, write initial file
if not file_exists:
df.to_csv('ballymena.csv', index=False)
file_exists = True
# If it already exists, ammend to file
else:
df.to_csv('ballymena.csv', mode = 'a', header = False, index = False)
sleep(randint(2,10))

BeautifulSoup - Scraping html table on multipe pages

I'm newbie on Python and BeautifulSoup, I would like to scrape multiple pages in csv but when I'm trying to store those 3 links only the last one it's stored in csv.
How can I fix my issue ?
## importing bs4, requests, fake_useragent and csv modules
from bs4 import BeautifulSoup
import requests
from fake_useragent import UserAgent
import csv
## create an array with URLs
urls = [
'https://www.scansante.fr/applications/casemix_ghm_cmd/submit?snatnav=&typrgp=etab&annee=2019&type=ghm&base=0&typreg=noreg2016&noreg=99&finess=750300360&editable_length=10',
'https://www.scansante.fr/applications/casemix_ghm_cmd/submit?snatnav=&typrgp=etab&annee=2019&type=ghm&base=0&typreg=noreg2016&noreg=99&finess=030780118&editable_length=10',
'https://www.scansante.fr/applications/casemix_ghm_cmd/submit?snatnav=&typrgp=etab&annee=2019&type=ghm&base=0&typreg=noreg2016&noreg=99&finess=620103432&editable_length=10'
]
## initializing the UserAgent object
user_agent = UserAgent()
## starting the loop
for url in urls:
## getting the reponse from the page using get method of requests module
page = requests.get(url, headers={"user-agent": user_agent.chrome})
## storing the content of the page in a variable
html = page.content
## creating BeautifulSoup object
soup = BeautifulSoup(html, "html.parser")
table = soup.findAll("table", {"class":"table"})[0]
rows = table.findAll("tr")
with open("test.csv", "wt+", newline="") as f:
writer = csv.writer(f)
for row in rows:
csv_row = []
for cell in row.findAll(["td", "th"]):
csv_row.append(cell.get_text())
writer.writerow(csv_row)
Thanks a lot !

To simplify the reading process of the rows, you could also give a shot with pandas:
import csv
import requests
from bs4 import BeautifulSoup
import pandas as pd
urls = [
'https://www.scansante.fr/applications/casemix_ghm_cmd/submit?snatnav=&typrgp=etab&annee=2019&type=ghm&base=0&typreg=noreg2016&noreg=99&finess=750300360&editable_length=10',
'https://www.scansante.fr/applications/casemix_ghm_cmd/submit?snatnav=&typrgp=etab&annee=2019&type=ghm&base=0&typreg=noreg2016&noreg=99&finess=030780118&editable_length=10',
'https://www.scansante.fr/applications/casemix_ghm_cmd/submit?snatnav=&typrgp=etab&annee=2019&type=ghm&base=0&typreg=noreg2016&noreg=99&finess=620103432&editable_length=10'
]
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}
all_data = []
for url in urls:
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, "html.parser")
table = soup.findAll("table", {"class":"table"})[0]
df_table = pd.read_html(str(table))[0]
#add a column with additional info
df_table['hit'] = soup.find("span", {"class":"c"}).text.strip()
#store the table in a list of tables
all_data.append(df_table)
#concat the tables and export them to csv
pd.concat(all_data).to_csv('test.csv',index=False)

In your code, you don't store rows variable to anywhere, so you write only values from your last URL to CSV file. This example will write values from all three URLs:
import csv
import requests
from bs4 import BeautifulSoup
urls = [
'https://www.scansante.fr/applications/casemix_ghm_cmd/submit?snatnav=&typrgp=etab&annee=2019&type=ghm&base=0&typreg=noreg2016&noreg=99&finess=750300360&editable_length=10',
'https://www.scansante.fr/applications/casemix_ghm_cmd/submit?snatnav=&typrgp=etab&annee=2019&type=ghm&base=0&typreg=noreg2016&noreg=99&finess=030780118&editable_length=10',
'https://www.scansante.fr/applications/casemix_ghm_cmd/submit?snatnav=&typrgp=etab&annee=2019&type=ghm&base=0&typreg=noreg2016&noreg=99&finess=620103432&editable_length=10'
]
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}
all_data = []
for url in urls:
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, "html.parser")
table = soup.findAll("table", {"class":"table"})[0]
# here I store all rows to list `all_data`
for row in table.findAll('tr'):
tds = [cell.get_text(strip=True, separator=' ') for cell in row.findAll(["td", "th"])]
all_data.append(tds)
print(*tds)
# write list `all_data` to CSV
with open("test.csv", "wt+", newline="") as f:
writer = csv.writer(f)
for row in all_data:
writer.writerow(row)
Writes test.csv from all three URLs (screenshot from LibreOffice):

Multiple Page BeautifulSoup Script only Pulling first value

New to screen scraping here and this is my first time posting on stackoverflow. Aplogies in advance for any formatting errors in this post. Attempting to extract data from multiple pages with URL:
https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-' + str(page)
For instance, page 1 is:
https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-1
Page 2:
https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-2
and so on...
My script is running without errors. However, my Pandas exported csv only contains 1 row with the first extracted value. At the time of this posting, the first value is:
14.01 Acres Â  Vestaburg, Montcalm County, MI$275,000
My intent is to create a spreadsheet with hundreds of rows that pull the property description from the URLs.
Here is my code:
import requests
from requests import get
from bs4 import BeautifulSoup
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'
}
)
n_pages = 0
desc = []
for page in range(1,900):
n_pages += 1
sapo_url = 'https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-' + str(page)
r=get(sapo_url, headers=headers)
page_html = BeautifulSoup(r.text, 'html.parser')
house_containers = page_html.find_all('div', class_="propName")
if house_containers != []:
for container in house_containers:
desc = container.getText(strip=True)
else:
break
print('you scraped {} pages containing {} Properties'.format(n_pages, len(desc)))
import pandas as pd
df = pd.DataFrame({'description': [desc]})
df.to_csv('test4.csv', encoding = 'utf-8')
I suspect the problem is with the line reading desc = container.getText(strip=True) and have tried changing the line but keep getting errors when running.
Any help is appreciated.

I believe the mistake is in the line:
desc = container.getText(strip=True)
Every time it loops, the value in desc is replaced, not added on. To add items into the list, do:
desc.append(container.getText(strip=True))
Also, since it is already a list, you can remove the brackets from the DataFrame creation like so:
df = pd.DataFrame({'description': desc})

The cause is that no data is being added in the loop, so only the final data is being saved. For testing purposes, this code is now on page 2, so please fix it.
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'
}
)
n_pages = 0
desc = []
all_data = pd.DataFrame(index=[], columns=['description'])
for page in range(1,3):
n_pages += 1
sapo_url = 'https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-' + str(page)
r=get(sapo_url, headers=headers)
page_html = BeautifulSoup(r.text, 'html.parser')
house_containers = page_html.find_all('div', class_="propName")
if house_containers != []:
for container in house_containers:
desc = container.getText(strip=True)
df = pd.DataFrame({'description': [desc]})
all_data = pd.concat([all_data, df], ignore_index=True)
else:
break
all_data.to_csv('test4.csv', encoding = 'utf-8')
print('you scraped {} pages containing {} Properties'.format(n_pages, len(desc)))

Trying to get leads from yelp

I'm trying to get leads from yelp using python and beautifulsoup but I'm not able to catch the fields for phone name address and wesbite (optional).
I'm getting the following error here is my code I try to search and found different solution but they didn't work for me.
Here is my code
from bs4 import BeautifulSoup
import requests
import sys
import csv
import requests, re, json
## Get the min and max page numbers
pagenum=0
maxpage =0
## loop go thourgh the pages
while pagenum <= maxpage:
newsu =pagenum
newsu = str(newsu)
csvname = 'cardealers'+newsu+'.csv';
csvfile = open(csvname , 'w',encoding="utf-8")
csv_writer = csv.writer(csvfile)
csv_writer.writerow(['Business name', 'phone' , 'address'] )
headers = {'User-Agent':'Mozilla/5.0'}
r = requests.get('https://www.yelp.com/search?find_desc=Used%20Car%20Dealers&find_loc=New%20York%2C%20NY&ns=1&sortby=review_count&start={}'.format(pagenum), headers = headers)
p = re.compile(r'PRELOADED_STATE__ = (.*?);')
data = json.loads(p)
print(data)
pagenum =pagenum+1
for item in data['searchResult']['results']:
name = item['businessName']
phone=item['phone']
address= ([item['address'],item['city'], item['state'], item['postalcode']])
csv_writer.writerow([name, phone , address ])
print(name)
csvfile.close()
here is the error message.
Traceback (most recent call last): File
"\Python\Python36\scraper\scrape.py", line 22, in
data = json.loads(p) File "\Python\Python36\lib\json__init__.py", line 348, in loads
'not {!r}'.format(s.class.name)) TypeError: the JSON object must be str, bytes or bytearray, not 'SRE_Pattern'

you are trying to read in a string that is not json format.
Essentially, this is what you are doing:
data = json.loads('THIS IS JUST A STRING. NOT IN A JSON FORMAT')
so you want to do something like: data = json.loads(p.findall(r.text))
You actually need to pull that out from the html. The other MAJOR issue though is that is not even within the html you are pulling...so it will always return an empty list.
Also, you are not iterating through anything. You start at pagenum=0, with maxpage page=0 and run while pagenum<=maxpage which means it's going to run forever.
The json structure with the data is in the html, but looks like it's within the Comments. So you'll need to parse that instead.
Also, why do:
newsu =pagenum
newsu = str(newsu)
simply do newsu = str(pagenum). Do you really want a seperate file for each iteration? I just put it into 1 file:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import json
import math
## Get the min and max page numbers
pagenum=0
results = pd.DataFrame()
with requests.Session() as s:
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36'}
url = 'https://www.yelp.com/search?find_desc=Used%20Car%20Dealers&find_loc=New%20York%2C%20NY&ns=1&sortby=review_count&start={}'.format(pagenum)
r = s.get(url, headers = headers)
soup = BeautifulSoup(r.text, 'html.parser')
scripts = soup.find_all('script')
for script in scripts:
if '<!--{' in script.text:
jsonStr = script.text.split('<!--')[-1].split('-->')[0]
jsonData = json.loads(jsonStr)
totalPages = jsonData['searchPageProps']['searchResultsProps']['paginationInfo']['totalResults']
resultsPerPage = jsonData['searchPageProps']['searchResultsProps']['paginationInfo']['resultsPerPage']
totalPages = math.ceil(totalPages/resultsPerPage)
## loop go through the pages
for pagenum in range(0,totalPages+1):
url = 'https://www.yelp.com/search?find_desc=Used%20Car%20Dealers&find_loc=New%20York%2C%20NY&ns=1&sortby=review_count&start={}'.format(pagenum)
r = s.get(url, headers = headers)
soup = BeautifulSoup(r.text, 'html.parser')
scripts = soup.find_all('script')
for script in scripts:
if '<!--{' in script.text:
jsonStr = script.text.split('<!--')[-1].split('-->')[0]
jsonData = json.loads(jsonStr)
for each in jsonData['searchPageProps']['searchResultsProps']['searchResults']:
if 'searchResultBusiness' in each.keys():
busiName = each['searchResultBusiness']['name']
phone = each['searchResultBusiness']['phone']
address = each['searchResultBusiness']['formattedAddress']
temp_df = pd.DataFrame([[busiName, phone, address]], columns=['Business name', 'phone' , 'address'])
results = results.append(temp_df, sort=False).reset_index(drop=True)
print ('Aquired page: %s' %pagenum)
results.to_csv('cardealers.csv', index=False)

i scraped title and price and links and info table and when i write csv file i get duplicated title and price and links

I want to replace duplicate title and price and links with empty column values.
import requests
import csv
from bs4 import BeautifulSoup
requests.packages.urllib3.disable_warnings()
import pandas as pd
url = 'http://shop.kvgems-preciousstones.com/'
while True:
session = requests.Session()
session.headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"}
content = session.get(url, verify=False).content
soup = BeautifulSoup(content, "html.parser")
posts = soup.find_all('li',{'class':'item'})
data = []
for url in posts:
title = url.find('h2',{'product-name'}).text
price = url.find('span',{'price'}).text
link = url.find('a').get('href')
url_response = requests.get(link)
url_data = url_response.text
url_soup = BeautifulSoup(url_data, 'html.parser')
desciption = url_soup.find('tr')
for tr in url_soup.find_all('tr'):
planet_data = dict()
values = [td.text for td in tr.find_all('td')]
planet_data['name'] = tr.find('td').text.strip()
planet_data['info'] = tr.find_all('td')[1].text.strip()
data.append((title,price,planet_data,link))
#data_new = data +","+ data_desciption
#urls = soup.find('a',{'class': 'next i-next'}).get('href')
#url = urls
#print(url)
with open('ineryrge5szdqzrt.csv', 'a') as csv_file:
writer = csv.writer(csv_file)
writer.writerow(['title','price','name','info','link'])
#The for loop
for title,price,planet_data,link in data:
writer.writerow([title,price,planet_data['name'],planet_data['info'],link])
When I write CSV I got the result of duplicated title, price, link but I want to get only 1 title, price, info and link while the rest are empty.

The first for loop extracts the common values (title, price and link). The second for loop then extracts all the data attributes for each item.
However, you are then writing title, price and link fields to the CSV file for every row of data. You only need to do it for the first row of data.
To detect if your second for loop is on the first row or not, you can change it to use the enumerate function which gives you an extra index variable. You can then use this value to only write the title, price, link if 0:
for index, tr in enumerate(url_soup.find_all('tr')):
planet_data = dict()
values = [td.text for td in tr.find_all('td')]
planet_data['name'] = tr.find('td').text.strip()
planet_data['info'] = tr.find_all('td')[1].text.strip()
if index == 0:
data.append((title,price,planet_data,link))
else:
data.append((None,None,planet_data,None))
(Also I don't think you need the initial while True: part.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web Scraping using Selenium from a list of URLs - python

Related

Looping through the page numbers with Python BeautifulSoup

BeautifulSoup - Scraping html table on multipe pages

Multiple Page BeautifulSoup Script only Pulling first value

Trying to get leads from yelp

i scraped title and price and links and info table and when i write csv file i get duplicated title and price and links

Categories

Resources