Scraping multiple URL's in Selenium and writing to JSON

Scraping multiple URL's in Selenium and writing to JSON - python

I'm working on a scraper using Selenium.
I have written the script and it is scraping properly, however I am trying to scrape multiple URL's, then write the results to JSON.
The script scrapes, and prints successfully, however I am only getting one result in the JSON - the second URL's detail (I am getting both results when printing).
How do I get both URL's results?
I think I need to add another FOR LOOP for the JSON data, but can't figure out how to add it in!
This is the code I am working with:
# -*- coding: UTF-8 -*-
from selenium import webdriver
import time
import json
def writeToJSONFile(path, fileName, data):
filePathNameWExt = './' + path + '/' + fileName + '.json'
with open(filePathNameWExt, 'a') as fp:
json.dump(data, fp, ensure_ascii=False)
browser = webdriver.Firefox(executable_path="/Users/path/geckodriver")
urls = ['https://www.tripadvisor.co.uk/Restaurant_Review-g186338-d8122594-Reviews-Humble_Grape_Battersea-London_England.html','https://www.tripadvisor.co.uk/Restaurant_Review-g186338-d5561842-Reviews-Gastronhome-London_England.html']
data = {}
for url in urls:
browser.get(url)
page = browser.find_element_by_class_name('non_hotels_like')
title = page.find_element_by_class_name('heading_title').text
street_address = page.find_element_by_class_name('street-address').text
print(title)
print(street_address)
data = {}
data['title'] = title
data['street_address'] = street_address
filename = 'properties'
writeToJSONFile('./', filename, data)
browser.quit()

You're trying to add values with the same keys to dictionary while Python dict can contain unique keys only! So instead of writing second title you're just overwriting it. The same with street_address
You can try to save data as list of dictionaries:
data = []
for url in urls:
browser.get(url)
page = browser.find_element_by_class_name('non_hotels_like')
title = page.find_element_by_class_name('heading_title').text
street_address = page.find_element_by_class_name('street-address').text
print(title)
print(street_address)
data.append({'title': title, 'street_address': street_address})

You are resting the data variable after the loop...
So... what I did is add the index of iteration using enumerate and formated it into the key...
Try this should work:
from selenium import webdriver
import time
import json
def writeToJSONFile(path, fileName, data):
filePathNameWExt = './' + path + '/' + fileName + '.json'
with open(filePathNameWExt, 'a') as fp:
json.dump(data, fp, ensure_ascii=False)
browser = webdriver.Firefox(executable_path="/Users/path/geckodriver")
urls = ['https://www.tripadvisor.co.uk/Restaurant_Review-g186338-d8122594-Reviews-Humble_Grape_Battersea-London_England.html','https://www.tripadvisor.co.uk/Restaurant_Review-g186338-d5561842-Reviews-Gastronhome-London_England.html']
data = {}
for i, url in enumerate(urls):
browser.get(url)
page = browser.find_element_by_class_name('non_hotels_like')
title = page.find_element_by_class_name('heading_title').text
street_address = page.find_element_by_class_name('street-address').text
# this 'f' string formating is suported from Python 3.6+ you can use other format... (for a cleaner job use list see the excpted answer...)
data[f'{i}title'] = title
data[f'{i}street_address'] = street_address
print(title)
print(street_address)
filename = 'properties'
writeToJSONFile('./', filename, data)
browser.quit()
Hope you find this helpful!

Related

While opening a .xlsx file written through python. An error pops up :- File format or file extension is not valid, Verify that file is not corrupted

from selenium import webdriver
import time
from bs4 import BeautifulSoup as Soup
from urllib.request import urlopen
import datetime as dt
import csv
import pandas as pd
driver = webdriver.Firefox(executable_path='C://Downloads//webdrivers//geckodriver.exe')
c1 = 'amazon_data_' + dt.datetime.now().strftime("%d_%b_%y_%I_%M_%p")
d = open(str(c1) + '.csv', 'x', encoding='utf-8')
#d = open(str(c1) + '.xlsx', 'x', encoding='utf-8')
for c in range(1):
a = f'https://www.flipkart.com/search?q=sony+headphones&as=on&as-show=on&otracker=AS_Query_HistoryAutoSuggest_1_4_na_na_na&otracker1=AS_Query_HistoryAutoSuggest_1_4_na_na_na&as-pos=1&as-type=HISTORY&suggestionId=sony+headphones&requestId=ad797917-16ae-401e-98df-1c79a43d40c3&as-backfill=on&page={c}'
'''
request_response = requests.head(a)
status_code = request_response.status_code
if status_code == 200:
print(True)
else:
print(False)
'''
driver.get(a)
# time.sleep(1)
page_soup = Soup(urlopen(a), 'html5lib')
container = page_soup.find_all('div', {'class': '_4ddWXP'})
for containers in container:
find_url = containers.find('a')['href']
new_url = 'https://www.flipkart.com' + find_url
fetch = driver.get(new_url)
# time.sleep(1)
page_source = driver.page_source
page_soup = Soup(page_source, 'html.parser')
for data in page_soup:
try:
product_name = data.find('span', {'class': 'B_NuCI'}).text.strip()
price = data.find('div', {'class': "_30jeq3 _16Jk6d"}).text.strip()
current_url = new_url
except:
print('Not Available')
# print(product_name, '\n', price, '\n', current_url, '\n')
d.write(product_name + price + current_url + '\n')
Error I got
While trying to save the output data in .xlsx format, It saves the file properly. But while opening it, an error pops out:- The file format of the extension is not valid, verify the file is not corrupted and the file extension matches the format of the file.
Things I tried
When I try to write the output data with .csv it saves properly. But while opening the file, data has some special characters and data is not written in single.
** Output of single cell while writing data through .csv method **
JBL a noise cancellation enabled Bluetooth~
Uploading an Image for better Understanding
Below I'm providing url of an image which has excel output that I got
while fetching data from above script and saving it to .csv file.
Things I want
I want to save this date in .xlsx format with relevant following 3
headers :- product_name, price, URL.
I want all the special characters to be removed so that I get the clean output while writing the data in .xlsx format.

I see few problems:
using open(), write() you can't create xlsx because it has to be file .xml compressed with zip
some data has , which normally is used as separator for columns and you should put data in " " to create columns correctly. Better use module csv or pandas and it will use " " automatically. And this can be your main problem.
you mix selenium with beautifulsoup and sometimes you make mess.
you use for data in page_soup so you get all children on page and run the same code for these elements but you should get values directly from page_soup
I would put all data on list - every item as sublist - and later I would convert it to pandas.DataFrame and save it using to_csv() or to_excel()
I would even use selenium to search element (ie. find_elements_by_xpath) instead of beautifulsoup but I skiped this idea in code.
from selenium import webdriver
import time
from bs4 import BeautifulSoup as BS
import datetime as dt
import pandas as pd
# - before loop -
all_rows = []
#driver = webdriver.Firefox(executable_path='C:\\Downloads\\webdrivers\\geckodriver.exe')
driver = webdriver.Firefox() # I have `geckodriver` in folder `/home/furas/bin` and I don't have to set `executable_path`
# - loop -
for page in range(1): # range(10)`
print('--- page:', page, '---')
url = f'https://www.flipkart.com/search?q=sony+headphones&as=on&as-show=on&otracker=AS_Query_HistoryAutoSuggest_1_4_na_na_na&otracker1=AS_Query_HistoryAutoSuggest_1_4_na_na_na&as-pos=1&as-type=HISTORY&suggestionId=sony+headphones&requestId=ad797917-16ae-401e-98df-1c79a43d40c3&as-backfill=on&page={page}'
driver.get(url)
time.sleep(3)
soup = BS(driver.page_source, 'html5lib')
all_containers = soup.find_all('div', {'class': '_4ddWXP'})
for container in all_containers:
find_url = container.find('a')['href']
print('find_url:', find_url)
item_url = 'https://www.flipkart.com' + find_url
driver.get(item_url)
time.sleep(3)
item_soup = BS(driver.page_source, 'html.parser')
try:
product_name = item_soup.find('span', {'class': 'B_NuCI'}).text.strip()
price = item_soup.find('div', {'class': "_30jeq3 _16Jk6d"}).text.strip()
print('product_name:', product_name)
print('price:', price)
print('item_url:', item_url)
print('---')
row = [product_name, price, item_url]
all_rows.append(row)
except Exception as ex:
print('Not Available:', ex)
print('---')
# - after loop -
df = pd.DataFrame(all_rows)
filename = dt.datetime.now().strftime("amazon_data_%d_%b_%y_%I_%M_%p.csv")
df.to_csv(filename)
#filename = dt.datetime.now().strftime("amazon_data_%d_%b_%y_%I_%M_%p.xlsx")
#df.to_excel(filename)

Creating multiple text files with unique file names from scraped data

I took an introductory course in Python this semester and am now trying to do a project. However, I don't really know what code I should write to create multiple .txt files of which the title will be different for each file.
I scraped all the terms and definitions from the website http://www.hogwartsishere.com/library/book/99/. Title of the .txt file should for example be 'Aconite.txt' and the content of the file should be the title and the definition. Every term with its definition can be found in a separate p-tag and the term itself is a b-tag withing the p-tag. Can I use this to write my code?
I suppose I will need to use a for-loop for this, but I don't really know where to start. I searched StackOverflow and found several solutions, but all of them contain code I am not familiar with and/or relate to another issue.
This is what I have so far:
#!/usr/bin/env/ python
import requests
import bs4
def download(url):
r = requests.get(url)
html = r.text
soup = bs4.BeautifulSoup(html, 'html.parser')
terms_definition = []
#for item in soup.find_all('p'): #beter definiëren
items = soup.find_all("div", {"class" : "font-size-16 roboto"})
for item in items:
terms = item.find_all("p")
for term in terms:
#print(term)
if term.text is not 'None':
#print(term.text)
#print("\n")
term_split = term.text.split()
print(term_split)
if term.text != None and len(term.text) > 1:
if '-' in term.text.split():
print(term.text)
print('\n')
if item.find('p'):
terms_definition.append(item['p'])
print(terms_definition)
return terms_definition
def create_url(start, end):
list_url = []
base_url = 'http://www.hogwartsishere.com/library/book/99/chapter/'
for x in range(start, end):
list_url.append(base_url + str(x))
return list_url
def search_all_url(list_url):
for url in list_url:
download(url)
#write data into separate text files. Word in front of the dash should be title of the document, term and definition should be content of the text file
#all terms and definitions are in separate p-tags, title is a b-tag within the p-tag
def name_term
def text_files
path_write = os.path.join('data', name_term +'.txt') #'term' should be replaced by the scraped terms
with open(path_write, 'w') as f:
f.write()
#for loop? in front of dash = title / everything before and after dash = text (file content) / empty line = new file
if __name__ == '__main__':
download('http://www.hogwartsishere.com/library/book/99/chapter/1')
#list_url = create_url(1, 27)
#search_all_url(list_url)
Thanks in advance!

You can iterate over all pages (1-27) to get its content, then parse each page with bs4 and then save results to files:
import requests
import bs4
import re
for i in range(1, 27):
r = requests.get('http://www.hogwartsishere.com/library/book/99/chapter/{}/'.format(i)).text
soup = bs4.BeautifulSoup(r, 'html.parser')
items = soup.find_all("div", {"class": "font-size-16 roboto"})
for item in items:
terms = item.find_all("p")
for term in terms:
title = re.match('^(.*) -', term.text).group(1).replace('/', '-')
with open(title + '.txt', 'w', encoding='utf-8') as f:
f.write(term.text)
Output files:

How to Scrape Only New Links (After Previous Scrape) Using Python

I am scraping and downloading links from a website, and the website is updated with new links each day. I would like it so that each time my code runs, it only scrapes/downloads the updated links since the last time the program ran, rather than running through the entire code again.
I have tried adding previously-scraped links to an empty list, and only executing the rest of the code (which downloads and renames the file) if the scraped link isn't found in the list. But it doesn't seem to work as hoped, for each time I run the code, it starts "from 0" and overwrites the previously-downloaded files.
Is there a different approach I should try?
Here is my code (also open to general suggestions on how to clean this up and make it better)
import praw
import requests
from bs4 import BeautifulSoup
import urllib.request
from difflib import get_close_matches
import os
period = '2018 Q4'
url = 'https://old.reddit.com/r/test/comments/b71ug1/testpostr23432432/'
headers = {'User-Agent': 'Mozilla/5.0'}
page = requests.get(url, headers=headers)
#set soup
soup = BeautifulSoup(page.text, 'html.parser')
table = soup.find_all('table')[0]
#create list of desired file names from existing directory names
candidates = os.listdir('/Users/test/Desktop/Test')
#set directory to download scraped files to
downloads_folder = '/Users/test/Desktop/Python/project/downloaded_files/'
#create empty list of names
scraped_name_list = []
#scrape site for names and links
for anchor in table.findAll('a'):
try:
if not anchor:
continue
name = anchor.text
letter_link = anchor['href']
#if name doesn't exist in list of names: append it to the list, download it, and rename it
if name not in scraped_name_list:
#append it to name list
scraped_name_list.append(name)
#download it
urllib.request.urlretrieve(letter_link, '/Users/test/Desktop/Python/project/downloaded_files/' + period + " " + name + '.pdf')
#rename it
best_options = get_close_matches(name, candidates, n=1, cutoff=.33)
try:
if best_options:
name = (downloads_folder + period + " " + name + ".pdf")
os.rename(name, downloads_folder + period + " " + best_options[0] + ".pdf")
except:
pass
except:
pass
#else skip it
else:
pass

every time you run this, it is recreating scraped_name_list as a new empty list. what you need to do is save the list at the end of the run, and then try to import it on any other run. the pickle library is great for this.
instead of defining scraped_name_list = [], try something like this
try:
with open('/path/to/your/stuff/scraped_name_list.lst', 'rb') as f:
scraped_name_list = pickle.load(f)
except IOError:
scraped_name_list = []
this will attempt to open your list, but if it's the first run (meaning the list doesn't exist yet) it will start with an empty list. then at the end of your code, you just need to save the file so it can be used any other times it runs:
with open('/path/to/your/stuff/scraped_name_list.lst', 'wb') as f:
pickle.dump(scraped_name_list, f)

How to save the html source code while navigating to each link

Here is my code
driver = webdriver.Chrome()
path = "/home/winpc/test/python/dup/new"
def get_link_urls(url,driver):
driver.get(url)
url = urllib.urlopen(url)
content = url.readlines()
urls = []
for link in driver.find_elements_by_tag_name('a'):
elem = driver.find_element_by_xpath("//*")
source_code = elem.get_attribute("outerHTML")
test = link.get_attribute('href')
if str(test) != 'None':
file_name=test.rsplit('/')[-1].split('.')[0]
file_name_formated = file_name + "Copy.html"
with open(os.path.join(path, file_name_formated), 'wb') as temp_file:
temp_file.write(source_code.encode('utf-8'))
urls.append(link.get_attribute('href'))
return urls
urls = get_link_urls("http://localhost:8080",driver)
sub_urls = []
for url in urls:
if str(url) != 'None':
sub_urls.extend(get_link_urls(url,driver))
This code properly navigating each and every link but all the time coppiny only the first html page.I need to save the source code of each and every page navigating.saving part is happening using below code:
file_name_formated = file_name + "Copy.html"
with open(os.path.join(path, file_name_formated), 'wb') as temp_file:
temp_file.write(source_code.encode('utf-8'))

First of all you're overwriting URL again and again in the function, so fix that one.
For saving page source through selenium, you can use driver.page_source
Additionally, if you want this code to be faster, consider using requests module.
response = requests.get(url).content

How to save the string, one word per column in Python?

I'm scraping the names of massage therapists along with their addresses from a directory. The addresses are all being saved into the CSV in one column for the whole string, but the title/name of each therapist is being saved one word per column over 2 or 3 columns.
What do I need to do in order to get the string that's being extracted to save in one column, like the addresses are being saved? (The top two lines of code are example html from the page, the next set of code is the extract from the script targeting this element)
<span class="name">
<img src="/images/famt-placeholder-sm.jpg" class="thumb" alt="Tiffani D Abraham"> Tiffani D Abraham</span>
import mechanize
from lxml import html
import csv
import io
from time import sleep
def save_products (products, writer):
for product in products:
for price in product['prices']:
writer.writerow([ product["title"].encode('utf-8') ])
writer.writerow([ price["contact"].encode('utf-8') ])
writer.writerow([ price["services"].encode('utf-8') ])
f_out = open('mtResult.csv', 'wb')
writer = csv.writer(f_out)
links = ["https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY","https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY&PageIndex=2&PageSize=10","https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY&PageIndex=3&PageSize=10","https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY&PageIndex=4&PageSize=10","https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY&PageIndex=5&PageSize=10","https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY&PageIndex=6&PageSize=10","https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY&PageIndex=7&PageSize=10", "https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY&PageIndex=8&PageSize=10", "https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY&PageIndex=9&PageSize=10", "https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY&PageIndex=10&PageSize=10" ]
br = mechanize.Browser()
for link in links:
print(link)
r = br.open(link)
content = r.read()
products = []
tree = html.fromstring(content)
product_nodes = tree.xpath('//ul[#class="famt-results"]/li')
for product_node in product_nodes:
product = {}
price_nodes = product_node.xpath('.//a')
product['prices'] = []
for price_node in price_nodes:
price = {}
try:
product['title'] = product_node.xpath('.//span[1]/text()')[0]
except:
product['title'] = ""
try:
price['services'] = price_node.xpath('./span[2]/text()')[0]
except:
price['services'] = ""
try:
price['contact'] = price_node.xpath('./span[3]/text()')[0]
except:
price['contact'] = ""
product['prices'].append(price)
products.append(product)
save_products(products, writer)
f_out.close()

I'm not positive if this solves the issue you were having, but either way there are a few improvements and modifications you might be interested in.
For example, since each link varies by a page index you can loop through the links easily rather than copying all 50 down to a list. Each therapist per page also has their own index, so you can also loop through the xpaths for each therapist's information.
#import modules
import mechanize
from lxml import html
import csv
import io
#open browser
br = mechanize.Browser()
#create file headers
titles = ["NAME"]
services = ["TECHNIQUE(S)"]
contacts = ["CONTACT INFO"]
#loop through all 50 webpages for therapist data
for link_index in range(1,50):
link = "https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY&PageIndex=" + str(link_index) + "&PageSize=10"
r = br.open(link)
page = r.read()
tree = html.fromstring(page)
#loop through therapist data for each therapist per page
for therapist_index in range(1,10):
#store names
title = tree.xpath('//*[#id="content"]/div[2]/ul[1]/li[' + str(therapist_index) + ']/a/span[1]/text()')
titles.append(" ".join(title))
#store techniques and convert to unicode
service = tree.xpath('//*[#id="content"]/div[2]/ul[1]/li[' + str(therapist_index) + ']/a/span[2]/text()')
try:
services.append(service[0].encode("utf-8"))
except:
services.append(" ")
#store contact info and convert to unicode
contact = tree.xpath('//*[#id="content"]/div[2]/ul[1]/li[' + str(therapist_index) + ']/a/span[3]/text()')
try:
contacts.append(contact[0].encode("utf-8"))
except:
contacts.append(" ")
#open file to write to
f_out = open('mtResult.csv', 'wb')
writer = csv.writer(f_out)
#get rows in correct format
rows = zip(titles, services, contacts)
#write csv line by line
for row in rows:
writer.writerow(row)
f_out.close()
The script loops through all 50 links on the provided webpage, and seems to be scraping all relevant information for each therapist if provided. Finally, it prints all the data to a csv with all data stored under respective columns for 'Name', 'Technique(s)', and 'Contact Info' if this is what you were originally struggling with.
Hope this helps!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping multiple URL's in Selenium and writing to JSON - python

Related

While opening a .xlsx file written through python. An error pops up :- File format or file extension is not valid, Verify that file is not corrupted

Creating multiple text files with unique file names from scraped data

How to Scrape Only New Links (After Previous Scrape) Using Python

How to save the html source code while navigating to each link

How to save the string, one word per column in Python?

Categories

Resources