How to Scrape Only New Links (After Previous Scrape) Using Python - python

I am scraping and downloading links from a website, and the website is updated with new links each day. I would like it so that each time my code runs, it only scrapes/downloads the updated links since the last time the program ran, rather than running through the entire code again.
I have tried adding previously-scraped links to an empty list, and only executing the rest of the code (which downloads and renames the file) if the scraped link isn't found in the list. But it doesn't seem to work as hoped, for each time I run the code, it starts "from 0" and overwrites the previously-downloaded files.
Is there a different approach I should try?
Here is my code (also open to general suggestions on how to clean this up and make it better)
import praw
import requests
from bs4 import BeautifulSoup
import urllib.request
from difflib import get_close_matches
import os
period = '2018 Q4'
url = 'https://old.reddit.com/r/test/comments/b71ug1/testpostr23432432/'
headers = {'User-Agent': 'Mozilla/5.0'}
page = requests.get(url, headers=headers)
#set soup
soup = BeautifulSoup(page.text, 'html.parser')
table = soup.find_all('table')[0]
#create list of desired file names from existing directory names
candidates = os.listdir('/Users/test/Desktop/Test')
#set directory to download scraped files to
downloads_folder = '/Users/test/Desktop/Python/project/downloaded_files/'
#create empty list of names
scraped_name_list = []
#scrape site for names and links
for anchor in table.findAll('a'):
try:
if not anchor:
continue
name = anchor.text
letter_link = anchor['href']
#if name doesn't exist in list of names: append it to the list, download it, and rename it
if name not in scraped_name_list:
#append it to name list
scraped_name_list.append(name)
#download it
urllib.request.urlretrieve(letter_link, '/Users/test/Desktop/Python/project/downloaded_files/' + period + " " + name + '.pdf')
#rename it
best_options = get_close_matches(name, candidates, n=1, cutoff=.33)
try:
if best_options:
name = (downloads_folder + period + " " + name + ".pdf")
os.rename(name, downloads_folder + period + " " + best_options[0] + ".pdf")
except:
pass
except:
pass
#else skip it
else:
pass

every time you run this, it is recreating scraped_name_list as a new empty list. what you need to do is save the list at the end of the run, and then try to import it on any other run. the pickle library is great for this.
instead of defining scraped_name_list = [], try something like this
try:
with open('/path/to/your/stuff/scraped_name_list.lst', 'rb') as f:
scraped_name_list = pickle.load(f)
except IOError:
scraped_name_list = []
this will attempt to open your list, but if it's the first run (meaning the list doesn't exist yet) it will start with an empty list. then at the end of your code, you just need to save the file so it can be used any other times it runs:
with open('/path/to/your/stuff/scraped_name_list.lst', 'wb') as f:
pickle.dump(scraped_name_list, f)

Related

While opening a .xlsx file written through python. An error pops up :- File format or file extension is not valid, Verify that file is not corrupted

from selenium import webdriver
import time
from bs4 import BeautifulSoup as Soup
from urllib.request import urlopen
import datetime as dt
import csv
import pandas as pd
driver = webdriver.Firefox(executable_path='C://Downloads//webdrivers//geckodriver.exe')
c1 = 'amazon_data_' + dt.datetime.now().strftime("%d_%b_%y_%I_%M_%p")
d = open(str(c1) + '.csv', 'x', encoding='utf-8')
#d = open(str(c1) + '.xlsx', 'x', encoding='utf-8')
for c in range(1):
a = f'https://www.flipkart.com/search?q=sony+headphones&as=on&as-show=on&otracker=AS_Query_HistoryAutoSuggest_1_4_na_na_na&otracker1=AS_Query_HistoryAutoSuggest_1_4_na_na_na&as-pos=1&as-type=HISTORY&suggestionId=sony+headphones&requestId=ad797917-16ae-401e-98df-1c79a43d40c3&as-backfill=on&page={c}'
'''
request_response = requests.head(a)
status_code = request_response.status_code
if status_code == 200:
print(True)
else:
print(False)
'''
driver.get(a)
# time.sleep(1)
page_soup = Soup(urlopen(a), 'html5lib')
container = page_soup.find_all('div', {'class': '_4ddWXP'})
for containers in container:
find_url = containers.find('a')['href']
new_url = 'https://www.flipkart.com' + find_url
fetch = driver.get(new_url)
# time.sleep(1)
page_source = driver.page_source
page_soup = Soup(page_source, 'html.parser')
for data in page_soup:
try:
product_name = data.find('span', {'class': 'B_NuCI'}).text.strip()
price = data.find('div', {'class': "_30jeq3 _16Jk6d"}).text.strip()
current_url = new_url
except:
print('Not Available')
# print(product_name, '\n', price, '\n', current_url, '\n')
d.write(product_name + price + current_url + '\n')
Error I got
While trying to save the output data in .xlsx format, It saves the file properly. But while opening it, an error pops out:- The file format of the extension is not valid, verify the file is not corrupted and the file extension matches the format of the file.
Things I tried
When I try to write the output data with .csv it saves properly. But while opening the file, data has some special characters and data is not written in single.
** Output of single cell while writing data through .csv method **
JBL a noise cancellation enabled Bluetooth~
Uploading an Image for better Understanding
Below I'm providing url of an image which has excel output that I got
while fetching data from above script and saving it to .csv file.
Things I want
I want to save this date in .xlsx format with relevant following 3
headers :- product_name, price, URL.
I want all the special characters to be removed so that I get the clean output while writing the data in .xlsx format.
I see few problems:
using open(), write() you can't create xlsx because it has to be file .xml compressed with zip
some data has , which normally is used as separator for columns and you should put data in " " to create columns correctly. Better use module csv or pandas and it will use " " automatically. And this can be your main problem.
you mix selenium with beautifulsoup and sometimes you make mess.
you use for data in page_soup so you get all children on page and run the same code for these elements but you should get values directly from page_soup
I would put all data on list - every item as sublist - and later I would convert it to pandas.DataFrame and save it using to_csv() or to_excel()
I would even use selenium to search element (ie. find_elements_by_xpath) instead of beautifulsoup but I skiped this idea in code.
from selenium import webdriver
import time
from bs4 import BeautifulSoup as BS
import datetime as dt
import pandas as pd
# - before loop -
all_rows = []
#driver = webdriver.Firefox(executable_path='C:\\Downloads\\webdrivers\\geckodriver.exe')
driver = webdriver.Firefox() # I have `geckodriver` in folder `/home/furas/bin` and I don't have to set `executable_path`
# - loop -
for page in range(1): # range(10)`
print('--- page:', page, '---')
url = f'https://www.flipkart.com/search?q=sony+headphones&as=on&as-show=on&otracker=AS_Query_HistoryAutoSuggest_1_4_na_na_na&otracker1=AS_Query_HistoryAutoSuggest_1_4_na_na_na&as-pos=1&as-type=HISTORY&suggestionId=sony+headphones&requestId=ad797917-16ae-401e-98df-1c79a43d40c3&as-backfill=on&page={page}'
driver.get(url)
time.sleep(3)
soup = BS(driver.page_source, 'html5lib')
all_containers = soup.find_all('div', {'class': '_4ddWXP'})
for container in all_containers:
find_url = container.find('a')['href']
print('find_url:', find_url)
item_url = 'https://www.flipkart.com' + find_url
driver.get(item_url)
time.sleep(3)
item_soup = BS(driver.page_source, 'html.parser')
try:
product_name = item_soup.find('span', {'class': 'B_NuCI'}).text.strip()
price = item_soup.find('div', {'class': "_30jeq3 _16Jk6d"}).text.strip()
print('product_name:', product_name)
print('price:', price)
print('item_url:', item_url)
print('---')
row = [product_name, price, item_url]
all_rows.append(row)
except Exception as ex:
print('Not Available:', ex)
print('---')
# - after loop -
df = pd.DataFrame(all_rows)
filename = dt.datetime.now().strftime("amazon_data_%d_%b_%y_%I_%M_%p.csv")
df.to_csv(filename)
#filename = dt.datetime.now().strftime("amazon_data_%d_%b_%y_%I_%M_%p.xlsx")
#df.to_excel(filename)

Scraping multiple URL's in Selenium and writing to JSON

I'm working on a scraper using Selenium.
I have written the script and it is scraping properly, however I am trying to scrape multiple URL's, then write the results to JSON.
The script scrapes, and prints successfully, however I am only getting one result in the JSON - the second URL's detail (I am getting both results when printing).
How do I get both URL's results?
I think I need to add another FOR LOOP for the JSON data, but can't figure out how to add it in!
This is the code I am working with:
# -*- coding: UTF-8 -*-
from selenium import webdriver
import time
import json
def writeToJSONFile(path, fileName, data):
filePathNameWExt = './' + path + '/' + fileName + '.json'
with open(filePathNameWExt, 'a') as fp:
json.dump(data, fp, ensure_ascii=False)
browser = webdriver.Firefox(executable_path="/Users/path/geckodriver")
urls = ['https://www.tripadvisor.co.uk/Restaurant_Review-g186338-d8122594-Reviews-Humble_Grape_Battersea-London_England.html','https://www.tripadvisor.co.uk/Restaurant_Review-g186338-d5561842-Reviews-Gastronhome-London_England.html']
data = {}
for url in urls:
browser.get(url)
page = browser.find_element_by_class_name('non_hotels_like')
title = page.find_element_by_class_name('heading_title').text
street_address = page.find_element_by_class_name('street-address').text
print(title)
print(street_address)
data = {}
data['title'] = title
data['street_address'] = street_address
filename = 'properties'
writeToJSONFile('./', filename, data)
browser.quit()
You're trying to add values with the same keys to dictionary while Python dict can contain unique keys only! So instead of writing second title you're just overwriting it. The same with street_address
You can try to save data as list of dictionaries:
data = []
for url in urls:
browser.get(url)
page = browser.find_element_by_class_name('non_hotels_like')
title = page.find_element_by_class_name('heading_title').text
street_address = page.find_element_by_class_name('street-address').text
print(title)
print(street_address)
data.append({'title': title, 'street_address': street_address})
You are resting the data variable after the loop...
So... what I did is add the index of iteration using enumerate and formated it into the key...
Try this should work:
from selenium import webdriver
import time
import json
def writeToJSONFile(path, fileName, data):
filePathNameWExt = './' + path + '/' + fileName + '.json'
with open(filePathNameWExt, 'a') as fp:
json.dump(data, fp, ensure_ascii=False)
browser = webdriver.Firefox(executable_path="/Users/path/geckodriver")
urls = ['https://www.tripadvisor.co.uk/Restaurant_Review-g186338-d8122594-Reviews-Humble_Grape_Battersea-London_England.html','https://www.tripadvisor.co.uk/Restaurant_Review-g186338-d5561842-Reviews-Gastronhome-London_England.html']
data = {}
for i, url in enumerate(urls):
browser.get(url)
page = browser.find_element_by_class_name('non_hotels_like')
title = page.find_element_by_class_name('heading_title').text
street_address = page.find_element_by_class_name('street-address').text
# this 'f' string formating is suported from Python 3.6+ you can use other format... (for a cleaner job use list see the excpted answer...)
data[f'{i}title'] = title
data[f'{i}street_address'] = street_address
print(title)
print(street_address)
filename = 'properties'
writeToJSONFile('./', filename, data)
browser.quit()
Hope you find this helpful!

Saving output URLs with next line in file

I am trying to extract data from website and have following code which is extracting all URLs from Main category and its sub category links.
I am now stuck in saving the extracted output with line separator (to move each URL in separate line) in a file -Medical.tsv
Need help on this.
Code is given below:
from bs4 import BeautifulSoup
import requests
import time
import random
def write_to_file(file,mode, data, newline=None, with_tab=None): #**
with open(file, mode, encoding='utf-8') as l:
if with_tab == True:
data = ''.join(data)
if newline == True:
data = data+'\n'
l.write(data)
def get_soup(url):
return BeautifulSoup(requests.get(url).content, "lxml")
url = 'http://www.medicalexpo.com/'
soup = get_soup(url)
raw_categories = soup.select('div.univers-main li.category-group-item a')
category_links = {}
for cat in (raw_categories):
t0 = time.time()
response_delay = time.time() - t0 # It wait 10x longer than it took them to respond using delay.
time.sleep(10*response_delay) # This way if the site gets overwhelmed and starts to slow down, the code will automatically back off.
time.sleep(random.randint(2,5)) # This will provide random time intervals of 2 and 3 secs acting as human crawl instead of bot.
soup = get_soup(cat['href'])
links = soup.select('#category-group li a')
category_links[cat.text] = [link['href'] for link in links]
print(category_links)
You got the write_to_file function but you never call it? mode have to be w or w+(if you wanna overwrite in the case if the file already exists)

How to extract data from all urls, not just the first

This script is generating a csv with the data from only one of the urls fed into it. There are meant to be 98 sets of results, however the for loop isn't getting past the first url.
I've been working on this for 12hrs+ today, what am I missing in order get the correct results?
import requests
import re
from bs4 import BeautifulSoup
import csv
#Read csv
csvfile = open("gyms4.csv")
csvfilelist = csvfile.read()
def get_page_data(urls):
for url in urls:
r = requests.get(url.strip())
soup = BeautifulSoup(r.text, 'html.parser')
yield soup # N.B. use yield instead of return
print r.text
with open("gyms4.csv") as url_file:
for page in get_page_data(url_file):
name = page.find("span",{"class":"wlt_shortcode_TITLE"}).text
address = page.find("span",{"class":"wlt_shortcode_map_location"}).text
phoneNum = page.find("span",{"class":"wlt_shortcode_phoneNum"}).text
email = page.find("span",{"class":"wlt_shortcode_EMAIL"}).text
th = pages.find('b',text="Category")
td = th.findNext()
for link in td.findAll('a',href=True):
match = re.search(r'http://(\w+).(\w+).(\w+)', link.text)
if match:
web_address = link.text
gyms = [name,address,phoneNum,email,web_address]
gyms.append(gyms)
#Saving specific listing data to csv
with open ("xgyms.csv", "w") as file:
writer = csv.writer(file)
for row in gyms:
writer.writerow([row])
You have 3 for-loops in your code and do not specifiy which one causes problem. I assume it is the one in get_page_date() function.
You leave the looop exactly in the first run with the return assignemt. That is why you never get to the second url.
There are at least two possible solutions:
Append every parsed line of url to a list and return that list.
Move you processing code in the loops and append the parsed data to gyms in the loop.
As Alex.S said, get_page_data() returns on the first iteration, hence subsequent URLs are never accessed. Furthermore, the code that extracts data from the page needs to be executed for each page downloaded, so it needs to be in a loop too. You could turn get_page_data() into a generator and then iterate over the pages like this:
def get_page_data(urls):
for url in urls:
r = requests.get(url.strip())
soup = BeautifulSoup(r.text, 'html.parser')
yield soup # N.B. use yield instead of return
with open("gyms4.csv") as url_file:
for page in get_page_data(url_file):
name = page.find("span",{"class":"wlt_shortcode_TITLE"}).text
address = page.find("span",{"class":"wlt_shortcode_map_location"}).text
phoneNum = page.find("span",{"class":"wlt_shortcode_phoneNum"}).text
email = page.find("span",{"class":"wlt_shortcode_EMAIL"}).text
# etc. etc.
You can write the data to the CSV file as each page is downloaded and processed, or you can accumulate the data into a list and write it in one for with csv.writer.writerows().
Also you should pass the URL list to get_page_data() rather than accessing it from a global variable.

Python, BeautifulSoup iterating through files issue

This may end up being a really novice question, because i'm a novice, but here goes.
i have a set of .html pages obtained using wget. i want to iterate through them and extract certain info, putting it in a .csv file.
using the code below, all the names print when my program runs, but only the info from the next to last page (i.e., page 29.html here) prints to the .csv file. i'm trying this with only a handful of files at first, there are about 1,200 that i'd like to get into this format.
the files are based on those here: https://www.cfis.state.nm.us/media/ReportLobbyist.aspx?id=25&el=2014 where page numbers are the id
thanks for any help!
from bs4 import BeautifulSoup
import urllib2
import csv
for i in xrange(22, 30):
try:
page = urllib2.urlopen('file:{}.html'.format(i))
except:
continue
else:
soup = BeautifulSoup(page.read())
n = soup.find(id='ctl00_ContentPlaceHolder1_lnkBCLobbyist')
name = n.string
print name
table = soup.find('table', 'reportTbl')
#get the rows
list_of_rows = []
for row in table.findAll('tr')[1:]:
col = row.findAll('td')
filing = col[0].string
status = col[1].string
cont = col[2].string
exp = col[3].string
record = (name, filing, status, cont, exp)
list_of_rows.append(record)
#write to file
writer = csv.writer(open('lob.csv', 'wb'))
writer.writerows(list_of_rows)
You need to append each time not overwrite, use a, open('lob.csv', 'wb') is overwriting each time through your outer loop:
writer = csv.writer(open('lob.csv', 'ab'))
writer.writerows(list_of_rows)
You could also declare list_of_rows = [] outside the for loops and write to the file once at the very end.
If you are wanting page 30 also you need to loop in range(22,31).

Categories