Multiple scraping flat-file for list of urls

Multiple scraping flat-file for list of urls - python

I made a script for scraping pages of some shop looking for out of stock items. It looks like this:
import requests
from bs4 import BeautifulSoup
urls = ['https://www.someurla','https://www.someurlb']
for url in urls:
res = requests.get(url)
soup = BeautifulSoup(res.content,'html.parser')
if len(soup.find_all('li',class_='out-of-stock')) > 0:
print(soup.title)
Now, I would like to somehow make this list or URLs available for updating without intervention in this little script. So, I think about some detached file that would serve as a flat database. I think it would be more appropriate than some relational DB, because I don't need it really.
I would like to get some opinion from more experienced Python users is this appropriate approach, and if it is what is the best way to do this with text or with .py file. What libraries are good for this task? On the other hand are there better approaches?

Go with a simple JSON file. Something like this:
import os
import json
url_file = '<path>/urls.json'
urls = []
if os.path.isfile(url_file):
with open(url_file, 'rb') as f:
urls = json.load(f)['urls']
else:
print('No URLs found to load')
print(urls)
# hook in your script here...
JSON structure for this particular example:
{
"urls": [
"http://example.com",
"http://google.com"
]
}

Related

list index out of range - beautiful soup

NEW TO PYTHON*** Below is my code I am using to pull a zip file from a website but I am getting the error, "list index out of range". I was given this code by someone else who wrote it but I had to change the URL and now I am getting the error. When I print(list_of_documents) it is blank.
Can someone help me with this? The url requires access so you won't be able to try to input this code directly. I am trying to understand how to use beautiful soup in this and how I can get the list to populate correctly.
import datetime
import requests
import csv
from zipfile import ZipFile as zf
import os
import pandas as pd
import time
from bs4 import BeautifulSoup
import pyodbc
import re
#set download location
downloads_folder = r"C:\Scripts\"
##### Creating outage dataframe
#Get list of download links
res = requests.get('https://www.ercot.com/mp/data-products/data-product-details?id=NP3-233-CD')
ercot_soup = BeautifulSoup(res.text, "lxml")
list_of_documents = ercot_soup.findAll('td', attrs={'class': 'labelOptional_ind'})
list_of_links = ercot_soup.select('a')'
##create the url for the download
loc = str(list_of_links[0])[9:len(str(list_of_links[0]))-9]
link = 'http://www.ercot.com' + loc
link = link.replace('amp;','')
# Define file name and set download path
file_name = str(list_of_documents[0])[30:len(str(list_of_documents[0]))-5]
file_path = downloads_folder + '/' + file_name

You can't expect code tailored to scrape one website to work for a different link! You should always inspect and explore your target site, especially the parts you need to scrape, so you know the tag names [like td and a here] and identifying attributes [like name, id, class, etc.] of the elements you need to extract data from.
With this site, if you want the info from the reportTable, it gets generated after the page gets loaded with javascript, so it wouldn't show up in the request response. You could either try something like Selenium, or you could try retrieving the data from the source itself.
If you inspect the site and look at the network tab, you'll find a request (which is what actually retrieves the data for the table) that looks like this, and when you inspect the table's html, you'll find above it the scripts to generate the data.
In the suggested solution below, the getReqUrl scrapes your link to get the url for requesting the reports (and also the template of the url for downloading the documents).
def getReqUrl(scrapeUrl):
res = requests.get(scrapeUrl)
ercot_soup = BeautifulSoup(res.text, "html.parser")
script = [l.split('"') for l in [
s for s in ercot_soup.select('script')
if 'reportListUrl' in s.text
and 'reportTypeID' in s.text
][0].text.split('\n') if l.count('"') == 2]
rtID = [l[1] for l in script if 'reportTypeID' in l[0]][0]
rlUrl = [l[1] for l in script if 'reportListUrl' in l[0]][0]
rdUrl = [l[1] for l in script if 'reportDownloadUrl' in l[0]][0]
return f'{rlUrl}{rtID}&_={int(time.time())}', rdUrl
(I couldn't figure out how to scrape the last query parameter [the &_=... part] from the site exactly, but {int(time.time())}} seems to get close enough - the results are the same even then and even when that last bit is omitted entirely; so it's totally optional.)
The url returned can be used to request the documents:
#import json
url = 'https://www.ercot.com/mp/data-products/data-product-details?id=NP3-233-CD'
reqUrl, ddUrl = getReqUrl(url)
reqRes = requests.get(reqUrl[0]).text
rsJson = json.loads(reqRes)
for doc in rsJson['ListDocsByRptTypeRes']['DocumentList']:
d = doc['Document']
downloadLink = ddUrl+d['DocID']
#print(f"{d['FriendlyName']} {d['PublishDate']} {downloadLink}")
print(f"Download '{d['ConstructedName']}' at\n\t {downloadLink}")
print(len(rsJson['ListDocsByRptTypeRes']['DocumentList']))
The print results will look like

How to increase the speed of my program in Python?

I am working on a web scraping project, and I have to get links from 19062 facilities. If I use a for loop, it will take almost 3 hours to complete. I tried making a generator but failed to make any logic, and I am not sure that it can be done using a generator. So, is there any Python expert who has an idea to get what I want faster? In my code, I execute it for just 20 ids. Thanks
import requests, json
from bs4 import BeautifulSoup as bs
url = 'https://hilfe.diakonie.de/hilfe-vor-ort/marker-json.php?ersteller=&kategorie=0&text=& n=55.0815&e=15.0418321&s=47.270127&w=5.8662579&zoom=20000'
res = requests.get(url).json()
url_1 = 'https://hilfe.diakonie.de/hilfe-vor-ort/info-window-html.php?id='
# extracting all the id= from .json res object
id = []
for item in res['items'][0]["elements"]:
id.append(item["id"])
# opening a .json file and making a dict for links
file = open('links.json', 'a')
links = {'links': []}
def link_parser(url, id):
resp = requests.get(url + id).content
soup = bs(resp, "html.parser")
link = soup.select_one('p > a').attrs['href']
links['links'].append(link)
# dumping the dict into links.json file
for item in id[:20]:
link_parser(url_1, item)
json.dump(links, file)
file.close()

In web scraping, speed is not a good idea! You will be hitting the server numerous times a second and will most likely get blocked if you use a For Loop. A generator will not make this quicker. Ideally, you want to hit the server once and process the data locally.
If it were me, I would want to use a framework like Scrapy that encourages good practice and various Spider classes to support standard techniques.

Python to Save Web Pages

This is probably a very simple task, but I cannot find any help. I have a website that takes the form www.xyz.com/somestuff/ID. I have a list of the IDs I need information from. I was hoping to have a simple script to go one the site and download the (complete) web page for each ID in a simple form ID_whatever_the_default_save_name_is in a specific folder.
Can I run a simple python script to do this for me? I can do it by hand, it is only 75 different pages, but I was hoping to use this to learn how to do things like this in the future.

Mechanize is a great package for crawling the web with python. A simple example for your issue would be:
import mechanize
br = mechanize.Browser()
response = br.open("www.xyz.com/somestuff/ID")
print response
This simply grabs your url and prints the response from the server.

This can be done simply in python using the urllib module. Here is a simple example in Python 3:
import urllib.request
url = 'www.xyz.com/somestuff/ID'
req = urllib.request.Request(url)
page = urllib.request.urlopen(req)
src = page.readall()
print(src)
For more info on the urllib module -> http://docs.python.org/3.3/library/urllib.html

Do you want just the html code for the website? If so, just create a url variable with the host site and add the page number as you go. I'll do this for an example with http://www.notalwaysright.com
import urllib.request
url = "http://www.notalwaysright.com/page/"
for x in range(1, 71):
newurl = url + x
response = urllib.request.urlopen(newurl)
with open("Page/" + x, "a") as p:
p.writelines(reponse.read())

Downloading a .csv file from the web (with redirects) in python

Let me start by saying that I know there are a few topics discussing problems similar to mine, but the suggested solutions do not seem to work for me for some reason.
Also, I am new to downloading files from the internet using scripts. Up until now I have mostly used python as a Matlab replacement (using numpy/scipy).
My goal:
I want to download a lot of .csv files from an internet database (http://dna.korea.ac.kr/vhot/) automatically using python. I want to do this because it is too cumbersome to download the 1000+ csv files I require by hand. The database can only be accessed using a UI, where you have to select several options from a drop down menu to finally end up with links to .csv files after some steps.
I have figured out that the url you get after filling out the drop down menus and pressing 'search' contains all the parameters of the drop-down menu. This means I can just change those instead of using the drop down menu, which helps a lot.
An example url from this website is (lets call it url1):
url1 = http://dna.korea.ac.kr/vhot/search.php?species=Human&selector=drop&mirname=&mirname_drop=hbv-miR-B2RC&pita=on&set=and&miranda_th=-5&rh_th=-10&ts_th=0&mt_th=7.3&pt_th=99999&gene=
On this page I can select 5 csv-files, one example directs me to the following url:
url2 = http://dna.korea.ac.kr/vhot/download.php?mirname=hbv-miR-B2RC&species_filter=species_id+%3D+9606&set=and&gene_filter=&method=pita&m_th=-5&rh_th=-10&ts_th=0&mt_th=7.3&pt_th=99999&targetscan=&miranda=&rnahybrid=&microt=&pita=on
However, this doesn't contain the csv file directly, but appears to be a 'redirect' (a new term for me, that I found by googeling, so correct me if I am wrong).
One strange thing. I appear to have to load url1 in my browser before I can access url2 (I do not know if it has to be the same day, or hour. url2 didn't work for me today and it did yesterday. Only after after accessing url1 did it work again...). If I do not access url1 before url2 I get "no results" instead of my csv file from my browser. Does anyone know what is going on here?
However, my main problem is that I cannot save the csv files from python.
I have tried using the packages urllib, urllib2 and request but I cannot get it to work.
From what i understand the Requests package should take care of redirects, but I haven't been able to make it work.
The solutions from the following web pages do not appear to work for me (or I am messing up):
stackoverflow.com/questions/7603044/how-to-download-a-file-returned-indirectly-from-html-form-submission-pyt
stackoverflow.com/questions/9419162/python-download-returned-zip-file-from-url
techniqal.com/blog/2008/07/31/python-file-read-write-with-urllib2/
Some of the things I have tried include:
import urllib2
import csv
import sys
url = 'http://dna.korea.ac.kr/vhot/download.php?mirname=hbv-miR-B2RC&species_filter=species_id+%3D+9606&set=or&gene_filter=&method=targetscan&m_th=-5&rh_th=-10&ts_th=0&mt_th=7.3&pt_th=-10&targetscan=on&miranda=&rnahybrid=&microt=&pita='
#1
u = urllib2.urlopen(url)
localFile = open('file.csv', 'w')
localFile.write(u.read())
localFile.close()
#2
req = urllib2.Request(url)
res = urllib2.urlopen(req)
finalurl = res.geturl()
pass
# finalurl = 'http://dna.korea.ac.kr/vhot/download.php?mirname=hbv-miR-B2RC&species_filter=species_id+%3D+9606&set=or&gene_filter=&method=targetscan&m_th=-5&rh_th=-10&ts_th=0&mt_th=7.3&pt_th=-10&targetscan=on&miranda=&rnahybrid=&microt=&pita='
#3
import requests
r = requests.get(url)
r.content
pass
#r.content = "< s c r i p t > location.replace('download_send.php?name=qgN9Th&type=targetscan'); < / s c r i p t >"
#4
import requests
r = requests.get(url,
allow_redirects=True,
data={'download_open': 'Download', 'format_open': '.csv'})
print r.content
# r.content = "
#5
import urllib
test1 = urllib.urlretrieve(url, "test.csv")
test2 = urllib.urlopen(url)
pass
For #2, #3 and #4 the outputs are displayed after the code.
For #1 and #5 I just get a .csv file with </script>'
Option #3 just gives me a new redirect I think, can this help me?
Can anybody help me with my problem?

The page does not send a HTTP Redirect, instead the redirect is done via JavaScript.
urllib and requests do not process javascript, so they cannot follow to the download url.
You have to extract the final download url by yourself, and then open it, using any of the methods.
You could extract the URL using the re module with a regex like r'location.replace\((.*?)\)'

Based on the response from ch3ka, I think I got it to work. From the source code I get the java redirect, and from this redirect I can get the data.
#Find source code
redirect = requests.get(url).content
#Search for the java redirect (find it in the source code)
# --> based on answer ch3ka
m = re.search(r"location.replace\(\'(.*?)\'\)", redirect).group(1)
# Now you need to create url from this redirect, and using this url get the data
data = requests.get(new_url).content

Website Downloader using Python

I am trying to create a website downloader using python. I have the code for:
Finding all URLs from a page
Downloading a given URL
What I have to do is to recursively download a page, and if there's any other link in that page, I need to download them also. I tried combining the above two functions, but recursion thing doesn't work.
The codes are given below:
1)
*from sgmllib import SGMLParser
class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []
def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.extend(href)
if __name__ == "__main__":
import urllib
wanted_url=raw_input("Enter the URL: ")
usock = urllib.urlopen(wanted_url)
parser = URLLister()
parser.feed(usock.read())
parser.close()
usock.close()
for url in parser.urls: download(url)*
2) where download(url) function is defined as follows:
*def download(url):
import urllib
webFile = urllib.urlopen(url)
localFile = open(url.split('/')[-1], 'w')
localFile.write(webFile.read())
webFile.close()
localFile.close()
a=raw_input("Enter the URL")
download(a)
print "Done"*
Kindly help me on how to combine these two codes to "recursively" download the new links on a webpage that's being downloaded.

You may want to look into the Scrapy library.
It would make a task like this pretty trivial, and allow you to download multiple pages concurrently.

done_url = []
def download(url):
if url in done_url:return
...download url code...
done_url.append(url)
urls = sone_function_to_fetch_urls_from_this_page()
for url in urls:download(url)
This is a very sad/bad code. For example you will need to check if the url is within the domain you want to crawl or not. However, you asked for recursive.
Be mindful of the recursion depth.
There are just so many things wrong with my solution. :P
You must try some crawling library like Scrapy or something.

Generally, the idea is this:
def get_links_recursive(document, current_depth, max_depth):
links = document.get_links()
for link in links:
downloaded = link.download()
if current_depth < max_depth:
get_links_recursive(downloaded, depth-1, max_depth)
Call get_links_recursive(document, 0, 3) to get things started.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Multiple scraping flat-file for list of urls - python

Related

list index out of range - beautiful soup

How to increase the speed of my program in Python?

Python to Save Web Pages

Downloading a .csv file from the web (with redirects) in python

Website Downloader using Python

Categories

Resources