Let's say I have an xls or csv file (on some cloud) with list of website which contain URL to some content on my website. I'd like to write a script that goes to given website, checks if the link is still there and if it has 'follow' attribute. Which tool and library will be optimal for it. I think about using Selenium for this.
For manually selecting websites to check, try:
import urllib
Url = urllib.request.urlopen(input(“Website to check?(Format: http(s)://www.(WebPageDomain).(WebPageUrlEnder)/(OPTIONAL:Sub-page)\n>> “)
if Url.read().contains(input(“Your website name?\n>> “)):
# do thing
This may work, or it may not. I had no time to check. If you get issues with the method read() then look at some documentation
yes, you can use selenium to automate the stuffs in python.
Alternatively, you can read the csv/xls files and store the values
as a dataframe using pandas in python. You can iterate over the
websites and store the result that the website is working or not.
# sample code for storing csv/xls in dataframe
filepath = 'data.csv'
df = pd.read_csv(filepath) || pd.read_excel(filepath, index_col=0)
print(df)
# sample code for checking website exists
import requests
url = 'http://www.example.com'
request = requests.get(url)
if request.status_code == 200:
print('Web site exists')
else:
print('Web site does not exist')
and finally store the result in the form of csv/xls.
Related
NEW TO PYTHON*** Below is my code I am using to pull a zip file from a website but I am getting the error, "list index out of range". I was given this code by someone else who wrote it but I had to change the URL and now I am getting the error. When I print(list_of_documents) it is blank.
Can someone help me with this? The url requires access so you won't be able to try to input this code directly. I am trying to understand how to use beautiful soup in this and how I can get the list to populate correctly.
import datetime
import requests
import csv
from zipfile import ZipFile as zf
import os
import pandas as pd
import time
from bs4 import BeautifulSoup
import pyodbc
import re
#set download location
downloads_folder = r"C:\Scripts\"
##### Creating outage dataframe
#Get list of download links
res = requests.get('https://www.ercot.com/mp/data-products/data-product-details?id=NP3-233-CD')
ercot_soup = BeautifulSoup(res.text, "lxml")
list_of_documents = ercot_soup.findAll('td', attrs={'class': 'labelOptional_ind'})
list_of_links = ercot_soup.select('a')'
##create the url for the download
loc = str(list_of_links[0])[9:len(str(list_of_links[0]))-9]
link = 'http://www.ercot.com' + loc
link = link.replace('amp;','')
# Define file name and set download path
file_name = str(list_of_documents[0])[30:len(str(list_of_documents[0]))-5]
file_path = downloads_folder + '/' + file_name
You can't expect code tailored to scrape one website to work for a different link! You should always inspect and explore your target site, especially the parts you need to scrape, so you know the tag names [like td and a here] and identifying attributes [like name, id, class, etc.] of the elements you need to extract data from.
With this site, if you want the info from the reportTable, it gets generated after the page gets loaded with javascript, so it wouldn't show up in the request response. You could either try something like Selenium, or you could try retrieving the data from the source itself.
If you inspect the site and look at the network tab, you'll find a request (which is what actually retrieves the data for the table) that looks like this, and when you inspect the table's html, you'll find above it the scripts to generate the data.
In the suggested solution below, the getReqUrl scrapes your link to get the url for requesting the reports (and also the template of the url for downloading the documents).
def getReqUrl(scrapeUrl):
res = requests.get(scrapeUrl)
ercot_soup = BeautifulSoup(res.text, "html.parser")
script = [l.split('"') for l in [
s for s in ercot_soup.select('script')
if 'reportListUrl' in s.text
and 'reportTypeID' in s.text
][0].text.split('\n') if l.count('"') == 2]
rtID = [l[1] for l in script if 'reportTypeID' in l[0]][0]
rlUrl = [l[1] for l in script if 'reportListUrl' in l[0]][0]
rdUrl = [l[1] for l in script if 'reportDownloadUrl' in l[0]][0]
return f'{rlUrl}{rtID}&_={int(time.time())}', rdUrl
(I couldn't figure out how to scrape the last query parameter [the &_=... part] from the site exactly, but {int(time.time())}} seems to get close enough - the results are the same even then and even when that last bit is omitted entirely; so it's totally optional.)
The url returned can be used to request the documents:
#import json
url = 'https://www.ercot.com/mp/data-products/data-product-details?id=NP3-233-CD'
reqUrl, ddUrl = getReqUrl(url)
reqRes = requests.get(reqUrl[0]).text
rsJson = json.loads(reqRes)
for doc in rsJson['ListDocsByRptTypeRes']['DocumentList']:
d = doc['Document']
downloadLink = ddUrl+d['DocID']
#print(f"{d['FriendlyName']} {d['PublishDate']} {downloadLink}")
print(f"Download '{d['ConstructedName']}' at\n\t {downloadLink}")
print(len(rsJson['ListDocsByRptTypeRes']['DocumentList']))
The print results will look like
Im trying to fetch an excel file with urllib as seen below
import urllib.request as url
request = url.urlopen("url").geturl()
url.urlretrieve(request,"excelfile.xls")
However, the url is not a direct link to the file, but to a html page which trigger the download after a small delay (without any redirects). This causes the above code to retrieve the html file instead.
I've worked out a temporary fix to this, but it is very unreliable. See below.
req1 = url.urlopen("url").geturl()
url.urlretrieve(req1,"excelfile.xls")
time.sleep(5)
req2 = url.urlopen("url").geturl()
url.urlretrieve(req2,"excelfile.xls")
time.sleep(5) sometimes makes up for the delay and the correct file gets downloaded.
Is there a more reliable way to be sure to get the correct file?
I've tried using .info() to maybe try to have the code retry until I get the correct file, but when trying out the code below the info printed is not correlated with the actual response from urlretrieve. I'm probably using it wrong.
req1 = url.urlopen("url")
url.urlretrieve(req1.geturl(),"excelfile.xls")
info = req1.info()
print(info.get_content_type())
time.sleep(5)
req2 = url.urlopen("url")
url.urlretrieve(req2.geturl(),"excelfile.xls")
info = req2.info()
print(info.get_content_type())
Any suggestions?
The url to the excel file can be found here.
I am trying to extract a list of golf courses name and addresses from the Garmin Website using the script below.
import csv
import requests
from bs4 import BeautifulSoup
courses_list= []
for i in range(893): #893
url = "http://sites.garmin.com/clsearch/courses?browse=1&country=US&lang=en&per_page={}".format(i*20)
r = requests.get(url)
soup = BeautifulSoup(r.content)
g_data2=soup.find_all("div",{"class":"result"})
for item in g_data2:
try:
name= item.contents[3].find_all("div",{"class":"name"})[0].text
print name
except:
name=''
try:
address= item.contents[3].find_all("div",{"class":"location"})[0].text
except:
address=''
course=[name,address]
courses_list.append(course)
with open ('PGA_Garmin2.csv','a') as file:
writer=csv.writer(file)
for row in courses_list:
writer.writerow([s.encode("utf-8") for s in row])
After running the script, I end up not getting the full data that I need and further when executed it produces random values and not a complete set of data. I need to extract information from 893 pages and get a list of at least 18000 but after running this script I only get 122. How do I fix this script to get the complete data set and produce the needed CSV with the complete data set of golf courses from the Garmin Website. I corrected the page page numbers to reflect the page set up in the Garmin website where the page starts at 20 so on.
Just taking a guess here, but try checking your r.status and confirm that it's 200? Maybe it is possible that you're not accessing the whole website?
Stab in the dark.
I am writing a function that downloads and stores the today's list of pre-release domains .txt file from http://www.namejet.com/pages/downloads.aspx. I am trying to achieve it using json.
import json
import requests
def hello():
r = requests.get('http://www.namejet.com/pages/downloads.aspx')
#Replace with your website URL
with open("a.txt", "w") as f:
#Replace with your file name
for item in r.json or []:
try:
f.write(item['name']['name'] + "\n")
except KeyError:
pass
hello()
I need to download the file which consist of pre-release domains using python. How can I do that? Is the above code right way to do it?
I dont't think mechanize is much use for javascript, use selenium. Here's an example:
In [1]: from selenium import webdriver
In [2]: browser=webdriver.Chrome() # Select browser that you want to automate
In [3]: browser.get('http://www.namejet.com/pages/downloads.aspx')
In [4]: element=browser.find_element_by_xpath(
'//a[#id="ctl00_ContentPlaceHolder1_hlPreRelease1"]')
In [5]: element.click()
Now you can find prerelease_10-08-2012.txt in your download folder and you can open it in a usual way.
I see a few problems with your approach:
The page doesn't return any json; so even if you were to access the page successfully, r.json will be empty:
>>> import requests
>>> r = requests.get('http://www.namejet.com/pages/downloads.aspx')
>>> r.json
The file that you are after, is hidden behind a postback link; which you cannot "execute" using requests as it will not understand javascript.
In light of the above, the better approach is to use mechanize or alternatives to emulate a browser. You could also ask the company to provide you with a direct link.
Let me start by saying that I know there are a few topics discussing problems similar to mine, but the suggested solutions do not seem to work for me for some reason.
Also, I am new to downloading files from the internet using scripts. Up until now I have mostly used python as a Matlab replacement (using numpy/scipy).
My goal:
I want to download a lot of .csv files from an internet database (http://dna.korea.ac.kr/vhot/) automatically using python. I want to do this because it is too cumbersome to download the 1000+ csv files I require by hand. The database can only be accessed using a UI, where you have to select several options from a drop down menu to finally end up with links to .csv files after some steps.
I have figured out that the url you get after filling out the drop down menus and pressing 'search' contains all the parameters of the drop-down menu. This means I can just change those instead of using the drop down menu, which helps a lot.
An example url from this website is (lets call it url1):
url1 = http://dna.korea.ac.kr/vhot/search.php?species=Human&selector=drop&mirname=&mirname_drop=hbv-miR-B2RC&pita=on&set=and&miranda_th=-5&rh_th=-10&ts_th=0&mt_th=7.3&pt_th=99999&gene=
On this page I can select 5 csv-files, one example directs me to the following url:
url2 = http://dna.korea.ac.kr/vhot/download.php?mirname=hbv-miR-B2RC&species_filter=species_id+%3D+9606&set=and&gene_filter=&method=pita&m_th=-5&rh_th=-10&ts_th=0&mt_th=7.3&pt_th=99999&targetscan=&miranda=&rnahybrid=µt=&pita=on
However, this doesn't contain the csv file directly, but appears to be a 'redirect' (a new term for me, that I found by googeling, so correct me if I am wrong).
One strange thing. I appear to have to load url1 in my browser before I can access url2 (I do not know if it has to be the same day, or hour. url2 didn't work for me today and it did yesterday. Only after after accessing url1 did it work again...). If I do not access url1 before url2 I get "no results" instead of my csv file from my browser. Does anyone know what is going on here?
However, my main problem is that I cannot save the csv files from python.
I have tried using the packages urllib, urllib2 and request but I cannot get it to work.
From what i understand the Requests package should take care of redirects, but I haven't been able to make it work.
The solutions from the following web pages do not appear to work for me (or I am messing up):
stackoverflow.com/questions/7603044/how-to-download-a-file-returned-indirectly-from-html-form-submission-pyt
stackoverflow.com/questions/9419162/python-download-returned-zip-file-from-url
techniqal.com/blog/2008/07/31/python-file-read-write-with-urllib2/
Some of the things I have tried include:
import urllib2
import csv
import sys
url = 'http://dna.korea.ac.kr/vhot/download.php?mirname=hbv-miR-B2RC&species_filter=species_id+%3D+9606&set=or&gene_filter=&method=targetscan&m_th=-5&rh_th=-10&ts_th=0&mt_th=7.3&pt_th=-10&targetscan=on&miranda=&rnahybrid=µt=&pita='
#1
u = urllib2.urlopen(url)
localFile = open('file.csv', 'w')
localFile.write(u.read())
localFile.close()
#2
req = urllib2.Request(url)
res = urllib2.urlopen(req)
finalurl = res.geturl()
pass
# finalurl = 'http://dna.korea.ac.kr/vhot/download.php?mirname=hbv-miR-B2RC&species_filter=species_id+%3D+9606&set=or&gene_filter=&method=targetscan&m_th=-5&rh_th=-10&ts_th=0&mt_th=7.3&pt_th=-10&targetscan=on&miranda=&rnahybrid=µt=&pita='
#3
import requests
r = requests.get(url)
r.content
pass
#r.content = "< s c r i p t > location.replace('download_send.php?name=qgN9Th&type=targetscan'); < / s c r i p t >"
#4
import requests
r = requests.get(url,
allow_redirects=True,
data={'download_open': 'Download', 'format_open': '.csv'})
print r.content
# r.content = "
#5
import urllib
test1 = urllib.urlretrieve(url, "test.csv")
test2 = urllib.urlopen(url)
pass
For #2, #3 and #4 the outputs are displayed after the code.
For #1 and #5 I just get a .csv file with </script>'
Option #3 just gives me a new redirect I think, can this help me?
Can anybody help me with my problem?
The page does not send a HTTP Redirect, instead the redirect is done via JavaScript.
urllib and requests do not process javascript, so they cannot follow to the download url.
You have to extract the final download url by yourself, and then open it, using any of the methods.
You could extract the URL using the re module with a regex like r'location.replace\((.*?)\)'
Based on the response from ch3ka, I think I got it to work. From the source code I get the java redirect, and from this redirect I can get the data.
#Find source code
redirect = requests.get(url).content
#Search for the java redirect (find it in the source code)
# --> based on answer ch3ka
m = re.search(r"location.replace\(\'(.*?)\'\)", redirect).group(1)
# Now you need to create url from this redirect, and using this url get the data
data = requests.get(new_url).content