Check multiple url from csv if valid or not, using python

Check multiple url from csv if valid or not, using python - python

I have this script works if I hard code the link in script itself. But wish to take multiple urls from a csv file having this column say url_to_check, need to validate all of them one by one if these urls are valid or not. Please help. Thanks
import httplib
from urlparse import urlparse
def checkUrl(url):
p = urlparse(url)
conn = httplib.HTTPConnection(p.netloc)
conn.request('HEAD', p.path)
resp = conn.getresponse()
return resp.status < 400
if __name__ == '__main__':
print checkUrl('http://www.stackoverflow.com')

You can use python's csv module for parsing your csv file.
A simple example using your example column name and checkUrl function:
import csv
with open('/path/to/your/csv/file') as fobj:
reader = csv.DictReader(fobj)
for row in reader:
valid = checkUrl(row['url_to_check'])
print('%s is %svalid' % (row['url_to_check'], '' if valid else 'in'))

Related

Downloading csv data from an API

I am attempting to download csv data from an API which I will then edit I am struggling to get the different functions to work together.
i.e. passing the export link through to download the file and then through to opening it.
'''
File name: downloadAWR.py
Author: Harry&Joe
Date created: 3/10/17
Date last modified: 5/10/17
Version: 3.6
'''
import requests
import json
import urllib2
import zipfile
import io
import csv
import os
from urllib2 import urlopen, URLError, HTTPError
geturl() is used to create a download link for the csv data, one link will be created with user input data in this case the name and dates, this will then create a link that we can use to download the data. the link is stored in export_link
def geturl():
#getProjectName
project_name = 'BIMM'
#getApiToken
api_token = "API KEY HERE"
#getStartDate
start_date = '2017-01-01'
#getStopDate
stop_date = '2017-09-01'
url = "https://api.awrcloud.com/get.php?action=export_ranking&project=%s&token=%s&startDate=%s&stopDate=%s" % (project_name,api_token,start_date,stop_date)
export_link = requests.get(url).content
return export_link
dlfile is used to actually use the link a get a file we can manipulate and edit e.g. removing columns and some of the data.
def dlfile(export_link):
# Open the url
try:
f = urlopen(export_link)
print ("downloading " + export_link)
# Open our local file for writing
with open(os.path.basename(export_link), "wb") as local_file:
local_file.write(f.read())
#handle errors
except HTTPError as e:
print ("HTTP Error:", e.code, export_link)
except URLError as e:
print ("URL Error:", e.reason, export_link)
return f
readdata is used to go into the file and open it for us to use.
def readdata():
with zipfile.ZipFile(io.BytesIO(zipdata)) as z:
for f in z.filelist:
csvdata = z.read(f)
#reader = csv.reader(io.StringIO(csvdata.decode()))
def main():
#Do something with the csv data
export_link = (geturl())
data = dlfile(export_link)
csvdata = data.readdata()
if __name__ == '__main__':
main()
Generally I'm finding that the code works independently but struggles when I try to put it all together synchronously.

You need to clean up and call your code appropriately. It seems you copy pasted from different sources and now you have some salad bowl of code that isn't mixing well.
If the task is just to read and open a remote file to do something to it:
import io
import zipfile
import requests
def get_csv_file(project, api_token, start_date, end_date):
url = "https://api.awrcloud.com/get.php"
params = {'action': 'export_ranking',
'project': project,
'token': api_token,
'startDate': start_date,
'stopDate': end_date}
r = requests.get(url, params)
r.raise_for_status()
return zipfile.ZipFile(io.BytesIO(request.get(r.content).content))
def process_csv_file(zip_file):
contents = zip_file.extractall()
# do stuff with the contents
if __name__ == '__main__':
process_zip_file(get_csv_file('BIMM', 'api-key', '2017-01-01', '2017-09-01'))

Print JSON data from csv list of multiple urls

Very new to Python and haven't found specific answer on SO but apologies in advance if this appears very naive or elsewhere already.
I am trying to print 'IncorporationDate' JSON data from multiple urls of public data set. I have the urls saved as a csv file, snippet below. I am only getting as far as printing ALL the JSON data from one url, and I am uncertain how to run that over all of the csv urls, and write to csv just the IncorporationDate values.
Any basic guidance or edits are really welcomed!
try:
# For Python 3.0 and later
from urllib.request import urlopen
except ImportError:
# Fall back to Python 2's urllib2
from urllib2 import urlopen
import json
def get_jsonparsed_data(url):
response = urlopen(url)
data = response.read().decode("utf-8")
return json.loads(data)
url = ("http://data.companieshouse.gov.uk/doc/company/01046514.json")
print(get_jsonparsed_data(url))
import csv
with open('test.csv') as f:
lis=[line.split() for line in f]
for i,x in enumerate(lis):
print ()
import StringIO
s = StringIO.StringIO()
with open('example.csv', 'w') as f:
for line in s:
f.write(line)
Snippet of csv:
http://business.data.gov.uk/id/company/01046514.json
http://business.data.gov.uk/id/company/01751318.json
http://business.data.gov.uk/id/company/03164710.json
http://business.data.gov.uk/id/company/04403406.json
http://business.data.gov.uk/id/company/04405987.json

Welcome to the Python world.
For dealing with making http requests, we commonly use requests because it's dead simple api.
The code snippet below does what I believe you want:
It grabs the data from each of the urls you posted
It creates a new CSV file with each of the IncorporationDate keys.
```
import csv
import requests
COMPANY_URLS = [
'http://business.data.gov.uk/id/company/01046514.json',
'http://business.data.gov.uk/id/company/01751318.json',
'http://business.data.gov.uk/id/company/03164710.json',
'http://business.data.gov.uk/id/company/04403406.json',
'http://business.data.gov.uk/id/company/04405987.json',
]
def get_company_data():
for url in COMPANY_URLS:
res = requests.get(url)
if res.status_code == 200:
yield res.json()
if __name__ == '__main__':
for data in get_company_data():
try:
incorporation_date = data['primaryTopic']['IncorporationDate']
except KeyError:
continue
else:
with open('out.csv', 'a') as csvfile:
writer = csv.writer(csvfile)
writer.writerow([incorporation_date])
```

First step, you have to read all the URLs in your CSV
import csv
csvReader = csv.reader('text.csv')
# next(csvReader) uncomment if you have a header in the .CSV file
all_urls = [row for row in csvReader if row]
Second step, fetch the data from the URL
from urllib.request import urlopen
def get_jsonparsed_data(url):
response = urlopen(url)
data = response.read().decode("utf-8")
return json.loads(data)
url_data = get_jsonparsed_data("give_your_url_here")
Third step:
Go through all URLs that you got from CSV file
Get JSON data
Fetch the field what you need, in your case "IncorporationDate"
Write into an output CSV file, I'm naming it as IncorporationDates.csv
Code below:
for each_url in all_urls:
url_data = get_jsonparsed_data(each_url)
with open('IncorporationDates.csv', 'w' ) as abc:
abc.write(url_data['primaryTopic']['IncorporationDate'])

Putting Json into python variables using query (url request)

I'm attempting to use this Python 2 code snippet from the WeatherUnderground's API Page in python 3.
import urllib2
import json
f = urllib2.urlopen('http://api.wunderground.com/api/apikey/geolookup/conditions/q/IA/Cedar_Rapids.json')
json_string = f.read()
parsed_json = json.loads(json_string)
location = parsed_json['location']['city']
temp_f = parsed_json['current_observation']['temp_f']
print "Current temperature in %s is: %s" % (location, temp_f)
f.close()
I've used 2to3 to convert it over but i'm still having some issues. The main conversion here is switching from the old urllib2 to the new urllib. I've tried using the requests library to no avail.
Using urllib from python 3 this is the code I have come up with:
import urllib.request
import urllib.error
import urllib.parse
import codecs
import json
url = 'http://api.wunderground.com/api/apikey/forecast/conditions/q/C$
response = urllib.request.urlopen(url)
#Decoding on the two lines below this
reader = codecs.getreader("utf-8")
obj = json.load(reader(response))
json_string = obj.read()
parsed_json = json.loads(json_string)
currentTemp = parsed_json['current_observation']['temp_f']
todayTempLow = parsed_json['forecast']['simpleforecast']['forecastday']['low'][$
todayTempHigh = parsed_json['forecast']['simpleforecast']['forecastday']['high'$
todayPop = parsed_json['forecast']['simpleforecast']['forecastday']['pop']
Yet i'm getting an error about it being the wrong object type. (Bytes instead of str)
The closest thing I could find to the solution is this question here.
Let me know if any additional information is needed to help me find a solution!
Heres a link to the WU API website if that helps

urllib returns a byte array. You convert it to string using
json_string.decode('utf-8')
Your Python2 code would convert to
from urllib import request
import json
f = request.urlopen('http://api.wunderground.com/api/apikey/geolookup/conditions/q/IA/Cedar_Rapids.json')
json_string = f.read()
parsed_json = json.loads(json_string.decode('utf-8'))
location = parsed_json['location']['city']
temp_f = parsed_json['current_observation']['temp_f']
print ("Current temperature in %s is: %s" % (location, temp_f))
f.close()

Saving an image from text file providing image url's in python

import urllib2
import urllib
import json
import urlparse
def main():
f = open("C:\Users\Stern Marketing\Desktop\dumpaday.txt","r")
if f.mode == 'r':
item = f.read()
for x in item:
urlParts = urlparse.urlsplit(x)
filename = urlParts.path.split('/')[-1]
urllib.urlretrieve(item.strip(), filename)
if __name__ == "__main__":
main()`
Looks like script still not working properly, I'm really not sure why... :S
Getting lots of errors...

urllib.urlretrieve("x", "0001.jpg")
This will try to download from the (static) URL "x".
The URL you actually want to download from is within the variable x, so you should write your line to reference that variable:
urllib.urlretrieve(x, "0001.jpg")
Also, you probably want to change the target filename for each download, so you don’t keep on overwriting it.
Regarding your filename update:
urlparse.urlsplit is a function that takes an URL and splits it into multiple parts. Those parts are returned from the function, so you need to save it in some variable.
One part is the path, which is what contains the file name. The path itself is a string on which you can call the split method to separate it by the / character. As you are interested in only the last part—the filename—you can discard everything else:
url = 'http://www.dumpaday.com/wp-content/uploads/2013/12/funny-160.jpg'
urlParts = urlparse.urlsplit(url)
print(urlParts.path) # /wp-content/uploads/2013/12/funny-160.jpg
filename = urlParts.path.split('/')[-1]
print(filename) # funny-160.jpg
It should work like this:
import urllib2
import urllib
import json
import urlparse
def main():
with open("C:\Users\Stern Marketing\Desktop\dumpaday.txt","r") as f:
for x in f:
urlParts = urlparse.urlsplit(x.strip())
filename = urlParts.path.split('/')[-1]
urllib.urlretrieve(x.strip(), filename)
if __name__ == "__main__":
main()`

The readlines method of file objects returns lines with a trailing newline character (\n).
Change your loop to the following:
# By the way, you don't need readlines at all. Iterating over a file yields its lines.
for x in fl:
urllib.urlretrieve(x.strip(), "0001.jpg")

Here is a solution that loops over images indexed 160 to 171. You can adjust as needed. This creates a url from the base, opens it via urllib2 and saves it as a binary file.
import urllib2
base_url = "http://www.dumpaday.com/wp-content/uploads/2013/12/funny-{}.jpg"
for n in xrange(160, 170):
url = base_url.format(n)
f_save = "{}.jpg".format(n)
req = urllib2.urlopen(url)
with open(f_save,'wb') as FOUT:
FOUT.write(req.read())

How to make download manager more robust?

I have made this simple download manager, but the problem is it wont work on complex urls, when pages are redirected.
def str(d):
for i in range(len(d)):
if d[-i] == '/':
x=-i
break
s=[]
l=len(d)+x+1
print d[l],d[len(d)-1]
s=d[l:]
return s
import urllib2
url=raw_input()
filename=str(url)
webfile = urllib2.urlopen(url)
data = webfile.read()
fout =open(filename,"w")
fout.write(data)
fout.close()
webfile.close()
it wouldn't work for http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=9&ved=0CG0QFjAI&url=http%3A%2F%2Fwww.iasted.org%2Fconferences%2Fformatting%2FPresentations-Tips.ppt&ei=clfWTpjZEIblrAfC8qWXDg&usg=AFQjCNEIgqx6x4ULHFXzzYDzCITuUJOczA&sig2=0VtKXPvoDnIq-lIR4S9LEQ
while it would work for http://www.iasted.org/conferences/formatting/Presentations-Tips.ppt
and both links are for the same file.
How to solve the problem of redirection?

I think redirection is not a problem here:
Since urllib2 already follows redirect automatically, google redirects to a page in case of error.
Try this script :
url1 = 'http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=9&ved=0CG0QFjAI&url=http%3A%2F%2Fwww.iasted.org%2Fconferences%2Fformatting%2FPresentations-Tips.ppt&ei=clfWTpjZEIblrAfC8qWXDg&usg=AFQjCNEIgqx6x4ULHFXzzYDzCITuUJOczA&sig2=0VtKXPvoDnIq-lIR4S9LEQ'
url2 = 'http://www.iasted.org/conferences/formatting/Presentations-Tips.ppt'
from urlparse import urlsplit
from urllib2 import urlopen
for url in [url1, url2]:
split = urlsplit(url)
filename = split.path[split.path.rfind('/')+1:]
if not filename:
filename = split.query[split.query.rfind('/')+1:]
f = open(filename, 'w')
f.write(urlopen(url).read())
f.close()
# Yields 2 files : url and Presentations-Tips.ppt [Both are ppt files]
The above script works every time.

In general, you handle redirection by using urllib2.HTTPRedirectHandler, like this:
import urllib2
opener = urllib.build_opener(urllib2.HTTPRedirectHandler)
res = open.open('http://example.com/some/url/')
However, it doesn't like like this will work for the Google URL you've given in your example, because rather than including a Location header in the response, the Google result looks like this:
<script>window.googleJavaScriptRedirect=1</script><script>var a=parent,b=parent.google,c=location;if(a!=window&&b){if(b.r){b.r=0;a.location.href="http://www.iasted.org/conferences/formatting/Presentations-Tips.ppt";c.replace("about:blank");}}else{c.replace("http://www.iasted.org/conferences/formatting/Presentations-Tips.ppt");};</script><noscript><META http-equiv="refresh" content="0;URL='http://www.iasted.org/conferences/formatting/Presentations-Tips.ppt'"></noscript>
...which is to say, it uses a Javascript redirect, which substantially complicates your life. You could use Python's re module to extract the correct location from this block.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Check multiple url from csv if valid or not, using python - python

Related

Downloading csv data from an API

Print JSON data from csv list of multiple urls

Putting Json into python variables using query (url request)

Saving an image from text file providing image url's in python

How to make download manager more robust?

Categories

Resources