checking if download link works in python - python

Im currently using the following code to download gz file. The url of the gz file will be constructed from pieces of information provided by the user:
generalUrl = theWebsiteURL + "/" + packageName
So generalURl can contain something like: http://www.example.com/blah-0.1.0.tar.gz
res = requests.get(generalUrl)
res.raise_for_status()
The problem I have here is; I have a list of websites for the variable called theWebsiteURL. I need to check all of these websites to see which ones have the package in packageName available for download. I would prefer not to download the package during the confirmation.
Once the code goes through the list of websites to discover which ones have the package, I then want to pick the first website from the list of websites that were found to have the package and automatically download the package from it.
something like this:
#!/usr/bin/env python2.7
listOfWebsites = [ website1, website2, website3, website4, and so on ]
goodWebsites = []
for eachWebsite in listOfWebsites:
genURL = eachWebsite + "/" + packageName
res = requests.get(genUrl)
res.raise_for_status()
if raise_for_status == "200"
goodWebsites.append(genURL)
This is where my imagination stops. I need assistance completing this. Not even sure I'm going about it the right way.

You can try to send a HEAD request first in order to check that the URL is valid, and only then download the package via a GET request.
#!/usr/bin/env python2.7
listOfWebsites = [ website1, website2, website3, website4, and so on ]
goodWebsites = []
for eachWebsite in listOfWebsites:
genURL = eachWebsite + "/" + packageName
res = requests.head(genUrl)
if res.ok:
goodWebsites.append(genURL)

Related

Is it always correct to use URLs like "./about.html" or "../about.htm" instead of Absolute URLS like /about?

I'm a computer science student. Recently we were tasked to develop a static HTTP server from scratch without using any HTTP modules, solely depending on socket programming. So this means that I had to write all the logic for HTTP message parsing, extracting headers, parsing URLs, etc.
However, I'm stuck with some confusion. As I'm somewhat experienced in web development before, I'm used to using URLs in places like anchor tags like this "/about", and "/articles/article-1".However, I've seen people sometimes people to relative paths according to their folder structure like this. "./about.html", "../contact.html".This always seemed to be a bad idea to me. However, I realized that even though in my code I'm not supporting these kinds of URLs explicitly, it seems to work anyhow.
Following is the python code I'm using to get the path from the HTTP message and then get the corresponding path in the file system.
def get_http_url(self, raw_request_headers: list[str]):
"""
Method to get HTTP url by parsing request headers
"""
if len(raw_request_headers) > 0:
method_and_path_header = raw_request_headers[0]
method_and_path_header_segments = method_and_path_header.split(" ")
if len(method_and_path_header_segments) >= 2:
"""
example: GET / HTTP/1.1 => ['GET', '/', 'HTTP/1.1] => '/'
"""
url = method_and_path_header_segments[1]
return url
return False
def get_resource_path_for_url(self, path: str | Literal[False]):
"""
Method to get the resource path based on url
"""
if not path:
return False
else:
if path.endswith('/'):
# Removing trailing '/' to make it easy to parse the url
path = path[0:-1]
# Split to see if the url also includes the file extension
parts = path.split('.')
if path == '':
# if the requested path is "/"
path_to_resource = os.path.join(
os.getcwd(), "htdocs", "index.html")
else:
# Assumes the user entered a valid url with resources file extension as well, ex: http://localhost:2728/pages/about.html
if len(parts) > 1:
path_to_resource = os.path.join(
os.getcwd(), "htdocs", path[1:]) # Get the abslute path with the existing file extension
else:
# Assumes user requested a url without an extension and as such is hoping for a html response
path_to_resource = os.path.join(
os.getcwd(), "htdocs", f"{path[1:]}.html") # Get the absolute path to the corresponding html file
return path_to_resource
So in my code, I'm not explicitly adding any logic to handle that kind of relative path. But somehow, when I use things like ../about.html in my test HTML files, it somehow works?
Is this the expected behavior? As of now (I would like to know where this behavior is implemented), I'm on Windows if that matters. And if this is expected, can I depend on this behavior and conclude that it's safe to refer to HTML files and other assets with relative paths like this on my web server?
Thanks in advance for any help, and I apologize if my question is not clear or well-formed.

I need to replace everything after https:// and before .com using python

What I'm trying to do is have it replace all urls from an html file.
This is what I have done but I realized it also deletes everything else after it.
s = 'https://12345678.com/'
site_link = "google"
print(s[:8] + site_link)
It would return as https://google
I have made a code sample.
In this, link_template is a template for a link, and ***** represents where your site_name will go. It might look a bit confusing at first, but if you run it you'll understand.
# change this to change your URL
link_template = 'https://*****.com/'
# a site name, from your example
site_name = 'google'
# this is your completed link
site_link = site_name.join(link_template.split('*****'))
# prints the result
print(site_link)
Additionally, you can make a function for it:
def name_to_link(link_template,site_name):
return site_name.join(link_template.split('*****'))
And then you can use the function like this:
link = name_to_link('https://translate.*****.com/','google')
print(link)

Python - Slate3k Giving me an type error after pdfminer install

I'm on Python 3.8.3 on windows 10.
I am working on a pdfparser and I initially found slate3k to use with Python 3.X. I got a basic script working and started to test it on some PDFs. I had some issues with some text not being parsed properly so I started to look into PDFMiner.
After reading through the documentation for PDFMiner, I decided to install that a give it a go as there was some functionality from it that would be super useful for my use case.
However, I figured out soon after that PDFMiner doesn't work with Python 3.x. I uninstalled it and went back to using slate3k.
When I did this, I started to get a bunch of errors. I then uninstalled slate3k and re-installed hoping to fix it. Still got the errors. I re-installed PDFMiner and get rid of those errors but now I stuck with the below error and I'm at a loss for what to do next.
Exception has occurred: TypeError
__init__() missing 1 required positional argument: 'parser'
Here is the code (please note I haven't done much error trapping and it's still a work in progress, I'm more at the "proof of concept" stage):
import re, os
import slate3k as slate
# variable define
CurWkDir = os.getcwd()
tags= list()
rev= str()
FileName = str()
ProperFileName = str()
parsed = str()
# open file and create if it doesn't exist
xref = open('parsed from pdf xref.csv', 'w+')
xref.write('File Name, Rev, Tag')
for files in os.listdir(CurWkDir):
# find pdf files
if files.endswith('.pdf'):
tags.clear()
rev = ""
FileName = ""
ProperFileName = ""
#extract revision, file name, create proper file name
rev = re.findall(r'[0-9]{,2}[A-Z]{1}[0-9]{,2}',files)[0]
FileName = re.findall(r'[A-Z]+[0-9]+-[A-Z]+-[0-9]+-[0-9]+|[A-Z]+[0-9]+-[A-Z]+-[A-Z]+[0-9]+-[0-9]+|[A-Z]+[0-9]+-[A-Z]+-[A-Z]+[0-9]+[A-Z]+-[0-9]+', files)[0]
ProperFileName = FileName + "(" + rev[0: len(rev) - 1] + ")"
# Parse through PDF to find tags
fileopen = open(files, 'rb')
print("Reading", files)
raw = slate.PDF(fileopen)
print("Finished reading", files)
parsed = raw[0]
parsedstripped = parsed.replace("\n"," ")
rawtags = re.findall(r'[0-9]+[A-Z]+-[0-9]+|[0-9]+[A-Z]+[0-9]{1,5}|[0-9]{3}[A-Z]+[0-9]+', parsed, re.I)
fileopen.close
print(parsedstripped)
for t in rawtags:
if t not in tags:
row = ProperFileName + "," + rev + "," + t + "\n"
xref.write(row)
tags.append(t)
xref.close()
The error comes at Line 34
raw = slate.PDF(fileopen)
Any insight into what I did to break the functionality of slate3k is appreciated.
Thanks,
JT
I looked into the dependencies on slate3k by looking at pip show slate3k and I found a couple of programs it was dependent on.
I uninstalled slate3k, pdfminer3k and pdfminer and then re-installed slate3k.
Now everything seems to be working.

How to use python regular expression inside if... else... statement?

My Python script:
import wget
if windowsbit == x86:
url = 'http://test.com/test_windows_2_56_30-STEST.exe'
filename = wget.download(url)
else:
url = 'http://test.com/test_windows-x64_2_56_30-STEST.exe'
filename = wget.download(url)
In above python script, I am using wget module to download a file form URL, based on windows 32 bit or 64 bit. Its working as expected.
I want to use regular expression to do the following:
if windowsbit == x86, its should download the file that starting with test_windows and ends with STEST.exe file.
else its should download file that starting with test_windows-x64 and ends with STEST.exe file.
I am new to python, I not getting any idea on how to do this. Could any one guide me on this?
This doesn't look possible. The regular expression that would match what you're trying to do is something like:
import re
urlre = re.compile("""
http://test.com/test_windows # base URL
(?P<bit>-x64)? # captures -x64 if present
_(?P<version_major>\d+) # captures major version
_(?P<version_minor>\d+) # captures minor version
_(?P<version_revision>\d+) # captures revision version
-STEST.exe # ending filename""", re.X)
However you can't just throw that in wget. You can't use wildcards in requests -- the webserver would have to know how to process them and it doesn't. A better method might be:
base_url = "http://test.com/test_windows"
if windowsbit == x64:
base_url += "-x64"
version = "2_56_30"
filename = "STEST.exe"
final_url = "{base}_{version}-{filename}".format(
base=base_url, version=version, filename=filename)
May be try this without regular expression:
import wget
text ="http://test.com/test_windows"
if windowsbit == x86:
url = '{}_2_56_30-STEST.exe'.format(text)
else:
url = '{}-x64_2_56_30-STEST.exe'.format(text)
filename = wget.download(url)
With version:
import wget
text ="http://test.com/test_windows"
version = '2_56_30'
if windowsbit == x86:
url = '{}_{}-STEST.exe'.format(text,version)
else:
url = '{}-x64_{}-STEST.exe'.format(text,version)
filename = wget.download(url)

Python script for "Google search by image"

I have checked Google Search API's and it seems that they have not released any API for searching "Images". So, I was wondering if there exists a python script/library through which I can automate the "search by image feature".
This was annoying enough to figure out that I thought I'd throw a comment on the first python-related stackoverflow result for "script google image search". The most annoying part of all this is setting up your proper application and custom search engine (CSE) in Google's web UI, but once you have your api key and CSE, define them in your environment and do something like:
#!/usr/bin/env python
# save top 10 google image search results to current directory
# https://developers.google.com/custom-search/json-api/v1/using_rest
import requests
import os
import sys
import re
import shutil
url = 'https://www.googleapis.com/customsearch/v1?key={}&cx={}&searchType=image&q={}'
apiKey = os.environ['GOOGLE_IMAGE_APIKEY']
cx = os.environ['GOOGLE_CSE_ID']
q = sys.argv[1]
i = 1
for result in requests.get(url.format(apiKey, cx, q)).json()['items']:
link = result['link']
image = requests.get(link, stream=True)
if image.status_code == 200:
m = re.search(r'[^\.]+$', link)
filename = './{}-{}.{}'.format(q, i, m.group())
with open(filename, 'wb') as f:
image.raw.decode_content = True
shutil.copyfileobj(image.raw, f)
i += 1
There is no API available but you are can parse the page and imitate the browser, but I don't know how much data you need to parse because google may limit or block access.
You can imitate the browser by simply using urllib and setting correct headers, but if you think parsing complex web-pages may be difficult from python, you can directly use a headless browser like phontomjs, inside a browser it is trivial to get correct elements using javascript/DOM
Note before trying all this check google's TOS
You can try this:
https://developers.google.com/image-search/v1/jsondevguide#json_snippets_python
It's deprecated, but seems to work.

Categories