Build a URL using Python requests library - python

I have a base URL.
BASE_URL = 'https://api.github.com/licenses'
I want to create a new url based on a search term(Ex - mit) appended to the base URL.
NEW_URL = 'https://api.github.com/licenses/mit'
I am using requests library to build and call the URLs as shown below.
from requests.compat import urljoin
base_url = 'https://api.github.com/licenses'
new_url = urljoin(base_url, 'mit')
print new_url
But when I print the new_url, it messes up the URL.
https://api.github.com/mit
I am not sure how to fix this issue.

Add a / at the end of the base url.
BASE_URL = 'https://api.github.com/licenses/'
Otherwise it's probably treating licences as filename.

Related

Using requests.get to get redirected url

I have a super long url, and I'm trying to print its final destination. I've got:
import requests
url = "https://l.facebook.com/l.php?u=https%3A%2F%2Fwww.washingtonpost.com%2Fscience%2F2020%2F04%2F29%2Fcoronavirus-detection-dogs%2F%3Ffbclid%3DIwAR00eT4EHsWC9986GUSox_7JS7IIg2wAan-tB-NteYJd8I4xckmxnfaNGEI&h=AT0cs4gTKPZlkSElC2uhoDYR98lsONooq_ZUFIK87khBmtZE_3r8j25EfioBPAdp-O8o7efRVG9uB-doy9vLT-AccZMrxnfpEiSYRmA2LTL21IU15bP_PTVw4SSibS1A_uE8bU-ROJexKgdk68VSTtE&__tn__=H-R&c[0]=AT3BNcTNFE13IJu3naJmxTRdJTWtO4O4L0_-nimmzcXpYv3N536YRpQZLg-v2FtP_Oz2DZZpBN6XQPb89JNJTsYFXlK8-1g4xdDLi1T_lfowpI5Ooh8kuLpciLiQ9t-ZmMd2CTUWaGZ_Y_JU0OEvVWfLLfjDq4VOzUtETBcvXHw2ZvQnTQ"
r = requests.get(url, allow_redirects=False)
print(r.headers['Location'])
It should get me to https://www.washingtonpost.com/science/2020/04/29/coronavirus-detection-dogs/?fbclid=IwAR00eT4EHsWC9986GUSox_7JS7IIg2wAan-tB-NteYJd8I4xckmxnfaNGEI. But I get the same URL I put in.
(By the way, if anyone happens to know how to do this in Javascript, that would be awesome, but Google tells me that's not possible.)
Since you just want to get the URL, requests here cant be that much of help. Instead, you can use urllib.parse:
import urllib.parse as url_parse
url = <https://l.facebook.com/l.php?u=https...>
news_link = url_parse.unquote(url).split("?u=")[1]
# if you wish to delete Facebook Id, you can add this too
news_link = url_parse.unquote(url).split("?u=")[1].split("?fbclid")[0]
print(news_link)

Is there a way to extract information from shadow-root on a Website?

I am setting up code to check the reputation of any URL E.g. http://go.mobisla.com/ on Website "https://www.virustotal.com/gui/home/url"
First, the very basic thing I am doing is to extract all the Website contents using BeautifulSoup but seems the information I am looking for is in shadow-root(open) -- div.detections and span.individual-detection.
Example Copied Element from Webpage results:
No engines detected this URL
I am new to Python, wondering if you can share the best way to extract the information
Tried requests.get() function but it doesn't give the required information
import requests
import os,sys
from bs4 import BeautifulSoup
import pandas as pd
url_check = "deloplen.com:443"
url = "https://www.virustotal.com/gui/home/url"
req = requests.get(url + url_str)
html = req.text
soup = BeautifulSoup(html, 'html.parser')
print(soup.prettify())
Expect to see "2 engines detected this URL" along with Detection Example: Dr. Web Malicious
If you use their website, it'll only return a loading screen for VirusTotal, as this isn't the proper way.
What Shows Up:
Instead, what you're supposed to do is use their public API to make requests. However, you'll have to make an account to obtain a Public API Key.
You can use this code which is able to retrieve JSON info about the link. However, you'll have to fill in the API KEY with yours.
import requests, json
user_api_key = "<api key>"
resource = "deloplen.com:443"
# feel free to remove this, just makes it look nicer
def pp_json(json_thing, sort=True, indents=4):
if type(json_thing) is str:
print(json.dumps(json.loads(json_thing), sort_keys=sort, indent=indents))
else:
print(json.dumps(json_thing, sort_keys=sort, indent=indents))
return None
response = requests.get("https://www.virustotal.com/vtapi/v2/url/report?apikey=" + user_api_key + "&resource=" + resource)
json_response = response.json()
pretty_json = pp_json(json_response)
print(pretty_json)
If you want to learn more about the API, you can use their documentation.

Python webbrowser - Open a url without https://

I am trying to get python to open a website URL. This code works.
import webbrowser
url = 'http://www.example.com/'
webbrowser.open(url)
I have noticed that python will only open the URL is it has https:// at the beginning.
Is it possible to get python to open the URL if it's in any of the formats in the examples below?
url = 'http://www.example.com/'
url = 'https://example.com/'
url = 'www.example.com/'
url = 'example.com/'
The URLs will be pulled from outside sources so I can't change what data i receive.
I have looked at the python docs, and can't find the answer on stackoverflow.
Why not just add it?
if not url.startswith('http')
if url.startswith('www'):
url = "http://" + url
else
url = "http://www." + url
If you really don't want to change the url string (which is quite fast and easy) like stazima said, then you can use Python 3. It supports all the listed url types in your question (tested them).

Python to Save Web Pages

This is probably a very simple task, but I cannot find any help. I have a website that takes the form www.xyz.com/somestuff/ID. I have a list of the IDs I need information from. I was hoping to have a simple script to go one the site and download the (complete) web page for each ID in a simple form ID_whatever_the_default_save_name_is in a specific folder.
Can I run a simple python script to do this for me? I can do it by hand, it is only 75 different pages, but I was hoping to use this to learn how to do things like this in the future.
Mechanize is a great package for crawling the web with python. A simple example for your issue would be:
import mechanize
br = mechanize.Browser()
response = br.open("www.xyz.com/somestuff/ID")
print response
This simply grabs your url and prints the response from the server.
This can be done simply in python using the urllib module. Here is a simple example in Python 3:
import urllib.request
url = 'www.xyz.com/somestuff/ID'
req = urllib.request.Request(url)
page = urllib.request.urlopen(req)
src = page.readall()
print(src)
For more info on the urllib module -> http://docs.python.org/3.3/library/urllib.html
Do you want just the html code for the website? If so, just create a url variable with the host site and add the page number as you go. I'll do this for an example with http://www.notalwaysright.com
import urllib.request
url = "http://www.notalwaysright.com/page/"
for x in range(1, 71):
newurl = url + x
response = urllib.request.urlopen(newurl)
with open("Page/" + x, "a") as p:
p.writelines(reponse.read())

Python to list HTTP-files and directories

How can I list files and folders if I only have an IP-address?
With urllib and others, I am only able to display the content of the index.html file. But what if I want to see which files are in the root as well?
I am looking for an example that shows how to implement username and password if needed. (Most of the time index.html is public, but sometimes the other files are not).
Use requests to get page content and BeautifulSoup to parse the result.
For example if we search for all iso files at http://cdimage.debian.org/debian-cd/8.2.0-live/i386/iso-hybrid/:
from bs4 import BeautifulSoup
import requests
url = 'http://cdimage.debian.org/debian-cd/8.2.0-live/i386/iso-hybrid/'
ext = 'iso'
def listFD(url, ext=''):
page = requests.get(url).text
print page
soup = BeautifulSoup(page, 'html.parser')
return [url + '/' + node.get('href') for node in soup.find_all('a') if node.get('href').endswith(ext)]
for file in listFD(url, ext):
print file
You cannot get the directory listing directly via HTTP, as another answer says. It's the HTTP server that "decides" what to give you. Some will give you an HTML page displaying links to all the files inside a "directory", some will give you some page (index.html), and some will not even interpret the "directory" as one.
For example, you might have a link to "http://localhost/user-login/": This does not mean that there is a directory called user-login in the document root of the server. The server interprets that as a "link" to some page.
Now, to achieve what you want, you either have to use something other than HTTP (an FTP server on the "ip address" you want to access would do the job), or set up an HTTP server on that machine that provides for each path (http://192.168.2.100/directory) a list of files in it (in whatever format) and parse that through Python.
If the server provides an "index of /bla/bla" kind of page (like Apache server do, directory listings), you could parse the HTML output to find out the names of files and directories. If not (e.g. a custom index.html, or whatever the server decides to give you), then you're out of luck :(, you can't do it.
Zety provides a nice compact solution. I would add to his example by making the requests component more robust and functional:
import requests
from bs4 import BeautifulSoup
def get_url_paths(url, ext='', params={}):
response = requests.get(url, params=params)
if response.ok:
response_text = response.text
else:
return response.raise_for_status()
soup = BeautifulSoup(response_text, 'html.parser')
parent = [url + node.get('href') for node in soup.find_all('a') if node.get('href').endswith(ext)]
return parent
url = 'http://cdimage.debian.org/debian-cd/8.2.0-live/i386/iso-hybrid'
ext = 'iso'
result = get_url_paths(url, ext)
print(result)
HTTP does not work with "files" and "directories". Pick a different protocol.
You can use the following script to get names of all files in sub-directories and directories in a HTTP Server. A file writer can be used to download them.
from urllib.request import Request, urlopen, urlretrieve
from bs4 import BeautifulSoup
def read_url(url):
url = url.replace(" ","%20")
req = Request(url)
a = urlopen(req).read()
soup = BeautifulSoup(a, 'html.parser')
x = (soup.find_all('a'))
for i in x:
file_name = i.extract().get_text()
url_new = url + file_name
url_new = url_new.replace(" ","%20")
if(file_name[-1]=='/' and file_name[0]!='.'):
read_url(url_new)
print(url_new)
read_url("www.example.com")

Categories