Python to list HTTP-files and directories

Python to list HTTP-files and directories - python

How can I list files and folders if I only have an IP-address?
With urllib and others, I am only able to display the content of the index.html file. But what if I want to see which files are in the root as well?
I am looking for an example that shows how to implement username and password if needed. (Most of the time index.html is public, but sometimes the other files are not).

Use requests to get page content and BeautifulSoup to parse the result.
For example if we search for all iso files at http://cdimage.debian.org/debian-cd/8.2.0-live/i386/iso-hybrid/:
from bs4 import BeautifulSoup
import requests
url = 'http://cdimage.debian.org/debian-cd/8.2.0-live/i386/iso-hybrid/'
ext = 'iso'
def listFD(url, ext=''):
page = requests.get(url).text
print page
soup = BeautifulSoup(page, 'html.parser')
return [url + '/' + node.get('href') for node in soup.find_all('a') if node.get('href').endswith(ext)]
for file in listFD(url, ext):
print file

You cannot get the directory listing directly via HTTP, as another answer says. It's the HTTP server that "decides" what to give you. Some will give you an HTML page displaying links to all the files inside a "directory", some will give you some page (index.html), and some will not even interpret the "directory" as one.
For example, you might have a link to "http://localhost/user-login/": This does not mean that there is a directory called user-login in the document root of the server. The server interprets that as a "link" to some page.
Now, to achieve what you want, you either have to use something other than HTTP (an FTP server on the "ip address" you want to access would do the job), or set up an HTTP server on that machine that provides for each path (http://192.168.2.100/directory) a list of files in it (in whatever format) and parse that through Python.
If the server provides an "index of /bla/bla" kind of page (like Apache server do, directory listings), you could parse the HTML output to find out the names of files and directories. If not (e.g. a custom index.html, or whatever the server decides to give you), then you're out of luck :(, you can't do it.

Zety provides a nice compact solution. I would add to his example by making the requests component more robust and functional:
import requests
from bs4 import BeautifulSoup
def get_url_paths(url, ext='', params={}):
response = requests.get(url, params=params)
if response.ok:
response_text = response.text
else:
return response.raise_for_status()
soup = BeautifulSoup(response_text, 'html.parser')
parent = [url + node.get('href') for node in soup.find_all('a') if node.get('href').endswith(ext)]
return parent
url = 'http://cdimage.debian.org/debian-cd/8.2.0-live/i386/iso-hybrid'
ext = 'iso'
result = get_url_paths(url, ext)
print(result)

HTTP does not work with "files" and "directories". Pick a different protocol.

You can use the following script to get names of all files in sub-directories and directories in a HTTP Server. A file writer can be used to download them.
from urllib.request import Request, urlopen, urlretrieve
from bs4 import BeautifulSoup
def read_url(url):
url = url.replace(" ","%20")
req = Request(url)
a = urlopen(req).read()
soup = BeautifulSoup(a, 'html.parser')
x = (soup.find_all('a'))
for i in x:
file_name = i.extract().get_text()
url_new = url + file_name
url_new = url_new.replace(" ","%20")
if(file_name[-1]=='/' and file_name[0]!='.'):
read_url(url_new)
print(url_new)
read_url("www.example.com")

Related

list index out of range - beautiful soup

NEW TO PYTHON*** Below is my code I am using to pull a zip file from a website but I am getting the error, "list index out of range". I was given this code by someone else who wrote it but I had to change the URL and now I am getting the error. When I print(list_of_documents) it is blank.
Can someone help me with this? The url requires access so you won't be able to try to input this code directly. I am trying to understand how to use beautiful soup in this and how I can get the list to populate correctly.
import datetime
import requests
import csv
from zipfile import ZipFile as zf
import os
import pandas as pd
import time
from bs4 import BeautifulSoup
import pyodbc
import re
#set download location
downloads_folder = r"C:\Scripts\"
##### Creating outage dataframe
#Get list of download links
res = requests.get('https://www.ercot.com/mp/data-products/data-product-details?id=NP3-233-CD')
ercot_soup = BeautifulSoup(res.text, "lxml")
list_of_documents = ercot_soup.findAll('td', attrs={'class': 'labelOptional_ind'})
list_of_links = ercot_soup.select('a')'
##create the url for the download
loc = str(list_of_links[0])[9:len(str(list_of_links[0]))-9]
link = 'http://www.ercot.com' + loc
link = link.replace('amp;','')
# Define file name and set download path
file_name = str(list_of_documents[0])[30:len(str(list_of_documents[0]))-5]
file_path = downloads_folder + '/' + file_name

You can't expect code tailored to scrape one website to work for a different link! You should always inspect and explore your target site, especially the parts you need to scrape, so you know the tag names [like td and a here] and identifying attributes [like name, id, class, etc.] of the elements you need to extract data from.
With this site, if you want the info from the reportTable, it gets generated after the page gets loaded with javascript, so it wouldn't show up in the request response. You could either try something like Selenium, or you could try retrieving the data from the source itself.
If you inspect the site and look at the network tab, you'll find a request (which is what actually retrieves the data for the table) that looks like this, and when you inspect the table's html, you'll find above it the scripts to generate the data.
In the suggested solution below, the getReqUrl scrapes your link to get the url for requesting the reports (and also the template of the url for downloading the documents).
def getReqUrl(scrapeUrl):
res = requests.get(scrapeUrl)
ercot_soup = BeautifulSoup(res.text, "html.parser")
script = [l.split('"') for l in [
s for s in ercot_soup.select('script')
if 'reportListUrl' in s.text
and 'reportTypeID' in s.text
][0].text.split('\n') if l.count('"') == 2]
rtID = [l[1] for l in script if 'reportTypeID' in l[0]][0]
rlUrl = [l[1] for l in script if 'reportListUrl' in l[0]][0]
rdUrl = [l[1] for l in script if 'reportDownloadUrl' in l[0]][0]
return f'{rlUrl}{rtID}&_={int(time.time())}', rdUrl
(I couldn't figure out how to scrape the last query parameter [the &_=... part] from the site exactly, but {int(time.time())}} seems to get close enough - the results are the same even then and even when that last bit is omitted entirely; so it's totally optional.)
The url returned can be used to request the documents:
#import json
url = 'https://www.ercot.com/mp/data-products/data-product-details?id=NP3-233-CD'
reqUrl, ddUrl = getReqUrl(url)
reqRes = requests.get(reqUrl[0]).text
rsJson = json.loads(reqRes)
for doc in rsJson['ListDocsByRptTypeRes']['DocumentList']:
d = doc['Document']
downloadLink = ddUrl+d['DocID']
#print(f"{d['FriendlyName']} {d['PublishDate']} {downloadLink}")
print(f"Download '{d['ConstructedName']}' at\n\t {downloadLink}")
print(len(rsJson['ListDocsByRptTypeRes']['DocumentList']))
The print results will look like

Image source is different in html between my browser and get request

I suspect this has happened due to my misunderstanding of how either lxml or html works and I'd appreciate if someone could fill in this blank in my knowledge.
My code is:
url = "https://prnt.sc/ca0000"
response = requests.get(url,headers={'User-Agent': 'Chrome'})
# Navigate to the correct img src.
tree = html.fromstring(response.content)
xpath = '/html/body/div[3]/div/div/img/#src'
imageURL = tree.xpath(xpath)[0]
print(imageURL)
I expect when I do this to get a result such as:
data:image/png;base64,iVBORw0KGgoAAA...((THIS IS REALLY LONG))...Jggg==
Which if I understand correctly is where the image is stored locally on my computer.
However when I run the code I get:
"https://prnt.sc/ca0000"
Why are these different?

Problem is that this page uses javaScript to put data:image/png;base64 ... in place of https://prnt.sc/ca0000 but requests can't use JavaScript.
But there are two img with different scr - first has standard URL to image (https:///....) and other has fake https://prnt.sc/ca0000
So this xpath works for me even without JavaScript
xpath = '//img[#id="screenshot-image"]/#src'
This code get correct url and download image.
import requests
from lxml import html
url = "https://prnt.sc/ca0000"
response = requests.get(url, headers={'User-Agent': 'Chrome'})
tree = html.fromstring(response.content)
image_url = tree.xpath('//img[#id="screenshot-image"]/#src')[0]
print(image_url)
# -- download ---
response = requests.get(image_url, headers={'User-Agent': 'Chrome'})
with open('image.png', 'wb') as fh:
fh.write(response.content)
Result
https://image.prntscr.com/image/797501c08d0a46ae93ff3a477b4f771c.png

Is there a way to extract information from shadow-root on a Website?

I am setting up code to check the reputation of any URL E.g. http://go.mobisla.com/ on Website "https://www.virustotal.com/gui/home/url"
First, the very basic thing I am doing is to extract all the Website contents using BeautifulSoup but seems the information I am looking for is in shadow-root(open) -- div.detections and span.individual-detection.
Example Copied Element from Webpage results:
No engines detected this URL
I am new to Python, wondering if you can share the best way to extract the information
Tried requests.get() function but it doesn't give the required information
import requests
import os,sys
from bs4 import BeautifulSoup
import pandas as pd
url_check = "deloplen.com:443"
url = "https://www.virustotal.com/gui/home/url"
req = requests.get(url + url_str)
html = req.text
soup = BeautifulSoup(html, 'html.parser')
print(soup.prettify())
Expect to see "2 engines detected this URL" along with Detection Example: Dr. Web Malicious

If you use their website, it'll only return a loading screen for VirusTotal, as this isn't the proper way.
What Shows Up:
Instead, what you're supposed to do is use their public API to make requests. However, you'll have to make an account to obtain a Public API Key.
You can use this code which is able to retrieve JSON info about the link. However, you'll have to fill in the API KEY with yours.
import requests, json
user_api_key = "<api key>"
resource = "deloplen.com:443"
# feel free to remove this, just makes it look nicer
def pp_json(json_thing, sort=True, indents=4):
if type(json_thing) is str:
print(json.dumps(json.loads(json_thing), sort_keys=sort, indent=indents))
else:
print(json.dumps(json_thing, sort_keys=sort, indent=indents))
return None
response = requests.get("https://www.virustotal.com/vtapi/v2/url/report?apikey=" + user_api_key + "&resource=" + resource)
json_response = response.json()
pretty_json = pp_json(json_response)
print(pretty_json)
If you want to learn more about the API, you can use their documentation.

How to download in python big media links of a web page behind a log in form?

I'm looking for some library or libraries in Python to:
a) log in a web site,
b) find all links to some media files (let us say having "download" in their URLs), and
c) download each file efficiently directly to the hard drive (without loading the whole media file into RAM).
Thanks

You can use the broadly used requests module (more than 35k stars on github), and BeautifulSoup. The former handles session cookies, redirections, encodings, compression and more transparently. The later finds parts in the HTML code and has an easy-to-remember syntax, e.g. [] for properties of HTML tags.
It follows a complete example in Python 3.5.2 for a web site that you can scrap without a JavaScript engine (otherwise you can use Selenium), and downloading sequentially some links with download in its URL.
import shutil
import sys
import requests
from bs4 import BeautifulSoup
""" Requirements: beautifulsoup4, requests """
SCHEMA_DOMAIN = 'https://exmaple.com'
URL = SCHEMA_DOMAIN + '/house.php/' # this is the log-in URL
# here are the name property of the input fields in the log-in form.
KEYS = ['login[_csrf_token]',
'login[login]',
'login[password]']
client = requests.session()
request = client.get(URL)
soup = BeautifulSoup(request.text, features="html.parser")
data = {KEYS[0]: soup.find('input', dict(name=KEYS[0]))['value'],
KEYS[1]: 'my_username',
KEYS[2]: 'my_password'}
# The first argument here is the URL of the action property of the log-in form
request = client.post(SCHEMA_DOMAIN + '/house.php/user/login',
data=data,
headers=dict(Referer=URL))
soup = BeautifulSoup(request.text, features="html.parser")
generator = ((tag['href'], tag.string)
for tag in soup.find_all('a')
if 'download' in tag['href'])
for url, name in generator:
with client.get(SCHEMA_DOMAIN + url, stream=True) as request:
if request.status_code == 200:
with open(name, 'wb') as output:
request.raw.decode_content = True
shutil.copyfileobj(request.raw, output)
else:
print('status code was {} for {}'.format(request.status_code,
name),
file=sys.stderr)

You can use the mechanize module to log into websites like so:
import mechanize
br = mechanize.Browser()
br.set_handle_robots(False)
br.open("http://www.example.com")
br.select_form(nr=0) #Pass parameters to uniquely identify login form if needed
br['username'] = '...'
br['password'] = '...'
result = br.submit().read()
Use bs4 to parse this response and find all the hyperlinks in the page like so:
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(result, "lxml")
links = []
for link in soup.findAll('a'):
links.append(link.get('href'))
You can use re to further narrow down the links you need from all the links present in the response webpage, which are media links (.mp3, .mp4, .jpg, etc) in your case.
Finally, use requests module to stream the media files so that they don't take up too much memory like so:
response = requests.get(url, stream=True) #URL here is the media URL
handle = open(target_path, "wb")
for chunk in response.iter_content(chunk_size=512):
if chunk: # filter out keep-alive new chunks
handle.write(chunk)
handle.close()
when the stream attribute of get() is set to True, the content does not immediately start downloading to RAM, instead the response behaves like an iterable, which you can iterate over in chunks of size chunk_size in the loop right after the get() statement. Before moving on to the next chunk, you can write the previous chunk to memory hence ensuring that the data isn't stored in RAM.
You will have to put this last chunk of code in a loop if you want to download media of every link in the links list.
You will probably have to end up making some changes to this code to make it work as I haven't tested it for your use case myself, but hopefully this gives a blueprint to work off of.

Performing a get request in Python

Please tell me why this similar lists of code get different results.
First one (yandex.ru) get page of request, and another one get Main page of site (moyareklama.ru)
import urllib
base = "http://www.moyareklama.ru/single_ad_new.php?"
data = {"id":"201623465"}
url = base + urllib.urlencode(data)
print url
page = urllib.urlopen(url).read()
f = open ("1.html", "w")
f.write(page)
f.close()
print page
##base = "http://yandex.ru/yandsearch?"
##data = (("text","python"),("lr","192"))
##url = base + urllib.urlencode(data)
##print url
##page = urllib.urlopen(url).read()
##f = open ("1.html", "w")
##f.write(page)
##f.close()
##print page

Most likely the reason you get something different with urllib.urlopen and your browser is because your browser can be redirected with javascript and meta/refresh tags as well as standard HTTP 301/302 responses. I'm pretty sure the urllib module will only be redirected by HTTP 301/302 responses.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python to list HTTP-files and directories - python

HTTP does not work with "files" and "directories". Pick a different protocol.

Related

list index out of range - beautiful soup

Image source is different in html between my browser and get request

Is there a way to extract information from shadow-root on a Website?

How to download in python big media links of a web page behind a log in form?

Performing a get request in Python

Categories

Resources