I want to clone a single webpage with all the images and no links in the html. I can achieve this with wget -E -H -k -K -p {url} however this pulls down the webpage with a full structure and you have to navigate to the html file to display the contents. This makes it inconsistent in where the html file to display the webpage would be.
I can also do this wget --no-check-certificate -O index.html -c -k {url} however this keeps the links to images and doesn't make the webpage truly local as it has to go out to the web to display the page properly.
Is there any way to clone a single webpage and spit out an index.html with the images linked locally?
PS: I am using wget through a python script that makes changes to webpages so having an index.html is neccesary for me. I am interested in other methods if there are better ones.
EDIT:
So it seems I haven't explain myself well but a bit background info on this project is I am working on a proof of concept for school on an automated phishing script which is supposed to clone a webpage, modify a few action tags and be placed on a local web server so that a user can navigate to it and the page will display correctly. Previously using the -O worked fine for my but since I am now incorporating DNS spoofing into my project the webpage cant have any links pointing externally as they will just end up getting rerouted to my internal webserver and the webpage will look broken. That is why I need to have just the information necessary for the single webpage to be displayed correctly but also have it predictable so that I am able to be sure that when i navigate to the directory I cloned the website from the webpage will be displayed (with proper links to images,css etc..)
use this wget facebook.com --domains website.org --no-parent --page-requisites --html-extension --convert-links if you wanna download all the entire website add --recursive after the web pages
wget is a bash command. There's no point in invoking it through Python when you can directly achieve this task in Python. Basically what you're trying to make is a web scraper. Use requests and BeautifulSoup modules to achieve this. Research a bit about them and start writing a script. If you hit any errors, feel free to post a new question about it on SO.
Related
In the past, bank Holding Company data on form FR Y-9C was easily downloadable from the Chicago website. A simple curl line would do the trick.
This has now changed, the repository originally going back to 1986 has now moved to this site. It requires clicking before one can download.
I want to download daily updates of the zip files (e.g. BHCF20210630.ZIP) using a headless Linux machine and avoid using selenium.
I tried to obtain the zip-file using the link below, and variations of that link, but alas, no result:
https://www.ffiec.gov/nwp/FinancialReport/ReturnBHCFZipFiles?zipfilename='BHCF20210630.ZIP'
Thanks to the comments, Copy as curl does the job. For Firefox, I installed the cliget extension and followed this instruction for copy as curl. Download the file and cliget will alert you about the curl command.
I can not download the complete HTML code from the Google Drive folder to find the ID code for downloading public files from this Google folder. If I open the site and download it through the Mozilla Firefox browser, then it's all in the HTML code. The link to the google drive folder is in the example code below. Everything as an unregistered Google user. These are public files and public folders.
The file, which I know to crawl through the downloaded Mozilla Firefox html code, but not through WGET or Python, has the name:
piconwhite-220x132-freeSAT..........(insignificant remaining part of file name)
Here is an example of the Python algorithm what I use, but where nothing is obvious (urllib2 module):
import urllib2
u_handle = urllib2.urlopen('https://drive.google.com/drive/folders/0Bwz6mBA7lUOKZi1nbGdlbzFDZ0U')
htmlPage = u_handle.read()
with open('/tmp/test.html','w') as f:
f.write(htmlPage)
If I download a html page using a web browser, the html file size is about 500kB and also contains the above mentioned file to uncover the download code. If I download the webpage through wget or through the Python urllib2 module, the downloaded html code has a size of only 213kB and does not contain the mentioned file.
BTW, I tried several WGET methods (via linux shell - command line) but there is the same situation - that is, always downloading HTML with a certain number of maximum files from the content (unfortunately, not all files there).
Thank you for all the advice.
P.S.
I'm not a good web developer and I'm looking for a solution to the problem. I'm a developer in other languages and on other platforms.
So, I resolved my own problem by downloading a different drive.google webpage as a shortened form of directory / file list. I use this new URL:
'https://drive.google.com/embeddedfolderview?id=0Bwz6mBA7lUOKZi1nbGdlbzFDZ0U#list'
Instead of the previous URL:
'https://drive.google.com/drive/folders/0Bwz6mBA7lUOKZi1nbGdlbzFDZ0U'
The source code of the "list" site is slightly different, but it has a lot of records (lots of directories or files on drive.google page). So I can see all the files or all the directories that are on the required drive.google website.
Thank you all for helping me or for reading my problem.
I am trying to download PDFs from my school server, but the way it is set up by the stupid IT department is that we have to click each link one by one and there are hundreds of PDFs on the same page with links.
How can I download using python or wget "2015-0001.pdf" "2015-0002.pdf" "2015-0003.pdf"
I have tried wget --accept pdf,zip,7z,doc --recursive but it only grabs the index.html file of the site and no actual files.
Use Scrapy: http://scrapy.org/
An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.
Scrapy tutorial how to get started with website scraping
I want to download all the files that are publicly accessible on this site:
https://www.duo.uio.no/
This is the site for the university of Oslo, and here we can find every paper/thesis that is publicly available from the archives of the university. I tried a crawler, but the website has set some mechanism for stopping crawlers accessing their documents. Are there any other ways of doing this?
Did not mention this in the original question, but what I want is all the pdf files on the server. I tried SiteSucker, but that seems to just download the site itself.
wget --recursive --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=unix,ascii --domains your-site.com --no-parent http://your-site.com
try it
You could try using site sucker (download), that allows you to download the contents of a website, ignoring any rules they may have in place.
I have written a java servlet that uploads multiple files, I used cURL to upload the file
curl -F filedata=#myfile.txt http://127.0.0.1/test_Server/multipleupload this uploads the file to a folder uploads that is located in the webapps folder. I'm in the middle of writing a python module that can be used instead of curl, the reason being this server is going to be used by a build farm so using cURL is not an option and the sane goes for using pycURL. The python module I'm working on was previously written for doing this on pastebin , so all i'm doing is editing this to use my server and it looks like urllib doesn't do multipart/form-data?. If anyone could point me in the right direction it would be great, I haven't posted the code but if anyone wants it I will do so, There isn't much in that code as a start all I did was changed teh URL to my server and thats when I found out that its using application/x-www-form-urlencoded (Thank you Wireshark ! )
You can use the Request-class to send your own headers. Maybe you wanna use requests, it makes life easier.
EDIT: uploading files with requests